Accreditation Bodies
Accreditation Bodies
Accreditation Bodies
Supercharge your career with our Multi-Cloud Engineer Bootcamp
KNOW MOREMongoDB stores and manages unstructured and semi-structured data, including JSON-like documents with dynamic schemas. It supports various programming languages and platforms, offering scalability and high availability. A career as a MongoDB developer can be very rewarding as a fresher, intermediate and expert. Follow the top basic and advanced MongoDB interview questions and turn yourself into an essential MongoDB Developer. We have covered the most-asked questions on topics like pagination, top operators, data synchronization, modes of reading, authorization model, storage engine, troubleshooting, queries in MongoDB, audit trail, sharding, backup methods, replication and more. With these top interview questions on MongoDB, you will understand the detailed structure of MongoDB and the different applications of MongoDB. These will qualify you to become a MEAN Stack Developer, Backend engineer, front-end developer and many more.
Filter By
Clear all
db.<collection>.find().skip(n).limit(n) Note: n is the pagesize, for the first page skip(n) will not be applicable limit(n) limits the documents to be returned from the cursor to n, skip(n) will skip n documents from the cursor |
This can be achieved in MongoDB using the $type operator. A null value, i.e., BSON type null has the type number 10. Using this type number, only those documents can be retrieved whose value is null.
Take the example of the below two documents in startup collection
{ _id: 1, name: "XYZ Tech", website: null }, { _id: 2, name: “ABC Pvt Ltd” }
The query { website : { $type: 10 } } will retrieve only those documents where the website is null, in the above case it would be the startup “XYZ Tech”
Note: The query { website : null } on the other hand will match documents where the website is null or the documents where the website field does not exist. For the above collection data, this query will return both the startups.
This is one of the most frequently asked MongoDB interview questions and answers in recent times.
only those documents that contain the field specified in the query.
For the following documents in employee collection
{ _id: 1, name: "Jonas", linkedInProfile: null }, { _id: 2, name: “Williams” }
The query { linkedInProfile: { $exists: true } } will return only the employee “Jonas”
In MongoDB we have Built-in roles as well as custom roles. Built-in roles already have pre-defined access associated with them. We can assign these roles directly to users or groups for access. To run mongostat we would require access to run the server status on the server.
Built-in role cluster monitor comes with required access for the same.
Custom roles or user-defined roles are the ones where we have to manually define access actions to a particular resource. MongoDB provides method db.createRole() for creating user-defined roles. These roles can be created in a specific database as MongoDB uses a combination of database and role name to uniquely define the role.
We will create a custom role mongostatRole that provides only the privileges to run mongostat.
First, we need to connect to mongod or mongos to the admin database with a user that has privileges to create roles in the admin as well as other databases.
mongo --port 27017 -u admin -p 'abc***' --authenticationDatabase 'admin'
Now we will create a desired custom role in the admin database.
use admin db.createRole( role: "mongostatRole", privileges: [ {resource: { cluster: true }, actions: [ "serverStatus" ] } ], roles: [] )
This role can now be assigned to members of monitoring team.
In MongoDB data is stored as JSON documents. These documents can have different sets of fields, with different data type for each field. For example, we can have a collection with number, varchar and array all as different documents.
{ “a” : 143 } { “name” : “john” } { “x” : [1,2,3] }
It is not correct to say MongoDB is schemaless, in fact, schema plays an important role in the designing of MongoDB applications. MongoDB has a dynamic schema having database structure with collections and indexes. These collections can be created either implicitly or explicitly.
Due to the dynamic behaviour of the schema, MongoDB has several advantages over RDBMS systems.
Schema Migrations become very easy as in traditional systems we had to use ALTER TABLE command after adding any column which could result in downtime. In MongoDB, such adjustments become transparent and automatic. For example, if we want to add CITY field to people collection, we can add the attribute and resave, that’s it. Whereas in a traditional system we would have to run ALTER TABLE command followed by reorg which would require downtime.
The first part of the query would give all documents where y>=10. So we will have 2 documents i.e
d> { "_id" : 4, "x" : 4, "y" : 10 } e> { "_id" : 5, "x" : 5, "y" : 75 }
Now the second part of the query would update the value of Y for above 2 documents to 75, but we already have a document with value y:75, that will not be updated.
Finally, we will have one 1 document that will be updated by the provided query.
d> { "_id" : 4, "x" : 4, "y" : 10 }
Every operation on the primary is logged in operation logs known as oplog. These oplogs are replicated to For a healthy replica set system, it is recommended that all members are in sync with no replication lag. Data is first written on primary by the applications then replicated to secondary. This synchronization is important to maintain up-to-date copies of data on all members. Synchronization happens in 2 ways: initial sync and continuous replication.
The oplog is operation logs that keep an update of all operations that modify the data stored in databases. We can define the oplog size while starting MongoDB by specifying the --oplog option. If we do not specify this option it will take the default values which is 5% of physical memory in case of wiredTiger. While the default value is sufficient for most workloads in some cases we may need to change the oplog size for the replica set.
OPlog size is changed in a rolling manner, first, we change on all secondary and then a primary member of the replica set. To change oplog size
use local
db.oplog.rs.stats().maxSize
db.adminCommand({replSetResizeOplog: 1, size: "Size-in-MB"})
MongoDB applies database operations on the primary and then records the operations on the primary’s oplog. The secondary members then copy and apply these operations in an asynchronous process. For each operation, there is separate oplog entry.
First, let’s check how many rows the query would fetch by changing delete to find operation.
db.sample.find( { state : "WA" } )
This will give all the rows with the state is WA.
{"firstName" : "Arthur", "lastName" : "Aaronson", "state" : "WA", "city" : "Seattle", "likes" : [ "dogs", "cats" ] } {"firstName" : "Beth", "lastName" : "Barnes", "state" : "WA", "city" : "Richland", "likes" : [ "forest", "cats" ] } {"firstName" : "Dawn", "lastName" : "Davis", "state" : "WA", "city" : "Seattle", "likes" : [ "forest", "mountains" ] }
Now Ideally delete should remove all matching rows but query says deleteOne.
If the query would have said deleteMany then all the matching rows would have been deleted and there would be 3 oplog entries but deleteOne will remove first matching row. So 1 oplog entry will be generated with provided query
Idempotence is the property of certain operations whereby they can be applied multiple times without changing the result beyond the initial application. In MongoDB, oplog is idempotent meaning even if they are applied multiple times the same output will be produced. So if the server goes down and we need to apply oplogs there would not be any inconsistency, even if it applies any logs that were already applied there will not be changed in the database end state.
Also, there was a desire to have a new state of a document to be independent of the previous state. For these all operators that rely on the previous state to determine new value needed to be transformed to see the actual values. For example, if an addition operation results in modifying the value from ‘21’ to ’30’, the operation should be changed to set value ‘30’ on the field. Replaying the operator multiple times should produce the same result.
In MongoDB we can read from Primary as well as secondary members of the replica set. This behaviour can be controlled by us as we can define the desired Read preference to which clients route read operations to a member of the replica set. If we do not specify any real preference by default MongoDB will read from primary. There are situations when you would want to reduce the load on your primary by forcing applications to read from secondary.
Below are different MongoDB read preference modes:
This is the default mode. Applications read from the replica set primary.
In this mode, all applications read from primary but if the primary member is not available they start reading from secondary.
All applications read from the secondary members of the replica set.
In this mode, all applications read from secondary but if any secondary member is not available they start reading from primary.
In this mode, applications read from the member which is nearest to them in terms of network latency, irrespective of the member being primary or secondary.
Shard key selection is based on the workload. Since the first query is being used 90% it should be driving the selection for selection of shard key.
Combination of fields from that query would make the best shard key. This eliminates option b, c and d.
Option a and e uses a subset of fields from the most used workload and both can be shard key but option has more fields and so would be more suitable.
One of the most frequently posed MongoDB scenario based interview questions, be ready for this conceptual question.
Chunk split operations are carried out automatically by the system when any insert operation causes chunk to exceed the maximum chunk size. Balancer then migrates recently split chunks to new shards. But in some cases we may want to pre-split the chunks manually:
To split the chunks manually we can use the split command with helper sh.splitFind() and sh.splitAt().
Example:
To split the chunk of employee collection for employee id field at a value of 713626 below command should be used.
sh.splitAt( "test.people", { "employeid": "713626" } )
We should be careful while pre-splitting chunks as sometimes it can lead to a collection with different sized chunks.
In some cases, chunks can grow beyond the specified chunk size but cannot undergo a split. The most common scenario is when a chunk represents a single shard key value. Since the chunk cannot split, it continues to grow beyond the chunk size, becoming a jumbo chunk. These jumbo chunks can become a performance bottleneck as they continue to grow, especially if the shard key value occurs with high frequency.
The addition of new data or new shards can result in data distribution imbalances within the cluster. A particular shard may acquire more chunks than another shard, or the size of a chunk may grow beyond the configured maximum chunk size.
MongoDB ensures a balanced cluster using two processes: chunk splitting and the balancer.
In replication we have multiple copies of the same data in sync with each other. It is mainly useful for High availability purpose. While in sharding we divide our entire dataset in small chunks and distribute among several servers. Sharding is used where we have some sort of bottleneck in terms of hardware or for getting the benefits of query parallelism. If our dataset is very small sharding would not provide many advantages but as the data grows we should move to sharding.
Below are a few of the situations where sharding is recommended over replication.
By breaking the dataset over shards will mean having more resources available to handle the subset of data it owns, and operations of moving data across machines for replication, backups, restores will also be faster.
Loading every document into RAM means that query will not be using index efficiently and will have to fetch documents from disk to ram.
For using an index, the initial match in the find statement should either use index or index prefix.
Below query has b key to find that does not use any existing index or index prefix so it would have to fetch from disk.
db.sample.find( { b : 1 } ).sort( { c : 1, a : 1 } )
If we start Mongo server using mongod with the --auth option it will enable authorization but we will not be able to do anything, even listing databases using show DBS will fail with an error "not authorized on admin to execute the command". This is because we have not authenticated yet to the database. But this is a new database with no users in this database, so how can we create new users with no authorization.
MongoDB provides localhost exception for creating the first user on the database without any authorization. With this first user, we can create other users with relevant access.
But there are a few considerations to it:
MongoDB follows Role access control authorization model(RBAC). In this model, users are assigned one or more roles which provide them access to the database resources and operations. Apart from these role assignments users do not have any access to the resources. When we enable internal authentication it automatically enables client authorization. The authorization can be enabled by starting mongod with –auth option or providing security.authorization setting in the config file.
These group of privileges are roles that can be assigned to the user.
MongoDB provides several Build-in roles like Database User Roles, Database Administration Roles, Cluster Administration Roles, Backup and Restoration Roles, All-Database Roles, Superuser Roles But we can also create our own custom roles based on the requirement which are called User-defined roles.
For Example: Suppose we want to create a role to manage all operations in the cluster we can create below user-defined role.
use admin db.createRole( role: "mongostatRole", privileges: [ { resource: { cluster: true }, actions: [ "serverStatus" ] } ], roles: [] )
If secondaries are falling behind they are experiencing replication lag, which is a delay in the application of oplog from primary to secondary. Replication lag can be a significant issue and can seriously affect MongoDB replica set deployments. Excessive replication lag makes a member ineligible to quickly become primary and increases the possibility of distributed read operations to be inconsistent.
We can check the replication lag by calling the rs.printSlaveReplicationInfo() method or by running the rs.status() command.
Possible causes of replication lag include:
A staple in MongoDB technical interview questions, be prepared to answer this one using your hands-on experience.
The storage engine is the component that lies between the database and storage layer and is primarily responsible for managing data. MongoDB provides few choices of storage engines, enabling us to use best suited for our applications. Choosing the appropriate storage engine can significantly impact performance.
WiredTiger replaced MMAPV1 in 3.2 to become default storage engine. If we install MongoDB and do not specify any storage engine, wiredTiger is enabled. As it provides a document-level concurrency model, checkpointing, and compression it is suited for most workloads. It also supports encryption at rest in MongoDB Enterprise.
Various applications require predictable latencies which can be achieved by storing the documents in memory rather than on disk. In-Memory Storage Engine is helpful for such applications. It is available only in MongoDB enterprise edition.
MongoDB started with MMAPv1 storage engine only but it was successful for a specific subset of use cases only due to which it is deprecated from version 4.0.
Concurrency in MongoDB is different for different storage engines. While WiredTiger uses document level concurrency control for write operations, MMAPV1 has concurrency at the collection level.
In WiredTiger locking is at document level due to which multiple clients can modify documents at the same time and so uses optimistic concurrency control for most read and writes. Instead of exclusive locks, it uses only intent locks at the global, database and collection levels. In case the conflict is detected by storage engine between operations, one will incur a write conflict causing MongoDB to transparently retry that operation.
Sometimes global “instance-wide” locks are required for global operations which involve multiple databases. Exclusive database lock incurs for operations such as dropping a collection.
MMAPv1 still uses collection level lock meaning if we have 3 applications running against the same collection, 2 would have to wait before first application completes its own as it applies a collection level write lock.
MongoDB WiredTiger ensures data durability with journaling and checkpoints. Journals are write-ahead logs which checkpoints are point-in-time snapshots.
In wiredTiger, with the start of each operation, a point-in-time snapshot is taken which presents a consistent view of in-memory data. WiredTiger then writes all snapshot data to the disk in a consistent way across all data files. This data on disk is durable and acts as a checkpoint in the data files. The checkpoint ensures all data files are consistent from the last checkpoint.
These checkpoints usually occur every 60sec so we have a consistent snapshot every 60sec of interval thus ensuring durability.
Journal is write-ahead logs which persist all data changes between 2 checkpoints. In case we require data between checkpoints for recovery these journal files can be used. These general files act as crash recovery files in case of interrupts. Once the system is back up these journal files can be replayed for recovery.
In MongoDB data set consistency is ensured by locking. In any database system long running queries degrade the performance as requests and operations have to wait for a lock. Locking issues are intermittent and so need to be resolved immediately.
MongoDB provides us with tools and utilities to troubleshoot these locking issues. The serverStatus() command provides us a view of the system including the locking-related information. We should look for locks and a globalLock section of serverStatus() command for troubleshooting locking issues.
We can use below commands to filter locking related information from the serverStatus output.
db.serverStatus().globalLock db.serverStatus().locks
To get the approximate average wait time for a lock mode we can divide locks. timeAcquiringMicros by locks.acquireWaitCount.
To check the number of times deadlocks occurred locks.deadlockCount should be checked.
If the application performance is constantly degraded there might be concurrency issues, in such cases, we should look at globalLock.currentQueue.total. A high value indicates concurrency issues.
Sometimes globalLock.totalTime is high relative to uptime which suggests database has been in a lock state for a significant time.
Indexes are important to consider while designing the databases as they impact the performance of the applications. Without index, query will perform collection scan in which all the documents of the collection are scanned one by one to find the matching fields of the query. If we have indexed for a particular query, MongoDB can use the index to limit the number of documents scanned to execute the query as the indexes store the values of the query in ascending or descending ordered form.
While the indexes help in improving the performance of find operations for write operations like insert and update there can be a significant negative impact of adding indexes as with modifications with each write MongoDB would need to update the indexes associated with the index also. This would be overhead on the system and we may end up with performance degradation.
So while the find() will improve performance the operators updateOne and insertOne would degrade the performance as with every update or insert related indexes would need to be updated.
A staple in senior MongoDB interview questions with answers, be prepared to answer this one using your hands-on experience. This is also one of the top interview questions to ask a MongoDB developer.
MongoDB provides several utilities for data movement activities like mongodump, mongoexport etc. Mongodump is used to export the contents of the collection in an external file in the BSON(binary) format. The contents exported by this method can then be used by mongorestore command to restore in another database or different collection. Mongodump does not capture index data and only captures the data present in the backup. Since the contents are exported in binary format using this method we cannot use it for exporting to CSV file.
To export the contents in JSON or CSV format we can use the mongoexport command. The exported collection can then be restored using mongoimport command. Since Mongoexport exports cannot export in BSON all the rich BSON data types are not preserved while exporting the data. Due to this reason, mongoexport should be used with careful consideration.
Below is the command that can be used for the same.
mongoexport --host host:27017 -d test -c sample --type=csv -f fields -o sample.csv
In sharded cluster we may have a database which has sharded as well as non sharded collections. While the sharded collections are spread across all the shards, all the unsharded collections are stored on a single shard known as a primary shard. Every database in the sharded collection has its own primary shard. When we create any new database, mongos pick the shard with the least amount of data in cluster and marks it as a primary shard.
If there is a need to change the primary shard we can do so by using the movePrimary command. This migration may take significant time to complete. We should not access any collections associated with migrating database until the process completes. Also, the migration of primary shard should be done at the lean time as it may impact the performance of the overall cluster.
Eg. To migrate the primary shard of accounts database to Shard0007 below command should be used.
db.adminCommand( { movePrimary : "accounts", to : "shard0007" } )
When we create any collection in MongoDB, a unique index on the _id field is created automatically. This unique index prevents applications from inserting multiple same documents with the same value for the _id field. This is forced by the system and we cannot drop this index on the _id field. Moreover, in replica sets the unique _id values are used in the oplog to reference documents to update.
In a sharded cluster if we do not have unique _id values across the sharded collection chunk migrations may fail as when documents migrate to another shard, any identical values will not be inserted to receiver shard. In such cases, we should code application such that it ensures uniqueness on _id for given collection across all shards in a sharded cluster.
If we use _id as the shard key, this uniqueness of values will automatically be forced by the system. In such case chunk ranges will be assigned to single shard and then the shard will force uniqueness on the values in that range.
MongoDB provides database profiler which captures and stores detailed information about commands executed on running instance. Captured details include CRUD operations, administrative commands and configuration commands. The data collected by the profiler is stored in the system.profile collection in the admin database.
By default, the profile is turned off. We can enable it and set to different profiling levels based on the requirements. It provides 3 profiling levels:
To capture the slow running queries, we can start the profiler with profiling level 1 or 2 and Default slow operation threshold is 100 milliseconds. We can change this threshold by specifying a slowms option.
Eg: To enable profiler which captures all queries slower than 50ms below command should be used:
db.setProfilingLevel(1, { slowms: 50 })
Consider the following compound index
{ "accountHolder": 1, "accountNumber": 1, "currency": 1 }
The index prefixes are
{ accountHolder: 1 } { accountHolder: 1, accountNumber: 1 }
Query plan will use this index if the query has the following fields
The $addToSet operator should be used with the $each modifier for this. The $each modifier allows the $addToSet operator to add multiple values to the array field.
Example, start ups are tagged as per the technology skill that they excel in
{ _id: 5, name: "XYZ Technology", skills: [ "Big Data", "AI", “Cloud” ] }
Now the start up needs to be updated with additional skills
db.startups.update( { _id: 5 }, { $addToSet: { skills: { $each: [ "Machine Learning", "RPA" ] } } } )
The resultant document after update()
{ _id: 5, name: "XYZ Technology", skills: [ "Big Data", "AI", “Cloud”, "Machine Learning", "RPA"] }
Note: There is no particular ordering of elements in the modified set, $addToSet does not guarantee that. Duplicate items will not be added.
When "fast reads" are the single most important criteria, Embedded documents can be the best way to model one-to-one and one-to-many relationships.
Consider the example of certifications awarded to an employee, in the below example the certification data is embedded in the employee document which is a denormalized way of storing data
{ _id: "10", name: "Sarah Jones", certifications: [ { certification: "Certified Project Management Professional”, certifying_auth: “PMI”, date: "06/06/2015" }, { certification: "Oracle Certified Professional”, certifying_auth: “Oracle Corporation”, date: "10/10/2017" } ] }
In a normalized form, there would be a reference to the employee document from the certificate document, example
{ employee_id: "10", certification: "Certified Project Management Profesional”, certifying_auth: “PMI”, date: "06/06/2015" }
Embedded documents are best used when the entire relationship data needs to be frequently retrieved together. Data can be retrieved via single query and hence is much faster.
Note: Embedded documents should not grow unbounded, otherwise it can slow down both read and write operations. Other factors like consistency and frequency of data change should be considered before making the final design decision for the application.
MongoDB has the db.collection.explain(), cursor.explain() and explain command to provide information on the query plan. The results of explain contain a lot of information, key ones being
Recursive queries can be performed within a collection using $graphLookUp which is an aggregate pipeline stage.
If a collection has a self-referencing field like the classic example of Manager for an employee, then a query to get the entire reporting structure for manager “David” would look like this
db.employees.aggregate( [ { $graphLookup: { from: "employees", startWith: "David", connectFromField: "manager", connectToField: "name", as: "Reporting Structure" } } ] )
For the following documents in the employee collection,
{ "_id" : 4, "name" : " David " , "manager" : "Sarah" } { "_id" : 5, "name" : "John" , "manager" : "David" } { "_id" : 6, "name" : "Richard", "manager" : " John " } { "_id" : 7, "name" : "Stacy" , "manager" : " Richard " }
Output of the above $graphLookup operation would result in the following 3 documents returned
{ "_id" : 5, "name" : "John" , "manager" : "David", … } { "_id" : 6, "name" : "Richard", "manager" : " John ", … } { "_id" : 7, "name" : "Stacy" , "manager" : " Richard", … }
The hierarchy starts with “David” which is specified in startWith and there on the data for each of the members in that reporting hierarchy are fetched recursively
The $graphLookup looks like this for a query from the employees collection where “manager” is the self-referencing field
db.employees.aggregate( [ { $graphLookup: { from: "employees", startWith: "David", connectFromField: "manager", connectToField: "name", as: "Reporting Structure" } } ] )
The value of as, which is “Reporting Structure” in this case is the name of the array field which contains the documents traversed in the $graphLookup to reach the output document.
For the following documents in the employee collection,
{ "_id" : 4, "name" : " David " , "manager" : "Sarah" } { "_id" : 5, "name" : "John" , "manager" : "David" } { "_id" : 6, "name" : "Richard", "manager" : " John " } { "_id" : 7, "name" : "Stacy" , "manager" : " Richard " }
“Reporting Structure” for each output document would look like this
{ "_id" : 5, "name" : "John" , "manager" : "David", "Reporting Structure" : [] } { "_id" : 6, "name" : "Richard", "manager" : " John ", "Reporting Structure" : [{ "_id" : 5, "name" : "John" , "manager" : "David" }] } { "_id" : 7, "name" : "Stacy" , "manager" : " Richard", "Reporting Structure" : [{ "_id" : 5, "name" : "John", "manager" : "David" } { "_id" : 6, "name" : "Richard", "manager" : " John " }] }
Yes, there is very much a simpler way of achieving this without having to do this programmatically. The $unwind operator deconstructs an array field resulting in a document for each element.
Consider user “John” with multiple addresses
{ "_id" : 1, "name" : "John", addresses: [ "Permanent Addr", "Temporary Addr", "Office Addr"] } db.users.aggregate( [ { $unwind : "$addresses" } ] )
would result in 3 documents, one for each of the addresses
{ "_id" : 1, " name " : " John ", " addresses " : "Permanent Addr" } { "_id" : 1, " name " : " John ", " addresses " : "Temporary Addr" } { "_id" : 1, " name " : " John ", " addresses " : "Office Addr" }
This is one of the most frequently asked MongoDB interview questions for freshers in recent times.
MongoDB supports Capped collections which are fixed-size collections. Once the allocated space is filled up, space is made for new documents by removing (overwriting) oldest documents. The insertion order is preserved and if a query does not specify any ordering then the ordering of results is same as the insertion order. The oplog.rs collection is a capped collection, thus ensuring that the collection of logs do not grow infinitely.
A query that is able to return entire results only by using the index is called a Covered Query. This is one of the optimization techniques that can be used with queries for faster retrieval of data. A query can be a covered query only if
Since everything is part of the index, there is no need for the query to check the documents for any information.
Expect to come across this, one of the most important MongoDB interview questions and answers for experienced in Database management, in your next interviews.
Multikey indexes can be used for supporting efficient querying against array fields. MongoDB creates an index key for each element in the array.
Note: MongoDB will automatically create a multikey index if any indexed field is an array, no separate indication required.
Consider the startups collection with array of skills
{ _id: 1, name: "XYZ Technology", skills: [ "Big Data", "AI", “Cloud” ] }
Multikey indexes allow to search on the values in the skills array
db.startups.createIndex( { skills : 1 } )
The query db.startups.find( { skills : "AI" } ) will use this index on skills to return the matching document
All the 3 projection operators, i.e., $, $elemMatch, $slice are used for manipulating arrays. They are used to limit the contents of an array from the query results.
For example, db.startups.find( {}, { skills: { $slice: 2 } } ) selects the first 2 items from the skills array for each document returned.
Starting in version 4.0, multi-document transactions are possible in MongoDB. Earlier to this version, atomic operations were possible only on a single document.
With embedded documents and arrays, data in the documents are generally denormalized and stored in a single structure. With this as the recommended data model, MongoDB's single document atomicity is sufficient for most of the applications.
Multi-document transactions now enable the remaining small percentage of applications which require this (due to related data spread across documents) to depend on the database to handle transactions automatically rather than implement this programmatically into their application (which can cause performance overheads).
Note: Performance cost is more for multi-document transactions (in most of the cases), hence it should be judiciously used.
In the case of an error, whether the remaining operations get processed or not is determined if the bulk operation is ordered or unordered. If it is ordered, then MongoDB will not process the remaining operations, whereas if it is unordered , MongoDB will continue to process the remaining operations.
Note: “ordered” is an optional Boolean parameter that can be passed to bulkWrite(), by default this is true.
The MongoDB enterprise version includes auditing capability and this is fairly easy to set up. Some salient features of auditing in MongoDB
Note: Auditing adds performance overhead and the amount of overhead is determined by a combination of the several factors listed above. The specific needs of the application should be taken into account to arrive at the optimal configuration.
Once selected, the shard key can't be changed later automatically. Hence it should be chosen after a lot of consideration. The distribution of the documents of a collection between the cluster shards is based on the shard key. Effectiveness of the chunk distribution is important for the efficient querying and writing of the MongoDB database and this effectiveness of the chunk distribution is directly related to the shard key. That is why choosing of the right shard key up front is of utmost importance.
A must-know for anyone looking for top MongoDB interview questions, this is one of the frequently asked MongoDB advanced interview questions.
When any text content within a document needs to be searchable, all the string fields of the document can be indexed using the $** wildcard specifier. db.articles.createIndex( { "$**" : "text" } )
Note: Any new string field added to the document after creating the index will automatically be indexed. When data is huge, wildcard indexes will have an impact on performance and hence should be used with due consideration of this.
BSON is a binary JSON. Inside the database, there is a need for binary representation for efficiency.
There are 3 major reasons for preference to BSON:
Example: In below document, we have a large subdocument named hobbies, now suppose we want to query field "active" skipping "hobbies" we can do so in BSON due to its linear serialization property.
{-id: "32781", name: "Smith”, age: 30, hobbies: { .............................500 KB ..............}, active: "true”}
First, we have the MongoDB query language.
This is the set of instructions and commands that we have to use to interact with MongoDB.All CRUD operations and the documents that we send back and forth in MongoDB are managed by this layer. They translate the incoming BSON wire protocol messages that MongoDB uses to communicate with the client side application libraries that we call drivers into MongoDB operations.
Then, we have the MongoDB Data Model Layer.
This is the layer responsible for applying all the CRUD operations defined in the MongoDB query language and how they should result in the data structures managed by MongoDB. Management of namespaces, database names, and collections, which indexes are defined per namespace and which interactions need to be performed to respond to the incoming requests are all managed here.
This is also the layer where a replication mechanism is defined. This is where we define WriteConcerns, ReadConcerns that applications may require.
Next, we have the storage layer.
At this layer, we will have all the persistence in physical medium calls, how data is stored on disk, what kind of files does it use, what levels of compression amongst other settings can be set. MongoDB has several different types of storage engines that will persist data with different properties, depending on how the system is configured. WiredTiger is the default storage engine. At this layer, all the actions regarding flushers to this, journal commits, compression operations, and low-level system access happens.
Shards themselves are replica sets-- highly availability in units. Where we have other components as well, like:
Suppose we have 3 servers abc.com,xyz.com andpqr.com.
option –replset is used for creating a replica set. We have given a replica set name as rs0. Bind IP is the IP to which server can be connected to from outside.
Login to server abc.com and run command
mongo>
it will take you to mongo shell.
Now we need to initiate the replica set with a configuration of all 3 members.
rs.initiate( { _id : "rs0", members: [ { _id: 0, host: "abc.com:27017" }, { _id: 1, host: "xyz.com:27017" }, { _id: 2, host: "pqr.com:27017" } })
MongoDB initiates a replica set, using the default replica set configuration.
rs.conf()
Also to check the status for each member we can run command
rs.status()
The server from which we run rs.initiate will become primary and other 2 servers will become secondary.
The first requirement eliminates any five-node replica set where one node is an arbiter, as arbiters do not have a copy of the data.
The second requirements eliminate setting Priority of 0 for dc1-01, dc1-02 or dc2-01, dc2-02. They can be assigned any positive integer, or the default value of 1 to be electable as Primary.
As per the third requirement, dc3-01 can never be primary so its priority has to be set 0.
Finally, as per the fourth dc3-01 configuration cannot be listed as hidden, as this will prevent reading from this replica member.
So below will be the config file meeting all the requirements.
{ "_id" : "rs0", "version" : 1, "members" : [ { "_id" : "dc1-01", "host" : "mongodb0.example.net:27017" }, { "_id" : "dc1-02", "host" : "mongodb1.example.net:27017" }, { "_id" : "dc2-01", "host" : "mongodb2.example.net:27017"}, { "_id" : "dc2-02", "host" : "mongodb3.example.net:27017"}, { "_id" : "dc3-01", "host" : "mongodb4.example.net:27017","priority" : 0 } ] }
When the primary of a replica set is not available secondary becomes primary, this is done via elections where the most appropriate member of the replica set is promoted to primary. Apart from unavailability, there are few situations when elections are triggered such as:
Don't be surprised if this question pops up as one of the top MongoDB questions for interview in your next interview.
Big data systems with large data sets or high throughput requirements usually challenge the capacity of a single server like a large number of parallel queries can exhaust CPU capacity for the server. Also, larger working sets than the RAM of the system can cause I/O bottleneck and disk performance disruption. Such growth is generally handled either by vertical scaling or horizontal scaling.
Here bottlenecks are handled by increasing the capacity of a single server by adding more RAM, having a more powerful CPU or adding more storage. This is fine up to a limit as even the biggest server has limits of RAM, CPU and storage as beyond a point we cannot add capacity. Also, this scaling method is very expensive as bigger servers cost must more than commodity servers.
Here bottlenecks are handled by dividing the dataset across multiple commodity servers. We get the benefit of more storage, RAM and CPU when data is spread. This also allows having high throughput as we can use parallelism of resources for the same. We also get the benefit of comparatively lower cost due to the use of commodity servers.
MongoDB supports horizontal scaling through sharding. It supports very large data sets and high throughput operations with sharding. In sharding data is distributed among several machines called shards.
A MongoDB sharded cluster consists of the following components:
Application data in MongoDB sharded cluster is stored in shards. Each shard has a subset of collection data divided on the basis of shard key which we define at the time of creating a collection. These shards can also be deployed as replica sets. If a query is performed at single shard it will return a subset of data. Applications usually should not connect to individual shards. Connections to individual shards should be made by administrators for maintenance purpose.
In a sharded cluster applications should connect through mongos which acts as a query router which acts as an interface between clients’ applications and sharded cluster. Mongos fetches the metadata from config server regarding what data is on which shard and caches it. This metadata is then used by config server to route the query to appropriate shard. We should have multiple mongos for redundancy and they can either be deployed on a separate server or mixed with application servers. To reduce latency, it is recommended to deploy them on application servers. These mongos utilize minimal server resources and do not have any persistent state.
All the metadata and configuration settings for the sharded cluster are stored in config servers. Metadata shows which data is stored in which shard, number of chunks, and distribution of shard keys across the cluster. It is recommended to deploy config server as a replica set. In case the config server does not have primary at any time the cluster cannot perform metadata changes and becomes read-only for the time period so config server replica set should also be monitored and maintained as the application data shards.
MongoDB sharded cluster has 3 components namely shards, mongos and config servers. We will deploy all components using the below process.
We need to start all the members of replica sets with the –shardsvr option.
mongod --replSet "rs0" --shardsvr mongod --replSet "rs1" --shardsvr
Suppose we have 2 shards with 3-member replica set each, all 6 shards should be started with the above option. These shard members are deployed as replica sets on the host (h1, h2, h3…. h6) at port 27017.
sh1(M1, M2, M3 as replica set “rs0”) and sh2(M4, M5, M6 as replica set “rs1”)
We need to start all members of config servers as a replica set with --configsvr
mongod --configsvr --replSet “cf1”
Config Sever (Member c1, c2 and c3 as a replica set cf1) on host h7, h8 at port 27017.
Start the mongos specifying the config server replica set name followed by a slash / and at least one of the config server hostnames and ports. Mongos is deployed on server h9 at port 27017.
mongos --configdb cf1/h7:27017, h8:27017, h9:27017
mongo h9:27017/admin sh.addShard( "rs0/h1:27017,h2:27017,h3:27017" ) sh.addShard( "rs1/h5:27017,h6:27017,h7:27017" )
mongo h9:27017/admin sh.enableSharding( "test" ) use test db.test_collection.createIndex( { a : 1 } ) sh.shardCollection( "test.test_collection", { "a" : 1 } )
Shard key selection is an important aspect of the sharded cluster as it affects the performance and overall efficiency of a cluster. Chunk creation and distribution among several shards is based on the choice of the shard key. Ideally shard key should allow MongoDB to distribute documents evenly across all the shards in the cluster.
There are three main factors that affect the selection of the shard key:
Cardinality refers to a number of distinctive values for a given shard key. Ideally shard key should have high cardinality. It represents the maximum number of chunks that can exist in clusters.
For example, suppose we have an application that was used only by members of a particular city and we are sharding on the state, we will have a maximum of one chunk as both upper and lower values of chunk would be that state only. And one chunk would only allow us to have one shard. Hence we need to ensure the shard key field has high cardinality.
If we cannot have a field with high cardinality we can increase the cardinality of our shard key by creating compound shard key. So in the above scenario, we can have shard key with a combination of state and name for ensuring cardinality.
Apart from having a large number of different values for our shard key, it is important to have even distribution for each value. It certain values occur more often than others then we may not have an equal distribution of load across the cluster. This limits the ability to handle scaled read and writes. For example, suppose we have an application where the majority of people using it have last name ‘jones’, the throughput of our application would be constraint with shard having those values. Chunks containing these values grow larger and larger and may sometimes become jumbo chunks. These jumbo chunks reduce the ability to scale horizontally as they cannot be split. To address such issues, we should choose a good compound shard key. In the above scenario, we can add _id as a compound field to have a high frequency for compound shard key.
We should avoid shard keys on fields which values are always increasing or decreasing. For example, ObjectId in MongoDB whose value is always increasing with each new document. In such case, all our writes will go to the same chunk having an upper bound key. For monotonically decreasing values writes will go to the first shard with a lower bound. We can have shard key as objectId as long as it’s not the first field.
To backup sharded cluster we need to take the backup for config database and individual shards.
First, we would need to disable the balancer from mongos. If we do not stop the balancer, the backup could duplicate data or omit data as chunks migrate while recording backups.
use config sh.stopBalancer()
For each shard replica set in the sharded cluster, connect a mongo shell to the secondary member’s mongod instance and run db.fsyncLock().
db.fsyncLock()
Connect to secondary of config server replica set and run
db.fsyncLock()
Now we will backup locked config secondary member. We are using mongodump for backup but we can also use any other method like cp or rsync etc.
Once the backup is taken, we can unlock the member so that it starts getting oplog from config primary.
mongodump --oplog db.fsyncUnlock()
Now we will backup locked member of each shard. We are using mongodump for backup but we can also use any other method like cp or rsync etc.
Once the backup is taken, we can unlock the member so that it starts getting oplog from shard primary.
mongodump --oplog db.fsyncUnlock()
Once we have the backup from config and each shard we will enable the balancer by connecting to config database.
use config sh.setBalancerState(true)
We can broadly divide MongoDB authentication mechanism in 2 parts namely client/user authentication which mainly deals with how clients of database authenticate to MongoDB and internal authentication which is how different members of replica sets or sharded clusters authenticate with each other.
SCRAM-SHA-1 MONGODB-CR X.509 LDAP KERBEROS
SCRAM-SHA-1 and MONGODB-CR are considered as a challenge/Response mechanism. From version 3.0 SCRAM-SHA-1 is the default security mechanism and has replaced MONGODB-CR.
MongoDB currently supports two internal authentication mechanisms. There's keyfile authentication which uses SCRAM-SHA-1 and X.509 authentication.
With keyfile authentication, the contents of keyfile essentially act as a shared password between the members of a replica set or sharded cluster. The same keyfile must be present on each member that talks to one another.
X.509 is another internal authentication mechanism. And it utilizes certificates to authenticate members to one another. We can use the same certificate on all members, it is recommended to issue a different certificate to each member. This way, if one of the certificates is compromised, we only need to reissue and deploy that one certificate instead of having to update your entire cluster.
It's important to note that whenever we enable internal authentication, either with X.509 or with keyfile based authentication, this automatically will enable client authentication.
There are a few key differences while setting authentication on the sharded cluster. To set up authentication we should connect to mongos instead of mongod. Also, clients who want to authenticate to the sharded cluster must do from mongos.
Ensure sharded cluster has at least two mongos instances available as it requires restarting each mongos in the cluster. If the sharded cluster has only one mongos instance, this results in downtime during the period that the mongos is offline.
db.createUser({ user: "admin", pwd: "<password>", roles: [ { role: "clusterAdmin", db: "admin" }, { role: "userAdmin", db: "admin" }]});
security:
transitionToAuth: true keyFile: <path-to-keyfile>
The new configuration file should contain all of the configuration settings previously used by the mongos as well as the new security settings.
Connect to the primary member of each shard replica set and create a user with the db.createUser() method.
db.createUser({ user: "admin1", pwd: "<password>", roles: [ { role: "clusterAdmin", db: "admin" }, { role: "userAdmin", db: "admin" }]});
This user can be used for maintenance activities on individual shards.
When deploying MongoDB in production, we should have a strategy for capturing and restoring backups in the case of data loss events. Below are the different backup options:
MongoDB Atlas, the official MongoDB cloud service, provides 2 fully-managed methods for backups:
MongoDB Cloud Manager and Ops Manager provide back up, monitoring, and automation service for MongoDB. They support backing up and restoring MongoDB replica sets and sharded clusters from a graphical user interface.
Back Up by Copying Underlying Data Files
MongoDB can also be backed up with operating system features which are not specific to MongoDB. Point-in-time filesystem snapshots can be used for backup If the volume where MongoDB stores its data files supports snapshots.
MongoDB deployments can also be backed up using system commands cp or rsync in case storage system does not support snapshots. It is recommended to stop all writes to mongo before copying database files as copying multiple is not an atomic operation.
mongodump is the utility using which we can take a backup of the MongoDB database in BSON files format. The backup files can then be used by a mongorestore utility for restoring to another database. Mongodump reads data page by page hence taking a lot of time and so is not recommended for large sized deployments.
Encryption plays a key role in securing any production environment. MongoDB offers encryption at-rest as well as transport encryption.
Transport encryption offers to encrypt information over the network traffic between the client and the server. MongoDB supports TLS/SSL (Transport Layer Security/Secure Sockets Layer) to encrypt all of MongoDB’s network traffic. TLS/SSL ensures that MongoDB network traffic is only readable by the intended client.
Encryption at rest encrypts the data on disk. This can be achieved either encrypting at the storage engine level or at the application level. Application level encryption is done at application end and is similar to masking as done earlier in RDBMS.
Encrypted Storage Engine
MongoDB Enterprise 3.2 introduces a native encryption option for the WiredTiger storage engine. This allows MongoDB to encrypt data files such that only parties with the decryption key can decode and read the data.
The data encryption process includes:
The encryption occurs transparently in the storage layer; i.e. all data files are fully encrypted from a file system perspective, and data only exists in an unencrypted state in memory and during transmission.
Application Level Encryption
Application Level Encryption provides encryption on a per-field or per-document basis within the application layer. To encrypt document or field level data, write custom encryption and decryption routines or use a commercial solution.
The MongoDB balancer is a background process that monitors the number of chunks on each shard. When the number of chunks on a given shard reaches specific migration thresholds, the balancer attempts to automatically migrate chunks between shards and reach an equal number of chunks per shard.
All chunk migrations use the following procedure:
MongoDB wiredTiger storage engine uses both WiredTiger internal cache and file system cache for storing data. If we do not define wiredTiger internal cache by default it utilizes larger of either 256MB or 50% of (RAM - 1GB). For example, if a system has a total 0f 6GB RAM, so 2GB (50% 0f 6GB – 1 GB) will be allocated to wiredTiger internal cache. This default setting assumes that there is only one mongod process running. In case we have multiple mongodb instances on the server we should decrease the wiredTiger internal cache size to accommodate other instances.
WiredTiger also provides compression options for both collections and indexes by default. While snappy compression is used for collections, prefix compression is used for all indexes. We can set the compression at the database as well as collection and index level.
WiredTiger internal cache and filesystem cache differs in terms of data representation from on-disk format.
All free memory that is not used by wiredTiger cache or by any other process is automatically used by MongoDB filesystem cache.
Any query on sharded cluster goes through mongos to config database where it looks for metadata information about the chunk distribution.
These queries are generally divided into broadly 2 groups:
Scatter gather queries:
Scatter-gather queries are the one which does not include the shard key. Since there are no shard keys, mongos does not know which shard to send this query to, hence it searches on all shards in the cluster. These queries are generally inefficient and are unfeasible for routine operations on large clusters.
Targeted queries:
If a query includes the shard key, the mongos directs the query to specific shards only that are part of query as per shard key. These queries are very efficient.
Now, in this case, we have a query with a shard key search 15000<=employeeid<=70000, which is a subset of the data from the entire cluster and so it’s a targeted query. Any shard with employee id within this range will be queries. From the above sample, we can see below shards fall within this range and will all be accessed by the query.
Shard0000
Shard0002
Shard0003
Shard0004
Shard0005
Shard0006
Shard0007
If MongoDB cannot split a chunk that exceeds the specified chunk size or contains a number of documents that exceeds the max, MongoDB labels the chunk as jumbo. If the chunk size no longer hits the limits, MongoDB clears the jumbo flag for the chunk when the mongos reloads or rewrites the chunk metadata.
But in some we need to follow the below process to clear the jumbo flag manually:
If the chunk is divisible, MongoDB removes the flag upon successful split of the chunk.
Process
Below output from sh.status(true) shows that chunk with shard key range { "x" : 2 } -->> { "x" : 4 } is jumbo.
--- Sharding Status --- .................. .................. test.foo shard key: { "x" : 1 } chunks: shard-b 2 shard-a 2 { "x" : { "$minKey" : 1 } } -->> { "x" : 1 } on : shard-b Timestamp(2, 0) { "x" : 1 } -->> { "x" : 2 } on : shard-a Timestamp(3, 1) { "x" : 2 } -->> { "x" : 4 } on : shard-a Timestamp(2, 2) jumbo { "x" : 4 } -->> { "x" : { "$maxKey" : 1 } } on : shard-b Timestamp(3, 0)
sh.splitAt( "test.foo", { x: 3 })
MongoDB removes the jumbo flag upon successful split of the chunk.
In some instances, MongoDB cannot split the no-longer jumbo chunk, such as a chunk with a range of single shard key value, and the preferred method to clear the flag is not applicable.
Process
mongodump --db config --port <config server port> --out <output file>
In the chunks collection of the config database, unset the jumbo flag for the chunk. For example,
db.getSiblingDB("config").chunks.update( { ns: "test.foo", min: { x: 2 }, jumbo: true }, { $unset: { jumbo: "" } } )
After the jumbo flag has been cleared out from the chunks collection, update the cluster routing metadata cache.
db.adminCommand( { flushRouterConfig: "test.foo" } )
This is a common yet one of the most important MongoDB basic interview questions; don't miss this one.
Monitoring is a critical component of all database administration. A firm grasp of MongoDB’s reporting will allow us to assess the state of the database and maintain deployment without crisis.
Below are some of the utilities used for MongoDB monitoring.
The mongostat utility provides a quick overview of the status of a currently running mongod or mongos instance. mongostat is functionally similar to the UNIX/Linux file system utility vmstat but provides data regarding mongod and mongos instances.
In order to run mongostat user must have the serverStatus privilege action on the cluster resources.
Eg. To run mongostat every 2 minutes below command can be used.
mongostat 120
mongotop provides a method to track the amount of time a MongoDB instance mongod spends reading and writing data. mongotop provides statistics on a per-collection level. By default, mongotop returns value every second.
Eg. To run mongotop every 30 sec below command can be used.
mongotop 30
MongoDB includes a number of commands that report on the state of the database.
The serverStatus command, or db.serverStatus() from the shell, return a general overview of the status of the database, detailing disk usage, memory use, connection, journaling, and index access. The command returns quickly and does not impact MongoDB performance.
The dbStats command, or db.stats() from the shell, returns a document that addresses storage use and data volumes. The dbStats reflect the amount of storage used, the quantity of data contained in the database, and the object, collection, and index counters.
We can use this data to monitor the state and storage capacity of a specific database. This output also allows to compare use between databases and to determine the average document size in a database.
The collStats or db.collection.stats() from the shell that provides statistics that resemble dbStats on the collection level, including a count of the objects in the collection, the size of the collection, the amount of disk space used by the collection, and information about its indexes.
The replSetGetStatus command (rs.status() from the shell) returns an overview of replica set’s status. The replSetGetStatus document details the state and configuration of the replica set and statistics about its members.
This data can be used to ensure that replication is properly configured, and to check the connections between the current host and the other members of the replica set.
Apart from the above tools MongoDB also provides an option for GUI based monitoring with ops-manager and cloud-manager. These are very efficient and are mostly used in large enterprise environments.
Security is very important for any production database. MongoDB provides us with the best practices to harden out MongoDB deployment. This list of best practices should act as security checklist before we give green light to any production deployment.
The balancer is a background process that runs on the primary of config server in a cluster. It constantly monitors the number of chunks on each shard and if the number of chunks for a specific shard is more than the migration threshold, it tries to automatically migrate chunks between shards so that there are an equal number of chunks per shard. The balancer migrates chunks from shards having more chunks to shards with lesser chunks. For example, Suppose we have 2 shards[shard01, shard02] with chunks 4 and 5 respectively. Now suppose there is a need to add another shard[shard03]. Initially, shard03 will have no chunks. A balancer will notice this uneven distribution and migrate chunks from shard01 and shard02 to shard03 until all 3 shards have three shards each.
There might be performance impact when balancer migrates the chunks as they carry some overhead in terms of bandwidth and workload, which can impact database performance. To minimize the impact balancer:
Impact of Adding and Removing Shards on a balancer
Adding or removing the shard from the cluster creates imbalance as either new shard will have no chunks or removed shard chunks need to be redistributed throughout the cluster. In case shard was removed from the cluster with uneven chunk distribution the balancer will remove the chunks from draining shard before balancing remaining uneven chunks. When balancer notices this imbalance it starts chunk migration process immediately. The migration process takes time to complete.
MongoDB creates oplogs for each operation on primary and these are then replicated to secondary using replication. MongoDB uses asynchronous replication and automatic failover feature to perform this efficiently.
Oplogs from the primary is applied to secondary asynchronously. This helps applications to continue without downtime despite the failure of members. MongoDB deployments are usually on commodity servers and for commodity servers, if we want to have synchronous replication, latency for waiting for acknowledgement is in the order of 100ms which is quite high. Due to this reason, MongoDB prefers asynchronous replication.
From version 4.0.6, MongoDB provides the capability to log entries of slow oplog operations for secondary members of a replica set. These slow oplog messages are logged for the secondaries in the diagnostic log under the REPL component. These slow oplog entries do not depend on log levels or profiling level but depend only on the slow operation threshold. The profiler does not capture slow oplog entries.
Many traditional databases follow Master-slave setup but in case of master failure, we have to manually cutover to a slave database. In MongoDB, we can have one primary with multiple secondary. If we have fewer servers, we can still afford to do manual cutover but MongoDB being big data may have 100 shards and it is impossible to cutover manually every time. So MongoDB has automatic failover. When the primary is unable to communicate to other members for more than the configured time(electionTimeoutMillis), and eligible secondary triggers election to nominate itself as primary. Until the new primary is elected cluster cannot serve write requests and can only serve read requests. Once the new primary is selected cluster resumes normal operations
The architecture of the cluster should be designed keeping in mind Network latency and time required for replica sets to complete elections as they affect the time our cluster runs without Primary.
Indexes help in improving the performance of queries. Without indexes, query must perform collection scan where each and every document of collection is scanned for the desired query result. With the use of proper indexes, we can limit the number of documents scanned thus improving the performance of queries.
Like collections indexes also use storage as they store a small portion of collection data. For example, if we create an index on field ‘name’ it will store data for this field and in ascending or descending order which also helps sort operations. Using indexes, we can satisfy equality matches and range-based queries more efficiently.
Some of the different index options available for MongoDB are:
By default, MongoDB creates an index on the _id field at the time of creating an index. This is a unique index and prevents applications from inserting multiple same values for the same _id field. MongoDB ensures that this index cannot be deleted.
These are indexes either on any one or combination of fields.
i.e
db.records.createIndex( { score: 1 } ) – Index on single field “score” db.products.createIndex( { "item": 1, "stock": 1 } ) – Index on comination of “item and stock”
MongoDB provides the option of creating an index on the contents stored in arrays. For every element of the array, a separate index entry is created. We can select matching elements of the array using multikey indexes more efficiently.
MongoDB also provides a geospatial index which helps to efficiently query the geospatial coordinate data. 2d indexes for planar geometry and 2dsphere indexes for spherical geometry.
To support string content search in collection MongoDB provides text index. These indexes only store root words while ignoring the language-specific words like ‘the’, ‘a’ etc.
To search for specific filter expression in a collection partial indexes are used. Since they store only the subset of documents in a collection, they have lower storage requirements. Index creation maintenance and performance is also low for these indexes.
If we only want to get the fields of a document that are indexed and skip all other fields we can do so by using the sparse index.
Certain application has requirements where documents need to be removed automatically after a certain amount of time. We can achieve this using TTL indexes. We specify TTL (time to live) for the documents after which a background process runs and removed these documents. This index is ideal for logs, session data and event data as such data only needs to persist for a limited time.
It is important to maintain data consistency in any database especially when multiple applications are accessing the same piece of data simultaneously. MongoDB uses locking and other concurrency control measures to ensure consistency. Multiple clients can read and write the same data while ensuring that all writes to single document either occur in full or not at all so that clients never see inconsistent data.
Effect of sharding on concurrency
In sharding, collections are distributed among several shard servers and so it improves concurrency. Mongos process routes multiple numbers of operations concurrently to different shards and finally combine them before sending back to the client.
In a sharded cluster locking is at individual shard level rather than cluster level so the operations in one shard do not block other shard operations. Each shard uses its own locks independent of other shards in the cluster.
Effect of replication on concurrency
In a MongoDB replica set each operation on the primary is also written to the special capped collection in the local database called oplog. So every time application writes to MongoDB it locks both databases i.e collection database and local database. Both these databases must be locked at the same time to maintain database consistency and ensuring that even with replication write operations maintain their ‘all-or-nothing” feature which ensures consistency.
In MongoDB replication, the application does not write to secondary but the secondary gets write from primary in the form or oplog. This oplog are not applied serially but collected in batches and batches are applied in parallel. The write operations are applied in the same order as they appear in oplog. During the time oplog are applied secondary do not allow reads to applied data to maintain consistency.
MongoDB has replication to provide high availability and redundancy which are the basis for any production database. With replica sets, we can achieve HA as well as DR capability. This also enables up for horizontal scaling enabling the use of commodity servers instead of enterprise servers. With replication, we can prevent downtime even if entire DC goes down with proper configuration.
There are several types of replica members based on the requirement:
This, along with other interview questions on MongoDB for freshers, is a regular feature in MongoDB interviews, be ready to tackle it with the approach mentioned.
We can change the configuration of the replica set as per the requirement of the application. Configuration changes may include adding a new member, adding Arbiter, removing a member, changing priority or votes for members, or changing member from normal secondary to hidden or delayed member.
To add a new member, first we need to start the mongod process –replset option on the new server
rs.add({host: “hostname” , port : “portno.”})
Once added member will fetch the data from primary using initial sync and replication synchronism.
rs.addArb({host: “hostname” , port : “portno.”})
rs.remove(hostname)
As a good practice should shut down the member being removed before running the above command.
rs.reconfig(new config)
Reconfig can be explained better with below example. Suppose we have replica set “rs0” with below configuration.
From Primary:
cfg = rs.conf(); cfg.members[1].priority = 2; rs.reconfig(cfg);
cfg = rs.conf(); cfg.members[2].votes = 0; rs.reconfig(cfg);
cfg = rs.conf() cfg.members[n].priority = 0 cfg.members[n].hidden = true cfg.members[n].slaveDelay = 3600 rs.reconfig(cfg)
cfg = rs.conf() cfg.members[n].priority = 0 cfg.members[n].hidden = true rs.reconfig(cfg)
db.<collection>.find().skip(n).limit(n) Note: n is the pagesize, for the first page skip(n) will not be applicable limit(n) limits the documents to be returned from the cursor to n, skip(n) will skip n documents from the cursor |
This can be achieved in MongoDB using the $type operator. A null value, i.e., BSON type null has the type number 10. Using this type number, only those documents can be retrieved whose value is null.
Take the example of the below two documents in startup collection
{ _id: 1, name: "XYZ Tech", website: null }, { _id: 2, name: “ABC Pvt Ltd” }
The query { website : { $type: 10 } } will retrieve only those documents where the website is null, in the above case it would be the startup “XYZ Tech”
Note: The query { website : null } on the other hand will match documents where the website is null or the documents where the website field does not exist. For the above collection data, this query will return both the startups.
This is one of the most frequently asked MongoDB interview questions and answers in recent times.
only those documents that contain the field specified in the query.
For the following documents in employee collection
{ _id: 1, name: "Jonas", linkedInProfile: null }, { _id: 2, name: “Williams” }
The query { linkedInProfile: { $exists: true } } will return only the employee “Jonas”
In MongoDB we have Built-in roles as well as custom roles. Built-in roles already have pre-defined access associated with them. We can assign these roles directly to users or groups for access. To run mongostat we would require access to run the server status on the server.
Built-in role cluster monitor comes with required access for the same.
Custom roles or user-defined roles are the ones where we have to manually define access actions to a particular resource. MongoDB provides method db.createRole() for creating user-defined roles. These roles can be created in a specific database as MongoDB uses a combination of database and role name to uniquely define the role.
We will create a custom role mongostatRole that provides only the privileges to run mongostat.
First, we need to connect to mongod or mongos to the admin database with a user that has privileges to create roles in the admin as well as other databases.
mongo --port 27017 -u admin -p 'abc***' --authenticationDatabase 'admin'
Now we will create a desired custom role in the admin database.
use admin db.createRole( role: "mongostatRole", privileges: [ {resource: { cluster: true }, actions: [ "serverStatus" ] } ], roles: [] )
This role can now be assigned to members of monitoring team.
In MongoDB data is stored as JSON documents. These documents can have different sets of fields, with different data type for each field. For example, we can have a collection with number, varchar and array all as different documents.
{ “a” : 143 } { “name” : “john” } { “x” : [1,2,3] }
It is not correct to say MongoDB is schemaless, in fact, schema plays an important role in the designing of MongoDB applications. MongoDB has a dynamic schema having database structure with collections and indexes. These collections can be created either implicitly or explicitly.
Due to the dynamic behaviour of the schema, MongoDB has several advantages over RDBMS systems.
Schema Migrations become very easy as in traditional systems we had to use ALTER TABLE command after adding any column which could result in downtime. In MongoDB, such adjustments become transparent and automatic. For example, if we want to add CITY field to people collection, we can add the attribute and resave, that’s it. Whereas in a traditional system we would have to run ALTER TABLE command followed by reorg which would require downtime.
The first part of the query would give all documents where y>=10. So we will have 2 documents i.e
d> { "_id" : 4, "x" : 4, "y" : 10 } e> { "_id" : 5, "x" : 5, "y" : 75 }
Now the second part of the query would update the value of Y for above 2 documents to 75, but we already have a document with value y:75, that will not be updated.
Finally, we will have one 1 document that will be updated by the provided query.
d> { "_id" : 4, "x" : 4, "y" : 10 }
Every operation on the primary is logged in operation logs known as oplog. These oplogs are replicated to For a healthy replica set system, it is recommended that all members are in sync with no replication lag. Data is first written on primary by the applications then replicated to secondary. This synchronization is important to maintain up-to-date copies of data on all members. Synchronization happens in 2 ways: initial sync and continuous replication.
The oplog is operation logs that keep an update of all operations that modify the data stored in databases. We can define the oplog size while starting MongoDB by specifying the --oplog option. If we do not specify this option it will take the default values which is 5% of physical memory in case of wiredTiger. While the default value is sufficient for most workloads in some cases we may need to change the oplog size for the replica set.
OPlog size is changed in a rolling manner, first, we change on all secondary and then a primary member of the replica set. To change oplog size
use local
db.oplog.rs.stats().maxSize
db.adminCommand({replSetResizeOplog: 1, size: "Size-in-MB"})
MongoDB applies database operations on the primary and then records the operations on the primary’s oplog. The secondary members then copy and apply these operations in an asynchronous process. For each operation, there is separate oplog entry.
First, let’s check how many rows the query would fetch by changing delete to find operation.
db.sample.find( { state : "WA" } )
This will give all the rows with the state is WA.
{"firstName" : "Arthur", "lastName" : "Aaronson", "state" : "WA", "city" : "Seattle", "likes" : [ "dogs", "cats" ] } {"firstName" : "Beth", "lastName" : "Barnes", "state" : "WA", "city" : "Richland", "likes" : [ "forest", "cats" ] } {"firstName" : "Dawn", "lastName" : "Davis", "state" : "WA", "city" : "Seattle", "likes" : [ "forest", "mountains" ] }
Now Ideally delete should remove all matching rows but query says deleteOne.
If the query would have said deleteMany then all the matching rows would have been deleted and there would be 3 oplog entries but deleteOne will remove first matching row. So 1 oplog entry will be generated with provided query
Idempotence is the property of certain operations whereby they can be applied multiple times without changing the result beyond the initial application. In MongoDB, oplog is idempotent meaning even if they are applied multiple times the same output will be produced. So if the server goes down and we need to apply oplogs there would not be any inconsistency, even if it applies any logs that were already applied there will not be changed in the database end state.
Also, there was a desire to have a new state of a document to be independent of the previous state. For these all operators that rely on the previous state to determine new value needed to be transformed to see the actual values. For example, if an addition operation results in modifying the value from ‘21’ to ’30’, the operation should be changed to set value ‘30’ on the field. Replaying the operator multiple times should produce the same result.
In MongoDB we can read from Primary as well as secondary members of the replica set. This behaviour can be controlled by us as we can define the desired Read preference to which clients route read operations to a member of the replica set. If we do not specify any real preference by default MongoDB will read from primary. There are situations when you would want to reduce the load on your primary by forcing applications to read from secondary.
Below are different MongoDB read preference modes:
This is the default mode. Applications read from the replica set primary.
In this mode, all applications read from primary but if the primary member is not available they start reading from secondary.
All applications read from the secondary members of the replica set.
In this mode, all applications read from secondary but if any secondary member is not available they start reading from primary.
In this mode, applications read from the member which is nearest to them in terms of network latency, irrespective of the member being primary or secondary.
Shard key selection is based on the workload. Since the first query is being used 90% it should be driving the selection for selection of shard key.
Combination of fields from that query would make the best shard key. This eliminates option b, c and d.
Option a and e uses a subset of fields from the most used workload and both can be shard key but option has more fields and so would be more suitable.
One of the most frequently posed MongoDB scenario based interview questions, be ready for this conceptual question.
Chunk split operations are carried out automatically by the system when any insert operation causes chunk to exceed the maximum chunk size. Balancer then migrates recently split chunks to new shards. But in some cases we may want to pre-split the chunks manually:
To split the chunks manually we can use the split command with helper sh.splitFind() and sh.splitAt().
Example:
To split the chunk of employee collection for employee id field at a value of 713626 below command should be used.
sh.splitAt( "test.people", { "employeid": "713626" } )
We should be careful while pre-splitting chunks as sometimes it can lead to a collection with different sized chunks.
In some cases, chunks can grow beyond the specified chunk size but cannot undergo a split. The most common scenario is when a chunk represents a single shard key value. Since the chunk cannot split, it continues to grow beyond the chunk size, becoming a jumbo chunk. These jumbo chunks can become a performance bottleneck as they continue to grow, especially if the shard key value occurs with high frequency.
The addition of new data or new shards can result in data distribution imbalances within the cluster. A particular shard may acquire more chunks than another shard, or the size of a chunk may grow beyond the configured maximum chunk size.
MongoDB ensures a balanced cluster using two processes: chunk splitting and the balancer.
In replication we have multiple copies of the same data in sync with each other. It is mainly useful for High availability purpose. While in sharding we divide our entire dataset in small chunks and distribute among several servers. Sharding is used where we have some sort of bottleneck in terms of hardware or for getting the benefits of query parallelism. If our dataset is very small sharding would not provide many advantages but as the data grows we should move to sharding.
Below are a few of the situations where sharding is recommended over replication.
By breaking the dataset over shards will mean having more resources available to handle the subset of data it owns, and operations of moving data across machines for replication, backups, restores will also be faster.
Loading every document into RAM means that query will not be using index efficiently and will have to fetch documents from disk to ram.
For using an index, the initial match in the find statement should either use index or index prefix.
Below query has b key to find that does not use any existing index or index prefix so it would have to fetch from disk.
db.sample.find( { b : 1 } ).sort( { c : 1, a : 1 } )
If we start Mongo server using mongod with the --auth option it will enable authorization but we will not be able to do anything, even listing databases using show DBS will fail with an error "not authorized on admin to execute the command". This is because we have not authenticated yet to the database. But this is a new database with no users in this database, so how can we create new users with no authorization.
MongoDB provides localhost exception for creating the first user on the database without any authorization. With this first user, we can create other users with relevant access.
But there are a few considerations to it:
MongoDB follows Role access control authorization model(RBAC). In this model, users are assigned one or more roles which provide them access to the database resources and operations. Apart from these role assignments users do not have any access to the resources. When we enable internal authentication it automatically enables client authorization. The authorization can be enabled by starting mongod with –auth option or providing security.authorization setting in the config file.
These group of privileges are roles that can be assigned to the user.
MongoDB provides several Build-in roles like Database User Roles, Database Administration Roles, Cluster Administration Roles, Backup and Restoration Roles, All-Database Roles, Superuser Roles But we can also create our own custom roles based on the requirement which are called User-defined roles.
For Example: Suppose we want to create a role to manage all operations in the cluster we can create below user-defined role.
use admin db.createRole( role: "mongostatRole", privileges: [ { resource: { cluster: true }, actions: [ "serverStatus" ] } ], roles: [] )
If secondaries are falling behind they are experiencing replication lag, which is a delay in the application of oplog from primary to secondary. Replication lag can be a significant issue and can seriously affect MongoDB replica set deployments. Excessive replication lag makes a member ineligible to quickly become primary and increases the possibility of distributed read operations to be inconsistent.
We can check the replication lag by calling the rs.printSlaveReplicationInfo() method or by running the rs.status() command.
Possible causes of replication lag include:
A staple in MongoDB technical interview questions, be prepared to answer this one using your hands-on experience.
The storage engine is the component that lies between the database and storage layer and is primarily responsible for managing data. MongoDB provides few choices of storage engines, enabling us to use best suited for our applications. Choosing the appropriate storage engine can significantly impact performance.
WiredTiger replaced MMAPV1 in 3.2 to become default storage engine. If we install MongoDB and do not specify any storage engine, wiredTiger is enabled. As it provides a document-level concurrency model, checkpointing, and compression it is suited for most workloads. It also supports encryption at rest in MongoDB Enterprise.
Various applications require predictable latencies which can be achieved by storing the documents in memory rather than on disk. In-Memory Storage Engine is helpful for such applications. It is available only in MongoDB enterprise edition.
MongoDB started with MMAPv1 storage engine only but it was successful for a specific subset of use cases only due to which it is deprecated from version 4.0.
Concurrency in MongoDB is different for different storage engines. While WiredTiger uses document level concurrency control for write operations, MMAPV1 has concurrency at the collection level.
In WiredTiger locking is at document level due to which multiple clients can modify documents at the same time and so uses optimistic concurrency control for most read and writes. Instead of exclusive locks, it uses only intent locks at the global, database and collection levels. In case the conflict is detected by storage engine between operations, one will incur a write conflict causing MongoDB to transparently retry that operation.
Sometimes global “instance-wide” locks are required for global operations which involve multiple databases. Exclusive database lock incurs for operations such as dropping a collection.
MMAPv1 still uses collection level lock meaning if we have 3 applications running against the same collection, 2 would have to wait before first application completes its own as it applies a collection level write lock.
MongoDB WiredTiger ensures data durability with journaling and checkpoints. Journals are write-ahead logs which checkpoints are point-in-time snapshots.
In wiredTiger, with the start of each operation, a point-in-time snapshot is taken which presents a consistent view of in-memory data. WiredTiger then writes all snapshot data to the disk in a consistent way across all data files. This data on disk is durable and acts as a checkpoint in the data files. The checkpoint ensures all data files are consistent from the last checkpoint.
These checkpoints usually occur every 60sec so we have a consistent snapshot every 60sec of interval thus ensuring durability.
Journal is write-ahead logs which persist all data changes between 2 checkpoints. In case we require data between checkpoints for recovery these journal files can be used. These general files act as crash recovery files in case of interrupts. Once the system is back up these journal files can be replayed for recovery.
In MongoDB data set consistency is ensured by locking. In any database system long running queries degrade the performance as requests and operations have to wait for a lock. Locking issues are intermittent and so need to be resolved immediately.
MongoDB provides us with tools and utilities to troubleshoot these locking issues. The serverStatus() command provides us a view of the system including the locking-related information. We should look for locks and a globalLock section of serverStatus() command for troubleshooting locking issues.
We can use below commands to filter locking related information from the serverStatus output.
db.serverStatus().globalLock db.serverStatus().locks
To get the approximate average wait time for a lock mode we can divide locks. timeAcquiringMicros by locks.acquireWaitCount.
To check the number of times deadlocks occurred locks.deadlockCount should be checked.
If the application performance is constantly degraded there might be concurrency issues, in such cases, we should look at globalLock.currentQueue.total. A high value indicates concurrency issues.
Sometimes globalLock.totalTime is high relative to uptime which suggests database has been in a lock state for a significant time.
Indexes are important to consider while designing the databases as they impact the performance of the applications. Without index, query will perform collection scan in which all the documents of the collection are scanned one by one to find the matching fields of the query. If we have indexed for a particular query, MongoDB can use the index to limit the number of documents scanned to execute the query as the indexes store the values of the query in ascending or descending ordered form.
While the indexes help in improving the performance of find operations for write operations like insert and update there can be a significant negative impact of adding indexes as with modifications with each write MongoDB would need to update the indexes associated with the index also. This would be overhead on the system and we may end up with performance degradation.
So while the find() will improve performance the operators updateOne and insertOne would degrade the performance as with every update or insert related indexes would need to be updated.
A staple in senior MongoDB interview questions with answers, be prepared to answer this one using your hands-on experience. This is also one of the top interview questions to ask a MongoDB developer.
MongoDB provides several utilities for data movement activities like mongodump, mongoexport etc. Mongodump is used to export the contents of the collection in an external file in the BSON(binary) format. The contents exported by this method can then be used by mongorestore command to restore in another database or different collection. Mongodump does not capture index data and only captures the data present in the backup. Since the contents are exported in binary format using this method we cannot use it for exporting to CSV file.
To export the contents in JSON or CSV format we can use the mongoexport command. The exported collection can then be restored using mongoimport command. Since Mongoexport exports cannot export in BSON all the rich BSON data types are not preserved while exporting the data. Due to this reason, mongoexport should be used with careful consideration.
Below is the command that can be used for the same.
mongoexport --host host:27017 -d test -c sample --type=csv -f fields -o sample.csv
In sharded cluster we may have a database which has sharded as well as non sharded collections. While the sharded collections are spread across all the shards, all the unsharded collections are stored on a single shard known as a primary shard. Every database in the sharded collection has its own primary shard. When we create any new database, mongos pick the shard with the least amount of data in cluster and marks it as a primary shard.
If there is a need to change the primary shard we can do so by using the movePrimary command. This migration may take significant time to complete. We should not access any collections associated with migrating database until the process completes. Also, the migration of primary shard should be done at the lean time as it may impact the performance of the overall cluster.
Eg. To migrate the primary shard of accounts database to Shard0007 below command should be used.
db.adminCommand( { movePrimary : "accounts", to : "shard0007" } )
When we create any collection in MongoDB, a unique index on the _id field is created automatically. This unique index prevents applications from inserting multiple same documents with the same value for the _id field. This is forced by the system and we cannot drop this index on the _id field. Moreover, in replica sets the unique _id values are used in the oplog to reference documents to update.
In a sharded cluster if we do not have unique _id values across the sharded collection chunk migrations may fail as when documents migrate to another shard, any identical values will not be inserted to receiver shard. In such cases, we should code application such that it ensures uniqueness on _id for given collection across all shards in a sharded cluster.
If we use _id as the shard key, this uniqueness of values will automatically be forced by the system. In such case chunk ranges will be assigned to single shard and then the shard will force uniqueness on the values in that range.
MongoDB provides database profiler which captures and stores detailed information about commands executed on running instance. Captured details include CRUD operations, administrative commands and configuration commands. The data collected by the profiler is stored in the system.profile collection in the admin database.
By default, the profile is turned off. We can enable it and set to different profiling levels based on the requirements. It provides 3 profiling levels:
To capture the slow running queries, we can start the profiler with profiling level 1 or 2 and Default slow operation threshold is 100 milliseconds. We can change this threshold by specifying a slowms option.
Eg: To enable profiler which captures all queries slower than 50ms below command should be used:
db.setProfilingLevel(1, { slowms: 50 })
Consider the following compound index
{ "accountHolder": 1, "accountNumber": 1, "currency": 1 }
The index prefixes are
{ accountHolder: 1 } { accountHolder: 1, accountNumber: 1 }
Query plan will use this index if the query has the following fields
The $addToSet operator should be used with the $each modifier for this. The $each modifier allows the $addToSet operator to add multiple values to the array field.
Example, start ups are tagged as per the technology skill that they excel in
{ _id: 5, name: "XYZ Technology", skills: [ "Big Data", "AI", “Cloud” ] }
Now the start up needs to be updated with additional skills
db.startups.update( { _id: 5 }, { $addToSet: { skills: { $each: [ "Machine Learning", "RPA" ] } } } )
The resultant document after update()
{ _id: 5, name: "XYZ Technology", skills: [ "Big Data", "AI", “Cloud”, "Machine Learning", "RPA"] }
Note: There is no particular ordering of elements in the modified set, $addToSet does not guarantee that. Duplicate items will not be added.
When "fast reads" are the single most important criteria, Embedded documents can be the best way to model one-to-one and one-to-many relationships.
Consider the example of certifications awarded to an employee, in the below example the certification data is embedded in the employee document which is a denormalized way of storing data
{ _id: "10", name: "Sarah Jones", certifications: [ { certification: "Certified Project Management Professional”, certifying_auth: “PMI”, date: "06/06/2015" }, { certification: "Oracle Certified Professional”, certifying_auth: “Oracle Corporation”, date: "10/10/2017" } ] }
In a normalized form, there would be a reference to the employee document from the certificate document, example
{ employee_id: "10", certification: "Certified Project Management Profesional”, certifying_auth: “PMI”, date: "06/06/2015" }
Embedded documents are best used when the entire relationship data needs to be frequently retrieved together. Data can be retrieved via single query and hence is much faster.
Note: Embedded documents should not grow unbounded, otherwise it can slow down both read and write operations. Other factors like consistency and frequency of data change should be considered before making the final design decision for the application.
MongoDB has the db.collection.explain(), cursor.explain() and explain command to provide information on the query plan. The results of explain contain a lot of information, key ones being
Recursive queries can be performed within a collection using $graphLookUp which is an aggregate pipeline stage.
If a collection has a self-referencing field like the classic example of Manager for an employee, then a query to get the entire reporting structure for manager “David” would look like this
db.employees.aggregate( [ { $graphLookup: { from: "employees", startWith: "David", connectFromField: "manager", connectToField: "name", as: "Reporting Structure" } } ] )
For the following documents in the employee collection,
{ "_id" : 4, "name" : " David " , "manager" : "Sarah" } { "_id" : 5, "name" : "John" , "manager" : "David" } { "_id" : 6, "name" : "Richard", "manager" : " John " } { "_id" : 7, "name" : "Stacy" , "manager" : " Richard " }
Output of the above $graphLookup operation would result in the following 3 documents returned
{ "_id" : 5, "name" : "John" , "manager" : "David", … } { "_id" : 6, "name" : "Richard", "manager" : " John ", … } { "_id" : 7, "name" : "Stacy" , "manager" : " Richard", … }
The hierarchy starts with “David” which is specified in startWith and there on the data for each of the members in that reporting hierarchy are fetched recursively
The $graphLookup looks like this for a query from the employees collection where “manager” is the self-referencing field
db.employees.aggregate( [ { $graphLookup: { from: "employees", startWith: "David", connectFromField: "manager", connectToField: "name", as: "Reporting Structure" } } ] )
The value of as, which is “Reporting Structure” in this case is the name of the array field which contains the documents traversed in the $graphLookup to reach the output document.
For the following documents in the employee collection,
{ "_id" : 4, "name" : " David " , "manager" : "Sarah" } { "_id" : 5, "name" : "John" , "manager" : "David" } { "_id" : 6, "name" : "Richard", "manager" : " John " } { "_id" : 7, "name" : "Stacy" , "manager" : " Richard " }
“Reporting Structure” for each output document would look like this
{ "_id" : 5, "name" : "John" , "manager" : "David", "Reporting Structure" : [] } { "_id" : 6, "name" : "Richard", "manager" : " John ", "Reporting Structure" : [{ "_id" : 5, "name" : "John" , "manager" : "David" }] } { "_id" : 7, "name" : "Stacy" , "manager" : " Richard", "Reporting Structure" : [{ "_id" : 5, "name" : "John", "manager" : "David" } { "_id" : 6, "name" : "Richard", "manager" : " John " }] }
Yes, there is very much a simpler way of achieving this without having to do this programmatically. The $unwind operator deconstructs an array field resulting in a document for each element.
Consider user “John” with multiple addresses
{ "_id" : 1, "name" : "John", addresses: [ "Permanent Addr", "Temporary Addr", "Office Addr"] } db.users.aggregate( [ { $unwind : "$addresses" } ] )
would result in 3 documents, one for each of the addresses
{ "_id" : 1, " name " : " John ", " addresses " : "Permanent Addr" } { "_id" : 1, " name " : " John ", " addresses " : "Temporary Addr" } { "_id" : 1, " name " : " John ", " addresses " : "Office Addr" }
This is one of the most frequently asked MongoDB interview questions for freshers in recent times.
MongoDB supports Capped collections which are fixed-size collections. Once the allocated space is filled up, space is made for new documents by removing (overwriting) oldest documents. The insertion order is preserved and if a query does not specify any ordering then the ordering of results is same as the insertion order. The oplog.rs collection is a capped collection, thus ensuring that the collection of logs do not grow infinitely.
A query that is able to return entire results only by using the index is called a Covered Query. This is one of the optimization techniques that can be used with queries for faster retrieval of data. A query can be a covered query only if
Since everything is part of the index, there is no need for the query to check the documents for any information.
Expect to come across this, one of the most important MongoDB interview questions and answers for experienced in Database management, in your next interviews.
Multikey indexes can be used for supporting efficient querying against array fields. MongoDB creates an index key for each element in the array.
Note: MongoDB will automatically create a multikey index if any indexed field is an array, no separate indication required.
Consider the startups collection with array of skills
{ _id: 1, name: "XYZ Technology", skills: [ "Big Data", "AI", “Cloud” ] }
Multikey indexes allow to search on the values in the skills array
db.startups.createIndex( { skills : 1 } )
The query db.startups.find( { skills : "AI" } ) will use this index on skills to return the matching document
All the 3 projection operators, i.e., $, $elemMatch, $slice are used for manipulating arrays. They are used to limit the contents of an array from the query results.
For example, db.startups.find( {}, { skills: { $slice: 2 } } ) selects the first 2 items from the skills array for each document returned.
Starting in version 4.0, multi-document transactions are possible in MongoDB. Earlier to this version, atomic operations were possible only on a single document.
With embedded documents and arrays, data in the documents are generally denormalized and stored in a single structure. With this as the recommended data model, MongoDB's single document atomicity is sufficient for most of the applications.
Multi-document transactions now enable the remaining small percentage of applications which require this (due to related data spread across documents) to depend on the database to handle transactions automatically rather than implement this programmatically into their application (which can cause performance overheads).
Note: Performance cost is more for multi-document transactions (in most of the cases), hence it should be judiciously used.
In the case of an error, whether the remaining operations get processed or not is determined if the bulk operation is ordered or unordered. If it is ordered, then MongoDB will not process the remaining operations, whereas if it is unordered , MongoDB will continue to process the remaining operations.
Note: “ordered” is an optional Boolean parameter that can be passed to bulkWrite(), by default this is true.
The MongoDB enterprise version includes auditing capability and this is fairly easy to set up. Some salient features of auditing in MongoDB
Note: Auditing adds performance overhead and the amount of overhead is determined by a combination of the several factors listed above. The specific needs of the application should be taken into account to arrive at the optimal configuration.
Once selected, the shard key can't be changed later automatically. Hence it should be chosen after a lot of consideration. The distribution of the documents of a collection between the cluster shards is based on the shard key. Effectiveness of the chunk distribution is important for the efficient querying and writing of the MongoDB database and this effectiveness of the chunk distribution is directly related to the shard key. That is why choosing of the right shard key up front is of utmost importance.
A must-know for anyone looking for top MongoDB interview questions, this is one of the frequently asked MongoDB advanced interview questions.
When any text content within a document needs to be searchable, all the string fields of the document can be indexed using the $** wildcard specifier. db.articles.createIndex( { "$**" : "text" } )
Note: Any new string field added to the document after creating the index will automatically be indexed. When data is huge, wildcard indexes will have an impact on performance and hence should be used with due consideration of this.
BSON is a binary JSON. Inside the database, there is a need for binary representation for efficiency.
There are 3 major reasons for preference to BSON:
Example: In below document, we have a large subdocument named hobbies, now suppose we want to query field "active" skipping "hobbies" we can do so in BSON due to its linear serialization property.
{-id: "32781", name: "Smith”, age: 30, hobbies: { .............................500 KB ..............}, active: "true”}
First, we have the MongoDB query language.
This is the set of instructions and commands that we have to use to interact with MongoDB.All CRUD operations and the documents that we send back and forth in MongoDB are managed by this layer. They translate the incoming BSON wire protocol messages that MongoDB uses to communicate with the client side application libraries that we call drivers into MongoDB operations.
Then, we have the MongoDB Data Model Layer.
This is the layer responsible for applying all the CRUD operations defined in the MongoDB query language and how they should result in the data structures managed by MongoDB. Management of namespaces, database names, and collections, which indexes are defined per namespace and which interactions need to be performed to respond to the incoming requests are all managed here.
This is also the layer where a replication mechanism is defined. This is where we define WriteConcerns, ReadConcerns that applications may require.
Next, we have the storage layer.
At this layer, we will have all the persistence in physical medium calls, how data is stored on disk, what kind of files does it use, what levels of compression amongst other settings can be set. MongoDB has several different types of storage engines that will persist data with different properties, depending on how the system is configured. WiredTiger is the default storage engine. At this layer, all the actions regarding flushers to this, journal commits, compression operations, and low-level system access happens.
Shards themselves are replica sets-- highly availability in units. Where we have other components as well, like:
Suppose we have 3 servers abc.com,xyz.com andpqr.com.
option –replset is used for creating a replica set. We have given a replica set name as rs0. Bind IP is the IP to which server can be connected to from outside.
Login to server abc.com and run command
mongo>
it will take you to mongo shell.
Now we need to initiate the replica set with a configuration of all 3 members.
rs.initiate( { _id : "rs0", members: [ { _id: 0, host: "abc.com:27017" }, { _id: 1, host: "xyz.com:27017" }, { _id: 2, host: "pqr.com:27017" } })
MongoDB initiates a replica set, using the default replica set configuration.
rs.conf()
Also to check the status for each member we can run command
rs.status()
The server from which we run rs.initiate will become primary and other 2 servers will become secondary.
The first requirement eliminates any five-node replica set where one node is an arbiter, as arbiters do not have a copy of the data.
The second requirements eliminate setting Priority of 0 for dc1-01, dc1-02 or dc2-01, dc2-02. They can be assigned any positive integer, or the default value of 1 to be electable as Primary.
As per the third requirement, dc3-01 can never be primary so its priority has to be set 0.
Finally, as per the fourth dc3-01 configuration cannot be listed as hidden, as this will prevent reading from this replica member.
So below will be the config file meeting all the requirements.
{ "_id" : "rs0", "version" : 1, "members" : [ { "_id" : "dc1-01", "host" : "mongodb0.example.net:27017" }, { "_id" : "dc1-02", "host" : "mongodb1.example.net:27017" }, { "_id" : "dc2-01", "host" : "mongodb2.example.net:27017"}, { "_id" : "dc2-02", "host" : "mongodb3.example.net:27017"}, { "_id" : "dc3-01", "host" : "mongodb4.example.net:27017","priority" : 0 } ] }
When the primary of a replica set is not available secondary becomes primary, this is done via elections where the most appropriate member of the replica set is promoted to primary. Apart from unavailability, there are few situations when elections are triggered such as:
Don't be surprised if this question pops up as one of the top MongoDB questions for interview in your next interview.
Big data systems with large data sets or high throughput requirements usually challenge the capacity of a single server like a large number of parallel queries can exhaust CPU capacity for the server. Also, larger working sets than the RAM of the system can cause I/O bottleneck and disk performance disruption. Such growth is generally handled either by vertical scaling or horizontal scaling.
Here bottlenecks are handled by increasing the capacity of a single server by adding more RAM, having a more powerful CPU or adding more storage. This is fine up to a limit as even the biggest server has limits of RAM, CPU and storage as beyond a point we cannot add capacity. Also, this scaling method is very expensive as bigger servers cost must more than commodity servers.
Here bottlenecks are handled by dividing the dataset across multiple commodity servers. We get the benefit of more storage, RAM and CPU when data is spread. This also allows having high throughput as we can use parallelism of resources for the same. We also get the benefit of comparatively lower cost due to the use of commodity servers.
MongoDB supports horizontal scaling through sharding. It supports very large data sets and high throughput operations with sharding. In sharding data is distributed among several machines called shards.
A MongoDB sharded cluster consists of the following components:
Application data in MongoDB sharded cluster is stored in shards. Each shard has a subset of collection data divided on the basis of shard key which we define at the time of creating a collection. These shards can also be deployed as replica sets. If a query is performed at single shard it will return a subset of data. Applications usually should not connect to individual shards. Connections to individual shards should be made by administrators for maintenance purpose.
In a sharded cluster applications should connect through mongos which acts as a query router which acts as an interface between clients’ applications and sharded cluster. Mongos fetches the metadata from config server regarding what data is on which shard and caches it. This metadata is then used by config server to route the query to appropriate shard. We should have multiple mongos for redundancy and they can either be deployed on a separate server or mixed with application servers. To reduce latency, it is recommended to deploy them on application servers. These mongos utilize minimal server resources and do not have any persistent state.
All the metadata and configuration settings for the sharded cluster are stored in config servers. Metadata shows which data is stored in which shard, number of chunks, and distribution of shard keys across the cluster. It is recommended to deploy config server as a replica set. In case the config server does not have primary at any time the cluster cannot perform metadata changes and becomes read-only for the time period so config server replica set should also be monitored and maintained as the application data shards.
MongoDB sharded cluster has 3 components namely shards, mongos and config servers. We will deploy all components using the below process.
We need to start all the members of replica sets with the –shardsvr option.
mongod --replSet "rs0" --shardsvr mongod --replSet "rs1" --shardsvr
Suppose we have 2 shards with 3-member replica set each, all 6 shards should be started with the above option. These shard members are deployed as replica sets on the host (h1, h2, h3…. h6) at port 27017.
sh1(M1, M2, M3 as replica set “rs0”) and sh2(M4, M5, M6 as replica set “rs1”)
We need to start all members of config servers as a replica set with --configsvr
mongod --configsvr --replSet “cf1”
Config Sever (Member c1, c2 and c3 as a replica set cf1) on host h7, h8 at port 27017.
Start the mongos specifying the config server replica set name followed by a slash / and at least one of the config server hostnames and ports. Mongos is deployed on server h9 at port 27017.
mongos --configdb cf1/h7:27017, h8:27017, h9:27017
mongo h9:27017/admin sh.addShard( "rs0/h1:27017,h2:27017,h3:27017" ) sh.addShard( "rs1/h5:27017,h6:27017,h7:27017" )
mongo h9:27017/admin sh.enableSharding( "test" ) use test db.test_collection.createIndex( { a : 1 } ) sh.shardCollection( "test.test_collection", { "a" : 1 } )
Shard key selection is an important aspect of the sharded cluster as it affects the performance and overall efficiency of a cluster. Chunk creation and distribution among several shards is based on the choice of the shard key. Ideally shard key should allow MongoDB to distribute documents evenly across all the shards in the cluster.
There are three main factors that affect the selection of the shard key:
Cardinality refers to a number of distinctive values for a given shard key. Ideally shard key should have high cardinality. It represents the maximum number of chunks that can exist in clusters.
For example, suppose we have an application that was used only by members of a particular city and we are sharding on the state, we will have a maximum of one chunk as both upper and lower values of chunk would be that state only. And one chunk would only allow us to have one shard. Hence we need to ensure the shard key field has high cardinality.
If we cannot have a field with high cardinality we can increase the cardinality of our shard key by creating compound shard key. So in the above scenario, we can have shard key with a combination of state and name for ensuring cardinality.
Apart from having a large number of different values for our shard key, it is important to have even distribution for each value. It certain values occur more often than others then we may not have an equal distribution of load across the cluster. This limits the ability to handle scaled read and writes. For example, suppose we have an application where the majority of people using it have last name ‘jones’, the throughput of our application would be constraint with shard having those values. Chunks containing these values grow larger and larger and may sometimes become jumbo chunks. These jumbo chunks reduce the ability to scale horizontally as they cannot be split. To address such issues, we should choose a good compound shard key. In the above scenario, we can add _id as a compound field to have a high frequency for compound shard key.
We should avoid shard keys on fields which values are always increasing or decreasing. For example, ObjectId in MongoDB whose value is always increasing with each new document. In such case, all our writes will go to the same chunk having an upper bound key. For monotonically decreasing values writes will go to the first shard with a lower bound. We can have shard key as objectId as long as it’s not the first field.
To backup sharded cluster we need to take the backup for config database and individual shards.
First, we would need to disable the balancer from mongos. If we do not stop the balancer, the backup could duplicate data or omit data as chunks migrate while recording backups.
use config sh.stopBalancer()
For each shard replica set in the sharded cluster, connect a mongo shell to the secondary member’s mongod instance and run db.fsyncLock().
db.fsyncLock()
Connect to secondary of config server replica set and run
db.fsyncLock()
Now we will backup locked config secondary member. We are using mongodump for backup but we can also use any other method like cp or rsync etc.
Once the backup is taken, we can unlock the member so that it starts getting oplog from config primary.
mongodump --oplog db.fsyncUnlock()
Now we will backup locked member of each shard. We are using mongodump for backup but we can also use any other method like cp or rsync etc.
Once the backup is taken, we can unlock the member so that it starts getting oplog from shard primary.
mongodump --oplog db.fsyncUnlock()
Once we have the backup from config and each shard we will enable the balancer by connecting to config database.
use config sh.setBalancerState(true)
We can broadly divide MongoDB authentication mechanism in 2 parts namely client/user authentication which mainly deals with how clients of database authenticate to MongoDB and internal authentication which is how different members of replica sets or sharded clusters authenticate with each other.
SCRAM-SHA-1 MONGODB-CR X.509 LDAP KERBEROS
SCRAM-SHA-1 and MONGODB-CR are considered as a challenge/Response mechanism. From version 3.0 SCRAM-SHA-1 is the default security mechanism and has replaced MONGODB-CR.
MongoDB currently supports two internal authentication mechanisms. There's keyfile authentication which uses SCRAM-SHA-1 and X.509 authentication.
With keyfile authentication, the contents of keyfile essentially act as a shared password between the members of a replica set or sharded cluster. The same keyfile must be present on each member that talks to one another.
X.509 is another internal authentication mechanism. And it utilizes certificates to authenticate members to one another. We can use the same certificate on all members, it is recommended to issue a different certificate to each member. This way, if one of the certificates is compromised, we only need to reissue and deploy that one certificate instead of having to update your entire cluster.
It's important to note that whenever we enable internal authentication, either with X.509 or with keyfile based authentication, this automatically will enable client authentication.
There are a few key differences while setting authentication on the sharded cluster. To set up authentication we should connect to mongos instead of mongod. Also, clients who want to authenticate to the sharded cluster must do from mongos.
Ensure sharded cluster has at least two mongos instances available as it requires restarting each mongos in the cluster. If the sharded cluster has only one mongos instance, this results in downtime during the period that the mongos is offline.
db.createUser({ user: "admin", pwd: "<password>", roles: [ { role: "clusterAdmin", db: "admin" }, { role: "userAdmin", db: "admin" }]});
security:
transitionToAuth: true keyFile: <path-to-keyfile>
The new configuration file should contain all of the configuration settings previously used by the mongos as well as the new security settings.
Connect to the primary member of each shard replica set and create a user with the db.createUser() method.
db.createUser({ user: "admin1", pwd: "<password>", roles: [ { role: "clusterAdmin", db: "admin" }, { role: "userAdmin", db: "admin" }]});
This user can be used for maintenance activities on individual shards.
When deploying MongoDB in production, we should have a strategy for capturing and restoring backups in the case of data loss events. Below are the different backup options:
MongoDB Atlas, the official MongoDB cloud service, provides 2 fully-managed methods for backups:
MongoDB Cloud Manager and Ops Manager provide back up, monitoring, and automation service for MongoDB. They support backing up and restoring MongoDB replica sets and sharded clusters from a graphical user interface.
Back Up by Copying Underlying Data Files
MongoDB can also be backed up with operating system features which are not specific to MongoDB. Point-in-time filesystem snapshots can be used for backup If the volume where MongoDB stores its data files supports snapshots.
MongoDB deployments can also be backed up using system commands cp or rsync in case storage system does not support snapshots. It is recommended to stop all writes to mongo before copying database files as copying multiple is not an atomic operation.
mongodump is the utility using which we can take a backup of the MongoDB database in BSON files format. The backup files can then be used by a mongorestore utility for restoring to another database. Mongodump reads data page by page hence taking a lot of time and so is not recommended for large sized deployments.
Encryption plays a key role in securing any production environment. MongoDB offers encryption at-rest as well as transport encryption.
Transport encryption offers to encrypt information over the network traffic between the client and the server. MongoDB supports TLS/SSL (Transport Layer Security/Secure Sockets Layer) to encrypt all of MongoDB’s network traffic. TLS/SSL ensures that MongoDB network traffic is only readable by the intended client.
Encryption at rest encrypts the data on disk. This can be achieved either encrypting at the storage engine level or at the application level. Application level encryption is done at application end and is similar to masking as done earlier in RDBMS.
Encrypted Storage Engine
MongoDB Enterprise 3.2 introduces a native encryption option for the WiredTiger storage engine. This allows MongoDB to encrypt data files such that only parties with the decryption key can decode and read the data.
The data encryption process includes:
The encryption occurs transparently in the storage layer; i.e. all data files are fully encrypted from a file system perspective, and data only exists in an unencrypted state in memory and during transmission.
Application Level Encryption
Application Level Encryption provides encryption on a per-field or per-document basis within the application layer. To encrypt document or field level data, write custom encryption and decryption routines or use a commercial solution.
The MongoDB balancer is a background process that monitors the number of chunks on each shard. When the number of chunks on a given shard reaches specific migration thresholds, the balancer attempts to automatically migrate chunks between shards and reach an equal number of chunks per shard.
All chunk migrations use the following procedure:
MongoDB wiredTiger storage engine uses both WiredTiger internal cache and file system cache for storing data. If we do not define wiredTiger internal cache by default it utilizes larger of either 256MB or 50% of (RAM - 1GB). For example, if a system has a total 0f 6GB RAM, so 2GB (50% 0f 6GB – 1 GB) will be allocated to wiredTiger internal cache. This default setting assumes that there is only one mongod process running. In case we have multiple mongodb instances on the server we should decrease the wiredTiger internal cache size to accommodate other instances.
WiredTiger also provides compression options for both collections and indexes by default. While snappy compression is used for collections, prefix compression is used for all indexes. We can set the compression at the database as well as collection and index level.
WiredTiger internal cache and filesystem cache differs in terms of data representation from on-disk format.
All free memory that is not used by wiredTiger cache or by any other process is automatically used by MongoDB filesystem cache.
Any query on sharded cluster goes through mongos to config database where it looks for metadata information about the chunk distribution.
These queries are generally divided into broadly 2 groups:
Scatter gather queries:
Scatter-gather queries are the one which does not include the shard key. Since there are no shard keys, mongos does not know which shard to send this query to, hence it searches on all shards in the cluster. These queries are generally inefficient and are unfeasible for routine operations on large clusters.
Targeted queries:
If a query includes the shard key, the mongos directs the query to specific shards only that are part of query as per shard key. These queries are very efficient.
Now, in this case, we have a query with a shard key search 15000<=employeeid<=70000, which is a subset of the data from the entire cluster and so it’s a targeted query. Any shard with employee id within this range will be queries. From the above sample, we can see below shards fall within this range and will all be accessed by the query.
Shard0000
Shard0002
Shard0003
Shard0004
Shard0005
Shard0006
Shard0007
If MongoDB cannot split a chunk that exceeds the specified chunk size or contains a number of documents that exceeds the max, MongoDB labels the chunk as jumbo. If the chunk size no longer hits the limits, MongoDB clears the jumbo flag for the chunk when the mongos reloads or rewrites the chunk metadata.
But in some we need to follow the below process to clear the jumbo flag manually:
If the chunk is divisible, MongoDB removes the flag upon successful split of the chunk.
Process
Below output from sh.status(true) shows that chunk with shard key range { "x" : 2 } -->> { "x" : 4 } is jumbo.
--- Sharding Status --- .................. .................. test.foo shard key: { "x" : 1 } chunks: shard-b 2 shard-a 2 { "x" : { "$minKey" : 1 } } -->> { "x" : 1 } on : shard-b Timestamp(2, 0) { "x" : 1 } -->> { "x" : 2 } on : shard-a Timestamp(3, 1) { "x" : 2 } -->> { "x" : 4 } on : shard-a Timestamp(2, 2) jumbo { "x" : 4 } -->> { "x" : { "$maxKey" : 1 } } on : shard-b Timestamp(3, 0)
sh.splitAt( "test.foo", { x: 3 })
MongoDB removes the jumbo flag upon successful split of the chunk.
In some instances, MongoDB cannot split the no-longer jumbo chunk, such as a chunk with a range of single shard key value, and the preferred method to clear the flag is not applicable.
Process
mongodump --db config --port <config server port> --out <output file>
In the chunks collection of the config database, unset the jumbo flag for the chunk. For example,
db.getSiblingDB("config").chunks.update( { ns: "test.foo", min: { x: 2 }, jumbo: true }, { $unset: { jumbo: "" } } )
After the jumbo flag has been cleared out from the chunks collection, update the cluster routing metadata cache.
db.adminCommand( { flushRouterConfig: "test.foo" } )
This is a common yet one of the most important MongoDB basic interview questions; don't miss this one.
Monitoring is a critical component of all database administration. A firm grasp of MongoDB’s reporting will allow us to assess the state of the database and maintain deployment without crisis.
Below are some of the utilities used for MongoDB monitoring.
The mongostat utility provides a quick overview of the status of a currently running mongod or mongos instance. mongostat is functionally similar to the UNIX/Linux file system utility vmstat but provides data regarding mongod and mongos instances.
In order to run mongostat user must have the serverStatus privilege action on the cluster resources.
Eg. To run mongostat every 2 minutes below command can be used.
mongostat 120
mongotop provides a method to track the amount of time a MongoDB instance mongod spends reading and writing data. mongotop provides statistics on a per-collection level. By default, mongotop returns value every second.
Eg. To run mongotop every 30 sec below command can be used.
mongotop 30
MongoDB includes a number of commands that report on the state of the database.
The serverStatus command, or db.serverStatus() from the shell, return a general overview of the status of the database, detailing disk usage, memory use, connection, journaling, and index access. The command returns quickly and does not impact MongoDB performance.
The dbStats command, or db.stats() from the shell, returns a document that addresses storage use and data volumes. The dbStats reflect the amount of storage used, the quantity of data contained in the database, and the object, collection, and index counters.
We can use this data to monitor the state and storage capacity of a specific database. This output also allows to compare use between databases and to determine the average document size in a database.
The collStats or db.collection.stats() from the shell that provides statistics that resemble dbStats on the collection level, including a count of the objects in the collection, the size of the collection, the amount of disk space used by the collection, and information about its indexes.
The replSetGetStatus command (rs.status() from the shell) returns an overview of replica set’s status. The replSetGetStatus document details the state and configuration of the replica set and statistics about its members.
This data can be used to ensure that replication is properly configured, and to check the connections between the current host and the other members of the replica set.
Apart from the above tools MongoDB also provides an option for GUI based monitoring with ops-manager and cloud-manager. These are very efficient and are mostly used in large enterprise environments.
Security is very important for any production database. MongoDB provides us with the best practices to harden out MongoDB deployment. This list of best practices should act as security checklist before we give green light to any production deployment.
The balancer is a background process that runs on the primary of config server in a cluster. It constantly monitors the number of chunks on each shard and if the number of chunks for a specific shard is more than the migration threshold, it tries to automatically migrate chunks between shards so that there are an equal number of chunks per shard. The balancer migrates chunks from shards having more chunks to shards with lesser chunks. For example, Suppose we have 2 shards[shard01, shard02] with chunks 4 and 5 respectively. Now suppose there is a need to add another shard[shard03]. Initially, shard03 will have no chunks. A balancer will notice this uneven distribution and migrate chunks from shard01 and shard02 to shard03 until all 3 shards have three shards each.
There might be performance impact when balancer migrates the chunks as they carry some overhead in terms of bandwidth and workload, which can impact database performance. To minimize the impact balancer:
Impact of Adding and Removing Shards on a balancer
Adding or removing the shard from the cluster creates imbalance as either new shard will have no chunks or removed shard chunks need to be redistributed throughout the cluster. In case shard was removed from the cluster with uneven chunk distribution the balancer will remove the chunks from draining shard before balancing remaining uneven chunks. When balancer notices this imbalance it starts chunk migration process immediately. The migration process takes time to complete.
MongoDB creates oplogs for each operation on primary and these are then replicated to secondary using replication. MongoDB uses asynchronous replication and automatic failover feature to perform this efficiently.
Oplogs from the primary is applied to secondary asynchronously. This helps applications to continue without downtime despite the failure of members. MongoDB deployments are usually on commodity servers and for commodity servers, if we want to have synchronous replication, latency for waiting for acknowledgement is in the order of 100ms which is quite high. Due to this reason, MongoDB prefers asynchronous replication.
From version 4.0.6, MongoDB provides the capability to log entries of slow oplog operations for secondary members of a replica set. These slow oplog messages are logged for the secondaries in the diagnostic log under the REPL component. These slow oplog entries do not depend on log levels or profiling level but depend only on the slow operation threshold. The profiler does not capture slow oplog entries.
Many traditional databases follow Master-slave setup but in case of master failure, we have to manually cutover to a slave database. In MongoDB, we can have one primary with multiple secondary. If we have fewer servers, we can still afford to do manual cutover but MongoDB being big data may have 100 shards and it is impossible to cutover manually every time. So MongoDB has automatic failover. When the primary is unable to communicate to other members for more than the configured time(electionTimeoutMillis), and eligible secondary triggers election to nominate itself as primary. Until the new primary is elected cluster cannot serve write requests and can only serve read requests. Once the new primary is selected cluster resumes normal operations
The architecture of the cluster should be designed keeping in mind Network latency and time required for replica sets to complete elections as they affect the time our cluster runs without Primary.
Indexes help in improving the performance of queries. Without indexes, query must perform collection scan where each and every document of collection is scanned for the desired query result. With the use of proper indexes, we can limit the number of documents scanned thus improving the performance of queries.
Like collections indexes also use storage as they store a small portion of collection data. For example, if we create an index on field ‘name’ it will store data for this field and in ascending or descending order which also helps sort operations. Using indexes, we can satisfy equality matches and range-based queries more efficiently.
Some of the different index options available for MongoDB are:
By default, MongoDB creates an index on the _id field at the time of creating an index. This is a unique index and prevents applications from inserting multiple same values for the same _id field. MongoDB ensures that this index cannot be deleted.
These are indexes either on any one or combination of fields.
i.e
db.records.createIndex( { score: 1 } ) – Index on single field “score” db.products.createIndex( { "item": 1, "stock": 1 } ) – Index on comination of “item and stock”
MongoDB provides the option of creating an index on the contents stored in arrays. For every element of the array, a separate index entry is created. We can select matching elements of the array using multikey indexes more efficiently.
MongoDB also provides a geospatial index which helps to efficiently query the geospatial coordinate data. 2d indexes for planar geometry and 2dsphere indexes for spherical geometry.
To support string content search in collection MongoDB provides text index. These indexes only store root words while ignoring the language-specific words like ‘the’, ‘a’ etc.
To search for specific filter expression in a collection partial indexes are used. Since they store only the subset of documents in a collection, they have lower storage requirements. Index creation maintenance and performance is also low for these indexes.
If we only want to get the fields of a document that are indexed and skip all other fields we can do so by using the sparse index.
Certain application has requirements where documents need to be removed automatically after a certain amount of time. We can achieve this using TTL indexes. We specify TTL (time to live) for the documents after which a background process runs and removed these documents. This index is ideal for logs, session data and event data as such data only needs to persist for a limited time.
It is important to maintain data consistency in any database especially when multiple applications are accessing the same piece of data simultaneously. MongoDB uses locking and other concurrency control measures to ensure consistency. Multiple clients can read and write the same data while ensuring that all writes to single document either occur in full or not at all so that clients never see inconsistent data.
Effect of sharding on concurrency
In sharding, collections are distributed among several shard servers and so it improves concurrency. Mongos process routes multiple numbers of operations concurrently to different shards and finally combine them before sending back to the client.
In a sharded cluster locking is at individual shard level rather than cluster level so the operations in one shard do not block other shard operations. Each shard uses its own locks independent of other shards in the cluster.
Effect of replication on concurrency
In a MongoDB replica set each operation on the primary is also written to the special capped collection in the local database called oplog. So every time application writes to MongoDB it locks both databases i.e collection database and local database. Both these databases must be locked at the same time to maintain database consistency and ensuring that even with replication write operations maintain their ‘all-or-nothing” feature which ensures consistency.
In MongoDB replication, the application does not write to secondary but the secondary gets write from primary in the form or oplog. This oplog are not applied serially but collected in batches and batches are applied in parallel. The write operations are applied in the same order as they appear in oplog. During the time oplog are applied secondary do not allow reads to applied data to maintain consistency.
MongoDB has replication to provide high availability and redundancy which are the basis for any production database. With replica sets, we can achieve HA as well as DR capability. This also enables up for horizontal scaling enabling the use of commodity servers instead of enterprise servers. With replication, we can prevent downtime even if entire DC goes down with proper configuration.
There are several types of replica members based on the requirement:
This, along with other interview questions on MongoDB for freshers, is a regular feature in MongoDB interviews, be ready to tackle it with the approach mentioned.
We can change the configuration of the replica set as per the requirement of the application. Configuration changes may include adding a new member, adding Arbiter, removing a member, changing priority or votes for members, or changing member from normal secondary to hidden or delayed member.
To add a new member, first we need to start the mongod process –replset option on the new server
rs.add({host: “hostname” , port : “portno.”})
Once added member will fetch the data from primary using initial sync and replication synchronism.
rs.addArb({host: “hostname” , port : “portno.”})
rs.remove(hostname)
As a good practice should shut down the member being removed before running the above command.
rs.reconfig(new config)
Reconfig can be explained better with below example. Suppose we have replica set “rs0” with below configuration.
From Primary:
cfg = rs.conf(); cfg.members[1].priority = 2; rs.reconfig(cfg);
cfg = rs.conf(); cfg.members[2].votes = 0; rs.reconfig(cfg);
cfg = rs.conf() cfg.members[n].priority = 0 cfg.members[n].hidden = true cfg.members[n].slaveDelay = 3600 rs.reconfig(cfg)
cfg = rs.conf() cfg.members[n].priority = 0 cfg.members[n].hidden = true rs.reconfig(cfg)
MongoDB is an open-source NoSQL database that uses a document-oriented data model and a non-structured query language. It overcame one of the biggest pitfalls of the traditional database systems, that is scalability. MongoDB is being used by some of the biggest companies in the world, known for its best features and offers a unique set of features to the companies in order to resolve the unstructured data.
MongoDB is used across several companies in multiple domains. The research found that 26,929 companies are using it. The companies using MongoDB are most often found in the United States mostly in the Computer Software industry. Companies with 10-50 employees and with a revenue of 1 Million -10 Million dollars using this. There is a huge demand for professionals who are qualified and have a MongoDB certification and work with the advanced and basics of MongoDB and can expect to have a promising career. Organizations around the world are utilizing the innovation of MongoDB to meet the fast-changing requirements of their customers.
The MongoDB Interview Questions and answers are prepared by experienced industry experts and can prove to be very useful for newcomers as well as the experienced professionals who want to become a MongoDB Developer. These interview questions on MongoDB here will help you strengthen your technical skills, prepare for the new job test and quickly revise the concepts. You will have an in-depth knowledge by going through these MongoDB Interview Questions and help you ace your MongoDB interview.
To relieve you of the worry and burden of preparation for your upcoming interviews, we have compiled the above MongoDB Interview Questions and answers with answers prepared by industry experts. These common interview questions on MongoDB will help you ace your MongoDB Interview. KnowledgeHut MongoDB certification also helps you learn advanced interview tricks and concepts that are most expected in these interviews.
Learning MongoDB will definitely give a boost to your career because of the demand for MongoDB in the market is increasing at a tremendous pace. All the best!
Submitted questions and answers are subjecct to review and editing,and may or may not be selected for posting, at the sole discretion of Knowledgehut.
Get a 1:1 Mentorship call with our Career Advisor
By tapping submit, you agree to KnowledgeHut Privacy Policy and Terms & Conditions