Skip navigation

Category Archives: MongoDB

Application Design

 

Normalization versus Denormalization

There are many ways of representing data and one of the most important issues is how much you should normalize your data.

Normalization is dividing up data into multiple collections with references between collections*.

Denormalization is the opposite of normalization: embedding all of the data in a single document.

Typically, normalizing makes writes faster and denormalizing makes reads faster. 

*MongoDB has no joining facilities, so gathering documents from multiple collections will require multiple queries.

Embedding is better for… References are better for…
Small subdocuments Large subdocuments
Data that does not change regularly Volatile data
When eventual consistency is acceptable When immediate consistency is necessary
Documents that grow by a small amount Documents that grow a large amount
Data that you’ll often need to perform a second query to fetch Data that you’ll often exclude from the results
Fast reads Fast writes

 

Optimizations for Data Manipulation

 

To optimize your application, you must first know what its bottleneck is by evaluating its read and write performance.

  • Better reads generally involves having the correct indexes and returning as much of the information as possible in a single document.
  • Better writes usually involves minimizing the number of indexes you have and making updates as efficient as possible.

There is often a trade-off between schemas that are optimized for writing quickly and those that are optimized for reading quickly, so you may have to decide which is a more
important for your application. Factor in not only the importance of reads versus writes, but also their proportions: if writes are more important but you’re doing a thousand
reads to every write, you may still want to optimize reads first.

 

Storing Files with GridFS

Storing large binary files in MongoDB. PROS: • Using GridFS can simplify your stack. If you’re already using MongoDB, you might be able to use GridFS instead of a separate tool for file storage. • GridFS will leverage any existing replication or autosharding that you’ve set up for MongoDB, so getting failover and scale-out for file storage is easier. • GridFS can alleviate some of the issues that certain filesystems can exhibit when being used to store user uploads. For example, GridFS does not have issues with storing large numbers of files in the same directory. • You can get great disk locality with GridFS, because MongoDB allocates data files in 2 GB chunks. There are some downsides, too: • Slower performance: accessing files from MongoDB will not be as fast as going directly through the filesystem. • You can only modify documents by deleting them and resaving the whole thing. MongoDB stores files as multiple documents so it cannot lock all of the chunks in a file at the same time.

How $-Operators Use Indexes

Some queries can use indexes more efficiently than others; some queries cannot using indexes at all… :( Inefficient operators There are a few queries that cannot use an index at all, such as “$where” queries and checking if a key exists ({“key” : {“$exists” : true}}). There are several other operators that use indexes but not very efficiently If there is a “vanilla” index on “x”, querying for documents where “x” does not exist can use the index ({“x” : {“$exists” : false}}). However, as nonexistent fields are stored the same way as null fields in the index, the query must visit each document to check if the value is actually null or nonexistent. If you use a sparse index, it cannot be used for either {“$exists” : true} nor {“$exists” : false}. In general, negation is inefficient. “$ne” queries can use an index, but not very well.

OR queries

As of this writing, MongoDB can only use one index per query. That is, if you create one index on {“x” : 1} and another index on {“y” : 1} and then do a query on{“x” : 123, “y” : 456}, MongoDB will use one of the indexes you created, not use both! If you must use an $or, keep in mind that MongoDB needs to look through the query results of both queries and remove any duplicates (documents that matched more than one $or clause).

Indexing embedded docs

Indexes can be created on keys in embedded documents in the same way that they are created on normal keys. If we had a collection where each document represented a user,we might have an embedded document that described each user’s location: { “username” : “sid”, “loc” : { “ip” : “1.2.3.4”, “city” : “Springfield”, “state” : “NY” } } If we create an index on “loc” it only can be used if we query ALL the fields (ip,city,state) with this order. Another option is create an index on “loc.city” for instance. Again, take a look to your queries and use EXPLAIN to the most used ones.

Index Cardinality

Cardinality refers to how many distinct values there are for a field in a collection. Some fields, such as “gender” or “newsletter opt-out”, might only have two possible values,which is considered a very low cardinality. .. Others, such as “username” or “email”,might have a unique value for every document in the collection, which is HIGH cardinality. Still others fall somewhere in between, such as “age” or “zip code”. As a rule of thumb, try to create indexes on high-cardinality keys or at least put high cardinality keys first in compound indexes (before low-cardinality keys).

The Query Optimizer

MongoDB’s query optimizer works a bit differently than any other database’s. Basically, if an index exactly matches a query (you are querying for “x” and have an index on “x”), the query optimizer will use that index. Otherwise, there might be a few possible indexes that could work well for your query. MongoDB will select a subset of likely indexes and run the query once with each plan, in parallel. The first plan to return 100 results is the “winner” and the other plans’ executions are halted. This plan is cached and used subsequently for that query until the collection has seen a certain amount of churn. Once the collection has changed a certain amount since the initial plan evaluation, the query optimizer will re-race the possible plans. Plans will also be reevaluated after index creation or every 1,000 queries. The “allPlans” field in explain()’s output shows each plan the query tried running.

When Not to Index

Indexes are most effective at retrieving small subsets of data and some types of queries are faster without indexes. Indexes become less and less efficient as you need to get larger percentages of a collection because using an index requires two lookups: one to look at the index entry and one following the index’s pointer to the document. A table scan only requires one: looking at the document. In the worst case (returning all of the documents in a collection) using an index would take twice as many lookups and would generally be significantly slower than a table scan. As a rule of thumb: if a query is returning 30% or more of the collection, start looking at whether indexes or table scans are faster. However, this number can vary from 2% to 60%. You can force it to do a table scan by hinting {“$natural” : 1}. Sorting Au Naturel There is a special type of sort that you can do with capped collections, called a natural sort. A natural sort returns the documents in the order that they appear on disk > db.my_collection.find().sort({“$natural” : -1})

Tailable Cursors

Tailable cursors are a special type of cursor that are not closed when their results are exhausted. They were inspired by the tail -f command and, similar to the command, will continue fetching output for as long as possible. Because the cursors do not die when they run out of results, they can continue to fetch new results as documents are added to the collection. Tailable cursors can be used only on capped collections, since insert order is not tracked for normal collections. Tailable cursors are often used for processing documents as they are inserted onto a “work queue” (the capped collection). Because tailable cursors will time out after 10 minutes of no results, it is important to include logic to re-query the collection if they die. The mongo shell does not allow you to use tailable cursors

Time-To-Live Indexes

If you need a more flexible age-out system, timeto-live (TTL) indexes allow you to set a timeout for each document. When a document reaches a preconfigured age, it will be deleted. This type of index is useful for caching problems like session storage. You can create a TTL index by specifying the expireAfterSecs option in the second argument to ensureIndex: > // 24-hour timeout > db.foo.ensureIndex({“lastUpdated” : 1}, {“expireAfterSecs” : 60*60*24})