Skip navigation

Category Archives: java

Storing Files with GridFS

Storing large binary files in MongoDB. PROS: • Using GridFS can simplify your stack. If you’re already using MongoDB, you might be able to use GridFS instead of a separate tool for file storage. • GridFS will leverage any existing replication or autosharding that you’ve set up for MongoDB, so getting failover and scale-out for file storage is easier. • GridFS can alleviate some of the issues that certain filesystems can exhibit when being used to store user uploads. For example, GridFS does not have issues with storing large numbers of files in the same directory. • You can get great disk locality with GridFS, because MongoDB allocates data files in 2 GB chunks. There are some downsides, too: • Slower performance: accessing files from MongoDB will not be as fast as going directly through the filesystem. • You can only modify documents by deleting them and resaving the whole thing. MongoDB stores files as multiple documents so it cannot lock all of the chunks in a file at the same time.

How $-Operators Use Indexes

Some queries can use indexes more efficiently than others; some queries cannot using indexes at all… :( Inefficient operators There are a few queries that cannot use an index at all, such as “$where” queries and checking if a key exists ({“key” : {“$exists” : true}}). There are several other operators that use indexes but not very efficiently If there is a “vanilla” index on “x”, querying for documents where “x” does not exist can use the index ({“x” : {“$exists” : false}}). However, as nonexistent fields are stored the same way as null fields in the index, the query must visit each document to check if the value is actually null or nonexistent. If you use a sparse index, it cannot be used for either {“$exists” : true} nor {“$exists” : false}. In general, negation is inefficient. “$ne” queries can use an index, but not very well.

OR queries

As of this writing, MongoDB can only use one index per query. That is, if you create one index on {“x” : 1} and another index on {“y” : 1} and then do a query on{“x” : 123, “y” : 456}, MongoDB will use one of the indexes you created, not use both! If you must use an $or, keep in mind that MongoDB needs to look through the query results of both queries and remove any duplicates (documents that matched more than one $or clause).

Indexing embedded docs

Indexes can be created on keys in embedded documents in the same way that they are created on normal keys. If we had a collection where each document represented a user,we might have an embedded document that described each user’s location: { “username” : “sid”, “loc” : { “ip” : “”, “city” : “Springfield”, “state” : “NY” } } If we create an index on “loc” it only can be used if we query ALL the fields (ip,city,state) with this order. Another option is create an index on “” for instance. Again, take a look to your queries and use EXPLAIN to the most used ones.

Index Cardinality

Cardinality refers to how many distinct values there are for a field in a collection. Some fields, such as “gender” or “newsletter opt-out”, might only have two possible values,which is considered a very low cardinality. .. Others, such as “username” or “email”,might have a unique value for every document in the collection, which is HIGH cardinality. Still others fall somewhere in between, such as “age” or “zip code”. As a rule of thumb, try to create indexes on high-cardinality keys or at least put high cardinality keys first in compound indexes (before low-cardinality keys).

The Query Optimizer

MongoDB’s query optimizer works a bit differently than any other database’s. Basically, if an index exactly matches a query (you are querying for “x” and have an index on “x”), the query optimizer will use that index. Otherwise, there might be a few possible indexes that could work well for your query. MongoDB will select a subset of likely indexes and run the query once with each plan, in parallel. The first plan to return 100 results is the “winner” and the other plans’ executions are halted. This plan is cached and used subsequently for that query until the collection has seen a certain amount of churn. Once the collection has changed a certain amount since the initial plan evaluation, the query optimizer will re-race the possible plans. Plans will also be reevaluated after index creation or every 1,000 queries. The “allPlans” field in explain()’s output shows each plan the query tried running.

When Not to Index

Indexes are most effective at retrieving small subsets of data and some types of queries are faster without indexes. Indexes become less and less efficient as you need to get larger percentages of a collection because using an index requires two lookups: one to look at the index entry and one following the index’s pointer to the document. A table scan only requires one: looking at the document. In the worst case (returning all of the documents in a collection) using an index would take twice as many lookups and would generally be significantly slower than a table scan. As a rule of thumb: if a query is returning 30% or more of the collection, start looking at whether indexes or table scans are faster. However, this number can vary from 2% to 60%. You can force it to do a table scan by hinting {“$natural” : 1}. Sorting Au Naturel There is a special type of sort that you can do with capped collections, called a natural sort. A natural sort returns the documents in the order that they appear on disk > db.my_collection.find().sort({“$natural” : -1})

Tailable Cursors

Tailable cursors are a special type of cursor that are not closed when their results are exhausted. They were inspired by the tail -f command and, similar to the command, will continue fetching output for as long as possible. Because the cursors do not die when they run out of results, they can continue to fetch new results as documents are added to the collection. Tailable cursors can be used only on capped collections, since insert order is not tracked for normal collections. Tailable cursors are often used for processing documents as they are inserted onto a “work queue” (the capped collection). Because tailable cursors will time out after 10 minutes of no results, it is important to include logic to re-query the collection if they die. The mongo shell does not allow you to use tailable cursors

Time-To-Live Indexes

If you need a more flexible age-out system, timeto-live (TTL) indexes allow you to set a timeout for each document. When a document reaches a preconfigured age, it will be deleted. This type of index is useful for caching problems like session storage. You can create a TTL index by specifying the expireAfterSecs option in the second argument to ensureIndex: > // 24-hour timeout >{“lastUpdated” : 1}, {“expireAfterSecs” : 60*60*24})

  • Local class: Use it if you need to create more than one instance of it, access its constructor, or introduce a new, named type (because, for example, you need to invoke additional methods later).
  • Anonymous class: Use it if you need an instance of a class or a non-functional interface, fields, or additional methods.
  • Lambda expression:
    • Use it if you are encapsulating a single unit of behavior that you want to pass to other code. For example, you would use one if you want a certain action performed on each element of a collection, when a process is completed, or when a process encounters an error.
    • Use it if you need a simple instance of a functional interface and none of the above apply (for example, you do not need a constructor, a named type, fields, or additional methods).
  • Nested class: Use it if your requirements are similar to those of a local class, you want to make the type more widely available, and you don’t require access to local variables or method parameters.
    • Use a non-static nested class (or inner class) if you require access to an enclosing instance’s non-public fields and methods. Use a static nested class if you don’t require this access.