Problem
Most of my slow queries on my server are due to a single “find” query. I have a collection that represents website pages. I want to be able to look up a page in the collection by URL. The caveat is that a website might have URLs that redirect to it.
I’ve modeled my schema (with Node.js/Mongoose) like this:
{
'url': { type: String, index: {unique: true, dropDups: true} },
'redirects': { type: Array, index: true },
// ... other stuff
}
In other words, one entry might have the URL http://codereview.stackexchange.com/
but also have an array of 5 other URLs (such as bit.ly URLs) that redirect to it.
Later, I want to query the collection to see if we have an entry which matches an arbitrary set of URLs. I don’t know if the URLs are redirected URLs or what — I just want to find all matches in my collection that represent these URLs.
So I do this:
// urls is an array of URL strings we want to find...
model.find({$or: [{'url': {$in: urls}}, {'redirects': {$in: urls}}]}).lean().exec( ... );
Unfortunately, I might be looking for around 200 URLs at once, so this query sometimes takes > 1 second.
Is there further optimization I can do, or would it be better to split the query into multiple queries, capping the search-size each time?
Solution
I haven’t used MongoDB, but I have a few suggestions based on experiences with optimization and other databases.
-
Index hashes of the URLs and search for those instead. Using a simple MD5 hash would probably speed up searching with the cost of dealing with false positives (unlikely but possible).
-
Store every URL as a top-level object and add a
redirectsTo
attribute. This alleviates the need to index or search theredirects
collections. While now you need to perform two queries–one for the original URLs and another for the redirected URLs–this could end up being faster if only a small percentage of URLs are redirects.Here’s some psuedocode to clarify what I mean:
all = [] found = model.find({'url': {$in: urls}})... redirects = [] for each item in found if item.redirectsTo redirects += item.redirectsTo else all += item redirected += model.find({'url': {$in: redirects}})... for each item in redirected all += item
-
Update your question with the higher-level problem you’re trying to solve. I often find that when I’m trying to optimize slow task
T
to solve problemP
, the better approach is to rethinkP
.