Latest Updates: Our Blog

Update on Searching and Entities

Sep 28th, 2011


Ted Han

Users who tried to search for pretty much anything on DocumentCloud this morning noticed pretty quickly that there was something not quite right on our servers. The short story is this: the problem was caused by human error and our servers are in the process of rebuilding the index that failed.

The longer story, for those of you who’ve been have been tracking updates about our search outage, is this:

DocumentCloud works because we index data about the contents of each document in a structured format – it means that when you search for a phrase or entity, our system doesn’t go through each document in our repository one by one looking for the term. Instead it looks it up in a structured index.

It happens that we haven’t been backing up our search index: everything else is backed up nightly, but not the search index. You can probably guess where this is going.

We’d configured a testing machine to re-index documents, not quite realizing that the testing machine was using the live search index. The first step in reindexing is to delete the existing index data, and, voila: the search index was gone, which meant that searches of public and private documents in our system would turned up zero results.

We’re re-creating the index, but with almost 3 million pages of documents in our system that took some time, even with the cloud at our disposal.

The problems with our search index shouldn’t have had any impact on other data stored in our database – users were still able to upload, annotate, and publish documents.

We did have to disable the entities tab in the workspace for the time being. Our faceted search system and the entities tab are also powered by our search index. The faceted search and entities tab will be disabled for the foreseeable future. Users will still be able to do all the boolean searches you’d expect, but without entity faceting.

If document entities are one of your favorite features, we do want to hear from you about how you were using them, but you shouldn’t fret: we have plans to expose them in new ways in our ongoing quest to help journalists analyze and connect primary source documents.

What did we learn from all this? First of all, we’re making nightly backups of our search index now. We’ve double-checked our other backup systems and we’re confident that we’ll be able to recover from something like this much more quickly in the future.

Thanks for bearing with us, and our (especially my) apologies for any inconveniences you’ve encountered today.


Leave a Reply