Every once in a while, DocumentCloud gets hit with the kind of document stash that really slows us down. We can take a lot, but if one newsroom finally gets a 25,000 page FOIA turned over to them and another gets a hold of 30,000 pages of documents for a breaking news story about the on the same afternoon, that’s a volume that will tax our servers.
We recently established a “fast lane” to ensure that smaller documents don’t have to get in line behind behemoths, but that doesn’t help if you’ve got a few MB of documents about a local scandal — you’ll still have to shuffle into line with the big sets.
As you know, DocumentCloud is growing fast and we’re learning how to keep up with demand, but we also thought this would be a great time to explain how our queue works. When you upload a document to DocumentCloud, we do a ton of processing on it:
- If the document isn’t already a PDF, we convert it to a PDF.
- We break the document into images of each page and create thumbnails for each image.
- If the document doesn’t already have text information, we OCR each page.
- We organize all the information about each document — where to find the page images, what text appears on each page, the title, source and description you provided — into our database.
- We send the text of the document to OpenCalais and stash the entity information they return to us in our database.
When all of that is done, we have what we need to give you a usable document in DocumentCloud’s workspace. The whole process takes anywhere from a few seconds to a few minutes, depending on the size of your document. We process documents in the order they’re received, which means that on a typical day it will take just minutes to process any one document you have uploaded. On a day like Thursday, when a total of 55,000 pages were added to DocumentCloud, that number creeps up towards an hour or more.
We’re all used to computers that don’t work. We’re used to trying things again when they don’t work out the first time. So it isn’t unreasonable that DocumentCloud users will often see that a modest sized document they uploaded an hour ago is still processing and say “Still processing?!? Dang. Maybe I should try again.” As far as we can tell, that’s how most users react in fact. You know what that means: on a day when our load is already unusually heavy, you’re just making it heavier.
We’ve learned a thing or two: it is time for us to set some minimum acceptable standards for how long it should take us to process your documents. And, we need to be more alert to the size of our queue so we can increase our capacity when documents are taking longer than they ought to. We also need to do a better job of communicating with users about our load and current capacity. We’re working on all of those things.
In the meantime, the answer is, no: if DocumentCloud is taking a really long time to process your documents, uploading the same document again won’t change that. Bringing the long processing time to our attention will.