Over the past two years, we have released much of our toolset as open-source code: Backbone.js, Underscore.js, Jammit, CloudCrowd, and others. Today, we’re launching another piece of DocumentCloud — both on DocumentCloud.org and as a component you can integrate into your own projects. VisualSearch.js is a rich search box for real data. It enhances ordinary search boxes with the ability to autocomplete facets and values for sophisticated searches.
For example, here’s a query that filters The New York Times’ copies of Sarah Palin’s recently-released emails. First we filter just the annotated emails, then just the emails from 2007, and then drill down to a specific date. Visual search works with the arbitrary metadata you’ve already added to your documents.
We are excited to not only see what clever uses developers come up with for VisualSearch, but also what additions you write that can be merged back into the main repository.
The document upload dialog we rolled out in January allowed users to upload multiple documents at once and incorporated some nice touches like a progress bar that tracked progress of your documents. The Flash uploader was a great move forward for us but a handful of our users were having trouble with it, so we’ve rewritten it in open-standards based HTML5.
Today, we just pushed out an update to the file uploader. Continue reading »
We’ve added a “pages” tab to document viewers in our workspace and embedded on news sites. The new tab offers a birds-eye view of an entire document. This new tab, which now appears in your document viewer right next to the “document” tab, allows you to browse a document more quickly by showing you thumbnail images of every page. For long documents, this tab allows you to identify exactly where you want to go in the document without having to scroll and search repeatedly until you find a specific section in the document. Continue reading »
Here at DocumentCloud, we’re constantly turning PDF files and Office documents into embeddable document viewers. We extract text from the documents with OCR and generate images at multiple sizes for each of the thousands of pages we process every day. To crunch all of this data, we rely on High-CPU Medium instances on Amazon EC2, and our CloudCrowd parallel-processing system. Since the new Micro instances were just announced, we thought it would be wise to try them out by benchmarking some real world work on these new servers. If they proved cost-effective, it would be beneficial for us to use them as worker machines for our document processing.
Benchmarking with Docsplit
To benchmark EC2 Micros, Smalls, and High-CPU Mediums, we used Docsplit. Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages…). Continue reading »
If you log in to DocumentCloud this morning, you’ll notice a new menu, entitled “Analyze”. We’ve gathered the various analytic tools under one roof here — to view the entities for selected documents, or display a timeline — and added a major new one: Related Documents. If you’re working on a story, and you just uploaded a material document, you can use that document as a jumping-off point to find other public documents about the same subject. Select a document, open the “Analyze” menu, and click “Find Related Documents”:
Related documents can span all documents visible to your account, including documents that other organizations have made public.
Under the hood, we are using a technique known as tf/idf, which compares document similarity by looking at the “important” words across a set of documents. The importance of each word is evaluated by weighing the frequency of use of each word in a particular, divided by the frequency of the word in the collection of documents as a whole. In this manner, commonly used words drop out of the index, and distinctive words obtain greater importance. The search engine we use for DocumentCloud, Lucene, has this type of search built-in.
We’re still at work on improving this feature. At the moment, you’ll notice two things: there is a long tail of barely-related documents that follows the first page of results, and shorter documents (1-3 pages) may find no related documents whatsoever. But for most documents with high-quality text, you’ll find that the related documents at the top are very relevant.
There’s one more thing that we released at the same time: a panel that allows you to edit all the information that describes your documents (title, source, description, access level) at a stroke. To use it, click on the pencil icon that now appears next to any document. Naturally, you can also select multiple documents and edit all of their attributes simultaneously.