Latest Updates: Our Blog

Related Documents

Aug 2nd, 2010


Samuel Clay

If you log in to DocumentCloud this morning, you’ll notice a new menu, entitled “Analyze”. We’ve gathered the various analytic tools under one roof here — to view the entities for selected documents, or display a timeline — and added a major new one: Related Documents. If you’re working on a story, and you just uploaded a material document, you can use that document as a jumping-off point to find other public documents about the same subject. Select a document, open the “Analyze” menu, and click “Find Related Documents”:

Finding Related Documents

Related documents can span all documents visible to your account, including documents that other organizations have made public.

Under the hood, we are using a technique known as tf/idf, which compares document similarity by looking at the “important” words across a set of documents. The importance of each word is evaluated by weighing the frequency of use of each word in a particular, divided by the frequency of the word in the collection of documents as a whole. In this manner, commonly used words drop out of the index, and distinctive words obtain greater importance. The search engine we use for DocumentCloud, Lucene, has this type of search built-in.

We’re still at work on improving this feature. At the moment, you’ll notice two things: there is a long tail of barely-related documents that follows the first page of results, and shorter documents (1-3 pages) may find no related documents whatsoever. But for most documents with high-quality text, you’ll find that the related documents at the top are very relevant.

There’s one more thing that we released at the same time: a panel that allows you to edit all the information that describes your documents (title, source, description, access level) at a stroke. To use it, click on the pencil icon that now appears next to any document. Naturally, you can also select multiple documents and edit all of their attributes simultaneously.

Leave a Reply