Whenever you upload a document to DocumentCloud we send the contents to OpenCalais, a service that discovers the entities (people, places, organizations, terms, etc.) that are present in plain text. OpenCalais can tell us that “Barack Obama” is the same person as “President Obama”, “Senator Obama”, “Mr. President” … and even “he” or “his” in clauses like “his policy proposals”.
Last month, we stopped indexing entities for faceting because DocumentCloud has reached the point where our search index can no longer support the strain of keeping track of the millions of unique entities stored in our database. We still hope to bring back some form of entity faceting — a feature you may remember as the “Entities” tab — using a different implementation in the future. But for the time being, we have added a new feature that allows you to easily browse through all of the entities associated with a document:
The entities are displayed in a chart that shows how often each entity occurs across each page. Using this chart, you can see which companies and individuals tend to be mentioned together frequently, or which parts of a long document concern a certain topic. Hover over any mention (the small gray boxes) to see the surrounding context, and click on it to jump directly to that mention within the document itself.
If you want to try out an example, here is a link to a recent document that ran with a disability fraud story in today’s New York Times. Right-click on the document and choose View Entities from the context menu, or select the document and choose View Entities from the Analyze menu.
We’re still polishing these charts, so let us know if you have any ideas for improving them, or ideas for other ways that we can make extracted entities more useful for your reporting.