Latest Updates: Our Blog

VisualSearch.js v0.2.1 release…

Posted
Nov 15th, 2011

Tags
Twitter

Author
documentcloud

VisualSearch.js v0.2.1 released: http://t.co/Sq5oj06I. New feature: preserve the order of your facets.

DocumentCloud gets Entity Char…

Posted
Oct 27th, 2011

Tags
Twitter

Author
documentcloud

DocumentCloud gets Entity Charts: http://t.co/cIvNMkeQ Give them a try on your documents. /ja

New Feature: Entity Charts

Posted
Oct 27th, 2011

Tags
Workspace

Author
Jeremy Ashkenas

Whenever you upload a document to DocumentCloud we send the contents to OpenCalais, a service that discovers the entities (people, places, organizations, terms, etc.) that are present in plain text. OpenCalais can tell us that “Barack Obama” is the same person as “President Obama”, “Senator Obama”, “Mr. President” … and even “he” or “his” in clauses like “his policy proposals”.

Last month, we stopped indexing entities for faceting because DocumentCloud has reached the point where our search index can no longer support the strain of keeping track of the millions of unique entities stored in our database. We still hope to bring back some form of entity faceting — a feature you may remember as the “Entities” tab — using a different implementation in the future. But for the time being, we have added a new feature that allows you to easily browse through all of the entities associated with a document:

The entities are displayed in a chart that shows how often each entity occurs across each page. Using this chart, you can see which companies and individuals tend to be mentioned together frequently, or which parts of a long document concern a certain topic. Hover over any mention (the small gray boxes) to see the surrounding context, and click on it to jump directly to that mention within the document itself.

If you want to try out an example, here is a link to a recent document that ran with a disability fraud story in today’s New York Times. Right-click on the document and choose View Entities from the context menu, or select the document and choose View Entities from the Analyze menu.

We’re still polishing these charts, so let us know if you have any ideas for improving them, or ideas for other ways that we can make extracted entities more useful for your reporting.

.@amandabee make that: If you …

Posted
Oct 22nd, 2011

Tags
Twitter

Author
documentcloud

.@amandabee make that: If you are trying to make @documentcloud handle data you’ll want to watch out for @pandaproject! /th

You at SEJ in Miami? So’s @doc…

Posted
Oct 22nd, 2011

Tags
Twitter

Author
documentcloud

You at SEJ in Miami? So’s @documentcloud! If you can’t make our Saturday panel, @amandabee will be around for the afternoon. /abh

If your CMS (Ahem, WordPress) …

Posted
Oct 20th, 2011

Tags
Twitter

Author
documentcloud

If your CMS (Ahem, WordPress) gives you trouble pasting in document viewer embed codes, we’ve just added a “remove line breaks” link. /ja

RT @dansinker: Congrats to @pr…

Posted
Oct 4th, 2011

Tags
Twitter

Author
documentcloud

RT @dansinker: Congrats to @propubnerds for a successful launch of DocDiver. Looks *amazing* http://t.co/cQKnmoO6 /abh

DocumentCloud now supports com…

Posted
Oct 3rd, 2011

Tags
Twitter

Author
documentcloud

DocumentCloud now supports combined metadata searches: `citizen: Guatemala citizen: Mexico` would find documents for both countries. /ja

Update on Searching and Entities

Posted
Sep 28th, 2011

Tags
Workspace

Author
Ted Han

Users who tried to search for pretty much anything on DocumentCloud this morning noticed pretty quickly that there was something not quite right on our servers. The short story is this: the problem was caused by human error and our servers are in the process of rebuilding the index that failed.

The longer story, for those of you who’ve been have been tracking updates about our search outage, is this: Continue reading »

Our search index should now be…

Posted
Sep 28th, 2011

Tags
Twitter

Author
documentcloud

Our search index should now be fully recovered for all documents. If you have any trouble searching this evening, please let us know. /ja

Search within document viewers…

Posted
Sep 28th, 2011

Tags
Twitter

Author
documentcloud

Search within document viewers should now be restored for all documents. /ja

Search recovery is proceeding….

Posted
Sep 28th, 2011

Tags
Twitter

Author
documentcloud

Search recovery is proceeding. We should have search restored for about half of all documents at this point. More updates TK. /ja

Problems with search on @docum…

Posted
Sep 28th, 2011

Tags
Twitter

Author
documentcloud

Problems with search on @documentcloud today. Looking into it. /abh

We’re currently seeing incompl…

Posted
Sep 28th, 2011

Tags
Twitter

Author
documentcloud

We’re currently seeing incomplete search results on DocumentCloud this morning. We’ll let you know as soon as search is back in order. /ja

DocumentCloud now supports adv…

Posted
Sep 27th, 2011

Tags
Twitter

Author
documentcloud

DocumentCloud now supports advanced boolean searches. http://t.co/14BkP7R9 /ja

DocumentCloud now supports adv…

Posted
Sep 27th, 2011

Tags
Twitter

Author
documentcloud

DocumentCloud now supports advanced boolean searches. http://t.co/14BkP7R9 /ja

Advanced Boolean Searches

Posted
Sep 27th, 2011

Tags
Workspace

Author
Jeremy Ashkenas

DocumentCloud now supports advanced boolean search queries, allowing you to more easily perform searches that hone right in on the documents you’re trying to find. You may be familiar with boolean operators from other search engines, but here’s a quick refresher on the available options:

  • and: both terms must exist in the document   Perry and Romney
  • or: either term may match   indicted or accused
  • !: the term must not exist in the document   obama !barack
  • *: a wildcard to match any sequence of letters   J*e Smith (Matches Joe, Jane or Jake Smith)
  • ( ): group together words into a term   (Perry or Romney) and governor

 

Here’s an example of what that last search looks like in action:

Behind the scenes, we’re using the latest stable release of the open-source Solr/Lucene search engine (3.4.0). It includes a new query parser called “edismax” that adds boolean operators to the previous implementation of full text search.

Give boolean searches a spin, and let us know if they’re working well for your ongoing projects.

We’re at #eij11 — come by Kni…

Posted
Sep 27th, 2011

Tags
Twitter

Author
documentcloud

We’re at #eij11 — come by Knight’s Innovation panel at 2:30 to get a great intro to @documencloud. /abh

Neat CPI use of DocumentCloud …

Posted
Sep 27th, 2011

Tags
Twitter

Author
documentcloud

Neat CPI use of DocumentCloud note embeds to compare signatures: http://t.co/yVoE4Bdq /ja

We released a feature this pas…

Posted
Sep 26th, 2011

Tags
Twitter

Author
documentcloud

We released a feature this past week: Printing Document Annotations! Learn more about it on our blog: http://t.co/2wdfGRj3 /th

Printing Document Annotations

Posted
Sep 26th, 2011

Tags
Documents

Author
Ted Han

We’ve been hard at work during our short Columbia, Missouri hackathon at DocumentCloud’s new home at the Investigative Reporters & Editors office. As a result we’ve rolled out a new feature for readers and journalists to print annotations made on documents.

Journalists have been publishing documents through DocumentCloud for a while now as well as annotating documents both for readers and for their own story writing processes. We think it’s just as important for DocumentCloud to make story writing quicker and easier as it is to help readers find primary source material.

So, when Marshall Allen of ProPublica told us that he would like to try using DocumentCloud to take his story notes, we did our best to help out. As a result, you can now select one or more documents in the workspace and choose “Print Notes” under the “Publish” menu.

This way you can annotate your sources in DocumentCloud, and have a single copy of all your research ready at hand for your copy editor or read when your flight attendant announces that all power switches should be in the off position.

And readers can find a “Print Notes” link in the sidebar footer of the document viewer too.

We hope this will help readers and journalists alike note and collect information in the format the best suits their workflows. Happy Printing (and remember to recycle)!

Busy week! Our upload queue is…

Posted
Sep 23rd, 2011

Tags
Twitter

Author
documentcloud

Busy week! Our upload queue is a bit backed up, so we’re spinning up several more worker machines to cope. Thanks for bearing with us /th

If you’re in Columbia, MO, @ja…

Posted
Sep 21st, 2011

Tags
Twitter

Author
documentcloud

If you’re in Columbia, MO, @jashkenas and @knowtheory will be speaking at Broadway Brewery tonight, at 7 p.m. http://ow.ly/6AFfa

Welcome Aboard, Ted Han

Posted
Sep 21st, 2011

Tags
IdeaLab,People

Author
Amanda Hickman

Back in August, we announced that we’d be welcoming a new lead developer, but he’s been on the job two weeks already and we managed to forget to say anything like “Welcome aboard!”

Well, better late than never.

If you want to re-OCR the text…

Posted
Sep 13th, 2011

Tags
Twitter

Author
documentcloud

If you want to re-OCR the text of an existing document: open it, and click “Reprocess Text” under the “Text Tools” section. /ja