Latest Updates: Our Blog

August 2010

Excerpts for your Entities

Posted
Aug 27th, 2010

Tags
Workspace

Author
Jeremy Ashkenas

We added excerpts to our timelines and entities tab today. We are using OpenCalais to parse every document you upload and extract the names of people, organizations, terms and places in the text. We display these under the Entities tab. We also extract date information from each document that you upload and plot those on a timeline which you can access from the Analyze menu.

DocumentCloud’s entities show you, at a glance, the people who are mentioned the most times in a given project, or the organizations that are named in each of the documents you’ve selected. Here’s how it works now:

Click on the “show pages” link next to any entity to reveal a thumbnail of each page in each document that contains that term, alongside excerpts highlighting the mention of the entity in the text. Clicking on the highlighted phrase will take you directly to term within the document itself. In the screenshot below, you can see how the Environmental Protection Agency was correctly identified by both its proper name and its acronym. Continue reading »

Uploading Documents Gets a Little Easier

Posted
Aug 18th, 2010

Tags
Workspace ,

Author
Amanda Hickman

You’ve always been able to script batch uploads using our API, but for users without coding skills, uploads were one at a time. Today we’re rolling out an improved document uploading dialog that will let you upload as many documents as you want, all in one fell swoop.

You’ll still use the “New Documents” button, but now that button takes you straight to a file selection dialog.
Use the control (on MS Windows) or command (on Macs) key to select additional documents, just like you would in your file browser.

File selection screenshot

We’ll start you off by suggesting a title, based on each file’s name, but you can edit that name and add additional information, including the source of each document and a description. As with the old upload dialog, you can decide right when you upload your document whether or not you’re ready to share it with the world yet. As ever, you can edit all of these fields again later.

If your documents share a common source or description, use the “Apply to All Files” link to copy your metadata to each document in this batch. Note: this new upload interface requires Flash. If that’s an issue for you, let us know ASAP and we’ll whip up an alternate interface that doesn’t require any plugins. Promise.

As the files upload from your computer to DocumentCloud, you’ll see the progress of each transfer.

Better Processing, Too
This week’s release is more than just a new upload dialog. We’ve made some big changes to Docsplit and the RightAWS gem, and we’re hoping this means the dreaded “import failed” error will be a thing of the past. If looking under the hood is your thing, both Docsplit and our fork of RightAWS are on on github for your viewing (and reusing) pleasure.

Don’t be a Stranger
If you have gigabytes worth of documents to upload, get in touch before you start uploading so we can add more horsepower to handle your job. Otherwise, happy uploading! And don’t forget to tell us about what you’re publishing with DocumentCloud.

DocumentCloud Helps Arizona Paper with Annotated Immigration Law

Posted
Aug 3rd, 2010

Tags
IdeaLab

Author
Amanda Hickman

Cross posted from PBS Idealab.

We opened the DocumentCloud floodgates less than six months ago and we’re still working hard to make DocumentCloud a better tool. We’re rolling out improvements at a healthy clip including SSL support, better documentation, and support for cross-newsroom collaboration. We continue to listen to feedback from our really incredible crop of beta testers (who now number close to 500!).

There are nearly 100 newsrooms participating in the DocumentCloud beta and requests are still pouring in. We’ve been doing a fair amount of outreach and more is in the works, but it turns out that our users are our best advocates: After John Addams in Great Falls, Montana, blogged about his experiences with DocumentCloud we were deluged with requests from Montana news organizations large and small.

Uses in Arizona, Chicago, Memphis

The really great stories about how reporters are using DocumentCloud continue to surprise all of us.

Not long after Arizona’s governor signed that state’s now infamous immigration law, the Arizona Republic published the bill in full, complete with annotations by a local law professor. Republic reporters told us that traffic to the annotated legislation outpaced the paper’s popular entertainment guide in its first weekend, and continues to draw traffic as the bill stays in the news.

Meanwhile, in Chicago, reporters at the Tribune have been uploading each document and transcript entered into evidence in former governor Rod Blagojevich’s corruption trial — the documents are just part of their extensive coverage of the trial.

In Memphis, the Commercial Appeal published a sample ballot alongside their voter guide.

These are just a few of the great uses reporters have put DocumentCloud to — there are many more great stories already out there and plenty of new ones on the way.

Discuss DocumentCloud Helps Arizona Paper with Annotated Immigration Law on PBS’s IdeaLab.

Related Documents

Posted
Aug 2nd, 2010

Tags
Workspace

Author
Samuel Clay

If you log in to DocumentCloud this morning, you’ll notice a new menu, entitled “Analyze”. We’ve gathered the various analytic tools under one roof here — to view the entities for selected documents, or display a timeline — and added a major new one: Related Documents. If you’re working on a story, and you just uploaded a material document, you can use that document as a jumping-off point to find other public documents about the same subject. Select a document, open the “Analyze” menu, and click “Find Related Documents”:

Finding Related Documents

Related documents can span all documents visible to your account, including documents that other organizations have made public.

Under the hood, we are using a technique known as tf/idf, which compares document similarity by looking at the “important” words across a set of documents. The importance of each word is evaluated by weighing the frequency of use of each word in a particular, divided by the frequency of the word in the collection of documents as a whole. In this manner, commonly used words drop out of the index, and distinctive words obtain greater importance. The search engine we use for DocumentCloud, Lucene, has this type of search built-in.

We’re still at work on improving this feature. At the moment, you’ll notice two things: there is a long tail of barely-related documents that follows the first page of results, and shorter documents (1-3 pages) may find no related documents whatsoever. But for most documents with high-quality text, you’ll find that the related documents at the top are very relevant.

There’s one more thing that we released at the same time: a panel that allows you to edit all the information that describes your documents (title, source, description, access level) at a stroke. To use it, click on the pencil icon that now appears next to any document. Naturally, you can also select multiple documents and edit all of their attributes simultaneously.