Excerpts for your Entities
We added excerpts to our timelines and entities tab today. We are using OpenCalais to parse every document you upload and extract the names of people, organizations, terms and places in the text. We display these under the Entities tab. We also extract date information from each document that you upload and plot those on a timeline which you can access from the Analyze menu.
DocumentCloud’s entities show you, at a glance, the people who are mentioned the most times in a given project, or the organizations that are named in each of the documents you’ve selected. Here’s how it works now:

Click on the “show pages” link next to any entity to reveal a thumbnail of each page in each document that contains that term, alongside excerpts highlighting the mention of the entity in the text. Clicking on the highlighted phrase will take you directly to term within the document itself. In the screenshot below, you can see how the Environmental Protection Agency was correctly identified by both its proper name and its acronym.

We’ve added excerpts to the timeline as well. When you open a timeline from the Analyze menu and scroll over any date, you’ll see a few words along with the date as it appears in the document–useful for corroborating a single event across multiple sources or for comparing different accounts of what should be a shared timeline. Click on a date to go straight to the point in the document where that date appears.

Hopefully excerpts will come in handy for your DocumentCloud projects. If you think of a way we can make them even more useful, comment or let us know.
Uploading Documents Gets a Little Easier
You’ve always been able to script batch uploads using our API, but for users without coding skills, uploads were one at a time. Today we’re rolling out an improved document uploading dialog that will let you upload as many documents as you want, all in one fell swoop.
You’ll still use the “New Documents” button, but now that button takes you straight to a file selection dialog.
Use the control (on MS Windows) or command (on Macs) key to select additional documents, just like you would in your file browser.

We’ll start you off by suggesting a title, based on each file’s name, but you can edit that name and add additional information, including the source of each document and a description. As with the old upload dialog, you can decide right when you upload your document whether or not you’re ready to share it with the world yet. As ever, you can edit all of these fields again later.

If your documents share a common source or description, use the “Apply to All Files” link to copy your metadata to each document in this batch. Note: this new upload interface requires Flash. If that’s an issue for you, let us know ASAP and we’ll whip up an alternate interface that doesn’t require any plugins. Promise.
As the files upload from your computer to DocumentCloud, you’ll see the progress of each transfer.
Better Processing, Too
This week’s release is more than just a new upload dialog. We’ve made some big changes to Docsplit and the RightAWS gem, and we’re hoping this means the dreaded “import failed” error will be a thing of the past. If looking under the hood is your thing, both Docsplit and our fork of RightAWS are on on github for your viewing (and reusing) pleasure.
Don’t be a Stranger
If you have gigabytes worth of documents to upload, get in touch before you start uploading so we can add more horsepower to handle your job. Otherwise, happy uploading! And don’t forget to tell us about what you’re publishing with DocumentCloud.
Related Documents
If you log in to DocumentCloud this morning, you’ll notice a new menu, entitled “Analyze”. We’ve gathered the various analytic tools under one roof here — to view the entities for selected documents, or display a timeline — and added a major new one: Related Documents. If you’re working on a story, and you just uploaded a material document, you can use that document as a jumping-off point to find other public documents about the same subject. Select a document, open the “Analyze” menu, and click “Find Related Documents”:

Related documents can span all documents visible to your account, including documents that other organizations have made public.
Under the hood, we are using a technique known as tf/idf, which compares document similarity by looking at the “important” words across a set of documents. The importance of each word is evaluated by weighing the frequency of use of each word in a particular, divided by the frequency of the word in the collection of documents as a whole. In this manner, commonly used words drop out of the index, and distinctive words obtain greater importance. The search engine we use for DocumentCloud, Lucene, has this type of search built-in.

We’re still at work on improving this feature. At the moment, you’ll notice two things: there is a long tail of barely-related documents that follows the first page of results, and shorter documents (1-3 pages) may find no related documents whatsoever. But for most documents with high-quality text, you’ll find that the related documents at the top are very relevant.
There’s one more thing that we released at the same time: a panel that allows you to edit all the information that describes your documents (title, source, description, access level) at a stroke. To use it, click on the pencil icon that now appears next to any document. Naturally, you can also select multiple documents and edit all of their attributes simultaneously.

Introducing Page Notes
The Document Viewer has always supported the ability to create “page notes” — annotations that sit between two pages and provide commentary about a specific page as a whole or an introduction to a new section of a document. This morning, we released an update to DocumentCloud that provides a way for you to create page notes from within the viewer.
To try it out, open a document you’ve uploaded and click on one of the “Add a Note” links in the sidebar. Hover your crosshair over the margin in between pages, and you’ll see a dotted line appear, with a note tab on the left:

If you click, you’ll create a page note in between the two pages. Add a title and some text and click the “Save” button. The note’s title will appear in the navigation on the right. Of course, page notes are viewable and editable from the workspace, just like any other.

If you’re logged in, you can take a look at the sample document shown here.
HTTPS Support (and Other Updates)
Monday morning we rolled out SSL support on DocumentCloud.org — visit https://www.documentcloud.org to view, browse and edit documents in your workspace over an encrypted connection. When you use HTTPS, all traffic between your computer and DocumentCloud is encrypted before it’s sent over the internet. If you’re working on a public wireless connection, are on an unsecured network or are dealing with highly-sensitive documents, we recommend using HTTPS.
You can tell if you’re an secure connection by looking at your browser. When visiting a secure website, all browsers display a lock icon somewhere on the window. Here’s what the lock looks like in Google Chrome:
![]()
More Search Parameters
We’ve also added new ways to filter your DocumentCloud searches. You can now use “access” to filter your documents by their access level, and “projectid” to designate a specific project when you’re using our search API. (Access to searches and the API are limited to registered users during the beta.)
![]()
To view only your private documents in a particular project, you can add “access: private” to your search terms. Searching by “access: public” will show you only public documents, while “access: organization” will show you those documents shared within your organization.
Already using the search API? We’ve added search terms that let you limit public results to a single project. Drop a line to support AT documentcloud DOT org if you’d like to take advantage of this one.
Still waiting for an important feature? Let us know!
These improvements are only available to users who have an account on DocumentCloud. If you’re a reporter who works with primary source documents, and you’re not using DocumentCloud yet contact us to find out how to start.
Guides and How-to’s
For some time now, instructions about how to use the DocumentCloud workspace have been available through our wiki. This morning, we released an update that pulls the help pages right into the workspace for easy access, and hopefully makes it faster to get your questions answered.
![]()
We’ve included pages on searching, account management, collaboration, privacy, uploading documents, troubleshooting failed uploads, and editing and publishing documents on your web site. The next time you log in to DocumentCloud, take a peek at the new “Help” tab, and let us know if there’s anything you think we should add to the guides.
Bidding IE6 Adieu
Last week, we rolled out an update to DocumentCloud’s document viewer that included a wide range of improvements that you might never even notice. Page layouts and scrolling look very different under the hood, pages load and scroll much faster now, annotations work better, readers can resize a document viewer without setting off a barrage of little hiccups. We replaced much of the viewer’s JavaScript with CSS, which we hope will form a much more stable foundation for DocumentCloud development going forward. In the process, however, we stopped supporting for Internet Explorer 6.
IE6 has long been the bane of web developers: developing web applications that work as well in IE6 as in other browsers is substantially more difficult than bypassing the ten year old browser.
IE6 users will still be able to download a original PDF of any document and will see a landing page that encourages IE6 users to upgrade their browser or install Chromeframe.
The New York Times, with whom we continue to collaborate closely on development of the viewer component of DocumentCloud, has long planned to phase out support for IE6. They don’t test new tools against the browser and will soon update to the same version of their document viewer that DocumentCloud is running on. The Times isn’t alone: YouTube began phasing out support for IE6 in March and other Google products are expected to follow suit. We’re certainly open to feedback on our implementation.
Meantime, take a look at some of the great things reporters are doing with DocumentCloud.
Welcome, Samuel Clay
Our third hire! Developer Samuel Clay joins DocumentCloud today, bringing our full time staff to a total of three.
Samuel joins us from Storybird, a collaborative storytelling startup which works with artists to give children access to high quality narrative art that they can use to publish their own original stories. He’s also the mastermind behind NewsBlur, an open source feed reader that uses artificial intelligence to suggest stories you might want to read. Think of it as an RSS reader with intelligence.
Samuel lives in Brooklyn with his dog and guinea pigs, where he photographs historic districts for New York Field Guide. Find him at samuel@documentcloud.org or on twitter.
He’ll be bringing his formidable JavaScript skills to DocumentCloud’s workspace, which should be getting more awesome twice as fast now.
Collaboration
Since we launched DocumentCloud’s beta, one of the most common requests has been: “How can I share documents with reporters from other organizations?”
Now you can share a project with any other DocumentCloud user — in any newsroom.
How does it work?
Let’s say I have a project with documents relating to the Madoff Ponzi scheme, and I want to share them with Scott. To open the project for editing, I click on its edit icon.

Inside of the project, I click on the “Add a collaborator to this project” link, and I type in Scott’s email address — the one that he uses to log in to DocumentCloud.

After clicking the “Add” button, Scott now appears as a collaborator on this project.

The next time Scott logs in to DocumentCloud, “The Madoff Files” will show up as one of the projects in his sidebar. He can now view, edit and annotate all of the documents inside of it. He can add documents of his own to the project and I’ll be able to see and edit those as well.

Project collaborators can do anything with the documents in a project that you can do: they can edit public notes, change settings like the document’s title or source, add “related article links.” Collaborators can also add or remove additional people to the project. You can only collaborate with fellow DocumentCloud users, though: if you’re collaborating with a newsroom that isn’t yet part of DocumentCloud, send them our way and we’ll get them set up.
We’d love it if you would give it a spin and let us know what you think: write to support@documentcloud.org or suggest improvements where fellow users can weigh in as well: in our support forum.
Embedding Documents on Your Site (UPDATED)
Over the past few months, you might have noticed a handful of news organizations using embedded documents to complement their reporting.
- The Chicago Tribune has a series of documents surrounding the ongoing Blagojevich criminal trial.
- The Arizona Republic publishes a law professor’s analysis of Senate Bill 1070, the controversial immigration law.
- PBS NewsHour details the coalition agreement between Conservatives and Liberal Democrats.
- The Center for Public Integrity releases emergency logs from the U.S. Coast Guard of the Gulf Coast oil spill.
This morning, we’re opening up the ability to embed documents to all of the newsrooms participating in DocumentCloud. When you log into your workspace, you’ll notice a new menu: “Publish”.

From here, you can grab an embed code (a short snippet of HTML) that can be dropped onto a web page to create a document viewer. You may be familiar with such snippets from embedding YouTube videos: this works in a similar fashion. For guidelines on setting up a template and other help, check out our documentation.
If you still have questions about the process, we’re listening at support@documentcloud.org.
Note: we know you’re eager to host documents yourself, and you can do that now, but we recommend that you stick with embedded documents so that you can take advantage of bug fixes and other improvements to the viewer. We don’t know yet whether we plan to offer embedding as a long term service. Keep in mind, as well, that this is still a beta. As described in our terms, our capacity to commit to uninterrupted service is limited, as is our liability if service is interrupted in some way.
For those news organizations that want to host documents on their own servers, we’re now offering that as an alternative too. Click on “Download Document Viewer” to get a zipped up folder with all the code, text, and images bundled together as a web page. Drop the folder into any web server (no special software required), and voila, it’s online.
Search of the document’s text is provided by DocumentCloud as a service, but everything else in the package is completely static — just HTML, images, JavaScript and CSS. If you choose to use this alternative, there is a caveat: If you edit your annotations, or want to make any changes to the document, you’ll have to download it again.
Here at DocumentCloud, we’re looking forward to seeing the great reporting you do with embedded documents — don’t forget to use the workspace to add a “Related Article” link.

