When you upload a document to DocumentCloud, and the file does not contain text, we attempt to perform OCR (optical character recognition) on the document, using the open source Tesseract project. Tesseract is a venerable piece of software, originally developed at Hewlett-Packard between 1985 and 1995. Google acquired the project in 2006, and has been sponsoring work on it since then. A few months ago, Tesseract 3.0 was released; and this morning, we’ve deployed the new version of Tesseract as part of DocumentCloud. Continue reading »
This morning we rolled out a much-requested update: the ability to set the date on which a document will become public. Until now, users have been manually making documents public when a story was ready to go live. Newsrooms that run new reporting late into the night were left to make documents public before the reporting was available. As DocumentCloud’s user base grows, and as we inch towards public searches of the catalog, we knew reporters need to be able to set the hour of publication and know that no one will get a sneak peak at their reporting. Continue reading »
When Krista Kjellman Schmidt was putting together the chart illustrating key deletions in a set of documents, she needed a quick way to grab thumbnail images of particular pages. She was lucky — I just happened to be at ProPublica’s offices and just happened to overhear a few snips of conversation. You might be able to use firebug, someone suggested, and then reduce the images? another followed. I’m not sure what prompted me to look up (possibly the fact that I’m a busybody?) but I did, and when I realized what she was trying to do I had a far better suggestion. It went something like this: Continue reading »
Last week, Investigative Reporters and Editors had us over for a webinar all about DocumentCloud. A video of the webinar will be available very soon from IRE. In the meantime, I did promise to answer every last question from the back channel.
We covered a lot of ground but there were plenty of questions that I couldn’t get to in the hour alloted. I realized, as I read through the questions that folks were asking, that a good many of our current users have been hesitant to ask questions when our documentation isn’t clear. Part of being in beta means that DocumentCloud is changing fast and often. Sometimes our documentation lags behind, sometimes we miss things. Sometimes what we think is clear makes no sense at all to you. So we’re trying something new: on top of all the other ways in which you’re more than welcome to contact us we’re setting up virtual office hours. Stop by: we’ll take all questions, no matter how technical or mundane. As for the questions you’ve already asked … Continue reading »
I’m pleased to announce that today (finally) we’re releasing of one of our most highly-requested features: A smaller-format document viewer that can be embedded inline within the text of an article. A few newsrooms have been beta-testing these fixed size document viewers over the past couple of weeks, so you can see them in action by visiting this Des Moines Register article about Fred Hoiberg’s contract, or this WNYC annotation of New York’s new ballot design. Continue reading »
We added excerpts to our timelines and entities tab today. We are using OpenCalais to parse every document you upload and extract the names of people, organizations, terms and places in the text. We display these under the Entities tab. We also extract date information from each document that you upload and plot those on a timeline which you can access from the Analyze menu.
DocumentCloud’s entities show you, at a glance, the people who are mentioned the most times in a given project, or the organizations that are named in each of the documents you’ve selected. Here’s how it works now:
Click on the “show pages” link next to any entity to reveal a thumbnail of each page in each document that contains that term, alongside excerpts highlighting the mention of the entity in the text. Clicking on the highlighted phrase will take you directly to term within the document itself. In the screenshot below, you can see how the Environmental Protection Agency was correctly identified by both its proper name and its acronym. Continue reading »
You’ve always been able to script batch uploads using our API, but for users without coding skills, uploads were one at a time. Today we’re rolling out an improved document uploading dialog that will let you upload as many documents as you want, all in one fell swoop.
You’ll still use the “New Documents” button, but now that button takes you straight to a file selection dialog.
Use the control (on MS Windows) or command (on Macs) key to select additional documents, just like you would in your file browser.
We’ll start you off by suggesting a title, based on each file’s name, but you can edit that name and add additional information, including the source of each document and a description. As with the old upload dialog, you can decide right when you upload your document whether or not you’re ready to share it with the world yet. As ever, you can edit all of these fields again later.
If your documents share a common source or description, use the “Apply to All Files” link to copy your metadata to each document in this batch. Note: this new upload interface requires Flash. If that’s an issue for you, let us know ASAP and we’ll whip up an alternate interface that doesn’t require any plugins. Promise.
As the files upload from your computer to DocumentCloud, you’ll see the progress of each transfer.
Better Processing, Too
This week’s release is more than just a new upload dialog. We’ve made some big changes to Docsplit and the RightAWS gem, and we’re hoping this means the dreaded “import failed” error will be a thing of the past. If looking under the hood is your thing, both Docsplit and our fork of RightAWS are on on github for your viewing (and reusing) pleasure.
Don’t be a Stranger
If you have gigabytes worth of documents to upload, get in touch before you start uploading so we can add more horsepower to handle your job. Otherwise, happy uploading! And don’t forget to tell us about what you’re publishing with DocumentCloud.
If you log in to DocumentCloud this morning, you’ll notice a new menu, entitled “Analyze”. We’ve gathered the various analytic tools under one roof here — to view the entities for selected documents, or display a timeline — and added a major new one: Related Documents. If you’re working on a story, and you just uploaded a material document, you can use that document as a jumping-off point to find other public documents about the same subject. Select a document, open the “Analyze” menu, and click “Find Related Documents”:
Under the hood, we are using a technique known as tf/idf, which compares document similarity by looking at the “important” words across a set of documents. The importance of each word is evaluated by weighing the frequency of use of each word in a particular, divided by the frequency of the word in the collection of documents as a whole. In this manner, commonly used words drop out of the index, and distinctive words obtain greater importance. The search engine we use for DocumentCloud, Lucene, has this type of search built-in.
We’re still at work on improving this feature. At the moment, you’ll notice two things: there is a long tail of barely-related documents that follows the first page of results, and shorter documents (1-3 pages) may find no related documents whatsoever. But for most documents with high-quality text, you’ll find that the related documents at the top are very relevant.
There’s one more thing that we released at the same time: a panel that allows you to edit all the information that describes your documents (title, source, description, access level) at a stroke. To use it, click on the pencil icon that now appears next to any document. Naturally, you can also select multiple documents and edit all of their attributes simultaneously.
For some time now, instructions about how to use the DocumentCloud workspace have been available through our wiki. This morning, we released an update that pulls the help pages right into the workspace for easy access, and hopefully makes it faster to get your questions answered. Continue reading »
Since we launched DocumentCloud’s beta, one of the most common requests has been: “How can I share documents with reporters from other organizations?”
Now you can share a project with any other DocumentCloud user — in any newsroom.
How does it work?
Let’s say I have a project with documents relating to the Madoff Ponzi scheme, and I want to share them with Scott. To open the project for editing, I click on its edit icon.
Inside of the project, I click on the “Add a collaborator to this project” link, and I type in Scott’s email address — the one that he uses to log in to DocumentCloud.
After clicking the “Add” button, Scott now appears as a collaborator on this project.
The next time Scott logs in to DocumentCloud, “The Madoff Files” will show up as one of the projects in his sidebar. He can now view, edit and annotate all of the documents inside of it. He can add documents of his own to the project and I’ll be able to see and edit those as well.
Project collaborators can do anything with the documents in a project that you can do: they can edit public notes, change settings like the document’s title or source, add “related article links.” Collaborators can also add or remove additional people to the project. You can only collaborate with fellow DocumentCloud users, though: if you’re collaborating with a newsroom that isn’t yet part of DocumentCloud, send them our way and we’ll get them set up.
You’ll notice a lot of changes to when you next log in to DocumentCloud. We’re still working hard to make DocumentCloud work better, and we hope that our new layout will make it much easier to access and use the entities (aka keywords:Â people, places, organizations, terms…) that OpenCalais provides. You’ll also notice a ton of little improvements, if you look closely.
You can now view all the entities that OpenCalais identified in your search results by clicking over to the “Entities” tab at any time. The number alongside each term shows you how many documents in your results contain each term. Select any term to filter your search to include that term and then use the “show pages” link to see each page on which a term appears.
View More Documents:
In addition to viewing the details of each document in a list, you can now choose to display a grid of your documents as thumbnails, with 30 documents on a page. This should make it easier to organize projects, and scan through search results.
Many of our (very cool, we realize) keyword visualizations are gone. The timeline is still there to show you what dates appear where in your documents, but gone are the swooping lines that highlighted exactly which documents contained any particular entity term. As you probably noticed if you tried to use them for more than just a test drive, they were a little too limited to be useful.
You can still access all the information you used to be able to get from those visualizations through the new entities tab, in what is hopefully a far more helpful fashion, but weâ€™re taking the visual displays back to the workshop for the time being.
The workspace’s new look is thanks to the talented Folkert Gorter, the interface designer responsible for Cargo and Good.is, among other fine work. We’re continuing to work together closely on the workspace and other portions of DocumentCloud, as we move into the home stretch of our first year.
DocumentCloud is still very much in beta and we continue to welcome your suggestions, bug reports and feedback.
DocumentCloud is still in beta. One thing that means is that it gets better every day. Or almost every day. A few recent improvements are worth noting:
Manage Access from the Edit Menu
Sharp eyed users will notice that the “Manage” pulldown has disappeared. If you want to delete, publish (or un-publish) documents, you can do that right from the Edit pulldown.
If you’ve published a story that made use of a document you uploaded to DocumentCloud, you can tell us about it by adding the URL to the “Related Article” field, available from the edit menu.
Links to your reporting will then appear in the document viewer if someone finds your documents through a search of public documents.
Click and Double-Click
We heard you loud and clear: clicking thumbnails to “select” and titles to “open” wasn’t intuitive. We changed all that. Now you can select anywhere on a document’s thumbnail, title or description. Click once to select it and twice to open it.
Try out ctrl-click (command-click for you Mac users) and shift-click as well: you’ll find they work a lot more like they do on your desktop.
Want to join the beta? Write to firstname.lastname@example.org and tell us about the documents you’re working with.