Sep 22nd, 2010


Amanda Hickman

Last week, Investigative Reporters and Editors had us over for a webinar all about DocumentCloud. A video of the webinar will be available very soon from IRE. In the meantime, I did promise to answer every last question from the back channel.

We covered a lot of ground but there were plenty of questions that I couldn’t get to in the hour alloted. I realized, as I read through the questions that folks were asking, that a good many of our current users have been hesitant to ask questions when our documentation isn’t clear. Part of being in beta means that DocumentCloud is changing fast and often. Sometimes our documentation lags behind, sometimes we miss things. Sometimes what we think is clear makes no sense at all to you. So we’re trying something new: on top of all the other ways in which you’re more than welcome to contact us we’re setting up virtual office hours. Stop by: we’ll take all questions, no matter how technical or mundane. As for the questions you’ve already asked …

The questions below were edited ever so very lightly from the actual webinar chat log and re-arranged into something like order. We’ll be incorporating most of these answers into our own materials soon as well.

Getting Started

Is there a preferred browser?

We caused something of a stir when we stopped supporting Internet Explorer 6, so IE6 definitely won’t work. Firefox 3.x will, as will IE 7 or any recent version of Chrome, Opera or Safari. If a browser isn’t working for you, definitely let us know!

Is this Scribd 2.0 basically?

Scribd is a great resource, and a better fit for plenty of projects. I’m sure that Scribd is working hard on their own Scribd 2.0. Our primary interest is in supporting journalism and promoting the kind of engagement and participation that comes with really digging into the news. Those primary goals are always going to guide us as we develop DocumentCloud. Scribd’s mission is different. We’re working hard to incorporate more tools, like OpenCalais entity extraction and timelines. We’re looking for ways to facilitate the kind of collaboration that goes on in newsrooms, and we’re trying to build a catalog of primary source document material that will be accessible. That’s part of why we’re manually approving account requests: when we open DocumentCloud up to public searches, we want the public to be confident that a document attributed to Memphis’ Commercial Appeal really was supplied by that paper.


Is there a cost for an account?

DocumentCloud is free of charge as long as you’re making documents public.

Is DocumentCloud limited to one account per newsroom, or can several people in one newsroom get individual accounts? / Can the whole newsroom use one account?

We’d prefer that everyone in your newsroom have their own account. Our (somewhat rough) system, is this: the first account we set up for your newsroom is an adminstrator, whose privileges include the ability to add additional users. Those users can be either adminstrators or simply contributors.

We’d definitely prefer that you create accounts for each user — it is really not hard to do.

Could IRE be an institutional user group for DocumentCloud (as an alternative to publication or journalism school groups)?

For that to happen, IRE would need to be prepared to take responsibility for managing accounts in their organizational group and IRE would be accountable for everything uploaded under that group’s auspices. That would be a lot of work for IRE and it doesn’t make a lot of sense at this stage.

What about freelancers?

If you’re a reporter who is working with primary source documents and you don’t fit the newsroom mold, we need to hear from you. Accounts are currently all publication based, and we haven’t worked out the best approach to working with freelancers yet. So far, most independent reporters have done one of two things: they’ve signed their own blog up as a document contributor or they’ve worked with an established news outlet that they report regularly for.

If neither of those options is a good fit for you, we want to hear from you. You can help by giving us a good sense of why a newsroom account won’t work for you and who you are as a reporter. Information like … Who have you written for recently? Do you teach, and if so where? What professional associations are you a member of? … will help us figure out what freelancers need from us and how we can bring more independent journalists on board.

Should the person administering this in each newsroom be the person who handles the CMS? If we’re talking snippets of code and pixel-width counts, that seems like extra confusion.

That’s up to you. Your newsroom can have as many or ask few administrators as you need.

What’s the difference between administrator accounts and contributor accounts?

Administrators can add new users and edit existing users. They can also edit public notes on documents that do not belong to them, but which are visible to them because they have been shared across the newsroom group. Administrators can also change the access level and metadata (eg. title, description, source) for any document they have access to. Administrators cannot, however, access private documents.

Collaborators are another story. A collaborator is anyone who you have invited to share a project. Collaborators can view and edit any document in that shared project, no matter how its access level is set. This means they can view and edit public annotations as well as document metadata.

No one but you can read your private notes.

Does all this seem complex? We want to hear from you: what works about our system of account privileges? What doesn’t? What do you need to make DocumentCloud fit your workflow better?

Uploading Documents

What is the maximum size (or page limit) for document upload? / I’ve tried to upload large PDFs that didn’t process.

We recently improved our document processing utilities: most users should not be getting “import failed” messages anymore. Usually, when we can’t process a document the problem is more specific than just size. Locked or password protected documents will still choke, for instance. We try keep an eye on the queue, but if you hit a wall, definitely let us know. Also, look over our troubleshooting page for more insights on troubleshooting your documents.

As far as maximum sizes go, we haven’t found one yet.

What about tables and numbers?

Tables and numbers shouldn’t break the processing system, though DocumentCloud doesn’t have the foggiest idea what rows and columns are, and it definitely can’t do math.

How well does your OCR work?

We’re using a free and open source product called Tesseract. I would call our OCR is pretty good. Not excellent, certainly not bowl-me-over amazing. If your primary interest in DocumentCloud is free OCR, expect to be disappointed. You don’t need us to run Tesseract! If you have the resouces to license better OCR, definitely do. For most documents, though, Tesseract’s OCR is good enough to support text searches of your document, provide text to OpenCalais and inform our date extraction engine.

How is this different from OCRing and using Google Desktop to search?

We (obviously) think it is pretty different. You’re accessing and contributing to a growing public catalog of documents, so you have access to more than just your own documents. You’re also able to publish the documents you’re working with, with or without annotations. I could make a long, long list, but if you’ve read this far you already know most of what is on it!

Have you considered allowing users to post web pages to their account?

At this stage, I think the option would be to print the web page to a PDF and upload that PDF to DocumentCloud. I think a person could, if they were inspired to do so, write a Firefox toolbar widget to do all of that in one sweep, though it isn’t on our list yet. What you really need is citability!

Working With Documents

Can I merge PDFs after they’ve been added to DocumentCloud?

At this stage we don’t offer any PDF manipulation tools, but take a look at our troubleshooting guide for a good round up of tools that you can use to manipulate PDFs before you upload them.

What, no tagging?

No tagging. OpenCalais provides extensive automated data about each document that you upload, and you can add documents to any number of projects for your own sorting needs. If you need more than that, let us know a bit more about what you’re trying to accomplish and we’ll see if we can’t find a solution that makes sense.

What kind of security is in place? Hate to have sensitive documents stored on the Web and have it hacked …

DocumentCloud is definitely not the place for truly senstive documents. For one thing, everything is stored, unencrypted, on Amazon S3 servers. We’re still in beta and we’re making changes all the time. We test new features intensively before we push them to our production server but there’s always the chance that we’ll make a mistake.

We take privacy seriously, but we definitely do not encourage you to store highly sensitive documents on our servers.

Can I download the OCR’d document?

Unfortunately, you can’t. OCRing the document is one thing. Re-assembling the OCR information into a complete PDF is more complex. you can download the full text of the document, separate from the original PDF, but if the PDF you uploaded didn’t include text information, the PDF you donwnload won’t either.

If you’re working on a document and it’s public, can everyone add notes to the document? Or can only the owner add notes?

Any user can add private notes to any document they’re able to access, but only a document’s owner can add or edit public notes. If a document is part of any shared projects, all collaborators on that project can add and edit notes as well. Any administrators in the owner’s newsroom group can also add and edit public notes if a document is public.

Searching the Catalog

Do you have to put searches in quotations?

Nope. Take a look at our search tips to review all your search options. Putting a term in quotes will limit your search results to the whole phrase. For instance, a search for world trade center will find documents that refer to world, trade of all sorts and/or centers of any kind, while a search for "world trade center" will only find documents that use those three words together as a phrase.

Why doesn’t it tell me how many documents meet the search criteria?

It actually does, way down at the bottom of the screen. Sounds like we should make that more visible!

Publishing Documents

There were quite a few questions specific to the document publishing process, so we’re looking into arranging a follow-up session with current users where we’ll cover embedding and publishing documents more thoroughly.

Let Amanda know if we should be sure to alert you when that’s scheduled!

Will your project eventually be visible to the public?

No, we never expose your project structure. You can, however, use our API to search for public documents by project in order to display them on your own site, which means you can expose projects yourself.

How do we add the annotated documents to our websites?

You want to select a document and then use the “Embed Document Viewer” option in the publish menu, and follow the instructions there to get to a few lines of HTML that you can paste into your own template. You have other options, however. If you want a document viewer that is wholly independent of DocumentCloud, use the “Download Document Viewer” option instead.

You said not to just copy and paste it into our page. What else do we need to do?

You need to create some sort of HTML template to put the document into. A bare .html page with just our embed code and no html, header or body tags is incomplete and will break in most browsers.

Can you go over how skinning and implementation of DocumentCloud on my own website would work?

You have a lot of options for skinning or templating the documents you embed on your own site. We’ve collected a bunch of great examples and you’ll find pretty good documentation of the publishing process on our help pages.

If you’re still stuck, drop by our office hours and ask questions until you’re unstuck. And please, don’t be shy! If you’re confused, other people are probably confused, too. We need your help to find the soft spots in our documentation.

Is there a list with links of stories/projects published with DocumentCloud for examples of use? / What does/can it look like when you publish documents?

There is! Take a look at our list of featured documents.

Can you add the article url after publication if you don’t have it yet?

You sure can. While you’re logged in to DocumentCloud, select the document(s) and choose “Related Article URL” from the edit menu.


