Latest Updates: Our Blog

Author Archive

DocumentCloud.org seeks developer

Posted
Mar 1st, 2018

Tags
Documents

Author
Ted Han

Come help build the future of journalism. DocumentCloud.org is looking for a developer to help us expand our open-source platform designed to make journalism more impactful, transparent, and trustworthy.

DocumentCloud.org is a global platform that now hosts more than 4 million documents and serves hundreds of millions of pageviews each year. It has been used by tens of thousands of journalists worldwide to upload, analyze, annotate and publish primary source documents.

DocumentCloud has helped the largest and most innovative news organizations in the world tackle high-profile stories such as WikiLeaks, Panama Papers and the Snowden documents.

The role is for a full-stack programmer who can continue to evolve and run a large-scale Ruby on Rails site in an AWS environment where uptime matters. The job requires being involved with the entire stack, from front-end to document indexing, but one needn’t be an expert in everything.

You’ll be working closely with an experienced team to help you grow in the areas you’re interested in, including opportunities to experiment with machine learning, chatbots, crowdsourcing and more.

As important, this is for someone who cares about making better journalism but knows that code isn’t the solution to every problem. The developer will work hand-in-hand with the rest of the team to assess new opportunities, forge key partnerships, evolve the product roadmap and overall strategy for DocumentCloud.

We are extremely flexible with location and remote working. DocumentCloud is a distributed team, based coast-to-coast, with people in Philadelphia, the Bay Area and Boston.

We believe deeply in the importance of diversity and inclusion. We strongly encourage people of color and LGBTQIA candidates to apply. We’re committed to structuring our hiring process and organization to be welcoming to applicants of all backgrounds.

This is a full-time position with paid vacation and health care benefits, vision, and dental benefits and a 401k.

The ideal candidate will have:

  • Experience building and scaling Ruby on Rails applications;
  • Experience working with Amazon Web Services or a similar cloud-based infrastructure;
  • Experience with front-end technologies and frameworks;
  • A desire to work for a mission-driven organization.

Other skills we’d like to see but are not a requirement:

  • Familiarity with indexing engines, such as Solr;
  • Experience with Python and Django;
  • Experience working with distributed teams;
  • Experience with using data and documents to tell a story, either through traditional reporting or in other ways.

If this role sounds interesting to you, drop Aron a line at aron@documentcloud.org or apply directly at jobs@documentcloud.org

Easier publishing with WordPress and oEmbed

Posted
May 5th, 2015

Tags
Code,Documents

Author
Ted Han

Today, we’re making it easier for you to embed documents and notes with an updated WordPress plugin and an oEmbed service to power it.

The WordPress plugin – which builds upon an earlier version developed by Chris Amico for NPR’s StateImpact project – adds the ability to embed notes as well as documents using shortcodes. And we’ve updated our embed wizard so it will generate the WordPress shortcodes for you like this:

[documentcloud url="http://www.documentcloud.org/documents/1699074-sb0101-05-enrs/annotations/210824.html" ]

Which embed like this:

 

The plugin is powered by our new oEmbed API. It's our next step in helping you integrate DocumentCloud embed codes into your content management system so embedding documents and notes is as simple as pasting in a URL.

You can install our plugin right now by visiting its WordPress page. Developers interested in adding simple embedding with our oEmbed API can find its documentation in our help pages.

Read on for more details:

WordPress

Since 2011, WordPress users have been able to embed documents on their blogs thanks to a plugin created by Chris Amico for NPR’s StateImpact project.

Chris’ plugin let users embed documents with shortcodes by translating the shortcodes into HTML embed codes.

With Chris’ help, and input from Adam Schweigert of the Project Largo team, we’ve released a new version of the plugin that adds:

  1. Note Embeds: You can now embed individual notes as easily as documents. Just pass a note URL into the shortcode with the url attribute.
  2. Raw URL Support: You can now paste the URL for a document or note onto its own line, and the plugin will translate that into an embed.
  3. oEmbed Support: Embed codes are now fetched from our oEmbed service rather than generated internally, so they’ll always be up to date.

Full installation and usage instructions are available on the plugin page.

Note for users of the existing plugin: Because we’re releasing a whole new plugin, prior installations of Navis DocumentCloud won’t automatically update. You’ll have to deactivate/delete Navis DocumentCloud from your site and install the new plugin. Sorry! This should be a one-time process, and future updates will be delivered through the normal WordPress update mechanism.

oEmbed

We’ve added an oEmbed API to make it easier for developers to get embed codes for documents and notes. oEmbed has become a standard for easily embedding Web content, and it has long been one of our users’ top feature requests. Now, instead of having to reverse engineer the format of our embed codes, developers can just send a request to the DocumentCloud API to ask for a correctly formatted embed code.

CMSes that support oEmbed can turn DocumentCloud URLs for documents and notes directly into embeds.  So users can be assured that their CMS won’t eat or mangle an HTML/JavaScript embed code, and developers can use existing oEmbed tools to support DocumentCloud’s current embeds as well as future types of embeds we have in the works.

If you’d like your CMS to support embedding documents as easily as DocumentCloud’s WordPress plugin, you or your developers can read more about our oEmbed service in our API help pages.

Thanks for reading.  Follow us on twitter for more updates.

Shearing PDFs with PDFShaver at DocumentCloud

Posted
Mar 7th, 2015

Tags
Documents

Author
Ted Han

As of this week, documents uploaded to DocumentCloud will process much faster thanks to a new tool we’ve written called PDFShaver that wraps Google Chrome’s PDFium library.

How much faster? From our preliminary statistics, a lot.

Under the covers, DocumentCloud uses our Docsplit open source library to disassemble documents. Prior to PDFShaver, Docsplit relied upon Graphicsmagick and Ghostscript (GM+GS) to render PDFs and save pages as images.

GraphicsMagick and Ghostscript have served DocumentCloud well, but we’ve had trouble processing some poorly constructed documents that journalists receive from sources — governments, companies and non-profits, for example. Our search for a replacement led us to PDFium, and we found that not only did it solve a number of our issues but it also provided substantial gains in speed.

Testing PDFShaver and Graphicsmagick on 50 documents picked at random from DocumentCloud’s public collection shows that PDFShaver can render documents an order of magnitude faster (here’s our raw data). These data are a preliminary sample, but we’re excited about what it shows about the kinds of speed gains we can make to our processing pipeline. We’ll continue to track PDFShaver and DocumentCloud’s performance as we make improvements, so look forward to more updates!

Rendering PDF pages with PDFShaver & PDFium

PDFShaver works by connecting PDFium to Ruby with a C/C++ extension inside a Ruby gem.  PDFium itself is an open source library and the software that powers Google Chrome’s PDF viewer. And aside from taking advantage of the speed and capabilities Chrome’s tools provide, we’re happy to be able to make open source PDF processing easier to access through a programming language such as Ruby.

For example, picking the landscape-oriented pages out of a document and rendering them is as easy as these three lines:

document = PDFShaver::Document.new("./path/to/document.pdf")
landscape_pages = document.pages.select{ |page| page.aspect > 1 }
landscape_pages.each{ |page| page.render("page_#{page.number}.gif") }

We plan to keep improving PDFShaver and make more of PDFium’s features accessible to give Rubyists, data scientists and journalists a boost for overcoming the impediments that PDFs present.

If you’re interested in installing and using PDFShaver, you can read how on our Github repository. And if you’d like to help journalists and others free information from PDFs, your contributions are welcome!

 

Welcome Aboard, Anthony DeBarros

Posted
Jan 6th, 2015

Tags
People

Author
Ted Han

I’m excited to announce that Anthony DeBarros is joining DocumentCloud this week.  We’ve known Tony for a long time as a DocumentCloud user at Gannett Digital and USA TODAY, where he’s led technology projects and data-driven stories and interactives.  Tony will be joining me to lead and manage DocumentCloud’s product efforts as our platform and team grows.

Thanks to a grant from the Knight Foundation, we’re looking forward to expanding and improving our platform. We’d like to find new ways to help journalists with reporting, research and publishing on deadline. The grant also is designed to give us the footing we need to guarantee DocumentClould’s long-term sustainability, so that journalists can continue to rely on the platform into the future, and Tony is going to be an important part of that process.

So, welcome Tony!  We’re happy to have you.

You can also reach him at anthony@documentcloud.org and on twitter as anthonydb, and you can learn more about him on the IRE website.

 

DocumentCloud searches for Product Manager

Posted
Oct 9th, 2014

Tags
Jobs

Author
Ted Han

DocumentCloud is a technology service created to help journalists and improve transparency in journalism.  Our platform helps journalists find and highlight interesting information in the primary source documents used in their reporting.

Our document research & publishing platform is used by the likes of the New York Times, ProPublica, the Guardian, NPR, La Nacion, Al Jazeera, Homicidewatch and many more.  Journalists have used our tools to support their innovative, award winning and world changing reporting (you can see examples on our featured reporting page).

With the help of a $1.4 million grant from the Knight Foundation, we’re expanding our team so that DocumentCloud can grow as an organization and sustain itself as an effective and efficient tool for our users.

We’re searching for a product manager who has experience working with software development teams.  You need to be comfortable with human-centric design processes (even in an informal way).  As manager you’ll work with our development team to plan, prioritize and keep us focused on making good products.  Together with our team, you’ll be responsible for developing products & features in line with our mission and our goal of generating enough revenue to sustain DocumentCloud.

We care deeply about contributing to the civic sphere whether that’s better tools for reporters, or open source software and patterns.  In fact the DocumentCloud platform itself was written as open source components, many of which have gone on to be adopted in other industries including Backbone.js, Underscore.js, VisualSearch, Docsplit as well as many others.

So, we hope you’ll join us to make great products with an eye to make the world greater too.  You’ll be able to join us from wherever you live, as DocumentCloud operates as a distributed organization.  Officially we’re based out of the Columbia, Missouri offices of our parent organization, Investigative Reporters & Editors, but we primarily function through Slack, IRC and Hangouts/Skype.

Email us at jobs@documentcloud.org by October 31st!

 

 

How we made DocumentCloud note embeds responsive

Posted
Sep 25th, 2014

Tags
Documents

Author
Ted Han

By Tom Meagher, data editor at The Marshall Project,
Emily Yount, interaction designer at The Washington Post,
Matt DeLong, national digital projects editor at The Washington Post
and Ted Han, lead developer at IRE/DocumentCloud

On Aug. 3, The Marshall Project, a new nonprofit journalism organization focused on criminal justice issues, published an investigation in partnership with The Washington Post that revealed new evidence raising doubts about a high-profile Texas execution.

TOM: Our reporter, Maurice Possley, began working on this story months before most of the rest of our newsroom at the Marshall Project was even hired. By the time we were able to start helping, the story was mostly reported, so we dove into the documents to bring ourselves up to speed.

The case against Cameron Todd Willingham — who was executed in Texas for the murder of his three daughters — had been written about extensively over the last 22 years, but a lot of new information was uncovered, and it was all in the documents. We knew we wanted to be able to explore and highlight the correspondence that cast this case in an entirely new light. DocumentCloud was clearly the answer.

In the course of his reporting, Possley, who has covered this case for more than a decade, was given access to copies of dozens of primary source documents that tell the backstory of Cameron Todd Willingham and the informer who helped convict him. In filing its grievance with the State Bar of Texas against the former prosecutor in the case, the Innocence Project had acquired these documents and assembled them into a series of appendices. They gave us eight PDF files that added up to nearly 400 pages. We used DocumentCloud to stitch them all back together into one large file.

We then combed through the appendices and dozens of other records of court testimony and correspondence. As we saw the various typefaces and handwriting styles that made up the key passages, we knew we wanted to use DocumentCloud notes to present excerpts directly in the story.

Matt: I started working on the story in earnest a couple of weeks before it published. We were very excited about having so many primary-source documents to enrich the narrative. The Post has been using DocumentCloud for years, but we’ve long been frustrated by one of its biggest limitations: it isn’t mobile-friendly. This isn’t really DocumentCloud’s fault; these scanned documents are a set size, so when you scale them down, at some point words will become too small to read.

We had seen how the New York Times addressed the problem, by putting up the text and linking to the original document in DocumentCloud. That’s totally logical and fine if the words are all that you care about, but in this case we have official letters and handwritten notes between the characters in the story. The pages themselves are interesting, and many readers will want to see them with their own eyes.

We decided at the outset that however we ended up displaying the documents we included in the story, they had to be responsive. But this meant we’d have to come up with our own hack. Emily and I had both been thinking about this problem individually for a while, and we had some time to work on it, so we decided to try to figure out a solution that we could use in this project.

Emily: At the time of publication, DocumentCloud’s note embed code already resized and repositioned the note based on the width of the DC-note-container div, so I knew we only needed to solve for when the note is wider than the note embed and the right side of the note is cut off (see image below).

image01

To solve this problem, when the embed first loads, the code stored the coordinates, width and height of the note relative to an image of the page of the annotated document. When the page loads, the browser resizes or the orientation of your device changes, javascript media queries (matchMedia) detect whether the note is wider than the embed and then resizes and repositions the document image.

The original coordinates, width and height allow us to determine how wide the note is in relation to the document image and resize the document just enough to make the note 100% of the embed, instead of the document image 100% of the embed. This helps with readability by making the text as large as it can be. At times, depending on the width of the note and the size of the text, there will still be readability issues, so cropping the annotations carefully and testing to make sure they are readable is really important.

Here’s an example from the Willingham story of a responsive note on an iPhone 5:

image00

Ted: We were thrilled when Ben Chartoff (OpenNews fellow at the Washington Post) reached out to put Emily in touch with us.

We believe deeply in DocumentCloud as an open source project as well as the service to which journalists post documents relevant to the public interest. Emily and Matt’s motivation to extend the behavior which DocumentCloud already provides and to share their code back is exactly the kind of effort we love to see and encourage.

Technology in the world of news is a means toward the end of better reporting. Especially in competitive industries like ours, an open source ethos around the tools we all share is an avenue for us work together to improve the state of all reporting. Anyone who solves an issue for their own needs can help to solve that issue for everyone.

In that spirit we were excited to incorporate Emily’s code into our own. To do so, we spun our note code off into its own repository to make it easier for anyone to contribute (you can find the code on Github as documentcloud-notes). Then with the Washington Post’s & Marshall Project’s stories as a basis we began incorporating the changes. Ultimately, we ended up rewriting much of Emily’s code in the process, but what she had written served as the design criteria to anchor the code we wrote.

Our responsive notes code is already live on DocumentCloud now, and journalists needn’t take any additional steps to use it. Any embedded note from DocumentCloud will now behave responsively.

DocumentCloud Seeks Developer to Work on Open Source Software Platform

Posted
Jul 16th, 2012

Tags
People

Author
Ted Han

We have a lot of projects going on at DocumentCloud and to serve those goals we’re looking for others to join us! For those who may be unfamiliar with our project, we’ve included the full details below.

DocumentCloud is a web based platform allowing journalists to upload, analyze, annotate, and publish primary source documents. We want give journalists the tools to show their audience their source material, not just tell them about it. In addition to the newsrooms worldwide who use DocumentCloud, our open source software projects, such as Backbone.js, Underscore.js, Docsplit, and Jammit, are relied upon by companies such as LinkedIn, Walmart, Foursquare and more. DocumentCloud is run by Investigative Reporters & Editors.

What DocumentCloud is building

  • DocumentCloud is growing fast, and we’re looking to accelerate that pace by expanding our tools into other languages beyond English. In the next year we’ll adapt our platform to accommodate multi-language OCR, search indexing, and entity extraction tools.
  • DocumentCloud always looks for new ways to present documents and engage readers. We are extending DocumentCloud’s document viewer and annotation tools so that readers can make their own comments and notes on documents.

DocumentCloud is looking for someone with a combination of the following skills

Experience with Ruby and JavaScript; API driven web applications; working on and fostering FOSS; user-centered products; Experience the JVM toolchain; linux administration on Platform as a Service providers such as AWS.

Things we like and hope you like too!

Literate programming; Extracting libraries from app code; Polyglot programming; Web standards; Journalists; Natural Language Processing

Practical Details

Investigative Reporters & Editors is based in Columbia, Missouri, on the University of Missouri’s campus. DocumentCloud is comfortable operating with a distributed team.

You can email us at jobs@documentcloud.org

State of the Backbone

Posted
Jun 4th, 2012

Tags
Code

Author
Ted Han

It may surprise some to find that Backbone.js was released just over a year and a half ago. In that time, Backbone has gained a remarkable reach, and one that stretched beyond our anticipations. As just one example (and there are many examples to be found in the Backbone documentation), last week brought us to BackboneConf in Boston where several hundred developers gathered to discuss Javascript application development.

Jeremy Ashkenas, engineer emeritus of DocumentCloud, opened the conference with the first State of the Backbone keynote. We’ve recorded it for those who weren’t able to join us at BackboneConf.

Update on Searching and Entities

Posted
Sep 28th, 2011

Tags
Workspace

Author
Ted Han

Users who tried to search for pretty much anything on DocumentCloud this morning noticed pretty quickly that there was something not quite right on our servers. The short story is this: the problem was caused by human error and our servers are in the process of rebuilding the index that failed.

The longer story, for those of you who’ve been have been tracking updates about our search outage, is this: Continue reading »

Printing Document Annotations

Posted
Sep 26th, 2011

Tags
Documents

Author
Ted Han

We’ve been hard at work during our short Columbia, Missouri hackathon at DocumentCloud’s new home at the Investigative Reporters & Editors office. As a result we’ve rolled out a new feature for readers and journalists to print annotations made on documents.

Journalists have been publishing documents through DocumentCloud for a while now as well as annotating documents both for readers and for their own story writing processes. We think it’s just as important for DocumentCloud to make story writing quicker and easier as it is to help readers find primary source material.

So, when Marshall Allen of ProPublica told us that he would like to try using DocumentCloud to take his story notes, we did our best to help out. As a result, you can now select one or more documents in the workspace and choose “Print Notes” under the “Publish” menu.

This way you can annotate your sources in DocumentCloud, and have a single copy of all your research ready at hand for your copy editor or read when your flight attendant announces that all power switches should be in the off position.

And readers can find a “Print Notes” link in the sidebar footer of the document viewer too.

We hope this will help readers and journalists alike note and collect information in the format the best suits their workflows. Happy Printing (and remember to recycle)!