Latest Updates: Our Blog

September 2009

Introducing Switch, A News Game About New York City’s Energy Gap

Posted
Sep 30th, 2009

Tags
IdeaLab

Author
Amanda Hickman

Cross posted from PBS Idealab.

Our latest (and last, for now) news game, Switch, is live. It is no Energyville but we think it is pretty awesome. Not only is it live, the source code and installation instructions are already available.

With gadgets guzzling evermore energy, New York City faces a looming energy gap. New Yorkers will have to cut back on our electric use or start generating a lot more power. Our game lets people explore the options that are on the table, along with a few that aren’t. Should the city ban air conditioning? Harness the tides? Go nuclear? Warning: the game is addictive.

Switch is a concentration-style game that deals each player 18 pairs of cards, each representing an opportunity for the city to conserve or produce electricity. As players match pairs, they’re asked to decide whether each policy initiative is a good fit for New York City. At the end (or whenever the player grows bored!) players “flip the switch” to see how the measures they’ve accepted would add up against the city’s predicted 2030 energy needs.

We worked with Will James of Tekimaki, whom we met through his very cool subway map project at onNYTurf which, in addition to being both early and awesome, is the only online NYC map I know of that is available in Estonian.

We’ve learned a lot about gaming and news games over the last two years, and a lot about building them on the cheap. More on that after you’ve all played Switch!

Discuss Introducing Switch, A News Game About New York City’s Energy Gap on PBS’s IdeaLab.

Two Dozen Media Outlets and Others Join Us as Beta Testers

Posted
Sep 24th, 2009

Tags
Code

Author
Scott

We have some more news: About two dozen news and other organizations have signed on as beta-testers. They’ll be contributing documents to DocumentCloud, and giving us feedback as we work out the kinks. It’s a wide-ranging list:

  • ACLU National Security Project
  • Arizona Republic
  • The Atlantic
  • Center for Democracy and Technology / OpenCRS
  • Centre for Investigative Journalism, City University London
  • Center for Investigative Reporting / California Watch
  • Center for Public Integrity
  • Chicago Tribune
  • Dallas Morning News
  • The Investigative Reporting Workshop at American University
  • The New Yorker
  • NewsHour
  • MinnPost
  • MSNBC
  • Mother Jones
  • Public.Resource.Org
  • St. Petersburg Times
  • Sunlight Foundation
  • Voice of San Diego
  • Washington Post
  • WNYC

These organizations will be joining our original set of contributors — The New York Times, ProPublica, Talking Points Memo, The National Security Archive, and Gotham Gazette — all of whom will of course be working with us during the testing too.

Earlier this morning we also announced that we’re working with Thomson Reuters’ OpenCalais service to extract and make available information from the documents contributed to DocumentCloud.

E-mail us if you’d like to participate in the testing. We’re interested in any organization, including non-profits and academic institutions, that have obtained documents during their research.

If you’re new here, the goal of DocumentCloud is to super-charge investigations by making documents, and the information in them, easier to find and share. Readers will be able to search documents on DocumentCloud and then will be pointed to the documents themselves on contributing organizations’ Web sites. (Here’s a FAQ with more details.)

Finally, you can keep following our progress on this blog — or follow us on Twitter, or RSS. And we’re releasing our code each step of the way.

Thomson Reuters and OpenCalais

Posted
Sep 24th, 2009

Tags
Code

Author
Scott

This morning we’re excited to announce a partnership with Thomson Reuters, which is contributing its OpenCalais service to DocumentCloud. OpenCalais uses natural language processing to extract information from documents, instantly identifying and tagging the relevant people, places, companies, facts and events. This will make it easy for readers and journalists to explore connections between documents and across the full collection of source materials.

If you’ve seen us do a presentation about DocumentCloud, you already know it’s going to be a key part of what makes DocumentCloud great.

CloudCrowd — Parallel Processing for the Rest of Us

Posted
Sep 14th, 2009

Tags
Code

Author
Jeremy Ashkenas

As we began to prototype DocumentCloud, it quickly became apparent that we’re going to need a heavy-duty system for document processing. Our PDFs need to have their text extracted, their images scaled and converted, and their entities extracted for later cataloging. All of these things are computationally expensive, keeping your laptop hot and busy for minutes, especially when the documents run into the hundreds or thousands of pages.

Today, we’re pleased to release CloudCrowd, the parallel processing system that we’re using to power DocumentCloud’s document import. It’s a Ruby Gem that includes a central server with a REST-JSON API, worker daemons so you can parcel out the jobs, and a web interface to help keep an eye on your work queue. The screenshot below is an example of what the web interface looks like, showing a series of brief jobs being rapidly dispatched by the workers.

CloudCrowd Operations Center

CloudCrowd is intended for a moderate volume of highly expensive tasks — things like PDF processing, image scaling and conversion, video encoding, and migrating data sets. It comes with a couple of example “actions”, including one that serves as a scalable interface to GraphicsMagick, a program that allows you to programmatically apply Photoshop-like transformations and adjustments to images. This sort of work is the kind of thing that always needs to be extracted from a web application; it’s simply too slow to be done in the middle of a request. CloudCrowd should help provide a convenient and scalable way to offload the work.

We’ve been inspired by Google’s MapReduce framework for distributed processing. CloudCrowd provides explicit hooks to help exploit the potential parallelism in your jobs. All “actions” take in a list of inputs: think of a list of PDFs that need to be imported, or a list of images that need to be cropped. Every input is run in parallel. The more workers you spin up, the more machines you add to the cluster, the faster you’ll be able to process them. In addition, if you define an optional “split” method in your action, each input will be split up into multiple work units, all running in parallel. For DocumentCloud, that means the ability to split up large PDFs into 10-page chunks, each of which will be handed off to a different worker.

This is DocumentCloud’s first release of open-source code — hopefully the first of many to follow. If you have similar batch-processing needs to ours, I encourage you to give CloudCrowd a try. The source is available on Github, there’s a wiki, and inline documentation. We’re hoping to get contributions back from the community (there’s even a wish list on the wiki). We hope you find CloudCrowd to be both pleasant and useful. Enjoy!

Our First Hire

Posted
Sep 14th, 2009

Tags
People

Author
Scott

We’re excited to announce that Jeremy Ashkenas has joined the team as the lead developer for DocumentCloud. His previous job was at Zenbe Inc., a provider of online email and collaboration software. He’s the creator of the Ruby-Processing visualization toolkit, and a winnertwice — of the Sunlight Foundation’s Apps for America competition. Jeremy graduated from Brown University with a degree in Literary Systems.

Over the past few weeks, he’s been working on the central processing system for a DocumentCloud prototype. We are planning to open source this tool shortly … so stay tuned.

Improving Access to Information is One Way to Make Reporting Cheaper

Posted
Sep 9th, 2009

Tags
IdeaLab

Author
Amanda Hickman

Cross posted from PBS Idealab.

When he’s not toasting escapism, our tireless editor Mark Glaser has been asking why reporting costs so much. I can’t tell you much about investigative reporting (a $400,000 product of which started the conversation), except to say that six figure salaries do add up. But I can tell you that when it comes to local reporting, improved access to information could make a big dent in the expense of getting a story written.

If you want to take a look at distribution of discretionary funds by the New York City Council, you have to start with a 400-page PDF full of tables of information. And then you need someone on hand who knows how to pull tables from a PDF into a workable spreadsheet. That, or you need a pencil sharpener and a calculator. And while highlighters and pencil sharpeners are not blowing holes in anyone’s reporting budget, the hours required to process this information certainly are. The situation is absurd: this information started out in a database and there’s no reason that anyone — whether they’re a reporter, civic gadfly or deli manager — should have to jump through hoops to put it back into a database.

Of course, those hoops are just for information the city already makes public. If you want to know where pedestrians are being hit by cars, or how parking placards are distributed in a city where curbside space is valuable and abuse of parking privileges is well documented, you’d better know who has that data and have someone on hand who can write an iron tight FOIL request. Want to know about the distribution of lead poisoning cases in the city? For that you’ll need lawyers.

FOILs take time, which means money. Lawyers, too, tend to want money for their time. One way to make information cheaper is to step up the data requirements in local transparency laws. New York City is considering legislation that would amend existing public records laws to require that information be made available and that it “be presented and structured in a format that permits automated processing.” That is to say, raw data. Just publish it — don’t make us ask.

With the law itself lingering in committee, the mayor’s office announced a competition, NYC Big Apps, for applications that will use city data. Perhaps the idea is to deflect attention from the bill, which the mayor is no fan of. The contest, which offers a prize that includes dinner with the mayor, is not really a substitute for making data available.

Steve Romalewski, a pioneer of web-based GIS and community mapping projects, is also skeptical of the contest. He notes that it offers no explicit guarantee that any datasets will be fully available for the long haul, and that no one has offered any explanation of why just 80 data sets are included.

Romalewski also rattles off a good list of datasets that are currently only available on a per-request basis — which means, among other things, that you need to know they are there. His list includes the types and locations of small businesses, green spaces, recreational spaces and housing violations, as well as interim multiple dwellings (aka lofts) throughout the city. He also points out that land use data currently must be licensed from the city at a rate of $1,500 per year if you want all five boroughs: not a trivial expense to small projects like Gotham Gazette.

Romalewski argues that we shouldn’t have to ask for data–that most of what city agencies aggregate belongs in the public domain. I’m with him there, and curious as I am to see what comes out of NYC Big Apps, I’m not convinced that the contest going to help put city data in the public domain in New York City.

I don’t know whether or not the legislation currently sitting in committee is the answer we need, but I do know that New York City is not alone in needing far better access to the data that civil servants use and aggregate in the course of their work. I also don’t think that simply providing us with the raw data is enough — but at least it’s the bare minimum we need to fill the role of government watchdog.

By the way, if you want that list of under-publicized city data, skip to the comments in Romalewski’s post.

Discuss Improving Access to Information is One Way to Make Reporting Cheaper on PBS’s IdeaLab.