Why Bother with Bulk?

An alternative approach to bulk data, and one that is sometimes mentioned as an equivalent solution, is for government to provide a data application programming interface (API). An API is like a 411 telephone directory service that people can call to ask for specific information about a particular person or business. The directory operator looks up the answer in the telephone book and replies to the caller. In the same way, computers can “call” an API and query it for specific information, in this case, from a government database that is otherwise inaccessible to the public, and the API responds with an answer once it is found. Whether a third-party website uses an API or hosts its own copy of the government data is an architectural question that is not likely to be directly observable by the website’s end users.

APIs can be excellent, disappointing, or anywhere in between, but generally speaking, providing an API does not produce the same transformative value as providing the underlying data in bulk. While APIs can enable some innovative third-party uses of data, they constrain the range of possible outcomes by controlling what kinds of questions can be asked about the data. A very poorly designed API, for example, might not offer access to certain portions of the underlying data because the API builder considered those data columns to be unimportant. A better API might theoretically permit access to all of the data, but may not allow users to get the desired data out efficiently. For instance, an API for local spending might be able to return lists of all projects by industry sector, but might lack the functionality to return a list of all projects funded within a particular zip code, or all projects contracted to a particular group of companies. Because of API design decisions, a user who wants this information would face a difficult task: she would need to find or develop a list of all possible sectors, query the API for each one, and then manually filter the aggregate results by zip code or contractor.

APIs and finished, user-facing websites face the same fundamental limit for the same reason: both require a designer to decide on a single monolithic interface for the data. Even with the best of intentions, these top-down technical decisions can only limit how citizens can interact with the underlying data. Past experience shows that, in these situations, interested developers will struggle to reconstruct a complete copy of the underlying data in a machine-readable way, imposing a high cost in terms of human capital and creating a risk of low data quality. The task would be like reconstructing the phone book by calling 411—“First, I want the last names starting with Aa….” Moreover, APIs and websites are likely more expensive for government to develop and maintain, as compared to simply publishing copies of the raw data and allowing third parties to host mirrors.

If government releases the data first in bulk, citizens will not be restricted to just the approved interfaces. Since APIs, like websites, do serve a useful purpose in efficient data delivery, developers will build their own APIs on top of bulk data sets that best suit their own needs and those of downstream users. Indeed, a number of nonprofit groups have already built and are now offering public APIs for data the government has published in bulk form. OMB Watch, for example, combines multiple government contract and grant databases into a single “FedSpending” API that other developers use for their own sites. The National Institute on Money in State Politics offers a “Follow the Money” API which provides convenient access to its comprehensive state-level campaign finance data set (see Chapter 19).