Chapter 19
IN THIS CHAPTER
Accessing open data from government sources
Getting data from the World Bank
Looking at private sources of open data
Exploring the world of open spatial data
Open data is part of a larger trend toward a less restrictive, more open understanding of the idea of intellectual property, a trend that’s been gaining tremendous popularity over the past decade. Think of open data as data that has been made publicly available and is permitted to be used, reused, built on, and shared with others. Maybe you’ve heard of open-source software, open hardware, open-content creative work, open access to scientific journals, and open science. Along with open data, they’re all part of the aptly named open movement — a movement committed to the notion that content (including raw data from experiments) should be shared freely.
The distinguishing feature of open licenses is that they have copyleft instead of copyright. With copyleft, the only restriction is that the source of the work must be identified, sometimes with the caveat that derivative works can’t be copyrighted with a more restrictive license than the original. If the second condition is in force, successfully commercializing the work itself becomes difficult, although people often find plenty of other indirect, creative avenues of commercialization.
People often confuse open licenses with Creative Commons licenses. Creative Commons is a not-for-profit organization that’s dedicated to encouraging and spreading creative works by offering a legal framework through which usage permissions can be granted and obtained so that sharing parties are safe from legal risks when building on and using work and knowledge that’s been openly shared. Some Creative Commons licenses are open, and some explicitly forbid derivative works and/or commercialization.
As part of more recent open government initiatives, governments around the world began releasing open government data. Governments generally provide this data so that it can be used by volunteer analysts and civic hackers — programmers who work collaboratively to build open-source solutions that use open data to solve social problems — in an effort to benefit society at large. In 2013, the G8 nations (France, the United States, the United Kingdom, Russia, Germany, Japan, Italy, and Canada) signed a charter committing themselves to open data, prioritizing the areas of national statistics, election results, government budgets, and national maps.
The open government movement promotes government transparency and accountability, nurtures a well-informed electorate, and encourages public engagement. To put it in computing terms, open government facilitates a read/write relationship between a government and its citizenry.
The Data.gov program (at www.data.gov
) was started by the administration of former US president Barrack Obama to provide open access to nonclassified US government data. Data.gov data is being produced by all departments in the executive branch — the White House and all cabinet-level departments — as well as datasets from other levels of government. By mid-2014, you could search for over 100,000 datasets by using the Data.gov search. The website is an unparalleled resource if you’re looking for US-government-derived data on the following indicators:
Datasets aren’t the only things that are open on Data.gov. You can also find over 60 open-source application programming interfaces (APIs) available on the platform. You can use these APIs to create tools and apps that pull data from government departments listed in the Data.gov data catalog. The catalog itself uses the popular open-source CKAN API. (CKAN here is short for Comprehensive Knowledge Archive Network.) Even the code used to generate the Data.gov website is open source and is published on GitHub (at https://github.com/ckan/ckan
), in case you’re interested in digging into that.
For many decades, Canada has been a world leader for its data collection and publication practices. Both The Economist and the Public Policy Forum have repeatedly named Statistics Canada — Canada’s federal department for statistics — as the best statistical organization in the world.
If you take a look at the Canada Open Data website (http://open.canada.ca
), the nation’s strong commitment to data is overwhelmingly evident. At this site, you can find over 200,000 datasets. Among the 25 most popular offerings on the site are datasets that cover the following indicators:
The United Kingdom got off to a late start in the open government movement. Data.gov.uk (http://data.gov.uk
) was started in 2010, and by mid-2014, only about 20,000 datasets were yet available. Like Data.gov (discussed in the section “Digging through Data.gov,” earlier in this chapter), data.gov.uk is also powered by the CKAN data catalog.
Although data.gov.uk is still playing catch-up, it has an impressive collection of ordnance survey maps old enough — 50 years or more — to be out of copyright. If you’re looking for world-renowned, free-to-use survey maps, data.gov.uk is an incredible place for you to explore. Beyond its stellar survey maps, data.gov.uk is a useful source for data on the following indicators:
Interestingly, the dataset most frequently downloaded from data.gov.uk is a dataset that covers the Bona Vacantia division — the government division charged with tracking the complicated processes involved in determining the proper inheritance of British estates.
The US Census is held every ten years, and since 2010, the data has been made freely available at www.census.gov
. Statistics are available down to the level of the census block — which aggregates by 30-person counts, on average. The demographics data provided by the US Census Bureau can be extremely helpful if you’re doing marketing or advertising research and need to target your audience according to the following classifications:
In addition to its census counts on people in the United States, the bureau conducts a census of businesses. You can use this business census data as a source for practical industry research to tell you information such as the number of businesses, the number of employees, and the size of payroll per industry per state or metropolitan area.
Lastly, the US Census Bureau carries out an annual American Community Survey to track demographics with a statistically representative sample of the population during non-census years. You can check this data if you need specific information about what has happened during a particular year or set of years.
With respect to features and functionality, the US Census Bureau has a lot to offer. You can use QuickFacts (www.census.gov/programs-surveys/sis/resources/data-tools/quickfacts.html
) to quickly source and pull government data from the US federal, state, county, or municipal level.
Since its inception in 1958, NASA has made public all its nonclassified project data. It has been in the open data game so long that NASA has tons of data! NASA datasets have been growing even faster with recent improvements in satellite and communication technology. In fact, NASA now generates 4 terabytes of new earth-science data per day — that’s equivalent to over a million MP3 files. Many of NASA’s projects have accumulated data into the petabyte range.
NASA’s open data portal is called DATA.NASA.GOV (http://data.nasa.gov
). This portal is a source of all kinds of wonderful data, including data about
Some examples from its hundreds of datasets are detailed data on the color of Earth’s oceans, a database of every lunar sample and where it’s stored, and the Great Images in NASA (GRIN) collection of historically significant photographs.
The World Bank is an international financial institution that provides loans to developing countries to pay for capital investment that will lead (one hopes) to poverty reduction and some surplus so that the recipient nations can repay the loan amounts over time. It provides loans to developing countries to pay for capital investment that will lead (one hopes) to poverty reduction and some surplus so that the recipient nations can repay the loan amounts over time. Because World Bank officers need to make well-informed decisions about which countries would be more likely to repay their loans, they’ve gathered an enormous amount of data on member nations. They’ve made this data available to the public at the World Bank Open Data page (http://data.worldbank.org
).
If you’re looking for data to buttress your argument in a truly interesting data-journalism piece that’s supported by global statistics, the World Bank should be your go-to source. No matter the scope of your project, if you need data about what’s happening in developing nations, the World Bank is the place to go. You can use the website to download entire datasets or simply view the data visualizations online. You can also use the World Bank’s Open Data API to access what you need.
World Bank Open Data supplies data on the following indicators (and many, many more):
World Bank Data also includes microdata — sample surveys of households and businesses in developing countries. You can use microdata to explore variations in your datasets.
Knoema (pronounced “no-mah”) purports to be the largest repository of public data on the web. The Knoema platform houses a staggering 500+ databases, in addition to its 150 million time series — 150 million collections of data on attribute values over time, in other words. Knoema includes, but isn’t limited to, all these data sources:
Knoema is an outstanding resource if you’re looking for international data on agriculture, crime statistics, demographics, economy, education, energy, environment, food security, foreign trade, health, land use, national defense, poverty, research and development, telecommunications, tourism, transportation, or water.
In addition to being an incredible data source, Knoema is a multifaceted tasking platform. You can use the Knoema platform to make dashboards that automatically track all your favorite datasets. You can use the platform’s data visualization tools to quickly and easily see your data in a tabular or map format. You can use the Knoema Data Atlas (http://knoema.com/atlas
) to drill down among categories and/or geographic regions and quickly access the specific datasets you need. As an individual, you can upload your own data and use Knoema as a free hosting service.
Quandl (www.quandl.com
) is a Toronto-based website that aims to be a search engine for numeric data. Unlike most search engines, however, its database isn’t automatically generated by spiders that crawl the web. Rather, it focuses on linked data that’s updated via crowdsourcing — updated manually via human curators, in other words.
Because most financial data is in numeric format, Quandl is an excellent tool for staying up-to-date on the latest business informatics. As you can see in Figure 19-2, a search for Apple returns over 4,700 datasets from 11 different sources with time series at the daily, weekly, monthly, quarterly, or annual level. Many of these results are related to the United Nations’ agricultural data. If you’re looking for data on Apple, you can narrow the scope of your search by replacing the Apple search term with the company’s stock abbreviation, AAPL.
The Quandl database includes links to over 10 million datasets (although it uses a generous metric in declaring what distinguishes one dataset from another). Quandl links to 2.1 million UN datasets and many other sources, including datasets in the Open Financial Data Project, the central banks, real estate organizations, and well-known think tanks.
You can browse Quandl data and get instant charts based on what you find. If you sign up for a free registered account, you can download as much data as you want or use the Quandl application programming interface (API). The Quandl Application Programming Interface (API) includes wrappers to accommodate platforms such as Java, Python, Julia, R, MATLAB, Excel, and Stata, among others. (In case you’re new to wrappers, think of them as a collection of functions that make it easier to interact with an API. They're language-specific and are sometimes called “wrappers” because they wrap the API calls into easy-to-use functions.)
Modeled after GitHub — the cloud-hosted platform across which programmers can collaboratively share and review code — Exversion aims to provide the same collaborative functionality around data that GitHub provides around code. The Exversion platform offers version control functionality and hosting services for uploading and sharing your data. To illustrate how Exversion works, imagine a platform that would allow you to first fork (or copy) a dataset and then make the changes you want. Exversion would be there to keep track of what has changed from the original set and every change that you make to it. Exversion also allows users to rate, review, and comment on datasets.
Datasets hosted on the Exversion platform are either provided by a user or created by a spider that crawls and indexes open data to make it searchable from a single application programming interface (API). As with GitHub, with a free user account, all the data you upload to Exversion is public. If you’re willing to pay for an account, you can create your own, private data repositories. Also, with the paid account, you gain the option to share your data with selected users for collaborative projects.
Exversion is extremely useful in the data-cleanup stage. Most developers are familiar with data-cleanup hassles. Imagine that you want to use a particular dataset, but in order to do so, you must insert tabs in all the right places to make the columns line up correctly. Meanwhile, the other 100 developers out there working with that dataset are doing exactly the same thing. In contrast, if you download, clean, and then upload the data to Exversion, other developers can use it and don’t have to spend their time doing the same work later. In this way, everyone can benefit from each other’s work, and each individual person can spend more time analyzing data and less time cleaning it.
OpenStreetMap (OSM) is an open, crowd-sourced alternative to commercial mapping products such as Google Maps and ESRI ArcGIS Online. In OSM, users create, upload, or digitize geographic data into the central repository.
The data in OSM isn’t stored as maps, but rather as geometric and text representations — points, lines, polygons, and map annotation — so all of OSM’s data can be quickly downloaded from the website and easily assembled into a cartographic representation via a desktop application.