Use Public Datasets Cataloged on Data.gov to Power Data Science Projects
Data.gov houses metadata that describes over 280,000 free and public datasets published at the U.S. federal, state, and local government levels. It simplifies the process of finding data and acquiring it through downloads or APIs.
Introduction
Recently, I published an article about how to acquire and analyze data on analytics.usa.gov about the public’s use of about 57,000 U.S. federal government websites. Data.gov, another government site, serves as a public clearinghouse for a vast collection of government datasets of all kinds.
Data.gov contains metadata that describes over 280,000 free public datasets. It catalogs a rich and varied collection of data managed by U.S. federal government entities, and in some cases, at the state, local, and tribal government levels. Data.gov is a clearinghouse of government datasets available to the American public.
This article describes what Data.gov is, how government entities can publish metadata about their public datasets on the site, the types of data the site catalogs, how to search the data, and how to download datasets from their source entities.
What is Data.gov?
The Technology Transformation Services (TTS) department within the U.S. General Services Administration (GSA) manages Data.gov. It established the service in 2009. To date, TTS has collected, documented, and published metadata for 280,518 datasets.
The following statements guide the work on Data.gov:
Mission: Design and deliver a digital government with and for the American public.
Vision: Trusted modern government experiences for all.
TTS built Data.gov with CKAN and WordPress. It develops its code publicly on GitHub.
Datasets indexed on data.gov follow the DCAT-US Schema v1.1 guidelines. With this schema, a consistent set of metadata (Title, Description, Tags, Publisher, and so on) is applied to all datasets to make them discoverable and understood.
How do Government Entities Add Datasets to Data.gov?
The OPEN Government Data Act, part of the Foundations for Evidence-Based Policymaking Act of the U.S. Congress, requires the federal government to make its data available to the public in open and machine-readable form. At the same time, it must ensure privacy and security.
How-to guides instruct government entities to publish metadata to describe their public datasets available on the site. Consistent metadata improves discoverability and impact.
Data.gov is primarily a federal government site and service. But state, local, and tribal governments can publish metadata to describe their public datasets on the platform as well.
What Types of Data does Data.gov Store?
As mentioned above, Data.gov stores metadata that describes data stored in thousands of datasets available elsewhere. It does not hold the defined data but stores and displays links to downloadable files and APIs to acquire the data.
With over 280,000 datasets, the data indexed on Data.gov is too vast to describe concisely. Here is a small sample of available data types:
- Gross domestic product
- Climate and weather
- Tax revenues, rates, and refunds
- Geospatial data
- Housing statistics
- Oceans
- Census and population
- Wind and solar power
- Education
- Birth and death rates
How can I Find Data on Data.gov?
The Data.gov data catalog makes it easy to search for datasets. The screenshot below provides tips to find datasets using the data catalog search page.
How can I Acquire Data and API Information from Data.gov?
Data and APIs can be accessed by clicking on the data source links in the search results.
Download Data
As shown in the screenshot below, a search of the keyword fishing in the search textbox on the data catalog search page returns 14,230 datasets. The first two datasets contain data about fish stocking and fishing facilities in North Dakota.
In this example, when you click on a data source link, you will be routed to a website hosted by North Dakota or a dataset file.
Clicking on CSV downloads a CSV file. When opened in Excel, the file shows data about fish that the state stocked in its lakes.
API Information
Metadata API
Data.gov manages metadata about datasets, not the raw data. While the search facility makes it easy to find datasets of interest, the CKAN API and the CSW endpoint can be used in programs to query datasets and retrieve metadata. See Data Harvesting for more information.
Dataset APIs
Most datasets are available as downloadable files. Fewer datasets have APIs that can be used to access data. To find the APIs or information about them, search the Data.gov dataset catalog with API selected in the Formats filter section.
Clicking the API link for the Great Smokey Mountains National Park Fish Distribution (2014) dataset opens a new browser tab. It displays a page with information about accessing the data.
Upcoming Articles about Data.gov
Watch for upcoming articles about Data.gov such as these:
- Write a program that uses the CKAN API to access dataset metadata.
- Explore a variety of types of datasets.
- Explore data tools to support the work of data practitioners.
- Interesting or unusual datasets.
Conclusion
Data.gov provides easy-to-find metadata about datasets managed by all levels of government within the United States. The platform simplifies the process of finding and acquiring data through downloads or APIs.
Data.gov is an example of government done well.
About the Author
Randy Runtsch is a writer, data engineer, data analyst, programmer, photographer, cyclist, and adventurer. He and his wife live in southeastern Minnesota, U.S.A.
Watch for Randy’s upcoming articles on public datasets to drive data analytics insights and decision-making, programming, data analytics, photography, bicycle touring, and more. You can see some of his photographs at shootproof.com and shutterstock.com.
