avatarMarco Peixeiro

Summary

The web content provides an overview of six reputable sources for obtaining free real-life time series data for data science practice.

Abstract

The article "Top 6 Sources of Free Real-Life Time Series Data" outlines valuable resources for data scientists to enhance their skills in time series analysis. It emphasizes the importance of working with real-life datasets to gain experience with practical challenges such as missing values and noise. The sources listed include Statistics Canada, NYC Open Data, Monash Time Series Forecasting Repository, Papers With Code, Numenta Anomaly Benchmark, and the UCI Machine Learning Repository. These platforms offer a diverse range of datasets across various domains and frequencies, from daily to annual, and even include data from past forecasting competitions. The article also reminds users to properly cite the datasets if used in research or publications and provides links to further reading and examples of how to work with the data.

Opinions

  • The author advocates for the use of real-life datasets to better prepare for the complexities of real-world data science challenges.
  • They highlight the convenience of dataset customization available on Statistics Canada's platform, such as adding columns or pivoting tables before downloading.
  • The author points out the user-friendly aspect of NYC Open Data's curated popular datasets, which can help users navigate the database more efficiently.
  • The Monash Time Series Forecasting Repository is noted for its comprehensive list of datasets, including versions with and without missing values, and for providing benchmarks for forecasting models.
  • Papers With Code is recognized for its unique offering of research papers alongside code implementations, although the author notes it may be less intuitive for those accustomed to direct CSV downloads.
  • The Numenta Anomaly Benchmark is acknowledged for its benchmarks in anomaly detection and for providing both real-life and simulated data, though it requires combining separate files for complete datasets.
  • The UCI Machine Learning Repository is praised as a popular and extensive source for time series datasets, with various filtering options for task, domain, and dataset characteristics.
  • The author encourages readers to follow their work for more insights on time series data and to connect on LinkedIn, indicating a commitment to community engagement and knowledge sharing.
  • A call for support through "Buy me a coffee" suggests the author values community feedback and encouragement, fostering a collaborative environment for learning and improvement.

Top 6 Sources of Free Real-Life Time Series Data

Use these sources of open real-life datasets to practice your skills in forecasting, classifying or detecting anomalies in time series.

Photo by Malvestida on Unsplash

Practice makes perfect, and for data scientists working with time series data, that means that you first need access to data.

Here, we list the top 6 sources of openly available time series data, so that you can practice forecasting, classifying, or detecting anomalies.

We emphasize on real-life datasets here, so that we get used to working with missing values or noisy data.

Of course, while the datasets are free and open for anyone, make sure to properly cite them if you use them in a research paper or blog article.

Learn the latest time series analysis techniques with my free time series cheat sheet in Python! Get the implementation of statistical and deep learning techniques, all in Python and TensorFlow!

Let’s get started!

1. Statistics Canada

Statistics Canada is the national statistics office. They compile data on a wide range of subjects from census data, to agricultural, economical and social aspects of Canada.

We can search datasets by keyword, or filter by frequency from daily to annual. They even have lower frequency data, like every 2 years, every 3 years, and also occasional.

The datasets can be downloaded as a CSV file, and you can also modify your dataset, like adding columns or pivoting the table before downloading it.

If you want an example of what you can do with these datasets, check out my article on deploying a population forecasting model.

Website: Statistics Canada

2. NYC Open Data

As the name suggests, this website compiles free public data published by New York City agencies and other partners.

This is similar to Statistics Canada, in the sense that you have data in the fields of health, education, environment, and more.

They also curate the most popular datasets, which can be a great starting point if you feel overwhelmed by searching their entire database.

While you cannot filter by frequency, searching the database using keywords like “monthly” or “daily” will help you find time series data quickly.

Website: NYC Open Data

3. Monash Time Series Forecasting Repository

The people behind this repository aim to create a comprehensive list of time series datasets for forecasting to facilitate the evaluation of forecasting models.

It contains a list of 30 datasets, both publicly available and curated by their team.

Datasets come in different versions, depending on the frequency, and they also versions with missing values and without missing values, bringing the total number of datasets to 58.

The datasets cover both real-world data and competition datasets covering different domains. For example, you can find the data used in past M forecasting competitions.

For all the details on each dataset, make sure to read their paper.

They have also included the performance of various models on all their datasets, which can also help you discover forecasting techniques and see if you can reproduce the results.

Website: Monash Forecasting Repository

4. Papers With Code

Papers With Code is a website where we can consult research papers along with the code implementation of the paper.

They also have a section of all the datasets used in the papers, including time series data. We can also filter by task, whether you want to work on forecasting, classification or anomaly detection.

The only drawback in my opinion is that it’s not as intuitive as other websites to use the data, because we often need to use data loaders instead of just downloading a CSV.

Website: Papers With Code

5. Numenta Anomaly Benchmark

The Numenta Anomaly Benchmark repository contains the scripts and datasets that set benchmarks for anomaly detection in time series.

The repository has both real-life and simulated data for anomaly detection, and you can perform either point-wise anomaly detection (finding points in time that are anomalous), or pattern-wise anomaly detection (finding sequences in time that are anomalous).

Just note that the datasets and the labels are in separate files, so you have to combine information from two different files to have a complete dataset.

If you want an example of how to work with their data, check out my guide on anomaly detection in time series.

Website: Numenta Anomaly Benchmark

6. UCI Machine Learning Repository

Of course, the UCI machine learning repository makes it to this list, as it is probably one of the most popular data source to practice our data science skills.

At the time of writing, it contains 126 time series datasets, and you can filter by task (like classification or regression), by domain, and also by number of attributes and instances.

Website: UCI Machine Learning Repository

Conclusion

There you have it, a list of my favourite places to get open real-life time series data. We can only get better by practising and facing new situations, and I hope that these sources will help you do that!

For anything related to time series, make sure to follow me as I publish many articles related to working with time series data.

We can also keep in touch on LinkedIn!

Cheers! 🍻

Support me

Enjoying my work? Show your support with Buy me a coffee, a simple way for you to encourage me, and I get to enjoy a cup of coffee! If you feel like it, just click the button below 👇

Time Series Analysis
Data Science
Lists
Machine Learning
Time Series Forecasting
Recommended from ReadMedium