avatarZach Quinn

Summary

Mockaroo is a user-friendly tool for generating realistic datasets for data projects, offering a variety of data types and formats, and is particularly useful for learning SQL and data preprocessing.

Abstract

Mockaroo is an online platform that simplifies the creation of randomized datasets for professionals working on data analysis and machine learning projects. It provides a free version that supports the generation of up to 10

How to Generate Realistic (Fake) Data for Your Projects Using Mockaroo

Mockaroo, a beginner-friendly UI for creating randomized datasets, enables aspiring professionals to simulate authentic data for analyses and ML models. We’ll use Mockaroo to create a sample dataset for a data engineering project.

Image courtesy of Wikipedia: https://en.wikipedia.org/wiki/Eastern_grey_kangaroo

Disclaimer: I‘m not affiliated with Mockaroo; I’m just an admirer of the tool.

Purpose

Mockaroo’s free version allows users to generate up to 1000 rows of data in CSV, XML, JSON and SQL formats while choosing from 150+ data types. Prior to Mockaroo, I downloaded data from Kaggle. While Kaggle provided a range of datasets, they were preprocessed, depriving beginning data engineers, scientists and analysts of an authentic development experience. I used Mockaroo particularly when I was learning SQL because it was one of the few tools I knew of that could generate and export SQL data.

How it Works: Mockaroo

The first step to using Mockaroo is to define a schema. Mockaroo supports standard data types like string, integer, etc. It also supports specialized data types like healthcare codes and cryptocurrency values. One of the best aspects of Mockaroo for data engineers looking for data preprocessing experience is the ability to specify the percentage of blank values. Users can use Mockaroo’s conditional syntax to create custom values within the provided fields.

I chose my fields and specified blank values between 5–30% per column. This means that when I import the data into Python I’ll get NaN values that I can filter. Alternatively, in SQL, I would get Null for these values. In addition to standard data types, I also generated a column with empty arrays.

The final step is to download and import into Python as a dataframe that you can use to practice cleaning, sorting and aggregations on realistic data.

Mockaroo API

While Mockaroo’s flagship product is its static data generator, it also allows users to create mock APIs to help developers preview the infrastructure of their APIs prior to deployment. Users can create a dummy URL, path variables, query strings and entity bodies like they would for a real API.

Take-Away

Mockaroo and tools like it help beginner data engineers and data science students quickly create data so that they can focus on developing the programmatic, logical and analytic skills necessary to excel in a data career. Although Mockaroo is a good place to quickly generate dummy data or learn about data types, in order to advance in any data discipline, you’ll want to work toward creating datasets through aggregation and web scraping.

Create a job-worthy data portfolio. Learn how with my free project guide.

Data Science
Data
Data Engineering
Programming
Python
Recommended from ReadMedium