avatarKurt Klingensmith

Summary

The provided content is a comprehensive guide on generating synthetic data for Python dataframes, illustrating how to create a dataset that mimics real-world student test scores for educational and data visualization purposes.

Abstract

The article "How to Create Synthetic Data" offers a step-by-step tutorial on crafting a synthetic dataset that resembles real-world data, specifically focusing on student test scores. It addresses the need for synthetic data to avoid licensing issues, protect sensitive information, and serve as a stand-in for real data in educational contexts. The author demonstrates the use of Python libraries such as pandas, NumPy, and Faker to generate names, ages, and test scores that follow a normal distribution. The guide also covers adding custom test scores to introduce outliers, assigning final grades, and visualizing the data using Plotly Express. The synthetic dataframe created can be used for various analytical and visualization tasks, providing a practical solution for data scientists and analysts who require datasets that are not bound by legal or privacy constraints.

Opinions

  • The author emphasizes the importance of synthetic data as a tool to circumvent licensing issues and protect personal information.
  • Synthetic data is presented as a viable alternative to real data, particularly when actual datasets are not available or are legally restricted.
  • The author suggests that synthetic data does not need to be a perfect representation of real data, but it should be robust enough for the intended use case.
  • Custom test scores are introduced to demonstrate how one can control the dataset to include specific scenarios or outliers, enhancing the utility of the synthetic data for various analyses.
  • The use of Plotly Express for data visualization is recommended for its ease of use and effectiveness in conveying information from the synthetic dataset.
  • The article concludes by acknowledging that while synthetic data has limitations, it serves a valuable purpose in data science education and in situations where real data cannot be used.

How to Create Synthetic Data

Go from nothing to a complete dataframe with Python

Photo by Joshua Sortino on Unsplash.

After submitting a recent article to Towards Data Science’s editorial team, I received a message back with a simple inquiry: are the datasets licensed for commercial use? It was a great question — the datasets in my draft came from Seaborn, a common Python Library that comes complete with 17 sample datasets [1]. The datasets certainly seemed open source and, sure enough, many had easily discoverable licenses authorizing commercial use. Unfortunately for me, I happened to pick one of the few datasets that I couldn’t find a license for. But instead of switching to a different Seaborn dataset, I decided to make my own Synthetic Data.

What is Synthetic Data?

IBM’s Kim Martineau defines Synthetic Data as “information that’s been generated on a computer to augment or replace real data to improve AI models, protect sensitive data, and mitigate bias” [2].

Synthetic Data may look like information from a real-world event, but it’s not. This avoids licensing issues, hides proprietary data, and protects personal information.

Synthetic Data differs from anonymized or masked data, which takes real data from actual events and alters certain fields to make the data non-attributional. If you’re looking for anonymizing names in data, you can read a how-to on name anonymization here.

Synthetic Data does not need to be perfect. In my previous article’s use case, I was writing a guide on how to use the Python GroupBy() function. All I needed was a dataset that had numeric data, categorical data, and a domain (in this case, student test scores and grades) understandable to the reader to help me deliver the message. Based on the work for that article, below I’ll provide a guide on building a Synthetic Dataset of your own.

Code:

The Jupyter notebook with full Python code used in this walkthrough is available at the linked github page. Download or clone the repository to follow along!

The code requires the following libraries:

# Data Handling
import pandas as pd
import numpy as np

# Data visualization
import plotly.express as px

# Anonymizer:
from faker import Faker

1. Building the Student Test Score Data Frame

Before getting to the code, let’s apply some domain knowledge to what student test score data might involve:

  • Student Information
  • Class Data
  • Test Scores
  • Overall Scores and Grades

Let’s start by creating a student body for our Synthetic Data. Using the faker library and the following code generates our student names [3]:

from faker import Faker
faker = Faker()

# Create a list of Students:
fake_names = [faker.name() for x in range (100)]
df = pd.DataFrame(fake_names, columns = ['Student'])

# Add age:
df['Age'] = np.random.randint(20, 25, df.shape[0])

df.head()

The result is 100 rows of random student names and ages. The ages are random integers between 20 and 25 picked via the numpy random.randint() function [4]. The result is:

Screenshot by author.

One hundred students is a lot for one class; let’s split this class into three sessions: morning, afternoon, and evening. The following code uses numpy’s random.choice() function to randomly assign students to the three class sessions [5]:

# Define Class Session:
testTime = ['Morning', 'Afternoon', 'Evening']
df['ClassSession'] = np.random.choice(testTime, size=len(df))

df.head()
Screenshot by author.

We now have a basic dataframe for our students, their ages, and their class session. Let’s add some test scores next!

2. Generating Test Scores

Let’s assume the test scores generally follow a normal distribution. We can generate test scores using the numpy random.normal() function [6] and numpy clip() [7] as shown in the following code:

# Morning Score Test 1:

low = 0
high = 100

mean = 85
scale = 8
size = len(df[df['ClassSession'] == 'Morning'])

values = np.random.normal(mean, scale, size)
df.loc[df['ClassSession'] == 'Morning', 'Test1'] = np.clip(values, low, high).astype(int)

Numpy’s random.normal() takes a value for mean, scale (or standard deviation), and size as inputs. For this example, I’ve chosen a mean score of 85 and the size is defined as the number of students in the morning class session. The scale, or standard deviation, defines the spread of the data.

The values returned by numpy’s random.normal() function are then clipped via the clip() function, which clips “values outside the interval …to the interval edges” [8]. For example, suppose a value returned from random.normal() is 103; numpy clip() will change that value to 100 based on the set parameters of 0 (low) and 100 (high), which is our allowed score range for each test.

Adding .astype(int) formats the float output of random.normal() as an integer. The values are returned to a new column, Test1, for Morning session students. Let’s repeat this two more times, giving the morning session a total of three tests:

# Morning Score Test 2:

mean = 74
scale = 7
size = len(df[df['ClassSession'] == 'Morning'])

values = np.random.normal(mean, scale, size)
df.loc[df['ClassSession'] == 'Morning', 'Test2'] = np.clip(values, low, high).astype(int)

# Morning Score Test 3:

mean = 89
scale = 5
size = len(df[df['ClassSession'] == 'Morning'])

values = np.random.normal(mean, scale, size)
df.loc[df['ClassSession'] == 'Morning', 'Test3'] = np.clip(values, low, high).astype(int)

To illustrate how selecting different means and scales impacts the distribution of test scores, run the following code:

# Show example of Morning Session test score distribution: 
morningScores = df[df['ClassSession'] == 'Morning'][['Test1', 'Test2', 'Test3']]

morningScores.plot.kde(figsize=(15,9));

The output is:

Screenshot by author.

The above Kernel Density Estimate (KDE) plot quickly shows the effects our various inputs for mean and scale had on the numpy random.normal() outputs. Note how a smaller scale for Test3 results in a tighter distribution of scores.

Let’s repeat this process for the afternoon class:

# Afternoon Score Test 1:

low = 0
high = 100

mean = 78
scale = 5
size = len(df[df['ClassSession'] == 'Afternoon'])

values = np.random.normal(mean, scale, size)
df.loc[df['ClassSession'] == 'Afternoon', 'Test1'] = np.clip(values, low, high).astype(int)

# Afternoon Score Test 2:

mean = 71
scale = 9
size = len(df[df['ClassSession'] == 'Afternoon'])

values = np.random.normal(mean, scale, size)
df.loc[df['ClassSession'] == 'Afternoon', 'Test2'] = np.clip(values, low, high).astype(int)

# Afternoon Score Test 3:

mean = 85
scale = 8
size = len(df[df['ClassSession'] == 'Afternoon'])

values = np.random.normal(mean, scale, size)
df.loc[df['ClassSession'] == 'Afternoon', 'Test3'] = np.clip(values, low, high).astype(int)

And let’s add the evening scores:

# Evening Score Test 1:

low = 0
high = 100

mean = 74
scale = 7
size = len(df[df['ClassSession'] == 'Evening'])

values = np.random.normal(mean, scale, size)
df.loc[df['ClassSession'] == 'Evening', 'Test1'] = np.clip(values, low, high).astype(int)

# Evening Score Test 2:

mean = 70
scale = 6
size = len(df[df['ClassSession'] == 'Evening'])

values = np.random.normal(mean, scale, size)
df.loc[df['ClassSession'] == 'Evening', 'Test2'] = np.clip(values, low, high).astype(int)

# Evening Score Test 3:

mean = 81
scale = 9
size = len(df[df['ClassSession'] == 'Evening'])

values = np.random.normal(mean, scale, size)
df.loc[df['ClassSession'] == 'Evening', 'Test3'] = np.clip(values, low, high).astype(int)

We now have a dataframe consisting of Student names, Ages, Class Sessions, and normally distributed test scores from three different tests, with each class session and test varying slightly in their performance. The resulting dataframe appears as follows:

Screenshot by author.

2.1. Custom Test Scores

Let’s add some custom test scores to our data to see if we can generate some outliers. Overwriting some student scores is easy:

# Add a top performer to the Evening Class:
df.iloc[df.index[df['ClassSession'] == 'Evening'][0], 3] = 88
df.iloc[df.index[df['ClassSession'] == 'Evening'][0], 4] = 90
df.iloc[df.index[df['ClassSession'] == 'Evening'][0], 5] = 95

# Add a poor performer to the Morning Class:
df.iloc[df.index[df['ClassSession'] == 'Morning'][0], 3] = 61
df.iloc[df.index[df['ClassSession'] == 'Morning'][0], 4] = 55
df.iloc[df.index[df['ClassSession'] == 'Morning'][0], 5] = 57

# Add a poor performer to the Afternoon Class:
df.iloc[df.index[df['ClassSession'] == 'Afternoon'][0], 3] = 50
df.iloc[df.index[df['ClassSession'] == 'Afternoon'][0], 4] = 61
df.iloc[df.index[df['ClassSession'] == 'Afternoon'][0], 5] = 75

In the above code, the first student in the dataframe for the Evening session now becomes a consistent top performer, while the first students in the dataframe for the Morning and Afternoon sessions become consistently poor performers.

3. Additional Features

Now that we have all of our test scores, there are still some additional features to add, such as final grade, which is the mean of the three test scores:

# Determine Course Grade:
df['Grade'] = df[['Test1', 'Test2', 'Test3']].mean(axis = 1).round(1)

Additionally, we can assign a final letter grade:

# Define Letter Grade:
df['LetterGrade'] = ['A' if x >= 90 else
                     'B' if x >= 80 else
                     'C' if x >= 70 else
                     'D' if x >= 60 else
                     'F' for x in df['Grade']]

And we can also flag a student as having passed or failed. Let’s set the threshold for passing at 70:

# Pass or Fail:
df['CoursePass'] = ['Yes' if x >= 70 else
                    'No' for x in df['Grade']]

We now have our complete Synthetic Dataframe, which looks like this:

Screenshot by author.

Don’t forget to save it as a CSV:

# Export to CSV:
df.to_csv('StudentData.csv', index = False)

4. Example Use Case

Let’s test out our data in a use case for someone writing a Towards Data Science article on how to use Plotly Express for data visualization [9]. Using our Synthetic Data, we can freely create some visualizations:

# Visualize Grade Distributions:

plot = px.box(df, x='ClassSession', y='Grade', color='ClassSession')
plot.update_layout(
    title={'text': "Distribution of Final Grades by Class Session",
           'xanchor': 'center',
           'yanchor': 'top',
           'x': 0.45},
    legend_title_text='Class Session',
    xaxis_title='',
    yaxis_title='Final Grade')
plot.show()
Screenshot by author.

And another:

# Plot of Student Test Grades by Class Session:

df = df.sort_values(by='Grade', ascending=False)

plot = px.scatter(df, x='Student',
                  y=['Test1', 'Test2', 'Test3'],
                  color = "ClassSession")
plot.update_layout(
    title={'text': "Students versus Test Scores by Class Session",
           'xanchor': 'center',
           'yanchor': 'top',
           'x': 0.5},
    xaxis_title='',
    yaxis_title='Test Scores')
plot.show()
Screenshot by author.

Try re-running the code with different means and scales for the numpy random.normal() function and see how it impacts the visualizations and class performance.

5. Conclusion

The advantages of creating Synthetic Data include protection of personal and proprietary information as well as avoiding legal issues surrounding data ownership and licensing. It is important to note, though, that not all Synthetic Data is a perfect representation of the data it attempts to represent.

In the example provided above, we made several assumptions (test scores being normally distributed) and also manually injected some high and low test scores, but the real-world student performances may not always fit into such assumptions. Also, keep in mind that the data created above was done through rather simple means; more advanced statistical modeling may be necessary to generate Synthetic Data for complex use cases.

However, the Synthetic Data still achieved the goals of this article by demonstrating data creation and by supporting a notional use case of displaying Plotly visualizations for a data science article. Data scientists charged with creating Synthetic Data must ensure the methodology underpinning the data’s creation is robust enough to make the data useful for the specific problem or use case.

Feel free to download or clone the code and make your own Synthetic Data!

References:

[1] Seaborn, Seaborn: statistical data visualization (2024).

[2] IBM, What is synthetic data? (2023).

[3] Faker PyPI, Faker 13.0 (2022).

[4] NumPy, Numpy.random.randint, (2024).

[5] NumPy, Numpy.random.choice, (2024).

[6] NumPy, Numpy.random.normal, (2024).

[7] NumPy, Numpy.clip, (2024).

[8] NumPy, Numpy.clip, (2024).

[9] Plotly, Plotly Express in Python (2024).

Data Science
Python
Synthetic Data
Tips And Tricks
Numpy
Recommended from ReadMedium