avatarKurt Klingensmith

Summary

The web content provides guidance on anonymizing personal names in Python using various libraries and custom functions to protect sensitive data.

Abstract

The article discusses the importance of anonymizing personal information in datasets, particularly in the context of data science, to prevent misuse such as fraud or identity theft. It outlines several methods for anonymizing names in Python, including the use of the Faker and AnonymizeDF libraries, as well as a custom-built word scrambler. The author demonstrates how to generate fake data to replace real names, create a key for de-anonymization, and clean the dataframe for secure data handling. The article emphasizes the necessity of safeguarding personal information and suggests that advanced techniques may be required for complex datasets to ensure complete anonymity.

Opinions

  • The author conveys that data scientists must prioritize the protection of personal information to maintain trust and avoid legal repercussions.
  • Anonymization techniques are essential for data scientists to learn and apply to ensure ethical data handling.
  • The use of third-party libraries like Faker and AnonymizeDF is recommended for their ease of use and functionality in generating fake data.
  • The article suggests that simple anonymization may not be sufficient for complex datasets, implying the need for more sophisticated de-identification techniques or synthetic data generation.
  • The author provides a pragmatic approach by including code examples and a downloadable Jupyter notebook, indicating a preference for practical, hands-on learning and application.

How to Quickly Anonymize Personal Names in Python

Photo by Julian Hochgesang on Unsplash.

Eventually, most data scientists will handle datasets with personal information. Personnel data is highly sensitive, and aggregation of such data can reveal privileged information about an individual or an organization. The Federal Trade Commission’s (FTC) guide for Protecting Personal Information elaborates further [1]:

[If] sensitive data falls into the wrong hands, it can lead to fraud, identify theft, or similar harms [resulting in] losing your customers’ trust and perhaps even defending yourself against a lawsuit.

— FTC Guide for Protecting Personal Information

Thus, data scientists who fail to safeguard personnel data will have short-lived careers. Fortunately, several options exist within Python to anonymize names and easily generate fake personnel data. Follow the examples below and download the complete Jupyter notebook with code examples at the linked Github page.

The Scenario — Test Scores

First, let’s create a scenario. Suppose a professor has a set of test scores from a recent exam, but wishes to obscure the student names when discussing exam trends with the class. To facilitate this scenario, the Python library Faker enables us to generate fake data, to include names [2]. Generating a notional dataset is simple:

# Import Faker
from faker import Faker
faker = Faker()
# Test fake data generation
print("The Faker library can generate fake names. By running 'faker.name()', we get:")
faker.name()

This code should provide a fake name, which will change with each execution. An example output is below:

Screenshot by author.

That only gets us one name, however; to generate an example dataframe of ten students with ten random test scores, run the following code:

# Create a list of fake names
fake_names = [faker.name() for x in range (10)]
df = pd.DataFrame(fake_names, columns = ['Student'])
# Generate random test scores
import numpy as np
df['TestScore'] = np.random.randint(50, 100, df.shape[0])
# Optional - Export to CSV
df.to_csv('StudentTestScores.csv', index=False)

The resultant dataframe is:

Screenshot by author.

This will serve as the student test results data. For the scenario, the randomly generated names above represent the real names of the students.

1. Anonymization via AnonymizeDF

AnonymizeDF is a Python library capable of generating fake data, including names, IDs, numbers, categories, and more [3]. Here’s an example code block to generate fake names:

# Anonymize DF
from anonymizedf.anonymizedf import anonymize
# AnonymizeDF can generate fake names
anon.fake_names("Student")

The resultant output is:

Screenshot by author.

AnonymizeDF can also create fake identifications. An example follows:

anon.fake_ids("Student")
Screenshot by author.

AnonymizeDF can also create categories. This feature takes the column name and adds a number to it. An example follows:

anon.fake_categories("Student")
Screenshot by author.

AnonymizeDF provides a powerful set of options for data scientists looking to obscure and anonymize user names, and is easy to use. But there are alternatives for those seeking other options.

2. Anonymization via Faker

Similar to AnonymizeDF, Faker is a python library that will generate fake data ranging from names to addresses and more [4]. Faker is very easy to use:

# Install Faker
from faker import Faker
faker = Faker()
Faker.seed(4321)
dict_names = {name: faker.name() for name in df['Student'].unique()}
df['New Student Name'] = df['Student'].map(dict_names)
Screenshot by author.

Faker also has some unique capabilities, such as creating a fake address. For example:

print(faker.address())
Screenshot by author.

3. Custom Built Word Scrambler

In addition to using third party libraries, homebuilt solutions are also an option. These can range from word scramblers or functions that replace names with random words or numbers. An example scrambler function is below:

# Scrambler
from random import shuffle
# Create a scrambler function
def word_scrambler(word):
    word = list(word)
    shuffle(word)
    return ''.join(word)

Apply this function to the dataframe with the following code:

df['ScrambledName'] = df.Student.apply(word_scrambler)
df['ScrambledName'] = df['ScrambledName'].str.replace(" ","")

This yields the following dataframe:

Screenshot by Author.

There are some limitations to this approach. First, an individual with prior knowledge of the student names could deduce who is who based on the capitalized letters in the scrambled name, which represent the initials. Second, the scrambled letters are not as clean in appearance or as interpretable as a pseudonym. Further customization could scrub capital letters or generate random numbers in place of names; the most appropriate choice depends on the scenario and needs of the customer.

4. Putting it All Together: Anonymize, Clean, and De-Anonymize the Data Frame

Once a technique is chosen, applying it to a dataframe, cleaning the frame, and storing a “key” is quite simple. Consider the original dataframe of student test scores from the beginning:

Screenshot by author.

Let’s use AnonymizeDF to create anonymous names. The following code block will:

  • Generate fake names.
  • Create a CSV “Key” containing the real and fake names.
  • Drop the real names from the original dataframe.
  • Present a clean, anonymized dataframe that is structurally indistinguishable from the original.
# Create the Fake Student Names
anon = anonymize(df)
anon.fake_names('Student')
# Create a "Key"
dfKey = df[['Student', 'Fake_Student']]
dfKey.to_csv('key.csv')
df = df.assign(Student = df['Fake_Student'])
df = df.drop(columns='Fake_Student')

The output is the following dataframe:

Screenshot by author.

Descrambling this is a simple matter of loading in the CSV “Key” and mapping the original student names to the fake student names:

# Load in the decoder key
dfKey = pd.read_csv('key.csv')
# Return to the original Data
df['Student'] = df['Student'].map(dfKey.set_index('Fake_Student')['Student'])

This is what it looks like in Jupyter Notebook (notebook downloadable at the linked Github):

Screenshot by author.

Conclusion

It is inevitable that a data scientist will encounter datasets with personal information, the safeguarding of which is critical for protecting individuals and organizations. The simple anonymization techniques highlighted above provide a means to quickly generate fake data as placeholders to protect individuals.

However, for certain datasets, simply anonymizing a name might be insufficient. Other datapoints such as addresses or personal attributes could allow a third party to reconstruct the identity associated with an observation. Thus, more complex datasets will require advanced anonymization and de-identification techniques, and in some cases synthetic data may be the best route for conducting analysis while protecting personal information.

References:

[1] Federal Trade Comission, Protecting Personal Information: A Guide for Business (2016).

[2] Faker PyPI, Faker 13.0 (2022), Python Package Index.

[3] Anonymize DF, Anonymizedf 1.0.1 (2022), Python Package Index.

[4] Faker PyPI, Faker 13.0 (2022), Python Package Index.

Data Science
Anonymization
Jupyter Notebook
Recommended from ReadMedium