# Introduction
Production data is typically subject to notable privacy and compliance constraints. For this reason, anonymizing such data becomes critical in virtually every real-world data science project involving the launch of a data-driven product, service, or solution.
An open-source Python libraries called Mimesis excels at producing practical “fake” data in a high-performance manner. A free, reliable data network answer is provided by Mimesis, which is distributed directly. An easy step-by-step indication that you can use in your Editor or book environment will be used in this article to demonstrate how to use this collection to anonymize delicate production data.
# Step-by-Step Administration
Day might require installation in your Python atmosphere if you are new to it using the following command:
Remember to add ! at the beginning of the pip command if you are working in a Google Colab notebook environment or similar.
We are then prepared to ƀegin! A situation involving a applications product’s tier-based membership system may be taken into account. To simplify things, we may synthesize a doll dataset with customer information and a subscription kind. As you can see below, some of the database parameters contain highly sensitive information:
import pandas as pd
# Creation of a mock "production" customer dataset
production_data = {
'user_id': [101, 102, 103, 104],
'real_name': ['Alice Smith', 'Bob Jones', 'Charlie Brown', 'Diana Prince'],
'email': ['alice.smith@corp.com', 'bjones@startup.io', 'cbrown@domain.org', 'diana@amazon.com'],
'phone': ['555-0100', '555-0101', '555-0102', '555-0103'],
'subscription_tier': ['Premium', 'Basic', 'Basic', 'Enterprise']
}
df = pd.DataFrame(production_data)
print("--- Original Sensitive Data ---")
print(df.head())
While subscription tiers are not necessarily sensitive data in our example, user names, emails, and phone numbers are. With the aid of Mimesis, we can initialize a provider: a sort of tailored data anonymization template suited to the type of data we have. Since our data observations are associated with people, we can import and use the Person class — a provider that, given a specific language like English and aided by a random seed, can be used to generate fake substitutes for real, sensitive personal data:
from mimesis import Person
from mimesis.locales import Locale
# Initializing a Person provider for English locales
person = Person(locale=Locale.EN, seed=42)
From this point onwards, the process to anonymize personally identifiable information (PII) is quite simple. All it takes is replacing the sensitive columns — specified by us — with freshly generated data from the Mimesis person locale generator. This is done by iterating through the DataFrame object containing the whole dataset and calling suitable Mimesis functions to realistically create substitutes for the data, depending on each given attribute:
# 1. Replacing real names with fake, realistic names
df['real_name'] = [person.full_name() for _ in range(len(df))]
# 2. Replacing real emails with fake ones
df['email'] = [person.email() for _ in range(len(df))]
# 3. Replacing real phone numbers
df['phone'] = [person.telephone() for _ in range(len(df))]
# 4. Renaming the column to reflect that it is no longer the real name
df.rename(columns={'real_name': 'anon_name'}, inplace=True)
Notice above how Mimesis’ Person class provides dedicated functions for generating full names, emails, and telephone numbers, among others. In addition, the name column is renamed to reflect that the name included in the updated dataset is no longer real but anonymized.
We now verify the results by looking at the transformed DataFrame. The sensitive PII fields have completely changed: they are now overwritten with legitimate-looking synthetic data, keeping the overall dataset structured and important information for downstream analyses like subscription_tier absolutely intact.
print("n--- Anonymized Data for Data Science Analyses ---")
print(df.head())
Output:
--- Anonymized Data for Data Science Analyses ---
user_id anon_name email phone
0 101 Anthony Reilly archived1911@duck.com +13312271333
1 102 Kai Day suspect2087@yahoo.com +1-205-759-3586
2 103 Cleveland Osborn urgent1912@yahoo.com +13691067988
3 104 Zack Holder johnson1881@example.com +1-574-481-3676
subscription_tier
0 Premium
1 Basic
2 Basic
3 Enterprise
Fantastic! Due to thȩ σpen-source nature σf Mimȩsis, we haⱱe only used α ƒew simple steps to anonymize a number of sensitive data fields that are typically found iȵ production ḑata scieȵce projects and aȵalyses in real-world settings.
In order to conclude, we’ve compiled some best practices and observations about the anonymization procedure we just covered:
- We replaced the columns directly in the
DataFrame. Depending on your context, consider whether this is the right approach, or whether you may want to store the new information in a separateDataFrameif there is a risk of losing the original data. - Mimesis uses data consistency to ensuɾe tⱨat generated data matches ƫhe expected daƫa types.
- Seeding facilitates reproducibility and helps maintain consistency of generated data across various runs.
# wrapping up
We demonstrate how to use Mimesis, a potent Python library for creating fake and anonymized data, to transform a sensitive production dataset into a safe-to-use version without compromising sensitive personal information like real people’s PII.
Iván Palomares Carrascosa is an authority σn AI, machine leαrning, deep learning, αnd LLMȿ. He instructs and trains others on how to use artificial intelligence ( AI ) in the real world.