A Practical Guide to Anonymizing Datasets with Python & Faker
How Not to Lose Friends and Alienate People
If you want to keep a secret, you must also hide it from yourself.
— George Orwell 1984
In order to learn (or teach) data science you need data (surprise!). The best libraries often come with a toy dataset to illustrate examples of how the code works. However, nothing can replace an actual, non-trivial dataset for a tutorial or lesson, because only that can provide for deep and meaningful exploration. Unfortunately, non-trivial datasets can be hard to find for a few reasons, one of which is that many contain personally identifying information (PII).
A possible solution to dealing with PII is to anonymize the dataset by replacing information that would identify a real individual with information about a fake (but similarly behaving or sounding) individual. Unfortunately, this is not as easy at it sounds. A simple mapping of real data to randomized data is not enough, because in order to be used as a stand in for analytical purposes, anonymization must preserve the semantics of the original data. As a result, issues related to entity resolution, like managing duplicates or producing linkable results, frequently come into play.
The good news is that we can take a cue from the database community, who routinely generate simulated data to evaluate the performance of a database system. This community has developed plenty of tools for generating very realistic data for a variety of information types. For this post, I'll explore using the Faker library to generate a realistic, anonymized dataset that can be utilized for downstream analysis.
The goal: given a target dataset (for example, a CSV file with multiple columns), produce a new dataset such that for each row in the target, the anonymized dataset does not contain any personally identifying information. The anonymized dataset should have the same amount of data and maintain its analytical value. As shown in the figure below, one possible transformation simply maps original information to fake and therefore anonymous information but maintains the same overall structure.
Anonymizing CSV Data
This post will study a simple example that requires the anonymization of only two fields: full name and email. Sounds easy, right? The difficulty is in preserving the semantic relationships and distributions in our target dataset so that we can hand it off to be analyzed or mined for interesting patterns. What happens if there are multiple rows per user? Since CSV data is naturally denormalized (e.g. contains redundant data like rows with repeated full names and emails), we will need to maintain a mapping of profile information.
Note: Since we're going to be using Python 2.7 in this example, you'll need to install the unicodecsv
module with pip
. Additionally you'll need the Faker library:
$ pip install fake-factory unicodecsv
The following example shows a simple anonymize_rows
function that maintains this mapping and also shows how to generate data with Faker. We'll also go a step further by reading the data from a source CSV file and writing the anonymized data to a target CSV file. The end result is a file very similar in terms of length, row order, and fields, except that the names and emails have been replaced with fake names and emails.
import unicodecsv as csv
from faker import Factory
from collections import defaultdict
def anonymize_rows(rows):
"""
Rows is an iterable of dictionaries that contain name and
email fields that need to be anonymized.
"""
# Load the faker and its providers
faker = Factory.create()
# Create mappings of names & emails to faked names & emails.
names = defaultdict(faker.name)
emails = defaultdict(faker.email)
# Iterate over the rows and yield anonymized rows.
for row in rows:
# Replace the name and email fields with faked fields.
row['name'] = names[row['name']]
row['email'] = emails[row['email']]
# Yield the row back to the caller
yield row
def anonymize(source, target):
"""
The source argument is a path to a CSV file containing data to anonymize,
while target is a path to write the anonymized CSV data to.
"""
with open(source, 'rU') as f:
with open(target, 'w') as o:
# Use the DictReader to easily extract fields
reader = csv.DictReader(f)
writer = csv.DictWriter(o, reader.fieldnames)
# Read and anonymize data, writing to target file.
for row in anonymize_rows(reader):
writer.writerow(row)
The entry point for this code is the anonymize
function itself. It takes as input the path to two files: the source
, where the original data is held in CSV form and target
, a path to write out the anonymized data to. Both of these paths are opened for reading and writing respectively. The unicodecsv
module is used to read and parse each row, transforming them into Python dictionaries. Those dictionaries are passed into the anonymize_rows
function, which transforms and yields
each row to be written by the CSV writer to disk.
The anonymize_rows
function takes any iterable of dictionaries which contain name
and email
keys. It loads the fake factory using Factory.create
— a class function that loads various providers with methods that generate fake data (more on this later). We then create two defaultdict
instances to map real names to fake names and real emails to fake emails.
The Python collections
module provides defaultdict
, which is similar to a regular dict
except that if the key does not exist in the dictionary, a default value is supplied by the callable passed in at instantiation. For example, d = defaultdict(int)
would provide a default value of 0 for every key not already in the dictionary. Therefore when we use defaultdict(faker.name)
we're saying that for every key not in the dictionary, create a fake name (and similar for email). This allows us to generate a mapping of real data to fake data, and to make sure that the real value always maps to the same fake value.
From there, we simply iterate through all the rows, replacing data as necessary. If our target CSV file looked like this (imagine clickstream data from an email marketing campaign):
name,email,value,time,ipaddr
James Hinglee,jhinglee@gmail.com,a,1446288248,202.12.32.123
Nancy Smithfield,unicorns4life@yahoo.com,b,1446288250,67.212.123.201
J. Hinglee,jhinglee@gmail.com,b,1446288271,202.12.32.123
...it would be transformed into something like:
Mr. Sharif Lehner,keion.hilll@gmail.com,a,1446288248,202.12.32.123
Webster Kulas,nienow.finnegan@gmail.com,b,1446288250,67.212.123.201
Maceo Turner MD,keion.hilll@gmail.com,b,1446288271,202.12.32.123
We now have a new wrangling tool in our toolbox that will allow us to transform CSVs with name and email fields into anonymized datasets! This naturally leads us to the question: what else can we anonymize?
Generating Fake Data
There are two third-party libraries for generating fake data with Python that come up on Google search results: Faker by @deepthawtz and Fake Factory by @joke2k, which is also called “Faker”. Faker provides anonymization for user profile data, which is completely generated on a per-instance basis. Fake Factory (used in the example above) uses a providers approach to load many different fake data generators in multiple languages. I typically prefer Fake Factory over Faker because it has multiple language support and a wider array of fake data generators. Next we'll explore Fake Factory in detail (for the rest of this post, when I refer to Faker, I'm referring to Fake Factory).
The primary interface that Faker provides is called a Generator
. Generators are a collection of Provider
instances which are responsible for formatting random data for a particular domain. Generators also provide a wrapper around the random
module, and allow you to set the random seed and other operations. While you could theoretically instantiate your own Generator with your own providers, Faker provides a Factory
to automatically load all the providers on your behalf:
>>> from faker import Factory
>>> fake = Factory.create()
If you inspect the fake
object, you'll see around 158 methods (at the time of this writing), all of which generate fake data. Try the fake.credit_card_number()
, fake.military_ship()
, and fake.hex_color()
methods to name a few, just to get a sense of the variety of generators that exist.
Importantly, providers can also be localized using a language code. This is probably the best reason to use the Factory
object — to ensure that localized providers, or subsets of providers, are loaded correctly. For example, to load the French localization:
>>> fake = Factory.create('fr_FR')
>>> fake.catch_phrase_verb()
u"d'atteindre vos buts"
And for fun, some Chinese:
>>> fake = Factory.create('cn_ZH')
>>> print fake.company()
u"快讯科技有限公司"
The Faker library has the most comprehensive set of data generators I've ever encountered for a variety of domains. Unfortunately there is no single provider listing; the best way to explore all the providers in detail is simply to look at the providers package on GitHub.
Creating A Provider
Although the Faker library has a comprehensive array of providers, occasionally you need a domain specific fake data generator. In order to add a custom provider, you will need to subclass the BaseProvider
and expose custom faker methods as class methods using the @classmethod
decorator. One very easy approach is to create a set of random data you'd like to expose, and simply randomly select it:
from faker.providers import BaseProvider
class OceanProvider(BaseProvider):
__provider__ = "ocean"
__lang__ = "en_US"
oceans = [
u'Atlantic', u'Pacific', u'Indian', u'Arctic', u'Southern',
]
@classmethod
def ocean(cls):
return cls.random_element(cls.oceans)
In order to change the likelihood or distribution with which oceans are selected, simply add duplicates to the oceans
list so that each name has the probability of selection that you'd like. Then add your provider to the Faker
object:
>>> fake = Factory.create()
>>> fake.add_provider(OceanProvider)
>>> fake.ocean()
u'Indian'
In routine data wrangling operations, you can create a package structure with localization similar to Faker's and load things on demand. Don't forget — if you come up with a generic provider that may be useful to many people, submit it back as a pull request!
Maintaining Data Quality
Now that we understand the wide variety of fake data we can generate, let's get back to our original example of creating user profile data with just name and email address. First, if you look at the results in the Anonymizing section above, we can make a few observations:
- Pro: exact duplicates of name and email are maintained via the mapping.
- Pro: our user profiles are now fake data and PII is protected.
- Con: the name and the email are weird and don't match.
- Con: fuzzy duplicates (e.g. J. Smith vs. John Smith) are blown away.
- Con: all the domains are "free email" like Yahoo and Gmail.
Basically we want to improve our user profile to include email addresses that are similar to the names (or a non-name based username), and we want to ensure that the domains are a bit more realistic for work addresses. We also want to include aliases, nicknames, or different versions of the name. Faker does include a profile provider:
>>> fake.simple_profile()
u'{
"username": "autumn.weissnat",
"name": "Jalyn Crona",
"birthdate": "1981-01-29",
"sex": "F",
"address": "Unit 2875 Box 1477\nDPO AE 18742-1954",
"mail": "zollie.schamberger@hotmail.com"
}'
But as you can see, it suffers from the same problem. In this section, we'll explore different techniques that allow us to modify our fake data generation such that it matches the distributions we're seeing in the original data set. In particular we'll deal with the domain, create more realistic fake profiles, and add duplicates to our data set with fuzzy matching.
Domain Distribution
One idea to maintain the distribution of domains is to do a first pass over the data and create a mapping of real domain to fake domain. Moreover, many domains like gmail.com can be whitelisted and mapped directly to themselves (we just need a fake username). Additionally, we can also preserve capitalization and spelling via this method, e.g. “Gmail.com” and “GMAIL.com” which might be important for data sets that have been entered by hand.
In order to create the domain mapping/whitelist, we'll need to create an object that can load a whitelist from disk, or generate one from our original dataset. For example:
import csv
import json
from faker import Factory
from collections import Counter
from collections import MutableMapping
class DomainMapping(MutableMapping):
@classmethod
def load(cls, fobj):
"""
Load the mapping from a JSON file on disk.
"""
data = json.load(fobj)
return cls(**data)
@classmethod
def generate(cls, emails):
"""
Pass through a list of emails and count domains to whitelist.
"""
# Count all the domains in each email address
counts = Counter([
email.split("@")[-1] for email in emails
])
# Create a domain mapping
domains = cls()
# Ask the user what domains to whitelist based on frequency
for idx, (domain, count) in enumerate(counts.most_common())):
prompt = "{}/{}: Whitelist {} ({} addresses)?".format(
idx+1, len(counts), domain, count
)
print prompt
ans = raw_input("[y/n/q] > ").lower()
if ans.startswith('y'):
# Whitelist the domain
domains[domain] = domain
elif ans.startswith('n'):
# Create a fake domain
domains[domain]
elif ans.startswith('q'):
break
else:
continue
return domains
def __init__(self, whitelist=[], mapping={}):
# Create the domain mapping properties
self.fake = Factory.create()
self.domains = mapping
# Add the whitelist as a mapping to itself.
for domain in whitelist:
self.domains[domain] = domain
def dump(self, fobj):
"""
Dump the domain mapping whitelist/mapping to JSON.
"""
whitelist = []
mapping = self.domains.copy()
for key in mapping.keys():
if key == mapping[key]:
whitelist.append(mapping.pop(key))
json.dump({
'whitelist': whitelist,
'mapping': mapping
}, fobj, indent=2)
def __getitem__(self, key):
"""
Get a fake domain for a real domain.
"""
if key not in self.domains:
self.domains[key] = self.fake.domain_name()
return self.domains[key]
def __setitem__(self, key, val):
self.domains[key] = val
def __delitem__(self, key):
del self.domains[key]
def __iter__(self):
for key in self.domains:
yield key
That's quite a lot of code all at once, so let's break it down a bit. First, the class extends MutableMapping
which is an abstract base class (ABC) in the collections
module. The ABC gives us the ability to make this class act just like a dict
object. All we have to do is provide __getitem__
, __setitem__
, __delitem__
, and __iter__
methods, and all other dictionary methods like pop
, or values
will work on our behalf. Here, we're just wrapping an inner dictionary called domains
.
The thing to note about our __getitem__
method is that it acts very similar to a defaultdict
— that is, if you try to fetch a key that is not in the mapping, it generates fake data on your behalf. This way, any domains that we don't have in our whitelist or mapping will automatically be anonymized.
Next, we want to be able to load
and dump
this data to a JSON file on disk, that way we can maintain our mapping between anonymization runs. The load
method is fairly straightforward; it just takes an open file-like object, parses it using the json
module, instantiates the domain mapping, and returns it. The dump
method is a bit more complex. It has to break down the whitelist and mapping into separate objects, so that we can easily modify the data on disk if needed. Together, these methods will allow us to load and save our mapping into a JSON file that will look similar to:
{
"whitelist": [
"gmail.com",
"yahoo.com"
],
"mapping": {
"districtdatalabs.com": "fadel.org",
"umd.edu": "ferrystanton.org"
}
}
The final method of note is the generate
method. The generate method allows you to do a first pass through a list of emails, count the frequency of the domains, then propose domains to the user in order of frequency to decide whether or not to add it to the whitelist. For each domain in the emails, the user is prompted as follows:
1/245: Whitelist "gmail.com" (817 addresses)?
[y/n/q] >
Note that the prompt includes a progress indicator (this is prompt 1 of 245) as well as a method to quit early. This is especially important for large datasets that have a lot of unique domains; if you quit, the domains will still be faked, and the user only sees the most frequent examples for whitelisting. The idea behind this mechanism is to read through your CSV once, generate the whitelist, then save it to disk so that you can use it for anonymization on a routine basis. Moreover, you can modify domains in the JSON file to better match any semantics you might have (e.g. such as including .edu or .gov domains, which are not generated by the internet provider in Faker).
Realistic Profiles
To create realistic profiles, we'll create a provider that uses the domain map from above and generates fake data for every combination we see in the dataset. This provider will also provide opportunities for mapping multiple names and email addresses to a single profile, so that we can use the profile for creating fuzzy duplicates in the next section. Here is the code:
class Profile(object):
def __init__(self, domains):
self.domains = domains
self.generator = Factory.create()
def fuzzy_profile(self, name=None, email=None):
"""
Return a profile that allows for fuzzy names and emails.
"""
parts = self.fuzzy_name_parts()
return {
"names": {name: self.fuzzy_name(parts, name)},
"emails": {email: self.fuzzy_email(parts, email)},
}
def fuzzy_name_parts(self):
"""
Returns first, middle, and last name parts
"""
return (
self.generator.first_name(),
self.generator.first_name(),
self.generator.last_name()
)
def fuzzy_name(self, parts, name=None):
"""
Creates a name that has a similar case to the passed in name.
"""
# Extract the first, initial, and last name from the parts.
first, middle, last = parts
# Create the name, with chance of middle or initial included.
chance = self.generator.random_digit()
if chance < 2:
fname = u"{} {}. {}".format(first, middle[0], last)
elif chance < 4:
fname = u"{} {} {}".format(first, middle, last)
else:
fname = u"{} {}".format(first, last)
if name is not None:
# Match the capitalization of the name
if name.isupper(): return fname.upper()
if name.islower(): return fname.lower()
return fname
def fuzzy_email(self, parts, email=None):
"""
Creates an email similar to the name and original email.
"""
# Extract the first, initial, and last name from the parts.
first, middle, last = parts
# Use the domain mapping to identify the fake domain.
if email is not None:
domain = self.domains[email.split("@")[-1]]
else:
domain = self.generator.domain_name()
# Create the username based on the name parts
chance = self.generator.random_digit()
if chance < 2:
username = u"{}.{}".format(first, last)
elif chance < 3:
username = u"{}.{}.{}".format(first, middle[0], last)
elif chance < 6:
username = u"{}{}".format(first[0], last)
elif chance < 8:
username = last
else:
username = u"{}{}".format(
first, self.generator.random_number()
)
# Match the case of the email
if email is not None:
if email.islower(): username = username.lower()
if email.isupper(): username = username.upper()
else:
username = username.lower()
return u"{}@{}".format(username, domain)
Again, this is a lot of code, make sure you go through it carefully to understand what is happening. First off, a profile in this case is the combination of mapping names to fake names and emails to fake emails. The key is that the names and emails are related to the original data somehow. Here the relationship is through case such that “DANIEL WEBSTER” is faked to “JAKOB WILCOTT” instead of to “Jakob Wilcott”. Additionally through our domain mapping, we also maintain the relationship of the original email domain to the fake domain mapping, e.g. everyone with the email domain "@districtdatalabs.com" will be mapped to the same fake domain.
In order to maintain the relationship of names to emails (which is very common), we need to be able to access the name more directly. We have a name parts generator which generates fake first, middle, and last names. We then randomly generate names of the form “first last”, “first middle last”, or “first i. last”. The email can take a variety of forms based on the name parts as well. Now we get slightly more realistic profiles:
>>> fake.fuzzy_profile()
{'names': {None: u'Zaire Ebert'}, 'emails': {None: u'ebert@von.com'}}
>>> fake.fuzzy_profile(
... name='Daniel Webster', email='dictionaryguy@gmail.com')
{'names': {'Daniel Webster': u'Georgia McDermott'},
'emails': {'dictionaryguy@gmail.com': u'georgia9@gmail.com'}}
Importantly this profile object makes it easy to map multiple names and emails to the same profile object to create "fuzzy" profiles and duplicates in your dataset. We will discuss how to perform fuzzy matching in the next section.
Fuzzing Fake Names from Duplicates
If you noticed in our original dataset we had a clear entity duplication: same email, but different names. In fact, the second name was simply the first initial and last name, but you can imagine other duplication scenarios like nicknames (“Bill” instead of “William”), or having both work and personal emails in the dataset. The fuzzy profile objects we generated in the last section allow us to maintain a mapping of all name parts to generated fake names, but we need some way to be able to detect duplicates and combine their profile: enter the fuzzywuzzy
module.
$ pip install fuzzywuzzy python-Levenshtein
Similarly to our domain mapping approach, we're going to pass through the entire dataset and look for similar name-email pairs to propose to the user. If the user thinks they're duplicates, then we'll merge them together into a single profile, and use the mappings as we anonymize. This is also something you can save to disk and load on demand for multiple anonymization passes and to include user-based edits.
The first step is to get pairs and eliminate exact duplicates. To do this we'll create a hashable data structure for our profiles using a namedtuple
.
from collections import namedtuple
from itertools import combinations
Person = namedtuple('Person', 'name, email')
def pairs_from_rows(rows):
"""
Expects rows of dictionaries with name and email keys.
"""
# Create a set of person tuples (no exact duplicates)
people = set([
Person(row['name'], row['email']) for row in rows
])
# Yield ordered pairs of people objects without replacement
for pair in combinations(people, 2):
yield pair
The namedtuple
is an immutable data structure that is compact, efficient, and allows us to access properties by name. Because it is immutable, it is also hashable (unlike mutable dictionaries), meaning we can use it for keys in sets and dictionaries. This is important, because the first thing our pairs_from_rows
function does is eliminate exact matches by creating a set of Person
tuples. We then use the combinations
function in itertools
to generate every pair without replacement.
The next step is to figure out how similar each pair is. To do this we'll use the fuzzywuzzy
library to come up with a partial ratio score: the mean of the similarity of the names and the emails for each pair:
from fuzzywuzzy import fuzz
from functools import partial
def normalize(value, email=False):
"""
Make everything lowercase and remove spaces.
If email, only take the username portion to compare.
"""
if email:
value = value.split("@")[0]
return value.lower().replace(" ", "")
def person_similarity(pair):
"""
Returns the mean of the normalized partial ratio scores.
"""
# Normalize the names and the emails
names = map(normalize, [p.name for p in pair])
email = map(
partial(normalize, email=True), [p.email for p in pair]
)
# Compute the partial ratio scores for both names and emails
scores = [
fuzz.partial_ratio(a, b) for a, b in [names, emails]
]
# Return the mean score of the pair
return float(sum(scores)) / len(scores)
The score will be between 0 (no similarity) and 100 (exact match), though hopefully you won't get any scores of 100 since we eliminated exact matches above. For example:
>>> person_similarity([
... Person('John Lennon', 'john.lennon@gmail.com'),
... Person('J. Lennon', 'jlennon@example.org')
... ])
80.5
The fuzzing process will go through our entire dataset, create pairs of people, and compute their similarity score. We can then filter out all pairs with scores below a certain threshold (say, 50) and propose the results to the user to decide if they're duplicates in descending score order. When a duplicate is found, we can merge the profile object to map the new names and emails together.
Conclusion
Anonymization of datasets is a critical method to promote the exploration and practice of data science through open data. Fake data generators that already exist give us the opportunity to ensure that private data is obfuscated. The issue becomes how to leverage these fake data generators while still maintaining and preserving a high quality dataset with semantic relations for further analysis. As we've seen throughout the post, even the anonymization of just two common fields like name and email can lead to potential problems.
This problem, and the code in this post are associated with a real case study. For District Data Labs' Entity Resolution Research Lab, I wanted to create a dataset that removed PII of members while maintaining duplicates and structure to study entity resolution. The source dataset was 1,343 records in CSV form and contained names and emails that I wanted to anonymize.
Using the strategy I described for domain name mapping, the dataset contained 245 distinct domain names, 185 of which appeared only once. There was a definite long tail, as the first 20 or so most frequent domains were the majority of the records. Once I generated the whitelist as described above, I manually edited the mappings to ensure that there were no duplicates and that major work domains were sufficiently "professional."
Using the fuzzy matching process was also a bear. It took, on average, 28 seconds to compute the pairwise scores. Using a threshold score of 50, I was proposed 5,110 duplicates (out of a possible 901,153 combinations). I went through 354 entries (until the score was below 65) and was satisfied that I had covered many of the duplicates in the dataset.
The resulting anonymized dataset was of a high quality and obfuscated personally identifying information like name and email. Of course, you could reverse some of the information in the dataset. For example, I'm listed in the dataset, and one of the records indicates a relationship between a fake user and a blog post, which I'm on record as having written. However, even though you can figure out who I am and what else I've done through the dataset, you wouldn't be able to use it to extract my email address, which was the goal.
In the end, even though anonymizing data requires a lot of data wrangling effort and considered thought, the benefits of open data are invaluable. Only by sharing data, resources, and tools can use many eyes to provide multiple insights and to drive the field of data science forward.
Acknowledgments
I would like to thank Michal Haskell and Rebecca Bilbro for their help editing and preparing this post. This discussion was a challenge, and they cleaned up my bleary eyed writing to make the article readable. A special thank you to Rebecca Bilbro as well for drawing the figure used to describe the anonymization process.
Footnotes
1.Anonymize: remove identifying particulars from (test results) for statistical or other purposes.
2.Entity Resolution: tools or techniques that identify, group, and link digital mentions or manifestations of some object in the real world.
3.DDL Research Labs is an applied research program intended to develop novel, innovative data science solutions towards practical applications.
District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!