Simple CSV Data Wrangling with Python
Efficient Processing, Schemas, and Serialization
I wanted to write a quick post today about a task that most of us do routinely but often think very little about - loading CSV (comma-separated value) data into Python. This simple action has a variety of obstacles that need to be overcome due to the nature of serialization and data transfer. In fact, I'm routinely surprised how often I have to jump through hoops to deal with this type of data, when it feels like it should be as easy as JSON or other serialization formats.
The basic problem is this: CSVs have inherent schemas. In fact, most of the CSVs that I work with are dumps from a database. While the database can maintain schema information alongside the data, the scheme is lost when serializing to disk. Worse, if the dump is denormalized (a join of two tables), then the relationships are also lost, making it harder to extract entities. Although a header row can give us the names of the fields in the file, it won't give us the type, and there is nothing structural about the serialization format (like there is with JSON) that we can infer the type from.
That said, I love CSVs. CSVs are a compact data format - one row, one record. CSVs can be grown to massive sizes without cause for concern. I don't flinch when reading 4 GB CSV files with Python because they can be split into multiple files, read one row at a time for memory efficiency, and multiprocessed with seeks to speed up the job. This is in stark contrast to JSON or XML, which have to be read completely from end to end in order to get the full data (JSON has to be completely loaded into memory and with XML you have to use a streaming parser like SAX).
CSVs are the file format of choice for big data appliances like Hadoop for good reason. If you can get past encoding issues, extra dependencies, schema inference, and typing; CSVs are a great serialization format. In this post, I will provide you with a series of pro tips that I have discovered for using and wrangling CSV data.
Specifically, this post will cover the following:
- The basics of CSV processing with Python
- Avoiding Unicode issues in Python 2.7
- Using namedtuples or slots for memory efficiency
- Serializing data with a schema using Avro
Although we won't cover it in this post, using these techniques you have a great start towards multiprocessing to quickly dig through a CSV file from many different positions in it at once. Hopefully this intro has made CSV sound more exciting, and so let's dive in.
The Data
Consider the following data set, a listing of company funding records as reported by TechCrunch. You can download the data set here.
The first ten rows are shown below:
permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
lifelock,LifeLock,,web,Tempe,AZ,1-May-07,6850000,USD,b
lifelock,LifeLock,,web,Tempe,AZ,1-Oct-06,6000000,USD,a
lifelock,LifeLock,,web,Tempe,AZ,1-Jan-08,25000000,USD,c
mycityfaces,MyCityFaces,7,web,Scottsdale,AZ,1-Jan-08,50000,USD,seed
flypaper,Flypaper,,web,Phoenix,AZ,1-Feb-08,3000000,USD,a
infusionsoft,Infusionsoft,105,software,Gilbert,AZ,1-Oct-07,9000000,USD,a
gauto,gAuto,4,web,Scottsdale,AZ,1-Jan-08,250000,USD,seed
chosenlist-com,ChosenList.com,5,web,Scottsdale,AZ,1-Oct-06,140000,USD,seed
chosenlist-com,ChosenList.com,5,web,Scottsdale,AZ,25-Jan-08,233750,USD,angel
digg,Digg,60,web,San Francisco,CA,1-Dec-06,8500000,USD,b
The CSV file has a header row, so we have the field names, but we do have a couple of data type conversions that we have to make. In particular, the fundedDate
needs to be transformed to a Python date object and the raisedAmt
needs to be converted to an integer. This isn't particularly onerous, but consider that this is just a simple example, more complex conversions can be easily imagined.
Note: This file was obtained from the following source SpatialKey Sample Data on October 21, 2014.
The Basics of CSV Processing
Python has a built in csv
module that handles all the ins and outs of processing CSV files, from dealing with dialects (Excel, anyone?) to quoting fields that may contain the delimiter to handling a variety of delimiters. Very simply, this is how you would read all the data from the funding CSV file:
import csv
FUNDING = 'data/funding.csv'
def read_funding_data(path):
with open(path, 'rU') as data:
reader = csv.DictReader(data)
for row in reader:
yield row
if __name__ == "__main__":
for idx, row in enumerate(read_funding_data(FUNDING)):
if idx > 10: break
print "%(company)s (%(numEmps)s employees) raised %(raisedAmt)s on %(fundedDate)s" % row
A couple of key points with the code above:
- Always wrap the CSV reader in a function that returns a generator (via the
yield
statement). - Open the file in universal newline mode with
'rU'
for backwards compatibility. - Use context managers
with [callable] as [name]
to ensure that the handle to the file is closed automatically. - Use the
csv.DictReader
class only when headers are present, otherwise just usecsv.reader
. (You can pass a list of fieldnames, but you'll see its better just to use anamedtuple
as we discuss below).
This code allows you to treat the data source as just another iterator or list in your code. In fact if you do the following:
data = read_funding_data(FUNDING)
print repr(data)
You'll see that the function "returns" a generator, thanks to the yield
statement in the function definition. This means, among other things, that the data is evaluated lazily. The file is not opened, read, or parsed until you need it. No more than one row of the file is in memory at any given time. The context manager in the function also ensures that the file handle is closed when you've finished reading the data, even in the face of an exception elsewhere in code. Note that this pattern requires the file to be open while you're reading from it. If you attempt to read from a data file that's closed, you will get an error.
This is powerful because it means that even for much larger data sets you will have efficient, portable code. Moreover, as we start looking at wrangling or munging the data from the CSV file, you'll have well encapsulated code that can handle a variety of situations. In fact, in code that has to read and parse files from a variety of sources, it is common to wrap the csv
module in a class so that you can persist statistics about the data and provide multiple reads to the CSV file. In the above example you have to ensure that you call the function to get the data every time you want to read or do any computation, whereas a class can save some state of the data between reads. Here is an example of persisted data for our funding class:
from collections import Counter
class FundingReader(object):
def __init__(self, path):
self.path = path
self._length = None
self._counter = None
def __iter__(self):
self._length = 0
self._counter = Counter()
with open(self.path, 'rU') as data:
reader = csv.DictReader(data)
for row in reader:
# Save the statistics
self._length += 1
self._counter[row['company']] += 1
yield row
def __len__(self):
if self._length is None:
for row in self: continue # Read the data for length and counter
return self._length
@property
def counter(self):
if self._counter is None:
for row in self: continue # Read the data for length and counter
return self._counter
@property
def companies(self):
return self.counter.keys()
def reset(self):
"""
In case of partial seeks (e.g. breaking in the middle of the read)
"""
self._length = None
self._counter = None
if __name__ == "__main__":
reader = FundingReader(FUNDING)
print "Funding reader with %i rows and %i companies" % (len(reader), len(reader.companies))
This is a pretty simple class that keeps track of the number of rows overall as well as the number of rows for each company. All that's stored in memory is the company name and the number of funding records, but you can see how this might be extremely useful, especially if you're going to filter the data by company. No matter what, you have to make one complete read to get this data, but on all passes of the data this is stored and cached. So when we ask for the length of the data at the end, the file is completely read, and when we ask for the number of companies, we don't return to disk.
Extra Dependencies
The csv
module is excellent for reading and writing CSV files, is memory efficient (providing an iterator that reads only one line from the file into memory at a time), and comes included with Python. Why would you need anything else?
The thing is, I regularly see this in the code of my colleagues (especially those that do research):
import pandas as pd
data = pd.read_csv('data/funding.csv')
print data
Although rarer, I also see things like numpy.genfromtxt
also used as a production CSV reader.
The utilities for analysis that Pandas gives you, especially DataFrames, are extremely useful, and there is obviously a 1:1 relationship between DataFrames and CSV files. I routinely use Pandas for data analyses, quick insights, and even data wrangling of smaller files. The problem is that Pandas is not meant for production-level ingestion or wrangling systems. It is meant for data analysis that can happen completely in memory. As such, when you run this line of code, the entirety of the CSV file is loaded into memory. Likewise, Numpy arrays are also immutable data types that are completely loaded into memory. You've just lost your memory efficiency, especially for larger data sets.
And while this may be fine on your Macbook Pro, keep in mind that if you're writing data pipeline code, it will probably be run more routinely on a virtual cloud server such as Rackspace, AWS, or Google App Engine. Since you have a budget, it will also probably be running on small or micro servers that might have 1 GB of memory, if you're lucky. You don't want to blow it away!
Then there is the issue of dependency management. The Pandas library depends on Numpy, which itself takes a lot of compilation to put together. On my Macbook Pro, pip installing Pandas into a virtualenv
takes about 5 minutes. This is a lot of overhead when you're not planning on using DataFrames for their potential!
So don't do it. Use the native csv
module.
Avoiding Unicode Issues
Text encoding is the bane of my existence in so many ways. Mismatched encoding can crash entire programs that have been continually running for a long time and passed unit tests. Even if you handle encoding, if you chose the wrong encoding scheme (usually UTF-8
), when something Latin 1
comes in, you can get invisible errors that don't raise any exceptions. Encoding is always something you should consider when opening up a file from anywhere.
If you're lucky enough to be able to use Python 3, you can skip this section. But as mentioned previously, production ingestion and wrangling systems are usually run on micro or small servers in the cloud, usually on the system Python. So if you're like me and many other data scientists, you're using Python 2.6+ which is currently what ships with Linux servers.
If you look at the csv
module documentation, they suggest adding three classes to encode, read, and write unicode data from a file passed to the csv module as a wrapped function. These encoders translate the native file encoding to UTF-8
, which the csv
module can read because it's 8-bit safe. I won't include those classes here. Instead, you should always do the following when you import the csv module.
try:
import unicodecsv as csv
except ImportError:
import warnings
warnings.warn("can't import `unicodecsv` encoding errors may occur")
import csv
This is a very small dependency that can be easily fetched via pip install unicodecsv
and you can find this module in pretty much every one of my data product repository's requirements.txt
files. To learn more, visit the PyPI page for the latest version: unicodecsv 0.9.4.
Registering Schemas via NamedTuples
Now that we can easily extract data from a CSV file with a memory efficient and encoding-tolerant method, we can begin to look at actually wrangling our data. CSVs don't provide a lot in terms of types or indications about how to deal with the data that is coming in as strings. As noted previously, if there is a header line, we may get information about the various field names, but that's about it. We have already parsed our rows into Python dictionaries using the built in csv.DictReader
, but this definitely depends on the header row being there. If the header row is not there, then you can pass a list of field names to the csv.DictReader
and you can get dictionaries back.
Dictionaries are great because you can access your data by name rather than by position, and these names may help you parse the data that's coming in. However, dictionaries aren't schemas. They are key-value pairs that are completely mutable. When using a single CSV file, you may expect to have dictionaries with all the same keys, but these keys aren't recorded anywhere in your code, making your code dependent on the structure of the CSV file in an implicit way.
Moreover, dictionaries have a bit more overhead associated with them in order to make them mutable. They use more memory when instantiated and in order to provide fast lookups, they are hashed (unordered). They are great for when you don't know what data is coming in, but when you do, you can use something much better: namedtuples.
from collections import namedtuple
fields = ("permalink","company","numEmps", "category","city","state","fundedDate", "raisedAmt","raisedCurrency","round")
FundingRecord = namedtuple('FundingRecord', fields)
def read_funding_data(path):
with open(path, 'rU') as data:
data.readline() # Skip the header
reader = csv.reader(data) # Create a regular tuple reader
for row in map(FundingRecord._make, reader):
yield row
if __name__ == "__main__":
for row in read_funding_data(FUNDING):
print row
break
Running this code produces a result as shown below. As you can see the result is a tuple (an ordered set of values) that have names associated with them, much like the keys of a dictionary. Not only that, but the tuple is typed - it is not only a variable named "FundingRecord", each row is a FundingRecord
.
FundingRecord(permalink=u'lifelock', company=u'LifeLock', numEmps=u'', category=u'web', city=u'Tempe', state=u'AZ', fundedDate=u'1-May-07', raisedAmt=u'6850000', raisedCurrency=u'USD', round=u'b')
The namedtuple
function actually returns a subclass of tuple with the type specified by the typename, the first argument to the function. The type that is returned is immutable once constructed and is extremely lightweight because it contains no internal __dict__
. It is very similar to using __slots__
for memory performance and efficiency. As you can see from the above code, you would use the FundingRecord
just like any other object, so long as you didn't modify it in place.
This leads to an obvious question, how can you parse the data into various types if you can't assign it back to the object? Well, the nice thing is that the return from namedtuple
is an actual type, one that you can subclass to add methods to!
from datetime import datetime
fields = ("permalink","company","numEmps", "category","city","state","fundedDate", "raisedAmt","raisedCurrency","round")
class FundingRecord(namedtuple('FundingRecord_', fields)):
@classmethod
def parse(klass, row):
row = list(row) # Make row mutable
row[2] = int(row[2]) if row[2] else None # Convert "numEmps" to an integer
row[6] = datetime.strptime(row[6], "%d-%b-%y") # Parse the "fundedDate"
row[7] = int(row[7]) # Convert "raisedAmt" to an integer
return klass(*row)
def __str__(self):
date = self.fundedDate.strftime("%d %b, %Y")
return "%s raised %i in round %s on %s" % (self.company, self.raisedAmt, self.round, date)
def read_funding_data(path):
with open(path, 'rU') as data:
data.readline() # Skip the header
reader = csv.reader(data) # Create a regular tuple reader
for row in map(FundingRecord.parse, reader):
yield row
if __name__ == "__main__":
for row in read_funding_data(FUNDING):
print row
break
If you run this code, you'll see a result as follows:
LifeLock raised 6850000 in round b on 01 May, 2007
We now have a simple, self-documenting methodology for constructing schemas and parsers for CSVs in code that is not only memory efficient and fast but also encapsulates all the work we might want to do on a single record in the CSV file.
A couple of caveats to note:
- Subclasses of tuple don't have an
__init__
since they are immutable, only__new__
. - When passing the type name to the
namedtuple
function use an_
to prevent type conflicts. - You won't be able to get access to the
keys
method, since this isnt' a dictionary; although there is an internal_fields
method that returns a similar result.
Keep in mind that the class is immutable, and therefore read only!
Performance of namedtuples vs dicts
As a side example, I tested the performance in both time and space using a DictReader vs using a NamedTuple. The tests operated on a 548 MB CSV file that contained 5,000,000 rows consisting of a tuple that met the schema (uuid, name, data, page, latitude, longitude). I loaded the entire data set into memory by appending each record (either dict or namedtuple) to a list of records and then monitored the memory usage of the Python process. The code I used is below.
import time
import unicodecsv as csv
def benchmark_dicts(path='code/testdata.csv'):
fields = ('uuid', 'name', 'date', 'page', 'lat', 'lng')
records = []
start = time.time()
with open(path, 'rU') as data:
reader = csv.DictReader(data, fields)
for row in reader:
records.append(row)
finit = time.time()
delta = finit - start
print "Dict Benchmark took %0.3f seconds" % delta
if __name__ == "__main__":
benchmark_dicts()
The result I got on my Macbook Pro with an 2.8GHz Intel i7 processor with 16 GB of memory was as follows:
Dict Benchmark took 102.723 seconds
The namedtuple benchmark code is similar and is as follows:
import time
import unicodecsv as csv
from collections import namedtuple
def benchmark_namedtuples(path='code/testdata.csv'):
Record = namedtuple('Record', ('uuid', 'name', 'date', 'page', 'lat', 'lng'))
records = []
start = time.time()
with open(path, 'rU') as data:
reader = csv.reader(data)
for row in map(Record._make, reader):
records.append(row)
finit = time.time()
delta = finit - start
print "NamedTuple Benchmark took %0.3f seconds" % delta
if __name__ == "__main__":
benchmark_namedtuples()
Running this code on the same data, on the same machine gave me the following result:
NamedTuple Benchmark took 49.018 seconds
In terms of memory usage, the named tuples also used significantly less memory. Here is a summary of the overall performance:
Test | Time | Memory |
---|---|---|
dict | 44.358 | 6.66 GB |
namedtuple | 20.682 | 2.23 GB |
Using dictionaries is almost twice as slow, if a little more generic. Worse, using dictionaries takes almost three times the memory because of the aforementioned overhead of a mutable vs. immutable type. These small adjustments can mean the difference between performant applications and searching for new solutions.
Serializing Data with Avro
These challenges with CSV as a default serialization format may prompt us to find a different solution, especially as we propagate data throughout our data pipelines. My suggestion is to use a binary serialization format like Avro, which stores the schema alongside compressed data. Because the file is stored with a binary seralization, it is as compact as csv
and can be read in a memory efficient manner. Because the data is stored with its schema, it can be read from any code and immediately extracted and automatically parsed.
So let's return to our original example with the funding.csv
data set. In order to convert our parsed FundingRecord
rows, we need to first define a schema. Schemas in Avro are simply JSON documents that define the field names along with their types. The schema for a FundingRecord is as follows:
{"namespace": "funding.districtdatalabs",
"type": "record",
"name": "FundingRecord",
"fields": [
{"name": "permalink", "type": "string"},
{"name": "company", "type": "string"},
{"name": "numEmps", "type": ["int", "null"]},
{"name": "category", "type": "string"},
{"name": "city", "type": "string"},
{"name": "state", "type": "string"},
{"name": "fundedDate", "type": "string"},
{"name": "raisedAmt", "type": "int"},
{"name": "raisedCurrency", "type": "string"},
{"name": "round", "type": "string"}
]
}
Save this schema in a file called funding.avsc
. To save out our parsed CSV to an Avro serialized format, use the following code:
import avro.schema
from avro.datafile import DataFileWriter
from avro.io import DatumWriter
def parse_schema(path="funding.avsc"):
with open(path, 'r') as data:
return avro.schema.parse(data.read())
def serialize_records(records, outpath="funding.avro"):
schema = parse_schema()
with open(outpath, 'wb') as out:
writer = DataFileWriter(out, DatumWriter(), schema)
for record in records:
record = dict((f, getattr(record, f)) for f in record._fields)
record['fundedDate'] = record['fundedDate'].strftime('%Y-%m-%dT%H:M:S')
writer.append(record)
if __name__ == "__main__":
serialize_records(read_funding_data(FUNDING))
Unfortunately, Avro doesn't currently support native datetime
objects in Python but hopefully it will soon. Otherwise we simply convert our namedtuple
back into a dictionary (or just use the DictReader
directly) and pass that to the Avro serializer.
Once saved to disk, the savings become apparent. The original file was approximately 92K. After serializing with Avro, we have compressed our data to 64K! Not only do we have a compressed file format, but we also have the schema associated with the data. Reading the data is pretty easy now, and requires no parsing as before.
from avro.datafile import DataFileReader
from avro.io import DatumReader
def read_avro_funding_data(path="funding.avro"):
with open(path, 'r') as data:
reader = DataFileReader(data, DatumReader())
for record in reader:
yield record
if __name__ == "__main__":
for record in read_avro_funding_data():
print record
break
{u'category': u'web', u'city': u'Tempe', u'permalink': u'lifelock', u'raisedAmt': 6850000, u'fundedDate': u'2007-05-01T00:M:S', u'numEmps': None, u'company': u'LifeLock', u'raisedCurrency': u'USD', u'state': u'AZ', u'round': u'b'}
As you can see, the data in the dictionary has already been converted (at least the two integers) and no extra parsing is required. Avro will also enforce schema constraints and invariants to ensure that your wrangling is successful without things like databases for normalization.
Not only that, but Avro is a standard for Apache Hadoop. If you're going to be loading this data into HDFS, it's a good idea to consider loading it as Avro sequence file formats rather than CSV formats. It can make your life a lot easier, especially if you're using Hadoop streaming.
Conclusion
In this post, we looked several issues that arise when wrangling CSV data in Python. First, we reviewed the basics of CSV processing in Python, taking a look at the csv
module and how that compared to Pandas and Numpy for importing and wrangling data stored in CSV files. Next, we highlighted the importance of encoding and how to avoid unicode issues. Then we took a look at the differences between dictionaries and namedtuples, and we saw how namedtuples are better when you know what data you're ingesting. And finally, we covered how to serialize CSV data with Avro so that the data is stored not only in a compact format but with its schema as well.
Hopefully, what you'll take away from this post is a basic data wrangling process that transforms data serialized input into an application-specific or data model usable form. The process uses a lot of transformations to and from flat files on disk, but in batch is far faster than accessing databases directly. If you have CSVs in multiple formats, this strategy can come in very handy to normalize data into a backwards-compatible safe methodology for storing and accessing data in WORM storage.
The code files to accompany this post can be found on Github.
As always, if you liked this post and you haven't already subscribed to the District Data Labs blog, click on the Subscribe button below and enter your email so that you don't miss any of the awesome data science posts we have coming up in the near future.
Finally, below are some additional resources where you can continue exploring some of the stuff we've covered.
Additional Reading
- Python for Data Analysis
- Apache Avro
- Python CSV Module Documentation
- Saving 9GB of RAM with Python Slots
- Serializing Python Objects
Acknowledgements
I'd like to thank Tony Ojeda for his help with preparing and producing this post. This started as an iPython notebook full of tips for some colleagues and classmates, and without his help it would not be the polished post that you see in front of you now!
District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!
Cover image credit: http://en.wikipedia.org/wiki/Team_roping#mediaviewer/File:Steer_Roping_at_the_Kentucky_State_fair_QUarter_Horse_show_(2751833934).jpg