Data Exploration with Python, Part 1

Preparing Yourself to Become a Great Explorer

Exploratory data analysis (EDA) is an important pillar of data science, a critical step required to complete every project regardless of the domain or the type of data you are working with. It is exploratory analysis that gives us a sense of what additional work should be performed to quantify and extract insights from our data. It also . . .

December 29, 2016

Building a Classifier from Census Data

An end-to-end machine learning example using Pandas and Scikit-Learn

One of the machine learning workshops given to students in the Georgetown Data Science Certificate is to build a classification, regression, or clustering model using one of the UCI Machine Learning Repository datasets. The idea behind the workshop is to ingest data from a website, perform some initial analyses to get a sense for what's . . .

May 02, 2016

A Practical Guide to Anonymizing Datasets with Python & Faker

How Not to Lose Friends and Alienate People

If you want to keep a secret, you must also hide it from yourself.

— George Orwell 1984

In order to learn (or teach) data science you need data (surprise!). The best libraries often come with a toy dataset to illustrate examples of how the code works. However, nothing can replace an actual, non-trivial dataset for a tutorial or . . .

March 02, 2016

Simple CSV Data Wrangling with Python

Efficient Processing, Schemas, and Serialization

I wanted to write a quick post today about a task that most of us do routinely but often think very little about - loading CSV (comma-separated value) data into Python. This simple action has a variety of obstacles that need to be overcome due to the nature of serialization and data transfer. In fact, I'm routinely surprised how often I . . .

November 08, 2014