District Data Labs

A Practical Guide to Anonymizing Datasets with Python & Faker

How Not to Lose Friends and Alienate People

If you want to keep a secret, you must also hide it from yourself.

— George Orwell 1984

In order to learn (or teach) data science you need data (surprise!). The best libraries often come with a toy dataset to illustrate examples of how the code works. However, nothing can replace an actual, non-trivial dataset for a tutorial or . . .

Read More

Posted in: wranglingpython

March 02, 2016

An Introduction to Machine Learning with Python

For the mind does not require filling like a bottle, but rather, like wood, it only requires kindling to create in it an impulse to think independently and an ardent desire for the truth.

— Plutarch On Listening to Lectures

The impulse to ingest more data is our first and most powerful instinct. Born with billions of neurons, as . . .

Read More

January 21, 2016

Time Maps: Visualizing Discrete Events Across Many Timescales

Discrete events pervade our daily lives. These include phone calls, online transactions, and heartbeats. Despite the simplicity of discrete event data, it’s hard to visualize many events over a long time period without hiding details about shorter timescales.

The plot below illustrates this problem. It shows the number of website visits made . . .

Read More

September 03, 2015

Modern Methods for Sentiment Analysis

Sentiment analysis is a common application of Natural Language Processing (NLP) methodologies, particularly classification, whose goal is to extract the emotional content in text. In this way, sentiment analysis can be seen as a method to quantify qualitative data with some sentiment score. While sentiment is largely subjective, sentiment . . .

Read More

Posted in: pythonnlp

March 31, 2015

Getting Started with Spark (in Python)

Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has become an operating system for Big Data, providing a rich ecosystem of tools and techniques that allow you to use a large cluster of relatively cheap . . .

Read More

Posted in: sparkpython

February 02, 2015

Simple CSV Data Wrangling with Python

Efficient Processing, Schemas, and Serialization

I wanted to write a quick post today about a task that most of us do routinely but often think very little about - loading CSV (comma-separated value) data into Python. This simple action has a variety of obstacles that need to be overcome due to the nature of serialization and data transfer. In fact, I'm routinely surprised how often I . . .

Read More

Posted in: pythonwrangling

November 08, 2014

Computing a Bayesian Estimate of Star Rating Means

Consumers rely on the collective intelligence of other consumers to protect themselves from coffee pots that break at the first sign of water, eating bad food at the wrong restaurant, and stunning flops at the theater. Although occasionally there are metrics like Rotten Tomatoes, we primarily prejudge products we would like to consume through . . .

Read More

Posted in: python

September 11, 2014