District Data Labs

Basics of Entity Resolution

with Python and Dedupe

Entity resolution (ER) is the task of disambiguating records that correspond to real world entities across and within datasets. The applications of entity resolution are tremendous, particularly for public sector and federal datasets related to health, transportation, finance, law enforcement, and antiterrorism.

Unfortunately, the problems . . .

Read More

March 11, 2017

Data Exploration with Python, Part 2

Preparing Your Data to be Explored

This is the second post in our Data Exploration with Python series. Before reading this post, make sure to check out Data Exploration with Python, Part 1!

Mise en place (noun): In a professional kitchen, the disciplined organization and preparation of equipment and food before service begins.

When performing exploratory data analysis (EDA), . . .

Read More

February 07, 2017

Forward Propagation: Building a Skip-Gram Net From the Ground Up

Part 1: Skip-gram Feedforward

Editor's Note: This post is part of a series based on the research conducted in District Data Labs' NLP Research Lab. Make sure to check out the other posts in the series so far:

Let's continue our treatment of the . . .

Read More

January 12, 2017

Ten Things to Try in 2017

New Years Resolutions for the Intermediate Data Scientist

2016 marked a zenith in the data science renaissance. In the wake of a series of articles and editorials declaiming the shortage of data analysts, the internet responded in force, exploding with blog posts, tutorials, and listicles aimed at launching the beginner into the world of data science. And yet, in spite of all the claims that this . . .

Read More

December 31, 2016

Data Exploration with Python, Part 1

Preparing Yourself to Become a Great Explorer

Exploratory data analysis (EDA) is an important pillar of data science, a critical step required to complete every project regardless of the domain or the type of data you are working with. It is exploratory analysis that gives us a sense of what additional work should be performed to quantify and extract insights from our data. It also . . .

Read More

December 29, 2016

Exploring Bureau of Labor Statistics Time Series

Machine learning models benefit from an increased number of features — “more data beats better algorithms”. In the financial and social domains, macroeconomic indicators are routinely added to models particularly those that contain a discrete time or date. For example, loan or credit analyses that predict the likelihood of . . .

Read More

December 09, 2016

The Trends Behind What's Trending

Anticipating the Extent of Article Virality

Editor's Note: This article highlights one of the capstone projects from the Georgetown Data Science Certificate program, where several of the DDL faculty teach. We've invited groups with interesting projects to share an overview of their work here on the DDL blog. We hope you find their projects interesting and are able to learn from . . .

Read More

December 05, 2016