Ddl%20square%20logo%20-%20dark_large

District Data Labs

Parameter Tuning with Hyperopt

This post will cover a few things needed to quickly implement a fast, principled method for machine learning model parameter tuning. There are two common methods of parameter tuning: grid search and random search. Each have their pros and cons. Grid search is slow but effective at searching the whole search space, while random search is fast, . . .

Read More

September 21, 2015

Time Maps: Visualizing Discrete Events Across Many Timescales

Discrete events pervade our daily lives. These include phone calls, online transactions, and heartbeats. Despite the simplicity of discrete event data, it’s hard to visualize many events over a long time period without hiding details about shorter timescales.

The plot below illustrates this problem. It shows the number of website visits made . . .

Read More

September 03, 2015

The Age of the Data Product

We are living through an information revolution. Like any economic revolution, it has had a transformative effect on society, academia, and business. The present revolution, driven as it is by networked communication systems and the Internet, is unique in that it has created a surplus of a valuable new material - data - and transformed us all . . .

Read More

Posted in: data products

May 20, 2015

Markup for Fast Data Science Publication

A central lesson of science is that to understand complex issues (or even simple ones), we must try to free our minds of dogma and to guarantee the freedom to publish, to contradict, and to experiment.

— Carl Sagan in Billions & Billions: Thoughts on Life and Death at the Brink of the Millennium

As data scientists, it's easy . . .

Read More

April 20, 2015

Modern Methods for Sentiment Analysis

Sentiment analysis is a common application of Natural Language Processing (NLP) methodologies, particularly classification, whose goal is to extract the emotional content in text. In this way, sentiment analysis can be seen as a method to quantify qualitative data with some sentiment score. While sentiment is largely subjective, sentiment . . .

Read More

Posted in: pythonnlp

March 31, 2015

Getting Started with Spark (in Python)

Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has become an operating system for Big Data, providing a rich ecosystem of tools and techniques that allow you to use a large cluster of relatively cheap . . .

Read More

Posted in: sparkpython

February 02, 2015

Creating a Hadoop Pseudo-Distributed Environment

Hadoop developers usually test their scripts and code on a pseudo-distributed environment (also known as a single node setup), which is a virtual machine that runs all of the Hadoop daemons simultaneously on a single machine. This allows you to quickly write scripts and test them on limited data sets without having to connect to a remote . . .

Read More

Posted in: hadoop

January 08, 2015

Archive