District Data Labs
District Data Labs PyCon Recap
Overview of our Talk, Tutorial, Posters, and Sprints
Last week, a group of us from District Data Labs flew to Portland, Oregon to attend PyCon, the largest annual gathering for the Python community. We had a talk, a tutorial, and two posters accepted to the conference, and we also hosted development sprints for several open source projects. With this blog post, we are putting everything . . .
Visual Diagnostics for More Informed Machine Learning: Part 3
Visual Evaluation and Parameter Tuning
Note: Before starting Part 3, be sure to read Part 1 and Part 2!
Welcome back! In this final installment of Visual Diagnostics for More Informed Machine Learning, we'll close the loop on visualization tools for navigating the different phases of the machine learning workflow. Recall that we are framing the workflow in terms of . . .
Posted in: machine learningpythonvisualization
Preparing for NLP with NLTK and Gensim
PyCon 2016 Tutorial on Sunday May 29, 2016 at 9am
This post is designed to point you to the resources that you need in order to prepare for the NLP tutorial at PyCon this coming weekend! If you have any questions, please contact us according to the directions at the end of the post.
In this tutorial, we will explore the features of the NLTK library for text processing in . . .
Visual Diagnostics for More Informed Machine Learning: Part 2
Demystifying Model Selection
Note: Before starting Part 2, be sure to read Part 1!
When it comes to machine learning, ultimately the most important picture to have is the big picture. Discussions of (i.e. arguments about) machine learning are usually about which model is the best. Whether it's logistic regression, random forests, Bayesian methods, support . . .
Posted in: machine learningpythonvisualization
Visual Diagnostics for More Informed Machine Learning: Part 1
Feature Analysis
How could they see anything but the shadows if they were never allowed to move their heads?
— Plato The Allegory of the Cave
Python and high level libraries like Scikit-learn, TensorFlow, NLTK, PyBrain, Theano, and MLPY have made machine learning accessible to a broad programming community that might never . . .
Posted in: machine learningpythonvisualization
Named Entity Recognition and Classification for Entity Extraction
Combining NERCs to Improve Entity Extraction
The overwhelming amount of unstructured text data available today from traditional media sources as well as newer ones, like social media, provides a rich source of information if the data can be structured. Named Entity Extraction forms a core subtask to build knowledge from semi-structured and unstructured text sources. Some of the first . . .
Building a Classifier from Census Data
An end-to-end machine learning example using Pandas and Scikit-Learn
One of the machine learning workshops given to students in the Georgetown Data Science Certificate is to build a classification, regression, or clustering model using one of the UCI Machine Learning Repository datasets. The idea behind the workshop is to ingest data from a website, perform some initial analyses to get a sense for what's . . .
Posted in: machine learningpythonwrangling