Preparing for NLP with NLTK and Gensim

PyCon 2016 Tutorial on Sunday May 29, 2016 at 9am

Benjamin Bengfort

This post is designed to point you to the resources that you need in order to prepare for the NLP tutorial at PyCon this coming weekend! If you have any questions, please contact us according to the directions at the end of the post.

In this tutorial, we will explore the features of the NLTK library for text processing in order to build language-aware data products with machine learning. In particular, we will use a corpus of RSS feeds that have been collected since March to create supervised document classifiers as well as unsupervised topic models and document clusters. To do this we will need to use language analysis to preprocess and vectorize our documents into an acceptable format for applying machine learning techniques. We will use the NLTK included language classifiers, Naive Bayes and Maximum Entropy for our document classification, and use K-means clustering and LDA in Gensim for unsupervised topic modeling.

Abstract

Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world - unstructured data that by its very nature has latent information that is important to humans. NLP practitioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python, Gensim, and the Natural Language Toolkit (NLTK).

NLTK is an excellent library for machine-learning based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.

In this tutorial we will begin by exploring the tools NLTK has to preprocess text into meaningful feature representations. Because most NLP practitioners work with their own domain specific corpora, we will focus on building a language aware data product from a specific corpus - a collection of RSS feeds that have been collected by the Baleen tool since March. We will use this corpus to do supervised machine learning, building a text classification system, as well as unsupervised topic modeling using Gensim.

Getting Ready

There are a number of tools that you’ll need to install as well as provided corpora to download. Note that there will be very little time before the tutorial (it starts at 9am!) and limited bandwidth at the conference center. We highly recommend that you install and download these things before you come!

Note that Python 2.7 or 3.5 is required for this tutorial. The instructor will be using Python 3.5 inside of Jupyter notebooks for demonstration purposes. Fair warning: there may be some 2to3 compatibility issues (we haven't made the switch yet) but it shouldn't be anything too dramatic.

Note that Anaconda (or similarly packaged Python distribution) is perfectly acceptable to use if you're on Windows or having difficulty installing binaries. However, the instructors will be focusing on pure Python rather than any other analytical tools.

Installation

The primary dependencies will be NLTK and Gensim, which you can install with pip as follows:

$ pip install nltk gensim
$ python -m nltk.downloader all

This will install both nltk and gensim as well as all the downloadable corpora from the NLTK library. If you'd like to follow along in the notebooks instead of plain Python scripts, please install Jupyter:

$ pip install jupyter

Secondary dependencies include python-readability, BeautifulSoup, lxml, and Requests. To install these, also use pip:

$ pip install requests beautifulsoup4 lxml readability-lxml

This should cover 95% of the dependencies in advance. However, please make sure that you check the requirements.txt file of any of our repositories that you'll be using and pip install the updated dependencies or specific versions.

Resources

Ok, now that you've installed the required code, you'll need to get the data and other resources in order to follow along. Most of the content can be found in our Pycon 2016 GitHub Repository and on our website:

This repository contains the notebooks and the source code for the slides. The slides are composed using Reveal.js and can be opened locally by opening the index.html file in your favorite browser. Note that the slides and materials for @rebeccabilbro's talk, “Visual Diagnostics for More Informed Machine Learning” are also in the repository, and we hope we will see you on Monday for her talk as well!

The next step is data. There is a lot of it: 332,321 HTML files ingested from several RSS feeds since March 3, 2016. The compressed dataset is about 10GB. In order to do a sequential read using a single process in Python it takes roughly 30 hours depending on what processing you're doing. Obviously this is unsustainable, so we've created a simple random sample of 33,232 files (10%) of the data source. This is what we'll be going through during the tutorial.

You can find the datasets here:

Baleen Corpus Sample: download this! (515.9 MB MD5 9e25885ec2cff6b82248cfeb81647262)
Full Baleen Corpus: for adventures post-tutorial (11.2 GB MD5 959ed49718b3fe35e1ac5eaaea990270)

A note on the copyright of the corpus - this corpus is intended to be used for academic purposes only, and we expect you to use them as though you downloaded them from the RSS feeds yourself. What does that mean? Well it means that the copyright of each individual document is the copyright of the owner who gives you the ability to download a copy for yourself for reading/analysis etc. We expect that you'll respect the copyright and not republish this corpus or use it for anything other than the tutorial analysis.

Optional Libraries and Code

We won't be explicitly demonstrating Sckit-Learn in the tutorial, but we think it's a really useful tool for doing machine learning on text in coordination with NLTK. You may also want to use NetworkX to perform graph analysis of text, for example to extract significant keywords. To install Sciki-Learn and NetworkX:

$ pip install scikit-learn networkx

The demonstration libraries are Baleen and Minke. These are for demonstration only, though we'd be happy to have your help developing these into more fully feature libraries or frameworks at the Sprints. If you'd like to use these, you may clone them with Git and then install their requirements as follows:

$ pip install -r requirements.txt

Neither of these are required for the tutorial, but recommended for further investigation.

Conclusion

Ok, that's a lot of stuff to do before Sunday! If you have any questions, please let us know via direct message on the PyCon website or through a message on Twitter:

The plan for Sunday is going to be a combo of slides and Notebooks. I'm going to walk you through the end-to-end process of doing natural language processing with NLTK specifically for machine learning. We will then have follow on time at the PyCon Sprints for you to join us and work on Baleen, Minke, or any of the other materials we presented. Looking forward to seeing you all on Sunday!

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!