District Data Labs
Modern Methods for Sentiment Analysis
Sentiment analysis is a common application of Natural Language Processing (NLP) methodologies, particularly classification, whose goal is to extract the emotional content in text. In this way, sentiment analysis can be seen as a method to quantify qualitative data with some sentiment score. While sentiment is largely subjective, sentiment . . .
Getting Started with Spark (in Python)
Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has become an operating system for Big Data, providing a rich ecosystem of tools and techniques that allow you to use a large cluster of relatively cheap . . .
Creating a Hadoop Pseudo-Distributed Environment
Hadoop developers usually test their scripts and code on a pseudo-distributed environment (also known as a single node setup), which is a virtual machine that runs all of the Hadoop daemons simultaneously on a single machine. This allows you to quickly write scripts and test them on limited data sets without having to connect to a remote . . .
Posted in: hadoop
Simple CSV Data Wrangling with Python
Efficient Processing, Schemas, and Serialization
I wanted to write a quick post today about a task that most of us do routinely but often think very little about - loading CSV (comma-separated value) data into Python. This simple action has a variety of obstacles that need to be overcome due to the nature of serialization and data transfer. In fact, I'm routinely surprised how often I . . .
Conditional Probability with R
Likelihood, Independence, and Bayes
In addition to regular probability, we often want to figure out how probability is affected by observing some event. For example, the NFL season is rife with possibilities. From the beginning of each season, fans start trying to figure out how likely it is that their favorite team will make the playoffs. After every game the team plays, these . . .
Posted in: probabilityr
Computing a Bayesian Estimate of Star Rating Means
Consumers rely on the collective intelligence of other consumers to protect themselves from coffee pots that break at the first sign of water, eating bad food at the wrong restaurant, and stunning flops at the theater. Although occasionally there are metrics like Rotten Tomatoes, we primarily prejudge products we would like to consume through . . .
Posted in: python
How to Develop Quality Python Code
Workflows and Development Tools
Developing in Python is very different from developing in other languages. Python is an interpreted language like Ruby or Perl, so developers are able to use read-evaluate-print loops (REPLs) to execute Python in real-time. This feature of Python means that it can be used for rapid development and prototyping because there is no build . . .
Posted in: python