<?xml version="1.0" encoding="UTF-8"?>
<rss version='2.0' xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>District Data Labs</title>
    <description>Hands-on data science tutorials, lessons, and other awesome content.</description>
    <link>https://districtdatalabs.silvrback.com/feed</link>
    <atom:link href="https://districtdatalabs.silvrback.com/feed" rel="self" type="application/rss+xml"/>
    <category domain="districtdatalabs.silvrback.com">Content Management/Blog</category>
    <language>en-us</language>
      <pubDate>Fri, 31 Mar 2017 15:47:38 -0400</pubDate>
    <managingEditor>tojeda@districtdatalabs.com (District Data Labs)</managingEditor>
      <item>
        <guid>http://blog.districtdatalabs.com/data-exploration-with-python-3#26376</guid>
          <pubDate>Fri, 31 Mar 2017 15:47:38 -0400</pubDate>
        <link>http://blog.districtdatalabs.com/data-exploration-with-python-3</link>
        <title>Data Exploration with Python, Part 3</title>
        <description>Embarking on an Insight-Finding Mission</description>
        <content:encoded><![CDATA[<p><em>This is the third post in our Data Exploration with Python series. Before reading this post, make sure to check out <a href="http://blog.districtdatalabs.com/data-exploration-with-python-1">Part 1</a> and <a href="http://blog.districtdatalabs.com/data-exploration-with-python-2">Part 2</a>!</em></p>

<p>Preparing yourself and your data like we have done thus far in this series is essential to analyzing your data well. However, the most exciting part of Exploratory Data Analysis (EDA) is actually getting in there, exploring the data, and discovering insights. That&#39;s exactly what we are going to start doing in this post. </p>

<p>We will begin with the cleaned and prepped vehicle fuel economy data set that we ended up with at the end of the last post. This version of the data set contains:</p>

<ul>
<li>The higher-level categories we created via category aggregations.</li>
<li>The quintiles we created by binning our continuous variables. </li>
<li>The clusters we generated via k-means clustering based on numeric variables. </li>
</ul>

<p>Now, without further ado, let&#39;s embark on our insight-finding mission!</p>

<h2 id="making-our-data-smaller-filter-aggregate">Making Our Data Smaller: Filter + Aggregate</h2>

<p>One of the fundamental ways to extract insights from a data set is to reduce the size of the data so that you can look at just a piece of it at a time. There are two ways to do this: <em>filtering</em> and <em>aggregating</em>. With filtering, you are essentially removing either rows or columns (or both rows and columns) in order to focus on a subset of the data that interests you. With aggregation, the objective is to group records in your data set that have similar categorical attributes and then perform some calculation (count, sum, mean, etc.) on one or more numerical fields so that you can observe and identify differences between records that fall into each group. </p>

<p>To begin filtering and aggregating our data set, we could write a function like the one below to aggregate based on a <code>group_field</code> that we provide, counting the number of rows in each group. To make things more intuitive and easier to interpret, we will also sort the data from most frequent to least and format it in a pandas data frame with appropriate column names. </p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">agg_count</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">group_field</span><span class="p">):</span>
    <span class="n">grouped</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">group_field</span><span class="p">,</span> <span class="n">as_index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span><span class="o">.</span><span class="n">size</span><span class="p">()</span>
    <span class="n">grouped</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">ascending</span> <span class="o">=</span> <span class="bp">False</span><span class="p">)</span>

    <span class="n">grouped</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">grouped</span><span class="p">)</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>
    <span class="n">grouped</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="n">group_field</span><span class="p">,</span> <span class="s1">&#39;Count&#39;</span><span class="p">]</span>
    <span class="k">return</span> <span class="n">grouped</span>
</pre></div>
<p>Now that we have this function in our toolkit, let&#39;s use it. Suppose we were looking at the <em>Vehicle Category</em> field in our data set and were curious about the number of vehicles in each category that were manufactured last year (2016). Here is how we would filter the data and use the <code>agg_count</code> function to transform it to show what we wanted to know.  </p>
<div class="highlight"><pre><span></span><span class="n">vehicles_2016</span> <span class="o">=</span> <span class="n">vehicles</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Year&#39;</span><span class="p">]</span><span class="o">==</span><span class="mi">2016</span><span class="p">]</span>
<span class="n">category_counts</span> <span class="o">=</span> <span class="n">agg_count</span><span class="p">(</span><span class="n">vehicles_2016</span><span class="p">,</span> <span class="s1">&#39;Vehicle Category&#39;</span><span class="p">)</span>
</pre></div>
<p><img alt="Filter Aggregate Count" src="https://silvrback.s3.amazonaws.com/uploads/42f4db91-ee92-4a97-a5b1-9c4970bd6552/filter_agg_categories.png" /></p>

<p>This gives us what we want in tabular form, but we could take it a step further and visualize it with a horizontal bar chart. </p>
<div class="highlight"><pre><span></span><span class="n">ax</span> <span class="o">=</span> <span class="n">sns</span><span class="o">.</span><span class="n">barplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">category_counts</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="s1">&#39;Count&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;Vehicle Category&#39;</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">xlabel</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1"> Number of Vehicles Manufactured&#39;</span><span class="p">)</span>
<span class="n">sns</span><span class="o">.</span><span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;Vehicles Manufactured by Category (2016) </span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
<p><img alt="Vehicles Manufactured 2016 Bar Chart" src="https://silvrback.s3.amazonaws.com/uploads/99f6e70a-596a-4954-b6ab-ae49a004e43e/category_counts_barchart_large.png" /></p>

<p>Now that we know how to do this, we can filter, aggregate, and plot just about anything in our data set with just a few lines of code. For example, here is the same metric but filtered for a different year (1985). </p>
<div class="highlight"><pre><span></span><span class="n">vehicles_1985</span> <span class="o">=</span> <span class="n">vehicles</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Year&#39;</span><span class="p">]</span><span class="o">==</span><span class="mi">1985</span><span class="p">]</span>
<span class="n">category_counts</span> <span class="o">=</span> <span class="n">agg_count</span><span class="p">(</span><span class="n">vehicles</span><span class="p">,</span> <span class="s1">&#39;Vehicle Category&#39;</span><span class="p">)</span>

<span class="n">ax</span> <span class="o">=</span> <span class="n">sns</span><span class="o">.</span><span class="n">barplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">category_counts</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="s1">&#39;Count&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;Vehicle Category&#39;</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">xlabel</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1"> Number of Vehicles Manufactured&#39;</span><span class="p">)</span>
<span class="n">sns</span><span class="o">.</span><span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;Vehicles Manufactured by Category (1985) </span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
<p><img alt="Vehicles Manufactured 1985 Bar Chart" src="https://silvrback.s3.amazonaws.com/uploads/1ed9fe80-ec31-403c-8f77-29d5fead34bd/category_counts_barchart_1985_large.png" /></p>

<p>If we wanted to stick with the year 2016 but drill down to the more granular <em>Vehicle Class</em>, we could do that as well. </p>
<div class="highlight"><pre><span></span><span class="n">class_counts</span> <span class="o">=</span> <span class="n">agg_count</span><span class="p">(</span><span class="n">vehicles_2016</span><span class="p">,</span> <span class="s1">&#39;Vehicle Class&#39;</span><span class="p">)</span>

<span class="n">ax</span> <span class="o">=</span> <span class="n">sns</span><span class="o">.</span><span class="n">barplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">class_counts</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="s1">&#39;Count&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;Vehicle Class&#39;</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">xlabel</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1"> Number of Vehicles Manufactured&#39;</span><span class="p">)</span>
<span class="n">sns</span><span class="o">.</span><span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;Vehicles Manufactured by Class (2016) </span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
<p><img alt="Silvrback blog image " src="https://silvrback.s3.amazonaws.com/uploads/c47e4aab-b59c-48e7-bc8f-ee82f98b1137/class_counts_barchart_large.png" /></p>

<p>We could also look at vehicle counts by manufacturer. </p>
<div class="highlight"><pre><span></span><span class="n">make_counts</span> <span class="o">=</span> <span class="n">agg_count</span><span class="p">(</span><span class="n">vehicles_2016</span><span class="p">,</span> <span class="s1">&#39;Make&#39;</span><span class="p">)</span>

<span class="n">ax</span> <span class="o">=</span> <span class="n">sns</span><span class="o">.</span><span class="n">barplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">make_counts</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="s1">&#39;Count&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;Make&#39;</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">xlabel</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1"> Number of Vehicles Manufactured&#39;</span><span class="p">)</span>
<span class="n">sns</span><span class="o">.</span><span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;Vehicles Manufactured by Make (2016) </span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
<p><img alt="Silvrback blog image " src="https://silvrback.s3.amazonaws.com/uploads/c02ebb36-5ec7-40c5-906d-17afa20316ba/make_counts_barchart_2016.png" /></p>

<p>What if we wanted to filter by something other than the year? We could do that by simply creating a different filtered data frame and passing that to our <code>agg_count</code> function. Below, instead of filtering by <em>Year</em>, I&#39;ve filtered on the <em>Fuel Efficiency</em> field, which contains the fuel efficiency quintiles we generated in the last post. Let&#39;s choose the <em>Very High Efficiency</em> value so that we can see how many very efficient vehicles each manufacturer has made. </p>
<div class="highlight"><pre><span></span><span class="n">very_efficient</span> <span class="o">=</span> <span class="n">vehicles</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Fuel Efficiency&#39;</span><span class="p">]</span><span class="o">==</span><span class="s1">&#39;Very High Efficiency&#39;</span><span class="p">]</span>
<span class="n">make_counts</span> <span class="o">=</span> <span class="n">agg_count</span><span class="p">(</span><span class="n">very_efficient</span><span class="p">,</span> <span class="s1">&#39;Make&#39;</span><span class="p">)</span>

<span class="n">ax</span> <span class="o">=</span> <span class="n">sns</span><span class="o">.</span><span class="n">barplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">make_counts</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="s1">&#39;Count&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;Make&#39;</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">xlabel</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1"> Number of Vehicles Manufactured&#39;</span><span class="p">)</span>
<span class="n">sns</span><span class="o">.</span><span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;Very Fuel Efficient Vehicles by Make </span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
<p><img alt="Very Efficient Vehicles by Make" src="https://silvrback.s3.amazonaws.com/uploads/d80731bf-eb04-45e2-9d6f-28dd59030c06/make_efficiency_count_barchart.png" /></p>

<p>What if we wanted to perform some other calculation, such as averaging, instead of counting the number of records that fall into each group? We can just create a new function called <code>agg_avg</code> that calculates the mean of a designated numerical field.</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">agg_avg</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">group_field</span><span class="p">,</span> <span class="n">calc_field</span><span class="p">):</span>
    <span class="n">grouped</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">group_field</span><span class="p">,</span> <span class="n">as_index</span><span class="o">=</span><span class="bp">False</span><span class="p">)[</span><span class="n">calc_field</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
    <span class="n">grouped</span> <span class="o">=</span> <span class="n">grouped</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">calc_field</span><span class="p">,</span> <span class="n">ascending</span> <span class="o">=</span> <span class="bp">False</span><span class="p">)</span>
    <span class="n">grouped</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="n">group_field</span><span class="p">,</span> <span class="s1">&#39;Avg &#39;</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">calc_field</span><span class="p">)]</span>
    <span class="k">return</span> <span class="n">grouped</span>
</pre></div>
<p>We can then simply swap out the <code>agg_count</code> function with our new <code>agg_avg</code> function and indicate what field we would like to use for our calculation. Below is an example showing the average fuel efficiency, represented by the <em>Combined MPG</em> field, by vehicle category. </p>
<div class="highlight"><pre><span></span><span class="n">category_avg_mpg</span> <span class="o">=</span> <span class="n">agg_avg</span><span class="p">(</span><span class="n">vehicles_2016</span><span class="p">,</span> <span class="s1">&#39;Vehicle Category&#39;</span><span class="p">,</span> <span class="s1">&#39;Combined MPG&#39;</span><span class="p">)</span>

<span class="n">ax</span> <span class="o">=</span> <span class="n">sns</span><span class="o">.</span><span class="n">barplot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">category_avg_mpg</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="s1">&#39;Avg Combined MPG&#39;</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s1">&#39;Vehicle Category&#39;</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">xlabel</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1"> Average Combined MPG&#39;</span><span class="p">)</span>
<span class="n">sns</span><span class="o">.</span><span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;Average Combined MPG by Category (2016) </span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
<p><img alt="Average Fuel Efficiency by Vehicle Category" src="https://silvrback.s3.amazonaws.com/uploads/c3a1416e-4553-4b8a-8661-78609b2d111f/avg_mpg_barchart_2016_large.png" /></p>

<h2 id="pivoting-the-data-for-more-detail">Pivoting the Data for More Detail</h2>

<p>Up until this point, we&#39;ve been looking at our data at a pretty high level, aggregating up by a single variable. Sure, we were able to drill down from <em>Vehicle Category</em> to <em>Vehicle Class</em> to get a more granular view, but we only looked at the data one hierarchical level at a time. Next, we&#39;re going to go into further detail by taking a look at two or three variables at a time. The way we are going to do this is via pivot tables and their visual equivalents, pivot heatmaps.</p>

<p>First, we will create a <code>pivot_count</code> function, similar to the <code>agg_count</code> function we created earlier, that will transform whatever data frame we feed it into a pivot table with the rows, columns, and calculated field we specify. </p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">pivot_count</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">rows</span><span class="p">,</span> <span class="n">columns</span><span class="p">,</span> <span class="n">calc_field</span><span class="p">):</span>
    <span class="n">df_pivot</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">values</span><span class="o">=</span><span class="n">calc_field</span><span class="p">,</span> 
                              <span class="n">index</span><span class="o">=</span><span class="n">rows</span><span class="p">,</span> 
                              <span class="n">columns</span><span class="o">=</span><span class="n">columns</span><span class="p">,</span> 
                              <span class="n">aggfunc</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">size</span>
                             <span class="p">)</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s1">&#39;all&#39;</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">df_pivot</span>
</pre></div>
<p>We will then use this function on our <code>vehicles_2016</code> data frame and pivot it out with the <em>Fuel Efficiency</em> quintiles we created in the last post representing the rows, the <em>Engine Size</em> quintiles representing the columns, and then counting the number of vehicles that had a <em>Combined MPG</em> value. </p>
<div class="highlight"><pre><span></span><span class="n">effic_size_pivot</span> <span class="o">=</span> <span class="n">pivot_count</span><span class="p">(</span><span class="n">vehicles_2016</span><span class="p">,</span><span class="s1">&#39;Fuel Efficiency&#39;</span><span class="p">,</span>
                               <span class="s1">&#39;Engine Size&#39;</span><span class="p">,</span><span class="s1">&#39;Combined MPG&#39;</span><span class="p">)</span>
</pre></div>
<p><img alt="Filter Pivot Count Fuel Efficiency vs. Engine Size" src="https://silvrback.s3.amazonaws.com/uploads/bf04bb64-890d-42e0-8d77-12a37a482c9d/filter_count_pivot.png" /></p>

<p>This is OK, but it would be faster to analyze visually. Let&#39;s create a heatmap that will color the magnitude of the counts and present us with a more intuitive view. </p>
<div class="highlight"><pre><span></span><span class="n">sns</span><span class="o">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">effic_size_pivot</span><span class="p">,</span> <span class="n">annot</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">fmt</span><span class="o">=</span><span class="s1">&#39;g&#39;</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">xlabel</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1"> Engine Size&#39;</span><span class="p">)</span>
<span class="n">sns</span><span class="o">.</span><span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;Fuel Efficiency vs. Engine Size (2016) </span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
<p><img alt="Fuel Efficiency by Engine Size Heatmap 2016" src="https://silvrback.s3.amazonaws.com/uploads/cb309263-4b8c-47d0-8159-ea0f9933cfa2/efficiency_size_heatmap.png" /></p>

<p>Just like we did earlier with our horizontal bar charts, we can easily filter by a different year and get a different perspective. For example, here&#39;s what this heatmap looks like for 1985.</p>
<div class="highlight"><pre><span></span><span class="n">effic_size_pivot</span> <span class="o">=</span> <span class="n">pivot_count</span><span class="p">(</span><span class="n">vehicles_1985</span><span class="p">,</span><span class="s1">&#39;Fuel Efficiency&#39;</span><span class="p">,</span>
                               <span class="s1">&#39;Engine Size&#39;</span><span class="p">,</span><span class="s1">&#39;Combined MPG&#39;</span><span class="p">)</span>

<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span><span class="mi">8</span><span class="p">))</span>
<span class="n">sns</span><span class="o">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">effic_size_pivot</span><span class="p">,</span> <span class="n">annot</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">fmt</span><span class="o">=</span><span class="s1">&#39;g&#39;</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">xlabel</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1"> Engine Size&#39;</span><span class="p">)</span>
<span class="n">sns</span><span class="o">.</span><span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;Fuel Efficiency vs. Engine Size (1985) </span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
<p><img alt="Fuel Efficiency by Engine Size Heatmap 1985" src="https://silvrback.s3.amazonaws.com/uploads/5f0f5ec8-f37e-4092-8828-07b12ae48307/efficiency_size_heatmap_1985.png" /></p>

<p>With these pivot heatmaps, we are not limited to just two variables. We can pass a list of variables for any of the axes (rows or columns), and it will display all the different combinations of values for those variables. </p>
<div class="highlight"><pre><span></span><span class="n">effic_size_category</span> <span class="o">=</span> <span class="n">pivot_count</span><span class="p">(</span><span class="n">vehicles_2016</span><span class="p">,</span>
                                  <span class="p">[</span><span class="s1">&#39;Engine Size&#39;</span><span class="p">,</span><span class="s1">&#39;Fuel Efficiency&#39;</span><span class="p">],</span>
                                  <span class="s1">&#39;Vehicle Category&#39;</span><span class="p">,</span><span class="s1">&#39;Combined MPG&#39;</span><span class="p">)</span>

<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span><span class="mi">10</span><span class="p">))</span>
<span class="n">sns</span><span class="o">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">effic_size_category</span><span class="p">,</span> <span class="n">annot</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">fmt</span><span class="o">=</span><span class="s1">&#39;g&#39;</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">xlabel</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1"> Vehicle Category&#39;</span><span class="p">)</span>
<span class="n">sns</span><span class="o">.</span><span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;Fuel Efficiency + Engine Size vs. Vehicle Category (2016) </span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
<p><img alt="Engine Size and Fuel Efficiency by Vehicle Category Heatmap" src="https://silvrback.s3.amazonaws.com/uploads/c366a0f2-0fd7-4347-bcec-871fd9f43463/efficiency_size_category.png" /></p>

<p>In this heatmap, we have <em>Engine Size</em> and <em>Fuel Efficiency</em> combinations represented by the rows, and we&#39;ve added a third variable (the <em>Vehicle Category</em>) across the columns. So now we can see a finer level of detail about what types of cars had what size engines and what level of fuel efficiency last year. </p>

<p>As a final example for this section, let&#39;s create a pivot heatmap that plots <em>Make</em> against <em>Vehicle Category</em> for 2016. We saw earlier, in the bar chart that counted vehicles by manufacturer, that BMW made the largest number of specific models last year. This pivot heatmap will let us see how those counts are distributed across vehicle categories, giving us a better sense of each auto company&#39;s current offerings in terms of the breadth vs. depth of vehicle types they make.</p>
<div class="highlight"><pre><span></span><span class="n">effic_size_pivot</span> <span class="o">=</span> <span class="n">pivot_count</span><span class="p">(</span><span class="n">vehicles_2016</span><span class="p">,</span> <span class="s1">&#39;Make&#39;</span><span class="p">,</span>
                               <span class="s1">&#39;Vehicle Category&#39;</span><span class="p">,</span><span class="s1">&#39;Combined MPG&#39;</span><span class="p">)</span>

<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span><span class="mi">20</span><span class="p">))</span>
<span class="n">sns</span><span class="o">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">effic_size_pivot</span><span class="p">,</span> <span class="n">annot</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">fmt</span><span class="o">=</span><span class="s1">&#39;g&#39;</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">xlabel</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1"> Vehicle Category&#39;</span><span class="p">)</span>
<span class="n">sns</span><span class="o">.</span><span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;Make vs. Vehicle Category (2016) </span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
<p><img alt="Make vs. Vehicle Category 2016" src="https://silvrback.s3.amazonaws.com/uploads/133b4395-0c9e-48ed-8d2b-eab02a45b7cf/make_category_heatmap.png" /></p>

<h2 id="visualizing-changes-over-time">Visualizing Changes Over Time</h2>

<p>So far in this post, we&#39;ve been looking at the data at given points in time. The next step is to take a look at how the data has changed over time. We can do this relatively easily by creating a <code>multi_line</code> function that accepts a data frame and x/y fields and then plots them on a multiline chart. </p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">multi_line</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
    <span class="n">ax</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">])</span><span class="o">.</span><span class="n">size</span><span class="p">()</span><span class="o">.</span><span class="n">unstack</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span><span class="mi">8</span><span class="p">),</span> <span class="n">cmap</span><span class="o">=</span><span class="s2">&quot;Set2&quot;</span><span class="p">)</span>
</pre></div>
<p>Let&#39;s use this function to visualize our vehicle categories over time. The resulting chart shows the number of vehicles in each category that were manufactured each year. </p>
<div class="highlight"><pre><span></span><span class="n">multi_line</span><span class="p">(</span><span class="n">vehicles</span><span class="p">,</span> <span class="s1">&#39;Year&#39;</span><span class="p">,</span> <span class="s1">&#39;Vehicle Category&#39;</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">xlabel</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1"> Year&#39;</span><span class="p">)</span>
<span class="n">sns</span><span class="o">.</span><span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;Vehicle Categories Over Time </span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
<p><img alt="Vehicle Categories Over Time" src="https://silvrback.s3.amazonaws.com/uploads/5050b0cf-32dd-43fd-9373-206f53f67f1b/categories_over_time.png" /></p>

<p>We can see from the chart that Small Cars have generally dominated across the board and that there was a small decline in the late 90s that then started to pick up again in the early 2000s. We can also see the introduction and increase in popularity of SUVs starting in the late 90s, and the decline in popularity of trucks in recent years.</p>

<p>If we wanted to, we could zoom in and filter for specific manufacturers to see how their offerings have changed over the years. Since BMW had the most number of vehicles last year and we saw in the pivot heatmap that those were mostly small cars, let&#39;s filter for just their vehicles to see whether they have always made a lot of small cars or if this is more of a recent phenomenon. </p>
<div class="highlight"><pre><span></span><span class="n">bmw</span> <span class="o">=</span> <span class="n">vehicles</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Make&#39;</span><span class="p">]</span> <span class="o">==</span> <span class="s1">&#39;BMW&#39;</span><span class="p">]</span>

<span class="n">multi_line</span><span class="p">(</span><span class="n">bmw</span><span class="p">,</span> <span class="s1">&#39;Year&#39;</span><span class="p">,</span> <span class="s1">&#39;Vehicle Category&#39;</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">xlabel</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1"> Year&#39;</span><span class="p">)</span>
<span class="n">sns</span><span class="o">.</span><span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;BMW Vehicle Categories Over Time </span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
<p><img alt="BMW Vehicle Categories Over Time" src="https://silvrback.s3.amazonaws.com/uploads/19c75310-4cd8-4794-a2bb-e6dacad5727f/bmw_categories.png" /></p>

<p>We can see in the chart above that they started off making a reasonable number of small cars, and then seemed to ramp up production of those types of vehicles in the late 90s. We can contrast this with a company like Toyota, who started out making a lot of small cars back in the 1980s and then seemingly made a decision to gradually manufacture less of them over the years, focusing instead on SUVs, pickup trucks, and midsize cars.</p>
<div class="highlight"><pre><span></span><span class="n">toyota</span> <span class="o">=</span> <span class="n">vehicles</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Make&#39;</span><span class="p">]</span> <span class="o">==</span> <span class="s1">&#39;Toyota&#39;</span><span class="p">]</span>

<span class="n">multi_line</span><span class="p">(</span><span class="n">toyota</span><span class="p">,</span> <span class="s1">&#39;Year&#39;</span><span class="p">,</span> <span class="s1">&#39;Vehicle Category&#39;</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">xlabel</span><span class="o">=</span><span class="s1">&#39;</span><span class="se">\n</span><span class="s1"> Year&#39;</span><span class="p">)</span>
<span class="n">sns</span><span class="o">.</span><span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">&#39;Toyota Vehicle Categories Over Time </span><span class="se">\n</span><span class="s1">&#39;</span><span class="p">)</span>
</pre></div>
<p><img alt="Toyota Vehicle Categories Over Time" src="https://silvrback.s3.amazonaws.com/uploads/a3a21ba3-de16-41d1-89df-0bc5b5075554/toyota_categories.png" /></p>

<h2 id="examining-relationships-between-variables">Examining Relationships Between Variables</h2>

<p>The final way we are going to explore our data in this post is by examining the relationships between numerical variables in our data. Doing this will provide us with better insight into which fields are highly correlated, what the nature of those correlations are, what typical combinations of numerical values exist in our data, and which combinations are anomalies. </p>

<p>For looking at relationships between variables, I often like to start with a scatter matrix because it gives me a bird&#39;s eye view of the relationships between all the numerical fields in my data set. With just a couple lines of code, we can not only create a scatter matrix, but we can also factor in a layer of color that can represent, for example, the clusters we generated at the end of the last post. </p>
<div class="highlight"><pre><span></span><span class="n">select_columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;Engine Displacement&#39;</span><span class="p">,</span><span class="s1">&#39;Cylinders&#39;</span><span class="p">,</span><span class="s1">&#39;Fuel Barrels/Year&#39;</span><span class="p">,</span>
                   <span class="s1">&#39;City MPG&#39;</span><span class="p">,</span><span class="s1">&#39;Highway MPG&#39;</span><span class="p">,</span><span class="s1">&#39;Combined MPG&#39;</span><span class="p">,</span>
                   <span class="s1">&#39;CO2 Emission Grams/Mile&#39;</span><span class="p">,</span> <span class="s1">&#39;Fuel Cost/Year&#39;</span><span class="p">,</span> <span class="s1">&#39;Cluster Name&#39;</span><span class="p">]</span>

<span class="n">sns</span><span class="o">.</span><span class="n">pairplot</span><span class="p">(</span><span class="n">vehicles</span><span class="p">[</span><span class="n">select_columns</span><span class="p">],</span> <span class="n">hue</span><span class="o">=</span><span class="s1">&#39;Cluster Name&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
</pre></div>
<p><img alt="Scatter Matrix with Cluster Hue" src="https://silvrback.s3.amazonaws.com/uploads/83a2b8d1-2c13-47df-accf-fb28fe86191a/scatter_matrix_clusters.png" /></p>

<p>From here, we can see that there are some strong positive linear relationships in our data, such as the correlations between the MPG fields, and also among the fuel cost, barrels, and CO2 emissions fields. There are also some hyperbolic relationships in there as well, particularly between the MPG fields and engine displacement, fuel cost, barrels, and emissions. Additionally, we can also get a sense of the size of our clusters, how they are distributed, and the level of overlap we have between them. </p>

<p>Once we have this high-level overview, we can zoom in on anything that we think looks interesting. For example, let&#39;s take a closer look at <em>Engine Displacement</em> plotted against <em>Combined MPG</em>. </p>
<div class="highlight"><pre><span></span><span class="n">sns</span><span class="o">.</span><span class="n">lmplot</span><span class="p">(</span><span class="s1">&#39;Engine Displacement&#39;</span><span class="p">,</span> <span class="s1">&#39;Combined MPG&#39;</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">vehicles</span><span class="p">,</span> 
           <span class="n">hue</span><span class="o">=</span><span class="s1">&#39;Cluster Name&#39;</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">fit_reg</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</pre></div>
<p><img alt="Displacement MPG Scatter Plot" src="https://silvrback.s3.amazonaws.com/uploads/0476c1e0-b3ad-40d2-a2ff-684c5052281f/disp_mpg_scatter.png" /></p>

<p>In addition to being able to see that there is a hyperbolic correlation between these two variables, we can see that our <em>Small Very Efficient</em> cluster resides in the upper left, followed by our <em>Midsized Balanced</em> cluster that looks smaller and more compact than the others. After that, we have our <em>Large Moderately Efficient</em> cluster and finally our <em>Large Inefficient</em> cluster on the bottom right. </p>

<p>We can also see that there are a few red points at the very top left and a few purple points at the very bottom right that we may want to investigate further to get a sense of what types of vehicles we are likely to see at the extremes. Try identifying some of those on your own by filtering the data set like we did earlier in the post. While you&#39;re at it, try creating additional scatter plots that zoom in on other numerical field combinations from the scatter matrix above. There are a bunch of other insights to be found in this data set, and all it takes is a little exploration!</p>

<h2 id="conclusion">Conclusion</h2>

<p>We have covered quite a bit in this post, and I hope I&#39;ve provided you with some good examples of how, with just a few tools in your arsenal, you can embark on a robust insight-finding expedition and discover truly interesting things about your data.  Now that you have some structure in your process and some tools for exploring data, you can let your creativity run wild a little and come up with filter, aggregate, pivot, and scatter combinations that are most interesting to you. Feel free to experiment and post any interesting insights you&#39;re able to find in the comments!</p>

<p>Also, make sure to stay tuned because in the next (and final) post of this series, I&#39;m going to cover how to identify and think about the different <em>networks</em> that are present in your data and how to explore them using graph analytics. Click the subscribe button below and enter your email so that you receive a notification as soon as it&#39;s published!</p>

<p><em>District Data Labs provides data science <a href="http://www.districtdatalabs.com/consulting/">consulting</a> and <a href="http://www.districtdatalabs.com/training/">corporate training</a> services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? <a href="mailto:contact@districtdatalabs.com?subject=Consulting%20and%20Corporate%20Training%20Services&body=Hello!%20I&#x27;m%20interested%20in%20learning%20more%20about%20your%20data%20science%20consulting%20and%20corporate%20training%20offerings.">Let us know</a>!</em></p>
]]></content:encoded>
      </item>
      <item>
        <guid>http://blog.districtdatalabs.com/basics-of-entity-resolution#30219</guid>
          <pubDate>Sat, 11 Mar 2017 13:55:00 -0500</pubDate>
        <link>http://blog.districtdatalabs.com/basics-of-entity-resolution</link>
        <title>Basics of Entity Resolution</title>
        <description>with Python and Dedupe</description>
        <content:encoded><![CDATA[<p>Entity resolution (ER) is the task of disambiguating records that correspond to real world entities across and within datasets. The applications of entity resolution are tremendous, particularly for public sector and federal datasets related to health, transportation, finance, law enforcement, and antiterrorism.  </p>

<p>Unfortunately, the problems associated with entity resolution are equally big &mdash; as the volume and velocity of data grow, inference across networks and semantic relationships between entities becomes increasingly difficult. Data quality issues, schema variations, and idiosyncratic data collection traditions can all complicate these problems even further. When combined, such challenges amount to a substantial barrier to organizations’ ability to fully understand their data, let alone make effective use of predictive analytics to optimize targeting, thresholding, and resource management.  </p>

<h3 id="naming-your-problem">Naming Your Problem</h3>

<p>Let us first consider what an entity is. Much as the key step in machine learning is to determine what an instance is, the key step in entity resolution is to determine what an entity is. Let&#39;s define an entity as a unique thing (a person, a business, a product) with a set of attributes that describe it (a name, an address, a shape, a title, a price, etc.). That single entity may have multiple references across data sources, such as a person with two different email addresses, a company with two different phone numbers, or a product listed on two different websites. If we want to ask questions about all the unique people, or businesses, or products in a dataset, we must find a method for producing an annotated version of that dataset that contains unique entities.</p>

<p>How can we tell that these multiple references point to the same entity? What if the attributes for each entity aren&#39;t the same across references? What happens when there are more than two or three or ten references to the same entity? Which one is the main (canonical) version? Do we just throw the duplicates away?</p>

<p>Each question points to a single problem, albeit one that frequently goes unnamed. Ironically, one of the problems in entity resolution is that even though it goes by a lot of different names, many people who struggle with entity resolution do not know the name of their problem.</p>

<p>The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalization:    </p>

<ol>
<li><strong>Deduplication:</strong> eliminating duplicate (exact) copies of repeated data.</li>
<li><strong>Record linkage:</strong> identifying records that reference the same entity across different sources.</li>
<li><strong>Canonicalization:</strong> converting data with more than one possible representation into a standard form.</li>
</ol>

<p>Entity resolution is not a new problem, but thanks to Python and new machine learning libraries, it is an increasingly achievable objective. This post will explore some basic approaches to entity resolution using one of those tools, the Python Dedupe library. In this post, we will explore the basic functionalities of Dedupe, walk through how the library works under the hood, and perform a demonstration on two different datasets.</p>

<h2 id="about-dedupe">About Dedupe</h2>

<p><a href="https://pypi.python.org/pypi/dedupe/1.6.5">Dedupe</a> is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data. It isn&#39;t the only tool available in Python for doing entity resolution tasks, but it is the only one (as far as we know) that conceives of entity resolution as it&#39;s primary task. In addition to removing duplicate entries from within a single dataset, Dedupe can also do record linkage across disparate datasets. Dedupe also scales fairly well &mdash; in this post we demonstrate using the library with a relatively small dataset of a few thousand records and a very large dataset of several million.</p>

<h3 id="how-dedupe-works">How Dedupe Works</h3>

<p>Effective deduplication relies largely on domain expertise. This is for two main reasons: first, because domain experts develop a set of heuristics that enable them to conceptualize what a canonical version of a record <em>should</em> look like, even if they&#39;ve never seen it in practice. Second, domain experts instinctively recognize which record subfields are most likely to uniquely identify a record; they just know where to look. As such, Dedupe works by engaging the user in labeling the data via a command line interface, and using machine learning on the resulting training data to predict similar or matching records within unseen data.    </p>

<h3 id="testing-out-dedupe">Testing Out Dedupe</h3>

<p>Getting started with Dedupe is easy, and the developers have provided a <a href="https://github.com/datamade/dedupe-examples">convenient repo</a> with examples that you can use and iterate on. Let&#39;s start by walking through the <em>csv_example.py</em> from the <em>dedupe-examples</em>. To get Dedupe running, we&#39;ll need to install <code>unidecode</code>, <code>future</code>, and <code>dedupe</code>.    </p>

<p>In your terminal (we recommend doing so inside a <a href="https://districtdatalabs.silvrback.com/how-to-develop-quality-python-code">virtual environment</a>):    </p>
<div class="highlight"><pre><span></span>git clone https://github.com/DistrictDataLabs/dedupe-examples.git
<span class="nb">cd</span> dedupe-examples

pip install unidecode
pip install future
pip install dedupe
</pre></div>
<p>Then we&#39;ll run the csv_example.py file to see what dedupe can do:    </p>
<div class="highlight"><pre><span></span>python csv_example.py
</pre></div>
<h3 id="blocking-and-affine-gap-distance">Blocking and Affine Gap Distance</h3>

<p>Let&#39;s imagine we own an online retail business, and we are developing a new recommendation engine that mines our existing customer data to come up with good recommendations for products that our existing and new customers might like to buy. Our dataset is a purchase history log where customer information is represented by attributes like name, telephone number, address, and order history. The database we&#39;ve been using to log purchases assigns a new unique ID for every customer interaction.</p>

<p>But it turns out we&#39;re a great business, so we have a lot of repeat customers! We&#39;d like to be able to aggregate the order history information by customer so that we can build a good recommender system with the data we have. That aggregation is easy if every customer&#39;s information is duplicated exactly in every purchase log. But what if it looks something like the table below?   </p>

<p><img alt="Silvrback blog image" class="sb_float_center" src="https://silvrback.s3.amazonaws.com/uploads/a7c655cb-7d2e-4439-8338-348f90b19145/er_bizcase.png" /></p>

<p>How can we aggregate the data so that it is unique to the customer rather than the purchase? Features in the data set like names, phone numbers, and addresses will probably be useful. What is notable is that there are numerous variations for those attributes, particularly in how names appear &mdash; sometimes as nicknames, sometimes even misspellings. What we need is an intelligent and mostly automated way to create a new dataset for our recommender system.  Enter Dedupe.</p>

<p>When comparing records, rather than treating each record as a single long string, Dedupe cleverly exploits the structure of the input data to instead compare the records <em>field by field</em>. The advantage of this approach is more pronounced when certain feature vectors of records are much more likely to assist in identifying matches than other attributes. Dedupe lets the user nominate the features they believe will be most useful:   </p>
<div class="highlight"><pre><span></span><span class="n">fields</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">{</span><span class="s1">&#39;field&#39;</span> <span class="p">:</span> <span class="s1">&#39;Name&#39;</span><span class="p">,</span> <span class="s1">&#39;type&#39;</span><span class="p">:</span> <span class="s1">&#39;String&#39;</span><span class="p">},</span>
    <span class="p">{</span><span class="s1">&#39;field&#39;</span> <span class="p">:</span> <span class="s1">&#39;Phone&#39;</span><span class="p">,</span> <span class="s1">&#39;type&#39;</span><span class="p">:</span> <span class="s1">&#39;Exact&#39;</span><span class="p">,</span> <span class="s1">&#39;has missing&#39;</span> <span class="p">:</span> <span class="bp">True</span><span class="p">},</span>
    <span class="p">{</span><span class="s1">&#39;field&#39;</span> <span class="p">:</span> <span class="s1">&#39;Address&#39;</span><span class="p">,</span> <span class="s1">&#39;type&#39;</span><span class="p">:</span> <span class="s1">&#39;String&#39;</span><span class="p">,</span> <span class="s1">&#39;has missing&#39;</span> <span class="p">:</span> <span class="bp">True</span><span class="p">},</span>
    <span class="p">{</span><span class="s1">&#39;field&#39;</span> <span class="p">:</span> <span class="s1">&#39;Purchases&#39;</span><span class="p">,</span> <span class="s1">&#39;type&#39;</span><span class="p">:</span> <span class="s1">&#39;String&#39;</span><span class="p">},</span>
    <span class="p">]</span>
</pre></div>
<p>Dedupe scans the data to create tuples of records that it will propose to the user to label as being either matches, not matches, or possible matches. These <code>uncertainPairs</code> are identified using a combination of <strong>blocking</strong> , <strong>affine gap distance</strong>, and <strong>active learning</strong>.</p>

<p>Blocking is used to reduce the number of overall record comparisons that need to be made. Dedupe&#39;s method of blocking involves engineering subsets of feature vectors (these are called &#39;predicates&#39;) that can be compared across records. In the case of our people dataset above, the predicates might be things like:</p>

<ul>
<li>the first three digits of the phone number<br></li>
<li>the full name<br></li>
<li>the first five characters of the name<br></li>
<li>a random 4-gram within the city name<br></li>
</ul>

<p>Records are then grouped, or blocked, by matching predicates so that only records with matching predicates will be compared to each other during the active learning phase. The blocks are developed by computing the edit distance between predicates across records. Dedupe uses a distance metric called <strong>affine gap distance</strong>, which is a variation on Hamming distance that makes subsequent consecutive deletions or insertions cheaper.</p>

<p><img alt="Silvrback blog image" src="https://silvrback.s3.amazonaws.com/uploads/a7c655cb-7d2e-4439-8338-348f90b19145/AGSlide1.png" /></p>

<p><img alt="Silvrback blog image " src="https://silvrback.s3.amazonaws.com/uploads/a7c655cb-7d2e-4439-8338-348f90b19145/AGSlide2.png" /></p>

<p><img alt="Silvrback blog image " src="https://silvrback.s3.amazonaws.com/uploads/a7c655cb-7d2e-4439-8338-348f90b19145/AGSlide3.png" /></p>

<p>Therefore, we might have one blocking method that groups all of the records that have the same area code of the phone number. This would result in three predicate blocks: one with a 202 area code, one with a 334, and one with NULL. There would be two records in the 202 block (IDs 452 and 821), two records in the 334 block (IDs 233 and 699), and one record in the NULL area code block (ID 720).</p>

<p><img alt="Silvrback blog image " src="https://silvrback.s3.amazonaws.com/uploads/a7c655cb-7d2e-4439-8338-348f90b19145/BlockingSlide4.png" /></p>

<p>The relative weight of these different feature vectors can be learned during the active learning process and expressed numerically to ensure that features that will be most predictive of matches will be heavier in the overall matching schema. As the user labels more and more tuples, Dedupe gradually relearns the weights, recalculates the edit distances between records, and updates its list of the most uncertain pairs to propose to the user for labeling.</p>

<p>Once the user has generated enough labels, the learned weights are used to calculate the probability that each pair of records within a block is a duplicate or not.  In order to scale the pairwise matching up to larger tuples of matched records (in the case that entities may appear more than twice within a document), Dedupe uses hierarchical clustering with centroidal linkage. Records within some threshold distance of a centroid will be grouped together. The final result is an annotated version of the original dataset that now includes a centroid label for each record.    </p>

<h2 id="active-learning">Active Learning</h2>

<p>You can see that <code>dedupe</code> is a command line application that will prompt the user to engage in active learning by showing pairs of entities and asking if they are the same or different.</p>
<div class="highlight"><pre><span></span>Do these records refer to the same thing?
<span class="o">(</span>y<span class="o">)</span>es / <span class="o">(</span>n<span class="o">)</span>o / <span class="o">(</span>u<span class="o">)</span>nsure / <span class="o">(</span>f<span class="o">)</span>inished
</pre></div>
<p>Active learning is the so-called <em>special sauce</em> behind Dedupe. As in most supervised machine learning tasks, the challenge is to get labeled data that the model can learn from. The active learning phase in Dedupe is essentially an extended user-labeling session, which can be short if you have a small dataset and can take longer if your dataset is large. You are presented with four options:</p>

<p><img alt="Silvrback blog image " src="https://silvrback.s3.amazonaws.com/uploads/a7c655cb-7d2e-4439-8338-348f90b19145/dedupeEX.png" /></p>

<p>You can experiment with typing the <em>y</em>, <em>n</em>, and <em>u</em> keys to flag duplicates for active learning. When you are finished, enter <em>f</em> to quit.</p>

<ul>
<li>(y)es:    confirms that the two references are to the same entity<br></li>
<li>(n)o:     labels the two references as not the same entity<br></li>
<li>(u)nsure: does not label the two references as the same entity or as different entities<br></li>
<li>(f)inished: ends the active learning session and triggers the supervised learning phase<br></li>
</ul>

<p><img alt="Silvrback blog image " src="https://silvrback.s3.amazonaws.com/uploads/a7c655cb-7d2e-4439-8338-348f90b19145/dedupeEX2.png" /></p>

<p>As you can see in the example above, some comparisons decisions are very easy. The first contains zero for zero hits on all four attributes being examined, so the verdict is most certainly a non-match. On the second, we have a 3/4 exact match, with the fourth being fuzzy in that one entity contains a piece of the matched entity; Ryerson vs. Chicago Public Schools Ryerson. A human would be able to discern these as two references to the same entity, and we can label it as such to enable the supervised learning that comes after the active learning.</p>

<p>The <em>csv_example</em> also includes an <a href="https://github.com/datamade/dedupe-examples/blob/master/csv_example/csv_evaluation.py">evaluation script</a> that will enable you to determine how successfully you were able to resolve the entities. It&#39;s important to note that the blocking, active learning and supervised learning portions of the deduplication process are very dependent on the dataset attributes that the user nominates for selection. In the <em>csv_example</em>, the script nominates the following four attributes:</p>
<div class="highlight"><pre><span></span><span class="n">fields</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">{</span><span class="s1">&#39;field&#39;</span> <span class="p">:</span> <span class="s1">&#39;Site name&#39;</span><span class="p">,</span> <span class="s1">&#39;type&#39;</span><span class="p">:</span> <span class="s1">&#39;String&#39;</span><span class="p">},</span>
    <span class="p">{</span><span class="s1">&#39;field&#39;</span> <span class="p">:</span> <span class="s1">&#39;Address&#39;</span><span class="p">,</span> <span class="s1">&#39;type&#39;</span><span class="p">:</span> <span class="s1">&#39;String&#39;</span><span class="p">},</span>
    <span class="p">{</span><span class="s1">&#39;field&#39;</span> <span class="p">:</span> <span class="s1">&#39;Zip&#39;</span><span class="p">,</span> <span class="s1">&#39;type&#39;</span><span class="p">:</span> <span class="s1">&#39;Exact&#39;</span><span class="p">,</span> <span class="s1">&#39;has missing&#39;</span> <span class="p">:</span> <span class="bp">True</span><span class="p">},</span>
    <span class="p">{</span><span class="s1">&#39;field&#39;</span> <span class="p">:</span> <span class="s1">&#39;Phone&#39;</span><span class="p">,</span> <span class="s1">&#39;type&#39;</span><span class="p">:</span> <span class="s1">&#39;String&#39;</span><span class="p">,</span> <span class="s1">&#39;has missing&#39;</span> <span class="p">:</span> <span class="bp">True</span><span class="p">},</span>
    <span class="p">]</span>
</pre></div>
<p>A different combination of attributes would result in a different blocking, a different set of <code>uncertainPairs</code>, a different set of features to use in the active learning phase, and almost certainly a different result. In other words, user experience and domain knowledge factor in heavily at multiple phases of the deduplication process.</p>

<h2 id="something-a-bit-more-challenging">Something a Bit More Challenging</h2>

<p>In order to try out Dedupe with a more challenging project, we decided to try out deduplicating the White House visitors&#39; log. Our hypothesis was that it would be interesting to be able to answer questions such as &quot;How many times has person X visited the White House during administration Y?&quot; However, in order to do that, it would be necessary to generate a version of the list that contained unique entities. We guessed that there would be many cases where there were multiple references to a single entity, potentially with slight variations in how they appeared in the dataset. We also expected to find a lot of names that seemed similar but in fact referenced different entities.  In other words, a good challenge!</p>

<p>The data set we used was pulled from the <a href="https://open.whitehouse.gov/dataset/White-House-Visitor-Records-Requests/p86s-ychb">WhiteHouse.gov</a> website, a part of the executive initiative to make federal data more open to the public. This particular set of data is a list of White House visitor record requests from 2006 through 2010. Here&#39;s a snapshot of what the data looks like via the White House API.  </p>

<p><img alt="Silvrback blog image " src="https://silvrback.s3.amazonaws.com/uploads/a7c655cb-7d2e-4439-8338-348f90b19145/visitors.png" /></p>

<p>The dataset includes a lot of columns, and for most of the entries, the majority of these fields are blank:</p>

<table><thead>
<tr>
<th>Database Field</th>
<th>Field Description</th>
</tr>
</thead><tbody>
<tr>
<td>NAMELAST</td>
<td>Last name of entity</td>
</tr>
<tr>
<td>NAMEFIRST</td>
<td>First name of entity</td>
</tr>
<tr>
<td>NAMEMID</td>
<td>Middle name of entity</td>
</tr>
<tr>
<td>UIN</td>
<td>Unique Identification Number</td>
</tr>
<tr>
<td>BDGNBR</td>
<td>Badge Number</td>
</tr>
<tr>
<td>Type of Access</td>
<td>Access type to White House</td>
</tr>
<tr>
<td>TOA</td>
<td>Time of arrival</td>
</tr>
<tr>
<td>POA</td>
<td>Post on arrival</td>
</tr>
<tr>
<td>TOD</td>
<td>Time of departure</td>
</tr>
<tr>
<td>POD</td>
<td>Post on departure</td>
</tr>
<tr>
<td>APPT_MADE_DATE</td>
<td>When the appointment date was made</td>
</tr>
<tr>
<td>APPT_START_DATE</td>
<td>When the appointment date is scheduled to start</td>
</tr>
<tr>
<td>APPT_END_DATE</td>
<td>When the appointment date is scheduled to end</td>
</tr>
<tr>
<td>APPT_CANCEL_DATE</td>
<td>When the appointment date was canceled</td>
</tr>
<tr>
<td>Total_People</td>
<td>Total number of people scheduled to attend</td>
</tr>
<tr>
<td>LAST_UPDATEDBY</td>
<td>Who was the last person to update this event</td>
</tr>
<tr>
<td>POST</td>
<td>Classified as &#39;WIN&#39;</td>
</tr>
<tr>
<td>LastEntryDate</td>
<td>When the last update to this instance</td>
</tr>
<tr>
<td>TERMINAL_SUFFIX</td>
<td>ID for terminal used to process visitor</td>
</tr>
<tr>
<td>visitee_namelast</td>
<td>The visitee&#39;s last name</td>
</tr>
<tr>
<td>visitee_namefirst</td>
<td>The visitee&#39;s first name</td>
</tr>
<tr>
<td>MEETING_LOC</td>
<td>The location of the meeting</td>
</tr>
<tr>
<td>MEETING_ROOM</td>
<td>The room number of the meeting</td>
</tr>
<tr>
<td>CALLER_NAME_LAST</td>
<td>The authorizing person for the visitor&#39;s last name</td>
</tr>
<tr>
<td>CALLER_NAME_FIRST</td>
<td>The authorizing person for the visitor&#39;s first name</td>
</tr>
<tr>
<td>CALLER_ROOM</td>
<td>The authorizing person&#39;s room for the visitor</td>
</tr>
<tr>
<td>Description</td>
<td>Description of the event or visit</td>
</tr>
<tr>
<td>RELEASE_DATE</td>
<td>The date this set of logs were released to the public</td>
</tr>
</tbody></table>

<h3 id="loading-the-data">Loading the Data</h3>

<p>Using the API, the <em>White House Visitor Log Requests</em> can be exported in a variety of formats to include, .json, .csv, and .xlsx, .pdf, .xlm, and RSS. However, it&#39;s important to keep in mind that the dataset contains over 5 million rows. For this reason, we decided to use .csv and grabbed the data using <code>requests</code>:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">requests</span>

<span class="k">def</span> <span class="nf">getData</span><span class="p">(</span><span class="n">url</span><span class="p">,</span><span class="n">fname</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;</span>
<span class="sd">    Download the dataset from the webpage.</span>
<span class="sd">    &quot;&quot;&quot;</span>
    <span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">fname</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">response</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>

<span class="n">DATAURL</span> <span class="o">=</span> <span class="s2">&quot;https://open.whitehouse.gov/api/views/p86s-ychb/rows.csv?accessType=DOWNLOAD&quot;</span>
<span class="n">ORIGFILE</span> <span class="o">=</span> <span class="s2">&quot;fixtures/whitehouse-visitors.csv&quot;</span>

<span class="n">getData</span><span class="p">(</span><span class="n">DATAURL</span><span class="p">,</span><span class="n">ORIGFILE</span><span class="p">)</span>
</pre></div>
<p>Once downloaded, we can clean it up and load it into a database for more secure and stable storage.</p>

<h2 id="tailoring-the-code">Tailoring the Code</h2>

<p>Next, we&#39;ll discuss what is needed to tailor a <code>dedupe</code> example to get the code to work for the White House visitors log dataset. The main challenge with this dataset is its sheer size. First, we&#39;ll need to import a few modules and connect to our database:    </p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">csv</span>
<span class="kn">import</span> <span class="nn">psycopg2</span>
<span class="kn">from</span> <span class="nn">dateutil</span> <span class="kn">import</span> <span class="n">parser</span>
<span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>

<span class="n">conn</span> <span class="o">=</span> <span class="bp">None</span>

<span class="n">DATABASE</span> <span class="o">=</span> <span class="n">your_db_name</span>
<span class="n">USER</span> <span class="o">=</span> <span class="n">your_user_name</span>
<span class="n">HOST</span> <span class="o">=</span> <span class="n">your_hostname</span>
<span class="n">PASSWORD</span> <span class="o">=</span> <span class="n">your_password</span>

<span class="k">try</span><span class="p">:</span>
    <span class="n">conn</span> <span class="o">=</span> <span class="n">psycopg2</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="n">database</span><span class="o">=</span><span class="n">DATABASE</span><span class="p">,</span> <span class="n">user</span><span class="o">=</span><span class="n">USER</span><span class="p">,</span> <span class="n">host</span><span class="o">=</span><span class="n">HOST</span><span class="p">,</span> <span class="n">password</span><span class="o">=</span><span class="n">PASSWORD</span><span class="p">)</span>
    <span class="k">print</span> <span class="p">(</span><span class="s2">&quot;I&#39;ve connected&quot;</span><span class="p">)</span>
<span class="k">except</span><span class="p">:</span>
    <span class="k">print</span> <span class="p">(</span><span class="s2">&quot;I am unable to connect to the database&quot;</span><span class="p">)</span>
<span class="n">cur</span> <span class="o">=</span> <span class="n">conn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
</pre></div>
<p>The other challenge with our dataset are the numerous missing values and <em>datetime</em> formatting irregularities. We wanted to be able to use the <em>datetime</em> strings to help with entity resolution, so we wanted to get the formatting to be as consistent as possible. The following script handles both the <em>datetime</em> parsing and the missing values by combining Python&#39;s <code>dateutil</code> module and PostgreSQL&#39;s fairly forgiving &#39;varchar&#39; type.</p>

<p>This function takes the csv data in as input, parses the <em>datetime</em> fields we&#39;re interested in (&#39;lastname&#39;,&#39;firstname&#39;,&#39;uin&#39;,&#39;apptmade&#39;,&#39;apptstart&#39;,&#39;apptend&#39;, &#39;meeting_loc&#39;.), and outputs a database table that retains the desired columns. Keep in mind this will take a while to run.        </p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">dateParseSQL</span><span class="p">(</span><span class="n">nfile</span><span class="p">):</span>
    <span class="n">cur</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s1">&#39;&#39;&#39;CREATE TABLE IF NOT EXISTS visitors_er</span>
<span class="s1">                  (visitor_id SERIAL PRIMARY KEY,</span>
<span class="s1">                  lastname    varchar,</span>
<span class="s1">                  firstname   varchar,</span>
<span class="s1">                  uin         varchar,</span>
<span class="s1">                  apptmade    varchar,</span>
<span class="s1">                  apptstart   varchar,</span>
<span class="s1">                  apptend     varchar,</span>
<span class="s1">                  meeting_loc varchar);&#39;&#39;&#39;</span><span class="p">)</span>
    <span class="n">conn</span><span class="o">.</span><span class="n">commit</span><span class="p">()</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">nfile</span><span class="p">,</span> <span class="s1">&#39;rU&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">infile</span><span class="p">:</span>
        <span class="n">reader</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">reader</span><span class="p">(</span><span class="n">infile</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s1">&#39;,&#39;</span><span class="p">)</span>
        <span class="nb">next</span><span class="p">(</span><span class="n">reader</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">reader</span><span class="p">:</span>
            <span class="k">for</span> <span class="n">field</span> <span class="ow">in</span> <span class="n">DATEFIELDS</span><span class="p">:</span>
                <span class="k">if</span> <span class="n">row</span><span class="p">[</span><span class="n">field</span><span class="p">]</span> <span class="o">!=</span> <span class="s1">&#39;&#39;</span><span class="p">:</span>
                    <span class="k">try</span><span class="p">:</span>
                        <span class="n">dt</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="n">field</span><span class="p">])</span>
                        <span class="n">row</span><span class="p">[</span><span class="n">field</span><span class="p">]</span> <span class="o">=</span> <span class="n">dt</span><span class="o">.</span><span class="n">toordinal</span><span class="p">()</span>  <span class="c1"># We also tried dt.isoformat()</span>
                    <span class="k">except</span><span class="p">:</span>
                        <span class="k">continue</span>
            <span class="n">sql</span> <span class="o">=</span> <span class="s2">&quot;INSERT INTO visitors_er(lastname,firstname,uin,apptmade,apptstart,apptend,meeting_loc) </span><span class="se">\</span>
<span class="s2">                   VALUES (</span><span class="si">%s</span><span class="s2">,</span><span class="si">%s</span><span class="s2">,</span><span class="si">%s</span><span class="s2">,</span><span class="si">%s</span><span class="s2">,</span><span class="si">%s</span><span class="s2">,</span><span class="si">%s</span><span class="s2">,</span><span class="si">%s</span><span class="s2">)&quot;</span>
            <span class="n">cur</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">sql</span><span class="p">,</span> <span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="n">row</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span><span class="n">row</span><span class="p">[</span><span class="mi">3</span><span class="p">],</span><span class="n">row</span><span class="p">[</span><span class="mi">10</span><span class="p">],</span><span class="n">row</span><span class="p">[</span><span class="mi">11</span><span class="p">],</span><span class="n">row</span><span class="p">[</span><span class="mi">12</span><span class="p">],</span><span class="n">row</span><span class="p">[</span><span class="mi">21</span><span class="p">],))</span>
            <span class="n">conn</span><span class="o">.</span><span class="n">commit</span><span class="p">()</span>
    <span class="k">print</span> <span class="p">(</span><span class="s2">&quot;All done!&quot;</span><span class="p">)</span>


<span class="n">dateParseSQL</span><span class="p">(</span><span class="n">ORIGFILE</span><span class="p">)</span>
</pre></div>
<p>About 60 of our rows had ASCII characters, which we dropped using this SQL command:</p>
<div class="highlight"><pre><span></span><span class="k">delete</span> <span class="k">from</span> <span class="n">visitors</span> <span class="k">where</span> <span class="n">firstname</span> <span class="o">~</span> <span class="s1">&#39;[^[:ascii:]]&#39;</span> <span class="k">OR</span> <span class="n">lastname</span> <span class="o">~</span> <span class="s1">&#39;[^[:ascii:]]&#39;</span><span class="p">;</span>
</pre></div>
<p>For our deduplication script, we modified the <a href="https://github.com/datamade/dedupe-examples/blob/master/pgsql_example/pgsql_example.py">PostgreSQL example</a> as well as <a href="https://twitter.com/dchud">Dan Chudnov</a>&#39;s <a href="https://github.com/dchud/osha-dedupe/blob/master/pgdedupe.py">adaptation of the script</a> for the OSHA dataset.</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">tempfile</span>
<span class="kn">import</span> <span class="nn">argparse</span>
<span class="kn">import</span> <span class="nn">csv</span>
<span class="kn">import</span> <span class="nn">os</span>

<span class="kn">import</span> <span class="nn">dedupe</span>
<span class="kn">import</span> <span class="nn">psycopg2</span>
<span class="kn">from</span> <span class="nn">psycopg2.extras</span> <span class="kn">import</span> <span class="n">DictCursor</span>
</pre></div>
<p>Initially, we wanted to try to use the datetime fields to deduplicate the entities, but <code>dedupe</code> was not a big fan of the <em>datetime</em> fields, whether in isoformat or ordinal, so we ended up nominating the following fields:    </p>
<div class="highlight"><pre><span></span><span class="n">KEY_FIELD</span> <span class="o">=</span> <span class="s1">&#39;visitor_id&#39;</span>
<span class="n">SOURCE_TABLE</span> <span class="o">=</span> <span class="s1">&#39;visitors&#39;</span>

<span class="n">FIELDS</span> <span class="o">=</span>  <span class="p">[{</span><span class="s1">&#39;field&#39;</span><span class="p">:</span> <span class="s1">&#39;firstname&#39;</span><span class="p">,</span> <span class="s1">&#39;variable name&#39;</span><span class="p">:</span> <span class="s1">&#39;firstname&#39;</span><span class="p">,</span>
               <span class="s1">&#39;type&#39;</span><span class="p">:</span> <span class="s1">&#39;String&#39;</span><span class="p">,</span><span class="s1">&#39;has missing&#39;</span><span class="p">:</span> <span class="bp">True</span><span class="p">},</span>
              <span class="p">{</span><span class="s1">&#39;field&#39;</span><span class="p">:</span> <span class="s1">&#39;lastname&#39;</span><span class="p">,</span> <span class="s1">&#39;variable name&#39;</span><span class="p">:</span> <span class="s1">&#39;lastname&#39;</span><span class="p">,</span>
               <span class="s1">&#39;type&#39;</span><span class="p">:</span> <span class="s1">&#39;String&#39;</span><span class="p">,</span><span class="s1">&#39;has missing&#39;</span><span class="p">:</span> <span class="bp">True</span><span class="p">},</span>
              <span class="p">{</span><span class="s1">&#39;field&#39;</span><span class="p">:</span> <span class="s1">&#39;uin&#39;</span><span class="p">,</span> <span class="s1">&#39;variable name&#39;</span><span class="p">:</span> <span class="s1">&#39;uin&#39;</span><span class="p">,</span>
               <span class="s1">&#39;type&#39;</span><span class="p">:</span> <span class="s1">&#39;String&#39;</span><span class="p">,</span><span class="s1">&#39;has missing&#39;</span><span class="p">:</span> <span class="bp">True</span><span class="p">},</span>
              <span class="p">{</span><span class="s1">&#39;field&#39;</span><span class="p">:</span> <span class="s1">&#39;meeting_loc&#39;</span><span class="p">,</span> <span class="s1">&#39;variable name&#39;</span><span class="p">:</span> <span class="s1">&#39;meeting_loc&#39;</span><span class="p">,</span>
               <span class="s1">&#39;type&#39;</span><span class="p">:</span> <span class="s1">&#39;String&#39;</span><span class="p">,</span><span class="s1">&#39;has missing&#39;</span><span class="p">:</span> <span class="bp">True</span><span class="p">}</span>
              <span class="p">]</span>
</pre></div>
<p>We modified a function Dan wrote to generate the predicate blocks:    </p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">candidates_gen</span><span class="p">(</span><span class="n">result_set</span><span class="p">):</span>
    <span class="n">lset</span> <span class="o">=</span> <span class="nb">set</span>
    <span class="n">block_id</span> <span class="o">=</span> <span class="bp">None</span>
    <span class="n">records</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">result_set</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">row</span><span class="p">[</span><span class="s1">&#39;block_id&#39;</span><span class="p">]</span> <span class="o">!=</span> <span class="n">block_id</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">records</span><span class="p">:</span>
                <span class="k">yield</span> <span class="n">records</span>

            <span class="n">block_id</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s1">&#39;block_id&#39;</span><span class="p">]</span>
            <span class="n">records</span> <span class="o">=</span> <span class="p">[]</span>
            <span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>

            <span class="k">if</span> <span class="n">i</span> <span class="o">%</span> <span class="mi">10000</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
                <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;{} blocks&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">i</span><span class="p">))</span>

        <span class="n">smaller_ids</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s1">&#39;smaller_ids&#39;</span><span class="p">]</span>
        <span class="k">if</span> <span class="n">smaller_ids</span><span class="p">:</span>
            <span class="n">smaller_ids</span> <span class="o">=</span> <span class="n">lset</span><span class="p">(</span><span class="n">smaller_ids</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;,&#39;</span><span class="p">))</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">smaller_ids</span> <span class="o">=</span> <span class="n">lset</span><span class="p">([])</span>

        <span class="n">records</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">row</span><span class="p">[</span><span class="n">KEY_FIELD</span><span class="p">],</span> <span class="n">row</span><span class="p">,</span> <span class="n">smaller_ids</span><span class="p">))</span>

    <span class="k">if</span> <span class="n">records</span><span class="p">:</span>
        <span class="k">yield</span> <span class="n">records</span>
</pre></div>
<p>And we adapted the method from the dedupe-examples repo to handle the active learning, supervised learning, and clustering steps:    </p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">find_dupes</span><span class="p">(</span><span class="n">args</span><span class="p">):</span>
    <span class="n">deduper</span> <span class="o">=</span> <span class="n">dedupe</span><span class="o">.</span><span class="n">Dedupe</span><span class="p">(</span><span class="n">FIELDS</span><span class="p">)</span>

    <span class="k">with</span> <span class="n">psycopg2</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="n">database</span><span class="o">=</span><span class="n">args</span><span class="o">.</span><span class="n">dbname</span><span class="p">,</span>
                          <span class="n">host</span><span class="o">=</span><span class="s1">&#39;localhost&#39;</span><span class="p">,</span>
                          <span class="n">cursor_factory</span><span class="o">=</span><span class="n">DictCursor</span><span class="p">)</span> <span class="k">as</span> <span class="n">con</span><span class="p">:</span>
        <span class="k">with</span> <span class="n">con</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span> <span class="k">as</span> <span class="n">c</span><span class="p">:</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s1">&#39;SELECT COUNT(*) AS count FROM </span><span class="si">%s</span><span class="s1">&#39;</span> <span class="o">%</span> <span class="n">SOURCE_TABLE</span><span class="p">)</span>
            <span class="n">row</span> <span class="o">=</span> <span class="n">c</span><span class="o">.</span><span class="n">fetchone</span><span class="p">()</span>
            <span class="n">count</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s1">&#39;count&#39;</span><span class="p">]</span>
            <span class="n">sample_size</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">count</span> <span class="o">*</span> <span class="n">args</span><span class="o">.</span><span class="n">sample</span><span class="p">)</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Generating sample of {} records&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">sample_size</span><span class="p">))</span>
            <span class="k">with</span> <span class="n">con</span><span class="o">.</span><span class="n">cursor</span><span class="p">(</span><span class="s1">&#39;deduper&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">c_deduper</span><span class="p">:</span>
                <span class="n">c_deduper</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s1">&#39;SELECT visitor_id,lastname,firstname,uin,meeting_loc FROM </span><span class="si">%s</span><span class="s1">&#39;</span> <span class="o">%</span> <span class="n">SOURCE_TABLE</span><span class="p">)</span>
                <span class="n">temp_d</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">((</span><span class="n">i</span><span class="p">,</span> <span class="n">row</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">row</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">c_deduper</span><span class="p">))</span>
                <span class="n">deduper</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">temp_d</span><span class="p">,</span> <span class="n">sample_size</span><span class="p">)</span>
                <span class="k">del</span><span class="p">(</span><span class="n">temp_d</span><span class="p">)</span>

            <span class="k">if</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">training</span><span class="p">):</span>
                <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Loading training file from {}&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">training</span><span class="p">))</span>
                <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">training</span><span class="p">)</span> <span class="k">as</span> <span class="n">tf</span><span class="p">:</span>
                    <span class="n">deduper</span><span class="o">.</span><span class="n">readTraining</span><span class="p">(</span><span class="n">tf</span><span class="p">)</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Starting active learning&#39;</span><span class="p">)</span>
            <span class="n">dedupe</span><span class="o">.</span><span class="n">convenience</span><span class="o">.</span><span class="n">consoleLabel</span><span class="p">(</span><span class="n">deduper</span><span class="p">)</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Starting training&#39;</span><span class="p">)</span>
            <span class="n">deduper</span><span class="o">.</span><span class="n">train</span><span class="p">(</span><span class="n">ppc</span><span class="o">=</span><span class="mf">0.001</span><span class="p">,</span> <span class="n">uncovered_dupes</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Saving new training file to {}&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">training</span><span class="p">))</span>
            <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">training</span><span class="p">,</span> <span class="s1">&#39;w&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">training_file</span><span class="p">:</span>
                <span class="n">deduper</span><span class="o">.</span><span class="n">writeTraining</span><span class="p">(</span><span class="n">training_file</span><span class="p">)</span>

            <span class="n">deduper</span><span class="o">.</span><span class="n">cleanupTraining</span><span class="p">()</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Creating blocking_map table&#39;</span><span class="p">)</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;&quot;&quot;</span>
<span class="s2">                DROP TABLE IF EXISTS blocking_map</span>
<span class="s2">                &quot;&quot;&quot;</span><span class="p">)</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;&quot;&quot;</span>
<span class="s2">                CREATE TABLE blocking_map</span>
<span class="s2">                (block_key VARCHAR(200), </span><span class="si">%s</span><span class="s2"> INTEGER)</span>
<span class="s2">                &quot;&quot;&quot;</span> <span class="o">%</span> <span class="n">KEY_FIELD</span><span class="p">)</span>

            <span class="k">for</span> <span class="n">field</span> <span class="ow">in</span> <span class="n">deduper</span><span class="o">.</span><span class="n">blocker</span><span class="o">.</span><span class="n">index_fields</span><span class="p">:</span>
                <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Selecting distinct values for &quot;{}&quot;&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">field</span><span class="p">))</span>
                <span class="n">c_index</span> <span class="o">=</span> <span class="n">con</span><span class="o">.</span><span class="n">cursor</span><span class="p">(</span><span class="s1">&#39;index&#39;</span><span class="p">)</span>
                <span class="n">c_index</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;&quot;&quot;</span>
<span class="s2">                    SELECT DISTINCT </span><span class="si">%s</span><span class="s2"> FROM </span><span class="si">%s</span><span class="s2"></span>
<span class="s2">                    &quot;&quot;&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">field</span><span class="p">,</span> <span class="n">SOURCE_TABLE</span><span class="p">))</span>
                <span class="n">field_data</span> <span class="o">=</span> <span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="n">field</span><span class="p">]</span> <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">c_index</span><span class="p">)</span>
                <span class="n">deduper</span><span class="o">.</span><span class="n">blocker</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="n">field_data</span><span class="p">,</span> <span class="n">field</span><span class="p">)</span>
                <span class="n">c_index</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Generating blocking map&#39;</span><span class="p">)</span>
            <span class="n">c_block</span> <span class="o">=</span> <span class="n">con</span><span class="o">.</span><span class="n">cursor</span><span class="p">(</span><span class="s1">&#39;block&#39;</span><span class="p">)</span>
            <span class="n">c_block</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;&quot;&quot;</span>
<span class="s2">                SELECT * FROM </span><span class="si">%s</span><span class="s2"></span>
<span class="s2">                &quot;&quot;&quot;</span> <span class="o">%</span> <span class="n">SOURCE_TABLE</span><span class="p">)</span>
            <span class="n">full_data</span> <span class="o">=</span> <span class="p">((</span><span class="n">row</span><span class="p">[</span><span class="n">KEY_FIELD</span><span class="p">],</span> <span class="n">row</span><span class="p">)</span> <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">c_block</span><span class="p">)</span>
            <span class="n">b_data</span> <span class="o">=</span> <span class="n">deduper</span><span class="o">.</span><span class="n">blocker</span><span class="p">(</span><span class="n">full_data</span><span class="p">)</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Inserting blocks into blocking_map&#39;</span><span class="p">)</span>
            <span class="n">csv_file</span> <span class="o">=</span> <span class="n">tempfile</span><span class="o">.</span><span class="n">NamedTemporaryFile</span><span class="p">(</span><span class="n">prefix</span><span class="o">=</span><span class="s1">&#39;blocks_&#39;</span><span class="p">,</span> <span class="n">delete</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
            <span class="n">csv_writer</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">writer</span><span class="p">(</span><span class="n">csv_file</span><span class="p">)</span>
            <span class="n">csv_writer</span><span class="o">.</span><span class="n">writerows</span><span class="p">(</span><span class="n">b_data</span><span class="p">)</span>
            <span class="n">csv_file</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>

            <span class="n">f</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">csv_file</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="s1">&#39;r&#39;</span><span class="p">)</span>
            <span class="n">c</span><span class="o">.</span><span class="n">copy_expert</span><span class="p">(</span><span class="s2">&quot;COPY blocking_map FROM STDIN CSV&quot;</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span>
            <span class="n">f</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>

            <span class="n">os</span><span class="o">.</span><span class="n">remove</span><span class="p">(</span><span class="n">csv_file</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>

            <span class="n">con</span><span class="o">.</span><span class="n">commit</span><span class="p">()</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Indexing blocks&#39;</span><span class="p">)</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;&quot;&quot;</span>
<span class="s2">                CREATE INDEX blocking_map_key_idx ON blocking_map (block_key)</span>
<span class="s2">                &quot;&quot;&quot;</span><span class="p">)</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;DROP TABLE IF EXISTS plural_key&quot;</span><span class="p">)</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;DROP TABLE IF EXISTS plural_block&quot;</span><span class="p">)</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;DROP TABLE IF EXISTS covered_blocks&quot;</span><span class="p">)</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;DROP TABLE IF EXISTS smaller_coverage&quot;</span><span class="p">)</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Calculating plural_key&#39;</span><span class="p">)</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;&quot;&quot;</span>
<span class="s2">                CREATE TABLE plural_key</span>
<span class="s2">                (block_key VARCHAR(200),</span>
<span class="s2">                block_id SERIAL PRIMARY KEY)</span>
<span class="s2">                &quot;&quot;&quot;</span><span class="p">)</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;&quot;&quot;</span>
<span class="s2">                INSERT INTO plural_key (block_key)</span>
<span class="s2">                SELECT block_key FROM blocking_map</span>
<span class="s2">                GROUP BY block_key HAVING COUNT(*) &gt; 1</span>
<span class="s2">                &quot;&quot;&quot;</span><span class="p">)</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Indexing block_key&#39;</span><span class="p">)</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;&quot;&quot;</span>
<span class="s2">                CREATE UNIQUE INDEX block_key_idx ON plural_key (block_key)</span>
<span class="s2">                &quot;&quot;&quot;</span><span class="p">)</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Calculating plural_block&#39;</span><span class="p">)</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;&quot;&quot;</span>
<span class="s2">                CREATE TABLE plural_block</span>
<span class="s2">                AS (SELECT block_id, </span><span class="si">%s</span><span class="s2"></span>
<span class="s2">                FROM blocking_map INNER JOIN plural_key</span>
<span class="s2">                USING (block_key))</span>
<span class="s2">                &quot;&quot;&quot;</span> <span class="o">%</span> <span class="n">KEY_FIELD</span><span class="p">)</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Adding {} index&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">KEY_FIELD</span><span class="p">))</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;&quot;&quot;</span>
<span class="s2">                CREATE INDEX plural_block_</span><span class="si">%s</span><span class="s2">_idx</span>
<span class="s2">                    ON plural_block (</span><span class="si">%s</span><span class="s2">)</span>
<span class="s2">                &quot;&quot;&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">KEY_FIELD</span><span class="p">,</span> <span class="n">KEY_FIELD</span><span class="p">))</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;&quot;&quot;</span>
<span class="s2">                CREATE UNIQUE INDEX plural_block_block_id_</span><span class="si">%s</span><span class="s2">_uniq</span>
<span class="s2">                ON plural_block (block_id, </span><span class="si">%s</span><span class="s2">)</span>
<span class="s2">                &quot;&quot;&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">KEY_FIELD</span><span class="p">,</span> <span class="n">KEY_FIELD</span><span class="p">))</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Creating covered_blocks&#39;</span><span class="p">)</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;&quot;&quot;</span>
<span class="s2">                CREATE TABLE covered_blocks AS</span>
<span class="s2">                    (SELECT </span><span class="si">%s</span><span class="s2">,</span>
<span class="s2">                            string_agg(CAST(block_id AS TEXT), &#39;,&#39;</span>
<span class="s2">                            ORDER BY block_id) AS sorted_ids</span>
<span class="s2">                     FROM plural_block</span>
<span class="s2">                     GROUP BY </span><span class="si">%s</span><span class="s2">)</span>
<span class="s2">                 &quot;&quot;&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">KEY_FIELD</span><span class="p">,</span> <span class="n">KEY_FIELD</span><span class="p">))</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Indexing covered_blocks&#39;</span><span class="p">)</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;&quot;&quot;</span>
<span class="s2">                CREATE UNIQUE INDEX covered_blocks_</span><span class="si">%s</span><span class="s2">_idx</span>
<span class="s2">                    ON covered_blocks (</span><span class="si">%s</span><span class="s2">)</span>
<span class="s2">                &quot;&quot;&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">KEY_FIELD</span><span class="p">,</span> <span class="n">KEY_FIELD</span><span class="p">))</span>
            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Committing&#39;</span><span class="p">)</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Creating smaller_coverage&#39;</span><span class="p">)</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;&quot;&quot;</span>
<span class="s2">                CREATE TABLE smaller_coverage AS</span>
<span class="s2">                    (SELECT </span><span class="si">%s</span><span class="s2">, block_id,</span>
<span class="s2">                        TRIM(&#39;,&#39; FROM split_part(sorted_ids,</span>
<span class="s2">                        CAST(block_id AS TEXT), 1))</span>
<span class="s2">                        AS smaller_ids</span>
<span class="s2">                     FROM plural_block</span>
<span class="s2">                     INNER JOIN covered_blocks</span>
<span class="s2">                     USING (</span><span class="si">%s</span><span class="s2">))</span>
<span class="s2">                &quot;&quot;&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">KEY_FIELD</span><span class="p">,</span> <span class="n">KEY_FIELD</span><span class="p">))</span>
            <span class="n">con</span><span class="o">.</span><span class="n">commit</span><span class="p">()</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Clustering...&#39;</span><span class="p">)</span>
            <span class="n">c_cluster</span> <span class="o">=</span> <span class="n">con</span><span class="o">.</span><span class="n">cursor</span><span class="p">(</span><span class="s1">&#39;cluster&#39;</span><span class="p">)</span>
            <span class="n">c_cluster</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;&quot;&quot;</span>
<span class="s2">                SELECT *</span>
<span class="s2">                FROM smaller_coverage</span>
<span class="s2">                INNER JOIN </span><span class="si">%s</span><span class="s2"></span>
<span class="s2">                    USING (</span><span class="si">%s</span><span class="s2">)</span>
<span class="s2">                ORDER BY (block_id)</span>
<span class="s2">                &quot;&quot;&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">SOURCE_TABLE</span><span class="p">,</span> <span class="n">KEY_FIELD</span><span class="p">))</span>
            <span class="n">clustered_dupes</span> <span class="o">=</span> <span class="n">deduper</span><span class="o">.</span><span class="n">matchBlocks</span><span class="p">(</span>
                    <span class="n">candidates_gen</span><span class="p">(</span><span class="n">c_cluster</span><span class="p">),</span> <span class="n">threshold</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Creating entity_map table&#39;</span><span class="p">)</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;DROP TABLE IF EXISTS entity_map&quot;</span><span class="p">)</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;&quot;&quot;</span>
<span class="s2">                CREATE TABLE entity_map (</span>
<span class="s2">                    </span><span class="si">%s</span><span class="s2"> INTEGER,</span>
<span class="s2">                    canon_id INTEGER,</span>
<span class="s2">                    cluster_score FLOAT,</span>
<span class="s2">                    PRIMARY KEY(</span><span class="si">%s</span><span class="s2">)</span>
<span class="s2">                )&quot;&quot;&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">KEY_FIELD</span><span class="p">,</span> <span class="n">KEY_FIELD</span><span class="p">))</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Inserting entities into entity_map&#39;</span><span class="p">)</span>
            <span class="k">for</span> <span class="n">cluster</span><span class="p">,</span> <span class="n">scores</span> <span class="ow">in</span> <span class="n">clustered_dupes</span><span class="p">:</span>
                <span class="n">cluster_id</span> <span class="o">=</span> <span class="n">cluster</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
                <span class="k">for</span> <span class="n">key_field</span><span class="p">,</span> <span class="n">score</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">cluster</span><span class="p">,</span> <span class="n">scores</span><span class="p">):</span>
                    <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;&quot;&quot;</span>
<span class="s2">                        INSERT INTO entity_map</span>
<span class="s2">                            (</span><span class="si">%s</span><span class="s2">, canon_id, cluster_score)</span>
<span class="s2">                        VALUES (</span><span class="si">%s</span><span class="s2">, </span><span class="si">%s</span><span class="s2">, </span><span class="si">%s</span><span class="s2">)</span>
<span class="s2">                        &quot;&quot;&quot;</span> <span class="o">%</span> <span class="p">(</span><span class="n">KEY_FIELD</span><span class="p">,</span> <span class="n">key_field</span><span class="p">,</span> <span class="n">cluster_id</span><span class="p">,</span> <span class="n">score</span><span class="p">))</span>

            <span class="k">print</span> <span class="p">(</span><span class="s1">&#39;Indexing head_index&#39;</span><span class="p">)</span>
            <span class="n">c_cluster</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
            <span class="n">c</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">&quot;CREATE INDEX head_index ON entity_map (canon_id)&quot;</span><span class="p">)</span>
            <span class="n">con</span><span class="o">.</span><span class="n">commit</span><span class="p">()</span>

<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s1">&#39;__main__&#39;</span><span class="p">:</span>
    <span class="n">parser</span> <span class="o">=</span> <span class="n">argparse</span><span class="o">.</span><span class="n">ArgumentParser</span><span class="p">()</span>
    <span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s1">&#39;--dbname&#39;</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s1">&#39;dbname&#39;</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="s1">&#39;whitehouse&#39;</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s1">&#39;database name&#39;</span><span class="p">)</span>
    <span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s1">&#39;-s&#39;</span><span class="p">,</span> <span class="s1">&#39;--sample&#39;</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="mf">0.10</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">float</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s1">&#39;sample size (percentage, default 0.10)&#39;</span><span class="p">)</span>
    <span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s1">&#39;-t&#39;</span><span class="p">,</span> <span class="s1">&#39;--training&#39;</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="s1">&#39;training.json&#39;</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s1">&#39;name of training file&#39;</span><span class="p">)</span>
    <span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse_args</span><span class="p">()</span>
    <span class="n">find_dupes</span><span class="p">(</span><span class="n">args</span><span class="p">)</span>
</pre></div>
<h2 id="active-learning-observations">Active Learning Observations</h2>

<p>We ran multiple experiments:</p>

<ul>
<li>Test 1: lastname, firstname, meeting_loc =&gt; 447 (15 minutes of training)</li>
<li>Test 2: lastname, firstname, uin, meeting_loc =&gt; 3385 (5 minutes of training) - one instance that had 168 duplicates</li>
</ul>

<p>We observed a lot of uncertainty during the active learning phase, mostly because of how enormous the dataset is.  This was particularly pronounced with names that seemed more common to us and that sounded more domestic since those are much more commonly occurring in this dataset. For example, are two records containing the name Michael Grant the same entity?</p>

<p>Additionally, we noticed that there were a lot of variations in the way that middle names were captured. Sometimes they were concatenated with the first name, other times with the last name. We also observed what seemed to be many nicknames or that could have been references to separate entities: <em>KIM ASKEW</em> vs. <em>KIMBERLEY ASKEW</em> and <em>Kathy Edwards</em> vs. <em>Katherine Edwards</em> (and yes, <code>dedupe</code> does preserve variations in case). On the other hand, since nicknames generally appear only in people&#39;s first names, when we did see a short version of a first name paired with an unusual or rare last name, we were more confident in labeling those as a match.</p>

<p>Other things that made the labeling easier were clearly gendered names (e.g. <em>Brian Murphy</em> vs. <em>Briana Murphy</em>), which helped us to identify separate entities in spite of very small differences in the strings. Some names appeared to be clear misspellings, which also made us more confident in our labeling two references as matches for a single entity  (<em>Davifd Culp</em> vs. <em>David Culp</em>). There were also a few potential easter eggs in the dataset, which we suspect might actually be aliases (<em>Jon Doe</em> and <em>Ben Jealous</em>).</p>

<p>One of the things we discovered upon multiple runs of the active learning process is that the number of fields the user nominates to Dedupe for use has a great impact on the kinds of predicate blocks that are generated during the initial blocking phase. Thus, the comparisons that are presented to the trainer during the active learning phase. In one of our runs, we used only the last name, first name, and meeting location fields. Some of the comparisons were easy:      </p>
<div class="highlight"><pre><span></span>lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH

lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH

Do these records refer to the same thing?
<span class="o">(</span>y<span class="o">)</span>es / <span class="o">(</span>n<span class="o">)</span>o / <span class="o">(</span>u<span class="o">)</span>nsure / <span class="o">(</span>f<span class="o">)</span>inished
</pre></div>
<p>Some were hard:    </p>
<div class="highlight"><pre><span></span>lastname : Desimone
firstname : Daniel
meeting_loc : OEOB

lastname : DeSimone
firstname : Daniel
meeting_loc : WH

Do these records refer to the same thing?
<span class="o">(</span>y<span class="o">)</span>es / <span class="o">(</span>n<span class="o">)</span>o / <span class="o">(</span>u<span class="o">)</span>nsure / <span class="o">(</span>f<span class="o">)</span>inished
</pre></div>
<h2 id="results">Results</h2>

<p>What we realized from this is that there are two different kinds of duplicates that appear in our dataset. The first kind of duplicate is one that generated via (likely mistaken) duplicate visitor request forms. We noticed that these duplicate entries tended to be proximal to each other in terms of <em>visitor_id</em> number, have the same meeting location and the same <em>uin</em> (which confusingly, is not a unique guest identifier but appears to be assigned to every visitor within a unique tour group). The second kind of duplicate is what we think of as the <em>frequent flier</em> &mdash; people who seem to spend a lot of time at the White House like staffers and other political appointees.</p>

<p>During the dedupe process, we computed there were 332,606 potential duplicates within the data set of 1,048,576 entities. For this particular data, we would expect these kinds of figures, knowing that people visit for repeat business or social functions.</p>

<h3 id="within-visit-duplicates">Within-Visit Duplicates</h3>
<div class="highlight"><pre><span></span><span class="n">lastname</span> <span class="o">:</span> <span class="n">Ryan</span>
<span class="n">meeting_loc</span> <span class="o">:</span> <span class="n">OEOB</span>
<span class="n">firstname</span> <span class="o">:</span> <span class="n">Patrick</span>
<span class="n">uin</span> <span class="o">:</span> <span class="n">U62671</span>

<span class="n">lastname</span> <span class="o">:</span> <span class="n">Ryan</span>
<span class="n">meeting_loc</span> <span class="o">:</span> <span class="n">OEOB</span>
<span class="n">firstname</span> <span class="o">:</span> <span class="n">Patrick</span>
<span class="n">uin</span> <span class="o">:</span> <span class="n">U62671</span>
</pre></div>
<h3 id="across-visit-duplicates-frequent-fliers">Across-Visit Duplicates (Frequent Fliers)</h3>
<div class="highlight"><pre><span></span><span class="n">lastname</span> <span class="o">:</span> <span class="n">TANGHERLINI</span>
<span class="n">meeting_loc</span> <span class="o">:</span> <span class="n">OEOB</span>
<span class="n">firstname</span> <span class="o">:</span> <span class="n">DANIEL</span>
<span class="n">uin</span> <span class="o">:</span> <span class="n">U02692</span>

<span class="n">lastname</span> <span class="o">:</span> <span class="n">TANGHERLINI</span>
<span class="n">meeting_loc</span> <span class="o">:</span> <span class="n">NEOB</span>
<span class="n">firstname</span> <span class="o">:</span> <span class="n">DANIEL</span>
<span class="n">uin</span> <span class="o">:</span> <span class="n">U73085</span>
</pre></div><div class="highlight"><pre><span></span><span class="n">lastname</span> <span class="o">:</span> <span class="n">ARCHULETA</span>
<span class="n">meeting_loc</span> <span class="o">:</span> <span class="n">WH</span>
<span class="n">firstname</span> <span class="o">:</span> <span class="n">KATHERINE</span>
<span class="n">uin</span> <span class="o">:</span> <span class="n">U68121</span>

<span class="n">lastname</span> <span class="o">:</span> <span class="n">ARCHULETA</span>
<span class="n">meeting_loc</span> <span class="o">:</span> <span class="n">OEOB</span>
<span class="n">firstname</span> <span class="o">:</span> <span class="n">KATHERINE</span>
<span class="n">uin</span> <span class="o">:</span> <span class="n">U76331</span>
</pre></div>
<p><img alt="Silvrback blog image " src="https://silvrback.s3.amazonaws.com/uploads/a7c655cb-7d2e-4439-8338-348f90b19145/dedupe_errors.png" /></p>

<h2 id="conclusion">Conclusion</h2>

<p>In this beginners guide to Entity Resolution, we learned what it means to identify entities and their possible duplicates within and across records. To further examine this data beyond the scope of this blog post, we would like to determine which records are true duplicates. This would require additional information to canonicalize these entities, thus allowing for potential indexing of entities for future assessments. Ultimately we discovered the importance of entity resolution across a variety of domains, such as counter-terrorism, customer databases, and voter registration.</p>

<p>Please return to the District Data Labs blog for upcoming posts on entity resolution and discussion about a number of other important topics to the data science community.  Upcoming post topics from our research group include string matching algorithms, data preparation, and entity identification!</p>

<h2 id="recommended-reading">Recommended Reading</h2>

<ul>
<li><a href="http://www.slideshare.net/BenjaminBengfort/a-primer-on-entity-resolution">A Primer on Entity Resolution by Benjamin Bengfort</a><br></li>
<li><a href="http://www.datacommunitydc.org/blog/2013/08/entity-resolution-for-big-data">Entity Resolution for Big Data: A Summary of the KDD 2013 Tutorial Taught by Dr. Lise Getoor and Dr. Ashwin Machanavajjhala</a><br></li>
<li><a href="http://courses.cs.washington.edu/courses/cse590q/04au/papers/Felligi69.pdf">A Theory for Record Linkage by Ivan P. Fellegi and Alan B. Sunter</a><br></li>
</ul>

<p><em>District Data Labs provides data science <a href="http://www.districtdatalabs.com/consulting/">consulting</a> and <a href="http://www.districtdatalabs.com/training/">corporate training</a> services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? <a href="mailto:contact@districtdatalabs.com?subject=Consulting%20and%20Corporate%20Training%20Services&body=Hello!%20I&#x27;m%20interested%20in%20learning%20more%20about%20your%20data%20science%20consulting%20and%20corporate%20training%20offerings.">Let us know</a>!</em></p>
]]></content:encoded>
      </item>
      <item>
        <guid>http://blog.districtdatalabs.com/data-exploration-with-python-2#28902</guid>
          <pubDate>Tue, 07 Feb 2017 03:04:00 -0500</pubDate>
        <link>http://blog.districtdatalabs.com/data-exploration-with-python-2</link>
        <title>Data Exploration with Python, Part 2</title>
        <description>Preparing Your Data to be Explored</description>
        <content:encoded><![CDATA[<p><em>This is the second post in our Data Exploration with Python series. Before reading this post, make sure to check out <a href="http://blog.districtdatalabs.com/data-exploration-with-python-1">Data Exploration with Python, Part 1</a>!</em></p>

<blockquote>
<p>Mise en place (noun): In a professional kitchen, the disciplined organization and preparation of equipment and food before service begins.</p>
</blockquote>

<p>When performing exploratory data analysis (EDA), it is important to not only prepare yourself (the analyst) but to prepare your data as well. As we discussed in the previous post, a small amount of preparation will often save you a significant amount of time later on. So let&#39;s review where we should be at this point and then continue our exploration process with data preparation.</p>

<p>In <a href="http://blog.districtdatalabs.com/data-exploration-with-python-1">Part 1</a> of this series, we were introduced to the data exploration framework we will be using. As a reminder, here is what that framework looks like.</p>

<p><img alt="Exploratory Data Analysis Framework" src="https://silvrback.s3.amazonaws.com/uploads/774cd15b-faa2-447f-86e2-7a099e79f8b5/framework_large.png" /></p>

<p>We also introduced the example data set we are going to be using to illustrate the different phases and stages of the framework. Here is what that looks like.</p>

<p><img alt="EPA Vehicle Fuel Economy Data" src="https://silvrback.s3.amazonaws.com/uploads/c170f837-017b-4ada-aa7f-03036632abb4/data_set_large.png" /></p>

<p>We then familiarized ourselves with our data set by identifying the types of information and entities encoded within it. We also reviewed several data transformation and visualization methods that we will use later to explore and analyze it. Now we are at the last stage of the framework&#39;s <em>Prep Phase</em>, the <em>Create</em> stage, where our goal will be to create additional categorical fields that will make our data easier to explore and allow us to view it from new perspectives.</p>

<h2 id="renaming-columns-to-be-more-intuitive">Renaming Columns to be More Intuitive</h2>

<p>Before we dive in and start creating categories, however, we have an opportunity to improve our categorization efforts by examining the columns in our data and making sure their labels intuitively convey what they represent. Just as with the other aspects of preparation, changing them now will save us from having to remember what <code>displ</code> or <code>co2TailpipeGpm</code> mean when they show up on a chart later. In my experience, these small, detail-oriented enhancements to the beginning of your process usually compound and preserve cognitive cycles that you can later apply to extracting insights.</p>

<p>We can use the code below to rename the columns in our vehicles data frame.</p>
<div class="highlight"><pre><span></span><span class="n">vehicles</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;Make&#39;</span><span class="p">,</span><span class="s1">&#39;Model&#39;</span><span class="p">,</span><span class="s1">&#39;Year&#39;</span><span class="p">,</span><span class="s1">&#39;Engine Displacement&#39;</span><span class="p">,</span><span class="s1">&#39;Cylinders&#39;</span><span class="p">,</span>
                    <span class="s1">&#39;Transmission&#39;</span><span class="p">,</span><span class="s1">&#39;Drivetrain&#39;</span><span class="p">,</span><span class="s1">&#39;Vehicle Class&#39;</span><span class="p">,</span><span class="s1">&#39;Fuel Type&#39;</span><span class="p">,</span>
                    <span class="s1">&#39;Fuel Barrels/Year&#39;</span><span class="p">,</span><span class="s1">&#39;City MPG&#39;</span><span class="p">,</span><span class="s1">&#39;Highway MPG&#39;</span><span class="p">,</span><span class="s1">&#39;Combined MPG&#39;</span><span class="p">,</span>
                    <span class="s1">&#39;CO2 Emission Grams/Mile&#39;</span><span class="p">,</span><span class="s1">&#39;Fuel Cost/Year&#39;</span><span class="p">]</span>
</pre></div>
<h2 id="thinking-about-categorization">Thinking About Categorization</h2>

<p>Now that we have changed our column names to be more intuitive, let&#39;s take a moment to think about what categorization is and examine the categories that currently exist in our data set. At the most basic level, categorization is just a way that humans structure information &mdash; how we hierarchically create order out of complexity. Categories are formed based on attributes that entities have in common, and they present us with different perspectives from which we can view and think about our data.</p>

<p>Our primary objective in this stage is to create additional categories that will help us further organize our data. This will prove beneficial not only for the exploratory analysis we will conduct but also for any <a href="http://blog.districtdatalabs.com/an-introduction-to-machine-learning-with-python">supervised machine learning</a> or modeling that may happen further down the <a href="http://blog.districtdatalabs.com/the-age-of-the-data-product">data science pipeline</a>. Seasoned data scientists know that the better your data is organized, the better downstream analyses you will be able to perform and the more informative features you will have to feed into your machine learning models.</p>

<p>In this stage of the framework, we are going to create additional categories in 3 distinct ways:</p>

<ul>
<li>Category Aggregations</li>
<li>Binning Continuous Variables</li>
<li>Clustering</li>
</ul>

<p>Now that we have a better idea of what we are doing and why, let&#39;s get started.</p>

<h3 id="aggregating-to-higher-level-categories">Aggregating to Higher-Level Categories</h3>

<p>The first way we are going to create additional categories is by identifying opportunities to create higher-level categories out of the variables we already have in our data set. In order to do this, we need to get a sense of what categories currently exist in the data. We can do this by iterating through our columns and printing out the name, the number of unique values, and the data type for each.</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">unique_col_values</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">column</span> <span class="ow">in</span> <span class="n">df</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s2">&quot;{} | {} | {}&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span>
            <span class="n">df</span><span class="p">[</span><span class="n">column</span><span class="p">]</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="n">column</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">()),</span> <span class="n">df</span><span class="p">[</span><span class="n">column</span><span class="p">]</span><span class="o">.</span><span class="n">dtype</span>
        <span class="p">))</span>

<span class="n">unique_col_values</span><span class="p">(</span><span class="n">vehicles</span><span class="p">)</span>
</pre></div><div class="highlight"><pre><span></span>Make | 126 | object
Model | 3491 | object
Year | 33 | int64
Engine Displacement | 65 | float64
Cylinders | 9 | float64
Transmission | 43 | object
Drivetrain | 7 | object
Vehicle Class | 34 | object
Fuel Type | 13 | object
Fuel Barrels/Year | 116 | float64
City MPG | 48 | int64
Highway MPG | 49 | int64
Combined MPG | 45 | int64
CO2 Emission Grams/Mile | 550 | float64
Fuel Cost/Year | 58 | int64
</pre></div>
<p>From looking at the output, it is clear that we have some numeric columns (<em>int64</em> and <em>float64</em>) and some categorical columns (<em>object</em>). For now, let&#39;s focus on the six categorical columns in our data set.</p>

<ul>
<li>Make: 126 unique values</li>
<li>Model: 3,491 unique values</li>
<li>Transmission: 43 unique values</li>
<li>Drivetrain: 7 unique values</li>
<li>Vehicle Class: 34 unique values</li>
<li>Fuel Type: 13 unique values</li>
</ul>

<p>When aggregating and summarizing data, having too many categories can be problematic. The average human is said to have the ability to hold <a href="https://en.wikipedia.org/wiki/The_Magical_Number_Seven,_Plus_or_Minus_Two">7 objects</a> at a time in their short-term working memory. Accordingly, I have noticed that once you exceed 8-10 discrete values in a category, it becomes increasingly difficult to get a holistic picture of how the entire data set is divided up.</p>

<p>What we want to do is examine the values in each of our categorical variables to determine where opportunities exist to aggregate them into higher-level categories. The way this is typically done is by using a combination of clues from the current categories and any domain knowledge you may have (or be able to acquire).</p>

<p>For example, imagine aggregating by <em>Transmission</em>, which has 43 discrete values in our data set. It is going to be difficult to derive insights due to the fact that any aggregated metrics are going to be distributed across more categories than you can hold in short-term memory. However, if we examine the different transmission categories with the goal of finding common features that we can group on, we would find that all 43 values fall into one of two transmission types, <em>Automatic</em> or <em>Manual</em>.</p>

<p><img alt="Category Aggregations - Transmission " src="https://silvrback.s3.amazonaws.com/uploads/4e6e18d2-4a00-4bf5-9744-181deff30659/transmission.png" /></p>

<p>Let&#39;s create a new <em>Transmission Type</em> column in our data frame and, with the help of the <code>loc</code> method in pandas, assign it a value of <em>Automatic</em> where the first character of <em>Transmission</em> is the letter A and a value of <em>Manual</em> where the first character is the letter M.</p>
<div class="highlight"><pre><span></span><span class="n">AUTOMATIC</span> <span class="o">=</span> <span class="s2">&quot;Automatic&quot;</span>
<span class="n">MANUAL</span> <span class="o">=</span> <span class="s2">&quot;Manual&quot;</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Transmission&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">&#39;A&#39;</span><span class="p">),</span>
             <span class="s1">&#39;Transmission Type&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">AUTOMATIC</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Transmission&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">&#39;M&#39;</span><span class="p">),</span>
             <span class="s1">&#39;Transmission Type&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">MANUAL</span>
</pre></div>
<p>We can apply the same logic to the <em>Vehicle Class</em> field. We originally have 34 vehicle classes, but we can distill those down into 8 vehicle categories, which are much easier to remember.</p>

<p><img alt="Category Aggregations - Vehicle Class" src="https://silvrback.s3.amazonaws.com/uploads/4cafec85-6fe9-4527-be2b-6f17622cfd07/vehicle_class.png" /></p>
<div class="highlight"><pre><span></span><span class="n">small</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;Compact Cars&#39;</span><span class="p">,</span><span class="s1">&#39;Subcompact Cars&#39;</span><span class="p">,</span><span class="s1">&#39;Two Seaters&#39;</span><span class="p">,</span><span class="s1">&#39;Minicompact Cars&#39;</span><span class="p">]</span>
<span class="n">midsize</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;Midsize Cars&#39;</span><span class="p">]</span>
<span class="n">large</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;Large Cars&#39;</span><span class="p">]</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Vehicle Class&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="n">small</span><span class="p">),</span> 
             <span class="s1">&#39;Vehicle Category&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;Small Cars&#39;</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Vehicle Class&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="n">midsize</span><span class="p">),</span> 
             <span class="s1">&#39;Vehicle Category&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;Midsize Cars&#39;</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Vehicle Class&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="n">large</span><span class="p">),</span> 
             <span class="s1">&#39;Vehicle Category&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;Large Cars&#39;</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Vehicle Class&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;Station&#39;</span><span class="p">),</span> 
             <span class="s1">&#39;Vehicle Category&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;Station Wagons&#39;</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Vehicle Class&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;Truck&#39;</span><span class="p">),</span> 
             <span class="s1">&#39;Vehicle Category&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;Pickup Trucks&#39;</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Vehicle Class&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;Special Purpose&#39;</span><span class="p">),</span> 
             <span class="s1">&#39;Vehicle Category&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;Special Purpose&#39;</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Vehicle Class&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;Sport Utility&#39;</span><span class="p">),</span> 
             <span class="s1">&#39;Vehicle Category&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;Sport Utility&#39;</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[(</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Vehicle Class&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;van&#39;</span><span class="p">)),</span>
             <span class="s1">&#39;Vehicle Category&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;Vans &amp; Minivans&#39;</span>
</pre></div>
<p>Next, let&#39;s look at the <em>Make</em> and <em>Model</em> fields, which have 126 and 3,491 unique values respectively. While I can&#39;t think of a way to get either of those down to 8-10 categories, we can create another potentially informative field by concatenating <em>Make</em> and the first word of the <em>Model</em> field together into a new <em>Model Type</em> field. This would allow us to, for example, categorize all <em>Chevrolet Suburban C1500 2WD</em> vehicles and all <em>Chevrolet Suburban K1500 4WD</em> vehicles as simply <em>Chevrolet Suburbans</em>.</p>
<div class="highlight"><pre><span></span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Model Type&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Make&#39;</span><span class="p">]</span> <span class="o">+</span> <span class="s2">&quot; &quot;</span> <span class="o">+</span>
                          <span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Model&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">split</span><span class="p">()</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="mi">0</span><span class="p">))</span>
</pre></div>
<p>Finally, let&#39;s look at the <em>Fuel Type</em> field, which has 13 unique values. On the surface, that doesn&#39;t seem too bad, but upon further inspection, you&#39;ll notice some complexity embedded in the categories that could probably be organized more intuitively.</p>
<div class="highlight"><pre><span></span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Fuel Type&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">()</span>
</pre></div><div class="highlight"><pre><span></span>array([&#39;Regular&#39;, &#39;Premium&#39;, &#39;Diesel&#39;, &#39;Premium and Electricity&#39;,
       &#39;Premium or E85&#39;, &#39;Premium Gas or Electricity&#39;, &#39;Gasoline or E85&#39;,
       &#39;Gasoline or natural gas&#39;, &#39;CNG&#39;, &#39;Regular Gas or Electricity&#39;,
       &#39;Midgrade&#39;, &#39;Regular Gas and Electricity&#39;, &#39;Gasoline or propane&#39;],
        dtype=object)
</pre></div>
<p>This is interesting and a little tricky because there are some categories that contain a single fuel type and others that contain multiple fuel types. In order to organize this better, we will create two sets of categories from these fuel types. The first will be a set of columns that will be able to represent the different combinations, while still preserving the individual fuel types.</p>
<div class="highlight"><pre><span></span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Gas&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Ethanol&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Electric&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Propane&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Natural Gas&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Fuel Type&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span>
        <span class="s1">&#39;Regular|Gasoline|Midgrade|Premium|Diesel&#39;</span><span class="p">),</span><span class="s1">&#39;Gas&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Fuel Type&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;E85&#39;</span><span class="p">),</span><span class="s1">&#39;Ethanol&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Fuel Type&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;Electricity&#39;</span><span class="p">),</span><span class="s1">&#39;Electric&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Fuel Type&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;propane&#39;</span><span class="p">),</span><span class="s1">&#39;Propane&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Fuel Type&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;natural|CNG&#39;</span><span class="p">),</span><span class="s1">&#39;Natural Gas&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
</pre></div>
<p>As it turns out, 99% of the vehicles in our database have gas as a fuel type, either by itself or combined with another fuel type. Since that is the case, let&#39;s create a second set of categories - specifically, a new <em>Gas Type</em> field that extracts the type of gas (Regular, Midgrade, Premium, Diesel, or Natural) each vehicle accepts.</p>
<div class="highlight"><pre><span></span><span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Fuel Type&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span>
        <span class="s1">&#39;Regular|Gasoline&#39;</span><span class="p">),</span><span class="s1">&#39;Gas Type&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;Regular&#39;</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Fuel Type&#39;</span><span class="p">]</span> <span class="o">==</span> <span class="s1">&#39;Midgrade&#39;</span><span class="p">,</span>
             <span class="s1">&#39;Gas Type&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;Midgrade&#39;</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Fuel Type&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;Premium&#39;</span><span class="p">),</span>
             <span class="s1">&#39;Gas Type&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;Premium&#39;</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Fuel Type&#39;</span><span class="p">]</span> <span class="o">==</span> <span class="s1">&#39;Diesel&#39;</span><span class="p">,</span>
             <span class="s1">&#39;Gas Type&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;Diesel&#39;</span>

<span class="n">vehicles</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Fuel Type&#39;</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;natural|CNG&#39;</span><span class="p">),</span>
             <span class="s1">&#39;Gas Type&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;Natural&#39;</span>
</pre></div>
<p>An important thing to note about what we have done with all of the categorical fields in this section is that, while we created new categories, we did not overwrite the original ones. We created additional fields that will allow us to view the information contained within the data set at different (often higher) levels. If you need to drill down to the more granular original categories, you can always do that. However, now we have a choice whereas before we performed these category aggregations, we did not.</p>

<h3 id="creating-categories-from-continuous-variables">Creating Categories from Continuous Variables</h3>

<p>The next way we can create additional categories in our data is by binning some of our continuous variables - breaking them up into different categories based on a threshold or distribution. There are multiple ways you can do this, but I like to use quintiles because it gives me one middle category, two categories outside of that which are moderately higher and lower, and then two extreme categories at the ends. I find that this is a very intuitive way to break things up and provides some consistency across categories. In our data set, I&#39;ve identified 4 fields that we can bin this way.</p>

<p><img alt="Silvrback blog image " src="https://silvrback.s3.amazonaws.com/uploads/3ca61771-8114-4ee7-a262-361f2ed31fd1/binning.png" /></p>

<p>Binning essentially looks at how the data is distributed, creates the necessary number of bins by splitting up the range of values (either equally or based on explicit boundaries), and then categorizes records into the appropriate bin that their continuous value falls into. Pandas has a <code>qcut</code> method that makes binning extremely easy, so let&#39;s use that to create our quintiles for each of the continuous variables we identified.</p>
<div class="highlight"><pre><span></span><span class="n">efficiency_categories</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;Very Low Efficiency&#39;</span><span class="p">,</span> <span class="s1">&#39;Low Efficiency&#39;</span><span class="p">,</span>
                         <span class="s1">&#39;Moderate Efficiency&#39;</span><span class="p">,</span><span class="s1">&#39;High Efficiency&#39;</span><span class="p">,</span>
                         <span class="s1">&#39;Very High Efficiency&#39;</span><span class="p">]</span>

<span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Fuel Efficiency&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">qcut</span><span class="p">(</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Combined MPG&#39;</span><span class="p">],</span>
                                      <span class="mi">5</span><span class="p">,</span> <span class="n">efficiency_categories</span><span class="p">)</span>
</pre></div><div class="highlight"><pre><span></span><span class="n">engine_categories</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;Very Small Engine&#39;</span><span class="p">,</span> <span class="s1">&#39;Small Engine&#39;</span><span class="p">,</span><span class="s1">&#39;Moderate Engine&#39;</span><span class="p">,</span>
                     <span class="s1">&#39;Large Engine&#39;</span><span class="p">,</span> <span class="s1">&#39;Very Large Engine&#39;</span><span class="p">]</span>

<span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Engine Size&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">qcut</span><span class="p">(</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Engine Displacement&#39;</span><span class="p">],</span>
                                  <span class="mi">5</span><span class="p">,</span> <span class="n">engine_categories</span><span class="p">)</span>
</pre></div><div class="highlight"><pre><span></span><span class="n">emission_categories</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;Very Low Emissions&#39;</span><span class="p">,</span> <span class="s1">&#39;Low Emissions&#39;</span><span class="p">,</span>
                        <span class="s1">&#39;Moderate Emissions&#39;</span><span class="p">,</span><span class="s1">&#39;High Emissions&#39;</span><span class="p">,</span>
                        <span class="s1">&#39;Very High Emissions&#39;</span><span class="p">]</span>

<span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Emissions&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">qcut</span><span class="p">(</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;CO2 Emission Grams/Mile&#39;</span><span class="p">],</span>
                                 <span class="mi">5</span><span class="p">,</span> <span class="n">emission_categories</span><span class="p">)</span>
</pre></div><div class="highlight"><pre><span></span><span class="n">fuelcost_categories</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;Very Low Fuel Cost&#39;</span><span class="p">,</span> <span class="s1">&#39;Low Fuel Cost&#39;</span><span class="p">,</span>
                       <span class="s1">&#39;Moderate Fuel Cost&#39;</span><span class="p">,</span><span class="s1">&#39;High Fuel Cost&#39;</span><span class="p">,</span>
                       <span class="s1">&#39;Very High Fuel Cost&#39;</span><span class="p">]</span>

<span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Fuel Cost&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">qcut</span><span class="p">(</span><span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Fuel Cost/Year&#39;</span><span class="p">],</span>
                                <span class="mi">5</span><span class="p">,</span> <span class="n">fuelcost_categories</span><span class="p">)</span>
</pre></div>
<h3 id="clustering-to-create-additional-categories">Clustering to Create Additional Categories</h3>

<p>The final way we are going to prepare our data is by clustering to create additional categories. There are a few reasons why I like to use clustering for this. First, it takes multiple fields into consideration <em>together at the same time</em>, whereas the other categorization methods only consider one field at a time. This will allow you to categorize together entities that are similar across a variety of attributes, but might not be close enough in each individual attribute to get grouped together.</p>

<p>Clustering also creates new categories for you automatically, which takes much less time than having to comb through the data yourself identifying patterns across attributes that you can form categories on. It will automatically group similar items together for you.</p>

<p>The third reason I like to use clustering is because it will sometimes group things in ways that you, as a human, may not have thought of. I&#39;m a big fan of humans and machines working together to optimize analytical processes, and this is a good example of value that machines bring to the table that can be helpful to humans. I&#39;ll write more about my thoughts on that in future posts, but for now, let&#39;s move on to clustering our data.</p>

<p>The first thing we are going to do is isolate the columns we want to use for clustering. These are going to be columns with numeric values, as the clustering algorithm will need to compute distances in order to group similar vehicles together.</p>
<div class="highlight"><pre><span></span><span class="n">cluster_columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;Engine Displacement&#39;</span><span class="p">,</span><span class="s1">&#39;Cylinders&#39;</span><span class="p">,</span><span class="s1">&#39;Fuel Barrels/Year&#39;</span><span class="p">,</span>
                   <span class="s1">&#39;City MPG&#39;</span><span class="p">,</span><span class="s1">&#39;Highway MPG&#39;</span><span class="p">,</span><span class="s1">&#39;Combined MPG&#39;</span><span class="p">,</span>
                   <span class="s1">&#39;CO2 Emission Grams/Mile&#39;</span><span class="p">,</span> <span class="s1">&#39;Fuel Cost/Year&#39;</span><span class="p">]</span>
</pre></div>
<p>Next, we want to scale the features we are going to cluster on. There are a variety of ways to <a href="http://scikit-learn.org/stable/modules/preprocessing.html">normalize and scale variables</a>, but I&#39;m going to keep things relatively simple and just use Scikit-Learn&#39;s <code>MaxAbsScaler</code>, which will divide each value by the max absolute value for that feature. This will preserve the distributions in the data and convert the values in each field to a number between 0 and 1 (technically -1 and 1, but we don&#39;t have any negatives).</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">preprocessing</span>
<span class="n">scaler</span> <span class="o">=</span> <span class="n">preprocessing</span><span class="o">.</span><span class="n">MaxAbsScaler</span><span class="p">()</span>

<span class="n">vehicle_clusters</span> <span class="o">=</span> <span class="n">scaler</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">vehicles</span><span class="p">[</span><span class="n">cluster_columns</span><span class="p">])</span>
<span class="n">vehicle_clusters</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">vehicle_clusters</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">cluster_columns</span><span class="p">)</span>
</pre></div>
<p>Now that our features are scaled, let&#39;s write a couple of functions. The first function we are going to write is a <code>kmeans_cluster</code> function that will k-means cluster a given data frame into a specified number of clusters. It will then return a copy of the original data frame with those clusters appended in a column named <em>Cluster</em>.</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">sklearn.cluster</span> <span class="kn">import</span> <span class="n">KMeans</span>

<span class="k">def</span> <span class="nf">kmeans_cluster</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">n_clusters</span><span class="o">=</span><span class="mi">2</span><span class="p">):</span>
    <span class="n">model</span> <span class="o">=</span> <span class="n">KMeans</span><span class="p">(</span><span class="n">n_clusters</span><span class="o">=</span><span class="n">n_clusters</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">clusters</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">fit_predict</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
    <span class="n">cluster_results</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
    <span class="n">cluster_results</span><span class="p">[</span><span class="s1">&#39;Cluster&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">clusters</span>
    <span class="k">return</span> <span class="n">cluster_results</span>
</pre></div>
<p>Our second function, called <code>summarize_clustering</code> is going to count the number of vehicles that fall into each cluster and calculate the cluster means for each feature. It is going to merge the counts and means into a single data frame and then return that summary to us.</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">summarize_clustering</span><span class="p">(</span><span class="n">results</span><span class="p">):</span>
    <span class="n">cluster_size</span> <span class="o">=</span> <span class="n">results</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">&#39;Cluster&#39;</span><span class="p">])</span><span class="o">.</span><span class="n">size</span><span class="p">()</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>
    <span class="n">cluster_size</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;Cluster&#39;</span><span class="p">,</span> <span class="s1">&#39;Count&#39;</span><span class="p">]</span>
    <span class="n">cluster_means</span> <span class="o">=</span> <span class="n">results</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">&#39;Cluster&#39;</span><span class="p">],</span> <span class="n">as_index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
    <span class="n">cluster_summary</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">cluster_size</span><span class="p">,</span> <span class="n">cluster_means</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s1">&#39;Cluster&#39;</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">cluster_summary</span>
</pre></div>
<p>We now have functions for what we need to do, so the next step is to actually cluster our data. But wait, our <code>kmeans_cluster</code> function is supposed to accept a number of clusters. How do we determine how many clusters we want? </p>

<p>There are a <a href="https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set">number of approaches</a> for figuring this out, but for the sake of simplicity, we are just going to plug in a couple of numbers and visualize the results to arrive at a good enough estimate. Remember earlier in this post where we were trying to aggregate our categorical variables to less than 8-10 discrete values? We are going to apply the same logic here. Let&#39;s start out with 8 clusters and see what kind of results we get.</p>
<div class="highlight"><pre><span></span><span class="n">cluster_results</span> <span class="o">=</span> <span class="n">kmeans_cluster</span><span class="p">(</span><span class="n">vehicle_clusters</span><span class="p">,</span> <span class="mi">8</span><span class="p">)</span>
<span class="n">cluster_summary</span> <span class="o">=</span> <span class="n">summarize_clustering</span><span class="p">(</span><span class="n">cluster_results</span><span class="p">)</span>
</pre></div>
<p>After running the couple of lines of code above, your <code>cluster_summary</code> should look similar to the following.</p>

<p><img alt="Silvrback blog image" class="sb_float_center" src="https://silvrback.s3.amazonaws.com/uploads/5a920c7e-789b-4923-bdc9-8c98a10e4a0f/8_clusters.png" /></p>

<p>By looking at the Count column, you can tell that there are some clusters that have significantly more records in them (ex. Cluster 7) and others that have significantly fewer (ex. Cluster 3). Other than that, though, it is difficult to notice anything informative about the summary. I don&#39;t know about you, but to me, the rest of the summary just looks like a bunch of decimals in a table.</p>

<p>This is a prime opportunity to use a visualization to discover insights faster. With just a couple import statements and a single line of code, we can light this summary up in a heatmap so that we can see the contrast between all those decimals and between the different clusters.</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="kn">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="kn">as</span> <span class="nn">sns</span>

<span class="n">sns</span><span class="o">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">cluster_summary</span><span class="p">[</span><span class="n">cluster_columns</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">(),</span> <span class="n">annot</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
<p><img alt="Silvrback blog image" class="sb_float_center" src="https://silvrback.s3.amazonaws.com/uploads/005f29c0-88cc-4bf4-bfa6-d227b8157498/8_cluster_heatmap.png" /></p>

<p>In this heatmap, the rows represent the features and the columns represent the clusters, so we can compare how similar or differently columns look to each other. Our goal for clustering these features is to ultimately create meaningful categories out of the clusters, so we want to get to the point where we can clearly distinguish one from the others. This heatmap allows us to do this quickly and visually.</p>

<p>With this goal in mind, it is apparent that we probably have too many clusters because:</p>

<ul>
<li> Clusters 3, 4, and 7 look pretty similar</li>
<li>Clusters 2 and 5 look similar as well</li>
<li>Clusters 0 and 6 are also a little close for comfort</li>
</ul>

<p>From the way our heatmap currently looks, I&#39;m willing to bet that we can cut the number of clusters in half and get clearer boundaries. Let&#39;s re-run the clustering, summary, and heatmap code for 4 clusters and see what kind of results we get.</p>
<div class="highlight"><pre><span></span><span class="n">cluster_results</span> <span class="o">=</span> <span class="n">kmeans_cluster</span><span class="p">(</span><span class="n">vehicle_clusters</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
<span class="n">cluster_summary</span> <span class="o">=</span> <span class="n">summarize_clustering</span><span class="p">(</span><span class="n">cluster_results</span><span class="p">)</span>

<span class="n">sns</span><span class="o">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">cluster_summary</span><span class="p">[</span><span class="n">cluster_columns</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">(),</span> <span class="n">annot</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre></div>
<p><img alt="Silvrback blog image" class="sb_float_center" src="https://silvrback.s3.amazonaws.com/uploads/4fe18749-0d32-45b7-a192-32c568177352/4_cluster_heatmap.png" /></p>

<p>These clusters look more distinct, don&#39;t they? Clusters 1 and 3 look like they are polar opposites of each other, cluster 0 looks like it’s pretty well balanced across all the features, and cluster 2 looks like it’s about half-way between Cluster 0 and Cluster 1.</p>

<p>We now have a good number of clusters, but we still have a problem. It is difficult to remember what clusters 0, 1, 2, and 3 <em>mean</em>, so as a next step, I like to assign descriptive names to the clusters based on their properties. In order to do this, we need to look at the levels of each feature for each cluster and come up with intuitive natural language descriptions for them. You can have some fun and can get as creative as you want here, but just keep in mind that the objective is for you to be able to remember the characteristics of whatever label you assign to the clusters.</p>

<ul>
<li>Cluster 1 vehicles seem to have large engines that consume a lot of fuel, process it inefficiently, produce a lot of emissions, and cost a lot to fill up. I&#39;m going to label them <em>Large Inefficient</em>.</li>
<li>Cluster 3 vehicles have small, fuel efficient engines that don&#39;t produce a lot of emissions and are relatively inexpensive to fill up. I&#39;m going to label them <em>Small Very Efficient</em>.</li>
<li>Cluster 0 vehicles are fairly balanced across every category, so I&#39;m going to label them <em>Midsized Balanced</em>.</li>
<li>Cluster 2 vehicles have large engines but are more moderately efficient than the vehicles in Cluster 1, so I&#39;m going to label them <em>Large Moderately Efficient</em>.</li>
</ul>

<p>Now that we have come up with these descriptive names for our clusters, let&#39;s add a <em>Cluster Name</em> column to our <code>cluster_results</code> data frame, and then copy the cluster names over to our original <code>vehicles</code> data frame.</p>
<div class="highlight"><pre><span></span><span class="n">cluster_results</span><span class="p">[</span><span class="s1">&#39;Cluster Name&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;&#39;</span>
<span class="n">cluster_results</span><span class="p">[</span><span class="s1">&#39;Cluster Name&#39;</span><span class="p">][</span><span class="n">cluster_results</span><span class="p">[</span><span class="s1">&#39;Cluster&#39;</span><span class="p">]</span><span class="o">==</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;Midsized Balanced&#39;</span>
<span class="n">cluster_results</span><span class="p">[</span><span class="s1">&#39;Cluster Name&#39;</span><span class="p">][</span><span class="n">cluster_results</span><span class="p">[</span><span class="s1">&#39;Cluster&#39;</span><span class="p">]</span><span class="o">==</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;Large Inefficient&#39;</span>
<span class="n">cluster_results</span><span class="p">[</span><span class="s1">&#39;Cluster Name&#39;</span><span class="p">][</span><span class="n">cluster_results</span><span class="p">[</span><span class="s1">&#39;Cluster&#39;</span><span class="p">]</span><span class="o">==</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;Large Moderately Efficient&#39;</span>
<span class="n">cluster_results</span><span class="p">[</span><span class="s1">&#39;Cluster Name&#39;</span><span class="p">][</span><span class="n">cluster_results</span><span class="p">[</span><span class="s1">&#39;Cluster&#39;</span><span class="p">]</span><span class="o">==</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;Small Very Efficient&#39;</span>

<span class="n">vehicles</span> <span class="o">=</span> <span class="n">vehicles</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s1">&#39;index&#39;</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">vehicles</span><span class="p">[</span><span class="s1">&#39;Cluster Name&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">cluster_results</span><span class="p">[</span><span class="s1">&#39;Cluster Name&#39;</span><span class="p">]</span>
</pre></div>
<h2 id="conclusion">Conclusion</h2>

<p>In this post, we examined several ways to prepare a data set for exploratory analysis. First, we looked at the categorical variables we had and attempted to find opportunities to roll them up into higher-level categories. After that, we converted some of our continuous variables into categorical ones by binning them into quintiles based on how relatively high or low their values were. Finally, we used clustering to efficiently create categories that automatically take multiple fields into consideration. The result of all this preparation is that we now have several columns containing meaningful categories that will provide different perspectives of our data and allow us to acquire as many insights as possible.</p>

<p>Now that we have these meaningful categories, our data set is in really good shape, which means that we can move on to the next phase of our data exploration framework. In the next post, we will cover the first two stages of the Explore Phase and demonstrate various ways to visually aggregate, pivot, and identify relationships between fields in our data. Make sure to subscribe to the DDL blog so that you get notified when we publish it!</p>

<p><em>District Data Labs provides data science <a href="http://www.districtdatalabs.com/consulting/">consulting</a> and <a href="http://www.districtdatalabs.com/training/">corporate training</a> services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? <a href="mailto:contact@districtdatalabs.com?subject=Consulting%20and%20Corporate%20Training%20Services&body=Hello!%20I&#x27;m%20interested%20in%20learning%20more%20about%20your%20data%20science%20consulting%20and%20corporate%20training%20offerings.">Let us know</a>!</em></p>
]]></content:encoded>
      </item>
  </channel>
</rss>