Exploring Bureau of Labor Statistics Time Series
Machine learning models benefit from an increased number of features — “more data beats better algorithms”. In the financial and social domains, macroeconomic indicators are routinely added to models particularly those that contain a discrete time or date. For example, loan or credit analyses that predict the likelihood of default can benefit from unemployment indicators or a model that attempts to quantify pay gaps between genders can benefit from demographic employment statistics.
The Bureau of Labor Statistics (BLS) collects information related to the labor market, working conditions, and prices into periodic time series data. Moreover, BLS provides a public API making it very easy to ingest essential economic information into a variety of analytics. However, while they provide raw data and even a few reports that analyze employment conditions in the United States, the tables they provide are more suited towards specialists and the information can be difficult to interpret at a glance.
In this post, we will review simple data ingestion of BLS time series data, enabling routine collection of data on a periodic basis so that local models are as up to date as possible. We will then visualize the time series using pandas and matplotlib to explore the series provided with a functional methodology. At the end of this post, you will have a mechanism to fetch data from BLS and quickly view and explore data using the BLS series id key.
The BLS API
The BLS API currently has two versions, but it is strongly encouraged to use the V2 API, which requires registration. Once you register, you will receive an API Key that will authorize your requests, ensuring that you get access to as many data sets as possible at as high a frequency as possible.
The API is organized to return data based on the BLS series id, a string that represents the survey type and encodes which version or facet of the data is being represented. To find series ids, I recommend going to the data tools section of the BLS website and clicking on the “top picks” button next to the survey you're interested in, the series id is provided after the series title. For example, the Current Population Survey (CPS), which provides employment statistics for the United States, lists their series; here are a few examples:
- Unemployment Rate - LNS14000000
- Discouraged Workers - LNU05026645
- Persons At Work Part Time for Economic Reasons - LNS12032194
- Unemployment Rate - 25 Years & Over, Some College or Associate Degree - LNS14027689
The series id, in this case, starts with LNS or LNU: LNS14000000
, LNU05026645
, LNS12032194
, and LNS14027689
. There are two methods to fetch data from the API. You can GET data from a single series endpoint, or you can POST a list of up to 25 ids to fetch multiple time series at a time. Generally, BLS data sets are fetched in groups, so we'll look at the multiple time series ingestion method. Using the requests.py
module, we can write a function that returns a JSON data set for a list of series ids:
import os
import json
import requests
BLS_API_KEY = os.environ.get('BLS_API_KEY')
BLS_ENDPOINT = "http://api.bls.gov/publicAPI/v2/timeseries/data/"
def fetch_bls_series(series, **kwargs):
"""
Pass in a list of BLS timeseries to fetch data and return the series
in JSON format. Arguments can also be provided as kwargs:
- startyear (4 digit year)
- endyear (4 digit year)
- catalog (True or False)
- calculations (True or False)
- annualaverage (True or False)
- registrationKey (api key from BLS website)
If the registrationKey is not passed in, this function will use the
BLS_API_KEY fetched from the environment.
"""
if len(series) < 1 or len(series) > 25:
raise ValueError("Must pass in between 1 and 25 series ids")
# Create headers and payload post data
headers = {'Content-Type': 'application/json'}
payload = {
'seriesid': series,
'registrationKey': BLS_API_KEY,
}
# Update the payload with the keyword arguments and convert to JSON
payload.update(kwargs)
payload = json.dumps(payload)
# Fetch the response from the BLS API
response = requests.post(BLS_ENDPOINT, data=payload, headers=headers)
response.raise_for_status()
# Parse the JSON result
result = response.json()
if result['status'] != 'REQUEST_SUCCEEDED':
raise Exception(result['message'][0])
return result
This script looks up your API key from the environment, a best practice for handling keys which should not be committed to GitHub or otherwise saved in a place that they can be discovered publicly. You can either change the line to hard code your API key as a string, or you can export the variable in your terminal as follows:
$ export BLS_API_KEY=yourapikey
The function accepts a list of series ids and a set of generic keyword arguments which are stored as a dictionary in the kwargs
variable. The first step of the function is to ensure that we have between 1 and 25 series passed in (otherwise an error will occur). If so, we create our request headers to pass and receive JSON data as well as construct a payload with our request parameters. The payload is constructed with the keyword arguments as well as the registration key from the environment and the list of series ids. Finally, we POST the request, check to make sure it returned successfully, and return the parsed JSON data.
To run this function for the series we listed before:
>>> series = ['LNS14000000', 'LNU05026645', 'LNS12032194', 'LNS14027689']
>>> data = fetch_bls_series(series, startyear=2000, endyear=2015)
>>> print(json.dumps(data, indent=2))
You should see something similar to the following result:
{
"Results": {
"series": [
{
"seriesID": "LNS14027689",
"data": [
{
"year": "2009",
"period": "M12",
"periodName": "December",
"footnotes": [
{}
],
"value": "8.7"
},
{
"year": "2009",
"period": "M11",
"periodName": "November",
"footnotes": [
{}
],
"value": "8.8"
}
[… snip …]
]}]}}
From here it is a simple matter to operationalize the routine (monthly) ingestion of new data. One method is to store the data in a relational database like PostgreSQL or SQLite so that complex queries can be run across series. As an example of database ingestion and wrangling, see the github.com/bbengfort/jobs-report repository. This project was a web/D3 visualization of the BLS time series data, but it utilized a routine ingestion mechanism as described in the README of the ingestion module. To simplify data access, we'll use a database dump from that project in the next section, but you can also use the data downloaded as JSON from the API if you wish.
Loading Data into Pandas Series
For this section, we have created a database of BLS data (using the API) that has two tables: a series
table that has information describing each time series and a records
table where each row is essentially a tuple of (blsid, period, value)
records. This allows us to aggregate and query the timeseries data effectively, particularly in a DataFrame. For this section we've dumped out the two tables as CSV files, which can be downloaded here: BLS time series CSV tables.
Querying Series Information
The first step is to create a data frame from the series.csv
file such that we can query information about each time series without having to store or duplicate the data.
import pandas as pd
info = pd.read_csv('../data/bls/series.csv')
info.head()
Working backward, we can create a function that accepts a BLS ID and returns the information from the info table:
def series_info(blsid, info=info):
return info[info.blsid == blsid]
# Use this function to lookup specific BLS series info.
series_info("LAUST280000000000003")
I utilize this function a fair amount to check if I have a time series in my dataset or to lookup seemingly related time series. In fact, we can see a pattern starting to emerge from function and the API fetch function from the last section. Our basic methodology is going to be to create functions that accept one or more BLS series ids and then perform some work on them. Unifying our function signatures in this way and working with our data on a specific key type dramatically simplifies exploratory workflows.
However, the BLS ids themselves aren't necessarily informative, so like the previous function, we need an ability to query the data frame. Here are a few example queries:
info[info.source == 'LAUS']
This query returns all of the time series whose source is the Local Area Unemployment Statistics (LAUS) program, which breaks down unemployment by state. However, you'll notice from the previous section that the prefixes of the series seem to be related but not necessarily to the source. We could also query based on the prefix to find related series:
info[info.blsid.apply(lambda r: r.startswith('LNS14'))]
Combining queries like these into a functional methodology will easily allow you to explore the 3,368 series in this dataset and more as you continue to ingest series information using the API!
Loading the Series
The next step is to load the actual time series data into Pandas. Pandas implements two primary data structures for data analysis — the Series
and DataFrame
objects. Both objects are indexed, meaning that they contain more information about the underlying data than simple one or two-dimensional arrays (which they wrap). Typically the indices are simple integers that represent the position from the beginning of the series or frame, but they can be more complex than that. For time series analysis, we can use a Period
index, which indexes the series values by a granular interval (by month as in the BLS dataset). Alternatively, for specific events you can use the Timestamp
index, but periods do well for our data.
To load the data from the records.csv
file, we need to construct a Series
per time series data structure, creating a collection of them. Here's a function to go about this:
import csv
from itertools import groupby
from operator import itemgetter
# Load each series, grouping by BLS ID
def load_series_records(path='../data/bls/records.csv'):
with open(path, 'r') as f:
reader = csv.DictReader(f)
for blsid, rows in groupby(reader, itemgetter('blsid')):
# Read all the data from the file and sort
rows = list(rows)
rows.sort(key=itemgetter('period'))
# Extract specific data from each row, namely:
# The period at the month granularity
# The value as a float
periods = [pd.Period(row['period']).asfreq('M') for row in rows]
values = [float(row['value']) for row in rows]
yield pd.Series(values, index=periods, name=blsid)
In this function we use the csv
module, part of the Python standard library, to read and parse each line of our opened CSV file. The csv.DictReader
generates rows as dictionaries whose keys are based on the header row of the csv file. Because each record is in the format (blsid, period, value)
(with some extra information as well), we can groupby
the blsid
. This does require the records.csv
file to be sorted by blsid
since the groupby
function simply scans ahead and collects rows into the rows
variable until it sees a new blsid
.
Once we have our rows grouped by blsid
we can load them into memory and sort them by time period. The period value is a string in the form 2015-02-01
, which is a sortable format; however, if we create a pd.Period
from this string the period will have the day granularity. For each row, we create a period using the .asfreq('M')
to transform the period into the month granularity. Finally, we parse our values into floating points for data analysis and construct a pd.Series
object with the values, the monthly period index, and assign a name to it — the string blsid, which we will continue to use to query our data.
This function uses the yield
statement to return a generator of pd.Series
objects. We can collect all series into a single data frame, indexed correctly as follows:
series = pd.concat(list(load_series_records()), axis=1)
If you're using our data, the series
data frame should be indexed by period, and there should be roughly 183 months (rows) in our dataset. There are also 3366 time series in the data frame represented as columns whose column id is the BLS ID. If any of the series did not have a period matched by the global period index, the concat
function correctly fills in that value as np.nan
. As you can see from a simple head
: the data frame contains a wide range of data, and the domain of every series can be dramatically different.
Visualizing Series with Matplotlib
Now that we've gone through the data wrangling hoops, we can start to visualize our series using matplotlib and the plotting library that comes with Pandas. The first step is to create a function that takes a blsid
as input and uses the series
and info
data frames to create a visualization:
def plot_single_series(blsid, series=series, info=info):
title = info.get_value(info[info.blsid == blsid].title.index[0], 'title')
series[blsid].plot(title=title)
The first thing this function does is look up the title of the series using the series info data frame. To do this it uses the get_value
method of the data frame which will return the value of a particular column for a particular row by index. To look up the title by blsid
, we will have to query the info data frame for that row, info[info.blsid == blsid]
, then fetch the index, and then we can then use that to get the 'title' column. After that we can simply plot the series, fetching it directly from the series data frame and plotting it using Pandas.
Warning: Don't try series.plot()
, which will try to plot a line for every series (all 3366 of them); I've crashed a few notebooks that way!
>>> plot_single_series("LNS12300000")
This function is certainly an enhancement of the series_info
function from before, allowing us to think more completely about the domain, range, and structure of the time series data for a single blsid
. I typically use this function when adding macroeconomic features to datasets to decide if I should use a simple magnitude, or if I should use a slope or a delta, or some other representation of the data based on its shape. Even better though would be the ability to compare a few time series together:
def plot_multiple_series(blsids, series=series, info=info):
for blsid in blsids:
title = info.get_value(info[info.blsid == blsid].title.index[0], 'title')
series[blsid].plot(label=title)
plt.title("BLS Time Series: {}".format(", ".join(blsids)))
plt.legend(loc='best')
In this function instead of providing a single blsid
, the argument is a list of blsid
strings. For each series, we plot them but add their title as a label. This allows us to create a legend with all the series names. Finally, we add a title that indicates the series blsid
for reference later. We can now start making visual series comparisons:
>>> plot_multiple_series(["LNS14000025", "LNS14000026"])
One thing to note is that comparing these series worked because they had the approximately the same range. However, not all series in the BLS data set are in the same range and can be orders of magnitude different. One method to combat this is to provide a normalize=True
argument to the plot_multiple_series
function and then use a normalization method to bring the series into the range [0,1].
Conclusion
The addition of macroeconomic features to models can greatly expand their predictive powers, particularly when they inform the behavior of the target variable. When instances have some time element that can be mapped to a period, then the economic data collected by Census and BLS can be easily incorporated into models.
This post was designed to equip a workflow for ingesting and exploring macroeconomic data from BLS in particular. By centering our workflow on the blsid
of each time series, we were able to create functions that accepted an id or a list of ids and work meaningfully with it. This allows us to connect exploration both on the BLS data explorer, as well as in our data frames.
Our exploration process was end-to-end with respect to the data science pipeline. We ingested data from the BLS API, then stored and wrangled that data into a database format. Computation on the timeseries involved the pd.Period
and pd.Series
objects, which were then aggregated into a single data frame. At all points we explored querying and limiting the data, culminating with visualizing single and multiple timeseries using matplotlib.
Helpful Links
- BLS Timeseries CSV Tables
- BLS Developer Documentation
- Jobs Report BLS Database Ingestion
- Interactive Exploration of the Employment Situation Report
Acknowledgements
Nicole Donnelly reviewed and edited this post, Tony Ojeda helped wrangle the datasets.
District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!