Editor's Note: This article highlights one of the capstone projects from the Georgetown Data Science Certificate program, where several of the DDL faculty teach. We've invited groups with interesting projects to share an overview of their work here on the DDL blog. We hope you find their projects interesting and are able to learn from their experiences.
Producing online content that goes viral continues to be more art than science. Often, the virality of content depends heavily on cultural context, relevance to current events, and the mercurial interest of the target audience. In today's dynamic world of constantly shifting tastes and interests, reliance on the experience and intuition of the editing staff is no longer sufficient to generate high-engagement digital content.
Based on this understanding of the media landscape, we used this project to ask the question: is it possible to optimize the likelihood that an article's headline is clicked on? Taking a data science approach, we used Natural Language Processing and Machine Learning methodologies to analyze a summer's worth of BuzzFeed article data. We hypothesized that with sufficient data, it would be possible to identify and evaluate the most common features of successful articles.
What's a BuzzFeed?
BuzzFeed is a global network of social news and entertainment content that is currently published in 11 countries, each with its own landing page. The website is frequently accessed outside of these 11 countries, though it does not target them directly. Of the 11 countries, 5 are English speaking: Australia, Canada, India, the United Kingdom, and the United States.
BuzzFeed was selected for this project because it might be considered a successful digital media company, one which has very much become an example of how to commodify the popularity of digital content. In fact, its entire business model is built on this very premise. However, despite this fact, it still finds itself prioritizing quantity over quality - acting as what might be referred to as a "content machine gun" that churns out as many articles as possible in the hopes that something will stick with the audience. Additionally, BuzzFeed's trending article API was free and open to the public and therefore allowed us to query their data on a regular basis without needing to request and frequently update developer keys or credentials.
The number of visitors to the BuzzFeed site has been rising steadily since 2013. In the first eight months of 2016, the monthly number of unique visitors has ranged from 180 million to 210 million, generating between 440 million to 500 million visits to the site. On average, the United States is responsible for approximately 50% of this traffic.
The project's approach was formed based on two central hypotheses:
The country in which an article is written has an impact on where the article will go viral. In other words, we believe that the popular culture within a given country is a good indicator of the what types of content are likely to go viral within that country's digital space.
Certain aspects of the language associated with an article, i.e. topics, keywords, and categories, are indicative of popularity and audience location. In other words, language plays a major role not only in terms of what will go viral but also in terms of where it will go viral.
Building a Usable Dataset from Scratch
In order to collect data from five different BuzzFeed API queries (we hit each English site's trending article list), we set up an Ubuntu instance through Amazon Web Services EC2 tool, which allowed us to schedule a crontab job to run our Python query scripts every hour. We then saved the API response as a JSON file in a WORM directory. Each call returned data points on the top 25 articles for each location. We then parsed the JSON files into a relational format and loaded each article as a row into a Postgres database, which allowed us to see what type of data we had very quickly.
Below is a visual glimpse of a portion of our final data set. On the left, you'll see a cluster of the data pulled from the US API. To better understand what we had, we color-coded the articles based on the country of origin and weighted the size based on its frequency (i.e. the number of occurrences) within the dataset. On the right-side is the same visualization, but for data pulled from the India API.
Once we had a sufficient amount of instances, we realized that our analysis would benefit from generating some additional data points. In particular, because BuzzFeed's methodology for determining what articles appear on their trending lists is proprietary and was not available for our consideration, we quickly determined that we needed to generate a metric to indicate what we meant when we used the term virality.
Ultimately, we defined our own heuristic function:
Virality = Log(Impressions * Frequency).
We then used this metric to generate a new category for our data under the assumption that our target was a solution to a classification problem. What we came up with was essentially a virality scale from 1 - 3, where one was the least viral and three was the most viral.
Data Modeling with Textual Data
Applying a Machine Learning solution to a problem that consists mostly of textual and categorical data can be tricky. We leaned heavily on the open source Scikit-Learn Python library to do the heavy lifting.
For each article, we vectorized the entire data set, linearly transforming our textual values into numerical vectors. In order to perform this vectorization, we used the CountVectorizer module available through Scikit-Learn. Another useful tool we relied on heavily was the pipeline module, which allowed us to consistently transform our data and solicit human-readable outputs, or predictions, from our models.
Once our pipeline was fine-tuned, we trained two classification models on our data for our predictors: Multinomial Naive Bayes and Logistic Regression. With our pipeline, Logistic Regression performed best with a cross-validated accuracy score of 0.864645.
Side-by-Side Comparison of our Models' Results:
What did we learn?
Most capstone projects tend to be an exercise in building a large machine and driving it simultaneously, and ours was no exception. Throughout the project, we were constantly faced with the need to ramp up knowledge in an unfamiliar area in order to make progress. In particular, we felt this acutely when attempting to apply a Machine Learning solution to a large amount of textual data, a practice that requires both a unique approach and at least basic familiarity with vectorization methodologies.
Additionally, we realized too late into the project that we had made the mistake of spreading our resources too thin by attempting to work on other problems at the same time we were working on our trending prediction model. Had the entire focus of the project been on this one issue, we may have determined a better way to address some of the setbacks we encountered.
Pulling data from a public API that is not well documented and includes publication-specific variables and terminology caused us to have a significant gap in our domain knowledge, despite one team member's background in web analytics.
What could have been better?
As we made progress collecting and augmenting our dataset, we quickly realized that the textual data we collected might benefit from also collecting the contents of each article. However, because we were using BuzzFeed as our primary data source, a significant number of articles were comprised of aggregated content - that is, content pulled in from other web platforms such as Twitter, Tumblr, YouTube, etc. Moreover, the API data did not contain a direct link to the articles themselves, though this issue might have been sidestepped by mixing and mashing "username" and "uri" data points.
Ultimately, incorporating additional article data would have required a significant level of time and effort that was simply not available for this project. Nevertheless, a more thorough approach might have found a way to web-scrape the content of the articles for which we were pulling API data.
Practically speaking, a project like this might allow news media organizations to evaluate the potential popularity of a given article based on its textual inputs. This would not act as a stand in for editorial judgment, but rather a tool to supplement and refine the process as it currently stands.
Rather than churning out excessive digital content and hoping that something will stick, further progress in this area would allow editors and content authors to see a macro-level view of what content is successful for each region and refine their efforts based on this perspective. These efforts could be improved even further if one were to train the models categorically to generate predictions that are contingent on the audience that each category serves.