Graph Analytics Over Relational Datasets with Python
The analysis of interconnection structures of entities connected through relationships has proven to be of immense value in understanding the inner-workings of networks in a variety of different data domains including finance, health care, business, computer science, etc. These analyses have emerged in the form of Graph Analytics -- the analysis of the characteristics in these graph structures through various graph algorithms. Some examples of insights offered by graph analytics include finding clusters of entities closely connected to each-other, calculating optimal paths between entities (the definition of optimal depending on the dataset and use case), understanding the hierarchy of entities within an organization as well as figuring out the impact each entity has inside the network.
Graph structured data is a specialized type of dataset in terms of the way we need to access it; therefore it needs to be stored in ways that complements these access patterns. This has sparked the emergence of a wide variety of specialized graph databases such as Neo4j, OrientDB, Titan etc. These systems are highly optimized specifically for efficient graph storage, traversal, and analysis.
Despite the existence of these graph database management systems however, users do not typically store their data in a graph database unless they are strictly dealing with graph workloads. The reason for this is that while graph structured data may complement graph analysis, it ultimately undermines traditional relational analytics through the use of declarative languages like SQL that have been heavily leveraged by companies and organizations for decades. Users thus typically store their data in Relational Database Management System that backed by many years of research and development, are highly robust and optimized for these generalized analyses and a wide variety of different types of queries.
So how do we get the best of both worlds?
GraphGen (currently under development at the University of Maryland), is a system that enables users to efficiently conduct graph analytics on many different types of graphs that exist inside of relational datasets. The main idea behind GraphGen is to enable users to effortlessly leverage graph analytics on top of their data, without the need to go through a labor-intensive and time-consuming Extract, Transform and Load (ETL) process. It allows users to declaratively express any set of entities and relationships between them that exists in the underlying dataset, and extract it into memory where it can be analyzed -- all this through a simple language abstraction! For more information on the project please visit the GraphGen project webpage
In this blog post, we will go through the entire process of writing an extraction query, to extracting and analyzing the extracted graph from within a relational database. GraphGen is natively written in Java
and includes many native optimizations for efficient extraction and memory usage, but for the purposes of this post we will use graphgenpy
, a Python wrapper over the GraphGen system, that enables quickly extracting and serializing the extracted graph to disk instead. We will demonstrate an end-to-end analysis on the openfootball football.db
database, restricted to the 2014 World Cup, from extraction to analysis of a graph between players. You can download the pre-built version of the football.db
sqlite database for the 2014 World Cup that we'll be using in this tutorial from here: worldcup2014.db
.
Installing Graphgenpy
To install graphgenpy
onto your system, simply download and uncompress the packaged zip file linked in the GraphGen webpage, into your workspace, and enter the graphgen-pkg
folder.
If you're using a virtual environment (virtualenv
) -- which we highly recommend -- then you can simply
python setup.py install
Note: You may need to use sudo
or Administrator privileges if you're installing to your global site-packges.
Using Graphgenpy Without Installation
If you'd prefer to try out graphgenpy
in your local workspace without having to install it simply first install the requirements using pip
pip install -r requirements.txt
and then export your PYTHONPATH
to include the graphgenpy
folder in your local workspace
export PYTHONPATH=$PYTHONPATH:{full path of downloaded graphgenpy folder}
After that you can immediately import graphgenpy
to begin exploring graphs in your relational datasets.
Extracting a World Cup 2014 Graph
Relational datasets, as their name suggests, often include a wide variety of underlying entities and relationships between them. One of the biggest strong points of GraphGen is that it enables users to explore any type of graph that may exist within the dataset, be it graphs with different types of entities (nodes), or different ways in which they are connected to each-other (edges).
For the sake of this example, we will extract a graph of players where there is an edge between them if they've played against each-other.
First, we need to inspect the database schema, in order to find which relations (tables) we need to utilize in order to extract this information. A subset of the schema can be seen below
As we can see here, the tables that hold the information about the ways we want the player nodes to be connected in this graph are included in the persons
, rosters
and games
tables. The persons
table holds the data for each player, the rosters
table tells us which team each player is a part of, and the games
table holds all the data about each game that occurred; including which teams played against each-other.
GraphGen uses a custom written declarative language (that gets translated to SQL queries under the hood). The query that extracts the graph we want in this language will look like this:
Nodes(id, name) :- persons(id,_,name).
Edges(id1, id2) :- rosters(_,id1,team1),rosters(_,id2,team2), games(_,_,_,_,_,team1,team2).
The intuition behind the above query is that we are declaring how we want the nodes and edges in this graph to be defined in terms of the tables in this database. Here, we are declaring that we want the nodes to have two properties, an id
and a name
, which should be taken from the persons
table. We are also stating that we want node with id1
and node with id2
, to be connected to each other, if there is a game in the games
table, where id1's team has played against id2's team. For a more in-depth analysis on the language and on writing queries for GraphGen, please refer to the GraphGen Project webpage
The python code that allows us to do this is quite simple
from graphgenpy import GraphGenerator
# extraction query string
extractionQuery = """
Nodes(id, name) :- persons(id,_,name).
Edges(id1, id2) :- rosters(_,id1,team1),rosters(_,id2,team2), games(_,_,_,_,_,team1,team2).
"""
# Absolute path to the db file or dbname (if using PostgreSQL)
dbpath = ".../wolrdcup2014.db"
# The name of the output graph file
graph_name = "extracted_graph_football"
# Credentials for connecting to the database
# PostgreSQL Usage: GraphGenerator(db_name, host, port, username, password).
# SQLite Usage: GraphGenerator(db_path)
gg = GraphGenerator(dbpath)
# Evaluate graph extraction query and serialize the resulting graph to disk. Returns the file path of the extracted graph.
# Currently supported graph formats: GML and JSON
fpath = gg.generateGraph(extractionQuery,graph_name,GraphGenerator.GML)
Note 1: By definition, this query will include all individuals in the "persons" table in the database, even if they did not participate in any of the games in the games
table.
Note 2: Currently, the language is limited to only extracting undirected graphs with no edge properties.
We have now extracted the graph we wanted in a matter of seconds, and serialized it onto disk in a standard graph format -- all without the need to write a single SQL query or script for wrangling the data into a graph format.
Most importantly however, the capabilities of the above language, allow us to essentially view this relational dataset from any angle, and explore all the types of graphs, that we may think of, and find interesting within.
Now we can go ahead and conduct analysis on the extracted graph.
Analyzing the Graph with Networkx
As mentioned before, there exist a wide diversity of graph databases, and graph analytics engines that are optimized for graph traversal and analysis, and we are now in the position to import our extracted graph and utilize any of them. For the purposes of this post, we will use a widely adopted graph analysis library written and distributed in Python, known as NetworkX.
NetworkX is a very rich library that implements a wide array of graph algorithms for many different types of analyses including clustering, communities, centrality, distance measures and many more. It also parses several different types of standard graph formats; for our purposes we will use the GML
format as we extracted in the previous section's code.
Here's how we import the graph into a networkx
Graph object:
import networkx as nx
# 'id' here is the node property we want to parse in as the identifier in the networkx Graph structure.
G = nx.read_gml(fpath,'id')
Finding Major Players (PageRank)
One classic type of analysis on any graph is to identify the key or most "important" entities in the graph. To do this, we will run the PageRank algorithm on top of the graph. This will assign a score to each node based on the structure of the incoming edges-- we can then find the node with the highest PageRank score.
# run PageRank on the graph. algorithm returns a dictionary with the scores
d = nx.pagerank(G)
# find max score
max_score = 0
max_node_p = None
for node,score in d.iteritems():
if score > max_score:
max_score = score
max_node_p = node
print 'Max Pagerank player is {}:{}, Score:{}'.format(max_node_p,G.node[max_node_p]['name'],max_score)
Betweenness Centrality
Another interesting aspect that we may want to analyze is finding the key intermediary nodes in terms of how information may flow inside the graph, commonly known as Betweenness Centrality. Vertices with high betweenness centrality, means that they have a large influence in the connectivity of their neighbors with the other nodes in the graph. The code here is identical to the way we ran PageRank
d = nx.betweenness_centrality(G)
max_score = 0
max_node_b = None
for node,score in d.iteritems():
if score > max_score:
max_score = score
max_node_b = node
print 'Max betweenness_centrality player is {}:{}, Score:{}'.format(max_node_b,G.node[max_node_b]['name'],max_score)
Finding Triangles
Another common analysis is how many "triangles" does each node in the graph participate in if any. A triangle is a subgraph that includes exactly 3 nodes connected via exactly 3 edges (or a complete subgraph with 3 nodes). Triangle counting is used in a wide variety of graph mining and analysis algorithms, and can be done using networkx
.
# Count all the triangles each node in the graph is a part of
print nx.triangles(G)
It's interesting to see that in this specific graph, there actually isn't a single triangle; every node in the graph is part of exactly 0 triangles.
Visualization
In many cases, visual analysis of a graph can also provide great value, as it allows for including the human factor into the mix, which is essential in pinpointing various patterns in the graph structure. Visualizations also allow for collaborative discussions on the outcomes of the analysis and allows for sharing and easily communicating results and ideas to others.
The simplest way to visualize a graph extracted using graphgen would be using networkx
itself. The networkx
library provides a variety of ways to easily draw the input graph using matplotlib
. It also provides a wide array of graph layouts for laying out the nodes and edges in non-overlapping ways adequate for enabling maximal visibility of the information in the graph, and simplifying visual analysis.
An example of the code we would need to write for drawing the graph we extracted in the previous sections is:
import matplotlib.pyplot as plt
# Specifying the layout of the graph
pos=nx.spring_layout(G)
# Specifying the nodes that we want different colors on and their colors
val_map = { max_node_p: 1.0,
max_node_b: 0.5714285714285714
}
# Specifying the nodes that we want different sizes on and their sizes
sz_map = { max_node_p: 300,
max_node_b: 300
}
# Creating the list of color and size values to pass into the draw method.
values = [val_map.get(node, 0.20) for node in G.nodes()]
sizes = [sz_map.get(node, 50) for node in G.nodes()]
# Drawing the graph
nx.draw(G,pos,with_labels=False,width=0.1,cmap=plt.get_cmap('jet'), node_color=values, node_size=sizes)
# Plot the graph
plt.show()
Another very effective way to visualize graphs in a standard format is by using a third party tool called Gephi. Gephi provides a great Graphical User Interface (GUI) for manually applying different layouts, node sizes, colors and various techniques for drawing edges. Note that most of the things Gephi does, are also possible through matplotlib
but would require significant amounts of complex code to accomplish.
Below is an example of using Gephi to load in the extracted .gml
file and changing the shade of color of each node depending on its degree. The layout algorithm used here is called Yifan Hu.
Gephi makes it effortless to re-size labels, choose and re-apply different layouts and play around with different node and edge sizes and colors. Below is an example of the same graph, but using the Fruchterman–Reingold algorithm and including the names of the players in small font.
Gephi produces vector graphics so it naturally allows for zooming into details you'd want to explore further and allows for manually clicking and dragging nodes of special interest to specific positions; something significantly more complex to do using networkx
and matplotlib
.
Conclusion
Analysis of graph structured data has proven its worth time and again, being able to provide invaluable insights about the relationships between entities, as well as enable optimizations over a network of interconnected objects. Graph analytics however is but one direction in which we'd like to leverage the information in our datasets, and typically do not want to center our entire data collection workflow around graph analysis. A majority of users and businesses rely on the robust and mature features and analytical capabilities provided by relational databases. GraphGen brings the best for both worlds, by enabling effortless extraction of many different types of graphs that exist inside of a relational dataset, thus opening graph analytics up to any dataset that exists inside a SQL-powered database engine.
District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!