Data science and machine learning have traditionally revolved around creating models based on the assumption that individual data points are uncorrelated. However, this ignores a signal that could potentially be very strong: the relationships between data points. We will look at this data as a network graph, and explore how to unlock the potential using a graph database.
This hands-on tutorial will begin with a discussion comparing querying data in a tabular environment such as SQL or Pandas dataframes. We will show hints of how to use that data to identify whether your problem would be better expressed as a graph problem. From there, we will provide a brief introduction to the graph theory concepts that are most relevant to data scientists such as centrality algorithms (ex: PageRank), community detection algorithms, node similarity, and path finding. Next we will discuss some standard Python packages used for graph analytics, which will be used as motivation for working with graph databases based on significant improvements to scalability and simple querying. We will then create our own free graph database using the Sandbox of Neo4j to do some hands-on data science. Using our database, we will demonstrate how to use standard Python packages for populating the graph and querying the data within it. This will include a brief introduction to the Cypher query language, commonly used for analyzing graphs, and why this approach is much more efficient than using a traditional relational databases or in-memory graph analytics in Python. There will also be an introduction on how to visualize graphs within the browser. We will conclude with how to create a machine learning model from a graph, based on the calculation of graph embeddings, to perform a common task such as node classification.