This tutorial is intended to make a introduction on the topic of machine learning over graphs, giving participants an insights into techniques and algorithms through multiple examples using open source code and public datasets, including a text dataset and twitter networks. Finally, we will show how can you use them so as to improve your everyday machine learning tasks.
In recent years, the amount of available data has drastically increased. However, labelling such data is hugely expensive. In this scenario, semi-supervised learning emerge as a vitally important tool, which combines labelled data (supervised machine learning) and unlabelled data (unsupervised learning) in order to make better predictions. In particular, graph based algorithms takes into account the relationships between the instances of the data and the underlying graph structures to make those predictions. In addition, in the context of data analysis, there are scenarios that can be naturally think as graphs. This occurs in situations where in addition to individual properties, connectivity between the elements of the data set is also important. Therefore, it is logical that machine learning models include information from both a node and its neighbours when making a prediction.
In this tutorial we will go thought: - Model a graph using networkx. - Understand and calculate graph property such as the degree, PageRank or the betweenness centrality using networkx. - Detect communities in a graph using networkx. - How to combine networkx and sklearn in order to train a machine learning classifier and make predictions. - How to adapt common machine learning models to a graph dataset using edge based regularisation.
The examples covered in this tutorial will be: - The Zachary karate club. - The twitter network during 2017 Argentinean elections (including millions of political tweets). - The Cora citation dataset (which includes features of the text of the papers). All the code and the datasets will be uploaded to a github.
The intention of this tutorial is to make an introduction on the topic, give participants deep insights on graph based algorithms and show how can you use them so as to improve the performance of your everyday machine learning tasks.
A minimum knowledge on pandas (How to import and manipulate a dataset) and sklearn (performing a classification) packages is expected . Also knowing what is a classification, cross validation and basic graph theory (what is a node, an edge or a community) will be assumed in order to focus on the important: how to adapt machine learning techniques and implementations with graphs datasets.