On a fast growing online platform arise numerous metrics. With increasing amount of metrics methods of exploratory data analysis are becoming more and more important. We will show how recognition of similar metrics and clustering can make monitoring feasible and provide a better understanding of their mutual dependencies.
Monitoring the growth of an online platform means keeping track of a variety of metrics: activity, user retention, revenue, marketing campaigns etc. These metrics are often split according to different characteristics like gender and country. This multitude of trackable metrics quickly exceeds what can be maintained and monitored manually. More importantly, unexpected increases or decreases may require action and at the least raise the following questions: Which parts of the application platform are affected? Can the effect be isolated to single device types, countries or other dimensions? Possible causes (e.g. outages, spam-attacks, bug-fixes, changes in marketing activity) are in most cases not obvious, particularly when multiple factors interact. Knowledge about which metrics evolve in a similar way is essential for understanding how they interact and impact each other. However, comparing pairwise correlations of metrics manually is tedious and will not reveal the whole picture. Taking clusters of similar metrics into consideration can provide better insights than investigating single metrics or single pairs of metrics. Clusters of similar metrics allow us to obtain a clear representation of the evolution of dynamics. Furthermore, the cluster assignments provide information about which metrics are closely connected in terms of positive correlation. To obtain a proper clustering we create a matrix of similarities for each pair of metrics and use then Spectral Clustering to detect a block structure in similarity matrix. In the talk we will sketch how spectral clustering works on a toy example of a small perturbed block matrix with three (quasi-)blocks, as well as apply the algorithm to real masked data. Once connected clusters of metrics have been identified, we can compute representative averages for each cluster. We will close the talk with a visualization that highlights, which metrics enter into particular clusters together with confidence bounds for the cluster averages illustrating the spread of each cluster. The examples will be available as jupyter notebook on https://github.com/metterlein/spectral_clustering .