Topic Models can be difficult to interpret, and in fact this has been akin to reading tea leaves. This talk suggests an approach by which one can aggregate topics across many model runs (in particular using Apache Spark) to create higher confidence clusters, which can then be used for further pipelines or potentially streamline/expedite human annotation.
This presentation describes some proposed improvements by which multiple topic model runs are aggregated to create more coherent, consistent topics with relative confidence in the topics, and without having to define a target count for topic models. This approach has been used to generate the models supporting this recent paper in HDSR on COVID-19 research. There are also suggestions to approaches to make the associated documents more relevant, and describe a pipeline by which the models can be generated efficiently at scale.
This is a personal project creating a pipeline using Apache Spark to create the models, aggregate the topics, and subsequently generate unique visualizations on top of the aggregated clusters. There were experiments performed on various datasets, including Pubmed abstracts, US Patent Office grants, the CORD-19 dataset for COVID research, and EEBO (Early English Books Online).
Experiments were run using the Apache Spark LDA implementation. Spark LDA individually generates models with a slightly lower coherence score than a comparable model generated via gensim. However the fast runtimes of the natively parallelized models allows for the generation of many model runs in a short time, on a relatively small cluster. For example I ran a couple of experiments where ~45 models were generated with different topic counts (50, 100, 200, 400) for a document corpus of ~30k Pubmed abstracts within a few hours; and the likewise for a different set of ~400k abstracts the same number of models took less than a day to generate.
The approach allows consistent topics bubble to the top; provides a more natural way to obtain number of topic / clusters; and smooths out the terms per the topic clusters. These clusters are formed by aggregating topics with pairwise distance (jensenshannon) below a threshold, in addition to the use of some network based heuristics, and by nature of the fact that the output is based on an ensemble of topic models, there is a confidence/ubiquity score which can be associated per topic cluster in the aggregated model.
A blend of open source Python technologies were used to build out this pipeline, including pyspark, gensim, sklearn, networkx, django, as well as the use of AngularJS and cytoscape to build the frontend.