Saturday 12:15–13:00 in A208

Evaluating Topic Models

Matti Lyra

Audience level:


Unsupervised models in natural language processing (NLP) have become very popular recently. Word2vec, GloVe and LDA provide powerful computational tools to deal with natural language and make exploring large document collections feasible. We would like to be able to say if a model is objectively good or bad, and compare different models to each other, this is often tricky to do in practice.


Supervised models are trained on labelled data and optimised to maximise an external metric such as log loss or accuracy. Unsupersived models on the other hand typically try to fit a predefined distribution to be consistent with the statistics of some large unlabelled data set or maximise the vector similarity of words that appear in similar contexts. Evaluating the trained model often starts by "eye-balling" the results, i.e. checking that your own expectations of similarity are fulfilled by the model.

Documents that talk about football should be in the same category and "cat" is more similar with "dog" than with "pen". Is "cat" more similar to "tiger" than to "dog"? Ideally this information should be captured in a single metric that can be maximised. Tools such as pyLDAvis and gensim provide many different ways to get an overview of the learned model or a single metric that can be maximised: topic coherence, perplexity, ontological similarity, term co-occurrence, word analogy. Using these methods without a good understanding of what the metric represents can give misleading results. The unsupervised models are also often used as part of larger processing pipelines, it is not clear if these intrinsic evaluation measures are approriate in such cases, perhaps the models should instead be evaluated against an external metric like accuracy for the entire pipeline.

In this talk I will give an intuition of what the evaluation metrics are trying to achieve, give some recommendations for when to use them, what kind of pitfalls one should be aware of when using topic models and the inherent difficulty of measuring or even defining semantic similarity concisely.

I assume that you are familiar with topic models, I will not cover how they are defined or trained. I talk specifically about the tools that are available for evaluating a topic model, irrespective of which algorithm you've used to learn one. The talk is accompanied by a notebook at

Subscribe to Receive PyData Updates



Get Now