Saturday 11:00–11:35 in Auditorium

SAMENVATTR, or, How to Make a Text Summarization Tool for your Language of Choice

Robert Rodger

Audience level:
Intermediate

Description

Ever wonder how the autotldr bot on Reddit works? Ever want to build "tl;dr" functionality into your text processing system? Ever try to do so, only to find that all of the good out-of-the-box options are coded to work with English text and not your desired input language? If you answered "yes" to any of the above questions, this talk is for you.

Abstract

Gensim, a Python-based text-processing module best known for its word embedding and topic modeling capabilities, also has a top-notch extractive summarization feature useful for adding "tl;dr" functionality to your code. Unfortunately, it only supports English input out-of-the-box. In this talk, I'll first describe TextRank, the algorithm underlying Gensim's summarization tech, and then I'll demonstrate how we can use this knowledge to modify Gensim's internals to support summarization in our language of choice. I'll also demo SAMENVATTR, my modification for the Dutch language.

Subscribe to Receive PyData Updates

Subscribe