Friday 15:00–16:30 in GoDataDriven

Tricks, tips and topics in Text Analysis

Bhargav Srinivasa Desikan

Audience level:
Novice

Description

There is an abundance of easily mineable text data (Whatsapp, twitter, and even our own e-mails!), and we have no excuse to not analyse it. In this workshop, we will learn some tips and tricks to deal with messy text data, before moving on to some lesser looked at text analysis techniques, such as text summarisation, working with distance metrics, and an old personal favorite - topic models.

Abstract

In this workshop we will be dealing with the entire text analysis process - this means we'll do a little bit of tweet collection and web scraping, set up a pipeline to clean our text, annotate it, and then have it ready to do some more advanced analysis. We will also spend a while discussing the different ways you can actually create your own dataset for text analysis, and how many of our personal textual data (e-mails, messages, tweets) can be accessed and used.

After this, we'll introduce two different approaches to playing with text - one is a more statistical, machine learning apporach, where we will use Topic Modelling, Clustering/Classification, and Deep Learning techniques to understand what may be hidden in our text. The other technique is a Computational Linguistics approach, where we will use linguistic information such as Part-Of-Speech tags, Named Entity Recognizers and Dependancy trees.

The tutorial will be carried out via a Jupyter Notebook, and the packages we will use will be spaCy, Gensim, and Keras. The purpose of the tutorial is to introduce the audience to the different source of textual data available, and give a taste of the different ways we can approach our analysis. It is meant to be more of a breadth than depth survey of the techniques in the field, and we leave the audience to decide which technique would be most useful for their particular use-case.

Subscribe to Receive PyData Updates

Subscribe