Not only is there an abundance of textual data, there is also an abundance of tools help analyse this data - and it is tough to choose the right tool for the right task. In this workshop we will be dealing with the entire text analysis process - this means we'll start with finding data, set up a pipeline to clean our text, annotate it, and then have it ready to do some more advanced analysis.
In this workshop we will be dealing with the entire text analysis process - this means we'll start by finding our data, set up a pipeline to clean our text, annotate it, and then have it ready to do some more advanced analysis. We will also spend a while discussing the different ways you can actually create your own dataset for text analysis, and how many of our personal textual data (e-mails, messages, tweets) can be accessed and used.
After this, we'll introduce two different approaches to playing with text - one is a more statistical, machine learning apporach, where we will use Topic Modelling, Clustering/Classification, and Deep Learning techniques to understand what may be hidden in our text. The other technique is a Computational Linguistics approach, where we will use linguistic information such as Part-Of-Speech tags, Named Entity Recognizers and Dependency trees.
The tutorial will be carried out via a Jupyter Notebook, and the packages we will use will be spaCy, Gensim, and Keras. The purpose of the tutorial is to introduce the audience to the different sources of textual data available, and give a taste of the different ways we can approach our analysis. It is meant to be more of a breadth than depth survey of the techniques in the field, and we leave the audience to decide which technique would be most useful for their particular use-case.