This tutorial is an introduction to SpaCy, a new library for natural language processing written in Cython, and the NLP capabilities of Scikit-Learn, a machine learning library, intended for those with experience working with text as data.
This tutorial is an introduction to SpaCy, a new library for natural language processing written in Cython, and an introduction to the NLP capabilities of Scikit-Learn, a machine learning library. Using SpaCy, we will cover part-of-speech tagging, dispersion plot analyses, dependency parsing, and word embeddings (word and document vectorization). Using Scikit-Learn, we will perform dimensionality reduction and other tasks. We will also visualize sentence diagrams using a custom library built for this tutorial called Sent2Tree. We will analyze texts such as Jane Austen's Pride and Prejudice and the screenplay of Monty Python and the Holy Grail, in order to answer questions like:
These techniques may also be applied more generally to any text.