PyData New York City 2017 - Presentation: Text Analysis with SpaCy and Scikit-Learn

This tutorial is an introduction to SpaCy, a new library for natural language processing written in Cython, and the NLP capabilities of Scikit-Learn, a machine learning library, intended for those with experience working with text as data.

This tutorial is an introduction to SpaCy, a new library for natural language processing written in Cython, and an introduction to the NLP capabilities of Scikit-Learn, a machine learning library. Using SpaCy, we will cover part-of-speech tagging, dispersion plot analyses, dependency parsing, and word embeddings (word and document vectorization). Using Scikit-Learn, we will perform dimensionality reduction and other tasks. We will also visualize sentence diagrams using a custom library built for this tutorial called Sent2Tree. We will analyze texts such as Jane Austen's Pride and Prejudice and the screenplay of Monty Python and the Holy Grail, in order to answer questions like:

What adjectives are used to describe Mr. Darcy, a character in Pride and Prejudice?
Which text (Pride and Prejudice or Monty Python) has the largest proportion of nouns, and why?
What actions (verbs) are most associated with bravely bold Sir Robin, a character from Monty Python and the Holy Grail?
What is the longest sentence in Pride and Prejudice?
What are the most frequent improbable words in these texts?

These techniques may also be applied more generally to any text.

Thursday 11:00 AM–12:30 PM in Central Park West 6501 (6th fl)

Text Analysis with SpaCy and Scikit-Learn

Jonathan Reeve

Description

Abstract

Subscribe to Receive PyData Updates