Sunday 2:15 p.m.–2:55 p.m.

NLP and text analytics at scale with PySpark and notebooks

Paco Nathan

Audience level:
Intermediate

Description

Who's who in a developer community and what do they discuss? And with whom? This project, based on Apache Spark, provides Python pipelines for scraping, parsing, and analyzing discussion forums for a given Apache developer community -- along with analytics for related meetup events and conference talks.

Abstract

Who's who in a developer community and what do they discuss? And with whom? This project, based on Apache Spark, provides Python pipelines for scraping, parsing, and analyzing discussion forums for a given Apache developer community -- along with analysis of related meetup events and conference talks.

Messages get parsed with NLTK and TextBlob, then represented as JSON. Analytics pipelines, organized as notebooks, produce leaderboards with Spark SQL, predictive models using MLlib, and visualizations in Seaborn, while storing the data with Parquet. Code is available on GitHub.

Sponsors


Become a sponsor.