Saturday 11:45 AM–12:30 PM in Room #1023/1022/1020 (1st Floor)

Beyond Bag of Words: A Practitioner’s Guide to Advanced NLP Using Open Source

Ariel M’ndange-Pfupfu, Mike Anderson

Audience level:
Intermediate

Description

We offer a foundation in building intelligent business applications using machine learning, walking you through all the steps to prototyping and production—data cleaning, feature engineering, model building and evaluation, and deployment—and diving into an application for anomaly detection and a personalized recommendation engine. All concepts will be presented with example code in Python.

Abstract

Machine learning is an important concept in data science but it doesn't exist in isolation. We will show how algorithms such as Linear Regression, K-Nearest Neighbors, and SVD fit into a larger workflow of preprocessing, feature engineering, tuning, and testing. We’ll delve into techniques that can dramatically boost accuracy with minimal computational overhead. We'll explore how the popular NumPy, Pandas, Scikit-Learn stack handles a variety of use cases.

The first problem we'll explore involves anomaly detection in time series data. We'll use tools like autocorrelation, Fourier analysis, and modeling to remove seasonality trends in bikeshare data, then apply statistics to qualify data points as outliers or not.

The second problem we'll discuss involves building a custom recommendation engine using data about users, movies, and ratings. We try and compare different approaches using feature similarity, regression, and both content and collaborative methods.

This course emphasizes a practical and exploratory approach to using machine learning in Python. It highlights the flexibility of the workflow and functionality which leads Python to be widely applicable in many situations. Participants will be walk away with take-home code samples that they can apply directly to their work.