Monday 12:35–13:05 in Track 2

From Data to Deliverable

Steph Samson

Audience level:


The data we need is sourced from different places. The data could even come in different formats. I will talk about how I used data from different APIs, cleaned and preprocessed it, and wrapped it up under a new API for a restaurant discovery service. This talk is for attendees looking to jump into the data science industry for the first time.


Many practitioners in our industry often work with unclean data. As a result, this data is not fit to work immediately with off-the-shelf libraries, regardless of whether these libraries are for machine learning or for creating a RESTful API. In this talk, I will describe my end-to-end process to get a data set to a state that can be used for a deliverable.

The pipeline begins with extracting data. The comes the cleaning and preprocessing. Cleaning and preprocessing data is rarely, if ever, painless. One will often encounter formatting and encoding errors, and sometimes different representations altogether. I will demonstrate some techniques I use when examining dirty data sets. From preparing the data, I will then demonstrate how I resolve the differences in the data sources by creating a new data structure. This data structure is subsequently exposed through a RESTful API.

By the end of this talk, attendees can hopefully gain some insight into how much data cleaning and preprocessing is involved in a work day.

Subscribe to Receive PyData Updates