Friday 13:15–14:45 in Track 2

A Hands-On Introduction to Your First Data Science Project

Em Grasmeder, Jin Yang

Audience level:
Intermediate

Description

We’ve created a playground data science application that’s ready for you to jump in to. It uses real data, an industry-standard tool set, and addresses a typical machine learning task. We will run you through the workflow and life cycle of the data science application, and you’ll spend most of the session experimenting and improving the machine learning model. Please bring a computer!

Abstract

Software developers interested in getting started with data science are often overwhelmed by the amount of choices: what programming language to use, which machine learning libraries, and where to find suitable data? Furthermore, it can be difficult to understand the whole scope of a data science project, which entails explorative analysis, data cleansing, feature engineering, machine learning prototyping, machine learning productization, and finally continuous model serving and model integration. Software developers are often specifically unaware about the approaches and tools for explorative analysis and feature engineering, although these skills are crucial to the success of a machine learning project. That’s why we concentrate on these topics in our workshop. We believe there is no better way of learning than to actually touch and modify a system (of understandable size) and to see how the results change.
To aid in this process, we have created an open-source example application that is optimized to serve as a playground for learning and experimentation. It works on a realistic dataset, addresses a typical machine learning task one may encounter on the job, and applies an industry-standard toolset (Python, Pandas, Jupyter Notebook, AWS). The application uses machine learning to make purchase predictions for a large grocery store in Ecuador, which is a task that comes up in many industries. Sales forecasting helps lower costs, reduce waste of perishable goods, and helps lower prices for consumers. The data is from a company called Favorita, which has published sales data in the public domain on Kaggle.com and illustrates the complexity of a data science problem with a range of stores, products, and dates of sales transactions.
In our presentation, we first explain the problem of demand forecasting, and the dataset we will be working with. Then, we move directly into hands-on data analysis. Participants will spend most of their time doing data analysis, using the same tools data scientists use day to day to improve machine learning models. This will include explanations of how and why and where to ask questions of data, and instruction on using the appropriate tools for the job. Towards the end of the workshop, we will talk about turning insights from the session into improvements on the predictive power of machine learning models. The data and the challenge are publicly available and have easily findable discussions and resources to go with them. The source code for our playground application is also open source and can also be found on GitHub at github.com/ThoughtWorksInc/twde-datalab so participants can continue working after the tutorial if they choose to.

Subscribe to Receive PyData Updates

Subscribe