Saturday 17:00–17:45 in Hall 7

Chain, Loop & Group: How Celery Empowered our Data Scientists to Take Control of our Data Pipeline

Michelle Tran

Audience level:
Novice

Description

In this talk, I’ll discuss how we optimized the bi-monthly refresh of data in our online analytics product using Celery and transferred ownership of this process from the development team to the data science team. I will also introduce the custom tool we built to submit jobs to BigQuery via Google’s API for Python.

Abstract

As a mobile app analytics company, we work with petabytes of data. Every 1-2 weeks, we are responsible for calculating and ETLing daily download and revenue estimates for 3 million apps in 57 countries over a 1.5 year (and growing) historical time period.

We push frequent updates to our estimation models, which means our data pipeline has to be easily adaptable. We need to be able to prototype, test and implement changes to the pipeline in a pinch. Our dev team doesn’t have the capacity to hold our hands during this process. Thanks to a Python tool we built which submits jobs to the Google BigQuery API for Python via Celery’s distributed queuing, our data science team now has the capability to develop on the data pipeline directly. We have direct control over how and when queries should be run because we design the data flows ourselves.

In this talk I’ll discuss the specifics of our use case (i.e. volume of data, complexity of the data pipeline, the different views displayed in our end platform, testing and troubleshooting) and how Python Celery provides a framework for us to address these issues.