Monday 17:10–17:40 in Track 2

High Performance Data Processing in Python

Donald Whyte

Audience level:
Novice

Description

numpy and numba are popular Python libraries for processing large quantities of data. This talk explains how numpy/numba work under the hood and how they use vectorisation to process large amounts of data extremely quickly. We use these tools to reduce the processing time of a large, real 600GB dataset from one month to 40 minutes, even when the code is run on a single Macbook Pro.

Abstract

The Internet age generates vast amounts of data. Most of this data is unstructured and needs to post processed in some way. Python has become the standard tool for transforming this data into more useable forms.

numpy and numba are popular Python libraries for processing large quantities of data. When running complex transformations on large datasets, many developers fall into common pitfalls that kill the performance of these libraries.

This talk explains how numpy/numba work under the hood and how they use vectorisation to process large amounts of data extremely quickly. We use these tools to reduce the processing time of a large, real 600GB dataset from one month to 40 minutes, even when the code is run on a single Macbook Pro.

Subscribe to Receive PyData Updates

Subscribe