Saturday 15:00–15:45 in Tower Suite 1

Modern Data Science: A new approach to DataFrames and pipelines

Maarten Breddels, Jovan Veljanoski

Audience level:
Intermediate

Description

We show how to deal with massive datasets using small resources using the Python Vaex DataFrame library. Using computational graphs, efficient algorithms and storage (Apache Arrow / hdf5) Vaex can easily handle up to a billion rows, even on your laptop. As a bonus, Vaex can automatically generate a Machine Learning pipeline using the graph structure build-up internally in the DataFrame.

Abstract

Working with datasets comprising millions or billions or samples is becoming an increasingly common task, one that is typically tackled with distributed computing. However, setting up a cluster to do standard data science and machine learning tasks using a data source that can already fit on a single hard drive seems like overkill, and might incur additional costs.

Applying concepts like computational graphs, which are common in neural network libraries, to a DataFrame library enables efficient memory and CPU usage. Together with memory-mapped storage (Apache Arrow & hdf5) and out-of-core algorithms, we can process much larger data sets with fewer resources. As a side effect, the computational graph ‘remembers’ all operations applied to a dataframe. Manually building pipelines becomes a thing of the past since we can generate if automatically.

In this talk, we will demonstrate Vaex, an open-source DataFrame library that embodies these concepts. Using data from the New York City YellowCab taxi service, we will showcase how one can conduct an exploratory data analysis and build a machine learning model on a single laptop even when the data source contains over 1 billion samples, in real time.

Vaex is open source (MIT License)

In particular, we will:

Resources:

Subscribe to Receive PyData Updates

Subscribe