We show how to deal with massive datasets using small resources using the Python Vaex DataFrame library. Using computational graphs, efficient algorithms and storage (Apache Arrow / hdf5) Vaex can easily handle up to a billion rows, even on your laptop. As a bonus, Vaex can automatically generate a Machine Learning pipeline using the graph structure build-up internally in the DataFrame.
Working with datasets comprising millions or billions or samples is becoming an increasingly common task, one that is typically tackled with distributed computing. However, setting up a cluster to do standard data science and machine learning tasks using a data source that can already fit on a single hard drive seems like overkill, and might incur additional costs.
Applying concepts like computational graphs, which are common in neural network libraries, to a DataFrame library enables efficient memory and CPU usage. Together with memory-mapped storage (Apache Arrow & hdf5) and out-of-core algorithms, we can process much larger data sets with fewer resources. As a side effect, the computational graph ‘remembers’ all operations applied to a dataframe. Manually building pipelines becomes a thing of the past since we can generate if automatically.
In this talk, we will demonstrate Vaex, an open-source DataFrame library that embodies these concepts. Using data from the New York City YellowCab taxi service, we will showcase how one can conduct an exploratory data analysis and build a machine learning model on a single laptop even when the data source contains over 1 billion samples, in real time.
Vaex is open source (MIT License)
In particular, we will:
Resources: