PyData New York City 2017 - Presentation: JuliaDB: A data system for Julia

We have known for decades that sending code to data is the way to make distributed computing scale. Yet data scientists are still loading datasets into memory on their laptops in convenient single-machine languages like R or Python for their daily data analysis work. When they do bite the bullet and adopt a distributed system like Hadoop or Spark, essentially rewriting their analyses from their convenience language. They gain scale, but in doing so sacrifice productivity, and performance gains of these packages are not poor on a single machin, and often increases underwhelmingly as more hardware is thrown at the problem. Databases can send queries to the data, but SQL and other query languages are not nearly expressive enough for modern analytics, especially machine learning and AI. On the other hand, running user-defined R or Python in databases is simply not practical or efficient.

Julia’s unique combination of features upends this status quo: Julia is natively distributed and compiles user code to native machine instructions that are as fast as C/C++ or Fortran – on local or remote machines. This combination allows sending arbitrary user code to the data, wherever it lives, achieving seamless, high-performance, distributed compute. JuliaDB is a pure Julia database that leverages these capabilities to deliver on Julia’s promise for distributed data analysis – it’s a fast, scalable, distributed, data store that lets you just load your data and get to work. JuliaDB trivially interoperates with Julia’s rich ecosystem of packages for plotting, statistics, linear algebra, deep learning, and optimization to name a few. It provides a seamless experience: a single language lets you write high-performance production code and interactively query in a REPL. It gives the data scientist a single coherent language and system for everything from data cleaning to sophisticated machine learning.

JuliaDB’s closest analog from the Python world is Pandas. But it also comes with built-in parallel and out-of-core compute, in which respect it is similar to dask dataframes. This talk will introduce some of the features of JuliaDB, and compare some of them in terms of capabilities and performance to existing systems. We will also emphasize the profound productivity gains that are due to it being embedded in a high-performance and high-productivity language, Julia.

Outline of the talk:

Motivation
Why we need JuliaDB
The JuliaDB workflow
Loading data from CSV files
Dealing with CSV’s quirks
Indexing
Parsing benchmarks
The indexed table data structure
Basic structure and accessor functions
Relational operations
As a key-value store
Analogy to N-D sparse arrays
Distributed table
S&P 500 case study walk through
Plotting the OHLC data
Aggregates and online statistics
Join with market cap
Buy and sell with JuMP - mathematical optimization
Benchmarks
aggregates
joins
Parallel sorting
Comparisons
With pandas
With dask dataframes

JuliaDB was developed at Julia Computing, most notably by Jeff Bezanson, Stefan Karpinski and Shashi Gowda. It is distributed under the MIT license. You can learn more about it at http://juliadb.org/

Monday 10:00 AM–10:40 AM in Central Park West 6501 (6th fl)

JuliaDB: A data system for Julia

Shashi Gowda, Jeff Bezanson

Description

Abstract

Subscribe to Receive PyData Updates