Modern data analysis pipelines routinely involve gluing together multiple systems and languages: SQL, Python, R, C++, unix tools, and more. This leads to unnecessary complexity and inefficiency. JuliaDB is a fast, productivity-focused, distributed database that, together with the Julia language, forms a single coherent system for everything from data preparation to machine learning with no glue.
We have known for decades that sending code to data is the way to make distributed computing scale. Yet data scientists are still loading datasets into memory on their laptops in convenient single-machine languages like R or Python for their daily data analysis work. When they do bite the bullet and adopt a distributed system like Hadoop or Spark, essentially rewriting their analyses from their convenience language. They gain scale, but in doing so sacrifice productivity, and performance gains of these packages are not poor on a single machin, and often increases underwhelmingly as more hardware is thrown at the problem. Databases can send queries to the data, but SQL and other query languages are not nearly expressive enough for modern analytics, especially machine learning and AI. On the other hand, running user-defined R or Python in databases is simply not practical or efficient.
Julia’s unique combination of features upends this status quo: Julia is natively distributed and compiles user code to native machine instructions that are as fast as C/C++ or Fortran – on local or remote machines. This combination allows sending arbitrary user code to the data, wherever it lives, achieving seamless, high-performance, distributed compute. JuliaDB is a pure Julia database that leverages these capabilities to deliver on Julia’s promise for distributed data analysis – it’s a fast, scalable, distributed, data store that lets you just load your data and get to work. JuliaDB trivially interoperates with Julia’s rich ecosystem of packages for plotting, statistics, linear algebra, deep learning, and optimization to name a few. It provides a seamless experience: a single language lets you write high-performance production code and interactively query in a REPL. It gives the data scientist a single coherent language and system for everything from data cleaning to sophisticated machine learning.
JuliaDB’s closest analog from the Python world is Pandas. But it also comes with built-in parallel and out-of-core compute, in which respect it is similar to dask dataframes. This talk will introduce some of the features of JuliaDB, and compare some of them in terms of capabilities and performance to existing systems. We will also emphasize the profound productivity gains that are due to it being embedded in a high-performance and high-productivity language, Julia.
Outline of the talk:
JuliaDB was developed at Julia Computing, most notably by Jeff Bezanson, Stefan Karpinski and Shashi Gowda. It is distributed under the MIT license. You can learn more about it at http://juliadb.org/