PyData New York City 2017 - Presentation: Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow

Dremio is a new open source project for self-service data fabric. Dremio simplifies and accelerates access to data from any source and any size, including relational databases, NoSQL, Hadoop, Parquet, and text files. We'll show you how you can use Dremio to visually curate data from any source, then access via Pandas or Jupyter notebook for rapid access.

Modern data lives in a wide range of technologies and formats, from relational databases, to JSON in NoSQL databases, to CSV files on cloud file systems like S3, and more. Each of these sources presents different challenges in terms of access and reshaping of data to the needs of a particular job. In addition, the speed of accessing each source from Python is dependent on a variety of factors, many of which are out of your control.

Python spend inordinate cycles getting data ready for analysis, which slows their ability to iterate on creative cycles. In addition, analyzing these data sources is slow and cumbersome, requiring access and mastery of many different technologies.

Dremio is a new open source project for self-service Data Fabric, designed to help analysts, data scientists, and data engineers work with the complexity of modern data. Based on Apache Arrow, it simplifies and accelerates access to any data source, of any size, and is tightly integrated with Pandas. Users can search across an indexed catalog of data sources to quickly find relevant datasets, regardless of the underlying technology or format, then visually curate the data for their particular needs using a browser-based interface. These Virtual Datasets are first class SQL objects that can be queried with standard SQL from Pandas or Jupyter notebooks.

In this talk we'll work through an example of accessing data from a mix of NoSQL, HDFS, and Amazon S3 data sources, curate and transform the data for a specific analytical job, then access the data from a simple Python application using a Jupyter notebook. We'll also explore how Dremio is architected and examples of how customers are deploying Dremio with Python to solve important use cases.

Tuesday 11:20 AM–12:00 PM in Radio City 6604 (6th fl)

Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow

Sudheesh Katkam

Description

Abstract

Subscribe to Receive PyData Updates