Tuesday 3:45 PM–4:30 PM in Central Park East (6501a)

Free Your Esoteric Data Using Apache Arrow and Python

Maciej Wojton

Audience level:
Intermediate

Description

Do you work with esoteric data that has no schema, no human-readable output, and/or inconsistent interfaces? Is your data only readable from C++ classes with a secret encoding? Let me demonstrate how to use Python and Apache Arrow to quickly read your data into pandas and elegantly analyze the data.

Abstract

One common scenario in large enterprise systems is esoteric/inconsistently structured data. This data is crucial to a firm’s success, but cannot be easily read, analyzed or extracted. The data might not have a schema and might only exist in memory. An example of this is C++ code that has classes with strange and inconsistent interfaces which do not have fast human-readable serializations. Programmers are stuck when needing to test and analyze this data. A better solution would be to migrate the data to a common schema-based format and use Python data science libraries to analyze it.

Python (with clang) and Apache Arrow enables you to quickly and easily transform data into the Apache Parquet format, where you can use PyArrow and pandas to analyze it. Attendees will learn how to:

  1. Use PyArrow to read CSV, JSON, custom data and hierarchical data.
  2. Use pandas to elegantly compare data.
  3. Parse their C++ classes using the cindex module and extract all the relevant data accessors and generate a “schema.”
  4. Use the schema with a Jinja template to convert from C++ to Parquet using the Apache Arrow C++ API.

Outline

Problem statement

Pandas

Free Data!!

Apache Arrow Overview

Put Clang, Jinja, Arrow and Python together to access your data:

Subscribe to Receive PyData Updates

Subscribe