PyData New York City 2019 - Presentation: Free Your Esoteric Data Using Apache Arrow and Python

Free Your Esoteric Data Using Apache Arrow and Python

Audience level:

Intermediate

Description

Do you work with esoteric data that has no schema, no human-readable output, and/or inconsistent interfaces? Is your data only readable from C++ classes with a secret encoding? Let me demonstrate how to use Python and Apache Arrow to quickly read your data into pandas and elegantly analyze the data.

Abstract

One common scenario in large enterprise systems is esoteric/inconsistently structured data. This data is crucial to a firm’s success, but cannot be easily read, analyzed or extracted. The data might not have a schema and might only exist in memory. An example of this is C++ code that has classes with strange and inconsistent interfaces which do not have fast human-readable serializations. Programmers are stuck when needing to test and analyze this data. A better solution would be to migrate the data to a common schema-based format and use Python data science libraries to analyze it.

Python (with clang) and Apache Arrow enables you to quickly and easily transform data into the Apache Parquet format, where you can use PyArrow and pandas to analyze it. Attendees will learn how to:

Use PyArrow to read CSV, JSON, custom data and hierarchical data.
Use pandas to elegantly compare data.
Parse their C++ classes using the cindex module and extract all the relevant data accessors and generate a “schema.”
Use the schema with a Jinja template to convert from C++ to Parquet using the Apache Arrow C++ API.

Outline

Problem statement

Go over the problem and show what esoteric data is.

Pandas

Review pandas role in comparing esoteric data.

Free Data!!

Apache Arrow Overview

Go over examples of PyArrow and why you want to use Apache Arrow.
- How to read CSV, JSON, custom and hierarchical data and save it to Parquet.

Put Clang, Jinja, Arrow and Python together to access your data:

Review a specific example of how to parse C++ classes using Python’s cindex module
Create the schema from the parsing
Generate the converter using Jinja templates
- convert
Show details of the generated Parquet file

Tuesday 3:45 PM–4:30 PM in Central Park East (6501a)