Thursday October 28 5:30 PM – Thursday October 28 6:00 PM

5 Reasons Parquet files are better than CSV for data analyses

Matthew Powers

Prior knowledge:
No previous knowledge expected

Summary

Parquet files are well supported by most languages / libraries, are easier to work with, and typically more performant than CSV files. This talk summarizes the main benefits of Parquet files and shows how they’re faster with benchmarking analyses. You’ll also learn how to convert CSV files to Parquet.

Description

5 reasons Parquet files are better than CSV:

  • schema - examine how the schema is embedded in the file metadata leveraging PyArrow
  • file sizes - compare file sizes when identical data is written to CSV and Parquet
  • columnar file format - examine performance benefits from leveraging column pruning to skip data
  • predicate pushdown filtering - understand how to query row group metadata with PyArrow and how to skip entire row groups based on column metadata
  • immutable - why immutable file formats are better

How to convert CSV files to Parquet with Pandas, Dask, and PySpark. Will show how to convert a single file or multiple files in parallel.

When to use CSV files and when to avoid them.