Feather, Avro, Parquet, Arrow... all these new file formats, what's wrong with good old CSVs? In this talk, we will cover why CSV files are not always the optimal storage format for tabular data and what optimizations each of these formats make. We will do a deep dive on Feather, a cross-language format created by Wes McKinney and Hadley Wickham.
CSV files are great for storing tabular data, this format is both human and machine readable. But this is not always the best format. Robustly parsing CSV files is challenging, see for example, the number of optional parameters accepted by the read_csv function in Pandas. CSV files also lose type information, since all fields must be strings or numeric, encoding categorical variables is difficult.
The Hadoop and Spark communities have solved this data serialization problem many times over. Among the more popular file formats are Avro, Parquet, and Arrow. Each is optimized for a different use, Avro for JSON-like data, Arrow for tabular data, and Parquet for columnar data. Similar to the CSV format, these new formats are supported by a wide range of reader/parsers, ensuring data written by an Hadoop task and later be read by PySpark.
Wes McKinney and Hadley Wickham recently created the Feather format, aimed at making it easier to share data between languages, specifically Python and R. Feather is most similar to Arrow, although Feather is meant for on-disk representation instead of in-memory. Among the many advantages of using Feather instead of CSV files to cross the language boundary is the incredible speedup that comes from removing the parser. Aside from some metadata, reading a Feather files does not require any parsing, the on-disk representation of various column types such as integer, floating point, and categorical are mirrors of their in-memory representation in Python and R.
In this talk, we'll do a deep dive on the Feather format, manually deconstructing the on-disk representation with some struct and NumPy magic.