Saturday 11:45 AM–12:30 PM in Room 2

Mind the Gap! Bridging the pandas – scikit-learn dtype divide

Tom Augspurger

Audience level:
Intermediate

Description

This talk briefly introduces the two different data models used by Scikit-Learn (NumPy arrays) and pandas DataFrames. We see why this can cause problems for users of these libraries. Finally, we discuss strategies for managing the differences.

Abstract

Scikit-Learn typically operates on NumPy arrays, which are homogeneous (have a single data type). On the other hand, pandas DataFrames are heterogeneous and may contain columns with different data types. Additionally, pandas has implemented several extension dtypes like Categoricals and datetimes with timezones, that can't be stored natively in NumPy arrays. Users of these libraries must be careful when crossing from pandas types to NumPy types.

We'll introduce the two systems, noting the vast area of agreement where pandas reuses NumPy's type system. However, the interesting cases are the relatively small areas where they differ. Next, we'll look at methods for converting from a pandas extension dtype to something suitable for scikit-learn. Depending on the statistical properties of your problem, one of several options may be appropriate. We'll cover factorization and dummy (one-hot) encoding. Finally, we'll implement a custom scikit-learn Transformer for use in pipelines.