Friday 11:00–12:00 in LG6

Pandas from the inside

Stephen Simmons

Audience level:


[THIS SESSION WILL BEGIN AT 11:00] Pandas is great way to quickly get started with data analysis in Python: intuitive DataFrames from R; fast numpy arrays under the hood; groupby like SQL. But this familiarity is deceptive and new Pandas users often get stuck on things they feel should be simple. This tutorial/talk takes a look inside Pandas to see how DataFrames actually work when indexing, grouping and joining tables.


We briefly set the scene with a map of the evolving Pandas landscape for large scale analytics. This is all very exciting and brings many new options. However for many users, Pandas' sweet spot remains much smaller scale. For this talk, we focus in on the DataFrame and how it actually works.

We will dig deeper and deeper into the components of a DataFrame, including Series and Categorical, DataFrame, Index, MultiIndex and GroupBy objects. For common operations like selecting rows, grouping, subtotals and joins, we will see what internal structures are created and how they fit together. From this, get a better understanding of the various API layers, and how to use them effectively.

We conclude the talk with a jump back to large scale data analysis and a quick look at how a distributed storage solution like Dask solves one of these problems.