Pandas is great for data analysis in Python. It promises intuitive DataFrames from R; speed like numpy; groupby like SQL. But there are plenty of pitfalls. This tutorial looks inside pandas to see how DataFrames actually work when building, indexing and grouping tables. You will learn how to write fast, efficient code, and how to scale up to bigger problems with libraries like Dask.
Pandas is great way to quickly get started with data analysis in Python: intuitive DataFrames from R; fast numpy arrays under the hood; groupby just like SQL. But this familiarity is deceptive and both new and experienced pandas users often get stuck on things they feel should be simple.
In the first part of this tutorial, we look inside pandas to see how DataFrames actually work when building, indexing and grouping tables. We will learn which pandas operations are fast and why, and how to avoid common performance pitfalls. By the end of the tutorial, you will develop a strong and reliable intuition about using pandas effectively.
In the second part, we switch gear to bigger problems where our data sets can't fit in local memory. First we see how pandas behaves as we start to hit memory limits. Then we look at Dask, whose distributed/deferred DataFrames are a near drop-in replacement for pandas. Then we come back to pure pandas and look for ways to manage bigger datasets with clever data storage,.
During this tutorial, you are welcome to follow along on your laptop with the sample data sets and example code in a Jupyter notebook. These will be made available on GitHub here just before the tutorial. The code targets Python 3 and the latest pandas/dask release: