pandas high-level API makes it easy for newcomers to do data wrangling and analysis, without having to know much about data structures or low level computing. But this initial easiness of use can become a problem when performing complex operations, or when facing performance problems when dealing with large volumes of data. In this talk, the essentials of data structures and pandas will be covered
There is no doubt that pandas popularity is sky rocketing. StackOverflow recently published that 1% of their traffic from high-income countries is to pandas tagged questions. Besides its unquestionable popularity, this is also caused by pandas often behaving in tricky and unexpected ways, especially to inexperienced users. By gaining an understanding of pandas internals, attendees to this talk will be able to write more predictable, robust and performant code.
The talk does not assume prior knowledge on data structures, computer architectures and low level Python. They will be covered in the first part. Computer engineers may find this part a bit basic, but it should be very useful for the many pandas users coming from other areas (economics, physical sciences...).
In a second part, the talk will focus on pandas itself. On how pandas represents information, and on what happens behind the scenes when we perform operations on a pandas structure like Series or DataFrame. Understanding how things work under the hood, will make it easy to understand things in pandas that often seem wrong or counter-intuitive. For example, why pandas convert integer columns with missing values automatically to float. Or why the famous SettingWithCopyWarning happens, and what to do about it.
The talk will be concluded with an overview of pandas 2. A short note on how Apache Arrow and Ibis will make the pandas of the future more robust, less tricky, and much more scalable.