This tutorial will explore the use of tools in the Pandas data analysis library for analyzing unevenly spaced time series data. The tutorial will start off with a brief primer on Pandas and the data.world API, and demonstrate how to use Pandas tools for analyzing data from The Simpsons episodes from data.world.
Indeed data scientists occasionally analyze time series data in which the events of interest are unevenly spaced. For example, when we want to understand how a change to a user interface for Indeed Hire recruiters affects the time it takes them to review candidates, we might look at changes in time intervals between individual candidate dispositions in our logs. When we want to understand the ratio of new business to repeat business - or explore different definitions of repeat business - we analyze the intervals in the creation dates of new requisitions from the same client.
The Pandas data analysis library offers powerful tools for conducting time series analysis. When working on unevenly spaced time series, we have found the shift() and transform() DataFrame methods particularly helpful. Many of the examples of using these methods that we found on the web were used only on small, artificial datasets. Determining how best to apply them to real datasets was not always as straightforward as we would have hoped.
Rather than use internal proprietary data to illustrate examples of how these methods can be used effectively to analyze unevenly spaced time series data, we will instead use data from a publicly available dataset of episodes of The Simpsons at data.world. In doing so, we will also provide an introduction on how to use the data.world API.
The purpose of this tutorial is to
Participants will be best prepared for this tutorial if they
Update: jupyter notebooks associated with the tutorial have been uploaded to a GitHub repository.