DataFrames of all sorts have become very popular for tabular data analytics for their nifty APIs and ease of use. But as they usually operate as in-memory engines, they can be hard to scale. In my talk, I'd like to outline several ways one can scale their compute platform to handle larger datasets without incurring much cost.
If you've dealt with tabular data in Python, you've probably used Python's pandas. But given its in-memory nature, you may have struggled when you tried loading larger and larger datasets. What I'd like to cover is how one can overcome this issue in various ways.
Much like in other areas, there isn't a one-size-fits-all solution. I will show four ways I have overcome this issue in the past, each time choosing said approach for a different reason. These approaches are:
I will cover the various differences between these approaches and will outline their set of upsides (e.g. scaling and performance) and downsides (DevOps difficulties, cost). I won't be doing any live demos (demo gods have not been kind to me in the past), I will mainly focus on the use cases that fit each approach and what to look out for.