In this workshop you see how (and why) to leverage the PyData ecosystem to build a robust data pipeline. More specifically you will learn how to use the Luigi framework to integrate multiple stages of a model building pipeline (collection, processing, vectorization, training of multiple models, and validation) all in Python!
As companies scale prototypes and ad hoc analyses into production systems, it is critical to build automated (and repeatable) systems for data collection/processing and model training /evaluation which are fault tolerant enough to adapt to changing constraints. Sustainable software development is often an afterthought for data scientists, especially since the tools for analysis (R, scientific python, etc.) do not naturally lend themselves to building scalable and extensible software abstractions. But now we can have our cake and eat it too... all with Python!
In this workshop you see how (and why) to leverage the PyData ecosystem to build a robust data pipeline. More specifically you will learn how to use the Luigi framework to integrate multiple stages of a model building pipeline: collection, processing, vectorization, training of multiple models, and validation.
Outline: