Conda environments can be fantastic for managing your data science dependencies. They can also be fragile, conflict-riddled, disk-filling monsters. Wouldn't it be great if we could easily maintain, delete, and reproduce these environments on a project-by-project basis? We can, and all it takes is a little Makefile magic.
At our shop, we had a problem: our conda environments were a mess. Most of us kept one or two monolithic environments per python version around (conda activate data_science_37
anyone?), but quickly, these environments became fragile and unmaintainable. Upgrading packages was near-impossible because of version conflicts with other installed packages. Switching machines was a nightmare, as we were never really sure which packages were required for a particular application. We couldn’t easily fix environments, and we couldn’t delete them. We didn't know how to recreate them, and so we had no easy way to share them. We were stuck.
In desperation, we started scripting our conda environment creation. Since we were already using make
for our data pipelines, we started stashing the creation code there, forcing ourselves to creating a unique conda environment for each git repo, and checking it in with the rest of the codebase.
Over time, we tweaked these Makefile
targets to work around some long-standing limitations of our conda setups. We added lockfiles, and self-documenting targets. We found reliable ways to mix pip and conda (in the odd cases where it was needed), and started making heavy use of editable python modules in our workflow. It worked out better than we ever imagined. Our work became reproducible, portable, and better documented.
In this talk, I walk you through the challenges of creating a reproducible, maintainable data science environment using little more than conda
, environment.yml
, Makefiles
, and git, in hopes that you too will be able to make
your conda environments more managable.