PyData Austin 2019 - Presentation: Productionalizing a Data Science Team

As someone who has been an early member of two data science teams, I’ve spent a lot of time developing good processes and practices for running a data science team. I’ll give you some concrete tips on helping your team write production code. This talk is aimed at anyone (whether you are a manager or individual contributor) who wants their team to run more smoothly.

Pre-req

Hire data scientists who actually want to work on a team. Most of my advice does not work with people who do not want to collaborate with others.

Tracking your work

Many data scientists avoid thinking about work tracking and end up having no process for this purpose. While this might work for smaller teams and for specific kinds of people, I strongly suggest investing time into putting a process for work tracking in place (to future growth-proof your team). I personally like Agile, but there are many possible ways to organize your team.

Documentation

Every time I’ve started working with a new dataset, I’ve discovered weird idiosyncrasies in the data. Often when I mention these issues to a coworker, they were already known, but never documented. My main piece of advice is to write all of this information down somewhere! Having extensive documentation helps everyone, especially new team members.

Code Organization/Standards

You’ll also want to figure out how you want to organize your repos. I’ve found that for many data scientists, it is unclear what to do with your code when you’re ready to move out of Jupyter notebooks. Having a set standard for your team will help your team organize their code without having to spend a lot of time thinking about it. In addition to code organization, I’ll give some suggestions for coding standards, a practice that I believe is underutilized by many data scientists.

Code Reviews

Code reviews go hand in hand with coding standards. In my opinion, no code should go into production without being reviewed by at least one other person, but most data scientists are not trained how to read someone else’s code and critique it. It’s important to think about how your team wants to conduct reviews and how you will train new team members.

Deploying Models

Finally, most teams will be collaborating with software engineers so you also need to think about how you want to deploy your models. In most cases, handing over a Jupyter notebook will not be sufficient. I generally recommend that a team choose one method for model deployment (e.g. choose one API framework and use it for everything).

I hope by the end of this talk you’ll have some ideas you can take back to your own team.

Sunday 3:50 PM–4:35 PM in Track 2

Productionalizing a Data Science Team

Nicole Carlson

Description