Many data scientists who are used to working independently struggle when transitioning to teams. This talk is about best practices for processes on data science teams, taken from two teams I've worked on. Some topics I’ll cover are: tracking your work, organizing your code, and deploying models. These tips are meant to be utilized by any teammate including managers and individual contributors.
As someone who has been an early member of two data science teams, I’ve spent a lot of time developing good processes and practices for running a data science team. I’ll give you some concrete tips on helping your team write production code. This talk is aimed at anyone (whether you are a manager or individual contributor) who wants their team to run more smoothly.
Hire data scientists who actually want to work on a team. Most of my advice does not work with people who do not want to collaborate with others.
Many data scientists avoid thinking about work tracking and end up having no process for this purpose. While this might work for smaller teams and for specific kinds of people, I strongly suggest investing time into putting a process for work tracking in place (to future growth-proof your team). I personally like Agile, but there are many possible ways to organize your team.
Every time I’ve started working with a new dataset, I’ve discovered weird idiosyncrasies in the data. Often when I mention these issues to a coworker, they were already known, but never documented. My main piece of advice is to write all of this information down somewhere! Having extensive documentation helps everyone, especially new team members.
You’ll also want to figure out how you want to organize your repos. I’ve found that for many data scientists, it is unclear what to do with your code when you’re ready to move out of Jupyter notebooks. Having a set standard for your team will help your team organize their code without having to spend a lot of time thinking about it. In addition to code organization, I’ll give some suggestions for coding standards, a practice that I believe is underutilized by many data scientists.
Code reviews go hand in hand with coding standards. In my opinion, no code should go into production without being reviewed by at least one other person, but most data scientists are not trained how to read someone else’s code and critique it. It’s important to think about how your team wants to conduct reviews and how you will train new team members.
Finally, most teams will be collaborating with software engineers so you also need to think about how you want to deploy your models. In most cases, handing over a Jupyter notebook will not be sufficient. I generally recommend that a team choose one method for model deployment (e.g. choose one API framework and use it for everything).
I hope by the end of this talk you’ll have some ideas you can take back to your own team.