When confronted with large-scale data challenges, we often reach for complex tools to help solve the problem. In the past 7 years, Ancestry DNA has amassed the largest collection of consumer genomic data in the world, creating a new scaling challenge in the genomics world. We'll show how picking simple tools in the Python ecosystem helped us solve massive scaling challenges in production.
With data sets growing larger by the day, and the number of big-data tools growing right along with them, it can be daunting to select the right tool for the job. Sometimes, too, it's tempting to apply the latest-and-greatest ones to the problems we're working on. But, more often than not, the simplest tool is the right one and, fortunately for us, that's where Python shines.
In a short 7 years, Ancestry has collected nearly 15 million DNA samples. Data of this magnitude has proved to be a massive scaling challenge for the production pipeline that must analyze that data set to produce customer results every single day. This talk will tell the story of how that pipeline has evolved over the years, from a manual command line process, to a scheduled Hadoop pipeline, and finally into the Python-based, event-driven system we use today.
The talk will give a basic overview of our DNA test and cover our core relative detection and ethnicity algorithms at a high level. I'll then dive into the constraints and specific challenges the pipeline presents, and how we decided to leverage Python & Celery to solve those problems. Lastly, I'll describe the benefits of switching to Python, demonstrating the simplicity, performance, and reduction in code it provided.