Biology is experiencing a Big Data revolution brought on by advances in genome sequencing technologies, leading to new challenges and opportunities in computational biology. To address one of these challenges, we built a Python library named SplitThreader to represent complex genomes as graphs, which we are using to untangle hundreds of mutations in a cancer genome.
The field of biology is in the midst of a sequencing revolution. The amount of data collected is growing exponentially, fueled by a cost of sequencing that is dropping at a rate outpacing Moore's Law.
In Python terms, the human genome is a "list" containing 46 "strings" (chromosomes) for a total of 6 billion characters. Every single character can be the site of a mutation that brings you one step closer to cancer. My research is in cancer genomics, and I have been working to reconstruct the history of rearrangements that brought one patient's cancer genome from 46 chromosomes to 86. In an effort to untangle hundreds of large, overlapping mutations, we built a genomic graph library in Python named SplitThreader. I will motivate why a special graph library is needed to represent genomes and how this same library can be used to understand human genetic variation.
I will also discuss some of the major challenges we are facing in genomics, how big data is introducing a new way of doing science, and how we ourselves have used Python to quickly iterate on new ideas and algorithms. This will serve as an overview of some of the challenges in computational biology.