I will give an introduction to genome assembly and the challenges it presents. I will include a brief synopsis of the journey from taking a few sample cells to producing a high standard 'reference genome' before moving onto the engineering challenges, with lessons applicable to all data intensive tasks.
Genome assembly is a cousin of string reconstruction, but the similarities stop very soon. Genomes are not random, they are generated by evolution. This makes them simultaneously interesting and a pain to deal with. Sequencing data acquisition techniques vary, and evolve all the time, so you have to start from scratch every 3 or 4 years. DNA sequencing data production vastly outstrips Moore’s law, and the nature of the genome sequence exasperates the explosion on many combinatorial methods, often presenting worst-case scenarios. Assembly algorithms are usually based on heuristics and are resource-hungry. Hence, the methods used are suboptimal, but they enable us to answer biological questions.
Having developed bioinformatics methods to enable us to analyse species’ DNA, we move onto the question: how can we get better at big data processing? What genome assembly and bioinformatics show is that you not only do you need top notch computational skills, but actually knowing the nature of your data, its biases and how your results are going to be used allow you to slightly re-define problems in ways that make it more tractable.
Part of the talk will focus on how to dramatically improve the performance of software without sacrificing the quality of the results. This will be relevant to anybody confronted with vast amounts of error prone data, or legacy codebases which can no longer cope with the rate of data generation. In our case the performance depends on many factors; including the structure of the genome being assembled and the quality and quantity of sequencing data. One of our success stories is the open source assembly pipeline we use to cope with the massively big and repetitive bread wheat genome. We still need pretty big computers, but we now have a stable codebase that produces robust results and has reduced RAM requirements from 9TB to 4TB, and runtimes from more than a month to around 10 days. This has taken over a year of intensive effort, which started with detailed profiling, and remains an ongoing task. The whole pipeline, which I will use as a case study in the talk, is open source and available at: https://github.com/bioinfologics/w2rap and https://github.com/bioinfologics/w2rap-contigger and is used widely by genomic researchers. All the software we develop is open source, with either MIT or GPL licenses, and is available on our research team’s github page: https://github.com/bioinfologics.