Genotype-phenotype studies are done for predicting traits such as whether someone will go bald or have a particular disease given their only genome. We look at how Python libraries such as scikit-learn and keras have made it easier to develop these statistical models. We describe a pipeline to predict antimicrobial resistance in bacteria and elaborate on challenges when working with genomic data.
Working with genomic datasets poses a number of challenges such as high-dimensional vector spaces and extreme amounts of noise. All genotype-phenotype studies ask "Given this textual genome, will a some trait be observed?" and can be viewed from the lens of a text classification problem. Predicted traits can be anything observable in the organism such as their hair color or whether they will be likely to have cancer. In this talk, I demonstrate the process for constructing a genotype-phenotype machine learning pipeline which predicts resistance to certain pharmaceuticals in Neisseria. We will discuss how data libraries in Python are used throughout each step in this pipeline.
We will first get an overview of the necessary biological background including the fundamental dogma of molecular biology and gene mutation types. We will then look at popular of feature vector representations for genomes including k-mers and single-nucleotide polymorphism presence. I will demonstrate training classifiers for predicting whether the given organism will exhibit antimicrobial resistance and inspect our trained classifiers for making biological conclusions. Lastly, I discuss how recurrent neural nets, which are useful for non-fixed length input, and naturally work well with genomes.