Millions of people are affected by the flu each year. I wanted to create a better way for scientists to make flu vaccines. My solution was to predict future flu genetic sequences so that scientists could easily analyze flu sequences and create vaccines. The main topics will include the use of Biopython and scikit-learn in a scientific environment, as well as data preprocessing and model selection.
Every year, millions of people are affected by the flu. Every year, we take a vaccine to possibly gain immunity from that year’s flu. However, the vaccines may not always work. In recent years, people have made attempts at predicting how the influenza virus has changed. For example, every year, world health officials try to predict what drugs to include in the coming year’s vaccine. However, many of the methods used have been called “questionable” by the National Institutes of Medicine. So, my goal in this project was to come up with a better way for scientists to make flu vaccines.
In my talk, I will show how I used Python to implement my flu prediction algorithm, as well as how I used libraries to make my algorithm more efficient and simpler in every step.
To create my algorithm, I first had to obtain the genetic sequences of the flu from the flu database. These were in a FASTA file format. I originally read the file in as a simple .txt file, but using the Biopython library to read the genetic sequences, which are over 1700 base pairs long, was much easier and had built-in functions to quickly parse FASTA files.
Next, I used a variety of machine learning algorithms. I originally tried writing them on my own; however, I found this would be tedious, and started using scikit-learn algorithms. Eventually, I found that the random forests algorithm performed best. In addition, I will explain how I evaluated each model’s performance and why the random forest performed the best.
What the audience will learn:
I will explain the benefits and the drawbacks of using the Biopython and scikit-learn functions. I will cover interpreting and preprocessing the gene sequences, encoding the data, fitting the data to the model and then evaluating the model’s performance and selecting the best model.