Friday 10 a.m.–noon

PySnpTools: A New Open-Source Library for Reading and Manipulating Matrix Data (including Genomics)

Carl Kadie

Audience level:
Novice

Description

Anyone who uses fast numeric NumPy arrays but would like a simpler-than-Pandas ability to slice-and-dice, read-and-write will find PySnpTools useful. I'll describe PySnpTools and also tell how it fits into our Machine Learning research group's long-term move from C++/VB to C# to Python. I'll also show how we use PySnpTools in FaST-LMM to do state-of-the-art Genome Wide Association Studies.

Abstract

The tutorial will cover:

PySnpTools details:

  • PstReader: Full NumPy-meets-Pandas-like slicing and subsetting of matrix data before (and after) reading from disk. (For genomics, it includes support for the PLINK Bed and phenotype formats. It also includes low-memory, high-speed methods for common operations such as standardization and kernel-creation.)

  • Utilities: One line intersecting and re-ordering of data for machine learning and statistics. Faster-than-NumPy extraction of a subarray from a NumPy array.

  • IntRangeSet: Manipulate from zero to billions of integers as sets with very little memory.

Python Trade Offs We Observe:

Our industrial research group focuses on Machine Learning. Over 15 years, we have moved from C++/VB to C# to Python. I'll talk about why we choose Python and what tradeoffs we see.

Application:

PySnpTools spun out of FaST-LMM. FaST-LMM is an Open Source, Python-based state-of-the-art system for doing Genome Wide Association Studies (GWAS). It is described in publications in Nature Methods, Nature Genetics, and Bioinfomatics. I'll talk about:

  • a layman’s overview of GWAS
  • how to use FaST-LMM
  • how FaST-LMM uses PySnpTools

To Install:

Sponsors


Become a sponsor.