PyData Chicago 2016 | Presentation: Pyglmnet: A Python package for elastic-net regularized generalized linear models

Saturday 3:15 PM–4:00 PM in Room 1

Pyglmnet: A Python package for elastic-net regularized generalized linear models

Pavan Ramkumar

Audience level:: Experienced

Description

In the era of big data, and high performance computing, generalized linear models (GLMs) have come to be widely applied across the sciences, economics, business, and finance. Although a popular R package exists, there is no equivalent in Python. I developed pyglmnet to address this need. In this talk, I will give a theoretical overview of GLMs and demonstrate the package with examples.

Abstract

Generalized linear models (GLMs) are powerful tools for multivariate regression. They allow us to model different types of target variables: real, categorical, counts, ordinal, etc. using multiple predictors or features.

Pyglmnet: http://github.com/pavanramkumar/pyglmnet is a newly launched and fast growing Python package for GLMs with state of the art elastic net regularization and a growing list of link functions. It aims to mimic the functionality of the widely popular R package: glmnet. The API documentation is inspired by Python’s Scikit Learn and the design is mindful of both R and Scikit Learn users. It has easy interoperability with various scikit learn tools for preprocessing, cross validation, scoring, etc.

Here is a talk outline.

Regularized multivariate regression models

I will setup the basic GLM model, walk through the most popular variants, and show how the model parameters can be estimated by maximizing the log-likelihood of the data. In problems with a large number of predictors, only a small fraction of them are useful. Therefore, penalization of complex models — called regularization — is important to prevent overfitting to the finite dataset. I will describe one popular regularization method called elastic net.

Optimization

Once the problem is setup, the GLM parameters are estimated by maximizing the penalized log-likelihood. This problem is solved using proximal gradient descent with shrinkage. I will introduce the intuitions behind gradient descent and show how modern Python tools such as Sympy and Theano can be useful to calculate gradients symbolically or automatically.

Python implementation and examples

I will walk through the Python implementation, demonstrating the GLM class, and methods, along with applications to a real biological dataset.

Pre-requisites

Working knowledge of freshman level calculus and linear algebra. Exposure to linear regression and optimization would be useful. Strong interest in machine learning.