In the era of big data, and high performance computing, generalized linear models (GLMs) have come to be widely applied across the sciences, economics, business, and finance. Although a popular R package exists, there is no equivalent in Python. I developed pyglmnet to address this need. In this talk, I will give a theoretical overview of GLMs and demonstrate the package with examples.
Generalized linear models (GLMs) are powerful tools for multivariate regression. They allow us to model different types of target variables: real, categorical, counts, ordinal, etc. using multiple predictors or features.
Pyglmnet: http://github.com/pavanramkumar/pyglmnet is a newly launched and fast growing Python package for GLMs with state of the art elastic net regularization and a growing list of link functions. It aims to mimic the functionality of the widely popular R package: glmnet. The API documentation is inspired by Python’s Scikit Learn and the design is mindful of both R and Scikit Learn users. It has easy interoperability with various scikit learn tools for preprocessing, cross validation, scoring, etc.
Here is a talk outline.
I will setup the basic GLM model, walk through the most popular variants, and show how the model parameters can be estimated by maximizing the log-likelihood of the data. In problems with a large number of predictors, only a small fraction of them are useful. Therefore, penalization of complex models — called regularization — is important to prevent overfitting to the finite dataset. I will describe one popular regularization method called elastic net.
Once the problem is setup, the GLM parameters are estimated by maximizing the penalized log-likelihood. This problem is solved using proximal gradient descent with shrinkage. I will introduce the intuitions behind gradient descent and show how modern Python tools such as Sympy and Theano can be useful to calculate gradients symbolically or automatically.
I will walk through the Python implementation, demonstrating the GLM class, and methods, along with applications to a real biological dataset.
Working knowledge of freshman level calculus and linear algebra. Exposure to linear regression and optimization would be useful. Strong interest in machine learning.
After the talk, you will be able to:
Appreciate the versatility of GLMs in data analysis
Setup the optimization problems using log likelihoods
Understand the intuition behind regularization and the elastic net penalty
Understand gradient descent
Use symbolic differentiation to compute gradients with sympy
Appreciate the power of automatic differentiation with Theano
Apply pyglmnet to your favorite prediction problem