Logistic Regression models are powerful tools in the data science toolkit; in this talk we will explore various implementations of logistic regression in Python and SAS, with a focus on output and performance. We will also discuss both the numerical and statistical implications (including Bayesian interpretations) of the various options.
The logistic regression model is a powerful linear model for predicting the log-odds of a binary or multi-class response. Empirically, carefully built logistic models perform well on many diverse tasks including survival analysis and classification problems. Business requirements often impose statistical constraints on modelers (e.g. interpretability) making logistic regression a standard industry tool.
Many organizations, including Capital One, are experimenting with the transition from SAS to open source tools such as Python; during this transition, modelers must confront a multitude of options for performing old tasks in new ways, including which package(s) to use for fitting logistic models. This naturally raises many questions: are the same tweaks and options still available (e.g., offsets, Firth adjustment)? Can we reproduce the exact same output and trust the results (e.g., p-values, coefficient estimates)? How should a modeler navigate the different numerical implementations?
In this talk, we will take a deep dive into logistic regression, with a focus on performance and output comparisons between SAS and various Python packages. We will also dig into the mathematical underpinnings of the implementations and discuss both the numerical and statistical implications (including Bayesian interpretations). Understanding what's going on behind the scenes can lead to powerful insights and innovations, and hopefully will serve as inspiration for improved future models!