- Audience level:
- Intermediate

Generalized linear mixed effects models, ubiquitous in social science research, are rarely seen in applied data science work despite their relevance and simplicity. We will discuss this class of statistical models, their usefulness in recommender systems, and present a fast, scalable Python solver for them called Diamond.

- fashion ecommerce company using data science extensively
- human stylists + algorithms pick out clothes for customers

- elegant way to learn from the experience of others
- linear model
- $$ Y = X \beta + \epsilon $$
- $ \epsilon \sim N(0, \sigma^2) $

- linear mixed-effects model
- $$ Y = X \beta + Z b + \epsilon $$
- X is the "fixed" design matrix
- Z is the "random" design matrix
- $$ b \sim N(0, \Sigma) $$

- "fixed" vs. "random" effects

- An extremely simple model:
`rating ~ 1 + (1 | user_id) + (1 | item_id)`

- Mathematically, this means
- $\mu_{global}$ is not a random variable, but unknown. "global bias"
- $\mu_{user} \sim N(0, \sigma_{user}^2)$. "user bias"
- $\mu_{item} \sim N(0, \sigma_{item}^2)$. "item bias"
- these biases are very strong signals in most systems with user-item interactions
- some items are more popular than others
- some users are more picky than others

- in mixed-effects, regularization is learned automatically!
- regularization controlled by $ \sigma_{user}$ and $\sigma_{item}$
- users/items with few observations will be shrunk towards the mean
- still a cold-start problem

- Add features:
`rating ~ 1 + ( 1 + item_feature1 + item_feature2 | user_id) + (1 + user_feature1 + user_feature2 | item_id)`

- Now, $\beta_{user} \sim N(0, \Sigma_{user})$ i.e. multivariate
- Now, $\beta_{item} \sim N(0, \Sigma_{item})$ i.e. multivariate
- In general, hard to learn covariance matrices: number of paramaters scales as $O(p^2)$
- But, this model lets us learn important things like
- individual users' preferences for items, e.g. color or print
- which items are liked by which users, e.g. "item 4602 is more popular with older users"

- Connection to matrix factorization
- this model has no embedding of users or items
- But, you could fit this model first, then fit the residuals using matrix factorization
- this separate optimization might be competitive with a joint optimization, especially one that uses stochastic gradient descent

- these models have a Bayesian flavor, and can be fit using markov chain monte carlo or variational inference
- however, there is a more traditional maximum likelihood-based approach illustrated by the lme4 vignette
- Roughly, the way to fit these models is an alternating sequence of
- learn covariance structure (nonlinear optimization)
- given covariance structure, estimate coefficients (standard convex optimization)

- diamond only does the second step, because our use case is
- take a sample of data and fit the full mixed-effects model using MCMC or maximum likelihood
- using the learned covariance matrices, use diamond to estimate coefficients for all users and items
- R is a bit dicey for production use for these reasons, so we prefer to fit models in Python
- Also easier to integrate into eng systems for production use at Stitch Fix

- Computational approach is second-order Newton-Raphson with some tricks from this paper
- difficult to invert Hessian -> use fixed hessian approximation
- conjugate gradient step size

- This involved a lot of paper and pencil math, but is very fast!
- Diamond supports logistic regression for binary and ordinal response variables
- Results of speed test on MovieLens dataset