Thursday 10:00 AM–10:40 AM in Central Park East (#6501a)

Train, Evaluate, Repeat: Building a Credit Card Fraud Detection System

Leela Senthil Nathan

Audience level:
Intermediate

Description

This talk covers three major ML problems Stripe faced (and solved!) in building its credit card fraud detection system: choosing labels for fraud that work across all merchants, addressing class imbalance (legitimate charges greatly outnumber fraudulent ones), and performing counterfactual evaluation (to measure performance and obtain training data when the ML system is changing outcomes itself).

Abstract

As a company that processes billions of dollars of transactions for hundreds of thousands of merchants across the globe, it is imperative for Stripe to protect our merchants from as much fraud as possible. However, machine learning models that fight credit card fraud have real-world implications for merchants when they get a decision wrong: false positives result in lost revenue and customer churn, while false negatives lead to the loss of the sold items.

In order to effectively protect our merchants from fraud, Stripe must constantly retrain our transaction fraud models and evaluate their performance in the real world. We have encountered three major challenges in this process:

  1. It’s not always easy to agree on a definition for what fraud means. For instance, a merchant may consider a charge by a customer it believes to be fraudulent as “fraud” even if the charge doesn’t actually end up being fraud.
  2. Training on highly imbalanced data usually doesn’t result in the best-performing models. Only a very small percentage of all charges end up being fraudulent, regardless of how you define fraud. If you trained on charge data as-is, you might get a model that appears reasonable at first glance. However, the model will often exhibit artifacts of the class imbalance in the data (e.g., it’s hard to get very high precision when your target label is so rare).
  3. Evaluating performance in production, and generating training data for future models, is not straightforward. If Stripe blocks a transaction because we think it is fraudulent, we don’t actually know if we got it right—we have no way to observe the charge’s ultimate outcome. If we naively just used transactions for which we had true outcomes (i.e., ones that we did not block) for training, we’d ultimately get a training set that under-represents fraud, and new models would “forget” how to detect that fraud.

In this talk, I’ll first focus on how Stripe decided on what labels to use for fraudulent charges. Then, I’ll talk about techniques we use to boost signal of fraudulent charges in the imbalanced data set (and thus improve overall model performance) during our training process. Once we’ve discussed how to handle labeling and class imbalance, I’ll dive into the most challenging aspect of our model pipeline: developing an effective counterfactual evaluation technique that we use to both gauge how well our model is doing in the real world, as well as generate unbiased training data. I’ll describe two different approaches we have employed for counterfactual evaluation, and why we ultimately decided that one of them was more effective.

Subscribe to Receive PyData Updates

Subscribe