Monday 2:00 PM–2:45 PM in The Forum, 4th Floor / NLP

Extracting Structured Data from Legal Documents

Zack Witten

Audience level:
Intermediate

Description

You’ll learn how to take a never-before-seen legal document, like a contract or a convertible note, and use machine learning to “read” the document and answer questions like “Who’s the investor” and “What interest rate did the parties agree to?”

Abstract

If you’re a law firm, when you get a new client, they’re going to send you a giant zip file with hundreds or thousands of documents. You might then sic a team of highly educated, highly paid paralegals to read each document and painstakingly enter the counterparty, date of execution, etc. into a spreadsheet. But what if an algorithm could do it for free?

In this talk, we’ll cover extracting this information from a legal document using an extension of a Hidden Markov Model called a CRF, or Conditional Random Field. We’ll go through the math behind the model in some depth. We’ll give some tips for feature selection. And we’ll talk about how to vectorize a legal document. We’ll close with a description of the pluses and minuses of using CRFs in lieu of more complex deep learning-type models.

The target audience is anyone who’s ever heard of Bayes’ Rule.

Subscribe to Receive PyData Updates

Subscribe