Friday October 29 5:30 PM – Friday October 29 7:00 PM in Workshop/Tutorial I

What's in your data: Data Profiler - An Open Source Solution to Explain Your Data

Austin Walters, Jeremy Goodsitt

Prior knowledge:
No previous knowledge expected

Summary

Data understanding is crucial for most machine learning applications. As data scientists and engineers, we need to answer these questions for every project: Is our data secure? What is in our data? How do we monitor data properties over time? The DataProfiler, an open source project from Capital One, is a Python library designed to facilitate data analysis, monitoring and sensitive data detection.

Description

Descriptions

DataProfiler was designed to accept a wide range of data formats including csv, avro, parquet, json, text, and pandas DataFrames. Whether the data is structured, semi-structured or unstructured, the library is able to identify the schema, statistics, entities from the data. In addition, the DataProfiler provides a cutting edge pre-trained deep learning model to efficiently identify sensitive information (or PII, such as customer names, physical addresses, bank account numbers, and credit card numbers). This helps companies detect sensitive data in different data sources and formats. With the ability to interchange the data labeler, DataProfiler can be customized to help users learn what is in their data. This versatility of the data labeler allows models to be modified as needed. Running multiple models on the same dataset is easy since choosing a preexisting data labeler to train and predict takes just a few lines of code.

We invite data scientists, machine learning engineers, software engineers, from beginner to expert level, to learn how to extract data properties in an efficient way with the DataProfiler.

Outline

The tutorial is planed to last for 90 minutes, with the following schedule:

  1. DataProfiler overview (10 minutes)
    • Basic usage of DataProfiler
    • Output reports on properties and statistics, and profiler options
  2. Data readers: reading and detecting schema from different data file types (10 minutes)
  3. Merge and update between profiles (10 minutes)
  4. Save and load profiles (5 minutes)
  5. Unstructured profiler (10 minutes)
  6. Break (5 minutes)
  7. Sensitive information detection with the data labeler component of DataProfiler

    • Getting started with the pretrained Data Labeler for sensitive data detection (10 minutes) - Structured data - Unstructured data
    • Retrain the Data Labeler with a new dataset (5 minutes)
    • Transfer-learn the Data Labeler with new labels (5 minutes)
    • Build a new model for Data Labeler (10 minutes) - Build a new character-level LSTM model that inherits the CNN model - Load the DataLabeler from the DataProfiler - Swap the existing CNN model with the new LSTM model - Train the data labeler pipeline on a given dataset
  8. Q & A (10 minutes)

Notes:

  • Users can run DataProfiler easily with several lines of code without much knowledge on machine learning, AI or Natural Language Processing. However, as DataProfiler is written in Python, users should be familiar with this programming language.

  • A laptop, desktop, or cloud machine (e.g., AWS EC2 instances) is sufficient for the tutorial. Users only need to clone the repo (https://github.com/capitalone/DataProfiler) and install the required packages therein. DataProfiler is fully functional on Mac and Linux.