Data understanding is crucial for most machine learning applications. As data scientists and engineers, we need to answer these questions for every project: Is our data secure? What is in our data? How do we monitor data properties over time? The DataProfiler, an open source project from Capital One, is a Python library designed to facilitate data analysis, monitoring and sensitive data detection.
DataProfiler was designed to accept a wide range of data formats including csv, avro, parquet, json, text, and pandas DataFrames. Whether the data is structured, semi-structured or unstructured, the library is able to identify the schema, statistics, entities from the data. In addition, the DataProfiler provides a cutting edge pre-trained deep learning model to efficiently identify sensitive information (or PII, such as customer names, physical addresses, bank account numbers, and credit card numbers). This helps companies detect sensitive data in different data sources and formats. With the ability to interchange the data labeler, DataProfiler can be customized to help users learn what is in their data. This versatility of the data labeler allows models to be modified as needed. Running multiple models on the same dataset is easy since choosing a preexisting data labeler to train and predict takes just a few lines of code.
We invite data scientists, machine learning engineers, software engineers, from beginner to expert level, to learn how to extract data properties in an efficient way with the DataProfiler.
The tutorial is planed to last for 90 minutes, with the following schedule:
Sensitive information detection with the data labeler component of DataProfiler
Q & A (10 minutes)
Users can run DataProfiler easily with several lines of code without much knowledge on machine learning, AI or Natural Language Processing. However, as DataProfiler is written in Python, users should be familiar with this programming language.
A laptop, desktop, or cloud machine (e.g., AWS EC2 instances) is sufficient for the tutorial. Users only need to clone the repo (https://github.com/capitalone/DataProfiler) and install the required packages therein. DataProfiler is fully functional on Mac and Linux.