Automatic Pattern-Based Data Catalogs

Zachary Blackwood

Prior knowledge:
No previous knowledge expected

Summary

Do you have data files that follow fixed patterns? Perhaps model versions, folders for different dates, customer names, or other patterns? Introducing intake-pattern-catalog, a new Intake plugin which allows you to take pattern-based files, and create an automatically-updating API to access specific files, without having to know the details of how they are stored. Learn about how DTN is using this

Description

What is Intake?

Intake is a lightweight set of tools for loading and sharing data in data science projects

Motivation for using intake – creating a single catalog with a consistent API for accessing all the data available across DTN. The driver (e.g. csv, pyarrow) to access the data, and the path to access it, and other metadata, is stored in a YAML file, or series of YAML files. Rather than the instructions for accessing the data being “look in this folder on <service x>, look for the file you want, then use <tool y> to download the data and then select <part z> of the data you want”, the instructions become “Run catalog.category.dataset.read()

Intake allows for creation of plugins to access new file types and data locations, and there are quite a few that have been created.

Motivation for creating a new plugin – the desire to automatically generate a number of catalog entries based on a file pattern (e.g. s3://bucket/folder/{customer}/{date}/{version}.parquet) – this didn’t exist in Intake, or any plugins currently.

Using the intake-pattern-catalog plugin:

  • Specifying a pattern and driver
  • Pattern can include strings and datetime objects, with standard strftime formatting (e.g. {date:%Y-%m-%d-%H})
  • The plugin looks in the local/remote folder and finds all the matching files, and creates an entry for each one
  • Entries can be accessed with catalog.get_entry(customer=”acme”, date=datetime(2021,1,1,6), version=23)
  • All available entries can be viewed with list(catalog)
  • All the valid arguments to get_entry can be listed via catalog.get_entry_kwarg_sets()

Common uses at DTN:

  • Files with the customer name in the path
  • Cleaned data with a version of the cleaning in the file name
  • Predictions that are generated on a schedule stored in different folder for each date
  • Model versions