Do you have data files that follow fixed patterns? Perhaps model versions, folders for different dates, customer names, or other patterns? Introducing intake-pattern-catalog, a new Intake plugin which allows you to take pattern-based files, and create an automatically-updating API to access specific files, without having to know the details of how they are stored. Learn about how DTN is using this
What is Intake?
Intake is a lightweight set of tools for loading and sharing data in data science projects
Motivation for using intake – creating a single catalog with a consistent API for accessing all the data available across DTN. The driver (e.g. csv, pyarrow) to access the data, and the path to access it, and other metadata, is stored in a YAML file, or series of YAML files. Rather than the instructions for accessing the data being “look in this folder on <service x>
, look for the file you want, then use <tool y>
to download the data and then select <part z>
of the data you want”, the instructions become “Run catalog.category.dataset.read()
“
Intake allows for creation of plugins to access new file types and data locations, and there are quite a few that have been created.
Motivation for creating a new plugin – the desire to automatically generate a number of catalog entries based on a file pattern (e.g. s3://bucket/folder/{customer}/{date}/{version}.parquet
) – this didn’t exist in Intake, or any plugins currently.
Using the intake-pattern-catalog plugin:
{date:%Y-%m-%d-%H}
)catalog.get_entry(customer=”acme”, date=datetime(2021,1,1,6), version=23)
list(catalog)
catalog.get_entry_kwarg_sets()
Common uses at DTN: