12-06, 17:30–18:00 (UTC), General Track
Intake is a python library for describing, cataloging, finding and loading data. It has had the
ethos of "load and get out of the way", which limited scope but provided a lot of convenience.
However, complexity built up over years, providing a barrier for new users to start with Intake.
In this talk, I will resent Intake 2, a complete rewrite of the package, featuring a much simpler
reader interface and removal of many complex and unused features. This overhaul also enabled the
development of a general purpose data pipelining description, making intake both simpler and
much more powerful.
There is a lot of data out there and all DS/AI/ML workflows are fundamentally data problems. Yet,
there is a huge heterogeneous range of places data can come from, each with their own internal
ontologies and processes. In fact, a surprisingly massive number of scripts and notebooks still
have those typical lines of code to pluck bytes from some opaque URL with a complex her of
options, followed by more lines to massage the result into something ready for analysis. Such
code is annoying and fragile, and typically poorly distributed, documented or tested.
Intake has been serving this use case for many years, moving the data load step into declarative
catalogs, so that it is clear, distributable and amenable to version control - and no longer
crowds the important analysis code of data practitioners. A healthy ecosystem of many data drivers
and catalog ingesters has built up over time, to provide a consistent API over all data loading.
However, if a user needs to write or edit YAML, ever, they will probably never convert to using
Intake. Intake is at its best when many datasets of many types are available to a user, for them
to find what they need to get their work done.
Intake 2 attempts to keep the best parts of what made Intake great, and build on it. Now, readers
are far simpler, and there are helpers to allow users to find the correct reader for the data they
have in hand. There is a new convert and transform pipeline system, so that any data can have any
process applied to it and the compute shipped off to one of many possible compute engines (dask,
duck, ray, spark etc.). We now have a massive and growing number of data formats and third party
data processing packages, and can recommend pathways when a user requests
for instance "go from this URL to a matplotlib figure".
No previous knowledge expected
Staff Software Engineer at Anaconda, Inc. Creator of fastparquet, fsspec, intake and kerchunk.