Intake is a simple library providing a single interface for cataloging, describing and reading any kind of data. Catalogs give end-users an easy way to find data, locally, in a cloud service, or on an Intake server. Thus, Intake separates the definition of data sources from their use and analysis, so that Data Engineers and Data Scientists can get on with their respective jobs.
Defining and loading data-sets costs time and effort. The data scientist needs to know what data are available, and the characteristics of each data-set, before going to the effort of loading and beginning to analyze a specific data-set. Furthermore, they might need to learn the API of some Python package specific to the target format. The code to do such data loading often makes up the first block of every notebook or script, propagated by copy&paste.
Intake has been designed as a simple layer over other Python libraries to provide:
For a simple design and relatively small code-base, there are lots of features. We will demonstrate the main ones and show typical work-flows from two points of view:
the end user (e.g., Data Scientist), who will be provided with an easy-to-use method for browsing data catalogs, getting basic descriptions of each entry. For each data-set, detailed metadata and data structure descriptions and quick-look plotting are available as one-liners, allowing for quick decisions on which data are appropriate for a given problem, and then Intake gets out of the way. The user need never know the details of the storage format of the data.
the catalog designer (e.g., Data Engineer), who wants to get on with deciding where data should be stored, in which format, and which package is best suited to load each data-set, so long as the data scientist, above, gets the data-frame (or other artefact) they need. Catalogs also provide a natural place to describe each data-set, with text and arbitrary metadata. Finally, catalogs can also encode user parameters, giving either natural choices to the end user (e.g., to filter a data set, or choose between version A and B), or for getting information required for data access from the user’s environment.
Thus, Intake provides a very simple yet useful division between the users of data, and the maintainers of data source catalogs. Intake has approachable code and is extensible in many places, and so hopefully can progress to become an all-inclusive data ecosystem for numerical Python.