PyData Austin 2019 - Presentation: Working interactively with large and remote datasets in JupyterLab

Introduction
- Historically, it's been difficult to build large data or remote data tools in JupyterLab
  - The fundamental issue is the low level architecture of the notebook server
  - The contents api always eagerly creates a local copy of any loaded data
- There have been several failed attempts
- Recently, three lab extensions have emerged that offer solutions
JupyterLab large/remote data extensions
- Each of these lab extensions have found their own creative workarounds for the limitations of the contents api
- jupyterlab-remote-data: Ian R Rose
  - An experimental package that relies on an experimental alternative contents handler (remotecontentmanager) within the notebook server itself
- @jupyterlab/hdf5: my own contribution
  - Avoids the contents api by serving its data via its own custom server extension
  - The server extension lazily fetches data in chunks via h5py
  - The frontend is implemented via @phosphor/datagrid, which can be set up to fetch data only as needed
  - Together, this means that @jupyterlab/hdf5 can be used to work with very large (TB) files in remote environments
- jupyterlab-data-explorer: Saul Shanabrook, et al
  - At it's core, a flexible tool for loading datasets, and for transforming one kind of dataset into another
  - Also supplies a tree and a number of grid widgets for exploring data
  - Flexible enough that it can be easily integrated with other tools. For example @jupyterlab/hdf5 includes support for data-explorer
The future of big data handling in JupyterLab core
- Currently there are plans to replace the built in filebrowser in JupyterLab
  - ...along with its obligate dependence on the contents api
- At the moment, jupyterlab-data-explorer is the leading candidate to replace the filebrowser
- We also plan to add a set of hooks that will make it easy to load data using data-explorer and then consume the data within notebook code

Sunday 2:05 PM–2:50 PM in Track 1

Working interactively with large and remote datasets in JupyterLab

Max Klein, PhD

Description

Abstract

Subscribe to Receive PyData Updates