Oceanography and climate science is experiencing a rapid growth in both observational data and numerical model output. The tools and workflows professional researchers and students currently use are not keeping pace with this growth. Recent additions to the Python data stack, such as xarray and dask, provide a way to enable scientists to work with the ever-increasing size of datasets.
A historical challenge in oceanography has been a limited amount of observational data. The ocean is effectively opaque to electromagnetic radiation and in-situ measurements traditionally require very expensive ship-based observations. Recently, there has been an explosive increase in data richness as new initiatives such as cabled ocean observatories and autonomous sensor platforms are deployed. Programs such as the international ARGO program, the international Global Drifter program, and the Ocean Observatory Initiative are producing unprecedented amounts of in-situ oceanographic data at very high resolutions.
Multiplying the ocean data volume growth of observations by a factor of 100 or more is the output produced by numerical ocean-ice-atmosphere analysis and prediction systems. This is being done on high-performance computing in supercomputing centres running on thousands of cores concurrently. The volume of this data output constantly increases in time, as computer capacity increases and ocean-ice model resolution increases. For example, global ocean circulation models are routinely run at one-quarter of a degree resolution and configurations as high as 1/30th of a degree resolution are being proposed. Regional models are being produced up to 1/32nd degree resolution over substantial geographic areas. Such simulations are typically run daily for short period forecasts (2-90 days) and in planned long-term runs of decades or more (projections). Furthermore, there is a move to expand simulations to ensemble runs, where these already large ocean models are run 50 to 100 times using various stochastic perturbations to create an ensemble of ocean conditions.
Accurately and effectively finding meaning in this volume of data poses a substantial challenge. The traditional approach has been to run a model on a supercomputer and then download the output as well as observation data onto a local workstation for further analysis, processing, and synthesis into journal papers and plots to increase our understanding. The challenge here is that the size of the model output can be larger than the physical resources (memory, computing power and storage) of the workstation, and Internet transfer of large data is prohibitively slow. To keep data volume within local workstation limits, the researcher often aggregates data into spatial or temporal means, or re-grids to a coarser numerical grid. The former is a problem if the researcher is studying extreme events, and the latter if she is interested in the detailed ocean conditions in a specific area. Full grid resolution output provides maximum accuracy and the best flexibility for downstream analysis.
Problems introduced by large-scale data are not unique to oceanography, and various approaches exist to manage this data. However, these approaches are often not compatible with existing workflows or the data training scientists receive. A related challenge is the reproducibility of data-intensive computations; there is a fragmentation of software tools and environments render most atmospheric, ocean, and climate research effectively unreproducible and prone to failure. In this talk, I will discuss recent progress in trying to close this technology gap to enable scientists to work with the ever-increasing size of datasets and some ideas on making scientific workflows more reproducible. As a specific example, I will present techniques for performing offline diagnostics for ocean models that are currently limited by insufficient memory or disk space. The Python modules xarray (N-D labelled arrays and datasets) and dask (a parallel computing library) will be discussed as tools that can build scalability into oceanographic analysis.