Monday 2:45 PM–3:25 PM in Track 1

Accelerating and Scaling GeoPandas with Cython and Dask

Matthew Rocklin

Audience level:
Intermediate

Description

GeoPandas is the standard analytical tool for tabular geospatial data in Python. It is well loved but known to be slow. This talk describes recent work to accelerate GeoPandas with Cython and Dask to make it one of the fastest and scalable geospatial libraries in existence.

Abstract

Geospatial data is used in city planning, real estate, agriculture, and any other field in which location has an impact. GeoPandas is the standard analytical tool for tabular geospatial data in Python. It has a well loved API, and integrates cleanly with the rest of the geospatial ecosystem, but can be very very slow.

This talk describes two recent modifications to GeoPandas that both accelerate it's performance and enable it to handle very large datasets.

  1. We Cythonize the core of GeoPandas, bringing it down to C-level speeds
  2. We use Dask to parallelize GeoPandas, allowing it to use both multi-core processors and distributed memory clusters

This talk will include a brief overview of geospatial data and the GeoPandas project using examples from open datasets. It will then describe the use of Cython and Dask to accelerate and scale the project to handle larger datasets more quickly. We will end with benchmarking information and plans for the future.