Thursday 2:50 PM–3:30 PM in Room 2

A pythonista’s pipeline for large-scale geospatial analytics

Alice Broadhead

Audience level:
Intermediate

Description

We created an automated framework for analysis of large-scale geospatial data using Spark, Impala, and Python. We use this framework to join billions of device locations daily from mobile phone users to millions of points of interest. We discuss our project structure and workflow. Our work has broader application to movement of populations through time and space.

Abstract

Large-scale geospatial data is a common analytical challenge. We receive billions of location data points per day for user locations. We need to filter this data and map it to millions of points of interest. Our end use case is to provide clients with information as to the attributes of users who visit their stores. This workflow, however, has application to the broader analytical problem of mapping geospatially located entities of interest to points of interest with known locations. We also provide our approach to the frequently encountered issue of needing both standardized and flexible reporting. We have automated the standard analyses in our workflow, but created an API for analysts to quickly develop ad-hoc analyses based on customer requests.

There are freely available tools to the Python user that can help us complete all of the tasks in our data analysis pipeline. At a high level, the architecture of our process is as follows: HDFS storage for large-scale geospatial data, Spark for geospatial joins, Cloudera Impala accessed via Ibis to query resultant datasets, and scientific python for conducting analyses. We make the automated analyses to the end users via a web portal created using Flask and Celery. We created an API available to analysts as a Python package so that they can quickly perform custom analyses created by clients. We used Sphinx to aid in documentation for easier use of the API.

In this talk we describe our architecture and workflow to accomplish the following analytical tasks related to user location:

1.) Estimation of the user’s approximate home location (within a few miles) 2.) Estimation of approximate demographic characteristics of visitors to a location 3.) Distribution of distances traveled to reach a specific location 4.) Distribution of visit propensities over time

We also describe an analytic technique used to statistically determine the impact of treatment on user’s geographical behavior patterns using propensity-based test and control groups.

Python is the only language with tools readily available to efficiently complete this task end-to-end.