The democratization of GPS enabled devices has led to a surge of interest in the availability of high quality geocoded datasets. This data poses both opportunities and challenges for the study of social behavior. The goal of this tutorial is to introduce its attendants to the state-of-the-art in the mining and analysis in this new world of spatial data with a special focus on the real world.
In this tutorial we will provide an overview of workflows for location rich data, from data collection to analysis and visualization using Python tools. In particular:
Introduction to location rich data: In this part tutorial attendees will be provided with an overview perspective on location-based technologies, datasets, applications and services
Online Data Collection: A brief introductions to the APIs of Twitter, Foursquare, Uber and AirBnB using Python (using urllib2, requests, BeautifulSoup). The focus will be on highlighting their similarities and differences and how they provide different perspectives on user behavior and urban activity. A special reference will be provided on the availability of Open Datasets with a notable example being the NYC Yellow Taxi dataset (NYC Taxy)
Data analysis and Measurement: Using data collected using the APIs listed above we will perform several simple analyses to illustrate not only different techniques and libraries (geopy, shapely, data science toolkit, etc) but also the different kinds of insights that are possible to obtain using this kind of data, particularly on the study of population demographics, human mobility, urban activity and neighborhood modeling as well as spatial economics.
Applied Data Mining and Machine Learning: In this part of the tutorial we will focus on exploiting the datasets collected in the previous part to solve interesting real world problems. After a brief introduction on python’s machine learning library, scikit-learn, we will formulate three optimization problems: i) predict the best area in New York City for opening a Starbucks using Foursquare check-in data, ii) predict the price of an Airbnb listing and iii) predict the average Uber surge multiplier of an area in New York City.
Visualization: Finally, we introduce some simple techniques for mapping location data and placing it in a geographical context using matplotlib Basemap and py.processing.