PyData DC 2016 | Presentation: Forecasting critical food violations at restaurants using open data

Saturday 2:15 PM–3:00 PM in Room #1025 (1st Floor)

Forecasting critical food violations at restaurants using open data

Nicole Donnelly

Audience level:: Novice

Description

As many as 105 million Americans suffer from foodborne illness annually. In 2014, the City of Chicago began forecasting these outbreaks targeting limited health inspection resources toward likely sites, showing a 7 day improvement in locating critical violations at food establishments. This talk provides an end-to-end walkthrough of predicting critical violations in Washington, DC using Python.

Abstract

Detailed Abstract

In 2014, data scientists at the Department of Innovation and Technology for the city of Chicago built an algorithm to predict likely health code violations for restaurants based on publicly available data in an attempt to reduce foodborne illness. They turned this into a freely available open source project, available on Github in R. However, in spite of the prevalence of foodborne illness and its associated costs (as much as $2–$4 billion annually¹), so far only one other location in the country has taken advantage of Chicago's work to implement this model.² That place is Montgomery County, MD which, with the assistance of Open Data Nation, is successfully adapting the model to the local environment.

This talk provides an end-to-end demonstration of how to replicate the process using Python and open data from Washington, DC. The content is targeted toward the novice data scientist and will discuss the practical aspects of planning and executing the project. Learn how you can combine Python libraries like Requests, BeautifulSoup, Sqlite, Numpy and Sckit-Learn to build your own machine learning model to predict health code violations!

Outline

Introduction/Problem statement

Foodborne illness outbreaks affect millions of people annually
DC has 22 inspectors responsible for ~5500 food establishments in addition to other facilities
Prioritization of routine inspections handled by risk category

Data Science Pipeline

Data Ingestion - Identification to build a dataset for modeling using scraped, API, and open city data
Data Munging and Wrangling- Getting the data to work together
Computation and Analyses- Do you have what you think you have? Feature engineering
Modeling and Application- Algorithm identification, building a pipeline
Reporting and Visualization - What does the data say?

Lessons learned

Problem identification
When the data you thought was available is not
Class imbalance in existing data
Domain expertise