Saturday 2:15 PM–3:00 PM in Room #1025 (1st Floor)

Forecasting critical food violations at restaurants using open data

Nicole Donnelly

Audience level:


As many as 105 million Americans suffer from foodborne illness annually. In 2014, the City of Chicago began forecasting these outbreaks targeting limited health inspection resources toward likely sites, showing a 7 day improvement in locating critical violations at food establishments. This talk provides an end-to-end walkthrough of predicting critical violations in Washington, DC using Python.


Detailed Abstract

In 2014, data scientists at the Department of Innovation and Technology for the city of Chicago built an algorithm to predict likely health code violations for restaurants based on publicly available data in an attempt to reduce foodborne illness. They turned this into a freely available open source project, available on Github in R. However, in spite of the prevalence of foodborne illness and its associated costs (as much as $2–$4 billion annually1), so far only one other location in the country has taken advantage of Chicago's work to implement this model.2 That place is Montgomery County, MD which, with the assistance of Open Data Nation, is successfully adapting the model to the local environment.

This talk provides an end-to-end demonstration of how to replicate the process using Python and open data from Washington, DC. The content is targeted toward the novice data scientist and will discuss the practical aspects of planning and executing the project. Learn how you can combine Python libraries like Requests, BeautifulSoup, Sqlite, Numpy and Sckit-Learn to build your own machine learning model to predict health code violations!


Introduction/Problem statement

  • Foodborne illness outbreaks affect millions of people annually
  • DC has 22 inspectors responsible for ~5500 food establishments in addition to other facilities
  • Prioritization of routine inspections handled by risk category

Data Science Pipeline

  • Data Ingestion - Identification to build a dataset for modeling using scraped, API, and open city data
  • Data Munging and Wrangling- Getting the data to work together
  • Computation and Analyses- Do you have what you think you have? Feature engineering
  • Modeling and Application- Algorithm identification, building a pipeline
  • Reporting and Visualization - What does the data say?

Lessons learned

  • Problem identification
  • When the data you thought was available is not
  • Class imbalance in existing data
  • Domain expertise