Thursday October 28 7:00 PM – Thursday October 28 7:30 PM in Talks II

Data infrastructure at the COVID Tracking Project

Julia Kodysh

Prior knowledge:
No previous knowledge expected

Summary

The COVID Tracking Project started in March 2020 as a volunteer effort to collect and publish crucial data related to the COVID-19 outbreak in the United States. As the organization grew, everything had to scale at once: the number of volunteers, our processes, and our tooling. This talk will go into some of the data problems, technical details and human element behind the project.

Description

Intended audience: This talk is for any attendee who is interested in the story of the COVID Tracking Project. It will be a largely technical talk, describing several pieces of the open source data infrastructure we built to support the data collection effort and publish the data. That said, the talk will also go into the human story around how the project evolved.

I’m going to go into these topics: - Why we didn’t fully automate data collection: lack of consistency or standards in data reporting. (But we did build a website screenshot system specific to our data sources, and I'll mention this as well) - The data pipeline that we built to support data collection instead: spreadsheets for data entry, eventual introduction of a PostgreSQL database to house current and historical data, internal Flask API with SQLAlchemy to preserve correctness on reads/writes of structured data, public-facing website and API. I’ll also talk about the evolution of our systems as the project changed over time. - Our database and data model design. - The public APIs we built, and our evolution from … a public Google sheet (v0?) to a flat CSV API (v1) to structured JSON benefitting from our evolving understanding of our data (v2).

The COVID Tracking Project became an incredible effort, eventually coming to be cited in over 1000 academic papers and over 7700 news stories. Some data-related takeaways we have learned: - Stay connected to the data. The data is meaningless without context around it. - Think about data storage and representation early. Early decisions will strongly influence how processes and workflow evolve later on. - Be flexible. In a project with many unknowns, the ability to build the plane while flying it was paramount. - Do not automate things too quickly. Get to know the subject matter as much as possible.