The value of individual-level data for research must be balanced against the privacy concerns of individuals. I discuss an approach for generating synthetic data sets that preserve relevant statistical properties of input data while also providing privacy guarantees for individuals in the data.
Individual-level data is very valuable to researchers, as it allows for arbitrary analyses. However, the value of individual-level data must be balanced against the privacy concerns of individuals who appear in data sets. In this talk, I discuss an approach for generating synthetic data sets that preserve relevant statistical properties of input data while also providing privacy guarantees for individuals who appear in the data. I offer an implementation of a differentially private generative adversarial network (DPWGAN) using PyTorch. This method provides mathematical guarantees of privacy. To illustrate how this method works, I apply the DPWGAN to ACS PUMS data, a collection of individual-level from the U.S. Census Bureau. I show that the DPWGAN models correlations in the data, and that cross tabs on the synthetic data are close to those in the original data. The PyTorch code is available and open source. Participants will come away with an understanding of the basics of differential privacy and generative adversarial networks, as well as the ability to apply the DPWGAN code to their own data sets. Slides for the talk can be found at bit.ly/DPWGAN.