Economists often fuse or impute features between survey datasets to simulate economic policies. I describe common methods such as regression and matching, and show how the synthimpute package's random-forest-based method outperforms these in holdout sets. I also give examples of how PolicyEngine uses synthimpute to research a range of universal basic income policies.
When economists simulate policy reforms, they turn to representative survey microdata. For example, policy reforms in the United States are often evaluated by simulating the impact on each of the 60,000 households that respond to the Current Population Survey. But the Current Population Survey doesn't include all household information that might be relevant to a policy; for example, it doesn't include wealth or carbon emissions. Simulating wealth taxes or carbon taxes using the Current Population Survey requires imputing wealth from the Survey of Consumer Finances, or carbon emissions from the Consumer Expenditure Survey.
Typical methods for this include regression (sometimes a mix of logistic and linear regression), and statistical matching. However, these methods tend to underestimate the predictive uncertainty, which can lead to too few extreme values. They also don't fully account for interactions in predictors common to each dataset, and they can't be easily adjusted for systematic under- or over-reporting of the predicted quantity in the survey.
I introduce the synthimpute
Python package, and its methods for data fusion based on the random forest model. I show that this method outperforms current models for common imputation tasks using quantile loss assessments. I also demonstrate its capabilities to optimize deviations from uniform quantile selection to ensure that imputations sum to administrative targets.
I contextualize the synthimpute
technology with examples from PolicyEngine, a tech nonprofit that lets anyone reform the tax and benefit system and see the impact on society and their own household.