As algorithms for statistical learning advance, the needs of social science research increasingly conflict with the privacy concerns of individuals in databases. Even when databases are “anonymized,” today’s algorithms are able to exploit statistical patterns and reveal secret information. As a result, some organizations are dramatically scaling back the extent to which they make data available for social science research.
The goal of this project is to develop new techniques for supporting social science research while maintaining formal privacy guarantees. We will build a system that is capable of taking arbitrary code written by a researcher, evaluating its behavior under a bootstrap procedure, and inserting noise to mask information about individuals. Due to the unrestricted nature of code, such a setting is incompatible with classic definitions in differential privacy. Instead, we will employ a statistical lens to estimate the distribution of privacy losses and provide users with guarantees that are asymptotically precise. We will further study how our bootstrap system is related to regularization in machine learning, proving results that connect privacy and generalization error.