Administrative Records for Survey Methodology. Группа авторов
are not equivalent. In a regression discontinuity design, for example, there will now be a window around the break point in the running variable that reflects the uncertainty associated with the noise infusion. If the effect is not large enough, it will be swamped by noise even though all the inputs to the analysis are unbiased, or nearly so. Once again, using the unmodified confidential data via a restricted access agreement does not completely solve the problem because once the noisy data have been published, the agency has to consider the consequences of allowing the publication of a clean regression discontinuity design estimate where the plot of the unprotected outcomes versus the running variable can be compared to the similar plot produced from the public noisy data.
An even more invasive input noise technique is data swapping. Sensitive data records (usually households) are identified based on a priori criteria. Then, sensitive records are compared to “nearby” records on the basis of a few variables. If there is a match, the values of some or all of the other variables are swapped (usually the geographic identifiers, thus effectively relocating the records in each other’s location). The formal theory of data swapping was developed shortly after the theory of primary/complementary suppression (Dalenius and Reiss 1982, first presented at American Statistical Association (ASA) Meetings in 1978). Basically, the marginal distribution of the variables used to match the records is preserved at the cost of all joint and conditional distributions involving the swapped variables. In general, very little is published about the swapping rates, the matching variables, or the definition of “nearby,” making analysis of the effects of this protection method very difficult. Furthermore, even arrangements that permit restricted access to the confidential files still require the use of the swapped data. Some providers destroy the unswapped data. Data swapping is used by the Census Bureau, NCHS, and many other agencies (FCSM 2005). The Census Bureau does not allow analysis of the unswapped decennial and ACS data except under extraordinary circumstances that usually involve the preparation of linked data from outside sources then reimposition of the original swap (so the records acquire the correct linked information, but the geographies are swapped according to the original algorithm before any analysis is performed). NCHS allows the use of unswapped data in its restricted access environment but prohibits publication of most subnational geographies when the research is published.
The basic problem for empirical social scientists is that agencies must have a general purpose data publication strategy in order to provide the public good that is the reason for incurring the cost of data collection in the first place. But this publication strategy inherently advantages certain analyses over others. Statisticians and computer scientists have developed two related ways to address this problem: synthetic data combined with validation servers and privacy-protected query systems. Statisticians define “synthetic data” as samples from the joint probability distribution of the confidential data that are released for analysis. After the researcher analyzes the synthetic data, the validation server is used to repeat some or all of the analyses on the underlying confidential data. Conventional SDL methods are used to protect the statistics released from the validation server.
2.2.2 Formal Privacy Models
Computer scientists define a privacy-protected query system as one in which all analyses of the confidential data are passed through a noise-infusion filter before they are published. Some of these systems use input noise infusion – the confidential data are permanently altered at the record level, and then all analyses are done on the protected data. Other formally private systems apply output noise infusion to the results of statistical analyses before they are released.
All formal privacy models define a cumulative, global privacy loss associated with all of the publications released from a given confidential database. This is called the total privacy-loss budget. The budget can then be allocated to each of the released queries. Once the budget is exhausted, no more analysis can be conducted. The researcher must decide how much of the privacy-loss budget to spend on each query – producing noisy answers to many queries or sharp answers to a few. The agency must decide the total privacy-loss budget for all queries and how to allocate it among competing potential users.
An increasing number of modern SDL and formal privacy procedures replace methods like deterministic suppression and targeted random swapping with some form of noisy query system. Over the last decade these approaches have moved to the forefront because they provide the agency with a formal method of quantifying the global disclosure risk in the output and of evaluating the data quality along dimensions that are broadly relevant.
Relatively recently, formal privacy models have emerged from the literature on database security and cryptography. In formal privacy models, the data are distorted by a randomized mechanism prior to publication. The goal is to explicitly characterize, given a particular mechanism, how much private information is leaked to data users.
Differential privacy is a particularly prominent and useful approach to characterizing formal privacy guarantees. Briefly, a formal privacy mechanism that grants ε-differential privacy places an upper bound, parameterized by ε, on the ability of a user to infer from the published output whether any specific data item, or response, was in the original, confidential data (see Dwork and Roth 2014 for an in-depth discussion).
Formal privacy models are very intriguing because they solve two key challenges for disclosure limitation. First, formal privacy models by definition provide provable guarantees on how much privacy is lost, in a probabilistic sense, in any given data publication. Second, the privacy guarantee does not require that the implementation details, specifically the parameter ε, be kept secret. This allows researchers using data published under formal privacy models to conduct fully SDL-aware analysis. This is not the case with many traditional disclosure limitation methods which require that key parameters, such as the swap rate, suppression rate, or variance of noise, not be made available to data users (Abowd and Schmutte 2015).
2.3 Confidentiality Protection in Linked Data: Examples
To illustrate the application of new disclosure avoidance techniques, we describe three examples of linked data and the means by which confidentiality protection is applied to each. First, the Health and Retirement Study(HRS) links extensive survey information to respondents’ administrative data from the Social Security Administration (SSA) and the Center for Medicare and Medicaid Services (CMS). To protect confidentiality in the linked HRS–SSA data, its data custodians use a combination of restrictive licensing agreements, physical security, and restrictions on model output. Our second example is the Census Bureau’s Survey of Income and Program Participation (SIPP), which has also been linked to earnings data from the Internal Revenue Service (IRS) and benefit data from the SSA. Census makes the linked data available to researchers as the SIPP Synthetic Beta File(SSB). Researchers can directly access synthetic data via a restricted server and, once their analysis is ready, request output based on the original harmonized confidential data via a validation server. Finally, the Longitudinal Employer-Household Dynamics Program (LEHD) at the Census Bureau links data provided by 51 state administrations to data from federal agencies and surveys and censuses on businesses, households, and people conducted by the Census Bureau. Tabular summaries of LEHD are published with greater detail than most business and demographic data. The LEHD is accessible in restricted enclaves, but there are also restrictions on the output researchers can release. There are many other linked data sources. These three are each innovative in some fashion, and allow us to illustrate the issues faced when devising disclosure avoidance methods for linked data.
2.3.1 HRS–SSA
2.3.1.1 Data Description
The HRS is conducted by the Institute for Social Research at the University of Michigan. Data collection was launched in 1992 and has reinterviewed the original sample of respondents every two years since then. New cohorts and sample refreshment have made the HRS one of the largest representative longitudinal samples of Americans over 50, with over 26 000 respondents in a given wave (Sonnega and Weir 2014). In 2006, the HRS started collecting