Administrative Records for Survey Methodology. Группа авторов
was to make the newly available linked administrative data at LEHD accessible to researchers. The network operates under physical security constraints managed by the Census Bureau and the IRS, in locations that are considered part of the Census Bureau itself, and staffed by Census Bureau employees.
Statistical data enclaves can be central locations, in which a single location at the statistical agency is made available to approved researchers. In the United States, NCHS and BLS follow this model, in addition to using the FSRDC network. In Canada, business data can be accessed at Statistics Canada headquarters, while other data may be accessed both there and at the geographically dispersed RDCs, which obtain physical copies of the confidential data.
Some facilities are hybrid facilities. The statistical processing occurs at a central location, but the secure remote access facilities are distributed geographically. The U.S. FSRDCs have worked this way since the early 2000s. A central computing facility is housed in the Census Bureau’s primary data center. Secure remote access is provided to approved researchers at designated sites throughout the county, namely the FSRDCs. Each of the FSRDC sites is a secure Census Bureau facility that is physically located on controlled premises provided by the partner organization, often a university or Federal Reserve Bank. The German IAB locates certified thin clients in dedicated rooms at partner institutions. Secure spaces are costly to build and certify. Recently, institutions in the United Kingdom have attempted to reduce the cost by commoditizing such secure spaces (Raab, Dibben, and Burton 2015). In France, the Centre d’accès sécurisé distant aux données (CASD) has a secure central computing facility, and allows for remote access through custom secure devices from designated but otherwise ordinary university offices, which satisfy certain physical requirements, but are not dedicated facilities. Similar arrangements are used by Scandinavian NSOs, as well as by survey organizations such as the HRS. Remote access to full desktop environments within the secure data enclave, commonly referred to as “virtual desktop infrastructure” (VDI), from regular laptops or workstations, is increasingly common.
The location of remote access points is often limited to the country of the data provider (United States, Canada), or to countries with reciprocal or common enforcement mechanisms (within the European Union, for European NSOs). Cross-border access, even within the European Union, remains exceedingly rare, with only a handful of cross-border secure remote access points open in the European Union. The most prolific user of cross-border secure remote access points, as of this writing, is the German IAB, with multiple data access points in the United States and a recently opened one in the United Kingdom.
2.4.2 Remote Processing
Two other alternative remote access mechanisms are often used: manual and automatic remote processing. Manual remote processing occurs when the remote “processor” is a staff member of the data provider. This can be as simple as sending programs in by email, or finding a co-author who is an employee of the data provider. The U.S. NCHS, German IAB, and Statistics Canada provide this type of access. Generally, the costs of manual remote processing are paid by the users.
More sophisticated mechanisms automate some or all of the data flow. For instance, programs may be executed automatically based on email or web submission, but disclosure review is performed manually. This method is used by the IAB’s JoSuA (Institute for Employment Research 2016). Fully automated mechanisms, such as LISSY (Luxembourg), ANDRE (U.S. NCHS), DAS (U.S. NCES), Australia’s Remote Access Data Laboratory (RADL), Canada’s Real Time Remote Access (RTRA), generally restrict the command set from the allowed statistical programming languages (SAS, Stata, and SPSS) and limit what the users can do to certain statistical procedures and languages for which known automated disclosure limitation procedures have been implemented.
Most of these systems only provide access to household and person surveys. Of the known systems surveyed above, only Australia’s RADL systems and the Bank of Italy’s implementation of LISSY (Bruno, D’Aurizio, and Tartaglia-Polcini 2009, 2014) seem to provide access to business microdata through automated remote processing facilities.
2.4.3 Licensing
Users of secure research data centers always sign some form of legally binding user or licensing agreement. These agreements describe acceptable user behavior, such as not copying or photographing screen contents. However, licensing alone may also be used to provide access to restricted-use microdata outside of formal restricted access data centers. In general, the detail in licensed microdata files is greater than in the equivalent (or related) public-use file, and may allow for disclosure of confidential data if inappropriately exploited. For this reason, licensed microdata files tend to have several additional levels of disclosure avoidance methods applied, including output review in some cases. For instance, even without linkages, the HRS licensed files have more detailed geography on respondents (county, say, rather than Census region), but do not have the most detailed geography (GPS coordinates or exact address). Generally, the legally enforceable license imposes restrictions on what can be published by the researchers, and restricts who can access the data, and for what purpose. The contracting organization is the researcher’s university, which is subject to penalties such as loss of eligibility status for research grants if the license is violated.
In the United States, some surveys (NCES, NLSY, and HRS) use licensing to distribute portions of the data they collect on their respondents. Commercial data providers (COMPUSTAT, etc.) also license the data distributed to researchers. Penalties for license infractions range from restricting future research grant funding, for example in HRS, to monetary penalties, for example in commercial data licenses. We are not aware of any studies that quantify the violation rates or financial penalties actually incurred due to license violations. Licensing may be limited by the enforceability of laws or contracts, and thus may be limited to residents of the same jurisdiction in which the data provider is housed. Often, some licensing is combined with the creation of ad-hoc data enclaves, the simplest of these being stand-alone, nonnetworked computer workstations.
2.4.4 Disclosure Avoidance Methods
Data enclaves exist to allow researchers to perform analyses within the restricted environment, and then extract or publish some form of statistical summary that can be released from the secure environment. Generally, these summaries are estimates from a statistical model. In general, model-based output is evaluated in accordance with the same criteria traditionally used for tabular output (minimum number of units within a reporting cell, minimum percentage of global activity within a reporting cell). In contrast to licensing arrangements, which allow researchers to self-monitor, statistical data enclaves have regimented output monitoring, typically by staff of the data provider. Generally, released statistical outputs are registered in some fashion, but documentation of the full provenance chain may be limited.
No systematic attempt has been made, to our knowledge, to measure formally the cumulative privacy impact of model-based releases because the science and technology for doing so are rudimentary. Remote processing facilities, on the other hand, when using automated mechanisms, rely on several practices to reduce the risk of disclosure. First, they limit the scope of possible analyses to those for which the agency has developed safe procedures. The number of times a researcher may request releases may also be limited. Nevertheless, most agencies recognize that this review system does not scale because the infeasibility of a full accounting of all possible query combinations over time. In general, they apply basic disclosure avoidance techniques such as suppression, perturbation, masking, recoding, and bootstrap sampling of the input data to each project separately. Some systems apply automated analysis of log and output files (Schouten and Cigrang 2003), although often a manual review is also included (O’Keefe et al. 2013). Some systems provide for self-monitored release of model results, either under licensing or remote access. There are also limitations on quantity and frequency of self-released results, combined with sampling by human reviewers. More sophisticated tools, such as perturbation or synthesizing of estimated model parameters, have been proposed (Reiter 2003). Finally, such systems require review of the draft research paper before submission to any publication medium including online preprint repositories like ArXiv.org.