Administrative Records for Survey Methodology. Группа авторов
surveys.
References
1 2017 Berzofsky, M., Zimmer, S., and Smith, T. (2017). Evaluating the accuracy of administrative data to augment survey responses. Presentation at the 7th Conference of the European Survey Research Association (ESRA).
2 2017 Chun, A.Y., and Porter, S. (2017). Assessing administrative data quality: the truth is out there. Presentation at the 7th Conference of the European Survey Research Association (ESRA).
3 2017 Schulte, E., Daas, P., Tennekes, M., and Ossen, S. (2017). Evaluation of the quality of administrative data used in the Dutch virtual census. Presentation at the 7th Conference of the European Survey Research Association (ESRA).
1 On the Use of Proxy Variables in Combining Register and Survey Data
Li-Chun Zhang
S3RI/Department of Social Statistics and Demography, University of Southampton, SO17 1BJ, Southampton, UK
1.1 Introduction
In this chapter, we present an overview of the uses of proxy variables when combining data from multiple sources. In the remaining of this introductory section, we will explain what we mean by register and survey data, how the multisource data perspective differs from the survey-data centric view, and the concept of proxy variables in the context of multisource data. In Section 1.2, we consider the many and various instances of proxy variable, based on a systematic examination of the processing steps of data integration and associated error sources. In Section 1.3, we classify and outline estimation methods in the presence of multiple proxy variables. It is seen that the traditional role of auxiliary data from administrative sources can be greatly extended under the multisource data perspective. A short summary and discussion of future research is given in Section 1.4.
1.1.1 A Multisource Data Perspective
Under the presumption that the target units and measures are collected in survey data, register data traditionally have two principal uses: to provide the frames for sampling and estimation, to provide the auxiliary data for reducing both sampling and non-sampling survey errors (Särndal, Swensson, and Wretman 1992). The term auxiliary data conveys that register data play a helpful supporting role but is ultimately not indispensable. A broader view is necessary in order to cover the full range of approaches for combining register and survey data, where the two types of data are on an equal footing to each other.
Let us first clarify what we mean by register and survey data. We shall simply refer to statistical data arising from administrative sources as register data. On the one hand, this extends the narrow interpretation of the term register as an authoritative list of objects; on the other hand, it implies that generally some processing may be required in order to transform “raw” administrative data into a state that permits them to be utilized for statistical purposes. Next, we shall simply refer to statistical data collected from samples and censuses as survey data. Our usage of the term survey here is conventional and more limiting, e.g. compared to that of Statistics Canada (2015), where it is used generically to cover any activity that collects or acquires statistical data, including administrative records and estimated data. We do not wish to contend the general interpretation, but we adopt the convention to facilitate the discussion that follows. A central distinction between what we call register and survey data is that the survey data are purposely designed and collected for statistical uses, whilst the register data are originally generated and recorded for purposes other than making statistics. This is also the reason why we refer to both survey sampling and census data as survey data, rather than taking on an even narrower interpretation which equates survey data with survey sampling data.
Brackstone (1987) characterizes the uses of administrative records, i.e. register data, into (i) direct tabulation, (ii) indirect estimation, (iii) survey frames, and (iv) survey evaluation. To appreciate what we shall refer to as the multisource data perspective and by way of introduction, let us consider the following question: Are the four uses (i)–(iv) of register data equally applicable to survey data?
Direct tabulation refers to the situation where statistics are produced based on the relevant register data without any explicit use of survey data. The scope of such register-based statistics has increased greatly in the past decades. A prominent example is the latest round of register-based census-like statistics in a number of European countries (UNECE 2014). See Wallgren and Wallgren (2014), for many other examples. As Zhang and Giusti (2016) point out and illustrate, sometimes relevant survey data are available and used implicitly to define the processing rules or to assess the accuracy of the register data, but are not part of the statistics directly. Clearly, in this sense, one can equally speak of direct tabulation based on survey data, such as the use of the Horvitz–Thompson estimator in survey sampling, or direct census enumeration of the population size.
Brackstone (1987) includes, under indirect estimation, the cases where register data “comprise one of the inputs into an estimation process.” In the split-population or split-data approach (UNECE 2011), register and survey data supplement each other literally. A practical example of the split-population approach is the Unified Enterprise Survey at Statistics Canada, where register data are used for over half of the smaller enterprises with simple structures, and survey data are collected from the remaining units with more complex structures. Under the split-data approach, register data would provide some but not all of the required variables for the whole population, which otherwise would have to be collected in survey questionnaires. For example, at Statistics Norway, it is possible to derive income and education level data from statistical registers, so that these variables are not collected in the European Union Statistics on Income and Living Conditions (EU-SILC) and other social surveys. Imputation for survey nonresponse using register data can be viewed as a hybrid approach, where the units and variables to be substituted are determined post hoc after survey data collection. Indirect estimation beyond the split-population/split-data approach will be discussed in details later on, after we have explained the concept of proxy variables in Section 1.1.2.
Regarding the use of register data to create, supplement, or update frames for sample surveys and censuses, it takes only a moment of reflection to realize that exactly the same can be said of survey data. For instance, a census can be used to create, supplement, or update frames for postcensal sample surveys. The yearly Structural Business Survey and specific quality assurance surveys are used to proof or update the Business Register. As a noteworthy special case, one may include here population size estimation based on Census and Census Coverage Surveys (Nirel and Glickman 2009).
Survey evaluation covers the use of register data for checking, validating, or assessing survey data, whether they are collected in a sample or census. This may be done at both individual and aggregate levels. Reversely, using survey estimates for external validation of register-based statistics has been a natural approach from early on (Myrskyla 1991). Quality survey in a census year is another common approach in Scandinavia (Axelson et al. 2020), which is usually not directed at the population coverage errors of the Central Population Register in those countries, but at the various classification and measurement errors in the register data. Or, as mentioned above, survey data are commonly used implicitly to define the processing rules or to assess the accuracy of the register data.
In summary, one can speak of a multisource data perspective for combining register and survey data on at least two different levels. In the wider sense, it is possible to characterize equally the uses of both register and survey data into four broad categories: (i) single-source estimation, (ii) multisource estimation, (iii) frames,