Administrative Records for Survey Methodology. Группа авторов
party votes be the table of interest. Suppose one can obtain and Y+j in an election, but there are no joint observations of the cells (a, j). This can be framed as a problem of statistical matching. Provided a proxy table X, say, ethnicity by party membership, the IPF can be applied to obtain an estimated table . Zhang (2015a) develop an uncertainty measure that combines the identification uncertainty and the sampling uncertainty in this context, which enables one to quantify the relative efficiency of the proxy data X, compared to statistical matching without X. The application of the IPF here is an example of the benchmarked adjustment approach.
1.3.3 Symmetric Setting
In the symmetric setting none of the proxy variables is ideal due to errors of relevance, measurement, or coverage. The two most common approaches under the symmetric-linked setting are capture–recapture methodology for population size estimation and Structural Equation Modeling (SEM) that covers the latent class models mentioned earlier.
Capture–recapture methods that originate from wide-life, social, and medical applications are traditionally used for under-count adjustment. Imagine catching fish in a pond on two separate occasions, where one marks and identifies the fish that happen to be caught on both occasions (i.e. the recaptures). Then, under a number of simplifying assumptions, including independent and constant-probability captures, it becomes possible to estimate the total number of fish in the pond (i.e. the target population), for which the captures on each occasion generally entail undercounts. The method can be generalized to multiple captures to allow for relaxation of the independent assumption. The capture probability can be modeled using covariates to allow for heterogeneity across different subpopulations. See, e.g. Böhning, Van der Heijden, and Bunge (2017), for some recent developments.
Combining survey and register-based enumerations for population size estimation has attracted growing interest in the recent years, under the assumption that none of the sources can yield the true target population enumeration directly. We refer to the Journal of Official Statistics (2015, vol. 31, issue 3) for several useful references in this regard. There is plenty of scope for developing a range of models in order to address the different problems, including erroneous enumerations that are not dealt with in the traditional capture–recapture methodology. The potential impact can be huge if it enables one to produce census-like population statistics without the traditional census.
SEM is often considered to have evolved from the genetic path modeling of Sewall Wright. See, e.g. Kline (2016) for a general introduction. The approach is popular in many social science disciplines that share a common interest in “latent constructs” such as intelligence, attitude, well-being, living standard, and so on. The postulated latent constructs cannot be measured directly and are only manifested through observable indicators. The SEM consists of two main components: the structural model showing potentially causal dependencies among the latent variables, and the measurement model relating the latent variables and their indicators. The approach can be referred to in different ways depending on the continuous-categorical nature of the variables involved, the presence of causality or stochastic process on the latent level, etc.
The SEM approach is applicable under the symmetric-linked setting, where the proxy variables are treated as the indicators of the unobserved target measure. In the context of combining register and survey data, this can serve a number of purposes, including assessing potential relevance bias of proxy measures, detecting and possible treatment of measurement errors in editing and estimation, and statistical analysis of latent relationships using proxy indicators. For examples of data types that have been studied recently, see e.g. Pavlopoulos and Vermunt (2015) for temporary employment, Guarnera and Varriale (2015) for labor cost, and Burger et al. (2015) for turnover.
Di Cecco et al. (2018) apply latent class models for population size estimation based on multiple register enumerations that entail both over and under-counts. It is intriguing to notice the connection with some recent developments in record linkage. Imagine K lists of records, where each record may or may not refer to a target population unit (i.e. latent entity). Provided the union of the lists entail only over-counts of the target population, a potential alternative approach is record linkage, also referred to as entity resolution or co-reference – see e.g. Stoerts, Hall, and Fienberg (2015). The records in the same list that refer to the same entity represent duplicated enumerations; the records in the different lists that refer to the same entity can be conceived as the target for record linkage. The errors in compiling the population total are then the potential de-duplication and record linkage errors, which are traditionally the topics of computerized record linkage.
Multiple macro-level proxy totals may need to be reconciled under the symmetric-unlinked setting. A typical example is multiple time series with different frequencies, e.g. with register-based yearly figures and survey-based sub-annual figures. Another example is the Supply-and-Use Tables for the production of GDP, where the initial estimates generally do not balance out because they are derived from different sources, or when the GDP is compiled using different approaches. Census output tables derived from fragmented data sources instead a one-number file is yet another example. See e.g. Bikker, Daalmans, and Mushkudiani (2013) and Mushkudiani, Daalmans, and Pannekoek (2014, 2015).
Reconciliation is often achieved as the solution to a constrained optimization problem. The approach requires the specification of two components. A loss function may be defined to measure the changes from the initial proxy estimates to the final reconciled estimates. The constraints that the final estimates must satisfy need to be explicitly stated, which may contain both equality and inequality constraints. Minimizing the loss function subjected to the constraints would then yield the final estimates. The approach is feasible without linked data across the sources. Notice that there are many advanced techniques of constrained optimization in Applied Mathematics, Engineering, and Computer Sciences.
Mushkudiani, Pannekoek, and Zhang (2016) develop scalar uncertainty measure of macro accounts to replace, say, the entire variance–covariance matrix of all the estimates involved. Devising simple summary statistical uncertainty measures for an accounting system such as the System of National Account can be helpful in at least two respects: (i) it can inform the choice among alternative adjustment methods that seem equally viable to start with, (ii) it can identify and assess the changes, or potential improvements, that are most effective in terms of the final estimated account directly. Implementation of the approach to the System of National Account is currently under development.
1.4 Summary
In this chapter, we have provided an overview of the uses of proxy variables when combining register and survey data. The nature of the proxy variables discussed is such that they are not subjects of data editing methods, even when they can be considered to have risen from some kind of measurement errors. The presence of proxy variables is a characteristic feature of the settings involving data from multiple sources, because in a single-source setting proxy variables can be eliminated by design. The various instances discussed in Section 1.2 demonstrate the ubiquitous presence of proxy variables in multisource statistics. Sometimes proxy variables raise challenges because the conflict between them needs to be resolved, sometimes they are a blessing – indeed statistics may be impossible without them as in the case of capture–recapture methods for population size estimation. Either way, they always represent potentially useful sources of statistical information. We believe that the appropriate conceptualization, treatment, and usage of proxy variables provide a wide-ranging perspective, which enables one to draw on insights and experiences from diverse problems. An important theme for future research is the assessment of statistical uncertainty associated with indirect estimation based on unlinked data (Table 1.1). Several methods are mentioned in