Resampling methods have a long and honorable history. Survey data are an ideal context to use resampling methods to approximate the sampling distribution of statistics, due to both a generally large sample size and data of typically good quality.
1. Introduction
Generalities
Resampling methods have a long and honorable history, going back at least to the seminal paper by [1].
In extreme synthesis, virtually all resampling methodologies used in sampling from finite populations are based on the idea of accounting for the effect of the sampling design. The main effect of the sampling design is that data cannot be generally assumed independent and identically distributed (i.i.d.). A large portion of the literature on resampling from finite populations focuses on estimating the variance of estimators. The main approaches are essentially the ad hoc approach and plug in approach.
The basic idea of the
ad hoc approach consists in maintaining Efron’s bootstrap as a resampling procedure but in properly rescaling data in order to account for the dependence among units. This approach is used, among others, in [
3,
4], where the re-sampled data produced by the “usual”
i.i.d. bootstrap are properly rescaled, as well as in [
5,
6]; cfr. also the review in [
7]. In [
8] a “rescaled bootstrap process” based on asymptotic arguments is proposed. Among the
ad hoc approaches, we also classify [
9] (based on a rescaling of weights) and the “direct bootstrap” by [
10].
Almost all
ad hoc resampling techniques are based on the same justification: in the case of linear statistics, the first two moments of the resampled statistic should match (at least approximately) the corresponding estimators; cfr., among the others, [
10]. Cfr. also [
9], where an analysis in terms of the first three moments is performed for Poisson sampling.
2. Accounting for the Sampling Design in Resampling: The Pseudo-Population Approach
Among several techniques that aim at accounting for the sampling design in resampling from finite populations, we consider here the approach based on
pseudo-populations. The idea of pseudo-population goes back, at least, to [
11] in the case of median estimation essentially under srs when the population size is a multiple of the sample size.
Rather similar ideas are in [
12] for srs, again under the condition that the ratio between population size and sample size is a ninteger, and in [
13], for stratified random sampling. A major step forward is the paper by [
14], where the construction of a pseudo-population is studied under a general
πps sampling design, with general first order inclusion probabilities. In [
19], a different approach to the construction of a pseudo-population, very interesting in many respects, is considered.
The pseudo-population approach to resampling can be considered as a two-phase procedure. In the first phase, a pseudo-population (roughly speaking, a prediction of the population) is constructed. In the second phase, a (bootstrap) sample is drawn from the pseudo-population. Broadly speaking, this approach parallels the plug-in principle by Efron.
The pseudo-population is plugged in the sampling process and is used as a “surrogate” of the actual finite population. In the second phase, a sample is drawn from the pseudo-population, according to a sampling design that mimics the original one. In this view, the pseudo-population mimics the real population, and the (re)sampling process from the pseudo-population mimics the (original) sampling process from the real population.
2.1. Resampling from Pseudo-Populations
Resampling based on pseudo-populations actually parallels Efron’s bootstrap for i.i.d. observations. The basic ideas are relatively simple, once the problem is approached in terms of an appropriate estimator of the f.p.d.f.
2.2. Resampling Based on Pseudo-Populations: Basics Results for Descriptive Inference
The main theoretical justification for resampling based on pseudo-population is of asymptotic nature, similar, in many respects, to results in [
17] for Efron’s bootstrap.
3. Computational Issues
Use of the pseudo-population approach, despite its many theoretical merits, is held back by its computational complexity. Real populations could contain millions of people, and thus the construction of a pseudo-population could be computationally cumbersome. For this reason, it is of primary interest to develop shortcuts that, while possessing the fundamental theoretical properties described in the above sections, are computationally simple to implement because they avoid the physical construction of the pseudo-population.
The above points are thoroughly discussed in [
26], where the problem of resampling for finite populations is addressed as a problem of sampling with replacement directly from the sample data, the original sample, henceforth, with different drawing probabilities.
An attempt to avoid complications related to integer-valued
N∗is is in [
27], where non-integer
N∗is are allowed
via the Horvitz–Thompson-based bootstrap (HTB) method. However, unless the sampling fraction
n/N tends to 0 as
N and
n increase, HTB does not generally possess the good asymptotic properties outlined in the previous sections.
An interesting computational shortcut is in [
28], where the pseudo-population (again with possibly non-integer
N∗is) is only implicitly used, and a computational scheme based on drawings with replacements from the original sample is proposed. Unfortunately, although the main idea behind that paper is interesting, the proposed bootstrap method fails to possess good asymptotic properties.
Computational shortcuts, based on ideas similar to those in [
28], but based on correct approximations of first order inclusion probabilities, were developed in [
29] for descriptive, design-based inference. In particular, in that paper, methodologies based on drawings with replacements from the original sample were proposed, and their merits, from both a theoretical and a computational point of view, were studied.
As remarked by a referee, another drawback of the pseudo-population approach is the apparent necessity to generate and save a large number of bootstrap sample files. However, it is not necessary to save all the bootstrap sample files. Only the original sample file must be saved along with two additional variables for each bootstrap replicate: one variable that contains the number of times each sample unit is used to create the pseudo-population and another one containing the number of times each sample unit has been selected in the bootstrap sample. In other words, it can be implemented similar to methods that rescale the sampling weights.
4. Open Problems and Final Considerations
The pseudo-population approach, despite its merits, requires further development from both the theoretical and computational perspectives. From a theoretical point of view, the results obtained thus far only refer to non-informative single-stage designs. The consideration of multi-stage designs appears as a necessary development as well as the consideration of non-respondent units.
Again, from a theoretical perspective, a major issue is the development of theoretically sound resampling methodologies for informative sampling designs. The major drawback is that, apart from the exception of adaptive designs (cfr. [
30]) and the references therein) first order inclusion probabilities can rarely be computed, as these might depend on unobserved quantities. This is what happens, for instance, with most of the network sampling designs that are actually used for hidden populations, where the inclusion probabilities are unknown and depend on unobserved/unknown network links (cfr. [
30,
31] and the references therein).
From a computational point of view, as indicated earlier, the computational shortcuts developed thus far only work in the case of descriptive inference. The development of theoretically well-founded computational schemes valid for analytic inference is an important issue that deserves further attention.
This entry is adapted from the peer-reviewed paper 10.3390/stats5010016