ResamIn pling methods have a long and honorable history. Srinciple, survey data are an ideal context to useapply resampling methods to approximate the (unknown) sampling distribution of statistics, due to both a generusually large sample size and data of typicontrolled quality. However, survey data cannot be generally assumed independent and identically good qualitydistributed (i.i.d.) so that any resampling methodologies to be used in sampling from finite populations must be adapted to account for the sample design effect. A principled appraisal is given and discussed here.
Resampling methods have a long and honorable history, going back at least to the Efron's seminal paper by [1].
Iin exlatreme synthesis,e 70s [1]. vVirtually all resampling methodologies used in sampling from finite populations are based on the idea of accounting for the effect of the sampling design. T In fact, the main effect of the sampling design is that data cannot be generally assumed independent and identically distributed (i.i.d.).
The Amain approaches are large essentially two: the ad hoc approach and the plug in approach. The basic idea of the ad hoc apprortion of the ach consists in maintaining Efron’s bootstrap as a resampling procedure but in properly rescaling data in order to account for the dependence among units. This approach is used, among others, in [2][3], where the re-sampled data produced by the “usual” i.i.d. bootstrap are properly rescaled, as well as in [4][5]; cfr. also ther review in [6]. In [7] a “rescaled booture on strap process” based on asymptotic arguments is proposed. Among the ad hoc approaches, we also classify [8] (based on a rescaling of weights) and the “direct bootstrap” by [9]. Almost all ad hoc resampling ftechniques arom finite populations focuses one based on the same justification: in the case of linear statistics, the first two moments of the resampled statistic should match (at least approximately) the corresponding estimatiors; cfr., among the vothers, [9]. Cfr. also [8], where an analysiance of estimas in terms of the first three moments is performed for Poisson sampling.
Here the secorsnd approach based on pseudo-populations is considered. The main approachereasons beyond this choice are i) resampling based on pseudo-populations actually parallels Efron’s bootstrap for i.i.d. observations; ii) the basic ideas are essentially the ad hoc relatively simple to understand and to apply once the problem is approached and plug inin terms of an appropriate estimator of the finite population distribution function (f.p.d.f.); and iii) the main theoretical justification for resampling bapproachsed on pseudo-population is of asymptotic nature, similar in many respects, to the well known Bickel-Freedman results [10] for Efron’s bootstrap.
Another practical drawback related to the pseudo-population approach is the seeming necessity to generate and store a large number of bootstrap sample files. However, it is not necessary to save all the bootstrap sample files. Only the original sample file should be saved along with two additional variables for each bootstrap replicate: one variable that contains the number of times each sample unit is used to create the pseudo-population and another one containing the number of times each sample unit has been selected in the bootstrap sample. In other words, it can be implemented similarly to methods that rescale the sampling weights.
The pseudo-population approach, despite its merits, requires further development from both the theoretical and computational perspectives.
The pseudo-population app Froach, despite its merits, requires further development from both the theoretical and computational p a theoretical perspectives. From a theoretical point of view, the results obtained thus far only refer to non-informative single-stage designs. The consideration of multi-stage designs appears as a necessary dedevelopment as well as the consideration of non-respondent units. Again, from a t of theoretical perspective, a major issue is the development of theoretically sound resampling methodologies for informative sampling designs is a major issue calling for more research. The majorin drawback is that, apart from the exception of adaptive designs (cfr. [30][21]) and the references therein) first order inclusion probabilities can rarely be computed, as these might depend on unobserved quantities. This is what happens, for instance, with most of the network sampling designs that are actually used for hidden populations, where the inclusion probabilities are unknown and depend on unobserved/unknown network links (cfr. [30,31][21][22] and the references therein). F Again from athe computationtheoretical point of view, as indicated earlier, the computational shortcuts developed thus far only work in the case of descriptive inference. The the consideration of multistage designs appears as a further necessary development of theoretically well-founded computational schemes valid for analytic inference is an important issue that deserves further attentionas well as the consideration of non-respondent units.From a computational point of view, the computational shortcuts developed thus far only apply to the case of descriptive inference. The development of theoretically well-founded computational schemes valid for analytic inference is an important issue that deserves further attention.