1000/1000
Hot
Most Recent
One of the pillars of experimental sciences is sampling. Based on analysis conducted on samples the estimations for the populations are made. The distributions are split in two main groups: continuous and discrete and the present study applies for the continuous ones. One of the challenges of the sampling is the accuracy of it, or, in other words how representative is the sample for the population from which was drawn. Another challenge, connected with this one, is the presence of the outliers - observations wrongly collected, not actually belonging to the population subjected to study. The present study proposes a statistic (and a test) intended to be used for any continuous distribution to detect the outliers, by constructing the confidence interval for the extreme value in the sample, at certain (preselected) risk of being in error, and depending on the sample size. The proposed statistic is operational for known distributions (having known their probability density function) and is dependent too on the statistical parameters of the population.
Many statistical techniques are sensitive to the presence of outliers and all calculations, including the mean and standard deviation may be distorted by a single grossly inaccurate data point and therefore checking for outliers should be a routine part of any data analysis.
Several tests were developed to date for the purpose of identifying outliers of certain distributions. Most of the studies are connected with the Normal (or Gauss) distribution (Gauss, 1809)[1]. Probably the first paper which attracted attention on this matter is (Tippett, 1925)[2] followed by the derivation of the distribution of the extreme values in samples taken from Normal distribution (Fisher & Tippett, 1928)[3]. Later, a series of tests were developed - probably the first being (Thompson, 1935)[4], subjected to evaluation (Pearson & Sekar, 1936)[5], and revised (Grubbs, 1950)[6], (Grubbs, 1969)[7].
For other distributions such as Gamma distribution procedures for detecting outliers were proposed (Nooghabi, et al., 2010)[8], revised (Kumar & Lalitha, 2012)[9], and unfortunately proved not efficient (Lucini & Frery, 2017)[10].
The first attempt to generalize the criterion for detecting the outliers for any distribution is (Hartley, 1942)[11], but unfortunately the researches on this subject are very scarce and a notable recent attempt should be noted (Bardet & Dimby, 2017)[12].
In (Jäntschi 2019)[13] is proposed a method for constructing the confidence intervals for the extreme values of any continuous distribution for which also the cumulative distribution function is obtainable. The method have as direct application a simple test for detecting the outliers. The proposed method is based on deriving the statistic for the extreme values for the uniform distribution.
When a sample of data is tested under the null hypothesis that it follows a certain distribution, it is intrinsically assumed that the distribution is known. The usual assumption is that we possess its probability density function (PDF; for a continuous distribution).
When the PDF is (possibly intrinsically) known, it is not necessary that its (statistical) parameters are known, and here a complex problem of estimating the parameters of the (population) distribution from the sample can be (re)opened.
The estimation of the parameters of the distribution of the data is, in general, biased by the presence of the outliers in the data, and thus, identifying the outliers along with the estimation of the parameters of the distribution is a difficult task operating on two statistical hypotheses.
Taking the general case, for (x1, …, xn) as n independent draws (or observations) from a (assumed known) continuous distribution defined by its probability density function, PDF(x; (πj)1≤£j≤m) where (πj)1≤j≤m are the (assumed unknown) m statistical parameters of the distribution, by the way of integration for a (assumed known) domain ("D") of the distribution we may have access to the associated cumulative density function ("CDF") CDF(x; (πj)1≤j≤m; "PDF"), expressed as (eq.1):
where inf(D) were used in the disfavor of min(D) to include unbounded domains (e.g. when inf(D) = -∞). Please note that having the PDF and CDF does not necessary implies that we have an explicit formula (or expression) for any of them. But with access to numerical integration methods (Davis & Rabinowitz, 1975)[14] it's enough to have the possibility to evaluate them in any point (x).
Unlike PDF(x), CDF(x) is a bijective function and therefore is invertible all the time (even if we don't have explicit formula; let "InvCDF" being its inverse, eq.2):
CDF(x; (πj)1≤j≤m; "PDF") is a strong tool simplifying a lot the problem at hand: the problem of analyzing any distribution function (PDF) is translated into analyzing only one (the continuous uniform distribution) when the series of observed data (xi)1≤i≤n is expressed through their associated probabilities pi = CDF(xi) (for 1≤i≤n) and the analysis can be conducted on the (pi)1≤i≤n series instead (eq.3).
A statistic called g1 was developed to be applicable to any distribution. For a series of probabilities ((pi)1≤i≤n) the (ri)1≤i≤n differences are calculated (eq.4):
The g1 statistic is then calculated as (eq.5):
The g1 statistic have a very simple calculation formula (see eq. 5) and, as expected, also its CDF formula is very simple (see eq.6). Thus, for a calculated sample statistic g1 (x ← g1 in eq.5), the significance level (α ← 1-p) is immediate (eq.6).
On the regard of the g1 statistic, the domain for a variable distributed by g1 statistic (see eq.6) have the values between 0 and 0.5 with the mode at p = 0 (a vertical asymptote at p = 0), a median of n-1·2-1/n (and having a left asymmetry decreasing with the increasing of n and converging - for n → ∞ - to symmetry) and mean of 1/2(n+1).
The expression of CDFg1 is easy to be inverted (eq.7).
Eq.7 can be used to calculate critical values of the g1 statistic for any values of α (α ← 1-p) and n. The critical values of g1 statistic acts as the boundaries of the confidence intervals.
For convenience, in the figure (from [13]) is represented the value of the estimation error in each observation point (999 points corresponding to p = 0.001 to p = 0.999 for each n from 2 to 12) from a MC simulation intended to proof the connection between (5) and (6).
Figure. Departures between expected and observed probabilities for g1 statistic (eq.5 vs. eq.6)
By setting the risk of being in error α (usually at 5%) then p = 1-α and eq.7 can be used to calculate the statistic associated with it (InvCDFg1(1-α; n)). By replacing this value into eqs. (5) & (6) the (extreme) probabilities can be extracted (eq.8).
In order to arrive at the confidence intervals for the extreme values in the sampled data (eq.9) is necessary to use (again) the inverse of the CDF, and at this time for the distribution of the sampled data.
To illustrate the arriving at the confidence intervals for the extreme values in the sampled data, and the use of the statistic as a test detecting the outliers, two examples are given. First is based on Table 4 from [13], and the second on Table 11 from [13].
The same data were tested against the assumption that follows a generalized Gauss-Laplace distribution (eq.10) and a normal (Gauss) distribution (eq.11).
The (sorted) sample of data is: {4.151, 4.401, 4.421, 4.601, 4.941, 5.021, 5.023, 5.150, 5.180, 5.295, 5.301, 5.311, 5.311, 5.335, 5.343, 5.404, 5.421, 5.447, 5.452, 5.452, 5.481, 5.504, 5.517, 5.537, 5.537, 5.551, 5.561, 5.572, 5.577, 5.577, 5.627, 5.637, 5.637, 5.667, 5.667, 5.671, 5.677, 5.677, 5.691, 5.717, 5.743, 5.751, 5.757, 5.761, 5.767, 5.767, 5.787, 5.811, 5.817, 5.827, 5.867, 5.897, 5.897, 5.904, 5.943, 5.957, 5.957, 5.987, 6.041, 6.047, 6.047, 6.047, 6.057, 6.077, 6.091, 6.111, 6.117, 6.117, 6.137, 6.137, 6.137, 6.137, 6.137, 6.142, 6.167, 6.177, 6.177, 6.177, 6.204, 6.207, 6.221, 6.227, 6.227, 6.231, 6.237, 6.257, 6.267, 6.267, 6.267, 6.291, 6.304, 6.327, 6.357, 6.357, 6.367, 6.367, 6.371, 6.427, 6.457, 6.467, 6.487, 6.497, 6.511, 6.517, 6.517, 6.523, 6.532, 6.547, 6.583, 6.587, 6.587, 6.587, 6.607, 6.611, 6.647, 6.647, 6.647, 6.647, 6.647, 6.657, 6.657, 6.671, 6.671, 6.677, 6.677, 6.677, 6.697, 6.704, 6.717, 6.717, 6.737, 6.737, 6.737, 6.747, 6.767, 6.767, 6.767, 6.797, 6.827, 6.857, 6.867, 6.897, 6.897, 6.937, 6.937, 6.957, 6.961, 6.997, 7.027, 7.027, 7.027, 7.057, 7.071, 7.087, 7.087, 7.117, 7.117, 7.117, 7.121, 7.123, 7.147, 7.151, 7.177, 7.177, 7.187, 7.187, 7.207, 7.207, 7.207, 7.211, 7.247, 7.247, 7.277, 7.277, 7.277, 7.281, 7.304, 7.307, 7.307, 7.321, 7.337, 7.367, 7.391, 7.427, 7.441, 7.467, 7.516, 7.527, 7.527, 7.557, 7.567, 7.592, 7.627, 7.627, 7.657, 7.657, 7.717, 7.747, 7.751, 7.933, 8.007, 8.164, 8.423, 8.683, 9.143, 9.603}. The sample size is n = 206.
The MLE estimates for the populations parameters are:
The greatest departure from the median (0.5) is for 9.603 in both cases:
For the sample size (n = 206) at α = 5% risk being in error the g1 statistic detect an outlier if is departed at more than = {0.000124483, 0.9998755} (see eq.8) and then the confidence interval for the extreme values at 5% risk being in error for the sample having n = 206 values is [0.000124483, 0.9998755]. With the above given results at 5% risk being in error 9.603 is an outlier for Gauss (normal) distribution (0.99918 > 0.9998755) and it is not an outlier for generalized Gauss-Laplace distribution (0.000124483 < 0.999804 < 0.999875).
7. Conclusion
Extreme values statistic g1 provides a simple method for detecting outliers. The method is applicable for any continuous distribution at any risk being in error.