What are outliers in the statistics?



An outlier outlier) is an observation that is distant from the other observations. Outliers can get through natural variability come about, but also on one experimental error indicate.

Outliers can get through coincidence occur in every distribution and therefore in every data set. Nevertheless, they often indicate errors in the implementation or measurement in experiments. Alternatively, outliers come in Heavy-Tailed Distributions often before. Especially in the case of measurement errors, we want to exclude the corresponding observation from the analysis, while in the case of heavy-tailed distributions one should check whether the methods used (which often assume a normal distribution) still deliver correct statements due to the high degree of skewness. Outliers often occur when two different distributions (and thus two heterogeneous populations) are mixed and analyzed.

In larger data collections, some data points will be further from the mean than is considered acceptable. This can be due to a systematic error, for example, but also due to an error-prone theory that made certain assumptions about the distribution properties, or because some data are simply further away than others. This indicates outliers corrupt data, wrong procedures andAreas in which the a priori theory is wrongdown. However, the probability of getting “natural” outliers increases with the size of the sample.

The minimum or maximum of a data record can be outliers, even if the minimum or maximum does not necessarily have to be an outlier. Here, too, it depends on how far the minimum or maximum are from the rest of the data. (In statistics programs, missing data is often coded as 999 or something similar. Here, it must be ensured that this is correctly noted as a missing value in the program.)

Above all, outliers are bad because they reduce the informative value of certain statistical methods. In addition, most studies try to make statements about the broad mass of a population. Values ​​that are far from other values ​​indicate a different population, one that is not necessarily of interest.