Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
In this chapter, we discuss the limitations of classic statistics that build on the concepts of the mean and variance. We argue that the mean and variance are appropriate measures of the center and the scatter of symmetric distributions. Many distributions we deal with are asymmetric, including distributions of positive data. The mean not only has a weak practical appeal but also may create theoretical trouble in the form of unbiased estimation - the existence of an unbiased estimator is more an exception than the rule.
Optimal statistical inference for normal variance in the form of minimum length or unbiased CI was developed more than 50 years ago and has been forgotten. This example serves as a motivation for our theory. Many central concepts, such as unbiased tests, mode, and maximum concentration estimators for normal variance serve as prototypes for the general theory to be deployed in subsequent chapters.
The Neyman-Pearson lemma is a fundamental statistical result that proves maximum power among all tests with fixed type I error. In this chapter, we prove two results, as an extension of this lemma, to be later used for demonstrating some optimal properties of M-statistics such as the superiority of the sufficient statistic and minimum volume of the density level test.
A long time ago, several prominent statisticians pointed out to limitations of the mean as a measure of central tendency or short center (Deming 1964; Tukey 1977). Starting from introductory statistics textbooks the mean is often criticized because it is not robust to outliers. We argue that the mean's limitations are conceptually serious compared to other centers, the median and the mode.
For example, when characterizing the distribution of English letters the mean is not applicable, but the mode is "e.". Occasionally, statistics textbooks discuss the difference between mean, mode, and median from the application standpoint. For example, consider the distribution of house prices on the real estate market in a particular town. For a town clerk, the most appropriate measure of the center is the mean because the total property taxes received by the town are proportional to the sum of house values and therefore the mean. For a potential buyer, who compares prices between small nearby towns, the most appropriate center is the mode as the typical house price. A person who can afford a house at the median price knows that they can afford 50% of the houses on the market.
Remarkably, modern statistical packages, like R, compute the mean and median as mean(Y) and median(Y), but not the mode, although it requires just two lines of code
mean(Y)
median(Y)
where Y is the array of data. The centerpiece of the mode computation is the density function, which by default assumes the Gaussian kernel and the bandwidth computed by Silverman's "rule of thumb" (1986).
Y
density
Consider another example of reporting the summary statistics for U.S. hourly wages (the data are obtained from the Bureau of Labor Statistics at https://www.bls.gov/mwe). Figure 1.1 depicts the Gaussian kernel density of hourly wages for 234,986 employees. The mean is almost twice as large as the mode because of the heavy right tail. What center should be used when reporting the average wage? The answer depends on how the center is used. The mean may be informative to the U.S. government because the sum of wages is proportional to consumer buying power and collected income taxes. The median has a clear interpretation: 50% of workers earn less than $17.10 per hour. The mode offers a better interpretation of the individual level as the typical wage - the point of maximum concentration of wages. In parentheses, we report the proportion of workers who earn $1 within each center. The mode has maximum data concentration probability - that is why we call the mode typical value. The mean ($20.40) may be happily reported by the government, but $11.50 is what people typically earn.
Figure 1.1: The Gaussian kernel density for a sample of 234,986 hourly wages in the country. The percent value in the parentheses estimates the probability that the wage is within $1 of the respective center.
Mean is a convenient quantity for computers, but humans never count and sum - they judge and compare samples based on the typical value.
Figure 1.2 illustrates this statement. It depicts a NASA comet image downloaded from https://solarviews.com/cap/comet/cometneat.htm. The bull's-eye of the comet is the mode where the concentration of masses is maximum. Mean does not have a particular interpretation.
Mean is for computers, and mode is for people. People immediately identify the mode as the maximum concentration of the distribution, but we never sum the data in our head and divide it by the number of points - this is what computers do. This picture points out the heart of this book: the mean is easy to compute because it requires arithmetic operations suitable to computers. The mode requires more sophisticated techniques such as density estimation - unavailable at the time when statistics was born. Estimation of the mode is absent even in comprehensive modern statistics books. The time has come to reconsider and rewrite statistical theory.
Figure 1.2: Image of comet C/2001 Q4 (NEAT) taken at the WIYN 0.9-meter telescope at Kitt Peak National Observatory near Tucson, Arizona, on May 7, 2004. Source: NASA.
The mean dominates not only statistical applications but also statistical theory in the form of an unbiased estimator. Finding a new unbiased estimator is regarded as one of the most rewarding works of a statistician. However, unbiasedness has serious limitations:
Note that while we criticize the unbiased estimators, there is nothing wrong with unbiased statistical tests and CIs - although the same term unbiasedness is used, these concepts are not related. Our theory embraces unbiased tests and CIs and derives the mode (MO) and maximum concentration (MC) estimator as the limit point of the unbiased and minimum length CI, respectively, when the coverage probability approaches zero.
Classic statistics uses the equal-tail approach for statistical hypothesis testing and CIs. This approach works for symmetric distributions or large sample sizes. It was convenient in the pre-computer era when tables at the end of statistics textbooks were used. The unequal approach, embraced in the present work, requires computer algorithms and implies optimal statistical inference for any sample size. Why use a suboptimal equal-tail approach when a better one exists? True, for a fairly large sample size, the difference is negligible but when the number of observations is small say, from 5 to 10 we may gain up to 20% improvement measured as the length of the CI or the power of the test.
The classic statistical inference was developed almost 100 years ago. It tends to offer simple procedures, often relying on the precomputed table of distributions printed at the end of books. This explains why until now equal-tail tests and CIs have been widely used even though for asymmetric distributions the respective inference is suboptimal. Certainly, for a moderate sample size, the difference is usually negligible but when the sample size is small the difference can be considerable. Classic equal-tail statistical inference is outdated. Yes, unequal-tail inferences do not have a closed-form solution, but this should not serve as an excuse for using suboptimal inference....
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.