The False Discovery Rate

Name: The False Discovery Rate | Its Meaning, Interpretation and Application in Data Science
Brand: Wiley
Price: 76.99 EUR
Availability: OnlineOnly

Its Meaning, Interpretation and Application in Data Science

N. W. Galwey(Autor*in)

Wiley (Verlag)

1. Auflage

Erschienen am 29. Oktober 2024

511 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-119-88979-3 (ISBN)

76,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Person

Inhalt

1
Introduction

1.1 A Brief History of Multiple Testing

In the beginning was the significance threshold. By the early twentieth century, researchers with an awareness of random variation became concerned that the interesting results that they wished to report might have occurred by chance. Mathematicians worked to develop methods for quantifying this risk, and in 1926, R.A. Fisher wrote,

.it is convenient to draw the line at about the level at which we can say: 'Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials'.

That is, he suggested a threshold of a?=?0.05, adding,

If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent. point) or one in a hundred (the 1 per cent. point).

(Fisher, 1926, p. 504)

Fisher's suggestion was taken up by researchers, but for the next few decades they did not usually calculate the probability of obtaining, by coincidence, the observed result of each particular study. Such a calculation required a substantial amount of work by a trained mathematician. Instead, the researcher calculated a test statistic - z, t, F, ?2 or r (the correlation coefficient), depending on the design of the study and the question asked - and compared the value obtained with a published table of values corresponding to particular thresholds, typically a?=?0.05, 0.01 and 0.001. For example, suppose that a researcher analysing data from an experiment obtained the result t?=?-3.1, with 8 degrees of freedom (d.f.?=?8). If they were interested in large effects either positive or negative, they would consult a table of values of the t statistic for a two-sided test, and find that P(|T8|?>?2.306)?=?0.05 and P(|T8|?>?3.355)?=?0.01. Hence, they would conclude that their result was significant at the 5% (a?=?0.05) level, but not at the 1% (a?=?0.01) level. Though no probability had been calculated, such a conclusion could be reported in terms of a p-value - in this case, p?<?0.05.

By the late 1970s, many researchers had desktop or pocket calculators offering statistical functions, or even had access to programmable computers. This enabled them to present the actual p-value associated with their result, rather than comparing the result to pre-specified thresholds. In the present case, they would report p?=?0.015. However, the preoccupation with thresholds that had its origin in arithmetical convenience persisted, and a value of p?>?0.05 was (and is) typically presented as 'non-significant' ('NS'), whereas p?<?0.05 is 'significant' (often indicated by '*'); p?<?0.01 is 'highly significant' ('**'); and p?<?0.001 is 'very highly significant' ('***').

By this time, such significance tests had become the mainstay of statistical data analysis in the biological and social sciences - a status that they still retain. However, it was apparent from the outset that there are conceptual problems associated with such tests. Firstly, the test does not address precisely the question that the researcher most wants to answer. The researcher is not primarily interested in the probability of their dataset - in a sense its probability is irrelevant, as it is an event that has actually happened. What they really want to know is the probability of the hypothesis that the experiment was designed to test. This is the problem of 'inverse' or 'Bayesian' probability, the probability of things that are not - and cannot be - observed. Secondly, although the probability that a single experiment will give a significant result by coincidence is low, if more tests are conducted, the probability that at least one of them will do so increases.

Initially, these difficulties were dealt with by an informal understanding that if results were unlikely to be obtained by coincidence, then the probability that they were indeed produced by coincidence was low, and hence the hypothesis that this had occurred - the null hypothesis, H0 - could be rejected. It followed that among all the 'discoveries' announced by many researchers working over many years, it would not often turn out that the null hypothesis had, after all, been correct. As long as every study and every statistical analysis conducted required a considerable effort on the part of the researcher, there was some reason for confidence in this argument: researchers would not usually waste their time and other resources striving to detect effects that were unlikely to exist.

However, in the last decades of the twentieth century, technological developments changed the situation. There were several aspects to this expansion of the scope for statistical analysis, namely:

Increased capacity for statistical calculations, initially by centralised mainframe computers, and later by personal computers and other devices that could be under the control of a small research group or an individual.
Increased capacity for electronic data storage. 'The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s' (Hilbert and López 2011, quoted by https://en.wikipedia.org/wiki/Big_data, accessed 15 April 2024).
Development of electronic automated measuring devices, electronic data loggers for capturing the measurements from traditional devices such as thermometers, and high-throughput laboratory technologies for obtaining experimental data (the latter particularly in genetics and genomics).
Development of user-friendly software, usually with a 'point-and-click' interface, enabling researchers to perform their own routine statistical analyses and ending their dependence on specialist programmers or statisticians for this service.

By the 1990s, the term 'big data' started to be used to refer to such developments. The management, manipulation and exploration of these huge datasets were characterised as a discipline called 'data science', distinct from classical statistics:

While the term data science is not new, the meanings and connotations have changed over time. The word first appeared in the '60s as an alternative name for statistics. In the late '90s, computer science professionals formalized the term. A proposed definition for data science saw it as a separate field with three aspects: data design, collection, and analysis. It still took another decade for the term to be used outside of academia. (Amazon Web Services, https://aws.amazon.com/what-is/data-science/, accessed 15 April 2024)

When thousands of statistical hypothesis tests could be performed with negligible effort either in the collection or the analysis of the data, the prospect that multiple testing would lead to significant results in cases where H0 was true - false positives - became effectively a certainty.

The problem of false-positive results is exacerbated if the multiple testing that has caused it is not apparent when the results are reported. This can occur, inadvertently or deliberately, due to several distinct mechanisms, for example, as follows:

Repeatedly testing the same null hypothesis. Specifically,
- repeated testing of the same hypothesis by the same method, stopping when a significant result is obtained and reporting only the last test: a flagrant abuse of statistical significance testing;
- less obviously, testing of the same null hypothesis in unconnected studies, reporting only the one or a few that give significant results, perhaps with no awareness that the other studies existed.
Testing a number of related null hypotheses, and reporting only those that give a significant result. Specifically,
- simple failure to report the number of hypotheses, directly comparable with each other, that have been considered (e.g., the number of candidate genes that have been tested for association with a heritable disease);
- testing each of several types of event (e.g., adverse health outcomes) for association with each of several exposures (e.g., potential environmental risk factors) resulting in a large number of pair-wise combinations;
- in clinical trials, testing the effect of the medical intervention on secondary outcomes in addition to the pre-specified primary end-point;
- also in clinical trials, testing the effect of the intervention on subsets of the patients recruited (e.g., older males, a particular ethnic group, patients with a particular comorbidity.).
Testing the same hypothesis by a number of different statistical methods. Specifically,
- selection of different variables for inclusion as covariates in a multivariable regression model;
- inclusion versus deletion of outliers;
- application of different analysis methods to the same data (logarithmic transformation, square root transformation, choice of a value for a 'tuning...

Systemvoraussetzungen

Als PDF speichern Als Link merken

The False Discovery Rate

Beschreibung

Weitere Details

Weitere Ausgaben

Person

Inhalt

1 Introduction

1.1 A Brief History of Multiple Testing

Systemvoraussetzungen

1
Introduction