
Statistics
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
More details
Other editions
Additional editions

Person
Michael J. Crawley, FRS, Department of Biological Sciences, Imperial College of Science, Technology and Medicine. Author of three bestselling Wiley statistics titles and five life science books.
Content
Preface xi
Chapter 1 Fundamentals 1
Everything Varies 2
Significance 3
Good and Bad Hypotheses 3
Null Hypotheses 3
p Values 3
Interpretation 4
Model Choice 4
Statistical Modelling 5
Maximum Likelihood 6
Experimental Design 7
The Principle of Parsimony (Occam's Razor) 8
Observation, Theory and Experiment 8
Controls 8
Replication: It's the ns that Justify the Means 8
How Many Replicates? 9
Power 9
Randomization 10
Strong Inference 14
Weak Inference 14
How Long to Go On? 14
Pseudoreplication 15
Initial Conditions 16
Orthogonal Designs and Non-Orthogonal Observational Data 16
Aliasing 16
Multiple Comparisons 17
Summary of Statistical Models in R 18
Organizing Your Work 19
Housekeeping within R 20
References 22
Further Reading 22
Chapter 2 Dataframes 23
Selecting Parts of a Dataframe: Subscripts 26
Sorting 27
Summarizing the Content of Dataframes 29
Summarizing by Explanatory Variables 30
First Things First: Get to Know Your Data 31
Relationships 34
Looking for Interactions between Continuous Variables 36
Graphics to Help with Multiple Regression 39
Interactions Involving Categorical Variables 39
Further Reading 41
Chapter 3 Central Tendency 42
Further Reading 49
Chapter 4 Variance 50
Degrees of Freedom 53
Variance 53
Variance: A Worked Example 55
Variance and Sample Size 58
Using Variance 59
A Measure of Unreliability 60
Confidence Intervals 61
Bootstrap 62
Non-constant Variance: Heteroscedasticity 65
Further Reading 65
Chapter 5 Single Samples 66
Data Summary in the One-Sample Case 66
The Normal Distribution 70
Calculations Using z of the Normal Distribution 76
Plots for Testing Normality of Single Samples 79
Inference in the One-Sample Case 81
Bootstrap in Hypothesis Testing with Single Samples 81
Student's t Distribution 82
Higher-Order Moments of a Distribution 83
Skew 84
Kurtosis 86
Reference 87
Further Reading 87
Chapter 6 Two Samples 88
Comparing Two Variances 88
Comparing Two Means 90
Student's t Test 91
Wilcoxon Rank-Sum Test 95
Tests on Paired Samples 97
The Binomial Test 98
Binomial Tests to Compare Two Proportions 100
Chi-Squared Contingency Tables 100
Fisher's Exact Test 105
Correlation and Covariance 108
Correlation and the Variance of Differences between Variables 110
Scale-Dependent Correlations 112
Reference 113
Further Reading 113
Chapter 7 Regression 114
Linear Regression 116
Linear Regression in R 117
Calculations Involved in Linear Regression 122
Partitioning Sums of Squares in Regression: SSY = SSR + SSE 125
Measuring the Degree of Fit, r 2 133
Model Checking 134
Transformation 135
Polynomial Regression 140
Non-Linear Regression 142
Generalized Additive Models 146
Influence 148
Further Reading 149
Chapter 8 Analysis of Variance 150
One-Way ANOVA 150
Shortcut Formulas 157
Effect Sizes 159
Plots for Interpreting One-Way ANOVA 162
Factorial Experiments 168
Pseudoreplication: Nested Designs and Split Plots 173
Split-Plot Experiments 174
Random Effects and Nested Designs 176
Fixed or Random Effects? 177
Removing the Pseudoreplication 178
Analysis of Longitudinal Data 178
Derived Variable Analysis 179
Dealing with Pseudoreplication 179
Variance Components Analysis (VCA) 183
References 184
Further Reading 184
Chapter 9 Analysis of Covariance 185
Further Reading 192
Chapter 10 Multiple Regression 193
The Steps Involved in Model Simplification 195
Caveats 196
Order of Deletion 196
Carrying Out a Multiple Regression 197
A Trickier Example 203
Further Reading 211
Chapter 11 Contrasts 212
Contrast Coefficients 213
An Example of Contrasts in R 214
A Priori Contrasts 215
Treatment Contrasts 216
Model Simplification by Stepwise Deletion 218
Contrast Sums of Squares by Hand 222
The Three Kinds of Contrasts Compared 224
Reference 225
Further Reading 225
Chapter 12 Other Response Variables 226
Introduction to Generalized Linear Models 228
The Error Structure 229
The Linear Predictor 229
Fitted Values 230
A General Measure of Variability 230
The Link Function 231
Canonical Link Functions 232
Akaike's Information Criterion (AIC) as a Measure of the Fit of a Model 233
Further Reading 233
Chapter 13 Count Data 234
A Regression with Poisson Errors 234
Analysis of Deviance with Count Data 237
The Danger of Contingency Tables 244
Analysis of Covariance with Count Data 247
Frequency Distributions 250
Further Reading 255
Chapter 14 Proportion Data 256
Analyses of Data on One and Two Proportions 257
Averages of Proportions 257
Count Data on Proportions 257
Odds 259
Overdispersion and Hypothesis Testing 260
Applications 261
Logistic Regression with Binomial Errors 261
Proportion Data with Categorical Explanatory Variables 264
Analysis of Covariance with Binomial Data 269
Further Reading 272
Chapter 15 Binary Response Variable 273
Incidence Functions 275
ANCOVA with a Binary Response Variable 279
Further Reading 284
Chapter 16 Death and Failure Data 285
Survival Analysis with Censoring 287
Further Reading 290
Appendix Essentials of the R Language 291
R as a Calculator 291
Built-in Functions 292
Numbers with Exponents 294
Modulo and Integer Quotients 294
Assignment 295
Rounding 295
Infinity and Things that Are Not a Number (NaN) 296
Missing Values (NA) 297
Operators 298
Creating a Vector 298
Named Elements within Vectors 299
Vector Functions 299
Summary Information from Vectors by Groups 300
Subscripts and Indices 301
Working with Vectors and Logical Subscripts 301
Addresses within Vectors 304
Trimming Vectors Using Negative Subscripts 304
Logical Arithmetic 305
Repeats 305
Generate Factor Levels 306
Generating Regular Sequences of Numbers 306
Matrices 307
Character Strings 309
Writing Functions in R 310
Arithmetic Mean of a Single Sample 310
Median of a Single Sample 310
Loops and Repeats 311
The ifelse Function 312
Evaluating Functions with apply 312
Testing for Equality 313
Testing and Coercing in R 314
Dates and Times in R 315
Calculations with Dates and Times 319
Understanding the Structure of an R Object Using str 320
Reference 322
Further Reading 322
Index 323
1
Fundamentals
The hardest part of any statistical work is getting started. And one of the hardest things about getting started is choosing the right kind of statistical analysis. The choice depends on the nature of your data and on the particular question you are trying to answer. The truth is that there is no substitute for experience: the way to know what to do is to have done it properly lots of times before.
The key is to understand what kind of response variable you have got, and to know the nature of your explanatory variables. The response variable is the thing you are working on: it is the variable whose variation you are attempting to understand. This is the variable that goes on the y axis of the graph (the ordinate). The explanatory variable goes on the x axis of the graph (the abscissa); you are interested in the extent to which variation in the response variable is associated with variation in the explanatory variable. A continuous measurement is a variable like height or weight that can take any real numbered value. A categorical variable is a factor with two or more levels: sex is a factor with two levels (male and female), and rainbow might be a factor with seven levels (red, orange, yellow, green, blue, indigo, violet).
It is essential, therefore, that you know:
- which of your variables is the response variable?
- which are the explanatory variables?
- are the explanatory variables continuous or categorical, or a mixture of both?
- what kind of response variable have you got - is it a continuous measurement, a count, a proportion, a time-at-death, or a category?
These simple keys will then lead you to the appropriate statistical method:
- The explanatory variables (pick one of the rows): (a) All explanatory variables continuous Regression (b) All explanatory variables categorical Analysis of variance (ANOVA) (c) Some explanatory variables continuous some categorical Analysis of covariance (ANCOVA)
- The response variable (pick one of the rows): (a) Continuous Regression, ANOVA or ANCOVA (b) Proportion Logistic regression (c) Count Log linear models (d) Binary Binary logistic analysis (e) Time at death Survival analysis
There is a small core of key ideas that need to be understood from the outset. We cover these here before getting into any detail about different kinds of statistical model.
Everything Varies
If you measure the same thing twice you will get two different answers. If you measure the same thing on different occasions you will get different answers because the thing will have aged. If you measure different individuals, they will differ for both genetic and environmental reasons (nature and nurture). Heterogeneity is universal: spatial heterogeneity means that places always differ, and temporal heterogeneity means that times always differ.
Because everything varies, finding that things vary is simply not interesting. We need a way of discriminating between variation that is scientifically interesting, and variation that just reflects background heterogeneity. That is why you need statistics. It is what this whole book is about.
The key concept is the amount of variation that we would expect to occur by chance alone, when nothing scientifically interesting was going on. If we measure bigger differences than we would expect by chance, we say that the result is statistically significant. If we measure no more variation than we might reasonably expect to occur by chance alone, then we say that our result is not statistically significant. It is important to understand that this is not to say that the result is not important. Non-significant differences in human life span between two drug treatments may be massively important (especially if you are the patient involved). Non-significant is not the same as 'not different'. The lack of significance may be due simply to the fact that our replication is too low.
On the other hand, when nothing really is going on, then we want to know this. It makes life much simpler if we can be reasonably sure that there is no relationship between y and x. Some students think that 'the only good result is a significant result'. They feel that their study has somehow failed if it shows that 'A has no significant effect on B'. This is an understandable failing of human nature, but it is not good science. The point is that we want to know the truth, one way or the other. We should try not to care too much about the way things turn out. This is not an amoral stance, it just happens to be the way that science works best. Of course, it is hopelessly idealistic to pretend that this is the way that scientists really behave. Scientists often want passionately that a particular experimental result will turn out to be statistically significant, so that they can get a Nature paper and get promoted. But that does not make it right.
Significance
What do we mean when we say that a result is significant? The normal dictionary definitions of significant are 'having or conveying a meaning' or 'expressive; suggesting or implying deeper or unstated meaning'. But in statistics we mean something very specific indeed. We mean that 'a result was unlikely to have occurred by chance'. In particular, we mean 'unlikely to have occurred by chance if the null hypothesis was true'. So there are two elements to it: we need to be clear about what we mean by 'unlikely', and also what exactly we mean by the 'null hypothesis'. Statisticians have an agreed convention about what constitutes 'unlikely'. They say that an event is unlikely if it occurs less than 5% of the time. In general, the null hypothesis says that 'nothing is happening' and the alternative says that 'something is happening'.
Good and Bad Hypotheses
Karl Popper was the first to point out that a good hypothesis was one that was capable of rejection. He argued that a good hypothesis is a falsifiable hypothesis. Consider the following two assertions:
- there are vultures in the local park
- there are no vultures in the local park
Both involve the same essential idea, but one is refutable and the other is not. Ask yourself how you would refute option A. You go out into the park and you look for vultures. But you do not see any. Of course, this does not mean that there are none. They could have seen you coming, and hidden behind you. No matter how long or how hard you look, you cannot refute the hypothesis. All you can say is 'I went out and I didn't see any vultures'. One of the most important scientific notions is that absence of evidence is not evidence of absence.
Option B is fundamentally different. You reject hypothesis B the first time you see a vulture in the park. Until the time that you do see your first vulture in the park, you work on the assumption that the hypothesis is true. But if you see a vulture, the hypothesis is clearly false, so you reject it.
Null Hypotheses
The null hypothesis says 'nothing is happening'. For instance, when we are comparing two sample means, the null hypothesis is that the means of the two populations are the same. Of course, the two sample means are not identical, because everything varies. Again, when working with a graph of y against x in a regression study, the null hypothesis is that the slope of the relationship is zero (i.e. y is not a function of x, or y is independent of x). The essential point is that the null hypothesis is falsifiable. We reject the null hypothesis when our data show that the null hypothesis is sufficiently unlikely.
p Values
Here we encounter a much-misunderstood topic. The p value is not the probability that the null hypothesis is true, although you will often hear people saying this. In fact, p values are calculated on the assumption that the null hypothesis is true. It is correct to say that p values have to do with the plausibility of the null hypothesis, but in a rather subtle way.
As you will see later, we typically base our hypothesis testing on what are known as test statistics: you may have heard of some of these already (Student's t, Fisher's F and Pearson's chi-squared, for instance): p values are about the size of the test statistic. In particular, a p value is an estimate of the probability that a value of the test statistic, or a value more extreme than this, could have occurred by chance when the null hypothesis is true. Big values of the test statistic indicate that the null hypothesis is unlikely to be true. For sufficiently large values of the test statistic, we reject the null hypothesis and accept the alternative hypothesis.
Note also that saying 'we do not reject the null hypothesis' and 'the null hypothesis is true' are two quite different things. For...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.