
Categorical Data Analysis by Example
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
Reviews / Votes
"Concise introduction to dealing withcategorical data (with supporting R code)which will help the general data scientist." (Raspberry Pi March 2017)More details
Other editions
Additional editions

Person
Content
Preface xi
Acknowledgments xiii
1 Introduction 1
1.1 What are categorical data? 1
1.2 A typical data set 2
1.3 Visualisation and crosstabulation 3
1.4 Samples, populations, and random variation 4
1.5 Proportion, probability and conditional probability 5
1.6 Probability distributions 6
1.6.1 The binomial distribution 6
1.6.2 The multinomial distribution 7
1.6.3 The Poisson distribution 7
1.6.4 The normal distribution 7
1.6.5 The chisquared (X2) distribution 8
1.7 *The likelihood 9
2 Estimation and inference for categorical data 11
2.1 Goodness of fit 11
2.1.1 Pearson's X2 goodness-of-fit statistic 11
2.1.2 * The link between X2 and the Poisson and I2 distributions 12
2.1.3 The likelihood-ratio goodness-of-fit statistic, G2 13
2.1.4 * Why the G2 and X2 statistics usually have similar values 14
2.2 Hypothesis tests for a binomial proportion (large sample) 14
2.2.1 The normal score test 14
2.2.2 * Link to Pearson's X2 goodness-of-fit test 15
2.2.3 G2 for a binomial proportion 15
2.3 Hypothesis tests for a binomial proportion (small sample) 16
2.3.1 One-tailed hypothesis test 16
2.3.2 Two-tailed hypothesis tests 17
2.4 Interval estimates for a binomial proportion 18
2.4.1 Laplace's method 18
2.4.2 Wilson's method 18
2.4.3 The Agresti-Coull method 19
2.4.4 Small samples and exact calculations 19
3 The 2 X 2 contingency table 23
3.1 Introduction 23
3.2 Fisher's exact test (for independence) 24
3.2.1 * Derivation of the exact test formula 26
3.3 Testing independence with large cell frequencies 27
3.3.1 Using Pearson's goodness-of-fit test 27
3.3.2 The Yates correction 28
3.4 The 2 X 2 table in a medical context 29
3.5 Measuring lack of independence (comparing proportions) 31
3.5.1 Difference of proportions 31
3.5.2 Relative risk 32
3.5.3 Odds-ratio 33
4 The I x J contingency table 37
4.1 Notation 37
4.2 Independence in the I X J contingency table 38
4.2.1 Estimation and degrees of freedom 38
4.2.2 Odds-ratios and independence 39
4.2.3 Goodness-of-fit and lack of fit of the independence model 39
4.3 Partitioning 42
4.3.1 * Additivity of G2 42
4.3.2 Rules for partitioning 44
4.4 Graphical displays 44
4.4.1 Mosaic plots 45
4.4.2 Cobweb diagrams 45
4.5 Testing independence with ordinal variables 46
5 The exponential family 51
5.1 Introduction 51
5.2 The exponential family 52
5.2.1 The exponential dispersion family 53
5.3 Components of a general linear model 53
5.4 Estimation 54
6 A model taxonomy 57
6.1 Underlying questions 57
6.1.1 Which variables are of interest? 57
6.1.2 What categories should be used? 58
6.1.3 What is the type of each variable? 58
6.1.4 What is the nature of each variable? 58
6.2 Identifying the type of model 58
7 The 2 X J contingency table 61
7.1 A problem with X2 (and G2) 61
7.2 Using the logit 62
7.2.1 Estimation of the logit 63
7.2.2 The null model 64
7.3 Individual data and grouped data 64
7.4 Precision, confidence intervals, and prediction intervals 69
7.4.1 Prediction intervals 70
7.5 Logistic regression with a categorical explanatory variable 70
> 2) 73
7.5.2 The dummy variable representation of a categorical variable 74
8 Logistic regression with several explanatory variables 77
8.1 Degrees of freedom when there are no interactions 77
8.2 Getting a feel for the data 79
8.3 Models with two variable interactions 81
8.3.1 Link to the testing of independence between two variables 83
9 Model selection and diagnostics 85
9.1 Introduction 85
9.1.1 Ockham's razor 86
9.2 Notation for interactions and for models 87
9.3 Stepwise methods for model selection using G2 89
9.3.1 Forward selection 89
9.3.2 Backward elimination 91
9.3.3 Complete stepwise 93
9.4 AIC and related measures 93
9.5 The problem caused by rare combinations of events 95
9.5.1 Tackling the problem 96
9.6 Simplicity versus accuracy 98
9.7 DFBETAS 100
10 Multinomial logistic regression 103
10.1 A single continuous explanatory variable 103
10.2 Nominal categorical explanatory variables 106
10.3 Models for an ordinal response variable 108
10.3.1 Cumulative logits 108
10.3.2 Proportional odds models 109
10.3.3 Adjacent-category logit models 114
10.3.4 Continuation-ratio logit models 115
11 Log-linear models for I X J tables 119
11.1 The saturated model 119
11.1.1 Cornered constraints 120
11.1.2 Centered constraints 122
11.2 The independence model for an I X J table 125
12 Log-linear models for I X J X K tables 129
12.1 Mutual independence: A=B=C 131
12.2 The model AB=C 131
12.3 Conditional independence and independence 133
12.4 The model AB=AC 134
12.5 The models AB=AC=BC and ABC 135
12.6 Simpson's paradox 135
12.7 Connection between log-linear models and logistic regression 137
13 Implications and uses of Birch's result 141
13.1 Birch's result 141
13.2 Iterative scaling 142
13.3 The hierarchy constraint 143
13.4 Inclusion of the all-factor interaction 144
13.5 Mostellerising 145
14 Model selection for log-linear models 149
14.1 Three variables 150
14.2 More than three variables 153
15 Incomplete tables, dummy variables, and outliers 157
15.1 Incomplete tables 157
15.1.1 Degrees of freedom 158
15.2 Quasi-independence 159
15.3 Dummy variables 159
15.4 Detection of outliers 160
16 Panel data and repeated measures 165
16.1 The mover-stayer model 166
16.2 The loyalty model 168
16.3 Symmetry 169
16.4 Quasi-symmetry 170
16.5 The loyalty-distance model 172
A R code for Cobweb function 175
Index 179
Author Index 183
Index of Examples 185
CHAPTER 1
Introduction
This chapter introduces basic statistical ideas and terminology in what the author hopes is a suitably concise fashion. Many readers will be able to turn to Chapter 2 without further ado!
1.1 What are categorical data?
Categorical data are the observed values of variables such as the color of a book, a person's religion, gender, political preference, social class, etc. In short, any variable other than a continuous variable (such as length, weight, time, distance, etc.).
If the categories have no obvious order (e.g., Red, Yellow, White, Blue) then the variable is described as a nominal variable. If the categories have an obvious order (e.g., Small, Medium, Large) then the variable is described as an ordinal variable. In the latter case the categories may relate to an underlying continuous variable where the precise value is unrecorded, or where it simplifies matters to replace the measurement by the relevant category. For example, while an individual's age may be known, it may suffice to record it as belonging to one of the categories "Under 18," "Between 18 and 65," "Over 65."
If a variable has just two categories, then it is a binary variable and whether or not the categories are ordered has no effect on the ensuing analysis.
1.2 A typical data set
The basic data with which we are concerned are counts, also called frequencies. Such data occur naturally when we summarize the answers to questions in a survey such as that in Table 1.1.
Table 1.1 Hypothetical sports preference survey
Sports preference questionnaire (A) Are you:- Male Female ? (B) Are you:- Aged 45 or under Aged over 45 ? (C) Do you:- Prefer golf to tennis Prefer tennis to golf ?The people answering this (fictitious) survey will be classified by each of the three characteristics: gender, age, and sport preference. Suppose that the 400 replies were as given in Table 1.2 which shows that males prefer golf to tennis (142 out of 194 is 73%) whereas females prefer tennis to golf (161 out of 206 is 78%). However, there is a lot of other information available. For example:
Table 1.2 Results of sports preference survey
Category of response Frequency Male, aged 45 or under, prefers golf to tennis 64 Male, aged 45 or under, prefers tennis to golf 28 Male, aged over 45, prefers golf to tennis 78 Male, aged over 45, prefers tennis to golf 24 Female, aged 45 or under, prefers golf to tennis 22 Female, aged 45 or under, prefers tennis to golf 86 Female, aged over 45, prefers golf to tennis 23 Female, aged over 45, prefers tennis to golf 75- There are more replies from females than males.
- There are more tennis lovers than golf lovers.
- Amongst males, the proportion preferring golf to tennis is greater amongst those aged over 45 (78/102 is 76%) than those aged 45 or under (64/92 is 70%).
This book is concerned with models that can reveal all of these subtleties simultaneously.
1.3 Visualization and cross-tabulation
While Table 1.2 certainly summarizes the results, it does so in a clumsily long-winded fashion. We need a more succinct alternative, which is provided in Table 1.3.
Table 1.3 Presentation of survey results by gender
Male Female Sport 45 and under Over 45 Total Sport 45 and under Over 45 Total Tennis 28 24 52 Tennis 86 75 161 Golf 64 78 142 Golf 22 23 45 Total 92 102 194 Total 108 98 206A table of this type is referred to as a contingency table-in this case it is (in effect) a three-dimensional contingency table. The locations in the body of the table are referred to as the cells of the table. Note that the table can be presented in several different ways. One alternative is Table 1.4.
Table 1.4 Presentation of survey results by sport preference
Prefers tennis Prefers golf Gender 45 and under Over 45 Total Gender 45 and under Over 45 Total Female 86 75 161 Female 22 23 45 Male 28 24 52 Male 64 78 142 Total 114 99 213 Total 86 101 187Figure 1.1 Illustration of results of sports preference survey.
In this example, the problem is that the page of a book is two-dimensional, whereas, with its three classifying variables, the data set is essentially three-dimensional, as Figure 1.1 indicates. Each face of the diagram contains information about the 2 × 2 category combinations for two variables for some particular category of the third variable.
With a small table and just three variables, a diagram is feasible, as Figure 1.1 illustrates. In general, however, there will be too many variables and too many categories for this to be a useful approach.
1.4 Samples, populations, and random variation
Suppose we repeat the survey of sport preferences, interviewing a second group of 100 individuals and obtaining the results summarized in Table 1.5.
Table 1.5 The results of a second survey
Prefers tennis Prefers golf Gender 45 and under Over 45 Total Gender 45 and under Over 45 Total Female 81 76 157 Female 16 24 40 Male 26 34 60 Male 62 81 143 Total 107 110 217 Total 78 105 183As one would expect, the results are very similar to those from the first survey, but they are not identical. All the principal characteristics (for example, the preference of females for tennis and males for golf) are again present, but there are slight...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.