An Introduction to Correspondence Analysis

Name: An Introduction to Correspondence Analysis
Brand: Wiley
Price: 60.99 EUR
Availability: OnlineOnly

Eric J. Beh Rosaria Lombardo(Author)

Wiley (Publisher)

1st Edition

Published on 9. April 2021

240 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-119-04197-9 (ISBN)

€60.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.

Alles über E-Books, Kopierschutz & Dateiformate finden Sie in unserem Info- & Hilfebereich.

Master the fundamentals of correspondence analysis with this illuminating resource

An Introduction to Correspondence Analysis assists researchers in improving their familiarity with the concepts, terminology, and application of several variants of correspondence analysis. The accomplished academics and authors deliver a comprehensive and insightful treatment of the fundamentals of correspondence analysis, including the statistical and visual aspects of the subject.

Written in three parts, the book begins by offering readers a description of two variants of correspondence analysis that can be applied to two-way contingency tables for nominal categories of variables. Part Two shifts the discussion to categories of ordinal variables and demonstrates how the ordered structure of these variables can be incorporated into a correspondence analysis. Part Three describes the analysis of multiple nominal categorical variables, including both multiple correspondence analysis and multi-way correspondence analysis.

Readers will benefit from explanations of a wide variety of specific topics, for example:

* Simple correspondence analysis, including how to reduce multidimensional space, measuring symmetric associations with the Pearson Ratio, constructing low-dimensional displays, and detecting statistically significant points

* Non-symmetrical correspondence analysis, including quantifying asymmetric associations

* Simple ordinal correspondence analysis, including how to decompose the Pearson Residual for ordinal variables

* Multiple correspondence analysis, including crisp coding and the indicator matrix, the Burt Matrix, and stacking

* Multi-way correspondence analysis, including symmetric multi-way analysis

Perfect for researchers who seek to improve their understanding of key concepts in the graphical analysis of categorical data, An Introduction to Correspondence Analysis will also assist readers already familiar with correspondence analysis who wish to review the theoretical and foundational underpinnings of crucial concepts.

More details

Other editions

Persons

Content

Preface xiii

1 Introduction 1

1.1 Data Visualisation 1

1.2 Correspondence Analysis in a "Nutshell" 3

1.3 Data Sets 4

1.3.1 Traditional European Food Data 4

1.3.2 Temperature Data 6

1.3.3 Shoplifting Data 6

1.3.4 Alligator Data 7

1.4 Symmetrical Versus Asymmetrical Association 8

1.5 Notation 10

1.5.1 The Two-way Contingency Table 10

1.5.2 The Three-way Contingency Table 11

1.6 Formal Test of Symmetrical Association 12

1.6.1 Test of Independence for Two-way Contingency Tables 12

1.6.2 The Chi-squared Statistic for a Two-way Table 13

1.6.3 Analysis of the Traditional European Food Data 13

1.6.4 The Chi-squared Statistic for a Three-way Table 15

1.6.5 Analysis of the Alligator Data 16

1.7 Formal Test of Asymmetrical Association 17

1.7.1 Test of Predictability for Two-way Contingency Tables 17

1.7.2 The Goodman-Kruskal tau Index 17

1.7.3 Analysis of the Traditional European Food Data 18

1.7.4 Test of Predictability for Three-way Contingency Tables 19

1.7.5 Marcotorchino's Index 19

1.7.6 Analysis of the Alligator Data 20

1.7.7 The Gray-Williams Index and Delta Index 21

1.8 Correspondence Analysis and R 22

1.9 Overview of the Book 25

Part I Classical Analysis of Two Categorical Variables 29

2 Simple Correspondence Analysis 31

2.1 Introduction 31

2.2 Reducing Multi-dimensional Space 32

2.2.1 Profiles Cloud of Points 32

2.2.2 Profiles for the Traditional European Food Data 33

2.2.3 Weighted Centred Profiles 33

2.3 Measuring Symmetric Association 39

2.3.1 The Pearson Ratio 39

2.3.2 Analysis of the Traditional European Food Data 40

2.4 Decomposing the Pearson Residual for Nominal Variables 41

2.4.1 The Generalised SVD of ¿¿¿¿ij - 1 41

2.4.2 SVD of the Pearson Ratio's 44

2.4.3 GSVD and the Traditional European Food Data 44

2.5 Constructing a Low-Dimensional Display 46

2.5.1 Standard Coordinates 46

2.5.2 Principal Coordinates 47

2.6 Practicalities of the Low-Dimensional Plot 50

2.6.1 The Two-Dimensional Correspondence Plot 50

2.6.2 What is NOT Being Shown in a Two-Dimensional Correspondence Plot? 54

2.6.3 The Three-Dimensional Correspondence Plot 57

2.7 The Biplot Display 59

2.7.1 Definition 59

2.7.2 Isometric Biplots of the Traditional European Food Data 60

2.7.3 What is NOT Being Shown in a Two-Dimensional Biplot? 63

2.8 The Case for No Visual Display 63

2.9 Detecting Statistically Significant Points 64

2.9.1 Confidence Circles and Ellipses 64

2.9.2 Confidence Ellipses for the Traditional European Food Data 65

2.10 Approximate p-values 69

2.10.1 The Hypothesis Test and its p-value 69

2.10.2 P-values and the Traditional European Food Data 70

2.11 Final Comments 70

3 Non-Symmetrical Correspondence Analysis 71

3.1 Introduction 71

3.2 Quantifying Asymmetric Association 72

3.2.1 The Goodman-Kruskal tau Index 72

3.2.2 The ¿¿¿¿ Index and the Traditional European Food Data 72

3.2.3 Weighted Centred Column Profile 73

3.2.4 Profiles of the Traditional European Food Data 73

3.3 Decomposing ¿¿¿¿i|j for Nominal Variables 76

3.3.1 The Generalised SVD of ¿¿¿¿i|j 76

3.3.2 GSVD and the Traditional Food Data 77

3.4 Constructing a Low-Dimensional Display 79

3.4.1 Standard Coordinates 79

3.4.2 Principal Coordinates 79

3.5 Practicalities of the Low-Dimensional Plot 82

3.5.1 The Two-Dimensional Correspondence Plot 82

3.5.2 The Three-Dimensional Correspondence Plot 85

3.6 The Biplot Display 89

3.6.1 Definition 89

3.6.2 The Column Isometric Biplot for the Traditional Food Data 90

3.6.3 The Three-Dimensional Biplot 91

3.7 Detecting Statistically Significant Points 92

3.7.1 Confidence Circles and Ellipses 92

3.7.2 Confidence Ellipses for the Traditional Food Data 93

3.8 Final Comments 96

Part II Ordinal Analysis of Two Categorical Variables 99

4 Simple Ordinal Correspondence Analysis 101

4.1 Introduction 101

4.2 A Simple Correspondence Analysis of the Temperature Data 102

4.3 On the Mean and Variation of Profiles with Ordered Categories 104

4.3.1 Profiles of the Temperature Data 104

4.3.2 Defining Scores 105

4.3.3 On the Mean of the Profiles 107

4.3.4 On the Variation of the Profiles 108

4.3.5 Mean and Variation of Profiles for the Temperature Data 108

4.4 Decomposing the Pearson Residual for Ordinal Variables 111

4.4.1 The Bivariate Moment Decomposition of ¿¿¿¿ij - 1 111

4.4.2 BMD and the Temperature Data 113

4.5 Constructing a Low-Dimensional Display 115

4.5.1 Standard Coordinates 115

4.5.2 Principal Coordinates 116

4.5.3 Practicalities of the Ordered Principal Coordinates 119

4.6 The Biplot Display 120

4.6.1 Definition 120

4.6.2 Ordered Column Isometric Biplot 120

4.6.3 Ordered Row Isometric Biplot 120

4.6.4 Ordered Isometric Biplots for the Temperature Data 121

4.7 Final Comments 124

5 Ordered Non-symmetrical Correspondence Analysis 125

5.1 Introduction 125

5.2 The Goodman-Kruskal tau Index Revisited 126

5.3 Decomposing ¿¿¿¿i|j for Ordinal and Nominal Variables 128

5.3.1 The Hybrid Decomposition of ¿¿¿¿i|j 128

5.3.2 Hybrid Decomposition and the Shoplifting Data 131

5.4 Constructing a Low-Dimensional Display 133

5.4.1 Standard Coordinates 133

5.4.2 Principal Coordinates 134

5.5 The Biplot 135

5.5.1 An Overview 135

5.5.2 Column Isometric Biplot 135

5.5.3 Column Isometric Biplot of the Shoplifting Data 135

5.5.4 Row Isometric Biplot 137

5.5.5 Row Isometric Biplot of the Shoplifting Data 137

5.5.6 Distance Measures and the Row Isometric Biplots 140

5.6 Some FinalWords 141

Part III Analysis of Multiple Categorical Variables 143

6 Multiple Correspondence Analysis 145

6.1 Introduction 145

6.2 Crisp Coding and the Indicator Matrix 146

6.2.1 Crisp Coding 146

6.2.2 The Indicator Matrix 146

6.2.3 Crisp Coding and the Alligator Data 147

6.2.4 Application of Multiple Correspondence Analysis using the Indicator Matrix 148

6.3 The Burt Matrix 152

6.4 Stacking 156

6.4.1 A Definition 156

6.4.2 Stacking and the Alligator Data - Lake(Size)× Food 156

6.4.3 Stacking and the Alligator Data - Food(Size)× Lake 159

6.5 Final Comments 161

7 Multi-way Correspondence Analysis 163

7.1 An Introduction 163

7.2 Pearson's Residual ¿¿¿¿ijk - 1 and the Partition of X2 164

7.2.1 The Pearson Residual 164

7.2.2 The Partition of X2 165

7.2.3 Partition of X2 for theAlligator Data 165

7.3 Symmetric Multi-way Correspondence Analysis 167

7.3.1 Tucker3 Decomposition of ¿¿¿¿ijk - 1 167

7.3.2 T3D and the Analysis of Two Variables 170

7.3.3 On the Choice of the Number of Components 171

7.3.4 Tucker3 Decomposition of ¿¿¿¿ijk - 1 and the Alligator Data 171

7.4 Constructing a Low-Dimensional Display 175

7.4.1 Principal Coordinates 175

7.4.2 The Interactive Biplot 176

7.4.3 Column-Tube Interactive Biplot for the Alligator Data 181

7.4.4 Row Interactive Biplot for the Alligator Data 185

7.5 The Marcotorchino Residual ¿¿¿¿i|j,k and the Partition of ¿¿¿¿M 188

7.5.1 The Marcotrochino Residual 188

7.5.2 The Partition of ¿¿¿¿M 189

7.5.3 Partition of ¿¿¿¿M for the Alligator Data 190

7.6 Non-symmetrical Multi-way Correspondence Analysis 191

7.6.1 Tucker3 Decomposition of ¿¿¿¿i|j,k 191

7.6.2 Tucker3 Decomposition of ¿¿¿¿i|j,k and the Alligator Data 193

7.7 Constructing a Low-Dimensional Display 194

7.7.1 On the Choice of Coordinates 194

7.7.2 Column-Tube Interactive Biplot for the Alligator Data 195

7.8 Final Comments 199

References 201

Author Index 213

Subject Index 217

1
Introduction

1.1 Data Visualisation

Every statistical technique has a long and interesting history. Studying how to numerically and graphically analyse the association between categorical variables is no exception. The contributions of some of the most influential statisticians, including Karl Pearson, R.A. Fisher and G.U. Yule, have left an indelible imprint on how categorical data analysis is performed. Excellent descriptions on the historical development of categorical data analysis, in particular the analysis of contingency tables, can be found by referring to, for example, Goodman and Kruskal (1954) and Agresti (2002, Chapter 16). The influence of the early pioneers has led to almost countless statistical techniques that measure, model, visualise and further scrutinise how categorical variables are related to each other. Much of the key focus has been on the numerical assessment of the strength of the association between the variables - whether the analysis is concerned with two, three or more variables. Yule and Kendall (1950), Bishop et al. (1975) and Liebetrau (1983) also provided excellent discussions of a large number of measures of association for contingency tables. The most influential and widely adopted statistical technique for analysing the association between categorical variables is Pearson's chi-squared statistic (Pearson 1904). The importance and wide applicability of this statistic has been discussed vigourously throughout the literature - see, for example, Lancaster (1969) and Greenwood and Nikulin (1996). The statistic, simply put, is defined as

where "Observed" refers to the observed count made in each cell of a table and "Expected" is its expected value under some model (even if that model reflects independence between the variables). While this statistic can detect if there is a statistically significant association between the variables it does not say anything more about the structure of the association. Various techniques may be considered for examining exactly how the association is structured. These include simple measures such as the product moment correlation (Pearson 1895) which will not only determine the strength of the association but also its direction. Model based approaches such as log-linear models and logistic models are commonly taught as a means of numerically assessing the nature of the association.

Despite the importance of modelling in statistics and her allied fields, there are two issues that need to be considered. Firstly, elementary statistics courses worldwide teach students about the importance of visualising the structure of the data as a means of "seeing" what it looks like before resorting to inferential techniques; this might be through constructing a bar chart, histogram or boxplot of the data. However, in practice many statistical categorical analysis techniques (of course, not all) ignore this visual component altogether and go straight to modelling the structure. Secondly, modelling techniques rely on methodological assumptions of the data, or the perceived behaviour of the data by the analyst. Such thoughts are elegantly, and simply, captured in George Box's (1979) famous quote

All models are wrong but some are useful

Earlier, Box (1976) had said

Since all models are wrong, the scientist cannot obtain a "correct" one by excessive elaboration.

Of course, such general phrases have caused a stir amongst the statistical community since a model can never fully capture the "truth" of a phenomenon. We certainly see many advantages in the wide range, and flexibility, of models that are now available but we urge caution when adopting some of them.

An alternative philosophy that can be adopted for assessing the association between the variables of a contingency table is to explore how they are associated to each other by visualising the association. There is now a plethora of strategies available for visualising numerical and categorical data. Some of the more popular approaches include the mosaic plot (Friendly 2000, 2002; Theus 2012), the four-fold display (Fienberg 1975) and the cobweb diagram (Upton 2000). The interested reader may also refer to Gabriel (2002) and Wegman and Solka (2002) for the visualisation of multivariate data. The key features of any graphical summary are that what is produced is simple, easy to interpret, and provides a quick and accurate visual representation of data. Cook and Weisberg (1999, p. 29) say of any graphical summary

In statistical graphics, information is contained in observable shapes and patterns. The task of the creator of a graph is to construct an informative view of the data that is appropriately grounded in a statistical context. The task of the viewer is to find the patterns, and then to interpret their meaning in the same context. Just as an interpretation of a painting or drawing requires understanding of the artist's context, interpreting a graph requires an understanding of the statistical context that surrounds the graph. As in art, conclusions about a graph without understanding the context are likely to be wrong, or off the point at best.

A very good example of the interplay between data visualisation and statistical context can be found by considering Anscombe's quartet (Anscombe 1973). While discussing his point in terms of simple linear regression, Anscombe (1973) provided a compelling argument for the need to visualise data by highlighting four very different scatterplots with equal correlations and equal parameter estimates from a simple linear regression model. His argument shows that the context of the statistical technique needs to be made in terms of the data being analysed, and a visualisation of this context can help the analyst to better understand the statistical and practical contexts of the data being analysed.

1.2 Correspondence Analysis in a "Nutshell"

So where does correspondence analysis fit into this discussion? It is first important to recognise that often the first task in assessing the association structure between categorical variables is to either model, or measure this association, with the structure reflected in the sign and magnitude of a numerical measure. However, as we shall explore in this book, correspondence analysis (in a nutshell) provides a way to visualise the association between two or more categorical variables that form a contingency table. In doing so we gain an understanding of how particular categories from the same variable, or from different variables, "correspond" to each other. From such visual summaries, one can better understand how the variables (and categories) under inspection are associated. In doing so, the analyst can then refine their research question and postulate other structures that may exist in the data. This is all undertaken without the need to make any assumption about the structure of the data, nor does one need to impose untestable, unnecessary, or unnecessarily complicated assumptions on the data (or on the technique). The analyst, whether they are of a technical or practical persuasion, need not rely on a suite of numbers to interpret the association between the variables (unless they want to of course). Therefore, correspondence analysis is a technique that allows the data to inform the analyst of what it is trying to say rather than the model defining how the structure may be defined. The philosophy of letting the "data speak for itself" in correspondence analysis harks back to Jean-Paul Benzécri and his team at the University of Paris, France. Thus, Benzécri is considered to be the father of correspondence analysis although, in truth, many of the technical (and not visual) features stem back to earlier times. Since the early work of Benzécri and his team, the development of correspondence analysis and its many variants has been dominant in many parts of the European statistical, and allied, communities. This is especially so in France, Italy, The Netherlands and Spain. Outside of Europe, it has developed due to the contribution of researchers in Great Britain, Japan and, to a lesser extent, the USA. Unfortunately, in the Australasian region, correspondence analysis has not received the same level of attention as other parts of the world.

Before we continue with our discussion of correspondence analysis, it is worth highlighting that there are many excellent texts on its historical, computational, practical and theoretical development. The first major work that helped to expose correspondence analysis to the English speaking/reading statistical world was that of Hill (1974). Interestingly, he titled his paper "Correspondence analysis: A neglected multivariate method" which was published in the Journal of the Royal Statistical Society, Series C (Applied Statistics). Since then, the growth of correspondence analysis has been quite slow but further insight was made 10 years later with the publication of a book by Michael Greenacre. This book, titled Theory and Applications of Correspondence Analysis was published by Academic Press and remains the most cited book of all on the topic and brought correspondence analysis out of the (mainly) French statistical literature and exposed it to the vast English reading/speaking research community; it is thus considered a landmark publication in correspondence analysis. Another excellent book is that of Lebart et al....

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

An Introduction to Correspondence Analysis

Description

More details

Other editions

Additional editions

Persons

Content

1
Introduction

1.1 Data Visualisation

1.2 Correspondence Analysis in a "Nutshell"

System requirements