
Data Analysis and Applications 1
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
Volume 1 begins with an introductory chapter by Gilbert Saporta, a leading expert in the field, who summarizes the developments in data analysis over the last 50 years. The book is then divided into three parts: Part 1 presents clustering and regression cases; Part 2 examines grouping and decomposition, GARCH and threshold models, structural equations, and SME modeling; and Part 3 presents symbolic data analysis, time series and multiple choice models, modeling in demography, and data mining.
More details
Other editions
Additional editions


Persons
Christos H. Skiadas is the Founder and former Director of the Data Analysis and Forecasting Laboratory at the Technical University of Crete, Greece. He continues his work at the university at the ManLab in the Department of Production Engineering and Management.
James R. Bozeman holds a PhD in Mathematics from Dartmouth College, USA, and is Professor of Mathematics at the American University of Malta.
Content
Preface xi
Introduction xv
Gilbert SAPORTA
Part 1 Clustering and Regression 1
Chapter 1 Cluster Validation by Measurement of Clustering Characteristics Relevant to the User 3
Christian HENNIG
1.1 Introduction 3
1.2 General notation 5
1.3 Aspects of cluster validity 6
1.3.1 Small within-cluster dissimilarities 6
1.3.2 Between-cluster separation 7
1.3.3 Representation of objects by centroids 7
1.3.4 Representation of dissimilarity structure by clustering 8
1.3.5 Small within-cluster gaps 9
1.3.6 Density modes and valleys 9
1.3.7 Uniform within-cluster density 12
1.3.8 Entropy 12
1.3.9 Parsimony 13
1.3.10 Similarity to homogeneous distributional shapes 13
1.3.11 Stability 13
1.3.12 Further Aspects 14
1.4 Aggregation of indexes 14
1.5 Random clusterings for calibrating indexes 15
1.5.1 Stupid K-centroids clustering 16
1.5.2 Stupid nearest neighbors clustering 16
1.5.3 Calibration 17
1.6 Examples 18
1.6.1 Artificial data set 18
1.6.2 Tetragonula bees data 20
1.7 Conclusion 22
1.8 Acknowledgment 23
1.9 References 23
Chapter 2 Histogram-Based Clustering of Sensor Network Data 25
Antonio BALZANELLA and Rosanna VERDE
2.1 Introduction 25
2.2 Time series data stream clustering 28
2.2.1 Local clustering of histogram data 30
2.2.2 Online proximity matrix updating 32
2.2.3 Off-line partitioning through the dynamic clustering algorithm for dissimilarity tables 33
2.3 Results on real data 34
2.4 Conclusions 36
2.5 References 36
Chapter 3 The Flexible Beta Regression Model 39
Sonia MIGLIORATI, Agnese MDI BRISCO and Andrea ONGARO
3.1 Introduction 39
3.2 The FB distribution 41
3.2.1 The beta distribution 41
3.2.2 The FB distribution 41
3.2.3 Reparameterization of the FB 42
3.3 The FB regression model 43
3.4 Bayesian inference 44
3.5 Illustrative application 47
3.6 Conclusion 48
3.7 References 50
Chapter 4 S-weighted Instrumental Variables 53
Jan Ámos VÍsEK
4.1 Summarizing the previous relevant results 53
4.2 The notations, framework, conditions and main tool 55
4.3 S-weighted estimator and its consistency 57
4.4 S-weighted instrumental variables and their consistency 59
4.5 Patterns of results of simulations 64
4.5.1 Generating the data 65
4.5.2 Reporting the results 66
4.6 Acknowledgment 69
4.7 References 69
Part 2 Models and Modeling 73
Chapter 5 Grouping Property and Decomposition of Explained Variance in Linear Regression 75
Henri WALLARD
5.1 Introduction 75
5.2 CAR scores 76
5.2.1 Definition and estimators 76
5.2.2 Historical criticism of the CAR scores 79
5.3 Variance decomposition methods and SVD 79
5.4 Grouping property of variance decomposition methods 80
5.4.1 Analysis of grouping property for CAR scores 81
5.4.2 Demonstration with two predictors 82
5.4.3 Analysis of grouping property using SVD 83
5.4.4 Application to the diabetes data set 86
5.5 Conclusions 87
5.6 References 88
Chapter 6 On GARCH Models with Temporary Structural Changes 91
Norio WATANABE and Fumiaki OKIHARA
6.1 Introduction 91
6.2 The model 92
6.2.1 Trend model 92
6.2.2 Intervention GARCH model 93
6.3 Identification 96
6.4 Simulation 96
6.4.1 Simulation on trend model 96
6.4.2 Simulation on intervention trend model 98
6.5 Application 98
6.6 Concluding remarks 102
6.7 References 103
Chapter 7 A Note on the Linear Approximation of TAR Models 105
Francesco GIORDANO, Marcella NIGLIO and Cosimo Damiano VITALE
7.1 Introduction 105
7.2 Linear representations and linear approximations of nonlinear models 107
7.3 Linear approximation of the TAR model 109
7.4 References 116
Chapter 8 An Approximation of Social Well-Being Evaluation Using Structural Equation Modeling 117
Leonel SANTOS-BARRIOS, Monica RUIZ-TORRES, William GÓMEZ-DEMETRIO, Ernesto SÁNCHEZ-VERA, Ana LORGA DA SILVA and Francisco MARTÍNEZ-CASTAÑEDA
8.1 Introduction 117
8.2 Wellness118
8.3 Social welfare 118
8.4 Methodology 119
8.5 Results 120
8.6 Discussion 123
8.7 Conclusions 123
8.8 References 123
Chapter 9 An SEM Approach to Modeling Housing Values 125
Jim FREEMAN and Xin ZHAO
9.1 Introduction 125
9.2 Data 126
9.3 Analysis 127
9.4 Conclusions 134
9.5 References 135
Chapter 10 Evaluation of Stopping Criteria for Ranks in Solving Linear Systems 137
Benard ABOLA, Pitos BIGANDA, Christopher ENGSTRÖM and Sergei SILVESTROV
10.1 Introduction 137
10.2 Methods 139
10.2.1 Preliminaries 139
10.2.2 Iterative methods 140
10.3 Formulation of linear systems 142
10.4 Stopping criteria 143
10.5 Numerical experimentation of stopping criteria 146
10.5.1 Convergence of stopping criterion 147
10.5.2 Quantiles 147
10.5.3 Kendall correlation coefficient as stopping criterion 148
10.6 Conclusions 150
10.7 Acknowledgments 151
10.8 References 151
Chapter 11 Estimation of a Two-Variable Second-Degree Polynomial via Sampling 153
Ioanna PAPATSOUMA, Nikolaos FARMAKIS and Eleni KETZAKI
11.1 Introduction 153
11.2 Proposed method 154
11.2.1 First restriction 154
11.2.2 Second restriction 155
11.2.3 Third restriction 156
11.2.4 Fourth restriction 156
11.2.5 Fifth restriction 157
11.2.6 Coefficient estimates 158
11.3 Experimental approaches 159
11.3.1 Experiment A 159
11.3.2 Experiment B 161
11.4 Conclusions 163
11.5 References 163
Part 3 Estimators, Forecasting and Data Mining 165
Chapter 12 Displaying Empirical Distributions of Conditional Quantile Estimates: An Application of Symbolic Data Analysis to the Cost Allocation Problem in Agriculture 167
Dominique DESBOIS
12.1 Conceptual framework and methodological aspects of cost allocation 167
12.2 The empirical model of specific production cost estimates 168
12.3 The conditional quantile estimation 169
12.4 Symbolic analyses of the empirical distributions of specific costs 170
12.5 The visualization and the analysis of econometric results 172
12.6 Conclusion 178
12.7 Acknowledgments 179
12.8 References 179
Chapter 13 Frost Prediction in Apple Orchards Based upon Time Series Models 181
Monika ATOMKOWICZ and Armin OSCHMITT
13.1 Introduction 181
13.2 Weather database 182
13.3 ARIMA forecast model 183
13.3.1 Stationarity and differencing 184
13.3.2 Non-seasonal ARIMA models 186
13.4 Model building 188
13.4.1 ARIMA and LR models 188
13.4.2 Binary classification of the frost data 189
13.4.3 Training and test set 189
13.5 Evaluation 189
13.6 ARIMA model selection 190
13.7 Conclusions 192
13.8 Acknowledgments 193
13.9 References 193
Chapter 14 Efficiency Evaluation of Multiple-Choice Questions and Exams 195
Evgeny GERSHIKOV and Samuel KOSOLAPOV
14.1 Introduction 195
14.2 Exam efficiency evaluation 196
14.2.1 Efficiency measures and efficiency weighted grades 196
14.2.2 Iterative execution 198
14.2.3 Postprocessing 199
14.3 Real-life experiments and results 200
14.4 Conclusions 203
14.5 References 204
Chapter 15 Methods of Modeling and Estimation in Mortality 205
Christos HSKIADAS and Konstantinos NZAFEIRIS
15.1 Introduction 205
15.2 The appearance of life tables 206
15.3 On the law of mortality 207
15.4 Mortality and health 211
15.5 An advanced health state function form 217
15.6 Epilogue 220
15.7 References 221
Chapter 16 An Application of Data Mining Methods to the Analysis of Bank Customer Profitability and Buying Behavior 225
Pedro GODINHO, Joana DIAS and Pedro TORRES
16.1 Introduction 225
16.2 Data set 227
16.3 Short-term forecasting of customer profitability 230
16.4 Churn prediction 235
16.5 Next-product-to-buy 236
16.6 Conclusions and future research 238
16.7 References 239
List of Authors 241
Index 245
Introduction
50 Years of Data Analysis: From Exploratory Data Analysis to Predictive Modeling and Machine Learning
In 1962, J.W. Tukey wrote his famous paper "The Future of Data Analysis" and promoted exploratory data analysis (EDA), a set of simple techniques conceived to let the data speak, without prespecified generative models. In the same spirit, J.P. Benzécri and many others developed multivariate descriptive analysis tools. Since that time, many generalizations occurred, but the basic methods (SVD, k-means, etc.) are still incredibly efficient in the Big Data era.
On the other hand, algorithmic modeling or machine learning is successful in predictive modeling, the goal being accuracy and not interpretability. Supervised learning proves in many applications that it is not necessary to understand, when one needs only predictions.
However, considering some failures and flaws, we advocate that a better understanding may improve prediction. Causal inference for Big Data is probably the challenge of the coming years.
It is a little presumptuous to want to make a panorama of 50 years of data analysis, while David Donoho (2017) has just published a paper entitled "50 Years of Data Science". But 1968 is the year when I began my studies as a statistician and I would very much like to talk about the debates of the time and the digital revolution that profoundly transformed statistics and which I witnessed. The terminology followed this evolution-revolution: from data analysis to data mining and then to data science while we went from a time when the asymptotics began to 30 observations with a few variables in the era of Big Data and high dimension.
I.1. The revolt against mathematical statistics
Since the 1960s, the availability of data has led to an international movement back to the sources of statistics ("let the data speak") and to sometimes fierce criticisms of an abusive formalization. Along with to John Tukey, who was cited above, here is a portrait gallery of some notorious protagonists in the United States, France, Japan, the Netherlands and Italy (for a color version of this figure, see www.iste.co.uk/skiadas/data1.zip).
And an anthology of quotes:
He (Tukey) seems to identify statistics with the grotesque phenomenon generally known as mathematical statistics and find it necessary to replace statistics by data analysis. (Anscombe 1967)
Statistics is not probability, under the name of mathematical statistics was built a pompous discipline based on theoretical assumptions that are rarely met in practice. (Benzécri 1972)
The models should follow the data, not vice versa. (Benzécri 1972)
Use the computer implies the abandonment of all the techniques designed before of computing. (Benzécri 1972)
Statistics is intimately connected with science and technology, and few mathematicians have experience or understand of methods of either. This I believe is what lies behind the grotesque emphasis on significance tests in statistics courses of all kinds; a mathematical apparatus has been erected with the notions of power, uniformly most powerful tests, uniformly most powerful unbiased tests, etc., and this is taught to people, who, if they come away with no other notion, will remember that statistics is about significant differences [.]. The apparatus on which their statistics course has been constructed is often worse than irrelevant - it is misleading about what is important in examining data and making inferences. (Nelder 1985)
Data analysis was basically descriptive and non-probabilistic, in the sense that no reference was made to the data-generating mechanism. Data analysis favors algebraic and geometrical tools of representation and visualization.
This movement has resulted in conferences especially in Europe. In 1977, E. Diday and L. Lebart initiated a series entitled Data Analysis and Informatics, and in 1981, J. Janssen was at the origin of biennial ASMDA conferences (Applied Stochastic Models and Data Analysis), which are still continuing.
The principles of data analysis inspired those of data mining, which developed in the 1990s on the border between databases, information technology and statistics. Fayaad (1995) is said to have the following definition: "Data Mining is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data". Hand et al. precised in 2000, "I shall define Data Mining as the discovery of interesting, unexpected, or valuable structures in large data sets".
The metaphor of data mining means that there are treasures (or nuggets) hidden under mountains of data, which may be discovered by specific tools. Data mining is generally concerned with data which were collected for another purpose: it is a secondary analysis of databases that are collected not primarily for analysis, but for the management of individual cases. Data mining is not concerned with efficient methods for collecting data such as surveys and experimental designs (Hand et al. 2000).
I.2. EDA and unsupervised methods for dimension reduction
Essentially, exploratory methods of data analysis are dimension reduction methods: unsupervised classification or clustering methods operate on the number of statistical units, whereas factorial methods reduce the number of variables by searching for linear combinations associated with new axes of the space of individuals.
I.2.1. The time of syntheses
It was quickly realized that all the methods looking for eigenvalues and eigenvectors of matrices related to the dispersion of a cloud (total or intra) or of correlation matrices could be expressed as special cases of certain techniques.
Correspondence analyses (single and multiple) and canonical discriminant analysis are particular principal component analyses. It suffices to extend the classical Principal Components Analysis (PCA) by weighting the units and introducing metrics. The duality scheme introduced by Cailliez and Pagès (1976) is an abstract way of representing the relationships between arrays, matrices and associated spaces. The paper by De la Cruz and Holmes (2011) brought it back to light.
From another point of view (Bouroche and Saporta 1983), the main factorial methods PCA, Multiple Correspondence Analysis (MCA), as well as multiple regression are particular cases of canonical correlation analysis.
Another synthesis comes from the generalization of canonical correlation analysis to several groups of variables introduced by J.D. Carroll (1968). Given p blocks of variables Xj, we look for components z maximizing the following criterion: .
The extension of this criterion in the form , where F, is an adequate measure of association, leads to the maximum association principle (Tenenhaus 1977; Marcotorchino 1986; Saporta 1988), which also includes the case of k-means partitioning.
The PLS approach to structural equation modeling also provides a global framework for many linear methods, as has been shown by Tenenhaus (1999) and Tenenhaus and Tenenhaus (2011).
Table I.1. Various cases of the maximum association principle
Criterion Analysis max with xj numerical PCA max with xj categorical MCA max with Xj data set GCA (Carroll) max with Y and xj categorical Central partition max with rank orders Condorcet aggregation ruleI.2.2. The time of clusterwise methods
The search for partitions in k classes of a set of units belonging to a Euclidean space is most often done using the k-means algorithm: this method converges very quickly, even for large sets of data, but not necessarily toward the global optimum. Under the name of dynamic clustering, Diday (1971) has proposed multiple extensions, where the representatives of classes can be groups of points, varieties, etc. The simultaneous search for k classes and local models by alternating k-means and modeling is a geometric and non-probabilistic way of addressing mixture problems. Clusterwise regression is the best-known case: in each class, a regression model is fitted and the assignment to the classes is done according to the best model. Clusterwise methods allow for non-observable heterogeneity and are particularly useful for large data sets where the relevance of a simple and global model is questionable. In the 1970s, Diday and his collaborators developed "typological" approaches for most linear techniques: PCA, regression (Charles 1977), discrimination. These methods are again the subject of numerous publications in association with functional data (Preda and Saporta 2005), symbolic data (de Carvalho et al. 2010) and in multiblock cases (De Roover et al. 2012; Bougeard et al. 2017).
I.2.3. Extensions to new types of data
I.2.3.1. Functional data
Jean-Claude Deville (1974) showed that the Karhunen-Loève decomposition was nothing other than the PCA of the trajectories of a process, opening the way to functional data analysis (Ramsay and Silverman 1997). The number of...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.