
Data Analysis and Applications 2
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
More details
Other editions
Additional editions


Persons
Christos H. Skiadas is the Founder and former Director of the Data Analysis and Forecasting Laboratory at the Technical University of Crete, Greece. He continues his work at the university at the ManLab in the Department of Production Engineering and Management.
James R. Bozeman holds a PhD in Mathematics from Dartmouth College, USA, and is Professor of Mathematics at the American University of Malta.
Content
Preface xi
Introduction xiii Gilbert SAPORTA
Part 1 Applications 1
Chapter 1 Context-specific Independence in Innovation Study 3 Federica NICOLUSSI and Manuela CAZZARO
1.1 Introduction 3
1.2 Parametrization for CS independencies 4
1.3 Stratified chain graph models 6
1.4 Application on real data 7
1.5 Conclusion 12
1.6 References 12
Chapter 2 Analysis of the Determinants and Outputs of Innovation in the Nordic Countries 15 Cátia ROSÁRIO, António Augusto COSTA and Ana LORGA DA SILVA
2.1 Introduction 15
2.2 Innovation 16
2.3 Methodology 19
2.4 Results 21
2.5 Conclusion 25
2.6 References 26
Chapter 3 Bibliometric Variables Determining the Quality of a Dentistry Journal 29 Pilar VALDERRAMA, Manuel ESCABIAS, Evaristo JIMÉNEZ-CONTRERAS, Mariano JVALDERRAMA and Pilar BACA
3.1 Introduction 29
3.2 Statistical methodology 30
3.3 Results 32
3.4 Conclusions 35
3.5 Acknowledgment 35
3.6 References 36
Chapter 4 Analysis of Dependence among Growth Rates of GDP of V4 Countries Using Four-dimensional Vine Copulas 37 Jozef KOMORNÍK, Magda KOMORNÍKOVÁ and TomáS BACIGÁL
4.1 Introduction 37
4.2 Theory 38
4.3 Results 42
4.4 Conclusion and future work 45
4.5 Acknowledgment 47
4.6 References 47
Chapter 5 Monitoring the Compliance of Countries on Emissions Mitigation Using Dissimilarity Indices 49 Eleni KETZAKI, Stavros RALLAKIS, Nikolaos FARMAKIS and Eftichios SARTZETAKIS
5.1 Introduction 49
5.2 The proposed method 50
5.2.1 Description of method for individual data 51
5.2.2 Description of method for grouped data 52
5.3 Application of method 53
5.3.1 Application of method for individual data 54
5.3.2 Application of method for grouped data 55
5.4 Conclusions 55
5.5 Appendix 57
5.6 References 58
Chapter 6 Maximum Entropy and Distributions of Five-Star Ratings 59 Yiannis DIMOTIKALIS
6.1 Introduction 59
6.2 Entropy framework to five-star ratings 60
6.3 Maximum entropy of ratings for values k = 1,2,3,,30 66
6.3.1 Ratings with two outcomes (k = 1) 66
6.3.2 Ratings with three Outcomes (k=2) 69
6.3.3 Ratings with four outcomes (k=3) 73
6.3.4 Ratings with five outcomes (k = 4) 76
>4 80
6.3.6 Maximum entropy constraints for the binomial distribution 82
6.4 Application to real five-star rating data 83
6.5 Conclusions 86
6.6 References 86
Part 2 The Impact of the Economic and Financial Crisis in Europe 89
Chapter 7 Access to Credit for SMEs after the 2008 Financial Crisis: The Northern Italian Perspective 91 Cinzia COLAPINTO and Mariangela ZENGA
7.1 Introduction 91
7.2 Italian SMEs and access to credit 92
7.3 The data 93
7.4 Methodology 94
7.5 Analysis and discussion 97
7.5.1 The measure for the Great Recession period (2008-2012) 97
7.5.2 The measure for the recovery period (2013-2015) 99
7.5.3 Comparing the two crisis phases 102
7.6 Conclusion 105
7.7 References 105
Chapter 8 Gender-Based Differences in the Impact of the Economic Crisis on Labor Market Flows in Southern Europe 107 Maria SYMEONAKI, Maria KARAMESSINI and Glykeria STAMATOPOULOU
8.1 Introduction 107
8.2 Data, methods and limitations 108
8.3 Results 111
8.4 Conclusions and discussion 111
8.5 References 119
Chapter 9 Measuring Labor Market Transition Probabilities in Europe with Evidence from the EU-SILC 121 Maria SYMEONAKI, Maria KARAMESSINI and Glykeria STAMATOPOULOU
9.1 Introduction 121
9.2 Data, methods and limitations 122
9.3 Results 124
9.4 Conclusions 135
9.5 References 135
Part 3 Student Assessment and Employment in Europe 137
Chapter 10 Almost Graduated, Close to Employment? Taking into Account the Characteristics of Companies Recruiting at a University Job Placement Office 139 Franca CRIPPA, Mariangela ZENGA and Paolo MARIANI
10.1 Introduction 139
10.2 Recruiters and graduates seeking an HEI common ground 140
10.3 Web survey pitfalls: considerations for data collection 141
10.4 Sampled recruiters: an outline 144
10.5 Conclusion 146
10.6 References 146
Chapter 11 How Variation of Scores of the Programme for International Student Assessment can be Explained through Analysis of Information 149 Valérie GIRARDIN, Justine LEQUESNE and Olivier THÉVENON
11.1 Introduction 149
11.2 Multiplicative models and Zighera's parameterization 151
11.3 Application to PISA surveys 155
11.3.1 Data and variables 155
11.3.2 Analysis of scores in mathematics 157
11.3.3 Conclusion 162
11.4 References 163
Part 4 Visualization 165
Chapter 12 A Topological Discriminant Analysis 167 Rafik ABDESSELAM
12.1 Introduction 167
12.2 Topological equivalence 168
12.3 Topological discriminant analysis 171
12.4 Application example 173
12.5 Conclusion and perspectives 175
12.6 Appendix 176
12.7 References 178
Chapter 13 Using Graph Partitioning to Calculate PageRank in a Changing Network 179 Christopher ENGSTRÖM and Sergei SILVESTROV
13.1 Introduction 179
13.1.1 Computing PageRank 181
13.2 Changes in personalization vector 182
13.3 Adding or removing edges between components 184
13.3.1 Computations in practice 186
13.3.2 Adding or removing an edge inside a component 187
13.3.3 Maintaining the component structure 189
13.4 Conclusions 190
13.5 References 191
Chapter 14 Visualizing the Political Spectrum of Germany by Contiguously Ordering the Party Policy Profiles 193 Andranik TANGIAN
14.1 Introduction 193
14.2 The model 195
14.3 Conclusions 206
14.4 References 206
List of Authors 209
Index 213
Introduction
50 Years of Data Analysis: From Exploratory Data Analysis to Predictive Modeling and Machine Learning
In 1962, J.W. Tukey wrote his famous paper "The Future of Data Analysis" and promoted exploratory data analysis (EDA), a set of simple techniques conceived to let the data speak, without prespecified generative models. In the same spirit, J.P. Benzécri and many others developed multivariate descriptive analysis tools. Since that time, many generalizations occurred, but the basic methods (SVD, k-means, etc.) are still incredibly efficient in the Big Data era.
On the other hand, algorithmic modeling or machine learning is successful in predictive modeling, the goal being accuracy and not interpretability. Supervised learning proves in many applications that it is not necessary to understand, when one needs only predictions.
However, considering some failures and flaws, we advocate that a better understanding may improve prediction. Causal inference for Big Data is probably the challenge of the coming years.
It is a little presumptuous to want to make a panorama of 50 years of data analysis, while David Donoho (2017) has just published a paper entitled "50 Years of Data Science". But 1968 is the year when I began my studies as a statistician and I would very much like to talk about the debates of the time and the digital revolution that profoundly transformed statistics and which I witnessed. The terminology followed this evolution-revolution: from data analysis to data mining and then to data science while we went from a time when the asymptotics began to 30 observations with a few variables in the era of Big Data and high dimension.
I.1. The revolt against mathematical statistics
Since the 1960s, the availability of data has led to an international movement back to the sources of statistics ("let the data speak") and to sometimes fierce criticisms of an abusive formalization. Along with to John Tukey, who was cited above, here is a portrait gallery of some notorious protagonists in the United States, France, Japan, the Netherlands and Italy (for a color version of this figure, see www.iste.co.uk/skiadas/data2.zip).
And an anthology of quotes:
He (Tukey) seems to identify statistics with the grotesque phenomenon generally known as mathematical statistics and find it necessary to replace statistics by data analysis. (Anscombe 1967) Statistics is not probability, under the name of mathematical statistics was built a pompous discipline based on theoretical assumptions that are rarely met in practice. (Benzécri 1972)
The models should follow the data, not vice versa. (Benzécri 1972)
Use the computer implies the abandonment of all the techniques designed before of computing. (Benzécri 1972)
Statistics is intimately connected with science and technology, and few mathematicians have experience or understand of methods of either. This I believe is what lies behind the grotesque emphasis on significance tests in statistics courses of all kinds; a mathematical apparatus has been erected with the notions of power, uniformly most powerful tests, uniformly most powerful unbiased tests, etc., and this is taught to people, who, if they come away with no other notion, will remember that statistics is about significant differences [.]. The apparatus on which their statistics course has been constructed is often worse than irrelevant - it is misleading about what is important in examining data and making inferences. (Nelder 1985)
Data analysis was basically descriptive and non-probabilistic, in the sense that no reference was made to the data-generating mechanism. Data analysis favors algebraic and geometrical tools of representation and visualization.
This movement has resulted in conferences especially in Europe. In 1977, E. Diday and L. Lebart initiated a series entitled Data Analysis and Informatics, and in 1981, J. Janssen was at the origin of biennial ASMDA conferences (Applied Stochastic Models and Data Analysis), which are still continuing.
The principles of data analysis inspired those of data mining, which developed in the 1990s on the border between databases, information technology and statistics. Fayaad (1995) is said to have the following definition: "Data Mining is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data". Hand et al. precised in 2000, "I shall define Data Mining as the discovery of interesting, unexpected, or valuable structures in large data sets".
The metaphor of data mining means that there are treasures (or nuggets) hidden under mountains of data, which may be discovered by specific tools. Data mining is generally concerned with data which were collected for another purpose: it is a secondary analysis of databases that are collected not primarily for analysis, but for the management of individual cases. Data mining is not concerned with efficient methods for collecting data such as surveys and experimental designs (Hand et al. 2000).
I.2. EDA and unsupervised methods for dimension reduction
Essentially, exploratory methods of data analysis are dimension reduction methods: unsupervised classification or clustering methods operate on the number of statistical units, whereas factorial methods reduce the number of variables by searching for linear combinations associated with new axes of the space of individuals.
I.2.1. The time of syntheses
It was quickly realized that all the methods looking for eigenvalues and eigenvectors of matrices related to the dispersion of a cloud (total or intra) or of correlation matrices could be expressed as special cases of certain techniques.
Correspondence analyses (single and multiple) and canonical discriminant analysis are particular principal component analyses. It suffices to extend the classical Principal Components Analysis (PCA) by weighting the units and introducing metrics. The duality scheme introduced by Cailliez and Pagès (1976) is an abstract way of representing the relationships between arrays, matrices and associated spaces. The paper by De la Cruz and Holmes (2011) brought it back to light.
From another point of view (Bouroche and Saporta 1983), the main factorial methods PCA, Multiple Correspondence Analysis (MCA), as well as multiple regression are particular cases of canonical correlation analysis.
Another synthesis comes from the generalization of canonical correlation analysis to several groups of variables introduced by J.D. Carroll (1968). Given p blocks of variables Xj, we look for components z maximizing the following criterion: .
The extension of this criterion in the form , where F is an adequate measure of association, leads to the maximum association principle (Tenenhaus 1977; Marcotorchino 1986; Saporta 1988), which also includes the case of k-means partitioning.
The PLS approach to structural equation modeling also provides a global framework for many linear methods, as has been shown by Tenenhaus (1999) and Tenenhaus and Tenenhaus (2011).
Table I.1. Various cases of the maximum association principle
Criterion Analysis with xj numerical PCA with xj categorical MCA with Xj data set GCA (Carroll) with Y and xj categorical Central partition with rank orders Condorcet aggregation ruleI.2.2. The time of clusterwise methods
The search for partitions in k classes of a set of units belonging to a Euclidean space is most often done using the k-means algorithm: this method converges very quickly, even for large sets of data, but not necessarily toward the global optimum. Under the name of dynamic clustering, Diday (1971) has proposed multiple extensions, where the representatives of classes can be groups of points, varieties, etc. The simultaneous search for k classes and local models by alternating k-means and modeling is a geometric and non-probabilistic way of addressing mixture problems. Clusterwise regression is the best-known case: in each class, a regression model is fitted and the assignment to the classes is done according to the best model. Clusterwise methods allow for non-observable heterogeneity and are particularly useful for large data sets where the relevance of a simple and global model is questionable. In the 1970s, Diday and his collaborators developed "typological" approaches for most linear techniques: PCA, regression (Charles 1977), discrimination. These methods are again the subject of numerous publications in association with functional data (Preda and Saporta 2005), symbolic data (de Carvalho et al. 2010) and in multiblock cases (De Roover et al. 2012; Bougeard et al. 2017).
I.2.3. Extensions to new types of data
I.2.3.1. Functional data
Jean-Claude Deville (1974) showed that the Karhunen-Loève decomposition was nothing other than the PCA of the trajectories of a process, opening the way to functional data analysis (Ramsay and Silverman 1997). The number of variables being...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.