Data Analysis and Applications 2

Name: Data Analysis and Applications 2 | Utilization of Results in Europe and Other Topics
Brand: Wiley
Price: 139.99 EUR
Availability: OnlineOnly

Utilization of Results in Europe and Other Topics

Christos H. Skiadas James R. Bozeman(Editor)

Wiley (Publisher)

1st Edition

Published on 7. March 2019

252 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-119-57953-3 (ISBN)

€139.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Persons

Content

Preface xi

Introduction xiii Gilbert SAPORTA

Part 1 Applications 1

Chapter 1 Context-specific Independence in Innovation Study 3 Federica NICOLUSSI and Manuela CAZZARO

1.1 Introduction 3

1.2 Parametrization for CS independencies 4

1.3 Stratified chain graph models 6

1.4 Application on real data 7

1.5 Conclusion 12

1.6 References 12

Chapter 2 Analysis of the Determinants and Outputs of Innovation in the Nordic Countries 15 Cátia ROSÁRIO, António Augusto COSTA and Ana LORGA DA SILVA

2.1 Introduction 15

2.2 Innovation 16

2.3 Methodology 19

2.4 Results 21

2.5 Conclusion 25

2.6 References 26

Chapter 3 Bibliometric Variables Determining the Quality of a Dentistry Journal 29 Pilar VALDERRAMA, Manuel ESCABIAS, Evaristo JIMÉNEZ-CONTRERAS, Mariano JVALDERRAMA and Pilar BACA

3.1 Introduction 29

3.2 Statistical methodology 30

3.3 Results 32

3.4 Conclusions 35

3.5 Acknowledgment 35

3.6 References 36

Chapter 4 Analysis of Dependence among Growth Rates of GDP of V4 Countries Using Four-dimensional Vine Copulas 37 Jozef KOMORNÍK, Magda KOMORNÍKOVÁ and TomáS BACIGÁL

4.1 Introduction 37

4.2 Theory 38

4.3 Results 42

4.4 Conclusion and future work 45

4.5 Acknowledgment 47

4.6 References 47

Chapter 5 Monitoring the Compliance of Countries on Emissions Mitigation Using Dissimilarity Indices 49 Eleni KETZAKI, Stavros RALLAKIS, Nikolaos FARMAKIS and Eftichios SARTZETAKIS

5.1 Introduction 49

5.2 The proposed method 50

5.2.1 Description of method for individual data 51

5.2.2 Description of method for grouped data 52

5.3 Application of method 53

5.3.1 Application of method for individual data 54

5.3.2 Application of method for grouped data 55

5.4 Conclusions 55

5.5 Appendix 57

5.6 References 58

Chapter 6 Maximum Entropy and Distributions of Five-Star Ratings 59 Yiannis DIMOTIKALIS

6.1 Introduction 59

6.2 Entropy framework to five-star ratings 60

6.3 Maximum entropy of ratings for values k = 1,2,3,,30 66

6.3.1 Ratings with two outcomes (k = 1) 66

6.3.2 Ratings with three Outcomes (k=2) 69

6.3.3 Ratings with four outcomes (k=3) 73

6.3.4 Ratings with five outcomes (k = 4) 76

>4 80

6.3.6 Maximum entropy constraints for the binomial distribution 82

6.4 Application to real five-star rating data 83

6.5 Conclusions 86

6.6 References 86

Part 2 The Impact of the Economic and Financial Crisis in Europe 89

Chapter 7 Access to Credit for SMEs after the 2008 Financial Crisis: The Northern Italian Perspective 91 Cinzia COLAPINTO and Mariangela ZENGA

7.1 Introduction 91

7.2 Italian SMEs and access to credit 92

7.3 The data 93

7.4 Methodology 94

7.5 Analysis and discussion 97

7.5.1 The measure for the Great Recession period (2008-2012) 97

7.5.2 The measure for the recovery period (2013-2015) 99

7.5.3 Comparing the two crisis phases 102

7.6 Conclusion 105

7.7 References 105

Chapter 8 Gender-Based Differences in the Impact of the Economic Crisis on Labor Market Flows in Southern Europe 107 Maria SYMEONAKI, Maria KARAMESSINI and Glykeria STAMATOPOULOU

8.1 Introduction 107

8.2 Data, methods and limitations 108

8.3 Results 111

8.4 Conclusions and discussion 111

8.5 References 119

Chapter 9 Measuring Labor Market Transition Probabilities in Europe with Evidence from the EU-SILC 121 Maria SYMEONAKI, Maria KARAMESSINI and Glykeria STAMATOPOULOU

9.1 Introduction 121

9.2 Data, methods and limitations 122

9.3 Results 124

9.4 Conclusions 135

9.5 References 135

Part 3 Student Assessment and Employment in Europe 137

Chapter 10 Almost Graduated, Close to Employment? Taking into Account the Characteristics of Companies Recruiting at a University Job Placement Office 139 Franca CRIPPA, Mariangela ZENGA and Paolo MARIANI

10.1 Introduction 139

10.2 Recruiters and graduates seeking an HEI common ground 140

10.3 Web survey pitfalls: considerations for data collection 141

10.4 Sampled recruiters: an outline 144

10.5 Conclusion 146

10.6 References 146

Chapter 11 How Variation of Scores of the Programme for International Student Assessment can be Explained through Analysis of Information 149 Valérie GIRARDIN, Justine LEQUESNE and Olivier THÉVENON

11.1 Introduction 149

11.2 Multiplicative models and Zighera's parameterization 151

11.3 Application to PISA surveys 155

11.3.1 Data and variables 155

11.3.2 Analysis of scores in mathematics 157

11.3.3 Conclusion 162

11.4 References 163

Part 4 Visualization 165

Chapter 12 A Topological Discriminant Analysis 167 Rafik ABDESSELAM

12.1 Introduction 167

12.2 Topological equivalence 168

12.3 Topological discriminant analysis 171

12.4 Application example 173

12.5 Conclusion and perspectives 175

12.6 Appendix 176

12.7 References 178

Chapter 13 Using Graph Partitioning to Calculate PageRank in a Changing Network 179 Christopher ENGSTRÖM and Sergei SILVESTROV

13.1 Introduction 179

13.1.1 Computing PageRank 181

13.2 Changes in personalization vector 182

13.3 Adding or removing edges between components 184

13.3.1 Computations in practice 186

13.3.2 Adding or removing an edge inside a component 187

13.3.3 Maintaining the component structure 189

13.4 Conclusions 190

13.5 References 191

Chapter 14 Visualizing the Political Spectrum of Germany by Contiguously Ordering the Party Policy Profiles 193 Andranik TANGIAN

14.1 Introduction 193

14.2 The model 195

14.3 Conclusions 206

14.4 References 206

List of Authors 209

Index 213

Introduction
50 Years of Data Analysis: From Exploratory Data Analysis to Predictive Modeling and Machine Learning

In 1962, J.W. Tukey wrote his famous paper "The Future of Data Analysis" and promoted exploratory data analysis (EDA), a set of simple techniques conceived to let the data speak, without prespecified generative models. In the same spirit, J.P. Benzécri and many others developed multivariate descriptive analysis tools. Since that time, many generalizations occurred, but the basic methods (SVD, k-means, etc.) are still incredibly efficient in the Big Data era.

On the other hand, algorithmic modeling or machine learning is successful in predictive modeling, the goal being accuracy and not interpretability. Supervised learning proves in many applications that it is not necessary to understand, when one needs only predictions.

However, considering some failures and flaws, we advocate that a better understanding may improve prediction. Causal inference for Big Data is probably the challenge of the coming years.

It is a little presumptuous to want to make a panorama of 50 years of data analysis, while David Donoho (2017) has just published a paper entitled "50 Years of Data Science". But 1968 is the year when I began my studies as a statistician and I would very much like to talk about the debates of the time and the digital revolution that profoundly transformed statistics and which I witnessed. The terminology followed this evolution-revolution: from data analysis to data mining and then to data science while we went from a time when the asymptotics began to 30 observations with a few variables in the era of Big Data and high dimension.

I.1. The revolt against mathematical statistics

Since the 1960s, the availability of data has led to an international movement back to the sources of statistics ("let the data speak") and to sometimes fierce criticisms of an abusive formalization. Along with to John Tukey, who was cited above, here is a portrait gallery of some notorious protagonists in the United States, France, Japan, the Netherlands and Italy (for a color version of this figure, see www.iste.co.uk/skiadas/data2.zip).

And an anthology of quotes:

He (Tukey) seems to identify statistics with the grotesque phenomenon generally known as mathematical statistics and find it necessary to replace statistics by data analysis. (Anscombe 1967) Statistics is not probability, under the name of mathematical statistics was built a pompous discipline based on theoretical assumptions that are rarely met in practice. (Benzécri 1972)

The models should follow the data, not vice versa. (Benzécri 1972)

Use the computer implies the abandonment of all the techniques designed before of computing. (Benzécri 1972)

Statistics is intimately connected with science and technology, and few mathematicians have experience or understand of methods of either. This I believe is what lies behind the grotesque emphasis on significance tests in statistics courses of all kinds; a mathematical apparatus has been erected with the notions of power, uniformly most powerful tests, uniformly most powerful unbiased tests, etc., and this is taught to people, who, if they come away with no other notion, will remember that statistics is about significant differences [.]. The apparatus on which their statistics course has been constructed is often worse than irrelevant - it is misleading about what is important in examining data and making inferences. (Nelder 1985)

Data analysis was basically descriptive and non-probabilistic, in the sense that no reference was made to the data-generating mechanism. Data analysis favors algebraic and geometrical tools of representation and visualization.

This movement has resulted in conferences especially in Europe. In 1977, E. Diday and L. Lebart initiated a series entitled Data Analysis and Informatics, and in 1981, J. Janssen was at the origin of biennial ASMDA conferences (Applied Stochastic Models and Data Analysis), which are still continuing.

The principles of data analysis inspired those of data mining, which developed in the 1990s on the border between databases, information technology and statistics. Fayaad (1995) is said to have the following definition: "Data Mining is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data". Hand et al. precised in 2000, "I shall define Data Mining as the discovery of interesting, unexpected, or valuable structures in large data sets".

The metaphor of data mining means that there are treasures (or nuggets) hidden under mountains of data, which may be discovered by specific tools. Data mining is generally concerned with data which were collected for another purpose: it is a secondary analysis of databases that are collected not primarily for analysis, but for the management of individual cases. Data mining is not concerned with efficient methods for collecting data such as surveys and experimental designs (Hand et al. 2000).

I.2. EDA and unsupervised methods for dimension reduction

Essentially, exploratory methods of data analysis are dimension reduction methods: unsupervised classification or clustering methods operate on the number of statistical units, whereas factorial methods reduce the number of variables by searching for linear combinations associated with new axes of the space of individuals.

I.2.1. The time of syntheses

It was quickly realized that all the methods looking for eigenvalues and eigenvectors of matrices related to the dispersion of a cloud (total or intra) or of correlation matrices could be expressed as special cases of certain techniques.

Correspondence analyses (single and multiple) and canonical discriminant analysis are particular principal component analyses. It suffices to extend the classical Principal Components Analysis (PCA) by weighting the units and introducing metrics. The duality scheme introduced by Cailliez and Pagès (1976) is an abstract way of representing the relationships between arrays, matrices and associated spaces. The paper by De la Cruz and Holmes (2011) brought it back to light.

From another point of view (Bouroche and Saporta 1983), the main factorial methods PCA, Multiple Correspondence Analysis (MCA), as well as multiple regression are particular cases of canonical correlation analysis.

Another synthesis comes from the generalization of canonical correlation analysis to several groups of variables introduced by J.D. Carroll (1968). Given p blocks of variables Xj, we look for components z maximizing the following criterion: .

The extension of this criterion in the form , where F is an adequate measure of association, leads to the maximum association principle (Tenenhaus 1977; Marcotorchino 1986; Saporta 1988), which also includes the case of k-means partitioning.

The PLS approach to structural equation modeling also provides a global framework for many linear methods, as has been shown by Tenenhaus (1999) and Tenenhaus and Tenenhaus (2011).

Table I.1. Various cases of the maximum association principle

Criterion Analysis with xj numerical PCA with xj categorical MCA with Xj data set GCA (Carroll) with Y and xj categorical Central partition with rank orders Condorcet aggregation rule

I.2.2. The time of clusterwise methods

The search for partitions in k classes of a set of units belonging to a Euclidean space is most often done using the k-means algorithm: this method converges very quickly, even for large sets of data, but not necessarily toward the global optimum. Under the name of dynamic clustering, Diday (1971) has proposed multiple extensions, where the representatives of classes can be groups of points, varieties, etc. The simultaneous search for k classes and local models by alternating k-means and modeling is a geometric and non-probabilistic way of addressing mixture problems. Clusterwise regression is the best-known case: in each class, a regression model is fitted and the assignment to the classes is done according to the best model. Clusterwise methods allow for non-observable heterogeneity and are particularly useful for large data sets where the relevance of a simple and global model is questionable. In the 1970s, Diday and his collaborators developed "typological" approaches for most linear techniques: PCA, regression (Charles 1977), discrimination. These methods are again the subject of numerous publications in association with functional data (Preda and Saporta 2005), symbolic data (de Carvalho et al. 2010) and in multiblock cases (De Roover et al. 2012; Bougeard et al. 2017).

I.2.3. Extensions to new types of data

I.2.3.1. Functional data

Jean-Claude Deville (1974) showed that the Karhunen-Loève decomposition was nothing other than the PCA of the trajectories of a process, opening the way to functional data analysis (Ramsay and Silverman 1997). The number of variables being...

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Data Analysis and Applications 2

Description

More details

Other editions

Additional editions

Persons

Content

Introduction
50 Years of Data Analysis: From Exploratory Data Analysis to Predictive Modeling and Machine Learning

I.1. The revolt against mathematical statistics

I.2. EDA and unsupervised methods for dimension reduction

I.2.1. The time of syntheses

I.2.2. The time of clusterwise methods

I.2.3. Extensions to new types of data

I.2.3.1. Functional data

System requirements

Schweitzer Fachinformationen

Data Analysis and Applications 2

Description

More details

Other editions

Additional editions

Persons

Content

Introduction 50 Years of Data Analysis: From Exploratory Data Analysis to Predictive Modeling and Machine Learning

I.1. The revolt against mathematical statistics

I.2. EDA and unsupervised methods for dimension reduction

I.2.1. The time of syntheses

I.2.2. The time of clusterwise methods

I.2.3. Extensions to new types of data

I.2.3.1. Functional data

System requirements

Introduction
50 Years of Data Analysis: From Exploratory Data Analysis to Predictive Modeling and Machine Learning