
Advances in Data Science
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
Advances in Data Science fills this gap. It presents a collection of up-to-date contributions by eminent scholars following two international workshops held in Beijing and Paris. The 10 chapters are organized into four parts: Symbolic Data, Complex Data, Network Data and Clustering. They include fundamental contributions, as well as applications to several domains, including business and the social sciences.
More details
Other editions
Additional editions


Persons
Edwin Diday is Emeritus Professor at Paris-Dauphine University-PSL. He helped to introduce the symbolic data analysis paradigm and the dynamic clustering method (opening the path to local models), as well as pyramidal clustering for spatial representation of overlapping clusters.
Rong Guan is Associate Professor at the School of Statistics and Mathematics, Central University of Finance and Economics, Beijing. Her research covers complex and symbolic data analysis and financial distress diagnosis.
Gilbert Saporta is Emeritus Professor at Conservatoire National des Arts et Métiers, France. His current research focuses on functional data analysis and clusterwise and sparse methods. He is Honorary President of the French Statistical Society.
Huiwen Wang is Professor at the School of Economics and Management, Beihang University, Beijing. Her research covers dimension reduction, PLS regression, symbolic data analysis, compositional data analysis, functional data analysis and statistical modeling methods for mixed data.
Content
Preface xi
Part 1. Symbolic Data 1
Chapter 1. Explanatory Tools for Machine Learning in the Symbolic Data Analysis Framework 3
Edwin DIDAY
1.1. Introduction 4
1.2. Introduction to Symbolic Data Analysis 6
1.2.1. What are complex data? 6
1.2.2. What are "classes" and "class of complex data"? 7
1.2.3. Which kind of class variability? 7
1.2.4. What are "symbolic variables" and "symbolic data tables"? 7
1.2.5. Symbolic Data Analysis (SDA) 9
1.3. Symbolic data tables from Dynamic Clustering Method and EM 10
1.3.1. The "dynamical clustering method" (DCM) 10
1.3.2. Examples of DCM applications 10
1.3.3. Clustering methods by mixture decomposition 12
1.3.4. Symbolic data tables from clustering 13
1.3.5. A general way to compare results of clustering methods by the "explanatory power" of their associated symbolic data table 15
1.3.6. Quality criteria of classes and variables based on the cells of the symbolic data table containing intervals or inferred distributions 15
1.4. Criteria for ranking individuals, classes and their bar chart descriptive symbolic variables 16
1.4.1. A theoretical framework for SDA 16
1.4.2. Characterization of a category and a class by a measure of discordance 18
1.4.3. Link between a characterization by the criteria W and the standard Tf-Idf 19
1.4.4. Ranking the individuals, the symbolic variables and the classes of a bar chart symbolic data table 21
1.5. Two directions of research 23
1.5.1. Parametrization of concordance and discordance criteria 23
1.5.2. Improving the explanatory power of any machine learning tool by a filtering process 25
1.6. Conclusion 27
1.7. References 28
Chapter 2. Likelihood in the Symbolic Context 31
Richard EMILION and Edwin DIDAY
2.1. Introduction 31
2.2. Probabilistic setting 32
2.2.1. Description variable and class variable 32
2.2.2. Conditional distributions 33
2.2.3. Symbolic variables 33
2.2.4. Examples 35
2.2.5. Probability measures on (C, C), likelihood 37
2.3. Parametric models for p = 1 38
2.3.1. LDA model 38
2.3.2. BLS method 41
2.3.3. Interval-valued variables 42
2.3.4. Probability vectors and histogram-valued variables 42
2.4. Nonparametric estimation for p = 1 45
2.4.1. Multihistograms and multivariate polygons 45
2.4.2. Dirichlet kernel mixtures 45
2.4.3. Dirichlet Process Mixture (DPM) 45
2.5. Density models for p = 2 46
2.6. Conclusion 46
2.7. References 47
Chapter 3. Dimension Reduction and Visualization of Symbolic Interval-Valued Data Using Sliced Inverse Regression 49
Han-Ming WU, Chiun-How KAO and Chun-houh CHEN
3.1. Introduction 49
3.2. PCA for interval-valued data and the sliced inverse regression 51
3.2.1. PCA for interval-valued data 51
3.2.2. Classic SIR 52
3.3. SIR for interval-valued data 53
3.3.1. Quantification approaches 54
3.3.2. Distributional approaches 56
3.4. Projections and visualization in DR subspace 58
3.4.1. Linear combinations of intervals 58
3.4.2. The graphical representation of the projected intervals in the 2D DR subspace 59
3.5. Some computational issues 61
3.5.1. Standardization of interval-valued data 61
3.5.2. The slicing schemes for iSIR 62
3.5.3. The evaluation of DR components 62
3.6. Simulation studies 63
3.6.1. Scenario 1: aggregated data 63
3.6.2. Scenario 2: data based on interval arithmetic 63
3.6.3. Results 64
3.7. A real data example: face recognition data 65
3.8. Conclusion and discussion 73
3.9. References 74
Chapter 4. On the "Complexity" of Social Reality. Some Reflections About the Use of Symbolic Data Analysis in Social Sciences 79
Frédéric LEBARON
4.1. Introduction 79
4.2. Social sciences facing "complexity" 80
4.2.1. The total social fact, a designation of "complexity" in social sciences 80
4.2.2. Two families of answers 80
4.2.3. The contemporary deepening of the two approaches, "reductionist" and "encompassing" 81
4.2.4. Issues of scale and heterogeneity 82
4.3. Symbolic data analysis in the social sciences: an example 83
4.3.1. Symbolic data analysis 83
4.3.2. An exploratory case study on European data 83
4.3.3. A sociological interpretation 94
4.4. Conclusion 95
4.5. References 96
Part 2. Complex Data 99
Chapter 5. A Spatial Dependence Measure and Prediction of Georeferenced Data Streams Summarized by Histograms 101
Rosanna VERDE and Antonio BALZANELLA
5.1. Introduction 101
5.2. Processing setup 103
5.3. Main definitions 104
5.4. Online summarization of a data stream through CluStream for Histogram data 106
5.5. Spatial dependence monitoring: a variogram for histogram data 107
5.6. Ordinary kriging for histogram data 110
5.7. Experimental results on real data 112
5.8. Conclusion 116
5.9. References 116
Chapter 6. Incremental Calculation Framework for Complex Data 119
Huiwen WANG, Yuan WEI and Siyang WANG
6.1. Introduction 119
6.2. Basic data 122
6.2.1. The basic data space 122
6.2.2. Sample covariance matrix 123
6.3. Incremental calculation of complex data 124
6.3.1. Transformation of complex data 124
6.3.2. Online decomposition of covariance matrix 125
6.3.3. Adopted algorithms 128
6.4. Simulation studies 131
6.4.1. Functional linear regression 131
6.4.2. Compositional PCA 133
6.5. Conclusion 135
6.6. Acknowledgment 135
6.7. References 135
Part 3. Network Data 139
Chapter 7. Recommender Systems and Attributed Networks 141
Françoise FOGELMAN-SOULIÉ, Lanxiang MEI, Jianyu ZHANG, Yiming LI, Wen GE, Yinglan LI and Qiaofei YE
7.1. Introduction 141
7.2. Recommender systems 142
7.2.1. Data used 143
7.2.2. Model-based collaborative filtering 145
7.2.3. Neighborhood-based collaborative filtering 145
7.2.4. Hybrid models 148
7.3. Social networks 150
7.3.1. Non-independence 150
7.3.2. Definition of a social network 150
7.3.3. Properties of social networks 151
7.3.4. Bipartite networks 152
7.3.5. Multilayer networks 153
7.4. Using social networks for recommendation 154
7.4.1. Social filtering 154
7.4.2. Extension to use attributes 155
7.4.3. Remarks 156
7.5. Experiments 156
7.5.1. Performance evaluation 156
7.5.2. Datasets 157
7.5.3. Analysis of one-mode projected networks 158
7.5.4. Models evaluated 160
7.5.5. Results 160
7.6. Perspectives 163
7.7. References 163
Chapter 8. Attributed Networks Partitioning Based on Modularity Optimization 169
David COMBE, Christine LARGERON, Baptiste JEUDY, Françoise FOGELMAN-SOULIÉ and Jing WANG
8.1. Introduction 169
8.2. Related work 171
8.3. Inertia based modularity 172
8.4. I-Louvain 174
8.5. Incremental computation of the modularity gain 176
8.6. Evaluation of I-Louvain method 179
8.6.1. Performance of I-Louvain on artificial datasets 179
8.6.2. Run-time of I-Louvain 180
8.7. Conclusion 181
8.8. References 182
Part 4. Clustering 187
Chapter 9. A Novel Clustering Method with Automatic Weighting of Tables and Variables 189
Rodrigo C. DE ARAÚJO, Francisco DE ASSIS TENORIO DE CARVALHO and Yves LECHEVALLIER
9.1. Introduction 189
9.2. Related Work 190
9.3. Definitions, notations and objective 191
9.3.1. Choice of distances 192
9.3.2. Criterion W measures the homogeneity of the partition P on the set of tables 193
9.3.3. Optimization of the criterion W 195
9.4. Hard clustering with automated weighting of tables and variables 196
9.4.1. Clustering algorithms MND-W and MND-WT 196
9.5. Applications: UCI data sets 201
9.5.1. Application I: Iris plant 201
9.5.2. Application II: multi-features dataset 204
9.6. Conclusion 206
9.7. References 206
Chapter 10. Clustering and Generalized ANOVA for Symbolic Data Constructed from Open Data 209
Simona KORENJAK-CERNE, NataSa KEJ?AR and Vladimir BATAGELJ
10.1. Introduction 209
10.2. Data description based on discrete (membership) distributions 210
10.3. Clustering 212
10.3.1. TIMSS - study of teaching approaches 215
10.3.2. Clustering countries based on age-sex distributions of their populations 217
10.4. Generalized ANOVA 221
10.5. Conclusion 225
10.6. References 226
List of Authors 229
Index 233
1
Explanatory Tools for Machine Learning in the Symbolic Data Analysis Framework
The aim of this chapter is mainly to give explanatory tools for the understanding of standard, complex and big data. First, we recall some basic notions in Data Science: what are complex data? What are classes and classes of complex data? Which kind of internal class variability can be considered? Then, we define "symbolic data" and "symbolic data tables", which express the within variability of classes, and we give some advantages of such kind of class description. Often in practice the classes are given. When they are not given, clustering can be used to build them by the Dynamic Clustering method (DCM) from which DCM regression, DCM canonical analysis, DCM mixture decomposition, and the like can be obtained. The description of these class yields by aggregation to a symbolic data table. We say that the description of a class is much more explanatory when it is described by symbolic variables (closer from the natural language of the users), and then by its usual analytical multidimensional description. The explanatory and characteristic power of classes can then be measured by criteria based on the symbolic data description of these classes and induce a way for comparing clustering methods by their explanatory power. These criteria are defined in a Symbolic Data Analysis framework for categorical variables, based on three random variables defined on the ground population. Tools are then given for ranking individuals, classes and their symbolic descriptive variables from the more toward the less characteristic. These characteristics are not only explanatory but can also express the concordance or the discordance of a class with the other classes. We suggest several directions of research mainly on parametric aspects of these criteria and on improving the explanatory power of Machine Learning tools. We finally present the conclusion and the wide domain of potential applications in socio demography, medicine, web security and so on.
1.1. Introduction
A "Data Scientist" is someone who is able to extract new knowledge from Standard, Big and Complex Data. Here we consider complex data as data that cannot be expressed in terms of a standard data table, where units are described by quantitative and qualitative variables. Complex data happen in case of unstructured data, unpaired samples, and multisource data (as mixture of numerical, textual, image and social networks data). The aggregation, fusion, and summarization of such data can be done into classes of row units that are considered as new units. Classes can be obtained by unsupervised learning, giving a concise and structured view on the data. In supervised learning, classes are used in order to provide efficient rules for the allocation of new units to a class. A third way is to consider classes as new units described by "symbolic" variables whose values are "symbols" as: intervals, probability distributions, weighted sequences of numbers or categories, functions, and the like, in order to express their within-class variability. For example, "Regions" express the variability of their inhabitant, "Companies" express the variability of their web intrusion, and "Species" express the variability of their specimen. One of the advantages of this approach is that unstructured data and unpaired samples at the level of row units become structured and paired at the classes' level (see section 1.2.4).
Three principles guide this chapter in conformity with the Data Science framework. First, new tools are needed to transform huge data bases intended for management to data bases usable for Data Science tools. This transformation leads to the construction of new statistical units-described by aggregated data in terms of symbols as single-valued data-are not suitable because they cannot incorporate the additional information on data structure available in symbolic data. Second, we work on the symbolic data as they are given in data bases and not as we wish that they be given. For example, if the data contain intervals, we work on them even if the within-interval uniformity is statistically not satisfactory. Moreover, by considering Min-Max intervals, we can obtain useful knowledge, complementary to the one given without the uniformity assumption. Hence, considering that the Min-Max or interquartile where the aim is to extract useful knowledge from the data and not only to infer models (even if inferring models like in standard statistics, can for sure give complementary knowledge). Third, by using marginal description of classes by vectors of univariate symbols, rather than joint symbolic description by multivariate symbols, 99% of the users would say that a joint distribution describing a class often contains too much low or 0 values and so has a poor explanatory power in comparison with marginal distributions describing the same class. For example, having 10 variables of 5 categories each, the joint multivariate distribution leads to a sparse symbolic data table where the classes are described by a unique bar chart symbolic variable value containing 510 categories and taking for each class 510 low or 0 values. On the other hand, the 10 marginal bar chart symbolic variables' values describe the classes by vectors of 10 bar charts of 5 categories each, easy to interpret and to compare between classes. Nevertheless, a compromise can be obtained by considering joints instead of marginal between the more dependent variables.
Symbolic Data Analysis (SDA) is an extension of standard data analysis and data mining to symbolic data. The theory and practice of SDA have been developed in several books [AFO 18a], [BIL 06], [BOC 00], [DID 08], many papers (see overviews in [BIL 03] and [DID 16]), and several international workshops (http://vladowiki.fmf.uni-lj.si/doku.php?id=sda:meet:pa18). Special issue related to SDA has been published, for example, in the RNTI journal, edited by Guan et al. [GUA 13] on Advances in Theory and Applications of High Dimensional and Symbolic Data Analysis; in the ADAC journal on SDA, edited by Brito et al. [BRI 16]; in IEEE Transactions on Cybernetics [SU 16].
This chapter is organized into five sections. Section 1.2 aims to define symbolic data issued from the descriptions of classes of statistical units (called "individuals") in order to take care of their internal variability. "Complex data", "classes", and "classes of complex data" are defined. The symbolic data appear in the cells of a "symbolic data table", where the rows describe classes and the columns are associated with variables of symbolic value. Some advantages of symbolic data are finally given in this section.
Section 1.3 is devoted to the case where the classes are not given, but built by a clustering process. We illustrate this case by two clustering tools: Dynamic Clustering Method (DCM) and by mixture decomposition with the Estimation-Maximization (EM) method. We present different variants of the DCM, which can lead to different kinds of clusters, depending on the kind of clusters representative: regression, canonical analysis, distributions, and so on. Then, we show how to build a symbolic data table from the results of these clustering methods. Several criteria measuring the explanatory power of a symbolic data table are suggested. In consequence, the explanatory quality of clustering methods can be compared by these criteria.
In section 1.4, our aim is to define other kinds of explanatory criteria in the case where the initial variables defined on the ground population are of categorical value. We introduce, in this case, a theoretical framework of SDA based on three random variables. From this framework, we define two kinds of bar chart. The first called "fx(c)" which assigns to each category x, its frequency in the class, and the second called "gc, E (x)" that associates its frequency to each event E containing fx(c). These functions yield the characterization of pairs (category and class) by different kinds of criteria. We show that these criteria generalize to symbolic data, the standard Tf-Idf widely used in text mining (see, for example, [ROB 04]). According to these criteria, can be placed in order: the individuals, the classes, the symbolic variables and the symbolic data tables from the more to the less characteristic power.
Finally, in section 1.5, we suggest two directions of research. First, in this SDA framework, there are different possible parameterizations of the criteria expressed in terms of concordance or discordance of a class with the other classes are given. An interesting open question is to find in which condition when a sequence of partitions converges toward a trivial partition, such parametric criteria defined on classes converges toward a parametric distribution defined on O, as it is interesting and economical to obtain from distributions on classes to the distribution on the population (in the case of concordance or discordance). Second, as explaining for understanding is complementary to discriminating for learning, we suggest a filtering process that improves on a filtered sub-population the explanatory power without degrading the discriminating power of any learning machine tool.
1.2. Introduction to Symbolic Data Analysis
1.2.1. What are complex data?
By definition, "complex data" are any data set...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.