2
From Conventional Data Analysis Methods to Big Data Analytics
2.1. From data analysis to data mining: exploring and predicting
Data analysis here mainly means descriptive and exploratory methods, also known as unsupervised. The objective is to describe as well as structure a set of data that can be represented in the form of a rectangular table crossing n statistical units and p variables. We generally consider n observations as points in p dimensional vector space, which, if provided with a distance, is a Euclidean space. Numerical variables are vectors of an n dimensional space. Data analysis methods are essentially dimension reduction methods that are divided into two categories:
- - on the one hand, factor methods (principal component analysis for numeric variables, correspondence analyses for category variables) which lead to new numeric variables, combinations of the original variables, allowing representations in low dimensional spaces. Mathematically, these are variants of singular value decomposition of the data table;
- - on the other hand, the unsupervised classification methods or clustering which divide observations, or variables, into homogeneous groups. The main algorithms are either hierarchical (step-by-step construction of the classes by successive clustering of units) or direct partition searches by k-means.
Many works are devoted to previous methods like [SAP 11].
However, data analysis is also an attitude that consists of "letting the data speak" by putting nothing, or at least very little a priori, on the generating mechanism. Let us recall here the principle stated by [BEN 72]: "The model must follow the data, and not the opposite". Data analysis developed in the 1960s and 1970s in reaction to the abuses of formalization, see [ANS 67], regarding John Tukey: "He (Tukey) seems to identify statistics with the grotesque phenomenon generally known as mathematical statistics and find it necessary to replace statistics by data analysis."
Data mining, a movement which began in the 1990s at the intersection of statistics and information technologies (databases, artificial intelligence, machine learning, etc.), also aims at discovering structures in large datasets and promotes new tools, such as association rules. The metaphor of data mining means that there are treasures or nuggets hidden under mountains of data that can be discovered with specialized tools. Data mining is a step in the knowledge discovery process, which involves applying data analysis algorithms. [HAN 99] defined it thus: "I shall define data mining as the discovery of interesting, unexpected, or valuable structures in large data sets." Data mining analyzes data collected for other purposes: it is often a secondary analysis of databases, designed for the management of individual data, and where there is no concern about effectively collecting data (surveys, experimental designs).
Data mining also seeks to find predictive models of a Y denoted response, but from a very different perspective than that of conventional modeling. A model is nothing more than an algorithm and not a representation of the mechanism that generated the data. We then proceed by exploring a set of linear or nonlinear algorithms, explicit or not, in order to select the best, which is the one that provides the most accurate forecasts without falling into the overfitting trap. We distinguish regression methods where Y is quantitative, supervised classification methods (also called discrimination methods) where Y is categorical, most often with two modalities. Massive data processing has only reinforced the trends already present in data mining.
2.2. Obsolete approaches
Inferential statistics were developed in a context of scarce data, so much so that a sample of more than 30 units was considered large! The volume of data radically changes the practice of statistics. Here are some examples:
- - any deviation from a theoretical value becomes "significant". Thus, a correlation coefficient of 0.01 calculated between two variables on a million observations (and even less, as the reader will easily verify) will be declared significantly different from zero. Is it a useful result?
- - the confidence intervals of the parameters of a model become zero width since the latter is generally in . Does this mean that the model will be known with certainty?
In general, there is no longer a generative model that applies to a large amount of data no more than the rules of choice of model by penalized likelihood that are the subject of so many publications.
It should be noted that the criteria of the type:
[2.1] and
[2.2] to choose between simple models where k is the number of parameters and L the likelihood, become ineffective when comparing predictive algorithms where neither the likelihood nor the number of parameters are known, as in decision trees and more complex methods discussed in the next chapter. Note that it is illogical, as is often seen, to use AIC and BIC simultaneously since they come from two incompatible theories: Kullback-Leibler information for the first and Bayesian choice of models a priori equiprobable for the second.
The large volume of data could be an argument in favor of the asymptotic properties of BIC, if it were calculable, since it has been shown that the probability of choosing the true model tends to 1 when the number of observations tends to infinity. The true model, however, must be part of the family studied, and it is especially necessary that this "true" model exists, which is fiction: a model (in the generative sense) is only a simplified representation of reality. Thirty years ago, well before we talked about big data, George Box declared "All models are wrong, some are useful."
The abuses of the so-called conventional statistics had been vigorously denounced by John Nelder [NEL 85], the co-inventor of generalized linear models, in this 1985 text discussing Chatfield's article: "Statistics is intimately connected with science and technology, and few mathematicians have experience or understand the methods of either. This I believe is what lies behind the grotesque emphasis on significance tests in statistics courses of all kinds; a mathematical apparatus has been erected with the notions of power, uniformly most powerful tests, uniformly most powerful unbiased tests, etc. etc. and this is taught to people, who, if they come away with no other notion, will remember that statistics is about significant differences [.]. The apparatus on which their statistics course has been constructed is often worse than irrelevant, it is misleading about what is important in examining data and making inferences."
2.3. Understanding or predicting?
The use of learning algorithms leads to methods known as "black boxes" that empirically show that it is not necessary to understand in order to predict. This fact, which is disturbing for scientists, is explicitly claimed by learning theorists, such as [VAP 06] who writes "Better models are sometimes obtained by deliberately avoiding to reproduce the true mechanisms."
[BRE 01] confirmed this in his famous article of Statistical Science entitled "Statistical Modeling: The Two Cultures": "Modern statistical thinking makes a clear distinction between the statistical model and the world. The actual mechanisms underlying the data are considered unknown. The statistical models do not need to reproduce these mechanisms to emulate the observable data." Breiman thus contrasted two modeling cultures in order to draw conclusions from data: one assumes that data is generated by a given stochastic model, and the other considers the generating mechanism as unknown and uses algorithms.
In the first case, attention is paid to fitting the model to the data (goodness of fit) and in the second, focus is on forecast accuracy.
[DON 15] recently took up this discussion by talking of generative modeling culture and predictive modeling culture. The distinction between models for understanding and models for predicting was also explicit in [SAP 08] and [SHM 10].
2.4. Validation of predictive models
The quality of a forecasting model cannot be judged solely by the fact that it appropriately fits to the data: it has to provide good forecasts in the future, what is called the capacity of generalization. Indeed, it is easy to see that the more complex a model, for example a higher degree polynomial, the better it will fit to the data until it passes through all points, but this apparent quality will degrade for new observations: this is the overfitting phenomenon.
Figure 2.1. From underfitting to overfitting
(source: available at http://datascience.stackexchange.com/questions/361/when-is-a-model-underfitted)
It is therefore appropriate to seek models that behave in a comparable way on available data (or learning data) and on future data. But this is not a sufficient criterion, since, for example, the constant model = c verifies this property! Forecasts must also be of good quality.
2.4.1. Elements of learning theory
The inequalities of the learning statistical theory make it possible to find bounds for the difference between learning error and generalization error (future data) according to the number of observations in learning and the complexity of the family...