Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
Over the past century, advancements in computer science have consistently resulted from extensive mathematical work. Even today, innovations in the digital domain continue to be grounded in a strong mathematical foundation. To succeed in this profession, both today's students and tomorrow's computer engineers need a solid mathematical background.
The goal of this book series is to offer a solid foundation of the knowledge essential to working in the digital sector. Across three volumes, it explores fundamental principles, digital information, data analysis, and optimization. Whether the reader is pursuing initial training or looking to deepen their expertise, the Mathematics for Digital Science series revisits familiar concepts, helping them refresh and expand their knowledge while also introducing equally essential, newer topics.
Gérard-Michel Cochard is Professor Emeritus at Université de Picardie Jules Verne, France, where he has held various senior positions. He has also served at the French Ministry of Education and the CNAM (Conservatoire National des Arts et Métiers). His research is conducted at the Eco-PRocédés, Optimisation et Aide à la Décision (EPROAD) laboratory, France.
Mhand Hifi is Professor of Computer Science at Université de Picardie Jules Verne, France, where he heads the EPROAD UR 4669 laboratory and manages the ROD team. As an expert in operations research and NP-hard problem-solving, he actively contributes to numerous international conferences and journals in the field.
This brief chapter serves as a reminder of the concepts presented in detail in Volume 1. It primarily provides an overview of basic statistical analysis tools, particularly linear regression and correlation for two-dimensional data.
References: [SAP 11].
Consider a population of n elements. Each element i is characterized by the value of a variable x = xi. The n values xi constitute a one-dimensional statistical series, whose characteristics are:
In this definition, it is assumed that all elements have the same statistical weight If the weights are not equal, the following expression is used:
where pi represents the statistical weight of individual i.
Huygens' theorem provides another method for calculating variance:
This relationship is often summarized as "the average of squares minus the square of the mean".
Consider the statistical series shown in Figure 1.1, which represents the number of rainy days over 10 consecutive years at a given location.
Figure 1.1. Statistical series
The average can be easily calculated by assigning equal statistical weight to each measurement. The average of the squares the variance v(x) = 1284 and the standard deviation s(x) = 35,83 are also determined.
Figure 1.2 shows the graphical representation of the statistical series in the form of a histogram. This histogram illustrates the distribution of data regarding the number of rainy days over the 10 years.
The average is a measure of the position of the statistical series along the number of days axis, while the standard deviation serves as a dispersion parameter, providing an indicator of the spread of the statistical series.
Figure 1.2. Graphical representation of the statistical series
Now, consider a two-dimensional statistical series, where each element is characterized by the values of two variables, x and y. For each variable, various statistical measures can be calculated, such as the average, variance and standard deviation.
To graphically represent this two-dimensional series, a two-dimensional Cartesian coordinate system is used. The x-axis represents the variable x. and the y-axis represents the variable y. Each element i of the series is represented as a point (xi,yi) in this coordinate system, where the coordinate xi corresponds to the value of the variable x, and the coordinate yi corresponds to the value of the variable y. Figure 1.3 shows examples of graphical representations of two two-dimensional series.
Figure 1.3. Example of two two-dimensional series
When observing a two-dimensional series and detecting a certain structure in the set of representative points, we may be inclined to model this structure using a curve. This involves finding a mathematical function that best describes the relationships between the variables x and y. In the examples shown in Figure 1.3, a straight line can be proposed for modeling the first example, and a parabola for the second example, as shown in Figure 1.4. These models are adjustments that simplify the representation of trends or relationships observed in the data.
Figure 1.4. Examples of adjustments
The linear adjustment is the simplest of all analytical adjustments. It involves obtaining the equation of the straight line that "best fit" the set of representative points of the series.
A classic method for obtaining the equation of the line in linear adjustment is the least squares method. This method involves minimizing the sum of the squares of the deviations between the observed values and the values predicted by the line. For the variables x and y, the respective means, denoted by and are calculated assuming equal statistical weight for each value of i:
Next, the deviations from these averages for each point in the series are calculated (convenient to work with "centered" coordinates):
It is easy to verify that:
The squares of these deviations are obtained by squaring these values:
The least squares method involves finding the coefficients a and b of the equation of the line y = ax + b. Alternatively, using the centered coordinates, the equation becomes Y´ = AX + B, where for each of the representative points, The relationship between (A, B) and (a,b) is:
The goal is to optimize the sum of squared deviations to a minimum. Mathematically, this involves minimizing the following objective function:
In other words, the aim is to minimize the following quantity:
The minimum of M corresponds to the cancellation of the first derivatives with respect to A and B, the only unknowns in M. Taking the partial derivatives:
which leads to:
These conditions lead to the following equations:
Let us consider the statistical series showing the number of rainy days (x) and umbrella sales in local currency (y) (see Figure 1.5).
Figure 1.5. Statistical series (x, y)
Figure 1.6. Detailed adjustment calculations
Figure 1.7. Adjustment line.
Figure 1.6 summarizes the calculations required to determine the best-fit adjustment line, with the values of a = 1311.53 and b = 8831.78. Figure 1.7 displays the best-fit adjustment line.
Figure 1.8. Different correlation situations
In the case of adjustment, the goal is to express y as a function of x. This choice is arbitrary, as x could be expressed as a function of y. In this case, two adjustment lines would be obtained, both intersecting at the point
By treating the variables x and y symmetrically, the concept of correlation between these variables can be introduced. Correlation measures the relationship between two variables and quantifies the possible influence of one on the other. Figure 1.8 presents various examples of scatter plots to illustrate different correlation situations.
In particular, in the case of linear correlation, it is interesting to note that when the two best-fit adjustment lines, y = f(x) and x = f´(y), coincide, this indicates maximum linear correlation between the variables x and y.
For the series in Example 1.2, the following two best-fit adjustment lines are obtained:
Figure 1.9 shows that the two straight lines are very close to each other, indicating a strong correlation between the variables.
Figure 1.9. Adjustment lines.
The two best-fit adjustment lines have direction coefficients a and a´. If the lines coincide, then equivalently, a × a´ = 1. Now,
The maximum correlation corresponds to the following equality (known as the Cauchy-Schwarz equality):
The analytical definition of the linear correlation is:
which is simply
Let us return to Example 1.3. The equations of the adjustment lines are:
The linear correlation coefficient is close to 1, i.e. r = 0.98 ~ 1. This indicates an almost maximal linear correlation between the variables x and y. In this case, a strong relationship exists between x and y.
The linear correlation coefficient r is often written in another form, using the standard deviations s(x) and s(y):
Furthermore, the covariance cov(x, y) is defined by:
It follows that:
In the case of linear fitting, the expression for M is:
The minimum is found by replacing A and B with the values obtained:
By definition, M and therefore Mmin are positive or zero quantities. This leads to the Cauchy-Schwarz inequality:
Figure 1.10. Variations in the linear correlation coefficient
This inequality implies that the linear correlation coefficient lies in the range -1 = r = 1. This means that the linear correlation coefficient can take values between -1 and 1, inclusive. Figure 1.10 shows such a correlation scale, where different ranges of r values are...
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.