Chemometrics

Name: Chemometrics | Data Driven Extraction for Science
Brand: Wiley
Price: 91.99 EUR
Availability: OnlineOnly

Data Driven Extraction for Science

Richard G. Brereton(Autor*in)

Wiley (Verlag)

2. Auflage

Erschienen am 13. März 2018

464 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-118-90468-8 (ISBN)

91,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

A new, full-color, completely updated edition of the key practical guide to chemometrics

This new edition of this practical guide on chemometrics, emphasizes the principles and applications behind the main ideas in the field using numerical and graphical examples, which can then be applied to a wide variety of problems in chemistry, biology, chemical engineering, and allied disciplines. Presented in full color, it features expansion of the principal component analysis, classification, multivariate evolutionary signal and statistical distributions sections, and new case studies in metabolomics, as well as extensive updates throughout. Aimed at the large number of users of chemometrics, it includes extensive worked problems and chapters explaining how to analyze datasets, in addition to updated descriptions of how to apply Excel and Matlab for chemometrics.

Chemometrics: Data Driven Extraction for Science, Second Edition offers chapters covering: experimental design, signal processing, pattern recognition, calibration, and evolutionary data. The pattern recognition chapter from the first edition is divided into two separate ones: Principal Component Analysis/Cluster Analysis, and Classification. It also includes new descriptions of Alternating Least Squares (ALS) and Iterative Target Transformation Factor Analysis (ITTFA). Updated descriptions of wavelets and Bayesian methods are included.

* Includes updated chapters of the classic chemometric methods (e.g. experimental design, signal processing, etc.)

* Introduces metabolomics-type examples alongside those from analytical chemistry

* Features problems at the end of each chapter to illustrate the broad applicability of the methods in different fields

* Supplemented with data sets and solutions to the problems on a dedicated website, www.booksupport.wiley.com

Chemometrics: Data Driven Extraction for Science, Second Edition is recommended for post-graduate students of chemometrics as well as applied scientists (e.g. chemists, biochemists, engineers, statisticians) working in all areas of data analysis.

Weitere Details

Weitere Ausgaben

Person

Inhalt

Preface to Second Edition xi

Preface to First Edition xiii

Acknowledgements xv

About the Companion Website xvii

1 Introduction 1

1.1 Historical Parentage 1

1.1.1 Applied Statistics 1

1.1.2 Statistics in Analytical and Physical Chemistry 2

1.1.3 Scientific Computing 3

1.2 Developments since the 1970s 3

1.3 Software and Calculations 4

1.4 Further Reading 6

1.4.1 General 6

1.4.2 Specific Areas 7

References 8

2 Experimental Design 11

2.1 Introduction 11

2.2 Basic Principles 14

2.2.1 Degrees of Freedom 14

2.2.2 Analysis of Variance 17

2.2.3 Design Matrices and Modelling 23

2.2.4 Assessment of Significance 29

2.2.5 Leverage and Confidence in Models 38

2.3 Factorial Designs 43

2.3.1 Full Factorial Designs 44

2.3.2 Fractional Factorial Designs 49

2.3.3 Plackett-Burman and Taguchi Designs 55

2.3.4 Partial Factorials at Several Levels: Calibration Designs 57

2.4 Central Composite or Response Surface Designs 62

2.4.1 Setting up the Design 62

2.4.2 Degrees of Freedom 65

2.4.3 Axial Points 66

2.4.4 Modelling 67

2.4.5 Statistical Factors 69

2.5 Mixture Designs 70

2.5.1 Mixture Space 70

2.5.2 Simplex Centroid 71

2.5.3 Simplex Lattice 74

2.5.4 Constraints 76

2.5.5 Process Variables 81

2.6 Simplex Optimisation 82

2.6.1 Fixed Sized Simplex 82

2.6.2 Elaborations 84

2.6.3 Modified Simplex 84

2.6.4 Limitations 86

Problems 86

3 Signal Processing 101

3.1 Introduction 101

3.1.1 Environmental and Geological Processes 101

3.1.2 Industrial Process Control 101

3.1.3 Chromatograms and Spectra 102

3.1.4 Fourier Transforms 102

3.1.5 Advanced Methods 102

3.2 Basics 103

3.2.1 Peak shapes 103

3.2.2 Digitisation 107

3.2.3 Noise 109

3.2.4 Cyclicity 112

3.3 Linear Filters 112

3.3.1 Smoothing Functions 112

3.3.2 Derivatives 116

3.3.3 Convolution 118

3.4 Correlograms and Time Series Analysis 122

3.4.1 Auto-correlograms 122

3.4.2 Cross-correlograms 124

3.4.3 Multivariate Correlograms 127

3.5 Fourier Transform Techniques 128

3.5.1 Fourier Transforms 128

3.5.2 Fourier Filters 135

3.5.3 Convolution Theorem 140

3.6 Additional Methods 142

3.6.1 Kalman Filters 142

3.6.2 Wavelet Transforms 145

3.6.3 Bayes' Theorem 148

3.6.4 Maximum Entropy 150

Problems 153

4 Principal Component Analysis and Unsupervised Pattern Recognition 163

4.1 Introduction 163

4.1.1 Exploratory Data Analysis 163

4.1.2 Cluster Analysis 164

4.2 The Concept and Need for Principal Components Analysis 164

4.2.1 History 164

4.2.2 Multivariate Data Matrices 165

4.2.3 Case Studies 166

4.2.4 Aims of PCA 171

4.3 Principal Components Analysis: The Method 171

4.3.1 Scores and Loadings 171

4.3.2 Rank and Eigenvalues 175

4.4 Factor Analysis 183

4.5 Graphical Representation of Scores and Loadings 184

4.5.1 Scores Plots 185

4.5.2 Loadings Plots 188

4.6 Pre-processing 191

4.6.1 Transforming Individual Elements of a Matrix 191

4.6.2 Row Scaling 193

4.6.3 Mean Centring 194

4.6.4 Standardisation 197

4.6.5 Further Methods 199

4.7 Comparing Multivariate Patterns 199

4.7.1 Biplots 200

4.7.2 Procrustes Analysis 201

4.8 Unsupervised Pattern Recognition: Cluster Analysis 201

4.8.1 Similarity 202

4.8.2 Linkage 204

4.8.3 Next Steps 206

4.8.4 Dendrograms 206

4.9 Multi-way Pattern Recognition 207

4.9.1 Tucker3 Models 207

4.9.2 Parallel Factor Analysis (PARAFAC) 208

4.9.3 Unfolding 209

Problems 210

5 Classification and Supervised Pattern Recognition 215

5.1 Introduction 215

5.1.1 Background 215

5.1.2 Case Study 216

5.2 Two-Class Classifiers 216

5.2.1 Distance-Based Methods 217

5.2.2 Partial Least-Squares Discriminant Analysis 224

5.2.3 K Nearest Neighbours 226

5.3 One-Class Classifiers 229

5.3.1 Quadratic Discriminant Analysis 229

5.3.2 Disjoint PCA and SIMCA 232

5.4 Multi-Class Classifiers 236

5.5 Optimisation and Validation 237

5.5.1 Validation 238

5.5.2 Optimisation 245

5.6 Significant Variables 246

5.6.1 Partial Least-Squares Discriminant Loadings and Weights 248

5.6.2 Univariate Statistical Indicators 250

5.6.3 Variable Selection for SIMCA 251

Problems 252

6 Calibration 265

6.1 Introduction 265

6.1.1 History, Usage and Terminology 265

6.1.2 Case Study 267

6.2 Univariate Calibration 267

6.2.1 Classical Calibration 269

6.2.2 Inverse Calibration 272

6.2.3 Intercept and Centring 274

6.3 Multiple Linear Regression 276

6.3.1 Multi-detector Advantage 276

6.3.2 Multi-wavelength Equations 277

6.3.3 Multivariate Approaches 280

6.4 Principal Components Regression 284

6.4.1 Regression 284

6.4.2 Quality of Prediction 287

6.5 Partial Least Squares Regression 289

6.5.1 PLS1 289

6.5.2 PLS2 294

6.5.3 Multi-way PLS 297

6.6 Model Validation and Optimisation 302

6.6.1 Auto-prediction 302

6.6.2 Cross-validation 303

6.6.3 Independent Test Sets 305

Problems 309

7 Evolutionary Multivariate Signals 323

7.1 Introduction 323

7.2 Exploratory Data Analysis and Pre-processing 325

7.2.1 Baseline Correction 325

7.2.2 Principal Component-Based Plots 325

7.2.3 Scaling the Data after PCA 329

7.2.4 Scaling the Data before PCA 332

7.2.5 Variable Selection 339

7.3 Determining Composition 341

7.3.1 Composition 341

7.3.2 Univariate Methods 342

7.3.3 Correlation- and Similarity-Based Methods 345

7.3.4 Eigenvalue-Based Methods 348

7.3.5 Derivatives 352

7.4 Resolution 355

7.4.1 Selectivity for All Components 356

7.4.2 Partial Selectivity 360

7.4.3 Incorporating Constraints: ITTFA, ALS and MCR 362

Problems 365

A Appendix 375

A.1 Vectors and Matrices 375

A.1.1 Notation and Definitions 375

A.1.2 Matrix and Vector Operations 375

A.2 Algorithms 377

A.2.1 Principal Components Analysis 377

A.2.2 PLS1 378

A.2.3 PLS2 379

A.2.4 Tri-Linear PLS1 380

A.3 Basic Statistical Concepts 381

A.3.1 Descriptive Statistics 381

A.3.2 Normal Distribution 383

A.3.3 ¿²-Distribution 383

A.3.4 t-Distribution 386

A.3.5 F-Distribution 386

A.4 Excel for Chemometrics 390

A.4.1 Names and Addresses 390

A.4.2 Equations and Functions 394

A.4.3 Add-Ins 398

A.4.4 Charts 398

A.4.5 Downloadable Macros 400

A.5 Matlab for Chemometrics 408

A.5.1 Getting Started 408

A.5.2 File Types 409

A.5.3 Matrices 411

A.5.4 Importing and Exporting Data 416

A.5.5 Introduction to Programming and Structure 417

A.5.6 Graphics 418

Answers to the Multiple Choice Questions 429

Index 433

Chapter 1
Introduction

1.1 Historical Parentage

There are many opinions about the origin of chemometrics. Until quite recently, the birth of chemometrics was considered to have happened in the 1970s. Its name first appeared in 1972 in an article by Svante Wold [1]: in fact, the topic of this article was not one that we would recognise as being core to chemometrics, being relevant to neither multivariate analysis nor experimental design. For over a decade, the word chemometrics was considered to be of very low profile, and it developed a recognisable presence only in the 1980s, as described below.

However, if an explorer describes a new species in a forest, the species was there long before the explorer. Thus, the naming of the discipline just recognises that it had reached some level of visibility and maturity. As people re-evaluate the origins of chemometrics, the birth can be traced many years back.

Chemometrics burst into the world due to three fundamental factors, applied statistics (multivariate and experimental design), statistics in analytical and physical chemistry, and scientific computing.

1.1.1 Applied Statistics

The ideas of multivariate statistics have been around a long time. R.A. Fisher and colleagues working in Rothamsted, UK, formalised many of our modern ideas while applying primarily to agriculture. In the UK, before the First World War, many of the upper classes owned extensive land and relied on their income from tenant farmers and agricultural labourers. After the First World War, the cost of labour became higher, with many moving to the cities, and there was stronger competition of food from global imports. This meant that historic agricultural practices were seen to be inefficient and it was hard for landowners (or companies that took over large estates) to be economic and competitive, hence a huge emphasis on agricultural research, including statistics to improve these. R.A. Fisher and co-workers published some of the first major books and papers that we would regard as defining modern statistical thinking [2, 3], introducing ideas ranging from the null hypothesis to discriminant analysis to ANOVA. Some of the work of Fisher followed from the pioneering work of Karl Pearson in the University College London who had founded the world's first statistics department previously and had first formulated ideas such as p values and correlation coefficients.

During the 1920s and 1930s, a number of important pioneers of multivariate statistics published their work, many strongly influenced or having worked with Fisher, including Harold Hotelling, credited by many as defining principal components analysis (PCA) [4], although Pearson had independently described this method some 30 years ago, but under a different guise. As so often ideas are reported several times over in science, it is the person that names it and popularises it that often gets the credit: in the early twentieth century, libraries were often localised and there were very few international journals (Hotelling working mainly in the US) and certainly no internet; therefore, parallel work was often reported.

The principles of statistical experimental design were also formulated at around this period. There had been early reports on what we regard as modern approaches to formal designs before that, for example James Lind's work on scurvy in the eighteenth century and Charles Pierce's discussion on randomised trials in the nineteenth century, but Fisher's classic work of the 1930s put all the concepts together in a rigorous statistical format [5].

Much non-Bayesian, applied statistical thinking has been based on principles established in the 1920s and 1930s, for nearly a century. Early applications include agriculture, psychology, finance and genetics. After the Second World War, the chemical industry took an interest. In the 1920s, an important need was to improve agricultural practice, but by the 1950s, a major need was to improve processes in manufacturing, especially chemical engineering; hence, many more statisticians were employed within the industry. O.L. Davies edited an important book on experimental design with contributions from colleagues in ICI [6]. Foremost was G.E.P. Box, son-in-law of Fisher, whose book with colleagues is one of the most important post-war classics in experimental design and multi-linear regression [7].

These statistical building blocks were already mature by the time people started calling themselves chemometricians and have changed only a little during the intervening period.

1.1.2 Statistics in Analytical and Physical Chemistry

Statistical methods, for example, to estimate accuracy and precision of measurements or to determine a best-fit linear relationship between two variables, have been available to analytical and physical chemists for over a century. Almost every general analytical textbook includes chapters on univariate statistics and has done for decades. Although theoretically we could view this as applied statistics, on the whole, the people who advanced statistics in analytical chemistry did not class themselves as applied statisticians and specialist terminology has developed over time.

Most quantitative analytical and physical chemistry until the 1970s was viewed as a univariate field; that is, only one independent variable was measured in an experiment. Usually, all other external factors were kept constant. This approach worked well in mechanics or fundamental physics, the so-called 'One Factor at a Time' (OFAT) approach. Hence, statistical methods were primarily used for univariate analysis of data. By the late 1940s, some analytical chemists were aware of ANOVA, F-tests and linear regression [8], although the term chemometrics had not been invented, but multivariate data came along much later.

There would have been very limited cross-fertilisation between applied statisticians, working in mathematics departments, and analytical chemists in chemistry departments, during these early days. Different departments often had different buildings, different libraries and different textbooks. A chemist, however numerate, would feel a stranger walking into a maths building and would probably cocoon him or herself in their own library. There was no such thing as the Internet or Web or Knowledge or electronic journals. Maths journals published papers for mathematicians and vice versa for chemistry journals. Although in areas such as agriculture and psychology there was a tradition of consulting statisticians, chemists were numerate and tended to talk to each other - an experimental chemist wanting to fit a straight line would talk to a physical chemist in the tea room if need be. Hence, ideas did not travel in academia. Industry was somewhat more pragmatic, but even there, the main statistical innovations were in chemical engineering and process chemistry and often classed as industrial chemistry. The top Universities often did not teach or research industrial chemistry, although they did teach Newtonian physics and relativity. In fact, the treatment of variables and errors by physicists trying, for example, to measure gravitational effects or the distance of a star is quite different to multivariate statistics: the former try to design experiments so that only one factor is studied and to make sure any errors are minimised and from one source, whereas a multivariate statistician might accept and expect data to be multifactorial.

Hence, statistics in analytical chemistry diverged from applied statistics for many decades. Caulcutt and Body's book first published in 1983 contains nothing on multivariate statistics [9] and in Miller and Miller's book of 1993 just one out of six main chapters is devoted to experimental design, optimisation and pattern recognition (including PCA) [10].

Even now, there are numerous useful books aimed at analytical and physical chemists that omit multivariate statistics. An elaborate vocabulary has developed for the needs of analytical chemists, with specialist concepts that are rarely encountered in other areas. Some analytical chemists in the 1960s to 1980s were aware that multivariate approaches existed and did venture into chemometrics, but good multivariate data were limited. Most are aware of ANOVA and experimental design. However, statistics for analytical chemistry tends to lead a separate existence from chemometrics, although multivariate methods derived from chemometrics do have a small foothold within most graduate-level courses and books in general analytical chemistry, and certainly quantitative analytical (and physical) chemistry was an important building block for modern chemometrics.

Over the last two decades, however, applications of chemometrics have moved far beyond traditional quantitative analytical chemistry, for example, into the areas of metabolomics, environment, cultural heritage or food, where the outcome is not necessarily to measure accurately the concentration of an analyte or how many compounds are in the spectra of a series of mixtures. This means that the aim of some chemometric analysis has changed. We often do not always have, for example, well-established reference samples and, in many cases, we cannot judge a method by how efficiently it predicts properties of these reference samples. We may not know whether the spectra of some extracts of urine samples can contain enough information to tell whether our donors are diseased or not: it may depend on how the disease has progressed, how good the diagnosis is, what the genetics of the donor and so on. Hence, we may never have a model that perfectly distinguishes two groups of samples. In classical physical or...

Systemvoraussetzungen

Als PDF speichern Als Link merken

Chemometrics

Beschreibung

Weitere Details

Weitere Ausgaben

Person

Inhalt

Chapter 1 Introduction

1.1 Historical Parentage

1.1.1 Applied Statistics

1.1.2 Statistics in Analytical and Physical Chemistry

Systemvoraussetzungen

Chapter 1
Introduction