Machine Learning

Name: Machine Learning | A Concise Introduction
Brand: Wiley
Price: 89.99 EUR
Availability: OnlineOnly

A Concise Introduction

Steven W. Knox(Author)

Wiley (Publisher)

2nd Edition

Published on 6. February 2026

775 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-394-32527-6 (ISBN)

€89.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.

Alles über E-Books, Kopierschutz & Dateiformate finden Sie in unserem Info- & Hilfebereich.

New edition of a PROSE award finalist title on core concepts for machine learning, updated with the latest developments in the field, now with Python and R source code side-by-side

Machine Learning is a comprehensive text on the core concepts, approaches, and applications of machine learning. It presents fundamental ideas, terminology, and techniques for solving applied problems in classification, regression, clustering, density estimation, and dimension reduction. New content for this edition includes chapter expansions which provide further computational and algorithmic insights to improve reader understanding. This edition also revises several chapters to account for developments since the prior edition.

In this book, the design principles behind the techniques are emphasized, including the bias-variance trade-off and its influence on the design of ensemble methods, enabling readers to solve applied problems more efficiently and effectively. This book also includes methods for optimization, risk estimation, model selection, and dealing with biased data samples and software limitations - essential elements of most applied projects.

Written by an expert in the field, this important resource:

Illustrates many classification methods with a single, running example, highlighting similarities and differences between methods
Presents side-by-side Python and R source code which shows how to apply and interpret many of the techniques covered
Includes many thoughtful exercises as an integral part of the text, with an appendix of selected solutions
Contains useful information for effectively communicating with clients on both technical and ethical topics
Details classification techniques including likelihood methods, prototype methods, neural networks, classification trees, and support vector machines

A volume in the popular Wiley Series in Probability and Statistics, Machine Learning offers the practical information needed for an understanding of the methods and application of machine learning for advanced undergraduate and beginner graduate students, data science and machine learning practitioners, and other technical professionals in adjacent fields.

More details

Other editions

Person

Content

Preface xi

Organization - How to Use This Book xii

Acknowledgments xiv

About the Companion Website xiv

1 Introduction - Examples from Real Life 1

2 The Problem of Learning 3

2.1 Domain 3

2.2 Range 4

2.3 Data 4

2.4 Loss 5

2.5 Risk 8

2.6 The Reality of the Unknown Function 12

2.7 Training and Selection of Models 12

2.8 Purposes of Learning 14

2.9 Notation 14

3 Regression 15

3.1 General Framework 16

3.2 Loss 17

3.3 Estimating the Model Parameters 17

3.4 Properties of Fitted Values 19

3.5 Estimating the Variance 22

3.6 A Normality Assumption 23

3.7 Computation 25

3.8 Categorical Features 26

3.9 Feature Expansions, Interactions, and Transformations 28

3.10 Penalized Regression: Model Transformation for Risk Reduction 31

3.11 Variations in Linear Regression 37

3.12 Nonlinear Regression 39

3.13 Nonparametric Regression 42

4 Classification 45

4.1 The Bayes Classifier 46

4.2 Introduction to Classifiers 47

4.3 Mitigating Biases in Software, Biases in Data, and Zero Probabilities 49

4.3.1 Mitigating Biases in Software by Adjusting Loss and Prior Probabilities 50

4.3.2 Mitigating Biases in Data by Adjusting Loss or Prior Probabilities 51

4.3.3 Mitigating Effects of Zero-One Probability Estimates 52

4.4 Class Boundaries 53

4.5 A Running Example 54

4.6 Likelihood Methods 55

4.6.1 Quadratic Discriminant Analysis 56

4.6.2 Linear Discriminant Analysis 58

4.6.3 Gaussian Mixture Models 60

4.6.4 Kernel Density Estimation 61

4.6.5 Histograms 65

4.6.6 The Naive Bayes Classifier 68

4.7 Prototype Methods 69

4.7.1 k-Nearest-Neighbor 69

4.7.2 Condensed k-Nearest-Neighbor 72

4.7.3 Nearest-Cluster 73

4.7.4 Learning Vector Quantization 73

4.8 Logistic Regression 76

4.8.1 The Logistic Regression Model 76

4.8.2 Adjusting the Marginal or Prior Distribution of Classes 79

4.8.3 Class Boundaries, Hyperplanes, and Geometry 81

4.9 Neural Networks 81

4.9.1 Activation Functions 81

4.9.2 Neurons 82

4.9.3 Single-Hidden-Layer Neural Networks 84

4.9.4 Multi-Hidden-Layer Neural Networks 90

4.9.5 Adjusting the Marginal or Prior Distribution of Classes 91

4.9.6 Logistic Regression and Zero-Hidden-Layer Neural Networks 91

4.10 Classification Trees 93

4.10.1 Classification of Data by Leaves (Terminal Nodes) 93

4.10.2 Impurity of Nodes and Trees 94

4.10.3 Growing Trees 95

4.10.4 Pruning Trees 98

4.10.5 Regression Trees 99

4.11 Support Vector Machines 100

4.11.1 A Geometric Definition of "Good" 100

4.11.2 Support Vector Machine Classifiers for Linearly Separable Data 101

4.11.3 The Central Role of Inner Products 103

4.11.4 Support Vector Machine Classifiers for Data Not Linearly Separable 104

4.11.5 Slack Variables as Hinge Loss 105

4.11.6 Multiple Classes, General Loss, and Non-uniform Class Prior 107

4.11.7 Approximation of the Bayes Classifier 109

4.11.8 Inner Products via Kernel Functions 110

4.12 Postscript: Example Problem Revisited 119

5 Bias-Variance Trade-Off 121

5.1 Squared-Error Loss 121

5.2 General Loss 125

6 Combining Classifiers 131

6.1 Ensembles 131

6.2 Ensemble Design 136

6.3 Bootstrap Aggregation (Bagging) 138

6.4 Random Forests 141

6.5 Boosting and Arcing 142

6.6 Classification by Regression Ensemble 147

6.7 Gradient Boosting 151

6.8 Stacking and Mixture of Experts 156

6.9 Postscript: Example Problem Revisited 160

7 Risk Estimation and Model Selection 163

7.1 Risk Estimation via Training Data 164

7.2 Risk Estimation via Validation or Test Data 164

7.2.1 Training, Validation, and Test Data Sets 164

7.2.2 Training, Validation, and Test Estimates of Risk 165

7.2.3 Application - Precision of Validation and Test Estimates of Risk 166

7.2.4 Application - Comparing a Model's Risk to a Target Value 166

7.2.5 Application - Comparing the Difference of Models' Risks to a Target Value 168

7.3 Cross-Validation 169

7.4 Improvements on Cross-Validation 171

7.5 Out-of-Bag Risk Estimation 172

7.6 Akaike's Information Criterion 173

7.7 Schwartz's Bayesian Information Criterion 174

7.8 Rissanen's Minimum Description Length Criterion 175

7.9 R 2 and Adjusted R 2 175

7.10 Stepwise Model Selection 177

7.11 Occam's Razor 177

7.12 Size of Validation and Test Data Sets 178

7.12.1 Measures of Performance for Hypothesis Tests About Risk 178

7.12.2 Size of Training, Validation, and Test Data Sets 178

7.12.3 Example Construction of Training, Validation, and Test Data Sets 180

7.12.4 Example Use of Training and Validation Data Sets 183

7.12.5 Example Use of Test Data Sets 186

8 Consistency 187

8.1 Convergence of Sequences of Random Variables 187

8.2 Consistency for Parameter Estimation 188

8.3 Consistency for Prediction 188

8.4 There Are Consistent and Universally Consistent Classifiers 189

8.5 Convergence to Asymptopia Is Not Uniform and May Be Slow 191

9 Clustering 193

9.1 Gaussian Mixture Models 194

9.2 k-Means 194

9.3 Clustering by Mode-Hunting in a Density Estimate 195

9.4 Using Classifiers to Cluster 196

9.5 Dissimilarity 196

9.6 k-Medoids 197

9.7 k-Modes and k-Prototypes 197

9.8 Agglomerative Hierarchical Clustering 198

9.9 Divisive Hierarchical Clustering 199

9.10 How Many Clusters Are There? Interpretation of Clustering 200

9.11 An Impossibility Theorem 201

10 Optimization 203

10.1 Quasi-Newton Methods 204

10.1.1 The Newton-Raphson Method for Finding Zeros 204

10.1.2 The Newton-Raphson Method for Optimization 205

10.1.3 Gradient Descent 205

10.1.4 The Broyden-Fletcher-Goldfarb-Shanno Algorithm 205

10.1.5 Modifications to Quasi-Newton Methods 206

10.2 The Nelder-Mead Algorithm 207

10.3 Simulated Annealing 207

10.4 Genetic Algorithms 209

10.5 Particle Swarm Optimization 210

10.6 General Remarks on Optimization 211

10.6.1 Imperfectly Known Objective Functions 211

10.6.2 Objective Functions That Are Sums 212

10.6.3 Optimization from Multiple Starting Points 212

10.7 Solving Least-Squares Problems via Quasi-Newton Methods 213

10.8 Gradient Computation for Neural Networks via Backpropagation 214

10.9 Handling Missing Data via the Expectation-Maximization Algorithm 219

10.9.1 The General Algorithm 219

10.9.2 EM Climbs the Marginal Likelihood of the Observations 220

10.9.3 Example - Fitting a Gaussian Mixture Model via EM 222

10.9.4 Example - The Expectation Step 223

10.9.5 Example - The Maximization Step 224

10.10 Fitting Support Vector Machines via Sequential Minimal Optimization 224

10.10.1 Primal and Dual Forms of the Linear SVM Optimization Problem 225

10.10.2 Slater's Condition and the Karush-Kuhn-Tucker Conditions 226

10.10.3 Generalization to Kernel Support Vector Machines 228

10.10.4 Computation of the Intercept 229

10.10.5 Solving the Dual Problem via Sequential Minimal Optimization 229

10.10.6 Step 1 - Choosing a Pair of Coordinates to Optimize 230

10.10.7 Step 2 - Constrained Optimization of a Pair of Coordinates 231

11 High-Dimensional Data 235

11.1 The Curse of Dimensionality 236

11.2 Two Running Examples 242

11.2.1 Example 1: Equilateral Simplex 242

11.2.2 Example 2: Text 242

11.3 Reducing Dimension While Preserving Information 243

11.3.1 The Geometry of Means and Covariances of Real Features 245

11.3.2 Principal Component Analysis 246

11.3.3 Working in "Dissimilarity Space" 248

11.3.4 Linear Multidimensional Scaling 248

11.3.5 The Singular Value Decomposition and Low-Rank Approximation 250

11.3.6 Stress-Minimizing Multidimensional Scaling 252

11.3.7 Projection Pursuit 252

11.3.8 Feature Selection 253

11.3.9 Clustering 254

11.3.10 Manifold Learning 254

11.3.11 Autoencoders 257

11.4 Model Regularization 261

11.4.1 Duality and the Geometry of Parameter Penalization 262

11.4.2 Parameter Penalization as Prior Information 263

12 Communication with Clients 267

12.1 Binary Classification and Hypothesis Testing 267

12.2 Terminology for Binary Decisions 269

12.3 Receiver Operating Characteristic (ROC) Curves 271

12.4 One-Dimensional Measures of Performance 273

12.5 Confusion Matrices 276

12.6 Pairwise Model Comparison 277

12.7 Multiple Testing 277

12.7.1 Control the Familywise Error 278

12.7.2 Control the False Discovery Rate 278

12.8 Expert Systems 279

12.9 Ethics in Machine Learning 280

12.9.1 Philosophical Foundations 280

12.9.2 Clear Goals 281

12.9.3 Good Practice 281

12.9.4 Machine Learning Might Not Be the Answer, and That's OK 282

12.9.5 Documentation 282

13 Current Challenges in Machine Learning 283

13.1 Streaming Data 283

13.2 Distributed Data 283

13.3 Semi-Supervised Learning 283

13.4 Active Learning 284

13.5 Feature Construction via Deep Neural Networks 284

13.6 Transfer Learning 284

13.7 Interpretability and Protection of Complex Models 285

14 R and Python Source Code 287

14.1 Author's Biases 288

14.2 Packages and Code 288

14.3 The Running Example (Section 4.5) 289

14.4 The Bayes Classifier (Section 4.1) 292

14.5 Quadratic Discriminant Analysis (Section 4.6.1) 294

14.6 Linear Discriminant Analysis (Section 4.6.2) 296

14.7 Gaussian Mixture Models (Section 4.6.3) 297

14.8 Kernel Density Estimation (Section 4.6.4) 300

14.9 Histograms (Section 4.6.5) 304

14.10 The Naive Bayes Classifier (Section 4.6.6) 309

14.11 k-Nearest-Neighbor (Section 4.7.1) 312

14.12 Learning Vector Quantization (Section 4.7.4) 314

14.13 Logistic Regression (Section 4.8) 317

14.14 Neural Networks (Section 4.9) 319

14.15 Classification Trees (Section 4.10) 324

14.16 Support Vector Machines (Section 4.11) 332

14.17 Bootstrap Aggregation (Bagging) (Section 6.3) 341

14.18 Random Forests (Section 6.4) 343

14.19 Boosting by Reweighting (Section 6.5) 345

14.20 Boosting by Sampling (Arcing) (Section 6.5) 346

14.21 Gradient Boosted Trees (Section 6.7) 347

Appendix-A: List of Symbols 351

Appendix-B: The Condition Number of a Matrix with Respect to a Norm 353

Appendix-C: Converting Between Normal Parameters and Level-Curve Ellipsoids 357

Appendix-D: The Geometry of Linear Functions and Linear Classifiers 359

Appendix-E: Training Data and Fitted Parameters 367

Appendix-F: Solutions to Selected Exercises 371

Bibliography 399

Index 413

Chapter 2
The Problem of Learning

Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.

- John Tukey, The Future of Data Analysis, 1962

This book treats The Problem of Learning, which can be stated generally and succinctly as follows.

The Problem of Learning

There are a known set and an unknown function on . Given data, construct a good approximation of . This is called learning .

The problem of learning has been studied in many guises and in different fields, such as statistics, computer science, mathematics, and the natural and social sciences. An advantage of this situation is that the set of methods now available for addressing the problem has benefited from a tremendous diversity of perspective and knowledge. A disadvantage is a certain lack of coherence in the language used to describe the problem: there are, perhaps, more names for things than there are things which need names, and this can make the study of the problem of learning appear more complicated and confusing than it really is.

This chapter introduces the main ideas which occur in almost any applied problem in learning, and introduces language commonly used to describe these ideas. The ideas and language introduced here will be used in every other chapter of this book.

2.1 Domain

The set is called feature space, and an element is called a feature vector (or an input). The coordinates of are called features. Individual features may take values in a continuum, a discrete ordered set, or a discrete unordered set.

In Problem 1 ("Shuttle"), is an interval of possible air temperatures: (-60, 140) Fahrenheit, say, or Kelvin. In Problem 4 ("Postal Code"), is the set of all 8 × 8 matrices with entries in {0, 1, ., 255}. In Problem 2 ("Ballot"), is for the given number of features , each feature being a count of people or votes, though in practice might well be taken to be the -dimensional real numbers, .

2.2 Range

The range is usually either a finite, unordered set, in which case learning is called classification,1 or it is a continuum, in which case learning is called regression. If the range is a finite, ordered set, learning is called ordinal regression.2 An element is called a class in classification and a response in regression.

In Problem 1 ("Shuttle"), , the set of probabilities that an O-ring is damaged.3 In Problem 4 ("Postal Code"), = {"0", "1", "2", "3", "4", "5", "6", "7", "8", "9"}, the quotes indicating that these are unordered labels rather than numbers. In Problem 2 ("Ballot"), is the set of nonnegative integers less than or equal to the number of registered voters in Palm Beach County, but in practice would usually be taken to be , the set of real numbers.

2.3 Data

In principle, data are random draws from a probability distribution on . Depending on the problem at hand, the data which are observed may either consist of domain-range pairs or just domain values : learning is called supervised in the former case and unsupervised in the latter.

In supervised learning, the data are

where each is drawn from a joint probability distribution on . Such data are called marked data.4 It is sometimes useful to consider the data as produced by a two-step process, in one of two ways: by drawing from marginal distribution on and then drawing a corresponding feature vector from conditional distribution on ; or by drawing feature vector from marginal distribution on and then drawing a corresponding from conditional distribution on . These two points of view correspond to the two factorizations5

Both are useful in classification. The latter is more useful in regression, where the function to be learned is often, though not necessarily, the expected value of the conditional distribution of , that is,

In unsupervised learning, the data are

Such data are called unmarked data. The range is either assumed to be finite and unordered, in which case unsupervised learning is called clustering, or it is and the function to be learned is the mass or density function of the marginal distribution of the features,

in which case unsupervised learning is called density estimation. In clustering problems, it may be useful to think of as being drawn originally as , but somehow were unobserved or lost: that is, it may be useful to think of as latent variables. In such cases, there may be reason to believe that discrete labels really did exist at some point while in other cases the discrete labels are potentially useful products of the machine learning practitioner's imagination.

Problem 1 ("Shuttle") is supervised learning, since the data are of the form

Problem 6 ("Vaults") is unsupervised learning, and is unknown (though archaeologists might have other evidence outside the data - such as artifacts recovered from the vaults or related sites - which indicates that, for example, ).

Figure 2.1 summarizes some of the terminology introduced so far and illustrates some connections between the four learning problems mentioned. Sometimes both marked and unmarked data are available. This situation is called semi-supervised learning.6

Figure 2.1 Four categories of learning problems, in bold type, split according to whether the range of is finite (and unordered) or continuous, and whether or not elements of the range are observed directly. Some solutions to these problems are devised by transforming one problem into another: examples of this are shown in red and described in Chapters 3, 4, 6, and 9.

2.4 Loss

What does "a good approximation" mean? This is specified by a loss function

where for each , is the loss, or penalty, incurred by approximating with . Common examples are:

name typical application formula squared-error loss regression absolute-error loss regression zero-one loss classification

In classification, an arbitrary loss function can be specified by the cells of a loss matrix, usually with all-zero diagonal elements and positive off-diagonal elements (so correct classification incurs no loss and misclassification incurs positive loss).7 Zero-one loss is represented by the matrix with 0's on the diagonal and 1's everywhere off the diagonal.

Some techniques for solving a classification problem focus on converting the problem into a regression problem, solving the regression problem to learn a continuous function

and then extracting predicted class labels from the continuous function as

Such approaches follow the top-most red arrow in Figure 2.1. When there are classes, sometimes the class labels are represented as -1 and +1, and the predicted class at is

Loss functions used to learn the regression function in such situations include

name typical application formula hinge loss binary classification exponential loss binary classification

Sometimes a regression approach is used to estimate the conditional probability distribution on class labels given the feature vector,

resulting in

A loss function used to learn the regression function in such situations is:

name typical application formula cross-entropy loss classification

These loss functions are sometimes implicit in specific methods: we shall see that cross-entropy loss arises in logistic regression and neural networks in Sections 4.8 and 4.9; hinge loss arises in support vector machines in Section 4.11; and exponential loss arises in ensembles of classifiers produced by AdaBoost and similar algorithms in Sections 6.5 and 6.6.

The choice of loss function is subjective and problem-dependent.8 An asymmetric loss function is sometimes appropriate: for example, in Problem 3 ("Heart"), declaring a healthy person to be sick may be viewed as a much less costly error than declaring a sick person to be healthy.9 That said, loss functions are often chosen for convenience and computational tractability.10 Exercises in Sections 3.6 and 4.8, respectively, derive squared-error loss and cross-entropy loss from the (subjective) principle that the best model in a class of models is one which assigns maximal likelihood to the observed data.

Exercise 2.1 (Cross-entropy loss and Kullback-Leibler information):

The Kullback-Leibler...

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Machine Learning

Description

More details

Other editions

Additional editions

Person

Content

Chapter 2
The Problem of Learning

The Problem of Learning

2.1 Domain

2.2 Range

2.3 Data

2.4 Loss

Exercise 2.1 (Cross-entropy loss and Kullback-Leibler information):

System requirements

Schweitzer Fachinformationen

Machine Learning

Description

More details

Other editions

Additional editions

Person

Content

Chapter 2 The Problem of Learning

The Problem of Learning

2.1 Domain

2.2 Range

2.3 Data

2.4 Loss

Exercise 2.1 (Cross-entropy loss and Kullback-Leibler information):

System requirements

Chapter 2
The Problem of Learning