Machine Learning

Name: Machine Learning | a Concise Introduction
Brand: Wiley
Price: 92.99 EUR
Availability: OnlineOnly

a Concise Introduction

Steven W. Knox(Author)

Wiley (Publisher)

1st Edition

Published on 15. March 2018

352 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-119-43898-4 (ISBN)

€92.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.

Alles über E-Books, Kopierschutz & Dateiformate finden Sie in unserem Info- & Hilfebereich.

AN INTRODUCTION TO MACHINE LEARNING THAT INCLUDES THE FUNDAMENTAL TECHNIQUES, METHODS, AND APPLICATIONS

PROSE Award Finalist 2019
Association of American Publishers Award for Professional and Scholarly Excellence

Machine Learning: a Concise Introduction offers a comprehensive introduction to the core concepts, approaches, and applications of machine learning. The author--an expert in the field--presents fundamental ideas, terminology, and techniques for solving applied problems in classification, regression, clustering, density estimation, and dimension reduction. The design principles behind the techniques are emphasized, including the bias-variance trade-off and its influence on the design of ensemble methods. Understanding these principles leads to more flexible and successful applications. Machine Learning: a Concise Introduction also includes methods for optimization, risk estimation, and model selection-- essential elements of most applied projects. This important resource:

* Illustrates many classification methods with a single, running example, highlighting similarities and differences between methods

* Presents R source code which shows how to apply and interpret many of the techniques covered

* Includes many thoughtful exercises as an integral part of the text, with an appendix of selected solutions

* Contains useful information for effectively communicating with clients

A volume in the popular Wiley Series in Probability and Statistics, Machine Learning: a Concise Introduction offers the practical information needed for an understanding of the methods and application of machine learning.

STEVEN W. KNOX holds a Ph.D. in Mathematics from the University of Illinois and an M.S. in Statistics from Carnegie Mellon University. He has over twenty years' experience in using Machine Learning, Statistics, and Mathematics to solve real-world problems. He currently serves as Technical Director of Mathematics Research and Senior Advocate for Data Science at the National Security Agency.

More details

Other editions

Person

Content

Preface xi

Organization-How to Use This Book xiii

Acknowledgments xvii

About the Companion Website xix

1 Introduction-Examples from Real Life 1

2 The Problem of Learning 3

2.1 Domain 4

2.2 Range 4

2.3 Data 4

2.4 Loss 6

2.5 Risk 8

2.6 The Reality of the Unknown Function 12

2.7 Training and Selection of Models, and Purposes of Learning 12

2.8 Notation 13

3 Regression 15

3.1 General Framework 16

3.2 Loss 17

3.3 Estimating the Model Parameters 17

3.4 Properties of Fitted Values 19

3.5 Estimating the Variance 22

3.6 A Normality Assumption 23

3.7 Computation 24

3.8 Categorical Features 25

3.9 Feature Transformations, Expansions, and Interactions 27

3.10 Variations in Linear Regression 28

3.11 Nonparametric Regression 32

4 Survey of Classification Techniques 33

4.1 The Bayes Classifier 34

4.2 Introduction to Classifiers 37

4.3 A Running Example 38

4.4 Likelihood Methods 40

4.4.1 Quadratic Discriminant Analysis 41

4.4.2 Linear Discriminant Analysis 43

4.4.3 Gaussian Mixture Models 45

4.4.4 Kernel Density Estimation 47

4.4.5 Histograms 51

4.4.6 The Naive Bayes Classifier 54

4.5 Prototype Methods 54

4.5.1 k-Nearest-Neighbor 55

4.5.2 Condensed k-Nearest-Neighbor 56

4.5.3 Nearest-Cluster 56

4.5.4 Learning Vector Quantization 58

4.6 Logistic Regression 59

4.7 Neural Networks 62

4.7.1 Activation Functions 62

4.7.2 Neurons 64

4.7.3 Neural Networks 65

4.7.4 Logistic Regression and Neural Networks 73

4.8 Classification Trees 74

4.8.1 Classification of Data by Leaves (Terminal Nodes) 74

4.8.2 Impurity of Nodes and Trees 75

4.8.3 Growing Trees 76

4.8.4 Pruning Trees 79

4.8.5 Regression Trees 81

4.9 Support Vector Machines 81

4.9.1 Support Vector Machine Classifiers 81

4.9.2 Kernelization 88

4.9.3 Proximal Support Vector Machine Classifiers 92

4.10 Postscript: Example Problem Revisited 93

5 Bias-Variance Trade-off 97

5.1 Squared-Error Loss 98

5.2 Arbitrary Loss 101

6 Combining Classifiers 107

6.1 Ensembles 107

6.2 Ensemble Design 110

6.3 Bootstrap Aggregation (Bagging) 112

6.4 Bumping 115

6.5 Random Forests 116

6.6 Boosting 118

6.7 Arcing 121

6.8 Stacking and Mixture of Experts 121

7 Risk Estimation and Model Selection 127

7.1 Risk Estimation via Training Data 128

7.2 Risk Estimation via Validation or Test Data 128

7.2.1 Training, Validation, and Test Data 128

7.2.2 Risk Estimation 129

7.2.3 Size of Training, Validation, and Test Sets 130

7.2.4 Testing Hypotheses About Risk 131

7.2.5 Example of Use of Training, Validation, and Test Sets 132

7.3 Cross-Validation 133

7.4 Improvements on Cross-Validation 135

7.5 Out-of-Bag Risk Estimation 137

7.6 Akaike's Information Criterion 138

7.7 Schwartz's Bayesian Information Criterion 138

7.8 Rissanen's Minimum Description Length Criterion 139

7.9 R2 and Adjusted R2 140

7.10 Stepwise Model Selection 141

7.11 Occam's Razor 142

8 Consistency 143

8.1 Convergence of Sequences of Random Variables 144

8.2 Consistency for Parameter Estimation 144

8.3 Consistency for Prediction 145

8.4 There Are Consistent and Universally Consistent Classifiers 145

8.5 Convergence to Asymptopia Is Not Uniform and May Be Slow 147

9 Clustering 149

9.1 Gaussian Mixture Models 150

9.2 k-Means 150

9.3 Clustering by Mode-Hunting in a Density Estimate 151

9.4 Using Classifiers to Cluster 152

9.5 Dissimilarity 153

9.6 k-Medoids 153

9.7 Agglomerative Hierarchical Clustering 154

9.8 Divisive Hierarchical Clustering 155

9.9 How Many Clusters Are There? Interpretation of Clustering 155

9.10 An Impossibility Theorem 157

10 Optimization 159

10.1 Quasi-Newton Methods 160

10.1.1 Newton's Method for Finding Zeros 160

10.1.2 Newton's Method for Optimization 161

10.1.3 Gradient Descent 161

10.1.4 The BFGS Algorithm 162

10.1.5 Modifications to Quasi-Newton Methods 162

10.1.6 Gradients for Logistic Regression and Neural Networks 163

10.2 The Nelder-Mead Algorithm 166

10.3 Simulated Annealing 168

10.4 Genetic Algorithms 168

10.5 Particle Swarm Optimization 169

10.6 General Remarks on Optimization 170

10.6.1 Imperfectly Known Objective Functions 170

10.6.2 Objective Functions Which Are Sums 171

10.6.3 Optimization from Multiple Starting Points 172

10.7 The Expectation-Maximization Algorithm 173

10.7.1 The General Algorithm 173

10.7.2 EM Climbs the Marginal Likelihood of the Observations 173

10.7.3 Example-Fitting a Gaussian Mixture Model Via EM 176

10.7.4 Example-The Expectation Step 177

10.7.5 Example-The Maximization Step 178

11 High-Dimensional Data 179

11.1 The Curse of Dimensionality 180

11.2 Two Running Examples 187

11.2.1 Example 1: Equilateral Simplex 187

11.2.2 Example 2: Text 187

11.3 Reducing Dimension While Preserving Information 190

11.3.1 The Geometry of Means and Covariances of Real Features 190

11.3.2 Principal Component Analysis 192

11.3.3 Working in "Dissimilarity Space" 193

11.3.4 Linear Multidimensional Scaling 195

11.3.5 The Singular Value Decomposition and Low-Rank Approximation 197

11.3.6 Stress-Minimizing Multidimensional Scaling 199

11.3.7 Projection Pursuit 199

11.3.8 Feature Selection 201

11.3.9 Clustering 202

11.3.10 Manifold Learning 202

11.3.11 Autoencoders 205

11.4 Model Regularization 209

11.4.1 Duality and the Geometry of Parameter Penalization 212

11.4.2 Parameter Penalization as Prior Information 213

12 Communication with Clients 217

12.1 Binary Classification and Hypothesis Testing 218

12.2 Terminology for Binary Decisions 219

12.3 ROC Curves 219

12.4 One-Dimensional Measures of Performance 224

12.5 Confusion Matrices 225

12.6 Multiple Testing 226

12.6.1 Control the Familywise Error 226

12.6.2 Control the False Discovery Rate 227

12.7 Expert Systems 228

13 Current Challenges in Machine Learning 231

13.1 Streaming Data 231

13.2 Distributed Data 231

13.3 Semi-supervised Learning 232

13.4 Active Learning 232

13.5 Feature Construction via Deep Neural Networks 233

13.6 Transfer Learning 233

13.7 Interpretability of Complex Models 233

14 R Source Code 235

14.1 Author's Biases 236

14.2 Libraries 236

14.3 The Running Example (Section 4.3) 237

14.4 The Bayes Classifier (Section 4.1) 241

14.5 Quadratic Discriminant Analysis (Section 4.4.1) 243

14.6 Linear Discriminant Analysis (Section 4.4.2) 243

14.7 Gaussian Mixture Models (Section 4.4.3) 244

14.8 Kernel Density Estimation (Section 4.4.4) 245

14.9 Histograms (Section 4.4.5) 248

14.10 The Naive Bayes Classifier (Section 4.4.6) 253

14.11 k-Nearest-Neighbor (Section 4.5.1) 255

14.12 Learning Vector Quantization (Section 4.5.4) 257

14.13 Logistic Regression (Section 4.6) 259

14.14 Neural Networks (Section 4.7) 260

14.15 Classification Trees (Section 4.8) 263

14.16 Support Vector Machines (Section 4.9) 267

14.17 Bootstrap Aggregation (Section 6.3) 272

14.18 Boosting (Section 6.6) 274

14.19 Arcing (Section 6.7) 275

14.20 Random Forests (Section 6.5) 275

A List of Symbols 277

B Solutions to Selected Exercises 279

C Converting Between Normal Parameters and Level-Curve Ellipsoids 299

D Training Data and Fitted Parameters 301

References 305

Index 315

2
The Problem of Learning

Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.

-John Tukey, The Future of Data Analysis, 1962

This book treats The Problem of Learning, which can be stated generally and succinctly as follows.

The Problem of Learning. There are a known set and an unknown function f on . Given data, construct a good approximation of f. This is called learning f.

The problem of learning has been studied in many guises and in different fields, such as statistics, computer science, mathematics, and the natural and social sciences. An advantage of this situation is that the set of methods now available for addressing the problem has benefited from a tremendous diversity of perspective and knowledge. A disadvantage is a certain lack of coherence in the language used to describe the problem: there are, perhaps, more names for things than there are things which need names, and this can make the study of the problem of learning appear more complicated and confusing than it really is.

This chapter introduces the main ideas which occur in almost any applied problem in learning, and introduces language commonly used to describe these ideas. The ideas and language introduced here will be used in every other chapter of this book.

2.1 Domain

The set is called feature space and an element is called a feature vector (or an input). The coordinates of X are called features. Individual features may take values in a continuum, a discrete, ordered set, or a discrete, unordered set.

In Problem 1 ("Shuttle"), is an interval of possible air temperatures: (- 60, 140) Fahrenheit, say, or [0, 8) Kelvin. In Problem 4 ("ZIP Code"), is the set of all 8 × 8 matrices with entries in {0, 1, ., 255}. In Problem 2 ("Ballot"), is {0, 1, 2, .}m for the given number of features m, each feature being a count of people or votes, though in practice might well be taken to be the m-dimensional real numbers, .

2.2 Range

The range is usually either a finite, unordered set, in which case learning is called classification,1 or it is a continuum, in which case learning is called regression.2 An element is called a class in classification and a response in regression.

In Problem 1 ("Shuttle"), , the set of probabilities that an O-ring is damaged.3 In Problem 4 ("Postal Code"), {"0", "1", "2", "3", "4", "5", "6", "7", "8", "9"}, the quotes indicating that these are unordered labels rather than numbers. In Problem 2 ("Ballot"), is the set of non-negative integers less than or equal to the number of registered voters in Palm Beach County, but in practice would usually be taken to be , the set of real numbers.

2.3 Data

In principle, data are random draws (X, Y) from a probability distribution on . Depending on the problem at hand, the data which are observed may either consist of domain-range pairs (X, Y) or just domain values X: learning is called supervised in the former case and unsupervised in the latter.

In supervised learning, the data are

where each (xi, yi) is drawn from a joint probability distribution on . Such data are called marked data.4 It is sometimes useful to consider the data as produced by a two-step process, in one of two ways: by drawing y from marginal distribution on and then drawing a corresponding feature vector x from conditional distribution on ; or by drawing feature vector x from marginal distribution on and then drawing a corresponding y from conditional distribution on . These two points of view correspond to the two factorizations,5

Both are useful in classification. The latter is more useful in regression, where the function f to be learned is often, though not necessarily, the expected value of the conditional distribution of Y | X, that is,

In unsupervised learning, the data are6

Such data are called unmarked data. The range is either assumed to be finite, in which case unsupervised learning is called clustering, or it is [0, 8) and the function f to be learned is the mass or density function of the marginal distribution of the features,

in which case unsupervised learning is called density estimation. In clustering problems, the size of the range, , may or may not be known: usually it is not.

Problem 1 ("Shuttle") is supervised learning, since the data are of the form

Problem 6 ("Vaults") is unsupervised learning, and is unknown (though archaeologists might have other evidence outside the data - such as artifacts recovered from the vaults or related sites - which indicates that, for example, ).

Figure 2.1 summarizes some of the terminology introduced so far and illustrates some connections between the four learning problems mentioned.

Figure 2.1 Four categories of learning problems, in bold type, split according to whether the range of f is finite (and unordered) or continuous, and whether or not elements of the range are observed directly. Some solutions to these problems are devised by transforming one problem into another: examples of this are shown in red and described in Chapters 3, 4, 6, and 9.

Sometimes both marked and unmarked data are available. This situation is called semi-supervised learning.7

2.4 Loss

What does "a good approximation" mean? This is specified by a loss function

where for each , is the loss, or penalty, incurred by approximating y with . Common examples are:

In classification, an arbitrary loss function can be specified by the cells of a C × C loss matrix, usually with all-zero diagonal elements and positive off-diagonal elements (so correct classification incurs no loss and misclassification incurs positive loss). Zero-one loss is represented by the C × C matrix with 0's on the diagonal and 1's everywhere off the diagonal.

Some techniques for solving classification problems focus on estimating the conditional probability distribution on class labels, given the feature vector,

(such techniques translate a classification problem into a regression problem, following the top-most red arrow in Figure 2.1). Let

denote an estimate of this vector. Predicted class labels, , are obtained from this estimated probability distribution in some way: for example, by predicting the most probable class. Techniques which do this essentially convert a classification problem into a regression problem, solve the regression problem, and apply the solution to classification. A loss function8 often used in such cases is

The choice of loss function is subjective and problem-dependent.9 An asymmetric loss function is sometimes appropriate: for example, in Problem 3 ("Heart"), declaring a healthy person to be sick may be viewed as a much less costly error than declaring a sick person to be healthy.10 That said, loss functions are often chosen for convenience and computational tractability.11 Exercises in Sections 3.6 and 4.6, respectively, derive squared-error loss and cross-entropy loss from the (subjective) principle that the best model in a class of models is one which assigns maximal likelihood to the observed data.

Exercise 2.1 The Kullback-Leibler information12 (also called cross-entropy) for discriminating between discrete probability distributions p and q on the set {1, ., C} is

It is sometimes interpreted as "the cost of using q in place of p when the true distribution is p," where the "cost" refers to loss of efficiency in encoding data drawn from distribution p (see Cover and Thomas (2006) for a full explanation). When a datum is class y, the true class probability vector is t = (0, ., 0, 1, 0, ., 0), where the single 1 occurs in position y.13 Show that cross-entropy loss is the same as Kullback-Leibler information:

2.5 Risk

A standard criterion for a "good" approximation of f is one which minimizes the expected loss,14 known as the generalization error or risk. Usually the expectation is with respect to the joint probability distribution on points .

The risk of approximation at point is the expected loss incurred by using in place of Y for new data (X, Y) such that X = x,

The response variable Y is treated as random, with the conditional distribution , while the input X = x and trained classifier are treated as fixed.15 Choosing an approximation to minimize the risk of at...

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Machine Learning

Description

More details

Other editions

Additional editions

Person

Content

2
The Problem of Learning

2.1 Domain

2.2 Range

2.3 Data

2.4 Loss

2.5 Risk

System requirements