2
The Problem of Learning
Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.
-John Tukey, The Future of Data Analysis, 1962
This book treats The Problem of Learning, which can be stated generally and succinctly as follows.
The Problem of Learning. There are a known set and an unknown function f on . Given data, construct a good approximation of f. This is called learning f.
The problem of learning has been studied in many guises and in different fields, such as statistics, computer science, mathematics, and the natural and social sciences. An advantage of this situation is that the set of methods now available for addressing the problem has benefited from a tremendous diversity of perspective and knowledge. A disadvantage is a certain lack of coherence in the language used to describe the problem: there are, perhaps, more names for things than there are things which need names, and this can make the study of the problem of learning appear more complicated and confusing than it really is.
This chapter introduces the main ideas which occur in almost any applied problem in learning, and introduces language commonly used to describe these ideas. The ideas and language introduced here will be used in every other chapter of this book.
2.1 Domain
The set is called feature space and an element is called a feature vector (or an input). The coordinates of X are called features. Individual features may take values in a continuum, a discrete, ordered set, or a discrete, unordered set.
In Problem 1 ("Shuttle"), is an interval of possible air temperatures: (- 60, 140) Fahrenheit, say, or [0, 8) Kelvin. In Problem 4 ("ZIP Code"), is the set of all 8 × 8 matrices with entries in {0, 1, ., 255}. In Problem 2 ("Ballot"), is {0, 1, 2, .}m for the given number of features m, each feature being a count of people or votes, though in practice might well be taken to be the m-dimensional real numbers, .
2.2 Range
The range is usually either a finite, unordered set, in which case learning is called classification,1 or it is a continuum, in which case learning is called regression.2 An element is called a class in classification and a response in regression.
In Problem 1 ("Shuttle"), , the set of probabilities that an O-ring is damaged.3 In Problem 4 ("Postal Code"), {"0", "1", "2", "3", "4", "5", "6", "7", "8", "9"}, the quotes indicating that these are unordered labels rather than numbers. In Problem 2 ("Ballot"), is the set of non-negative integers less than or equal to the number of registered voters in Palm Beach County, but in practice would usually be taken to be , the set of real numbers.
2.3 Data
In principle, data are random draws (X, Y) from a probability distribution on . Depending on the problem at hand, the data which are observed may either consist of domain-range pairs (X, Y) or just domain values X: learning is called supervised in the former case and unsupervised in the latter.
In supervised learning, the data are
where each (xi, yi) is drawn from a joint probability distribution on . Such data are called marked data.4 It is sometimes useful to consider the data as produced by a two-step process, in one of two ways: by drawing y from marginal distribution on and then drawing a corresponding feature vector x from conditional distribution on ; or by drawing feature vector x from marginal distribution on and then drawing a corresponding y from conditional distribution on . These two points of view correspond to the two factorizations,5
Both are useful in classification. The latter is more useful in regression, where the function f to be learned is often, though not necessarily, the expected value of the conditional distribution of Y | X, that is,
In unsupervised learning, the data are6
Such data are called unmarked data. The range is either assumed to be finite, in which case unsupervised learning is called clustering, or it is [0, 8) and the function f to be learned is the mass or density function of the marginal distribution of the features,
in which case unsupervised learning is called density estimation. In clustering problems, the size of the range, , may or may not be known: usually it is not.
Problem 1 ("Shuttle") is supervised learning, since the data are of the form
Problem 6 ("Vaults") is unsupervised learning, and is unknown (though archaeologists might have other evidence outside the data - such as artifacts recovered from the vaults or related sites - which indicates that, for example, ).
Figure 2.1 summarizes some of the terminology introduced so far and illustrates some connections between the four learning problems mentioned.
Figure 2.1 Four categories of learning problems, in bold type, split according to whether the range of f is finite (and unordered) or continuous, and whether or not elements of the range are observed directly. Some solutions to these problems are devised by transforming one problem into another: examples of this are shown in red and described in Chapters 3, 4, 6, and 9.
Sometimes both marked and unmarked data are available. This situation is called semi-supervised learning.7
2.4 Loss
What does "a good approximation" mean? This is specified by a loss function
where for each , is the loss, or penalty, incurred by approximating y with . Common examples are:
In classification, an arbitrary loss function can be specified by the cells of a C × C loss matrix, usually with all-zero diagonal elements and positive off-diagonal elements (so correct classification incurs no loss and misclassification incurs positive loss). Zero-one loss is represented by the C × C matrix with 0's on the diagonal and 1's everywhere off the diagonal.
Some techniques for solving classification problems focus on estimating the conditional probability distribution on class labels, given the feature vector,
(such techniques translate a classification problem into a regression problem, following the top-most red arrow in Figure 2.1). Let
denote an estimate of this vector. Predicted class labels, , are obtained from this estimated probability distribution in some way: for example, by predicting the most probable class. Techniques which do this essentially convert a classification problem into a regression problem, solve the regression problem, and apply the solution to classification. A loss function8 often used in such cases is
The choice of loss function is subjective and problem-dependent.9 An asymmetric loss function is sometimes appropriate: for example, in Problem 3 ("Heart"), declaring a healthy person to be sick may be viewed as a much less costly error than declaring a sick person to be healthy.10 That said, loss functions are often chosen for convenience and computational tractability.11 Exercises in Sections 3.6 and 4.6, respectively, derive squared-error loss and cross-entropy loss from the (subjective) principle that the best model in a class of models is one which assigns maximal likelihood to the observed data.
Exercise 2.1 The Kullback-Leibler information12 (also called cross-entropy) for discriminating between discrete probability distributions p and q on the set {1, ., C} is
It is sometimes interpreted as "the cost of using q in place of p when the true distribution is p," where the "cost" refers to loss of efficiency in encoding data drawn from distribution p (see Cover and Thomas (2006) for a full explanation). When a datum is class y, the true class probability vector is t = (0, ., 0, 1, 0, ., 0), where the single 1 occurs in position y.13 Show that cross-entropy loss is the same as Kullback-Leibler information:
2.5 Risk
A standard criterion for a "good" approximation of f is one which minimizes the expected loss,14 known as the generalization error or risk. Usually the expectation is with respect to the joint probability distribution on points .
The risk of approximation at point is the expected loss incurred by using in place of Y for new data (X, Y) such that X = x,
The response variable Y is treated as random, with the conditional distribution , while the input X = x and trained classifier are treated as fixed.15 Choosing an approximation to minimize the risk of at...