
Handbook of Regression Analysis With Applications in R
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
Handbook of Regression Analysis with Applications in R, Second Edition is a comprehensive and up-to-date guide to conducting complex regressions in the R statistical programming language. The authors' thorough treatment of "classical" regression analysis in the first edition is complemented here by their discussion of more advanced topics including time-to-event survival data and longitudinal and clustered data.
The book further pays particular attention to methods that have become prominent in the last few decades as increasingly large data sets have made new techniques and applications possible. These include:
* Regularization methods
* Smoothing methods
* Tree-based methods
In the new edition of the Handbook, the data analyst's toolkit is explored and expanded. Examples are drawn from a wide variety of real-life applications and data sets. All the utilized R code and data are available via an author-maintained website.
Of interest to undergraduate and graduate students taking courses in statistics and regression, the Handbook of Regression Analysis will also be invaluable to practicing data scientists and statisticians.
More details
Other editions
Additional editions

Persons
Samprit Chatterjee, PhD, is Professor Emeritus of Statistics at New York University. A Fellow of the American Statistical Association, Dr. Chatterjee has been a Fulbright scholar in both Kazakhstan and Mongolia. He is the coauthor of multiple editions of Regression Analysis By Example, Sensitivity Analysis in Linear Regression, A Casebook for a First Course in Statistics and Data Analysis, and the first edition of Handbook of Regression Analysis, all published by Wiley.
Jeffrey S. Simonoff, PhD, is Professor of Statistics at the Leonard N. Stern School of Business of New York University. He is a Fellow of the American Statistical Association, a Fellow of the Institute of Mathematical Statistics, and an Elected Member of the International Statistical Institute. He has authored, coauthored, or coedited more than one hundred articles and seven books on the theory and applications of statistics.
Content
Preface to the Second Edition xv
Preface to the First Edition xix
Part I The Multiple Linear Regression Model
1 Multiple Linear Regression 3
1.1 Introduction 3
1.2 Concepts and Background Material 4
1.2.1 The Linear Regression Model 4
1.2.2 Estimation Using Least Squares 5
1.2.3 Assumptions 8
1.3 Methodology 9
1.3.1 Interpreting Regression Coefficients 9
1.3.2 Measuring the Strength of the Regression Relationship 10
1.3.3 Hypothesis Tests and Confidence Intervals for ß 12
1.3.4 Fitted Values and Predictions 13
1.3.5 Checking Assumptions Using Residual Plots 14
1.4 Example -Estimating Home Prices 15
1.5 Summary 19
2 Model Building 23
2.1 Introduction 23
2.2 Concepts and Background Material 24
2.2.1 Using Hypothesis Tests to Compare Models 24
2.2.2 Collinearity 26
2.3 Methodology 29
2.3.1 Model Selection 29
2.3.2 Example-Estimating Home Prices (continued) 31
2.4 Indicator Variables and Modeling Interactions 38
2.4.1 Example-Electronic Voting and the 2004 Presidential Election 40
2.5 Summary 46
Part II Addressing Violations of Assumptions
3 Diagnostics for Unusual Observations 53
3.1 Introduction 53
3.2 Concepts and Background Material 54
3.3 Methodology 56
3.3.1 Residuals and Outliers 56
3.3.2 Leverage Points 57
3.3.3 Influential Points and Cook's Distance 58
3.4 Example- Estimating Home Prices (continued) 60
3.5 Summary 63
4 Transformations and Linearizable Models 67
4.1 Introduction 67
4.2 Concepts and Background Material: The Log-Log Model 69
4.3 Concepts and Background Material: Semilog Models 69
4.3.1 Logged Response Variable 70
4.3.2 Logged Predictor Variable 70
4.4 Example- Predicting Movie Grosses After One Week 71
4.5 Summary 77
5 Time Series Data and Autocorrelation 79
5.1 Introduction 79
5.2 Concepts and Background Material 81
5.3 Methodology: Identifying Autocorrelation 83
5.3.1 The Durbin-Watson Statistic 83
5.3.2 The Autocorrelation Function (ACF) 84
5.3.3 Residual Plots and the Runs Test 85
5.4 Methodology: Addressing Autocorrelation 86
5.4.1 Detrending and Deseasonalizing 86
5.4.2 Example- e-Commerce Retail Sales 87
5.4.3 Lagging and Differencing 93
5.4.4 Example- Stock Indexes 94
5.4.5 Generalized Least Squares (GLS): The Cochrane-Orcutt Procedure 99
5.4.6 Example- Time Intervals Between Old Faithful Geyser Eruptions 100
5.5 Summary 104
Part III Categorical Predictors
6 Analysis of Variance 109
6.1 Introduction 109
6.2 Concepts and Background Material 110
6.2.1 One-Way ANOVA 110
6.2.2 Two-Way ANOVA 111
6.3 Methodology 113
6.3.1 Codings for Categorical Predictors 113
6.3.2 Multiple Comparisons 118
6.3.3 Levene's Test and Weighted Least Squares 120
6.3.4 Membership in Multiple Groups 123
6.4 Example-DVD Sales of Movies 125
6.5 Higher-Way ANOVA 130
6.6 Summary 132
7 Analysis of Covariance 135
7.1 Introduction 135
7.2 Methodology 136
7.2.1 Constant Shift Models 136
7.2.2 Varying Slope Models 137
7.3 Example -International Grosses of Movies 137
7.4 Summary 142
Part IV Non-Gaussian Regression Models
8 Logistic Regression 145
8.1 Introduction 145
8.2 Concepts and Background Material 147
8.2.1 The Logit Response Function 148
8.2.2 Bernoulli and Binomial Random Variables 149
8.2.3 Prospective and Retrospective Designs 149
8.3 Methodology 152
8.3.1 Maximum Likelihood Estimation 152
8.3.2 Inference, Model Comparison, and Model Selection 153
8.3.3 Goodness-of-Fit 155
8.3.4 Measures of Association and Classification Accuracy 157
8.3.5 Diagnostics 159
8.4 Example- Smoking and Mortality 159
8.5 Example- Modeling Bankruptcy 163
8.6 Summary 168
9 Multinomial Regression 173
9.1 Introduction 173
9.2 Concepts and Background Material 174
9.2.1 Nominal Response Variable 174
9.2.2 Ordinal Response Variable 176
9.3 Methodology 178
9.3.1 Estimation 178
9.3.2 Inference, Model Comparisons, and Strength of Fit 178
9.3.3 Lack of Fit and Violations of Assumptions 180
9.4 Example- City Bond Ratings 180
9.5 Summary 184
10 Count Regression 187
10.1 Introduction 187
10.2 Concepts and Background Material 188
10.2.1 The Poisson Random Variable 188
10.2.2 Generalized Linear Models 189
10.3 Methodology 190
10.3.1 Estimation and Inference 190
10.3.2 Offsets 191
10.4 Overdispersion and Negative Binomial Regression 192
10.4.1 Quasi-likelihood 192
10.4.2 Negative Binomial Regression 193
10.5 Example- Unprovoked Shark Attacks in Florida 194
10.6 Other Count Regression Models 201
10.7 Poisson Regression and Weighted Least Squares 203
10.7.1 Example- International Grosses of Movies (continued) 204
10.8 Summary 206
11 Models for Time-to-Event (Survival) Data 209
11.1 Introduction 210
11.2 Concepts and Background Material 211
11.2.1 The Nature of Survival Data 211
11.2.2 Accelerated Failure Time Models 212
11.2.3 The Proportional Hazards Model 214
11.3 Methodology 214
11.3.1 The Kaplan-Meier Estimator and the Log-Rank Test 214
11.3.2 Parametric (Likelihood) Estimation 219
11.3.3 Semiparametric (Partial Likelihood) Estimation 221
11.3.4 The Buckley-James Estimator 223
11.4 Example-The Survival of Broadway Shows (continued) 223
11.5 Left-Truncated/Right-Censored Data and Time-Varying Covariates 230
11.5.1 Left-Truncated/Right-Censored Data 230
11.5.2 Example-The Survival of Broadway Shows (continued) 233
11.5.3 Time-Varying Covariates 233
11.5.4 Example-Female Heads of Government 235
11.6 Summary 238
Part V Other Regression Models
12 Nonlinear Regression 243
12.1 Introduction 243
12.2 Concepts and Background Material 244
12.3 Methodology 246
12.3.1 Nonlinear Least Squares Estimation 246
12.3.2 Inference for Nonlinear Regression Models 247
12.4 Example -Michaelis-Menten Enzyme Kinetics 248
12.5 Summary 252
13 Models for Longitudinal and Nested Data 255
13.1 Introduction 255
13.2 Concepts and Background Material 257
13.2.1 Nested Data and ANOVA 257
13.2.2 Longitudinal Data and Time Series 258
13.2.3 Fixed Effects Versus Random Effects 259
13.3 Methodology 260
13.3.1 The Linear Mixed Effects Model 260
13.3.2 The Generalized Linear Mixed Effects Model 262
13.3.3 Generalized Estimating Equations 262
13.3.4 Nonlinear Mixed Effects Models 263
13.4 Example -Tumor Growth in a Cancer Study 264
13.5 Example -Unprovoked Shark Attacks in the United States 269
13.6 Summary 275
14 Regularization Methods and Sparse Models 277
14.1 Introduction 277
14.2 Concepts and Background Material 278
14.2.1 The Bias-Variance Tradeoff 278
14.2.2 Large Numbers of Predictors and Sparsity 279
14.3 Methodology 280
14.3.1 Forward Stepwise Regression 280
14.3.2 Ridge Regression 281
14.3.3 The Lasso 281
14.3.4 Other Regularization Methods 283
14.3.5 Choosing the Regularization Parameter(s) 284
14.3.6 More Structured Regression Problems 285
14.3.7 Cautions About Regularization Methods 286
14.4 Example- Human Development Index 287
14.5 Summary 289
Part VI Nonparametric and Semiparametric Models
15 Smoothing and Additive Models 295
15.1 Introduction 296
15.2 Concepts and Background Material 296
15.2.1 The Bias-Variance Tradeoff 296
15.2.2 Smoothing and Local Regression 297
15.3 Methodology 298
15.3.1 Local Polynomial Regression 298
15.3.2 Choosing the Bandwidth 298
15.3.3 Smoothing Splines 299
15.3.4 Multiple Predictors, the Curse of Dimensionality, and Additive Models 300
15.4 Example- Prices of German Used Automobiles 301
15.5 Local and Penalized Likelihood Regression 304
15.5.1 Example- The Bechdel Rule and Hollywood Movies 305
15.6 Using Smoothing to Identify Interactions 307
15.6.1 Example- Estimating Home Prices (continued) 308
15.7 Summary 310
16 Tree-Based Models 313
16.1 Introduction 314
16.2 Concepts and Background Material 314
16.2.1 Recursive Partitioning 314
16.2.2 Types of Trees 317
16.3 Methodology 318
16.3.1 CART 318
16.3.2 Conditional Inference Trees 319
16.3.3 Ensemble Methods 320
16.4 Examples 321
16.4.1 Estimating Home Prices (continued) 321
16.4.2 Example-Courtesy in Airplane Travel 322
16.5 Trees for Other Types of Data 327
16.5.1 Trees for Nested and Longitudinal Data 327
16.5.2 Survival Trees 328
16.6 Summary 332
Bibliography 337
Index 343
CHAPTER ONE
Multiple Linear Regression
- 1.1 Introduction
- 1.2 Concepts and Background Material
- 1.3 Methodology
- 1.4 Example-Estimating Home Prices
- 1.5 Summary
1.1 Introduction
This is a book about regression modeling, but when we refer to regression models, what do we mean? The regression framework can be characterized in the following way:
- We have one particular variable that we are interested in understanding or modeling, such as sales of a particular product, sale price of a home, or voting preference of a particular voter. This variable is called the target, response, or dependent variable, and is usually represented by .
- We have a set of other variables that we think might be useful in predicting or modeling the target variable (the price of the product, the competitor's price, and so on; or the lot size, number of bedrooms, number of bathrooms of the home, and so on; or the gender, age, income, party membership of the voter, and so on). These are called the predicting, or independent variables, and are usually represented by , , etc.
Typically, a regression analysis is used for one (or more) of three purposes:
- modeling the relationship between and ;
- prediction of the target variable (forecasting);
- and testing of hypotheses.
In this chapter, we introduce the basic multiple linear regression model, and discuss how this model can be used for these three purposes. Specifically, we discuss the interpretations of the estimates of different regression parameters, the assumptions underlying the model, measures of the strength of the relationship between the target and predictor variables, the construction of tests of hypotheses and intervals related to regression parameters, and the checking of assumptions using diagnostic plots.
1.2 Concepts and Background Material
1.2.1 THE LINEAR REGRESSION MODEL
The data consist of observations, which are sets of observed values that represent a random sample from a larger population. It is assumed that these observations satisfy a linear relationship,
(1.1)where the coefficients are unknown parameters, and the are random error terms. By a linear model, it is meant that the model is linear in the parameters; a quadratic model,
paradoxically enough, is a linear model, since and are just versions of and .
It is important to recognize that this, or any statistical model, is not viewed as a true representation of reality; rather, the goal is that the model be a useful representation of reality. A model can be used to explore the relationships between variables and make accurate forecasts based on those relationships even if it is not the "truth." Further, any statistical model is only temporary, representing a provisional version of views about the random process being studied. Models can, and should, change, based on analysis using the current model, selection among several candidate models, the acquisition of new data, new understanding of the underlying random process, and so on. Further, it is often the case that there are several different models that are reasonable representations of reality. Having said this, we will sometimes refer to the "true" model, but this should be understood as referring to the underlying form of the currently hypothesized representation of the regression relationship.
FIGURE 1.1: The simple linear regression model. The solid line corresponds to the true regression line, and the dotted lines correspond to the random errors .
The special case of (1.1) with corresponds to the simple regression model, and is consistent with the representation in Figure 1.1. The solid line is the true regression line, the expected value of given the value of . The dotted lines are the random errors that account for the lack of a perfect association between the predictor and the target variables.
1.2.2 ESTIMATION USING LEAST SQUARES
The true regression function represents the expected relationship between the target and the predictor variables, which is unknown. A primary goal of a regression analysis is to estimate this relationship, or equivalently, to estimate the unknown parameters . This requires a data-based rule, or criterion, that will give a reasonable estimate. The standard approach is least squares regression, where the estimates are chosen to minimize
(1.2)Figure 1.2 gives a graphical representation of least squares that is based on Figure 1.1. Now the true regression line is represented by the gray line, and the solid black line is the estimated regression line, designed to estimate the (unknown) gray line as closely as possible. For any choice of estimated parameters , the estimated expected response value given the observed predictor values equals
FIGURE 1.2: Least squares estimation for the simple linear regression model, using the same data as in Figure 1.1. The gray line corresponds to the true regression line, the solid black line corresponds to the fitted least squares line (designed to estimate the gray line), and the lengths of the dotted lines correspond to the residuals. The sum of squared values of the lengths of the dotted lines is minimized by the solid black line.
and is called the fitted value. The difference between the observed value and the fitted value is called the residual, the set of which is represented by the signed lengths of the dotted lines in Figure 1.2. The least squares regression line minimizes the sum of squares of the lengths of the dotted lines; that is, the ordinary least squares (OLS) estimates minimize the sum of squares of the residuals.
In higher dimensions (), the true and estimated regression relationships correspond to planes () or hyperplanes (), but otherwise the principles are the same. Figure 1.3 illustrates the case with two predictors. The length of each vertical line corresponds to a residual (solid lines refer to positive residuals, while dashed lines refer to negative residuals), and the (least squares) plane that goes through the observations is chosen to minimize the sum of squares of the residuals.
FIGURE 1.3: Least squares estimation for the multiple linear regression model with two predictors. The plane corresponds to the fitted least squares relationship, and the lengths of the vertical lines correspond to the residuals. The sum of squared values of the lengths of the vertical lines is minimized by the plane.
The linear regression model can be written compactly using matrix notation. Define the following matrix and vectors as follows:
The regression model (1.1) is then
(1.3)The normal equations [which determine the minimizer of 1.2] can be shown (using multivariate calculus) to be
which implies that the least squares estimates satisfy
(1.4)The fitted values are then
(1.5)where is the so-called "hat" matrix (since it takes to ). The residuals thus satisfy
(1.6)or
1.2.3 ASSUMPTIONS
The least squares criterion will not necessarily yield sensible results unless certain assumptions hold. One is given in (1.1) - the linear model should be appropriate. In addition, the following assumptions are needed to justify using least squares regression.
- The expected value of the errors is zero ( for all ). That is, it cannot be true that for certain observations the model is systematically too low, while for others it is systematically too high. A violation of this assumption will lead to difficulties in estimating . More importantly, this reflects that the model does not include a necessary systematic component, which has instead been absorbed into the error terms.
- The variance of the errors is constant ( for all ). That is, it cannot be true that the strength of the model is greater for some parts of the population (smaller ) and less for other parts (larger ). This assumption of constant variance is called homoscedasticity, and its violation (nonconstant variance) is called heteroscedasticity. A violation of this assumption means that the least squares estimates are not as efficient as they could be in estimating the true parameters, and better estimates are available. More importantly, it also results in poorly calibrated confidence and (especially) prediction intervals.
- The errors are uncorrelated with each other. That is, it cannot be true that knowing that the...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.