
Regression Analysis By Example Using R
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
A STRAIGHTFORWARD AND CONCISE DISCUSSION OF THE ESSENTIALS OF REGRESSION ANALYSIS
In the newly revised sixth edition of Regression Analysis By Example Using R, distinguished statistician Dr Ali S. Hadi delivers an expanded and thoroughly updated discussion of exploratory data analysis using regression analysis in R. The book provides in-depth treatments of regression diagnostics, transformation, multicollinearity, logistic regression, and robust regression.
The author clearly demonstrates effective methods of regression analysis with examples that contain the types of data irregularities commonly encountered in the real world. This newest edition also offers a brand-new, easy to read chapter on the freely available statistical software package R.
Readers will also find:
* Reorganized, expanded, and upgraded exercises at the end of each chapter with an emphasis on data analysis
* Updated data sets and examples throughout the book
* Complimentary access to a companion website that provides data sets in xlsx, csv, and txt format
Perfect for upper-level undergraduate or beginning graduate students in statistics, mathematics, biostatistics, and computer science programs, Regression Analysis By Example Using R will also benefit readers who need a reference for quick updates on regression methods and applications.
More details
Other editions
Additional editions

Persons
Ali S. Hadi, PhD, Fellow ASA (1997), Member ISI (1998), Fellow AAS (2019) is Distinguished University Professor and former Chair of the Department of Mathematics and Actuarial Science at the American University in Cairo (AUC). He is also the Founder of the Actuarial Science Program at AUC (2004), the Founder of the Data Science Program at AUC (2019), and the former Vice Provost and Director of Graduate Studies and Research at AUC. Dr. Hadi is also a Stephen H. Weiss Presidential Fellow and Professor Emeritus at Cornell University, USA. He is the author and co-author of four other books and numerous articles. For more info, see his Website at: www1.aucegypt.edu/faculty/hadi.
Content
Preface xiv
1 Introduction 1
1.1 What Is Regression Analysis? 1
1.2 Publicly Available Data Sets 2
1.3 Selected Applications of Regression Analysis 3
1.3.1 Agricultural Sciences 3
1.3.2 Industrial and Labor Relations 4
1.3.3 Government 5
1.3.4 History 5
1.3.5 Environmental Sciences 6
1.3.6 Industrial Production 6
1.3.7 The Space Shuttle Challenger 7
1.3.8 Cost of Health Care 7
1.4 Steps in Regression Analysis 7
1.4.1 Statement of the Problem 9
1.4.2 Selection of Potentially Relevant Variables 9
1.4.3 Data Collection 9
1.4.4 Model Specification 10
1.4.5 Method of Fitting 12
1.4.6 Model Fitting 13
1.4.7 Model Criticism and Selection 14
1.4.8 Objectives of Regression Analysis 15
1.5 Scope and Organization of the Book 16
2 A Brief Introduction to R 19
2.1 What Is R and RStudio? 19
2.2 Installing R and RStudio 20
2.3 Getting Started With R 21
2.3.1 Command Level Prompt 21
2.3.2 Calculations Using R 22
2.3.3 Editing Your R Code 24
2.3.4 Best Practice: Object Names in R 25
2.4 Data Values and Objects in R 25
2.4.1 Types of Data Values in R 25
2.4.2 Types (Structures) of Objects in R 28
2.4.3 Object Attributes 34
2.4.4 Testing (Checking) Object Type 34
2.4.5 Changing Object Type 34
2.5 R Packages (Libraries) 35
2.5.1 Installing R Packages 35
2.5.2 Name Spaces 36
2.5.3 Updating R 37
2.5.4 Datasets in R Packages 37
2.6 Importing (Reading) Data into R Workspace 37
2.6.1 Best Practice: Working Directory 38
2.6.2 Reading ASCII (Text) Files 38
2.6.3 Reading CSV Files 40
2.6.4 Reading Excel Files 40
2.6.5 Reading Files from the Internet 41
2.7 Writing (Exporting) Data to Files 42
2.7.1 Diverting Normal R Output to a File 42
2.7.2 Saving Graphs in Files 42
2.7.3 Exporting Data to Files 43
2.8 Some Arithmetic and Other Operators 43
2.8.1 Vectors 43
2.8.2 Matrix Computations 45
2.9 Programming in R 50
2.9.1 Best Practice: Script Files 50
2.9.2 Some Useful Commands or Functions 50
2.9.3 Conditional Execution 51
2.9.4 Loops 53
2.9.5 Functions and Functionals 54
2.9.6 User Defined Functions 55
2.10 Bibliographic Notes 60
3 Simple Linear Regression 65
3.1 Introduction 65
3.2 Covariance and Correlation Coefficient 65
3.3 Example: Computer Repair Data 69
3.4 The Simple Linear Regression Model 72
3.5 Parameter Estimation 73
3.6 Tests of Hypotheses 77
3.7 Confidence Intervals 82
3.8 Predictions 83
3.9 Measuring the Quality of Fit 84
3.10 Regression Line Through the Origin 88
3.11 Trivial Regression Models 89
3.12 Bibliographic Notes 90
4 Multiple Linear Regression 97
4.1 Introduction 97
4.2 Description of the Data and Model 97
4.3 Example: Supervisor Performance Data 98
4.4 Parameter Estimation 100
4.5 Interpretations of Regression Coefficients 101
4.6 Centering and Scaling 104
4.6.1 Centering and Scaling in Intercept Models 104
4.6.2 Scaling in No-Intercept Models 105
4.7 Properties of the Least Squares Estimators 106
4.8 Multiple Correlation Coefficient 107
4.9 Inference for Individual Regression Coefficients 108
4.10 Tests of Hypotheses in a Linear Model 111
4.10.1 Testing All Regression Coefficients Equal to Zero
4.10.2 Testing a Subset of Regression Coefficients Equal to 113
4.10.3 Testing the Equality of Regression Coefficients
4.10.4 Estimating and Testing of Regression Parameters 118
4.11 Predictions 121
4.12 Summary 122
5 Regression Diagnostics: Detection of Model Violations 131
5.1 Introduction 131
5.2 The Standard Regression Assumptions 132
5.3 Various Types of Residuals 134
5.4 Graphical Methods 136
5.5 Graphs Before Fitting a Model 139
5.5.1 One-Dimensional Graphs 139 5.5.2 Two-Dimensional Graphs 140
5.5.3 Rotating Plots 142
5.5.4 Dynamic Graphs 142
5.6 Graphs After Fitting a Model 143
5.7 Checking Linearity and Normality Assumptions 143
5.8 Leverage, Influence, and Outliers 144
5.8.1 Outliers in the Response Variable 146
5.8.2 Outliers in the Predictors 146
5.8.3 Masking and Swamping Problems 147
5.9 Measures of Influence 148
5.9.1 Cook's Distance 150
5.9.2 Welsch and Kuh Measure 151
5.9.3 Hadi's Influence Measure 151
5.10 The Potential-Residual Plot 152
5.11 Regression Diagnostics in R 154 5.12 What to Do with the Outliers? 155
5.13 Role of Variables in a Regression Equation 156
5.11.1 Added-Variable Plot 156
5.11.2 Residual Plus Component Plot 157
5.14 Effects of an Additional Predictor 159
5.15 Robust Regression 161
6 Qualitative Variables as Predictors 167
6.1 Introduction 167
6.2 Salary Survey Data 168
6.3 Interaction Variables 171
6.4 Systems of Regression Equations 175
6.4.1 Models with Different Slopes and Different Intercepts 176
6.4.2 Models with Same Slope and Different Intercepts 183
6.4.3 Models with Same Intercept and Different Slopes 184
6.5 Other Applications of Indicator Variables 185
6.6 Seasonality 186
6.7 Stability of Regression Parameters Over Time 187
7 Transformation of Variables 195
7.1 Introduction 195
7.2 Transformations to Achieve Linearity 197
7.3 Bacteria Deaths Due to X-Ray Radiation 199
7.3.1 Inadequacy of a Linear Model 200
7.3.2 Logarithmic Transformation for Achieving Linearity 201
7.4 Transformations to Stabilize Variance 203
7.5 Detection of Heteroscedastic Errors 208
7.6 Removal of Heteroscedasticity 210
7.7 Weighted Least Squares 211
7.8 Logarithmic Transformation of Data 212
7.9 Power Transformation 213
7.10 Summary 216
8 Weighted Least Squares 223
8.1 Introduction 223
8.2 Heteroscedastic Models 224
8.2.1 Supervisors Data 224
8.2.2 College Expense Data 226
8.3 Two-Stage Estimation 227
8.4 Education Expenditure Data 229
8.5 Fitting a Dose-Response Relationship Curve 237
9 The Problem of Correlated Errors 241
9.1 Introduction: Autocorrelation 241
9.2 Consumer Expenditure and Money Stock 242
9.3 Durbin-Watson Statistic 245
9.4 Removal of Autocorrelation by Transformation 246
9.5 Iterative Estimation with Autocorrelated Errors 249
9.6 Autocorrelation and Missing Variables 250
9.7 Analysis of Housing Starts 251
9.8 Limitations of the Durbin-Watson Statistic 253
9.9 Indicator Variables to Remove Seasonality 255
9.10 Regressing Two Time Series 257
10 Analysis of Collinear Data 261
10.1 Introduction 261
10.2 Effects of Collinearity on Inference 262
10.3 Effects of Collinearity on Forecasting 267
CONTENTS
10.4 Detection of Collinearity 271
10.4.1 Simple Signs of Collinearity 271
10.4.2 Variance Inflation Factors 274
10.4.3 The Condition Indices 276
11 Working With Collinear Data 283
11.1 Introduction 283
11.2 Principal Components 283
11.3 Computations Using Principal Components 287
11.4 Imposing Constraints 289
11.5 Searching for Linear Functions of the ß's 292
11.6 Biased Estimation of Regression Coefficients 295
11.7 Principal Components Regression 296
11.8 Reduction of Collinearity in the Estimation Data 298
11.9 Constraints on the Regression Coefficients 300
11.10 Principal Components Regression: A Caution 301
11.11 Ridge Regression 303
11.12 Estimation by the Ridge Method 305
11.13 Ridge Regression: Some Remarks 308
11.14 Summary 311
11.15 Bibliographic Notes 311
12 Variable Selection Procedures 321
12.1 Introduction 321
12.2 Formulation of the Problem 322
12.3 Consequences of Variables Deletion 322
12.4 Uses of Regression Equations 324
12.4.1 Description and Model Building 324
12.4.2 Estimation and Prediction 324
12.4.3 Control 324
12.5 Criteria for Evaluating Equations 325
12.5.1 Residual Mean Square 325
12.5.2 Mallows Cp 326
12.5.3 Information Criteria 327
12.6 Collinearity and Variable Selection 328
12.7 Evaluating All Possible Equations 328
12.8 Variable Selection Procedures 329
12.8.1 Forward Selection Procedure 329
12.8.2 Backward Elimination Procedure 330
12.8.3 Stepwise Method 330
12.9 General Remarks on Variable Selection Methods 331
12.10 A Study of Supervisor Performance 332
12.11 Variable Selection with Collinear Data 336
12.12 The Homicide Data 336
12.13 Variable Selection Using Ridge Regression 339
12.14 Selection of Variables in an Air Pollution Study 339
12.15 A Possible Strategy for Fitting Regression Models 345
12.16 Bibliographic Notes 347
13 Logistic Regression 353
13.1 Introduction 353
13.2 Modeling Qualitative Data 354
13.3 The Logit Model 354
13.4 Example: Estimating Probability of Bankruptcies 356
13.5 Logistic Regression Diagnostics 358
13.6 Determination of Variables to Retain 359
13.7 Judging the Fit of a Logistic Regression 362
13.8 The Multinomial Logit Model 364
13.8.1 Multinomial Logistic Regression 364
13.8.2 Example: Determining Chemical Diabetes 365
13.8.3 Ordinal Logistic Regression 368
13.8.4 Example: Determining Chemical Diabetes Revisited 368
13.9 Classification Problem: Another Approach 370
14 Further Topics 375
14.1 Introduction 375
14.2 Generalized Linear Model 375
14.3 Poisson Regression Model 376
14.4 Introduction of New Drugs 377
14.5 Robust Regression 378
14.6 Fitting a Quadratic Model 379
14.7 Distribution of PCB in U.S. Bays 381
Exercises 384
References 385
Index
CHAPTER 1
INTRODUCTION
1.1 WHAT IS REGRESSION ANALYSIS?
Regression analysis is a conceptually simple method for investigating functional relationships among variables. A real estate appraiser may wish to relate the sale price of a home from selected physical characteristics of the building and taxes (local, school, county) paid on the building. We may wish to examine whether cigarette consumption is related to various socioeconomic and demographic variables such as age, education, income, and price of cigarettes. The relationship is expressed in the form of an equation or a model connecting the response or dependent variable and one or more explanatory or predictor variables. In the cigarette consumption example, the response variable is cigarette consumption (measured by the number of packs of cigarette sold in a given state on a per capita basis during a given year) and the explanatory or predictor variables are the various socioeconomic and demographic variables. In the real estate appraisal example, the response variable is the price of a home and the explanatory or predictor variables are the characteristics of the building and taxes paid on the building.
We denote the response variable by and the set of predictor variables by , , where denotes the number of predictor variables. The true relationship between and can be approximated by the regression model
(1.1)where is assumed to be a random error representing the discrepancy in the approximation. It accounts for the failure of the model to fit the data exactly. The function describes the relationship between and , , , . An example is the linear regression model
(1.2)where , called the regression parameters or coefficients, are unknown constants to be determined (estimated) from the data. We follow the commonly used notational convention of denoting unknown parameters by Greek letters.
The predictor or explanatory variables are also called by other names such as independent variables, covariates, regressors, factors, and carriers. The name independent variable, though commonly used, is the least preferred, because in practice the predictor variables are rarely independent of each other.
1.2 PUBLICLY AVAILABLE DATA SETS
Regression analysis has numerous areas of applications. A partial list would include economics, finance, business, law, meteorology, medicine, biology, chemistry, engineering, physics, education, sports, history, sociology, and psychology. A few examples of such applications are given in Section 1.3. Regression analysis is learned most effectively by analyzing data that are of direct interest to the reader. We invite the readers to think about questions (in their own areas of work, research, or interest) that can be addressed using regression analysis. Readers should collect the relevant data and then apply the regression analysis techniques presented in this book to their own data. To help the reader locate real-life data, this section provides some sources and links to a wealth of data sets that are available for public use.
A number of data sets are available in books and on the Internet. The book by Hand et al. (1994) contains data sets from many fields. These data sets are small in size and are suitable for use as exercises. The book by Chatterjee et al. (1995) provides numerous data sets from diverse fields. The data are included in a diskette that comes with the book and can also be found at the Website.1
Data sets are also available on the Internet at many other sites. Some of the Websites given below allow the direct copying and pasting into the statistical package of choice, while others require downloading the data file and then importing them into a statistical package. Some of these sites also contain further links to yet other data sets or statistics-related Websites.
The Data and Story Library (DASL, pronounced "dazzle") is one of the most interesting sites that contains a number of data sets accompanied by the "story" or background associated with each data set. DASL is an online library2 of data files and stories that illustrate the use of basic statistical methods. The data sets cover a wide variety of topics. DASL comes with a powerful search engine to locate the story or data file of interest.
Another Website, which also contains data sets arranged by the method used in the analysis, is the Electronic Dataset Service.3 The site also contains many links to other data sources on the Internet.
Finally, this book has a Website,4 which contains, among other things, all the data sets that are included in this book and more. These and other data sets can be found at the Book's Website.
1.3 SELECTED APPLICATIONS OF REGRESSION ANALYSIS
Regression analysis is one of the most widely used statistical tools because it provides simple methods for establishing a functional relationship among variables. It has extensive applications in many subject areas. The cigarette consumption and the real estate appraisal, mentioned above, are but two examples. In this section, we give a few additional examples demonstrating the wide applicability of regression analysis in real-life situations. Some of the data sets described here will be used later in the book to illustrate regression techniques or in the exercises at the end of various chapters.
1.3.1 Agricultural Sciences
The Dairy Herd Improvement Cooperative (DHI) in upstate New York collects and analyzes data on milk production. One question of interest here is how to develop a suitable model to predict current milk production from a set of measured variables. The response variable (current milk production in pounds) and the predictor variables are given in Table 1.1. Samples are taken once a month during milking. The period that a cow gives milk is called lactation. Number of lactations is the number of times a cow has calved or given milk. The recommended management practice is to have the cow produce milk for about 305 days and then allow a 60-day rest period before beginning the next lactation. The data set, consisting of 199 observations, was compiled from the DHI milk production records. The Milk Production data can be found at the Book's Website.
Table 1.1 Variables in Milk Production Data
Variable Definition Current Current month milk production in pounds Previous Previous month milk production in pounds Fat Percent of fat in milk Protein Percent of protein in milk Days Number of days since present lactation Lactation Number of lactations I79 Indicator variable (0 if Days and 1 if Days )1.3.2 Industrial and Labor Relations
In 1947, the United States Congress passed the Taft-Hartley Amendments to the Wagner Act. The original Wagner Act had permitted the unions to use a Closed Shop Contract5 unless prohibited by state law. The Taft-Hartley Amendments made the use of Closed Shop Contract illegal and gave individual states the right to prohibit union shops6 as well. These right-to-work laws have caused a wave of concern throughout the labor movement. A question of interest here is: What are the effects of these laws on the cost of living for a four-person family living on an intermediate budget in the United States? To answer this question a data set consisting of 38 geographic locations has been assembled from various sources. The variables used are defined in Table 1.2. The Right-To-Work Laws data can be found at the Book's Website.
Table 1.2 Variables in Right-To-Work Laws Data
Variable Definition COL Cost of living for a four-person family PD Population density (person per square mile) URate State unionization rate in 1978 Pop Population in 1975 Taxes Property taxes in 1972 Income Per capita income in 1974 RTWL Indicator variable (1 if there are right-to-work laws in the state and 0 otherwise)1.3.3 Government
Information about domestic immigration (the movement of people from one state or area of a country to another) is important to state and local governments. It is of interest to build a model that predicts domestic immigration or to answer the question of why do people leave one place to go to another? There are many factors that influence domestic immigration, such as weather conditions, crime, taxes, and unemployment rates. A data set for the 48 contiguous states has been created. Alaska and Hawaii are excluded from the analysis because the environments of these states are significantly different from the other 48, and their locations present certain barriers...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.