Applied Logistic Regression

Name: Applied Logistic Regression
Brand: Wiley
Price: 133.99 EUR
Availability: OnlineOnly

David W. Hosmer Stanley Lemeshow Rodney X. Sturdivant(Author)

Wiley (Publisher)

3rd Edition

Published on 26. February 2013

528 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-118-54835-6 (ISBN)

€133.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Persons

Content

Preface to the Third Edition xiii

1 Introduction to the Logistic Regression Model 1

1.1 Introduction 1

1.2 Fitting the Logistic Regression Model 8

1.3 Testing for the Significance of the Coefficients 10

1.4 Confidence Interval Estimation 15

1.5 Other Estimation Methods 20

1.6 Data Sets Used in Examples and Exercises 22

1.6.1 The ICU Study 22

1.6.2 The Low Birth Weight Study 24

1.6.3 The Global Longitudinal Study of Osteoporosis in Women 24

1.6.4 The Adolescent Placement Study 26

1.6.5 The Burn Injury Study 27

1.6.6 The Myopia Study 29

1.6.7 The NHANES Study 31

1.6.8 The Polypharmacy Study 31

Exercises 32

2 The Multiple Logistic Regression Model 35

2.1 Introduction 35

2.2 The Multiple Logistic Regression Model 35

2.3 Fitting the Multiple Logistic Regression Model 37

2.4 Testing for the Significance of the Model 39

2.5 Confidence Interval Estimation 42

2.6 Other Estimation Methods 45

Exercises 46

3 Interpretation of the Fitted Logistic Regression Model 49

3.1 Introduction 49

3.2 Dichotomous Independent Variable 50

3.3 Polychotomous Independent Variable 56

3.4 Continuous Independent Variable 62

3.5 Multivariable Models 64

3.6 Presentation and Interpretation of the Fitted Values 77

3.7 A Comparison of Logistic Regression and Stratified Analysis for 2 × 2 Tables 82

Exercises 87

4 Model-Building Strategies and Methods for Logistic Regression 89

4.1 Introduction 89

4.2 Purposeful Selection of Covariates 89

4.2.1 Methods to Examine the Scale of a Continuous Covariate in the Logit 94

4.2.2 Examples of Purposeful Selection 107

4.3 Other Methods for Selecting Covariates 124

4.3.1 Stepwise Selection of Covariates 125

4.3.2 Best Subsets Logistic Regression 133

4.3.3 Selecting Covariates and Checking their Scale Using Multivariable Fractional Polynomials 139

4.4 Numerical Problems 145

Exercises 150

5 Assessing the Fit of the Model 153

5.1 Introduction 153

5.2 Summary Measures of Goodness of Fit 154

5.2.1 Pearson Chi-Square Statistic, Deviance, and Sum-of-Squares 155

5.2.2 The Hosmer-Lemeshow Tests 157

5.2.3 Classification Tables 169

5.2.4 Area Under the Receiver Operating Characteristic Curve 173

5.2.5 Other Summary Measures 182

5.3 Logistic Regression Diagnostics 186

5.4 Assessment of Fit via External Validation 202

5.5 Interpretation and Presentation of the Results from a Fitted Logistic Regression Model 212

Exercises 223

6 Application of Logistic Regression with Different Sampling Models 227

6.1 Introduction 227

6.2 Cohort Studies 227

6.3 Case-Control Studies 229

6.4 Fitting Logistic Regression Models to Data from Complex Sample Surveys 233

Exercises 242

7 Logistic Regression for Matched Case-Control Studies 243

7.1 Introduction 243

7.2 Methods For Assessment of Fit in a 1-M Matched Study 248

7.3 An Example Using the Logistic Regression Model in a 1-1 Matched Study 251

7.4 An Example Using the Logistic Regression Model in a 1-M Matched Study 260

Exercises 267

8 Logistic Regression Models for Multinomial and Ordinal Outcomes 269

8.1 The Multinomial Logistic Regression Model 269

8.1.1 Introduction to the Model and Estimation of Model Parameters 269

8.1.2 Interpreting and Assessing the Significance of the Estimated Coefficients 272

8.1.3 Model-Building Strategies for Multinomial Logistic Regression 278

8.1.4 Assessment of Fit and Diagnostic Statistics for the Multinomial Logistic Regression Model 283

8.2 Ordinal Logistic Regression Models 289

8.2.1 Introduction to the Models, Methods for Fitting, and Interpretation of Model Parameters 289

8.2.2 Model Building Strategies for Ordinal Logistic Regression Models 305

Exercises 310

9 Logistic Regression Models for the Analysis of Correlated Data 313

9.1 Introduction 313

9.2 Logistic Regression Models for the Analysis of Correlated Data 315

9.3 Estimation Methods for Correlated Data Logistic Regression Models 318

9.4 Interpretation of Coefficients from Logistic Regression Models for the Analysis of Correlated Data 323

9.4.1 Population Average Model 324

9.4.2 Cluster-Specific Model 326

9.4.3 Alternative Estimation Methods for the Cluster-Specific Model 333

9.4.4 Comparison of Population Average and Cluster-Specific Model 334

9.5 An Example of Logistic Regression Modeling with Correlated Data 337

9.5.1 Choice of Model for Correlated Data Analysis 338

9.5.2 Population Average Model 339

9.5.3 Cluster-Specific Model 344

9.5.4 Additional Points to Consider when Fitting Logistic Regression Models to Correlated Data 351

9.6 Assessment of Model Fit 354

9.6.1 Assessment of Population Average Model Fit 354

9.6.2 Assessment of Cluster-Specific Model Fit 365

9.6.3 Conclusions 374

Exercises 375

10 Special Topics 377

10.1 Introduction 377

10.2 Application of Propensity Score Methods in Logistic Regression Modeling 377

10.3 Exact Methods for Logistic Regression Models 387

10.4 Missing Data 395

10.5 Sample Size Issues when Fitting Logistic Regression Models 401

10.6 Bayesian Methods for Logistic Regression 408

10.6.1 The Bayesian Logistic Regression Model 410

10.6.2 MCMC Simulation 411

10.6.3 An Example of a Bayesian Analysis and Its Interpretation 419

10.7 Other Link Functions for Binary Regression Models 434

10.8 Mediation 441

10.8.1 Distinguishing Mediators from Confounders 441

10.8.2 Implications for the Interpretation of an Adjusted Logistic Regression Coefficient 443

10.8.3 Why Adjust for a Mediator? 444

10.8.4 Using Logistic Regression to Assess Mediation: Assumptions 445

10.9 More About Statistical Interaction 448

10.9.1 Additive versus Multiplicative Scale-Risk Difference versus Odds Ratios 448

10.9.2 Estimating and Testing Additive Interaction 451

Exercises 456

References 459

Index 479

Chapter 1: Introduction to the Logistic Regression Model

1.1 Introduction

Regression methods have become an integral component of any data analysis concerned with describing the relationship between a response variable and one or more explanatory variables. Quite often the outcome variable is discrete, taking on two or more possible values. The logistic regression model is the most frequently used regression model for the analysis of these data.

Before beginning a thorough study of the logistic regression model it is important to understand that the goal of an analysis using this model is the same as that of any other regression model used in statistics, that is, to find the best fitting and most parsimonious, clinically interpretable model to describe the relationship between an outcome (dependent or response) variable and a set of independent (predictor or explanatory) variables. The independent variables are often called covariates. The most common example of modeling, and one assumed to be familiar to the readers of this text, is the usual linear regression model where the outcome variable is assumed to be continuous.

What distinguishes a logistic regression model from the linear regression model is that the outcome variable in logistic regression is binary or dichotomous. This difference between logistic and linear regression is reflected both in the form of the model and its assumptions. Once this difference is accounted for, the methods employed in an analysis using logistic regression follow, more or less, the same general principles used in linear regression. Thus, the techniques used in linear regression analysis motivate our approach to logistic regression. We illustrate both the similarities and differences between logistic regression and linear regression with an example.

Example 1: Table 1.1 lists the age in years (AGE), and presence or absence of evidence of significant coronary heart disease (CHD) for 100 subjects in a hypothetical study of risk factors for heart disease. The table also contains an identifier variable (ID) and an age group variable (AGEGRP). The outcome variable is CHD, which is coded with a value of "0" to indicate that CHD is absent, or "1" to indicate that it is present in the individual. In general, any two values could be used, but we have found it most convenient to use zero and one. We refer to this data set as the CHDAGE data.

Table 1.1 Age, Age Group, and Coronary Heart Disease (CHD) Status of 100 Subjects

It is of interest to explore the relationship between AGE and the presence or absence of CHD in this group. Had our outcome variable been continuous rather than binary, we probably would begin by forming a scatterplot of the outcome versus the independent variable. We would use this scatterplot to provide an impression of the nature and strength of any relationship between the outcome and the independent variable. A scatterplot of the data in Table 1.1 is given in Figure 1.1.

Figure 1.1 Scatterplot of presence or absence of coronary heart disease (CHD) by AGE for 100 subjects.

In this scatterplot, all points fall on one of two parallel lines representing the absence of CHD () or the presence of CHD (). There is some tendency for the individuals with no evidence of CHD to be younger than those with evidence of CHD. While this plot does depict the dichotomous nature of the outcome variable quite clearly, it does not provide a clear picture of the nature of the relationship between CHD and AGE.

The main problem with Figure 1.1 is that the variability in CHD at all ages is large. This makes it difficult to see any functional relationship between AGE and CHD. One common method of removing some variation, while still maintaining the structure of the relationship between the outcome and the independent variable, is to create intervals for the independent variable and compute the mean of the outcome variable within each group. We use this strategy by grouping age into the categories (AGEGRP) defined in Table 1.1. Table 1.2 contains, for each age group, the frequency of occurrence of each outcome, as well as the percent with CHD present.

Table 1.2 Frequency Table of Age Group by CHD

By examining this table, a clearer picture of the relationship begins to emerge. It shows that as age increases, the proportion (mean) of individuals with evidence of CHD increases. Figure 1.2 presents a plot of the percent of individuals with CHD versus the midpoint of each age interval. This plot provides considerable insight into the relationship between CHD and AGE in this study, but the functional form for this relationship needs to be described. The plot in this figure is similar to what one might obtain if this same process of grouping and averaging were performed in a linear regression. We note two important differences.

Figure 1.2 Plot of the percentage of subjects with CHD in each AGE group.

The first difference concerns the nature of the relationship between the outcome and independent variables. In any regression problem the key quantity is the mean value of the outcome variable, given the value of the independent variable. This quantity is called the conditional mean and is expressed as "" where denotes the outcome variable and denotes a specific value of the independent variable. The quantity is read "the expected value of , given the value ". In linear regression we assume that this mean may be expressed as an equation linear in (or some transformation of or ), such as

This expression implies that it is possible for to take on any value as ranges between and .

The column labeled "Mean" in Table 1.2 provides an estimate of . We assume, for purposes of exposition, that the estimated values plotted in Figure 1.2 are close enough to the true values of to provide a reasonable assessment of the functional relationship between CHD and AGE. With a dichotomous outcome variable, the conditional mean must be greater than or equal to zero and less than or equal to one (i.e., ). This can be seen in Figure 1.2. In addition, the plot shows that this mean approaches zero and one "gradually". The change in the per unit change in becomes progressively smaller as the conditional mean gets closer to zero or one. The curve is said to be S-shaped and resembles a plot of the cumulative distribution of a continuous random variable. Thus, it should not seem surprising that some well-known cumulative distributions have been used to provide a model for in the case when is dichotomous. The model we use is based on the logistic distribution.

Many distribution functions have been proposed for use in the analysis of a dichotomous outcome variable. Cox and Snell (1989) discuss some of these. There are two primary reasons for choosing the logistic distribution. First, from a mathematical point of view, it is an extremely flexible and easily used function. Second, its model parameters provide the basis for clinically meaningful estimates of effect. A detailed discussion of the interpretation of the model parameters is given in Chapter 3.

In order to simplify notation, we use the quantity to represent the conditional mean of given when the logistic distribution is used. The specific form of the logistic regression model we use is:

1.1

A transformation of that is central to our study of logistic regression is the logit transformation. This transformation is defined, in terms of , as:

The importance of this transformation is that has many of the desirable properties of a linear regression model. The logit, , is linear in its parameters, may be continuous, and may range from to , depending on the range of .

The second important difference between the linear and logistic regression models concerns the conditional distribution of the outcome variable. In the linear regression model we assume that an observation of the outcome variable may be expressed as . The quantity is called the error and expresses an observation's deviation from the conditional mean. The most common assumption is that follows a normal distribution with mean zero and some variance that is constant across levels of the independent variable. It follows that the conditional distribution of the outcome variable given is normal with mean , and a variance that is constant. This is not the case with a dichotomous outcome variable. In this situation, we may express the value of the outcome variable given as . Here the quantity may assume one of two possible values. If then with probability , and if then with probability . Thus, has a distribution with mean zero and variance equal to . That is, the conditional distribution of the outcome variable follows a binomial distribution with probability given by the conditional mean, .

In summary, we have shown that in a regression analysis when the outcome variable is dichotomous:

1. The model for the conditional mean of the regression equation must be bounded between zero and one. The logistic regression model, , given in equation 1.1, satisfies this constraint. 2. The binomial, not the normal, distribution describes the distribution of the errors and is the statistical distribution on which the analysis is based. 3. The principles that guide an analysis using linear regression also guide us in logistic...

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Applied Logistic Regression

Description

Reviews / Votes

More details

Other editions

Additional editions

Persons

Content

Chapter 1: Introduction to the Logistic Regression Model

1.1 Introduction

System requirements