
Statistical Analysis with Missing Data
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
The topic of missing data has gained considerable attention in recent decades. This new edition by two acknowledged experts on the subject offers an up-to-date account of practical methodology for handling missing data problems. Blending theory and application, authors Roderick Little and Donald Rubin review historical approaches to the subject and describe simple methods for multivariate analysis with missing values. They then provide a coherent theory for analysis of problems based on likelihoods derived from statistical models for the data and the missing data mechanism, and then they apply the theory to a wide range of important missing data problems.
Statistical Analysis with Missing Data, Third Edition starts by introducing readers to the subject and approaches toward solving it. It looks at the patterns and mechanisms that create the missing data, as well as a taxonomy of missing data. It then goes on to examine missing data in experiments, before discussing complete-case and available-case analysis, including weighting methods. The new edition expands its coverage to include recent work on topics such as nonresponse in sample surveys, causal inference, diagnostic methods, and sensitivity analysis, among a host of other topics.
* An updated "classic" written by renowned authorities on the subject
* Features over 150 exercises (including many new ones)
* Covers recent work on important methods like multiple imputation, robust alternatives to weighting, and Bayesian methods
* Revises previous topics based on past student feedback and class experience
* Contains an updated and expanded bibliography
The authors were awarded The Karl Pearson Prize in 2017 by the International Statistical Institute, for a research contribution that has had profound influence on statistical theory, methodology or applications. Their work "has been no less than defining and transforming." (ISI)
Statistical Analysis with Missing Data, Third Edition is an ideal textbook for upper undergraduate and/or beginning graduate level students of the subject. It is also an excellent source of information for applied statisticians and practitioners in government and industry.
More details
Other editions
Additional editions

Persons
Roderick J. A. Little, PhD., is Richard D. Remington Distinguished University Professor of Biostatistics, Professor of Statistics, and Research Professor, Institute for Social Research, at the University of Michigan.
Donald B. Rubin, PhD., is Professor, Yau Mathematical Sciences Center, Tsinghua University; Murray Shusterman Senior Research Fellow, Department of Statistical Science, Fox School of Business at Temple University; and Professor Emeritus, Harvard University.
Content
Preface to the Third Edition xi
Part I Overview and Basic Approaches 1
1 Introduction 3
1.1 The Problem of Missing Data 3
1.2 Missingness Patterns and Mechanisms 8
1.3 Mechanisms That Lead to Missing Data 13
1.4 A Taxonomy of Missing Data Methods 23
2 Missing Data in Experiments 29
2.1 Introduction 29
2.2 The Exact Least Squares Solution with Complete Data 30
2.3 The Correct Least Squares Analysis with Missing Data 32
2.4 Filling in Least Squares Estimates 33
2.4.1 Yates's Method 33
2.4.2 Using a Formula for the Missing Values 34
2.4.3 Iterating to Find the Missing Values 34
2.4.4 ANCOVA with Missing Value Covariates 35
2.5 Bartlett's ANCOVA Method 35
2.5.1 Useful Properties of Bartlett's Method 35
2.5.2 Notation 36
2.5.3 The ANCOVA Estimates of Parameters and Missing Y-Values 36
2.5.4 ANCOVA Estimates of the Residual Sums of Squares and the Covariance Matrix of ¿¿¿¿^ 37
2.6 Least Squares Estimates of Missing Values by ANCOVA Using Only Complete-Data Methods 38
2.7 Correct Least Squares Estimates of Standard Errors and One Degree of Freedom Sums of Squares 40
2.8 Correct Least-Squares Sums of Squares with More Than One Degree of Freedom 42
3 Complete-Case and Available-Case Analysis, Including Weighting Methods 47
3.1 Introduction 47
3.2 Complete-Case Analysis 47
3.3 Weighted Complete-Case Analysis 50
3.3.1 Weighting Adjustments 50
3.3.2 Poststratification and Raking to Known Margins 58
3.3.3 Inference from Weighted Data 60
3.3.4 Summary of Weighting Methods 61
3.4 Available-Case Analysis 61
4 Single Imputation Methods 67
4.1 Introduction 67
4.2 Imputing Means from a Predictive Distribution 69
4.2.1 Unconditional Mean Imputation 69
4.2.2 Conditional Mean Imputation 70
4.3 Imputing Draws from a Predictive Distribution 73
4.3.1 Draws Based on Explicit Models 73
4.3.2 Draws Based on Implicit Models - Hot Deck Methods 76
4.4 Conclusion 81
5 Accounting for Uncertainty from Missing Data 85
5.1 Introduction 85
5.2 Imputation Methods that Provide Valid Standard Errors from a Single Filled-in Data Set 86
5.3 Standard Errors for Imputed Data by Resampling 90
5.3.1 Bootstrap Standard Errors 90
5.3.2 Jackknife Standard Errors 92
5.4 Introduction to Multiple Imputation 95
5.5 Comparison of Resampling Methods and Multiple Imputation 100
Part II Likelihood-Based Approaches to the Analysis of Data with Missing Values 107
6 Theory of Inference Based on the Likelihood Function 109
6.1 Review of Likelihood-Based Estimation for Complete Data 109
6.1.1 Maximum Likelihood Estimation 109
6.1.2 Inference Based on the Likelihood 118
6.1.3 Large Sample Maximum Likelihood and Bayes Inference 119
6.1.4 Bayes Inference Based on the Full Posterior Distribution 126
6.1.5 Simulating Posterior Distributions 130
6.2 Likelihood-Based Inference with Incomplete Data 132
6.3 A Generally Flawed Alternative to Maximum Likelihood: Maximizing over the Parameters and the Missing Data 141
6.3.1 The Method 141
6.3.2 Background 142
6.3.3 Examples 143
6.4 Likelihood Theory for Coarsened Data 145
7 Factored Likelihood Methods When the Missingness Mechanism Is Ignorable 151
7.1 Introduction 151
7.2 Bivariate Normal Data with One Variable Subject to Missingness: ML Estimation 153
7.2.1 ML Estimates 153
7.2.2 Large-Sample Covariance Matrix 157
7.3 Bivariate Normal Monotone Data: Small-Sample Inference 158
7.4 Monotone Missingness with More Than Two Variables 161
7.4.1 Multivariate Data with One Normal Variable Subject to Missingness 161
7.4.2 The Factored Likelihood for a General Monotone Pattern 162
7.4.3 ML Computation for Monotone Normal Data via the Sweep Operator 166
7.4.4 Bayes Computation forMonotone Normal Data via the Sweep Operator 174
7.5 Factored Likelihoods for Special Nonmonotone Patterns 175
8 Maximum Likelihood for General Patterns of Missing Data: Introduction and Theory with Ignorable Nonresponse 185
8.1 Alternative Computational Strategies 185
8.2 Introduction to the EM Algorithm 187
8.3 The E Step and The M Step of EM 188
8.4 Theory of the EM Algorithm 193
8.4.1 Convergence Properties of EM 193
8.4.2 EM for Exponential Families 196
8.4.3 Rate of Convergence of EM 198
8.5 Extensions of EM 200
8.5.1 The ECM Algorithm 200
8.5.2 The ECME and AECM Algorithms 205
8.5.3 The PX-EM Algorithm 206
8.6 Hybrid Maximization Methods 208
9 Large-Sample Inference Based on Maximum Likelihood Estimates 213
9.1 Standard Errors Based on The Information Matrix 213
9.2 Standard Errors via Other Methods 214
9.2.1 The Supplemented EM Algorithm 214
9.2.2 Bootstrapping the Observed Data 219
9.2.3 Other Large-Sample Methods 220
9.2.4 Posterior Standard Errors from Bayesian Methods 221
10 Bayes and Multiple Imputation 223
10.1 Bayesian Iterative Simulation Methods 223
10.1.1 Data Augmentation 223
10.1.2 The Gibbs' Sampler 226
10.1.3 Assessing Convergence of Iterative Simulations 230
10.1.4 Some Other Simulation Methods 231
10.2 Multiple Imputation 232
10.2.1 Large-Sample Bayesian Approximations of the Posterior Mean and Variance Based on a Small Number of Draws 232
10.2.2 Approximations Using Test Statistics or p-Values 235
10.2.3 Other Methods for Creating Multiple Imputations 238
10.2.4 Chained-Equation Multiple Imputation 241
10.2.5 Using Different Models for Imputation and Analysis 243
Part III Likelihood-Based Approaches to the Analysis of Incomplete Data: Some Examples 247
11 Multivariate Normal Examples, Ignoring the Missingness Mechanism 249
11.1 Introduction 249
11.2 Inference for a Mean Vector and Covariance Matrix with Missing Data Under Normality 249
11.2.1 The EM Algorithm for Incomplete Multivariate Normal Samples 250
11.2.2 Estimated Asymptotic Covariance Matrix of (¿¿¿¿ - ) 252
11.2.3 Bayes Inference and Multiple Imputation for the Normal Model 253
11.3 The Normal Model with a Restricted Covariance Matrix 257
11.4 Multiple Linear Regression 264
11.4.1 Linear Regression with Missingness Confined to the Dependent Variable 264
11.4.2 More General Linear Regression Problems with Missing Data 266
11.5 A General Repeated-Measures Model with Missing Data 269
11.6 Time Series Models 273
11.6.1 Introduction 273
11.6.2 Autoregressive Models for Univariate Time Series with Missing Values 273
11.6.3 Kalman Filter Models 276
11.7 Measurement Error Formulated as Missing Data 277
12 Models for Robust Estimation 285
12.1 Introduction 285
12.2 Reducing the Influence of Outliers by Replacing the Normal Distribution by a Longer-Tailed Distribution 286
12.2.1 Estimation for a Univariate Sample 286
12.2.2 Robust Estimation of the Mean and Covariance Matrix with Complete Data 288
12.2.3 Robust Estimation of the Mean and Covariance Matrix from Data with Missing Values 290
12.2.4 Adaptive Robust Multivariate Estimation 291
12.2.5 Bayes Inference for the t Model 292
12.2.6 Further Extensions of the t Model 294
12.3 Penalized Spline of Propensity Prediction 298
13 Models for Partially Classified Contingency Tables, Ignoring the Missingness Mechanism 301
13.1 Introduction 301
13.2 Factored Likelihoods for Monotone Multinomial Data 302
13.2.1 Introduction 302
13.2.2 ML and Bayes for Monotone Patterns 303
13.2.3 Precision of Estimation 312
13.3 ML and Bayes Estimation for Multinomial Samples with General Patterns of Missingness 313
13.4 Loglinear Models for Partially Classified Contingency Tables 317
13.4.1 The Complete-Data Case 317
13.4.2 Loglinear Models for Partially Classified Tables 320
13.4.3 Goodness-of-Fit Tests for Partially Classified Data 326
14 Mixed Normal and Nonnormal Data with Missing Values, Ignoring the Missingness Mechanism 329
14.1 Introduction 329
14.2 The General Location Model 329
14.2.1 The Complete-DataModel and Parameter Estimates 329
14.2.2 ML Estimation with Missing Values 331
14.2.3 Details of the E Step Calculations 334
14.2.4 Bayes' Computation for the Unrestricted General Location Model 335
14.3 The General Location Model with Parameter Constraints 337
14.3.1 Introduction 337
14.3.2 Restricted Models for the Cell Means 340
14.3.3 LoglinearModels for the Cell Probabilities 340
14.3.4 Modifications to the Algorithms of Previous Sections to Accommodate Parameter Restrictions 340
14.3.5 SimplificationsWhen Categorical Variables are More Observed than Continuous Variables 343
14.4 Regression Problems InvolvingMixtures of Continuous and Categorical Variables 344
14.4.1 Normal Linear Regression with Missing Continuous or Categorical Covariates 344
14.4.2 Logistic Regression with Missing Continuous or Categorical Covariates 346
14.5 Further Extensions of the General Location Model 347
15 Missing Not at RandomModels 351
15.1 Introduction 351
15.2 Models with Known MNAR Missingness Mechanisms: Grouped and Rounded Data 355
15.3 Normal Models for MNAR Missing Data 362
15.3.1 Normal Selection and Pattern-Mixture Models for Univariate Missingness 362
15.3.2 Following up a Subsample of Nonrespondents 364
15.3.3 The Bayesian Approach 366
15.3.4 Imposing Restrictions on Model Parameters 369
15.3.5 Sensitivity Analysis 376
15.3.6 Subsample Ignorable Likelihood for Regression with Missing Data 379
15.4 Other Models and Methods for MNAR Missing Data 382
15.4.1 MNAR Models for Repeated-Measures Data 382
15.4.2 MNAR Models for Categorical Data 385
15.4.3 Sensitivity Analyses for Chained-Equation Multiple Imputations 391
15.4.4 Sensitivity Analyses in Pharmaceutical Applications 396
References 405
Author Index 429
Subject Index 437
1
Introduction
1.1 The Problem of Missing Data
Standard statistical methods have been developed to analyze rectangular data sets. Traditionally, the rows of the data matrix represent units, also called cases, observations, or subjects depending on context, and the columns represent characteristics or variables measured for each unit. The entries in the data matrix are nearly always real numbers, either representing the values of essentially continuous variables, such as age and income, or representing categories of response, which may be ordered (e.g., level of education) or unordered (e.g., race, sex). This book concerns the analysis of such a data matrix when some of the entries in the matrix are not observed. For example respondents in a household survey may refuse to report income; in an industrial experiment, some results are missing because of mechanical failures unrelated to the experimental process; in an opinion survey, some individuals may be unable to express a preference for one candidate over another.
In the first two examples, it is natural to treat the values that are not observed as missing, in the sense that there are actual underlying values that would have been observed if survey techniques had been better or the industrial equipment had been better maintained. In the third example, however, it is less clear that a well-defined candidate preference has been masked by the nonresponse; thus, it is less natural to treat the unobserved values as missing. Instead, in this example, the lack of a response is essentially an additional point in the sample space of the variable being measured, which identifies a "no preference" or "don't know" stratum of the population for that variable.
Older review articles on the statistical analysis of data with missing values include Afifi and Elashoff (1966), Hartley and Hocking (1971), Orchard and Woodbury (1972), Dempster et al. (1977), Little and Rubin (1983a), Little and Schenker (1994), and Little (1997). More recent literature includes books on the topic, such as Schafer (1997), van Buuren (2012), Carpenter and Kenward (2014), and Raghunathan (2015).
Part I considers basic approaches, including analysis of the complete cases and associated weighting methods, and methods that impute (that is fill in), the missing values. Part II considers more principled approaches based on statistical models and the associated likelihood function, and Part III provides applications of these methods. Our generally preferred philosophy of inference can be termed "calibrated Bayes," where the inference is Bayesian, using models that yield inferences with good frequentist properties (Rubin 1984, 2019; Little 2006). For example, 95% Bayesian credibility intervals should have approximately 95% confidence coverage in repeated sampling from the population. The method of multiple imputation has such a Bayesian justification but can be used in conjunction with standard frequentist approaches to the complete-data inference.
Most statistical software packages allow the identification of nonrespondents by creating one or more special codes for those entries of the data matrix that are not observed. More than one code might be used to identify particular types of nonresponse, such as "don't know," or "refuse to answer," or "out of legitimate range." Some statistical software excludes units that have missing value codes for any of the variables involved in an analysis. This strategy, which is often termed a "complete-case analysis," is generally inappropriate because the investigator is usually interested in making inferences about the entire target population, rather than about the portion of the target population that would provide responses on all relevant variables in the analysis. Our aim is to describe a collection of techniques that are more generally appropriate than complete-case analysis when missing entries in the data set mask the underlying values.
Definition 1.1 Missing data are unobserved values that would be meaningful for analysis if observed; in other words, a missing value hides a meaningful value.
When Definition 1.1 applies, it makes sense to consider analyses that effectively predict, or "impute" (that is, fill in), the unobserved values. If, on the other hand, Definition 1.1 does not apply, then imputing the unobserved values makes little sense, and an analysis that creates strata of the population defined by the pattern of observed data is more appropriate. Example 1.1 describes a situation with longitudinal data on obesity where Definition 1.1 clearly makes sense. Example 1.2 describes the case of a randomized experiment where it makes sense for one outcome variable (survival) but not for another (quality of life); and Example 1.3 describes a situation in opinion polling where Definition 1.1 may or may not make sense, depending on the specific setting.
Example 1.1 Nonresponse for a Binary Outcome Measured at Three Times Points. Woolson and Clarke (1984) analyze data from the Muscatine Coronary Risk Factor Study, a longitudinal study of coronary risk factors in schoolchildren. Table 1.1 summarizes the pattern of missing data in the data matrix. Five variables (sex, age, and obesity for three rounds of the survey) are recorded for 4856 units; sex and age are completely recorded, but the three obesity variables are sometimes missing, thereby generating six patterns of missingness. Because age is recorded in five categories and the obesity variables are binary, the data can be displayed as counts in a contingency table. Table 1.2 displays the data in this form, with missingness of obesity treated as a third category of the variable, where O = obese, N = not obese, and M = missing. Thus, the pattern MON denotes missing at the first round, obese at the second round, and not obese at the third round, and the other five patterns are defined analogously.
Table 1.1 Example 1.1: data matrix for children in a survey summarized by the pattern of missing data: 1 = missing, 0 = observed
Variables Pattern Age Sex Weight 1 Weight 2 Weight 3 No. of children with pattern A 0 0 0 0 0 1770 B 0 0 0 0 1 631 C 0 0 0 1 0 184 D 0 0 1 0 0 645 E 0 0 0 1 1 756 F 0 0 1 0 1 370 G 0 0 1 1 0 500Woolson and Clarke analyze these data by fitting multinomial distributions over the 33 - 1 = 26 response categories for each column in Table 1.2. That is missingness is regarded as defining strata of the population. We suspect that for these data, it makes good sense to regard the nonrespondents as having a true underlying value for the obesity variable. Hence, we would argue for treating the nonresponse categories as missing value indicators and estimating the joint distribution of the three dichotomous outcome variables from the partially missing data. Appropriate methods for handling such categorical data with missing values effectively impute the values of obesity that are not observed, as described in Chapter 12. The methods involve quite straightforward modifications of existing algorithms for categorical data analysis, which are now widely available in statistical software packages. For an analysis of these data that averages over patterns of missing data, see Ekholm and Skinner (1998).
Table 1.2 Example 1.1: number of children classified by population and relative weight category in three rounds of a survey
Males Females Response Age group Age...System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.