Multiple Imputation and its Application

Name: Multiple Imputation and its Application
Brand: Wiley-ISTE
Price: 70.99 EUR
Availability: OnlineOnly

James R. Carpenter Jonathan W. Bartlett Tim P. Morris Angela M. Wood Matteo Quartagno Michael G. Kenward(Author)

Wiley-ISTE (Publisher)

2nd Edition

Published on 20. July 2023

464 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-119-75610-1 (ISBN)

€70.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Persons

Content

Preface to the second edition xiii

Data acknowledgements xv

Acknowledgements xvii

Glossary xix

Part I Foundations 1

1 Introduction 3

1.1 Reasons for missing data 5

1.2 Examples 6

1.3 Patterns of missing data 7

1.4 Inferential framework and notation 10

1.5 Using observed data to inform assumptions about the missingness mechanism 21

1.6 Implications of missing data mechanisms for regression analyses 24

1.7 Summary 34

2 The Multiple Imputation Procedure and Its Justification 39

2.1 Introduction 39

2.2 Intuitive outline of the MI procedure 40

2.3 The generic MI procedure 45

2.4 Bayesian justification of mi 48

2.5 Frequentist inference 50

2.6 Choosing the number of imputations 55

2.7 Some simple examples 56

2.8 mi in more general settings 64

2.9 Constructing congenial imputation models 72

2.10 Discussion 73

Part II Multiple Imputation for Simple Data Structures 79

3 Multiple Imputation of Quantitative Data 81

3.1 Regression imputation with a monotone missingness pattern 81

3.2 Joint modelling 85

3.3 Full conditional specification 90

3.4 Full conditional specification versus joint modelling 92

3.5 Software for multivariate normal imputation 93

3.6 Discussion 93

4 Multiple Imputation of Binary and Ordinal Data 96

4.1 Sequential imputation with monotone missingness pattern 96

4.2 Joint modelling with the multivariate normal distribution 98

4.3 Modelling binary data using latent normal variables 100

4.4 General location model 108

4.5 Full conditional specification 108

4.6 Issues with over-fitting 110

4.7 Pros and cons of the various approaches 114

4.8 Software 116

4.9 Discussion 116

5 Imputation of Unordered Categorical Data 119

5.1 Monotone missing data 119

5.2 Multivariate normal imputation for categorical data 121

5.3 Maximum indicant model 121

5.4 General location model 125

5.5 FCS with categorical data 128

5.6 Perfect prediction issues with categorical data 130

5.7 Software 130

5.8 Discussion 130

Part III Multiple Imputation in Practice 133

6 Non-linear Relationships, Interactions, and Other Derived Variables 135

6.1 Introduction 135

6.2 No missing data in derived variables 141

6.3 Simple methods 143

6.4 Substantive-model-compatible imputation 152

6.5 Returning to the problems 165

7 Survival Data 175

7.1 Missing covariates in time-to-event data 175

7.2 Imputing censored event times 186

7.3 Non-parametric, or 'hot deck' imputation 188

7.4 Case-cohort designs 191

7.5 Discussion 197

8 Prognostic Models, Missing Data, and Multiple Imputation 200

8.1 Introduction 200

8.2 Motivating example 201

8.3 Missing data at model implementation 201

8.4 Multiple imputation for prognostic modelling 202

8.5 Model building 202

8.6 Model performance 204

8.7 Model validation 206

8.8 Incomplete data at implementation 208

9 Multi-level Multiple Imputation 213

9.1 Multi-level imputation model 213

9.2 MCMC algorithm for imputation model 224

9.3 Extensions 231

9.4 Other imputation methods 234

9.5 Individual participant data meta-analysis 237

9.6 Software 241

9.7 Discussion 241

10 Sensitivity Analysis: MI Unleashed 245

10.1 Review of MNAR modelling 246

10.2 Framing sensitivity analysis: estimands 249

10.3 Pattern mixture modelling with mi 251

10.4 Pattern mixture approach with longitudinal data via mi 263

10.5 Reference based imputation 267

10.6 Approximating a selection model by importance weighting 279

10.7 Discussion 289

11 Multiple Imputation for Measurement Error and Misclassification 294

11.1 Introduction 294

11.2 Multiple imputation with validation data 296

11.3 Multiple imputation with replication data 301

11.4 External information on the measurement process 307

11.5 Discussion 308

12 Multiple Imputation with Weights 312

12.1 Using model-based predictions in strata 313

12.2 Bias in the MI variance estimator 314

12.3 MI with weights 317

12.4 A multi-level approach 320

12.5 Further topics 328

12.6 Discussion 329

13 Multiple Imputation for Causal Inference 333

13.1 Multiple imputation for causal inference in point exposure studies 333

13.2 Multiple imputation and propensity scores 338

13.3 Principal stratification via multiple imputation 343

13.4 Multiple imputation for IV analysis 346

13.5 Discussion 350

14 Using Multiple Imputation in Practice 355

14.1 A general approach 355

14.2 Objections to multiple imputation 359

14.3 Reporting of analyses with incomplete data 363

14.4 Presenting incomplete baseline data 364

14.5 Model diagnostics 365

14.6 How many imputations? 366

14.7 Multiple imputation for each substantive model, project, or dataset? 369

14.8 Large datasets 370

14.9 Multiple imputation and record linkage 375

14.10 Setting random number seeds for multiple imputation analyses 377

14.11 Simulation studies including multiple imputation 377

14.12 Discussion 381

Appendix A Markov Chain Monte Carlo 384

A.1 Metropolis Hastings sampler 385

A.2 Gibbs sampler 386

A.3 Missing data 387

Appendix B Probability Distributions 388

B.1 Posterior for the multivariate normal distribution 391

Appendix C Overview of Multiple Imputation in R, Stata 394

C.1 Basic multiple imputation using R 394

C.2 Basic MI using Stata 395

References 398

Author Index 419

Index of Examples 429

Subject Index 431

1
Introduction

Collecting, analysing, and drawing inferences from data are central to research in the medical and social sciences. Unfortunately, for any number of reasons, it is rarely possible to collect all the intended data. The ubiquity of missing data, and the problems this poses for both analysis and inference, has spawned a substantial statistical literature dating from 1950s. At that time, when statistical computing was in its infancy, many analyses were only feasible because of the carefully planned balance in the dataset (for example the same number of observations on each unit). Missing data meant the available data for analysis were unbalanced, thus complicating the planned analysis and in some instances rendering it infeasible. Early work on the problem was therefore largely computational (e.g. Healy and Westmacott, 1956, Afifi and Elashoff, 1966, Orchard and Woodbury, 1972, Dempster et al., 1977).

The wider question of the consequences of non-trivial proportions of missing data for inference was neglected until the seminal paper by Rubin (1976). This set out a typology for assumptions about the reasons for missing data and sketched their implications for analysis and inference. It marked the beginning of a broad stream of research about the analysis of partially observed data. The literature is now huge and continues to grow, both as methods are developed for large and complex data structures, and as increasing computer power and suitable software enables researchers to apply these methods.

For a broad overview of the literature, a good place to start for applied statisticians is Little and Rubin (2019). They give a good overview of likelihood methods and an introduction to multiple imputation. Allison (2002) presents a less technical overview. Schafer (1997) is more algorithmic, focusing on the expectation maximisation (EM) algorithm and imputation using the multivariate normal and general location model. Molenberghs and Kenward (2007) focus on clinical studies, while Daniels and Hogan (2008) focus on longitudinal studies with a Bayesian emphasis.

The above books concentrate on the parametric approaches. However, there is also a growing literature based around using inverse probability weighting, in the spirit of Horvitz and Thompson (1952), and associated doubly robust methods. In particular, we refer to the work of Robins and colleagues (e.g. Robins and Rotnitzky, 1995, Scharfstein et al., 1999). Vansteelandt et al. (2009) give an accessible introduction to these developments. A comparison with multiple imputation in a simple setting is given by Carpenter et al. (2006). The pros and cons are debated in Kang and Schafer (2007) and the theory is brought together by Tsiatis (2006).

This book is concerned with a particular statistical method for analysing and drawing inferences from incomplete data called multiple imputation (MI). Initially proposed by Rubin (1987) in the context of surveys, increasing awareness among researchers about the possible effects of missing data (e.g. Klebanoff and Cole, 2008) has led to an upsurge of interest (e.g. Sterne et al. (2009), Kenward and Carpenter (2007), Schafer (1999a), Rubin (1996)), fuelled by the increasing availability of software and computing power.

MI is attractive because it is both practical and widely applicable. Well-developed statistical software (see, for example, issue 45 of the Journal of Statistical Software) has placed MI within the reach of most researchers in the medical and social sciences, whether or not they have undertaken advanced training in statistics. However, the increasing use of MI in a range of settings beyond that originally envisaged has led to a bewildering proliferation of algorithms and software. Further, the implications of the underlying assumptions in the context of the data at hand are often unclear.

We are writing for researchers in the medical and social sciences with the aim of clarifying the issues raised by missing data, outlining the rationale for MI, explaining the motivation and relationship between the various imputation algorithms and describing and illustrating its application in various settings and to some complex data structures.

Throughout most of the book (with the partial exception of Chapter 8), we will assume that a key aim of analysis with incomplete data is to recover the information lost due to missing data. More specifically, we will take the 'substantive model' as the model that would be used with complete data. We can then define certain desirable properties of our estimator with incomplete data. First, it should be unbiased for the value of the parameter we would see with complete data. Second, it should have low variance. Third, we should have a reliable variance formula and a means of constructing confidence intervals with the advertised coverage.

In the context of multiple imputation, it is worth noting that these remain our aims; the aim of multiple imputation is not to accurately predict the missing values. Rubin (1996) describes it as follows:

'Judging the quality of missing data procedures by their ability to recreate the individual missing values [.] does not lead to choosing procedures that result in valid inference, which is our objective'.

An objection may be that the ability to perfectly predict missing values would result in valid inference; however, in our view, this hypothetical scenario would be one in which data are not really 'missing'.

Central to the analysis of partially observed data is an understanding of why the data are missing and the implications of this for the analysis. This is the focus of the remainder of this chapter. Introducing some of the examples that run through the book, we show how Rubin's typology (Rubin, 1976) provides the foundational framework for understanding the implications of missing data.

1.1 Reasons for missing data

In this section, we consider possible reasons for missing data, illustrate these with examples, and note some preliminary implications for inference. We use the word 'possible' advisedly, since we can rarely be sure of the mechanism giving rise to missing data. Instead, a range of possible mechanisms are consistent with the observed data. In practice, we therefore wish to analyse the data under different mechanisms to establish the robustness of our inference in the face of uncertainty about the missingness mechanism.

All datasets consist of a series of units each of which provides information on a series of items. For example, in a cross-sectional questionnaire survey, the units would be individuals, and the items their answers to the questions. In a household survey, the units would be households, and the items information about the household and members of the household. In longitudinal studies, units would typically be individuals, while items would be longitudinal data from those individuals. In this book, units therefore correspond to the highest level in multi-level (i.e. hierarchical) data, and unless stated otherwise, data from different units are statistically independent.

Within this framework, it is useful to distinguish between units where all the information is missing, termed unit non-response and units who contribute partial information, termed item non-response. The statistical issues are the same in both cases and both can in principle be handled by MI. However, the main focus of this book is the latter.

Figure 1.1 Detail from a senior mandarin's house front in New Territories, Hong Kong.

1.2 Examples

We now introduce two key examples, which we return to throughout the book.

1.3 Patterns of missing data

It is very important to investigate the patterns of missing data before embarking on a formal analysis. This can throw up vital information that might otherwise be overlooked and may even allow the missing data to be traced. For example, when analysing the new wave of a longitudinal survey, a colleague's careful examination of missing data patterns established that many of the missing questionnaires could be traced to a set of cardboard boxes. These turned out to have been left behind in a move. They were recovered, and the data entered.

Table 1.1 YCS variables for exploring the relationship between Year 11 attainment and social stratification.

Variable name Description Cohort Year of data collection: 1990, '93, '95, '97, '99 Boy Indicator variable for boys Occupation Parental occupation, categorised as managerial, intermediate, or working Ethnicity Categorised as Bangladeshi, Black, Indian, other Asian, Other, Pakistani, or...

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Multiple Imputation and its Application

Description

More details

Other editions

Additional editions

Persons

Content

1
Introduction

1.1 Reasons for missing data

1.2 Examples

1.3 Patterns of missing data

System requirements