Multiple Imputation and its Application

Name: Multiple Imputation and its Application
Brand: Wiley
Price: 61.99 EUR
Availability: OnlineOnly

James Carpenter Michael Kenward(Author)

Wiley (Publisher)

Published on 19. December 2012

368 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-118-44261-6 (ISBN)

€61.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

New edition available

Description

Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.

Alles über E-Books, Kopierschutz & Dateiformate finden Sie in unserem Info- & Hilfebereich.

A practical guide to analysing partially observeddata. Collecting, analysing and drawing inferences from data iscentral to research in the medical and social sciences.Unfortunately, it is rarely possible to collect all the intendeddata. The literature on inference from the resultingincomplete data is now huge, and continues to grow both asmethods are developed for large and complex data structures, and asincreasing computer power and suitable software enable researchersto apply these methods. This book focuses on a particular statistical method foranalysing and drawing inferences from incomplete data, calledMultiple Imputation (MI). MI is attractive because it is bothpractical and widely applicable. The authors aim is to clarify theissues raised by missing data, describing the rationale for MI, therelationship between the various imputation models and associatedalgorithms and its application to increasingly complex datastructures. Multiple Imputation and its Application: * Discusses the issues raised by the analysis of partiallyobserved data, and the assumptions on which analyses rest. * Presents a practical guide to the issues to consider whenanalysing incomplete data from both observational studies andrandomized trials. * Provides a detailed discussion of the practical use of MI withreal-world examples drawn from medical and social statistics. * Explores handling non-linear relationships and interactionswith multiple imputation, survival analysis, multilevel multipleimputation, sensitivity analysis via multiple imputation, usingnon-response weights with multiple imputation and doubly robustmultiple imputation. Multiple Imputation and its Application is aimed atquantitative researchers and students in the medical and socialsciences with the aim of clarifying the issues raised by theanalysis of incomplete data data, outlining the rationale for MIand describing how to consider and address the issues that arise inits application.

More details

Other editions

Persons

Content

Preface xi

Data acknowledgements xiii

Acknowledgements xv

Glossary xvii

PART I FOUNDATIONS 1

1 Introduction 3

1.1 Reasons for missing data 4

1.2 Examples 6

1.3 Patterns of missing data 7

1.3.1 Consequences of missing data 9

1.4 Inferential framework and notation 10

1.4.1 Missing Completely At Random (MCAR) 11

1.4.2 Missing At Random (MAR) 12

1.4.3 Missing Not At Random (MNAR) 17

1.4.4 Ignorability 21

1.5 Using observed data to inform assumptions about the missingness mechanism 21

1.6 Implications of missing data mechanisms for regression analyses 24

1.6.1 Partially observed response 24

1.6.2 Missing covariates 28

1.6.3 Missing covariates and response 30

1.6.4 Subtle issues I: The odds ratio 30

1.6.5 Implication for linear regression 32

1.6.6 Subtle issues II: Subsample ignorability 33

1.6.7 Summary: When restricting to complete records is valid 34

1.7 Summary 35

2 The multiple imputation procedure and its justification 37

2.1 Introduction 37

2.2 Intuitive outline of the MI procedure 38

2.3 The generic MI procedure 44

2.4 Bayesian justification of MI 46

2.5 Frequentist inference 48

2.5.1 Large number of imputations 49

2.5.2 Small number of imputations 49

2.6 Choosing the number of imputations 54

2.7 Some simple examples 55

2.8 MI in more general settings 62

2.8.1 Survey sample settings 70

2.9 Constructing congenial imputation models 70

2.10 Practical considerations for choosing imputation models 71

2.11 Discussion 73

PART II MULTIPLE IMPUTATION FOR CROSS SECTIONAL DATA 75

3 Multiple imputation of quantitative data 77

3.1 Regression imputation with a monotone missingness pattern 77

3.1.1 MAR mechanisms consistent with a monotone pattern 79

3.1.2 Justification 81

3.2 Joint modelling 81

3.2.1 Fitting the imputation model 82

3.3 Full conditional specification 85

3.3.1 Justification 86

3.4 Full conditional specification versus joint modelling 87

3.5 Software for multivariate normal imputation 88

3.6 Discussion 88

4 Multiple imputation of binary and ordinal data 90

4.1 Sequential imputation with monotone missingness pattern 90

4.2 Joint modelling with the multivariate normal distribution 92

4.3 Modelling binary data using latent normal variables 94

4.3.1 Latent normal model for ordinal data 98

4.4 General location model 103

4.5 Full conditional specification 103

4.5.1 Justification 103

4.6 Issues with over-fitting 104

4.7 Pros and cons of the various approaches 109

4.8 Software 110

4.9 Discussion 111

5 Multiple imputation of unordered categorical data 112

5.1 Monotone missing data 112

5.2 Multivariate normal imputation for categorical data 114

5.3 Maximum indicant model 114

5.3.1 Continuous and categorical variable 117

5.3.2 Imputing missing data 119

5.3.3 More than one categorical variable 120

5.4 General location model 121

5.5 FCS with categorical data 122

5.6 Perfect prediction issues with categorical data 124

5.7 Software 126

5.8 Discussion 126

6 Nonlinear relationships 127

6.1 Passive imputation 128

6.2 No missing data in nonlinear relationships 130

6.3 Missing data in nonlinear relationships 133

6.3.1 Predictive Mean Matching (PMM) 133

6.3.2 Just Another Variable (JAV) 134

6.3.3 Joint modelling approach 135

6.3.4 Extension to more general models and missing data patterns 138

6.3.5 Metropolis-Hastings sampling 140

6.3.6 Rejection sampling 141

6.3.7 FCS approach 143

6.4 Discussion 145

7 Interactions 147

7.1 Interaction variables fully observed 147

7.2 Interactions of categorical variables 151

7.3 General nonlinear relationships 155

7.4 Software 163

7.5 Discussion 164

PART III ADVANCED TOPICS 165

8 Survival data, skips and large datasets 167

8.1 Time-to-event data 167

8.1.1 Imputing missing covariate values 169

8.1.2 Survival data as categorical 173

8.1.3 Imputing censored survival times 177

8.2 Nonparametric, or 'hot deck' imputation 180

8.2.1 Nonparametric imputation for survival data 182

8.3 Multiple imputation for skips 184

8.4 Two-stage MI 188

8.5 Large datasets 190

8.5.1 Large datasets and joint modelling 190

8.5.2 Shrinkage by constraining parameters 192

8.5.3 Comparison of the two approaches 195

8.6 Multiple imputation and record linkage 195

8.7 Measurement error 197

8.8 Multiple imputation for aggregated scores 200

8.9 Discussion 202

9 Multilevel multiple imputation 203

9.1 Multilevel imputation model 203

9.2 MCMC algorithm for imputation model 214

9.3 Imputing level-2 covariates using FCS 220

9.4 Individual patient meta-analysis 222

9.4.1 When to apply Rubin's rules 224

9.5 Extensions 225

9.5.1 Random level-1 covariance matrices 226

9.5.2 Model fit 228

9.6 Discussion 228

10 Sensitivity analysis: MI unleashed 229

10.1 Review of MNAR modelling 230

10.2 Framing sensitivity analysis 233

10.3 Pattern mixture modelling with MI 235

10.3.1 Missing covariates 240

10.3.2 Application to survival analysis 241

10.4 Pattern mixture approach with longitudinal data via MI 246

10.4.1 Change in slope post-deviation 247

10.5 Piecing together post-deviation distributions from other trial arms 249

10.6 Approximating a selection model by importance weighting 257

10.6.1 Algorithm for approximate sensitivity analysis by re-weighting 259

10.7 Discussion 268

11 Including survey weights 269

11.1 Using model based predictions 269

11.2 Bias in the MI variance estimator 271

11.2.1 MI with weights 274

11.2.2 Estimation in domains 276

11.3 A multilevel approach 277

11.4 Further developments 280

11.5 Discussion 281

12 Robust multiple imputation 282

12.1 Introduction 282

12.2 Theoretical background 284

12.2.1 Simple estimating equations 284

12.2.2 The Probability Of Missingness (POM) model 285

12.2.3 Augmented inverse probability weighted estimating equation 286

12.3 Robust multiple imputation 287

12.3.1 Univariate MAR missing data 287

12.3.2 Longitudinal MAR missing data 289

12.4 Simulation studies 292

12.4.1 Univariate MAR missing data 292

12.4.2 Longitudinal monotone MAR missing data 293

12.4.3 Longitudinal nonmonotone MAR missing data 293

12.4.4 Nonlongitudinal nonmonotone MAR missing data 297

12.4.5 Results and discussion 297

12.5 The RECORD study 302

12.6 Discussion 304

Appendix A Markov Chain Monte Carlo 306

Appendix B Probability distributions 310

B.1 Posterior for the multivariate normal distribution 313

Bibliography 316

Index of Authors 327

Index of Examples 332

Index 334

Chapter 1

Introduction

Collecting, analysing and drawing inferences from data are central to research in the medical and social sciences. Unfortunately, for any number of reasons, it is rarely possible to collect all the intended data. The ubiquity of missing data, and the problems this poses for both analysis and inference, has spawned a substantial statistical literature dating from 1950s. At that time, when statistical computing was in its infancy, many analyses were only feasible because of the carefully planned balance in the dataset (for example, the same number of observations on each unit). Missing data meant the available data for analysis were unbalanced, thus complicating the planned analysis and in some instances rendering it unfeasible. Early work on the problem was therefore largely computational (e.g. Healy and Westmacott, 1956; Afifi and Elashoff, 1966; Orchard and Woodbury, 1972; Dempster et al., 1977).

The wider question of the consequences of nontrivial proportions of missing data for inference was neglected until a seminal paper by Rubin (1976). This set out a typology for assumptions about the reasons for missing data, and sketched their implications for analysis and inference. It marked the beginning of a broad stream of research about the analysis of partially observed data. The literature is now huge, and continues to grow, both as methods are developed for large and complex data structures, and as increasing computer power and suitable software enable researchers to apply these methods.

For a broad overview of the literature, a good place to start is one of the recent excellent textbooks. Little and Rubin (2002) write for applied statisticians. They give a good overview of likelihood methods, and give an introduction to multiple imputation. Allison (2002) presents a less technical overview. Schafer (1997) is more algorithmic, focusing on the EM algorithm and imputation using the multivatiate normal and general location model. Molenberghs and Kenward (2007) focus on clinical studies, while Daniels and Hogan (2008) focus on longitudinal studies with a Bayesian emphasis.

The above books concentrate on parametric approaches. However, there is also a growing literature based around using inverse probability weighting, in the spirit of Horvitz and Thompson (1952), and associated doubly robust methods. In particular, we refer to the work of Robins and colleagues (e.g. Robins et al., 1995; Scharfstein et al., 1999). Vansteelandt et al. (2009) give an accessible introduction to these developments. A comparison with multiple imputation in a simple setting is given by Carpenter et al. (2006). The pros and cons are debated in Kang and Schafer (2007) and the theory is brought together by Tsiatis (2006).

This book is concerned with a particular statistical method for analysing and drawing inferences from incomplete data, called Multiple Imputation (MI). Initially proposed by Rubin (1987) in the context of surveys, increasing awareness among researchers about the possible effects of missing data (e.g. Klebanoff and Cole, 2008) has led to an upsurge of interest (e.g. Sterne et al., 2009; Kenward and Carpenter, 2007; Schafer, 1999a; Rubin, 1996).

Multiple imputation (MI) is attractive because it is both practical and widely applicable. Recently developed statistical software (see, for example, issue 45 of the Journal of Statistical Software) has placed it within the reach of most researchers in the medical and social sciences, whether or not they have undertaken advanced training in statistics. However, the increasing use of MI in a range of settings beyond that originally envisaged has led to a bewildering proliferation of algorithms and software. Further, the implication of the underlying assumptions in the context of the data at hand is often unclear.

We are writing for researchers in the medical and social sciences with the aim of clarifying the issues raised by missing data, outlining the rationale for MI, explaining the motivation and relationship between the various imputation algorithms, and describing and illustrating its application to increasingly complex data structures.

Central to the analysis of partially observed data is an understanding of why the data are missing and the implications of this for the analysis. This is the focus of the remainder of this chapter. Introducing some of the examples that run through the book, we show how Rubin's typology (Rubin, 1976) provides the foundational framework for understanding the implications of missing data.

1.1 Reasons for missing data

In this section we consider possible reasons for missing data, illustrate these with examples, and draw some preliminary implications for inference. We use the word ‘possible’ advisedly, since with partially observed data we can rarely be sure of the mechanism giving rise to missing data. Instead, a range of possible mechanisms are consistent with the observed data. In practice, we therefore wish to analyse the data under different mechanisms, to establish the robustness of our inference in the face of uncertainty about the missingness mechanism.

All datasets consist of a series of units each of which provides information on a series of items. For example, in a cross-sectional questionnaire survey, the units would be individuals and the items their answers to the questions. In a household survey, the units would be households, and the items information about the household and members of the household. In longitudinal studies, units would typically be individuals while items would be longitudinal data from those individuals. In this book, units therefore correspond to the highest level in multilevel (i.e., hierarchical) data, and unless stated otherwise data from different units are statistically independent.

Within this framework, it is useful to distinguish between units where all the information is missing, termed unit nonresponse and units who contribute partial information, termed item nonresponse. The statistical issues are the same in both cases, and both can in principle be handled by MI. However, the main focus of this book is the latter.

Example 1.1 Mandarin tableau Figure 1.1, which is also shown on the cover, shows part of the frontage of a senior mandarin's house in the New Territories, Hong Kong. We suppose interest focuses on characteristics of the figurines, for example their number, height, facial characteristics and dress. Unit nonresponse then corresponds to missing figurines, and item nonresponse to damaged—hence partially observed—figurines.

Figure 1.1 Detail from a senior mandarin's house front in New Territories, Hong Kong. Photograph by H. Goldstein.

1.2 Examples

We now introduce two key examples, which we return to throughout the book.

Example 1.2 Youth Cohort Study (YCS) The Youth Cohort Study of England and Wales (YCS) is an ongoing UK government funded representative survey of pupils in England and Wales at school-leaving age (School year 11, age 16–17) (UK Data Archive, 2007). Each year that a new cohort is surveyed, detailed information is collected on each young person's experience of education and their qualifications as well as information on employment and training. A limited amount of information is collected on their personal characteristics, family, home circumstances, and aspirations. Over the life-cycle of the YCS, different organisations have had responsibility for the structure and timings of data collection. Unfortunately, the documentation of older cohorts is poor. Croxford et al. (2007) have recently deposited a harmonised dataset that comprises YCS cohorts from 1984 to 2002 (UK Data Archive Study Number 5765). We consider data from pupils attending comprehensive schools from five YCS cohorts; these pupils reached the end of Year 11 in 1990, 1993, 1995, 1997 and 1999. We explore relationships between Year 11 educational attainment (the General Certificate of Secondary Education) and key measures of social stratification. The units are pupils and the items are measurements on these pupils, and a nontrivial number of items are partially observed.

Example 1.3 Randomised controlled trial of patients with chronic asthma We consider data from a 5-arm asthma clinical trial to assess the efficacy and safety of budesonide, a second-generation glucocorticosteroid, on patients with chronic asthma. 473 patients with chronic asthma were enrolled in the 12-week randomised, double-blind, multi-centre parallel-group trial, which compared the effect of a daily dose of 200, 400, 800 or 1600 mcg of budesonide with placebo. Key outcomes of clinical interest include patients' peak expiratory flow rate (their maximum speed of expiration in litres/minute) and their Forced Expiratory Volume, FEV1, (the volume of air, in litres, the patient with fully inflated lungs can breathe out in one second). In summary, the trial found a statistically significant dose-response effect for the mean change from baseline over the study for both morning peak expiratory flow, evening peak expiratory flow and FEV1, at the 5% level. Budesonide treated patients also showed reduced asthma symptoms and bronchodilator use compared with placebo, while there were no clinically significant differences in treatment related adverse experiences between the treatment groups. Further details about the conduct of the trial, its conclusions and the variables collected can be found elsewhere (Busse et...

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Multiple Imputation and its Application

Description

More details

Other editions

New editions

Additional editions

Persons

Content

1.1 Reasons for missing data

1.2 Examples

System requirements