The Statistical Analysis of Doubly Truncated Data

Name: The Statistical Analysis of Doubly Truncated Data | With Applications in R
Brand: Wiley
Price: 67.99 EUR
Availability: OnlineOnly

With Applications in R

Jacobo de Uña-Álvarez Rosa M. Crujeiras Carla Moreira(Autor*in)

Wiley (Verlag)

1. Auflage

Erschienen am 10. November 2021

192 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-119-50047-6 (ISBN)

67,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Personen

Inhalt

Preface xi

List of Abbreviations xiii

Notation xv

1 Introduction 1

1.1 Random Truncation 1

1.2 One-sided Truncation 2

1.2.1 Left-truncation 2

1.2.2 Right-truncation 2

1.2.3 Truncation vs. Censoring 3

1.3 Double Truncation 3

1.4 Real Data Examples 5

1.4.1 Childhood Cancer Data 5

1.4.2 AIDS Blood Transfusion Data 6

1.4.3 Equipment-S Rounded Failure Time Data 7

1.4.4 Quasar Data 7

1.4.5 Parkinson's Disease Data 8

1.4.6 Acute Coronary Syndrome Data 9

References 10

2 One-Sample Problems 13

2.1 Nonparametric Estimation of a Distribution Function 13

2.1.1 The NPMLE 14

2.1.2 Numerical Algorithms for Computing the NPMLE 21

2.1.3 Theoretical Properties of the NPMLE 24

2.1.4 Standard Errors and Confidence Limits 36

2.2 Semiparametric and Parametric Approaches 43

2.2.1 Semiparametric Approach 44

2.2.2 Parametric Approach 52

2.3 R Code for the Examples 56

2.3.1 Code for Example 2.1.8 56

2.3.2 Code for Examples 2.1.11 and 2.1.13 56

2.3.3 Code for Example 2.1.14 58

2.3.4 Code for Example 2.1.15 59

2.3.5 Code for Example 2.1.22 60

2.3.6 Code for Example 2.2.6 61

2.3.7 Code for Example 2.2.8 62

References 65

3 Smoothing Methods 69

3.1 Some Background in Kernel Estimation 69

3.2 Estimating the Density Function 71

3.3 Asymptotic Properties 71

3.4 Data-driven Bandwidth Selection 77

3.4.1 Normal Reference Bandwidth Selection 78

3.4.2 Plug-in Bandwidth Selection 79

3.4.3 Least-squares Cross-validation Bandwidth Selection 80

3.4.4 Smoothed Bootstrap Bandwidth Selection 81

3.4.5 Bandwidth Selectors in Practice 82

3.5 Further Issues in Kernel Density Estimation 88

3.6 Estimating the Hazard Function 90

3.7 R Code for the Examples 98

3.7.1 Code for Example 3.2.1 98

3.7.2 Code for Examples 3.3.4 and 3.3.5 99

3.7.3 Code for Examples 3.4.2 and 3.4.3 100

3.7.4 Code for Example 3.5.1 102

3.7.5 Code for Example 3.6.4 104

3.7.6 Code for Example 3.6.5 105

References 106

4 Regression Analysis 109

4.1 Observational Bias in Regression 109

4.2 Proportional Hazards Regression 114

4.3 Accelerated Failure Time Regression 117

4.4 Nonparametric Regression 121

4.5 R Code for the Examples 126

4.5.1 Code for Example 4.1.1 126

4.5.2 Code for Example 4.1.4 126

4.5.3 Code for Example 4.2.4 127

4.5.4 Code for Example 4.3.2 127

4.5.5 Code for Example 4.4.2 128

References 129

5 Further Topics 131

5.1 Two-Sample Problems 132

5.2 Competing Risks 137

5.2.1 Cumulative Incidences 139

5.2.2 Regression Models for Competing Risks 142

5.3 Testing for Quasi-independence 146

5.4 Dependent Truncation 150

5.5 R Code for the Examples 157

5.5.1 Code for Example 5.1.3 157

5.5.2 Code for Example 5.2.4 159

5.5.3 Code for Example 5.2.6 160

5.5.4 Code for Example 5.3.1 161

5.5.5 Code for Example 5.4.3 161

References 162

A Packages and Functions in R 165

A.1 Computing the NPMLE and Standard Errors 166

A.2 Assessing the Existence and Uniqueness of the NPMLE 167

A.3 Semiparametric and Parametric Estimation 168

A.4 Kernel Estimation 168

A.5 Regression Analysis 169

A.6 Competing Risks 169

A.7 Simulating Data 170

A.8 Testing Quasi-independence 170

A.9 Dependent Truncation 170

References 171

Index 173

1
Introduction

1.1 Random Truncation

Random truncation generally refers to a situation in which a number of individuals of the target population cannot be sampled because a certain random event precludes them. When this random event is unrelated to the variables of interest standard statistical methods apply, with the only inconvenience of using a smaller sample size. In many practical cases, however, the truncation event is related to the variables under study, and specific methods to overcome the sampling bias must be considered.

This book is focused on random truncation phenomena that arise (usually, but not only) when sampling time-to-event data. That is, the variable of interest is the time elapsed from a well-defined origin to another well-defined end point. In this setting, a truncated sample of is a set of independent and identically distributed (iid) random variables with the conditional distribution of given , where is a random set. Since the truncation event is obviously related to , standard statistical methods applied to the truncated sample may be systematically biased. For example, the ordinary empirical cumulative distribution function (ecdf) of at point , , converges to rather than to the target cumulative distribution function (cdf) . This problem has received remarkable attention since the seminal paper by Turnbull (1976). Special forms of truncation when sampling time-to-event data are reviewed in Sections 1.2 and 1.3.

Time-to-event data are relevant in fields like Survival Analysis and Reliability Engineering, in which random truncation often occurs. Random truncation is found in Astronomy too, where represents the luminosity of an stellar object that is subject to observation limits. Examples from these areas will be introduced and analysed throughout this book.

1.2 One-sided Truncation

1.2.1 Left-truncation

Left-truncation is a common feature when sampling time-to-event data. A left-truncation time for the target is defined as a random variable such that is observed only when , determining the random set in the previous section.

Left-truncation occurs, for example, with cross-sectional sampling, where the sampled individuals are those being between the origin and the end point at a certain calendar time, which is the cross-section date (Wang, 1991). That is, the observer arrives at the process at a given date, being allowed to observe the time-to-event and the left-truncation time for the individuals 'in progress' by that date. With cross-sectional sampling, the variable is simply defined as the time from onset to the cross-section date. This sampling procedure is often applied because it entails relatively little effort to reach a pre-specified sampling size. In medical research, such a design leads to the sampling of the so-called prevalent cases: patients already diagnosed from a certain disease of interest who survived beyond the cross-section date. Clearly, such a sampling design implies an observational bias, in the sense that individuals with longer survival (the value) will be observed with a relatively large probability. There exist well investigated proposals to overcome such a bias, based on the simple idea of taking the observed left-truncation times into account to define suitable risk sets. For this purpose, independence between and has been traditionally assumed. This independence assumption states that the time-to-event distribution remains unchanged along time, being unrelated to the date of onset. A classical example of left-truncation are the Channing House data, where the age at death is measured for people living in that retirement centre; in this case, the target variable is left-truncated by the age when entering the residence (Klein and Moeschberger, 2003).

Another feature leading to left-truncation is the delayed entry into study. This happens when the individuals enter the study only at some random time after onset. For example, diagnosis of a certain disease may not be ascertained until the first visit to the hospital. If the 'end-of-disease' event occurs before the potential date of visit, the time-to-event of such a patient will be never known, with the resulting difficulty in observing relatively small event times. Beyersmann et al. (2012) provide an illustrative example of this issue in the investigation of abortion times.

1.2.2 Right-truncation

In some particular settings, the target variable of ultimate interest is observed only for the individuals who experience the event before a certain calendar time . A typical example of such a situation is the investigation of the incubation (or induction) times for AIDS; see for example Klein and Moeschberger (2003). The incubation time is defined as the time elapsed between the date of HIV infection, say, and the development of AIDS. If stands for the incubation time and , then the incubation times of individuals developing AIDS prior follow the distribution of conditionally on . Here, is called the right-truncation time. An immediate effect of right-truncation is that large values of are sampled with a relatively small probability.

1.2.3 Truncation vs. Censoring

At this point, the reader may be curious about the difference between truncation and censoring. Right-censoring is a very well known phenomenon in Survival Analysis and reliability studies, among other fields. It happens when the follow-up of a given individual stops before the event of interest has taken place. In such a case, the observer only knows that the target variable is larger than the registered value, which is referred to as censoring time. A sample made up of real and censored values is typically analysed by the Kaplan-Meier estimator (Kaplan and Meier, 1958), which corrects for the fact that some of the recorded values for are smaller than the true ones. With truncated data, every value in the sample corresponds to a true observation of ; however, the distribution of the observed values may be shifted with respect to the true one due to the truncation event. This difference between truncation and censoring suggests that specific methods to estimate the target distribution under random truncation should be employed. Indeed, Woodroofe (1985) provides a deep analysis of one-sided truncation, introducing the original idea of Lynden-Bell (1971) as a nonparametric maximum likelihood estimator (NPMLE) of the probability distribution in that setting. The estimator in Woodroofe (1985) is a particular case of the estimator corresponding to doubly truncated data, on which this book is focused.

1.3 Double Truncation

A variable of interest is said to be doubly truncated by a couple of random variables if the observation of is possible only when occurs. In such a case, and are called left- and right-truncation variables respectively. Double truncation reduces to left-truncation when degenerates at , while it corresponds to right-truncation when . This book is focused on the problem of estimating the distribution of , and other related curves, from a set of iid triplets with the distribution of given .

There are several scenarios where double truncation appears in practice. One setting leading to double truncation is that of interval sampling, where the sample is restricted to the individuals with event between two specific dates and (Zhu and Wang, 2012). Then, the right-truncation time is , where denotes the date of onset for the time-to-event, and the left-truncation time is , where is the interval width. The Childhood Cancer Data in Section 1.4.1 is an example of data obtained through interval sampling.

With interval sampling the variable is degenerated at . This occurs in other sampling schemes too, in which and are certain subject-specific event dates. An illustrative example is given by the Parkinson's Disease Data, see Section 1.4.5, where is the individual age at blood sampling. When is constant, the couple falls on a line, and its joint density does not exist, even when the truncating variables may be continuous.

In other situations, the truncating variables and are not linked through the linear equation . For example, and could represent some random observation limits beyond which the variable of interest can not be sampled or detected. Situations like this occur for example in Astronomy, as it is illustrated in Section 1.4.4.

With random double truncation, both large and small values of are observed in principle with a relatively small probability. However, the real observational bias for varies from application to application, depending on the joint distribution of . We will see, for example, that the probability of sampling a value , namely , may be roughly constant, inducing no observational bias; or that it may be roughly decreasing, indicating the dominance of the right-truncation bias relative to the left-truncation bias.

Another issue of relevance is that of the identifiability of the distribution of . Intuitively it is clear that with doubly truncated data it is only possible to estimate the...

Systemvoraussetzungen

Als PDF speichern Als Link merken