Methodological Developments in Data Linkage

Name: Methodological Developments in Data Linkage
Brand: Wiley
Price: 75.99 EUR
Availability: OnlineOnly

Katie Harron H. Goldstein Chris Dibben(Author)

Wiley (Publisher)

Published on 22. September 2015

288 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-119-07248-5 (ISBN)

€75.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.

Alles über E-Books, Kopierschutz & Dateiformate finden Sie in unserem Info- & Hilfebereich.

A comprehensive compilation of new developments in data linkage methodology The increasing availability of large administrative databases has led to a dramatic rise in the use of data linkage, yet the standard texts on linkage are still those which describe the seminal work from the 1950-60s, with some updates. Linkage and analysis of data across sources remains problematic due to lack of discriminatory and accurate identifiers, missing data and regulatory issues. Recent developments in data linkage methodology have concentrated on bias and analysis of linked data, novel approaches to organising relationships between databases and privacy-preserving linkage. Methodological Developments in Data Linkage brings together a collection of contributions from members of the international data linkage community, covering cutting edge methodology in this field. It presents opportunities and challenges provided by linkage of large and often complex datasets, including analysis problems, legal and security aspects, models for data access and the development of novel research areas. New methods for handling uncertainty in analysis of linked data, solutions for anonymised linkage and alternative models for data collection are also discussed. Key Features: * Presents cutting edge methods for a topic of increasing importance to a wide range of research areas, with applications to data linkage systems internationally * Covers the essential issues associated with data linkage today * Includes examples based on real data linkage systems, highlighting the opportunities, successes and challenges that the increasing availability of linkage data provides * Novel approach incorporates technical aspects of both linkage, management and analysis of linked data This book will be of core interest to academics, government employees, data holders, data managers, analysts and statisticians who use administrative data. It will also appeal to researchers in a variety of areas, including epidemiology, biostatistics, social statistics, informatics, policy and public health.

More details

Other editions

Persons

Content

Foreword xi

Contributors xiii

1 Introduction 1
Katie Harron, Harvey Goldstein and Chris Dibben

1.1 Introduction: data linkage as it exists 1

1.2 Background and issues 2

1.3 Data linkage methods 3

1.3.1 Deterministic linkage 3

1.3.2 Probabilistic linkage 3

1.3.3 Data preparation 4

1.4 Linkage error 5

1.5 Impact of linkage error on analysis of linked data 6

1.6 Data linkage: the future 7

2 Probabilistic linkage 8
William E. Winkler

2.1 Introduction 8

2.2 Overview of methods 10

2.2.1 The Fellegi-Sunter model of record linkage 10

2.2.2 Learning parameters 13

2.2.3 Additional methods for matching 20

2.2.4 An empirical example 22

2.3 Data preparation 23

2.3.1 Description of a matching project 24

2.3.2 Initial file preparation 25

2.3.3 Name standardisation and parsing 26

2.3.4 Address standardisation and parsing 27

2.3.5 Summarising comments on preprocessing 27

2.4 Advanced methods 28

2.4.1 Estimating false?]match rates without training data 28

2.4.2 Adjusting analyses for linkage error 32

2.5 Concluding comments 35

3 The data linkage environment 36
Chris Dibben, Mark Elliot, Heather Gowans, Darren Lightfoot and Data Linkage Centres

3.1 Introduction 36

3.2 The data linkage context 37

3.2.1 Administrative or routine data 37

3.2.2 The law and the use of administrative (personal) data for research 38

3.2.3 The identifiability problem in data linkage 42

3.3 The tools used in the production of functional anonymity through a data linkage environment 42

3.3.1 Governance, rules and the researcher 43

3.3.2 Application process, ethics scrutiny and peer review 43

3.3.3 Shaping 'safe' behaviour: training, sanctions, contracts and licences 43

3.3.4 'Safe' data analysis environments 44

3.3.5 Fragmentation: separation of linkage process and temporary linked data 47

3.4 Models for data access and data linkage 50

3.4.1 Single centre 50

3.4.2 Separation of functions: firewalls within single centre 51

3.4.3 Separation of functions: TTP linkage 53

3.4.4 Secure multiparty computation 53

3.5 Four case study data linkage centres 54

3.5.1 Population Data BC 54

3.5.2 The Secure Anonymised Information Linkage Databank, United Kingdom 58

3.5.3 Centre for Data Linkage (Population Health Research Network), Australia 59

3.5.4 The Centre for Health Record Linkage, Australia 61

3.6 Conclusion 62

4 Bias in data linkage studies 63
Megan Bohensky

4.1 Background 63

4.2 Description of types of linkage error 65

4.2.1 Missed matches from missing linkage variables 65

4.2.2 Missed matches from inconsistent case ascertainment 66

4.2.3 False matches: Description of cases incorrectly matched 66

4.3 How linkage error impacts research findings 68

4.3.1 Results 68

4.3.2 Assessment of linkage bias 75

4.4 Discussion 78

4.4.1 Potential biases in the review process 79

4.4.2 Recommendations and implications for practice 79

5 Secondary analysis of linked data 83
Raymond Chambers and Gunky Kim

5.1 Introduction 83

5.2 Measurement error issues arising from linkage 84

5.2.1 Correct links, incorrect links and non?]links 84

5.2.2 Characterising linkage errors 85

5.2.3 Characterising errors from non?]linkage 86

5.3 Models for different types of linking errors 86

5.3.1 Linkage errors under binary linking 86

5.3.2 Linkage errors under multi?]linking 88

5.3.3 Incomplete linking 88

5.3.4 Modelling the linkage error 89

5.4 Regression analysis using complete binary?]linked data 90

5.4.1 Linear regression 91

5.4.2 Logistic regression 95

5.5 Regression analysis using incomplete binary?]linked data 95

5.5.1 Linear regression using incomplete sample to register linked data 97

5.6 Regression analysis with multi?]linked data 99

5.6.1 Uncorrelated multi?]linking: Complete linkage 100

5.6.2 Uncorrelated multi?]linking: Sample to register linkage 101

5.6.3 Correlated multi?]linkage 105

5.6.4 Incorporating auxiliary population information 105

5.7 Conclusion and discussion 107

6 Record linkage: A missing data problem 109
Harvey Goldstein and Katie Harron

6.1 Introduction 109

6.2 Probabilistic Record Linkage (PRL) 111

6.3 Multiple Imputation (MI) 112

6.4 Prior-Informed Imputation (PII) 113

6.4.1 Estimating matching probabilities 115

6.5 Example 1: Linking electronic healthcare data to estimate trends in bloodstream infection 115

6.5.1 Methods 115

6.5.2 Results 117

6.5.3 Conclusions 118

6.6 Example 2: Simulated data including non?]random linkage error 118

6.6.1 Methods 118

6.6.2 Results 119

6.7 Discussion 122

6.7.1 Non?]random linkage error 122

6.7.2 Strengths and limitations: Handling linkage error 122

6.7.3 Implications for data linkers and data users 123

7 Using graph databases to manage linked data 125
James M. Farrow

7.1 Summary 125

7.2 Introduction 126

7.2.1 Flat approach 127

7.2.2 Oops, your legacy is showing 128

7.2.3 Shortcomings 128

7.3 Graph approach 131

7.3.1 Overview of graph concepts 131

7.3.2 Graph queries versus relational queries 133

7.3.3 Comparison of data in flat database versus graph database 136

7.3.4 Relaxing the notion of 'truth' 137

7.3.5 Not a linkage approach per se but a management approach which enables novel linkage approaches 138

7.3.6 Linkage engine independent 139

7.3.7 Separates out linkage from cluster identification phase (and clerical review) 139

7.4 Methodologies 139

7.4.1 Overview of storage and extraction approach 140

7.4.2 Overall management of data as collections 141

7.4.3 Data loading 142

7.4.4 Identification of equivalence sets and deterministic linkage 143

7.4.5 Probabilistic linkage 144

7.4.6 Clerical review 144

7.4.7 Determining cut?]off thresholds 145

7.4.8 Final cluster extraction 147

7.4.9 Graph partitioning 147

7.4.10 Data management/curation 150

7.4.11 User interface challenges 150

7.4.12 Final cluster extraction 154

7.4.13 A typical end?]to?]end workflow 155

7.5 Algorithm implementation 156

7.5.1 Graph traversal 156

7.5.2 Cluster identification 157

7.5.3 Partitioning visitor 158

7.5.4 Encapsulating edge following policies 158

7.5.5 Graph partitioning 158

7.5.6 Insertion of review links 158

7.5.7 How to migrate while preserving current clusters 158

7.6 New approaches facilitated by graph storage approach 158

7.6.1 Multiple threshold extraction 160

7.6.2 Possibility of returning graph to end users 165

7.6.3 Optimised cluster analysis 166

7.6.4 Other link types 167

7.7 Conclusion 167

8 Large?]scale linkage for total populations in official statistics 170
Owen Abbott, Peter Jones and Martin Ralphs

8.1 Introduction 170

8.2 Current practice in record linkage for population censuses 171

8.2.1 Introduction 171

8.2.2 Case study: the 2011 England and Wales Census assessment of coverage 172

8.3 Population?]level linkage in countries that operate a population register: register?]based censuses 178

8.3.1 Introduction 178

8.3.2 Case study 1: Finland 179

8.3.3 Case study 2: The Netherlands Virtual Census 180

8.3.4 Case study 3: Poland 180

8.3.5 Case study 4: Germany 181

8.3.6 Summary 181

8.4 New challenges in record linkage: the Beyond 2011 Programme 182

8.4.1 Introduction 182

8.4.2 Beyond 2011 linking methodology 183

8.4.3 The anonymisation process in Beyond 2011 184

8.4.4 Beyond 2011 linkage strategy using pseudonymised data 185

8.4.5 Linkage quality 195

8.4.6 Next steps 197

8.4.7 Conclusion 198

8.5 Summary 199

9 Privacy?]preserving record linkage 201
Rainer Schnell

9.1 Introduction 201

9.2 Chapter outline 202

9.3 Linking with and without personal identification numbers 202

9.3.1 Linking using a trusted third party 203

9.3.2 Linking with encrypted PIDs 204

9.3.3 Linking with encrypted quasi?]identifiers 204

9.3.4 PPRL in decentralised organisations 204

9.4 PPRL approaches 206

9.4.1 Phonetic codes 206

9.4.2 High?]dimensional embeddings 206

9.4.3 Reference tables 207

9.4.4 Secure multiparty computations for PPRL 207

9.4.5 Bloom filter?]based PPRL 207

9.5 PPRL for very large databases: blocking 209

9.5.1 Blocking for PPRL with Bloom filters 210

9.5.2 Blocking Bloom filters with MBT 211

9.5.3 Empirical comparison of blocking techniques for Bloom filters 211

9.5.4 Current recommendations for linking very large datasets with Bloom filters 213

9.6 Privacy considerations 213

9.6.1 Probability of attacks 214

9.6.2 Kind of attacks 215

9.6.3 Attacks on Bloom filters 215

9.7 Hardening Bloom filters 217

9.7.1 Randomly selected hash values 218

9.7.2 Random bits 218

9.7.3 Avoiding padding 220

9.7.4 Standardising the length of identifiers 220

9.7.5 Sampling bits for composite Bloom filters 221

9.7.6 Rehashing 221

9.7.7 Salting keys with record?]specific data 223

9.7.8 Fake injections 223

9.7.9 Evaluation of Bloom filter hardening procedures 223

9.8 Future research 224

9.9 PPRL research and implementation with national databases 225

10 Summary 226
Katie Harron, Chris Dibben and Harvey Goldstein

10.1 Introduction 226

10.2 Part 1: Data linkage as it exists today 226

10.3 Part 2: Analysis of linked data 227

10.3.1 Quality of identifiers 227

10.3.2 Quality of linkage methods 228

10.3.3 Quality of evaluation 228

10.4 Part 3: Data linkage in practice: new developments 229

10.5 Concluding remarks 231

References 233

Index 253

1
Introduction

Katie Harron1, Harvey Goldstein2,3 and Chris Dibben4

1 London School of Hygiene and Tropical Medicine, London, UK

2 Institute of Child Health, University College London, London, UK

3 Graduate School of Education, University of Bristol, Bristol, UK

4 University of Edinburgh, Edinburgh, UK

1.1 Introduction: data linkage as it exists

The increasing availability of large administrative databases for research has led to a dramatic rise in the use of data linkage. The speed and accuracy of linkage have much improved over recent decades with developments such as string comparators, coding systems and blocking, yet the methods still underpinning most of the linkage performed today were proposed in the 1950s and 1960s. Linkage and analysis of data across sources remain problematic due to lack of identifiers that are totally accurate as well as being discriminatory, missing data and regulatory issues, especially concerned with privacy.

In this context, recent developments in data linkage methodology have concentrated on bias in the analysis of linked data, novel approaches to organising relationships between databases and privacy-preserving linkage. Methodological developments in data linkage bring together a collection of chapters on cutting-edge developments in data linkage methodology, contributed by members of the international data linkage community.

The first section of the book covers the current state of data linkage, methodological issues that are relevant to linkage systems and analyses today and case studies from the United Kingdom, Canada and Australia. In this introduction, we provide a brief background to the development of data linkage methods and introduce common terms. We highlight the most important issues that have emerged in recent years and describe how the remainder of the book attempts to deal with these issues. Chapter 2 summarises the advances in linkage accuracy and speed that have arisen from the traditional probabilistic methods proposed by Fellegi and Sunter. The first section concludes with a description of the data linkage environment as it is today, with case study examples. Chapter 3 describes the opportunities and challenges provided by data linkage, focussing on legal and security aspects and models for data access and linkage.

The middle section of the book focusses on the immediate future of data linkage, in terms of methods that have been developed and tested and can be put into practice today. It concentrates on analysis of linked data and the difficulties associated with linkage uncertainty, highlighting the problems caused by errors that occur in linkage (false matches and missed matches) and the impact that these errors can have on the reliability of results based on linked data. This section of the book discusses two methods for handling linkage error, the first relating to regression analyses and the second to an extension of the standard multiple imputation framework. Chapter 7 presents an alternative data storage solution compared to relational databases that provides significant benefits for linkage.

The final section of the book tackles an aspect of the potential future of data linkage. Ethical considerations relating to data linkage and research based on linked data are a subject of continued debate. Privacy-preserving data linkage attempts to avoid the controversial release of personal identifiers by providing means of linking and performing analysis on encrypted data. This section of the book describes the debate and provides examples.

The establishment of large-scale linkage systems has provided new opportunities for important and innovative research that, until now, have not been possible but that also present unique methodological and organisational challenges. New linkage methods are now emerging that take a different approach to the traditional methods that have underpinned much of the research performed using linked data in recent years, leading to new possibilities in terms of speed, accuracy and transparency of research.

1.2 Background and issues

A statistical definition of data linkage is 'a merging that brings together information from two or more sources of data with the object of consolidating facts concerning an individual or an event that are not available in any separate record' (Organisation for Economic Co-operation and Development (OECD)). Data linkage has many different synonyms (record linkage, record matching, re-identification, entity heterogeneity, merge/purge) within various fields of application (computer science, marketing, fraud detection, censuses, bibliographic data, insurance data) (Elmagarmid, Ipeirotis and Verykios, 2007).

The term 'record linkage' was first applied to health research in 1946, when Dunn described linkage of vital records from the same individual (birth and death records) and referred to the process as 'assembling the book of life' (Dunn, 1946). Dunn emphasised the importance of such linkage to both the individual and health and other organisations. Since then, data linkage has become increasingly important to the research environment.

The development of computerised data linkage meant that valuable information could be combined efficiently and cost-effectively, avoiding the high cost, time and effort associated with setting up new research studies (Newcombe et al., 1959). This led to a large body of research based on enhanced datasets created through linkage. Internationally, large linkage systems of note are the Western Australia Record Linkage System, which links multiple datasets (over 30) for up to 40 years at a population level, and the Manitoba Population-Based Health Information System (Holman et al., 1999; Roos et al., 1995). In the United Kingdom, several large-scale linkage systems have also been developed, including the Scottish Health Informatics Programme (SHIP), the Secure Anonymised Information Linkage (SAIL) Databank and the Clinical Practice Research Datalink (CPRD). As data linkage becomes a more established part of research relating to health and society, there has been an increasing interest in methodological issues associated with creating and analysing linked datasets (Maggi, 2008).

1.3 Data linkage methods

Data linkage brings together information relating to the same individual that is recorded in different files. A set of linked records is created by comparing records, or parts of records, in different files and applying a set of linkage criteria or rules to determine whether or not records belong to the same individual. These rules utilise the values on 'linking variables' that are common to each file. The aim of linkage is to determine the true match status of each comparison pair: a match if records belong to the same individual and a non-match if records belong to different individuals.

As the true match status is unknown, linkage criteria are used to assign a link status for each comparison pair: a link if records are classified as belonging to the same individual and a non-link if records are classified as belonging to different individuals.

In a perfect linkage, all matches are classified as links, and all non-matches are classified as non-links. If comparison pairs are misclassified (false matches or missed matches), error is introduced. False matches occur when records from different individuals link erroneously; missed matches occur when records from the same individual fail to link.

1.3.1 Deterministic linkage

In deterministic linkage, a set of predetermined rules are used to classify pairs of records as links and non-links. Typically, deterministic linkage requires exact agreement on a specified set of identifiers or matching variables. For example, two records may be classified as a link if their values of National Insurance number, surname and sex agree exactly. Modifications of strict deterministic linkage include 'stepwise' deterministic linkage, which uses a succession of rules; the 'n-1' deterministic procedure, which allows a link to be made if all but one of a set of identifiers agree; and ad hoc deterministic procedures, which allow partial identifiers to be combined into a pseudo-identifier (Abrahams and Davy, 2002; Maso, Braga and Franceschi, 2001; Mears et al., 2010). For example, a combination of the first letter of surname, month of birth and postcode area (e.g. H01N19) could form the basis for linkage.

Strict deterministic methods that require identifiers to match exactly often have a high rate of missed matches, as any recording errors or missing values can prevent identifiers from agreeing. Conversely, the rate of false matches is typically low, as the majority of linked pairs are true matches (records are unlikely to agree exactly on a set of identifiers by chance) (Grannis, Overhage and McDonald, 2002). Deterministic linkage is a relatively straightforward and quick linkage method and is useful when records have highly discriminative or unique identifiers that are well completed and accurate. For example, the community health index (CHI) is used for much of the linkage in the Scottish Record Linkage System.

1.3.2 Probabilistic linkage

Newcombe was the first to propose that comparison pairs could be classified using a probabilistic approach (Newcombe et al.,...

Content (EPUB)

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Methodological Developments in Data Linkage

Description

More details

Other editions

Additional editions

Persons

Content

1
Introduction

1.1 Introduction: data linkage as it exists

1.2 Background and issues

1.3 Data linkage methods

1.3.1 Deterministic linkage

1.3.2 Probabilistic linkage

System requirements