Biological Data Integration

Name: Biological Data Integration | Computer and Statistical Approaches
Brand: Wiley
Price: 130.99 EUR
Availability: OnlineOnly

Computer and Statistical Approaches

Christine Froidevaux Marie-Laure Martin-Magniette Guillem Rigaill(Editor)

Wiley (Publisher)

1st Edition

Published on 7. December 2023

288 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-394-25730-0 (ISBN)

€130.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Persons

Content

Preface xi
Christine FROIDEVAUX, Marie-Laure MARTIN-MAGNIETTE and Guillem RIGAILL

Part 1 Knowledge Integration 1

Chapter 1 Clinical Data Warehouses 3
Maxime WACK and Bastien RANCE

1.1 Introduction to clinical information systems and biomedical warehousing: data warehouses for what purposes? 3

1.1.1 Warehouse history 4

1.1.2 Using data warehouses today 4

1.2 Challenge: widely scattered data 5

1.3 Data warehouses and clinical data 6

1.3.1 Warehouse structures 6

1.3.2 Warehouse construction and supply 11

1.3.3 Uses 11

1.4 Warehouses and omics data: challenges 15

1.4.1 Challenges of data volumetry and structuring omic data 16

1.4.2 Attempted solutions 17

1.5 Challenges and prospects 18

1.5.1 Toward general-purpose warehouses 18

1.5.2 Ethical dimension of the implementation and the use of warehouses 19

1.5.3 Origin and reproducibility 19

1.5.4 Data quality 20

1.5.5 Data warehousing federation and data sharing 21

1.6 References 21

Chapter 2 Semantic Web Methods for Data Integration in Life Sciences 25
Olivier DAMERON

2.1 Data-related requirements in life sciences 26

2.1.1 Databases for the life sciences 26

2.1.2 Requirements 27

2.1.3 Common approaches: InterMine and BioMart 30

2.2 Semantic Web 31

2.2.1 Techniques 32

2.2.2 Implementation 42

2.3 Perspectives 43

2.3.1 Facilitating appropriation to users 43

2.3.2 Facilitating the appropriation by software programs: FAIR data 44

2.3.3 Federated queries 45

2.4 Conclusion 46

2.5 References 47

Chapter 3 Workflows for Bioinformatics Data Integration 53
Sarah COHEN-BOULAKIA and Frédéric LEMOINE

3.1 Introduction 53

3.2 Bioinformatics data processing chains: difficulties 54

3.2.1 Designing a data processing chain 55

3.2.2 Analysis execution and reproducibility 56

3.2.3 Maintenance, sharing and reuse 58

3.3 Solutions provided by scientific workflow systems 59

3.3.1 Fundamentals of workflow systems 59

3.3.2 Workflow systems 64

3.4 Use case: RNA-seq data analysis 69

3.4.1 Study description 69

3.4.2 From data processing chain to workflows 72

3.4.3 Data processing chains implemented as workflows: conclusion 75

3.5 Challenges, open problems and research opportunities 77

3.5.1 Formalizing workflow development 77

3.5.2 Workflow testing 78

3.5.3 Discovering and sharing workflows 79

3.6 Conclusion 80

3.7 References 81

Part 2 Integration and Statistics 87

Chapter 4 Variable Selection in the General Linear Model: Application to Multiomic Approaches for the Study of Seed Quality 89
Céline LÉVY-LEDUC, Marie PERROT-DOCKÈS, Gwendal CUEFF and Loïc RAJJOU

4.1 Introduction 90

4.2 Methodology 93

4.2.1 Estimation of the covariance matrix Sq 93

4.2.2 Estimation of B 96

4.3 Numerical experiments 99

4.3.1 Statistical performance 99

4.3.2 Numerical performance 100

4.4 Application to the study of seed quality 103

4.4.1 Metabolomics data 104

4.4.2 Proteomics data 105

4.5 Conclusion 108

4.6 Appendices 108

4.6.1 Example of using the package MultiVarSel for metabolomic data analysis 108

4.6.2 Example of using the package MultiVarSel for proteomic data analysis 110

4.7 Acknowledgments 113

4.8 References 113

Chapter 5 Structured Compression of Genetic Information and Genome-Wide Association Study by Additive Models 117
Florent GUINOT, Marie SZAFRANSKI and Christophe AMBROISE

5.1 Genome-wide association studies 118

5.1.1 Introduction to genetic mapping and linkage analysis 118

5.1.2 Principles of genome-wide association studies 119

5.1.3 Single nucleotide polymorphism 120

5.1.4 Disease penetrance and odds ratio 122

5.1.5 Single marker analysis 124

5.1.6 Multi-marker analysis 126

5.2 Structured compression and association study 132

5.2.1 Context 132

5.2.2 New structured compression approach 133

5.3 Application to ankylosing spondylitis (AS) 142

5.3.1 Data 142

5.3.2 Predictive power evaluation 143

5.3.3 Manhattan diagram 144

5.3.4 Estimation for the most significant SNP aggregates 144

5.4 Conclusion 146

5.5 References 146

Chapter 6 Kernels for Omics 151
Jérôme MARIETTE and Nathalie VIALANEIX

6.1 Introduction 152

6.2 Relational data 153

6.2.1 Data described by the kernel 153

6.2.2 Data described by a general (dis)similarity measure 155

6.3 Exploratory analysis for relational data 158

6.3.1 Kernel clustering 158

6.3.2 Kernel principal component analysis 161

6.3.3 Kernel self-organizing maps 163

6.3.4 Limitations of relational methods 166

6.4 Combining relational data 168

6.4.1 Data integration in systems biology 168

6.4.2 Kernel approaches in data integration 169

6.4.3 A consensual kernel 172

6.4.4 A parsimonious kernel that preserves the topology of the initial data 173

6.4.5 A complete kernel preserving the topology of the initial data 175

6.5 Application 176

6.5.1 Loading Tara Ocean data 176

6.5.2 Data integration by kernel approaches 177

6.5.3 Exploratory analysis: kernel PCA 179

6.6 Session information for the results of the example 186

6.7 References 188

Chapter 7 Multivariate Models for Data Integration and Biomarker Selection in 'Omics Data 195
Sébastien DÉJEAN and Kim-Anh LÊ CAO

7.1 Introduction 195

7.2 Background 197

7.2.1 Mathematical notations 197

7.2.2 Terminology 198

7.2.3 Multivariate projection-based approaches 198

7.2.4 A criterion to maximize specific to each methodology 199

7.2.5 A linear combination of variables to reduce the dimension of the data 199

7.2.6 Identifying a subset of relevant molecular features 200

7.2.7 Summary 200

7.3 From the biological question to the statistical analysis 201

7.3.1 Exploration of one dataset: PCA 201

7.3.2 Classify samples: projection to latent structure discriminant analysis 206

7.3.3 Integration of two datasets: projection to latent structure and related methods 210

7.3.4 Integration of several datasets: multi-block approaches 215

7.4 Graphical outputs 220

7.4.1 Individual plots 220

7.4.2 Variable plots 221

7.5 Overall summary 222

7.6 Liver toxicity study 223

7.6.1 The datasets 223

7.6.2 Biological questions and statistical methods 223

7.6.3 Single dataset analysis 224

7.6.4 Integrative analysis 231

7.7 Conclusion 238

7.8 Acknowledgments 238

7.9 Appendix: reproducible R code 239

7.9.1 Toy examples 239

7.9.2 Liver toxicity 243

7.10 References 247

List of Authors 251

Index 255

Preface

Christine FROIDEVAUX1, Marie-Laure MARTIN-MAGNIETTE2,3 and Guillem RIGAILL2,4

1Université Paris-Saclay, CNRS, LISN, Orsay, France

2IPS2, Université Paris-Saclay, CNRS, INRAE, Université d'Évry, Université Paris Cité, Gif-sur-Yvette, France

3MIA Paris-Saclay, Université Paris-Saclay , AgroParis Tech, INRAE, France

4LaMME, Université Paris-Saclay, CNRS, Université d'Évry, Évry-Courcouronnes, France

P.1. Introduction

The study of biological data has undergone fundamental changes in recent years. Firstly, the volume of these data has dramatically increased due to new high-throughput techniques for experiments. Secondly, remarkable advances reached in both computational and statistical analysis methods and in infrastructures have made processing these large datasets possible. These data should then be integrated, that is, their complementarity exploited with the prospect of advancing biological knowledge. Using data integration to allow the most exhaustive analysis possible thus represents a major challenge in biology.

This book intends to address research studies in biological data science with a pedagogical approach, focusing first on computational approaches to biological data integration and then on statistical approaches to omics data integration.

P.2. Computer-based approaches to biological data integration

P.2.1. Challenges of biological knowledge integration

Biological knowledge has given rise to new fields of application: beyond integrative and systems biology, it is valuable for health and the environment. In particular, the linking of omics data with knowledge of pathologies and clinical data has led to the emergence of precision medicine, which holds tremendous promise for individual health. However, to achieve it, we need to be able to analyze all the knowledge available in an integrated way.

Life sciences data integration must face several difficulties: in addition to the fact that they are massive (Big Data), they are heterogeneous (very varied formats), dispersed (they are found in many databases), presenting various granularities (genomic data or pathology information) and of very variable quality (databases do not all grant the same guarantee of verification (curation)).

Unlike other application areas where the integration process is based on the identification of concepts structured in ontologies and on which data matching is performed, biological data integration proceeds by reconciling data using algorithmic, learning and statistical approaches. This integration increasingly attempts to put the human being at the center of the process.

P.2.2. Computer-based solutions

A new paradigm has emerged: the procedure no longer consists of two distinct phases, where the first phase aimed at gathering data distributed through different databases and integrating them, while the second performed analysis on the integrated data. The two phases are intertwined: integration is used for analysis, which in turn is the basis for better integration.

A number of data warehouses have been developed to gather in an integrated, that is to say, structured, coherent and complementary manner fragmented data related to the same biology field. The constitution of these warehouses is accompanied by data querying methods such that their analysis is made possible. These data can be annotated with conceptual terms derived from ontologies, which make it possible to keep track of the deep knowledge associated with them. Ontologies not only allow enriching knowledge with annotations but also to reason about this knowledge. They are at the heart of the Semantic Web, which aims at a fine-grained representation of data to facilitate the automatic integration and interpretation of the data (Chen et al. 2012).

Finally, the analyses performed on the data use a multitude of very different tools. The data processing procedure that makes use of a sequence of several tools one after another, called workflow, is becoming a fundamental part of data analysis and is at the heart of the paradigm shift mentioned in the introduction. Designing and executing these bioinformatics data processing chains are important issues.

P.2.3. Presentation of the first part

Chapter 1 introduces data warehouses for the life sciences, focusing on clinical data. Chapter 2 introduces Semantic Web concepts and techniques for omics data integration. Finally, Chapter 3 exposes bioinformatics problems and solutions for designing and executing scientific workflows.

These chapters underline the close relationships between good integration and the FAIR (Findable, Accessible, Interoperable, Reproducible) data principles and insist on the importance of data provenance (Zheng et al. 2015). They point out the ethical challenges implied by the protection of stored personal data, especially in the health field, in connection with the security of computer systems.

Throughout these chapters, the reader will be able to see how, in terms of data integration, advances in computational research benefit the life sciences, and how wider adoption of computational methods could benefit them even more so. Conversely, the life sciences offer a tremendous field of investigation for the development of innovative computational methods.

P.3. Statistical approaches to omics data integration

P.3.1. Integration statistical challenges

Omics data integration is a very broad topic: it is very difficult to accurately define its contours. Our vision of omics data integration is quite close to the one presented by Ritchie et al. (2015):

[.] (multi)-omics information integration in a meaningful manner to provide a more complete analysis of a biological point of interest.

This definition emphasizes the objectives of integration. The analysis must make sense, of course, but more importantly it must shed a new light on a biological question of interest: in other words, it must do "better" than a non-integrative analysis.

On the biological level, a systemic vision of the functioning of the cell perfectly motivates the development of methodologies for integrating omic information. How could we actually understand the regulations of the cell without studying or understanding the numerous molecular interactions that take place therein: DNA-DNA, DNA-RNA, RNA-protein, etc. Nonetheless, omics data integration is not an easy task. It is not a miraculous solution and the demonstration that an integrative analysis provides a more complete biological picture than a non-integrative analysis is not always straightforward. We mention here very briefly some of the statistical difficulties associated with data integration (Ritchie et al. 2015).

P.3.1.1. Heterogeneous and complex data

One of the first difficulties that comes across is certainly data diversity. For example:

data with very different formats have to be integrated: graphs, matrices, signals, etc.;
data corresponding to a wide range of molecular scales have to be integrated, for example, transcriptomic and proteomic data;
unbalanced datasets where some samples are not present in all the datasets have to be integrated.

P.3.1.2. Quality data

As Ritchie et al. (2015) reminded us well, before integrating data, it is necessary to analyze each dataset separately and validate its quality. To obtain high-quality results from an integrative analysis, high-quality data are necessary.

P.3.1.3. High-dimensional data

In genomics, we are often faced with the problem of high dimensionality (Giraud 2014): the number of variables p (genes, proteins, transcripts) is often much larger than the number of observations n (individuals, samples). Integration tends to make the problem worse. For simplicity reasons, let us assume that in each dataset d to be integrated, the same n samples are observed and that we measure pd variables. If in each dataset, we already have n « pd, a fortiori .

One solution for mitigating the importance of this problem consists of reducing the size of each dataset. For this purpose, there are many existing techniques, for example, data mining techniques or even the use of knowledge bases.

P.3.2. Omic or multiomic knowledge integration and acquisition

The focus is often on the need for multi-omics data integration. This need is undeniable. However, at the statistical level, we should not forget the need for mono-omic integration. A large number of classical analysis tools model biological entities independently (or almost independently). For example, for the study of RNA-seq data, differential analysis is most often used and genes are analyzed almost independently (Robinson et al. 2010; Love et al. 2014). There is a form of integration at the level of the estimation of the overdispersion parameter or even of the analyses of pathways. This integration already raises very important statistical difficulties. However, more should be done in modeling dependencies within a type of omics data (see Chapters 4 and 5, for example).

Clearly, integrating data should make it possible to take advantage of the very large number of datasets already available and perform powerful meta-analyses. A more...

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Biological Data Integration

Description

More details

Other editions

Additional editions

Persons

Content

Preface

P.1. Introduction

P.2. Computer-based approaches to biological data integration

P.2.1. Challenges of biological knowledge integration

P.2.2. Computer-based solutions

P.2.3. Presentation of the first part

P.3. Statistical approaches to omics data integration

P.3.1. Integration statistical challenges

P.3.1.1. Heterogeneous and complex data

P.3.1.2. Quality data

P.3.1.3. High-dimensional data

P.3.2. Omic or multiomic knowledge integration and acquisition

System requirements