Biological Data Integration

Name: Biological Data Integration | Computer and Statistical Approaches
Brand: Wiley
Price: 142.99 EUR
Availability: OnlineOnly

Computer and Statistical Approaches

Christine Froidevaux Marie-Laure Martin-Magniette Guillem Rigaill(Herausgeber*in)

Wiley (Verlag)

1. Auflage

Erschienen am 7. Dezember 2023

288 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-394-25730-0 (ISBN)

142,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Personen

Inhalt

Preface

Christine FROIDEVAUX1, Marie-Laure MARTIN-MAGNIETTE2,3 and Guillem RIGAILL2,4

1Université Paris-Saclay, CNRS, LISN, Orsay, France

2IPS2, Université Paris-Saclay, CNRS, INRAE, Université d'Évry, Université Paris Cité, Gif-sur-Yvette, France

3MIA Paris-Saclay, Université Paris-Saclay , AgroParis Tech, INRAE, France

4LaMME, Université Paris-Saclay, CNRS, Université d'Évry, Évry-Courcouronnes, France

P.1. Introduction

The study of biological data has undergone fundamental changes in recent years. Firstly, the volume of these data has dramatically increased due to new high-throughput techniques for experiments. Secondly, remarkable advances reached in both computational and statistical analysis methods and in infrastructures have made processing these large datasets possible. These data should then be integrated, that is, their complementarity exploited with the prospect of advancing biological knowledge. Using data integration to allow the most exhaustive analysis possible thus represents a major challenge in biology.

This book intends to address research studies in biological data science with a pedagogical approach, focusing first on computational approaches to biological data integration and then on statistical approaches to omics data integration.

P.2. Computer-based approaches to biological data integration

P.2.1. Challenges of biological knowledge integration

Biological knowledge has given rise to new fields of application: beyond integrative and systems biology, it is valuable for health and the environment. In particular, the linking of omics data with knowledge of pathologies and clinical data has led to the emergence of precision medicine, which holds tremendous promise for individual health. However, to achieve it, we need to be able to analyze all the knowledge available in an integrated way.

Life sciences data integration must face several difficulties: in addition to the fact that they are massive (Big Data), they are heterogeneous (very varied formats), dispersed (they are found in many databases), presenting various granularities (genomic data or pathology information) and of very variable quality (databases do not all grant the same guarantee of verification (curation)).

Unlike other application areas where the integration process is based on the identification of concepts structured in ontologies and on which data matching is performed, biological data integration proceeds by reconciling data using algorithmic, learning and statistical approaches. This integration increasingly attempts to put the human being at the center of the process.

P.2.2. Computer-based solutions

A new paradigm has emerged: the procedure no longer consists of two distinct phases, where the first phase aimed at gathering data distributed through different databases and integrating them, while the second performed analysis on the integrated data. The two phases are intertwined: integration is used for analysis, which in turn is the basis for better integration.

A number of data warehouses have been developed to gather in an integrated, that is to say, structured, coherent and complementary manner fragmented data related to the same biology field. The constitution of these warehouses is accompanied by data querying methods such that their analysis is made possible. These data can be annotated with conceptual terms derived from ontologies, which make it possible to keep track of the deep knowledge associated with them. Ontologies not only allow enriching knowledge with annotations but also to reason about this knowledge. They are at the heart of the Semantic Web, which aims at a fine-grained representation of data to facilitate the automatic integration and interpretation of the data (Chen et al. 2012).

Finally, the analyses performed on the data use a multitude of very different tools. The data processing procedure that makes use of a sequence of several tools one after another, called workflow, is becoming a fundamental part of data analysis and is at the heart of the paradigm shift mentioned in the introduction. Designing and executing these bioinformatics data processing chains are important issues.

P.2.3. Presentation of the first part

Chapter 1 introduces data warehouses for the life sciences, focusing on clinical data. Chapter 2 introduces Semantic Web concepts and techniques for omics data integration. Finally, Chapter 3 exposes bioinformatics problems and solutions for designing and executing scientific workflows.

These chapters underline the close relationships between good integration and the FAIR (Findable, Accessible, Interoperable, Reproducible) data principles and insist on the importance of data provenance (Zheng et al. 2015). They point out the ethical challenges implied by the protection of stored personal data, especially in the health field, in connection with the security of computer systems.

Throughout these chapters, the reader will be able to see how, in terms of data integration, advances in computational research benefit the life sciences, and how wider adoption of computational methods could benefit them even more so. Conversely, the life sciences offer a tremendous field of investigation for the development of innovative computational methods.

P.3. Statistical approaches to omics data integration

P.3.1. Integration statistical challenges

Omics data integration is a very broad topic: it is very difficult to accurately define its contours. Our vision of omics data integration is quite close to the one presented by Ritchie et al. (2015):

[.] (multi)-omics information integration in a meaningful manner to provide a more complete analysis of a biological point of interest.

This definition emphasizes the objectives of integration. The analysis must make sense, of course, but more importantly it must shed a new light on a biological question of interest: in other words, it must do "better" than a non-integrative analysis.

On the biological level, a systemic vision of the functioning of the cell perfectly motivates the development of methodologies for integrating omic information. How could we actually understand the regulations of the cell without studying or understanding the numerous molecular interactions that take place therein: DNA-DNA, DNA-RNA, RNA-protein, etc. Nonetheless, omics data integration is not an easy task. It is not a miraculous solution and the demonstration that an integrative analysis provides a more complete biological picture than a non-integrative analysis is not always straightforward. We mention here very briefly some of the statistical difficulties associated with data integration (Ritchie et al. 2015).

P.3.1.1. Heterogeneous and complex data

One of the first difficulties that comes across is certainly data diversity. For example:

data with very different formats have to be integrated: graphs, matrices, signals, etc.;
data corresponding to a wide range of molecular scales have to be integrated, for example, transcriptomic and proteomic data;
unbalanced datasets where some samples are not present in all the datasets have to be integrated.

P.3.1.2. Quality data

As Ritchie et al. (2015) reminded us well, before integrating data, it is necessary to analyze each dataset separately and validate its quality. To obtain high-quality results from an integrative analysis, high-quality data are necessary.

P.3.1.3. High-dimensional data

In genomics, we are often faced with the problem of high dimensionality (Giraud 2014): the number of variables p (genes, proteins, transcripts) is often much larger than the number of observations n (individuals, samples). Integration tends to make the problem worse. For simplicity reasons, let us assume that in each dataset d to be integrated, the same n samples are observed and that we measure pd variables. If in each dataset, we already have n « pd, a fortiori .

One solution for mitigating the importance of this problem consists of reducing the size of each dataset. For this purpose, there are many existing techniques, for example, data mining techniques or even the use of knowledge bases.

P.3.2. Omic or multiomic knowledge integration and acquisition

The focus is often on the need for multi-omics data integration. This need is undeniable. However, at the statistical level, we should not forget the need for mono-omic integration. A large number of classical analysis tools model biological entities independently (or almost independently). For example, for the study of RNA-seq data, differential analysis is most often used and genes are analyzed almost independently (Robinson et al. 2010; Love et al. 2014). There is a form of integration at the level of the estimation of the overdispersion parameter or even of the analyses of pathways. This integration already raises very important statistical difficulties. However, more should be done in modeling dependencies within a type of omics data (see Chapters 4 and 5, for example).

Clearly, integrating data should make it possible to take advantage of the very large number of datasets already available and perform powerful meta-analyses. A more...

Systemvoraussetzungen

Als PDF speichern Als Link merken