Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
Christine FROIDEVAUX1, Marie-Laure MARTIN-MAGNIETTE2,3 and Guillem RIGAILL2,4
1Université Paris-Saclay, CNRS, LISN, Orsay, France
2IPS2, Université Paris-Saclay, CNRS, INRAE, Université d'Évry, Université Paris Cité, Gif-sur-Yvette, France
3MIA Paris-Saclay, Université Paris-Saclay , AgroParis Tech, INRAE, France
4LaMME, Université Paris-Saclay, CNRS, Université d'Évry, Évry-Courcouronnes, France
The study of biological data has undergone fundamental changes in recent years. Firstly, the volume of these data has dramatically increased due to new high-throughput techniques for experiments. Secondly, remarkable advances reached in both computational and statistical analysis methods and in infrastructures have made processing these large datasets possible. These data should then be integrated, that is, their complementarity exploited with the prospect of advancing biological knowledge. Using data integration to allow the most exhaustive analysis possible thus represents a major challenge in biology.
This book intends to address research studies in biological data science with a pedagogical approach, focusing first on computational approaches to biological data integration and then on statistical approaches to omics data integration.
Biological knowledge has given rise to new fields of application: beyond integrative and systems biology, it is valuable for health and the environment. In particular, the linking of omics data with knowledge of pathologies and clinical data has led to the emergence of precision medicine, which holds tremendous promise for individual health. However, to achieve it, we need to be able to analyze all the knowledge available in an integrated way.
Life sciences data integration must face several difficulties: in addition to the fact that they are massive (Big Data), they are heterogeneous (very varied formats), dispersed (they are found in many databases), presenting various granularities (genomic data or pathology information) and of very variable quality (databases do not all grant the same guarantee of verification (curation)).
Unlike other application areas where the integration process is based on the identification of concepts structured in ontologies and on which data matching is performed, biological data integration proceeds by reconciling data using algorithmic, learning and statistical approaches. This integration increasingly attempts to put the human being at the center of the process.
A new paradigm has emerged: the procedure no longer consists of two distinct phases, where the first phase aimed at gathering data distributed through different databases and integrating them, while the second performed analysis on the integrated data. The two phases are intertwined: integration is used for analysis, which in turn is the basis for better integration.
A number of data warehouses have been developed to gather in an integrated, that is to say, structured, coherent and complementary manner fragmented data related to the same biology field. The constitution of these warehouses is accompanied by data querying methods such that their analysis is made possible. These data can be annotated with conceptual terms derived from ontologies, which make it possible to keep track of the deep knowledge associated with them. Ontologies not only allow enriching knowledge with annotations but also to reason about this knowledge. They are at the heart of the Semantic Web, which aims at a fine-grained representation of data to facilitate the automatic integration and interpretation of the data (Chen et al. 2012).
Finally, the analyses performed on the data use a multitude of very different tools. The data processing procedure that makes use of a sequence of several tools one after another, called workflow, is becoming a fundamental part of data analysis and is at the heart of the paradigm shift mentioned in the introduction. Designing and executing these bioinformatics data processing chains are important issues.
Chapter 1 introduces data warehouses for the life sciences, focusing on clinical data. Chapter 2 introduces Semantic Web concepts and techniques for omics data integration. Finally, Chapter 3 exposes bioinformatics problems and solutions for designing and executing scientific workflows.
These chapters underline the close relationships between good integration and the FAIR (Findable, Accessible, Interoperable, Reproducible) data principles and insist on the importance of data provenance (Zheng et al. 2015). They point out the ethical challenges implied by the protection of stored personal data, especially in the health field, in connection with the security of computer systems.
Throughout these chapters, the reader will be able to see how, in terms of data integration, advances in computational research benefit the life sciences, and how wider adoption of computational methods could benefit them even more so. Conversely, the life sciences offer a tremendous field of investigation for the development of innovative computational methods.
Omics data integration is a very broad topic: it is very difficult to accurately define its contours. Our vision of omics data integration is quite close to the one presented by Ritchie et al. (2015):
[.] (multi)-omics information integration in a meaningful manner to provide a more complete analysis of a biological point of interest.
This definition emphasizes the objectives of integration. The analysis must make sense, of course, but more importantly it must shed a new light on a biological question of interest: in other words, it must do "better" than a non-integrative analysis.
On the biological level, a systemic vision of the functioning of the cell perfectly motivates the development of methodologies for integrating omic information. How could we actually understand the regulations of the cell without studying or understanding the numerous molecular interactions that take place therein: DNA-DNA, DNA-RNA, RNA-protein, etc. Nonetheless, omics data integration is not an easy task. It is not a miraculous solution and the demonstration that an integrative analysis provides a more complete biological picture than a non-integrative analysis is not always straightforward. We mention here very briefly some of the statistical difficulties associated with data integration (Ritchie et al. 2015).
One of the first difficulties that comes across is certainly data diversity. For example:
As Ritchie et al. (2015) reminded us well, before integrating data, it is necessary to analyze each dataset separately and validate its quality. To obtain high-quality results from an integrative analysis, high-quality data are necessary.
In genomics, we are often faced with the problem of high dimensionality (Giraud 2014): the number of variables p (genes, proteins, transcripts) is often much larger than the number of observations n (individuals, samples). Integration tends to make the problem worse. For simplicity reasons, let us assume that in each dataset d to be integrated, the same n samples are observed and that we measure pd variables. If in each dataset, we already have n « pd, a fortiori .
One solution for mitigating the importance of this problem consists of reducing the size of each dataset. For this purpose, there are many existing techniques, for example, data mining techniques or even the use of knowledge bases.
The focus is often on the need for multi-omics data integration. This need is undeniable. However, at the statistical level, we should not forget the need for mono-omic integration. A large number of classical analysis tools model biological entities independently (or almost independently). For example, for the study of RNA-seq data, differential analysis is most often used and genes are analyzed almost independently (Robinson et al. 2010; Love et al. 2014). There is a form of integration at the level of the estimation of the overdispersion parameter or even of the analyses of pathways. This integration already raises very important statistical difficulties. However, more should be done in modeling dependencies within a type of omics data (see Chapters 4 and 5, for example).
Clearly, integrating data should make it possible to take advantage of the very large number of datasets already available and perform powerful meta-analyses. A more...
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.