
From Sequences to Graphs
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
More details
Other editions
Additional editions


Persons
Annie Chateau is a lecturer at the University of Montpellier, France. Her research interests include algorithms and combinatorial structures.
Mikaël Salson is a lecturer at the University of Lille, France. His work focuses mainly on indexing and sequence comparison.
Content
Preface xi
Author Biographies xvii
Chapter 1 Methodological Concepts: Algorithmic Solutions of Bioinformatics Problems 1
Annie CHATEAU and Tom DAVOT-GRANGÉ
1.1 Data, Models, Problem Formalism in Bioinformatics 1
1.1.1 Data 1
1.1.2 Genome Modeling 4
1.1.3 Problems in Bioinformatics 5
1.2 Mathematical Preliminaries 6
1.2.1 Propositional Logic Preliminaries 6
1.2.2 Preliminaries on Sets 7
1.3 Vocabulary in Text Algorithmics 9
1.4 Graph Theory 10
1.4.1 Subgraphs 12
1.4.2 Path in a Graph 13
1.4.3 Matching 13
1.4.4 Planarity 14
1.4.5 Tree Decomposition 15
1.5 Algorithmic Problems 16
1.5.1 Definition 16
1.5.2 Graph Problem 17
1.5.3 Satisfiability Problems 19
1.6 Problem Solutions 20
1.6.1 Algorithm 20
1.6.2 Complexity 21
1.6.3 Runtime 24
1.7 Complexity Classes 26
1.7.1 Generality 26
1.7.2 Exact Algorithms 28
1.7.3 Approximation Algorithms 32
1.7.4 Solvers 34
1.8 Some Algorithmic Techniques 35
1.8.1 Dynamic Programming 35
1.8.2 Tree Traversal 38
1.9 Validation 41
1.9.1 The Different Types of Errors 42
1.9.2 Quality Measures 44
1.9.3 And in the Non-Binary Case? 46
1.10 Conclusion 47
1.11 References 47
Chapter 2 Sequence Indexing 49
Thierry LECROQ and Mikaël SALSON
2.1 Introduction 49
2.1.1 What is Indexing? 50
2.1.2 When to Index? 51
2.1.3 What to Index? 51
2.1.4 Indexing Structures and Queries Considered 52
2.1.5 Basic Notions and Vocabulary 53
2.2 Word Indexing 54
2.2.1 Bloom Filters 54
2.2.2 Inverted List 56
2.2.3 De Bruijn Graphs 60
2.2.4 Efficient Structures for Targeted Queries 61
2.3 Full-Text Indexing 62
2.3.1 Suffix Tree 62
2.3.2 (Extended) Suffix Array 64
2.3.3 Burrows-Wheeler Transform 67
2.4 Indexing Choice Criteria 76
2.4.1 Based on the Type of the Necessary Query 77
2.4.2 Based on the Space-Time and Data Quantity Trade-Off 77
2.4.3 Based on the Need to Add or Modify Indexed Data 79
2.4.4 Indexing Choices According to Applications 80
2.5 Conclusion and Perspectives 81
2.5.1 Efficient Methods for Indexing a Few Genomes or Sequencing Sets 81
2.5.2 Methods that Struggle to Take Advantage of Data Redundancy 82
2.6 References 83
Chapter 3 Sequence Alignment 87
Laurent NOÉ
3.1 Introduction 87
3.1.1 What is Pairwise Alignment? 87
3.1.2 How to Evaluate an Alignment? 88
3.2 Exact Alignment 90
3.2.1 Representation in Edit Graph Form 90
3.2.2 Global Alignment and Needleman-Wunsch Algorithm 93
3.2.3 Local Alignment and Smith-Waterman Algorithm 94
3.2.4 Alignment with Affine Indel Function and the Gotoh Algorithm 96
3.3 Heuristic Alignment 98
3.3.1 Seeds 99
3.3.2 Min-Hash and Global Sampling 105
3.3.3 Minimizing and Local Sampling 106
3.4 References 109
Chapter 4 Genome Assembly 113
Dominique LAVENIER
4.1 Introduction 113
4.2 Sequencing Technologies 116
4.2.1 Short Reads 117
4.2.2 Long Reads 118
4.2.3 Linked Reads 118
4.2.4 Hi-C Reads 119
4.2.5 Optical Mapping 119
4.3 Assembly Strategies 120
4.3.1 The Main Steps 120
4.3.2 Cleaning and Correction of Reads 121
4.3.3 Scaffold Construction 122
4.3.4 Scaffold Ordering 123
4.4 Scaffold Construction Methods 124
4.4.1 Greedy Assembly 124
4.4.2 OLC Assembly 126
4.4.3 DBG Assembly 127
4.4.4 Constrained Assembly 130
4.5 Scaffold-Ordering Methods 132
4.5.1 Hi-C Data-Based Methods 132
4.5.2 Optical Mapping-Based Methods 137
4.6 Assembly Validation 139
4.6.1 Metrics 140
4.6.2 Read Realignment 140
4.6.3 Gene Prediction 141
4.6.4 Competitions 141
4.7 Conclusion 142
4.8 References 143
Chapter 5 Metagenomics and Metatranscriptomics 147
Cervin GUYOMAR and Claire LEMAITRE
5.1 What is Metagenomics? 147
5.1.1 Motivations and Historical Context 147
5.1.2 The Metagenomics Data 148
5.1.3 Bioinformatics Challenges for Metagenomics 151
5.2 "Who Are They": Taxonomic Characterization of Microbial Communities 153
5.2.1 Methods for Targeted Metagenomics 154
5.2.2 Whole-Genome Methods with Reference 155
5.2.3 Reference-Free Methods 160
5.3 "What Are They Able To Do?": Functional Metagenomics 166
5.3.1 Gene Prediction and Annotation 166
5.3.2 Metatranscriptomics 167
5.3.3 Reconstruction of Metabolic Networks 168
5.4 Comparative Metagenomics 169
5.4.1 Comparative Metagenomics with Diversity Estimation 170
5.4.2 De Novo Comparative Metagenomics 170
5.5 Conclusion 175
5.6 References 176
Chapter 6 RNA Folding 185
Yann PONTY And Vladimir REINHARZ
6.1 Introduction 185
6.1.1 RNA Folding 186
6.1.2 Secondary Structure 189
6.2 Optimization for Structure Prediction 192
6.2.1 Computing the Minimum Free-Energy (MFE) Structure 192
6.2.2 Listing (Sub)optimal Structures 198
6.2.3 Comparative Prediction: Simultaneous Alignment/Folding of RNAs 203
6.2.4 Joint Alignment/Folding Model 204
6.3 Analyzing the Boltzmann Ensemble 210
6.3.1 Computing the Partition Function 210
6.3.2 Statistical Sampling 215
6.3.3 Boltzmann Probability of Structural Patterns 220
6.4 Studying RNA Structure in Practice 225
6.4.1 The Turner Model 225
6.4.2 Tools 228
6.5 References 228
Conclusion 233
List of Authors 237
Index 239
Preface
Scientists have long been interested in studying living organisms, both at a macroscopic scale, by analyzing their external appearance or their overall internal functioning, and at a more microscopic scale. Taken to its extreme, their observation consists of studying the nucleus of cells and the molecules of living organisms that define their functioning: namely DNA and RNA. In an organism, DNA is actually the carrier of genetic information, which is called the genome, thus playing an important role. However, the genome is not everything; it is composed of genes that allow for RNA production which have various functions such as protein synthesis or regulation of cell activity. In digital form, these DNA or RNA molecules are most often represented as text from four-letter alphabets (A, C, G and T for DNA; A, C, G and U for RNA). Using these DNA and RNA sequences, computer-based methods are able to answer a number of biological questions. This is the heart of this book. In the chapters of this book, we will find the answers to these questions, as well as their limitations to some fundamental questions that have been, and are still being, addressed by bioinformatics. How can a short sequence of a few hundred nucleotides quickly be searched for in a genome that can make a few billion of them? How can sequences be compared with one another? How can the complete sequence of a genome be reconstructed? How can the bacteria that make up our intestinal flora be identified? Based on their sequences, how can the structure that certain RNAs will adopt be predicted?
The methods that are described in this book have their roots anchored in two fields with long-standing foundations, which have long evolved alongside each other without any significant interaction: computer science and biology. It was during the 20th century that the symbiosis between computational methods and biological problems led to combined modeling and the design of bioinformatics algorithms, methods and tools.
DNA was first sequenced in the late 1970s, with low volumes and incurring huge costs. The need to store and manipulate this data automatically soon became very pressing. As a result, the mid-1980s witnessed the development of the first sequence databases. These databases are pooled by the community who feeds them sequencing experiments, which are increasingly growing and require more efficient methods. This is how sequence alignment methods are implemented, which are dedicated to these genomic sequences, and designed for optimizing the time and space used for this operation. These databases are not only maintained, but also expanded and made public internationally, further accelerating access to knowledge. Data acquisition is also accelerating since the first complete bacterial or yeast genomes in the mid-1990s, as well as with the human genome project, which has kept many teams busy for over a decade. Access to knowledge about these genomes leads to questioning living organisms from a completely new point of view, and opens up new avenues for several fields of application, particularly in the health sector, and also in ecology and evolution, along with the enrichment of the fundamental knowledge of organisms and how they function.
Since the mid-2000s, genomic data have been acquired at a much faster pace following the advent of high-throughput sequencers, which enable, to a certain extent, the transformation of DNA or RNA molecules into short sequences of letters at a low cost and at an increasingly frenetic pace. There is an ongoing discussion of projects involving several thousand, or even tens of thousands, complete genomes of individuals of interest. These developments make it now possible to question living organisms in a finer manner, at the scale of varieties and individuals of the same species, but also at the scale of the different tissues that make up an organism, or even at the scale of a natural environment sample containing thousands of different organisms. With these new questions emerges the need to model data as a whole, in a structured way, and to develop methods for the purpose of answering them.
At the same time, storage and information processing capacities, as well as computational performance allowed by increasingly powerful processors and exploiting increasingly complex parallelism, have accompanied an extraordinarily rapid progress in the field of algorithmics and problem modeling based on the use of elaborate discrete structures. Some particular operations that seemed inaccessible have become commonplace at a lower cost in modern programs, and it is not uncommon today to run calculations over a grid whose capacities far exceed what could be imagined some 20 years ago. Nonetheless, this is not enough to make feasible all the studies that we would like to achieve on sequencing data and their derivatives.
The data produced by the sequencers, due to their quantity (up to 10 million nucleotides per second) and their particularities (whose lengths and types of errors vary according to sequencing technologies), require the appropriate use of methods in order to extract relevant information therefrom in a reasonable amount of time without resorting to gigantic computing infrastructures.
Although the methods developed are generally independent of the technology, they must take into account the constraints of the technology in order to yield solutions for practical applications. In particular, the increase in the volume of data to be processed makes some solutions impractical and requires the use of much more faster heuristics. Therefore, the methods used in bioinformatics are most often at the crossroads between exact and approximate methods.
In order to better understand the specific terms and tools of bioinformatics, we have introduced most of them in Chapter 1. This chapter also covers in detail the data (DNA, RNA and proteins) which we are working with, and the way they are obtained. Moreover, this chapter presents some algorithmic notions that are useful for understanding this book, and addresses the concepts used in bioinformatics more broadly. The remaining chapters present the most commonly studied problems in bioinformatics from genomic data. Some chapters focus more on tools, others on methods and still others on a more detailed description of data. We briefly present the questions which the following chapters of this book will answer.
Sequence indexing. In order to address the influx of data, how can these be easily stored, queried and manipulated? This is the subject of Chapter 2, which explains how to respond to these different aspects. The crucial issues here are the conservation of information, the flexibility of the structure and its ability to answer in a reasonable time the most common question, namely "Is this sequence indeed in my genome?"
Sequence alignment. When studying one or more sequences, a question arises very quickly: how can we tell whether the sequences are similar if a sequence is approximately found in another, and also can a score that allows for classifying these comparisons between them be determined? A crucial point in bioinformatics is to be able to answer the question "What are the most significant occurrences of my pattern in my sequence?" This is what is called the sequence alignment, which is the subject of Chapter 3. This chapter also deals with the aspects of alignment-free comparison where, in order to cope with the volume of data and sometimes significant error rates, making use of an alignment is not feasible and heuristics are developed.
Genome assembly. In Chapter 4, the following question is addressed: "How can the complete sequence of an organism be obtained based on the reads produced by sequencing?" This fundamental problem thus arose from a technical difficulty that makes it impossible to read the genome of an organism in one piece from its cells. This technical difficulty will most likely disappear if advances in sequencing make it possible to read the genome in a single pass; however, the assembly is for the moment essential to the knowledge of the genome and raises many problems, such as "how to choose between two possibilities to assemble the reads?" or "how could the quality of the reconstruction be evaluated?" Graphs prove to be very interesting models in this context of reconstruction.
Metagenomics and metatranscriptomics. When several organisms are mixed in a sample, for example, of soil, from a marine environment or from the internal environment of an organism (the well-known microbiota), additional problems can occur along with those already existing during the assembly. For example, "how can we determine which species are present?" or still "how can genomes be assembled when they are mixed together?" This is the subject of Chapter 5.
RNA folding. RNA data are particular data because their secondary structure plays a fundamental role in the functioning of organisms. Chapter 6 proposes an overview of methods for modeling and inferring this secondary structure from sequence data. Here the fundamental question is "how can the folding of a word on itself be found and evaluated, taking into account the affinities between the characters of this word?"
Apart from the solutions provided to answer this large number of questions, it now becomes all the more necessary to take a step back from the methods capable of processing these data. What does it mean when we "find the same piece of sequence" of one organism in another, and what is the significance of...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.