
Multiple Biological Sequence Alignment
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
Reviews / Votes
"Covers the full spectrum of the field, from alignment algorithms to scoring methods, practical techniques, and alignment tools and their evaluations." (Zentralblatt MATH, 2016)More details
Other editions
Additional editions


Persons
Content
Preface xi
1 Introduction 1
1.1 Motivation 2
1.2 The Organization of this Book 2
1.3 Sequence Fundamentals 3
1.3.1 Protein 5
1.3.2 DNA/RNA 6
1.3.3 Sequence Formats 6
1.3.4 Motifs 7
1.3.5 Sequence Databases 9
2 Protein/DNA/RNA Pairwise Sequence Alignment 11
2.1 Sequence Alignment Fundamentals 12
2.2 Dot-Plot Matrix 12
2.3 Dynamic Programming 14
2.3.1 Needleman-Wunsch's Algorithm 15
2.3.2 Example 16
2.3.3 Smith-Waterman's Algorithm 17
2.3.4 Affine Gap Penalty 19
2.4 Word Method 19
2.4.1 Example 20
2.5 Searching Sequence Databases 21
2.5.1 FASTA 21
2.5.2 BLAST 21
3 Quantifying Sequence Alignments 25
3.1 Evolution and Measuring Evolution 25
3.1.1 Jukes and Cantor's Model 26
3.1.2 Measuring Relatedness 28
3.2 Substitution Matrices and Scoring Matrices 28
3.2.1 Identity Scores 28
3.2.2 Substitution/Mutation Scores 29
3.3 GAPS 32
3.3.1 Sequence Distances 35
3.3.2 Example 35
3.4 Scoring Multiple Sequence Alignments 36
3.4.1 Sum-of-Pair Score 36
3.5 Circular Sum Score 38
3.6 Conservation Score Schemes 39
3.6.1 Wu and Kabat's Method 39
3.6.2 Jores's Method 39
3.6.3 Lockless and Ranganathan's Method 40
3.7 Diversity Scoring Schemes 40
3.7.1 Background 41
3.7.2 Methods 41
3.8 Stereochemical Property Methods 42
3.8.1 Valdar's Method 43
3.9 Hierarchical Expected Matching Probability Scoring Metric (HEP) 44
3.9.1 Building an AACCH Scoring Tree 44
3.9.2 The Scoring Metric 46
3.9.3 Proof of Scoring Metric Correctness 47
3.9.4 Examples 48
3.9.5 Scoring Metric and Sequence Weighting Factor 49
3.9.6 Evaluation Data Sets 50
3.9.7 Evaluation Results 52
4 Sequence Clustering 59
4.1 Unweighted Pair Group Method with Arithmetic Mean - UPGMA 60
4.2 Neighborhood-Joining Method - NJ 61
4.3 Overlapping Sequence Clustering 65
5 Multiple Sequences Alignment Algorithms 69
5.1 Dynamic Programming 70
5.1.1 DCA 70
5.2 Progressive Alignment 71
5.2.1 Clustal Family 73
5.2.2 PIMA: Pattern-Induced Multisequence Alignment 73
5.2.3 PRIME: Profile-Based Randomized Iteration Method 74
5.2.4 DIAlign 75
5.3 Consistency and Probabilistic MSA 76
5.3.1 POA: Partial Order Graph Alignment 76
5.3.2 PSAlign 77
5.3.3 ProbCons: Probabilistic Consistency-Based Multiple Sequence Alignment 78
5.3.4 T-Coffee: Tree-Based Consistency Objective Function for Alignment Evaluation 79
5.3.5 MAFFT: MSA Based on Fast Fourier Transform 80
5.3.6 AVID 81
5.3.7 Eulerian Path MSA 81
5.4 Genetic Algorithms 82
5.4.1 SAGA: Sequence Alignment by Genetic Algorithm 83
5.4.2 GA and Self-Organizing Neural Networks 84
5.4.3 FAlign 85
5.5 New Development in Multiple Sequence Alignment Algorithms 85
5.5.1 KB-MSA: Knowledge-Based Multiple Sequence Alignment 85
5.5.2 PADT: Progressive Multiple Sequence Alignment Based on Dynamic Weighted Tree 94
5.6 Test Data and Alignment Methods 97
5.7 Results 98
5.7.1 Measuring Alignment Quality 98
5.7.2 RT-OSM Results 98
6 Phylogeny in Multiple Sequence Alignments 103
6.1 The Tree of Life 103
6.2 Phylogeny Construction 105
6.2.1 Distance Methods 106
6.2.2 Character-Based Methods 107
6.2.3 Maximum Likelihood Methods 109
6.2.4 Bootstrapping 110
6.2.5 Subtree Pruning and Re-grafting 111
6.3 Inferring Phylogeny from Multiple Sequence Alignments 112
7 Multiple Sequence Alignment on High-Performance Computing Models 113
7.1 Parallel Systems 113
7.1.1 Multiprocessor 113
7.1.2 Vector 114
7.1.3 GPU 114
7.1.4 FPGA 114
7.1.5 Reconfigurable Mesh 114
7.2 Exiting Parallel Multiple Sequence Alignment 114
7.3 Reconfigurable-Mesh Computing Models - (R-Mesh) 116
7.4 Pairwise Dynamic Programming Algorithms 118
7.4.1 R-Mesh Max Switches 118
7.4.2 R-Mesh Adder/Subtractor 118
7.4.3 Constant-Time Dynamic Programming on R-Mesh 120
7.4.4 Affine Gap Cost 123
7.4.5 R-Mesh On/Off Switches 124
7.4.6 Dynamic Programming Backtracking on R-Mesh 125
7.5 Progressive Multiple Sequence Alignment ON R-Mesh 126
7.5.1 Hierarchical Clustering on R-Mesh 127
7.5.2 Constant Run-Time Sum-of-Pair Scoring Method 128
7.5.3 Parallel Progressive MSA Algorithm and Its Complexity Analysis 129
8 Sequence Analysis Services 133
8.1 EMBL-EBI: European Bioinformatics Institute 133
8.2 NCBI: National Center for Biotechnology Information 135
8.3 GenomeNet and Data Bank of Japan 136
8.4 Other Sequence Analysis and Alignment Web Servers 137
8.5 SeqAna: Multiple Sequence Alignment with Quality Ranking 138
8.6 Pairwise Sequence Alignment and Other Analysis Tools 140
8.7 Tool Evaluation 142
9 Multiple Sequence for Next-Generation Sequences 145
9.1 Introduction 145
9.2 Overview of Next Generation Sequence Alignment Algorithms 147
9.2.1 Alignment Algorithms Based on Seeding and Hash Tables 147
9.2.2 Alignment Algorithms Based on Suffix Tries 151
9.3 Next-Generation Sequencing Tools 154
10 Multiple Sequence Alignment for Variations Detection 161
10.1 Introduction 161
10.2 Genetic Variants 163
10.3 Variation Detection Methods Based on MSA 165
10.4 Evaluation Methodology 172
10.4.1 Performance Metrics 172
10.4.2 Simulated Sequence Data 174
10.4.3 Real Sequence Data 175
10.5 Conclusion and Future Work 176
11 Multiple Sequence Alignment for Structure Detection 179
11.1 Introduction 179
11.2 RNA Secondary Structure Prediction Based on MSA 180
11.2.1 Common Information in Multiple Aligned RNA Sequences 182
11.2.2 Review of RNA SS Prediction Methods 183
11.2.3 Measures of Quality of RNA SS Prediction 187
11.3 Protein Secondary Structure Prediction Based on MSA 189
11.3.1 Review of Protein Secondary Structure Prediction Methods 190
11.3.2 Measures of Quality of Protein SS Prediction 195
11.4 Conclusion and Future Work 196
References 199
Index 219
Chapter 1
Introduction
Majority of organisms on Earth, though diverse, share a significant biological similarity. There is an abundance of biological sequence data showing that any two mammals can have as many as 99% genes in common. Humans and fruit flies are two very different species that share at least 50% common genes. These striking facts have been discovered largely through biological sequence analysis.
Multiple sequence alignment is a fundamental task in bioinformatics and sequence analysis. In the early 1970s, deoxyribonucleic acid (DNA) sequences were obtained using laborious methods based on 2D chromatography. Thus, the number of sequences is limited and often being studied and annotated individually by scientists. By the late 1970s, Gilbert [1] and Sanger and Coulson [2] proposed DNA sequencing by chemical degradation and enzymatic synthesis, respectively. Their works earned a Nobel Prize in chemistry in 1980. Later, sequences are obtained by many newer methods such as dye-based methods [3], microarrays, mass spectrometry, X-ray, ultracentrifugation, and so on. Since the development of Sanger's method, the volume of sequences being identified and deposited is enormous. The current commercial sequencing such as "454 sequencing" can read up to 20 million bases per run and produce the sequences in hours. With this vast amount of sequences, manually annotating each sequence is infeasible. However, we need to categorize them by family, analyze them, find features that are common between them, and so on. The main step to solve this problem is finding the best way to start with the sequence fundamentals, thus leading the readers to the most modern and practical alignment techniques that have been proven to be effective in biological sequence analysis.
1.1 Motivation
There are two popular trends in sequence analysis. One trend focuses primarily on applying rigorous mathematical methods to bring out the optimal alignment of the sequences, thus leading to revelation of possible hidden biological significance between sequences. The other trend stretches on correctly identifying the actual biological significance between the sequences, where some or all biological features may have already been known. These two trends emerge from specific tasks that bioinformatics scientists are dealing with. The first trend relates to prediction of the sequence structures and homology, evolution of species, or determination of the relationship between sequences in order to categorize and organize sequence databases. The second trend is to perform a daily task inwhich scientists want to arrange similar known features of the sequences into the same columns to see how closely they resemble each other. Thus, the second trend can be seen in evolution analysis, in sequence structure and functional analysis, or in drug design and discovery. In the later case, for each specific virus sequence, drug designers search for possible drug-like compounds from libraries of simple sequence models annotated with functional sites and specific drug-like compounds that can bind [4 5] . Hence, aligning a sequence obtained from a new virus against the library of sequences may lead to a manageable set of sequences and compounds to work with.
Consequently, these two distinctive perspectives lead to different approaches to sequence alignment, and the development of sequence alignment algorithms, in turn, allows scientists to automate these tremendous and time-consuming tasks.
1.2 The Organization of this Book
Multiple sequence alignment study involves many aspects of sequence analysis, and it requires broad and significant background information. Therefore, we present each aspect as a chapter starting with existing methodologies and following by our contributions.
The rest of this chapter provides basic information on biological sequences.
- Chapter 2 provides fundamentals in pairwise sequence alignment.
- Chapter 3 describes popular existing quantitative models that have been designed to quantify multiple sequence alignment along with their analysis and evaluations.
- Chapter 4 describes practical clustering techniques that have been used in multiple sequence alignment.
- Chapter 5 describes, characterizes, and relates many multiple sequence alignment models.
- Chapter 6 describes how traditional phylogenetic trees have been constructed and how available sequence knowledge bases can be used to improve the accuracy of reconstructing phylogeny trees.
- Chapter 7 describes the latest methods developed to improve the run-time efficiency of multiple sequence alignment. A large section of this chapter is devoted to parallel alignment model on reconfigurable networks.
- Chapter 8 describes several popular existing multiple sequence alignment server and services.
- Chapter 9 describe several multiple sequence alignment techniques that have been developed to handle short sequences (reads) produced by the next-generation sequencing technique (NSG).
- Chapter 10 describes a bioinformatics application, genetic variant detection, using multiple sequence alignment of short reads or whole genomes as input.
- Lastly, Chapter 11 provides a review of ribonucleic acid (RNA) and protein secondary structure prediction using the evolution information inferred from multiple sequence alignments.
1.3 Sequence Fundamentals
DNA is the fundamental unit that characterizes a living organism and its genome, that is, its genetic information set. DNA contains thousands of genes that carry the genetic information of a cell. Each gene holds information of how to build a protein molecule, which serves as building blocks for the cell or performs important tasks for the cell functions. The DNA is positioned in the nucleus, which is organized into chromosomes. Since DNA contains the genetic information of the cell, it must be duplicated before the cell divides. This technique is called duplication. When proteins are required, the corresponding genes are transcribed into RNA (transcription), the noncoding parts of the RNA are removed, and the RNA is transported out of the nucleus. Proteins are built outside of the nucleus based on the code in the RNA (see Figure 1.1). Thus, DNA sequence determines protein sequence and its structure; and the protein structure dictates the protein function. In this section, we present the fundamentals of sequences and their basic components.
Figure 1.1 Transcription and translation processes: DNA RNA protein (the "central dogma" of biology).
1.3.1 Protein
Proteins (also known as polypeptides) are organic compounds consisting of amino acids linked with peptide bonds forming a linear polypeptide chain and folded into a globular form. The linear sequence of amino acids in the protein molecule represents its primary structure. The secondary structure refers to regions of local regularity within a protein fold such as -helices, -turns, or -strands. The size of a protein sequence ranges from a few to several thousand residues. Figure 1.2 shows both the structure and the primary sequence of the yeast 1M2V protein. Tables 1.1 and 1.2 list the amino acids and their common abbreviations. Figure 1.3 shows a substitution matrix to quantify the probable substitution between two amino acids in a protein molecule.
Figure 1.2 The yeast Sec23/24 heterodimer 1M2V: (a) protein structure and (b) primary sequence.
Table 1.1 Common Amino Acids
Name 3-Letter 1-Letter Alanine Ala A Arginine Arg R Asparagine Asn N Aspartic acid Asp D Cysteine Cys C Glutamic acid Glu E Glutamine Gln Q Glycine Gly G Histidine His H Isoleucine Ile I Leucine Leu L Lysine Lys K Methionine Met M Phenylalanine Phe F Proline Pro P Serine Ser S Threonine Thr T Tryptophan Trp W Tyrosine Tyr Y Valine Val V Unknown or "other" XTable 1.2 Ambiguous Amino Acids
Ambiguous Amino Acids 3-Letter 1-Letter Asparagine or aspartic Asx B Glutamine or glutamic acid Glx Z Leucine or isoleucine Xle JFigure 1.3 BLOSUM62...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.