
Algorithms in Bioinformatics
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
Explore a comprehensive and insightful treatment of the practical application of bioinformatic algorithms in a variety of fields
Algorithms in Bioinformatics: Theory and Implementation delivers a fulsome treatment of some of the main algorithms used to explain biological functions and relationships. It introduces readers to the art of algorithms in a practical manner which is linked with biological theory and interpretation. The book covers many key areas of bioinformatics, including global and local sequence alignment, forced alignment, detection of motifs, Sequence logos, Markov chains or information entropy. Other novel approaches are also described, such as Self-Sequence alignment, Objective Digital Stains (ODSs) or Spectral Forecast and the Discrete Probability Detector (DPD) algorithm.
The text incorporates graphical illustrations to highlight and emphasize the technical details of computational algorithms found within, to further the reader's understanding and retention of the material. Throughout, the book is written in an accessible and practical manner, showing how algorithms can be implemented and used in JavaScript on Internet Browsers. The author has included more than 120 open-source implementations of the material, as well as 33 ready-to-use presentations. The book contains original material that has been class-tested by the author and numerous cases are examined in a biological and medical context. Readers will also benefit from the inclusion of:
* A thorough introduction to biological evolution, including the emergence of life, classifications and some known theories and molecular mechanisms
* A detailed presentation of new methods, such as Self-sequence alignment, Objective Digital Stains and Spectral Forecast
* A treatment of sequence alignment, including local sequence alignment, global sequence alignment and forced sequence alignment with full implementations
* Discussions of position-specific weight matrices, including the count, weight, relative frequencies, and log-likelihoods matrices
* A detailed presentation of the methods related to Markov Chains as well as a description of their implementation in Bioinformatics and adjacent fields
* An examination of information and entropy, including sequence logos and explanations related to their meaning
* An exploration of the current state of bioinformatics, including what is known and what issues are usually avoided in the field
* A chapter on philosophical transactions that allows the reader a broader view of the prediction process
* Native computer implementations in the context of the field of Bioinformatics
* Extensive worked examples with detailed case studies that point out the meaning of different results
Perfect for professionals and researchers in biology, medicine, engineering, and information technology, as well as upper level undergraduate students in these fields, Algorithms in Bioinformatics: Theory and Implementation will also earn a place in the libraries of software engineers who wish to understand how to implement bioinformatic algorithms in their products.
More details
Other editions
Additional editions


Person
Paul A. Gagniuc, PhD, is an associated Professor of Bioinformatics and a Professor of Programming Languages at University Politehnica of Bucharest in Romania. He obtained his doctorate in Genetics at the University of Bucharest. Dr. Gagniuc is also an Academic Editor at PLoS ONE and a pro-active reviewer for several well-known scientific journals. He has published numerous high-profile scientific articles and is the recipient of several awards for exceptional scientific results.
Content
Preface xv
About the Companion Website xvii
1 The Tree of Life (I) 1
1.1 Introduction 1
1.2 Emergence of Life 1
1.2.1 Timeline Disagreements 3
1.3 Classifications and Mechanisms 4
1.4 Chromatin Structure 5
1.5 Molecular Mechanisms 9
1.5.1 Precursor Messenger RNA 9
1.5.2 Precursor Messenger RNA to Messenger RNA 10
1.5.3 Classes of Introns 10
1.5.4 Messenger RNA 10
1.5.5 mRNA to Proteins 11
1.5.6 Transfer RNA 12
1.5.7 Small RNA 12
1.5.8 The Transcriptome 13
1.5.9 Gene Networks and Information Processing 13
1.5.10 Eukaryotic vs. Prokaryotic Regulation 14
1.5.11 What Is Life? 14
1.6 Known Species 14
1.7 Approaches for Compartmentalization 15
1.7.1 Two Main Approaches for Organism Formation 16
1.7.2 Size and Metabolism 16
1.8 Sizes in Eukaryotes 16
1.8.1 Sizes in Unicellular Eukaryotes 17
1.8.2 Sizes in Multicellular Eukaryotes 17
1.9 Sizes in Prokaryotes 17
1.10 Virus Sizes 18
1.10.1 Viruses vs. the Spark of Metabolism 20
1.11 The Diffusion Coefficient 20
1.12 The Origins of Eukaryotic Cells 21
1.12.1 Endosymbiosis Theory 21
1.12.2 DNA and Organelles 22
1.12.3 Membrane-bound Organelles with DNA 23
1.12.4 Membrane-bound Organelles Without DNA 23
1.12.5 Control and Division of Organelles 24
1.12.6 The Horizontal Gene Transfer 24
1.12.7 On the Mechanisms of Horizontal Gene Transfer 25
1.13 Origins of Eukaryotic Multicellularity 26
1.13.1 Colonies Inside an Early Unicellular Common Ancestor 26
1.13.2 Colonies of Early Unicellular Common Ancestors 26
1.13.3 Colonies of Inseparable Early Unicellular Common Ancestors
1.13.4 Chimerism and Mosaicism 28
1.14 Conclusions 29
2 Tree of Life: Genomes (II) 31
2.1 Introduction 31
2.2 Rules of Engagement 31
2.3 Genome Sizes in the Tree of Life 32
2.3.1 Alternative Methods 33
2.3.2 The Weaving of Scales 33
2.3.3 Computations on the Average Genome Size 36
2.3.4 Observations on Data 38
2.4 Organellar Genomes 40
2.4.1 Chloroplasts 40
2.4.2 Apicoplasts 40
2.4.3 Chromatophores 42
2.4.4 Cyanelles 42
2.4.5 Kinetoplasts 42
2.4.6 Mitochondria 43
2.5 Plasmids 43
2.6 Virus Genomes 44
2.7 Viroids and Their Implications 46
2.8 Genes vs. Proteins in the Tree of Life 47
2.9 Conclusions 49
3 Sequence Alignment (I) 51
3.1 Introduction 51
3.2 Style and Visualization 51
3.3 Initialization of the Score Matrix 54
3.4 Calculation of Scores 57
3.4.1 Initialization of the Score Matrix for Global Alignment 57
3.4.2 Initialization of the Score Matrix for Local Alignment 62
3.4.3 Optimization of the Initialization Steps 65
3.4.4 Curiosities 66
3.5 Traceback 71
3.6 Global Alignment 75
3.7 Local Alignment 79
3.8 Alignment Layout 84
3.9 Local Sequence Alignment - The Final Version 87
3.10 Complementarity 91
3.11 Conclusions 97
4 Forced Alignment (II) 99
4.1 Introduction 99
4.2 Global and Local Sequence Alignment 100
4.2.1 Short Notes 100
4.2.2 Understanding the Technology 101
4.2.3 Main Objectives 102
4.3 Experiments and Discussions 102
4.3.1 Alignment Layout 106
4.3.2 Forced Alignment Regime 106
4.3.3 Alignment Scores and Significance 109
4.3.4 Optimal Alignments 110
4.3.5 The Main Significance Scores 110
4.3.6 The Information Content 110
4.3.7 The Match Percentage 112
4.3.8 Significance vs. Chance 113
4.3.9 The Importance of Randomness 113
4.3.10 Sequence Quality and the Score Matrix 114
4.3.11 The Significance Threshold 115
4.3.12 Optimal Alignments by Numbers 116
4.3.13 Chaos Theory on Sequence Alignment 116
4.3.14 Image-Encoding Possibilities 116
4.4 Advanced Features and Methods 117
4.4.1 Sequence Detector 117
4.4.2 Parameters 117
4.4.3 Heatmap 118
4.4.4 Text Visualization 123
4.4.5 Graphics for Manuscript Figures and Didactic Presentations 124
4.4.6 Dynamics 124
4.4.7 Independence 125
4.4.8 Limits 125
4.4.9 Local Storage 125
4.5 Conclusions 128
5 Self-Sequence Alignment (I) 129
5.1 Introduction 129
5.2 True Randomness 130
5.3 Information and Compression Algorithms 130
5.4 White Noise and Biological Sequences 131
5.5 The Mathematical Model 131
5.5.1 A Concrete Example 132
5.5.2 Model Dissection 133
5.5.3 Conditions for Maxima and Minima 136
5.6 Noise vs. Redundancy 137
5.7 Global and Local Information Content 137
5.8 Signal Sensitivity 138
5.9 Implementation 140
5.9.1 Global Self-Sequence Alignment 140
5.9.2 Local Self-Sequence Alignment 144
5.10 A Complete Scanner for Information Content 147
5.11 Conclusions 149
6 Frequencies and Percentages (II) 151
6.1 Introduction 151
6.2 Base Composition 152
6.3 Percentage of Nucleotide Combinations 152
6.4 Implementation 153
6.5 A Frequency Scanner 156
6.6 Examples of Known Significance 158
6.7 Observation vs. Expectation 160
6.8 A Frequency Scanner with a Threshold 161
6.9 Conclusions 163
7 Objective Digital Stains (III) 165
7.1 Introduction 165
7.2 Information and Frequency 166
7.3 The Objective Digital Stain 169
7.3.1 A 3D Representation Over a 2D Plane 173
7.3.2 ODSs Relative to the Background 177
7.4 Interpretation of ODSs 181
7.5 The Significance of the Areas in the ODS 183
7.6 Discussions 184
7.6.1 A Similarity Between Dissimilar Sequences 186
7.7 Conclusions 186
8 Detection of Motifs (I) 187
8.1 Introduction 187
8.2 DNA Motifs 187
8.2.1 DNA-binding Proteins vs. Motifs and Degeneracy 188
8.2.2 Concrete Examples of DNA Motifs 188
8.3 Major Functions of DNA Motifs 191
8.3.1 RNA Splicing and DNA Motifs 191
8.4 Conclusions 195
9 Representation of Motifs (II) 197
9.1 Introduction 197
9.2 The Training Data 197
9.3 A Visualization Function 198
9.4 The Alignment Matrix 200
9.5 Alphabet Detection 203
9.6 The Position-Specific Scoring Matrix (PSSM) Initialization 206
9.7 The Position Frequency Matrix (PFM) 207
9.8 The Position Probability Matrix (PPM) 208
9.8.1 A Kind of PPM Pseudo-Scanner 209
9.9 The Position Weight Matrix (PWM) 212
9.10 The Background Model 215
9.11 The Consensus Sequence 218
9.11.1 The Consensus - Not Necessarily Functional 219
9.12 Mutational Intolerance 221
9.13 From Motifs to PWMs 222
9.14 Pseudo-Counts and Negative Infinity 226
9.15 Conclusions 229
10 The Motif Scanner (III) 231
10.1 Introduction 231
10.2 Looking for Signals 232
10.3 A Functional Scanner 235
10.4 The Meaning of Scores 239
10.4.1 A Score Value Above Zero 239
10.4.2 A Score Value Below Zero 241
10.4.3 A Score Value of Zero 241
10.5 Conclusions 242
11 Understanding the Parameters (IV) 243
11.1 Introduction 243
11.2 Experimentation 243
11.2.1 A Scanner Implementation Based on Pseudo-Counts 244
11.2.2 A Scanner Implementation Based on Propagation of Zero Counts 246
11.3 Signal Discrimination 249
11.4 False-Positive Results 250
11.5 Sensitivity Adjustments 251
11.6 Beyond Bioinformatics 252
11.7 A Scanner That Uses a Known PWM 253
11.8 Signal Thresholds 256
11.8.1 Implementation and Filter Testing 258
11.9 Conclusions 262
12 Dynamic Backgrounds (V) 263
12.1 Introduction 263
12.2 Toward a Scanner with Two PFMs 263
12.2.1 The Implementation of Dynamic PWMs 264
12.2.2 Issues and Corrections for Dynamic PWMs 271
12.2.3 Solutions for Aberrant Positive Likelihood Values 274
12.3 A Scanner with Two PFMs 280
12.4 Information and Background Frequencies on Score Values 283
12.5 Dynamic Background vs. Null Model 285
12.6 Conclusions 285
13 Markov Chains: The Machine (I) 287
13.1 Introduction 287
13.2 Transition Matrices 287
13.3 Discrete Probability Detector 292
13.3.1 Alphabet Detection 292
13.3.2 Matrix Initialization 293
13.3.3 Frequency Detection 295
13.3.4 Calculation of Transition Probabilities 297
13.3.5 Particularities in Calculating the Transition Probabilities 306
13.4 Markov Chains Generators 307
13.4.1 The Experiment 308
13.4.2 The Implementation 312
13.4.3 Simulation of Transition Probabilities 315
13.4.4 The Markov machine 315
13.4.5 Result Verification 317
13.5 Conclusions 318
14 Markov Chains: Log Likelihood (II) 319
14.1 Introduction 319
14.2 The Log-Likelihood Matrix 319
14.2.1 A Log-Likelihood Matrix Based on the Null Model 320
14.2.2 A Log-Likelihood Matrix Based on Two Models 322
14.3 Interpretation and Use of the Log-Likelihood Matrix 326
14.4 Construction of a Markov Scanner 328
14.5 A Scanner That Uses a Known LLM 337
14.6 The Meaning of Scores 340
14.7 Beyond Bioinformatics 344
14.8 Conclusions 345
15 Spectral Forecast (I) 347
15.1 Introduction 347
15.2 The Spectral Forecast Model 347
15.3 The Spectral Forecast Equation 349
15.4 The Spectral Forecast Inner Workings 350
15.4.1 Each Part on a Single Matrix 351
15.4.2 Both Parts on a Single Matrix 352
15.4.3 Both Parts on Separate Matrices 353
15.4.4 Concrete Example 1 354
15.4.5 Concrete Example 2 357
15.4.6 Concrete Example 3 359
15.5 Implementations 360
15.5.1 Spectral Forecast for Signals 362
15.5.2 What Does the Value of d Mean? 364
15.5.3 Spectral Forecast for Matrices 368
15.6 The Spectral Forecast Model for Predictions 372
15.6.1 The Spectral Forecast Model for Signals 372
15.6.2 Experiments on the Similarity Index Values 381
15.6.3 The Spectral Forecast Model for Matrices 384
15.7 Conclusions 389
16 Entropy vs. Content (I) 391
16.1 Introduction 391
16.2 Information Entropy 391
16.3 Implementation 395
16.4 Information Content vs. Information Entropy 400
16.4.1 Implementation 403
16.4.2 Additional Considerations 409
16.5 Conclusions 409
17 Philosophical Transactions 411
17.1 Introduction 411
17.2 The Frame of Reference 411
17.2.1 The Fundamental Layer of Complexity 412
17.2.2 On the Complexity of Life 414
17.3 Random vs. Pseudo-random 415
17.4 Random Numbers and Noise 418
17.5 Determinism and Chaos 419
17.5.1 Chaos Without Noise 420
17.5.2 Chaos with Noise 427
17.5.3 Limits of Prediction 430
17.5.4 On the Wings of Chaos 431
17.6 Free Will and Determinism 431
17.6.1 The Greatest Disappointment 432
17.6.2 The Most Powerful Processor in Existence 433
17.6.3 Certainty vs. Interpretation 435
17.6.4 A Wisdom that Applies 436
17.7 Conclusions 439
Appendix A 441
A.1 Association of Numerical Values with Letters 441
A.2 Sorting Values on Columns 443
A.3 The Implementation of a Sequence Logo 446
A.4 Sequence Logos Based on Maximum Values 451
A.5 Using Logarithms to Build Sequence Logos 455
A.6 From a Motif Set to a Sequence Logo 459
References 467
Index 489
1
The Tree of Life (I)
1.1 Introduction
This chapter provides an overview of life and draws near some important questions: When did life on earth begin? What is life? How is it organized? When did multicellular organisms appear and why? How many species exist on Earth? Notions of biology related to the emergence and classification of life are discussed in connection with different strategies on organism formation. Some ultrastructural images (electron microscopy) are presented as examples for reference. The lower and upper physical dimensions of eukaryotic and prokaryotic organisms are explored in detail. The same exploration is made for viruses that interact within the kingdoms of life (Animalia, Plantae, Fungi, Protista, [Archaea and Bacteria] or Monera). Moreover, a discussion closely debates the reference system and the requirements for life; with special considerations for the "spark of metabolism." Next, an introduction is made on some concrete topics, namely: The origins of eukaryotic cells, the endosymbiosis theory, the origins of organelles, the notion of reductive evolution, and the importance of horizontal gene transfer (HGT). Toward the end of the chapter, the main hypotheses regarding the origin of eukaryotic multicellularity are explored using the behavior observed in current species.
1.2 Emergence of Life
The Earth is believed to be ~4.5 billion years old. Geological evidence shows that liquid water, continental crust, and a rudimentary atmosphere existed on Earth just 100 million years later (4.4 billion years ago) [1]. The planetary atmosphere consisted of water vapor, carbon dioxide, methane, and ammonium [2]. It is unknown exactly when or how life began on Earth [3]. It is considered that life began on the early Earth soon after conditions became favorable for a chain of consecutive, yet undetermined chemical reactions [4]. The field of prebiotic chemistry tries to explain how organic compounds formed in the absence of biology and how these simple molecules self-assembled to ignite life on Earth and possibly on other planets. The oldest fossils of single-celled organisms date around 3.5 billion years ago [5, 6]. Nonetheless, only organisms with a dense biological structure would have resisted the intense metamorphism experienced by crustal rocks for more than 3.5 billion years. In turn, a dense biological structure may indicate high organism complexity. Thus, the earliest known microfossils could actually indicate the presence of structurally complex unicellular organisms [7]. It stands to reason that those "first" organisms must have required a long time to develop their complexity. Evidence for life on Earth before 3.8 billion years ago has been proposed in the past [7]. Preserved carbon, potentially of biogenic nature, pushes the origin of life on Earth to 4.1 billion years [8]. This indicates that life may have occurred fairly quickly after the formation of the planet (4.5 - 4.1 = 0.4 or 4.5 - 3.8 = 0.7). That is, 400 million to 700 million years after the formation of the planet. Moreover, the observation has important implications for our beliefs about how fast life ignites on other planets with similar conditions. In the next important event, life brings chemical modifications to the planetary atmosphere. An oxygen-containing atmosphere and evidence of cyanobacteria and photosynthesis date around 2.4-2.2 billion years ago [9, 10]. Large colonial organisms with coordinated growth in oxygenated environments have been found as far as 2.1 billion years ago [11]. Life made a gradual step toward eukaryotic unicellular organisms ~2 billion years ago. Eukaryotes divide into three main groups around 1.5 billion years ago, namely in the unicellular ancestors of modern plants, fungi, and animals [12, 13]. The appearance of multicellular eukaryotes is an unclear period. During evolution, gain or loss of multicellularity often occurred until a stable multicellular state was reached [14]. Knowledge of the complexity and size of current single-celled eukaryotic organisms calls into question many more complex fossils. It is difficult to investigate whether some macroscopic-sized fossils indicate multicellular or unicellular macroscopic organisms. Some certainty appears in the fossil record with the rise of bilaterians (bilateral symmetry in organisms) over 550-600 million years ago [15]. Nevertheless, even in the case of these fossils, some uncertainty overshadows interpretation. The macroscopic dimensions and the observed bilateral symmetry still cannot indicate with certainty the multicellular nature of these extinct organisms. Again, many of these fossils can be interpreted as multicellular organisms or as unicellular organisms (e.g. giant protists) [16]. More clear evidence suggests that multicellular organisms may have been present around 635 million years ago [17]. Recent molecular clock analyses estimate that animals started to evolve ~650 million years ago [18]. Bilaterian metazoans (animals with bilateral symmetry) first appeared around 600 million years [12]. Moreover, the evolutionary origins of the blood vascular system date around the same period [19]. Later, bilaterians split into the protostomes and deuterostomes [12]. Protostomes give rise to bring about all the arthropods (e.g. insects, spiders, crabs, shrimp, and so on). Deuterostomes eventually give rise to all vertebrates [12]. Perhaps the most important leap made in the evolution of life was the appearance of motility in multicellular organisms. Fossilized trails of bilaterian animals suggest that eukaryotic multicellular organisms have acquired motility around 551-539 million years ago [20]. A "few" million years later, the first true vertebrate with a backbone appears in the fossil record (545-490 million years ago) [21]. Moreover, fossil evidence shows that animals were exploring the land for the first time around 500 million years ago (544-457) [22]. Plants begin colonizing the land around the same time (470 million years ago) [22]. The first four-legged animals (tetrapods) explored the land 385-359 million years ago and gave rise to all amphibians, reptiles, birds, and mammals [22, 23]. The oldest fossilized tree also dates from this period [22]. Important diversifications in eukaryotic species appear after this period, both on land and in water. The first mammal-like forms appear in the fossil record around 225 million years ago [24]. Much closer periods bring many wonders. For instance, the largest eukaryotes in Earth's history have been observed around 100 million years ago, namely the cretaceous dinosaurs (e.g. Argentinosaurus; length: 22-35 m; estimated mass: 50 000-100 000 kg) [25]. The last extinction event (Cretaceous-Tertiary extinction), which occurred over 66-65 million years ago, allowed for a relaxed evolution of mammals [26]. Our own story begins of course at the origin, of life. However, more distinguished developments start with the origins of the first primates around 55 million years ago [27]. Note that the timeline of past events is detailed in the literature and here only some general points were reached.
1.2.1 Timeline Disagreements
Microfossils (the imprint left by an organism in stone), stromatolites (layered rocks derived from photosynthetic cyanobacteria remains sedimented over time), sedimentary carbon isotope ratios or molecular fossils derived from cellular and membrane lipids ("biomarkers"), are used for estimations of the origin and diversification of life in the distant past [28]. Data expressed in billions and hundreds of millions of years are particularly subjective and can lead to variations in the literature up to plus or minus half a billion years. These issues are known and must be taken at face value. While timeline estimates may vary, the order of events is particularly objective. Note that timeline disagreements in the paleontology literature rather indicate that evolution has no milestones but trial periods that overlap; some trials more successful and others that we will probably never know about. Nevertheless, the closer the events get to the present, the more reliable the numbers become. Although relative, timeline estimations in paleontology represent a reliable reference system for important past events on our planet.
1.3 Classifications and Mechanisms
Life on Earth was classified by us into three major domains, namely Bacteria, Archaea, and Eukarya (Figure 1.1) [29]. Bacteria (Greek - bakterion, "small stick") and Archaea (Greek - Arkhe, "origin") are prokaryotes (Greek - pro, "before"; karyon, "kernel" or "core"). Prokaryotes are single-cell organisms (unicellular) without a "core," namely without a nucleus, and are considered similar or close in sophistication to the first living organisms on Earth. Eukarya (Greek - eu, "well" or "true") includes all unicellular and multicellular organisms with cells that contain nuclei, and it refers to animals (including us), plants, fungi, and single-celled protists.
Figure 1.1 The tree of life - basic diagram....
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.