Preface
Gilles DIDIER1 and Stéphane GUINDON2
1IMAG, CNRS, Université de Montpellier, France
2LIRMM, CNRS, Université de Montpellier, France
Evolution, in the usual meaning it takes in life science as the phenomenon by which living species evolve over time, is an essential biological process. It can arguably be said that it is the most important process, since all other biological phenomena are in some way derived therefrom. Evolution is thus at the origin not only of the extraordinary diversity of living beings, but has also shaped every biological function that can be observed. Its general theory dates back to the 19th century with the work of Darwin. Although this theory is now widely accepted and shared, its study remains an extremely lively and fertile scientific ground, which has extensively developed in recent decades, and is thus, consequently, rather broad.
It is obvious that the aim of this book cannot be to present every single aspect of evolution, but is instead to focus on the large number of interactions between biology, mathematics and computer science that its study has fostered. The main reason why evolutionary studies make such an extensive use of mathematical models and algorithms is due to the fact that biological evolution is a process which has been going on for more than 4 billion years and which is not, with rare exceptions (see Chapter 11), directly observable on our time scale. As such, it can only be studied from the data that are available to us today, that is, present-day species and the fossil record. In order to test hypotheses about the mechanisms governing evolution, it is generally necessary to express them in terms of mathematical models. The latter are simplified representations of biological reality that can be used to (imperfectly) reconstruct the evolutionary history of contemporary species (as well as ancestral species in the case of fossils) that we observe. By fitting these models to the data, their relevance can be assessed and the hypotheses initially proposed validated or rejected. The design of models and the inference of their parameters require us to deal with mathematical, computational and statistical problems that have contributed to opening up new fields of research, both theoretical and applied, in these areas.
The central object in the study of evolution is the tree of life. The latter is mainly a natural representation of the diversification of species in the sense that it describes the relationships between species (or even between individuals). In modeling, we shall see that it is sometimes interpreted as a support for evolution, which we can try to reconstruct from the available data, and sometimes as a representation of the statistical dependency between the characters carried by the species. Trees are also theoretical objects that have been studied in computer science and mathematics, particularly from the point of view of their combinatorics. Chapter 1 briefly presents this aspect before describing different evolutionary models leading to trees and the probabilities associated therewith within these models.
Although trees represent the framework in which evolution takes place, evolution itself operates on the various characteristics (which should be taken in the broadest sense here) that can be found in living beings. Moreover, it is through these characters that it can be studied. The most used "character" within this framework is genetic material, namely the molecules/polymers from which the sequence of elementary bricks can be extracted in the form of (long) words on finite alphabets. In fact, the development of genetics, first identified as a support for evolution, and DNA sequencing techniques have revolutionized the study of biological evolution by changing the nature and causing an explosion in the amount of exploitable data in this field. Using these (and other) data to better understand evolution requires an increase in mathematical and computational resources to address the ever-changing amount and type of data.
Chapter 2 presents the main Markovian models of DNA sequence evolution. These models, generally considered as mechanistic, describe evolutionary processes at the molecular level, over sufficiently long periods of time so that intraspecies genetic variability is negligible compared to interspecies variability. The vast majority of these models consider that the different positions along the genetic sequences evolve independently of each other and follow the same continuous-time Markovian model. This same chapter also describes probabilistic models for taking into account the variability of evolutionary rates along sequences, an important phenomenon from a biological point of view, particularly with regard to the evolution of genome coding parts, which are constrained by the structure of the genetic code. Finally, models of the same type as those used for DNA sequences can also be used for modeling the evolution of discrete characters such as the presence or absence of a given morphological characteristic or the number of fingers, etc.
Evolution also concerns the physical characteristics of species, in particular the so-called quantitative characters such as height, weight and so on. Although these are less used for phylogenetic inference, understanding their evolution is essential to biology, for example, for testing hypotheses about morphometric and allometric relationships in ecology. Evolutionary models of continuous characters prove to also be a relevant tool for the detection of possible traces of natural selection on the evolution of morphological characters. Chapter 3 presents in detail the generic framework in which these models are implemented as well as a wide range of regularly used approaches to appropriately model the correlation between character values that derives from evolutionary relationships between compared species.
The models presented in Chapters 2 and 3 assume that the characters being considered are probabilistically independent (e.g. sequences sites and the different edges in a phylogeny are considered as independent). Although this independence avoids an explosion of the computation time and the size of the models, it might not be considered realistic in many cases. Chapter 4 presents different approaches which make it possible to highlight and study the evolution interdependence of several characters, discrete or continuous. Similarly to the models presented in Chapter 3, the co-evolution models take into consideration the phylogenetic tree as a nuisance parameter in order to evaluate the correlation part between morphological characters that is not explained by evolutionary relationships.
Genetic sequences do not only evolve by means of mutations, as presented in Chapter 2, but sometimes change in a more radical way. In particular, at the genome level, evolution sometimes proceeds by duplication or inversion of whole sections of chromosomes. These types of changes are very rich in terms of information on the evolutionary distances between genomes. They provide a more comprehensive view of evolution than the "simple" models of point substitutions between nucleotides. Nonetheless, genomic rearrangements are more difficult to model mathematically. Chapter 5 reviews the general approach to detecting these rearrangements and reconstructing evolution at the genome scale.
The first five chapters of our book provide an overview of evolutionary models. The next four chapters illustrate how some of these models are applied in the context of phylogenetic inference, that is, for determining evolutionary relationships between species or individuals and the time since they have diverged. They present several approaches commonly used to answer these questions.
The first approach to reconstructing the evolution of a group of species consists of considering a distance or dissimilarity matrix, that is, a measure of the "resemblance" between pairs of species. Here, the assumption is made that the further apart species are from an evolutionary perspective, the less similar they are from the point of view of the chosen measure. Under this hypothesis, the tree that best represents these distances and that we will try to determine is close to the one that traces their evolution. The dissimilarity, or distance, used can be calculated based on genetic sequences and morphological characters. Although a distance drastically summarizes all the characters inherent to the species, different methods presented in Chapter 6 enable the reconstruction of realistic evolutionary trees from this information. These approaches are even virtually the only ones that can be applied in practice to reconstruct the trees comprising a large number of species due to their computational speed.
While the methods presented in Chapter 6 prove to be very fast, they do not directly take the evolution of the species under study into consideration, since the evolutionary distances between species only provide an approximate summary of the raw data. Other approaches directly involve mechanisms or evolutionary models for phylogenic inference. One of the first to be considered is parsimony, which is based on Occam's razor principle and seeks phylogenies involving the fewest possible evolutionary events. Parsimony has gradually been supplanted by approaches based on probabilistic models such as those presented in Chapter 2. The first way to use such models for phylogenetic inference consists of looking for the tree maximizing the probability of the observed data under the chosen model. This is...