
Handbook of Statistical Systems Biology
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
Reviews / Votes
"A very remarkable collection of essays. Stronglyrecommended to workers in this area." (InternationalStatistical Review, 1 October 2013) "I would highly recommend this book as a useful guide forthe students and practitioners of systems biology." (Science Progress, 1 September 2012) "This handbook will be a key resource for researcherspractising systems biology, and those requiring a comprehensiveoverview of this important field." (Zentralblatt MATH,2012)More details
Other editions
Additional editions


Persons
Michael Stumpf, Theoretical Systems Biology at Imperial College London
David Balding, Statistical Genetics in the Institute of Genetics at University College London
Mark Girolami, Department of Computing Science and the Department of Statistics
Content
Chapter 2
Introduction to Statistical Methods for Complex Systems
Tristan Mary-Huard and Stéphane Robin
Agro ParisTech and INRA, Paris, France
2.1 Introduction
The aim of the present chapter is to introduce and illustrate some concepts of statistical inference useful in systems biology. Here we limit ourselves to the classical, so-called ‘frequentist’ statistical inference where parameters are fixed quantities that need to be estimated. The Bayesian approach will be presented in Chapter 3.
Modelling and inference techniques are illustrated in three recurrent problems in systems biology:
Class comparison aims at assessing the effect of some treatment or experimental condition on some biological response. This requires proper statistical modelling to account for the experimental design, various covariates or stratifications or dependence between the measurements. As systems biology often deals with high-throughput technologies, it also raises multiple testing issues.
Class prediction refers to learning techniques that aim at building a rule to predict the status (e.g. well or ill) of an individual, based on a set of biological descriptors. An exhaustive list of classification algorithms is out of reach, but general techniques such as regularization or aggregation are of prime interest in systems biology where the number of variables often exceeds the number of observations by far. Evaluating the performances of a classifier also requires relevant tools.
Class discovery aims at uncovering some structure in a set of observations. These techniques include distance-based or model-based clustering methods and allow to determine distinct groups of individuals in the absence of a prior classification. However, the underlying structure may have more complex forms, each raising specific issues in terms of inference.
This chapter focuses on generic statistical concepts and methods, that can be applied no matter which technology is used for the data acquisition. In practice, applications to any biological problem will necessitate both a relevant strategy for the data collection, and a careful tuning of the methods to obtain meaningful results. These two steps of data collection (or experimental design conception) and adaptation of the generic methods require taking into account the nature of the data. Therefore, they are dependent on the data acquisition technology, and will be discussed in Part B of this Handbook.
In this chapter, the data are assumed to arise from a static process. The analysis of a dynamic biological system would require more sophisticated methods, such as partial differential equations or network modelling.
These topics are not discussed here as they will be reviewed in depth in Parts C and D.
Lastly, a basic knowledge in statistics is assumed, covering topics including point estimation (in particular maximum likelihood estimation), hypothesis testing, and a background in regression and linear models.
2.2 Class Comparison
We consider here the general problem of assessing the effect of some treatment, experimental condition or covariate on some response. We first address the problem of modelling the data resulting from the experiments, focusing on how to account for the dependency between the observations. We then turn to the problem of multiple testing, which is recurrent in high-throughput data analyses.
2.2.1 Models for Dependent Data
Many biological experiments aim at observing the effects of a given treatment (or combination of treatments) on a given response. ‘Treatment’ is used here in a very broad sense, including controlled experimental conditions, uncontrolled covariates, time, population structure, etc. In the following will stand for the total number of experiments.
Linear (Gaussian) models (Searle 1971; Dobson 1990) provide a general framework to describe the influence of a set of controlled conditions and/or uncontrolled covariates, summarized in a -dimensional matrix , on the observed response gathered in a -dimensional vector as
where is the -dimensional vector containing all parameters. In the most classical setting, the response is supposed to be Gaussian, and the dependency structure between the observations is then fully specified by the (co-)variance matrix which contains the variance of each observation on the diagonal, and the covariances between pairs of observations elsewhere. In the most simple setting, the responses are supposed to be independent with same variance , that is .
2.2.1.1 Writing the Right (Mixed) Model
In more complex experiments, the assumption that observations are independent does not hold and the structure of needs to be adapted. Because it contains parameters, the shape of has to be strongly constrained to allow good inference. We first present here some typical experimental settings, and the associated dependency structures.
Variance Components
Consider the study of the combined effects of the genotype (indexed by ) and of the cell type () on some gene expression. Several individuals () from each genotype are included and cells from each type are harvested in each of them. In such a setting the expected response is , which is often decomposed into a genotype effect, a cell type effect and an interaction as .
The most popular way to account for the dependency between measures obtained on the same individual is to add a random term associated with each individual. The complete model can then be written as
where all and are independent centred Gaussian variables with variance and , respectively. The variance of one observation is then , where is the ‘biological’ variance and is the ‘technical’ one (Kerr and Churchill 2001). The random effect induces a uniform correlation between observations from the same individual since:
and 0 if . The matrix form of this model is a generalization of (2.1):
where describes the individual structure: each row corresponds to one measurement and each column to one individual and contains a 1 at the intersection if the measurement has been made on the individual, and a 0 otherwise. The denomination ‘mixed’ of ‘linear mixed models’ comes from the simultaneous presence of fixed and random effects. It corresponds to the simplest form of so-called ‘variance components’ models. The variance matrix corresponding to (2.3) is . Application of such a model to gene expression data can be found in Wolfinger et al.(2001) or Tempelman (2008).
Repeated Measurements
One considers a similar design where, in place of cell types, we compare successive harvesting times (indexed by ) within each individual. The uniform correlation within each individual given in (2.3) may then seem inappropriate, for it does not account for the delay between times of observation. A common dependency form is then the so-called ‘autoregressive’, which states that
and 0 otherwise. This is to assume that the correlation decreases (at an exponential rate) with the time delay. Such a variance structure cannot be put in a simple matrix form similar to (2.4). Note that Equation (2.1) is still valid, but with nondiagonal variance matrix .
Spatial Dependency
It is also desirable to account for spatial dependency when observations have some spatial localization. Suppose one wants to compare treatments (indexed by ), and that replicates () have respective localizations . A typical variance structure (Cressie 1993) is
where accounts for the measurement error variability and controls the speed at which the dependency decreases with distance.
The dependency structures described above can of course be combined. Also note that this list is far from exhaustive. The limitations often come from the software at hand or the specific computing developments that can be made. A large catalogue of such structures can be found in software such as SAS (2002-03) or R (www.r-project.org).
2.2.1.2 Inference
Some problems related to the inference of mixed linear models are still unresolved. We only provide here an introduction to the most popular approaches and emphasize some practical issues that can be faced when using them.
Estimation
Mixed model inference requires to estimate both and . We start with the estimation of , which reduces to the estimation of a few variance parameters such as in the examples given above.
Moment estimates can be obtained (Searle 1971; Demindenko 2004), typically for variance component models. Such estimates are often based on sums of squares, that are squared distances between and its projection on various linear spaces, such as span, span or span(. The expectation of these sums of squares can often be related to the different variance parameters and the estimation then reduces to solving a set of linear equations.
The maximum likelihood (ML) estimator is defined as
and can be used for all models. Unfortunately, ML variance estimates are known to be biased in many (almost all) situations, because both and have to be estimated at the same time. The most popular way to circumvent this problem consists of changing to a model where is known (Verbeke and Molenberghs 2000). Defining some matrix such that , we may define the Gaussian vector which satisfies
The most natural choice for is the projector on the linear space orthogonal to span. The so-called...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.