Metaheuristics for String Problems in Bio-informatics

Name: Metaheuristics for String Problems in Bio-informatics
Brand: Wiley-ISTE
Price: 139.99 EUR
Availability: OnlineOnly

Christian Blum Paola Festa(Author)

Wiley-ISTE (Publisher)

Published on 16. August 2016

228 pages

E-Book

PDF with Adobe-DRM

System requirements

978-1-119-13680-4 (ISBN)

€139.99incl. 7% vat

System requirements

for PDF with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Persons

Content

Preface ix

Acknowledgments xi

List of Acronyms xiii

Chapter 1 Introduction 1

1.1 Complete methods for combinatorial optimization 3

1.1.1 Linear programming relaxation 6

1.1.2 Cutting plane techniques 9

1.1.3 General-purpose ILP solvers 18

1.1.4 Dynamic programming 19

1.2 Approximate methods: metaheuristics 20

1.2.1 Ant colony optimization 22

1.2.2 Evolutionary algorithms 24

1.2.3 Greedy randomized adaptive search procedures 25

1.2.4 Iterated local search 26

1.2.5 Simulated annealing 27

1.2.6 Other metaheuristics 29

1.2.7 Hybrid approaches 29

1.3 Outline of the book 32

Chapter 2 Minimum Common String Partition Problem 37

2.1 The MCSP problem 38

2.1.1 Technical description of the UMCSP problem 38

2.1.2 Literature review 39

2.1.3 Organization of this chapter 40

2.2 An ILP model for the UMCSP problem 40

2.3 Greedy approach 42

2.4 Construct, merge, solve and adapt 42

2.5 Experimental evaluation 45

2.5.1 Benchmarks 46

2.5.2 Tuning CMSA 46

2.5.3 Results 47

2.6 Future work 54

Chapter 3 Longest Common Subsequence Problems 55

3.1 Introduction 56

3.1.1 LCS problems 56

3.1.2 ILP models for LCS and RFLCS problems 59

3.1.3 Organization of this chapter 61

3.2 Algorithms for the LCS problem 61

3.2.1 Beam search 61

3.2.2 Upper bound 64

3.2.3 Beam search framework 64

3.2.4 Beam-ACO 67

3.2.5 Experimental evaluation 71

3.3 Algorithms for the RFLCS problem 75

3.3.1 CMSA 77

3.3.2 Experimental evaluation 80

3.4 Future work 85

Chapter 4 The Most Strings With Few Bad Columns Problem 87

4.1 The MSFBC problem 88

4.1.1 Literature review 88

4.2 An ILP model for the MSFBC problem 89

4.3 Heuristic approaches 90

4.3.1 Frequency-based greedy 91

4.3.2 Truncated pilot method 91

4.4 ILP-based large neighborhood search 92

4.5 Experimental evaluation 94

4.5.1 Benchmarks 94

4.5.2 Tuning of LNS 96

4.5.3 Results 97

4.6 Future work 104

Chapter 5 Consensus String Problems 107

5.1 Introduction 107

5.1.1 Creating diagnostic probes for bacterial infections 108

5.1.2 Primer design 108

5.1.3 Discovering potential drug targets 108

5.1.4 Motif search 109

5.2 Organization of this chapter 110

5.3 The closest string problem and the close to most string problem 110

5.3.1 ILP models for the CSP and the CTMSP 111

5.3.2 Literature review 112

5.3.3 Exact approaches for the CSP 113

5.3.4 Approximation algorithms for the CSP 113

5.3.5 Heuristics and metaheuristics for the CSP 114

5.4 The farthest string problem and the far from most string problem 117

5.4.1 ILP models for the FSP and the FFMSP 117

5.4.2 Literature review 118

5.4.3 Heuristics and metaheuristics for the FFMSP 119

5.5 An ILP-based heuristic 141

5.6 Future work 146

Chapter 6 Alignment Problems 149

6.1 Introduction 149

6.1.1 Organization of this chapter 150

6.2 The pairwise alignment problem 151

6.2.1 Smith and Waterman's algorithm 154

6.3 The multiple alignment problem 157

6.3.1 Heuristics for the multiple alignment problem 161

6.3.2 Metaheuristics for the multiple alignment problem 162

6.4 Conclusion and future work 173

Chapter 7 Conclusions 175

7.1 DNA sequencing 175

7.1.1 DNA fragment assembly 176

7.1.2 DNA sequencing by hybridization 177

7.2 Founder sequence reconstruction 180

7.2.1 The FSRP problem 181

7.2.2 Existing heuristics and metaheuristics 182

7.3 Final remarks 184

Bibliography 187

Index 205

1
Introduction

In computer science, a string s is defined as a finite sequence of characters from a finite alphabet S. Apart from the alphabet, an important characteristic of a string s is its length which, in this book, will be denoted by |s|. A string is, generally, understood to be a data type i.e. a string is used to represent and store information. For example, words in a specific language are stored in a computer as strings. Even the entire text may be stored by means of strings. Apart from fields such as information and text processing, strings arise in the field of bioinformatics. This is because most of the genetic instructions involved in the growth, development, functioning and reproduction of living organisms are stored in a molecule which is known as deoxyribonucleic acid (DNA) which can be represented in the form of a string in the following way. DNA is a nucleic acid that consists of two biopolymer strands forming a double helix (see Figure 1.1). The two strands of DNA are called polynucleotides as they are composed of simpler elements called nucleotides. Each nucleotide consists of a nitrogen-containing nucleobase as well as deoxyribose and a phosphate group. The four different nucleobases of DNA are cytosine (C), guanine (G), adenine (A) and thymine (T). Each DNA strand is a sequence of nucleotides that are joined to one another by covalent bonds between the sugar of one nucleotide and the phosphate of the next. This results in an alternating sugar-phosphate backbone (see Figure 1.1). Furthermore, hydrogen bonds bind the bases of the two separate polynucleotide strands to make double-stranded DNA. As a result, A can only bind with T and C can only bind with G. Therefore, a DNA molecule can be stored as a string of symbols from S = {A, C, T, G} that represent one of the two polynucleotide strands. Similarly, most proteins can be stored as a string of letters from an alphabet of 20 letters, representing the 20 standard amino acids that constitute most proteins.

Figure 1.1. DNA double helix (image courtesy of Wikipedia)

As a result, many optimization problems in the field of computational biology are concerned with strings representing, for example, DNA or protein sequences. In this book we will focus particularly on recent works concerning string problems that can be expressed in terms of combinatorial optimization (CO). In early work by Papadimitriou and Steiglitz [PAP 82], a CO problem is defined by a tuple (, f), where is a finite set of objects and is a function that assigns a non-negative cost value to each of the objects s ? . Solving a CO problem requires us to find an object s* of minimum cost value1. As a classic academic example, consider the well-known traveling salesman problem (TSP). In the case of the TSP, set consists of all Hamiltonian cycles in a completely connected, undirected graph with positive edge weights. The objective function value of such a Hamiltonian cycle is the sum of the weights of its edges.

Unfortunately, most CO problems are difficult to optimality solve in practice. In theoretical terms, this is confirmed by corresponding results about non-deterministic (NP)-hardness and non-approximability. Due to the hardness of CO problems, a large variety of algorithmic and mathematical approaches have emerged to tackle these problems in recent decades. The most intuitive classification labels these approaches as either complete/exact or approximate methods. Complete methods guarantee that, for every instance of a CO problem of finite size, there is an optimal solution in bounded time [PAP 82, NEM 88]. However, assuming that P ? NP, no algorithm that solves a CO problem classified as being NP-hard in polynomial time exists [GAR 79]. As a result, complete methods, in the worst case, may require an exponential computation time to generate an evincible optimal solution. When faced with large size, the time required for computation may be too high for practical purposes. Thus, research on approximate methods to solve CO problems, also in the bioinformatics field, has enjoyed increasing attention in recent decades. In contrast to complete methods, approximate algorithms produce, not necessarily optimal, solutions in relatively acceptable computation times. Moreover, note that in the context of a combinatorial optimization problem in bioinformatics, finding an optimal solution is often not as important as in other domains. This is due to the fact that often the available data are error prone.

In the following section, we will give a short introduction into some of the most important complete methods, including integer linear programming and dynamic programming. Afterward, some of the most popular approximate methods for combinatorial optimization are reviewed. Finally, the last part of this chapter outlines the string problems considered in this book.

1.1. Complete methods for combinatorial optimization

Many combinatorial optimization problems can be expressed in terms of an integer programming (IP) problem, in a way that involves maximizing or minimizing an objective function of a certain number of decision variables, subject to inequality and/or equality constraints and integrality restrictions on the decision variables.

Formally, we require the following ingredients:

1) a n-dimensional vector , where each xj, j = 1, 2, . , n, is called a decision variable and x is called the decision vector;
2) m scalar functions of the decision vector x;
3) a m-dimensional vector , called the right-hand side vector;
4) a scalar function of the decision vector x.

With these ingredients, an IP problem can be formulated mathematically as follows:

where ~ ? {=, =, =}. Therefore, the set

is called a feasible set and consists of all those points that satisfy the m constraints gi(x) ~ bi, i = 1, 2, . , m. Function z is called the objective function. A feasible point x* ? X, for which the objective function assumes the minimum (respectively maximum) value i.e. z(x*) = z(x) (respectively z(x*) = z(x)) for all x ? X is called an optimal solution for (IP).

In the special case where is a finite set, the (IP) problem is a CO problem. The optimization problems in the field of computational biology described in this book are concerned with strings and can be classified as CO problems. Most of them have a binary nature, i.e. a feasible solution to any of these problems is a n-dimensional vector x ? {0, 1}n, where for each xi it is necessary to choose between two possible choices. Classical binary CO problems are, for example, the 0/1 knapsack problem, assignment and matching problems, set covering, set packing and set partitioning problems [NEM 88, PAP 82].

In what follows, unless explicitly noted, we will refer to a general integer linear programming (ILP) problem in standard form:

[1.1] [1.2] [1.3]

where is the matrix of the technological coefficients aij, i = 1, 2, . , m, j = 1, 2, . , n, and are the m-dimensional array of the constant terms and the n-dimensional array of the objective function coefficients, respectively. Finally, is the n-dimensional array of the decision variables, each being non-negative and integer. ILP is linear since both the objective function [1.1] and the m equality constraints [1.2] are linear in the decision variables x1, . , xn.

Note that the integer constraints [1.3] define a lattice of points in n, some of them belonging to the feasible set X of ILP. Formally, , where (Figure 1.2).

Figure 1.2. Graphical representation of the feasible region X of a generic ILP problem: , where

Many CO problems are computationally intractable. Moreover, in contrast to linear programming problems that may be solved efficiently, for example by the simplex method, there is no single method capable of solving all of them efficiently. Since many of them exhibit special properties, because they are defined on networks with a special topology or because they are characterized by a special cost structure, the scientific community has historically lavished its efforts on the design and development of ad hoc techniques for specific problems. The remainder of this section is devoted to the description of mathematical programming methods and to dynamic programming.

1.1.1. Linear programming relaxation

A simple way to approximately solving an ILP using mathematical programming techniques is as follows:

a) Relax the integer constraints [1.3], obtaining the following continuous linear programming problem in the standard form:

ILP-c is called the linear programming relaxation of ILP. It is the problem that arises by the replacement of the integer constraints [1.3] by weaker...

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Metaheuristics for String Problems in Bio-informatics

Description

More details

Other editions

Additional editions

Persons

Content

1
Introduction

1.1. Complete methods for combinatorial optimization

1.1.1. Linear programming relaxation

System requirements