Computational Methods for Next Generation Sequencing Data Analysis

Name: Computational Methods for Next Generation Sequencing Data Analysis
Brand: Wiley
Price: 107.99 EUR
Availability: OnlineOnly

Ion Mandoiu Alexander Zelikovsky(Author)

Wiley (Publisher)

Published on 12. September 2016

464 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-119-27217-5 (ISBN)

€107.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.

Alles über E-Books, Kopierschutz & Dateiformate finden Sie in unserem Info- & Hilfebereich.

Introduces readers to core algorithmic techniques for next-generation sequencing (NGS) data analysis and discusses a wide range of computational techniques and applications This book provides an in-depth survey of some of the recent developments in NGS and discusses mathematical and computational challenges in various application areas of NGS technologies. The 18 chapters featured in this book have been authored by bioinformatics experts and represent the latest work in leading labs actively contributing to the fast-growing field of NGS. The book is divided into four parts: Part I focuses on computing and experimental infrastructure for NGS analysis, including chapters on cloud computing, modular pipelines for metabolic pathway reconstruction, pooling strategies for massive viral sequencing, and high-fidelity sequencing protocols. Part II concentrates on analysis of DNA sequencing data, covering the classic scaffolding problem, detection of genomic variants, including insertions and deletions, and analysis of DNA methylation sequencing data. Part III is devoted to analysis of RNA-seq data. This part discusses algorithms and compares software tools for transcriptome assembly along with methods for detection of alternative splicing and tools for transcriptome quantification and differential expression analysis. Part IV explores computational tools for NGS applications in microbiomics, including a discussion on error correction of NGS reads from viral populations, methods for viral quasispecies reconstruction, and a survey of state-of-the-art methods and future trends in microbiome analysis. Computational Methods for Next Generation Sequencing Data Analysis: * Reviews computational techniques such as new combinatorial optimization methods, data structures, high performance computing, machine learning, and inference algorithms * Discusses the mathematical and computational challenges in NGS technologies * Covers NGS error correction, de novo genome transcriptome assembly, variant detection from NGS reads, and more This text is a reference for biomedical professionals interested in expanding their knowledge of computational techniques for NGS data analysis. The book is also useful for graduate and post-graduate students in bioinformatics.

More details

Other editions

Persons

Content

CONTRIBUTORS xix PREFACE xxiii ABOUT THE COMPANION WEBSITE xxv PART I COMPUTING AND EXPERIMENTAL INFRASTRUCTURE FOR NGS 1 1 Cloud Computing for Next-Generation Sequencing Data Analysis 3 Xuan Guo, Ning Yu, Bing Li, and Yi Pan 2 Introduction to the Analysis of Environmental Sequence Information Using Metapathways 25 Niels W. Hanson, Kishori M. Konwar, Shang-Ju Wu, and Steven J. Hallam 3 Pooling Strategy for Massive Viral Sequencing 57 Pavel Skums, Alexander Artyomenko, Olga Glebova, Sumathi Ramachandran, David S. Campo, Zoya Dimitrova, Ion I. Mândoiu, Alexander Zelikovsky, and Yury Khudyakov 4 Applications of High-Fidelity Sequencing Protocol to RNA Viruses 85 Serghei Mangul, Nicholas C. Wu, Ekaterina Nenastyeva, Nicholas Mancuso, Alexander Zelikovsky, Ren Sun, and Eleazar Eskin PART II GENOMICS AND EPIGENOMICS 105 5 Scaffolding Algorithms 107 Igor Mandric, James Lindsay, Ion I.Mândoiu, and Alexander Zelikovsky 6 Genomic Variants Detection and Genotyping 133 Jorge Duitama 7 Discovering and Genotyping Twilight Zone Deletions 149 Tobias Marschall and Alexander Schönhuth 8 Computational Approaches for Finding Long Insertions and Deletions with NGS Data 175 Jin Zhang, Chong Chu, and Yufeng Wu 9 Computational Approaches in Next-Generation Sequencing Data Analysis for Genome-Wide DNA Methylation Studies 197 Jeong-Hyeon Choi and Huidong Shi 10 Bisulfite-Conversion-Based Methods for DNA Methylation Sequencing Data Analysis 227 Elena Harris and Stefano Lonardi PART III TRANSCRIPTOMICS 245 11 Computational Methods for Transcript Assembly from RNA-SEQ Reads 247 Stefan Canzar and Liliana Florea 12 An Overview And Comparison of Tools for RNA-Seq Assembly 269 Rasiah Loganantharaj and Thomas A. Randall 13 Computational Approaches for Studying Alternative Splicing in Nonmodel Organisms From RNA-SEQ Data 287 Sing-Hoi Sze 14 Transcriptome Quantification and Differential Expression From NGS Data 301 Olga Glebova, Yvette Temate-Tiagueu, Adrian Caciula, Sahar Al Seesi, Alexander Artyomenko, Serghei Mangul, James Lindsay, Ion I. M?andoiu, and Alexander Zelikovsky PART IV MICROBIOMICS 329 15 Error Correction of NGS Reads from Viral Populations 331 Pavel Skums, Alexander Artyomenko, Olga Glebova, David S. Campo, Zoya Dimitrova, Alexander Zelikovsky, and Yury Khudyakov 16 Probabilistic Viral Quasispecies Assembly 355 Armin Töpfer and Niko Beerenwinkel 17 Reconstruction of Infectious Bronchitis Virus Quasispecies from NGS Data 383 Bassam Tork, Ekaterina Nenastyeva, Alexander Artyomenko, Nicholas Mancuso, Mazhar I. Khan, Rachel O'Neill, Ion I. Mândoiu, and Alexander Zelikovsky 18 Microbiome Analysis: State of the Art and Future Trends 401 Mitch Fernandez, Vanessa Aguiar-Pulido, Juan Riveros, Wenrui Huang, Jonathan Segal, Erliang Zeng, Michael Campos, Kalai Mathee, and Giri Narasimhan INDEX 425

Chapter 1
Cloud Computing for Next-Generation Sequencing Data Analysis

Xuan Guo, Ning Yu, Bing Li and Yi Pan

Department of Computer Science, Department of Biology, Georgia State University, Atlanta, GA, USA

1.1 Introduction

Since the automated Sanger sequencing method dominated in the 1980s (1), considered as the first-generation sequencing technology, researchers first have the opportunity to construct steadily an effective ecosystem for the production and consumption of genomic information. A large number of computational tools have been developed to decode the biological information from the sequence databases in the ecosystem. Due to the expensive cost of using the first-generation sequencing technology, only a few bacteria, whose organisms possess relatively small and simple genomes, were sequenced to publish. However, along with the completion of the Human Genome Project in the beginning of the 21st century, studies on large-scale genome analysis became feasible depending on an unprecedented proliferation of genomic sequence data, which was unimaginable only a few years ago. The advent of newer methods of sequencing, known as next-generation sequencing (NGS) technologies (2), threatens the conventional genome informatics ecosystem in terms of the storage space, as well as the efficiencies of transitional tools when analyzing such huge amounts of data. The medical discoveries of the future will largely rely on our ability to dig out the "treasure" from the massive biological data. Thus, unprecedented demands are placed on the storage and analysis approaches for big data. Moreover, voluminous data may consume all network bandwidth available to the organization and cause traffic trouble in the network because of the uploading and downloading for large data sets. In addition, local data centers will constantly suffer other issues, including control of data access, sufficient input/output, data backup, power supply, and cooling of computing resources. All of these obstacles have led to the solution in the form of cloud computing, which has become a significant technology in big data era and exerted revolutionary influences on both academy and industry.

1.2 Challenges for NGS Data Analysis

Since the 1980s, the genomic ecosystem (Figure 1.1 (3)) for production and consumption of genomic information consists of sequencing lab, archives, power users, and casual users. The sequencing labs submitted their data to big archival databases, such as GenBank of National Center for Biotechnology Information (NCBI) (4), European Bioinformatics Institute EMBL database (5), and Sequence Read Archive (SRA, previously known as Short Read Archive) (6). Most of these databases maintain, organize, and distribute sequencing data, and also provide data access and associated tools to both power users and casual users freely. Most users obtain information either via websites created by archival databases or by value-added integrators.

Figure 1.1 The old genome informatics ecosystem prior to the advent of next-generation sequencing technologies (3).

The basis for the above ecosystem is Moore's law (7), which describes a long-term trend first introduced in 1965 by Intel co-founder Gordon Moore. Moore's law stated that "the number of transistors that can be placed on an integrated circuit board is increasing exponentially, with a rate of doubling in roughly 18 months" (8). The trend has remained true for approximately 40 years across multiple changes in semiconductor and manufacturing techniques. Similar phenomena have been noted for disk storage: hard drive capacity doubles roughly annually (Kryder's law) (9); and network capacity that the cost of sending a bit of information over optical networks halves every 9 months (Nielsen's law and Butter's law) (10). Along with the improvement of genome sequencing technology, the increasing rate of time for DNA sequencing was approximating the growth of computing and storage capacity at the beginning. The archival databases and computational biologists did not need to worry about running out of disk storage space or not having access to sufficiently powerful networks because the slight difference between two rates allowed them to upgrade their capacity ahead of the curve.

However, a deluge of biological sequence data has been generated since the Human Genome Project was completed in 2003. The advent of NGS technologies in the mid-2000s increases the slope of the DNA sequencing curve abruptly and now threatens the conventional genome informatics ecosystem. The commercially available NGS technologies, including 454 Sequencer (11), Solexa/Illumina (12), and ABI SOLiD (13), generated a tsunami of petabyte-scale genomic data, which flooded biological databases as never before. In terms of the prices of hard disk and DNA sequencing, we illustrate this by using a long-term trend (Figure 1.2) (7) plotted by Stein (14). Note that exponential curves are drawn as straight lines in the logarithmic scale. According to the figure, it is clear that the cost of storing a byte of data was halved every 14 months during 1990-2010. On the contrary, the cost of sequencing a base was halved every 19 months during 1990-2004, which is more slow than the unit cost of storage did. After the widespread use of NGS technologies, the cost of sequencing a base was halved down to every 5 months, which leads to the drop in the cost of genome sequencing several times faster than the cost of storage. It is not difficult to predict that it will cost us less to sequence a base of DNA than to store it on a hard disk sometime shortly. There is no guarantee to accelerate the trends all the time, but recently announced results by Illumina (15), Pacific Biosystems (16), Helicos (17), and Ion Torrent (18) ensure the continuing of the trends for at least another half-century. The development of NGS makes the current ecosystem face four challenges from the perspectives of storage, transportation, analysis, and economy.

Storage. The tsunami of genomic data from NGS projects threats public biological databases in terms of space and cost. For example, just after the first 6 months of the 1000 Genomes Project, the raw sequencing data deposited in GenBank's Sequence Read Archive (SRA) division (19) were two times larger than all of the data deposited into GenBank in last 30 years (7). Another instance involved NCBI that it announced to discontinue the access service to the high-throughput sequence data due to the unaffordable cost for SRA service (20).
Transportation. The uploading and downloading of huge amounts of data can easily exhaust all the network capacity available to researchers. It is reported that annual worldwide sequencing capacity is currently beyond 13 Pbp (21). Both power users and value-added genome integrators must directly or indirectly download the data from archival databases via the Internet and store copies in local storage systems to analyze them to provide web service. The mirroring of data sets across the network in multiple local storage systems are increasingly cumbersome, error-prone, expensive, and even getting worse when updates are made to databases and all mirrors are needed to be refreshed.
Analysis. The massive amounts of sequence data generated by NGS put the computational burden on traditional analysis significantly. Take sequence assembly of the human genome, for example. Velvet (22), a popular sequential assembly program, needs at least 2 TB memory and several weeks to fully assemble the human genome based on the data from Illumina platform. The single desktop computer is not powerful enough to give us the results in an acceptable time. On the other hand, if we try to cast traditional programs on computing clusters, the coding experience for traditional high-performance computing is not easy to be acquired.
Economy. The load of servers for accessing genome databases and web services usually fluctuates hourly, daily, and seasonally, so large data centers, such as NCBI, UCSC, and other genome data providers, are forced to choose either a cluster to meet average daily requirements or a powerful one to handle peak usage. No matter choosing which option, a large portion of computing resources will stay idle waiting for activities, such as a new large genome data set is submitted, or a major scientific conference is getting close. In addition, as long as the services are online, all the computers require electricity and maintenance, which is not a small amount of the cost.

Figure 1.2 Historical trends in storage prices versus DNA sequencing costs (7).

Source: Stein et al. 2010. Creative Commons Attribution License 4.0.

1.3 Background For Cloud Computing and its Programming Models

A promising solution to address these four challenges mentioned earlier hides in cloud computing, which has been an emerging trend in the scientific community (23). The cloud symbol is often employed to depict the term of "cloud computing" in Internet flowcharts. Based on virtualization technologies, cloud computing provides a variety of services from the hardware level to the application level, and all the services are charged on a pay-per-use basis. Therefore, scientists can have immediate access to needed resources, such as computation power and storage space of large distributed infrastructures, without planning, and release them to save cost as soon as experiments finish.

1.3.1 Overview of Cloud Computing

The general notions in cloud computing can...

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Computational Methods for Next Generation Sequencing Data Analysis

Description

More details

Other editions

Additional editions

Persons

Content

Chapter 1
Cloud Computing for Next-Generation Sequencing Data Analysis

1.1 Introduction

1.2 Challenges for NGS Data Analysis

1.3 Background For Cloud Computing and its Programming Models

1.3.1 Overview of Cloud Computing

System requirements

Schweitzer Fachinformationen

Computational Methods for Next Generation Sequencing Data Analysis

Description

More details

Other editions

Additional editions

Persons

Content

Chapter 1 Cloud Computing for Next-Generation Sequencing Data Analysis

1.1 Introduction

1.2 Challenges for NGS Data Analysis

1.3 Background For Cloud Computing and its Programming Models

1.3.1 Overview of Cloud Computing

System requirements

Chapter 1
Cloud Computing for Next-Generation Sequencing Data Analysis