
Computational Methods for Next Generation Sequencing Data Analysis
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
More details
Other editions
Additional editions


Persons
Content
Chapter 1
Cloud Computing for Next-Generation Sequencing Data Analysis
Xuan Guo, Ning Yu, Bing Li and Yi Pan
Department of Computer Science, Department of Biology, Georgia State University, Atlanta, GA, USA
1.1 Introduction
Since the automated Sanger sequencing method dominated in the 1980s (1), considered as the first-generation sequencing technology, researchers first have the opportunity to construct steadily an effective ecosystem for the production and consumption of genomic information. A large number of computational tools have been developed to decode the biological information from the sequence databases in the ecosystem. Due to the expensive cost of using the first-generation sequencing technology, only a few bacteria, whose organisms possess relatively small and simple genomes, were sequenced to publish. However, along with the completion of the Human Genome Project in the beginning of the 21st century, studies on large-scale genome analysis became feasible depending on an unprecedented proliferation of genomic sequence data, which was unimaginable only a few years ago. The advent of newer methods of sequencing, known as next-generation sequencing (NGS) technologies (2), threatens the conventional genome informatics ecosystem in terms of the storage space, as well as the efficiencies of transitional tools when analyzing such huge amounts of data. The medical discoveries of the future will largely rely on our ability to dig out the "treasure" from the massive biological data. Thus, unprecedented demands are placed on the storage and analysis approaches for big data. Moreover, voluminous data may consume all network bandwidth available to the organization and cause traffic trouble in the network because of the uploading and downloading for large data sets. In addition, local data centers will constantly suffer other issues, including control of data access, sufficient input/output, data backup, power supply, and cooling of computing resources. All of these obstacles have led to the solution in the form of cloud computing, which has become a significant technology in big data era and exerted revolutionary influences on both academy and industry.
1.2 Challenges for NGS Data Analysis
Since the 1980s, the genomic ecosystem (Figure 1.1 (3)) for production and consumption of genomic information consists of sequencing lab, archives, power users, and casual users. The sequencing labs submitted their data to big archival databases, such as GenBank of National Center for Biotechnology Information (NCBI) (4), European Bioinformatics Institute EMBL database (5), and Sequence Read Archive (SRA, previously known as Short Read Archive) (6). Most of these databases maintain, organize, and distribute sequencing data, and also provide data access and associated tools to both power users and casual users freely. Most users obtain information either via websites created by archival databases or by value-added integrators.
Figure 1.1 The old genome informatics ecosystem prior to the advent of next-generation sequencing technologies (3).
The basis for the above ecosystem is Moore's law (7), which describes a long-term trend first introduced in 1965 by Intel co-founder Gordon Moore. Moore's law stated that "the number of transistors that can be placed on an integrated circuit board is increasing exponentially, with a rate of doubling in roughly 18 months" (8). The trend has remained true for approximately 40 years across multiple changes in semiconductor and manufacturing techniques. Similar phenomena have been noted for disk storage: hard drive capacity doubles roughly annually (Kryder's law) (9); and network capacity that the cost of sending a bit of information over optical networks halves every 9 months (Nielsen's law and Butter's law) (10). Along with the improvement of genome sequencing technology, the increasing rate of time for DNA sequencing was approximating the growth of computing and storage capacity at the beginning. The archival databases and computational biologists did not need to worry about running out of disk storage space or not having access to sufficiently powerful networks because the slight difference between two rates allowed them to upgrade their capacity ahead of the curve.
However, a deluge of biological sequence data has been generated since the Human Genome Project was completed in 2003. The advent of NGS technologies in the mid-2000s increases the slope of the DNA sequencing curve abruptly and now threatens the conventional genome informatics ecosystem. The commercially available NGS technologies, including 454 Sequencer (11), Solexa/Illumina (12), and ABI SOLiD (13), generated a tsunami of petabyte-scale genomic data, which flooded biological databases as never before. In terms of the prices of hard disk and DNA sequencing, we illustrate this by using a long-term trend (Figure 1.2) (7) plotted by Stein (14). Note that exponential curves are drawn as straight lines in the logarithmic scale. According to the figure, it is clear that the cost of storing a byte of data was halved every 14 months during 1990-2010. On the contrary, the cost of sequencing a base was halved every 19 months during 1990-2004, which is more slow than the unit cost of storage did. After the widespread use of NGS technologies, the cost of sequencing a base was halved down to every 5 months, which leads to the drop in the cost of genome sequencing several times faster than the cost of storage. It is not difficult to predict that it will cost us less to sequence a base of DNA than to store it on a hard disk sometime shortly. There is no guarantee to accelerate the trends all the time, but recently announced results by Illumina (15), Pacific Biosystems (16), Helicos (17), and Ion Torrent (18) ensure the continuing of the trends for at least another half-century. The development of NGS makes the current ecosystem face four challenges from the perspectives of storage, transportation, analysis, and economy.
- Storage. The tsunami of genomic data from NGS projects threats public biological databases in terms of space and cost. For example, just after the first 6 months of the 1000 Genomes Project, the raw sequencing data deposited in GenBank's Sequence Read Archive (SRA) division (19) were two times larger than all of the data deposited into GenBank in last 30 years (7). Another instance involved NCBI that it announced to discontinue the access service to the high-throughput sequence data due to the unaffordable cost for SRA service (20).
- Transportation. The uploading and downloading of huge amounts of data can easily exhaust all the network capacity available to researchers. It is reported that annual worldwide sequencing capacity is currently beyond 13 Pbp (21). Both power users and value-added genome integrators must directly or indirectly download the data from archival databases via the Internet and store copies in local storage systems to analyze them to provide web service. The mirroring of data sets across the network in multiple local storage systems are increasingly cumbersome, error-prone, expensive, and even getting worse when updates are made to databases and all mirrors are needed to be refreshed.
- Analysis. The massive amounts of sequence data generated by NGS put the computational burden on traditional analysis significantly. Take sequence assembly of the human genome, for example. Velvet (22), a popular sequential assembly program, needs at least 2 TB memory and several weeks to fully assemble the human genome based on the data from Illumina platform. The single desktop computer is not powerful enough to give us the results in an acceptable time. On the other hand, if we try to cast traditional programs on computing clusters, the coding experience for traditional high-performance computing is not easy to be acquired.
- Economy. The load of servers for accessing genome databases and web services usually fluctuates hourly, daily, and seasonally, so large data centers, such as NCBI, UCSC, and other genome data providers, are forced to choose either a cluster to meet average daily requirements or a powerful one to handle peak usage. No matter choosing which option, a large portion of computing resources will stay idle waiting for activities, such as a new large genome data set is submitted, or a major scientific conference is getting close. In addition, as long as the services are online, all the computers require electricity and maintenance, which is not a small amount of the cost.
Figure 1.2 Historical trends in storage prices versus DNA sequencing costs (7).
Source: Stein et al. 2010. Creative Commons Attribution License 4.0.
1.3 Background For Cloud Computing and its Programming Models
A promising solution to address these four challenges mentioned earlier hides in cloud computing, which has been an emerging trend in the scientific community (23). The cloud symbol is often employed to depict the term of "cloud computing" in Internet flowcharts. Based on virtualization technologies, cloud computing provides a variety of services from the hardware level to the application level, and all the services are charged on a pay-per-use basis. Therefore, scientists can have immediate access to needed resources, such as computation power and storage space of large distributed infrastructures, without planning, and release them to save cost as soon as experiments finish.
1.3.1 Overview of Cloud Computing
The general notions in cloud computing can...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.