
Genomics in the AWS Cloud
Beschreibung
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
Genomics in the AWS Cloud: Analyzing Genetic Code Using Amazon Web Services enables a person who has moderate familiarity with AWS Cloud to perform full genome analysis and research. Using the information in this book, you'll be able to take a FASTQ file containing raw data from a lab or a BAM file from a service provider and perform genome analysis on it. You'll also be able to identify potentially pathogenic gene sequences.
* Get an introduction to Whole Genome Sequencing (WGS)
* Make sense of WGS on AWS
* Master AWS services for genome analysis
Some key advantages of using AWS for genomic analysis is to help researchers utilize a wide choice of compute services that can process diverse datasets in analysis pipelines. Genomic sequencers that generate raw data files are located in labs on premises and AWS provides solutions to make it easy for customers to transfer these files to AWS reliably and securely. Storing Genomics and Medical (e.g., imaging) data at different stages requires enormous storage in a cost-effective manner. Amazon Simple Storage Service (Amazon S3), Amazon Glacier, and Amazon Elastics Block Store (Amazon EBS) provide the necessary solutions to securely store, manage, and scale genomic file storage. Moreover, the storage services can interface with various compute services from AWS to process these files.
Whether you're just getting started or have already been analyzing genomics data using the AWS Cloud, this book provides you with the information you need in order to use AWS services and features in the ways that will make the most sense for your genomic research.
Weitere Details
Weitere Ausgaben
Andere Ausgaben

Personen
David Wall is a consulting engineer. He designs, builds, and supports hardware, software, and business processes. He is an AWS Certified Solution Architect.
Inhalt
Chapter 1 Why Do Genome Analysis Yourself When Commercial Offerings Exist? 1
Chapter 2 A Crash Course in Molecular Biology 9
Chapter 3 Obtaining Your Genome 25
Chapter 4 The Bioinformatics Workflow 39
Chapter 5 AWS Services for Genome Analysis 59
Chapter 6 Building Your Environment in the AWS Cloud 77
Chapter 7 Linux and AWS Command-Line Basics for Genomics 115
Chapter 8 Processing the Sequencing Data 143
Chapter 9 Visualizing the Genome 211
Chapter 10 Containerizing Your Workflow on the Desktop 235
Chapter 11 Variants and Applications 249
Chapter 12 Cancer Genomics 267
Index 291
Introduction
Welcome to Genomics in the AWS Cloud!
From its title, you can conclude that this book is about two things: genomics (the science of sequencing and interpreting genetic data) and Amazon Web Services (one of the three big hosted computing platforms). Genomics in the AWS Cloud, therefore, is meant to appeal either to people from a biology background who want to learn how to do genomics work with AWS or to people with a computer background who want to find out how to apply their skills to genomics.
Both of these areas, genomics and cloud computing, are evolving constantly, and practically no one can claim to be completely au fait with either. This book, therefore, aims at not one but two separate moving targets. Our goal as authors is not to teach you everything there is to know about AWS and genomics-or even about the intersection of the two fields-but rather to show you the following:
- Enough of the general concepts of cloud computing and genomics that you understand the problems to be solved and the technologies available to work on those problems
- Enough specifics to enable you to work through actual genomics tasks and see results
Who Should Read This Book
This book is intended for people who aren't content to use commercial genome sequencing services and want to do their own analysis. We walk you through the process of getting raw data from a blood sample via a lab and then using the AWS services to analyze it-learning which genes are present in the sample and what they might say about you and your health. This will enable you to investigate aspects of your genome that commercial services don't explore because they are not allowed to give medical advice.
As well, this book is suited to people who want to learn about the AWS cloud and want to structure their study around a useful field-genomics.
Genomics
At the core of genomics is genome sequencing, which is the process of taking some biological material, such as blood or tissue, and converting it to pure information. This is a complicated process that combines the traditional work of a biologist (which is to say, manipulating actual cells in a "wet lab" environment) with information technology. Cells go into the process; a computer-readable data file comes out.
Genome sequencing took a long time to figure out. Crude, expensive methods were first employed in the 1970s and 1980s. More automated methods became available in the late 1990s, and these enabled the sequencing of relatively simple organisms: yeasts, bacteria, and a nearly microscopic nematode worm (Caenorhabditis elegans, long popular in biology labs as an experimental subject). Uncomplicated plants (notably Arabidopsis thaliana-a European weed with a particularly small genome) and modest insects (Drosophila melanogaster-the fruit fly), both longtime standard experimental subjects, soon followed around the turn of the century.
Two biologists described the first draft human genome sequence in an article in the journal Science in 2001. Scientists have worked to refine the human genome sequence since then and also have worked to sequence the genomes of thousands of other organisms.
Key to their work has been a continuous drop in the cost of full-genome sequencing. The first human genome sequence in 2001 cost roughly 2.7 billion U.S. dollars to produce-it required funding of the sort only national governments could provide. Within less than a decade, by 2005, the cost had fallen by four orders of magnitude to something like $1 million-still quite a lot. At this writing, in 2022, it is possible to have a human genome sequenced for less than the cost of a high-end smartphone, and plenty of companies are attracting funding for their plans to bring the cost to less than $100. By the time the asteroid 99942 Apophis makes its closest approach to Earth in 2029, sequencing a full genome will almost certainly cost about the same as the simplest routine medical blood test costs now. The cost of knowing everything about your genetic makeup will be trivial (assuming Apophis doesn't render this unimportant, which the latest reports assure us it won't).
The dramatic fall of the price of genome sequencing, from billions of dollars at the turn of the twenty-first century to a few hundred dollars today, makes it possible for almost all of us to explore our genetic makeup. While we all, as humans, share practically all of our genetic code (upwards of 99.9 percent), the differences make all the difference.
The tiny fraction of our individual genomes that differ from other humans is what accounts for whether we are male or female, all of our physical characteristics, many of our personality traits, and our propensity to health or various kinds of disease.
The availability of low-cost genome sequencing has revolutionized medical and pharmaceutical research and is started to change the practice of medicine. It also enables us to start to understand the building blocks of life and how much, or rather how little, we differ from other life forms.
Genomics in the AWS Cloud is about discovering and studying those differences and learning from them. But there is another part to the equation, which is to say the other set of tools and techniques identified in the title.
Cloud Computing and AWS
Almost in parallel with the advances in genomics that took place between 1995 and the present day, so-called cloud computing evolved enormously during the same time period and today represents a standard way of designing, deploying, and operating information processing systems.
Now, the idea of computing resources that are not local to the people who need them is not new at all. The earliest commercial and scientific computers were, of course, mainframes that were shared across many users-and more than a few of these remain in place today. Servers in organizational or co-location data centers, providing storage and computing resources to privileged users and the general public, have long been part of information technology. In the case of mainframes and client-server systems, users access remote computer systems (often not knowing or caring where they actually are). Functionally, that's cloud computing, and it's not a new thing.
What is new is the ease with which modern cloud computing platforms allow rapid construction and cost-effective use of complex and powerful systems. You can quickly set up elaborate workflows, test them with minimal computing power, and then scale them up enormously when it's time for a production run. More or less, you pay only for the computing power you use, and there are ways to schedule the use of processor cycles for times of low demand, when computing is cheaper. With the exception of storage and data transfer-meaning machine images that can be turned into working compute resources, as well as input and output data-systems configured in the cloud can cost practically nothing when they are not doing useful work. Such efficient use of expensive resources isn't possible with on-premises or traditionally hosted solutions.
There are three main players in the cloud-computing industry.
- Amazon Web Services (AWS)
- Google Cloud Platform (GCP)
- Microsoft Azure
Each of them has its points of strength and weakness, and those relative merits are beyond the scope of this publication. Most organizations mix and match parts of each anyway to take advantage of relative technical superiorities and to maintain leverage with vendors. Suffice it to say that we chose to do our genomics work on AWS.
Considering that genome analysis is essentially a workflow to be carried out on storage and computing resources, AWS is well suited to the job. Here are some of the tools we will use:
- Elastic Compute Cloud (EC2) for building and running the Linux servers that actually run the software required for genome analysis
- Elastic Block Storage (EBS) for maintaining updated disk images, ready to attach to a working machine when needed
- Simple Storage Service (S3) for storing input and output files when ready access to them is needed
- Simple Workflow Service (SWF) for automating processes
- Glacier when files need to be archived at low cost
- Identity and Access Management (IAM) for maintaining security and appropriate user privileges
What You'll Learn from This Book
This book is intended to educate its readers in two areas: the science of genomics and the technology of Amazon Web Services. The idea is that you use the latter as a tool to explore the former.
Since it's unlikely that many readers are familiar with both genomics and AWS, this book is meant to teach you either subject-or both if you are familiar with neither-and how they work together.
How This Book Is Organized
Here is a quick introduction to each chapter in this book. You can skip directly to the parts that interest you most, or you can read from beginning to end to get a complete picture.
- Chapter 1: Why Do Genome Analysis Yourself When Commercial Offerings Exist? This chapter explains what turnkey commercial services (such as 23&Me) exist and what they are good for. It then explains what they do not do and why you might want to do your own genomics work.
- Chapter 2: A Crash Course in Molecular...
Systemvoraussetzungen
Dateiformat: ePUB
Kopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
- Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).
- Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions oder die App PocketBook (siehe E-Book Hilfe).
- E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an.
Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.
Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.