Genomics in the Cloud by O'Connor, Brian D. -- Read -- Imperial Library of Trantor

Index

Foreword Preface

Purpose, Scope, and Intended Audience of This Book

What You Will Learn from This Book What Computational Experience Is Needed for the Exercises?

Conventions Used in This Book Using Code Examples O’Reilly Online Learning How to Contact Us Acknowledgments

1. Introduction

The Promises and Challenges of Big Data in Biology and Life Sciences Infrastructure Challenges Toward a Cloud-Based Ecosystem for Data Sharing and Analysis

Cloud-Hosted Data and Compute Platforms for Research in the Life Sciences Standardization and Reuse of Infrastructure

Being FAIR Wrap-Up and Next Steps

2. Genomics in a Nutshell: A Primer for Newcomers to the Field

Introduction to Genomics

The Gene as a Discrete Unit of Inheritance (Sort Of) The Central Dogma of Biology: DNA to RNA to Protein The Origins and Consequences of DNA Mutations Genomics as an Inventory of Variation in and Among Genomes The Challenge of Genomic Scale, by the Numbers

Genomic Variation

The Reference Genome as Common Framework Physical Classification of Variants Germline Variants Versus Somatic Alterations

High-Throughput Sequencing Data Generation

From Biological Sample to Huge Pile of Read Data Types of DNA Libraries: Choosing the Right Experimental Design

Data Processing and Analysis

Mapping Reads to the Reference Genome Variant Calling Data Quality and Sources of Error Functional Equivalence Pipeline Specification

Wrap-Up and Next Steps

3. Computing Technology Basics for Life Scientists

Basic Infrastructure Components and Performance Bottlenecks

Types of Processor Hardware: CPU, GPU, TPU, FPGA, OMG Levels of Compute Organization: Core, Node, Cluster, and Cloud Addressing Performance Bottlenecks

Parallel Computing

Parallelizing a Simple Analysis From Cores to Clusters and Clouds: Many Levels of Parallelism Trade-Offs of Parallelism: Speed, Efficiency, and Cost

Pipelining for Parallelization and Automation

Workflow Languages Popular Pipelining Languages for Genomics Workflow Management Systems

Virtualization and the Cloud

VMs and Containers Introducing the Cloud Categories of Research Use Cases for Cloud Services

Wrap-Up and Next Steps

4. First Steps in the Cloud

Setting Up Your Google Cloud Account and First Project

Creating a Project Checking Your Billing Account and Activating Free Credits

Running Basic Commands in Google Cloud Shell

Logging in to the Cloud Shell VM Using gsutil to Access and Manage Files Pulling a Docker Image and Spinning Up the Container Mounting a Volume to Access the Filesystem from Within the Container

Setting Up Your Own Custom VM

Creating and Configuring Your VM Instance Logging into Your VM by Using SSH Checking Your Authentication Copying the Book Materials to Your VM Installing Docker on Your VM Setting Up the GATK Container Image Stopping Your VM…to Stop It from Costing You Money

Configuring IGV to Read Data from GCS Buckets Wrap-Up and Next Steps

5. First Steps with GATK

Getting Started with GATK

Operating Requirements Command-Line Syntax Multithreading with Spark Running GATK in Practice

Getting Started with Variant Discovery

Calling Germline SNPs and Indels with HaplotypeCaller Filtering Based on Variant Context Annotations

Introducing the GATK Best Practices

Best Practices Workflows Covered in This Book

Other Major Use Cases

Wrap-Up and Next Steps

6. GATK Best Practices for Germline Short Variant Discovery

Data Preprocessing

Mapping Reads to the Genome Reference Marking Duplicates Recalibrating Base Quality Scores

Joint Discovery Analysis

Overview of the Joint Calling Workflow Calling Variants per Sample to Generate GVCFs Consolidating GVCFs Applying Joint Genotyping to Multiple Samples Filtering the Joint Callset with Variant Quality Score Recalibration Refining Genotype Assignments and Adjusting Genotype Confidence Next Steps and Further Reading

Single-Sample Calling with CNN Filtering

Overview of the CNN Single-Sample Workflow Applying 1D CNN to Filter a Single-Sample WGS Callset Applying 2D CNN to Include Read Data in the Modeling

Wrap-Up and Next Steps

7. GATK Best Practices for Somatic Variant Discovery

Challenges in Cancer Genomics Somatic Short Variants (SNVs and Indels)

Overview of the Tumor-Normal Pair Analysis Workflow Creating a Mutect2 PoN Running Mutect2 on the Tumor-Normal Pair Estimating Cross-Sample Contamination Filtering Mutect2 Calls Annotating Predicted Functional Effects with Funcotator

Somatic Copy-Number Alterations

Overview of the Tumor-Only Analysis Workflow Creating a Somatic CNA PoN Applying Denoising Performing Segmentation and Call CNAs Additional Analysis Options

Wrap-Up and Next Steps

8. Automating Analysis Execution with Workflows

Introducing WDL and Cromwell Installing and Setting Up Cromwell Your First WDL: Hello World

Learning Basic WDL Syntax Through a Minimalist Example Running a Simple WDL with Cromwell on Your Google VM Interpreting the Important Parts of Cromwell’s Logging Output Adding a Variable and Providing Inputs via JSON Adding Another Task to Make It a Proper Workflow

Your First GATK Workflow: Hello HaplotypeCaller

Exploring the WDL Generating the Inputs JSON Running the Workflow Breaking the Workflow to Test Syntax Validation and Error Messaging

Introducing Scatter-Gather Parallelism

Exploring the WDL Generating a Graph Diagram for Visualization

Wrap-Up and Next Steps

9. Deciphering Real Genomics Workflows

Mystery Workflow #1: Flexibility Through Conditionals

Mapping Out the Workflow Reverse Engineering the Conditional Switch

Mystery Workflow #2: Modularity and Code Reuse

Mapping Out the Workflow Unpacking the Nesting Dolls

Wrap-Up and Next Steps

10. Running Single Workflows at Scale with Pipelines API

Introducing the GCP Genomics Pipelines API Service

Enabling Genomics API and Related APIs in Your Google Cloud Project

Directly Dispatching Cromwell Jobs to PAPI

Configuring Cromwell to Communicate with PAPI Running Scattered HaplotypeCaller via PAPI Monitoring Workflow Execution on Google Compute Engine

Understanding and Optimizing Workflow Efficiency

Granularity of Operations Balance of Time Versus Money Suggested Cost-Saving Optimizations Platform-Specific Optimization Versus Portability

Wrapping Cromwell and PAPI Execution with WDL Runner

Setting Up WDL Runner Running the Scattered HaplotypeCaller Workflow with WDL Runner Monitoring WDL Runner Execution

Wrap-Up and Next Steps

11. Running Many Workflows Conveniently in Terra

Getting Started with Terra

Creating an Account Creating a Billing Project Cloning the Preconfigured Workspace

Running Workflows with the Cromwell Server in Terra

Running a Workflow on a Single Sample Running a Workflow on Multiple Samples in a Data Table Monitoring Workflow Execution Locating Workflow Outputs in the Data Table Running the Same Workflow Again to Demonstrate Call Caching

Running a Real GATK Best Practices Pipeline at Full Scale

Finding and Cloning the GATK Best Practices Workspace for Germline Short Variant Discovery Examining the Preloaded Data Selecting Data and Configuring the Full-Scale Workflow Launching the Full-Scale Workflow and Monitoring Execution Options for Downloading Output Data—or Not

Wrap-Up and Next Steps

12. Interactive Analysis in Jupyter Notebook

Introduction to Jupyter in Terra

Jupyter Notebooks in General How Jupyter Notebooks Work in Terra

Getting Started with Jupyter in Terra

Inspecting and Customizing the Notebook Runtime Configuration Opening Notebook in Edit Mode and Checking the Kernel Running the Hello World Cells Using gsutil to Interact with Google Cloud Storage Buckets Setting Up a Variable Pointing to the Germline Data in the Book Bucket Setting Up a Sandbox and Saving Output Files to the Workspace Bucket

Visualizing Genomic Data in an Embedded IGV Window

Setting Up the Embedded IGV Browser Adding Data to the IGV Browser Setting Up an Access Token to View Private Data

Running GATK Commands to Learn, Test, or Troubleshoot

Running a Basic GATK Command: HaplotypeCaller Loading the Data (BAM and VCF) into IGV Troubleshooting a Questionable Variant Call in the Embedded IGV Browser

Visualizing Variant Context Annotation Data

Exporting Annotations of Interest with VariantsToTable Loading R Script to Make Plotting Functions Available Making Density Plots for QUAL by Using makeDensityPlot Making a Scatter Plot of QUAL Versus DP Making a Scatter Plot Flanked by Marginal Density Plots

Wrap-Up and Next Steps

13. Assembling Your Own Workspace in Terra

Managing Data Inside and Outside of Workspaces

The Workspace Bucket as Data Repository Accessing Private Data That You Manage Outside of Terra Accessing Data in the Terra Data Library

Re-Creating the Tutorial Workspace from Base Components

Creating a New Workspace Adding the Workflow to the Methods Repository and Importing It into the Workspace Creating a Configuration Quickly with a JSON File Adding the Data Table Filling in the Workspace Resource Data Table Creating a Workflow Configuration That Uses the Data Tables Adding the Notebook and Checking the Runtime Environment Documenting Your Workspace and Sharing It

Starting from a GATK Best Practices Workspace

Cloning a GATK Best Practices Workspace Examining GATK Workspace Data Tables to Understand How the Data Is Structured Getting to Know the 1000 Genomes High Coverage Dataset Copying Data Tables from the 1000 Genomes Workspace Using TSV Load Files to Import Data from the 1000 Genomes Workspace Running a Joint-Calling Analysis on the Federated Dataset

Building a Workspace Around a Dataset

Cloning the 1000 Genomes Data Workspace Importing a Workflow from Dockstore Configuring the Workflow to Use the Data Tables

Wrap-Up and Next Steps

14. Making a Fully Reproducible Paper

Overview of the Case Study

Computational Reproducibility and the FAIR Framework Original Research Study and History of the Case Study Assessing the Available Information and Key Challenges Designing a Reproducible Implementation

Generating a Synthetic Dataset as a Stand-In for the Private Data

Overall Methodology Retrieving the Variant Data from 1000 Genomes Participants Creating Fake Exomes Based on Real People Mutating the Fake Exomes Generating the Definitive Dataset

Re-Creating the Data Processing and Analysis Methodology

Mapping and Variant Discovery Variant Effect Prediction, Prioritization, and Variant Load Analysis Analytical Performance of the New Implementation

The Long, Winding Road to FAIRness Final Conclusions

Glossary Index

← Prev
Back
Next →

← Prev
Back
Next →