Purpose, Scope, and Intended Audience of This Book
What You Will Learn from This Book
What Computational Experience Is Needed for the Exercises?
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
1. Introduction
The Promises and Challenges of Big Data in Biology and Life Sciences
Infrastructure Challenges
Toward a Cloud-Based Ecosystem for Data Sharing and Analysis
Cloud-Hosted Data and Compute
Platforms for Research in the Life Sciences
Standardization and Reuse of Infrastructure
Being FAIR
Wrap-Up and Next Steps
2. Genomics in a Nutshell: A Primer for Newcomers to the Field
Introduction to Genomics
The Gene as a Discrete Unit of Inheritance (Sort Of)
The Central Dogma of Biology: DNA to RNA to Protein
The Origins and Consequences of DNA Mutations
Genomics as an Inventory of Variation in and Among Genomes
The Challenge of Genomic Scale, by the Numbers
Genomic Variation
The Reference Genome as Common Framework
Physical Classification of Variants
Germline Variants Versus Somatic Alterations
High-Throughput Sequencing Data Generation
From Biological Sample to Huge Pile of Read Data
Types of DNA Libraries: Choosing the Right Experimental Design
Data Processing and Analysis
Mapping Reads to the Reference Genome
Variant Calling
Data Quality and Sources of Error
Functional Equivalence Pipeline Specification
Wrap-Up and Next Steps
3. Computing Technology Basics for Life Scientists
Basic Infrastructure Components and Performance Bottlenecks
Types of Processor Hardware: CPU, GPU, TPU, FPGA, OMG
Levels of Compute Organization: Core, Node, Cluster, and Cloud
Addressing Performance Bottlenecks
Parallel Computing
Parallelizing a Simple Analysis
From Cores to Clusters and Clouds: Many Levels of Parallelism
Trade-Offs of Parallelism: Speed, Efficiency, and Cost
Pipelining for Parallelization and Automation
Workflow Languages
Popular Pipelining Languages for Genomics
Workflow Management Systems
Virtualization and the Cloud
VMs and Containers
Introducing the Cloud
Categories of Research Use Cases for Cloud Services
Wrap-Up and Next Steps
4. First Steps in the Cloud
Setting Up Your Google Cloud Account and First Project
Creating a Project
Checking Your Billing Account and Activating Free Credits
Running Basic Commands in Google Cloud Shell
Logging in to the Cloud Shell VM
Using gsutil to Access and Manage Files
Pulling a Docker Image and Spinning Up the Container
Mounting a Volume to Access the Filesystem from Within the Container
Setting Up Your Own Custom VM
Creating and Configuring Your VM Instance
Logging into Your VM by Using SSH
Checking Your Authentication
Copying the Book Materials to Your VM
Installing Docker on Your VM
Setting Up the GATK Container Image
Stopping Your VM…to Stop It from Costing You Money
Configuring IGV to Read Data from GCS Buckets
Wrap-Up and Next Steps
5. First Steps with GATK
Getting Started with GATK
Operating Requirements
Command-Line Syntax
Multithreading with Spark
Running GATK in Practice
Getting Started with Variant Discovery
Calling Germline SNPs and Indels with HaplotypeCaller
Filtering Based on Variant Context Annotations
Introducing the GATK Best Practices
Best Practices Workflows Covered in This Book
Other Major Use Cases
Wrap-Up and Next Steps
6. GATK Best Practices for Germline Short Variant Discovery
Data Preprocessing
Mapping Reads to the Genome Reference
Marking Duplicates
Recalibrating Base Quality Scores
Joint Discovery Analysis
Overview of the Joint Calling Workflow
Calling Variants per Sample to Generate GVCFs
Consolidating GVCFs
Applying Joint Genotyping to Multiple Samples
Filtering the Joint Callset with Variant Quality Score Recalibration
Refining Genotype Assignments and Adjusting Genotype Confidence
Next Steps and Further Reading
Single-Sample Calling with CNN Filtering
Overview of the CNN Single-Sample Workflow
Applying 1D CNN to Filter a Single-Sample WGS Callset
Applying 2D CNN to Include Read Data in the Modeling
Wrap-Up and Next Steps
7. GATK Best Practices for Somatic Variant Discovery
Challenges in Cancer Genomics
Somatic Short Variants (SNVs and Indels)
Overview of the Tumor-Normal Pair Analysis Workflow
Creating a Mutect2 PoN
Running Mutect2 on the Tumor-Normal Pair
Estimating Cross-Sample Contamination
Filtering Mutect2 Calls
Annotating Predicted Functional Effects with Funcotator
Somatic Copy-Number Alterations
Overview of the Tumor-Only Analysis Workflow
Creating a Somatic CNA PoN
Applying Denoising
Performing Segmentation and Call CNAs
Additional Analysis Options
Wrap-Up and Next Steps
8. Automating Analysis Execution with Workflows
Introducing WDL and Cromwell
Installing and Setting Up Cromwell
Your First WDL: Hello World
Learning Basic WDL Syntax Through a Minimalist Example
Running a Simple WDL with Cromwell on Your Google VM
Interpreting the Important Parts of Cromwell’s Logging Output
Adding a Variable and Providing Inputs via JSON
Adding Another Task to Make It a Proper Workflow
Your First GATK Workflow: Hello HaplotypeCaller
Exploring the WDL
Generating the Inputs JSON
Running the Workflow
Breaking the Workflow to Test Syntax Validation and Error Messaging
Introducing Scatter-Gather Parallelism
Exploring the WDL
Generating a Graph Diagram for Visualization
Wrap-Up and Next Steps
9. Deciphering Real Genomics Workflows
Mystery Workflow #1: Flexibility Through Conditionals
Mapping Out the Workflow
Reverse Engineering the Conditional Switch
Mystery Workflow #2: Modularity and Code Reuse
Mapping Out the Workflow
Unpacking the Nesting Dolls
Wrap-Up and Next Steps
10. Running Single Workflows at Scale with Pipelines API
Introducing the GCP Genomics Pipelines API Service
Enabling Genomics API and Related APIs in Your Google Cloud Project
Directly Dispatching Cromwell Jobs to PAPI
Configuring Cromwell to Communicate with PAPI
Running Scattered HaplotypeCaller via PAPI
Monitoring Workflow Execution on Google Compute Engine
Understanding and Optimizing Workflow Efficiency
Granularity of Operations
Balance of Time Versus Money
Suggested Cost-Saving Optimizations
Platform-Specific Optimization Versus Portability
Wrapping Cromwell and PAPI Execution with WDL Runner
Setting Up WDL Runner
Running the Scattered HaplotypeCaller Workflow with WDL Runner
Monitoring WDL Runner Execution
Wrap-Up and Next Steps
11. Running Many Workflows Conveniently in Terra
Getting Started with Terra
Creating an Account
Creating a Billing Project
Cloning the Preconfigured Workspace
Running Workflows with the Cromwell Server in Terra
Running a Workflow on a Single Sample
Running a Workflow on Multiple Samples in a Data Table
Monitoring Workflow Execution
Locating Workflow Outputs in the Data Table
Running the Same Workflow Again to Demonstrate Call Caching
Running a Real GATK Best Practices Pipeline at Full Scale
Finding and Cloning the GATK Best Practices Workspace for Germline Short Variant Discovery
Examining the Preloaded Data
Selecting Data and Configuring the Full-Scale Workflow
Launching the Full-Scale Workflow and Monitoring Execution
Options for Downloading Output Data—or Not
Wrap-Up and Next Steps
12. Interactive Analysis in Jupyter Notebook
Introduction to Jupyter in Terra
Jupyter Notebooks in General
How Jupyter Notebooks Work in Terra
Getting Started with Jupyter in Terra
Inspecting and Customizing the Notebook Runtime Configuration
Opening Notebook in Edit Mode and Checking the Kernel
Running the Hello World Cells
Using gsutil to Interact with Google Cloud Storage Buckets
Setting Up a Variable Pointing to the Germline Data in the Book Bucket
Setting Up a Sandbox and Saving Output Files to the Workspace Bucket
Visualizing Genomic Data in an Embedded IGV Window
Setting Up the Embedded IGV Browser
Adding Data to the IGV Browser
Setting Up an Access Token to View Private Data
Running GATK Commands to Learn, Test, or Troubleshoot
Running a Basic GATK Command: HaplotypeCaller
Loading the Data (BAM and VCF) into IGV
Troubleshooting a Questionable Variant Call in the Embedded IGV Browser
Visualizing Variant Context Annotation Data
Exporting Annotations of Interest with VariantsToTable
Loading R Script to Make Plotting Functions Available
Making Density Plots for QUAL by Using makeDensityPlot
Making a Scatter Plot of QUAL Versus DP
Making a Scatter Plot Flanked by Marginal Density Plots
Wrap-Up and Next Steps
13. Assembling Your Own Workspace in Terra
Managing Data Inside and Outside of Workspaces
The Workspace Bucket as Data Repository
Accessing Private Data That You Manage Outside of Terra
Accessing Data in the Terra Data Library
Re-Creating the Tutorial Workspace from Base Components
Creating a New Workspace
Adding the Workflow to the Methods Repository and Importing It into the Workspace
Creating a Configuration Quickly with a JSON File
Adding the Data Table
Filling in the Workspace Resource Data Table
Creating a Workflow Configuration That Uses the Data Tables
Adding the Notebook and Checking the Runtime Environment
Documenting Your Workspace and Sharing It
Starting from a GATK Best Practices Workspace
Cloning a GATK Best Practices Workspace
Examining GATK Workspace Data Tables to Understand How the Data Is Structured
Getting to Know the 1000 Genomes High Coverage Dataset
Copying Data Tables from the 1000 Genomes Workspace
Using TSV Load Files to Import Data from the 1000 Genomes Workspace
Running a Joint-Calling Analysis on the Federated Dataset
Building a Workspace Around a Dataset
Cloning the 1000 Genomes Data Workspace
Importing a Workflow from Dockstore
Configuring the Workflow to Use the Data Tables
Wrap-Up and Next Steps
14. Making a Fully Reproducible Paper
Overview of the Case Study
Computational Reproducibility and the FAIR Framework
Original Research Study and History of the Case Study
Assessing the Available Information and Key Challenges
Designing a Reproducible Implementation
Generating a Synthetic Dataset as a Stand-In for the Private Data
Overall Methodology
Retrieving the Variant Data from 1000 Genomes Participants
Creating Fake Exomes Based on Real People
Mutating the Fake Exomes
Generating the Definitive Dataset
Re-Creating the Data Processing and Analysis Methodology
Mapping and Variant Discovery
Variant Effect Prediction, Prioritization, and Variant Load Analysis
Analytical Performance of the New Implementation
The Long, Winding Road to FAIRness
Final Conclusions
