Log In
Or create an account -> 
Imperial Library
  • Home
  • About
  • News
  • Upload
  • Forum
  • Help
  • Login/SignUp

Index
Foreword Preface
Purpose, Scope, and Intended Audience of This Book
What You Will Learn from This Book What Computational Experience Is Needed for the Exercises?
Conventions Used in This Book Using Code Examples O’Reilly Online Learning How to Contact Us Acknowledgments
1. Introduction
The Promises and Challenges of Big Data in Biology and Life Sciences Infrastructure Challenges Toward a Cloud-Based Ecosystem for Data Sharing and Analysis
Cloud-Hosted Data and Compute Platforms for Research in the Life Sciences Standardization and Reuse of Infrastructure
Being FAIR Wrap-Up and Next Steps
2. Genomics in a Nutshell: A Primer for Newcomers to the Field
Introduction to Genomics
The Gene as a Discrete Unit of Inheritance (Sort Of) The Central Dogma of Biology: DNA to RNA to Protein The Origins and Consequences of DNA Mutations Genomics as an Inventory of Variation in and Among Genomes The Challenge of Genomic Scale, by the Numbers
Genomic Variation
The Reference Genome as Common Framework Physical Classification of Variants Germline Variants Versus Somatic Alterations
High-Throughput Sequencing Data Generation
From Biological Sample to Huge Pile of Read Data Types of DNA Libraries: Choosing the Right Experimental Design
Data Processing and Analysis
Mapping Reads to the Reference Genome Variant Calling Data Quality and Sources of Error Functional Equivalence Pipeline Specification
Wrap-Up and Next Steps
3. Computing Technology Basics for Life Scientists
Basic Infrastructure Components and Performance Bottlenecks
Types of Processor Hardware: CPU, GPU, TPU, FPGA, OMG Levels of Compute Organization: Core, Node, Cluster, and Cloud Addressing Performance Bottlenecks
Parallel Computing
Parallelizing a Simple Analysis From Cores to Clusters and Clouds: Many Levels of Parallelism Trade-Offs of Parallelism: Speed, Efficiency, and Cost
Pipelining for Parallelization and Automation
Workflow Languages Popular Pipelining Languages for Genomics Workflow Management Systems
Virtualization and the Cloud
VMs and Containers Introducing the Cloud Categories of Research Use Cases for Cloud Services
Wrap-Up and Next Steps
4. First Steps in the Cloud
Setting Up Your Google Cloud Account and First Project
Creating a Project Checking Your Billing Account and Activating Free Credits
Running Basic Commands in Google Cloud Shell
Logging in to the Cloud Shell VM Using gsutil to Access and Manage Files Pulling a Docker Image and Spinning Up the Container Mounting a Volume to Access the Filesystem from Within the Container
Setting Up Your Own Custom VM
Creating and Configuring Your VM Instance Logging into Your VM by Using SSH Checking Your Authentication Copying the Book Materials to Your VM Installing Docker on Your VM Setting Up the GATK Container Image Stopping Your VM…to Stop It from Costing You Money
Configuring IGV to Read Data from GCS Buckets Wrap-Up and Next Steps
5. First Steps with GATK
Getting Started with GATK
Operating Requirements Command-Line Syntax Multithreading with Spark Running GATK in Practice
Getting Started with Variant Discovery
Calling Germline SNPs and Indels with HaplotypeCaller Filtering Based on Variant Context Annotations
Introducing the GATK Best Practices
Best Practices Workflows Covered in This Book
Other Major Use Cases
Wrap-Up and Next Steps
6. GATK Best Practices for Germline Short Variant Discovery
Data Preprocessing
Mapping Reads to the Genome Reference Marking Duplicates Recalibrating Base Quality Scores
Joint Discovery Analysis
Overview of the Joint Calling Workflow Calling Variants per Sample to Generate GVCFs Consolidating GVCFs Applying Joint Genotyping to Multiple Samples Filtering the Joint Callset with Variant Quality Score Recalibration Refining Genotype Assignments and Adjusting Genotype Confidence Next Steps and Further Reading
Single-Sample Calling with CNN Filtering
Overview of the CNN Single-Sample Workflow Applying 1D CNN to Filter a Single-Sample WGS Callset Applying 2D CNN to Include Read Data in the Modeling
Wrap-Up and Next Steps
7. GATK Best Practices for Somatic Variant Discovery
Challenges in Cancer Genomics Somatic Short Variants (SNVs and Indels)
Overview of the Tumor-Normal Pair Analysis Workflow Creating a Mutect2 PoN Running Mutect2 on the Tumor-Normal Pair Estimating Cross-Sample Contamination Filtering Mutect2 Calls Annotating Predicted Functional Effects with Funcotator
Somatic Copy-Number Alterations
Overview of the Tumor-Only Analysis Workflow Creating a Somatic CNA PoN Applying Denoising Performing Segmentation and Call CNAs Additional Analysis Options
Wrap-Up and Next Steps
8. Automating Analysis Execution with Workflows
Introducing WDL and Cromwell Installing and Setting Up Cromwell Your First WDL: Hello World
Learning Basic WDL Syntax Through a Minimalist Example Running a Simple WDL with Cromwell on Your Google VM Interpreting the Important Parts of Cromwell’s Logging Output Adding a Variable and Providing Inputs via JSON Adding Another Task to Make It a Proper Workflow
Your First GATK Workflow: Hello HaplotypeCaller
Exploring the WDL Generating the Inputs JSON Running the Workflow Breaking the Workflow to Test Syntax Validation and Error Messaging
Introducing Scatter-Gather Parallelism
Exploring the WDL Generating a Graph Diagram for Visualization
Wrap-Up and Next Steps
9. Deciphering Real Genomics Workflows
Mystery Workflow #1: Flexibility Through Conditionals
Mapping Out the Workflow Reverse Engineering the Conditional Switch
Mystery Workflow #2: Modularity and Code Reuse
Mapping Out the Workflow Unpacking the Nesting Dolls
Wrap-Up and Next Steps
10. Running Single Workflows at Scale with Pipelines API
Introducing the GCP Genomics Pipelines API Service
Enabling Genomics API and Related APIs in Your Google Cloud Project
Directly Dispatching Cromwell Jobs to PAPI
Configuring Cromwell to Communicate with PAPI Running Scattered HaplotypeCaller via PAPI Monitoring Workflow Execution on Google Compute Engine
Understanding and Optimizing Workflow Efficiency
Granularity of Operations Balance of Time Versus Money Suggested Cost-Saving Optimizations Platform-Specific Optimization Versus Portability
Wrapping Cromwell and PAPI Execution with WDL Runner
Setting Up WDL Runner Running the Scattered HaplotypeCaller Workflow with WDL Runner Monitoring WDL Runner Execution
Wrap-Up and Next Steps
11. Running Many Workflows Conveniently in Terra
Getting Started with Terra
Creating an Account Creating a Billing Project Cloning the Preconfigured Workspace
Running Workflows with the Cromwell Server in Terra
Running a Workflow on a Single Sample Running a Workflow on Multiple Samples in a Data Table Monitoring Workflow Execution Locating Workflow Outputs in the Data Table Running the Same Workflow Again to Demonstrate Call Caching
Running a Real GATK Best Practices Pipeline at Full Scale
Finding and Cloning the GATK Best Practices Workspace for Germline Short Variant Discovery Examining the Preloaded Data Selecting Data and Configuring the Full-Scale Workflow Launching the Full-Scale Workflow and Monitoring Execution Options for Downloading Output Data—or Not
Wrap-Up and Next Steps
12. Interactive Analysis in Jupyter Notebook
Introduction to Jupyter in Terra
Jupyter Notebooks in General How Jupyter Notebooks Work in Terra
Getting Started with Jupyter in Terra
Inspecting and Customizing the Notebook Runtime Configuration Opening Notebook in Edit Mode and Checking the Kernel Running the Hello World Cells Using gsutil to Interact with Google Cloud Storage Buckets Setting Up a Variable Pointing to the Germline Data in the Book Bucket Setting Up a Sandbox and Saving Output Files to the Workspace Bucket
Visualizing Genomic Data in an Embedded IGV Window
Setting Up the Embedded IGV Browser Adding Data to the IGV Browser Setting Up an Access Token to View Private Data
Running GATK Commands to Learn, Test, or Troubleshoot
Running a Basic GATK Command: HaplotypeCaller Loading the Data (BAM and VCF) into IGV Troubleshooting a Questionable Variant Call in the Embedded IGV Browser
Visualizing Variant Context Annotation Data
Exporting Annotations of Interest with VariantsToTable Loading R Script to Make Plotting Functions Available Making Density Plots for QUAL by Using makeDensityPlot Making a Scatter Plot of QUAL Versus DP Making a Scatter Plot Flanked by Marginal Density Plots
Wrap-Up and Next Steps
13. Assembling Your Own Workspace in Terra
Managing Data Inside and Outside of Workspaces
The Workspace Bucket as Data Repository Accessing Private Data That You Manage Outside of Terra Accessing Data in the Terra Data Library
Re-Creating the Tutorial Workspace from Base Components
Creating a New Workspace Adding the Workflow to the Methods Repository and Importing It into the Workspace Creating a Configuration Quickly with a JSON File Adding the Data Table Filling in the Workspace Resource Data Table Creating a Workflow Configuration That Uses the Data Tables Adding the Notebook and Checking the Runtime Environment Documenting Your Workspace and Sharing It
Starting from a GATK Best Practices Workspace
Cloning a GATK Best Practices Workspace Examining GATK Workspace Data Tables to Understand How the Data Is Structured Getting to Know the 1000 Genomes High Coverage Dataset Copying Data Tables from the 1000 Genomes Workspace Using TSV Load Files to Import Data from the 1000 Genomes Workspace Running a Joint-Calling Analysis on the Federated Dataset
Building a Workspace Around a Dataset
Cloning the 1000 Genomes Data Workspace Importing a Workflow from Dockstore Configuring the Workflow to Use the Data Tables
Wrap-Up and Next Steps
14. Making a Fully Reproducible Paper
Overview of the Case Study
Computational Reproducibility and the FAIR Framework Original Research Study and History of the Case Study Assessing the Available Information and Key Challenges Designing a Reproducible Implementation
Generating a Synthetic Dataset as a Stand-In for the Private Data
Overall Methodology Retrieving the Variant Data from 1000 Genomes Participants Creating Fake Exomes Based on Real People Mutating the Fake Exomes Generating the Definitive Dataset
Re-Creating the Data Processing and Analysis Methodology
Mapping and Variant Discovery Variant Effect Prediction, Prioritization, and Variant Load Analysis Analytical Performance of the New Implementation
The Long, Winding Road to FAIRness Final Conclusions
Glossary Index
  • ← Prev
  • Back
  • Next →
  • ← Prev
  • Back
  • Next →

Chief Librarian: Las Zenow <zenow@riseup.net>
Fork the source code from gitlab
.

This is a mirror of the Tor onion service:
http://kx5thpx2olielkihfyo4jgjqfb7zx7wxr3sd4xzt26ochei4m6f7tayd.onion