Pentaho Kettle Solutions · Building Open Source ETL Solutions With Pentaho Data Integration by Casters, Matt -- Read -- Imperial Library of Trantor

Index

Cover Title Page Copyright Dedication About the Authors Credits Acknowledgments Introduction

The Origins of Kettle About This Book How This Book Is Organized Prerequisites On the Website Further Resources

Part I: Getting Started

Chapter 1: ETL Primer

OLTP versus Data Warehousing What Is ETL? ETL, ELT, and EII Data Integration Challenges ETL Tool Requirements Summary

Chapter 2: Kettle Concepts

Design Principles The Building Blocks of Kettle Design Parameters and Variables Visual Programming Summary

Chapter 3: Installation and Configuration

Kettle Software Overview Installation Configuration Summary

Chapter 4: An Example ETL Solution—Sakila

Sakila Prerequisites and Some Basic Spoon Skills The Sample ETL Solution Summary

Part II: ETL

Chapter 5: ETL Subsystems

Introduction to the 34 Subsystems Summary

Chapter 6: Data Extraction

Kettle Data Extraction Overview Working with ERP and CRM Systems Data Profiling CDC: Change Data Capture Delivering Data Summary

Chapter 7: Cleansing and Conforming

Data Cleansing Error Handling Auditing Data and Process Quality Deduplicating Data Scripting Summary

Chapter 8: Handling Dimension Tables

Managing Keys Loading Dimension Tables Slowly Changing Dimensions More Dimensions Summary

Chapter 9: Loading Fact Tables

Loading in Bulk Dimension Lookups Fact Table Handling Summary

Chapter 10: Working with OLAP Data

OLAP Benefits and Challenges Working with Mondrian Working with XML/A Servers Working with Palo Summary

Part III: Management and Deployment

Chapter 11: ETL Development Lifecycle

Solution Design Agile Development Testing and Debugging Documenting the Solution Summary

Chapter 12: Scheduling and Monitoring

Scheduling Monitoring Summary

Chapter 13: Versioning and Migration

Version Control Systems Kettle Metadata Managing Repositories Version Migration System Summary

Chapter 14: Lineage and Auditing

Batch-Level Lineage Extraction Lineage Logging and Operational Metadata Summary

Part IV: Performance and Scalability

Chapter 15: Performance Tuning

Transformation Performance: Finding the Weakest Link Improving Transformation Performance Improving Job Performance Summary

Chapter 16: Parallelization, Clustering, and Partitioning

Multi-Threading Using Carte as a Slave Server Clustering Transformations Partitioning Summary

Chapter 17: Dynamic Clustering in the Cloud

Dynamic Clustering Cloud Computing EC2 Summary

Chapter 18: Real-Time Data Integration

Introduction to Real-Time ETL Transformation Streaming Summary

Part V: Advanced Topics

Chapter 19: Data Vault Management

Introduction to Data Vault Modeling Do You Need a Data Vault? Data Vault Building Blocks Transforming Sakila to the Data Vault Model Loading the Data Vault: A Sample ETL Solution Updating a Data Mart from a Data Vault Summary

Chapter 20: Handling Complex Data Formats

Non-Relational and Non-Tabular Data Formats Non-Relational Tabular Formats Semi- and Unstructured Data Key/Value Pairs Summary

Chapter 21: Web Services

Web Pages and Web Services Data Formats XML Examples SOAP Examples JSON Example RSS Summary

Chapter 22: Kettle Integration

The Kettle API Executing Existing Transformations and Jobs Embedding Kettle OEM Versions and Forks Summary

Chapter 23: Extending Kettle

Plugin Architecture Overview Transformation Step Plugins The User-Defined Java Class Step Job Entry Plugins Partitioning Method Plugins Repository Type Plugins Database Type Plugins Summary

Appendix A: The Kettle Ecosystem

Kettle Development and Versions The Pentaho Community Wiki Using the Forums Jira ##pentaho

Appendix B: Kettle Enterprise Edition Features Appendix C: Built-in Variables and Properties Reference

Internal Variables Kettle Variables Variables for Configuring VFS Noteworthy JRE Variables

Index

← Prev
Back
Next →

← Prev
Back
Next →