Learning Pentaho Data Integration 8 CE by Roldán, María Carina -- Read -- Imperial Library of Trantor

Index

Title Page

Third Edition

Learning Pentaho Data Integration 8 CE

Third Edition

Credits About the Author About the Reviewers www.PacktPub.com

Why subscribe?

Customer Feedback Preface

What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support

Downloading the example code Downloading the color images of this book Errata Piracy Questions

Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite

Introducing Pentaho Data Integration Using PDI in real-world scenarios

Loading data warehouses or data marts Integrating data Data cleansing Migrating information Exporting data

Integrating PDI along with other Pentaho tools

Installing PDI Launching the PDI Graphical Designer - Spoon

Starting and customizing Spoon

Exploring the Spoon interface

Extending the PDI functionality through the Marketplace

Introducing transformations

The basics about transformations Creating a Hello World! Transformation

Designing a Transformation Previewing and running a Transformation

Installing useful related software Summary

Getting Started with Transformations

Designing and previewing transformations

Getting familiar with editing features

Using the mouseover assistance toolbar Adding steps and creating hops Working with grids

Designing transformations

Putting the editing features in practice Previewing and fixing errors as they appear Looking at the results in the execution results pane

The Logging tab The Step Metrics tab

Running transformations in an interactive fashion

Understanding PDI data and metadata

Understanding the PDI rowset Adding or modifying fields by using different PDI steps Explaining the PDI data types

Handling errors

Implementing the error handling functionality Customizing the error handling

Summary

Creating Basic Task Flows

Introducing jobs

Learning the basics about jobs Creating a Simple Job

Designing and running jobs

Revisiting the Spoon interface and the editing features Designing jobs

Getting familiar with the job design process Looking at the results in the Execution results window

The Logging tab The Job metrics tab

Enriching your work by sending an email

Running transformations from a Job

Using the Transformation Job Entry

Understanding and changing the flow of execution

Changing the flow of execution based on conditions Forcing a status with an abort Job or success entry Changing the execution to be synchronous

Managing files

Creating a Job that moves some files

Selecting files and folders Working with regular expressions

Summarizing the Job entries that deal with files Customizing the file management

Knowing the basics about Kettle variables

Understanding the kettle.properties file How and when you can use variables

Summary

Reading and Writing Files

Reading data from files

Reading a simple file

Troubleshooting reading files

Learning to read all kind of files

Specifying the name and location of the file

Reading several files at the same time Reading files that are compressed or located on a remote server Reading a file whose name is known at runtime

Describing the incoming fields

Reading Date fields Reading Numeric fields

Reading only a subset of the file

Reading the most common kinds of sources

Reading text files Reading spreadsheets Reading XML files Reading JSON files

Outputting data to files

Creating a simple file Learning to create all kind of files and write data into them

Providing the name and location of an output file

Creating a file whose name is known only at runtime Creating several files whose name depend on the content of the file

Describing the content of the output file

Formatting Date fields Formatting Numeric fields

Creating the most common kinds of files

Creating text files Creating spreadsheets Creating XML files Creating JSON files

Working with Big Data and cloud sources

Reading files from an AWS S3 instance Writing files to an AWS S3 instance Getting data from HDFS Sending data to HDFS

Summary

Manipulating PDI Data and Metadata

Manipulating simple fields

Working with strings

Extracting parts of strings using regular expressions Searching and replacing using regular expressions

Doing some math with Numeric fields Operating with dates

Performing simple operations on dates

Subtracting dates with the Calculator step

Getting information relative to the current date

Using the Get System Info step

Performing other useful operations on dates

Getting the month names with a User Defined Java Class step

Modifying the metadata of streams

Working with complex structures

Working with XML

Introducing XML terminology Getting familiar with the XPath notation Parsing XML structures with PDI

Reading an XML file with the Get data from XML step Parsing an XML structure stored in a field

PDI Transformation and Job files

Parsing JSON structures

Introducing JSON terminology Getting familiar with the JSONPath notation Parsing JSON structures with PDI

Reading a JSON file with the JSON input step Parsing a JSON structure stored in a field

Summary

Controlling the Flow of Data

Filtering data

Filtering rows upon conditions

Reading a file and getting the list of words found in it Filtering unwanted rows with a Filter rows step Filtering rows by using the Java Filter step

Filtering data based on row numbers

Splitting streams unconditionally

Copying rows Distributing rows Introducing partitioning and clustering

Splitting the stream based on conditions

Splitting a stream based on a simple condition Exploring PDI steps for splitting a stream based on conditions

Merging streams in several ways

Merging two or more streams Customizing the way of merging streams

Looking up data

Looking up data with a Stream lookup step

Summary

Cleansing, Validating, and Fixing Data

Cleansing data

Cleansing data by example

Standardizing information Improving the quality of data

Introducing PDI steps useful for cleansing data Dealing with non-exact matches

Cleansing by doing a fuzzy search Deduplicating non-exact matches

Validating data

Validating data with PDI

Validating and reporting errors to the log

Introducing common validations and their implementation with PDI

Treating invalid data by splitting and merging streams

Fixing data that doesn't match the rules

Summary

Manipulating Data by Coding

Doing simple tasks with the JavaScript step

Using the JavaScript language in PDI Inserting JavaScript code using the JavaScript step

Adding fields Modifying fields Organizing your code

Controlling the flow using predefined constants Testing the script using the Test script button

Parsing unstructured files with JavaScript Doing simple tasks with the Java Class step

Using the Java language in PDI Inserting Java code using the Java Class step

Learning to insert java code in a Java Class step Data types equivalence Adding fields Modifying fields Controlling the flow with the putRow() function

Testing the Java Class using the Test class button

Getting the most out of the Java Class step

Receiving parameters Reading data from additional steps Redirecting data to different target steps Parsing JSON structures

Avoiding coding using purpose-built steps Summary

Transforming the Dataset

Sorting data

Sorting a dataset with the sort rows step

Working on groups of rows

Aggregating data Summarizing the PDI steps that operate on sets of rows

Converting rows to columns

Converting row data to column data using the Row denormaliser step Aggregating data with a Row Denormaliser step

Normalizing data

Modifying the dataset with a Row Normaliser step

Going forward and backward across rows

Picking rows backward and forward with the Analytic Query step

Summary

Performing Basic Operations with Databases

Connecting to a database and exploring its content

Connecting with Relational Database Management Systems Exploring a database with the Database Explorer

Previewing and getting data from a database

Getting data from the database with the Table input step Using the Table input step to run flexible queries

Adding parameters to your queries Using Kettle variables in your queries

Inserting, updating, and deleting data

Inserting new data into a database table Inserting or updating data with the Insert / Update step Deleting records of a database table with the Delete step Performing CRUD operations with more flexibility

Verifying a connection, running DDL scripts, and doing other useful tasks Looking up data in different ways

Doing simple lookups with the Database Value Lookup step

Making a performance difference when looking up data in a database

Performing complex database lookups

Looking for data using a Database join step Looking for data using a Dynamic SQL row step

Summary

Loading Data Marts with PDI

Preparing the environment

Exploring the Jigsaw database model Creating the database and configuring the environment

Introducing dimensional modeling Loading dimensions with data

Learning the basics of dimensions

Understanding dimensions technical details

Loading a time dimension Introducing and loading Type I slowly changing dimensions

Loading a Type I SCD with a combination lookup/update step

Introducing and loading Type II slowly changing dimension

Loading Type II SCDs with a dimension lookup/update step

Loading a Type II SDC for the first time Loading a Type II SDC and verifying how history is kept

Explaining and loading Type III SCD and Hybrid SCD Loading other kinds of dimensions

Loading a mini dimension Loading junk dimensions Explaining degenerate dimensions

Loading fact tables

Learning the basics about fact tables

Deciding the level of granularity

Translating the business keys into surrogate keys

Obtaining the surrogate key for Type I SCD Obtaining the surrogate key for Type II SCD Obtaining the surrogate key for the junk dimension Obtaining the surrogate key for a time dimension

Loading a cumulative fact table Loading a snapshot fact table

Loading a fact table by inserting snapshot data Loading a fact table by overwriting snapshot data

Summary

Creating Portable and Reusable Transformations

Defining and using Kettle variables

Introducing all kinds of Kettle variables

Explaining predefined variables Revisiting the kettle.properties file Defining variables at runtime

Setting a variable with a constant value Setting a variable with a value unknown beforehand Setting variables with partial or total results of your flow

Defining and using named parameters

Using variables as fields of your stream

Creating reusable Transformations

Creating and executing sub-transformations

Creating and testing a sub-transformation Executing a sub-transformation

Introducing more elaborate sub-transformations

Making the data flow between transformations

Transferring data using the copy/get rows mechanism

Executing transformations in an iterative way

Using Transformation executors Configuring the executors with advanced settings

Getting the results of the execution of the inner transformation Working with groups of data Using variables and named parameters Continuing the flow after executing the inner transformation

Summary

Implementing Metadata Injection

Introducing metadata injection

Explaining how metadata injection works Creating a template Transformation Injecting metadata

Discovering metadata and injecting it Identifying use cases to implement metadata injection Summary

Creating Advanced Jobs

Enhancing your processes with the use of variables

Running nested jobs Understanding the scope of variables Using named parameters Using variables to create flexible processes

Using variables to name jobs and transformations Using variables to name Job and Transformation folders

Accessing copied rows for different purposes

Using the copied rows to manage files in advanced ways Using the copied rows as parameters of a Job or Transformation

Working with filelists

Maintaining a filelist Using the filelist for different purposes

Attaching files in an email Copying, moving, and deleting files Introducing other ways to process the filelist

Executing jobs in an iterative way

Using Job executors Configuring the executors with advanced settings

Getting the results of the execution of the job Working with groups of data Using variables and named parameters Capturing the result filenames

Summary

Launching Transformations and Jobs from the Command Line

Using the Pan and Kitchen utilities

Running jobs and transformations Checking the exit code

Supplying named parameters and variables Using command-line arguments

Deciding between the use of a command-line argument and named parameters

Sending the output of executions to log files Automating the execution Summary

Best Practices for Designing and Deploying a PDI Project

Setting up a new project

Setting up the local environment Defining a folder structure for the project Dealing with external resources Defining and adopting a versioning system

Best practices to design jobs and transformations

Styling your work Making the work portable Designing and developing reusable jobs and transformations

Maximizing the performance

Analyzing Steps Metrics Analyzing performance graphs

Deploying the project in different environments

Modifying the Kettle home directory

Modifying the Kettle home in Windows Modifying the Kettle home in Unix-like systems

Summary

← Prev
Back
Next →

← Prev
Back
Next →