Log In
Or create an account ->
Imperial Library
Home
About
News
Upload
Forum
Help
Login/SignUp
Index
Title Page
Third Edition
Copyright
Learning Pentaho Data Integration 8 CE
Third Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Getting Started with Pentaho Data Integration
Pentaho Data Integration and Pentaho BI Suite
Introducing Pentaho Data Integration
Using PDI in real-world scenarios
Loading data warehouses or data marts
Integrating data
Data cleansing
Migrating information
Exporting data
Integrating PDI along with other Pentaho tools
Installing PDI
Launching the PDI Graphical Designer - Spoon
Starting and customizing Spoon
Exploring the Spoon interface
Extending the PDI functionality through the Marketplace
Introducing transformations
The basics about transformations
Creating a Hello World! Transformation
Designing a Transformation
Previewing and running a Transformation
Installing useful related software
Summary
Getting Started with Transformations
Designing and previewing transformations
Getting familiar with editing features
Using the mouseover assistance toolbar
Adding steps and creating hops
Working with grids
Designing transformations
Putting the editing features in practice
Previewing and fixing errors as they appear
Looking at the results in the execution results pane
The Logging tab
The Step Metrics tab
Running transformations in an interactive fashion
Understanding PDI data and metadata
Understanding the PDI rowset
Adding or modifying fields by using different PDI steps
Explaining the PDI data types
Handling errors
Implementing the error handling functionality
Customizing the error handling
Summary
Creating Basic Task Flows
Introducing jobs
Learning the basics about jobs
Creating a Simple Job
Designing and running jobs
Revisiting the Spoon interface and the editing features
Designing jobs
Getting familiar with the job design process
Looking at the results in the Execution results window
The Logging tab
The Job metrics tab
Enriching your work by sending an email
Running transformations from a Job
Using the Transformation Job Entry
Understanding and changing the flow of execution
Changing the flow of execution based on conditions
Forcing a status with an abort Job or success entry
Changing the execution to be synchronous
Managing files
Creating a Job that moves some files
Selecting files and folders
Working with regular expressions
Summarizing the Job entries that deal with files
Customizing the file management
Knowing the basics about Kettle variables
Understanding the kettle.properties file
How and when you can use variables
Summary
Reading and Writing Files
Reading data from files
Reading a simple file
Troubleshooting reading files
Learning to read all kind of files
Specifying the name and location of the file
Reading several files at the same time
Reading files that are compressed or located on a remote server
Reading a file whose name is known at runtime
Describing the incoming fields
Reading Date fields
Reading Numeric fields
Reading only a subset of the file
Reading the most common kinds of sources
Reading text files
Reading spreadsheets
Reading XML files
Reading JSON files
Outputting data to files
Creating a simple file
Learning to create all kind of files and write data into them
Providing the name and location of an output file
Creating a file whose name is known only at runtime
Creating several files whose name depend on the content of the file
Describing the content of the output file
Formatting Date fields
Formatting Numeric fields
Creating the most common kinds of files
Creating text files
Creating spreadsheets
Creating XML files
Creating JSON files
Working with Big Data and cloud sources
Reading files from an AWS S3 instance
Writing files to an AWS S3 instance
Getting data from HDFS
Sending data to HDFS
Summary
Manipulating PDI Data and Metadata
Manipulating simple fields
Working with strings
Extracting parts of strings using regular expressions
Searching and replacing using regular expressions
Doing some math with Numeric fields
Operating with dates
Performing simple operations on dates
Subtracting dates with the Calculator step
Getting information relative to the current date
Using the Get System Info step
Performing other useful operations on dates
Getting the month names with a User Defined Java Class step
Modifying the metadata of streams
Working with complex structures
Working with XML
Introducing XML terminology
Getting familiar with the XPath notation
Parsing XML structures with PDI
Reading an XML file with the Get data from XML step
Parsing an XML structure stored in a field
PDI Transformation and Job files
Parsing JSON structures
Introducing JSON terminology
Getting familiar with the JSONPath notation
Parsing JSON structures with PDI
Reading a JSON file with the JSON input step
Parsing a JSON structure stored in a field
Summary
Controlling the Flow of Data
Filtering data
Filtering rows upon conditions
Reading a file and getting the list of words found in it
Filtering unwanted rows with a Filter rows step
Filtering rows by using the Java Filter step
Filtering data based on row numbers
Splitting streams unconditionally
Copying rows
Distributing rows
Introducing partitioning and clustering
Splitting the stream based on conditions
Splitting a stream based on a simple condition
Exploring PDI steps for splitting a stream based on conditions
Merging streams in several ways
Merging two or more streams
Customizing the way of merging streams
Looking up data
Looking up data with a Stream lookup step
Summary
Cleansing, Validating, and Fixing Data
Cleansing data
Cleansing data by example
Standardizing information
Improving the quality of data
Introducing PDI steps useful for cleansing data
Dealing with non-exact matches
Cleansing by doing a fuzzy search
Deduplicating non-exact matches
Validating data
Validating data with PDI
Validating and reporting errors to the log
Introducing common validations and their implementation with PDI
Treating invalid data by splitting and merging streams
Fixing data that doesn't match the rules
Summary
Manipulating Data by Coding
Doing simple tasks with the JavaScript step
Using the JavaScript language in PDI
Inserting JavaScript code using the JavaScript step
Adding fields
Modifying fields
Organizing your code
Controlling the flow using predefined constants
Testing the script using the Test script button
Parsing unstructured files with JavaScript
Doing simple tasks with the Java Class step
Using the Java language in PDI
Inserting Java code using the Java Class step
Learning to insert java code in a Java Class step
Data types equivalence
Adding fields
Modifying fields
Controlling the flow with the putRow() function
Testing the Java Class using the Test class button
Getting the most out of the Java Class step
Receiving parameters
Reading data from additional steps
Redirecting data to different target steps
Parsing JSON structures
Avoiding coding using purpose-built steps
Summary
Transforming the Dataset
Sorting data
Sorting a dataset with the sort rows step
Working on groups of rows
Aggregating data
Summarizing the PDI steps that operate on sets of rows
Converting rows to columns
Converting row data to column data using the Row denormaliser step
Aggregating data with a Row Denormaliser step
Normalizing data
Modifying the dataset with a Row Normaliser step
Going forward and backward across rows
Picking rows backward and forward with the Analytic Query step
Summary
Performing Basic Operations with Databases
Connecting to a database and exploring its content
Connecting with Relational Database Management Systems
Exploring a database with the Database Explorer
Previewing and getting data from a database
Getting data from the database with the Table input step
Using the Table input step to run flexible queries
Adding parameters to your queries
Using Kettle variables in your queries
Inserting, updating, and deleting data
Inserting new data into a database table
Inserting or updating data with the Insert / Update step
Deleting records of a database table with the Delete step
Performing CRUD operations with more flexibility
Verifying a connection, running DDL scripts, and doing other useful tasks
Looking up data in different ways
Doing simple lookups with the Database Value Lookup step
Making a performance difference when looking up data in a database
Performing complex database lookups
Looking for data using a Database join step
Looking for data using a Dynamic SQL row step
Summary
Loading Data Marts with PDI
Preparing the environment
Exploring the Jigsaw database model
Creating the database and configuring the environment
Introducing dimensional modeling
Loading dimensions with data
Learning the basics of dimensions
Understanding dimensions technical details
Loading a time dimension
Introducing and loading Type I slowly changing dimensions
Loading a Type I SCD with a combination lookup/update step
Introducing and loading Type II slowly changing dimension
Loading Type II SCDs with a dimension lookup/update step
Loading a Type II SDC for the first time
Loading a Type II SDC and verifying how history is kept
Explaining and loading Type III SCD and Hybrid SCD
Loading other kinds of dimensions
Loading a mini dimension
Loading junk dimensions
Explaining degenerate dimensions
Loading fact tables
Learning the basics about fact tables
Deciding the level of granularity
Translating the business keys into surrogate keys
Obtaining the surrogate key for Type I SCD
Obtaining the surrogate key for Type II SCD
Obtaining the surrogate key for the junk dimension
Obtaining the surrogate key for a time dimension
Loading a cumulative fact table
Loading a snapshot fact table
Loading a fact table by inserting snapshot data
Loading a fact table by overwriting snapshot data
Summary
Creating Portable and Reusable Transformations
Defining and using Kettle variables
Introducing all kinds of Kettle variables
Explaining predefined variables
Revisiting the kettle.properties file
Defining variables at runtime
Setting a variable with a constant value
Setting a variable with a value unknown beforehand
Setting variables with partial or total results of your flow
Defining and using named parameters
Using variables as fields of your stream
Creating reusable Transformations
Creating and executing sub-transformations
Creating and testing a sub-transformation
Executing a sub-transformation
Introducing more elaborate sub-transformations
Making the data flow between transformations
Transferring data using the copy/get rows mechanism
Executing transformations in an iterative way
Using Transformation executors
Configuring the executors with advanced settings
Getting the results of the execution of the inner transformation
Working with groups of data
Using variables and named parameters
Continuing the flow after executing the inner transformation
Summary
Implementing Metadata Injection
Introducing metadata injection
Explaining how metadata injection works
Creating a template Transformation
Injecting metadata
Discovering metadata and injecting it
Identifying use cases to implement metadata injection
Summary
Creating Advanced Jobs
Enhancing your processes with the use of variables
Running nested jobs
Understanding the scope of variables
Using named parameters
Using variables to create flexible processes
Using variables to name jobs and transformations
Using variables to name Job and Transformation folders
Accessing copied rows for different purposes
Using the copied rows to manage files in advanced ways
Using the copied rows as parameters of a Job or Transformation
Working with filelists
Maintaining a filelist
Using the filelist for different purposes
Attaching files in an email
Copying, moving, and deleting files
Introducing other ways to process the filelist
Executing jobs in an iterative way
Using Job executors
Configuring the executors with advanced settings
Getting the results of the execution of the job
Working with groups of data
Using variables and named parameters
Capturing the result filenames
Summary
Launching Transformations and Jobs from the Command Line
Using the Pan and Kitchen utilities
Running jobs and transformations
Checking the exit code
Supplying named parameters and variables
Using command-line arguments
Deciding between the use of a command-line argument and named parameters
Sending the output of executions to log files
Automating the execution
Summary
Best Practices for Designing and Deploying a PDI Project
Setting up a new project
Setting up the local environment
Defining a folder structure for the project
Dealing with external resources
Defining and adopting a versioning system
Best practices to design jobs and transformations
Styling your work
Making the work portable
Designing and developing reusable jobs and transformations
Maximizing the performance
Analyzing Steps Metrics
Analyzing performance graphs
Deploying the project in different environments
Modifying the Kettle home directory
Modifying the Kettle home in Windows
Modifying the Kettle home in Unix-like systems
Summary
← Prev
Back
Next →
← Prev
Back
Next →