Loading a Type I SCD with a combination lookup/update step

We will explain the loading process of a Type I SCD by example. We will load the region dimension, which contains geographical information—cities and countries. The source of our dimension will be the cities and countries tables, and the dimension table to be loaded is lk_regions.

Note that we will take information from the operational database js and load a table in another database js_dw.

We load Type I SCDs with the Combination lookup/update step. Before loading the dimension, it's a good idea to take a look at the definition of the table so that we can identify the different kinds of fields needed to configure the Combination lookup/update step properly:

CREATE TABLE LK_REGIONS (
id INT(4) NOT NULL,
city CHAR(30) DEFAULT 'N/A' NOT NULL,
country CHAR(30) DEFAULT 'N/A' NOT NULL,
region CHAR(30) DEFAULT 'N/A' NOT NULL,
id_js INT(4) NOT NULL,
lastupdate DATE,
PRIMARY KEY (id)
);

Among the fields, we have the surrogate key (id), the business key (id_js), and the fields for the different attributes: city, country, and region. We also have a field named lastupdate, not mandatory as we will shortly see.

This is how we proceed. First of all, we have to gather the region information from the source:

  1. Launch Spoon and create a new Transformation.
  2. Drag a Table input step to the work area and double-click on it.
  3. As Connection, select js.
  4. In the SQL area, type the following query and click on the OK button:
SELECT ci.city_id, city_name, country_name 
FROM cities ci, countries co 
WHERE ci.cou_id = co.cou_id 

Now that you have all the data to populate the dimension, do the following:

  1. After the Table input step, add a Combination lookup/update step. You will find it in the Data Warehouse category of steps.
  2. Double-click on the step; as Connection, select dw. As Target table, browse and select lk_regions or simply type it. Set Commit size to 1.
  1. Fill the grid, as shown in the following screenshot:
Configuring a Combination lookup/update step
  1. As Technical key field, type id, and as Date of last update field (optional), type lastupdate.
  2. Close the window.
  3. After the Combination lookup/update step, add an Update step and double-click on it.
  4. As Connection, select js_dw and as Target table, type lk_regions.
  5. Fill the upper grid by adding the condition id = id. The id attribute to the left is the table's id, while id to the right is the stream id.
  6. Fill the lower grid. Add one row with the city and city_name values. Add a second row with the country and country_name values. This will update the table columns city and country with the values city_name and country_name coming in the stream.
  7. Save the Transformation and run it.
  8. Explore the js_dw database and preview the lk_regions table. You should see this:
Region dimension populated
Be aware that in the Combination lookup/update step, the Dimension field, Technical key field, and Date of last update field options do not refer to fields in the stream, but to columns in the table. In those cases, where you see the word field, you should know that it refers to a column: Dimension column, Technical key column, and Date of last update column. Also, note that the term Technical refers to the surrogate key.

How does this work? The Combination lookup/update (Combination L/U) looks in the dimension table for a record that matches the key fields that you put in the upper grid in the setting window:

This behavior implies that the Combination lookup/update step merely inserts the row for each city, whereas the Update step is the one that adds the city and country information.

Note that the dimension table lk_regions has a column named region that you didn't update because you don't have data for this column. The column is filled with a default value set in the DDL definition of the table.

In the Combination lookup/update step, we set the commit size to 1. This is the only way in which this Transformation will run successfully and reliably every time. There are conflicts in timing if the commit size is larger as we are attempting to update a row that may not have been committed yet. Of course, if you don't have fields to update, you don't have to change the commit size.

A commit size equal to 1 certainly degrades the performance when loading dimensions of high cardinality. If that's the case, you have an alternative way to load Type I SCD, which is explained in the next section.

Now you may wonder why we update the attributes with a separate step. We do it because they are not part of the key. In our example, if we put the city and country in the Combination lookup/update step, the first time that we run the Transformation, the row would be inserted without problem. If some change is applied to the city or the country fields in the source system, the next time we run the Transformation, PDI would assume that the key changed and would insert a new record for the same city, which would be wrong.

As the Combination L/U only maintains the key information, if you have non-key columns in the table, you must update them with an extra Update step. These values must have a default value or must allow null values. If none of these conditions is true, the insert operation will fail.

In our sample database, there is another Type I SCD: manufacturers. The process to load this dimension is exactly the same as the one explained for the REGION dimension.