We will explain the loading process of a Type I SCD by example. We will load the region dimension, which contains geographical information—cities and countries. The source of our dimension will be the cities and countries tables, and the dimension table to be loaded is lk_regions.
We load Type I SCDs with the Combination lookup/update step. Before loading the dimension, it's a good idea to take a look at the definition of the table so that we can identify the different kinds of fields needed to configure the Combination lookup/update step properly:
CREATE TABLE LK_REGIONS (
id INT(4) NOT NULL,
city CHAR(30) DEFAULT 'N/A' NOT NULL,
country CHAR(30) DEFAULT 'N/A' NOT NULL,
region CHAR(30) DEFAULT 'N/A' NOT NULL,
id_js INT(4) NOT NULL,
lastupdate DATE,
PRIMARY KEY (id)
);
Among the fields, we have the surrogate key (id), the business key (id_js), and the fields for the different attributes: city, country, and region. We also have a field named lastupdate, not mandatory as we will shortly see.
This is how we proceed. First of all, we have to gather the region information from the source:
- Launch Spoon and create a new Transformation.
- Drag a Table input step to the work area and double-click on it.
- As Connection, select js.
- In the SQL area, type the following query and click on the OK button:
SELECT ci.city_id, city_name, country_name FROM cities ci, countries co WHERE ci.cou_id = co.cou_id
Now that you have all the data to populate the dimension, do the following:
- After the Table input step, add a Combination lookup/update step. You will find it in the Data Warehouse category of steps.
- Double-click on the step; as Connection, select dw. As Target table, browse and select lk_regions or simply type it. Set Commit size to 1.
- Fill the grid, as shown in the following screenshot:
- As Technical key field, type id, and as Date of last update field (optional), type lastupdate.
- Close the window.
- After the Combination lookup/update step, add an Update step and double-click on it.
- As Connection, select js_dw and as Target table, type lk_regions.
- Fill the upper grid by adding the condition id = id. The id attribute to the left is the table's id, while id to the right is the stream id.
- Fill the lower grid. Add one row with the city and city_name values. Add a second row with the country and country_name values. This will update the table columns city and country with the values city_name and country_name coming in the stream.
- Save the Transformation and run it.
- Explore the js_dw database and preview the lk_regions table. You should see this:
How does this work? The Combination lookup/update (Combination L/U) looks in the dimension table for a record that matches the key fields that you put in the upper grid in the setting window:
- If the combination exists, the step returns the surrogate key of the found record.
- If it doesn't exist, the step generates a new surrogate key and inserts a row with the key fields and the generated surrogate key. In this case, the surrogate key is also added to the output stream.
This behavior implies that the Combination lookup/update step merely inserts the row for each city, whereas the Update step is the one that adds the city and country information.
Note that the dimension table lk_regions has a column named region that you didn't update because you don't have data for this column. The column is filled with a default value set in the DDL definition of the table.
A commit size equal to 1 certainly degrades the performance when loading dimensions of high cardinality. If that's the case, you have an alternative way to load Type I SCD, which is explained in the next section.
Now you may wonder why we update the attributes with a separate step. We do it because they are not part of the key. In our example, if we put the city and country in the Combination lookup/update step, the first time that we run the Transformation, the row would be inserted without problem. If some change is applied to the city or the country fields in the source system, the next time we run the Transformation, PDI would assume that the key changed and would insert a new record for the same city, which would be wrong.
In our sample database, there is another Type I SCD: manufacturers. The process to load this dimension is exactly the same as the one explained for the REGION dimension.