When you are transforming data, it is not uncommon that you detect inaccuracies or errors. Sometimes the issues you find may not be severe enough to discard the rows. Maybe you can somehow guess what data was supposed to be there instead of the current values, or it can happen that you have default values for the invalid values. Let's see some examples:
- You have a field defined as a string, and this field represents the date of birth of a person. As values, you have, besides valid dates, other strings, for example N/A, -, ???, and so on. Any attempt to run a calculation with these values would lead to an error.
- You have two dates representing the start date and end date of the execution of a task. Suppose that you have 2018-01-05 and 2017-10-31 as the start date and end date respectively. They are well-formatted dates, but if you try to calculate the time that it took to execute the task, you will get a negative value, which is clearly wrong.
- You have a field representing the nationality of a person. The field is mandatory but there are some null values.
In these cases and many more like these, the problem is not so critical and you can do some work to avoid aborting or discarding data because of these anomalies:
- In the first example, you could delete the invalid year and leave the field empty.
- In the second example, you couldĀ interchange the values assuming that the user that typed the dates switched the dates unintentionally
- Finally, in the last example, you could set a predefined default value.
In general, in any situation, you can do your best to fix the issues and send the rows back to the main stream. This is valid both for regular streams and streams that are a result of error handling.
In this section, you will see an example of fixing these kinds of issues and avoiding having to discard the rows that cause errors or are considered invalid. You will do it using the concepts learned in the last chapter: splitting and merging streams.