9 Data Lifecycle Management
Effectively managing data growth in your data warehouse can be an intensive and massive task. In this chapter, we’ll discuss planning for growth and performance up front during design, which can simplify and facilitate your data management efforts.
As data is loaded into a data warehouse, it’s propagated through the data flow and replicated throughout the LSA++ architecture. For example, data extracted from a table in SAP ERP could be loaded into the operational delta queue (ODQ), one or more DSO layers, change logs, and corporate memory. The result of this data propagation is data replication and an increased storage requirement beyond that of the original source system. In some cases, data may be denormalized to improve data load performance and/or reporting performance (i.e., including sales order header information in the sales order item data flow). Denormalization is the intentional duplication of columns in multiple tables, and it increases data redundancy.
Failure to proactively manage data loaded into SAP BW/4HANA can seriously jeopardize system performance and directly affect the total cost of ownership. Housekeeping activities must be an integral part of the operational concept of a complex enterprise data warehouse. Planning and executing housekeeping activities regularly in the system will ensure optimum utilization of system resources and thereby optimize the overall system performance.
An integral part of any data monitoring and housekeeping effort is a data retention strategy, as we’ll discuss in Section 9.1; such a strategy may include a hybrid approach that uses all tools at your disposal, such as aggregation, near-line storage, archival processes, and data deletion. In Section 9.2, we’ll introduce the data temperature concept for managing data with different retention requirements and access frequencies. For cold data, in Section 9.3 we’ll discuss in detail how to configure the data archiving process using SAP’s near-line storage (NLS) on SAP IQ or Hadoop.
9.1 Data Retention Strategy
In an operational data warehouse, it’s natural for the volume of data to perpetually increase, which is known as organic growth. Constant changes to business and legal requirements mean that this data must be available for longer than ever before. The impact of keeping a large volume of data in the system is generally a decrease in performance and an increase in administration effort. To combat and manage these impacts, you should implement a data aging strategy early in the lifecycle of the data warehouse; such a strategy can have a profound influence on design and architecture.
Efficient storage of historical data for comprehensive enterprise data warehousing is essential to maintaining a healthy and high-performing system. The following are some advantages of a carefully planned data retention strategy:
-
Reduced infrastructure costs for the following categories:
- Storage capacity costs
- Processing capacity costs
-
Improved performance of the SAP BW/4HANA system in the following areas:
- Query performance
- Data loading performance
- System administration performance
-
Reduced administration and maintenance effort in the following areas:
- Reloading and repartitioning effort
- Data management effort
When considering a data retention strategy, there are always trade-offs: benefits and costs, advantages and risks. Making a decision about what data should remain available and accessible to end users is fraught with consequences.
However, the good news is that solid advice is available, because this dilemma has been wrestled with many times before. To start, it may be helpful to consider some of the following questions:
- At what point does lack of processing capability incurred by the data volume become unacceptable? In other words, what is the data volume threshold above which response times are intolerable?
- Are there any catastrophic consequences associated with loss of processing for this system?
- What other systems or processes does the data retention strategy affect or feed? How many downstream systems rely on data availability in this data warehouse?
- What is the cost and elapsed time of restoring or recreating the data that’s no longer retained?
- Are there legal requirements (such as a litigation hold, court order, or law) to retain the data or keep the system operational?
- What’s the risk or exposure related to keeping old data available for reporting? What liabilities are inherent in lieu of adopting a strictly enforced data retention strategy?
The greatest complexity related to defining a data retention strategy is that answering these questions requires candid discussion and agreement among infrastructure, technical support, and business representatives. Recognizing that the data retention requirements for each subject area or line of business may be completely different adds a bit more complexity. The good news is that a data retention strategy need not be one size fits all. In fact, multiple options are available, and the success of the whole strategy may depend on defining unique data aging tactics for each area based on specific data content and the business value of accessing that data.
The specific options or tactics available range from doing nothing (i.e., keeping all data available online for all time) to deleting data directly from the database with no option for recovery. Thankfully, there are also several options in between! Again, because these options may influence design and architecture, defining the data aging or retention strategy should not be deferred; if it is, a significant effort to redesign or rearchitecture may be needed.
The key point is that as data ages, details lose relevance. Keeping irrelevant data online not only adversely impacts performance (in read and write operations) but also increases administration and maintenance efforts and costs. The following list ranks different data retention tactics in order of data availability online and, indirectly, data relevance:
- Detail InfoProviders
- History InfoProviders
- Summary InfoProviders
- NLS
- Traditional archiving
- Data deletion
Note
Traditional archiving isn’t available for customer data loaded to SAP BW/4HANA InfoProviders; however, it is available for system administration data, such as IDocs and application logs.
From a purely technical perspective, the first step to defining which tactic is suitable for each data set is to determine the usage of the relevant data sets. The technical content statistics are a great place to start: Check the number of navigations and number of users executing queries on each InfoProvider for that data set (i.e., subject area). Ranking each InfoProvider by usage and performance is a good way to initiate discussions with the business about the need for a data retention strategy. The flaw in this approach is the inability to decipher the age of the data being queried from the statistical information available, so engaging the business on end user behavior is critical to fully understanding how the data is being consumed. A general guideline to follow is illustrated by Table 9.1. Data used more frequently should be kept online, and you should consider keeping data used less frequently offline.
Detail InfoProviders |
Summary InfoProviders |
Historical InfoProviders | NLS | Traditional Archiving | Deletion | |
---|---|---|---|---|---|---|
Frequently queried data | Yes | Yes | Yes | No | No | No |
Rarely used data | Yes | Yes | Yes | Yes | No | No |
Very rarely used data | No | Yes | Yes | Yes | No | No |
Legal hold or audit data | No | No | Yes | Yes | Yes | No |
Expired or obsolete data | No | No | No | No | Yes | Yes |
Table 9.1 Data Retention Tactics versus Data Access Frequency Matrix
Although no single tactic should be employed for all data sets, Table 9.1 illustrates some general guidelines that can be employed fairly consistently. Before proceeding, we should review these tactics in a little more detail and evaluate the advantages and disadvantages of each approach.
The first three tactics (using detail, historical, and summary InfoProviders) will influence the design of the data models in the warehouse. The fourth one (using NLS) will influence the system architecture. The final two (using data archiving and data deletion) will only influence the warehouse administration.
First and foremost, all data that is frequently accessed should remain available in full detail. In fact, all available details of each transaction should be available to minimize the need to enhance the data structure by adding data fields later. This is the default tactic and is represented by the first detail InfoProviders column in Table 9.1. The other columns represent alternative tactics.
The use of historical InfoProviders involves the creation of a new InfoProvider in which old data is loaded at the original granularity (level of detail). Current data will remain in the original InfoProvider. After data is moved to the history cube, it must be deleted from the original InfoProvider so that queries won’t return duplicate data.
The historical InfoProvider tactic provides the following benefits:
- Storing historical data in a semantic partition improves performance of read/write access to the original InfoProvider, because most loading and querying is performed against current data in the original cube, which has less data.
- All data, current and historical, remains available online and can be accessed in the same query when both the historical and current InfoProviders are part of the same MultiProvider.
- There is no custom programming required to implement historical InfoProviders, so development is simple. Mapping transformation rules from the original InfoProvider to the historical InfoProvider are 1:1.
However, the historical InfoProvider is associated with a few disadvantages:
- The cost/benefit value associated with maintaining historical data is low. As older data loses relevance and is accessed less frequently but remains online, the cost per storage unit equals the cost of more relevant data.
- Unless an automated mechanism is developed, data moved from the original InfoProvider to the historical InfoProvider must be selectively deleted from the original InfoProvider, which requires planning and execution effort, as well as data loading downtime.
- There is neither reduction in infrastructure costs (storage capacity) nor in administration effort. In fact, the existence of historical data may result in increased administration effort, because changes made to the original InfoProvider may need to be applied to the historical InfoProvider to keep results consistent.
Rather than keeping historical data at the same granularity as relevant or current data, another option is to simplify the data model and summarize the data. As the details lose relevance, this approach can be very practical to keep the relevant information available. The following are advantages of a summary InfoProvider:
- Storing summarized data in a specific InfoProvider improves performance of read/write access to the original InfoProvider, because most loading and querying is performed against current data in the original cube, which has less data.
- Summarized data, current and historical, remains available online and can be accessed in the same query when both the summarized and current InfoProviders are part of the same MultiProvider.
- There is minimal custom programming required to implement summary InfoProviders, so development is simple. Mapping transformation rules from the original InfoProvider to the summary InfoProvider are usually 1:1.
- Data in the summary InfoProvider can be aggregated by more than one characteristic, thus compounding the value of the summary InfoProvider.
- Query response times against the summary InfoProvider will be faster.
- The reduction in storage capacity will be directly correlated to the reduction in granularity so long as the data in the original InfoProvider is deleted.
- There will be a reduction in infrastructure, maintenance, and administration costs, including backups, because the database holds less data.
The summary InfoProvider tactic is prone to the following disadvantages:
- An analysis effort needs to be conducted to determine whether each InfoProvider is a candidate for summarization based on the business use cases for queries on the data contained within the InfoProviders.
- If data isn’t deleted from the original InfoProvider, there’s neither a reduction of cost and effort nor an improvement of performance over the original InfoProvider.
- In many cases, queries may need to be adapted if historical data is summarized, or new queries may need be to be created.
Beyond summarized data, there’s no other online data storage option. The next alternative is to move data out of the database and onto a near-line storage (NLS) database. Near-line storage is an intermediate type of data storage representing a compromise between online storage and archiving. When data is near-lined from the SAP HANA database, it is moved to the NLS system, but is still accessible by SAP BW/4HANA for reporting. The next section will dive deeper into the NLS options, but for now, the advantages of implementing NLS include the following:
- Data stored in NLS can reach compression ratios of up to 95 percent, significantly reducing storage capacity requirements and costs in the original database.
- Less expensive media can be used for the NLS solution, thus further reducing the cost of the reduced storage capacity.
- Data in NLS can be accessed by the original queries with little or no maintenance and only a nominal performance impact.
- Data that is near-lined can be restored to the original database if necessary.
The disadvantages of NLS as a data retention tactic include the following:
- Data stored in the NLS solution doesn’t meet the same performance benchmarks as data that remains online; there’s a performance impact, which is usually acceptable for older data.
- Only static data can be near-lined. Once near-lined, no deltas can be loaded unless the near-lined data is restored to the original database. It will then need to be moved again to the NLS database.
- When data is moved from the original database to the NLS database, no new data can be loaded to the original InfoProvider until the NLS process is completed.
- Each NLS solution has a license and maintenance cost, which, in some cases, depends on the volume of data stored in the NLS solution.
- To look up data from NLS in transformation rules, special code needs to be added to ensure the NLS data is queried.
When data is truly no longer relevant for any end-use scenario, or liability policies dictate data should be purged from all production systems, then selective deletion of data from SAP BW/4HANA InfoProviders is the best and sometimes only option. The following are advantages of deleting data:
- It’s the cheapest cost option from an infrastructure perspective; once the data is deleted, there are pure cost savings.
- Minimal maintenance is required—especially if a custom programming solution is developed to selectively delete irrelevant data from InfoProviders.
- It leads to better performance reading and writing to the InfoProviders, which have less data volume.
- Less storage capacity is needed for all InfoProviders being cleansed in this way.
- It creates a healthier system as redundant and irrelevant data is deleted.
The disadvantages of data deletion, on the other hand, are quite stark:
- It’s very risky if no archiving was performed previously and important or legally needed data is no longer available.
- Recovery of deleted data can only come from the source system; if the data is archived from the source, recovery can be extremely costly and difficult. If data has been deleted from the source system, no recovery is possible.
- Additional administration is required to detect candidate data for deletion and confirm whether it can be deleted.
It’s important to note that these tactics are not mutually exclusive and every tactic could be employed in a single scenario. The keys to defining a successful data aging strategy include the following:
- Forecasting the impact of future data growth on capacity and performance
- Developing a cost model for data storage and maintenance
- Profiling data activity and access for all data sets
- Defining data retention tactics for all data sets
- Choosing and implementing technology to minimize impact to the business
In the next section, we’ll discuss the NLS solutions for SAP BW/4HANA in more detail and learn how to create the connections necessary to use them.