9    Data Lifecycle Management

Effectively managing data growth in your data warehouse can be an intensive and massive task. In this chapter, we’ll discuss planning for growth and performance up front during design, which can simplify and facilitate your data management efforts.

As data is loaded into a data warehouse, it’s propagated through the data flow and replicated throughout the LSA++ architecture. For example, data extracted from a table in SAP ERP could be loaded into the operational delta queue (ODQ), one or more DSO layers, change logs, and corporate memory. The result of this data propagation is data replication and an increased storage requirement beyond that of the original source system. In some cases, data may be denormalized to improve data load performance and/or reporting performance (i.e., including sales order header information in the sales order item data flow). Denormalization is the intentional duplication of columns in multiple tables, and it increases data redundancy.

Failure to proactively manage data loaded into SAP BW/4HANA can seriously jeopardize system performance and directly affect the total cost of ownership. Housekeeping activities must be an integral part of the operational concept of a complex enterprise data warehouse. Planning and executing housekeeping activities regularly in the system will ensure optimum utilization of system resources and thereby optimize the overall system performance.

An integral part of any data monitoring and housekeeping effort is a data retention strategy, as we’ll discuss in Section 9.1; such a strategy may include a hybrid approach that uses all tools at your disposal, such as aggregation, near-line storage, archival processes, and data deletion. In Section 9.2, we’ll introduce the data temperature concept for managing data with different retention requirements and access frequencies. For cold data, in Section 9.3 we’ll discuss in detail how to configure the data archiving process using SAP’s near-line storage (NLS) on SAP IQ or Hadoop.

9.1    Data Retention Strategy

In an operational data warehouse, it’s natural for the volume of data to perpetually increase, which is known as organic growth. Constant changes to business and legal requirements mean that this data must be available for longer than ever before. The impact of keeping a large volume of data in the system is generally a decrease in performance and an increase in administration effort. To combat and manage these impacts, you should implement a data aging strategy early in the lifecycle of the data warehouse; such a strategy can have a profound influence on design and architecture.

Efficient storage of historical data for comprehensive enterprise data warehousing is essential to maintaining a healthy and high-performing system. The following are some advantages of a carefully planned data retention strategy:

When considering a data retention strategy, there are always trade-offs: benefits and costs, advantages and risks. Making a decision about what data should remain available and accessible to end users is fraught with consequences.

However, the good news is that solid advice is available, because this dilemma has been wrestled with many times before. To start, it may be helpful to consider some of the following questions:

The greatest complexity related to defining a data retention strategy is that answering these questions requires candid discussion and agreement among infrastructure, technical support, and business representatives. Recognizing that the data retention requirements for each subject area or line of business may be completely different adds a bit more complexity. The good news is that a data retention strategy need not be one size fits all. In fact, multiple options are available, and the success of the whole strategy may depend on defining unique data aging tactics for each area based on specific data content and the business value of accessing that data.

The specific options or tactics available range from doing nothing (i.e., keeping all data available online for all time) to deleting data directly from the database with no option for recovery. Thankfully, there are also several options in between! Again, because these options may influence design and architecture, defining the data aging or retention strategy should not be deferred; if it is, a significant effort to redesign or rearchitecture may be needed.

The key point is that as data ages, details lose relevance. Keeping irrelevant data online not only adversely impacts performance (in read and write operations) but also increases administration and maintenance efforts and costs. The following list ranks different data retention tactics in order of data availability online and, indirectly, data relevance:

  1. Detail InfoProviders
  2. History InfoProviders
  3. Summary InfoProviders
  4. NLS
  5. Traditional archiving
  6. Data deletion
Note

Traditional archiving isn’t available for customer data loaded to SAP BW/4HANA InfoProviders; however, it is available for system administration data, such as IDocs and application logs.

From a purely technical perspective, the first step to defining which tactic is suitable for each data set is to determine the usage of the relevant data sets. The technical content statistics are a great place to start: Check the number of navigations and number of users executing queries on each InfoProvider for that data set (i.e., subject area). Ranking each InfoProvider by usage and performance is a good way to initiate discussions with the business about the need for a data retention strategy. The flaw in this approach is the inability to decipher the age of the data being queried from the statistical information available, so engaging the business on end user behavior is critical to fully understanding how the data is being consumed. A general guideline to follow is illustrated by Table 9.1. Data used more frequently should be kept online, and you should consider keeping data used less frequently offline.

Detail
InfoProviders
Summary
InfoProviders
Historical InfoProviders NLS Traditional Archiving Deletion
Frequently queried data Yes Yes Yes No No No
Rarely used data Yes Yes Yes Yes No No
Very rarely used data No Yes Yes Yes No No
Legal hold or audit data No No Yes Yes Yes No
Expired or obsolete data No No No No Yes Yes

Table 9.1    Data Retention Tactics versus Data Access Frequency Matrix

Although no single tactic should be employed for all data sets, Table 9.1 illustrates some general guidelines that can be employed fairly consistently. Before proceeding, we should review these tactics in a little more detail and evaluate the advantages and disadvantages of each approach.

The first three tactics (using detail, historical, and summary InfoProviders) will influence the design of the data models in the warehouse. The fourth one (using NLS) will influence the system architecture. The final two (using data archiving and data deletion) will only influence the warehouse administration.

First and foremost, all data that is frequently accessed should remain available in full detail. In fact, all available details of each transaction should be available to minimize the need to enhance the data structure by adding data fields later. This is the default tactic and is represented by the first detail InfoProviders column in Table 9.1. The other columns represent alternative tactics.

The use of historical InfoProviders involves the creation of a new InfoProvider in which old data is loaded at the original granularity (level of detail). Current data will remain in the original InfoProvider. After data is moved to the history cube, it must be deleted from the original InfoProvider so that queries won’t return duplicate data.

The historical InfoProvider tactic provides the following benefits:

However, the historical InfoProvider is associated with a few disadvantages:

Rather than keeping historical data at the same granularity as relevant or current data, another option is to simplify the data model and summarize the data. As the details lose relevance, this approach can be very practical to keep the relevant information available. The following are advantages of a summary InfoProvider:

The summary InfoProvider tactic is prone to the following disadvantages:

Beyond summarized data, there’s no other online data storage option. The next alternative is to move data out of the database and onto a near-line storage (NLS) database. Near-line storage is an intermediate type of data storage representing a compromise between online storage and archiving. When data is near-lined from the SAP HANA database, it is moved to the NLS system, but is still accessible by SAP BW/4HANA for reporting. The next section will dive deeper into the NLS options, but for now, the advantages of implementing NLS include the following:

The disadvantages of NLS as a data retention tactic include the following:

When data is truly no longer relevant for any end-use scenario, or liability policies dictate data should be purged from all production systems, then selective deletion of data from SAP BW/4HANA InfoProviders is the best and sometimes only option. The following are advantages of deleting data:

The disadvantages of data deletion, on the other hand, are quite stark:

It’s important to note that these tactics are not mutually exclusive and every tactic could be employed in a single scenario. The keys to defining a successful data aging strategy include the following:

In the next section, we’ll discuss the NLS solutions for SAP BW/4HANA in more detail and learn how to create the connections necessary to use them.