12.3Predictive Analysis Library
In comparison to the BFL, the Predictive Analysis Library (PAL) provides a series of generic, statistical algorithms that can be used on any data models. Table 12.4 contains some examples of algorithms you can implement in PAL.
Function |
Description |
Sample scenario |
---|---|---|
Anomaly detection |
Determination of outliers |
Detecting unusual system behavior: long response times despite normal system load |
A priori |
Detection of correlations for deriving rules |
Analysis of purchasing behavior: “Customers who have purchased products A and B often purchase product C also” |
K-means |
Classification of data into groups |
Segmentation of a customer base into target groups for promotions |
Table 12.4Some Functions of the Predictive Analysis Library
Not all PAL functions are provided for direct use. For example, some of the more complex PAL functions provide as a return value a description in the Predictive Model Markup Language (PMML) format, a standardized XML format for statistical models. Such functions are aimed at exchange options with other products, such as SAP Predictive Analytics, which was described initially.
[»]Background: Training and Execution of Statistical Models
In real life, there are only a few cases where you can apply standard algorithms to an existing dataset for gaining new information. Usually, you first have to prepare the data elaborately and determine the right input and output variables in modeling so that these models can then be trained using datasets and the resulting algorithm can be run (scoring). To keep the examples comprehensible, this chapter is restricted to the technical call of PAL functions.
In this section, we’ll demonstrate a PAL function that you can use to segment general datasets: the K-means function. Here, a dataset is divided into a specified number (K) of groups (or clusters). We won’t discuss the underlying mathematical algorithm in detail at this point. However, the basic idea is based on assigning an initial selection of centers of data records to the cluster whose center is closest. This enables you to identify patterns and classify datasets (e.g., customers, products, and so on). Figure 12.3 visualizes sample values and displays the corresponding cluster.
Figure 12.3Schematic Visualization of Dataset Segmentation via the K-Means Function
As sample values, you can imagine that each point represents a flight connection, and the values on the axes represent the average seat utilization (Y-axis) and the percentage share of bookings with excess baggage (X-axis) in a period of time. Via the segmentation, you get a classification of flights into different categories. Flight connections with a high utilization and low excess baggage may indicate, for example, very frequent usage by business travelers (left upper cluster).
Table 12.5 shows the input and output parameters of the interface of the K-means function, where this segmentation is based on two numeric values (V000 and V001).
Parameter |
Explanation |
Column Structure |
|
---|---|---|---|
Input: |
The dataset to be classified, consisting of ID and numeric values |
ID |
INTEGER |
V000 |
DOUBLE |
||
V001 |
DOUBLE |
||
Input: |
Parameterization of segmentation by name/value pairs, for example, cluster number (GROUP_NUMBER) |
NAME |
NVARCHAR (50) |
INTARGS |
INTEGER |
||
DOUBLEARGS |
DOUBLE |
||
STRINGARGS |
NVARCHAR |
||
Output: |
Assignment of data records to a cluster |
ID |
INTEGER |
CENTER_ASSIGN |
INTEGER |
||
DISTANCE |
DOUBLE |
||
Output: |
List of centers of the groups (cluster ID and coordinates of the center) |
CENTER_ID |
INTEGER |
V000 |
DOUBLE |
||
V001 |
DOUBLE |
Table 12.5Interface of K-Means Function from PAL
12.3.1Generating the K-Means Function via the SQL Console
As described in Section 12.1, you as an administrator must first generate this interface. In this section, we’ll show you how to do this using Native SQL. Then, we’ll discuss the graphical AFM so that you as the developer no longer need to use manual generation. However, we’ll show the functioning because you’ll get a better understanding of the procedures behind the scenes of AFM and because generation the procedures from a program is beneficial in some generic cases.
To generate the K-means function with the interface from Table 12.5, execute the SQL statements from Listing 12.3 using the SQL console in SAP HANA Studio. Here, table types for the input and output parameters from Table 12.5 are first created in the _SYS_AFL schema, and then the desired interface is created using the AFL_WRAPPER_GENERATOR database procedure. After the successful execution of these SQL statements, the _SYS_AFL schema contains a procedure named PAL_KMEANS. For more information on generating database procedures for the BFL and PAL libraries, see the reference documentation of these libraries at http://help.sap.com/hana.
-- Create table types for interface
CREATE TYPE PAL_KMEANS_ASSIGNED_T AS TABLE(
"ID" INT,
"CENTER_ASSIGN" INT,
"DISTANCE" DOUBLE);
CREATE TYPE PAL_KMEANS_DATA_T AS TABLE(
"ID" INT,
"V000" DOUBLE,
"V001" DOUBLE,
primary key("ID"));
CREATE TYPE PAL_KMEANS_CENTERS_T AS TABLE(
"CENTER_ID" INT,
"V000" DOUBLE,
"V001" DOUBLE);
CREATE TYPE PAL_CONTROL_T AS TABLE(
"NAME" VARCHAR (50),
"INTARGS" INTEGER,
"DOUBLEARGS" DOUBLE,
"STRINGARGS" VARCHAR (100));
-- Define interface
DROP TABLE PDATA;
CREATE COLUMN TABLE PDATA(
"ID" INT,
"TYPENAME" VARCHAR(100),
"DIRECTION" VARCHAR(100) );
INSERT INTO PDATA VALUES (1,'_SYS_AFL.PAL_KMEANS_DATA_T','in');
INSERT INTO PDATA VALUES (2,'_SYS_AFL.PAL_CONTROL_T','in');
INSERT INTO PDATA VALUES (3,'_SYS_AFL.PAL_KMEANS_ASSIGNED_T',
'out');
INSERT INTO PDATA VALUES (4, '_SYS_AFL.PAL_KMEANS_CENTERS_T',
'out');
-- Generate K-means function
call SYSTEM.AFL_WRAPPER_GENERATOR ('PAL_KMEANS', 'AFLPAL',
'KMEANS', PDATA);
Listing 12.3Generation of an Interface for the K-Means Function
To test the generated function via the SQL console in SAP HANA Studio, you can execute the statements in Listing 12.4.
CREATE LOCAL TEMPORARY COLUMN TABLE #PAL_CONTROL_TBL like
PAL_CONTROL_T;
INSERT INTO #PAL_CONTROL_TBL VALUES ('GROUP_NUMBER', 4, null, null);
; Create data table and insert sample values
CREATE COLUMN TABLE PAL_KMEANS_DATA_TBL LIKE PAL_KMEANS_DATA_T;
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (0, 0.5, 0.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (1, 0.5, 1);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (2, 1, 0.5);
INSERT INTO PAL_KMEANS_DATA_TBL VALUES (3, 1, 1);
; Create output tables
CREATE COLUMN TABLE PAL_KMEANS_ASSIGNED_TBL LIKE PAL_KMEANS_ASSIGNED_T;
CREATE COLUMN TABLE PAL_KMEANS_CENTERS_TBL LIKE PAL_KMEANS_CENTERS_T;
; Call K-Means function
CALL PAL_KMEANS( PAL_KMEANS_DATA_TBL,
#PAL_CONTROL_TBL,?,?);
Listing 12.4Testing the K-Means Function via the SQL Console
The result of the calculation is shown in Figure 12.4. As expected, the four sample values were distributed to the four clusters, and the values form the center of the cluster.
Figure 12.4Calling the K-Means Function Using Sample Values
As you’ve seen, both the steps for generating and the parameterization of a PAL function are very comprehensive. By means of the AFM, you can define the generation and the call from Listing 12.3 and Listing 12.4 in graphical form to a great extent and even model complex call sequences.
12.3.2Using the Application Function Modeler
The Application Function Modeler (AFM) is a tool that supports the generation of AFL procedures. Published with SAP HANA SPS 7, AFM initially supported the generation of individual PAL or BFL functions only. The first version was some kind of graphical interface for the AFLLANG_WRAPPER_PROCEDURE_CREATE procedure, which we presented in the previous section. With SPS 9, the AFM has been revised significantly and now allows for modeling of entire flowgraphs, that is, an entire sequence of operations. This results in a database procedure that you can reuse in your own programs or other tools. In this section, we’ll present the basic usage of this tool. All detail options are described in the development guide available at http://help.sap.com/hana.
As an application example, we want to perform a segmentation of flight customers using the Kmeans function in a CUSTOMER_SEGMENTATION procedure, while considering the following input variables:
-
Total of booking prices (in EUR) in one year
-
Total of baggage weight (in KG) in one year
We create two CDS views to determine the required values. Initially, we convert the data types, currencies, and units in the first view (see Listing 12.5).
@EndUserText.label: Conversions and type casts'
define view Za4h_Cds_Booking
as select from sbook {
cast ( customid as abap.int4) as id,
cast ( unit_conversion(
quantity => luggweight,
source_unit => wunit,
target_unit => cast( 'KG' as abap.unit(3) ),
error_handling => 'SET_TO_NULL' )
as abap.fltp) as weight_kg,
cast ( currency_conversion(
amount => sbook.loccuram ,
source_currency => loccurkey,
target_currency => cast( 'EUR' as abap.cuky ),
exchange_rate_date =>
cast( '20150819' as abap.dats ),
error_handling => 'SET_TO_NULL' )
as abap.fltp) as price_eur
};
Listing 12.5Currency and Unit Conversion
In another view, we use this as the basis to perform aggregations (see Listing 12.6).
@ClientDependent: false
@EndUserText.label: 'Input for KMeans-Clustering'
define view Za4h_Cds_Kmeans as select from za4h_book_conv {
id,
sum ( weight_kg ) as weight,
sum ( price_eur ) as price
} group by id;
Listing 12.6Aggregation and Type Conversion
As a result of this application, we expect a division of flight passengers into groups (with gradations and combinations), of which passengers are more likely to be business (many flights, little baggage) or private (fewer flights, more baggage). From this information and its corresponding time-based development, an airline could design a bonus system that is tailored to the needs of these groups (e.g., higher baggage allowance for frequent travelers).
Before we describe the actual usage of the AFM, we first want to explain the associate diagram presentation (see Figure 12.5). You define the specific data sources and target structures as well as the mapping to the respective columns (in the AFM, you select the arrows to perform this configuration). For a single function call, the presentation simply corresponds to the interface of the function. However, in the AFM, you can also use the output of a function directly as an input for another call.
Figure 12.5Flow Chart of the K-Means Function
To use the AFM in SAP HANA Studio as a developer, you must first create an SAP HANA Extended Application Services (SAP HANA XS) project. We discussed this procedure in Chapter 4, Section 4.3. Now create an object of the flowgraph model type in your package, and enter a name (see Figure 12.6).
Figure 12.6Creating a Flowgraph Model
Select Flowgraph for Activation as Stored Procedure to generate a database procedure as the result of modeling. Alternatively, you can generate a task plan that can be configured for automatic execution within the scope of the package SAP HANA Enterprise Information Management.
Again, we consider the K-means function from the previous section as an example. Now we want to implement the generation via the AFM. Figure 12.7 shows a graphical model of the desired function. By means of the palette on the right-hand side 1, you add the K-means function from the Predictive Analysis Library area and a Data Source as well as two result structures (Data Sink [Template Table]) to the editor area 2. Now select the ZA4H_KMEANS_IN CDS view as the data source, and define tables in the _SYS_BIC schema as target tables, which the AFM automatically creates when you establish a connection with the right structure of the output tables. Under Properties 3, you can specify the parameters, for example, the number of clusters.
Figure 12.7Flowgraph Model for the K-Means Function
In the settings, choose 4 as the number of desired clusters (GROUP_NUMBER=4), and enforce a normalization of values (NORMALIZATION=1) because the values for weight and price have different value ranges, and the price would otherwise dominate the analysis too much. It’s necessary here to find the right calibration. In real life, however, this involves far more than playing with the values mentioned.
After you’ve concluded the definition with no errors found (indicated by green icons), you must activate the model (using the Activate SAP HANA Development Object button). The database procedure is then generated in the selected schema (e.g., _SYS_BIC). Next, you can test the procedure by selecting the Execute in SQL Editor entry from the top right selection area 4. This opens the SQL console, and you can query the values in the target tables.
[»]Advantages and Disadvantages of the AFM
As you’ve seen, the AFM provides much simpler access to using the AFL functions compared with manual generation. Because the models are defined as SAP HANA development objects, you can transport them with your application.
However, you must define both the data sources and the target structures as static tables or views. Moreover, if the AFM is used, the technical _SYS_REPO user requires specific authorizations for the objects involved.
There are advantages and disadvantages, but the AFM is the method of choice for most scenarios.
As in the previous section, we use an ABAP database procedure to integrate the procedure we’ve previously generated with ABAP (see Listing 12.7). We only want to assign the datasets to the clusters (and additionally join the input values). Note that we don’t use the OPTIONS READ-ONLY addition for this procedure because the K-means function writes data to the output tables.
PUBLIC CREATE PUBLIC.
PUBLIC SECTION.
INTERFACES: if_amdp_marker_hdb.
TYPES: BEGIN OF ty_center_assign,
id TYPE s_customer,
weight TYPE s_lugweigh,
price TYPE s_price,
center_assign TYPE i,
distance TYPE p LENGTH 6 DECIMALS 2,
END OF ty_center_assign.
TYPES tt_center_assign TYPE TABLE OF ty_center_assign.
METHODS: kmeans
EXPORTING
VALUE(et_center_assign) TYPE tt_center_assign.
PROTECTED SECTION.
PRIVATE SECTION.
ENDCLASS.
CLASS zcl_a4h_chapter12_kmeans IMPLEMENTATION.
METHOD kmeans BY DATABASE PROCEDURE /FOR HDB
LANGUAGE SQLSCRIPT USING ZA4H_KMEANS_IN.
CALL "_SYS_BIC"."test.a4h.book.chapter12::kmeans"();
lt_center_assign = select * from
_sys_bic.za4h_cluster_assign;
et_center_assign = select r.id, i.weight, i.price,
r.center_assign, r.energy as distance
from :lt_center_assign as r
inner join ZA4H_KMEANS_IN as i on r.id = i.id
order by center_assign, distance;
ENDMETHOD.
ENDCLASS
Listing 12.7Calling the K-Means Function from an ABAP Database Procedure
We use a sample report to save the result in an ABAP table and then use the data display in SAP HANA Studio for an initial analysis of the entries (see Figure 12.8). We added the customer type (CUSTTYPE) and the customer discount (DISCOUNT) to the result structure to filter by these values. Of course, you can also use the results in ABAP applications, for example, using Web Dynpro ABAP, SAPUI5, and so on, to implement a graphical display.
Figure 12.8Result of the Segmentation of Flight Customers via K-Means
By means of the libraries presented in this chapter, you can access complex statistical functions to deploy individually or in combination with an ABAP program. For more detailed analyses, SAP HANA provides additional integration options (particularly with the open-source statistics software R).
[»]R-Integration
In addition to PAL, SAP HANA also contains an adapter for integrating the open-source software system R (www.r-project.org). You have an additional range of statistical operations available via this adapter. We won’t discuss R integration in detail within this book. However, it’s used in a similar manner to how the AFL functions are used via database procedures in SAP HANA. AFM also permits the integration of nodes that are implemented via an R script.
You must note, however, that the R server isn’t part of SAP HANA for licensing reasons, and it must be installed separately on a dedicated server. More information is available in the SAP HANA R Integration Guide at http://help.sap.com/hana .