Key Concepts Refresher

We’ve all heard of the website Google: You type in a few words, and the Google search engine returns a list of web pages that best meet your search query, ranked to your preferences. It is also used to indicate when you type a word incorrectly and suggests another spelling, with the question, Did you mean...? Now, however, the Google search engine has become so confident that it no longer suggests the correction with a qualifying question, but instead says Showing results for... . If you’re convinced you’re original entry is correct, it will allow you to Search instead for [whatever you entered]. Even more revolutionary is the search engine’s ability to predict what you want as you type.

Google search is a good example of what we can do with SAP HANA in information models: We can perform fuzzy text searches, like the “Did you mean” example; we can perform text mining to find the best document out of a large collection for a user; and we can perform predictive analysis.

In this chapter, we will also look at text and sentiment analysis and spatial and geo-spatial processing.

Text

First, let’s look at working with the text features in SAP HANA—namely, fault-tolerant text searches, text analysis, and text mining. However, before we can discuss each of these topics in more detail, you first need to create a full-text index.

Text Index

To enable the text features in SAP HANA, you need a full-text index. Depending on what you specify when you create the full-text index, it gives access to one or more text features. The full-text index will use additional memory to provide the requested text features.

Listing 9.1 provides an example of a SQL statement to create a full-text index. In this case, we create a full-text index with the name DEMO_TEXT_INDEX on a column called SOME_TEXT in a column table.

CREATE FULLTEXT INDEX DEMO_TEXT_INDEX ON 
"TEXT_INDEX"."DEMO_TEXT" ("SOME_TEXT")
LANGUAGE DETECTION ('EN')
SEARCH ONLY ON
FUZZY SEARCH INDEX ON
TEXT ANALYSIS ON
CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER'
TEXT MINING OFF;

Listing 9.1 SQL Statement to Create a Full-Text Index

Let’s examine the different text elements of the SQL statement in Listing 9.1:

[»] Note

You normally put the SQL statement in Listing 9.1 in a procedure (using a .hdbprocedure file). This way, you can deploy it later on the production system. The recommended way to create full-text indexes going forward is using Core Data Services (CDS). You can find the CDS syntax in the SAP HANA Search Developer Guide at http://help.sap.com/hana_options_adp.

You now can open the definition of the column table by double-clicking the table’s name in the Catalog area. In the Indexes tab, you’ll see the full-text index named DEMO_TEXT_INDEX that you just created. When you select this index, you’ll see the details of the full-text index as shown in Figure 9.1.

You can see from the checkboxes that we enabled Fuzzy Search and Text Analysis and disabled Text Mining in this index.

Figure 9.1   Full-Text Index DetailsFull-text index

Figure 9.1 Full-Text Index Details

When fuzzy search is enabled for text-type columns, the full-text index increases the memory size of the columns by approximately an additional 10 percent. When text analysis is enabled, the increase in the memory size due to the full-text index depends on many factors, but it can be as large as that of the column itself.

SAP HANA will automatically create full-text indexes for columns with data types TEXT, BINTEXT, and SHORTTEXT.

For other column types, you must manually create the full-text index. You can only manually create a full-text index for columns with the VARCHAR, NVARCHAR, ALPHANUM, CLOB, NCLOB, and BLOB data types.

When you manually create an index, SAP HANA attaches a hidden column to the column you indexed. This hidden column contains extracted textual data from the source column; the original source column remains unchanged. Search queries are performed on the hidden column, but the data returned is from the source column.

[+] Tip

Remember that a full-text index is created for a specific column, not for the entire table. You will have to create a full-text index for every column that you wish to search or analyze.

Text Search

When you’ve created a full-text index with the FUZZY SEARCH option enabled, you can do fault-tolerant searches on your text data. (These are also called fuzzy searches.) This means we can misspell words, and SAP HANA will still find the text for us. The real-world scenario at the beginning of the chapter illustrates this point in a somewhat exaggerated manner.

SAP HANA will find alternative (US and UK) spellings for words like color/colour, artifact/artefact, utilize/utilise, and check/cheque. Incorrectly entered words such as typing coat, will still return the correct word, which was probably cost.

In Chapter 8, we discussed the LIKE operator in SQL. You can search for a random pharmaceutical product name (like Guaiphenesin) via the following SQL statement:

SELECT productname, price from Products where productname LIKE 
‘%UAIPH%’;

This will find the name if you type the first few characters correctly. If you search for “%AUIPH%” instead, you won’t find anything.

CONTAINS Predicate

In SAP HANA, you can use the CONTAINS predicate to use the fuzzy text search feature. In this case, you use the following SQL statement:

SELECT SCORE(), productname, price from Products where 
CONTAINS(productname, ‘AUIPH’, FUZZY(0.8)) ORDER BY SCORE() DESC;

[+] Tip

Remember that your users will not type SQL statements; you’ll create these statements in a user-defined function (UDF) or a procedure and call them from your graphical calculation view or your SAP HANA XS application.

The CONTAINS predicate uses the column name you want to search on, the search string itself (even if misspelled), and how “fuzzy” you want the search to be. The fuzziness factor can range from zero (everything matches) to one (requires an exact match).

The SCORE() function helps sort the results from the most relevant to the least relevant.

If you want to search on all columns with a full-text search index, you can specify the CONTAINS predicate as follows:

CONTAINS (*,‘AUIPH’, FUZZY(0.8))

When you search on multiple columns at once, it’s called a freestyle search. In this case, the asterisk (*) includes all columns. However, you also can specify which columns you want to include or exclude in the list.

There are some special characters you can use in the search string:

SAP HANA Simple Info Access API

The SAP HANA simple info access (SINA) API is a client-side JavaScript API for developing your own web-based search interfaces. Figure 9.2 shows a demo text search application created with the SINA toolkit.

Figure 9.2   SAP HANA Simple Info Access Demo Search Application

Figure 9.2 SAP HANA Simple Info Access Demo Search Application

[»] Note

SAP HANA now includes an embedded search via the built-in sys.esh_search()procedure. With this procedure, you can use an existing ABAP SQL connection instead of an XS OData service for fuzzy text searches in SAP HANA from SAP business applications.

For more information on this new API, see the SAP HANA Search Developer Guide at http://help.sap.com/hana_options_adp.

Fuzzy Text Search in Attribute Views

To date, you can only use fuzzy text searches directly in attribute views; this is one of the few reasons that SAP HANA SPS 11 still uses attributes.

Figure 9.3 shows the Search Properties tab in which you can see an output column’s properties. Here, you can enable fuzzy text search for a column and with a particular Fuzziness Threshold.

Figure 9.3   Output Column Fuzzy Text Search Properties in an Attribute View

Figure 9.3 Output Column Fuzzy Text Search Properties in an Attribute View

Text Analysis

Once you’ve created a full-text index with the TEXT ANALYSIS option enabled, you can analyze text with the SAP HANA text analysis feature, both via entity analysis and sentiment analysis.

Working with Text Analysis

You create text analysis via the full-text index. The abbreviated statement is as follows:

CREATE FULLTEXT INDEX INDEX_NAME ON TABLE_NAME ("COLUMN_NAME")
TEXT ANALYSIS ON
CONFIGURATION 'CONFIG_NAME';

You need the TEXT ANALYSIS ON clause and a CONFIGURATION NAME; there are a few built-in configurations, but you can also create your own industry-specific configuration files.

In the earlier example shown in Figure 9.1, we created the full-text index (called DEMO_TEXT_INDEX) on the DEMO_TEXT table. When you look at the table in the Catalog area, you’ll notice that SAP HANA has created an additional table: $TA_DEMO_TEXT_INDEX (see Figure 9.4). This is the name of the full-text index we created, prepended with $TA_.

Figure 9.4   Source Table (DEMO_TEXT) and $TA_ Text Analysis Results Table

Figure 9.4 Source Table (DEMO_TEXT) and $TA_ Text Analysis Results Table

You’d expect $TA_ to denote text analysis—and that’s exactly right. You get the text analysis request results back in table $TA.

[+] Hint

Table $TA is not the full-text index; it contains the results of the text analysis.

The original table included the following values in the SOME_TEXT column:

Figure 9.5 shows the text analysis request results.

Figure 9.5   $TA_ Text AnalysisText analysisresults Results Table Content

Figure 9.5 $TA_ Text Analysis Results Table Content

The text analysis determined the following points:

With a different text analysis configuration file, we would have received different results.

Using the standard configuration files, you will be able to identify many of the attributes (master data) in your text data. Identifying measures is not always that easy, however.

SAP HANA Text Analysis XS JavaScript API

From SAP HANA SPS 10 on, the SAP HANA Text Analysis XS JavaScript API is available. With this API, you can call text analysis without creating a full-text index or reading the results in the $TA table.

Grammatical Role Analysis

SAP HANA SPS 11 makes the optional Grammatical Role Analysis (GRA) analyzer for English available. It identifies groups of three in the form of subject–verb–object expressions.

For an example, let’s look at a well-known phrase:

The [SUBJECT]quick brown fox[/SUBJECT] [VERB]jumps[/VERB] 
over the [DIRECTOBJECT]lazy dog[/DIRECTOBJECT]

Text Mining

Fuzzy searches and text analysis work with a word, sentence, or paragraph—but with text mining, you can work with entire documents. These documents can be web pages, Microsoft Word documents, or PDF files that are loaded into SAP HANA.

Text mining uses the full-text indexing and text analysis processes. When using text mining in SAP HANA, you can perform the following tasks:

You can call the text mining features from SQL functions or from web applications that can use the SAP HANA Text Mining XS JavaScript API.

[+] Tip

Remember that users will not work at the lower level that we have described in this chapter: They will want to use text features from a browser-based application interface. This is why SAP has focused on supplying various SAP HANA XS JavaScript APIs for SAP HANA modelers and developers.

Now that you know how to work with text data in SAP HANA and can use fault-tolerant text searches, text analysis, and text mining in SAP HANA modeling work, let’s move on to spatial processing.

Spatial

The next topic we’ll discuss is understanding spatial processing in SAP HANA.

[+] Tip

Geospatial processing is a subset of spatial processing related to geography. Spatial processing can include CAD models or even the human body and uses a three-dimensional coordinate system that differs from the geospatial coordinates of longitude, latitude, and height above sea level.

We first mentioned spatial processing in Chapter 4 when we discussed spatial joins. Now, let’s take a closer look at spatial joins in an SAP HANA system.

Spatial Data Types and Joins

Figure 9.6 shows two tables that contain spatial data. One table contains shop locations, and the other table contains city suburb shapes.

Figure 9.6   Two Tables Containing Spatial Data, with a Spatial Join

Figure 9.6 Two Tables Containing Spatial Data, with a Spatial Join

Here, we stored the shop locations in table GEO_SHOP, using the ST_POINT data type.

The following spatial data types are available:

You can join different data spatial types together in a spatial join. Figure 9.7 shows the details of a spatial join. Notice that the Spatial Properties tab is available for this join type. The Predicate dropdown list in this tab allows you to specify how to join the fields.

Figure 9.7   Spatial Join Details

Figure 9.7 Spatial Join Details

In the shop example, we will use the Contains option, because we want to see which suburb contains which shop addresses (locations).

[+] Tip

Please don’t confuse the Contains option in the spatial joins with the CONTAINS predicate we used when executing a fuzzy text search.

Importing and Exporting Spatial Data

In SAP HANA, you can import and export spatial data in the Well-Known Text (WKT) and the Well-Known Binary (WKB) formats. These formats are part of the Open Geospatial Consortium (OGC) standards. You can also import and export the extended versions of these formats, which are used by PostGIS.

In addition, SAP HANA can import ESRI shapefiles and export spatial data in the GeoJSON and Scalable Vector Graphic (SVG) file formats. The SVG file format is maintained by World Wide Web Consortium (W3C) and supported by all modern browsers.

Spatial Functions

You saw an example of a spatial function earlier when we asked for the area of a polygon to be calculated—for example, geometry1.ST_Area().

Referring back to Figure 9.7, you can ask if a shop’s location is in a suburb as follows:

SUBURB_SHAPE.ST_CONTAINS(SHOP_LOCATION1)

You can also see how far two shops are from each other as follows:

SHOP_LOCATION1. ST_WITHINDISTANCE(SHOP_LOCATION2)

Additional Resources

To learn more about spatial processing in SAP HANA, look at the examples in the SHINE demo content. Here, you can find examples like the one in Figure 9.8, in which you can work with shops on a map on the left of the screen and see their details in the area on the right.

Figure 9.8   Spatial Data Used in a Business Application (SHINE Demo Content)

Figure 9.8 Spatial Data Used in a Business Application (SHINE Demo Content)

You can find more information about spatial data in the SAP HANA Spatial Reference Guide at http://help.sap.com/hana_options_spatial.

Predictive

Now, let’s discuss predictive analysis. We’ll look at the SAP HANA predictive analysis libraries and how to use them to create a predictive analysis model.

In this section, we’ll look at predicting future probabilities and trends and how you can use this functionality in business. In the real-world scenario at the beginning of the chapter, we described a use case in which you might create a predictive model in SAP HANA to help sales reps with their sales forecasting by looking at seasonal patterns; for example, people use more pharmaceuticals for allergies in the spring.

There are many use cases for the predictive analysis libraries built into SAP HANA: For financial institutions such as banks, you can detect fraud by looking at abnormal transactions. For fleet management and car rental companies, you can predict which parts will break soon in a vehicle and replace them before they actually break.

Predictive Analysis Library

In SAP HANA, the predictive analysis functions are grouped together in the Predictive Analysis Library (PAL). PAL defines functions that you can call from SQLScript and use to perform various predictive analysis algorithms. The algorithms were chosen by looking at what SAP HANA applications need, what people commonly use, and what the market requests.

In PAL, predictive analysis algorithms are grouped in nine categories; we’ll take a quick look at the nine categories and discuss a few popular algorithms in some of these categories. Figure 9.9 provides an overview. The categories are as follows:

Figure 9.9 shows the PAL algorithms, per category, that are available in SAP HANA SPS 11.

Figure 9.9   Predictive Functions Available in SAP HANA SPS 11, Grouped by Category

Figure 9.9 Predictive Functions Available in SAP HANA SPS 11, Grouped by Category

You can find information about each of the PAL predictive functions in the SAP HANA Predictive Analysis Library (PAL) Reference Guide at http://help.sap.com/hana_platform/.

Installing PAL

PAL is part of the Application Function Library (AFL) that ships with all SAP HANA systems. AFL is an optional component; if you want to use the predictive analysis algorithms in PAL, you have to install AFL via SAP HANA Application Lifecycle Management. AFL also includes a Business Function Library (BFL).

After the installation, you have to ensure that the script server is started and make sure you have the AFL_SYS_AFL_AFLPAL_EXECUTE role assigned to your SAP HANA user. (Roles, and how to assign them, are discussed in Chapter 13.)

You can check if AFL is installed on a system by opening the AFL_FUNCTIONS view in the SYS schema. If you see lots of entries in this view, then AFL has been installed on that system.

Creating a Predictive Model

Once PAL is installed, you can create predictive models graphically via the application function modeler (AFM). In Chapter 5, we discussed the steps to create a flowgraph model:

  1. Create a .hdbflowgraph file in your project in the developer perspective.
  2. Give your predictive model a name, and choose Flowgraph for Activation as Stored Procedure.

[»] Note

You can also create a predictive model using SQLScript, which is the traditional way to build such models. Building the models graphically using the AFM only became available in the last few releases of SAP HANA.

If you find .aflpmml files, which are AFL models, you will have to convert these to AFM flowgraphs. SAP HANA SPS 11 can still run these files, but you cannot edit them in the AFM.

In the AFM, build a flowgraph model. Start on the left side and work towards the right side, as shown in Figure 9.10. First, drag a table with data into the left side of the work area. Next, select a PAL function from the palette on the right and drag it to the work area. Then, connect the table to the PAL function; the PAL functions are grouped into the categories that we discussed earlier.

In Figure 9.10, we started building the end-to-end scenario to predict segmentation of new customers for a supermarket, as described in the PAL reference guide. The guide describes how to do this by using SQLScript, but it’s a good exercise to try to do it via a graphical flowgraph model.

Figure 9.10   Creating a Flowgraph Model in the AFM Tool

Figure 9.10 Creating a Flowgraph Model in the AFM Tool

Sometimes, you might want to train your model, and then use it later to make predictions. Figure 9.11 shows such a model: It starts with some training data that you feed into a C4.5 decision tree. The data is used to train the classification model. You then feed the trained classification model into a prediction function and also give the prediction function some new data. The prediction function applies the trained model to the new data and provides the predicted results in an output table.

Figure 9.11   Predictive Model Created in AFMApplication Function Modeler (AFM)

Figure 9.11 Predictive Model Created in AFM

SAP HANA automatically converts the graphical flowgraph model into a SQLScript procedure, and you can then call this procedure when you want to use the predictive model as part of your other SAP HANA information models.