SAP PRESS :: SAP HANA Certification Guide - 0 Text, Spatial, and Predictive Modeling

Key Concepts Refresher

We’ve all heard of the website Google: You type in a few words, and the Google search engine returns a list of web pages that best meet your search query, ranked to your preferences. It is also used to indicate when you type a word incorrectly and suggests another spelling, with the question, Did you mean...? Now, however, the Google search engine has become so confident that it no longer suggests the correction with a qualifying question, but instead says Showing results for... . If you’re convinced you’re original entry is correct, it will allow you to Search instead for [whatever you entered]. Even more revolutionary is the search engine’s ability to predict what you want as you type.

Google search is a good example of what we can do with SAP HANA in information models: We can perform fuzzy text searches , like the “Did you mean” example; we can perform text mining to find the best document out of a large collection for a user; and we can perform predictive analysis.

In this chapter, we will also look at text and sentiment analysis and spatial and geo-spatial processing.

Text

First, let’s look at working with the text features in SAP HANA—namely, fault-tolerant text searches , text analysis , and text mining . However, before we can discuss each of these topics in more detail, you first need to create a full-text index.

Text Index

To enable the text features in SAP HANA, you need a full-text index . Depending on what you specify when you create the full-text index, it gives access to one or more text features. The full-text index will use additional memory to provide the requested text features.

Listing 9.1 provides an example of a SQL statement to create a full-text index. In this case, we create a full-text index with the name DEMO_TEXT_INDEX on a column called SOME_TEXT in a column table.

CREATE FULLTEXT INDEX DEMO_TEXT_INDEX ON 
   "TEXT_INDEX"."DEMO_TEXT" ("SOME_TEXT")
   LANGUAGE DETECTION ('EN')
   SEARCH ONLY ON
   FUZZY SEARCH INDEX ON
   TEXT ANALYSIS ON
      CONFIGURATION 'EXTRACTION_CORE_VOICEOFCUSTOMER'
   TEXT MINING OFF;

Listing 9.1 SQL Statement to Create a Full-Text Index

Let’s examine the different text elements of the SQL statement in Listing 9.1:

CREATE FULLTEXT INDEX
The first line creates a full-text index from the SOME_TEXT column in the DEMO_TEXT table.

LANGUAGE DETECTION
SAP HANA will attempt to detect the language of your text. You can specify on what languages it should focus with this clause. In this case, we indicated that our text is all in English.

SEARCH ONLY ON
This clause saves some memory by not copying the source text into the full-text index.

FUZZY SEARCH INDEX ON
Enables or disables the fault-tolerant search feature. This feature is also called fuzzy text search .

TEXT ANALYSIS ON
Enables or disables the text analysis feature.

CONFIGURATION
Name of the configuration. The text analysis feature needs a configuration file. Some standard configurations ship with SAP HANA, and you can also create your own.

TEXT MINING OFF
Enables or disables the text-mining feature.

[»] Note

You normally put the SQL statement in Listing 9.1 in a procedure (using a .hdbprocedure file). This way, you can deploy it later on the production system. The recommended way to create full-text indexes going forward is using Core Data Services (CDS). You can find the CDS syntax in the SAP HANA Search Developer Guide at http://help.sap.com/hana_options_adp.

You now can open the definition of the column table by double-clicking the table’s name in the Catalog area. In the Indexes tab, you’ll see the full-text index named DEMO_TEXT_INDEX that you just created. When you select this index, you’ll see the details of the full-text index as shown in Figure 9.1.

You can see from the checkboxes that we enabled Fuzzy Search and Text Analysis and disabled Text Mining in this index.

Figure 9.1 Full-Text Index Details

When fuzzy search is enabled for text-type columns, the full-text index increases the memory size of the columns by approximately an additional 10 percent. When text analysis is enabled, the increase in the memory size due to the full-text index depends on many factors, but it can be as large as that of the column itself.

SAP HANA will automatically create full-text indexes for columns with data types TEXT, BINTEXT, and SHORTTEXT.

For other column types, you must manually create the full-text index. You can only manually create a full-text index for columns with the VARCHAR, NVARCHAR, ALPHANUM, CLOB, NCLOB, and BLOB data types.

When you manually create an index, SAP HANA attaches a hidden column to the column you indexed. This hidden column contains extracted textual data from the source column; the original source column remains unchanged. Search queries are performed on the hidden column, but the data returned is from the source column.

[+] Tip

Remember that a full-text index is created for a specific column, not for the entire table. You will have to create a full-text index for every column that you wish to search or analyze.

Text Search

When you’ve created a full-text index with the FUZZY SEARCH option enabled, you can do fault-tolerant searches on your text data. (These are also called fuzzy searches .) This means we can misspell words, and SAP HANA will still find the text for us. The real-world scenario at the beginning of the chapter illustrates this point in a somewhat exaggerated manner.

SAP HANA will find alternative (US and UK) spellings for words like color/colour, artifact/artefact, utilize/utilise, and check/cheque. Incorrectly entered words such as typing coat, will still return the correct word, which was probably cost.

In Chapter 8, we discussed the LIKE operator in SQL. You can search for a random pharmaceutical product name (like Guaiphenesin) via the following SQL statement:

SELECT productname, price from Products where productname LIKE 
‘%UAIPH%’;

This will find the name if you type the first few characters correctly. If you search for “%AUIPH%” instead, you won’t find anything.

CONTAINS Predicate

In SAP HANA, you can use the CONTAINS predicate to use the fuzzy text search feature. In this case, you use the following SQL statement:

SELECT SCORE(), productname, price from Products where 
CONTAINS(productname, ‘AUIPH’, FUZZY(0.8)) ORDER BY SCORE() DESC;

[+] Tip

Remember that your users will not type SQL statements; you’ll create these statements in a user-defined function (UDF) or a procedure and call them from your graphical calculation view or your SAP HANA XS application.

The CONTAINS predicate uses the column name you want to search on, the search string itself (even if misspelled), and how “fuzzy” you want the search to be. The fuzziness factor can range from zero (everything matches) to one (requires an exact match).

The SCORE() function helps sort the results from the most relevant to the least relevant.

If you want to search on all columns with a full-text search index, you can specify the CONTAINS predicate as follows:

CONTAINS (*,‘AUIPH’, FUZZY(0.8))

When you search on multiple columns at once, it’s called a freestyle search . In this case, the asterisk (*) includes all columns. However, you also can specify which columns you want to include or exclude in the list.

There are some special characters you can use in the search string:

Double quotes (")
Everything between the double quotes is searched for exactly as it appears.
Asterisk (*)
Replaces zero or more characters in the search string; for example, cat* matches cats, catalogues, and so on.
Question mark (?)
Replaces a single character in the search string; for example, cat? matches cats, but not catalogue.

SAP HANA Simple Info Access API

The SAP HANA simple info access (SINA ) API is a client-side JavaScript API for developing your own web-based search interfaces. Figure 9.2 shows a demo text search application created with the SINA toolkit.

Figure 9.2 SAP HANA Simple Info Access Demo Search Application

[»] Note

SAP HANA now includes an embedded search via the built-in sys.esh_search()procedure. With this procedure, you can use an existing ABAP SQL connection instead of an XS OData service for fuzzy text searches in SAP HANA from SAP business applications.

For more information on this new API, see the SAP HANA Search Developer Guide at http://help.sap.com/hana_options_adp.

Fuzzy Text Search in Attribute Views

To date, you can only use fuzzy text searches directly in attribute views; this is one of the few reasons that SAP HANA SPS 11 still uses attributes .

Figure 9.3 shows the Search Properties tab in which you can see an output column’s properties. Here, you can enable fuzzy text search for a column and with a particular Fuzziness Threshold.

Figure 9.3 Output Column Fuzzy Text Search Properties in an Attribute View

Text Analysis

Once you’ve created a full-text index with the TEXT ANALYSIS option enabled, you can analyze text with the SAP HANA text analysis feature, both via entity analysis and sentiment analysis .

Working with Text Analysis

You create text analysis via the full-text index. The abbreviated statement is as follows:

CREATE FULLTEXT INDEX INDEX_NAME ON TABLE_NAME ("COLUMN_NAME")
   TEXT ANALYSIS ON
      CONFIGURATION 'CONFIG_NAME';

You need the TEXT ANALYSIS ON clause and a CONFIGURATION NAME; there are a few built-in configurations, but you can also create your own industry-specific configuration files.

In the earlier example shown in Figure 9.1, we created the full-text index (called DEMO_TEXT_INDEX) on the DEMO_TEXT table. When you look at the table in the Catalog area, you’ll notice that SAP HANA has created an additional table: $TA_DEMO_TEXT_INDEX (see Figure 9.4). This is the name of the full-text index we created, prepended with $TA_.

Figure 9.4 Source Table (DEMO_TEXT) and $TA_ Text Analysis Results Table

You’d expect $TA_ to denote text analysis—and that’s exactly right. You get the text analysis request results back in table $TA.

[+] Hint

Table $TA is not the full-text index; it contains the results of the text analysis.

The original table included the following values in the SOME_TEXT column:

Laura likes apples. Laura dislikes chocolate.
Colin loves running.
Danie works for SAP.

Figure 9.5 shows the text analysis request results.

Figure 9.5 $TA_ Text AnalysisText analysisresults Results Table Content

Figure 9.5 $TA_ Text Analysis Results Table Content

The text analysis determined the following points:

It identified Laura, Colin, and Danie as people.
It classified SAP as an organization.
It identified apples, chocolate, and running as topics.
It rated Laura liking apples as a weak positive sentiment.
It rated Laura disliking chocolate as a weak negative sentiment.
It determined that Colin loving running is a strong positive sentiment.

With a different text analysis configuration file, we would have received different results.

Using the standard configuration files, you will be able to identify many of the attributes (master data) in your text data. Identifying measures is not always that easy, however.

SAP HANA Text Analysis XS JavaScript API

From SAP HANA SPS 10 on, the SAP HANA Text Analysis XS JavaScript API is available. With this API, you can call text analysis without creating a full-text index or reading the results in the $TA table.

Grammatical Role Analysis

SAP HANA SPS 11 makes the optional Grammatical Role Analysis (GRA) analyzer for English available. It identifies groups of three in the form of subject–verb–object expressions.

For an example, let’s look at a well-known phrase:

The [SUBJECT]quick brown fox[/SUBJECT] [VERB]jumps[/VERB] 
over the [DIRECTOBJECT]lazy dog[/DIRECTOBJECT]

Text Mining

Fuzzy searches and text analysis work with a word, sentence, or paragraph—but with text mining, you can work with entire documents. These documents can be web pages, Microsoft Word documents, or PDF files that are loaded into SAP HANA.

Text mining uses the full-text indexing and text analysis processes. When using text mining in SAP HANA, you can perform the following tasks:

Categorize documents
Find top-ranked terms related to a particular term
Analyze a document to find key phrases that describe the document
See a list of documents related to the one you ask about
Access documents that best match your search terms

You can call the text mining features from SQL functions or from web applications that can use the SAP HANA Text Mining XS JavaScript API .

[+] Tip

Remember that users will not work at the lower level that we have described in this chapter: They will want to use text features from a browser-based application interface. This is why SAP has focused on supplying various SAP HANA XS JavaScript APIs for SAP HANA modelers and developers.

Now that you know how to work with text data in SAP HANA and can use fault-tolerant text searches, text analysis, and text mining in SAP HANA modeling work, let’s move on to spatial processing.

Spatial

The next topic we’ll discuss is understanding spatial processing in SAP HANA.

[+] Tip

Geospatial processing is a subset of spatial processing related to geography. Spatial processing can include CAD models or even the human body and uses a three-dimensional coordinate system that differs from the geospatial coordinates of longitude, latitude, and height above sea level.

We first mentioned spatial processing in Chapter 4 when we discussed spatial joins. Now, let’s take a closer look at spatial joins in an SAP HANA system.

Spatial Data Types and Joins

Figure 9.6 shows two tables that contain spatial data. One table contains shop locations, and the other table contains city suburb shapes.

Figure 9.6 Two Tables Containing Spatial Data, with a Spatial Join

Here, we stored the shop locations in table GEO_SHOP, using the ST_POINT data type.

The following spatial data types are available:

ST_Geometry
This the supertype for all the other spatial data types; you can store objects like linestrings, polygons, and circles in this data type. Then you can use the TREAT expression to specify which data type you want to work with. For example, TREAT( geometry1 AS ST_Polygon ).ST_Area() will take a geometry, treat it like a polygon, and ask the size of the polygon’s area.

ST_Point
A point is a single location in space with X and Y coordinates at a minimum. It can also have a Z coordinate for three-dimensional spatial data. Points are often used to represent addresses, as in the shop example in Figure 9.6.
ST_MultiPoint
A multipoint is a collection of points. For example, it could represent all the different locations (addresses) where a company has shops.

ST_LineString
A linestring is a line, which can be used to represent rivers, railroads, and roads. In the shop example, it could represent a delivery route.

ST_MultiLineString
A multilinestring is a collection of linestrings. In the shop example, this collection could include all the deliveries that a shop has to make on a certain day for which the driver goes to different customers or suppliers.

ST_Polygon
A polygon defines an area. In the shop example, this is used to represent the suburbs in the second table (Figure 9.6; most suburbs do not have regular square shapes).
ST_MultiPolygon
A multipolygon is a collection of polygons; for example, the suburbs could be combined to indicate the area of the city or town.

ST_CircularString
A circularstring is a connected sequence of circular arc segments, much like a linestring with circular arcs between points.

You can join different data spatial types together in a spatial join . Figure 9.7 shows the details of a spatial join. Notice that the Spatial Properties tab is available for this join type. The Predicate dropdown list in this tab allows you to specify how to join the fields.

Figure 9.7 Spatial Join Details

In the shop example, we will use the Contains option, because we want to see which suburb contains which shop addresses (locations).

[+] Tip

Please don’t confuse the Contains option in the spatial joins with the CONTAINS predicate we used when executing a fuzzy text search.

Importing and Exporting Spatial Data

In SAP HANA, you can import and export spatial data in the Well-Known Text (WKT) and the Well-Known Binary (WKB) formats. These formats are part of the Open Geospatial Consortium (OGC) standards . You can also import and export the extended versions of these formats, which are used by PostGIS .

In addition, SAP HANA can import ESRI shapefiles and export spatial data in the GeoJSON and Scalable Vector Graphic (SVG) file formats. The SVG file format is maintained by World Wide Web Consortium (W3C) and supported by all modern browsers.

Spatial Functions

You saw an example of a spatial function earlier when we asked for the area of a polygon to be calculated—for example, geometry1.ST_Area().

Referring back to Figure 9.7, you can ask if a shop’s location is in a suburb as follows:

SUBURB_SHAPE.ST_CONTAINS(SHOP_LOCATION1)

You can also see how far two shops are from each other as follows:

SHOP_LOCATION1. ST_WITHINDISTANCE(SHOP_LOCATION2)

Additional Resources

To learn more about spatial processing in SAP HANA, look at the examples in the SHINE demo content. Here, you can find examples like the one in Figure 9.8, in which you can work with shops on a map on the left of the screen and see their details in the area on the right.

Figure 9.8 Spatial Data Used in a Business Application (SHINE Demo Content)

You can find more information about spatial data in the SAP HANA Spatial Reference Guide at http://help.sap.com/hana_options_spatial.

Predictive

Now, let’s discuss predictive analysis. We’ll look at the SAP HANA predictive analysis libraries and how to use them to create a predictive analysis model.

In this section, we’ll look at predicting future probabilities and trends and how you can use this functionality in business. In the real-world scenario at the beginning of the chapter, we described a use case in which you might create a predictive model in SAP HANA to help sales reps with their sales forecasting by looking at seasonal patterns ; for example, people use more pharmaceuticals for allergies in the spring.

There are many use cases for the predictive analysis libraries built into SAP HANA: For financial institutions such as banks, you can detect fraud by looking at abnormal transactions. For fleet management and car rental companies, you can predict which parts will break soon in a vehicle and replace them before they actually break.

Predictive Analysis Library

In SAP HANA, the predictive analysis functions are grouped together in the Predictive Analysis Library (PAL ). PAL defines functions that you can call from SQLScript and use to perform various predictive analysis algorithms. The algorithms were chosen by looking at what SAP HANA applications need, what people commonly use, and what the market requests.

In PAL, predictive analysis algorithms are grouped in nine categories; we’ll take a quick look at the nine categories and discuss a few popular algorithms in some of these categories. Figure 9.9 provides an overview. The categories are as follows:

Clustering
With clustering, you divide data into groups (clusters) of items that have similar properties. For example, you can group servers with similar usage characteristics, like email servers that all have a lot of network traffic on port 25 during the day. You might use the popular K-means algorithm for this purpose. Another example is computer vision, in which you cluster pictures of human faces using the affinity propagation algorithm. Clustering is sometimes also called segmentation . The focus of clustering is the algorithm figuring out by itself what is the best way to group existing data together.

Classification
The process of classification takes historical data and trains predictive models—using, for example, decision trees . You can then send new data to the predictive model you’ve trained, and the model will attempt to predict which class the new data belongs to. This process is often referred to as supervised learning . The focus here is on getting accurate predictions for unknown data.

Regression
Regression plots the best possible line or curve through the historical data. You can then use that line or curve to predict future values. This could be used by the previously mentioned car rental company. The company has had 5,000 of these vehicles before, and so it has good information available to predict when certain parts in a similar vehicle will break.

Association
For an example of an association, think of Amazon’s suggestions: People who bought this item also bought the following items. The apriori algorithm can be used to analyze shopping baskets. It can identify products that are commonly purchased together and group them together in the shop—for example, flashlights and batteries. You could even offer a discount on the flashlights because you know that people will need to buy batteries for them, and you can make enough profit on the batteries. (Notice however that the inverse is not necessarily true; a discount on the batteries will not encourage people to buy flashlights.)

Time series
For an example of a time series, think of the real-life scenario at the beginning of the chapter, in which we built a predictive model to forecast sales by looking at seasonal patterns. You might also see, for example, a monthly pattern in shops of higher sales after paydays.

Preprocessing
This category of algorithms is used to prepare data before further processing. These algorithms may improve the accuracy of the predictive models by reducing the noise . You can also identify outliers, or invalid and missing values.

Statistics
This category includes some statistical algorithms to find out how closely variables are related to—or independent they are from—each other. One example is the Grubbs’ test algorithm, which is used to detect outliers . This could be used in the fraud detection use case we mentioned earlier, in which certain transactions are different from the others. (Often, we first use a clustering algorithm to group transactions together, and then detect outliers for each group/cluster after.)

Social network analysis
Currently, the only algorithm in this category is the link prediction algorithm , used to predict the likelihood of a future association between two people who have no association yet. This is used by companies like Facebook and LinkedIn.

Miscellaneous
Two algorithms fall under this category: The ABC analysis algorithm can be used to identify which 20 percent of our customers account for 80 percent of the revenue, and weighted score tables can be used to help you decide what to buy when some criteria are more important than others—for example, if you want to buy an SAP HANA server. The speed of the memory weighs more heavily in that decision than disk speeds.

Figure 9.9 shows the PAL algorithms, per category, that are available in SAP HANA SPS 11.

Figure 9.9 Predictive Functions Available in SAP HANA SPS 11, Grouped by Category

You can find information about each of the PAL predictive functions in the SAP HANA Predictive Analysis Library (PAL) Reference Guide at http://help.sap.com/hana_platform/.

Installing PAL

PAL is part of the Application Function Library (AFL) that ships with all SAP HANA systems. AFL is an optional component; if you want to use the predictive analysis algorithms in PAL, you have to install AFL via SAP HANA Application Lifecycle Management . AFL also includes a Business Function Library (BFL ).

After the installation, you have to ensure that the script server is started and make sure you have the AFL_SYS_AFL_AFLPAL_EXECUTE role assigned to your SAP HANA user. (Roles, and how to assign them, are discussed in Chapter 13.)

You can check if AFL is installed on a system by opening the AFL_FUNCTIONS view in the SYS schema. If you see lots of entries in this view, then AFL has been installed on that system.

Creating a Predictive Model

Once PAL is installed, you can create predictive models graphically via the application function modeler (AFM ). In Chapter 5, we discussed the steps to create a flowgraph model:

Create a .hdbflowgraph file in your project in the developer perspective.
Give your predictive model a name, and choose Flowgraph for Activation as Stored Procedure.

[»] Note

You can also create a predictive model using SQLScript, which is the traditional way to build such models. Building the models graphically using the AFM only became available in the last few releases of SAP HANA.

If you find .aflpmml files, which are AFL models, you will have to convert these to AFM flowgraphs. SAP HANA SPS 11 can still run these files, but you cannot edit them in the AFM.

In the AFM, build a flowgraph model. Start on the left side and work towards the right side, as shown in Figure 9.10. First, drag a table with data into the left side of the work area. Next, select a PAL function from the palette on the right and drag it to the work area. Then, connect the table to the PAL function; the PAL functions are grouped into the categories that we discussed earlier.

In Figure 9.10, we started building the end-to-end scenario to predict segmentation of new customers for a supermarket, as described in the PAL reference guide. The guide describes how to do this by using SQLScript, but it’s a good exercise to try to do it via a graphical flowgraph model .

Figure 9.10 Creating a Flowgraph Model in the AFM Tool

Sometimes, you might want to train your model, and then use it later to make predictions. Figure 9.11 shows such a model: It starts with some training data that you feed into a C4.5 decision tree. The data is used to train the classification model. You then feed the trained classification model into a prediction function and also give the prediction function some new data. The prediction function applies the trained model to the new data and provides the predicted results in an output table.

Figure 9.11 Predictive Model Created in AFM

SAP HANA automatically converts the graphical flowgraph model into a SQLScript procedure, and you can then call this procedure when you want to use the predictive model as part of your other SAP HANA information models.