SAP HANA offers a function for analyzing unstructured data. By leveraging this capability, you can considerably improve the user friendliness of search scenarios within business applications. In addition, you can gain further insight by recognizing patterns in existing datasets.
10Text Search and Analysis of Unstructured Data
Hardly any other functionality has experienced as great a boost from the Internet in recent years as the search within large datasets—irrespective of whether you search through a product catalog, the telephone book, or the entire Internet. This chapter introduces options provided by SAP HANA to search and analyze texts and documents. These options open up many ways to employ the SAP HANA platform, particularly in business applications, which haven’t been extensively equipped with these kinds of functions until now.
Input helps represent a simple usage scenario for text searches in SAP HANA. SAP applications contain input helps in many different places. When using input helps, users sometimes search for an entry in a large dataset without knowing the details of the entry, or at least without having these details at hand. For example, you may be searching for a specific customer in Argentina who is based in Buenos Aires and works in the telecommunications industry. Because his customer number isn’t available to you, you enter information such as the company name, country, location, and industry into a complex input template, You often have to enter this data several times using wild cards such as the asterisk (*). In addition, if you mistype an entry or the data is stored in a different way in the database (e.g., if the name of the location was entered using the country-specific spelling), you usually won’t obtain any results.
The text search function in SAP HANA allows you to develop search helps that work similarly to modern Internet searches. They provide a certain error tolerance and are able to process multilingual terms and synonyms. In the preceding example, such a search help might consist of an input field that correctly interprets a user request such as “buenes eires tele”, despite the incorrect spelling and the search via multiple columns. However, users can’t always easily determine whether the returned result is the expected one in this type of error-tolerant search, also referred to as a fuzzy search. Have you ever asked yourself why you sometimes obtain unexpected results when performing a search on the Internet?
The recognition of patterns in texts and documents represents an entirely different kind of text-analysis function. You can use this feature in many different scenarios, some of which are presented in the following sections. For example, to avoid having duplicate business partners in your master data, you may want to check in the system whether a similar client already exists in the dataset prior to creating a new client, and, if so, notify the application user about it. In this context, being “similar” might mean that the last name and address of an existing and new client are almost identical. As it often happens that names and addresses in particular are entered with different kinds of spellings, a simple check for identical entries rarely returns satisfactory results.
The text analysis function in SAP HANA not only allows you to run searches within texts but also to extract additional information from the texts. For example, you can recognize relationships and even intentions or emotions within texts. Let’s suppose you run a web store that enables clients to order products online as well as to post comments about the products and the vendor. The sentiment analysis is part of the text engine functionality in SAP HANA and enables you to recognize patterns in these types of unstructured data. In the context of the online store, for instance, it would allow you to analyze whether a specific product evokes more positive or negative comments.
This chapter begins by introducing some of the basic technical principles and prerequisites for using the text search in SAP HANA. This is followed by a description of how to call the function using SQL and how to use it in ABAP, with a special focus on embedding the text search function in input helps. In addition to using the text search function directly, you’ll learn about several existing SAP components that support the implementation of complex searches. Moreover, the chapter contains practical examples of pattern recognition within texts. Finally, you’ll become acquainted with nonfunctional aspects such as resource consumption, performance, and error analysis.
The practical examples will be used to implement search runs across airline names (table SCARR), flight schedule data (airports and locations from tables SPFLI and SAIRPORT), and the flight passenger address data (name, address, town, and country from table SCUSTOM).
10.1Basic Principles of the Text Search in SAP HANA
The main purpose of the text search function in SAP HANA is to provide users with an optimized usability of search interfaces. In addition to various features common in Internet search engines, this includes functions with special significance for business applications, such as industry-specific lists of synonyms.
This involves the following characteristics that are usually deployed in combination:
-
Freestyle search
The user doesn’t need to know the exact database columns in which the search is supposed to be carried out. For example, you can implement an address search across a single input field and include all technical characteristics such as street name, ZIP code, town, country, and so on. -
Error-tolerant search (fuzzy search)
The user may vary the spelling slightly in his search requests. -
Linguistic search and synonym search
Linguistic variants and synonymous terms are included. -
Value suggestions
The system efficiently identifies probable search results while the user is typing and presents these to the user in real time. -
Results ranking
The sequence of the search results is optimized so that results with the highest probability rate are presented at the top of the list. -
Search facets
The search results are counted and grouped according to specific criteria. For example, when searching for airlines, you can view the distribution of the airlines per country. -
Text analysis (particularly sentiment analysis)
Additional information is extracted from texts, which allows you to gain insights on semantical aspects.
10.1.1Technical Architecture
The following sections describe how you can use the text search and text analysis functions. To provide you with an idea of which components are involved in SAP HANA, Figure 10.1 shows the architecture of the text search functionality. The column store supports the data types and operations that are required for the search, which are described in further detail in Section 10.2 and Section 10.3. To perform complex text analyses and to extract information, the column store draws on the preprocessor server. In this context, the system uses the Document Analysis Toolkit.
Figure 10.1Architecture of the Text Search Function in SAP HANA
Section 10.1.3 provides further details on other text search components.
[»]Heritage of the Fuzzy Search Component in SAP HANA
The fuzzy search function in SAP HANA represents the advanced development of a data-quality analysis solution initially developed by the German company Fuzzy Informatik AG. SAP adopted this solution indirectly through the acquisition of Business-Objects. In addition to the genuine fuzzy search, this solution is particularly useful for recognizing duplicates, especially in sets of address data.
10.1.2Error-Tolerant Search
The error-tolerant or fuzzy search involves the search for character strings (i.e., the search request) in text-based data, where the data doesn’t have to correspond exactly to the search request; this way, sufficiently similar entries are also included in the result set. This section provides an overview of the techniques used for the fuzzy search in SAP HANA.
Mathematical algorithms that form the basis of the fuzzy search determine the degree to which a data record must correspond to the search request. The result of the calculation is often a numerical value used to decide whether a data record is sufficiently similar to the search request. With regard to texts, the simplest type of such an algorithm consists of determining the minimum number of operations (such as replacing and moving characters) that are required to generate a section of the actual data record from the search request. In practice, it’s very complicated to determine the degree of similarity between texts, and it involves using variants and heuristics that all have their pros and cons depending on the scenario in which they are used.
The text search function in SAP HANA determines a value between 0 and 1 that marks the degree of similarity. As a programmer, you must define a threshold value (e.g., 0.8) from which a value of the dataset that has been searched is categorized as matching the search request.
In addition, the functionality of the fuzzy search can be adapted for specific (semantic) data types. For example, the fuzzy search for a date can include date values that are several days before or after the specific date being searched. In this case, the similarity criterion is the period rather than the similarity of the character string (so, according to this criterion, the date 01/01/1909 isn’t similar to 01/01/1990, although the position of only one character has been changed).
Another example involves the search for a town on the basis of a ZIP code. In most countries, ZIP codes are structured in such a way that a similarity of the code’s first digits tells more about geographical proximity than a similarity of the last digits.
When running a fuzzy search, you can use a set of simple expressions that enable an expert to formulate more precise search requests. For example, this includes the option to enforce an exact search for a specific portion of the search request or to use logical expressions. Table 10.1 contains some sample expressions of the SAP HANA text search based on the example of an airline search.
Search Request |
Explanation |
---|---|
lufthansa |
Results that are similar either to “Lufthansa” or to “United”. |
airline—united |
Results that are similar to “airline,” but not to “united”. |
“south air” |
Results that are similar to the entire expression, “south air”, and not only to its components, “south” and “air”. In this example, “South African Airways” isn’t returned as a result. |
Table 10.1Using Expressions in the SAP HANA Text Search
To determine the degree of similarity, it’s also useful to include grammatical and other linguistic aspects. In this context, terms are reverted to their word stem so that word variants such as “house,” “houses,” “housing,” and so on, are recognized. In addition, the linguistic search provides opportunities for handling multilingual texts and search requests.
The fuzzy search can also be extended by lists of synonyms. In this context, you can store a list of terms that are equivalent to a specific term; the search request can then draw upon this list. For example, “notebook” might be regarded as a synonym of “laptop,” or “monitor” as a synonym of “screen.” This feature is particularly useful for industry-specific abbreviations and concepts.
Another option to implement a more intelligent search is to familiarize the system with semantic characteristics of specific terms. In this context, it’s important to know that not every term in a search request has the same selectivity. For example, terms such as “Inc.” or “LLC” aren’t as selective as the actual company name when you search for a specific company. It’s therefore usually more important to enter a company name similar to the one you’re searching for than to enter that the search result is an “Inc.,” for example.
Likewise, in longer texts such as product descriptions, similarities in certain parts of speech such as articles or pronouns are less important than similarities in names within the text (e.g., in brand names). When you run a search request in SAP HANA, you can enter a list of so-called stop words (also referred to as noise words) that are considered less important than other words.
Because the text search function is based on a number of rather complex algorithms, it may be necessary to create specific fuzzy search indexes to accelerate the search runs and thus optimize the system performance, particularly if large amounts of data are involved. However, these indexes require additional memory. Section 10.6 provides some recommendations on how to use them.
10.1.3SAP Components and Products for Search
In Section 10.3, you’ll learn in detail how to access the search features of SAP HANA directly through SQL. In addition, SAP provides specific components and frameworks that support you in the creation of search runs, but because these aren’t the focus of this book, they are mentioned only briefly in the following paragraphs.
Since release 7.0, SAP NetWeaver AS ABAP contains the Embedded Search. This component allows users to extract data for indexing via the TREX Search and Classification Engine, which represents an SAP NetWeaver component that can be installed separately (standalone engine). Embedded Search provides interfaces that enable a more efficient search within the extracted data of an application.
However, Embedded Search is limited to searches within an SAP system. To run searches across different systems (e.g., in an application portal), you can use the SAP NetWeaver Enterprise Search solution. This is based on the capabilities of the local Embedded Search functionality in integrated systems.
Because SAP HANA supports most of the functions of the TREX engine, you can use these functions directly in SAP HANA and without a separate TREX installation. This means you can use existing Embedded Search models in SAP HANA, while, by default, the data continues to be extracted and replicated within SAP HANA. SAP currently plans to enable direct searches in tables via Embedded Search in SAP HANA without the requiring the data to be replicated.
Since SPS 5, SAP HANA also provides the UI Toolkit for Information Access (InA), which allows you to create simple HTML5-based search interfaces. Based on attribute views, you can use HTML and JavaScript as well as the UI templates contained in InA to build a simple search application according to the modular design principle. This application employs SAP HANA Extended Application Services (SAP HANA XS).