SAP PRESS :: ABAP® Development for SAP HANA® - Text Search and Analysis of Unstructured Data

10.6Resource Consumption and Runtime Aspects of the Text Search

In this chapter, we’ve discussed the basic architecture and use of the text search and text analysis functions in SAP HANA. You’ve learned that the column store contains specific data types (TEXT and SHORTTEXT) that provide powerful functions for searching and analyzing unstructured data. For ABAP text types, you can use a full text index to create a virtual column of the TEXT type. If you employ an additional fuzzy search index, you also can accelerate a fuzzy search run.

This section provides essential background information on the functionality of the text data types, as well as recommendations concerning the use of indexes. In particular, you’ll learn how to use SQL queries to analyze the memory consumption.

Depending on the configuration, special dictionaries are created for text data types and full text indexes. These dictionaries store the fragmentation into tokens and linguistic information (e.g., word stems) in an efficient way. This process involves the same mechanisms and memory structures as other functions of the column store. If you want to learn more about the technical details of building and accessing such dictionary structures, you can find additional information in Appendix C.

As described in Section 10.2, texts are fragmented into words (tokens), and then normalized and stored in the dictionary vector of the column (word dictionary). In addition to this, word stems can optionally be stored in a second dictionary, where inflected verbs, for example, are reverted to their basic form, or umlauts are replaced. All this information isn’t persisted on the disk but generated only when the table is loaded into the main memory.

You can use additional memory structures (optional) to accelerate text searches further; however, this will have an impact on the required memory size. Currently, two options are available: an additional fuzzy search index, or an increase of the phrase-index ratio . Using a fuzzy search index means that certain data are precalculated instead of being determined at the start of a search request. Additionally, in the phrase index , frequently occurring word constellations (phrases) are stored in a separate dictionary. The higher the specified phrase-index ratio value, the more storage space is reserved in relation to the actual memory consumption of the column (currently, the default value of this ratio is 0.2, i.e., 1:5).

As you can see, many setting options are available. Using the search and analysis options described here will increase the memory requirements for the required columns, and usually ABAP-based text data require twice as much memory space. Therefore, it’s advisable to use the default settings first and to employ additional tuning options—such as fuzzy search indexes or changing the phrase-index ratio—only after you encounter performance problems.

For you to get a better picture of this, we’ll now describe how you can use monitoring views via the SQL console to obtain detailed information on the indexes and system memory consumption.

The FULLTEXT_INDEXES view in SAP HANA enables you to view the configuration of all full text indexes in the system. Figure 10.19 shows the full text indexes for the flight data model tables created in the preceding sections, as well as some other predefined indexes in the SAP HANA repository.

Figure 10.19FULLTEXT_INDEXES Monitoring View

Moreover, you can query the memory consumption of the special fuzzy search indexes separately using the M_FUZZY_SEARCH_INDEXES monitoring view. The memory consumption depends on various factors but predominantly on the number of different values within the column. The following SQL statement allows you to query the current memory consumption of all data structures available for the fuzzy search in the system:

SELECT * FROM m_heap_memory
WHERE category LIKE '%FuzzySearch%'

To conclude this chapter, we’ll briefly discuss the topic of write operations, especially in the context of tables, which are both frequently modified and can be used for text searches and analyses. Full text indexes can be updated synchronously and asynchronously. If an index is updated synchronously, write operations take slightly longer because the creation of dictionary and index structures is part of the write operation. Usually, the effects should be minimal with small data types (e.g., with character strings of a fixed length). For larger documents that are stored as large objects in the database (e.g., STRING), an asynchronous update can be advantageous.

In addition, when write operations are carried out in the column store, the data is first stored in the delta store and is automatically integrated into the main store only at specific merge times (see also Appendix C). The bigger the delta store gets in this process, the more costly the merging of results in SQL queries. This can significantly impact the system runtime, particularly in complex operations such as those described in this chapter. If you want to run text analyses across large datasets that are carried out asynchronously at fixed points in time, for example, it makes sense to implement delta merge manually, which is supposed to be executed on the relevant tables prior to the text analyses. For this purpose, you could, for example, use the MERGE DELTA OF <Table> SQL statement.