Advanced Strings

This chapter will cover more advanced string algorithms than the previous chapters have discussed. They should be easier to understand now that you have learned about some other data structures. Specifically, this chapter will focus on string searching algorithms.

Trie (Prefix Tree)

A trie is special type of tree used commonly for searching strings and matching on stored strings. At each level, nodes can branch off to form complete words. For example, Figure 18-1 shows a trie of the words: Sammie, Simran, Sia, Sam. Each ending node has a boolean flag: isCompleted. This indicates that the word ends in this path. For example, m in Sam has endOfWord set to true. Nodes with endOfWord set to true are shaded in Figure 18-1.

../images/465726_1_En_18_Chapter/465726_1_En_18_Fig1_HTML.jpg — Figure 18-1
Trie of Sammie, Simran, Sia, Sam

A trie is implemented using a nested object where each level has its direct children as keys. A trie node can be formed by using an object to store the children. The trie has a root node that is instantiated in the constructor of the Trie class, as shown in the following code block:

1 function TrieNode() {

2 this.children = {}; // table

3 this.endOfWord = false;

4 }

6 function Trie() {

7 this.root = new TrieNode();

8 }

To insert into the trie, the child trie node is created on the root if it does not exist already. For each character in the word being inserted, it creates a child node if the character does not exist, as shown in the following code block:

1 Trie.prototype.insert = function(word) {

2 var current = this.root;

3 for (var i = 0; i < word.length; i++) {

4 var ch = word.charAt(i);

5 var node = current.children[ch];

6 if (node == null) {

7 node = new TrieNode();

8 current.children[ch] = node;

9 }

10 current = node;

11 }

12 current.endOfWord = true; //mark the current nodes endOfWord as true

13 }

To search inside a trie, each character of the word must be checked. This is done by setting a temporary variable of current on the root. The current variable is updated as each character in the word is checked.

1 Trie.prototype.search = function(word) {

2 var current = this.root;

3 for (var i = 0; i < word.length; i++) {

4 var ch = word.charAt(i);

5 var node = current.children[ch];

6 if (node == null) {

7 return false; // node doesn't exist

8 }

9 current = node;

10 }

11 return current.endOfWord;

12 }

13 var trie = new Trie();

14 trie.insert("sammie");

15 trie.insert("simran");

16 trie.search("simran"); // true

17 trie.search("fake") // false

18 trie.search("sam") // false

To delete an element from a trie, the algorithm should traverse the root node until it reaches the last character of the word. Then, for each node that does not have any other children, the node should be deleted. For example, in a trie with sam and sim, when sim is deleted, the s node in the root stays intact, but i and m are removed. The recursive implementation in the following code block implements this algorithm:

1 Trie.prototype.delete = function(word) {

2 this.deleteRecursively(this.root, word, 0);

3 }

5 Trie.prototype.deleteRecursively = function(current, word, index) {

6 if (index == word.length) {

7 //when end of word is reached only delete if currrent.end Of Word is true.

8 if (!current.endOfWord) {

9 return false;

10 }

11 current.endOfWord = false;

12 //if current has no other mapping then return true

13 return Object.keys(current.children).length == 0;

14 }

15 var ch = word.charAt(index),

16 node = current.children[ch];

17 if (node == null) {

18 return false;

19 }

20 var shouldDeleteCurrentNode = this.deleteRecursively(node, word, index + 1);

22 // if true is returned then

23 // delete the mapping of character and trienode reference from map.

24 if (shouldDeleteCurrentNode) {

25 delete current.children[ch];

26 //return true if no mappings are left in the map.

27 return Object.keys(current.children).length == 0;

28 }

29 return false;

30 }

31 var trie1 = new Trie();

32 trie1.insert("sammie");

33 trie1.insert("simran");

34 trie1.search("simran"); // true

35 trie1.delete("sammie");

36 trie1.delete("simran");

37 trie1.search("sammie"); // false

38 trie1.search("simran"); // false

Time Complexity: O(W )

Space Complexity: O(N*M)

Time complexity is O(W) for all operations (insert, search, delete), where W is the length of the string being searched because each character in the string is checked. The space complexity is O(N*M), where N is the number of words inserted into the trie and M is the length of the longest character. Hence, a trie is an efficient data structure when there are multiple strings with common prefixes. For searching one specific string pattern in one specific string, a trie is not efficient because of the additional memory required to store the strings in the tree-like structure.

For a pattern search in a single target string, the Boyer–Moore algorithm and the Knuth–Morris–Pratt (KMP) algorithm are useful and are covered later in this chapter.

Boyer–Moore String Search

The Boyer–Moore string search algorithm is used to power the “find” tool used in text editor applications and web browsers, like the one in Figure 18-2.

../images/465726_1_En_18_Chapter/465726_1_En_18_Fig2_HTML.jpg — Figure 18-2
Find tool commonly seen in many applications

The Boyer–Moore string search algorithm allows linear time in search by skipping indices when searching inside a string for a pattern. For example, for the pattern jam and the string jellyjam, visualization of brute-force comparison is shown in Figure 18-3. It should be noted that in the fourth iteration when j is compared with m, since j is shown in the pattern, skipping ahead by 2 would be valid. Figure 18-4 shows an optimized iteration cycle where the number of string comparisons is limited by skipping ahead when the string at the index exists in the pattern.

../images/465726_1_En_18_Chapter/465726_1_En_18_Fig3_HTML.jpg — Figure 18-3
Brute-force pattern match iterations

../images/465726_1_En_18_Chapter/465726_1_En_18_Fig4_HTML.jpg — Figure 18-4
Boyer–Moore Skipping Indices

To implement this skip rule, you can build a “bad match table” structure. The bad match table indicates how much to skip for a given character of a pattern. Some examples of various patterns and its corresponding bad match table are shown here:

Pattern	Bad Match Table
jam	{j: 2, a: 1, m: 3}
data	{d: 3, a: 2, t: 1}
struct	{s: 5, t: 4, r: 3, u: 2, c: 1}
roi	{r: 2, o: 1, i: 3}

For the roi example, r:2 indicates that if r is not found in the string, the index should skip by 2. This bad match table can be implemented with the following code block:

function buildBadMatchTable(str) {

var tableObj = {},

strLength = str.length;

for (var i = 0; i < strLength - 1; i++) {

tableObj[str[i]] = strLength - 1 - i;

}

if (tableObj[str[strLength-1]] == undefined) {

tableObj[str[strLength-1]] = strLength;

}

return tableObj;

}

buildBadMatchTable('data'); // {d: 3, a: 2, t: 1}

buildBadMatchTable('struct'); // {s: 5, t: 4, r: 3, u: 2, c: 1}

buildBadMatchTable('roi'); // {r: 2, o: 1, i: 3}

buildBadMatchTable('jam'); // {j: 2, a: 1, m: 3}

Using this bad match table, the Boyer–Moore string search algorithm can be implemented. When scanning the input string for the pattern, if the current string being looked at exists in the bad match table, it skips over by the bad match table value associated with the current string. Otherwise, it is incremented by 1. This continues until either the string is found or the index is greater than the difference of the pattern and string lengths. This is implemented in the following code block:

function boyerMoore(str, pattern) {

var badMatchTable = buildBadMatchTable(pattern),

offset = 0,

patternLastIndex = pattern.length - 1,

scanIndex = patternLastIndex,

maxOffset = str.length - pattern.length;

// if the offset is bigger than maxOffset, cannot be found

while (offset <= maxOffset) {

scanIndex = 0;

while (pattern[scanIndex] == str[scanIndex + offset]) {

if (scanIndex == patternLastIndex) {

// found at this index

return offset;

}

scanIndex++;

}

var badMatchString = str[offset + patternLastIndex];

if (badMatchTable[badMatchString]) {

// increase the offset if it exists

offset += badMatchTable[badMatchString]

} else {

offset += 1;

}

return -1;

}

boyerMoore('jellyjam','jelly'); // 5. indicates that the pattern starts at index 5

boyerMoore('jellyjam','jelly'); // 0. indicates that the pattern starts at index 0

boyerMoore('jellyjam','sam'); // -1. indicates that the pattern does not exist

Best Case:

In the best case, all the characters in the pattern are the same, and this consistently produces shifts by T, where T is the length of the pattern. Hence, O(W/T) is the best time complexity where W is the string where the pattern is being searched. The space complexity is O(1) since only 1 value is stored into the bad match table.

Time Complexity: O(T/W)

Space Complexity: O(1)

Worst Case:

In the worst case, the string has the pattern at the end of the string, and the preceding part is all unique characters. An example of this is a string of abcdefgxyz and pattern of xyz. In this case, T*W string comparisons are done.

Time Complexity: O(T*W)

Space Complexity: O(T)

All the characters in the pattern and the string are the same. An example of such a case is the string bbbbbb and the pattern bbb. This case cannot use the skip mechanism to its fullest because the index will always be incremented by 1. Space complexity in this case is T because the pattern could have all unique characters.

Knuth–Morris–Pratt String Search

Chapter 4 discussed the native String.prototype.indexOf function. A naive implementation of the String.prototype.indexOf function was included as an exercise for that chapter. A better (faster) implementation uses the Knuth–Morris–Pratt (KMP) string search algorithm. The following implementation of the KMP algorithm returns all indices where the pattern is present.

The KMP string searching algorithm searches for occurrences of the “word” W within an input “text,” which is T, by utilizing the observation that the occurrence of a mismatch contains sufficient information about where the next match could begin. This helps to skip re-examination of previously matched characters. A prefix array has to be built to indicate how many indices it has to backtrack to get the same prefix. For the string ababaca, the prefix building looks like the following:

At current index 0, there is no string to compare to, and the prefix array value is initialized to 0.

array index 0 1 2 3 4 5 6
character a b a b a c a
prefix array 0

At current index 1:

The character is b.
The previous prefix array value, prefix[0], is 0.

Compare index 0 to the current index: a (at index = 0) and b (at index = 1) mismatch.

prefix[1] is set to 0:

Array index 0 1 2 3 4 5 6
Character a b a b a c a
Prefix array 0 0

At current index 2:

The character is a.
The previous prefix array value, prefix[1], is 0.

Compare the index and to the current index: a (at index = 0) and a (at index = 2) match.

prefix[2] is set to 1 (incremented from prefix[1]):

Array index 0 1 2 3 4 5 6
Character a b a b a c a
Prefix array 0 0 1

At current index 3:

The character is b.
The previous prefix array value, prefix[2], is 1.

Compare index 1 and the current index: b (at index = 1) and b (at index = 3) match.

prefix[3] is set to 2 (incremented from prefix[2]):

Array index 0 1 2 3 4 5 6
Character a b a b a c a
Prefix array 0 0 1 2

At current index 4:

The character is a.
The previous prefix array value, prefix[3], is 2.

Compare index 2 and the current index: a (at index = 2) and a (at index = 4) match.

prefix[4] is set to 3 (incremented from prefix[3]):

Array index 0 1 2 3 4 5 6
Character a b a b a c a
Prefix array 0 0 1 2 3

At current index 5:

The character is c.
The previous prefix array value, prefix[4], is 3.

Compare index 3 and the current index: b (at index = 3) and c (at index = 5) mismatch.

prefix[5] is set to 0:

Array index 0 1 2 3 4 5 6
Character a b a b a c a
Prefix array 0 0 1 2 3 0

At current index 6:

The character is c.
The previous prefix array value, prefix[5], is 0.

Compare from index 0 and current index: a (at index = 0) and a (at index = 5) match.

prefix[6] is set to 1 (incremented from prefix[5]):

Array index 0 1 2 3 4 5 6
Character a b a b a c a
Prefix array 0 0 1 2 3 0 1

The function in the following code block illustrates this algorithm to build a prefix table:

function longestPrefix(str) {

// prefix array is created

var prefix = new Array(str.length);

var maxPrefix = 0;

// start the prefix at 0

prefix[0] = 0;

for (var i = 1; i < str.length; i++) {

// decrement the prefix value as long as there are mismatches

while (str.charAt(i) !== str.charAt(maxPrefix) && maxPrefix > 0) {

maxPrefix = prefix[maxPrefix - 1];

}

// strings match, can update it

if (str.charAt(maxPrefix) === str.charAt(i)) {

maxPrefix++;

}

// set the prefix

prefix[i] = maxPrefix;

}

return prefix;

}

console.log(longestPrefix('ababaca')); // [0, 0, 1, 2, 3, 0, 1]

With this prefix table now, KMP can be implemented. KMP search iterates through the string and the pattern to be searched for index by index. Whenever there is a mismatch, it can use the prefix table to compute a new index to try. When the pattern’s index reaches the length of the pattern, the string is found. This is implemented in detail in the following code block:

function KMP(str, pattern) {

// build the prefix table

var prefixTable = longestPrefix(pattern),

patternIndex = 0,

strIndex = 0;

while (strIndex < str.length) {

if (str.charAt(strIndex) != pattern.charAt(patternIndex)) {

// Case 1: the characters are different

if (patternIndex != 0) {

// use the prefix table if possible

patternIndex = prefixTable[patternIndex - 1];

} else {

// increment the str index to next character

strIndex++;

}

} else if (str.charAt(strIndex) == pattern.charAt(patternIndex)) {

// Case 2: the characters are same

strIndex++;

patternIndex++;

}

// found the pattern

if (patternIndex == pattern.length) {

return true

}

return false;

}

KMP('ababacaababacaababacaababaca', 'ababaca'); // true

KMP('sammiebae', 'bae'); // true

KMP('sammiebae', 'sammie'); // true

KMP('sammiebae', 'sammiebaee'); // false

Time Complexity: O(W)

Space Complexity: O(W )

Preprocessing a word of length W requires both O(W) time and space complexity.

Time Complexity: O(W + T )

Here, W is the “word” in the T (the main string being searched).

Rabin–Karp Search

The Rabin–Karp algorithm is based on hashing to find the specified pattern in text. While KMP is optimized to skip redundant checks during the search, Rabin–Karp seeks to speed up the equality of the pattern of the substring via a hash function. To do this efficiently, the hash function must be O(1). Specifically for the Rabin-Karp search, the Rabin fingerprint hashing technique is used.

The Rabin Fingerprint

The Rabin fingerprint is calculated via the following equation: f(x) = m₀ + m₁x + … + m_n-1 x^n-1 where n is the number of characters being hashed and x is some prime number.

This is a simple implementation, as shown in the following code block. An arbitrary prime number of 101 was set for this example. Any high prime number should work well in this case. However, be aware that if the x is too high, it could cause integer overflow because x^n-1 grows quickly. The endLength argument indicates to what string index the hash should be calculated. It should be defaulted to the length of str if the argument is not passed.

1 function RabinKarpSearch() {

2 this.prime = 101;

3 }

4 RabinKarpSearch.prototype.rabinkarpFingerprintHash = function (str, endLength) {

5 if (endLength == null) endLength = str.length;

6 var hashInt = 0;

7 for (var i=0; i < endLength; i++) {

8 hashInt += str.charCodeAt(i) * Math.pow(this.prime, i);

9 }

10 return hashInt;

11 }

12 var rks = new RabinKarpSearch();

13 rks.rabinkarpFingerprintHash("sammie"); // 1072559917336

14 rks.rabinkarpFingerprintHash("zammie"); // 1072559917343

As shown in the previous code block result, the hashes from sammie and zammie are unique because they are two different strings. The hash value allows you to quickly, in constant time, check whether two strings are the same. As an example, let’s look for am inside same. Since am is only two characters long, when you scan the text, sa, am, and me are formed from same and compute the hash as shown here:

1 rks.rabinkarpFingerprintHash("sa"); // 9912

2 rks.rabinkarpFingerprintHash("am"); // 11106

3 rks.rabinkarpFingerprintHash("me"); // 10310

This is a sliding hash calculation. How can this be done efficiently? Let’s analyze it mathematically. Recall that for this example the x is 101. In addition, the character code for s, a, m, and e are 115, 97, 109, and 101, respectively.

sa: f(x) = m₀ + m₁x = 115 + (97)*(101) = 9912
am: f(x) = m₀ + m₁x = 97 + (109)*(101) = 11106
me: f(x) = m₀ + m₁x = 109 + (101)*(101) = 10310

To get the hash value from sa to am, you must subtract the first term, divide the remaining by the prime number, and then add the new term. This recalculation algorithm is implemented in the following code block:

1 RabinKarpSearch.prototype.recalculateHash = function (str, oldIndex, newIndex, oldHash, patternLength) {

2 if (patternLength == null) patternLength = str.length;

3 var newHash = oldHash - str.charCodeAt(oldIndex);

4 newHash = Math.floor(newHash/this.prime);

5 newHash += str.charCodeAt(newIndex) * Math.pow(this.prime, patternLength - 1);

6 return newHash;

7 }

8 var oldHash = rks.rabinkarpFingerprintHash("sa"); // 9912

9 rks.recalculateHash("same", 0, 2, oldHash, "sa".length); // 11106

Lastly, two different strings can still have the same hash value although it’s unlikely. Therefore, there needs to be a function to check that two strings are equal given the start index and end index for both strings. This is implemented in the following code block:

1 RabinKarpSearch.prototype.strEquals = function (str1, startIndex1, endIndex1,

2 str2, startIndex2, endIndex2) {

3 if (endIndex1 - startIndex1 != endIndex2 - startIndex2) {

4 return false;

5 }

6 while ( startIndex1 <= endIndex1

7 && startIndex2 <= endIndex2) {

8 if (str1[startIndex1] != str2[startIndex2]) {

9 return false;

10 }

11 startIndex1++;

12 startIndex2++;

13 }

14 return true;

15 }

Then, the main Rabin–Karp search function is implemented by calculating the starting hash and then recalculating the hashes in a sliding manner until the pattern is found or the end of the string is reached.

1 RabinKarpSearch.prototype.rabinkarpSearch = function (str, pattern) {

2 var T = str.length,

3 W = pattern.length,

4 patternHash = this.rabinkarpFingerprintHash(pattern, W),

5 textHash = this.rabinkarpFingerprintHash(str, W);

7 for (var i = 1; i <= T - W + 1; i++) {

8 if (patternHash == textHash &&

9 this.strEquals(str, i - 1, i + W - 2, pattern, 0, W - 1)) {

10 return i - 1;

11 }

12 if (i < T - W + 1) {

13 textHash = this.recalculateHash(str, i - 1, i + W - 1, textHash, W);

14 }

15 }

17 return -1;

18 }

20 var rks = new RabinKarpSearch();

21 rks.rabinkarpSearch("SammieBae", "as"); // -1

22 rks.rabinkarpSearch("SammieBae", "Bae"); // 6

23 rks.rabinkarpSearch("SammieBae", "Sam"); // 0

Preprocessing Time Complexity: O(W )

The preprocessing time complexity W is the length of the “word.”

Matching Time Complexity: O(W + T )

At most, this algorithm iterates through the sum of length T and length W, where T is the string being searched for.

Applications in Real Life

The Rabin–Karp algorithm can be used for detecting plagiarism. With a source material, this algorithm can search through a paper submission for instances of phrases and sentences from the source material (and ignoring grammar details like punctuation by omitting punctuation characters during the preprocessing phase). This problem is impractical for single-search algorithms because of the large set of sought (input) phrases and sentences. The Rabin–Karp algorithm is also used in other string matching applications such as looking for a specific sequence in large DNA data.

Summary

This chapter returned to the topic of strings and looked at more advanced examples and searching on string patterns. The chapter discussed several different types.

Trie is great for multiple searches and prefix pattern matching.
Boyer–Moore, with assumption that the absence of a match at the end means no need to match the beginning, tries to match the last character of the pattern instead of the first; this allows large “jumps” (spaces between indexes) and works better when the text is larger.
The KMP algorithm searches for occurrences of the pattern in a string by observing that when a mismatch occurs, the pattern itself has sufficient information to determine the index in the string where the next match could begin. Hence, the KMP algorithm is better for small sets.

Table 18-1 summarizes the different search algorithms.

Table 18-1

Single String Search Summary

Algorithm	Preprocessing Time Complexity	Matching Time Complexity	Space Complexity
Naive	None	O(W ∗ T )	None
Boyer–Moore	O(W + T )	O(T /W ) best case O(W ∗ T) worst case	O(1)
KMP	O(W )	O(T )	O(W )
Rabin–Karp	O(W )	O(W + T )	O(1)

18. Advanced Strings

Trie (Prefix Tree)

Boyer–Moore String Search

Knuth–Morris–Pratt String Search

Rabin–Karp Search

The Rabin Fingerprint

Applications in Real Life

Summary