Findset, Find Min, and Find Word
In this chapter, we explore data structures for point search queries and widely used data structures to answer some very important queries. In the first section we discuss disjoint-set data structure which is well known for its applications in network algorithms and provides find set queries. Sections 4.2 and 4.3 cover binomial heaps and Fibonacci heaps; both are known for their roles in creating priority queues. Binomial heaps efficiently perform delete min queries. Fibonacci heaps effectively decrease key operation. Section 4.4 covers various kinds of strings and their variants that are designed for membership queries. The final section covers the use of inverted indices for handling keyword queries. The chapter ends with a set of exercises.
The disjoint-set data structure, also known as a union find data structure was developed to represent a disjoint collection of sets which allows simple operations of set theory such as creating singleton sets, unions of to sets, and most importantly, for determining to which set an element belongs. This data structure is well known for its wide applications in network algorithms.
Definition 4.1.1 A disjoint-set data structure organizes a collection, S = S1, S2, …, Sk of mutually disjoint sets, such that each element of any set is represented as an object and each set has a representative element inside the set. Moreover, it is possible to unite two disjoint sets Si and Sj to form a new set
4.1.1 Operations on disjoint-set data structure
The disjoint-set data structure is characterized by its unique operations and provides the way to understand the data which is always arranged as a disjoint collection of sets. The operations are described below. Separate three operations below by vertical spaces so they’re easier to read.
makeset(x): The makeset(x) operation uses an element x as input and returns a new singleton set represented by itself.
union(x,y): The union(x,y) operation uses two sets represented by x and y elements as inputs and unites those sets to return a new set consisting of elements from both the sets.
findset(x): The findset(x) operation uses an element as input and returns the representative of the set to which it belongs.
▪ Example 4.1 — Find connected components of a graph G(V, E). In this example, we present an algorithm to compute connected components of a given graph. The algorithm is very straightforward and given below. ▪
Algorithm 15 Algorithm to determine connected components of a graph G
1: procedure CONNECTED-COMPONENTS G(V,E) ⊳ computes connected components of a graph
2: for ∀ v ∈ V do
3: makeset(v) ⊳ creates |V| number of singleton sets
4: remember representatives
5: for ∀ (u, v) ∈ E do
6: if findset(u) ≠ findset(v) then
7: union(u,v) ⊳ unite if different set
8: reduce representative
9: return representative ⊳ one representative for each component
4.1.2 Representations of disjoint sets
Given abstraction for disjoint sets, the next task is to implement this data structure in an efficient manner. A trivial implementation uses list representation, i.e., each set will be a link list and specify disjoint set operations there. Another implementation uses disjoint forests which are combinations of tree data structures. Both the representations are discussed below.
4.1.3 Link-list representations of disjoint sets
In a link-list representation of a disjoint set (Figure 4.1). Every set is realized as a link list, nodes of the lists are elements of the sets, representative elements should appear at the first location of a list and every node will have two pointers instead of one. One pointer will point to the next element, as in an ordinary link list. The second pointer will point to the representative element of the set (the first node).
Figure 4.1: Disjoint sets with link representation
In this representation, makeset(x) creates a single node list with reference to itself. There is no operation to create a non-singleton set; larger sets are created via union operations. The union (x, y) joins two lists, generally the larger one appears first to reduce complexity. The findset operation is performed following the representative pointer associated with each element.
Definition 4.1.2 — Weighted union heuristic. Let |S1| = n and |S2| = m be the cardinality of two disjoint sets, where m ≤ n. A union operation must append the smaller list, S2 at the end of S1 to minimize the pointer update, O(m).
Complexity 4.1.1 — Disjoint sets with link representation.
Complexity of makeset operation is O(1).
Complexity of findset operation is O(1).
Complexity of union operation is O(m + n).
Theorem 4.1.1 A union operation takes Ω(n) time, but a sequence of m disjoint set operations on a collection of n elements required O(m + n log n) time.
Proof. If all operations are makeset or findset it can be at most O(m). If m is large enough and maximum possible union operation is performed then at most (n − 1) union is possible but the cost of union is only O(1). The worst case of union complexity occurs when set sizes are equal, which will take time O (nlog n) similar to merge sort. ▪
4.1.4 Forest representations of disjoint sets
Objective 4.1 — Disjoint forest representation. In link-list representation of a disjoint set, the makeset and findset perform in constant time whereas the union is almost linear. However, the findset and union are called in roughly the same order in many applications. Balancing the cost of union with findset operation is a good trade-off to minimize the total cost of application, which is the main motive of disjoint forest representation.
Realizing disjoint-set data structure with a collection of trees was first proposed to make proper use of pointers (Figures 4.2 and 4.3). In this representation sets are trees with reverse pointers, i.e., a parent pointer in every node instead of child pointers in which roots will point to themselves and represent the set as well. In makeset operation a root is created using an input element. The findset operation accesses the sequential parent pointer to reach the root element where the parent pointer points to itself and represents the set. The union operation joins the two tree the root of the smaller rank tree becomes the child of the root of the higher rank tree.
Figure 4.2: Disjoint sets with forest representation
Figure 4.3: Union by rank in disjoint sets
Definition 4.1.3 — Union by rank heuristic. In union(x,y) operation of two disjoint sets represented by the elements x and y respectively in the tree. As illustrated in the diagram below representation the reference pointer associated with the tree with smaller rank is updated by pointing it to the root of the higher rank tree. Rank of the resulting tree is equal to the rank of the larger tree.
Definition 4.1.4 — Path compression heuristic. The findset operation searches for roots by following parent pointers. It detaches all nodes appearing in the path and makes them direct children of a root.
Theorem 4.1.2 For a sequence of m disjoint set operations on a collection of n elements, represented as disjoint tree forest supported with union by rank and path compression, where m ≥ n implement, the total cost is O(m.α(n)), where α(n) = min{k : Ak(1) ≥ n}, the inverse of the variant of Ackerman function Ak(i).
α(n) ≥ 4 means n greater than the number of atoms in the universe.
Proof. Note that findset can use its worst case O(n) it will happen only once due to the path compression heuristic, whereas makeset and union utilize constant time. If all operations are makeset or findset, the result can be O(m) at most. See Figure 4.4.
Figure 4.4: Disjoint-sets path compression during findset operation
Heaps are common data structures used mainly to implement priority queue operations. Binomial heaps are taught in basic data structure courses. They are binary trees used in moderately complex operations like findmin in O(1), deletemin in O(log n), and buildheap in O(n). However, binomial heaps are not suitable for dynamic update operations.
Objective 4.2 Binomial heap data structures provide amortized constant time finemin and deletemin operations and they efficiently perform insert and meld operations.
A binomial heap is a specific implementation of the heap data structure using a collection of binary trees connected by a circular link list in the roots of the respective trees. Each binary tree is a min heap (can be made max as well), i.e. minimum element in the root. The numbers on nodes in a binary tree may be 1, 2, 4, 8 etc. To create a binomial heap of any size we need to represent a chosen number as a sum of binary numbers and create trees accordingly.
4.2.1 Creation and updates of binomial heap
Binomial heaps are created as a collection of binomial trees and connected in roots by a circular link list. The sizes of the binomial trees are determined by the elements of binomial sequence
Binomial trees demonstrate remarkable properties because they are represented by binomial numbers. See Figures 4.5 and 4.6.
1.For every
2.A binomial heap H1 of size 13 = 23 + 22 + 20 can be written as H1 = B3 + B2 + B0.
3.Similarly, a binomial heap H2 of size 23 = 24 + 22 + 21 + 20 can be written as H1 = B4 + B2 + B1 + B0.
4.H3 = meld(H1, H2) will have 36 = 25 + 22 node and can be written as H3 = B5 + B2.
Figure 4.5: Binomial trees
Figure 4.6: Merging two binomial trees of same order
4.2.2 Operations of Binomial Heap
Binomial trees are constructed to include heap properties, i.e., values of children are always greater than values of parents. See Figure 4.7. Creation, update and finding minimum value are generally done through the operations create(B0), meld(H1, H2), findmin(), deletemin(), deletenode(x) and decreasekey(x, Δx).
Figure 4.7: Merging binomial trees of lower order
4.2.2.1 create(B0)
The (B0) operation creates a binomial tree of size 1; it contains only one root as shown in Algorithm 16.
Algorithm 16 Algorithm to perform create (B0) operation
1: Procedure CREATE B0, x ⊳Create a B0 from the element x
2: Make the x root
3: Add key value of x
4: return B0
4.2.2.2 meld(H1, H2)
One useful operation of a binomial heap is meld(H1, H2). Other operations like deletemin and deletenode depend on the meld operation and their complexities are determined by the complexity of the meld. Algorithm 17 details the meld operation.
4.2.2.3 findmin()
To search and return the minimum value of a binomial heap we execute the findmin() operation that traverses the list of the tree roots. One of these nodes must have minimum value. This can be done in O(1) amortized complexity though the worst case is O(log n). Algorithm 18 depicts the findmin function.
Algorithm 17 Algorithm to perform meld(H1, H2) operation
1: procedure H = MELD (H1 H2 ⊳ Join two binomial heaps
2: for k = 0 → max {log |H1|, log |H2|} do do
3: if more than one Bk then join root to create a Bk + 1
4: Add Bk + 1 to list of H
5: return H
Algorithm 18 Algorithm to perform findmin() operation
1: procedure FINDMIN (H) ⊳ Find and return the minimum value of heap
2: Find minimum of link list
3: return minimum value
4.2.2.4 deletemin()
In the deletemin() operation first we call findmin(H) to find the minimum root from the collection of binomial trees associated with the binomial heap H, then we delete that root and a new binomial tree followed by a meld operation. For example, if in a binomial heap
Algorithm 19 Algorithm to perform deletemin() operation
1: procedure DELETEMIN (H) ⊳ Delete the minimum element of the heap H
2: root (Bi) = findmin(H)
3: remove root of Bi
4: return meld(H \ Bi,Bi−1 + Bi−2 + … + B1 … B0)
4.2.2.5 decreasekey(x, Δx)
Assume we want to decrease the value of node x in a decreasekey(x, Δx) operation but decreasing an intermediate key will violate the heap property. So we need compare the node with its parent in a recursive process up-to root or stop where the heat property is maintained. The worst case complexity of decreasekey operation is O(log n), but it is an overestimate. The operation will proceed in a binomial heap if a binary tree containing most elements and a decreasekey operation is called in leaf node; this is a rare situation. Algorithm details the decreasekey operation.
Algorithm 20 Algorithm to perform decreasekey(x, Δx) operation
1: procedure DECREASEKEY (x, Δx) ⊳ Decrease the value of a particular node
2: Decrease x by Δx
3: while parent(x) > current node do swap with parent
4: Update current node to parent
5: return New tree
4.2.2.6 deletenode(x)
To delete a particular node from a binomial heap which reside in some binary tree following steps are performed:
1)set the value of that node sufficiently small (may be 0 or negative).
2)a bubble up of node up-to root to maintain heap property.
3)extract-min() operation will delete the node as well as recreate the heap.
The worst case complexity of deletenode operation is O(logn), but it is an over estimate because. Algorithm 21 shows the deletenode operation.
Algorithm 21 Algorithm to perform deletenode(x) operation
1: procedure DELETENODE (x) ⊳ Delete an arbitrary node in binomial heap.
2: decreasekey(x,∞)
3: deletemin()
4: return New Heap
The complexity issues of the six binomial heap operation are summarized below.
1.Complexity of create(B0) is O(1).
2.Complexity of meld(H1, H2) is O(log n) worst case (over estimated).
3.Complexity of findmin() is O(1) amortized.
4.Complexity of deletemin() is O(log n) over estimated.
5.Complexity of deletenode(x) is O(log n) over estimated.
6.Complexity of decreasekey(x, Δx) is O(log n) over estimated.
▪ Example 4.2 — Prim’s algorithm to find Minimal Spanning Tree. Prim’s algorithm uses binomial heaps to store the key values of unexplored vertices. The algorithm searches every iteration to look for minimum edges in explored vertices and unexplored vertics determined by the minimum values of the priority queue. The iterations continue until the priority queue is empty. If the graph under consideration is G(V, E), then maximum size of priority queue is |V|, so number of findmin is |V| and number of deletemin is also |V|. However, the number of decrease keys may be O(E) which may cost up-to O(Elog V) without amortized analysis.
Objective 4.3 Priority queue is a ubiquitous data structure in theoretical computer science. Fibonacci heaps provide a fast and efficient solution for decrease key operation, thus making the network optimization algorithms faster.
The fibonacci heap data structure developed by Fredman and Tarjan in 1984 gives an efficient way to implement decrease keys of priority queues [19]. See Figure 4.8 goal is to find a way to minimize the number of operations needed to compute the Minimal Spanning Tree or Shortest Path tree, the kind of operations that we are interested in are insert, decrease-key, merge, and delete-min. We can achieve this minimization goal by using lazy operations to reduce amortization complexity.
Figure 4.8: Fibonacci trees of different ranks and sizes
Fibonacci heaps make use of heap-ordered trees, but uses Fibonacci numbers instead of the binomial numbers used in binomial heaps.
4.3.1 Properties of a Fibonacci heap
1.The roots of these trees are kept in a doubly-linked list (the “root list” of H);
2.The root of each tree contains the minimum element in that tree (this follows from being a heap-ordered tree);
3.We access the heap by a pointer to the tree root with the overall minimum key;
4.For each node x, we keep track of the rank (also known as the order or degree) of x, which is just the number of children x has; we also keep track of the mark of x, which is a Boolean value whose role will be explained later.
For each node, we have at most four pointers that respectively point to the node’s parent, to one of its children, and to two of its siblings. The sibling pointers are arranged in a doubly-linked list (the “child list” of the parent node).
4.3.2 Inserting, merging, cutting, and marking
Inserting a node x. We create a new tree containing only x and insert it into the root list of H; this is clearly an O(1) operation.
Merging two trees. Let x and y be the roots of the two trees we want to merge; then if the key in x is no less than the key in y, we make x the child of y; otherwise, we make y the child of x. We update the appropriate node’s rank and the appropriate child list; this takes O(1) operations.
Cutting a node. If x is a root in H, we are done. If x is not a root in H, we remove x from the child list of its parent, and insert it into the root list of H, updating the appropriate variables (the rank of the parent of x is decremented, etc.). Again, this takes O(1) operations. (We assume that when we want to find a node, we have a pointer hanging around that accesses it directly, so actually finding the node takes O(1) time.)
Marking. We say that x is marked if its mark is set to “true”, and that it is unmarked if its mark is set to “false”. A root is always unmarked. We mark x if it is not a root and it loses a child (i.e., one of its children is cut and put into the root-list). We unmark x whenever it becomes a root. We will make sure later that no marked node loses another child before it is cut and reverts to unmarked status.
4.3.3 Decreasing keys and delete-min operation
At first, decrease-key does not appear to be any different than merge or insert; we simply need to find the node and cut it off from its parent, then insert the node into the root list with a new key. This requires removing it from its parent’s child list, adding it to the root list, updating the parent’s rank, and (if necessary) the pointer to the root of smallest key. This takes O(1) operations.
The delete-min operation works in the same way as decrease-key: Our pointer into the Fibonacci heap is a pointer to the minimum keyed node, so we can find it in one step. We remove this root of smallest key, add its children to the root-list, and scan through the linked list of all the root nodes to find the new root of minimum key. Therefore, the cost of a delete-min operation is
4.3.4 Algorithm for Fibonacci heaps
Maintain a list of heap-ordered trees.
4.3.4.1 insert
Add a degree 0 tree to the list.
4.3.4.2 delete-min
We can find the node we wish to delete immediately since our handle to the entire data structure is a pointer to the root with minimum key. Remove the smallest root, and add its children to the list of roots. Scan the roots to find the next minimum. Then consolidate all the trees (merging trees of equal rank) until there is ≤ 1 of each rank.
Assuming that we have achieved the property that the number of descendants is exponential in the number of children for any node, as we did in the binomial trees, no node has rank >clog n for some constant c. From this assumption, we can conclude that every root (including the smallest) has O(log n) children, and that consolidation leaves us with O(log n) roots.
The consolidation is performed by allocating buckets of sizes up to the maximum possible rank for any root node, which we just showed to be O(log n). We put each node into the appropriate bucket, at cost O(log n) +
4.3.4.3 decrease-key
Cut the node, change its key, and insert it into the root list as before, Additionally, if the parent of the node was unmarked, mark it. If the parent of the node was marked, cut it off also. Recursively do this until we get up to an unmarked node. Mark it.
4.3.5 Amortized analysis for Fibonacci heaps
Define Φ(DS) = k · (# of roots in DS + 2 · # marked bits in DS). Note that insert and delete-min do not ever cause nodes to be marked - we can analyze their behaviour without reference to marked and unmarked bits. The parameter k is a constant that we will conveniently specify in the analysis below. We now analyze the costs of the operations in terms of their amortized costs (defined to be the real costs plus the changes in the potential function).
4.3.5.1 insert
The amortized cost is O(1) which represents actual work plus k change in potential for adding a new root. Since k is a constant, O(1) + k = O(1) total amortized cost.
4.3.5.2 delete-min
For every node that we put into the root list (the children of the node we have deleted), plus every node that is already in the root list, we do constant work putting that node into a bucket corresponding to its rank and constant work whenever we merge the node.
Let r be the number of roots on the root list before we add the minumum root’s children. From our assumption that the size of a tree be exponential in its rank, we know that the number of children added is O(log n). Thus, our real costs are putting the roots into buckets (O(r) + O(log n)), walking through the buckets (O(log n)), and doing the consolidating tree merges (O(r) + O(log n)). On the other hand, our change in potential is k · (O(log n) − r) (since there are at most O(log n) roots after consolidation). Thus, total amortized cost is O(r) + O(log n) + k · (O(log n) − r) = O(log n) + (O(r) − k · r). Now, we see that if we set k to be greater than the constant term hidden in the O(r) notation, that term disappears, and the amortized cost becomes O(log n).
4.3.5.3 decrease-key
The real cost is O(1) for each individual cut, key decrease and re-insertion. The only problematic issue is the possibility of a “cascading cut” - a name we give to a cut that causes the node above it to cut because it was already marked, which causes the node above it to be cut since it too was already marked. This can increase the actual cost of the operation to O(# of nodes already marked). Every cost we incur from having to update pointers due to a marked node that was cut is offset by the decrease in the potential function when that previously marked node is now left unmarked in the root list.
More formally, let c be the number of nodes cut. The amount of real work done is O(c). We clear c − 1 mark bits as we cascade up, but the parent of the last node that we cut may be marked. Also, we added c nodes to the root list. Thus, the change in potential is at most k · (c + 2( − (c − 1) + 1)) = −k · (c − 4), and the amortized cost is (O(c) − k · c) + O(1). As with delete-min, if we set k to be greater than the constant term hidden in the O(c) notation, we’re left with O(1). This analysis also explains why we needed to use twice the number of marked bits in our potential function to cancel the addition of the nodes onto the root list.
The only thing left to prove is that for every node in every tree in our Fibonacci heap, the number of descendants of that node is exponential to the number of children of that node, and this is true despite the cut rule for marked bits. We must prove this in order to substantiate our earlier assertion that all nodes have degree ≤ log n.
Consider the children of some node x in the order in which they were added to x.
Theorem 4.3.1 The ith child to be added to x has rank at least i − 2.
Proof. Let y be the ith child to be added to x. When it was added, y had at least i − 1 children. This is true because we can currently see i − 1 children that were added earlier, so they were there at the time of y’s addition. This means that y had at least i − 1 children at the time of it merger because we only merge equal ranked nodes. Since a node could not lose more than one child without being cut, y must have at least i − 2 children (i − 1 from when it was added, and no more than a potential 1 subsequently lost). ▪
If we had been working with a binomial tree, the appropriate lemma would have been rank = i − 1 not ≥i − 2.
Let Sk be the minimum number of descendants of a node with k children. We have S0 = 1, S1 = 2 and,
The trie is a oldest known data structure, which implements set of words very efficiently report the set membership queries. Other than string membership search, the trie data structure is widely used in modern applications such as lexicon implementation, spelling correction, auto complete operations, and routing tables. Figure 4.9 depicts a trie.
Figure 4.9: Trie
Consider a set
Table 4.1: Comparison of data structures for membership query
Set membership verification from a collection of data items is a classical problem in computer science and lots of data structures are developed with a common objective to reduce the query reporting time. However, finding a word from a collection of words can be done using a special data structure popularly known as TRIE or simply trie. Though the traditional data structures like BST or hashing are very good for searching but in case of string trie perform better, a comparison is shown in the table above.
The following are the main advantages of tries over binary search trees (BSTs). If we store keys in a binary search tree, a well balanced BST will need time proportional to mlog n, where m is maximum string length and n is number of keys in a tree. Using trie, we can search the key in O(m) time, i.e., lookup time is faster. Looking up a key of length m takes worst case O(m) time. A BST performs O(log (n)) comparisons of keys, where n is the number of elements in the tree, because lookups depend on the depth of the tree, which is logarithmic in the number of keys if the tree is balanced. Hence in the worst case, a BST takes O(mlog n) time. Moreover, in the worst case log (n) will approach m.
Tries are more space-efficient when they contain a large number of short keys, since nodes are shared between keys with common initial subsequences. Tries facilitate longest-prefix matching. The following are the main advantages of tries over hash tables: Tries tend to be faster on average at insertion than hash tables because hash tables must rebuild their indices they they become full – a very expensive operation. Tries therefore have much better bounded worst-case time costs, which is important for latency-sensitive programs. Tries support ordered iteration, whereas iteration over a hash table will result in a pseudo-random order given by the hash function and there is no problem of collision even. Tries facilitate longest-prefix matching, but hashing does not.
The word trie is an infix of the word “retrieval” because the trie can find a single word in a collection of words with only a prefix of the word. The main idea of the trie data structure consists of the following parts: The trie is a tree where each vertex represents a single word or a prefix. The root represents an empty string (””), the vertices that are direct sons of the root represent prefixes of length 1, the vertices that are 2 edges of distance from the root represent prefixes of length 2, the vertices that are 3 edges of distance from the root represent prefixes of length 3 and so on. In other words, a vertex that is k edges of distance of the root has an associated prefix of length k.
Inserting a key into trie is simple. Every character of input key is inserted as an individual trie node. Note that the children constitute an array of pointers to next level trie nodes. The key character acts as an index into the array of children. If the input key is new or an extension of existing key, we need to construct non-existing nodes of the key, and mark leaf node. If the input key is a prefix of an existing key in trie, we simply mark the last node of key as leaf. The key length determines trie depth. Algorithm 22 details the steps or inserting a word in a trie.
Algorithm 22 Algorithm to insert word in trie
1: procedure TRIE-INSERT (word) ⊳ Insert a word in trie.
2. Set pointer to root
3: while word length ≥ 0 do repeat over character
4: if character not exist in childs then insert new child for the character
5: move to next character
6: move to child
7: Set end flag
8: return New trie
Searching for a key is similar to insert operation, however we only compare the characters and move down. The search can terminate due to end of string or lack of key in trie. In the former case, if the value field of last node is non-zero, then the key exists in trie. In the second case, the search terminates without examining all the characters of key, since the key is not present in trie. Algorithm 23 lists the search steps.
Algorithm 23 Algorithm to search word in trie
1: procedure TRIE-SEARCH word ⊳Search a word in a trie.
2: Set pointer to root
3: While word length ≥ 0 do repeat over character
4: if character not exist in children then return not exist
5: else move to next character
6: move to child
7: if Current character is end flag then return word found
8: else return not exist
During delete operation we delete the key in two phases using a stack. In the first phase we search for the word to be deleted and store the characters appearing in the search path. If the search is successful, we delete characters up to the previous end flag (Algorithm 24).
Algorithm 24 Algorithm to delete word from a trie
1: procedure TRIE-DELETE word ⊳ Delete a word from a trie.
2: Set pointer to root
3: while word length ≥ 0 do repeat over character
4: push character in a stack
5: if character not exist in childes then return not exist
6: else move to next character
7: move to child
8: if Current character is end flag then stop the stack
9: while character ≠ end flag do pop from stack
10: remove character from trie
11: return deleted
12: else return not exist
Insert and search operations costs equal O(key length). The memory requirement of a trie is O(alphabet size * key length * N) where N is the number of keys in trie. There are efficient representations of trie nodes (e.g. compressed trie, ternary search tree, etc.) to minimize memory requirements. What makes the trie really perform well is the fixed cost of looking up a word of prefix; the cost depends only on the number of characters in a word, not on the size of the vocabulary.
We must take into account the worst case timing first and later convince ourselves of the practical timings. For every node in a trie, we have a collection that may be either a set or a list. If we choose the set option, the order of the operation will be in O(1) time. If we use a linked list, the worst number of comparison will be 26 (the number of letters in the English alphabet). To move from node to another, at least 26 comparisons are required at each step.
Complexity 4.4.1 — Lexicon. To insert a word of length k, we need k * 26 comparisons. Applying the O notationy ields O(k) which will become O(1). Insert operations are performed in constant time regardless of the length of the input string (this might look lik an understatement, but if we make the length of the input string a worst case maximum, this sentence holds true). Same holds true for the search operation as well. The search operation exactly performs the way the insert does and its order is O(k*26) = O(1).
Complexity 4.4.2 — Space requirement. We will use an English dictionary as an example to illustrate complexity. Considering space requirements (recall that M indicates the byte size of the dictionary), a trie could have M nodes in the worst case, if no two strings shared a prefix. A lot of compression must be done if we observe a large amount of redundancy in the dictionary. The English dictionary used in the example contains 935,017 bytes and requires 250,264 nodes, with a compression ratio of about 73%. However, despite the compression, a trie will usually require more memory than a tree array because each node requires at least 26 * sizes of pointer bytes plus overhead. On a 64-bit machine, each node requires more than 200 bytes, whereas a string character requires a single byte, or two if we consider UTF strings.
Consider a trie that is mostly static (all insertions or deletions of keys from a prefilled trie are disabled) and only lookups are needed. It is possible to compress the trie representation by merging the common branches to compress the representation. We can also improve the time and space performance metrics of a trie by eliminating all branch nodes that have only one child. The result is called a compressed trie. When branch nodes with a single child are removed from a trie, we need to keep additional information so that dictionary operations may be performed correctly.
A Patricia is a compressed trie. Instead of storing all data at its edges and having empty internal nodes, Patricia stores data in every node. All operations are performed at worst in O(K) time, where K is the number of bits in the largest item in the tree. In practice, operations actually take O(A(K)) time, where A(K) is the average number of bits of all items in the tree. Most importantly, Patricia requires very few comparisons to keys during any operation. During a lookup, each comparison (K at most) will perform a single bit comparison against the given key, instead of comparing the entire key to another key. Patricia, edge labels (from, to) in a compressed operation are replaced by (T[from], to from + 1). To implement this idea we make a copy of each string s that is inserted into T. To label an edge with some substring of s, we use two pointers to the first and last characters of the label, thus ignoring the cost of storing extra strings. The size of each edge label is constant. Additional Patricia operations include (1) finding all strings with a common prefix; the system returns an array of strings that begin with the same prefix; (2) finding predecessors by locating the largest string that is smaller than the given string using lexicographic order; and (3) finding successors by locating the smallest string larger than the given string, also via lexicographic order.
Let S denote a string, the length of which is n. Let S[i, j] denote the substring of S from position i to position j. Before constructing the suffix tree, we concatenate a new character, $ to S. The importance of this character is twofold. First, by adding it to the string, we avoid the undesirable situation in which a suffix is a prefix of another suffix, which is undesirable. Second, the generalization is also made easier by this operation. Now, we will define the suffix tree of a string. We also consider fixed size alphabets; unbounded alphabets are not discussed. A suffix tree is a rooted, directed tree (see Figure 4.10). It has n leaves labelled from 1 to n, and its edges are labelled by characters of the alphabet. The label of an edge e is denoted by l(e). On a path from the root to the leaf j one can read the suffix S[j, n] of the string and a $ sign. Such a path is denoted by P(j) and its label is referred as the path label of j and denoted by L(j). We call a leaf w reachable from the node v, if there is a directed path from v to w.
Figure 4.10: Suffix tree example (in reverse dictionary order)
Suffix trees can use quite a lot of space. There are long branches which could be compressed to achieve a compact suffix tree of a string. Formally, a compact suffix tree of S is a rooted directed tree with n leaves. Each internal node has at least two children (the root is not an internal node). Each edge has a label with the property that if uv and uw are edges, then the first characters of the label of uv and of uw are distinct. The label of a path is the concatenation of the labels on its edges.
An inverted index data structure stores mappings from content such as words or numbers in a database, document, or set of documents (unlike a forward index that maps from documents to content). The purpose of an inverted index is to allow fastfull text searches, at a cost of increased processing when a document is added to the database. The inverted file may be the database file rather than itsindex. It is the most popular data structure used indocument retrieval systemsused on a large scale for example insearch engines.
In its basic form, an inverted index consists of postings lists, one associated with each term that appears in the collection. A postings list is comprised of individual postings, each of which consists of a document identification and payload information about occurrences of the term in the document. The simplest payload is nothing! For simple Boolean retrieval, no additional information is needed in the posting other than the document identification; the existence of the posting indicates the presence of the term in the document.
▪ Example 4.3 Simple illustration of an inverted index. Each term is associated with a list of postings. Each posting is comprised of a document identification and a payload, denoted by p. An inverted index provides quick access to documents identifications that contain a term.
•term1 occurs in d1, d5, d6, d11,........,
•term2 occurs in d11, d23, d59, d84,....., and
•term3 occurs in d1, d4, d11, d19,.......
In an actual implementation, we assume that documents can be identified by a unique integer ranging from 1 to n, where n is the total number of documents. Generally, postings are sorted by document identification.
terms |
postings |
term1 |
d1p d5p d6p d11 |
term2 |
d11 p d23 p d59 p d84 p … |
term3 |
d1 p d4 p d11 p d19 p … |
▪
The size of an inverted index varies, depending on the payload stored in each posting. If only term frequency is stored, a well-optimized inverted index can be a tenth of the size of the original document collection. An inverted index that stores positional information would easily be several times larger than one that does not. Generally, it is possible to hold the entire vocabulary (i.e., dictionary of all the terms) in memory, especially with techniques such as front-coding. However, with the exception of well-resourced, commercial web search engines, postings lists are usually too large to store in memory and must be held on disk, usually in compressed form; this involves random disk access and decoding of postings. One important aspect of the retrieval problem is to organize disk operations such that random seeks are minimized.
We return to the question of how postings are actually compressed and stored on disk. This chapter devotes a substantial amount of space to this topic because index compression is one of the main differences between a “toy” indexer and one that works on real-world collections. Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge. Many search engines utilize a form of compression to reduce the size of the indices on disk.
▪ Example 4.4 Consider the following scenario for a full text, Internet search engine.
•An estimated 2,000,000,000 different web pages existed as of the year 2000.
•Suppose there are 250 words on each webpage (based on the assumption they are similar to the pages of a novel.
•It takes 8 bits (or 1 byte) to store a single character. Some encodings use 2 bytes per character.
•The average number of characters in any given word on a page may be estimated at 5.
•The average personal computer comes with 100 to 250 gigabytes of usable space.
Given this scenario, an uncompressed index (assuming a non-conflated, simple, index) for 2 billion web pages would need to store 500 billion word entries. At 1 byte per character, or 5 bytes per word, this would require 2500 gigabytes of storage space alone, more than the average free disk space of 25 personal computers. This space requirement may be even larger for a fault-tolerant distributed storage architecture. ▪
The point of using an index is to increase the speed and efficiency of searches of the document collection. Without some sort of index, a users query must sequentially scan the complete document collection, finding those documents containing the search terms. Consider the “Find” operation in Windows; a user search is initiated and a search starts through each file on the hard disk. When a directory is encountered, the search continues through each directory. With only a few thousand files on a typical laptop, a typical find operation takes a minute or longer. Currently, a web search covers at least one billion documents. Hence, a sequential scan is simply not feasible. Within the search engine domain, data are searched far more frequently than they are updated. An inverted index is able to do many accesses in O(1) time at a price of significantly longer time to do an update, in the worst case O(n). Where other data structures require a minimum of O(log n) time to perform any operation; best results were produced from balanced developed to improve database access. For many systems, the inverted index can be compressed to around 10% of the original document collection.
Exercise 4.1 Consider an algorithm that computes the connected components of an undirected graph G using a forest of trees and union-find: Start with a partition of the n vertices into a forest of n trees, each consisting of a single vertex. Then, for each edge (i, j) in the graph G, apply union(i, j). Prove that this algorithm is correct, in that it indeed computes the connected components of G. ▪
Exercise 4.2 Consider the set of all trees of height h that can be constructed by a sequence of union-by-height operations. How many such trees are there? ▪
Exercise 4.3 Consider an arbitrary sequence of m makeset operations, followed by u union operations, followed by f find operations, and let n = m + u + f. Prove that if we use union by rank and find with path compression, all n operations are executed in O(n) time. ▪
Exercise 4.4 Give an example of two binary heaps with n elements each such that build-heap takes (n) time on the concatenation of their arrays. ▪
Exercise 4.5 Discuss the relationship between inserting into a binomial heap and incrementing a binary number and the relationship between uniting two binomial heaps and adding two binary numbers. ▪
Exercise 4.6 Show that if only the mergeable-heap operations are supported, the maximum degree D(n) in an n-node Fibonacci heap is at most
Exercise 4.7 Given two strings S1 and S2, and a positive integer k, find the number of substrings of S1 of length at least k that occur in S2. Develop and analyze an algorithm to solve this problem in O(|S1| + |S2| + sort(Σ)) time. ▪
Exercise 4.8 Create a suffix tree for Hizbizbiz. ▪
Exercise 4.9 Create a trie, compact trie and Patricia for the word searching from a file titled “Education news covers the latest national and international education news”. ▪
Exercise 4.10 Describe inverted index data structure for text searching from a collection of documents. Compute construction cost and search cost of an inverted index. Construct an inverted index for the following collection.
•Doc1: breakthrough drug for schizophrenia
•Doc2: new schizophrenia drug
•Doc3: new approaches for treatment of schizophrenia
•Doc4: new hopes for schizophrenia patients ▪