This book emphasises recent theoretical developments, evolving data structures for different paradigms of computations and important applications to research domains of computer science. The whole book is divided into three parts. As an introduction, the need for advanced data structures and some basic concepts pertaining to amortized analysis of algorithms have been presented in the first chapter.
Part I details advancements on basic data structures, for example, cuckoo hashing, skiplists, tango trees and Fibonacci heaps and index files. Part II details data structures of different evolving data domains such as special data structures, temporal data structures, external-memory data structures, distributed and streaming data structures. Part III elucidates the applications of these data structures on different areas of computer science such as networks, the World Wide Web, database management systems (DBMSs), cryptography, and graphics to name a few. The concepts and techniques behind each data structure and their applications have been explained. Illustrative problems, review questions, and programming assignments are included to enable the students comprehend, implement and appreciate advanced data structures. In addition to illustrative problems, each chapter includes a detailed summary of the technical content and list of review questions to reinforced comprehension of the material.
Objective 1.1 — Advanced data structure. The rapid development of different domains (e.g. data science, the Internet of Things (IOT), artificial intelligence (AI), machine learning (ML), cloud computing and others) of computer science imposes new challenges to the field of data structures and algorithms. The traditional data structures are mainly developed to perform sequential point search on RAM models and assessment based on computational complexity is insufficient to meet the requirements of modern computer science. In this book we have incorporated important data structural concepts to be taught as an advanced course on data structure intended to meet contemporary computational challenges.
We assume students understand basic data structures like arrays, link lists, stacks, queues, trees, hashing, and other relevant concepts of data structures and algorithms. Those materials are not included in this book, but their advancements are demonstrated along with the other advanced topics.
In this chapter, the need for advanced data structures and some basic concepts pertaining to design of data structures, amortized analysis of queries over data structures and how to use data structures to solve computational problems have been presented. The last section of the chapter is included to describe organization of the book.
A data structure is a typical way of organizing data in a computer so that it can be accessed efficiently [1].
Definition 1.1.1 — Data structure. A data structure is an organization of data values, the relationships among them, and the functions to answer particular queries on that data.
A data structure can implement one or many abstract data types (ADT), which specify the operations that can be performed in that data structure. Moreover, the computational complexity of those operations, described in the ADT will be adopted in the data structure. In comparison, a data structure is a real implementation of the theoretical specification abstracted in an ADT.
Definition 1.1.2 — Abstract data type. An abstract data type (ADT) is a mathematical abstraction about data and its functions.
Different kinds of data structures are generally used in different problems. A data structure can implement an abstract data types, and adopt the specifications of operations and complexities. Different application tasks are often solved using various types of data structures, some of which are highly specialized for specific tasks. Data structures provide mechanisms to organize and manage large amounts of data efficiently for several uses in most domains of computer science. To be more specific, the efficient data structures are the key to designing efficient algorithms whatever the domain may be. Some formal design paradigm and programming languages put more emphasis on data structures than algorithms. They argue that data structures are the key organizing factors in software design. Data structures describe the access patterns of data stored in both main and secondary memories. To implement a data structure, we usually require to write a set of procedures that access and organize data instances of that structure. The efficiency of a data structure can be analyzed by counting the number of times a particular operation is performed. This is the underlying theoretical concept of an abstract data type. A data structure can be defined alternatively by the designs of its access mechanisms and operations sequences.
This section focuses on data structure design. Practitioners understand that data should be specified on two levels based on the abstract user-oriented information structure and the concrete machine-oriented storage structure of the data. The design methodology of data structure is based on many views such as data reality, data abstraction, information structure, storage structure, and machine encoding to name a few. The design of a data structure should proceed through successive levels by specifying important aspects necessary within levels and binding the different aspects across the levels. Users must be able to consider all aspects of data and must be able to recognize uses for commonly studied data structures for compatibility of a data structure to an application algorithm. In order to do this, they must be aware of the various data structures which could be used to solve a problem, the particular processes required, what trade-offs may occur in the selection of one option over another, and how to make a reasonable choice given all of these considerations. In trying to get students to match a data structure to an application, some small activities can be used to provide a flavor of what is involved in data structure selection.
For example, within a programming assignment which requires that students keep track of passengers on a limited number of airline flights, one requirement is that passengers who wish to book a full flight be kept on a waiting list.
Another task that can be included in an assignment on binary trees is to design an algorithm to print the tree in graphic form.
Both of these activities require that a student recognize the particular properties of a problem and match the processing needs to a simple data structure.
1.3 Analysis of Data Structure
In this section, we discuss briefly the common practice and widely used notions to analyze a data structure, which is designed to handle a particular type of data (generally known as data domain) and also accept specific queries drawn from a domain of possible queries. Let D be the domain of data and Q be the domain of possible queries. In case of static data structure, let f be a function defined as f:Q × D → A, i.e. a = f(x, y) is the answer to query x about data y. For example, if the function returns true or false, then A is Boolean, but it may be something else [2]. To understand the complexity of the process we need to include three more parameters s, b, and t, where s denotes the size of the memory cell containing b bits and t is the time for accessing data points to answer a query in a random access machine. In general, our goal is to optimize the parameters s, t, and b are predefined parameter of the model, usually set to O(log |Q|) or O(poly(log |Q|)). The most widely used model for proving lower bound of data structure is the “cell-probe model”, introduced by Yao [3], before introduction of the notion of communication complexity [4].
Minimizing the cost per query is usually trivial for static data structures. We can always pre-compute the answers to all queries and store the answers in a memory location corresponding to that query [5]. However, the solution is undesirable because it requires a large amount of preprocessing as well as a large amount of memory. Therefore for static data structures, good understanding of query time, preprocessing time and space requirement are necessary to implement a solution.
For dynamic data structures, the problem of minimizing time per operation often becomes nontrivial, even without considering the cost of preprocessing or space. If we try to minimize the costs of queries as in the static case by maintaining the answer to each query in memory, each update may change the answers many queries and thus require many memory locations to be rewritten [6]. On the other hand, the cost of updates can be minimized by simply memorizing each update in a list without doing any processing. Answering a query is very expensive because of the need to scan the entire history. The whole difficulty of dynamic data structure is the number of memory locations that is used to record updates and the number of memory locations that need to be accessed perform queries.
The concept of amortized analysis is to compute the time complexity on a sequence of operations rather than finding a single case of worst case time, which may be very high but rare within the sequence. To explain the situation we can take an example of monthly expenditure of a student. In the notion of time complexity we should find out a typical day with maximum expenditure (analogous to worst case complexity) and multiply it 30 times. We then estimate monthly expenses but must consider possible overestimate based on the day the student paid rent. In the paradigm of time complexity her total monthly expense in the worst case (calculated on the day rent is paid) will ignore the fact that rent is paid only once a month. The actual daily expenditure must be total spending over the month divided by thirty, not just the first days expenditure. The same solution method is adopted in the notion of amortized complexity [7,8].
Objective 1.2 — Amortized analysis. Amortized analysis is designed to reduce the total complexity of a sequence of operation which is in practice confused with worst-case run time of the operation multiplied by sequence size. In typical algorithms, certain operations may be very costly, whereas other operations may not be as costly. In amortized analysis we consider both the costly and less costly operations together on a sequential run of the algorithm and compute their average, which often provides a tighter bound of worst case complexity.
The basic concept is that a wrost case operation generally changes the state although this is not frequent. We should consider amortizing costs in a worst case situation. The three common methods of amortized analysis are:
1.Aggregate method: determines the upper bound of the total cost for a sequence of operations followed by average computation.
2.Accounting method: assign the individual amortized cost of each operation, its stored credit for non-costly operations and debit the account when costly operation occurs. This method ensures that total credit is non-negative by means of creative assignment of amortized cost. Finally, we compute the amortized cost by taking the average over the sequence of operations.
3.Potential method: similar to accounting method, but cost is associated with data structure as a potential function that is updated after each operation by adding the potential difference per operation. It can be noted that potential difference may be negative but total potential must remain positive.
▪ Example 1.1 — Amortized analysis of dynamic array. A dynamic array grows in size with the insertion of more elements; a typical example is the array list in Java. Consider a dynamic array of five elements and a constant time requirement. Inserting a sixth element into the array would take more time because the dynamic array would have to double in size, map the old elements into the larger array, then insert the new element. The next four insertions will require constant time plus insertion of the new element will require costly doubling of the array size. Consider an array of size n. Note that insertion operations take constant time except for the every (kn + 1)th insertion, which takes O(n) time to execute the size doubling operation. The average of all insertions represents constant time. ▪
▪ Example 1.2 — Amortized analysis of red-blacktrees. Red-black tree is a very popular balanced binary search tree that use color color convention and set of rules to ensure balance structure. During dynamic update by means of insert and delete, the resulting tree fails to follow the rules assumed in the definition, which follows recursive rotation and re-coloring to re-establish balanced structure. In red-black tree every operation cost O(log n) in the worst case. Let us consider the case of inserting n more elements in a red-black tree of size n. A conventional calculation of complexity may be O(nlog n), but the amortized analysis shows us O(n) operation for structural reform after any n consecutive insert; we can apply this to any operation. The amortized analysis of the above example takes into account the fact that O(log n) structural reform is infrequent enough to sum up to O(nlog n), but the per operation cost is only O(1). The analysis can be done easily by valuing black nodes with no red children as 0, black node with one red child as 1, and black nodes with two red children as 2. ▪
Computational complexity is examined in concrete and abstract terms. The concrete analysis of computational limits is done using models that capture the exchange of space for time. Although many computational models exist, we mention only three that are most widely used for providing bounds of data structures.
The random access machine (RAM) is the most popular computation model. It assumes all simple operations are equally costly and require one unit of time, loops and subroutines are not simple, and the memory is sufficient for one unit of time to facilitate easy calculation for algorithm developers. However, the cost of addition and the cost of multiplication, which involves several additions, are not same in reality. The memory access in cache and disk are not same, cache access is far faster (possibly 1,000 times or more) than disk access [1]. Although there are seemingly absurd assumptions in the RAM model, in practice it is most useful probably for its robustness. The model does not depend on the configuration of the machine and it clearly indicates how the algorithm will work. The model also allows a comparison of algorithms designed to perform the same task to enable users to determine which algorithm will perform better asymptotically.
The word RAM model is a modified abstraction of a random access machine with some additional capabilities. It can handle words up to w bits in size and store integers up to size
1.5.3 Cell-probe model of computation
The cell-probe model represents a modification of the random access machine; it assumes all operations except memory access are free. This model is generally used to provide lower bounds of data structure operations. Computational costs is assigned only to accessing units of memory (cells) and the model represents a minor upgrade of the random access machine model (a modification of the counter machine model). Computation is treated as a problem of querying a set of memory cells. Every problem has two phases: preprocessing and querying. The input of the preprocessing state is storing a set of data as a memory cell structure. The query phase input is a query element. The system determines whether the query element is included in the memory structure. All operations except memory cell access are free [3].
This model is useful in the analysis of data structural lower bounds. The model clearly shows a minimum number of memory accesses in stored data on which we run the queries.
▪ Example 1.3 — Dynamic partial sums. A dynamic partial sum problem defines two operations: update and sum. Update
If the values are stored as leaves in a tree and inner nodes store the values of the subtree which is rooted at that node, the Update requires
1.6 Bounds of Fundamental Data Structures
In this section, we discuss fundamental Data structures such as arrays, lists, trees, and queues and explain how they work and achievable results and bounds. The most common data structures used for searching unordered sequential lists of items are usually easy to utilize but less efficient. Finding the query item in such a list requires a number of operations proportional to the number of items (in worst case and average case) in the sequence, possibly with linear search method. Some good data structures developed for searching purpose provide faster retrieval, but generally increase overhead because creation and maintenance are costly. The cost of building such search data structures is at least proportional to number of elements, and often exceeds basic element costs. Static search data structures are developed for answering many queries on a fixed data set. However, dynamic structures provide the capabilities of insertion, deletion and modification of data items along with query retrieval. In the dynamic data structure, the cost of fixing the search structure also includes updates and accounting costs.
The approximate summary for basic data structure developed to perform specific queries appears in Table 1.1. There are special situations and different variants of the data structures that involve further costs. It is also possible to combine data structures to obtain lower costs [12,13].
Table 1.1: Comparison of fundamental data structures
An alternative to a standard deletion strategy is known as lazy deletion. When deleting elements from a singly linked list, we delete them logically but not physically. This is done by marking the node as deleted (using a Boolean value). The numbers of deleted and not-deleted elements in the list are kept as part of the list. If at some point the number of deleted elements is equal to the number of not-deleted elements, we traverse the list and delete all “lazily deleted” elements. The technique has advantages and disadvantages. The main advantage is that marking a node as deleted takes less time than changing pointers. However, once a node is deleted lazily, still we need to traverse it while searching for another node and a lot of time is required to remove lazily deleted elements [14].
▪ Example 1.4 — Hashing with lazy delete. In hashing we index the elements of a set or collection into a table of larger size using some range bound function, which is subject to collisions. In collision resolution by probing, we shift the element to some other empty slot using some predefined strategy. During the removal of elements by hashing with collision resolution, if we remove an indexed element which was inserted first in the table followed by several collisions and consequently followed by several shiftings of later inserted elements, the later element will become inaccessible. A lazy delete is a good temporary solution for this problem without any burden of re-indexing, although it uses more memory. A delete flag will ensure the search process to answer queries correctly. ▪
Part I of this book (Chapters 1 through 4) discusses theoretical advancements in the basic data structure field, specifically in the areas of hashing, trees and lists. This chapter introduced the uses of advanced data structures, basic design concepts, amortized analysis of queries, and capabilities of data structures to solve computational problems.
The advanced hashing techniques are covered in chapter 2. The chapter first covers definitions and concepts, then proceeds to discuss collisions, birthday paradoxes, load factors, chaining and probing. It explains perfect hashing universal hashing, cuckoo hashing, bloom filters, and locality-sensitive hashing. Most of these topics were developed recently and are not covered in well known texts. Bipartite graph analysis in cuckoo hashing and probabilities of false positives in bloom filters are also included.
Chapter 3 discusses balanced binary search trees (BSTs) in depth. The chapter starts with a brief review and progresses to discussions of reducing log(n) lower bounds, splay trees, tango trees, and skiplists. The final section covers static and dynamic optimality.
Chapter 4 explains some important data structures designed to answer complex point search queries. Disjoint sets and binomial heaps, both of which play important roles in major network algorithms, are detailed, along with Fibonacci heaps. Tries and inverted indices that handle keyword queries are detailed in the final section.
Exercise 1.1 What is the difference between an array and a linked list? When might you use either data structure? ▪
Exercise 1.2 There is an array of numbers with all the elements appearing twice, and one element appearing once. How can you remove that element efficiently? ▪
Exercise 1.3 Explain whether a tree is or is not a BST. ▪
Exercise 1.4 Find the k closest points to a target. ▪
Exercise 1.5 Given a binary tree, find the greatest possible sum of the subtrees. ▪
Exercise 1.6 Remove duplicates from an unsorted linked list. ▪
Exercise 1.7 Find k-th largest element in an array. ▪
Exercise 1.8 During the execution of a task sequence, assume that every ith task takes O(1) time when i is not a power of two; otherwise, it takes i time. Show that amortized complexity of each task is O(1). ▪
Exercise 1.9 Build a heap(