Evolving Paradigms of Data Structures
In Part II presents structures of various data domains in a problem-specific arrangement. Not all data structures work in all domains because data is domain-specific and query requirements differ. The chapters (5 through 10) of Part II describe various data structures: spatial, temporal, external memory, distributed, and synopsis types. All six chapters conclude with exercises.
This chapter details evolving paradigms and discusses topics of current interest. This chapter is relevant because numbers, sizes, and complexities of data structures continue to increase. This chapter focuses on geometric queries and the complexities of input and output operations and communications. The final section tackles large data problems.
Chapter 6 introduces data structures designed to handle spatial (multidimensional) data. It starts by explaining range queries and suggests four popular solutions: range search trees, KD trees, quadtrees and R trees. Theoretical bounds, examples, and construction details are explained.
Chapter 7 explores lesser known temporal data structures: partial, full, confluent, and functional persistent types. Retroactivity issues are also covered. Chapter 8 describes external memory models designed to reduce input and output operations related to queries. Cache aware and cache oblivious models are explained as are (a,b), B, B+, and buffer trees. The chapter includes theorems and examples.
Chapter 9 focuses on distributed data structures and the difficulties of implementing them, then explains distributed hashing, distributed lists and trees, and skip graphs. Theoretical details and recent developments are described. Chapter 10 covers synopsis data structures designed to deal with large amounts of streaming data. It includes definitions and notations. Specific sections discuss sampling, sketching, fingerprints, and wavelets.
All the data structures we studied in previous chapters were developed to handle one-dimensional data points. However, data points are not always one-dimensional in practice. The ability to partition search space is a common feature of most data structures. Most data structures are designed to perform point search (linear space) queries. The common partitioning strategy is dividing linear spaces around points into two halves. Instead of designing a new data structure for higher dimensional spaces, we developed a space partitioning device to transform points in linear space into two or more dimensions. Note that curves and other geometric shapes can be used for space partioning, but they are difficult to understand.
Orthogonal range queries are the most basic geometric inquiries. They are different from point search queries and are helpful for developing range search solutions and more advanced databases. Let S ⊆ Rd, where |S| = n and Rd is a d dimensional real space having Cartesian coordinate system, i.e. S is a set of n points of d dimensional space. Assume that Q is a query box (hyper cube) in same d dimensional space whose sides are parallel to the axis of Rd. In the range search problem, we are supposed to return the points of
▪ Example 5.1 Range searching arises in many applications.
•A university administrator may wish to know students whose ages are between 21 and 24 years and whose grade point averages are greater than 3.5.
•In a geographic database of cities one might seek a list of cities whose latitude lies between 37′ and 41′ and longitude between 102′ and 109′.
•In data analysis it is often useful to do separate analyses on sets of data lying in different regions (ranges) of the observation space and then compare (or contrast) the respective results.
•In statistics, range searching can be employed to determine the empirical probability content of a hyperrectangle, to determine empirical cumulative distributions, and to perform density estimation.
▪
The efficiencies of algorithms and data structure are generally measured in units of number of operations based on the assumption that the whole data structure is stored in memory. But for large data structures not all of the data can be stored in cache; the portions of the data structure must be stored in external memory, which is usually a disk. Accessing data from the disk is a slower operation than data in cache and measures; as units of time. Our objective is minimizing the number of disk transfers for creation updates and query reporting of data structure. Since accessing the external memory is so much slower than accessing RAM, the total number of block transfers between internal memory and external memory are only considered and the computations performed within the internal memory are assumed free. The notions of complexity we use to compute are meaningful but disk access is too slow to make a program usable. Analyzing the performance of an ADT by counting the number of disk transfers is called I/O complexity.
▪ Example 5.2 — Systems log file. A log file holds a tremendous number of system commands. Every record stores the detailed information of the program and its execution behavior. The huge number of records are arranged into a large file, stored in the disk. The disk writing process is formed block by block. When a new record must be stored, the last used block is checked for empty slots; a new block is initiated if the last block is filled. To access a stored record we need to load the particular block in the memory. When storing an unordered list, the O(n/B) blocks are searched; n is the number of nodes and B is block size. We can say that I/O complexity is O(n/B). Using a ordered list can reduce this I/O complexity of searching to O(log n/B), but insertion will become O(n/B). We can observed that sequential access is not good option for data stored in disk.
5.3 Communication Complexities
Definition 5.3.1 — Communication complexity. The communication complexity is a measure to quantify the total amount of communication made by an algorithm that executes on different systems over a network.
Communication complexity was introduced by Yao in 1979 (4) who described complexity as communication between multiple parties. Assume A and B are two parties; A possesses D1 data and B possesses D2 data. Our task is to compute a function using both sets of data, f(D1, D2). Computation can be done anywhere at the location of A or B. We choose to compute at A. If B sends all data, A can compute f(D1, D2) trivially, but the goal of communication complexity is to find an efficient solution: to compute a task with the least amount of communication between the parties.
▪ Example 5.3 — Simple statistics. Assume that we need to compute the average of a large collection D of n integers. The data resides with A and B. A possesses D1 consisting of n1 integers and B possesses D2 consisting of n2 integers, where n1 + n2 = n. Assume further that the task will be computed at A. A trivial solution will send all B’s data to A, requiring (size of D2)/(packet size) communication and computation of the average at A. A better solution is choosing the computation point wisely, i.e., node with smaller data will send. An even better solution is having B send the average and number of nodes to A and having A compute the joint average. ▪
The rapid growth of data size in various application domains demands more efficient data structures capable of processing petabytes and continuous sensor data. Large volumes of data that reside in disks or arrive continuously over a network are not accessible randomly in multiple passes. Processing of these kind of data using some data structure allows user to scan the data or part of it only once through a small window. Main challenge of the researchers is to minimize the access of the data and still allow the data structure to answer the desired query with some degree of guarantee. For static data S, residing in the disks and a class of queries Q, the goal is to develop a data structure for the query class Q that minimizes the reporting time as well as maximizes the accuracy and confidence of the answers. In case of dynamic data structure, data arriving online are stored in disks, then used to create and update data structures.
To address the above problem we need data structures with small footprints, popularly known as synopsis data structures which are substantively smaller than their base data sets. These are data structures for supporting queries to massive data sets while minimizing or avoiding disk access. They have the following advantages:
1.Fast processing: May reside in memory; provides fast processing of queries and updates itself.
2.Fast transfer: Resides in disks; can be swapped in and out of memory with minimal disk access.
3.Lower cost: Has minimal impact on space requirements of the data set and its supporting data structures.
4.Small surrogate: Can provide surrogate function for data set when the data set is currently expensive or impossible to access.
Since synopsis data structures are too small to maintain a full characterization of their base data sets, they must summarize the data set, and the responses they provide to queries will typically be approximate.
▪ Example 5.4 An important application domain for synopsis data structure is approximate query answering for ad hoc queries of large data warehouses. In large data recording and warehousing environments, it is often advantageous to provide fast, approximated answers to complex decision support queries. The goal is to provide an estimated response in far less time than the time required to compute an exact answer by minimizing or eliminating the number of accesses to the base data. ▪
Exercise 5.1 How can you search for all the points laying inside a circle? ▪
Exercise 5.2 How can you retrieve a previous edit in an image? ▪
Exercise 5.3 How do you multiply two matrices, each of which is larger than the computer memory? ▪
Exercise 5.4 Describe the procedure of executing a stack on a distributed environment. ▪
Exercise 5.5 How would you determine the most frequent caller on your phone since the day you purchased it? ▪