Distributed Data Structures (DDSs)
In this chapter, different types of distributed data structures are presented. Distributed architecture and the difficulties of implementation in that environment are described in the first section. Later sections discuss distributed hashing, lists, trees, and skiplists. The chapter also covers theoretical details, current state of the art, and open questions.
The evolution of the Internet and other advanced technologies led to the development of network computing. This framework lets powerful, low-cost workstations connect quickly via terabytes of memory, petabytes of disk space, and groups of processing units. Every node in a network is either a client or server, based on whether it accesses or manages data.
Every server provides storage space in the form of buckets, to keep a portion of the file under maintenance. Servers communicate by sending and receiving point-to-point messages. The network performing the communications is assumed to be error-free. The efficiency of the system is judged by the number of messages communicated by the servers regardless of message length or network topology.
The algorithms and data structures are specified as distributed algorithms and distributed data structure respectively. The distributed data structures are designed and implemented to allow easy addition of new servers to balance loads on a particular server. The access and maintenance responsibilities require atomic updates to multiple machines. A data structure that meets these constrains is generally known as Distributed Data Structure (DDS).
9.1 Descriptions of Structures
A DDS is a self-managing storage layer developed to run on a collection of workstations inter connected by an underlying network. A DDS is designed to take care of high throughput, high concurrency, availability, incremental scalability, and data consistency. Users see the interface of a distributed data structure as a conventional data structure, such as a hash table, tree or a list. The DDS interface hides all of the mechanisms used to access, partition, replicate, scale, and recover data from a user whose only concern is consistent service to meet his or her specific requirements. A user expects all difficulties of managing to be handled by a DDS interface. Databases and file systems have managed storage layers and other durable components for many years. The advantage of a DDS is the level of abstraction it provides to users. A DDS handles access behavior (concurrency and throughput demands) and other requirements based on its expected runtime and the types of failures it can correct.
Objective 9.1 A distributed data structure provide better services transmitted over the Internet by means of a new persistent storage layer of data which is durable, available, consistent, and independent of service constraints. A DDS automates replication, partitions data, and distributes data over servers that ensure high availability and automatic recovery.
Strict consistency of DDS: All operations on the elements of DDS are atomic; they complete a step or are rejected. DDS data elements are replicated among services and a client can see one copy of a logical data item. A two-phase procedure maintains coherent replicas to provide single copy interfaces to clients. A DDS does not typically support accessing of multiple elements or operations. Its protocol may be designed to deny multiple access to maintain efficiency and simplicity. The atomic single element updates and coherence provided in DDS are good enough to support various applications.
Fault tolerance of DDS: A distributed system generally remains operational even if an individual server fails. In a large distributed system server failures are expected from time to time, but a DDS must tolerate limited hardware failures and still support access to data elements at all times and provide data availability, generally by using efficient data replications. A server must immediately detect an operational failure or crash of a connected server. This is certainly a reasonable assumption for local networks, but it is unrealistic for globally distributed databases.
A distributed hash table (DHT) is a class of a decentralized distributed system that provides a lookup service similar to a hash table: (key, value) pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows a DHT to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures. DHTs form an infrastructure that can be used to build more complex services, such as cooperative Web caching, distributed file systems, domain name services, instant messaging, multicast, and also peer-to-peer file sharing and content distribution systems. Notable distributed networks that use DHTs include BitTorrent’s distributed tracker, the Coral Content Distribution Network, the Kad network, the Storm botnet, the Tox instant messenger, Freenet and the YaCy search engine.
9.2.1 Structure of distributed hashing
The structure of a DHT can be decomposed into several main components (Figure 9.1). The foundation is an abstract keyspace, such as the set of 160-bit strings. A keyspace partitioning scheme splits ownership of this keyspace among the participating nodes. An overlay network then connects the nodes, allowing them to find the owner of any given key in the keyspace. The keyspace partitioning and overlay network components are described below with the goal of capturing the principal ideas common to most DHTs; many designs differ in the details.
Figure 9.1: Distributed hashing
9.2.1.1 Keyspace partitioning
Most DHTs use some variant of consistent hashing or rendezvous hashing to map keys to nodes. The two algorithms appear to have been devised independently and simultaneously to solve the distributed hash table problem. Both consistent hashing and rendezvous hashing have the essential property that removal or addition of one node changes only the set of keys owned by the nodes with adjacent IDs, and leaves all other nodes unaffected. Contrast this with a traditional hash table in which addition or removal of one bucket causes nearly the entire keyspace to be remapped.
9.2.1.2 Consistent hashing
Consistent hashing is defined as a distance function between the keys. The distance function is independent of geographical distance or network latency. Each node is assigned a unique key called identifier. A node owns all the keys closest measured according to the distance function. For example, the Chord DHT uses consistent hashing, which treats keys as points on a circle, and is the distance measured clockwise around the circle. Thus, the circular keyspace is split into contiguous segments whose endpoints are the node identifiers.
9.2.1.3 Rendezvous hashing
A rendezvous hashing is a distributed technique in which a common predefined hash function is used by all the clients to a key to some server decided by a common list of server identifiers
9.2.1.4 Overlay network
All the nodes in the overlay network (e.g. Peer to Peer networks) maintains a set of links with its neighbors and nodes in its routing table. The network formed by these set of links is known as an overlay network. DHT is used to create the topological structure (node ID which owns the key (k) or has a link to those nodes who are closer to key (k)). Data sharing in the overlay network becomes easy via DHT. As at every step, data is forwarded to the neighbor node which is closer to the key (k). When the message arrives at the neighbor closed to the k and no other neighbor exists which is closed to the key then the node owns the key. This approach is also known as Greedy approach or key-based routing using DHT.
9.2.1.5 DHT implementation
Notable differences encountered in practical instances of DHT implementations include the following:
•The address space is one of the parameters of DHT and various real-world DHT use either 128-bit or 160-bit key space. Some of the real-world DHTs use hash functions instead of SHA-1. In the real world, DHT keys are the hash of file content instead of file name, so the name does not affect the operation (search).
•Some real world DHT may also publish objects of different types. Like, the key could be the node ID and related data could illustrate how to contact this node. This allows publication-of-presence information and often used in Instant messaging (IM) technology. The node ID is simply a random number that is directly used as a key (so in a 128 bit DHT, node ID will be a 128-bit number chosen randomly).
•To improve reliability in DHT, redundancy can be added. In which the key pair can be stored in more than one node corresponding to the key. Usually, rather than selecting just one node, real-world DHT algorithms selects the ‘n’ suitable nodes, with ‘n’ being an implementation-specific parameter of the DHT.
•Some advanced DHTs like Kademlia perform iterative lookups through the DHT first in order to select a set of suitable nodes and send key and data messages only to those nodes. This drastically reduces useless traffic since published messages are sent only to nodes that seem suitable for storing the key. Furthermore, iterative lookups cover only a small set of nodes rather than the entire DHT, thus reducing useless forwarding.
9.2.1.6 Security in DHT
There are three main security issues related to DHTs mentioned in literature. These are called Sybil attacks, Eclipse attacks and routing and storage attacks. In Sybil attacks the idea is that an attacker generates large number of nodes in the network in order to subvert the reputation system or mechanisms based on redundancy.
Eclipse attack is based on poisoning the routing tables of honest nodes. As there are many nodes joining and exiting from the DHT all the time, nodes need to actively update and synchronize their routing tables with their neighbors in order to keep lookup system functional.
In a routing and storage attack, a single node does not follow the protocol. Instead of forwarding the lookup requests, it may drop the messages or pretend that is responsible for the key. Hence, it may provide corrupted or malicious data such as viruses or Trojan horses as a response. Sybil and Eclipse attacks do not directly break the DHT nor damage the other peers.
Security Mechanisms: Some practical examples of security solutions in DHTs are listed below.
1. Sybil attacks: As a defense against Sybil attack, there are several different approaches. Borisov [202 proposes a challenge response protocol based on computational puzzles. The idea is that every node should periodically send computational puzzles to its neighbors. Solving the puzzle proves that the node is honest and trustworthy, but it also requires CPU cycles. The goal is to make organizing Sybil attacks more difficult: running one peer client does not require much CPU power, but running thousands of active virtual nodes is computationally infeasible.
2. Eclipse attacks: An obvious way to shield against Eclipse attacks is to add some redundancy in routing. This approach is utilized by Castro [203 who proposed two routing tables: the optimized routing table and the verified routing table.
3. Routing and storage attacks: Ganesh and Zhao [204 proposed having nodes sign proof-of-life certificates that are distributed to randomly chosen proof managers. A node making a lookup request can request the certificates from the managers and detect possibly malicious nodes.
This section discusses a unique distributed solution for searching problems: the optimal binary search tree (BST) problems presented and analysed. Implemented as a VLSI array, the algorithm for building the optimal BST uses O(n2) processor and has the parallel time complexity O(n). A search is solved in O(log n) time. Every site in a network is either a server managing data or a client requesting access to data. Every server provides a storage space of b (bucket) data elements to accommodate a part of the file under maintenance. Sites communicate by sending and receiving point-to-point messages. The distributed algorithms and data structures in such an environment must be designed and implemented so that (a) they expand to new servers gracefully, and only when servers already used are efficiently loaded and (b) their access and maintenance operations never require atomic updates to multiple clients. If all the hypotheses used to efficiently manage search structures in the single processor case are carried over to a distributed environment, then a tight lower bound of Ω(n) holds for the height of balanced search trees.
9.3.1 Construction of distributed BST
Let T be any binary search tree with ‘n’ leaves and n − 1 internal nodes. Let f1, f2, f3......fn be the leaves and t1, t2, …tn be the internal node. To each leaf a bucket capacity of ‘b’ storage is associated. Let s1, s2…sn be the n servers managing the search tree. We can define leaf association in the form of pair (f, s), where ‘s’ represents the server manages the leaf ‘f’ and the node association is represented in the form of pair (t,s), where ‘s’ represents the server manages internal node ‘t’. In an equivalent way, we denote the two functions as
1) t(sj) = ti, where (ti, sj) is a node association,
2) f(sj) = fi, where (fi, sj) is a leaf association.
A search tree is binary in that every node represents an interval of the data domain. Moreover, the overall data organization satisfies the invariant that the interval managed by a child node lies inside the father node’s interval. Hence the search process visits a child node only if the searched key is inside the father node’s interval. It is not possible, in the distributed case, to directly make use of rotations for balancing a distributed search tree while guaranteeing the straight guiding property. A lower bound of O(n) holds for the height of balanced search trees if the straight guiding property has to be satisfied. The relaxed balanced search tree (RBST), upon accepting a violation of the straight guiding property, keeps tree height logarithmic. All update operations have logarithmic cost but the upper bound on the complexity of the search process is O(log2n).
The distributed data structure we focus on is a binary search tree, where data are stored in the leaves and internal nodes contain only routing information. Every node has zero or two children. For a binary search tree T we denote with h(T) the height of T, that is the number of internal nodes on a longest path from the root to a leaf. Every server s but one, with leaf node association (t, s) and leaf association (f, s), records at least the following information:
1.An internal node t = t (s) and the associated interval of key’s domain I (t).
2.The server p(s) managing the parent node pn(t) of t, if t is not the root node.
3.The server l(s) (resp., r (s)) managing the left child ls(t) (resp., right child rs(t)) of t, and the associated interval I (t) (resp., I (t)).
4.A leaf f = f (s) and the associated interval of key’s domain I (f).
5.The server pf (s) managing the father node pn(f) of f, if f is not the unique node of global tree (initial situation).
This information constitutes the local tree lt (s) of server s. Since in a global tree of n nodes there are n − 1 internal nodes, there is one server s0 managing only a leaf association, hence lt (s′) is made up by only the two last pieces of information in the above list. We say a server s is pertinent for a key k, if s manages the bucket to which k belongs, in our case, if kcI(f (s)). Moreover we say a server s is logically pertinent for a key k, if k is in the key interval of the internal node associated to s, that is if kcI(t(s)).
Step 1: Insert: We search for the leaf where the new key has to be inserted and insert it. We assume that this insert generates an overflow, that is the key to be inserted is the (b + 1)th key assigned to that bucket.
Step 2: Manage the overflow: Leaf f, managed by server s, goes into overflow. In this case we have to decide whether s has to be split or if it is possible to transfer its keys to adjacent nodes. Details about this aspect have been discussed in the previous section. Assume then the decision was to split the node. Then s must perform a function called split. Leaf f splits in two new leaves f1 and f2. A new internal node t + 1 replaces f in the tree. A new server s + 1 is called to manage the new internal node and one of the new leaves. Server s releases the old leaf f and manages the other new leaf. In conclusion we delete leaf association (f, s) and add two leaf associations (f1, s) and (f2, s + 1) and one node association (t + 1, s + 1) (Figure 9.2). The old interval I (f) is divided in the new intervals I (f 1) and I (f 2), such that I (f 1) [ I (f 2) = I (f)].
Stpe 3: Balance the distributed BST function starting from t + 1.
Figure 9.2: Distributed BST
Stpe 1: Delete: We search for the leaf where the key has to be deleted and delete it. We assume that this generates an underflow, that is, by deleting that key the bucket has fewer than two keys.
Step 2: Manage the underflow: The leaf f, managed by server s, goes into overflow. In this case we have to decide whether s has to be released or if it is possible to transfer keys from the adjacent leaves, without releasing s. Details on this aspect have been discussed in Section 9.3.4. Assume then the decision was to release s. Then s performs a function called merge. If f is the root, the distributed BST is composed by one node, no action IS performed. If the distributed BST is composed by the root r and two leaves f and x, there are only two servers s and s0. Then s is released and after the communication to s0 and the deletion of r, x becomes the root. All the keys of f are sent to x. In the general case f is the leaf in figure 9.2. The case with f as left child is analogous. We assume b is the server such that t (b) is the father node of f (s) and c is the server such that t (c) is the father node of t(b). Note that t(a) can be a leaf or an internal node. The steps to achieve are
1.Release server s and delete leaf f = f (s).
2.Since node t (b) has now one child, then delete t (b) and replace it with t (a) as the child of t (c).
3.If s managed an internal node t = t (s), then from now on t is managed by server b (note that b has just released its internal node t (b)).
Step 3: Balance the distributed BST by starting the balancing function at t(c).
Rotations in a distributed environment are performed via message exchanges between servers. Since we are in a concurrency framework, in the sense that various clients independently manipulate the structure, each rotation must be preceded by a lock of the servers involved. We can show that the cost of one rotation is a constant and then if a balancing strategy uses a logarithmic number of rotations for operation, the overall cost is kept logarithmic. Suppose that node a must rotate with node b.
1.a sends messages to (client) nodes A, B and to (server) node b, to notify that a lock must be created. After having received these messages, nodes A, B, and b stop routing messages towards a and send a lock acknowledgement to a.
2.b sends messages to (client) node C and to (server) node c, to notify that a lock must be created and that acknowledgement must be sent to a. After this message, nodes C and c stop routing messages towards b.
3.Every server answers to a (see second sketch of Figure 9.2) to acknowledge the lock state.
4.a notices to all servers involved in the rotation which modifications are needed and after all servers have been confirmed a releases all locks (Figure 9.2, third sketch).
5.When locks are released the situation is shown in the rightmost sketch of Figure 9.2 and all servers restart to route messages.
Skip graphs are novel distributed data structures based on skiplists, that provide the full functionality of a balanced tree in a distributed system where resources are stored in separate nodes that may fail at any time. Skip graphs are designed for use in searching peer-to-peer systems. By providing the ability to perform queries based on key ordering, they improve on existing search tools that provide only hash table functionality. Unlike skiplists or other tree data structures, skip graphs are highly resilient, tolerating a large fraction of failed nodes without losing connectivity. In addition, simple algorithms can be used to construct a skip graph, insert new nodes into it, search it, and detect and repair errors from node failures.
As in a skiplist, each of the n nodes in a skip graph (Figure 9.3) is a member of multiple linked lists. The level 0 list consists of all nodes in sequence. Where a skip graph is distinguished from a skiplist is that there may be many lists at level i, and every node participates in one of these lists, until the nodes are splintered into singletons after O(log n) levels on average. A skip graph supports search, insert, and delete operations analogous to the corresponding operations for skiplists; indeed, algorithms for skiplists can be applied directly to skip graphs, as a skip graph is equivalent to a collection of n skiplists that happen to share some of their lower levels.
Figure 9.3: Distributed skip graph
Because there are many lists at each level, the chances that any individual node participates in some search is small, eliminating both single points of failure and hot spots. Furthermore, each node has Ω(log n) neighbors on average, and with high probability (1 − [1/p]log n), where p is the probability of node failure), no node is isolated. In Section 9.4.5 we observe that skip graphs are resilient to node failures and have an expansion ratio of Ω(l log n) with n nodes in the graph.
In addition to providing fault tolerance, having an Ω(log n) degree to support O(log n) search time appears to be necessary for distributed data structures based on nodes in a one-dimensional space linked by random connections satisfying certain uniformity conditions. While this lower bound requires some independence assumptions that are not satisfied by skip graphs, there is enough similarity between skip graphs and the class of models considered in the bound that an Ω(log n) average degree is not surprising.
We now give a formal definition of a skip graph. Precisely the list to which node x belongs is controlled by a membership vector m(x). We think of m(x) as an infinite random word over some fixed alphabet, although in practice, only an O(log n) length prefix of m(x) needs to be generated on average. The idea of the membership vector is that every linked list in the skip graph is labeled by some finite word w, and a node x is in the list labeled by w if and only if w is a prefix of m(x).
The search operation is identical to the search in a skiplist with only minor adaptations to run in a distributed system. The search is started at the topmost level of the node seeking a key and it proceeds along each level without overshooting the key, continuing at a lower level if required, until it reaches level 0. Either the address of the node storing the search key, if it exists, or the address of the node storing the largest key smaller than (or the smallest key larger than) the search key is returned. Skip graphs can support range queries to find a key ≥ x, a key ≤ x, the largest key < x, the least key > x, some key in the interval [x, y], all keys in [x, y], and so forth. For most of these queries, the procedure is an obvious modification and runs in O(log n) time with O(log n) messages. For finding all nodes in an interval, we can use a search operation to find a single element of the interval (which takes O(log n) time and O(log n) messages). With r nodes in the interval, we can then broadcast the query through all the nodes (which takes O(log r) time and O (rlogn) messages). If the originator of the query is capable of processing r simultaneous responses, the entire operation still takes O(log n) time.
A new node u knows some introducing node v in the network that will help it to join the network. Node u inserts itself in one linked list at each level till it finds itself in a singleton list at the topmost level. The insert operation consists of two stages: (1) Node u starts a search for itself from v to find its neighbors at level 0, and links to them. (2) Node u finds the closest nodes s and y at each level ≥ 0, s < u < y, such that m(u) (+ 1) = m(s) (+ 1) = m(y) (+ 1), if they exist, and links to them at level +1. Because each existing node v does not require m(v) + 1 unless there exists another node u such that m(v) (+ 1) = m(u) (+ 1), it can delay determining its value until a new node arrives asking for its value; thus at any given time only a finite prefix of the membership vector of any node needs to be generated.
In the absence of concurrency, the insert operation in a skip graph S with n nodes takes expected O(log n) messages and O(log n) time.
With concurrent inserts, the cost of a particular insert may be arbitrarily large. In addition to any costs imposed by the underlying doubly-linked list implementation, there is the possibility of starvation in line 5 as new nodes are inserted faster than u can skip over them. We do not expect this problem to be common in practice. With m machines and n resources in the system, most DHTs such as CAN, Pastry and Tapestry take O(logm) time for insertion; an exception is Chord which takes O(log2m) time. An O(logm) time bound improves on the O(log n) bound for skip graphs when m is much smaller than n. However, the cost of this improvement is losing support for complex queries and spatial locality, and the improvement is only a constant factor on machines not capable of storing superpolynomial numbers of resources.
The delete operation is very simple. When node u wants to leave the network, it deletes itself in parallel from all lists above level 0 and then deletes itself from level 0. In the absence of concurrency, the delete operation in a skip graph S with n nodes takes expected O(log n) messages and O(1) time. We have assumed that each linked-list delete operation takes O(1) messages and O(1) time starting from u; since all but one of these operations proceed in parallel the total time is O(1) while the total messages (summing over all O(log n) expected levels) is O(log n). The performance of a delete operation in the presence of concurrency depends on the costs of the underlying linked-list delete operation.
9.4.5 Correctness and concurrency
In this subsection, we prove the correctness of the search, insert and delete algorithms described above. We show that both insert and delete maintain the following simple invariant, and then use this invariant to argue that search operations eventually find their target node or correctly report that it is not present in the skip graph.
It can be observed that, at any time during the execution of any number of concurrent insert or delete operations, the set of nodes in any higher level of the list is a subset of the set of nodes in the list of level 0.
Consider a search operation with target x. If the search operation returns a node with key x, then some such node existed in the graph at some time during the execution of the search. If the search operations returns notFound, then there is a time during its execution at which no such node is present. The reason is that x is present if and only if it appears in the level 0 list. In order to return notFound, the search algorithm must reach the bottom level without finding x. Thus at the time it queries x’s level-0 predecessor and finds a successor (or vice versa), x is not present in the graph. If the search operation finds x, it must send a “found” message back to the source node. This can only occur if x has previously received a “found” message, which must have been sent by x’s successor or predecessor in some list. At the time this message is sent, x is still an element of that list and thus present in the graph. This implies that any search operation can be linearized with respect to inserts and deletes. In effect, the skip graph inherits the atomicity properties of its bottom layer, with upper layers serving only to provide increased efficiency.
Exercise 9.1 Assume that the following solution to the problem of distributing a hash directory is proposed: each node maintains the hash value and location of its successor. Discuss the advantages and disadvantages of this solution, and describe the cost of the dictionary operation (insert, delete, search) and network maintenance operations (join and leave). ▪
Exercise 9.2 Compare index structures presented in this chapter and identify, beyond their differences, some of the common design principles adopted to cope with the distribution problem. ▪
Exercise 9.3 Describe possible solutions to perform distributed range queries. ▪
Exercise 9.4 How we can create a distributed priority queue? ▪
Exercise 9.5 How we can create a distributed stack? ▪