Objective 6.1 All data structures we studied so far deal with one-dimensional data points, but real-life data are not always one-dimensional. Our primary concern in this chapter is developing efficient data structures to handle higher dimensional data points.
Partitioning search space is a common feature in most data structures designed to perform point search (linear space) queries. A common strategy s to divide the linear space into two halves. Rather than designing new logical steps for higher dimensional spaces, we utilize a space partitioning separator to handle two-dimensional or greater situations. Curves and other geometric shapes can partition space but they are difficult to understand.
Definition 6.0.1 — Range queries. Orthogonal range queries are basic geometric queries, unlike point search queries. They represent a fundamental step toward developing the requirements for solving range search problems. Let S ⊆ Rd, where |S| = n and Rd is a d dimensional real space having Cartesian coordinates. S is a set of n points of d dimensional space and Q is a query box (hyper cube) in the same d dimensional space whose sides are parallel to the axis of Rd. In the range search problem, we should return the points of
A range search tree on a set S of d dimensional points, is an ordered tree data structure generally implemented with balanced binary search trees in each dimension, where the data values are stored in the leaves and internal nodes keep values for comparison purposes, usually the largest value of its left subtree. A range search tree can be realized as a multi-level binary search tree on a set of points in d-dimensions, defined recursively over the dimensions as level. The first level is a binary search tree consisting of the first coordinate of each point of S. Each node x of this top level tree contains an associated range search tree data dimension created with the remaining (d − 1) coordinates of the points of S stored in the subtree of x.
Definition 6.1.1 A range search tree is a binary tree data structure designed to report all points within a given range of multi-dimensional space.
Bentley introduced range trees in 1979 [50] to provide
A one-dimensional range search tree of n points is a simple binary search procedure that takes
Figure 6.1: Range search tree
6.1.1.1 Two-dimensional range search tree in O(n log n)
Let S ⊆ R2.
If S is singleton then return a leaf with value of that point.
Otherwise, construct a one-dimensional range tree based on the y coordinates for (x, y) ∈ S.
Let xm = {median of all x}.
Let Sl ⊆ S s.t.
Let Sr ⊆ S s.t.
Repeat above steps for Sl.
Repeat above steps for Sr.
Recursive equation of complexity is T(n) = 2T(n/2) + n, which resolves into
A range query Q on a range search tree TR containing points of S ⊂ Rd, reports the set of points
Let x12 be the last node in the common search path of x1 and x2.
For every node x after x12 in the search path of x1 do.
If x ≥ x1, report every point in the right-subtree of x including x.
For every node x′ after x12 in the search path of x2 do.
If x′ ≤ x2, report every point in the left-subtree of x′ including x′.
Return union of points.
A range search tree is a always a balanced binary search tree, so the maximum length of search paths for any point is
A range query in d-dimensions performs in a recursive matter on the remaining structure and stops recursion via the above steps when the remaining dimension is 1. The recursion works as a d-dimensional range query reducing to
This complexity can be reduced to
Definition 6.2.1 A k-dimensional tree, popularly known as KD tree is a geometric data structure widely used for organizing multi-dimensional points into an upgraded binary search tree capable of searching over dimensions alternatively while processing spatial search queries.
The KD tree is the most popular spatial data structure for performing range search queries and nearest neighbor queries. Each level of the tree partitions all children along a specific dimension and rotates the dimension over levels using a hyperplane perpendicular to the axis of consideration. At the root of the tree all points will be partitioned based on the first coordinate of the root node, i.e. all points of the right subtree have larger first coordinates than the roots have. Each level down the tree partitions the search space on the basis of next dimension with a periodical repetition, i.e. returning to the first dimension when the last one is finished. The efficient way to build a static KD tree is to use a partition method that places the median point; all data with smaller one-dimensional values moves to the right, and larger amounts move to the right. We then repeat this procedure recursively on both the left and right subtrees until single element remains. See Figure 6.2.
Figure 6.2: Simple KD tree example
Search space partitioning strategy used here is uses line separators in two dimensions and hyperplane separators for hyperspace, partitioning dimensions one after one alternatively.
Since there are many possible ways to choose axis-aligned splitting planes, there are many different ways to construct KD trees. First, we will understand the creation method of a static KD tree before proceeding to dynamic updates.
6.2.1.1 Construction of static KD tree
Step 1: Let S ⊂ Rd be a set of n points of d dimensional space, i.e.
Step 2: The first partition of the data space takes place at the root of static KD tree using the hyperplane
Step 3: The left subtree of the root will contain all the points on the left side of the hyperplane
Step 4: Similarly, right subtree of the root will contain all the points lying to the right of the hyperplane
Step 5: Now, we have two disjoint subsets of point set,
Step 6: The next level partitions of the data space take place at the children of the roots using the hyperplanes
Step 7: The process continues until the subsets become singletons and are inserted into the corresponding leaves when further partition is no longer required.
The above method creates a balanced static KD tree.
Complexity 6.2.1 The above recursive algorithm will generate the recursive equation T(n) = 2T(n/2) + n log n. Solving this recursion we obtain T(n) = Ω(n log n).
6.2.1.2 Point query in a KD tree
The point of searching with a KD tree is analogous to searching in other types of trees. For searching a query point
6.2.1.3 Dynamic updates in a KD tree
Inserting a node in an empty KD tree is trivial, as it is for other data structures: create a root and insert the node. Inserting a new node in an existing KD tree is also similar; the first step is running a point search to find the location. The search process is coordinate wise of the point as well as level wise of the tree, as explained in the above paragraph. Once the search operation reach the leaf and element not already present in the tree, the new node is inserted based on comparison of next coordinates of current node and new node either as left or right child depending on the result of comparison.
Deleting a given node is similar to deletion processes for other tree structures. The first step is determining the element to be deleted. If the node is a leaf, simply delete it. If the element to be deleted is an internal node, the operation is more complex.
Spatial data structures are generally created for large numbers of data points; lazy deletes are preferred to sequential deletes because they perform more efficiently.
What about maintaining a balanced KD tree?
Dynamic updates may destroy the balanced structures of KD trees. Rebalancing is not a preferred option even though an unbalanced tree make increase search cost. A randomized KD tree is a logical alternative but it requires random shuffling during insertion.
Complexity 6.2.2 Inserting a new element in a balanced kd-tree takes O(log n) time. Removing a given element from a balanced kd-tree also takes O(log n) time. Querying a range parallel to the axes in a balanced kd-tree takes O(n1−1/k + m) time, where m is the number of the points to be reported in
Range searches in KD trees find all points in the circle of a radius R of a query q, typically achieved by using a stack. Starting from a root, we put all the nodes and their children arriving in the search path into the stack for further processing. When the search operation reaches the leaf, we compute the distances of query points in the stack by removing them from the top down. If the computed distance is less than R the point is accepted for range reporting.
What to do with the node lying on the boundary?
We must consider all the points of the subtree rooted at that particular child node for inclusion in the stack followed by a range verification for a possible reporting.
Inclusion of a child node in a subtree is very expensive but the step must be performed to ensure accurate range reporting.
If a KD tree uses split hyperlanes parallel to an axis, inclusion of a subtree with a child node is made easier by subtraction of splitting coordinate of child node and query point.
Complexity 6.2.3 The worst case time complexity for range search query in a k-d tree containing n nodes is
6.2.3 Nearest neighbor search in KD tree
A nearest neighbor search in a KD tree is similar to a range search except that the radius R is treated as a current minimum and rejection criterion rather than as an accepted condition. The nearest neighbor search is performed by using a stack. We start from the root and put all nodes and their children in the search path into the stack for selection as possible nearest neighbors sought by the query. After the search reaches the leaf, we compute the distances of query points starting by removing the points in the stack from the top down. The first distance computed between leaf and query point is considered as current minimum R and the leaf is considered the current nearest neighbor. We then compute the distances from the query point for all other points extracted from the top of the stack. If the computed distance is less than R the point updated as current nearest neighbor and the value of R are reduced to newly computed distance; otherwise the old parameters will remain. When the stack is empty the current nearest neighbor is reported.
Nearest neighbor search is very useful for real-time applications such as finding a nearby school or the nearest bus stop of restaurant.
A quadtree is a tree based spatial data structure in which each internal node has exactly four children. Quadtrees perform the two-dimensional space partition in each node dividing a two-dimensional point into four quadrants or regions; this process is similar to what binary search trees do in one dimension. Various methods can associate data with leaf nodes based on the requirements of the application.
The subdivided regions may be square or rectangular, or may have arbitrary shapes. This data structure was named a quadtree by Raphael Finkel and J.L. Bentley in 1974 [51].
A quadtree is a hierarchical data structure that represents spatial data based on the principle of recursive decomposition (similar to divide-and-conquer methods). Hierarchical data structures are useful because of their ability to focus on the interesting subsets of the data. This focusing results in an efficient representation and improved execution times. Thus they are particularly convenient for performing set operations. Data structures are also attractive because of their structural clarity and ease of implementation. The hierarchical representation using a tree data structure is the traditional also a way to represent quadtrees but a linear representation is more efficient and saves image storage space.
6.3.1 Inserting data into a quadtree
Starting at the root, determine which quadrant your point occupies. Recurse to that node and repeat, until you find a leaf node. Then, add your point to that node’s list of points. If the list exceeds some predetermined maximum number of elements, split the node and move the points into the correct subnodes. To query a quadtree, starting at the root, examine each child node, and check if it intersects the area queried. If it does, recurse into that child node. Whenever you encounter a leaf node, examine each entry to see if it intersects with the query area, and return it if it does.
Theorem 6.3.1 Let T be a quadtree with m nodes. Then the balanced version of T has O(m) nodes and can be constructed in O((d + 1)m) time.
Quadtrees are hierarchical data structures whose common property is that they are based on the principle of decomposition of space (Figure 6.3). Their characterization is based on (1) the type of data they represent; (2) the principle guiding the decomposition process; and (3) whether the resolution is or is not variable. Currently quadtree is used for point data, areas, curves, surfaces and volumes. The prime motivation for the development of the quadtree is the desire to reduce the space necessary to store data through the use of aggregation of homogeneous blocks. An important by-product of this aggregation is the reduction of operating times of certain operations (e.g. connected component labeling and component counting).
Figure 6.3: Example of quadtree
Quadtrees may be classified according to the type of data they represent, including areas, points, lines and curves. Quadtrees may also be classified by whether the shape of the tree is independent of the order in which data is processed.
The region quadtree represents a partition of space in two dimensions by decomposing the region into four equal quadrants, subquadrants, and so on with each leaf node containing data corresponding to a specific subregion. Each node in the tree either has exactly four children, or has no children (a leaf node). The height of a quadtree that follows this decomposition strategy (i.e. subdividing subquadrants as long as there is interesting data in the subquadrant for which more refinement is desired) is sensitive to and dependent on the spatial distribution of interesting areas in the space being decomposed. The region quadtree is a type of trie.
A region quadtree with a depth of n may be used to represent an image consisting of 2n × 2n pixels, where each pixel value is 0 or 1. The root node represents the entire image region. If the pixels in any region are not entirely 0s or 1s, it is subdivided. In this application, each leaf node represents a block of pixels that are all 0s or all 1s. Note the potential savings in terms of space when these trees are used for storing images; images often have many regions of considerable size that have the same colour value throughout. Rather than store a big two-dimensional array of every pixel in the image, a quadtree can capture the same information potentially several divisive levels higher than the pixel-resolution sized cells that we would otherwise require. The tree resolution and overall size are bounded by the pixel and image sizes.
A region quadtree may also be used as a variable resolution representation of a data field. For example, the temperatures in an area may be stored as a quadtree, with each leaf node storing the average temperature over the subregion it represents.
If a region quadtree is used to represent a set of point data (such as the latitude and longitude of a set of cities), regions are subdivided until each leaf contains at most a single point.
The point quadtree is an adaptation of a binary tree used to represent two-dimensional point data. It shares the features of all quadtrees but is a true tree as the center of a subdivision is always on a point. It is often very efficient in comparing two-dimensional, ordered data points, usually operating in O(log n) time. Point quadtrees are worth mentioning for completeness, but they have been surpassed by KD trees as tools for generalized binary search.
Point quadtrees are constructed as follows. Given the next point to insert, we find the cell in which it lies and add it to the tree. The new point is added such that the cell that contains it is divided into quadrants by the vertical and horizontal lines that run through the point. Consequently, cells are rectangular but not necessarily square. In these trees, each node contains one of the input points.
Since the division of the plane is decided by the order of point-insertion, the tree’s height is sensitive to and dependent on insertion order. Inserting in a “bad” order can lead to a tree of height linear in the number of input points (at which point it becomes a linked-list). If the point-set is static, pre-processing can be done to create a tree of balanced height.
A node of a point quadtree is similar to a node of a binary tree, with the major difference being that it has four pointers (one for each quadrant) instead of two (left and right) as in an ordinary binary tree. Also a key is usually decomposed into two parts, referring to x and y coordinates. Therefore, a node contains the following information: four pointers: quad[NW], quad[NE], quad[SW], and quad[SE] point; which in turn contains: key; usually expressed as x, y coordinates value.
▪ Example 6.1 — Connected Component Labelling. Consider two neighbouring black pixels in a binary image. They are adjacent if they share a bounding horizontal or vertical edge. In general, two black pixels are connected if one can be reached from the other by moving only to adjacent pixels (i.e. there is a path of black pixels between them where each consecutive pair is adjacent). Each maximal set of connected black pixels is a connected component. Using the quadtree representation of images, we can find and label these connected components in time proportional to the size of the quadtree. This algorithm can also be used for polygon colouring.
The algorithm works in three steps:
1.Establish the adjacency relationships between black pixels.
2.Process the equivalence relations from the first step to obtain one unique label for each connected component.
3.Label the black pixels with the label associated with their connected component.
To simplify the discussion, let us assume the children of a node in the quadtree follow the Z-order (SW, NW, SE, NE). Since we can count on this structure, for any cell we know how to navigate the quadtree to find the adjacent cells in the different levels of the hierarchy.
Step 1 is accomplished with a post-order traversal of the quadtree. For each black leaf v we look at the node or nodes representing cells that are northern neighbours and eastern neighbours (i.e. the northern and eastern cells that share edges with the cell of v). Since the tree is organized in Z-order, we have the invariant that the southern and western neighbours have already been taken care of and accounted for. Let the northern or eastern neighbour currently under consideration be u. If u represents black pixels:
•If only one of u or v has a label, assign that label to the other cell.
•If neither of them have labels, create one and assign it to both of them.
•If u and v have different labels, record this label equivalence and move on.
Step 2 can be accomplished using the union-find data structure. We start with each unique label as a separate set. For every equivalence relation noted in the first step, we union the corresponding sets. Afterwards, each distinct remaining set will be associated with a distinct connected component in the image.
Step 3 performs another post-order traversal. This time, for each black node v we use the union-find’s find operation (with the old label of v to find and assign v its new label (associated with the connected component of which v is part).
Objective 6.2 So far we have described many spatial data structures to report geometric queries, but all these data structures store multidimensional points, whereas R trees index special objects such as regions and answer geometric queries as objects or collections of objects.
An R tree is a spatial data structure developed for indexing multi-dimensional batch processing such as geographical information or graphics design. The R tree was invented by Antonin Guttman in 1984 and is very popular for storing spatial objects, the data structure provides solid theoretical foundation as well as useful applications.
Definition 6.4.1 R tree, or rectangle tree, is a highly balanced multiple-use tree data structure developed to index spatial objects using their geometric location. It is created in bottom up manner and very efficient for performing object, range object, and nearest object searches.
In R tree data structure, we group nearby objects and organize them with their minimum bounding rectangle (MBR) in the upper level of the tree. The whole tree is a hierarchical structure of inclusive MBRs except the leaves containing the objects.
Like other spatial data structures, R trees also partition search spaces in terms of MBRs. They create overlapping space partitioning and use optimization to minimize overlap instead of utilizing disjoint space partitioning.
6.4.1 Indexing structure of R tree
R tree is a multilevel height balance tree. Each node contains several keys and pointers for child nodes up to a determined maximum of m. A leaf level node contains information about objects. Internal nodes contain rectangles corresponding to children lying inside their parent nodes. Figure 6.4 shows the indexing and hierarchical bounding boxes of an R tree.
Figure 6.4: Construction of minimum bounding region of R tree.
6.4.1.1 Creation of R tree
Let,
Figure 6.5: Example of R tree
To search a given query object Oq, first we need to create MBR of Oq. Let Iq be the corresponding identifier. Start the search process by matching the region of Iq with MBRs of all entries of internodes, starting from the root. A successful match in internal node means inclusion of Iq within some MBR, (gets smaller at every match) and finally query Oq matches at leaf.
For range search we need to report all objects inside the range. The range search process follows a path from root to leaf and matches the region of query with MBRs of all entries of the internal nodes. The objective is to find the smallest MBR which contains the whole range and reports all the objects under the subtree rooted at that node.
In the case of a nearest neighbor search for a given point or region, the search process runs with the help of a priority queue. Starting from the root the algorithm inserts the nodes of the search path and their children into the priority queue. The priority values are decided by computing distance from the query point to the identifier. The find-min-return-by function at the end is the desired answer for nearest neighbor search in R tree.
6.4.3 Dynamic update of R tree
R trees can be updated dynamically through insert and delete operations. R trees are height balanced; the two-phase update method handles height without performing rotation or other external balancing steps.
6.4.3.1 Insertion in R tree
Insertion of a new object in the R tree is done by two phase traversal started from the root up to the leaf node of the search path. The first phase is similar to search and we create a new object and its identifier. In the second phase we traverse from leaf to upper level node until the identifier is included into a top level MBR. This process may require redefinition of the MBR and updating data in a reverse path up to the root. In some situations, we may need to split the node to maintain the upper bound M for number of entries. However, sharing of entry with the less crowded sibling is also possible.
Tree height may get increased in the second phase of insertion, when a split occurs at the root of the R tree but the system maintains height balance by increasing the heights of all the paths get increased.
6.4.3.2 Deletion in R tree
Deletion in R tree is similar to insertion and performed in two phases. In the first phase, a search operation is executed to find the object to be deleted. The second phase may be required only if the object if found and deleted. In this case, a bottom-up date of the MBRs in the affected path is required. In certain situations, nodes may be underfilled and may be joined with a sibling or take a sibling from a crowded node.
In the second phase of any update operation, MBRs must be updated optimally. The choice to share or merge siblings should be performed to maintain minimum overlaps between higher level MBRs.
Exercise 6.1 Describe how to construct a d-dimensional range tree on n given points in O(nlgd−1 n) time. ▪
Exercise 6.2 Describe how to construct a d-dimensional layered range tree on n given points in O(nlgd−1 n) time. ▪
Exercise 6.3 What is a spatial data structure? Describe the applicability of the following data structures for various types of queries of special data.
1.Multi-dimensional range search tree
2.KD tree
3.Quadratic tree
4.R tree ▪
Exercise 6.4 What is nearest neighbour search problem? Write an algorithm for an approximate nearest neighbour search using the follow data structures:
1.Multi-dimensional range search tree
2.KD tree
3.Quadratic tree
4.R tree ▪
Exercise 6.5 What is an R tree? Explain the construction of R tree for special objects. How would you construct an internal MBR? Discuss the optimality issues for insertion in an R tree in the case of a split. Discuss the optimality issues for deletion in an R tree in the case of a join. ▪