Synopsis data structures are substantially smaller than their base data sets. These are data structures for supporting queries to massive data sets while minimizing or avoiding disk accesses. They have the following advantages:
1.Fast processing: May reside in main memory; provides fast disk processing of queries and data structures updates.
2.Fast swap/transfer: Synopsis data structure that resides in disks can be swapped in and out of memory with minimal disk accesses.
3.Lower cost: Has minimal impact on space requirements of the data set and its supporting data structures.
4.Small surrogate: Can serve a data set when the data set is currently expensive or impossible to access.
Since synopsis data structures are too small to maintain a full characterization of their base data sets, they must summarize the data set, and the responses they provide to queries will typically be approximate ones.
What synopsis of the full data must be kept in the limited space to maximize the accuracy of response? How can you efficiently compute a synopsis and maintain it during updates to the data set?
There are a number of data sets
Definition 10.1.1 An f(n) synopsis data structure for a class Q of queries is a data structure for providing answers to queries from Q that uses f(n) = O(nε) space for a data of size n.
A synopsis data structure can be evaluated according to five metrics:
1.Coverage: the range and importance of the queries in Q
2.Answer quality: the accuracy and confidence of its answers to queries in Q
3.Footprint: its space bound
4.Query time
5.Computation time
10.1.1.1 Sampling methods
Sampling methods are among the most simplest methods for synopsis construction in data streams in that they use the same multi-dimensional representation as the original data points.
10.1.1.2 Sketches
Sketch-based methods derive their inspiration from wavelet techniques. In fact, sketch based methods can be considered randomized versions of wavelet techniques, and are among the most space efficient of all methods. However, because of the difficulty of intuitive interpretations of sketch based representations, they are sometimes difficult to apply to arbitrary applications.
10.1.1.3 Histograms
Histogram based methods are widely used for static datasets. However most traditional algorithms on static data sets require super-linear time and space. This is because of the use of dynamic programming techniques for optimal histogram construction. Their extension to the data stream case is a challenging task.
10.1.1.4 Wavelets
Wavelets have traditionally been used in a variety of image and query processing applications. In particular, the dynamic maintenance of the dominant coefficients of the wavelet representation requires some novel algorithmic techniques.
10.1.2.1 Approximate query estimation
Query estimation is possibly the most widely used application of synopsis structures. The problem is particularly important from an efficiency point of view, since queries usually have to be resolved in online time. Therefore, most synopsis methods such as sampling, histograms, wavelets and sketches are usually designed to solve the query estimation problem.
10.1.2.2 Approximate join estimation
The efficient estimation of join size is a particularly challenging problem in streams when the domain of the join attributes is particularly large.
10.1.2.3 Computing aggregates
In many data stream computation problems, it may be desirable to compute aggregate statistics over data streams. Some applications include estimation of frequency counts and quintiles
10.1.2.4 Data mining applications
A variety of data mining applications such as change detection do not require to use of individual data points, but only require a temporal synopsis which provides an overview of the behavior of the stream. Methods such as clustering and sketches can be used for effective change detection in data streams.
Sampling is a utility task used in diverse applications such as data mining, query processing and sensor data management. There are many different algorithms dedicated to effectively building a random sample of a small fixed size whose efficiency varies and depends on how data is stored and accessed. It is however less obvious to incrementally build sample in a single pass or, more generally, to maintain a random sample when the dataset is updated. This is for instance the case when dealing with data streams.
The right choice of elements of a sample is necessary to achieve good representation of a population. The following issues should be clearly defined.
1.The selection method for the elements of the population (sampling method to be used).
2.Sample size.
3.Reliability of the conclusions and estimates of probable errors.
How can we classify the different ways of choosing a sample?
10.2.1.1 Probability based sampling
In probability sampling, every unit in a population has a chance (greater than zero) of being selected in the sample, and this probability can be accurately determined. The combination of these traits makes it possible to produce unbiased estimates of population totals, by weighting sampled units according to their probability of selection.
▪ Example 10.1 We want to estimate the total income of adults living in a given street. We visit each household in that street, identify all adults living there, and randomly select one adult from each household. For example, we can allocate each person a random number, generated from a uniform distribution between 0 and 1, and select the person with the highest number in each household. We then interview the selected person and determine his or her income. People living on their own are certain to be selected, so we simply add their income to our estimate of the total. But a person living in a household of two adults has only a one-in-two chance of selection. To reflect this, we would count the selected person’s income twice towards the total. The person who is selected from that household can be loosely viewed as also representing the person who isn’t selected. ▪
In the above example, not everybody has the same probability of selection; what makes it a probability sample is the fact that each person’s probability is known. When every element in the population does have the same probability of selection, this is known as an equal probability of selection (EPS) design. Such designs are also referred to as self-weighting because all sampled units are given the same weight. Probability sampling includes: Simple random sampling, systematic sampling, stratified sampling, probability proportional to size sampling, and cluster or multistage sampling have two common characteristics:
1.Every element has a known nonzero probability of being sampled.
2.Random selection occurs at some point.
10.2.1.2 Non-probability-based sampling
Non-probability-based sampling is any method in which some elements of a population have no chance of selection (these are sometimes referred to as out of coverage or undercovered elements), or where the probability of selection can’t be accurately determined. Sampling involves the selection of elements based on assumptions regarding the population of interest, which forms the criteria for selection. Hence, because the selection of elements is nonrandom, nonprobability sampling does not allow the estimation of sampling errors. These conditions give rise to exclusion bias, placing limits on how much information a sample can provide about the population. Information about the relationship between sample and population is limited, making it difficult to extrapolate from the sample to the population.
▪ Example 10.2 We visit every household in a given street, and interview the first person to answer the door. In any household with more than one occupant, this is a non-probability sample, because some people are more likely to answer the door (e.g. an unemployed person who spends most of his or her time at home is more likely to answer than an employed housemate who might be at work when the interviewer calls) and it’s not practical to calculate these probabilities.
Probability sampling is useful because it assures that a sample is representative and we can estimate the errors for the sampling.
In this subsection, we discuss simple reservoir sampling algorithm (internal memory), which works as follows: Let t denote the number of the data in the dataset that have been processed. If t + 1 ≤ n, the (t + 1)th data is directly inserted into the reservoir. Otherwise, the data is made a candidate and replaces one of the old candidates in the reservoir with probability n/(t + 1). The replaced data is uniformly selected from the reservoir.
Algorithm 25 Reservoir sampling algorithm
1: procedure RESERVOIR(k, S) ⊳ take a random sample from the dataset S
2: initialize reservoir of size k
3: for i = 1 to i ≤ |S| do
4: old = i-th item
5: if i ≤ k thenR[i]= old
6: else generate a random integer from 1 to x
7: if x ≤ k thenR[i]= old
The basic idea behind reservoir algorithms is to select a sample of size 2n, from which a random sample of size n can be generated. A reservoir algorithm is defined as follows:
Definition 10.2.1 — Reservoir. The first step of any reservoir algorithm is to put the first n records of the file into a reservoir. The rest of the records are processed sequentially; records can be selected for the reservoir only as they are processed.
An algorithm is a reservoir algorithm if it maintains the invariant that after each record is processed a true random sample of size n can be extracted from the current state of the reservoir. At the end of the sequential pass through the file, the final random sample must be extracted from the reservoir. The reservoir might be rather large, and so this process could be expensive. The most efficient reservoir algorithms avoid this step by always maintaining a set of n designated candidates in the reservoir, which form a true random sample of the records processed so far. When a record is chosen for the reservoir, it becomes a candidate and replaces one of the former candidates; at the end of the sequential pass through the file, the current set of n candidates is output as the final sample.
If the internal memory is not large enough to store the n candidates, the algorithm can be modified as follows: The reservoir is stored sequentially on secondary storage; pointers to the current candidates are stored in internal memory in an array, which we call I. (We assume that there is enough space to store n pointers.) Suppose during the algorithm that record R′ is chosen as a candidate to replace record R, which is pointed to by I[k]. Record R is written sequentially on secondary storage, and I[k] is set to point to R. The above algorithm can be modified by replacing the initial for loop.
Algorithm 26 Reservoir sampling algorithm (external memory)
1: procedure RESERVOIR(k, S)⊳ take a random samples(external) from the dataset S
2: for i=1 to i ≤ |S| do
3: Copy the jth record onto secondary storage;
4: Set I[j] to point to the jth record;
5: Copy the next record onto secondary storage;
6: Set I[H] to point to that record ⊳ H is a harmonic coefficient
Retrieval of the final sample from secondary storage can be sped up by retrieving the records in sequential order. The sort should be very fast because it can be done in internal memory. The basic idea of Algorithm 25 is to skip data that are not going to be selected, and rather select the index of next data. A random variable ϕ(n, t) is defined to be the number of data that are skipped over before the next data is chosen for the reservoir, where n is the size of the sample and t is the number of data items processed so far. This technique reduces the number of data items that need to be processed and thus the number of calls to RANDOM (RANDOM) is a function to generate a uniform random variable between 0 and 1).
Complexity 10.2.1 — Reservoir sampling algorithm. It is clear that Algorithm 26 runs in time O(N) because the entire file must run in time O(N) and be scanned since each record can be processed in constant time. This algorithm can be reformulated easily by using the framework that we develop in the next section, so that the I/O time is reduced from O(N) to O(n(1 + log (N/n))).
As we discussed above, the reservoir algorithm can only produce samples of the insertion-only dataset. Deletions are supported by two algorithms: random pairing (RP) and resizing samples (RS).
10.2.3.1 Random pairing
The basic idea behind random pairing is to avoid accessing the base data set by considering the new insertion as a compensation for the previous deletion. In the long term, every deletion from the data set is eventually compensated by a corresponding insertion. The algorithm maintains two counters c1 and c2, which respectively denote the numbers of uncompensated deletions in the sample S and in the base data set R. Initially c1 and c2 are both set to 0. If c1 + c2 = 0, the reservoir algorithm is applied and new data has a probability c1 /(c1 + c2) to be chosen for S; otherwise, it is excluded. Then c1 and c2 are modified accordingly. When the transaction consists of a sequence deletion then the last element immediately compensated by an insertion.
10.2.3.2 Resizing samples
The general idea of any resizing algorithm is to generate a sample S of size at most n from the initial dataset R and after some finite transactions of insertions and deletions, produce a sample S of size n from the new base dataset R, where n < n < |R|. The proposed algorithm follows this general idea by using a random variable based on the binomial distribution.
10.2.4 Sliding window sampling
In this subsection, we focus on algorithms, which generate a sample of size n from a window of size w. Two types of sliding windows are defined: (i) the sequence-based window and (ii) the timestamp-based window.
10.2.4.1 Simple algorithm
The simple algorithm first generates a sample of size n from the first w data using the reservoir algorithm. Then, the window moves. The sample is maintained until the entry of new data cause old data in the sample to expire. The new data is then inserted into the sample and the expired data is discarded. This algorithm can efficiently maintain a uniform random sample of the sliding window. However, the sample design is reproduced for every tumbling window. If the ith data is in the sample for the current window, the (i + cw)th data is guaranteed to be included into the sample at some time in the future, where c is an arbitrary integer constant.
10.2.4.2 Chain-sample algorithm
The chain-sample algorithm generates a sample of size 1 for each chain. So in order to get our sample of size n, n chains need to be maintained. When the ith data enters the window, it is selected to be the sample with probability Min(i,w). If the data is selected, the index of the data w that replaces it when it expires is uniformly chosen from i + 1 to i + w. When the data with the selected index arrives, the algorithm puts it into the sample and calculates the new replacement index. Thus, a chain of elements that can replace the outdated data is built.
In recent years, significant research has focused on developing compact data structures. Families of such data structures are called sketches. Recent research to develop data structures to summarize massive data streets and compute statistics from such summaries has been ongoing. One statistic that is commonly sought is the total count associated with a key in a data stream, i.e., the sum of the values of keys. A sketch can estimate the count associated with a key in a process known as answering point queries. In networking, sketches have known applications in estimating the size of the largest traffic flows in routers, in detecting changes in traffic streams, in adaptive traffic-sampling techniques, and in worm fingerprinting. More generally, sketches find applications in systems that require online processing of large data sets [145, 146, 144]. Sketches use the same underlying hashing scheme for summarizing data. They differ in how they update hash buckets and use hashed data to derive estimates. Among the different sketches, the one with the best time and space bounds is the count-min sketch.
Definition 10.3.1 The count-min (CM) sketch is a compact summary data structure capable of representing high-dimensional vectors and answering queries on such vectors, in particular point queries and dot product queries, with strong accuracy guarantees. Figure 10.1 depicts the structure.
Figure 10.1: Count-min sketch
Point queries are at the core of many computations, so the structure can be used in order to answer a variety of other queries, such as frequent items (heavy hitters), quintile finding, join size estimation, and more. Since the data structure can easily process updates in the form of additions or subtractions to dimensions of the vector (which may correspond to insertions or deletions, or other transactions), it is capable of working over streams of updates, at high rates.
The data structure maintains the linear projection of the vector with a number of other random vectors. These vectors are defined implicitly by simple hash functions. Increasing the range of the hash functions increases the accuracy of the summary, and increasing the number of hash functions decreases the probability of a bad estimate. These tradeoffs are quantified precisely below. Because of this linearity, CM sketches can be scaled, added and subtracted, to produce summaries of the corresponding scaled and combined vectors. The CM sketch was first proposed in 2003 as an alternative to several other sketch techniques, such as the count sketch and the AMS sketch [145]. The goal was to provide a simple sketch data structure with a precise characterization of the dependence on the input parameters. The sketch has also been viewed as a multistage-filter, which requires only limited independence AND randomness to show strong, provable guarantees. The simplicity of creating and probing the sketch has led to its wide use in many areas [145, 146, 143].
The CM sketch is simply an array of counters of width w and depth d,
10.3.1.1 Update procedure
Consider an implicit and incremental vector a of dimension n. Assume that its current state at time t is
Generally, for convenience of reference, t is dropped, and the current state of the vector is written as just a.
When an update (it, ct) appears, ct is added to one count in each row of the CM sketch and the counter is determined by hj.
Complexity 10.3.1 Computing each hash function takes O(1) (constant) time and the total time to perform an update is O(d), independent of w. Since d is typically small in practice (often less than 10), updates can be processed at high speed.
10.3.1.2 Estimation procedure
The process is similar for estimating (i) operations. For each row, the corresponding hash function is applied to i to look up one of the counters. Across all rows, the estimate is found as the minimum of all the probed counters. In the example above, we examine each place where i was mapped by the hash functions. Each of these entries has a counter which has added up all the updates that were mapped there, and the estimate returned is the smallest of these.
Algorithm 27 Initializing array C of w × d counters to 0, and picking values for hash functions based on prime p.
1: procedure CountMinInit (w, d, p) ⊳ Initialization of CM sketch
2: C[d, w] = 0
3: for i=1 to d do
4: Choose aj, bj and prime p ⊳ different prime p may be 231 − 1 or else
5: Total count, N=0
Algorithm 28 Updating (i c) by updating N with c.
1: procedure CountMinUpdate(i, c) ⊳ Update of CM sketch
2: N = N + c
3: for i=1 to d do
4: hj(i) = (aj × i + bj) mod p mod w ⊳ different prime p for different j
5: C[j, hj(i)] = C[j, hj(i)] + c;
NB: The loop hashes its counter in each row and updates it there.
Algorithm 29 Estimating (i for given i by performing hashing and tracking minimum value of C[j, hj(i)] over d values of j
1: procedure CountMinEstimate(i) ⊳ Estimation of CM sketch
2: e ← 100
3: for i=1 to d do
4: hj(i) = (aj × i + bj) mod p mod w ⊳ different prime p for different j
5: e = min(e, C[j, hj(i)])
6: return e
To speed implementation, mod p can be replaced with a shift and mod w can be replaced with a bitmask operation when w for certain choice of p and w.
Theorem 10.3.1 In a sketch of size w × d and total count N, any estimate has error at most 2N/w with an error probability
Proof. Proof of the theorem can be found in Cormode et al. [144].
If we set large w and d, we can achieve high accuracy in comperatively little space.
This section describes the fingerprint synopsis data structure that maintains a small data footprint while representing it accurately.
Objective 10.1—Fingerprint. Analysis of huge groups of complex objects is now common practice. The fingerprint technique eliminates the “curse of dimensionality” in costly object comparisons by comparing the smallest fingerprint first.
Definition 10.4.1 — Fingerprint. Small fingerprints effectively represent large data objects based on the concept that the probability that two objects have the same fingerprint should be far smaller if the two fingerprints are different from the corresponding objects.
A fingerprinting scheme is a certain collection of functions F = {f:Δ → {0, 1}k} where Δ is the set of all possible objects of interest and k is the length of the fingerprint.
Complexity 10.4.1 The space complexity of a fingerprint is 2k.
10.4.1 Fingerprinting scheme of Rabin
Rabin’s scheme [152] can produce a simple real-time string matching algorithm and a method for securing files against unauthorized changes. This fingerprinting scheme is based on the arithmetic modulo of a polynomial with coefficient in Z2. Let
Definition 10.4.2 — Irreducible polynomial. A given polynomial P(x) defined on a finite field Z2 is irreducible if P(x) is not evaluted to 0 by any values from Z2.
What will be the value of k, the degree of irreducible polynomial P(x)?
Any value of k can be used, but implementation if more convenient if we choose k as prime number.
Theorem 10.4.1 Let S be a set on an n string of length m. For an irreducible P(x), the probability that a pair of distinct strings in S has the same fingerprint is
Proof. Let Φs(x) = ∏A≠B∈S (A(x) − B(x)).
Collision implies Φs(x) = Omod P(x) by equivalence of string and polynomial.
Therefore, P(x) is a factor of Φs(x).
Degree Φs(x) is n2 m and # of irreducible degree k factor of Φs(x) is
The number of irreducible degree k polynomials >(2k − 2k/2)/k.
Therefore the probability that a random P(x) divides Φs(x) is
10.4.1.1 Usefull properties for application development
1.f(A) ≠ f(B).
2.Pr(f(A)) = (f(B) | A ≠ B) = very small.
3.The string A and the polynomial A(x) with coefficient over Z2 are identical.
4.Fingerprinting is distributive over addition (in Z2) : f(A + B) = f(A) + f(B).
5.Fingerprints can be computed in linear time.
6.The fingerprint of the concatenated strings can be computed as: f(concat(A, B)) = f(concat (f(A), B)).
7.For f(A) and f(B), and the length n = length (B), we have f (concat(A, B)) = A(x) * xn + B(x) mod P(x).
The operations on polynomials have simple implementations: addition is equivalent to bit-wise exclusive OR. And multiplication by t is equivalent to a one-bit left shift.
Table 10.1: Comparison of hashing and fingerprinting
▪ Example 10.3 — Duplicate detection in large document corpus. Collections of documents are often very large. Duplicate detection using naive search is very costly. Fingerprinting is a good alternative. It matches fingerprints instead of documents and checks documents for possible duplications only when fingerprint matches are found. ▪
Wavelets are common mathematical functions often used by database researchers for summarization and decomposition. Wavelet algorithms are capable of processing data at various scales and resolution levels.
Objective 10.2 — Wavelet as synopsis data structure. Wavelets are used to capture crucial information about data such as broad trends and local characteristics of higher and lower order coefficients. The function of the coefficients is to answer queries about data within bounded spaces.
We rely on windows to observe gross characteristics of signals; small windows are used to examine small characteristics. This section discusses the use of Haar 195 wavelets for simple hierarchical decompositions of streaming and other large data collections.
Wavelet decompositions are special mathematical transforms for capturing data trends in numerical sequences (Figure 10.2). If a few are significant, no need to say that some aren’t: Only a few wavelet coefficients of data sets are significant. Only small numbers of significant coefficients are stored to approximate data sequences. Although wavelets represent a vast number of operations, we generally use only basic wavelet decomposition for data structures.
Figure 10.2: Wavelet decomposition
1.The decomposition is computed by convolving the signal with the low pass filter
2.In the discrete case if there are N values in the array, this process yields N = 2 averages and N = 2 differences (the averages and differences are scaled by suitable scaling factors).
3.We store the differences as the wavelet coefficients at this level. We repeat the computations of averages and differences until we are left with only one average and N1 differences over logN scales or resolutions.
4.The total collection of all the differences over the log N scales together with the final average gives the Haar wavelet transform of the input signal.
The entire computation can be quite naturally represented by a binary tree over the signal array, each node in the tree representing the average of the nodes under it and the difference between the left and right child.
▪ Example 10.4 — Computer vision. Marr’s theory was that image processing in the human visual system has a complicated hierarchical structure that involves several layers of processing. At each processing level, the retinal system provides a visual representation that scales progressively in a geometrical manner. His arguments hinged on the detection of intensity changes. He theorized that intensity changes occur at different scales in an image, so that their optimal detection requires the use of operators of different sizes. He also theorized that sudden intensity changes produce a peak or trough in the first derivative of the image. These two hypotheses require that a vision filter have two characteristics:
1.It should be a differential operator.
2.It should be capable of tuning to act at any desired scale. Marr’s operator is now referred to as the Marr wavelet. ▪
▪ Example 10.5 — Wavelet tree for text compression. A wavelet data structure was initially proposed for text compression applications and for other uses in text indexing and retrieval. A wavelet tree can simply store texts and provide indexing and compression mechanisms. It can also store auxiliary information for compressed algorithms. Another use is as a general tool to reduce computation needed to compress a string from an arbitrary alphabet to binary strings.
Wavelet trees can be used as a means of reorganizing natural text that has been word-compressed in order to guarantee self-synchronization, even for compression algorithms that are not self-synchronized. Self-synchronization for compressed text permits fast search and random access. The compressed text resulting from this reorganization is synchronized, even for codes that are not self-synchronized. All the advantages (good compression, fast search and random access ability) can be gained simultaneously. The first step is compressing the text and then reorganizing the bytes of all code words in the order in which they appear in the nodes of a structure closely resembling a wavelet tree. ▪
Exercise 10.1 What is a synopsis data structure? What is the need for using synopsis data structures? ▪
Exercise 10.2 Describe the window sampling method and its uses on massive data. ▪
Exercise 10.3 Describe the random sampling method and its uses on massive data. ▪
Exercise 10.4 Describe the reservoir sampling method and its uses on massive data. ▪
Exercise 10.5 Describe the random sketches method and its uses on massive data. ▪
Exercise 10.6 Describe the sketching of frequency moments method and its uses on massive data. ▪
Exercise 10.7 Describe the fingerprint method and its uses on massive data. ▪
Exercise 10.8 Describe the histogram method and its uses on massive data. ▪