Data sciences represent smart world and artificial intelligence of the future. They blend the areas of data inference, algorithm development, and electronic advances to solve complex problems and add value in all areas of government and commerce.
Objective 14.1 This chapter covers the applications of important concepts of synopsis data structures and various hashing techniques for analysis of large amounts of data in reasonable time.
14.1 Heavy Hitters and Count-Min Structures
A paper by Cormode et al. [186] describes the count-min sketch data structure that does more than approximating data distributions. The functions are described below.
•A heavy hitter query (HH(k) request) seeks a set of high frequency elements (say
•Finding heavy-hitters is also of interest in the context of signal processing. Since data distribution is defined by signals, the recovery of heavy goods is the key to the best signal convergence. Consequently the graph-min function can be used in a compressed sensor paradigm to obtain a recently processed signal [187].
•If
•Applications like no loopy peas (NLP) generate large amounts of data. Therefore, it is important to store statistics by combining words, such as pairs or triple words that appear in the sequence. In one experiment, the researchers compacted a 90 GB data load to 8 GB graphic sketches.
•Another use of a count-min sketch is in password design. The count-min structure can be used for count tracking (see http://www.youtube.com/watch?v=qo1cOJFEF0U). The good feature with it is the impact of a false positive.
14.2 Approximate Nearest Neighbor Searches
The nearest neighbor (NN) search is an important step in stuctural analysis and processing of other types of queries. NN search is very useful for dealing with massive data sets, but it suffers from the curse of dimension [34, 36]. NN searches were previously used for low dimensional data. A recent surge of results shows that the NN search is also useful for analyzing large data collections if a suitable data structure (KD tree, quadtree, R tree, metric tree, locality sensitive hashing (LSH)) is used [42, 41, 40, 38]. One more advantage of using NN search for large data analysis is the availability of efficient approximation scheme, which provides almost the same results in very less time [49, 37].
14.2.1 Approximate nearest neighbor
Consider a metric space (S, d) and some finite subset SD of data points SD ⊂ S on which the nearest neighbor queries are to be made. The aim of approximate NN to organize SD such that answer the NN queries can be done more efficiently. For any q ∈ S, the NN problem consists of finding a single minimal located point p ∈ SD s.t. d(p, q) minimum over all p ∈ SD. We denote this by p = NN(q, SD).
An ε approximate NN of q ∈ S is to find a point p ∈ SD s.t. d(p, q) ≤ (1 + ε) d(x, d)
14.2.2 Locality-sensitive hashing (LSH)
Several methods to compute first nearest neighbor query are cited in the literature and locality-sensitive hashing (LSH) is most popular because of its dimension independent run time [46, 45]. In a locality sensitive hashing, the hash function has the property that close points are hashed into same bucket with high probability and distance points are hashed into same bucket with low probability. Mathematically, a family H = {h : S → U} is called (r1, r2, p1, p2)-sensitive if for any p, q ∈ S
•if p ∈ B(q, r1) then PrH[h(q) = h(p)] ≥ p1
•if
where B(q, r) denotes a hypersphere of radius r centered at q. In order for a locality-sensitive family to be useful, it has to satisfy inequalities p1 > p2 and r1 < r2 when D is a distance, or p1 > p2 and r1 > r2 when D is a similarity measure [27, 29]. The value of δ = log(1/P1)/log(1/P2) determines search performance of LSH. Defining a LSH as a (r, r(1 + ε), p1, p2), the (1 + ε) NN problem can be solved via hashing and searching within the buckets [44, 43, 40].
14.3 Low Rank Approximation by Sampling
Traditionally in information retrieval and machine learning, data is represented in the form of vectors. These vector collections are then stored in the single metric A ∈ Rn×m, where each column of A corresponds to a vector in the n dimensional space. The advantage of using a vector space model is that its paradigm can be exploited for the solution [205]. However, the information contained in the data must not be removed. The widely used method for this purpose is to estimate a single data matrix, A with a lower rank matrix. Mathematically, according to the Frobenius criteria, the optimum rank estimate of the matrix A can be computed as follows: Find a matrix B ∈ Rn×m with rank (B) = k., such that ||A − B|| is minimum. The matrix B can be readily obtained by computing the singular value decomposition (SVD) of A, as stated by Golub and Van Loan [206]. For any approximation M of A, we call ||A − M||F the reconstruction error of the approximation.
SVD calculation is interesting from a theoretical view because it provides the closest matrix of a given rank. For many applications where the data matrix is large, the SVD calculation can be impractical because it requires a large number of operations and large memory. Recent studies have focused on algorithms which are not optimal in the sense that they compute a lower-grade matrix which is not close to the original matrix. Reported work shows that they have an advantage over SVD based algorithms as they require less memory. Low-rank approximations have various applications like latent semantic indexing, support vector machine training, machine learning, computer vision, and web search models. In these applications, the data consist of a matrix of pairwise distances between the nodes of a complex network and approximated by a low-rank matrix for fast community detection using a distance-based partitioning algorithm. Calculating such a low-rank approximation can reveal the underlying structure of the data and allow for fast computations.
Any symmetric positive semi-definite (SPSD) matrix can be approximated as a subset of its columns using Nystrom methods. Specifically, given an m × m SPSD matrix A, the method requires sampling c ( < m) columns of A to construct an m × c matrix C. We always assume that C consists of the first c columns of A without loss of generality. We partition A and C as
where W and A21 are of size c × c and m − c × c respectively [31].
Definition 14.3.1 The standard Nystrom approximation of A is
Since the running time complexity of SVD on W is O(kl2) and matrix multiplication with C takes O(kln), the total complexity of the Nystrom approximation computation is in O(kln)[32, 33].
In random sketching, a relation is modelled as defining a vector or matrix. The sketch is formed by multiplying the data by a vector [21]. A sketch vectorforms a synopsis of the data ad the synopsis is smaller than all the original data. Data mining algorithms can now be applied on the sketch vector.
Definition 14.3.2 Johnson-Lindenstrauss Lemma 1. For any set of n points S ∈ Rd, there is a (1 + ε)-distortion embedding of X into Rk (k < d), for
Definition 14.3.3 Johnson-Lindenstrauss Lemma 2. There is a distribution over random linear mappings A: Rd → Rk, (k < d) such that for any vector x we have ||Ax|| = (1 ± ε) ||x|| with probability
The Johnson Linderstrauss theorems [30] for sketches ensure the preservation of distance during lower dimensional approximation and provide minimum lower bounds of approximation.
We have used random sketches for low rank approximations of the dataset.
•Step 1: counts the number of attributes in the dataset
•Step 2: sketch matrix is generated and number of rows equals the number attributed in the dataset. Sketch matrix have three sets of values: (1) between 0 and 1 (2) 0 and 1 (3)between maximum and minimum values of the dataset.
•Step 3: multiplies each row of the dataset with the sketch matrix in order to generate sketch vector.
After a sketch vector is generated machine learning algorithms are applied to the sketch matrix.
14.4 Near-Duplicate Detection by Min Hashing
The MinHash technique invented by Andrei Broder can quickly estimate similarities between two sets [188, 189]. Initially, it was used in the AltaVista search engine to detect duplicate web pages, clustering documents by the similarities of their sets of words.
MinHash uses the Jaccard similarity coefficient and hash functions. The coefficient is used to find similarities between two sets (A and B). A similarity of 0 indicates they have no elements in common. A similarity of 1 indicates both sets contain the same elements.
MinHash methodology can be classified into two types: variants with many hash functions and variants with only one.
In the many hash functions variation, the MinHash scheme uses k different hash functions. Here the k (a fixed integer parameter) represents each set S by the k values of hmin(S) for these k functions. To detect the Jaccard similarity coefficient J(A,B) assume y is the total number of hash functions for which h(min)(A) = h(min)(B), and use
Project 14.1 — Crime Rate Prediction Using K Means. The rate of crime in the past few years has increased in many countries. The high crime rates worldwide require effective protection. Create a crime prediction system by using a user’s crime data to compute future crime rates. Use the K-mean algorithm and suitable data structure to identify and store different crime patterns, hidden links, link prediction, and statistical analysis of crime data. Administrators can see criminal historical data. Crime prediction relies on historical crime statistics along with geospatial and demographic data.
Project 14.2 — Online test system using clustering algorithm. Create a user-friendly online test system using interactive web applications and software. The system should provide fast access and information retrieval. The test result should show ranking and weights. [Hint: use any clustering algorithm for classification of question papers.]
Project 14.3 — Informtion leak detection system. Consider a scenario in which a sender wants to transmit confidential to a number of receivers. Due to an accident, the information is leaked. The sender wants to determine whether the leaked data came from one of more of his receivers. Design an information leak detection system that allows data allocation. Treat data allocation as an input that will help identify data leaks. You can also use also insert realistic but fake data records to further improve the chances of identifying unknown data leakages and also the receiver responsible for it.