Distances

In the field of data mining, it is often required to determine which members of a training set are closest to unknown test instances. It is imperative to have a good set of different distance functions for any of the algorithms that perform the search, and SciPy has, for this purpose, a huge collection of optimally coded functions in the distance submodule of the scipy.spatial module. The list is long. Besides Euclidean, squared Euclidean, or standardized Euclidean, we have many more—Bray-Curtis, Canberra, Chebyshev, Manhattan, correlation distance, cosine distance, dice dissimilarity, Hamming, Jaccard-Needham, Kulsinski, Mahalanobis, and so on. The syntax in most cases is simple:

distance_function(first_vector, second_vector)

The only three cases in which the syntax is different are the Minkowski, Mahalanobis, and standardized Euclidean distances, in which the distance function requires either an integer number (for the order of the norm in the definition of Minkowski distance), a covariance for the Mahalanobis case (but this is an optional requirement), or a variance matrix to standardize the Euclidean distance.

Let us see now a fun exercise to visualize the unit balls in Minkowski metrics:

>>> import numpy 
>>> from scipy.spatial.distance import minkowski 
>>> Square=numpy.mgrid[-1.1:1.1:512j,-1.1:1.1:512j]
>>> X=Square[0]; Y=Square[1]
>>> f=lambda x,y,p: minkowski([x,y],[0.0,0.0],p)<=1.0
>>> Ball=lambda p:numpy.vectorize(f)(X,Y,p)

We have created a function, Ball, which creates a grid of 512 x 512 Boolean values. The grid represents a square of length 2.2 centered at the origin, with sides parallel to the coordinate axis, and the true values on it represent all those points of the grid inside of the unit ball for the Minkowksi metric, for the parameter p. All we have to do is show it graphically, as in the following example:

>>> import matplotlib.pylab as plt
>>> plt.imshow(Ball(3), cmap = plt.cm.gray)
>>> plt.axis('off')
>>> plt.subplots_adjust(left=0.0127,bottom=0.0164,\
    right=0.987,top=0.984)
>>> plt.show()

This produces the following, where Ball(3) is a unit ball in the Minkowski metric with parameter p = 3:

We feel the need to issue the following four important warnings:

First warning: We must use these routines instead of creating our own definitions of the corresponding distance functions whenever possible. They guarantee a faster result and optimal coding to take care of situations in which the inputs are either too large or too small.
Second warning: These functions work great when comparing two vectors; however, for the pairwise computation of many vectors, we must resort to the pdist routine. This command takes an m x n array representing m vectors of dimension n, and computes the distance of each of them to each other. We indicate the distance function to be used with the option metric and additional parameters as needed. For example, for the Manhattan (cityblock) distance for five randomly selected randomly selected four-dimensional vectors with integer values 1, 0, or -1, we could issue the following command:
```
>>> import scipy.stats
>>> from scipy.spatial.distance import pdist
>>> V=scipy.stats.randint.rvs(0.4,3,size=(5,4))-1
>>> print (V)
```
The output is shown as follows:
```
[[ 1  0  1 -1]
 [-1  0 -1  0]
 [ 1  1  1 -1]
 [ 1  1 -1  0]
 [ 0  0  1 -1]]
```
Let's take a look at the following pdist command:
```
>>> pdist(V,metric='cityblock')
```
The output is shown as follows:
```
array([ 5.,  1.,  4.,  1.,  6.,  3.,  4.,  3.,  2.,  5.])
```
This means, if v1 = [1,0,1,-1], v2 = [-1,0,-1,0], v3 = [1,1,1,-1], v4 = [1,1,-1,0], and v5 = [0,0,1,-1], then the Manhattan distance of v1 from v2 is 5. The distance from v1 to v3 is 1; from v1 to v4, 4; and from v1 to v5, 1. From v2 to v3 the distance is 6; from v2 to v4, 3; and from v2 to v5, 4. From v3 to v4 the distance is 3; and from v3 to v5, 2. And finally, the distance from v4 to v5 is 5, which is the last entry of the output.
Third warning: When computing the distance between each pair of two collections of inputs, we should use the cdist routine, which
has a similar syntax. For instance, for the two collections of three randomly selected four-dimensional Boolean vectors, the corresponding Jaccard-Needham dissimilarities are computed, as follows:
```
>>> from scipy.spatial.distance import cdist
>>> V=scipy.stats.randint.rvs(0.4, 2, size=(3,4)).astype(bool)
>>> W=scipy.stats.randint.rvs(0.4, 3, size=(2,4)).astype(bool)
>>> cdist(V,W,'jaccard')
array([[ 0.75      ,  1.        ],
       [ 0.75      ,  1.        ],
       [ 0.33333333,  0.5       ]])
```
That is, if the three vectors in V are labeled v1, v2, and v3 and if the two vectors in W are labeled as w1 and w2, then the dissimilarity between v1 and w1 is 0.75; between v1 and w2, 1; and so on.
Fourth warning: When we have a large amount of data points and we need to address the problem of nearest neighbors (for example, to locate the closest element of the data to a new instance point), we seldom do it by brute force. The optimal algorithm to perform this search is based on the idea of k-dimensional trees. SciPy has two classes
to handle these objects – KDTree and cKDTree. The latter is a subset of the former, a little faster since it is wrapped from C code, but with very limited use. It only has the query method to find the nearest neighbors of the input. The syntax is simple, as follows:
```
KDTree(data, leafsize=10)
```
This creates a structure containing a binary tree, very apt for the design of fast search algorithms. The leafsize option indicates at what level the search based on the structure of the binary tree must be abandoned in favor of brute force.
The other methods associated with the KDTree class are—count_neighbors, to compute the number of nearby pairs that can be formed with another KDTree; query_ball_point, to find all points at a given distance from the input; query_ball_tree and query_pairs, to find all pairs of points within certain distance; and sparse_distance_matrix, that computes a sparse matrix with the distances between two KDTree classes.
Let us see it in action, with a small dataset of 10 randomly generated four-dimensional points with integer entries:
```
>>> from scipy.spatial import KDTree
>>> data=scipy.stats.randint.rvs(0.4,10,size=(10,4))
>>> print (data)
```
The output is shown as follows:
```
[[8 6 1 1]
 [2 9 1 5]
 [4 8 8 9]
 [2 6 6 4]
 [4 1 2 1]
 [3 8 7 2]
 [1 1 3 6]
 [5 2 1 5]
 [2 5 7 3]
 [6 0 6 9]]
>>> tree=KDTree(data)
>>> tree.query([0,0,0,0])
```
The output is shown as follows:
```
(4.6904157598234297, 4)
```

This means, among all the points in the dataset, the closest one in the Euclidean distance to the origin is the fifth one (index 4), and the distance is precisely about 4.6 units.

We can have an input of more than one point; the output will still be a tuple, where the first entry is an array that indicates the smallest distance to each of the input points. The second entry is another array that indicates the indices of the nearest neighbors.