Independent component analysis

We have seen that the factors extracted by a PCA are decorrelated, but not independent. A classic example is the cocktail party: we have a recording of many overlapped voices and we would like to separate them. Every single voice can be modeled as a random process and it's possible to assume that they are statistically independent (this means that the joint probability can be factorized using the marginal probabilities of each source). Using FA or PCA, we are able to find uncorrelated factors, but there's no way to assess whether they are also independent (normally, they aren't). In this section, we are going to study a model that is able to produce sparse representations (when the dictionary isn't under-complete) with a set of statistically independent components.

Let's assume we have a zero-centered and whitened dataset X sampled from N(0, I) and noiseless linear transformation:

In this case, the prior over, z, is modeled as a product of independent variables (α is the normalization factor), each of them represented as a generic exponential where the function f_k(z) must be non-quadratic, that is, p(z; θ) cannot be Gaussian. Furthermore, we assume that the variance of z_i is equal to 1, therefore, p(x|z; θ) ∼ N(Az, AA^T). The joint probability p(X, z; θ) = p(X|z; θ)p(z|θ) is equal to the following:

If X has been whitened, A is orthogonal (the proof is straightforward); hence, the previous expression can be simplified. However, applying the EM algorithm requires determining p(z|X; θ) and this is quite difficult. The process could be easier after choosing a suitable prior distribution for z, that is, f_k(z), but as we discussed at the beginning of the chapter, this assumption can have dramatic consequences if the real factors are distributed differently. For these reasons, other strategies have been studied.

The main concept that we need to enforce is having a non-Gaussian distribution of the factors. In particular, we'd like to have a peaked distribution (inducing sparseness) with heavy tails. From the theory, we know that the standardized fourth moment (also called Kurtosis) is a perfect measure:

For a Gaussian distribution, Kurt[X] is equal to three (which is often considered as the reference point, determining the so called Excess Kurtosis = Kurtosis - 3), while it's larger for a family of distributions, called Leptokurtotic or super-Gaussian, which are peaked and heavy-tailed (also, the distributions with Kurt[X] < 3, called Platykurtotic or sub-Gaussian, can be good candidates, but they are less peaked and normally only the super-Gaussian distributions are taken into account). However, even if accurate, this measure is very sensitive to outliers because of the fourth power. For example, if x ∼ N(0, 1) and z = x + ν, where ν is a noise term that alters a few samples, increasing their value to two, the result can be a super-Gaussian (Kurt[x] > 3) even if, after filtering the outliers out, the distribution has Kurt[x] = 3 (Gaussian).

To overcome this problem, Hyvärinen and Oja (Independent Component Analysis: Algorithms and Applications, Hyvarinen A., Oja E., Neural Networks 13/2000) proposed a solution based on another measure, the negentropy. We know that the entropy is proportional to the variance and, given the variance, the Gaussian distribution has the maximum entropy (for further information, read Mathematical Foundations of Information Theory, Khinchin A. I., Dover Publications); therefore, we can define the measure:

Formally, the negentropy of X is the difference between the entropy of a Gaussian distribution with the same covariance and the entropy of X (we are assuming both zero-centered). It's immediately possible to understand that H_N(X) ≥ 0, hence the only way to maximize it is by reducing H(X). In this way, X becomes less random, concentrating the probability around the mean (in other words, it becomes super-Gaussian). However, the previous expression cannot be easily adapted to closed-form solutions, because H(X) needs to be computed over all the distribution of X, which must be estimated. For this reason, the same authors proposed an approximation based on non-quadratic functions (remember that in the context of ICA, a quadratic function can be never be employed because it would lead to a Gaussian distribution) that is useful to derive a fixed-point iterative algorithm called FastICA (indeed, it's really faster than EM).

Using k functions f_k(x), the approximation becomes as follows:

In many real-life scenarios, a single function is enough to achieve a reasonable accuracy and one of the most common choices for f(x) is as follows:

In the aforementioned paper, the reader can find some alternatives that can be employed when this function fails in forcing statistical independence between components.

If we invert the model, we get z = Wx with W = A^-1; therefore, considering a single sample, the approximation becomes as follows:

Clearly, the second term doesn't depend on w (in fact, it's only a reference) and can be excluded from the optimization. Moreover, considering the initial assumptions, E[Z^TZ]=W E[X^TX] W^T = I, therefore WW^T = I, i.e. ||w||² = 1. Hence, our goal is to find the following:

In this way, we are forcing the matrix W to transform the input vector x, so that z has the lowest possible entropy; therefore, it's super-Gaussian. The maximization process is based on convex optimization techniques that are beyond the scope of this book (the reader can find all the details of Lagrange theorems in Luenberger D. G., Optimization by Vector Space Methods, Wiley); therefore, we directly provide the iterative step that must be performed:

Of course, to ensure ||w||² = 1, after each step, the weight vector w must be normalized (w_t+1 = w_t+1 / ||w_t+1||).

In a more general context, the matrix W contains more than one weight vector and, if we apply the previous rule to find out the independent factors, it can happen that some elements, w_i^Tx, are correlated. A strategy to avoid this problem is based on the gram-schmidt orthonormalization process, which decorrelates the components one by one, subtracting the projections of the current component (w_n) onto all the previous ones (w₁, w₂, ..., w_n-1) to w_n. In this way, w_n is forced to be orthogonal to all the other components.

Even if this method is simple and doesn't require much effort, it's preferable a global approach that can work directly with the matrix W at the end of an iteration (so that the order of the weights is not fixed). As explained in Fast and robust fixedpoint
algorithms for independent component analysis, Hyvarinen A., IEEE Transactions on Neural Networks this result can be achieved with a simple sub-algorithm that we are including in the final FastICA algorithm:

Set random initial values for W₀
Set a threshold Thr (for example 0.001)
1. Independent component extraction
2. For each w in W:
  1. While ||w_t+1 - w_t|| > Thr:
    1. Compute w_t+1 = E[x · f^'(w_t^Tx)] - E[f^''(w_t^Tx)] w_t
    2. w_t+1 = w_t+1 / ||w_t+1||
3. Orthonormalization
4. While ||W_t+1 - W_t||_F > Thr:
  1. W_t = W_t / sqrt(||W_tW_t^T||)
  2. W_t₊₁ = (3/2)W_t - (1/2)WW^TW

This process can be also iterated for a fixed number of times, but the best approach is based on using both a threshold and a maximum number of iterations.