Penalized auto-encoders

As we have seen in previous chapters, one approach to preventing overfitting is to use penalties, that is, regularization. In general, our goal is to minimize the reconstruction error. If we have an objective function, F, we may optimize F(y, f(x)), where f() encodes the raw data inputs to generate predicted or expected y values. For auto-encoders, we have F(x, g(f(x))), so that the machine learns the weights and functional form of f() and g() to minimize the discrepancy between x and the reconstruction of x, namely g(f(x)). If we want to use an overcomplete auto-encoder, we need to introduce some form of regularization to force the machine to learn a representation that does not simply mirror the input. For example, we might add a function that penalizes based on complexity, so that instead of optimizing F(x, g(f(x))), we optimize F(x, g(f(x))) + P(f(x)), where the penalty function, P, depends on the encoding or the raw inputs, f().

Such penalties differ somewhat from those we have seen before, in that the penalty is designed to induce sparseness, not of the parameters but rather of the latent variables, H, which are the encoded representations of the raw data. The goal is to learn a latent representation that captures the essential features of the data.

Another type of penalty that can be used to provide regularization is one based on the derivative. Whereas sparse auto-encoders have a penalty that induces sparseness of the latent variables, penalizing the derivatives results in the model learning a form of f() that is relatively insensitive to minor perturbations of the raw input data, x. What we mean by this is that it forces a penalty on functions where the encoding varies greatly for changes in x, preferring regions where the gradient is relatively flat.