© Springer Nature Singapore Pte Ltd. 2020
X.-D. ZhangA Matrix Algebra Approach to Artificial Intelligencehttps://doi.org/10.1007/978-981-15-2770-8_8

8. Support Vector Machines

Xian-Da Zhang1  
(1)
Department of Automation, Tsinghua University, Beijing, Beijing, China
 
 Deceased

Supervised regression/classification methods learn a model of relation between the target vectors 
$$\{y_i \}_{i=1}^N$$
and the corresponding input vectors 
$$\{{\mathbf {x}}_i\}_{i=1}^N$$
consisting of N training samples and utilize this model to predict/classify target values for the previously unseen inputs.

In real-world data, the presence of noise (in regression) and class overlap (in classification) implies that the principal modeling challenge is to avoid “overfitting” of the training set, which is an important concern with this modeling.

Unlike traditional neural network approaches which suffer difficulties with generalization, producing models that may overfit the data, SVMs [43] are a popular machine learning approach for classification, regression, and other learning tasks, based on statistical learning theory (Vapnik–Chervonenkis or VC-theory).

In regression and classification applications, SVMs draw on two main practical observations [5, pp. 12–13]:
  • At a sufficiently high dimension, patterns are orthogonal to each other, and thus it is easier to find a separating hyperplane for data in a high dimension space.

  • Not all patterns are necessary for finding a separating hyperplane. In fact, it is sufficient to use only those points that are near the boundary between groups to construct the boundary.

As a kind of generalization intelligence, this chapter presents SVMs together with applications in target regression and classification.

8.1 Support Vector Machines: Basic Theory

Support vector machines comprise a new class of learning algorithms, originally developed for pattern recognition [3, 46], and motivated by results of statistical learning theory [43].

The problem of empirical data modeling is germane to many engineering applications. In empirical data modeling a process of induction is used to build up a model of the system under study, and it is hoped to deduce responses of the system that have yet to be observed.

The SVMs aim at minimizing an upper bound of the generalization error by maximizing the margin between the separating hyperplane1 and the data. What makes SVMs attractive is the property of condensing information in the training data and providing a sparse representation by using a very small number of data points (support vectors, SVs) [17].

8.1.1 Statistical Learning Theory

This subsection is a very brief introduction to statistical learning theory [42, 44].

Statistical learning theory based modeling aims at choosing a model from the hypothesis space, which is closest (with respect to some error measure) to the underlying function in the target space.

Errors in SVM modeling arise from two cases [19]:
  • Approximation Error is a consequence of a poor choice of the hypothesis space smaller than the target space, which will result in a large approximation error, and is referred to as model mismatch.

  • Estimation Error is the error due to the learning procedure which results in a technique selecting the non-optimal model from the hypothesis space.

In order to choose the best available approximation to the supervisor’s response, it is important to measure the loss or discrepancy L(y, f(w, x)) between the response y of the supervisor to a given input x and the response f(w, x) provided by the learning machine.

Given a training set of N independent observations 
$$({\mathbf {x}}_1^{\,} ,y_1^{\,}),\ldots ,({\mathbf {x}}_N^{\,} ,y_N^{\,})$$
. We would like to find the function f such that the expected value of the loss, called the risk function, is minimized:

$$\displaystyle \begin{aligned} R(\mathbf{w})=\int L(y,f(\mathbf{w},\mathbf{x}))\mathrm{d}P(\mathbf{x},y), \end{aligned} $$
(8.1.1)
where P(x, y) is the joint probability distribution. The goal of SVM is to minimize the risk function R(w) over the class of functions f(w, x), w ∈ W.

The problem is: the joint probability distribution P(x, y) = P(y|x)P(x) is unknown, and the only available information is contained in the training set 
$$\{({\mathbf {x}}_1^{\,},y_1^{\,}),$$

$$\ldots ,({\mathbf {x}}_N^{\,} ,y_N^{\,})\}$$
.

In order to solve this problem, the incalculable risk function R(w) should be approximated by the empirical risk function

$$\displaystyle \begin{aligned} R_{\mathrm{emp}}(\mathbf{w})=\frac 1N\sum_{i=1}^N L(y_i^{\,},f(\mathbf{w},{\mathbf{x}}_i)) \end{aligned} $$
(8.1.2)
which can be constructed by using the training set 
$$\{({\mathbf {x}}_1,y_1^{\,}),\ldots ,({\mathbf {x}}_N,y_N^{\,})\}$$
. In the above equation, the Boolean function f(w, x i) on the input feature x i and a set of Boolean functions on the input features {f(w, x i), w ∈ W, x i ∈ X} are known as hypothesis and hypothesis space, respectively.

Empirical risk minimization (ERM) principle is to minimize R emp(w) over the set w ∈ W, results in a risk 
$$R({\mathbf {w}}_i^*)$$
which is close to its minimum.

The evaluation of the soundness of the ERM principle requires answers to the following two questions [42]:
  1. 1.
    Does the empirical risk R emp(w) converge uniformly to the actual risk R(w) over the full set f(w, x), w ∈ W, namely
    
$$\displaystyle \begin{aligned} \mathrm{Prob}\Big\{ \sup_{\mathbf{w}\in W} |R(\mathbf{w})-R_{\mathrm{emp}}(\mathbf{w})|>\varepsilon\Big\} \rightarrow 0\quad  \mathrm{as}~ N\to\infty . \end{aligned} $$
    (8.1.3)
     
  2. 2.

    What is the rate of convergence?

     

The theory of uniform convergence of empirical risk to actual risk includes necessary and sufficient conditions as well as bounds for the rate of convergence. These bounds independent of the distribution function P(x, y) are based on the Vapnik–Chervonekis (VC) dimension of the set of functions f(w, x) implemented by the learning machine.

For a set of functions {f(w)}, w is a generic set of parameters: a choice of w specifies a particular function and can be defined for various classes of function f.

Let us consider only the functions corresponding to the two-class pattern recognition case, so that f(w, x) ∈{−1, 1}, ∀w, x.

If a given set of N data points can be labeled in all possible 2N ways, and for each labeling, a member of the set {f(w)} with correctly assigned labels can be found, then the set of points is said to be shattered by the set of functions.

Suppose that the space in which the data live is 
$$\mathbb {R}^2$$
, and the set {f(w)} consists of oriented straight lines, so that for a given line, all points on one side are assigned to the class 1, and all points on the other side, the class − 1, as shown in Fig. 8.1.
../images/492994_1_En_8_Chapter/492994_1_En_8_Fig1_HTML.png
Fig. 8.1

Three points in 
$$ \mathbb {R}^2$$
, shattered by oriented lines. The arrow points to the side with the points labeled black

One can find at least one set of 3 points in 2D all of whose 23 = 8 possible labeling can be separated by some hyperplane. The orientation is shown by an arrow, specifying on which side of the line points is to be assigned the label 1. Any set of 4 points, all of whose 24 = 16 possible labeling, are not separable by hyperplanes.

Suppose we have a data set containing N points. These N points can be labeled in 2N ways as positive “+” and negative “−.” Therefore, 2N different learning problems can be defined by N data points. If for any of these problems, we can find a hypothesis 
$$h\in \mathbb {H}$$
that separates the positive examples from the negative, then we say 
$$\mathbb {H}$$
shatters N points. That is, any learning problem definable by N examples can be learned without error by a hypothesis drawn from 
$$\mathbb {H}$$
.

Definition 8.1 (Indicator Function [12])
Let 
$$\mathbb {D}$$
be a 2n full factorial design with levels − 1 and + 1, and a fractional factorial design 
$$\mathbb {F}$$
be a subset of 
$$\mathbb {D}$$
. The indicator function F of its fraction 
$$\mathbb {F}$$
is a function defined on 
$$\mathbb {D}$$
as follows:

$$\displaystyle \begin{aligned} F(\mathbf{x})=\left\{ \begin{aligned} +1, &\quad  \mathrm{if}~\mathbf{x}\in \mathbb{F};\\ -1, &\quad \mathrm{if}~\mathbf{x}\in\mathbb{D}\setminus\mathbb{F}.\end{aligned}\right. \end{aligned} $$
(8.1.4)
Definition 8.2 (VC-Dimension [42])

The VC-dimension of a set of indicator functions f(w, x), w ∈ W is the maximal number h of vectors which can be shattered in all possible 2h ways by f(w, x), w ∈ W.

For example, h = n + 1 for linear decision rules in n-dimensional space, since they can shatter at most n + 1 points.

It is well-known [45] that the finiteness of the VC-dimension of the set of indicator functions implemented by the learning machine forms the necessary and sufficient condition for consistency of the ERM method independent of probability measure. Finiteness of VC-dimension also implies fast convergence.

Theorem 8.1 ([6])

Consider some set of m points in 
$$\mathbb {R}^n$$
. Choose any one of the points as origin. Then the m points can be shattered by oriented hyperplanes if and only if the position vectors of the remaining points are linearly independent.

Corollary 8.1 ([6])

The VC-dimension of the set of oriented hyperplanes in 
$$\mathbb {R}^n$$
is n + 1, since we can always choose n + 1 points, and then choose one of the points as origin, such that the position vectors of the remaining n points are linearly independent, but can never choose n + 2 such points (since no n + 1 vectors in 
$$\mathbb {R}^n$$
can be linearly independent).

8.1.2 Linear Support Vector Machines

We discuss linear SVMs in two cases: the separable case and the nonseparable case.

The Separable Case

Consider first the simplest case: linear machines trained on separable data. We are given the labeled training data 
$$\{{\mathbf {x}}_i^{\,} ,y_i^{\,}\}$$
with 
$$i=1,\ldots ,N, y_i^{\,}\in \{-1,+1\},{\mathbf {x}}_i^{\,}\in \mathbb {R}^n$$
. Suppose we have a “separating hyperplane” which separates the positive from the negative data samples. The points x on the hyperplane satisfy w Tx + b = 0, where w is normal to the hyperplane, 
$$|b|/\|\mathbf {w}\|{ }_2^{\,}$$
is the perpendicular distance from the hyperplane to the origin, and 
$$\|\mathbf {w}\|{ }_2^{\,}$$
is the Euclidean norm of w.

Lemma 8.1 ([6])

Two sets of points in 
$$\mathbb {R}^n$$
may be separated by a hyperplane if and only if the intersection of their convex hulls is empty.

Let 
$$d_+^{\,}$$
and 
$$d_{-}^{\,}$$
be the shortest distances from the separating hyperplane to the closest positive and negative data samples, respectively. Then, the “margin” of a separating hyperplane can be defined to be 
$$d_+^{\,} +d_{-}^{\,}$$
.

For the linearly separable case, the support vector algorithm aims at the separating hyperplane with the largest margin 
$$d_+^{\,} +d_{-}^{\,}$$
. For this end, all the training data need to satisfy the following constrains:

$$\displaystyle \begin{aligned} {\mathbf{w}}^T{\mathbf{x}}_i^{\,} +b&\geq +1,~\mathrm{for}~y_i^{\,} =+1,{} \end{aligned} $$
(8.1.5)

$$\displaystyle \begin{aligned} {\mathbf{w}}^T{\mathbf{x}}_i^{\,} +b&\leq -1,~\mathrm{for}~y_i^{\,} =-1.{} \end{aligned} $$
(8.1.6)
The above two constrains can be combined into one set of inequality constrains:

$$\displaystyle \begin{aligned} y_i^{\,} ({\mathbf{w}}^T{\mathbf{x}}_i^{\,} +b)-1\geq 0,\quad \forall~i=1,\ldots ,N.{} \end{aligned} $$
(8.1.7)

Clearly, the points satisfying the equality in (8.1.5) lie on the hyperplane 
$$H_1^{\,} :{\mathbf {w}}^T{\mathbf {x}}_i^{\,} +b=1$$
with normal w and perpendicular distance from the origin 
$$|1-b|/\|\mathbf {w}\|{ }_2^{\,}$$
. Similarly, the points satisfying the equality in Eq. (8.1.6) lie on the hyperplane 
$$H_2^{\,} :{\mathbf {w}}^T\mathbf {x}+b=-1$$
with normal again w and perpendicular distance from the origin 
$$|-1-b|/\|\mathbf {w}\|{ }_2^{\,}$$
. Then 
$$d_+^{\,}=d_{-}^{\,} =1/\|\mathbf {w}\|{ }_2$$
and hence the margin is simply 2∕∥w2.

Due to having the same normal, 
$$H_1^{\,}$$
and 
$$H_2^{\,}$$
are parallel, and no training points fall between them. Thus the pair of hyperplanes (H 1, H 2) which gives the maximum margin can be found by minimizing 
$$\frac 12\|\mathbf {w}\|{ }_2^2$$
subject to constraints (8.1.7), namely we have the constrained optimization problem:

$$\displaystyle \begin{aligned} &\min_{\mathbf{w},b}~~ \frac 12\|\mathbf{w}\|{}_2^2, \end{aligned} $$
(8.1.8)

$$\displaystyle \begin{aligned} &\text{subject to}~~ y_i^{\,} ({\mathbf{w}}^T{\mathbf{x}}_i^{\,} +b)-1\geq 0,\quad \forall~i=1,\ldots ,N. \end{aligned} $$
(8.1.9)
For the above equality constraints, the Lagrange multiplier method gives the following unconstrained primal optimization problem:

$$\displaystyle \begin{aligned} \min_{\mathbf{w},b,{\boldsymbol \alpha}}\ \mathbb{L}_P (\mathbf{w},b,{\boldsymbol \alpha})=\frac 12\|\mathbf{w}\|{}_2^2-\sum_{i=1}^N\alpha_i^{\,} y_i^{\,} (\mathbf{ w}^T {\mathbf{x}}_i^{\,} +b)+\sum_{i=1}^N\alpha_i^{\,} {} \end{aligned} $$
(8.1.10)
with nonnegative Lagrange multipliers 
$$\alpha _i^{\,} \geq 0,\,\forall ~i=1,\ldots ,N$$
.
From the first-order optimality conditions we have

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}}{\partial \mathbf{w}}=0\quad  &\Rightarrow\quad  \mathbf{w}=\sum_{i=1}^N \alpha_i^{\,} y_i^{\,} {\mathbf{x}}_i^{\,} ,{} \end{aligned} $$
(8.1.11)

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}_P}{\partial b}=0\quad  &\Rightarrow\quad  \sum_{i=1}^N \alpha_i^{\,} y_i^{\,} =0,{} \end{aligned} $$
(8.1.12)
Substituting (8.1.11) and (8.1.12) into (8.1.10) can eliminate b, and thus yield the dual objective function

$$\displaystyle \begin{aligned} \mathbb{L}_D ({\boldsymbol \alpha})=\sum_{i=1}^N\alpha_i^{\,} -\frac 12\sum_{i=1}^N\sum_{j=1}^N \alpha_i^{\,}\alpha_j^{\,} y_i^{\,} y_j^{\,} {\mathbf{x}}_i^T {\mathbf{x}}_j^{\,} .{} \end{aligned} $$
(8.1.13)
Combining (8.1.13) with (8.1.12), we get the Wolfe dual optimization problem corresponding to (8.1.10) as follows:

$$\displaystyle \begin{aligned} &\max_{{\boldsymbol \alpha}}\ \mathbb{W}({\boldsymbol \alpha})=\sum_{i=1}^N\alpha_i^{\,} -\frac 12\sum_{i=1}^N\sum_{j=1}^N\alpha_i^{\,}\alpha_j^{\,} y_i^{\,} y_j^{\,} {\mathbf{x}}_i^T {\mathbf{x}}_j^{\,} , \end{aligned} $$
(8.1.14)

$$\displaystyle \begin{aligned} &\text{subject to}\quad  \sum_{i=1}^N \alpha_i^{\,} y_i^{\,} =0, \end{aligned} $$
(8.1.15)
for the separable case.
The Nonseparable Case

When applied to nonseparable data, the above separable optimization algorithm will result into no feasible solution. Hence, the inequality constraints (8.1.5) and (8.1.6) must be relaxed by introducing positive slack variables 
$$\xi _i^{\,} ,i=1,\ldots ,N$$
:

$$\displaystyle \begin{aligned} {\mathbf{w}}^T{\mathbf{x}}_i^{\,} +b\geq +1-\xi_i^{\,} ,\quad  \mathrm{for}~y_i^{\,} =+1, \end{aligned} $$
(8.1.16)
and

$$\displaystyle \begin{aligned} {\mathbf{w}}^T{\mathbf{x}}_i^{\,} +b&\leq -1+\xi_i^{\,} ,\quad  \mathrm{for}~y_i^{\,} =-1, \end{aligned} $$
(8.1.17)

$$\displaystyle \begin{aligned} \xi_i^{\,} &\geq 0,\quad  \forall~i=1,\ldots ,N. \end{aligned} $$
(8.1.18)

For an error to occur, the corresponding 
$$\xi _i^{\,}$$
must exceed unity, so 
$$\sum _i\xi _i^{\,}$$
is an upper bound on the number of training errors. Hence, a natural way to assign an extra cost for errors is to change the objective function to be minimized from 
$$\|\mathbf {w}\|{ }_2^2/2$$
to 
$$\|\mathbf {w}\|{ }_2^2/2+C(\sum _i\xi _i)$$
, where C is a parameter to be chosen by the user, a larger C corresponding to assigning a higher penalty to errors.

Consider the primal minimization problem

$$\displaystyle \begin{aligned} \min_{\mathbf{w},b,{\boldsymbol \alpha},C}\ \mathbb{L}_P (\mathbf{w},b,{\boldsymbol \alpha},C)&=\frac 12\|\mathbf{w}\|{}_2^2-\sum_{i=1}^N \alpha_i^{\,} y_i^{\,} ({\mathbf{w}}^T{\mathbf{x}}_i^{\,} +b)\\ &\ +C\sum_{i=1}^N \xi_i^{\,} +\sum_{i=1}^N\alpha_i^{\,} ,{} \end{aligned} $$
(8.1.19)

$$\displaystyle \begin{aligned} &\text{subject to}\quad  0\leq \alpha_i^{\,}\leq C,\quad  i=1,\ldots ,N.{} \end{aligned} $$
(8.1.20)
By the first-order optimality conditions it follows that

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}_P}{\partial \mathbf{w}}=0\quad  &\Rightarrow\quad  \mathbf{w}=\sum_{i=1}^N \alpha_i^{\,} y_i^{\,}{\mathbf{x}}_i^{\,},{} \end{aligned} $$
(8.1.21)

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}_P}{\partial C}=0\quad  &\Rightarrow\quad  \sum_{i=1}^N\xi_i^{\,} =0,{} \end{aligned} $$
(8.1.22)

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}_P}{\partial b}=0\quad  &\Rightarrow\quad  \sum_{i=1}^N\alpha_i^{\,} y_i^{\,} =0.{} \end{aligned} $$
(8.1.23)
By substituting (8.1.21) and (8.1.22) into (8.1.19), and using the constraints (8.1.20) and (8.1.23), we can eliminate 
$$\xi _i^{\,}$$
and C in the loss to get

$$\displaystyle \begin{aligned} &\max_{{\boldsymbol \alpha}}\left\{\sum_{i=1}^N\alpha_i^{\,} -\sum_{i=1}^N\sum_{j=1}^N\alpha_i^{\,}\alpha_j^{\,} y_i^{\,} y_j^{\,} {\mathbf{x}}_i^T {\mathbf{x}}_j^{\,}\right\} , \end{aligned} $$
(8.1.24)

$$\displaystyle \begin{aligned} &\text{subject to}\quad  0\leq \alpha_i^{\,} \leq C,\quad  i=1,\ldots ,N, \end{aligned} $$
(8.1.25)

$$\displaystyle \begin{aligned} &\hskip 17mm \sum_{i=1}^N\alpha_i^{\,} y_i^{\,} =0, \end{aligned} $$
(8.1.26)
which is called the Wolfe dual optimization problem for finding α.

8.2 Kernel Regression Methods

Following the development of support vector machines, positive definite kernels have recently attracted considerable attention in the machine learning community.

8.2.1 Reproducing Kernel and Mercer Kernel

Consider how a linear SVM is generalized to the case where the decision function f(x), whose sign represents the class assigned to data point x, is not a linear function of the data x?

Definition 8.3 (Reproducing Kernel Hilbert Space [47])
A reproducing kernel Hilbert space (RKHS) is a Hilbert space 
$$\mathbb {H}$$
of real-valued function on a compact domain X with the reproducing property that for 
$$f\in \mathbb {H}$$
, and each x ∈ X, there exists M x, not depending on f such that

$$\displaystyle \begin{aligned} \mathbb{F}_x[f]=|f(\mathbf{x})|=\langle f(\cdot ),K_x (\cdot )\rangle\leq M_x\|f\|,\quad  \forall~ f\in \mathbb{H}, \end{aligned} $$
(8.2.1)
where 
$$\mathbb {F}_x[f]$$
is called the evaluation functional of f. That is to say, 
$$\mathbb {F}_x[f]$$
is a bounded linear functional.

The reproducing property above states that the inner product of a function f and a kernel function K x, 〈f(⋅), K x(⋅)〉, is f(x) in the RKHS. Because K x must ensure reproducing the function f, the function K x(⋅) is called the reproducing kernel of f for the RKHS H.

A natural question to ask is: how to choose a function K x which is a reproducing kernel? To answer this question, we define K x as a two-variable function

$$\displaystyle \begin{aligned} K(\mathbf{x},\mathbf{y})=K_x (\mathbf{y}). \end{aligned} $$
(8.2.2)
Definition 8.4 (Gram Matrix [35])
Given a kernel K(x, y) and patterns 
$${\mathbf {x}}_1^{\,} ,\ldots ,{\mathbf {x}}_N^{\,}$$
∈ X, where X is a closed subset of 
$$\mathbb {R}^n$$
. The N × N matrix K with entries

$$\displaystyle \begin{aligned} K_{ij}=K({\mathbf{x}}_i^{\,} ,{\mathbf{x}}_j^{\,} )\quad \text{and}\quad  K_{ij}=K_{ji} \end{aligned} $$
(8.2.3)
is called the Gram matrix of the kernel K with respect to 
$${\mathbf {x}}_1^{\,},\ldots ,{\mathbf {x}}_N^{\,}$$
.
Definition 8.5 (Positive Definite Kernel [47])
Let X be a closed subset of 
$$\mathbb {R}^n$$
, and 
$$K: X\times X \to \mathbb {R}$$
a symmetric function satisfying: for any finite set of points 
$$\{{\mathbf {x}}_i^{\,}\}_{i=1}^N$$
in X and real numbers 
$$\{a_i^{\,}\}_{i=1}^N$$
,

$$\displaystyle \begin{aligned} \sum_{i=1}^N\sum_{j=1}^N a_i^{\,} a_j^{\,} K({\mathbf{x}}_i,{\mathbf{x}}_j^{\,} )\geq 0. \end{aligned} $$
(8.2.4)
Then K is said to be a positive definite kernel on X.
Denote kernels 
$$K_i:X\times X\to \mathbb {R}$$
, then positive definition (pd) kernels have the following general properties:
  1. 1.

    If K 1, …, K n are p.d. kernels and λ 1, …, λ n ≥ 0, then the sum 
$$\sum _{i=1}^n \lambda _i K_i$$
is p.d.

     
  2. 2.

    If K 1, …, K n are p.d. kernels and α 1, …, α n ∈{1, …, N}, then their product 
$$\prod _{i=1}^n K_i^{\alpha _i}$$
is p.d.

     
  3. 3.

    For a sequence of p.d. kernels, the limit K =limnK n is p.d., if the limit exists.

     
  4. 4.

    If X 0 ⊆ X, then 
$$K_0:X_0\times X_0\to \mathbb {R}$$
is also a p.d. kernel.

     
  5. 5.
    If 
$$K_i:X_i\times X_i\to \mathbb {R}$$
is a sequence of p.d. kernels, then
    • 
$$K(({\mathbf {x}}_1,\ldots ,{\mathbf {x}}_n),({\mathbf {y}}_1,\ldots ,{\mathbf {y}}_n))=\prod _{i=1}^n K_i({\mathbf {x}}_i,{\mathbf {y}}_i)$$
is p.d. kernel on X 1 ×⋯ × X n.

    • 
$$K(({\mathbf {x}}_1,\ldots ,{\mathbf {x}}_n),({\mathbf {y}}_1,\ldots ,{\mathbf {y}}_n))=\sum _{i=1}^n K_i({\mathbf {x}}_i,{\mathbf {y}}_i)$$
is p.d. kernel on X 1 ×⋯ × X n.

     
Definition 8.6 (Reproducing Kernel [1])
Let F be a class of functions f(x) defined in a set E, forming a Hilbert space (complex or real). The function K(x, y) of x and y in E is called a reproducing kernel of F if
  1. 1.

    For every y, K(x, y) as function of x belongs to F.

     
  2. 2.
    The reproducing property: for every y ∈ E and every f ∈ F,
    
$$\displaystyle \begin{aligned} f(\mathbf{y})=\langle f(\mathbf{x}),K(\mathbf{x},\mathbf{y})\rangle_{x^*}. \end{aligned} $$
    (8.2.5)

    Here the subscript x by the scalar product indicates that the scalar product applies to functions of x.

     
Theorem 8.2 (Moore–Aronszajn Theorem [1])
Let x, y ∈ X and 
$$K(\mathbf {x},\mathbf {y}):X\times X \to \mathbb {R}$$
be positive defined. Then, there must be a unique RKHS 
$$\mathbb {H}\subset \mathbb {R}^X$$
whose reproducing kernel is K(x, y). Moreover, if the space 
$$\mathbb {H}_0=\mathrm {span}(\{K(\cdot ,\mathbf { x})\}_{\mathbf {x}\in X})$$
has the inner product

$$\displaystyle \begin{aligned} \langle f,g\rangle_{\mathbb{H}_0}=\sum_{i=1}^n\sum_{j=1}^n \alpha_i\beta_jK({\mathbf{x}}_i,{\mathbf{x}}_j, \end{aligned} $$
(8.2.6)

where 
$$f=\sum _{i=1}^n\alpha _iK(\cdot ,{\mathbf {x}}_i)$$
and 
$$g=\sum _{j=1}^n\beta _jK(\cdot ,{\mathbf {x}}_j)$$
, then 
$$\mathbb {H}_0$$
is an effective RKHS.

Moore–Aronszajn theorem states that there is one-to-one correspondence between positive definite kernel function K(x, y) and RKHS 
$$\mathbb {H}$$
. This is why the kernel function is limited to the reproducing kernel Hilbert space 
$$\mathbb {H}$$
.

The reproducing kernel has the following basic properties [1]:
  • Uniqueness. If a reproducing kernel K(x, y) exists, then it is unique.

  • Existence. For the existence of a reproducing kernel K(x, y) it is necessary and sufficient that for every y of the set E, f(y) be a continuous functional of f running through the Hilbert space F.

  • Positiveness. 
$$\mathbf {K}=[K({\mathbf {x}}_i,{\mathbf {y}}_j)]_{i,j=1}^{n,n}$$
is a positive matrix in the sense of E.

  • One-to-one correspondence. To every positive matrix 
$$\mathbf {K}=[K({\mathbf {x}}_i,{\mathbf {y}}_j)]_{{i,j}=1}^{n,n}$$
, there corresponds one and only one class of functions with a uniquely determined quadratic form in it, forming a Hilbert space and admitting K(x, y) as a reproducing kernel.

  • Convergence. If the class F possesses a reproducing kernel K(x, y), every sequence of functions {f n} which converges strongly to a function f in the Hilbert space F, converges also at every point in the ordinary sense, 
$$\lim f_n(\mathbf {x})=f(\mathbf {x})$$
.

The following Mercer’s theorem provides a construction method of the nonlinear reproducing kernel function.

Theorem 8.3 (Mercer’s Theorem)
Each positive definite kernel K(x, y) defined on a compact domain X × X can be written in the form

$$\displaystyle \begin{aligned} K(\mathbf{x},\mathbf{y})=\sum_{i=1}^M \lambda_i^{\,}\phi_i (\mathbf{x})\phi_i (\mathbf{y}),\quad  M\leq \infty . \end{aligned} $$
(8.2.7)
The kernel satisfying Mercer’s theorem is called the Mercer kernel. According to Mercer’s theorem, the nonlinear kernel function is usually constructed by

$$\displaystyle \begin{aligned} K({\mathbf{x}}_i,{\mathbf{y}}_i)={\boldsymbol \phi}^T({\mathbf{x}}_i^{\,} ){\boldsymbol \phi}({\mathbf{y}}_i^{\,}, ) \end{aligned} $$
(8.2.8)
where 
$${\boldsymbol \phi }(\mathbf {x})=[\phi _1^{\,} (\mathbf {x}),\ldots ,\phi _M^{\,} (\mathbf {x})]^T$$
is a nonlinear function.
The widely used Mercer kernel is the Gaussian radial basis function (GRBF) kernel:

$$\displaystyle \begin{aligned}K({\mathbf{x}}_i^{\,} ,{\mathbf{x}}_j^{\,} ) = \exp \left (-\frac{\|{\mathbf{x}}_i^{\,} -{\mathbf{x}}_j^{\,} \|{}_2^2}{2\sigma^2}\right ) \end{aligned} $$
(8.2.9)
or the exponential radial basis function kernel

$$\displaystyle \begin{aligned} K({\mathbf{x}}_i^{\,} ,{\mathbf{x}}_j^{\,} ) = \exp \left (-\frac{\|{\mathbf{x}}_i^{\,} -{\mathbf{x}}_j^{\,} \|{}_2^{\,} }{2\sigma^2}\right ). \end{aligned} $$
(8.2.10)
For a radial basis function (RBF) kernel one has the following approximation [16]:

$$\displaystyle \begin{aligned} \int_{\mathbf{x}}p(\mathbf{x})^2\mathrm{d}\mathbf{x}\approx \frac 1{N^2}\sum_{i=1}^N\sum_{j=1}^N K_{ij}={\mathbf{1}}_N^T\mathbf{K1}_N^{\,} , \end{aligned} $$
(8.2.11)
where 
$${\mathbf {1}}_N^{\,}$$
is an N × 1 summation vector with all entries equal to 1.
If the N × N kernel matrix K has the eigenvalue decomposition K = U ΛU T, then

$$\displaystyle \begin{aligned} {\mathbf{1}}_N^T \mathbf{K1}_N ={\mathbf{1}}_N^T\left (\sum_{i=1}\lambda_i{\mathbf{u}}_i{\mathbf{u}}_i^T\right ){\mathbf{1}}_N=\sum_{i=1}^N\lambda_i\left ({\mathbf{1}}_N^T {\mathbf{u}}_i^{\,} \right )^2. \end{aligned} $$
(8.2.12)
This result has two important applications.
  1. 1.

    Kernel PCA: For given data vectors 
$${\mathbf {x}}_1,\ldots ,{\mathbf {x}}_N^{\,}$$
and an RBF kernel, K dominant eigenvalues and their corresponding eigenvectors of the kernel matrix K give the kernel PCA after the principal components in standard PCA are replaced by the principal components of the kernel matrix K. The kernel PCA was originally proposed by Schölkopf et al. [33].

     
  2. 2.

    Kernel K-means clustering: If there are K distinct clustered regions within the N data samples, then there will be K dominant terms 
$$\lambda _i^{\,} \left ({\mathbf {1}}_N^T{\mathbf {u}}_i^{\,}\right )$$
that provide a means of estimating the possible number of clusters within the data sample in kernel based K-means clustering [16].

     

Kernel PCA is a nonlinear generalization of PCA in the sense that it is performing PCA in feature spaces of arbitrarily large (possibly infinite) dimensionality. If the kernels 
$$K({\mathbf {x}}_i,{\mathbf {x}}_j)={\mathbf {x}}_i^T{\mathbf {x}}_j^{\,}$$
are used, then the kernel PCA reduces to standard PCA. Compared to the standard PCA, kernel PCA has the main advantage that no nonlinear optimization is involved; it is essentially linear algebra, as simple as standard PCA.

8.2.2 Representer Theorem and Kernel Regression

We now focus on applications of nonlinear kernel functions in SVMs. The following are widely used types of kernel functions in SVMs [19, 36]:
  1. 1.
    Linear SVM uses the linear kernel function
    
$$\displaystyle \begin{aligned} K(\mathbf{x},{\mathbf{x}}_k)=\langle \mathbf{x},{\mathbf{x}}_k\rangle ={\mathbf{x}}^T{\mathbf{x}}_k. \end{aligned} $$
    (8.2.13)
     
  2. 2.
    Polynomial SVM uses the polynomial kernel of degree d
    
$$\displaystyle \begin{aligned} K(\mathbf{x},{\mathbf{x}}_k)=(\langle \mathbf{x},{\mathbf{x}}_k\rangle +1)^d . \end{aligned} $$
    (8.2.14)
     
  3. 3.
    Radial basis function (RBF) SVM consists of the Gaussian radial basis function
    
$$\displaystyle \begin{aligned} K(\mathbf{x},{\mathbf{x}}_k)=\exp \left (\frac{\|\mathbf{x}-{\mathbf{x}}_k\|{}^2}{2\sigma^2}\right ) \end{aligned} $$
    (8.2.15)
    or exponential radial basis function
    
$$\displaystyle \begin{aligned} K(\mathbf{x},{\mathbf{x}}_k)=\exp \left (\frac{\|\mathbf{x}-{\mathbf{x}}_k\|}{2\sigma^2}\right ). \end{aligned} $$
    (8.2.16)
     
  4. 4.
    Multilayer SVM uses the multilayer perceptron kernel function
    
$$\displaystyle \begin{aligned} K(\mathbf{x},{\mathbf{x}}_k)=\tanh (\rho \langle \mathbf{x},{\mathbf{x}}_k\rangle +\vartheta )=\tanh (\rho {\mathbf{x}}^T{\mathbf{x}}_k +\vartheta ) \end{aligned} $$
    (8.2.17)

    for certain values of the scale ρ and offset 𝜗 parameters. Here the support vector (SV) corresponds to the first layer and the Lagrange multipliers to the weights.

     
  5. 5.
    B-splines SVM consists of B-splines kernel function of order 2M + 1:
    
$$\displaystyle \begin{aligned} K(\mathbf{x},{\mathbf{x}}_k)=B_{2M+1}(\|\mathbf{x}-{\mathbf{x}}_k\|) \end{aligned} $$
    (8.2.18)
    with
    ../images/492994_1_En_8_Chapter/492994_1_En_8_Equ45_HTML.png
    (8.2.19)
     
  6. 6.
    Summing SVM uses additive kernel function
    
$$\displaystyle \begin{aligned} K(\mathbf{x},{\mathbf{x}}_k)=\sum_i K_i(\mathbf{x},{\mathbf{x}}_k) \end{aligned} $$
    (8.2.20)

    which is obtained by forming summing kernels, since the sum of two positive definite functions is positive definite.

     
To address the poor generalization properties of SVMs, regularization networks, Gaussian processes, and spline methods, a popular technique is to solve regularized optimization problem in a reproducing kernel Hilbert space 
$$\mathbb {H}$$
:

$$\displaystyle \begin{aligned} {\mathbf{f}}^{\,*}=\operatorname*{\text{arg}\ \text{min}}_{\mathbf{f}\in \mathbb{H}}\frac 1l\sum_{i=1}^l V({\mathbf{x}}_i,y_i^{\,} ,\mathbf{f})+\frac 12\gamma \|\mathbf{ f}\|{}_{\mathbb{H}}^2, \end{aligned} $$
(8.2.21)
where f is regressor or classifier.

The solution to the problem (8.2.21) was given by Kimeldorf and Wahba in 1971 [24], known as the representer theorem.

Theorem 8.4 (Representer Theorem for Supervised Learning [24, 47])
Given a set of l labeled examples 
$$\{({\mathbf {x}}_i ,y_i ) \}_{i=1}^l$$
. Any solution to the optimization problem (8.2.21) for SVMs has a representation of the form

$$\displaystyle \begin{aligned}f(\mathbf{x})=\sum_{j=1}^l \alpha_j^{\,} K(\mathbf{x},{\mathbf{x}}_j^{\,} ),{} \end{aligned} $$
(8.2.22)

where 
$$\{\alpha _j^{\,}\}_{j=1}^l\in \mathbb {R}$$
.

Denoting 
$$f_i=f({\mathbf {x}}_i)=\sum _{j=1}^l \alpha _j^{\,} K({\mathbf {x}}_i^{\,} ,{\mathbf {x}}_j^{\,} )$$
then we have
../images/492994_1_En_8_Chapter/492994_1_En_8_Equ49_HTML.png
(8.2.23)
Letting 
$$\sum _{i=1}^l V({\mathbf {x}}_i^{\,} ,y_i^{\,} ,\mathbf {f})=\sum _{i=1}^l (y_i^{\,} -\hat f_i^{\,} )^2=\|\mathbf {y}-\mathbf {K}{\boldsymbol \alpha }\|{ }_2^2$$
, substituting (8.2.22) into (8.2.21), and using (8.2.23), we get the loss expressed by α:

$$\displaystyle \begin{aligned} L({\boldsymbol \alpha})=(\mathbf{y}-\mathbf{K}{\boldsymbol \alpha})^T(\mathbf{y}-\mathbf{K}{\boldsymbol \alpha})+\frac 12\gamma_A^{\,} {\boldsymbol \alpha}^T\mathbf{ K}{\boldsymbol \alpha}. \end{aligned} $$
(8.2.24)
From 
$$\frac {\partial L({\boldsymbol \alpha })}{\partial {\boldsymbol \alpha }}=\mathbf {0}$$
it follows that 
$$(-\mathbf {K})(\mathbf {y}-\mathbf {K}{\boldsymbol \alpha })+\gamma _A^{\,} \mathbf {K}{\boldsymbol \alpha }=\mathbf {0}$$
, which yields the optimal solution

$$\displaystyle \begin{aligned} {\boldsymbol \alpha}^* =(\mathbf{K}+\gamma_A^{\,} \mathbf{I})^{-1}\mathbf{y}. \end{aligned} $$
(8.2.25)
This is just the Tikhonov regularization least squares solution.

8.2.3 Semi-Supervised and Graph Regression

Representer theorem for supervised learning (Theorem 8.4) can be extended to semi-supervised learning and/or graph signals.

Theorem 8.5 (Representer Theorem for Semi-Supervised Learning [2])
Given a set of l labeled examples 
$$\{({\mathbf {x}}_i ,y_i )\}_{i=1}^l$$
, a set of u unlabeled examples 
$$\{{\mathbf {x}}_j\}_{\hskip -0.6mm j=l+1}^{l+u}$$
, and the graph LaplacianL. The minimizer of semi-supervised/graph optimization problem

$$\displaystyle \begin{aligned} {\mathbf{f}}^{\,*}=\operatorname*{\mathit{\text{arg}}\ \mathit{\text{min}}}_{\mathbf{f}}\left\{\frac 1l\sum_{i=1}^l V({\mathbf{x}}_i,y_i,\mathbf{f})+\frac 12\gamma_A^{\,} \|\mathbf{f}\|{}_{\mathbb{H}}^2+ \frac{\gamma_l^{\,}}{(u+l)^2}{\mathbf{f}}^T\mathbf{Lf}\right\},{} \end{aligned} $$
(8.2.26)
admits an expansion

$$\displaystyle \begin{aligned} {\mathbf{f}}^{\,*}(\mathbf{x})=\sum^{l+u}_{i=1}\alpha_i^{\,} K(\mathbf{x},{\mathbf{x}}_i){} \end{aligned} $$
(8.2.27)

in terms of the labeled and unlabeled examples.

Note that the above representer theorem [2] is different from the generalized representer theorem in [35]. When the graph Laplacian L = I, Theorem 8.5 reduces to the representer theorem for semi-supervised learning. If letting further u = 0, then Theorem 8.5 reduces to the representer theorem 8.4 for supervised learning.

Theorem 8.5 shows that the optimization problem (8.2.26) is equivalent to finding the optimal solution α .

Letting the loss 
$$V({\boldsymbol \alpha })=\sum _{i=1}^l (y_i^{\,} -\hat f_i ({\boldsymbol \alpha }))^2=\|\mathbf {y}-\mathbf {JK}{\boldsymbol \alpha }\|{ }_2^2$$
and using (8.2.23), then the loss in (8.2.26) can be represented as [2]:

$$\displaystyle \begin{aligned} L({\boldsymbol \alpha})=\frac 1l(\mathbf{y}-\mathbf{JK}{\boldsymbol \alpha})^T(\mathbf{y}-\mathbf{JK}{\boldsymbol \alpha})+\frac 12\gamma_A^{\,} {\boldsymbol \alpha}^T\mathbf{K}{\boldsymbol \alpha}+ \frac {\gamma_l^{\,} }{2(u+l)^2}{\boldsymbol \alpha}^T\mathbf{KLK}{\boldsymbol \alpha},{} \end{aligned} $$
(8.2.28)
where K is the (l + u) × (l + u) Gram matrix K ij = K(x i, x j), 
$$\mathbf {y}=[y_1^{\,} ,\ldots ,y_l^{\,} ,0,\ldots ,0]^T$$
is the (l + u)-dimensional label vector, and J = Diag(1, …, 1, 0, …, 0) is an (l + u) × (l + u) diagonal matrix with the first l diagonal entries 1 and the rest 0.
From 
$$\frac {\partial L({\boldsymbol \alpha })}{\partial {\boldsymbol \alpha }}=\mathbf {0}$$
, one has the first-order optimization condition

$$\displaystyle \begin{aligned} \frac 1l \left (-(\mathbf{JK})^T\right )(\mathbf{y}-\mathbf{JK}{\boldsymbol \alpha})+\left (\gamma_A^{\,} \mathbf{K}+\frac{\gamma_l^{\,} }{(u+l)^2}\mathbf{ KLK}\right ) {\boldsymbol \alpha}=\mathbf{0}.{} \end{aligned} $$
(8.2.29)
Due to K TJ T = KJ, JJ = J, and Jy −JJKα = Jy −JKα, the optimization condition (8.2.29) gives the optimal solution

$$\displaystyle \begin{aligned} {\boldsymbol \alpha}^*=\left (\mathbf{JK}+\gamma_A l\mathbf{I}+\frac{\gamma_l^{\,} l}{(u+l)^2}\mathbf{LK}\right )^{-1}\mathbf{Jy}.{} \end{aligned} $$
(8.2.30)
This solution is known as Laplacian regularized least squares (LapRLS) solution [2].
Interestingly, LapRLS is also available for graph supervised regression and classification. In this setting, the number u of unlabeled samples is equal to zero, resulting to the l × l matrix J = I, and thus (8.2.30) becomes

$$\displaystyle \begin{aligned} {\boldsymbol \alpha}^* =\left (\mathbf{K}+\gamma_A^{\,} l\mathbf{I}+\frac{\gamma_l^{\,}}l\mathbf{LK}\right )^{-1}\mathbf{y}.{} \end{aligned} $$
(8.2.31)
This is just LapRLS solution for supervised regression and classification.
Another interesting fact is that LapRLS contains the regularized least squares (RLS) as a special example. The regularized least squares algorithm for non-graph signals is a fully supervised method where the optimal problem is to find the solution

$$\displaystyle \begin{aligned} {\mathbf{f}}^{\,*}=\operatorname*{\text{arg}\ \text{min}}_{\mathbf{f}}\frac 1l\sum_{i=1}^l (y_i^{\,} -f({\mathbf{x}}_i^{\,} ))^2 +\gamma_A^{\,}\|\mathbf{ f}\|{}_K^2.{} \end{aligned} $$
(8.2.32)
By the classical representer theorem, this solution is given by

$$\displaystyle \begin{aligned} {\mathbf{f}}^{\,*}(\mathbf{x})=\sum_{i=1}^l \alpha_i^*K(\mathbf{x},{\mathbf{x}}_i).{} \end{aligned} $$
(8.2.33)
Substituting (8.2.33) into (8.2.32) then

$$\displaystyle \begin{aligned} {\boldsymbol \alpha}^*=\operatorname*{\text{arg}\ \text{min}}_{\boldsymbol \alpha}\left\{\frac 1l(\mathbf{y}-\mathbf{K}{\boldsymbol \alpha})^T(\mathbf{y}-\mathbf{ K}{\boldsymbol \alpha}) +\gamma_A^{\,} {\boldsymbol \alpha}^T\mathbf{K}{\boldsymbol \alpha}\right \}. \end{aligned} $$
(8.2.34)
From the first-order optimization condition 
$$\frac {\partial V({\boldsymbol \alpha })}{\partial {\boldsymbol \alpha }}=\mathbf {0}$$
, it is easy to get

$$\displaystyle \begin{aligned} {\boldsymbol \alpha}^*=\left (\mathbf{K}+\gamma_A^{\,} l\mathbf{I}\right )^{-1}\mathbf{y}, \end{aligned} $$
(8.2.35)
which is just the Tikhonov regularization least squares (RLS) solution. Clearly, the Tikhonov RLS solution is a special example of LapRLS solution (8.2.31) for supervised machine learning when the graph Laplacian L is a null matrix.

8.2.4 Kernel Partial Least Squares Regression

Given two data blocks X and Y, consider kernel methods for partial least squares (PLS) regression. This type of methods is known as the kernel partial least squares regression.

Kernel PLS regression is a natural extension of PLS regression discussed in Sect. 6.​9.​2.

The key steps of PLS regression are given by
  1. 1.

    w = X Tu∕(u Tu).

     
  2. 2.

    t = Xw.

     
  3. 3.

    c = Y Tt∕(t Tt).

     
  4. 4.

    u = Y Tc∕(c Tc).

     
  5. 5.

    p = X Tt∕(t Tt).

     
  6. 6.

    q = Y Tu∕(u Tu).

     
  7. 7.

    X = X −tp T.

     
  8. 8.

    Y = Y −tc T.

     
Then, we have

$$\displaystyle \begin{aligned} \mathbf{t}&=\mathbf{XX}^T\mathbf{u}/({\mathbf{u}}^T\mathbf{u}),{} \end{aligned} $$
(8.2.36)

$$\displaystyle \begin{aligned} \mathbf{c}&={\mathbf{Y}}^T\mathbf{t},{} \end{aligned} $$
(8.2.37)

$$\displaystyle \begin{aligned} \mathbf{u}&={\mathbf{Y}}^T\mathbf{u}/({\mathbf{u}}^T\mathbf{u}),{} \end{aligned} $$
(8.2.38)

$$\displaystyle \begin{aligned} \mathbf{X}&=\mathbf{X}-\mathbf{tt}^T\mathbf{X},{} \end{aligned} $$
(8.2.39)

$$\displaystyle \begin{aligned} \mathbf{Y}&=\mathbf{Y}-\mathbf{tc}^T{}. \end{aligned} $$
(8.2.40)
Using Φ =  Φ(X) instead of X, then Eqs. (8.2.36) and (8.2.39) become

$$\displaystyle \begin{aligned} \mathbf{t}={\boldsymbol \Phi}{\boldsymbol \Phi}^T\mathbf{u}/({\mathbf{u}}^T\mathbf{u})\quad \text{and}\quad  {\boldsymbol \Phi}={\boldsymbol \Phi}-\mathbf{ tt}^T{\boldsymbol \Phi}. \end{aligned} $$
(8.2.41)
Therefore, the key steps of kernel nonlinear iterative partial least squares (NIPALS) regression are as follows [28, 32]:
Given Φ 0 =  Φ and the data block Y 0 = Y.
  1. 1.

    Randomly initialize u.

     
  2. 2.

    t =  Φ Φ Tu, t ←t∕(t Tt).

     
  3. 3.

    c = Y Tt.

     
  4. 4.

    u = Y Tu, u ←u∕(u Tu).

     
  5. 5.

    Repeat Steps 2–4 until convergence of t.

     
  6. 6.

    Deflate the matrix: Φ Φ T = (I −tt T) Φ Φ T(Itt T)T.

     
  7. 7.

    Deflate the matrix: Y = Y −tc T.

     

The kernel NIPALS regression is an iterative process: after extraction of the first component t 1 the algorithm starts again using the deflated matrices Φ Φ T and Y computed in Step 6 and Step 7, and repeat Steps 2–7 until the deflated matrix Φ Φ T or Y becomes a null matrix.

Once two matrices T = [t 1, …, t p] and U = [u 1, …, u p] are found by using the NIPALS regression algorithm, then the matrix regression coefficients B can be computed in the form similar to (6.​9.​26):

$$\displaystyle \begin{aligned} \mathbf{B}={\boldsymbol \Phi}_0^T\mathbf{U}({\mathbf{T}}^T{\boldsymbol \Phi}_0{\boldsymbol \Phi}_0^T\mathbf{U})^{-1}{\mathbf{T}}^T{\mathbf{Y}}_0.\end{aligned} $$
(8.2.42)
Then for a given new data block X new and Φ new =  Φ(X new), then unknown 
$$\mathbb {Y}$$
-values can be predicted as

$$\displaystyle \begin{aligned} \hat{\mathbf{Y}}_{\mathrm{new}} ={\boldsymbol \Phi}_{\mathrm{new}}\mathbf{B}.\end{aligned} $$
(8.2.43)

MATLIB code for Kernel RLS algorithm can be found in [28, Appendix III].

8.2.5 Laplacian Support Vector Machines

Consider support vector machines for graph signals. Given a set of l labeled graph examples 
$$\{({\mathbf {x}}_i ,y_i )\}_{i=1}^l$$
and a set of u unlabeled graph examples 
$$\{{\mathbf {x}}_j\}_{j=l+1}^{l+u}$$
. Let f be a semi-supervised SVM for graph training samples. The graph or Laplacian support vector machines are extended by solving the following optimization problem [2]:

$$\displaystyle \begin{aligned}{\mathbf{f}}^*=\operatorname*{\text{arg}\ \text{min}}_{\mathbf{f}}\left \{\frac 1l\sum_{i=1}^l \big (1-y_i f({\mathbf{x}}_i^{\,} )\big )_+ + \gamma_A^{\,}\|\mathbf{f}\|{}_K^2+\frac{\gamma_l^{\,}}{(u+l)^2}{\mathbf{f}}^T\mathbf{Lf}\right \}.{}\end{aligned} $$
(8.2.44)
By the extended representer theorem (Theorem 8.5), the solution to the problem (8.2.44) is given by

$$\displaystyle \begin{aligned} {\mathbf{f}}^* =\sum_{i=1}^{l+u}\alpha_i^* K(\mathbf{x},{\mathbf{x}}_i).{}\end{aligned} $$
(8.2.45)
Substituting (8.2.45) into (8.2.44) and adding an unregularized bias term b in f , then the primal problem for optimizing α can be written as

$$\displaystyle \begin{aligned} \operatorname*{\text{arg}\ \text{min}}_{{\boldsymbol \alpha}\in\mathbb{R}^{l+u},{\boldsymbol \xi}\in \mathbb{R}^l}&\left \{\frac 1l\sum_{i=1}^l\xi_i^{\,} +\gamma_A^{\,} {\boldsymbol \alpha}^T\mathbf{K}{\boldsymbol \alpha}+\frac{\gamma_l^{\,}}{(u+l)^2}{\boldsymbol \alpha}^T\mathbf{KLK}{\boldsymbol \alpha}\right \}, \end{aligned} $$
(8.2.46)

$$\displaystyle \begin{aligned} \text{subject to}\quad  &y_i\left (\sum_{j=1}^{l+u}\alpha_j^{\,} K({\mathbf{x}}_i^{\,},{\mathbf{x}}_j^{\,} )+b\right )\geq 1-\xi_i^{\,} ,\quad  i=1,\ldots ,l, \end{aligned} $$
(8.2.47)

$$\displaystyle \begin{aligned} &\xi_i^{\,} \geq 0,\quad  i=1,\ldots ,l. \end{aligned} $$
(8.2.48)
By using the augmented Lagrange multiplier method, the Lagrange function for dual unconstrained optimization is then given by

$$\displaystyle \begin{aligned} L({\boldsymbol \alpha},{\boldsymbol \xi},b,\beta ,\zeta )&=\frac 1l\sum_{i=1}^l \xi_i^{\,} +\frac 12 {\boldsymbol \alpha}^T \left (2\gamma_A^{\,}\mathbf{ K}+2\frac{\gamma_l^{\,}} {(l+u)^2}\mathbf{KLK}\right ){\boldsymbol \alpha}\\ &\quad  -\sum_{i=1}^l\beta_i^{\,} \left [y_i^{\,} \left (\sum_{j=1}^{l+u}\alpha_j^{\,} K({\mathbf{x}}_i^{\,} ,{\mathbf{x}}_j^{\,} )+b\right )-1+\xi_i^{\,} \right ]- \sum_{i=1}^l\zeta_i^{\,}\xi_i^{\,} .{} \end{aligned} $$
(8.2.49)
From the first-order optimization conditions

$$\displaystyle \begin{aligned} \frac{\partial L}{\partial b}=0\quad  &\Rightarrow\quad  \sum_{i=1}^l \beta_i^{\,} y_i^{\,} =0,\\ \frac{\partial L}{\partial \xi_i^{\,}} =0\quad  &\Rightarrow\quad  \frac 1l -\beta_i^{\,} -\zeta_i^{\,} =0,\\ &\Rightarrow\quad  0\leq \beta_i^{\,} \leq \frac 1l\quad  (\zeta_i^{\,} ,\xi_i^{\,}~\text{are nonnegative}). \end{aligned} $$
By using above identities, the Lagrange function in (8.2.49) can be reduced to

$$\displaystyle \begin{aligned} L^R ({\boldsymbol \alpha},{\boldsymbol \beta})&=\frac 12{\boldsymbol \alpha^T}\left (2\gamma_A^{\,}\mathbf{K}+2\frac{\gamma_l^{\,}}{(u+l)^2}\mathbf{KLK}\right ){\boldsymbol \alpha}\!-\! \sum_{i=1}^l\beta_i\left (y_i^{\,}\sum_{j=1}^{l+u}\alpha_j^{\,} K({\mathbf{x}}_i^{\,} ,{\mathbf{x}}_j^{\,} )-1\right )\\ &=\frac 12{\boldsymbol \alpha^T}\left (2\gamma_A^{\,}\mathbf{K}+2\frac{\gamma_l^{\,}}{(u+l)^2}\mathbf{KLK}\right ){\boldsymbol \alpha}\!-\!{\boldsymbol \alpha}^T\mathbf{ KJ}^T\mathbf{Y} {\boldsymbol \beta}+\sum_{i=1}^l\beta_i^{\,} , \end{aligned} $$
(8.2.50)
where J = [1, …, 1, 0, …, 0] is an 1 × (l + u) matrix with the first l entries as 1 and the rest as 0, and 
$$\mathbf {Y}= \mathbf {Diag}(y_1^{\,} ,\ldots ,y_l^{\,} )$$
.
From the first-order optimization condition

$$\displaystyle \begin{aligned} \frac{\partial L^R}{\partial {\boldsymbol \alpha}}=\left (2\gamma_A^{\,}\mathbf{K}+2\frac{\gamma_l^{\,}}{(u+l)^2}\mathbf{KLK}\right ){\boldsymbol \alpha}-\mathbf{ KJ}^T\mathbf{Y} {\boldsymbol \beta}=\mathbf{0} \end{aligned} $$
(8.2.51)
it follows that

$$\displaystyle \begin{aligned} {\boldsymbol \alpha}^*=\left (2\gamma_A^{\,} \mathbf{I}+2\frac{\gamma_l^{\,}}{(u+l)^2}\mathbf{LK}\right )^{-1}{\mathbf{J}}^T\mathbf{Y}{\boldsymbol \beta}^*. \end{aligned} $$
(8.2.52)

8.3 Support Vector Machine Regression

The support vector machine regression is a binary regressor algorithm that looks for an optimal hyperplane as a decision function in a high-dimensional space.

8.3.1 Support Vector Machine Regressor

Suppose we are given a training set of N data points 
$$\{{\mathbf {x}}_k,y_k^{\,}\}, k=1,\ldots ,N$$
, where 
$${\mathbf {x}}_k\in \mathbb {R}^n$$
is the kth input pattern and 
$$y_k^{\,}\in \mathbb {R}$$
is the kth associated “truth”.

Let 
$${\boldsymbol \phi }:I\subseteq \mathbb {R}^n\to F\subseteq \mathbb {R}^N$$
be a mapping from the input space 
$$I\subseteq \mathbb {R}^n$$
to the feature space F. Here, ϕ(x i) is the extracted feature of the input x i.

The SVM learning algorithm aims at finding a hyperplane (w, b) through the constrained optimization

$$\displaystyle \begin{aligned} &\min_{\mathbf{w},b} \left \{ f(\mathbf{w},b)=\|\mathbf{w}\|{}_2^2\right \},{} \end{aligned} $$
(8.3.1)

$$\displaystyle \begin{aligned} &\text{subject to}~~ \sum_{i=1}^N y_i^{\,} \left [{\mathbf{w}}^T{\boldsymbol \phi}({\mathbf{x}}_i^{\,} )-b\right ]\geq 0, \end{aligned} $$
(8.3.2)
where the quality (〈w, ϕ(x)〉− b) corresponds to the distance between the point x i and the decision boundary, and the quality

$$\displaystyle \begin{aligned} \gamma =\sum_{i=1}^N y_i^{\,}[\langle \mathbf{w},{\boldsymbol \phi}({\mathbf{x}}_i^{\,} )\rangle -b] \end{aligned} $$
(8.3.3)
is called the margin.
The constrained optimization problem (8.3.1) can be rewritten as an unconstrained optimization in Lagrangian form:

$$\displaystyle \begin{aligned} \min_{\mathbf{w},b}\left \{ L(\mathbf{w},b)=\|\mathbf{w}\|{}_2^2 -\sum_{i=1}^N\alpha_i y_i\left [{\mathbf{w}}^T{\boldsymbol \phi}({\mathbf{x}}_i^{\,} )-b\right ]\right\}, \end{aligned} $$
(8.3.4)
where the Lagrangian multiplies 
$$\alpha _i^{\,}$$
are nonnegative.
From the optimization conditions, we have

$$\displaystyle \begin{aligned} &\frac{\partial L(\mathbf{w},b)}{\partial \mathbf{w}}=\mathbf{w}-\sum_{i=1}^N \alpha_i^{\,} y_i^{\,} {\boldsymbol \phi}({\mathbf{x}}_i^{\,} )=0~\Rightarrow~ \mathbf{w}=\sum_{i=1}^N\alpha_i^{\,} y_i^{\,}{\boldsymbol \phi}({\mathbf{x}}_i^{\,} ),{} \end{aligned} $$
(8.3.5)

$$\displaystyle \begin{aligned} &\frac{\partial L(\mathbf{w},b)}{\partial b}=\sum_{i=1}^N \alpha_i^{\,} y_i^{\,} =0. \end{aligned} $$
(8.3.6)
Substituting these two results into (8.3.1) to give the following constrained optimization with respect to α:

$$\displaystyle \begin{aligned} \min_{\boldsymbol \alpha} \left \{ J_1({\boldsymbol \alpha})=\sum_{i=1}^N\sum_{j=1}^N \alpha_i^{\,}\alpha_j^{\,} y_i^{\,} y_j^{\,} K({\mathbf{x}}_i^{\,} ,\mathbf{ x}_j^{\,} ) -\sum_{i=1}^N \alpha_i^{\,}\right \}, \end{aligned} $$
(8.3.7)
or

$$\displaystyle \begin{aligned} \max_{\boldsymbol \alpha} \left \{ J_2({\boldsymbol \alpha})=\sum_{i=1}^N \alpha_i^{\,} -\sum_{i=1}^N\sum_{j=1}^N \alpha_i^{\,}\alpha_j^{\,} y_i^{\,} y_j^{\,} K( {\mathbf{x}}_i^{\,} ,{\mathbf{x}}_j^{\,} )\right \}{} \end{aligned} $$
(8.3.8)
subject to

$$\displaystyle \begin{aligned} \sum_{i=1}^N \alpha_i^{\,} y_i^{\,} =0\quad \text{and}\quad  \alpha_i^{\,} >0,\quad  i=1,\ldots ,N,{} \end{aligned} $$
(8.3.9)
where 
$$K({\mathbf {x}}_i^{\,} ,{\mathbf {x}}_j^{\,} )=\langle {\boldsymbol \phi }({\mathbf {x}}_i^{\,} ),{\boldsymbol \phi }({\mathbf {x}}_j^{\,} )\rangle ={\boldsymbol \phi }^T({\mathbf {x}}_i^{\,} ) {\boldsymbol \phi }({\mathbf {x}}_j^{\,} )$$
.
The support vector machine regression aims at designing the following machine learning algorithms:
  • solve the maximization problem (8.3.8) and (8.3.9) for the Lagrangian multiplies α i, i = 1, …, N;

  • update the bias via 
$$b\leftarrow b-\eta \sum _{i=1}^N \alpha _i^{\,} y_i^{\,}$$
;

  • use (8.3.5) to calculate the support vector regressor w.

8.3.2 𝜖-Support Vector Regression

The basic idea of 𝜖-support vector (SV) regression is to find a function f(x) = w Tϕ(x) + b that has at most 𝜖 deviation from the actually obtained targets 
$$y_i^{\,}$$
for all the training data 
$${\mathbf {x}}_1^{\,} ,\ldots ,{\mathbf {x}}_N^{\,}$$
, namely 
$$|f({\mathbf {x}}_i^{\,} )-y_i^{\,} |\leq \epsilon $$
for i = 1, …, N, and at the same time w is as flat as possible. In other words, we do not count errors as long as they are less than 𝜖, but will not accept any deviation larger than this [36].

One way to ensure the flatness of w is to minimize its norm 
$$\|\mathbf {w}\|{ }_2^2=\langle \mathbf {w},\mathbf {w}\rangle $$
. Hence, the basic form of 𝜖-support vector regression can be written as a convex optimization problem [36, 43]:

$$\displaystyle \begin{aligned} &\min_{\mathbf{w}}\quad  \frac 12\|\mathbf{w}\|{}_2^2,{} \end{aligned} $$
(8.3.10)

$$\displaystyle \begin{aligned} &\text{subject to}\quad  \left\{\begin{aligned} y_k^{\,}-\langle \mathbf{w},\mathbf{\phi}({\mathbf{x}}_k)\rangle -b\leq\epsilon ,\\ \langle \mathbf{w},\mathbf{\phi}({\mathbf{x}}_k)\rangle +b-y_k^{\,}\leq\epsilon ;\end{aligned}\right. .{}\end{aligned} $$
(8.3.11)
where 𝜖 > 0 is a regression error.
In order to avoid a violation of constrained conditions in (8.3.11), one can introduce slackness parameters 
$$(\xi _k,\xi _k^*)$$
, for given parameters C > 0 and 𝜖 > 0. Hence we have the standard form of support vector regression given by Vapnik [44]

$$\displaystyle \begin{aligned} &\min_{\mathbf{w},b,\xi_k,\xi_k^*}\left\{\frac 12\langle \mathbf{w},\mathbf{w}\rangle +C\sum_{k=1}^N (\xi_k+\xi_k^*) \right\}{} \end{aligned} $$
(8.3.12)

$$\displaystyle \begin{aligned} &\text{subject to}\quad  y_k^{\,}-(\langle\mathbf{w},{\boldsymbol\phi}({\mathbf{x}}_k)\rangle +b)\leq \epsilon +\xi_k,{} \end{aligned} $$
(8.3.13)

$$\displaystyle \begin{aligned} &\hskip 17mm (\langle\mathbf{w},{\boldsymbol\phi}({\mathbf{x}}_k)\rangle +b)-y_k^{\,}\leq \epsilon +\xi_k^*,{} \end{aligned} $$
(8.3.14)

$$\displaystyle \begin{aligned} &\hskip 17mm\xi_k,\xi_k^*\geq 0,\quad  k=1,\ldots ,N.{}\end{aligned} $$
(8.3.15)
Here C > 0 is the regularization parameter that denotes SVM misclassification tolerance and is a pre-specified value. ξ k represents the upper training error, and 
$$\xi _k^*$$
is the lower training error subject to 𝜖-insensitive tube 
$$|y_k^{\,}-(\langle \mathbf {w},{\boldsymbol \phi }({\mathbf {x}}_k)\rangle +b)|\leq \epsilon $$
.
To solve the above constrained optimization problem, one can define the Lagrange function (or Lagrangian) as follows [36]:

$$\displaystyle \begin{aligned} \mathbb{L}=&\,\mathbb{L}(\mathbf{w},b,\xi_k,\xi_k^*)\\ =&\,\frac 12\|\mathbf{w}\|{}_2^2+C\sum_{k=1}^N (\xi_k+\xi_k^* )-\sum_{k=1}^N (\eta_k\xi_k +\eta_k^*\xi_k^*)\\ &\,-\sum_{k=1}^N \alpha_k \left (\epsilon +\xi_k-y_k^{\,}+\langle \mathbf{w},{\boldsymbol\phi}({\mathbf{x}}_k)\rangle +b\right )\\ &\,-\sum_{k=1}^N \alpha_k^* \left (\epsilon +\xi_k^*+y_k^{\,}-\langle \mathbf{w},{\boldsymbol\phi}({\mathbf{x}}_k)\rangle -b\right ),{}\end{aligned} $$
(8.3.16)
where 
$$\eta _k,\eta _k^*,\alpha _k,\alpha _k^*$$
are Lagrange multipliers which have to satisfy the positivity constraints:

$$\displaystyle \begin{aligned} \eta_k^{(*)},\alpha_k^{(*)}\geq 0,{}\end{aligned} $$
(8.3.17)
where 
$$\eta _k^{(*)}=\{\eta _k,\eta _k^*\}$$
and 
$$\alpha _k^{(*)}=\{\alpha _k,\alpha _k^*\}$$
.
From the first-order optimality conditions it follows that

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}}{\partial \mathbf{w}}=0\quad  &\Rightarrow\quad  \mathbf{w}=\sum_{k=1}^N (\alpha_k^{\,} -\alpha_k^*){\boldsymbol \phi} ({\mathbf{x}}_k^{\,} ),{} \end{aligned} $$
(8.3.18)

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}}{\partial b}=0\quad  &\Rightarrow\quad  \sum_{k=1}^N (\alpha_k^{\,} -\alpha_k^* )=0,{} \end{aligned} $$
(8.3.19)

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}}{\partial \xi_k^{\,}}=0\quad  &\Rightarrow\quad  \eta_k^{\,} +\alpha_k^{\,} =C,{} \end{aligned} $$
(8.3.20)

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}}{\partial \xi_k^*}=0\quad  &\Rightarrow\quad  \eta_k^* +\alpha_k^* =C.{} \end{aligned} $$
(8.3.21)
Equation (8.3.18) is the so-called support vector expansion, i.e., w can be completely described as a linear combination of the training patterns x i.
Substituting (8.3.18) and (8.3.19) into (8.3.16) can eliminate the dual variables 
$$\eta _i^{\,} , \eta _i^*$$
and gives the dual function

$$\displaystyle \begin{aligned} \mathbb{L}_D ({\boldsymbol \alpha},{\boldsymbol \alpha}^*)&=-\frac 12\sum_{i=1}^N\sum_{j=1}^N (\alpha_i^{\,} -\alpha_i^*)(\alpha_j^{\,} -\alpha_j^* )\langle {\boldsymbol \phi}({\mathbf{x}}_i),{\boldsymbol \phi}({\mathbf{x}}_j^{\,} )\rangle\\ &\quad  -\epsilon \sum_{i=1}^N (\alpha_i^{\,} +\alpha_i^*)+\sum_{i=1}^N y_i^{\,} (\alpha_i^{\,} -\alpha_j^*).{} \end{aligned} $$
(8.3.22)
On the other side, from (8.3.17), (8.3.20), and (8.3.21) it is known that

$$\displaystyle \begin{aligned} 0\leq \alpha_i^{\,} ,\alpha_i^*\leq C,\quad  i=1,\ldots ,N.{} \end{aligned} $$
(8.3.23)
Equations (8.3.22) together with constraints (8.3.19) and (8.3.23) constitute the Wolfe dual 𝜖-support vector machine regression problem:

$$\displaystyle \begin{aligned} &\max_{{\boldsymbol \alpha},{\boldsymbol \alpha}^*}\left \{-\frac 12 ({\boldsymbol \alpha}-{\boldsymbol \alpha}^*)^T \mathbf{Q}({\boldsymbol \alpha}-{\boldsymbol \alpha}^*)-\epsilon \sum_{i=1}^N (\alpha_i^{\,} +\alpha_i^*)+\sum_{i=1}^N y_i^{\,} (\alpha_i^{\,} -\alpha_j^*)\right \},{} \end{aligned} $$
(8.3.24)

$$\displaystyle \begin{aligned} &\text{subject to}\quad  \sum_{k=1}^N (\alpha_k^{\,} -\alpha_k^* )=0, \end{aligned} $$
(8.3.25)

$$\displaystyle \begin{aligned} &\hskip 17mm 0\leq \alpha_i^{\,} ,\alpha_i^*\leq C,\quad  i=1,\ldots ,N, \end{aligned} $$
(8.3.26)
where 
$$\mathbf {Q}=[K({\mathbf {x}}_i^{\,},{\mathbf {x}}_j^{\,})]_{i,j=1}^{N,N}$$
is an N × N positive semi-definite matrix with 
$$K( {\mathbf {x}}_i^{\,},{\mathbf {x}}_j^{\,})=\langle {\boldsymbol \phi }({\mathbf {x}}_i^{\,}),{\boldsymbol \phi }({\mathbf {x}}_j^{\,})\rangle $$
being the kernel function of SVM.
Then the regression function is given by

$$\displaystyle \begin{aligned} f(\mathbf{x})={\mathbf{w}}^T{\boldsymbol \phi}(\mathbf{x})+b=\sum_{i=1}^N(\alpha_i^{\,} -\alpha_i^* )K({\mathbf{x}}_i^{\,} ,\mathbf{x})+b,\end{aligned} $$
(8.3.27)
where w is described by the support vector expansion Eq. (8.3.18).
The KKT conditions for the dual constrained optimization problem (8.3.24) are given by [36]

$$\displaystyle \begin{aligned} \alpha_i(\epsilon +\xi_i-y_i+\langle \mathbf{w},{\mathbf{x}}_i\rangle +b)&=0, \end{aligned} $$
(8.3.28)

$$\displaystyle \begin{aligned} \alpha_i^*(\epsilon +\xi_i^*-y_i+\langle \mathbf{w},{\mathbf{x}}_i\rangle +b)&=0, \end{aligned} $$
(8.3.29)

$$\displaystyle \begin{aligned} (C-\alpha_i )\xi_i &=0. \end{aligned} $$
(8.3.30)

$$\displaystyle \begin{aligned} (C-\alpha_i^* )\xi_i^* &=0.\end{aligned} $$
(8.3.31)
From the above conditions it follows that

$$\displaystyle \begin{aligned} \epsilon - y_i +\langle \mathbf{w},{\mathbf{x}}_i\rangle +b&\geq 0\quad \text{and}\quad  \xi_i =0 \text{ if }\alpha_i < C, \end{aligned} $$
(8.3.32)

$$\displaystyle \begin{aligned} \epsilon - y_i +\langle \mathbf{w},{\mathbf{x}}_i\rangle +b&\leq 0\qquad \qquad  \text{if }\alpha_i>0, \end{aligned} $$
(8.3.33)

$$\displaystyle \begin{aligned} \epsilon - y_i +\langle \mathbf{w},{\mathbf{x}}_i\rangle +b&\geq 0\quad \text{and}\quad  \xi_i^* =0 \text{ if }\alpha_i^* < C, \end{aligned} $$
(8.3.34)

$$\displaystyle \begin{aligned} \epsilon - y_i +\langle \mathbf{w},{\mathbf{x}}_i\rangle +b&\leq 0\qquad \qquad  \text{if }\alpha_i^*>0.\end{aligned} $$
(8.3.35)
Then the computation of b is given by Smola and Schölkopf [36]:

$$\displaystyle \begin{aligned} &\max\{-\epsilon + y_i-\langle \mathbf{w},{\mathbf{x}}_i\rangle |\alpha_i < C\text{ or } \alpha_i^* > 0\}\leq b\leq\\ &\min\{-\epsilon + y_i-\langle \mathbf{w},{\mathbf{x}}_i\rangle |\alpha_i> 0\text{ or } \alpha_i^* <C\}.\end{aligned} $$
(8.3.36)
If some α i or 
$$\alpha _i^*\in (0,C)$$
, then the inequalities become equalities.

8.3.3 ν-Support Vector Machine Regression

In the standard support vector machine regression described above, an error of 𝜖 is allowed at each point x i. By introducing a constant ν ≥ 0, the size of 𝜖 can be traded off against model complexity and slack variables. This is the basic idea of the following ν-support vector machine regression developed by Schölkopf et al. [34]:

$$\displaystyle \begin{aligned} &\min_{\mathbf{w},\xi_i,{\boldsymbol \xi}_i^*,\epsilon}\left \{ \frac 12\|\mathbf{w}\|{}_2^2+C\bigg (\nu \epsilon+\frac 1N\sum_{i=1}^N(\xi_i^{\,}+\xi_i^*)\bigg )\right \}{} \end{aligned} $$
(8.3.37)

$$\displaystyle \begin{aligned} &\text{subject to}\quad  ({\mathbf{w}}^T{\mathbf{x}}_i^{\,}+b)-y_i^{\,}\leq \epsilon +\xi_i^{\,},{} \end{aligned} $$
(8.3.38)

$$\displaystyle \begin{aligned} &\hskip 17mm y_i^{\,}-({\mathbf{w}}^T{\mathbf{x}}_i^{\,}+b)\leq \epsilon +\xi_i^*,{} \end{aligned} $$
(8.3.39)

$$\displaystyle \begin{aligned} &\hskip 17mm\xi_i^{\,}\geq 0,\,\xi_i^*\geq 0,\quad  i=1,\ldots ,N.{} \end{aligned} $$
(8.3.40)
By introducing the Lagrange multipliers 
$$\alpha _i^*.\eta _i^*,\beta \geq 0$$
, one has the Lagrange function [34]:

$$\displaystyle \begin{aligned} \mathbb{L}&\left (\mathbf{w},b,{\boldsymbol \xi}^*,\epsilon ,{\boldsymbol \alpha}^*,\beta ,{\boldsymbol \eta}^*\right )\\ &\quad  =\frac 12\|\mathbf{w}\|{}_2^2+C\nu\epsilon +\frac CN\sum_{i=1}^N (\xi_i^{\,}+\xi_i^*)-\beta\epsilon -\sum_{i=1}^N (\eta_i^{\,} \xi_i^{\,}+\eta_i^*\xi_i^*)\\ &\qquad  -\sum_{i=1}^N\alpha_i^{\,}\left (\xi_i^{\,}+y_i^{\,}-{\mathbf{w}}^T{\mathbf{x}}_i^{\,}-b+\epsilon\right )\\ &\qquad  -\sum_{i=1}^N \alpha_i^*\left (\xi_i^*+{\mathbf{w}}^T{\mathbf{x}}_i^{\,}+b-y_i^{\,}+\epsilon\right ).{} \end{aligned} $$
(8.3.41)
The first-order optimality conditions give the results

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}}{\partial \mathbf{w}}=0&\quad \Rightarrow\quad  \mathbf{w}=\sum_{i=1}^N (\alpha_i^*-\alpha_i^{\,}){\mathbf{x}}_i,{} \end{aligned} $$
(8.3.42)

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}}{\partial \epsilon}=0&\quad \Rightarrow\quad  C\nu -\sum_{i=1}^N (\alpha_i^{\,}+\alpha_i^*)-\beta =0,{} \end{aligned} $$
(8.3.43)

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}}{\partial b}=0&\quad \Rightarrow\quad  \sum_{i=1}^N (\alpha_i^{\,}-\alpha_i^* )=0,{} \end{aligned} $$
(8.3.44)

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}}{\partial \xi_i^*}=0&\quad \Rightarrow\quad  \frac CN-\alpha_i^*-\eta_i^* =0,\quad  i=1,\ldots ,N;{} \end{aligned} $$
(8.3.45)
for minimizing over the primal variables 
$$\mathbf {w},\epsilon , b, \xi _i^*$$
, while maximizing over the dual variables 
$$\alpha _i^*, \eta _i^*, \beta $$
yields

$$\displaystyle \begin{aligned}\frac{\partial \mathbb{L}}{\partial \alpha_i^*}=0\quad \Rightarrow\quad  \xi_i^*+{\mathbf{w}}^T{\mathbf{x}}_i^{\,}+b-y_i^{\,}+\epsilon =0, \end{aligned}$$
and

$$\displaystyle \begin{aligned}\frac{\partial \mathbb{L}}{\partial \eta_i^*}=0\quad \Rightarrow\quad  \epsilon =0,\quad \frac{\partial \mathbb{L}}{\partial \beta}=0\quad \Rightarrow\quad  \xi_i^* =0. \end{aligned}$$
Hence, the regression result is given by

$$\displaystyle \begin{aligned} y_i^{\,}={\mathbf{w}}^T{\mathbf{x}}_i^{\,}+b. \end{aligned} $$
(8.3.46)
Substituting the four constraints in (8.3.42)–(8.3.45) into the Lagrange function 
$$\mathbb {L}$$
defined in (8.3.41) leads to the optimization problem called the Wolfe dual. The Wolfe dual ν-support vector machine regression problem can be written as [34]

$$\displaystyle \begin{aligned} &\max\left\{ \mathbb{W}({\boldsymbol \alpha}^*)=\sum_{i=1}^N (\alpha_i^*-\alpha_i^{\,})y_i^{\,}-\frac 12\sum_{i=1}^N\sum_{j=1}^N (\alpha_i^*-\alpha_i^{\,})(\alpha_j^*-\alpha_j^{\,}) K({\mathbf{x}}_i^{\,},{\mathbf{x}}_j^{\,})\right\} \end{aligned} $$
(8.3.47)

$$\displaystyle \begin{aligned} &\text{subject to}~~\sum_{i=1}^N (\alpha_i^{\,}-\alpha_i^* )=0,~~\alpha_i^*\in \left [ 0,\frac CN\right ],~~\sum_{i=1}^N (\alpha_i^{\,}+\alpha_i^* )\leq C\nu . \end{aligned} $$
(8.3.48)
After solving the Wolfe dual problem, the regression value of the new instance x is given by

$$\displaystyle \begin{aligned} f(\mathbf{x})=\sum_{i=1}^N (\alpha_i^*-\alpha_i^{\,})K(\mathbf{x},{\mathbf{x}}_i^{\,})+b. \end{aligned} $$
(8.3.49)
Proposition 8.1 ([34])
Suppose ν-SVR is applied to some data set, and the resulting 𝜖 is nonzero. The following statements hold:
  • ν is an upper bound on the fraction of errors.

  • ν is a lower bound on the fraction of SVs.

The following function is a popular choice to illustrate SVR for regression in the literature:

$$\displaystyle \begin{aligned} y(x)=\left\{\begin{aligned} &\sin (x)/x,&x\neq 0;\\ &0,&x=0.\end{aligned}\right. \end{aligned} $$
(8.3.50)

8.4 Support Vector Machine Binary Classification

A basic task in data analysis and pattern recognition is classification that requires the construction of a classifier, namely a function that assigns a class label to instances described by a set of attributes.

In this section we discuss SVM binary classification problems with two classes.

8.4.1 Support Vector Machine Binary Classifier

Given a training set of N data points 
$$\{{\mathbf {x}}_k,y_k^{\,}\}, k=1,\ldots ,N$$
, where x k is the kth input pattern and 
$$y_k^{\,}$$
is the kth output pattern. Denote the set of data

$$\displaystyle \begin{aligned} \mathbb{D}=\left\{ ({\mathbf{x}}_1,y_1^{\,}),\ldots ,({\mathbf{x}}_N,y_N^{\,})\right \},\quad  {\mathbf{x}}_k\in\mathbb{R}^n,\,y_k^{\,} \in\{-1,+1\}{}\end{aligned} $$
(8.4.1)

A SVM classifier is a classifier which finds the hyperplane that separates the data with the largest distance between the hyperplane and the closest data point (called margin).

A linear separating hyperplane is a decision function in the form

$$\displaystyle \begin{aligned} f(\mathbf{x})=\mathrm{sign}({\mathbf{w}}^T\mathbf{x}+b),{}\end{aligned} $$
(8.4.2)
where x is the input pattern, w is the weight vector, and b is a bias term. Hence, we have

$$\displaystyle \begin{aligned} {\mathbf{w}}^T\mathbf{x} +b\left \{\begin{aligned} >0,&~~~\mathrm{then}~\mathbf{x}\in \mathrm{class}~S_+;\\ <0,&~~~\mathrm{then}~\mathbf{x}\in \mathrm{class}~S_-;\\ =0,&~~~\mathrm{then}~\mathbf{x}\in \mathrm{class}~S_+~\mathrm{or}~\mathbf{x}\in \mathrm{class}~S_-.\end{aligned}\right .\end{aligned} $$
(8.4.3)
Similarly, for a nonlinear classifier w with the nonlinear mapping ϕ : x →ϕ(x), its testing output y = f(x) = w Tϕ(x) + b, where b denotes a bias term. Then, we have

$$\displaystyle \begin{aligned} {\mathbf{w}}^T{\boldsymbol \phi}(\mathbf{x}) +b\left \{\begin{aligned} >0,&~~~\mathrm{then}~\mathbf{x}\in \mathrm{class}~S_+;\\ <0,&~~~\mathrm{then}~\mathbf{x}\in \mathrm{class}~S_-;\\ =0,&~~~\mathrm{then}~\mathbf{x}\in \mathrm{class}~S_+~\mathrm{or}~\mathbf{x}\in \mathrm{class}~S_-.\end{aligned}\right .\end{aligned} $$
(8.4.4)
Therefore the decision function of a nonlinear classifier is given by

$$\displaystyle \begin{aligned} \text{class of }\mathbf{x}=\mathrm{sign}\left ({\mathbf{w}}^T{\boldsymbol \phi}(\mathbf{x})+b\right ).{}\end{aligned} $$
(8.4.5)
For a SVM with the kernel function 
$$K(\mathbf {x},{\mathbf {x}}_i^{\,} )={\boldsymbol \phi }^T(\mathbf {x}){\boldsymbol \phi }({\mathbf {x}}_i^{\,} )$$
, the SVM classifier w is usually designed as

$$\displaystyle \begin{aligned} \mathbf{w}=\sum_{i=1}^N\alpha_i^{\,} y_i^{\,} {\boldsymbol \phi}({\mathbf{x}}_i).{}\end{aligned} $$
(8.4.6)
Substituting (8.4.6) into (8.4.5), we directly get the decision function of any SVM classifier as follows:

$$\displaystyle \begin{aligned} \text{class of }\mathbf{x}=\mathrm{sign}\left (\sum_{i=1}^N\alpha_i^{\,} y_i^{\,} K(\mathbf{x},{\mathbf{x}}_i^{\,} )+b\right ).{} \end{aligned} $$
(8.4.7)
Hence, when designing any SVM classifier, its weighting vector w must have the form in (8.4.6).

The set of vectors {x 1, …, x N} is said to be optimally separated by the hyperplane if it is separated without error and the distance between the closest vector to the hyperplane is maximal.

For designing the classifier w, we assume that

$$\displaystyle \begin{aligned} {\mathbf{w}}^T{\boldsymbol \phi}({\mathbf{x}}_k)+b&\geq 1,\quad \,\mathrm{if}~y_k^{\,}=+1, \end{aligned} $$
(8.4.8)

$$\displaystyle \begin{aligned} {\mathbf{w}}^T{\boldsymbol \phi}({\mathbf{x}}_k)+b&\leq -1,~\mathrm{if}~y_k^{\,}=-1. \end{aligned} $$
(8.4.9)
That is to say, if the data is to be classified correctly, this hyperplane should ensure that

$$\displaystyle \begin{aligned} y_k^{\,} \left ( {\mathbf{w}}^T{\boldsymbol \phi}({\mathbf{x}}_k)+b\right )>0,~\text{for all }k=1,\ldots ,N,{} \end{aligned} $$
(8.4.10)
assuming that y ∈{−1, +1}. Here ϕ(x k) is a nonlinear vector function mapping the input space 
$$\mathbb {R}^n$$
into a higher-dimensional space, but this function is not explicitly constructed.
The distance of a point x k from the hyperplane, denoted as d(w, b;x k), is defined as

$$\displaystyle \begin{aligned} d(\mathbf{w},b;{\mathbf{x}}_k)=\frac{\left |\langle \mathbf{w},{\boldsymbol \phi}({\mathbf{x}}_k)\rangle +b\right |}{\|\mathbf{w}\|}. \end{aligned} $$
(8.4.11)
By the constraint condition 
$$y_k^{\,}\left (\langle \mathbf {w},{\boldsymbol \phi }({\mathbf {x}}_k)\rangle + b\right )\geq 1$$
in (8.4.10), the margin of a classifier is given by

$$\displaystyle \begin{aligned} \rho (\mathbf{w},b)&=\min_{{\mathbf{x}}_k:y_k^{\,}=-1}d(\mathbf{w},b;{\mathbf{x}}_k)+\min_{{\mathbf{x}}_k:y_k^{\,}=1}d(\mathbf{w},b;{\mathbf{x}}_k)\\ &=\min_{{\mathbf{x}}_k:y_k^{\,}=-1}\frac{\left |\langle \mathbf{w},{\boldsymbol \phi}({\mathbf{x}}_k)\rangle +b\right |}{\|\mathbf{w}\|}+\min_{{\mathbf{x}}_k:y_k^{\,}=1} \frac{\left |\langle \mathbf{w},{\boldsymbol \phi}({\mathbf{x}}_k)\rangle +b\right |}{\|\mathbf{w}\|}\\ &=\frac 1{\|\mathbf{w}\|}\left (\min_{{\mathbf{x}}_k:y_k^{\,}=-1}\left |\langle\mathbf{w},{\boldsymbol \phi}({\mathbf{x}}_k)\rangle +b\right |+\min_{\mathbf{ x}_k:y_k^{\,}=1} \left |\langle\mathbf{w},{\boldsymbol \phi}({\mathbf{x}}_k)\rangle +b\right |\right )\\ &=\frac 2{\|\mathbf{w}\|}. \end{aligned} $$
(8.4.12)
The optimal hyperplane w opt is given by maximizing the above margin, namely

$$\displaystyle \begin{aligned} \mathbf{w}=\operatorname*{\text{arg}\ \text{max}}_{\mathbf{w}}\left \{\rho (\mathbf{w},b)=\frac 2{\|\mathbf{w}\|}\right \} \end{aligned} $$
(8.4.13)
which is equivalent to

$$\displaystyle \begin{aligned} \mathbf{w}=\operatorname*{\text{arg}\ \text{min}}_{\mathbf{w}}~\frac 12\|\mathbf{w}\|. \end{aligned} $$
(8.4.14)
In order to avoid the possibility to violate (8.4.10), variables ξ k need to be introduced such that

$$\displaystyle \begin{aligned} y_k^{\,} \left ( {\mathbf{w}}^T{\boldsymbol \phi}({\mathbf{x}}_k)+b\right )&\geq 1-\xi_k,\quad  k=1,\ldots ,N, {} \end{aligned} $$
(8.4.15)

$$\displaystyle \begin{aligned} \xi_k &\geq 0,\quad  k=1,\ldots ,N.{} \end{aligned} $$
(8.4.16)
According to the structural risk minimization principle, the risk bound is minimized by formulating the optimization problem

$$\displaystyle \begin{aligned} \min\left \{ \frac 12\|\mathbf{w}\|{}^2+C\sum_{k=1}^N\xi_k\right \}{} \end{aligned} $$
(8.4.17)
subject to (8.4.15), where C is a user-specified parameter for providing a trade-off between the distance of the separating margin and the training error.
Therefore the primary problem for SVM binary classifier is a constrained optimization problem:

$$\displaystyle \begin{aligned} &\min \left \{\mathbb{L}_{\mathrm{P}_{\mathrm{SVM}}}^{\,}=\frac 12 \|\mathbf{w}\|{}_2^2+C\sum_{i=1}^N \xi_i\right \}{} \end{aligned} $$
(8.4.18)

$$\displaystyle \begin{aligned} &\text{subject to}~~ y_i\left ({\mathbf{w}}^T {\boldsymbol \phi}({\mathbf{x}}_i)+b\right )\geq 1-\xi_i,~~\xi_i\geq 0~~(i=1,\ldots ,N).{} \end{aligned} $$
(8.4.19)
The dual form of the above primary optimization problem is given by

$$\displaystyle \begin{aligned} &\min \Bigg \{ \mathbb{L}_{\mathrm{D}_{\mathrm{SVM}}}^{\,}=\frac 12\sum_{i=1}^N\sum_{j=1}^Ny_i^{\,}y_j^{\,}\alpha_i^{\,} \alpha_j^{\,}\langle {\boldsymbol \phi}({\mathbf{x}}_i^{\,}),{\boldsymbol \phi}({\mathbf{x}}_j^{\,})\rangle -\sum_{i=1}^N\alpha_i^{\,}\Bigg \}{} \end{aligned} $$
(8.4.20)

$$\displaystyle \begin{aligned} &\text{subject to}~~\sum_{i=1}^N \alpha_i^{\,}y_i^{\,}=0,~~0\leq \alpha_i^{\,}\leq C~~(i=1,\ldots ,N),{} \end{aligned} $$
(8.4.21)
where 
$$\alpha _i^{\,}$$
is the Lagrange multiplier corresponding to ith training sample 
$$({\mathbf {x}}_i^{\,},y_i^{\,})$$
, and vectors 
$${\mathbf {x}}_i^{\,}$$
satisfying 
$$y_i^{\,}({\mathbf {w}}^T{\boldsymbol \phi }({\mathbf {x}}_i^{\,})+b)=1$$
are termed support vectors.
In SVM learning algorithms, kernel functions K(u, v) = 〈ϕ(u), ϕ(v)〉 are usually used, and the dual optimization problem for SVM binary classifier is represented as

$$\displaystyle \begin{aligned} &\min \Bigg \{\mathbb{L}_{\mathrm{D}_{\mathrm{SVM}}}^{\,}=\frac 12\sum_{i=1}^N\sum_{j=1}^Ny_i^{\,}y_j^{\,}\alpha_i^{\,} \alpha_j^{\,}K({\mathbf{x}}_i^{\,},{\mathbf{x}}_j^{\,})-\sum_{i=1}^N\alpha_i^{\,}\Bigg \}{} \end{aligned} $$
(8.4.22)

$$\displaystyle \begin{aligned} &\text{subject to}~~ \sum_{i=1}^N \alpha_i^{\,}y_i^{\,}=0,~~ 0\leq \alpha_i^{\,}\leq C~~(i=1,\ldots ,N).{}\end{aligned} $$
(8.4.23)
By Bousquet et al. [5, pp. 14–15], several important practical points should be taken into account when designing classifiers.
  1. 1.

    In order to reduce the likelihood of overfitting the classifier to the training data, the ratio of the number of training examples to the number of features should be at least 10:1. For the same reason the ratio of the number of training examples to the number of unknown parameters should be at least 10:1.

     
  2. 2.

    Importantly, proper error-estimation methods should be used, especially when selecting parameters for the classifier.

     
  3. 3.

    Some algorithms require the input features to be scaled to similar ranges, such as some kind of a weighted average of the inputs.

     
  4. 4.

    There is no single best classification algorithm!

     

8.4.2 ν-Support Vector Machine Binary Classifier

Similar to ν-SVR, by introducing a new parameter ν ∈ (0, 1] in SVM classification, Schölkopf et al. [34] presented ν-support vector classifier (ν-SVC).

The primal optimization problem of ν-SVC is described by

$$\displaystyle \begin{aligned} &\min_{\mathbf{w},{\boldsymbol \xi},\rho}~\Bigg \{\frac 12\|\mathbf{w}\|{}_2^2-\nu\rho +\sum_{i=1}^N \xi_i^{\,}\Bigg\} \end{aligned} $$
(8.4.24)

$$\displaystyle \begin{aligned} &\text{subject to}~~ y_i^{\,}({\mathbf{w}}^T{\boldsymbol \phi}({\mathbf{x}}_i)+b)\geq\epsilon -\xi_i^{\,},~~ \xi_i^{\,}\geq 0~~(i=1,\ldots ,N),~~\rho \geq 0.\end{aligned} $$
(8.4.25)
Let Lagrange multipliers 
$$\alpha _i^{\,},\beta _i,\delta \geq 0$$
, and consider the Lagrange function

$$\displaystyle \begin{aligned} \mathbb{L}(\mathbf{w},{\boldsymbol \xi},b,\rho ,{\boldsymbol \alpha},\beta ,\delta )&=\frac 12\|\mathbf{w}\|{}_2^2-\nu\rho +\frac 1N \sum_{i=1}^N \xi_i^{\,}-\delta\rho \\ &\quad  -\sum_{i=1}^N\left (\alpha_i^{\,}\left [y_i^{\,}({\mathbf{w}}^T{\mathbf{x}}_i^{\,}+b)-\rho +\xi_i^{\,}\right ]+ \beta_i^{\,}\xi_i^{\,}\right ).{} \end{aligned} $$
(8.4.26)
This function needs to be minimized with respect to the primal variables w, ξ, b, ρ and maximized over the dual variables α, β, δ.
The first-order optimization conditions for minimization give the following results:

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}}{\partial \mathbf{w}}=0\quad &\Rightarrow\quad  \mathbf{w}=\sum_{i=1}^N\alpha_i^{\,}y_i^{\,} {\mathbf{x}}_i^{\,},{} \end{aligned} $$
(8.4.27)

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}}{\partial \xi_i^{\,}}=0\quad &\Rightarrow\quad  \alpha_i^{\,}+\beta_i^{\,}=1/N,\quad  i=1,\ldots ,N, \end{aligned} $$
(8.4.28)

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}}{\partial b}=0\quad &\Rightarrow\quad  \sum_{i=1}^N\alpha_i^{\,}y_i^{\,}=0, \end{aligned} $$
(8.4.29)

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}}{\partial \rho}=0\quad &\Rightarrow\quad  \sum_{i=1}^N \alpha_i^{\,}-\delta =\nu .{} \end{aligned} $$
(8.4.30)
If substituting (8.4.27)–(8.4.30) into the Lagrange function 
$$\mathbb {L}$$
, using 
$$\alpha _i^{\,},\beta _i^{\,}, \delta \geq 0$$
and incorporating the kernel function K(x, x i) instead of 
$${\mathbf {x}}_i^{\,}$$
in dot product, then one obtains the Wolfe dual optimization problem for ν-SVC as follows:

$$\displaystyle \begin{aligned} &\max_{\boldsymbol \alpha}\left\{ \mathbb{W}({\boldsymbol \alpha})=-\frac 12\sum_{i=1}^N\sum_{j=1}^N\alpha_i^{\,}\alpha_j^{\,}y_i^{\,}y_j^{\,} K({\mathbf{x}}_i^{\,},{\mathbf{x}}_j^{\,})\right\} \end{aligned} $$
(8.4.31)

$$\displaystyle \begin{aligned} &\text{subject to}~~ 0\leq \alpha_i^{\,}\leq 1/N~~(i=1,\ldots ,N);~~\sum_{i=1}^N \alpha_i^{\,}y_i^{\,}=0;~~\sum_{i=1}^N\alpha_i^{\,}\geq \nu . \end{aligned} $$
(8.4.32)

To compute parameters b and ρ in the primal ν-SVC optimization problem, consider two sets S + and S , containing support vectors x i with 
$$0\leq \alpha _i^{\,}<1$$
and 
$$y_i^{\,}=\pm 1$$
.

Let

$$\displaystyle \begin{aligned} s_1^{\,}&=|S_+|=|\{i|0<\alpha_i^{\,} <1,y_i^{\,}=1\}|, \end{aligned} $$
(8.4.33)

$$\displaystyle \begin{aligned} s_2^{\,}&=|S_-|=|\{i|0<\alpha_i^{\,} <1,y_i^{\,}=-1\}| \end{aligned} $$
(8.4.34)
be the sizes of the sets S + and S , respectively.
If defining

$$\displaystyle \begin{aligned} r_1^{\,}&=\frac 1{s_1^{\,}}\sum_{\mathbf{x}\in S_+}\sum_{j=1}^N\alpha_j^{\,}y_j^{\,}K(\mathbf{x},{\mathbf{x}}_j^{\,}), \end{aligned} $$
(8.4.35)

$$\displaystyle \begin{aligned} r_2^{\,}&=-\frac 1{s_2^{\,}}\sum_{\mathbf{x}\in S_-}\sum_{j=1}^N\alpha_j^{\,}y_j^{\,}K(\mathbf{x},{\mathbf{x}}_j^{\,}), \end{aligned} $$
(8.4.36)
then one has [7, 34]

$$\displaystyle \begin{aligned} b=-\frac{r_1^{\,}-r_2^{\,}}2\quad \text{and}\quad  \rho =\frac{r_1^{\,}+r_2^{\,}}2. \end{aligned} $$
(8.4.37)

8.4.3 Least Squares SVM Binary Classifier

Standard SVMs are powerful tools for data classification by assigning them to one of two disjoint halfspaces in either the original input space for linear classifiers, or in a higher-dimensional feature space for nonlinear classifiers. The least squares SVM (LS-SVMs) developed in [37] and the proximal SVMs (PSVMs) presented in [14, 15] are two much simpler classifiers, in which each class of points is assigned to the closest of two parallel planes (in input or feature space) such that they are pushed apart as far as possible.

The LS-SVMs formulate the binary classification problem as

$$\displaystyle \begin{aligned} &\min_{\mathbf{w},b,e}\Bigg \{\frac 12 \|\mathbf{w}\|{}_2^2+\frac{C}2\sum_{k=1}^N e_k^2\Bigg \} \end{aligned} $$
(8.4.38)

$$\displaystyle \begin{aligned} &\text{subject to}~~ y_k^{\,}\left ({\mathbf{w}}^T{\boldsymbol \phi}({\mathbf{x}}_k)+b\right )=1-e_k,\quad  k=1,\ldots ,N,{} \end{aligned} $$
(8.4.39)
for the binary classification 
$$y_k^{\,}\in \{-1,1\}$$
.
Define the Lagrange function (Lagrangian)

$$\displaystyle \begin{aligned} \mathbb{L}&=\mathbb{L}(\mathbf{w},b,\mathbf{e};{\boldsymbol \alpha})\\ &=\frac 12 \|\mathbf{w}\|{}_2^2+\frac{C}2\sum_{k=1}^N e_k^2-\sum_{k=1}^N \alpha_k \left [y_k^{\,}\left ({\mathbf{w}}^T{\boldsymbol \phi}({\mathbf{x}}_k)+b\right )-1+e_k\right ],{} \end{aligned} $$
(8.4.40)
where α k are Lagrange multipliers. Different from Lagrange multipliers in SVM with inequality constraints, in LS-SVM, Lagrange multipliers α k can be either positive or negative due to the equality constraints used.
Based on the KKT conditions, one has the optimality conditions of (8.4.40) as follows [11]:

$$\displaystyle \begin{aligned} \nabla_{\mathbf{w}}\mathbb{L}=\frac{\partial \mathbb{L}}{\partial \mathbf{w}}=\mathbf{0}&~~\Rightarrow~~ \mathbf{w}=\sum_{k=1}^N \alpha_ky_k^{\,}{\boldsymbol \phi}({\mathbf{x}}_k) \Rightarrow \mathbf{w}=\mathbf{Z}{\boldsymbol \alpha},{} \end{aligned} $$
(8.4.41)

$$\displaystyle \begin{aligned} \nabla_b \mathbb{L}=\frac{\partial \mathbb{L}}{\partial b}=0&~~\Rightarrow~~ \sum_{k=1}^N\alpha_k y_k^{\,}=0\Rightarrow {\mathbf{y}}^T{\boldsymbol \alpha}=0,{} \end{aligned} $$
(8.4.42)

$$\displaystyle \begin{aligned} \nabla_e \mathbb{L}=\frac{\partial \mathbb{L}}{\partial e_k^{\,}}=0&~~\Rightarrow~~ \alpha_k=Ce_k, k=1,\ldots ,N\Rightarrow {\boldsymbol \alpha}=C\mathbf{ e},{} \end{aligned} $$
(8.4.43)

$$\displaystyle \begin{aligned} \nabla_{\alpha_k}\mathbb{L}=\frac{\partial \mathbb{L}}{\partial \alpha_k}=0&~~\Rightarrow~~ y_k^{\,}\left ({\mathbf{w}}^T{\boldsymbol \phi}({\mathbf{x}}_k)+b\right ) -1+e_k=0, k=1,\ldots ,N,\\ &~~\Rightarrow~~ {\mathbf{Z}}^T\mathbf{w}+b\mathbf{y}+\mathbf{e}=\mathbf{1},{} \end{aligned} $$
(8.4.44)
where 
$$\mathbf {Z}=[y_1^{\,}{\boldsymbol \phi }({\mathbf {x}}_1),\ldots ,y_N^{\,}{\boldsymbol \phi }({\mathbf {x}}_N)]\hskip -0.35mm \in \hskip -0.35mm \mathbb {R}^{m\times N}$$
, 
$$\mathbf {y}=[y_1^{\,},\ldots ,y_N^{\,}]^T\hskip -0.25mm ,$$
1 = [1, …, 1]T , e = [e 1, …, e N]T and α = [α 1, …, α N]T.
The KKT equations (8.4.41)–(8.4.44) can be written as the following matrix  equation form [11]:
../images/492994_1_En_8_Chapter/492994_1_En_8_Equ173_HTML.png
(8.4.45)
By eliminating w and e, the above KKT equations are simplified as
../images/492994_1_En_8_Chapter/492994_1_En_8_Equ174_HTML.png
(8.4.46)
Mercer’s condition can be applied again to the matrix Z TZ to get [37]

$$\displaystyle \begin{aligned}{}[{\mathbf{Z}}^T\mathbf{Z}]_{ij}=y_i^{\,}y_j^{\,}{\boldsymbol \phi}^T ({\mathbf{x}}_i){\boldsymbol \phi}({\mathbf{x}}_j)=y_i^{\,}y_j^{\,}K( {\mathbf{x}}_i,{\mathbf{x}}_j). \end{aligned} $$
(8.4.47)
Given a training set 
$$\{({\mathbf {x}}_i^{\,} ,y_i^{\,} )|{\mathbf {x}}_i^{\,}\in \mathbb {R}^n,\,y_i^{\,} \in \{-1,1\},\, i=1,\ldots ,N\}$$
, constant C > 0, and the kernel function K(x i, x j), the LS-SVM binary classification algorithm performs the following learning step:
  • Construct the N × N matrix 
$$[{\mathbf {Z}}^T\mathbf {Z}]_{ij}=y_i^{\,}y_j^{\,}{\boldsymbol \phi }^T ({\mathbf {x}}_i){\boldsymbol \phi }(\mathbf { x}_j)=y_i^{\,}y_j^{\,}K( {\mathbf {x}}_i,{\mathbf {x}}_j)$$
.

  • Solve the KKT matrix equation (8.4.46) for 
$${\boldsymbol \alpha }=[\alpha _1^{\,} ,\ldots ,\alpha _N^{\,} ]^T$$
and b.

In testing step, for given testing sample 
$$\mathbf {x}\in \mathbb {R}^n$$
, its decision is given by

$$\displaystyle \begin{aligned} \text{class of }\mathbf{x}=\mathrm{sign}\left (\sum_{j=1}^N\alpha_j y_j K(\mathbf{x},{\mathbf{x}}_j )+b\right ). \end{aligned} $$
(8.4.48)

8.4.4 Proximal Support Vector Machine Binary Classifier

Due to the simplicity of their implementations, least square SVM (LS-SVM) and proximal support vector machine (PSVM) have been widely used in binary classification applications.

In the standard SVM binary classifier a plane midway between two parallel bounding planes is used to bound two disjoint halfspaces each of which contains points mostly of class 1 or 2. Similar to the LS-SVM, the key idea of PSVM [14] is that the separation hyperplanes are “proximal” planes rather than bounded planes anymore. Proximal planes classify data points depending on proximity to either one of the two separation planes. We aim to push away proximal planes as far apart as possible. Different from LS-SVM, PSVM uses 
$$(\|\mathbf {w}\|{ }_2^2+b_2)$$
instead of 
$$\|\mathbf {w}\|{ }_2^2$$
as the objective function, making the optimization problem strongly convex and has little or no effect on the original optimization problem [22].

The primal optimization problem of linear PSVM is described by

$$\displaystyle \begin{aligned} &\min \Bigg \{ \mathbb{L}_{\mathrm{P}_{\mathrm{PSVM}}}(\mathbf{w},b,\xi_i^{\,})=\frac 12(\|\mathbf{w}\|{}_2^2+b^2)+\frac C2\sum_{i=1}^N\xi_i^2\Bigg \},{} \end{aligned} $$
(8.4.49)

$$\displaystyle \begin{aligned} &\text{subject to}~~ y_i^{\,}({\mathbf{w}}^T{\mathbf{x}}_i^{\,}+b)=1-\xi_i,\quad  i=1,\ldots ,N. \end{aligned} $$
(8.4.50)

By Fung and Mangasarian [15], the PSVM formulation (8.4.49) can be also interpreted as a regularized least squares solution [38] of the system of linear equations 
$$y_i^{\,}({\mathbf {x}}_i^T\mathbf {w}+b)=1,i=1,\ldots ,N$$
, that is, finding an approximate solution (w, b) to 
$$y_i^{\,}({\mathbf {x}}_i^T\mathbf {w}+b)=1$$
with least 2-norm ../images/492994_1_En_8_Chapter/492994_1_En_8_IEq150_HTML.gif.

The corresponding dual unconstrained optimization problem is given by

$$\displaystyle \begin{aligned} \min~~ &\mathbb{L}_{\mathrm{D}_{\mathrm{PSVM}}}(\mathbf{w},b,\xi_i^{\,},\alpha_i^{\,})\\ &=\frac 12(\|\mathbf{w}\|{}_2^2+b^2)+\frac C2\sum_{i=1}^N\xi_i^2-\sum_{i=1}^N \alpha_i^{\,}\left (y_i^{\,}({\mathbf{w}}^T{\mathbf{x}}_i^{\,}+b)-1+\xi_i^{\,}\right ).{} \end{aligned} $$
(8.4.51)
Using the first-order optimality conditions 
$$\frac {\partial \mathbb {L}}{\partial \mathbf {w}}=\mathbf {0},\, \frac {\partial \mathbb {L}}{\partial b}=0,\,\frac {\partial \mathbb {L}}{\partial \xi _i^{\,}}=0$$
, and 
$$\frac {\partial \mathbb {L}} {\partial \alpha _i^{\,}}=0$$
, and eliminating w and 
$$\xi _i^{\,}$$
, one gets the KKT equation of linear PSVM classifier in matrix form:

$$\displaystyle \begin{aligned} \left (C^{-1}\mathbf{I}+\mathbf{ZZ}^T +\mathbf{yy}^T\right ){\boldsymbol \alpha}=\mathbf{1},{} \end{aligned} $$
(8.4.52)
and

$$\displaystyle \begin{aligned} b=\sum_{i=1}^N\alpha_i^{\,} y_i^{\,} .{}\end{aligned} $$
(8.4.53)
Here

$$\displaystyle \begin{aligned} \mathbf{Z}&=\left [y_1^{\,}{\mathbf{x}}_1^{\,},\ldots ,y_N^{\,}{\mathbf{x}}_N^{\,}\right ],{} \end{aligned} $$
(8.4.54)

$$\displaystyle \begin{aligned} \mathbf{y}&=[y_1^{\,},\ldots ,y_N^{\,}]^T,{} \end{aligned} $$
(8.4.55)

$$\displaystyle \begin{aligned} {\boldsymbol \alpha}&=[\alpha_1^{\,},\ldots ,\alpha_N^{\,}]^T.{}\end{aligned} $$
(8.4.56)

Similar to LS-SVM, the training data x can be mapped from the input space 
$$\mathbb {R}^n$$
into the feature space ϕ : x →ϕ(x). Hence, the nonlinear PSVM classifier has still the KKT equation (8.4.52) with 
$$\mathbf {Z}=[y_1^{\,}{\boldsymbol \phi }({\mathbf {x}}_1^{\,}),\ldots ,y_N^{\,}{\boldsymbol \phi } ({\mathbf {x}}_N^{\,})]$$
in Eq. (8.4.54).

Algorithm 8.1 shows the PSVM binary classification algorithm.

Algorithm 8.1 PSVM binary classification algorithm
../images/492994_1_En_8_Chapter/492994_1_En_8_Figa_HTML.png
The decision functions of binary SVM, LS-SVM, and PSVM classifiers have the same form

$$\displaystyle \begin{aligned} f(\mathbf{x})=\mathrm{sign}\left (\sum_{i=1}^N \alpha_i y_i K(\mathbf{x},{\mathbf{x}}_i)+b\right ),\end{aligned} $$
(8.4.57)
where y i is the corresponding target class label of the training data x i, α i is the Lagrange multiplier to be computed by the learning machines, and K(x, x i) is a suitable kernel function to be given by users.
The followings are the comparisons of SVM, LS-SVM, and PSVM.
  • SVM, LS-SVM, and PSVM are originally proposed for binary classification.

  • LS-SVM and PSVM provide fast implementations of the traditional SVM. Both LS-SVM and PSVM use equality optimization constraints instead of inequalities from the traditional SVM, which results in a direct least square solution by avoiding quadratic programming.

It should be noted [22] that the Lagrange multipliers 
$$\alpha _i^{\,}$$
are proportional to the training errors 
$$\xi _i^{\,}$$
in LS-SVM, while in the conventional SVM, many Lagrange multipliers 
$$\alpha _i^{\,}$$
are typically equal to zero. As compared to the conventional SVM, sparsity is lost in LS-SVM; this is true to PSVM as well.

8.4.5 SVM-Recursive Feature Elimination

Decision functions that are simple weighted sums of the training patterns plus a bias are called linear discriminant functions [9]:

$$\displaystyle \begin{aligned} D(\mathbf{x})=\langle \mathbf{w},\mathbf{x}\rangle+b, \end{aligned} $$
(8.4.58)
where w is the weight vector and b is a bias value.
Construct the optimal hyperplane (w, b) such that

$$\displaystyle \begin{aligned} {\mathbf{w}}_{\mathrm{opt}}^T\mathbf{x}+b_{\mathrm{opt}}=0 \end{aligned} $$
(8.4.59)
which separates a set of training data (x 1, y 1), …, (x n, y n). Vectors x i such that y i(w Tx i + b) = 1 is termed support vectors. Hence, the constrained optimization problem for finding the optimal weight vector w can be described as

$$\displaystyle \begin{aligned} &\min_{\mathbf{w},b}\left \{ f(\mathbf{w},b)=\frac 12\|\mathbf{w}\|{}_2^2\right \}, \end{aligned} $$
(8.4.60)

$$\displaystyle \begin{aligned} &\text{subject to }y_i^{\,} ({\mathbf{x}}_i^T\mathbf{w}+b)\geq 1,\quad  i=1,\ldots ,n, \end{aligned} $$
(8.4.61)
where the inequality constraint is for ensuring x to be support vectors.
The above constrained optimization can be rewritten as the unconstrained optimization in Lagrangian form:

$$\displaystyle \begin{aligned} \min_{\mathbf{w},b,{\boldsymbol \alpha}}\Bigg \{L(\mathbf{w},b,{\boldsymbol \alpha})=\frac 12\|\mathbf{w}\|{}_2 -\sum_{i=1}^n\alpha_i[y_i^{\,} ({\mathbf{x}}_i^T\mathbf{ w}+b)-1]\Bigg \}, \end{aligned} $$
(8.4.62)
where α = [α 1, …, α n]T is the vector of nonnegative Lagrangian multiplies α i ≥ 0, i = 1, …, n.
The optimization conditions with respect to w and b are given by

$$\displaystyle \begin{aligned} &\frac{\partial L(\mathbf{w},b,{\boldsymbol \alpha})}{\partial \mathbf{w}}=\Bigg (\mathbf{w}-\sum_{i=1}^n \alpha_i^{\,} y_i^{\,} {\mathbf{x}}_i\Bigg )=0\Rightarrow \mathbf{w}=\sum_{i=1}^n \alpha_i^{\,} y_i^{\,} {\mathbf{x}}_i^{\,} , \end{aligned} $$
(8.4.63)

$$\displaystyle \begin{aligned} &\frac{\partial L(\mathbf{x},b,{\boldsymbol \alpha})}{\partial b}=\sum_{i=1}^n \alpha_i^{\,} y_i^{\,} =0, \end{aligned} $$
(8.4.64)
Substituting the above two equations into (8.4.62) to yield [8]

$$\displaystyle \begin{aligned} &\min_{\boldsymbol \alpha}\Bigg \{ J({\boldsymbol \alpha})=\frac 12\sum_{i=1}^n\sum_{j=1}^n \alpha_i^{\,}\alpha_j^{\,} y_i^{\,} y_j^{\,} {\mathbf{x}}_i^T{\mathbf{x}}_j -\sum_{i=1}\alpha_i^{\,} \Bigg \} \\ &\text{subject to}~~~ \mathbf{0}\leq {\boldsymbol \alpha}\leq C\mathbf{I}\quad \text{and}\quad  {\boldsymbol \alpha}^T\mathbf{y}=0.{} \end{aligned} $$
(8.4.65)

In conclusion, when used for classification, SVMs separate a given set of binary labeled training data (x i, y i) with hyperplane (w, b) that is maximally distant from them. Such a hyperplane is known as “the maximal margin hyperplane.”

In nonlinear binary classification cases, Eq. (8.4.65) becomes

$$\displaystyle \begin{aligned} &\min_{\boldsymbol \alpha}\Bigg \{\frac 12\sum_{i=1}^n\sum_{j=1}^n \alpha_i^{\,}\alpha_j^{\,} y_i^{\,} y_j^{\,} {\boldsymbol \phi}^T({\mathbf{x}}_i){\boldsymbol \phi}({\mathbf{x}}_j )-\sum_{i=1} \alpha_i^{\,} \Bigg \}\\ &\text{subject to}~~~ \mathbf{0}\leq {\boldsymbol \alpha}\leq C\mathbf{I}\quad \text{and}\quad  {\boldsymbol \alpha}^T\mathbf{y}=0.{} \end{aligned} $$
(8.4.66)
The goal of the SVM-recursive feature elimination (SVM-RFE) algorithm, proposed by Guyon et al. [20], is to find a subset of size r among d variables (r < d) which maximizes the performance of the predictor, based on a backward sequential selection. Starting all the features, one feature is removed at a time. The removed ith feature is the one whose removal minimizes the variation of 
$$\|\mathbf {w}\|{ }_2^2$$
[31]:

$$\displaystyle \begin{aligned} &amp;\big |\|\mathbf{w}\|{}_2^2-{\mathbf{w}}^{(i)}\|{}_2^2\big |\\ &amp;\quad  =\frac 12 \Bigg |\sum_{j=1}^d\sum_{k=1}^d \alpha_j^{\,}\alpha_k^{\,} y_j^{\,} y_k^{\,} K({\mathbf{x}}_j^{\,} ,{\mathbf{x}}_k^{\,} ) -\sum_{j=1}^d\sum_{k=1}^d \alpha_j^{(i)}\alpha_k^{(i)}y_j^{\,} y_k^{\,} K^{(i)}({\mathbf{x}}_j^{\,} ,{\mathbf{x}}_k)\Bigg |, \end{aligned} $$
(8.4.67)
where 
$$[{\mathbf {K}}^{(i)}]_{jk}=K_{jk}^{(i)}=\langle {\boldsymbol \phi }({\mathbf {x}}_j^{(i)},{\boldsymbol \phi }({\mathbf {x}}_k^{(i)}\rangle $$
is the (j, k)th entry of the Gram matrix K (i)of the training data when variable i is removed, and 
$$\alpha _j^{(i)}$$
is the corresponding solution of (8.4.66). For the sake of simplicity, 
$$\alpha _j^{(i)}=\alpha _j$$
is usually taken even if a variable has been removed in order to reduce the computational complexity of the SVM-RFE algorithm.

Algorithm 8.2 shows the SVM-RFE algorithm that is an application of RFE using the weight magnitude as ranking criterion.

Algorithm 8.2 SVM-recursive feature elimination (SVM-RFE) algorithm [20]
../images/492994_1_En_8_Chapter/492994_1_En_8_Figb_HTML.png

8.5 Support Vector Machine Multiclass Classification

A multiclass classifier is a function 
$$H:\mathbb {X}\to \mathbb {Y}$$
that maps an instance 
$$\mathbf {x}\in \mathbb {X} \ \left (\text{for example}, \mathbb {X}=\mathbb {R}^n\right )$$
into an element y of 
$$\mathbb {Y} \ \left (\text{for example}, y\in \{1,\ldots ,k\}\right )$$
.

8.5.1 Decomposition Methods for Multiclass Classification

A popular way to solve a k-class problem is to decompose it to a set of L binary classification problems. One-versus-one, one-versus-rest, and directed acyclic graph SVM methods are three of the most common decomposition approaches [21, 41, 48].

One-Against-All Method

The standard method for k-class SVM classification is to construct L = k binary classifiers 
$$\mathbb {C}_m,m=1,\ldots ,k$$
. The ith SVM will be trained with all of the examples in the ith class with positive labels, and all other examples with negative labels. Then, the weight vector w m for the mth model can be generated by any linear classifier. Such an SVM classification method is referred to as one-against-all (OAA) or one-versus-rest (OVR) method [4, 27].

Let 
$$S=\{({\mathbf {x}}_1^{\,},y_1^{\,}),\ldots ,({\mathbf {x}}_N^{\,},y_N^{\,})\}$$
be a set of N training examples, and each example 
$${\mathbf {x}}_i^{\,}$$
be drawn from a domain 
$$\mathbb {X}\subseteq \mathbb {R}^n$$
and 
$$y_i^{\,}\in \{1,\ldots ,k\}$$
is the class of 
$${\mathbf {x}}_i^{\,}$$
.

The mth one-against-all classifier solves the following constrained optimization problem:

$$\displaystyle \begin{aligned} &amp;\min_{{\mathbf{w}}_m,b_m,\xi_m}~ \Bigg \{\frac 12\|{\mathbf{w}}_m\|{}_2^2 +C\sum_{i=1}^N \xi_{m,i}\Bigg \} ,{} \end{aligned} $$
(8.5.1)

$$\displaystyle \begin{aligned} &amp;\text{subject to}\quad  {\mathbf{w}}_m^T{\boldsymbol \phi}({\mathbf{x}}_i^{\,})+b_m^{\,}\geq 1-\xi_{m,i}^{\,},~\mathrm{if}~y_i^{\,}=m,\\ &amp;\qquad \quad \qquad  {\mathbf{w}}_m^T{\boldsymbol \phi}({\mathbf{x}}_i^{\,})+b_m^{\,}\leq -1+\xi_{m,i}^{\,},~\mathrm{if}~y_i^{\,}\neq m,\\ &amp;\qquad \qquad \quad  \xi_{m,i}^{\,}\geq 0,\quad  i=1,\ldots ,N, \end{aligned} $$
(8.5.2)
where 
$${\mathbf {w}}_m^{\,}=[w_{m,1},\cdots ,w_{m,n}]^T,\,m=1,\ldots ,k$$
is the weight vector of the mth classifier, C is the penalty parameter, and 
$$\xi _{m,i}^{\,}$$
is the training error corresponding to the mth class and the ith training data 
$${\mathbf {x}}_i^{\,}$$
that is mapped to a higher-dimensional space by the function 
$${\boldsymbol \phi }({\mathbf {x}}_i^{\,})$$
.

Minimizing 
$$\frac 12\|{\mathbf {w}}_m^{\,}\|{ }_2^2$$
means maximizing the margin 
$$2/\|{\mathbf {w}}_m^{\,}\|$$
between two groups of data. When data are not linear separable, there is a penalty term 
$$C\sum _{i=1}^N\xi _{m,i}$$
which can reduce the training errors in the mth classifier. The basic concept behind one-against-all SVM is to search for a balance between the regularization term 
$$\frac 12\|{\mathbf {w}}_m^{\,}\|{ }_2^2$$
and the training error 
$$C\sum _{i=1}^N\xi _{m,i}$$
for the m classifiers.

By “decision function” it means a function f(x) whose sign represents the class assigned to data point x.

The solution of (8.5.1) gives k decision functions 
$${\mathbf {w}}_m^T{\boldsymbol \phi }(\mathbf {x})+b_m^{\,},m=1,\ldots ,k$$
. Hence, a new testing instance x is said to be in the class which has the largest value among k decision functions:

$$\displaystyle \begin{aligned} \text{class of }\mathbf{x}=\operatorname*{\text{arg}\ \text{max}}_{m=1,\ldots ,k}\left \{{\mathbf{w}}_m^T{\boldsymbol \phi}(\mathbf{x})+b_m\right \}. \end{aligned} $$
(8.5.3)
One-Against-One Method

The one-against-one (OAO) method [13, 25] is also known as one-versus-one (OVO) method. This method constructs L = k(k − 1)∕2 binary classifiers by solving k(k − 1)∕2 binary classification problems [25]. Each binary classifier constructs a model with data from one class as positive and another class as negative. Since there is k(k − 1)∕2 combinations of two classes, one needs to construct k(k − 1)∕2 weight vectors: w 1,2, …, w 1,k;w 2,3, …, w 2,k;…;w k−1,k.

For training data from the ith and the jth classes, one solves the following binary classification problem:

$$\displaystyle \begin{aligned} &amp;\min_{{\mathbf{w}}^{ij},b^{ij},\xi^{ij}}\left \{\frac 12({\mathbf{w}}^{ij})^T{\mathbf{w}}^{ij}+C\sum_{n=1}^N \xi_n^{ij}({\mathbf{w}}^{ij})^T {\boldsymbol \phi }({\mathbf{x}}_n)\right \}, \end{aligned} $$
(8.5.4)

$$\displaystyle \begin{aligned} &amp;\text{subject to}\quad  ({\mathbf{w}}^{ij})^T{\boldsymbol \phi}({\mathbf{x}}_n)+b^{ij}\geq 1-\xi_n^{ij},\quad  \mathrm{if}~y_n^{\,} =i, \end{aligned} $$
(8.5.5)

$$\displaystyle \begin{aligned} &amp;\hskip 17mm ({\mathbf{w}}^{ij})^T{\boldsymbol \phi}({\mathbf{x}}_n)+b^{ij}\leq -1+\xi_n^{ij},\quad  \mathrm{if}~y_n^{\,} =j, \end{aligned} $$
(8.5.6)

$$\displaystyle \begin{aligned} &amp;\hskip 17mm \xi_n^{ij}\geq 0,\quad  n=1,\ldots ,N, \end{aligned} $$
(8.5.7)
where w ij is the binary classifier for the ith and the jth classes, b ij is a real parameter, and ξ ij is the training error corresponding to the ith and the jth classes.

After all k(k − 1)∕2 binary classifiers are constructed by solving the above binary problems for i, j = 1, …, k, the following voting strategy suggested in [25] can be used: if the decision function sign((w ij)Tϕ(x) + b ij) determines that x is in ith class, then vote for the ith class is added by one. Otherwise, the vote for jth is added by one. Finally, x is predicted to be in the class with the largest vote. This voting approach is also called the “Max Wins” strategy. In case that two classes have identical votes, thought it may not be a good strategy, we simply select the one with the smaller index [21].

DAGSVM Method

Directed acyclic graph SVM method [27] is simply called DAGSVM method whose training phase is the same as the one-against-one method by solving binary SVMs. However, in the testing phase, it uses a rooted binary directed acyclic graph which has internal nodes and leaves.

The training phase of the DAGSVM method is the same as the one-against-one method by solving k(k − 1)∕2 binary SVM classification problems, but uses one different voting strategy, called a rooted binary directed acyclic graph which has k(k − 1)∕2 internal nodes and k leaves, in the testing phase. Each node is a binary SVM classifier of ith and jth classes.

A directed acyclic graph (DAG) is a graph whose edges have an orientation and no cycles.

Definition 8.7 (Decision Directed Acyclic Graph [27])

Given a space X and a set of Boolean functions 
$$\mathbb {F}=\{f: X\to \{0, 1\}\}$$
, the decision directed acyclic graphs (DDAGs) on k classes over 
$$\mathbb {F}$$
are functions which can be implemented using a rooted binary DAG with k leaves labeled by the classes where each of the L = k(k − 1)∕2 internal nodes is labeled with an element of 
$$\mathbb {F}$$
. The nodes are arranged in a triangle with the single root node at the top, two nodes in the second layer, and so on until the final layer of k leaves. The ith node in layer j < k is connected to the ith and (i + 1)th node in the (j + 1)st layer.

For k-class classification, a rooted binary directed acyclic graph has k layers: the top layer has a single root node, the second layer has two nodes, and so on until the kth (i.e., final) layer has k nodes, so a rooted binary DAG has k leaves and 1 + 2 + ⋯ + k = k(k − 1)∕2 internal nodes. Each node is a binary SVM of ith and jth classes.

The DDAG is equivalent to operating on a list, where each node eliminates one class from the list. The list is initialized with a list of all classes. A test point is evaluated against the decision node that corresponds to the first and last elements of the list. If the node prefers one of the two classes, the other class is eliminated from the list, and the DDAG proceeds to test the first and last elements of the new list. The DDAG terminates when only one class remains in the list. Thus, for a problem with k classes, k − 1 decision nodes will be evaluated in order to derive an answer.

Example 8.1

Given N test samples 
$$\{{\mathbf {x}}_i^{\,} ,y_i^{\,}\} ,i=1,\ldots ,N$$
with 
$${\mathbf {x}}_i^{\,}\in \mathbb {R}^n$$
and 
$$y_i^{\,}\in \{1,2,$$
3, 4}. The DDAG is equivalent to operating on a list {1, 2, 3, 4}, where each node eliminates one class from the list. The list is starting at the root node 1 versus 4 at the top layer, the binary decision function is evaluated. If the output value is not Class 1, then it is eliminated from the list to get a new list {2, 3, 4} and thus we make the binary decision of two classes 2 versus 4. If the root node prefers Class 1, then Class 4 at the root node is removed from the list to yield a new list {1, 2, 3} and the binary decision is made for two classes 1 versus 3. Then, the second layer has two nodes (2 vs 4) and (1 vs 3). Therefore, we go through a path until DDAG terminates when only one class remains in the list.

Figure 8.2 shows the decision directed acyclic graphs (DDAG) of the above four classes [27].
../images/492994_1_En_8_Chapter/492994_1_En_8_Fig2_HTML.png
Fig. 8.2

The decision directed acyclic graphs (DDAG) for finding the best class out of four classes

DDAGs naturally generalize the class of Decision Trees, allowing for a more efficient representation of redundancies and repetitions that can occur in different branches of the tree, by allowing the merging of different decision paths [27].

8.5.2 Least Squares SVM Multiclass Classifier

The previous subsection discussed the support vector machine binary classifier. In this subsection we discuss multi-class classification cases.

For the multiclass case with k labels, LS-SVM uses k output nodes in order to encode multiclasses, where y i,j denotes the output value of the jth output node for the training data x i [41]. The k outputs can be used to encode up to 2k different classes. For multiclass case, the primal optimization problem of LS-SVM can be represented as [41]
../images/492994_1_En_8_Chapter/492994_1_En_8_Equ203_HTML.png
(8.5.8)
for i = 1, …, N.
The Lagrange function in dual LS-SVM multiclass classifier is given by

$$\displaystyle \begin{aligned} \mathbb{L}_{\mathrm{D}}&amp;=\frac 12\sum_{m=1}^k \|{\mathbf{w}}_m^{\,}\|{}^2+\frac C2\sum_{i=1}^N\sum_{m=1}^k \xi_{m,i}^2 \end{aligned} $$
(8.5.9)

$$\displaystyle \begin{aligned} &amp;\quad  -\sum_{i=1}^N\sum_{m=1}^k \alpha_{m,i}\left [y_i^{\,}\left ({\mathbf{w}}_m^T{\boldsymbol \phi}_m^{\,}({\mathbf{x}}_i)+b_m^{\,} \right )-1+\xi_{m,i}^{\,}\right ] \end{aligned} $$
(8.5.10)
Similar to the LS-SVM solution to the binary classification, the conditions for optimality are given by

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}_{\mathrm{D}}}{\partial {\mathbf{w}}_m}=\mathbf{0}&amp;~~\Rightarrow~~ {\mathbf{w}}_m=\sum_{i=1}^N \alpha_{m,i}y_i^{(m)} {\boldsymbol \phi}_m^{\,}({\mathbf{x}}_i)~~\Rightarrow~~ {\mathbf{w}}_m^{\,} ={\mathbf{Z}}_m^{\,}{\boldsymbol \alpha}_m^{\,},{} \end{aligned} $$
(8.5.11)

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}_{\mathrm{D}}}{\partial b_m^{\,}}=0&amp;~~\Rightarrow~~ \sum_{i=1}^N\alpha_{m,i} y_i^{(m)}=0~~\Rightarrow~~ {\boldsymbol \alpha}_m^T{\mathbf{y}}^{(m)}=0,{} \end{aligned} $$
(8.5.12)

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}_{\mathrm{D}}}{\partial \xi_{m,i}^{\,}}=0&amp;~~\Rightarrow~~ \alpha_{m,i}^{\,}=C\xi_{m,i}~~\Rightarrow~~ {\boldsymbol \alpha}_m^{\,} =C{\boldsymbol \xi}_m^{\,} ,{} \end{aligned} $$
(8.5.13)

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}_{\mathrm{D}}}{\partial \alpha_{m,i}^{\,}}=0&amp;~~\Rightarrow~~ y_i^{(m)}\left ({\mathbf{w}}_m^T{\boldsymbol \phi}_m^{\,}({\mathbf{x}}_i) +b_m^{\,}\right )-1+\xi_{m,i}=0, \\ &amp;~~\Rightarrow~~ {\mathbf{Z}}_m^T{\mathbf{w}}_m+b_m{\mathbf{y}}^{(m)}+{\boldsymbol \xi}_m^{\,} =\mathbf{1}.{} \end{aligned} $$
(8.5.14)
In the above equations:

$$\displaystyle \begin{aligned} {\mathbf{Z}}_m^{\,} &amp;=[y_1^{(m)}{\boldsymbol \phi}_m^{\,} ({\mathbf{x}}_1^{\,} ),\ldots ,y_N^{(m)}{\boldsymbol \phi}_m^{\,} ({\mathbf{x}}_N^{\,} )],{} \end{aligned} $$
(8.5.15)

$$\displaystyle \begin{aligned} {\mathbf{w}}_m^{\,} &amp;=[w_{m,1}^{\,} ,\ldots ,w_{m,N}^{\,}]^T, \end{aligned} $$
(8.5.16)

$$\displaystyle \begin{aligned} {\mathbf{y}}^{(m)}&amp;=[y_1^{(m)},\ldots ,y_N^{(m)}]^T, \end{aligned} $$
(8.5.17)

$$\displaystyle \begin{aligned} {\boldsymbol \alpha}_m^{\,} &amp;=[\alpha_{m,1}^{\,},\ldots ,\alpha_{m,N}]^T, \end{aligned} $$
(8.5.18)

$$\displaystyle \begin{aligned} {\boldsymbol \xi}_m^{\,} &amp;=[\xi_{m,1}^{\,} ,\ldots ,\xi_{m,N}^{\,} ]^T.{} \end{aligned} $$
(8.5.19)
Due to each classifier being binary, for given 
$$y_i^{\,}\in \{1,\ldots .k\}$$
, one has

$$\displaystyle \begin{aligned} y_i^{(m)}=\bigg \{\begin{array}{ll} +1,&amp;~~y_i^{\,} =m;\\  &amp;{}\\ -1,&amp;~~y_i^{\,}\neq m.\end{array} \quad  (m=1,\ldots ,k;i=1,\ldots ,N)\end{aligned} $$
(8.5.20)
Equations (8.5.15)–(8.5.19) can be rewritten as the KKT equation in matrix form:
../images/492994_1_En_8_Chapter/492994_1_En_8_Equ216_HTML.png
(8.5.21)
where 
$${\boldsymbol \varOmega }^{(m)}=\left ({\mathbf {Z}}_m^T{\mathbf {Z}}_m+C^{-1}\mathbf {I}\right )$$
with the (i, j)th entries

$$\displaystyle \begin{aligned} {\boldsymbol \varOmega}_{ij}^{(m)}=y_i^{(m)}y_j^{(m)}K_m ({\mathbf{x}}_i^{\,} ,{\mathbf{x}}_j^{\,} )+C^{-1}\delta_{ij},\quad  i,j=1,\ldots ,N, \end{aligned} $$
(8.5.22)
where 
$$\delta _{ij}^{\,} =1$$
(if i = j) or 0 (otherwise); and 
$$K_m ({\mathbf {x}}_i^{\,} ,{\mathbf {x}}_j^{\,} )={\boldsymbol \phi }_m^T ({\mathbf {x}}_i^{\,} ) {\boldsymbol \phi }_m^{\,} ({\mathbf {x}}_j^{\,} )$$
is the kernel function of the mth SVM for multiclass classification.

Algorithm 8.3 shows the LS-SVM multiclass classification algorithm.

Algorithm 8.3 LS-SVM multiclass classification algorithm [41]
../images/492994_1_En_8_Chapter/492994_1_En_8_Figc_HTML.png

8.5.3 Proximal Support Vector Machine Multiclass Classifier

The primal constrained optimization problem of PSVM multiclass classifier is described by

$$\displaystyle \begin{aligned} &amp;\min\Bigg \{\mathbb{L}_{\mathrm{P}_{\mathrm{PSVM}}}{(m)}({\mathbf{w}}_m^{\,} ,b_m^{\,} ,\xi_{m,i}^{\,})=\frac 12(\|{\mathbf{w}}_m^{\,}\|{}_2^2+b_m^2)+ \frac C2\sum_{i=1}^N\xi_{m,i}^2\Bigg \}{} \end{aligned} $$
(8.5.23)
../images/492994_1_En_8_Chapter/492994_1_En_8_Equ219_HTML.png
(8.5.24)
The corresponding dual unconstrained optimization problem is given by

$$\displaystyle \begin{aligned} \min\quad  \mathbb{L}_{\mathrm{D}_{\mathrm{PSVM}}}^{(m)}&amp;=\frac 12(\|{\mathbf{w}}_m^{\,} \|{}_2^2+b_m^2)+\frac C2\sum_{i=1}^N\xi_{m,i}^2\\ &amp;\quad  -\sum_{i=1}^N\alpha_{m,i}^{\,}\left (y_i^{(m)}({\mathbf{w}}_m^T{\boldsymbol \phi}_m^{\,} ({\mathbf{x}}_i^{\,})+b_m^{\,} )-1+\xi_{m,i}^{\,}\right ). {} \end{aligned} $$
(8.5.25)
From the optimality conditions we have

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}_{\mathrm{D}_{\mathrm{PSVM}}}^{(m)}}{\partial {\mathbf{w}}_m^{\,}}=\mathbf{0}\quad &amp;\Rightarrow\quad  {\mathbf{w}}_m^{\,} =\sum_{i=1}^N\alpha_{m,i}^{\,} y_i^{(m)}{\boldsymbol \phi}_m^{\,} ({\mathbf{x}}_i^{\,} ), \end{aligned} $$
(8.5.26)

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}_{\mathrm{D}_{\mathrm{PSVM}}}^{(m)}}{\partial b_m^{\,}} =0\quad &amp;\Rightarrow\quad  b_m^{\,} =\sum_{i=1}^N\alpha_{m,i}^{\,} y_i^{(m)}, \end{aligned} $$
(8.5.27)

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}_{\mathrm{D}_{\mathrm{PSVM}}}^{(m)}}{\partial \xi_{m,i}}=0\quad &amp;\Rightarrow\quad  \xi_{m,i}^{\,} =C^{-1}\alpha_{m,i}^{\,} , \end{aligned} $$
(8.5.28)

$$\displaystyle \begin{aligned} \frac{\partial \mathbb{L}_{\mathrm{D}_{\mathrm{PSVM}}}^{(m)}}{\partial \alpha_{m,i}}=0\quad &amp;\Rightarrow\quad  y_i^{(m)}({\mathbf{w}}_m^T{\boldsymbol \phi}_m^{\,} (\mathbf{ x}_i^{\,}) +b_m^{\,} )-1+\xi_{m,i}^{\,}=0, \end{aligned} $$
(8.5.29)
for m = 1, …, k.
Eliminating 
$${\mathbf {w}}_m^{\,}$$
and 
$$\xi _{m,i}^{\,}$$
in the above equations yields the KKT equations

$$\displaystyle \begin{aligned} b_m^{\,} =\sum_{i=1}^N\alpha_{m,i}^{\,} y_i^{(m)}={\boldsymbol \alpha}_m^T{\mathbf{y}}_m^{\,} ,{} \end{aligned} $$
(8.5.30)
and

$$\displaystyle \begin{aligned} (C^{-1}\mathbf{I}+{\mathbf{Z}}_m^T{\mathbf{Z}}_m+{\mathbf{y}}_m^{\,}{\mathbf{y}}_m^T ){\boldsymbol \alpha}_m^{\,} =\mathbf{1},{} \end{aligned} $$
(8.5.31)
where

$$\displaystyle \begin{aligned} {\mathbf{Z}}_m&amp;=\left [y_1^{(m)}{\boldsymbol \phi}_m^{\,} ({\mathbf{x}}_1^{\,} ),\ldots ,y_N^{(m)}{\boldsymbol \phi}_m^{\,} ({\mathbf{x}}_N^{\,} ) \right ],{} \end{aligned} $$
(8.5.32)

$$\displaystyle \begin{aligned} {\mathbf{y}}_m^{\,} &amp;=[y_1^{(m)},\ldots ,y_N^{(m)}]^T,{} \end{aligned} $$
(8.5.33)

$$\displaystyle \begin{aligned} {\boldsymbol \alpha}_m^{\,} &amp;=[\alpha_{m,1}^{\,} ,\ldots ,\alpha_{m,N}^{\,} ]^T.{} \end{aligned} $$
(8.5.34)

Algorithm 8.4 shows the PSVM multiclass classification algorithm.

Algorithm 8.4 PSVM multiclass classification algorithm
../images/492994_1_En_8_Chapter/492994_1_En_8_Figd_HTML.png

8.6 Gaussian Process for Regression and Classification

In Chap. 6 we discussed traditionally parametric models based machine learning. The parametric models have a possible advantage in ease of interpretability, but for complex data sets, simple parametric models may lack expressive power, and their more complex counterparts (such as feedforward neural networks) may not be easy to work with in practice [29, 30]. The advent of kernel machines, such as Gaussian processes, sparse Bayesian learning, and relevance vector machine, has opened the possibility of flexible models which are practical to work with.

In this section we deal with Gaussian process methods for regression and classification problems.

8.6.1 Joint, Marginal, and Conditional Probabilities

Let the n (discrete or continuous) random variables y 1, …, y n have a joint probability p(y 1, …, y n), or p(y) for short. Technically, one ought to distinguish between probabilities (for discrete variables) and probability densities for continuous variables. Throughout the book we commonly use the term “probability” to refer to both. Let us partition the variables in y into two groups, 
$${\mathbf {y}}_A^{\,}$$
and 
$${\mathbf {y}}_B^{\,}$$
, where A ∪ B = {1, …, n} and A ∩ B = ∅, so that 
$$p(\mathbf {y})=p({\mathbf {y}}_A^{\,} ,{\mathbf {y}}_B^{\,} )$$
. Each group may contain one or more variables. The marginal probability of 
$${\mathbf {y}}_A^{\,}$$
, denoted by 
$$p({\mathbf {y}}_A^{\,} )$$
, is given by marginal probability

$$\displaystyle \begin{aligned} p({\mathbf{y}}_A^{\,} )=\int p({\mathbf{y}}_A^{\,} ,{\mathbf{y}}_B^{\,} )d {\mathbf{y}}_B^{\,} . \end{aligned} $$
(8.6.1)
The integral is replaced by a sum if the variables are discrete valued. Notice that if the set A contains more than one variable, then the marginal probability is itself a joint probability of these variables. If the joint distribution is equal to the product of the marginals, then the variables are said to be independent, otherwise they are dependent.
The conditional probability function is defined as conditional probability

$$\displaystyle \begin{aligned} p({\mathbf{y}}_A^{\,} |{\mathbf{y}}_B^{\,} )=\frac{p({\mathbf{y}}_A^{\,} ,{\mathbf{y}}_B^{\,} )}{p({\mathbf{y}}_B^{\,} )} \end{aligned} $$
(8.6.2)
defined for 
$$p({\mathbf {y}}_B^{\,} )&gt;0$$
, as it is not meaningful to condition on an impossible event. If 
$${\mathbf {y}}_A^{\,} $$
and 
$${\mathbf {y}}_B^{\,} $$
are independent, then the marginal 
$$p({\mathbf {y}}_A^{\,} )$$
and the conditional 
$$p({\mathbf {y}}_A^{\,} |{\mathbf {y}}_B^{\,} )$$
are equal.
Using the definitions of both 
$$p({\mathbf {y}}_A^{\,} |{\mathbf {y}}_B^{\,} )$$
and 
$$p({\mathbf {y}}_B^{\,} |{\mathbf {y}}_A^{\,} )$$
we obtain Bayes theorem

$$\displaystyle \begin{aligned} p({\mathbf{y}}_A^{\,} |{\mathbf{y}}_B^{\,} )=\frac{p({\mathbf{y}}_A^{\,} )p({\mathbf{y}}_B^{\,} |{\mathbf{y}}_A^{\,} )}{p({\mathbf{y}}_B^{\,} )}.{} \end{aligned} $$
(8.6.3)

8.6.2 Gaussian Process

A random variable x with the normal distribution 
$$p(x)=(2\pi \sigma ^2)^{-1}\exp \left (\frac {|x-\mu |{ }^2}{2\sigma ^2}\right )$$
is called the univariate Gaussian distribution, and is denoted as x ∼ N(μ, σ 2), where μ = E{x} and σ 2 = var(x) are its mean and variance, respectively.

A multivariate Gaussian (or normal) distribution has a joint probability density given by

$$\displaystyle \begin{aligned} p(\mathbf{x}|{\boldsymbol \mu},{\boldsymbol \Sigma})=(2\pi )^{-N/2}|{\boldsymbol \Sigma}|{}^{-1/2} \exp \left (-\frac 12 (\mathbf{x}-{\boldsymbol \mu})^T{\boldsymbol \Sigma}^{-1} (\mathbf{x}-{\boldsymbol \mu})\right ), \end{aligned} $$
(8.6.4)
where 
$${\boldsymbol \mu }=[\mu _1,\ldots ,\mu _N^{\,} ]^T$$
with μ i = E{x i} is the mean vector (of length N) of x and Σ is the N × N (symmetric, positive definite) covariance matrix of x. The multivariate Gaussian distribution is denoted simply as x ∼ N(μ, Σ).
If letting x and y be jointly Gaussian random vectors
../images/492994_1_En_8_Chapter/492994_1_En_8_Equ234_HTML.png
(8.6.5)
then the marginal distribution of x and the conditional distribution of x given marginalizing y are
../images/492994_1_En_8_Chapter/492994_1_En_8_Equ235_HTML.png
(8.6.6)
or
../images/492994_1_En_8_Chapter/492994_1_En_8_Equ236_HTML.png
(8.6.7)

A Gaussian process is a natural generalization of multivariate Gaussian distribution.

Definition 8.8 (Gaussian Process [29, 30])

A Gaussian Process f(x) is a collection of random variables x, any finite number of which have (consistent) joint Gaussian distributions.

Clearly, the Gaussian distribution is over vectors, whereas the Gaussian process is over functions f(x), where x is a Gaussian vector.

A scalar Gaussian process f(x) can be represented as

$$\displaystyle \begin{aligned} f~\sim~\mathrm{GP}(\mu ,K),{} \end{aligned} $$
(8.6.8)
where μ = E{x} is the mean of x and K(x, x′) is the covariance function of x. Equation (8.6.8) means [29]: “the function f is distributed as a GP with mean function μ and covariance function K.”
A vector-valued Gaussian process f(x) is completely specified by its mean function and covariance function

$$\displaystyle \begin{aligned} {\boldsymbol \mu}(\mathbf{x})&amp;=E\{\mathbf{f}(\mathbf{x})\}, \end{aligned} $$
(8.6.9)

$$\displaystyle \begin{aligned} K({\mathbf{x}}_i,{\mathbf{x}}_j)&amp;=E\{(\mathbf{f}({\mathbf{x}}_i)-{\boldsymbol \mu}({\mathbf{x}}_i))^T(\mathbf{f}({\mathbf{x}}_j)-{\boldsymbol \mu}({\mathbf{x}}_j))\}, \end{aligned} $$
(8.6.10)
and the Gaussian process is expressed as

$$\displaystyle \begin{aligned} \mathbf{f}(\mathbf{x})\,\sim\,\mathrm{GP}({\boldsymbol \mu},{\boldsymbol \Sigma}),{} \end{aligned} $$
(8.6.11)
where the (i, j)th elements of the covariance matrix Σ are defined as Σij = K(x i, x j).

The covariance function K(x i, x j) (also known as kernel, kernel function, or covariance kernel) is the driving factor in Gaussian processes for regression and/or classification. Actually, the kernel represents the particular structure present in the data 
$${\mathbf {x}}_1,\ldots ,{\mathbf {x}}_N^{\,}$$
being modeled. One of the main difficulties in applying Gaussian processes is to construct such a kernel.

According to Mercer’s theorem (Theorem 8.3), the nonlinear kernel function is usually constructed by

$$\displaystyle \begin{aligned} K({\mathbf{x}}_i,{\mathbf{x}}_j)={\boldsymbol \phi}^T({\mathbf{x}}_i^{\,} ){\boldsymbol \phi}({\mathbf{x}}_j^{\,} ). \end{aligned} $$
(8.6.12)

8.6.3 Gaussian Process Regression

Let {(x n, f n)| n = 1, …, N} be the training samples, where f n, n = 1, …, N are the noise-free training outputs of x n, typically continuous functions for regression or discrete function for classification.

Denote the noise-free training output vector 
$$\mathbf {f}=[f_1^{\,} ,\ldots ,f_N^{\,} ]^T$$
and the test output vector 
$${\mathbf {f}}_*=[f_1^*,\ldots , f_{N_*}^*]^T$$
given the test samples 
$${\mathbf {x}}_1^*,\ldots ,{\mathbf {x}}_{N_*}^*$$
. Under the assumption of Gaussian distributions f ∼ N(0, K(X, X)) and f  ∼ N(0, K(X , X )), the joint distribution of f and f is given by
../images/492994_1_En_8_Chapter/492994_1_En_8_Equ242_HTML.png
(8.6.13)
where 
$$\mathbf {X}=[{\mathbf {x}}_1^{\,} ,\ldots ,{\mathbf {x}}_N^{\,} ]$$
and 
$${\mathbf {X}}_*=[{\mathbf {x}}_1^*,\ldots ,{\mathbf {x}}_{N_*}^*]$$
are the N × N training sample matrix and the N × N test sample matrix, respectively; and 
$$\mathbf {K}(\mathbf {X},\mathbf {X})=\mathrm {cov}(\mathbf {X})\in \mathbb {R}^{N \times N}$$
, 
$$\mathbf {K}(\mathbf {X},{\mathbf {X}}_*)=\mathrm {cov}(\mathbf {X},{\mathbf {X}}_*)\in \mathbb {R}^{N\times N_*}$$
, 
$$\mathbf {K}({\mathbf {X}}_*,\mathbf {X})=\mathrm {cov} ({\mathbf {X}}_*,\mathbf {X})={\mathbf {K}}^T (\mathbf {X},{\mathbf {X}}_*)\in \mathbb {R}^{N_*\times N}$$
, and 
$$\mathbf {K}({\mathbf {X}}_*,{\mathbf {X}}_*)=\mathrm {cov}(\mathbf { X}_*)\in \mathbb {R}^{N_*\times N_*}$$
are the covariance function matrices, respectively.
From Eq. (8.6.6) it is known that the conditional distribution of f given f can be expressed as

$$\displaystyle \begin{aligned} {\mathbf{f}}_*|\mathbf{f}~\sim~ N\big (\mathbf{K}({\mathbf{X}}_*,\mathbf{X}){\mathbf{K}}^{-1}(\mathbf{X},\mathbf{X})\mathbf{f},\mathbf{K}({\mathbf{X}}_*,{\mathbf{X}}_*)-\mathbf{K}({\mathbf{X}}_*,\mathbf{X}) {\mathbf{K}}^{-1}(\mathbf{X},\mathbf{X})\mathbf{K}(\mathbf{X},{\mathbf{X}}_*)\big ). \end{aligned} $$
(8.6.14)
This is just the posterior distribution for a specific set of test cases.
The goal of Gaussian processes regression is to find f via maximizing the posterior probability f |f. However, f is unobservable. In practical applications, the noisy training output vector y = f + e is observed in the additive Gaussian white noise 
$$\mathbf {e}\sim N(\mathbf {0},\sigma _n^2\mathbf {I})$$
. In this case, y ∼ N(0, cov(y)) with

$$\displaystyle \begin{aligned} \mathrm{cov}(y_i,y_j)=K(y_i,y_j)+\sigma_n^2\delta_{ij}\quad \text{or}\quad  \mathbf{y}\sim N\big (\mathbf{0},\mathbf{K}(\mathbf{X},\mathbf{X})+ \sigma_n^2\mathbf{I}\big ). \end{aligned} $$
(8.6.15)
Then, the joint distribution of y and f according to the prior is given by
../images/492994_1_En_8_Chapter/492994_1_En_8_Equ245_HTML.png
(8.6.16)
From Eq. (8.6.6) we get the conditional distribution of f given y below:

$$\displaystyle \begin{aligned} {\mathbf{f}}_*|\mathbf{y}~\sim~ N\big (\bar{\mathbf{f}}_*,\mathrm{cov}({\mathbf{f}}_*)\big ), \end{aligned} $$
(8.6.17)
where

$$\displaystyle \begin{aligned} \bar{\mathbf{f}}_*&amp;=E\{{\mathbf{f}}_*|\mathbf{y}\}=\mathbf{K}({\mathbf{X}}_*,\mathbf{X})\big [\mathbf{K}(\mathbf{X},\mathbf{X})+\sigma_n^2\mathbf{I}\big ]^{-1}\mathbf{ y},{} \end{aligned} $$
(8.6.18)

$$\displaystyle \begin{aligned} \mathrm{cov}({\mathbf{f}}_*)&amp;=\mathbf{K}({\mathbf{X}}_*,{\mathbf{X}}_*)-\mathbf{K}({\mathbf{X}}_*,\mathbf{X})\big [\mathbf{K}(\mathbf{X},\mathbf{X})+\sigma_n^2\mathbf{ I}\big ]^{-1} \mathbf{K}(\mathbf{X},{\mathbf{X}}_*).{} \end{aligned} $$
(8.6.19)
Here, 
$$\bar {\mathbf {f}}_*$$
is the mean vector over N test points, i.e., 
$$\bar {\mathbf {f}}_*=\bar {\mathbf {f}}({\mathbf {x}}_1^*,\ldots ,{\mathbf {x}}_{N_*}^{\,} )$$
.
If there is only one test point x , i.e., X  = x , then K(X, X ) reduce to k  = k(x ) = k(x, x ) which denotes the vector of covariances between the test point x and the n training points 
$${\mathbf {x}}_1^{\,} , \ldots ,{\mathbf {x}}_N^{\,}$$
in X. In this case, Eqs. (8.6.18) and (8.6.19) reduce to

$$\displaystyle \begin{aligned} \bar{\mathbf{f}}_*&amp;={\mathbf{k}}_*^T (\mathbf{K}+\sigma_n^2\mathbf{I})^{-1}\mathbf{y},{} \end{aligned} $$
(8.6.20)

$$\displaystyle \begin{aligned} \mathrm{cov}({\mathbf{f}}_*)&amp;=K({\mathbf{x}}_*,{\mathbf{x}}_*)-{\mathbf{k}}_*^T (\mathbf{K}+\sigma_n^*\mathbf{I})^{-1}{\mathbf{k}}_*.{} \end{aligned} $$
(8.6.21)
Letting 
$${\boldsymbol \alpha }=(\mathbf {K}+\sigma _n^2\mathbf {I})^{-1}\mathbf {y}$$
, then the mean vector centered on a test point x , denoted by 
$$\bar {\mathbf {f}}_* =\bar {\mathbf {f}}({\mathbf {x}}_*)$$
, is given by

$$\displaystyle \begin{aligned} \bar{\mathbf{f}}({\mathbf{x}}_*)=\sum_{i=1}^N \alpha_i K({\mathbf{x}}_i,{\mathbf{x}}_*)={\mathbf{k}}_*^T {\boldsymbol \alpha}.{} \end{aligned} $$
(8.6.22)

When using Eqs. (8.6.20) and (8.6.21) to compute directly 
$$\bar {\mathbf {f}}_*$$
and var(f ), it needs to invert the matrix 
$$\mathbf {K}+\sigma _n^2\mathbf {I}$$
. A faster and numerically more stable computation method is to use Cholesky decomposition.

Let the Cholesky decomposition of 
$$\mathbf {K}+\sigma _n^2\mathbf {I}$$
be given by 
$$(\mathbf {K}+\sigma _n^2\mathbf {I})=\mathbf {LL}^T$$
, where L is a lower triangular matrix. Then, 
$${\boldsymbol \alpha }=(\mathbf {K}+\sigma _n^2\mathbf {I})^{-1}\mathbf {y}$$
becomes α = (L T)−1L −1y. Denoting z = L −1y and α = (L T)−1z, then

$$\displaystyle \begin{aligned} \mathbf{z}={\mathbf{L}}^{-1}\mathbf{y}~&amp;\Leftrightarrow~\mathbf{L}\mathbf{z}=\mathbf{y}, \end{aligned} $$
(8.6.23)

$$\displaystyle \begin{aligned} {\boldsymbol \alpha}=({\mathbf{L}}^T)^{-1}\mathbf{z}~&amp;\Leftrightarrow~{\mathbf{L}}^T{\boldsymbol \alpha}=\mathbf{z}. \end{aligned} $$
(8.6.24)
The above two equations suggest an efficient algorithm for finding α given K and y as follows:
  1. 1.

    Make Cholesky decomposition: 
$$(\mathbf {K}+\sigma _n^2\mathbf {I})=\mathbf {LL}^T$$
.

     
  2. 2.

    Solve the triangular system Lz = y for z by forward substitution.

     
  3. 3.

    Solve the triangular system L Tα = z for α by back substitution.

     

Similarly, we have 
$${\mathbf {k}}_*^T (\mathbf {K}+\sigma _n^*\mathbf {I})^{-1}{\mathbf {k}}_*={\mathbf {k}}_*^T({\mathbf {L}}^T)^{-1}{\mathbf {L}}^{-1}{\mathbf {k}}_*={\mathbf {v}}^T\mathbf {v}$$
, where v = L −1k Lv = k . This implies v is the solution for the triangular system Lv = k .

Once α and v are found, one can use Eqs. (8.6.20) and (8.6.21) to get 
$$\bar {\mathbf {f}}({\mathbf {x}}_*)=\mathbf { k}_*^T {\boldsymbol \alpha }$$
and var(f ) = k(x , x ) −v Tv, respectively.

The marginal likelihood is the integral marginal of the likelihood times the prior

$$\displaystyle \begin{aligned} p(\mathbf{y}|\mathbf{X})=\int p(\mathbf{y}|\mathbf{f},\mathbf{X})p(\mathbf{f}|\mathbf{X})d\mathbf{f}. \end{aligned} $$
(8.6.25)
By the term “marginal likelihood,” it refers to the marginalization over the function values f. Under the Gaussian process model, the prior is Gaussian, f|X ∼ N(0, K(X, X), or

$$\displaystyle \begin{aligned} \log p(\mathbf{f}|\mathbf{X})=-\frac 12{\mathbf{f}}^T{\mathbf{K}}^{-1}\mathbf{f}-\frac 12 \log |\mathbf{K}|-\frac N2 \log (2\pi ).{} \end{aligned} $$
(8.6.26)
Because of y = f + e with 
$$\mathbf {e}\sim N(\mathbf {0},\sigma _n^2\mathbf {I})$$
, the marginal likelihood

$$\displaystyle \begin{aligned} \log p(\mathbf{y}|\mathbf{X})=-\frac 12{\mathbf{y}}^T\big [\mathbf{K}+\sigma_n^2\mathbf{I}\big ]^{-1}\mathbf{y}-\frac 12 \log \big |\mathbf{K}+\sigma_n^2\mathbf{ I}\big | -\frac N2 \log (2\pi ). \end{aligned} $$
(8.6.27)
Here, the first term is a data fitting term as it is the only one involving the observed targets y. The second term 
$$\log |\mathbf {K}+ \sigma _n^2\mathbf {I}|/2$$
is the model complexity depending only on the covariance function and the inputs. The last term 
$$\log (2\pi )/2$$
is just a normalization constant.

Algorithm 8.5 shows the Gaussian process regression (GPR) in [30].

Algorithm 8.5 Gaussian process regression algorithm [30]
../images/492994_1_En_8_Chapter/492994_1_En_8_Fige_HTML.png

8.6.4 Gaussian Process Classification

Both regression and classification can be viewed as function approximation problems, but their solutions are rather different. This is because the targets in regression are continuous functions in which the likelihood function is Gaussian; a Gaussian process prior combined with a Gaussian likelihood gives rise to a posterior Gaussian process over functions, and everything remains analytically tractable [30]. But, in classification models, the targets are discrete class labels, so the Gaussian likelihood is no longer appropriate. Hence, approximate inference is only available for classification, where exact inference is not feasible.

Inference is naturally divided into two steps [30]:
  1. 1.
    Computing the distribution of the latent variable corresponding to a test case:
    
$$\displaystyle \begin{aligned} p({\mathbf{f}}_*|\mathbf{X},\mathbf{y},{\mathbf{x}}_*)=\int p({\mathbf{f}}_*|\mathbf{X},{\mathbf{x}}_*,\mathbf{f})p(\mathbf{f}|\mathbf{X},\mathbf{y})d\mathbf{ f},{}\end{aligned} $$
    (8.6.28)

    where p(f|X, y) = p(y|f)p(f|X)∕p(y|X) is the posterior over the latent variables.

     
  2. 2.
    Using this distribution over the latent f to produce a probabilistic prediction
    
$$\displaystyle \begin{aligned} \bar{\pi}_*=p({\mathbf{y}}_*=+1|\mathbf{X},\mathbf{y},{\mathbf{x}}_*)=\int \sigma ({\mathbf{f}}_*)p({\mathbf{f}}_*|\mathbf{X},\mathbf{y},{\mathbf{x}}_*)d\mathbf{ f}_*.{}\end{aligned} $$
    (8.6.29)
     

In classification due to discrete class labels the posterior p(f|X, y) in (8.6.28) is non-Gaussian, which makes the likelihood in (8.6.28) is non-Gaussian, which makes its integral analytically intractable. Similarly, Eq. (8.6.29) can be intractable analytically for certain sigmoid functions.

The non-Gaussian joint posterior p(f|X, y) can be approximated analytically with a Gaussian one. To this end, consider Laplace’s method that utilizes a Gaussian approximation

$$\displaystyle \begin{aligned} q(\mathbf{f}|\mathbf{X},\mathbf{y})= N(\mathbf{f}|\hat{\mathbf{f}},{\mathbf{A}}^{-1})~\propto~\exp\left (\frac 12(\mathbf{f}-\hat{\mathbf{f}})^T\mathbf{A}(\mathbf{f}- \hat{\mathbf{f}})\right ) \end{aligned} $$
(8.6.30)
to the non-Gaussian posterior p(f|X, y) in the integral (8.6.28). In the above equation, 
$$\hat {\mathbf {f}}={\mathrm {arg}\ \mathrm {min}}_{\mathbf { f}} \,p(\mathbf {f}|\mathbf {X},\mathbf {y})$$
, and 
$$\mathbf {A}=-\nabla ^2\log p(\mathbf {f}|\mathbf {X},\mathbf {y})|{ }_{\mathbf {f}=\hat {\mathbf {f}}}$$
is the Hessian matrix of the negative log posterior at that point.
By Bayes’ rule the posterior over the latent variables is given by p(f|X, y) = p(y|f)p(f|X)∕p(y|X), but as p(y|X) is independent of f, we need to consider only the unnormalized posterior when maximizing with respect to f, namely consider only p(f|X, y) = p(y|f)p(f|X). Taking its logarithm and using Eq. (8.6.26), then

$$\displaystyle \begin{aligned} {\Psi}(\mathbf{f})&amp;=\log p(\mathbf{y}|\mathbf{f})+\log p(\mathbf{f}|\mathbf{X})\\ &amp;=\log p(\mathbf{y}|\mathbf{f})-\frac 12 {\mathbf{f}}^T{\mathbf{K}}^{-1}\mathbf{f}-\frac 12 \log |\mathbf{K}|-\frac N2\log (2\pi ) \end{aligned} $$
(8.6.31)
whose first and second derivatives with respect to f are, respectively, given by

$$\displaystyle \begin{aligned} \nabla {\Psi}(\mathbf{f})&amp;=\nabla \log p(\mathbf{y}|\mathbf{f})-{\mathbf{K}}^{-1}\mathbf{f},\\ \nabla^2 {\Psi}(\mathbf{f})&amp;=\nabla^2\log p(\mathbf{y}|\mathbf{f})-{\mathbf{K}}^{-1}. \end{aligned} $$
From ∇ Ψ(f) = 0 it follows that

$$\displaystyle \begin{aligned} \mathbf{f}=\mathbf{Ka},\quad  \mathbf{a}=\nabla \log p(\mathbf{y}|\mathbf{f}). \end{aligned} $$
(8.6.32)
The Hessian matrix can be written as

$$\displaystyle \begin{aligned} \nabla^2 \Psi (\mathbf{f})=-\mathbf{W}-{\mathbf{K}}^{-1},\quad  \mathbf{W}=-\nabla^2 \log p(\mathbf{y}|\mathbf{X}). \end{aligned} $$
(8.6.33)
Here, W is a diagonal matrix, since the distribution for y i depends only on f i, not on f ji.

Gaussian processes are a main mathematical tool for sparse Bayesian learning and hence the relevance vector machine that will be discussed in the next section.

8.7 Relevance Vector Machine

In real-world data, the presence of noise (in regression) and class overlap (in classification) implies that the principal modeling challenge is to avoid “overfitting” of the training set. In order to avoid the overfitting of SVMs (because of too many support vectors), Tipping [39] proposed a relevant vectors based learning machine, called the relevance vector machine (RVM). “Sparse Bayesian learning” is the basis for the RVM.

8.7.1 Sparse Bayesian Regression

We are given a set of data 
$$\{{\mathbf {x}}_n^{\,} ,t_n^{\,}\}_{n=1}^N$$
, where the “target” samples 
$$t_n^{\,} = y({\mathbf {x}}_n^{\,} )+\epsilon _n^{\,}$$
are conventionally considered to be realizations of a deterministic function y that is corrupted by some additive noise process 
$$\{\epsilon _n^{\,}\}$$
. This function f is usually modeled by a linearly weighted sum of M fixed basis functions 
$$\{\phi _m^{\,} (\mathbf {x})\}_{m=1}^M$$
:

$$\displaystyle \begin{aligned} \hat y (\mathbf{x})=\sum_{m=1}^M w_m^{\,} \phi_m^{\,} (\mathbf{x}). \end{aligned} $$
(8.7.1)
Our objective is to infer values of the parameters/weights 
$$w_m^{\,} ,m=1,\ldots ,M$$
such that 
$$\hat y (\mathbf {x})$$
is a “good” approximation of y(x).

The function approximation has two important benefits: accuracy and “sparsity.” By sparsity, it means that a sparse learning algorithm can set significant numbers of the parameters 
$$w_m^{\,}$$
to zero.

The support vector machine (SVM) makes predictions based on the function:

$$\displaystyle \begin{aligned} y(\mathbf{x};\mathbf{w})=\sum_{i=1}^N w_i K(\mathbf{x};{\mathbf{x}}_i)+{\mathbf{w}}_0,{} \end{aligned} $$
(8.7.2)
where 
$$K(\mathbf {x};{\mathbf {x}}_i^{\,} )$$
is a kernel function, effectively defining one basis function for each example in the training set.
There are two key features of the SVM:
  • This can avoid overfitting, which leads to good generalization.

  • This furthermore results in a sparse model dependent only on a subset of kernel functions.

However, there are a number of significant and practical disadvantages of the support vector learning methodology [39]:
  1. 1.

    Although relatively sparse, SVMs make unnecessarily liberal use of basis functions since the number of support vectors required typically grows linearly with the size of the training set. In order to reduce computational complexity, some form of post-processing is often required.

     
  2. 2.

    Predictions are not probabilistic. The SVM outputs a point estimate in regression, and a “hard” binary decision in classification. Ideally, it is desired to estimate the conditional distribution p(t|x) in order to capture uncertainty in prediction. Although posterior probability estimates can be coerced from SVMs via post-processing, these estimates are unreliable.

     
  3. 3.

    Owing to satisfying Mercer’s condition, the kernel function K(x;x i) must be the continuous symmetric kernel of a positive integral operator.

     
To avoid any of the above limitations, Tipping [39] proposed to use the “relevance vector machine” (RVM) instead of the SVM as a Bayesian learning of (8.7.2):

$$\displaystyle \begin{aligned} y(\mathbf{x};\mathbf{w})=\sum_{i=1}^m w_i\phi_i (\mathbf{x})={\mathbf{w}}^T{\boldsymbol \phi}(\mathbf{x}), \end{aligned} $$
(8.7.3)
where the output is a linearly weighted sum of M generally nonlinear and fixed basis functions ϕ(x) = [ϕ 1(x), …, ϕ m(x)]T. The regression process is to determine (or learn) the adjustable parameters (or “weights”) w = [w 1, …, w m]T by using a Bayesian learning framework.

The key feature of this learning approach is that in order to offer good generalization performance, the inferred predictors are exceedingly sparse in that they contain relatively few nonzero w i parameters. Since the majority of parameters are automatically set to zero during the learning process, this learning approach is known as the sparse Bayesian learning for regression [39].

The following are the framework of sparse Bayesian learning for regression.

Model Specification
Given a data set of input-target pairs 
$$\{{\mathbf {x}}_n,t_n\}_{n=1}^N$$
, where 
$${\mathbf {x}}_n\in \mathbb {R}^d,t_n\in \mathbb {R}$$
. It is usually assumed that the target values are samples from the model with additive noise 𝜖 n. Under this assumption, the standard linear regression model with additive Gaussian noise is given by

$$\displaystyle \begin{aligned} t_n=y_n+\epsilon_n =y({\mathbf{x}}_n;\mathbf{w})+\epsilon_n, \end{aligned} $$
(8.7.4)
where y n = y(x n;w) are approximation signals of the target signal t n, and 𝜖 n are independent samples from some Gaussian noise N(0, σ 2) with mean-zero and variance σ 2. If letting 
$$\mathbf {t}=[t_1^{\,} ,\ldots ,t_N^{\,} ]^T$$
be the target vector and 
$$\mathbf {y}=[y_1^{\,} , \ldots ,y_N^{\,} ]^T$$
be the approximation vector, then under the assumption of independence of the target vector t and the Gaussian noise e = t −y, the likelihood of the complete data set can be written as

$$\displaystyle \begin{aligned} p(\mathbf{t}|\mathbf{w},\sigma^2)=(2\pi )^{-N/2}\sigma^{-N}\exp \left ( -\frac 1{2\sigma^2}\|\mathbf{t}-\mathbf{y}\|{}_2^2\right ) \end{aligned} $$
(8.7.5)
In different regression cases, the above likelihood has different forms.
  • The linear neural network makes predictions based on the function
    
$$\displaystyle \begin{aligned} y_n=y({\mathbf{x}}_n;\mathbf{w})={\mathbf{x}}_n^T\mathbf{w}\quad \text{or}\quad  \mathbf{y}={\mathbf{X}}^T\mathbf{w}, \end{aligned} $$
    (8.7.6)
    where 
$$\mathbf {w}=[w_1^{\,} ,\ldots ,w_N^{\,} ]^T$$
is the weight vector, 
$$\mathbf {X}=[{\mathbf {x}}_1^{\,} ,\ldots ,{\mathbf {x}}_N^{\,} ]$$
is the data matrix, and thus the likelihood (because of the independence assumption) is given by
    
$$\displaystyle \begin{aligned} p(\mathbf{t}|\mathbf{w},\sigma^2)&amp;=\prod_{i=1}^N p(t_i|\mathbf{w},\sigma^2)\\ &amp;=\prod_{i=1}^N \frac 1{\sqrt{2\pi}\sigma}\exp \left (-\frac{\|t_i- {\mathbf{x}}_i^T\mathbf{w}\|{}_2^2}{2\sigma^2}\right )\\ &amp;=(2\pi )^{-N/2}\sigma^{-N}\exp \left ( -\frac 1{2\sigma^2}\|\mathbf{t}-{\mathbf{X}}^T\mathbf{w}\|{}_2^2\right ).\end{aligned} $$
    (8.7.7)
  • The SVM makes predictions
    
$$\displaystyle \begin{aligned} y_n=\sum_{i=1}^N w_i K({\mathbf{x}}_n;{\mathbf{x}}_i)+w_0\quad \text{or}\quad  \mathbf{y}=\mathbf{Kw},\end{aligned} $$
    (8.7.8)
    where 
$$\mathbf {K}=[\mathbf {k}({\mathbf {x}}_1)^{\,} ,\ldots ,\mathbf {k}({\mathbf {x}}_N^{\,} )]^T$$
is the N × (N + 1) kernel matrix with 
$$\mathbf {k}(\mathbf { x}_n)= [1,K({\mathbf {x}}_n,{\mathbf {x}}_1),\ldots ,K({\mathbf {x}}_n^{\,} ,{\mathbf {x}}_N^{\,} )]^T$$
, and 
$$\mathbf {w}=[w_0^{\,} ,w_1^{\,} ,\ldots ,w_N^{\,} ]^T$$
. So under the independence assumption, the likelihood is given by
    
$$\displaystyle \begin{aligned} p(\mathbf{t}|\mathbf{w},\sigma^2)=(2\pi )^{-N/2}\sigma^{-N}\exp \left ( -\frac 1{2\sigma^2}\|\mathbf{t}-\mathbf{Kw}\|{}_2^2\right ).\end{aligned} $$
    (8.7.9)
  • The RVM makes predictions
    
$$\displaystyle \begin{aligned} y_n=\sum_{i=1}^N{\boldsymbol \phi}^T({\mathbf{x}}_i)\mathbf{w}+w_0^{\,} \quad \text{or}\quad  \mathbf{y}={\boldsymbol \Phi}\mathbf{w},\end{aligned} $$
    (8.7.10)
    where 
$${\boldsymbol \Phi }=[{\boldsymbol \phi }({\mathbf {x}}_1^{\,} ),\ldots ,{\boldsymbol \phi }({\mathbf {x}}_N^{\,} )]^T$$
is the N × (N + 1) “design” matrix with 
$${\boldsymbol \phi }({\mathbf {x}}_n) =[1,\phi _1({\mathbf {x}}_n),\ldots ,\phi _N^{\,} ({\mathbf {x}}_n)]^T$$
, and 
$$\mathbf {w}=[w_0^{\,} ,w_1^{\,} ,\ldots ,w_N^{\,} ]^T$$
. For example, ϕ(x n) = [1, 
$$K({\mathbf {x}}_n,{\mathbf {x}}_1),K({\mathbf {x}}_n,{\mathbf {x}}_2),\ldots ,K({\mathbf {x}}_n,{\mathbf {x}}_N^{\,} )]^T$$
. Therefore, the likelihood (because of the independence assumption) is given by
    
$$\displaystyle \begin{aligned} p(\mathbf{t}|\mathbf{w},\sigma^2)=(2\pi )^{-N/2}\sigma^{-N}\exp \left ( -\frac 1{2\sigma^2}\|\mathbf{t}-{\boldsymbol \Phi}\mathbf{w}\|{}_2^2\right ).\end{aligned} $$
    (8.7.11)
Parameter Estimation
Inference in the Bayesian linear model is based on the posterior distribution over the weights, computed by Bayes’ rule [30]:

$$\displaystyle \begin{aligned} \mathrm{posterior}=\frac{\mathrm{likelihood}\times \mathrm{prior}}{\mathrm{marginal~likelihood}}. \end{aligned} $$
(8.7.12)
Hence, the posterior distribution over the weight vector w is given by Tipping [39]

$$\displaystyle \begin{aligned} p(\mathbf{w}|\mathbf{t},{\boldsymbol\alpha},\sigma^2)&amp;=\frac{p(\mathbf{t}|\mathbf{w},\sigma^2)p(\mathbf{w},{\boldsymbol \alpha})}{p(\mathbf{t}|{\boldsymbol \alpha},\sigma^2)}\\ &amp;=(2\pi )^{-(N+1)/2}|{\boldsymbol\Sigma}|{}^{-1/2}\exp\left (-\frac 12(\mathbf{w}-{\boldsymbol \mu})^T {\boldsymbol\Sigma}^{-1}(\mathbf{w}-{\boldsymbol \mu}) \right ), \end{aligned} $$
(8.7.13)
where the posterior covariance and mean are, respectively, as follows [26]:

$$\displaystyle \begin{aligned} {\boldsymbol\Sigma}&amp;=(\sigma^{-2}{\boldsymbol\Phi}^T{\boldsymbol\Phi}+\mathbf{A})^{-1}, \end{aligned} $$
(8.7.14)

$$\displaystyle \begin{aligned} {\boldsymbol\mu}&amp;=\sigma^{-2}{\boldsymbol\Sigma}{\boldsymbol\Phi}^T\mathbf{t}, \end{aligned} $$
(8.7.15)
with A = Diag(α 0, α 1, …, α N).
Sparse Bayesian “Learning”
The procedure which chooses a model with the maximum marginal likelihood is called the type II maximum likelihood procedure by Good [18]. By this procedure, sparse Bayesian learning is formulated as the (local) maximization with respect to α of the marginal likelihood, or equivalently, its logarithm marginal likelihood is given by Tipping [39]:

$$\displaystyle \begin{aligned} \mathbb{L}({\boldsymbol \alpha})&amp;=\log p(\mathbf{t}|{\boldsymbol \alpha},\sigma^2)=\log \int_{-\infty}^\infty p(\mathbf{t}|\mathbf{w},\sigma^2)p(\mathbf{ w}|{\boldsymbol \alpha}) \mathrm{d}\mathbf{w}\\ &amp;=-\frac 12\Big (N\log (2\pi )+ \log |\mathbf{C}|+{\mathbf{t}}^T{\mathbf{C}}^{-1}\mathbf{t}\Big ){} \end{aligned} $$
(8.7.16)
with

$$\displaystyle \begin{aligned} \mathbf{C}=\sigma^2\mathbf{I} +{\boldsymbol \Phi}{\mathbf{A}}^{-1}{\boldsymbol \Phi}^T. \end{aligned} $$
(8.7.17)
Here, by the term “marginal” it emphasizes a non-parametric model.

The Bayesian learning problem, in the context of the RVM, thus becomes the search for the hyperparameters α. In the RVM, these hyperparameters are estimated from minimizing 
$$\mathbb {L}({\boldsymbol \alpha })$$
. This estimation problem will be discussed in Sect. 8.7.3.

As a typical application of RVMs, consider the compressive sensing (CS) measurements g represented as [23]:

$$\displaystyle \begin{aligned} \mathbf{g}={\boldsymbol \Phi}\mathbf{w}, \end{aligned} $$
(8.7.18)
where 
$${\boldsymbol \Phi }=[{\mathbf {r}}_1^{\,} ,\ldots ,{\mathbf {r}}_K^{\,} ]$$
is a K × N matrix, assuming random CS measurements are made.
A typical means of solving the CS problem is 1 regularization:

$$\displaystyle \begin{aligned} {\mathbf{w}}^*=\operatorname*{\text{arg}\ \text{min}}_{\mathbf{w}}\big\{\|\mathbf{g}-{\boldsymbol \Phi}\mathbf{w}\|{}_2^2+\gamma\|\mathbf{w}\|{}_1^{\,}\big\}.\end{aligned} $$
(8.7.19)
Under the assumption that the additive noise n is mean-mean Gaussian distribution N(0, σ 2I), the Gaussian likelihood model is described as

$$\displaystyle \begin{aligned} p(\mathbf{g}|\mathbf{w},\sigma^2)=(2\pi )^{-K/2}\sigma^{-K}\exp \left (-\frac{\|\mathbf{g}-{\boldsymbol \Phi}\mathbf{w}\|{}_2^2}{2\sigma^2}\right ).\end{aligned} $$
(8.7.20)
Assuming the hyperparameters α and α 0 are known, given the CS measurements g and the projection matrix Φ, the posterior for w can be expressed analytically as a multivariate Gaussian distribution with mean and covariance

$$\displaystyle \begin{aligned} {\boldsymbol\mu}=\alpha_0{\boldsymbol\Sigma}{\boldsymbol\Phi}^T\mathbf{g}\quad \text{and}\quad  {\boldsymbol\Sigma}=(\alpha_0{\boldsymbol\Phi}^T{\boldsymbol\Phi}+\mathbf{A})^{-1}.\end{aligned} $$
(8.7.21)
Then, Bayesian compressive sensing problem becomes sparse Bayesian regression one.

8.7.2 Sparse Bayesian Classification

Sparse Bayesian classification follows an essentially identical framework for regression above, but using a Bernoulli likelihood and a sigmoidal link function to account for the change in the target quantities [39].

For an input vector x, an RVM classifier models the probability distribution of its class label using logistic sigmoid link function σ(y) = 1∕(1 + e y) as

$$\displaystyle \begin{aligned} p(d=1|\mathbf{x})=\frac 1{1+\exp (-f_{\mathrm{RVM}}(\mathbf{x}))},\end{aligned} $$
(8.7.22)
where f RVM(x), called the RVM classifier function, is given by

$$\displaystyle \begin{aligned} f_{\mathrm{RVM}}(\mathbf{x})=\sum_{i=1}^N \alpha_iK(\mathbf{x},{\mathbf{x}}_i),\end{aligned} $$
(8.7.23)
where K(x, x i) is a kernel function, and x i, i = 1, …, N are the training samples.
By adopting the Bernoulli distribution for P(t|x), the likelihood can be written as

$$\displaystyle \begin{aligned} P(\mathbf{t}|\mathbf{w})=\prod_{n=1}^N \sigma\big (y({\mathbf{x}}_n,\mathbf{w})\big )^{t_n}\left (1-\sigma \big (y({\mathbf{x}}_n,\mathbf{w})\big )\right )^{1-t_n}\end{aligned} $$
(8.7.24)
where the target t n ∈{0, 1}.

In sparse Bayesian regression, the weights w can be integrated out analytically. Unlike the regression case, sparse Bayesian classification cannot be integrated out analytically, precluding closed form expressions for either the weight posterior p(w|t, α) or the marginal likelihood P(t|α). Thus, it utilizes the Laplace approximation procedure.

In the statistics and machine leaning literature, the Laplace approximation refers to the evaluation of the marginal likelihood or free energy using Laplace’s method. This is equivalent to a local Gaussian approximation of P(t|w) around a maximum a posteriori (MAP) estimate [40].

8.7.3 Fast Marginal Likelihood Maximization

Consider the dependence of 
$$\mathbb {L}({\boldsymbol \alpha })$$
on a single hyperparameter α m, m ∈{1, …, M}, the matrix C in the Eq. (8.7.16) can be decomposed as

$$\displaystyle \begin{aligned} \mathbf{C}&amp;=\sigma^2\mathbf{I}+\sum_{i\neq m}\alpha_i^{-1}{\boldsymbol \phi}_i{\boldsymbol \phi}_i^T+\alpha_m^{-1}{\boldsymbol \phi}_m{\boldsymbol \phi}_m^T\\ &amp;={\mathbf{C}}_{-m}+\alpha_m^{-1}{\boldsymbol \phi}_m{\boldsymbol \phi}_m^T, \end{aligned} $$
(8.7.25)
where C m = [C ∖c m] is C with the mth column c m removed. Thus, by applying the determinant identity and matrix inverse lemma, the terms of interest in 
$$\mathbb {L}$$
can be written as

$$\displaystyle \begin{aligned} |\mathbf{C}|&amp;=|{\mathbf{C}}_m|\cdot |1+\alpha_m^{-1}{\boldsymbol \phi}_m^T{\mathbf{C}}_{-m}^{-1}{\boldsymbol \phi}_m|, \end{aligned} $$
(8.7.26)

$$\displaystyle \begin{aligned} {\mathbf{C}}^{-1}&amp;={\mathbf{C}}_{-m}^{-1}-\frac{{\mathbf{C}}_{-m}^{-1}{\boldsymbol \phi}_m {\boldsymbol \phi}_m^T{\mathbf{C}}_{-m}^{-1}}{\alpha_m+{\boldsymbol \phi}_m^T {\mathbf{C}}_{-m}^{-1}{\boldsymbol \phi}_m}.{} \end{aligned} $$
(8.7.27)
Then, 
$$\mathbb {L}({\boldsymbol \alpha })$$
can be rewritten as

$$\displaystyle \begin{aligned} \mathbb{L}({\boldsymbol \alpha})&amp;=-\frac 12\Bigg (N\log (2\pi )+\log |{\mathbf{C}}_{-m}|+{\mathbf{t}}^T{\mathbf{C}}_{-m}^{-1}\mathbf{t}\\ &amp;\quad  -\log \alpha_m+\log (\alpha_m+{\boldsymbol \phi}_m^T{\mathbf{C}}_{-m}^{-1}{\boldsymbol \phi}_m)-\frac{({\boldsymbol \phi}_m^T{\mathbf{C}}_{-m}^{-1}\mathbf{ t})^2} {\alpha_m + {\boldsymbol \phi}_m^T{\mathbf{C}}_{-m}^{-1}{\boldsymbol \phi}_m}\Bigg )\\ &amp;=\mathbb{L}({\boldsymbol \alpha}_{-m})+\frac 12 \left ( \log \alpha_m-\log (\alpha_m+s_m)+\frac{q_m^2}{\alpha_m+s_m}\right )\\ &amp;=\mathbb{L}({\boldsymbol \alpha}_{-m})+\ell (\alpha_m),{} \end{aligned} $$
(8.7.28)
where

$$\displaystyle \begin{aligned} s_m={\boldsymbol \phi}_m^T{\mathbf{C}}_{-m}^{-1}{\boldsymbol \phi}_m\quad \text{and}\quad  q_m={\boldsymbol \phi}_m^T{\mathbf{C}}_{-m}^{-1}\mathbf{ t},\quad \text{for}~m=1,\ldots ,M. \end{aligned} $$
(8.7.29)
The “sparsity factor” s m can be seen to be a measure of the extent that basis vector ϕ m “overlaps” those already present in the model, while the “quality factor” q m is a measure of the alignment of ϕ m with the error of the model with that vector excluded.
From (8.7.28) and the optimization condition 
$$\frac {\partial \mathbb {L}({\boldsymbol \alpha })}{\partial \alpha _m}=\frac {\partial \ell (\alpha _m)} {\partial \alpha _m}=0$$
we have the result

$$\displaystyle \begin{aligned} \frac 12 \left (\frac 1{\alpha_m}-\frac 1{\alpha_m+s_m}-\frac{q_m^2}{(\alpha_m+s_m)^2}\right )=0\quad \text{or}\quad  \alpha_m=\frac{s_m^2} {q_m^2-s_m}. \end{aligned} $$
(8.7.30)
Hence, 
$$\mathbb {L}({\boldsymbol \alpha })$$
has a unique maximum with respect to α m [10]:

$$\displaystyle \begin{aligned} \alpha_m =\bigg \{\begin{array}{ll} \frac{s_m^2}{q_m^2-s_m},&amp;~~\text{if } q_m^2&gt;s_m,\\  &amp;{}\\ \infty ,&amp;~~\text{otherwise},\end{array}{} \end{aligned} $$
(8.7.31)
for m = 1, …, M.
Equation (8.7.31) shows that:
  • If ϕ m is “in the model” (i.e., α m < ) yet 
$$q_m^2\leq s_m$$
, then ϕ m may be deleted (i.e., α m set to ).

  • If ϕ m is excluded from the model (α m = ) and 
$$q_m^2 &gt; s_m$$
, then ϕ m may be added (i.e., α m is set to some optimal finite value).

It is obvious that as compared with 
$${\mathbf {C}}_{-m}^{-1},m=1,\ldots ,M$$
, it is easier to maintain and compute value of C −1. Let

$$\displaystyle \begin{aligned} S_m ={\boldsymbol \phi}_m^T{\mathbf{C}}^{-1}{\boldsymbol \phi}_m\quad \text{and}\quad  Q_m={\boldsymbol \phi}_m^T{\mathbf{C}}^{-1}\mathbf{t},\quad  m=1,\ldots ,M.{} \end{aligned} $$
(8.7.32)
By premultiplying 
$${\boldsymbol \phi }_m^T$$
, postmultiplying ϕ m, and using (8.7.29), Eq. (8.7.27) yields

$$\displaystyle \begin{aligned} S_m&amp;={\boldsymbol \phi}_m^T {\mathbf{C}}^{-1}{\boldsymbol \phi}_m ={\boldsymbol \phi}_m^T\left ({\mathbf{C}}_{-m}^{-1}-\frac{{\mathbf{C}}_{-m}^{-1}{\boldsymbol \phi}_m {\boldsymbol \phi}_m^T {\mathbf{C}}_{-m}^{-1}}{\alpha_m+{\boldsymbol \phi}_m^T{\mathbf{C}}_{-m}^{-1}{\boldsymbol \phi}_m}\right ){\boldsymbol \phi}_m\\ &amp;=s_m-\frac{s_m^2}{\alpha_m+s_m},\quad  m=1,\ldots ,M.{} \end{aligned} $$
(8.7.33)
Similarly, we have

$$\displaystyle \begin{aligned} Q_m&amp;={\boldsymbol \phi}_m^T {\mathbf{C}}^{-1}\mathbf{t}={\boldsymbol \phi}_m^T\left ({\mathbf{C}}_{-m}^{-1}-\frac{{\mathbf{C}}_{-m}^{-1}{\boldsymbol \phi}_m {\boldsymbol \phi}_m^T {\mathbf{C}}_{-m}^{-1}}{\alpha_m+{\boldsymbol \phi}_m^T{\mathbf{C}}_{-m}^{-1}{\boldsymbol \phi}_m}\right )\mathbf{t}\\ &amp;=q_m-\frac{s_m q_m}{\alpha_m+s_m},\quad  m=1,\ldots ,M.{} \end{aligned} $$
(8.7.34)
From (8.7.33) and (8.7.34) it follows that

$$\displaystyle \begin{aligned} s_m=\frac{\alpha_m S_m}{\alpha_m -S_m}\quad \text{and}\quad  q_m =\frac{\alpha_mQ_m}{\alpha_m-S_m}{} \end{aligned} $$
(8.7.35)
for m = 1, …, M. Note that when α m = , s m = S m and q m = Q m.
By Duncan–Guttman inversion formula (1.​6.​5)

$$\displaystyle \begin{aligned} (\mathbf{A}+\mathbf{UD}^{-1}\mathbf{V} )^{-1}={\mathbf{A}}^{-1}-{\mathbf{A}}^{-1}\mathbf{U}(\mathbf{D}+\mathbf{VA}^{-1}\mathbf{U})^{-1} \mathbf{VA}^{-1}, \end{aligned} $$
(8.7.36)
we have

$$\displaystyle \begin{aligned} {\mathbf{C}}^{-1}&amp;=(\sigma^2\mathbf{I}+{\boldsymbol \Phi}{\mathbf{A}}^{-1}{\boldsymbol \Phi}^T)^{-1}\\ &amp;=\mathbf{B}-\mathbf{B}{\boldsymbol \Phi}(\mathbf{A}+{\boldsymbol \Phi}^T \mathbf{B}{\boldsymbol \Phi})^{-1}{\boldsymbol \Phi}^T\mathbf{B}\\ &amp;=\mathbf{B}-\mathbf{B}{\boldsymbol \Phi}{\boldsymbol \Sigma}{\boldsymbol \Phi}^T\mathbf{B},{} \end{aligned} $$
(8.7.37)
where

$$\displaystyle \begin{aligned} \mathbf{B}=\sigma^{-2}\mathbf{I}\quad \text{and}\quad  {\boldsymbol \Sigma}=(\mathbf{A}+{\boldsymbol \Phi}^T\mathbf{B}{\boldsymbol \Phi})^{-1}. \end{aligned} $$
(8.7.38)
Substituting (8.7.37) into (8.7.32) to yield

$$\displaystyle \begin{aligned} S_m &amp;={\boldsymbol \phi}_m^T\mathbf{B}{\boldsymbol \phi}_m-{\boldsymbol \phi}_m\mathbf{B}{\boldsymbol \Phi}{\boldsymbol \Sigma}{\boldsymbol \Phi}^T\mathbf{ B}{\boldsymbol \phi}_m^T,{} \end{aligned} $$
(8.7.39)

$$\displaystyle \begin{aligned} Q_m&amp;={\boldsymbol \phi}_m^T\mathbf{B}\hat{\mathbf{t}}-{\boldsymbol \phi}_m\mathbf{B}{\boldsymbol \Phi}{\boldsymbol \Sigma}{\boldsymbol \Phi}^T\mathbf{ B}\hat{\mathbf{t}}.{} \end{aligned} $$
(8.7.40)
The following is the marginal likelihood maximization algorithm for sequential sparse Bayesian learning proposed by Tipping and Faul [40]:
  1. 1.

    Initialize σ 2 to some sensible value (e.g., var[t] × 0.1).

     
  2. 2.
    Initialize with a single basis vector ϕ i, setting, from (8.7.31):
    
$$\displaystyle \begin{aligned} \alpha_i = \frac{\|{\boldsymbol \phi}_i\|{}_2^2}{\|{\boldsymbol \phi}_i^T\mathbf{t}\|{}_2^2/\|-\sigma^2}. \end{aligned} $$
    (8.7.41)

    All other α m are notionally set to infinity.

     
  3. 3.

    Explicitly compute Σ and μ (which are scalars initially), along with initial values of s m and q m for all M bases ϕ m.

     
  4. 4.

    Select a candidate basis vector ϕ i from the set of all M.

     
  5. 5.

    Compute 
$$\theta _i=q_i^2-s_i$$
.

     
  6. 6.
    If θ i > 0 and α i <  (i.e., ϕ i is in the model), re-estimate α i. Defining 
$$\kappa _j=(\varSigma _{jj} +(\bar {\alpha }_i -\alpha _i)^{-1})^{-1}$$
and Σ j as the jth column of Σ:
    
$$\displaystyle \begin{aligned} 2\Delta \mathbb{L}&amp;=\frac{Q_i^2}{S_i + (\bar{\alpha}_i^{-1}-\alpha_i^{-1})^{-1}}-\log \left (1+S_i(\bar{\alpha}_i^{-1}-\alpha_i^{-1})\right ), \end{aligned} $$
    (8.7.42)
    
$$\displaystyle \begin{aligned} \bar{\Sigma}&amp;={\boldsymbol \Sigma}-\kappa_j{\boldsymbol \Sigma}{\boldsymbol \Sigma}_j^T, \end{aligned} $$
    (8.7.43)
    
$$\displaystyle \begin{aligned} \bar{\boldsymbol \mu}&amp;={\boldsymbol \mu}-\kappa_j\mu_j{\boldsymbol \Sigma}_j, \end{aligned} $$
    (8.7.44)
    
$$\displaystyle \begin{aligned} \bar S_m&amp;=S_m +\kappa_j(\beta{\boldsymbol \Sigma}_j^T{\boldsymbol \Phi}^T{\boldsymbol \phi}_m)^2, \end{aligned} $$
    (8.7.45)
    
$$\displaystyle \begin{aligned} \bar Q_m&amp;=Q_m +\kappa_j\mu_j (\beta{\boldsymbol \Sigma}_j^T{\boldsymbol \Phi}^T{\boldsymbol \phi}_m). \end{aligned} $$
    (8.7.46)
     
  7. 7.
    If θ i > 0 and α i = , add ϕ i to the model with updated α i:
    
$$\displaystyle \begin{aligned} 2\Delta \mathbb{L}&amp;=\frac{Q_i^2-S_i}{S_i}+\log \frac{S_i}{Q_i^2}, \end{aligned} $$
    (8.7.47)
    ../images/492994_1_En_8_Chapter/492994_1_En_8_Equ310_HTML.png
    (8.7.48)
    ../images/492994_1_En_8_Chapter/492994_1_En_8_Equ311_HTML.png
    (8.7.49)
    
$$\displaystyle \begin{aligned} \bar S_m&amp;=S_m -\varSigma_{ii}(\beta{\boldsymbol \phi}_m^T{\mathbf{e}}_i)^2, \end{aligned} $$
    (8.7.50)
    
$$\displaystyle \begin{aligned} \bar Q_m&amp;=Q_m-\mu_i (\beta {\boldsymbol \phi}_m^T{\mathbf{e}}_i), \end{aligned} $$
    (8.7.51)

    where Σ ii = (α i + S i)−1, μ i = Σ iiQ i, and e i = ϕ i − β Φ Σ Φ Tϕ i.

     
  8. 8.
    If θ i ≤ 0 and α i < , then delete ϕ i from the model and set α i = :
    
$$\displaystyle \begin{aligned} 2\Delta \mathbb{L}&amp;=\frac{Q_i^2}{S_i -\alpha_i}-\log \left ( 1-\frac{S_i}{\alpha_i}\right ), \end{aligned} $$
    (8.7.52)
    
$$\displaystyle \begin{aligned} \bar{\Sigma}&amp;={\boldsymbol \Sigma}-\frac 1{\varSigma_{jj}}{\boldsymbol \Sigma}_j{\boldsymbol \Sigma}_j^T,{} \end{aligned} $$
    (8.7.53)
    
$$\displaystyle \begin{aligned} \bar{\boldsymbol \mu}&amp;={\boldsymbol \mu}-\frac{\mu_j}{\varSigma_{jj}}{\boldsymbol \Sigma}_j,{} \end{aligned} $$
    (8.7.54)
    
$$\displaystyle \begin{aligned} \bar S_m&amp;=S_m +\frac 1{\varSigma_{jj}}(\beta{\boldsymbol \Sigma}_j^T{\boldsymbol \Phi}^T{\boldsymbol \phi}_m)^2, \end{aligned} $$
    (8.7.55)
    
$$\displaystyle \begin{aligned} \bar Q_m&amp;=Q_m+\frac{\mu_j}{\varSigma_{jj}}(\beta{\boldsymbol \Sigma}_j^T{\boldsymbol \Phi}^T{\boldsymbol \phi}_m) \end{aligned} $$
    (8.7.56)

    Following updates (8.7.53) and (8.7.54), the appropriate row and/or column j is removed from 
$$\bar {\boldsymbol \Sigma }$$
and 
$$\bar {\boldsymbol \mu }$$
.

     
  9. 9.

    If it is a regression model and estimating the noise level, update 
$$\sigma ^2=\|\mathbf {t}-\mathbf {y}\|{ }_2^2/ (N-M+\sum _m \alpha _m\varSigma _{mm})$$
.

     
  10. 10.

    Recompute/update Σ, μ (using the Laplace approximation procedure in classification) and all s m and q m using Eqs. (8.7.35)–(8.7.40).

     
  11. 11.

    If converged terminate, otherwise goto 4.

     
Brief Summary of This Chapter
  • SVMs have high classification accuracy due to introducing maximum interval.

  • When the sample size is small, SVMs can also classify accurately and have good generalization ability.

  • The kernel function can be used to solve nonlinear problems.

  • SVMs can solve the problem of classification and regression with high-dimensional features.