1.1 Examples of Optimization Problems in Machine Learning
Optimization problems arise throughout machine learning. We provide two representative examples here. The first one is classification/regression and the second one is low-rank learning.
The combinations of different loss functions, prediction functions, and regularizers lead to different machine learning models. For example, hinge loss, linear classification function, and ℓ 2 regularizer give the support vector machine (SVM) problem [21]; logistic loss, linear regression function, and ℓ 2 regularizer give the regularized logistic regression problem [10]; square loss, forward propagation function, and R(W) = 0 give the multi-layer perceptron [33]; and square loss, linear regression function, and ℓ 1 regularizer give the LASSO problem [68].
For more examples of optimization problems in machine learning, one may refer to the survey paper written by Gambella, Ghaddar, and Naoum-Sawaya in 2019 [28].
1.2 First-Order Algorithm
In most machine learning models, a moderate numerical precision of parameters already suffices. Moreover, an iteration needs to be finished in reasonable amount of time. Thus, first-order optimization methods are the mainstream algorithms used in the machine learning community. While “first-order” has its rigorous definition in the complexity theory of optimization, which is based on an oracle that only returns f(x k) and ∇f(x k) when queried with x k, here we adopt a much more general sense that higher order derivatives of the objective function are not used (thus allows the closed form solution of a subproblem and the use of proximal mapping (Definition A.19), etc.). However, we do not want to write a book on all first-order algorithms that are commonly used or actively investigated in the machine learning community, which is clearly out of our capability due to the huge amount of literatures. Some excellent reference books, preprints, or surveys include [7, 12–14, 34, 35, 37, 58, 60, 66]. Rather, we focus on the accelerated first-order methods only, where “accelerated” means that the convergence rate is improved without making much stronger assumptions and the techniques used are essentially exquisite interpolation and extrapolation.
1.3 Sketch of Representative Works on Accelerated Algorithms
In the above sense of acceleration, the first accelerated optimization algorithm may be Polyak’s heavy-ball method [61]. Consider a problem with an L-smooth (Definition A.12) and μ-strongly convex (Definition A.10) objective, and let ε be the error to the optimal solution. The heavy-ball method reduces the complexity of the usual gradient descent to . In 1983, Nesterov proposed his accelerated gradient descent (AGD) for L-smooth objective functions, where the complexity is reduced to as compared with that of usual gradient descent: . Nesterov further proposed another accelerated algorithm for L-smooth objective functions in 1988 [53], smoothing techniques for nonsmooth functions with acceleration tricks in 2005 [54], and an accelerated algorithm for composite functions in 2007 [55] (whose formal publication is [57]). Nesterov’s seminal work did not catch much attention in the machine learning community, possibly because the objective functions in machine learning models are often nonsmooth, e.g., due to the adoption of sparse and low-rank regularizers which are not differentiable. The accelerated proximal gradient (APG) for composite functions by Beck and Teboulle [8], which was formally published in 2009 and is an extension of [53] and simpler than [55],1 somehow gained great interest in the machine learning community as it fits well for the sparse and low-rank models which were hot topics at that time. Tseng further provided a unified analysis of existing acceleration techniques [70] and Bubeck proposed a near optimal method for highly smooth convex optimization [16].
Nesterov’s AGD is not quite intuitive. There have been some efforts on interpreting his AGD algorithm. Su et al. gave an interpretation from the viewpoint of differential equations [67] and Wibisono et al. further extended it to higher order AGD [71]. Fazlyab et al. proposed a Linear Matrix Inequality (LMI) using the Integral Quadratic Constraints (IQCs) from robust control theory to interpret AGD [42]. Allen-Zhu and Orecchia connected AGD to mirror descent via the linear coupling technique [6]. On the other hand, some researchers work on designing other interpretable accelerated algorithms. Kim and Fessler designed an optimized first-order algorithm whose complexity is only one half of that of Nesterov’s accelerated gradient method via the Performance Estimation Problem approach [40]. Bubeck proposed a geometric descent method inspired from the ellipsoid method [15] and Drusvyatskiy et al. showed that the same iterate sequence is generated via computing an optimal average of quadratic lower-models of the function [24].
For linearly constrained convex problems, different from the unconstrained case, both the errors in the objective function value and the constraint should be taken care of. Ideally, both errors should reduce at the same rate. A straightforward way to extend Nesterov’s acceleration technique to constrained optimization is to solve its dual problem (Definition A.24) using AGD directly, which leads to the accelerated dual ascent [9] and accelerated augmented Lagrange multiplier method [36], both with the optimal convergence rate in the dual space. Lu [51] and Li [44] further analyzed the complexity in the primal space for the accelerated dual ascent and its variant. One disadvantage of the dual based method is the need to solve a subproblem at each iteration. Linearization is an effective approach to overcome this shortcoming. Specifically, Li et al. proposed an accelerated linearized penalty method that increases the penalty along with the update of variable [45] and Xu proposed an accelerated linearized augmented Lagrangian method [72]. ADMM and the primal-dual method , as the most commonly used methods for constrained optimization, were also accelerated in [59] and [20] for generally convex (Definition A.10) and smooth objectives, respectively. When the strong convexity is assumed, ADMM and the primal-dual method can have faster convergence rates even if no acceleration techniques are used [19, 72].
Nesterov’s AGD has also been extended to nonconvex problems. The first analysis of AGD for nonconvex optimization appeared in [31], which minimizes a composite objective with a smooth (Definition A.11) nonconvex part and a nonsmooth convex (Definition A.7) part. Inspired by [31], Li and Lin proposed AGD variants for minimizing the composition of a smooth nonconvex part and a nonsmooth nonconvex part [43]. Both works in [31, 43] studied the convergence to the first-order critical point (Definition A.34). Carmon et al. further gave an complexity analysis [17]. For many famous machine learning problems, e.g., matrix sensing and matrix completion, there is no spurious local minimum [11, 30] and the only task is to escape strict saddle points (Definition A.29). The first accelerated method to find the second-order critical point appeared in [18], which alternates between two subroutines: negative curvature descent and Almost Convex AGD , and can be seen as a combination of accelerated gradient descent and the Lanczos method. Jin et al. further proposed a single-loop accelerated method [38]. Agarwal et al. proposed a careful implementation of the Nesterov–Polyak method , using accelerated methods for fast approximate matrix inversion [1]. The complexities established in [1, 18, 38] are all .
As for stochastic algorithms, compared with the deterministic algorithms, the main challenge is that the noise of gradient will not reach zero through updates and this makes the famous stochastic gradient descent (SGD) converge only with a sublinear rate even for strongly convex and smooth problems. Variance reduction (VR) is an efficient technique to reduce the negative effect of noise [22, 39, 52, 63]. With the VR and the momentum technique, Allen-Zhu proposed the first truly accelerated stochastic algorithm, named Katyusha [2]. Katyusha is an algorithm working in the primal space. Another way to accelerate the stochastic algorithms is to solve the problem in the dual space so that we can use the techniques like stochastic coordinate descent (SCD) [27, 48, 56] and stochastic primal-dual method [41, 74]. On the other hand, in 2015 Lin et al. proposed a generic framework, called Catalyst [49], that minimizes a convex objective function via an accelerated proximal point method and gains acceleration, whose idea previously appeared in [65]. Stochastic nonconvex optimization is also an important topic and some excellent works include [3–5, 29, 62, 69, 73]. Particularly, Fang et al. proposed a Stochastic Path-Integrated Differential Estimator (SPIDER) technique and attained the near optimal convergence rate under certain conditions [26].
The acceleration techniques are also applicable to parallel optimization. Parallel algorithms can be implemented in two fashions: asynchronous updates and synchronous updates. For asynchronous update, none of the machines need to wait for the others to finish computing. Representative works include asynchronous accelerated gradient descent (AAGD) [25] and asynchronous accelerated coordinate descent (AACD) [32]. Based on different topologies, synchronous algorithms include centralized and decentralized distributed methods. Typical works for the former organization include the distributed ADMM [13], distributed dual coordinate ascent [75] and their extensions. One bottleneck of centralized topology lies in high communication cost at the central node [47]. Although decentralized algorithms have been widely studied by the control community, the lower bound has not been established until 2017 [64] and a distributed dual ascent with a matching upper bound is given in [64]. Motivated by the lower bound, Li et al. further analyzed the distributed accelerated gradient descent with both optimal communication and computation complexities up to a log factor [46].
1.4 About the Book
In the previous section, we have briefly introduced the representative works on accelerated first-order algorithms. However, due to limited time we do not give details of all of them in the subsequent chapters. Rather, we only introduce results and proofs of part of them, based on our personal flavor and familiarity. The algorithms are organized by their nature: deterministic algorithms for unconstrained convex problems (Chap. 2), constrained convex problems (Chap. 3), and (unconstrained) nonconvex problems (Chap. 4), as well as stochastic algorithms for centralized optimization (Chap. 5) and distributed optimization (Chap. 6). To make our book self-contained, for each introduced algorithm we give the details of its proof. This book serves as a reference to part of the recent advances in optimization. It is appropriate for graduate students and researchers who are interested in machine learning and optimization. Nonetheless, the proofs for achieving critical points (Sect. 4.2), escaping saddle points (Sect. 4.3), and decentralized topology (Sect. 6.2.2) are highly non-trivial. So uninterested readers may skip them.