Z. Lin et al.Accelerated Optimization for Machine Learning https://doi.org/10.1007/978-981-15-2910-8_1

1. Introduction

Zhouchen Lin¹, Huan Li² and Cong Fang³

(1)

Key Lab. of Machine Perception School of EECS, Peking University, Beijing, Beijing, China

(2)

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, Jiangsu, China

(3)

School of Engineering and Applied Science, Princeton University, Princeton, NJ, USA

Keywords

First-order accelerated algorithmsMachine learningClassification/regressionLow-rank learning

Optimization is a supporting technology in many numerical computation related research fields, such as machine learning, signal processing, industrial design, and operation research. In particular, P. Domingos, an AAAI Fellow and a Professor of University of Washington, proposed a celebrated formula [23]:

$\displaystyle \begin{aligned}\mbox{machine learning = representation + optimization + evaluation,}\end{aligned}$

showing the importance of optimization in machine learning.

1.1 Examples of Optimization Problems in Machine Learning

Optimization problems arise throughout machine learning. We provide two representative examples here. The first one is classification/regression and the second one is low-rank learning.

Many classification/regression problems can be formulated as

$\displaystyle \begin{aligned} \min\limits_{\mathbf{w}\in\mathbb{R}^n} \frac{1}{m}\sum_{i=1}^m l(p({\mathbf{x}}_i;\mathbf{w}),y_i) + \lambda R(\mathbf{w}), \end{aligned}$

(1.1)

where w consists of the parameters of a classification/regression system, p(x;w) represents the prediction function of the learning model, l is the loss function to punish the inconformity between the system prediction and the truth value, (x _i, y _i) is the i-th data sample with x _i being the datum/feature vector and y _i the label for classification or the corresponding value for regression, R is a regularizer that enforces some special property in w, and λ ≥ 0 is a trade-off parameter. Typical examples of l(p, y) include the squared loss $l(p,y)=\frac {1}{2}(p-y)^2$ , the logistic loss $l(p,y)=\log (1+\exp (-py))$ , and the hinge loss $l(p,y)=\max \{0,1-py\}$ . Examples of p(x;w) include p(x;w) = w ^Tx − b for linear classification/regression and p(x;W) = ϕ(W _nϕ(W _n−1⋯ϕ(W ₁x)⋯ )) for forward propagation widely used in deep neural networks, where W is a collection of the weight matrices W _k, k = 1, ⋯ , n, and ϕ is an activation function. Representative examples of R(w) include the ℓ ₂ regularizer $R(\mathbf {w})=\frac {1}{2}\|\mathbf {w}\|{ }^2$ and the ℓ ₁ regularizer R(w) = ∥w∥₁.

The combinations of different loss functions, prediction functions, and regularizers lead to different machine learning models. For example, hinge loss, linear classification function, and ℓ ₂ regularizer give the support vector machine (SVM) problem [21]; logistic loss, linear regression function, and ℓ ₂ regularizer give the regularized logistic regression problem [10]; square loss, forward propagation function, and R(W) = 0 give the multi-layer perceptron [33]; and square loss, linear regression function, and ℓ ₁ regularizer give the LASSO problem [68].

There are also many problems investigated by the machine learning community that are not of the form of (1.1). For example, the matrix completion problem, which has wide applications in signal and data processing, can be written as:

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle \min\limits_{\mathbf{X}\in\mathbb{R}^{m\times n}} &\displaystyle \|\mathbf{X}\|{}_*,\\ &\displaystyle s.t.&\displaystyle {\mathbf{X}}_{ij}={\mathbf{D}}_{ij},\forall (i,j)\in\Omega, \end{array} \end{aligned}$

where Ω is the locations of observed entries. The low-rank representation (LRR) problem [50], which is powerful in clustering data into subspaces, is cast as:

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle \min\limits_{\mathbf{Z}\in\mathbb{R}^{n\times n},\mathbf{E}\in\mathbb{R}^{m\times n}} &\displaystyle \|\mathbf{Z}\|{}_*+\lambda\|\mathbf{E}\|{}_1,\\ &\displaystyle s.t.&\displaystyle \mathbf{D}=\mathbf{D}\mathbf{Z}+\mathbf{E}. \end{array} \end{aligned}$

To reduce the computational cost as well as the storage space, people observe that a low-rank matrix can be factorized as a product of two much smaller matrices, i.e., X = UV ^T. Take the matrix completion problem as an example, it can be reformulated as follows, which is a nonconvex problem,

$\displaystyle \begin{aligned} \begin{array}{rcl} \min_{\mathbf{U}\in\mathbb{R}^{m\times r},\mathbf{V}\in\mathbb{R}^{n\times r}} \frac{1}{2}\sum_{(i,j)\in\Omega}\left\|{\mathbf{U}}_i{\mathbf{V}}_j^T-{\mathbf{D}}_{ij}\right\|{}_F^2+\frac{\lambda}{2}\left(\|\mathbf{U}\|{}_F^2+\|\mathbf{V}\|{}_F^2\right). \end{array} \end{aligned}$

For more examples of optimization problems in machine learning, one may refer to the survey paper written by Gambella, Ghaddar, and Naoum-Sawaya in 2019 [28].

1.2 First-Order Algorithm

In most machine learning models, a moderate numerical precision of parameters already suffices. Moreover, an iteration needs to be finished in reasonable amount of time. Thus, first-order optimization methods are the mainstream algorithms used in the machine learning community. While “first-order” has its rigorous definition in the complexity theory of optimization, which is based on an oracle that only returns f(x _k) and ∇f(x _k) when queried with x _k, here we adopt a much more general sense that higher order derivatives of the objective function are not used (thus allows the closed form solution of a subproblem and the use of proximal mapping (Definition A.19), etc.). However, we do not want to write a book on all first-order algorithms that are commonly used or actively investigated in the machine learning community, which is clearly out of our capability due to the huge amount of literatures. Some excellent reference books, preprints, or surveys include [7, 12–14, 34, 35, 37, 58, 60, 66]. Rather, we focus on the accelerated first-order methods only, where “accelerated” means that the convergence rate is improved without making much stronger assumptions and the techniques used are essentially exquisite interpolation and extrapolation.

1.3 Sketch of Representative Works on Accelerated Algorithms

In the above sense of acceleration, the first accelerated optimization algorithm may be Polyak’s heavy-ball method [61]. Consider a problem with an L-smooth (Definition A.12) and μ-strongly convex (Definition A.10) objective, and let ε be the error to the optimal solution. The heavy-ball method reduces the complexity $O\left (\frac {L}{\mu }\log \frac {1}{\varepsilon }\right )$ of the usual gradient descent to $O\left (\sqrt {\frac {L}{\mu }}\log \frac {1}{\varepsilon }\right )$ . In 1983, Nesterov proposed his accelerated gradient descent (AGD) for L-smooth objective functions, where the complexity is reduced to $O\left (\frac {1}{\sqrt {\varepsilon }}\right )$ as compared with that of usual gradient descent: $O\left (\frac {1}{\varepsilon }\right )$ . Nesterov further proposed another accelerated algorithm for L-smooth objective functions in 1988 [53], smoothing techniques for nonsmooth functions with acceleration tricks in 2005 [54], and an accelerated algorithm for composite functions in 2007 [55] (whose formal publication is [57]). Nesterov’s seminal work did not catch much attention in the machine learning community, possibly because the objective functions in machine learning models are often nonsmooth, e.g., due to the adoption of sparse and low-rank regularizers which are not differentiable. The accelerated proximal gradient (APG) for composite functions by Beck and Teboulle [8], which was formally published in 2009 and is an extension of [53] and simpler than [55],¹ somehow gained great interest in the machine learning community as it fits well for the sparse and low-rank models which were hot topics at that time. Tseng further provided a unified analysis of existing acceleration techniques [70] and Bubeck proposed a near optimal method for highly smooth convex optimization [16].

Nesterov’s AGD is not quite intuitive. There have been some efforts on interpreting his AGD algorithm. Su et al. gave an interpretation from the viewpoint of differential equations [67] and Wibisono et al. further extended it to higher order AGD [71]. Fazlyab et al. proposed a Linear Matrix Inequality (LMI) using the Integral Quadratic Constraints (IQCs) from robust control theory to interpret AGD [42]. Allen-Zhu and Orecchia connected AGD to mirror descent via the linear coupling technique [6]. On the other hand, some researchers work on designing other interpretable accelerated algorithms. Kim and Fessler designed an optimized first-order algorithm whose complexity is only one half of that of Nesterov’s accelerated gradient method via the Performance Estimation Problem approach [40]. Bubeck proposed a geometric descent method inspired from the ellipsoid method [15] and Drusvyatskiy et al. showed that the same iterate sequence is generated via computing an optimal average of quadratic lower-models of the function [24].

For linearly constrained convex problems, different from the unconstrained case, both the errors in the objective function value and the constraint should be taken care of. Ideally, both errors should reduce at the same rate. A straightforward way to extend Nesterov’s acceleration technique to constrained optimization is to solve its dual problem (Definition A.24) using AGD directly, which leads to the accelerated dual ascent [9] and accelerated augmented Lagrange multiplier method [36], both with the optimal convergence rate in the dual space. Lu [51] and Li [44] further analyzed the complexity in the primal space for the accelerated dual ascent and its variant. One disadvantage of the dual based method is the need to solve a subproblem at each iteration. Linearization is an effective approach to overcome this shortcoming. Specifically, Li et al. proposed an accelerated linearized penalty method that increases the penalty along with the update of variable [45] and Xu proposed an accelerated linearized augmented Lagrangian method [72]. ADMM and the primal-dual method , as the most commonly used methods for constrained optimization, were also accelerated in [59] and [20] for generally convex (Definition A.10) and smooth objectives, respectively. When the strong convexity is assumed, ADMM and the primal-dual method can have faster convergence rates even if no acceleration techniques are used [19, 72].

Nesterov’s AGD has also been extended to nonconvex problems. The first analysis of AGD for nonconvex optimization appeared in [31], which minimizes a composite objective with a smooth (Definition A.11) nonconvex part and a nonsmooth convex (Definition A.7) part. Inspired by [31], Li and Lin proposed AGD variants for minimizing the composition of a smooth nonconvex part and a nonsmooth nonconvex part [43]. Both works in [31, 43] studied the convergence to the first-order critical point (Definition A.34). Carmon et al. further gave an $O\left (\frac {1}{\varepsilon ^{7/4}}\log \frac {1}{\varepsilon }\right )$ complexity analysis [17]. For many famous machine learning problems, e.g., matrix sensing and matrix completion, there is no spurious local minimum [11, 30] and the only task is to escape strict saddle points (Definition A.29). The first accelerated method to find the second-order critical point appeared in [18], which alternates between two subroutines: negative curvature descent and Almost Convex AGD , and can be seen as a combination of accelerated gradient descent and the Lanczos method. Jin et al. further proposed a single-loop accelerated method [38]. Agarwal et al. proposed a careful implementation of the Nesterov–Polyak method , using accelerated methods for fast approximate matrix inversion [1]. The complexities established in [1, 18, 38] are all $O\left (\frac {1}{\varepsilon ^{7/4}}\log \frac {1}{\varepsilon }\right )$ .

As for stochastic algorithms, compared with the deterministic algorithms, the main challenge is that the noise of gradient will not reach zero through updates and this makes the famous stochastic gradient descent (SGD) converge only with a sublinear rate even for strongly convex and smooth problems. Variance reduction (VR) is an efficient technique to reduce the negative effect of noise [22, 39, 52, 63]. With the VR and the momentum technique, Allen-Zhu proposed the first truly accelerated stochastic algorithm, named Katyusha [2]. Katyusha is an algorithm working in the primal space. Another way to accelerate the stochastic algorithms is to solve the problem in the dual space so that we can use the techniques like stochastic coordinate descent (SCD) [27, 48, 56] and stochastic primal-dual method [41, 74]. On the other hand, in 2015 Lin et al. proposed a generic framework, called Catalyst [49], that minimizes a convex objective function via an accelerated proximal point method and gains acceleration, whose idea previously appeared in [65]. Stochastic nonconvex optimization is also an important topic and some excellent works include [3–5, 29, 62, 69, 73]. Particularly, Fang et al. proposed a Stochastic Path-Integrated Differential Estimator (SPIDER) technique and attained the near optimal convergence rate under certain conditions [26].

The acceleration techniques are also applicable to parallel optimization. Parallel algorithms can be implemented in two fashions: asynchronous updates and synchronous updates. For asynchronous update, none of the machines need to wait for the others to finish computing. Representative works include asynchronous accelerated gradient descent (AAGD) [25] and asynchronous accelerated coordinate descent (AACD) [32]. Based on different topologies, synchronous algorithms include centralized and decentralized distributed methods. Typical works for the former organization include the distributed ADMM [13], distributed dual coordinate ascent [75] and their extensions. One bottleneck of centralized topology lies in high communication cost at the central node [47]. Although decentralized algorithms have been widely studied by the control community, the lower bound has not been established until 2017 [64] and a distributed dual ascent with a matching upper bound is given in [64]. Motivated by the lower bound, Li et al. further analyzed the distributed accelerated gradient descent with both optimal communication and computation complexities up to a log factor [46].

1.4 About the Book

In the previous section, we have briefly introduced the representative works on accelerated first-order algorithms. However, due to limited time we do not give details of all of them in the subsequent chapters. Rather, we only introduce results and proofs of part of them, based on our personal flavor and familiarity. The algorithms are organized by their nature: deterministic algorithms for unconstrained convex problems (Chap. 2), constrained convex problems (Chap. 3), and (unconstrained) nonconvex problems (Chap. 4), as well as stochastic algorithms for centralized optimization (Chap. 5) and distributed optimization (Chap. 6). To make our book self-contained, for each introduced algorithm we give the details of its proof. This book serves as a reference to part of the recent advances in optimization. It is appropriate for graduate students and researchers who are interested in machine learning and optimization. Nonetheless, the proofs for achieving critical points (Sect. 4.2), escaping saddle points (Sect. 4.3), and decentralized topology (Sect. 6.2.2) are highly non-trivial. So uninterested readers may skip them.

References

1.
N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, T. Ma, Finding approximate local minima for nonconvex optimization in linear time, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, Montreal, (2017), pp. 1195–1200
2.
Z. Allen-Zhu, Katyusha: the first truly accelerated stochastic gradient method, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, Montreal, (2017), pp. 1200–1206
3.
Z. Allen-Zhu, Natasha2: faster non-convex optimization than SGD, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 2675–2686
4.
Z. Allen-Zhu, E. Hazan, Variance reduction for faster non-convex optimization, in Proceedings of the 33th International Conference on Machine Learning, New York, (2016), pp. 699–707
5.
Z. Allen-Zhu, Y. Li, Neon2: finding local minima via first-order oracles, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 3716–3726
6.
Z. Allen-Zhu, L. Orecchia, Linear coupling: an ultimate unification of gradientand mirror descent, in Proceedings of the 8th Innovations in Theoretical Computer Science, Berkeley, (2017)
7.
A. Beck, First-Order Methods in Optimization, vol. 25 (SIAM, Philadelphia, 2017)zbMATH
8.
A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)MathSciNetzbMATH
9.
A. Beck, M. Teboulle, A fast dual proximal gradient algorithm for convex minimization and applications. Oper. Res. Lett. 42(1), 1–6 (2014)MathSciNetzbMATH
10.
J. Berkson, Application of the logistic function to bio-assay. J. Am. Stat. Assoc. 39(227), 357–365 (1944)
11.
S. Bhojanapalli, B. Neyshabur, N. Srebro, Global optimality of local search for low rank matrix recovery, in Advances in Neural Information Processing Systems, Barcelona, vol. 29 (2016), pp. 3873–3881
12.
L. Bottou, F.E. Curtis, J. Nocedal, Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)MathSciNetzbMATH
13.
S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)zbMATH
14.
S. Bubeck, Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015)zbMATH
15.
S. Bubeck, Y.T. Lee, M. Singh, A geometric alternative to Nesterov’s accelerated gradient descent (2015). Preprint. arXiv:1506.08187
16.
S. Bubeck, Q. Jiang, Y.T. Lee, Y. Li, A. Sidford, Near-optimal method for highly smooth convex optimization, in Proceedings of the 32th Conference on Learning Theory, Phoenix, (2019), pp. 492–507
17.
Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Convex until proven guilty: dimension-free acceleration of gradient descent on non-convex functions, in Proceedings of the 34th International Conference on Machine Learning, Sydney, (2017), pp. 654–663
18.
Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Accelerated methods for nonconvex optimization. SIAM J. Optim. 28(2), 1751–1772 (2018)MathSciNetzbMATH
19.
A. Chambolle, T. Pock, A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imag. Vis. 40(1), 120–145 (2011)MathSciNetzbMATH
20.
Y. Chen, G. Lan, Y. Ouyang, Optimal primal-dual methods for a class of saddle point problems. SIAM J. Optim. 24(4), 1779–1814 (2014)MathSciNetzbMATH
21.
C. Cortes, V. Vapnik, Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)zbMATH
22.
A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in Neural Information Processing Systems, Montreal, vol. 27 (2014), pp. 1646–1654
23.
P.M. Domingos, A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012)
24.
D. Drusvyatskiy, M. Fazel, S. Roy, An optimal first order method based on optimal quadratic averaging. SIAM J. Optim. 28(1), 251–271 (2018)MathSciNetzbMATH
25.
C. Fang, Y. Huang, Z. Lin, Accelerating asynchronous algorithms for convex optimization by momentum compensation (2018). Preprint. arXiv:1802.09747
26.
C. Fang, C.J. Li, Z. Lin, T. Zhang, SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential estimator, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 689–699
27.
O. Fercoq, P. Richtárik, Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)MathSciNetzbMATH
28.
C. Gambella, B. Ghaddar, J. Naoum-Sawaya, Optimization models for machine learning: a survey (2019). Preprint. arXiv:1901.05331
29.
R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points – online stochastic gradient for tensor decomposition, in Proceedings of the 28th Conference on Learning Theory, Paris, (2015), pp. 797–842
30.
R. Ge, J.D. Lee, T. Ma, Matrix completion has no spurious local minimum, in Advances in Neural Information Processing Systems, Barcelona, vol. 29 (2016), pp. 2973–2981
31.
S. Ghadimi, G. Lan, Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2016)MathSciNetzbMATH
32.
R. Hannah, F. Feng, W. Yin, A2BCD: an asynchronous accelerated block coordinate descent algorithm with optimal complexity, in Proceedings of the 7th International Conference on Learning Representations, New Orleans, (2019)
33.
S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd edn. (Pearson Prentice Hall, Upper Saddle River, 1999)
34.
E. Hazan, Introduction to online convex optimization. Found. Trends Optim. 2(3–4), 157–325 (2016)
35.
E. Hazan, Optimization for machine learning. Technical report, Princeton University (2019)
36.
B. He, X. Yuan, On the acceleration of augmented Lagrangian method for linearly constrained optimization. Optim. (2010). Preprint. http://www.optimization-online.org/DB_FILE/2010/10/2760.pdf
37.
P. Jain, P. Kar, Non-convex optimization for machine learning. Found. Trends Mach. Learn. 10(3–4), 142–336 (2017)zbMATH
38.
C. Jin, P. Netrapalli, M.I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent, in Proceedings of the 31th Conference On Learning Theory, Stockholm, (2018), pp. 1042–1085
39.
R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in Advances in Neural Information Processing Systems, Lake Tahoe, vol. 26 (2013), pp. 315–323
40.
D. Kim, J.A. Fessler, Optimized first-order methods for smooth convex minimization. Math. Program. 159(1–2), 81–107 (2016)MathSciNetzbMATH
41.
G. Lan, Y. Zhou, An optimal randomized incremental gradient method. Math. Program. 171(1–2), 167–215 (2018)MathSciNetzbMATH
42.
L. Lessard, B. Recht, A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)MathSciNetzbMATH
43.
H. Li, Z. Lin, Accelerated proximal gradient methods for nonconvex programming, in Advances in Neural Information Processing Systems, Montreal, vol. 28 (2015), pp. 379–387
44.
H. Li, Z. Lin, On the complexity analysis of the primal solutions for the accelerated randomized dual coordinate ascent. J. Mach. Learn. Res. (2020). http://jmlr.org/papers/v21/18-425.html
45.
H. Li, C. Fang, Z. Lin, Convergence rates analysis of the quadratic penalty method and its applications to decentralized distributed optimization (2017). Preprint. arXiv:1711.10802
46.
H. Li, C. Fang, W. Yin, Z. Lin, A sharp convergence rate analysis for distributed accelerated gradient methods (2018). Preprint. arXiv:1810.01053
47.
X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, J. Liu, Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent, in Advances in Neural Information Processing Systems, Long Beach, vol. 30 (2017), pp. 5330–5340
48.
Q. Lin, Z. Lu, L. Xiao, An accelerated proximal coordinate gradient method, in Advances in Neural Information Processing Systems, Montreal, vol. 27 (2014), pp. 3059–3067
49.
H. Lin, J. Mairal, Z. Harchaoui, A universal catalyst for first-order optimization, in Advances in Neural Information Processing Systems, Montreal, vol. 28 (2015), pp. 3384–3392
50.
G. Liu, Z. Lin, Y. Yu, Robust subspace segmentation by low-rank representation, in Proceedings of the 27th International Conference on Machine Learning, Haifa, vol. 1 (2010), pp. 663–670
51.
J. Lu, M. Johansson, Convergence analysis of approximate primal solutions in dual first-order methods. SIAM J. Optim. 26(4), 2430–2467 (2016)MathSciNetzbMATH
52.
J. Mairal, Optimization with first-order surrogate functions, in Proceedings of the 30th International Conference on Machine Learning, Atlanta, (2013), pp. 783–791
53.
Y. Nesterov, On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonomika I Mateaticheskie Metody 24(3), 509–517 (1988)zbMATH
54.
Y. Nesterov, Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)MathSciNetzbMATH
55.
Y. Nesterov, Gradient methods for minimizing composite objective function. Technical Report Discussion Paper #2007/76, CORE (2007)
56.
Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)MathSciNetzbMATH
57.
Y. Nesterov, Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)MathSciNetzbMATH
58.
Y. Nesterov, Lectures on Convex Optimization (Springer, New York, 2018)zbMATH
59.
Y. Ouyang, Y. Chen, G. Lan, E. Pasiliao Jr., An accelerated linearized alternating direction method of multipliers. SIAM J. Imag. Sci. 8(1), 644–681 (2015)MathSciNetzbMATH
60.
N. Parikh, S. Boyd, Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)
61.
B.T. Polyak, Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
62.
S.J. Reddi, A. Hefny, S. Sra, B. Poczos, A. Smola, Stochastic variance reduction for nonconvex optimization, in Proceedings of the 33th International Conference on Machine Learning, New York, (2016), pp. 314–323
63.
M. Schmidt, N. Le Roux, F. Bach, Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)MathSciNetzbMATH
64.
K. Seaman, F. Bach, S. Bubeck, Y.T. Lee, L. Massoulié, Optimal algorithms for smooth and strongly convex distributed optimization in networks, in Proceedings of the 34th International Conference on Machine Learning, Sydney, (2017), pp. 3027–3036
65.
S. Shalev-Shwartz, T. Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization, in Proceedings of the 31th International Conference on Machine Learning, Beijing, (2014), pp. 64–72
66.
S. Sra, S. Nowozin, S.J. Wright (eds.), Optimization for Machine Learning (MIT Press, Cambridge, MA, 2012)
67.
W. Su, S. Boyd, E. Candès, A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights, in Advances in Neural Information Processing Systems, Montreal, vol. 27 (2014), pp. 2510–2518
68.
R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 58(1), 267–288 (1996)MathSciNetzbMATH
69.
N. Tripuraneni, M. Stern, C. Jin, J. Regier, M.I. Jordan, Stochastic cubic regularization for fast nonconvex optimization, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 2899–2908
70.
P. Tseng, On accelerated proximal gradient methods for convex-concave optimization. Technical report, University of Washington, Seattle (2008)
71.
A. Wibisono, A.C. Wilson, M.I. Jordan, A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), 7351–7358 (2016)MathSciNetzbMATH
72.
Y. Xu, Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming. SIAM J. Optim. 27(3), 1459–1484 (2017)MathSciNetzbMATH
73.
Y. Xu, J. Rong, T. Yang, First-order stochastic algorithms for escaping from saddle points in almost linear time, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 5530–5540
74.
Y. Zhang, L. Xiao, Stochastic primal-dual coordinate method for regularized empirical risk minimization. J. Mach. Learn. Res. 18(1), 2939–2980 (2017)MathSciNetzbMATH
75.
S. Zheng, J. Wang, F. Xia, W. Xu, T. Zhang, A general distributed dual coordinate optimization framework for regularized loss minimization. J. Mach. Learn. Res. 18(115), 1–52 (2017)MathSciNetzbMATH