- 1.
in theory, most convergence properties are studied in the forms of expectation (concentration);
- 2.
in real experiments, the algorithms are often much faster than the batch (deterministic) ones.
Because the updates access the gradient individually, the time complexity to reach a tolerable accuracy can be evaluated by the total number of calls for individual functions. Formally, we refer to it as the Incremental First-order Oracle (IFO) calls, with definition as follows:
For problem (5.2), an IFO takes an index i ∈ [n] and a point , and returns the pair (f i(x), ∇f i(x)).
As has been discussed in Chap. 2, the momentum (acceleration) technique ensures a theoretically faster convergence rate for deterministic algorithms. We might ask whether the momentum technique can accelerate stochastic algorithms. Before we answer the question, we first put our attention on the stochastic algorithms itself. What is the main challenge in analyzing the stochastic algorithms? Definitely, it is the noise of the gradients. Specifically, the variance of the noisy gradient will not go to zero through the updates, which fundamentally slows down the convergence rate. So a more involved question is to ask whether the momentum technique can reduce the negative effect of noise? Unfortunately, the existing results answer the question with “No.” The momentum technique cannot reduce the variance but instead can accumulate the noise. What really reduces the negative effect of noise is the technique called variance reduction (VR) [10]. When applying the VR technique, the algorithms are transformed to act like a deterministic algorithm and then can be further fused with momentum . Now we are able to answer our first question. The answer is towards positivity: in some cases, however, not all (e.g., when n is large), the momentum technique fused with VR ensures a provably faster rate! Besides, another effect of the momentum technique is that it can accumulate the noise. Thus one can reduce the variance together after it is aggregated by the momentum . By doing this, the mini-batch sampling size is increased, which is very helpful in distributed optimization (see Chap. 6).
- 1.
transforming the algorithm into a “near deterministic” one by VR and
- 2.
fusing the momentum trick to achieve a faster rate of convergence.
Ensure faster convergence rates (by order) when n is sufficiently small.
Ensure larger mini-batch sizes for distributed optimization.
each f i(x) is convex, which we call the individually convex (IC) case.
each f i(x) can be nonconvex but f(x) is convex, which we call the individually nonconvex (INC) case.
f(x) is nonconvex (NC).
5.1 The Individually Convex Case
5.1.1 Accelerated Stochastic Coordinate Descent
We give the convergence result below. The proof is taken from [8].
where denotes that the expectation is taken only on the random number i kunder the condition thatx kandz kare known.
where in we use e k+1,k+1 = nθ k, in we use e k+1,k = n(1 − θ k)θ k−1 + θ k − nθ k, and in we use e k+1,i = (1 − θ k)e k,i for i ≤ k − 1, and e k,k = nθ k−1. □
Taking full expectation, expanding the result to k = 0, then using (see (5.19)), , and ∥z k+1 −x ∗∥2 ≥ 0, we obtain (5.8). □
5.1.2 Background for Variance Reduction Methods
For stochastic gradient descent , due to the nonzero variance of the gradient, using a constant step size cannot guarantee the convergence of algorithms. Instead, it only guarantees a sublinear convergence rate on the strongly convex and L-smooth objective functions. For finite-sum objective functions, the way to solve the problem is called variance reduction (VR) [10] which reduces the variance to zero through the updates. The convergence rate can be accelerated to be linear for strongly convex and L-smooth objective functions. The first VR method might be SAG [17], which uses the sum of the latest individual gradients as an estimator. It requires O(nd) memory storage and uses a biased gradient estimator. SDCA [18] also achieves a linear convergence rate. In the primal space, the algorithm is known as MISO [13], which is a majorization-minimization VR algorithm. SVRG [10] is a follow-up work of SAG [17] which reduces the memory costs to O(d) and uses an unbiased gradient estimator. The main technique of SVRG [10] is frequently pre-storing a snapshot vector and bounding the variance by the distance from the snapshot vector and the latest variable. Later, SAGA [6] improves SAG by using an unbiased updates via the technique of SVRG [10].
In other words, the IFO calls to achieve an 𝜖- accuracy solution is .
The proof is mainly taken from [10] and [4].
Taking full expectation on (5.35) and substituting (5.36) into it, we can obtain that . □
5.1.3 Accelerated Stochastic Variance Reduction Method
With the VR technique, we can fuse the momentum technique to achieve a faster rate. We show that the convergence rate can be improved to in the IC case, where . We introduce Katyusha [1], which is the first truly accelerated stochastic algorithm. The main technique in Katyusha [1] is introducing a “negative momentum ” which restricts the extrapolation term to be not far from , the snapshot vector introduced by SVRG [10]. The algorithm is shown in Algorithm 5.3.
The proof is taken from [1]. We have the following lemma to bound the variance:
where the expectation is taken on the random number k under the condition thatuand are known.
5.1.4 Black-Box Acceleration
- 1.
The black-box methods make acceleration easier, because we only need to concern the method to solve subproblems. In most time, the subproblems have a good condition number and so are easy to be solved. Specifically, for general problem of (5.3), the subproblems can be solved by arbitrary vanilla VR methods. For specific forms of (5.3), one is allowed to design appropriate methods according to the characteristic of functions to solve them without considering acceleration techniques.
- 2.
The black-box methods make the acceleration technique more general. For different properties of objectives, no matter strongly convex or not and smooth or not, the black-box methods are able to give a universal accelerated algorithm.
The first stochastic black-box acceleration might be Acc-SDCA [19]. Its convergence rate for the IC case is . Later, Lin et al. proposed a generic acceleration, called Catalyst [12], which achieved a convergence rate , outperforming Acc-SDCA by a factor of . Allen-Zhu et al. [2] designed a black-box acceleration by gradually decreasing the condition number of the subproblem, achieving faster rate than Catalyst [12] on some general objective functions. In the following, we introduce Catalyst [12] as an example. The algorithm of Catalyst is shown in Algorithm 5.4. The main theorem for Algorithm 5.4 is as follows:
We leave the proof to Sect. 5.2 and first introduce how to use it to obtain an accelerated algorithm. We have the following theorem:
For (5.3), assume that each f i(x) is convex and L-smooth and h(x) is μ-strongly convex satisfying μ ≤ L∕n. 1 For Algorithm 5.4, if setting and solving Step 3 by SVRG [ 10], then one can obtain an 𝜖-accuracy solution satisfying F(x) − F(x ∗) ≤ 𝜖 with IFO calls of .
The subproblem in Step 3 is -strongly convex and -smooth. From Theorem 5.2, applying SVRG to solve the subproblem needs IFOs . So the total complexity is . □
The black-box algorithms, e.g., Catalyst [12], are proposed earlier than Katyusha [1], discussed in Sect. 5.1.3. From the convergence results, Catalyst [12] is times lower than Katyusha [1]. However, the black-box algorithms are more flexible and easier to obtain an accelerated rate. For example, in the next section we apply Catalyst to the INC case.
5.2 The Individually Nonconvex Case
In this section, we consider solving (5.3) by allowing nonconvexity of f i(x) but f(x), the sum of f i(x), is still convex. One important application of it is principal component analysis [9]. It is also the core technique to obtain faster rate in the NC case.
Notice that the convexity of f(x) guarantees an achievable global minimum of (5.3). As shown in Theorem 5.2, to reach a minimizer of F(x) − F ∗≤ 𝜖, vanilla SVRG needs IFO calls of . We will show that the convergence rate can be improved to be by acceleration. Compared with the computation costs in the IC case, the costs of NC cases are Ω(n 1∕4) times larger, which is caused by the nonconvexity of individuals. We still use Catalyst [12] to obtain the result. We first prove Theorem 5.4 in the context of INC. The proof is directly taken from [12], which is based on the estimate sequence in [14] (see Sect. 2.1) and further takes the error of inexact solver into account.
- 1.
;
- 2.For k ≥ 0,
Plugging (5.66) into (5.63) with k − 1 being replaced by k, we obtain (5.65).
With a special setting of the parameters for Algorithm 5.4, we are able to give the convergence result for the INC case.
5.3 The Nonconvex Case
In this section, we consider a hard case when f(x) is nonconvex. We only consider the problem where h(x) = 0 and focus on the IFO complexity to achieve an approximate first-order stationary point satisfying ∥∇f(x)∥≤ 𝜖. In the IC case, SVRG [10] has already achieved nearly optimal rate (ignoring some constants) when n is sufficiently large (). However, for the NC case SVRG is not the optimal. We will introduce SPIDER [7] which can find an approximate first-order stationary point in an optimal (ignoring some constants) rate. Next, we will show that if further assume that the objective function has Lipschitz continuous Hessians, the momentum technique can ensure a faster rate when n is much smaller than κ = L∕μ.
5.3.1 SPIDER
Proposition 5.1 can be easily proven using the property of square-integrable martingales.
The algorithm using SPIDER to solve (5.2) is shown in Algorithm 5.5. We have the following theorem.
where Δ = f(x 0) − f ∗ (f ∗ =infxf(x)). The gradient cost is bounded by for any choice of n 0 ∈ [1, 2σ∕𝜖]. Treating Δ, L, and σ as positive constants, the stochastic gradient complexity is O(𝜖 −3).
To prove Theorem 5.7, we first prepare the following lemmas.
where denotes the conditional expectation over the randomness of .
Now, we are ready to prove Theorem 5.7.
The gradient cost is bounded by for any choice of n 0 ∈ [1, n 1∕2]. Treating Δ, L, and σ as positive constants, the stochastic gradient complexity is O(n + n 1∕2𝜖 −2).
5.3.2 Momentum Acceleration
When computing a first-order stationary point, SPIDER is actually (nearly) optimal, if only with the gradient-smoothness condition under certain regimes. Thus only with the condition of smoothness of the gradient, it is hard to apply the momentum techniques to accelerate algorithms. However, one can obtain a faster rate with an additional assumption on Hessian:
Each f i(x) has ρ-Lipschitz continuous Hessians (Definition A.14 ).
Suppose solving NC-Search by [ 9] and INC blocks by Catalyst described in Theorem 5.6, the total stochastic gradient complexity to achieve an 𝜖-accuracy solution satisfying ∥∇f(x k)∥≤ 𝜖 and is .
The proof of Theorem 5.9 is lengthy, so we omit the proof here. We mention that when n is large, e.g., n ≥ 𝜖 −1, the above method might not be faster than SPIDER [7]. Thus the existing lowest complexity to find a first-order stationary point is . In fact, for the problem of searching a stationary point in nonconvex (stochastic) optimization, neither the upper nor the lower bounds have been well studied up to now. However, it is a very hot topic recently and has aroused a lot of attention in both the optimization and the machine learning communities due to the empirical practicability for nonconvex models. Interested readers may refer to [3, 7, 9] for some latest developments.
5.4 Constrained Problem
Notations and variables
Notation | Meaning | Variable | Meaning |
---|---|---|---|
〈x, y〉G, ∥x∥G |
|
| Extrapolation variables |
F i(x i) | h i(x i) + f i(x i) |
| Primal variables |
x |
|
| Dual and temporary variables |
y |
|
| Snapshot vectors |
F(x) | F 1(x 1) + F 2(x 2) | used for VR | |
A | [A 1, A 2] |
| KKT point of (5.98) |
| Mini-batch indices | b | Batch size |
Now, we give the convergence result. The main property of Acc-SADMM (Algorithm 5.7) in the inner loop is shown below.
where denotes that the expectation is taken over the random samples in the mini-batch , L(x 1, x 2, λ) = F 1(x 1) + F 2(x 2) + 〈λ, A 1x 1 + A 2x 2 −b〉 is the Lagrangian function , , and . Other notations can be found in Table 5.1.
With Theorem 5.10 in hand, we can check that the convergence rate of Acc-SADMM to solve problem (5.98) is exactly O(1∕(mS)). However, for non-accelerated methods, the convergence rate for ADMM in the non-ergodic sense is [5].
5.5 The Infinite Case
We apply mini-batch AGD to solve (5.149). The algorithm is shown in Algorithm 5.8.
Theorem 5.11 indicates that the total stochastic gradient calls is . However, the algorithm converges in steps, which is the fastest rate even in deterministic algorithms. So the momentum technique enlarges the mini-batch size by times.