Z. Lin et al.Accelerated Optimization for Machine Learning https://doi.org/10.1007/978-981-15-2910-8_5

5. Accelerated Stochastic Algorithms

Zhouchen Lin¹, Huan Li² and Cong Fang³

(1)

Key Lab. of Machine Perception School of EECS, Peking University, Beijing, Beijing, China

(2)

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, Jiangsu, China

(3)

School of Engineering and Applied Science, Princeton University, Princeton, NJ, USA

Keywords

Stochastic algorithmsIndividually convex problemIndividually nonconvex problemNonconvex problemConstrained problemAccelerated stochastic coordinate descentAccelerated variance reductionBlack-box acceleration

In machine learning and statistics communities, lots of large-scale problems can be formulated as the following optimization problem:

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \min_{\mathbf{x}} f(\mathbf{x})\equiv \mathbb{E} [F(\mathbf{x};\xi)], \end{array} \end{aligned}$

(5.1)

where F(x;ξ) is a stochastic component indexed by a random number ξ. A special case that is of central interest is that f(x) can be written as a sum of functions. If we denote each component function as f _i(x), then (5.1) can be restated as

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \min_{\mathbf{x}} f(\mathbf{x}) \equiv \frac{1}{n}\sum_{i=1}^nf_i(\mathbf{x}), \end{array} \end{aligned}$

(5.2)

where n is the number of individual functions. When n is finite, (5.2) is an offline problem, with typical examples including empirical risk minimization (ERM). n can also go to infinity and we refer to this case as an online (streaming) problem.

Obtaining the full gradient of (5.2) might be expensive when n is large and even inaccessible when n = ∞. Instead, a standard manner is to estimate the full gradient via one or several randomly sampled counterparts from individual functions. The obtained algorithms to solve problem (5.2) are referred to as stochastic algorithms, involving the following characteristics:

1.
in theory, most convergence properties are studied in the forms of expectation (concentration);
2.
in real experiments, the algorithms are often much faster than the batch (deterministic) ones.

Because the updates access the gradient individually, the time complexity to reach a tolerable accuracy can be evaluated by the total number of calls for individual functions. Formally, we refer to it as the Incremental First-order Oracle (IFO) calls, with definition as follows:

Definition 5.1

For problem (5.2), an IFO takes an index i ∈ [n] and a point $\mathbf {x}\in \mathbb {R}^d$ , and returns the pair (f _i(x), ∇f _i(x)).

As has been discussed in Chap. 2, the momentum (acceleration) technique ensures a theoretically faster convergence rate for deterministic algorithms. We might ask whether the momentum technique can accelerate stochastic algorithms. Before we answer the question, we first put our attention on the stochastic algorithms itself. What is the main challenge in analyzing the stochastic algorithms? Definitely, it is the noise of the gradients. Specifically, the variance of the noisy gradient will not go to zero through the updates, which fundamentally slows down the convergence rate. So a more involved question is to ask whether the momentum technique can reduce the negative effect of noise? Unfortunately, the existing results answer the question with “No.” The momentum technique cannot reduce the variance but instead can accumulate the noise. What really reduces the negative effect of noise is the technique called variance reduction (VR) [10]. When applying the VR technique, the algorithms are transformed to act like a deterministic algorithm and then can be further fused with momentum . Now we are able to answer our first question. The answer is towards positivity: in some cases, however, not all (e.g., when n is large), the momentum technique fused with VR ensures a provably faster rate! Besides, another effect of the momentum technique is that it can accumulate the noise. Thus one can reduce the variance together after it is aggregated by the momentum . By doing this, the mini-batch sampling size is increased, which is very helpful in distributed optimization (see Chap. 6).

From a high level view, we summarize the way to accelerate stochastic methods into two steps:

1.
transforming the algorithm into a “near deterministic” one by VR and
2.
fusing the momentum trick to achieve a faster rate of convergence.

We conclude the advantages of momentum technique for stochastic algorithms below:

Ensure faster convergence rates (by order) when n is sufficiently small.
Ensure larger mini-batch sizes for distributed optimization.

In the following sections, we will concretely introduce how to use the momentum technique to accelerate algorithms according to different properties of f(x). In particular, we roughly split them into three cases:

each f _i(x) is convex, which we call the individually convex (IC) case.
each f _i(x) can be nonconvex but f(x) is convex, which we call the individually nonconvex (INC) case.
f(x) is nonconvex (NC).

At last, we extend some results to linearly constrained problems.

5.1 The Individually Convex Case

We consider a more general form of (5.2):

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \min_{\mathbf{x}} F(\mathbf{x}) \equiv h(\mathbf{x})+ \frac{1}{n}\sum_{i=1}^nf_i(\mathbf{x}), \end{array} \end{aligned}$

(5.3)

where h(x) and f _i(x) with i ∈ [n] are convex, the proximal mapping of h can be computed efficiently, and n is finite.

5.1.1 Accelerated Stochastic Coordinate Descent

Algorithm 5.1 Accelerated stochastic coordinate descent (ASCD) [8]

../images/487003_1_En_5_Chapter/487003_1_En_5_Figa_HTML.png

To use the finiteness of n, we can solve (5.3) in its dual space in which n becomes the dimension of the dual variable and one coordinate update of the dual variable corresponds to one time of accessing the individual function. We first introduce accelerated stochastic coordinate descent (ASCD) [8, 11, 16] and later illustrate how to apply it to solve (5.3). ASCD is first proposed in [16], and later proximal versions are discovered in [8, 11]. With a little abuse of notation, without hindering the readers to understand the momentum technique, we still write the optimization model for ASCD as follows:

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \min_{\mathbf{x} \in \mathbb{R}^n} F(\mathbf{x}) \equiv h(\mathbf{x})+f(\mathbf{x}), \end{array} \end{aligned}$

(5.4)

where f(x) has L _c-coordinate Lipschitz continuous gradients (Definition A.13), h(x) has coordinate separable structure, i.e., $h(\mathbf {x}) = \sum _{i=1}^n h_i({\mathbf {x}}_i)$ with $\mathbf {x}=({\mathbf {x}}_1^T,\cdots ,{\mathbf {x}}_n^T)^T$ , and f(x) and h _i(x) are convex.

Stochastic coordinate descent algorithms are efficient to solve (5.4). In each update, one random coordinate ${\mathbf {x}}_{i_k}$ is chosen to sufficiently reduce the objective value while keeping other coordinates fixed, reducing the per-iteration cost. More specifically, the following types of proximal subproblem are solved:

$\displaystyle \begin{aligned} \begin{array}{rcl} \boldsymbol{\delta} = \operatorname*{\mathrm{argmin}}_{\boldsymbol{\delta}} \left(h_{i_k}({\mathbf{x}}^k_{i_k}+\boldsymbol{\delta} ) + \left\langle\nabla_{i_k} f({\mathbf{x}}^k), \boldsymbol{\delta}\right\rangle +\frac{\theta_k }{2\gamma}\boldsymbol{\delta}^2\right), \end{array} \end{aligned}$

where $\nabla _{i_k} f(\mathbf {x})$ denotes the partial gradient of f with respect to ${\mathbf {x}}_{i_k}$ . Fusing with the momentum technique, ASCD is shown in Algorithm 5.1.

We give the convergence result below. The proof is taken from [8].

Lemma 5.1

If $\theta _0 \leq \frac {1}{n}$ and for all k ≥ 0, θ _k ≥ 0 and is monotonically non-increasing, then x ^k is a convex combination of z ⁰, ⋯ , z ^k, i.e., we have ${\mathbf {x}}^k=\sum _{i=0}^k e_{k,i}{\mathbf {z}}^i$ , where e _0,0 = 1, e _1,0 = 1 − nθ ₀, e _1,1 = nθ, and for k > 1, we have

$\displaystyle \begin{aligned} e_{k+1,i}= \left\{ \begin{array}{rl} (1-\theta_k)e_{k,i},&\quad i\leq k-1,\\ n(1-\theta_k)\theta_{k-1}+\theta_k-n\theta_k,&\quad i=k,\\ n\theta_k,& \quad i = k+1. \end{array} \right. \end{aligned}$

(5.5)

Defining $\hat {h}_{k} = \sum _{i=0}^k e_{k,i} h({\mathbf {z}}^i)$ , we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathbb{E}_{i_k} (\hat{h}_{k+1}) = (1-\theta_k)\hat{h}_k +\theta_k\sum_{i_k=1}^n h_{i_k}({\mathbf{z}}^{k+1}_{i_k}), \end{array} \end{aligned}$

(5.6)

where $\mathbb {E}_{i_k}$ denotes that the expectation is taken only on the random number i _kunder the condition thatx ^kandz ^kare known.

Proof

We prove e _k+1,i first. When k = 0 and 1, we can check that it is right. We then prove (5.5). Since

$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{x}}^{k+1} &\displaystyle \overset{a}=&\displaystyle (1-\theta_k){\mathbf{x}}^k +\theta_k {\mathbf{z}}^k +n\theta_k ({\mathbf{z}}^{k+1}-{\mathbf{z}}^k)\\ &\displaystyle =&\displaystyle (1-\theta_k)\sum_{i=0}^k e_{k,i}{\mathbf{z}}^i +\theta_k {\mathbf{z}}^k +n\theta_k({\mathbf{z}}^{k+1}-{\mathbf{z}}^{k})\\ &\displaystyle =&\displaystyle (1-\theta_k)\sum_{i=0}^{k-1} e_{k,i}{\mathbf{z}}^i +\left[(1-\theta_k)e_{k,k}+\theta_k -n\theta_k\right]{\mathbf{z}}^k +n\theta_k {\mathbf{z}}^{k+1}, \end{array} \end{aligned}$

where $\overset {a}=$ uses Step 5 of Algorithm 5.1. Comparing the results, we obtain (5.5). Next, we prove that the above is a convex combination. It is easy to prove that the weights sum to 1 (by induction), 0 ≤ (1 − θ _k)e _k,j ≤ 1, and 0 ≤ nθ _k ≤ 1. So (1 − θ _k)e _k,k + θ _k − nθ _k = n(1 − θ _k)θ _k−1 + θ _k − nθ _k ≤ 1. On the other hand, we have

$\displaystyle \begin{aligned} \begin{array}{rcl} n(1-\theta_k)\theta_{k-1}+\theta_k-n\theta_k \geq n(1-\theta_k)\theta_k+\theta_k-n\theta_k =\theta_k(1-n\theta_k)\geq0. \end{array} \end{aligned}$

For (5.6), we have

$\displaystyle \begin{aligned} \begin{aligned} &\mathbb{E}_{i_k}\hat{h}_{k+1}\\ &\quad \overset{a}= \sum_{i=0}^k e_{k+1,i} h({\mathbf{z}}^i) +\mathbb{E}_{i_k} \left[n\theta_k h({\mathbf{z}}^{k+1})\right]\\ &\quad = \sum_{i=0}^ke_{k+1,i} h({\mathbf{z}}^i) +\frac{1}{n}\sum_{i_k} n\theta_k \left(h_{i_k}({\mathbf{z}}^{k+1}_{i_k})+\sum_{j\neq i_k}h_j({\mathbf{z}}^k_j) \right)\\ &\quad = \sum_{i=0}^ke_{k+1,i} h ({\mathbf{z}}^i) +\theta_k\sum_{i_k} h_{i_k}({\mathbf{z}}^{k+1}_{i_k})+(n-1)\theta_k h({\mathbf{z}}^k) \\ &\quad \overset{b}= \sum_{i=0}^{k-1} e_{k+1,i} h({\mathbf{z}}^i) +[n(1-\theta_k)\theta_{k-1}+\theta_k-n\theta_k]h({\mathbf{z}}^k)+(n-1)\theta_kh({\mathbf{z}}^k) \\ &\qquad +\theta_k\sum_{i_k} h_{i_k}({\mathbf{z}}^{k+1}_{i_k})\\ &\quad = \sum_{i=0}^{k-1}e_{k+1,i} h({\mathbf{z}}^i)+n(1-\theta_k)\theta_{k-1}h({\mathbf{z}}^k)+\theta_k\sum_{i_k} h_{i_k}({\mathbf{z}}^{k+1}_{i_k})\\ \end{aligned} \end{aligned}$

$\displaystyle \begin{aligned} \begin{aligned} &\quad \overset{c}= \sum_{i=0}^{k-1}e_{k,i} (1-\theta_k) h({\mathbf{z}}^i) +(1-\theta_k)e_{k,k}h({\mathbf{z}}^k)+\theta_k\sum_{i_k} h_{i_k}({\mathbf{z}}^{k+1}_{i_k})\\ &\quad = \sum_{i=0}^{k}e_{k,i} (1-\theta_k) h({\mathbf{z}}^i) +\theta_k\sum_{i_k} h_{i_k}({\mathbf{z}}^{k+1}_{i_k})\\ &\quad = (1 - \theta_k)\hat{h}_k +\theta_k\sum_{i_k} h_{i_k}({\mathbf{z}}^{k+1}_{i_k}), \end{aligned} \end{aligned}$

where in $\overset {a}=$ we use e _k+1,k+1 = nθ _k, in $\overset {b}=$ we use e _k+1,k = n(1 − θ _k)θ _k−1 + θ _k − nθ _k, and in $\overset {c}=$ we use e _k+1,i = (1 − θ _k)e _k,i for i ≤ k − 1, and e _k,k = nθ _k−1. □

Theorem 5.1

For Algorithm 5.1, setting $\theta _k=\frac {2}{2n+k}$ , we have that x ^k is a convex combination of z ⁰, ⋯ , z ^k, and

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \frac{\mathbb{E} [F({\mathbf{x}}^{K+1})] -F({\mathbf{x}}^*)}{\theta_K^2} +\frac{n^2L_c}{2}\mathbb{E}\|{\mathbf{z}}^{K+1} -{\mathbf{x}}^* \|{}^2\\ &\displaystyle &\displaystyle \quad \leq\frac{F({\mathbf{x}}^0)-F({\mathbf{x}}^*)}{\theta_{-1}^2} +\frac{n^2L_c}{2}\|{\mathbf{z}}^0 -{\mathbf{x}}^*\|{}^2. \end{array} \end{aligned}$

(5.7)

When h(x) is strongly convex with modulus 0 ≤ μ ≤ L _c, setting $\theta _k = \frac {-\frac {\mu }{L_c}+\sqrt {\mu ^2/L_c^2+4\mu /L_c}}{2n} \sim O\left (\frac {\sqrt {\mu /L_c}}{n}\right )$ which is denoted as θ instead, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \mathbb{E} [F({\mathbf{x}}^{K+1})] -F({\mathbf{x}}^*)\\ &\displaystyle &\displaystyle \quad \leq (1-\theta)^{K+1} \left(F({\mathbf{x}}^0) -F({\mathbf{x}}^*)+ \frac{n^2\theta^2L_c+n\theta\mu}{2}\left\|{\mathbf{z}}^{0} -{\mathbf{x}}^* \right\|{}^2 \right). \end{array} \end{aligned}$

(5.8)

Proof

We can check that the setting of θ _k satisfies the assumptions in Lemma 5.1. We first consider the function value. By the optimality of ${\mathbf {z}}^{k+1}_{i_k}$ in Step 4, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} n\theta_k({\mathbf{z}}^{k+1}_{i_k} - {\mathbf{z}}^k_{i_k}) + \gamma \nabla_{i_k} f({\mathbf{y}}^{k})+\gamma \boldsymbol{\xi}^k_{i_k}=0, \end{array} \end{aligned}$

(5.9)

where $\boldsymbol {\xi }^k_{i_k} \in \partial h_{i_k}({\mathbf {z}}^{k+1}_{i_k})$ . By Steps 1 and 5, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} {\mathbf{x}}^{k+1} ={\mathbf{y}}^k +n\theta_k({\mathbf{z}}^{k+1}-{\mathbf{z}}^k). \end{array} \end{aligned}$

(5.10)

Substituting (5.10) into (5.9), we have

$\displaystyle \begin{aligned} \begin{array}{rcl} {} {\mathbf{x}}^{k+1}_{i_k} - {\mathbf{y}}^k_{i_k} + \gamma \nabla_{i_k} f({\mathbf{y}}^{k})+\gamma \boldsymbol{\xi}_{i_k}^k=0. \end{array} \end{aligned}$

(5.11)

Since f has L _c-coordinate Lipschitz continuous gradients on coordinate i _k (see (A.4)) and x ^k+1 and y ^k only differ at the i _k-th entry, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle f({\mathbf{x}}^{k+1})\\ &\displaystyle &\displaystyle \quad \overset{a}\leq f({\mathbf{y}}^k) +\left\langle\nabla_{i_k} f({\mathbf{y}}^k), {\mathbf{x}}^{k+1}_{i_k}-{\mathbf{y}}^{k}_{i_k}\right\rangle +\frac{L_c}{2}\left({\mathbf{x}}^{k+1}_{i_k}-{\mathbf{y}}^{k}_{i_k} \right)^2 \\ &\displaystyle &\displaystyle \quad \overset{b}=f({\mathbf{y}}^k) -\gamma\left\langle\nabla_{i_k} f({\mathbf{y}}^k),\nabla_{i_k} f({\mathbf{y}}^k)+ \boldsymbol{\xi}^k_{i_k}\right\rangle+\frac{L_c}{2}\left({\mathbf{x}}^{k+1}_{i_k}-{\mathbf{y}}^{k}_{i_k}\right )^2 \\ &\displaystyle &\displaystyle \quad \overset{c}=f({\mathbf{y}}^k) -\gamma\left\langle\nabla_{i_k} f({\mathbf{y}}^k)+ \boldsymbol{\xi}^k_{i_k},\nabla_{i_k} f({\mathbf{y}}^k)+ \boldsymbol{\xi}^k_{i_k}\right\rangle+\frac{L_c}{2}\left({\mathbf{x}}^{k+1}_{i_k}-{\mathbf{y}}^{k}_{i_k} \right)^2 \\ &\displaystyle &\displaystyle \qquad + \gamma \left\langle \boldsymbol{\xi}^k_{i_k}, \nabla_{i_k} f({\mathbf{y}}^k)+ \boldsymbol{\xi}^k_{i_k} \right\rangle\\ &\displaystyle &\displaystyle \quad \overset{d}=f({\mathbf{y}}^k)-\frac{\gamma}{2}\left(\frac{{\mathbf{x}}^{k+1}_{i_k}-{\mathbf{y}}^{k}_{i_k}}{\gamma}\right )^2-\left\langle \boldsymbol{\xi}_{i_k}^k, {\mathbf{x}}^{k+1}_{i_k}-{\mathbf{y}}^k_{i_k}\right\rangle, \end{array} \end{aligned}$

(5.12)

where $\overset {a}\leq$ uses Proposition A.7, in $\overset {b}=$ we use (5.11), in $\overset {c}=$ we insert $\gamma \big \langle \boldsymbol {\xi }^k_{i_k},\nabla _{i_k} f({\mathbf {y}}^k)+\boldsymbol {\xi }^k_{i_k}\big \rangle$ , and $\overset {d}=$ uses $\gamma = \frac {1}{L_c}$ .

Then we analyze $\left \|{\mathbf {z}}^{k+1}-{\mathbf {x}}^*\right \|{ }^2$ . We have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \frac{n^2}{2\gamma}\left\|\theta_k {\mathbf{z}}^{k+1} -\theta_k {\mathbf{x}}^*\right \|{}^2\\ &\displaystyle &\displaystyle \quad = \frac{n^2}{2\gamma}\left\| \theta_k {\mathbf{z}}^{k} -\theta_k {\mathbf{x}}^*+ \theta_k {\mathbf{z}}^{k+1}-\theta_k{\mathbf{z}}^k \right \|{}^2\\ &\displaystyle &\displaystyle \quad =\frac{n^2}{2\gamma} \left\| \theta_k {\mathbf{z}}^{k} -\theta_k {\mathbf{x}}^*\right\|{}^2 +\frac{n^2}{2\gamma}\left( \theta_k {\mathbf{z}}^{k+1}_{i_k}-\theta_k{\mathbf{z}}^k_{i_k} \right)^2\\ &\displaystyle &\displaystyle \qquad +\frac{n^2}{\gamma}\left\langle \theta_k \left({\mathbf{z}}^{k+1}_{i_k}-{\mathbf{z}}^k_{i_k}\right) , \theta_k{\mathbf{z}}^{k}_{i_k}-\theta_k{\mathbf{x}}^*_{i_k} \right\rangle\\ &\displaystyle &\displaystyle \quad \overset{a}=\frac{n^2}{2\gamma}\left\| \theta_k {\mathbf{z}}^{k} -\theta_k {\mathbf{x}}^*\right\|{}^2 +\frac{1}{2\gamma }\left({\mathbf{x}}^{k+1}_{i_k}-{\mathbf{y}}^k_{i_k} \right)^2\\ &\displaystyle &\displaystyle \qquad -n\left\langle \nabla_{i_k} f({\mathbf{y}}^k)+\boldsymbol{\xi}^k_{i_k}, \theta_k{\mathbf{z}}_{i_k}^{k}-\theta_k{\mathbf{x}}_{i_k}^* \right\rangle, \end{array} \end{aligned}$

(5.13)

where $\overset {a}=$ uses (5.9) and (5.10).

Then by taking expectation on (5.13), we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \frac{n^2}{2\gamma}\mathbb{E}_{i_k}\left\|\theta_k {\mathbf{z}}^{k+1} -\theta_k {\mathbf{x}}^* \right\|{}^2\\ &\displaystyle &\displaystyle \quad =\frac{n^2}{2\gamma}\left\|\theta_k {\mathbf{z}}^{k} -\theta_k {\mathbf{x}}^* \right\|{}^2+\frac{1}{2\gamma n}\sum_{i_k=1}^n\left({\mathbf{x}}^{k+1}_{i_k}-{\mathbf{y}}^k_{i_k}\right )^2-\left\langle \nabla f({\mathbf{y}}^k), \theta_k{\mathbf{z}}^{k}-\theta_k{\mathbf{x}}^* \right\rangle\\ &\displaystyle &\displaystyle \qquad -\sum_{i_k=1}^n\left\langle \boldsymbol{\xi}^k_{i_k}, \theta_k{\mathbf{z}}_{i_k}^{k}-\theta_k{\mathbf{x}}_{i_k}^* \right\rangle. \end{array} \end{aligned}$

(5.14)

By Step 1, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} -\left\langle \nabla f({\mathbf{y}}^k), \theta_k{\mathbf{z}}^{k}-\theta_k{\mathbf{x}}^* \right\rangle &\displaystyle =&\displaystyle \left\langle \nabla f({\mathbf{y}}^k), (1-\theta_k){\mathbf{x}}^k+\theta_k{\mathbf{x}}^*-{\mathbf{y}}^k \right\rangle\\ &\displaystyle \overset{a}\leq&\displaystyle (1-\theta_k)f({\mathbf{x}}^k)+\theta_k f({\mathbf{x}}^*) -f({\mathbf{y}}^k), \end{array} \end{aligned}$

(5.15)

where $\overset {a}\leq$ uses the convexity of f. Taking expectation on (5.12) and adding (5.14) and (5.15), we have

$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbb{E}_{i_k} f({\mathbf{x}}^{k+1}) &\displaystyle \leq&\displaystyle (1-\theta_k)f({\mathbf{x}}^k)+\theta_k f({\mathbf{x}}^*) \\ &\displaystyle &\displaystyle -\sum_{i_k=1}^n\left\langle \boldsymbol{\xi}^k_{i_k}, \theta_k{\mathbf{z}}_{i_k}^{k}-\theta_k{\mathbf{x}}_{i_k}^*+\frac{1}{n} \left({\mathbf{x}}^{k+1}_{i_k}-{\mathbf{y}}^k_{i_k}\right) \right\rangle\\ &\displaystyle &\displaystyle +\frac{n^2}{2\gamma}\left\|\theta_k {\mathbf{z}}^{k} -\theta_k {\mathbf{x}}^*\right \|{}^2-\frac{n^2}{2\gamma}\mathbb{E}_{i_k}\left\|\theta_k {\mathbf{z}}^{k+1} -\theta_k {\mathbf{x}}^*\right \|{}^2\\ &\displaystyle \overset{a}=&\displaystyle (1-\theta_k)f({\mathbf{x}}^k)+\theta_k f({\mathbf{x}}^*) -\sum_{i_k=1}^n \left\langle \boldsymbol{\xi}^k_{i_k}, \theta_k{\mathbf{z}}_{i_k}^{k+1}-\theta_k{\mathbf{x}}_{i_k}^*\right\rangle\\ &\displaystyle &\displaystyle +\frac{n^2}{2\gamma}\left\|\theta_k {\mathbf{z}}^{k} -\theta_k {\mathbf{x}}^* \right\|{}^2-\frac{n^2}{2\gamma}\mathbb{E}_{i_k}\left\|\theta_k {\mathbf{z}}^{k+1} -\theta_k {\mathbf{x}}^*\right \|{}^2, \end{array} \end{aligned}$

where in $\overset {a}=$ we use (5.10).

Using the strong convexity of $h_{i_k}$ (μ ≥ 0), we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \theta_k\left\langle \boldsymbol{\xi}^k_{i_k} ,{\mathbf{x}}^*_{i_k}-{\mathbf{z}}^{k+1}_{i_k}\right\rangle \leq \theta_k h_{i_k}({\mathbf{x}}^*) -\theta_k h_{i_k}({\mathbf{z}}^{k+1}_{i_k})-\frac{\mu\theta_k}{2}\left({\mathbf{z}}_{i_k}^{k+1}-{\mathbf{x}}_{i_k}^* \right)^2. \end{array} \end{aligned}$

(5.16)

On the other hand, by analyzing the expectation we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathbb{E}_{i_k} \left\|{\mathbf{z}}^{k+1}-{\mathbf{x}}^* \right \|{}^2 &\displaystyle =&\displaystyle \frac{1}{n}\sum_{i_k=1}^n\left[ \left ({\mathbf{z}}^{k+1}_{i_k}-{\mathbf{x}}^*_{i_k} \right)^2 + \sum_{j\neq i_k} \left({\mathbf{z}}^{k+1}_{j}-{\mathbf{x}}^*_{j} \right)^2 \right] \\ &\displaystyle \overset{a}=&\displaystyle \frac{1}{n}\sum_{i_k=1}^n\left[ \left ({\mathbf{z}}^{k+1}_{i_k}-{\mathbf{x}}^*_{i_k} \right)^2 + \sum_{j\neq i_k} \left({\mathbf{z}}^{k}_{j}-{\mathbf{x}}^*_{j} \right)^2 \right] \\ &\displaystyle =&\displaystyle \frac{1}{n}\sum_{i_k=1}^n\left({\mathbf{z}}^{k+1}_{i_k}-{\mathbf{x}}^*_{i_k} \right)^2+\frac{n-1}{n} \left\|{\mathbf{z}}^{k}-{\mathbf{x}}^*\right \|{}^2, \end{array} \end{aligned}$

(5.17)

where in $\overset {a}=$ we use ${\mathbf {z}}^{k+1}_{j}={\mathbf {z}}^{k}_{j}$ , j ≠ i _k. Similar to (5.17), we also have

$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbb{E}_{i_k}\left \|{\mathbf{x}}^{k+1}-{\mathbf{y}}^k \right\|{}^2 &\displaystyle =&\displaystyle \frac{1}{n}\sum_{i_k=1}^n\left[ \left({\mathbf{x}}^{k+1}_{i_k}-{\mathbf{y}}^k_{i_k} \right)^2 + \sum_{j\neq i_k} \left({\mathbf{x}}^{k+1}_{j}-{\mathbf{y}}^k_{j} \right)^2 \right] \\ &\displaystyle \overset{a}=&\displaystyle \frac{1}{n}\sum_{i_k=1}^n\left({\mathbf{x}}^{k+1}_{i_k}-{\mathbf{y}}^k_{i_k}\right)^2, \end{array} \end{aligned}$

where in $\overset {a}=$ we use ${\mathbf {x}}^{k+1}_{j}={\mathbf {y}}^{k}_{j}$ , j ≠ i _k. Then we obtain

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \mathbb{E}_{i_k} f({\mathbf{x}}^{k+1})\\ &\displaystyle &\displaystyle \quad \overset{a}{\leq}(1-\theta_k)f({\mathbf{x}}^k)+\theta_k F({\mathbf{x}}^*)-\theta_k\sum_{i_k=1}^n h_{i_k}({\mathbf{z}}^{k+1}_{i_k}) -\sum_{i_k=1}^n\frac{\mu\theta_k}{2}\left({\mathbf{z}}_{i_k}^{k+1}-{\mathbf{x}}_{i_k}^* \right)^2\\ &\displaystyle &\displaystyle \qquad +\frac{n^2}{2\gamma}\left\|\theta_k {\mathbf{z}}^{k} -\theta_k {\mathbf{x}}^* \right\|{}^2-\frac{n^2}{2\gamma}\mathbb{E}_{i_k}\left\|\theta_k {\mathbf{z}}^{k+1} -\theta_k {\mathbf{x}}^* \right\|{}^2\\ &\displaystyle &\displaystyle \quad \overset{b}=(1-\theta_k)f({\mathbf{x}}^k)+\theta_k F({\mathbf{x}}^*)-\theta_k\sum_{i_k=1}^n h_{i_k}({\mathbf{z}}^{k+1}_{i_k}) \\ &\displaystyle &\displaystyle \qquad +\frac{n^2\theta_{k}^2+(n-1)\theta_k\mu\gamma}{2\gamma}\left\|{\mathbf{z}}^{k} - {\mathbf{x}}^* \right\|{}^2-\frac{n^2\theta_{k}^2+n\theta_k\mu\gamma}{2\gamma}\mathbb{E}_{i_k}\left\|{\mathbf{z}}^{k+1} -{\mathbf{x}}^* \right\|{}^2\\ &\displaystyle &\displaystyle \quad \overset{c}=(1-\theta_k)f({\mathbf{x}}^k)+\theta_k F({\mathbf{x}}^*)+(1-\theta_k)\hat{h}_k-\mathbb{E}_{i_k}\hat{h}_{k+1} \\ &\displaystyle &\displaystyle \qquad +\frac{n^2\theta_{k}^2+(n-1)\theta_k\mu\gamma}{2\gamma}\left\|{\mathbf{z}}^{k} - {\mathbf{x}}^* \right\|{}^2-\frac{n^2\theta_{k}^2+n\theta_k\mu\gamma}{2\gamma}\mathbb{E}_{i_k}\left\|{\mathbf{z}}^{k+1} -{\mathbf{x}}^* \right\|{}^2, \end{array} \end{aligned}$

(5.18)

where $\overset {a}{\leq }$ uses (5.16), $\overset {b}=$ uses (5.17), and $\overset {c}=$ uses Lemma 5.1.

For the generally convex case (μ = 0), rearranging (5.18) and dividing both sides with $\theta _k^2$ , we have

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \frac{\mathbb{E}_{i_k} f({\mathbf{x}}^{k+1})+\mathbb{E}_{i_k} \hat{h}_{k+1} -F({\mathbf{x}}^*)}{\theta_k^2} +\frac{n^2}{2\gamma}\mathbb{E}_{i_k}\left\|{\mathbf{z}}^{k+1} -{\mathbf{x}}^* \right\|{}^2\\ &\displaystyle &\displaystyle \quad \leq\frac{1-\theta_k}{\theta_k^2} \left(f({\mathbf{x}}^k)+ \hat{h}_{k} -F({\mathbf{x}}^*) \right) +\frac{n^2}{2\gamma}\left\|{\mathbf{z}}^{k} - {\mathbf{x}}^* \right\|{}^2\\ &\displaystyle &\displaystyle \quad \overset{a}\leq\frac{1}{\theta_{k-1}^2} \left(f({\mathbf{x}}^k)+ \hat{h}_{k} -F({\mathbf{x}}^*) \right) +\frac{n^2}{2\gamma}\left\|{\mathbf{z}}^{k} - {\mathbf{x}}^*\right \|{}^2, \end{array} \end{aligned}$

where in $\overset {a}\leq$ we use $\frac {1-\theta _k}{\theta _k^2}\leq \frac {1}{\theta _{k-1}^2}$ when k ≥−1. Taking full expectation, we have

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \frac{\mathbb{E} f({\mathbf{x}}^{k+1})+\mathbb{E} \hat{h}_{k+1} -F({\mathbf{x}}^*)}{\theta_k^2} +\frac{n^2}{2\gamma}\mathbb{E}\left\|{\mathbf{z}}^{k+1} -{\mathbf{x}}^* \right\|{}^2\\ &\displaystyle &\displaystyle \quad \leq \frac{f({\mathbf{x}}^0)+\hat{h}_0 - F({\mathbf{x}}^*)}{\theta_{-1}^2} +\frac{n^2}{2\gamma}\left\|{\mathbf{z}}^{0} -{\mathbf{x}}^* \right\|{}^2\\ &\displaystyle &\displaystyle \quad \overset{a}=\frac{F({\mathbf{x}}^0)-F({\mathbf{x}}^*)}{\theta_{-1}^2} +\frac{n^2}{2\gamma}\|{\mathbf{z}}^0 -{\mathbf{x}}^* \|{}^2, \end{array} \end{aligned}$

where in $\overset {a}=$ we use $\hat {h}_0=h({\mathbf {x}}^0)$ . Then using the convexity of h(x) and that x ^k+1 is a convex combination of z ⁰, ⋯ , z ^k+1, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} h({\mathbf{x}}^{k+1}) \leq \sum_{i=0}^{k+1} e_{k+1,i}h({\mathbf{z}}^i) \overset{a}{=} \hat{h}_{k+1}, \end{array} \end{aligned}$

(5.19)

where $\overset {a}{=}$ uses Lemma 5.1. So we obtain (5.7).

For the strongly convex case (μ > 0), (5.18) gives

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \mathbb{E}_{i_k} f({\mathbf{x}}^{k+1})+\mathbb{E} \hat{h}_{k+1}-F({\mathbf{x}}^*)+\frac{n^2\theta^2+n\theta\mu\gamma}{2\gamma}\mathbb{E}_{i_k}\left\|{\mathbf{z}}^{k+1} -{\mathbf{x}}^* \right\|{}^2\\ &\displaystyle &\displaystyle \quad \leq(1-\theta)\left(f({\mathbf{x}}^k)+\mathbb{E} \hat{h}_{k}-F({\mathbf{x}}^*)\right) +\frac{n^2\theta^2+(n-1)\theta\mu\gamma}{2\gamma}\left\|{\mathbf{z}}^{k} - {\mathbf{x}}^* \right\|{}^2. \end{array} \end{aligned}$

For the setting of θ, we have

$\displaystyle \begin{aligned}\theta^2+(n-1)\theta \mu \gamma = (1-\theta)(n^2\theta^2+n\theta \mu\gamma).\end{aligned}$

Thus

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \mathbb{E}_{i_k} f({\mathbf{x}}^{k+1})+\mathbb{E} \hat{h}_{k+1}-F({\mathbf{x}}^*)+\frac{n^2\theta^2+n\theta\mu\gamma}{2\gamma}\mathbb{E}_{i_k}\left\|{\mathbf{z}}^{k+1} -{\mathbf{x}}^* \right\|{}^2\\ &\displaystyle &\displaystyle \quad \leq(1-\theta)\left(f({\mathbf{x}}^k)+\mathbb{E} \hat{h}_{k}-F({\mathbf{x}}^*) +\frac{n^2\theta^2+n\theta\mu\gamma}{2\gamma}\left\|{\mathbf{z}}^{k} - {\mathbf{x}}^* \right\|{}^2\right). \end{array} \end{aligned}$

Taking full expectation, expanding the result to k = 0, then using $h({\mathbf {x}}^{k+1})\leq \hat {h}_{k+1}$ (see (5.19)), $h({\mathbf {x}}^{0})= \hat {h}_{0}$ , and ∥z ^k+1 −x ^∗∥² ≥ 0, we obtain (5.8). □

An important application of ASCD is to solve ERM problems in the following form:

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \min_{\mathbf{x} \in \mathbb{R}^d} P(\mathbf{x}) \equiv \frac{1}{n} \sum_{i = 1}^n \phi_i({\mathbf{A}}^T_i \mathbf{x}) +\frac{\lambda}{2}\| \mathbf{x}\|{}^2, \end{array} \end{aligned}$

(5.20)

where λ > 0 and $\phi _i({\mathbf {A}}^T_i \mathbf {x})$ , i = 1, ⋯ , n, are loss functions over training samples. Lots of machine learning problems can be formulated as (5.20), such as linear SVM, ridge regression, and logistic regression. For ASCD , we can solve (5.20) via its dual problem :

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \min_{\mathbf{a} \in \mathbb{R}^n} D(\mathbf{a}) = \frac{1}{n} \sum_{i = 1}^n \phi^*_i(-{\mathbf{a}}_i) + \frac{\lambda}{2}\left\|\frac{1}{\lambda n}\mathbf{A}\mathbf{a}\right\|{}^2, \end{array} \end{aligned}$

(5.21)

where $\phi ^*_i(\cdot )$ is the conjugate function (Definition A.21) of ϕ _i(⋅) and A = [A ₁, ⋯ , A _n]. Then (5.21) can be solved by Algorithm 5.1.

5.1.2 Background for Variance Reduction Methods

For stochastic gradient descent , due to the nonzero variance of the gradient, using a constant step size cannot guarantee the convergence of algorithms. Instead, it only guarantees a sublinear convergence rate on the strongly convex and L-smooth objective functions. For finite-sum objective functions, the way to solve the problem is called variance reduction (VR) [10] which reduces the variance to zero through the updates. The convergence rate can be accelerated to be linear for strongly convex and L-smooth objective functions. The first VR method might be SAG [17], which uses the sum of the latest individual gradients as an estimator. It requires O(nd) memory storage and uses a biased gradient estimator. SDCA [18] also achieves a linear convergence rate. In the primal space, the algorithm is known as MISO [13], which is a majorization-minimization VR algorithm. SVRG [10] is a follow-up work of SAG [17] which reduces the memory costs to O(d) and uses an unbiased gradient estimator. The main technique of SVRG [10] is frequently pre-storing a snapshot vector and bounding the variance by the distance from the snapshot vector and the latest variable. Later, SAGA [6] improves SAG by using an unbiased updates via the technique of SVRG [10].

Algorithm 5.2 Stochastic variance reduced gradient (SVRG) [10]

../images/487003_1_En_5_Chapter/487003_1_En_5_Figb_HTML.png

We introduce SVRG [10] as an example. We solve the following problem:

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \min_{\mathbf{x}\in \mathbb{R}^d} f(\mathbf{x}) \equiv \frac{1}{n}\sum_{i=1}^nf_i(\mathbf{x}). \end{array} \end{aligned}$

(5.22)

The proximal version can be obtained straightforwardly in Sect. 5.1.3. The algorithm of SVRG is Algorithm 5.2. We have the following theorem:

Theorem 5.2

For Algorithm 5.2, if each f _i(x) is μ-strongly convex and L-smooth and $\eta < \frac {1}{2L}$ , then we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \begin{aligned} &\displaystyle \mathbb{E} f(\tilde{\mathbf{x}}^s) -f({\mathbf{x}}^*) \\ &\displaystyle \quad \leq \left( \frac{1}{\mu \eta (1-2L\eta)m} + \frac{2L\eta}{1-2L\eta} \right) \left(\mathbb{E} f(\tilde{\mathbf{x}}^{s-1}) -f({\mathbf{x}}^*) \right),~s\geq1, \end{aligned} \end{array} \end{aligned}$

(5.23)

where ${\mathbf {x}}^* = \operatorname *{\mathrm {argmin}}_{\mathbf {x}\in \mathbb {R}^d} f(\mathbf {x})$ . In other words, by setting $m = O(\frac {L}{\mu })$ and $\eta = O(\frac {1}{L})$ , the IFO calls (Definition 5.1) to achieve an 𝜖-accuracy solution is $O\left (\left (n+\frac {L}{\mu }\right )\log (1/\epsilon )\right )$ . If each f _i(x) is L-smooth (but might be nonconvex) and f(x) is μ-strongly convex, by setting $\eta \leq \frac {\mu }{8(1+e)L^2}$ and m = −ln ⁻¹(1 − ημ∕2) ∼ O(η ⁻¹μ ⁻¹), we have

$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbb{E}\|{\mathbf{x}}^s_k - {\mathbf{x}}^*\|{}^2 \leq (1 - \eta\mu/2)^{sm+k} \|{\mathbf{x}}^0_0 - {\mathbf{x}}^*\|{}^2. \end{array} \end{aligned}$

In other words, the IFO calls to achieve an 𝜖- accuracy solution is $O\Big (\Big (n+\frac {L^2}{\mu ^2}\Big )\log (1/\epsilon )\Big )$ .

The proof is mainly taken from [10] and [4].

Proof

Let $\mathbb {E}_k$ denote the expectation taken only on the random number i _k,s conditioned on ${\mathbf {x}}^s_k$ . Then we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathbb{E}_k \left( \tilde{\nabla} f_{i_{k,s}}({\mathbf{x}}^s_k) \right) &\displaystyle =&\displaystyle \mathbb{E}_k \nabla f_{i_{k,s}} ({\mathbf{x}}^s_{k}) - \mathbb{E}_k \left( \nabla f_{i_{k,s}} (\tilde{\mathbf{x}}^s)-\frac{1}{n}\sum_{i=1}^n \nabla f_{i}(\tilde{\mathbf{x}}^s)\right) \\ &\displaystyle =&\displaystyle \nabla f ({\mathbf{x}}^s_{k}). \end{array} \end{aligned}$

(5.24)

So $\tilde {\nabla } f_{i_{k,s}}({\mathbf {x}}^s_k)$ is an unbiased estimator of $\nabla f({\mathbf {x}}^s_k)$ . Then

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \mathbb{E}_k\| \tilde{\nabla} f({\mathbf{x}}^s_k) \|{}^2\\ &\displaystyle &\displaystyle \quad \overset{a}\leq2\mathbb{E}_k\|\nabla f_{i_{k,s}}({\mathbf{x}}^s_k) - \nabla f_{i_{k,s}}({\mathbf{x}}^*) \|{}^2 + 2 \mathbb{E}_k\| \nabla f_{i_{k,s}}(\tilde{\mathbf{x}}^s) - \nabla f_{i_{k,s}}({\mathbf{x}}^*) - \nabla f(\tilde{\mathbf{x}}^s)\|{}^2\\ &\displaystyle &\displaystyle \quad \overset{b}\leq2\mathbb{E}_k\| \nabla f_{i_{k,s}}({\mathbf{x}}^s_k) - \nabla f_{i_{k,s}}({\mathbf{x}}^*) \|{}^2+ 2\mathbb{E}_k\|\nabla f_{i_{k,s}}(\tilde{\mathbf{x}}^s) - \nabla f_{i_{k,s}}({\mathbf{x}}^*) \|{}^2, \end{array} \end{aligned}$

(5.25)

where in $\overset {a}\leq$ we use ∥a −b∥² ≤ 2∥a∥² + 2∥b∥² and in $\overset {b}\leq$ we use Proposition A.2.

We first consider the case when each f _i(x) is strongly convex. Since each f _i(x) is convex and L-smooth, from (A.7) we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \left\| \nabla f_i(\mathbf{x}) -\nabla f_i(\mathbf{y}) \right \|{}^2 \leq 2 L \left( f_i(\mathbf{x}) -f_i(\mathbf{y})+\langle\nabla f_i(\mathbf{y}), \mathbf{y}-\mathbf{x} \rangle \right). \end{array} \end{aligned}$

(5.26)

Letting $\mathbf {x} = {\mathbf {x}}^s_k$ and y = x ^∗ in (5.26) and summing the result with i = 1 to n, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathbb{E}_k\left\| \nabla f_{i_{k,s}}({\mathbf{x}}^s_k) - \nabla f_{i_{k,s}}({\mathbf{x}}^*) \right\|{}^2 \leq 2L \left( f({\mathbf{x}}^s_k) - f({\mathbf{x}}^*)\right), \end{array} \end{aligned}$

(5.27)

where we use ∇f(x ^∗) = 0. In the same way, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathbb{E}_k\left\| \nabla f_{i_{k,s}}(\tilde{\mathbf{x}}^s) - \nabla f_{i_{k,s}}({\mathbf{x}}^*) \right\|{}^2 \leq 2L \left( f(\tilde{\mathbf{x}}^s) - f({\mathbf{x}}^*)\right). \end{array} \end{aligned}$

(5.28)

Plugging (5.27) and (5.28) into (5.25), we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathbb{E}_k\left\| \tilde{\nabla} f({\mathbf{x}}^s_k) \right\|{}^2\leq 4L \left[ (f({\mathbf{x}}^s_k) - f({\mathbf{x}}^*)) + (f(\tilde{\mathbf{x}}^s) - f({\mathbf{x}}^*))\right]. \end{array} \end{aligned}$

(5.29)

On the other hand,

$\displaystyle \begin{aligned} &\mathbb{E}_k\|{\mathbf{x}}^s_{k+1} - {\mathbf{x}}^* \|{}^2 \\ &\quad = \|{\mathbf{x}}^s_{k} - {\mathbf{x}}^* \|{}^2+ 2\mathbb{E}_k\left\langle {\mathbf{x}}^s_{k+1} - {\mathbf{x}}^s_{k} , {\mathbf{x}}^s_{k} - {\mathbf{x}}^* \right\rangle +\mathbb{E}_k \|{\mathbf{x}}^s_{k+1} - {\mathbf{x}}^s_{k}\|{}^2\\ &\quad =\|{\mathbf{x}}^s_{k} - {\mathbf{x}}^* \|{}^2 - 2\eta\mathbb{E}_k\left\langle \tilde{\nabla} f({\mathbf{x}}_k^s), {\mathbf{x}}_k^s - {\mathbf{x}}^*\right\rangle + \eta^2\mathbb{E}_k\| \tilde{\nabla} f({\mathbf{x}}^s_k) \|{}^2\\ &\quad \overset{a} \leq \|{\mathbf{x}}^s_{k} - {\mathbf{x}}^* \|{}^2 - 2\eta \left\langle \nabla f({\mathbf{x}}^s_k), {\mathbf{x}}_k^s - {\mathbf{x}}^* \right\rangle \\ &\qquad + 4L\eta^2\left[ (f({\mathbf{x}}_k^s) - f({\mathbf{x}}^*))+(f(\tilde{\mathbf{x}}^s) - f({\mathbf{x}}^*))\right]\\ &\quad \leq\|{\mathbf{x}}^s_{k} - {\mathbf{x}}^* \|{}^2 - 2 \eta (1 - 2 L \eta)[ f({\mathbf{x}}^{s}_k) - f({\mathbf{x}}^*)] + 4L\eta^2[f(\tilde{\mathbf{x}}^s) - f({\mathbf{x}}^*)], \end{aligned}$

(5.30)

where $\overset {a} \leq$ uses (5.24) and (5.29).

Suppose we choose Option I. By taking full expectation on (5.30) and telescoping the result with k = 1 to m, we have

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \mathbb{E} \|{\mathbf{x}}^s_m - {\mathbf{x}}^* \|{}^2 + 2\eta (1- 2L\eta)m\mathbb{E} (f(\tilde{\mathbf{x}}^{s+1}) - f({\mathbf{x}}^*))\\ &\displaystyle &\displaystyle \quad \overset{a}{\leq} \mathbb{E} \|{\mathbf{x}}^s_m - {\mathbf{x}}^* \|{}^2 + 2\eta (1- 2L\eta)\sum_{k=0}^{m-1}\mathbb{E} (f({\mathbf{x}}^{s}_k) - f({\mathbf{x}}^*))\\ &\displaystyle &\displaystyle \quad \overset{b}{\leq} \mathbb{E} \|{\mathbf{x}}^s_0 - {\mathbf{x}}^* \|{}^2+ 4L m \eta^2 \mathbb{E}\left(f(\tilde{\mathbf{x}}^{s}) - f({\mathbf{x}}^*)\right)\\ &\displaystyle &\displaystyle \quad \leq 2 \left( \mu^{-1}+ 2Lm\eta^2 \right) \mathbb{E}\left(f(\tilde{\mathbf{x}}^{s} - f({\mathbf{x}}^*)\right), \end{array} \end{aligned}$

where $\overset {a}{\leq }$ uses Option I of Algorithm 5.2 and $\overset {b}{\leq }$ is by telescoping (5.30). Using $\|{\mathbf {x}}^s_m - {\mathbf {x}}^* \|{ }^2\geq 0$ , we can obtain (5.23).

Then we consider the case when f(x) is strongly convex. From f(⋅) being L-smooth, by (A.5) we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} f({\mathbf{x}}^s_{k+1}) \leq f({\mathbf{x}}^s_k)+ \left\langle \nabla f({\mathbf{x}}^s_k), {\mathbf{x}}^s_{k+1} - {\mathbf{x}}^s_k \right\rangle + \frac{L}{2}\|{\mathbf{x}}^s_{k+1}-{\mathbf{x}}^s_k \|{}^2, \end{array} \end{aligned}$

(5.31)

and from f(⋅) being μ-strongly convex, by (A.9) we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} f({\mathbf{x}}^s_k) \leq f({\mathbf{x}}^*) + \left\langle \nabla f({\mathbf{x}}^s_k), {\mathbf{x}}^s_k - {\mathbf{x}}^* \right\rangle - \frac{\mu}{2}\|{\mathbf{x}}^s_k - {\mathbf{x}}^*\|{}^2. \end{array} \end{aligned}$

(5.32)

Adding (5.31) and (5.32), we have

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle f({\mathbf{x}}^s_{k+1})\\ &\displaystyle &\displaystyle \quad \leq f({\mathbf{x}}^*) + \left\langle \nabla f({\mathbf{x}}^s_k), {\mathbf{x}}^s_{k+1} - {\mathbf{x}}^* \right\rangle+ \frac{L}{2}\|{\mathbf{x}}^s_{k+1}-{\mathbf{x}}^s_k \|{}^2 - \frac{\mu}{2}\|{\mathbf{x}}^s_k - {\mathbf{x}}^*\|{}^2\\ &\displaystyle &\displaystyle \quad = f({\mathbf{x}}^*)- \frac{1}{\eta}\left\langle {\mathbf{x}}^s_{k+1} - {\mathbf{x}}^s_k , {\mathbf{x}}^s_{k+1} - {\mathbf{x}}^* \right\rangle+ \frac{L}{2}\|{\mathbf{x}}^s_{k+1}-{\mathbf{x}}^s_k \|{}^2\\ &\displaystyle &\displaystyle \qquad - \frac{\mu}{2}\|{\mathbf{x}}^s_{k}-{\mathbf{x}}^*\|{}^2 -\left\langle\tilde{\nabla} f({\mathbf{x}}^s_{k}) - \nabla f({\mathbf{x}}^s_k), {\mathbf{x}}^s_{k+1} - {\mathbf{x}}^* \right\rangle\\ &\displaystyle &\displaystyle \quad =f({\mathbf{x}}^*) - \frac{1}{2\eta}\|{\mathbf{x}}^s_{k+1} - {\mathbf{x}}^* \|{}^2 + \frac{1}{2\eta}\|{\mathbf{x}}^s_k - {\mathbf{x}}^* \|{}^2 - \left( \frac{1}{2\eta} - \frac{L}{2} \right)\|{\mathbf{x}}^s_{k+1} - {\mathbf{x}}^s_k\|{}^2 \\ &\displaystyle &\displaystyle \qquad - \frac{\mu}{2}\|{\mathbf{x}}^s_k - {\mathbf{x}}^* \|{}^2 - \left\langle\tilde{\nabla} f({\mathbf{x}}^s_{k}) - \nabla f({\mathbf{x}}^s_k), {\mathbf{x}}^s_{k+1} - {\mathbf{x}}^* \right\rangle. \end{array} \end{aligned}$

Then by rearranging terms and using $f({\mathbf {x}}^s_{k+1}) - f({\mathbf {x}}^*)\geq 0$ , we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \frac{1}{2}\|{\mathbf{x}}^s_{k+1} - {\mathbf{x}}^* \|{}^2\\ &\displaystyle &\displaystyle \quad \leq \frac{1- \eta\mu}{2}\|{\mathbf{x}}^s_{k} - {\mathbf{x}}^* \|{}^2 - \eta\left\langle\tilde{\nabla} f({\mathbf{x}}^s_{k})-\nabla f({\mathbf{x}}^s_k), {\mathbf{x}}^s_{k+1} - {\mathbf{x}}^* \right\rangle \\ &\displaystyle &\displaystyle \qquad - \left( \frac{1}{2} - \frac{L \eta}{2} \right)\|{\mathbf{x}}^s_{k+1} - {\mathbf{x}}^s_{k} \|{}^2. \end{array} \end{aligned}$

(5.33)

Considering the expectation taken on the random number of i _k,s, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle -\eta \mathbb{E}_k \left\langle\tilde{\nabla} f({\mathbf{x}}^s_{k})-\nabla f({\mathbf{x}}^s_k), {\mathbf{x}}^s_{k+1} - {\mathbf{x}}^* \right\rangle\\ &\displaystyle &\displaystyle \quad \overset{a}{=} -\eta \mathbb{E}_k \left\langle\tilde{\nabla} f({\mathbf{x}}^s_{k})-\nabla f({\mathbf{x}}^s_k), {\mathbf{x}}^s_{k+1}\right\rangle\\ &\displaystyle &\displaystyle \quad \overset{b}{=} -\eta \mathbb{E}_k \left\langle\tilde{\nabla} f({\mathbf{x}}^s_{k})-\nabla f({\mathbf{x}}^s_k), {\mathbf{x}}^s_{k+1} - {\mathbf{x}}^s_k\right\rangle\\ &\displaystyle &\displaystyle \quad \leq \eta^2\mathbb{E}_k\|\tilde{\nabla} f({\mathbf{x}}^s_{k})-\nabla f({\mathbf{x}}^s_k) \|{}^2 + \frac{1}{4}\mathbb{E}_k\|{\mathbf{x}}^s_{k+1} - {\mathbf{x}}^s_k\|{}^2\\ &\displaystyle &\displaystyle \quad \leq \eta^2\mathbb{E}_k\left\|\nabla f_{i_{k,s}}({\mathbf{x}}^s_k) - \nabla f_{i_{k,s}}(\tilde{\mathbf{x}}^s) - (\nabla f({\mathbf{x}}^s_k) {-} \nabla f(\tilde{\mathbf{x}}^s) ) \right\|{}^2 + \frac{1}{4}\mathbb{E}_k\|{\mathbf{x}}^s_{k+1} {-} {\mathbf{x}}^s_k\|{}^2\\ &\displaystyle &\displaystyle \quad \overset{c}\leq \eta^2\mathbb{E}_k\left\|\nabla f_{i_{k,s}}({\mathbf{x}}^s_k) - \nabla f_{i_{k,s}}(\tilde{\mathbf{x}}^s) \right\|{}^2 + \frac{1}{4}\mathbb{E}_k\|{\mathbf{x}}^s_{k+1} - {\mathbf{x}}^s_k\|{}^2\\ &\displaystyle &\displaystyle \quad \leq\eta^2 L^2 \|{\mathbf{x}}^s_k - \tilde{\mathbf{x}}^s \|{}^2 + \frac{1}{4}\mathbb{E}_k\|{\mathbf{x}}^s_{k+1} - {\mathbf{x}}^s_k\|{}^2\\ &\displaystyle &\displaystyle \quad \leq2\eta^2 L^2 \|{\mathbf{x}}^s_k - {\mathbf{x}}^*\|{}^2 + 2\eta^2L^2\| \tilde{\mathbf{x}}^s - {\mathbf{x}}^*\|{}^2 + \frac{1}{4}\mathbb{E}_k\|{\mathbf{x}}^s_{k+1} - {\mathbf{x}}^s_k\|{}^2, \end{array} \end{aligned}$

(5.34)

where both $\overset {a}{=}$ and $\overset {b}{=}$ use (5.24) and $\overset {c}\leq$ uses Proposition A.2.

From the setting of η, we have $L\eta \leq \frac {1}{2}$ . Substituting (5.34) into (5.33) after taking expectation on i _k, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \frac{1}{2}\mathbb{E}_k\|{\mathbf{x}}^s_{k+1} - {\mathbf{x}}^* \|{}^2 \\ &\displaystyle &\displaystyle \quad \leq \frac{1- \eta\mu}{2}\|{\mathbf{x}}^s_{k} - {\mathbf{x}}^* \|{}^2 + 2\eta^2 L^2 \|{\mathbf{x}}^s_k - {\mathbf{x}}^*\|{}^2 + 2\eta^2L^2\| \tilde{\mathbf{x}}^s - {\mathbf{x}}^*\|{}^2. \end{array} \end{aligned}$

(5.35)

Now we use induction to prove

$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbb{E}\|{\mathbf{x}}^s_k - {\mathbf{x}}^* \|{}^2 \leq (1 - \eta\mu/2)^{sm+k}\|{\mathbf{x}}^0_0 - {\mathbf{x}}^*\|{}^2. \end{array} \end{aligned}$

When k = 0, it is right. Suppose that at iteration k ≥ 0, we have $\mathbb {E}\|{\mathbf {x}}^s_k - {\mathbf {x}}^* \|{ }^2 \leq (1 - \eta \mu /2)^{sm+k}\|{\mathbf {x}}^0_0 - {\mathbf {x}}^*\|{ }^2$ . Now we consider k + 1. We consider Option II in Algorithm 5.2 and have

$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbb{E}\| \tilde{\mathbf{x}}^s - {\mathbf{x}}^*\|{}^2 =\mathbb{E}\|{\mathbf{x}}_0^s - {\mathbf{x}}^*\|{}^2 &\displaystyle \leq&\displaystyle (1 - \eta\mu/2)^{sm}\|{\mathbf{x}}^0_0 - {\mathbf{x}}^*\|{}^2\\ &\displaystyle \overset{a }{\leq} &\displaystyle e(1 - \eta\mu/2)^{sm+k}\|{\mathbf{x}}^0_0 - {\mathbf{x}}^*\|{}^2, \end{array} \end{aligned}$

where $\overset {a }{\leq }$ uses k ≤ m = −ln⁻¹(1 − ημ∕2). Then we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle 4\eta^2 L^2 \mathbb{E} \|{\mathbf{x}}^s_k - {\mathbf{x}}^*\|{}^2 + 4\eta^2L^2\mathbb{E}\| \tilde{\mathbf{x}}^s - {\mathbf{x}}^*\|{}^2\\ &\displaystyle &\displaystyle \quad \leq 4\eta^2 L^2 (1+ e) (1 - \eta\mu/2)^{sm+k}\|{\mathbf{x}}^0_0 - {\mathbf{x}}^*\|{}^2\\ &\displaystyle &\displaystyle \quad \overset{a}{\leq} \eta\mu/2 (1 - \eta\mu/2)^{sm+k}\|{\mathbf{x}}^0_0 - {\mathbf{x}}^*\|{}^2, \end{array} \end{aligned}$

(5.36)

where $\overset {a}{\leq }$ uses $\eta \leq \frac {\mu }{8(1+e)L^2}$ .

Taking full expectation on (5.35) and substituting (5.36) into it, we can obtain that $\mathbb {E}\|{\mathbf {x}}^s_{k+1} - {\mathbf {x}}^* \|{ }^2 \leq (1 - \eta \mu /2)^{sm+k+1}\|{\mathbf {x}}^0_0 - {\mathbf {x}}^*\|{ }^2$ . □

5.1.3 Accelerated Stochastic Variance Reduction Method

With the VR technique, we can fuse the momentum technique to achieve a faster rate. We show that the convergence rate can be improved to $O\left ((n+\sqrt {n\kappa })\log (1/\epsilon )\right )$ in the IC case, where $\kappa = \frac {L}{\mu }$ . We introduce Katyusha [1], which is the first truly accelerated stochastic algorithm. The main technique in Katyusha [1] is introducing a “negative momentum ” which restricts the extrapolation term to be not far from $\tilde {\mathbf {x}}$ , the snapshot vector introduced by SVRG [10]. The algorithm is shown in Algorithm 5.3.

Algorithm 5.3 Katyusha [1]

../images/487003_1_En_5_Chapter/487003_1_En_5_Figc_HTML.png

The proof is taken from [1]. We have the following lemma to bound the variance:

Lemma 5.2

For $f(\mathbf {x})=\frac {1}{n}\sum _{i=1}^n f_i(\mathbf {x})$ , with each f _i with i ∈ [n] being convex and L-smooth. For any u and $\tilde {\mathbf {x}}$ , defining

$\displaystyle \begin{aligned} \begin{array}{rcl} \tilde{\nabla} f(\mathbf{u}) = \nabla f_{k}(\mathbf{u}) -\nabla f_k(\tilde{\mathbf{x}})+ \frac{1}{n}\sum_{i=1}^n \nabla f_i(\tilde{\mathbf{x}}), \end{array} \end{aligned}$

we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathbb{E}\left\|\tilde{\nabla} f(\mathbf{u}) - \nabla f(\mathbf{u})\right\|{}^2 \leq 2L\left(f(\tilde{\mathbf{x}})-f(\mathbf{u})+\langle \nabla f(\mathbf{u}), \mathbf{u}-\tilde{\mathbf{x}} \rangle\right), \end{array} \end{aligned}$

(5.37)

where the expectation is taken on the random number k under the condition thatuand $\tilde {\mathbf {x}}$ are known.

Proof

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \mathbb{E}\left\|\tilde{\nabla} f(\mathbf{u}) - \nabla f(\mathbf{u})\right\|{}^2\\ &\displaystyle &\displaystyle \quad = \mathbb{E} \left(\left\|\nabla f_k(\mathbf{u}) -\nabla f_k(\tilde{\mathbf{x}}) - \left( \nabla f(\mathbf{u}) - \nabla f(\tilde{\mathbf{x}})\right)\right\|{}^2 \right)\\ &\displaystyle &\displaystyle \quad \overset{a}\leq \mathbb{E} \left\|\nabla f_k(\mathbf{u}) -\nabla f_k(\tilde{\mathbf{x}})\right\|{}^2, \end{array} \end{aligned}$

where in inequality $\overset {a}\leq$ we use

$\displaystyle \begin{aligned}\mathbb{E}\left(\nabla f_k(\mathbf{u}) -\nabla f_k(\tilde{\mathbf{x}}) \right) = \nabla f(\mathbf{u}) - \nabla f(\tilde{\mathbf{x}})\end{aligned}$

and Proposition A.2. Then by directly applying (A.7) we obtain (5.37). □

Theorem 5.3

Suppose that h(x) is μ-strongly convex and $n\leq \frac {L}{4\mu }$ . For Algorithm 5.3, if the step size $\gamma = \frac {1}{3L}$ , $\theta _1= \sqrt {\frac {n\mu }{L}}$ , $\theta _3 = 1+\frac {\mu \gamma }{\theta _1}$ , and m = n, we have

$\displaystyle \begin{aligned} \begin{array}{rcl} F(\tilde{\mathbf{x}}^{S+1})-F({\mathbf{x}}^*) \leq \theta_3^{-Sn}\left[\frac{1}{4n\gamma}\|{\mathbf{z}}^0_0-{\mathbf{x}}^* \|{}^2 + \left(1+\frac{1}{n}\right) \left(F({\mathbf{x}}^0_0)-F({\mathbf{x}}^*) \right) \right]. \end{array} \end{aligned}$

Proof

Because $n\leq \frac {L}{4\mu }$ and $\theta _1= \sqrt {\frac {n\mu }{L}}$ , we have

$\displaystyle \begin{aligned} \begin{array}{rcl} \theta_1&\displaystyle \leq&\displaystyle \frac{1}{2},{}\\ 1-\theta_1 -\theta_2&\displaystyle \geq&\displaystyle 0. \end{array} \end{aligned}$

(5.38)

For Step 1,

$\displaystyle \begin{aligned} \begin{array}{rcl}{} {\mathbf{y}}^s_k = \theta_1{\mathbf{z}}^s_k +\theta_2 \tilde{\mathbf{x}}^s +(1 - \theta_1 -\theta_2 ) {\mathbf{x}}^s_k. \end{array} \end{aligned}$

(5.39)

Together with Step 6, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} {\mathbf{x}}^s_{k+1} = {\mathbf{y}}^s_k + \theta_1 ({\mathbf{z}}^s_{k+1}-{\mathbf{z}}^s_{k}). \end{array} \end{aligned}$

(5.40)

Through the optimality of ${\mathbf {z}}^s_{k+1}$ in Step 4 of Algorithm 5.3, there exists $\boldsymbol {\xi }^{s}_{k+1}\in \partial h({\mathbf {z}}^s_{k+1})$ satisfying

$\displaystyle \begin{aligned} \begin{array}{rcl} \theta_1({\mathbf{z}}^s_{k+1}-{\mathbf{z}}^s_{k}) +\gamma \tilde{\nabla}^s_k +\gamma \boldsymbol{\xi}^s_{k+1} = \mathbf{0}. \end{array} \end{aligned}$

From (5.39) and (5.40), we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} {\mathbf{x}}^s_{k+1}-{\mathbf{y}}^s_{k} +\gamma \tilde{\nabla}^s_k +\gamma \boldsymbol{\xi}^s_{k+1} = \mathbf{0}. \end{array} \end{aligned}$

(5.41)

For f(⋅) is L-smooth, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} f({\mathbf{x}}^s_{k+1}) &\displaystyle \leq&\displaystyle f({\mathbf{y}}^s_k) + \langle \nabla f({\mathbf{y}}^s_k), {\mathbf{x}}^s_{k+1}-{\mathbf{y}}^s_k\rangle + \frac{L}{2}\left\|{\mathbf{x}}^s_{k+1}-{\mathbf{y}}^s_k\right\|{}^2 \\ &\displaystyle =&\displaystyle f({\mathbf{y}}^s_k) - \gamma \langle \nabla f({\mathbf{y}}^s_k), \tilde{\nabla}^s_k +\boldsymbol{\xi}^{s}_{k+1} \rangle + \frac{L}{2}\left\|{\mathbf{x}}^s_{k+1}-{\mathbf{y}}^s_k \right\|{}^2\\ &\displaystyle \overset{a}=&\displaystyle f({\mathbf{y}}^s_k) - \gamma \langle \tilde{\nabla}^s_k + \boldsymbol{\xi}^{s}_{k+1}, \tilde{\nabla}^s_k +\boldsymbol{\xi}^{s}_{k+1} \rangle\\ &\displaystyle &\displaystyle + \frac{L}{2}\left\|{\mathbf{x}}^s_{k+1}-{\mathbf{y}}^s_k \right\|{}^2 + \gamma \langle \tilde{\nabla}^s_k + \boldsymbol{\xi}^{s}_{k+1}-\nabla f({\mathbf{y}}^s_k), \tilde{\nabla}^s_k+\boldsymbol{\xi}^{s}_{k+1} \rangle \\ &\displaystyle \overset{b}=&\displaystyle f({\mathbf{y}}^s_k)-\gamma\left( 1-\frac{\gamma L}{2}\right)\left\|\frac{1}{\gamma}\left({\mathbf{x}}^s_{k+1}-{\mathbf{y}}^s_k\right) \right\|{}^2{-} \langle \tilde{\nabla}^s_k -\nabla f({\mathbf{y}}^s_k), {\mathbf{x}}^s_{k+1}-{\mathbf{y}}^s_k \rangle \\ &\displaystyle &\displaystyle -\langle {\boldsymbol{\xi}}^s_{k+1}, {\mathbf{x}}^s_{k+1}-{\mathbf{y}}^s_k\rangle, \end{array} \end{aligned}$

(5.42)

where in $\overset {a}=$ we add and subtract the term $\gamma \langle \tilde {\nabla }^s_k+\boldsymbol {\xi }^s_{k+1}, \tilde {\nabla }^s_k+\boldsymbol {\xi }^s_{k+1} \rangle$ and $\overset {b}=$ uses the equality (5.41).

For the last but one term of (5.42), we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \mathbb{E}_k \langle \tilde{\nabla}^s_k-\nabla f({\mathbf{y}}^s_{k}) ,{\mathbf{y}}^s_{k}-{\mathbf{x}}^s_{k+1} \rangle\\ &\displaystyle &\displaystyle \quad \overset{a}\leq \frac{\gamma}{2C_3}\mathbb{E}_k\left\| \tilde{\nabla}^s_k-\nabla f({\mathbf{y}}^s_{k}) \right\|{}^2 +\frac{\gamma C_3}{2}\mathbb{E}_k\left\|\frac{1}{\gamma}\left({\mathbf{x}}^s_{k+1}-{\mathbf{y}}^s_{k}\right)\right\|{}^2\\ &\displaystyle &\displaystyle \quad \overset{b}\leq \frac{\gamma L }{C_3} \left( f(\tilde{\mathbf{x}}^s)-f({\mathbf{y}}^s_{k})+\langle\nabla f({\mathbf{y}}^s_{k}), {\mathbf{y}}^s_{k} -\tilde{\mathbf{x}}^s \rangle \right) +\frac{\gamma C_3}{2}\mathbb{E}_k\left\|\frac{1}{\gamma}\left({\mathbf{x}}^s_{k+1}-{\mathbf{y}}^s_{k}\right)\right\|{}^2,\\ \end{array} \end{aligned}$

(5.43)

where we use $\mathbb {E}_k$ to denote that expectation is taken on the random number $i^s_{k}$ (step k and epoch s) under the condition that ${\mathbf {y}}^s_k$ is known, in $\overset {a}\leq$ we use the Cauchy–Schwartz inequality, and $\overset {b}\leq$ uses (5.37). C ₃ is an absolute constant determined later.

Taking expectation for (5.42) on the random number $i^s_{k}$ and adding (5.43), we obtain

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \mathbb{E}_k f({\mathbf{x}}^s_{k+1})\\ &\displaystyle &\displaystyle \quad \leq f({\mathbf{y}}^s_k) - \gamma\left(1-\frac{\gamma L}{2}- \frac{C_3}{2} \right)\mathbb{E}_k\left\|\frac{1}{\gamma}\left({\mathbf{x}}^s_{k+1} - {\mathbf{y}}^s_{k}\right)\right\|{}^2 \\ &\displaystyle &\displaystyle \qquad -\mathbb{E}_k\langle \boldsymbol{\xi}^s_{k+1}, {\mathbf{x}}^s_{k+1}-{\mathbf{y}}^s_k\rangle+ \frac{\gamma L }{C_3} \left( f(\tilde{\mathbf{x}}^s)-f({\mathbf{y}}^s_{k})+\langle\nabla f({\mathbf{y}}^s_{k}), {\mathbf{y}}^s_{k} -\tilde{\mathbf{x}}^s \rangle \right). \end{array} \end{aligned}$

(5.44)

On the other hand, we analyze $\|{\mathbf {z}}^s_k - {\mathbf {x}}^*\|{ }^2$ . Setting a = 1 − θ ₁ − θ ₂, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \left\|\theta_1{\mathbf{z}}^s_{k+1}-\theta_1{\mathbf{x}}^*\right\|{}^2\\ &\displaystyle &\displaystyle \quad =\left\|{\mathbf{x}}^s_{k+1}-a{\mathbf{x}}^s_{k}-\theta_{2}\tilde{\mathbf{x}}^s-\theta_1{\mathbf{x}}^*\right\|{}^2\\ &\displaystyle &\displaystyle \quad =\left\|{\mathbf{y}}^s_{k}-a{\mathbf{x}}^s_{k}-\theta_{2}\tilde{\mathbf{x}}^s-\theta_1{\mathbf{x}}^* - ({\mathbf{y}}^s_{k}-{\mathbf{x}}^s_{k+1})\right\|{}^2\\ &\displaystyle &\displaystyle \quad =\left\|{\mathbf{y}}^s_{k}-a{\mathbf{x}}^s_{k}-\theta_{2}\tilde{\mathbf{x}}^s-\theta_1{\mathbf{x}}^* \right\|{}^2 + \left\|{\mathbf{y}}^s_{k}-{\mathbf{x}}^s_{k+1}\right\|{}^2\\ &\displaystyle &\displaystyle \qquad -2\gamma\langle \boldsymbol{\xi}^s_{k+1} + \tilde{\nabla}_k^s, {\mathbf{y}}^s_{k}-a{\mathbf{x}}^s_{k}-\theta_{2}\tilde{\mathbf{x}}^s-\theta_1{\mathbf{x}}^* \rangle\\ &\displaystyle &\displaystyle \quad \overset{a}=\left\| \theta_1{\mathbf{z}}^s_{k}-\theta_1{\mathbf{x}}^* \right \|{}^2 +\left \|{\mathbf{y}}^s_{k}-{\mathbf{x}}^s_{k+1}\right\|{}^2\\ &\displaystyle &\displaystyle \qquad -2\gamma\left\langle \boldsymbol{\xi}^s_{k+1} + \tilde{\nabla}_k^s, {\mathbf{y}}^s_{k}-a{\mathbf{x}}^s_{k}-\theta_{2}\tilde{\mathbf{x}}^s-\theta_1{\mathbf{x}}^* \right\rangle, \end{array} \end{aligned}$

(5.45)

where $\overset {a}=$ uses Step 1 of Algorithm 5.3.

For the last term of (5.45), we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \mathbb{E}_k\left\langle \tilde{\nabla}^s_k , a{\mathbf{x}}^s_{k}+\theta_{2}\tilde{\mathbf{x}}^s+\theta_1{\mathbf{x}}^*-{\mathbf{y}}^s_k \right\rangle\\ &\displaystyle &\displaystyle \quad \overset{a}=\left\langle \nabla f({\mathbf{y}}_{k}^s), a{\mathbf{x}}^s_{k}+\theta_1{\mathbf{x}}^* - (1-\theta_2){\mathbf{y}}^s_{k}\right\rangle + \theta_2\left\langle \nabla f({\mathbf{y}}^s_k), \tilde{\mathbf{x}}^s - {\mathbf{y}}^s_k\right\rangle\\ &\displaystyle &\displaystyle \quad \overset{b}\leq af({\mathbf{x}}^s_k)+\theta_1f({\mathbf{x}}^*) - (1-\theta_2) f({\mathbf{y}}^s_{k})+ \theta_2\left\langle \nabla f({\mathbf{y}}^s_k), \tilde{\mathbf{x}}^s - {\mathbf{y}}^s_k\right\rangle, \end{array} \end{aligned}$

(5.46)

where in $\overset {a}\leq$ we use $\mathbb {E}_k (\tilde {\nabla }^s_k) = \nabla f({\mathbf {y}}_{k}^s)$ and that $a{\mathbf {x}}^s_{k}+\theta _{2}\tilde {\mathbf {x}}^s+\theta _1{\mathbf {x}}^*-{\mathbf {y}}^s_k$ is constant, because the expectation is taken only on the random number i _k,s, and in inequality $\overset {b}\leq$ we use the convexity of f(⋅) and so for any vector u,

$\displaystyle \begin{aligned}\langle\nabla f({\mathbf{y}}_{k}^s), \mathbf{u} -{\mathbf{y}}^s_{k}\rangle \leq f(\mathbf{u})-f({\mathbf{y}}^s_{k}).\end{aligned}$

Then dividing (5.45) by 2γ and taking expectation on the random number $i^s_{k}$ , we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \frac{1}{2\gamma}\mathbb{E}_k\left\|\theta_1{\mathbf{z}}^{s}_{k+1}-\theta_1{\mathbf{x}}^*\right\|{}^2\\ &\displaystyle &\displaystyle \quad \leq\frac{1}{2\gamma}\left\| \theta_1{\mathbf{z}}^{s}_{k}-\theta_1{\mathbf{x}}^*\right \|{}^2 + \frac{\gamma}{2}\mathbb{E}_k\left\| \frac{1}{\gamma}\left({\mathbf{y}}^s_{k}-{\mathbf{x}}^s_{k+1}\right)\right\|{}^2\\ &\displaystyle &\displaystyle \qquad -\mathbb{E}_k\left\langle \boldsymbol{\xi}^s_{k+1} + \tilde{\nabla}^k_s, {\mathbf{y}}^s_{k}-a{\mathbf{x}}^s_{k}-\theta_{2}\tilde{\mathbf{x}}^s-\theta_1{\mathbf{x}}^* \right\rangle\\ &\displaystyle &\displaystyle \quad \overset{a}\leq\frac{1}{2\gamma}\left\|\theta_1{\mathbf{z}}^s_k -\theta_1{\mathbf{x}}^* \right\|{}^2 + \frac{\gamma}{2}\mathbb{E}_k\left\| \frac{1}{\gamma}\left({\mathbf{y}}^s_{k}-{\mathbf{x}}^s_{k+1}\right)\right\|{}^2\\ &\displaystyle &\displaystyle \qquad -\mathbb{E}_k\left\langle \boldsymbol{\xi}^s_{k+1}, {\mathbf{y}}^s_{k}-a{\mathbf{x}}^s_{k}-\theta_{2}\tilde{\mathbf{x}}^s-\theta_1{\mathbf{x}}^* \right\rangle\\ &\displaystyle &\displaystyle \qquad +a f({\mathbf{x}}^s_k)+\theta_1f({\mathbf{x}}^*) - (1-\theta_2)f({\mathbf{y}}^s_k) +\theta_2\left\langle\nabla f({\mathbf{y}}^s_k), \tilde{\mathbf{x}}^s -{\mathbf{y}}^s_k \right\rangle,\quad \end{array} \end{aligned}$

(5.47)

where $\overset {a}\leq$ uses (5.46). Adding (5.47) and (5.44), we obtain that

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \mathbb{E}_k f({\mathbf{x}}^s_{k+1})+\frac{1}{2\gamma}\mathbb{E}_k\left\|\theta_1{\mathbf{z}}^{s}_{k+1}-\theta_1{\mathbf{x}}^*\right\|{}^2\\ &\displaystyle &\displaystyle \quad \leq af({\mathbf{x}}^s_k)+\theta_1f({\mathbf{x}}^*) + \theta_2f({\mathbf{y}}_{k}^s)+\theta_2\langle \nabla f({\mathbf{y}}^s_{k}),\tilde{\mathbf{x}}^s -{\mathbf{y}}^s_{k} \rangle\\ &\displaystyle &\displaystyle \qquad - \gamma \left(\frac{1}{2} - \frac{\gamma L}{2} - \frac{C_3}{2} \right)\mathbb{E}_k\left\| \frac{1}{\gamma}\left({\mathbf{y}}^s_k - {\mathbf{x}}^s_{k+1}\right)\right\|{}^2\\ &\displaystyle &\displaystyle \qquad -\mathbb{E}_k\langle \boldsymbol{\xi}^s_{k+1}, {\mathbf{x}}^s_{k+1}-a{\mathbf{x}}^s_{k}-\theta_{2}\tilde{\mathbf{x}}^s-\theta_1{\mathbf{x}}^*\rangle \\ &\displaystyle &\displaystyle \qquad + \frac{\gamma L}{C_3}\left(f(\tilde{\mathbf{x}}^s) - f({\mathbf{y}}^s_k)+\left\langle \nabla f({\mathbf{y}}^s_k), {\mathbf{y}}^s_k-\tilde{\mathbf{x}}^s \right\rangle \right)+\frac{1}{2\gamma}\left\|\theta_1{\mathbf{z}}^{s}_{k}-\theta_1{\mathbf{x}}^*\right\|{}^2\\ &\displaystyle &\displaystyle \quad \overset{a}{\leq}a f({\mathbf{x}}^s_k)+\theta_1f({\mathbf{x}}^*) + \theta_2 f(\tilde{\mathbf{x}}^s)+\frac{1}{2\gamma}\left\|\theta_1{\mathbf{z}}^{s}_{k}-\theta_1{\mathbf{x}}^*\right\|{}^2\\ &\displaystyle &\displaystyle \qquad -\mathbb{E}_k\left\langle \boldsymbol{\xi}^s_{k+1} , {\mathbf{x}}^s_{k+1}-a{\mathbf{x}}^s_{k}-\theta_{2}\tilde{\mathbf{x}}^s-\theta_1{\mathbf{x}}^* \right\rangle, \end{array} \end{aligned}$

(5.48)

where in $\overset {a}{\leq }$ we set $C_3=\frac {\gamma L}{\theta _2}$ . For the last term of (5.48), we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle -\left\langle \boldsymbol{\xi}^s_{k+1} , {\mathbf{x}}^s_{k+1}-a{\mathbf{x}}^s_{k}-\theta_{2}\tilde{\mathbf{x}}^s-\theta_1{\mathbf{x}}^* \right\rangle\\ &\displaystyle &\displaystyle \quad = -\left\langle \boldsymbol{\xi}^s_{k+1} , \theta_1 {\mathbf{z}}^s_{k+1}-\theta_1{\mathbf{x}}^* \right\rangle\\ &\displaystyle &\displaystyle \quad \leq \theta_1 h({\mathbf{x}}^*)-\theta_1 h({\mathbf{z}}^s_{k+1})-\frac{\mu\theta_1}{2}\left\|{\mathbf{z}}^s_{k+1}-{\mathbf{x}}^* \right\|{}^2\\ &\displaystyle &\displaystyle \quad \overset{a}\leq \theta_1 h({\mathbf{x}}^*)- h({\mathbf{x}}^s_{k+1}) + \theta_2 h(\tilde{\mathbf{x}}^s)+ a h({\mathbf{x}}^s_k) -\frac{\mu \theta_1}{2}\left\|{\mathbf{z}}^s_{k+1}-{\mathbf{x}}^* \right\|{}^2,\quad \end{array} \end{aligned}$

(5.49)

where in $\overset {a}\leq$ we use ${\mathbf {x}}^s_{k+1}=a{\mathbf {x}}^s_{k}+\theta _{2}\tilde {\mathbf {x}}^s+\theta _1{\mathbf {z}}^s_{k+1}$ and the convexity of h(⋅). Substituting (5.49) into (5.48), we obtain

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \mathbb{E}_k F({\mathbf{x}}^s_{k+1})+\frac{1+\frac{\mu \gamma}{\theta_1}}{2\gamma}\left\|\theta_1{\mathbf{z}}^{s}_{k+1}-\theta_1{\mathbf{x}}^*\right\|{}^2\\ &\displaystyle &\displaystyle \quad \leq aF({\mathbf{x}}^s_k)+\theta_1F({\mathbf{x}}^*)+ \theta_2 F(\tilde{\mathbf{x}}^s)+\frac{1}{2\gamma}\left\|\theta_1{\mathbf{z}}^{s}_{k}-\theta_1{\mathbf{x}}^*\right\|{}^2. \end{array} \end{aligned}$

(5.50)

Set $\theta _1= \sqrt {\frac {n\mu }{L}}$ and $\theta _3 = \frac {\mu \gamma }{\theta _1}+1 \leq \frac {L\gamma }{\theta _1}+1 \leq \frac {1}{3}\sqrt {\frac {\mu }{Ln}}+1$ . Taking expectation on (5.50) for the first k − 1 iterations, then multiplying it with $\theta _3^k$ , and telescoping the results with k from 0 to m − 1, we have

$\displaystyle \begin{aligned} &\sum_{k=1}^{m} \theta_3^{k-1} \left(F({\mathbf{x}}^s_{k})-F({\mathbf{x}}^*) \right) - a\sum_{k=0}^{m-1} \theta_3^k \left(F({\mathbf{x}}^s_{k})-F({\mathbf{x}}^*) \right)\\ &\quad - \theta_2\sum_{k=0}^{m-1} \left[\theta_3^k \left(F(\tilde{\mathbf{x}}^s)-F({\mathbf{x}}^*) \right)\right] + \frac{\theta_3^{m}}{2\gamma}\left\| \theta_1{\mathbf{z}}^s_m- \theta_1{\mathbf{x}}^*\right\|{}^2 \leq \frac{1}{2\gamma}\left\| \theta_1 {\mathbf{z}}^s_0 - \theta_1{\mathbf{x}}^*\right\|{}^2. \end{aligned}$

(5.51)

By rearranging the terms of (5.51), we have

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle [\theta_1+\theta_{2} -(1-1/\theta_3)]\sum_{k=1}^{m} \theta_3^k \left(F({\mathbf{x}}^s_{k})-F({\mathbf{x}}^*) \right) + \theta_3 ^ma \left(F({\mathbf{x}}^s_{m})-F({\mathbf{x}}^*)\right)\\ &\displaystyle &\displaystyle \quad +\frac{\theta_3^{m}}{2\gamma}\left\| \theta_1{\mathbf{z}}^s_m- \theta_1{\mathbf{x}}^*\right\|{}^2\\ &\displaystyle &\displaystyle \qquad \leq \theta_2\sum_{k=0}^{m-1} \left[\theta_3^k \left(F(\tilde{\mathbf{x}}^s)-F({\mathbf{x}}^*) \right)\right] +a \left(F({\mathbf{x}}^s_{0}){-}F({\mathbf{x}}^*)\right)+ \frac{1}{2\gamma}\left\| \theta_1 {\mathbf{z}}^s_0 {-} \theta_1{\mathbf{x}}^*\right\|{}^2. \end{array} \end{aligned}$

From the definition $\tilde {\mathbf {x}}^{s+1} = \left (\sum _{j=0}^{m-1}\theta _3^j\right )^{-1}\sum _{j=0}^{m-1}\theta _3^j {\mathbf {x}}^s_j$ , we have

$\displaystyle \begin{aligned} & [\theta_1+\theta_{2} -(1-1/\theta_3)]\theta_3 \left(\sum_{k=0}^{m-1} \theta_3^k\right) \left(F(\tilde{\mathbf{x}}^{s+1})-F({\mathbf{x}}^*) \right) + \theta_3 ^ma \left(F({\mathbf{x}}^s_{m})-F({\mathbf{x}}^*)\right)\\ &\quad +\frac{\theta_3^{m}}{2\gamma}\left\| \theta_1{\mathbf{z}}^s_m- \theta_1{\mathbf{x}}^*\right\|{}^2\\ &\qquad \leq \theta_2\left(\sum_{k=0}^{m-1} \theta_3^k\right) \left(F(\tilde{\mathbf{x}}^s)-F({\mathbf{x}}^*) \right) +a \left(F({\mathbf{x}}^s_{0})-F({\mathbf{x}}^*)\right)\\ &\qquad \quad + \frac{1}{2\gamma}\left\| \theta_1 {\mathbf{z}}^s_0 - \theta_1{\mathbf{x}}^*\right\|{}^2.{} \end{aligned}$

(5.52)

Since

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \theta_2 \Big(\theta_3^{m-1}-1\Big)+ (1-1/\theta_3) \\ &\displaystyle &\displaystyle \quad \leq\frac{1}{2}\left[ \left(1+ \frac{1}{3}\sqrt{\frac{\mu}{nL}}\right)^{m-1}-1 \right] + \frac{\frac{1}{3}\sqrt{\frac{\mu}{nL}}}{\theta_3}\\ &\displaystyle &\displaystyle \quad \overset{a}\leq \frac{1}{2}\frac{1}{2} \sqrt{\frac{n\mu}{L}} +\frac{\frac{1}{3}\sqrt{\frac{\mu}{nL}}}{\theta_3}\overset{b}{\leq}\frac{7}{12}\sqrt{\frac{n\mu}{L}} \leq \theta_1, \end{array} \end{aligned}$

(5.53)

where in $\overset {a}\leq$ we use μn ≤ L∕4 < L, m − 1 = n − 1 ≤ n, and the fact that

$\displaystyle \begin{aligned} g(x)=(1+x)^c \leq 1+\frac{3}{2}cx\mbox{ when }c\geq 1\mbox{ and }x\leq\frac{1}{c}, \end{aligned}$

(5.54)

and in $\overset {b}{\leq }$ we use θ ₃n ≥ θ ₃ ≥ 1. To prove (5.54), we can use Taylor expansion at point x = 0 to obtain

$\displaystyle \begin{aligned} \begin{array}{rcl} (1+x)^c = 1 + cx +\frac{c(c-1)}{2} \xi^2 \leq 1 + cx +\frac{c(c-1)}{2} \frac{1}{c} x \leq 1+\frac{3}{2}cx, \end{array} \end{aligned}$

where ξ ∈ [0, x].

Equation (5.53) indicates that $\theta _1+\theta _{2} -(1-1/\theta _3) \geq \theta _2\theta _3^{m-1}$ . This and (5.52) gives

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \theta_2 \theta_3^m \left( \sum_{k=0}^{m-1} \theta_3^k \right)\left(F(\tilde{\mathbf{x}}^{s+1})-F({\mathbf{x}}^*) \right) + \theta_3^m a \left(F({\mathbf{x}}^s_{m})-F({\mathbf{x}}^*)\right)\\ &\displaystyle &\displaystyle \quad +\frac{\theta_3^{m}}{2\gamma}\left\| \theta_1{\mathbf{z}}^s_m- \theta_1{\mathbf{x}}^*\right\|{}^2\\ &\displaystyle &\displaystyle \qquad \leq \theta_2\left(\sum_{k=0}^{m-1} \theta_3^k\right) \left(F(\tilde{\mathbf{x}}^s)-F({\mathbf{x}}^*) \right) +a \left(F({\mathbf{x}}^s_{0})-F({\mathbf{x}}^*)\right)\\ &\displaystyle &\displaystyle \qquad \quad + \frac{1}{2\gamma}\left\| \theta_1 {\mathbf{z}}^s_0 - \theta_1{\mathbf{x}}^*\right\|{}^2. \end{array} \end{aligned}$

By expanding the above inequality from s = S, ⋯ , 0, we have

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \theta_2 \left(\sum_{k=0}^{m-1} \theta_3^k \right)\left(F(\tilde{\mathbf{x}}^{S+1})-F({\mathbf{x}}^*) \right) + (1-\theta_1-\theta_2) \left(F({\mathbf{x}}^S_{m})-F({\mathbf{x}}^*)\right)\\ &\displaystyle &\displaystyle \quad + \frac{\theta_1^2}{2\gamma}\left\|{\mathbf{z}}^{S+1}_0 - {\mathbf{x}}^*\right\|{}^2\\ &\displaystyle &\displaystyle \qquad \leq \theta_3^{-Sm}\left\{\left[\theta_2\left(\sum_{k=0}^{m-1} \theta_3^k\right) + (1-\theta_1-\theta_2)\right] \left(F({\mathbf{x}}^0_0)-F({\mathbf{x}}^*) \right) \right.\\ &\displaystyle &\displaystyle \qquad \quad \left. +\frac{\theta_1^2}{2\gamma}\left\|{\mathbf{z}}^0_0-{\mathbf{x}}^* \right\|{}^2 \right\}. \end{array} \end{aligned}$

Since $\theta _3^k \geq 1$ , we have $\sum _{k=0}^{m-1}\theta ^k_3 \geq n$ . Then using $\theta _2 = \frac {1}{2}$ and $\theta _1\leq \frac {1}{2}$ (see (5.38)), we have

$\displaystyle \begin{aligned} \begin{array}{rcl} F(\tilde{\mathbf{x}}^{S+1})-F({\mathbf{x}}^*) \leq \theta_3^{-Sn} \left[ \left(1+\frac{1}{n}\right) \left(F({\mathbf{x}}^0_0)-F({\mathbf{x}}^*) \right) +\frac{1}{4n\gamma}\left\|{\mathbf{z}}^0_0-{\mathbf{x}}^* \right\|{}^2 \right]. \end{array} \end{aligned}$

This ends the proof. □

5.1.4 Black-Box Acceleration

In this section, we introduce black-box acceleration methods. In general, the methods solve (5.3) by constructing a series of subproblems known as “mediator” [15], which can be solved efficiently to a high accuracy. The main advantages of these methods are listed as follows:

1.
The black-box methods make acceleration easier, because we only need to concern the method to solve subproblems. In most time, the subproblems have a good condition number and so are easy to be solved. Specifically, for general problem of (5.3), the subproblems can be solved by arbitrary vanilla VR methods. For specific forms of (5.3), one is allowed to design appropriate methods according to the characteristic of functions to solve them without considering acceleration techniques.
2.
The black-box methods make the acceleration technique more general. For different properties of objectives, no matter strongly convex or not and smooth or not, the black-box methods are able to give a universal accelerated algorithm.

The first stochastic black-box acceleration might be Acc-SDCA [19]. Its convergence rate for the IC case is $O\left ((n + \sqrt {n\kappa })\log (\kappa ) \log ^2(1/\epsilon )\right )$ . Later, Lin et al. proposed a generic acceleration, called Catalyst [12], which achieved a convergence rate $O\left ((n + \sqrt {n\kappa })\log ^2(1/\epsilon )\right )$ , outperforming Acc-SDCA by a factor of $\log (\kappa )$ . Allen-Zhu et al. [2] designed a black-box acceleration by gradually decreasing the condition number of the subproblem, achieving $O(\log (1/\epsilon ))$ faster rate than Catalyst [12] on some general objective functions. In the following, we introduce Catalyst [12] as an example. The algorithm of Catalyst is shown in Algorithm 5.4. The main theorem for Algorithm 5.4 is as follows:

Theorem 5.4

For problem (5.3), suppose that F(x) is μ-strongly convex and set $\alpha _0 = \sqrt {q}$ with $q= \frac {\mu }{\mu +\kappa }$ and

$\displaystyle \begin{aligned}\epsilon_k = \frac{2}{9} (F({\mathbf{x}}_0) - F^*)(1-\rho)^k\quad \mathit{\text{with }} \rho\leq \sqrt{q}.\end{aligned}$

Then Algorithm 5.4 generates iterates {x _k}_k≥0 such that

$\displaystyle \begin{aligned}F({\mathbf{x}}_k) - F^*\leq C (1-\rho)^{k+1} (F({\mathbf{x}}_0) - F^*) \quad \mathit{\text{with }} C = \frac{8}{(\sqrt{q}-\rho)^2}.\end{aligned}$

We leave the proof to Sect. 5.2 and first introduce how to use it to obtain an accelerated algorithm. We have the following theorem:

Theorem 5.5

For (5.3), assume that each f _i(x) is convex and L-smooth and h(x) is μ-strongly convex satisfying μ ≤ L∕n. ¹ For Algorithm 5.4, if setting $\kappa = \frac {L}{n-1}$ and solving Step 3 by SVRG [ 10], then one can obtain an 𝜖-accuracy solution satisfying F(x) − F(x ^∗) ≤ 𝜖 with IFO calls of $O\left (\sqrt {nL/\mu }\log ^2(1/\epsilon )\right )$ .

Proof

The subproblem in Step 3 is $\left (\mu +\frac {L}{n-1}\right )$ -strongly convex and $\left (L + \frac {L}{n-1}\right )$ -smooth. From Theorem 5.2, applying SVRG to solve the subproblem needs $O\left (\left (n +\frac {L + \frac {L}{n-1}}{\mu +\frac {L}{n-1}}\right )\log (1/\epsilon )\right ) = O\left ( n\log (1/\epsilon )\right )$ IFOs . So the total complexity is $O\left (\sqrt {nL/\mu }\right .$ $\left .\log ^2(1/\epsilon )\right )$ . □

Algorithm 5.4 Catalyst [12]

../images/487003_1_En_5_Chapter/487003_1_En_5_Figd_HTML.png

The black-box algorithms, e.g., Catalyst [12], are proposed earlier than Katyusha [1], discussed in Sect. 5.1.3. From the convergence results, Catalyst [12] is $O(\log (1/\epsilon ))$ times lower than Katyusha [1]. However, the black-box algorithms are more flexible and easier to obtain an accelerated rate. For example, in the next section we apply Catalyst to the INC case.

5.2 The Individually Nonconvex Case

In this section, we consider solving (5.3) by allowing nonconvexity of f _i(x) but f(x), the sum of f _i(x), is still convex. One important application of it is principal component analysis [9]. It is also the core technique to obtain faster rate in the NC case.

Notice that the convexity of f(x) guarantees an achievable global minimum of (5.3). As shown in Theorem 5.2, to reach a minimizer of F(x) − F ^∗≤ 𝜖, vanilla SVRG needs IFO calls of $O\left (\left (n+\frac {L^2}{\mu ^2}\right )\log ^2(1/\epsilon )\right )$ . We will show that the convergence rate can be improved to be $O\left ( \left (n+n^{3/4}\sqrt {\frac {L}{\mu }}\right )\log ^2(1/\epsilon )\right )$ by acceleration. Compared with the computation costs in the IC case, the costs of NC cases are Ω(n ^1∕4) times larger, which is caused by the nonconvexity of individuals. We still use Catalyst [12] to obtain the result. We first prove Theorem 5.4 in the context of INC. The proof is directly taken from [12], which is based on the estimate sequence in [14] (see Sect. 2.1) and further takes the error of inexact solver into account.

Proof of Theorem 5.4 in the Context of INC

Define the estimate sequence as follows:

1.
$\phi _0(\mathbf {x}) \equiv F({\mathbf {x}}_0) +\frac {\gamma _0}{2}\|\mathbf {x} - {\mathbf {x}}_0 \|{ }^2$ ;
2.
For k ≥ 0,
$\displaystyle \begin{aligned}\phi_{k+1}(\mathbf{x}) &= (1-\alpha_{k})\phi_{k}(\mathbf{x}) +\alpha_{k}\left[F({\mathbf{x}}_{k+1})+\left\langle\kappa ({\mathbf{y}}_{k} - {\mathbf{x}}_{k+1}), \mathbf{x} - {\mathbf{x}}_{k+1} \right\rangle\right] \\ &\quad +\frac{\mu}{2}\|\mathbf{x} - {\mathbf{x}}_{k+1} \|{}^2,\end{aligned}$

where γ ₀ ≥ 0 will be defined in Step 3. One can find that the main difference of estimate sequence defined here with the original one in [15] is replacing ∇f(y _k) with κ(y _k −x _k+1) (see (2.5)).

Step 1: For all k ≥ 0, we can have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \phi_k(\mathbf{x}) = \phi_k^* +\frac{\gamma_k}{2}\|\mathbf{x} - {\mathbf{v}}_k \|{}^2, \end{array} \end{aligned}$

(5.55)

where $\phi ^*_0=F({\mathbf {x}}^0)$ and v ₀ = x ₀ when k = 0, and γ _k, v _k, and $\phi ^*_k$ satisfy:

$\displaystyle \begin{aligned} \gamma_k &= (1-\alpha_{k-1})\gamma_{k-1} +\alpha_{k-1}\mu,{} \end{aligned}$

(5.56)

$\displaystyle \begin{aligned} {\mathbf{v}}_k &= \frac{1}{\gamma_k}\left[ (1 - \alpha_{k-1})\gamma_{k-1}{\mathbf{v}}_{k-1} +\alpha_{k-1}\mu {\mathbf{x}}_k- \alpha_{k-1}\kappa({\mathbf{y}}_{k-1}-{\mathbf{x}}_k) \right],{} \end{aligned}$

(5.57)

$\displaystyle \begin{aligned} \phi^*_k &= (1-\alpha_{k-1}) \phi_{k-1}^* + \alpha_{k-1} F({\mathbf{x}}_{k}) - \frac{\alpha_{k-1}^2}{2\gamma_{k}} \Vert \kappa ({\mathbf{y}}_{k-1} -{\mathbf{x}}_{k}) \|{}^2 \\ &\quad + \frac{\alpha_{k-1} (1-\alpha_{k-1}) \gamma_{k-1}}{\gamma_{k}}\left (\frac{\mu}{2} \|{\mathbf{x}}_{k} - {\mathbf{v}}_{k-1} \|{}^2+\langle \kappa ({\mathbf{y}}_{k-1} -{\mathbf{x}}_{k}), {\mathbf{v}}_{k-1}- {\mathbf{x}}_{k} \rangle \right ), {} \end{aligned}$

(5.58)

when k > 0.

Proof of Step 1: we need to check

$\displaystyle \begin{aligned} \begin{array}{rcl} \begin{aligned} &\displaystyle \phi_k^* +\frac{\gamma_k}{2}\| \mathbf{x}-{\mathbf{v}}_k\|{}^2\\ &\displaystyle \quad = (1-\alpha_{k-1})\phi_{k-1}(\mathbf{x})\\ &\displaystyle \qquad +\alpha_{k-1}\left(F({\mathbf{x}}_k)+\left\langle\kappa ({\mathbf{y}}_{k-1} - {\mathbf{x}}_k), \mathbf{x} - {\mathbf{x}}_k \right\rangle] +\frac{\mu}{2}\|\mathbf{x} - {\mathbf{x}}_k \|{}^2\right). \end{aligned} \end{array} \end{aligned}$

Suppose that at iteration k − 1, (5.55) is right. Then we need to prove

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \phi_k^* +\frac{\gamma_k}{2}\| \mathbf{x}-{\mathbf{v}}_k\|{}^2 \\ &\displaystyle &\displaystyle \quad = (1-\alpha_{k-1})\left( \phi^*_{k-1} +\frac{\gamma_{k-1}}{2}\|\mathbf{x} - {\mathbf{v}}_{k-1} \|{}^2 \right )\\ &\displaystyle &\displaystyle \qquad +\alpha_{k-1}\left(F({\mathbf{x}}_k)+\left\langle\kappa ({\mathbf{y}}_{k-1} - {\mathbf{x}}_k), \mathbf{x} - {\mathbf{x}}_k \right\rangle +\frac{\mu}{2}\|\mathbf{x} - {\mathbf{x}}_k \|{}^2\right). \end{array} \end{aligned}$

(5.59)

Both sides of the equation are simple quadratic forms. By comparing the coefficient of ∥x∥², we have γ _k = (1 − α _k−1)γ _k−1 + α _k−1μ. Then by computing the gradient at x = 0 on both sides of (5.59), we can obtain

$\displaystyle \begin{aligned} \begin{array}{rcl} \gamma_k {\mathbf{v}}_k = (1-\alpha_{k-1})\gamma_{k-1}{\mathbf{v}}_{k-1}-\alpha_{k-1}\kappa ({\mathbf{y}}_{k-1} - {\mathbf{x}}_k) + \alpha_{k-1}\mu {\mathbf{x}}_k. \end{array} \end{aligned}$

For $\phi ^*_k$ , we can set x = x _k in (5.59) and obtain:

$\displaystyle \begin{aligned} \begin{array}{rcl} \phi^*_k = - \frac{\gamma_k}{2}\|{\mathbf{x}}_k - {\mathbf{v}}_k\|{}^2 + (1-\alpha_{k-1})\left( \phi_{k-1}^* +\frac{\gamma_{k-1}}{2}\|{\mathbf{x}}_k - {\mathbf{v}}_{k-1} \|{}^2\right) + \alpha_{k-1} F({\mathbf{x}}_k). \end{array} \end{aligned}$

Then by substituting (5.56) and (5.57) to the above, we can obtain (5.58).

Step 2: For Algorithm 5.4, we can have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} F({\mathbf{x}}_k) \leq \phi_k^*+\xi_k, \end{array} \end{aligned}$

(5.60)

where ξ ₀ = 0 and $\xi _k = \left (1-\alpha _{k-1}\right )\left (\xi _{k-1} +\epsilon _k -(\kappa +\mu )\left \langle {\mathbf {x}}_k - {\mathbf {x}}_k^*, {\mathbf {x}}_{k-1} - {\mathbf {x}}_k \right \rangle \right ).$

Proof of Step 2: Suppose that ${\mathbf {x}}^*_k$ is the optimal solution in Step 3 of Algorithm 5.4. By the (μ + κ)-strongly convexity of G _k(x), we have

$\displaystyle \begin{aligned}G_k(\mathbf{x}) \geqslant G_k^* + \frac{\kappa+\mu}{2} \| \mathbf{x} - {\mathbf{x}}_{k}^* \|{}^2.\end{aligned}$

Then

$\displaystyle \begin{aligned} \begin{array}{rcl}{} F(\mathbf{x}) &\displaystyle \geq&\displaystyle G_k^* + \frac{\kappa+\mu}{2}\| \mathbf{x} - {\mathbf{x}}_{k}^*\|{}^2 - \frac{\kappa}{2} \|\mathbf{x} - {\mathbf{y}}_{k-1} \|{}^2 \\ &\displaystyle \overset{a}=&\displaystyle G_k({\mathbf{x}}_{k}) - \epsilon_k + \frac{\kappa + \mu}{2} \| (\mathbf{x} - {\mathbf{x}}_{k}) + ({\mathbf{x}}_{k} -{\mathbf{x}}_{k}^*) \|{}^2 - \frac{\kappa}{2} \| \mathbf{x} - {\mathbf{y}}_{k-1}\|{}^2 \\ &\displaystyle \geq &\displaystyle F({\mathbf{x}}_{k}) + \frac{\kappa}{2}\|{\mathbf{x}}_k - {\mathbf{y}}_{k-1} \|{}^2 - \epsilon_k + \frac{\kappa + \mu}{2} \| \mathbf{x} - {\mathbf{x}}_{k} \|{}^2 - \frac{\kappa}{2} \| \mathbf{x} - {\mathbf{y}}_{k-1} \|{}^2\\ &\displaystyle &\displaystyle + (\kappa+\mu) \langle {\mathbf{x}}_{k} - {\mathbf{x}}_{k}^*, \mathbf{x}-{\mathbf{x}}_{k} \rangle\\ &\displaystyle = &\displaystyle F({\mathbf{x}}_{k}) +\kappa \left\langle {\mathbf{y}}_{k-1}-{\mathbf{x}}_k, \mathbf{x}-{\mathbf{x}}_k \right\rangle - \epsilon_k + \frac{\mu}{2} \| \mathbf{x} - {\mathbf{x}}_{k} \|{}^2 \\ &\displaystyle &\displaystyle + (\kappa+\mu) \langle {\mathbf{x}}_{k} - {\mathbf{x}}_{k}^*, \mathbf{x}-{\mathbf{x}}_{k} \rangle, \end{array} \end{aligned}$

(5.61)

where in $\overset {a}=$ we use that G _k(x _k) is an 𝜖-accuracy solution of Step 3.

Now we prove (5.60). When k = 0, we have $F({\mathbf {x}}_0) = \phi _0^*$ . Suppose that for k − 1 with k ≥ 1, (5.60) is right, i.e., $F({\mathbf {x}}_{k-1})\leq \phi ^*_{k-1}+\xi _{k-1}$ . Then

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \phi^*_{k-1} &\displaystyle \geq&\displaystyle F({\mathbf{x}}_{k-1}) - \xi_{k-1} \\ &\displaystyle \overset{a}{\geq}&\displaystyle F({\mathbf{x}}_{k}) +\left\langle \kappa ({\mathbf{y}}_{k-1} -{\mathbf{x}}_k), {\mathbf{x}}_{k-1} - {\mathbf{x}}_k \right\rangle + (\kappa +\mu)\left\langle {\mathbf{x}}_{k} - {\mathbf{x}}^*_k, {\mathbf{x}}_{k-1}-{\mathbf{x}}_k\right\rangle\\ &\displaystyle &\displaystyle -\epsilon_k - \xi_{k-1}\\ &\displaystyle =&\displaystyle F({\mathbf{x}}_{k}) +\left\langle \kappa ({\mathbf{y}}_{k-1} -{\mathbf{x}}_k), {\mathbf{x}}_{k-1} - {\mathbf{x}}_k \right\rangle -\xi_k/(1-\alpha_{k-1}), \end{array} \end{aligned}$

(5.62)

where $\overset {a}{\geq }$ uses (5.61). Then from (5.58), we have

$\displaystyle \begin{aligned} \begin{array}{rcl} \phi_k^* &\displaystyle =&\displaystyle (1-\alpha_{k-1}) \phi_{k-1}^* + \alpha_{k-1} F({\mathbf{x}}_{k}) - \frac{\alpha_{k-1}^2}{2\gamma_{k}} \Vert \kappa ({\mathbf{y}}_{k-1} -{\mathbf{x}}_{k}) \|{}^2\\ &\displaystyle &\displaystyle +\frac{\alpha_{k-1} (1-\alpha_{k-1}) \gamma_{k-1}}{\gamma_{k}} \left(\frac{\mu}{2}\|{\mathbf{x}}_k - {\mathbf{v}}_{k-1} \|{}^2+\langle \kappa ({\mathbf{y}}_{k-1} -{\mathbf{x}}_{k}), {\mathbf{v}}_{k-1} - {\mathbf{x}}_{k} \rangle\right)\\ &\displaystyle \overset{a}\geq&\displaystyle (1-\alpha_{k-1})F({\mathbf{x}}_{k}) +(1-\alpha_{k-1})\left\langle\kappa({\mathbf{y}}_{k-1} - {\mathbf{x}}_{k}),{\mathbf{x}}_{k-1}-{\mathbf{x}}_k \right\rangle- \xi_k+\alpha_{k-1}F({\mathbf{x}}_{k}) \\ &\displaystyle &\displaystyle - \frac{\alpha_{k-1}^2}{2\gamma_{k}} \Vert \kappa ({\mathbf{y}}_{k-1} -{\mathbf{x}}_{k}) \|{}^2+ \frac{\alpha_{k-1} (1-\alpha_{k-1}) \gamma_{k-1}}{\gamma_{k}} \langle \kappa ({\mathbf{y}}_{k-1} -{\mathbf{x}}_{k}), {\mathbf{v}}_{k-1} - {\mathbf{x}}_{k} \rangle\\ &\displaystyle =&\displaystyle F({\mathbf{x}}_{k}) + (1-\alpha_{k-1}) \left\langle \kappa ({\mathbf{y}}_{k-1} - {\mathbf{x}}_{k}), {\mathbf{x}}_{k-1}- {\mathbf{x}}_{k}+\frac{\alpha_{k-1} \gamma_{k-1}}{\gamma_{k}} ( {\mathbf{v}}_{k-1} - {\mathbf{x}}_{k}) \right\rangle \\ &\displaystyle &\displaystyle - \frac{\alpha^2_{k-1}}{2\gamma_k}\| \kappa ({\mathbf{y}}_{k-1} - {\mathbf{x}}_k)\|{}^2 -\xi_k\\ &\displaystyle \overset{b}{=}&\displaystyle F({\mathbf{x}}_{k}) + (1-\alpha_{k-1}) \left\langle \kappa ({\mathbf{y}}_{k-1} - {\mathbf{x}}_{k}), {\mathbf{x}}_{k-1}- {\mathbf{y}}_{k-1}+\frac{\alpha_{k-1} \gamma_{k-1}}{\gamma_{k}} ( {\mathbf{v}}_{k-1} - {\mathbf{y}}_{k-1}) \right\rangle \\ &\displaystyle &\displaystyle + \left (1 -\frac{ (\kappa+2\mu) \alpha_{k-1}^2}{2\gamma_{k}} \right) \kappa \|{\mathbf{y}}_{k-1} -{\mathbf{x}}_{k}\|{}^2 -\xi_k, \end{array} \end{aligned}$

where $\overset {a}\geq$ uses (5.62) and $\overset {b}{=}$ uses (5.56).

Step 3: Set

$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{x}}_{k-1}- {\mathbf{y}}_{k-1}+\frac{\alpha_{k-1} \gamma_{k-1}}{\gamma_{k}} ( {\mathbf{v}}_{k-1} - {\mathbf{y}}_{k-1})&\displaystyle =&\displaystyle \mathbf{0},{} \end{array} \end{aligned}$

(5.63)

$\displaystyle \begin{aligned} \begin{array}{rcl} \gamma_0 &\displaystyle =&\displaystyle \frac{\alpha_0[(\kappa+\mu)\alpha_0 -\mu]}{1 - \alpha_0},\\ \gamma_k &\displaystyle =&\displaystyle (\kappa +\mu)\alpha_{k-1}^2,\quad k\geq 1.\quad {} \end{array} \end{aligned}$

(5.64)

By Step 4 of Algorithm 5.4, we can have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} {\mathbf{y}}_k = {\mathbf{x}}_k + \frac{\alpha_{k-1}(1-\alpha_{k-1})}{\alpha^2_{k-1}+ \alpha_k} ({\mathbf{x}}_k - {\mathbf{x}}_{k-1}). \end{array} \end{aligned}$

(5.65)

Proof of Step 3: Suppose that at iteration k − 1, (5.65) is right, then from (5.57) in Step 1, we have

../images/487003_1_En_5_Chapter/487003_1_En_5_Equ93_HTML.png

(5.66)

where $\overset {a}=$ uses (5.63) and in $\overset {b} =$ we use

$\displaystyle \begin{aligned} \begin{array}{rcl} (1-\alpha_{k-1}) (\gamma_k + \alpha_{k-1}\gamma_{k-1}) &\displaystyle \overset{c}{=}&\displaystyle (1-\alpha_{k-1})(\gamma_{k-1} + \alpha_{k-1}\mu )\\ &\displaystyle \overset{d}{=}&\displaystyle \gamma_k - \mu \alpha_{k-1}^2 \overset{e}{=} \kappa \alpha_{k-1}^2, \end{array} \end{aligned}$

in which both $\overset {c}{=}$ and $\overset {d}{=}$ use (5.56) and $\overset {e}{=}$ uses (5.64).

Plugging (5.66) into (5.63) with k − 1 being replaced by k, we obtain (5.65).

Step 4: Set $\lambda _k = \prod _{i=0}^{k-1} (1-\alpha _i)$ , we can have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \frac{1}{\lambda_{k}} \left(F({\mathbf{x}}_{k}) - F^* + \frac{\gamma_{k}}{2} \|{\mathbf{x}}^* - {\mathbf{v}}_{k} \|{}^2\right) \\ &\displaystyle &\displaystyle \quad \leq \phi_{0} ({\mathbf{x}}^*) -F^*+ \sum_{i=1}^{k}\frac{\epsilon_i}{\lambda_{i}} + \sum_{i=1}^{k} \frac{ \sqrt{2 \epsilon_i \gamma_{i} }} {\lambda_{i}} \|{\mathbf{x}}^* - {\mathbf{v}}_{i} \|. \end{array} \end{aligned}$

(5.67)

Proof of Step 4: From the definition of ϕ _k(x), we have

$\displaystyle \begin{aligned} \begin{array}{rcl} \phi_k({\mathbf{x}}^*) &\displaystyle =&\displaystyle (1-\alpha_{k-1})\phi_{k-1}({\mathbf{x}}^*)\\ &\displaystyle &\displaystyle +\alpha_{k-1}\left(F({\mathbf{x}}_k)+\left\langle\kappa({\mathbf{y}}_{k-1}-{\mathbf{x}}_k), {\mathbf{x}}^* - {\mathbf{x}}_k\right\rangle +\frac{\mu}{2}\left\|{\mathbf{x}}^* - {\mathbf{x}}_k\right\|{}^2 \right)\\ &\displaystyle \overset{a}{\leq}&\displaystyle (1-\alpha_{k-1})\phi_{k-1}({\mathbf{x}}^*)\\ &\displaystyle &\displaystyle + \alpha_{k-1}\left[F({\mathbf{x}}^*)+ \epsilon_k -(\kappa+\mu)\left\langle {\mathbf{x}}_k - {\mathbf{x}}_k^*, {\mathbf{x}}^* - {\mathbf{x}}_k \right\rangle\right], \end{array} \end{aligned}$

where $\overset {a}{\leq }$ uses (5.61). Rearranging terms and using the definition of $\xi _k = (1-\alpha _{k-1})\left (\xi _{k-1}+\epsilon _k - (\kappa +\mu )\left \langle {\mathbf {x}}_k -{\mathbf {x}}_k^*, {\mathbf {x}}_{k-1}-{\mathbf {x}}_k\right \rangle \right )$ , we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \phi_k({\mathbf{x}}^*)+ \xi_k - F^*\\ &\displaystyle &\displaystyle \quad \leq(1-\alpha_{k-1})(\phi_{k-1}({\mathbf{x}}^*)+ \xi_{k-1} - F^*) + \epsilon_k \\ &\displaystyle &\displaystyle \qquad - (\kappa+\mu)\left\langle {\mathbf{x}}_k - {\mathbf{x}}_k^*, (1-\alpha_{k-1}){\mathbf{x}}_{k-1}+\alpha_{k-1}{\mathbf{x}}^* - {\mathbf{x}}_k \right\rangle\\ &\displaystyle &\displaystyle \quad \overset{a}\leq(1-\alpha_{k-1})(\phi_{k-1}({\mathbf{x}}^*)+ \xi_{k-1} - F^*) + \epsilon_k + \sqrt{2\epsilon_k \gamma_k}\|{\mathbf{x}}^* - {\mathbf{v}}_k\|,\quad \end{array} \end{aligned}$

(5.68)

where F ^∗ = F(x ^∗) and in $\overset {a} =$ we use

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle - (\kappa+\mu)\left\langle {\mathbf{x}}_k - {\mathbf{x}}_k^*, (1-\alpha_{k-1}){\mathbf{x}}_{k-1}+\alpha_{k-1}{\mathbf{x}}^* - {\mathbf{x}}_k \right\rangle\\ &\displaystyle &\displaystyle \quad \overset{a}{=} - \alpha_{k-1}(\kappa+\mu)\left\langle {\mathbf{x}}_k - {\mathbf{x}}_k^*, {\mathbf{x}}^* -{\mathbf{v}}_k\right\rangle\\ &\displaystyle &\displaystyle \quad \leq \alpha_{k-1} (\kappa+\mu) \|{\mathbf{x}}_k - {\mathbf{x}}_k^* \| \|{\mathbf{x}}^* -{\mathbf{v}}_k\| \\ &\displaystyle &\displaystyle \quad \overset{b}{\leq} \alpha_{k-1}\sqrt{2(\kappa+\mu)\epsilon_k}\|{\mathbf{x}}^* -{\mathbf{v}}_k \| \overset{c}= \sqrt{2\epsilon_k \gamma_k}\|{\mathbf{x}}^* -{\mathbf{v}}_k \|, \end{array} \end{aligned}$

where $\overset {a}{=}$ uses (5.66), $\overset {b}\leq$ uses (A.10), and $\overset {c}=$ uses (5.64).

Then dividing λ _k on both sides of (5.68) and telescoping the result with k = 1 to k, we have

$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{1}{\lambda_k}\left( \phi_{k}({\mathbf{x}}^*) + \xi_k - F^* \right)\leq \phi_0({\mathbf{x}}^*) - F^* + \sum_{i=1}^k\frac{\epsilon_i}{\lambda_i} + \sum_{i=1}^k\frac{\sqrt{2\epsilon_i \gamma_i}}{\lambda_i}\|{\mathbf{x}}^* - {\mathbf{v}}_i\|. \end{array} \end{aligned}$

Using the definition of ϕ _k(x ^∗), we have

$\displaystyle \begin{aligned} \begin{array}{rcl} \phi_{k}({\mathbf{x}}^*) + \xi_k - F^* &\displaystyle =&\displaystyle \phi_{k}^* + \xi_k - F^* + \frac{\gamma_k}{2}\|{\mathbf{x}}^* - {\mathbf{v}}_k\|{}^2\\ &\displaystyle \overset{a}{\geq}&\displaystyle F({\mathbf{x}}_k) - F^* + \frac{\gamma_k}{2}\|{\mathbf{x}}^* - {\mathbf{v}}_k \|{}^2, \end{array} \end{aligned}$

where $\overset {a}{\geq }$ uses (5.60). So we obtain (5.67).

Step 5: By setting $u_i = \sqrt {\frac {\gamma _i}{2\lambda _i}}\|{\mathbf {x}}^* - {\mathbf {v}}_i\|$ , $\alpha _i = 2\sqrt {\frac {\epsilon _i}{\lambda _i}}$ , and $S_k = \phi _0({\mathbf {x}}^*) - F^* + \sum _{i=1}^k \frac {\epsilon _i}{\lambda _i}$ , (2.37) of Lemma 2.8 is validated due to (5.67) and F(x) − F ^∗≥ 0. So by Lemma 2.8 we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} F({\mathbf{x}}_k) - F^* \overset{a}{\leq} \lambda_k \left(S_k + \sum_{i=1}^k\alpha_i u_i\right) \leq \lambda_k\left( \sqrt{S_k} + 2\sum_{i=1}^k \sqrt{\frac{\epsilon_i}{\lambda_i}} \right)^2, \end{array} \end{aligned}$

(5.69)

where $\overset {a}{\leq }$ uses (5.67).

Step 6: We prove Theorem 5.4. By setting $\alpha _0 =\sqrt {q} = \sqrt {\frac {\mu }{\mu + \kappa }}$ , we can have $\alpha _k = \sqrt {q}$ for k ≥ 0, and then $\lambda _k = (1 - \sqrt {q})^k$ and γ ₀ = μ. Thus by the μ-strong convexity of F(⋅), we have $\frac {\gamma _0}{2}\|{\mathbf {x}}_0 - {\mathbf {x}}^*\|{ }^2 \leq F({\mathbf {x}}_0) - F^*$ . Then

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \sqrt{S_k} + 2\sum_{i=1}^k\sqrt{\frac{\epsilon_i}{\lambda_i}}\\ &\displaystyle &\displaystyle \quad =\sqrt{F({\mathbf{x}}_0) - F^* + \frac{\gamma_0}{2}\|{\mathbf{x}}_0 - {\mathbf{x}}_*\|{}^2 + \sum_{i=1}^k\frac{\epsilon_i}{\lambda_i}} + 2\sum_{i=1}^k\sqrt{\frac{\epsilon_i}{\lambda_i}}\\ &\displaystyle &\displaystyle \quad \leq\sqrt{ F({\mathbf{x}}_0) - F^* + \frac{\gamma_0}{2}\|{\mathbf{x}}_0 - {\mathbf{x}}_*\|{}^2} + 3\sum_{i=1}^k\sqrt{\frac{\epsilon_i}{\lambda_i}}\\ &\displaystyle &\displaystyle \quad \leq\sqrt{ 2 (F({\mathbf{x}}_0) - F^*) } + 3\sum_{i=1}^k\sqrt{\frac{\epsilon_i}{\lambda_i}}\\ &\displaystyle &\displaystyle \quad =\sqrt{ 2 (F({\mathbf{x}}_0) - F^*) } \left[ 1+ \sum_{i=1}^k\left( \sqrt{\frac{1-\rho}{1-\sqrt{q}}} \right)^i\right]\\ &\displaystyle &\displaystyle \quad = \sqrt{ 2 (F({\mathbf{x}}_0) - F^*) } \frac{\eta^{k+1}-1}{\eta - 1}\\ &\displaystyle &\displaystyle \quad \leq \sqrt{ 2 (F({\mathbf{x}}_0) - F^*) } \frac{\eta^{k+1}}{\eta - 1}, \end{array} \end{aligned}$

where we set $\eta = \sqrt {\frac {1-\rho }{1-\sqrt {q}}}$ . Then from (5.69), we have

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle F({\mathbf{x}}_k) - F^*\\ &\displaystyle &\displaystyle \quad \leq 2\lambda_k (F({\mathbf{x}}_0) - F^*)\left(\frac{\eta^{k+1}}{\eta-1}\right)^2\\ &\displaystyle &\displaystyle \quad \overset{a}\leq2\left(\frac{\eta}{\eta -1} \right)^2 (1-\rho)^k (F({\mathbf{x}}_0) - F^*)\\ &\displaystyle &\displaystyle \quad =2\left( \frac{\sqrt{1-\rho}}{\sqrt{1-\rho} - \sqrt{1-\sqrt{q}}}\right)^2 (1-\rho)^k(F({\mathbf{x}}_0) - F^*)\\ &\displaystyle &\displaystyle \quad =2\left( \frac{1}{\sqrt{1-\rho} - \sqrt{1-\sqrt{q}}}\right)^2 (1-\rho)^{k+1}(F({\mathbf{x}}_0) - F^*), \end{array} \end{aligned}$

where in $\overset {a}\leq$ we use $\lambda _k = \prod _{i=0}^{k-1} (1-\alpha _i)\leq \left (1-\sqrt {q}\right )^k$ . Using that $\sqrt {1-x} +\frac {x}{2}$ is monotonically decreasing, we have $\sqrt {1-\rho }+ \frac {\rho }{2} \geq \sqrt {1 - \sqrt {q}} + \frac {\sqrt {q}}{2}$ . So we have

$\displaystyle \begin{aligned} \begin{array}{rcl} F({\mathbf{x}}_k) - F^* \leq \frac{8}{(\sqrt{q}-\rho)^2}(1-\rho)^{k+1}\left( F({\mathbf{x}}_0) - F^*\right). \end{array} \end{aligned}$

(5.70)

This ends the proof. □

With a special setting of the parameters for Algorithm 5.4, we are able to give the convergence result for the INC case.

Theorem 5.6

For (5.3), assume that each f _i(x) is L-smooth, f(x) is convex, and h(x) is μ-strongly convex, satisfying $\frac {L}{\mu }\geq n^{1/2}$ , then for Algorithm 5.4, by setting $\kappa = \frac {L}{\sqrt {n}-1}$ and solving Step 3 by SVRG [ 10] by running $O(n\log (1/\epsilon ))$ steps, one can obtain an 𝜖-accuracy solution satisfying F(x) − F ^∗≤ 𝜖 with IFO calls of $O(n^{3/4}\sqrt {L/\mu }\log ^2(1/\epsilon ))$ .

5.3 The Nonconvex Case

In this section, we consider a hard case when f(x) is nonconvex. We only consider the problem where h(x) = 0 and focus on the IFO complexity to achieve an approximate first-order stationary point satisfying ∥∇f(x)∥≤ 𝜖. In the IC case, SVRG [10] has already achieved nearly optimal rate (ignoring some constants) when n is sufficiently large ( $n\geq \frac {L}{\mu }$ ). However, for the NC case SVRG is not the optimal. We will introduce SPIDER [7] which can find an approximate first-order stationary point in an optimal (ignoring some constants) $O\left (\frac {n^{1/2}}{\epsilon ^2}\right )$ rate. Next, we will show that if further assume that the objective function has Lipschitz continuous Hessians, the momentum technique can ensure a faster rate when n is much smaller than κ = L∕μ.

5.3.1 SPIDER

The Stochastic Path-Integrated Differential Estimator (SPIDER) [7] technique is a radical VR method which is used to track quantities using reduced stochastic oracles. Let us consider an arbitrary deterministic vector quantity Q(x). Assume that we observe a sequence $\hat {\mathbf {x}}_{0:K}$ and we want to dynamically track $Q(\hat {\mathbf {x}}^k )$ for k = 0, 1, ⋯ , K. Further assume that we have an initial estimate $\tilde {Q}(\hat {\mathbf {x}}^0 ) \approx Q(\hat {\mathbf {x}}^0 )$ and an unbiased estimate $\boldsymbol {\xi } _k ( \hat {\mathbf {x}}_{0:k})$ of $Q(\hat {\mathbf {x}}^k ) - Q( \hat {\mathbf {x}}^{ k-1} )$ such that for each k = 1, ⋯ , K,

$\displaystyle \begin{aligned} \mathbb{E}\left[ \boldsymbol{\xi} _k ( \hat{\mathbf{x}}_{0:k}) \mid \hat{\mathbf{x}}_{0:k} \right] = Q(\hat{\mathbf{x}}^k ) - Q(\hat{\mathbf{x}}^{k-1} ) . \end{aligned}$

Then we can integrate (in the discrete sense) the stochastic differential estimate as

$\displaystyle \begin{aligned} \tilde{Q}(\hat{\mathbf{x}}_{0:K} ) \equiv \tilde{Q}(\hat{\mathbf{x}}^0 ) + \sum_{k=1}^K \boldsymbol{\xi} _k (\hat{\mathbf{x}}_{0:k} ) . \end{aligned}$

(5.71)

We call estimator $\tilde {Q}(\hat {\mathbf {x}}_{0:K})$ the Stochastic Path-Integrated Differential Estimator , or SPIDER for brevity. We have

Proposition 5.1

The martingale (Definition A.4 ) variance bound has

$\displaystyle \begin{aligned} \mathbb{E} \left\|\tilde{Q}(\hat{\mathbf{x}}_{0:K} ) - Q(\hat{\mathbf{x}}^K )\right\|{}^2 &= \mathbb{E} \left\|\tilde{Q}(\hat{\mathbf{x}}^0 ) - Q(\hat{\mathbf{x}}^ 0 )\right\|{}^2\\ &\quad + \sum_{k=1}^K \mathbb{E} \left\| \boldsymbol{\xi}_ k (\hat{\mathbf{x}}_{0:k} ) - (Q(\hat{\mathbf{x}}^k ) - Q(\hat{\mathbf{x}}^{k-1} )) \right\|{}^2 . \end{aligned}$

(5.72)

Proposition 5.1 can be easily proven using the property of square-integrable martingales.

Now, let $\mathbb {B}_i$ map any $\mathbf {x}\in \mathbb {R}^d$ to a random estimate $\mathbb {B}_i(\mathbf {x})$ , where $\mathbb {B}(\mathbf {x})$ is the true value to be estimated. At each step k, let S _∗ be a subset that samples |S _∗| elements in [n] with replacement and let the stochastic estimator $\mathbb {B}_{S_*} = (1/|S_*|) \sum _{i\in S_*} \mathbb {B}_i$ satisfy

$\displaystyle \begin{aligned} \mathbb{E} \left\|\mathbb{B}_i (\mathbf{x}) -\mathbb{B}_i (\mathbf{y})\right\|{}^2 \leq L_{\mathbb{B}}^2 \|\mathbf{x} -\mathbf{y} \|{}^2 , \end{aligned}$

(5.73)

and $\left \|{\mathbf {x}}^{k}-{\mathbf {x}}^{k-1} \right \| \leq \epsilon _1$ for all k = 1, ⋯ , K. Finally, we set our estimator $\mathbb {V}^{k}$ of $\mathbb {B}({\mathbf {x}}^{k})$ as

$\displaystyle \begin{aligned} \mathbb{V}^{k} = \mathbb{B}_{S_*}({\mathbf{x}}^{k}) - \mathbb{B}_{S_*}({\mathbf{x}}^{k-1}) + \mathbb{V}^{k-1} . \end{aligned}$

Applying Proposition 5.1 immediately concludes the following lemma, which gives an error bound of the estimator $\mathbb {V}^{k}$ in terms of the second moment of $\left \| \mathbb {V}^{k} - \mathbb {B}({\mathbf {x}}^k) \right \|$ :

Lemma 5.3

Under the condition (5.73) we have that for all k = 1, ⋯ , K,

$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbb{E}\left\| \mathbb{V}^{k} - \mathbb{B}({\mathbf{x}}^k)\right \|{}^2 \leq \frac{k L_{\mathbb{B}}^2 \epsilon_1^2}{|S_*|} + \mathbb{E} \left\|\mathbb{V}^0 - \mathbb{B}({\mathbf{x}}^0) \right\|{}^2. \end{array} \end{aligned}$

(5.74)

Proof

For any k > 0, we have from Proposition 5.1 (by applying $\tilde {Q} = \mathbb {V}$ )

$\displaystyle \begin{aligned} \mathbb{E}_k \left\| \mathbb{V}^{k} - \mathbb{B}({\mathbf{x}}^k) \right\|{}^2 &= \mathbb{E}_k \left\| \mathbb{B}_{S_*}({\mathbf{x}}^{k}) - \mathbb{B}({\mathbf{x}}^k) - \mathbb{B}_{S_*}({\mathbf{x}}^{k-1}) + \mathbb{B}({\mathbf{x}}^{k-1})\right\|{}^2 \\ &\quad + \left\| \mathbb{V}^{k-1} - \mathbb{B}({\mathbf{x}}^{k-1}) \right\|{}^2. \end{aligned}$

(5.75)

Then

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \mathbb{E}_k \left\| \mathbb{B}_{S_*}({\mathbf{x}}^{k}) - \mathbb{B}({\mathbf{x}}^k) - \mathbb{B}_{S_*}({\mathbf{x}}^{k-1}) + \mathbb{B}({\mathbf{x}}^{k-1})\right\|{}^2\\ &\displaystyle &\displaystyle \quad \overset{a}=\frac{1}{|S_*|}\mathbb{E} \left\| \mathbb{B}_{i}({\mathbf{x}}^{k}) - \mathbb{B}({\mathbf{x}}^k) - \mathbb{B}_{i}({\mathbf{x}}^{k-1}) + \mathbb{B}({\mathbf{x}}^{k-1})\right\|{}^2\\ &\displaystyle &\displaystyle \quad \overset{b}\leq\frac{1}{|S_*|}\mathbb{E} \left\| \mathbb{B}_{i}({\mathbf{x}}^{k}) - \mathbb{B}_{i}({\mathbf{x}}^{k-1})\right\|{}^2\\ &\displaystyle &\displaystyle \quad \overset{c}{\leq}\frac{1}{|S_*|}L_{\mathbb{B}}^2\mathbb{E} \left\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k-1}\right\|{}^2\leq \frac{L_{\mathbb{B}}^2 \epsilon_1^2}{|S_*|}, \end{array} \end{aligned}$

(5.76)

where in $\overset {a}=$ we use that S _∗ is randomly sampled from [n] with replacement, so the variance reduces by $\frac {1}{|S_*|}$ times. In $\overset {b}\leq$ and $\overset {c}{\leq }$ we use Proposition A.2 and (5.73), respectively.

Combining (5.75) and (5.76), we have

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \mathbb{E}_k \left\| \mathbb{V}^{k} - \mathbb{B}({\mathbf{x}}^k) \right\|{}^2 \leq \frac{L_{\mathbb{B}}^2 \epsilon_1^2}{|S_*|} + \left\|\mathbb{V}^{k-1} - \mathbb{B}({\mathbf{x}}^{k-1})\right\|{}^2. \end{array} \end{aligned}$

(5.77)

Telescoping the above display for k′ = k − 1, ⋯ , 0 and using the iterated law of expectation (Proposition A.4), we have

$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbb{E} \left\| \mathbb{V}^{k} - \mathbb{B}({\mathbf{x}}^k) \right\|{}^2 \leq \frac{ k L_{\mathbb{B}}^2 \epsilon_1^2}{|S_*|} + \mathbb{E}\left\| \mathbb{V}^0 - \mathbb{B}({\mathbf{x}}^0) \right\|{}^2. \end{array} \end{aligned}$

(5.78)

□

Algorithm 5.5 SPIDER for searching first-order stationary point (SPIDER-SFO)

../images/487003_1_En_5_Chapter/487003_1_En_5_Fige_HTML.png

The algorithm using SPIDER to solve (5.2) is shown in Algorithm 5.5. We have the following theorem.

Theorem 5.7

For the optimization problem (5.22) in the online case (n = ∞), assume that each f _i(x) is L-smooth and $\mathbb {E}\| \nabla f_i(\mathbf {x}) - \nabla f(\mathbf {x})\|{ }^2\leq \sigma ^2$ . Set the parameters S ₁, S ₂, η, and q as

$\displaystyle \begin{aligned} S_1 = \frac{2\sigma^2}{\epsilon^2} ,~ S_2 = \frac{2\sigma}{\epsilon n_0} ,~ \eta = \frac{\epsilon}{Ln_0} ,~ \eta_k = \min \left( \frac{\epsilon}{Ln_0\|{\mathbf{v}}^k\|}, \frac{1}{2L n_0} \right) ,~ q = \frac{ \sigma n_0}{\epsilon} , \end{aligned}$

(5.79)

and set $K = \left \lfloor (4L \Delta n_0)\epsilon ^{-2}\right \rfloor +1$ . Then running Algorithm 5.5with OPTION II for K iterations outputs an $\tilde {\mathbf {x}}$ satisfying

$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbb{E} \|\nabla f(\tilde{\mathbf{x}}) \|\leq 5\epsilon, \end{array} \end{aligned}$

(5.80)

where Δ = f(x ⁰) − f ^∗ (f ^∗ =inf_xf(x)). The gradient cost is bounded by $24 L\Delta \sigma \cdot \epsilon ^{-3} +2\sigma ^2 \epsilon ^{-2} + 4\sigma n_0^{-1} \epsilon ^{-1}$ for any choice of n ₀ ∈ [1, 2σ∕𝜖]. Treating Δ, L, and σ as positive constants, the stochastic gradient complexity is O(𝜖 ⁻³).

To prove Theorem 5.7, we first prepare the following lemmas.

Lemma 5.4

Set the parameters S ₁, S ₂, η, and q as in (5.79), and k ₀ = ⌊k∕q⌋⋅ q. We have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathbb{E}_{k_0} \left[\left. \left\|{\mathbf{v}}^k -\nabla f({\mathbf{x}}^k) \right \|{}^2 \right| {\mathbf{x}}_{0:k_0} \right] \leq \epsilon^2, \end{array} \end{aligned}$

(5.81)

where $\mathbb {E}_{k_0}$ denotes the conditional expectation over the randomness of ${\mathbf {x}}_{(k_0+1): k}$ .

Proof

For k = k ₀, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathbb{E}_{k_0} \left\|{\mathbf{v}}^{k_0} - \nabla f({\mathbf{x}}^{k_0})\right\|{}^2 = \mathbb{E}_{k_0}\left\| \nabla f_{S_1}({\mathbf{x}}^{k_0}) - \nabla f({\mathbf{x}}^{k_0})\right\|{}^2 \le \frac{\sigma^2}{S_1} = \frac{\epsilon^2}{2}.\quad \end{array} \end{aligned}$

(5.82)

From Line 15 of Algorithm 5.5 we have that for all k ≥ 0,

$\displaystyle \begin{aligned} \left\|{\mathbf{x}}^{k+1} - {\mathbf{x}}^k\right\| = \min\left(\frac{\epsilon}{Ln_0\|{\mathbf{v}}^k\|} ,\frac{1}{2Ln_0}\right) \|{\mathbf{v}}^k\|\leq \dfrac{\epsilon}{Ln_0} . \end{aligned}$

(5.83)

Applying Lemma 5.3 with 𝜖 ₁ = 𝜖∕(Ln ₀), S ₂ = 2σ∕(𝜖n ₀), and K = k − k ₀ ≤ q = σn ₀∕𝜖, we have

$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbb{E}_{k_0}\left\|{\mathbf{v}}^{k} - \nabla f({\mathbf{x}}^{k}) \right\|{}^2 \leq \frac{ \sigma n_0 }{\epsilon}\cdot L^2 \cdot \left(\frac{\epsilon}{L n_0}\right)^2 \cdot \frac{\epsilon n_0}{2\sigma} + \mathbb{E}_{k_0} \left\|{\mathbf{v}}^{k_0} - \nabla f({\mathbf{x}}^{k_0})\right\|{}^2 \overset{a}{\leq} \epsilon^2 , \end{array} \end{aligned}$

where $\overset {a}{\leq }$ uses (5.82). The proof is completed. □

Lemma 5.5

Setting k ₀ = ⌊k∕q⌋⋅ q, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathbb{E}_{k_0} \left[ f({\mathbf{x}}^{k+1}) - f({\mathbf{x}}^k) \right] \leq -\frac{\epsilon }{4Ln_0} \mathbb{E}_{k_0} \left\|{\mathbf{v}}^k\right\| + \frac{3\epsilon^2}{4n_0 L }. \end{array} \end{aligned}$

(5.84)

Proof

We have

$\displaystyle \begin{aligned} \begin{array}{rcl} \begin{aligned} \| \nabla f(\mathbf{x}) - \nabla f(\mathbf{y}) \|{}^2 = \left\| \mathbb{E}_i \left(\nabla f_i(\mathbf{x}) - \nabla f_i(\mathbf{y})\right) \right\|{}^2&\displaystyle \leq \mathbb{E}_i \|\nabla f_i(\mathbf{x}) - \nabla f_i(\mathbf{y}) \|{}^2\\ &\displaystyle \leq L^2 \|\mathbf{x} - \mathbf{y}\|{}^2. \end{aligned} \end{array} \end{aligned}$

So f(x) is L-smooth, then

$\displaystyle \begin{aligned} \begin{array}{rcl}{} f({\mathbf{x}}^{k+1}) &\displaystyle \leq&\displaystyle f({\mathbf{x}}^k)+\left\langle\nabla f({\mathbf{x}}^k), {\mathbf{x}}^{k+1}-{\mathbf{x}}^k \right\rangle+\frac{L}{2}\left\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^k\right\|{}^2\\ &\displaystyle =&\displaystyle f({\mathbf{x}}^k) - \eta_k \left\langle\nabla f({\mathbf{x}}^k), {\mathbf{v}}^k \right\rangle +\frac{L \eta_k^2}{2}\left\|{\mathbf{v}}^k\right\|{}^2\\ &\displaystyle =&\displaystyle f({\mathbf{x}}^k) -\eta_k \left(1-\frac{\eta_k L}{2}\right)\left\|{\mathbf{v}}^k\right\|{}^2- \eta_k\left\langle \nabla f({\mathbf{x}}^k)-{\mathbf{v}}^k,{\mathbf{v}}^k\right\rangle\\ &\displaystyle \overset{a}{\leq}&\displaystyle f({\mathbf{x}}^k) -\eta_k\left(\frac{1}{2}-\frac{\eta_k L}{2}\right)\left\|{\mathbf{v}}^k \right\|{}^2+\frac{\eta_k}{2}\left\|{\mathbf{v}}^k - \nabla f({\mathbf{x}}^k)\right\|{}^2,\quad \end{array} \end{aligned}$

(5.85)

where in $\overset {a}\leq$ we apply the Cauchy–Schwartz inequality. Since

$\displaystyle \begin{aligned}\eta_k= \min\left(\frac{\epsilon}{Ln_0\|{\mathbf{v}}^k\|} ,\frac{1}{2Ln_0}\right) \leq \frac{1}{2Ln_0}\leq \frac{1}{2L},\end{aligned}$

we have

$\displaystyle \begin{aligned} \eta_k\left(\frac{1}{2}-\frac{\eta_k L}{2}\right)\left\|{\mathbf{v}}^k \right\|{}^2 &\geq \frac{1}{4} \eta_k \left\|{\mathbf{v}}^k\right\|{}^2\\ &= \frac{\epsilon^2}{8n_0L } \min\left( 2\left\| \frac{{\mathbf{v}}^k} {\epsilon}\right\|, \left\| \frac{{\mathbf{v}}^k}{\epsilon}\right\|{}^2 \right) \overset{a}{\geq} \frac{\epsilon\|{\mathbf{v}}^k\|- 2\epsilon^2}{4n_0L}, \end{aligned}$

where in $\overset {a}\geq$ we use $V(x) = \min \left (| x|, \frac {x^2}{2}\right ) \geq |x | - 2$ for all x. Hence

$\displaystyle \begin{aligned} \begin{array}{rcl} f({\mathbf{x}}^{k+1}) &\displaystyle \leq&\displaystyle f({\mathbf{x}}^k) - \frac{\epsilon\|{\mathbf{v}}^k\|}{4Ln_0} + \frac{\epsilon^2}{2n_0L} + \frac{\eta_k}{2}\left\|{\mathbf{v}}^k - \nabla f({\mathbf{x}}^k)\right\|{}^2\\ &\displaystyle \overset{a}{\leq}&\displaystyle f({\mathbf{x}}^k) - \frac{\epsilon\|{\mathbf{v}}^k\|}{4Ln_0} + \frac{\epsilon^2}{2n_0L} + \frac{1}{4Ln_0}\left\|{\mathbf{v}}^k - \nabla f({\mathbf{x}}^k)\right\|{}^2, \end{array} \end{aligned}$

(5.86)

where $\overset {a}{\leq }$ uses $\eta _k\leq \frac {1}{2Ln_0}$ .

Taking expectation on the above display and using Lemma 5.4, we have

$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbb{E}_{k_0} f({\mathbf{x}}^{k+1}) - \mathbb{E}_{k_0} f({\mathbf{x}}^k) \leq -\frac{\epsilon }{4Ln_0} \mathbb{E}_{k_0} \left\|{\mathbf{v}}^k\right\|+\frac{3\epsilon^2}{4Ln_0}. \end{array} \end{aligned}$

(5.87)

□

Lemma 5.6

For all k ≥ 0, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathbb{E} \left\|\nabla f({\mathbf{x}}^k)\right\| \leq \mathbb{E} \left\|{\mathbf{v}}^k \right\|+ \epsilon. \end{array} \end{aligned}$

(5.88)

Proof

By taking the total expectation on (5.81), we have

$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbb{E}\left\|{\mathbf{v}}^{k} -\nabla f({\mathbf{x}}^k) \right\|{}^2 \leq \epsilon^2. \end{array} \end{aligned}$

(5.89)

Then by Jensen’s inequality (Proposition A.3),

$\displaystyle \begin{aligned}\left( \mathbb{E} \left\|{\mathbf{v}}^{k} -\nabla f({\mathbf{x}}^k)\right\| \right)^2 \le \mathbb{E} \left\|{\mathbf{v}}^{k} -\nabla f({\mathbf{x}}^k) \right\|{}^2 \leq \epsilon^2 . \end{aligned}$

So using the triangle inequality,

$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbb{E}\left\| \nabla f({\mathbf{x}}^k)\right \| &\displaystyle =&\displaystyle \mathbb{E} \left\|{\mathbf{v}}^{k} - ({\mathbf{v}}^{k} -\nabla f({\mathbf{x}}^k) ) \right \|\\ &\displaystyle \leq&\displaystyle \mathbb{E} \left\|{\mathbf{v}}^{k} \right\| + \mathbb{E} \left\|{\mathbf{v}}^{k} -\nabla f({\mathbf{x}}^k) \right\|\leq \mathbb{E} \left\|{\mathbf{v}}^{k} \right\| + \epsilon. \end{array} \end{aligned}$

(5.90)

This completes our proof. □

Now, we are ready to prove Theorem 5.7.

Proof of Theorem 5.7

Taking full expectation on (5.84) and telescoping the results from k = 0 to K − 1, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \frac{\epsilon}{4Ln_0}\sum_{k=0}^{K-1}\mathbb{E} \left\|{\mathbf{v}}^k \right\| \leq f({\mathbf{x}}^0) - \mathbb{E} f({\mathbf{x}}^K) + \frac{3K\epsilon^2}{4Ln_0}\overset{a}{\leq} \Delta + \frac{3K\epsilon^2}{4Ln_0}, \end{array} \end{aligned}$

(5.91)

where $\overset {a}{\leq }$ uses $\mathbb {E} f({\mathbf {x}}^K)\geq f^*$ .

Dividing both sides of (5.91) by $\frac {\epsilon }{4Ln_0}K$ and using $K = \lfloor \frac {4L\Delta n_0}{\epsilon ^2}\rfloor +1 \geq \frac {4L\Delta n_0}{\epsilon ^2}$ , we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E} \left\|{\mathbf{v}}^k\right \| \leq \Delta \cdot \frac{4Ln_0}{\epsilon}\frac{1}{K} +3\epsilon\leq 4\epsilon. \end{array} \end{aligned}$

(5.92)

Then from the choice of $\tilde {\mathbf {x}}$ in Line 17 of Algorithm 5.5, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathbb{E} \| \nabla f(\tilde{\mathbf{x}})\| = \frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E} \left\|\nabla f({\mathbf{x}}^k) \right\| \overset{a}{\leq} \frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E} \left\|{\mathbf{v}}^k \right\| + \epsilon \overset{b}{\leq} 5\epsilon, \end{array} \end{aligned}$

(5.93)

where $\overset {a}{\leq }$ and $\overset {b}{\leq }$ use (5.88) and (5.92), respectively.

To compute the gradient cost, note that in each q iteration we access for one time of S ₁ stochastic gradients and for q times of 2S ₂ stochastic gradients, hence the cost is

$\displaystyle \begin{aligned} \begin{array}{rcl} \left\lceil K \cdot \frac{1}{q}\right\rceil S_1 + 2K S_2 &\displaystyle \overset{a}{\le}&\displaystyle 3K \cdot S_2 + S_1 \\ &\displaystyle \leq&\displaystyle \left[3\left(\frac{4Ln_0\Delta}{\epsilon^2} \right)+2\right] \frac{2\sigma}{\epsilon n_0} + \frac{2\sigma^2}{\epsilon^2}\\ &\displaystyle =&\displaystyle \frac{24L\sigma \Delta}{\epsilon^3}+ \frac{4\sigma}{n_0\epsilon} +\frac{2\sigma^2}{\epsilon^2}, \end{array} \end{aligned}$

(5.94)

where $\overset {a}{\le }$ uses S ₁ = qS ₂. This concludes a gradient cost of $24L\Delta \sigma \epsilon ^{-3}+2\sigma ^2\epsilon ^{-2}+ 4\sigma n_0^{-1}\epsilon ^{-1}$ . □

Theorem 5.8

For the optimization problem (5.22) in the finite-sum case (n < ∞), assume that each f _i(x) is L-smooth, set the parameters S ₂, η _k, and q as

$\displaystyle \begin{aligned} \begin{array}{rcl}{} S_2 = \frac{n^{1/2}}{n_0} ,~ \eta = \frac{\epsilon}{Ln_0} ,~ \eta_k = \min \left( \frac{\epsilon}{Ln_0\|{\mathbf{v}}^k\|}, \frac{1}{2L n_0} \right) ,~ q = n_0n^{1/2} , \end{array} \end{aligned}$

(5.95)

set $K = \left \lfloor (4L \Delta n_0)\epsilon ^{-2} \right \rfloor +1$ , and let S ₁ = n, i.e., we obtain the full gradient in Line 4. Then running Algorithm 5.5 with OPTION II for K iterations outputs an $\tilde {\mathbf {x}}$ satisfying

$\displaystyle \begin{aligned} \mathbb{E}\| \nabla f(\tilde{\mathbf{x}}) \|\leq 5\epsilon . \end{aligned}$

The gradient cost is bounded by $n+12(L\Delta ) \cdot n^{1/2} \epsilon ^{-2}+ 2n_0^{-1} n^{1/2}$ for any choice of n ₀ ∈ [1, n ^1∕2]. Treating Δ, L, and σ as positive constants, the stochastic gradient complexity is O(n + n ^1∕2𝜖 ⁻²).

Proof

For k = k ₀, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \mathbb{E}_{k_0} \left\|{\mathbf{v}}^{k_0} - \nabla f({\mathbf{x}}^{k_0})\right\|{}^2 = \mathbb{E}_{k_0}\left\| \nabla f({\mathbf{x}}^{k_0}) - \nabla f({\mathbf{x}}^{k_0})\right\|{}^2 = 0 . \end{array} \end{aligned}$

(5.96)

For k ≠ k ₀, applying Lemma 5.3 with $\epsilon _1 =\frac {\epsilon }{L n_0}$ , $S_2 = \frac {n^{1/2}}{\epsilon n_0}$ , and K = k − k ₀ ≤ q = n ₀n ^1∕2, we have

$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbb{E}_{k_0}\left\|{\mathbf{v}}^{k} - \nabla f({\mathbf{x}}^{k}) \right\|{}^2 &\displaystyle \leq&\displaystyle n_0n^{1/2}\cdot L^2 \cdot\left(\frac{\epsilon}{L n_0}\right)^2\cdot\frac{ n_0}{n^{1/2}}+ \mathbb{E}_{k_0} \left\|{\mathbf{v}}^{k_0} - \nabla f({\mathbf{x}}^{k_0})\right\|{}^2\\ &\displaystyle \overset{a}{=}&\displaystyle \epsilon^2, \end{array} \end{aligned}$

where $\overset {a}{=}$ uses (5.96). So Lemma 5.4 holds for all k. Then from the same technique of the online case (n = ∞), we can also obtain (5.83), (5.84), and (5.93). The gradient cost analysis is computed as

$\displaystyle \begin{aligned} \begin{array}{rcl} \left\lceil K \cdot \frac{1}{q}\right\rceil S_1 + 2K S_2 &\displaystyle \overset{a}{\le}&\displaystyle 3K \cdot S_2 + S_1 \\ &\displaystyle \leq&\displaystyle \left[3\left(\frac{4Ln_0\Delta}{\epsilon^2} \right)+2\right] \frac{n^{1/2}}{n_0} + n \\ &\displaystyle =&\displaystyle \frac{12(L\Delta) \cdot n^{1/2} }{\epsilon^2}+ \frac{2n^{1/2}}{n_0} + n, \end{array} \end{aligned}$

(5.97)

where $\overset {a}{\le }$ uses S ₁ = qS ₂. This concludes a gradient cost of $n+12(L\Delta ) \cdot n^{1/2} \epsilon ^{-2}+ 2n^{-1}_0 n^{1/2}$ . □

5.3.2 Momentum Acceleration

When computing a first-order stationary point, SPIDER is actually (nearly) optimal, if only with the gradient-smoothness condition under certain regimes. Thus only with the condition of smoothness of the gradient, it is hard to apply the momentum techniques to accelerate algorithms. However, one can obtain a faster rate with an additional assumption on Hessian:

Assumption 5.1

Each f _i(x) has ρ-Lipschitz continuous Hessians (Definition A.14 ).

The technique to accelerate nonconvex algorithms is briefly described as follows:

Run an efficient Negative Curvature Search (NC-Search) iteration to find an δ-approximate negative Hessian direction w ₁ using stochastic gradients,² e.g., the shift-and-invert technique in [9].
If NC-Search finds a w ₁, update x ^k+1 ←x ^k ± (δ∕ρ)w ₁.
If not, solve the INC problem:
$\displaystyle \begin{aligned}{\mathbf{x}}^{k+1} =\operatorname*{\mathrm{argmin}}_{\mathbf{x}}\left(f(\mathbf{x}) +\frac{\Omega(\delta)}{2}\|\mathbf{x}-{\mathbf{x}}^k\|{}^2\right),\end{aligned}$
using a momentum acceleration technique, e.g., Catalyst [12] described in Sect. 5.1.4. If ∥x ^k+1 −x ^k∥≥ Ω(δ), return to Step 1, otherwise output x ^k+1.

We informally list the convergence result as follows:

Theorem 5.9

Suppose solving NC-Search by [ 9] and INC blocks by Catalyst described in Theorem 5.6, the total stochastic gradient complexity to achieve an 𝜖-accuracy solution satisfying ∥∇f(x _k)∥≤ 𝜖 and $\lambda _{\min }(\nabla ^2 f({\mathbf {x}}_k) )\geq -\sqrt {\epsilon }$ is $\tilde {O}(n^{3/4}\epsilon ^{-1.75})$ .

The proof of Theorem 5.9 is lengthy, so we omit the proof here. We mention that when n is large, e.g., n ≥ 𝜖 ⁻¹, the above method might not be faster than SPIDER [7]. Thus the existing lowest complexity to find a first-order stationary point is $\tilde {O}(\min (n^{3/4}\epsilon ^{-1.75}, n^{1/2}\varepsilon ^{-2}, \varepsilon ^{-3}))$ . In fact, for the problem of searching a stationary point in nonconvex (stochastic) optimization, neither the upper nor the lower bounds have been well studied up to now. However, it is a very hot topic recently and has aroused a lot of attention in both the optimization and the machine learning communities due to the empirical practicability for nonconvex models. Interested readers may refer to [3, 7, 9] for some latest developments.

5.4 Constrained Problem

Algorithm 5.6 Inner loop of Acc-SADMM

../images/487003_1_En_5_Chapter/487003_1_En_5_Figf_HTML.png

We extend the stochastic acceleration methods to solve the constrained problem in this section. As a simple example, we consider the convex finite-sum problem with linear constraints:

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \min_{{\mathbf{x}}_1,{\mathbf{x}}_2} &\displaystyle &\displaystyle h_1({\mathbf{x}}_1)+f_1({\mathbf{x}}_1) +h_2({\mathbf{x}}_2)+\frac{1}{n}\sum_{i=1}^n f_{2,i}({\mathbf{x}}_2),\\ s.t. &\displaystyle &\displaystyle {\mathbf{A}}_1 {\mathbf{x}}_1 +{\mathbf{A}}_2 {\mathbf{x}}_2 = {\mathbf{b}}, \end{array} \end{aligned}$

(5.98)

where f ₁(x ₁) and f _2,i(x ₂) with i ∈ [n] are convex and have Lipschitz continuous gradients, and h ₁(x ₁) and h ₂(x ₂) are also convex and their proximal mappings can be solved efficiently. We use L ₁ to denote the Lipschitz constant of f ₁(x ₁), and L ₂ to denote the Lipschitz constant of f _2,i(x ₂) with i ∈ [n], and $f_2(\mathbf {x})=\frac {1}{n}\sum _{i=1}^n f_{2,i}(\mathbf {x})$ . We show that by fusing the VR technique and momentum , the convergence rate can be improved to be non-ergodic O(1∕K).

We list the notations and variables in Table 5.1. The algorithm has double loops: in the inner loop, we update primal variables ${\mathbf {x}}^k_{s,1}$ and ${\mathbf {x}}^k_{s,2}$ through extrapolation terms ${\mathbf {y}}_{s,1}^{k}$ and ${\mathbf {y}}_{s,2}^{k}$ and the dual variable $\boldsymbol \lambda _s^k$ ; in the outer loop, we maintain snapshot vectors $\tilde {\mathbf {x}}_{s+1,1}$ , $\tilde {\mathbf {x}}_{s+1,2}$ , and $\tilde {{\mathbf {b}}}_{s+1}$ , and then assign the initial value to the extrapolation terms ${\mathbf {y}}^0_{s+1,1}$ and ${\mathbf {y}}^0_{s+1,2}$ . The whole algorithm is shown in Algorithm 5.7. In the process of solving primal variables, we linearize both the smooth term f _i(x _i) and the augmented term $\frac {\beta }{2}\|{\mathbf {A}}_1{\mathbf {x}}_1+{\mathbf {A}}_2{\mathbf {x}}_2-{\mathbf {b}}+\frac {\boldsymbol \lambda }{\beta } \|{ }^2$ . The update rules of x ₁ and x ₂ can be written as

$\displaystyle \begin{aligned} \begin{array}{rcl}{} {\mathbf{x}}^{k+1}_{s,1} &\displaystyle =&\displaystyle \operatorname*{\mathrm{argmin}}_{{\mathbf{x}}_1} h_1({\mathbf{x}}_1)+\left\langle \nabla f_1({\mathbf{y}}^{k}_{s,1}),{\mathbf{x}}_1\right\rangle \\ &\displaystyle &\displaystyle + \left\langle \frac{\beta}{\theta_{1,s}}\left({\mathbf{A}}_1{\mathbf{y}}_{s,1}^k+{\mathbf{A}}_2{\mathbf{y}}_{s,2}^k-{\mathbf{b}}\right)+\boldsymbol\lambda^k_s,{\mathbf{A}}_1{\mathbf{x}}_1 \right\rangle \\ &\displaystyle &\displaystyle +\left(\frac{L_1}{2}+\frac{\beta \left\|{\mathbf{A}}_1^T{\mathbf{A}}_1\right\|}{2\theta_{1,s}}\right)\left\|{\mathbf{x}}_1-{\mathbf{y}}^{k}_{s,1}\right\|{}^2 \end{array} \end{aligned}$

(5.99)

and

$\displaystyle \begin{aligned} \begin{array}{rcl}{} {\mathbf{x}}^{k+1}_{s,2} &\displaystyle =&\displaystyle \operatorname*{\mathrm{argmin}}_{{\mathbf{x}}_2} h_2({\mathbf{x}}_2)+\left\langle \tilde{\nabla} f_2({\mathbf{y}}^{k}_{s,1}),{\mathbf{x}}_2 \right\rangle\\ &\displaystyle &\displaystyle + \left\langle \frac{\beta}{\theta_{1,s}}\left({\mathbf{A}}_1{\mathbf{x}}_{s,1}^{k+1}+{\mathbf{A}}_2{\mathbf{y}}_{s,2}^k-{\mathbf{b}}\right) +\boldsymbol\lambda^k_s,{\mathbf{A}}_2{\mathbf{x}}_2 \right\rangle\\ &\displaystyle &\displaystyle +\left(\frac{\left(1+\frac{1}{b\theta_2}\right)L_2}{2}+\frac{\beta \left\|{\mathbf{A}}_2^T{\mathbf{A}}_2\right\|}{2\theta_{1,s}}\right)\left\|{\mathbf{x}}_2-{\mathbf{y}}^{k}_{s,2}\right\|{}^2, \end{array} \end{aligned}$

(5.100)

where $\tilde {\nabla } f_2({\mathbf {y}}_{s,2}^k)$ is defined as

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \tilde{\nabla} f_2({\mathbf{y}}_{s,2}^k)=\frac{1}{b}\sum_{i_{k,s}\in \mathbb{I}_{k,s}}\left(\nabla f_{2,i_{k,s}}({\mathbf{y}}_{s,2}^k)- \nabla f_{2,i_{k,s}}(\tilde{\mathbf{x}}_{s,2}) +\nabla f_2(\tilde{\mathbf{x}}_{s,2})\right), \end{array} \end{aligned}$

in which $\mathbb {I}_{k,s}$ is a mini-batch of indices randomly chosen from [n] with a size of b.

Table 5.1

Notations and variables

Notation	Meaning	Variable	Meaning
〈x, y〉_G, ∥x∥_G	${\mathbf {x}}^T \mathbf {G} \mathbf {y},\sqrt {{\mathbf {x}}^T \mathbf {G} \mathbf {x}}$	${\mathbf {y}}^k_{s,1},{\mathbf {y}}^k_{s,2}$	Extrapolation variables
F _i(x _i)	h _i(x _i) + f _i(x _i)	${\mathbf {x}}^k_{s,1},{\mathbf {x}}^k_{s,2}$	Primal variables
x	$({\mathbf {x}}_1^T,{\mathbf {x}}_2^T)^T$	$\tilde {\boldsymbol {\lambda }}^k_{s},\boldsymbol {\lambda }^k_{s},\hat {\boldsymbol {\lambda }}^k$	Dual and temporary variables
y	$({\mathbf {y}}_1^T,{\mathbf {y}}_2^T)^T$	$\tilde {\mathbf {x}}_{s,1},\tilde {\mathbf {x}}_{s,2},\tilde {{\mathbf {b}}}_s$	Snapshot vectors
F(x)	F ₁(x ₁) + F ₂(x ₂)		used for VR
A	[A ₁, A ₂]	$({\mathbf {x}}_1^, {\mathbf {x}}_2^, \boldsymbol {\lambda }^*)$	KKT point of (5.98)
$\mathbb {I}_{k,s}$	Mini-batch indices	b	Batch size

Algorithm 5.7 Accelerated stochastic alternating direction method of multiplier (Acc-SADMM)

../images/487003_1_En_5_Chapter/487003_1_En_5_Figg_HTML.png

Now, we give the convergence result. The main property of Acc-SADMM (Algorithm 5.7) in the inner loop is shown below.

Lemma 5.7

For Algorithm 5.6, in any epoch with fixed s (for simplicity we drop the subscript s throughout the proof unless necessary), we have

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \mathbb{E}_{i_k}L({\mathbf{x}}^{k+1}_1,{\mathbf{x}}^{k+1}_2,\boldsymbol\lambda^*) - \theta_2 L(\tilde{\mathbf{x}}_1,\tilde{\mathbf{x}}_2,\boldsymbol\lambda^*) -(1-\theta_2 - \theta_1)L({\mathbf{x}}^{k}_1,{\mathbf{x}}^{k}_2,\boldsymbol\lambda^*)\\ &\displaystyle &\displaystyle \quad \leq \frac{\theta_1}{2\beta}\left(\left\|\hat{\boldsymbol\lambda}^k-\boldsymbol\lambda^*\right\|{}^2- \mathbb{E}_{i_k}\left\|\hat{\boldsymbol\lambda}^{k+1}-\boldsymbol\lambda^*\right\|{}^2\right)\\ &\displaystyle &\displaystyle \qquad +\frac{1}{2}\left\|{\mathbf{y}}_1^{k}-(1-\theta_1-\theta_2){\mathbf{x}}^k_1-\theta_2\tilde{\mathbf{x}}_1-\theta_1{\mathbf{x}}_1^*\right\|{}^2_{{\mathbf{G}}_1}\\ &\displaystyle &\displaystyle \qquad -\frac{1}{2}\mathbb{E}_{i_k}\left\|{\mathbf{x}}_1^{k+1}-(1-\theta_1-\theta_2){\mathbf{x}}^k_1-\theta_2\tilde{\mathbf{x}}_1-\theta_1{\mathbf{x}}_1^*\right\|{}^2_{{\mathbf{G}}_1}\\ &\displaystyle &\displaystyle \qquad + \frac{1}{2} \left\|{\mathbf{y}}_2^{k}-(1-\theta_1-\theta_2){\mathbf{x}}^k_2-\theta_2\tilde{\mathbf{x}}_2-\theta_1{\mathbf{x}}_2^*\right\|{}^2_{{\mathbf{G}}_2}\\ &\displaystyle &\displaystyle \qquad -\frac{1}{2} \mathbb{E}_{i_k} \left\|{\mathbf{x}}_2^{k+1}-(1-\theta_1-\theta_2){\mathbf{x}}^k_2-\theta_2\tilde{\mathbf{x}}_2-\theta_1{\mathbf{x}}_2^*\right\|{}^2_{{\mathbf{G}}_2},{} \end{array} \end{aligned}$

(5.102)

where $\mathbb {E}_{i_k}$ denotes that the expectation is taken over the random samples in the mini-batch $\mathbb {I}_{k,s}$ , L(x ₁, x ₂, λ) = F ₁(x ₁) + F ₂(x ₂) + 〈λ, A ₁x ₁ + A ₂x ₂ −b〉 is the Lagrangian function , $\hat {\boldsymbol \lambda }^k=\tilde {\boldsymbol \lambda }^{k} +\frac {\beta (1-\theta _{1})}{ \theta _{1}}(\mathbf {A}{\mathbf {x}}^k-{\mathbf {b}}),$ ${\mathbf {G}}_1=\left (L_1+\frac {\beta \left \|{\mathbf {A}}_1^T {\mathbf {A}}_1\right \|}{\theta _1}\right )\mathbf {I}-\frac {\beta {\mathbf {A}}_1^T{\mathbf {A}}_1}{\theta _1}$ , and ${\mathbf {G}}_2=\left (\left (1+\frac {1}{b\theta _2}\right ) L_2+\frac {\beta \left \|{\mathbf {A}}_2^T {\mathbf {A}}_2\right \|}{\theta _1}\right )\mathbf {I}$ . Other notations can be found in Table 5.1.

Proof

Step 1: We first analyze x ₁. By the optimality of ${\mathbf {x}}_1^{k+1}$ in (5.99) and the convexity of F ₁(⋅), we can obtain

(5.103)

We prove (5.103) below.

By the same proof technique of Lemma 5.2 for Katyusha (Algorithm 5.3), we can bound the variance through

$\displaystyle \begin{aligned} \mathbb{E}_{i_k}\left\|\nabla f_2({\mathbf{y}}_2^k) - \tilde{\nabla} f_2({\mathbf{y}}_2^k)\right\|{}^2\leq \frac{2L_2}{b} \left[ f_2(\tilde{\mathbf{x}}_2) - f_2({\mathbf{y}}^k_2) - \langle \nabla f_2({\mathbf{y}}^k_2), \tilde{\mathbf{x}}_2-{\mathbf{y}}^k_2 \rangle \right]\!. \end{aligned}$

(5.104)

Set

$\displaystyle \begin{aligned} {\bar{\boldsymbol\lambda}}({\mathbf{x}}_1,{\mathbf{x}}_2) = \boldsymbol\lambda^k+\frac{\beta}{\theta_1}\left({\mathbf{A}}_1 {\mathbf{x}}_1 +{\mathbf{A}}_2 {\mathbf{x}}_2-{\mathbf{b}}\right). \end{aligned}$

For the optimality solution of ${\mathbf {x}}_1^{k+1}$ in (5.99), we have

../images/487003_1_En_5_Chapter/487003_1_En_5_Equ142_HTML.png

(5.105)

Since f ₁ is L ₁-smooth, we have

$\displaystyle \begin{aligned} \begin{array}{rcl} f_1({\mathbf{x}}^{k+1}_1) &\displaystyle \leq&\displaystyle f_1({\mathbf{y}}^{k}_1) + \left\langle \nabla f_1({\mathbf{y}}^{k}_1), {\mathbf{x}}^{k+1}_1 -{\mathbf{y}}^{k}_1 \right\rangle +\frac{L_1}{2}\left\|{\mathbf{x}}^{k+1}_1 -{\mathbf{y}}^{k}_1 \right \|{}^2\\ &\displaystyle \overset{a}\leq&\displaystyle f_1({\mathbf{u}}_1) + \left\langle \nabla f_1({\mathbf{y}}^{k}_1), {\mathbf{x}}^{k+1}_1 -{\mathbf{u}}_1 \right\rangle + \frac{L_1}{2}\left\|{\mathbf{x}}^{k+1}_1 -{\mathbf{y}}^{k}_1 \right \|{}^2 \\ &\displaystyle \overset{b}\leq&\displaystyle f_1({\mathbf{u}}_1) - \left\langle \partial h_1({\mathbf{x}}^{k+1}_1) , {\mathbf{x}}^{k+1}_1-{\mathbf{u}}_1\rangle-\langle{\mathbf{A}}_1^T{\bar{\boldsymbol\lambda}}({\mathbf{y}}^k_1,{\mathbf{y}}^k_2) ,{\mathbf{x}}^{k+1}_1-{\mathbf{u}}_1 \right\rangle\\ &\displaystyle &\displaystyle -\left(L_1+\frac{\beta \left\|{\mathbf{A}}_1^T {\mathbf{A}}_1\right\|}{\theta_1}\right) \left\langle {\mathbf{x}}^{k+1}_1-{\mathbf{y}}^{k}_1, {\mathbf{x}}^{k+1}_1-{\mathbf{u}}_1\right\rangle+\frac{L_1}{2}\left\|{\mathbf{x}}^{k+1}_1-{\mathbf{y}}^k_1 \right\|{}^2, \end{array} \end{aligned}$

where u ₁ is an arbitrary variable. In the inequality $\overset {a}\leq$ we use the fact that f ₁(⋅) is convex and so $f_1({\mathbf {y}}^k_1)\leq f_1({\mathbf {u}}_1)+ \langle \nabla f_1({\mathbf {y}}^k_1), {\mathbf {y}}_1^{k}-{\mathbf {u}}_1\rangle$ . The inequality $\overset {b}\leq$ uses (5.105). Then the convexity of h ₁(⋅) gives $h_1({\mathbf {x}}^{k+1}_1)\leq h_1({\mathbf {u}}_1)+ \langle \partial h_1({\mathbf {x}}_1^{k+1}), {\mathbf {x}}^{k+1}_1-{\mathbf {u}}_1\rangle$ . So we have

$\displaystyle \begin{aligned} \begin{array}{rcl} F_1({\mathbf{x}}^{k+1}_1)&\displaystyle \leq&\displaystyle F_1({\mathbf{u}}_1) -\left\langle{\mathbf{A}}_1^T{\bar{\boldsymbol\lambda}}({\mathbf{y}}^k_1,{\mathbf{y}}^k_2) ,{\mathbf{x}}^{k+1}_1-{\mathbf{u}}_1 \right \rangle+\frac{L_1}{2}\left\|{\mathbf{x}}^{k+1}_1-{\mathbf{y}}^k_1 \right \|{}^2\\ &\displaystyle &\displaystyle -\left(L_1+\frac{\beta \left\|{\mathbf{A}}_1^T {\mathbf{A}}_1\right\|}{\theta_1}\right) \left\langle {\mathbf{x}}^{k+1}_1-{\mathbf{y}}^{k}_1, {\mathbf{x}}^{k+1}_1-{\mathbf{u}}_1\right\rangle. \end{array} \end{aligned}$

Setting u ₁ be ${\mathbf {x}}_1^{k}$ , $\tilde {\mathbf {x}}_1$ , and ${\mathbf {x}}_1^*$ , respectively, then multiplying the three inequalities by (1 − θ ₁ − θ ₂), θ ₂, and θ ₁, respectively, and adding them, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle F_1({\mathbf{x}}^{k+1}_1)\\ &\displaystyle &\displaystyle \quad \leq (1-\theta_1-\theta_2)F_1({\mathbf{x}}^k_1)+ \theta_2 F_1(\tilde{\mathbf{x}}_1)+\theta_1 F_1({\mathbf{x}}_1^*)+\frac{L_1}{2}\left\|{\mathbf{x}}^{k+1}_1-{\mathbf{y}}^k_1 \right\|{}^2\\ &\displaystyle &\displaystyle \qquad -\left\langle {\mathbf{A}}_1^T{\bar{\boldsymbol\lambda}}({\mathbf{y}}^k_1,{\mathbf{y}}^k_2) ,{\mathbf{x}}^{k+1}_1-(1-\theta_1-\theta_2){\mathbf{x}}^k_1-\theta_2\tilde{\mathbf{x}}_1 -\theta_1{\mathbf{x}}_1^* \right\rangle \\ &\displaystyle &\displaystyle \qquad -\left(L_1+\frac{\beta \left\|{\mathbf{A}}_1^T {\mathbf{A}}_1\right\|}{\theta_1}\right) \left\langle {\mathbf{x}}^{k+1}_1-{\mathbf{y}}^{k}_1, {\mathbf{x}}^{k+1}_1-(1-\theta_1-\theta_2){\mathbf{x}}^k_1-\theta_2 \tilde{\mathbf{x}}_1-\theta_1{\mathbf{x}}_1^*\right\rangle\\ &\displaystyle &\displaystyle \quad \overset{a}= (1-\theta_1-\theta_2)F_1({\mathbf{x}}^k_1)+ \theta_2 F_1(\tilde{\mathbf{x}}_1)+\theta_1 F_1({\mathbf{x}}_1^*)+\frac{L_1}{2}\left\|{\mathbf{x}}^{k+1}_1-{\mathbf{y}}^k_1 \right\|{}^2\\ &\displaystyle &\displaystyle \qquad -\left\langle {\mathbf{A}}_1^T{\bar{\boldsymbol\lambda}}({\mathbf{x}}^{k+1}_1,{\mathbf{y}}^k_2) ,{\mathbf{x}}^{k+1}_1-(1-\theta_1-\theta_2){\mathbf{x}}^k_1-\theta_2 \tilde{\mathbf{x}}_1-\theta_1{\mathbf{x}}_1^* \right\rangle \\ &\displaystyle &\displaystyle \qquad - \left\langle {\mathbf{x}}^{k+1}_1-{\mathbf{y}}^{k}_1, {\mathbf{x}}^{k+1}_1-(1-\theta_1-\theta_2){\mathbf{x}}^k_1-\theta_2\tilde{\mathbf{x}}_1-\theta_1{\mathbf{x}}^*\right\rangle_{{\mathbf{G}}_1}, \end{array} \end{aligned}$

(5.106)

where in the equality $\overset {a}\leq$ we replace ${\mathbf {A}}_1^T{\bar {\boldsymbol \lambda }}({\mathbf {y}}^k_1,{\mathbf {y}}^k_2)$ with ${\mathbf {A}}_1^T{\bar {\boldsymbol \lambda }}({\mathbf {x}}^{k+1}_1,{\mathbf {y}}^k_2)-\frac {\beta {\mathbf {A}}^T_1{\mathbf {A}}_1}{\theta _1}({\mathbf {x}}^{k+1}_1-{\mathbf {y}}^k_1)$ .

Step 2: We next analyze x ₂. By the optimality of ${\mathbf {x}}_2^{k+1}$ in (5.100) and the convexity of F ₂(⋅), we can obtain

../images/487003_1_En_5_Chapter/487003_1_En_5_Equ146_HTML.png

(5.107)

We prove (5.107) below.

For the optimality of ${\mathbf {x}}_2^{k+1}$ in (5.100), we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \left(\alpha L_2+\frac{\beta \left\|{\mathbf{A}}_2^T {\mathbf{A}}_2\right\|}{\theta_1} \right) \left({\mathbf{x}}^{k+1}_2-{\mathbf{y}}^{k}_2\right) + \tilde{\nabla} f_2({\mathbf{y}}^k_2) +{\mathbf{A}}_2^T{\bar{\boldsymbol\lambda}}({\mathbf{x}}^{k+1}_1,{\mathbf{y}}^k_2)\\ &\displaystyle &\displaystyle \quad \in-\partial h_2({\mathbf{x}}^{k+1}_2), \end{array} \end{aligned}$

(5.108)

where we set $\alpha = 1+\frac {1}{b\theta _2}$ . Since f ₂ is L ₂-smooth, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} f_2({\mathbf{x}}^{k+1}_2)&\displaystyle \leq&\displaystyle f_2({\mathbf{y}}^k_2) + \left\langle \nabla f_2({\mathbf{y}}^k_2), {\mathbf{x}}^{k+1}_2 -{\mathbf{y}}^k_2 \right\rangle+\frac{L_2}{2}\left\|{\mathbf{x}}^{k+1}_2 -{\mathbf{y}}^k_2 \right\|{}^2. \end{array} \end{aligned}$

(5.109)

We first consider $\langle \nabla f_2({\mathbf {y}}^k_2), {\mathbf {x}}^{k+1}_2 -{\mathbf {y}}^k_2 \rangle$ and have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \left\langle \nabla f_2({\mathbf{y}}^k_2), {\mathbf{x}}^{k+1}_2-{\mathbf{y}}^k_2 \right\rangle\\ &\displaystyle &\displaystyle \quad \overset{a}=\left\langle \nabla f_2({\mathbf{y}}^k_2), {\mathbf{u}}_2-{\mathbf{y}}^k_2+{\mathbf{x}}^{k+1}_2-{\mathbf{u}}_2 \right\rangle \\ &\displaystyle &\displaystyle \quad \overset{b}= \left\langle \nabla f_2({\mathbf{y}}^k_2),{\mathbf{u}}_2-{\mathbf{y}}^k_2\right\rangle -\theta_3\left\langle \nabla f_2({\mathbf{y}}^k_2), {\mathbf{y}}^k_2-\tilde{\mathbf{x}}_2 \right\rangle +\left\langle \nabla f_2({\mathbf{y}}^k_2),{\mathbf{z}}^{k+1}-{\mathbf{u}}_2\right\rangle\\ &\displaystyle &\displaystyle \quad = \left\langle \nabla f_2({\mathbf{y}}^k_2),{\mathbf{u}}_2-{\mathbf{y}}^k_2\right\rangle -\theta_3\left\langle \nabla f_2({\mathbf{y}}^k_2), {\mathbf{y}}^k_2-\tilde{\mathbf{x}}_2 \right\rangle \\ &\displaystyle &\displaystyle \qquad + \left\langle \tilde{\nabla} f_2({\mathbf{y}}^k_2),{\mathbf{z}}^{k+1}-{\mathbf{u}}_2\right\rangle+ \left\langle \nabla f_2({\mathbf{y}}^k_2)-\tilde{\nabla} f_2({\mathbf{y}}^k_2),{\mathbf{z}}^{k+1}-{\mathbf{u}}_2\right\rangle, \end{array} \end{aligned}$

(5.110)

where in the equality $\overset {a}=$ we introduce an arbitrary variable u ₂ (we will set it to be ${\mathbf {x}}_2^{k}$ , $\tilde {\mathbf {x}}_2$ , and ${\mathbf {x}}_2^*$ ) and in the equality $\overset {b}=$ we set

$\displaystyle \begin{aligned} \begin{array}{rcl}{} {\mathbf{z}}^{k+1} = {\mathbf{x}}^{k+1}_2 +\theta_3 ( {\mathbf{y}}^k_2 - \tilde{\mathbf{x}}_2 ), \end{array} \end{aligned}$

(5.111)

in which θ ₃ is an absolute constant determined later. For $\langle \tilde {\nabla } f_2({\mathbf {y}}_2^k),{\mathbf {z}}^{k+1}-{\mathbf {u}}_2\rangle$ , we have

../images/487003_1_En_5_Chapter/487003_1_En_5_Equ151_HTML.png

(5.112)

where in the equalities $\overset {a}=$ and $\overset {b}=$ , we use (5.108) and (5.111), respectively. The inequality $\overset {d}=$ uses (5.108) again. The inequality $\overset {c}\leq$ uses the convexity of h ₂:

$\displaystyle \begin{aligned}\left\langle\partial h_2({\mathbf{x}}^{k+1}_2), \mathbf{w}-{\mathbf{x}}^{k+1}_2\right\rangle \leq h_2(\mathbf{w})-h_2({\mathbf{x}}_2^{k+1}), \quad \mathbf{w}={\mathbf{u}}_2,\tilde{\mathbf{x}}_2.\end{aligned}$

Rearranging terms in (5.112) and using $\tilde {\nabla } f_2({\mathbf {y}}_2^k) {=}\nabla f_2({\mathbf {y}}_2^k) +\left (\tilde {\nabla } f_2({\mathbf {y}}_2^k){-} \nabla f_2({\mathbf {y}}_2^k)\right )$ , we have

../images/487003_1_En_5_Chapter/487003_1_En_5_Equ152_HTML.png

(5.113)

Substituting (5.113) in (5.110), we obtain

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle (1+\theta_3) \left\langle \nabla f_2({\mathbf{y}}^k_2), {\mathbf{x}}^{k+1}_2-{\mathbf{y}}^k_2 \right\rangle\\ &\displaystyle &\displaystyle \quad = \left\langle \nabla f_2({\mathbf{y}}^k_2),{\mathbf{u}}_2-{\mathbf{y}}^k_2\right\rangle -\theta_3\left\langle \nabla f_2({\mathbf{y}}^k_2), {\mathbf{y}}^k_2-\tilde{\mathbf{x}}_2 \right\rangle+h_2({\mathbf{u}}_2)-h_2({\mathbf{x}}_2^{k+1})\\ &\displaystyle &\displaystyle \qquad + \theta_3h_2(\tilde{\mathbf{x}}_2)-\theta_3h_2({\mathbf{x}}_2^{k+1})\\ &\displaystyle &\displaystyle \qquad - \left\langle {\mathbf{A}}_2^T{\bar{\boldsymbol\lambda}}({\mathbf{x}}^{k+1}_1,{\mathbf{y}}^k_2)+ \left(\alpha L_2+\frac{\beta \left\|{\mathbf{A}}_2^T {\mathbf{A}}_2\right\|}{\theta_1} \right)\left({\mathbf{x}}^{k+1}_2-{\mathbf{y}}^{k}_2\right),\right.\\ &\displaystyle &\displaystyle \qquad \quad {\mathbf{z}}^{k+1}-{\mathbf{u}}_2 +\theta_3( {\mathbf{x}}^{k+1}_2-{\mathbf{y}}^k_2) \Bigg\rangle\\ &\displaystyle &\displaystyle \qquad +\left\langle \nabla f_2({\mathbf{y}}^k_2)-\tilde{\nabla} f_2({\mathbf{y}}^k_2),\theta_3({\mathbf{x}}^{k+1}_2-{\mathbf{y}}^k_2)+{\mathbf{z}}^{k+1}-{\mathbf{u}}_2\right\rangle. \end{array} \end{aligned}$

(5.114)

Multiplying (5.109) by (1 + θ ₃) and then adding (5.114), we can eliminate the term $\langle \nabla f_2({\mathbf {y}}^k_2), {\mathbf {x}}^{k+1}_2-{\mathbf {y}}^k_2 \rangle$ and obtain

../images/487003_1_En_5_Chapter/487003_1_En_5_Equ154_HTML.png

(5.115)

where the inequality $\overset {a}\leq$ uses the convexity of f ₂: $\langle \nabla f_2({\mathbf {y}}^k_2), {\mathbf {u}}_2-{\mathbf {y}}^k_2\rangle \leq f_2({\mathbf {u}}_2)-f_2({\mathbf {y}}^k_2).$

We now consider the term $\left \langle \nabla f_2({\mathbf {y}}^k_2)-\tilde {\nabla } f_2({\mathbf {y}}^k_2),\theta _3({\mathbf {x}}^{k+1}_2-{\mathbf {y}}^k_2)+{\mathbf {z}}^{k+1}-{\mathbf {u}}_2\right \rangle$ . We will set u ₂ to be ${\mathbf {x}}_2^k$ and ${\mathbf {x}}^*_2$ , which do not depend on $\mathbb {I}_{k,s}$ . So we obtain

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \mathbb{E}_{i_{k}}\left\langle \nabla f_2({\mathbf{y}}_2^k) - \tilde{\nabla} f_2({\mathbf{y}}^k), \theta_3({\mathbf{x}}_2^{k+1}-{\mathbf{y}}_2^k)+{\mathbf{z}}^{k+1}-{\mathbf{u}}_2 \right\rangle \\ &\displaystyle &\displaystyle \quad = \mathbb{E}_{i_{k}}\left\langle \nabla f_2({\mathbf{y}}_2^k) - \tilde{\nabla} f_2({\mathbf{y}}_2^k), \theta_3{\mathbf{z}}^{k+1}+ {\mathbf{z}}^{k+1}\right\rangle\\ &\displaystyle &\displaystyle \qquad - \mathbb{E}_{i_{k}}\left\langle \nabla f_2({\mathbf{y}}_2^k) - \tilde{\nabla} f_2({\mathbf{y}}_2^k), \theta^2_3({\mathbf{y}}_2^k-\tilde{\mathbf{x}}_2)+\theta_3{\mathbf{y}}_2^k+{\mathbf{u}}_2 \right\rangle\\ &\displaystyle &\displaystyle \quad \overset{a}= (1+\theta_3)\mathbb{E}_{i_{k}}\langle \nabla f_2({\mathbf{y}}_2^k) - \tilde{\nabla} f_2({\mathbf{y}}_2^k), {\mathbf{z}}^{k+1}\rangle \\ &\displaystyle &\displaystyle \quad \overset{b}= (1+\theta_3)\mathbb{E}_{i_{k}}\langle \nabla f_2({\mathbf{y}}_2^k) - \tilde{\nabla} f_2({\mathbf{y}}_2^k), {\mathbf{x}}_2^{k+1}\rangle \\ &\displaystyle &\displaystyle \quad \overset{c}= (1+\theta_3)\mathbb{E}_{i_{k}}\langle \nabla f_2({\mathbf{y}}^k_2) - \tilde{\nabla} f_2({\mathbf{y}}^k_2), {\mathbf{x}}_2^{k+1}-{\mathbf{y}}_2^{k}\rangle \\ &\displaystyle &\displaystyle \quad \overset{d}\leq \mathbb{E}_{i_{k}}\left( \frac{\theta_3b}{2 L_2}\left\|\nabla f_2({\mathbf{y}}^k_2) - \tilde{\nabla} f_2({\mathbf{y}}^k_2)\right\|{}^2 \right)+ \mathbb{E}_{i_{k}}\left( \frac{(1+\theta_3)^2 L_2}{2\theta_3 b}\left\|{\mathbf{x}}_2^{k+1}-{\mathbf{y}}_2^{k}\right\|{}^2 \right)\\ &\displaystyle &\displaystyle \quad \overset{e}\leq \theta_3\left(f_2(\tilde{\mathbf{x}}_2) - f_2({\mathbf{y}}^k_2) - \langle \nabla f_2({\mathbf{y}}^k_2), \tilde{\mathbf{x}}_2-{\mathbf{y}}^k_2\rangle\right) \\ &\displaystyle &\displaystyle \qquad +\mathbb{E}_{i_{k}}\left( \frac{(1+\theta_3)^2 L_2}{2\theta_3 b}\left\|{\mathbf{x}}_2^{k+1}-{\mathbf{y}}_2^{k}\right\|{}^2 \right), \end{array} \end{aligned}$

(5.116)

where in the equality $\overset {a}=$ we use the fact that

$\displaystyle \begin{aligned}\mathbb{E}_{i_k} \left(\nabla f_2({\mathbf{y}}^k_2) - \tilde{\nabla} f_2({\mathbf{y}}^k_2) \right)=\mathbf{0},\end{aligned}$

and ${\mathbf {x}}^k_2$ , ${\mathbf {y}}^k_2$ , $\tilde {\mathbf {x}}_2$ , and u ₂ are independent of i _k,s (they are known), so

$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbb{E}_{i_k} \langle \nabla f_2({\mathbf{y}}^k_2) - \tilde{\nabla} f_2({\mathbf{y}}^k_2),{\mathbf{y}}^k_2 \rangle &\displaystyle =&\displaystyle 0,\\ \mathbb{E}_{i_k} \langle \nabla f_2({\mathbf{y}}^k_2) - \tilde{\nabla} f_2({\mathbf{y}}^k_2),\tilde{\mathbf{x}}_2 \rangle &\displaystyle =&\displaystyle 0,\\ \mathbb{E}_{i_k} \langle \nabla f_2({\mathbf{y}}^k_2) - \tilde{\nabla} f_2({\mathbf{y}}^k_2), {\mathbf{u}}_2 \rangle &\displaystyle =&\displaystyle 0; \end{array} \end{aligned}$

the equalities $\overset {b}=$ and $\overset {c}=$ hold similarly; the inequality $\overset {d}\leq$ uses the Cauchy–Schwartz inequality; and $\overset {e}\leq$ uses (5.104). Taking expectation on (5.115) and adding (5.116), we obtain

../images/487003_1_En_5_Chapter/487003_1_En_5_Equt_HTML.png

where in equality $\overset {a}=$ we use (5.111) and set θ ₃ satisfying $\theta _2 = \frac {\theta _3}{1+\theta _3}$ . Setting u ₂ to be ${\mathbf {x}}_2^{k}$ and ${\mathbf {x}}_2^*$ , respectively, then multiplying the two inequalities by 1 − θ ₁(1 + θ ₃) and θ ₁(1 + θ ₃), respectively, and adding them, we obtain

../images/487003_1_En_5_Chapter/487003_1_En_5_Equ157_HTML.png

(5.117)

Dividing (5.117) by (1 + θ ₃), we obtain

../images/487003_1_En_5_Chapter/487003_1_En_5_Equ158_HTML.png

(5.118)

where we use $\theta _2=\frac {\theta _3}{1+\theta _3}$ and so $\frac {1-\theta _1(1+\theta _3)}{1+\theta _3}=1-\theta _2-\theta _1$ . We obtain (5.107).

Step 3: Setting

$\displaystyle \begin{aligned} \hat{\boldsymbol\lambda}^{k} = \tilde{\boldsymbol\lambda}^{k} +\frac{\beta(1-\theta_1)}{\theta_1}({\mathbf{A}}_1{\mathbf{x}}^{k}_1+{\mathbf{A}}_2{\mathbf{x}}^{k}_2-{\mathbf{b}}), \end{aligned}$

(5.119)

we prove that it has the following properties:

$\displaystyle \begin{aligned} \begin{array}{rcl} \hat{\boldsymbol\lambda}^{k+1}&\displaystyle =&\displaystyle {\bar{\boldsymbol\lambda}}({\mathbf{x}}^{k+1}_1,{\mathbf{x}}^{k+1}_2),{} \end{array} \end{aligned}$

(5.120)

$\displaystyle \begin{aligned} \begin{array}{rcl} \hat{\boldsymbol\lambda}^{k+1}-\hat{\boldsymbol\lambda}^{k}&\displaystyle =&\displaystyle \frac{\beta }{\theta_1}{\mathbf{A}}_1\left[{\mathbf{x}}^{k+1}_1-(1-\theta_1-\theta_2){\mathbf{x}}_1^{k} -\theta_2\tilde{\mathbf{x}}_1-\theta_1{\mathbf{x}}_1^*\right]\\ &\displaystyle &\displaystyle + \frac{\beta }{\theta_1}{\mathbf{A}}_2\left[{\mathbf{x}}^{k+1}_2-(1-\theta_1-\theta_2){\mathbf{x}}_2^{k} -\theta_2\tilde{\mathbf{x}}_2-\theta_1{\mathbf{x}}_2^* \right],{} \end{array} \end{aligned}$

(5.121)

$\displaystyle \begin{aligned} \begin{array}{rcl} \hat{\boldsymbol\lambda}^0_{s} &\displaystyle =&\displaystyle \hat{\boldsymbol\lambda}^m_{s-1}, \quad s\geq 1. {} \end{array} \end{aligned}$

(5.122)

Indeed, for Algorithm 5.6 we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \boldsymbol\lambda^{k} = \tilde{\boldsymbol\lambda}^{k} +\frac{\beta \theta_2}{\theta_{1}}\left({\mathbf{A}}_1{\mathbf{x}}^{k}_{1}+{\mathbf{A}}_2{\mathbf{x}}^{k}_{2}-\tilde{{\mathbf{b}}}\right) \end{array} \end{aligned}$

(5.123)

and

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \tilde{\boldsymbol\lambda}^{k+1} = \boldsymbol\lambda^{k} + \beta \left({\mathbf{A}}_1{\mathbf{x}}_{1}^{k+1}+{\mathbf{A}}_2{\mathbf{x}}_{2}^{k+1}-{\mathbf{b}} \right). \end{array} \end{aligned}$

(5.124)

With (5.119) we have

$\displaystyle \begin{aligned} \begin{array}{rcl} \hat{\boldsymbol\lambda}^{k+1} &\displaystyle =&\displaystyle \tilde{\boldsymbol\lambda}^{k+1} +\beta \left(\frac{1}{\theta_1}-1\right)({\mathbf{A}}_1{\mathbf{x}}^{k+1}_1+{\mathbf{A}}_2{\mathbf{x}}^{k+1}_2-{\mathbf{b}})\\ &\displaystyle \overset{a}=&\displaystyle \boldsymbol\lambda^{k} +\frac{\beta }{\theta_1}({\mathbf{A}}_1{\mathbf{x}}^{k+1}_1+{\mathbf{A}}_2{\mathbf{x}}^{k+1}_2-{\mathbf{b}}){}\\ &\displaystyle \overset{b}=&\displaystyle \tilde{\boldsymbol\lambda}^k +\frac{\beta}{\theta_1}\left\{ {\mathbf{A}}_1{\mathbf{x}}^{k+1}_1+{\mathbf{A}}_2{\mathbf{x}}^{k+1}_2-{\mathbf{b}} +\theta_2\left[ {\mathbf{A}}_1({\mathbf{x}}^{k}_2-\tilde{\mathbf{x}}_1) + {\mathbf{A}}_2({\mathbf{x}}^{k}_2-\tilde{\mathbf{x}}_2)\right] \right\}, \end{array} \end{aligned}$

(5.125)

where in equality $\overset {a}=$ we use (5.124) and the equality $\overset {b}=$ is obtained by (5.123) and $\tilde {{\mathbf {b}}}={\mathbf {A}}_1\tilde {\mathbf {x}}_1+{\mathbf {A}}_2\tilde {\mathbf {x}}_2$ (see Algorithm 5.7). Together with (5.119) we obtain

$\displaystyle \begin{aligned} \begin{array}{rcl} \hat{\boldsymbol\lambda}^{k+1}-\hat{\boldsymbol\lambda}^{k} &\displaystyle =&\displaystyle \frac{\beta }{\theta_1}{\mathbf{A}}_1\left[{\mathbf{x}}^{k+1}_1-(1-\theta_1){\mathbf{x}}_1^{k}-\theta_1{\mathbf{x}}_1^* +\theta_2({\mathbf{x}}^k_1-\tilde{\mathbf{x}}_1)\right]\\ &\displaystyle &\displaystyle + \frac{\beta }{\theta_1}{\mathbf{A}}_2\left[{\mathbf{x}}^{k+1}_2-(1-\theta_1){\mathbf{x}}_2^{k}-\theta_1{\mathbf{x}}_2^* +\theta_2({\mathbf{x}}^k_2-\tilde{\mathbf{x}}_2) \right], \end{array} \end{aligned}$

where we use the fact that ${\mathbf {A}}_1{\mathbf {x}}_1^*+{\mathbf {A}}_2{\mathbf {x}}_2^*={\mathbf {b}}$ . So (5.121) is proven.

Since (5.125) equals ${\bar {\boldsymbol \lambda }}({\mathbf {x}}^{k+1}_1,{\mathbf {x}}^{k+1}_2)$ , we obtain (5.120). Now we prove $\hat {\boldsymbol \lambda }^m_{s-1} = \hat {\boldsymbol \lambda }^0_{s}$ when s ≥ 1.

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \hat{\boldsymbol\lambda}^0_{s} &\displaystyle \overset{a}=&\displaystyle \tilde{\boldsymbol\lambda}^0_s+ \frac{\beta(1-\theta_{1,s})}{\theta_{1,s}}\left( {\mathbf{A}}_1{\mathbf{x}}^m_{s,1} +{\mathbf{A}}_2{\mathbf{x}}^{m}_{s,2}-{\mathbf{b}}\right)\\ &\displaystyle \overset{b}=&\displaystyle \tilde{\boldsymbol\lambda}^0_s+ \beta \left(\frac{1}{\theta_{1,{s-1}}}+\tau-1\right)\left( {\mathbf{A}}_1{\mathbf{x}}^m_{s,1} +{\mathbf{A}}_2{\mathbf{x}}^{m}_{s,2}-{\mathbf{b}}\right)\\ &\displaystyle \overset{c}=&\displaystyle \boldsymbol\lambda^{m-1}_{s-1} -\beta (\tau-1) \left( {\mathbf{A}}_1{\mathbf{x}}^m_{s,1} +{\mathbf{A}}_2{\mathbf{x}}^{m}_{s,2}-{\mathbf{b}}\right)\\ &\displaystyle &\displaystyle +\beta \left(\frac{1}{\theta_{1,s-1}}+\tau-1\right) \left( {\mathbf{A}}_1{\mathbf{x}}^m_{s,1} +{\mathbf{A}}_2{\mathbf{x}}^{m}_{s,2}-{\mathbf{b}}\right)\\ &\displaystyle =&\displaystyle \boldsymbol\lambda^{m-1}_{s-1} +\frac{\beta}{\theta_{1,s-1}} \left( {\mathbf{A}}_1{\mathbf{x}}^m_{s,1} +{\mathbf{A}}_2{\mathbf{x}}^{m}_{s,2}-{\mathbf{b}}\right)\\ &\displaystyle \overset{d}=&\displaystyle \tilde{\boldsymbol\lambda}^{m}_{s-1} -\left(\beta-\frac{\beta}{\theta_{1,s-1}}\right)\left( {\mathbf{A}}_1{\mathbf{x}}^m_{s,1} +{\mathbf{A}}_2{\mathbf{x}}^{m}_{s,2}-{\mathbf{b}}\right) \\ &\displaystyle =&\displaystyle \hat{\boldsymbol\lambda}^m_{s-1}, \end{array} \end{aligned}$

(5.126)

where $\overset {a}=$ uses (5.119), $\overset {b}=$ uses the fact that $\frac {1}{\theta _{1,s}}=\frac {1}{\theta _{1,s-1}}+\tau$ , $\overset {c}=$ uses $\tilde {\boldsymbol \lambda }_{s+1}^{0} = \boldsymbol \lambda _s^{m-1} + \beta (1-\tau ) ({\mathbf {A}}_1{\mathbf {x}}_{s,1}^{m}+{\mathbf {A}}_2{\mathbf {x}}_{s,2}^{m}-{\mathbf {b}})$ in Algorithm 5.7, and $\overset {d}=$ uses (5.124).

Step 4: We now are ready to prove (5.102). Define $L({\mathbf {x}}_1,{\mathbf {x}}_2,\boldsymbol \lambda ) = F_1({\mathbf {x}}_1) - F_1({\mathbf {x}}^*_1)+ F_2({\mathbf {x}}_2)-F_2({\mathbf {x}}^*_2) + \langle \boldsymbol \lambda , {\mathbf {A}}_1{\mathbf {x}}_1+{\mathbf {A}}_2{\mathbf {x}}_2-{\mathbf {b}} \rangle$ . We have

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle L({\mathbf{x}}^{k+1}_1,{\mathbf{x}}^{k+1}_2,\boldsymbol\lambda^*) - \theta_2 L(\tilde{\mathbf{x}}_1,\tilde{\mathbf{x}}_2,\boldsymbol\lambda^*) -(1-\theta_1 - \theta_2)L({\mathbf{x}}^{k}_1,{\mathbf{x}}^{k}_2,\boldsymbol\lambda^*)\\ &\displaystyle &\displaystyle \quad = F_1({\mathbf{x}}^{k+1}_1)-(1-\theta_2-\theta_1)F_1({\mathbf{x}}^{k}_1) -\theta_1F_1({\mathbf{x}}^*_1) -\theta_2 F_1(\tilde{\mathbf{x}}_1)\\ &\displaystyle &\displaystyle \qquad +F_2({\mathbf{x}}^{k+1}_2)-(1-\theta_2-\theta_1)F_2({\mathbf{x}}^{k}_2) -\theta_1F_2({\mathbf{x}}^*_2) -\theta_2 F_2(\tilde{\mathbf{x}}_2)\\ &\displaystyle &\displaystyle \qquad + \left\langle \boldsymbol\lambda^*, {\mathbf{A}}_1\left[{\mathbf{x}}^{k+1}_1-(1-\theta_1-\theta_2){\mathbf{x}}_1^{k}-\theta_2\tilde{\mathbf{x}}_1-\theta_1 {\mathbf{x}}_1^*\right] \right\rangle\\ &\displaystyle &\displaystyle \qquad + \left\langle \boldsymbol\lambda^*, {\mathbf{A}}_2\left[{\mathbf{x}}^{k+1}_2-(1-\theta_1-\theta_2){\mathbf{x}}_2^{k}-\theta_2\tilde{\mathbf{x}}_2-\theta_1 {\mathbf{x}}_2^*\right] \right\rangle. \end{array} \end{aligned}$

Plugging (5.106) and (5.118) into the above, we have

../images/487003_1_En_5_Chapter/487003_1_En_5_Equ169_HTML.png

(5.127)

where in the equality $\overset {a}=$ we change the term ${\bar {\boldsymbol \lambda }}({\mathbf {x}}^{k+1}_1,{\mathbf {y}}_2^k)$ to ${\bar {\boldsymbol \lambda }}({\mathbf {x}}^{k+1}_1,{\mathbf {x}}_2^{k+1})-\frac {\beta }{\theta _1}{\mathbf {A}}_2({\mathbf {x}}^{k+1}_2-{\mathbf {y}}^k_2)$ . For the first two terms in the right-hand side of (5.127), we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \left\langle \boldsymbol\lambda^*-{\bar{\boldsymbol\lambda}}({\mathbf{x}}^{k+1}_1,{\mathbf{x}}^{k+1}_2), {\mathbf{A}}_1\left[{\mathbf{x}}^{k+1}_1-(1-\theta_1-\theta_2){\mathbf{x}}_1^{k}-\theta_2\tilde{\mathbf{x}}_1-\theta_1{\mathbf{x}}_1^*\right] \right\rangle\\ &\displaystyle &\displaystyle \quad +\left\langle \boldsymbol\lambda^*-{\bar{\boldsymbol\lambda}}({\mathbf{x}}^{k+1}_1,{\mathbf{x}}^{k+1}_2), {\mathbf{A}}_2\left[{\mathbf{x}}^{k+1}_2-(1-\theta_1-\theta_2){\mathbf{x}}_1^{k}-\theta_2\tilde{\mathbf{x}}_2-\theta_1 {\mathbf{x}}_2^*\right] \right\rangle\\ &\displaystyle &\displaystyle \qquad \overset{a}= \frac{\theta_1}{\beta} \langle \boldsymbol\lambda^*-\hat{\boldsymbol\lambda}^{k+1}, \hat{\boldsymbol\lambda}^{k+1}-\hat{\boldsymbol\lambda}^{k}\rangle\\ &\displaystyle &\displaystyle \qquad \overset{b}= \frac{\theta_1}{2\beta}\left(\left\|\hat{\boldsymbol\lambda}^k-\boldsymbol\lambda^*\right\|{}^2- \left\|\hat{\boldsymbol\lambda}^{k+1}-\boldsymbol\lambda^*\right\|{}^2 -\left\| \hat{\boldsymbol\lambda}^{k+1} -\hat{\boldsymbol\lambda}^{k}\right\|{}^2\right), \end{array} \end{aligned}$

(5.128)

where $\overset {a}=$ uses (5.120) and (5.121) and $\overset {b}=$ uses (A.2).

Substituting (5.128) into (5.127), we obtain

../images/487003_1_En_5_Chapter/487003_1_En_5_Equ171_HTML.png

(5.129)

Then applying identity (A.1) to the second and the third terms in the right-hand side of (5.129) and rearranging, we have

$\displaystyle \begin{aligned} & \mathbb{E}_{i_k}L({\mathbf{x}}^{k+1}_1,{\mathbf{x}}^{k+1}_2,\boldsymbol\lambda^*) - \theta_2 L(\tilde{\mathbf{x}}_1,\tilde{\mathbf{x}}_2,\boldsymbol\lambda^*) -(1-\theta_2 - \theta_1)L({\mathbf{x}}^{k}_1,{\mathbf{x}}^{k}_2,\boldsymbol\lambda^*)\\ &\quad \leq \frac{\theta_1}{2\beta}\left(\left\|\hat{\boldsymbol\lambda}^k-\boldsymbol\lambda^*\right\|{}^2- \mathbb{E}_{i_k}\left\|\hat{\boldsymbol\lambda}^{k+1}-\boldsymbol\lambda^*\right\|{}^2 -\mathbb{E}_{i_k}\left\| \hat{\boldsymbol\lambda}^{k+1} -\hat{\boldsymbol\lambda}^{k}\right\|{}^2\right)\\ &\qquad +\frac{1}{2}\left\|{\mathbf{y}}_1^{k}-(1-\theta_1-\theta_2){\mathbf{x}}_1^{k}-\theta_2\tilde{\mathbf{x}}_1-\theta_1{\mathbf{x}}_1^*\right\|{}^2_{{\mathbf{G}}_1}\\ &\qquad -\frac{1}{2}\mathbb{E}_{i_k}\left\|{\mathbf{x}}^{k+1}_1-(1-\theta_1-\theta_2){\mathbf{x}}_1^{k}-\theta_2\tilde{\mathbf{x}}_1-\theta_1{\mathbf{x}}_1^*\right\|{}^2_{{\mathbf{G}}_1}\\ &\qquad + \frac{1}{2} \left\|{\mathbf{y}}_2^{k}-(1-\theta_1-\theta_2){\mathbf{x}}^k_2-\theta_2\tilde{\mathbf{x}}_2-\theta_1{\mathbf{x}}_2^*\right\|{}^2_{\left(\alpha L_2+\frac{\beta\left\|{\mathbf{A}}_2^T {\mathbf{A}}_2\right\|}{\theta_1}\right)\mathbf{I}-\frac{\beta{\mathbf{A}}_2^T{\mathbf{A}}_2}{\theta_1}}\\ &\qquad -\frac{1}{2} \mathbb{E}_{i_k} \left\|{\mathbf{x}}_2^{k+1}-(1-\theta_1-\theta_2){\mathbf{x}}^k_2-\theta_2\tilde{\mathbf{x}}_2-\theta_1{\mathbf{x}}_2^*\right\|{}^2_{\left(\alpha L_2+\frac{\beta\left\|{\mathbf{A}}_2^T {\mathbf{A}}_2\right\|}{\theta_1}\right)\mathbf{I}-\frac{\beta{\mathbf{A}}_2^T{\mathbf{A}}_2}{\theta_1}}\\ &\qquad -\frac{1}{2}\mathbb{E}_{i_k}\left\|{\mathbf{x}}^{k+1}_1-{\mathbf{y}}^k_1 \right \|{}^2_{\frac{\beta\left\|{\mathbf{A}}_1^T {\mathbf{A}}_1\right\|}{\theta_1}\mathbf{I}-\frac{\beta{\mathbf{A}}_1^T{\mathbf{A}}_1}{\theta_1} }-\frac{1}{2}\mathbb{E}_{i_k}\left\|{\mathbf{x}}^{k+1}_2-{\mathbf{y}}^k_2\right\|{}^2_{\frac{\beta\left\|{\mathbf{A}}_2^T {\mathbf{A}}_2\right\|}{\theta_1}\mathbf{I}-\frac{\beta{\mathbf{A}}_2^T{\mathbf{A}}_2}{\theta_1} } \\ &\qquad +\frac{\beta}{\theta_1}\mathbb{E}_{i_k}\left\langle {\mathbf{A}}_2{\mathbf{x}}^{k+1}_2-{\mathbf{A}}_2{\mathbf{y}}^{k}_2,{\mathbf{A}}_1\left[{\mathbf{x}}^{k+1}_1-(1-\theta_1-\theta_2){\mathbf{x}}_1^{k}-\theta_2\tilde{\mathbf{x}}_1-\theta_1 {\mathbf{x}}_1^*\right] \right\rangle\!. \end{aligned}$

(5.130)

For the last term in the right-hand side of (5.130), we have

$\displaystyle \begin{aligned} &\frac{\beta}{\theta_1}\left\langle {\mathbf{A}}_2{\mathbf{x}}^{k+1}_2-{\mathbf{A}}_2{\mathbf{y}}^{k}_2,{\mathbf{A}}_1\left[{\mathbf{x}}^{k+1}_1-(1-\theta_1-\theta_2){\mathbf{x}}_1^{k}-\theta_2\tilde{\mathbf{x}}_1-\theta_1{\mathbf{x}}_1^*\right] \right\rangle\\ &\quad \overset{a}=\frac{\beta}{\theta_1}\left\langle {\mathbf{A}}_2{\mathbf{x}}^{k+1}_2\!\!-\!{\mathbf{A}}_2\mathbf{v} -({\mathbf{A}}_2{\mathbf{y}}^{k}_2-{\mathbf{A}}_2\mathbf{v}),\right. \\ &\qquad \left.{\mathbf{A}}_1\!\!\left[{\mathbf{x}}^{k+1}_1\!\!-(1\!-\!\theta_1-\!\theta_2){\mathbf{x}}_1^{k}-\theta_2\tilde{\mathbf{x}}_1-\theta_1{\mathbf{x}}_1^*\right] \!\!-\!\mathbf{0} \right\rangle\\ &\quad \overset{b}=\frac{\beta}{2\theta_1}\left\|{\mathbf{A}}_2{\mathbf{x}}^{k+1}_2 -{\mathbf{A}}_2 \mathbf{v}+ {\mathbf{A}}_1\left[{\mathbf{x}}^{k+1}_1-(1-\theta_1-\theta_2){\mathbf{x}}_1^{k}-\theta_2\tilde{\mathbf{x}}_1-\theta_1{\mathbf{x}}_1^*\right] \right\|{}^2\\ &\qquad -\frac{\beta}{2\theta_1}\left\|{\mathbf{A}}_2{\mathbf{x}}^{k+1}_2-{\mathbf{A}}_2\mathbf{v} \right\|{}^2+\frac{\beta}{2\theta_1}\left\|{\mathbf{A}}_2{\mathbf{y}}^{k}_2-{\mathbf{A}}_2\mathbf{v} \right\|{}^2\\ &\qquad -\frac{\beta}{2\theta_1}\left\|{\mathbf{A}}_2{\mathbf{y}}^{k}_2-{\mathbf{A}}_2 \mathbf{v} + {\mathbf{A}}_1\left[{\mathbf{x}}^{k+1}_1-(1-\theta_1-\theta_2){\mathbf{x}}_1^{k}-\theta_2\tilde{\mathbf{x}}_1-\theta_1{\mathbf{x}}_1^*\right] \right\|{}^2\\ &\quad \overset{c}= \frac{\theta_1}{2\beta}\left\| \hat{\boldsymbol\lambda}^{k+1} -\hat{\boldsymbol\lambda}^{k} \right\|{}^2 -\frac{\beta}{2\theta_1}\left\|{\mathbf{A}}_2{\mathbf{x}}^{k+1}_2-{\mathbf{A}}_2\mathbf{v} \right\|{}^2+\frac{\beta}{2\theta_1}\left\|{\mathbf{A}}_2{\mathbf{y}}^{k}_2-{\mathbf{A}}_2\mathbf{v} \right\|{}^2\\ &\qquad -\frac{\beta}{2\theta_1}\left\|{\mathbf{A}}_2{\mathbf{y}}^{k}_2-{\mathbf{A}}_2 \mathbf{v} + {\mathbf{A}}_1\left[{\mathbf{x}}^{k+1}_1-(1-\theta_1-\theta_2){\mathbf{x}}_1^{k}-\theta_2\tilde{\mathbf{x}}_1-\theta_1{\mathbf{x}}_1^*\right] \right \|{}^2,\end{aligned}$

(5.131)

where in $\overset {a}=$ we set $\mathbf {v}=(1-\theta _1-\theta _2){\mathbf {x}}_2^{k}+\theta _2\tilde {\mathbf {x}}_2+\theta _1 {\mathbf {x}}_2^*$ , $\overset {b}=$ uses (A.3), and $\overset {c}=$ uses (5.121). Substituting (5.131) into (5.130), we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \mathbb{E}_{i_k}L({\mathbf{x}}^{k+1}_1,{\mathbf{x}}^{k+1}_2,\boldsymbol\lambda^*) - \theta_2 L(\tilde{\mathbf{x}}_1,\tilde{\mathbf{x}}_2,\boldsymbol\lambda^*) -(1-\theta_2 - \theta_1)L({\mathbf{x}}^{k}_1,{\mathbf{x}}^{k}_2,\boldsymbol\lambda^*)\\ &\displaystyle &\displaystyle \quad \leq \frac{\theta_1}{2\beta}\left( \left\|\hat{\boldsymbol\lambda}^k-\boldsymbol\lambda^*\right\|{}^2- \mathbb{E}_{i_k}\left\|\hat{\boldsymbol\lambda}^{k+1}-\boldsymbol\lambda^*\right\|{}^2 \right)\\ &\displaystyle &\displaystyle \qquad +\frac{1}{2}\left\|{\mathbf{y}}_1^{k}-(1-\theta_1-\theta_2){\mathbf{x}}_1^{k}-\theta_2\tilde{\mathbf{x}}_1-\theta_1{\mathbf{x}}_1^*\right\|{}^2_{{\mathbf{G}}_1}\\ &\displaystyle &\displaystyle \qquad -\frac{1}{2}\mathbb{E}_{i_k}\left\|{\mathbf{x}}^{k+1}_1-(1-\theta_1-\theta_2){\mathbf{x}}_1^{k}-\theta_2\tilde{\mathbf{x}}_1-\theta_1{\mathbf{x}}_1^*\right\|{}^2_{{\mathbf{G}}_1}\\ &\displaystyle &\displaystyle \qquad + \frac{1}{2} \left\|{\mathbf{y}}_2^{k}-(1-\theta_1-\theta_2){\mathbf{x}}^k_2-\theta_2\tilde{\mathbf{x}}_2-\theta_1{\mathbf{x}}_2^*\right\|{}^2_{\left(\alpha L_2+\frac{\beta\left\|{\mathbf{A}}_2^T {\mathbf{A}}_2\right\|}{\theta_1}\right)\mathbf{I}}\\ &\displaystyle &\displaystyle \qquad - \frac{1}{2}\mathbb{E}_{i_k}\left\|{\mathbf{x}}_2^{k+1} - (1-\theta_1 - \theta_2){\mathbf{x}}^k_2-\theta_2 \tilde{\mathbf{x}}_2 - \theta_1 {\mathbf{x}}^*_2 \right\|{}^2_{\left(\alpha L_2+\frac{\beta\left\|{\mathbf{A}}_2^T {\mathbf{A}}_2\right\|}{\theta_1}\right)\mathbf{I}}\\ &\displaystyle &\displaystyle \qquad -\frac{1}{2}\mathbb{E}_{i_k}\left\|{\mathbf{x}}^{k+1}_1-{\mathbf{y}}^k_1 \right\|{}^2_{\frac{\beta\left\|{\mathbf{A}}_1^T {\mathbf{A}}_1\right\|}{\theta_1}\mathbf{I}-\frac{\beta{\mathbf{A}}_1^T{\mathbf{A}}_1}{\theta_1} }-\frac{1}{2}\mathbb{E}_{i_k}\left\|{\mathbf{x}}^{k+1}_2-{\mathbf{y}}^k_2\right\|{}^2_{\frac{\beta\left\|{\mathbf{A}}_2^T {\mathbf{A}}_2\right\|}{\theta_1}\mathbf{I}-\frac{\beta{\mathbf{A}}_2^T{\mathbf{A}}_2}{\theta_1} } \\ &\displaystyle &\displaystyle \qquad -\frac{\beta}{2\theta_1}\mathbb{E}_{i_k}\left\|{\mathbf{A}}_2{\mathbf{y}}^{k}_2-{\mathbf{A}}_2 \mathbf{v} + {\mathbf{A}}_1\left[{\mathbf{x}}^{k+1}_1-(1-\theta_1-\theta_2){\mathbf{x}}_1^{k}-\theta_2\tilde{\mathbf{x}}_1-\theta_1{\mathbf{x}}_1^*\right] \right \|{}^2.\\ \end{array} \end{aligned}$

(5.132)

Since the last three terms in the right-hand side of (5.132) are nonpositive, we obtain (5.102). □

Theorem 5.10

If the conditions in Lemma 5.7hold, then we have

$\displaystyle \begin{aligned} &\mathbb{E}\left(\frac{1}{2\beta}\left\|\frac{\beta (m-1)(\theta_2+\theta_{1,S})+\beta}{\theta_{1,{S}}}\left(\mathbf{A}\hat{\mathbf{x}}_{S}\!-\!{\mathbf{b}}\right)\right.\right.\\ &\quad \left.\left.-\frac{\beta(m\!-\!1)\theta_{2}}{\theta_{1,{0}}}\left(\mathbf{A}{\mathbf{x}}^0_{0}-{\mathbf{b}}\right) +\tilde{\boldsymbol\lambda}^0_0-\boldsymbol\lambda^*\right \|{}^2 \right)\\ &\quad +\mathbb{E}\left(\frac{(m-1)(\theta_2+\theta_{1,S})+1}{\theta_{1,{S}}}\left(F(\hat{\mathbf{x}}_{S})-F({\mathbf{x}}^*) +\langle \boldsymbol\lambda^*, \mathbf{A}\hat{\mathbf{x}}_S -{\mathbf{b}}\rangle\right)\right)\\ &\qquad \leq C_3\left(F({\mathbf{x}}_{0}^0)-F({\mathbf{x}}^*) +\langle \boldsymbol\lambda^*, \mathbf{A}{\mathbf{x}}^0_0 -{\mathbf{b}}\rangle\right)\\ &\qquad \quad +\frac{1}{2\beta}\left\|\tilde{\boldsymbol\lambda}^0_0 +\frac{\beta(1-\theta_{1,{0}})}{\theta_{1,{0}}}(\mathbf{A}{\mathbf{x}}_0^0-{\mathbf{b}}) -\boldsymbol\lambda^* \right\|{}^2\\ &\qquad \quad +\frac{1}{2}\left\|{\mathbf{x}}^0_{0,1}-{\mathbf{x}}^*_1 \right \|{}^2_{\left(\theta_{1,0}L_1+\beta\|{\mathbf{A}}_1^T {\mathbf{A}}_1\|\right)\mathbf{I}-\beta{\mathbf{A}}_1^T{\mathbf{A}}_1}\\ &\qquad \quad +\frac{1}{2}\left\|{\mathbf{x}}^0_{0,2}-{\mathbf{x}}^*_2 \right\|{}^2_{\left(\left(1+\frac{1}{b\theta_2}\right) \theta_{1,0}L_2+\beta\|{\mathbf{A}}_2^T {\mathbf{A}}_2\|\right)\mathbf{I}}, \end{aligned}$

(5.133)

where $C_3= \frac {1-\theta _{1,0}+(m-1)\theta _2}{\theta _{1,0}}$ .

Proof

Taking expectation over the first k iterations for (5.102) and dividing θ ₁ on both sides of it, we obtain

$\displaystyle \begin{aligned} &\frac{1}{\theta_{1}}\mathbb{E} L({\mathbf{x}}^{k+1}_1,{\mathbf{x}}^{k+1}_2,\boldsymbol\lambda^*) - \frac{\theta_2}{\theta_{1}} L(\tilde{\mathbf{x}}_1,\tilde{\mathbf{x}}_2,\boldsymbol\lambda^*) -\frac{1-\theta_2 - \theta_1}{\theta_{1}}L({\mathbf{x}}^{k}_1,{\mathbf{x}}^{k}_2,\boldsymbol\lambda^*)\\ &\quad \leq \frac{1}{2\beta}\left(\left\|\hat{\boldsymbol\lambda}^k-\boldsymbol\lambda^*\right\|{}^2- \mathbb{E}\left\|\hat{\boldsymbol\lambda}^{k+1}-\boldsymbol\lambda^*\right\|{}^2 \right) \\ &\qquad +\frac{\theta_{1}}{2}\left\|\frac{1}{\theta_1}\left[{\mathbf{y}}_1^{k}-(1-\theta_1-\theta_2){\mathbf{x}}_1^{k}-\theta_2\tilde{\mathbf{x}}_1\right]-{\mathbf{x}}_1^*\right\|{}^2_{\left(L_1+\frac{\beta\left\|{\mathbf{A}}_1^T {\mathbf{A}}_1\right\|}{\theta_1}\right)\mathbf{I}-\frac{\beta{\mathbf{A}}_1^T{\mathbf{A}}_1}{\theta_1} }\\ &\qquad -\frac{\theta_{1}}{2}\mathbb{E}\left\|\frac{1}{\theta_1}\left[{\mathbf{x}}^{k+1}_1-(1-\theta_1-\theta_2){\mathbf{x}}_1^{k}-\theta_2\tilde{\mathbf{x}}_1\right]-{\mathbf{x}}_1^*\right\|{}^2_{\left(L_1+\frac{\beta\left\|{\mathbf{A}}_1^T {\mathbf{A}}_1\right\|}{\theta_1}\right)\mathbf{I}-\frac{\beta{\mathbf{A}}_1^T{\mathbf{A}}_1}{\theta_1} }\\ &\qquad + \frac{\theta_{1}}{2} \left\|\frac{1}{\theta_1}\left[{\mathbf{y}}_2^{k}-(1-\theta_1-\theta_2){\mathbf{x}}^k_2-\theta_2\tilde{\mathbf{x}}_2\right]-{\mathbf{x}}_2^*\right\|{}^2_{\left(\alpha L_2+\frac{\beta\left\|{\mathbf{A}}_2^T {\mathbf{A}}_2\right\|}{\theta_1}\right)\mathbf{I}}\\ &\qquad -\frac{\theta_{1}}{2} \mathbb{E} \left\|\frac{1}{\theta_1}\left[{\mathbf{x}}_2^{k+1}-(1-\theta_1-\theta_2){\mathbf{x}}^k_2-\theta_2\tilde{\mathbf{x}}_2\right]-{\mathbf{x}}_2^*\right\|{}^2_{\left(\alpha L_2+\frac{\beta\left\|{\mathbf{A}}_2^T {\mathbf{A}}_2\right\|}{\theta_1}\right)\mathbf{I}}, \end{aligned}$

(5.134)

where the expectation is taken under the condition that the randomness under the first s epochs is fixed. Since

$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{y}}^{k} = {\mathbf{x}}^{k} + (1-\theta_{1}-\theta_2)({\mathbf{x}}^{k} - {\mathbf{x}}^{k-1}), \quad k\geq 1 , \end{array} \end{aligned}$

we obtain

$\displaystyle \begin{aligned} &\frac{1}{\theta_{1}}\mathbb{E} L({\mathbf{x}}^{k+1}_1,{\mathbf{x}}^{k+1}_2,\boldsymbol\lambda^*) - \frac{\theta_2}{\theta_{1}} L(\tilde{\mathbf{x}}_1,\tilde{\mathbf{x}}_2,\boldsymbol\lambda^*) -\frac{1-\theta_2 - \theta_1}{\theta_{1}}L({\mathbf{x}}^{k}_1,{\mathbf{x}}^{k}_2,\boldsymbol\lambda^*)\\ &\quad \leq \frac{1}{2\beta}\left(\left\|\hat{\boldsymbol\lambda}^k-\boldsymbol\lambda^*\right\|{}^2- \mathbb{E}\left\|\hat{\boldsymbol\lambda}^{k+1}-\boldsymbol\lambda^*\right\|{}^2 \right) \\ &\qquad +\frac{\theta_{1}}{2}\left\|\frac{1}{\theta_1}\left[{\mathbf{x}}_1^{k}-(1-\theta_1-\theta_2){\mathbf{x}}_1^{k-1}-\theta_2\tilde{\mathbf{x}}_1\right]-{\mathbf{x}}_1^*\right\|{}^2_{\left(L_1+\frac{\beta\left\|{\mathbf{A}}_1^T {\mathbf{A}}_1\right\|}{\theta_1}\right)\mathbf{I}-\frac{\beta{\mathbf{A}}_1^T{\mathbf{A}}_1}{\theta_1} }\\ &\qquad -\frac{\theta_{1}}{2}\mathbb{E}\left\|\frac{1}{\theta_1}\left[{\mathbf{x}}^{k+1}_1-(1-\theta_1-\theta_2){\mathbf{x}}_1^{k}-\theta_2\tilde{\mathbf{x}}_1\right]-{\mathbf{x}}_1^*\right\|{}^2_{\left(L_1+\frac{\beta\left\|{\mathbf{A}}_1^T {\mathbf{A}}_1\right\|}{\theta_1}\right)\mathbf{I}-\frac{\beta{\mathbf{A}}_1^T{\mathbf{A}}_1}{\theta_1} }\\ &\qquad + \frac{\theta_{1}}{2} \left\|\frac{1}{\theta_1}\left[{\mathbf{x}}_2^{k}-(1-\theta_1-\theta_2){\mathbf{x}}^{k-1}_2-\theta_2\tilde{\mathbf{x}}_2\right]-{\mathbf{x}}_2^*\right\|{}^2_{\left(\alpha L_2+\frac{\beta\left\|{\mathbf{A}}_2^T {\mathbf{A}}_2\right\|}{\theta_1}\right)\mathbf{I}}\\ &\qquad -\frac{\theta_{1}}{2} \mathbb{E} \left\|\frac{1}{\theta_1}\left[{\mathbf{x}}_2^{k+1}-(1-\theta_1-\theta_2){\mathbf{x}}^k_2-\theta_2\tilde{\mathbf{x}}_2\right]-{\mathbf{x}}_2^*\right\|{}^2_{\left(\alpha L_2+\frac{\beta\left\|{\mathbf{A}}_2^T {\mathbf{A}}_2\right\|}{\theta_1}\right)\mathbf{I}},\\ &\qquad \quad k\geq 1 . \end{aligned}$

(5.135)

Adding back the subscript s, taking expectation on the first s epoches, and then summing (5.134) with k from 0 to m − 1 (for k ≥ 1, using (5.135)), we have

../images/487003_1_En_5_Chapter/487003_1_En_5_Equ179_HTML.png

(5.136)

where we use $L({\mathbf {x}}^k_s,\boldsymbol \lambda ^*)$ and $L(\tilde {\mathbf {x}}_s,\boldsymbol \lambda ^*)$ to denote $L({\mathbf {x}}^k_{s,1}, {\mathbf {x}}^k_{s,2},\boldsymbol \lambda ^*)$ and $L(\tilde {\mathbf {x}}_{s,1},\tilde {\mathbf {x}}_{s,2}, \boldsymbol \lambda ^*)$ , respectively. Since L(x, λ ^∗) is convex for x, we have

$\displaystyle \begin{aligned} &mL(\tilde{\mathbf{x}}_s,\boldsymbol\lambda^*)\\ &\quad =mL\left( \frac{1}{m}\left[\left(1-\frac{(\tau-1)\theta_{1,{s}}}{\theta_2} \right){\mathbf{x}}^m_{s-1}+\left(1+\frac{(\tau-1)\theta_{1,{s}}}{(m-1)\theta_2}\right)\sum_{k=1}^{m-1}{\mathbf{x}}^k_{s-1}\right],\boldsymbol\lambda^*\right)\\ &\quad \leq\left[1-\frac{(\tau-1)\theta_{1,{s}}}{\theta_2} \right]L({\mathbf{x}}^m_{s-1},\boldsymbol\lambda^*)+\left[1+\frac{(\tau-1)\theta_{1,{s}}}{(m-1)\theta_2}\right]\sum_{k=1}^{m-1}L({\mathbf{x}}^k_{s-1},\boldsymbol\lambda^*). \end{aligned}$

(5.137)

Substituting (5.137) into (5.136), and using ${\mathbf {x}}^m_{s-1}={\mathbf {x}}^0_s$ , we have

../images/487003_1_En_5_Chapter/487003_1_En_5_Equ181_HTML.png

(5.138)

Then from the setting of $\theta _{1,s}=\frac {1}{2+\tau s}$ and $\theta _2=\frac {m-\tau }{\tau (m-1)}$ , we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \frac{1}{\theta_{1,{s}}}=\frac{1-\tau\theta_{1,{s+1}}}{\theta_{1,{s+1}}}, \quad s\geq 0, \end{array} \end{aligned}$

(5.139)

and

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \frac{\theta_2+\theta_{1,{s}}}{\theta_{1,{s}}}=\frac{\theta_2}{\theta_{1,{s+1}}}-\tau\theta_2+1=\frac{\theta_{2}+\frac{\tau-1}{m-1}\theta_{1,{s+1}}}{\theta_{1,s+1}}, \quad s\geq 0. \end{array} \end{aligned}$

(5.140)

Substituting (5.139) into the first term and (5.140) into the second term in the right-hand side of (5.138), we obtain

../images/487003_1_En_5_Chapter/487003_1_En_5_Equ184_HTML.png

(5.141)

When k = 0, for

$\displaystyle \begin{aligned} \begin{array}{rcl} {\mathbf{y}}_{s+1}^0&\displaystyle =&\displaystyle (1-\theta_2){\mathbf{x}}^m_s +\theta_2 \tilde{\mathbf{x}}_{s+1} \\ &\displaystyle &\displaystyle +\frac{\theta_{1,s+1}}{\theta_{1,s}}\left[(1-\theta_{1,s}){\mathbf{x}}^m_s-(1-\theta_{1,s}-\theta_2){\mathbf{x}}^{m-1}_s -\theta_2 \tilde{\mathbf{x}}_s\right], \end{array} \end{aligned}$

we obtain

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \frac{1}{\theta_{1,s}}\left[{\mathbf{x}}^{m}_s-\theta_2\tilde{\mathbf{x}}_s-(1-\theta_{1,s}-\theta_{2}){\mathbf{x}}^{m-1}_s\right]\\ &\displaystyle &\displaystyle \quad =\frac{1}{\theta_{1,s+1}}\left[{\mathbf{y}}^{0}_{s+1}-\theta_2\tilde{\mathbf{x}}_{s+1}-(1-\theta_{1,{s+1}}-\theta_{2}){\mathbf{x}}^{0}_{s+1}\right]. \end{array} \end{aligned}$

(5.142)

Substituting (5.142) into the third and the fifth terms in the right-hand side of (5.141) and substituting (5.122) into the last term in the right-hand side of (5.141), we obtain

../images/487003_1_En_5_Chapter/487003_1_En_5_Equu_HTML.png

For θ _1,s−1 ≥ θ _1,s, we have $\|\mathbf {x}\|{ }^2_{\theta _{1,s-1}L}\geq \|\mathbf {x}\|{ }^2_{\theta _{1,s}L}$ , thus

../images/487003_1_En_5_Chapter/487003_1_En_5_Equ187_HTML.png

(5.143)

When s = 0, via (5.136) and using ${\mathbf {y}}^0_{0,1}=\tilde {\mathbf {x}}_{0,1}= {\mathbf {x}}^0_{0,1}$ , ${\mathbf {y}}^0_{0,2}=\tilde {\mathbf {x}}_{0,2} = {\mathbf {x}}^0_{0,2}$ , and θ _1,0 ≥ θ _1,1, we obtain

../images/487003_1_En_5_Chapter/487003_1_En_5_Equ188_HTML.png

(5.144)

where we use θ _1,0 ≥ θ _1,1 in the fourth and the sixth lines.

Summing (5.143) with s from 1 to S − 1 and adding (5.144), we have

../images/487003_1_En_5_Chapter/487003_1_En_5_Equ189_HTML.png

(5.145)

Now we analyze $\|\hat {\boldsymbol \lambda }^{m}_S-\boldsymbol \lambda ^*\|{ }^2$ . From (5.126), for s ≥ 1 we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} &\displaystyle &\displaystyle \hat{\boldsymbol\lambda}^{m}_s-\hat{\boldsymbol\lambda}^{m}_{s-1}=\hat{\boldsymbol\lambda}^{m}_s-\hat{\boldsymbol\lambda}^{0}_{s}= \sum_{k=1}^m \left(\hat{\boldsymbol\lambda}^{k}_s-\hat{\boldsymbol\lambda}^{k-1}_{s}\right)\\ &\displaystyle &\displaystyle \quad \overset{a}= \beta\sum_{k=1}^m \left[ \frac{1}{\theta_{1,{s}}} \left(\mathbf{A}{\mathbf{x}}^k_s-{\mathbf{b}}\right) - \frac{1-\theta_{1,{s}}-\theta_2}{\theta_{1,{s}}} \left(\mathbf{A}{\mathbf{x}}^{k-1}_s-{\mathbf{b}}\right) - \frac{\theta_2}{\theta_{1,{s}}} \left(\mathbf{A}\tilde{\mathbf{x}}_s-{\mathbf{b}}\right) \right]\\ &\displaystyle &\displaystyle \quad = \frac{\beta}{\theta_{1,s}}\left(\mathbf{A}{\mathbf{x}}^m_s-{\mathbf{b}}\right)+ \frac{\beta(\theta_2+\theta_{1,{s}})}{\theta_{1,s}}\sum_{k=1}^{m-1} \left(\mathbf{A}{\mathbf{x}}^k_s-{\mathbf{b}}\right)\\ &\displaystyle &\displaystyle \qquad -\frac{\beta(1-\theta_{1,s}-\theta_2)}{\theta_{1,s}}\left(\mathbf{A}{\mathbf{x}}^m_{s-1}-{\mathbf{b}}\right)- \frac{m\beta\theta_2}{\theta_{1,s}} \left(\mathbf{A}\tilde{\mathbf{x}}_{s}-{\mathbf{b}}\right)\\ &\displaystyle &\displaystyle \quad \overset{b}= \frac{\beta}{\theta_{1,s}}\left(\mathbf{A}{\mathbf{x}}^m_s-{\mathbf{b}}\right)+ \frac{\beta(\theta_2+\theta_{1,{s}})}{\theta_{1,s}}\sum_{k=1}^{m-1} \left(\mathbf{A}{\mathbf{x}}^k_s-{\mathbf{b}}\right)\\ &\displaystyle &\displaystyle \qquad -\beta\left[ \frac{1-\theta_{1,s}-(\tau-1)\theta_{1,{s}}}{\theta_{1,s}}\left(\mathbf{A}{\mathbf{x}}^m_{s-1}-{\mathbf{b}}\right)\right.\\ &\displaystyle &\displaystyle \qquad \left.+ \frac{\theta_{2}+\frac{\tau-1}{m-1}\theta_{1,{s}}}{\theta_{1,s}}\sum_{k=1}^{m-1}\left(\mathbf{A}{\mathbf{x}}^k_{s-1}-{\mathbf{b}}\right)\right]\\ &\displaystyle &\displaystyle \quad \overset{c}= \frac{\beta}{\theta_{1,s}}\left(\mathbf{A}{\mathbf{x}}^m_s-{\mathbf{b}}\right)+ \frac{\beta(\theta_2+\theta_{1,{s}})}{\theta_{1,s}}\sum_{k=1}^{m-1} \left(\mathbf{A}{\mathbf{x}}^k_s-{\mathbf{b}}\right)\\ &\displaystyle &\displaystyle \qquad -\frac{\beta}{\theta_{1,s-1}}\left(\mathbf{A}{\mathbf{x}}^m_{s-1}-{\mathbf{b}}\right)- \frac{\beta(\theta_2+\theta_{1,{s-1}})}{\theta_{1,s-1}}\sum_{k=1}^{m-1} \left(\mathbf{A}{\mathbf{x}}^k_{s-1}-{\mathbf{b}}\right), \end{array} \end{aligned}$

(5.146)

where $\overset {a}=$ uses (5.121), $\overset {b}=$ uses the definition of $\tilde {\mathbf {x}}_s$ , and $\overset {c}=$ uses (5.139) and (5.140). When s = 0, we can obtain

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \hat{\boldsymbol\lambda}^{m}_0-\hat{\boldsymbol\lambda}^{0}_{0}&\displaystyle =&\displaystyle \sum_{k=1}^m \left(\hat{\boldsymbol\lambda}^{k}_0-\hat{\boldsymbol\lambda}^{k-1}_{0}\right)\\ &\displaystyle =&\displaystyle \sum_{k=1}^m \left[ \frac{\beta}{\theta_{1,{0}}} \left(\mathbf{A}{\mathbf{x}}^k_0-{\mathbf{b}}\right) - \frac{\beta(1-\theta_{1,{0}}-\theta_2)}{\theta_{1,{0}}} \left(\mathbf{A}{\mathbf{x}}^{k-1}_0-{\mathbf{b}}\right)\right.\\ &\displaystyle &\displaystyle \left. - \frac{\theta_2\beta}{\theta_{1,{0}}} \left(\mathbf{A}{\mathbf{x}}^0_0-{\mathbf{b}}\right) \right]\\ &\displaystyle =&\displaystyle \frac{\beta}{\theta_{1,0}}\left(\mathbf{A}{\mathbf{x}}^m_0-{\mathbf{b}}\right)+ \frac{\beta(\theta_2+\theta_{1,{0}})}{\theta_{1,0}}\sum_{k=1}^{m-1} \left(\mathbf{A}{\mathbf{x}}^k_0-{\mathbf{b}}\right) \\ &\displaystyle &\displaystyle - \frac{\beta[1-\theta_{1,0}+(m-1)\theta_2]}{\theta_{1,0}}\left(\mathbf{A}{\mathbf{x}}^0_0-{\mathbf{b}} \right). \end{array} \end{aligned}$

(5.147)

Summing (5.146) with s from 1 to S − 1 and adding (5.147), we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \hat{\boldsymbol\lambda}^{m}_S-\boldsymbol\lambda^* &\displaystyle =&\displaystyle \hat{\boldsymbol\lambda}^{m}_S-\hat{\boldsymbol\lambda}^0_0 + \hat{\boldsymbol\lambda}^0_0 - \boldsymbol\lambda^*\\ &\displaystyle =&\displaystyle \frac{\beta}{\theta_{1,S}}\left(\mathbf{A}{\mathbf{x}}^m_S-{\mathbf{b}}\right)+ \frac{\beta(\theta_2+\theta_{1,{S}})}{\theta_{1,S}}\sum_{k=1}^{m-1} \left(\mathbf{A}{\mathbf{x}}^k_S-{\mathbf{b}}\right) \\ &\displaystyle &\displaystyle - \frac{\beta\left[1-\theta_{1,0}+(m-1)\theta_2\right]}{\theta_{1,0}}\left(\mathbf{A}{\mathbf{x}}^0_0-{\mathbf{b}} \right)\\ &\displaystyle &\displaystyle +\tilde{\boldsymbol\lambda}^{0}_0 +\frac{\beta(1-\theta_{1,{0}})}{\theta_{1,{0}}}\left(\mathbf{A}{\mathbf{x}}^0_{0}-{\mathbf{b}} \right) - \boldsymbol\lambda^*\\ &\displaystyle \overset{a}=&\displaystyle \frac{(m-1)(\theta_2+\theta_{1,S})\beta +\beta}{\theta_{1,{S}}}\left(\mathbf{A}\hat{\mathbf{x}}_S-{\mathbf{b}} \right) \\ &\displaystyle &\displaystyle +\tilde{\boldsymbol\lambda}^0_0 -\frac{\beta(m-1)\theta_2}{\theta_{1,{0}}}\left( \mathbf{A}{\mathbf{x}}^0_0 -{\mathbf{b}}\right) -\boldsymbol\lambda^*, \end{array} \end{aligned}$

(5.148)

where the equality $\overset {a}=$ uses the definition of $\hat {\mathbf {x}}_S$ . Substituting (5.148) into (5.145) and using that L(x, λ) is convex in x, we have

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \mathbb{E}\left(\frac{1}{2\beta}\left\|\frac{\beta (m-1)(\theta_2+\theta_{1,S})+\beta}{\theta_{1,{S}}}\left(\mathbf{A}\hat{\mathbf{x}}_{S}\!-\!{\mathbf{b}}\right)\right.\right.\\ &\displaystyle &\displaystyle \quad \left.\left.-\frac{\beta(m\!-\!1)\theta_{2}}{\theta_{1,{0}}}\left(\mathbf{A}{\mathbf{x}}^0_{0}-{\mathbf{b}}\right) +\tilde{\boldsymbol\lambda}^0_0-\boldsymbol\lambda^*\right \|{}^2 \right)\\ &\displaystyle &\displaystyle \quad +\mathbb{E}\left(\frac{(m-1)(\theta_2+\theta_{1,S})+1}{\theta_{1,{S}}}\left(L(\hat{\mathbf{x}}_{S}, \boldsymbol\lambda^*)-L({\mathbf{x}}^*,\boldsymbol\lambda^*)\right)\right)\\ &\displaystyle &\displaystyle \qquad \leq \frac{1-\theta_{1,0}+(m-1)\theta_2}{\theta_{1,0}}\left(L({\mathbf{x}}^0_{0},\boldsymbol\lambda^*)-L({\mathbf{x}}^*,\boldsymbol\lambda^*)\right)+ \frac{1}{2\beta}\left\|\hat{\boldsymbol\lambda}^0_{0}-\boldsymbol\lambda^*\right\|{}^2\\ &\displaystyle &\displaystyle \qquad \quad +\frac{1}{2}\left\|{\mathbf{x}}^0_{0,1}-{\mathbf{x}}^*_1 \right \|{}^2_{\left(\theta_{1,0}L_1+\beta\left\|{\mathbf{A}}_1^T {\mathbf{A}}_1\right\|\right)\mathbf{I}-\beta{\mathbf{A}}_1^T{\mathbf{A}}_1}\\ &\displaystyle &\displaystyle \qquad \quad +\frac{1}{2}\left\|{\mathbf{x}}^0_{0,2}-{\mathbf{x}}^*_2 \right \|{}^2_{\left(\alpha \theta_{1,0}L_2+\beta\left\|{\mathbf{A}}_2^T {\mathbf{A}}_2\right\|\right)\mathbf{I}}. \end{array} \end{aligned}$

By the definitions of L(x, λ) and $\hat {\boldsymbol \lambda }^0_{0}$ we can obtain (5.133). □

With Theorem 5.10 in hand, we can check that the convergence rate of Acc-SADMM to solve problem (5.98) is exactly O(1∕(mS)). However, for non-accelerated methods, the convergence rate for ADMM in the non-ergodic sense is $O(1/\sqrt {mS})$ [5].

5.5 The Infinite Case

In the previous section, the momentum technique is used to achieve faster convergence rate. We introduce the other benefit: the momentum technique can increase the mini-batch size. Unlike the previous sections, here we focus on the infinite case which differs from the finite case, since the true gradient for the infinite case is unavailable. We show that by using the momentum technique, the mini-batch size can be largely increased. We consider the problem:

$\displaystyle \begin{aligned} \begin{array}{rcl}{} \min_{\mathbf{x}} f(\mathbf{x})\equiv \mathbb{E} [F(\mathbf{x};\xi)], \end{array} \end{aligned}$

(5.149)

where f(x) is μ-strongly convex and L-smooth and its stochastic gradient has finite variance bound σ, i.e.,

$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbb{E}\| \nabla F(\mathbf{x};\xi) - \nabla f(\mathbf{x}) \|{}^2\leq \sigma^2. \end{array} \end{aligned}$

We apply mini-batch AGD to solve (5.149). The algorithm is shown in Algorithm 5.8.

Algorithm 5.8 Stochastic accelerated gradient descent (SAGD)

../images/487003_1_En_5_Chapter/487003_1_En_5_Figh_HTML.png

Theorem 5.11

For Algorithm 5.8, set the mini-batch size $b=\frac {\sigma ^2}{2L\theta \epsilon }$ , where $\theta =\sqrt {\mu /(2L)}$ . After running $K = \log _{1-\sqrt {\frac {\mu }{2L}}}\left ( f({\mathbf {x}}^0) -f({\mathbf {x}}^*) + L \|{\mathbf {x}}^0 - {\mathbf {x}}^* \|{ }^2\right )- \log _{1-\sqrt {\frac {\mu }{2L}}}(\epsilon )\sim O\left ( \sqrt {\frac {L}{\mu }} \log (L \|{\mathbf {x}}^0 - {\mathbf {x}}^* \|{ }^2/\epsilon )\right )$ iterations, we have

$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbb{E} f({\mathbf{x}}^{K+1}) - f({\mathbf{x}}^*) +L \mathbb{E}\left\|{\mathbf{x}}^{K+1} - (1-\theta){\mathbf{x}}^K -\theta {\mathbf{x}}^*\right\|{}^2\leq 2\epsilon. \end{array} \end{aligned}$

Proof

For f(x) is L-smooth, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} f({\mathbf{x}}^{k+1}) \leq f({\mathbf{y}}^k) +\langle \nabla f({\mathbf{y}}^k), {\mathbf{x}}^{k+1} -{\mathbf{y}}^k\rangle +\frac{L}{2}\left\|{\mathbf{x}}^{k+1}-{\mathbf{y}}^k\right\|{}^2. \end{array} \end{aligned}$

(5.150)

For f(y) is μ-strongly convex, we have

$\displaystyle \begin{aligned} \begin{array}{rcl}{} f({\mathbf{y}}^k) \leq f({\mathbf{x}}^*) +\left\langle \nabla f({\mathbf{y}}^k), {\mathbf{y}}^k - {\mathbf{x}}^* \right\rangle - \frac{\mu }{2}\left\|{\mathbf{y}}^k - {\mathbf{x}}^* \right\|{}^2, \end{array} \end{aligned}$

(5.151)

and

$\displaystyle \begin{aligned} \begin{array}{rcl}{} f({\mathbf{y}}^k) \leq f({\mathbf{x}}^k) +\left\langle \nabla f({\mathbf{y}}^k), {\mathbf{y}}^k - {\mathbf{x}}^k \right\rangle. \end{array} \end{aligned}$

(5.152)

Multiplying (5.151) by θ and (5.152) by 1 − θ and summing the results with (5.150), we have

$\displaystyle \begin{aligned} \begin{array}{rcl} f({\mathbf{x}}^{k+1}) &\displaystyle \leq&\displaystyle \theta f({\mathbf{x}}^*) + (1-\theta) f({\mathbf{x}}^k) + \left\langle \nabla f({\mathbf{y}}^k), {\mathbf{x}}^{k+1}-(1-\theta){\mathbf{x}}^k -\theta{\mathbf{x}}^* \right\rangle\\ &\displaystyle &\displaystyle +\frac{L}{2}\left\|{\mathbf{x}}^{k+1}-{\mathbf{y}}^k\right\|{}^2-\frac{\mu\theta}{2}\left\|{\mathbf{y}}^k -{\mathbf{x}}^*\right\|{}^2\\ &\displaystyle =&\displaystyle \theta f({\mathbf{x}}^*) + (1-\theta) f({\mathbf{x}}^k)\\ &\displaystyle &\displaystyle + \left\langle \nabla f({\mathbf{y}}^k) - \tilde{\nabla} f({\mathbf{y}}^k), {\mathbf{x}}^{k+1}-(1-\theta){\mathbf{x}}^k -\theta{\mathbf{x}}^* \right\rangle \\ &\displaystyle &\displaystyle +\frac{L}{2}\left\|{\mathbf{x}}^{k+1}-{\mathbf{y}}^k\right\|{}^2-\frac{\mu\theta}{2}\left\|{\mathbf{y}}^k -{\mathbf{x}}^*\right\|{}^2\\ &\displaystyle &\displaystyle +\left\langle \tilde{\nabla} f({\mathbf{y}}^k), {\mathbf{x}}^{k+1}-(1-\theta){\mathbf{x}}^k -\theta{\mathbf{x}}^* \right\rangle, \end{array} \end{aligned}$

where we will set $\theta = \sqrt {\mu /(2L)}$ . By taking the expectation on the random number in $\mathbb {I}^k$ , we have

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \mathbb{E}{{}_k}\left\langle \nabla f({\mathbf{y}}^k) - \tilde{\nabla} f({\mathbf{y}}^k), {\mathbf{x}}^{k+1}-(1-\theta){\mathbf{x}}^k -\theta{\mathbf{x}}^*\right\rangle\\ &\displaystyle &\displaystyle \quad \overset{a}= \mathbb{E}{{}_k}\left\langle \nabla f({\mathbf{y}}^k) - \tilde{\nabla} f({\mathbf{y}}^k), {\mathbf{x}}^{k+1}-{\mathbf{y}}^k\right\rangle\\ &\displaystyle &\displaystyle \quad \leq \frac{1}{2L} \mathbb{E}{{}_k}\left\| \nabla f({\mathbf{y}}^k) - \tilde{\nabla} f({\mathbf{y}}^k)\right\|{}^2+\frac{L}{2}\mathbb{E}_k\left\|{\mathbf{x}}^{k+1}-{\mathbf{y}}^k\right\|{}^2, \end{array} \end{aligned}$

where in $\overset {a}=$ we use $\mathbb {E}_k \left (\nabla f({\mathbf {y}}^k) - \tilde {\nabla } f({\mathbf {y}}^k)\right ) = \mathbf {0}$ . Thus we have

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \mathbb{E}{{}_k}f({\mathbf{x}}^{k+1}) \\ &\displaystyle &\displaystyle \quad \leq \theta f({\mathbf{x}}^*) + (1-\theta) f({\mathbf{x}}^k) - 2L\mathbb{E}_k\left\langle {\mathbf{x}}^{k+1} - {\mathbf{y}}^k, {\mathbf{x}}^{k+1}-(1-\theta){\mathbf{x}}^k -\theta{\mathbf{x}}^* \right\rangle\\ &\displaystyle &\displaystyle \qquad + \frac{L}{2}\mathbb{E}_k\left\|{\mathbf{x}}^{k+1} - {\mathbf{y}}^k \right\|{}^2 - \frac{\mu\theta}{2}\left\|{\mathbf{y}}^k - {\mathbf{x}}^*\right\|{}^2 +\frac{L}{2}\mathbb{E}_k\left\|{\mathbf{x}}^{k+1} - {\mathbf{y}}^k \right\|{}^2\\ &\displaystyle &\displaystyle \qquad +\frac{1}{2L}\mathbb{E}_k\left\| \nabla f({\mathbf{y}}^k) - \tilde{\nabla} f({\mathbf{y}}^k)\right\|{}^2\\ &\displaystyle &\displaystyle \quad \leq\theta f({\mathbf{x}}^*) + (1-\theta) f({\mathbf{x}}^k) + \frac{1}{2L}\mathbb{E}_k\left\|\nabla f({\mathbf{y}}^k) - \tilde{\nabla} f({\mathbf{y}}^k)\right\|{}^2\\ &\displaystyle &\displaystyle \qquad + L\left\|{\mathbf{y}}^k - (1-\theta){\mathbf{x}}^k -\theta{\mathbf{x}}^* \right\|{}^2 -L\mathbb{E}_k\left\|{\mathbf{x}}^{k+1} - (1-\theta){\mathbf{x}}^k -\theta{\mathbf{x}}^* \right \|{}^2\\ &\displaystyle &\displaystyle \qquad -\frac{\mu\theta }{2}\left\|{\mathbf{y}}^k -{\mathbf{x}}^*\right\|{}^2. \end{array} \end{aligned}$

Then using

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle L\left\|{\mathbf{y}}^k -(1-\theta){\mathbf{x}}^k-\theta{\mathbf{x}}^*\right\|{}^2\\ &\displaystyle &\displaystyle \quad =\theta^2L\left\|\frac{1}{\theta}{\mathbf{y}}^k-\theta {\mathbf{x}}^*- \frac{1-\theta}{\theta}{\mathbf{x}}^k- (1-\theta){\mathbf{x}}^* \right\|{}^2\\ &\displaystyle &\displaystyle \quad =\theta^2L\left\|\theta ({\mathbf{y}}^k -{\mathbf{x}}^*) + \left(\frac{1}{\theta}-\theta\right){\mathbf{y}}^k - \frac{1-\theta}{\theta}{\mathbf{x}}^k- (1-\theta){\mathbf{x}}^* \right\|{}^2\\ &\displaystyle &\displaystyle \quad =\theta^2L\left\|\theta ({\mathbf{y}}^k -{\mathbf{x}}^*) + (1-\theta)\left[ \left(1+\frac{1}{\theta}\right){\mathbf{y}}^k -\frac{1}{\theta}{\mathbf{x}}^k -{\mathbf{x}}^* \right] \right\|{}^2\\ &\displaystyle &\displaystyle \quad \overset{a}\leq \theta^3L\left\|{\mathbf{y}}^k-{\mathbf{x}}^*\right\|{}^2 + \theta^2(1-\theta)L\left\| \left(1+\frac{1}{\theta}\right){\mathbf{y}}^k -\frac{1}{\theta}{\mathbf{x}}^k -{\mathbf{x}}^* \right\|{}^2\\ &\displaystyle &\displaystyle \quad \overset{b}=\frac{\theta\mu}{2}\left\|{\mathbf{y}}^k-{\mathbf{x}}^*\right\|{}^2 + \theta^2(1-\theta)L\left\| \left(1+\frac{1}{\theta}\right){\mathbf{y}}^k -\frac{1}{\theta}{\mathbf{x}}^k -{\mathbf{x}}^* \right\|{}^2, \end{array} \end{aligned}$

where in $\overset {a}\leq$ we use the convexity of ∥⋅∥² and in $\overset {b}=$ we use $\theta = \sqrt {\mu /(2L)}$ . Then from the update rule, we have

$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{1}{\theta}\Big[{\mathbf{x}}^{k}-(1-\theta){\mathbf{x}}^{k-1}\Big] = \left(1+\frac{1}{\theta}\right){\mathbf{y}}^k -\frac{1}{\theta}{\mathbf{x}}^k. \end{array} \end{aligned}$

So we obtain

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \mathbb{E}{{}_k}f({\mathbf{x}}^{k+1}) - f({\mathbf{x}}^*) + L \mathbb{E}_k \left\|{\mathbf{x}}^{k+1} - (1-\theta){\mathbf{x}}^k -\theta {\mathbf{x}}^*\right\|{}^2\\ &\displaystyle &\displaystyle \quad \leq (1-\theta)\left( f({\mathbf{x}}^{k}) - f({\mathbf{x}}^*) + L \left\|{\mathbf{x}}^k - (1-\theta){\mathbf{x}}^{k-1} - \theta {\mathbf{x}}^* \right\|{}^2 \right)\\ &\displaystyle &\displaystyle \qquad + \frac{1}{2L}\mathbb{E}_k\left\|\nabla f({\mathbf{y}}^k) - \tilde{\nabla} f({\mathbf{y}}^k) \right\|{}^2. \end{array} \end{aligned}$

Setting $K = \log _{1-\theta }\left ( f({\mathbf {x}}^0) -f({\mathbf {x}}^*) + L \|{\mathbf {x}}^0 - {\mathbf {x}}^* \|{ }^2\right )+ \log _{1-\theta }( \epsilon ^{-1}) \sim O\left (\sqrt {\frac {L}{\mu }}\log (1/\epsilon )\right )$ and taking the first K iterations into account, we have

$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \mathbb{E} f({\mathbf{x}}^{K+1}) - f({\mathbf{x}}^*) +L \mathbb{E}\left\|{\mathbf{x}}^{K+1} - (1-\theta){\mathbf{x}}^K -\theta {\mathbf{x}}^*\right\|{}^2\\ &\displaystyle &\displaystyle \quad \leq \epsilon + \frac{1}{2L}\sum_{i = 0}^K (1-\theta)^{K-i}\mathbb{E}\left\| \nabla f({\mathbf{y}}^i) - \tilde{\nabla} f({\mathbf{y}}^i)\right \|{}^2. \end{array} \end{aligned}$

By the setting of $b=\frac {\sigma ^2}{2L\theta \epsilon }$ mini-batch size, we have

$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbb{E}\left\| \nabla f({\mathbf{y}}^k) - \tilde{\nabla} f({\mathbf{y}}^k) \right\|{}^2 \leq 2L \theta \epsilon, \quad k\geq0. \end{array} \end{aligned}$

So we have

□

Theorem 5.11 indicates that the total stochastic gradient calls is $\tilde {O}\left (\frac {\sigma ^2}{\mu \epsilon }\right )$ . However, the algorithm converges in $\tilde {O}\left (\sqrt {\frac {L}{\mu }}\right )$ steps, which is the fastest rate even in deterministic algorithms. So the momentum technique enlarges the mini-batch size by $\Omega \left (\sqrt {\frac {L}{\mu }}\right )$ times.

References

1.
Z. Allen-Zhu, Katyusha: the first truly accelerated stochastic gradient method, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, Montreal, (2017), pp. 1200–1206
2.
Z. Allen-Zhu, E. Hazan, Optimal black-box reductions between optimization objectives, in Advances in Neural Information Processing Systems, Barcelona, vol. 29 (2016), pp. 1614–1622
3.
Z. Allen-Zhu, Y. Li, Neon2: finding local minima via first-order oracles, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 3716–3726
4.
Z. Allen-Zhu, Y. Yuan, Improved SVRG for non-strongly-convex or sum-of-non-convex objectives, in Proceedings of the 33th International Conference on Machine Learning, New York, (2016), pp. 1080–1089
5.
D. Davis, W. Yin, Convergence rate analysis of several splitting schemes, in Splitting Methods in Communication, Imaging, Science, and Engineering (Springer, New York, 2016), pp. 115–163zbMATH
6.
A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in Neural Information Processing Systems, Montreal, vol. 27 (2014), pp. 1646–1654
7.
C. Fang, C.J. Li, Z. Lin, T. Zhang, SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential estimator, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 689–699
8.
O. Fercoq, P. Richtárik, Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)MathSciNetzbMATH
9.
D. Garber, E. Hazan, C. Jin, S.M. Kakade, C. Musco, P. Netrapalli, A. Sidford, Faster eigenvector computation via shift-and-invert preconditioning, in Proceedings of the 33th International Conference on Machine Learning, New York, (2016), pp. 2626–2634
10.
R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in Advances in Neural Information Processing Systems, Lake Tahoe, vol. 26 (2013), pp. 315–323
11.
Q. Lin, Z. Lu, L. Xiao, An accelerated proximal coordinate gradient method, in Advances in Neural Information Processing Systems, Montreal, vol. 27 (2014), pp. 3059–3067
12.
H. Lin, J. Mairal, Z. Harchaoui, A universal catalyst for first-order optimization, in Advances in Neural Information Processing Systems, Montreal, vol. 28 (2015), pp. 3384–3392
13.
J. Mairal, Optimization with first-order surrogate functions, in Proceedings of the 30th International Conference on Machine Learning, Atlanta, (2013), pp. 783–791
14.
Y. Nesterov, A method for unconstrained convex minimization problem with the rate of convergence O(1∕k ²). Sov. Math. Dokl. 27(2), 372–376 (1983)zbMATH
15.
Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Springer, New York, 2004)zbMATH
16.
Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)MathSciNetzbMATH
17.
M. Schmidt, N. Le Roux, F. Bach, Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)MathSciNetzbMATH
18.
S. Shalev-Shwartz, T. Zhang, Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)MathSciNetzbMATH
19.
S. Shalev-Shwartz, T. Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization, in Proceedings of the 31th International Conference on Machine Learning, Beijing, (2014), pp. 64–72