Zhouchen Lin, Huan Li and Cong Fang

Accelerated Optimization for Machine Learning

First-Order Algorithms

Zhouchen Lin
Key Lab. of Machine Perception School of EECS, Peking University, Beijing, Beijing, China
Huan Li
College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, Jiangsu, China
Cong Fang
School of Engineering and Applied Science, Princeton University, Princeton, NJ, USA
ISBN 978-981-15-2909-2e-ISBN 978-981-15-2910-8
© Springer Nature Singapore Pte Ltd. 2020
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.

The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

To our families. Without your great support this book will not exist and even our careers will be meaningless.

Foreword by Michael I. Jordan

Optimization algorithms have been the engine that have powered the recent rise of machine learning. The needs of machine learning are different from those of other disciplines that have made use of the optimization toolbox; most notably, the parameter spaces are of high dimensionality, and the functions that are being optimized are often sums of millions of terms. In such settings, gradient-based methods are preferred over higher order methods, and given that the computation of a full gradient can be infeasible, stochastic gradient methods are the coin of the realm. Putting such specifications together with the need to solve nonconvex optimization problems, to control the variance induced by the stochastic sampling, and to develop algorithms that run on distributed platforms, one poses a new set of challenges for optimization. Surprisingly, many of these challenges have been addressed within the past decade.

The book by Lin, Li, and Fang is one of the first book-length treatments of this emerging field. The book covers gradient-based algorithms in detail, with a focus on the concept of acceleration. Acceleration is a key concept in modern optimization, supplying new algorithms and providing insight into achievable convergence rates. The book also covers stochastic methods, including variance control, and it includes material on asynchronous distributed implementations.

Any researcher wishing to work in the machine learning field should have a foundational understanding of the disciplines of statistics and optimization. The current book is an excellent place to obtain the latter and to begin one’s adventure in machine learning.

Michael I. Jordan
October 2019
Foreword by Zongben Xu

Optimization is one of the core topics in machine learning. While benefiting from the advances in the native optimization community, optimization for machine learning has its own flavor. One remarkable phenomenon is thatfirst-order algorithms more or less dominate the optimization methods in machine learning. While there have been some books or preprints that introduce major optimization algorithms used in machine learning, either partially or thoroughly, this book focuses on a notable stream in recent machine learning optimization, namely theaccelerated first-order methods. Originating from Polyak’s heavy-ball method and triggered by Nesterov’s series of works, accelerated first-order methods have become a hot topic in both the optimization and the machine learning communities and have yielded fruitfully. The results have significantly extended beyond the traditional scope of unconstrained (and deterministic) convex optimization. New results include acceleration for constrained convex optimization and nonconvex optimization, stochastic algorithms, and general acceleration frameworks such as Katyusha and Catalyst. Some of them even have nearly optimal convergence rates. Unfortunately, existing literatures scatter across diverse and extensive publications. Mastering the basic techniques and having a global picture of this dynamic field thus becomes very difficult.

Fortunately, this monograph, coauthored by Zhouchen Lin, Huan Li, and Cong Fang, meets the need of quick education on accelerated first-order algorithms just in time. The book first gives an overview on the development of accelerated first-order algorithms, which is extremely informative, despite being sketchy. Then, it introduces the representative works in different categories, with detailed proofs that greatly facilitate the understanding of underlying ideas and the mastering of basic techniques. Without doubt, this book is a vital reference for those who want to learn the state of the art of machine learning optimization.

I have known Dr. Zhouchen Lin for a long time. He impresses me with solid work, deep insights, and careful analysis on the problems arising from his diverse research fields. With a lot of shared research interests, one of which is learning-based optimization, I am delighted to see this book finally published after elaborative writing.

Zongben Xu
October 2019
Foreword by Zhi-Quan Luo

First-order optimization methods have been the main workhorse in the machine learning, signal processing, and artificial intelligence involving big data. These methods, while simple conceptually, require careful analysis and a good understanding of them to be effectively deployed. The issues such as acceleration, nonsmoothness, nonconvexity, parallel and distributed implementation are critical due to their great impact on the algorithm’s convergence behavior and running time.

This research monograph gives an excellent introduction to the algorithmic aspects of first-order optimization methods, focusing on algorithm design and convergence analysis. It treats in depth the issues of acceleration, nonconvexity, constraints, and asynchronous implementation. The topics covered and the results given in the monograph are very timely and strongly relevant to both the researchers and practitioners of machine learning, signal processing, and artificial intelligence. The theoretical issues of lower bounds on complexity are purposely avoided to give way to algorithm design and convergence analysis. Overall, the treatment of the subject is quite balanced and many useful insights are provided throughout the monograph.

The authors of this monograph are experienced researchers at the interface of machine learning and optimization. The monograph is very well written and makes an excellent read. It should be an important reference book for everyone interested in the optimization aspects of machine learning.

Zhi-Quan Luo
October 2019
Preface

While I was preparing advanced materials for the optimization course taught at Peking University, I found that accelerated algorithms is the most attractive and practical topic for students in engineering. Actually, this is also a hot topic of current machine learning conferences. While some books have introduced some accelerated algorithms, such as [1–3], they are nevertheless incomplete, unsystematic, and not up-to-date. Thus, in early 2018, I decided to write a monograph on accelerated algorithms. My goal was to produce a book that is organized and self-contained, with sufficient preliminary materials and detailed proofs, so that the readers need not consult scattered literatures, be plagued by inconsistent notations, and be carried away from the central ideas by non-essential contents. Luckily, my two Ph.D. students, Huan Li and Cong Fang, were happy to join this work.

This task turned out to be very hard, as we had to work among our busy schedules. Eventually, we managed to have the first complete yet crude draft right before Huan Li and Cong Fang graduated. Smoothing the book and correcting various errors further took us 4 months. Finally, we were truly honored to have forewords from Prof. Michael I. Jordan, Prof. Zongben Xu, and Prof. Zhi-Quan Luo. While this book deprived us of all our leisure time in the past nearly 2 years, we still feel that our endeavor pays when every part of the book is ready.

Hope this book is a valuable reference for the machine learning and the optimization communities. This will be the highest praise for our work.

References

  1. 1.

    A. Beck,First-Order Methods in Optimization , vol. 25 (SIAM, Philadelphia, 2017)

     
  2. 2.

    S. Bubeck, Convex optimization: algorithms and complexity. Found. Trends Mach. Learn.8 (3–4), 231–357 (2015)

     
  3. 3.

    Y. Nesterov,Lectures on Convex Optimization (Springer, New York, 2018)

     
Zhouchen Lin
Beijing, China
November 2019
Acknowledgements

The authors would like to thank all our collaborators and friends, especially: Bingsheng He, Junchi Li, Qing Ling, Guangcan Liu, Risheng Liu, Yuanyuan Liu, Canyi Lu, Zhiquan Luo, Yi Ma, Fanhua Shang, Zaiwen Wen, Xingyu Xie, Chen Xu, Shuicheng Yan, Wotao Yin, Xiaoming Yuan, Yaxiang Yuan, and Tong Zhang. The authors also thank Yuqing Hou, Jia Li, Shiping Wang, Jianlong Wu, Hongyang Zhang, and Pan Zhou for careful proofreading. The authors also thank Celine Chang from Springer, who offered much assistance during the production of the book. This monograph is supported by National Natural Science Foundation of China under Grant Nos. 61625301 and 61731018 and Beijing Academy of Artificial Intelligence.

Acronyms
AAAI

Association for the Advancement of Artificial Intelligence

AACD

Asynchronous Accelerated Coordinate Descent

AAGD

Asynchronous Accelerated Gradient Descent

AASCD

Asynchronous Accelerated Stochastic Coordinate Descent

AC-AGD

Almost Convex Accelerated Gradient Descent

Acc-ADMM

Accelerated Alternating Direction Method of Multiplier

Acc-SADMM

Accelerated Stochastic Alternating Direction Method of Multiplier

Acc-SDCA

Accelerated Stochastic Dual Coordinate Ascent

ADMM

Alternating Direction Method of Multiplier

AGD

Accelerated Gradient Descent

APG

Accelerated Proximal Gradient

ASCD

Accelerated Stochastic Coordinate Descent

ASGD

Asynchronous Stochastic Gradient Descent

ASVRG

Asynchronous Stochastic Variance Reduced Gradient

DSCAD

Distributed Stochastic Communication Accelerated Dual

ERM

Empirical Risk Minimization

EXTRA

EXact firsT-ordeR Algorithm

GIST

General Iterative Shrinkage and Thresholding

IC

Individually Convex

IFO

Incremental First-order Oracle

INC

Individually Nonconvex

iPiano

Inertial Proximal Algorithms for Nonconvex Optimization

IQC

Integral Quadratic Constraint

KKT

Karush–Kuhn–Tucker

Kurdyka–Łojasiewicz

LASSO

Least Absolute Shrinkage and Selection Operator

LMI

Linear Matrix Inequality

MISO

Minimization by Incremental Surrogate Optimization

NC

Negative Curvature/Nonconvex

NCD

Negative Curvature Descent

PCA

Principal Component Analysis

PG

Proximal Gradient

SAG

Stochastic Average Gradient

SAGD

Stochastic Accelerated Gradient Descent

SCD

Stochastic Coordinate Descent

SDCA

Stochastic Dual Coordinate Ascent

SGD

Stochastic Gradient Descent

SPIDER

Stochastic Path-Integrated Differential Estimator

SVD

Singular Value Decomposition

SVM

Support Vector Machine

SVRG

Stochastic Variance Reduced Gradient

SVT

Singular Value Thresholding

VR

Variance Reduction

Contents
Index 273