Zhouchen Lin, Huan Li and Cong Fang

Accelerated Optimization for Machine Learning

First-Order Algorithms

../images/487003_1_En_BookFrontmatter_Figa_HTML.png

Zhouchen Lin

Key Lab. of Machine Perception School of EECS, Peking University, Beijing, Beijing, China

Huan Li

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, Jiangsu, China

Cong Fang

School of Engineering and Applied Science, Princeton University, Princeton, NJ, USA

ISBN 978-981-15-2909-2e-ISBN 978-981-15-2910-8

https://doi.org/10.1007/978-981-15-2910-8

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.

The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

To our families. Without your great support this book will not exist and even our careers will be meaningless.

Foreword by Michael I. Jordan

Optimization algorithms have been the engine that have powered the recent rise of machine learning. The needs of machine learning are different from those of other disciplines that have made use of the optimization toolbox; most notably, the parameter spaces are of high dimensionality, and the functions that are being optimized are often sums of millions of terms. In such settings, gradient-based methods are preferred over higher order methods, and given that the computation of a full gradient can be infeasible, stochastic gradient methods are the coin of the realm. Putting such specifications together with the need to solve nonconvex optimization problems, to control the variance induced by the stochastic sampling, and to develop algorithms that run on distributed platforms, one poses a new set of challenges for optimization. Surprisingly, many of these challenges have been addressed within the past decade.

The book by Lin, Li, and Fang is one of the first book-length treatments of this emerging field. The book covers gradient-based algorithms in detail, with a focus on the concept of acceleration. Acceleration is a key concept in modern optimization, supplying new algorithms and providing insight into achievable convergence rates. The book also covers stochastic methods, including variance control, and it includes material on asynchronous distributed implementations.

Any researcher wishing to work in the machine learning field should have a foundational understanding of the disciplines of statistics and optimization. The current book is an excellent place to obtain the latter and to begin one’s adventure in machine learning.

Michael I. Jordan

October 2019

Foreword by Zongben Xu

Optimization is one of the core topics in machine learning. While benefiting from the advances in the native optimization community, optimization for machine learning has its own flavor. One remarkable phenomenon is thatfirst-order algorithms more or less dominate the optimization methods in machine learning. While there have been some books or preprints that introduce major optimization algorithms used in machine learning, either partially or thoroughly, this book focuses on a notable stream in recent machine learning optimization, namely theaccelerated first-order methods. Originating from Polyak’s heavy-ball method and triggered by Nesterov’s series of works, accelerated first-order methods have become a hot topic in both the optimization and the machine learning communities and have yielded fruitfully. The results have significantly extended beyond the traditional scope of unconstrained (and deterministic) convex optimization. New results include acceleration for constrained convex optimization and nonconvex optimization, stochastic algorithms, and general acceleration frameworks such as Katyusha and Catalyst. Some of them even have nearly optimal convergence rates. Unfortunately, existing literatures scatter across diverse and extensive publications. Mastering the basic techniques and having a global picture of this dynamic field thus becomes very difficult.

Fortunately, this monograph, coauthored by Zhouchen Lin, Huan Li, and Cong Fang, meets the need of quick education on accelerated first-order algorithms just in time. The book first gives an overview on the development of accelerated first-order algorithms, which is extremely informative, despite being sketchy. Then, it introduces the representative works in different categories, with detailed proofs that greatly facilitate the understanding of underlying ideas and the mastering of basic techniques. Without doubt, this book is a vital reference for those who want to learn the state of the art of machine learning optimization.

I have known Dr. Zhouchen Lin for a long time. He impresses me with solid work, deep insights, and careful analysis on the problems arising from his diverse research fields. With a lot of shared research interests, one of which is learning-based optimization, I am delighted to see this book finally published after elaborative writing.

Zongben Xu

October 2019

Foreword by Zhi-Quan Luo

First-order optimization methods have been the main workhorse in the machine learning, signal processing, and artificial intelligence involving big data. These methods, while simple conceptually, require careful analysis and a good understanding of them to be effectively deployed. The issues such as acceleration, nonsmoothness, nonconvexity, parallel and distributed implementation are critical due to their great impact on the algorithm’s convergence behavior and running time.

This research monograph gives an excellent introduction to the algorithmic aspects of first-order optimization methods, focusing on algorithm design and convergence analysis. It treats in depth the issues of acceleration, nonconvexity, constraints, and asynchronous implementation. The topics covered and the results given in the monograph are very timely and strongly relevant to both the researchers and practitioners of machine learning, signal processing, and artificial intelligence. The theoretical issues of lower bounds on complexity are purposely avoided to give way to algorithm design and convergence analysis. Overall, the treatment of the subject is quite balanced and many useful insights are provided throughout the monograph.

The authors of this monograph are experienced researchers at the interface of machine learning and optimization. The monograph is very well written and makes an excellent read. It should be an important reference book for everyone interested in the optimization aspects of machine learning.

Zhi-Quan Luo

October 2019

Preface

While I was preparing advanced materials for the optimization course taught at Peking University, I found that accelerated algorithms is the most attractive and practical topic for students in engineering. Actually, this is also a hot topic of current machine learning conferences. While some books have introduced some accelerated algorithms, such as [1–3], they are nevertheless incomplete, unsystematic, and not up-to-date. Thus, in early 2018, I decided to write a monograph on accelerated algorithms. My goal was to produce a book that is organized and self-contained, with sufficient preliminary materials and detailed proofs, so that the readers need not consult scattered literatures, be plagued by inconsistent notations, and be carried away from the central ideas by non-essential contents. Luckily, my two Ph.D. students, Huan Li and Cong Fang, were happy to join this work.

This task turned out to be very hard, as we had to work among our busy schedules. Eventually, we managed to have the first complete yet crude draft right before Huan Li and Cong Fang graduated. Smoothing the book and correcting various errors further took us 4 months. Finally, we were truly honored to have forewords from Prof. Michael I. Jordan, Prof. Zongben Xu, and Prof. Zhi-Quan Luo. While this book deprived us of all our leisure time in the past nearly 2 years, we still feel that our endeavor pays when every part of the book is ready.

Hope this book is a valuable reference for the machine learning and the optimization communities. This will be the highest praise for our work.

References

1.
A. Beck,First-Order Methods in Optimization , vol. 25 (SIAM, Philadelphia, 2017)
2.
S. Bubeck, Convex optimization: algorithms and complexity. Found. Trends Mach. Learn.8 (3–4), 231–357 (2015)
3.
Y. Nesterov,Lectures on Convex Optimization (Springer, New York, 2018)

Zhouchen Lin

Beijing, China

November 2019

About the Authors

Zhouchen Lin is a leading expert in the fields of machine learning and computer vision. He is currently a Professor at the Key Laboratory of Machine Perception (Ministry of Education), School of EECS, Peking University. He served as an area chair for several prestigious conferences, including CVPR, ICCV, ICML, NIPS/NeurIPS, AAAI and IJCAI. He is an associate editor of the IEEE Transactions on Pattern Analysis and Machine Intelligence and the International Journal of Computer Vision. He is a Fellow of IAPR and IEEE.

Huan Li received his Ph.D. degree in machine learning from Peking University in 2019. He is currently an Assistant Professor at the College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics. His current research interests include optimization and machine learning.

Cong Fang received his Ph.D. degree from Peking University in 2019. He is currently a Postdoctoral Researcher at Princeton University. His research interests include machine learning and optimization.

Acknowledgements

The authors would like to thank all our collaborators and friends, especially: Bingsheng He, Junchi Li, Qing Ling, Guangcan Liu, Risheng Liu, Yuanyuan Liu, Canyi Lu, Zhiquan Luo, Yi Ma, Fanhua Shang, Zaiwen Wen, Xingyu Xie, Chen Xu, Shuicheng Yan, Wotao Yin, Xiaoming Yuan, Yaxiang Yuan, and Tong Zhang. The authors also thank Yuqing Hou, Jia Li, Shiping Wang, Jianlong Wu, Hongyang Zhang, and Pan Zhou for careful proofreading. The authors also thank Celine Chang from Springer, who offered much assistance during the production of the book. This monograph is supported by National Natural Science Foundation of China under Grant Nos. 61625301 and 61731018 and Beijing Academy of Artificial Intelligence.

Acronyms

AAAI: Association for the Advancement of Artificial Intelligence
AACD: Asynchronous Accelerated Coordinate Descent
AAGD: Asynchronous Accelerated Gradient Descent
AASCD: Asynchronous Accelerated Stochastic Coordinate Descent
AC-AGD: Almost Convex Accelerated Gradient Descent
Acc-ADMM: Accelerated Alternating Direction Method of Multiplier
Acc-SADMM: Accelerated Stochastic Alternating Direction Method of Multiplier
Acc-SDCA: Accelerated Stochastic Dual Coordinate Ascent
ADMM: Alternating Direction Method of Multiplier
AGD: Accelerated Gradient Descent
APG: Accelerated Proximal Gradient
ASCD: Accelerated Stochastic Coordinate Descent
ASGD: Asynchronous Stochastic Gradient Descent
ASVRG: Asynchronous Stochastic Variance Reduced Gradient
DSCAD: Distributed Stochastic Communication Accelerated Dual
ERM: Empirical Risk Minimization
EXTRA: EXact firsT-ordeR Algorithm
GIST: General Iterative Shrinkage and Thresholding
IC: Individually Convex
IFO: Incremental First-order Oracle
INC: Individually Nonconvex
iPiano: Inertial Proximal Algorithms for Nonconvex Optimization
IQC: Integral Quadratic Constraint
KKT: Karush–Kuhn–Tucker
KŁ: Kurdyka–Łojasiewicz
LASSO: Least Absolute Shrinkage and Selection Operator
LMI: Linear Matrix Inequality
MISO: Minimization by Incremental Surrogate Optimization
NC: Negative Curvature/Nonconvex
NCD: Negative Curvature Descent
PCA: Principal Component Analysis
PG: Proximal Gradient
SAG: Stochastic Average Gradient
SAGD: Stochastic Accelerated Gradient Descent
SCD: Stochastic Coordinate Descent
SDCA: Stochastic Dual Coordinate Ascent
SGD: Stochastic Gradient Descent
SPIDER: Stochastic Path-Integrated Differential Estimator
SVD: Singular Value Decomposition
SVM: Support Vector Machine
SVRG: Stochastic Variance Reduced Gradient
SVT: Singular Value Thresholding
VR: Variance Reduction