RANDPROX: PRIMAL-DUAL OPTIMIZATION ALGO-RITHMS WITH RANDOMIZED PROXIMAL UPDATES

Abstract

Proximal splitting algorithms are well suited to solving large-scale nonsmooth optimization problems, in particular those arising in machine learning. We propose a new primal-dual algorithm, in which the dual update is randomized; equivalently, the proximity operator of one of the function in the problem is replaced by a stochastic oracle. For instance, some randomly chosen dual variables, instead of all, are updated at each iteration. Or, the proximity operator of a function is called with some small probability only. A nonsmooth variance-reduction technique is implemented so that the algorithm finds an exact minimizer of the general problem involving smooth and nonsmooth functions, possibly composed with linear operators. We derive linear convergence results in presence of strong convexity; these results are new even in the deterministic case, when our algorithms reverts to the recently proposed Primal-Dual Davis-Yin algorithm. Some randomized algorithms of the literature are also recovered as particular cases (e.g., Point-SAGA). But our randomization technique is general and encompasses many unbiased mechanisms beyond sampling and probabilistic updates, including compression. Since the convergence speed depends on the slowest among the primal and dual contraction mechanisms, the iteration complexity might remain the same when randomness is used. On the other hand, the computation complexity can be significantly reduced. Overall, randomness helps getting faster algorithms. This has long been known for stochastic-gradient-type algorithms, and our work shows that this fully applies in the more general primal-dual setting as well.

1. INTRODUCTION

Optimization problems arise virtually in all quantitative fields, including machine learning, data science, statistics, and many other areas (Palomar & Eldar, 2009; Sra et al., 2011; Bach et al., 2012; Cevher et al., 2014; Polson et al., 2015; Bubeck, 2015; Glowinski et al., 2016; Chambolle & Pock, 2016; Stathopoulos et al., 2016) . In the big data era, they tend to be very high-dimensional, and first-order methods are particularly appropriate to solve them. When a function is smooth, an optimization algorithm typically makes calls to its gradient, whereas for a nonsmooth function, its proximity operator is called instead. Iterative optimization algorithms making use of proximity operators are called proximal (splitting) algorithms (Parikh & Boyd, 2014) . Over the past 10 years or so, primal-dual proximal algorithms have been developed and are well suited for a broad class of large-scale optimization problems involving several functions, possibly composed with linear operators (Combettes & Pesquet, 2010; Boţ et al., 2014; Parikh & Boyd, 2014; Komodakis & Pesquet, 2015; Beck, 2017; Condat et al., 2023a; Combettes & Pesquet, 2021; Condat et al., 2022c) . However, in many situations, these deterministic algorithms are too slow, and this is where randomized algorithms come to the rescue; they are variants of the deterministic algorithms with a cheaper iteration complexity, obtained by calling a random subset, instead of all, of the operators or updating a random subset, instead of all, of the variables, at every iteration. Stochastic Gradient Descent (SGD)-type methods (Robbins & Monro, 1951; Nemirovski et al., 2009; Bottou, 2012; Gower et al., 2020; Gorbunov et al., 2020; Khaled et al., 2020b) are a prominent example, with the huge success we all know. They consist in replacing a call to the gradient of a function, which

