RANDPROX: PRIMAL-DUAL OPTIMIZATION ALGO-RITHMS WITH RANDOMIZED PROXIMAL UPDATES

Abstract

Proximal splitting algorithms are well suited to solving large-scale nonsmooth optimization problems, in particular those arising in machine learning. We propose a new primal-dual algorithm, in which the dual update is randomized; equivalently, the proximity operator of one of the function in the problem is replaced by a stochastic oracle. For instance, some randomly chosen dual variables, instead of all, are updated at each iteration. Or, the proximity operator of a function is called with some small probability only. A nonsmooth variance-reduction technique is implemented so that the algorithm finds an exact minimizer of the general problem involving smooth and nonsmooth functions, possibly composed with linear operators. We derive linear convergence results in presence of strong convexity; these results are new even in the deterministic case, when our algorithms reverts to the recently proposed Primal-Dual Davis-Yin algorithm. Some randomized algorithms of the literature are also recovered as particular cases (e.g., Point-SAGA). But our randomization technique is general and encompasses many unbiased mechanisms beyond sampling and probabilistic updates, including compression. Since the convergence speed depends on the slowest among the primal and dual contraction mechanisms, the iteration complexity might remain the same when randomness is used. On the other hand, the computation complexity can be significantly reduced. Overall, randomness helps getting faster algorithms. This has long been known for stochastic-gradient-type algorithms, and our work shows that this fully applies in the more general primal-dual setting as well.

1. INTRODUCTION

Optimization problems arise virtually in all quantitative fields, including machine learning, data science, statistics, and many other areas (Palomar & Eldar, 2009; Sra et al., 2011; Bach et al., 2012; Cevher et al., 2014; Polson et al., 2015; Bubeck, 2015; Glowinski et al., 2016; Chambolle & Pock, 2016; Stathopoulos et al., 2016) . In the big data era, they tend to be very high-dimensional, and first-order methods are particularly appropriate to solve them. When a function is smooth, an optimization algorithm typically makes calls to its gradient, whereas for a nonsmooth function, its proximity operator is called instead. Iterative optimization algorithms making use of proximity operators are called proximal (splitting) algorithms (Parikh & Boyd, 2014) . Over the past 10 years or so, primal-dual proximal algorithms have been developed and are well suited for a broad class of large-scale optimization problems involving several functions, possibly composed with linear operators (Combettes & Pesquet, 2010; Boţ et al., 2014; Parikh & Boyd, 2014; Komodakis & Pesquet, 2015; Beck, 2017; Condat et al., 2023a; Combettes & Pesquet, 2021; Condat et al., 2022c) . However, in many situations, these deterministic algorithms are too slow, and this is where randomized algorithms come to the rescue; they are variants of the deterministic algorithms with a cheaper iteration complexity, obtained by calling a random subset, instead of all, of the operators or updating a random subset, instead of all, of the variables, at every iteration. Stochastic Gradient Descent (SGD)-type methods (Robbins & Monro, 1951; Nemirovski et al., 2009; Bottou, 2012; Gower et al., 2020; Gorbunov et al., 2020; Khaled et al., 2020b) are a prominent example, with the huge success we all know. They consist in replacing a call to the gradient of a function, which can be itself a sum or expectation of several functions, by a cheaper stochastic gradient estimate. By contrast, replacing the proximity operator of a possibly nonsmooth function by a stochastic proximity operator estimate is a nearly virgin territory. This is an important challenge, because many functions of practical interest have a proximity operator, which is expensive to compute. We can mention the nuclear norm of matrices, which requires singular value decompositions, indicator functions of sets on which it is difficult to project, or optimal transport costs (Peyré & Cuturi, 2019) . In this paper, we propose RandProx (Algorithm 2), a randomized version of the Primal-Dual Davis-Yin (PDDY) method (Algorithm 1), which is a proximal algorithm proposed recently (Salim et al., 2022b) and further analyzed in Condat et al. (2022c) . In RandProx, one proximity operator that appears in the PDDY algorithm is replaced by a stochastic estimate. RandProx is variance-reduced (Hanzely & Richtárik, 2019; Gorbunov et al., 2020; Gower et al., 2020) ; that is, through the use of control variates, the random noise is mitigated and eventually vanishes, so that the algorithm converges to an exact solution, just like its deterministic counterpart. Algorithms with stochastic errors in the computation of proximity operators have been studied, for instance in Combettes & Pesquet ( 2016), but the errors are typically assumed to decay or some stepsizes are made decaying along the iterations, with a certain rate. By contrast, in variance-reduced algorithms such as RandProx, which has fixed stepsizes, error compensation is automatic. We analyze RandProx and prove its linear convergence in the strongly convex setting, with additional results in the convex setting; we leave the nonconvex case, which requires different proof techniques, for future work. We mention relationships between our results and related works in the literature throughout the paper. In special cases, RandProx reduces to Point-SAGA (Defazio, 2016) 

2. PROBLEM FORMULATION

Let X and U be finite-dimensional real Hilbert spaces. We consider the generic convex optimization problem: Find x ⋆ ∈ arg min x∈X f (x) + g(x) + h(Kx) , where K : X → U is a nonzero linear operator; f is a convex L f -smooth function, for some L f > 0; that is, its gradient ∇f is L f -Lipschitz continuous (Bauschke & Combettes, 2017, Definition 1.47); and g : X → R ∪ {+∞} and h : U → R ∪ {+∞} are proper closed convex functions whose proximity operator is easy to compute. We will assume strong convexity of some functions: a convex function ϕ is said to be µ ϕ -strongly convex, for some µ ϕ ≥ 0, if ϕ - µ ϕ 2 ∥ • ∥ 2 is convex. This covers the case µ ϕ = 0, in which ϕ is merely convex.

2.1. PROXIMITY OPERATORS AND PROXIMAL ALGORITHMS

We recall that for any function ϕ and parameter γ > 0, the proximity operator of γϕ is (Bauschke & Combettes, 2017): prox γϕ : x ∈ X → arg min x ′ ∈X γϕ(x ′ ) + 1 2 ∥x ′ -x∥ 2 . This operator has a closed form for many functions of practical interest (Parikh & Boyd, 2014; Pustelnik & Condat, 2017; Gheche et al., 2018) , see also the website http://proximity-operator.net. In addition, the Moreau identity holds: prox γϕ * (x) = x -γ prox ϕ/γ (x/γ), where ϕ * : x ∈ X → sup x ′ ∈X ⟨x, x ′ ⟩ -ϕ(x ′ ) denotes the conjugate function of ϕ (Bauschke & Combettes, 2017). Thus, one can compute the proximity operator of ϕ from the one of ϕ * , and conversely.



, the Stochastic Decoupling Method (Mishchenko & Richtárik, 2019), ProxSkip, SplitSkip and Scaffnew (Mishchenko et al., 2022), and randomized versions of the PAPC (Drori et al., 2015), PDHG (Chambolle & Pock, 2011) and ADMM (Boyd et al., 2011) algorithms. They are all generalized and unified within our new framework. Thus, RandProx paves the way to the design of proximal counterparts of variance-reduced SGD-type algorithms, just like Point-SAGA (Defazio, 2016) is the proximal counterpart of SAGA (Defazio et al., 2014).

