RANDPROX: PRIMAL-DUAL OPTIMIZATION ALGO-RITHMS WITH RANDOMIZED PROXIMAL UPDATES

Abstract

Proximal splitting algorithms are well suited to solving large-scale nonsmooth optimization problems, in particular those arising in machine learning. We propose a new primal-dual algorithm, in which the dual update is randomized; equivalently, the proximity operator of one of the function in the problem is replaced by a stochastic oracle. For instance, some randomly chosen dual variables, instead of all, are updated at each iteration. Or, the proximity operator of a function is called with some small probability only. A nonsmooth variance-reduction technique is implemented so that the algorithm finds an exact minimizer of the general problem involving smooth and nonsmooth functions, possibly composed with linear operators. We derive linear convergence results in presence of strong convexity; these results are new even in the deterministic case, when our algorithms reverts to the recently proposed Primal-Dual Davis-Yin algorithm. Some randomized algorithms of the literature are also recovered as particular cases (e.g., Point-SAGA). But our randomization technique is general and encompasses many unbiased mechanisms beyond sampling and probabilistic updates, including compression. Since the convergence speed depends on the slowest among the primal and dual contraction mechanisms, the iteration complexity might remain the same when randomness is used. On the other hand, the computation complexity can be significantly reduced. Overall, randomness helps getting faster algorithms. This has long been known for stochastic-gradient-type algorithms, and our work shows that this fully applies in the more general primal-dual setting as well. µ ϕ 2 ∥ • ∥ 2 is convex. This covers the case µ ϕ = 0, in which ϕ is merely convex. We recall that for any function ϕ and parameter γ > 0, the proximity operator of γϕ is (Bauschke & Combettes, 2017): prox γϕ : x ∈ X → arg min x ′ ∈X γϕ(x ′ ) + 1 2 ∥x ′ -x∥ 2 . This operator has a closed form for many functions of practical interest (Parikh & Boyd, 2014; Pustelnik & Condat, 2017; Gheche et al., 2018) , see also the website http://proximity-operator.net. In addition, the Moreau identity holds: where ϕ * : x ∈ X → sup x ′ ∈X ⟨x, x ′ ⟩ -ϕ(x ′ ) denotes the conjugate function of ϕ (Bauschke & Combettes, 2017). Thus, one can compute the proximity operator of ϕ from the one of ϕ * , and conversely.

1. INTRODUCTION

Optimization problems arise virtually in all quantitative fields, including machine learning, data science, statistics, and many other areas (Palomar & Eldar, 2009; Sra et al., 2011; Bach et al., 2012; Cevher et al., 2014; Polson et al., 2015; Bubeck, 2015; Glowinski et al., 2016; Chambolle & Pock, 2016; Stathopoulos et al., 2016) . In the big data era, they tend to be very high-dimensional, and first-order methods are particularly appropriate to solve them. When a function is smooth, an optimization algorithm typically makes calls to its gradient, whereas for a nonsmooth function, its proximity operator is called instead. Iterative optimization algorithms making use of proximity operators are called proximal (splitting) algorithms (Parikh & Boyd, 2014) . Over the past 10 years or so, primal-dual proximal algorithms have been developed and are well suited for a broad class of large-scale optimization problems involving several functions, possibly composed with linear operators (Combettes & Pesquet, 2010; Boţ et al., 2014; Parikh & Boyd, 2014; Komodakis & Pesquet, 2015; Beck, 2017; Condat et al., 2023a; Combettes & Pesquet, 2021; Condat et al., 2022c) . However, in many situations, these deterministic algorithms are too slow, and this is where randomized algorithms come to the rescue; they are variants of the deterministic algorithms with a cheaper iteration complexity, obtained by calling a random subset, instead of all, of the operators or updating a random subset, instead of all, of the variables, at every iteration. Stochastic Gradient Descent (SGD)-type methods (Robbins & Monro, 1951; Nemirovski et al., 2009; Bottou, 2012; Gower et al., 2020; Gorbunov et al., 2020; Khaled et al., 2020b) are a prominent example, with the huge success we all know. They consist in replacing a call to the gradient of a function, which can be itself a sum or expectation of several functions, by a cheaper stochastic gradient estimate. By contrast, replacing the proximity operator of a possibly nonsmooth function by a stochastic proximity operator estimate is a nearly virgin territory. This is an important challenge, because many functions of practical interest have a proximity operator, which is expensive to compute. We can mention the nuclear norm of matrices, which requires singular value decompositions, indicator functions of sets on which it is difficult to project, or optimal transport costs (Peyré & Cuturi, 2019) . In this paper, we propose RandProx (Algorithm 2), a randomized version of the Primal-Dual Davis-Yin (PDDY) method (Algorithm 1), which is a proximal algorithm proposed recently (Salim et al., 2022b) and further analyzed in Condat et al. (2022c) . In RandProx, one proximity operator that appears in the PDDY algorithm is replaced by a stochastic estimate. RandProx is variance-reduced (Hanzely & Richtárik, 2019; Gorbunov et al., 2020; Gower et al., 2020) ; that is, through the use of control variates, the random noise is mitigated and eventually vanishes, so that the algorithm converges to an exact solution, just like its deterministic counterpart. Algorithms with stochastic errors in the computation of proximity operators have been studied, for instance in Combettes & Pesquet (2016) , but the errors are typically assumed to decay or some stepsizes are made decaying along the iterations, with a certain rate. By contrast, in variance-reduced algorithms such as RandProx, which has fixed stepsizes, error compensation is automatic. We analyze RandProx and prove its linear convergence in the strongly convex setting, with additional results in the convex setting; we leave the nonconvex case, which requires different proof techniques, for future work. We mention relationships between our results and related works in the literature throughout the paper. In special cases, RandProx reduces to Point-SAGA (Defazio, 2016) , the Stochastic Decoupling Method (Mishchenko & Richtárik, 2019) , ProxSkip, SplitSkip and Scaffnew (Mishchenko et al., 2022) , and randomized versions of the PAPC (Drori et al., 2015) , PDHG (Chambolle & Pock, 2011) and ADMM (Boyd et al., 2011) algorithms. They are all generalized and unified within our new framework. Thus, RandProx paves the way to the design of proximal counterparts of variance-reduced SGD-type algorithms, just like Point-SAGA (Defazio, 2016) is the proximal counterpart of SAGA (Defazio et al., 2014) .

2. PROBLEM FORMULATION

Let X and U be finite-dimensional real Hilbert spaces. We consider the generic convex optimization problem: Find x ⋆ ∈ arg min x∈X f (x) + g(x) + h(Kx) , where K : X → U is a nonzero linear operator; f is a convex L f -smooth function, for some L f > 0; that is, its gradient ∇f is L f -Lipschitz continuous (Bauschke & Combettes, 2017 , Definition 1.47); and g : X → R ∪ {+∞} and h : U → R ∪ {+∞} are proper closed convex functions whose proximity operator is easy to compute. We will assume strong convexity of some functions: a convex function ϕ is said to be µ ϕ -strongly convex, for some µ ϕ ≥ 0, if ϕ -Proximal splitting algorithms, such as the forward-backward and the Douglas-Rachford algorithms (Bauschke & Combettes, 2017) , are well suited to minimizing the sum, f + g or g + h in our notation, of two functions. However, many problems take the form (1) with K ̸ = Id, where Id denotes the identity, and the proximity operator of h • K is intractable in most cases. A classical example is the total variation, widely used in image processing (Rudin et al., 1992; Caselles et al., 2011; Condat, 2014; 2017) or for regularization on graphs (Couprie et al., 2013) , where h is some variant of the ℓ 1 norm and K takes differences between adjacent values. Another example is when h is the indicator function of some nonempty closed convex set Ω ⊂ U; that is, h(u) = (0 if u ∈ Ω, +∞ otherwise), in which case the problem (1) can be rewritten as Find x ⋆ ∈ arg min x∈X f (x) + g(x) s.t. Kx ∈ Ω. If g = 0 and Ω = {b} for some b ∈ ran(K), where ran denotes the range, the problem can be further rewritten as the linearly constrained smooth minimization problem Find x ⋆ ∈ arg min x∈X f (x) s.t. Kx = b. This last problem has applications in decentralized optimization, for instance (Xin et al., 2020; Kovalev et al., 2020; Salim et al., 2022a) . Thus, the template problem (1) covers a wide range of optimization problems met in machine learning (Bach et al., 2012; Polson et al., 2015) , signal and image processing (Combettes & Pesquet, 2010; Chambolle & Pock, 2016) , control (Stathopoulos et al., 2016) , and many other fields. Examples include compressed sensing (Candès et al., 2006) , object discovery in computer vision (Vo et al., 2019) , ℓ 1 trend filtering (Kim et al., 2009) , group lasso (Yuan & Lin, 2006) , square-root lasso (Belloni et al., 2011) , Dantzig selector (Candès & Tao, 2007) , and support-vector machines (Cortes & Vapnik, 1995) .

2.2. THE DUAL PROBLEM, SADDLE-POINT REFORMULATION, AND OPTIMALITY CONDITIONS

In order to analyze algorithms solving such problems, we introduce the dual problem to (1): Find u ⋆ ∈ arg min u∈U (f + g) * (-K * u) + h * (u) , where K * : U → X is the adjoint operator of K. We can also express the primal and dual problems as a combined saddle-point problem: Find (x ⋆ , u ⋆ ) ∈ arg min x∈X max u∈U f (x) + g(x) + ⟨Kx, u⟩ -h * (u) . For these problems to be well-posed, we suppose that there exists x ⋆ ∈ X such that 0 ∈ ∇f (x ⋆ ) + ∂g(x ⋆ ) + K * ∂h(Kx ⋆ ), where ∂(•) denotes the subdifferential (Bauschke & Combettes, 2017) . By Fermat's rule, every x ⋆ satisfying (4) is a solution to (1). Equivalently to (4), we suppose that there exists (x ⋆ , u ⋆ ) ∈ X × U such that 0 ∈ ∇f (x ⋆ ) + ∂g(x ⋆ ) + K * u ⋆ 0 ∈ -Kx ⋆ + ∂h * (u ⋆ ) . Every (x ⋆ , u ⋆ ) satisfying ( 5) is a primal-dual solution pair; that is, x ⋆ is a solution to (1), u ⋆ is a solution to (2), and (x ⋆ , u ⋆ ) is a solution to (3).

3. PROPOSED ALGORITHM: RandProx

There exist several deterministic algorithms for solving the problem (1); see Condat et al. (2023a) for a recent overview. In this work, we focus on the PDDY algorithm (Algorithm 1) (Salim et al., 2022b; Condat et al., 2022c) . In particular, our new algorithm RandProx (Algorithm 2) generalizes the PDDY algorithm with a stochastic estimate of the proximity operator of h * . Algorithm 1 PDDY algorithm (Salim et al., 2022b) input: initial points x 0 ∈ X , u 0 ∈ U; stepsizes γ > 0, τ > 0 v 0 := K * u 0 for t = 0, 1, . . . do xt := prox γg x t -γ∇f (x t ) -γv t u t+1 := prox τ h * u t + τ K xt v t+1 := K * u t+1 x t+1 := xt -γ(v t+1 -v t ) end for Algorithm 2 RandProx [new] input: initial points x 0 ∈ X , u 0 ∈ U; stepsizes γ > 0, τ > 0; ω ≥ 0 v 0 := K * u 0 for t = 0, 1, . . . do xt := prox γg x t -γ∇f (x t ) -γv t u t+1 := u t + 1 1+ω R t prox τ h * (u t +τ K xt ) -u t v t+1 := K * u t+1 x t+1 := xt -γ (1 + ω) (v t+1 -v t ) end for

3.1. THE PDDY ALGORITHM

We recall the general convergence result for the PDDY algorithm (Condat et al., 2022c, Theorem 2) : If γ ∈ (0, 2/L f ), τ > 0, τ γ∥K∥ 2 ≤ 1, then (x t ) t∈N converges to a primal solution x ⋆ of (1) and (u t ) t∈N converges to a dual solution u ⋆ of (2). The PDDY algorithm is similar and closely related to the PD3O algorithm (Yan, 2018) , as discussed in Salim et al. (2022b) ; Condat et al. (2022c) . It is also an instance (Algorithm 5) of the Asymmetric Forward-Backward Adjoint (AFBA) framework of Latafat & Patrinos (2017) . We note that the popular Condat-Vũ algorithm (Condat, 2013; Vũ, 2013) can solve the same problem but has more restrictive conditions on γ and τ . In the PDDY algorithm, the full gradient ∇f can be replaced by a stochastic estimator which is typically cheaper to compute (Salim et al., 2022b) . Convergence rates and accelerations of the PDDY algorithm, as well as distributed versions of the algorithm, have been derived in Condat et al. (2022c) . In particular, if µ f > 0 or µ g > 0, the primal problem (1) is strongly convex. In this case, a varying stepsize strategy accelerates the algorithm, with a O(1/t 2 ) decay of ∥x t -x ⋆ ∥ 2 , where x ⋆ is the unique solution to (1). But strong convexity of the primal problem is not sufficient for the PDDY algorithm to converge linearly, and additional assumptions on h and K are needed. We will prove linear convergence when both the primal and dual problems are strongly convex; this is a natural condition for primal-dual algorithms. We note that h is L h -smooth, for some L h > 0, if and only if h * is µ h * -strongly convex, for some µ h * > 0, with µ h * = 1/L h . In that case, the dual problem (2) is strongly convex.

3.2. RANDOMIZATION MECHANISM FOR THE PROXIMITY OPERATOR OF h *

We propose RandProx (Algorithm 2), a generalization of the PDDY algorithm (Algorithm 1) with a randomized update of the dual variable u. Let us formalize the random operations using random variables and stochastic processes. We introduce the underlying probability space (S, F, P ). Given a real Hilbert space H, an H-valued random variable is a measurable map from (S, F) to (H, B), where B is the Borel σ-algebra of H. Formally, randomizing some steps in the PDDY algorithm amounts to defining (x t , u t ) t∈N as a stochastic process, with x t being a X -valued random variable and u t a U-valued random variable, for every t ≥ 0. We use light notations and write our randomized algorithm RandProx using stochastic operators R t on U; that is, for every t ≥ 0 and any r t ∈ U, R t (r t ) is a U-valued random variable, which can be interpreted as r t plus 'random noise' (formally, r t is itself a U-valued random variable, but algorithmically, R t is applied to a particular outcome in U, hence the notation as an operator on U). To fix the ideas, let us give two examples. Example 1. The first example is compression (Alistarh et al., 2017; 2018; Horváth et al., 2022; Mishchenko et al., 2019; Albasyoni et al., 2020; Beznosikov et al., 2020; Condat et al., 2022b ): U = R d for some d ≥ 1 and R t is the well known rand-k compressor or sparsifier, with 1 ≤ k < d: R t multiplies k coordinates, chosen uniformly at random, of the vector r t by d/k and sets the other ones to zero. An application to compressed communication is discussed in Section A.3. Example 2. The second example, discussed in Section A.1, is the Bernoulli, or coin flip, operator R t : r t → 1 p r t with probability p, 0 with probability 1 -p, for some p > 0. In that case, with probability 1 -p, the outcome of R t (r t ) is 0 and r t does not need to be calculated; in particular, in the RandProx algorithm, prox τ h * is not called, and this is why one can expect the iteration complexity of RandProx to decrease. Thus, in this example, R t (r t ) does not really consist of applying the operator R t to r t ; in general, the notation R t (r t ) simply denotes a stochastic estimate of r t . Example 3. The third example, discussed in Section A.2, is sampling, which makes it possible to solve problems involving a sum n i=1 h i of functions, by calling the proximity operator of only one randomly chosen function h i , instead of all functions, at every iteration. The Point-SAGA algorithm (Defazio, 2016) is recovered as a particular case of RandProx in this setting. Hereafter, we denote by F t the σ-algebra generated by the collection of (X × U)-valued random variables (x 0 , u 0 ), . . . , (x t , u t ), for every t ≥ 0. In this work, we consider unbiased random estimates: for every t ≥ 0, E R t (r t ) | F t = r t , where E[•] denotes the expectation, here conditionally on F t , and r t is the random variable r t := prox τ h * (u t + τ K xt ) -u t , as defined by RandProx. Note that our framework is general in that for t ̸ = t ′ , R t and R t ′ need not be independent nor have the same law. In simple words, at every iteration, the randomness is new but can have a different form and depend on the past, so that the operators R t can be defined dynamically on the fly in RandProx. We characterize the operators R t by their relative variance ω ≥ 0 such that, for every t ≥ 0, E R t (r t ) -r t 2 | F t ≤ ω r t 2 . ( ) This assumption is satisfied by a large class of randomization strategies, which are widely used to define unbiased stochastic gradient estimates. We refer to Beznosikov et al. (2020) , Table 1 in Safaryan et al. (2021) , Zhang et al. (2023) , Szlendak et al. (2022) for examples. In the Example 1 above of rand-k, ω = d k -1. In Example 2, ω = 1 p -1. In Example 3, ω = n -1. The value of ω is supposed known and is used in the RandProx algorithm. Note that ω = 0 if and only if R t = Id, in which case there is no randomness and RandProx reverts to the original deterministic PDDY algorithm. Thus, R t (r t ) = r t + e t , with the variance of the error e t proportional to ∥r t ∥ 2 . In particular, if r t = 0, there is no error and R t (0) = 0. The stochastic operators R t will be applied to a sequence of random vectors that will converge to zero, and hence the error will converge to zero as well, due to the relative variance property (7). RandProx is therefore a variance-reduced method (Hanzely & Richtárik, 2019; Gorbunov et al., 2020; Gower et al., 2020) : the random errors vanish along the iterations and the algorithm converges to an exact solution of the problem. To characterize how the error on the dual variable propagates to the primal variable after applying K * , we also introduce the relative variance ω ran ≥ 0 in the range of K * and the offset ζ ∈ [0, 1] such that, for every t ≥ 0, E K * R t (r t ) -r t 2 | F t ≤ ω ran r t 2 -ζ K * r t 2 . ( ) It is easy to see that (8) holds with ω ran = ∥K∥ 2 ω and ζ = 0, so this is the default choice without particular knowledge on K * . But in some situations, e.g. sampling like in Section A.2, a much smaller value of ω ran and a positive value of ζ can be derived.

3.3. DESCRIPTION OF THE ALGORITHM

Let us now describe how the PDDY and RandProx algorithms work. An iteration consists in 3 steps: 1. Given x t and u t , the updated value of the primal variable is predicted to be xt . 2. The points xt and u t are used to update the dual variable to its new value u t+1 . 3. The primal variable is corrected from xt to x t+1 , by back-propagating the difference u t+1 -u t using K * . In RandProx, randomization takes place in Step 2. On average, this decreases the progress from u t to u t+1 , and in turn from xt to x t+1 in Step 3, but the progress from x t to xt , due to the unaltered proximal gradient descent step in Step 1, is kept. Therefore, randomization can be used to balance the progress speed on the primal and dual variables, depending on the relative computational complexity of the gradient and proximity operators. The random errors are kept under control and convergence is ensured using underrelaxation: let us define, for every t ≥ 0, ût+1 := prox τ h * u t + τ K xt . The PDDY algorithm updates the dual variable by setting u t+1 := ût+1 . In RandProx, let us define ũt+1 := u t + R t ût+1 -u t = ût+1 + e t for some zero-mean random error e t , keeping in mind that ũt+1 is typically cheaper to compute than ût+1 . Then underrelaxation is applied: we set u t+1 := ρũ t+1 + (1 -ρ)u t for some relaxation parameter ρ ∈ (0, 1]; we use ρ =foot_0 1+ω in the algorithm. That is, the update of the dual variable consists in a convex combination of the old estimate u t and the new, better in expectation but noisy, estimate ũt+1 . Noise is mitigated by underrelaxation, because the error e t is multiplied by ρ, so that its variance is multiplied by ρfoot_1 . So, even if ω is arbitrarily large, ωρ 2 is kept small. Underrelaxation slows down the progress on the dual variable of the algorithm towards the solution, but if the iterations become faster, this is beneficial overall.

4. CONVERGENCE ANALYSIS OF RandProx

Our most general result, whose proof is in the Appendix, is the following: Theorem 1. Suppose that µ f > 0 or µ g > 0, and that µ h * > 0. In RandProx, suppose that 0 < γ < 2 L f , τ > 0, and γτ (1 -ζ)∥K∥ 2 + ω ran ≤ 1, where ω ran and ζ are defined in (8). 1 For every t ≥ 0, define the Lyapunov function Ψ t := 1 γ x t -x ⋆ 2 + (1 + ω) 1 τ + 2µ h * u t -u ⋆ 2 , where x ⋆ and u ⋆ are the unique solutions to (1) and (2), respectively. Then RandProx converges linearly: for every t ≥ 0, E Ψ t ≤ c t Ψ 0 , where c := max (1 -γµ f ) 2 1 + γµ g , (γL f -1) 2 1 + γµ g , 1 - 2τ µ h * (1 + ω)(1 + 2τ µ h * ) < 1. ( ) Also, (x t ) t∈N and (x t ) t∈N both converge to x ⋆ and (u t ) t∈N converges to u ⋆ , almost surely. In Theorem 1, if γ ≤ 2 L f +µ f , we have max(1 -γµ f , γL f -1) 2 = (1 -γµ f ) 2 ≤ 1 -γµ f , so that in that case the rate c in (13) satisfies c ≤ 1 -min γ(µ f + µ g ) 1 + γµ g , 2τ µ h * (1 + ω)(1 + 2τ µ h * ) < 1. Table 1 : The different particular cases of the problem (1) for which we derive an instance of RandProx, with the number of the theorem where its linear convergence is stated, and the corresponding condition on h and K. λ is a shorthand notation for λ min (KK * ) and ı {b} : x → (0 if x = b, +∞ otherwise). In the rest of this section, we discuss some particular cases of (1), for which we derive stronger convergence guarantees than in Theorem 1 for RandProx. Other particular cases are studied in the Appendix; for instance, an instance of RandProx, called RandProx-ADMM, is a randomized version of the popular ADMM (Boyd et al., 2011) . The different particular cases are summarized in Table 1 .

4.1. PARTICULAR CASE g = 0

In this section, we assume that g = 0. Then the PDDY algorithm becomes an algorithm proposed for least-squares problems (Loris & Verhoeven, 2011) and rediscovered independently as the PDFP2O algorithm (Chen et al., 2013) and as the Proximal Alternating Predictor-Corrector (PAPC) algorithm (Drori et al., 2015) ; let us call it the PAPC algorithm. It has been shown to have a primal-dual forward-backward structure (Combettes et al., 2014) . Thus, when g = 0, RandProx is a randomized version of the PAPC algorithm. We note that f * is strongly convex, which is not the case of (f + g) * in general. Let us define λ min (KK * ) as the smallest eigenvalue of KK * . λ min (KK * ) > 0 if and only if ker(K * ) = {0}, where ker denotes the kernel. If λ min (KK * ) > 0, f * (-K * •) is strongly convex. Thus, when g = 0, λ min (KK * ) > 0 and µ h * > 0 are two sufficient conditions for the dual problem (2) to be strongly convex. We indeed get linear convergence of RandProx in that case: Theorem 2. Suppose that g = 0, µ f > 0, and that λ min (KK * ) > 0 or µ h * > 0. In RandProx, suppose that 0 < γ < 2 L f , τ > 0 and γτ (1 -ζ)∥K∥ 2 + ω ran ≤ 1. Then RandProx converges linearly: for every t ≥ 0, E[Ψ t ] ≤ c t Ψ 0 , where the Lyapunov function Ψ t is defined in (11), and c := max (1 -γµ f ) 2 , (γL f -1) 2 , 1 - 2τ µ h * + γτ λ min (KK * ) (1 + ω)(1 + 2τ µ h * ) < 1. ( ) Also, (x t ) t∈N and (x t ) t∈N both converge to x ⋆ and (u t ) t∈N converges to u ⋆ , almost surely. When R t = Id and ω = ω ran = 0, RandProx reverts to the PAPC algorithm. Even in this particular case, Theorem 2 proves linear convergence of the PAPC algorithm and is new. In Chen et al. (2013, Theorem 3.7) , the authors proved linear convergence of an underrelaxed version of the algorithm; underrelaxation slows down convergence. In Luke & Shefi (2018) , Theorem 3.1 is wrong, since it is based on the false assumption that if λ min (K i K * i ) > 0 for linear operators K i , i = 1, . . . , p, then λ min (KK * ) > 0, with K : x → (K 1 x, . . . , K p x). Their theorem remains valid when p = 1, but their rate is complicated and worse than ours. We now consider the even more particular case of g = 0 and K = Id. Then the problems (1) and ( 2) consist in minimizing f (x) + h(x) and f * (-u) + h * (u), respectively. The dual problem is strongly convex and has a unique solution u ⋆ = -∇f (x ⋆ ), for any primal solution x ⋆ . By setting τ := 1/γ Algorithm 3 RandProx-FB [new] input: initial points x 0 ∈ X , u 0 ∈ X ; stepsize γ > 0; ω ≥ 0 for t = 0, 1, . . . do xt := x t -γ∇f (x t ) -γu t d t := R t xt -prox γ(1+ω)h (x t + γ(1 + ω)u t ) u t+1 := u t + 1 γ(1+ω) 2 d t x t+1 := xt -1 1+ω d t end for Algorithm 4 RandProx-LC [new] input: initial points x 0 ∈ X , u 0 ∈ U; stepsizes γ > 0, τ > 0; ω ≥ 0 v 0 := K * u 0 for t = 0, 1, . . . do xt := x t -γ∇f (x t ) -γv t u t+1 := u t + τ 1+ω R t (K xt -b) v t+1 := K * u t+1 x t+1 := xt -γ(1 + ω)(v t+1 -v t ) end for in the PAPC algorithm, we obtain the classical proximal gradient, a.k.a. forward-backward (FB), algorithm, which iterates x t+1 := prox γh x t -γ∇f (x t ) . Thus, when randomness is introduced, we set ω ran := ω, ζ := 0 and, according to Remark 1, τ := 1 γ(1+ω) in RandProx. By noting that, for every a > 0, the abstract operators R t and aR t 1 a • have the same properties, we can put the constant γ(1 + ω) outside R t to simplify the algorithm, and rewrite RandProx as RandProx-FB, shown above. As a corollary of Theorem 2, we have: Theorem 3. Suppose that µ f > 0. In RandProx-FB, suppose that 0 < γ < 2 L f . For every t ≥ 0, define the Lyapunov function Ψ t := 1 γ x t -x ⋆ 2 + (1 + ω) γ(1 + ω) + 2µ h * u t -u ⋆ 2 , ( ) where x ⋆ is the unique minimizer of f +h and u ⋆ = -∇f (x ⋆ ) is the unique minimizer of f * (-•)+h * . Then RandProx-FB converges linearly: for every t ≥ 0, E Ψ t ≤ c t Ψ 0 , where c := max (1 -γµ f ) 2 , (γL f -1) 2 , 1 - 1 + 2 γ µ h * (1 + ω) 1 + ω + 2 γ µ h * < 1. ( ) Also, (x t ) t∈N and (x t ) t∈N both converge to x ⋆ and (u t ) t∈N converges to u ⋆ , almost surely. It is important to note that it is not necessary to have µ h * > 0 in Theorem 3. If we ignore the properties of h * , the third factor in ( 16) can be replaced by its upper bound 1 -1 (1+ω) 2 .

4.2. LINEARLY CONSTRAINED SMOOTH MINIMIZATION

Let b ∈ ran(K). In this section, we consider the linearly constrained (LC) minimization problem Find x ⋆ ∈ arg min x∈X f (x) s.t. Kx = b, which is a particular case of (1) with g = 0 and h : u ∈ U → (0 if u = b, +∞ otherwise). We have h * : u ∈ U → ⟨u, b⟩ and prox τ h * : u ∈ U → u -τ b. The dual problem to (17) is Find u ⋆ ∈ arg min u∈U f * (-K * u) + ⟨u, b⟩ . We denote by u ⋆ 0 the unique solution to (18) in ran(K). Then the set of solutions of ( 18) is the affine subspace u ⋆ 0 + ker(K * ). Thus, the dual problem is not strongly convex, unless ker(K * ) = {0}. Yet, we will see that strong convexity of f is sufficient to have linear convergence of RandProx, without any condition on K. We rewrite RandProx in this setting as RandProx-LC, shown above. We observe that u t does not appear in the argument of R t any more, so that the iteration can be rewritten with the variable v t = K * u t , and u t can be removed if we are not interested in estimating a dual solution. In any case, we denote by P ran(K) the orthogonal projector onto ran(K) and by λ + min (KK * ) > 0 the smallest nonzero eigenvalue of KK * . Then: Theorem 4. In the setup (17)-( 18), suppose that µ f > 0. In RandProx-LC, suppose that 0 < γ < 2 L f , τ > 0 and γτ (1 -ζ)∥K∥ 2 + ω ran ≤ 1. Define the Lyapunov function, for every t ≥ 0, Ψ t := 1 γ x t -x ⋆ 2 + 1 + ω τ u t 0 -u ⋆ 0 2 , ( ) where u t 0 := P ran(K) (u t ) is also the unique element in ran(K) such that v t = K * u t 0 , x ⋆ is the unique solution of (17) and u ⋆ 0 is the unique solution in ran(K) of (18). Then RandProx-LC converges linearly: for every t ≥ 0, E Ψ t ≤ c t Ψ 0 , where c := max (1 -γµ f ) 2 , (γL f -1) 2 , 1 - γτ λ + min (KK * ) 1 + ω < 1. ( ) Also, (x t ) t∈N and (x t ) t∈N both converge to x ⋆ and (u t 0 ) t∈N converges to u ⋆ 0 , almost surely. Theorem 4 is new even for the PAPC algorithm when ω = 0: its linear convergence under the stronger condition γτ ∥K∥ 2 < 1 has been shown in Salim et al. (2022b, Theorem 6.2 ), but our rate in (20) is better. We further discuss RandProx-LC, which can be used for decentralized optimization, in the Appendix. Another example of application is when X = R d , for some d ≥ 1, and K is a matrix; one can solve (17) by activating one row of K chosen uniformly at random at every iteration.

5. CONVERGENCE IN THE MERELY CONVEX CASE

In all theorems, strong convexity of f or g is assumed; that is, µ f > 0 or µ g > 0. In this section, we remove this hypothesis, so that the primal problem is not necessarily strongly convex any more. But ∇f (x ⋆ ) is the same for every solution x ⋆ of (1), and we denote by ∇f (x ⋆ ) this element. We define the Bregman divergence of f at points (x, x ′ ) ∈ X 2 as D f (x, x ′ ) := f (x) -f (x ′ ) -⟨∇f (x ′ ), x -x ′ ⟩ ≥ 0. For every t ≥ 0, D f (x t , x ⋆ ) is the same for every solution x ⋆ of (1), and we denote by D f (x t , x ⋆ ) this element. D f (x t , x ⋆ ) can be viewed as a generalization of the objective gap f (x t ) -f (x ⋆ ) to the case when ∇f (x ⋆ ) ̸ = 0. D f (x t , x ⋆ ) is a loose kind of distance between x t and the solution set, but under some additional assumptions on f , for instance strict convexity, D f (x t , x ⋆ ) → 0 implies that the distance from x t to the solution set tends to zero. Also, D f (x t , x ⋆ ) ≥ 1 2L f ∥∇f (x t ) -∇f (x ⋆ )∥ 2 , so that D f (x t , x ⋆ ) → 0 implies that ∇f (x t ) t∈N converges to ∇f (x ⋆ ). Theorem 11. In RandProx, suppose that 0 < γ < 2 L f , τ > 0, and γτ (1 -ζ)∥K∥ 2 + ω ran ≤ 1. Then D f (x t , x ⋆ ) → 0, almost surely and in quadratic mean. Moreover, for every t ≥ 0, we define xt := 1 t+1 t i=0 x i . Then, for every t ≥ 0, E D f (x t , x ⋆ ) ≤ Ψ 0 (2 -γL f )(t + 1) = O(1/t). If, in addition, µ h * > 0, there is a unique dual solution u ⋆ to (2) and (u t ) t∈N converges to u ⋆ , in quadratic mean. We can derive counterparts of the other theorems in the same way. These theorems apply to all algorithms presented in the paper. For instance, Theorem 11 applies to Scaffnew (Mishchenko et al., 2022) , a particular case of RandProx-FL seen in Section A.3, and provides for it the first convergence results in the non-strongly convex case. Algorithm 5 RandProx-Skip [new] input: initial points x 0 ∈ X , u 0 ∈ U; stepsizes γ > 0, τ > 0; p ∈ (0, 1] v 0 := K * u 0 for t = 0, 1, . . . do xt := prox γg x t -γ∇f (x t ) -γv t Flip a coin θ t = (1 with probability p, 0 else) if θ t = 1 then u t+1 := prox τ h * (u t + τ K xt ) v t+1 := K * u t+1 x t+1 := xt -γ p (v t+1 -v t ) else u t+1 := u t , v t+1 := v t , x t+1 := xt end if end for Algorithm 6 RandProx-Minibatch [new] input: initial points x 0 ∈ X , (u 0 i ) n i=1 ∈ X n ; stepsize γ > 0; k ∈ {1, . . . , n} v 0 := n i=1 u 0 i for t = 0, 1, . . . do xt := prox γg x t -γ∇f (x t ) -γv t pick Ω t ⊂ {1, . . . , n} of size k unif. at random for i ∈ Ω t do u t+1 i := prox 1 γn h * i (u t i + 1 γn xt ) end for for i ∈ {1, . . . , n}\Ω t do u t+1 i := u t i end for v t+1 := n i=1 u t+1 i x t+1 := xt -γn k (v t+1 -v t ) end for Appendix A EXAMPLES A.1 SKIPPING THE PROXIMITY OPERATOR In this section, we consider the case of Bernoulli operators R t defined in (6), which compute and return their argument only with probability p > 0. RandProx becomes RandProx-Skip, shown above. Then ω = 1 p -1, ω ran = ∥K∥ 2 ω, and ζ = 0. If g = 0, RandProx-Skip reverts to the SplitSkip algorithm proposed recently (Mishchenko et al., 2022) . Our Theorems 1 and 4 recover the same rate as given for SplitSkip in Mishchenko et al. (2022, Theorem D.1) , if smoothness of h is ignored. If in addition K = Id and τ = 1 γ(1+ω) = p γ , RandProx-Skip reverts to ProxSkip, a particular case of SplitSkip (Mishchenko et al., 2022) . Our Theorem 3 applies to this case and allows us to exploit the possible smoothness of h in RandProx-Skip = ProxSkip, which is not the case of the results of (Mishchenko et al., 2022) . As a practical application of our new results, let us consider personalized federated learning (FL) (Hanzely et al., 2020) : given a client-server architecture with a master and n ≥ 1 users, each with local cost function f i , i = 1, . . . , n, the goal is to minimize (xi) n i=1 ∈(R d ) n n i=1 f i (x i ) + λ 2 n i=1 ∥x i -x∥ 2 , ( ) where x := 1 n n i=1 x i . Each f i is supposed L f -smooth and µ f -strongly convex. We set X := (R d ) n , f : x = (x i ) n i=1 → n i=1 f i (x i ), h : x → λ 2 n i=1 ∥x i -x∥ 2 . f is L f -smooth and µ f -strongly convex, h is λ-smooth, so that µ h * = 1 λ . Thus, with γ = 1 L f , we have in ( 16): + 1 log 1 ϵ , which, up to the '+1' log factor, is optimal (Hanzely et al., 2020) . This shows that in personalized FL with λ < L f , the complexity can be decreased in comparison with non-personalized FL, which corresponds to λ = +∞. This is achieved by properly setting p in ProxSkip, according to our new theory, which exploits the smoothness of h. c ≤ 1 -min µ f L f , 1 + 2L f λ 1 p 1 p + 2L f λ < 1. Hence, with p = √ µ f min(L f ,λ) L f = µ f L f min λ L f ,

A.2 SAMPLING AMONG SEVERAL FUNCTIONS

We first remark that we can extend Problem (1) with the term h(Kx) replaced by the sum n i=1 h i (K i x) of n ≥ 2 proper closed convex functions h i composed with linear operators K i : X → U i , for some real Hilbert spaces U i , by using the classical product-space trick: by defining U := U 1 × • • • U n , h : u = (u i ) n i=1 ∈ U → n i=1 h i (u i ), K : x ∈ X → (K i x) n i=1 ∈ U, we have h(Kx) = n i=1 h i (K i x). In particular, by setting K i := Id and U i := X , we consider in this section the problem: Find x ⋆ ∈ arg min x∈X f (x) + g(x) + n i=1 h i (x) . ( ) We have h * : (u i ) n i=1 ∈ X n → n i=1 h * i (u i ) and we suppose that every function h * i is µ h * -strongly convex, for some µ h * ≥ 0; then h * is µ h * -strongly convex. Thus, the dual problem to ( 23) is Find (u ⋆ i ) n i=1 ∈ arg min (ui) n i=1 ∈X n (f + g) * - n i=1 u i + n i=1 h * i (u i ) . ( ) Since K * K = nId, ∥K∥ 2 = n. Now, we choose R t as the rand-k sampling operator, for some k ∈ {1, . . . , n}: R t multiplies k elements out of the n of its argument sequence, chosen uniformly at random, by n/k and sets the other ones to zero. It is known (Condat & Richtárik, 2022 , Proposition 1) that we can set ω := n k -1, ω ran := n(n -k) k(n -1) , ζ := n -k k(n -1) . Note that this value of ω ran is n -1 times smaller than the naive bound ∥K∥ 2 ω = n(n-k) k . We have (1 -ζ)∥K∥ 2 + ω ran = n. RandProx in this setting, with τ := 1 γn , becomes RandProx-Minibatch, shown above, and Theorem 1 yields: Theorem 5. Suppose that µ f > 0 or µ g > 0, and that µ h * > 0. In RandProx-Minibatch, suppose that 0 < γ < 2 L f . Define the Lyapunov function, for every t ≥ 0, Ψ t := 1 γ x t -x ⋆ 2 + n k (γn + 2µ h * ) n i=1 u t i -u ⋆ i 2 , ( ) where x ⋆ and (u ⋆ i ) n i=1 are the unique solutions to (23) and (24), respectively. Then RandProx-Minibatch converges linearly: for every t ≥ 0, E[Ψ t ] ≤ c t Ψ 0 , where c := max (1 -γµ f ) 2 1 + γµ g , (γL f -1) 2 1 + γµ g , 1 - 2kµ h * n(γn + 2µ h * ) . Also, (x t ) t∈N and (x t ) t∈N both converge to x ⋆ and (u t i ) t∈N converges to u ⋆ i , ∀i, almost surely. RandProx-Minibatch with k = 1 becomes the Stochastic Decoupling Method (SDM) proposed in Mishchenko & Richtárik (2019) , where strong convexity of g is not exploited, but similar guarantees are derived as in Theorem 5 if µ g = 0. Linear convergence of SDM is also proved in Mishchenko & Richtárik (2019) in conditions related to ours in Theorems 2 and 4. Thus, RandProx-Minibatch extends SDM to larger minibatch size k and exploits possible strong convexity of g. When f = 0 and g = 0, SDM further simplifies to Point-SAGA (Defazio, 2016) . In that case, our results do not apply directly, since there is no strong convexity in f and g any more, but when minimizing the average of functions h i , with each function supposed to be L-smooth and µ-strongly convex, for some L ≥ µ > 0, we can transfer the strong convexity to g by subtracting µ 2 ∥ • ∥ 2 to each h i and setting g = µ 2 ∥ • ∥ 2 . This does not change the problem and the algorithm but our Theorem 5 now applies, and with the right choice of γ, we recover the result in Defazio (2016) , that the asymptotic complexity of Point-SAGA to reach ϵ-accuracy is O n + nL µ log 1 ϵ , which is conjectured to be optimal. Thus, RandProx-Minibatch extends Point-SAGA to larger minibatch size and to the more general problem ( 23) with nonzero f or g. When n = 1, there is no randomness and SDM reverts to the DY algorithm discussed in Appendix G. Algorithm 7 SDM (Mishchenko & Richtárik, 2019) input: initial points x 0 ∈ X , (u 0 i ) n i=1 ∈ X n ; stepsize γ > 0 v 0 := n i=1 u 0 i for t = 0, 1, . . . do xt := prox γg x t -γ∇f (x t ) -γv t pick i t ∈ {1, . . . , n} uniformly at random x t+1 := prox γnhi (γnu t i t + xt ) u t+1 i t := u t i t + 1 γn (x t -x t+1 ) for every i ∈ {1, . . . , n}\{i t }, u t+1 i := u t i v t+1 := n i=1 u t+1 i // = v t + u t+1 i t -u t i t end for Algorithm 8 Point-SAGA (Defazio, 2016) input: initial points x 0 ∈ X , (u 0 i ) n i=1 ∈ X n ; stepsize γ > 0 v 0 := n i=1 u 0 i for t = 0, 1, . . . do xt := x t -γv t pick i t ∈ {1, . . . , n} uniformly at random x t+1 := prox γnhi (γnu t i t + xt ) u t+1 i t := u t i t + 1 γn (x t -x t+1 ) for every i ∈ {1, . . . , n}\{i t }, u t+1 i := u t i v t+1 := n i=1 u t+1 i // = v t + u t+1 i t -u t i t end for Algorithm 9 RandProx-FL [new] input: initial estimates (x 0 i ) n i=1 ∈ X n , (u 0 i ) n i=1 ∈ X n such that n i=1 u 0 i = 0; stepsize γ > 0; ω ≥ 0 for t = 0, 1, . . . do for i = 1, . . . , n at nodes in parallel do xt i := x t i -γ∇f i (x t i ) -γu t i a t i := R t (x t i ) // send compressed vector a t i to master end for a t := 1 n n i=1 a t i // aggregation at master // broadcast a t to all nodes for i = 1, . . . , n at nodes in parallel do d t i := a t i -a t u t+1 i := u t i + 1 γ(1+ω) 2 d t i x t+1 i := xt i -1 1+ω d t i end for end for

A.3 DISTRIBUTED AND FEDERATED LEARNING WITH COMPRESSION

We consider in this section distributed optimization within the client-server model, with a master node communicating back and forth with n ≥ 1 parallel workers. This is particularly relevant for federated learning (FL) (Konečný et al., 2016; McMahan et al., 2017; Kairouz et al., 2021; Li et al., 2020) , where a potentially huge number of devices, with their owners' data stored on each of them, are involved in the collaborative process of training a global machine learning model. The goal is to exploit the wealth of useful information lying in the heterogeneous data stored across the devices. Communication between the devices and the distant server, which can be costly and slow, is the main bottleneck in this framework. So, it is of primary importance to devise novel algorithmic strategies, which are efficient in terms of computation and communication complexities. A natural and widely used idea is to make use of (lossy) compression, to reduce the size of the communicated message (Alistarh et al., 2017; Wen et al., 2017; Wangni et al., 2018; Khaled & Richtárik, 2019; Albasyoni et al., 2020; Basu et al., 2020; Dutta et al., 2020; Sattler et al., 2020; Xu et al., 2021) . Another popular idea is to make use of local steps (McMahan et al., 2017; Khaled et al., 2019; Stich, 2019; Khaled et al., 2020a; Malinovsky et al., 2020; Woodworth et al., 2020; Karimireddy et al., 2020; Gorbunov et al., 2021; Mishchenko et al., 2022) ; that is, communication with the server does not occur at every iteration but only every few iterations, for instance communication is triggered randomly with a small probability at every iteration. Between communication rounds, the workers perform multiple local steps independently, based on their local objectives. Our proposed algorithm RandProx-FL unifies the two strategies, in the sense that depending on the choice of the randomization process R t , we obtain a method with local steps or with compression, or both. The combination of local training and compression has been further investigated in our follow-up work (Condat et al., 2022a) , and partial participation in Condat et al. (2023b) . Thus, we consider the problem Find x ⋆ ∈ arg min x∈R d n i=1 f i (x) , where d ≥ 1 is the model dimension and n ≥ 1 is the number of parallel workers, each having its own objective function f i . Every function f i : R d → R is µ-strongly convex and L-smooth, for some L ≥ µ > 0. We define κ := L/µ. Now, we can observe that ( 27) can be recast as ( 1) with K = Id, U = X , g = 0; that is, as the minimization of f + h, as studied in Section 4.1, with X = (R d ) n , f : x = (x i ) n i=1 → n i=1 f i (x i ), h : x = (x i ) n i=1 → (0 if x 1 = • • • = x n , +∞ otherwise). ( ) We note that f is µ-strongly convex and L-smooth, and µ h * = 0. Making these substitutions in RandProx-FB yields RandProx-FL, a distributed algorithm well suited for FL, shown above. In RandProx-FL, randomization takes the form of linear random unbiased operators R t applied to the vectors sent to the server. Note that at every iteration, the same operator R t is applied at every node; that is, its randomness is shared. We can easily check that RandProx-FL is an instance of RandProx-FB, because of the linearity of the R t and because the property n i=1 u t i = 0 is maintained at every iteration. Formally, R t applied as a whole in RandProx-FB consists of n copies of R t applied individually at every node in RandProx-FL, that is why we keep the same notation; in particular, the value of ω is the same in both interpretations. Interestingly, in RandProx-FL, information about the functions f i or their gradients is never communicated and is exploited completely locally. This is ideal in terms of privacy. As an application of Theorem 3, we obtain: Theorem 10. In RandProx-FL, suppose that 0 < γ < 2 L f . Define the Lyapunov function, for every t ≥ 0, Ψ t := n i=1 1 γ x t i -x ⋆ 2 + γ(1 + ω) 2 u t i -u ⋆ i 2 , where x ⋆ is the unique solution of (27) and u ⋆ i := -∇f i (x ⋆ ). Then RandProx-FL converges linearly: for every t ≥ 0, E[Ψ t ] ≤ c t Ψ 0 , where c := max (1 -γµ f ) 2 , (γL f -1) 2 , 1 - 1 (1 + ω) 2 < 1. Also, the (x t i ) t∈N and (x t i ) t∈N all converge to x ⋆ and every (u t i ) t∈N converges to u ⋆ i , almost surely. If R t is the Bernoulli compressor we have seen before in (6) and in Section A.1, RandProx-FL reverts to the Scaffnew algorithm proposed in Mishchenko et al. (2022) , which communicates at every iteration with probability p ∈ (0, 1] and performs in average 1/p local steps between successive communication rounds. We have ω = 1 p -1. The analysis of Scaffnew in Theorem 10 is the same as in Mishchenko et al. (2022) . With γ = 1 L , the iteration complexity of Scaffnew is O (κ + 1 p 2 ) log 1 ϵ , and since the algorithm communicates with probability p, its average communication complexity is O (pκ + 1 p ) log 1 ϵ . In particular, with p = 1 √ κ , the average communication complexity of Scaffnew is O √ κ log 1 ϵ . We now propose a new algorithm with compressed communication: in RandProx-FL we choose, for every t ≥ 0, R t as the well-known rand-k compressor, for some k ∈ {1, . . . , d}: R t multiplies k coordinates, chosen uniformly at random, of its vector argument by d/k and sets the other ones to zero. We have ω  = d k -1. The iteration complexity with γ = 1 L is O (κ + d 2 k 2 ) log

B CONTRACTION OF GRADIENT DESCENT

Lemma 1. For every γ > 0, the gradient descent operator Id -γ∇f is c γ -Lipschitz continuous, with c γ := max(1 -γµ f , γL f -1). That is, for every (x, x ′ ) ∈ X 2 , ∥(Id -γ∇f )x -(Id -γ∇f )x ′ ∥ ≤ c γ ∥x -x ′ ∥. Proof Let (x, x ′ ) ∈ X 2 . By cocoercivity of ∇f -µ f Id, we have (Bubeck, 2015, Lemma 3.11  ) ⟨∇f (x) -∇f (x ′ ), x -x ′ ⟩ ≥ L f µ f L f +µ f ∥x -x ′ ∥ 2 + 1 L f +µ f ∥∇f (x) -∇f (x ′ )∥ 2 . Hence, ∥(Id -γ∇f )x -(Id -γ∇f )x ′ ∥ 2 ≤ 1 - 2γL f µ f L f +µ f ∥x -x ′ ∥ 2 + γ 2 -2γ L f +µ f ∥∇f (x) -∇f (x ′ )∥ 2 . Thus, if γ ≤ 2 L f +µ f , since ∥∇f (x) -∇f (x ′ )∥ ≥ µ f ∥x -x ′ ∥, ∥(Id -γ∇f )x -(Id -γ∇f )x ′ ∥ 2 ≤ 1 - 2γL f µ f L f +µ f + (γ 2 -2γ L f +µ f )µ 2 f ∥x -x ′ ∥ 2 = (1 -γµ f ) 2 ∥x -x ′ ∥ 2 . On the other hand, if γ ≥ 2 L f +µ f , since ∥∇f (x) -∇f (x ′ )∥ ≤ L f ∥x -x ′ ∥, ∥(Id -γ∇f )x -(Id -γ∇f )x ′ ∥ 2 ≤ 1 - 2γL f µ f L f +µ f + (γ 2 -2γ L f +µ f )L 2 f ∥x -x ′ ∥ 2 = (γL f -1) 2 ∥x -x ′ ∥ 2 . Since max(1 -γµ f , γL f -1) = (1 -γµ f if γ ≤ 2 L f +µ f , γL f -1 otherwise) ≥ 0, we arrive at the given expression of c γ . □ We note that if γ < 2 L f and µ f > 0, c γ < 1.

C PROOF OF THEOREM 1

Let t ∈ N. Let p t ∈ ∂g(x t ) be such that xt = x t -γ∇f (x t )-γp t -γK * u t ; p t exists and is unique, by properties of the proximity operator. We also define p ⋆ := -∇f (x ⋆ ) -K * u ⋆ ; we have p ⋆ ∈ ∂g(x ⋆ ). Let q t := p t -µ g xt and q ⋆ := p ⋆ -µ g x ⋆ . We have (1 + γµ g )x t = x t -γ∇f (x t ) -γq t -γK * u t . Let w t := x t -γ∇f (x t ) and w ⋆ := x ⋆ -γ∇f (x ⋆ ). Using ût+1 defined in (9), we have E x t+1 -x ⋆ 2 | F t = E x t+1 | F t -x ⋆ 2 + E x t+1 -E x t+1 | F t 2 | F t ≤ xt -x ⋆ -γK * (û t+1 -u t ) 2 + γ 2 ω ran ût+1 -u t 2 -γ 2 ζ K * (û t+1 -u t ) 2 . Moreover, xt -x ⋆ -γK * (û t+1 -u t ) 2 = xt -x ⋆ 2 + γ 2 K * (û t+1 -u t ) 2 -2γ⟨x t -x ⋆ , K * (û t+1 -u t )⟩ ≤ (1 + γµ g ) xt -x ⋆ 2 + γ 2 K * (û t+1 -u t ) 2 -2γ⟨x t -x ⋆ , K * (û t+1 -u ⋆ )⟩ + 2γ⟨x t -x ⋆ , K * (u t -u ⋆ )⟩ = ⟨w t -w ⋆ -γ(q t -q ⋆ ) -γK * (u t -u ⋆ ), xt -x ⋆ ⟩ + γ 2 K * (û t+1 -u t ) 2 -2γ⟨x t -x ⋆ , K * (û t+1 -u ⋆ )⟩ + 2γ⟨x t -x ⋆ , K * (u t -u ⋆ )⟩ = -2γ⟨q t -q ⋆ , xt -x ⋆ ⟩ + ⟨w t -w ⋆ + γ(q t -q ⋆ ) + γK * (u t -u ⋆ ), xt -x ⋆ ⟩ + γ 2 K * (û t+1 -u t ) 2 -2γ⟨x t -x ⋆ , K * (û t+1 -u ⋆ )⟩ = -2γ⟨q t -q ⋆ , xt -x ⋆ ⟩ + 1 1 + γµ g ⟨w t -w ⋆ + γ(q t -q ⋆ ) + γK * (u t -u ⋆ ), w t -w ⋆ -γ(q t -q ⋆ ) -γK * (u t -u ⋆ )⟩ + γ 2 K * (û t+1 -u t ) 2 -2γ⟨x t -x ⋆ , K * (û t+1 -u ⋆ )⟩ = -2γ⟨q t -q ⋆ , xt -x ⋆ ⟩ + 1 1 + γµ g w t -w ⋆ 2 - γ 2 1 + γµ g q t -q ⋆ + K * (u t -u ⋆ ) 2 + γ 2 K * (û t+1 -u t ) 2 -2γ⟨x t -x ⋆ , K * (û t+1 -u ⋆ )⟩. We have ⟨q t -q ⋆ , xt -x ⋆ ⟩ ≥ 0. Hence, xt -x ⋆ -γK * (û t+1 -u t ) 2 ≤ 1 1 + γµ g w t -w ⋆ 2 - γ 2 1 + γµ g q t -q ⋆ + K * (u t -u ⋆ ) 2 + γ 2 K * (û t+1 -u t ) 2 -2γ⟨x t -x ⋆ , K * (û t+1 -u ⋆ )⟩, so that E x t+1 -x ⋆ 2 | F t ≤ 1 1 + γµ g w t -w ⋆ 2 - γ 2 1 + γµ g q t -q ⋆ + K * (u t -u ⋆ ) 2 + γ 2 (1 -ζ) K * (û t+1 -u t ) 2 -2γ⟨x t -x ⋆ , K * (û t+1 -u ⋆ )⟩ + γ 2 ω ran ût+1 -u t 2 . On the other hand, E u t+1 -u ⋆ 2 | F t ≤ u t -u ⋆ + 1 1 + ω ût+1 -u t 2 + ω (1 + ω) 2 ût+1 -u t 2 = ω 2 (1 + ω) 2 u t -u ⋆ 2 + 1 (1 + ω) 2 ût+1 -u ⋆ 2 + 2ω (1 + ω) 2 ⟨u t -u ⋆ , ût+1 -u ⋆ ⟩ + ω (1 + ω) 2 ût+1 -u ⋆ 2 + ω (1 + ω) 2 u t -u ⋆ 2 - 2ω (1 + ω) 2 ⟨u t -u ⋆ , ût+1 -u ⋆ ⟩ = 1 1 + ω ût+1 -u ⋆ 2 + ω 1 + ω u t -u ⋆ 2 . ( ) Ignoring the last term in (34), we obtain: E Ψ t+1 | F t ≤ max (1 -γµ f ) 2 1 + γµ g , (γL f -1) 2 1 + γµ g , 1 - 2τ µ h * (1 + ω)(1 + 2τ µ h * ) Ψ t . Using the tower rule, we can unroll the recursion in ( 35) to obtain the unconditional expectation of Ψ t+1 . Since E[Ψ t ] → 0, we have E ∥x t -x ⋆ ∥ 2 → 0 and E ∥u t -u ⋆ ∥ 2 → 0. Moreover, using classical results on supermartingale convergence (Bertsekas, 2015, Proposition A.4.5) Let us go back to (34). Since g = 0, we have q t = q ⋆ = 0 and µ g = 0, so that E Ψ t+1 | F t ≤ 1 γ max(1 -γµ f , γL f -1) 2 x t -x ⋆ 2 + 1 + ω τ + 2ωµ h * u t -u ⋆ 2 -γ K * (u t -u ⋆ ) 2 . We have ∥K * (u t -u ⋆ )∥ 2 ≥ λ min (KK * ) ∥u t -u ⋆ ∥ 2 . This yields E Ψ t+1 | F t ≤ 1 γ max(1 -γµ f , γL f -1) 2 x t -x ⋆ 2 + 1 + ω τ + 2ωµ h * -γλ min (KK * ) u t -u ⋆ 2 ≤ max (1 -γµ f ) 2 , (γL f -1) 2 , 1 - 2τ µ h * + γτ λ min (KK * ) (1 + ω)(1 + 2τ µ h * ) Ψ t . ( ) The end of the proof is the same as the one of Theorem 1. □ Let us add here a remark on the PAPC algorithm, which is the particular case of RandProx when ω = 0, in the conditions of Theorem 2: Remark 2 (PAPC vs. proximal gradient descent on the dual problem) If µ f > 0, f * is µ -1 - smooth and L -1 f -strongly convex. Then f * • -K * is µ -1 f ∥K∥ 2 -smooth and L -1 f λ min (KK * )strongly convex. So, if ∇f * is computable, one can apply the proximal gradient algorithm on the dual problem (2), which iterates u t+1 = prox τ h * u t + τ K∇f * (-K * u t ) , with τ ∈ 0, 2µ f ∥K∥ 2 . If λ min (KK * ) > 0, this algorithm converges linearly: ∥u t+1 -u ⋆ ∥ 2 ≤ c 2 ∥u t -u ⋆ ∥ 2 with c = max 1 -τ L -1 f λ min (KK * ), τ µ -1 f ∥K∥ 2 -1 . c is smallest with τ = 2/ µ -1 f ∥K∥ 2 + L -1 f λ min (KK * ) , in which case c = 1 - µ f L f λmin(KK * ) ∥K∥ 2 1 + µ f L f λmin(KK * ) ∥K∥ 2 . This is much worse than the rate of the PAPC algorithm, since it involves the product of the condition numbers L f /µ f and ∥K∥ 2 /λ min (KK * ), instead of their maximum. This is due to calling gradients of f * • -K * , whereas f and K are split, or decoupled, in the PAPC algorithm.

E PROOF OF THEOREM 4 AND FURTHER DISCUSSION

Algorithm 10 RandPriLiCo [new] input: initial points x 0 ∈ X , v 0 ∈ ran(W ); stepsizes γ > 0, τ > 0; ω ≥ 0 for t = 0, 1, . . . do xt := x t -γ∇f (x t ) -γv t d t+1 := τ S t (W xt -a) v t+1 := v t + 1 1+ω d t+1 x t+1 := xt -γd t+1 end for We observe that in RandProx-LC and Theorem 4, it is as if the sequence (u t 0 ) t∈N had been computed by the following iteration, initialized with x 0 ∈ X and u 0 0 := P ran(K) (u 0 ):       xt := x t -γ∇f (x t ) -γv t u t+1 0 := u t 0 + 1 1+ω P ran(K) R t τ (K xt -b) v t+1 := K * u t+1 0 x t+1 := xt -γ(1 + ω)(v t+1 -v t ) . Then we remark that this is simply the iteration of RandProx, with R t replaced by R t := P ran(K) R t . Since its argument r t = τ (K xt -b) is always in ran(K), R t is unbiased, and we have, for every t ≥ 0, E R t (r t ) -r t 2 | F t ≤ E R t (r t ) -r t 2 | F t ≤ ω r t 2 , where F t the σ-algebra generated by the collection of random variables (x 0 , u 0 0 ), . . . , (x t , u t 0 ). Also, ω ran is unchanged. Therefore, the analysis of RandProx in Theorem 2 applies, with u t replaced by u t 0 and u ⋆ by u ⋆ 0 . Now, for every u ∈ ran(K), ∥K * u∥ 2 ≥ λ + min (KK * ) ∥u∥ 2 , and using this lower bound in the proof of Theorem 2, with µ h * = 0, we obtain Theorem 4. □ Furthermore, the constraint Kx = b is equivalent to the constraint K * Kx = K * b; so, let us consider problems where we are given K * K and not K in the first place: Let W be a linear operator on X , which is self-adjoint, i.e. W * = W , and positive, i.e. ⟨W x, x⟩ ≥ 0 for every x ∈ X . Let a ∈ ran(W ). We consider the linearly constrained minimization problem Find x ⋆ ∈ arg min x∈X f (x) s.t. W x = a. Now, we let U := X and K = K * := √ W , where √ W is the unique positive self-adjoint linear operator on X such that 37) is equivalent to (17) and the dual problem is (18). We consider the Randomized Primal Linearly Constrained minimization algorithm (RandPriLiCo), shown above. We suppose that the stochastic operators S t in RandPriLiCo satisfy, for every t ≥ 0, √ W √ W = W . Also, b is defined as the unique element in ran(W ) = ran(K) such that √ W b = a. Then ( E S t (r t ) | F t = r t and E S t (r t ) -r t 2 | F t ≤ ω r t 2 , for some ω ≥ 0, where r t := τ W xt -τ a. In addition, we suppose that the S t commute with √ W : for every t ≥ 0 and x ∈ X , √ W S t (x) = S t ( √ W x). This is satisfied with the Bernoulli operators or some linear sketching operators, for instance. Then RandPriLiCo is equivalent to RandProx-LC, with S t playing the role of R t and ω ran = ∥W ∥ω, ζ = 0. Applying Theorem 4 with these equivalences, we obtain: Algorithm 11 CP algorithm (Chambolle & Pock, 2011) input: initial points x 0 ∈ X , u 0 ∈ U; stepsizes γ > 0, τ > 0 x0 := prox γg x 0 -γK * u 0 for t = 0, 1, . . . do u t+1 := prox τ h * u t + τ K xt // x t+1 := xt -γK * (u t+1 -u t ) xt+1 := prox γg xt -γK * (2u t+1 -u t ) end for Algorithm 12 RandProx-CP [new] input: initial points x 0 ∈ X , u 0 ∈ U; stepsizes γ > 0, τ > 0; ω ≥ 0 x0 := prox γg x 0 -γK * u 0 for t = 0, 1, . . . do d t := R t prox τ h * (u t + τ K xt ) -u t u t+1 := u t + 1 1+ω d t // x t+1 := xt -γK * d t xt+1 := prox γg xt -γK * (u t+1 + d t ) end for Theorem 6. In the setting of (37), suppose that µ f > 0. In RandPriLiCo, suppose that 0 < γ < 2 L f , τ > 0 and γτ ∥W ∥(1 + ω) ≤ 1. Define the Lyapunov function, for every t ≥ 0, Ψ t := 1 γ x t -x ⋆ 2 + 1 + ω τ u t 0 -u ⋆ 0 2 , ( ) where u t 0 is the unique element in ran(W ) such that v t = √ W u t 0 , x ⋆ is the unique solution of (37) and u ⋆ 0 is the unique element in ran(W ) such that -∇f (x ⋆ ) = √ W u ⋆ 0 . Then RandPriLiCo converges linearly: for every t ≥ 0, E Ψ t ≤ c t Ψ 0 , where c := max (1 -γµ f ) 2 , (γL f -1) 2 , 1 - γτ λ + min (W ) 1 + ω < 1. Also, (x t ) t∈N and (x t ) t∈N both converge to x ⋆ almost surely. F PARTICULAR CASE f = 0: RANDOMIZED CHAMBOLLE-POCK ALGORITHM In this section, we suppose that f = 0. The primal problem (1) becomes: Find x ⋆ ∈ arg min x∈X g(x) + h(Kx) , and the dual problem (2) becomes: Find u ⋆ ∈ arg min u∈U g * (-K * u) + h * (u) . The PDDY algorithm becomes the Chambolle-Pock (CP), a.k.a. PDHG, algorithm (Chambolle & Pock, 2011) , shown above. RandProx can be rewritten as RandProx-CP, shown above, too. In both algorithms, the variable x t is not needed any more and can be removed. Since f = 0, L f > 0 can be set arbitrarily close to zero, so that Theorem 1 can be rewritten as: Theorem 7. Suppose that µ g > 0 and µ h * > 0. In RandProx-CP, suppose that γ > 0, τ > 0, γτ (1 -ζ)∥K∥ 2 + ω ran ≤ 1. Define the Lyapunov function, for every t ≥ 0, Ψ t := 1 γ x t -x ⋆ 2 + (1 + ω) 1 τ + 2µ h * u t -u ⋆ 2 , ( ) where x ⋆ and u ⋆ are the unique solutions to (42) and (43), respectively. Then RandProx-CP converges linearly: for every t ≥ 0, E Ψ t ≤ c t Ψ 0 , Algorithm 13 ADMM input: initial points x 0 ∈ X , u 0 ∈ U; stepsize γ > 0 for t = 0, 1, . . . do xt := prox γg (x t -γu t ) x t+1 := prox γh (x t + γu t ) u t+1 := u t + 1 γ (x t -x t+1 ) end for Algorithm 14 RandProx-ADMM [new] input: initial points x 0 ∈ X , u 0 ∈ U; stepsize γ > 0; ω ≥ 0 for t = 0, 1, . It would be interesting to study whether the mechanism in the stochastic PDHG algorithm proposed in Chambolle et al. (2018) can be viewed as a particular case of RandProx-CP; we leave the analysis of this connection for future work. In any case, the strong convexity constants µ g and µ h * need to be known in the linearly converging version of the stochastic PDHG algorithm, which is not the case here; this is an important advantage of RandProx-CP. Now, let us look at the particular case K = Id in ( 42) and ( 43). The primal problem becomes:  When K = Id, the CP algorithm with τ = 1 γ reverts to the Douglas-Rachford algorithm, which is equivalent to the Alternating Direction Method of Multipliers (ADMM) (Boyd et al., 2011; Condat et al., 2023a) , shown above. Therefore, in that case, with ω ran = ω, ζ = 0 and τ = 1 γ(1+ω) , RandProx-CP can be rewritten as RandProx-ADMM, shown above. Theorem 7 becomes: Theorem 8. Suppose that µ g > 0 and µ h * > 0. In RandProx-ADMM, suppose that γ > 0. For every t ≥ 0, define the Lyapunov function Ψ t := 1 γ x t -x ⋆ 2 + (1 + ω) γ(1 + ω) + 2µ h * u t -u ⋆ 2 , ( ) where x ⋆ and u ⋆ are the unique solutions to (48) and (49), respectively. Then RandProx-ADMM converges linearly: for every t ≥ 0, E Ψ t ≤ c t Ψ 0 , where c := max 1 1 + γµ g , 1 - 2τ µ h * (1 + ω)(1 + 2τ µ h * ) (52) = 1 -min γµ g 1 + γµ g , 2τ µ h * (1 + ω)(1 + 2τ µ h * ) < 1. Also, (x t ) t∈N and (x t ) t∈N both converge to x ⋆ and (u t ) t∈N converges to u ⋆ , almost surely. where the second inequality follows from cocoercivity of the gradient. Moreover, for every (x, x ′ ) ∈ X 2 , D f (x, x ′ ) ≤ ⟨∇f (x) -∇f (x ′ ), x -x ′ ⟩. Therefore, in the proof of Theorem 1, for every primal-dual solution (x ⋆ , u ⋆ ) and t ≥ 0, since ∥w t -w ⋆ ∥ 2 = ∥(Id -γ∇f )x t -(Id -γ∇f )x ⋆ ∥ 2 , (33) yields E Ψ t+1 | F t ≤ 1 γ x t -x ⋆ 2 -(2 -γL f )D f (x t , x ⋆ ) + 1 + ω τ + 2ωµ h * u t -u ⋆ 2 -γ q t -q ⋆ + K * (u t -u ⋆ ) 2 . Ignoring the last term, this yields E Ψ t+1 | F t ≤ 1 γ x t -x ⋆ 2 + c(1 + ω) 1 τ + 2µ h * u t -u ⋆ 2 (59) -(2 -γL f )D f (x t , x ⋆ ) ≤ Ψ t -(2 -γL f )D f (x t , x ⋆ ), with c = 1 -2τ µ h * (1+ω)(1+2τ µ h * ) in ( 59). Using classical results on supermartingale convergence (Bertsekas, 2015, Proposition A.4.5) , it follows from (60) that Ψ t converges almost surely to a random variable Ψ ∞ and that ∞ t=0 D f (x t , x ⋆ ) < +∞ almost surely. Hence, D f (x t , x ⋆ ) → 0 almost surely. Moreover, for every T ≥ 0, (2 -γL f ) T t=0 E D f (x t , x ⋆ ) ≤ Ψ 0 -E Ψ T +1 ≤ Ψ 0 and (2 -γL f ) ∞ t=0 E D f (x t , x ⋆ ) ≤ Ψ 0 . Therefore, E[D f (x t , x ⋆ )] → 0; that is, D f (x t , x ⋆ ) → 0 in quadratic mean. The Bregman divergence is convex in its first argument, so that for every T ≥ 0, D f (x T , x ⋆ ) ≤ 1 T + 1 T t=0 D f (x t , x ⋆ ). Combining this last inequality with (61) yields (T + 1)(2 -γL f )E D f (x T , x ⋆ ) ≤ Ψ 0 . Now, if µ h * > 0, then c < 1 in (59), and since Ψ t converges almost surely to Ψ ∞ , it must be that E ∥u t -u ⋆ ∥ 2 → 0. □ The counterpart of Theorem 2 in the convex case is: Theorem 12. Suppose that g = 0, and that λ min (KK * ) > 0 or µ h * > 0. In RandProx, suppose that 0 < γ < 2 L f , τ > 0, and γτ (1 -ζ)∥K∥ 2 + ω ran ≤ 1. Then there is a unique dual solution u ⋆ to (2) and (u t ) t∈N converges to u ⋆ , in quadratic mean. Proof of Theorem 12 Considering the proof of Theorem 2, the same arguments as in the proof of Theorem 11 apply, with c in (59) now equal to c = 1 -2τ µ h * + γτ λ min (KK * ) (1 + ω)(1 + 2τ µ h * ) < 1. Hence, E ∥u t -u ⋆ ∥ 2 → 0. □



The condition γ < L f is given for simplicity. Larger values of γ can be used when µg > 0, as long as c < 1 in (13).



1 , the communication complexity in terms of the expected number of communication rounds to reach ϵ-accuracy is O min(L f ,λ) µ f

can be applied to decentralized optimization, like in Kovalev et al. (2020); Salim et al. (2022a) but with randomized communication; we leave the detailed study of this setting for future work.

Find u ⋆ ∈ arg min u∈U g * (-u) + h * (u) .

Remark 1 (choice of τ ) Given γ, the rate c in (13) is smallest if τ is largest. So, there seems to be no reason to take τ γ (1 -ζ)∥K∥ 2 + ω ran < 1, and τ γ (1 -ζ)∥K∥ 2 + ω ran = 1 should be the best choice in most cases. Thus, one can set τ =

1 ϵ and the communication complexity, in terms of average number of floats sent by every worker to the master, is O (kκ + d 2 k ) log 1 ϵ , since k floats are sent by every worker at every iteration. Thus, by choosing k = ⌈d/ √ κ⌉, as long as d ≥ √ κ, the communication complexity in terms of floats is O d √ κ log 1 ϵ ; this is the same as the one of Scaffnew with γ = 1 L and p = 1 √ κ , but RandProx-FL with rand-k compressors removes the necessity to communicate full d-dimensional vectors periodically.

, it follows from (35) that Ψ t → 0 almost surely. Almost sure convergence of x t and u t follows. Finally, by Lipschitz continuity of ∇f , K * , prox g , we can upper bound ∥x t -x ⋆ ∥ 2 by a linear combination of∥x t -x ⋆ ∥ 2 and ∥u t -u ⋆ ∥ 2 . It follows that E ∥x t -x ⋆ ∥2 → 0 linearly with the same rate c and that xt → x ⋆ almost surely, as well. □

. . do xt := prox γg x t -γu t d t := R t xt -prox γ(1+ω)h (x t + γ(1 + ω)u t ) x t+1 := xt -1 1+ω d t u t+1 := u t +Also, (x t ) t∈N and (x t ) t∈N both converge to x ⋆ and (u t ) t∈N converges to u ⋆ , almost surely.

ACKNOWLEDGMENTS

The work of P. Richtárik was partially supported by the KAUST Baseline Research Fund Scheme and by the SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence.

annex

Let s t+1 ∈ ∂h * (û t+1 ) be such that ût+1 = u t + τ K xt -τ s t+1 ; s t+1 exists and is unique. We also define s ⋆ := Kx ⋆ ; we have s ⋆ ∈ ∂h * (u ⋆ ). Therefore,Hence,, and using (32),Hence,After Lemma 1,Plugging this inequality in (33) yieldsAlgorithm 15 DY algorithm (Davis & Yin, 2017) input: initial points x 0 ∈ X , u 0 ∈ X ; stepsize γ > 0 for t = 0, 1, . . . do xt := prox γg x t -γ∇f (x t ) -γu t x t+1 := prox γh (x t + γu t ) u t+1 := u t + 1 γ (x t -x t+1 ) end for Algorithm 16 RandProx-DY [new] input: initial points x 0 ∈ X , u 0 ∈ X ; stepsize γ > 0; ω ≥ 0 for t = 0, 1, . . .After the particular case g = 0 discussed in Section 4.1 and the particular case f = 0 discussed in Section F, we discuss in this section the third particular case K = Id in ( 1) and ( 2). The primal problem becomes:and the dual problem becomes:When K = Id, the PDDY algorithm with τ = 1 γ reverts to the Davis-Yin (DY) algorithm (Davis & Yin, 2017) , shown above. Therefore, in that case, with ω ran = ω, ζ = 0 and τ = 1 γ(1+ω) , RandProx can be rewritten as RandProx-DY, shown above, too. When g = 0, RandProx-DY reverts to RandProx-FB and when f = 0, RandProx-DY reverts to RandProx-ADMM; in other words, RandProx-DY generalizes RandProx-FB and RandProx-ADMM into a single algorithm. Theorem 1 yields: Theorem 9. Suppose that µ f > 0 or µ g > 0, and that µ h * > 0. In RandProx-DY, suppose that 0 < γ < 2 L f . For every t ≥ 0, define the Lyapunov function,where x ⋆ and u ⋆ are the unique solutions to (54) and (55), respectively. Then RandProx-DY converges linearly: for every t ≥ 0,whereAlso, (x t ) t∈N and (x t ) t∈N both converge to x ⋆ and (u t ) t∈N converges to u ⋆ , almost surely.We note that in Theorem 9, µ h * > 0 is required. It is only in the case g = 0, when RandProx-DY reverts to RandProx-FB, that one can apply Theorem 3, which does not require strong convexity of h * .

H PROOF OF THEOREM 11

Proof of Theorem 11 We have, for every (x, x ′ ) ∈ X 2 ,

