DISTRIBUTED EXTRA-GRADIENT WITH OPTIMAL COMPLEXITY AND COMMUNICATION GUARANTEES

Abstract

We consider monotone variational inequality (VI) problems in multi-GPU settings where multiple processors/workers/clients have access to local stochastic dual vectors. This setting includes a broad range of important problems from distributed convex minimization to min-max and games. Extra-gradient, which is a de facto algorithm for monotone VI problems, has not been designed to be communicationefficient. To this end, we propose a quantized generalized extra-gradient (Q-GenX), which is an unbiased and adaptive compression method tailored to solve VIs. We provide an adaptive step-size rule, which adapts to the respective noise profiles at hand and achieve a fast rate of O(1/T ) under relative noise, and an orderoptimal O(1/ √ T ) under absolute noise and show distributed training accelerates convergence. Finally, we validate our theoretical results by providing real-world experiments and training generative adversarial networks on multiple GPUs.

1. INTRODUCTION

The surge of deep learning across tasks beyond image classification has triggered a vast literature of optimization paradigms, which transcend the standard empirical risk minimization. For example, training generative adversarial networks (GANs) gives rise to solving a more complicated zero-sum game between a generator and a discriminator (Goodfellow et al., 2020) . This can become even more complex when the generator and the discriminator do not have completely antithetical objectives and e.g., constitute a more general game-theoretic setup. A powerful unifying framework which includes those important problems as special cases is monotone variational inequality (VI). Formally, given a monotone operator A : R d → R d , i.e., ⟨A(x) -A(x ′ ), xx ′ ⟩ ≥ 0 for all x, x ′ ∈ R d , our goal is to find some x * ∈ R d such that: ⟨A(x * ), xx * ⟩ ≥ 0, for all x ∈ R d . (VI) Several practical problems can be formulated as a (VI) problem including those with convex-like structures, e.g., convex minimization, saddle-point problems, and games (Facchinei & Pang, 2003; Bauschke & Combettes, 2017; Antonakopoulos et al., 2021) with several applications such as auction theory (Syrgkanis et al., 2015) , multi-agent and robust reinforcement learning (Pinto et al., 2017) , adversarially robust learning (Schmidt et al., 2018) , and GANs. For various tasks, it is widely known that employing deep neural networks (DNNs) along with massive datasets leads to significant improvement in terms of learning (Shalev-Shwartz & Ben-David, 2014) . However, DNNs can no longer be trained on a single machine. One common solution is to train on multi-GPU systems (Alistarh et al., 2017) . Furthermore, in federated learning (FL), multiple clients, e.g., a few hospitals or several cellphones learn a model collaboratively without sharing local data due to privacy risks (Kairouz et al., 2021) . To minimize a single empirical risk, SGD is the most popular algorithm due to its flexibility for parallel implementations and excellent generalization performance (Alistarh et al., 2017; Wilson et al., 2017) . Data-parallel SGD has delivered tremendous success in terms of scalability: (Zinkevich et al., 2010; Bekkerman et al., 2011; Recht et al., 2011; Dean et al., 2012; Coates et al., 2013; Chilimbi et al., 2014; Li et al., 2014; Duchi et al., 2015; Xing et al., 2015; Zhang et al., 2015; Alistarh et al., 2017; Faghri et al., 2020; Ramezani-Kebrya et al., 2021; Kairouz et al., 2021) . Data-parallel SGD reduces computational costs significantly. However, the communication costs for broadcasting huge stochastic gradients is the main performance bottleneck in large-scale settings (Strom, 2015; Alistarh et al., 2017; Faghri et al., 2020; Ramezani-Kebrya et al., 2021; Kairouz et al., 2021) . Several methods have been proposed to accelerate training for classical empirical risk minimization such as gradient (or model update) compression, gradient sparsification, weight quantization/sparsification, and reducing the frequency of communication though local methods (Dean et al., 2012; Seide et al., 2014; Sa et al., 2015; Gupta et al., 2015; Abadi et al., 2016; Alistarh et al., 2017; Wen et al., 2017; Zhou et al., 2018; Bernstein et al., 2018; Faghri et al., 2020; Ramezani-Kebrya et al., 2021; Kairouz et al., 2021) . In particular, unbiased gradient quantization is interesting due to both enjoying strong theoretical guarantees along with providing communication efficiency on the fly, i.e., convergence under the same hyperparameteres tuned for uncompressed variants while providing substantial savings in terms of communication costs (Alistarh et al., 2017; Faghri et al., 2020; Ramezani-Kebrya et al., 2021) . Unlike full-precision data-parallel SGD, where each processor is required to broadcast its local gradient in full-precision, i.e., transmit and receive huge full-precision vectors at each iteration, unbiased quantization requires each processor to transmit only a few communication bits per iteration for each component of the stochastic gradient. In this work, we propose communication-efficient variants of a general first-order method that achieves the optimal rate of convergence with improved guarantees on the number of communication bits for monotone VIs and show distributed training accelerates convergence. We employ an adaptive step-size and both adaptive and non-adaptive variants of unbiased quantization schemes tailored to VIs. There exist three major challenges to tackle this problem: 1) how to modify adaptive variants of unbiased quantization schemes tailored to solve general VIs; 2) can we achieve optimal rate of convergence without knowing noise profile and show benefits of distributed training?; 3) can we validate improvements in terms of scalability without compromising accuracy in large-scale settings? We aim to address those challenges and answer all questions in the affirmative: 1.1 SUMMARY OF CONTRIBUTIONS • We propose quantized generalized extra-gradient (Q-GenX) family of algorithms, which employs unbiased compression methods tailored to general VI-solvers. Our framework unifies distributed and communication-efficient variants of stochastic dual averaging, stochastic dual extrapolation, and stochastic optimistic dual averaging. • Without prior knowledge on the noise profile, we provide an adaptive step-size rule for Q-GenX and achieve a fast rate of O(1/T ) under relative noise, and an order-optimal O(1/ √ T ) in the absolute noise case and show that increasing the number of processors accelerates convergence for general monotone VIs. • We validate our theoretical results by providing real-world experiments and training generative adversarial networks on multiple GPUs.

1.2. RELATED WORK

We overview a summary of related work. Complete related work is provided in Appendix B. Adaptive quantization has been used for speech communication (Cummiskey et al., 1973) . In machine learning, adapting quantization levels (Faghri et al., 2020) , adapting communication frequency in local SGD (Wang & Joshi, 2019) , adapting the number of quantization levels (communication budget) over the course of training (Guo et al., 2020; Agarwal et al., 2021) , adapting a gradient sparsification scheme over the course of training (Khirirat et al., 2021) , and adapting compression parameters across model layers and training iterations (Markov et al., 2022) have been proposed for minimizing a single empirical risk. In this paper, we propose communication-efficient generalized extra-gradient family of algorithms with adaptive quantization and adaptive step-size for a general (VI) problem. In the VI literature, the benchmark method is EG, proposed by Korpelevich (1976) , along with its variants including (Nemirovski, 2004; Nesterov, 2007) . Rate interpolation between different noise profiles under an adaptive step-size has been explored by Antonakopoulos et al. (2021) . However, their results are limited to centralized and single-GPU settings. Beznosikov et al. (2021) ; Kovalev et al. (2022) have proposed communication-efficient algorithms for VIs with finite-sum structure and variance reduction in centralized settings and (strongly) monotone VIs in decentralized settings, respectively. Unlike (Beznosikov et al., 2021; Kovalev et al., 2022) , we achieve fast and order-optimal rates with adaptive step-size and adaptive compression without requiring variance reduction and strong monotoncity, and improve variance and code-length bounds for unbiased and adaptive compression.

2. PROBLEM SETUP

Our objective throughout this paper is to solve VI with A : R d → R d being a monotone operatorfoot_0 . Moreover, in order to avoid trivialities, we make the following mild assumption: Assumption 1 (Existence). The set X * := {x * ∈ R d : x * solves (VI)} is non-empty. Let C denote a non-empty compact test domain. A popular performance measure for the evaluation of a candidate solution for VI is the so-called restricted gap function defined as: Gap C (x) = sup x∈C ⟨A(x), x -x⟩. (Gap) Gap is used to measure x's performance mainly because it characterizes the solutions of VI via its zeros. Mathematically speaking, we have the following proposition: Proposition 1 (Nesterov 2009) . Let C be a non-empty convex subset of R d . Then, the following holds 1. Gap C (x) ≥ 0 for all x ∈ C 2. If Gap C (x) = 0 and C contains a neighbourhood of x, then x is a solution of VI. Proposition 1 is an extension of an earlier characterization shown by Nesterov (2007) ; we refer the reader to (Antonakopoulos et al., 2019; Nesterov, 2009) and references therein. From an algorithmic perspective, we primarily consider the generic family of iterative methods, which have access to a stochastic first-order oracle, i.e., a black-box feedback mechanism (Nesterov, 2004) . The respective iterative algorithm can call the oracle over and over at a (possibly) random sequence of points x 0 , x 1 , . . . When called at x, the oracle draws an i.i.d. sample ω from a (complete) probability space (Ω, F, P) and returns a stochastic dual vector g(x; ω) given by g(x; ω) = A(x) + U (x; ω) (2.1) where U (x; ω) denotes the measurement error or noise. We consider two important noise profile models, i.e., absolute and relative noise models formally described in the following assumptions: Assumption 2 (Absolute noise). Let x ∈ R d and ω ∼ P. The oracle g(x; ω) enjoys these properties: 1) Almost sure boundedness: There exists some M > 0 such that ∥g(x; ω)∥ * ≤ M a.s.; 2) Unbiasedness: E [g(x; ω)] = A(x); 3) Bounded absolute variance: E ∥U (x; ω)∥ 2 * ≤ σ 2 . The conditions in Assumption 2 are mild and hold for standard oracles, in particular, in the context of adaptive algorithms (Kavis et al., 2019; Levy et al., 2018; Bach & Levy, 2019; Antonakopoulos & Mertikopoulos, 2021) and typically guarantee a convergence rates in O(1/ √ T ) (Nemirovski et al., 2009; Juditsky et al., 2011; Antonakopoulos et al., 2021) . Alternatively, one may consider the relative noise model following (Polyak, 1987) : Assumption 3 (Relative noise). The oracle g(x; ω) satisfies: 1) Almost sure boundedness: There exists some M > 0 such that ∥g(x; ω)∥ * ≤ M a.s; 2) Unbiasedness: E [g(x; ω)] = A(x); 3) Bounded relative variance: There exists some c > 0 such that E ∥U ( x; ω)∥ 2 * ≤ c∥A(x)∥ 2 * . While Assumption 2 is enough for obtaining the typical O(1/ √ T ) rate in stochastic settings, Assumption 3 may allow us to recover the well-known order-optimal rate of O(1/T ) in deterministic Algorithm 1 Q-GenX: Loops are executed in parallel on processors. At certain steps, each processor computes sufficient statistics of a parametric distribution to estimate distribution of dual vectors. Input: Local data, parameter vector (local copy) X t , Y t , learning rate {γ t }, and set of update steps U for t = 1 to T do if t ∈ U then for i = 1 to K do settings. Intuitively, this improvement is explained by the fact that the noisy error measurements vanish while we approach a solution of VI. In Appendix J, we highlight random coordinate descent and random player updating as popular examples, which motivate Assumption 3.

3.1. SYSTEM MODEL AND PROPOSED ALGORITHM

We now describe the algorithmic framework unifying communication-efficient variants of generalized extra-gradient (EG) family of algorithms. In particular, we consider a synchronous and distributed system with K processors along the lines of e.g., data-parallel SGD (Alistarh et al., 2017; Faghri et al., 2020; Ramezani-Kebrya et al., 2021; Kairouz et al., 2021) . These processors can be cellphones and hospitals in FL or GPU resources in a data center. In multi-GPU systems, processors partition a large dataset among themselves such that each processor keeps only a local copy of the current parameter vector and has access to independent and private stochastic dual vectors. At each iteration, each processor receives stochastic dual vectors from all other processors and aggregates them. To accelerate training, stochastic dual vectors are first compressed by each processor before broadcasting to other peers and then decompressed before each aggregation step. We focus on unbiased compression where, in expectation, the output of the decompression of a compressed vector is the same as the original uncompressed vector. We use Q ℓt to denote a random and adaptive quantization function where the quantization levels ℓ t may change over time. We use V k,t to denote the original (uncompressed) stochastic dual vector computed by process k at time t. Using multiple processors reduces computational costs significantly. However, communication costs to broadcast huge stochastic dual vectors is the main performance bottleneck in practice (Alistarh et al., 2017) . In order to reduce communication costs and improve scalability, each processor receives and aggregates the compressed stochastic dual vectors from all peers to obtain the updated parameter vector. Let ℓ t = (ℓ 0 , ℓ t 1 , . . . , ℓ t s , ℓ s+1 ) denote the sequence of s quantization levels optimized at iteration t with 0 = ℓ 0 < ℓ t 1 < • • • < ℓ t s < ℓ s+1 = 1. We now define quantization function Q ℓt : Definition 1 (Random quantization function). Let s ∈ Z + denote the number of quantization levels. Let u ∈ [0, 1] and ℓ t = (ℓ 0 , ℓ t 1 , . . . , ℓ t s , ℓ s+1 ) denote the sequence of s quantization levels at iteration t with 0 = ℓ 0 < ℓ t 1 < • • • < ℓ t s < ℓ s+1 = 1. Let τ (u) denote the index of a level such that ℓ t τ (u) ≤ u < ℓ t τ (u)+1 . Let ξ t (u) = (u -ℓ t τ (u) )/(ℓ t τ (u)+1 -ℓ t τ (u) ) be the relative distance of u to level τ (u) + 1. We define the random function q ℓt (u) : [0, 1] → {ℓ 0 , ℓ t 1 , . . . , ℓ t s , ℓ s+1 } such that q ℓt (u) = ℓ t τ (u) with probability 1 -ξ t (u) and q ℓt (u) = ℓ t τ (u)+1 with probability ξ t (u). Let q ∈ Z + and v ∈ R d . We define the random quantization of v as follows: Q ℓt (v) := ∥v∥ q • s ⊙ [q ℓt (u 1 ), . . . , q ℓt (u d )] ⊤ where ⊙ denotes the element-wise (Hadamard) product. Let Vk,t = Q ℓt (V k,t ) and Vk,t+1/2 = Q ℓt (V k,t+1/2 ) denote the unbiased and quantized stochastic dual vectors for k ∈ [K] and t ∈ [T ] . We propose quantized generalized extra-gradient (Q-GenX) family of algorithms with this update rule: X t+1/2 = X t - γ t K K k=1 Vk,t Y t+1 = Y t - 1 K K k=1 Vk,t+1/2 X t+1 = γ t+1 Y t+1 (Q-GenX) where ( Vk,0 , Vk,1 , . . .) and ( Vk,1/2 , Vk,3/2 , . . .) are the sequences of stochastic dual vectors computed and quantized by processor k ∈ [K]. Provided that V k,t and V k,t+1/2 are stochastic dual vectors for k ∈ [K], then 1 K K k=1 Vk,t and 1 K K k=1 Vk,t+1/2 remain unbiased stochastic dual vectors. Q-GenX is described in Algorithm 1. In general, the decoded stochastic dual vectors are likely to be different from the original locally computed stochastic dual vectors. A particularly appealing feature of (Q-GenX) formulation is that it enables us to unify communicationefficient and distributed variants of a wide range of popular first-order methods for solving VIs. In particular, one may observe that under different choices of Vk,t and Vk,t+1/2 , communication-efficient variants of stochastic dual averaging (Nesterov, 2009) , stochastic dual extrapolation (Nesterov, 2007) , and stochastic optimistic dual averaging (Popov, 1980; Rakhlin & Sridharan, 2013; Hsieh et al., 2021; 2022) in multi-GPU settings are special cases of Q-GenX: Example 3.1. Distributed stochastic dual averaging: Consider the case where Vk,t ≡ 0 and Vk,t+1/2 ≡ ĝk,t+1/2 = Q ℓt (A(X t+1/2 ) + U k,t+1/2 ). This setting yields to X t+1/2 = X t and hence ĝk,t+1/2 = ĝk,t = Vk,t+1/2 . Therefore, Q-GenX reduces to the communication-efficient stochastic dual averaging scheme: Y t+1 = Y t - 1 K K k=1 ĝk,t X t+1 = γ t+1 Y t+1 (Quantized DA) Example 3.2. Distributed stochastic dual extrapolation: Consider the case where Vk,t ≡ ĝk,t = Q ℓt (A(X t )+U k,t ) and Vk,t+1/2 ≡ ĝk,t+1/2 = Q ℓt (A(X t+1/2 )+U k,t+1/2 ) are noisy oracle queries at X t and X t+1/2 , respectively. Then Q-GenX provides the communication-efficient variant of the Nesterov's stochastic dual extrapolation method (Nesterov, 2007) : X t+1/2 = X t - γ t K K k=1 ĝk,t Y t+1 = Y t - 1 K K k=1 ĝk,t+1/2 X t+1 = γ t+1 Y t+1 (Quantized DE) Example 3.3. Distributed stochastic optimistic dual averaging: Consider the case Vk,t ≡ ĝk,t-1/2 = Q ℓt (A(X t-1/2 ) + U k,t-1/2 ) and Vk,t+1/2 ≡ ĝk,t+1/2 = Q ℓt (A(X t+1/2 ) + U k,t+1/2 ) are the noisy oracle feedback at X t-1/2 and X t+1/2 , respectively. We then obtain the communicationefficient stochastic optimistic dual averaging method: X t+1/2 = X t - γ t K K k=1 ĝk,t-1/2 Y t+1 = Y t - 1 K K k=1 ĝk,t+1/2 X t+1 = γ t+1 Y t+1 (Quantized OptDA) This general formulation of Q-GenX allows us to bring these variants under one umbrella and provide theoretical guarantees for all of them in a unified manner.

3.2. ENCODING

To further reduce communication costs, we can apply information-theoretically coding schemes on top of quantization. Let q ∈ Z + . We first note that a vector v ∈ R d can be uniquely represented by a tuple (∥v∥ q , s, u) where ∥v∥ q is the L q norm of v, s := [sgn(v 1 ), . . . , sgn(v d )] ⊤ consists of signs of the coordinates v i 's, and u := [u 1 , . . . , u d ] ⊤ with u i = |v i |/∥v∥ q are the normalized coordinates. Note that 0 ≤ u i ≤ 1 for all i ∈ [d]. The overall encoding, i.e., composition of coding and quantization, CODE • Q(∥v∥ q , s, q ℓt ) : R + × {±1} d × {ℓ 0 , ℓ t 1 , . . . , ℓ t s , ℓ s+1 } d → {0, 1} * in Algorithm 1 uses a standard floating point encoding with C b bits to represent the positive scalar ∥v∥ q , encodes the sign of each coordinate with one bit, and finally applies an integer encoding scheme Ψ : {ℓ 0 , ℓ t 1 , . . . , ℓ t s , ℓ s+1 } → {0, 1} * to efficiently encode each quantized and normalized coordinate q ℓt (u i ) with the minimum expected code-length. The overall decoding DEQ • CODE : {0, 1} * → R d first reads C b bits to reconstruct ∥v∥ q . Then it applies Ψ -1 : {0, 1} * → {ℓ 0 , ℓ t 1 , . . . , ℓ t s , ℓ s+1 } to reconstruct normalized coordinates. The encoding/decoding details are provided in Appendix K.

3.3. ADAPTIVE QUANTIZATION

Instead of using a heuristically chosen sequence of quantization levels, adaptive quantization estimates distribution of uncompressed original vectors, i.e., dual vectors, by computing sufficient statistics of a parametric distribution, optimizes quantization levels to minimize the quantization error, and updates those levels adaptively throughout the course of training as the distribution changes. Let (Ω Q , F Q , P Q ) denote a complete probability space. Let q ℓt ∼ P Q represent d variables sampled independently for random quantization in Definition 1. Let v ∈ R d denote a stochastic dual vector to be quantized. Given v, we measure the quantization error by the variance of vector quantization, which is the trace of the covariance matrix: E q ℓ t [∥Q ℓt (v) -v∥ 2 2 ] = ∥v∥ 2 q d i=1 σ 2 Q (u i ; ℓ t ) (3.1) where u i = |v i |/∥v∥ q and σ 2 Q (u; ℓ t ) = E q ℓ t [(q ℓt (u) -u) 2 ] = (ℓ t τ (u)+1 -u)(u -ℓ t τ (u) ) is the variance for a normalized coordinate u. We optimize ℓ t by minimizing the quantization variance: min ℓt∈L E ω E q ℓ t ∥Q ℓt (g(x t ; ω)) -A(x t )∥ 2 * where L = {ℓ : ℓ j ≤ ℓ j+1 , ∀ j, ℓ 0 = 0, ℓ s+1 = 1} denotes the set of feasible solutions. Since random quanitzation and random samples are statistically independent, we can solve the following equivalent problem: min ℓt∈L E ω E q ℓ t ∥Q ℓt (g(x t ; ω)) -g(x t ; ω)∥ 2 2 (MinVar) To solve MinVar, we first sample J stochastic dual vectors {g(x t ; ω 1 ), . . . , g(x t ; ω J )}foot_1 . Let F j (r) denote the marginal cumulative distribution function (CDF) of normalized coordinates conditioned on observing ∥g(x t ; ω j )∥ q . By the law of total expectation, MinVar can be approximated by: min ℓt∈L J j=1 ∥g(x t ; ω j )∥ 2 q s i=0 ℓi+1 ℓi σ 2 Q (u; ℓ t ) dF j (u) ≡ min ℓt∈L s i=0 ℓi+1 ℓi σ 2 Q (u; ℓ t ) d F (u) (QAda) where F (u) = J j=1 λ j F j (u) is the weighted sum of the conditional CDFs with λ j = ∥g(x t ; ω j )∥ 2 q / J j=1 ∥g(x t ; ω j )∥ 2 q . Finally, we solve QAda efficiently by either updating levels one at a time or gradient descent along the lines of (Faghri et al., 2020) .

4. THEORETICAL GUARANTEES

We first establish a variance error bound for a general unbiased and normalized compression scheme and a bound on the expected number of communication bits to encode Q ℓ (v), i.e., the output of CODE • Q, which is introduced in Section 3.2. The detailed proofs are provided in the appendix. Let ℓ := max 1≤j≤s ℓ j+1 /ℓ j and d th = (2/ℓ 1 ) min{q,2} . Under general L q normalization and sequence of s quantization levels ℓ, we first establish an upper bound on the variance of quantization: Theorem 1 (Variance bound). Let v ∈ R d , q ∈ Z + , and s ∈ Z + . Let ℓ = (ℓ 0 , . . . , ℓ s+1 ) denote a sequence of s quantization levels defined in Definition 1. The quantization of v in Definition 1 is unbiased, i.e., E q ℓ [Q ℓ (v)] = v. Furthermore, we have E q ℓ [∥Q ℓ (v) -v∥ 2 2 ] ≤ ϵ Q ∥v∥ 2 2 , (4.1) where ϵ Q = ℓ+ℓ -1 4 + 1 4 ℓ 2 1 d 2 min{q,2} 1{d ≤ d th } + ℓ 1 d 1 min{q,2} -1 1{d ≥ d th } -1 2 and 1 is the indicator function. Theorem 1 implies that if g(x; ω) is an unbiased stochastic dual vector with a bounded absolute variance σ 2 , then Q ℓ (g(x; ω)) will be an unbiased stochastic dual vector with a variance upper bound ϵ Q σ 2 . Note that, the dominant term 1 4 ℓ 2 1 d 2 min{q,2} 1{d ≤ d th } + ℓ 1 d 1 min{q,2} -1 1{d ≥ d th } monotonically decreases as the number of quantization levels increases. Unlike (Alistarh et al., 2017, Theorem 3.2 ) and (Ramezani-Kebrya et al., 2021, Theorem 4) that hold under the special cases of L 2 normalization with uniform and exponentially spaced levels, respectively, our bound in Theorem 1 holds under general L q normalization and arbitrary sequence of quantization levels. For the special case of L 2 normalization in the regime of large d, which is the case in practice, our bound in Theorem et al., 2017, Theorem 3.2) and (Ramezani-Kebrya et al., 2021, Theorem 4) , respectively, because ℓ 1 is adaptively designed to minimize the variance of quantization. Furthermore, unlike the bound in (Faghri et al., 2020, Theorem 2) that requires an inner optimization problem over an auxiliary parameter p, the bound in Theorem 1 is provided in an explicit form without an inner optimization problem, which matches the known Ω( √ d) lower bound. 1 is O(ℓ 1 √ d), which is arbitrarily smaller than O( √ d/s) and O(2 -s √ d) in (Alistarh Theorem 2 (Code-length bound). Let p j denote the probability of occurrence of ℓ j (weight of symbol ℓ j ) for j ∈ [s]. Under the setting specified in Theorem 1, the expectation E ω E q ℓ [|CODE • Q(Q ℓ (g(x; ω)); ℓ)|] of the number of bits to encode Q ℓ (g(x; ω)) is bounded by E ω E q ℓ [|CODE • Q(Q ℓ (g(x; ω)); ℓ)|] = O ( s j=1 p j log(1/p j ) -p 0 )d . (4.2) Note {p 0 , . . . , p s+1 } can be computed efficiently using the weighted sum of the conditional CDFs in QAda and quantization levels. We provide their expressions in Appendix E. Unlike (Alistarh et al., 2017, Theorem 3.4 ) and (Ramezani-Kebrya et al., 2021, Theorem 5) that hold under the special cases of L 2 normalization, our bound in Theorem 2 holds under general L q normalization. For the special case of L 2 normalization with s = √ d as in (Alistarh et al., 2017, Theorem 3.4 ), our bound in Theorem 2 can be arbitrarily smaller than (Alistarh et al., 2017, Theorem 3.4 ) and (Ramezani-Kebrya et al., 2021, Theorem 5) depending on {p 0 , . . . , p s+1 }. Compared to (Faghri et al., 2020, Theorem 3) , our bound in Theorem 2 does not have an additional n ℓ1,d term. We show that a total expected number of O(Kd/ϵ) bits are required to reach an ϵ gap, which matches the lower bound developed for convex optimization problems with finite-sum structures (Tsitsiklis & Luo, 1987; Korhonen & Alistarh, 2021) . We finally present the convergence guarantees for Q-GenX given access to stochastic dual vectors under both absolute noise and relative noise models in Assumptions 2 and 3, respectively. Theorem 3 (Q-GenX under absolute noise). Let C ⊂ R d denote a compact neighborhood of a solution for (VI) and let D 2 := sup X∈C ∥X -X 0 ∥ 2 . Suppose that the oracle and the problem (VI) satisfy Assumptions 1 and 2, respectively, Algorithm 1 is executed for T iterations on K processors with an adaptive step-size γ t = K(1 + t-1 i=1 K k=1 ∥ Vk,i -Vk,i+1/2 ∥ 2 ) -1/2 , and quantization levels are updated J times where ℓ j with variance bound ϵ Q,j in (4.1) and code-length bound N Q,j in (4.2) is used for T j iterations with J j=1 T j = T . Then we have E Gap C 1 T T t=1 X t+1/2 = O ( J j=1 ϵ Q,j T j /T M + σ)D 2 √ T K . In addition, Algorithm 1 requires each processor to send at most 2 T J j=1 T j N Q,j communication bits per iteration in expectation. We now establish fast rate of O(1/T ) under relative noise and a mild regularity condition: Assumption 4 (Co-coercivity). Let β > 0. We assume that operator A is β-cocoercive: ⟨A(x) -A(x ′ ), x -x ′ ⟩ ≥ β∥A(x) -A(x ′ )∥ 2 * for all x, x ′ ∈ R d . (4.3) For a panoramic view of this class of operators, we refer the reader to (Bauschke & Combettes, 2017) . Remark 1. The order-optimal rate of O(1/ √ T ) under absolute noise does not require co-cocercivity. Our adaptive step-size also does not depend on the noise model or co-coercivity. Co-coercivity is required only to achieve fast rate of O(1/T ) in the case of relative noise. Theorem 4 (Q-GenX under relative noise). Let C ⊂ R d denote a compact neighborhood of a solution for (VI) and let D 2 := sup X∈C ∥X -X 0 ∥ 2 . Suppose that the oracle and the problem (VI) satisfy Assumptions 1, 3, and 4, Algorithm 1 is executed for T iterations on K processors with an adaptive step-size γ t = K(1 + t-1 i=1 K k=1 ∥ Vk,i -Vk,i+1/2 ∥ 2 ) -1/2 , and quantization levels are updated J times where ℓ j with variance bound ϵ Q,j in (4.1) and code-length bound N Q,j in (4.2) is used for T j iterations with J j=1 T j = T . Then we have E Gap C 1 T T t=1 X t+1/2 = O (c + 1) J j=1 T j ϵ Q,j /T + c D 2 KT . In addition, Algorithm 1 requires each processor to send at most 2 T J j=1 T j N Q,j communication bits per iteration in expectation. To the best of our knowledge, our results in Theorems 3 and 4 are the first ones proving that increasing the number of processors accelerates convergence for general monotone VIs under an adaptive step size. Theorems 3 and 4 show that we can attain a fast rate of O(1/T ) and an order-optimal O(1/ √ T ) without prior knowledge on the noise profile while significantly reducing communication costs. Compared to saddle point problems, our rates are optimal. This can be verified by lower bounds in (Beznosikov et al., 2020) . For convex problems in deterministic settings, the rate can be improved to O(1/T 2 ) via acceleration. However, it is known that in the stochastic and distributed settings, our rates cannot be improved even with acceleration. E.g., for convex and smooth problems under absolute noise model, the lower bound of Ω( 1 √ T K ) can be established by (Woodworth et al., 2021 , Theorem 1) and setting the number of gradients per round to one. In Appendix I, we build on Theorems 3 and 4 to capture the trade-off between the number of iterations to converge and time per iteration, which includes total time required to update a model on each GPU.

5. EXPERIMENTAL EVALUATION

In order to validate our theoretical results, we build on the code base of Gidel et al. (2019) and run an instantiation of Q-GenX obtained by combining ExtraAdam with the compression offered by the torch_cgx pytorch extension of Markov et al. (2022) , and train a WGAN-GP (Arjovsky et al., 2017) on CIFAR10 (Krizhevsky, 2009) . Since torch_cgx uses OpenMPI (Gabriel et al., 2004) as its communication backend, we use OpenMPI as the communication backend for the full gradient as well for a fairer comparison. We deliberately do not tune any hyperparameters to fit the larger batchsize since simliar to (Gidel et al., 2019) , we do not claim to set a new SOTA with these experiments but simply want to show that our theory holds up in practice and can potentially lead to improvements. For this, we present a basic experiment showing that even for a very small problem size and a heuristic base compression method of cgx, we can achieve a noticeable speedup of around 8%. We expect further gains to be achievable for larger problems and more advanced compression methods. Given differences in terms of settings and the lack of any code, let alone an efficient implementations that can be used in a real-world setting (i.e. CUDA kernels integrated with networking), it is difficult to impossible to conduct a fair comparison with Beznosikov et al. (2021) . More details and a comparison with QSGDA of Beznosikov et al. (2022) are provided in Appendix H. We are not aware of any other existing method dealing with the same problem in our paper without the cost associated with variance reduction. We follow exactly the setup of (Gidel et al., 2019) except that we share an effective batch size of 1024 across 3 nodes (strong scaling) connected via Ethernet, and use Layernorm (Ba et al., 2016) instead of Batchnorm (Ioffe & Szegedy, 2015) since Batchnorm is known to be challenging to work with in distributed training as well as interacting badly with the WGAN-GP penalty. The results are shown in Figure 1 (left) showing evolution of FID. We note that we do not scale the learning rate or any other hyperparameters to account for these two changes so this experiment is not meant to claim SOTA performance, merely to illustrate that 1. Even with the simplest possible unbiased quantization on a relatively small-scale setup, we can observe a speedup (about 8%). 2. This speedup does not drastically change the performance. We compare training using the full gradient of 32 bit (FP32) to training with gradients compressed to 8 (UQ8) and 4 bits (UQ4) using a bucket size of 1024. Figure 1 (middle and right) shows a more fine grained breakdown of the time used for back propagation (BP) where the network activity takes place. GenBP, DiscBP and PenBP refer to the backpropagation for generator, discriminator, and the calculation of the gradient penalty, respectively. Total refers to the sum of these times.

6. CONCLUSIONS

We have considered mononote VIs in a synchronous and multi-GPU setting where multiple processors compute independent and private stochastic dual vectors in parallel. We proposed Q-GenX, which employs unbiased and adaptive compression methods tailored to a generic unifying framework for solving VIs. Without knowing the noise profile in advance, we have obtained an adaptive step-size rule, which achieves a fast rate of O(1/T ) under relative noise, and an order-optimal O(1/ √ T ) in the absolute noise case along with improved guarantees on the expected number of communication bits. Our results show that increasing the number of processors accelerates convergence. Developing new VI-solvers for asynchronous settings and establishing convergence guarantees while relaxing co-coercivity assumption on the operator are interesting problems left for future work. Expected co-coercivity has been used as a more relaxed noise model (Loizou et al., 2021) . It is interesting to study monotone VIs under expected co-coercivity and adaptive step-sizes in the future.

A APPENDIX

Notation. We use E[•], ∥ • ∥, ∥ • ∥ 0 , and ∥ • ∥ * to denote the expectation operator, Euclidean norm, number of nonzero elements of a vector, and dual norm, respectively. We use | • | to denote the length of a binary string, the length of a vector, and cardinality of a set. We use lower-case bold letters to denote vectors. Sets are typeset in a calligraphic font. The base-2 logarithm is denoted by log, and the set of binary strings is denoted by {0, 1} * . We use [n] to denote {1, . . . , n} for an integer n. Content of the appendix. The appendix is organized as follows: • Complete related work is discussed in Appendix B. • Special cases of Q-GenX are provided in Appendix C. • Theorem 1 (variance bound) is proved in Appendix D. • Theorem 2 (code-length bound) is proved in Appendix E. • Theorem 3 (Q-GenX under absolute noise) is proved in Appendix F. • Theorem 4 (Q-GenX under relative noise) is proved in Appendix G. • Additional experimental details are included in Appendix H. • Trade-off between number of iterations and time per iteration is provided in Appendix I. • Popular Examples motivating Assumption 3 are provided in Appendix J. • The encoding/decoding details are provided in Appendix K.

B FURTHER RELATED WORK

Unbiased compression. Seide et al. (2014) proposed SignSGD, an efficient heuristic scheme to reduce communication costs drastically by quantizing each gradient component to two values. (This scheme is sometimes termed as 1bitSGD (Seide et al., 2014) .) Bernstein et al. (2018) later provided convergence guarantees for a variant of SignSGD. Note that the quantization employed by SignSGD is not unbiased, and so a new analysis was required. Alistarh et al. (2017) proposed quantized SGD (QSGD) focusing on the uniform quantization of stochastic gradients normalized to have unit Euclidean norm. Their experiments illustrate a similar quantization method, where gradients are normalized to have unit L ∞ norm, achieves better performance. We refer to this method as QSGDinf or Qinf in short. Wen et al. (2017) proposed TernGrad, which can be viewed as a special case of QSGDinf with three quantization levels. Ramezani-Kebrya et al. (2021) proposed nonuniform quantization levels (NUQSGD) and demonstrated superior empirical results compared to QSGDinf. Recently, lattice-based quantization has been studied for distributed mean estimation and variance reduction (Davies et al., 2021) . Adaptive quantization has been used for speech communication and storage (Cummiskey et al., 1973) . In machine learning, several biased and unbiased schemes have been proposed to compress networks and gradients. In this work, we focus on unbiased and coordinate-wise schemes to compress gradients. Zhang et al. (2017) proposed ZipML, which is an optimal quantization method if all points to be quantized are known a priori. To find the optimal sequence of quantization levels, a dynamic program is solved whose computational and memory cost is quadratic in the number of points to be quantized, which in the case of gradients would correspond to their dimension. Faghri et al. (2020) have proposed two adaptive gradient compression schemes where multiple processors update their compression schemes in parallel by efficiently computing sufficient statistics of a parametric distribution. Adaptive quantization methods in (Faghri et al., 2020) are applicable only when minimizing a single empirical risk. In addition, convergence guarantees in (Faghri et al., 2020) are established for smooth nonconvex optimization with a fixed step-size. In this paper, we propose communication-efficient variants of generalized EG family of algorithms with an adaptive and unbiased quantization scheme for a general (VI) problem. Furthermore, we establish improved variance and code-length bounds and optimal convergence guarantees for nonconvex problems with an adaptive step-size. Adaptive gradient compression has been studied in other contexts when minimizing a single empirical risk, such as adapting the communication frequency in local SGD (Wang & Joshi, 2019) , adapting the number of quantization levels (communication budget) over the course of training (Guo et al., 2020; Agarwal et al., 2021) , adapting a gradient sparsification scheme over the course of training (Khirirat et al., 2021) , and adapting compression parameters across model layers and training iterations (Markov et al., 2022) . We focus on unbiased and normalized quantization and adapt quantization levels to minimize quantization error for generalized EG family of algorithms, which has not been considered in the literature. First-order methods to solve VIs. In the VI literature, the benchmark method is extragradient (EG), proposed by Korpelevich (1976) , along with its variants including (Nemirovski, 2004; Nesterov, 2007) . Furthermore, there have been a line of work that aims to focus on establishing convergence guarantees though adopting an adaptive step-size policy. To that end, we review the most relevant works below. For unconstrained problems with an operator that is locally Lipschitz continuous (but not necessarily globally), the Golden Ratio Algorithm (GRAAL) of Malitsky (2020) achieves convergence without requiring prior knowledge of the problem's Lipschitz parameter. Moreover, such guarantees are provided in problems with a bounded domain by the Generalized Mirror Prox (GMP) algorithm of Stonyakin et al. (2018) under the umbrella of Hölder continuity. A more relevant method that simultaneously achieves an O(1/ √ T ) rate in non-smooth and/or stochastic problems and an O(1/T ) rate in smooth ones is the recent algorithm of Bach & Levy (2019) . This algorithm employs an adaptive, AdaGrad-like step-size policy which allows the method to interpolate between these regimes. On the negative side, this algorithm requires a bounded domain with a (Bregman) diameter that is known in advance. In optimization community, similar rate interpolation guarantees between different noise profiles has been explored by Antonakopoulos et al. (2021) . However, their results are limited to centralized, single GPU settings. While all these algorithms concern standard single-GPU settings with having access to full-precision stochastic dual vectors, we consider multi-GPU settings, which has not been considered before. Beznosikov et al. (2021) ; Kovalev et al. (2022) have proposed communication-efficient algorithms for VIs with finite-sum structure and variance reduction in centralized settings and (strongly) monotone VIs in decentralized settings, respectively. Unlike (Beznosikov et al., 2021; Kovalev et al., 2022) , we achieve fast and order-optimal rates with adaptive step-size and adaptive compression without requiring variance reduction and strong monotoncity, and improve variance and code-length bounds for unbiased and adaptive compression. Detailed comparison with (Antonakopoulos et al., 2021; Alistarh et al., 2017; Faghri et al., 2020; Ramezani-Kebrya et al., 2021) . In this section, we elaborate and provide a detailed comparison with the most relevant related work (Antonakopoulos et al., 2021; Alistarh et al., 2017; Faghri et al., 2020; Ramezani-Kebrya et al., 2021) . Although our results build upon the extra-gradient literature Korpelevich (1976), as Antonakopoulos et al. (2021) does, in this paper, we address monotone VI/ convex-concave min-max problems in distributed and large-scale settings, which has not been considered in Antonakopoulos et al. (2021) that strictly refers to a strictly single-GPU and centralized setting. The considered distributed framework complicates the analysis in a significant manner, since we have to simultaneously treat two different types of randomness. In particular, on one hand, we face randomness associated with the compression scheme (which is necessary to achieve substantial communication savings in a distributed setup) where on the other hand we deal with different noisy feedback models stemming from inexact operator calculations (before any compression takes place), which together result in efficient implementations at each GPU. We show benefits of distributed training in terms of accelerating convergence for general monotone VIs. Unlike (Alistarh et al., 2017, Theorem 3.2 ) and (Ramezani-Kebrya et al., 2021, Theorem 4) that hold under the special cases of L 2 normalization with uniform and exponentially spaced levels, respectively, our bound in Theorem 1 holds under general L q normalization and arbitrary sequence of quantization levels. For the special case of L 2 normalization in the regime of large d, which is the case in practice, our bound in Theorem 1 is O(ℓ 1 √ d), which is arbitrarily smaller than O( √ d/s) and O(2 -s √ d) in (Alistarh et al., 2017, Theorem 3.2 ) and (Ramezani-Kebrya et al., 2021, Theorem 4), respectively, because ℓ 1 is adaptively designed to minimize the variance of quantization. Furthermore, unlike the bound in (Faghri et al., 2020, Theorem 2) that requires an inner optimization problem over an auxiliary parameter p, the bound in Theorem 1 is provided in an explicit form without an inner optimization problem, which matches the lower bound. Unlike (Alistarh et al., 2017, Theorem 3.4 ) and (Ramezani-Kebrya et al., 2021, Theorem 5) that hold under the special cases of L 2 normalization, our code-length bound in Theorem 2 holds under general L q normalization. For the special case of L 2 normalization with s = √ d as in (Alistarh et al., 2017, Theorem 3.4) , our bound in Theorem 2 can be arbitrarily smaller than (Alistarh et al., 2017, Theorem 3.4 ) and (Ramezani-Kebrya et al., 2021, Theorem 5) depending on {p 0 , . . . , p s+1 }. Compared to (Faghri et al., 2020, Theorem 3) , our bound in Theorem 2 does not have an additional n ℓ1,d term.

C SPECIAL CASES OF Q-GENX

In this section, we show that under different choices of V k,t and V k,t+1/2 , one can obtain communication-efficient variants of stochastic dual averaging (Nesterov, 2009) , stochastic dual extrapolation (Nesterov, 2007) , and stochastic optimistic dual averaging (Popov, 1980; Rakhlin & Sridharan, 2013; Hsieh et al., 2021; 2022) in multi-GPU settings as special cases of Q-GenX. Example C.1. Communication-efficient stochastic dual averaging: Consider the case where Vk,t ≡ 0 and Vk,t+1/2 ≡ ĝk,t+1/2 = Q ℓt (A(X t+1/2 ) + U k,t+1/2 ). This setting yields to X t+1/2 = X t and hence ĝk,t+1/2 = ĝk,t = Vk,t+1/2 . Therefore, Q-GenX reduces to the communication-efficient stochastic dual averaging scheme: Y t+1 = Y t -K -1 K k=1 ĝk,t X t+1 = γ t+1 Y t+1 (Quantized DA) Example C.2. Communication-efficient stochastic dual extrapolation: Consider the case where Vk,t ≡ ĝk,t = Q ℓt (A(X t )+U k,t ) and Vk,t+1/2 ≡ ĝk,t+1/2 = Q ℓt (A(X t+1/2 )+U k,t+1/2 ) are noisy oracle queries at X t and X t+1/2 , respectively. Then Q-GenX provides the communication-efficient variant of Nesterov's stochastic dual extrapolation method (Nesterov, 2007) : X t+1/2 = X t - γ t K K k=1 ĝk,t Y t+1 = Y t -K -1 K k=1 ĝk,t+1/2 X t+1 = γ t+1 Y t+1 (Quantized DE) Example C.3. Communication-efficient stochastic optimistic dual averaging: Consider the case Vk,t ≡ ĝk,t-1/2 = Q ℓt (A(X t-1/2 ) + U k,t-1/2 ) and Vk,t+1/2 ≡ ĝk,t+1/2 = Q ℓt (A(X t+1/2 ) + U k,t+1/2 ) are the noisy oracle feedback at X t-1/2 and X t+1/2 , respectively. We then obtain the communication-efficient stochastic optimistic dual averaging method: X t+1/2 = X t - γ t K K k=1 ĝk,t-1/2 Y t+1 = Y t -K -1 K k=1 ĝk,t+1/2 X t+1 = γ t+1 Y t+1 (Quantized OptDA) D PROOF OF THEOREM 1 (VARIANCE BOUND) Let u j = |v j |/∥v∥ q , B 0 := [0, ℓ 1 ], B j := [ℓ j , ℓ j+1 ] for j ∈ [s]. Let V ℓ (v) = E q ℓ [∥Q ℓ (v) -v∥ 2 2 ] denote the variance of quantization in Eq. (3.1). Then we have V ℓ (v) = ∥v∥ 2 q ui∈B0 (ℓ 1 -u i )u i + s j=1 ui∈Bj (ℓ j+1 -u i )(u i -ℓ j ) . (D.1) We first find the minimum k j that satisfies (ℓ j+1 -u)(u -ℓ j ) ≤ k j u 2 for u ∈ B j and j ∈ [s]. The minimum k j can be obtained by changing the variable u = ℓ j θ: k j = max 1≤θ≤ℓj+1/ℓj (ℓ j+1 /ℓ j -θ)(θ -1) θ 2 = ℓ j+1 /ℓ j -1 2 4(ℓ j+1 /ℓ j ) . (D.2) We note that ℓ j+1 /ℓ j > 1 and (x -1) 2 /(4x) is monotonically increasing function of x for x > 1. Furthermore, note that ui / ∈B0 u 2 i ≤ ∥v∥ 2 2 ∥v∥ 2 q . Published as a conference paper at ICLR 2023 Substituting Eq. (D.2) into Eq. (D.1), an upper bound on V ℓ (v) is given by V ℓ (v) ≤ ∥v∥ 2 q ℓ + ℓ -1 4 - 1 2 ∥v∥ 2 2 ∥v∥ 2 q + ui∈B0 (ℓ 1 -u i )u i . In the rest of the proof, we use the following known lemma. Lemma 1. Let v ∈ R d . Then, for all 0 < p < q, we have ∥v∥ q ≤ ∥v∥ p ≤ d 1/p-1/q ∥v∥ q . We note that Lemma 1 holds even when q < 1 and ∥ • ∥ q is merely a seminorm. We now establish an upper bound on ui∈B0 (2 -s -u i )u i . Lemma 2 (Ramezani- Kebrya et al. 2021, Lemma 15) . Let p ∈ (0, 1) and u ∈ B 0 . Then we have u(ℓ 1 -u) ≤ K p ℓ 1 (2-p) u p where K p = 1/p 2/p -1 1/p -1 2/p -1 (1-p) . (D.3) Let S j denote the coordinates of vector v whose elements fall into the (j + 1)-th bin, i.e., S j := {i : u i ∈ B j } for j ∈ [s] . For any 0 < p < 1 and q ≥ 2, we have ∥v∥ 2 q ui∈B0 u p i = ∥v∥ 2-p q i∈S0 |v i | p ≤ ∥v∥ 2-p q ∥v∥ p p ≤ ∥v∥ 2-p q ∥v∥ p 2 d 1-p/2 ≤ ∥v∥ 2 2 d 1-p/2 , where the third inequality holds as ∥v∥ p ≤ ∥v∥ 2 d 1/p-1/2 using Lemma 1 and the last inequality holds as ∥v∥ q ≤ ∥v∥ 2 for q ≥ 2. Using Lemmas 1 and 2, we establish an upper bound on V ℓ (v): V ℓ (v) ≤ ∥v∥ 2 2 ℓ + ℓ -1 4 - 1 2 + K p ℓ 1 (2-p) d 1-p/2 . For q ≥ 1, we note that ∥v∥ 2-p q ≤ ∥v∥ 2-p 2 d 2-p min{q,2} -2-p 2 , i.e., V ℓ (v) ≤ ∥v∥ 2 2 ℓ + ℓ -1 4 - 1 2 + K p ℓ 1 (2-p) d 2-p min{q,2} . (D.4) Note that the optimal p to minimize ϵ Q is obtained by minimizing: λ(p) = 1/p 2/p -1 1/p 2/p -1 1-p δ 1-p where δ = ℓ 1 d 1 min{q,2} . Taking the first-order derivative of λ(p), the optimal p * is given by p * = δ-2 δ-1 , δ ≥ 2 0, δ < 2. (D.5) Substituting (D.5) into (D.4) gives (4.1), which completes the proof.

E PROOF OF THEOREM 2 (CODE-LENGTH BOUND)

Let | • | denote the length of a binary string. In this section, we obtain an upper bound on E ω E q ℓ [|CODE • Q(Q ℓ (g(x; ω)); ℓ)|], i.e., the expected number of communication bits per iteration. We recall from Section 3.2 that v is uniquely represnted by the tuple (∥v∥ q , s, u). We first encode the norm ∥v∥ q using C b bits where. In practice, we use standard 32-bit floating point encoding. We then use one bit to encode the sign of each nonzero entry of u. Let S j := {i : u i ∈ [l j , l j+1 ]} and N j := |S j | for j ∈ [s]. We now provide the expression for probabilities associated with our symbols to be coded, i.e., {ℓ 0 , ℓ 1 , . . . , ℓ s+1 }. The associated probabilities can computed using the weighted sum of the conditional CDFs of normalized coordinates in QAda and quantization levels: Proposition 2. Let j ∈ [s]. The probability of occurrence of ℓ j (weight of symbol ℓ j ) is given by p j = Pr(ℓ j ) = ℓj ℓj -1 u -ℓ j-1 ℓ j -ℓ j-1 d F (u) + ℓj+1 ℓj ℓ j+1 -u ℓ j+1 -ℓ j d F (u) where F is the weighted sum of the conditional CDFs of normalized coordinates in QAda. In addition, we have p 0 = Pr(ℓ 0 = 0) = ℓ1 0 1 -u ℓ 1 d F (u) and p s+1 = Pr(ℓ s+1 = 1) = 1 ℓs u -ℓ s 1 -ℓ s d F (u). We have an upper bound on the expected number of nonzero entries as follows: Lemma 3. Let v ∈ R d . The expected number of nonzeros in Q ℓ (v) is given by E q ℓ [∥Q ℓ (v)∥ 0 ] = (1 -p 0 )d. (E.1) We then send the the associated codeword to encode each coordinate of u. The optimal expected code-length for transmitting one random symbol is within one bit of the entropy of the source (Cover & Thomas, 2006) . So the number of required information bits to transmit entries of u is bounded above by d(H(L) + 1) where H(L) = -s j=1 p j log(p j ) is the entropy in bits. Putting everything together, we have 1) is a universal constant. Finally, we note that the entropy of a source with n outcomes is upper bounded by log(n). E ω E q ℓ [|CODE • Q(Q ℓ (g(x; ω)); ℓ)|] ≤ C b + (1 -p 0 )d + (H(L) + 1)d where C b = O(

F PROOF OF THEOREM 3 (Q-GENX UNDER ABSOLUTE NOISE)

We note that the output of Algorithm 1 follows the iterates of (Q-GenX): X t+1/2 = X t - γ t K K k=1 Vk,t Y t+1 = Y t -K -1 K k=1 Vk,t+1/2 X t+1 = γ t+1 Y t+1 (Q-GenX) We first prove the following Template Inequality for (Q-GenX), which is a useful milestone to prove both Theorems 3 and 4. Proposition 3 (Template inequality). Let X ∈ R d . Suppose the iterates X t of (Q-GenX) are updated with some non-increasing step-size schedule γ t for t = 1, 1/2, . . . Then, we have T t=1 1 K K k=1 Vk,t+1/2 , X t+1/2 -X ≤ ∥X∥ 2 * 2γ T +1 + 1 2K 2 T t=1 γ t K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * - 1 2 T t=1 1 γ t ∥X t -X t+1/2 ∥ 2 * . (F.1) Proof. We first decompose 1 K ⟨ K k=1 Vk,t+1/2 , X t+1/2 -X⟩ into two terms and note: 1 K K k=1 Vk,t+1/2 , X t+1/2 -X = S A + S B . where S A = 1 K K k=1 Vk,t+1/2 , X t+1/2 -X t+1 and S B = 1 K K k=1 Vk,t+1/2 , X t+1 -X . Note that the update rule in (Q-GenX) implies: S B = ⟨Y t -Y t+1 , X t+1 -X⟩ = Y t - γ t+1 γ t Y t+1 , X t+1 -X + γ t+1 γ t Y t+1 -Y t+1 , X t+1 -X = 1 γ t ⟨γ t Y t -γ t+1 Y t+1 , X t+1 -X⟩ + 1 γ t+1 - 1 γ t ⟨-γ t+1 Y t+1 , X t+1 -X⟩ = 1 γ t ⟨X t -X t+1 , X t+1 -X⟩ + 1 γ t+1 - 1 γ t ⟨-X t+1 , X t+1 -X⟩. By algebraic manipulations, we can further show that S B = 1 γ t 1 2 ∥X t -X∥ 2 * - 1 2 ∥X t+1 -X∥ 2 * - 1 2 ∥X t+1 -X t ∥ 2 * + 1 γ t+1 - 1 γ t 1 2 ∥X∥ 2 * - 1 2 ∥X t+1 ∥ 2 * - 1 2 ∥X t+1 -X∥ 2 * ≤ 1 2γ t ∥X t -X∥ 2 * - 1 2γ t+1 ∥X t+1 -X∥ 2 * - 1 2γ t ∥X t -X t+1 ∥ 2 * + 1 2γ t+1 - 1 2γ t ∥X∥ 2 * where the last inequality holds by dropping -1 2 ∥X t+1 ∥ 2 * . Rearranging the terms in the above expression and substituting S B , we have 1 2γ t+1 ∥X t+1 -X∥ 2 * ≤ 1 2γ t ∥X t -X∥ 2 * - 1 2γ t ∥X t -X t+1 ∥ 2 * + 1 2γ t+1 - 1 2γ t ∥X∥ 2 * - 1 K K k=1 Vk,t+1/2 , X t+1 -X = 1 2γ t ∥X t -X∥ 2 * - 1 2γ t ∥X t -X t+1 ∥ 2 * + 1 2γ t+1 - 1 2γ t ∥X∥ 2 * + 1 K K k=1 Vk,t+1/2 , X t+1/2 -X t+1 - 1 K K k=1 Vk,t+1/2 , X t+1/2 -X .

(F.2)

On the other hand, we have γ t K K k=1 Vk,t , X t+1/2 -X = ⟨X t -X t+1/2 , X t+1/2 -X⟩ = 1 2 ∥X t -X∥ 2 * - 1 2 ∥X t -X t+1/2 ∥ 2 * - 1 2 ∥X t+1/2 -X∥ 2 * (F.3) Substituting X = X t+1 and dividing both sides of (F.3) by γ t , we have 1 K K k=1 Vk,t , X t+1/2 -X t+1 = 1 2γ t ∥X t -X t+1 ∥ 2 * - 1 2γ t ∥X t -X t+1/2 ∥ 2 * - 1 2γ t ∥X t+1/2 -X t+1 ∥ 2 * . (F.4) Combining (F.2) and (F.4), we have 1 K K k=1 Vk,t+1/2 , X t+1/2 -X ≤ 1 2γ t ∥X t -X∥ 2 * - 1 2γ t+1 ∥X t+1 -X∥ 2 * + 1 2γ t+1 - 1 2γ t ∥X∥ 2 * + 1 K K k=1 Vk,t+1/2 -Vk,t , X t+1/2 -X t+1 - 1 2γ t ∥X t -X t+1/2 ∥ 2 * - 1 2γ t ∥X t+1 -X t+1/2 ∥ 2 * . Summing the above for t = 1, . . . , T and telescoping, we have 1 K T t=1 ⟨ K k=1 Vk,t+1/2 , X t+1/2 -X⟩ ≤ 1 2γ1 ∥X1 -X∥ 2 * - 1 2γT +1 ∥XT +1 -X∥ 2 * + ( 1 2γT +1 - 1 2γ1 )∥X∥ 2 * + 1 K T t=1 ⟨ K k=1 Vk,t+1/2 -Vk,t , X t+1/2 -Xt+1⟩ - T t=1 1 2γt ∥Xt -X t+1/2 ∥ 2 * - T t=1 1 2γt ∥Xt+1 -X t+1/2 ∥ 2 * . Substituting X 1 = 0, we havefoot_2  1 K T t=1 ⟨ K k=1 Vk,t+1/2 , X t+1/2 -X⟩ ≤ 1 2γT +1 ∥X∥ 2 * + 1 K T t=1 ⟨ K k=1 Vk,t+1/2 -Vk,t , X t+1/2 -Xt+1⟩ - T t=1 1 2γt ∥Xt -X t+1/2 ∥ 2 * - T t=1 1 2γt ∥Xt+1 -X t+1/2 ∥ 2 * . (F.5) By applying Cauchy-Schwarz and triangle inequalities, we have 1 K ⟨ K k=1 Vk,t+1/2 -Vk,t , X t+1/2 -Xt+1⟩ ≤ K k=1 ∥ Vk,t+1/2 -Vk,t ∥ * ∥ 1 K (X t+1/2 -Xt+1)∥ * . (F.6) Furthermore, since ab ≤ γt 2K 2 a 2 + K 2 2γt b 2 , we have T t=1 1 K ⟨ K k=1 Vk,t+1/2 -Vk,t , X t+1/2 -X t+1 ⟩ ≤ T t=1 γ t 2K 2 K k=1 ∥( Vk,t+1/2 -Vk,t )∥ 2 * + T t=1 1 2γ t ∥X t+1/2 -X t+1 ∥ 2 * . (F.7) Substituting (F.7) into (F.5) and applying the convexity of ∥ • ∥ 2 * , we obtain (F.1), which completes the proof. ■ The following lemmas show how additional noise due to compression affects the upper bounds under absolute noise and relative noise models in Assumptions 2 and 3, respectively. Let q ℓ ∼ P Q represent d variables sampled i.i.d. for random quantization in Definition 1. We remind that q ℓ is independent of the random sample ω ∼ P in Eq. (2.1). We consider a general unbiased and normalized compression scheme, which has a bounded variance as in Theorem 1 and a bound on the expected number of communication bits to encode Q ℓ (v), i.e., the output of CODE • Q introduced in Section 3.2, which satisfies Theorem 2. Then we have: Lemma 4 ( Unbiased compression under absolute noise). Let x ∈ R d and ω ∼ P. Suppose the oracle g(x; ω) satisfies Assumption 2. Suppose Q ℓ satisfies Theorems 1 and 2. Then the compressed Q ℓ (g(x; ω)) satisfies Assumption 2 with E ∥Q ℓ (g(x; ω)) -A(x)∥ 2 2 ≤ ϵ Q M 2 + σ 2 . (F.8) Furthermore, the number of bits to encode Q ℓ (g(x; ω)) is bounded by the upper bound in (4.2). Proof. The almost sure boudedness and unbiasedness are immediately followed by the construction of the unbiased Q ℓ . In particular, we note that the maximum additional norm when compressing Q ℓ (g(x; ω)) happens when all normalized coordinates of g(x; ω), which we call them u 1 , . . . , u d are mapped to the upper level ℓ τ (u)+1 in Definition 1. The additional norm multiplier is bounded by d i=1 (ℓ τ (ui)+1 -u i ) 2 ≤ d i=1 u 2 i = 1. Then we have Q ℓ (g(x; ω)) ≤ 2M a.s. , so the additional upper bound is constant. The final property also holds as follows: E ω E q ℓ ∥Q ℓ (g(x; ω)) -A(x)∥ 2 2 = E ω E q ℓ ∥Q ℓ (g(x; ω)) ± g(x; ω) -A(x)∥ 2 2 = E ω E q ℓ ∥Q ℓ (g(x; ω)) -g(x; ω)∥ 2 2 + E ω ∥U (x; ω)∥ 2 2 ≤ ϵ Q E ω ∥g(x; ω)∥ 2 2 + σ 2 ≤ ϵ Q M 2 + σ 2 where the second step holds due to unbiasedness of q ℓ and the last inequality holds since ∥g(x; ω)∥ * ≤ M a.s. ■ Lemma 5 (Unbiased compression under relative noise). Let x ∈ R d and ω ∼ P. Suppose the oracle g(x; ω) satisfies Assumption 3. Suppose Q ℓ satisfies Theorems 1 and 2. Then the compressed Q ℓ (g(x; ω)) satisfies Assumption 3 with E ∥Q ℓ (g(x; ω)) -A(x)∥ 2 2 ≤ ϵ Q (c + 1) + c ∥A(x)∥ 2 2 . (F.9) Furthermore, the number of bits to encode Q ℓ (g(x; ω)) is bounded by the upper bound in (4.2). Proof. The almost sure boudedness and unbiasedness are immediately followed by the construction of the unbiased Q ℓ . The final property also holds as follows: E ω E q ℓ ∥Q ℓ (g(x; ω)) -A(x)∥ 2 2 = E ω E q ℓ ∥Q ℓ (g(x; ω)) -g(x; ω)∥ 2 2 + E ω ∥U (x; ω)∥ 2 2 ≤ ϵ Q E ω ∥g(x; ω)∥ 2 2 + c∥A(x)∥ 2 2 = ϵ Q E ω ∥U (x; ω) + A(x)∥ 2 2 + c∥A(x)∥ 2 2 = ϵ Q E ω ∥U (x; ω)∥ 2 2 + ∥A(x)∥ 2 2 + c∥A(x)∥ 2 2 ≤ ϵ Q (c + 1) + c ∥A(x)∥ 2 2 . where the first and fourth steps hold due to unbiasedness of q ℓ and the noise model, respectively. ■ We now prove our main theorme: Theorem 5 (Q-GenX under absolute noise). Let C ⊂ R d denote a compact neighborhood of a solution for (VI) and let D 2 := sup X∈C ∥X -X 0 ∥ 2 . Suppose that the oracle and the problem (VI) satisfy Assumptions 1 and 2, respectively, Algorithm 1 is executed for T iterations on K processors with an adaptive step-size γ t = K(1 + t-1 i=1 K k=1 ∥ Vk,i -Vk,i+1/2 ∥ 2 ) -1/2 , and quantization levels are updated J times where ℓ j with variance bound ϵ Q,j in (4.1) and code-length bound N Q,j in (4.2) is used for T j iterations with J j=1 T j = T . Then we have E Gap C 1 T T t=1 X t+1/2 = O ( J j=1 ϵ Q,j T j /T M + σ)D 2 √ T K . In addition, Algorithm 1 requires each processor to send at most 2 T J j=1 T j N Q,j communication bits per iteration in expectation. In the following, we prove the results using the template inequality Proposition 3 and noise anslysis in Lemma 4. As a preliminary step, we prove this proposition for an adaptive step-size with a non-adaptive Q ℓ , which satisfies Theorem 1. Proposition 4 (Algorithm 1 under absolute noise and fixed compression scheme). Under the setup described in Theorem 3, with an adaptive step-size γ t = K(1+ t-1 i=1 K k=1 ∥ Vk,i -Vk,i+1/2 ∥ 2 ) -1/2 and non-adaptive Q ℓ satisfying Theorem 1, we have E Gap C 1 T T t=1 X t+1/2 = O ( √ ϵ Q M + σ)D 2 √ T K . Proof. Suppose that we do not apply compression, i.e., ϵ Q = 0. By the template inequality Proposition 3, we have T t=1 1 K K k=1 Vk,t+1/2 , X t+1/2 -X ≤ ∥X∥ 2 * 2γ T +1 + 1 2K 2 T t=1 γ t K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * . (F.10) Let denote the LHS and RHS of (F.10) by S A = T t=1 1 K K k=1 Vk,t+1/2 , X t+1/2 -X and S B = ∥X∥ 2 * 2γ T +1 + 1 2K 2 T t=1 γ t K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * , respectively.Then, by the noise model (2.1) and monotonicity of operator A, we have S A = T t=1 1 K K k=1 A k (X t+1/2 ), X t+1/2 -X + T t=1 1 K K k=1 U k,t+1/2 , X t+1/2 -X ≥ T t=1 1 K K k=1 A k (X), X t+1/2 -X + T t=1 1 K K k=1 U k,t+1/2 , X t+1/2 -X = T K K k=1 ⟨A k (X), X T +1/2 -X⟩ + T t=1 1 K K k=1 U k,t+1/2 , X t+1/2 -X where A k = A for k ∈ [K] . Therefore, by rearranging the terms Eq. (F.10) using above inequality, we have T K K k=1 ⟨A k (X), X T +1/2 -X⟩ ≤ - T t=1 ⟨ 1 K K k=1 U k,t+1/2 , X t+1/2 -X⟩ + ∥X∥ 2 * 2γ T +1 + 1 2K 2 T t=1 γ t K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * . (F.11) By taking supermom on both sides of Eq. (F.11), dividing by T , and taking expectation, we have E 1 K K k=1 sup X ⟨A k (X), X T +1/2 -X⟩ ≤ 1 T (S 1 + S 2 + S 3 ) (F.12) where S 1 = E D 2 2γ T +1 , S 2 = E 1 2K 2 T t=1 γ t K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * , and S 3 = E sup X T t=1 ⟨ 1 K K k=1 U k,t+1 /2 , X t+1/2 -X⟩ . We now bound S 1 , S 2 , and S 3 from above, individually. For S 1 , we have S 1 = E D 2 2γ T +1 = D 2 2K E   1 + T -1 i=1 K k=1 ∥ Vk,i -Vk,i+1/2 ∥ 2   ≤ D 2 2K 1 + T -1 i=1 K k=1 E ∥ Vk,i -Vk,i+1/2 ∥ 2 ≤ D 2 2K 1 + T -1 i=1 K k=1 2(E[∥ Vk,i ∥ 2 ] + E[∥ Vk,i+1/2 ∥ 2 ]) ≤ D 2 2K 1 + 4KT σ 2 . (F.13) We also have S2 = E 1 2K 2 T t=1 γt K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * = 1 2 E T t=1 γt K 2 - γt+1 K 2 K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * + 1 2 E T t=1 γt+1 K 2 K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * ≤ 2E T t=1 γt K 2 - γt+1 K 2 Kσ 2 + 1 2 E T t=1 γt+1 K 2 K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * ≤ 2σ 2 + 1 2K E   T t=1 K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * 1 + T t=1 K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 *   ≤ 2σ 2 + 1 2K E   1 + T t=1 K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 *   ≤ 2σ 2 + 1 2K 1 + T t=1 K k=1 E ∥ Vk,t+1/2 -Vk,t ∥ 2 * ≤ 2σ 2 + 1 2K 1 + 4σ 2 KT . (F.14) Finally we note that S 3 = E sup X T t=1 ⟨ 1 K K k=1 U k,t+1/2 , X t+1/2 -X⟩ = E sup X T t=1 ⟨ 1 K K k=1 U k,t+1/2 , X⟩ -E sup X T t=1 ⟨ 1 K K k=1 U k,t+1/2 , X t+1/2 ⟩ . (F.15) We bound the first term in the RHS of (F.15) using the following known lemma: Lemma 6 (Bach & Levy 2019) . Let C ∈ R d be a convex set and h : C → R be a 1-strongly convex w.r.t. a norm ∥ • ∥. Assume that h(x) -min x∈C h(x) ≤ D 2 /2 for all x ∈ C. Then, for any martingle difference (z t ) T t=1 ∈ R d and any x ∈ C, we have E T t=1 z t , x ≤ D 2 2 T t=1 E[∥z t ∥ 2 ]. Using Lemma 6, the first term in the RHS of (F.15) is bounded by 1 K E sup X T t=1 ⟨ K k=1 U k,t+1/2 , X⟩ ≤ D 2 2K E T t=1 K k=1 ∥U k,t+1/2 ∥ 2 ≤ D 2 σ √ T 2 √ K . (F.16) Similarly, we can bound the second term in the RHS of (F.15). Combining the results in Eq. (F.13), Eq. (F.14), and Eq. (F.16) and applying Lemma 4, we obtain the upper bound in Proposition 4. ■ Applying Proposition 4 with the scaled step size schedule in Theorem 3, we complete the proof for an adaptive compression scheme along the lines of (Faghri et al., 2020, Theorem 4) . G PROOF OF THEOREM 4 (Q-GENX UNDER RELATIVE NOISE) We first remind the theorem statement: Theorem 6 (Q-GenX under relative noise). Let C ⊂ R d denote a compact neighborhood of a solution for (VI) and let D 2 := sup X∈C ∥X -X 0 ∥ 2 . Suppose that the oracle and the problem (VI) satisfy Assumptions 1, 3, and 4, Algorithm 1 is executed for T iterations on K processors with an adaptive step-size γ t = K(1 + t-1 i=1 K k=1 ∥ Vk,i -Vk,i+1/2 ∥ 2 ) -1/2 , and quantization levels are updated J times where ℓ j with variance bound ϵ Q,j in (4.1) and code-length bound N Q,j in (4.2) is used for T j iterations with J j=1 T j = T . Then we have E Gap C 1 T T t=1 X t+1/2 = O (c + 1) J j=1 T j ϵ Q,j /T + c D 2 KT . In addition, Algorithm 1 requires each processor to send at most 2 T J j=1 T j N Q,j communication bits per iteration in expectation. Suppose that we do not apply compression, i.e., ϵ Q = 0. We first remind the template inequality in (F.1), which holds for any γ t and noise model: T t=1 1 K K k=1 Vk,t+1/2 , X t+1/2 -X ≤ ∥X∥ 2 * 2γ T +1 + 1 2K 2 T t=1 γ t K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * - 1 2 T t=1 1 γ t ∥X t -X t+1/2 ∥ 2 * . In the following proposition, we show that under γ t and relative noise model in Theorem 4, T t=1 E ∥A(X t+1/2 )∥ 2 * + ∥A(X t )∥ 2 * is summable in the sense that T t=1 E ∥A(X t+1/2 )∥ 2 * + ∥A(X t )∥ 2 * = O(1/γ T ). Proposition 5 (Sum operator output under relative noise). Let X * denote a solution of (VI). Under the setup described in Theorem 4, we have: T t=1 E ∥A(X t+1/2 )∥ 2 * + ∥A(X t )∥ 2 * ≤ E ∥X * ∥ 2 * 2γ T +1 . (G.1) Proof. Substituting X = X * into E 1 K K k=1 Vk,t+1/2 , X t+1/2 -X and applying the law of total expectation, we have: E 1 K K k=1 Vk,t+1/2 , X t+1/2 -X * = E 1 K K k=1 E[⟨ Vk,t+1/2 , X t+1/2 -X * ⟩|X t+1/2 ] = E 1 K K k=1 ⟨A k (X t+1/2 ), X t+1/2 -X * ⟩ = E ⟨A(X t+1/2 ), X t+1/2 -X * ⟩ ≥ E ⟨A(X t+1/2 ) -A(X * ), X t+1/2 -X * ⟩ ≥ βE[∥A(X t+1/2 )∥ 2 * ] (G.2) where the fourth and fifth inequalities hold due to the definition of the monotone operator Eq. (VI) and β-cocoecivity in Eq. ( 4.3), respectively. Applying the lower bound in Eq. (G.2) into Eq. (F.1), we obtain: T t=1 βE[∥A(X t+1/2 )∥ 2 * ] ≤ E ∥X * ∥ 2 * 2γ T +1 + 1 2K 2 T t=1 γ t K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * - 1 2 T t=1 1 γ t ∥X t -X t+1/2 ∥ 2 * . (G.3) Moreover, by lower bounding (LHS) of the above: T t=1 E[∥A(X t+1/2 )∥ 2 * = T t=1 E[1/K K k=1 ∥A(X t+1/2 )∥ 2 ] (G.4) ≥ 1 c T t=1 E[1/K K k=1 ∥ Vk,t+1/2 ∥ 2 * ] (G.5) with the second inequality being obtained by the relative noise condition. On the other hand, applying Cauchy-Schwarz and β-cocoecivity in Eq. ( 4.3) imply ∥X t -X t+1/2 ∥ 2 * ≥ β 2 ∥A(X t )-A(X t+1/2 )∥ 2 * . It follows that: 1 2 E T t=1 β∥A(X t+1/2 )∥ 2 * + T t=1 1 γ t ∥X t -X t+1/2 ∥ 2 * ≥ 1 2 E T t=1 β∥A(X t+1/2 )∥ 2 * + T t=1 β 2 γ t ∥A(X t ) -A(X t+1/2 )∥ 2 * ≥ 1 2 E T t=1 β∥A(X t+1/2 )∥ 2 * + T t=1 β 2 γ t ∥A(X t ) -A(X t+1/2 )∥ 2 * ≥ 1 2 min β, β 2 γ 0 T t=1 E ∥A(X t+1/2 )∥ 2 * + ∥A(X t ) -A(X t+1/2 )∥ 2 * ≥ 1 2 min β, β 2 γ 0 T t=1 E ∥A(X t )∥ 2 * ≥ 1 2 min β, β 2 γ 0 T t=1 E 1/K K k=1 ∥A(X t )∥ 2 * ≥ 1 2c min β, β 2 γ 0 T t=1 E 1/K K k=1 ∥ Vk,t ∥ 2 * (G.6) where the last inequality holds due to the relative noise condition. Combining the above inequalities we get the following: β c T t=1 E[1/K K k=1 ∥ Vk,t+1/2 ∥ 2 * ] ≤ E[ ∥X∥ 2 * 2γ T +1 + 1 2K 2 T t=1 γ t K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * ] (G.7) and 1 2c min β, β 2 γ 0 T t=1 E 1/K K k=1 ∥ Vk,t ∥ 2 * ≤ E[ ∥X∥ 2 * 2γ T +1 + 1 2K 2 T t=1 γ t K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * ] (G.8) Therefore, by adding the above inequalities we get: β c T t=1 E[1/K K k=1 ∥ Vk,t+1/2 ∥ 2 * ] + 1 2c min β, β 2 γ 0 T t=1 E 1/K K k=1 ∥ Vk,t ∥ 2 * ≤ 2E ∥X∥ 2 * 2γ T +1 + 1 2K 2 T t=1 γ t K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * (G.9) We now establish an upper bound on the R.H.S. of Eq. (G.9). We first note that: E 1 2K 2 T t=1 γ t K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * = E 1 2K 2 T t=1 (γ t -γ t+1 ) K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * + 1 2K 2 T t=1 γ t+1 K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * ≤ E 1 2K (4M 2 ) T t=1 (γ t -γ t+1 ) + 1 2K 2 T t=1 γ t+1 K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * ≤ E 2M 2 K γ 1 + 1 2K 2 T t=1 γ t+1 K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * ≤ E 2M 2 + 1 2K 2 T t=1 γ t+1 K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * ≤ E 2M 2 + 1 K 1 + T t=1 K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * ≲ E[ 1 γ T +1 ]. (G.10) Therefore, an upper bound on the R.H.S. of Eq. (G.9) is given by : β c T t=1 E[1/K K k=1 ∥ Vk,t+1/2 ∥ 2 * ] + 1 2c min β, β 2 γ 0 T t=1 E 1/K K k=1 ∥ Vk,t ∥ 2 * ≤ E ∥X * ∥ 2 * + 1 γ T +1 . (G.11)

■

To establish a lower bound on the L.H.S. of Eq. (G.9), we first note that: E 1 K 2 T t=1 K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * = E 1 K 2 1 + T t=1 K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * - 1 K 2 = E[ 1 γ 2 T +1 ] - 1 K 2 . (G.12) Published as a conference paper at ICLR 2023 We also note that: ]: K 4c min β, β 2 γ 0 E 1/K 2 T t=1 K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * ≤ β 2c T t=1 E[1/K K k=1 ∥ Vk,t+1/2 ∥ 2 * ] + 1 2c min β, β 2 γ 0 T t=1 E 1/K K k=1 ∥ Vk,t ∥ 2 * . K 4c min β, β 2 γ 0 E[ 1 γ 2 T +1 ] ≤ ∥X * ∥ 2 + 1 E[ 1 γ T +1 ] = ∥X * ∥ 2 + 1 E[ 1 γ 2 T +1 ] ≤ ∥X * ∥ 2 + 1 E[ 1 γ 2 T +1 ] with the last inequality being obtained by Jensen's inequality. So we have E[ 1 γ T +1 ] ≤ 4c K max 1 β , γ 0 β 2 . (G.14) Similar to the proof of Theorem 3, we have E sup X ⟨A(X), X T +1/2 -X⟩ ≤ 1 T (S 1 + S 2 + S 3 ) (G.15) where S 1 = E D 2 2γ T +1 , S 2 = E 1 2K 2 T t=1 γ t K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * , and S 3 = E sup X T t=1 ⟨ 1 K K k=1 U k,t+1/2 , X -X t+1/2 ⟩ . By (G.10), we have E 1 2K 2 T t=1 γ t K k=1 ∥ Vk,t+1/2 -Vk,t ∥ 2 * ≲ E 1 γ T +1 . (G.16) We now decompose S 3 into two terms S 3 = E sup X T t=1 ⟨ 1 K K k=1 U k,t+1/2 , X⟩ - E T t=1 ⟨ 1 K K k=1 U k,t+1/2 , X t+1/2 ⟩ . Let the supremum on the first terms is attained by X = X o . We can also establish an upper bound on the first term using Lemma 6: 1 K E sup X T t=1 ⟨ K k=1 U k,t+1/2 , X⟩ = 1 K E ⟨ T t=1 K k=1 U k,t+1/2 , X o ⟩ = D 2 2K E ∥ T t=1 K k=1 U k,t+1/2 ∥ 2 * ≤ D 2 2 √ K E T t=1 c∥A(X t+1/2 )∥ 2 * ≤ D 2 2 √ K cE ∥X * ∥ 2 * 2γ T +1 (G.17) where the last inequality holds by Proposition 5. Finally, by the law of total expectation, we have: E T t=1 ⟨ K k=1 U k,t+1/2 , X t+1/2 ⟩ = E T t=1 K k=1 E[⟨U k,t+1/2 , X t+1/2 ⟩|X t+1/2 ] = 0. (G.18) Substituting Eq. (G.14), Eq. (G.17), and Eq. (G.18) into Eq. (G.15), we have E[Gap C X T +1/2 ] = O 1 T E D 2 2γ T +1 . (G.19) Following similar analysis as in Proposition 4 and applying Lemma 5 with the scaled step size schedule in Theorem 4, we complete the proof for an adaptive compression scheme along the lines of (Faghri et al., 2020, Theorem 4 ).

H EXPERIMENTAL DETAILS AND ADDITIONAL EXPERIMENTS

In order to validate our theoretical results, we build on the code base of Gidel et al. (2019) and run an instantiation of Q-GenX obtained by combining ExtraAdam with the compression offered by the torch_cgx pytorch extension of Markov et al. (2022) , and train a WGAN-GP (Arjovsky et al., 2017) on CIFAR10 (Krizhevsky, 2009) . Since torch_cgx uses OpenMPI (Gabriel et al., 2004) as its communication backend, we use OpenMPI as the communication backend for the full gradient as well for a fairer comparison. We deliberately do not tune any hyperparameters to fit the larger batchsize since simliar to (Gidel et al., 2019) , we do not claim to set a new SOTA with these experiments but simply want to show that our theory holds up in practice and can potentially lead to improvements. For this, we present a basic experiment showing that even for a very small problem size and a heuristic base compression method of cgx, we can achieve a noticeable speedup of around 8%. We expect further gains to be achievable for larger problems and more advanced compression methods. Given differences in terms of settings and the lack of any code, let alone an efficient implementations that can be used in a real-world setting (i.e. CUDA kernels integrated with networking), it is difficult to impossible to conduct a fair comparison with Beznosikov et al. (2021) . We follow exactly the setup of (Gidel et al., 2019) except that we share an effective batch size of 1024 across 3 nodes (strong scaling) connected via Ethernet, and use Layernorm (Ba et al., 2016) instead of Batchnorm (Ioffe & Szegedy, 2015) since Batchnorm is known to be challenging to work with in distributed training as well as interacting badly with the WGAN-GP penalty. The results are shown in Fig. 2a , showing evolution of FID and Fig. 2b showing the accumulated total time spent backpropagating. We note that we do not scale the learning rate or any other hyperparameters to account for these two changes so this experiment is not meant to claim SOTA performance, merely to illustrate that 2. This speedup does not drastically change the performance. We compare training using the full gradient of 32 bit (FP32) to training with gradients compressed to 8 (UQ8) and 4 bits (UQ4) using a bucket size of 1024. Figure 3 shows a more fine grained breakdown of the time used for back propagation (BP) where the network activity takes place. GenBP, DiscBP and PenBP refer to the backpropagation for generator, discriminator and the calculation of the gradient penalty, respectively. Total refers to the sum of these times. We used Weights and Biases (Biewald, 2020) for all experiment tracking. Our time measurements are performed with pythons time.time() function which has microsecond precision on Linux, measuring only backward propagation times and total training time, excluding plotting, logging etc. The experiments were performed on 3 Nvidia V100 GPUs (1 per node) using a Kubernetes cluster and an image built on the torch_cgx Docker image. H.1 COMPARISON WITH QSGDA Figure 4 compares Q-GenX with QSGDA of Beznosikov et al. (2022) , the only method without explicit variance reduction. Due to the extra-gradient template, Q-GenX is able to make steady progress without variance reduction.

I TRADE-OFF BETWEEN NUMBER OF ITERATIONS AND TIME PER ITERATION

In this section, we build on our theoretical results in Theorems 3 and 4 to capture the trade-off between the number of iterations to converge and time per iteration, which includes total time required to update a model on each GPU. We note that q ℓ = {q ℓ (u i )} i=1,...,d are independent random variables. The encoding CODE • Q(∥v∥ q , s, q ℓ ) : R + × {±1} d × {ℓ 0 , . . . , ℓ s+1 } d → {0, 1} * uses a standard floating point encoding with C b bits to represent the positive scalar ∥v∥ q , encodes the sign of each coordinate with one bit, and finally applies an integer encoding scheme Ψ : {ℓ 0 , ℓ 1 , . . . , ℓ s+1 } → {0, 1} * to efficiently encode each quantized normalized coordinate q ℓ (u i ) with the minimum expected code-length. Depending on how much knowledge of the distribution of the discrete alphabet of levels is known, a particular lossless prefix code can be used to encode q ℓ . In particular, if the distribution of the frequency of the discrete alphabet {ℓ 0 , ℓ 1 , . . . , ℓ s+1 } is unknown but it is known that smaller values are more frequent than larger values, Elias recursive coding (ERC) can be used (Elias, 1975) . ERC is a universal lossless integer coding scheme with a recursive and efficient encoding and decoding schemes, which assigns shorter codes to smaller values. If the distribution of the frequency of the discrete alphabet {ℓ 0 , ℓ 1 , . . . , ℓ s+1 } is known or can be estimated efficiently, we use Huffman coding, which has an efficient encoding/decoding scheme and achieves the minimum expected code-length among methods encoding symbols separately (Cover & Thomas, 2006) . The decoding DEQ • CODE : {0, 1} * → R d in Algorithm 1 first reads C b bits to reconstruct ∥v∥ q . Then it applies Ψ -1 : {0, 1} * → {ℓ 0 , ℓ 1 , . . . , ℓ s+1 } to decode the index of the first coordinate, depending on whether the decoded entry is zero or nonzero, it may read one bit indicating the sign, and then proceeds to decode its value. It then decodes the next symbol. The decoding continues mimicking the encoding scheme and finishes when all quantized coordinates are decoded. Note that the decoding will fully recover Q ℓ (v) because the coding scheme is lossless. One may slightly improve the coding efficiency in terms of the expected code-length by encoding blocks of symbols at the cost of increasing encoding/decoding complexity. We focus on lossless prefix coding schemes, which encode symbols separately due to their encoding/decoding simplicity (Cover & Thomas, 2006) . To implement an efficient Huffman code, we need to estimate probabilities w.r.t. the symbols in our discrete alphabet {ℓ 0 , ℓ 1 , . . . , ℓ s+1 }. This discrete distribution can be estimated by properly estimating the marginal probability density function (PDF) of normalized coordinates along the lines of e.g., (Faghri et al., 2020, Proposition 6) . Given quantization levels ℓ t and the marginal PDF of normalized coordinates, K processors can construct the Huffman tree in parallel. A Huffman tree of a source with s + 2 symbols can be constructed in time O(s) through sorting the symbols by the associated probabilities. It is well-known that Huffman codes minimize the expected code-length: Theorem 7 (Cover & Thomas 2006, Theorems 5.4.1 and 5.8.1) . Let Z denote a random source with a discrete alphabet Z. The expected code-length of an optimal prefix code to compress Z is bounded by H(Z) ≤ E[L] ≤ H(Z) + 1 where H(Z) ≤ log 2 (|Z|) is the entropy of Z in bits.



Notations are provided in Appendix A. A more fine-grained analysis can be done by considering two sequences of adaptive levels one for V k,t 's and another one for V k,t+1/2 's. While in this paper, we quantize both sequences with the same quantization scheme, it is possible to further reduce quantization errors by considering two fine-grained quantization schemes at the cost of additional computations at processors. The substitution X1 = 0 is just for notation simplicity and can be relaxed at the expense of obtaining a slightly more complicated expression.



Figure1: FID evolution during training (left). We compare full-precision ExtraAdam with a simple instantiation of Q-GenX. FID stands for Frechet inception distance, which is a standard GAN quality metric introduced in(Heusel et al., 2017). Fine grained comparison of average .backward() times on generator, discriminator, gradient penalty as well as total training time (s) (middle and right). The .backward() function is where pytorch DistributedDataParallel (DDP) handles gradient exchange.

G.13) Hence, combining, Eqs. (G.11)-(G.13), we can find an upper bound on E[

Figure 3: Fine grained comparison of average .backward() times on generator, discriminator, gradient penalty as well as total training time. The .backward() function is where pytorch DDP handles gradient exchange

Total time spent on backpropagation and gradient exchangesFigure2: Comparing full gradient ExtraAdam with a simple instantiation of QGenX. FID stands for Frechet inception distance, which is a standard GAN quality metric introduced in(Heusel et al., 2017).

ACKNOWLEDGMENTS

The authors would like to thank Fartash Faghri, Yang Linyan, Ilia Markov, and Hamidreza Ramezanikebrya for helpful discussions. This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement n°725594 -time-data). This work was supported by the Swiss National Science Foundation (SNSF) under grant number 200021_205011.The work of Ali Ramezani-Kebrya was in part supported by the Research Council of Norway, through its Centre for Research-based Innovation funding scheme (Visual Intelligence under grant no. 309439), and Consortium Partners. This work is licensed under a Creative Commons "Attribution 3.0 Unported" license.

annex

Using the results of Theorems 3 and 4, we can obtain the minimum number of iterations required to guarantee an expected gap of ϵ, which is a measure of solution quality. In particular, under absolute noise model and average gradient variance bound ϵ Q = J j=1 T j ϵ Q,j /T , the minimum number of iterations to guarantee an expected gap of ϵ is T (ϵ, ϵ Q ) = (ϵ Q M 2 + σ 2 ) 2 D 4 /ϵ 2 . Now suppose that the time per iteration, which includes overall computation, encoding, and communication times to compute, compress, send, receive, decompress and update one iteration, is denoted by ∆. Decreasing the number of communication bits, i.e., compressing more aggressively increases the sufficient number of iterations T (ϵ, ϵ Q ) and decreases time per iteration, ∆, due to communication savings, which captures the trade-off. Theoretically, the best compression method is the one with the minimum overall wall-clock time bounded by T (ϵ, ϵ Q )∆. The exact optimal point depends on the specific problem to solve (dataset, loss, etc.) and the hyperparameters chosen (architecture, number of bits, etc.), which together determine ϵ Q and the implementation details of the algorithm, networking, and compression, along with the cluster setup and hyperparameters, which together influence ∆. We defer more refined analysis of the optimal point to future work.

J EXAMPLES MOTIVATING ASSUMPTION 3

In this section, we provide some popular examples which motivate Assumption 3: Example J.1 (Random coordinate descent (RCD)). Consider a smooth convex function f over R d . At iteration t, the RCD algorithm draws one coordinate i t ∈ [d] uniformly at random and computes the partial derivative v i,t = ∂f /∂x it . Subsequently, the i-th derivative is updated asThis update rule can be written in an abstract recursive form as x + = x -αg(x; µ) where g i (x; µ) = d • ∂f /∂x i • µ and µ is drawn uniformly at random from the set of basis vectors {e 1 , . . . , e d } ⊆ R d . We note that E[g(x; µ)] = ∇f (x). Furthermore, since ∂f /∂x i = 0 at the minima of f , we also have g(x * ; µ) = 0 if x * is a minimizer of f , i.e., the variance of the random vector g(x; µ) vanishes at the minima of f . It is not difficult to show that E µ ∥g(x; µ) -∇f (x)∥ 2 = O(∥∇f (x)∥ 2 ), which satisfies Assumption 3 with A = ∇f . Example J.2 (Random player updating). Consider an N -player convex game with loss functions f i , i ∈ [N ]. Suppose, at each stage, player i is selected with probability p i to play an action following its individual gradient descent rule X i,t+1 = X i,t + γ t /p i V i,t where V i,t = ∇ i f i (X t ) denotes player i's individual gradient at the state X t = (X 1,t , ..., X N,t ) and p i is included for scaling reasons.

Note that E[V

. It is not difficult to show that V t is an unbiased oracle for A, and since all individual components of A vanish at the game's Nash equilibria, it is also straightforward to verify that V t satisfies Assumption 3.

K ENCODING

To further reduce communication costs, we can apply information-theoretically inspired coding schemes on top of quantization. In this section, we provide an overview of our coding schemes along the lines of (Alistarh et al., 2017; Faghri et al., 2020; Ramezani-Kebrya et al., 2021) . Let q ∈ Z + . We first note that a vector v ∈ R d can be uniquely represented by a tuple (∥v∥ q , s, u) where ∥v∥ q is the L q norm of v, s := [sgn(v 1 ), . . . , sgn(v d )] ⊤ consists of signs of the coordinates v i 's, and u := [u 1 , . . . , u d ] ⊤ with u i = |v i |/∥v∥ q are the normalized coordinates. Note that 0 ≤ u i ≤ 1 for all i ∈ [d]. We define a random quantization function as follows: Definition 2 (Random quantization function). Let s ∈ Z + denote the number of quantization levels. Let u ∈ [0, 1] and ℓ = (ℓ 0 , . . . , ℓ s+1 ) denote a sequence of s quantization levels with 0) be the relative distance of u to level τ (u) + 1. We define the random function q ℓ (u) : [0, 1] → {ℓ 0 , . . . , ℓ s+1 } such that q ℓ (u) = ℓ τ (u) with probability 1 -ξ(u) and q ℓ (u) = ℓ τ (u)+1 with probability ξ(u). Let q ∈ Z + and v ∈ R d . We define the random quantization of v as follows:Q ℓ (v) := ∥v∥ q • s ⊙ [q ℓ (u 1 ), . . . , q ℓ (u d )] ⊤ where ⊙ denotes the element-wise (Hadamard) product.

