DISTRIBUTED EXTRA-GRADIENT WITH OPTIMAL COMPLEXITY AND COMMUNICATION GUARANTEES

Abstract

We consider monotone variational inequality (VI) problems in multi-GPU settings where multiple processors/workers/clients have access to local stochastic dual vectors. This setting includes a broad range of important problems from distributed convex minimization to min-max and games. Extra-gradient, which is a de facto algorithm for monotone VI problems, has not been designed to be communicationefficient. To this end, we propose a quantized generalized extra-gradient (Q-GenX), which is an unbiased and adaptive compression method tailored to solve VIs. We provide an adaptive step-size rule, which adapts to the respective noise profiles at hand and achieve a fast rate of O(1/T ) under relative noise, and an orderoptimal O(1/ √ T ) under absolute noise and show distributed training accelerates convergence. Finally, we validate our theoretical results by providing real-world experiments and training generative adversarial networks on multiple GPUs.

1. INTRODUCTION

The surge of deep learning across tasks beyond image classification has triggered a vast literature of optimization paradigms, which transcend the standard empirical risk minimization. For example, training generative adversarial networks (GANs) gives rise to solving a more complicated zero-sum game between a generator and a discriminator (Goodfellow et al., 2020) . This can become even more complex when the generator and the discriminator do not have completely antithetical objectives and e.g., constitute a more general game-theoretic setup. A powerful unifying framework which includes those important problems as special cases is monotone variational inequality (VI). Formally, given a monotone operator A : R d → R d , i.e.,

⟨A(x) -

A(x ′ ), x -x ′ ⟩ ≥ 0 for all x, x ′ ∈ R d , our goal is to find some x * ∈ R d such that: ⟨A(x * ), x -x * ⟩ ≥ 0, for all x ∈ R d . (VI) Several practical problems can be formulated as a (VI) problem including those with convex-like structures, e.g., convex For various tasks, it is widely known that employing deep neural networks (DNNs) along with massive datasets leads to significant improvement in terms of learning (Shalev-Shwartz & Ben-David, 2014) . However, DNNs can no longer be trained on a single machine. One common solution is to train on multi-GPU systems (Alistarh et al., 2017) . Furthermore, in federated learning (FL), multiple clients, e.g., a few hospitals or several cellphones learn a model collaboratively without sharing local data due to privacy risks (Kairouz et al., 2021) . To minimize a single empirical risk, SGD is the most popular algorithm due to its flexibility for parallel implementations and excellent generalization performance (Alistarh et al., 2017; Wilson et al., 2017) . Data-parallel SGD has delivered tremendous success in terms of scalability: (Zinkevich et al., 2010; Bekkerman et al., 2011; Recht et al., 2011; Dean et al., 2012; Coates et al., 2013; Chilimbi et al., 2014; Li et al., 2014; Duchi et al., 2015; Xing et al., 2015; Zhang et al., 2015; Alistarh et al., 2017; Faghri et al., 2020; Ramezani-Kebrya et al., 2021; Kairouz et al., 2021) . Data-parallel SGD reduces computational costs significantly. However, the communication costs for broadcasting huge stochastic gradients is the main performance bottleneck in large-scale settings (Strom, 2015; Alistarh et al., 2017; Faghri et al., 2020; Ramezani-Kebrya et al., 2021; Kairouz et al., 2021) . Several methods have been proposed to accelerate training for classical empirical risk minimization such as gradient (or model update) compression, gradient sparsification, weight quantization/sparsification, and reducing the frequency of communication though local methods (Dean et al., 2012; Seide et al., 2014; Sa et al., 2015; Gupta et al., 2015; Abadi et al., 2016; Alistarh et al., 2017; Wen et al., 2017; Zhou et al., 2018; Bernstein et al., 2018; Faghri et al., 2020; Ramezani-Kebrya et al., 2021; Kairouz et al., 2021) . In particular, unbiased gradient quantization is interesting due to both enjoying strong theoretical guarantees along with providing communication efficiency on the fly, i.e., convergence under the same hyperparameteres tuned for uncompressed variants while providing substantial savings in terms of communication costs (Alistarh et al., 2017; Faghri et al., 2020; Ramezani-Kebrya et al., 2021) . Unlike full-precision data-parallel SGD, where each processor is required to broadcast its local gradient in full-precision, i.e., transmit and receive huge full-precision vectors at each iteration, unbiased quantization requires each processor to transmit only a few communication bits per iteration for each component of the stochastic gradient. In this work, we propose communication-efficient variants of a general first-order method that achieves the optimal rate of convergence with improved guarantees on the number of communication bits for monotone VIs and show distributed training accelerates convergence. We employ an adaptive step-size and both adaptive and non-adaptive variants of unbiased quantization schemes tailored to VIs. There exist three major challenges to tackle this problem: 1) how to modify adaptive variants of unbiased quantization schemes tailored to solve general VIs; 2) can we achieve optimal rate of convergence without knowing noise profile and show benefits of distributed training?; 3) can we validate improvements in terms of scalability without compromising accuracy in large-scale settings? We aim to address those challenges and answer all questions in the affirmative: 1.1 SUMMARY OF CONTRIBUTIONS • We propose quantized generalized extra-gradient (Q-GenX) family of algorithms, which employs unbiased compression methods tailored to general VI-solvers. Our framework unifies distributed and communication-efficient variants of stochastic dual averaging, stochastic dual extrapolation, and stochastic optimistic dual averaging. • Without prior knowledge on the noise profile, we provide an adaptive step-size rule for Q-GenX and achieve a fast rate of O(1/T ) under relative noise, and an order-optimal O(1/ √ T ) in the absolute noise case and show that increasing the number of processors accelerates convergence for general monotone VIs. 



• We validate our theoretical results by providing real-world experiments and training generative adversarial networks on multiple GPUs. Wang & Joshi, 2019), adapting the number of quantization levels (communication budget) over the course of training (Guo et al., 2020; Agarwal et al., 2021), adapting a gradient sparsification scheme over the course of training (Khirirat et al., 2021), and adapting compression parameters across model layers and training iterations (Markov et al., 2022) have been proposed for minimizing a single empirical risk. In this paper, we propose communication-efficient generalized extra-gradient family of algorithms with adaptive quantization and adaptive step-size for a general (VI) problem. In the

