MITIGATING GRADIENT BIAS IN MULTI-OBJECTIVE LEARNING: A PROVABLY CONVERGENT APPROACH

Abstract

Machine learning problems with multiple objectives appear either i) in learning with multiple criteria where learning has to make a trade-off between multiple performance metrics such as fairness, safety and accuracy; or, ii) in multi-task learning where multiple tasks are optimized jointly, sharing inductive bias among them. These multiple-objective learning problems are often tackled by the multiobjective optimization framework. However, existing stochastic multi-objective gradient methods and their recent variants (e.g., MGDA, PCGrad, CAGrad, etc.) all adopt a biased gradient direction, which leads to degraded empirical performance. To this end, we develop a stochastic multi-objective gradient correction (MoCo) method for multi-objective optimization. The unique feature of our method is that it can guarantee convergence without increasing the batch size even in the nonconvex setting. Simulations on supervised and reinforcement learning demonstrate the effectiveness of our method relative to state-of-the-art methods.

1. INTRODUCTION

Multi-objective optimization (MOO) involves optimizing multiple, potentially conflicting objectives simultaneously. Recently, MOO has gained attention in various application settings such as optimizing hydrocarbon production (You et al., 2020) , tissue engineering (Shi et al., 2019) , safe reinforcement learning (Thomas et al., 2021) , and training neural networks for multiple tasks (Sener & Koltun, 2018) . We consider the stochastic MOO problem as min x∈X F (x) := (E ξ [f 1 (x, ξ)], E ξ [f 2 (x, ξ)], . . . , E ξ [f M (x, ξ)]) where X ⊆ R d is the feasible set, and f m : X → R with f m (x) := E ξ [f m (x, ξ)] for m ∈ [M ]. Here we denote [M ] := {1, 2, . . . , M } and denote ξ as a random variable. In this setting, we are interested in optimizing all of the objective functions simultaneously without sacrificing any individual objective. Since we cannot always hope to find a common variable x that achieves optima for all functions simultaneously, a natural solution instead is to find the so-termed Pareto stationary point x that cannot be further improved for all objectives without sacrificing some objectives. In this context, a multiple gradient descent algorithm (MGDA) has been developed for achieving this goal (Désidéri, 2012) . The idea of MGDA is to iteratively update the variable x via a common descent direction for all the objectives through a time-varying convex combination of gradients from individual objectives. Recently, various MGDA-based MOO algorithms have been proposed, especially for multi-task learning (MTL) (Sener & Koltun, 2018; Chen et al., 2018; Yu et al., 2020a; Liu et al., 2021a) . While the deterministic MGDA algorithm and its variants are well understood in literature, only little theoretical study has been taken on its stochastic counterpart. Recently, (Liu & Vicente, 2021) has introduced the stochastic multi-gradient (SMG) method as a stochastic counterpart of MGDA (see Section 2.3 for details). To establish convergence, however, (Liu & Vicente, 2021) requires a strong assumption on the fast decaying first moment of the gradient, which was enforced by linearly growing the batch size. While this allows for analysis of multi-objective optimization in stochastic setting, this may not be true for many MTL tasks in practice. Furthermore, the analysis in (Liu & Vicente, The work was supported by the National Science Foundation CAREER project 2047177 and the RPI-IBM Artificial Intelligence Research Collaboration. Correspondence to: Tianyi Chen (chentianyi19@gmail.com). 2021) cannot cover the important setting with non-convex multiple objectives, which is prevalent in challenging MTL tasks. This leads us to a natural question: Can we design a stochastic MOO algorithm that provably converges to a Pareto stationary point without growing batch size and also in the nonconvex setting? Our contributions. In this paper, we answer this question affirmatively by providing the first stochastic MOO algorithm that provably converges to a Pareto stationary point without growing batch size. Specifically, we make the following major contributions: C1) (Asymptotically unbiased multi-gradient). We introduce a new method for MOO that we call the stochastic Multi-objective gradient with Correction (MoCo) method. MoCo is a simple algorithm that addresses the convergence issues of stochastic MGDA and provably converges to a Pareto stationary point under several stochastic MOO settings. We use a toy example in Figure 1 to demonstrate the empirical benefit of our method. In this example, MoCo is able to reach the Pareto front from all initializations, while other MOO algorithms such as SMG, CAGrad, and PCGrad fail to find the Pareto front due to using biased multi-gradient. C2) (Unified non-asymptotic analysis). We generalize our MoCo method to the case where the individual objective function has a nested structure and thus obtaining unbiased stochastic gradients is costly. We provide a unified convergence analysis of the nested MoCo algorithm in smooth non-convex and convex stochastic MOO settings. To our best knowledge, this is the first analysis of smooth non-convex stochastic gradient-based MOO. C3) (Experiments on MTL applications). We provide an empirical evaluation of our method with existing state-of-the-art MTL algorithms in supervised learning and reinforcement learning (RL) settings, and show that our method can outperform prior methods such as stochastic MGDA, PCGrad, CAGrad, and GradDrop.

2. BACKGROUND

In this section, we introduce the concepts of Pareto optimality and Pareto stationarity and then discuss MGDA and its existing stochastic counterpart. We then motivate our proposed method by elaborating the challenge in stochastic MOO. The notations used in the paper are summarized in Appendix A.



Figure 1: A toy example from (Liu et al., 2021a) with two objective (Figures 1b and 1c) to show the impact of gradient bias. We use the mean objective as a reference when plotting the trajectories corresponding to each initialization (3 initializations in total). The starting points of the trajectories are denoted by a black •, and the trajectories are shown fading from red (start) to yellow (end). The Pareto front is given by the gray bar, and the black ⋆ denotes the point in the Pareto front corresponding to equal weights to each objective. We implement recent MOO algorithms such as SMG (Liu & Vicente, 2021), PCGrad (Yu et al., 2020a), and CAGrad (Liu et al., 2021a), and MGDA (Désidéri, 2012) alongside our method. Except for MGDA (Figure 1d) all the other algorithms only have access to gradients of each objective with added zero mean Gaussian noise. It can be observed that SMG, CAGrad, and PCGrad fail to find the Pareto front in some initializations.

