MITIGATING GRADIENT BIAS IN MULTI-OBJECTIVE LEARNING: A PROVABLY CONVERGENT APPROACH

Abstract

Machine learning problems with multiple objectives appear either i) in learning with multiple criteria where learning has to make a trade-off between multiple performance metrics such as fairness, safety and accuracy; or, ii) in multi-task learning where multiple tasks are optimized jointly, sharing inductive bias among them. These multiple-objective learning problems are often tackled by the multiobjective optimization framework. However, existing stochastic multi-objective gradient methods and their recent variants (e.g., MGDA, PCGrad, CAGrad, etc.) all adopt a biased gradient direction, which leads to degraded empirical performance. To this end, we develop a stochastic multi-objective gradient correction (MoCo) method for multi-objective optimization. The unique feature of our method is that it can guarantee convergence without increasing the batch size even in the nonconvex setting. Simulations on supervised and reinforcement learning demonstrate the effectiveness of our method relative to state-of-the-art methods.

1. INTRODUCTION

Multi-objective optimization (MOO) involves optimizing multiple, potentially conflicting objectives simultaneously. Recently, MOO has gained attention in various application settings such as optimizing hydrocarbon production (You et al., 2020) , tissue engineering (Shi et al., 2019) , safe reinforcement learning (Thomas et al., 2021) , and training neural networks for multiple tasks (Sener & Koltun, 2018) . We consider the stochastic MOO problem as min x∈X F (x) := (E ξ [f 1 (x, ξ)], E ξ [f 2 (x, ξ)], . . . , E ξ [f M (x, ξ)]) where X ⊆ R d is the feasible set, and f m : X → R with f m (x) := E ξ [f m (x, ξ)] for m ∈ [M ]. Here we denote [M ] := {1, 2, . . . , M } and denote ξ as a random variable. In this setting, we are interested in optimizing all of the objective functions simultaneously without sacrificing any individual objective. Since we cannot always hope to find a common variable x that achieves optima for all functions simultaneously, a natural solution instead is to find the so-termed Pareto stationary point x that cannot be further improved for all objectives without sacrificing some objectives. In this context, a multiple gradient descent algorithm (MGDA) has been developed for achieving this goal (Désidéri, 2012) . The idea of MGDA is to iteratively update the variable x via a common descent direction for all the objectives through a time-varying convex combination of gradients from individual objectives. Recently, various MGDA-based MOO algorithms have been proposed, especially for multi-task learning (MTL) (Sener & Koltun, 2018; Chen et al., 2018; Yu et al., 2020a; Liu et al., 2021a) . While the deterministic MGDA algorithm and its variants are well understood in literature, only little theoretical study has been taken on its stochastic counterpart. Recently, (Liu & Vicente, 2021) has introduced the stochastic multi-gradient (SMG) method as a stochastic counterpart of MGDA (see Section 2.3 for details). To establish convergence, however, (Liu & Vicente, 2021) requires a strong assumption on the fast decaying first moment of the gradient, which was enforced by linearly growing the batch size. While this allows for analysis of multi-objective optimization in stochastic setting, this may not be true for many MTL tasks in practice. Furthermore, the analysis in (Liu & Vicente, The work was supported by the National Science Foundation CAREER project 2047177 and the RPI-IBM Artificial Intelligence Research Collaboration. Correspondence to: Tianyi Chen (chentianyi19@gmail.com).

