ENHANCING META LEARNING VIA MULTI-OBJECTIVE SOFT IMPROVEMENT FUNCTIONS

Abstract

Meta-learning tries to leverage information from similar learning tasks. In the commonly-used bilevel optimization formulation, the shared parameter is learned in the outer loop by minimizing the average loss over all tasks. However, the converged solution may be compromised in that it only focuses on optimizing on a small subset of tasks. To alleviate this problem, we consider meta-learning as a multi-objective optimization (MOO) problem, in which each task is an objective. However, existing MOO solvers need to access all the objectives' gradients in each iteration, and cannot scale to the huge number of tasks in typical meta-learning settings. To alleviate this problem, we propose a scalable gradient-based solver with the use of mini-batch. We provide theoretical guarantees on the Pareto optimality or Pareto stationarity of the converged solution. Empirical studies on various machine learning settings demonstrate that the proposed method is efficient, and achieves better performance than the baselines, particularly on improving the performance of the poorly-performing tasks and thus alleviating the compromising phenomenon.

1. INTRODUCTION

Meta-learning, also known as "learning to learn", aims to enable models to learn more effectively by leveraging information from many similar learning tasks (Hospedales et al., 2020) . In recent years, meta-learning has received much attention for its fast adaptation to new learning scenarios with limited data (Kao et al., 2021; Finn et al., 2017; Snell et al., 2017; Lee et al., 2019; Nichol et al., 2018; Deleu et al., 2022; Rajeswaran et al., 2019; Vilalta & Drissi, 2002) . It is usually formulated as a bi-level optimization problem (Franceschi et al., 2018; Hong et al., 2020) , which finds task-specific parameters in the inner level and minimizes the average loss over tasks in the outer level. Recently, Wang et al. ( 2021) reformulate meta-learning as a multi-task learning problem. From this perspective, minimizing the average loss in the outer level using (stochastic) gradient descent may not always be desirable. Specifically, it may suffer from the compromising (or conflicting) phenomenon, in which the converged solution only focuses on minimizing the losses of a small subset of tasks while ignoring the others (Yu et al., 2020; Liu et al., 2021a; Sener & Koltun, 2018) . This compromised solution may thus lead to poor performance. To alleviate this problem, we propose reformulating meta-learning as a multi-objective optimization (MOO) problem, in which each task is an objective. The performance of all tasks (objectives) are then considered during optimization (Emmerich & Deutz, 2018) . A popular class of MOO solvers is the gradient-based approach (Liu et al., 2021a; Yu et al., 2020; Sener & Koltun, 2018; Navon et al., 2022; Liu et al., 2021b) , with prominent examples such as the multiple-gradient descent algorithm (MGDA) (Désidéri, 2012; Sener & Koltun, 2018 ), PCGard (Yu et al., 2020 ), and CAGard (Liu et al., 2021a) . In each iteration, they find a common descent direction among all objective gradients, instead of simply optimizing the average performance over all objectives. Existing gradient-based MOO methods require using gradients from all the objectives. However, when formulating meta-learning as a MOO problem with each task being an objective, computing all these gradients in each iteration can become very expensive, as the number of objectives (i.e., tasks) can be huge. For example, in 5-way 1-shot classification on the miniImageNet data, the total number of meta-training tasks is 64 5 ≈ 7 × 10 6 . To address this challenge, we propose a scalable MOO solver by using the improvement function (Miettinen & Mäkelä, 1995; Mäkelä et al., 2016; Montonen et al., 2018) with the help of mini-batch. On the other hand, we show that a trivial extension of existing gradient-based MOO methods with the use of mini-batch does not guarantee Pareto optimality and has poor performance in practice. Our main contributions are as follows: (i) To alleviate the compromising phenomenon, we reformulate meta-learning as a multi-objective optimization problem in which each task is an objective; (ii) To handle the possibly huge number of tasks, we propose a scalable gradient-based solver. (iii) We provide theoretical guarantees on the Pareto optimality or Pareto stationarity of the converged solution. (iv) Empirical studies on few-shot regression, few-shot classification, and reinforcement learning demonstrate that the proposed method achieves better performance, particularly in improving the performance of the poorly-performing tasks and thus alleviating the compromising phenomenon.

2. BACKGROUND

Multi-Objective Optimization (MOO). In MOO (Marler & Arora, 2004) , one aims to minimizefoot_0 m ≥ 2 objectives f 1 (x), . . . , f m (x): min x [f 1 (x), . . . , f m (x)]. (1) Definition 2.1. (Global Pareto optimality) (Miettinen, 2012; Mäkelä et al., 2016) x * is global Pareto optimal if there does not exist another x such that f τ (x * ) ≥ f τ (x) for all τ ∈ {1, . . . , m}, and f τ ′ (x * ) > f τ ′ (x) for at least one τ ′ ∈ {1, . . . , m}. The Pareto front (PF) is the set of multi-objective values of all global Pareto-optimal solutions. Definition 2.2. (Pareto stationarity) (Miettinen, 2012; Désidéri, 2012) x * is Pareto-stationary if there exist  {u τ } m τ =1 such that ∥ m τ =1 u τ ∇ x f τ (x)∥ = 0, u τ ≥ 0 ∀τ and m τ =1 u τ = 1.

Note that global

(x, x ′ ) = max τ =1...,m {f τ (x) -f τ (x ′ )}. Note that x * satisfying x * = arg min x H(x, x * ) (intuitively, x * cannot be further improved) is Pareto stationary (Montonen et al., 2018) . To find x * , one can perform steepest descent on H: x s+1 = x s + βd * , d * = arg min d H(x s + d, x s ) + λ ′ 2 ∥d∥ 2 , where x s is the iterate at iteration s, β is the learning rate satisfying H(x s + βd, x s ) < H(x s , x s ), and λ ′ is a hyper-parameter. It can be shown that when s → ∞, x s is Pareto stationary (Montonen et al., 2018) . In this paper, we focus on gradient-based MOO methods, including MGDA (Désidéri, 2012; Sener & Koltun, 2018 ), PCGard (Yu et al., 2020 ), and CAGard (Liu et al., 2021a) . They assign weights to each objective's gradient and find a common descent direction that decreases the losses of all objectives. For example, MGDA finds the direction g * (x) =  Meta-Learning. Meta-learning aims to achieve good performance with limited data and computation (Hospedales et al., 2020) . Most of them are gradient-based (Nichol et al., 2018; Deleu et al., 2022; Rajeswaran et al., 2019; Zhou et al., 2019; Shu et al., 2019 ) or metric-based (Snell et al., 2017;  



Without loss of generality, we consider minimization in this paper.



Pareto optimal solutions are also Pareto stationary (Désidéri, 2012). Analogous to the extension from a stationary point to an ϵ-stationary point (Lin et al., 2020), we extend Pareto stationarity to ϵ-Pareto stationarity. Obviously, 0-Pareto stationarity reduces to Pareto stationarity. Definition 2.3. (ϵ-Pareto stationarity). For a given ϵ, x is ϵ-Pareto-stationary iff there exist {u τ } m τ =1 such that ∥ m τ =1 u τ ∇ x f τ (x)∥ ≤ ϵ, u τ ≥ 0 ∀τ and m τ =1 u τ = 1. Definition 2.4. (Improvement function) (Montonen et al., 2018) The improvement function of problem (1) is: H

=1 γ * τ ∇ x f τ (x) in each iteration, where

γτ ≥ 0, ∀τ.

