PARETO MANIFOLD LEARNING: TACKLING MULTIPLE TASKS VIA ENSEMBLES OF SINGLE-TASK MODELS

Abstract

All reviewers]: We have added "notes" such as this one to make the changes more visible. For minor things such as slightly changing plot size, we do not write anything. In Multi-Task Learning, tasks may compete and limit the performance achieved 1 on each other rather than guiding the optimization trajectory to a common solu-2 tion, superior to its single-task counterparts. There is often not a single solution 3 that is optimal for all tasks, leading practitioners to balance tradeoffs between 4 tasks' performance, and to resort to optimality in the Pareto sense. Current Multi-5 Task Learning methodologies either completely neglect this aspect of functional 6 diversity, and produce one solution in the Pareto Front predefined by their op-7 timization schemes, or produce diverse but discrete solutions, each requiring a 8 separate training run. In this paper, we conjecture that there exist Pareto Sub-9 spaces, i.e., weight subspaces where multiple optimal functional solutions lie. We 10 propose Pareto Manifold Learning, an ensembling method in weight space that is 11 able to discover such a parameterization and produces a continuous Pareto Front 12 in a single training run, allowing practitioners to modulate the performance on 13 each task during inference on the fly. We validate the proposed method on a di-14 verse set of multi-task learning benchmarks, ranging from image classification to 15 tabular datasets and scene understanding, and show that Pareto Manifold Learning semble) members during training. By operating in the convex hull of the members' weight space, 61 each single-task model infuses and benefits from representational knowledge to and from the other 62 members. During training, the losses are weighted in tandem with the interpolation, i.e., a mono-63 tonic relationship is imposed between the degree of a single-task predictor participation and the 64 weight of the corresponding task loss. Consequently, the ensemble as a whole engenders a (weight) 65 subspace that explicitly encodes tradeoffs and results in a continuous parameterization of the Pareto 66 Front. We identify challenges in guiding the ensemble to such subspaces, designated Pareto sub-67 spaces, and propose solutions regarding balancing the loss contributions, and regularizing the Pareto 68 properties of the subspaces and adapting the interpolation sampling distribution. Experimental results validate that the proposed method is able to discover Pareto Subspaces, and out-70 performs baselines on multiple benchmarks. Our training scheme offers two main advantages. First, 71 enforcing low loss for all tasks on a linear subspace implicitly penalizes curvature, which has been 72 linked to generalization (Chaudhari et al., 2017), benefitting all tasks' performance. Second, the al-73 gorithm produces a subspace of Pareto Optimal solutions, rather than a single model, enabling prac-74 titioners to handpick during inference the solution that offers the tradeoff that best suits their needs.

1. INTRODUCTION 18

In Multi-Task Learning (MTL), multiple tasks are learned concurrently within a single model, striv-19 ing towards infusing inductive bias that will help outperform the single-task baselines. Apart from 20 the promise of superior performance and some theoretical benefits (Ruder, 2017) , such as generaliza-21 tion properties for the learned representation, modeling multiple tasks jointly has practical benefits 22 as well, e.g., lower inference times and memory requirements. However, building machine learning 23 models presents a multifaceted host of decisions for multiple and often competing objectives, such 24 as model complexity, runtime and generalization. Conflicts arise since optimizing for one metric of-25 ten leads to the deterioration of other(s). A single solution satisfying optimally all objectives rarely 26 exists and practitioners must balance the inherent trade-offs.

27

In contrary to single-task learning, where one metric governs the comparison between methods (e.g.,

28

top-1 accuracy in ImageNet), multiple models can be optimal in Multi-Task Learning; e.g., model 29 X yields superior performance on task A compared to model Y, but the reverse holds true for task B; 



Figure 1: Illustrative example following Yu et al. (2020); Navon et al. (2022). We present the optimization trajectories in loss space starting from different initializations (black bullets) leading to final points (crosses). Color reflects the iteration number when the corresponding value is achieved. To highlight that our method (PML) deals in pairs of models, we use blue and red to differentiate them. Dashed lines show intermediate results of the discovered subspace. While baselines may not reach the Pareto Front or display bias towards specific solutions, PML discovers the entire Pareto Front in a single run and shows superior functional diversity.

annex

(weight-space continuous) Pareto Fronts. As a consequence, the memory requirements grow linearly [All reviewers]: Added a short clarification about related work. More details are also provided in "Related work" section.

48

In this paper, we conjecture that we can actually produce a subspace with multiple Pareto stationary 

