PARETO MANIFOLD LEARNING: TACKLING MULTIPLE TASKS VIA ENSEMBLES OF SINGLE-TASK MODELS

Abstract

All reviewers]: We have added "notes" such as this one to make the changes more visible. For minor things such as slightly changing plot size, we do not write anything. In Multi-Task Learning, tasks may compete and limit the performance achieved on each other rather than guiding the optimization trajectory to a common solution, superior to its single-task counterparts. There is often not a single solution that is optimal for all tasks, leading practitioners to balance tradeoffs between tasks' performance, and to resort to optimality in the Pareto sense. Current Multi-Task Learning methodologies either completely neglect this aspect of functional diversity, and produce one solution in the Pareto Front predefined by their optimization schemes, or produce diverse but discrete solutions, each requiring a separate training run. In this paper, we conjecture that there exist Pareto Subspaces, i.e., weight subspaces where multiple optimal functional solutions lie. We propose Pareto Manifold Learning, an ensembling method in weight space that is able to discover such a parameterization and produces a continuous Pareto Front in a single training run, allowing practitioners to modulate the performance on each task during inference on the fly. We validate the proposed method on a diverse set of multi-task learning benchmarks, ranging from image classification to tabular datasets and scene understanding, and show that Pareto Manifold Learning outperforms state-of-the-art algorithms.

1. INTRODUCTION

In Multi-Task Learning (MTL), multiple tasks are learned concurrently within a single model, striving towards infusing inductive bias that will help outperform the single-task baselines. Apart from the promise of superior performance and some theoretical benefits (Ruder, 2017) , such as generalization properties for the learned representation, modeling multiple tasks jointly has practical benefits as well, e.g., lower inference times and memory requirements. However, building machine learning models presents a multifaceted host of decisions for multiple and often competing objectives, such as model complexity, runtime and generalization. Conflicts arise since optimizing for one metric often leads to the deterioration of other(s). A single solution satisfying optimally all objectives rarely exists and practitioners must balance the inherent trade-offs. In contrary to single-task learning, where one metric governs the comparison between methods (e.g., top-1 accuracy in ImageNet), multiple models can be optimal in Multi-Task Learning; e.g., model X yields superior performance on task A compared to model Y, but the reverse holds true for task B; thus, there is not a single better model among the two. This notion of tradeoffs is formally defined as Pareto optimality. Intuitively, improvement on an individual task performance can come only at the expense of another task. However, there exists no framework addressing the need for efficient construction of the Pareto Front, i.e., the set of all Pareto optimal solutions. Recent methods in Multi-Task Learning casted the problem in the lens of multi-objective optimization and introduced the concept of Pareto optimality, resulting in different mechanisms for computing the descent direction for the shared parameters. Specifically, Sener & Koltun (2018) produce a single solution that lies on the Pareto Front. As an optimization scheme, however, it is biased towards the task with the smallest gradient magnitude, as argued in Liu et al. (2020) . Lin et al. (2019) expand this idea and, by imposing additional constraints on the objective space to produce multiple solutions on the Pareto Front, each corresponding to a different user-specified tradeoff. Finally, the work by Ma et al. (2020) proposes an orthogonal approach that can be applied after training and starts with a discrete solution set and produces a continuous set (in weight space) around each so-Figure 1 : Illustrative example following Yu et al. (2020) ; Navon et al. (2022) . We present the optimization trajectories in loss space starting from different initializations (black bullets) leading to final points (crosses). Color reflects the iteration number when the corresponding value is achieved. To highlight that our method (PML) deals in pairs of models, we use blue and red to differentiate them. Dashed lines show intermediate results of the discovered subspace. While baselines may not reach the Pareto Front or display bias towards specific solutions, PML discovers the entire Pareto Front in a single run and shows superior functional diversity. lution, while the overall Pareto Front is continuous only in objective space as the union of the local (weight-space continuous) Pareto Fronts. As a consequence, the memory requirements grow linearly with the number of models stored. Navon et al. (2021) ; Lin et al. (2021) use hypernetworks to produce a Pareto Front in a single training run, but this approach has limited scalability and introduces additional design choices. [All reviewers]: Added a short clarification about related work. More details are also provided in "Related work" section. In this paper, we conjecture that we can actually produce a subspace with multiple Pareto stationary points in the Multi-Task Learning setting with the hypothesis that local optima (produced by different runs or sharing training steps) can be found in close proximity and are connected by simple paths. This is motivated by the recent advancements in single task machine learning that have explored the geometry of the loss landscape and shown experimentally that local optima are connected by simple paths, even linear ones in some cases (Wortsman et al., 2021; Garipov et al., 2018; Frankle et al., 2020; Draxler et al., 2018) . We assume that, when the problem has multiple objectives, it acquires a new dimension relating to the number of tasks. Concretely, there are multiple loss landscapes and a solution that satisfies users' performance requirements must lie in the intersection of low loss valleys (for all tasks). Building upon our conjecture, we develop a novel method, Pareto Manifold Learning, which casts Multi-Task problems as learning an ensemble of single-task predictors by interpolating among (ensemble) members during training. By operating in the convex hull of the members' weight space, each single-task model infuses and benefits from representational knowledge to and from the other members. During training, the losses are weighted in tandem with the interpolation, i.e., a monotonic relationship is imposed between the degree of a single-task predictor participation and the weight of the corresponding task loss. Consequently, the ensemble as a whole engenders a (weight) subspace that explicitly encodes tradeoffs and results in a continuous parameterization of the Pareto Front. We identify challenges in guiding the ensemble to such subspaces, designated Pareto subspaces, and propose solutions regarding balancing the loss contributions, and regularizing the Pareto properties of the subspaces and adapting the interpolation sampling distribution. Experimental results validate that the proposed method is able to discover Pareto Subspaces, and outperforms baselines on multiple benchmarks. Our training scheme offers two main advantages. First, enforcing low loss for all tasks on a linear subspace implicitly penalizes curvature, which has been linked to generalization (Chaudhari et al., 2017) , benefitting all tasks' performance. Second, the algorithm produces a subspace of Pareto Optimal solutions, rather than a single model, enabling practitioners to handpick during inference the solution that offers the tradeoff that best suits their needs.

2. RELATED WORK

Multi-Task Learning. Learning multiple tasks in the Deep Learning setting (Ruder, 2017; Crawshaw, 2020) is usually approached by architectural methodologies (Misra et al., 2016; Ruder et al., 2019) , where the architectural modules are combined in several layers to govern the joint representation learning, or optimization approaches (Cipolla et al., 2018; Chen et al., 2018) , where the architecture is standardized to be an encoder-decoder(s), for learning the joint and task-specific rep-resentations, respectively, and the focus shifts to the descent direction for the shared parameters. We focus on the more general track of optimization methodologies fixing the architectural structure to Shared-Bottom (Caruana, 1997) . The various approaches focus on finding a suitable descent direction for the shared parameters. The optimization methods can be broadly categorized into lossbalancing and gradient-balancing (Liu et al., 2020) . For the former, the goal is to compute an appropriate weighting scheme for the losses, e.g., the losses can be weighted via task-dependent homoscedastic uncertainty (Cipolla et al., 2018) , by enforcing task gradient magnitudes to have close norms (Chen et al., 2018) . [All reviewers]: Removed discussion about Sener & Koltun (2018) from this point, since the paper is also discussed in the introduction and the next paragraph. The latter class of methodologies manipulate the gradients so that they satisfy certain conditions; projecting the gradient of a (random) task on the normal plane of another so that gradient conflict is avoided (Yu et al., 2020) , enforcing the common descent direction to have equal projections for all task gradients (Liu et al., 2020) , casting the gradient combination as a bargaining game (Navon et al., 2022) . Multi-Task Learning for Pareto Optimality. The authors in (Sener & Koltun, 2018) were the first to view the search for a common descent direction under the Pareto optimality prism and employ the Multiple Gradient Descent Algorithm (MGDA) (Désidéri, 2012) in the Deep Learning context. However. MGDA did not account for task preferences and the solutions yielded for various initializations in a synthetic example resulted in similar points in the Pareto Front (Lin et al., 2019) . By solving a slightly different formulation of the multi-objective problem, they are able to systematically introduce task trade-offs and produce a discrete Pareto Front. However, this approach requires as many training runs as the stated preference combinations and the optimization process for each training step of each run introduces a non-negligible overhead. The work in (Ma et al., 2020) proposes an orthogonal approach for Pareto stationary points; after a model is fitted with any Multi-Task Learning method and has converged to a point (seed) in parameter space, a separate phase seeks other Pareto stationary points in the vicinity of the seed. The convex hull of these points is guaranteed to lie in the Pareto Front. But training still needs to occur for every seed point, the separate phase overhead grows linearly with the number of additional models, and the Pareto Front is not continuous across seed points in parameter space. Navon et al. (2021) and Lin et al. (2021) employ hypernetworks to continuously approximate the Pareto Front in a single run, which introduces additional design choices. Ruchte & Grabocka (2021) address the scalability issues of hypernetworks by augmenting the feature space with the preference vector. Raychaudhuri et al. (2022) employ a second hypernetwork to also modulate the architecture of the target network addressing. [All reviewers]: Added prior work. Ensemble Learning and Mode Connectivity. Apart from Multi-Task Learning, our algorithm is methodologically tied to prior work in the geometry of the neural network optimization landscapes. The authors in (Garipov et al., 2018; Draxler et al., 2018) independently and concurrently showed that for two local optima θ * 1 , θ * 2 produced by separate training runs (but same initializations) there exist nonlinear paths, defined as connectors by Wortsman et al. (2021) , where the loss remains low. The connectivity paths can be extended to include linear in the case of the training runs sharing some part of the optimization trajectory (Frankle et al., 2020) . These findings can be leveraged to train a neural network subspace by enforcing linear connectivity among the subspace endpoints (Wortsman et al., 2021) . Appendix J discusses more related work regarding ensemble learning and flat minima.

3. PROBLEM FORMULATION

Notation. We use bold font for vectors x, capital bold for matrices X and regular font for scalars x. T is the number of tasks and m is the number of ensemble members. Each task t ∈ [T ] has a loss L t . The overall multi-task loss is L = [L 1 , . . . , L T ] . w ∈ ∆ T ⊂ R T is the weighting scheme for the tasks, i.e., the overall loss is calculated as L = w L = T t=1 α t L t . Each member k ∈ [m] is associated with parameters θ k ∈ R N and weighting w ∈ ∆ T . Preliminaries. Our goal lies in solving an unconstrained vector optimization problem of minimizing L(y, ŷ) = [L 1 (y 1 , ŷ1 ), . . . , L T (y T , ŷT )] , where L i corresponds to the objective function for the i th task, e.g., cross-entropy loss in case of classification. Constructing an optimal solution for all tasks is often unattainable due to competing objectives. Hence, an alternative notion of optimality is used, as described in Definition 1. Definition 1 (Pareto Optimality). A point x dominates a point y if L t (x) ≤ L t (y) for all tasks t ∈ [T ] and L (x) = L (y). Then, a point x is called Pareto optimal if there exists no point y that dominates it. The set of Pareto optimal points forms the Pareto front P L .  B A C θ 1 θ 2 θ 3 w 1 =   1 0 0   w 2 =   0 1 0   w 3 =   0 0 1   α =   0.6 0.2 0.2   = α Θ = 0.6 • θ 1 + 0.2 • θ 2 + 0.2 • θ 3 . The vector loss function is scalarized by the vector w ∈ [0, 1] T to form the overall objective w L. Without loss of generality, we assume that w lies in the T -dimensional simplex ∆ T by imposing the constraint w = T t=1 w t = 1. This formulation permits to think of the vector of weights as an encoding of task preferences, e.g., for two tasks letting w = [0.8, 0.2] results in attaching more importance to the first task. Overall, the Multi-Task Learning problem can be formulated within the Empirical Risk Minimization (ERM) framework for preference vector w and dataset D = {(x, y)} i=1 as: min θ E (x,y)∼D [L (y, f (x; θ))] Our overall goal is to discover a low-dimensional parameterization in weight space that yields a (continuous) Pareto Front in functional space. This desideratum leads us to the following definition: Definition 2 (Pareto Subspace). Let T be the number of tasks, X the input space, Y the multitask output space, R ⊂ R N the parameter space, f : X × R → Y the function implemented by a neural network, and L : Y × Y → R T >0 be the vector loss. Let {θ t ∈ R : 

4. METHOD

We seek to find a collection of m neural network models, of identical architecture, whose linear combination in weight space forms a continuous Pareto Front in objective space. Model i corresponds to a tuple of network parameters θ i and task weighting w i and implements the function f (•; θ i ). We impose connectivity among models by modeling the subspace in the convex hull of the ensemble members. Section 4.1 presents the core of the algorithm, and in Section 4.2 we discuss various improvements that address Multi-Task Learning challenges.

4.1. PARETO MANIFOLD LEARNING

Let Θ = [θ 1 , θ 2 , . . . , θ m ] be an m × N matrix storing the parameters of all models, W = [w 1 , . . . , w m ] be a m × T matrix storing the task weighting of ensemble members. By designing the subspace as a simplex, the objective now becomes: E (x,y)∼D E α∼P α W L (y, f (x; αΘ)) ( ) where P is the sampling distribution placed upon the simplex. In the case where the en- the losses and the ensemble interpolation, we explicitly associate models and task losses with a oneto-one correspondence, infusing preference towards one task rather than the other and guiding the learning trajectory to a subspace that encodes such tradeoffs. V ← ∅ 4 for i ∈ {1, 2, . . . , W } do 5 sample αi ∼ Dir(p) 6 V ← V ∪ αi 7 θi ← α i Θ // construct network in convex hull of ensemble members 8 L(αi) = L1(αi) • • • LT (αi) ← criterion (f (x; θi), y) // compute Algorithm 1 presents the full training procedure for this ensemble of neural networks, containing modifications discussed in subsequent sections. Figure 1 showcases the algorithm in a toy example . Concretely, at each training step a random α is sampled and the corresponding convex combination of the networks is constructed. This procedure is shown in Figure 2 . Note that we have chosen a convex hull parameterization of the weight space, but there are other options, such as Bezier curves or other nonlinear paths (Wortsman et al., 2021; Draxler et al., 2018) . However, the universal approximation theorem implies no loss of generality for our design choice. In Loss and gradient balancing schemes. A common challenge in Multi-Task Learning is the case where tasks have different loss scales, e.g., consider datasets with regression and classification tasks such as UTKFace. Then, using the same weighting α for both the losses and the weight ensembling, as presented in Equation 2, the easiest tasks are favored and the important property of scale invariance is neglected. To prevent this, the loss weighting needs to be adjusted. Hence, we propose simple balancing schemes: one loss and one gradient balancing scheme, whose effect is to warp the space of loss weightings. While gradient balancing schemes are applied on the shared parameters, loss balancing also affects the task-specific decoders, rendering the methodologies can be complementary. To avoid cluttering, balancing schemes are not presented in Algorithm 1. In terms of loss balancing, we use a lightweight scheme of adding a normalization coefficient to each loss term which depends on past values. Concretely, let W ∈ Z + be a positive integer and L m (τ 0 ) be the loss of task m in step τ 0 . Then, the regularization coef- ficient is L(τ 0 ; W ) = 1 W W τ =1 L m (τ 0 + 1 -τ ) for τ 0 ≥ W resulting in the overall loss L total = α τ0 L = T t=1 α t Lt(τ0) Lm(τ0;W ) . For gradient balancing. let g t be the gradient of task The red points are not optimal and, therefore, the regularization term penalizes the violations of the monotonicity constraints for the appropriate task loss: α 2 and α 4 violate the L 1 and L 2 orderings w.r.t. L 1 L 2 α 1 L 1 (a 1 ) L 2 (a 1 ) α 2 α 3 α 4 α 5 L 1 L 2 L 1 (a 2 ) L 2 (a 2 ) α 1 α 2 α 3 α 4 α 5 α 3 , since α 2,1 > α 3,1 L 1 (α 2 ) < L 1 (α 3 ) and α 4,2 > α 3,2 L 2 (α 4 ) < L 2 (α 3 ). t ∈ [T ] w.r.t. the shared parameters. Previously, the update rule occurred with the overall gradient g total = α G = α [g 1 . . . g T ]. We impose a unit 2 -norm for gradients and perform the update with g total = α G = α [ g 1 . . . g T ] where g t = gt gt 2 . Improving stability by Multi-Forward batch regularization. Consider two different weightings α 1 and α 2 ∈ ∆ T -1 . Without loss of generality [α 1 ] 0 = α 1 > [α 2 ] 0 = α 2 . Then, ideally, the interpolated model closer to the ensemble member for task 1 has the lowest loss on that task, i.e., we would want the ordering L 1 (α 1 ) < L 1 (α 2 ), and, equivalently for the other tasks. Furthermore, if α = [1 -, /T -1, . . . , /T -1], only one member essentially reaps the benefits of the gradient update and moves the ensemble towards weight configurations more suitable for one task but, perhaps deleterious for the remaining ones. Thus, we propose repeating the forward pass W times for different random weightings {α i } i∈[W ] , allowing the advancement of all ensemble members concurrently in a coordinated way. By performing multiple forward passes for various weightings, we achieve a lower discrepancy sequence and reduce the variance of such pernicious updates. We also include a regularization term, which penalizes the wrong orderings and encourages the subspace to have Pareto properties. Let V be the set of interpolation weighs sampled in the current batch V = {α w = (α w,1 , α w,2 , . . . , α w,T ) ∈ ∆ T -1 } w∈[W ] . Then each task defines the directed graph G t = (V, E t ) where E t = {(α i , α j ) ∈ V × V : α i,t < α j,t }. The overall loss becomes: L total = W i=1 α i L(α i ) + λ • T t=1 log   1 |E t | (αi,αj )∈Et e [Lt(αi)-Lt(αj )] +   The current formulation of the edge set penalizes heavily the connections from vertices with low values. For this reason, we only keep one outgoing edge per node, defined by the task lexicographic order, resulting in the graph G LEX t = (V, E LEX t ) and |E LEX t | = W -1, ∀t ∈ [T ]. Note that the regularization term is convex as the sum of log-sum-exp terms. If no violations occur, the regularization term is zero. we experiment on numerous two-task datasets where plots convey the results succinctly, (c) present qualitative results on three-task datasets. The source code will be released after the review process. Baselines. We explore various algorithms from the literature: 1. Single-Task Learning (STL), 

5.1. EXPERIMENTS ON DATASETS WITH TWO CLASSIFICATION TASKS

In this section, we focus on datasets with two tasks, both classification. This setting allows for rich visualizations that we use to draw insights on the inner workings of the algorithms. MultiMNIST. We investigate the effectiveness of Pareto Manifold Learning on digit classification using a LeNet model with a shared-bottom architecture. The ensemble consists of two members with single task weightings. To gauge the performance of the models lying in the linear segment between the nodes, we test the performance on the validation set on the ensemble members as well as for 9 models uniformly distributed across the edge, resulting in 11 models in total. We use this evaluation/plotting scheme throughout the experiments. We ablate the effect of multi-forward training on Appendix D; we use a grid search on window W ∈ {2, 3, 4, 5} and strength λ ∈ {0, 2, 5, 10} along with the base case of (W, λ) = (1, 0) and present in the main text the setting that achieves the . In contrast, all Pareto Manifold Learning seeds find subspaces with diverse functional solutions. This statement is quantitatively translated to higher HyperVolume compared to the baselines, shown in Table 4 of the appendix, and can be attributed to the observation that Equation 2 generalizes the Linear Scalarization method. Census. We explore the method on the tabular dataset Census (Kohavi, 1996) using a Multi-Layer Perceptron. We focus on the task combination of predicting age and education level, similar to Ma et al. (2020) . We perform the same ablation study as before and present the results on Figure 4 for the best setting (W = 3 and λ = 10). In the case of MultiMNIST, there exists symmetry between the tasks, both digits are drawn from the same distribution and placed in the pixel grid in a symmetric way, resulting in equal pace learning. However, in the case of Census, tasks differ in statistics and, yet, the proposed method recovers a Pareto subspace with diverse solutions.

5.2. BEYOND PAIRS OF CLASSIFICATION TASKS: MU L T IMNIST-3 AND UTKFA C E

We expand the experimental validation to triplets of tasks, consider regression and more complex architectures, graduating from MLPs and CNNs to ResNets (He et al., 2016) . For three tasks, we create a 2D grid of equidistant points spanning the three single-task predictors. If n is the number of interpolated points between two (out of three) members, the grid has n+1 2 points. We use n = 11, resulting in 66 points. For visual purposes, neighboring points are connected. For three tasks, it would be visually cluttering to present the discovered subspaces with multiple seeds and baselines. Hence, we opt for a more qualitative discussion in this section and present quantitative findings in the appendix. MultiMNIST-3. First, we construct an equivalent of MultiMNIST for 3 tasks. Digits are placed on top-left, top-right and bottom-centre. Figure 5a shows the results on MultiMNIST-3. As argued previously, MNIST variants are characterized by task symmetry and Figure 5a reflects this. For this reason, we do not employ any balancing scheme. The 3D plot in conjunction with the simplices reveal that the method has the effect of gradual transfer of learned representation from one member to the other, and offers a succinct visual confirmation of Claim 3. UTKFace. The UTKFace dataset (Zhang et al., 2017) has more than 20,000 face images and three tasks: predicting age (modeled as regression using Huber loss -similar to (Ma et al., 2020) ), classifying gender and ethnicity. The introduction of a regression task implies that losses have vastly different scales, which dictates the use of balancing schemes, as discussed in Section 4.2. We apply the proposed gradient-balancing scheme and present the results in Figure 5b . For visual unity and to remain in the theme of "higher is better", the negative Huber loss is plotted. Despite the increased complexity and the existence of a regression task, the proposed method discovers a Pareto Subspace. Additional experiments and qualitative results are provided in Appendix G.

5.3. SCENE UNDERSTANDING

We also explore the applicability of Pareto Manifold Learning for CityScapes (Cordts et al., 2016) , a scene understanding dataset containing high-resolution images of urban street scenes. Our MGDA narrowly better. However, the performance compared to the other algorithms is superior. In Semantic Segmentation, our method outperforms MGDA, but is worse than other baselines. Overall no multi-task method dominates Pareto Manifold Learning. [Reviewer qexX]: Added short comment addressing weakness 3. It is remarkable that, despite our goal of discovering Pareto subspaces, the proposed method is on par in performance on Semantic Segmentation with the state-of-the-art algorithms, and better than the vast majority on Depth Estimation.

6. CONCLUSION

In this paper, we proposed a weight-ensembling method tailored to Multi-Task Learning; multiple single-task predictors are trained in conjunction to produce a subspace formed by their convex hull, and endowed with desirable Pareto properties. We experimentally show on a diverse suite of benchmarks that the the proposed method is successful in discovering Pareto subspaces and outperforms some state-of-the-art MTL methods. An interesting future direction is to perform a hierarchical weight ensembling, sharing progressively more of the lower layers, given that the features learned at low depth are similar across tasks. An alternative exploration venue is to connect our method to the challenge of task affinity (Fifty et al., 2021; Standley et al., 2020) via a geometrical lens of the loss landscape.

B EXPERIMENTAL DETAILS

MultiMNIST MultiMNIST is a synthetic dataset derived form the samples of MNIST. Since there is no publicly available version, we create our own by the following procedure. For each MultiMNIST image, we sample (with replacement) two MNIST images (of size 28×28) and place them top-left and bottom-right on a 36 × 36 grid. This grid is then resized to 28 × 28 pixels. The procedure is repeated 60000 times, 10000 and 10000 times for training, validation and test datasets. We use a LeNet shared-bottom architecture. Specifically, the encoder has two convolutional layers with 10 and 20 channels and kernel size of 5 followed by Maxpool and a ReLU nonlinearity each. The final layer of the encoder is fully connected producing an embedding with 50 features. The decoders are fully connected with two layers, one with 50 features and the output layer has 10. We use Adam optimizer Kingma & Ba (2015) with learning rate 10 -3 , no scheduler and the batch size is set to 256. Training lasts 10 epochs.

Census

The original version of the Census (Kohavi, 1996) dataset has one task: predicting whether a person's income exceeds $50000. The dataset becomes suitable for Multi-Task Learning by turning one or several features to tasks (Lin et al., 2019) . We focus on the task combination of predicting age and education level, similar to Ma et al. (2020) . The model has a Multi-Layer Perceptron shared-bottom architecture. The encoder has one layer with 256 neurons, followed by a ReLU nonlinearity, and two decoders with 2 output neurons each (since the tasks are binary classification). Training lasts 10 epochs. We use Adam optimizer learning rate of 10 -3 . 

C DETAILS OF THE ILLUSTRATIVE EXAMPLE

The details of the illustrative example are provided in this section. We use the configuration presented by Navon et al. (2022) , which was introduced with slight modifications by Liu et al. (2021) and Yu et al. (2020) . Specifically, let θ = (θ 1 , θ 2 ) ∈ R 2 be the parameter vector and L = ( ˜ 1 , ˜ 2 ) be the vector objective defined as follows: ˜ 1 (θ) = c 1 (θ)f 1 (θ) + c 2 (θ)g 1 (θ) and ˜ 2 (θ) = c 1 (θ)f 2 (θ) + c 2 (θ)g 2 (θ) where f 1 (θ) = log (max (|0.5 (-θ 1 -7) -tanh (-θ 2 )| , 5e -6)) + 6, f 2 (θ) = log (max (|0.5 (-θ 1 + 3) -tanh (-θ 2 ) + 2| , 5e -6)) + 6, g 1 (θ) = (-θ 1 + 7) 2 + 0.1 • (-θ 2 -8) 2 /10 -20, g 2 (θ) = (-θ 1 -7) 2 + 0.1 • (-θ 2 -8) 2 /10 -20, c 1 (θ) = max (tanh (0.5θ 2 ) , 0) and c 2 (θ) = max (tanh (-0.5θ 2 ) , 0) We use the experimental setting outlined by (Navon et al., 2022) with minor modifications, i.e., Adam optimizer with a learning rate of 2e -3 and training lasts for 50K iterations. The overall objectives are 1 = c • ˜ 1 and 2 = ˜ 2 where we explore two configurations for the scalar c, namely c ∈ {0.1, 1}. For c = 1, the two tasks have losses at the same scale. For c = 0.1, the difference in loss scales makes the problem more challenging and the algo-rithm used should be characterized by scale invariance in order to find diverse solutions spanning the entirety of the Pareto Front. The initialization points are drawn from the following set {(-8.5, 7.5), (0.0, 0.0), (9.0, 9.0), (-7.5, -0.5), (9, -1.0)}. In the case of Pareto Manifold Learning with two ensemble members there are 5 2 = 25 initialization pairs. In the main text we use the initialization pair with the worst initial objective values. Figure 6 presents the results for the case of different loss scales, i.e., c = 0.1. We plot various baselines and three versions of the proposed algorithm, Pareto Manifold Learning or PML in short. We focus on the effect of the balancing schemes, introduced in Section 4.2, resulting in the use of no balancing scheme (denoted as PML), the use of gradient balancing (denoted as PML-gb) and the use of loss balancing (denoted as PML-lb). We dedicate two figures for each version of the algorithm and we present all 25 initialization pairs for completeness. For c = 1.0, the proposed method is able to retrieve the exact Pareto Front with no balancing scheme for most initialization pairs, as can be seen in Figure 8 . In three cases (out of 25), the method fails. In our experiments, we found that allowing longer training times or higher learning rates resolve the remaining cases.s For c = 0.1, the problem is more challenging and the vanilla version of the algorithm results in a subset of the analytical Pareto Front. Figure 10 shows that this subset is consistent across initialization pairs, excluding the ones the method fails, and focuses on the task with higher loss magnitude. Applying gradient balancing, shown in Figure 11 and Figure 12 , allows the method to retrieve (a superset of) the Pareto Front for all initialization pairs. Similarly, loss balancing, shown in Figure 13 and Figure 14 , results in the exact Pareto Front. Hence, the inclusion of balancing schemes endows scale invariance in the proposed algorithm. Balancing schemes are used for the more challenging datasets, such as CityScapes. Optimization trajectories in objective space for all initialization pairs in the case of equal loss scales (c = 1.0) and application of the proposed method with no balancing scheme. Blue and red markers show each ensemble member's loss value, dots and "X"s correspond to the initial and final step, accordingly. In all but four cases, Pareto Manifold Learning retrieves the entirety of the Pareto Front (can be sen clearly in Figure 8 ). Allowing longer training times or higher learning rates solves the remaining initialization pairs. Figure 8 : Illustrative example. Mapping in objective space of the weight subspace discovered by the proposed method with no balancing scheme, in the case of equal loss scales (c = 1.0). The analytic Pareto Front is plotted in light blue. In all but four cases, the dashed line (our method) coincides with the full analytic Pareto Front. Figure 9 : Illustrative example. Optimization trajectories in objective space for all initialization pairs in the case of unequal loss scales (c = 0.1) and application of the proposed method with no balancing scheme. Blue and red markers show each ensemble member's loss value, dots and "X"s correspond to the initial and final step, accordingly. For the vast majority of initialization pairs, the lack of balancing scheme guides the ensemble to a subset of the Pareto Front, influenced by the task with higher loss magnitude (can be sen clearly in Figure 10 ). Optimization trajectories in objective space for all initialization pairs in the case of unequal loss scales (c = 0.1) and application of the proposed method with gradient balancing scheme. Blue and red markers show each ensemble member's loss value, dots and "X"s correspond to the initial and final step, accordingly. The proposed method discovers a subspace whose mapping in objective space results in a superset of the Pareto Front. This can be clearly seen in Figure 12 . The analytic Pareto Front is plotted in light blue. The proposed method consistently finds the same subspace, which is a superset of the analytic Pareto Front. Figure 13 : Illustrative example. Optimization trajectories in objective space for all initialization pairs in the case of unequal loss scales (c = 0.1) and application of the proposed method with loss balancing scheme. Blue and red markers show each ensemble member's loss value, dots and "X"s correspond to the initial and final step, accordingly. For all but five cases, the proposed method discovers a subspace whose mapping in objective space results in the exact Pareto Front. This can be clearly seen in Figure 14 . 10 . However, the same initialization pairs continue to be problematic as in the case of equal loss scales (see Figure 8 ). Allowing for longer training or higher learning rates solves the remaining initialization pairs. Under review as a conference paper at ICLR 2023 

D ABLATION ON MULTI-FORWARD REGULARIZATION

Multi-Forward regularization, introduced in Section 4.2, penalizes the ensemble if the interpolated models' losses (sampled within a batch) are not in accordance with the tradeoff imposed by the corresponding interpolation weights. Simply put, the closer we sample to the member corresponding to task 1, the lower the loss should be on task 1. The same applies to the other tasks. Equation 3in the main text presents the case of two tasks, where the idea of the regularization is outlined in loss space. For completeness, we present the underlying graph construction for the cases of two and three tasks in Figure 15 and Figure 16 , respectively. The nodes of the graphs are associated with the sampled weightings and the edges for the graph G t of task t are drawn w.r.t. the corresponding partial ordering. If the loss ordering is violated for a given edge, a penalty term is added. We ablate the effect multi-forward training and the corresponding regularization have on performance. We explore the MultiMNIST and Census datasets using the same experimental configurations as in the main text. We are interested in two parameters: • W : number of α re-samplings per batch. This parameter is also referred as window. • λ: the regularization strength as presented in Algorithm 1. For λ = 0, no regularization is applied but the subspace is still sampled W times and the total loss takes into account all the respective interpolated models. Figure 17 and Table 2 present the results for MultiMNIST. Figure 18 and Table 3 present the results for Census. It is important to note that MultiMNIST is symmetric, while Census is not. As a result, the features learned for each single-task predictor are helpful to one another and the case of λ = 0, i.e., no regularization and only multi-forward training, is beneficial for MultiMNIST but not for Census. Intuitively, both digit classification tasks have the same difficulty and posterior distribution, which produces few violations of monotonicity constraints and renders the regularization less applicable. [Reviewer qexX]: expanded commentary on λ = 0 for MultiMNIST. On the other hand, severe regularization such as λ = 10 can be harmful and hinder training. More details in table and figure captions. 18 . For each configuration, we track the Hypervolume across three random seeds and present Mean HV, max HV and standard deviation. We annotate with bold the best per column. In the main text, we report the best result in terms of mean HV, i.e., W = 2 and λ = 5. Seed -0 Seed -1 Seed -2 Mean HV Max HV std (q) W = 5 and λ = 10 W = 2 λ = 0. Figure 17 : MultiMNIST: Effect of multi-forward on the window W and the regularization coefficient λ on the validation dataset. The case of no multi-forward (W = 1) is presented in the first row. Multi-forward regularization for higher W values is beneficial. Intuitively, attaching serious weight on the regularization λ ∈ {5, 10} while sampling few times W ∈ {2, 3} leads to suboptimal performance since the update step focuses on an uninformed regularization term. The accompanying quantitative analysis appears in Table 2 . This example showcases loss and the perfect oracle lies in the origin. The point (1, 1) is used for reference. Hence, higher hypervolume implies that the objective space is better explored/covered.

E HYPERVOLUME ANALYSIS ON MU L T IMNIST AND CE N S U S

HyperVolume is a metric widely used in multi-objective optimization that captures the quality of exploration. A visual explanation of the metric is given in Figure 19 . Table 4 presents the results of Figure 4 of the main text in a tabular form. We present the best three results per column (higher is better) to succinctly and visually show that all Pareto Manifold Learning seeds outperform the baselines. Table 4 : Tabular complement to Figure 4 . Classification accuracy for both tasks and HyperVolume (HV) metric (higher is better). Three random seeds per method. For baselines, we show the mean accuracy and HV (across seeds). For PML, we show the results per seed; HV and max accuracies for the subspace yielded by that seed. We use underlined bold, solely bold and solely underlined font for the best, second best and third best results. We observe that the best results are concentrated in the rows concerning the proposed method (PML). Note that the use of three decimals leads to ties.

MultiMNIST Census

Task For the pixels where the initial digits overlap, the maximum value is selected. Finally, the image is resized to 28 × 28 pixels. Figure 20 shows some examples of the dataset, which consists of three digit classification tasks. Section 5 compares the performance of baselines and the proposed method while Figure 21 presents visually the performance achieved on the discovered subspace. Table 5 : MultiMNIST-3: Mean Accuracy and standard deviation of accuracy (over 3 random seeds). For the proposed method (PML), we report the mean and standard deviation of the best performance from the interpolated models in the sampled subspace. No balancing schemes and regularization are applied. Bold is used for the best performing multi-task method. 

G UTKFA C E ADDITIONAL RESULTS

This section serves as supplementary to Section 5.2. Section 6 compares the performance of the baselines and the proposed method. We experiment without balancing schemes and with gradientbalancing, and present the results in Figure 22 and Figure 23 , respectively. Together with the quantitative results, we observe that for datasets with varying task difficulties, scales, etc. the lack of balancing can be impeding. On the other hand, its inclusion makes the subspace functionally diverse and boosts overall performance. For instance, Huber loss on the task of age prediction is significantly improved. Figure 22 : UTKFace results with Linear Scalarization for all three seeds. Each triangle shows the 66 points in the convex hull and color is used for the performance on the associated task. The 3d plot shows the mapping of the subspace to the multi-objective space. Applying no balancing scheme for datasets with different loss scales, e.g., regression and classification tasks, may lead to limited functional diversity, such as for seed 1. Figure 23 : UTKFace results with Gradient-Balancing Scheme for all three seeds. Each triangle shows the 66 points in the convex hull and color is used for the performance on the associated task. The 3d plot shows the mapping of the subspace to the multi-objective space. For datasets with tasks of varying loss scales, applying gradient balancing improves functional diversity and performance, as shown in Section 6. This appendix expands on Section 4.2 and, specifically, presents in greater detail the intuition behind the sampling distribution's parameters. Let p ∈ R T + be the parameters of the Dirichlet distribution. Assuming no prior knowledge on the tasks, e.g., task difficulties or affinities, a symmetric distribution is used by setting p = p1 T . This design choice results in three cases: • p = 1: the distribution is uniform on the simplex. Intuitively this means that all tasks are equally important and we care about the diversity of solutions for all tradeoffs (reflected in the linear scalarization weights) • p ∈ (0, 1): the distribution is more concentrated towards the ensemble members, as in the top row of Figure 24 . Assume an extreme case of two tasks and p0. Then the distribution degenerates to a Bernoulli distribution. Effectively, at each iteration one of the ensemble members is selected and its weights are updated, which will result in two separate and independent single-task predictors with no common representation infused about the other task. Then, linearly interpolating in weight space will result in models with random predictions for both tasks, since the training procedure has not focused in retrieving a Pareto Subspace. For milder cases (e.g. p = 0.7) , we observed that the models in the middle of the linear interpolation suffered in performance which can be attributed to the fact that the sampling focused more on single-task rather than multi-task representations and performance. • p > 1. Then the distribution is more concentrated towards the midpoint of the simplex, as in the bottom row of Figure 24 . Assume an extreme case of two tasks and p → ∞. Then, the distribution becomes deterministic and outputs equal weights for all tasks. The randomly and independently initialized ensemble members will collapse to each other, resulting in duplicate ensemble members. Similarly, for very large values (e.g. p = 100), the functional diversity of the ensemble will suffer since the weights produced by the distribution will be almost equal for all tasks, resulting in a milder version of the aforementioned phenomenon. In contrast, we found that small values such as p = 2 or p = 3 can help convergence since they put more emphasis towards common representation (compared to p = 1), but may limit functional diversity. Figure 25 presents experimental results on MultiMNIST and Census for various concentration parameters p ∈ {0, 0.1, 100} of the Dirichlet distribution. Let θ 1 and θ 2 be the parameters of the ensemble members. For p = 0, the ensemble consists of two single-task predictors with no multitask learning representational knowledge, since their interpolation meets a low accuracy/high loss barrier. We omit the case of p = 0 for Census for visual clarity. This lack of common representation is evident in the cosine similarities as well, where for p = 0 cos(θ 1 , θ 2 ) ≈ 0. On the other hand, for p = 0.1, common representations are infused into the ensemble and the experimental results show that the test performance is characterized by diversity. However, this comes at the expense of the interpolated models at the middle of the line segment, where the performance is suboptimal compared to p = 100 for MultiMNIST. This behavior is also illustrated in the cosine similarities, where for p = 100 the ensemble weights α are in an -ball around the midpoint causing the independently initialized models to progressively collapse. For Census, we also observe that this collapsing leads to very high cosine similarity cos(θ 1 , θ 2 ) > 0.9 and the ensemble is suboptimal compared to p = 0.1. Under review as a conference paper at ICLR 2023 In this section, we supplement our experimental findings on MultiMNIST with additional baselines, namely HPN-LN and HPN-EPO (Navon et al., 2021) and COSMOS (Ruchte & Grabocka, 2021) foot_1 . We use hyperparameters of Ruchte & Grabocka (2021) for both methods. We provide two experimental settings: • Setting I: 10 epochs and no learning rate scheduler, i.e., the setting used for all other methods in Figure 4 , • Setting II:the experimental setting used by (Ruchte & Grabocka, 2021) , i.e., 100 epochs for COSMOS and 150 epochs for HPN-LN/HPN-EPO with multi-step learning scheduler. Figure 29 presents the results with the additional baselines, using three seeds each. We use dashed lines for setting I and solid lines for setting II and group the three methods in various color shades (blue, green, red) for visual clarity. We observe that in the original setting of 10 epochs, all new baselines are suboptimal compared to all methodologies. For setting II, the hypernetwork methodologies are competitive with some baselines but are suboptimal compared to the proposed method. For COSMOS, only one seed is competitive with the proposed method. Moreover, HPN-LN, HPN-EPO employ a hypernetwork of 1.6m parameters, while the target network has < 50k parameters.



[Reviewer qexX]: Table update. We use the open source implementation provided byRuchte & Grabocka (2021) making minimal changes. Our implementation of the MultiMNIST dataset has images of size × 28 rather than 36 × 36 resulting in slightly different models.



Figure 2: A representation of the encoding in parameter space for T = 3 tasks. Each node corresponds to a tuple of parameters and weighting scheme (θ v , w v ) ∈ R N × ∆ T . The blue dashed frame shows the model, e.g., shared-bottom architecture, implemented by the parameters θ v of each node. For each training step, we sample α ∈ ∆ T and construct the weight combination θ = α Θ = 0.6 • θ 1 + 0.2 • θ 2 + 0.2 • θ 3 .

∈ [T ]} be a collection of network parameters and S the corresponding convex envelope, i.e., S = T t=1 α t θ t : T t=1 α t = 1 and α t ≥ 0, ∀t . Consider the dataset D = (D X , D Y ). Then, the subspace S is called Pareto if its mapping to functional space via the network architecture f forms a Pareto Front P = L(f (D X ; S), D Y ) = {l : l = L(f (D X ; θ), D Y ), ∀θ ∈ S}.

Reviewer qexX]: addressing weakness 2 → mentioned Fig.1in the main text.

Reviewer qexX]: addressing weakness 2 → mentioned Fig. 2 in the main text. The batch is forwarded through the constructed network and the vector loss is scalarized by α as well. The procedure is repeated W times at each batch (see Section 4.2 [Reviewer qexX]: Addressing weakness 4: Changed citation from Figure 4.2 to Section 4.2 ) and a regularization term penalizing non-Pareto stationary points is added (line 11). Claim 3. Let {θ * t ∈ R : t ∈ [T ]} be the optimal ensemble parameters retrieved at the end of training by Algorithm 1 and let S be the their convex hull. Then S is a Pareto Subspace.

Figure 3: Visual explanation of multiforward regularization, presented in Equation 3. The subfigures depict the loss values for various weightingsα i = [α i,1 , α i,2 ].Optimal lies in the origin. We assume that α 1,1 > • • • > α 5,1 . Green color corresponds to Pareto optimality. (Left) all sampled weightings are in the Pareto Front and the regularization term is zero. (Right) The red points are not optimal and, therefore, the regularization term penalizes the violations of the monotonicity constraints for the appropriate task loss: α 2 and α 4 violate the L 1 and L 2 orderings w.r.t. α 3 , since α 2,1 > α 3,1 L 1 (α 2 ) < L 1 (α 3 ) and α 4,2 > α 3,2 L 2 (α 4 ) < L 2 (α 3 ).

Figure 3 offers a visual explanation of the regularization term. The role of sampling. Another component of Algorithm 1 is the sampling imposed on the convex hull parameterization. During training, the sampling distribution dictates the loss weighting used and, hence, modulates the degree of task learning. A natural choice is the Dirichlet distribution Dir(p) where p ∈ R T >0 are the concentration parameters, since its support is the T -dimensional simplex ∆ T . For p = p1 T , the distribution is symmetric; for p < 1 the sampling is more concentrated near the ensemble members, for p > 1 it is near the centre and for p = 1 it corresponds to the uniform distribution. In contrast, for p 1 = p 2 the distribution is skewed. In our experiments, we use symmetric Dirichlet distributions with p ≥ 1 to guide the ensemble to representations best suited for Multi-Task Learning.

Figure4: Experimental results on MultiMNIST and Census. Top right is optimal. Three random seeds per method. Solid lines correspond to our method (PML) and thick lines to the Pareto Front. We have used a different color for each seed of PML. Baselines are shown in shades of gray: scatter plot for MTL baselines and dashed lines for single task. In both datasets, Pareto Manifold Learning discovers subspaces with diverse and Pareto-optimal solutions and outperforms the baselines.

2. Linear Scalarization (LS) which minimizes the average loss 1 T T t=1 L t , 3. Uncertainty Weighting (UW, Cipolla et al. 2018), 4. Multiple-gradient descent algorithm (MGDA, Sener & Koltun 2018), 5. Dynamic Weight Averaging (DWA, Liu et al. 2019), 6. Projecting Conflicting Gradients (PCGrad, Yu et al. 2020), 7. Impartial Multi-Task Learning (IMTL, Liu et al. 2020), 8. Conflict-Averse Gradient Descent (CAGrad, Liu et al. 2021) and 9. Bargaining Multi-Task Learning (Nash-MTL, Navon et al. 2022).

: Accuracy Heatmap and Pareto Front for all tasks. UTKFace: Objective Heatmap and Pareto Front for all tasks.

Figure5: Application of Pareto Manifold Learning on datasets with 3 tasks. Each triangle depicts the performance on a task, using color, as a function of the interpolation weighting, i.e. each hexagon corresponds to a different weighting α = [α 1 , α 2 , α 3 ] ∈ ∆ 3 . The closer the interpolated member is to a single-task predictor, the higher the performance on the corresponding task. The 3D plot, on the right, show the performance of the model in the multi-objective space.

experimental configuration is drawn fromLiu et al. (2019);Yu et al. (2020);Liu et al. (2021);Navon et al. (2022) with some modifications. Concretely, we address two tasks: semantic segmentation and depth regression. We use a SegNet architecture(Badrinarayanan et al., 2017) trained for 100 epochs with Adam optimizer(Kingma & Ba, 2015) of initial learning rate 10 -4 , which is halved after 75 epochs. The images are resized to 128 × 256 pixels. In the initial training steps any sampling α results in a random model, due to initialization, and the algorithm has a warmup period until the ensemble members have acquired meaningful representations. Hence, to reduce computational overhead and help convergence, the concentration parameter of the Dirichlet distribution is set to p 0 = 5. We use gradient balancing, window W = 3 and λ = 1. The results are presented in Table 1. In Depth Estimation and out of MTL methods, Pareto Manifold Learning is near-optimal with

The configuration of MultiMNIST is used. Now, the model has three decoders and training lasts 20 epochs.UTKFaceThe UTKFace dataset has more than 20,000 face images of dimensions 200 × 200 pixels and 3 color channels. The dataset has three tasks: predicting age (modeled as regression using Huber loss -similar to(Ma et al., 2020)), classifying gender and ethnicity (modeled as classification tasks using Cross-Entropy loss). Images are resized to 64 × 64 pixels, age is normalized and a 80/20 train/test split is used. We use a shared-bottom architecture; the encoder is a ResNet18(He et al., 2016) model without the last fully connected layer. The decoders (task-specific layers) consist of one fully-connected layer, where the output dimensions are 1, 2 and 5 for age (modeled as regression), gender (binary classification) and ethnicity (classification with 5 classes). Training lasts 100 epochs, batch size is 256 and we use Adam optimizer with a learning rate of 10 -3 . No scheduler is used. CityScapes Our experimental configuration is very similar to prior work, namely Liu et al. (2019); Yu et al. (2020); Liu et al. (2021); Navon et al. (2022). All images are resized to 128 × 256.The tasks used are coarse semantic segmentation and depth regression. The task of semantic segmentation has 7 classes, whereas the original has 19. We use a SegNet architecture(Badrinarayanan et al., 2017) and train the model for 100 epochs with Adam optimizer(Kingma & Ba, 2015) of an initial learning rate 10 -4 . We employ a scheduler that halves the learning rate after 75 epochs.

Figure 6: Optimization trajectories in objective space in the case different loss scales. Similar to Figure 1, 5 initializations are shown for baselines and a pair of initializations for Pareto Manifold Learning (PML), in color for clarity. Dashed lines show the evolution of the mapping in loss spacefor the subspace at the current step. We also show the initial subspace (step= 0). All baselines, except Nash-MTL, and MGDA to a lesser degree, are characterized by trajectories focused on a subset of the Pareto Front, namely minimizing the task with high loss magnitude. The same observation applies to naïvely applying the proposed algorithm PML, because using the same weighting for both the interpolation and the losses attaches too much importance on the task with large loss magnitude. However, simple balancing schemes palliate this issue; gradient balancing (PML-gb) discovers a superset of the Pareto Front and loss balancing (PML-lb) discovers the exact Pareto Front.

Figure 9 Figure 10 correspond to no balancing scheme, Figure 11 and Figure 12 correspond to the use of gradient balancing, Figure 13 and Figure 14 correspond to the use of loss balancing. The first figures of each pair show the trajectories for each initialization pair, with markers for initial and final positions. The other figures of each pair dispense of the visual clutter and focus on the subspace discovered in the final step of training, which is plotted with dashed lines along with the analytical Pareto Front in solid light blue. Hence, they provide a succinct overview of whether the method was able or not to discover the (entire) Pareto Front.

Figure 7: Illustrative example.Optimization trajectories in objective space for all initialization pairs in the case of equal loss scales (c = 1.0) and application of the proposed method with no balancing scheme. Blue and red markers show each ensemble member's loss value, dots and "X"s correspond to the initial and final step, accordingly. In all but four cases, Pareto Manifold Learning retrieves the entirety of the Pareto Front (can be sen clearly in Figure8). Allowing longer training times or higher learning rates solves the remaining initialization pairs.

Figure 10: Illustrative example. Mapping in objective space of the weight subspace discovered by the proposed method with no balancing scheme, in the case of unequal loss scales (c = 0.1). The analytic Pareto Front is plotted in light blue. The lack of balancing scheme renders optimization difficult; the method either completely fails or retrieves a narrow subset of the analytic Pareto Front. Applying balancing schemes resolve these issues.

Figure 11: Illustrative example.Optimization trajectories in objective space for all initialization pairs in the case of unequal loss scales (c = 0.1) and application of the proposed method with gradient balancing scheme. Blue and red markers show each ensemble member's loss value, dots and "X"s correspond to the initial and final step, accordingly. The proposed method discovers a subspace whose mapping in objective space results in a superset of the Pareto Front. This can be clearly seen in Figure12.

Figure 12: Illustrative example. Mapping in objective space of the weight subspace discovered by the proposed method with gradient balancing scheme, in the case of unequal loss scales (c = 0.1). The analytic Pareto Front is plotted in light blue. The proposed method consistently finds the same subspace, which is a superset of the analytic Pareto Front.

Figure 14: Illustrative example.Mapping in objective space of the weight subspace discovered by the proposed method with loss balancing scheme, in the case of unequal loss scales (c = 0.1). The analytic Pareto Front is plotted in light blue. Using loss balancing endows scale invariance and the solutions are more functionally diverse, in comparison with no balancing scheme in Figure10. However, the same initialization pairs continue to be problematic as in the case of equal loss scales (see Figure8). Allowing for longer training or higher learning rates solves the remaining initialization pairs.

Figure 15: Multi-Forward Graph: case of two tasks. We assume a window of W = 5. The nodes lie in the line segment α 2 + α 1 = 1, α 1 , α 2 ∈ [0, 1]. (Left) Full graph and dashed edges will be removed. (Right) Final graph.

Figure16: Multi-Forward Graph for three tasks. Left, middle and right present the case of the first, second and third task, respectively. Each node is noted by its weighting, summing up to 1. Edges are drawn if the two nodes obey the total ordering imposed by the task. Dashed edges are omitted from the final graph.

Figure18: Census: Effect of multiforward on the window W and the regularization coefficient λ. The axes are shared across plots. Compared to MultiMNIST, applying multiforward on the asymmetric Census dataset can improve accuracies and help significantly outperform the baselines. However, widening the window W (e.g., last row for W = 5) can be hindering, since larger regularization coefficients are needed. The accompanying quantitative analysis appears in Table3.

Figure 20: Examples of samples and corresponding labels for the MultiMNIST-3 dataset.

Figure21: MultiMNIST-3 results for all three seeds. Each triangle shows the 66 points in the convex hull and color is used for the performance on the associated task. The 3d plot shows the mapping of the subspace to the multi-objective space. No balancing scheme is used.

Figure24: Dirichlet distribution in the case of two tasks. Top row: p < 1 and the distribution is more concentrated towards the ensemble members. Bottom row: p > 1 and the distribution focuses more on the midpoint which corresponds to all tasks having the same weight. Right column: extreme choices p → 0 or p → ∞. Left column: milder choices.

Census: Cosine similarities of ensemble members.

Figure 25: Experimental results on MultiMNIST and Census varying the concentration parameters p = p1 T of the sampling distribution. Three seeds depicted in shades of the same colors for the various p.

Figure 27: Illustrative example: Alternate view of Figure 26. Refer to the text for details.

Test performance on CityScapes. 3 random seeds per method. For Pareto Manifold Learning, we report the mean (across seeds) best results from the final subspace.

MultiMNIST: Ablation on multi-forward training and regularization, presented in Section 4.2. Validation performance in terms of HyperVolume (HV) metric. Higher is better, except for standard deviation (std). The visual complement of the table appears in Figure17. For each configuration, we track the Hypervolume across three random seeds and present Mean HV, max HV and standard deviation. We annotate with bold the best per column. In the main text, we report the best result in terms of mean HV, i.e., W = 4 and λ = 0.

Census: Ablation on multi-forward training and regularization, presented in Section 4.2.

UTKFace: Mean Accuracy and standard deviation of accuracy (over 3 random seeds). For the proposed method (PML), we report the mean and standard deviation of the best performance from the interpolated models in the sampled subspace. No multi-forward training is applied. We present Pareto Manifold Learning with no balancing scheme and with gradient balancing, denoted as gb. Bold is used for the best performing multi-task method.

A APPENDIX OVERVIEW

As a reference, we provide the following table of contents solely for the appendix. For clarity during the rebuttal, this discussion has been added as a standalone appendix. It will be incorportared in the appendix regarding the illustrative example.In this section, we investigate the connection between the intersection of multiple loss landscapes, pareto optimality and the effect of the proposed algorithm Pareto Manifold Learning. We use the illustrative example, presented in Figure 1 . Let Θ be the parameter space of the model and L t :Θ → R, t ∈ {1, 2}, be the losses of the problem. For α ∈ [0, 1] and θ ∈ Θ, the overall objective is In this appendix, we further expand on prior work. Linear mode connectivity, as in (Wortsman et al., 2021) , encourages flatness and, therefore, is linked with methods explicitly enforcing flat minima (Chaudhari et al., 2017; Foret et al., 2021; Dinh et al., 2017; Jiang* et al., 2020) . These approaches are applicable when designing a single objective, e.g. average of losses in Multi-Task Learning, but do not allow for the infusion of Pareto properties and the inclusion of tradeoffs. (2021) employ a multi-input multi-output network by accommodating independent subnetworks for each ensemble and allowing a single-forward pass ensemble prediction. However, this results in subnetworks with incompatible architecture which does not allow for a continuous approximation of the Pareto Front.[Reviewer NfGo]: added prior work on ensemble learning.

