LEARNING THE PARETO FRONT WITH HYPERNETWORKS

Abstract

Multi-objective optimization (MOO) problems are prevalent in machine learning. These problems have a set of optimal solutions, called the Pareto front, where each point on the front represents a different trade-off between possibly conflicting objectives. Recent MOO methods can target a specific desired ray in loss space however, most approaches still face two grave limitations: (i) A separate model has to be trained for each point on the front; and (ii) The exact trade-off must be known before the optimization process. Here, we tackle the problem of learning the entire Pareto front, with the capability of selecting a desired operating point on the front after training. We call this new setup Pareto-Front Learning (PFL). We describe an approach to PFL implemented using HyperNetworks, which we term Pareto HyperNetworks (PHNs). PHN learns the entire Pareto front simultaneously using a single hypernetwork, which receives as input a desired preference vector and returns a Pareto-optimal model whose loss vector is in the desired ray. The unified model is runtime efficient compared to training multiple models and generalizes to new operating points not used during training. We evaluate our method on a wide set of problems, from multi-task regression and classification to fairness. PHNs learn the entire Pareto front at roughly the same time as learning a single point on the front and at the same time reach a better solution set. PFL opens the door to new applications where models are selected based on preferences that are only available at run time.

1. INTRODUCTION

Multi-objective optimization (MOO) aims to optimize several possibly conflicting objectives. MOO is abundant in machine learning problems, from multi-task learning (MTL) , where the goal is to learn several tasks simultaneously, to constrained problems. In such problems, one aims to learn a single task while finding solutions that satisfy properties like fairness or privacy. It is common to optimize the main task while adding loss terms to encourage the learned model to obtain these properties. MOO problems have a set of optimal solutions, the Pareto front, each reflecting a different trade-off between objectives. Points on the Pareto front can be viewed as an intersection of the front with a specific direction in loss space (a ray, Figure 1 ). We refer to this direction as a preference vector, as it represents a single trade-off between objectives. When a direction is known in advance, it is possible to obtain the corresponding solution on the front (Mahapatra & Rajan, 2020). However, in many cases, we are interested in more than one predefined direction, either because the trade-off is not known before training, or because there are many possible trade-offs of interest. For Although several recent studies (Yang et al., 2019; Parisi et al., 2016) suggested to learn the trade-off curve in MOO problems, there is no existing scalable and general-purpose MOO approach that provides Pareto-optimal solutions for numerous preferences in objective space. Classical approaches, like genetic algorithms, do not scale to modern high-dimensional problems. It is possible in principle to run a single-direction optimization multiple times, each for a different preference, but this approach faces two major drawbacks: (i) Scalability -the number of models to be trained to cover the objective space grows exponentially with the number of objectives; and (ii) Flexibility -the decision maker cannot switch freely between preferences unless all models are trained and stored in advance. Here we put forward a new view of PFL as a problem of learning a conditional model, where the conditioning is over the preference direction. During training, a single unified model is trained to produce Pareto optimal solutions while satisfying the given preferences. During inference, the model covers the Pareto front by varying the input preference vector. We further describe an architecture that implements this idea using HyperNetworks. Specifically, we train a hypernetwork, termed Pareto Hypernetwork (PHN), that given a preference vector as an input, produces a deep network model tuned for that objective preference. Training is applied to preferences sampled from the m-dimensional simplex where m represents the number of objectives. We evaluate PHN on a wide set of problems, from multi-class classification, through fairness and image segmentation to multi-task regression. We find that PHN can achieve superior overall solutions while being 10 ∼ 50 times faster (see Figure 5 ). PHN addresses both scalability and flexibility. Training a unified model allows using any objective preference at inference time. Finally, as PHN generates a continuous parametrization of the entire Pareto front, it could open new possibilities to analyze Pareto optimal solutions in large-scale neural networks. Our paper has the following contributions: (1) We define the Pareto-front learning problem -learn a model that at inference time can operate on any given preference vector, providing a Pareto-optimal solution for that specified objective trade-off. (2) We describe Pareto Hypernetworks (PHN), a unified architecture based on hypernetworks that addresses PFL and shows it can be effectively trained. (3) Empirical evaluations on various tasks and datasets demonstrate the ability of PHNs to generate better objective space coverage compared to multiple baseline models, with significant improvement in training time.

2. MULTI-OBJECTIVE OPTIMIZATION

We start by formally defining the MOO problem. An MOO is defined by m losses i : R d → R + , i = 1, . . . , m, or in vector form : R d → R m + . We define a partial ordering on the loss space



Figure 1: Illustrative example using the popular task of Fonseca (1995): demonstrating the relation between Pareto front, preference rays, and solutions. Pareto front (black solid line) for a 2D loss space and several rays (colored dashed lines) which represent various possible preferences. Left: A single PHN-EPO model learns the entire Pareto front, mapping any given preference ray to its corresponding solution on the front. The data and task are detailed in Section 5.1.

