LEARNING MULTIOBJECTIVE PROGRAM THROUGH ONLINE LEARNING

Abstract

We investigate the problem of learning the parameters (i.e., objective functions or constraints) of a multiobjective decision making model, based on a set of sequentially arrived decisions. In particular, these decisions might not be exact and possibly carry measurement noise or are generated with the bounded rationality of decision makers. In this paper, we propose a general online learning framework to deal with this learning problem using inverse multiobjective optimization, and prove that this framework converges at a rate of O(1/ √ T ) under certain regularity conditions. More precisely, we develop two online learning algorithms with implicit update rules which can handle noisy data. Numerical results with both synthetic and real world datasets show that both algorithms can learn the parameters of a multiobjective program with great accuracy and are robust to noise.

1. INTRODUCTION

In this paper, we aim to learn the parameters (i.e., constraints and a set of objective functions) of a decision making problem with multiple objectives, instead of solving for its efficient (or Pareto) optimal solutions, which is the typical scenario. More precisely, we seek to learn θ given {y i } i∈ [N ] that are observations of the efficient solutions of the multiobjective optimization problem (MOP): min x {f 1 (x, θ), f 2 (x, θ), . . . , f p (x, θ)} s.t. x ∈ X(θ), where θ is the true but unknown parameter of the MOP. In particular, we consider such learning problems in online fashion, noting observations are unveiled sequentially in practical scenarios. Specifically, we study such learning problem as an inverse multiobjective optimization problem (IMOP) dealing with noisy data, develop online learning algorithms to derive parameters for each objective function and constraint, and finally output an estimation of the distribution of weights (which, together with objective functions, define individuals' utility functions) among human subjects. Learning human participants' decision making scheme is critical for an organization in designing and providing services or products. Nevertheless, as in most scenarios, we can only observe their decisions or behaviors and cannot directly access decision making schemes. Indeed, participants probably do not have exact information regarding their own decision making process (Keshavarz et al., 2011) . To bridge the discrepancy, we leverage the inverse optimization idea that has been proposed and received significant attention in the optimization community, which is to infer the missing information of the underlying decision models from observed data, assuming that human decision makers are making optimal decisions (Ahuja & Orlin, 2001; Iyengar & Kang, 2005; Schaefer, 2009; Wang, 2009; Keshavarz et al., 2011; Chan et al., 2014; Bertsimas et al., 2015; Aswani et al., 2018; Esfahani et al., 2018; Tan et al., 2020) . This subject actually carries the data-driven concept and becomes more applicable as large amounts of data are generated and become readily available, especially those from digital devices and online transactions.

1.1. RELATED WORK

Our work draws inspiration from the inverse optimization problem with single objective. It seeks particular values for those parameters such that the difference between the actual observation and the expected solution to the optimization model (populated with those inferred values) is minimized. Although complicated, an inverse optimization model can often be simplified for computation through using KKT conditions or strong duality of the decision making model, provided that it is convex. Nowadays, extending from its initial form that only considers a single observation Ahuja & Orlin (2001) ; Iyengar & Kang (2005) ; Schaefer (2009) ; Wang (2009) , inverse optimization has been further developed and applied to handle many observations Keshavarz et al. (2011) ; Bertsimas et al. (2015) ; Aswani et al. (2018) ; Esfahani et al. (2018) . Nevertheless, a particular challenge, which is almost unavoidable for any large data set, is that the data could be inconsistent due to measurement errors or decision makers' sub-optimality. To address this challenge, the assumption on the observations' optimality is weakened to integrate those noisy data, and KKT conditions or strong duality is relaxed to incorporate inexactness. Our work is most related to the subject of inverse multiobjective optimization. The goal is to find multiple objective functions or constraints that explain the observed efficient solutions well. There are several recent studies related to the presented research. One is in Chan et al. (2014) , which considers a single observation that is assumed to be an exact optimal solution. Then, given a set of well-defined linear functions, an inverse optimization is formulated to learn their weights. Another one is Dong & Zeng (2020) , which proposes the batch learning framework to infer utility functions or constraints from multiple noisy decisions through inverse multiobjective optimization. This work can be categorized as doing inverse multiobjective optimization in batch setting. Recently, Dong & Zeng (2021) extends Dong & Zeng (2020) with distributionally robust optimization by leveraging the prominent Wasserstein metric. In contrast, we do inverse multiobjective optimization in online settings, and the proposed online learning algorithms significantly accelerate the learning process with performance guarantees, allowing us to deal with more realistic and complex preference inference problems. Also related to our work is the line of research conducted by Bärmann et al. (2017) and Dong et al. (2018) , which develops online learning methods to infer the utility function or constraints from sequentially arrived observations. However, their approach is only possible to handle inverse optimization with a single objective. More specifically, their methods apply to situations where observations are generated by decision making problems with only one objective function. Differently, our approach does not make the single-objective assumption and only requires the convexity of the underlying decision making problem with multiple objectives. Hence, we believe that our work generalizes their methods and extends the applicability of online learning from learning single objective program to multiobjective program.

1.2. OUR CONTRIBUTIONS

To the best of authors' knowledge, we propose the first general framework of online learning for inferring decision makers' objective functions or constraints using inverse multiobjective optimization. This framework can learn the parameters of any convex decision making problem, and can explicitly handle noisy decisions. Moreover, we show that the online learning approach, which adopts an implicit update rule, has an O( √ T ) regret under suitable regularity conditions when using the ideal loss function. We finally illustrate the performance of two algorithms on both a multiobjective quadratic programming problem and a portfolio optimization problem. Results show that both algorithms can learn parameters with great accuracy and are robust to noise while the second algorithm significantly accelerate the learning process over the first one.

2.1. DECISION MAKING PROBLEM WITH MULTIPLE OBJECTIVES

We consider a family of parametrized multiobjective decision making problems of the form min x∈R n f 1 (x, θ), f 2 (x, θ), . . . , f p (x, θ) s.t. x ∈ X(θ), where p ≥ 2 and f l (x, θ) : R n × R n θ → R for each l ∈ [p]. Assume parameter θ ∈ Θ ⊆ R n θ . We denote the vector of objective functions by f (x, θ) = (f 1 (x, θ), f 2 (x, θ), . . . , f p (x, θ)) T . Assume X(θ) = {x ∈ R n : g(x, θ) ≤ 0, x ∈ R n + }, where g(x, θ) = (g 1 (x, θ), . . . , g q (x, θ)) T is another vector-valued function with g k (x, θ) : R n × R n θ → R for each k ∈ [q]. Definition 2.1 (Efficiency). For fixed θ, a decision vector x * ∈ X(θ) is said to be efficient if there exists no other decision vector x ∈ X(θ) such that f i (x, θ) ≤ f i (x * , θ) for all i ∈ [p], and f k (x, θ) < f k (x * , θ) for at least one k ∈ [p]. In the study of multiobjective optimization, the set of all efficient solutions is denoted by X E (θ) and called the efficient set. The weighting method is commonly used to obtain an efficient solution through computing the problem of weighted sum (PWS) Gass & Saaty (1955) as follows. min w T f (x, θ) s.t. x ∈ X(θ), where w = (w 1 , . . . , w p ) T . Without loss of generality, all possible weights are restricted to a simplex, which is denoted by W p = {w ∈ R p + : 1 T w = 1}. Next, we denote the set of optimal solutions for the (PWS) by Miettinen (2012) , we have: Proposition 2.1. If x ∈ S(w, θ) and w ∈ W + p , then x ∈ X E (θ). S(w, θ) = arg min x w T f (x, θ) : x ∈ X(θ) . Let W + p = {w ∈ R p ++ : 1 T w = 1}. Following from Theorem 3.1.2 of The next result from Theorem 3.1.4 of Miettinen (2012) states that all the efficient solutions can be found by the weighting method for convex MOP. Proposition 2.2. Assume that MOP is convex. If x ∈ X is an efficient solution, then there exists a weighting vector w ∈ W p such that x is an optimal solution of (PWS). By Propositions 2.1 -2.2, we can summarize the relationship between S(w, θ) and X E (θ) as follows. Corollary 2.2.1. For convex MOP, w∈W + p S(w, θ) ⊆ X E (θ) ⊆ w∈Wp S(w, θ). In the following, we make a few assumptions to simplify our understanding, which are actually mild and appear often in the literature. Assumption 2.1. Set Θ is a convex compact set. There exists D > 0 such that ∥θ∥ 2 ≤ D for all θ ∈ Θ. In addition, for each θ ∈ Θ, both f (x, θ) and g(x, θ) are convex in x.

2.2. INVERSE MULTIOBJECTIVE OPTIMIZATION

Consider a learner who has access to decision makers' decisions, but does not know their objective functions or constraints. In our model, the learner aims to learn decision makers' multiple objective functions or constraints from observed noisy decisions only. We denote y the observed noisy decision that might carry measurement error or is generated with a bounded rationality of the decision maker. We emphasize that this noisy setting of y reflects the real world situation rather than for analysis of regret. Throughout the paper we assume that y is a random variable distributed according to an unknown distribution P y supported on Y. As y is a noisy observation, we note that y does not necessarily belong to X(θ), i.e., it might be either feasible or infeasible with respect to X(θ). We next discuss the construction of an appropriate loss function for the inverse multiobjective optimization problem Dong & Zeng (2020; 2021) . Ideally, given a noisy decision y and a hypothesis θ, the loss function can be defined as the minimum distance between y and the efficient set X E (θ): l(y, θ) = min x∈X E (θ) ∥y -x∥ 2 2 . (loss function) For a general MOP, however, there might exist no explicit way to characterize the efficient set X E (θ). Hence, an approximation approach to practically describe this is adopted. Following from Corollary 2.2.1, a sampling approach is adopted to generate w k ∈ W p for each k ∈ [K] and approximate X E (θ) as k∈[K] S(w k , θ). Then, the surrogate loss function is defined as l K (y, θ) = min x∈ k∈[K] S(w k ,θ) ∥y -x∥ 2 2 . (surrogate loss) By using binary variables, this surrogate loss can be converted into the Surrogate Loss Problem. l K (y, θ) = min zj ∈{0,1} ∥y - k∈[K] z k x k ∥ 2 2 s.t. k∈[K] z k = 1, x k ∈ S(w k , θ). (1) Constraint k∈[K] z k = 1 ensures that exactly one of the efficient solutions will be chosen to measure the distance to y. Hence, solving this optimization problem identifies some w k with k ∈ [K] such that the corresponding efficient solution S(w k , θ) is closest to y. Remark 2.1. It is guaranteed that no efficient solution will be excluded if all weight vectors in W p are enumerated. As it is practically infeasible due to computational intractability, we can control K to balance the tradeoff between the approximation accuracy and computational efficacy. Certainly, if the computational power is strong, we would suggest to draw a large number of weights evenly in W p to avoid any bias. In practice, for general convex MOP, we evenly sample {w k } k∈[K] from W + p to ensure that S(w k , θ) ∈ X E (θ). If f (x, θ ) is known to be strictly convex, we can evenly sample {w k } k∈[K] from W p as S(w k , θ) ∈ X E (θ) by Proposition 2.1.

3. ONLINE LEARNING FOR IMOP

In our online learning setting, noisy decisions become available to the learner one by one. Hence, the learning algorithm produces a sequence of hypotheses (θ 1 , . . . , θ T +1 ). Here, T is the total number of rounds, and θ 1 is an arbitrary initial hypothesis and θ t for t > 1 is the hypothesis chosen after seeing the (t -1)th decision. Let l(y t , θ t ) denote the loss the learning algorithm suffers when it tries to predict y t based on the previous observed decisions {y 1 , . . . , y t-1 }. The goal of the learner is to minimize the regret, which is the cumulative loss T t=1 l(y t , θ t ) against the best possible loss when the whole batch of decisions are available. Formally, the regret is defined as R T = T t=1 l(y t , θ t ) -min θ∈Θ T t=1 l(y t , θ). Unlike most online learning problems that assume the loss function to be smooth Shalev-Shwartz (2011); Hazan (2016) , l(y, θ) and l K (y, θ) are not necessarily smooth in our paper, due to the structures of X E (θ) and k∈[K] S(w k , θ). Thus, the popular gradient based online learning algorithms Bottou (1999) ; Kulis & Bartlett (2010) fail and our problem is significantly more difficult than most of them. To address this challenge, two online learning algorithms are developed in the next section.

3.1. ONLINE IMPLICIT UPDATES

Once receiving the tth noisy decision y t , the ideal way to update θ t+1 is by solving the following optimization problem using the ideal loss function: θ t+1 = arg min θ∈Θ 1 2 ∥θ -θ t ∥ 2 2 + η t l(y t , θ), where η t is the learning rate in each round, and l(y t , θ) is defined in loss function. As explained in the previous section, l(y t , θ) might not be computable due to the non-existence of the closed form of the efficient set X E (θ). Thus, we seek to approximate the update 2 by: θ t+1 = arg min θ∈Θ 1 2 ∥θ -θ t ∥ 2 2 + η t l K (y t , θ), where η t is the learning rate in each round, and l K (y t , θ) is defined in surrogate loss. The update 3 approximates 2, and seeks to balance the tradeoff between "conservativeness" and "correctiveness", where the first term characterizes how conservative we are to maintain the current estimation, and the second term indicates how corrective we would like to modify with the new estimation. As no closed form exists for θ t+1 in general, this update method is an implicit approach. Algorithm 1 Online Learning for IMOP 1: Input: noisy decisions {y t } t∈T , weights {w k } k∈K 2: Initialize θ 1 = 0 3: for t = 1 to T do 4: receive y t 5: suffer loss l K (y t , θ t ) 6: if l K (y t , θ t ) = 0 then 7: θ t+1 ← θ t 8: else 9: set learning rate η t ∝ 1/ √ t 10: update θ t+1 by solving 3 directly (or equivalently solving K subproblems 4) 11: end if 12: end for Algorithm 2 Accelerated Online Learning 1: Input: {y t } t∈T and {w k } k∈K 2: Initialize θ 1 = 0 3: for t = 1 to T do 4: receive y t 5: suffer loss l K (y t , θ t ) end if 13: end for To solve 3, we can replace 6: let k * = arg min k∈[K] ∥y t -x k ∥ 2 2 , where x k ∈ S(w k , θ t ) for k ∈ [K] 7: if l K (y t , θ t ) = 0 then 8: θ t+1 ← θ t 9: x k ∈ S(w k , θ) by KKT conditions for each k ∈ [K]: min θ 1 2 ∥θ -θ t ∥ 2 2 + η t k∈[K] ∥y t -ϑ k ∥ 2 2 s.t. θ ∈ Θ,    g(x k ) ≤ 0, u k ≥ 0, u T k g(x k ) = 0, ∇ x k w T k f (x k , θ) + u k • ∇ x k g(x k ) = 0,    , ∀k ∈ [K], 0 ≤ ϑ k ≤ M k z k , ∀k ∈ [K], x k -M k (1 -z k ) ≤ ϑ k ≤ x k , ∀k ∈ [K], k∈[K] z k = 1, x k ∈ R n , u k ∈ R m + , z k ∈ {0, 1}, ∀k ∈ [K], where u k is the dual variable for g k (x, θ) ≤ 0, and M k is a big number to linearize z k x k . Alternatively, solving 3 is equivalent to solving K independent programs defined in the following and taking the one with the least optimal value (breaking ties arbitrarily). min θ∈Θ 1 2 ∥θ -θ t ∥ 2 2 + η t ∥y t -x∥ 2 2 s.t. x ∈ S(w k , θ). Our application of the implicit update rule to learn an MOP proceeds as outlined in Algorithm 1. Remark 3.1. (i) When choosing 4 to update θ t+1 , we can parallelly compute K independent problems of 4, which would dramatically improve the computational efficiency. (ii) After the completion of Algorithm 1, we can allocate every y t to the w k that minimizes l K (y t , θ T +1 ), which provides an inference on the distribution of weights of component functions f l (x, θ) over human subjects. Acceleration of Algorithm 1: Note that we update θ and the weight sample assigned to y t in 3 simultaneously, meaning both θ and the weight sample index k are variables when solving 3. In other words, one needs to solve K subproblems 4 to get an optimal solution for 3. However, note that the increment of θ by 3 is typically small for each update. Consequently, the weight sample assigned to y t using θ t+1 is roughly the same as using the previous guess of this parameter, i.e., θ t . Hence, it is reasonable to approximate 3 by first assigning a weight sample to y t based on the previous updating result. Then, instead of computing K problems of 4, we simply compute a single one associated with the selected weight samples, which significantly eases the burden of solving 3. Our application of the accelerated implicit update rule proceeds as outlined in Algorithm 2.

Mini-batches

We enhance online learning by considering multiple observations per update Bottou & Cun (2004) . In online IMOP, this means that computing θ t+1 using |N t | > 1 decisions: θ t+1 = arg min θ∈Θ 1 2 ∥θ -θ t ∥ 2 2 + η t |N t | t∈Nt l K (y t , θ), However, we should point out that applying Mini-batches might not be suitable here as the update 5 is drastically more difficult to compute even for |N t | = 2 than the update 3 with a single observation.

3.2. ANALYSIS OF CONVERGENCE

Note that the proposed online learning algorithms are generally applicable to learn the parameter of any convex MOP. In this section, we show that the average regret converges at a rate of O(1/ √ T ) under certain regularity conditions based on the ideal loss function l(y, θ). Namely, we consider the regret bound when using the ideally implicit update rule 2. Next, we introduce a few assumptions that are regular in literature Keshavarz et al. (2011) Namely, there exists B > 0 such that ∥x∥ 2 ≤ B for all x ∈ X(θ). The support Y of the noisy decisions y is contained within a ball of radius R almost surely, where R < ∞. In other words, P(∥y∥ 2 ≤ R) = 1. (b) Each function in f is strongly convex on R n , that is for each l ∈ [p], ∃λ l > 0, ∀x, y ∈ R n ∇f l (y, θ l ) -∇f l (x, θ l ) T (y -x) ≥ λ l ∥x -y∥ 2 2 . Regarding Assumption 3.1.(a), assuming that the feasible region is closed and bounded is very common in inverse optimization. The finite support of the observations is needed since we do not hope outliers have too many impacts in our learning. Let λ = min l∈[p] {λ l }. It follows that w T f (x, θ) is strongly convex with parameter λ for w ∈ W p . Therefore, Assumption 3.1.(b) ensures that S(w, θ) is a single-valued set for each w. The performance of the algorithm also depends on how the change of θ affects the objective values. For ∀w ∈ W p , θ 1 ∈ Θ, θ 2 ∈ Θ, we consider the following function h(x, w, θ 1 , θ 2 ) = w T f (x, θ 1 ) -w T f (x, θ 2 ). Assumption 3.2. ∃κ > 0, ∀w ∈ W p , h(•, w, θ 1 , θ 2 ) is κ-Lipschitz continuous on Y. That is, |h(x, w, θ 1 , θ 2 ) -h(y, w, θ 1 , θ 2 )| ≤ κ∥θ 1 -θ 2 ∥ 2 ∥x -y∥ 2 , ∀x, y ∈ Y. Basically, this assumption says that the objective functions will not change much when either the parameter θ or the variable x is perturbed. It actually holds in many common situations, including the multiobjective linear program and multiobjective quadratic program. From now on, given any y ∈ Y, θ ∈ Θ, we denote x(θ) the efficient point in X E (θ) that is closest to y. Namely, l(y, θ) = ∥y -x(θ)∥ 2 2 . Lemma 3.1. Under Assumptions 3.1 -3.2, the loss function l(y, θ) is uniformly 4(B+R)κ λ -Lipschitz continuous in θ. That is, ∀y ∈ Y, ∀θ 1 , θ 2 ∈ Θ, we have |l(y, θ 1 ) -l(y, θ 2 )| ≤ 4(B + R)κ λ ∥θ 1 -θ 2 ∥ 2 . The key point in proving Lemma 3.1 is the observation that the perturbation of S(w, θ) due to θ is bounded by the perturbation of θ by applying Proposition 6.1 in Bonnans & Shapiro (1998) . Details of the proof are given in Appendix. Assumption 3.3. For MOP, ∀y ∈ Y, ∀θ 1 , θ 2 ∈ Θ, ∀α, β ≥ 0 s.t. α + β = 1, we have either of the following: (a) if x 1 ∈ X E (θ 1 ), and x 2 ∈ X E (θ 2 ), then αx 1 + βx 2 ∈ X E (αθ 1 + βθ 2 ). (b) ∥αx(θ 1 ) + βx(θ 2 ) -x(αθ 1 + βθ 2 )∥ 2 ≤ αβ∥x(θ 1 ) -x(θ 2 )∥ 2 /(2(B + R)). The definition of x(θ 1 ), x(θ 2 ) and x(αθ 1 + βθ 2 ) is given before Lemma 3.1. This assumption requires the convex combination of x 1 ∈ X E (θ 1 ), and x 2 ∈ X E (θ 2 ) belongs to X E (αθ 1 + βθ 2 ). Or there exists an efficient point in X E (αθ 1 + βθ 2 ) close to the convex combination of x(θ 1 ) and x(θ 2 ). Examples are given in Appendix. Let θ * be an optimal inference to min θ∈Θ t∈[T ] l(y t , θ), i.e., an inference derived with the whole batch of observations available. Then, the following theorem asserts that under the above assumptions, the regret R T = t∈[T ] (l(y t , θ t ) -l(y t , θ * )) of the online learning algorithm is of O( √ T ). Theorem 3.2. Suppose Assumptions 3.1 -3.3 hold. Then, choosing η t = Dλ 2 √ 2(B+R)κ 1 √ t , we have R T ≤ 4 √ 2(B + R)Dκ λ √ T . We establish the above regret bound by extending Theorem 3.2 in Kulis & Bartlett (2010) . Our extension involves several critical and complicated analyses for the structure of the optimal solution set S(w, θ) as well as the loss function, which is essential to our theoretical understanding. Moreover, we relax the requirement of smoothness of loss function to Lipschitz continuity through a similar argument in Lemma 1 of Wang et al. (2017) and Duchi et al. (2011) .

4. EXPERIMENTS

In this section, we will provide a multiobjective quadratic program (MQP) and a portfolio optimization problem to illustrate the performance of the proposed online learning Algorithms 1 and 2. The mixed integer second order conic problems (MISOCPs), which are derived from using KKT conditions in 3, are solved by Gurobi Optimization (2016). All the algorithms are programmed with Julia Bezanson et al. (2017) . The experiments have been run on an Intel(R) Xeon(R) E5-1620 processor that has a 3.60GHz CPU with 32 GB RAM.

4.1. SYNTHETIC DATA: LEARNING THE PREFERENCES AND RESTRICTIONS FOR AN MQP

Consider the following multiobjective quadratic optimization problem. min x∈R 2 + f 1 (x) = 1 2 x T Q 1 x + c T 1 x f 2 (x) = 1 2 x T Q 2 x + c T 2 x s.t. Ax ≤ b, where parameters of the objective functions and constraints are provided in Appendix. Suppose there are T decision makers. In each round, the learner would receive one noisy decision. Her goal is to learn the objective functions or restrictions of these decision makers. In round t, we suppose that the decision maker derives an efficient solution x t by solving (PWS) with weight w t , which is uniformly chosen from W 2 . Next, the learner receives the noisy decision y t corrupted by noise that has a jointly uniform distribution with support [-0.5, 0.5] 2 . Namely, y t = x t + ϵ t , where each element of ϵ t ∼ U (-0.5, 0.5). Learning the objective functions In the first set of experiments, the learner seeks to learn c 1 and c 2 given the noisy decisions that arrive sequentially in T rounds. We assume that c 1 is within range [1, 6] 2 , c 2 is within range [-6, -1] 2 , T = 1000 rounds of noisy decisions are generated, and K = 41 weights from W 2 are evenly sampled. The learning rate is set to η t = 5/ √ t. Then, we implement Algorithms 1 and 2. At each round t, we solve 4 using parallel computing with 6 workers. To illustrate the performance of the algorithms in a statistical way, we run 100 repetitions of the experiments. Figure 1a shows the total estimation errors of c 1 and c 2 in each round over the 100 repetitions for the two algorithms. We also plot the average estimation error of the 100 repetitions. As can be seen in this figure, convergence for both algorithms is pretty fast. Also, estimation errors over rounds for different repetitions concentrate around the average, indicating that our algorithm is pretty robust to noise. The estimation error in the last round is not zero because we use a finite K to approximate the efficient set. We see in Figure 1b that Algorithm 2 is much faster than Algorithm 1 especially when K is large. To further illustrate the performance of algorithms, we randomly pick one repetition using Algorithm 1 and plot the estimated efficient set in Figure 1c . We can see clearly that the estimated efficient set almost coincides with the real efficient set. Moreover, Figure 1d shows that IMOP in online settings is drastically faster than in batch setting. It is practically impossible to apply the batch setting algorithms in real-world applications. Learning the Right-hand Side In the second set of experiments, the learner seeks to learn b given the noisy decisions that arrive sequentially in T rounds. We assume that b is within [-10, 10] 2 . T = 1000 rounds of noisy decisions are generated. K = 81 weights from W 2 are evenly sampled. The learning rate is set to η t = 5/ √ t. Then, we apply Algorithms 1 and 2. To illustrate the performance of them, we run 100 repetitions of the experiments. Figure 2a shows the estimation error of b in each round over the 100 repetitions for the two algorithms. We also plot the average estimation error of the 100 repetitions. As can be seen in the figure, convergence for both algorithms is pretty fast. In addition, we see in Figure 2b that Algorithm 2 is much faster than Algorithm 1. 

4.2. REAL-WORLD CASE: LEARNING EXPECTED RETURNS IN PORTFOLIO OPTIMIZATION

We next consider noisy decisions arising from different investors in a stock market. More precisely, we consider a portfolio selection problem, where investors need to determine the fraction of their wealth to invest in each security to maximize the total return and minimize the total risk. The process typically involves the cooperation between an investor and a portfolio analyst, where the analyst provides an efficient frontier on a certain set of securities to the investor and then the investor selects a portfolio according to her preference to the returns and risks. The classical Markovitz meanvariance portfolio selection Markowitz (1952) in the following is used by analysts. where r ∈ R n + is a vector of individual security expected returns, Q ∈ R n×n is the covariance matrix of securities returns, x is a portfolio specifying the proportions of capital to be invested in the different securities, and b i is an upper bound on the proportion of security i, ∀i ∈ [n]. min f 1 (x) = -r T x f 2 (x) = x T Qx s.t. 0 ≤ x i ≤ b i , ∀i ∈ [n], n i=1 x i = 1, Dataset: The dataset is derived from monthly total returns of 30 stocks from a blue-chip index which tracks the performance of top 30 stocks in the market when the total investment universe consists of thousands of assets. The true expected returns and true return covariance matrix for the first 8 securities are given in the Appendix. Details for generating the portfolios are provided in Appendix. The portfolios on the efficient frontier are plot in Figure 3a . The learning rate is set to η t = 5/ √ t. At each round t, we solve 4 using parallel computing. In Table 1 we list the estimation error and estimated expected returns for different K. The estimation error becomes smaller when K increases, indicating that we have a better approximation accuracy of the efficient set when using a larger K. We also plot the estimated efficient frontier using the estimated r for K = 41 in Figure 3a . We can see that the estimated efficient frontier is very close to the real one, showing that our algorithm works quite well in learning expected returns in portfolio optimization. We also plot our estimation on the distribution of the weight of f 1 (x) among the 1000 decision makers. As shown in Figure 3b , the distribution follows roughly normal distribution. We apply Chi-square goodness-of-fit tests to support our hypotheses. ∥r -r true ∥ 2 0.1270 0.1270 0.0420 0.0091

5. CONCLUSION AND FUTURE WORK

In this paper, an online learning method to learn the parameters of the multiobjective optimization problems from noisy observations is developed and implemented. We prove that this framework converges at a rate of O(1/ √ T ) under suitable conditions. Nonetheless, as shown in multiple experiments using both synthetic and real world datasets, even if these conditions are not satisfied, we still observe a fast convergence rate and a strong robustness to noisy observations. Thus, it would be interesting to analyze to what extent these conditions can be relaxed. Also, we note that our model naturally follows the contextual bandit setting. We can view the decision y t observed at time t as the context. The learner then takes the action θ t and the loss is jointly determined by the context and the action. Compared to the vast majority of literature surveyed in Zhou (2015) , the main technical difficulty in our model is how to set an appropriate reward (loss) given the context (y t ) and the action (θ t ). Intuitively, we set the loss as the difference between the context (y t ) and another context generated by the action. Motivated by this observation, one future work is to integrate classical contextual bandits algorithms into our model. Particularly, we think that algorithms without the Linear Realizability Assumption (the reward is linear with respect to the context), such as KernelUCB, might fit well in our problem.

A APPENDIX

A.1 OMITTED MATHEMATICAL REFORMULATIONS Before giving the reformulations, we first make some discussions about the surrogate loss functions. l K (y, θ) = min z k ∈{0,1} ∥y - k∈[K] z k x k ∥ 2 2 = min z k ∈{0,1} k∈[K] ∥y -z k x k ∥ 2 2 -(K -1)∥y∥ 2 2 where x k ∈ S(w k , θ) and k∈[K] z k = 1. Since (K -1)∥y∥ 2 2 is a constant, we can safely drop it and use the following as the surrogate loss function when solving the optimization program in the implicit update, l K (y, θ) = min z k ∈{0,1} k∈[K] ∥y -z k x k ∥ 2 2 where x k ∈ S(w k , θ) and k∈[K] z k = 1. A.1.1 SINGLE LEVEL REFORMULATION FOR THE INVERSE MULTIOBJECTIVE OPTIMIZATION PROBLEM The parametrized mulobjective optimization problem is min x∈R n f (x, θ) s.t. g(x) ≤ 0 MOP where f (x, θ) = (f 1 (x, θ), f 2 (x, θ), . . . , f p (x, θ)) T g(x) = (g 1 (x), . . . , g q (x)) T Then, the single level reformulation for the Implicit update in the paper is given in the following min b 1 2 ∥θ -θ t ∥ 2 2 + η t k∈[K] ∥y t -ϑ k ∥ 2 2 s.t. θ ∈ Θ    g(x k ) ≤ 0, u k ≥ 0 u T k g(x k ) = 0 ∇ x k w T k f (x k , θ) + u k • ∇ x k g(x k ) = 0    ∀k ∈ [K] 0 ≤ ϑ k ≤ M k z k ∀k ∈ [K] x k -M k (1 -z k ) ≤ ϑ k ≤ x k ∀k ∈ [K] k∈[K] z k = 1 x k ∈ R n , u k ∈ R m + , t k ∈ {0, 1} m , z k ∈ {0, 1} ∀k ∈ [K] A.1.2 SINGLE LEVEL REFORMULATION FOR THE INVERSE MULTIOBJECTIVE QUADRATIC PROBLEM When the objective functions are quadratic and the feasible region is a polyhedron, the multiobjective optimization has the following form min x∈R n    1 2 x T Q 1 x + c T 1 x . . . 1 2 x T Q p x + c T p x    s.t. Ax ≥ b

MQP

where Q l ∈ S n + (the set of symmetric positive semidefinite matrices) for all l ∈ [p].. When trying to learn {c l } l∈[p] , the single level reformulation for the Implicit update in the paper is given in the following min c l 1 2 l∈[p] ∥c l -c t l ∥ 2 2 + η t k∈[K] ∥y t -ϑ k ∥ 2 2 s.t. c l ∈ C l ∀l ∈ [p]      Ax k ≥ b, u k ≥ 0 u k ≤ M t k Ax k -b ≤ M (1 -t k ) (w 1 k Q 1 + • • • + w p k Q p )x i + w 1 k c 1 + • • • + w p k c p -A T u k = 0      ∀k ∈ [K] 0 ≤ ϑ k ≤ M k z k ∀k ∈ [K] x k -M k (1 -z k ) ≤ ϑ k ≤ x k ∀k ∈ [K] k∈[K] z k = 1 x k ∈ R n , u k ∈ R m + , t k ∈ {0, 1} m , z k ∈ {0, 1} ∀l ∈ [p] ∀k ∈ [K] where c t l is the estimation of c l at the tth round, and C l is a convex set for each l ∈ [p]. We have a similar single level reformulation when learning the Right-hand side b. Clearly, this is a Mixed Integer Second Order Cone program(MISOCP) when learning either c l or b.

A.2 OMITTED PROOFS

A.2.1 STRONGLY CONVEX OF w T f (x, θ) AS STATED UNDER ASSUMPTION 3.1 Proof. By the definition of λ, ∇w T f (y, θ) -∇w T f (x, θ) T (y -x) = ∇ p l=1 w l f l (y, θ) -∇ p l=1 w l f l (x, θ l ) T (y -x) = p l=1 w l ∇f l (y, θ l ) -∇f l (x, θ l ) T (y -x) ≥ p l=1 w l λ l ∥x -y∥ 2 2 ≥ η∥x -y∥ 2 2 p l=1 w l = λ∥x -y∥ 2 2 Thus, w T f (x, θ) is strongly convex for x ∈ R n . A.2.2 PROOF OF LEMMA 3.1 Proof. By Assumption 3.1(b), we know that S(w, θ) is a single-valued set for each w ∈ W p . Thus, ∀y ∈ Y, ∀θ 1 , θ 2 ∈ Θ, ∃w 1 , w 2 ∈ W p , s.t. x(θ 1 ) = S(w 1 , θ 1 ), x(θ 2 ) = S(w 2 , θ 2 ) Without of loss of generality, let l K (y, θ 1 ) ≥ l K (y, θ 2 ). Then, |l K (y, θ 1 ) -l K (y, θ 2 )| = l K (y, θ 1 ) -l K (y, θ 2 ) = ∥y -x(θ 1 )∥ 2 2 -∥y -x(θ 2 )∥ 2 2 = ∥y -S(w 1 , θ 1 )∥ 2 2 -∥y -S(w 2 , θ 2 )∥ 2 2 ≤ ∥y -S(w 2 , θ 1 )∥ 2 2 -∥y -S(w 2 , θ 2 )∥ 2 2 = ⟨S(w 2 , θ 2 ) -S(w 2 , θ 1 ), 2y -S(w 2 , θ 1 ) -S(w 2 , θ 2 )⟩ ≤ 2(B + R)∥S(w 2 , θ 2 ) -S(w 2 , θ 1 )∥ 2 (6) The last inequality is due to Cauchy-Schwartz inequality and the Assumptions 3.1(a), that is ∥2y -S(w 2 , θ 1 ) -S(w 2 , θ 2 )∥ 2 ≤ 2(B + R) Next, we will apply Proposition 6.1 in Bonnans & Shapiro (1998) to bound ∥S(w 2 , θ 2 ) -S(w 2 , θ 1 )∥ 2 . Under Assumptions 3.1 -3.2, the conditions of Proposition 6.1 in Bonnans & Shapiro (1998) are satisfied. Therefore, ∥S(w 2 , θ 2 ) -S(w 2 , θ 1 )∥ 2 ≤ 2κ λ ∥θ 1 -θ 2 ∥ 2 (8) Plugging equation 7 and equation 8 in equation 6 yields the claim. A.2.3 PROOF OF THEOREM 3.2 Proof. We will extend Theorem 3.2 in Kulis & Bartlett (2010) to prove our theorem. Let G t (θ) = 1 2 ∥θ -θ t ∥ 2 2 + η t l(y t , θ). We will now show the loss function is convex. The first step is to show that if Assumption 3.3 holds, then the loss function l(y, θ) is convex in θ. First, suppose Assumption 3.3(a) hold. Then, αl(y, θ 1 ) + βl(y, θ 2 ) -l(y, αθ 1 + βθ 2 ) = α∥y -x(θ 1 )∥ 2 2 + β∥y -x(θ 2 )∥ 2 2 -∥y -x(αθ 1 + βθ 2 )∥ 2 2 ≥ α∥y -x(θ 1 )∥ 2 2 + β∥y -x(θ 2 )∥ 2 2 -∥y -αx(θ 1 ) -βx(θ 2 )∥ 2 2 (By Assumption 3.3(a)) = αβ∥x(θ 1 ) -x(θ 2 )∥ 2 2 ≥ 0 Second, suppose Assumption 3.3(b) holds. Then, αl(y, θ 1 ) + βl(y, θ 2 ) -l(y, αθ 1 + βθ 2 ) = α∥y -x(θ 1 )∥ 2 2 + β∥y -x(θ 2 )∥ 2 2 -∥y -x(αθ 1 + βθ 2 )∥ 2 2 = α∥y -x(θ 1 )∥ 2 2 + β∥y -x(θ 2 )∥ 2 2 -∥y -αx(θ 1 ) -βx(θ 2 )∥ 2 2 +∥y -αx(θ 1 ) -βx(θ 2 )∥ 2 2 -∥y -x(αθ 1 + βθ 2 )∥ 2 2 = αβ∥x(θ 1 ) -x(θ 2 )∥ 2 2 + ∥y -αx(θ 1 ) -βx(θ 2 )∥ 2 2 -∥y -x(αθ 1 + βθ 2 )∥ 2 2 = αβ∥x(θ 1 ) -x(θ 2 )∥ 2 2 -⟨αx(θ 1 ) + βx(θ 2 ) -x(αθ 1 + βθ 2 ), 2y -x(αθ 1 + βθ 2 ) -αx(θ 1 ) -βx(θ 2 )⟩ ≥ αβ∥x(θ 1 ) -x(θ 2 )∥ 2 2 -∥αx(θ 1 ) + βx(θ 2 ) -x(αθ 1 + βθ 2 )∥ 2 ∥2y -x(αθ 1 + βθ 2 ) -αx(θ 1 ) -βx(θ 2 )∥ 2 (10) The last inequality is by Cauchy-Schwartz inequality. Note that ∥αx(θ 1 ) + βx(θ 2 ) -x(αθ 1 + βθ 2 )∥ 2 ∥2y -x(αθ 1 + βθ 2 ) -αx(θ 1 ) -βx(θ 2 )∥ 2 ≤ 2(B + R)∥αx(θ 1 ) + βx(θ 2 ) -x(αθ 1 + βθ 2 )∥ 2 ≤ αβ∥x(θ 1 ) -x(θ 2 )∥ 2 (By Assumption 3.3(b)) Plugging equation 11 in equation 10 yields the result. Using Theorem 3.2. in Kulis & Bartlett (2010) , for α t ≤ Gt(θt+1) Gt(θt) , we have R T ≤ T t=1 1 ηt (1 -α t )η t l(y t , θ t ) + 1 2ηt (∥θ t -θ * ∥ 2 2 -∥θ t+1 -θ * ∥ 2 2 ) (12) Notice that G t (θ t ) -G t (θ t+1 ) = η t (l(y t , θ t ) -l(y t , θ t+1 )) -1 2 ∥θ t -θ t+1 ∥ 2 2 ≤ 4(B+R)κηt λ ∥θ t -θ t+1 ∥ 2 -1 2 ∥θ t -θ t+1 ∥ 2 2 ≤ 8(B+R) 2 κ 2 η 2 t λ 2 The first inequality follows by applying Lemma 3.1. Let α t = Rt(θt+1) Rt(θt) . Using equation 13, we have (1 -α t )η t l(y t , θ t ) = (1 -α t )G t (θ t ) = G t (θ t ) -G t (θ t+1 ) ≤ 8(B+R) 2 κ 2 η 2 t λ 2 Plug equation 14 in equation 12, and note the telescoping sum, R T ≤ T t=1 8(B + R) 2 κ 2 η t λ 2 + T t=1 1 2η t (∥θ t -θ * ∥ 2 2 -∥θ t+1 -θ * ∥ 2 2 ) Setting η t = Dλ 2(B+R)κ √ 2t , we can simplify the second summation to D(B+R)κ √ 2 λ since the sum telescopes and θ 1 = 0, ∥θ * ∥ 2 ≤ D. The first sum simplifies using T t=1 1 √ t ≤ 2 √ T -1 to obtain the result R T ≤ 4 √ 2(B + R)Dκ λ √ T . A.3 OMITTED EXAMPLES A.3.1 EXAMPLES FOR WHICH ASSUMPTION 3.3 HOLDS Consider for example the following quadratic program Note that this rate exhibits a dependence on the number of objective functions p. As p increases, we might require (approximately) exponentially more weight samples {w K } k∈ [K] to achieve an approximation accuracy. In fact, this phenomenon is a reflection of curse of dimensionality Hastie et al. (2001) , a principle that estimation becomes exponentially harder as the number of dimension increases. In particular, the dimension here is the number of objective functions p. Naturally, one way to deal with the curse of dimensionality is to employ dimension reduction techniques in statistics to find a low-dimensional representation of the objective functions. Example A.1. When p = 2, MOP is a bi-objective decision making problem. Then, Theorem A.1 shows that l K (y, θ) -l(y, θ) is of O(1/K). That is, l K (y, θ) asymptotically converges to l(y, θ) sublinearly. min x∈R n x T x -2θ T 1 x x T x -2θ T 2 x s.t. 0 ≤ x ≤ 10 One can check that Assumption 3.3 (a) is indeed satisfied. For example, let n = 1. Then, W.L.O.G, let θ 1 ≤ θ 2 . Then, X E (θ) = [θ 1 , θ 2 ]. Consider two parameters that θ 1 = (θ 1 1 , θ 1 2 ), θ 2 = (θ 2 1 , θ 2 2 ) ∈ [0, 10] 2 . For all α ∈ [0, 1], X E (αθ 1 + (1 -α)θ 2 ) = [αθ 1 1 + (1 -α)θ 2 1 , αθ 1 2 + (1 -α)θ 2 2 ] Although tedious, Proof. By definition, l K (y, θ) -l(y, θ) = min ∥w - w k ∥ 2 ≤ √ 2 Λ -1 . ( ) Here, Λ is the number of evenly spaced weight samples between any two extreme points of W p . Note that {w k } k∈[K] are evenly sampled from W p , and that the distance between any two extreme points of W p equals to √ 2. Hence, the distances between any two neighboring weight samples are equal and can be calculated as the distance between any two extreme points of W p divided by Λ -1. Proof of equation 16 can be done by further noticing that the distance between any w and {w k } k∈ [K] is upper bounded by the distances between any two neighboring weight samples. Combining equation 15 and equation 16 yields that 0 ≤ l K (y, θ) -l(y, θ) ≤ 4(B + R)ζ λ • √ 2p Λ -1 , Then, we can prove that the total number of weight samples K and Λ has the following relationship: K = Λ + p -2 p -1 Proof of equation 18 can be done by induction with respect to p. Obviously, equation 18 holds when p = 2 as K = Λ. Assume equation 18 holds for the ≤ p -1 cases. For ease of notation, denote K Λ p = Λ + p -2 p -1 .



θ t+1 by 4 with k = k * 12:

; Bertsimas et al. (2015); Esfahani et al. (2018); Aswani et al. (2018); Dong & Zeng (2018); Dong et al. (2018). Assumption 3.1. (a) X(θ) is closed, and has a nonempty relative interior. X(θ) is also bounded.

Figure 1: Learning objective functions of an MQP over T = 1000 rounds. We run 100 repetitions of experiments. Let c = [c 1 , c 2 ]. (a) We plot estimation errors at each round t for all 100 experiments and their average estimation errors with K = 41. (b) Blue and yellow bars indicate average running time and standard deviations for each K using Algorithm 1 and 2, respectively. (c) We randomly pick one repetition. The estimated efficient set after T = 1000 rounds is indicated by the red line. The real efficient set is shown by the yellow line. (d) The dotted brown line is the error bar plot of the running time over 10 repetitions in batch setting. The blue line is the error bar plot of the running time over 100 repetitions in an online setting using Algorithm 1.

Figure 2: Learning the right-hand side of an MQP over T = 1000 rounds. We run 100 repetitions of the experiments. (a) We plot estimation errors at each round t for all 100 experiments and their average estimation errors of all repetitions with K = 41. (b) Blue and yellow bars indicate the average running times and standard deviations for each K using Algorithm 1 and 2, respectively.

Figure 3: Learning the expected return of a Portfolio optimization problem over T = 1000 rounds with K = 41. (a) The red line indicates the real efficient frontier. The blue dots indicate the estimated efficient frontier using the estimated expected return for K = 41. (b) Each bar represents the proportion of the 1000 decision makers that has the corresponding weight for f 1 (x).

Under Assumption 3.1, we have that ∀y ∈ Y, ∀θ ∈ Θ,0 ≤ l K (y, θ) -l(y, θ) ≤ 4p -2)! (Λ -1)!(p -1)! , ζ = max l∈[p],x∈X(θ),θ∈Θ |f l (x, θ)|.Furthermore, 0 ≤ l K (y, θ) -l(y, θ) ≤ 16e(B + R)ζ λ surrogate loss function uniformly converges to the loss function at the rate of O(1/K 1 p-1 ).

k ,θ) ∥y -x∥ 2 2 , and ∥y -S(w y , θ)∥ 2 2 = minx∈X E (θ) ∥y -x∥ 2 2 . Let w y k ′ be the closest weight sample among {w k } k∈[K] to w y . Then, l K (y, θ) -l(y, θ) = ∥y -S(w y k , θ)∥ 2 2 -∥y -S(w y , θ)∥ 2 2 ≤ ∥y -S(w y k ′ , θ)∥ 2 2 -∥y -S(w y , θ)∥ 2 2 = (2y -S(w y k ′ , θ) -S(w y , θ)) T (S(w y , θ) -S(w y k ′ , θ)) ≤ ∥2y -S(w y k ′ , θ) -S(w y , θ)∥ 2 ∥S(w y , θ) -S(w y k ′ , θ)∥ 2 ≤ 2(B + R)∥S(w y , θ) -S(w y k ′ , θ)∥ 2 ≤ ],x∈X(θ),θ∈Θ |f l (x, θ)|.The third inequality is due to Cauchy Schwarz inequality.Under Assumption 3.1, we can apply Lemma 4 inDong & Zeng (2018)  to yield the last inequality.Next, we will show that ∀w ∈ W p , the distance between w and its closest weight sample among {w k } k∈[K] is upper bounded by the function of K and p and nothing else. More precisely, we will show that

Estimation Error for Different K

one can check that one can check that Assumption 3.3 (a) is indeed satisfied.A.4 DATA FOR THE PORTFOLIO OPTIMIZATION PROBLEM

True Expected Return

annex

Then, for the p case, we note that the weight samples can be classified into two categories: w p = 0; w p > 0. For w p = 0, the number of weight samples is simply K Λ p-1 . For w p > 0, the number of weight samples is K Λ-1 p . Thus,Iteratively expanding K Λ-1 p through the same argument as equation 18 and using the fact thatwe haveTo this end, we complete the proof of equation 18.Furthermore, we notice thatThen, when Λ ≥ p(K ≥ 2 p-1 ), through simple algebraic calculation we haveWe complete the proof by combining equation 17 and equation 21 and noticing that √ 2p ≤ p.

