MIN-MAX MULTI-OBJECTIVE BILEVEL OPTIMIZATION WITH APPLICATIONS IN ROBUST MACHINE LEARNING

Abstract

We consider a generic min-max multi-objective bilevel optimization problem with applications in robust machine learning such as representation learning and hyperparameter optimization. We design MORBiT, a novel single-loop gradient descentascent bilevel optimization algorithm, to solve the generic problem and present a novel analysis showing that MORBiT converges to the first-order stationary point at a rate of O(n 1 /2 K -2 /5 ) for a class of weakly convex problems with n objectives upon K iterations of the algorithm. Our analysis utilizes novel results to handle the non-smooth min-max multi-objective setup and to obtain a sublinear dependence in the number of objectives n. Experimental results on robust representation learning and robust hyperparameter optimization showcase (i) the advantages of considering the min-max multi-objective setup, and (ii) convergence properties of the proposed MORBiT. Our code is at https://github.com/minimario/MORBiT.

1. INTRODUCTION

We begin by examining the classic bilevel optimization (BLO) problem as follows: min x∈X ⊆R dx f (x, y ⋆ (x)) subject to y ⋆ (x) ∈ arg min y∈Y=R dy g(x, y) where f : X × Y → R is the upper-level (UL) objective function and g : X × Y → R is the lower-level (LL) objective function. X and Y, respectively, denote the domains for the UL and LL optimization variables x and y, incorporating any respective constraints. Equation 1 is called BLO because the UL objective f depends on both x and the solution y ⋆ (x) of the LL objective g. BLO is well-studied in the optimization literature (Bard, 2013; Dempe, 2002) . Recently, stochastic BLO has found various applications in machine learning (Liu et al., 2021; Chen et al., 2022a) , such as hyperparameter optimization (Franceschi et al., 2018) , reinforcement learning or RL (Hong et al., 2020) , multi-task representation learning (Arora et al., 2020) , model compression (Zhang et al., 2022) , adversarial attack generation (Zhao et al., 2022) and invariant risk minimization (Zhang et al., 2023) . In this work, we focus on a robust generalization of equation 1 to the multi-objective setting, where there are n different objective function pairs (f i , g i ). Let [n] ≜ {1, 2, • • • , n} and f i : X × Y i → R, g i : X × Y i → R denote the i th UL and LL objectives respectively. We study the following problem: min x∈X ⊆R dx max i∈[n] f i (x, y ⋆ i (x)) subject to y ⋆ i (x) ∈ arg min yi∈Yi=R dy i g i (x, y i ), ∀i ∈ [n]. Here, the optimization variable x is shared across all objectives f i , g i , i ∈ [n], while the variables y i , i ∈ [n] are only involved in their corresponding objectives f i , g i . The goal is to find a robust solution x ∈ X , such that, the worst-case across all objectives is minimized. This is a generic problem which reduces to equation 1 if we have a single objective pair, that is n = 1. Such a robust optimization problem is useful in various applications, and especially necessary in any safety-critical ones. For example, in decision optimization, the different objectives (f i , g i ) can correspond to different "scenarios" (such as plans for different scenarios), with x being the shared decision variable and y i 's being scenario-specific decision variables. The goal of equation 2 is to find the robust shared decision x which provides robust performance across all the n considered scenarios, so that such a robust assignment of decision variables will generalize well on other scenarios. In machine learning, robust representation learning is important in object recognition and facial recognition where we desire robust worst-case performance across different groups of objects or different population demographics. In RL applications with multiple agents (Busoniu et al., 2006; Li et al., 2019; Gronauer & Diepold, 2022) , our robust formulation in equation 2 would generate a shared model of the worldthe UL variable x -such that the worst-case utility, max i f i (x, y ⋆ i (x)), of the agent-specific optimal action -the LL variable y ⋆ i (x) -is optimized, ensuring robust performance across all agents. An additional technical advantage of the general multi-objective problem in equation 2 is that it allows the objective-specific variables y i ∈ Y i to come from different domains, that is, Y i ̸ = Y j , i, j ∈ [n]; as stated in equation 2, this implies that the dimensionality d yi for the per-objective y i need not be the same across all objectives. This allows for a larger class of problems where each objective can then have different number of objective specific variables but we still require a robust shared variable x. For example, in multi-agent RL, different agents can have different action spaces because they need to operate in different mediums (land, water, air, etc). Focusing on stochastic objectives common in ML, the main contributions of this work are as follows: ▶ (New algorithm design) We present a single loop Multi-Objective Robust Bilevel Two-timescale optimization algorithm, MORBiT, which uses (i) SGD for the unconstrained strongly convex LL problem, and (ii) projected SGD for the constrained weakly convex UL problem. ▶ (Theoretical convergence guarantees) We demonstrate that, under standard smoothness and regularity conditions, MORBiT with n objectives converges to a O(n 1 /2 K -5 /2 )-stationary point with K iterations, matching the best convergence rate for single-loop single-objective (n = 1) BLO algorithms with the constrained UL problem while using vanilla SGD for the LL problem, and providing a sublinear n 1 /2 -dependence on the number of objective pairs n. ▶ (Two sets of applications) We present two applications involving min-max multi-objective bilevel problems, robust representation learning and robust hyperparameter optimization (HPO), and demonstrate the effectiveness of our proposed algorithm MORBiT. Paper Outline. In the following section 2, we further discuss the different aspects of the problem in equation 2 and compare that to the problems and solutions considered in existing literature. We present our novel algorithm, MORBiT, and analyse its convergence properties in section 3, and empirically evaluate it in section 4. We conclude with future directions in section 5.

2. PROBLEM AND RELATED WORK

We first discuss the different aspects of the robust multi-objective BLO problem with constrained UL in equation 2. While BLO is used in machine learning (Liu et al., 2021; Chen et al., 2022a) , multi-objective BLO has not received much attention. In multi-task learning (MTL), the optimization problem is a multi-objective problem in nature, but is usually solved by summing the objectives and using a single-objective solver, that is, optimizing the objective i f i . The robust min-max extension of MTL (Mehta et al., 2012; Collins et al., 2020) and RL (Li et al., 2019) have been shown to improve generalization performance, supporting the need for a more complex multi-objective optimization problem that replaces the objective i f i with the objective max i f i . For SGD-based solutions to stochastic BLO, one critical aspect is whether the algorithm is single-loop (a single update for both x and y in each iteration) or double-loop (multiple updates for the LL y between each update of the UL x). Double-loop algorithms can have faster empirical convergence, but are more computationally intensive, and their performance is extremely sensitive to the step-sizes and termination criterion for the LL updates. Double-loop algorithms are not applicable when the (stochastic) gradients of the LL and UL problems are only provided sequentially, such as in logistics, motion planning and RL problems. Hence, we develop and analyse a single-loop algorithm. A final aspect of BLO is the constrained UL problem. When the UL variable x corresponds to some decision variable in a decision optimization problem or a hyperparameter in HPO, we must consider a constrained form, x ∈ X ⊂ R dx . To capture a more general form of the bilevel problem, we focus on the constrained UL setup. In the remainder of this section, we will review existing literature on single-objective and multi-objective BLO and robust optimization, especially in the context of machine learning. Table 1 provides a snapshot of the properties of the problems and algorithms (with rigorous convergence analysis) studied in recent machine learning literature. Table 1 : The problem studied here relative to representative related work. If the studied problem is not a BLO, the notion of single-loop or constrained UL (X ⊂ R dx ) is not applicable. The 1 st row block lists general problems. The 2 nd block lists algorithms with analyses. The final row is MORBiT. †: This problem has been viewed both as single-level and bilevel. □: The problem can be multi-objective but is treated as single-objective by summing the objectives. △: In bilevel adversarial learning, the UL is unconstrained but the LL is constrained. 2020) is a single-level one. In fact, we consider the bilevel form of the TR-MAML problem as one of our applications for empirical evaluation. In ARL, the minimum is over the loss and the maximum is over the worst-case perturbation to inputs (Madry et al., 2017; Wang et al., 2019) . In both DRL and ARL, the min-max objective is in the form min x max y f (x, y) with a single-objective. Problem/Method Bilevel Multi-objective Min-max Single-loop X ⊂ R dx Distributionally Robust Learning † ✗ ✓ - - Adversarially Robust Learning † ✗ ✓ - △ Multi-task Learning (MTL) † □ ✗ - - Robust MTL (Mehta et al., 2012) † ✓ ✓ - - Meta-learning † □ ✗ - - BSA (Ghadimi & Wang, 2018) ✓ ✗ ✗ ✗ ✓ HiBSA (Lu et al., 2020) ✗ ✗ ✓ ✓ ✓ GDA (Lin et al., 2020) ✗ ✗ ✓ ✓ ✗ TR-MAML ( In contrast, we study general robust multi-objective BLO where the UL objective is dependent on the LL solutions, and where the minimization is over the variable x shared across all objectives, and the maximization is over the multiple objectives, ensuring that each individual objective converges fast. Closely related and concurrent work. Since our goals align with the properties of TTSA (Hong et al., 2020) -the single-loop nature and the ability to handle UL constraints -our proposed MORBiT is inspired by TTSA and can be viewed as a robust multi-objective version. Beyond this advancement, our contribution also lies in the convergence analysis of MORBiT, which significantly diverges from that of TTSA. After our MORBiT was developed and released (Gu et al., 2021) , STABLE (Chen et al., 2022b) was recently presented as an improvement of TTSA, and we wish to explore similar improvements to MORBiT in future work. A very recent work (Hu et al., 2022) studies a problem that appears to be quite similar to equation 2, with common elements such as bilevel and min-max, and proposes a single-loop multi-block min-max bilevel (MMB) algorithm. However, there are significant differences: (i) Firstly, in their setup, they consider an extension of a min-max single level problem to a min-max BLO, and min-max is not meant to provide "robustness" among objectives. 

3. ALGORITHM AND ANALYSIS

In this section, we propose a simple single-loop algorithm MORBiT to solve equation 2, and establish a rigorous convergence rate and sample complexity for this algorithm. For the theoretical results, we defer the precise assumptions, statements and proofs to Appendix A and present the high-level theoretical results and critical novel proof steps here. In the sequel, we will always use the subscript i ∈ [n] to denote the objective index and the superscript (k) to denote the iteration index, with x (k) denoting the k th iterate of the shared variable x ∈ X ⊆ R dx and y (k) i denoting the k th iterate of the i th -objective-specific variable y i ∈ R dy i . We will also use the shorthand y to denote all the per-objective variables [y 1 , y 2 , . . . , y n ], with y (k) denoting the k th iterate of all the per-objective variables [y (k) 1 , y (k) 2 , . . . , y n ]. Given our assumption that the LL objectives g i are strongly convex, we define y ⋆ i (x) ≜ arg min yi∈R dy i g i (x, y), and use the shorthand ℓ i (x) ≜ f i (x, y ⋆ i (x)).

3.1. MORBiT ALGORITHM

We begin with a standard reformulation of robust min-max problems (Duchi et al., 2008) . We can rewrite the non-smooth min-max problem in equation 2 as min x∈X max λ∈∆n i∈[n] λ i f i (x, y ⋆ i (x)) subject to y ⋆ i (x) = arg min yi∈R dy i g i (x, y i ), ∀i ∈ [n] where ∆ n ∈ R n + is the n-simplex defined as ∆ n := {λ ∈ R n + : λ i ≥ 0, ∀i ∈ [n], i∈[n] λ i = 1}. This problem is equivalent to the min-max problem in equation 2, but allows us to solve the problem with (projected) gradient based methods. The gradient for y i , i ∈ [n] is the straightforward ∇ yi g i (x, y i ) and we denote h i as its stochastic estimate, with h as the shorthand for the per-objective stochastic gradient estimates [h 1 , h 2 , . . . , h n ]. The gradient for the x-update is more involved because of the hierarchical structure of the BLO problem. Then, we consider the following weighted objectives utilizing the simplex variable λ ∈ ∆ n to define the necessary gradients: F (x, λ) = i∈[n] λ i ℓ i (x), F (x, y, λ) = i∈[n] λ i f i (x, y i ). Note that F (x, λ) is the UL objective in equation 3, and the UL gradients can be defined as: ∇ x F (x, λ) = i∈[n] λ i ∇ℓ i (x), ∇ λ F (x, λ) = [ℓ 1 (x), • • • , ℓ n (x)] ⊤ , where ∇ x ℓ i (x) for any i ∈ [n] can be defined as follows utilizing the strong convexity of the LL problem and implicit gradients (Gould et al., 2016) : ∇ℓ i (x) = ∇ x f i (x, y ⋆ i (x)) -∇ 2 xyi g i (x, y ⋆ i (x)) ∇ 2 yiyi g i (x, y ⋆ i (x)) -1 ∇ yi f i (x, y ⋆ i (x)). Note that in general, y ⋆ i (x) cannot be computed exactly. Following Ghadimi & Wang (2018), we use an approximation of ∇ x ℓ i (x) as a surrogate, denoted by ∇ x f i (x, y i ), by replacing y ⋆ i (x) in equation 6 with any y i ∈ R dy i as follows: ∇ x f i (x, y i ) = ∇ x f i (x, y i ) -∇ 2 xyi g i (x, y i ) ∇ 2 yiyi g i (x, y i ) -1 ∇ yi f i (x, y i ). Consequently, we define our approximate gradients for the UL variables x (and λ) as: ∇ x F (x, y, λ) = i∈[n] λ i ∇ x f i (x, y i ), ∇ λ F (x, y, λ) = [f 1 (x, y i ), • • • , f n (x, y n )] ⊤ . (8) We denote the (possibly biased) stochastic estimates of ∇ x F (x, y, λ) as h x and ∇ λ F (x, y, λ) as h λ . Algorithm 1: MORBiT with learning rates α, β and γ for x, y, λ respectively for k = 1, 2, • • • , K do y (k+1) ← y (k) -βh (k) x (k+1) ← proj X (x (k) -αh (k) x ) λ (k+1) ← proj ∆n (λ (k) + γh (k) λ ) end Sample τ ∼ U ({1, • • • , K}) return x ← x (τ ) , ȳi ← y (τ -1) i , λ ← λ (τ ) Given the gradients and their stochastic estimates, we present our single-loop algorithm MORBiT in algorithm 1, where we utilize learning rates α, β, γ > 0 for the UL variable x, LL variables y i , i ∈ [n] and the simplex variable λ respectively. The algorithm tracks three sets of variables x (k) , y (k) = [y (k) 1 , y (k) 2 , . . . , y n ] and λ (k) through a total of K iterations. The per-iterate gradient estimates h (k) of the LL variables y is defined as the collection of the per-objective gradient estimate h (k) i evaluated at (x (k) , y (k) i ) for all i ∈ [n]. The gra- dient estimates h (k) x and h (k) λ of the UL variables x and λ are evaluated at (x (k) , y (k+1) , λ (k) ). We perform a standard gradient descent update for the objective specific variables y i , i ∈ [n] from y (k) i to y (k+1) i . For the shared UL variable x, we perform a projected gradient descent to satisfy the UL constraints, where proj X (•) denotes the projection operation onto the constrained set X . We update the simplex variable λ via projected gradient ascent, where we project the variable back onto the n-simplex after a gradient ascent step with proj ∆n (•). Given the learning rates (α, β, γ), MORBiT is quite straightforward in terms of implementation. When n = 1, the problem reduces to single-objective BLO, λ (k) = 1, and MORBiT reduces to TTSA (Hong et al., 2020).

3.2. ANALYSIS

Given the single-loop MORBiT, we establish conditions under which MORBiT has finite-horizon convergence. The coupling of the stochastic errors due to the sampling process makes the convergence analysis of this three-sequence-based algorithm much more challenging than existing BLO algorithms. Assumptions. We summarize the following typical assumptions (detailed in Appendix A.1) for all objective pairs (f i , g i ), i ∈ [n]. Focusing on the smoothness and regularity properties of the objectives, we assume that (i) the LL objective g i is strongly convex in y i , twice-differentiable, and has sufficiently smooth first and second order gradients (Assumption 2 in Appendix A.1), (ii) the UL objective f i has sufficiently smooth first order gradients, and (iii) the function ℓ i (x) ≜ f i (x, y ⋆ i (x)) is weakly convex, bounded and has bounded first-order gradients (Assumption 1 in Appendix A.1, also see Appendix D.1). Regarding the quality of the gradient estimates h (k) i , i ∈ [n], h x and h (k) λ , we assume that, for all k > 0, (i) h (k) i is an unbiased estimate with bounded variance, (ii) h (k) λ is an unbiased estimate, and (iii) h (k) x has bounded variance, and can be a biased estimate of the ∇ x F (x (k) , y (k+1) , λ (k) ) term defined in equation 8, but the bias norm at iteration k is bounded by b k ≥ 0, with {b k , k ≥ 0} forming a non-increasing sequence. These gradient estimate quality assumptions are detailed in Assumption 3 in Appendix A.1. While the assumptions on h x actually can be easily satisfied when a Hessian inverse approximation (HIA) based mini-batch sampling strategy is adopted, which can also avoid the matrix inversion by leveraging the Neumann series (Agarwal et al., 2017; Ghadimi & Wang, 2018; Hong et al., 2020) . Optimality and Stationarity of Solutions. To quantify the convergence properties of the solutions x, ȳi , i ∈ [n], λ generated by MORBiT, we use the following optimality properties of the optimal solutions x ⋆ , y ⋆ i , i ∈ [n], λ ⋆ of the problem in equation 3. (i) The per-objective optimal LL variable y ⋆ i = y ⋆ i (x ⋆ ) = arg min yi∈R dy i g i (x ⋆ , y i ); (ii) The optimal simplex variable λ ⋆ : F (x ⋆ , λ ⋆ ) = max λ∈∆n F (x ⋆ , λ). Given the constrained UL, the first-order stationarity condition is satisfied if ⟨∇ x F (x ⋆ , λ ⋆ ), x -x ⋆ ⟩ ≥ 0∀x ∈ X . (iii) For establishing near-stationarity of UL variable x, the proximal map x(z) ∈ X , defined below, x(z) ≜ arg min x∈X ρ 2 ∥x -z∥ 2 + F (x, λ), ρ > 0 is a fixed constant. ( ) is employed (Davis & Drusvyatskiy, 2018; Hong et al., 2020) to quantify the convergence for a constrained variable x in the stochastic setting. If ∥x(x (k) )-x (k) ∥ 2 is small, then, near-stationarity of x (k) is achieved at iteration k. Therefore, we need to bound ∥x(x) -x∥ to guarantee the convergence of the UL solution x returned by MORBiT. Given the convergence of the UL x, we also need to bound ∥ȳ i -y ⋆ i (x)∥ 2 for each i ∈ [n] simultaneously to quantify the convergence of the LL variables. Finally, the convergence of λ requires us to bound the difference between F (x, λ) and max λ∈∆n F (x, λ). Theoretical Convergence Rate. Now, we are ready to state our main theoretical result: a rigorous convergence rate for the solution returned by MORBiT (algorithm 1). We state an abbreviated version of the result, deferring details to Theorem 2 in Appendix A.2: Theorem 1 (MORBiT convergence). Suppose that the previously stated assumptions holds and learning rates are set as α = O(K -3 /5 ), β = O(K -2 /5 ) and γ = O(n -1 /2 K -3 /5 ). Then, if b 2 k ≤ α, the solutions x, ȳi , i ∈ [n] , λ generated by algorithm 1 satisfy: E[∥x(x) -x∥ 2 ] ≤ O( √ nK -2/5 ), E max i∈[n] ∥ȳ i -y ⋆ i (x)∥ 2 ≤ O( √ nK -2/5 ), max λ E[F (x, λ)] -E[F (x, λ)] ≤ O( √ nK -2/5 ), with expectation over the stochastic gradient estimates and the random index τ (algorithm 1, line 6). This result establishes the O(n 1 /2 K -2 /5 )-stationarity achieved by K iterations of MORBiT for both the UL and LL variables if all the assumptions are satisfied and the learning rates are selected appropriately. Note that, if the UL problem is unconstrained, that is x ∈ X = R dx , the definition of the proximal map (equation 9) implies that E∥∇ x F (x, λ)∥ ≤ O(n 1 /2 K -2 /5 ), providing the convergence of x to a O(n 1 /2 K -2 /5 )-stationary point if the UL problem is unconstrained. Comparison with Related Work. We would like to further highlight the differences between the convergence results of TTSA and MORBiT to highlight the major novelties in our analyses and theorem proving techniques. First, we consider a more general proximal map in equation 9 involving a weighted sum of weakly convex functions ℓ i instead of a single weakly convex function in TTSA, requiring new construction of potential functions for establishing the convergence of the UL variable x in equation 10a. Secondly, even though TTSA provides a convergence rate for a single LL variable (equivalent to bounding E[∥ȳ i -y ⋆ i (x)∥ 2 ] for a single i ∈ [n]), we provide a much stronger result for multiple LL optimization objectives, in the sense that simultaneously establishing convergence for all LL variables in equation 10b through measuring the convergence rate of E[max i∈[n] ∥ȳ i -y ⋆ i (x)∥ 2 ]. This is especially challenging since a bounded E[∥ȳ i -y ⋆ i (x)∥] for each i ∈ [n] does not directly imply a bounded E[max i∈[n] ∥ȳ i -y ⋆ i (x)∥ 2 ] ; in fact this can be generally unbounded. Finally, to satisfy the requirements of the min-max problem in equation 3, we have to additionally establish convergence for the simplex solution λ in equation 10c while TTSA does not have any such analysis. Given the convergence rate, another related quantity of interest is the sample complexity which pertains to the number of queries to the stochastic gradient oracle required to achieve a desired level of stationarity. For example, for an iterative algorithm that converges to a O(K -µ )-stationary point with K iterations for some µ > 0, requiring O(1) queries to the stochastic gradient oracle in each iteration, the sample complexity to find an ϵ-optimal solution is O(ϵ -1 /µ ). The number of stochastic gradient oracle queries required is directly related to the conditions in the gradient estimate quality assumptions (Assumption 3 in Appendix A.1 in our case). While the conditions on the per-iterate gradient estimates h Proof Sketch of Theorem 1. We now give a proof sketch of our main theorem, with constant terms abstracted away with O notation. In order to show equation 10a of Theorem 1 (convergence of x), we will derive a descent lemma comparing successive iterates x (k) and x (k+1) . Descent lemmas often contain a quadratic term ∥x (k+1) -x (k) ∥ 2 , so it is natural that we must bound ∥h (k) x ∥ 2 . In Lemma 1, we bound the expected squared norm of the stochastic gradient estimate: Lemma 1. Under our regularity assumptions, E[∥h (k) x ∥ 2 ] ≤ O i∈[n] λ (k) i ∥y ⋆ i (x (k) ) -y (k+1) i ∥ 2 . Turning to equation 10b of Theorem 1, we use a descent relation on y (k) i -y ⋆ i (x (k-1) ). While ideally we would obtain a descent relation purely involving y (k) i terms themselves, the intricate coupling of the x and y i terms result in an extra x (k-1) -x (k) term. The resulting relation is shown in Lemma 2. Here, we have that c 1 = 1 - µgβ 2 , c 2 = 2 µgβ -1, where µ g β < 1. Lemma 2. E[∥y (k+1) i -y ⋆ i (x (k) )∥ 2 ] ≤ O (1 -c 1 )E[∥y (k) i -y ⋆ i (x (k-1) )∥ 2 ] + c 2 E[∥x (k-1) -x (k) ∥ 2 ] . From this lemma, intuitively, we know that E[∥y (k) i -y ⋆ i (x (k-1) )∥ 2 ] is decreasing as k increases, as long as the E[∥x (k-1) -x (k) ∥ 2 ] 's are not too large. Therefore, it is important to have another descent relation that upper bounds this quantity, which we do next in Lemma 3. The lemma naturally involves the objective F (x (k) , λ (k) ), which will telescope. Here, we have c 3 = 1 4α - L f 2 , c 4 = 4αL 2 . As k → ∞, α → 0, and c 3 is positive. Lemma 3. Let L (k) ≜ E[F (x (k) , λ (k) )] = i∈[n] λ (k) i E[ℓ i (x (k) )]. Then, the L (k) satisfies: L (k+1) -L (k) ≤ O -c 3 E[∥x (k+1) -x (k) ∥ 2 ] + c 4 max i∈[n] E[∥y (k+1) i -y ⋆ i (x (k) )∥ 2 ] + √ nγ + α . Following the intuition previously described, we then use Lemmas 2 and 3 to show that the E[∥x (k-1) -x (k) ∥ 2 ] terms are small enough and that the y (k) i iterates converge in Lemma 4. Lemma 4. 1 K K k=1 max i∈[n] E ∥y (k) i -y ⋆ i (x (k-1) )∥ 2 ≤ O( √ nK -2/5 ). Finally, we can use the convergence of y i to prove convergence of λ and x. Theorem 1 then follows. A more detailed proof plan can be found in Appendix A.3.

4. EXPERIMENTAL RESULTS

In this section, we consider two applications where the min-max multi-objective bilevel formulation in equation 2 enhances robustness -multi-task representation learning and hyperparameter optimization. We will highlight the advantage of the min-max formulation and the convergence of MORBiT on these applications. We use PyTorch (Paszke et al., 2019) , and implementation details are in Appendix C. All results are aggregated over 10 trials. Representation Learning. In this setup, each objective pair corresponds to a learning "task" i ∈ [n], with its own training and validation dataset pair D t i , D v i . We consider a shared representation network ϕ x with ReLU nonlinearity (making the UL problem weakly convex) parameterized with x ∈ R dx and a per-task linear model w yi parameterized with y i ∈ R dy i . Here the UL is unconstrained. Using L(f, D) to denote the loss of a model f on data D, we consider the problem in equation 2 with f i (x, y i ) ≜ L(w yi • ϕ x , D v i ), g i (x, y i ) ≜ L(w yi • ϕ x , D t i ) + ρ∥w yi ∥ 2 2 , with ρ > 0 as a regularization penalty (ensuring that the LL problem is strongly convex). We first consider a multi-task setup with n = 10 binary classification tasks from the FashionMNIST dataset (Xiao et al., 2017) . The goal is to learn a shared representation and per-task models so that each of the tasks generalizes well. Usually, this problem is solved as a single-objective BLO by minimizing 1 /n i f i ; we call this min-avg. We theoretically show that solving the min-max multi-objective BLO in equation 2 provides a tighter generalization guarantee (Proposition 2, Appendix B). Here we demonstrate the same in figure 1a -we plot the worst-case UL objective (the validation loss) and the worst-case generalization loss across all tasks/objectives throughout the optimization trajectory, comparing the behaviour of the solution of min-avg problem to that of the min-max problem. The results indicate solving the min-max problem significantly reduces the worst-case validation loss and this also results in a significant reduction of the worst-case generalization loss, highlighting the utility of solving the min-max multi-objective bilevel problem in equation 2. We study the convergence of the UL variable for the min-max problem in the form of the trajectory of the (stochastic) gradient norm ∥∇ x ∥foot_1 in figure 1b , comparing it to the theoretical O(K -2 /5 ) rate. We see that the empirical trajectory of the gradient norm closely tracks the theoretical rate. We also consider a bilevel extension of the robust meta-learning application (Collins et al., 2020) for a sinusoid regression task, a common meta-learning application introduced by Finn et al. ( 2017) 1 . Here the goal of solving the problem in equation 2 with the objectives defined in equation 11 would be to learn a robust representation network such that we not only improve generalization on tasks seen during the optimization but also improve generalization for related unseen tasks. We theoretically show that that solving the min-max multi-objective bilevel problem in equation 2 also provides a tighter generalization guarantee for the unseen tasks (Proposition 1c support this, showing that solving the min-max problem not only improves the generalization on seen tasks, but significantly improves the generalization on unseen tasks when compared to solving the min-avg problem. These results are also consistent with the results for robust MTL in figure 1a . Hyperparameter Optimization. In this setup, each objective pair again corresponds to a learning "task" i ∈ [n], each with its own d dimensional training/validation dataset pair D t i , D v i . We consider a shared hyperparameter optimization problem for kernel logistic regression (Zhu & Hastie, 2001) with K random Fourier features (RFFs) (Rahimi & Recht, 2007) , where x = {x ρ ∈ R 2K + , x σ ∈ R d + } are the regularization penalty and the bandwidth hyperparameters respectively, with ϕ xσ denoting the RFF 2 . The per-task linear model w yi on top of the RFFs are parameterized with y i ∈ R 2K . In this setup, we have a weakly convex constrained UL problem (the hyperparameters need to be positive), and an unconstrained strongly convex LL problem. Again using L(f, D) to denote the learning loss of a model f on a dataset D, we consider the problem in equation 2 with f i (x, y i ) ≜ L(w yi • ϕ xσ , D v i ), g i (x, y i ) ≜ L(w yi • ϕ xσ , D t i ) + ∥x ρ ⊙ w yi ∥ 2 2 , ( ) where ⊙ denotes the elementwise vector multiplication, and we consider a weighted regression penalty 3 . We generate n = 16 binary classification tasks from the Letter dataset (Frey & Slate, 1991) and compare the generalization of the min-max solution of equation 2 to that of the min-avg. The results in figure 2a indicate that the solution of equation 2 provides a robust solution x (hyperparameters), significantly improving not only the worst-case validation loss but also the worst-case generalization loss for the supervised learning problems. This result highlights the advantage of We study the effect of the number of objective pairs n on the convergence. We consider n ∈ {4, 16, 64}, increasing n with a factor of 4 (implying a theoretical convergence slow down by a factor of 2) to check how the convergence matches the √ n-dependence in our theoretical result. In this case, we consider the trajectory of the (stochastic) gradient norm ∥∇ x ∥ 2 (as in figure 1b ). The results in figure 2b display such a behaviour -for a fixed K (outer iterations), as the number of tasks is increased 4-fold, the gradient norm approximately increases 2-fold (note the log 2 -scale on the vertical axis). This validates our theoretical dependence on the number of objective pairs n. We also study the effect of the batch size on the generalization performance of the min-max solution. In the previous experiments, we considered a batch size of 8 for both the UL and LL stochastic gradients. Here, we will consider batch sizes from {8, 32, 128}, using the same batch size for gradients of both levels and variables. Note that, in this problem, each of the 16 learning tasks (and hence, objective pairs) has a training set size of around 900 samples (for the LL loss), with 300 samples each for the UL loss and for computing the generalization loss. Unlike figures 1a and 2a, we only show the generalization loss (dropping the validation loss) in figure 2c . The results indicate that increasing the batch size improves the stability and reduces the variance of the overall generalization. However, the convergence follows a similar trend for all batch sizes, and converges to a very similar level of generalization, supporting the O(1) batch size requirement for convergence.

Empirical conclusion

The empirical evaluations highlight that considering the more robust min-max problem in equation 2 does provide improved generalization in multiple applications (representation learning for MTL and meta-learning, and for hyperparameter optimization). The results also highlight the validity of our theoretical convergence analysis both in terms of the number of iterations K and the number of objective pairs n.

5. CONCLUDING REMARKS

Motivated by the desiderata of robustness in bilevel learning applications, we study a new minmax multi-objective BLO framework (equation 2) that provides full flexibility and generality. We propose MORBiT (algorithm 1), a single-loop gradient descent-ascent based algorithm for finding an solution to our proposed min-max multi-objective framework. We establish its convergence rate (Theorem 1) and sample complexity (Corollary 1), demonstrating both the advantage of the min-max multi-objective BLO framework and the validity of our theoretical analyses on robust representation learning and hyperparameter optimization applications. We wish to explore further applications where robustness would be beneficial such as in RL, federated learning and domain generalization. On the theoretical side, we wish to develop single-loop algorithms with improved convergence rates (for example, exploring techniques in Chen et al. (2022b)) and double-loop algorithms with convergence guarantees for applications where a single-loop algorithm is not feasible (e.g., federated learning). Finally, we also wish to develop algorithms for large n (the number of objective pairs) or even n → ∞ where MORBiT is not computationally feasible.

REPRODUCIBILITY STATEMENT

The formal definitions, assumptions, precise theorem statments, high level proof outline and detailed proofs for our main theoretical results are presented in Appendix A. We provide appropriate citations for the datasets used in our experiments and the experimental setup and details are presented in Appendix C. Our implementation is available at https://github.com/minimario/MORBiT.

A CONVERGENCE ANALYSIS OF MORBiT

A.1 ASSUMPTIONS First, we begin by listing the assumptions we make: Assumption 1 (Regularity of the outer functions). For all i ∈ [n], assume that outer functions f i (x, y) and ℓ i (x) = f i (x, y ⋆ i (x)) satisfy the following properties: ▶ For any x ∈ X , f i (x, •) is Lipschitz (w.r.t. y) with constant G f > 0. ▶ For any x ∈ X , ∇ x f i (x, •) and ∇ yi f i (x, •) are Lipschitz continuous (w.r.t. y i ) with constants L fx > 0 and L fy > 0. ▶ For any y i ∈ R dy i , ∇ yi f i (•, y i ) is Lipschitz continuous (w.r.t. x) with constant Lfy > 0. ▶ For any x ∈ X , y i ∈ R dy i , we have ∥∇ yi f i (x, y i )∥ ≤ C f y for C f y > 0. ▶ The function ℓ i (•) is µ ℓ weakly convex (in x), so that for all v, w ∈ X , ℓ i (w) ≥ ℓ i (v) + ⟨∇ℓ i (v), w -v⟩ -µ ℓ ∥w -v∥ 2 . ( ) ▶ For all x ∈ X , ∥ℓ i (x)∥ ≤ B ℓ for B ℓ > 0. ▶ For all x ∈ X , ∥∇ℓ i (x)∥ ≤ C ℓ , for C ℓ > 0. Assumption 2 (Regularity of the inner functions). Assume that inner functions g i (x, y i ), ∀i ∈ [n] satisfy: ▶ For any x ∈ X and y i ∈ R dy i , g i (x, y i ) is twice continuously differentiable in (x, y i ). ▶ For any x ∈ X , ∇ yi g i (x, •) is Lipschitz continuous (w.r.t. y i ) with constant L g . ▶ For any x ∈ X , g i (x, •) is µ g -strongly convex in y i , so that for all v, w ∈ R dy i , g i (x, w) ≥ g i (x, v) + ⟨∇ v g i (x, v), w -v⟩ + µ g ∥w -v∥ 2 . ( ) ▶ For any x ∈ X , ∇ 2 xyi g i (x, •) and ∇ 2 yi g i (x, •) are Lipschitz continuous (w.r.t. y i ) with constants L gxy > 0 and L gyy > 0, respectively. ▶ For any x ∈ X and y i ∈ R dy i , we have ∥∇ 2 xyi g i (x, y i )∥ ≤ C gxy for some C gxy > 0. ▶ For any y i ∈ R dy i , ∇ 2 xyi g i (•, y i ) and ∇ 2 yi g i (•, y i ) are Lipschitz continuous (w.r.t. x) with constants Lgxy > 0 and Lgyy > 0, respectively. From these assumptions, we can show a few additional regularity-type conditions. Since these conditions can also be found in Ghadimi & Wang (2018, Lemma 2.2) and Hong et al. (2020, Lemma 2), we state these results without proof. Lemma 5 (Corollary of Assumptions). Under Assumptions 1 and 2 stated above, for all x, x 1 , x 2 ∈ X ⊆ R dx , y ∈ R dy i , i ∈ [n], we have ∥∇ x f i (x, y) -∇ℓ i (x)∥ ≤ L∥y ⋆ i (x) -y∥, (15a) ∥y ⋆ i (x 1 ) -y ⋆ i (x 2 )∥ ≤ G y ∥x 1 -x 2 ∥, ( ) ∥∇ℓ i (x 1 ) -∇ℓ i (x 2 )∥ ≤ L f ∥x 1 -x 2 ∥, where we define L ≜ L fx + L fy C gxy µ g + C fy L gxy µ g + L gyy C gxy µ 2 g , L f ≜ L fx + ( Lfy + L)C gxy µ g + C fy Lgxy µ g + Lgyy C gxy µ 2 g , G y ≜ C g µ g . ( ) Assumption 3 (Quality of stochastic gradient estimates). For any iteration k > 0 and all i ∈ [n], the gradient estimates h  E[h (k) i ] = ∇ yi g i (x (k) , y (k) i ), E[∥h (k) i -∇ yi g i (x (k) , y (k) i )∥ 2 ] ≤ σ 2 g (1 + ∥∇ yi g i (x (k) , y (k) i )∥ 2 ). ( ) For any iteration k > 0, the gradient estimate h (k) λ for the simplex variable λ satisfies: E[h (k) λ ] = ∇ λ F (x (k) , y (k+1) , λ (k) ) = f 1 (x (k) , y ), • • • , f n (x (k) , y (k+1) n ) ⊤ . ( ) For any k ≥ 0 and a σ f > 0, we assume that there exists a non-increasing sequence {b k } k≥0 such that E[h (k) x ] = ∇ x F (x (k) , y (k+1) , λ (k) ) + B k , ∥B k ∥ ≤ b k , E[∥h (k) x -E[h (k) x ]∥ 2 ] ≤ σ 2 f . A.

2. MAIN THEOREM AND REMARKS

We now state our main theorem, in full. Recall our notation from equation 9: for a fixed constant ρ > 0, we defined the proximal map to be x(z) ≜ arg min x∈X ρ 2 ∥x -z∥ 2 + F (x, λ). We also define the Moreau envelope as Φ 1/ρ (z) ≜ min x ρ 2 ∥x -z∥ 2 + n i=1 λ i ℓ i (x). In addition, we use the notation ∆ (k) yi ≜ E[∥y (k) i -y ⋆ i (x (k-1) )∥ 2 ] and i (k) ≜ arg max i∈[n] ∆ (k) yi . Finally, we define σ2 f = σ 2 f + 3C 2 ℓ (see Lemma 12). Now, we are ready to state the full version of our main theorem: Theorem 2 (Convergence of MORBiT). Under Assumptions 1, 2 and 3, and the terms defined in Lemma 5, when step sizes are chosen as α = min µ g 16G y L ν, K -3/5 4G y L , β = min ν, 4K -2/5 µ g , γ = 2K -3/5 B ℓ n 1/2 , ( ) where ν = min ( µg /L 2 g (1+σ g ), 1 /µg), Algorithm 1 produces x, λ, ȳi , i ∈ [n] satisfying: E max i∈[n] ∥ȳ i -y ⋆ i (x)∥ 2 ≤ A, max λ E[F (x, λ)] -E[F (x, λ)] ≤ √ 2B ℓ √ nK -2/5 + G f A, E[∥x(x) -x∥ 2 ] ≤ 16Φ 1/ρ (x (0) )G y L (-µ ℓ + ρ)ρK 2/5 + 8(b 2 0 + L 2 A) (-µ ℓ + ρ) 2 + 2α(σ 2 f + 3b 2 0 + 3L 2 A) -µ ℓ + ρ , where We also discuss the convergence rate. Note that A in equation 30 is dominated by the fourth term, √ n/K 2/5 , so it is clear that equation 27 and equation 28 converge at a rate of O( √ nK -2/5 ). We give special attention to equation 29). Apart from the 8b 2 0/(-µ ℓ +ρ) 2 term, we see that the RHS of equation 29 converges at a rate of O( √ nK -2/5 ). To understand the convergence of this term, we turn to (20) from Assumption 3. As discussed in section 3.2, b k can be made arbitrarily small by running more iterations of the subroutine for estimating h In order to show equation 10a of Theorem 1 (the convergence of x), we will derive a descent lemma comparing successive iterates x (k) and x (k+1) . Descent lemmas often contain a quadratic term ∥x (k+1) -x (k) ∥ 2 , so it is natural that we will have to bound ∥h A = ∆ (0) y i (0) /µ g K 3/5 + 16σ 2 g /µ 2 g K 7/5 + G y /L K 4/5 + 2 √ nB ℓ G y /L K 2/5 + (b 2 0 + 1 2 σ 2 f )/(L 2 ) K 2/5 + 16σ 2 g /µ 2 g K 2/5 . ( (k) x ∥ 2 . As such, in Lemma 6, we bound the averaged squared norm of the stochastic gradient estimate, E[∥h (k) x ∥ 2 ]: Lemma 6. Under our regularity assumptions, the average squared norm of h (k) x can be bounded as follows, where the expectation is over the filtration F ′ i ≜ {y (0) i , x (0) , • • • , y (k) i , x (k) , y }: E[∥h (k) x ∥ 2 ] ≤ O n i=1 λ (k) i ∥y ⋆ i (x (k) ) -y (k+1) i ∥ 2 . ( ) Turning to equation 10a of Theorem 1, we'll use a descent relation on y (k) i -y ⋆ i (x (k-1) ). While ideally we would obtain a descent relation purely involving y ((k)) i terms themselves, the intricate coupling of the x and y terms result in an extra x (k-1) -x (k) term. The resulting relation is shown in Lemma 7. Here, we have that c 1 = 1 - µgβ 2 , c 2 = 2 µgβ -1, where µ g β < 1. Lemma 7. The distance between the algorithm's iterates y (k) i and the true inner optimum y ⋆ i (x (k) ) satisfies the following descent equation, E[∥y (k+1) i -y ⋆ i (x (k) )∥ 2 ] ≤ O((1 -c 1 )E[∥y (k) i -y ⋆ i (x (k-1) )∥ 2 ] + c 2 E[∥x (k-1) -x (k) ∥ 2 ]). ( ) From this lemma, intuitively, we know that E[∥y (k) i -y ⋆ i (x (k-1) )∥ 2 ] is decreasing as k increases, as long as the E[∥x (k-1) -x (k) ∥ 2 ]'s are not too large. Therefore, it is important to have another descent relation that upper bounds this quantity, which we do next in Lemma 8. The lemma naturally involves the objective L (k) = F (x (k) , λ (k) ), which will telescope. Here, we have c 3 = 1 4α - L f 2 , c 4 = 4αL 2 . As k → ∞, α → 0, and c 3 is positive. Lemma 8. Let L (k) ≜ E[F (x (k) , λ (k) )] = n i=1 λ (k) i E[ℓ i (x (k) )]. Then, the L (k) satisfies the descent equation L (k+1) -L (k) ≤ O -c 3 E[∥x (k+1) -x (k) ∥ 2 ] + c 4 max i∈[n] E[∥y (k+1) i -y ⋆ i (x (k) )∥ 2 ] + √ nγ + α . Following the intuition previously described, we then use Lemmas 7 and 8 to show that the E[∥x (k-1) -x (k) ∥ 2 ] terms are small enough and that the y (k) i iterates converge: Lemma 9 (Informal, see Appendix A.7 for precise statement). 1 K K k=1 max i∈[n] E ∥y (k) i -y ⋆ i (x (k-1) )∥ 2 ≤ O( √ nK -2/5 ). Lemma 10 then leverages the convergence of y i , i ∈ [n] to bound the convergence of λ, and Lemma 11 shows the bound on x. By plugging in our step-sizes into Lemmas 9, 10, and 11, Theorem 1 directly follows. Lemma 10. For any λ ∈ ∆ n , the iterates of Algorithm 1 satisfy 1 K E K k=1 F (x (k) , λ) -F (x (k) , λ (k) ) ≤ O( √ nK -2/5 ). ( ) Lemma 11. The iterates of Algorithm 1 satisfy 1 K K k=1 E[∥x(x (k) ) -x (k) ∥ 2 ] ≤ O( √ nK -2/5 ). A.4 PROOF OF LEMMA 1 (LEMMA 6) Stating Lemma 1 more precisely: Lemma 12. Under Assumptions 1, 2 and 3, the average squared norm of h x can be bounded as follows, where σ2 f = σ 2 f + 3C 2 ℓ : E[∥h (k) x ∥ 2 ] ≤ σ2 f + 3b 2 k + 3L 2 n i=1 λ (k) i ∥y ⋆ i (x (k) ) -y (k+1) i ∥ 2 , ( ) where the expectation is over the filtration F ′ i ≜ {y (0) i , x (0) , • • • , y (k) i , x (k) , y }. Note that here the expectation is over F ′ i ≜ {y (0) i , x (0) , • • • , y (k) i , x (k) , y }, so no expectation is needed in the last term. Proof. We can derive the following: E[∥h (k) x ∥ 2 ] (1) = E[∥h (k) x -E[h (k) x ]∥ 2 ] + ∥E[h (k) x ]∥ 2 (38) (2) = E[∥h (k) x -E[h (k) x ]∥ 2 ] + ∥∇ x F (x (k) , y (k+1) , λ (k) ) + B k ∥ 2 (39) ≤ σ 2 f + ∥∇ x F (x (k) , y (k+1) , λ (k) ) + B k ∥ 2 (40) (4) = σ 2 f + n i=1 λ (k) i ∇ x f i (x (k) , y i ) + B k 2 (41) (5) ≤ σ 2 f + 3b 2 k + 3 2 n i=1 λ (k) i ∇ x f i (x (k) , y (k+1) i ) 2 . ( ) (1) is true because E[∥h (k) x -E[h (k) x ]∥ 2 ] + ∥E[h (k) x ]∥ 2 = E[∥h (k) x ∥ 2 ] + ∥E[h (k) x ]∥ 2 -2E⟨h (k) x , E[h (k) x ]⟩ + ∥E[h (k) x ]∥ 2 (43) = E[∥h (k) x ∥ 2 ] + ∥E[h (k) x ]∥ 2 -⟨E[h (k) x ], E[h (k) x ]⟩ + ∥E[h (k) x ]∥ 2 (44) = E[∥h (k) x ∥ 2 ]. (2) follows from (20), (3) follows from (21), (4) follows from definition of ∇ x , and (5) follows from Young's inequality, ∥a + b∥ 2 ≤ 3∥a∥ 2 + 3 2 ∥b∥ 2 . Next, we bound the last term in (42). We start by using the fact that n i=1 λ (k) i ∇ x f i (x (k) , y (k+1) i ) 2 (1) ≤ 2 n i=1 λ (k) i (∇ x f i (x (k) , y (k+1) i ) -∇ℓ i (x (k) )) 2 + 2 n i=1 λ (k) i ∇ℓ i (x (k) ) 2 (46) (2) ≤ 2 n i=1 λ (k) i ∥∇ x f i (x (k) , y (k+1) i ) -∇ℓ i (x (k) )∥ 2 + 2 n i=1 λ (k) i ∥∇ℓ i (x (k) )∥ 2 , where (1) follows from ∥a + b∥ 2 ≤ 2∥a∥ 2 + 2∥b∥ 2 , and (2) follows from n i=1 p i a i 2 ≤ n i=1 p i ∥a i ∥ 2 . Next, we bound the first term in (47). From Lemma 5, we have ∥∇ x f i (x (k) , y (k+1) i ) -∇ℓ i (x (k) )∥ 2 ≤ L∥y ⋆ i (x (k) ) -y (k+1) i ∥ 2 . ( ) Therefore, we can obtain E[∥h (k) x ∥ 2 ] (1) ≤ σ 2 f + 3b 2 k + 3L 2 n i=1 λ (k) i ∥y ⋆ i (x (k) ) -y (k+1) i ∥ 2 + 3 n i=1 λ (k) i ∥∇ℓ i (x (k) )∥ 2 (49) ≤ σ2 f + 3b 2 k + 3L 2 n i=1 λ (k) i ∥y ⋆ i (x (k) ) -y (k+1) i ∥ 2 , where (1) comes from plugging (48) into ( 47) and ( 47) into (42), ( 2) comes from the definition of σ2 f using ∥∇ℓ i (x (k) )∥ 2 ≤ C 2 ℓ and λ (k) ∈ ∆ n . A.5 PROOF OF LEMMA 2 (LEMMA 7) We state the precise version of Lemma 2 here: Lemma 13. Under Assumptions 1, 2, and 3, when β 2 (1 + σ 2 g )L 2 g ≤ βµ g and µ g β < 1, the iterates y (k) i satisfy the descent equation: E[∥y (k+1) i -y ⋆ i (x (k) )∥ 2 ] (51) ≤ 1 - µ g β 2 ∥y (k) i -y ⋆ i (x (k-1) )∥ 2 + 2 µ g β -1 G 2 y ∥x (k-1) -x (k) ∥ 2 + β 2 σ 2 g . Proof. For a particular (fixed) realization of the iterates x (1) , • • • , x (k) , y i , • • • , y (k) i for some i ∈ [n], we have E[∥h (k) i ∥ 2 ] (1) = E[∥h (k) i -E[h (k) i ]∥ 2 ] + ∥E[h (k) i ]∥ 2 (52) = E[∥h (k) i -∇ yi g i (x (k) , y (k) i )∥ 2 ] + ∥∇ yi g i (x (k) , y (k) i )∥ 2 (53) ≤ σ 2 g + (1 + σ 2 g )∥∇ yi g i (x (k) , y (k) i )∥ 2 (54) (4) ≤ σ 2 g + (1 + σ 2 g )∥∇ yi g i (x (k) , y (k) i ) -∇ yi g i (x (k) , y ⋆ i (x (k) ))∥ 2 (5) ≤ σ 2 g + (1 + σ 2 g )L 2 g ∥y (k) i -y ⋆ i (x (k) )∥ 2 , where (1) follows from algebra, (2) follows from E[h (k) i ] = ∇ yi g i (x (k) , y i ) in equation 17 in Assumption 3, (3) is from equation 18 in Assumption 3, (4) is from ∇ yi g i (x (k) , y ⋆ i (x (k) )) = 0 due to the optimality of y ⋆ i (x (k) ), and (5) is due to the L g -Lipschitz continuity of ∇ yi g i (x, •). Next, we can bound the difference between y (k+1) i and y ⋆ i (x (k) ) as the following, where again we assume that x (1) , • • • , x (k) , y (1) , • • • , y (k) i is fixed and the expectation is over the stochasticity of the gradient estimates: E[∥y (k+1) i -y ⋆ i (x (k) )∥ 2 ] = E[∥y (k) i -βh (k) i -y ⋆ i (x (k) )∥ 2 ] ( ) (2) = ∥y (k) i -y ⋆ i (x (k) )∥ 2 + β 2 E[∥h (k) i ∥] 2 -2β⟨y (k) i -y ⋆ i (x (k) ), ∇ yi g i (x (k) , y (k) i )⟩ (58) ≤ (1 -2βµ g )∥y (k) i -y ⋆ i (x (k) )∥ 2 + β 2 E[∥h (k) i ∥ 2 ] (59) ≤ (1 -2βµ g )∥y (k) i -y ⋆ i (x (k) )∥ 2 + β 2 σ 2 g + β 2 (1 + σ 2 g )L 2 g ∥y (k) i -y ⋆ i (x (k) )∥ 2 (5) ≤ (1 -βµ g )∥y (k) i -y ⋆ i (x (k) )∥ 2 + β 2 σ 2 g (61) ≤ (1 -βµ g ) (1 + c)∥y (k) i -y ⋆ i (x (k-1) )∥ 2 + 1 + 1 c ∥y ⋆ i (x (k-1) ) -y ⋆ i (x (k) )∥ 2 + β 2 σ 2 g (62) ≤ (1 -βµ g ) (1 + c)∥y (k) i -y ⋆ i (x (k-1) )∥ 2 + 1 + 1 c G 2 y ∥x (k-1) -x (k) ∥ 2 + β 2 σ 2 g , where ( 1) is true by definition, and (2) holds by direct algebra and the unbiasedness assumption E[h 4) is from equation 56, (5) is from the assumption β 2 (1 + σ 2 g )L 2 g ≤ βµ g , ( 6) is from the inequality ∥a + b∥ 2 ≤ (1 + 1/c)∥a∥ 2 + (1 + c)∥b∥ 2 , and ( 7) is from the G y -lipschitzness of y ⋆ i (•) in Lemma 5. (k) i ] = ∇ yi g i (x (k) , y (k) i ) in equation 17 in Assumption 3, (3) is from strong convex- ity, β ∇ yi g i (x (k) , y (k) i ), y (k) i -y ⋆ i (x (k) ) ≥ βµ g ∥y (k) i -y ⋆ i (x (k) )∥ 2 , ( Then, we choose c = µgβ 2(1-µgβ) , so that (1 -βµ g )(1 + c) = 1 -βµ g /2 and 1 + 1/c = 2 µgβ -1. We have c > 0 because µ g β < 1. Plugging these expressions into (63), we get ∥y (k+1) i -y ⋆ i (x (k) )∥ 2 (64) ≤ 1 - µ g β 2 ∥y (k) i -y ⋆ i (x (k-1) )∥ 2 + 2 µ g β -1 G 2 y ∥x (k-1) -x (k) ∥ 2 + β 2 σ 2 g , which completes the proof. A.6 PROOF OF LEMMA 3 (LEMMA 8) We state the precise version of Lemma 3 here: Lemma 14. Let L (k) ≜ E[F (x (k) , λ (k) )] = n i=1 λ (k) i E[ℓ i (x (k) )]. Under Assumptions 1, 2 and 3, assume that the iterates {x (k) , y k) , ∀k} are generated by MORBiT, then, L (k) satisfies the descent equation (k) i , i ∈ [n], λ L (k+1) -L (k) ≤ 4αL 2 max i∈[n] ∆ (k+1) yi + L f 2 - 1 4α E[∥x (k+1) -x (k) ∥ 2 ] + γnB 2 ℓ + 4αb 2 0 + 2ασ 2 f . Proof. First, since ℓ i is L f -smooth, we know that for all i ∈ [n], ℓ i (x (k+1) ) ≤ ℓ i (x (k) ) + ⟨x (k+1) -x (k) , ∇ℓ i (x (k) )⟩ + L f 2 ∥x (k+1) -x (k) ∥ 2 . ( ) Taking λ (k) i times the equation for i in (66) and summing, we can get n i=1 λ (k) i ℓ i (x (k+1) ) ≤ n i=1 λ (k) i ℓ i (x (k) ) + x (k+1) -x (k) , n i=1 λ (k) i ∇ℓ i (x (k) ) + L f 2 ∥x (k+1) -x (k) ∥ 2 . (67) Therefore, we have n i=1 λ (k+1) i ℓ i (x (k+1) ) - n i=1 λ (k) i ℓ i (x (k) ) ≤ n i=1 λ (k+1) i ℓ i (x (k+1) ) - n i=1 λ (k) i ℓ i (x (k+1) ) ≜(A) + ⟨x (k+1) -x (k) , n i=1 λ (k) i ∇ℓ i (x (k) )⟩ ≜(B) + L f 2 ∥x (k+1) -x (k) ∥ 2 . Next, we bound (A) and (B) respectively as follows. First, we upper bound term (A). First, from the non-expansiveness of projections and λ (k+1) = proj ∆n (λ (k) + γh (k) λ ), we have ∥λ (k+1) - λ (k) ∥ ≤ ∥γh (k) λ ∥. Since λ (k+1) , λ (k) ∈ ∆ n , ∥λ (k+1) -λ (k) ∥ ≤ √ 2. Therefore, we know that ∥λ (k+1) -λ (k) ∥ ≤ Λ ≜ min{ √ 2, γ∥h λ ∥}. Based on these facts, we can have (A) = n i=1 λ (k+1) i ℓ i (x (k+1) ) - n i=1 λ (k) i ℓ i (x (k+1) ) (1) = n i=1 λ (k+1) i -λ (k) i ℓ i (x (k+1) ) (2) ≤ λ (k+1) -λ (k) ℓ 1 (x (k+1) ), ℓ 2 (x (k+1) ), • • • , ℓ n (x (k+1) ) ⊤ (71) ≤ ΛB ℓ √ n (4) ≤ √ n min{ √ 2, ∥γh (k) λ ∥}B ℓ (5) ≤ √ nγ∥h (k) λ ∥B ℓ , where ( 1) is straightforward, (2) follows from Cauchy-Schwarz, (3) follows from the update rule for λ and the fact that |ℓ i (•)| ≤ B ℓ from Assumption 1, (4) is from plugging in the definition of Λ, and (5) follows from γ∥h k λ ∥ ≤ √ 2. Then, we upper bound (B). First, from the non-expansiveness of projection and the update rule x (k+1) = proj X (x (k) -αh (k) x ), we know that ∥x (k+1) -x (k) + αh (k) x ∥ 2 ≤ ∥ -αh (k) x ∥ 2 , ( ) ⇒ ∥x (k+1) -x (k) ∥ 2 + 2α⟨x (k+1) -x (k) , h (k) x ⟩ ≤ 0, ⇒ 1 2α ∥x (k+1) -x (k) ∥ 2 + ⟨x (k+1) -x (k) , h (k) x ⟩ ≤ 0. ( ) Therefore, we can have (B) = n i=1 λ (k) i ∇ℓ i (x (k) ) , x (k+1) -x (k) (1) = ∇ x F (x (k) , λ (k) ), x (k+1) -x (k) ( ) (2) = ∇ x F (x (k) , λ (k) ) -∇ x F (x (k) , y (k+1) , λ (k) ) -B k , x (k+1) -x (k) + ∇ x F (x (k) , y (k+1) , λ (k) ) + B k , x (k+1) -x (k) ( ) (3) ≤ ∇ x F (x (k) , λ (k) ) -∇ x F (x (k) , y (k+1) , λ (k) ) -B k , x (k+1) -x (k) (78) + ∇ x F (x (k) , y (k+1) , λ (k) ) + B k -h (k) x , x (k+1) -x (k) - 1 2α ∥x (k+1) -x (k) ∥ 2 (4) ≤ 1 2c ∥∇ x F (x (k) , λ (k) ) -∇ x F (x (k) , y (k+1) , λ (k) ) -B k ∥ 2 + c 2 ∥x (k+1) -x (k) ∥ 2 + 1 2d ∥∇ x F (x (k) , y (k+1) , λ (k) ) + B k -h (k) x ∥ 2 + d 2 ∥x (k+1) -x (k) ∥ 2 - 1 2α ∥x (k+1) -x (k) ∥ 2 (5) ≤ 1 2c ∥∇ x F (x (k) , λ (k) ) -∇ x F (x (k) , y (k+1) , λ (k) ) -B k ∥ 2 + c 2 ∥x (k+1) -x (k) ∥ 2 + σ 2 f 2d + d 2 ∥x (k+1) -x (k) ∥ 2 - 1 2α ∥x (k+1) -x (k) ∥ 2 (80) (6) = 1 2c ∥∇ x F (x (k) , λ (k) ) -∇ x F (x (k) , y (k+1) , λ (k) ) -B k ∥ 2 + c + d 2 - 1 2α ∥x (k+1) -x (k) ∥ 2 + σ 2 f 2d , where ( 1) is by definition of F (x (k) , λ (k) ), (2) is from adding and subtracting 3) is from adding (75) to the previous inequality, ( 4) is from applying the inequality ⟨a, b⟩ ≤ 1 2c ∥a∥ 2 + c 2 ∥b∥ 2 to both inner product terms, (5) is from equation 20 and equation 21 in Assumption 3, and ( 6) is from algebra. ∇ x F (x (k) , y (k+1) , λ (k) ) + B k , x (k+1) -x (k) , ( Plugging in our expressions for (A) and (B) into (68), we get n i=1 λ (k+1) i ℓ i (x (k+1) ) - n i=1 λ (k) i ℓ i (x (k) ) ≤ γ∥h (k) λ ∥ √ nB ℓ + 1 2c ∥∇ x F (x (k) , λ (k) ) -∇ x F (x (k) , y (k+1) , λ (k) ) -B k ∥ 2 + c + d + L f 2 - 1 2α ∥x (k+1) -x (k) ∥ 2 + σ 2 f 2d . ( ) Next, we work on bounding ∥∇ x F (x (k) , λ (k) ) -∇ x F (x (k) , y (k+1) , λ (k) ) -B k ∥ 2 in equation 81. Observe that ∥∇ x F (x (k) , λ (k) ) -∇ x F (x (k) , y (k+1) , λ (k) ) -B k ∥ 2 (1) ≤ 2∥∇ x F (x (k) , λ (k) ) -∇ x F (x (k) , y (k+1) , λ (k) )∥ 2 + 2∥B k ∥ 2 (2) ≤ 2∥∇ x F (x (k) , λ (k) ) -∇ x F (x (k) , y (k+1) , λ (k) )∥ 2 + 2b 2 k ( ) (3) ≤ 2 n i=1 λ (k) i ∇ℓ i (x (k) ) -∇ x f i (x (k) , y (k+1) i ) 2 + 2b 2 k ( ) (4) ≤ 2 n i=1 λ (k) i L∥y ⋆ i (x (k) ) -y (k+1) i ∥ 2 + 2b 2 k ( ) (5) ≤ 2L 2 max i∈[n] ∥y ⋆ i (x (k) ) -y (k+1) i ∥ 2 + 2b 2 k , where ( 1) comes from ∥a 4) is from Lemma 5. Therefore, plugging in equation 87 into equation 82, using c = d = 1 4α , and taking expectation over the stochasticity of the gradient estimates, we get: + b∥ 2 ≤ 2∥a∥ 2 + 2∥b∥ 2 , (2) comes from ∥B k ∥ ≤ b k in Assumption 3, ( ) is from expanding the definitions of ∇ x F (•, •) and ∇ x F (•, •, •), ( n i=1 λ (k+1) i ℓ i (x (k+1) ) - n i=1 λ (k) i ℓ i (x (k) ) ≤ √ nγ∥h (k) λ ∥B ℓ + 4αL 2 max i∈[n] E∥y ⋆ i (x (k) ) -y (k+1) i ∥ 2 + 4αb 2 k + L f 2 - 1 4α E[∥x (k+1) -x (k) ∥ 2 ] + 2ασ 2 f . ( ) Now, observe that the LHS looks like a telescoping sum. To make this more apparent, define L (k) = n i=1 λ (k) i E[ℓ i (x (k) )] and ∆ (k) yi = E[∥y (k) i -y ⋆ i (x (k-1) )∥ 2 ] . Therefore, with the assumption that b k ≤ b 0 , we have L (k+1) -L (k) ≤ 4αL 2 max i∈[n] ∆ (k+1) yi + L f 2 - 1 4α E[∥x (k+1) -x (k) ∥ 2 ] + √ nγ∥h (k) λ ∥B ℓ + 4αb 2 0 + 2ασ 2 f ≤ 4αL 2 max i∈[n] ∆ (k+1) yi + L f 2 - 1 4α E[∥x (k+1) -x (k) ∥ 2 ] + γnB 2 ℓ + 4αb 2 0 + 2ασ 2 f . A.7 PROOF OF LEMMA 9 We restate Lemma 9 in more general terms here: Lemma 15. Assume that Ω (k) , Θ (k) , Υ (k) i , λ i , c 0 , c 1 , c 2 , d 0 , d 1 , d 2 are real numbers such that for all 0 ≤ k ≤ K -1, Ω (k+1) ≤ Ω (k) -c 0 Θ (k+1) + c 1 max i∈[n] Υ (k+1) i + c 2 (90) and also for all 1 ≤ k ≤ K, 1 ≤ i ≤ N , Υ (k+1) i ≤ (1 -d 0 )Υ (k) i + d 1 Θ (k) + d 2 . ( ) In addition, assume that 1 - d 0 > 0, d 0 -d 1 c 1 c -1 0 > 0 and c 0 -c 1 d 1 d -1 0 , and that Υ (k) i , Ω (k) ≥ 0 for all k, i ∈ [n]. Then, if i (0) = arg max i∈[n] Υ (0) i , we have 1 K K k=1 max i∈[n] Υ (k) i ≤ (d 0 -d 1 c -1 0 c 1 ) -1 Υ (0) i (0) + d 1 Θ (0) + d 2 + d 1 c -1 0 Ω (0) K + d 1 c -1 0 c 2 + d 2 . (92) Proof. First, let i (k) ≜ arg max i∈[n] Υ (k) i , so that Υ (k) i (k) = max i∈[n] Υ (k) i . Summing (90) from k = 0, 1, • • • , K -1, we get: c 0 K k=1 Θ (k) ≤ Ω (0) -Ω (k) + c 1 K k=1 Υ (k) i (k) + c 2 K. ( ) Next, we apply (91) for i = i (k+1) . Noting that 1 -d 0 > 0 and Υ (k) i (k+1) ≤ Υ (k) i (k) by definition of i (k) , we have Υ (k+1) i (k+1) ≤ (1 -d 0 )Υ (k) i (k+1) + d 1 Θ (k) + d 2 (94) ≤ (1 -d 0 )Υ (k) i (k) + d 1 Θ (k) + d 2 . (95) Then summing for k = 1 to K, we get d 0 K k=1 Υ (k) i (k) ≤ Υ (1) i (1) -Υ (K+1) i (K+1) + d 1 K k=1 Θ (k) + d 2 K. Now, we have d 0 K k=1 Υ (k) i (k) ≤ Υ (1) i (1) -Υ (K+1) i (K+1) + d 1 c -1 0 Ω (0) -Ω (k) + c 1 K k=1 Υ (k) i (k) + c 2 K + d 2 K (97) ≤ Υ (1) i (1) + d 1 c -1 0 Ω (0) + c 1 K k=1 Υ (k) i (k) + c 2 K + d 2 K (98) ≤ Υ (1) i (1) + d 1 c -1 0 Ω (0) + d 1 c -1 0 c 1 K k=1 Υ (k) i (k) + d 1 c -1 0 c 2 K + d 2 K, where (1) holds from plugging (93) into (96), ( 2) is true because Υ, Ω ≥ 0, and (3) follows from the distributive property. We can rewrite this equation as (d 0 -d 1 c -1 0 c 1 ) K k=1 Υ (k) i (k) ≤ Υ (1) i (1) + d 1 c -1 0 Ω (0) + d 1 c -1 0 c 2 K + d 2 K (100) ⇒ 1 K K k=1 Υ (k) i (k) ≤ (d 0 -d 1 c -1 0 c 1 ) -1 Υ (1) i (1) + d 1 c -1 0 Ω (0) K + d 1 c -1 0 c 2 + d 2 . ( ) By plugging in Υ (1) 101), we get the statement of the lemma. i (1) = Υ (0) i (0) + d 1 Θ (0) + d 2 into ( Plugging the following values into Lemma 15 and utilizing Lemmas 13 and 14 and the learning rates α, β, γ from Theorem 2, we get the result in Lemma 9 in precise terms: Ω (k) = L (k) , Θ (k) = E[∥x (k) -x (k-1) ∥ 2 ], Υ (k) i = ∆ (k) yi , c 0 = 1 4α - L f 2 , c 1 = 4αL 2 , c 2 = γnB 2 ℓ + 4αb 2 0 + 2ασ 2 f , d 0 = µ g β/2, d 1 = 2 µ g β -1 G 2 y , d 2 = β 2 σ 2 g . Now, we can bound the optimality of y by bounding the maximum difference ∆ (k) yi : 1 K K k=1 max i∈[n] ∆ (k) yi (1) ≤ (d 0 -d 1 c -1 0 c 1 ) -1 Υ (0) i (0) + d 1 Θ (0) + d 2 + d 1 c -1 0 Ω (0) K + d 1 c -1 0 c 2 + d 2 ( ) (2) = 4 µ g β Υ (0) i (0) + d 1 Θ (0) + d 2 + d 1 c -1 0 Ω (0) K + d 1 c -1 0 c 2 + d 2 (107) (3) ≤ 4 µ g β   ∆ (0) y i (0) + β 2 σ 2 g + 2G 2 y µgβ (8α)L (0) K + 2G 2 y µ g β (8α)(nγB 2 ℓ + 4αb 2 0 + 2ασ 2 f ) + β 2 σ 2 g ( ) (4) = 4∆ (0) y i (0) µgβ + 4β σ 2 g µg + 64G 2 y α µ 2 g β 2 K + 64G 2 y α µ 2 g β 2 nγB ℓ + 4αb 2 0 + 2ασ 2 f + 4βσ 2 g µ g ( ) (5)

= 4∆

(0) y i (0) µ g 1 βK + 4σ 2 g µ g β K + 64G 2 y µ 2 g α β 2 K + 64G 2 y µ 2 g γB 2 ℓ nα β 2 + 64G 2 y (4b 2 0 + 2σ 2 f ) µ 2 g α 2 β 2 + 4σ 2 g µ g β (110) ≤ ∆ (0) y i (0) /µ g K 3/5 + 16σ 2 g /µ 2 g K 7/5 + G y /L K 4/5 + 2 √ nB ℓ G y /L K 2/5 + (b 2 0 + 1 2 σ 2 f )/(L 2 ) K 2/5 + 16σ 2 g /µ 2 g K 2/5 . ( ) Here, (1) follows directly from plugging Υ (k) i from (102) into Lemma 4, (2) follows from (104), (3) comes from plugging in the rest of (102), ( 4) is direct algebraic manipulation, (5) separates the step sizes and n, K factors from the rest of the constants, and (6) applies the definition of the step sizes. This gives us the O( √ nK -2/5 ) bound in Lemma 9.

A.8 PROOF OF LEMMA 10

We present a precise form of Lemma 10 here: Lemma 16. For any λ ∈ ∆ n , under Assumptions 1, 2, and 3, assume that the iterates {x (k) , y i , i ∈ [n]λ (k) , ∀k} generated by MORBiT, then we have 1 K E K k=1 F (x (k) , λ) -F (x (k) , λ (k) ) ≤ 1 √ 2 B ℓ √ nK -3/5 + B ℓ √ nK -2/5 + G f K K k=1 max i∈[n] ∆ (k) yi . Proof. Recall that we defined F (x, λ) := n i=1 λ i f i (x, y ⋆ i (x)) = n i=1 λ i ℓ i (x). For a fixed realization of x (1) , • • • , x (k) , y i , • • • , y (k) i , i ∈ [n], we have F (x (k) , λ) -F (x (k) , λ (k) ) (1) = n i=1 (λ i -λ (k) i )f i (x (k) , y ⋆ i (x (k) )) (2) = n i=1 (λ i -λ (k) i )(f i (x (k) , y ⋆ i (x (k) )) -f i (x (k) , y ) + f i (x (k) , y ))) = λ -λ (k) , f 1 (x (k) , y ), • • • , f n (x (k) , y (k+1) n ) ⊤ + n i=1 (λ i -λ (k) i )(f i (x (k) , y ⋆ i (x (k) )) -f i (x (k) , y )) = λ -λ (k) , h (k) λ + n i=1 (λ i -λ (k) i )(f i (x (k) , y ⋆ i (x (k) )) -f i (x (k) , y )) (5) = ∥λ -λ (k) ∥ 2 + γ 2 ∥h (k) λ ∥ 2 -∥λ -λ (k) -γh (k) λ ∥ 2 2γ + n i=1 (λ i -λ (k) i )(f i (x (k) , y ⋆ i (x (k) )) -f i (x (k) , y )) ≤ ∥λ -λ (k) ∥ 2 + γ 2 ∥h (k) λ ∥ 2 -∥λ -λ (k+1) ∥ 2 2γ + n i=1 (λ -λ (k) ) i (f i (x (k) , y ⋆ i (x (k) )) -f i (x (k) , y )) ≤ ∥λ -λ (k) ∥ 2 + γ 2 ∥h (k) λ ∥ 2 -∥λ -λ (k+1) ∥ 2 2γ + G f n i=1 (λ i -λ (k) i )(y ⋆ i (x (k) ) -y (k+1) i ), where (1) comes from the definition of F , (2) follows from adding and subtracting (λ iλ (k) i )f i (x (k) , y ) terms, (3) follows from splitting the preceding sum and writing the first term as a dot product, (4) follows from definition of h (k) λ , (5) uses ⟨a, b⟩ = ∥a∥ 2 +∥b∥ 2 -∥a-b∥ 2 2 , follows from the update λ (k+1) = proj ∆n (λ (k) -γh (k) λ ) and the projection property, and (7) follows from Lipschitzness of f . Therefore, applying the telescoping sum by adding the preceding inequality over k = 1, 2, . . . , K, and taking expectation, we get: E K k=1 F (x (k) , λ) -F (x (k) , λ (k) ) (1) ≤ γ 2 K k=1 E∥h (k) λ ∥ 2 + E[∥λ -λ (1) ∥ 2 ] 2γ + G f K k=1 n i=1 (λ i -λ (k) i )∆ (k) yi (121) ≤ γ 2 K k=1 E∥h (k) λ ∥ 2 + 1 γ + G f K k=1 max i∈[n] ∆ (k) yi (122) ≤ nKB 2 γ 2 + 1 γ + G f K k=1 max i∈[n] ∆ (k) yi ( ) (4) ≤ √ 2 2 B ℓ √ nK 2/5 + B ℓ √ nK 3/5 + G f K k=1 max i∈[n] ∆ (k) yi where (1) follows directly from (120) and the telescoping sum, (2) follows from λ ∈ ∆ n and ∥λ - λ (1) ∥ 2 ≤ 2, (3) follows from ∥h (k) λ ∥ 2 ≤ nB 2 , and (4) follows from selecting γ = √ 2 B ℓ √ nK 3/5 . Therefore, we obtain 1 K E K k=1 F (x (k) , λ) -F (x (k) , λ (k) ) ≤ 1 √ 2 B ℓ √ nK -3/5 + B ℓ √ nK -2/5 + G f O( √ nK -2/5 ). A.9 PROOF OF LEMMA 11 We state Lemma 11 here in precise terms: Lemma 17. Under Assumptions 1, 2, and 3 with the iterates {x (k) , y k) , ∀k} generated by MORBiT, then we have (k) i , i ∈ [n], λ 1 K K k=1 E[∥x(x (k) ) -x (k) ∥ 2 ] ≤ 4 -µ ℓ + ρ Φ 1/ρ (x 0 ) ραK + 2 -µ ℓ + ρ b 2 0 + 2L 2 -µ ℓ + ρ + 3L 2 α 2 1 K K k=1 max i∈[n] ∆ (k) yi + α 2 σ2 f + 3b 2 0 . Proof. Recall that we defined the Moreau envelope and proximal map as follows: Φ 1/ρ (z) ≜ min x ρ 2 ∥x -z∥ 2 + n i=1 λ i ℓ i (x), x(z) ≜ arg min x ρ 2 ∥x -z∥ 2 + n i=1 λ i ℓ i (x). Therefore, we have Φ 1/ρ (x (k+1) ) = i λ i ℓ i (x(x (k+1) )) + ρ 2 ∥x (k+1) -x(x (k+1) )∥ 2 (128) ≤ i λ i ℓ i (x(x (k) )) + ρ 2 ∥x (k+1) -x(x (k) )∥ 2 = i λ i ℓ i (x(x (k) )) + ρ 2 ∥x (k+1) -x (k) + x (k) -x(x (k) )∥ 2 (130) = i λ i ℓ i (x(x (k) )) + ρ 2 ∥x (k+1) -x (k) ∥ 2 + ρ 2 ∥x (k) -x(x (k) )∥ 2 + ρ x (k+1) -x (k) , x (k) -x(x (k) ) where ( 1) is by definition of the proximal map, (2) comes from the optimality of the Moreau envelope, (3) is by adding and subtracting x (k) , and ( 4) is from expanding out ∥a+b∥ 2 into ∥a∥ 2 +∥b∥ 2 +2⟨a, b⟩. Published as a conference paper at ICLR 2023 Next, from the optimality condition of the update x (k+1) = proj ∆n (x (k) -αh (k) x ), we have ⟨x (k+1) -x(x (k) ), x (k+1) -x (k) + αh (k) x ⟩ ≤ 0 (132) ⇒ ⟨x (k+1) - x (k) + x (k) -x(x (k) )x (k+1) -x (k) + αh (k) x ⟩ ≤ 0 (133) ⇒ ⟨x (k+1) -x (k) , x (k) -x(x (k) )⟩ ≤ -∥x (k+1) -x (k) ∥ 2 -α⟨h (k) x , x (k+1) -x (k) ⟩ -α⟨h (k) x , x (k) -x(x (k) )⟩ ⇒ ρ⟨x (k+1) -x (k) , x (k) -x(x (k) )⟩ ≤ -ρ∥x (k+1) -x (k) ∥ 2 -ρα⟨h (k) x , x (k+1) -x (k) ⟩ -ρα⟨h (k) x , x (k) -x(x (k) )⟩ ⇒ ρ⟨x (k+1) - x (k) , x (k) -x(x (k) )⟩ ≤ -ρ∥x (k+1) -x (k) ∥ 2 + ρ⟨αh (k) x , x (k) -x (k+1) ⟩ + ρα⟨h (k) x , x(x (k) ) -x (k) ⟩ (5) ⇒ ρ⟨x (k+1) -x (k) , x (k) -x(x (k) )⟩ ≤ -ρ∥x (k+1) -x (k) ∥ 2 + ρ 2 ∥αh (k) x ∥ 2 + ∥x (k) -x (k+1) ∥ 2 + ρα⟨h (k) x , x(x (k) ) -x (k) ⟩ (137) ⇒ ρ⟨x (k+1) - x (k) , x (k) -x(x (k) )⟩ ≤ - ρ 2 ∥x (k+1) -x (k) ∥ 2 + ρα 2 2 ∥h (k) x ∥ 2 + ρα⟨h (k) x , x(x (k) ) -x (k) ⟩ where ( 1) is by adding and subtracting x (k) , ( 2) is from distributing the inner product ⟨(x (k+1) - x (k) ) + (x (k) -x(x (k) )), (x (k+1) -x (k) ) + αh (k) x ⟩, is from multiplying both sides by ρ, (4) is from simple algebra, (5) is from rewriting ⟨a, b⟩ ≤ ∥a∥ 2 +∥b∥ 2

2

, and ( 6) is from combining terms. Therefore, substituting (138) into (131), we get Φ 1/ρ (x (k+1) ) ≤ i λ i ℓ i (x(x (k) )) + ρ 2 ∥x (k) -x(x (k) )∥ 2 + ρα 2 2 ∥h (k) x ∥ 2 + ρα⟨h (k) x , x(x (k) ) -x (k) ⟩ (139) = Φ 1/ρ (x (k) ) + ρα 2 2 ∥h (k) x ∥ 2 + ρα⟨h (k) x , x(x (k) ) -x (k) ⟩ where the second equality is by definition of the Moreau envelope. Now, we bound the last term in (140): ⟨h (k) x , x(x (k) ) -x (k) ⟩ (1) = x(x (k) ) -x (k) , h (k) x -∇ x F (x (k) , y (k+1) , λ (k) ) + x(x (k) ) -x (k) , ∇ x F (x (k) , y (k+1) , λ (k) ) -∇ x F (x (k) , λ (k) ) + x(x (k) ) -x (k) , ∇ x F (x (k) , λ (k) ) (2) = ⟨x(x (k) ) -x (k) , B k ⟩ (A) + ⟨x(x (k) ) -x (k) , ∇ x F (x (k) , y (k+1) , λ (k) ) -∇ x F (x (k) , λ (k) )⟩ (B) + ⟨x(x (k) ) -x (k) , ∇ x F (x (k) , λ (k) )⟩ (C) where (1) follows from adding and subtracting ∇ x F (x (k) , y (k+1) , λ (k) ), ∇ x F (x (k) , λ (k) ) terms and ( 2) is from splitting the inner product and applying h (k) x -∇ x F (x (k) , y (k+1) , λ (k) ) = B k from equation 20 in Assumption 3. To bound (A) and (B), we simply apply ⟨a, b⟩ ≤ c 4 ∥a∥ 2 + 1 c ∥b∥ 2 to both inner products: (A) = ⟨x(x (k) ) -x (k) , B k ⟩ ≤ c 4 ∥x(x (k) ) -x (k) ∥ 2 + 1 c b 2 k (143) (B) = x(x (k) ) -x (k) , ∇ x F (x (k) , y (k+1) , λ (k) ) -∇ x F (x (k) , λ (k) ) ≤ 1 c ∥∇ x F (x (k) , y (k+1) , λ (k) ) -∇ x F (x (k) , λ (k) )∥ 2 + c 4 ∥x(x (k) ) -x (k) ∥ 2 . ( ) We proceed to bound (C). First, from weak convexity of µ ℓ , we have that for all i ∈ [n], ℓ i (x(x (k) )) ≥ ℓ i (x (k) ) + ⟨∇ℓ(x (k) ), x(x (k) ) -x (k) ⟩ - µ ℓ 2 ∥x(x (k) ) -x (k) ∥ 2 . ( ) Taking λ i times the ith of these equations, we get n i=1 λ i ℓ i (x(x (k) )) ≥ n i=1 λ i ℓ i (x (k) ) + n i=1 ∇ℓ i (x (k) ), x(x (k) ) -x (k) - µ ℓ 2 ∥x(x (k) ) -x (k) ∥ 2 . (146) By definition of the Moreau envelope, we also have n i=1 λ i ℓ i (x (k) ) ≥ n i=1 λ i ℓ i (x(x (k) )) + ρ 2 ∥x(x (k) ) -x (k) ∥ 2 . Adding ( 146) and (147), we have µ ℓ -ρ 2 ∥x(x (k) ) -x (k) ∥ 2 ≥ ⟨∇ x F (x (k) , λ (k) ), x(x (k) ) -x (k) ⟩. If we let c = -µ ℓ +ρ 2 in ( 143) and (144), we can rewrite (142) as ⟨h (k) x , x(x (k) ) -x (k) ⟩ ≤ c 2 ∥x(x (k) ) -x (k) ∥ 2 + 1 c b 2 k + 1 c ∥∇ x F (x (k) , y (k+1) , λ (k) ) -∇ x F (x (k) , λ (k) )∥ 2 - -µ ℓ + ρ 2 ∥x(x (k) ) -x (k) ∥ 2 (149) ≤ c 2 ∥x(x (k) ) -x (k) ∥ 2 + 1 c b 2 k + L 2 c n i=1 λ i ∥y ⋆ i (x (k) ) -y (k+1) i ∥ 2 - -µ ℓ + ρ 2 ∥x(x (k) ) -x (k) ∥ 2 (150) ≤ 2 -µ ℓ + ρ b 2 k + 2L 2 -µ ℓ + ρ n i=1 λ i ∥y ⋆ i (x (k) ) -y (k+1) i ∥ 2 - -µ ℓ + ρ 4 ∥x(x (k) ) -x (k) ∥ 2 . Taking the full expectation F i ≜ {y (0) i , x (0) , • • • , y (k) i , x (k) }, we have E[⟨h (k) x , x(x (k) ) -x (k) ⟩] ≤ 2 -µ ℓ + ρ b 2 k + 2L 2 -µ ℓ + ρ E n i=1 λ i ∥y ⋆ i (x (k) ) -y (k+1) i ∥ 2 - -µ ℓ + ρ 4 ∥x(x (k) ) -x (k) ∥ 2 (152) = 2 -µ ℓ + ρ b 2 k + 2L 2 -µ ℓ + ρ n i=1 λ i E[∥y ⋆ i (x (k) ) -y (k+1) i ∥ 2 ] - -µ ℓ + ρ 4 ∥x(x (k) ) -x (k) ∥ 2 (153) ≤ 2 -µ ℓ + ρ b 2 k + 2L 2 -µ ℓ + ρ max i∈[n] E[∥y ⋆ i (x (k) ) -y (k+1) i ∥ 2 ] - -µ ℓ + ρ 4 ∥x(x (k) ) -x (k) ∥ 2 (154) = 2 -µ ℓ + ρ b 2 k + 2L 2 -µ ℓ + ρ max i∈[n] ∆ (k) yi - -µ ℓ + ρ 4 ∥x(x (k) ) -x (k) ∥ 2 Therefore, rewriting everything into (146) and taking the full expectation over F i , we have (recall that the definition of ∆ (k) yi includes an expectation): E[Φ 1/ρ (x (k+1) )] ≤ E[Φ 1/ρ (x (k) )] + ρα 2 2 E[∥h (k) x ∥ 2 ] + ραE[⟨h (k) x , x(x (k) ) -x (k) ⟩] (2) = E[Φ 1/ρ (x (k) )] + 2ρα -µ ℓ + ρ b 2 k + 2L 2 ρα -µ ℓ + ρ max i∈[n] ∆ (k) yi - ρα(-µ ℓ + ρ) 4 E[∥x(x (k) ) -x (k) ∥ 2 ] + ρα 2 2 σ2 f + 3b 2 k + 3L 2 max i∈[n] ∆ (k) yi ( ) (3) = E[Φ 1/ρ (x (k) )] + 2ρα -µ ℓ + ρ b 2 k + 2L 2 ρα -µ ℓ + ρ + 3L 2 ρα 2 2 max i∈[n] ∆ (k) yi + ρα(µ ℓ -ρ) 4 E[∥x(x (k) ) -x (k) ∥ 2 ] + ρα 2 2 (σ 2 f + 3b 2 k ), where ( 1) is a copy of ( 140) and ( 2) is from (158), plugging in ∥h (k) x ∥ 2 from Lemma 1, and doing the same expectation calculation from (152) to (155). Finally, (3) is combining terms via algebra. Summing up from k = 0, 1, • • • , K -1, we get the following bound: 1 K K k=1 E[∥x(x (k) ) -x (k) ∥ 2 ] ≤ 4 -µ ℓ + ρ Φ 1/ρ (x 0 ) ραK + 2 -µ ℓ + ρ b 2 0 + 2L 2 -µ ℓ + ρ + 3L 2 α 2 1 K K k=1 max i∈[n] ∆ (k) yi + α 2 σ2 f + 3b 2 0 (159) = 4 -µ ℓ + ρ     Φ 1/ρ (x 0 ) ραK K -2/5 + 2 -µ ℓ + ρ b 2 0 + 2L 2 -µ ℓ + ρ 1 K K k=1 max i∈[n] ∆ (k) yi √ nK -2/5 + α 2 σ2 f + 3b 2 0 + 3L 2 K K k=1 max i∈[n] ∆ (k) yi K -3/5 + √ nK -1       . B GENERALIZATION BOUNDS  F i = f i x, ŷ⋆ i (x; D t i ), D v i , x ∈ X . We use f and g to denote the empirical function values evaluated at the points in D t i and D v i , and so the empirical Rademacher complexity of F i on m i samples D i ≜ {(D t i,j , D v i,j } mi j=1 ∼ (D i ) mi is R i mi (F i ) = E Di E ϵj   sup x∈X 1 m i mi j=1 ϵ j f i x, y ⋆ i (x; D t i,j ); D v i,j   , where ϵ j s are Rademacher random variables (± 1 /2 with equal probability). The empirical loss for fixed samples D t i,j and D v i,j , Fi (x) ≜ 1 m i mi j=1 f i x, y ⋆ i (x; D t i,j ); D v i,j . Similarly, define  F i (x) ≜ E Di [ Fi (x)]. F i (x) ≤ Fi (x) + 2R i mi (F i ) + B ℓ log 1/δ 2m i . This extends to the following in a straightforward manner, providing a guarantee for the worst-case generalization for any learning task i: Proposition 2. Assume the regularity assumptions given in Section 5, specifically that the function ℓ is B ℓ -bounded. Then, with probability at least 1 -δ, we have max i∈[n] F i (x) ≤ max i∈[n] Fi (x) + 2R m (F) + B ℓ log n/δ 2m . ( ) Here we assume that F i = F and m i = m for all i ∈ [n] and hence R i mi (F i ) = R m (F) for all i ∈ [n]. Next, we proceed to bound the generalization on unseen tasks. Consider a new task with distribution D n+1 = n i=1 a i D i , for some a ∈ ∆ n , meaning that the distribution of the new task is anywhere in the convex hull of the distribution of the old tasks. We make this assumption because if the new task is very dissimilar to the existing tasks, there is no reason to expect good generalization in the first place. We then show the following proposition: Proposition 3. For all x ∈ X , with probability at least 1 -δ, we have F n+1 (x) ≤ max p∈∆n p i Fi (x) + 2 n i=1 a i R i mi (F i ) + n i=1 a i B ℓ log(n/δ) 2m i . ( ) Notice that while Proposition 3 holds true for all x, the tighest upper bound is found when x minimizes max p∈∆n p i Fi (x), which is precisely when x = x ⋆ , the optimal solution to problem we study in equation 2. This highlights another advantage of our formulation over TTSA: when we use the solution obtained by the single averaged objective i f i in equation 1, x ⋆ min-avg , we will have a looser upper bound for F n+1 (x ⋆ min-avg ) compared to F n+1 (x ⋆ ), showing that our formulation gives tighter robust (or worst case) generalization guarantees and this behaviour has been demonstrated empirically in section 4. Proof. (of Proposition 3) First, by definition, for all x, we have that F n+1 (x) = E (D train n+1,j ,D test n+1,j )∼Dn+1   fi   x, arg min yi mi j ′ =1 ĝi (x, y i , D train i,j ′ ), D test i,j     (164) = n i=1 a i E (D train n+1,j ,D test n+1,j )∼Di   fi   x, arg min yi mi j ′ =1 ĝi (x, y i , D train i,j ′ ), D test i,j     (165) = n i=1 a i F i (x). Therefore, for all x, using a union bound over 1 ≤ i ≤ n, we get that with probability at least 1 -nδ ′ , we have F n+1 (x) = n i=1 a i F i (x) ≤ n i=1 a i Fi (x) + 2 n i=1 a i R i mi (F) + n i=1 a i B log 1/δ ′ 2m i . ( ) Letting δ = nδ ′ in (167), we have F n+1 (x) ≤ max p∈∆n p i Fi (x) + 2 n i=1 a i R i mi (F) + n i=1 a i B ℓ log(n/δ) 2m i Therefore, plugging in x ⋆ gives us F n+1 (x ⋆ ) ≤ min x∈X max p∈∆n p i Fi (x) + 2 n i=1 a i R i mi (F) + n i=1 a i B ℓ log(n/δ) 2m i .

C IMPLEMENTATION DETAILS AND COMPUTE RESOURCES

We perform our experiments in Python 3.7.10 and PyTorch 1.8.1 with Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz. The code is available at our repository https://github.com/minimario/ bilevel. For our empirical evaluation, we first select α, β that give good performance/convergence for the min-avg problem (the baseline). We do a hyperparameter search to choose these parameters, specifically the initial learning rates. Then we fix α, β and only select γ that provides good convergence for the min-max problem (our proposed scheme). Hypergradient computation. We would like to note that the analysis does not require the actual hypergradient but rather a stochastic estimate with bounded bias. The standard Hessian inverse approximation using the Neumann series ( 

C.1 SINUSOID REGRESSION TASK

We consider the sinusoid regression experiment (Finn et al., 2017) , a multi-task representation learning problem where each task T i is a regression problem y = t i (x) = a i sin(x -ϕ i ). We uniformly sample the amplitude a i ∈ [0.1, 5], frequency and phase ϕ i ∈ [0, π] for each task. We use n = 3 training tasks and 3 testing tasks, with 2 "easy tasks" (a i ∈ [0.1, 1.05]) and one "hard tasks" (a i ∈ [4.95, 5]) for each set. We use easy and hard tasks following the setup in (Collins et al., 2020) . During training, for each task i, the learner is given samples (x, y), x ∈ [-5, 5]. The goal is to learn a function approximating t i as best as possible in the mean squared error sense. As described in Section 2, we use a neural network divided into two pieces, i.e., an embedding network and a task-specific network. The embedding network f : R → R 10 consists of two hidden ReLU layers of size 80 and a final fully connected layer of size 10. Each task-specific network g i : R 10 → R, i ∈ [n] is a one-layer linear layer. Therefore, the loss on an input x ∈ R and y = t i (x) for task i is (g i (f (x)) -y) 2 , and the true loss of the network with parameters f, g i are ℓ i (f ; g i ) = E (x,y) [(g i (f (x)) -y) 2 ] . The embedding network is as follows: Loss curves: To approximate the true loss for measurement purposes, we use 100 equally-spaced samples from [-5, 5]. After each iteration, we calculated the maximum loss among all the tasks. In 1c, we show the minimum of these maximum losses up until each epoch. Input (R Results with more tasks: Finally, we show another figure similar to Figure 1 , but with 20 training tasks and 20 test tasks. It can be observed that both the task-robust training loss and the task-robust testing loss greatly outperform their respective standard losses. We consider binary classification tasks generated from the FashionMNIST data set where we select 8 "easy" tasks (lowest log loss ∼ 0.3 from independent training) and 2 "hard" tasks (lowest loss ∼ 0.45 from independent training). We learn a shared representation network that maps the 784 dimensional (vectorized 28×28 images) to a 100 dimensional space. Each tasks then learns a binary classifier on top of this representation. The task specific objective g i for task i corresponds to the cross-entropy loss on the training set, while the upper level objective f i corresponds to the loss of the y ⋆ i (x) with the learned representation x on a validation set. We also maintain a heldout test set which we use to evaluate the generalization of the learned representation and per-task models. For our data, we had x ∈ R 784×100 and y ∈ R 100×2 . We used step sizes α = 0.01, β = 0.01, and γ = 0.3. We used batch sizes of 8 and 128 to compute g i for each inner step and f i for each outer iteration, respectively. In addition, we included ℓ 2 -regularization of y with regularization penalty 0.0005. We used vanilla SGD with a learning rate scheduler (ReduceLROnPlateau), invoked every 100 outer iterations, with patience of 10. Each optimization was executed for 10000 outer iterations. The results are generated by aggregation over runs with 10 different seeds.

C.3 HYPERPARAMETER OPTIMIZATION

In this application, we use learning rates α = 0.0001, β = 0.001, γ = 0.001 and 20000 outer iterations. We use a batch size of 8 for both the inner and outer steps for each i ∈ [16] for the initial experiment in figure 2a . The optimizer was vanilla SGD with a learning rate scheduler (ReduceLROnPlateau), invoked every 100 outer iterations, with patience of 30. The results are generated by aggregating over 10 runs with different seeds. For the other HPO experiments, the number of tasks n and the batch sizes are discussed in the main text.

D ADDITIONAL TECHNICAL DETAILS

Here we provide further discussion on some technical aspects of the problem we are studying in this paper.

D.1 WEAK-CONVEXITY AND NON-CONVEXITY

We consider weakly convex UL objective, and here we discuss how it is related to non-convexity. Weak convexity captures a class of non-convex problems. Weakly convex functions are not convexnote difference in the following definitions (also in Appendix A.1, Assumptions 1 and 2). For any convex function τ , there exists a µ ≥ 0 such that, for any x, x ′ (x ̸ = x ′ ) τ (x ′ ) ≥ τ (x) + ⟨∇ x τ (x), x ′ -x⟩ + µ∥x ′ -x∥ 2 , whereas, for a weakly-convex function κ, there exists ν > 0 such that, for any x, x ′ (x ̸ = x ′ ) κ(x ′ ) ≥ κ(x) + ⟨∇ x κ(x), x ′ -x⟩ -ν∥x ′ -x∥ 2 . ( ) Note the "-ν" for a weakly-convex κ(•) instead of the "+µ" for a convex τ (•) in the third term on the right hand side of the above two inequalities. So κ(•) is clearly not convex. Moreover, note that the ∥x ′ -x∥ 2 term on the right-hand side of the inequality for the weakly-convex function is strictly positive, implying that, for large enough ν, the inequality will be true for any function. We provide convergence results which depend on the coefficient of weak-convexity (for our UL function in question, it is denoted as µ ℓ ), with slower rates for larger coefficients. Then they make it bilevel to min x max α f (x, y ⋆ (x), α) subject to y ⋆ (x) = arg min y g(x, y, α) by splitting the x variable and then further splitting into multiblock to min x max αi,i∈[n] i f i (x, y ⋆ i (x), α i ). Here, each f i is strongly concave in α i . This is a different problem setup than ours and does not include our problem formulation. Therefore, the crucial difference is this: they study a single-objective problem min x max α f (x, α), and we consider the robust multi-objective bilevel problem min x max i f i (x, y ⋆ i (x)). Their approach seems similar at first glance because they are solving the single-objective problem in a bilevel, multi-block way, but their problem class does not encompass the multi-objective one we consider.

D.3 IMPROVING THE SAMPLE COMPLEXITY OF MORBiT

There is a potential room for improvement in the sample complexity of MORBiT. In the n = 1 case, our algorithm builds off of TTSA (Hong et al., 2020) with a O(1/ϵ 2.5 ) complexity. The only existing work in the n = 1 case with a better sample complexity in a single-loop constrained UL case is the extremely recent STABLE (Chen et al., 2022b) , achieving O(1/ϵ 2 ). STABLE, has a much more complex LL update than TTSA using variance reduction techniques. We are optimistic that more complex algorithms like STABLE can be extended to the robust multi-objective bilevel optimization setting with improved sample complexity.

D.4 WHY ROBUST min max INSTEAD OF PARETO MULTI-OBJECTIVE OPTIMIZATION?

Bilevel optimization problems are ubiquitous in machine learning applications such as representation learning and hyperparameter optimization, which is difficult to formulate as a single-level problem. We consider standard stochastic bilevel problems such as these, formulating a natural robust multiobjective version of these problems inspired by the benefits of robust multi-objective learning highlighted in Mehta et al. (2012) and Collins et al. (2020) . These papers consider the robust multiobjective view but do not study stochastic bilevel learning problems, which we do. Existing bilevel optimization problems, however, are all single-objective rather than multi-objective. The advantages of taking single objective problems and formulating them as robust multi-objective problems have been highlighted in various works -see the literature cited in section 2 (Min-max Robust Optimization in Machine Learning). To summarize, the main advantage is that we can get guarantees on the worst-case performance instead of the usual average case performance (see for example our generalization guarantees in Appendix B). If we just summed the objectives and solved a single-objective problem, we would only be able to establish guarantees for the average-case performance: maybe we would find a solution that is good for most tasks, but might do extremely poorly on some. Moreover, at the lower level (LL) problem, there are different objectives for the learners as the individual problem structures and data distributions are different, again forming a natural multi-objective optimization (MOO) problem. Much like our motivating existing literature on robust multi-objective learning, we focus on a single robust solution instead of a set of Pareto optimal solutions since, in various applications, we finally need select a single solution, and the robust (min max) solution provides stronger worst-case guarantees than any Pareto-optimal solution, which is our main motivation. Pareto frontiers can be very useful and informative, potentially allowing us to understand the tradeoff between the multiple objectives. However, we would like to note that there are various forms of solutions in multi-objective optimization. There are Pareto optimal solutions, but also "possibly optimal" solutions (Wilson et al., 2015) , convex coverage set of solutions (Yang et al., 2019) , and min max robust solution (that we consider). The appropriate form of solution(s) would depend on the application, and we are focusing on min max applications, motivated by existing work such as Furthermore, while the Pareto frontier can be more informative and the Pareto curves better demonstrate tradeoff between the objectives, it is important to note that, this curve is mostly intuitive with obvious tradeoffs for n = 2 objectives. With n > 3 objectives, the Pareto frontier cannot even be visualized, and one has to resort to pairwise comparisons, making it hard to reason about the tradeoffs between objectives even for moderately high n since we will have to consider n 2 such comparisons (for example n ∼ O(10)). Therefore, given a Pareto front of solutions, it is not clear which of the Pareto optimal solutions we should select. One advantage of the min max formulation (equation 2) is that it tries to seek a single solution instead of a set of solutions. This allows us to use the solution for a new related problem (like for a new related task in representation learning application or hyperparameter optimization application in Franceschi et al. ( 2018)), we can use the robust min max solution -we select the robust solution for the shared UL variable x (the representation network or the hyperparameter configuration). With a Pareto front, it is not clear which solution to pick for a new task since we would have a set of solutions, without the knowledge of which one would be useful for a new task/objective. Furthermore, while a solution on the Pareto frontier implies that there is no other solution that "dominates" it, to the best of our knowledge, there is no guarantee that some solution on the obtained Pareto frontier achieves the optimal value for the robust min max objective max i min x f i (x) unless the Pareto frontier is completely dense, which is never the case. Multi-objective optimizers can return a set of solutions on the Pareto frontier, but even uniformly covering the Pareto frontier requires the size of the solution set to grow exponentially in the number of objectives n. 2)) and references therein) is measured on the size of the weighted average of the gradients. The weighting vector in both of these two notations is optimized over a simplex. Therefore, the stationarity condition of our proposed min max formulation can be considered as one variant of Pareto stationarity for nonconvex problems.



We use the formulation of Raghu et al. (2020) to separate the representation and the model parameters. For a d-dimension point p, the RFF ϕx σ (p) = [sin(W (xσ ⊙ p)) ⊤ , cos(W (xσ ⊙ p)) ⊤ ] ⊤ ∈ R 2K, where W ∈ R K×d is a random normal matrix and the sin(•) and cos(•) are applied elementwise. The weighted regression penalty mitigates bias especially in the high-dimensional learning setting(Candes et al., 2008;Gasso et al., 2009;Šehić et al., 2022), which is common when using RFFs.



are standard (Hong et al., 2020; Lu et al., 2022), the assumption on h (k)

per-objective LL variables) and h (k) λ (for the simplex variable λ) both only require O(1) stochastic gradient oracle queries from each of the n objective pairs in each iteration, the condition b 2 k ≤ α on the non-increasing squared norm of the per-iterate bias of the gradient estimate h(k) x (for the UL variable) require O(log K) stochastic gradient oracle queries for each of the n objective pairs leveraging the HIA sampling techniques in Ghadimi & Wang (2018) and Hong et al. (2020) using the Neumann series (Agarwal et al., 2017). This gives us the following sample complexity bound for MORBiT (see Appendix D.3 on potential improvements): Corollary 1. Under the conditions of Theorem 1, MORBiT converges to ϵ-(near)-stationarity with O(n 5 /4 ϵ -5 /2 log( 1 /ϵ)) queries to the stochastic gradient oracle for each of the n objective pairs.

(a) Quality of min-avg vs min-max. (b) Convergence of ∥∇x∥ 2 . (c) Task-gen. of min-avg vs min-max.

Figure 1: Numerical results for representation learning application.

(a) Quality of min-avg vs min-max. (b) Effect of n on convergence.(c) Effect of batch size.

Figure 2: Numerical results for hyperparameter optimization application. solving the min-max problem in equation 2 and the ability of the single-loop MORBiT to handle a weakly convex constrained UL problem.

the LL variable y i satisfy the following for some σ g > 0 (Hong et al., 2020; Lu et al., 2022):

Connection of TTSA (Hong et al., 2020). As we generalize Hong et al. (2020), our proof follows a similar structure. In particular, Lemma 6 is a generalization of Hong et al. (2020, Equation (14)), and our Lemma 7 combines Hong et al. (2020, Lemma 3 and 4). Lemmas 8, 9, and 11 in our work parallel Hong et al. (2020, Lemmas 6, 5, and 7), respectively. Lemma 10 in our work deals with the maximization problem w.r.t. λ, so there is no analogue in Hong et al. (2020). However, it borrows techniques from Collins et al. (2020, Theorem 1).

(for example utilizing the HIA sampling scheme(Agarwal et  al., 2017; Ghadimi & Wang, 2018; Hong et al., 2020)). Therefore, as long as we run enough iterations (O(log K) for HIA) such that b 2 k ≤ α, (29) will also converge at a rate of O( √ nK -2/5 ). A.3 PROOF PLAN Overall Roadmap: In what follows, we give a proof sketch of our main theorem, with constant terms abstracted away with O notation. c 1 , c 2 , c 3 , c 4 are positive constants depending on L f (defined in Lemma 5), µ g (the LL objective convexity defined in Assumption 2) and the learning rates α, β (in algorithm 1).

First, from classical generalization results such as Shalev-Shwartz & Ben-David (2014, Theorem 26.5) or Mohri et al. (2018, Theorem 3.3), we directly conclude the following proposition, which bounds the true loss of the classifier as a function of the empirical loss. Proposition 1. Assume the regularity assumptions considered in Appendix 3, specifically that the function ℓ is B ℓ -bounded. Then, with probability at least 1 -δ, we have

) Linear FC Layer (output in R 80 ) ReLU Linear FC Layer (output in R 80 ) ReLU Linear FC Layer (output in R 10 ) Training: At each iteration, we first perform the inner loop optimization step (meta-training) by sampling 10 shots from each of the tasks in order to update each of the task-specific network weights. We use just 1 inner loop step. Pseudocode for the inner loop is shown below in PyTorch-style: f o r t a s k i d i n t a s k l i s t : xs , y s = s a m p l e b a t c h ( t a s k i d , n s h o t s ) embedding = e m b e d d i n g n e t w o r k ( x s ) f o r i n r a n g e ( n i n n e r ) : h e a d o p t i m i z e r s [ t a s k i d ] . z e r o g r a d ( ) t o t a l l o s s = g e t l o s s ( t a s k i d , xs , y s ) t o t a l l o s s . b a c k w a r d ( ) h e a d o p t i m i z e r s [ t a s k i d ] . s t e p ( ) For each outer loop optimization, we run a meta-validation batch again containing 10 shots from each of the tasks. We then take an outer-loop step, optimizing the embedding weights using the results of the meta-validation batch. The meta-validation batch is sampled in the exact same way as the meta-training batch shown above. Regularization: First, as in (Ji et al., 2020), we add weight regualarization during inner loop training of the form ϵ w w∈W ∥w∥, where W denotes the set of weight parameters, where ϵ w = 0.01. In PyTorch, this is expressed as l 2 r e g c o n s t a n t = 0 . 0 1 f o r p i n h e a d s [ t a s k ] . p a r a m e t e r s ( ) : l 2 r e g += p . norm ( 2 ) t o t a l l o s s += l 2 r e g * l 2 r e g c o n s t a n t Next, for the inner λ updates, we add a regularization term to the overall loss, -ϵ λ n i=1 (λ i -1 n ) 2 , where ϵ λ = 3, which pulls the λs closer to uniform. In PyTorch, the λ update is expressed as t a s k g r a d i e n t = t a s k l o s s e s [ i ] r e g g r a d i e n t = -mu lambda * ( l a m b d a s [ t a s k ] -1 / n ) l a m b d a s [ t a s k ] += ( t a s k u p d a t e + r e g u p d a t e ) * gamma Parameters: For the Task-Robust version of the algorithm, we use α = 0.007, β = 0.005, γ = 0.003. For the standard version of the algorithm, we use α = 0.007, β = 0.011, γ = 0.003.

Figure 3: Comparison of standard (min-avg) training and robust (min-max) training using 20 tasks

COMPARISON WITH HU ET AL. (2022) Hu et al. (2022) may appear similar to our work at a glance, but we would like to clarify that the differences are nontrivial as we are solving a different problem. We address this briefly in section 2 (Closely related and Concurrent Work), but we will elaborate further here to make the distinction clearer. At a high level, the problem in Hu et al. (2022) is not multi-objective: the authors explicitly call it multi-block. They are still solving the single-objective min-max problem min x max α f (x, α). Hence the problem setup in Hu et al. (2022) cannot solve standard bilevel learning applications such as representation learning and HPO; they choose AUC maximization as their motivating example instead. Now, we explain what may be a source of confusion: why it seems like they are solving a multiobjective problem. Hu et al. (2022) start with the min-max problem min x max α f (x, α) with strong concavity in α, such as in AUC maximization.

Mehta et al. (2012) and Collins et al. (2020), since a min max solution can be shown to have good generalization guarantees (as we have also shown in Appendix B).

objective functions, the Pareto frontier refers to the Pareto stationarity rather than Pareto optimality. Our considered first-order stationarity condition is defined on the weighted average of the objective value, while the classical Pareto stationarity (please see Fernando et al. (2023, equation (

The applications inHu et al. (2022) are restricted to problems such as multi-task AUC maximization instead of the common bilevel applications of representation learning and HPO. (ii) Also,Hu et al. (2022) do not consider a constrained UL problem. The problem in equation 2 is not a generalization of their problem -both our work and theirs are considering different setups with high-level commonalities. For more details, see Appendix D.2. Table1shows how our setup compares to existing literature. To the best of our knowledge, the precise problem in equation 2 has not been studied in ML literature.

In addition to convergence, we also show the generalization abilities of the bilevel optimizer. The theorem in this section is inspired byCollins et al. (2020, Theorem 4), but our results hold for the fully general min-max multi-objective BLO setup while Collins et al. (2020) study a min-max multi-objective single-level problem. Assume that for a learning task i, we observe m i batches of train/test data, D t i,j and D v i,j , j ∈ [m i ]. Assume that each train and test batch has K and J inputoutput pairs, respectively, so that sets D t i,j , D v i,j are drawn from a common distribution D i . =1 g i (x, y i , D t i,j ′ ) be the value of y i that minimizes the empirical inner loss on some dataset D t i . For outer and inner objectives f i , g i , we consider the following function class F i , where x ∈ X is the set of optimization parameters introduced in equation 2 and (D t i , D v i ) are any train and test datasets sampled from D i :

Agarwal et al., 2017;Ghadimi & Wang, 2018;Hong et al., 2020) is one way of computing this estimate (as we have discussed in section 3.2 preceding Corollary 1). Since we are considering a single-loop algorithm, even a straightforward iterative differentiation(Ji et al., 2021) can provide an sufficiently useful estimate of the hypergradient. We utilize this for our empirical evaluations.

ACKNOWLEDGEMENTS

A.G. is supported by the National Science Foundation (NSF) Graduate Research Fellowship under Grant No. 2141064, and T.-W. Weng is supported by NSF under Grant No. 2107189. We would like to thank the MIT-IBM Watson AI Lab (https://mitibmwatsonailab.mit.edu/) and the MIT-UROP program (https://urop.mit.edu/) for their support. We would also like to thank the organizers of the "Beyond First-order Methods in ML Systems" workshop at ICML'21 and the "Bilevel Stochastic Methods for Optimization and Learning" session at INFORMS'22 for giving us the opportunity to present various iterations of our work (Gu et al., 2021; 2022). Finally, we would like to thank Soumyadip Ghosh and Mark Squillante for some insightful discussions.

annex

Published as a conference paper at ICLR 2023 Next, recall that our step sizes werewhere ν = min(g ) , 1 µg ). Note that the choice of ν was motivated by the conditions of Lemma 5. First, observe that 1 -completing the set of conditions in Lemma 5. By direct algebraic manipulation, we haveSimilarly, we also have

