BETTY: AN AUTOMATIC DIFFERENTIATION LIBRARY FOR MULTILEVEL OPTIMIZATION

Abstract

Gradient-based multilevel optimization (MLO) has gained attention as a framework for studying numerous problems, ranging from hyperparameter optimization and meta-learning to neural architecture search and reinforcement learning. However, gradients in MLO, which are obtained by composing best-response Jacobians via the chain rule, are notoriously difficult to implement and memory/compute intensive. We take an initial step towards closing this gap by introducing BETTY, a software library for large-scale MLO. At its core, we devise a novel dataflow graph for MLO, which allows us to (1) develop efficient automatic differentiation for MLO that reduces the computational complexity from O(d 3 ) to O(d 2 ), (2) incorporate systems support such as mixed-precision and data-parallel training for scalability, and (3) facilitate implementation of MLO programs of arbitrary complexity while allowing a modular interface for diverse algorithmic and systems design choices. We empirically demonstrate that BETTY can be used to implement an array of MLO programs, while also observing up to 11% increase in test accuracy, 14% decrease in GPU memory usage, and 20% decrease in training wall time over existing implementations on multiple benchmarks. We also showcase that BETTY enables scaling MLO to models with hundreds of millions of parameters.

1. INTRODUCTION

Multilevel optimization (MLO) addresses nested optimization scenarios, where upper level optimization problems are constrained by lower level optimization problems following an underlying hierarchical dependency. MLO has gained considerable attention as a unified mathematical framework for studying diverse problems including meta-learning (Finn et al., 2017; Rajeswaran et al., 2019) , hyperparameter optimization (Franceschi et al., 2017) , neural architecture search (Liu et al., 2019) , and reinforcement learning (Konda & Tsitsiklis, 1999; Rajeswaran et al., 2020) . While a majority of existing work is built upon bilevel optimization, the simplest case of MLO, there have been recent efforts that go beyond this two-level hierarchy. For example, (Raghu et al., 2021) proposed trilevel optimization that combines hyperparameter optimization with two-level pretraining and finetuning. More generally, conducting joint optimization over machine learning pipelines consisting of multiple models and hyperparameter sets can be approached as deeper instances of MLO (Garg et al., 2022; Raghu et al., 2021; Somayajula et al., 2022; Such et al., 2020) . Following its increasing popularity, a multitude of optimization algorithms have been proposed to solve MLO. Among them, gradient-based (or first-order) approaches (Pearlmutter & Siskind, 2008; Lorraine et al., 2020; Raghu et al., 2021; Sato et al., 2021) have recently received the limelight from the machine learning community, due to their ability to carry out efficient high-dimensional optimization, under which all of the above listed applications fall. Nevertheless, research in gradientbased MLO has been largely impeded by two major bottlenecks. First, implementing gradients in multilevel optimization, which is achieved by composing best-response Jacobians via the chain rule, requires both programming and mathematical proficiency. Second, algorithms for best-response Jacobian calculation, such as iterative differentiation (ITD) or approximate implicit differentiation (AID) (Grazzi et al., 2020) , are memory and compute intensive, as they require multiple forward/backward computations and oftentimes second-order gradient (i.e. Hessian) information. In recent years, there has been some work originating in the meta-learning community on developing software libraries that target some aspects of gradient-based MLO (Blondel et al., 2021; Deleu et al., 2019; Grefenstette et al., 2019) . For example, JAXopt (Blondel et al., 2021) provides efficient and modular implementations of AID algorithms by letting the user define a function capturing the optimality conditions of the problem to be differentiated. However, JAXopt fails to combine the chain rule with AID to support general MLO programs beyond a two-level hierarchy. Similarly, higher (Grefenstette et al., 2019) provides several basic primitives (e.g. making PyTorch's (Paszke et al., 2019) native optimizers differentiable) for implementing ITD/AID algorithms, but users still need to manually implement complicated internal mechanisms of these algorithms as well as the chain rule to implement a given instance of MLO. Furthermore, most existing libraries do not have systems support, such as mixed-precision and data-parallel training, that could mitigate memory and computation bottlenecks. As a result, gradient-based MLO research built upon these libraries has been largely limited to simple bilevel optimization and small-scale setups. In this paper, we attempt to bridge this gap between research and software systems by introducing BETTY, an easy-to-use and modular automatic differentiation library with various systems support for large-scale MLO. The main contributions of this paper are as follows: 1. We develop an efficient automatic differentiation technique for MLO based on a novel interpretation of MLO as a special type of dataflow graph (Section 3). In detail, gradient calculation for each optimization problem is automatically carried out by iteratively multiplying best-response Jacobians (defined in Section 2) through the chain rule while reverse-traversing specific paths of this dataflow graph. This reverse-traversing procedure is crucial for efficiency, as it reduces the computational complexity of our automatic differentiation technique from O(d 3 ) to O(d 2 ), where d is the dimension of the largest optimization problem in the MLO program. 2. We introduce a software library for MLO, BETTY, built upon the above automatic differentiation technique. Our software design (Section 4), motivated by the dataflow graph interpretation, provides two major benefits: (1) it allows for incorporating various systems support, such as mixedprecision and data-parallel training, for large-scale MLO, and (2) it facilitates implementation of MLO programs of arbitrary complexity while allowing a modular interface for diverse algorithmic and systems design choices. The overall software architecture of BETTY is presented in Figure 1 . 3. We empirically demonstrate that BETTY can be used to implement an array of MLO applications with varying scales and complexities (Section 5). Interestingly, we observe that trying out different best-response Jacobian algorithms with our modular interface (which only requires changing one line of code) can lead to up to 11% increase in test accuracy, 14% decrease in GPU memory usage, and 20% decrease in training wall time on various benchmarks, compared with the original papers' implementations. Finally, we showcase the scalability of BETTY to models with hundreds of millions of parameters by performing MLO on the BERT-base model with the help of BETTY's systems support, which was otherwise infeasible.

2. BACKGROUND: GRADIENT-BASED MULTILEVEL OPTIMIZATION

To introduce MLO, we first define an important concept known as a "constrained problem" (Vicente & Calamai, 1994) . Definition 1. An optimization problem P is said to be constrained by λ when its cost function C has λ as an argument in addition to the optimization parameter θ (i.e. P : arg min θ C(θ, λ, • • • )). Multilevel optimization (Migdalas et al., 1998) refers to a field of study that aims to solve a nested set of optimization problems defined on a sequence of so-called levels, which satisfy two main criteria: A1) upper-level problems are constrained by the optimal parameters of lower-level problems while A2) lower-level problems are constrained by the nonoptimal parameters of upper-level problems. Formally, an n-level MLO program can be written as: P n : θ * n = argmin θn C n (θ n , U n , L n ; D n ) ▷ Level n problem . . . P k : s.t. θ * k = argmin θ k C k (θ k , U k , L k ; D k ) ▷ Level k ∈ {2, . . . , n -1} problem . . . P 1 : s.t. θ * 1 = argmin θ1 C 1 (θ 1 , U 1 , L 1 ; D 1 ) ▷ Level 1 problem where, P k stands for the level k problem, θ k / θ * k for corresponding nonoptimal / optimal parameters, and U k / L k for the sets of constraining parameters from upper / lower level problems. Here, D k is the training dataset, and C k indicates the cost function. Due to criteria A1 & A2, constraining parameters from upper-level problems should be nonoptimal (i.e. U k ⊆ {θ k+1 , • • • , θ n }) while constraining parameters from lower-level problems should be optimal (i.e. L k ⊆ {θ * 1 , • • • , θ * k-1 }). Although we denote only one optimization problem per level in the above formulation, each level could in fact have multiple problems. Therefore, we henceforth discard the concept of level, and rather assume that problems {P 1 , P 2 , • • • , P n } of a general MLO program are topologically sorted in a "reverse" order (i.e. P n / P 1 denote uppermost / lowermost problems). For example, in hyperparameter optimization formulated as bilevel optimization, hyperparameters and network parameters (weights) correspond to upper and lower level parameters (θ 2 and θ 1 ). Train / validation losses correspond to C 1 / C 2 , and validation loss is dependent on optimal network parameters θ * 1 obtained given θ 2 . Thus, constraining sets for each level are U 1 = {θ 2 } and L 2 = {θ * 1 }. In this paper, we focus in particular on gradient-based MLO, rather than zeroth-order methods like Bayesian optimization (Cui & Bai, 2019) , in order to efficiently scale to high-dimensional problems. Essentially, gradient-based MLO calculates gradients of the cost function C k (θ k , U k , L k ) with respect to the corresponding parameter θ k , with which gradient descent is performed to solve for optimal parameters θ * k for every problem P k . Since optimal parameters from lower level problems (i.e. θ * l ∈ L k ) can be functions of θ k (criterion A2), dC k dθ k can be expanded using the chain rule as follows: dC k dθ k = ∂C k ∂θ k direct gradient + θ * l ∈L k dθ * l dθ k best-response Jacobian × ∂C k ∂θ * l direct gradient (1) While calculating direct gradients (purple) is straightforward with existing automatic differentiation engines like PyTorch (Paszke et al., 2019) , a major difficulty in gradient-based MLO lies in bestresponse Jacobianfoot_0 (orange) calculation, which will be discussed in depth in Section 3. Once gradient calculation for each level k is enabled via Equation (1), gradient-based optimization is executed from lower to upper level problems in a topologically reverse order, reflecting underlying hierarchies.

3. AUTOMATIC DIFFERENTIATION FOR MULTILEVEL OPTIMIZATION

While Equation (1) serves as a mathematical basis for gradient-based multilevel optimization, how to automatically and efficiently carry out such gradient calculation has not been extensively studied and incorporated into a software system that can support MLO programs involving many problems with complex dependencies. In this section, we discuss the challenges in building an automatic differentiation library for MLO, and provide solutions to address these challenges.

3.1. DATAFLOW GRAPH FOR MULTILEVEL OPTIMIZATION

One may observe that the best-response Jacobian term in Equation ( 1) is expressed with a total derivative instead of a partial derivative. This is because θ k can affect θ * l not only through a direct interaction, but also through multiple indirect interactions via other lower-level optimal parameters. For example, consider the four-problem MLO program illustrated in Figure 2 . Here, the parameter of Problem 4 (θ p4 ) affects the optimal parameter of Problem 3 (θ * p3 ) in two different ways: 1) θ p4 → θ * p3 and 2) θ p4 → θ * p1 → θ * p3 . In general, we can expand the best-response Jacobian 1) by applying the chain rule for all paths from θ k to θ * l as dθ * l dθ k in Equation ( dC k dθ k = ∂C k ∂θ k + θ * l ∈L k q∈Q k,l ∂θ * q(1) ∂θ k upper-to-lower × len(q)-1 i=1 ∂θ * q(i+1) ∂θ * q(i) lower-to-upper × ∂C k ∂θ * l (2) where Q k,l is a set of paths from θ k to θ * l , and q(i) refers to the index of the i-th problem in the path q with the last point being θ * l . Replacing a total derivative term in Equation ( 1) with a product of partial derivative terms using the chain rule allows us to ignore indirect interactions between problems, and only deal with direct interactions. To formalize the path finding problem, we develop a novel dataflow graph for MLO. Unlike traditional dataflow graphs with no predefined hierarchy among nodes, a dataflow graph for multilevel optimization has two different types of directed edges stemming from criteria A1 & A2: lower-to-upper and upper-to-lower. Each of these directed edges is respectively depicted with green and red arrows in Figure 2 . Essentially, a lower-to-upper edge represents the directed dependency between two optimal parameters (i.e. θ * Pi → θ * Pj with P i < P j ), while an upper-tolower edge represents the directed dependency between nonoptimal and optimal parameters (i.e. θ Pi → θ * Pj with P i > P j ). Since we need to find paths from the nonoptimal parameter θ k to the optimal parameter θ * l , the first directed edge must be an upper-to-lower edge (red), which connects θ k to some lower-level optimal parameter. Once it reaches the optimal parameter, it can only move through optimal parameters via lower-to-upper edges (green) in the dataflow graph. Therefore, every valid path from θ k to θ * l will start with an upper-to-lower edge, and then reach the destination only via lower-to-upper edges. The best-response Jacobian term for each edge in the dataflow graph is also marked with the corresponding color in Equation (2). We implement the above path finding mechanism with a modified depth-first search algorithm in BETTY.

3.2. GRADIENT CALCULATION WITH BEST-RESPONSE JACOBIANS

Automatic differentiation for MLO can be realized by calculating Equation (2) for each problem P k (k = 1, • • • , n). However, a naive calculation of Equation ( 2) could be computationally onerous as it involves multiple matrix multiplications with best-response Jacobians, of which computational complexity is O(d 3 ), where d is the dimension of the largest optimization problem in the MLO program. To alleviate this issue, we observe that the rightmost term in Equation (2) is a vector, which allows us to reduce the computational complexity of Equation (2) to O(d 2 ) by iteratively performing matrix-vector multiplication from right to left (or, equivalently, reverse-traversing a path q in the dataflow graph). As such, matrix-vector multiplication between the best-response Jacobian and a vector serves as a base operation of efficient automatic differentiation for MLO. Mathematically, this problem can be simply written as follows: Calculate ∂w * (λ) ∂λ × v (3) Given w * (λ) = argmin w C(w, λ). Two major challenges in the above problems are: 1) approximating the solution of the optimization problem (i.e. w * (λ)), and 2) differentiating through the (approximated) solution. In practice, an approximation of w * (λ) is typically achieved by unrolling a small number of gradient steps, which can significantly reduce the computational cost (Franceschi et al., 2017) . While we could potentially obtain a better approximation of w * (λ) by running gradient steps until convergence, this procedure alone can take a few days (or even weeks) when the underlying optimization problem is large-scale (Deng et al., 2009; Devlin et al., 2018) . Once w * (λ) is approximated, matrix-vector multiplication between the best-response Jacobian dw * (λ) dλ and a vector v is popularly obtained by either iterative differentiation (ITD) or approximate implicit differentiation (AID) (Grazzi et al., 2020) . This problem has been extensively studied in bilevel optimization literature (Finn et al., 2017; Franceschi et al., 2017; Lorraine et al., 2020) , and we direct interested readers to the original papers, as studying these algorithms is not the focus of this paper. In BETTY, we provide implementations of several popular ITD/AID algorithms which users can easily plug-and-play for their MLO applications. Currently available algorithms within BETTY include ITD with reverse-mode automatic differentiation (ITD-RMAD) (Finn et al., 2017) , AID with Neumann series (AID-NMN) (Lorraine et al., 2020) , AID with conjugate gradient (AID-CG) (Rajeswaran et al., 2019) , and AID with finite difference (AID-FD) (Liu et al., 2019) .

3.3. EXECUTION OF MULTILEVEL OPTIMIZATION

In MLO, optimization of each problem should be performed in a topologically reverse order, as the upper-level optimization is constrained by the result of lower-level optimization. To ease an MLO implementation, we also automate such an execution order with the dataflow graph developed in Section 3.1. Specifically, let's assume that there is a lower-to-upper edge between problems P i and P j (i.e. θ * i → θ * j ). When the optimization process (i.e. a small number of gradient steps) of the problem P i is complete, it can call the problem P j to start its one-step gradient descent update through the lower-to-upper edge. The problem P j waits until all lower level problems in L j send their calls, and then performs the one-step gradient descent update when all the calls from lower levels are received. Hence, to achieve the full execution of gradient-based MLO, we only need to call the one-step gradient descent processes of the lowermost problems, as the optimization processes of upper problems will be automatically called from lower problems via lower-to-upper edges. To summarize, automatic differentiation for MLO is accomplished by performing gradient updates of multiple optimization problems in a topologically reverse order based on the lower-to-upper edges (Sec. 3.3), where gradients for each problem are calculated by iteratively multiplying best-response Jacobians obtained with ITD/AID (Sec. 3.2) while reverse-traversing the dataflow graph (Sec. 3.1).

4. SOFTWARE DESIGN

On top of the automatic differentiation technique developed in Section 3, we build an easy-to-use and modular software library, BETTY, with various systems support for large-scale gradient-based MLO. In detail, we break down MLO into two high-level concepts, namely 1) optimization problems and 2) hierarchical dependencies among problems, and design abstract Python classes for both of them. Such abstraction is also motivated by our dataflow graph interpretation, as each of these concepts respectively corresponds to nodes and edges. The architecture of BETTY is shown in Figure 1 Problem Each optimization problem P k in MLO is defined by the parameter (or module) θ k , the sets of the upper and lower constraining problems U k & L k , the dataset D k , the cost function C k , the optimizer, and other optimization configurations (e.g best-response Jacobian calculation algorithm, number of unrolling steps). The Problem class is an interface where users can provide each of the aforementioned components to define the optimization problem. In detail, each one except for the cost function C k and the constraining problems U k & L k can be provided through the class constructor, while the cost function can be defined through a "training step" method and the constraining problems are automatically provided by Engine. Abstracting an optimization problem by encapsulating module, optimizer, and data loader together additionally allows us to implement various systems support, including mixed-precision, data-parallel training, and gradient accumulation, within the abstract Problem class. A similar strategy has also been adopted in popular frameworks for large-scale deep learning such as DeepSpeed (Rajbhandari et al., 2020) . Since implementations of such systems support as well as best-response Jacobian are abstracted away, users can easily plug-and-play different algorithmic and systems design choices, such as unrolling steps or mixed-precision training, via Config in a modular fashion. An example usage of Problem is shown in Listing 1, and a full list of supported features in Config is provided in Appendix F. # Users define the cost function here return cost_fn(batch, self.module, self.other_probs, ...) config = Config(type="darts", unroll_steps=10, fp16=True, gradient_accumulation=4) prob = MyProblem("myproblem", config, module, optimizer, data_loader) Listing 1: Problem class example. Engine While Problem manages each optimization problem, Engine handles hierarchical dependencies among problems in the dataflow graph. As discussed in Section 3.1, a dataflow graph for MLO has upper-to-lower and lower-to-upper directed edges. We allow users to define two separate graphs, one for each type of edge, using a Python dictionary, in which keys/values respectively represent start/end nodes of the edge. When user-defined dependency graphs are provided, Engine compiles them and finds all paths required for automatic differentiation with a modified depth-first search algorithm. Moreover, Engine sets constraining problem sets for each problem based on the dependency graphs, as mentioned above. Once all initialization processes are done, users can run a full MLO program by calling Engine's run method, which repeatedly calls the one-step gradient descent procedure of lowermost problems. The example usage of Engine is provided in Listing 2. 

5. EXPERIMENTS

To showcase the general applicability of BETTY, we implement three MLO benchmarks with varying complexities and scales: data reweighting for class imbalance (Sec. 5.1), correcting and reweighting corrupted labels (Sec. 5.2), and domain adaptation for a pretraining/finetuning framework (Sec. 5.3). Furthermore, we analyze the effect of different best-response Jacobian algorithms and system features by reporting GPU memory usage and training wall time. Last but not least, in the Appendix, we include an additional MLO benchmark experiment on differentiable neural architecture search (Appendix A), code examples (Appendix B), training details such as hyperparameters (Appendix C), analyses on various algorithmic and systems design choices (Appendix D and E).

5.1. DATA REWEIGHTING FOR CLASS IMBALANCE

Many real-world datasets suffer from class imbalance due to underlying long-tailed data distributions. Meta-Weight-Net (MWN) (Shu et al., 2019) proposes to alleviate the class imbalance issue with a data reweighting scheme where they learn to assign higher/lower weights to data from more rare/common classes. In detail, MWN formulates data reweighting with bilevel optimization as follows: θ * = argmin θ L val (w * (θ)) ▷ Reweighting s.t. w * (θ) = argmin w 1 N n i=1 R(L i train ; θ) • L i train (f (x i ; w), y i ) ▷ Classification where w is the network parameters, L i train is the training loss for the i-th training sample, and θ is the MWN R's parameters, which reweights each training sample given its training loss L i train . Following the original paper, we artificially inject class imbalance into the CIFAR-10 dataset by geometrically decreasing the number of data sample for each class, as per an imbalance factor. While the official implementation, which is built upon Torchmeta (Deleu et al., 2019) , only adopts ITD-RMAD for best-response Jacobian calculation, we re-implement MWN with multiple best-response Jacobian algorithms, which only require one-liner changes using BETTY, to study their effect on test accuracy, memory efficiency, and training wall time. The experiment results are given in As shown above, default full-precision training fails due to the CUDA out-of-memory error, while mixed-precision training, which only requires a one-line change in Config, avoids this issue while also providing consistent improvements in test accuracy compared to the BERT baseline. This demonstrates that our system features are indeed effective in scaling MLO to large models. We include more analyses on our systems support in Appendix E.

5.2. CORRECTING & REWEIGHTING CORRUPTED LABELS

Another common pathology in real-world data science is the issue of label corruption, stemming from noisy data preparation processes (e.g. Amazon MTurk). One prominent example of this is in weak supervision (Ratner et al., 2016) , where users create labels for large training sets by leveraging multiple weak/noisy labeling sources such as heuristics and knowledge bases. Due to the nature of weak supervision, generated labels are generally noisy, and consequently lead to a significant performance degradation. In this example, we aim to mitigate this issue by 1) correcting and 2) reweighting potentially corrupted labels. More concretely, this problem can be formulated as an extended bilevel optimization problem, as, unlike the MWN example, we have two optimization problems-correcting and reweighting-in the upper level, as opposed to one. The mathematical formulation of this MLO program is as follows: θ * = argmin θ L val (w * (θ, α)), α * = argmin α L ′ val (w * (θ, α)) ▷ RWT & CRT s.t. w * (θ, α) = argmin w 1 N n i=1 R(L i train ; θ) • L i train (f (x i ; w), g(x i , y i ; α)) ▷ Classification where, α is the parameter for the label correction network g, and L ′ val is augmented with the classification loss of the correction network in addition to that of the main classification network f on the clean validation set. We test our framework on the WRENCH benchmark (Zhang et al., 2021a) , which contains multiple weak supervision datasets. In detail, we use a 2-layer MLP as our classifier, AID-FD as our bestresponse Jacobian algorithm, and Snorkel Data Programming (Ratner et al., 2016) We observe that simultaneously applying label correction and reweighting significantly improves the test accuracy over the baseline and the reweighting-only scheme in almost all tasks. Thanks to BETTY, adding label correction in the upper-level on top of the existing reweighting scheme only requires defining one more Problem class, and accordingly updating the problem dependency in Engine (code examples can be found in Appendix B).

5.3. DOMAIN ADAPTATION FOR PRETRAINING & FINETUNING

Pretraining/finetuning paradigms are increasingly adopted with recent advances in self-supervised learning (Devlin et al., 2018; He et al., 2020) . However, the data for pretraining are oftentimes from a different distribution than the data for finetuning, which could potentially cause negative transfer. Thus, domain adaptation emerges as a natural solution to mitigate this issue. As a domain adaptation strategy, (Raghu et al., 2021) proposes to combine data reweighting with a pretraining/finetuning framework to automatically decrease/increase the weight of pretraining samples that cause negative/positive transfer. In contrast with the above two benchmarks, this problem can be formulated as trilevel optimization as follows: θ * = argmin θ L F T (v * (w * (θ))) ▷ Reweighting s.t. v * (w * (θ)) = argmin v L F T (v) + λ∥v -w * (θ)∥ 2 2 ▷ Finetuning w * (θ) = argmin w 1 N n i=1 R(x i ; θ) • L i P T (w) ▷ Pretraining where x i / L i P T stands for the i-th pretraining sample/loss, R for networks that reweight importance for each pretraining sample x i , and λ for the proximal regularization parameter. Additionally, w, v, and θ are respectively parameters for pretraining, finetuning, and reweighting networks. We conduct an experiment on the OfficeHome dataset (Venkateswara et al., 2017 ) that consists of 15,500 images from 65 classes and 4 domains: Art (Ar), Clipart (Cl), Product (Pr), and Real World (RW). Specifically, we randomly choose 2 domains and use one of them as a pretraining task and the other as a finetuning task. ResNet-18 (He et al., 2016) is used for all pretraining/finetuning/reweighting networks, and AID-FT with an unrolling step of 1 is used as our best-response Jacobian algorithm. Following (Bai et al., 2021) , the finetuning and the reweighting stages share the same training dataset. We adopted a normal pretraining/finetuning framework without the reweighting stage as our baseline, and the result is presented in Table 4 . Our trilevel optimization framework achieves consistent improvements over the baseline for every task combination at the cost of additional memory usage and wall time, which demonstrates the empirical usefulness of multilevel optimization beyond a two-level hierarchy. Finally, we provide an example of (a simplified version of) the code for this experiment in Appendix B to showcase the usability of our library for a general MLO program. We note that Baseline is a two-layer, and Baseline + Reweight a three-layer, MLO program.

6. RELATED WORK

Bilevel & Multilevel Optimization There are a myriad of machine learning applications that are built upon bilevel optimization (BLO), the simplest case of multilevel optimization with a twolevel hierarchy. For example, neural architecture search (Liu et al., 2019; Zhang et al., 2021b) , hyperparameter optimization (Franceschi et al., 2017; Lorraine et al., 2020; Maclaurin et al., 2015) , reinforcement learning (Hong et al., 2020; Konda & Tsitsiklis, 1999) , data valuation (Ren et al., 2020; Wang et al., 2020) , meta learning (Finn et al., 2017; Rajeswaran et al., 2019) , and label correction (Zheng et al., 2019) are formulated as BLO. In addition to applying BLO to machine learning tasks, a variety of optimization techniques (Couellan & Wang, 2016; Grazzi et al., 2020; Ji et al., 2021; Liu et al., 2021) have been developed for solving BLO. Following the popularity of BLO, MLO with more than a two-level hierarchy has also attracted increasing attention recently (Raghu et al., 2021; Somayajula et al., 2022; Such et al., 2020; Xie & Du, 2022) . In general, these works construct complex multi-stage ML pipelines, and optimize the pipelines in an end-to-end fashion with MLO. For instance, (Garg et al., 2022) constructs the pipeline of (data generation)-(architecture search)-(classification) and (He et al., 2021) of (data reweighting)-(finetuning)-(pretraining), all of which are solved with MLO. Furthermore, (Sato et al., 2021) study gradient-based methods for solving MLO with theoretical guarantees. Multilevel Optimization Software There are several software libraries that are frequently used for implementing MLO programs. Most notably, JAXopt (Blondel et al., 2021) proposes an efficient and modular approach for AID by leveraging JAX's native autodiff of the optimality conditions. Despite its easy-to-use programming interface for AID, it fails to support combining the chain rule with AID as in Equation ( 2), because it overrides the default behavior of JAX's automatic differentiation, which takes care of the chain rule. Therefore, it cannot be used for implementing MLO beyond a two-level hierarchy without major changes in the source code and the software design. Alternatively, higher (Grefenstette et al., 2019) provides two major primitives of making 1) stateful PyTorch modules stateless and 2) PyTorch optimizers differentiable to ease the implementation of AID/ITD. However, users still need to manually implement complicated internal mechanisms of these algorithms as well as the chain rule with the provided primitives. Torchmeta (Deleu et al., 2019 ) also provides similar functionalities as higher, but it requires users to use its own stateless modules implemented in the library rather than patching general modules as in higher. Thus, it lacks the support for user's custom modules, limiting its applicability. learn2learn (Arnold et al., 2020) focuses on supporting meta learning. However, since meta-learning is strictly a bilevel problem, extending it beyond a two-level hierarchy is not straightforward. Finally, most existing libraries do not have systems support, such as data-parallel training, that could mitigate memory/compute bottlenecks.

7. CONCLUSION

In this paper, we aimed to help establish both mathematical and systems foundations for automatic differentiation in MLO. To this end, we devised a novel dataflow graph for MLO, upon which an automatic differentiation procedure is built, and additionally introduced BETTY, a software library with various systems support, that allows for easy programming of a wide range of MLO applications in a modular fashion. We showed that BETTY allows for scaling up to both larger models with many parameters, as well as to MLO programs with multiple dependent problems. As future work, we plan to extend BETTY to support additional algorithmic and systems features, such as best-response Jacobian algorithms for non-differentiable processes, and advanced memory optimization techniques like model-parallel training and CPU-offloading.

ETHICS STATEMENT

Multilevel optimization has the power to be a double-edged sword that can have both positive and negative societal impacts. For example, both 1) defense or attack in an adversarial game, and 2) decreasing or increasing bias in machine learning models, can all be formulated as MLO programs, depending on the goal of the uppermost optimization problem, which is defined by users. Thus, research in preventing malicious use cases of MLO is of high importance.

REPRODUCIBILITY STATEMENT

As one of main contributions of this work is a new software library for scalable multilevel optimization, all of the source code for the library and examples will be released open source with an Apache-2.0 License, including a full implementation of all MLO programs and experiments described in this paper. In addition, for reviewing purposes, we include our source code and easily runnable scripts for all experiments in the supplemental material of this submission.

A ADDITIONAL MULTILEVEL OPTIMIZATION BENCHMARKS A.1 DIFFERENTIABLE NEURAL ARCHITECTURE SEARCH

A neural network architecture plays a significant role in deep learning research. However, the search space of neural architectures is so large that manual search is almost impossible. To overcome this issue, DARTS (Liu et al., 2019) proposes an efficient gradient-based neural architecture search method based on the bilevel optimization formulation: α * = argmin α L val (w * (α), α) ▷ Architecture Search s.t. w * (α) = argmin w L train (w; α) ▷ Classification where α is the architecture weight and w is the network weight. The original paper uses implicit differentiation with finite difference as its best-response Jacobian algorithm to solve the above MLO program. We follow the training configurations from the original paper's CIFAR-10 experiment, with a few minor changes. While the original paper performs a finite difference method on the initial network weights, we perform it on the unrolled network weights. This is because we view their best-response Jacobian calculation from the implicit differentiation perspective, where the second-order derivative is calculated based on the unrolled weight. This allows us to unroll the lower-level optimization for more than one step as opposed to strict one-step unrolled gradient descent of the original paper. A similar idea was also proposed in iDARTS (Zhang et al., 2021b) . Specifically, we re-implement DARTS with implicit differentiation and finite difference using 1 and 3 unrolling steps. The results are provided in Table 5 . Table 5 : DARTS re-implementation results. AID-FD refers to implicit differentiation with a finite difference method, and * indicates the difference in the implementation of AID-FD explained above. Our re-implementation with different unrolling steps achieves a similar performance as the original paper. We also notice that our re-implementation achieves slightly less GPU memory usage and wall time. This is because the original implementation calculates gradients for the architecture weights (upper-level parameters) while running lower-level optimization, while ours only calculates gradients of the parameters for the corresponding optimization stage.

A.2 CORRECTING & REWEIGHTING CORRUPTED LABELS (EXTENDED)

To further demonstrate the general applicability of BETTY to different datasets and scales, we performed experiments from Section 5.2 in two additional settings. Clothing-1M + ResNet-50 Clothing-1M (Xiao et al., 2015) is a real-world noisy dataset that consists of 1 million fashion images collected from various online shopping websites and has the approximate noise ratio of 38.5%. Following the standard, we use ResNet-50 as our backbone model and attempt to correct and reweight noisy labels with extended bilevel optimization. The experiment result is presented in In this experiment, we are able to empirically show that the MLO application implemented with BETTY works well with a large-scale dataset. Wrench + BERT-base In recent years, finetuning the pretrained large language model has become the standard for text classification. As the Wrench benchmark mostly consists of text classification datasets, we further applied our "correcting and reweighting corrupted labels" framework to the BERT-base model. In this experiment, we are able to empirically show that the MLO application implemented with BETTY works well with a large model.

B CODE EXAMPLE

Here, we provide simplified code for our experiments from Section 5. Note that every experiment shares a similar code structure when implemented with BETTY. Pretraining Network We use ResNet18 (He et al., 2016) pretrained on the ImageNet dataset (Deng et al., 2009) for our pretraining network. Following the popular transfer learning strategy, we split the network into two parts, namely the feature (or convolutional layer) part and the classifier (or fully-connected layer) part, and each part is trained with different learning rates. Specifically, learning rates for the feature and the classifier parts are respectively set to 0.001 and 0.0001 with the Adam optimizer. They share the same weight decay value of 0.0005 and momentum values of (0.9, 0.999). Furthermore, we encourage the network weight to stay close to the pretrained weight by introducing the additional proximal regularization with the regularization value of 0.001. Training is performed for 1,000 iterations, and the learning rate is decayed by a factor of 10 on the iterations of 400 and 800. Finetuning Network The same architecture and optimization configurations as the pretraining network are used for the finetuning network. The proximal regularization parameter, which encourages the finetuning network parameter to stay close to the pretraining network parameter, is set to 0.007.

Reweighting Network

The same architecture and optimization configurations as the pretraining network are used for the reweighting network, except that no proximal regularization is applied to the reweighting network.

C.4 DIFFERENTIABLE NEURAL ARCHITECTURE SEARCH

Dataset Follwing the original paper (Liu et al., 2019) , we use the first half of the CIFAR-10 training dataset as our inner-level training dataset (i.e. classification network) and the other half as the outer-level training dataset (i.e. architecture network). Training accuracy reported in the main text is measured on the CIFAR-10 validation dataset. Architecture Network We adopt the same architecture search space as in the original paper (Liu et al., 2019) with 8 operations, and 7 nodes per convolutional cell. The architecture parameters are initialized to zero to ensure equal softmax values, and trained with the Adam optimizer (Kingma & Ba, 2014) whose learning rate is fixed to 0.0003, momentum values to (0.5, 0.999), and weight decay value to 0.001 throughout training. Training is performed for 50 epochs. Classification Network Given the above architecture parameters, we set our classification network to have 8 cells and the initial number of channels to be 16. The network is trained with the SGD optimizer whose initial learning rate is set to 0.025, momentum to 0.9, and weight decay value to 0.0003. Training is performed for 50 epochs, and the learning rate is decayed following the cosine annealing schedule without restart to the minimum learning rate of 0.001 by the end of training. comparison, we used the same unrolling step of 1 for all algorithms. The experiment result is provided in Figure 4 . Above two experiments follow the no free lunch theorem: the optimal design choice can vary for different tasks without golden rules. However, thanks to the modular interface for switching between different design choices (in Config), only minimal programming efforts would be needed with BETTY, expediting the research cycle. 



We abuse the term Jacobian for a total derivative here while it is originally a matrix of partial derivatives



Figure 1: In Engine (left), users define their MLO program as a hierarchy/graph of optimization problems. In Problem (middle), users define an optimization problem with a data loader, cost function, module, and optimizer, while upper/lower level constraint problems (i.e. U k , L k ) are injected by Engine. The "step" function in Problem serves as the base of gradient-based optimization, abstracting the one-step gradient descent update process. Finally, users can easily try out different best-response Jacobian algorithms & system features (right) via Config in a modular manner.

Figure 2: An example dataflow graph for MLO.

class MyProblem(Problem): def training_step(self, batch):

prob1 = MyProblem1(...) prob2 = MyProblem2(...) dependency = {"u2l": {prob1: [prob2]}, "l2u": {prob1: [prob2]}} engine = Engine(problems=[prob1, prob2], dependencies=dependency) engine.run() Listing 2: Engine class example.

self, batch): inputs, labels = batch outputs = self.module(inputs) loss = F.cross_entropy(outputs, labels, reduction="none") loss_reshape = torch.reshape(loss, (-1, 1)) # Reweighting weight = self.reweight(loss_reshape.detach()) return torch.mean(weight * loss_reshape) upper_config = Config(type="darts", retain_graph=True) lower_config = Config(type="default", unroll_steps=5) reweight = Reweight(name="reweight"= self.classifier(inputs, return_embeds=True) correct_outputs = self.module(embeds, test=True) ce_loss = F.cross_entropy(outputs, labels) aux_loss = F.cross_entropy(correct_outputs, labels) return ce_loss + aux_loss # Level 2 class Reweight(ImplicitProblem): def training_step(self, batch): inputs, labels = batch outputs = self.classifier(inputs) return F.cross_entropy(outputs, labels) # Level 1 class Classifier(ImplicitProblem): def training_step(self, batch): inputs, labels = batch outputs, embeds = self.module(inputs, return_embeds=True) # Correcting new_labels = self.correct(embeds, labels) log_softmax = F.log_softmax(outputs, dim=-1) loss = torch.sum(-log_softmax * new_labels, dim=-1) loss_reshape = torch.reshape(loss, (-1, 1)) # Reweighting weight = self.reweight(loss_reshape.detach()) return torch.mean(weight * loss_reshape) upper_config = Config(type="darts", retain_graph=True) lower_config = Config(type="default", unroll_steps=5) correct = Correct(name="correct"correct, reweight, classifier] u2l = {correct: [classifier], reweight: [classifier]} l2u = {classifier: [correct, reweight]} depends = {"l2u": l2u, "u2l": u2l} engine = Engine(problems=probs, dependencies=depends) engine.run() Listing 4: Simplified code of "Correcting & Reweighting Corrupted Labels" Published as a conference paper at ICLR 2023 B.3 DOMAIN ADAPTATION FOR PRETRAINING & FINETUNING # Get module, optimizer, lr_scheduler, data loader for each problem pt_module, pt_optimizer, pt_scheduler, pt_loader = setup_pretrain() ft_module, ft_optimizer, ft_scheduler, ft_loader = setup_finetune() rw_module, rw_optimizer, rw_scheduler, rw_loader = setup_reweight() # Level 1 class Pretrain(ImplicitProblem): def training_step(self, batch): inputs, targets = batch outs = self.module(inputs) loss_raw = F.cross_entropy(outs, targets, reduction="none") logit = self.reweight(inputs) weight = torch.sigmoid(logit) return torch.mean(loss_raw * weight) # Level 2 class Finetune(ImplicitProblem): def training_step(self, batch): inputs, targets = batch outs = self.module(inputs) loss = F.cross_entropy(outs, targets, reduction="none") loss = torch.mean(ce_loss) # Proximal regularization for (n1, p1), p2 in zip(self.module.named_parameters(), self. pretrain.module.parameters()): lam = 0 if "fc" in n1 else args.lam loss += lam * (p1 -p2).pow(2).sum(.finetune(inputs) return F.cross_entropy(outs, targets) # Define optimization configurations reweight_config = Config(type="darts", step=1, retain_graph=True) finetune_config = Config(type="default", step=1) pretrain_config = Config(type="default", step=1) pretrain = Pretrain("pretrain", pt_config, pt_module, pt_optimizer pt_scheduler, pt_loader) finetune = Finetune("finetune", ft_config, ft_module, ft_optimizer ft_scheduler, ft_loader) reweight = Reweight("reweight", rw_config, rw_module, rw_optimizer rw_scheduler, rw_loader) probs = [reweight, finetune, pretrain] u2l = {reweight: [pretrain]} l2u = {pretrain: [finetune], finetune: [reweight]} depends = {"u2l": u2l, "l2u": l2u} engine = Engine(problems=probs, dependencies=depends) engine.run() Listing 5: Simplified code of "Domain Adaptation for Pretraining & Finetuning" B.4 DIFFERENTIABLE NEURAL ARCHITECTURE SEARCH train_loader, valid_loader = setup_dataloader() arch_module, arch_optimizer = setup_architecture() cls_module, cls_optimizer, cls_scheduler = setup_classifier() .architecture() return self.module.loss(x, alphas, target) arch_config = Config(type="darts", architecture, classifier] u2l = {architecture: [classifier]} l2u = {classifier: [architecture]} depends = {"l2u": l2u, "u2l": u2l} engine = Engine(problems=probs, dependencies=depends) engine.run() Listing 6: Simplified code of "Differentiable Neural Architecture Search"

Figure 4: Convergence analysis of different best-response Jacobian algorithms on the data reweighting task .

Figure 5: Dataflow graphs for all our experiments



MWN experiment results. IF denotes an imbalance factor. AID-CG/NMN/FD respectively stand for implicit differentiation with conjugate gradient/Neumann series/finite difference.We observe that different best-Jacobian algorithms lead to vastly different test accuracy, memory efficiency, and training wall time. Interestingly, we notice that AID-FD with unrolling steps of both 1 and 5 consistently achieve better test accuracy (close to SoTA(Tang et al., 2020)) and memory efficiency than other methods. This demonstrates that, while BETTY is developed to support large and general MLO programs, it is still useful for simpler bilevel optimization tasks as well. An additional analysis on the effect of best-response Jacobian can also be found in Appendix D. Furthermore, to demonstrate the scalability of BETTY to large-scale MLO, we applied MWN to sentence classification with the BERT-base model(Devlin et al., 2018) with 110M parameters. Similarly, we artificially inject class imbalance into the SST dataset, and use AID-FD as our bestresponse Jacobian calculation algorithm. The experiment results are provided in Table2.

MWN+BERT experiment results. fp32 and fp16 respectively stand for full-precision and mixed-precision training.

as our weak supervision algorithm for generating training labels. The experiment results are provided in Table3. Wrench Results. RWT stands for reweighting and CRT for correction

Domain Adaptation for Pretraining & Finetuning results. Reported numbers are classification accuracy on the target domain (right of arrow), after pretraining on the source domain (left of arrow).



Clothing-1M + ResNet-50 results.

Wrench + BERT-base results.

ACKNOWLEDGEMENTS

We thank all the reviewers for invaluable comments and feedback. EX acknowledges the support of NSF IIS1563887, NSF CCF1629559, NSF IIS1617583, NGA HM04762010002, NIGMS R01GM140467, NSF IIS1955532, NSF CNS2008248, NSF IIS2123952, and NSF BCS2040381. WN was supported in part by NSF (1651565), AFOSR (FA95501910024), ARO (W911NF-21-1-0125), CZ Biohub, Sloan Fellowship, and U.S. Department of Energy Office of Science under Contract No. DE-AC02-76SF00515.

availability

//github.com/leopard

C EXPERIMENT DETAILS

In this section, we provide further training details (e.g. hyperparameters) of each experiment.

C.1 DATA REWEIGHTING FOR CLASS IMBALANCE

Dataset We reuse the long-tailed CIFAR-10 dataset from the original paper (Shu et al., 2019) as our inner-level training dataset. More specifically, the imbalance factor is defined as the ratio between the number of training samples from the most common class and the most rare class. The number of training samples of other classes are defined by geometrically interpolating the number of training samples from the most common class and the most rare class. We randomly select 100 samples from the validation set to construct the upper-level (or meta) training dataset, and use the rest of it as the validation dataset, on which classification accuracy is reported in the main text.Meta-Weight-Network We adopt a MLP with one hidden layer of 100 neurons (i.e. 1-100-1) as our Meta-Weight-Network (MWN). It is trained with the Adam optimizer (Kingma & Ba, 2014) whose learning rate is set to 0.00001 throughout the whole training procedure, momentum values to (0.9, 0.999), and weight decay value to 0. MWN is trained for 10,000 iterations and learning rate is fixed throughout training.Classification Network Following the original MWN work (Shu et al., 2019) , we use ResNet32 (He et al., 2016) as our classification network. It is trained with the SGD optimizer whose initial learning rate is set to 0.1, momentum value to 0.9, and weight decay value to 0.0005. Training is performed for 10,000 iterations, and we decay the learning rate by a factor of 10 on the iterations of 5,000 and 7,500.

C.2 CORRECTING & REWEIGHTING CORRUPTED LABELS

Dataset We directly use TREC, AGNews, IMDB, SemEval, ChemProt, YouTube text classification datasets from the Wrench benchmark (Zhang et al., 2021a) . More specifically, we use the training split of each dataset for training the classification network, and the validation split for training the correcting and the reweighting networks. Test accuracy is measured on the test split.Correct Network Our correct network takes the penultimate activation from the classification network, and outputs soft labels through the linear layer and the softmax layer. These new soft labels are interpolated with the original labels via the reweighting scheme which is achieved with 2-layer MLP. As our reweighting network, the correct network is trained with Adam optimizer whose learning rate is set to 0.00001, momentum values to (0.9, 0.999), and weight decay value to 0.Reweighting Network For our reweighting network, we reuse Meta-Weight-Net from the "Data Reweighting for Class Imbalance" experiment, follow all the training details.Classification Network As our classification network, we adopt a 2-layer MLP with the hidden size of 100. The classification network is trained for 30,000 iterations with the SGD optimizer whose learning rate is set to 0.003, momentum to 0.9, and weight decay to 0.0001. Learning rate is decayed to 0 with the cosine annealing schedule during training.

C.3 DOMAIN ADAPTATION FOR PRETRAINING & FINETUNING

Dataset We split each domain of the OfficeHome dataset (Venkateswara et al., 2017) into training/validation/test datasets with a ratio of 5:3:2. The pretraining network is trained on the training set of the source domain. Finetuning and reweighting networks are both trained on the training set of the target domain following the strategy proposed in (Bai et al., 2021) . The final performance is measured by the classification accuracy of the finetuning network on the test dataset of the target domain.

D DESIGN CHOICE ANALYSIS

In this section, we visually compare the convergence speed of different best-response Jacobian algorithms with the loss convergence graphs on the synthetic hyperparameter optimization task and the data reweighting task (Section 5.1). Specifically, we analyze the convergence speed in terms of both 1) the number of steps and 2) training time, as the per-step computational cost differs for each algorithm.

D.1 SYNTHETIC HYPERPARAMETER OPTIMIZATION

Following (Grazzi et al., 2020) , we constructed a synthetic hyperparameter optimization task where we optimize the weight decay value for every parameter in simple binary logistic regression. Mathematically, this problem can be formulated as bilevel optimization as follows:where, (x l , y l ) and (x u , y u ) are repsectively the training datasets for the lower-(and upper-)level problems, with x ∈ R n×d and y ∈ R n×1 . Here, n is the number of training data in each dataset and d is the dimension of the feature vector. w ∈ R d×1 is the logistic regression parameter, and λ ∈ R d×1 is the hyperparameter (i.e. the per-parameter weight decay value).Given the above setup, we compared four different best-reponse Jacobian algorithms: 1) ITD-RMAD, 2) AID-FD, 3) AID-CG, and 4) AID-Neumann. For the fair comparison, we fixed the unrolling step to 100 for all algorithms. The experiment result is presented below: 

D.2 DATA REWEIGHTING

To study how different best-response Jacobian algorithms perform on more complex tasks, we repeated the above experiment on the data reweighting task from Section 5.1. Again, for the fair

E SYSTEMS SUPPORT

In this section, we perform additional analyses on the memory saving effects of our system features with two benchmarks: (1) differentiable neural architecture search and (2) data reweighting for class imbalance.

E.1 DIFFERENTIABLE NEURAL ARCHITECTURE SEARCH

Baseline + mixed-precision

GPU Memory Usage 9867MiB 5759MiB

Table 8 : GPU memory usage analysis for DARTS.

E.2 DATA REWEIGHTING FOR CLASS IMBALANCE

In this experiment, we use ResNet50 (He et al., 2016) instead of ResNet30, to better study the memory reduction from our system features, when the larger model is used. Importantly, we also test the data-parallel training feature in addition to the mixed-precision training feature.Baseline + mixed-precision + data-parallel (2 GPUs) GPU Memory Usage 6817MiB 4397MiB 3185/3077MiB (GPU0/1) Table 9 : GPU memory usage analysis for MWN with ResNet-50.As shown above, we observe more reduction in memory usage as we add more system features.

F SUPPORTED FEATURES

Here, we summarize the supported features within BETTY.

Category Features

Best-response Jacobian algorithms 

