TARGETED HYPERPARAMETER OPTIMIZATION WITH LEXICOGRAPHIC PREFERENCES OVER MULTIPLE OB-JECTIVES

Abstract

Motivated by various practical applications, we propose a novel and general formulation of targeted multi-objective hyperparameter optimization. Our formulation allows a clear specification of an automatable optimization goal using lexicographic preference over multiple objectives. We then propose a randomized directed search method named LexiFlow to solve this problem. We demonstrate the strong empirical performance of the proposed algorithm in multiple hyperparameter optimization tasks. With the inputs F, T, C specified above, ∀k ∈ [K], we can define a lexi-target value z k * . The lexitarget value z k * of the k-th objective is defined recursively based on the lexi-target values of the more important objectives, i.e., the first (k -1) objectives. We define X 0 * = X and for k ∈ [K], z k * :=

1. INTRODUCTION

Hyperparameter optimization (HPO) of machine learning models, as a core component of AutoML, is a process of finding a good choice of hyperparameter configuration that optimizes the model "performance". In the context of practical ML systems, there are typically more than one metrics to evaluate the model "performance" on which one desires to optimize. For instance, latency (He et al., 2018) , fairness (Brookhouse & Freitas, 2022) , and explainability (Gonzalez et al., 2021) are important complementary metrics of interest in addition to prediction accuracy in many application scenarios. Typical multi-objective HPO (MO-HPO) approaches (Knowles, 2006; Daulton et al., 2020) seek to find wide-spread Pareto frontiers for users to choose from. This type of method can only establish a partial ordering of the configurations. The final choice on which Pareto frontier to use is typically done manually and is opaque to the optimization algorithm. We call such optimization "untargeted". An automated approach is desirable, especially in repetitive tuning scenarios such as continuous integration and delivery (CI/CD) of machine learning models or MLOps in general (Garg et al., 2021; Mäkinen et al., 2021; Symeonidis et al., 2022) . This automation is possible if the criteria for selecting the final choice is specified explicitly. In this scenario, untargeted HPO can be inefficient as the optimization algorithm may waste resources on finding Pareto frontiers that are far from the desired final choice, i.e., the target. In this work, we consider a targeted HPO scenario: practitioners have a priority order over the objectives, which enables a total ordering of all the configurations. We formalize a general notion of priority order rigorously as a lexicographic preference (Fishburn, 1975) over multiple objectives in an HPO task. It allows users to specify a clear optimization target across multiple objectives before the optimization starts and removes the need for manual post hoc selection. Such a priority structure is found in HPO tasks from various application domains. For example, in many bioinformatics applications, besides the primary objective of finding model hyperparameter configurations with low prediction error, minimizing feature numbers via a feature selection step is found to be helpful in avoiding overfitting and discovering relevant features for domain experts and thus is suggested to be used as an auxiliary objective in HPO (Bommert et al., 2017; Gonzalez et al., 2021) . When both objectives are included, the auxiliary objective is considered less important than the minimization of the prediction error, which naturally forms a lexicographic structure. Despite its appealing practical importance, we find this type of targeted HPO problem remarkably under-explored. In this work, we first provide a rigorous problem formulation for the targeted HPO Figure 1 : Results in an XGBoost tuning task to find accurate and fair models, in which validation loss is specified as an objective of a higher priority, and DSP (fairness-related objective) of a lower priority. Lexi-Target * = l * +0.05, in which l * is the optimal loss value which is unknown before the optimization starts and 0.05 is the user-specified tolerance onloss degradation. The circles represent proposed configurations from different methods in the objective space (darker color indicates later iteration). Both objectives are smaller the better. SO-HPO easily achieves the target loss value but performs poorly regarding fairness. MO-HPO is able to achieve a better performance in terms of fairness compared to SO-HPO. However, it also wastes resources on finding points outside the desired loss target as it seeks a wide spread of the Pareto front. Compared to MO-HPO, a larger fraction of the configurations in LexiFlow are within the loss target. This allows LexiFlow to try more configurations within the desired loss range and achieves better fairness performance. task with lexicographic preference over multiple objectives. This formulation provides a general and flexible way for the users to specify customized targets expressed via a priority order on the objectives and a list of optional goals and tolerances on the objectives. Based on the problem formulation, we propose an algorithm named LexiFlow as a general solution. Specifically, LexiFlow conducts the optimization by leveraging pairwise comparisons between hyperparameter configurations in a randomized direct search framework. The pairwise comparisons are supported by a suite of targeted lexicographic relations, which allow us to navigate toward the more promising region of the search space considering the lexicographic structure in the objective space. By doing so, the algorithm is able to efficiently optimize the objectives with a strong any-time performance. We perform extensive empirical evaluation on four different machine learning model tuning tasks, including a tuning task to find accurate and fast/small neural network models, a tuning task to find accurate and fair Xgboost models, a tuning task on random forest combined with feature selection for gene expression prediction, and a tuning task to mitigate overfitting. Our method has promising performance on all the evaluated tasks. The good empirical performance verified the unique advantages of our proposed algorithm. We demonstrate different performance patterns of methods including our method LexiFlow, a single objective method (SO-HPO), and a multiple objective method (MO-HPO) in Figure 1 .

1.1. RELATED WORK

There are a number of works trying to address MO-HPO tasks on machine learning models, including evolutionary algorithms (Deb et al., 2002; Srinivas & Deb, 1994; Zhang & Li, 2007; Binder et al., 2020) , Bayesian optimization (Knowles, 2006; Daulton et al., 2020; Emmerich & Klinkenberg, 2008; Hernández-Lobato et al., 2016) and multi-fidelity methods (Schmucker et al., 2020; 2021) . All the aforementioned methods treat all the objectives equally important and seek an approximation of the Pareto front in the objectives space. Some recent work proposes to incorporate different types of user preferences into MO-HPO. Paria et al. (2020) allow users to encode their preferences as a prior on a weight vector to scalarize multiple objectives. The prior will induce a posterior distribution over the set of Pareto optimal values. Setting a proper prior in practice is non-trivial as the relation between the prior and posterior is difficult, if not impossible, to derive for an average practitioner. Abdolshah et al. (2019) regard preferences as the stability of objectives, using a constrained Bayesian optimization method. An earlier multi-objective optimization method (Zitzler et al., 2008) incorporates preference information into the multiple objective evolutionary frameworks by defining relations between different populations. However, this preference information is defined upon different populations rather than individual configurations.

2. PROBLEM FORMULATION

In this section, we first provide a rigorous formulation of the targeted HPO problem with lexicographic preferences over multiple objectives together with a motivating example. We then analyze the challenges in approaching the problem.

2.1. PROBLEM FORMULATION

Throughout the paper, we use [K] to denote the integers from 1 to K. The studied problem needs the following inputs from the user, • A d-dimensional hyperparameter search space X. Any hyperparameter configuration x ∈ X is considered a feasible configuration. • A list of K (K > 1) objective functions regarding x ∈ X, denoted by F (x) = [f 1 (x), f 2 (x), ..., f K (x)]. Specifically, each f k (∀k ∈ [K] ) is a real-valued objective function of x ∈ X, and the superscript k in f k indicates the priority order (the smaller the index, the high the priority) of the objective. Without loss of generality, we assume minimization problems regarding all the objectives here and throughout this paper. • A K-dimensional tolerance vector on the objectives denoted by T = [τ 1 , τ 2 , ..., τ K ]. ∀k ∈ [K], each τ k is a non-negative number representing an optimality tolerance on the k-th objective. It can be intuitively considered as the amount of performance degradation the user is willing to compromise in order to find choices with better performance on the objectives of lower priorities. • C = [c 1 , c 2 , ..., c K ] a K-dimensional goal vector, in which c k ∈ R denotes a goal value on the k-th objective desired by the user. F (x ′ ) = l F (x) ⇔ f k (x ′ ) = f k (x) ∀k ∈ [K] F (x ′ ) ≺ l F (x) ⇔ ∃k ∈ [K] : f k (x ′ ) < f k (x) ∧ (∀k ′ < k, f k ′ (x ′ ) = f k ′ (x)) (3) F (x ′ ) ⪯ l F (x) ⇔ F (x ′ ) ≺ l F (x) ∨ F (x ′ ) = l F (x) Our aim is to find a lexi-optimal configuration, i.e., any element in X * = {x ∈ X K * |∀x ′ ̸ = x, F (x) ⪯ l F (x ′ )}, using minimal cost. We call this problem a Lexi-HPO problem in short. Remark 1 (Lexi-optimal vs Pareto frontiers). The lexi-optimal solutions in X * are always a subset of the Pareto frontiers in the K-dimensional objective space. The relative location of X K * in the Pareto frontiers depends on the lexicographic preferences and the relaxations imposed via tolerance and/or goals. In this way, the users could specify their own 'target' of interest in the Pareto frontiers in a straightforward way before the optimization starts. Remark 2 (On tolerances and goals). Note that the two inputs tolerances T and goals C are both optional. They are introduced to help impose customized needs on the targeted optimum in flexible ways. When there is no tolerance and no particular goal on a particular objective k, one can just set τ k = 0 and c k = -inf without affecting the nature of the problem. To facilitate understanding of the Lexi-HPO problem and better motivate the studied scenario, we include a concrete example for gene expression prediction from the bio-informatics domain below. Example: In the HPO task for gene expression prediction from (Bommert et al., 2017) , three minimization objectives are considered during the HPO process, including the model's prediction error, the resulting feature numbers after feature selection, and feature selection instability. The three objectives are in descending priority order. Despite the priority order, when trying to find the optimal choice, it is okay to consider the configurations whose loss does not exceed l * + 0.05, in which l * denotes the best loss one can achieve by any feasible hyperparameter configuration. In addition, a feature number of 500 is considered good enough for the second objective. Say in this example l * = 0.1 and x A , x B , x C , x D are whole Pareto frontiers with performance 13, 500, 0.2], and F (x D ) = [0.1, 300, 0.5]. This example can be formulated as a Lexi-HPO problem, where the objectives are F = [loss, feature number, instability], tolerances are T = [0.05, 0, 0], and goals are C = [inf, 500, -inf]. Under the Lexi-HPO problem, we have the following conclusions: (1) x C is the optimal choice among the whole frontiers; (2) Combining the first conclusion with Remark 1, we can also conclude that x C is also the lexi-optimal choice in the whole space. This example also shows that the tolerances T and goals C are, in general, intuitive to understand and easy to set by practitioners because they are directly about each single objective value. F (x A ) = [0.2, 100, 0.1], F (x B ) = [0.1, 600, 0.2], F (x C ) = [0.

2.2. CHALLENGES

We want to first remind the readers that in the context of HPO of ML models, the objectives are mostly black-box, and it is also expensive to get function evaluations because it involves training and validating an ML model. The black-box nature of this problem makes methods that require analytic forms of the objectives, e.g., gradient-based methods (Gong et al., 2021) inapplicable. We also urge methods that are efficient and cost-frugal because of the expensive function evaluations and practical resource limitations. One potential solution for the Lexi-HPO problem is to use MO-HPO approaches to find Pareto frontiers and add a post-processing step to narrow them down to the targeted optimal points. This type of method can be inefficient because they strive to find a wide spread of Pareto frontiers while the users are actually only interested in a subset of them in a particular region. Another potential solution is to perform single objective HPO or constrained single objective HPO regarding each objective sequentially following the preference order. More specifically, when optimizing the k-th objective, one can impose constraints of the form f i (x) ≤ z i * , ∀i ∈ [k -1] and estimate z k * according to the result. However, one tricky problem in this approach is how to allocate limited resources to the optimization of different objectives. The resource allocation directly affects how reliable the constraints are. For example, if we allocate insufficient resources to the first objective, we will not reach the optimal value on the first objective which loosens the constraints imposed on the first objective when trying to optimize objectives of lower priorities. If we allocate excessive resources to the first objective, the optimization of other objectives can be hindered due to a lack of resources. We include this approach as a baseline in the experiment section.

3. ALGORITHM

We propose an algorithm named LexiFlow to solve the Lexi-HPO problem. We adopt a randomized direct search framework that could direct the search to the targeted optimum based on lexicographic comparisons over pairs of configurations. During the iterative optimization process, the algorithm automatically adjusts its focus on the different objectives based on their priorities and room for improvement. For example, when there is no room for improvement on the first objective in a particular iteration, the algorithm will direct the search toward the region where the second objective could improve without hurting the optimality of the first objective. Since such adjustment is made at every iteration, the algorithm keeps the chance to improve every objective when the optimality has not been reached. By doing so, the algorithm is able to adaptively allocate optimization resources over multiple objectives in a flexible order while respecting the priority. This strategy can help the algorithm achieve a good any-time performance. LexiFlow is presented in Alg. 1. It takes as inputs the objectives F , a goal vector C (optionally), and a tolerance vector T (optionally). When C and T are not provided, we just set c k to -∞, and τ k to 0, ∀k ∈ [K]. After an initialization step, which sets an initial hyperparameter configuration x 0 and an initial stepsize δ init , the algorithm proceeds as follows: At each iteration i, LexiFlow maintains an 'incumbent' point, denoted by x i . The 'incumbent' point is the point in whose neighboring area the algorithm samples randomized directions to get new points to try next. More specifically, a new direction u is sampled uniformly at random from a unit sphere (Line 4 in Alg. 1), which leads to two newly proposed points x i ± δu (considering both the sampled direction and its opposite direction). After evaluating the newly proposed point(s), i.e., obtaining F (x i + δu) and potentially also F (x i -δu), it decides whether to update the incumbent with one of the newly proposed points, i.e., x i + δu or x i -δu. In LexiFlow, this decision is made via a carefully designed Update function. Following the same spirit of an existing randomized direct search-based single-objective HPO method (Wu et al., 2021) , LexiFlow includes the restart and dynamic stepsize techniques to free the algorithm from local optimum and manual configuration of stepsize (line 9-11 of Alg. 1). We also adopt the same setting of the initial stepsize δ init and stepsize lower bound δ lower with the aforementioned work. The algorithm terminates if it runs out of resource budget and outputs a lexi-optimal configuration x * , which is maintained in a way such that it is lexi-optimal among all the evaluated configurations. Following the general principles of direct search (Kolda et al., 2003) , the incumbent point shall be maintained in a way such that it is the best or at least a very promising point among the historically evaluated points (within a local search thread without considering restart). This general principle, together with an iterative random sampling of new points around the incumbent, could gradually lead the search to more promising region (Kolda et al., 2003; Wu et al., 2021) . The algorithm makes a decision on whether to update the incumbent with a newly proposed point via the Update procedure attached at the end of Alg. 1 (line 12-18). Algorithm 1: LexiFlow Input: Objectives F (•), goals C (optional) and tolerances T (optional). Initialization: Initial configuration x0, i ′ = r = s = 0, δ = δinit; Obtain F (x0), and x * ← x0, H ← {x0}, ZH ← F (x0) while i = 0, 1, ... do Sample u uniformly from unit sphere S if Update(F (xi + δu), F (xi), ZH) then xi+1 ← xi + δu, i ′ ← i; else if Update(F (xi -δu), F (xi), ZH) then xi+1 ← xi -δu, i ′ ← i ; else xi+1 ← xi, s ← s + 1 ; H ← H ∪ {xi+1}, and update ZH according to Eq. ( 8) if s = 2 d-1 then s ← 0, δ ← δ (i ′ + 1)/(i + 1) ; if δ < δ lower then r ← r + 1, xi+1 ← N (x0, I), δ ← δinit + r // Random Restart Procedure Update(F (x ′ ), F (x), ZH): if F (x ′ ) ≺ (Z H ) F (x) Or F (x ′ ) = (Z H ) F (x) and F (x ′ ) ≺ l F (x ) then if F (x ′ ) ≺ (Z H ) F (x * ) Or F (x ′ ) = (Z H ) F (x * ) and F (x ′ ) ≺ l F (x * ) then x * ← x ′ // Using x * to keep track of the lexi-optimal solution Return True // Accept x ′ else Return False // Discard x ′ Output: A lexi-optimal configuration x * Update Function. Existing work on randomized direct search (Kolda et al., 2003; Wu et al., 2021) , especially the one about cost-frugal HPO (Wu et al., 2021) indicates the following desiderata for doing cost-efficient randomized direct search: (1) Whenever a newly proposed point is better than the incumbent, it should be accepted as the new incumbent. This property can help achieve a good anytime performance of the algorithm and avoids getting stuck into local optimum while maintaining a good convergence rate. (2) The algorithm should not accept a point that is strictly worse than the incumbent. This property can help control the evaluation cost in each individual trial according to theoretical analysis of the cost of a randomized direct search method (Proposition 4 and 5 in (Wu et al., 2021) ). Our Update function is designed following these desiderata. One seemingly straightforward idea for the Update function is to directly make comparisons between the incumbent and the newly proposed points according to the vanilla lexicographic relations defined in Eq. (2), Eq. (3) and Eq. ( 4). However, one problem with vanilla lexicographic relations is that they cannot accommodate user-specified tolerances and goals on the objectives. To address this challenge, we introduce the targeted lexicographic relations, which are parameterized by a Kdimensional real-valued target vector Z = [z 1 , z 2 , ..., z K ] and denoted by = (Z) , ≺ (Z) , ⪯ (Z) , F (x ′ ) = (Z) F (x) ⇔ f k (x ′ ) = f k (x) ∨ (f k (x ′ ) ≤ z k ∧ f k (x) ≤ z k ) ∀k ∈ [K] (5) F (x ′ ) ≺ (Z) F (x) ⇔ ∃k ∈ [K] : f k (x ′ ) < f k (x) ∧ f k (x) > z k ∧ F k-1 (x) = (Z) F k-1 (x ′ ) (6) F (x ′ ) ⪯ (Z) F (x) ⇔ F (x ′ ) ≺ (Z) F (x) ∨ F (x ′ ) = (Z) F (x) in which F k-1 (x) denotes a vector with the first k -1 dimensions of F (x), i.e., F k-1 (x) = [f 1 (x), ..., f k-1 (x)] (We define F 0 (x) = 0, ∀x ∈ X). It is easy to verify that the relation ⪯ (Z) is reflexive in X and we prove in Lemma 1 (in Appendix A) that it is also transitive in X. With these properties, it can be verified that when Z is set to the lexi-targets Z * = [z 1 * , z 2 * , ..., z K * ], the targeted lexicographic relations can be used to find points that are in X K * , i.e., the lexi-frontiers, through exhaustive pairwise comparisons of configurations from the search space X. Another caveat in this idea is that the comparisons depend on the knowledge of lexi-targets z k * ∀k ∈ [K], which are generally unknown during the optimization process. To address this difficulty, we propose to maintain an online approximate of the lexi-targets based on historical observations. Specifically, we maintain all historically evaluated points in H and introduce the following statistics of H based on Eq. ( 1): X (0) H = H and ∀k ∈ [K], z k H := max{c k , f k H + τ k }, f k H := min x∈X k-1 H f k (x), X k H := {x ∈ X k-1 H |f k (x) ≤ z k H } Here z k H , f k H , and X k H can be considered online versions of z k * , f k * , and X k * respectively, calculated based on historical observations in H. With the approximated lexi-targets Z H = [z 1 H , z H 2 , ..., z K H ], we decide whether to accept a newly proposed point x ′ comparing the the current incumbent x according to the Update procedure in Alg. 1. Specifically, we accept x ′ as the new incumbent in the following two cases: (1) When F (x ′ ) ≺ (Z H ) F (x) is true. This condition allows us to leverage the targeted lexicographic relation ≺ (Z H ) to direct the search toward the approximated lexifrontiers based on the approximated lexi-targets Z H . (2) When F (x ′ ) = (Z H ) F (x) and F (x ′ ) ≺ l F (x . This condition allows us to move toward lexi-optimal points (also approximated) when the proposed point is already an approximated lexi-frontier. The update rules on the one hand respect the lexicographic relations considering the existence of lexi-targets and on the other hand follow the established desiderata. In Alg. 1 we use x * to keep track of the lexi-optimal solution and update it whenever needed in a similar way in the Update procedure (line 13-15). Note that when there is no random start, x * is simply the current incumbent, i.e., x i at iteration. The extra update steps in line 13-15 are essential considering the potential random restarts.

4. EXPERIMENTS

We perform evaluations on multiple hyperparameter tuning tasks, which fall into the following two categories of use case scenarios: (1) To directly perform lexicographic optimization in cases where the users do have lexicographic preferences over multiple objectives. (2) To optimize a particular objective that is inaccessible during the optimization, by finding lexi-optimal solutions in terms of a list of proxy objectives. For the first type of use case, we include an efficient NN tuning task, an unfairness mitigation task, and a high-dimensional feature selection task in Section 4.1. For the second type of use case, we include a task about overfitting mitigation in Section 4.2. We tune three types of models including Xgboost, Random Forest and Neural Networks on different classification tasks. In all the experiments, if not otherwise specified, we show the results of the lexi-optimal solutions calculated based on the historical observations in each method at each particular concerned time point. All reported results are averaged over five runs with different random seeds.

4.1. HPO WITH LEXICOGRAPHIC PREFERENCES OVER MULTIPLE OBJECTIVES

Since there are no existing HPO methods that are directly applicable to the Lexi-HPO problem, we include three general HPO methods which could be extended to the Lexi-HPO problem: (1) A single objective HPO algorithm (SO-HPO), named CFO (Wu et al., 2021) , which only optimizes the objective of the highest priority. The method CFO is selected as it is a strong SO-HPO method showing superior performance over other alternatives, e.g., BO-based methods, multi-fidelity-based methods as reported in (Wu et al., 2021) . (2) A constrained optimization method (C-HPO) as described and constructed in Section 2.2. This method allocates even resources to different objectives. (3) A popular multiple objective HPO algorithm (MO-HPO) qEHVI (Daulton et al., 2020) . We choose qEHVI in this category because it is a state-of-the-art MO-HPO method with ready-to-use opensource implementation. We use both CFO and its constrained optimization version implementation from the AutoML library FLAML (Wang et al., 2021) , and use the implementation for qEHVI from the HPO library Optuna (Akiba et al., 2019) . In all the experiments, when presenting the anytime performance, we use a horizontal dashed line labeled Lexi-Target * to represent the lexi-target calculated based on user inputs and the configurations found by all the methods (unknown before the optimization) according to Eq. ( 8).

4.1.1. TUNING ACCURATE AND EFFICIENT NEURAL NETWORKS

Deploying neural networks in practical applications typically requires a balance between model accuracy and efficiency. Lexico-HPO can be a good way to find models achieving the desired balance. For example, as mentioned in (He et al., 2018) , for AI applications such as Google Photos, model accuracy is more important than latency, and it is desirable to find smaller models under the condition of not sacrificing accuracy. There is a natural lexicographic preference over the two objectives of improving model accuracy and reducing latency. In this experiment, we evaluate our method for the task of tuning accurate and efficient neural networks. More specifically, following a similar search space setting with (Abdolshah et al., 2019; Hernández-Lobato et al., 2016) , we tune neural network models on a subsampled Fashion MNIST dataset (Xiao et al., 2017) with two minimization objectives under lexicographic preference. The 1st objective is the error rate. The 2nd objective is efficiency-related objectives including FLOPs and parameter numbers in two different experiments respectively. For both experiments, we provide tolerance vector T = [τ 1 = 1/ √ S, τ 2 = 0.0], where S represents the number of validation data points since the error metric calculated from validation data can deviate from the true test error in the scale of 1/ √ S. We do not set a specific goal vector for this task, as mentioned in (He et al., 2018) , latency is typically not a hard constraint. We perform tuning for 2 hours with each method and show the anytime performance of the methods regarding the two objectives in Figure 2 . All four methods eventually achieve good performance in terms of the objective of the highest priority, i.e., the error rate: all of them are within the calculated lexi-target. LexiFlow, SO-HPO, and C-HPO achieve the target value earlier than MO-HPO because they put more priority on the first objective. In terms of the second objective, LexiFlow shows a significant performance improvement in both two experiments compared to all the baselines. MO-HPO finds models with lower FLOPs and lower parameter numbers than SO-HPO and C-HPO. But the FLOPs and parameter numbers in MO-HPO are still 2-4 times larger than that in LexiFlow when the error rate is within the tolerance range. 

4.1.2. TUNING ACCURATE AND FAIR XGBOOST

Due to increasing evidence on fairness-related harms (Angwin et al., 2016; Barocas & Selbst, 2016) in ML, there is an emerging need to build ML models that is not only accurate but also fair. One way to achieve this goal is to optimize the predictive performance of a model (1st objective) while minimizing the unfair bias (2nd objective) under a lexicographic preference in model tuning (Brookhouse & Freitas, 2022) . Following the same setting with (Brookhouse & Freitas, 2022) , we perform the Xgboost tuning task with two minimization objectives, including validation loss (1st objective) according to (Brookhouse & Freitas, 2022) , and DSP (2nd objective), which an unfairness-related metric calculated based on statistical disparity (Perrone et al., 2021) . We do the evaluation on datasets including Adult, Compas, and German, which are commonly used datasets for fairness-related tasks (Perrone et al., 2021) . We set a tolerance of 0.05 on the validation loss according to (Brookhouse & Freitas, 2022) , and we do not provide the goal value . We perform tuning for 1 hour with each method and show the final performance in Table 1 . The results show that LexiFlow can indeed achieve the best performance considering the lexicographic preference over the objectives compared to baselines: LexiFlow has a similar performance with baselines considering optimality tolerance (0.05) in terms of the 1st objective and achieves a better performance in terms of the 2nd objective across all three datasets. Regarding the performance on the 2nd objective, i.e., the DSP value, 0.05 is a commonly used reference value for good DSP (Perrone et al., 2021) . LexiFlow reaches 0.016 in Compas and 0.034 in German, where most of the other baselines got larger than 0.05. Table 1 : Results from tuning Xgboost on tasks that involve fairness considerations. All four methods are within the optimality tolerance range in 1st objective. In terms of the 2nd objective, LexiFlow achieves a significant performance improvement compared to baselines. In many applications of ML in bioinformatics, the primary goal is to find a model with good predictive performance. For high-dimension biological data, due to a large number of features, it is also critical to integrate feature selection into the model fitting process, to make the selected features few and stable (Bommert et al., 2017) , which could help provide insights for domain experts. In this experiment, we verify LexiFlow in a feature selection task following a similar setting with (Bommert et al., 2017) which implicitly formulates the optimization objectives as lexicographic preference. Specifically, we perform the Xgboost tuning on biological dataset AP colon Kidney with three minimizing objectives including error rate, feature number, and the instability of the selected features (with descending priority order). We set targets C = [c 1 = 0, c 2 = 500, c 3 = 0] and tolerances T = [τ 1 = 0.01, τ 2 = 0.0, τ 3 = 0.0] according to (Bommert et al., 2017) We use the filter feature selection method from Scikit-learn (Pedregosa et al., 2011) for feature selection. We perform tuning for 10 4 seconds and report the anytime performance regarding the three objectives in Figure 3 . From the results, we observe that regarding the 1st objective, all four methods are within the optimality tolerance range. In terms of the 2nd objective, only LexiFlow reaches the goal in the end. Moreover, LexiFlow achieves significant performance improvement compared to baselines in the 3rd objective. We also include results from all five random seeds in Appendix B.

4.2. LEXIFLOW AS AN OPTIMIZATION TOOL TO MITIGATE OVERFITTING

Although low loss on the final test data is the objective of interest in machine learning, validation loss is commonly used as a proxy metric for test loss since the final test data is inaccessible in the model building and tuning stage. It is well known that there may exist a gap between an ML model's performance on validation data and test data. Overfitting is a major factor that causes such a performance gap. Evidence from existing work show that optimizing machine learning models with fewer but key features could mitigate overfitting problems (Jović et al., 2015; Ying, 2019; Abdel-Aal, 2005) . These findings inspire us to explore the possibility of incorporating the objective of minimizing feature numbers as an additional proxy objective in ML model tuning. Specifically, we consider the commonly used validation loss as the primary objective and feature number as a secondary proxy objective during HPO. We tune Random Forest (RF) with training and validation data and evaluate the best model's test loss on a reserved test dataset as the final performance metric. More specifically, we include the following methods in comparison: (1) LexiFlow w/ FS, which is LexiFlow with two minimization objectives, including validation loss (1st), and feature number (2nd). ( 2) SO-HPO w/ FS, which is the single objective HPO algorithm CFO (Wu et al., 2021) with validation loss as its objective. (3) SO-HPO w/o FS, which is the single objective HPO algorithm CFO with validation loss as the objective and without a feature selection process. Both LexiFlow w/ FS and SO-HPO w/ FS include a feature selection step (indicated by w/ FS) using the same feature selection method as that in Section 4.1.3, in addition to RF model hyperparameter tuning. We include eight classification datasets, which are datasets from two previous feature selection studies (Gonzalez et al., 2021; Bommert et al., 2020) satisfying the following two conditions: feature number larger than 100 and available on Openml. We use 0.01 or 0.001 as the tolerance on validation loss according to empirical studies from (Gonzalez et al., 2021) : for each dataset, if the mean validation loss from a default RF model (from sklearn) is larger than 0.1, 0.01 is used as the tolerance, otherwise 0.001. No tolerance or target is imposed on the 2nd objective. We show the final test result from different methods in Table 2 . LexiFlow w/ FS achieves the best performance on most of the datasets (7/8). We also investigate the number of features selected in each method and find that LexiFlow w/ FS achieves the smallest reserved feature ratio (30.1%) compared with SO-HPO w/o FS (100%) and SO-HPO w/ FS (66.8%). There are two important takeaways from this experiment, considering the nature of the compared methods and their performance: (1) The good test performance of SO-HPO w/o FS and LexiFlow w/ FS (over SO-HPO w/o FS) indicates that reducing feature number can indeed help mitigate overfitting and thus improve test performance, which is consistent with findings from existing work mentioned; (2) The dominating performance of LexiFlow w/ FS over SO-HPO w/ FS indicates that leveraging feature number as a secondary objective in addition to validation loss during model tuning is an effective way to further mitigate overfitting. Our method serves as an ideal optimization tool in this important endeavor.

5. CONCLUSION

In this paper, we propose an HPO algorithm named LexiFlow, which could easily incorporate users' lexicographic preferences across multiple objectives in HPO tasks. LexiFlow is simple and effective, showing a strong empirical performance over a wide spectrum of tuning tasks and application domains. LexiFlow is of good practical importance especially considering the ubiquitous existence of potentially conflicting objectives in modern machine learning systems, such as accuracy, latency, model size, robustness, and ethics-related objectives. We also made an interesting finding on overfitting mitigation: tuning certain proxy objectives with a lexicographic structure could help find models that are less likely to overfit. The implementation of our method is available in the opensource AutoML library FLAMLfoot_0 . A PROOF Lemma 1. The binary relation ⪯ (Z) defined in Eq. ( 7) is transitive, i.e., ∀x, y, w ∈ X, F (x) ⪯ (Z) F (y) ∧ F (y) ⪯ (Z) F (w) ⇒ F (x) ⪯ (Z) F (w). Proof of Lemma 1. According to Eq. ( 7), F (x) ⪯ (Z) F (y) ∧ F (y) ⪯ (Z) F (w) ⇒ F (x) ≺ (Z) F (y) ∨ F (x) = (Z) F (y) ∧ F (y) ≺ (Z) F (w) ∨ F (y) = (Z) F (w) ⇒ F (x) = (Z) F (y) ∧ F (y) = (Z) F (w) i ∨ F (x) ≺ (Z) F (y) ∧ F (y) ≺ (Z) F (w) ii ∨ F (x) = (Z) F (y) ∧ F (y) ≺ (Z) F (w) iii ∨ F (x) ≺ (Z) F (y) ∧ F (y) = (Z) F (w) iv (9) Statement I: F (x) = (Z) F (y) ∧ F (y) = (Z) F (w) ⇒ F (x) = (Z) F (w). Proof of Statement I. From Eq. ( 5), we have, F (x) = (Z) F (y) ∧ F (y) = (Z) F (w) ⇒ f k (x) = f k (y) ∨ f k (x) ≤ z k ∧ f k (y) ≤ z k , ∀k ∈ [K] ∧ f k (y) = f k (w) ∨ f k (y) ≤ z k ∧ f k (w) ≤ z k , ∀k ∈ [K] ⇒ ∀k ∈ [K] : f k (x) = f k (y) = f k (w) ∨ f k (x) = f k (y) ∧ f k (y) ≤ z k ∧ f k (w) ≤ z k ∨ f k (x) ≤ z k ∧ f k (y) ≤ z k ∧ f k (y) = f k (w) ∨ f k (x) ≤ z k ∧ f k (y) ≤ z k ∧ f k (w) ≤ z k ⇒ f k (x) = f k (w) ∨ f k (x) ≤ z k ∧ f k (w) ≤ z k , ∀k ∈ [K] ⇒ F (x) = (Z) F (w) Statement II: F (x) ≺ (Z) F (y) ∧ F (y) ≺ (Z) F (w) ⇒ F (x) ≺ (Z) F (w) Proof of Statement II. According to Eq. ( 6), we have F (x) ≺ (Z) F (y) ⇔ ∃k 1 ∈ [K] : f k1 (x) < f k1 (y) ∧ f k1 (y) > z k1 ∧ F k1-1 (x) = (Z) F k1-1 (y) F (y) ≺ (Z) F (w) ⇔ ∃k 2 ∈ [K] : f k2 (y) < f k2 (w) ∧ f k2 (w) > z k2 ∧ F k2-1 (y) = (Z) F k2-1 (w) Let k ′ = min{k 1 , k 2 }, according to Statement I and Eq. ( 11), we have: F k ′ -1 (x) = (Z) F k ′ -1 (y) ∧ F k ′ -1 (y) = (Z) F k ′ -1 (w) ⇒ F k ′ -1 (x) = (Z) F k ′ -1 (w) According to Eq. ( 12), if k ′ = k 1 < k 2 , we have f k ′ (x) < f k ′ (y) = f k ′ (w) ∧ z k ′ < f k ′ (y) = f k ′ (w) If k ′ = k 2 < k 1 , we have f k ′ (x) = f k ′ (y) < f k ′ (w) ∧ z k ′ < f k ′ (w) Otherwise when k ′ = k 1 = k 2 , we have f k ′ (x) < f k ′ (y) < f k ′ (w) ∧ z k ′ < f k ′ (y) < f k ′ (w) Combing Eq. ( 12), Eq. ( 13), Eq. ( 14) and Eq. ( 15), we have found existing k = k ′ such that f k (x) < f k (w) ∧ f k (w) > z k ∧ F k-1 (x) = (Z) F k-1 (w) . Statement III: F (x) = (Z) F (y) ∧ F (y) ≺ (Z) F (w) ⇒ F (x) ≺ (Z) F (w). Proof of Statement III. According to the definitions of = (Z) and ≺ (Z) in Eq. ( 5) and Eq. ( 6), F (x) = (Z) F (y) ∧ F (y) ≺ (Z) F (w) ⇒ ∀k 1 ∈ [K] : f k1 (x) = f k1 (y) ∨ f k1 (x) ≤ z k1 ∧ f k1 (y) ≤ z k1 ∧ ∃k 2 ∈ [K] : f k2 (y) < f k2 (w) ∧ f k2 (w) > z k2 ∧ F k2-1 (y) = (Z) F k2-1 (w) Then, we find a k = k 2 , such that, f k (x) = f k (y) < f k (w) ∨ f k (x) ≤ z k < f k (w) ∧ z k < f k (w) ∧ F k-1 (x) = (Z) F k-1 (w) ⇒ f k (x) < f k (w) ∧ z k < f k (w) ∧ F k-1 (x) = (Z) F k-1 (w) in which indicates F (x) ≺ (Z) F (w). Statement IV: F (x) ≺ (Z) F (y) ∧ F (y) = (Z) F (w) ⇒ F (x) ≺ (Z) F (w). Proof of Statement IV. According to the definitions of = (Z) and ≺ (Z) in Eq. ( 5) and Eq. ( 6), F (x) ≺ (Z) F (y) ∧ F (y) = (Z) F (w) ⇒ ∃k 1 ∈ [K] : f k1 (x) < f k1 (y) ∧ f k1 (y) > z k1 ∧ F k1-1 (x) = (Z) F k1-1 (y) ∧ ∀k 2 ∈ [K] : f k2 (y) = f k2 (w) ∨ f k2 (y) ≤ z k2 ∧ f k2 (w) ≤ z k2 (18) Then, we can find a k = k 1 , such that, f k (x) < f k (y) = f k (w) ∧ z k < f k (w) = f k (y) ∧ F k-1 (x) = (Z) F k-1 (w) ⇒ f k (x) < f k (w) ∧ z k < f k (w) ∧ F k-1 (x) = (Z) F k-1 (w) ⇒ F (x) ≺ (Z) F (w) By substituting the conclusions in statement (I), (II), (III) and (IV) into (i), (ii), (iii), and (iv) in Eq. ( 9) respectively, we get: F (x) ⪯ (Z) F (y) ∧ F (y) ⪯ (Z) F (w) ⇒ F (x) ≺ (Z) F (w) ∨ F (x) = (Z) F (w) ⇒ F (x) ⪯ (Z) F (w) which concludes the proof.

B SUPPLEMENTARY RESULTS

We provide the results for the feature selection task in bio-informatics (Section 4.1.3) under different random seeds in Figure 4 . ( Brookhouse & Freitas, 2022) to calculate the 1st objective validation loss in the experiment section. More specifically, the accuracy metric geometric mean of Sensitivity and Specificity GM Sen×Spe is calculated by: Sensitivity = T P T P + F N , Specificity = T N T N + F P (21) GM Sen×Spe = Sensitivity × Specificity Where T P , T N , F N and F P represent the number of true positive instances, the number of true negative instances, the number of false negative instances and the number of false positive instances, respectively. The validaiton loss reported in Table 1 is calculated by 1 -GM Sen×Spe . More information of this metric could be found in (Brookhouse & Freitas, 2022) .



Link to the documentation page of LexiFlow in FLMAL: https://microsoft.github.io/ FLAML/docs/Use-Cases/Tune-User-Defined-Function#lexicographic-objectives. code example demonstrating the use of LexiFlow to find accurate and fast neural networks: https: //microsoft.github.io/FLAML/docs/Examples/Tune-Lexicographic-objectives.



Figure 2: Results from the neural networks tuning task on Fashion-MNIST. Each of the curves represents averaged result of a particular method from 5 runs with different random seeds. The shaded area corresponds to 95% confidence intervals.

Figure 3: The results in tuning Xgboost on biological dataset AP Colon Kidney.

Figure 4: The detailed results in tuning Xgboost on biological dataset AP Colon Kidney with different random seeds.

Test-time error rate (%) achieved by different methods on different datasets.

C EXPERIMENTATION DETAILS C.1 SEARCH SPACE

The detailed search space in tuning Neural Networks, Random Forest, and XGboost are shown in Table 3 , Table 4 and Table 5 , respectively. 

