BOPTFORMER: BEYOND TRANSFORMER FOR BLACK-BOX OPTIMIZATION

Abstract

We design a novel Transformer for continuous unconstrained black-box optimization, called BOptformer. Inspired by the similarity between Vision Transformer and evolutionary algorithms (EAs), we modify Tansformer's multi-head self-attention layer, feed-forward network, and residual connection to implement the functions of crossover, mutation, and selection operators. Moreover, we devise an iterated mode to generate and survive potential solutions like EAs. BOptformer learns the optimization strategies from the target task automatically without human intervention, which addresses the poor generalization of human-designed EAs when given a new task. Compared to baselines, such as EAs, Bayesian optimization, and the learning-to-optimize (L2O) method, BOptformer shows the top performance in six black-box functions and two real-world applications. We also find that untrained BOptformer can achieve good performance on the simple tasks. Deep BOptformer performs better than shallow ones. We bring a new and efficient Transformer-based black-box optimization framework for the L2O and EA communities.

1. INTRODUCTION

Many tasks, such as neural architecture search (Elsken et al., 2019) and hyperparameter optimization (Hutter et al., 2019; Golovin et al., 2017) , can be abstracted as black-box optimization problems, which means that although we can evaluate f (x) for any x ∈ X, we have no access to any other information about f , such as the Hessian and gradients. A series of hand-designed algorithms, such as evolutionary algorithms (EAs) (Mitchell, 1998; Khadka & Tumer, 2018; Zhang & Li, 2007) , Bayesian optimization (Snoek et al., 2012; Mutny & Krause, 2018; Li et al., 2017; Kandasamy et al., 2015; Balandat et al., 2020) , and evolutionary strategies (ES) (Wierstra et al., 2014; Hansen & Ostermeier, 2001; Auger & Hansen, 2005; Salimans et al., 2017) , have been designed to solve black-box optimization. Recently, the learning to optimize (L2O) framework (Chen et al., 2022) gives an new insight on optimization by leveraging the recurrent neural network (RNN), long short-term memory architecture (LSTM) (Chen et al., 2020; Andrychowicz et al., 2016; Chen et al., 2017; Li & Malik, 2016; Wichrowska et al., 2017; Bello et al., 2017) or multilayer perceptron (MLP) (Metz et al., 2019) as the optimizer to develop optimization methods, aiming at reducing the laborious iterations of hand engineering (Sun et al., 2018; Vicol et al., 2021; Flennerhag et al., 2021; Li & Malik, 2016; Sun et al., 2018) . They don't concentrate on issues with black-box optimization. The core of L2O is constructing a strong mapping from the initial solutions to the optimal solution. Although several efforts like (Cao et al., 2019; Chen et al., 2017) have coped with the black-box problems, their effectiveness may be hindered by the limited representational capabilities of RNN, LSTM, and MLP. In EAs, the hand-designed crossover, mutation, and selection operators make the initial population move near the optimal solution. This updated model has stood the test of time. Because the evolutionary operators must be modified to maximize their performance on the target task, humandesigned EAs have a low generalization ability to a new black-box problem. Most notably, the limited use of target function information in EA design due to expert knowledge limitations makes it difficult to adapt to the target task. Learning the optimization strategies from the taget task is the key step to overcome this limitation. This paper designs a novel L2O framework based on the advantages of Vision Transformer (Dosovitskiy et al., 2021) and EAs to overcome the above limitations, termed BOptformer. Moreover, Transformer (Han et al., 2022) owns a strong representation ability, and there is currently no work to use Transformer for optimization. Inspired by the similarity of EAs and Transformer (Zhang et al., 2021; 2022) , BOptformer revised the critical part of Transformer to realize the mapping from the random and optimal populations. To generate potential individuals to approach the optimal solution, we first design an self-attention (SA)-based crossover module (SAC) to simulate the crossover operator of EA, and then the output of this module is input into the proposed feed-forward network (FFN)-based mutation module (FM) to perform mutation. Moreover, the residual and selection module (RSSM) is designed to survive the fittest individuals. RSSM is a pairwise comparison between the output of SAC, FM, and the input population regarding their fitness. We design an BOptformer Block (OB) consisting of SAC, FM, and RSSM. Finally, we construct BOptformer by stacking OBs to simulate generations of EAs. Moreover, to cope with black-box optimization, we establish a function set to train BOptformer under an unsupervised mode. We construct a set of differentiable functions with similar properties to the targeted black-box optimization problems. This training set contains the pair of the initial population and the designed function. Thus, we can use gradient-based methods to train BOptformer. We tested BOptformer on six standard functions, the protein docking (Cao & Shen, 2020) problem, and the planar mechanic arm problem (Wang et al., 2021) . The experimental results demonstrate the top rank of BOptformer and the strong representation compared with three population-based baselines, Bayesian optimization, and one learning-to-optimize method (Cao et al., 2019) . Moreover, we also analyze the effect of learning rate, deep structure, and weight sharing between OBs. The highlights of this paper are summarized as follows: 1) We propose a solid Transformer-based L2O framework addressing black-box problems to the L2O community. We have demonstrated its benefit when compared with standard black-box optimization methods, particularly for the L2O-based method. 2) BOptformer efficiently uses the target black-box function's information to aid in the development of the optimization strategy. Compared to the human-designed EA, BOptformer has a substantially greater degree of task fit.

2. RELATED WORK

Transformer Transformer structure achieves significant progress for machine translation task (Vaswani et al., 2017) , computer vision task (Dosovitskiy et al., 2021) , time series task (Zhou et al., 2021) , and so on. Many improved models are proposed and obtain great achievements (Han et al., 2022) . There are no Transformer-based efforts for handling optimization problems, which is crucial in the machine learning community. (Vaswani et al., 2017) proposed the meta-learning hyperparameter optimization framework with Transformers to learn both policy and function priors from data across different search spaces. However, the BOptformer proposed in this paper expands the application scope of Transformer and can effectively deal with this case. The basic modules of Transformer are shown in Appendix A.1. Evolutionary Algorithm Inspired by the evolution of species, EAs have provided surprising performance for black-box optimization (Mitchell, 1998) . The basic modules of EAs are shown in Appendix A.2. Many influential variants have been proposed to deal with different problems (Das & Suganthan, 2010; Wu & Liu, 2019 ), but at their core they are: 1) recombination and mutation, how to produce the excellent solution; 2) selection, how to choose the best individuals between the parents and offspring. Thus, many algorithmic components have been designed for different tasks. The performance of algorithms varies towards various tasks, as different optimization strategies may be required given diverse landscapes. Current methods manually adjust genetic operators' hyperparameters and design the combination between them (Kerschke et al., 2019; Tian et al., 2020) to map the random population to the optimal solution. We require an expert to design or choose the evolutionary operations when given a new black-box optimization task to maximize its performance on the target task, which negatively impacts generalization ability. Most notably, the limited use of target function information in EA design due to expert knowledge limitations makes it difficult to adapt to the target task. The suggested BOptformer uses a Transformer framework instead of the manually designed crossover, mutation, and selection operators. The genetic operator is then designed automatically by the built Transformer rather than by a human designer. BOptformer efficiently uses the target black-box function's information to aid in developing the optimization strategy. In comparison to the human-designed EA, BOptformer has a substantially greater degree of task fit.

3.1. PROBLEM DEFINITION

A black-box optimization problem can be transformed as a minimization problem, as shown in Equation (1), and constraints may exist for corresponding solutions: min f (x), s.t. x i ∈ [l i , u i ] where x = (x 1 , x 2 , • • • , x d ) represents the solution of optimization problem f , the lower and upper bounds l = (l 1 , l 2 , • • • , l d ) and u = (u 1 , u 2 , • • • , u d ), and d is the dimension of x. Suppose n individuals of one population be X 1 = (X 1,1 , X 1,2 , • • • , X 1,d ), X 2 = (X 2,1 , X 2,2 , • • • , X 2,d ), • • • , X n = (X n,1 , X n,2 , • • • , X n,d ), then BOptformer are required to find the population near the optimal solution x. We suppose that X 0 is the initial population and X t is the output population. (Zhang et al., 2021) . W c i is the diagonal matrix. If W c i is full of zeros, the ith individual has no contribution. Suppose a population X is arranged in a non-descending order of fitness, and F ∈ R n×1 be the fitness matrix of X. Then, this module can be represented as follows: X c = SAC(X, F ) where X c is the output population of the proposed SAC module. Since the object processed by BOptformer is the population, and the order of individuals in the population does not affect the population distribution, SA does not require position coding. Standard SA projects the input sequence X into a d-dimensional space via the queries (Q), keys (K), and values (V ). These three mappings enable the SA module to capture better the characteristics of the problems encountered during training. In other words, these three mappings strengthen the ability of SA to focus on specific problems but do not necessarily make SA have good transferability between different problems. Therefore, we consider removing these three mappings for enhanced transferability, and X c = AX. A ∈ R n×n is a self-attention matrix that can be learned to maximize inter-individual information interaction based on individual ranking information. This is why the population needs to be sorted in non-descending order. However, designing crossover operations based solely on population ranking information is a coarsegrained approach. Because this method only considers the location information of individuals in the population, but does not consider the fitness relationship between individuals. Therefore, we further introduce fitness information to assist in learning crossover operators: A F = SA(F ) = Sof tmax F W Q (F W K ) T /sqrt(d k ) Thus, X c = AX + A F X. To better balance the roles of A and A F , we introduce two learnable weights W c 1 ∈ R n×1 and W c 2 ∈ R n×1 . Therefore, the final crossover operation is shown as follows: X c = tile(W c 1 ) ⊙ (AX) + tile(W c 2 ) ⊙ (A F X) where X c ∈ R n×d is the population obtained by X through the SAC module; ⊙ represents Hadamard product; the tile copy function extends the vector to a matrix.

3.3. FFN-BASED MUTATION MODULE

The mutation operator brings random changes into the population. Specifically, an individual X i in the population goes through the mutation operator to form the new individual Xi , formulated as Xi = X i W m i . W m i is the diagonal matrix. In Transformer, each patch embedding carries on directional feature transformation through the FFN module. We take one linear layer as an example: X = XW F , where W F is the weight of the linear layer, and it is applied to each embedding separately and identically. This equation and the mutation operator have the same formula format, which inspires us to design a learnable mutation module FM based on FFN with ReLU activation function: X m = F M (X c ) = (ReLU (XW F 1 + b 1 ))W F 2 + b 2 (4) where X m is the population after the mutation of X c . W F 2 and W F 1 represent the weight of the second layer of FFN and the weight of the first layer of FFN, respectively. b 2 and b 1 represent the bias of the second layer and the first layer of FFN, respectively.

3.4. SELECTION MODULE

The residual connection in the transformer can be analogized to the selection operation in EA (Zhang et al., 2021) . We combine the residual structure and selection module (SM) (Anonymous, 2023) to design a learnable selection module RSSM. The RSSM generates the offspring population according to the following equation: X = RSSM (X, X c , X m ) = Sort(SM (X, tile(W s 1 ) ⊙ X + tile(W s 2 ) ⊙ X c + tile(W s 3 ) ⊙ X m )) (5) where X is the fittest population for the next generation; the learnable weights W s 1 ∈ R n×1 , W s 2 ∈ R n×1 , and W s 3 ∈ R n×1 are the weights for X, X c , and X m , respectively. Sort(X) represents that X is sorted in non-descending order of fitness. We use quicksort to sort the population. These three learnable weight matrices realize the weighted summation of residual connections, thereby simulating a learnable selection strategy. Meanwhile, the introduction of residual structure also enhances the model's representation ability, enabling BOptformer to form a deep architecture. SM updates individuals based on a pairwise comparison between the offspring and input population regarding their fitness. Suppose that X and X ′ are the input populations of SM. We compare the quality of individuals from X and X ′ pairwise based on fitness. A binary mask matrix indicating the selected individual can be obtained based on the indicator function l x>0 (x), where l x>0 (x) = 1 if x > 0 and l x>0 (x) = 0 if x < 0. SM forms a new population X by employing Equation (6). X = tile(l x>0 (M F ′ -M F )) ⊙ X + tile(1 -l x>0 (M F ′ -M F )) ⊙ X ′ (6) where the tile copy function extends the indication vector to a matrix, M F (M F ′ ) denotes the fitness matrix of X(X ′ ).

3.5. STRUCTURE OF BOPTFORMER

BOptformer comprises basic t BOptformer blocks (OBs), and parameters can be shared among these t OBs or not. The overall architecture of BOptformer and OB is shown in Figure 1 . Each OB consists of SAC, FM, and RSSM. X 0 ∈ R n×d represents the initial population input into BOptformer, which needs to be sorted in non-descending order of fitness. In Equation 7, X i-1 is fed into OB t to get X i , where i ∈ [1, t]. BOptformer realizes the mapping from the random initial population to the target population by stacking t OBs. X i = OB(X i-1 ); X c = SAC(X i-1 , F ); (7) X m = F M (X c ); X i = RSSM (X i-1 , X c , X m )

3.6. TRAINING OF BOPTFORMER

Training Dataset Before introducing the details of the training dataset, fidelity (Kandasamy et al., 2016) is defined as follows: Suppose the differentiable surrogate functions f 1 , f 2 , • • • , f m are the continuous exact approximations of the black-box function f . We call these approximations fidelity, which satisfies the following conditions: 1) f 1 , • • • , f i , • • • , f m approximate f . ||f -f i || ∞ ≤ ζ m , where the fidelity bound ζ 1 > ζ 2 > • • • ζ m . 2) Estimating approximation f i is cheaper than estimating f . Suppose the query cost at fidelity is λ i , and λ 1 < λ 2 < • • • λ m . Training data is a crucial factor beyond the objective functions. This paper establishes the training set by constructing a set of differentiable functions related to the optimization objective. This training dataset only contains (X 0 , f i (x|ω)), the initial population and objective function, respectively. The variance of ω causes the shift in landscapes. The training dataset is designed as follows: 1) Randomly initialize the input population X 0 ; 2) Randomly produce a shifted objective function f i (x|ω) by adjusting the parameter ω; 3) Evaluate X 0 by f i (x|ω); 4) Repeat Steps 1)-3) to generate the corresponding dataset. We show the designed training and testing datasets as follows: F train = {f 1 (x|ω train 1,i ), • • • , f m (x|ω train m,i )} where ω train m,i represents the ith different values of ω in mth function f m . Loss Function BOptformer attempts to search for individuals with high quality based on the available information. The loss function tells how to obtain the parameters of BOptformer to generate individuals closer to the optimal solution by maximizing the difference between the initial population and the output population of BOptformer. The following loss function is employed (Anonymous, 2023) , l i (X 0 , f (x|ω)) = 1 |X 0 | x∈X 0 f i (x|ω) - 1 |E θ (X 0 )| x∈E θ (X 0 ) f i (x|ω) 1 |X 0 | x∈X 0 f i (x|ω) where θ denotes parameters of BOptformer (E). Equation ( 9) calculates the average fitness difference between the input and output, further normalized within [0, 1]. To encourage BOptformer to explore the fitness landscape, for example, the constructed Bayesian posterior distribution over the global optimum (Cao & Shen, 2020 ) can be added to Equation (9). Since the derivatives of functions in the training dataset are available, we can obtain the gradient information of Equation ( 9) for the training process. Also, we can employ REINFORCE (Williams, 1992) to approximate these derivatives. Training BOptformer We then train BOptformer under a supervised mode. Since the gradient is unnecessary during the test process, BOptformer can solve black-box optimization problems. To prepare BOptformer to learn a balanced performance upon different optimization problems, we design a loss function formulated as follows: l Ω = - 1 K X 0 ∈Ω l i (X 0 , f i (x|ω train i )) We employ Adam (Kingma & Ba, 2014) method with a minibatch Ω to train BOptformer upon the constructed training dataset.

Detailed Training Process

The goal of the training algorithm is to search for parameters θ * of the BOptformer. Before training starts, BOptformer is randomly initialized to get initial parameters θ. Then the algorithm will perform the following three steps in a loop until the training termination condition is satisfied: Step 1, randomly initialize a minibatch Ω comprised of K populations X 0 ; Step 2, for each f i ∈ F train , given training data (X 0 , f i ), update θ by minimizing the l Ω ; Step 3, given X 0 , update θ by minimizing -1/m i l Ω , where m is the number of functions in F train . After completing the training process, the algorithm will output θ * .  and 7 (Appendix). Here, BOptformer is trained on F train is generated based on functions in Table 6 , and the target functions are shown in Table 7 (Appendix). Here, d = {10, 100}.

4. EXPERIMENTS

Protein Docking We also handle the problem of Ab initio protein docking (Cao & Shen, 2020) , which optimizes a noisy and costly function in a high-dimensional conformational space. Mathematically, this problem is formulated as optimizing the Gibbs binding free energy f (x) for conformation x. We calculate the energy function in a CHARMM 19 force field as in (Moal & Bates, 2010) and shift it so that f (x) = 0 at the origin of the search space. f (x) is differentiable when we parameterize the search space as R 12 (Smith & Sternberg, 2002) . Here, only 100 interface atoms are considered. The details of this problem can be found in Appendix A.8.

Planner Mechanic Arm

We further evaluate the performance of the proposed scheme on the planner mechanic arm problem, which has been widely used to evaluate the performance of the black-box optimization algorithms (Cully et al., 2015; Vassiliades et al., 2018; Vassiliades & Mouret, 2018; Mouret & Maguire, 2020) . The optimization goal of this problem is to minimize the distance from the top of the mechanic arm to the target position by optimizing a set of lengths angles. The detailed problem can be found in Appendix A.5. r represents the distance from the target point to the origin of the mechanic arm, as shown in Fig. 4 (Appendix) . Protein Docking We also test the performance of BOptformer on the problem of protein docking. The experimental results is shown in 2. The performance of BOptformer exceeds that of L2O-swarm. During training, L2O-swarm does not converge. At the same time, we find that better solutions exist in the initial population than those found by L2O-swarm. However, during testing, L2O-swarm lost these good solutions. Planner Mechanic Arm The detailed experimental results are given in Tables 3. BOptformer selects 5 OBs without WS as the example, which evolves only five generations. Untrained represents the untrained BOptformer. DE, ES, and CMA-ES are tested when the maximum generations is set to 100. EA baselines have 100/5 times as many function evaluations as BOptformer. However, even in this unfair situation, BOptformer achieves the best results. We have observed that BOptformer can achieve better results with deeper architectures. However, it is currently difficult for us to train deep BOptformer. Moreover, as far as we know, the use of ES to optimize deep models has been studied a lot (Vicol et al., 2021) , which will be an essential research prospect in the future. We also find an interesting phenomenon: 5 OBs without WS outperforms 3 OBs with WS in all cases. Our untrained deep architecture, 5 OBs without WS, can achieve good results on simple cases, which shows that BOptformer retains the advantages of Transformer architecture and has strong generalization ability. We use the untrained 5 OBs with WS to test on the complex plannar mechanic arm problem and find that it performs poorly. We train BOptformer on the F1-F3 function set with different learning rates (lr) and then test them on the F4-F9 function set. The experimental results are shown in Table 9 (Appendix A.6). For 5 OBs without WS, setting lr = 0.01 achieves the relatively best performance. Using lr = 0.0001 would be a good choice for 30 OBs with WS and 3 OBs with WS.

4.4. ABLATION STUDY

This section considers the performance impact of different parts in BOptformer. We take BOptformer with 3 OBs and weight sharing as an example, which is trained on F1-F3 and tested on F4-F9. We remove SAC, FM, RSSM, and RC in BOptformer, respectively, and denote them as Not SAC, Not FM, Not RSSM, and Not RC. The experimental results are shown in Table 5 . When their results were sorted from good to worst, the rank is BOptformer > Not FM > Not RC ≈ Not SAC ≈ Not RSSM. The role of FM is slightly weaker than that of the other three modules. Taken as a whole, the parts of SAC, RSSM, and RC are of equal importance. The absence of these core components can seriously affect the performance of BOptformer. At the same time, it also shows the effectiveness of the proposed four modules. The removal of any one of the modules in the crossover, mutation, and selection of EAs will degrade the performance of EAs. This shows that BOptformer implements a learnable EA framework that does not require human-designed parameters. 

4.5. VISUALIZATION ANALYSIS

The tested model is 5 OBs with WS trained on F1-F3 with d = 100. The population size is 100. Visual Analysis of SAC The crossover strategies learned by the five SAC are shown in Fig. 2 . For the presentation, we select individuals with fitness rankings 1st, 50th, and 100th. The horizontal axis represents the fitness ranking of individuals, and the vertical axis represents the attention (weight when performing crossover) on these individuals. OB1 tends to crossover with lower-ranked individuals, showing a preference for exploration. From OB1 to OB5, the bias of SAC gradually changes from exploration to exploitation. Visual Analysis of FM We test 5 OBs with WS on F4 with d = 2. The mutation strategies learned by the five OBs are shown in Fig. 3 . Input and output represent the input and output populations of the FM module, respectively. OB1 tends to explore a broad solution space, and the next 4 OBs gradually shift from searching the vast space to searching the space near the input population. The strategies learned by FM and SAC modules show a common feature: the preference for generating solutions gradually shifts from exploration to exploitation as the population converges. 

5. CONCLUSIONS

We successfully designed the Transformer-based L2O framework for black-box optimization, which does not need hand-designed operators. The better performance than that of EA baselines, Bayesian optimization, and the L2O method demonstrates the effectiveness of BOptformer. Moreover, BOptformer can be well adapted to unseen black-box optimization. Meanwhile, we experimentally demonstrate that the proposed three modules have positive effects. BOptformer still has room for improvement. 1) Our scheme is not limited to black-box optimization. Similar to the LSTM architecture, our scheme can directly optimize differentiable functions. However, the architecture of BOptformer does not directly involve the gradient information of the optimization target, which makes BOptformer inferior to existing L2O schemes. In future work, we will design a new module that embeds the gradient information of the optimization target; 2) In the loss function, we did not effectively consider the diversity of the population, and the population can be regularized in the future; 3) The training set seriously affects the performance of BOptformer. If the similarity between the training set and the optimization objective is low, it will cause the performance of BOptformer to degrade drastically. Building the dataset as relevant to the target as possible is essential. Selection We introduce the binary tournament mating selection operator in Equation ( 13). The selection operator survives individuals of higher quality for the next generation until the number of individuals is chosen. p i = 1 f (X i ) < f (X k ) 0 f (X i ) > f (X k ) , (X i , X k ) ∈ X, where p i reflects the probability that X i is selected for the next generation, and (X i , X k ) in Equation (13) are randomly selected from the population X ∪ Xm . A.3 SYNTHETIC FUNCTIONS -16, 16] A.4 PARAMETERS BOptformer. For example, 30 OBs with WS contains 30 OBs, and each OB consists of 1 SAC, 1 FM, and 1 RSSM. In 30 OBs with WS, these 30 OBs share parameters. 5 OBs without WS has 5 OBs, and no parameters are shared among them. During the training process, BOptformer is iterated for 1000 epochs. The initial learning rate (lr) was set to 0.01 and lr = lr × 0.9 each 100 cycles. The 2-norm of the gradient is clipped so that it is not larger than 10. The bias of the function is regenerated each epoch, and a new batch of random initial populations is generated. i z 2 i , zi = xi -bi x ∈ [-100, 100], b ∈ [-50, 50] F5 max{|zi|, 1 ≤ i ≤ D}, zi = xi -bi x ∈ [-100, 100], b ∈ [-50, 50] F6(Rosenbrock) D-1 i=1 (100(z 2 i -zi+1) 2 + (zi -1) 2 ), zi = xi -bi x ∈ [-100, 100], b ∈ [-50, 50] F7(Rastrigin) D i=1 (z 2 i -10 cos(2πzi) + 10), zi = xi -bi x ∈ [-5, 5], b ∈ [-2.5, 2.5] F8(Griewank) D i=1 z 2 i 4000 -D i=1 cos( z i √ i ) + 1, zi = xi -bi x ∈ [-600, 600], b ∈ [-300, 300] F9(Ackley) -20 exp(-0.2 1 D D i=1 z 2 i ) - exp( 1 D D i=1 cos(2πzi)) + 20 + exp(1), zi = xi -bi x ∈ [-32, 32], b ∈ [ Baselines. The number of generations of the reference algorithms is set to 100. The population size of ES, DE, and CMA-ES is set to 100. For all cases, we choose the optimal hyperparameters. To ensure validity, all experimental results are averaged over 10 runs. All experiments were performed on a Ubuntu20.04 PC with Intel(R) Core I7 (TM) I3-8100 CPU at 3.60GHz and NVIDIA GeForce GTX 1060.

A.5 PLANNER MECHANIC ARM PROBLEM

The optimization goal of this problem is to search for a set of lengths L = (L 1 , L 2 , • • • , L n ) and a set of angles α = (α 1 , α 2 , • • • , α n ) so that the distance f (L, α, p) from the top of the mechanic arm to the target position p is the smallest, where n represents the number of segments of the mechanic arm, and L i ∈ (l i , u i ) and α i ∈ (-Π, Π) represent the length and angle of the ith mechanic arm, respectively. Typically, d is calculated as follows: f (L, α, p) = n i=1 cos(α i )L i -p x 2 + n i=1 sin(α i )L i -p y 2 (14) where p x and p y represent the x-coordinate and y-coordinate of the target point, respectively. Here, n = 100, l i = 0 and u i = 10. We design two groups of experiments. 1) Simple case. We fixed the length of each mechanic arm as l i = 10 and only searched for the optimal α. 2) Complex case. We need to search for L and α simultaneously. We randomly selected 600 target points within the range of r ≤ 1000 to form a set S, where r represents the distance from the target point to the origin of the mechanic arm, as shown in Fig. 4 . During the training process of BOptformer, a sample point set s is re-extracted from S for training every T training cycle. In the testing process, we extracted 128 target points (S test ) in the range of r ≤ 100, r ≤ 300, and r ≤ 1000, respectively, for testing. The purpose of testing in three different regions is to explore the generalization performance of BOptformer further. We evaluate the generalization ability of the algorithm by 9 . 5 OBs without WS and 30 OBs with WS perform poorly when the learning rate is 0.1, which may be because the learning rate is too large, which affects the convergence of BOptformer during the training process. For 5 OBs without WS, setting the learning rate to 0.01 achieves relatively best performance. Using a learning rate of 0.0001 would be a good choice for 30 OBs with WS and 3 OBs with WS. However, our experiments are coarse-grained. The learning rate has a greater impact on BOptformer. Then using Auto-ML to search for the optimal hyperparameter combination of the model is expected to achieve better performance.

A.7 CONVERGENCE OF BOPTFORMER

We plot the convergence curves of 30 OBs with WS, ES, DE, and CMA-ES on F7. BOptformer converges quickly and can obtain better solutions. BOptformer can only iterate ten times to get the best solution relative to EA baselines. ES and DE converged around 100 generations, and CMA-ES showed a slow convergence rate. 



Figure 1: Overall architecture of BOptformer and OB. N x stands for BOptformer is composed of N x stacked OBs. These OBs can be set to share weights with each other or not share weights with each other.

Synthetic Functions This paper first employs nine commonly used functions to show the effectiveness of the proposed BOptformer. The characteristics of these nine functions are shown in Tables6

Figure 2: Crossover Strategy learned by BOptformer.

Figure 3: Mutation strategy learned by BOptformer.

Figure 4: Planar Mechanical Arm. DE, ES, and CMA-ES are tested when the maximum generations M axgen is set to 10, 50, and 100, respectively. We find that BOptformer outperforms all baselines.

The compared results on six functions.Synthetic FunctionsThe results on six functions are provided in Table1. BOptformer outperforms three EA baselines, Dragonfly, and L2O-swarm in all cases, but loses once to Dragonfly in F6 with d = 100. These cases also show the excellent generalization ability of BOptformer on more tasks unseen during the training stage. We think the transferability of BOptformer is proportional to the fitness landscape similarity between the training set and the problem. Although new problem attributes are not available in the training set, BOptformer can still perform better. However, this conclusion only holds when the similarity between the problem and training dataset is high. We plot the convergence curves of BOptformer (10 OBs with WS), ES, DE, and CMA-ES on F7 (see Appendix A.7, Figure5). BOptformer converges quickly and can obtain better solutions. BOptformer can only iterate ten times to get the best solution relative to EA baselines. ES and DE converge around 100 generations, and CMA-ES shows a slow convergence rate.

The results on the problem of protein docking.

The results of planar mechanical arm. Simple Case (SC): searching for different angles with the fixed lengths. Complex Case (CC): searching for different angles and lengths.

The performance of different BOptformer structures.

The results of ablation study. d = 10.

Qingfu Zhang and Hui Li. Moea/d: A multiobjective evolutionary algorithm based on decomposition. IEEE Transactions on Evolutionary Computation, 11(6):712-731, 2007. Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11106-11115, 2021.

Training functions.

Testing Functions.

The results of planar mechanical arm on searching for different angles with the fixed lengths.

Ablation study on learning rate.

A APPENDIX

A.1 VISION TRANSFORMER We mainly introduce the core part of Vision Transformer, such as the multi-head self-attention layer (MSA), feed-forward network (FFN), layer normalization (LN), and residual connection (RC).MSA MSA fuses several SA operations to handle the queries (Q), keys (K), and values (V ) that jointly attend to information from different representation subspaces. MSA is formulated as follows:where Concat means concatenation operation. The head feature H i can be formulated as:, and W V i ∈ R dm×dv are parameter matrices for queries, keys, and values, respectively; W O ∈ R hdv×dm maps each head feature H i to the output. Moreover, d m is the input dimension, while d q , d k , and d v are hidden dimensions of the corresponding projection subspace; h is the head number. A ∈ R l×l is the attention matrix of hth head, l is the sequence length.FFN FFN employs two cascaded linear transformations with a ReLU activation to handle X, which is shown as:where W 1 and W 2 are weights of two linear layers, and b 1 and b 2 are corresponding biases.LN LN is applied before each layer of MSA and FFN, and the output of LN is calculated by X + [M SA|F F N ](LN (X)).

A.2 PRELIMINARY EAS

The crossover, mutation, and selection operators form the basic framework of EAs. EA starts with a randomly generated initial population. Then, genetic operations such as crossover and mutation will be carried out. After the fitness evaluation of all individuals in the population, a selection operation is performed to identify fitter individuals to undergo reproduction to generate offspring. Such an evolutionary process will be repeated until specific predefined stopping criteria are satisfied.Crossover The crossover operator generates a new individual Xi by Equation ( 11), and cr is the probability of the crossover operator.

Xc

whereThis operator is commonly conducted on n individuals. After an expression expansion, we re-formulate Equation (11) as n i=1 X i W c i (Zhang et al., 2021) .i is full of zeros, the ith individual has no contribution.Mutation The mutation operator brings random changes into the population. Specifically, an individual X i in the population goes through the mutation operator to form the new individual Xi , formulated as follows:where mr is the probability of mutation operator and kSimilarly, Equation ( 12) can be re-formulated as X i W m i , where W m i is the diagonal matrix. 7 .

A.8 THE DETAILS OF PROTEIN DOCKING

Protein Docking We also handle the problem of Ab initio protein docking (Cao & Shen, 2020) , which optimizes a noisy and costly function in a high-dimensional conformational space. Mathematically, this problem is formulated as optimizing the Gibbs binding free energy f (x) for conformation x. We calculate the energy function in a CHARMM 19 force field as in (Moal & Bates, 2010) and shift it so that f (x) = 0 at the origin of the search space. f (x) is differentiable when we parameterize the search space as R 12 (Smith & Sternberg, 2002) . Here, only 100 interface atoms are considered.Training dataset. 25 protein-protein complexes (see Appendix A.8) from the protein docking benchmark set 4.0 (Hwang et al., 2010) , each of which has 5 starting points (top-5 models from ZDOCK (Pierce et al., 2014)).Testing dataset. Three complexes (with one starting model each) of different levels of docking difficulty are selected, including 1ATN 7, 2JEL 1, and 7CEI 1.25 Protein-protein Complexes The training dataset contains 25 protein-protein complexes from the protein docking benchmark set 4.0 (Hwang et al., 2010) . The detailed information is shown as follows: 1ATN, 1AVX, 1AY7, 1BJ1, 1BVN, 1CGI, 1DFJ, 1EAW, 1EWY, 1EZU, 1GRN, 1IBR, 1IJK, 1IQD, 1JPS, 1KXQ, 1M10, 1MAH, 1N8O, 1PPE, 1R0R, 1XQS, 2B42, 2C0L, and 2HRK.

