DECN: EVOLUTION INSPIRED DEEP CONVOLUTION NETWORK FOR BLACK-BOX OPTIMIZATION

Abstract

We design a deep evolution convolution network (DECN) to overcome the poor generalization of an evolutionary algorithm in handling continuous black-box optimization. DECN is composed of two modules: convolution-based reasoning module (CRM) and selection module (SM), to move from hand-designed optimization strategies to learned optimization strategies. CRM produces a population closer to the optimal solution based on the convolution operators, and SM removes poor solutions. We also design a proper loss function to train DECN so as to force the random population to move near the optimal solution. The experimental results on one synthetic case and two real-world cases show the advantages of learned optimization strategies over human-designed black-box optimization baselines. DECN obtains good performance with deep structure but encounters difficulties in training. In addition, DECN is friendly to the acceleration with Graphics Processing Units (GPUs) and runs 102 times faster than unaccelerated EA when evolving 32 populations, each containing 6400 individuals.

1. INTRODUCTION

Optimization has been an old and essential research topic in history; Many tasks in computer vision, machine learning, and natural language processing can be abstracted as optimization problems. Moreover, many of these problems are black-box, such as neural architecture search (Elsken et al., 2019) and hyperparameter optimization (Hutter et al., 2019) . Various approaches, such as Bayesian optimization (Snoek et al., 2012) and evolutionary algorithms (EAs), including genetic algorithms (Jin et al., 2019; Khadka & Tumer, 2018; Zhang & Li, 2007; Such et al., 2017; Stanley et al., 2019) and evolution strategies (ES) (Wierstra et al., 2014; Vicol et al., 2021; Hansen & Ostermeier, 2001; Auger & Hansen, 2005; Salimans et al., 2017) , have been proposed to deal with these problems in the past. The generalization ability of EAs is poor. Faced with a new black-box optimization task, we need experts to redesign the EA's crossover, mutation, and selection operations to maximize its performance on the target task, resulting in a hand-designed EA with big application limitation. Most importantly, due to the limitation of expert knowledge, only little target function information is used to assist the design of EA, which makes it challenging to adapt to the target task. How to automatically design optimization strategies according to new tasks is crucial. EA is a generative optimization model that realizes the generation from a random population to an optimal solution by generating potential solutions and retaining good solutions. The task of automatically designing an optimization strategy is learning how to automatically generate and retain potential solutions. This paper first attempts to develop a deep evolution convolution network (DECN) to learn to exploit structure in the problem of interest so that DECN can automatically move a random population near the optimal solution for different black-box optimization tasks. DECN uses the process of EAs to guide the design of this new learning-to-optimize architecture. Like EAs, we propose two critical components of DECN to generate and select potential solutions: a convolution-based reasoning module (CRM) and a selection module (SM). For CRM, we need to ensure the exchange of information between individuals in the population to achieve the function of generating potential solutions. We design a lattice-like environment organizing the population into the modified convolution operators and then employ mirror padding (Goodfellow et al., 2016) to generate the potential offspring. SM need to update the population to survive the fittest solutions. We design SM based on a pairwise comparison between the offspring and the input population regarding their fitness, implemented by employing the mask operator. Then, we design the evolution module (EM) based on CRM and SM to simulate one generation of EAs. Finally, we build the DECN by stacking several EMs to cope with the first issue. The untrained DECN does not handle the black-box optimization problem well because it needs information about the target black-box function. In order to better optimize the objective task, we need to design a training set containing objective function information and a practical loss function to guide the parameter training of DECN. The characteristics of black-box functions make it difficult for us to obtain their gradient information to assist in the training of DECN. To overcome the second issue, the following questions must be solved: how to design a proper loss function and training dataset. We construct a differentiable surrogate function set of the target black-box function to obtain the information of the target black-box function. However, the optimal population is usually unknown. The designed loss function is to maximize the difference between the initial and output populations to train DECN towards the optimal solution, where the loss function can be optimized by back-propagation. We test the performance of DECN on six standard black-box functions, protein docking problem, and planner mechanic arm problem. Three population-based optimization baselines, Bayesian optimization (Kandasamy et al., 2020) , and a learning-to-optimize method for black-box optimization (Cao et al., 2019) are employed as references. The results indicate that DECN can automatically learn efficient mapping for unconstrained continuous optimization on high-fidelity and low-fidelity training datasets. Finally, to verify that DECN is friendly to Graphics Processing Units (GPUs)' acceleration, we compare the runtime of DECNs on one 1080Ti GPU with the standard EA.

2. RELATED WORK

There are many efforts that can handle black-box optimization, such as Bayesian optimization (Snoek et al., 2012) and EAs (Mitchell, 1998) . Since the object of DECN is population, it has a strong relationship with EA. Meanwhile, DECN is a new learning-to-optimize (L2O) framework. Appendix A.10 shows our detailed motivations. EAs. EAs are inspired by the evolution of species and have provided acceptable performance for black-box optimization. There are two essential parts to EAs: 1) crossover and mutation: how to generate individuals with the potential to approach the optimal solution; 2) selection: how to discard individuals with inferior performance while maintaining the ones with superior performance. In the past decades, many algorithmic components have been designed for different tasks in EAs. The performance of algorithms varies towards various tasks, as different optimization strategies may be required given diverse landscapes. This paper focuses on two critical issues of EAs: 1) Poor generalization ability. Existing methods manually adjust genetic operators' hyperparameters and design the combination between them (Kerschke et al., 2019; Tian et al., 2020) ; However, its crossover, mutation, and selection modules can only be designed manually based on expert knowledge and cannot effectively interact with the environment (function); that is, they cannot change their elements in real-time to adapt to new problems through the feedback of the objective function. 2) The acceleration of EAs using GPUs is a challenging task. The support for multiple subpopulations to evolve simultaneously has paramount significance in practical applications. Besides, many available genetic operators are unfriendly to the GPU acceleration, as GPUs are weak in processing logical operations. DECN overcomes the above issues. It is adapted to different optimization scenarios, based on which DECN automatically forms optimization strategies. L2O. The most related work is about L2O (Chen et al., 2022) . These methods employ the long short-term memory architecture (LSTM) (Chen et al., 2020; Andrychowicz et al., 2016; Chen et al., 2017; Li & Malik, 2016; Wichrowska et al., 2017; Bello et al., 2017) or multilayer perceptron (MLP) (Metz et al., 2019) as the optimizer to achieve point-based optimization (Sun et al., 2018; Vicol et al., 2021; Flennerhag et al., 2021; Li & Malik, 2016) . However, none of the above methods can handle black-box optimization. Swarm-inspired meta-optimizer (Cao et al., 2019) learns in the algorithmic space of both point-based and population-based optimization algorithms. This method does not consider the advantage of EAs and is a model-free method. Existing L2O techniques rarely focus on black-box optimization. Although several efforts like (Cao et al., 2019; Chen et al., 2017) have coped with these problems, they all deal with small-scale problems in the experimental setting. DECN is a new L2O framework that makes up for the performance disadvantage of the current L2O architecture in black-box optimization. This paper makes an essential contribution to the L2O community.

3.1. PROBLEM DEFINITION

An unconstrained black-box optimization problem can be transformed or represented by a minimization problem, and constraints may exist for corresponding solutions: min f (s|ξ), s.t. x i ∈ [d i , u i ], ∀x i ∈ s, where s = (x 1 , x 2 , • • • , x D ) represents the solution of optimization problem f while d = (d 1 , d 2 , • • • , d D ) and u = (u 1 , u 2 , • • • , u D ) denote the corresponding lower and upper bounds of the solution's domain, respectively. ξ is the known parameters of f . We can only use the query-response terminology because the objective function f is a black box without a closedform formulation in this setting. Suppose n individuals of one population (S = {s 1 , • • • , s n }) be s 1 = (x 1 1 , x 1 2 , • • • , x 1 D ), s 2 = (x 2 1 , x 2 2 , • • • , x 2 D ), • • • , s n = (x n 1 , x n 2 , • • • , x n D ) . This paper aims to make the initial population move near the optimal solution. To be noted, θ is the parameters (strategies) of G, G is an abstract function remarking the optimization process, S 0 is the initial population, and S t is the output population. The procedure of DECN can be formulated as S t = G θ (S 0 , f (s|ξ)). Based on the optimized θ, DECN optimizes f (s|ξ) by G θ .

3.2. CONVOLUTION-BASED REASONING MODULE

We design CRM to ensure that individuals in the population can exchange information to generate the potential solutions near the optimal solution (similar to the function of recombination operator in EAs). The corresponding correction to the convolution operator can achieve this goal, which is the motivation for our design with convolution. This part mainly describes how to construct CRM to generate new solutions. Organize Population into Convolution. We arrange all individuals in a lattice-like environment with a size of L × L. In this case, we can represent the population by using a tensor (i, j, d), where (i, j) locates the position of one individual S(i, j) in the L × L lattice and d is the dimension information of this individual. Appendix A.2 gives an illustration of the tensor data. The individuals in the lattice are sorted in descending order to construct a population tensor with a consistent pattern (see Figure 5 of Appendix). The number of channels in input tensors is D+1, where D is the dimension of the optimization task, and the fitness of individuals occupies one channel. The fitness channel does not participate in the convolution process but is essential for the information selection in the selection module. How to Design CRM. After organizing the population into a tensor (L, L, D + 1), we modified the depthwise separable convolution (DSC) operator (Chollet, 2017) to generate new individuals by merging information in different dimensions among individuals. The DSC operator includes a depthwise convolution followed by a pointwise convolution. Pointwise convolution maps the output channel of depthwise convolution to a new channel space. When applied to our task, we remove the pointwise convolution in DSC to avoid the information interaction between channels. Eq. ( 2) provides the details about how to reproduce offspring, and one example is shown in Fig. 8 of the Appendix. S ′ (i, j) = k,l w k,l S(i + k, j + l), where S ′ (i, j) denotes the individuals in the output population, S(i, j) denotes the individuals in the input population, and w k,l represents the related parameters of convolution kernels. Moreover, to adapt to optimization tasks with different dimensions, different channels share the same parameters. The parameters within convolution kernels record the strategies learned by this module to reason over available populations given different tasks. There are still two critical issues to address here. 1) Since there does not exist a consistent pattern in the population, the gradient upon parameters is unstable as well as divergent. A fitness-sensitive convolution is designed, where the CRM's attention to available information should be relative to the quality and diversity of the population. w k,l reflects the module's attention during reasoning and is usually relative to the fitness of individuals. After that, this problem is resolved by simply sorting the population in the lattice based on individuals' fitness. 2) Another vital issue is the scale of the offspring. We conduct padding before the convolution operator to maintain the same scale as the input population. However, filling the tensor of the population with constant values '0' is not proper, as is usually done in computer vision. Instead, mirror padding copies the individuals to maintain the same scale between the offspring and the input population. As the recombination process conducts the information interaction among individuals, copying the individual is better than extending tensors with a constant value. An implementation of mirror padding to the population is given in Appendix.A.3. The size of convolution kernels within CRM determines the number of individuals employed to implement reasoning of S ′ (i, j). After that, this paper employs convolution kernels with commonly used sizes. Different convolution kernels produce corresponding output tensors, while the final offspring are obtained by averaging multiple convolutions' output. Then, the fitness of this final offspring will be evaluated. i-1 , we construct a binary mask tensor by copying and extending the mask matrix to the same shape as S i-1 and S ′ i-1 . The selected information forms a new tensor S i by employing Eq. ( 3) illustrated in Fig. 1 .

3.3. SELECTION MODULE

𝐼𝐼 𝑥𝑥>0 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑡𝑡𝑡𝑡 𝐶𝐶 1 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑡𝑡𝑡𝑡 𝐶𝐶 1 ′ 𝐶𝐶 1 ′ -𝐶𝐶 1 1 -𝐼𝐼 𝑥𝑥>0 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 × + × 𝑆𝑆 𝑖𝑖-1 ′ 𝑆𝑆 𝑖𝑖-1 𝐹𝐹𝑡𝑡𝑡𝑡𝐶𝐶𝑡𝑡𝐹𝐹𝐹𝐹 𝐹𝐹𝑡𝑡𝑡𝑡𝐶𝐶𝑡𝑡𝐹𝐹𝐹𝐹 𝑆𝑆 𝑖𝑖 S i = tile(l x>0 (M F ′ -M F )) • S i-1 + tile(1 -l x>0 (M F ′ -M F )) • S ′ i-1 , where the tile copy function extends the indication matrix to a tensor with size (L, L, D), M F (M F ′ ) denotes the fitness matrix of S i-1 (S ′ i-1 ), and • indicates the pairwise multiplication between inputs. 9 1/9 1/9   1/9 1/9 1/9 1/9 1/9 1/9 

3.4. THE STRUCTURE OF DECN

𝑆𝑆 𝑖𝑖-1 𝑆𝑆 𝑖𝑖-1 ′ 𝑆𝑆 𝑖𝑖 𝐸𝐸𝐸𝐸 𝑖𝑖 𝐸𝐸𝐸𝐸 𝑖𝑖 𝑆𝑆 𝑖𝑖-1 𝑆𝑆 𝑖𝑖 𝐶𝐶𝐶𝐶𝐸𝐸 𝑖𝑖 𝑆𝑆𝐸𝐸 𝑖𝑖 𝐸𝐸𝐸𝐸 1 𝑆𝑆 0 𝑆𝑆 1 ⋯ ⋯ 𝐸𝐸𝐸𝐸 𝑡𝑡 𝑆𝑆 𝑡𝑡-1 𝑆𝑆 𝑡𝑡 𝐷𝐷𝐸𝐸𝐶𝐶𝐷𝐷 𝑺 𝟒 ′ 𝑺 𝟏 ′ 𝑺 𝟑 ′ 𝑺 𝟐 ′ 0 0 0 0 1 1 1 1 𝑺 𝟒 𝑺 𝟐 𝑺 𝟑 𝑺 𝟏 0 0 0 0 𝑺 𝟒 ′ 𝑺 𝟏 ′ 𝑺 𝟑 ′ 𝑺 𝟐 ′ 1 1 1 1 𝑺 𝟒 ′ 𝑺 𝟏 ′ 𝑺 𝟑 ′ 𝑺 𝟐 ′ 1 1 0.25 0.25 𝑺 𝟏 Depth-wise convolution mask (1-mask) Fitness 𝒙 𝟏 𝒙 𝟐 1/ (s = {x 1 , x 2 }, x i = {0, 0}) = (x 1 -0) 2 (x 2 -0) 2 , x i ∈ [-1, 1]. L is set to 2. We first transfer the initial population with four individuals to the tensor and then sort and pad it into the new tensor with 16 channels. The modified DSC operator is employed to generate x 1 , x 2 , and the fitness tensor. x 1 and x 2 are handled by parameters shared across channels within a 3 × 3 convolution kernel. The fitness tensor is handled by Eq. ( 3). The new tensors of x 1 , x 2 , and the fitness tensor are averaged to generate the output.

3.5. TRAINING OF DECN

DECN with t EMs generates the offspring S t from the input population S 0 and can be trained based on end-to-end mode. Then, given a proper loss function and training dataset, DECN can be trained to learn optimization strategies towards the objective function f (s|ξ) by the back-propagation. We will generally establish a function set F train to train DECN. Training Dataset. This paper establishes the training set by constructing a set of differentiable functions related to the optimization objective. This training dataset only contains (S 0 , f i (s|ξ)), the initial population and objective function, respectively. f i represents the ith function in this set. We show the designed training and testing datasets as follows: F train = {f 1 (s|ξ train 1,i ), • • • , f m (s|ξ train m,i )}, F test = {F 1 (s|ξ test 1 )} (4) where  F L i = min θ - 1 K S0∈Ω 1 |S0| s∈S0 f i (s|ξ) - 1 |G θ (S0)| s∈G θ (S0) f i (s|ξ) ( 1 |S0| s∈S0 f i (s|ξ) Eq. ( 5) is to maximize the difference between the initial population and the output population of DECN to ensure that the initial population is close to the optimal solution. Moreover, Eq. ( 5) 5) enables DECN to perform the exploitation operation well but does not strongly encourage DECN to explore the fitness landscape. However, we have many options to balance exploration and exploitation. For example, the constructed Bayesian posterior distribution over the global optimum (Cao & Shen, 2020) is added to Eq. ( 5). Suppose the objective functions are hard to be formulated, and the derivatives of these functions are not available at training time, then two strategies can be employed: 1) Approximate these derivatives via REINFORCE (Williams, 1992); 2) Use the neuro-evolution (Such et al., 2017; Stanley et al., 2019) method to train DECN. Algorithm 1 provides the training process of DECN. After DECN has been trained, DECN can be used to solve the black-box optimization problems since the gradient is unnecessary during the test process.  F train = {F 4(s|ξ train 1 ), • • • , F 4(s|ξ train m )}, F test = {F 4(s|ξ test )} 6) F train and F test are comprised of the same essential function but vary in the location of optima obtained by setting different combinations of ξ (called b i in Table 7 ). F train can be considered as the high-fidelity surrogate functions of F test . We train DECN on F train , and then we test the performance of DECN upon F test , where the values of ξ test not appearing in the training process. Here, D = {10, 100} and L = 10. DECN is compared with standard EA baselines (DE (DE/rand/1/bin) (Das & Suganthan, 2010) , ES ((µ,λ)-ES), and CMA-ES), L2O-swarm (Cao et al., 2019) (a representative L2O method for black-box optimization), and Dragonfly (Kandasamy et al., 2020) (the state-of-the-art Bayesian optimization). DECNws3 contains 3 EMs, and the parameters of these three convolution kernels are consistent across different EMs (weight sharing). The detailed parameters of these models can be found in Appendix A.11. The results are provided in Table 1 . DECN outperforms compared methods by a large margin. This is because we use a high-fidelity surrogate function of the target black-box function to train DECN. The trained DECN contains an optimization strategy that is more tailored to the task. Current DE, ES, CMA-ES, and Dragonfly do not use this information to design their element. Even if we constantly adjust the hyperparameters of the comparison algorithm, the results are unlikely to be better than DECN.  F train = {F 1(s|ξ 1,i ), F 2(s|ξ 2,i ), F 3(s|ξ 3,i )}, F test = {F 4(s|ξ test )} Meanwhile, we also test the impact of different architectures on DECN, including the different number of layers and whether weights are shared between layers. We design three models, including DECNws3, DECNn15, and DECNws30. DECNn15 does not share parameters across 15 EMs. DECNws30 shares parameters across 30EMs. Their parameters are shown in Table 9 (Appendix). DECNws30 outperforms DECNws3 in all cases, demonstrating that deep architectures have stronger representation capabilities and can build more accurate mapping relationships between random populations and optimal solutions. DECNn15 outperforms DECNws3 and DECNws30 when D = 100. This case is more complex than the case with D = 10. Although the number of layers of DECNn15 is lower than that of DECNws30, its representation ability is stronger than that of DECNws30 because it does not share weights. However, when the number of layers becomes larger, this architecture is more difficult to train. The transferability of DECN is proportional to the fitness landscape similarity between the training set and the problem. When new problem attributes are not available in the training set, DECN can still perform better. However, if extreme attributes are not available, then DECN can be the less satisfactory performance for functions with this attribute. These results show that the optimization strategy learned by DECN has good generality and is transferable to many unseen objective functions.

4.2. RESULTS ON PROTEIN DOCKING

Protein docking predicts the 3D structures of protein-protein complexes given individual proteins' 3D structures or 1D sequences (Smith & Sternberg, 2002) . We consider the ab initio protein docking problem, which is formulated as optimizing the Gibbs binding free energy for conformation s: f (s) = ∇G(s). We calculate the energy function in a CHARMM 19 force field as in (Moal & Bates, 2010) . We parameterize the search space as s ∈ R 12 as in (Cao & Shen, 2020) . We only consider 100 interface atoms. The training set includes 125 instances (see Appendix A.9), which contains 25 protein-protein complexes from the protein docking benchmark set 4.0 (Hwang et al., 2010)  F train = {f (s|ξ 1 ), • • • , f (s|ξ 125 )}, F test = {f (s|ξ test )} where ξ represents different instances of protein-protein complexes. Since L2O-swarm has no elite retention mechanism, its result is worse than the optimal value of the initial population.

4.3. RESULTS ON PLANAR MECHANICAL ARM

The planner mechanic arm has been frequently employed as an optimization problem to assess how well the black-box optimization algorithms perform (Cully et al., 2015; Vassiliades et al., 2018; Vassiliades & Mouret, 2018; Mouret & Maguire, 2020) . The planner mechanic arm problem has two key parameters: the set of L = (L 1 , L 2 , • • • , L n ) and the set of angles α = (α 1 , α 2 , • • • , α n ), where n represents the number of segments of the mechanic arm, and L i ∈ (0, 10) and α i ∈ (-Π, Π) represent the length and angle of the ith mechanic arm, respectively. This problem is to find the suitable sets of L and α such that the distance f (L, α, p) from the top of the mechanic arm to the target position p is the smallest, where f (L, α, p) = ( n i=1 cos(α i )L i -p x ) 2 + ( n i=1 sin(α i )L i -p y ) 2 , and (p x , p y ) represents the target point's x-and y-coordinates. Here, n = 100. We design two groups of experiments. 1) Simple Case (SC). We fixed the length of each mechanic arm as ten and only searched for the optimal α. We randomly selected 600 target points within the range of r ≤ 1000, where r represents the distance from the target point to the origin of the mechanic arm, as shown in Fig. 10 (Appendix). In the testing process, we extracted 128 target points in the range of r ≤ 100 and r ≤ 300, respectively, for testing. We show the designed training and testing datasets as follows: F train = {f (α|ξ 1 ), • • • , f (α|ξ 600 )}, F test = {f (α|ξ test 1 ), • • • , f (α|ξ test 128 )}, ξ = (p x , p y ) (9) 2) Complex Case (CC). We need to search for L and α at the same time. We show the designed training and testing datasets as follows: F train = {f ((L, α)|ξ 1 ), • • • , f ((L, α)|ξ 600 )}, F test = {f ((L, α)|ξ test 1 ), • • • , f ((L, α)|ξ test 128 )} (10) We evaluate the performance of the algorithm by f ∈F test f /128. The experimental results are shown in Table 4 . Note that Dragonfly performs poorly due to the high dimensional of this problem (D = {100, 200}. In simple cases, DECNws3 outperforms all baselines. Nevertheless, for complex cases, DECNws3 outperforms all baselines when r ≤ 100. However, when r ≤ 300, DECNws3 outperforms ES and L2O-Swarm and is weaker than DE and CMA-ES. as shown in Table 2 , the performance of DECNws3 is worse than DECNws30 and DECNn15. When we use DECNn15 to optimize the complex case, its result is 0.54(0.26), which is better than all baselines. We show the surprising performance of DECN with GPU-accelerated CRM and SM. To display the adaptability of DECN to GPUs, we offer the average runtime (second) of DECN and unaccelerated EA for three generations in Table 5 . See Appendix A.6 for more results. DECN and EA optimize K = 32 populations, each containing L × L individuals (number of individuals:

4.4. ACCELERATING DECN WITH GPU

K × L × L). Similarly, we employ the runtime of EA with SBX crossover and Breeder mutation operator without acceleration as a reference in this experiment. In the unified test environment, the function estimation time consumed by DECN and EA is basically the same. As can be seen, with the increase of L, the advantage of acceleration based on GPUs is clear. DECN is around 102 times faster than EA when D ∈ {10, 50, 500}. This case indicates that DECN is adapted to the acceleration of GPUs and can be accelerated sufficiently. However, GPU cannot accelerate EA's crossover, mutation, and selection modules. In the case of a large population of individuals, these operators take up a high running time.

4.5. VISUALIZATION

Figure 4 : Visualization of the optimization process. We take a two-dimensional F4 function as an example to verify that DECN can indeed advance the optimization. In Fig. 4 , as the iteration proceeds, DECN gradually converges. When passing through the first EM module, the CRM is first passed, and the offspring S ′ i-1 are widely distributed in the search space, and the offspring are closer to the optimal solution. Therefore, the CRM generates more potential offspring and is rich in diversity. After the SM update, the generated S i is around the optimal solution, showing that the SM update can keep good solutions and remove poor ones. From the population distribution results of the 2nd, 3rd, and 15th EMs, DECN continuously moves the population to the vicinity of the optimal solution.

5. CONCLUSIONS

We successfully designed DECN to learn optimization strategies for black-box optimization automatically. The better performance than other human-designed methods demonstrates the effectiveness of DECN. DECN can be well adapted to new black-box optimization tasks. Moreover, DECN has outstanding adaptability to GPU acceleration due to the tensor operator. The limitations are discussed in Appendix A.8.

6. REPRODUCIBILITY STATEMENT

The source code of Pytorch version of DECN can be downloaded in supplementary materials. The parameters of DECN are shown in Table 9 in Appendix. Nine synthetic functions are shown in Appendix A.5 (Tables 6 and 7 ). The 25 protein-protein complexes used for training DECN are shown in Appendix A.9.

A APPENDIX

A.1 BACKGROUND Recombination. The subtraction is applicable during the production of a new individual, such as DE (Das & Suganthan, 2010) and recombination operator in EAs, usually conducted on n individuals. DE can reproduce a unique individual s * based on 11 with s 1 , s 2 , • • • , s n . s * = s k + n-1 i=2 F i (s i -s i+1 ) where F i is a scaling factor and s k is the best solution or is selected from s 1 , s 2 , • • • , s n . After an expression expansion, this process can be summarized by a weighted recombination process as given in Eq. ( 12). These operators are manually designed with different parameters (a i ). s * = a 1 × s 1 + a 2 × s 2 + • • • + a n × s n = n i=1 a i × s i . Selection. Many selection operators exist, such as the binary tournament mating selection operator in Eq. ( 13). The selection operator is to retain individuals of higher quality for the next generation, which can be regarded as an information selection process. p i = 1 f (s i ) < f (s k ) 0 f (s i ) > f (s k ) , (s i , s k ) ∈ S, where p i reflects the probability that s i is selected for the next generation, and (s i , s k ) in Eq. ( 13) are randomly selected from the population S. The selection process will be repeated until the number of individuals is chosen.

A.2 HOW TO ORGANIZE A POPULATION INTO A TENSOR

As shown in Fig. 5 , individuals in the lattice are sorted in descending order to construct a population tensor with a consistent pattern. Suppose a population S = s 1 , s 2 , • • • , s L×L and f (s 1 ) < f (s 2 ) < • • • < f (s L×L ) , where f (s) is a minimization task. s 1 , s 2 , • • • , s L×L are arranged in descending order within the L × L lattice. Figure 5 : Organizing a population into a tensor. Figure 6 gives an illustration of the tensor data. As can be seen, the number of channels of input tensors is D+1, where D is the dimension of the optimization task, and the fitness of individuals occupies one channel. The fitness channel does not participate in the convolution process but is essential for the selection module in DECN for the information selection. Figure 6 : The realization of similar functions as recombination operators based on convolution operator. Convolution kernels slip over the whole L × L lattice and conduct the information interaction within the neighborhood of (i, j). For the picture on the left, the small red square with many channals represents S i . A.3 POPULATION ARRANGEMENT results output by different convolution kernels does not influence the training process. For example, a 1 Co 1 x i 1 + a 2 Co 2 x i 2 + a 3 Co 3 x i 3 ↔ Co ′ 1 x i 1 + Co ′ 2 x i 2 + Co ′ 3 x i 3 , where x i 1 , x i 2 , and x i 3 are input elements of s i , a 1 , a 2 , and a 3 are the constant, and Co denotes the convolution matrix. 3) How many convolution kernels should be used within CRM. We suppose that these are three convolution kernels for x. We can find that the outcome a 1 Co 1 3×3 x + a 2 Co 2 3×3 x + a 3 Co 3 3×3 x is equivalent to a ′ Co ′ 3×3 x. The output of multiple convolution kernels can be replaced by one convolution kernel. Thus, the number of convolution kernels of the same size has no apparent influence on DECN. 4) The impact of neighborhood recombination operation. The neighborhood recombination operation has been commonly accepted in EAs to alleviate the selection pressure and prevent the premature convergence of populations. Moreover, the receptive field of convolution kernels expands as the number of layers increases. Thus, DECN can learn efficient optimization strategies across generations. The acceleration of EAs using GPUs is challenging, and lots of research has contributed to this problem. The support for multiple subpopulations to evolve simultaneously has paramount significance in practical applications. The efforts (Jin & Qin, 2017; Qin et al., 2012) accelerated the K-Means process within the brain storm optimization algorithm through GPUs and proposed an improved CUDA-based implementation of differential evolution on GPUs. Many other EAs have benefited from the computing performance of GPUs (Huang et al., 2021; Cheng & Gen, 2019) . However, all of them just parallelized the current EAs. Besides, many available genetic operators are unfriendly to the GPU acceleration, as GPUs are weak in processing logical operations. As both CRM and SM are comprised of operations upon tensors, they can be sufficiently accelerated by GPUs. The results are shown in Fig. 9 . Here, D=10.

A.5 NINE SYNTHETIC FUNCTIONS AND PARAMETERS

F6(Rosenbrock) D-1 i=1 (100(z 2 i -zi+1) 2 + (zi -1) 2 ), zi = xi -bi x ∈ [-100, 100], b ∈ [-50, 50] F7(Rastrigin) D i=1 (z 2 i -10 cos(2πzi) + 10), zi = xi -bi x ∈ [-5, 5], b ∈ [-2.5, 2.5] F8(Griewank) D i=1 z 2 i 4000 -D i=1 cos( z i √ i ) + 1, zi = xi -bi x ∈ [-600, 600], b ∈ [-300, 300] F9(Ackley) -20 exp(-0.2 1 D D i=1 z 2 i ) - exp( 1 D D i=1 cos(2πzi)) + 20 + exp(1), zi = xi -bi x ∈ [-32, 32], b ∈ [-16, 16] A.8 LIMITATIONS However, DECN has many drawbacks. We hope to address these deficiencies in future work. 1) The designed loss function enables DECN to perform the exploitation operation well but does not strongly encourage DECN to explore the fitness landscape. However, we have many options to balance exploration and exploitation. For example, the constructed Bayesian posterior distribution (Cao & Shen, 2020) over the global optimum is added to Eq. 7. In addition to adding items that focus on the exploration ability of the loss function, new modules can also be designed to be added to the EM to help DECN jump out of the local optimum. 2) For the constructed training dataset, DECN does not have an advantage if it is utterly irrelevant to the optimization objective. Thus, establishing a suitable training dataset is essential. 3) DECN only focuses on continuous optimization problems without constraints. For problems such as expensive optimization, combinatorial optimization, constrained optimization, and multi-objective optimization, DECN needs to be adjusted according to the characteristics of the problem. We can think of DECN as standard optimizers like vanilla DE, ES, GA, and PSO. In order to deal with different types of problems, we need to make different corrections to DECN. For example, to deal with expensive problems, we need to build surrogate models to assist DECN. We need to redesign the CRM module to generate new feasible solutions for combinatorial optimization problems. For example, for TSP tasks, GNN may be a feasible option to generate new solutions instead of CRM. We can redesign the CRM module for constrained optimization problems to generate feasible solutions. Of course, the easiest way is to use constraint violations and fitness functions as criteria for selecting the next generation in the SM module. A.9 TRAINING DATASET FOR PROTEIN DOCKING The training dataset contains 25 protein-protein complexes from the protein docking benchmark set 4.0 (Hwang et al., 2010) We construct a differentiable surrogate function set of the target black-box function to obtain the information of the target black-box function. The designed loss function is to maximize the difference between the initial population and the output population of DECN to ensure that the initial population is close to the optimal solution. There are few learning-to-optimize architectures (Chen et al., 2022) currently dealing with blackbox optimization problems, and their performance is weak. From the experimental results, DECN makes up for the performance disadvantage of the learning-to-optimize architecture in the black-box optimization problem. We also strongly believe that this paper makes an essential contribution to the learning-to-optimize community. Table 9 : Experimental setup for DECNws30, DECNws3 and DECNnws15. In DECNws3, parameters of these three convolution kernels are consistent across different EMs (weight sharing). Moreover, during the training process, the 2-norm of gradients is clipped to be not larger than 10, and the learning rate (lr = 0.01) shrinks every 100 epochs. The shrinking rate is set to 0.9. The generation of these reference algorithms is set to 100, while DECNws3 only evolves the population with 3 EMs. 5000 epochs are conducted during the training process. All experimental studies are performed on a Linux PC with Intel Core i7-10700K CPU at 3.80GHz and 32GB RAM. 



Figure 1: SM. An indication matrix is produced by subtracting the fitness channel (c 1 ), based on which individuals within the input and output populations can be extracted to form offspring.

Figure 2: A general view of DECN and EM.

Figure 3: An example to show the data flow in EM. Suppose f(s = {x 1 , x 2 }, x i = {0, 0}) = (x 1 -0) 2 (x 2 -0) 2 , x i ∈ [-1, 1].L is set to 2. We first transfer the initial population with four individuals to the tensor and then sort and pad it into the new tensor with 16 channels. The modified DSC operator is employed to generate x 1 , x 2 , and the fitness tensor. x 1 and x 2 are handled by parameters shared across channels within a 3 × 3 convolution kernel. The fitness tensor is handled by Eq. (3). The new tensors of x 1 , x 2 , and the fitness tensor are averaged to generate the output.

Figure 7 gives an example of population arrangement and padding for the problem min f (s) = x 1 ×x 1 , where s = x 1 , x 1 ∈ [0, 10]. The blue part marks the population arranged in a 10 × 10 lattice, while the gray region marks the mirror padding part.

Figure 7: Population arrangement and padding.

xi -bi)| x ∈ [-10, 10], b ∈ [-10, 10] F2 i |xi -bi| x ∈ [-10, 10], b ∈ [-10, 10] F3 i |(xi -bi) -(xi+1 -bi+1)| + i |xi -bi| x ∈ [-10, 10], b ∈ [-10, 10]A.6 ACCELERATE DECN WITH GPU

Figure 9: The convergence of loss function in training process. (a) F4, (b) F5, (c) F6, (d) F7, (e) F8, and (f) F9.

with standard EA baselines (DE (DE/rand/1/bin)(Das & Suganthan, 2010), ES ((µ,λ)-ES), and CMA-ES), L2O-swarm(Cao et al., 2019) (a representative L2O method for black-box optimization), and Dragonfly(Kandasamy et al., 2020) (the state-of-the-art Bayesian optimization).DE and ES are implemented based on Geatpy (et.al., 2020), and CMA-ES is implemented by Pymoo(Blank & Deb, 2020). The parameters of DE, ES, CMA-ES, and Dragonfly are adjusted to be optimal for each problem. L2O-swarm and DECN use the same training set and loss function. All algorithms are run ten times for each function. DECNws3 contains 3 EMs, and the parameters of these three convolution kernels are consistent across different EMs (weight sharing). The population sizes of DE, ES, CMA-ES, and DECN are 100. DE, ES, and CMA-ES run for 100 generations. For DECNws3, its architecture determines that DECN has only been iterated for three generations. DE, ES, and CMA-ES have 100/3 times as many function evaluations as DECN, which is highly unfair to DECN. Both Dragonfly and L2O-Swarm run to convergence.

u = 0, σ = 0.5 7 × 7: u = 0, σ = 0.5

1 is not employed in the training stage, and m is the number of functions in F train . ξ train m,i represents the ith different values of ξ in mth function f m , which is true for any index pair. The initial population S 0 is always randomly generated before optimization. F train is comprised of different functions and has diverse landscapes from F test .How to Train DECN. DECN attempts to search for individuals with high quality based on the available information. The loss function tells how to adaptively adjust the DECN parameters to generate individuals closer to the optimal solution. According to the Adam (Kingma & Ba, 2014) method, a minibatch Ω is sampled each epoch for the training of DECN upon F train , which is comprised by employing K initialized S 0 for each f i . We give the corresponding mean loss of minibatch Ω for f i in F train ,

Algorithm 1 Training of DECN Input: Batch size for Adam, Ω; Function set for training, F train Output: Parameters of DECN θ; to adjust f j in F train ; repeat Randomly initialize a minibatch Ω comprised of K populations S 0 ; for f j in F train do Update θ by L i given training data (S 0 , f j ); end for Update θ by minimizing -1/m j L j ; Re-initialize parameters ξ j of f j in F train every T epochs; until training is finished are generally differentiable based on the constructed training dataset. Eq. (

The compared results on six functions. The value of the objective function is shown in the table, and the optimal solution is bolded. *(*) represents the mean and standard deviation of repeated experiments.Results on High-fidelity Training DatasetFor each function in Appendix Table7, we produce the training dataset as follows: 1) Randomly initialize the input population S 0 ; 2) Randomly produce a shifted objective function f i (s|ξ) by adjusting the corresponding location of optima-namely, adjusting the parameter ξ; 3) Evaluate S 0 by f

The performance of different DECN. Therefore, this section tests the performance of DECNs trained on low-fidelity surrogate functions. Three functions in Table6are employed as the low-fidelity surrogate functions for each function in Table7. Here, the whole functions in Table6are employed as F train in order to train one DECN, and then the results on each function of Table7are shown in Table2. For example, we show the designed training and testing datasets for the F4 function as follows:

, each of which has five starting points (top-5 models from ZDOCK(Pierce et al., 2014)). The testing set includes three complexes (with one starting model each) of different levels of docking difficulty. 1ATN is the protein class that appeared during training. 1ATN 7 is the No. 7 instance of the 1ATN class, and it did not appear in the training process. 2JEL 1 and 7CEI 1 are the No. 1 instances of the two classes of proteins that did not participate in the training process. For example, we show the designed training and testing datasets for 1ATN 7 as follows:

The compared results on ab initio protein docking problem. D = 12.

The results of planar mechanical arm. gen is the number of generations for EAs.

DECN's calculation efficiency upon one 1080Ti GPU.

Training functions.

Testing Functions.

Investigation of DECN's calculation efficiency when accelerated upon one 1080Ti GPU. The results in this table are the average time (second) of algorithms to conduct the evolution of 32 input populations for three generations. operations upon tensors and is easily accelerated by GPUs. Current distributed EA methods usually separate a population into multiple subpopulations that evolve simultaneously. Such separation is also a commonly accepted operation in many EAs However, none of them can accelerate the genetic operators. Here, we show the surprising performance of DECN with GPU accelerated CRM and SM. Moreover, Tensorflow has provided mature solutions for the acceleration upon GPUs, and DECN implemented by Tensorflow is supportable to load multiple populations as the input.To show the adaptability of DECN to GPUs, we offer the runtime of DECN and unaccelerated EA in Table8, within which both DECN and EA optimize K = 32 populations with each containing L × L individuals (number of individuals: K ×L×L). Similarly, we employ the runtime of EASBX without acceleration as a reference in this experiment. As can be seen, with the increase of L, the advantage of acceleration based on GPUs is clear. When the dimension D ∈ {2, 10}, DECN runs 103∼104 times faster than EA. DECN is still around 102 times faster than EA when D ∈ {30, 50, 100, 500}. This case indicates that DECN is adapted to the acceleration of GPUs and can be accelerated sufficiently. However, with increasing D, DECN increases the proportion of evaluations in the runtime and ultimately weakens the advantage of acceleration. These cases indicate the acceleration advantage of DECN when optimizing a larger population.A.7 THE CONVERGENCE OF LOSS FUNCTION IN TRAINING PROCESSThis part is the change curve of the loss function of the training process of DECNnws15 on F4-F9.

annex

operations (including their hyperparameters) to maximize its performance on the target task, resulting in a hand-designed EA with big application limitation. Most importantly, due to the limitation of expert knowledge, only little target function information is used to assist the design of EA, which makes it challenging to adapt to the target task. How to automatically design optimization strategies according to new tasks is crucial. To the best of our knowledge, there is currently no work to address this issue. We think EA is a generative optimization model that realizes the generation from a random population to an optimal solution by manually designing crossover, mutation, and selection operations. The purpose of these operations is to generate potential solutions and retain good solutions. The task of automatically designing an optimization strategy is learning how to automatically generate and retain potential solutions. This paper is to show how DECN finish this task.By constructing a set of differentiable surrogate functions of the objective black-box function, DECN can allow the designed CRM and SM to learn the strategy of optimizing the objective function. At this point, DECN effectively utilizes the information of the target black-box function to assist the construction of the optimization strategy. The degree of fit of DECN with the target task is much higher than that of the human-designed EA. The following statement may be one-sided: Bayesian optimization also suffers from poor generalization. For example, how to choose/design appropriate acquisition functions for different problems.We use the process of an evolutionary algorithm to guide the design of DECN and realize the mapping from a random population to the optimal solution. First, we need to design a module to ensure the exchange of information between individuals in the population to achieve the function of generating potential solutions (similar to the recombination operators in EA). We can achieve this function by modifying the convolution operation accordingly, which is our motivation for using convolution to design. The designed CRM module achieves this purpose (see Section 3.1).Second, to survive good individuals for the next layer of DECN, we design the selection module (SM) based on a pairwise comparison between the offspring and input population regarding their fitness (see Section 3.3, Equation 3). We can clearly observe that Equation 3 can indeed keep good individuals.Third, the untrained DECN does not handle the black-box optimization problem well because it needs information about the target black-box function. In order to better optimize the objective

