FOURIER REPRESENTATIONS FOR BLACK-BOX OPTIMIZATION OVER CATEGORICAL VARIABLES

Abstract

Optimization of real-world black-box functions defined over purely categorical variables is an active area of research. In particular, optimization and design of biological sequences with specific functional or structural properties have a profound impact in medicine, materials science, and biotechnology. Standalone acquisition methods, such as simulated annealing (SA) and Monte Carlo tree search (MCTS), are typically used for such optimization problems. In order to improve the performance and sample efficiency of such acquisition methods, we propose to use existing acquisition methods in conjunction with a surrogate model for the black-box evaluations over purely categorical variables. To this end, we present two different representations, a group-theoretic Fourier expansion and an abridged one-hot encoded Boolean Fourier expansion. To learn such models, characters of each representation are considered as experts and their respective coefficients are updated via an exponential weight update rule each time the black box is evaluated. Numerical experiments over synthetic benchmarks as well as real-world RNA sequence optimization and design problems demonstrate the representational power of the proposed methods, which achieve competitive or superior performance compared to state-of-the-art counterparts, while improving the computational cost and/or sample efficiency substantially.

1. INTRODUCTION

A plethora of practical optimization problems involve black-box functions, with no simple analytical closed forms, that can be evaluated at any arbitrary point in the domain. Optimization of such black-box functions poses a unique challenge due to restrictions on the number of possible function evaluations, as evaluating functions of real-world complex processes is often expensive and time consuming. Efficient algorithms for global optimization of expensive black-box functions take past queries into account in order to select the next query to the black-box function more intelligently. While black-box optimization of real-world functions defined over integer, continuous, and mixed variables has been studied extensively in the literature, limited work has addressed incorporation of purely categorical type input variables. Categorical type variables are particularly challenging when compared to integer or continuous variables, as they do not have a natural ordering. However, many real-world functions are defined over categorical variables. One such problem, which is of wide interest, is the design of optimal chemical or biological (protein, RNA, and DNA) molecule sequences, which are constructed using a vocabulary of fixed size, e.g. 4 for DNA/RNA. Designing optimal molecular sequences with improved or novel structures and/or functionalities is of paramount importance in material science, drug and vaccine design, synthetic biology and many other applications (see Dixon et al. (2010) ; Ng et al. (2019) ; Hoshika et al. (2019) ; Yamagami et al. (2019) ). Design of optimal sequences is a difficult black-box optimization problem over a combinatorially large search space (Stephens et al. (2015) ), in which function evaluations often rely on either wet-lab experiments, physics-inspired simulators, or knowledge-based computational algorithms, which are slow and expensive in practice. Another problem of interest is the constrained design problem, e.g. find a sequence given a specific structure (or property), which is inverse of the well-known folding problem discussed in Dill & MacCallum (2012) . This problem is complex due to the strict structural constraints imposed on the sequence. In fact one of the ways to represent such a complex structural constraint is to constrain the next choice sequentially based on the sequence elements that have been chosen a priori. Therefore, we divide the black box optimization problem into two settings, depending on the constraint set: (i) Generic Black Box Optimization (BBO) problem referring to the unconstrained case and (ii) Design Problem that refers to the case with complex sequential constraints. Let x t be the t-th sequence evaluated by the black box function f . The key question in both settings is the following: Given prior queries x 1 , x 2 . . . x t and their evaluations f (x 1 ) . . . f (x t ), how to choose the next query x t+1 ? This acquisition must be devised so that over a finite budget of black-box evaluations, one is closest to the minimizer in an expected sense over the acquisition randomness. In the literature, for design problems with sequential constraints, MCTS (Monte Carlo Tree Search) based acquisitions are often used with real function evaluations f (x t ). In the generic BBO problems in the unconstrained scenario, Simulated Annealing (SA) based techniques are typically used as acquisition functions. A key missing ingredient in the categorical domain is a surrogate model for the black-box evaluations that can interpolate between such evaluations and use cost-free approximate evaluations from the surrogate model internally (in acquisition functions) in order to reduce the need for frequently accessing real evaluations. This leads to improved sample efficiency in acquisition functions. Due to the lack of efficient interpolators in the categorical domains, existing acquisition functions suffer under a finite budget constraint, due to reliance on only real black-box evaluations. Contributions: We address the above problem in our work. Our main contributions are as follows: 1. We present two representations for modeling real-valued combinatorial functions over categorical variables, which we then use in order to learn a surrogate model for the generic BBO problem and the design problem. The surrogate model is updated via a hedge algorithm where the basis functions in our representations act as experts. The latter update happens once for every real black-box evaluation. To the best of our knowledge, the representations and/or their use in black-box optimization of functions over categorical variables are novel to this work. 2. In the BBO problem, the proposed method uses a version of simulated annealing that utilizes the current surrogate model for many internal cost-free evaluations before producing the next black-box query. 3. In the design problem, the proposed method uses a version of MCTS in conjunction with the current surrogate model as reward function of the terminal states during intermediate tree traversals/backups in order to improve the sample efficiency of the search algorithm. 4. Numerical results, over synthetic benchmarks as well as real-world biological (RNA) sequence optimization and design problems demonstrate the competitive or superior performance of the proposed methods over state-of-the-art counterparts, while substantially reducing the computation time and sample efficiency, respectively. 2018)), and reinforcement learning that either performs a local search as in Eastman et al. (2018) or learns complete candidate solutions from scratch (Runge et al. (2018) ). In all these approaches, the assumption is that the algorithm has access to a large number of function evaluations, whereas we are interested in sample efficiency of each algorithm.

2. RELATED WORK

As an alternative to parameter free search methods (such as SA), Swersky et al. (2020) suggests to use a parameterized policy to generate candidates that maximize the acquisition function in Bayesian optimization over discrete search spaces. Our MCTS acquisition method is similar in concept to Swersky et al. (2020) in the sense that the tabular value functions are constructed and maintained over different time steps. However, we are maintaining value functions rather than a policy network.

3. BLACK-BOX OPTIMIZATION OVER CATEGORICAL VARIABLES

Problem Setting: Given the combinatorial categorical domain X = [k] n and a constraint set C ⊆ X , with n variables each of cardinality k, the objective is to find x * = arg min x∈C f (x) where f : X → R is a real-valued combinatorial function. We assume that f is a black-box function, which is computationally expensive to evaluate. As such, we are interested in finding x * in as few evaluations as possible. We consider two variations of the problem depending on how the constraint set C is specified. Generic BBO Problem: In this case, the constraint set C = X . For example, RNA sequence optimization problem that searches for an RNA sequence with a specific property optimized lies within this category. A score for every RNA sequence, reflecting the property we wish to optimize, is evaluated by a black box function. Design Problem: The constraint set is complex and is only sequentially specified. For every sequence of x 1 x 2 . . . x i consisting of i characters from the alphabet [k], the choice of the next character x i+1 ∈ C(x 1 x 2 . . . x i ) ⊆ [k] is specified by a constraint set function C(x 1 . . . x i ). The RNA inverse folding problem in Runge et al. (2018) falls into this category, where the constraints on the RNA sequence are determined by the sequential choices one makes during the sequence design. The goal is to find the sequence that is optimal with respect to a pre-specified structure that also obeys complex sequential constraints. In the sequel, we propose two representations that can be used as surrogate models for black-box combinatorial functions over categorical variables. These representations serve as direct generalizations of the Boolean surrogate model based on Fourier expansion proposed in Dadkhahi et al. (2020) in the sense that they reduce to the Fourier representation for real-valued Boolean functions when the cardinality of the categorical variables is two. In addition, both approaches can be modified to address the more general case where different variables are of different cardinalities. However, for ease of exposition, we assume that all the variables are of the same cardinality. Finally, we introduce two popular acquisition function to be used in conjunction with the proposed surrogate models in order to propose new queries for subsequent black-box function evaluations.

3.1. REPRESENTATIONS FOR THE SURROGATE MODEL

We present two representations for combinatorial functions f : [k] n → R and an algorithm to update from the black-box evaluations. The representations use the Fourier basis in two different ways. Abridged One-Hot Encoded Boolean Fourier Representation: The one-hot encoding of each variable x i : i ∈ [n] can be expressed as a (k -1)-tuple (x i1 , x i2 , . . . , x i(k-1) ), where x ij ∈ {-1, 1} are Boolean variables with the constraint that at most one such variable can be equal to -1 for any given x i ∈ [k]. We consider the following representation for the combinatorial function f : f α (x) = n m=0 I∈( [n] m ) J ∈[k-1] |I| α I,J ψ I,J (x) where [k -1] |I| denotes |I|-fold cartesian product of the set [k -1] = {1, 2, . . . , k -1}, [n] m designates the set of m-subsets of the set [n], and the monomials ψ I,J (x) can be written as ψ I,J (x) = {(i,j):i=I ,j=J , ∈[|I|]} x ij A second order approximation (i.e. at m = 2) of the representation in (2) can be expanded in the following way: f α (x) = α 0 + i∈[n] ∈[k-1] α i x i + (i,j)∈( [n] 2 ) (p,q)∈[k-1] 2 α ijpq x ip x jq . Example. For n = 2 variables x 1 and x 2 , each of which with cardinality k = 3, we have the one-hot encoding of (x 11 , x 12 ) and (x 21 , x 22 ) respectively. From Equation (4), the one-hot encoding factorization for this example can be written as  f (x) = α 0 + α 1 x (i ∈ [n]), where k i is the cardinality of the latter variable. From the fundamental theorem of abelian groups Terras (1999) , there exists an abelian group G which is isomorphic to the direct sum (a.k.a direct product) of the cyclic groups Z/k i Z corresponding to the n categorical variables: G ∼ = Z/k 1 Z ⊕ Z/k 2 Z ⊕ . . . ⊕ Z/k n Z (5) where the latter group consists of all vectors (a 1 , a 2 , . . . , a n ) such that a i ∈ Z/k i Z and ∼ = denotes group isomorphism. We assume that k i = k (∀i ∈ [n] ) for simplicity, but the following representation could be easily generalized to the case of arbitrary cardinalities for different variables. The Fourier representation of any complex-valued function f (x) on the finite abelian group G is given by Terras ( 1999) f (x) = I∈[k] n α I ψ I (x) where α I are (in general complex) Fourier coefficients, [k] n is the n-fold cartesian product of the set [k] and ψ I (x) are complex exponentialsfoot_0 (k-th roots of unity) given by ψ I (x) = exp( 2πj x,I /k). Note that the latter complex exponentials are the characters of the representation, and reduce to the monomials (i.e. in {-1, 1}) when the cardinality of each variable is two. A second order approximation of the representation in ( 6) can be written as: f α (x) = α 0 + i∈[n] ∈[k-1] α i exp( 2πjxi /k)+ (i,j)∈( [n] 2 ) (p,q)∈[k-1] 2 α ijpq exp( 2πj(xip+xj q) /k). For a real-valued function f α (x) (which is of interest here), the representation in (6) reduces to f α (x) = I∈[k] n α I ψ I (x) = I∈[k] n α r,I ψ r,I (x) - I∈[k] n α i,I ψ i,I (x) where ψ r,I (x) = cos( 2π x,I /k) and ψ i,I (x) = sin( 2π x,I /k) (9) α r,I = {α I } and α i,I = {α I } We note that the number of characters utilized in this representation is almost twice as many as that of monomials used in the previous representation.

Surrogate Model Learning:

We adopt the learning algorithm of combinatorial optimization with expert advice Dadkhahi et al. (2020) in the following way. We consider the monomials ψ I,J (x) in (3) and the characters ψ ,I (x) in (10) as experts. For each surrogate model, we maintain a pool of such experts, the coefficients of which are refreshed sequentially via an exponential weight update rule. We refer to the proposed algorithm as Expert-Based Categorical Optimization (ECO) and the two versions of the algorithm with the two proposed surrogate models are called ECO-F (based on the One-Hot Encoded Boolean Fourier Representation) and ECO-G (based on Fourier Representation on Finite Abelian Groups), respectively. A summary of the algorithm is given in the Appendix.

3.2. ACQUISITION FUNCTIONS

In this subsection, we discuss how two popular acquisition functions, namely Simulated Annealing (SA) and Monte Carlo Tree Search (MCTS), work with a surrogate model and use cost-free evaluations of the surrogate model to select the next query for the black box evaluation. In the literature, SA has been used for the generic BBO problems whereas MCTS has been used for the design problems. SA as Acquisition Function: Our acquisition function is devised so as to minimize f α (x), the current estimate for the surrogate model. A simple strategy to minimize f α (x) is to iteratively switch each variable into the value that minimizes f α given the values of all the remaining variables, until no more changes occur after a sweep through all the variables x i (∀i ∈ [n]). A strategy to escape local minima in this context Pincus (1970) is to allow for occasional increases in f α by generating a Markov Chain (MC) sample sequence (for x), whose stationary distribution is proportional to exp(-fα(x) /s), where s is gradually reduced to zero. This optimization strategy was first applied by Kirkpatrick et al. (1983) in their Simulated Annealing algorithm to solve combinatorial optimization problems. We use the Gibbs sampler Geman & Geman (1984) to generate such a MC by sampling from the full-conditional distribution of the stationary distribution, which in our case is given by the softmax distribution over {- f α (x i = , x -i )/s} ∈[k] , for each variable x i conditional on the values of the remaining variables x -i . By decreasing s from a high value to a low one, we allow the MC to first search at a coarse level avoiding being trapped in local minima. Algorithm 1 presents our simulated annealing (SA) version for categorical domains, where s(t) is an annealing schedule, which is a non-increasing function of t. We use the annealing schedule suggested in Spears (1993), which follows an exponential decay with parameter given by s(t) = exp(-t /n). In each step of the algorithm, we pick a variable x i (i ∈ [n]) uniformly at random, evaluate the surrogate model (possibly in parallel) k times, once for each categorical value ∈ [k] for the chosen variable x i given the current values x -i for the remaining variables. We then update x i with the sampled value in [k] from the corresponding softmax distribution.

MCTS as Acquisition Function:

We formulate the design problem as an undiscounted Markov decision process (S, A, T, R). Each state s ∈ S corresponds to a partial or full sequence of categorical variables of lengths in [0, n]. The process in each episode starts with an empty sequence s 0 , the initial state. Actions are selected from the set of permissible additions to the current state (sequence) s t at each time step t, A t = A(s t ) ⊂ A. The transition function T is deterministic, and defines the sequence obtained from the juxtaposition of the current state s t with the action a t , i.e. s t+1 = T (s t , a t ) = s t • a t . The transitions leading to incomplete sequences yield a reward of zero. Complete sequences are considered as terminal states, from which no further transitions (juxtapositions) can be made. Once the sequence is complete (i.e. at a terminal state), the reward is obtained from the current surrogate reward model f α . Thus, the reward function is defined as R(s t , a t , s t+1 ) = -f α (s t+1 ) if s t+1 is terminal, and zero otherwise. MCTS is a popular search algorithm used for design problems. MCTS is a rollout algorithm which keeps track of the value estimates obtained via Monte Carlo simulations in order to progressively make better selections. The UCT selection criteria, see Kocsis & Szepesvári (2006) , is typically used as tree policy, where action a t at state s t in the tree search is selected via: π T (s t ) = arg max a∈A(st) Q(s t , a) + c ln N (st) /N(st,a), where T is the search tree, c is the exploration parameter, Q(s, a) is the state-action value estimate, and N (s) and N (s, a) are the visit counts for the parent state node and the candidate state-action edge, respectively. For the selection of actions in states outside the tree search, a random default policy is used: π RS (s t ) = unif(A t ). A summary of the proposed algorithm is given in Algorithm 2. Starting with an empty sequence s 0 at the root of the tree, we follow the tree policy until a leaf node of the search tree is reached (selection step). At this point, we append the state corresponding to the leaf node to the tree and initialize a value function estimate for its children (extension step). From the reached leaf node we follow the default policy until a terminal state is reached. At this point, we plug the sequence corresponding to this terminal state into the surrogate reward model -f α and observe the reward r. This reward is backed up from the leaf node to the root of the tree in order to update the value estimates Q(s, a) via Monte Carlo (i.e. using the average reward) for all visited (s, a) pairs along the path. This process is repeated until a stopping criterion (typically a max number of playouts) is met, at which point the sequence s best with maximum reward r best is returned as the output of the algorithm. Algorithm 1 SA for Categorical Variables with Surrogate Model 1: Inputs: surrogate model f α , annealing schedule s(t), categorical domain X 2: Initialize x ∈ X 3: t = 0 4: repeat  5: i ∼ unif([n]) 6: x i |x -i ∼ Softmax {-fα t (xi= ,x-i) /s(t)} ∈[k] 7: t ← t + s leaf ← Selection(π T ) 5: T ← T ∪ {s leaf } 6: s t ← Simulation(π RS , s leaf ) 7: r ← -f α (s t ) 8: Backup(s leaf , r) 9: if r > (d) = O(k m-1 n m ) , and is thus linear in the number of experts d. Moreover, the complexity of the simulated annealing algorithm (Algorithm 1) is in O(kk m-1 n m-1 n) = O(kd), assuming that the number of iterations in each SA run is in O(n). As a result, the overall complexity of the algorithm is in O(kd). Finally, the computational complexity of each playout in Algorithm 2 is in (O)(kn), leading to an overall complexity of (O)(kd), assuming (O)( d /n) playouts per time step.

4. EXPERIMENTS AND RESULTS

In this section, we measure the performance of the proposed representations, when used as surrogate/reward model in conjunction with search algorithms (SA and MCTS) in BBO and design problems. The learning rate used in exponential weight updates is selected via the anytime learning rate schedule suggested in Dadkhahi et al. (2020) and Gerchinovitz & Yu (2011) (see Appendix). The maximum degree of interactions used in our surrogate models is set to two for all the problems; increasing the max order improved the results only marginally. The sparsity parameter λ in exponential weight updates is set to 1 in all the experiments following the same choice made in Dadkhahi et al. (2020) . Experimentally, the learning algorithm is fairly insensitive to the variations in the latter parameter. In each experiment, we report the results averaged over multiple runs (20 runs in BBO experiments and 10 runs in design experiments) ± one standard error of the mean. The experiments were run on machines with CPU cores from the Intel Xeon E5-2600 v3 family. BBO Experiments: We compare the performance of our ECO algorithms in conjunction with SA with two baselines, random search (RS) and simulated annealing (SA), as well as a state-of-the-art Bayesian combinatorial optimization algorithm (COMBO) Oh et al. (2019) . In particular, we consider two synthetic benchmarks (Latin square problem and pest control problem) and a real-word sequence design problem in biology: RNA sequence optimization. In addition to the performance of the algorithms in terms of the best value of f (x) observed until a given time step t, we measure the average computation time per time step of our algorithm versus that of COMBO. The decay parameter used in the annealing schedule of SA is set to = 3 in all the experiments. In addition, the number of SA iterations in our surrogate models is set to T = 3 × n. Intuitively, each of these parameters creates an exploration-exploitation trade-off. The smaller (larger) the value of or T , the more exploratory (exploitative) is the behavior of SA. The selected values seem to create a reasonable balance; tuning these parameters may improve the performance of the acquisition function. 2008) use a thermodynamic model (e.g. Zuker & Stiegler (1981) ) and dynamic programming to estimate MFE of a sequence. However, the O(n 3 ) time complexity of these algorithms prohibits their use for evaluating substantial numbers of RNA sequences Gould et al. (2014) and exhaustively searching the space to identify the global free energy minimum, as the number of sequences grows exponentially as 4 n . We formulate the RNA sequence optimization problem as follows: For a sequence of length n, find the RNA sequence which folds into a secondary structure with the lowest MFE. In our experiments, we initially set n = 30 and k = 4. We then use the popular RNAfold package Lorenz et al. (2011) to evaluate the MFE for a given sequence. The goal is to find the lowest MFE sequence by calling the MFE evaluator minimum number of times. As depicted in Figure 1 , both ECO-F and particularly ECO-G outperform the baselines as well as COMBO by a considerable margin.

RNA Design Experiments:

The problem is to find a primary RNA sequence φ which folds into a target structure w, given a folding algorithm F . Such target structures can be represented as a sequence of dots (for unpaired bases) and brackets (for paired bases). In our algorithm, the action sets are defined as follows. For unpaired sites A t = {A, G, C, U } and for paired sites A t = {GC, CG, AU, U A}. At the beginning of each run of our algorithm (ECO-F and ECO-G in conjunction with MCTS acquisition), we draw a random permutation for the order of locations to be selected in each level of the tree. The reward value offered by the environment (i.e. the black-box function) at any time step t corresponds to the normalized Hamming distance between the target structure ω and the structure y t = F (x t ) of the sequence x t found by each algorithm, i.e. d H (w, y t ). We compare the performance of our algorithms against RS as a baseline, where random search is carried out over the given structure (i.e. default policy π RS ) rather than over unstructured random sequences. We also include two state-of-the-art algorithms in our experiments: MCTS-RNA of Yang et al. (2017) and LEARNA of Runge et al. (2019) . MCTS-RNA has an exploration parameter, which we tune in advance (per sequence). LEARNA has a set of 14 hyper-parameters tuned a priori using training data and is provided by the authors of Runge et al. (2019) . Note that the latter training phase (for LEARNA) as well as the former exploration parameter tuning (for MCTS-RNA) is offered to the respective algorithms as an advantage, whereas for our algorithm we use a global set of heuristic choices for the two hyper-parameters, rather than attempting to tune the two hyper-parameters. In particular, we set the exploration parameter c to 0.5 and the number of MCTS playouts at each time step to 30 × h, where h is the height of the tree (i.e. number of dots and bracket pairs). The latter heuristic choice is made since the bigger the tree, the more playouts are needed to explore the space. We point out that the entire design pipeline in state-of-the-art algorithms typically also includes a local improvement step (as a post-processing step), which is either a rule-based search (e.g. in Yang et al. (2017) ) or an exhaustive search (e.g. in Runge et al. ( 2019)) over the mismatched sites. We do not include the local improvement step in our experiments, since we are interested in measuring sample efficiency of different algorithms. In other words, the question is the following: given a fixed and finite evaluation budget, which algorithm is able to get closer to the target structure. For our experiments, we focus on three puzzles from the Eterna-100 dataset Anderson-Lee et al. (2016) . Two of the selected puzzles (#15 and #41 of lengths 30 and 35, resp.), despite their fairly small lengths, are challenging for many algorithms (see Anderson-Lee et al. ( 2016)). In both puzzles, our algorithms ECO-F and ECO-G (with MCTS acquisition) are able to significantly improve the performance of MCTS when limited number of black-box evaluations is available. All algorithms outperformed RS as expected. Within the given 500 evaluation budget, ECO-G, and especially ECO-F, are superior to LEARNA by a substantial margin (see Appendix). In puzzle number 41 (Figure 2 ), again both ECO-G and ECO-F significantly outperform LEARNA, over the given number of evaluations. Interestingly, ECO-F is able to outperform LEARNA throughout the evaluation process, and in average finds a far better final solution than LEARNA. See Appendix for the third puzzle.

5. CONCLUSIONS AND FUTURE WORK

In summary, we propose novel Fourier representations as surrogate models for black box optimization over categorical variables and show performance improvements over existing baselines when combined with state-of-the-art acquisition methods. Considering the performance variability of the two surrogate model representations introduced in this paper across different problems, an important research avenue would be to incorporate an ensemble of surrogate models rather than a single one. Such an ensemble model would then update and explore both models simultaneously and draw samples from either individual models or a combination of both at any given time step. It would be interesting to see if such an ensemble model would in fact be able to outperform both individual models over different combinatorial problems. Our ECO algorithm incorporates an online estimator learnt via Hedge, rather than a Bayesian posterior mean function which is commonly used in conjunction with Thompson Sampling (TS) and UCB for uncertainty quantification in Bayesian optimization algorithms. The Hedge algorithm has strong adversarial guarantees (see Dadkhahi et al. (2020) for theoretical results in the Boolean case), which can be shown to carry over to our setting as well. More precisely, given any additional black box evaluation, it is guaranteed to move closer to the true black-box model. However, the exploration bonus used in the acquisition process via SA, used in our algorithm, can be shown to be domain independent, i.e. sampled i.i.d from the same distribution regardless of the query point. The terms that account for uncertainty in both TS and UCB are domain dependent and depend on the query point. As such, domain dependent uncertainty incorporation would be an interesting next step, which is left for future work. A PROOF OF THEOREM 3.1 We first assume that the one-hot variables x ij ∈ {0, 1}. Plugging different choices of x ∈ X = [k] n into Equation (2) leads to a system of linear equations with k n unknowns (coefficients α I,J ) and k n equations. We can express this system in matrix form as the product of the matrix of monomials (where each column j corresponds to a monomial ψ j , and each element (i, j) corresponds to the evaluation of the monomial ψ j at the i-th choice for x) and the vector of unknown coefficients, which is set equal to the function values at the corresponding choices for x. We claim that there exists a permutation of the rows and columns of the matrix of monomials such that the latter becomes unit lower triangular, and is thereby full rank. As a result, the representation in (2) for x ij ∈ {0, 1} is complete and unique. To formally show that this permutation for the monomials' matrix exists, we use a construction by induction over the number of variables n included in the representation. We denote the monomials' matrix over variables with Φ , and define the one-hot variables x ij (j ∈ [k -1]) as the descendants of the parent categorical variable x i (∀i ∈ [n] ). It is easy to see that such a construction for the base case of only one variable exits, as the monomials' matrix is a k × k matrix. In this case, we use the following permutation of the monomials (columns): (1, x 11 , x 12 , . . . , x 1(k-1) ). We also use the following permutation of the x 1 values in rows: (k, 1, . . . , k -1). As a result, in this matrix only the elements of the first column and the main diagonal are non-zero and equal to one, and thus the matrix Φ 1 is unit lower triangular. Assuming that the induction hypothesis holds for n variables, we show that it also holds for n + 1 variables. Starting from a unit lower triangular matrix Φ n , we can construct the matrix Φ n+1 as follows. Note that all the k n columns of Φ n correspond to the monomials composed of the descendants of the first n variables {x ij : ∀i ∈ [n] and ∀j ∈ [k -1]}, whereas each of the additional k n+1 -k n columns introduced in Φ n+1 involves exactly one term from {x (n+1)j : ∀j ∈ [k -1]} (possibly also containing factors from the previous n variables). We can express k n+1 -k n using the following binomial expansion: k n+1 -k n = (k -1) n+1 m=1 n m -1 (k -1) m-1 . ( ) In words, the additional k n+1 -k n columns can be considered as a collection of n m-1 (k -1) m-1 m-th order monomials (m ∈ [n + 1]), each of which includes one out of the k -1 descendant variables x (n+1)j (j ∈ [k -1]) together with m -1 variables from the (descendants) of the previous n variables. Each of the latter m -1 variables can take values in [k -1], whereas the remaining n -m + 1 variables are set to k. Starting with m = 1, we have (k -1) first order terms (x (n+1)1 , x (n+1)2 , . . . , x (n+1)(k-1) ) which we assign to columns (k n +1, . . . , k n +(k -1)). The x values associated with rows (k n +1, . . . , k n + (k -1)) are constructed by assuming that (i) x n+1 takes values from 1 to k -1 (in order), while (ii) all the remaining x j (j ∈ [n]) variables are set to k. As a consequence of (i), all the elements on the main diagonal are ones; also, as a consequence of (ii), all the higher degree monomials involving x n+1 (which occupy the elements after the diagonal ones) are equal to zeros. Thus the augmented matrix, until this point, remains unit lower triangular. We then consider the second degree terms m = 2, where we have n(k -1) terms (choice of one out of the other n variables, each taking values from [k -1]) for each of the k -1 choices for the variable x n+1 . Starting with second order monomials involving x 1 and x n+1 , we again assume that all the remaining variables x i (i ∈ {1, n + 1}) are equal 2 to k. For any choice of (x 1 , x n+1 ) ∈ [k -1] 2 , we add a new column corresponding to the monomial x 1j x (n+1) as well as a new row in which x i = j and x n+1 = , whereas the remaining variables are set to k. As a result, we have that: (i) the diagonal element in the new row/column is equal to one, and (ii) all the elements in the future 3 columns are zeros, since any combination of one of the descendants of x n+1 with the remaining variables (any variable except x 1 and x n+1 ) is zero. We continue this construction strategy for the remaining variables until all the second degree terms involving x n+1 and one out of the remaining n -1 variables is exhausted. We then repeat the same idea for terms with orders up to n + 1, as defined by the binomial expansion in (11). As a result of this construction strategy, the monomial matrix Φ n+1 is unit lower triangular. Now, we use this result to show the completeness and uniqueness of the representation with one-hot variables in {-1, 1} in the following way. Completeness: We showed that the representation (2) over {0, 1} is complete, i.e. we can express any function using the representation in ( 2), where we have at most one descendant term from the same parent in each monomial. Now, we replace each x ij (from {0, 1}) in the latter representation with (1-x ij ) /2, where x ij ∈ {1, -1}. The new representation can also be expressed via the expansion (2) since no two descendants from the same parent variable are being multiplied with each another. Uniqueness: Assume that the uniqueness condition is not satisfied. Then we have two distinct polynomial representations f 1 and f 2 that have the same value for every x. However, since f 1 and f 2 are distinct polynomials, f 1 (x) -f 2 (x) is a polynomial p(x) which is non-zero in at least one input x * . This implies that f 1 (x * ) -f 2 (x * ) is also non-zero, which is a contradiction, and the proof is complete. Remark 1. The group-theoretic Fourier representation defined in Equation ( 6) is unique and complete. Proof. Let χ = [k] n be the categorical domain. Let the true function be f . For generality, let us consider a complex valued function f : χ → C where C is the field of complex numbers. The basis functions are ψ I (x). Now, one can view a function as a [k] n -length vector, one entry each for evaluating the function at every point in the domain χ. We denote the vector for function f , thus obtained, by f χ ∈ C k n . Similarly, denote the vector for evaluations of the basis function ψ I by ψ χ I ∈ C k n . Let A be a matrix created by stacking all vectors corresponding to basis vectors in the columns. Then, the Fourier representation can be written as f χ = Aα where α is the vector of Fourier coefficients in our group-theoretic representation. Now, due to the use of complex exponentials, one can show that x∈[k] n ψ I (x)ψ I (x) = 0 if I = I . Therefore, the columns of the matrix A are orthogonal. Hence, A is a full rank matrix. Therefore, our representation is merely representing a vector in another full rank orthogonal basis. Hence, it is unique and complete.

B DESCRIPTION OF ALGORITHMS

Surrogate Model Learning Algorithm: Let n and k denote the number of variables and the cardinality of each variable, respectively. The surrogate models used in ECO-F and ECO-G correspond to approximations of the representations given in ( 2) and ( 8), respectively, where each approximation is obtained by restricting the maximum order of interactions among variables to m. We consider each term in the latter surrogate models, i.e. monomials ψ I,J from (3) in ECO-F and characters ψ β,I (β ∈ {r, i}) from ( 10) in ECO-G, as an expert, denoted by ψ i (i ∈ [d]). The number of such experts in ECO-F is d = m i=0 n i (k -1) i which coincides with the dimensionality of the space k n when m = n, whereas the number of experts in ECO-G is equal to d = 2 m i=0 n i (k -1) i -1. The coefficient of each expert ψ i is designated by α i . Since the exponential weights, utilized to update the coefficients α i , are non-negative, we maintain two non-negative coefficients α + i and α - i , which yield α i = α + i -α - i . We initialize all the coefficients with a uniform prior, i.e. α γ i = 1 /2d (∀i ∈ [d] and γ ∈ {-, +}). In each time step t, we draw a sample x t via Algorithm 1 with respect to our current estimate for the surrogate model f α . The latter sample is then plugged into the black-box function to obtain the evaluation f (x t ). This leads to a mixture loss t as the difference between the evaluations obtained by our surrogate model and the black-box function for query x t . Using this mixture loss, we compute the individual loss t i for each expert ψ i . Finally, we update each coefficient in the model via an exponential weight obtained according to its incurred individual loss. We repeat this process until stopping criteria are met. Algorithm 3 Expert Categorical Optimization 1: Inputs: sparsity λ, max model order m 2: t ← 0, ∀γ ∈ {-, +} ∀i ∈ [d] : α t i,γ ← 1 2d 3: repeat 4: x t ∼ f α t via Algorithm 1 or Algorithm 2 5: Observe f (x t ) 6: f α t (x) ← i∈[d] α t i,+ -α t i,-ψ i (x) 7: t+1 ← f α t (x t ) -f (x t ) 8: for i ∈ [d] and γ ∈ {-, +} do 9: t+1 i ← 2 λ t+1 ψ i (x t ) 10: α t+1 i,γ ← α t i,γ exp -γ η t t+1 i 11: α t+1 i,γ ← λ • α t+1 i,γ µ∈{-,+} j∈[d] α t+1 j,µ 12: end for 13: t ← t + 1 14: until Stopping Criteria 15: return x * = arg min {xi: ∀i∈[t]} f (x i )

Number of Experts:

The number of terms in vanilla one-hot encoded Fourier representation is 2 kn , whereas our abridged representation reduces this number to k n matching the space dimensionality, thereby making the algorithm computationally tractable and efficient. When a max degree of m is used in the approximate representation, the number of terms in the abridged representation is equal to d = Learning Rate: The anytime learning rate (at time step t) used in Algorithm 3 is given by Gerchinovitz & Yu (2011); Dadkhahi et al. (2020) : η t = min 1 e t-1 , c ln (2 d) v t-1 , where c ∆ = 2( √ 2 -1)/(exp(1) -2) and z γ j,t ∆ = -2 γ λ t ψ j (x t ) e t ∆ = inf k∈Z 2 k : 2 k ≥ max s∈[t] max j,k∈[d] γ,µ∈{-,+} |z γ j,s -z µ k,s | v t ∆ = s∈[t] j∈[d] γ∈{-,+} α γ j,s z γ j,s - k∈[d] µ∈{-,+} α µ k,s z µ k,s 2 .

C CONTINUED RELATED WORK

A variety of discrete search algorithms and meta-heuristics have been studied in the literature for combinatorial optimization over categorical variables. Such algorithms, including Genetic Algorithms Holland & Reitman (1978 ), Simulated Annealing Spears (1993) , and Particle Swarms Kennedy & Eberhart (1995) , are generally inefficient in finding the global minima. In the context of biological sequence optimization, the most popular method is directed evolution Arnold (1998), which explores the space by only making small mutations to existing sequences. In the context of sequence optimization, a recent promising approach consists of fitting a neural network model to predict the black box function and then applying gradient ascent on the latter model until s is the root node Bogard et al. (2019) ; Liu et al. (2020) . This approach allows for a continuous relaxation of the discrete search space making possible step-wise local improvements to the whole sequence at once based on a gradient direction. However, these methods have been shown to suffer from vanishing gradients Linder & Seelig (2020) . Further, the projected sequences in the continuous relaxation space may not be recognized by the predictors, leading to poor convergence. Generative model-based optimization approaches aim to learn distributions whose expectation coincides with evaluations of the black box and try to maximize such expectation Gupta & Zou (2019); Brookes et al. (2019) . However, such approaches require a pre-trained generative model for optimization.

D BBO EXPERIMENTS

Latin Square Problem: A latin square of order k is a k × k matrix of elements x ij ∈ [k], such that each number appears in each row and column exactly once. When k = 5, the problem of finding a latin square has 161, 280 solutions in a space of dimensionality 5 25 . We formulate the problem of finding a latin square of order k as a black-box function by imposing an additive penalty of one for any repetition of numbers in any row or column. As a result, function evaluations are in the range [0, 2k(k -1)], and a function evaluation of zero corresponds to a latin square of order k. We consider a noisy version of this problem, where an additive Gaussian noise with zero mean and standard deviation of σ = 0.1 is added to function evaluations observed by each algorithm. Figure 3 demonstrates the performance of different algorithms, in terms of the best function value found until time t, over 500 time steps. Both ECO-F and ECO-G outperform the baselines with a considerable margin. In addition, both ECO-G and ECO-F match COMBO's performance closely until time step t = 190. At larger time steps, COMBO outperforms the other algorithms, however, this performance comes at the price of a far larger computation time. As demonstrated in Table 1 , ECO-F and ECO-G offer a speed-up over COMBO by a factor of approximately 100 and 50, respectively. Pest Control Problem: In the pest control problem, given n stations and k -1 pesticide types, the idea is to maintain the spread of pest (with minimum cost), which is propagating throughout the stations in an interactive and probabilistic fashion. The k-th category for each variable corresponds to the choice of no pesticide at all. Controlling the spread of the pest is carried out via the choice of the right type of pesticide subject to a penalty proportional to its associated cost. A closed form definition of this problem is given in Oh et al. (2019) . The results for different algorithms are shown in Figure 4 . Despite the fact that COMBO is able to find the minimum in fewer time steps (in ≈ 200 steps) than ECO-F (in ≈ 360 steps) on average, ECO-F outperforms COMBO during initial time steps (until t ≈ 180). SA performs competitively, but eventually is unable to find the optimal solution to this problem over the designated 500 steps. The poor performance of ECO-G can be explained by the interactive nature of the problem, where early mistakes are punished inordinately. Early mistakes made by ECO-G can also be attributed to the large number of experts (with noisy coefficients) in its model, which in turn promotes an early exploratory behavior. RNA Sequence Optimization Problem: Structured RNA molecules play a critical role in many biological applications, ranging from control of gene expression to protein translation. The native secondary structure of a RNA molecule is usually the minimum free energy (MFE) structure. Consider an RNA sequence as a string A = a 1 . . . a n of n letters (nucleotides) over the alphabet Σ = {A, U, G, C}. A pair of complementary nucleotides a i and a j , where (i < j), can interact with each other and form a base pair (denoted by (i, j)), A-U, C-G and G-U being the energetically stable pairs. Thus, the secondary structure of an RNA can be represented by an ensemble of pairing bases. Finding the most stable RNA sequences has immediate applications in material and biomedical applications Li et al. (2015) . Studies show that by controlling the structure and free energy of a RNA molecule, one may modulate its translation rate and half-life in a cell Buchan & Stansfield (2007) ; Davis et al. (2008) & Stiegler (1981) ) and dynamic programming to estimate MFE of a sequence. However, the O(n 3 ) time complexity of these algorithms prohibits their use for evaluating substantial numbers of RNA sequences Gould et al. (2014) and exhaustively searching the space to identify the global free energy minimum, as the number of sequences grows exponentially as 4 n . Here, we formulate the RNA sequence optimization problem as follows: For a sequence of length n, find the RNA sequence that will fold into the secondary structure with the lowest minimum free energy. In our experiments, we initially set n = 30 and k = 4. We then use the popular RNAfold package Lorenz et al. (2011) to evaluate the MFE for a given sequence. The goal is to find the lowest MFE sequence by calling the MFE evaluator minimum number of times. The performance of different algorithms is depicted in Figure 1 , where both ECO-F and particularly ECO-G outperform the baselines as well as COMBO by a considerable margin. Energy-optimized RNA Structures: Sample RNA sequences obtained via ECO-G after 4000 time steps for n = 30 and n = 60 are shown in Figures 6 and 7 , respectively. The resulting energyoptimized sequences (as obtained using RNAfold service) have high (> 90%) GC content that makes the strongest positive contribution to lowering MFE Trotta (2014) , as pairings between G and C have three hydrogen bonds and are more stable compared to A and U pairings, which For odd values of n, there is presence of a loop with an odd number of residues or a single unpaired base at the end, but there is still a GC-rich double helix. In contrast, the structures generated by the under-performing algorithms do show presence of unpaired bases and are less in GC content, leading to high energy structures (e.g. Figures 8 and 9 are obtained via SA after 4000 steps for n = 30 and 60, respectively).

E DESIGN EXPERIMENTS

For our experiments, we focus on three puzzles from the Eterna-100 dataset Anderson-Lee et al. (2016) . Two of the selected sequences (puzzles 15 and 41 of lengths 30 and 35, resp.), despite their fairly small lengths, are challenging for many algorithms (see Anderson-Lee et al. (2016) ). In both puzzles, our MCTS variants (ECO-F and ECO-G) are able to significantly improve the performance of MCTS when limited number of true rewards are available. All algorithms outperformed RS as expected. Within the given 500 evaluation budget, both ECO-G, and especially ECO-F, are superior to LEARNA by a substantial margin (see Figure 12 ). In puzzle number 41 (Figure 13 ), again both ECO-G and ECO-F significantly outperform LEARNA, over the given number of evaluations. Interestingly, ECO-F is able to outperform LEARNA throughout the evaluation process, and in average finds a far better final solution than LEARNA. The final sequence is puzzle #70 of length 184. The results of different algorithms over the latter puzzle is shown in Figure 14 . As we can see from this figure, MCTS-RNA and LEARNA perform very similarly over the given 500 evaluation budget. ECO-F is able to outperform the remaining algorithm throughout the evaluation steps. Initially, ECO-G has a similar performance to those of 

F CHOICE OF THE ACQUISITION METHOD

Throughout the experiments, we designated SA and MCTS as acquisition methods for generic BBO and design problems, respectively. The latter choice was made in accordance with the literature, where SA is typically used as a baseline and method of choice for the generic BBO problem, whereas MCTS has been commonly used for the design problem. For instance, SA has been considered as a baseline and/or acquisition method in Dadkhahi et al. (2020) , Ricardo Baptista (2018), and Oh et al. (2019) (albeit with a different algorithm than ours). On the other hand, MCTS (i.e. RNA-MCTS as well as its variations) is perhaps the most popular RNA design technique in the literature. Here, we point out that both SA and MCTS can be used for both generic BBO and design problems. In this section, we compare the performance of different acquisition methods in each problem. First, we consider the generic BBO problem of RNA sequence optimization with n = 30 (considered in Section 4). Figure 15 demonstrates the performance of ECO-F and ECO-G when SA or MCTS are used as acquisition methods. As we can see from this figure, the SA-as-acquisition-method variants perform slightly better than MCTS-as-acquisition-method counterparts over 500 steps. In particular, although the performance gap is initially moderately large, over time this performance gap becomes smaller. Next, we consider two design problems considered in Section 4: puzzles #15 and #41. Note that, when using SA as the acquisition method, we apply the softmax operator (in Algorithm 1) over the set of {GC, CG, AU, U A} if the corresponding variable is part of a paired base. As we can see from Figure 16 , for puzzle #15, ECO-F with MCTS outperforms the remaining algorithms. The rest of the algorithms have very similar performances, with ECO-F (MCTS) marginally surpassing the SA variants. As we can see from Figure 17 , for puzzle #41, the MCTS variant of ECO-F slightly 

G ORDER OF THE SURROGATE MODEL

As mentioned in Section 4, we used m = 2 as the maximum order of the representations in all the experiments. In this section, we focus on the generic BBO problem of RNA optimization and investigate the impact of the model order on the performance of the proposed ECO algorithms. In particular, we compare the performance of the algorithm at m = 3 with that of m = 2. As we can see from Figure 18 , at smaller evaluation budgets, the order 2 models moderately outperform the order 3 counterparts in both ECO-F and ECO-G. As we increase the number of samples, this performance gap becomes smaller. At the 500 evaluation budget, ECO-G3 outperforms ECO-G2 by a small margin of 0.1. At the same evaluation budget, ECO-F3 is slightly inferior to ECO-F2 by a margin of 0.2. Considering the convergence behavior of the curves at order 3 versus those of order 2, we expect the former models to eventually outperform the latter models at higher number of evaluations. However, since in BBO problems sample efficiency is typically of main concern, it would make sense to use low-order approximations. We point out that a similar observation was made in Ricardo Baptista (2018) for the Boolean case, where higher order models suffer from a slower start due to the higher dimensionality of the parameter space. From our experiments in categorical problems, this behavior seems to be even more pronounced due to the higher dimensionality of the categorical domains. A summary of the computation times for ECO algorithms at different model orders is given in table 2. Since the complexity of ECO is linear in the number of experts, which exponentially grows with the model order m, we observe an increase in the computational complexity of ECO-F3 (ECO-G3) versus that of ECO-F2 (ECO-G2) by a factor of 9.7 (16.3). 



Note that in the general case of different cardinalities for different variables,I ∈ [k1] × [k2] × . . . × [kn]where × denotes the cartesian product and the exponent denominator in the complex exponential character is replaced by k = LCM(k1, k2, . . . , kn). Note that this assumption is necessary in order to ensure that monomials in future columns for the same row are evaluated to zero; choices of x where this assumption is not valid is addressed in next rows. Note that the elements in the previous columns in the same row corresponding to monomials involving the selected m -1 variables are non-zeros as well.



Figure 1: RNA BBO Problem with n = 30

-1) i . The corresponding number in a vanilla one-hot encoded representation is equal to m i=0 nk i . Finally, the numbers of terms in the full and order-m group-theoretic Fourier expansions are equal to 2k n -1 and d = 2 m i=0 n i (k -1) i -1, respectively. MCTS Algorithm: For a complete version of Algorithm 2, see Algorithm 4.

(s, a) ← N (s, a) + 1 22: Q(s, a) ← Q(s, a) + 1 N (s,a) (r -Q(s, a)) 23:s ← parent(s); a ← visited action on s 24:

Figure 3: Best function evaluation seen so far for the Latin Square problem.

Figure 4: Best function evaluation seen so far for the pest control problem.

Figure 5: Best function evaluation seen so far for the RNA sequence optimization problem with n = 30.

Figure 6: RNA Structure via ECO-G for n = 30 Figure 7: RNA Structure via ECO-G for n = 60

Figure 8: RNA Structure via SA for n = 30 Figure 9: RNA Structure via SA for n = 60

Figure 10: RNA Structure via ECO-G for n = 31 Figure 11: RNA Structure via ECO-G for n = 31

Figure 14: Best function evaluation for RNA Design of puzzle #70 with n = 184.

Figure 16: Comparison of different acquisition methods for design puzzle #15 with n = 30.

BOCS and COMBO are hindered by associated high computational complexities, which grow polynomially with both the number of variables and the number of function evaluations.More recently, a computationally efficient black-box optimization algorithm (COMEX)(Dadkhahi et al. (2020)) was introduced to address the computational impediments of its Bayesian counterparts. COMEX adopts a Boolean Fourier representation as its surrogate model, which is updated via an exponential weight update rule. Nevertheless, COMEX is limited to functions over the Boolean hypercube. We generalize COMEX to handle functions over categorical variables by proposing two representations for modeling functions over categorical variables: an abridged one-hot encoded Boolean Fourier representation and Fourier representation on finite Abelian groups. The utilization of the latter representation as a surrogate model in combinatorial optimization algorithms is novel to this work. Factorizations based on one-hot encoding has been previously (albeit briefly) suggested in Ricardo Baptista (2018) to enable black-box optimization algorithms designed for Boolean variables to address problems over categorical variables. Different from Ricardo Baptista (2018), we show that we can significantly reduce the number of additional variables introduced upon one-hot encoding, and that such a reduced representation is in fact complete and unique.

Our Techniques: In order to address this problem, we adopt a surrogate model-acquisition function based learning framework, where an estimate for the black-box function f (i.e. the surrogate model) is updated sequentially via black-box function evaluations observed until time step t. The selection of candidate points for black-box function evaluation is carried out via an acquisition function, which uses the surrogate model f as an inexpensive proxy (to make many internal calls) for the black-box function and produces the next candidate point to be evaluated. The sequence proceeds as follows:

11 + α 2 x 12 + α 3 x 21 + α 4 x 22 + α 5 x 11 x 21 + α 6 x 11 x 22 + α 7 x 12 x 21 + α 8 x 12 x 22 .Fourier Representation on Finite Abelian Groups: We define a cyclic group structure Z/k i Z over the elements of each categorical variable x i

Inputs: surrogate model f α , search tree T 2: Initialize s best = {}, r best = -∞

r best then The computational complexity per time step associated with learning the surrogate model, for both representations introduced in 3.1, is in O

Killoran et al.   Inputs: surrogate reward model f α , exploration parameter c, search tree T 2: s best = {}, r best = -∞ t ← π T (s t ) = arg max a∈At Q(s t , a) + c ln N (st) /N(st,a)

, which is important in the context of viral RNA. A number of RNA folding algorithms Lorenz et al. (2011); Markham & Zuker (2008) use a thermodynamic model (e.g. Zuker

Average computation time per step (in Seconds) over different problems and algorithms.

Average computation time per step (in Seconds) at different model orders.

