FOURIER REPRESENTATIONS FOR BLACK-BOX OPTIMIZATION OVER CATEGORICAL VARIABLES

Abstract

Optimization of real-world black-box functions defined over purely categorical variables is an active area of research. In particular, optimization and design of biological sequences with specific functional or structural properties have a profound impact in medicine, materials science, and biotechnology. Standalone acquisition methods, such as simulated annealing (SA) and Monte Carlo tree search (MCTS), are typically used for such optimization problems. In order to improve the performance and sample efficiency of such acquisition methods, we propose to use existing acquisition methods in conjunction with a surrogate model for the black-box evaluations over purely categorical variables. To this end, we present two different representations, a group-theoretic Fourier expansion and an abridged one-hot encoded Boolean Fourier expansion. To learn such models, characters of each representation are considered as experts and their respective coefficients are updated via an exponential weight update rule each time the black box is evaluated. Numerical experiments over synthetic benchmarks as well as real-world RNA sequence optimization and design problems demonstrate the representational power of the proposed methods, which achieve competitive or superior performance compared to state-of-the-art counterparts, while improving the computational cost and/or sample efficiency substantially.

1. INTRODUCTION

A plethora of practical optimization problems involve black-box functions, with no simple analytical closed forms, that can be evaluated at any arbitrary point in the domain. Optimization of such black-box functions poses a unique challenge due to restrictions on the number of possible function evaluations, as evaluating functions of real-world complex processes is often expensive and time consuming. Efficient algorithms for global optimization of expensive black-box functions take past queries into account in order to select the next query to the black-box function more intelligently. While black-box optimization of real-world functions defined over integer, continuous, and mixed variables has been studied extensively in the literature, limited work has addressed incorporation of purely categorical type input variables. Categorical type variables are particularly challenging when compared to integer or continuous variables, as they do not have a natural ordering. However, many real-world functions are defined over categorical variables. One such problem, which is of wide interest, is the design of optimal chemical or biological (protein, RNA, and DNA) molecule sequences, which are constructed using a vocabulary of fixed size, e.g. 4 for DNA/RNA. Designing optimal molecular sequences with improved or novel structures and/or functionalities is of paramount importance in material science, drug and vaccine design, synthetic biology and many other applications (see Dixon et al. (2010); Ng et al. (2019); Hoshika et al. (2019); Yamagami et al. (2019) ). Design of optimal sequences is a difficult black-box optimization problem over a combinatorially large search space (Stephens et al. (2015) ), in which function evaluations often rely on either wet-lab experiments, physics-inspired simulators, or knowledge-based computational algorithms, which are slow and expensive in practice. Another problem of interest is the constrained design problem, e.g. find a sequence given a specific structure (or property), which is inverse of the well-known folding problem discussed in Dill & MacCallum (2012) . This problem is complex due to the strict structural constraints imposed on the sequence. In fact one of the ways to represent such a complex structural constraint is to constrain the next choice sequentially based on the sequence elements that have been chosen a priori. Therefore, we divide the black box optimization problem into two settings, depending on the constraint set: (i) Generic Black Box Optimization (BBO) problem referring to the unconstrained case and (ii) Design Problem that refers to the case with complex sequential constraints. Let x t be the t-th sequence evaluated by the black box function f . The key question in both settings is the following: Given prior queries x 1 , x 2 . . . x t and their evaluations f (x 1 ) . . . f (x t ), how to choose the next query x t+1 ? This acquisition must be devised so that over a finite budget of black-box evaluations, one is closest to the minimizer in an expected sense over the acquisition randomness. In the literature, for design problems with sequential constraints, MCTS (Monte Carlo Tree Search) based acquisitions are often used with real function evaluations f (x t ). In the generic BBO problems in the unconstrained scenario, Simulated Annealing (SA) based techniques are typically used as acquisition functions. A key missing ingredient in the categorical domain is a surrogate model for the black-box evaluations that can interpolate between such evaluations and use cost-free approximate evaluations from the surrogate model internally (in acquisition functions) in order to reduce the need for frequently accessing real evaluations. This leads to improved sample efficiency in acquisition functions. Due to the lack of efficient interpolators in the categorical domains, existing acquisition functions suffer under a finite budget constraint, due to reliance on only real black-box evaluations. Contributions: We address the above problem in our work. Our main contributions are as follows: 1. We present two representations for modeling real-valued combinatorial functions over categorical variables, which we then use in order to learn a surrogate model for the generic BBO problem and the design problem. The surrogate model is updated via a hedge algorithm where the basis functions in our representations act as experts. The latter update happens once for every real black-box evaluation. To the best of our knowledge, the representations and/or their use in black-box optimization of functions over categorical variables are novel to this work. 2. In the BBO problem, the proposed method uses a version of simulated annealing that utilizes the current surrogate model for many internal cost-free evaluations before producing the next black-box query. 3. In the design problem, the proposed method uses a version of MCTS in conjunction with the current surrogate model as reward function of the terminal states during intermediate tree traversals/backups in order to improve the sample efficiency of the search algorithm. 4. Numerical results, over synthetic benchmarks as well as real-world biological (RNA) sequence optimization and design problems demonstrate the competitive or superior performance of the proposed methods over state-of-the-art counterparts, while substantially reducing the computation time and sample efficiency, respectively. 



Hutter et al. (2011) suggests a surrogate model based on random forests to address optimization problems over categorical variables. The proposed SMAC algorithm uses a randomized local search under the expected improvement acquisition criterion in order to obtain candidate points for black-box evaluations. Bergstra et al. (2011) suggests a tree-structured Parzen estimator (TPE) for approximating the surrogate model, and maximizes the expected improvement criterion to find candidate points for evaluation. For optimization problems over Boolean variables, multilinear polynomials Ricardo Baptista (2018); Dadkhahi et al. (2020) and Walsh functions Leprêtre et al. (2019) have been used in the literature. designed for black-box functions over combinatorial domains. In particular, the BOCS algorithm Ricardo Baptista (2018), primarily devised for Boolean functions, employs a sparse monomial representation to model the interactions among different variables, and uses a sparse Bayesian linear regression method to learn the model coefficients. The COMBO algorithm of Oh et al. (2019) uses Graph Fourier Transform (GFT) over a combinatorial graph, constructed via graph cartesian product of variable subgraphs, to gauge the smoothness of the black-box function. However,

