USING SEMANTIC DISTANCE FOR DIVERSE AND SAM-PLE EFFICIENT GENETIC PROGRAMMING

Abstract

Evolutionary methods, such as genetic programming, search a space of programs to find those with good fitness, often using mutations that manipulate the syntactic structure of programs without being aware of how they affect the semantics. For applications where the semantics are highly sensitive to small syntactic mutations, or where fitness evaluation is expensive, this can make learning programs intractable. We introduce a mutation operator that yields mutated programs that are semantically far from previously evaluated programs, while still being semantically close to their parent. For function regression, this leads to an algorithm that is one to two orders of magnitude more sample efficient than other gradient-free methods, such as genetic programming, or learning the weights of a neural network using evolutionary strategies. We show how this method can be applied to learning architecture-specific and general purpose neural network optimizers, and to reinforcement learning loss functions. The learnt components are simple, interpretable, high performance, and contain novel features not seen before such as weight growth.

1. INTRODUCTION

A program is a discrete structure, such as a symbolic expression, or sequence of instructions. Given a fitness function, genetic programming is an evolutionary algorithm for searching over the space of programs to find one with high fitness. An example of a program is a machine learning (ML) component such as an optimizer or loss function, specified using calls to a library such as TensorFlow or JAX. They contain an ordered list of nodes, where each node has an operator (such as addition or multiplication), and a list of inputs where each input is either an external input, a constant, or a previous output. These programs thus compactly describe a mathematical function that can be executed on hardware, or can be further transformed in non-trivial ways, such as being differentiated. Can we learn such programs? An important lesson from the history of ML research is that methods that scale and leverage computation will typically eventually outperform hand-designed priors and heuristics given enough compute (Sutton, 2019) . The lesson of replacing hand-designed with learnt features has often been applied to learning neural networks using gradient descent. Parts of the model such as the optimizer or loss function are important for controlling aspects such as how fast the model trains, but typically have less useful gradient information available to train (Metz et al., 2019) and are harder to specify using parameterized functions. In this paper we hypothesize that one issue with learning via evolution is that mutation operators that naively manipulate the syntactic structure of a program can be highly sample inefficient, due to proposing mutations that are either too close to a previously evaluated program, or too far from the parent of the mutation (and thus likely to be deleterious). We introduce a mutation operator that uses semantic distance information to mutate safely and with diversity. We demonstrate the effectiveness of this method in a variety of ways. For function regression, we show our method is one to two orders of magnitude more sample efficient than other gradient-free methods. We then learn architecture-specific and general-purpose optimizers, which perform well within the training distribution, and with simple, interpretable and novel features that can be used to generate research hypotheses. Finally, we learn reinforcement learning losses in grid-world environments, with novel interpretable features that transfer to Atari. Semantic genetic programming See Vanneschi et al. ( 2014) for a survey. This includes point mutations and crossover mutations. Semantic point mutations replace a node in a program with a new one that is semantically close, but not too close (Nguyen et al., 2009) . Semantic crossover mutations combine two programs by swapping two nodes which are either semantically non-equivalent, or again close, but not too close (Beadle & Johnson, 2008; Uy et al., 2009a; b) . One can crossover whole programs in semantic crossover, such as summing two programs and simplifying (Moraglio et al., 2012) which can suffer from exponential size blowup; and although not genetic programming per-se, Gangwani & Peng (2017) crossover neural network policies via distillation. Our method is most similar to semantic point mutation, although crucially (1) we only use semantic information at the program level, which allows applications where per-node semantic information is not defined, and (2) we also use semantic information for diversity in the mutation objective. Diversity Diversity in evolution usually refers to population diversity, where a diverse range of programs is kept in the gene pool to avoid premature convergence (Burke et al., 2004) . Methods include reducing fitness for crowded areas of the phenotype space (Goldberg et al., 1987) , or grouping within the phenotype space (Mouret & Clune, 2015) . Our use of diversity is orthogonal to, and could be combined with population diversity: we ensure mutations are diverse from previously evaluated programs before evaluating them, which reduces evaluations of expensive fitness functions. Ensuring programs are not semantically identical to previous programs has been done before (Alet et al., 2020; Real et al., 2020) but we are the first to use semantic distance in the mutation objective, which generalizes to continuous search spaces. Learning reinforcement learning (RL) algorithms, optimizers, and loss functions Previous work for learning loss functions and optimizers include many ways of parameterizing these components (hyperparameters, neural networks, or symbolic computation graphs), and many ways of learning (genetic programming, evolutionary strategies, meta-gradients, and Bayesian methods). Learning RL algorithms includes using meta-gradients (Oh et al., 2018; 2020) and evolving programs (Co-Reyes et al., 2021; Faust et al., 2019) . Genetic programming has been used to learn loss functions (Bengio et al., 1994; Trujillo & Olague, 2006; Gonzalez & Miikkulainen, 2020) . Methods for learning optimizers include meta-gradients (Hochreiter et al., 2001; Andrychowicz et al., 2016; Wichrowska et al., 2017; Lv et al., 2017; Metz et al., 2020) , RL (Bello et al., 2017; Li & Malik, 2016) , and evolutionary strategies (Houthooft et al., 2018) . More generally, evolutionary techniques have most commonly been applied to neural architecture search (Stanley & Miikkulainen, 2002; Real et al., 2019; Liu et al., 2018; Elsken et al., 2019; Pham et al., 2018; So et al., 2019) , and AutoML aims to automate the machine learning training process (Hutter et al., 2019; Real et al., 2020) . Our method uses genetic programming to learn symbolic computation graphs: this has the advantage of interpretable results and avoiding noisy gradient information (Metz et al., 2019) . Co-Reyes et al. ( 2021) is most similar to our work for RL algorithms we use semantic information in the mutation operator, but do not directly compare empirical performance due to the complexity of reproducing the same environments and evolutionary setup.

3. METHOD

Let G be the space of programs. The two key points are to define a distance function d between pairs of programs which captures semantic information; and then given G ∈ G, to define a mutation M (G) which respects this distance function. This mutation can then be used in an evolutionary algorithm, for example, hill climbing where the best program so far is repeatedly mutated and evaluated.

3.1. THE MUTATION OBJECTIVE

Let H ⊂ G be the set of previously evaluated programs, let v(H) be a measure of program complexity (for example the number of nodes), and let µ, β > 0 be constants. Our objective to minimize is (1) This objective contains three terms.



f (M (G)) = |d(M (G), G) -µ| -µ tanh( min H∈H d(M (G), H)/µ) + µβv(M (G)).

