LEARNING NOT TO LEARN: NATURE VERSUS NURTURE IN SILICO

Abstract

Animals are equipped with a rich innate repertoire of sensory, behavioral and motor skills, which allows them to interact with the world immediately after birth. At the same time, many behaviors are highly adaptive and can be tailored to specific environments by means of learning. In this work, we use mathematical analysis and the framework of memory-based meta-learning (or 'learning to learn') to answer when it is beneficial to learn such an adaptive strategy and when to hardcode a heuristic behavior. We find that the interplay of ecological uncertainty, task complexity and the agents' lifetime has crucial effects on the meta-learned amortized Bayesian inference performed by an agent. There exist two regimes: One in which meta-learning yields a learning algorithm that implements task-dependent information-integration and a second regime in which meta-learning imprints a heuristic or 'hard-coded' behavior. Further analysis reveals that non-adaptive behaviors are not only optimal for aspects of the environment that are stable across individuals, but also in situations where an adaptation to the environment would in fact be highly beneficial, but could not be done quickly enough to be exploited within the remaining lifetime. Hard-coded behaviors should hence not only be those that always work, but also those that are too complex to be learned within a reasonable time frame.

1. INTRODUCTION

The 'nature versus nurture' debate (e.g., Mutti et al., 1996; Tabery, 2014) -the question of which aspects of behavior are 'hard-coded' by evolution, and which are learned from experience -is one of the oldest and most controversial debates in biology. Evolutionary principles prescribe that hard-coded behavioral routines should be those for which there is no benefit in adaptation. This is believed to be the case for behaviors whose evolutionary advantage varies little among individuals of a species. Mating instincts or flight reflexes are general solutions that rarely present an evolutionary disadvantage. On the other hand, features of the environment that vary substantially for individuals of a species potentially ask for adaptive behavior (Buss, 2015) . Naturally, the same principles should not only apply to biological but also to artificial agents. But how can a reinforcement learning agent differentiate between these two behavioral regimes? A promising approach to automatically learn rules of adaptation that facilitate environment-specific specialization is meta-learning (Schmidhuber, 1987; Thrun & Pratt, 1998) . At its core lies the idea of using generic optimization methods to learn inductive biases for a given ensemble of tasks. In this approach, the inductive bias usually has its own set of parameters (e.g., weights in a recurrent network; Hochreiter et al., 2001) that are optimized on the whole task ensemble, that is, on a long, 'evolutionary' time scale. These parameters in turn control how a different set of parameters (e.g., activities in the network) are updated on a much faster time scale. These rapidly adapting parameters then allow the system to adapt to a specific task at hand. Notably, the parameters of the system that are subject to 'nature' -i.e., those that shape the inductive bias and are common across tasks -and those that are subject to 'nurture' are usually predefined from the start. In this work, we use the memory-based meta-learning approach for a different goal, namely to acquire a qualitative understanding of which aspects of behavior should be hard-coded and which should be adaptive. Our hypothesis is that meta-learning can not only learn efficient learning algorithms, but can also decide not to be adaptive at all, and to instead apply a generic heuristic to the whole ensemble of tasks. Phrased in the language of biology, meta-learning can decide whether to hard-code a behavior or to render it adaptive, based on the range of environments the individuals of a species could encounter. We study the dependence of the meta-learned algorithm on three central features of the metareinforcement learning problem: • Ecological uncertainty: How diverse is the range of tasks the agent could encounter? • Task complexity: How long does it take to learn the optimal strategy for the task at hand? Note that this could be different from the time it takes to execute the optimal strategy. • Expected lifetime: How much time can the agent spend on exploration and exploitation? Using analytical and numerical analyses, we show that non-adaptive behaviors are optimal in two cases -when the optimal policy varies little across the tasks within the task ensemble and when the time it takes to learn the optimal policy is too long to allow a sufficient exploitation of the learned policy. Our results suggest that not only the design of the meta-task distribution, but also the lifetime of the agent can have strong effects on the meta-learned algorithm of RNN-based agents. In particular, we find highly nonlinear and potentially discontinuous effects of ecological uncertainty, task complexity and lifetime on the optimal algorithm. As a consequence, a meta-learned adaptation strategy that was optimized, e.g., for a given lifetime may not generalize well to other lifetimes. This is essential for research questions that are interested in the conducted adaptation behavior, including curriculum design, safe exploration as well as human-in-the-loop applications. Our work may provide a principled way of examining the constraint-dependence of meta-learned inductive biases. The remainder of this paper is structured as follows: First, we review the background in memorybased meta-reinforcement learning and contrast the related literature. Afterwards, we analyze a Gaussian multi-arm bandit setting, which allows us to analytically disentangle the behavioral impact of ecological uncertainty, task complexity and lifetime. Our derivation of the lifetime-dependent Bayes optimal exploration reveals a highly non-linear interplay of these three factors. We show numerically that memory-based meta-learning reproduces our theoretical results and can learn not to learn. Furthermore, we extend our analysis to more complicated exploration problems. Throughout, we analyze the resulting recurrent dynamics of the network and the representations associated with learning and non-adaptive strategies.

2. RELATED WORK & BACKGROUND

Meta-learning or 'learning to learn ' (e.g., Schmidhuber, 1987; Thrun & Pratt, 1998; Hochreiter et al., 2001; Duan et al., 2016; Wang et al., 2016; Finn et al., 2017) has been proposed as a computational framework for acquiring task distribution-specific learning rules. During a costly outer loop optimization, an agent crafts a niche-specific adaptation strategy, which is applicable to an engineered task distribution. At inference time, the acquired inner loop learning algorithm is executed for a fixed amount of timesteps (lifetime) on a test task. This framework has successfully been applied to a range of applications such as the meta-learning of optimization updates (Andrychowicz et al., 2016; Flennerhag et al., 2018; 2019 ), agent (Rabinowitz et al., 2018) ) and world models (Nagabandi et al., 2018) and explicit models of memory (Santoro et al., 2016; Bartunov et al., 2019) . Already, early work by Schmidhuber (1987) suggested an evolutionary perspective on recursively learning the rules of learning. This perspective holds the promise of explaining the emergence of mechanisms underlying both natural and artificial behaviors. Furthermore, a similarity between the hidden activations of LSTM-based meta-learners and the recurrent activity of neurons in the prefrontal cortex (Wang et al., 2018) has recently been suggested. Previous work has shown that LSTM-based meta-learning is capable of distilling a sequential integration algorithm akin to amortized Bayesian inference (Ortega et al., 2019; Rabinowitz, 2019; Mikulik et al., 2020) . Here we investigate when the integration of information might not be the optimal strategy to meta-learn. We analytically characterize a task regime in which not adapting to sensory information is optimal. Furthermore, we study whether LSTM-based meta-learning is capable of inferring when to learn and when to execute a non-adaptive program. Rabinowitz (2019) 

