AUXILIARY TASK DISCOVERY THROUGH GENERATE-AND-TEST

Abstract

In this paper, we explore an approach to auxiliary task discovery in reinforcement learning based on ideas from representation learning. Auxiliary tasks tend to improve data efficiency by forcing the agent to learn auxiliary prediction and control objectives in addition to the main task of maximizing reward, and thus producing better representations. Typically these tasks are designed by people. Meta-learning offers a promising avenue for automatic task discovery; however, these methods are computationally expensive and challenging to tune in practice. In this paper, we explore a complementary approach to the auxiliary task discovery: continually generating new auxiliary tasks and preserving only those with high utility. We also introduce a new measure of auxiliary tasks' usefulness based on how useful the features induced by them are for the main task. Our discovery algorithm significantly outperforms random tasks, hand-designed tasks, and learning without auxiliary tasks across a suite of environments.

1. INTRODUCTION

The discovery question-what should an agent learn about-remains an open challenge for AI research. In the context of reinforcement learning, multiple components define the scope of what the agent is learning about. The agent's behavior defines its focus and attention in terms of data collection. Related exploration methods based on intrinsic rewards define what the agent chooses to do outside of reward maximization. Most directly, the auxiliary learning objectives we build in, including macro actions or options, models, and representation learning objectives force the agent to learn about other things beyond a reward maximizing policy. The primary question is where do these auxiliary learning objectives come from? Classically, there are two approaches to defining auxiliary objectives that are the extremes of a spectrum of possibilities. The most common approach is for people to build the auxiliary objectives in pre-defining option policies, intrinsic rewards, and model learning objectives. Although most empirically successful, this approach has obvious limitations like feature engineering of old. At the other extreme is end-to-end learning. The idea is to build in as little inductive bias as possible including the inductive biases introduced by auxiliary learning objectives. Instead, we let the agent's neural network discover and adapt internal representations and algorithmic components (e.g., discovering objectives (Xu et al., 2020) , update rules (Oh et al., 2020), and models (Silver et al., 2017) ) just through trial and error interaction with the world. This approach remains challenging due to data efficiency concerns and in some cases shifts the difficulty from auxiliary objective design to loss function and curriculum design. An alternative approach that exists somewhere between human design and end-to-end learning is to hand-design many tasks in the form of additional output heads on the network that must be optimized in addition to the primary learning signal. These tasks, called auxiliary tasks, exert pressure on the lower layers of the neural network during training, yielding agents that can learn faster (Mirowski et al., 2016; Shelhamer et al., 2016) , produce better final performance (Jaderberg et al., 2016) , and at times transfer to other related problems (Wang et al., 2022) . This positive influence on neural network training is called the auxiliary task effect and is related to the emergence of good internal representations we seek in end-to-end learning. The major weakness of auxiliary task learning is its dependence on people. Relying on people for designing auxiliary tasks is not ideal because it is challenging to know what auxiliary tasks will be useful in advance and, as we will show later, poorly specified auxiliary tasks can significantly slow learning. There has been relatively little work on autonomously discovering auxiliary tasks. One approach is to use meta learning. Meta-learning methods are higher-level learning methods that adapt the parameters of the base learning system, such as step-sizes, through gradient descent (Xu et al., 2018) . This approach can be applied to learning auxiliary tasks defined via General Value Functions or GVFs (Sutton et al., 2011) by adapting the parameters that define the goal (cumulant) and termination functions via gradient-descent (Veeriah et al., 2019) . Generally speaking, these meta-learning approaches require large amounts of training data and are notoriously difficult to tune (Antoniou et al., 2018) . An exciting alternative is to augment these meta-learning approaches with generate-and-test mechanisms that can discover new auxiliary tasks, which can later be refined via meta-learning. This approach has produced promising results in representation learning where simple generate-andtest significantly improve classification and regression performance when combined with back-prop (Dohare et al., 2021) . Before we can combine meta-learning and generate-and-test, we must first develop the generate-and-test approach to auxiliary task discovery so that their combination has the best chance for success. Such an effort is worthy of an entire study on its own, so in this paper we leave combining the two to future work and focus on the generate-and-test approach. Despite significant interest, it remains unclear what makes a good or bad auxiliary tasks. The metalearning approaches do not generate human-interpretable tasks. Updating toward multiple previous policies, called the value improvement path (Dabney et al., 2020) , can improve performance but is limited to historical tasks. The gradient alignment between auxiliary tasks and the main task has been proposed as a measure of auxiliary tasks usefulness (Lin et al., 2019; Du et al., 2018) . However, the efficacy of this measure has not been thoroughly studied. Randomly generated auxiliary tasks can help avoid representation collapse (Lyle et al., 2021) and improve performance (Zheng et al., 2021) , but can also generate significant interference which degrades performance (Wang et al., 2022) . In this paper we take a step toward understanding what makes useful auxiliary tasks introducing a new generate-and-test method for autonomously generating new auxiliary tasks and a new measure of task usefulness to prune away bad ones. The proposed measure of task usefulness evaluates the auxiliary tasks based on how useful the features induced by them are for the main task. Our experimental results shows that our measure of task usefulness successfully distinguishes between the good and bad auxiliary tasks. Moreover, our proposed generate-and-test method outperforms random tasks, hand-designed tasks, and learning without auxiliary tasks.

2. BACKGROUND

In this paper, we consider the interaction of an agent with its environment at discrete time steps t = 1, 2, . . .. The current state is denoted by S t ∈ S. The agent's action A t ∈ A is selected according to a policy π : A×S → [0, 1], causing the environment to transition to the next state S t+1 emitting a reward of R t+1 ∈ R. The goal of the agent is to find the policy π with the highest stateaction value function defined as q π (s, a) . = E π [G t |S t = s, A t = a] where G t . = ∞ k=0 γ k R t+k+1 is called the return with γ ∈ [0, 1) being the discount factor. To estimate the state-action value function, we use temporal-difference learning (Sutton, 1988) . Specifically, we use Q-learning (Watkins & Dayan, 1992) to learn a parametric approximation q(s, a; w) by updating a vector of parameters w ∈ R d . The update is as follows, w t+1 ← w t + αδ t ∇ w q(S t , A t ; w), where δ t . = R t+1 + γmax a q(S t+1 , a; w t ) -q(S t , A t ; w t ) is the TD error, ∇ w v(S t ; w) is the gradient of the value function with respect to the parameters w t , and the scalar α denotes the stepsize parameter. For action selection, Q-learning is commonly combined with an epsilon greedy policy. We use neural networks for function approximation. We integrate a replay buffer, a target network, and the RMSProp optimizer with Q-learning as is commonly done to improve performance (Mnih et al., 2013) .

