ADAPTIVE MULTI-MODEL FUSION LEARNING FOR SPARSE-REWARD REINFORCEMENT LEARNING Anonymous

Abstract

In this paper, we consider intrinsic reward generation for sparse-reward reinforcement learning based on model prediction errors. In typical model-prediction-errorbased intrinsic reward generation, an agent has a learning model for the underlying environment. Then, intrinsic reward is designed as the error between the model prediction and the actual outcome of the environment, based on the fact that for less-visited or non-visited states, the learned model yields larger prediction errors, promoting exploration helpful for reinforcement learning. This paper generalizes this model-prediction-error-based intrinsic reward generation method to multiple prediction models. We propose a new adaptive fusion method relevant to the multiple-model case, which learns optimal prediction-error fusion across the learning phase to enhance the overall learning performance. Numerical results show that for representative locomotion tasks, the proposed intrinsic reward generation method outperforms most of the previous methods, and the gain is significant in some tasks.

1. INTRODUCTION

Reinforcement learning (RL) with sparse reward is an active research area (Andrychowicz et al., 2017; Tang et al., 2017; de Abril & Kanai, 2018; Oh et al., 2018; Kim et al., 2019) . In sparse-reward RL, the environment does not return a non-zero reward for every agent's action but returns a non-zero reward only when certain conditions are met. Such situations are encountered in many action control problems (Houthooft et al., 2016; Andrychowicz et al., 2017; Oh et al., 2018) . As in conventional RL, exploration is essential at the early stage of learning in sparse-reward RL, whereas the balance between exploration and exploitation is required later. Intrinsically motivated RL has been studied to stimulate better exploration by generating intrinsic reward for each action by the agent itself. Recently, many intrinsically-motivated RL algorithms have been devised especially to deal with the sparsity of reward, e.g., based on the notion of curiosity (Houthooft et al., 2016; Pathak et al., 2017 ), surprise (Achiam & Sastry, 2017) . In essence, in these intrinsic reward generation methods, the agent has a learning model for the next state or the transition probability of the underlying environment, and intrinsic reward is designed as the error between the model prediction and the actual outcome of the environment, based on the fact that for less-visited or non-visited states, the learned model yields larger prediction errors, promoting exploration helpful for reinforcement learning. These previous methods typically use a single prediction model for the next state or the environment's transition probability. In this paper, we generalize this model-prediction-error-based approach to the case of multiple prediction models and propose a new framework for intrinsic reward generation based on the optimal adaptive fusion of multiple values from multiple models. The use of multiple models increases diversity in modeling error values and the chance to design a better intrinsic reward from these values. The critical task is to learn an optimal fusion rule to maximize the performance across the entire learning phase. In order to devise such an optimal adaptive fusion algorithm, we adopt the α-mean with the scale-free property from the field of information geometry (Amari, 2016) and apply the meta-gradient optimization to search for optimal fusion at each stage of learning. Numerical results show that the proposed multi-model intrinsic reward generation combined with fusion learning significantly outperforms existing intrinsic reward generation methods.

2. RELATED WORK

Intrinsically-motivated RL and exploration methods can be classified mainly into two categories. One is to explicitly generate intrinsic reward and train the agent with the sum of the extrinsic reward and the adequately scaled intrinsic reward. The other is indirect methods that do not explicitly generate intrinsic reward. Our work belongs to the first category, and we conducted experiments using baselines in the first category. However, we also detailed the second category in Appendix H for readers for further work in the intrinsically-motivated RL area. 2019) interpreted the disagreement among the models as the variance of the predicted next states and used the variance as the final differentiable intrinsic reward. Our method is a generalized version of their work as we can apply our proposed fusion method to the multiple squared error values between a predicted next state and all the predicted next states' average. Freirich et al. (2019) proposed generating intrinsic reward by applying a generative model with the Wasserstein-1 distance. With the concept of state-action embedding, Kim et al. (2019) adopted the Jensen-Shannon divergence (JSD) (Hjelm et al., 2019) to construct a new variational lower bound of the corresponding mutual information, guaranteeing numerical stability. Our work differs from these two works in that we use the adaptive fusion method of multiple intrinsic reward at every timestep.

3.1. SETUP

We consider a discrete-time continuous-state Markov Decision Process (MDP), denoted as (S, A, P, r, ρ 0 , γ), where S and A are the sets of states and actions, respectively, P : S × A → Π(S) is the transition probability function, where Π(S) is the space of probability distributions over S, r : S × A × S → R is the extrinsic reward function, ρ 0 is the probability distribution of the initial state, and γ is the discounting factor. A (stochastic) policy is represented by π : S → Π(A), where Π(A) is the space of probability distributions on A and π(a|s) represents the probability of choosing action a ∈ A for given state s ∈ S. In sparse-reward RL, the environment does not return a non-zero reward for every action but returns a non-zero reward only when certain conditions are met by the current state, the action and the next state (Houthooft et al., 2016; Andrychowicz et al., 2017; Oh et al., 2018) . Our goal is to optimize the policy π to maximize the expected cumulative return η(π) by properly generating intrinsic reward in such sparse-reward environments. We assume that the true transition probability distribution P is unknown to the agent.

3.2. INTRINSIC REWARD DESIGN BASED ON MODEL PREDICTION ERRORS

Intrinsically-motivated RL adds a properly designed intrinsic reward at every timestep t to the actual extrinsic reward to yield a non-zero total reward for training even when the extrinsic reward returned by the environment is zero (Pathak et al., 2017; Tang et al., 2017; de Abril & Kanai, 2018 ). In the model-prediction-error-based intrinsic reward design, the agent has a prediction model parametrized by φ for the next state s t+1 or the transition probability P (s t+1 |s t , a t ), and the intrinsic reward is designed as the error between the model prediction and the actual outcome of the environment (Houthooft et al., 2016; Achiam & Sastry, 2017; Pathak et al., 2017; Burda et al., 2019; de Abril & Kanai, 2018) . Thus, the intrinsic-reward-incorporated problem under this approach is given in most



Houthooft et al. (2016) used the information gain on the prediction model as an additional reward based on the notion of curiosity. Tang et al. (2017) efficiently applied count-based exploration to high-dimensional state space by mapping the states' trained features into a hash table. The concept of surprise was exploited to yield intrinsic rewards (Achiam & Sastry, 2017). Pathak et al. (2017) defined an intrinsic reward with the prediction error using a feature state space, and de Abril & Kanai (2018) enhanced Pathak et al. (2017)'s work with the idea of homeostasis in biology. Zheng et al. (2018) used a delayed reward environment to propose training the module to generate intrinsic reward apart from training the policy. This delayed reward environment for sparse-reward settings differs from the previous sparse-reward environment based on thresholding (Houthooft et al., 2016). (The agent gets a non-zero reward when the agent achieves a specific physical quantity -such as the distance from the origin -larger than the predefined threshold.) Pathak et al. (

