ADAPTIVE MULTI-MODEL FUSION LEARNING FOR SPARSE-REWARD REINFORCEMENT LEARNING Anonymous

Abstract

In this paper, we consider intrinsic reward generation for sparse-reward reinforcement learning based on model prediction errors. In typical model-prediction-errorbased intrinsic reward generation, an agent has a learning model for the underlying environment. Then, intrinsic reward is designed as the error between the model prediction and the actual outcome of the environment, based on the fact that for less-visited or non-visited states, the learned model yields larger prediction errors, promoting exploration helpful for reinforcement learning. This paper generalizes this model-prediction-error-based intrinsic reward generation method to multiple prediction models. We propose a new adaptive fusion method relevant to the multiple-model case, which learns optimal prediction-error fusion across the learning phase to enhance the overall learning performance. Numerical results show that for representative locomotion tasks, the proposed intrinsic reward generation method outperforms most of the previous methods, and the gain is significant in some tasks.

1. INTRODUCTION

Reinforcement learning (RL) with sparse reward is an active research area (Andrychowicz et al., 2017; Tang et al., 2017; de Abril & Kanai, 2018; Oh et al., 2018; Kim et al., 2019) . In sparse-reward RL, the environment does not return a non-zero reward for every agent's action but returns a non-zero reward only when certain conditions are met. Such situations are encountered in many action control problems (Houthooft et al., 2016; Andrychowicz et al., 2017; Oh et al., 2018) . As in conventional RL, exploration is essential at the early stage of learning in sparse-reward RL, whereas the balance between exploration and exploitation is required later. Intrinsically motivated RL has been studied to stimulate better exploration by generating intrinsic reward for each action by the agent itself. Recently, many intrinsically-motivated RL algorithms have been devised especially to deal with the sparsity of reward, e.g., based on the notion of curiosity (Houthooft et al., 2016; Pathak et al., 2017) , surprise (Achiam & Sastry, 2017). In essence, in these intrinsic reward generation methods, the agent has a learning model for the next state or the transition probability of the underlying environment, and intrinsic reward is designed as the error between the model prediction and the actual outcome of the environment, based on the fact that for less-visited or non-visited states, the learned model yields larger prediction errors, promoting exploration helpful for reinforcement learning. These previous methods typically use a single prediction model for the next state or the environment's transition probability. In this paper, we generalize this model-prediction-error-based approach to the case of multiple prediction models and propose a new framework for intrinsic reward generation based on the optimal adaptive fusion of multiple values from multiple models. The use of multiple models increases diversity in modeling error values and the chance to design a better intrinsic reward from these values. The critical task is to learn an optimal fusion rule to maximize the performance across the entire learning phase. In order to devise such an optimal adaptive fusion algorithm, we adopt the α-mean with the scale-free property from the field of information geometry (Amari, 2016) and apply the meta-gradient optimization to search for optimal fusion at each stage of learning. Numerical results show that the proposed multi-model intrinsic reward generation combined with fusion learning significantly outperforms existing intrinsic reward generation methods.

