COMBINING IMITATION AND REINFORCEMENT LEARNING WITH FREE ENERGY PRINCIPLE Anonymous authors Paper under double-blind review

Abstract

Imitation Learning (IL) and Reinforcement Learning (RL) from high dimensional sensory inputs are often introduced as separate problems, but a more realistic problem setting is how to merge the techniques so that the agent can reduce exploration costs by partially imitating experts at the same time it maximizes its return. Even when the experts are suboptimal (e.g. Experts learned halfway with other RL methods or human-crafted experts), it is expected that the agent outperforms the suboptimal experts' performance. In this paper, we propose to address the issue by using and theoretically extending Free Energy Principle, a unified brain theory that explains perception, action and model learning in a Bayesian probabilistic way. We find that both IL and RL can be achieved based on the same free energy objective function. Our results show that our approach is promising in visual control tasks especially with sparse-reward environments.

1. INTRODUCTION

Imitation Learning (IL) is a framework to learn a policy to mimic expert trajectories. As the expert specifies model behaviors, there is no need to do exploration or to design complex reward functions. Reinforcement Learning (RL) does not have these features, so RL agents have no clue to realize desired behaviors in sparse-reward settings and even when RL succeeds in reward maximization, the policy does not necessarily achieve behaviors that the reward designer has expected. The key drawbacks of IL are that the policy never exceeds the suboptimal expert performance and that the policy is vulnerable to distributional shift. Meanwhile, RL can achieve super-human performance and has potentials to transfer the policy to new tasks. As real-world applications often needs high sample efficiency and little preparation (rough rewards and suboptimal experts), it is important to find a way to effectively combine IL and RL. When the sensory inputs are high-dimensional images as in the real world, behavior learning such as IL and RL would be difficult without representation or model learning. Free Energy Principle (FEP), a unified brain theory in computational neuroscience that explains perception, action and model learning in a Bayesian probabilistic way (Friston et al., 2006; Friston, 2010) , can handle behavior learning and model learning at the same time. In FEP, the brain has a generative model of the world and computes a mathematical amount called Free Energy using the model prediction and sensory inputs to the brain. By minimizing the Free Energy, the brain achieves model learning and behavior learning. Prior work about FEP only dealt with limited situations where a part of the generative model is given and the task is very low dimensional. As there are a lot in common between FEP and variational inference in machine learning, recent advancements in deep learning and latent variable models could be applied to scale up FEP agents to be compatible with high dimensional tasks. Recent work in model-based reinforcement learning succeeds in latent planning from highdimensional image inputs by incorporating latent dynamics models. Behaviors can be derived either by imagined-reward maximization (Ha & Schmidhuber, 2018; Hafner et al., 2019a) or by online planning (Hafner et al., 2019b) . Although solving high dimensional visual control tasks with modelbased methods is becoming feasible, prior methods have never tried to combine with imitation. In this paper, we propose Deep Free Energy Network (FENet), an agent that combines the advantages of IL and RL so that the policy roughly learns from suboptimal expert data without the need of exploration or detailed reward crafting in the first place, then learns from sparsely specified reward functions to exceed the suboptimal expert performance.

annex

The key contributions of this work are summarized as follows:• Extension of Free Energy Principle:We theoretically extend Free Energy Principle, introducing policy prior and policy posterior to combine IL and RL. We implement the proposed method on top of Recurrent State Space Model (Hafner et al., 2019b) , a latent dynamics model with both deterministic and stochastic components. • Visual control tasks in realistic problem settings:We solve Cheetah-run, Walker-walk, and Quadruped-walk tasks from DeepMind Control Suite (Tassa et al., 2018) . We do not only use the default problem settings, we also set up problems with sparse rewards and with suboptimal experts. We demonstrate that our agent outperforms model-based RL using Recurrent State Space Model in sparse-reward settings.We also show that our agent can achieve higher returns than Behavioral Cloning (IL) with suboptimal experts.2 BACKGROUNDS ON FREE ENERGY PRINCIPLE Since we cannot compute p(o t ) due to the integral, we think of approximating p(s t |o t ) with a variational posterior q(s t ) by minimizing KL divergence KL(q(s t )||p(s t |o t )).We define the Free Energy as (eq.4). Since p(o t ) does not depend on s t , we can minimize (eq.3) w.r.t. the parameters of the variational posterior by minimizing the Free Energy. Thus, the agent can infer the hidden states of the observations by minimizing F t . This process is called 'perceptual inference' in FEP.Perceptual Learning Free Energy is the same amount as negative Evidence Lower Bound (ELBO) in variational inference often seen in machine learning as follows.By minimizing F t w.r.t. the parameters of the prior and the likelihood, the generative model learns to best explain the observations. This process is called 'perceptual learning' in FEP.Active Inference We can assume that the prior is conditioned on the hidden states and actions at the previous time step as follows.p(s t ) = p(s t |s t-1 , a t-1 ) (6)

