COMBINING IMITATION AND REINFORCEMENT LEARNING WITH FREE ENERGY PRINCIPLE Anonymous authors Paper under double-blind review

Abstract

Imitation Learning (IL) and Reinforcement Learning (RL) from high dimensional sensory inputs are often introduced as separate problems, but a more realistic problem setting is how to merge the techniques so that the agent can reduce exploration costs by partially imitating experts at the same time it maximizes its return. Even when the experts are suboptimal (e.g. Experts learned halfway with other RL methods or human-crafted experts), it is expected that the agent outperforms the suboptimal experts' performance. In this paper, we propose to address the issue by using and theoretically extending Free Energy Principle, a unified brain theory that explains perception, action and model learning in a Bayesian probabilistic way. We find that both IL and RL can be achieved based on the same free energy objective function. Our results show that our approach is promising in visual control tasks especially with sparse-reward environments.

1. INTRODUCTION

Imitation Learning (IL) is a framework to learn a policy to mimic expert trajectories. As the expert specifies model behaviors, there is no need to do exploration or to design complex reward functions. Reinforcement Learning (RL) does not have these features, so RL agents have no clue to realize desired behaviors in sparse-reward settings and even when RL succeeds in reward maximization, the policy does not necessarily achieve behaviors that the reward designer has expected. The key drawbacks of IL are that the policy never exceeds the suboptimal expert performance and that the policy is vulnerable to distributional shift. Meanwhile, RL can achieve super-human performance and has potentials to transfer the policy to new tasks. As real-world applications often needs high sample efficiency and little preparation (rough rewards and suboptimal experts), it is important to find a way to effectively combine IL and RL. When the sensory inputs are high-dimensional images as in the real world, behavior learning such as IL and RL would be difficult without representation or model learning. Free Energy Principle (FEP), a unified brain theory in computational neuroscience that explains perception, action and model learning in a Bayesian probabilistic way (Friston et al., 2006; Friston, 2010) , can handle behavior learning and model learning at the same time. In FEP, the brain has a generative model of the world and computes a mathematical amount called Free Energy using the model prediction and sensory inputs to the brain. By minimizing the Free Energy, the brain achieves model learning and behavior learning. Prior work about FEP only dealt with limited situations where a part of the generative model is given and the task is very low dimensional. As there are a lot in common between FEP and variational inference in machine learning, recent advancements in deep learning and latent variable models could be applied to scale up FEP agents to be compatible with high dimensional tasks. Recent work in model-based reinforcement learning succeeds in latent planning from highdimensional image inputs by incorporating latent dynamics models. Behaviors can be derived either by imagined-reward maximization (Ha & Schmidhuber, 2018; Hafner et al., 2019a) or by online planning (Hafner et al., 2019b) . Although solving high dimensional visual control tasks with modelbased methods is becoming feasible, prior methods have never tried to combine with imitation. In this paper, we propose Deep Free Energy Network (FENet), an agent that combines the advantages of IL and RL so that the policy roughly learns from suboptimal expert data without the need of exploration or detailed reward crafting in the first place, then learns from sparsely specified reward functions to exceed the suboptimal expert performance.

