PRIOR PREFERENCE LEARNING FROM EXPERTS: DESIGNING A REWARD WITH ACTIVE INFERENCE Anonymous

Abstract

Active inference may be defined as Bayesian modeling of a brain with a biologically plausible model of the agent. Its primary idea relies on the free energy principle and the prior preference of the agent. An agent will choose an action that leads to its prior preference for a future observation. In this paper, we claim that active inference can be interpreted using reinforcement learning (RL) algorithms and find a theoretical connection between them. We extend the concept of expected free energy (EFE), which is a core quantity in active inference, and claim that EFE can be treated as a negative value function. Motivated by the concept of prior preference and a theoretical connection, we propose a simple but novel method for learning a prior preference from experts. This illustrates that the problem with inverse RL can be approached with a new perspective of active inference. Experimental results of prior preference learning show the possibility of active inference with EFE-based rewards and its application to an inverse RL problem.

1. INTRODUCTION

Active inference (Friston et al., 2009) is a theory emerging from cognitive science using a Bayesian modeling of the brain function (Friston et al., 2006; Friston, 2010; Friston et al., 2015; 2013) , predictive coding (Friston et al., 2011; Lopez-Persem et al., 2016) , and the free energy principle (Friston, 2012; Parr & Friston, 2019; Friston, 2019) . It states that the agents choose actions to minimize an expected future surprise (Friston et al., 2012; 2017a; b) , which is a measurement of the difference between an agent's prior preference and expected future. Minimization of an expected future surprise can be achieved by minimizing the expected free energy (EFE), which is a core quantity of active inference. Although active inference and EFE have been inspired and derived from cognitive science using a biologically plausible brain function model, its usage in RL tasks is still limited owing to its computational issues and prior-preference design. (Millidge, 2020; Fountas et al., 2020) First, EFE requires heavy computational cost. A precise computation of an EFE theoretically averages all possible policies, which is clearly intractable as an action space A and a time horizon T increase in size. Several attempts have been made to calculate the EFE in a tractable manner, such as limiting the future time horizon from t to t + H (Tschantz et al., 2019) , and applying Monte-Carlo based sampling methods (Fountas et al., 2020; C ¸atal et al., 2020) for the search policies. Second, it is unclear how the prior preferences should be set. This is the same question as how to design the rewards in the RL algorithm. In recent studies (Fountas et al., 2020; C ¸atal et al., 2020; Ueltzhöffer, 2018 ) the agent's prior preference is simply set as the final goal of a given environment for every time step. There are some environments in which the prior preference can be set as time independent. However, most prior preferences in RL problems are neither simple nor easy to design because prior preferences of short and long-sighted futures should generally be treated in different ways. In this paper, we first claim that there is a theoretical connection between active inference and RL algorithms. We then propose prior preference learning (PPL), a simple and novel method for learning a prior-preference of an active inference from an expert simulation. In Section 2, we briefly introduce the concept of an active inference. From the previous definition of the EFE of a deterministic policy, in Section 3, we extend the previous concepts of active inference and theoretically demonstrate that it can be analyzed in view of the RL algorithm. We extend this quantity to a stochastic policy network and define an action-conditioned EFE for a given action and a given policy network. Following Millidge (2020) using a bootstrapping argument, we show that the optimal distribution over the first-step action induced from active inference can be interpreted using Q-Learning. Consequently, we show that EFE can be treated as a negative value function from an RL perspective. From this connection, in Section 4, we propose a novel inverse RL algorithm for designing EFEbased rewards, by learning a prior preference from expert demonstrations. Through such expert demonstrations, an agent learns its prior preference given the observation to achieve a final goal, which can effectively handle the difference between local and global preferences. It will extend the scope of active inference to inverse RL problem. Our experiments in Section 6 show the applicability of active inference based rewards using EFE to an inverse RL problem.

2. ACTIVE INFERENCE

The active inference environment rests on the partially observed Markov decision process settings with an observation that comes from sensory input o t and a hidden state s t which is encoded in the agent's latent space. We will discuss a continuous observation/hidden state space, a discrete time step, and a discrete action space A: At a current time t < T with a given time horizon T , the agent receives an observation o t . The agent encodes this observation to a hidden state s t in its internal generative model, (i.e., a generative model for the given environment in an agent) and then searches for the action sequence that minimizes the expected future surprise based on the agent's prior preference p(o τ ) of a future observation o τ with τ > t. (i.e. The agent avoids an action which leads to unexpected and undesired future observations, which makes the agent surprised.) In detail, we can formally illustrate the active inference agent's process as follows: s t and o t are a hidden state and an observation at time t, respectively. In addition, π = (a 1 , a 2 , ..., a T ) is a sequence of actions. Let p(o 1:T , s 1:T ) be a generative model of the agent with its transition model p(s t+1 |s t , a t ), and q(o 1:T , s 1:T , π) be a variational density. A distribution over policies q(π) will be determined later. From here, we can simplify the parameterized densities as trainable neural networks with p(o t |s t ) as a decoder, q(s t |o t ) as an encoder, and p(s t+1 |s t , a t ) as a transition network in our generative model. First, we minimize the current surprise of the agent, which is defined as -log p(o t ). Its upper bound can be interpreted as the well-known negative ELBO term, which is frequently referred to as the variational free energy F t at time t in studies on active inference. -log p(o t ) ≤ E q(st|ot) [log q(s t |o t ) -log p(o t , s t )] = F t (1) Minimizing F t provides an upper bound on the current surprise, and makes our networks in the generative model well-fitted with our known observations of the environment and its encoded states. For the future action selection, the total EFE G(s t ) over all possible policies at the current state s t at time t should be minimized.

G(s

t ) = E q(s t+1:T ,o t+1:T ,π) [log q(s t+1:T , π) p(s t+1:T , o t+1:T ) ] Focusing on the distribution q(π), it is known that the total EFE G(s t ) is minimized when the distribution over policies q(π) follows σ(-G π (s t )), where σ(•) is a softmax over the policies and G π (s t ) is the EFE under a given state s t for a fixed sequence of actions π at time t. (Millidge et al., 2020) G π (s t ) = τ >t G π (τ, s t ) = τ >t E q(sτ ,oτ |π) [log q(s τ |π) p(o τ )q(s τ |o τ ) ] This means that a lower EFE is obtained for a particular action sequence π; a lower future surprise will be expected and a desired behavior p(o) will be obtained. Several active inference studies introduce a temperature parameter γ > 0 such that q γ (π) = σ(-γG π (s t )) to control the agent's behavior between exploration and exploitation. Because we know that the optimal distribution over the policies is q(π) = σ(-G π (s t )), the action selection problem boils down to a calculation of the expected free energy G π (s t ) of a given action sequence π. The learning process of active inference contains two parts: (1) learning an agent's generative model with its trainable neural networks p(o t |s t ), q(s t |o t ), and p(s t+1 |s t , a t ) that explains the current observations and (2) learning to select an action that minimizes a future expected surprise of the agent by calculating the EFE of a given action sequence.

