PRIOR PREFERENCE LEARNING FROM EXPERTS: DESIGNING A REWARD WITH ACTIVE INFERENCE Anonymous

Abstract

Active inference may be defined as Bayesian modeling of a brain with a biologically plausible model of the agent. Its primary idea relies on the free energy principle and the prior preference of the agent. An agent will choose an action that leads to its prior preference for a future observation. In this paper, we claim that active inference can be interpreted using reinforcement learning (RL) algorithms and find a theoretical connection between them. We extend the concept of expected free energy (EFE), which is a core quantity in active inference, and claim that EFE can be treated as a negative value function. Motivated by the concept of prior preference and a theoretical connection, we propose a simple but novel method for learning a prior preference from experts. This illustrates that the problem with inverse RL can be approached with a new perspective of active inference. Experimental results of prior preference learning show the possibility of active inference with EFE-based rewards and its application to an inverse RL problem.

1. INTRODUCTION

Active inference (Friston et al., 2009) is a theory emerging from cognitive science using a Bayesian modeling of the brain function (Friston et al., 2006; Friston, 2010; Friston et al., 2015; 2013) , predictive coding (Friston et al., 2011; Lopez-Persem et al., 2016) , and the free energy principle (Friston, 2012; Parr & Friston, 2019; Friston, 2019) . It states that the agents choose actions to minimize an expected future surprise (Friston et al., 2012; 2017a; b) , which is a measurement of the difference between an agent's prior preference and expected future. Minimization of an expected future surprise can be achieved by minimizing the expected free energy (EFE), which is a core quantity of active inference. Although active inference and EFE have been inspired and derived from cognitive science using a biologically plausible brain function model, its usage in RL tasks is still limited owing to its computational issues and prior-preference design. (Millidge, 2020; Fountas et al., 2020) First, EFE requires heavy computational cost. A precise computation of an EFE theoretically averages all possible policies, which is clearly intractable as an action space A and a time horizon T increase in size. Several attempts have been made to calculate the EFE in a tractable manner, such as limiting the future time horizon from t to t + H (Tschantz et al., 2019) , and applying Monte-Carlo based sampling methods (Fountas et al., 2020; C ¸atal et al., 2020) for the search policies. Second, it is unclear how the prior preferences should be set. This is the same question as how to design the rewards in the RL algorithm. In recent studies (Fountas et al., 2020; C ¸atal et al., 2020; Ueltzhöffer, 2018 ) the agent's prior preference is simply set as the final goal of a given environment for every time step. There are some environments in which the prior preference can be set as time independent. However, most prior preferences in RL problems are neither simple nor easy to design because prior preferences of short and long-sighted futures should generally be treated in different ways. In this paper, we first claim that there is a theoretical connection between active inference and RL algorithms. We then propose prior preference learning (PPL), a simple and novel method for learning a prior-preference of an active inference from an expert simulation. In Section 2, we briefly introduce the concept of an active inference. From the previous definition of the EFE of a deterministic policy, in Section 3, we extend the previous concepts of active inference and theoretically demonstrate that it can be analyzed in view of the RL algorithm. We extend this quantity to a stochastic policy network and define an action-conditioned EFE for a given action and a given policy network. Following Millidge (2020) using a bootstrapping argument, we show that the optimal distribution over the first-step action induced from active inference can be interpreted using Q-Learning. Consequently, we show that EFE can be treated as a negative value function from an RL perspective. From this connection, in Section 4, we propose a novel inverse RL algorithm for designing EFEbased rewards, by learning a prior preference from expert demonstrations. Through such expert demonstrations, an agent learns its prior preference given the observation to achieve a final goal, which can effectively handle the difference between local and global preferences. It will extend the scope of active inference to inverse RL problem. Our experiments in Section 6 show the applicability of active inference based rewards using EFE to an inverse RL problem.

2. ACTIVE INFERENCE

The active inference environment rests on the partially observed Markov decision process settings with an observation that comes from sensory input o t and a hidden state s t which is encoded in the agent's latent space. We will discuss a continuous observation/hidden state space, a discrete time step, and a discrete action space A: At a current time t < T with a given time horizon T , the agent receives an observation o t . The agent encodes this observation to a hidden state s t in its internal generative model, (i.e., a generative model for the given environment in an agent) and then searches for the action sequence that minimizes the expected future surprise based on the agent's prior preference p(o τ ) of a future observation o τ with τ > t. (i.e. The agent avoids an action which leads to unexpected and undesired future observations, which makes the agent surprised.) In detail, we can formally illustrate the active inference agent's process as follows: s t and o t are a hidden state and an observation at time t, respectively. In addition, π = (a 1 , a 2 , ..., a T ) is a sequence of actions. Let p(o 1:T , s 1:T ) be a generative model of the agent with its transition model p(s t+1 |s t , a t ), and q(o 1:T , s 1:T , π) be a variational density. A distribution over policies q(π) will be determined later. From here, we can simplify the parameterized densities as trainable neural networks with p(o t |s t ) as a decoder, q(s t |o t ) as an encoder, and p(s t+1 |s t , a t ) as a transition network in our generative model. First, we minimize the current surprise of the agent, which is defined as -log p(o t ). Its upper bound can be interpreted as the well-known negative ELBO term, which is frequently referred to as the variational free energy F t at time t in studies on active inference. -log p(o t ) ≤ E q(st|ot) [log q(s t |o t ) -log p(o t , s t )] = F t (1) Minimizing F t provides an upper bound on the current surprise, and makes our networks in the generative model well-fitted with our known observations of the environment and its encoded states. For the future action selection, the total EFE G(s t ) over all possible policies at the current state s t at time t should be minimized. G(s t ) = E q(s t+1:T ,o t+1:T ,π) [log q(s t+1:T , π) p(s t+1:T , o t+1:T ) ] Focusing on the distribution q(π), it is known that the total EFE G(s t ) is minimized when the distribution over policies q(π) follows σ(-G π (s t )), where σ(•) is a softmax over the policies and G π (s t ) is the EFE under a given state s t for a fixed sequence of actions π at time t. (Millidge et al., 2020 ) G π (s t ) = τ >t G π (τ, s t ) = τ >t E q(sτ ,oτ |π) [log q(s τ |π) p(o τ )q(s τ |o τ ) ] This means that a lower EFE is obtained for a particular action sequence π; a lower future surprise will be expected and a desired behavior p(o) will be obtained. Several active inference studies introduce a temperature parameter γ > 0 such that q γ (π) = σ(-γG π (s t )) to control the agent's behavior between exploration and exploitation. Because we know that the optimal distribution over the policies is q(π) = σ(-G π (s t )), the action selection problem boils down to a calculation of the expected free energy G π (s t ) of a given action sequence π. The learning process of active inference contains two parts: (1) learning an agent's generative model with its trainable neural networks p(o t |s t ), q(s t |o t ), and p(s t+1 |s t , a t ) that explains the current observations and (2) learning to select an action that minimizes a future expected surprise of the agent by calculating the EFE of a given action sequence.

3. EFE AS A NEGATIVE VALUE: BETWEEN RL AND ACTIVE INFERENCE

In this section, we first extend the definition of an EFE to a stochastic policy. We then, propose an action-conditioned EFE that has a similar role as a negative action-value function in RL. Based on these extensions, we will repeat the arguments in the active inference and claim that the RL algorithm with EFE as a negative value is equivalent to controlling the policy network toward an ideal distribution in the active inference. Calculating the expected free energy G π (s t ) for all possible deterministic policies π is intractable even in a toy-example task because the number of policies rapidly increases as the time horizon T and the number of actions |A| increases. Instead of searching for a number of deterministic policies, we extend the concept of EFE to a stochastic policy based on a policy network φ = φ(a t |s t ). In this case, we also extend q(s τ |π) to q(s τ , a τ -1 |φ) := p(s τ |s τ -1 , a τ -1 )φ(a τ -1 |s τ -1 ), which is a distribution over the states and actions, where each action a τ -1 is only dependent on state s τ -1 with φ(a τ -1 |s τ -1 ). Plugging in a deterministic policy φ = π yields q(s τ , a τ -1 |π) = p(s τ |s τ -1 , a ) with π(s τ -1 ) = a , which is the same equation used in previous studies on active inference. Suppose we choose an action a t based on the current state s t and a given action network φ, its corresponding expected free energy term can then be written as follows. This can be interpreted as an EFE of a sequence of policies (φ, φ, ..., φ) by substituting our extensions for the probabilities in (3). ,oτ ,aτ-1|φ) [ G φ (s t ) = E τ >t q(sτ τ >t log q(s τ , a τ -1 |φ) p(o τ )q(s τ |o τ ) ] Note that the agent uses the same action network φ(a τ |s τ ) for all τ , and we can therefore rewrite this equation in a recursive form. G φ (s t ) = E τ >t q(sτ ,oτ ,aτ-1|φ) [ τ >t log q(s τ , a τ -1 |φ) p(o τ )q(s τ |o τ ) ] = E τ >t q(sτ ,oτ ,aτ-1|φ) [log q(s t+1 , a t |φ) p(o t+1 )q(s t+1 , a t |o t+1 ) + τ >t+1 log q(s τ , a τ -1 |φ) p(o τ )q(s τ |o τ ) ] = E q(st+1,ot+1,at|φ) [log q(s t+1 , a t |φ) p(o t+1 )q(s t+1 |o t+1 ) + E τ >t+1 q(sτ ,oτ ,aτ-1|φ) τ >t+1 log q(s τ , a τ -1 |φ) p(o τ )q(s τ |o τ ) ] = E q(st+1,ot+1,at|φ) [log q(s t+1 , a t |φ) p(o t+1 )q(s t+1 |o t+1 ) + G φ (s t+1 )] Replacing and fixing the first action a t = a, an action-conditioned EFE G φ (s t |a) of a given action a and a policy network φ is then defined as follows: G φ (s t |a) := E q(st+1,ot+1,at|at=a) log q(s t+1 , a t |a t = a) p(o t+1 , s t+1 ) + G φ (s t+1 ) = E p(st+1|st,at=a)p(ot+1|st+1) log p(s t+1 |s t , a t = a) p(o t+1 )q(s t+1 |o t+1 ) + G φ (s t+1 ) Taking an expectation over φ(a|s t ), we obtain the relationship between G φ (s t ) and G φ (s t |a t ). E φ(a|st) [G φ (s t |a)] = G φ (s t ) + H[φ(a|s t )] ) We may consider separating the distribution over the first-step action from φ to find an alternative distribution that minimizes the EFE. Substituting the distribution over the first-step action as q(a t ) instead of φ(a t ), we obtain its one-step substituted EFE as indicated below. This can be interpreted as the EFE of a sequence of a policies (q(a t ), φ, ..., φ) G φ 1-step (s t ) = E q(at)q(st+1,ot+1|at) τ >t+1 q(sτ ,oτ |φ) [log q(a t )q(s t+1 |a t )q(s t+2:T , a t+1:T -1 |φ) p(s t+1:T , o t+1:T ) ] = E q(at)q(st+1,ot+1|at) τ >t+1 q(sτ ,oτ |φ) [log q(a t ) + log q(s t+1 |a t ) p(o t+1 , s t+1 ) + log τ >t+1 q(s τ , a τ -1 |φ) p(o τ )q(s τ |o τ ) ] = E q(at) log q(a t ) + G φ (s t |a t ) Under a given φ, the value above depends only on the distribution q(a t ). Thus, minimizing the quantity above will naturally introduce the distribution q * (a t ) = σ a (-G φ (s t |a t )), which is known to be similar in terms of active inference to the γ = 1 case. G φ 1-step (s t ) = E q(at) log q(a t ) + G φ (s t |a t ) = KL(q(a t )||q * (a t )) -log at∈A exp(-G φ (s t |a t )) Therefore, G φ 1-step (s t ) ≥ -log at∈A exp(-G φ (s t |a t )), and the equality holds if and only if q(a t ) = σ a (-G φ (s t |a t )). Let us further consider the temperature hyperparameter γ > 0 such that q * γ (a t ) = σ a (-γG(s t |a t )). Similarly to the above, we obtain the following: G φ 1-step (s t ) = 1 γ E q(at) γ log q(a t ) + γG φ (s t |a t ) = 1 γ E q(at) (1 -γ)(-log q(a t )) + log q(a t ) -log q * γ (a t ) - 1 γ log at∈A exp(-γG φ (s t |a t )) = ( 1 γ -1)H(q(a t )) + 1 γ KL(q(a t )||q * γ (a t )) - 1 γ log at∈A exp(-γG φ (s t |a t )) From the arguments above, we can conclude that (1) G φ 1-step (s t ) is minimized when q * (a t ) = σ a (-G(s t |a t )), a 'natural case' in which γ = 1, which was heuristically set in the experiments described in Millidge (2020) . ( 2) Plugging q * γ (a t ) = σ a (-γG(s t |a t )) to q(a) in the equation above, we obtain G φ 1-step (s t ) = (1 -γ)E q * γ (a) [G φ (s t |a t )] -log D γ ≈ (1 -γ)E q * γ (a) [G φ (s t |a t )] + γ min at∈A {G φ (s t |a t )} where D γ = at∈A exp(-γG φ (s t |a t )), and the last comes from the smooth approximation to the maximum function of log-sum-exp. This approximation becomes accurate when the maximum is much larger than the others. -log D γ = -log at∈A exp(-γG φ (s t |a t )) ≈ -max at∈A {-γG φ (s t |a t )} = γ min at∈A G φ (s t |a t ) (12) When γ = 1, the quantity G φ 1-step (s t ) is minimized with q * 1 (a t ) = σ a (-G(s t |a t )) , and its minimum value can be approximated as -log D 1 ≈ min at∈A {G φ (s t |a t )}. When γ → ∞, q * γ (a t ) can be considered as a deterministic policy seeking the smallest EFE, which leads to the following: G φ 1-step (s t ) = ( 1 γ -1)H(q * γ (a t )) - 1 γ log at∈A exp(-γG φ (s t |a t )) ≈ min at∈A G φ (s t |a t ) Note that Q-learning with a negative EFE can be interpreted as q * γ with the case of γ = ∞. When γ 0, q * γ (a t ) converges to a uniform distribution over the action space A, and the temperature hyperparameter γ motivates the agent to explore other actions with a greater EFE. Its weighted sum also converges to an average of the action-conditioned EFE. Considering the optimal policy φ * that seeks an action with a minimum EFE, we obtain the following: G φ * (s t ) = min a G φ * (s t |a) = min a E p(st+1|st,at=a)p(ot+1|st+1) log p(s t+1 |s t , a t = a) p(o t+1 )q(s t+1 |o t+1 ) + G φ * (s t+1 ) This equation is extremely similar to the Bellman optimality equation. Here, G φ (s t |a) and G φ (s t+1 ) correspond to the action-value function and the state value function, respectively. Based on this similarity, we can consider the first term log p(st+1|st,at=a) p(ot+1)q(st+1|ot+1) as a one-step negative reward (because active inference aims to minimize the expected free energy of the future) and EFE as a negative value function.

4. PPL: PRIOR PREFERENCE LEARNING FROM EXPERTS

From the previous section, we verified that using log p(st+1|st,at) p(ot+1)q(st+1|ot+1) as a negative reward can handle EFE with traditional RL methods. Through a simple calculation, the given term can be decomposed as follows: R t := -log p(s t+1 |s t , a t ) p(o t+1 )q(s t+1 |o t+1 ) = log p(o t+1 ) + (-log p(s t+1 |s t , a t ) q(s t+1 |o t+1 ) ) = R t,i + R t,e The first term measures the similarity between a preferred future and a predicted future, which is an intuitive value. The second term is called the epistemic value, which encourages the exploration of an agent. (Friston et al., 2015) The epistemic value can be computed either with prior knowledge on the environment of an agent or with various algorithms to learn a generative model. (Igl et al., 2018; Kaiser et al., 2019) The core key to calculating the one-step reward and learning EFE is the prior preference p(o t ), which includes information about the agent's preferred observation and goal. Setting a prior preference for an agent is a challenging point for active inference. A simple method that has been used in recent studies (Ueltzhöffer, 2018; C ¸atal et al., 2020) is to set a prior preference as a Gaussian distribution with mean of the goal position. However, it would be inefficient to use the same prior preference for all observations o. We called this type of preference global preference. Taking a mountaincar environment as an example, to reach the goal position, the car must move away from the goal position in an early time step. Therefore p(o t ) must contain information about local preference. It is also difficult to directly design the prior preference for some observation space, such as in an image. To solve the problems arising from active inference, we introduce prior preference learning from experts (PPL). Suppose that an agent can access expert simulations S = {(o i,1 , ..., o i,T } N i=1 , where N is the number of simulations. From the expert simulations S, an agent can learn an expert's prior p(o t+1 |o t ) based on the current observation o t . Model p(o t+1 |o t ) captures the expert's local preference, which is more effective than the global preference. We can state that this method can be applied only with expert simulations, regardless of the knowledge on the forward dynamics of the environment. Once learning the prior preference p(o t+1 |o t ), we can calculate R t given in (15). We can use any RL alogorithm to approximate EFE, G φ * (s t ) in ( 14). In Algorithm 1, we provide our pseudo-code of PPL with Q-learning.

5. RELATED WORKS

Active inference on RL. Our works are based on active inference and reinforcement learning. Active inference was first introduced in Friston et al. (2006) , inspired from neuroscience and free energy principle. It explains how a biological system is maintained with a brain model. Furthermore, Friston et al. (2009) treated the relation between active inference and reinforcement learning. Early studies (Friston, 2010; Friston et al., 2011; 2015; 2017a) dealt with a tabular or model based problem due to computational cost. Recently, Ueltzhöffer (2018) introduced deep active inference which utilizes deep neural network to approximate the observation and transition models on MountainCar environment. This work used EFE as an objective function and its gradient propagates through the environment dynamics. To handle this problem, Ueltzhöffer (2018) used stochastic weights to apply an evolutionary strategy (Salimans et al., 2017) , which is a method for gradient approximation. For a stable gradient approximation, about 10 4 environments were used in parallel. 2020) introduced a new KL objective called free energy of expected future (FEEF) which is related to probabilistic RL (Levine, 2018b; Rawlik, 2013; Kappen et al., 2012; Lee et al., 2019) .  ((p • T • q)(o t ), o t+1 ); Compute the EFE network loss L ef e (G ξ (s t , a t ), R t + max a G ξ (s t+1 , a)); Update η ← η -α∇L d and ξ ← ξ -α∇L e f e.

end end

Our work extends the scope of active inference to inverse RL by learning a preferred observation from experts. A common limitation of previous studies on active inference is the ambiguity of prior preference distribution. Previous works have done in environments where its prior preference can be naturally expressed, which is clearly not true in general. PPL is an active inference based approach for the invser RL problem setting, which allows us to learn a prior preference from expert simulations. This broadens the scope of active inference and provides a new perspective about a connection between active inference and RL. Control as inference. Levine (2018a) proposed control as inference framework, which interprets a control problem as a probabilistic inference with an additional binary variable O t that indicates whether given action is optimal or not. Several studies on the formulation of RL as an inference problem (Todorov, 2008; Kappen et al., 2009) have been proposed. Control as inference measures a probability that a given action is optimal based on a given reward with p(O t = 1|s t , a t ) = exp(r(s t , a t )). That is, from the given reward and the chosen probability model, control as inference calculates the probability of the given action a t with the state s t to be optimal. In contrast, active inference measures this probability as in (15) with a prior preference distribution p and constructs a reward, viewing EFE as a negative value function. Although control as inference and active inference as a RL in Section 3 have a theoretical similarity based on a duality of a control problem and an inference, our proposed PPL can interpret the reward r(s t , a t ) and the message function β(s t ) in Levine (2018a) as EFE-related quantities. Therefore, learning the expert as a prior preference p immediately constructs its active inference based reward function and thereby applicable to a reward construction problem and several inverse RL problems.

6. EXPERIMENTS

In this section, we discuss and compare the experimental results of our proposed algorithm PPL and other inverse RL algorithms on the several classical control environments. We evaluate our approach with a classic control environment implemented in Open AI Gym (Brockman et al., 2016) . First, we aim to compare the conventional global preference method to the PPL. We expect the PPL to be effective in environments where the local and global preferences are different. Second, we claim that our active inference based approach can achieve a compatible results with current inverse RL algorithms. Expert simulations were obtained from Open AI RL baseline zoo (Raffin, 2018) .

6.1. PPL AND GLOBAL PREFERENCE

First, we compare our proposed algorithm (setting 1 in Table 1 ) and its variants. Table 1 contains four experimental settings, where the setting 1 is our proposed method and the others are experimental groups to compare the effects of an expert batch, the epistemic value R t,e , and PPL. Expert Batch. We use expert simulations for batch sampling when learning EFE and a forward dynamic model. This allows an EFE network and a dynamic model to be trained even in states that do not reach the early stage of learning. We use the Q-learning algorithm to learn the EFE network. Commonly used techniques in RL, such as replay memory and target network (Mnih et al., 2015) are used. Reward. We observe how an epistemic value R t,e in (15) influences the learning process and its performance. Global preference. Global preference is a distribution over a state which can be naturally induced from the agent's goal, whereas our PPL is a learned prior preference from the expert's simulations. Roughly, global preference can be understood as a hard-coded prior preference based on the prior knowledge of the environment, as a 'goal directed behavior' in Ueltzhöffer (2018) . Detailed hardcoded global preferences in our experiments can be found in the below sub-subsection.

6.1.1. ENVIRONMENTS AND EXPERIMENT DETAILS

We tested three classical control environments: Acrobot, Cartpole, and MountainCar. We used 5000 pairs of (o t , o t+1 ) to train the prior preference p(o t+1 ). We used the same neural network architecture for all environments, except for the number of input and output dimensions. During the training process, we clip the epistemic value to prevent a gradient explosion while using the epistemic value in settings 1, 2, and 4. Acrobot is an environment with a two-link pendulum, where the second link is actuated. The goal of this environment is to swing the end effector up to the horizontal line. The state space has six dimensions: (cos θ 1 , sin θ 1 , cos θ 2 , sin θ 2 , θ1 , θ2 ), where θ 1 and θ 2 are the two rotational joint angles. The action space consists of three actions: accelerating +1, 0, or -1 torque on the joint between the two links. We did not run setting 4 in this study, because Acrobot is ambiguous in defining the global preference of the environment. Cartpole is an environment with a cart and a pole on the cart. The goal is to balance the pole by controlling the cart. The state space is of 4 dimensions: (x, ẋ, θ, θ), where x is the position of the cart and θ is the angle of the pole from the vertical line. The action space consists of two actions: accelerating the cart to the left or right. The global preference is given by a Gaussian distribution with mean (0, 0, 0, 0) with a fixed standard deviation 0.1 on each dimension. MountainCar is an environment with a car on the valley. The goal is to make the car reach the goal position, the right peak of the mountain. The state space is of two dimensions (x, ẋ) where x is the position of the car. The action space consists of three actions: accelerating the car to the left or right, or leaving it. The global preference is given by a Gaussian distribution with mean (0.5, 0) with a fixed standard deviation 0.1 on each dimension. Note that (0.5, 0) is the state of the goal position with zero velocity. 

6.1.2. RESULTS AND DISCUSSIONS

Figure 1 shows the experimental results on MountainCar, Acrobot, and Cartpole. Their performances are compared and benchmarked with the default reward of the environments. PPL and global preference. Comparing setting 1 (blue line) and setting 2 (red line), it can be seen that PPL is more efficient than the conventional global preference as expected. In particular, in MountainCar, we can see that little learning is achieved. This seems to be because the difference between global and expert preferences is greater in the MountainCar environment. In the Cartpole, setting 2 learned more slowly than setting 1. Expert Batch. Comparing setting 1 (blue line) and setting 3 (orange line), it can be seen that using the expert batch is helpful for certain tasks. With Acrobot and MountainCar, the use of an expert batch performs better than the case without an expert batch. However, the results without expert batch are marginally better than those of setting 1 for Cartpole. This is because an agent of Cartpole only moves near the initial position, and thus there is no need for an expert batch to discover the dynamics of the generative model. Epistemic Value. We found that the epistemic value in the EFE term does not significantly impact the training process. Comparing setting 1 (blue line) and setting 4 (green line), the results were similar regardless of whether the epistemic value was used. In Acrobot and MountainCar, standard deviations were marginally smaller, but there were no significant differences between them. In the result of CartPole-v1, we found that setting 4 with no epistemic value term learned faster than our proposed setting 1 at the beginning of the training process. We deduce that this initial performance drop is due to the instability of the epistemic term. At the beginning of the training process, the generative model is not learned, and thus the related epistemic term becomes unstable. We leave this issue to a future study.

6.2. PPL AND INVERSE RL ALGORITHMS

Second, we check that our proposed PPL is compatible with traditional inverse RL algorithms. We compared PPL with behavioral cloning (BC, Pomerleau (1989) ) and maximum entropy inverse RL (MaxEnt, Ziebart et al. (2008) ) as benchmark models. We use setting 1 in Table 1 as our proposed PPL here, and we test on MountainCar-v0 and CartPole-v1. Note that BC does not need to interact with the environment and the state space was discretized to use original MaxEnt algorithm. We also tried to run the experiment on Acrobot-v1 for PPL and other benchmarks, but we failed to make the agent learn with MaxEnt. A discretized state space for MaxEnt becomes larger exponentially to its state dimension. We think it is due to a larger dimension of its state space compared to the others. Therefore, we only report that PPL and BC give similar results to Acrobot-v1. We verified that our method PPL gives compatible results on the MountainCar-v0 and CartPole-v1. Compared to MaxEnt, PPL shows better results than MaxEnt on both environments. Note that MaxEnt needs much more episodes to converge. Also, PPL obtained almost similar mean rewards to BC on MountainCar-v0, whereas BC gives better results than PPL on CartPole-v1. 

7. CONCLUSION

In this paper, we introduced the use of active inference from the perspective of RL. Although active inference emerged from the Bayesian model of cognitive process, we show that the concepts of active inference, especially for EFE, are highly related to RL using the bootstrapping method. The only difference is that, the value function of RL is based on a reward, while active inference is based on the prior preference. We also show that active inference can provide insights to solve the inverse RL problems. Using expert simulations, an agent can learn a local prior preference, which is more effective than the global preference. Furthermore, our proposed active inference based reward with a prior preference and a generative model makes the previous invser RL problems free from an ill-posed state. Our work on active inference is complementary to RL because it can be applied to model-based RL for the design of reward and model-free RL for learning of generative models.



¸atal et al. (2020); Millidge (2020) introduced end-to-end differentiable models by including environment transition models. Both require much less interaction with environments than before. C ¸atal et al. (2020) used the Monte Carlo sampling to approximate EFE and used global prior preference. On the other hand Millidge (2020) used bootstrapping methods and used common RL rewards with model driven values derived from EFE. Millidge (2020) verified that the model driven values induce faster and better results on MountainCar environment. Furthermore, Tschantz et al. (2020); Millidge et al. (

Figure 1: Experiment results on three classical control environments: MountainCar-v0, Acrobot-v1, and CartPole-v1. The curves in the figure were averaged over 50 runs, and the standard deviation of 50 runs is given as a shaded area. Each policy was averaged out of 5 trials. All rewards of the environment follow the default settings of Open AI Gym.

Figure 2: Inverse RL Experiment results on MountainCar-v0 (left) and CartPole-v1 (right). The curves in the figure were averaged over 50 runs, and the standard deviation of 50 runs is given as a shaded area. Note that black and green dashed line on the right are overlapped. All rewards of the environment follow the default settings of Open AI Gym. (BC : Behavioral Cloning, MaxEnt : Maximum Entropy)

Algorithm 1: Inverse Q-Learning with Prior Preference Learning Learning prior preference from expert simulations; Input: Expert simultaions S = (o i,1 , ...o i,T ) Prior preference pθ (o) Initialize the forward dynamic p η (o|s), T η (s t+1 |s t , a t ), q η (s|o), and EFE network G ξ (s t , a t ).

Experment setting for PPL and global preference and its variants Setting 4 Global preference Replay memory + Experts R t,i + R t,e

