CAT-SAC: SOFT ACTOR-CRITIC WITH CURIOSITY-AWARE ENTROPY TEMPERATURE

Abstract

The trade-off between exploration and exploitation has long been a crucial issue in reinforcement learning (RL). Most of the existing RL methods handle this problem by adding action noise to the policies, such as the Soft Actor-Critic (SAC) (Haarnoja et al., 2018a;b) that introduces an entropy temperature for maximizing both the external value and the entropy of the policy. However, this temperature is applied indiscriminately to all different environment states, undermining the potential of exploration. In this paper, we argue that the agent should explore more in an unfamiliar state, while less in a familiar state, so as to understand the environment more efficiently. To this purpose, we propose Curiosity-Aware entropy Temperature for SAC (CAT-SAC), which utilizes the curiosity mechanism in developing an instance-level entropy temperature. CAT-SAC uses the state prediction error to model curiosity because an unfamiliar state generally has a large prediction error. The curiosity is added to the target entropy to increase the entropy temperature for unfamiliar states and decrease the target entropy for familiar states. By tuning the entropy specifically and adaptively, CAT-SAC is encouraged to explore when its curiosity is large, otherwise, it is encouraged to exploit. Experimental results on the difficult MuJoCo benchmark testify that the proposed CAT-SAC significantly improves the sample efficiency, outperforming the advanced model-based / model-free RL baselines.

1. INTRODUCTION

Deep reinforcement learning (RL), with high-capacity deep neural networks (DNNs), has been applied to solve various complex decision-making problems, including video games (Mnih et al., 2015; Vinyals et al., 2019 ), chess games (Silver et al., 2017; 2018) as well as robotic manipulation (Kalashnikov et al., 2018) . However, it is notorious for its high sample inefficiency (Kaiser et al., 2019) . Even when solving a simple task, RL needs substantial interaction data to improve the policy. One of the major obstacles for achieving sample-efficiency is the difficulty of balancing exploration and exploitation. The RL agent needs to explore the environment to collect useful information as well as exploit the acquired knowledge to improve its policy. Most of the existing works either use intrinsic rewards (e.g., count-based bonuses (Bellemare et al., 2016) and state prediction error (Pathak et al., 2017; Burda et al., 2018b) ) to strengthen exploration, or augment the action value with the entropy to control the proportion of exploration dynamically (Haarnoja et al., 2017; Ziebart et al., 2008) . Among these approaches, Soft Actor-Critic (SAC) (Haarnoja et al., 2018a; b) achieves the superior performance on MuJoCo (Todorov et al., 2012) , an OpenAI gym (Brockman et al., 2016) benchmark with complex continuous control tasks. Specifically, SAC aims to maximize expected value while also maximizing the entropy: π * = argmax π ∞ t=0 E st,at∼ρπ [r t + αH(π(•|s t ))], where r t is the environment reward at timestep t, s t is the state, a t is the action, ρ π is the distribution of trajectory w.r.t. policy π, H is the entropy of the policy and α is the entropy temperature that weights the importance of the entropy term versus the external reward. The entropy temperature is critical in SAC since different values determine diverse patterns of agent behavior. Small α may lead the agent to over-optimize the state-action value and develop a greedy policy. Due to the lack of

