CAT-SAC: SOFT ACTOR-CRITIC WITH CURIOSITY-AWARE ENTROPY TEMPERATURE

Abstract

The trade-off between exploration and exploitation has long been a crucial issue in reinforcement learning (RL). Most of the existing RL methods handle this problem by adding action noise to the policies, such as the Soft Actor-Critic (SAC) (Haarnoja et al., 2018a;b) that introduces an entropy temperature for maximizing both the external value and the entropy of the policy. However, this temperature is applied indiscriminately to all different environment states, undermining the potential of exploration. In this paper, we argue that the agent should explore more in an unfamiliar state, while less in a familiar state, so as to understand the environment more efficiently. To this purpose, we propose Curiosity-Aware entropy Temperature for SAC (CAT-SAC), which utilizes the curiosity mechanism in developing an instance-level entropy temperature. CAT-SAC uses the state prediction error to model curiosity because an unfamiliar state generally has a large prediction error. The curiosity is added to the target entropy to increase the entropy temperature for unfamiliar states and decrease the target entropy for familiar states. By tuning the entropy specifically and adaptively, CAT-SAC is encouraged to explore when its curiosity is large, otherwise, it is encouraged to exploit. Experimental results on the difficult MuJoCo benchmark testify that the proposed CAT-SAC significantly improves the sample efficiency, outperforming the advanced model-based / model-free RL baselines.

1. INTRODUCTION

Deep reinforcement learning (RL), with high-capacity deep neural networks (DNNs), has been applied to solve various complex decision-making problems, including video games (Mnih et al., 2015; Vinyals et al., 2019) , chess games (Silver et al., 2017; 2018) as well as robotic manipulation (Kalashnikov et al., 2018) . However, it is notorious for its high sample inefficiency (Kaiser et al., 2019) . Even when solving a simple task, RL needs substantial interaction data to improve the policy. One of the major obstacles for achieving sample-efficiency is the difficulty of balancing exploration and exploitation. The RL agent needs to explore the environment to collect useful information as well as exploit the acquired knowledge to improve its policy. Most of the existing works either use intrinsic rewards (e.g., count-based bonuses (Bellemare et al., 2016) and state prediction error (Pathak et al., 2017; Burda et al., 2018b) ) to strengthen exploration, or augment the action value with the entropy to control the proportion of exploration dynamically (Haarnoja et al., 2017; Ziebart et al., 2008) . Among these approaches, Soft Actor-Critic (SAC) (Haarnoja et al., 2018a; b) achieves the superior performance on MuJoCo (Todorov et al., 2012) , an OpenAI gym (Brockman et al., 2016) benchmark with complex continuous control tasks. Specifically, SAC aims to maximize expected value while also maximizing the entropy: π * = argmax π ∞ t=0 E st,at∼ρπ [r t + αH(π(•|s t ))], where r t is the environment reward at timestep t, s t is the state, a t is the action, ρ π is the distribution of trajectory w.r.t. policy π, H is the entropy of the policy and α is the entropy temperature that weights the importance of the entropy term versus the external reward. The entropy temperature is critical in SAC since different values determine diverse patterns of agent behavior. Small α may lead the agent to over-optimize the state-action value and develop a greedy policy. Due to the lack of exploration, the agent with a small α is likely to get stuck in a local optimal. On the other hand, overoptimize the entropy with large α causes the agent to act nearly uniformly and therefore hampers the ability to exploit the environment. To choose a reasonable temperature automatically, Haarnoja et al. (2018b) introduce a target entropy term and extends SAC to a constrained optimization problem, which performs empirically well on MuJoCo tasks (Brockman et al., 2016) . However, the target entropy is applied to all transitions equally during training, which neglects the particularity of different states -some states require less exploration while others need more (Tokic, 2010) . The particularity of different states for exploration has been discovered for a long time. One of the important particularity is the internal curiosity w.r.t. the environment states. Intuitively, when playing a new game, humans make many different attempts at the beginning, and after they understand the basic logic of the game, they will not spend extra time trying the things they already know. This implies that a better exploration strategy is to explore when the agent is in unfamiliar states and exploit when it is in the familiar states. 2018b) have also considered the association between the curiosity and the entropy, they do not apply any relevant guidance to their proposed strategy of automatically tuning entropy temperature. In order to dynamically adjust the exploration-exploitation strategy regarding the curiosity about states, we make the first attempt to introduce the curiosity to the target entropy term of SAC so that SAC can actively enlarge its entropy at unfamiliar states while reducing its entropy at familiar states. We named our method as CAT-SAC -SAC with Curiosity-Aware entropy Temperature, by seamlessly introducing the curiosity to augment the target entropy term. Specifically, CAT-SAC first augments the target entropy term by adding the zero mean curiosity. By this means, the agent is encouraged to explore more in an unfamiliar state as its target entropy is large (with positive curiosity) and to explore less at a familiar state as its target entropy is small (with negative curiosity). Then, CAT-SAC adopts an instance-level entropy temperature α(state) to replace the original global entropy temperature so as to learn different entropy temperatures for different states. The instancelevel entropy temperature is curiosity-aware as it is supervised by the curiosity augmented target entropy. As for the curiosity model, current advanced methods e.g. RND (Burda et al., 2018b ) may fail to model the curiosity for feature input (e.g., position, velocity). One of the major reasons is that different states with dramatic visual difference may have a little difference in feature so that RND has difficulty in distinguishing different state with feature inputs and fails to model the curiosity. To cope with this problem, we propose a novel curiosity model X-RND, which first synthesizes 'unvisited' feature states from the collected data by blending two visited states with random weight entry by entry. With the visited and the 'unvisited' states, X-RND learns to separate the 'unvisited' from the visited by contrastive self-supervised learning (He et al., 2020) . By doing so, X-RND successfully prevents the curiosity about the 'unvisited' states from the influence of the visited states. In summary, the main contributions of this paper are as follows: We propose a novel CAT-SAC model, which enables the agent to better trade-off the exploration versus exploitation according to different curiosity about states; To model curiosity for feature inputs, we propose a new curiosity model, X-RND, optimized by contrastive self-supervised learning. Experimental results testify that the proposed method significantly improves the sample-efficiency on complex and difficult continuous control tasks of the MuJoCo benchmark against the state-of-art model-based / model-free methods.

2. RELATED WORK

Exploration is an essential issue for effective reinforcement learning (Kakade & Langford, 2002) , and the problem of exploration has been widely studied (Wunder et al., 2010; Haarnoja et al., 2018a; Pathak et al., 2017; Lee et al., 2020; Zhang et al., 2020; Sekar et al., 2020; Stadie et al., 



When predicting the dynamics, the curiosity model is prone to produce a large error w.r.t noisy observation, resulting in a large curiosity.



Therefore, optimizing the policy with a global target entropy without considering the particularity of different states may instead hamper the exploration of the agent. Researchers use internal curiosity as an auxiliary reward to encourage the exploration of the agent(Bellemare et al., 2016; Tang et al., 2017; Ostrovski et al., 2017). Among these methods,Burda et al. (2018b)  propose Random Network Distillation (RND)(Burda et al., 2018b)  to remedy the 'noisy TV' problem(Pathak et al., 2017)  1 , which predicts the output of a fixed random network and demonstrates the excellent exploration ability on games with image input. AlthoughHaarnoja  et al. (

