ROBUST POLICY OPTIMIZATION IN DEEP REINFORCE-MENT LEARNING

Abstract

Entropy can play an essential role in policy optimization by selecting the stochastic policy, which eventually helps better explore the environment in reinforcement learning (RL). A proper balance between exploration and exploitation is challenging and might depend on the particular RL task. However, the stochasticity often reduces as the training progresses; thus, the policy becomes less exploratory. Therefore, in many cases, the policy can converge to sub-optimal due to a lack of representative data during training. Moreover, this issue can even be severe in high-dimensional environments. This paper investigates whether keeping a certain entropy threshold throughout training can help better policy learning. In particular, we propose an algorithm Robust Policy Optimization (RPO), which leverages a perturbed Gaussian distribution to encourage high-entropy actions. We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym. We observed that in many settings, RPO increases the policy entropy early in training and then maintains a certain level of entropy throughout the training period. Eventually, our agent RPO shows consistently improved performance compared to PPO and other techniques such as data augmentation and entropy regularization. Furthermore, in several settings, our method stays robust in performance, while other baseline mechanisms fail to improve and even worsen the performance.

1. INTRODUCTION

Exploration in a high-dimensional environment is challenging due to the online nature of the task. In a reinforcement learning (RL) setup, the agent is responsible for collecting high-quality data. The agent has to decide on taking action which maximizes future return. In deep reinforcement learning, the policy and value functions are often represented as neural networks due to their flexibility in representing complex functions with continuous action space. If explored well, the learned policy will more likely lead to better data collection and, thus, better policy. However, in high-dimensional observation space, the possible trajectories are larger; thus, having representative data is challenging. Moreover, it has been observed that deep RL exhibit the primacy bias, where the agent has the tendency to rely heavily on the earlier interaction and might ignore helpful interaction at the later part of the training (Nikishin et al., 2022) . Maintaining stochasticity in policy is considered beneficial, as it can encourage exploration (Mnih et al., 2016; Ahmed et al., 2019) . Entropy is the randomness of the actions, which is expected to go down as the training progress, and thus the policy becomes less stochastic. However, lack of stochasticity might hamper the exploration, especially in the large dimensional environment (high state and action spaces), as the policy can prematurely converge to a suboptimal policy. This scenario might result in low-quality data for agent training. In this paper, we are interested in observing the effect when we maintain a certain level of entropy throughout the training and thus encourage exploration. We focus on a policy gradient-based approach with continuous action spaces. A common practice (Schulman et al., 2017; 2015) is to represent continuous action as the Gaussian distribution and learn the parameters (µ, and σ) conditioned on the state. The policy can be represented as a neural network, and it takes the state as input and outputs the one Gaussian parameters per action dimension. Then the final action is chosen as a sample from this distribution. This process inherently introduces 1

