ROBUST POLICY OPTIMIZATION IN DEEP REINFORCE-MENT LEARNING

Abstract

Entropy can play an essential role in policy optimization by selecting the stochastic policy, which eventually helps better explore the environment in reinforcement learning (RL). A proper balance between exploration and exploitation is challenging and might depend on the particular RL task. However, the stochasticity often reduces as the training progresses; thus, the policy becomes less exploratory. Therefore, in many cases, the policy can converge to sub-optimal due to a lack of representative data during training. Moreover, this issue can even be severe in high-dimensional environments. This paper investigates whether keeping a certain entropy threshold throughout training can help better policy learning. In particular, we propose an algorithm Robust Policy Optimization (RPO), which leverages a perturbed Gaussian distribution to encourage high-entropy actions. We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym. We observed that in many settings, RPO increases the policy entropy early in training and then maintains a certain level of entropy throughout the training period. Eventually, our agent RPO shows consistently improved performance compared to PPO and other techniques such as data augmentation and entropy regularization. Furthermore, in several settings, our method stays robust in performance, while other baseline mechanisms fail to improve and even worsen the performance.

1. INTRODUCTION

Exploration in a high-dimensional environment is challenging due to the online nature of the task. In a reinforcement learning (RL) setup, the agent is responsible for collecting high-quality data. The agent has to decide on taking action which maximizes future return. In deep reinforcement learning, the policy and value functions are often represented as neural networks due to their flexibility in representing complex functions with continuous action space. If explored well, the learned policy will more likely lead to better data collection and, thus, better policy. However, in high-dimensional observation space, the possible trajectories are larger; thus, having representative data is challenging. Moreover, it has been observed that deep RL exhibit the primacy bias, where the agent has the tendency to rely heavily on the earlier interaction and might ignore helpful interaction at the later part of the training (Nikishin et al., 2022) . Maintaining stochasticity in policy is considered beneficial, as it can encourage exploration (Mnih et al., 2016; Ahmed et al., 2019) . Entropy is the randomness of the actions, which is expected to go down as the training progress, and thus the policy becomes less stochastic. However, lack of stochasticity might hamper the exploration, especially in the large dimensional environment (high state and action spaces), as the policy can prematurely converge to a suboptimal policy. This scenario might result in low-quality data for agent training. In this paper, we are interested in observing the effect when we maintain a certain level of entropy throughout the training and thus encourage exploration. We focus on a policy gradient-based approach with continuous action spaces. A common practice (Schulman et al., 2017; 2015) is to represent continuous action as the Gaussian distribution and learn the parameters (µ, and σ) conditioned on the state. The policy can be represented as a neural network, and it takes the state as input and outputs the one Gaussian parameters per action dimension. Then the final action is chosen as a sample from this distribution. This process inherently introduces randomness in action as every time it samples action for the same state, the action value might differ. Though, in expectation, the action value is the same as the mean of the distribution, this process introduces some randomness in the learning process. However, as time progresses, the randomness might reduce, and the policy becomes less stochastic. We first notice that when this method is used to train a PPO-based agent, the entropy of the policy starts to decline as the training progresses, as in Figure 1 , even when the return performance is not good. Then we pose the question; what if the agent keeps the policy stochastic or entropy throughout the training? The goal is to enable the agent to keep exploring even when it achieves a certain level of learning. This process might help, especially in high-dimensional environments where the state and action spaces often remain unexplored. We developed an algorithm called Robust Policy Optimization (RPO), which maintains stochasticity throughout the training. We notice a consistent improvement in the performance of our method in many continuous control environments compared to standard PPO. Seeing the data augmentation through the lens of entropy, we observe that empirically, it can help the policy achieve a higher entropy than without data augmentation. However, this process often requires prior knowledge about the environments and a preprocessing step of the agent experience. Moreover, such methods might result in an uncontrolled increase in action entropy, eventually hampering the return performance (Raileanu et al., 2020; Rahman & Xue, 2022) . Another way to control entropy is to use an entropy regularizer (Mnih et al., 2016; Ahmed et al., 2019) , which often shows beneficial effects. However, it has been observed that increasing entropy in such a way has little effect in specific environments (Andrychowicz et al., 2020) . These results show difficulty in setting proper entropy throughout the agent's training. To this end, in this paper, we propose a mechanism for maintaining entropy throughout the training. We propose to use a new distribution to represent continuous action instead of standard Gaussian. The policy network output is still the Gaussian distribution's mean and standard deviation in our setup. However, we add a random perturbation on the mean before taking an action sample. In particular, we add a random value z ∼ U (-α, α) to the mean µ to get a perturbed mean µ ′ = µ + z. Finally, the action is taken from the perturbed Gaussian distribution a ∼ N (µ ′ , σ). The resulting distribution is shown in Figure 1 . We see that the resulting distribution becomes flatter than the standard Gaussian. Thus the sample spread more around the center of the mean than standard Gaussian, whose samples are more concentrated toward means. The uniform random number does not depend on states and policy parameters and thus can help the agent to maintain a certain level of stochasticity throughout the training process. We name our approach Robust Policy Optimization (RPO) and compare it with the standard PPO algorithm and other entropy-controlled methods such as data augmentation and entropy regularization.



Figure 1: [Left] Standard Gaussian and corresponding Perturb Gaussian distribution (ours). We observe that the probability density is less centered around the mean in our distribution. [Middle] Policy entropy at different timesteps of training in DeepMind Hopper Hop Environment. Similar patterns are observed for other evaluated environments. The PPO agent who uses Standard Gaussian starts from a particular entropy and becomes less stochastic as the training progresses and reduces the policy entropy. In contrast, our agent RPO uses our perturbed Gaussian distribution, increases the entropy at the initial timestep, and then maintains a certain level of entropy throughout the training. [Right] Our agent RPO shows over 3x performance improvement in normalized return compared to the base PPO agent on 12 DeepMind control environments. The results are averaged over 10 random seed runs.

