SPEEDING UP POLICY OPTIMIZATION WITH VANISH-ING HYPOTHESIS AND VARIABLE MINI-BATCH SIZE

Abstract

Reinforcement learning-based algorithms have been used extensively in recent years due to their flexible nature, good performance, and the increasing number of said algorithms. However, the largest drawback of these techniques remains unsolved, that is, it usually takes a long time for the agents to learn how to solve a given problem. In this work, we outline a novel method that can be used to drastically reduce the training time of current state-of-the-art algorithms like Proximal Policy Optimization (PPO). We evaluate the performance of this approach in a unique environment where we use reinforcement learning to help with a practical astronomical problem: where to place a fixed number of observatory stations in the Solar System to observe space objects (e.g. asteroids) as permanently as possible. That is, the reward in this scenario corresponds to the total coverage of the trajectories of these objects. We apply noisy evaluation for calculating the reward to speed up the training, which technique has already been efficiently applied in stochastic optimization. Namely, we allow the incorporation of some additional noise in the reward function in the form of a hypothesis term and a varying minibatch size. However, in order to follow the theoretical guidelines, both of them are forced to vanish during training to let the noise converge to zero. Our experimental results show that using this approach we can reduce the training time remarkably, even by 75%.

1. INTRODUCTION

Reinforcement learning (RL) was in the focus of numerous studies in the last few years. As such, plenty of new algorithms have emerged and are continuously surfacing that improve the already existing RL algorithms or combine the traditional methods with modern deep learning techniques, like deep Q-learning (Mnih et al., 2013) , Proximal Policy Optimization (PPO) (Schulman et al., 2017) , Asynchronous Advantage Actor-critic (Mnih et al., 2016b) or Soft Actor-Critic (Haarnoja et al., 2018) . Researchers have used these algorithms with great success, solving various challenging problems, ranging from using RL to play table games like Go (Silver et al., 2016; 2017) , playing Massively Multiplayer Online (MMO) games (Suarez et al., 2019) , to solving complex problems in robotics (Plappert et al., 2018) . Recent research (Reed et al., 2022) has also shown that we can train a single RL agent in a way that it can be used to solve different tasks, indicating the high generalization capabilities of these types of algorithms. However, one of the biggest problems of these algorithms -especially on-policy methods -still remains: the optimization process, and hence the training of the agent, takes a really long time (Yarats et al., 2021; Yu, 2018) . This is especially true in the case of policy-based algorithms which are notorious for being sample inefficient (Bastani, 2020) . Although PPO, which is also a policy-based method, has better sample complexity (Schulman et al., 2017) than the original policy gradient algorithm (Mnih et al., 2016a) , it still suffers from this phenomenon. Despite this drawback, PPO is still widely used due to its convergence to at least a local optimum, in contrast to value-based methods, where convergence is not necessarily guaranteed, which phenomenon is referred to as the "deadly triad" by Sutton & Barto (2018) . In this work, we show that we can remarkably reduce the training time of a PPO agent while still preserving its convergence and achieving the same cumulative rewards by treating the loss as an energy function and incorporating some noises in the optimization process. In real-world applications it is quite common that the energy function cannot be evaluated precisely because of certain noise corruptions. If the noise is too large then a meaningful optimization cannot

