SPEEDING UP POLICY OPTIMIZATION WITH VANISH-ING HYPOTHESIS AND VARIABLE MINI-BATCH SIZE

Abstract

Reinforcement learning-based algorithms have been used extensively in recent years due to their flexible nature, good performance, and the increasing number of said algorithms. However, the largest drawback of these techniques remains unsolved, that is, it usually takes a long time for the agents to learn how to solve a given problem. In this work, we outline a novel method that can be used to drastically reduce the training time of current state-of-the-art algorithms like Proximal Policy Optimization (PPO). We evaluate the performance of this approach in a unique environment where we use reinforcement learning to help with a practical astronomical problem: where to place a fixed number of observatory stations in the Solar System to observe space objects (e.g. asteroids) as permanently as possible. That is, the reward in this scenario corresponds to the total coverage of the trajectories of these objects. We apply noisy evaluation for calculating the reward to speed up the training, which technique has already been efficiently applied in stochastic optimization. Namely, we allow the incorporation of some additional noise in the reward function in the form of a hypothesis term and a varying minibatch size. However, in order to follow the theoretical guidelines, both of them are forced to vanish during training to let the noise converge to zero. Our experimental results show that using this approach we can reduce the training time remarkably, even by 75%.

1. INTRODUCTION

Reinforcement learning (RL) was in the focus of numerous studies in the last few years. As such, plenty of new algorithms have emerged and are continuously surfacing that improve the already existing RL algorithms or combine the traditional methods with modern deep learning techniques, like deep Q-learning (Mnih et al., 2013) , Proximal Policy Optimization (PPO) (Schulman et al., 2017) , Asynchronous Advantage Actor-critic (Mnih et al., 2016b) or Soft Actor-Critic (Haarnoja et al., 2018) . Researchers have used these algorithms with great success, solving various challenging problems, ranging from using RL to play table games like Go (Silver et al., 2016; 2017) , playing Massively Multiplayer Online (MMO) games (Suarez et al., 2019) , to solving complex problems in robotics (Plappert et al., 2018) . Recent research (Reed et al., 2022) has also shown that we can train a single RL agent in a way that it can be used to solve different tasks, indicating the high generalization capabilities of these types of algorithms. However, one of the biggest problems of these algorithms -especially on-policy methods -still remains: the optimization process, and hence the training of the agent, takes a really long time (Yarats et al., 2021; Yu, 2018) . This is especially true in the case of policy-based algorithms which are notorious for being sample inefficient (Bastani, 2020) . Although PPO, which is also a policy-based method, has better sample complexity (Schulman et al., 2017) than the original policy gradient algorithm (Mnih et al., 2016a) , it still suffers from this phenomenon. Despite this drawback, PPO is still widely used due to its convergence to at least a local optimum, in contrast to value-based methods, where convergence is not necessarily guaranteed, which phenomenon is referred to as the "deadly triad" by Sutton & Barto (2018) . In this work, we show that we can remarkably reduce the training time of a PPO agent while still preserving its convergence and achieving the same cumulative rewards by treating the loss as an energy function and incorporating some noises in the optimization process. In real-world applications it is quite common that the energy function cannot be evaluated precisely because of certain noise corruptions. If the noise is too large then a meaningful optimization cannot take place. On the other hand, we can also inject some corruption deliberately to speed up the optimization process via applying noisy evaluation (Gelfand & Mitter, 1989) . The original idea was to perform only a rough evaluation which approach can save a lot of computational time if the energy function is complex, and its proper evaluation is laborious. It has already been shown that this approach is valid only in the case when the corruption vanishes as the search progresses to be able to keep up the convergence behavior of the original (uncorrupted) model. Specifically, for simulated annealing (SA) Gelfand & Mitter (1989) have shown in what extent the noise should converge to 0 to preserve the search efficiency of the uncorrupted case. In this work, we focus on policy-based approaches and further refine the noisy evaluation idea to integrate it in the RL framework in different ways. The first integration is possible only in such applications -including ours -, where a batch size is used during the gradient descent or ascent update. This is true for policy gradient algorithms which measure the goodness of a policy π as the cumulative reward achieved in a set time frame via r(π ) = t [r(s, a, s ′ )|s = s t , a = a t , s ′ = s t+1 ] for each timestep t ∈ N in the given time frame and use a gradient estimate to update π. In such cases we can apply the well-known mini-batch approach with including only a part of these timesteps in one update cycle. However, opposite to the traditional mini-batch approach that keeps the batch size constant, we also consider it as a noise w.r.t. a full-batch validation. Consequently, we also follow the above requirement to vanish the noise, which can be achieved here with continuously increasing the batch size during training till reaching the total number of observations in the episode. Beyond considering variable mini-batch size, we also include a hypothesis term h(s ′ ) for every subsequent state s ′ in the reward function r(s, a, s ′ ) as a direct feedback for our agent. This procedure can also be interpreted as adding noise in the light of the above summary. Our intention with this addition is to speed up the training process and improve the sample efficiency, as we believe that an appropriately formulated hypothesis can direct the search to the optimal location and hence make the convergence faster. However, we can never be sure that our hypothesis indeed has this behaviour, since we can be mistaken with it. Moreover, to take the above considerations also into account, the noise term should tend to 0 to keep up the convergence behavior of the original RL model. Thus, the hypothesis term will be added in such a way that it will vanish as the search progresses. As we will see, both the variable mini-batch and the vanishing hypothesis term are able to speed up the training process, however their simultaneous integration will improve it further. The rest of the paper is organized as follows. In section 2 we properly exhibit how the hypothesis term and the variable mini-batch can be integrated in a reward function considered in policy optimization. Our special application domain that originally motivated the research is introduced in section 3. Then, section 4 presents the details of the proper implementation. In section 5 we show our experimental results suggesting a meaningful speed up in finding the solution for our problem. Finally, some conclusions are drawn in section 6.

2. METHODOLOGY

Policy gradient methods like PPO use an estimation of the gradient to update their policy π parameterized by a set of weights θ (π θ for short) so that the expected cumulative reward E π θ [ t r t |r t = r(s, a, s ′ ), s = s t , a = a t , s ′ = s t+1 ] increases. To this end, they use an estimation similar to the following expression (Schulman et al., 2017) for each timestep t and perform gradient ascent: g = E t [∇ θ log π θ (a t |s t ) A t ], where A t is the estimation of the advantage. PPO also uses a clipped surrogate objective to prevent large updates during the optimization process. Although this facilitates training to a certain degree, making PPO converge faster than the vanilla policy gradient method (Mnih et al., 2016a) and achieve higher scores as shown in Schulman et al. (2017) , the training still takes a long time. What is even more important is the fact that in the PPO paper it can also be observed that in several environments the agents achieved really low scores in a significant chunk of the total training time before finally

