MAXIMUM REWARD FORMULATION IN REINFORCE-MENT LEARNING

Abstract

Reinforcement learning (RL) algorithms typically deal with maximizing the expected cumulative return (discounted or undiscounted, finite or infinite horizon). However, several crucial applications in the real world, such as drug discovery, do not fit within this framework because an RL agent only needs to identify states (molecules) that achieve the highest reward within a trajectory and does not need to optimize for the expected cumulative return. In this work, we formulate an objective function to maximize the expected maximum reward along a trajectory, propose a novel functional form of the Bellman equation, introduce the corresponding Bellman operators, and provide a proof of convergence. Using this formulation, we achieve state-of-the-art results on the task of synthesizable molecule generation that mimics a real-world drug discovery pipeline.

1. INTRODUCTION

Reinforcement learning (RL) algorithms typically try to maximize the cumulative finite horizon undiscounted return, R(τ ) = T t=0 r t , or the infinite horizon discounted return, R(τ ) = ∞ t=0 γ t r t . r t is the reward obtained at time step t, γ is the discount factor in the range [0, 1), and τ is the agent's trajectory. τ consists of actions (a) sampled from the policy (π(• | s)) and states (s ) sampled from the probability transition function P (s |s, a) of the underlying Markov Decision Process (MDP). The action-value function Q π (s, a) for a policy π is given by Q π (s, a) = E τ ∼π [R(τ )|(s 0 , a 0 ) = (s, a)] The corresponding Bellman equation for Q π (s, a) with the expected return defined as R(τ ) = ∞ t=0 γ t r t is Q π (s t , a t ) = E st+1∼P (•|st,at) at+1∼π(•|st+1) [r(s t , a t ) + γQ π (s t+1 , a t+1 )] This Bellman equation has formed the foundation of RL. However, we argue that optimizing for only the maximum reward achieved in an episode is also an important goal. Reformulating the RL problem to achieve the largest reward in an episode is the focus of this paper, along with empirical demonstrations in one toy and one real-world domain. In the de novo drug design pipeline, molecule generation tries to maximize a given reward function. Existing methods either optimize for the expected cumulative return, or for the reward at the end of the episode, and thus fail to optimize for the very high reward molecules that may be encountered in the middle of an episode. This limits the potential of several of these reinforcement learning based drug design algorithms. We overcome this limitation by proposing a novel functional formulation of the Bellman equation: Q π max (s t , a t ) = E st+1∼P (•|st,at) at+1∼π(•|st+1) [max (r(s t , a t ), γQ π max (s t+1 , a t+1 ))] Other use cases of this formulation (i.e., situations where the single best reward found, rather than the total rewards, are important) are -symbolic regression (Petersen (2020), Udrescu & Tegmark (2020)) which is interested in finding the single best model, active localization (Chaplot et al. ( 2018)) must find the robot's one most likely pose, green chemistry (Koch et al. ( 2019)) wants to identify the one best product formulation, and other domains that use RL for generative purposes.

