MAXIMUM REWARD FORMULATION IN REINFORCE-MENT LEARNING

Abstract

Reinforcement learning (RL) algorithms typically deal with maximizing the expected cumulative return (discounted or undiscounted, finite or infinite horizon). However, several crucial applications in the real world, such as drug discovery, do not fit within this framework because an RL agent only needs to identify states (molecules) that achieve the highest reward within a trajectory and does not need to optimize for the expected cumulative return. In this work, we formulate an objective function to maximize the expected maximum reward along a trajectory, propose a novel functional form of the Bellman equation, introduce the corresponding Bellman operators, and provide a proof of convergence. Using this formulation, we achieve state-of-the-art results on the task of synthesizable molecule generation that mimics a real-world drug discovery pipeline.

1. INTRODUCTION

Reinforcement learning (RL) algorithms typically try to maximize the cumulative finite horizon undiscounted return, R(τ ) = T t=0 r t , or the infinite horizon discounted return, R(τ ) = ∞ t=0 γ t r t . r t is the reward obtained at time step t, γ is the discount factor in the range [0, 1), and τ is the agent's trajectory. τ consists of actions (a) sampled from the policy (π(• | s)) and states (s ) sampled from the probability transition function P (s |s, a) of the underlying Markov Decision Process (MDP). The action-value function Q π (s, a) for a policy π is given by Q π (s, a) = E τ ∼π [R(τ )|(s 0 , a 0 ) = (s, a)] The corresponding Bellman equation for Q π (s, a) with the expected return defined as R(τ ) = ∞ t=0 γ t r t is Q π (s t , a t ) = E st+1∼P (•|st,at) at+1∼π(•|st+1) [r(s t , a t ) + γQ π (s t+1 , a t+1 )] This Bellman equation has formed the foundation of RL. However, we argue that optimizing for only the maximum reward achieved in an episode is also an important goal. Reformulating the RL problem to achieve the largest reward in an episode is the focus of this paper, along with empirical demonstrations in one toy and one real-world domain. In the de novo drug design pipeline, molecule generation tries to maximize a given reward function. Existing methods either optimize for the expected cumulative return, or for the reward at the end of the episode, and thus fail to optimize for the very high reward molecules that may be encountered in the middle of an episode. This limits the potential of several of these reinforcement learning based drug design algorithms. We overcome this limitation by proposing a novel functional formulation of the Bellman equation: Q π max (s t , a t ) = E st+1∼P (•|st,at) at+1∼π(•|st+1) [max (r(s t , a t ), γQ π max (s t+1 , a t+1 ))] Other use cases of this formulation (i.e., situations where the single best reward found, rather than the total rewards, are important) are -symbolic regression (Petersen (2020), Udrescu & Tegmark (2020)) which is interested in finding the single best model, active localization (Chaplot et al. ( 2018)) must find the robot's one most likely pose, green chemistry (Koch et al. ( 2019)) wants to identify the one best product formulation, and other domains that use RL for generative purposes. This paper's contributions are to: • Derive a novel functional form of the Bellman equation, called max-Bellman, to optimize for the maximum reward in an episode. • Introduce the corresponding evaluation and optimality operators, and prove the convergence of Q-learning with the max-Bellman formulation. • Test on a toy environment and draw further insights with a comparison between Q-learning and Q-learning with our max-Bellman formulation. • Use this max-Bellman formulation to generate synthesizable molecules in an environment that mimics the real drug discovery pipeline, and demonstrate significant improvements over the existing state-of-the-art methods.

2. RELATED WORK

This section briefly introduces fundamental RL concepts and the paper's main application domain.

2.1. REINFORCEMENT LEARNING

Bellman's dynamic programming paper (Bellman, 1954) introduced the notions of optimality and convergence of functional equations. This has been applied in many domains, from control theory to economics. The concept of an MDP was proposed in the book Dynamic Programming and Markov Processes (Howard, 1960) (although some variants of this formulation already existed in the 1950s). These two concepts of Bellman equation and MDP are the foundations of modern RL. Q-learning was formally introduced in (Watkins & Dayan, 1992) and different convergence guarantees were further developed in (Jaakkola et al., 1993) and (Szepesvári, 1997) . Q-learning convergence to the optimal Q-value (Q ) has been proved under several important assumptions. One fundamental assumption is that the environment has finite (and discrete) state and action spaces and each of the states and actions can be visited infinitely often. The learning rate assumption is the second important assumption, where the sum of learning rates over infinite episodes is assumed to go to infinity in the limit, whereas the sum of squares of the learning rates are assumed to be a finite value (Tsitsiklis, 1994; Kamihigashi & Le Van, 2015) . Under similar sets of assumptions, the on-policy version of Q-learning, known as Sarsa, has also been proven to converge to the optimal Q-value in the limit (Singh et al., 2000) . Recently, RL algorithms have seen large empirical successes as neural networks started being used as function approximators (Mnih et al., 2016) . Tabular methods cannot be applied to large state and action spaces as these methods are linear in the state space and polynomial in the action spaces in both time and memory. Deep reinforcement learning (DRL) methods on the other hand, can approximate the Q-function or the policy using neural networks, parameterized by the weights of the corresponding neural networks. In this case, RL algorithms easily generalize across states, which improves the learning speed (time complexity) and sample efficiency of the algorithm. Some popular Deep RL algorithms include DQN (Mnih et al., 2015) , PPO (Schulman et al., 2017) 2018)). While the effectiveness of the generated molecules using these approaches has been demonstrated on standard benchmarks such as Guacamol (Brown et al. ( 2019)), the issue of synthesizability remains a problem. While all the above approaches generate molecules that optimize a given reward function, they do not account for whether the molecules can actually be effectively synthesized, an important practical consideration. Gao & Coley (2020) further highlighted this issue of synthesizability by using a synthesis planning program to quantify how often the molecules generated using these existing approaches can be readily synthesized. To attempt to solve this issue, Bradshaw et al. ( 2019) used a

