EMPIRICAL SUFFICIENCY FEATURING REWARD DE-LAY CALIBRATION

Abstract

Appropriate credit assignment for delay rewards is a fundamental challenge in various deep reinforcement learning tasks. To tackle this problem, we introduce a delay reward calibration paradigm inspired from a classification perspective. We hypothesize that when an agent's behavior satisfies an equivalent sufficient condition to be awarded, well-represented state vectors should share similarities. To this end, we define an empirical sufficient distribution, where the state vectors within the distribution will lead agents to environmental reward signals in consequent steps. Therefore, an overfitting classifier is established to handle the distribution and generate calibrated rewards. We examine the correctness of sufficient state extraction by tracking the real-time extraction and building hybrid different reward functions in environments with different levels of awarding latency. The results demonstrate that the classifier could generate timely and accurate calibrated rewards, and the rewards could make the training more efficient. Finally, we find that the sufficient states extracted by our model resonate with observations of human cognition.

1. INTRODUCTION

Reinforcement learning (RL) approaches have made incredible breakthroughs in various domains (Silver et al., 2016; Mnih et al., 2015; OpenAI et al., 2019; Vinyals et al., 2019) , where the performance exceeds people's expectations. The reinforcement learning theoretical basis models sequential decision tasks as dynamic programming processes to maximize expected accumulated rewards. Given that environmental rewards generally cannot entirely reflect the contribution of each action in a step, existing approaches commit to distributing different credits to individual decisions, known as credit assignment (Sutton & Barto, 1998) . Bellman equation-based architecture calculates a value of a state based on the gathered rewards in the future, which at times assigns an unreasonable value to prior states. This problem becomes even more intractable when reward signals are extremely sparse or severely delayed. In this paper, we formulate an overfitting classification mechanism to extract empirical sufficient conditions of acquiring desired environmental signals. We refer to this extraction formulation as an Empirical Sufficient Condition Extractor (ESCE) to fairly assign delayed rewards to corresponding states. In so doing, we first propose a classification mechanism to identify empirical sufficient states. To train a classifier with partially labeled data, we label the state vectors with matched environmental signals. We then train the classifier with two phases, wherein a novel overfitting training process is conducted. In addition to existing value-based estimation, the ESCE provides concrete predictions. We equip the ESCE with Asynchronous Advantage Actor Critic (A3C) algorithm (Mnih et al., 2016) and measure the performance on six Atari games, most of which have delayed discrete rewards. We comprehensively examine the extraction correctness by formulating different reward functions, and further track the accuracy/recall of ESCE on the fly. The results show that agents guided by our empirical efficiency achieve significant improvements in convergence, especially in the scenarios with delayed rewards. Furthermore, we constructively modify the environment to render the rewards even to be more delayed, termed as hindsight rewards settings. The results show that equal calibrated rewards could lead agents to acquire well-learned target policies as if rewards are not delayed. In addition to quantitative experiments, we screenshot the identified sufficient states, showing high similarity with human cognition. Our contributions can be summarized as follows: • We introduce an overfitting classification model to extract empirical sufficient conditions, where overfitting mechanism could significantly reduce uncertainty. • We formulate a calibrated reward signal in line with the environmental targets to tackle the reinforcement learning reward delay issues, such that the rewards are provided when empirical sufficient conditions are satisfied. • The experimental results show reward-calibrated agents are able to learn decent target policies in the scenarios where rewards have been severely delayed. The identified sufficient conditions empirically resonate with the "true" environment targets.

2.1. INTRINSIC MOTIVATION

Intrinsic rewards (Singh et al., 2004; Ryan & Deci, 2000) , inspired by intrinsic motivation, are primarily introduced to encourage exploration. The rewards mechanism is largely independent of environmental rewards. Intrinsic Rewards in the exploration-oriented mechanism are generally correlated to the novelty or informative acquisition of new arrival states (Pathak et al., 2017; Burda et al., 2019; Houthooft et al., 2016; Zhang et al., 2019) . Due to the awarding mechanism does not depend on environmental rewards, as an exchange, the policy may not align with the environmental target. In addition to exploration, intrinsic rewards can often be found in hierarchical frameworks (Kulkarni et al., 2016; Vezhnevets et al., 2017; Frans et al., 2018) . Moreover, intrinsic rewards are also used to assist agents to more directly learn optimal or near-optimal policies (Wang et al., 2020; Zheng et al., 2018; 2019) . Likewise, following down this branch, the ESCE developed in this paper could generate empirical intrinsic rewards to learn better policies, without encouraging exploration.

2.2. CREDIT ASSIGNMENT FOR DELAYED REWARDS

Most evaluation mechanisms in reinforcement learning rely on Bellman equation, where the environmental signals are passed across states in sequences (Lee et al., 2019; Arjona-Medina et al., 2019; Ng et al., 1999; Marom & Rosman, 2018) . To make the training more efficient, one effective direction is to build an extra mechanism to capture critical states and to emphatically regress on these states (Sutton et al., 2016; Ke et al., 2018; Hung et al., 2018) . Ideologically, Irpan et al. (2019) introduce binary classification into evaluation, and positive-unlabeled learning (Kiryo et al., 2017 ) is adopted to distinguish promising and catastrophic states. Our work evaluates states by discriminating states as a simple binary classification problem without relying on Bellman equation. Specially, we accurately differentiate states between "sufficient for success" and "insufficient for success", by developing a new overfitting classifier (see Sections 3.3 and 3.4).

3.1. LEARNING WITH HYBRID REWARD FUNCTIONS

As an independent module, the proposed Empirical Sufficient Condition Extractor (ESCE) can be incorporated into multiple mainstream reinforcement learning frameworks. Calibrated rewards are provided by ESCE when a state meets the empirical sufficient condition. We denote π(s t ; θ P ) as the learned policy, where θ P is the set of parameters of the policy network; r c t is the calibrated reward generated by ESCE and r e t an environmental reward from the environment, at time step t. Both reward signals have the same scale. The total reward function is synthesized with calibrated signals and environmental signals, r t = αr c t + βr e t , where α and β are the weight coefficients of corresponding rewards. Our baseline, optimized by environmental rewards, has coefficients α = 0 and β = 1. The policy network is optimized to maximize expected accumulated rewards, as shown below: max θ P E π(st;θ P ) Σ t r t . (1)

