EMPIRICAL SUFFICIENCY FEATURING REWARD DE-LAY CALIBRATION

Abstract

Appropriate credit assignment for delay rewards is a fundamental challenge in various deep reinforcement learning tasks. To tackle this problem, we introduce a delay reward calibration paradigm inspired from a classification perspective. We hypothesize that when an agent's behavior satisfies an equivalent sufficient condition to be awarded, well-represented state vectors should share similarities. To this end, we define an empirical sufficient distribution, where the state vectors within the distribution will lead agents to environmental reward signals in consequent steps. Therefore, an overfitting classifier is established to handle the distribution and generate calibrated rewards. We examine the correctness of sufficient state extraction by tracking the real-time extraction and building hybrid different reward functions in environments with different levels of awarding latency. The results demonstrate that the classifier could generate timely and accurate calibrated rewards, and the rewards could make the training more efficient. Finally, we find that the sufficient states extracted by our model resonate with observations of human cognition.

1. INTRODUCTION

Reinforcement learning (RL) approaches have made incredible breakthroughs in various domains (Silver et al., 2016; Mnih et al., 2015; OpenAI et al., 2019; Vinyals et al., 2019) , where the performance exceeds people's expectations. The reinforcement learning theoretical basis models sequential decision tasks as dynamic programming processes to maximize expected accumulated rewards. Given that environmental rewards generally cannot entirely reflect the contribution of each action in a step, existing approaches commit to distributing different credits to individual decisions, known as credit assignment (Sutton & Barto, 1998) . Bellman equation-based architecture calculates a value of a state based on the gathered rewards in the future, which at times assigns an unreasonable value to prior states. This problem becomes even more intractable when reward signals are extremely sparse or severely delayed. In this paper, we formulate an overfitting classification mechanism to extract empirical sufficient conditions of acquiring desired environmental signals. We refer to this extraction formulation as an Empirical Sufficient Condition Extractor (ESCE) to fairly assign delayed rewards to corresponding states. In so doing, we first propose a classification mechanism to identify empirical sufficient states. To train a classifier with partially labeled data, we label the state vectors with matched environmental signals. We then train the classifier with two phases, wherein a novel overfitting training process is conducted. In addition to existing value-based estimation, the ESCE provides concrete predictions. We equip the ESCE with Asynchronous Advantage Actor Critic (A3C) algorithm (Mnih et al., 2016) and measure the performance on six Atari games, most of which have delayed discrete rewards. We comprehensively examine the extraction correctness by formulating different reward functions, and further track the accuracy/recall of ESCE on the fly. The results show that agents guided by our empirical efficiency achieve significant improvements in convergence, especially in the scenarios with delayed rewards. Furthermore, we constructively modify the environment to render the rewards even to be more delayed, termed as hindsight rewards settings. The results show that equal calibrated rewards could lead agents to acquire well-learned target policies as if rewards are not delayed. In addition to quantitative experiments, we screenshot the identified sufficient states, showing high similarity with human cognition. Our contributions can be summarized as follows:

