CAN AGENTS RUN RELAY RACE WITH STRANGERS? GENERALIZATION OF RL TO OUT-OF-DISTRIBUTION TRAJECTORIES

Abstract

In this paper, we evaluate and improve the generalization performance for reinforcement learning (RL) agents on the set of "controllable" states, where good policies exist on these states to achieve the goal. An RL agent that generally masters a task should reach its goal starting from any controllable state of the environment instead of memorizing a small set of trajectories. To practically evaluate this type of generalization, we propose relay evaluation, which starts the test agent from the middle of other independently well-trained stranger agents' trajectories. With extensive experimental evaluation, we show the prevalence of generalization failure on controllable states from stranger agents. For example, in the Humanoid environment, we observed that a well-trained Proximal Policy Optimization (PPO) agent, with only 3.9% failure rate during regular testing, failed on 81.6% of the states generated by well-trained stranger PPO agents. To improve "relay generalization," we propose a novel method called Self-Trajectory Augmentation (STA), which will reset the environment to the agent's old states according to the Q function during training. After applying STA to the Soft Actor Critic's (SAC) training procedure, we reduced the failure rate of SAC under relay-evaluation by more than three times in most settings without impacting agent performance and increasing the needed number of environment interactions. Our code is available at https://github.com/lan-lc/STA.

1. INTRODUCTION

Generalization is critical for deploying reinforcement learning (RL) agents into real-world applications. A well-trained RL agent that can achieve high rewards under restricted settings may not be able to handle the enormous state space and complex environment variations in the real world. There are many different aspects regarding the generalization of RL agents. While many existing works study RL generalization under environment variations between training and testing Kirk et al. (2021) , in this paper, we study the generalization problem in a simple and fixed environment under an often overlooked notion of generalization -a well-generalized agent that masters a task should be able to start from any "controllable" state in this environment and still reach the goal. For example, a self-driving system may need to take over the control from humans (or other AIs trained for different goals such as speed, gas-efficient, or comfortable) in the middle of driving and must continue to drive the car safely. We can make little assumptions about what states the cars are at when the take-over happens, and the self-driving agent must learn to drive generally. Although this problem may look ostensibly easy for simple MDPs (e.g., a 2-D maze), most real RL agents are trained by trajectories generated by its policy, and it is hard to guarantee the behavior of an agent on all "controllable" states. Roughly speaking, in the setting of robotics (e.g., Mujoco environments), we can define a state as controllable if there exists at least one policy that can lead to a high reward trajectory or reach the goal from this state. Unfortunately, most ordinary evaluation procedures of RL nowadays do not take this into consideration, and the agents are often evaluated from a fixed starting point with a very small amount of perturbation (Todorov et al., 2012; Brockman et al., 2016) . In fact, finding these controllable states themselves for evaluation is difficult.

