CAN AGENTS RUN RELAY RACE WITH STRANGERS? GENERALIZATION OF RL TO OUT-OF-DISTRIBUTION TRAJECTORIES

Abstract

In this paper, we evaluate and improve the generalization performance for reinforcement learning (RL) agents on the set of "controllable" states, where good policies exist on these states to achieve the goal. An RL agent that generally masters a task should reach its goal starting from any controllable state of the environment instead of memorizing a small set of trajectories. To practically evaluate this type of generalization, we propose relay evaluation, which starts the test agent from the middle of other independently well-trained stranger agents' trajectories. With extensive experimental evaluation, we show the prevalence of generalization failure on controllable states from stranger agents. For example, in the Humanoid environment, we observed that a well-trained Proximal Policy Optimization (PPO) agent, with only 3.9% failure rate during regular testing, failed on 81.6% of the states generated by well-trained stranger PPO agents. To improve "relay generalization," we propose a novel method called Self-Trajectory Augmentation (STA), which will reset the environment to the agent's old states according to the Q function during training. After applying STA to the Soft Actor Critic's (SAC) training procedure, we reduced the failure rate of SAC under relay-evaluation by more than three times in most settings without impacting agent performance and increasing the needed number of environment interactions. Our code is available at https://github.com/lan-lc/STA.

1. INTRODUCTION

Generalization is critical for deploying reinforcement learning (RL) agents into real-world applications. A well-trained RL agent that can achieve high rewards under restricted settings may not be able to handle the enormous state space and complex environment variations in the real world. There are many different aspects regarding the generalization of RL agents. While many existing works study RL generalization under environment variations between training and testing Kirk et al. (2021) , in this paper, we study the generalization problem in a simple and fixed environment under an often overlooked notion of generalization -a well-generalized agent that masters a task should be able to start from any "controllable" state in this environment and still reach the goal. For example, a self-driving system may need to take over the control from humans (or other AIs trained for different goals such as speed, gas-efficient, or comfortable) in the middle of driving and must continue to drive the car safely. We can make little assumptions about what states the cars are at when the take-over happens, and the self-driving agent must learn to drive generally. Although this problem may look ostensibly easy for simple MDPs (e.g., a 2-D maze), most real RL agents are trained by trajectories generated by its policy, and it is hard to guarantee the behavior of an agent on all "controllable" states. Roughly speaking, in the setting of robotics (e.g., Mujoco environments), we can define a state as controllable if there exists at least one policy that can lead to a high reward trajectory or reach the goal from this state. Unfortunately, most ordinary evaluation procedures of RL nowadays do not take this into consideration, and the agents are often evaluated from a fixed starting point with a very small amount of perturbation (Todorov et al., 2012; Brockman et al., 2016) . In fact, finding these controllable states themselves for evaluation is difficult. The first contribution of this work is to propose relay-evaluation, a proxy to evaluate agent generalization performance on controllable states in a fixed environment. Relay-evaluation involves running an agent from the middle states of other independently trained agents' high-reward trajectories. This is similar to running a relay race with another stranger agent, where the stranger agent controls the robot first, and the test agent takes the reins of the robot later. It naturally finds a diverse set of controllable states for the test agents because the sampled states come from high reward trajectories of well-trained agents. The stranger agents can be trained using a variety of RL algorithms, not limited to the one used for the test agent. Our extensive experiments on 160 agents trained in four environments using four algorithms show that many representative RL algorithms have unexpectedly high failure rates under relay-evaluation. For example, in the Humanoid environment, Proximal Policy Optimization (PPO) agents and Soft Actor Critic (SAC) agents have average failure rates of 81.6% and 38.0% under relay-evaluation even when the stranger agents have trained with the same algorithms (different random seeds), which is surprisingly high compared to the original failure rates (PPO: 3.9%, SAC: 0.92%). The failure of the agents under this setting shows that they may not genuinely understand the dynamics of the environments and learn general concepts like balancing the robot to avoid failing, but rather memorize a small set of actions specifically for a limited number of states it encountered. In Figure 1a , we illustrate the t-SNE for states in trajectories of 6 agents trained with 3 algorithms and observe that even trained with the same algorithm, the states generated by different agents are quite distinctive. Figure 1b shows that the SAC 1 agent only has a low failure rate on its own states, which is the dots colored blue in Figure 1a . Our second contribution is to propose a novel training method called Self-Trajectory Augmentation (STA) that can significantly improve agents' generalization without significantly increasing training costs. We first conduct a motivative experiment to augment the initial state set of an agent during training by the states generated by a set of pretrained stranger agents, and we find that as we increase the number of stranger agents, the relay-evaluation of the agent improves significantly. However, pretraining additional models is time-consuming and may be impractical in complex environments. Therefore, we propose a novel method called Self-Trajectory Augmentation (STA), where we randomly set the agent to start from its old trajectories. Since the distribution of visited states often varies during training, reviewing an agent's old trajectories can be beneficial for generalization. After applying STA to the standard SAC training procedure, the failure rates of SAC agents are reduced by more than three times in most settings without sacrificing agent performance under ordinary evaluation, and it has minimal impact on convergence speed. In Figure 1c , our STA agent is uniformly successful on most states from 6 stranger agents.

2. RELAY EVALUATION

In this section, we first introduce the definition of relay-evaluation in Sec.2.1. Then we conduct extensive experiments to evaluate the relay-generalization of representative RL algorithms in Sec.2.3.



Figure 1: t-SNE of states from trajectories of 6 Humanoid agents. (a) The states of 6 agents are almost non-overlapping even for the ones trained by the same algorithm; (b) SAC 1 agent in (a) performs badly on controllable states from other stranger agents, indicating that it may not learn to control the robot generally; (c) our STA agent performs uniformly well when starting from the same set of states.

