DOES ZERO-SHOT REINFORCEMENT LEARNING EXIST?

Abstract

A zero-shot RL agent is an agent that can solve any RL task in a given environment, instantly with no additional planning or learning, after an initial reward-free learning phase. This marks a shift from the reward-centric RL paradigm towards "controllable" agents that can follow arbitrary instructions in an environment. Current RL agents can solve families of related tasks at best, or require planning anew for each task. Strategies for approximate zero-shot RL have been suggested using successor features (SFs) (Borsa et al., 2018) or forward-backward (FB) representations (Touati & Ollivier, 2021), but testing has been limited. After clarifying the relationships between these schemes, we introduce improved losses and new SF models, and test the viability of zero-shot RL schemes systematically on tasks from the Unsupervised RL benchmark (Laskin et al., 2021). To disentangle universal representation learning from exploration, we work in an offline setting and repeat the tests on several existing replay buffers. SFs appear to suffer from the choice of the elementary state features. SFs with Laplacian eigenfunctions do well, while SFs based on auto-encoders, inverse curiosity, transition models, low-rank transition matrix, contrastive learning, or diversity (APS), perform unconsistently. In contrast, FB representations jointly learn the elementary and successor features from a single, principled criterion. They perform best and consistently across the board, reaching 85% of supervised RL performance with a good replay buffer, in a zero-shot manner.

1. INTRODUCTION

For breadth of applications, reinforcement learning (RL) lags behind other fields of machine learning, such as vision or natural language processing, which have effectively adapted to a wide range of tasks, often in almost zero-shot manner, using pretraining on large, unlabelled datasets (Brown et al., 2020) . The RL paradigm itself may be in part to blame: RL agents are usually trained for only one reward function or a small family of related rewards. Instead, we would like to train "controllable" agents that can be given a description of any task (reward function) in their environment, and then immediately know what to do, reacting instantly to such commands as "fetch this object while avoiding that area". The promise of zero-shot RL is to train without rewards or tasks, yet immediately perform well on any reward function given at test time, with no extra training, planning, or finetuning, and only a minimal amount of extra computation to process a task description (Section 2 gives the precise definition we use for zero-shot RL). How far away are such zero-shot agents? In the RL paradigm, a new task (reward function) means re-training the agent from scratch, and providing many reward samples. Model-based RL trains a reward-free, task-independent world model, but still requires heavy planning when a new reward function is specified (e.g, Chua et al., 2018; Moerland et al., 2020) . Model-free RL is reward-centric from start, and produces specialized agents. Multi-task agents generalize within a family of related tasks only. Reward-free, unsupervised skill pre-training (e.g, Eysenbach et al., 2018) still requires substantial downstream task adaptation, such as training a hierarchical controller. Is zero-shot RL possible? If one ignores practicality, zero-shot RL is easy: make a list of all possible rewards up to precision 𝜀, then pre-learn all the associated optimal policies. Scalable zero-shot RL must somehow exploit the relationships between policies for all tasks. Learning to go from 𝑎 to 𝑐 is not independent from going from 𝑎 to 𝑏 and 𝑏 to 𝑐, and this produces rich, exploitable algebraic relationships (Blier et al., 2021; Schaul et al., 2015) . Yet SFs heavily depend on a choice of basic state features. To get a full zero-shot RL algorithm, a representation learning method must provide those. While SFs have been successively applied to transfer between tasks, most of the time, the basic features were handcrafted or learned using prior task class knowledge. Meanwhile, FB is a standalone method with no task prior and good theoretical backing, but testing has been limited to goal-reaching in a few environments. Here: • We systematically assess SFs and FB for zero-shot RL, including many new models of SF basic features, and improved FB loss functions. We use 13 tasks from the Unsupervised RL benchmark (Laskin et al., 2021) , repeated on several ExORL training replay buffers (Yarats et al., 2021) to assess robustness to the exploration method. • We systematically study the influence of basic features for SFs, by testing SFs on features from ten RL representation learning methods. such as latent next state prediction, inverse curiosity module, contrastive learning, diversity, various spectral decompositions... • We expose new mathematical links between SFs, FB, and other representations in RL. • We discuss the implicit assumptions and limitations behind zero-shot RL approaches.

2. PROBLEM AND NOTATION; DEFINING ZERO-SHOT RL

Let ℳ = (𝑆, 𝐴, 𝑃, 𝛾) be a reward-free Markov decision process (MDP) with state space 𝑆, action space 𝐴, transition probabilities 𝑃 (𝑠 ′ |𝑠, 𝑎) from state 𝑠 to 𝑠 ′ given action 𝑎, and discount factor 0 < 𝛾 < 1 (Sutton & Barto, 2018). If 𝑆 and 𝐴 are finite, 𝑃 (𝑠 ′ |𝑠, 𝑎) can be viewed as a stochastic matrix 𝑃 𝑠𝑎𝑠 ′ ∈ R (|𝑆|×|𝐴|)×|𝑆| ; in general, for each (𝑠, 𝑎) ∈ 𝑆 × 𝐴, 𝑃 (d𝑠 ′ |𝑠, 𝑎) is a probability measure on 𝑠 ′ ∈ 𝑆. The notation 𝑃 (d𝑠 ′ |𝑠, 𝑎) covers all cases. Given (𝑠 0 , 𝑎 0 ) ∈ 𝑆 × 𝐴 and a policy 𝜋 : 𝑆 → Prob(𝐴), we denote Pr(•|𝑠 0 , 𝑎 0 , 𝜋) and E[•|𝑠 0 , 𝑎 0 , 𝜋] the probabilities and expectations under state-action sequences (𝑠 𝑡 , 𝑎 𝑡 ) 𝑡≥0 starting at (𝑠 0 , 𝑎 0 ) and following policy 𝜋 in the environment, defined by sampling 𝑠 𝑡 ∼ 𝑃 (d𝑠 𝑡 |𝑠 𝑡-1 , 𝑎 𝑡-1 ) and 𝑎 𝑡 ∼ 𝜋(d𝑎 𝑡 |𝑠 𝑡 ). We define 𝑃 𝜋 (d𝑠 ′ , d𝑎 ′ |𝑠, 𝑎) := 𝑃 (d𝑠 ′ |𝑠, 𝑎)𝜋(d𝑎 ′ |𝑠 ′ ) and 𝑃 𝜋 (d𝑠 ′ |𝑠) := ∫︀ 𝑃 (d𝑠 ′ |𝑠, 𝑎)𝜋(d𝑎|𝑠), the state-action transition probabilities and state transition probabilities induced by 𝜋. Given a reward function 𝑟 : 𝑆 → R, the 𝑄-function of 𝜋 for 𝑟 is 𝑄 𝜋 𝑟 (𝑠 0 , 𝑎 0 ) := ∑︀ 𝑡≥0 𝛾 𝑡 E[𝑟(𝑠 𝑡+1 )|𝑠 0 , 𝑎 0 , 𝜋]. For simplicity, we assume the reward 𝑟 depends only on the next state 𝑠 𝑡+1 instead on the full triplet (𝑠 𝑡 , 𝑎 𝑡 , 𝑠 𝑡+1 ), but this is not essential. We focus on offline unsupervised RL, where the agent cannot interact with the environment. The agent only has access to a static dataset of logged reward-free transitions in the environment, 𝒟 = {(𝑠 𝑖 , 𝑎 𝑖 , 𝑠 ′ 𝑖 )} 𝑖∈ℐ with 𝑠 ′ 𝑖 ∼ 𝑃 (d𝑠 ′ 𝑖 |𝑠 𝑖 , 𝑎 𝑖 ). These can come from any exploration method or methods. The offline setting disentangles the effects of the exploration method and representation and policy learning: we test each zero-shot method on several training datasets from several exploration methods.



Figure1: Zero-shot scores of ten SF methods and FB, as a percentage of the supervised score of offline TD3 trained on the same replay buffer, averaged on some tasks, environments and replay buffers from the Unsupervised RL and ExORL benchmarks(Laskin et al., 2022; Yarats et al., 2021). FB and SFs with Laplacian eigenfunctions achieve zero-shot scores approaching supervised RL.

