LEARNABLE BEHAVIOR CONTROL: BREAKING ATARI HUMAN WORLD RECORDS VIA SAMPLE-EFFICIENT BE-HAVIOR SELECTION

ABSTRACT

The exploration problem is one of the main challenges in deep reinforcement learning (RL) . Recent promising works tried to handle the problem with populationbased methods, which collect samples with diverse behaviors derived from a population of different exploratory policies. Adaptive policy selection has been adopted for behavior control. However, the behavior selection space is largely limited by the predefined policy population, which further limits behavior diversity. In this paper, we propose a general framework called Learnable Behavioral Control (LBC) to address the limitation, which a) enables a significantly enlarged behavior selection space via formulating a hybrid behavior mapping from all policies; b) constructs a unified learnable process for behavior selection. We introduce LBC into distributed off-policy actor-critic methods and achieve behavior control via optimizing the selection of the behavior mappings with bandit-based metacontrollers. Our agents have achieved 10077.52% mean human normalized score and surpassed 24 human world records within 1B training frames in the Arcade Learning Environment, which demonstrates our significant state-of-the-art (SOTA) performance without degrading the sample efficiency. 

1. INTRODUCTION

Reinforcement learning (RL) has led to tremendous progress in a variety of domains ranging from video games (Mnih et al., 2015) to robotics (Schulman et al., 2015; 2017) . However, efficient exploration remains one of the significant challenges. ). As an improvement, Agent57 adopts an adaptive policy selection mechanism that each behavior used for sampling is periodically selected from the population according to a meta-controller (Badia et al., 2020a). Although Agent57 achieved significantly better results on the Arcade Learning Environment (ALE) benchmark, it costs tens of billions of environment interactions as much as NGU. To handle this drawback, GDI (Fan & Xiao, 2022) adaptively combines multiple advantage functions learned from a single policy to obtain an enlarged behavior space without increasing policy population size. However, the population-based scenarios with more than one learned policy has not been widely explored yet. Taking a further step from GDI, we try to enable a larger and non-degenerate behavior space by learning different combinations across a population of different learned policies. In this paper, we attempt to further improve the sample efficiency of population-based reinforcement learning methods by taking a step towards a more challenging setting to control behaviors with significantly enlarged behavior space with a population of different learned policies. Differing from all of the existing works where each behavior is derived from a single selected learned policy, we formulate the process of getting behaviors from all learned policies as hybrid behavior mapping, and the behavior control problem is directly transformed into selecting appropriate mapping functions. By combining all policies, the behavior selection space increases exponentially along with the population size. As a special case that population size degrades to one, diverse behaviors can also be obtained by choosing different behavior mappings. This two-fold mechanism enables tremendous larger space for behavior selection. By properly parameterizing the mapping functions, our method enables a unified learnable process, and we call this general framework Learnable Behavior Control. We use the Arcade Learning Environment (ALE) to evaluate the performance of the proposed methods, which is an important testing ground that requires a broad set of skills such as perception, exploration, and control (Badia et al., 2020a) . Previous works use the normalized human score to summarize the performance on ALE and claim superhuman performance (Bellemare et al., 2013) . However, the human baseline is far from representative of the best human player, which greatly underestimates the ability of humanity. In this paper, we introduce a more challenging baseline, i.e., the human world records baseline (see 1. A data-efficient RL framework named LBC. We propose a general framework called Learnable Behavior Control (LBC), which enables a significantly enlarged behavior selection space without increasing the policy population size via formulating a hybrid behavior mapping from all policies, and constructs a unified learnable process for behavior selection. 2. A family of LBC-based RL algorithms. We provide a family of LBC-based algorithms by combining LBC with existing distributed off-policy RL algorithms, which shows the generality and scalability of the proposed method. 3. The state-of-the-art performance with superior sample efficiency. From Figs. 1, our method has achieved 10077.52% mean human normalized score (HNS) and surpassed 24



Figure1: Performance on the 57 Atari games. Our method achieves the highest mean human normalized scores(Badia et al., 2020a), is the first to breakthrough 24 human world records(Toromanoff  et al., 2019), and demands the least training data.

Figure 2: A General Architecture of Our Algorithm.diverse behaviors derived from the policy population(Badia et al., 2020b;a). Despite the significant improvement in the performance, these methods suffer from the aggravated high sample complexity due to the joint training on the whole population while keeping the diversity property. To acquire diverse behaviors, NGU(Badia et al., 2020b)  uniformly selects policies in the population regardless of their contribution to the learning progress(Badia et al., 2020b). As an improvement, Agent57 adopts an adaptive policy selection mechanism that each behavior used for sampling is periodically selected from the population according to a meta-controller(Badia et al., 2020a). Although Agent57 achieved significantly better results on the Arcade Learning Environment (ALE) benchmark, it costs tens of billions of environment interactions as much as NGU. To handle this drawback, GDI (Fan & Xiao, 2022) adaptively combines multiple advantage functions learned from a single policy to obtain an enlarged behavior space without increasing policy population size. However, the population-based scenarios with more than one learned policy has not been widely explored yet. Taking a further step from GDI, we try to enable a larger and non-degenerate behavior space by learning different combinations across a population of different learned policies.

Toromanoff et al. (2019);Hafner et al. (2021)  for more information on Atari human world records). We summarize the number of games that agents can outperform the human world records (i.e., HWRB, see Figs. 1) to claim a real superhuman performance in these games, inducing a more challenging and fair comparison with human intelligence. Experimental results show that the sample efficiency of our method also outperforms the concurrent workMEME Kapturowski  et al. (2022), which is 200x faster than Agent57. In summary, our contributions are as follows:

