LEARNABLE BEHAVIOR CONTROL: BREAKING ATARI HUMAN WORLD RECORDS VIA SAMPLE-EFFICIENT BE-HAVIOR SELECTION

Abstract

The exploration problem is one of the main challenges in deep reinforcement learning (RL). Recent promising works tried to handle the problem with populationbased methods, which collect samples with diverse behaviors derived from a population of different exploratory policies. Adaptive policy selection has been adopted for behavior control. However, the behavior selection space is largely limited by the predefined policy population, which further limits behavior diversity. In this paper, we propose a general framework called Learnable Behavioral Control (LBC) to address the limitation, which a) enables a significantly enlarged behavior selection space via formulating a hybrid behavior mapping from all policies; b) constructs a unified learnable process for behavior selection. We introduce LBC into distributed off-policy actor-critic methods and achieve behavior control via optimizing the selection of the behavior mappings with bandit-based metacontrollers. Our agents have achieved 10077.52% mean human normalized score and surpassed 24 human world records within 1B training frames in the Arcade Learning Environment, which demonstrates our significant state-of-the-art (SOTA) performance without degrading the sample efficiency.



Figure 1 : Performance on the 57 Atari games. Our method achieves the highest mean human normalized scores (Badia et al., 2020a) , is the first to breakthrough 24 human world records (Toromanoff et al., 2019) , and demands the least training data.

1. INTRODUCTION

Reinforcement learning (RL) has led to tremendous progress in a variety of domains ranging from video games (Mnih et al., 2015) to robotics (Schulman et al., 2015; 2017) . However, efficient exploration remains one of the significant challenges. Recent prominent works tried to address the problem with population-based training (Jaderberg et al., 2017, PBT) wherein a population of policies with different degrees of exploration is jointly trained to keep both the long-term and shortterm exploration capabilities throughout the learning process. A set of actors is created to acquire diverse behaviors derived from the policy population (Badia et al., 2020b; a) . Despite the significant improvement in the performance, these methods suffer from the aggravated high sample complexity due to the joint training on the whole population while keeping the diversity property. To acquire diverse behaviors, NGU (Badia et al., 2020b) uniformly selects policies in the population regardless of their contribution to the learning progress (Badia et al., 2020b) . As an improvement, Agent57 adopts an adaptive policy selection mechanism that each behavior used for sampling is periodically selected from the population according to a meta-controller (Badia et al., 2020a) . Although Agent57 achieved significantly better results on the Arcade Learning Environment (ALE) benchmark, it costs tens of billions of environment interactions as much as NGU. To handle this drawback, GDI (Fan & Xiao, 2022) adaptively combines multiple advantage functions learned from a single policy to obtain an enlarged behavior space without increasing policy population size. However, the population-based scenarios with more than one learned policy has not been widely explored yet. Taking a further step from GDI, we try to enable a larger and non-degenerate behavior space by learning different combinations across a population of different learned policies. In this paper, we attempt to further improve the sample efficiency of population-based reinforcement learning methods by taking a step towards a more challenging setting to control behaviors with significantly enlarged behavior space with a population of different learned policies. Differing from all of the existing works where each behavior is derived from a single selected learned policy, we formulate the process of getting behaviors from all learned policies as hybrid behavior mapping, and the behavior control problem is directly transformed into selecting appropriate mapping functions. By combining all policies, the behavior selection space increases exponentially along with the population size. As a special case that population size degrades to one, diverse behaviors can also be obtained by choosing different behavior mappings. This two-fold mechanism enables tremendous larger space for behavior selection. By properly parameterizing the mapping functions, our method enables a unified learnable process, and we call this general framework Learnable Behavior Control. We use the Arcade Learning Environment (ALE) to evaluate the performance of the proposed methods, which is an important testing ground that requires a broad set of skills such as perception, exploration, and control (Badia et al., 2020a) . Previous works use the normalized human score to summarize the performance on ALE and claim superhuman performance (Bellemare et al., 2013) . However, the human baseline is far from representative of the best human player, which greatly underestimates the ability of humanity. In this paper, we introduce a more challenging baseline, i.e., the human world records baseline (see Toromanoff et al. (2019) ; Hafner et al. (2021) for more information on Atari human world records). We summarize the number of games that agents can outperform the human world records (i.e., HWRB, see Figs. 1) to claim a real superhuman performance in these games, inducing a more challenging and fair comparison with human intelligence. Experimental results show that the sample efficiency of our method also outperforms the concurrent work MEME Kapturowski et al. (2022) , which is 200x faster than Agent57. In summary, our contributions are as follows: 1. A data-efficient RL framework named LBC. We propose a general framework called Learnable Behavior Control (LBC), which enables a significantly enlarged behavior selection space without increasing the policy population size via formulating a hybrid behavior mapping from all policies, and constructs a unified learnable process for behavior selection.

2.

A family of LBC-based RL algorithms. We provide a family of LBC-based algorithms by combining LBC with existing distributed off-policy RL algorithms, which shows the generality and scalability of the proposed method. 

3.

(s) = (1 -γ)E s0∼ρ0 [ ∞ t=0 γ t P(s t = s|s 0 )]. Define return G t = ∞ k=0 γ k r t+k wherein γ ∈ (0, 1) is the discount factor. The goal of reinforcement learning is to find the optimal policy π * that maximizes the expected sum of discounted rewards G t : π * := argmax π E st∼d π ρ 0 E π G t = ∞ k=0 γ k r t+k |s t ,

2.2. BEHAVIOR CONTROL FOR REINFORCEMENT LEARNING

In value-based methods, a behavior policy can be derived from a state-action value function Q π θ,h (s, a) via ϵ-greedy. In policy-based methods, a behavior policy can be derived from the policy logits Φ θ,h (Li et al., 2018) via Boltzmann operator. For convenience, we define that a behavior policy can be derived from the learned policy model Φ θ,h via a behavior mapping, which normally maps a single policy model to a behavior, e.g., ϵ-greedy(Φ θ,h ). In PBT-based methods, there would be a set of policy models {Φ θ1,h1 , ..., Φ θN,hN }, each of which is parameterized by θ i and trained under its own hyper-parameters h i , wherein θ i ∈ Θ = {θ 1 , ..., θ N } and h i ∈ H = {h 1 , ..., h N }. The behavior control in population-based methods is normally achieved in two steps: i) select a policy model Φ θ,h from the population. ii) applying a behavior mapping to the selected policy model. When the behavior mapping is rule-based for each actor (e.g., ϵ-greedy with rule-based ϵ ), the behavior control can be transformed into the policy model selection (See Proposition 1). Therefore, the optimization of the selection of the policy models becomes one of the critical problems in achieving effective behavior control. Following the literature on PBRL, NGU adopts a uniform selection, which is unoptimized and inefficient. Built upon NGU, Agent57 adopts a meta-controller to adaptively selected a policy model from the population to generate the behavior for each actor, which is implemented by a non-stationary multi-arm bandit algorithm. However, the policy model selection requires maintaining a large number of different policy models, which is particularly data-consuming since each policy model in the population holds heterogeneous training objectives. To handle this problem, recent notable work GDI-H 3 (Fan & Xiao, 2022) enables to obtain an enlarged behavior space via adaptively controls the temperature of the softmax operation over the weighted advantage functions. However, since the advantage functions are derived from the same target policy under different reward scales, the distributions derived from them may tend to be similar (e.g., See App. N), thus would lead to degradation of the behavior space. Differing from all of the existing works where each behavior is derived from a single selected learned policy, in this paper, we try to handle this problem via three-fold: i) we bridge the relationship between the learned policies and each behavior via a hybrid behavior mapping, ii) we propose a general way to build a non-degenerate large behavior space for population-based methods in Sec. 4.1, iii) we propose a way to optimize the hybrid behavior mappings from a population of different learned models in Proposition. 2.

3. LEARNABLE BEHAVIOR CONTROL

In this section, we first formulate the behavior control problem and decouple it into two sub-problems: behavior space construction and behavior selection. Then, we discuss how to construct the behavior space and select behaviors based on the formulation. By integrating both, we can obtain a general framework to achieve behavior control in RL, called learnable behavior control (LBC).

3.1. BEHAVIOR CONTROL FORMULATION

Behavior Mapping Define behavior mapping F as a mapping from some policy model(s) to a behavior. In previous works, a behavior policy is typically obtained using a single policy model. In this paper, as a generalization, we define two kinds of F according to how many policy models they take as input to get a behavior. The first one, individual behavior mapping, is defined as a mapping from a single model to a behavior that is widely used in prior works, e.g., ϵ-greedy and Boltzmann Strategy for discrete action space and Gaussian Strategy for continuous action space; And the second one, hybrid behavior mapping, is defined to map all policy models to a single behavior, i.e., F(Φ θ1,h1 , ..., Φ θN,hN ). The hybrid behavior mapping enables us to get a hybrid behavior by combining all policies together, which provides a greater degree of freedom to acquire a larger behavior space. For any behavior mapping F ψ parameterized by ψ, there exists a family of behavior mappings F Ψ = {F ψ |ψ ∈ Ψ} that hold the same parametrization form with F ψ , where Ψ ⊆ R k is a parameter set that contains all possible parameter ψ. Behavior Formulation As described above, in our work, a behavior can be acquired by applying a behavior mapping F ψ to some policy model(s). For the individual behavior mapping case, a behavior can be formulated as µ θ,h,ψ = F ψ (Φ θ,h ), which is also the most used case in previous works. As for the hybrid behavior mapping case, a behavior is formulated as µ Θ,H,ψ = F ψ (Φ Θ,H ), wherein Φ Θ,H = {Φ θ1,h1 , ..., Φ θN,hN } is a policy model set containing all policy models. Behavior Control Formulation Behavior control can be decoupled into two sub-problems: 1) which behaviors can be selected for each actor at each training time, namely the behavior space construction. 2) how to select proper behaviors, namely the behavior selection. Based on the behavior formulation, we can formulate these sub-problems: Definition 3.1 (Behavior Space Construction). Considering the RL problem that behaviors µ are generated from some policy model(s). We can acquire a family of realizable behaviors by applying a family of behavior mappings F Ψ to these policy model(s). Define the set that contains all of these realizable behaviors as the behavior space, which can be formulated as: M Θ,H,Ψ = {µ θ,h,ψ = F ψ (Φ h )|θ ∈ Θ, h ∈ H, ψ ∈ Ψ}, for individual behavior mapping {µ Θ,H,ψ = F ψ (Φ Θ,H )|ψ ∈ Ψ}, for hybrid behavior mapping (2) Definition 3.2 (Behavior Selection). Behavior selection can be formulated as finding a optimal selection distribution P * M Θ,H,Ψ to select the behaviors µ from behavior space M Θ,H,Ψ and maximizing some optimization target L P , wherein L P is the optimization target of behavior selection: P * M Θ,H,Ψ := argmax P M Θ,H,Ψ L P (3) 3.2 BEHAVIOR SPACE CONSTRUCTION In this section, we further simplify the equation 2, and discuss how to construct the behavior space. Assumption 1. Assume all policy models share the same network structure, and h i can uniquely index a policy model Φ θi,hi . Then, Φ θ,h can be abbreviated as Φ h . Unless otherwise specified, in this paper, we assume Assumption 1 holds. Under Assumption 1, the behavior space defined in equation 2 can be simplified as, M H,Ψ = {µ h,ψ = F ψ (Φ h )|h ∈ H, ψ ∈ Ψ}, for individual behavior mapping {µ H,ψ = F ψ (Φ H )|ψ ∈ Ψ}, for hybrid behavior mapping (4) According to equation 4, four core factors need to be considered when constructing a behavior space: the network structure Φ, the form of behavior mapping F, the hyper-parameter set H and the parameter set Ψ. Many notable representation learning approaches have explored how to design the network structure (Chen et al., 2021; Irie et al., 2021) , but it is not the focus of our work. In this paper, we do not make any assumptions about the model structure, which means it can be applied to any model structure. Hence, there remains three factors, which will be discussed below. For cases that behavior space is constructed with individual behavior mappings, there are two things to be considered if one want to select a specific behavior from the behavior space: the policy model Φ h and behavior mapping F ψ . Prior methods have tried to realize behavior control via selecting a policy model Φ hi from the population {Φ h1 , ..., Φ hN } (See Proposition 1). The main drawback of this approach is that only one policy model is considered to generate behavior, leaving other policy models in the population unused. In this paper, we argue that we can tackle this problem via hybrid behavior mapping, wherein the hybrid behavior is generated based on all policy models. In this paper, we only consider the case that all of the N policy models are used for behavior generating, i.e., µ H,ψ = F ψ (Φ H ). Now there is only one thing to be considered , i.e., the behavior mapping function F ψ , and the behavior control problem will be transformed into the optimization of the behavior mapping (See Proposition 2). We also do not make any assumptions about the form of the mapping. As an example, one could acquire a hybrid behavior from all policy models via network distillation, parameter fusion, mixture models, etc.

3.3. BEHAVIOR SELECTION

According to equation 4, each behavior can be indexed by h and ψ for individual behavior mapping cases, and when the ψ is not learned for each actor, the behavior selection can be cast to the selection of h (see Proposition 1). As for the hybrid behavior mapping cases, since each behavior can be indexed by ψ, the behavior selection can be cast into the selection of ψ (see Proposition 2). Moreover, according to equation 3, there are two keys in behavior selection: 1) Optimization Target L P . 2) The optimization algorithm to learn the selection distribution P M H,Ψ and maximize L P . In this section, we will discuss them sequentially. Optimization Target Two core factors have to be considered for the optimization target: the diversity-based measurement V TD µ (Eysenbach et al., 2019) and the value-based measurement V TV µ (Parker-Holder et al., 2020) . By integrating both, the optimization target can be formulated as: L P = R µ∼P M H,Ψ + c • D µ∼P M H,Ψ = E µ∼P M H,Ψ [V TV µ + c • V TD µ ], wherein, R µ∼P M H,Ψ and D µ∼P M H,Ψ is the expectation of value and diversity of behavior µ over the selection distribution P M H,Ψ . When F ψ is unlearned and deterministic for each actor, behavior selection for each actor can be simplified into the selection of the policy model: Proposition 1 (Policy Model Selection). When F ψ is a deterministic and individual behavior mapping for each actor at each training step (wall-clock), e.g., Agent57, the behavior for each actor can be uniquely indexed by h, so equation 5 can be simplified into L P = E h∼P H V TV µ h + c • V TD µ h , where P H is a selection distribution of h ∈ H = {h 1 , ..., h N }. For each actor, the behavior is generated from a selected policy model Φ hi with a pre-defined behavior mapping F ψ . In Proposition 1, the behavior space size is controlled by the policy model population size (i.e., |H|). However, maintaining a large population of different policy models is data-consuming. Hence, we try to control behaviors via optimizing the selection of behavior mappings: Proposition 2 (Behavior Mapping Optimization). When all the policy models are used to generate each behavior, e.g., µ ψ = F ψ (Φ θ,h ) for single policy model cases or µ ψ = F ψ (Φ θ1,h1 , ..., Φ θN,hN ) for N policy models cases, each behavior can be uniquely indexed by F ψ , and equation 5 can be simplified into: L P = E ψ∼P Ψ V TV µ ψ + c • V TD µ ψ , where P Ψ is a selection distribution of ψ ∈ Ψ. In Proposition 2, the behavior space is majorly controlled by |Ψ|, which could be a continuous parameter space. Hence, a larger behavior space can be enabled. Selection Distribution Optimization Given the optimization target L P , we seek to find the optimal behavior selection distribution P * µ that maximizes L P : P * M H,Ψ := argmax P M H,Ψ L P (1) = argmax P H L P (2) = argmax P Ψ L P , where ( 1) and ( 2) hold because we have Proposition 1 and 2, respectively. This optimization problem can be solved with existing optimizers, e.g., evolutionary algorithm (Jaderberg et al., 2017) , multi-arm bandits (MAB) (Badia et al., 2020a), etc.

4. LBC-BM: A BOLTZMANN MIXTURE BASED IMPLEMENTATION FOR LBC

In this section, we provide an example of improving the behavior control of off-policy actor-critic methods (Espeholt et al., 2018) via optimizing the behavior mappings as Proposition 2. We provide a practical design of hybrid behavior mapping, inducing an implementation of LBC, which we call Boltzmann Mixture based LBC, namely LBC-BM. By choosing different H and Ψ, we can obtain a family of implementations of LBC-BM with different behavior spaces (see Sec. 5.4).

4.1. BOLTZMANN MIXTURE BASED BEHAVIOR SPACE CONSTRUCTION

In this section, we provide a general hybrid behavior mapping design including three sub-processes: Generalized Policy Selection In Agent57, behavior control is achieved by selecting a single policy from the policy population at each iteration. Following this idea, we generalize the method to the case where multiple policies can be selected. More specifically, we introduce a importance weights vector ω to describe how much each policy will contribute to the generated behavior, ω = [ω 1 , ..., ω N ], ω i ≥ 0, N i=1 ω i = 1, where ω i represents the importance of ith policy in the population (i.e., Φ hi ). In particular, if ω is a one-hot vector, i.e., ∃i ∈ {1, 2, ..., N}, ω i = 1; ∀j ∈ {1, 2, ..., N} ̸ = i, ω j = 0, then the policy selection becomes a single policy selection as Proposition 1. Therefore, it can be seen as a generalization of single policy selection, and we call this process generalized policy selection.

Policy-Wise Entropy Control

In our work, we propose to use entropy control (which is typically rule-based controlled in previous works) to make a better trade-off between exploration and exploitation. For a policy model Φ hi from the population, we will apply a entropy control function f τi (•), i.e., π hi,τi = f τi (Φ hi ), where π hi,τi is the new policy after entropy control, and f τi (•) is parameterized by τ i . Here we should note that the entropy of all the policies from the population is controlled in a policy-wise manner. Thus there would be a set of entropy control functions to be considered, which is parameterized by τ = [τ 1 , ..., τ N ]. Behavior Distillation from Multiple Policies Different from previous methods where only one policy is used to generate the behavior, in our approach, we combine N policies [π h1,τ1 , ..., π hN,τN ], together with their importance weights ω = [ω 1 , ..., ω N ]. Specially, in order to make full use of these policies according to their importance, we introduce a behavior distillation function g which takes both the policies and importance weights as input, i.e., µ H,τ ,ω = g(π h1,τ1 , ..., π hN,τN , ω). The distillation function g(•, ω) can be implemented in different ways, e.g., knowledge distillation (supervised learning), parameters fusion, etc. In conclusion, the behavior space can be constructed as, M H,Ψ = {g (f τ1 (Φ h1 ), ..., f τN (Φ hN ), ω 1 , ..., ω N ) |ψ ∈ Ψ} (9) wherein Ψ = {ψ = (τ 1 , ..., τ N , ω 1 , ..., ω N )}, H = {h 1 , ..., h N }. Note that this is a general approach which can be applied to different tasks and algorithms by simply selecting different entropy control function f τi (•) and behavior distillation function g(•, ω). As an example, for Atari task, we model the policy as a Boltzmann distribution, i.e., π hi,τi (a|s) = e τiΦ h i (a|s) a ′ e τiΦ h i (a ′ |s) , where τ i ∈ (0, ∞). The entropy can thus be controlled by controlling the temperature. As for the behavior distillation function, we are inspired by the behavior design of GDI, which takes a weighted sum of two softmax distributions derived from two advantage functions. We can further extend this approach to the case to do a combination across different policies, i.e., µ H,τ ,ω (a|s) = N i=1 ω i π hi,τi (a|s). This formula is actually a form of mixture model, where the importance weights play the role of mixture weights of the mixture model. Then the behavior space becomes, M H,Ψ = {µ H,ψ = N i=1 ω i softmax τi (Φ hi )|ψ ∈ Ψ} (10) 4.2 MAB BASED BEHAVIOR SELECTION According to Proposition 2, the behavior selection over behavior space 10 can be simplified to the selection of ψ. In this paper, we use MAB-based meta-controller to select ψ ∈ Ψ. Since Ψ is a continuous multidimensional space, we discretize Ψ into K regions {Ψ 1 , ..., Ψ K }, and each region corresponds to an arm of MAB. At the beginning of a trajectory i, l-th actor will use MAB to sample a region Ψ k indexed by arm k according to P Ψ = softmax(Score Ψ k ) = e Score Ψ k j e Score Ψ j . We adopt UCB score as Score Ψ k = V Ψ k + c • log(1+ K j̸ =k NΨ j ) 1+NΨ k to tackle the reward-diversity trade-off problem in equation 7 (Garivier & Moulines, 2011) . N Ψ k means the number of the visit of Ψ k indexed by arm k. V Ψ k is calculated by the expectation of the undiscounted episodic returns to measure the value of each Ψ k , and the UCB item is used to avoid selecting the same arm repeatedly and ensure sufficient diverse behavior mappings can be selected to boost the behavior diversity. After an Ψ k is sampled, a ψ will be uniformly sampled from Ψ k , corresponding to a behavior mapping F ψ . With F ψ , we can obtain a behavior µ ψ according to equation 10. Then, the l-th actor acts µ ψ to obtain a trajectory τ i and the undiscounted episodic return G i , then G i is used to update the reward model V Ψ k of region Ψ k indexed by arm k. As for the nonstationary problem, we are inspired from GDI, which ensembles several MAB with different learning rates and discretization accuracy. We can extend to handle the nonstationary problem by jointly training a population of bandits from very exploratory to purely exploitative (i.e., different c of the UCB item, similar to the policy population of Agent57). Moreover, we will periodically replace the members of the MAB population to ease the nonstationary problem further. More details of implementations of MAB can be found in App. E. Moreover, the mechanism of the UCB item for behavior control has not been widely studied in prior works, and we will demonstrate how it boosts behavior diversity in App. K.3.

5. EXPERIMENT

In this section, we design our experiment to answer the following questions: • Whether our methods can outperform prior SOTA RL algorithms in both sample efficiency and final performance in Atari 1B Benchmarks (See Sec. 5.2 and Figs. 3)? • Can our methods adaptively adjust the exploration-exploration trade-off (See Figs. 4)? • How to enlarge or narrow down the behavior space? What is the performance of methods with different behavior spaces (See Sec. 5.4)? • How much performance will be degraded without proper behavior selection (See Figs. 5)? 5.1 EXPERIMENTAL SETUP

5.1.1. EXPERIMENTAL DETAILS

We conduct our experiments in ALE (Bellemare et al., 2013) . The standard pre-processing settings of Atari are identical to those of Agent57 (Badia et al., 2020a) , and related parameters have been concluded in App. I. We employ a separate evaluation process to record scores continuously. We record the undiscounted episodic returns averaged over five seeds using a windowed mean over 32 episodes. To avoid any issues that aggregated metrics may have, App. J provides full learning curves for all games and detailed comparison tables of raw and normalized scores. Apart from the mean and median HNS, we also report how many human worlds records our agents have broken to emphasize the superhuman performance of our methods. For more experimental details, see App. H. Median Human Normalized Score Ours@1B MEME@1B(2022) Muzero@20B(2020) GDI-H 3 @200M(2022) Agent57@100B(2019) Go-Explore@10B(2021) 

5.1.2. IMPLEMENTATION DETAILS

We jointly train three polices, and each policy can be indexed by the hyper-parameters h i = (γ i , RS i ), wherein RS i is a reward shaping method (Badia et al., 2020a) , and γ i is the discounted factor. Each policy model Φ hi adopts the dueling network structure (Wang et al., 2016) , where Φ hi = A hi = Q hi -V hi . More details of the network structure can be found in App. L. To correct for harmful discrepancy of off-policy learning, we adopt V-Trace (Espeholt et al., 2018) and ReTrace (Munos et al., 2016) to learn V hi and Q hi , respectively. The policy is learned by policy gradient (Schulman et al., 2017) . Based on equation 10, we could build a behavior space with a hybrid mapping as M H,Ψ = {µ H,ψ = 3 i=1 ω i softmax τi (Φ hi )}, wherein H = {h 1 , h 2 , h 3 }, Ψ = {ψ = (τ 1 , ω 1 , τ 2 , ω 2 , τ 3 , ω 3 )|τ i ∈ (0, τ + ), 3 j=1 ω j = 1}. The behavior selection is achieved by MAB described in Sec. 4.2, and more details can see App. E. Finally, we could obtain an implementation of LBC-BM, which is our main algorithm. The target policy for A π 1 and A π 2 in GDI-H 3 is the same, while in our work the target policy for A πi i is π i = softmax(A i ).

Results on Atari Benchmark

The aggregated results across games are reported in Figs. 3. Among the algorithms with superb final performance, our agents achieve the best mean HNS and surpass the most human world records across 57 games of the Atari benchmark with relatively minimal training frames, leading to the best learning efficiency. Noting that Agent57 reported the maximum scores across training as the final score, and if we report our performance in the same manner, our median is 1934%, which is higher than Agent57 and demonstrates our superior performance.

Discussion of Results

With LBC, we can understand the mechanisms underlying the performance of GDI-H 3 more clearly: i) GDI-H 3 has a high-capacity behavior space and a meta-controller to optimize the behavior selection ii) only a single target policy is learned, which enables stable learning and fast converge (See the case study of KL divergence in App. N). Compared to GDI-H 3 , to ensure the behavior space will not degenerate, LBC maintains a population of diverse policies and, as a price, sacrifices some sample efficiency. Nevertheless, LBC can continuously maintain a significantly larger behavior space with hybrid behavior mapping, which enables RL agents to continuously explore and get improvement.

5.3. CASE STUDY: BEHAVIOR CONTROL

To further explore the mechanisms underlying the success of behavior control of our method, we adopt a case study to showcase our control process of behaviors. As shown in Figs. our agents prefer exploratory behaviors first (i.e., high stochasticity policies with high entropy), and, as training progresses, the agents shift into producing experience from more exploitative behaviors. On the verge of peaking, the entropy of the behaviors could be maintained at a certain level (task-wise) instead of collapsing swiftly to zero to avoid converging prematurely to sub-optimal policies.

5.4. ABLATION STUDY

In this section, we investigate several properties of our method. For more details, see App. K. Behavior Space Decomposition To explore the effect of different behavior spaces, we decompose the behavior space of our main algorithm via reducing H and Ψ: 1) Reducing H. When we set all the policy models of our main algorithm the same, the behavior space transforms from F(Φ h1 , Φ h2 , Φ h3 ) into F(Φ h1 , Φ h1 , Φ h1 ). H degenerates from {h 1 , h 2 , h 3 } into {h 1 }. We can obtain a control group with a smaller behavior space by reducing H. 2)Reducing H and Ψ. Based on the control group reducing H, we can further reduce Ψ to further narrow down the behavior space. Specially, we can directly adopt a individual behavior mapping to build the behavior space as M H,Ψ = {µ ψ = softmax τ (Φ h1 )}, where Ψ degenerates from {ω 1 , ω 2 , ω 3 , τ 1 , τ 2 , τ 3 } to {τ } and H = {h 1 }. Then, we can obtain a control group with the smallest behavior space by reducing H and Ψ. The performance of these methods is illustrated in Figs. 5, and from left to right, the behavior space of the first three algorithms decreases in turn (According to Corollary 4 in App. C). It is evident that narrowing the behavior space via reducing H or Ψ will degrade the performance. On the contrary, the performance can be boosted by enlarging the behavior space, which could be a promising way to improve the performance of existing methods. Behavior Selection To highlight the importance of an appropriate behavior selection, we replace the meta-controller of our main algorithm with a random selection. The ablation results are illustrated in Figs. 5, from which it is evident that, with the same behavior space, not learning an appropriate selection distribution of behaviors will significantly degrade the performance. We conduct a t-SNE analysis in App. K.3 to demonstrate that our methods can acquire more diverse behaviors than the control group with pre-defined behavior mapping. Another ablation study that removed the UCB item has been conducted in App. K.3 to demonstrate the behavior diversity may be boosted by the UCB item, which can encourage the agents to select more different behavior mappings.

6. CONCLUSION

We present the first deep reinforcement learning agent to break 24 human world records in Atari using only 1B training frames. To achieve this, we propose a general framework called LBC, which enables a significantly enlarged behavior selection space via formulating a hybrid behavior mapping from all policies, and constructs a unified learnable process for behavior selection. We introduced LBC into off-policy actor-critic methods and obtained a family of implementations. A large number of experiments on Atari have been conducted to demonstrate the effectiveness of our methods empirically. Apart from the full results, we do detailed ablation studies to examine the effectiveness of the proposed components. While there are many improvements and extensions to be explored going forward, we believe that the ability of LBC to enhance the control process of behaviors results in a powerful platform to propel future research.



Figure 2: A General Architecture of Our Algorithm.

Figure 3: The learning curves in Atari. Curves are smoothed with a moving average over 5 points.

Figure 5: Ablation Results. All the results are scaled by the main algorithm to improve readability.

4, in most tasks,

ACKNOWLEDGMENTS

This work is majorly supported by the National Key R&D Program of China (Grant Number 2021ZD0110400). In addition, this work is partly supported by the National Natural Science Foundation of China under Grant 62171248, and the R&D Program of Shenzhen under Grant JCYJ20220818101012025.We are very grateful for the careful reading and insightful reviews of meta-reviewers and reviewers.

REPRODUCIBILITY STATEMENT

Open-sourced code will be implemented with Mindspore (MS, 2022) and released on our website.

funding

* Work done as a research intern at Huawei Noah's Ark Lab.

