ERL-RE 2 : EFFICIENT EVOLUTIONARY REINFORCE-MENT LEARNING WITH SHARED STATE REPRESENTA-TION AND INDIVIDUAL POLICY REPRESENTATION

Abstract

Deep Reinforcement Learning (Deep RL) and Evolutionary Algorithms (EA) are two major paradigms of policy optimization with distinct learning principles, i.e., gradient-based v.s. gradient-free. An appealing research direction is integrating Deep RL and EA to devise new methods by fusing their complementary advantages. However, existing works on combining Deep RL and EA have two common drawbacks: 1) the RL agent and EA agents learn their policies individually, neglecting efficient sharing of useful common knowledge; 2) parameter-level policy optimization guarantees no semantic level of behavior evolution for the EA side. In this paper, we propose Evolutionary Reinforcement Learning with Two-scale State Representation and Policy Representation (ERL-Re 2 ), a novel solution to the aforementioned two drawbacks. The key idea of ERL-Re 2 is two-scale representation: all EA and RL policies share the same nonlinear state representation while maintaining individual linear policy representations. The state representation conveys expressive common features of the environment learned by all the agents collectively; the linear policy representation provides a favorable space for efficient policy optimization, where novel behavior-level crossover and mutation operations can be performed. Moreover, the linear policy representation allows convenient generalization of policy fitness with the help of the Policy-extended Value Function Approximator (PeVFA), further improving the sample efficiency of fitness estimation. The experiments on a range of continuous control tasks show that ERL-Re 2 consistently outperforms advanced baselines and achieves the State Of The Art (SOTA). Our code is available on https://github.com/yeshenpy/ERL-Re2.

1. INTRODUCTION

Reinforcement learning (RL) has achieved many successes in robot control (Yuan et al., 2022) , game AI (Hao et al., 2022; 2019) , supply chain (Ni et al., 2021) and etc (Hao et al., 2020) . With function approximation like deep neural networks, the policy can be learned efficiently by trial-and-error with reliable gradient updates. However, RL is widely known to be unstable, poor in exploration, and struggling when the gradient signals are noisy and less informative. By contrast, Evolutionary Algorithms (EA) (Bäck & Schwefel, 1993) are a class of black-box optimization methods, which is demonstrated to be competitive with RL (Such et al., 2017) . EA model natural evolution processes by maintaining a population of individuals and searching for favorable solutions by iteration. In each iteration, individuals with high fitness are selected to produce offspring by inheritance and variation, while those with low fitness are eliminated. Different from RL, EA are gradient-free and offers several strengths: strong exploration ability, robustness, and stable convergence (Sigaud, 2022) . Despite the advantages, one major bottleneck of EA is the low sample efficiency due to the iterative evaluation of the population. This issue becomes more stringent when the policy space is large (Sigaud, 2022). Since EA and RL have distinct and complementary advantages, a natural idea is to combine these two heterogeneous policy optimization approaches and devise better policy optimization algorithms. Many efforts in recent years have been made in this direction (Khadka & Tumer, 2018; Khadka et al., 2019; Bodnar et al., 2020; Wang et al., 2022; Shen et al., 2020) . One representative work is ERL (Khadka & Tumer, 2018) which combines Genetic Algorithm (GA) (Mitchell, 1998) and DDPG (Lillicrap et al., 2016) . ERL maintains an evolution population and a RL agent meanwhile. The population and the RL agent interact with each other in a coherent framework: the RL agent learns by DDPG with diverse off-policy experiences collected by the population; while the population includes a copy of the RL agent periodically among which genetic evolution operates. In this way, EA and RL cooperate during policy optimization. Subsequently, many variants and improvements of ERL are proposed, e.g., to incorporate Cross-Entropy Method (CEM) (Pourchot & Sigaud, 2019) rather than GA (Pourchot & Sigaud, 2019) , to devise gradient-based genetic operators (Gangwani & Peng, 2018) , to use multiple parallel RL agents (Khadka et al., 2019) and etc. However, we observe that most existing methods seldom break the performance ceiling of either their EA or RL components (e.g., Swimmer and Humanoid on MuJoCo are dominated by EA and RL respectively). This indicates that the strengths of EA and RL are not sufficiently blended. We attribute this to two major drawbacks. First, each agent of EA and RL learns its policy individually. The state representation learned by individuals can inevitably be redundant yet specialized (Dabney et al., 2021) , thus slowing down the learning and limiting the convergence performance. Second, typical evolutionary variation occurs at the level of the parameter (e.g., network weights). It guarantees no semantic level of evolution and may induce policy crash (Bodnar et al., 2020) . In the literature of linear approximation RL (Sutton & Barto, 1998) and state representation learning (Chung et al., 2019; Dabney et al., 2021; Kumar et al., 2021) , a policy is usually understood as the composition of nonlinear state features and linear policy weights. Taking this inspiration, we propose a new approach named Evolutionary Reinforcement Learning with Two-scale State Representation and Policy Representation (ERL-Re 2 ) to address the aforementioned two drawbacks. ERL-Re 2 is devised based on a novel concept, i.e., two-scale representation: all EA and RL agents maintained in ERL-Re 2 are composed of a shared nonlinear state representation and an individual linear policy representation. The shared state representation takes the responsibility of learning general and expressive features of the environment, which is not specific to any single policy, e.g., the common decision-related knowledge. In particular, it is learned by following a unifying update direction derived from value function maximization regarding all EA and RL agents collectively. Thanks to the expressivity of the shared state representation, the individual policy representation can have a simple linear form. It leads to a fundamental distinction of ERL-Re 2 : evolution and reinforcement occur in the linear policy representation space rather than in a nonlinear parameter (e.g., policy network) space as the convention. Thus, policy optimization can be more efficient with ERL-Re 2 . In addition, we propose novel behavior-level crossover and mutation that allow to imposing variations on designated dimensions of action while incurring no interference on the others. Compared to parameter-level operators, our behavior-level operators have clear genetic semantics of behavior, thus are more effective and stable. Moreover, we further reduce the sample cost of EA by introducing a new surrogate of fitness, based on the convenient incorporation of Policy-extended Value Function Approximator (PeVFA) favored by the linear policy representations. Without loss of generality, we use GA and TD3 (and DDPG) for the concrete choices of EA and RL algorithms. Finally, we evaluate ERL-Re 2 on MuJoCo continuous control tasks with strong ERL baselines and typical RL algorithms, along with a comprehensive study on ablation, hyperparameter analysis, etc. We summarize our major contributions below: 1) We propose a novel approach ERL-Re 2 to integrate EA and RL based on the concept of two-scale representation; 2) We devise behavior-level crossover and mutation which have clear genetic semantics; 3) We empirically show that ERL-Re 2 outperforms other related methods and achieves state-of-the-art performance.

2. BACKGROUND

Reinforcement Learning Consider a Markov decision process (MDP), defined by a tuple ⟨S, A, P, R, γ, T ⟩. At each step t, the agent uses a policy π to select an action a t ∼ π(s t ) ∈ A according to the state s t ∈ S and the environment transits to the next state s t+1 according to transition function P(s t , a t ) and the agent receives a reward r t = R(s t , a t ). The return is defined as the discounted cumulative reward, denoted by R t = T i=t γ i-t r i where γ ∈ [0, 1) is the discount factor and T is the maximum episode horizon. The goal of RL is to learn an optimal policy π * that maximizes the expected return. DDPG (Lillicrap et al., 2016 ) is a representative off-policy Actor-Critic algorithm, consisting of a deterministic policy π ω (i.e., the actor) and a state-action value function approximation Q ψ (i.e., the critic), with the parameters ω and ψ respectively. The critic is optimized with the Temporal Difference (TD) (Sutton & Barto, 1998) loss and the actor is updated by maximizing the estimated Q value. The loss functions are defined as: L(ψ) = E D [(r + γQ ψ ′ (s ′ , π ω ′ (s ′ )) -Q ψ (s, a)) 2 ] and L(ω) = -E D [Q ψ (s, π ω (s))], where the experiences (s, a, r, s ′ ) are sampled from the replay buffer D, ψ ′ and ω ′ are the parameters of the target networks. TD3 (Fujimoto et al., 2018) improves DDPG by addressing overestimation issue mainly by clipped double-Q learning.

