EXPLICITLY MAINTAINING DIVERSE PLAYING STYLES IN SELF-PLAY Anonymous

Abstract

Self-play has proven to be an effective training schema to obtain a high-level agent in complex games through iteratively playing against an opponent from its historical versions. However, its training process may prevent it from generating a well-generalised policy since the trained agent rarely encounters diversely-behaving opponents along its own historical path. In this paper, we aim to improve the generalisation of the policy by maintaining a population of agents with diverse playing styles and high skill levels throughout the training process. Specifically, we propose a bi-objective optimisation model to simultaneously optimise the agents' skill level and playing style. A feature of this model is that we do not regard the skill level and playing style as two objectives to maximise directly since they are not equally important (i.e., agents with diverse playing styles but low skill levels are meaningless). Instead, we create a meta bi-objective model to enable high-level agents with diverse playing styles more likely to be incomparable (i.e. Pareto non-dominated), thereby playing against each other through the training process. We then present an evolutionary algorithm working with the proposed model. Experiments in a classic table tennis game Pong and a commercial roleplaying game Justice Online show that our algorithm can learn a well generalised policy and at the same time is able to provide a set of high-level policies with various playing styles.

1. INTRODUCTION

Recent years have witnessed impressive results of self-play for Deep Reinforcement Learning (DRL) in sophisticated game environments such as various board (Silver et al., 2016; 2017a; Jiang et al., 2019) and video games (Jaderberg et al., 2019; Vinyals et al., 2019; Berner et al., 2019) . The idea behind self-play is using a randomly initialised DRL agent to bootstrap itself to high-level intelligence by iteratively playing against an opponent from its historical versions (Silver et al., 2016) . However, the training process of self-play may prevent it from obtaining a well-generalised policy since the trained agent rarely encounters diversely-behaving opponents along its own historical path. This can easily be taken advantage of by human players. Taking OpenAI Five (Berner et al., 2019) as an example, it has a win rate of 99.4% in more than 7000 Dota 2 open matchesfoot_0 , but the replay shows that 8 of the top 9 teams that defeated the OpenAI Five are the same team and the policies they use in each game are very similar. This indicates that despite the remarkable high performance, the OpenAI Five agent is still not fully comfortable in some circumstances, which can be found and further exploited by human players (e.g., through meta-policies). Generally, an agent can be identified from two aspects, skill levels and playing styles (Mouret & Clune, 2015) . These two aspects are crucial for learning a high-level agent in the self-play training process because only playing against opponents with diverse playing styles and appropriate skill levels (i.e., not too low) can maximise the gains of the learning. If one only considers opponents' skill levels, there can be a catastrophic forgetting problem in which the agent "forgets" how to play against a wide variety of opponents (Hernandez et al., 2019) . On the other hand, if one only considers playing styles, there will be a lot of meaningless games that the agent learns very little from its far inferior opponents (Laterre et al., 2018) . Unfortunately, it is intrinsically challenging to strike a good balance between skill levels and playing styles in self-play. During the training process, the network weights of the DRL agent are usually optimised by a gradient-based method, which progresses along a single path that relies on the random seed of the network and the environment. At each iteration, the incumbent agent is the only response-policy for its historical versions. Such single-path optimisation is very unlikely to experience sufficiently diverse opponents, especially within a sophisticated environment. This may make common self-play algorithms, which use a probability function to decide which opponents to consider (e.g., the latest opponent or the past versions (Berner et al., 2019; Oh et al., 2019) ), unable to generalise their policies, i.e., struggle to cope with opponents that are very different from which they have encountered before. A viable way to introduce diverse playing styles in self-play is to consider the population-based approach, where a population of agents/opponents are maintained during the training process, with each potentially representing one play style. The population-based approach has already been frequently used in DRL (Jung et al., 2020; Carroll et al., 2019; Parker-Holder et al., 2020; Zhao et al., 2021) . For example, Population-Based Training (PBT) (Jaderberg et al., 2017; 2019; Li et al., 2019; Liu et al., 2019) optimises a population of networks at the same time, allowing for the optimal hyperparameters and model to be quickly found. Neuroevolution (Heidrich-Meisner & Igel, 2009; Such et al., 2017; Salimans et al., 2017; Stanley et al., 2019) uses population-based evolutionary search (e.g., genetic algorithm and evolution strategy) to generate the agents' network parameters and topology. In these population-based methods, an interesting idea to promote diversity of the agents' behaviours is to proactively search for "novel" behaviours. This can be very useful since maintaining a population of behaviours does not necessarily mean diversifying them over the search space (Jaderberg et al., 2019) . This is particularly true in sparse/deceptive reward problems (Salimans et al., 2017; Conti et al., 2018) where the reward function may provide useless/misleading feedback leading to the agent to get stuck and fail to learn properly (Lehman & Stanley, 2011; Ecoffet et al., 2019) . Such proactive-noveltysearch techniques include novelty search (Conti et al., 2018; Lehman & Stanley, 2011 ), intrinsic motivation (Bellemare et al., 2016 ), count-based exploration (Ostrovski et al., 2017; Tang et al., 2017) , variational information maximisation (Houthooft et al., 2016) , curiosity-driven learning (Baranes & Oudeyer, 2013; Forestier et al., 2017) , multi-behaviour search (Mouret & Doncieux, 2009; Shen et al., 2020) and quality-diversity (Cully & Demiris, 2018) . They, based on the history information in the environment, motivate the agent to visit unexplored states in order to accumulate higher rewards (Conti et al., 2018; Ecoffet et al., 2019; Guo & Brunskill, 2019) . For example, the qualitydiversity algorithms use domain dependent behaviour characterisations to abstractly describe the agent's behaviour trajectory and encourage the agent to uncover as many diverse behaviour niches as possible, with each niche being represented by its highest-level agent (Mouret & Clune, 2015; Pugh et al., 2016) . However, such proactive novelty search may not always be promising since novel behaviours that we search for do not always come with high skill levels. When it comes to population-based self-play, a game could be meaningless when the difference of agents' skill levels is too big, albeit their playing styles being very different. Indeed, what we need effectively is a population of high-level and diverse-style agents which play against each other through the training process. To this end, this paper proposes a novel Bi-Objective (BiO) optimisation model to optimise skill levels and playing styles. One feature of this model is that we do not regard these two aspects as objectives to maximise directly, but rather we create a meta bi-objective model to enable highlevel agents with diverse playing styles more likely incomparable (i.e. Pareto nondominated to each other), thus being always kept in the training process. Specifically, in BiO each objective is composed of two components. The first component is related to skill level of the agent, same for the two objectives, while the second component is related to playing style of the agent, we making it completely conflicting for the two objectives. As such, the Pareto optimal solutions in BiO are typically those far away from each other in playing styles but all with reasonably good skill levels (this will be explained in details in Section 3). We propose an evolutionary algorithm to work with the proposed model. We follow the basic framework of multi-objective evolutionary algorithms, but with customized components for self-play.



https://arena.openai.com/#/results

