EXPLICITLY MAINTAINING DIVERSE PLAYING STYLES IN SELF-PLAY Anonymous

Abstract

Self-play has proven to be an effective training schema to obtain a high-level agent in complex games through iteratively playing against an opponent from its historical versions. However, its training process may prevent it from generating a well-generalised policy since the trained agent rarely encounters diversely-behaving opponents along its own historical path. In this paper, we aim to improve the generalisation of the policy by maintaining a population of agents with diverse playing styles and high skill levels throughout the training process. Specifically, we propose a bi-objective optimisation model to simultaneously optimise the agents' skill level and playing style. A feature of this model is that we do not regard the skill level and playing style as two objectives to maximise directly since they are not equally important (i.e., agents with diverse playing styles but low skill levels are meaningless). Instead, we create a meta bi-objective model to enable high-level agents with diverse playing styles more likely to be incomparable (i.e. Pareto non-dominated), thereby playing against each other through the training process. We then present an evolutionary algorithm working with the proposed model. Experiments in a classic table tennis game Pong and a commercial roleplaying game Justice Online show that our algorithm can learn a well generalised policy and at the same time is able to provide a set of high-level policies with various playing styles.

1. INTRODUCTION

Recent years have witnessed impressive results of self-play for Deep Reinforcement Learning (DRL) in sophisticated game environments such as various board (Silver et al., 2016; 2017a; Jiang et al., 2019) and video games (Jaderberg et al., 2019; Vinyals et al., 2019; Berner et al., 2019) . The idea behind self-play is using a randomly initialised DRL agent to bootstrap itself to high-level intelligence by iteratively playing against an opponent from its historical versions (Silver et al., 2016) . However, the training process of self-play may prevent it from obtaining a well-generalised policy since the trained agent rarely encounters diversely-behaving opponents along its own historical path. This can easily be taken advantage of by human players. Taking OpenAI Five (Berner et al., 2019) as an example, it has a win rate of 99.4% in more than 7000 Dota 2 open matchesfoot_0 , but the replay shows that 8 of the top 9 teams that defeated the OpenAI Five are the same team and the policies they use in each game are very similar. This indicates that despite the remarkable high performance, the OpenAI Five agent is still not fully comfortable in some circumstances, which can be found and further exploited by human players (e.g., through meta-policies). Generally, an agent can be identified from two aspects, skill levels and playing styles (Mouret & Clune, 2015) . These two aspects are crucial for learning a high-level agent in the self-play training process because only playing against opponents with diverse playing styles and appropriate skill levels (i.e., not too low) can maximise the gains of the learning. If one only considers opponents' skill levels, there can be a catastrophic forgetting problem in which the agent "forgets" how to play against a wide variety of opponents (Hernandez et al., 2019) . On the other hand, if one only considers playing styles, there will be a lot of meaningless games that the agent learns very little from its far inferior opponents (Laterre et al., 2018) .



https://arena.openai.com/#/results 1

