ADVERSARIAL POLICIES BEAT SUPERHUMAN GO AIS

Abstract

We attack the state-of-the-art Go-playing AI system, KataGo, by training adversarial policies that play against frozen KataGo victims. Our attack achieves a >99% win-rate when KataGo uses no tree-search, and a >77% win-rate when KataGo uses enough search to be superhuman. Notably, our adversaries do not win by learning to play Go better than KataGo-in fact, our adversaries are easily beaten by human amateurs. Instead, our adversaries win by tricking KataGo into making serious blunders. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available at https://go-attack-iclr.netlify.app/.

1. INTRODUCTION

Reinforcement learning from self-play has achieved superhuman performance in a range of games including Go (Silver et al., 2016) , chess and shogi (Silver et al., 2016), and Dota (OpenAI et al., 2019) . Moreover, idealized versions of self-play provably converge to Nash equilibria (Brown, 1951; Heinrich et al., 2015) . Although realistic versions of self-play may not always converge, the strong empirical performance of self-play seems to suggest this is rarely an issue in practice. Nonetheless, prior work has found that seemingly highly capable continuous control policies trained via self-play can be exploited by adversarial policies (Gleave et al., 2020; Wu et al., 2021) . This suggests that self-play may not be as robust as previously thought. However, although the victim agents are state-of-the-art for continuous control, they are still well below human performance. This raises the question: are adversarial policies a vulnerability of self-play policies in general, or simply an artifact of insufficiently capable policies? To answer this, we study a domain where self-play has achieved very strong performance: Go. Specifically, we train adversarial policies end-to-end to attack KataGo (Wu, 2019), the strongest publicly available Go-playing AI system. Using less than 5% of the compute used to train KataGo, we obtain adversarial policies that win >99% of the time against KataGo with no search, and >77% against KataGo with enough search to be superhuman. Critically, our adversaries do not win by learning a generally capable Go policy. Instead, the adversaries trick KataGo into making serious blunders that result in KataGo losing the game (Figure 1 ). Despite being able to beat KataGo, our adversarial policies lose against even amateur Go players (see Appendix F.1). So KataGo is in this instance less robust than human amateurs, despite having superhuman capabilities. This is a striking example of non-transitivity, illustrated in Figure 2 . Our adversaries have no special powers: they can only place stones, or pass, like a regular player. We do, however, give our adversaries access to the victim network they are attacking. In particular, we train our adversaries using an AlphaZero-style training process (Silver et al., 2018) , similar to that of KataGo. The key differences are that we collect games with the adversary playing the victim, and that we use the victim network to select victim moves during the adversary's tree search. KataGo is the strongest publicly available Go AI system at the time of writing. With search, KataGo is strongly superhuman, winning (Wu, 2019, Section 5.1) against ELF OpenGo (Tian et al., 2019) and Leela Zero (Pascutto, 2019) that are themselves superhuman. In Appendix D, we estimate that KataGo without search plays at the level of a top 100 European player, and that KataGo with 2048 visits per move of search is much stronger than any human. Our paper makes three contributions. First, we propose a novel attack method, hybridizing the attack of Gleave et al. ( 2020) and AlphaZero-style training (Silver et al., 2018) . Second, we demon-(a) Adversary wins as black by tricking the victim (Latest, no search) into passing prematurely, ending the game. Explore the game. (b) Adversary wins as white by capturing a group (X) that the victim (Latestdef, 2048 visits) leaves vulnerable. Explore the game. Figure 1 : Two randomly sampled games against the strongest policy network, Latest. (a) An adversarial policy beats KataGo by tricking it into passing. The adversary then passes in turn, ending the game with the adversary winning under the Tromp-Taylor ruleset for computer Go (Tromp, 2014) that KataGo was trained and configured to use (see Appendix A). The adversary gets points for its territory in the top-right corner (devoid of victim stones) whereas the victim does not get points for the territory in the bottom-left due to the presence of the adversary's stones. (b) A different adversarial policy beats a superhuman-level victim immunized against the "passing trick". The adversary lures the victim into letting a large group of victim stones (X) get captured by the adversary's next move (∆). Appendix F.3 has a more detailed description of this adversary's behavior. strate the existence of two distinct adversarial policies against the state-of-the-art Go AI system, KataGo. Finally, we provide a detailed empirical investigation into these adversarial policies, including a qualitative analysis of their game play. Our open-source implementation is available at -anonymized-.

2. RELATED WORK

Our work is inspired by the presence of adversarial examples in a wide variety of models (Szegedy et al., 2014) . Notably, many image classifiers reach or surpass human performance (Ho-Phuoc, 2018; Russakovsky et al., 2015; Shankar et al., 2020; Pham et al., 2021 ). Yet even these stateof-the-art image classifiers are vulnerable to adversarial examples (Carlini et al., 2019; Ren et al., 2020) . This raises the question: could highly capable deep RL policies be similarly vulnerable? One might hope that the adversarial nature of self-play training would naturally lead to robustness. This strategy works for image classifiers, where adversarial training is an effective if computationally expensive defense (Madry et al., 2018; Ren et al., 2020) . This view is further bolstered by the fact that idealized versions of self-play provably converge to a Nash equilibrium, which is unexploitable (Brown, 1951; Heinrich et al., 2015) . However, our work finds that in practice even state-of-the-art and professional-level deep RL policies are still vulnerable to exploitation. It is known that self-play may not converge in non-transitive games (Balduzzi et al., 2019 ). However, Czarnecki et al. (2020) has argued that real-world games like Go grow increasingly transitive as skill increases. This would imply that while self-play may struggle with non-transitivity early on during training, comparisons involving highly capable policies such as KataGo should be mostly transitive. By contrast, we find a striking non-transitivity: our adversaries exploit KataGo agents that beat human professionals, yet even an amateur Go player can beat our adversaries (Appendix F.1).

