ADVERSARIAL POLICIES BEAT SUPERHUMAN GO AIS

Abstract

We attack the state-of-the-art Go-playing AI system, KataGo, by training adversarial policies that play against frozen KataGo victims. Our attack achieves a >99% win-rate when KataGo uses no tree-search, and a >77% win-rate when KataGo uses enough search to be superhuman. Notably, our adversaries do not win by learning to play Go better than KataGo-in fact, our adversaries are easily beaten by human amateurs. Instead, our adversaries win by tricking KataGo into making serious blunders. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available at https://go-attack-iclr.netlify.app/.

1. INTRODUCTION

Reinforcement learning from self-play has achieved superhuman performance in a range of games including Go (Silver et al., 2016) , chess and shogi (Silver et al., 2016), and Dota (OpenAI et al., 2019) . Moreover, idealized versions of self-play provably converge to Nash equilibria (Brown, 1951; Heinrich et al., 2015) . Although realistic versions of self-play may not always converge, the strong empirical performance of self-play seems to suggest this is rarely an issue in practice. Nonetheless, prior work has found that seemingly highly capable continuous control policies trained via self-play can be exploited by adversarial policies (Gleave et al., 2020; Wu et al., 2021) . This suggests that self-play may not be as robust as previously thought. However, although the victim agents are state-of-the-art for continuous control, they are still well below human performance. This raises the question: are adversarial policies a vulnerability of self-play policies in general, or simply an artifact of insufficiently capable policies? To answer this, we study a domain where self-play has achieved very strong performance: Go. Specifically, we train adversarial policies end-to-end to attack KataGo (Wu, 2019), the strongest publicly available Go-playing AI system. Using less than 5% of the compute used to train KataGo, we obtain adversarial policies that win >99% of the time against KataGo with no search, and >77% against KataGo with enough search to be superhuman. Critically, our adversaries do not win by learning a generally capable Go policy. Instead, the adversaries trick KataGo into making serious blunders that result in KataGo losing the game (Figure 1 ). Despite being able to beat KataGo, our adversarial policies lose against even amateur Go players (see Appendix F.1). So KataGo is in this instance less robust than human amateurs, despite having superhuman capabilities. This is a striking example of non-transitivity, illustrated in Figure 2 . Our adversaries have no special powers: they can only place stones, or pass, like a regular player. We do, however, give our adversaries access to the victim network they are attacking. In particular, we train our adversaries using an AlphaZero-style training process (Silver et al., 2018) , similar to that of KataGo. The key differences are that we collect games with the adversary playing the victim, and that we use the victim network to select victim moves during the adversary's tree search. KataGo is the strongest publicly available Go AI system at the time of writing. With search, KataGo is strongly superhuman, winning (Wu, 2019, Section 5.1) against ELF OpenGo (Tian et al., 2019) and Leela Zero (Pascutto, 2019) that are themselves superhuman. In Appendix D, we estimate that KataGo without search plays at the level of a top 100 European player, and that KataGo with 2048 visits per move of search is much stronger than any human. Our paper makes three contributions. First, we propose a novel attack method, hybridizing the attack of Gleave et al. (2020) and AlphaZero-style training (Silver et al., 2018) . Second, we demon-

