L2E: LEARNING TO EXPLOIT YOUR OPPONENT

Abstract

Opponent modeling is essential to exploit sub-optimal opponents in strategic interactions. One key challenge facing opponent modeling is how to fast adapt to opponents with diverse styles of strategies. Most previous works focus on building explicit models to predict the opponents' styles or strategies directly. However, these methods require a large amount of data to train the model and lack the adaptability to new opponents of unknown styles. In this work, we propose a novel Learning to Exploit (L2E) framework for implicit opponent modeling. L2E acquires the ability to exploit opponents by a few interactions with different opponents during training so that it can adapt to new opponents with unknown styles during testing quickly. We propose a novel Opponent Strategy Generation (OSG) algorithm that produces effective opponents for training automatically. By learning to exploit the challenging opponents generated by OSG through adversarial training, L2E gradually eliminates its own strategy's weaknesses. Moreover, the generalization ability of L2E is significantly improved by training with diverse opponents, which are produced by OSG through diversity-regularized policy optimization. We evaluate the L2E framework on two poker games and one grid soccer game, which are the commonly used benchmark for opponent modeling. Comprehensive experimental results indicate that L2E quickly adapts to diverse styles of unknown opponents.

1. INTRODUCTION

One core research topic in modern artificial intelligence is creating agents that can interact effectively with their opponents in different scenarios. To achieve this goal, the agents should have the ability to reason about their opponents' behaviors, goals, and beliefs. Opponent modeling, which constructs the opponents' models to reason about them, has been extensively studied in past decades (Albrecht & Stone, 2018) . In general, an opponent model is a function that takes some interaction history as its input and predicts some property of interest of the opponent. Specifically, the interaction history may contain the past actions that the opponent took in various situations, and the properties of interest could be the actions that the opponent may take in the future, the style of the opponent (e.g., "defensive", "aggressive"), or its current goals. The resulting opponent model can inform the agent's decision-making by incorporating the model's predictions in its planning procedure to optimize its interactions with the opponent. Opponent modeling has already been used in many practical applications, such as dialogue systems (Grosz & Sidner, 1986) , intelligent tutor systems (McCalla et al., 2000) , and security systems (Jarvis et al., 2005) . Many opponent modeling algorithms vary greatly in their underlying assumptions and methodology. For example, policy reconstruction based methods (Powers & Shoham, 2005; Banerjee & Sen, 2007) explicitly fit an opponent model to reflect the opponent's observed behaviors. Type reasoning based methods (Dekel et al., 2004; Nachbar, 2005) reuse pre-learned models of several known opponents by finding the one which most resembles the behavior of the current opponent. Classification based methods (Huynh et al., 2006; Sukthankar & Sycara, 2007) build models that predict the play style of the opponent, and employ the counter-strategy, which is effective against that particular style. Some recent works combine opponent modeling with deep learning methods or reinforcement learning methods and propose many related algorithms (He et al., 2016; Foerster et al., 2018; Wen et al., 2018) . Although these algorithms have achieved some success, they also have some obvious disadvantages. First, constructing accurate opponent models requires a lot of data, which is problematic since the agent does not have the time or opportunity to collect enough data about its opponent in 1

