L2E: LEARNING TO EXPLOIT YOUR OPPONENT

Abstract

Opponent modeling is essential to exploit sub-optimal opponents in strategic interactions. One key challenge facing opponent modeling is how to fast adapt to opponents with diverse styles of strategies. Most previous works focus on building explicit models to predict the opponents' styles or strategies directly. However, these methods require a large amount of data to train the model and lack the adaptability to new opponents of unknown styles. In this work, we propose a novel Learning to Exploit (L2E) framework for implicit opponent modeling. L2E acquires the ability to exploit opponents by a few interactions with different opponents during training so that it can adapt to new opponents with unknown styles during testing quickly. We propose a novel Opponent Strategy Generation (OSG) algorithm that produces effective opponents for training automatically. By learning to exploit the challenging opponents generated by OSG through adversarial training, L2E gradually eliminates its own strategy's weaknesses. Moreover, the generalization ability of L2E is significantly improved by training with diverse opponents, which are produced by OSG through diversity-regularized policy optimization. We evaluate the L2E framework on two poker games and one grid soccer game, which are the commonly used benchmark for opponent modeling. Comprehensive experimental results indicate that L2E quickly adapts to diverse styles of unknown opponents.

1. INTRODUCTION

One core research topic in modern artificial intelligence is creating agents that can interact effectively with their opponents in different scenarios. To achieve this goal, the agents should have the ability to reason about their opponents' behaviors, goals, and beliefs. Opponent modeling, which constructs the opponents' models to reason about them, has been extensively studied in past decades (Albrecht & Stone, 2018) . In general, an opponent model is a function that takes some interaction history as its input and predicts some property of interest of the opponent. Specifically, the interaction history may contain the past actions that the opponent took in various situations, and the properties of interest could be the actions that the opponent may take in the future, the style of the opponent (e.g., "defensive", "aggressive"), or its current goals. The resulting opponent model can inform the agent's decision-making by incorporating the model's predictions in its planning procedure to optimize its interactions with the opponent. Opponent modeling has already been used in many practical applications, such as dialogue systems (Grosz & Sidner, 1986) , intelligent tutor systems (McCalla et al., 2000) , and security systems (Jarvis et al., 2005) . Many opponent modeling algorithms vary greatly in their underlying assumptions and methodology. For example, policy reconstruction based methods (Powers & Shoham, 2005; Banerjee & Sen, 2007) explicitly fit an opponent model to reflect the opponent's observed behaviors. Type reasoning based methods (Dekel et al., 2004; Nachbar, 2005) reuse pre-learned models of several known opponents by finding the one which most resembles the behavior of the current opponent. Classification based methods (Huynh et al., 2006; Sukthankar & Sycara, 2007) build models that predict the play style of the opponent, and employ the counter-strategy, which is effective against that particular style. Some recent works combine opponent modeling with deep learning methods or reinforcement learning methods and propose many related algorithms (He et al., 2016; Foerster et al., 2018; Wen et al., 2018) . Although these algorithms have achieved some success, they also have some obvious disadvantages. First, constructing accurate opponent models requires a lot of data, which is problematic since the agent does not have the time or opportunity to collect enough data about its opponent in most applications. Second, most of these algorithms perform well only when the opponents during testing are similar to the ones used for training, and it is difficult for them to adapt to opponents with new styles quickly. More related works on opponent modeling are in Appendix A.1. To overcome these shortcomings, we propose a novel Learning to Exploit (L2E) framework in this work for implicit opponent modeling, which has two desirable advantages. First, L2E does not build an explicit model for the opponent, so it does not require a large amount of interactive data and eliminates the modeling errors simultaneously. Second, L2E can quickly adapt to new opponents with unknown styles, with only a few interactions with them. The key idea underlying L2E to train a base policy against various styles of opponents by using only a few interactions between them during training, such that it acquires the ability to exploit different opponents quickly. After training, the base policy can quickly adapt to new opponents using only a few interactions during testing. In effect, our L2E framework optimizes for a base policy that is easy and fast to adapt. It can be seen as a particular case of learning to learn, i.e., meta-learning (Finn et al., 2017) . The meta-learning algorithm (c.f ., Appendix A.2 for details), such as MAML (Finn et al., 2017) , is initially designed for single-agent environments. It requires manual design of training tasks, and the final performance largely depends on the user-specified training task distribution. The L2E framework is designed explicitly for the multi-agent competitive environments, which generates effective training tasks (opponents) automatically (c.f ., Appendix A.3 for details). Some recent works have also initially used meta-learning for opponent modeling. Unlike these works, which either use meta-learning to predict the opponent's behaviors (Rabinowitz et al., 2018) or to handle the non-stationarity problem in multi-agent reinforcement learning (Al-Shedivat et al., 2018) , we focus on how to improve the agent's ability to adapt to unknown opponents quickly. In our L2E framework, the base policy is explicitly trained such that a few interactions with a new opponent will produce an opponent-specific policy to effectively exploit this opponent, i.e., the base policy has strong adaptability that is broadly adaptive to many opponents. In specific, if a deep neural network models the base policy, then the opponent-specific policy can be obtained by fine-tuning the parameters of the base policy's network using the new interactive data with the opponent. A critical step in L2E is how to generate effective opponents to train the base policy. The ideal training opponents should satisfy the following two desiderata. 1) The opponents need to be challenging enough (i.e., hard to exploit). By learning to exploit these challenging opponents, the base policy eliminates its weakness and learns a more robust strategy. 2) The opponents need to have enough diversity. The more diverse the opponents during training, the stronger the base policy's generalization ability is, and the more adaptable the base policy to the new opponents. To this end, we propose a novel opponent strategy generation (OSG) algorithm, which can produce challenging and diverse opponents automatically. We use the idea of adversarial training to generate challenging opponents. Some previous works have also been proposed to obtain more robust policies through adversarial training and showed that it improves the generalization (Pinto et al., 2017; Pattanaik et al., 2018) . From the perspective of the base policy, giving an opponent, the base policy first adjusts itself to obtain an adapted policy, the base policy is then optimized to maximize the rewards that the adapted policy gets when facing the opponent. The challenging opponents are then adversarially generated by minimizing the base policy's adaptability by automatically generating difficult to exploit opponents. These hard-to-exploit opponents are trained such that even if the base policy adapts to them, the adapted base policy cannot take advantage of them. Besides, our OSG algorithm can further produce diverse training opponents with a novel diversity-regularized policy optimization procedure. In specific, we use the Maximum Mean Discrepancy (MMD) metric (Gretton et al., 2007) to evaluate the differences between policies. The MMD metric is then incorporated as a regularization term into the policy optimization process to obtain a diverse set of opponent policies. By training with these challenging and diverse training opponents, the robustness and generalization ability of our L2E framework can be significantly improved. To summarize, the main contributions of this work are listed bellow in four-fold: • We propose a novel learning to exploit (L2E) framework to exploit sub-optimal opponents without building explicit models for it. L2E can quickly adapt to a new opponent with unknown style using only a few interactions. • We propose to use an adversarial training procedure to generate challenging opponents automatically. These hard to exploit opponents help L2E eliminate its weakness and improve its robustness effectively.

