LOCAL INFORMATION OPPONENT MODELLING USING VARIATIONAL AUTOENCODERS

Abstract

Modelling the behaviours of other agents (opponents) is essential for understanding how agents interact and making effective decisions. Existing methods for opponent modelling commonly assume knowledge of the local observations and chosen actions of the modelled opponents, which can significantly limit their applicability. We propose a new modelling technique based on variational autoencoders, which are trained to reconstruct the local actions and observations of the opponent based on embeddings which depend only on the local observations of the modelling agent (its observed world state, chosen actions, and received rewards). The embeddings are used to augment the modelling agent's decision policy which is trained via deep reinforcement learning; thus the policy does not require access to opponent observations. We provide a comprehensive evaluation and ablation study in diverse multi-agent tasks, showing that our method achieves comparable performance to an ideal baseline which has full access to opponent's information, and significantly higher returns than a baseline method which does not use the learned embeddings.

1. INTRODUCTION

An important aspect of autonomous decision-making agents is the ability to reason about the unknown intentions and behaviours of other agents. Much research has been devoted to this opponent modelling problem [2] , with recent works focused on the use of deep learning architectures for opponent modelling and reinforcement learning (RL) [20, 34, 16, 33] . A common assumption in existing methods is that the modelling agent has access to the local trajectory of the modelled agents [2] , which may include their local observations of the environment state, their past actions, and possibly their received rewards. While it is certainly desirable to be able to observe an agent's local context in order to reason about its past and future decisions, in practice such an assumption may be too restrictive. Agents may only have a limited view of their surroundings, communication with other agents may not be feasible or reliable [40] , and knowledge of the perception system of other agents may not be available [13] . In such cases, an agent must reason with only locally available information. We consider the question: Can effective opponent modelling be achieved using only the locally available information of the modelling agent during execution? A strength of deep learning techniques is their ability to identify informative features in data. Here, we use deep learning techniques to extract informative features from a stream of local observations for the purpose of opponent modelling. Specifically, we consider multi-agent settings in which we control a single agent which must learn to interact with a set of opponent agents (we use the term "opponent" in a neutral sense). We assume a given set of possible policies for opponent agents and that these policies are fixed (that is, other agents do not simultaneously learn, such as in multi-agent RL [32] ). We propose an opponent modelling method which is able to extract a compact yet informative representation of opponents given only the local information of the controlled agent, which includes its local state observations, past actions, and rewards. To this end, we use an encoder-decoder architecture based on variational autoencoders (VAE) [26] . The VAE model is trained to replicate opponent actions and observations from the local information only. During training, the opponent's observations are utilised as reconstruction targets for the decoder; after training, only the encoder component is retained which generates embeddings using local observations of the controlled agent. The learned embeddings condition the policy of the controlled agent in addition to its local observation, and the policy and VAE model are optimised concurrently during the RL learning process. We evaluate our proposed method, called Local Information Opponent Modelling (LIOM), in two benchmark environments used in multi-agent systems research, the multi-agent particle environment [31, 28] and level-based foraging (LBF) [1] . Our results support the idea that effective opponent modelling can be achieved using only local information during execution: the same RL algorithm generally achieved higher average returns when combined with our opponent embeddings than without, and in some cases the average returns are comparable to those achieved by an ideal baseline which has full access to the opponent's trajectory. We evaluate the method's ability to predict the opponent's actions, and provide an ablation study on the different types of local information used by the encoder.

2. RELATED WORK

Learning Opponent Models: We are interested in opponent modelling methods that use neural networks to learn representations of the opponents. He et al. [20] proposed a method which learns a modelling network to reconstruct an opponent's actions given the opponent's observations. Raileanu et al. [34] developed an algorithm for learning to infer an opponents' intentions using the policy of the controlled agent. Grover et al. [16] proposed an encoder-decoder method for modelling the opponent's policy. The encoder learns a point-based representation of different opponent trajectories, and the decoder learns to reconstruct the opponent's policy. In addition, the authors introduced an objective to separate embeddings of different agents into different clusters. Rabinowitz et al. [33] proposed the Theory of mind Network (TomNet), which learns embedding-based representations of opponents for meta-learning. Tacchetti et al. [42] proposed relational forward models to model opponents using graph neural networks. A common assumption in these methods, which our work aims to eliminate, is that the modelling agent has full access to the opponent's local information during execution, including their observations, chosen actions, and received rewards. Opponent modelling from local information has been researched under the I-POMDP model [13] and in the Poker domain research. In contrast to our work, I-POMDPs utilise recursive reasoning [2] which assumes knowledge of the observation models of the modelled agents (which is not available in our setting). In the Poker domain, Johanson et al. [24] proposed Restricted Nash Response (RNR) for computing robust counter-strategies to opponents. Additionally, they generate a mixture-of-experts counter-strategies to various opponents. During execution, the UCB1 algorithm [4] is used to adapt and select the appropriate counter-strategy out of the mixture against each specific and previously unknown opponent. Bowling et al. [7] propose a method for online evaluation of an agent's strategy using importance sampling for reducing the variance of the estimation. Bard et al. [5] combined several ideas from the aforementioned works to build a complete Poker agent system. Their method creates mixture-of-experts strategies, and during execution they deploy Exp4 [3] for online adaptation (selection of the best strategy from the mixture) to each opponent. The aforementioned works do not require any access to the opponent's observations during execution. The main difference between our method and these works is that the latter require a number of online adaptation episodes (to select the best strategy) against each opponent. In contrast, in our work we use a single episode for adaptation. Zintgraf et al. [44] is closely related, where the authors proposed a recurrent VAE model which receives as input the observation, action, reward of the controlled agent and learns a variational distribution of tasks. Rakelly et al. [35] used representations from an encoder for off-policy meta-RL. Note that all of these methods were designed for learning representations of tasks or properties of the environment. In contrast, our approach focuses on learning representations of opponents.



Representation Learning in Reinforcement Learning: Another related topic which has received significant attention is representation learning in RL. Using unsupervised learning techniques to learn low-dimensional representations of the environment state has led to significant improvements in RL. Ha and Schmidhuber [18] proposed a VAE-based model and a forward model to learn state representations of the environment. Hausman et al. [19] learned task embeddings and interpolated them to solve more difficult tasks. Igl et al. [22] used a VAE model for learning state representations in partially-observable environments. Gupta et al. [17] proposed a model which learns Gaussian embeddings to represent different tasks during meta-training and manages to quickly adapt to new task during meta-testing. Gregor et al. [14] developed a VAE-based model for long-term state predictions. The work of

