LEARNING INTUITIVE POLICIES USING ACTION FEA-TURES

Abstract

An unaddressed challenge in multi-agent coordination is to enable AI agents to exploit the semantic relationships between the features of actions and the features of observations. Humans take advantage of these relationships in highly intuitive ways. For instance, in the absence of a shared language, we might point to the object we desire or hold up our fingers to indicate how many objects we want. To address this challenge, we investigate the effect of network architecture on the propensity of learning algorithms to exploit these semantic relationships. Across a procedurally generated coordination task, we find that attention-based architectures that jointly process a featurized representation of observations and actions have a better inductive bias for learning intuitive policies. Through fine-grained evaluation and scenario analysis, we show that the resulting policies are humaninterpretable. Moreover, such agents coordinate with people without training on any human data.

1. INTRODUCTION

Successful collaboration between agents requires coordination (Tomasello et al., 2005; Misyak et al., 2014; Kleiman-Weiner et al., 2016) , which is challenging because coordinated strategies can be arbitrary (Lewis, 1969; Young, 1993; Lerer & Peysakhovich, 2018) . A priori, one can neither deduce which side of the road to drive, nor what utterance to use to refer to ♡ (Pal et al., 2020) . In these cases coordination can arise from actors best responding to what others are already doing-i.e., following a convention. For example, Americans drive on the right side of the road and say "heart" to refer to ♡ while Japanese drive on the left and say "shinzo". Yet in many situations prior conventions may not be available and agents may be faced with entirely novel situations or partners. In this work, we study ways that agents may learn to leverage semantic relations between observations and actions to coordinate with agents they have had no experience interacting with before. Consider the shapes in Fig. 1 . When asked to assign the names "Bouba" and "Kiki" to the two shapes, people name the jagged object "Kiki" and the curvy object "Bouba" (Köhler, 1929) . This finding is robust across different linguistic communities and cultures and is even found in young children (Maurer et al., 2006) . The causal explanation is that people match a "jaggedness"-feature and "curvey"-feature in both the visual and auditory data. Across the above cases, there seems to be a generalized mechanism for mapping the features of the person's action with the features of the action that the person desires the other agent to take. In the absence of norms or conventions, people may minimize the distance between these features when making a choice. This basic form of feature utilization in humans predates verbal behavior (Tomasello et al., 2007) , and this capability has been hypothesized as a key predecessor to more sophisticated language development and acquisition (Tomasello et al., 2005) . Modeling these capacities is key for building machines that can robustly coordinate with other agents and with people (Kleiman-Weiner et al., 2016; Dafoe et al., 2020) . Might this general mechanism emerge through multi-agent reinforcement learning across a range of tasks? As we will show, reinforcement learning agents naively trained with self-play fail to learn to coordinate even in these obvious ways. Instead, they develop arbitrary private languages that are uninterpretable to both the same models trained with a different random seed and to human partners (Hu et al., 2020) . For instance, in the examples above, they would be equally likely to wave a red-hat to hint they want strawberries as they would to indicate that they want blueberries. Unfortunately, developing an inductive bias that might take into account these correspondences is not straightforward because describing the kind of abstract knowledge that these agents lack in closed form is challenging. Rather than attempting to do so, we take a learning-based approach. Our aim is to build an agent with the capacity to develop these kinds of abstract correspondences during self-play, such that it can robustly succeed during cross play, a setting in which a model is paired with a partner (human or AI) that it did not train with. Toward this end, we extend the Dec-POMDP formalism to allow actions and observations to be represented using shared features and design a human-interpretable environment for studying coordination with these enrichments. Using this formalism, we examine the inductive bias of a collection of five network architectures-two feedforward based and three attention based-in procedurally generated coordination tasks. Our main contribution is finding that a self-attention architecture that takes both the action and observations as input has a strong inductive bias toward using the relationship between actions and observations in intuitive ways. This inductive bias manifests in:1) High intra-algorithm cross-play scores compared to both other architecures and to algorithms specifically designed to maximize intra-algorithm cross play; 2) Sophisticated human-like coordination patterns that exploit mutual exclusivity and implicature-two well-known phenomena studied in cognitive science (Markman & Wachtel, 1988; Grice, 1975) ; 3) Human-level performance at ad-hoc coordinating with humans. We hypothesize that the success of this architecture can be attributed to the fact that it processes observation features and action features using the same weights. Our finding suggests that this kind of attention architecture is the most sensible starting point for learning intuitive policies in settings in which action features play an important role.

2. BACKGROUND AND RELATED WORK

Dec-POMDPs. We start with decentralized partially observable Markov decision processes (Dec-POMDPs) to formalize our setting (Nair et al., 2003) . In a Dec-POMDP, each player i receives an observation Ω i (s) ∈ O i generated by the underlying state s, and takes action a i ∈ A i . Players receive a common reward R(s, a) and the state transitions according to the function T (s, a). The historical trajectory is τ = (s 1 , a 1 , . . . , a t-1 , s t ). Player i's action-observation history (AOH) is denoted as τ i t = (Ω i (s 1 ), a i 1 , . . . , a i t-1 , Ω i (s t )). The policy for player i takes as input an AOH and outputs a distribution over actions, denoted by π i (a i | τ i t ). The joint policy is denoted by π. MARL and Coordination. The standard paradigm for training multi-agent reinforcement learning (MARL) agents in Dec-POMDPs is self play (SP). However, the failure of such policies to achieve high reward when evaluated in cross play (XP) is well documented. Carroll et al. (2019) used grid-world MDPs to show that both SP and population-based training fail when paired with human collaborators. Bard et al. (2019); Hu et al. (2020) showed that agents perform significantly worse when paired with independently trained agents than they do at training time in Hanabi, even though the agents are trained under identical circumstances. This drop in XP performance directly results in poor human-AI coordination, as shown in (Hu et al., 2020) . Lanctot et al. (2017) found similar qualitative XP results in a partially-cooperative laser tag game. To address this issue, Hu et al. (2020) introduced a setting in which the goal is to maximize the XP returns of independently trained agents using the same algorithm. We call this setting intra-algorithm cross play (intra-AXP). Hu et al. (2020) argue that high intra-AXP is necessary for successful coordination with humans: If agents trained from independent runs or random seeds using the same algorithm cannot coordinate well with each other, it is unlikely they will be able to coordinate with agents with different model architectures, not to mention humans. However, while there has been recent progress in developing algorithms that achieve high intra-AXP scores in some settings Hu et al. (2020; 2021) , this success does not carry over to settings in which the correspondence between actions and observations is important for coordination, as we will show. Beyond performing well in intra-AXP, a more ambitious goal is to perform well with agents that are externally determined, such as humans, and not observed during training time. This setting has been referred to both as ad-hoc coordination (Stone et al., 2010; Barrett et al., 2011) and zero-shot



Figure 1: The "Bouba" (right) and "Kiki" (left) effect.

