LEARNING INTUITIVE POLICIES USING ACTION FEA-TURES

Abstract

An unaddressed challenge in multi-agent coordination is to enable AI agents to exploit the semantic relationships between the features of actions and the features of observations. Humans take advantage of these relationships in highly intuitive ways. For instance, in the absence of a shared language, we might point to the object we desire or hold up our fingers to indicate how many objects we want. To address this challenge, we investigate the effect of network architecture on the propensity of learning algorithms to exploit these semantic relationships. Across a procedurally generated coordination task, we find that attention-based architectures that jointly process a featurized representation of observations and actions have a better inductive bias for learning intuitive policies. Through fine-grained evaluation and scenario analysis, we show that the resulting policies are humaninterpretable. Moreover, such agents coordinate with people without training on any human data.

1. INTRODUCTION

Successful collaboration between agents requires coordination (Tomasello et al., 2005; Misyak et al., 2014; Kleiman-Weiner et al., 2016) , which is challenging because coordinated strategies can be arbitrary (Lewis, 1969; Young, 1993; Lerer & Peysakhovich, 2018) . A priori, one can neither deduce which side of the road to drive, nor what utterance to use to refer to ♡ (Pal et al., 2020) . In these cases coordination can arise from actors best responding to what others are already doing-i.e., following a convention. For example, Americans drive on the right side of the road and say "heart" to refer to ♡ while Japanese drive on the left and say "shinzo". Yet in many situations prior conventions may not be available and agents may be faced with entirely novel situations or partners. In this work, we study ways that agents may learn to leverage semantic relations between observations and actions to coordinate with agents they have had no experience interacting with before. Consider the shapes in Fig. 1 . When asked to assign the names "Bouba" and "Kiki" to the two shapes, people name the jagged object "Kiki" and the curvy object "Bouba" (Köhler, 1929) . This finding is robust across different linguistic communities and cultures and is even found in young children (Maurer et al., 2006) . The causal explanation is that people match a "jaggedness"-feature and "curvey"-feature in both the visual and auditory data. Across the above cases, there seems to be a generalized mechanism for mapping the features of the person's action with the features of the action that the person desires the other agent to take. In the absence of norms or conventions, people may minimize the distance between these features when making a choice. This basic form of feature utilization in humans predates verbal behavior (Tomasello et al., 2007) , and this capability has been hypothesized as a key predecessor to more sophisticated language development and acquisition (Tomasello et al., 2005) . Modeling these capacities is key for building machines that can robustly coordinate with other agents and with people (Kleiman-Weiner et al., 2016; Dafoe et al., 2020) . Might this general mechanism emerge through multi-agent reinforcement learning across a range of tasks? As we will show, reinforcement learning agents naively trained with self-play fail to learn to coordinate even in these obvious ways. Instead, they develop arbitrary private languages that are



Figure 1: The "Bouba" (right) and "Kiki" (left) effect.

