MOVING FORWARD BY MOVING BACKWARD: EMBED-DING ACTION IMPACT OVER ACTION SEMANTICS

Abstract

A common assumption when training embodied agents is that the impact of taking an action is stable; for instance, executing the "move ahead" action will always move the agent forward by a fixed distance, perhaps with some small amount of actuator-induced noise. This assumption is limiting; an agent may encounter settings that dramatically alter the impact of actions: a move ahead action on a wet floor may send the agent twice as far as it expects and using the same action with a broken wheel might transform the expected translation into a rotation. Instead of relying that the impact of an action stably reflects its pre-defined semantic meaning, we propose to model the impact of actions on-the-fly using latent embeddings. By combining these latent action embeddings with a novel, transformerbased, policy head, we design an Action Adaptive Policy (AAP). We evaluate our AAP on two challenging visual navigation tasks in the AI2-THOR and Habitat environments and show that our AAP is highly performant even when faced, at inference-time with missing actions and, previously unseen, perturbed action space. Moreover, we observe significant improvement in robustness against these actions when evaluating in real-world scenarios.

1. INTRODUCTION

Humans show a remarkable capacity for planning when faced with substantially constrained or augmented means by which they may interact with their environment. For instance, a human who begins to walk on ice will readily shorten their stride to prevent slipping. Likewise, a human will spare little mental effort in deciding to exert more force to lift their hand when it is weighed down by groceries. Even in these mundane tasks, we see that the effect of a humans' actions can have significantly different outcomes depending on the setting: there is no predefined one-to-one mapping between actions and their impact. The same is true for embodied agents where something as simple as attempting to moving forward can result in radically different outcomes depending on the load the agent carries, the presence of surface debris, and the maintenance level of the agent's actuators (e.g., are any wheels broken?). Despite this, many existing tasks designed in the embodied AI community (Jain et al., 2019; Shridhar et al., 2020; Chen et al., 2020; Ku et al., 2020; Hall et al., 2020; Wani et al., 2020; Deitke et al., 2020; Batra et al., 2020a; Szot et al., 2021; Ehsani et al., 2021; Zeng et al., 2021; Li et al., 2021; Weihs et al., 2021; Gan et al., 2021; 2022; Padmakumar et al., 2022) make the simplifying assumption that, except for some minor actuator noise, the impact of taking a particular discrete action is functionally the same across trials. We call this the action-stability assumption (AS assumption). Artificial agents trained assuming action-stability are generally brittle, obtaining significantly worse performance, when this assumption is violated at inference time (Chattopadhyay et al., 2021) ; unlike humans, these agents cannot adapt their behavior without additional training. In this work, we study how to design a reinforcement learning (RL) policy that allows an agent to adapt to significant changes in the impact of its actions at inference time. Unlike work in training robust policies via domain randomization, which generally leads to learning conservative strategies (Kumar et al., 2021) , we want our agent to fully exploit the actions it has available: philosophically, if a move ahead action now moves the agent twice as fast, our goal is not to have the agent take smaller steps to compensate but, instead, to reach the goal in half the time. While prior works have studied test time adaptation of RL agents (Nagabandi et al., 2018; Wortsman et al., 2019; Yu et al., 2020; Kumar et al., 2021) , the primary insight in this work is an action-centric approach which Figure 1 : An agent may encounter unexpected drifts during deployment due to changes in its internal state (e.g., a defective wheel) or environment (e.g., hardwood floor v.s. carpet). Our proposed Action Adaptive Policy (AAP) introduces an action-impact encoder which takes state-changes (e.g., o t → o t+1 ) caused by agent actions (e.g., a t-1 ) as input and produces embeddings representing these actions' impact. Using these action embeddings, the AAP utilizes a Order-Invariant (OI) head to choose the action whose impact will allow it to most readily achieve its goal. requires the agent to generate action embeddings from observations on-the-fly (i.e., no pre-defined association between actions and their effect) where these embeddings can then be used to inform future action choices. In our approach, summarized in Fig. 1 , an agent begins each episode with a set of unlabelled actions A = {a 0 , ..., a n }. Only when the agent takes one of these unlabelled actions a i at time t, does it observe, via its sensor readings, how that action changes the agent's state and the environment. Through the use of a recurrent action-impact encoder module, the agent then embeds the observations from just before (o t ) and just after (o t+1 ) taking the action to produce an embedding of the action e i,t . At a future time step t ′ , the agent may then use these action-impact embeddings to choose which action it wishes to execute. In our initial experiments, we found that standard RL policy heads, which generally consist of linear function acting on the agent's recurrent belief vector b t , failed to use these action embeddings to their full potential. As we describe further in Sec. 4.3, we conjecture that this is because matrix multiplications impose an explicit ordering on their inputs so that any linear-based actor-critic head must treat each of the n! potential action orderings separately. To this end, we introduce a novel, transformer-based, policy head which we call the Order-Invariant (OI) head. As suggested by the name, this OI head is invariant to the order of its inputs and processes the agent's belief jointly with the action embeddings to allow the agent to choose the action whose impact will allow it to most readily achieve its goal. We call this above architecture, which uses our recurrent action-impact encoder module with our OI head, the Action Adaptive Policy (AAP). To evaluate AAP, we train agents to complete two challenging visual navigation tasks within the AI2-THOR environment (Kolve et al., 2017) : Point Navigation (PointNav) (Deitke et al., 2020) and Object Navigation (ObjectNav) (Deitke et al., 2020)foot_0 . For these tasks, we train models with moderate amounts of simulated actuator noise and, during evaluation, test with a range of modest to severe unseen action impacts. These include disabling actions, changing movement magnitudes, rotation degrees, etc.; we call these action augmentations drifts. We find that, even when compared to sophisticated baselines, including meta-learning (Wortsman et al., 2019), a model-based approach (Zeng et al., 2021), and RMA (Kumar et al., 2021) , our AAP approach handily outperforms competing baselines and can even succeed when faced with extreme changes in the effect of actions. Further analysis shows that our agent learns to execute desirable behavior at inference time; for instance, it quickly avoids using disabled actions more than once during an episode despite not being exposed to disabled actions during training. In addition, the experimental results in a real-world test scene from RoboTHOR on the Object Navigation task demonstrate that our AAP performs better than baselines against unseen drifts. In summary, our contributions include: (1) an action-centric perspective towards test-time adaptation, (2) an Action Adaptive Policy network consisting of an action-impact encoder module and a novel order-invariant policy head, and (3) extensive experimentation showing that our proposed approach outperforms existing adaptive methods.



We also show results in a modified PettingZoo environment Terry et al. (2020) on Point Navigation and Object Push in Sec. F and a modified Habitat Environment (Savva et al., 2019) on Point Navigation in Sec. G.

