MUG: INTERACTIVE MULTIMODAL GROUNDING ON USER INTERFACES

Abstract

We present MUG, a novel interactive task for multimodal grounding where a user and an agent work collaboratively on an interface screen. Prior works modeled multimodal UI grounding in one round: the user gives a command and the agent responds to the command. Yet, in a realistic scenario, a user command can be ambiguous when the target action is inherently difficult to articulate in natural language. MUG allows multiple rounds of interactions such that upon seeing the agent responses, the user can give further commands for the agent to refine or even correct its actions. Such interaction is critical for improving grounding performances in real-world use cases. To investigate the problem, we create a new dataset that consists of 77, 820 sequences of human user-agent interaction on mobile interfaces in which 20% involves multiple rounds of interactions. To establish our benchmark, we experiment with a range of modeling variants and evaluation strategies, including both offline and online evaluation-the online strategy consists of both human evaluation and automatic with simulators. Our experiments show that allowing iterative interaction significantly improves the absolute task completion by 18% over the entire test set and 31% over the challenging subset. Our results lay the foundation for further investigation of the problem.

1. INTRODUCTION

Natural language understanding on graphical user interfaces (GUIs) is crucial for realizing humancomputer interaction and assisting scenarios that have accessibility difficulties (Sarsenbayeva, 2018) . Specifically, interpreting user commands into executable actions has drawn increasing interests as it manifests rich research problems including multimodal modeling and natural language grounding (e.g., Li et al., 2017; Gur et al., 2019; He et al., 2020; Li et al., 2020a; 2021) . Prior works often consider UI grounding in a single-pass fashion where the model predicts actions with a given instruction without looking backward to refine prediction. However, in a realistic scenario, user instructions can be ambiguous or inaccurate especially when the target action is difficult or inconvenient to articulate. Reasoning in such cases is inherently iterative. Therefore, it is important and beneficial to incorporate interaction for resilient grounding (Suhr et al., 2019; Chandu et al., 2021) . In this paper, we investigate interactive grounding on GUIs, which aligns multimodal input to actionable objects of a screen. We focus on single-screen interaction which is the building block of UI reasoning. Specifically, we introduce the MUG (Multi-turn UI Grounding) task in which the user iteratively guides the agent to select a desired UI object (see Fig. 1 ). With a given UI and a target object, the user instructs the agent via natural language, ranging from casual intent to more descriptive commands. The agent infers which UI object is intended by the user and and highlights it. If the agent is correct, the user can confirm the selection and the grounding is completed. Otherwise, the user issues further guidance, e.g., "Click the one below", to the agent to refine its selection. We collecte the MUG dataset from live interaction sessions between pairs of human annotators-one acts as the user and the other as the agent. Our dataset has 77, 820 examples, each records the transaction history in a session. Specially, 20% of the dataset are challenging ones as their human commands need multiple rounds to ground, even for human agents. To establish the benchmark, we experiment with a range of variants to model the dynamics between the two roles. While the main goal of the task is to develop agent models for grounding, we also develop the user models for online instruction simulation. We build our models upon a Transformer- To fully examine the model performances, we evaluate the agent model with a spectrum of evaluation strategies, including both offline and online evaluations. For the online evaluation, we employ both automatic and human evaluations, which include interactions between the agent and the user (either a human or the user model) and offer a comprehensive probe into model understanding. Our experiments show that incorporating interaction substantially improves UI grounding task completion by 18% on the entire dataset and 31% on the challenging set, both in absolute scales. Furthermore, our robustness measurements suggest MUG, while being a seemingly easy single-screen task, is actually difficult since neural agents sometimes struggle to correct themselves, resulting in repeated wrong selections across multiple turns. This suggests large rooms for future improvement in grounding agents. In summary, our key contributionsfoot_0 are: 1. We introduce MUG, a novel interactive vision-language task that focuses on multi-turn language grounding on a graphical UI screen, which is a challenging task that is meant to improve language grounding in realistic UIs. 2. We create a rich dataset that includes 77,820 examples recorded from live sessions between pairs of human users and agents. And 20% of the data are challenging for both human annotators and neural agents. 3. We experiment with a range of model variants and evaluation strategies, showing that iterative interaction significantly improves grounding accuracy by 18% and 31% on the entire and challenging test sets respectively, with automatic assistance from our user models. Our work lays a solid foundation for future investigations on collaborative grounding.

2. BACKGROUND

Multi-modal modeling has a long history of research (e.g., Winograd, 1972; Barnard & Forsyth, 2001; Lavrenko et al., 2003; Plummer et al., 2015; Yu et al., 2016) . One important area focuses on grounding objects in images where the natural language is used as an additional input (Chen et al., 2017; Yu et al., 2016; 2018; Fukui et al., 2016; Deng et al., 2021) . Interactive Multimodal Grounding Prior works have formulated grounding as a multi-step reasoning task, e.g., navigation via multiple steps of grounding (e.g., Ku et al., 2020; Gur et al., 2019) . Our work differs by focusing on agent's ability to self-correct in synchronized turns of interaction



The dataset and code for reproducing our experiments are at https://github.com/to-be-de-anonymized.



Figure 1: Two illustrative examples of the MUG task. There are two turns in each of these examples. Interactions happen within a single screen. User commands are shown above the screens. The target object is bounded in . Agent choices are marked with .

