ADVERSARIAL ENVIRONMENT GENERATION FOR LEARNING TO NAVIGATE THE WEB Anonymous authors Paper under double-blind review

Abstract

Learning to autonomously navigate the web is a difficult sequential decisionmaking task. The state and action spaces are large and combinatorial in nature, and successful navigation may require traversing several partially-observed pages. One of the bottlenecks of training web navigation agents is providing a learnable curriculum of training environments that can cover the large variety of real-world websites. Therefore, we propose using Adversarial Environment Generation (AEG) to generate challenging web environments in which to train reinforcement learning (RL) agents. We introduce a new benchmarking environment, gMiniWoB, which enables an RL adversary to use compositional primitives to learn to generate complex websites. To train the adversary, we present a new decoder-like architecture that can directly control the difficulty of the environment, and a new training technique Flexible b-PAIRED. Flexible b-PAIRED jointly trains the adversary and a population of navigator agents and incentivizes the adversary to generate "just-the-right-challenge" environments by simultaneously learning two policies encoded in the adversary's architecture. First, for its environment complexity choice (difficulty budget), the adversary is rewarded with the performance of the best-performing agent in the population. Second, for selecting the design elements the adversary learns to maximize the regret using the difference in capabilities of navigator agents in population (flexible regret). The results show that the navigator agent trained with Flexible b-PAIRED generalizes to new environments, significantly outperforms competitive automatic curriculum generation baselines-including a state-of-the-art RL web navigation approach and prior methods for minimax regret AEG-on a set of challenging unseen test environments that are order of magnitude more complex than the previous benchmarks. The navigator agent achieves more than 75% success rate on all tasks, yielding 4x higher success rate that the strongest baseline.

1. INTRODUCTION

Autonomous web navigation agents that complete tedious, digital tasks, such a booking a flight or filling out forms, have a potential to significantly improve user experience and systems' accessibility. The agents could enable a user to issue requests such as, "Buy me a plane ticket to Los Angeles leaving on Friday", and have the agent automatically handle the details of completing these tasks. However, the complexity and diversity of real-world websites make this a formidable challenge. General web navigation form-filling tasks such as these require an agent to navigate through a set of web pages, matching user's information to the appropriate elements on a web page. This is a highly challenging decision-making problem for several reasons. First, the observation space is large, and partially-observable, consisting of a single web page in the flow of several web pages (e.g. the payment information page is only one part of a shopping task). Web pages are represented using the Document Object Model (DOM), a tree of web elements with hundreds of nodes. Second, actions are all possible combination of the web elements (fill-in boxes, drop-downs, click on the buttons) and their possible values. For example, the drop-down selection actions are only appropriate if there there is a drop-down menu present. Even if the agent is able to navigate the site to arrive at the correct page, and eventually select the correct element (e.g. the 'departure' field for booking a flight), there are many possible values it can insert (e.g. all user input). Therefore, the action space is discrete and prohibitively large, with only a valid set of actions changing with the context. Finally, the same task, such as booking a flight, results in a very different experience and workflow depending on the website. The agent must be able to adapt and operate in the new environment to complete the task. Therefore, the reinforcement learning (RL) agents should be capable of zero-shot generalization to new environments. Prior work made significant strides toward learning web navigation on a single website, yet the existing methods do not scale. Behavior cloning from expert demonstrations (Shi et al., 2017; Liu et al., 2018) shows promising results, however, it requires a number of demonstrations for every single website. RL agent trained using synthetic demonstrations created with a generative model Gur et al. ( 2019) improves the performance. Yet, the method still requires training a separate policy for every single website requiring tens of thousands of interactions with every website. Lastly, the existing benchmarks (Shi et al., 2017; Liu et al., 2018) have limited complexity. Their DOM trees are fixed and considerably smaller than real websites. We aim to train RL agents to solve web navigation form-filling tasks; by correctly entering relevant information into unknown websites. Successful generalization to new websites requires training an agent on a large distribution of possible tasks and environments. The question is how to create a distribution that will not only cover most realistic tasks, but can be presented in a curriculum that is learnable by the agent. Manually designing a pre-defined curriculum of hand-built websites is tedious, and intractable. Another option would be to apply domain randomization (DR) (as in e.g. Jakobi (1997); Sadeghi & Levine (2016); Tobin et al. ( 2017)) to randomize parameters of websites, or automatically increase some parameter controlling the difficulty over time (as in Gur et al. ( 2019)). However, all these approaches are likely to fail to cover important test cases, and cannot tailor the difficulty of the parameter configuration to the current ability of the agent. Adversarial Environment Generation (AEG) trains a learning adversary to automatically generate a curriculum of training environments, enabling both increased complexity of training environments, and generalization to new, unforeseen test environments. However, if we naively train a minimax adversary-i.e. an adversary that seeks to minimize the performance of the learning agent-the adversary is motivated to create the hardest possible website, preventing learning. Instead, PAIRED (Protagonist Antagonist Induced Regret Environment Design) (Dennis et al., 2020) , trains the adversary to maximize the regret, estimated as a difference between two navigation agents (protagonist and antagonist). While PAIRED shows exciting results, without an explicit feedback on how skillful antagonist is and mechanism to control the difficulty of the environment, the method is susceptible to local minima, and has hard time learning in the complex environments when the regret is zero. We present Flexible b-PAIRED, which builds on PAIRED framework, and jointly trains the adversarial RL agent (adversary) and a population of navigator agents. Flexible b-PAIRED adversary learns to present "just-the-right-challenge" to the navigation agents. We enable Flexible b-PAIRED adversary to tailor the environment difficulty to the ability of the best performing agent by introducing an explicit difficulty budgeting mechanism, and a novel multi-objective loss function. The budgeting mechanism gives the adversary the direct control of the difficulty of the generated environment. The



Figure 1: Samples of generated web pages from selected websites taken from different snapshots of the training (a-c) and unseen test "Login" website (d). Over time, the number of pages in a website decreases but the density of elements in a page increases with more task-oriented elements. See Appendix A.11 for more samples.

