ADVERSARIAL ENVIRONMENT GENERATION FOR LEARNING TO NAVIGATE THE WEB Anonymous authors Paper under double-blind review

Abstract

Learning to autonomously navigate the web is a difficult sequential decisionmaking task. The state and action spaces are large and combinatorial in nature, and successful navigation may require traversing several partially-observed pages. One of the bottlenecks of training web navigation agents is providing a learnable curriculum of training environments that can cover the large variety of real-world websites. Therefore, we propose using Adversarial Environment Generation (AEG) to generate challenging web environments in which to train reinforcement learning (RL) agents. We introduce a new benchmarking environment, gMiniWoB, which enables an RL adversary to use compositional primitives to learn to generate complex websites. To train the adversary, we present a new decoder-like architecture that can directly control the difficulty of the environment, and a new training technique Flexible b-PAIRED. Flexible b-PAIRED jointly trains the adversary and a population of navigator agents and incentivizes the adversary to generate "just-the-right-challenge" environments by simultaneously learning two policies encoded in the adversary's architecture. First, for its environment complexity choice (difficulty budget), the adversary is rewarded with the performance of the best-performing agent in the population. Second, for selecting the design elements the adversary learns to maximize the regret using the difference in capabilities of navigator agents in population (flexible regret). The results show that the navigator agent trained with Flexible b-PAIRED generalizes to new environments, significantly outperforms competitive automatic curriculum generation baselines-including a state-of-the-art RL web navigation approach and prior methods for minimax regret AEG-on a set of challenging unseen test environments that are order of magnitude more complex than the previous benchmarks. The navigator agent achieves more than 75% success rate on all tasks, yielding 4x higher success rate that the strongest baseline.

1. INTRODUCTION

Autonomous web navigation agents that complete tedious, digital tasks, such a booking a flight or filling out forms, have a potential to significantly improve user experience and systems' accessibility. The agents could enable a user to issue requests such as, "Buy me a plane ticket to Los Angeles leaving on Friday", and have the agent automatically handle the details of completing these tasks. However, the complexity and diversity of real-world websites make this a formidable challenge. General web navigation form-filling tasks such as these require an agent to navigate through a set of web pages, matching user's information to the appropriate elements on a web page. This is a highly challenging decision-making problem for several reasons. First, the observation space is large, and partially-observable, consisting of a single web page in the flow of several web pages (e.g. the payment information page is only one part of a shopping task). Web pages are represented using the Document Object Model (DOM), a tree of web elements with hundreds of nodes. Second, actions are all possible combination of the web elements (fill-in boxes, drop-downs, click on the buttons) and their possible values. For example, the drop-down selection actions are only appropriate if there there is a drop-down menu present. Even if the agent is able to navigate the site to arrive at the correct page, and eventually select the correct element (e.g. the 'departure' field for booking a flight), there are many possible values it can insert (e.g. all user input). Therefore, the action space is discrete and prohibitively large, with only a valid set of actions changing with the context. Finally, the same task, such as booking a flight, results in a very different experience and workflow depending on the 1

