UNDERSTANDING HTML WITH LARGE LANGUAGE MODELS

Abstract

Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks. Yet, their capabilities for HTML understanding -i.e., parsing the raw HTML of a webpage, with applications to automation of web-based tasks, crawling, and browser-assisted retrieval -have not been fully explored. We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks: (i) Semantic Classification of HTML elements, (ii) Description Generation for HTML inputs, and (iii) Autonomous Web Navigation of HTML pages. While previous work has developed dedicated architectures and training procedures for HTML understanding, we show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks. For instance, fine-tuned LLMs are 12% more accurate at semantic classification compared to models trained exclusively on the task dataset. Moreover, when fine-tuned on data from the MiniWoB benchmark, LLMs successfully complete 50% more tasks using 192x less data compared to the previous best supervised model. Out of the LLMs we evaluate, we show evidence that T5-based models are ideal due to their bidirectional encoder-decoder architecture. To promote further research on LLMs for HTML understanding, we create and open-source a large-scale HTML dataset distilled and auto-labeled from CommonCrawl. 1 

1. INTRODUCTION

Web crawling (Olston et al., 2010) , form-filling (Diaz et al., 2013; Gur et al., 2021) , or information retrieving web agents (Nogueira & Cho, 2016) are important for both automating and assisting users in web-based tasks. These and similar applications rely on models that can search for specific content or controls on a web page as well as navigate a website autonomously. Since a web page in its raw form is represented as an HTML-based text sequence, the success of models for web-based tasks relies on their ability to understand HTML semantics, structure, and embedded interactions. The predominant approach to web automation and HTML understanding is to train specialized models, i.e., gathering application-specific datasets and designing neural network (NN) architectures to leverage inductive biases of the HTML's structure; see, e.g., Liu et al. ( 2018 Meanwhile, in the natural language processing (NLP) literature, large language models (LLMs) have emerged as a solution to the difficulties of dataset collection and specialized NN design (Kaplan et al., 2020; Bommasani et al., 2021) . A popular paradigm in NLP is to take an off-the-shelf LLM -pretrained on a large text corpus via an unsupervised and task-agnostic learning objective -and either fine-tune or prompt the LLM on a small task-specific dataset. This paradigm has shown exceptional performance on a variety of NLP tasks (Xue et al., 2020; Brown et al., 2020; Austin et al., 2021) . Whether LLMs can be applied to HTML understanding -especially given the much larger context and sequence lengths -remains an under-explored question. In this paper, we investigate whether LLMs can be applied to HTML understanding to produce better-performing, more sample-efficient HTML understanding models and without the need for <html> <body> <form class="login-form"> <div> <label class="form-label" for="uName"> Enter Email Address </label> <label class="form-label" for="pass"> Enter Password: </label> </div> <div> <input type="email" id="uName"> <input type="password" id="pass"> <span class="hidden"> Please enter your password. </span> </div> <button type="submit">Sign In</button> </form> </body> </html> (a) <div><label class="form-label" for="uName">Email Address</label><label class="form-label" for="pass">Enter Password: </label></div><div><input type="email" id="uName" target><input type="password" id="pass"><span class="hidden">Please enter your password.</span></div> (b) Figure 1 : a) HTML example page with a highlighted salient element, an element of interest (dashed box). All canonical tasks evaluate a distinct interaction with this element, either by classifying it as one of a set of categories, generating a text description of its purpose, or applying an action as part of a sequential navigation of a multi-page website. b) LLM architectures overview. Dashed boxes denote sub-modules that are specific to either encoder-only or encoder-decoder models. For encoder-only models, we add an extra classification layer. Decoder-only models (not in the diagram) are similar to encoder-decoder models, the main difference is that the HTML snippet is fed to the decoder and processed from left-to-right. custom NN architecture design. To that end, we present a suite of three benchmarking tasks for HTML understanding that capture the essence of these applications and require understanding both structure and content. First, we devise Semantic Classification as a task that requires a model to classify a given HTML element into one of a set of categories, such as address, email, password etc., with application to automated form-filling. Second, we present Description Generation, a label-extraction task where a model is given an HTML snippet and is asked to produce a natural language description. For instance for an email field, the description might be "Please enter your email address." Note that in the majority of web pages, this connection between input elements and description content is only implicit in the raw HTML code and inferring such links is a prerequisite for higher-level navigation objectives. The third task is Autonomous Web Navigation (Shi et al., 2017) . A model is presented with an HTML page paired with a natural language command and must apply appropriate actions on a sequence of HTML pages to satisfy the command. See Figure 1a for a simplified example of these tasks. With these benchmark tasks in hand, we evaluate the transfer capabilities of a variety of pretrained LLMs (Table 1 ), varying in architecture (encoder-only, encoder-decoder, or decoder-only), model size (from 24.6M to 62B parameters), and training data corpora (both including and excluding pretraining NLP and HTML corpus). While prior work universally pre-parses the HTML as input to the model (Gur et al., 2021; Liu et al., 2018; Nakano et al., 2021) , ours -to the best of our knowledge -is the first work that uses raw, unprocessed HTML. Our results show that LLMs demonstrate a remarkable level of HTML understanding across all tasks, with up to 192× more sample-efficiency than models trained from scratch, and achieving a new SoTA for supervised learning on the MiniWoB benchmark suite (Shi et al., 2017) . The encoder-decoder architectures with bi-directional attention show the best performance across the board even when their pretraining does not include HTML. In addition, we show that the performance scales sub-linearly with the model size. The broader objective of this research is to advance the integration of LLMs with autonomous web agents. It has only been in the last year that researchers have begun to utilize LLMs outside of NLP and integrate them as core capabilities in autonomy (Lu et al. (2021); Ahn et al. (2022) ). In this context, LLMs are reasoning engines for sequential decision making agents interacting with environments. The present work is the first in the research literature to embed an LLM and train it as an agent for autonomous web navigation. This requires new implementations to adapt LLM training for behavior cloning in addition to designing interfaces for integrating text generation into a perception-compute-



See visualizations of the results at https://sites.google.com/view/llm4html/home.



); Toyama et al. (2021); Gur et al. (2021); Humphreys et al. (2022). However, both dataset collection and neural architecture design are expensive, time-consuming, and require highly-specialized, domain-specific knowledge.

