AUTOMATICALLY AUDITING LARGE LANGUAGE MODELS VIA DISCRETE OPTIMIZATION

Abstract

Auditing large language models for unexpected behaviors is critical to preempt catastrophic deployments, yet remains challenging. In this work, we cast auditing as a discrete optimization problem, where we automatically search for inputoutput pairs that match a desired target behavior. For example, we might aim to find non-toxic input that starts with "Barack Obama" and maps to a toxic output. Our optimization problem is difficult to solve as the set of feasible points is sparse, the space is discrete, and the language models we audit are non-linear and highdimensional. To combat these challenges, we introduce a discrete optimization algorithm, ARCA, that is tailored to autoregressive language models. We demonstrate how our approach can: uncover derogatory completions about celebrities (e.g. "Barack Obama is a legalized unborn" → "child murderer'), produce French inputs that complete to English outputs, and find inputs that generate a specific name. Our work offers a promising new tool to uncover models' failure-modes before deployment. Trigger Warning: This paper contains model behavior that can be offensive in nature.

1. INTRODUCTION

Autoregressive large language models (LLMs) are currently used to complete code (Chen et al., 2021; Li et al., 2022b) , summarize books (Stiennon et al., 2020) , and engage in dialog (Thoppilan et al., 2022; Bai et al., 2022) , to name a few of their many capabilities. In order to deploy such models, we need auditing methods that test for examples of undesirable behaviors in the intended operating domain. For example, we might like to identify benign-sounding inputs that produce offensive outputs or false statements, or that reveal private information. In future systems, we might like to find instances of unsafe actions, e.g. deleting all computer files or emptying back accounts. Finding instances of undesirable behavior helps practitioners decide whether to deploy a system, restrict its operation domain, or continue to improve it in-house. In this work, we observe that mining for these diverse, undesired behaviors can often be framed as instances of an abstract optimization problem. Under this abstraction, the goal is to find a prompt x and output o with a high auditing objective value, ϕ(x, o), and where o is the greedy completion of x under the LLM. Our auditing objective is designed to capture some target behavior; for instance, ϕ might measure whether the prompt is French and output is English (i.e. a surprising, unhelpful completion), or whether the prompt is non-toxic and contains "Barack Obama", while the output is toxic (Table 1 ). This reduces auditing to solving a discrete optimization problem: find a promptoutput pair that maximizes the auditing objective, such that the prompt completes to the output. Though our reduction makes the optimization problem clear, solving it is computationally challenging: the set of feasible points is sparse, the space is discrete, and the language model itself is non-linear and high-dimensional. In addition, even querying a language model once is expensive, so large numbers of sequential queries are prohibitive. To combat these challenges, we introduce an optimization algorithm, ARCA. ARCA builds on existing algorithms that navigate the discrete space of tokens using coordinate ascent (Ebrahimi et al., 2018; Wallace et al., 2019) , and use approximations of the objective to make variable updates efficient. ARCA approximates our auditing objective by decomposing it into two components: log probabilities that can be efficiently computed via a transformer forward pass, and terms that can be

