AUTOMATICALLY AUDITING LARGE LANGUAGE MODELS VIA DISCRETE OPTIMIZATION

Abstract

Auditing large language models for unexpected behaviors is critical to preempt catastrophic deployments, yet remains challenging. In this work, we cast auditing as a discrete optimization problem, where we automatically search for inputoutput pairs that match a desired target behavior. For example, we might aim to find non-toxic input that starts with "Barack Obama" and maps to a toxic output. Our optimization problem is difficult to solve as the set of feasible points is sparse, the space is discrete, and the language models we audit are non-linear and highdimensional. To combat these challenges, we introduce a discrete optimization algorithm, ARCA, that is tailored to autoregressive language models. We demonstrate how our approach can: uncover derogatory completions about celebrities (e.g. "Barack Obama is a legalized unborn" → "child murderer'), produce French inputs that complete to English outputs, and find inputs that generate a specific name. Our work offers a promising new tool to uncover models' failure-modes before deployment. Trigger Warning: This paper contains model behavior that can be offensive in nature.

1. INTRODUCTION

Autoregressive large language models (LLMs) are currently used to complete code (Chen et al., 2021; Li et al., 2022b) , summarize books (Stiennon et al., 2020) , and engage in dialog (Thoppilan et al., 2022; Bai et al., 2022) , to name a few of their many capabilities. In order to deploy such models, we need auditing methods that test for examples of undesirable behaviors in the intended operating domain. For example, we might like to identify benign-sounding inputs that produce offensive outputs or false statements, or that reveal private information. In future systems, we might like to find instances of unsafe actions, e.g. deleting all computer files or emptying back accounts. Finding instances of undesirable behavior helps practitioners decide whether to deploy a system, restrict its operation domain, or continue to improve it in-house. In this work, we observe that mining for these diverse, undesired behaviors can often be framed as instances of an abstract optimization problem. Under this abstraction, the goal is to find a prompt x and output o with a high auditing objective value, ϕ(x, o), and where o is the greedy completion of x under the LLM. Our auditing objective is designed to capture some target behavior; for instance, ϕ might measure whether the prompt is French and output is English (i.e. a surprising, unhelpful completion), or whether the prompt is non-toxic and contains "Barack Obama", while the output is toxic (Table 1 ). This reduces auditing to solving a discrete optimization problem: find a promptoutput pair that maximizes the auditing objective, such that the prompt completes to the output. Though our reduction makes the optimization problem clear, solving it is computationally challenging: the set of feasible points is sparse, the space is discrete, and the language model itself is non-linear and high-dimensional. In addition, even querying a language model once is expensive, so large numbers of sequential queries are prohibitive. To combat these challenges, we introduce an optimization algorithm, ARCA. ARCA builds on existing algorithms that navigate the discrete space of tokens using coordinate ascent (Ebrahimi et al., 2018; Wallace et al., 2019) , and use approximations of the objective to make variable updates efficient. ARCA approximates our auditing objective by decomposing it into two components: log probabilities that can be efficiently computed via a transformer forward pass, and terms that can be 1 : Illustration of our framework. Given a target behavior to uncover, we specify an auditing objective over prompts and outputs that captures that behavior. We then use our optimization algorithm ARCA to maximize the objective, such that under a language model (GPT-2 large) the prompt completes to the output (arrow). We present some returned prompts (blue, first line) and outputs (red, second line) for each objective, where the optimization variables are bolded and italicized. effectively approximated via a first-order approximation. Approximating our entire auditing objective via a first-order approximation, as existing algorithms would, loses important information about whether preceding tokens are likely to generate candidate tokens. In contrast, ARCA reliably finds solutions when jointly optimizing over prompts and outputs. Using the 762M parameter GPT-2 as a case study (Radford et al., 2019) , we find that ARCA reliably produces examples of target behaviors specified by the auditing objective. For example, we uncover prompts that generate toxic statements about celebrities (Barack Obama is a legalized unborn → child murder), completions that change languages (naissance duiciée → of the French), and associations that are factually inaccurate (Florida governor → Rick Scott) or offensive in context (billionaire Senator → Bernie Sanders), to name a few salient behaviors. One challenge of our framework is specifying the auditing objective; while in our work we use unigram models, perplexity constraints, and specific prompt prefixes to produce natural text that is faithful to the target behavior, choosing the right objective in general remains an open problem. Nonetheless, our results demonstrate that it is possible to produce meaningful solutions with our framework, and that auditing via discrete optimization can help preempt unsafe deployments.

2. RELATED WORK

Work on large language models. A wide body of recent work has introduced large, capable autoregressive language models on text (Radford et al., 2019; Brown et al., 2020; Wang & Komatsuzaki, 2021; Rae et al., 2021; Hoffmann et al., 2022) and code (Chen et al., 2021; Nijkamp et al., 2022; Li et al., 2022b) , among other media. Such models have been applied to open-ended generation tasks like dialog (Ram et al., 2018; Thoppilan et al., 2022) , long-form summarization (Stiennon et al., 2020; Rothe et al., 2020) , and solving math problems (Tang et al., 2021; Lewkowycz et al., 2022) . LLM Failure Modes. There are many documented failure modes of large language models on generation tasks, including propagating biases and stereotypes (Sheng et al., 2019; Nadeem et al., 2020; Groenwold et al., 2020; Blodgett et al., 2021; Abid et al., 2021; Hemmatian & Varshney, 2022) , and leaking private information (Carlini et al., 2020) Some prior work searches for model failure modes by testing manually written prompts (Ribeiro et al., 2020; Xu et al., 2021b) , prompts scraped from a training set (Gehman et al., 2020) , or prompts constructed from templates (Jia & Liang, 2017; Garg et al., 2019; Jones & Steinhardt, 2022) . A more related line of work optimizes an objective to produce interesting behaviors. Wallace et al. (2019) finds a universal trigger optimizing a single prompt to produce toxic outputs, and find that this



. See Bender et al. (2021); Bommasani et al. (2021); Weidinger et al. (2021) for surveys on additional failures.

