THINKSUM: PROBABILISTIC REASONING OVER SETS USING LARGE LANGUAGE MODELS

Abstract

Large language models (LLMs) have a substantial capacity for high-level analogical reasoning: reproducing patterns in linear text that occur in their training data (zero-shot evaluation) or in the provided context (few-shot in-context learning). However, recent studies show that even the largest LLMs fail in scenarios that require reasoning over multiple objects or facts or making sequences of logical deductions. We propose a two-stage probabilistic inference paradigm, THINKSUM, that reasons over sets of objects or facts in a structured manner. In the first stage (THINK -'fast' retrieval of associations), a LLM is queried in parallel over a set of phrases extracted from the prompt or an auxiliary model call. In the second stage (SUM -'slow' probabilistic inference or reasoning), the results of these queries are aggregated to make the final prediction. We demonstrate the advantages of THINKSUM on the BIG-bench suite of evaluation tasks, achieving improvements over the state of the art using GPT-family models on ten difficult tasks, often with far smaller model variants. We compare and contrast THINKSUM with other proposed modifications to direct prompting of LLMs, such as variants of chain-of-thought prompting. We argue that because the probabilistic inference in THINKSUM is performed outside of calls to the LLM, THINKSUM is less sensitive to prompt design, yields more interpretable predictions, and can be flexibly combined with latent variable models to extract structured knowledge from LLMs.

1. INTRODUCTION

Large language models (LLMs) (Brown et al., 2020; Rae et al., 2021; Chowdhery et al., 2022) can recall a broad range of basic facts, recognize and mimic various forms in language, and efficiently extrapolate analogies in structure and meaning. These abilities allow LLMs to excel in zero-shot and few-shot tasks that are formulated as generation or selection of a likely completion of a prompt. This formulation requires LLMs to perform associative fast thinking, in which each token of text in the sequence making up the answer is generated or scored in one pass through the model and, other than that, no intermediate information is created or retained. Fast thinking is made possible by the compression in the LLM weights of information that is repeated in a variety of ways in large training datasets. However, it is increasingly evident that when reasoning, or slow thinking, is required, failure modes of LLMs are revealed. In our usage, reasoning is sequential manipulation of concepts that can be expressed in language. Tasks that require iterative retrieval of rarely stated knowledge, uncertainties over multiple objects or facts, or multiple steps of deduction are difficult even for the most advanced LLMs. In a recently designed suite of evaluations, BIG-bench (Srivastava et al., 2022) , some of the tasks where the gap between machine and human performance is large involve inference sequences with nested counterfactuals (LOGICAL DEDUCTION), concepts introduced though definitions (CONCEPTUAL COMBINATIONS), etc. (see Fig. A.1) . These are tasks where a human solver's intuitive feeling of '(in)coherence' is not sufficient to produce the right answer: the solution is obtained by a sequence of thoughts that can be explained in words and may even require writing down intermediate results if working memory is insufficient. We show on several examples in BIG-bench that such problems can be addressed by a twocomponent mechanism, which we name THINKSUM: • THINK (fast thinking / association / knowledge retrieval step): creating an association of spans of text with sets of strings. This process may involve generation from a language model, as is the case in Fig. 1 , where the novel word 'binne' is associated with the set of strings ⌉ ⌋ THINK (auxiliary LM calls to define sets) A cat cottage is a place for A rabbit cabin is a place for A mink shelter is a place for by prompting GPT-3 with the definition and asking for examples. However, it may also consist of scoring alone, in order to form a matrix of probabilities over which probabilistic inference is performed. • SUM (slow thinking / SUMmarization / reasoning step): probabilistic inference that aggregates generated strings or probabilities to produce the final answer. The summarization typically involves, and often entirely consists of, summing of probabilities of strings (computed in the THINK step), as in Fig. 1 , where the final word is assumed to be sampled from a mixture of possible substitutions of 'binne' and 'bam' words into the input. THINKSUM is named by analogy with other algorithms with two basic operations that 'expand' and 'aggregate', like MapReduce in distributed computing and sum-product in graphical models. We discuss different ways to THINK and to SUM in section §2, but we start with one example, illustrated in Fig. 1 , motivated by the CONCEPTUAL COMBINATIONS (INVENTED WORDS) task in BIG-bench (Srivastava et al., 2022) . In this task, the LLM is provided with two invented words and their definitions in the input. The LLM is then asked to infer the most plausible sentence that uses a combination of the invented words. As the words are invented, they are not common or consistently used in the training set, and the LLM needs to understand and combine the definitions of the invented words to reason about the meaning of the combination. The LLM is queried to produce example instances of the invented words with the help of the definitions. These example instances can be substituted into the query in place of the invented words. In this way, by mapping individual spans of the text of interest to sets we arrive at a mixture model (in this example, a mixture with 25 components, for 5 possible replacements of each of the words), which can be used in the same manner the original LLM is used, either to score text or to generate it token by token. In this case, when we score all candidate completions using this mixture model and normalize over the four choices, the correct answer -that 'binne bams' are for animals and not people -becomes most likely. An important difference between THINKSUM and existing chain-of-thought-like prompt engineering methods (Wei et al., 2022; Kojima et al., 2022) , is that the reasoning step is not reduced to a generation problem for the LLM, but is performed as a probabilistic inference external to the LLM. This reduces its vulnerability to features of the prompt, such as accidental distraction of the LLM by spurious patterns. Instead, we engineer the slow thinking process to make parallel calls to the LLM to query for intermediate information, then possibly perform programmatic recombination of strings (THINK). The final reasoning step -in which likelihoods obtained from the LLM for the recombinations derived from earlier steps of the reasoning process are combined to make the final prediction



Figure 1: An example adapted from the CONCEPTUAL COMBINATIONS (INVENTED WORDS) task, in which models must select the most likely completion of a phrase that includes nonce words whose definitions are given. Direct prompting evaluates completion likelihoods normalized over

