LARGE LANGUAGE MODELS CAN SELF-IMPROVE

Abstract

Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%→82.1% on GSM8K, 78.2%→83.0% on DROP, 90.0%→94.4% on OpenBookQA, and 63.4%→67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that finetuning on reasoning is critical for self-improvement.

1. INTRODUCTION

Scaling has enabled Large Language Models (LLMs) to achieve state-of-the-art performance on a range of Natural Language Processing (NLP) tasks (Wang et al., 2018; 2019; Rajpurkar et al., 2016) . More importantly, new capabilities have emerged from LLMs as they are scaled to hundreds of billions of parameters (Wei et al., 2022a) : in-context few-shot learning (Brown et al., 2020) makes it possible for an LLM to perform well on a task it never trained on with only a handful of examples; Chain-of-Thought (CoT) prompting (Wei et al., 2022b; Kojima et al., 2022) demonstrates strong reasoning ability of LLMs across diverse tasks with or without few-shot examples; self-consistency (Wang et al., 2022b) further improves the performance via self-evaluating multiple reasoning paths. Despite these incredible capabilities of models trained on large text corpus (Brown et al., 2020; Chowdhery et al., 2022) , fundamentally improving the model performances beyond few-shot baselines still requires finetuning on an extensive amount of high-quality supervised datasets. FLAN (Wei et al., 2021; Chung et al., 2022) and T0 (Sanh et al., 2022 ) curated tens of benchmark NLP datasets to boost zero-shot task performances on unseen tasks; InstructGPT (Ouyang et al., 2022) crowd-sourced many human answers for diverse sets of text instructions to better align their model to human instructions. While significant efforts were committed on collecting high-quality supervised datasets, human brain, on the contrary, is capable of the metacognition process (Dunlosky & Metcalfe, 2008) , where we can refine our own reasoning ability without external inputs. In this paper, we study how an LLM capable of in-context few-shot learning and chain-of-thought reasoning, is able to self-improve its reasoning ability without supervised data. We show that using only input sequences (without ground truth output sequences) from multiple NLP task datasets, a pre-trained LLM is able to improve performances for both in-domain and out-of-domain tasks. Our method is shown in Figure 1 : we first sample multiple predictions using few-shot Chain-of-Thought (CoT) (Wei et al., 2022b) as prompts, filter "high-confidence" predictions using majority voting (Wang et al., 2022b) , and finally finetune the LLM on these high-confidence predictions. The resulting model shows improved reasoning in both greedy and multi-path evaluations. We call the model fine-tuned in this way as Language Model Self-Improved (LMSI). Note that LMSI depends on in-context few-shot learning and chain-of-thought reasoning abilities which small language models do not necessarily have. We empirically verify LMSI using a pre-trained 540B LLM, where our method not only improves training task performances (74.4%→82.1% on GSM8K, 78.2%→83.0% on DROP, 90.0%→94.4% on OpenBookQA, and 63.4%→67.9% on ANLI-A3), but also enhances out-of-domain (OOD) test tasks (AQUA, StrategyQA, MNLI), achieving state-of-the-art perfor-

