LARGE LANGUAGE MODELS CAN SELF-IMPROVE

Abstract

Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%→82.1% on GSM8K, 78.2%→83.0% on DROP, 90.0%→94.4% on OpenBookQA, and 63.4%→67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that finetuning on reasoning is critical for self-improvement.

1. INTRODUCTION

Scaling has enabled Large Language Models (LLMs) to achieve state-of-the-art performance on a range of Natural Language Processing (NLP) tasks (Wang et al., 2018; 2019; Rajpurkar et al., 2016) . More importantly, new capabilities have emerged from LLMs as they are scaled to hundreds of billions of parameters (Wei et al., 2022a) : in-context few-shot learning (Brown et al., 2020) makes it possible for an LLM to perform well on a task it never trained on with only a handful of examples; Chain-of-Thought (CoT) prompting (Wei et al., 2022b; Kojima et al., 2022) demonstrates strong reasoning ability of LLMs across diverse tasks with or without few-shot examples; self-consistency (Wang et al., 2022b) further improves the performance via self-evaluating multiple reasoning paths. Despite these incredible capabilities of models trained on large text corpus (Brown et al., 2020; Chowdhery et al., 2022) , fundamentally improving the model performances beyond few-shot baselines still requires finetuning on an extensive amount of high-quality supervised datasets. FLAN (Wei et al., 2021; Chung et al., 2022) and T0 (Sanh et al., 2022) curated tens of benchmark NLP datasets to boost zero-shot task performances on unseen tasks; InstructGPT (Ouyang et al., 2022) crowd-sourced many human answers for diverse sets of text instructions to better align their model to human instructions. While significant efforts were committed on collecting high-quality supervised datasets, human brain, on the contrary, is capable of the metacognition process (Dunlosky & Metcalfe, 2008) , where we can refine our own reasoning ability without external inputs. In this paper, we study how an LLM capable of in-context few-shot learning and chain-of-thought reasoning, is able to self-improve its reasoning ability without supervised data. We show that using only input sequences (without ground truth output sequences) from multiple NLP task datasets, a pre-trained LLM is able to improve performances for both in-domain and out-of-domain tasks. Our method is shown in Figure 1 : we first sample multiple predictions using few-shot Chain-of-Thought (CoT) (Wei et al., 2022b) as prompts, filter "high-confidence" predictions using majority voting (Wang et al., 2022b) , and finally finetune the LLM on these high-confidence predictions. The resulting model shows improved reasoning in both greedy and multi-path evaluations. We call the model fine-tuned in this way as Language Model Self-Improved (LMSI). Note that LMSI depends on in-context few-shot learning and chain-of-thought reasoning abilities which small language models do not necessarily have. We empirically verify LMSI using a pre-trained 540B LLM, where our method not only improves training task performances (74.4%→82.1% on GSM8K, 78.2%→83.0% on DROP, 90.0%→94.4% on OpenBookQA, and 63.4%→67.9% on ANLI-A3), but also enhances out-of-domain (OOD) test tasks (AQUA, StrategyQA, MNLI), achieving state-of-the-art perfor- The most consistent answer is selected by majority voting (Wang et al., 2022b) . The "high-confidence" CoT reasoning paths that lead to the majority answer are augmented by mixed formats as the final training samples to be fed back to the model for fine-tuning. mances in many tasks without relying on supervised ground truth answers. Lastly, we conduct preliminary studies on self-generating additional input questions and few-shot CoT prompts, which could further reduce the amount of human effort required for model self-improving, and ablation studies on important hyperparameters of our approach. We hope our simple approach and strong empirical results could encourage more future work by the community to investigate optimal performances of pretrained LLMs without additional human supervision. Our contributions are summarized as follows: • We demonstrate that a large language model can self-improve by taking datasets without ground truth outputs, by leveraging CoT reasoning (Wei et al., 2022b) and selfconsistency (Wang et al., 2022b) , achieving competitive in-domain multi-task performances as well as out-of-domain generalization. We achieve state-of-the-art-level results on ARC, OpenBookQA, and ANLI datasets. 

2. RELATED WORK

Learning from explanations. Augmenting a machine learning model with explanations has been studied in existing literature extensively. For example, in the supervised learning setting, a model can be fine-tuned using human-annotated rationales (Zaidan et al., 2007; Ling et al., 2017b; Narang et al., 2020; Camburu et al., 2018; Cobbe et al., 2021; Chung et al., 2022) . A few works have also looked at how explanations can help the models in various settings, e.g., in-context learning (Lampinen et al., 2022) and in distillation (Pruthi et al., 2022) . In this paper, we focus more on the unsupervised learning setting, where we do not assume we have a rationale-augmented training dataset available, since human-annotated rationales can be expensive. Few-shot explanations improves reasoning in LLMs. Recently, a lot of progress has been made towards improving LLMs' reasoning abilities via prompting or in-context learning. Wei et al.



buys 20 cards and 1/4 are uncommon. How many uncommon cards did he get? A: John gets 20 * 1/4 = 5 uncommon cards. The answer is 5. … Q: Amy is 10. Jake is 8. Alex's age is right in the middle. How old is Alex? t e x i t s h a 1 _ b a s e 6 4 = " H O C r R 7 5 P T d X M U 4 J p j x Y t p z 9 B B c Y = " > A A A B 7 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k V I 8 F L x 4 r 2 A 9 o Q 9 l s N u 3 a T T b s T o R S + h + 8 e F D E q / / H m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I J X C o O t + O 4 W N z a 3 t n e J u a W / / 4 P C o f H z S N i r T j L e Y k k p 3 A 2 q 4 F A l v o U D J u 6 n m N A 4 k 7 w T j 2 7 n f e e L a C J U 8 4C T l f k y H i Y g E o 2 i l d p + F C s 2 g X H G r 7 g J k n X g 5 q U C O 5 q D 8 1 Q 8 V y 2 K e I J P U m J 7 n p u h P q U b B J J + V + p n h K W V j O u Q 9 S x M a c + N P F 9 f O y I V V Q h Ip b S t B s l B / T 0 x p b M w k D m x n T H F k V r 2 5 + J / X y z C 6 8 a c i S T P k C V s u i j J J U J H 5 6 y Q U m j O U E 0 s o 0 8 L e S t i I a s r Q B l S y I X i r L 6 + T 9 l X V q 1 d r 9 7 V K o 5 7 H U Y Q z O I d L 8 O A a G n A H T W g B g 0 d 4 h l d 4 c 5 T z 4 r w 7 H 8 v W g p P P n M I f O J 8 / r a m P L A = = < / l a t e x i t > • • • < l a t e x i t s h a 1 _ b a s e 6 4 = " H O C r R 7 5 P T d X M U 4 J p j x Y t p z 9 B B c Y = " > A A A B 7 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k V I 8 F L x 4 r 2 A 9 o Q 9 l s N u 3 a T T b s T o R S + h + 8 e F D E q / / H m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I J X C o O t + O 4 W N z a 3 t n e J u a W / / 4 P C o f H z S N i r T j L e Y k k p 3 A 2 q 4 F A l v o U D J u 6 n m N A 4 k 7 w T j 2 7 n f e e L a C J U 8 4 C T l f k y H i Y g E o 2 i l d p + F C s 2 g X H G r 7 g J k n X g 5 q U C O 5 q D 8 1 Q 8 V y 2 K e I J P U m J 7 n p u h P q U b B J J + V + p n h K W V j O u Q 9 S x M a c + N P F 9 f O y I V V Q h I p b S t B s l B / T 0 x p b M w k D m x n T H F k V r 2 5 + J / X y z C 6 8 a c i S T P k C V s u i j J J U J H 5 6 y Q U m j O U E 0 s o 0 8 L e S t i I a s r Q B l S y I X i r L 6 + T 9 l X V q 1 d r 9 7 V K o 5 7 H U Y Q z O I d L 8 O A a G n A H T W g B g 0 d 4 h l d 4 c 5 T z 4 r w 7 H 8 v W g p P P n M I f O J 8 / r a m P L A = = < / l a t e x i t > How old is Alex? A: Let's think step-by-step.

Figure 1: Overview of our method. With Chain-of-Thought (CoT) examples as demonstration (Wei et al., 2022b), the language model generates multiple CoT reasoning paths and answers (temperature T > 0) for each question.The most consistent answer is selected by majority voting(Wang et al.,  2022b). The "high-confidence" CoT reasoning paths that lead to the majority answer are augmented by mixed formats as the final training samples to be fed back to the model for fine-tuning.

• We provide detailed ablation studies on training sample formatting and sampling temperature after fine-tuning, and identify critical design choices for most successful selfimprovement by LLMs. • We study two other approaches for self-improvements, where the model generates additional questions from finite input questions and generates few-shot CoT prompt templates itself. The latter achieves 74.2% on GSM8K, which is the state-of-the-art zero-shot performance, against 43.0% by Kojima et al. (2022) or 70.1% through its naive extension with Wang et al. (2022b).

