LANGUAGE MODELS ARE MULTILINGUAL CHAIN-OF-THOUGHT REASONERS

Abstract

We evaluate the reasoning abilities of large language models in multilingual settings. We introduce the Multilingual Grade School Math (MGSM) benchmark, by manually translating 250 grade-school math problems from the GSM8K dataset (Cobbe et al., 2021) into ten typologically diverse languages. We find that the ability to solve MGSM problems via chain-of-thought prompting emerges with increasing model scale, and that models have strikingly strong multilingual reasoning abilities, even in underrepresented languages such as Bengali and Swahili. Finally, we show that the multilingual reasoning abilities of language models extend to other tasks such as commonsense reasoning and wordin-context semantic judgment.



Frequency of language in pre-training dataset (token percentage) MGSM Accuracy (%) Translate to English with Google Translate and solve with English intermediate steps Intermediate reasoning steps in the language of the question Intermediate reasoning steps in English Figure 1 : Correlation between language frequency and MGSM accuracy for PaLM-540B. The accuracy is surprisingly high, even for underrepresented languages like Swahili (SW) and Bengali (BN), which account for less than 0.01% of the pre-training dataset.

1. INTRODUCTION

Recent work has shown that presenting explicit reasoning steps (i.e., chains of thought; COT) in English elicits multi-step reasoning abilities of large language models such as GPT-3 and PaLM (Brown et al., 2020; Chowdhery et al., 2022; Wei et al., 2022b, inter alia) . Pretrained multilingual language models have also achieved impressive performance on various NLP tasks across typologically distinct languages (Conneau et al., 2020; Xue et al., 2021; Chowdhery et al., 2022; Clark et al., 2020; Hu et al., 2020; Ruder et al., 2021, inter alia) . Tasks in existing multilingual benchmarks usually require only simple reasoning steps, and so it is still unclear how well language models perform on tasks that require more complex reasoning in a multilingual setting. In this work, we introduce the MGSM benchmark to bridge the gap between the progress on Englishbased chain-of-thought reasoning and multilingual NLP. We extend a subset of the English-language GSM8K dataset (Cobbe et al., 2021) to ten typologically diverse languages via manual translation of problems into target languages. To the best of our knowledge, this is the first multilingual benchmark to evaluate the arithmetic reasoning abilities of language models. We evaluate two large language models, GPT-3 (Brown et al., 2020; Ouyang et al., 2022) and PaLM (Chowdhery et al., 2022) , on this benchmark. While both models solve less than 20% of problems with standard prompting, the 540-billion-parameter PaLM model in particular shows exceptional multilingual reasoning abilities with intermediate reasoning steps (Figure 1 ), solving more than 40% of the problems in any investigated language, including underrepresented languages such as Bengali and Swahili. In our best setting, PaLM achieves an average solve rate of 55% across languages. We find that intermediate reasoning steps in English consistently lead to competitive or better results than those written in the native language of the question, suggesting that English chain-of-thought prompting may be a useful baseline for future multilingual reasoning work. We further demonstrate that the multilingual reasoning abilities of pretrained models extend to common-sense reasoning (Ponti et al., 2020) and word-in-context semantic judgment (Raganato et al., 2020) . By presenting the models with few-shot examples in different languages, PaLM sets a new state-of-the-art performance (89.9%) on XCOPA (Ponti et al., 2020) , outperforming the prior approaches that require thousands of training examples.

2. THE MGSM BENCHMARK

In this section, we describe the collection process of Multilingual Grade School Math (MGSM), to our knowledge the first multilingual arithmetic reasoning benchmark. Source data. We used GSM8K (Cobbe et al., 2021) , an English-language human-annotated grade-school math problem dataset, as the base data source. For MGSM, we took the first 250 examples from the GSM8K official test example list. Each problem requires two to eight steps to solve according to the official solution (Figure 2 ). The answer for each question in GSM8K was written as an Arabic numeral, which we kept consistent across all languages to facilitate cross-lingual prediction.foot_0  Target language selection. We selected a typologically diverse set of ten languages other than English (EN), spanning eight language families and different levels of representation in standard pretraining datasets such as mC4 (Xue et al., 2021) : Bengali (BN), Chinese (ZH), French (FR), German (DE), Japanese (JA), Russian (RU), Spanish (ES), Swahili (SW), Telugu (TE), and Thai (TH).



Certain scripts such as Devanagari employ different numerals. We restrict the data to Arabic numerals for consistency but future work may investigate cross-lingual numeracy by mapping Arabic numerals to those of the corresponding script (seeSpithourakis & Riedel, 2018).



Figure 2: MGSM problem distribution with respect to the number of reasoning steps in the standard solution.

