DocPrompting: GENERATING CODE BY RETRIEVING THE DOCS

Abstract

Publicly available source-code libraries are continuously growing and changing. This makes it impossible for models of code to keep current with all available APIs by simply training these models on existing code repositories. Thus, existing models inherently cannot generalize to using unseen functions and libraries, because these would never appear in their training data. In contrast, when human programmers use functions and libraries for the first time, they frequently refer to textual resources such as code manuals and documentation, to explore and understand the available functionality. Inspired by this observation, we introduce DocPrompting: a natural-language-to-code generation approach that explicitly leverages code documentation by (1) retrieving the relevant documentation pieces given a natural language (NL) intent, and (2) generating code based on the NL intent and the retrieved documentation. DocPrompting is general: it can be applied to any programming language, and is agnostic to the underlying neural model. We demonstrate that DocPrompting consistently improves NL-to-code models: DocPrompting improves strong base models such as CodeT5 by 2.85% in pass@1 (52% relative gain) and 4.39% in pass@10 (30% relative gain) in execution-based evaluation on the popular Python CoNaLa benchmark; on a new Bash dataset tldr, DocPrompting improves CodeT5 and GPT-Neo-1.3B by up to absolute 6.9% exact match. 1 

1. INTRODUCTION

We address the task of natural language to code generation (NL→code): generating a code snippet, written in a general-purpose programming language such as Python or Bash, given a natural language intent. This task has seen sharply growing popularity recently due to the emergence of large language models trained on vast amounts of natural language and code (Chen et al., 2021; Xu et al., 2022; Fried et al., 2022) . NL→code models facilitate programming for both professional and inexperienced programmers, by allowing programmers to write code by only expressing their higher-level intent. Many existing code generation models either learn directly from input-output pairs provided as training data (Allamanis et al., 2015; Yin and Neubig, 2017; Iyer et al., 2018; Brockschmidt et al., 2019; Xu et al., 2020; Alon et al., 2020; Wang et al., 2021) , or learn the mapping between input and output implicitly from naturally occurring corpora of intertwined natural language and code (Austin et al., 2021; Nijkamp et al., 2022) . Nevertheless, all these works assume that all libraries and function calls were seen in the training data; and that at test time, the trained model will need to generate only seen libraries and function calls. However, new functions and libraries are introduced all the time, and even a seen function call can have unseen arguments. Thus, these existing models inherently cannot generalize to generate such unseen usages. In contrast to these existing models, human programmers frequently refer to manuals and documentation when writing code (Nykaza et al., 2002; Lethbridge et al., 2003) . This allows humans to easily use functions and libraries they have never seen nor used before. Inspired by this ability, Generate HTML with python syntax highlighting for "print('reading docs')"

Re!iever

Genera"r we propose DocPrompting: a code generation approach that learns to retrieve code documentation before generating the code. An overview of our approach is illustrated in Figure 1 : First, a document retriever uses the NL intent n ⃝ to retrieve relevant code documentation { d1 ⃝,d2 ⃝, d3 ⃝} from a documentation pool D ⃝. Then, a code generator uses these docs in its prompt to generate the corresponding code c ⃝. The documentation pool serves as an external data store that can be updated frequently with new contents (e.g., documentation of newly released libraries), without re-training any model component. This way, DocPrompting can leverage newly added documentation, and it can generate code containing unseen and unused functions and libraries. DocPrompting is general and applicable to any programming language and underlying base architecture. To the best of our knowledge, this is the first demonstration of leveraging documentation in models of code explicitly and effectively. We demonstrate the effectiveness of DocPrompting on two NL→code benchmarks and tasks, across two programming languages, and using several base models: GPT-Neo (Black et al., 2021 ), T5 (Raffel et al., 2020 ), CodeT5 (Wang et al., 2021) , Fusion-in-Decoder (Izacard and Grave, 2021)), and Codex (Chen et al., 2021) . Further, we experiment with both sparse retrievers such as BM25 (Robertson and Jones, 1976) and dense retrieval models such as SimCSE (Gao et al., 2021) . Finally, we introduce two new benchmarks for retrieval-based code generation: (a) in Bash, we curate a new benchmark by crawling the tldr repository, and constructing the training/development/test splits without overlapping commands; (b) in Python, we re-split the popular CoNaLa benchmark (Yin et al., 2018) by making every test example contain at least one Python function that is not seen in the training data. Models that use DocPrompting consistently outperform their base models that generate code solely based on the NL intents. Using DocPrompting improves strong base models such as CodeT5 by 2.85% in pass@1 (52% relative gain) and 4.39% in pass@10 (30% relative gain) in execution-based evaluation in CoNaLa; on the new tldr dataset, DocPrompting improves CodeT5 and GPT-Neo-1.3B by up to absolute 6.9% exact match. We release our new benchmarks, including annotation of oracle documents for each example and pools of documentation, to serve as a test-bed for future retrieval-based code generation models.

2. CODE GENERATION BY READING THE DOCS

Our underlying assumption is that code documentation is the most exhaustive yet succinct resource for most libraries and programming languages (Roehm et al., 2012) , and that documentation allows to effectively generalize to unseen libraries and functions (Forward and Lethbridge, 2002) . We follow the retrieve-then-generate paradigm (Lewis et al., 2020; Guu et al., 2020) , focusing on retrieving documentation. In this section, we describe the general approach of DocPrompting; in §3 and §6.2, we elaborate and experiment with practical implementations of DocPrompting. Formulation Given NL intent n, our goal is to generate a corresponding code snippet c written in some programming language (PL) such as Python. We assume that a model has access to a collection of code documentation D. Each document d i ∈ D describes the usage of a library, a function, or an



Data and code are available at https://github.com/shuyanzhou/docprompting.



Figure 1: DocPrompting: given an NL intent n ⃝, the retriever retrieves a set of relevant documentation { d1⃝,d2 ⃝, d3 ⃝} from a documentation pool D ⃝. Then, the generator generates the code c ⃝ based on the NL and retrieved docs. DocPrompting allows the model to generalize to previously unseen usages by reading those docs. Italic blue highlights the shared tokens between NL and docs; Bold shows shared tokens between docs and the code snippet.

