CODET5MIX: A PRETRAINED MIXTURE OF ENCODER-DECODER TRANSFORMERS FOR CODE UN-DERSTANDING AND GENERATION

Abstract

Pretrained language models (LMs) trained on vast source code have achieved prominent progress in a wide range of code intelligence tasks. Despite their success, they either adopt specific types of network architectures (encoder-only or decoder-only) for different downstream tasks or rely on a single architecture (encoder-decoder or UniLM-style encoder) for all tasks. The latter approach usually results in a sub-optimal performance on a subset of tasks. To address these limitations, we propose "CodeT5Mix", a mixture of encoder-decoder Transformers for code where its components can be flexibly combined based on the target tasks during finetuning, while still enjoying the mutual benefits from the joint pretraining. To endow the model with both code understanding and generation capabilities, we pretrain CodeT5Mix using a mixture of denoising, contrastive learning, matching, and Causal Language Modeling (CLM) tasks on large-scale multilingual code corpora in a stage-wise manner. Additionally, we design a weight sharing strategy in decoders except the feedforward layers, which act as taskspecific experts to reduce the interference across tasks of various types. We extensively evaluate CodeT5Mix on seven code-related tasks over twenty datasets and show it achieves state-of-the-art (SoTA) performance on most tasks such as textto-code retrieval, code completion and generation, and math programming. Particularly, we demonstrate that CodeT5Mix can be used as a unified semi-parametric retrieval-augmented generator with SoTA code generation performance.

1. INTRODUCTION

Language model pretraining (Chen et al., 2021; Wang et al., 2021c; Feng et al., 2020) has recently demonstrated remarkable success in various downstream tasks in the code domain (Husain et al., 2019; Lu et al., 2021; Hendrycks et al., 2021) . By pretraining large-scale language models on massive code-based data (e.g. GitHub public data), these models can learn rich contextual representations which can be transferred to related downstream tasks. However, we found that many of the existing models are specifically designed to perform well only in a subset of tasks (e.g. generativeonly tasks or retrieval-only tasks). On other tasks, their performance is suboptimal and the models often require substantial modifications to the architectural features or learning objectives. Existing models have two main limitations. First, current models follow either encoder-only (Feng et al., 2020; Guo et al., 2021) or decoder-only (Chen et al., 2021; Nijkamp et al., 2022) architectures which are suitable for only a subset of tasks. Specifically, encoder-only models are often used to facilitate retrieval-based tasks such as text-to-code retrieval (Lu et al., 2021) . For generative tasks such as code generation (Chen et al., 2021; Hendrycks et al., 2021) , decoder-only models are more appropriate. Several approaches have adopted encoder-decoder architectures to adapt to multiple types of tasks (Wang et al., 2021c; Ahmad et al., 2021) . While these models can achieve good performance overall, they still fail to beat state-of-the-art encoder-only or decoder-only baselines in some tasks, e.g., retrieval and code completion tasks respectively (Guo et al., 2022) . Moreover, Li et al. (2022b) observes that encoder-decoder models do not perform well with in-context learning compared to GPT-style models like Codex (Chen et al., 2021) Secondly, current models are trained in self-supervised learning objectives that might not be appropriate to transfer the models to some downstream tasks. For instance, T5-based models such as (Wang et al., 2021c) are often trained with a span denoising objective. However, in downstream tasks such as code generation (Chen et al., 2021; Hendrycks et al., 2021) , most state-of-the-art models are pretrained with a next-token prediction objective which auto-regressively predicts a program token by token. Furthermore, most models do no have specific pretraining tasks to ensure the sharp text/code representation learning which is vital for understanding tasks like text-to-code retrieval. Although recent attempts (Guo et al., 2022) introduce contrastive learning pretraining tasks to cope with this, the performance is still limited by neglecting the fine-grained cross-modal alignments. To address the above issues, we introduce "CodeT5Mix", a new pretrained language framework for both code understanding and generation (See Fig. 1 for an overview). Specifically, CodeT5Mix includes the following contributions: • A mixture of encoder-decoder Transformers: we introduce a new architectural design for multi-task pretraining and flexible finetuning for both code understanding and generation tasks. CodeT5Mix consists of multimodal encoder and decoder modules, which, in downstream tasks, can be directly repurposed and combined to suit different functionalities. • A mixture of self-supervised pretraining tasks: we adopt a diverse set of pretraining objectives to learn rich representations from both code and text data. We design a stage-wise pretraining strategy to first train on code-only data with span denoising and causal language modeling (CLM) tasks, and then train on text-code data with cross-modal contrastive learning, matching, and CLM tasks, where the matching task is crucial to capture the fine-grained text-code interactions. • A weight sharing strategy through task-specific experts: to optimize multi-task learning while keeping the model parameters affordable, we propose task-specific experts which are designed for different learning tasks while sharing the same backbone contextual representations. • A unified model for semi-parametric retrieval-augmented generation: as CodeT5Mix is capable of both retrieval and generation tasks, we demonstrated that it can be seamlessly adopted as a semi-parametric retrieval-augmented generator to achieve SoTA code generation performance. • Thorough evaluation and SoTA performance: our extensive evaluations show that CodeT5Mix yields significant performance gains on most downstream tasks compared to their SoTA baselines, e.g., 8 text-to-code retrieval tasks (+3.16 avg. MRR), 2 line-level code completion tasks (+2.56 avg. exact match), 2 retrieval-augmented code generation tasks (+5.78 avg. BLEU-4). • Open source: implementation code, data, and pretrained models will be made publicly available.

2. RELATED WORK

Typically, code-based language models (LMs) can be categorized into three architectures: encoderonly models like CodeBERT (Feng et al., 2020) , GraphCodeBERT (Guo et al., 2021), and CodeMVP (Wang et al., 2022) , decoder-only models like CodeGPT (Lu et al., 2021 ), Codex (Chen et al., 2021 ), InCoder (Fried et al., 2022) and CodeGen (Nijkamp et al., 2022) , and encoder-decoder models like



in code synthesis tasks.

