CONTRAGEN: EFFECTIVE CONTRASTIVE LEARNING FOR CAUSAL LANGUAGE MODEL

Abstract

Despite exciting progress in large-scale causal language models, the expressiveness of its representations is largely limited by the anisotropy issue where the hidden representations are distributed into a narrow cone in the vector space. To resolve this problem, we present CONTRAGEN, a novel contrastive learning framework at both token-level and sequence-level. We assess CONTRAGEN on a wide range of downstream tasks and show that CONTRAGEN can effectively enhance both isotropy and discrimination of the representations. This leads to the desired improvement on various language understanding tasks, which helps bridge the gap with the encoder-only models and makes causal language models more suited for tasks beyond language generation. Specifically, we attain 44% relative improvement on the Semantic Textual Similarity tasks and 34% on Code-to-Code Search tasks. Furthermore, by improving the expressiveness of the representations, CONTRAGEN also boosts the source code generation capability with 9% relative improvement on execution accuracy on the HumanEval benchmark. 1

1. INTRODUCTION

Causal Language Models (CLM) have seen remarkable success in language generation, both in natural language (Radford et al., 2018; 2019; Brown et al., 2020) and programming language (Chen et al., 2021; Nijkamp et al., 2022) . However, one limitation at its core is the anisotropy issue, where the hidden representations, especially those output from the final layers of transformers, are squeezed into a tiny cone in the representation space. As observed by Ethayarajh (2019), the average cosine similarity of any two words from two randomly sampled sequences is almost at one when evaluating at the last hidden layer of GPT-2 (Radford et al., 2019) . Such anisotropic often causes large performance gap (see Appendix D.1) with the encoder-based models on the discriminative tasks and hence limits the wide usage of CLM beyond language generation. Two main reasons have been posited for causing the degenerated representations. Dong et al. (2021) theoretically prove that the key ingredient of language models -self-attention (Vaswani et al., 2017) possesses a strong inductive bias towards the representation collapse. Another line of work argues that the autoregressive next token prediction objective is the main cause for the non-isotropic representations. We thereby face a conundrum. On the one hand, both self-attention and next token prediction are the driving force behind the widespread successes of CLM on language generation (Lewis et al., 2020; Radford et al., 2019; Chen et al., 2021; Chowdhery et al., 2022) . On the other hand, as shown in Section 4 and Appendix D.1, the resulting anisotropic representations cause inferior performance on language understanding tasks, e.g., Semantic Textual Similarity (Agirre et al., 2012; 2013; 2014; Marelli et al., 2014; Agirre et al., 2015; 2016; Cer et al., 2017) and Code Search (Lu et al., 2021; Huang et al., 2021; Guo et al., 2022) . We tackle this challenge by leveraging contrastive learning (Chen et al., 2020; He et al., 2020; Gao et al., 2021) . Intuitively, by separating each instance apart from each other, contrastive learning can produce more uniformly distributed representations. Previous studies (Wang & Isola, 2020; Wang & Liu, 2021) have shown that uniformity is important for yielding discriminative representations. We thereby present CONTRAGEN, a novel contrastive learning framework at both token-level (CONTRAGEN-TOK) and sequence-level (CONTRAGEN-SEQ). Our goals are two-fold. First, an



We will release our code and checkpoints after the final decisions of acceptance are out.1

