CONTRAGEN: EFFECTIVE CONTRASTIVE LEARNING FOR CAUSAL LANGUAGE MODEL

Abstract

Despite exciting progress in large-scale causal language models, the expressiveness of its representations is largely limited by the anisotropy issue where the hidden representations are distributed into a narrow cone in the vector space. To resolve this problem, we present CONTRAGEN, a novel contrastive learning framework at both token-level and sequence-level. We assess CONTRAGEN on a wide range of downstream tasks and show that CONTRAGEN can effectively enhance both isotropy and discrimination of the representations. This leads to the desired improvement on various language understanding tasks, which helps bridge the gap with the encoder-only models and makes causal language models more suited for tasks beyond language generation. Specifically, we attain 44% relative improvement on the Semantic Textual Similarity tasks and 34% on Code-to-Code Search tasks. Furthermore, by improving the expressiveness of the representations, CONTRAGEN also boosts the source code generation capability with 9% relative improvement on execution accuracy on the HumanEval benchmark. 1

1. INTRODUCTION

Causal Language Models (CLM) have seen remarkable success in language generation, both in natural language (Radford et al., 2018; 2019; Brown et al., 2020) and programming language (Chen et al., 2021; Nijkamp et al., 2022) . However, one limitation at its core is the anisotropy issue, where the hidden representations, especially those output from the final layers of transformers, are squeezed into a tiny cone in the representation space. As observed by Ethayarajh (2019), the average cosine similarity of any two words from two randomly sampled sequences is almost at one when evaluating at the last hidden layer of GPT-2 (Radford et al., 2019) . Such anisotropic often causes large performance gap (see Appendix D.1) with the encoder-based models on the discriminative tasks and hence limits the wide usage of CLM beyond language generation. Two main reasons have been posited for causing the degenerated representations. Dong et al. (2021) theoretically prove that the key ingredient of language models -self-attention (Vaswani et al., 2017) possesses a strong inductive bias towards the representation collapse. Another line of work argues that the autoregressive next token prediction objective is the main cause for the non-isotropic representations. We thereby face a conundrum. On the one hand, both self-attention and next token prediction are the driving force behind the widespread successes of CLM on language generation (Lewis et al., 2020; Radford et al., 2019; Chen et al., 2021; Chowdhery et al., 2022) . On the other hand, as shown in Section 4 and Appendix D.1, the resulting anisotropic representations cause inferior performance on language understanding tasks, e.g., Semantic Textual Similarity (Agirre et al., 2012; 2013; 2014; Marelli et al., 2014; Agirre et al., 2015; 2016; Cer et al., 2017) and Code Search (Lu et al., 2021; Huang et al., 2021; Guo et al., 2022) . We tackle this challenge by leveraging contrastive learning (Chen et al., 2020; He et al., 2020; Gao et al., 2021) . Intuitively, by separating each instance apart from each other, contrastive learning can produce more uniformly distributed representations. Previous studies (Wang & Isola, 2020; Wang & Liu, 2021) have shown that uniformity is important for yielding discriminative representations. We thereby present CONTRAGEN, a novel contrastive learning framework at both token-level (CONTRAGEN-TOK) and sequence-level (CONTRAGEN-SEQ). Our goals are two-fold. First, an Figure 1 : CONTRAGEN and its components enhance (Left) the representation discrimination approximated as one minus the ratio of inter-similarity or uniformity (the cosine similarity between tokens from randomly sampled sequences) and intra-similarity (the cosine similarity between a sequence and its constituent tokens) -high discrimination score is better; and (Right) the performance of discrimination (STS) and generation (HumanEval) tasks (details in Section 4). ideal CLM should be able to better leverage the representation space by dispersing apart semantically different tokens or sequences. Second, we aim to attain separable features by mapping tokens or sequences under similar contexts to distinct but comparatively closer locations in the vector space. We conduct extensive experiments and analyses to assess the effectiveness of our proposed framework with particular focus to address the following -(1) Does CONTRAGEN improve the discrimination ability of representations learned by CLM? (2) Do the enhanced representations lead to better performance on language generation tasks? (3) Is the joint contrastive learning at both token-level and sequence-level necessary and how do they benefit from each other? (4) How does the impact of CONTRAGEN vary across language domains?

2. RELATED WORK

Anisotropic Representation of Language Models Despite the remarkable success achieved by large scale language models (Devlin et al., 2019; Radford et al., 2019; Yang et al., 2019; Raffel et al., 2020; Lewis et al., 2020) , they suffer from the anisotropy issue where the representations are distributed into a tiny cone in the vector space (Gao et al., 2019; Ethayarajh, 2019; Li et al., 2020; Wang et al., 2020) . In particular, Ethayarajh (2019) shows that the degeneration is much severer on CLM, where the average cosine similarity between two words sampled from randomly selected sequences is almost at one when evaluating at the output from the last hidden layer of GPT-2 (Radford et al., 2019) . Inspired by the previous findings (Arora et al., 2017; Mu & Viswanath, 2018) , several efforts have focused on regularizing the representations to isotropic distribution either via postprocessing (Su et al., 2021; Li et al., 2020) or directly optimizing variant regularization terms during training (Gao et al., 2019; Wang et al., 2020) . Although promising improvement has been observed, performance is still inadequate. 2020; Giorgi et al., 2021; Wu et al., 2020; Meng et al., 2021; Yan et al., 2021; Kim et al., 2021; Gao et al., 2021; Zhang et al., 2022) . Recently there is emerging interest in developing effective contrastive learning approach for text generation models. et al., 2021) , against the ground truth. On the other hand, it is not intuitive and even elusive to develop effective contrastive learning strategy for the decoderonly models. A recent work (Su et al., 2022) proposes, SimCTG, a token-level contrastive learning approach that aims to separate each token apart from others within the same sequence by a prefixed distance. However, we find that our temperature-based token-level contrastive learning approach, CONTRAGEN-TOK, consistently outperforms SimCTG across different tasks. We conjecture that



We will release our code and checkpoints after the final decisions of acceptance are out.



An alternative comes to light as the popular contrastive learning techniques Chen et al. (2020); He et al. (2020) have seen remarkable successes in Natural Language Processing (NLP). A large amount of research has focused on sentence representation learning for the encoderonly model, with the main differences lying in how the augmentations are generated (Fang & Xie,

However, most existing work mainly focus on the encoder-decoder structure Dong et al. (2019); Raffel et al. (2020); Lewis et al. (2020) by contrasting the suboptimal model generation, obtained via diverse sampling (An et al., 2022) or adding perturbations on the embedding space (Lee

