CONTRANORM: A CONTRASTIVE LEARNING PER-SPECTIVE ON OVERSMOOTHING AND BEYOND

Abstract

Oversmoothing is a common phenomenon in a wide range of Graph Neural Networks (GNNs) and Transformers, where performance worsens as the number of layers increases. Instead of characterizing oversmoothing from the view of complete collapse in which representations converge to a single point, we dive into a more general perspective of dimensional collapse in which representations lie in a narrow cone. Accordingly, inspired by the effectiveness of contrastive learning in preventing dimensional collapse, we propose a novel normalization layer called ContraNorm. Intuitively, ContraNorm implicitly shatters representations in the embedding space, leading to a more uniform distribution and a slighter dimensional collapse. On the theoretical analysis, we prove that ContraNorm can alleviate both complete collapse and dimensional collapse under certain conditions. Our proposed normalization layer can be easily integrated into GNNs and Transformers with negligible parameter overhead. Experiments on various real-world datasets demonstrate the effectiveness of our proposed ContraNorm. Our implementation is available at

1. INTRODUCTION

Recently, the rise of Graph Neural Networks (GNNs) has enabled important breakthroughs in various fields of graph learning (Ying et al., 2018; Senior et al., 2020) . Along the other avenue, although getting rid of bespoke convolution operators, Transformers (Vaswani et al., 2017 ) also achieve phenomenal success in multiple natural language processing (NLP) tasks (Lan et al., 2020; Liu et al., 2019; Rajpurkar et al., 2018) and have been transferred successfully to computer vision (CV) field (Dosovitskiy et al., 2021; Liu et al., 2021; Strudel et al., 2021) . Despite their different model architectures, GNNs and Transformers are both hindered by the oversmoothing problem (Li et al., 2018; Tang et al., 2021) , where deeply stacking layers give rise to indistinguishable representations and significant performance deterioration. In order to get rid of oversmoothing, we need to dive into the modules inside and understand how oversmoothing happens on the first hand. However, we notice that existing oversmoothing analysis fails to fully characterize the behavior of learned features. A canonical metric for oversmoothing is the average similarity (Zhou et al., 2021; Gong et al., 2021; Wang et al., 2022) . The tendency of similarity converging to 1 indicates representations shrink to a single point (complete collapse). However, this metric can not depict a more general collapse case, where representations span a lowdimensional manifold in the embedding space and also sacrifice expressive power, which is called dimensional collapse (left figure in Figure 1 ). In such cases, the similarity metric fails to quantify the collapse level. Therefore, we need to go beyond existing measures and take this so-called dimensional collapse into consideration. Actually, this dimensional collapse behavior is widely discussed in the contrastive learning literature (Jing et al., 2022; Hua et al., 2021; Chen & He, 2021; Grill et al., 2020) , which may hopefully help us characterize the oversmoothing problem of GNNs and Transformers. The main idea of contrastive learning is maximizing agreement between different augmented views of the same data example (i.e. positive pairs) via a contrastive loss. Common contrastive loss can be decoupled into alignment loss and uniformity loss (Wang & Isola, 2020) . The two ingredients correspond to different objectives: alignment loss expects the distance between positive pairs to be closer, while uniformity loss measures how well the embeddings are uniformly distributed. Pure training with only alignment loss may lead to a trivial solution where all representations shrink to one single point. Fortunately, the existence of uniformity loss naturally helps solve this problem by drawing representations evenly distributed in the embedding space. Given the similarities between the oversmoothing problem and the representation collapse issue, we establish a connection between them. Instead of directly adding uniformity loss into model training, we design a normalization layer that can be easily used out of the box with almost no parameter overhead. To achieve this, we first transfer the uniformity loss used for training to a loss defined over graph node representations, thus it can optimize representations themselves rather than model parameters. Intuitively, the loss meets the need of drawing a uniform node distribution. Following the recent research in combining optimization scheme and model architecture (Yang et al., 2021; Zhu et al., 2021; Xie et al., 2021; Chen et al., 2022) , we use the transferred uniformity loss as an energy function underlying our proposed normalization layer, such that descent steps along it corresponds with the forward pass. By analyzing the unfolding iterations of the principled uniformity loss, we design a new normalization layer ContraNorm. • We dissect the limitations of existing oversmoothing analysis, and highlight the importance of incorporating the dimensional collapse issue into consideration. = UPDATE h (l-1) , AGGREGATE(h (l-1) i , {h (l-1) j | j ∈ N (i)}) , where N (i) denotes the neighborhood set of node i, AGGREGATE(•) is the procedure where nodes exchange message, and UPDATE(•) is often a multi-layer perceptron (MLP). A classical MP-GNNs model is GCN (Kipf & Welling, 2017) , which propagates messages between 1-hop neighbors using an adjacency matrix. Self-Attention in Transformers. Transformers encode information in a global scheme with selfattention as the key ingredient (Vaswani et al., 2017) . Self-attention module re-weights interme-



Figure 1: An illustration of how our proposed ContraNorm solves the dimensional collapse. Left: Features suffer from dimensional collapse. Right: With the help of ContraNorm, features become more uniform in the space, and the dimensional collapse is eased. As a proof of concept, Figure 1 demonstrates that ContraNorm makes the features away from each other, which eases the dimensional collapse. Theoretically, we prove that ContraNorm increases both average variance and effective rank of representations, thus solving complete collapse and dimensional collapse effectively. We also conduct a comprehensive evaluation of ContraNorm on various tasks. Specifically, ContraNorm boosts the average performance of BERT (Devlin et al., 2018) from 82.59% to 83.54% on the validation set of General Language Understanding Evaluation (GLUE) datasets (Wang et al., 2019), and raises the test accuracy of DeiT (Touvron et al., 2021) with 24 blocks from 77.69% to 78.67% on ImageNet1K (Russakovsky et al., 2015) dataset. For GNNs, experiments are conducted on fully supervised graph node classification tasks, and our proposed model outperforms vanilla Graph Convolution Network (GCN) (Kipf & Welling, 2017) on all depth settings. Our contributions are summarized as:

• Inspired by the techniques from contrastive learning to measure and resolve oversmoothing, we propose ContraNorm as an optimization-induced normalization layer to prevent dimensional collapse.• Experiments on a wide range of tasks show that ContraNorm can effectively mitigate dimensional collapse in various model variants, and demonstrate clear benefits across three different scenarios: ViT for image classification, BERT for natural language understanding, and GNNs for node classifications.

availability

https://github.com/PKU-ML

