ATTENTION DE-SPARSIFICATION MATTERS: INDUC-ING DIVERSITY IN DIGITAL PATHOLOGY REPRESEN-TATION LEARNING

Abstract

In this work, we develop DiRL, a Diversity-inducing Representation Learning technique for histopathology image analysis. SSL techniques, such as contrastive and non-contrastive approaches, have been shown to learn rich and effective representations without any human supervision. Lately, computational pathology has also benefited from the resounding success of SSL. In this work, we develop a novel prior-guided pre-training strategy based on SSL to enhance representation learning in digital pathology. Our analysis of vanilla SSL-pretrained models' attention distribution reveals an insightful observation: sparsity in attention, i.e, models tends to localize most of their attention to some prominent patterns in the image. Although attention sparsity can be beneficial in natural images due to these prominent patterns being the object of interest itself, this can be sub-optimal in digital pathology; this is because, unlike natural images, digital pathology scans are not object-centric, but rather a complex phenotype of various spatially intermixed biological components. Inadequate diversification of attention in these complex images could result in crucial information loss. To address this, we first leverage cell segmentation to densely extract multiple histopathology-specific representations. We then propose a prior-guided dense pretext task for SSL, designed to match the multiple corresponding representations between the views. Through this, the model learns to attend to various components more closely and evenly, thus inducing adequate diversification in attention for capturing context rich representations. Through quantitative and qualitative analysis on multiple slide-level tasks across cancer types, and patch-level classification tasks, we demonstrate the efficacy of our method and observe that the attention is more globally distributed. Specifically, we obtain a relative improvement in accuracy of up to 6.9% in slidelevel and 2% in patch level classification tasks (corresponding AUC improvement up to 7.9% and 0.7%, respectively) over a baseline SSL model.

1. INTRODUCTION

Computational pathology is a rapidly emerging field that aims at analyzing high resolution images of biopsied or resected tissue samples. Advancements in computer vision and deep learning has enabled learning of the rich phenotypic information from whole slides images (WSIs) to understand mechanisms contributing to disease progression and patient outcomes. Acquiring crop-level localized annotations for WSIs is expensive and often not feasible; only slide-level pathologist labels are usually available. In such a scenario, weak supervision is a commonly utilized strategy, where crops are embedded into representations in the first stage, followed by considering these WSI-crops' representation as a bag for multiple instance learning (MIL). Now the question remains, how do we learn a model to effectively encode the crops into rich representations? Traditionally, ImageNet (Krizhevsky et al., 2017) pre-trained neural networks are utilized to extract the representations (Lu et al., 2021b; Lerousseau et al., 2021; Shao et al., 2021) . However ImageNet and pathology datasets are composed of different semantics; while the former contains object-centric natural images, the later consists of images with spatially distributed biological components such as cells, glands, stroma, etc. Therefore, to learn domain-specific features of WSI-crops in the absence of localized annotations, various self-supervised learning (SSL) techniques are recently gaining traction (Ciga et al., 2022; Stacke et al., 2021; Boyd et al., 2021) . SSL models pre-trained on histopathology datasets have been shown to be effective in downstream classification tasks when compared to those trained on ImageNet. To further analyze the role of SSL in computational pathology, we pre-trained a vision transformer (Dosovitskiy et al., 2020) on various WSI datasets using vanilla SSL (Caron et al., 2021) . In-depth analysis of the pre-trained models' attention maps on WSI-crops led us to a striking observation: sparsity in attention maps. The model tends to localize most of its attention to a small fraction of regions, leading to sub-optimal representation learning. To further validate our observation, we visualized the attention maps of a self-supervised ImageNet pre-trained model on natural images (see Fig. 1 ). Similar observations led us to conclude that this is a property of SSL rather than of data. We believe that sparsity in attention might potentially benefit the performance in some natural imaging tasks such as object classification. This stems from the fact that during SSL, the model is tasked to match the two views, optimizing which leads the model to focus on the prominent patterns. For example, in Fig. 1 (a), for an object-centric ImageNet example, since the prominent pattern is the object (eg. bird) itself (Yun et al., 2022) , the model tends to center its attention towards the object, thus benefiting numerous downstream applications (for eg., bird classification). In contrast, WSI-crops are not object-centric, rather they constitute a spatial distribution of complex structures such as cells, glands, their clusters and organizations, etc, see Fig. 1(b ). Encoding this dense information available into a holistic representation demands the model to focus more diversely to various histopathology primitives and not just to specific ones. Conversely, the vanilla SSL model pre-trained on histopathology only sparsely attends to the important regions (Fig. 1(b )), i.e., there is inadequate diversity in attention. We hypothesize that this sparsely attending model could result in encoding sub-optimal representations, as fine-grained context-rich details are often ignored. To address this issue of inadequate attention diversity, we propose DiRL, a diversity-inducing pretraining technique, tailored to enhance representation learning in digital pathology. Each WSI-crop consists of two regions: cellular regions (one containing cells) and non-cellular regions (containing no cells). We leverage an off-the-shelf cell segmentation pipeline to identify these regions. This domain-specific knowledge is then utilized to extract region-level representations separately for the cellular and non-cellular regions. We further propose to encode the inter-and intra-spatial interplay of two regions. This biologically-inspired step (Saltz et al., 2018; Fassler et al., 2022) is achieved through a transformer-based disentangle block to encode the self-interaction within the regions, and cross-interaction between both the regions, termed as disentangled representations. In contrast to vanilla SSL frameworks that leverage one image-level representation for a WSI-crop, our prior-guided representation learning framework leverages histology-specific domain knowledge to densely extract a set of region-level and disentangled representations. We then task our framework to match all the corresponding representations between the views. We hypothesize that optimizing this dense matching objective between the views would encourage the model to diversify its attention to various regions; matching assorted representations would then enforce the model to explore diverse



Figure 1: Diversification of attention for encoding dense information in digital pathology. View 1 and View 2 are two augmented views of the input image. a) Illustration of attention map from a model pre-trained on ImageNet using vanilla SSL. b) Attention map of model pre-trained on histopathology dataset with vanilla SSL, and with our proposed pre-training strategy.In both natural imaging and digital pathology, vanilla SSL pre-training creates sparse attention maps, i.e., it attends largely to only some prominent patterns. Although attention sparsification can be beneficial in natural image tasks such as object classification, this could be sub-optimal for encoding representations in digital pathology as it leads to loss of important contextual information. Through a more diversified attention mechanism, DiRL encodes dense information critical to non object-centric tasks.

