ATTENTION DE-SPARSIFICATION MATTERS: INDUC-ING DIVERSITY IN DIGITAL PATHOLOGY REPRESEN-TATION LEARNING

Abstract

In this work, we develop DiRL, a Diversity-inducing Representation Learning technique for histopathology image analysis. SSL techniques, such as contrastive and non-contrastive approaches, have been shown to learn rich and effective representations without any human supervision. Lately, computational pathology has also benefited from the resounding success of SSL. In this work, we develop a novel prior-guided pre-training strategy based on SSL to enhance representation learning in digital pathology. Our analysis of vanilla SSL-pretrained models' attention distribution reveals an insightful observation: sparsity in attention, i.e, models tends to localize most of their attention to some prominent patterns in the image. Although attention sparsity can be beneficial in natural images due to these prominent patterns being the object of interest itself, this can be sub-optimal in digital pathology; this is because, unlike natural images, digital pathology scans are not object-centric, but rather a complex phenotype of various spatially intermixed biological components. Inadequate diversification of attention in these complex images could result in crucial information loss. To address this, we first leverage cell segmentation to densely extract multiple histopathology-specific representations. We then propose a prior-guided dense pretext task for SSL, designed to match the multiple corresponding representations between the views. Through this, the model learns to attend to various components more closely and evenly, thus inducing adequate diversification in attention for capturing context rich representations. Through quantitative and qualitative analysis on multiple slide-level tasks across cancer types, and patch-level classification tasks, we demonstrate the efficacy of our method and observe that the attention is more globally distributed. Specifically, we obtain a relative improvement in accuracy of up to 6.9% in slidelevel and 2% in patch level classification tasks (corresponding AUC improvement up to 7.9% and 0.7%, respectively) over a baseline SSL model.

1. INTRODUCTION

Computational pathology is a rapidly emerging field that aims at analyzing high resolution images of biopsied or resected tissue samples. Advancements in computer vision and deep learning has enabled learning of the rich phenotypic information from whole slides images (WSIs) to understand mechanisms contributing to disease progression and patient outcomes. Acquiring crop-level localized annotations for WSIs is expensive and often not feasible; only slide-level pathologist labels are usually available. In such a scenario, weak supervision is a commonly utilized strategy, where crops are embedded into representations in the first stage, followed by considering these WSI-crops' representation as a bag for multiple instance learning (MIL). Now the question remains, how do we learn a model to effectively encode the crops into rich representations? Traditionally, ImageNet (Krizhevsky et al., 2017) pre-trained neural networks are utilized to extract the representations (Lu et al., 2021b; Lerousseau et al., 2021; Shao et al., 2021) . However ImageNet and pathology datasets are composed of different semantics; while the former contains object-centric natural images, the later consists of images with spatially distributed biological components such as cells, glands, stroma, etc. Therefore, to learn domain-specific features of WSI-crops in the absence of localized annotations, various self-supervised learning (SSL) techniques are recently gaining traction (Ciga et al., 2022; Stacke 

