CONTRASTIVE LEARNING OF MEDICAL VISUAL REPRESENTATIONS FROM PAIRED IMAGES AND TEXT

Abstract

Learning visual representations of medical images is core to medical image understanding but its progress has been held back by the small size of hand-labeled datasets. Existing work commonly relies on transferring weights from ImageNet pretraining, which is suboptimal due to drastically different image characteristics, or rule-based label extraction from the textual report data paired with medical images, which is inaccurate and hard to generalize. We propose an alternative unsupervised strategy to learn medical visual representations directly from the naturally occurring pairing of images and textual data. Our method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input. We test our method by transferring our pretrained weights to 4 medical image classification tasks and 2 zero-shot retrieval tasks, and show that our method leads to image representations that considerably outperform strong baselines in most settings. Notably, in all 4 classification tasks, our method requires only 10% as much labeled training data as an ImageNet initialized counterpart to achieve better or comparable performance, demonstrating superior data efficiency.

1. INTRODUCTION

Severe cardiomegaly is noted in the image with enlarged… Radiograph shows pleural effusion in the right lobe… Medical image understanding has the potential to transform healthcare and has seen rapid progress with the use of deep neural architectures (Gulshan et al., 2016; Esteva et al., 2017; De Fauw et al., 2018; Rajpurkar et al., 2018b ). Yet, with expert-level performance achieved only in some specialties and under some circumstances, medical image understanding remains a difficult task for the majority of specialties, mainly due to its challenging nature and the extreme scarcity of annotated data. Existing work has followed two general approaches to obtain annotations for medical imaging tasks. The first approach has been using high-quality annotations created by medical experts (Abràmoff et al., 2016; Gulshan et al., 2016; Shih et al., 2019; Wang & Wong, 2020) . However, the high cost of this approach has resulted in datasets that are mostly orders of magnitude smaller than natural image datasets such as ImageNet (Russakovsky et al., 2015) . To remedy this, existing work has relied heavily on transferring model weights from ImageNet pretraining (Wang et al., 2017; Esteva et al., 2017; Irvin et al., 2019) . This approach is suboptimal because, as shown in Figure 1 , medical image understanding often requires representations of very fine-grained visual features that are drastically different from those required for identifying objects in natural images. As a result, Raghu et al. ( 2019) found that ImageNet pretraining often provides little to no benefit compared to simple random initialization. A second popular approach is to use expert-crafted rules to extract labels from the textual reports accompanying the medical images. This approach has led to datasets of larger scale, since the text data paired with medical images are often produced naturally by medical experts in their routine work-flow and abundant in a typical hospital's IT systems. Nevertheless, this rule-based label extraction approach has two limitations: 1) the rules are often inaccurate and limited to a few major categories (Wang et al., 2017) , leading to very inefficient use of the textual report data; 2) these rules are often domain-specific and sensitive to the style of the text, making cross-domain and cross-institution generalization difficult (Irvin et al., 2019) . In efforts to make more efficient use of unlabeled image data, several recent studies have shown promising results from contrastive representation learning from natural images (Chen et al., 2020a; He et al., 2020; Grill et al., 2020) . However, as we will show, applying these image view-based contrastive methods to medical images provides only marginal benefits compared to ImageNet pretraining, a result mostly due to the high inter-class similarity of the medical images as in Figure 1 . In this work, we aim to improve visual representations of medical images by combining the benefits of both learning from abundant textual data and unsupervised statistical approaches. We present Contrastive VIsual Representation Learning from Text (ConVIRT), a framework for learning visual representations by exploiting the naturally occurring pairing of images and textual data. ConVIRT improves visual representations by maximizing the agreement between true image-text pairs versus random pairs via a bidirectional contrastive objective between the image and text modalities. We apply ConVIRT to the pretraining of medical image encoders, and show that it leads to higherquality in-domain image representations that capture the subtlety of visual features required for medical image understanding tasks. Compared to existing methods, ConVIRT has the advantages of utilizing the paired text data in a way agnostic to the medical specialty and requiring no additional expert input. This allows us to evaluate ConVIRT by transferring our pretrained weights to 4 different medical image classification tasks covering 2 different specialties. We find that the resulting models outperform all baseline initialization approaches, including the standard ImageNet pretraining and several strong baselines that also utilize the paired text data. Most notably, in all 4 tasks, ConVIRT requires only 10% as much labeled training data as an ImageNet initialized counterpart to achieve better or comparable performance. We further evaluate ConVIRT on two new zero-shot retrieval tasks, an image-image and a text-image retrieval task, and also find it superior to all baselines. To facilitate future research, we will make our code and the collected retrieval datasets available.

2.1. TASK DEFINITION

We start by giving a formal description of our representation learning setting. We assume paired input (x v , x u ) where x v represents one or a group of images, and x u represents a text sequence which describes the imaging information in x v . Our goal is to learn a parameterized image encoder function f v , which maps an image to a fixed-dimensional vector. We are then interested in transferring the learned image encoder function f v into downstream tasks, such as classification or image retrieval. In this work, we model the encoder function f v as a convolutional neural network (CNN). We note that paired image-text data (x v , x u ) naturally exists for many medical domains. Medical experts such as radiologists produce textual descriptions of images as part of their routine workflow, some of which are also made publicly available (Demner-Fushman et al., 2016; Johnson et al., 2019) .

2.2. CONTRASTIVE VISUAL REPRESENTATION LEARNING FROM TEXT

An overview of our method, ConVIRT, for learning f v is shown in Figure 2 . At a high level, our method converts each input image x v and text x u into d-dimensional vector representations v and u respectively, following a similar processing pipeline. For each input image x v , our method starts by drawing a random view xv from x v with a sampled transformation function t v ∼ T , where T represents a family of stochastic image transformation functions described later. Next, the encoder function f v transforms xv into a fixed-dimensional vector h v , followed by a non-linear projection function g v which further transforms h v into vector v: v = g v (f v (x v )), (1) where v ∈ R d . Similarly, for each text input x u , we obtain a span xu from it following a sampling function t u , and then a text representation u with: u = g u (f u (x u )), where f u is a text encoder,



Figure 1: Two example chest radiograph images with different abnormality categories, along with sentences from their paired textual report and example views indicative of their characteristics.

