CONTRASTIVE LEARNING OF MEDICAL VISUAL REPRESENTATIONS FROM PAIRED IMAGES AND TEXT

Abstract

Learning visual representations of medical images is core to medical image understanding but its progress has been held back by the small size of hand-labeled datasets. Existing work commonly relies on transferring weights from ImageNet pretraining, which is suboptimal due to drastically different image characteristics, or rule-based label extraction from the textual report data paired with medical images, which is inaccurate and hard to generalize. We propose an alternative unsupervised strategy to learn medical visual representations directly from the naturally occurring pairing of images and textual data. Our method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input. We test our method by transferring our pretrained weights to 4 medical image classification tasks and 2 zero-shot retrieval tasks, and show that our method leads to image representations that considerably outperform strong baselines in most settings. Notably, in all 4 classification tasks, our method requires only 10% as much labeled training data as an ImageNet initialized counterpart to achieve better or comparable performance, demonstrating superior data efficiency.

1. INTRODUCTION

Severe cardiomegaly is noted in the image with enlarged… Radiograph shows pleural effusion in the right lobe… Medical image understanding has the potential to transform healthcare and has seen rapid progress with the use of deep neural architectures (Gulshan et al., 2016; Esteva et al., 2017; De Fauw et al., 2018; Rajpurkar et al., 2018b ). Yet, with expert-level performance achieved only in some specialties and under some circumstances, medical image understanding remains a difficult task for the majority of specialties, mainly due to its challenging nature and the extreme scarcity of annotated data. Existing work has followed two general approaches to obtain annotations for medical imaging tasks. The first approach has been using high-quality annotations created by medical experts (Abràmoff et al., 2016; Gulshan et al., 2016; Shih et al., 2019; Wang & Wong, 2020) . However, the high cost of this approach has resulted in datasets that are mostly orders of magnitude smaller than natural image datasets such as ImageNet (Russakovsky et al., 2015) . To remedy this, existing work has relied heavily on transferring model weights from ImageNet pretraining (Wang et al., 2017; Esteva et al., 2017; Irvin et al., 2019) . This approach is suboptimal because, as shown in Figure 1 , medical image understanding often requires representations of very fine-grained visual features that are drastically different from those required for identifying objects in natural images. As a result, Raghu et al. (2019) found that ImageNet pretraining often provides little to no benefit compared to simple random initialization. A second popular approach is to use expert-crafted rules to extract labels from the textual reports accompanying the medical images. This approach has led to datasets of larger scale, since the text data paired with medical images are often produced naturally by medical experts in their routine work-flow and abundant in a typical hospital's IT systems. Nevertheless, this rule-based label extraction approach has two limitations: 1) the rules are often inaccurate and limited to a few major categories (Wang et al., 2017) , leading to very inefficient use of the textual report data; 2) these rules are often domain-specific and sensitive to the style of the text, making cross-domain and cross-institution generalization difficult (Irvin et al., 2019) . In efforts to make more efficient use of unlabeled image data, several recent studies have shown promising results from contrastive representation learning from natural images (Chen et al., 2020a; He et al., 2020; Grill et al., 2020) . However, as we will show, applying these image view-based contrastive methods to medical images provides only marginal benefits compared to ImageNet pretraining, a result mostly due to the high inter-class similarity of the medical images as in Figure 1 . In this work, we aim to improve visual representations of medical images by combining the benefits of both learning from abundant textual data and unsupervised statistical approaches. We present Contrastive VIsual Representation Learning from Text (ConVIRT), a framework for learning visual representations by exploiting the naturally occurring pairing of images and textual data. ConVIRT improves visual representations by maximizing the agreement between true image-text pairs versus random pairs via a bidirectional contrastive objective between the image and text modalities. We apply ConVIRT to the pretraining of medical image encoders, and show that it leads to higherquality in-domain image representations that capture the subtlety of visual features required for medical image understanding tasks. Compared to existing methods, ConVIRT has the advantages of utilizing the paired text data in a way agnostic to the medical specialty and requiring no additional expert input. This allows us to evaluate ConVIRT by transferring our pretrained weights to 4 different medical image classification tasks covering 2 different specialties. We find that the resulting models outperform all baseline initialization approaches, including the standard ImageNet pretraining and several strong baselines that also utilize the paired text data. Most notably, in all 4 tasks, ConVIRT requires only 10% as much labeled training data as an ImageNet initialized counterpart to achieve better or comparable performance. We further evaluate ConVIRT on two new zero-shot retrieval tasks, an image-image and a text-image retrieval task, and also find it superior to all baselines. To facilitate future research, we will make our code and the collected retrieval datasets available.

2.1. TASK DEFINITION

We start by giving a formal description of our representation learning setting. We assume paired input (x v , x u ) where x v represents one or a group of images, and x u represents a text sequence which describes the imaging information in x v . Our goal is to learn a parameterized image encoder function f v , which maps an image to a fixed-dimensional vector. We are then interested in transferring the learned image encoder function f v into downstream tasks, such as classification or image retrieval. In this work, we model the encoder function f v as a convolutional neural network (CNN). We note that paired image-text data (x v , x u ) naturally exists for many medical domains. Medical experts such as radiologists produce textual descriptions of images as part of their routine workflow, some of which are also made publicly available (Demner-Fushman et al., 2016; Johnson et al., 2019) .

2.2. CONTRASTIVE VISUAL REPRESENTATION LEARNING FROM TEXT

An overview of our method, ConVIRT, for learning f v is shown in Figure 2 . At a high level, our method converts each input image x v and text x u into d-dimensional vector representations v and u respectively, following a similar processing pipeline. For each input image x v , our method starts by drawing a random view xv from x v with a sampled transformation function t v ∼ T , where T represents a family of stochastic image transformation functions described later. Next, the encoder function f v transforms xv into a fixed-dimensional vector h v , followed by a non-linear projection function g v which further transforms h v into vector v: v = g v (f v (x v )), where v ∈ R d . Similarly, for each text input x u , we obtain a span xu from it following a sampling function t u , and then a text representation u with: u = g u (f u (x u )), where f u is a text encoder,  Image Encoder g v < l a t e x i t s h a 1 _ b a s e 6 4 = " G A V 4 Y T V u l U g y r 5 G I f U 9 1 U I j K R o s = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l U t N 4 K X j x W t B / Q h r L Z b t K l m 0 3 Y 3 R R K 6 E / w 4 k E R r / 4 i b / 4 b N 2 k Q t T 4 Y e L w 3 w 8 w 8 L + Z M a d v + t E o r q 2 v r G + X N y t b 2 z u 5 e d f + g o 6 J E E t o m E Y 9 k z 8 O K c i Z o W z P N a S + W F I c e p 1 1 v c p P 5 3 S m V i k X i Q c 9 i 6 o Y 4 E M x n B G s j 3 Q f D 6 b B a s + t 2 D r R M n I L U o E B r W P 0 Y j C K S h F R o w r F S f c e O t Z t i q R n h d F 4 Z J I r G m E x w Q P u G C h x S 5 a b 5 q X N 0 Y p Q R 8 i N p S m i U q z 8 n U h w q N Q s 9 0 x l i P V Z / v U z 8 z + s n 2 m + 4 K R N x o q k g i 0 V + w p G O U P Y 3 G j F J i e Y z Q z C R z N y K y B h L T L R J p 5 K H c J 3 h 8 v v l Z d I 5 q z v n 9 f O 7 i 1 q z U c R R h i M 4 h l N w 4 A q a c A s t a A O B A B 7 h G V 4 s b j 1 Z r 9 b b o r V k F T O H 8 A v W + x d w J I 4 B < / l a t e x i t > g u < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 o N Z + 7 q l + o I x t s 8 F b 7 q U G / j N I c 4 = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m s a L 0 V v H i s a D + g D W W z 3 a R L N 5 u w u x F K 6 E / w 4 k E R r / 4 i b / 4 b N 2 k Q t T 4 Y e L w 3 w 8 w 8 L + Z M a d v + t E o r q 2 v r G + X N y t b 2 z u 5 e d f + g q 6 J E E t o h E Y 9 k 3 8 O K c i Z o R z P N a T + W F I c e p z 1 v e p 3 5 v Q c q F Y v E v Z 7 F 1 A 1 x I J j P C N Z G u g t G y a h a s + t 2 D r R M n I L U o E B 7 V P 0 Y j i O S h F R o w r F S A 8 e O t Z t i q R n h d F 4 Z J o r G m E x x Q A e G C h x S 5 a b 5 q X N 0 Y p Q x 8 i N p S m i U q z 8 n U h w q N Q s 9 0 x l i P V F / v U z 8 z x s k 2 m + 6 K R N x o q k g i 0 V + w p G O U P Y 3 G j N J i e Y z Q z C R z N y K y A R L T L R J p 5 K H c J X h 4 v v l Z d I 9 q z u N e u P 2 v N Z q F n G U 4 Q i O 4 R Q c u I Q W 3 E A b O k A g g E d 4 h h e L W 0 / W q / W 2 a C 1 Z x c w h / I L 1 / g V u o I 4 A < / l a t e x i M I g = " > A A A B 6 n i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y q a L w F v H i M a B 6 Q L G F 2 M p s M m X 0 w 0 x s I I Z / g x Y M i X v 0 i b / 6 N s 5 t F 1 F j Q U F R 1 0 9 3 l x V J o t O 1 P q 7 C y u r a + U d w s b W 3 v 7 O 6 V 9 w 9 a O k o U 4 0 0 W y U h 1 P K q 5 F C F v o k D J O 7 H i N P A k b 3 v j m 9 R v T 7 j S I g o f c B p z N 6 D D U P i C U T T S P f Y n / X L F r t o Z y D J x c l K B H I 1 + + a M 3 i F g S 8 B C Z p F p 3 H T t G d 0 Y V C i b 5 v N R L N I 8 p G 9 M h 7 x o a 0 o B r d 5 a d O i c n R h k Q P 1 K m Q i S Z + n N i R g O t p 4 F n O g O K I / 3 X S 8 X / v G 6 C f s 2 d i T B O k I d s s c h P J M G I p H + T g V C c o Z w a Q p k S 5 l b C R l R R h i a d U h b C d Y r L 7 5 e X S e u s 6 p x X z + 8 u K v V a H k c R j u A Y T s G B K 6 j D L T S g C Q y G 8 A j P 8 G J J 6 8 l 6 t d 4 W r Q U r n z m E X 7 D e v w C D 8 o 4 O < / l a t e x i t > t u < l a t e x i t s h a 1 _ b a s e 6 4 = " I k n v 9 L c l a 4 + y I P h e V 6 J L C u R o + 9 U = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I v g q a R W t N 4 K X j x W t B / Q h r L Z b t q l m 0 3 Y n Q g l 9 C d 4 8 a C I V 3 + R N / + N m z S I W h 8 M P N 6 b Y W a e F w m u 0 X E + r c L K 6 t r 6 R n G z t L W 9 s 7 t X 3 j / o 6 D B W l L V p K E L V 8 4 h m g k v W R o 6 C 9 S L F S O A J 1 v W m 1 6 n f f W B K 8 1 D e 4 y x i b k D G k v u c E j T S H Q 7 j Y b n i V J 0 M 9 j K p 5 a Q C O V r D 8 s d g F N I 4 Y B K p I F r 3 a 0 6 E b k I U c i r Y v D S I N Y s I n Z I x 6 x s q S c C 0 m 2 S n z u 0 T o 4 x s P 1 S m J N q Z + n M i I Y H W s 8 A z n Q H B i f 7 r p e J / X j 9 G v + E m X E Y x M k k X i / x Y 2 B j a 6 d / 2 i C t G U c w M I V R x c 6 t N J 0 Q R i i a d U h b C V Y q L 7 5 e X S e e s W q t X 6 7 f n l W Y j j 6 M I R 3 A M p 1 C D S 2 j C D b S g D R T G 8 A j P 8 G I J 6 8 l 6 t d 4 W r Q U r n z m E X 7 D e v w C C b o 4 N < / l a t e x i t > x v < l a t e x i t s h a 1 _ b a s e 6 4 = " y B F f O E P 6 J A F 7 + n J V 5 B t 8 G 3 1 S 8 l 4 = " > A A A B 8 3 i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i Q q W n c F N y 4 r 2 A c 0 o U y m k 3 b o Z B J m J s U S + h t u X C j i 1 p 9 x 5 9 8 4 S Y O o 9 c D A 4 Z x 7 u W e O H 3 O m t G 1 / W q W V 1 b X 1 j f J m Z W t 7 Z 3 e v u n / Q U V E i C W 2 T i E e y 5 2 N F O R O 0 r Z n m t B d L i k O f 0 6 4 / u c n 8 7 p R K x S J x r 2 c x 9 U I 8 E i x g B G s j u W 6 I 9 d g P 0 o f 5 Y D q o 1 u y 6 n Q M t E 6 c g N S j Q G l Q / 3 G F E k p A K T T h W q u / Y s f Z S L D U j n M 4 r b q J o j M k E j 2 j f U I F D q r w 0 z z x H J 0 Y Z o i C S 5 g m N c v X n R o p D p W a h b y a z j O q v l 4 n / e f 1 E B w 0 v Z S J O N B V k c S h I O N I R y g p A Q y Y p 0 X x m C C a S m a y I j L H E R J u a K n k J 1 x k u v 7 + 8 T D p n d e e 8 f n 5 3 U W s 2 i j r K c A T H c A o O X E E T b q E F b S A Q w y M 8 w 4 u V W E / W q / W 2 G C 1 Z x c 4 h / I L 1 / g W o u p I y < / l a t e x i t > x u < l a t e x i t s h a 1 _ b a s e 6 4 = " N b 5 H K U Z Q B q K f 5 + D J L k i W Y R S T W C Q = " > A A A B 8 3 i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i R W t O 4 K b l x W s A 9 o Q p l M J + 3 Q y S T M Q y y h v + H G h S J u / R l 3 / o 1 J G k S t B w Y O 5 9 z L P X P 8 m D O l b f v T K q 2 s r q 1 v l D c r W 9 s 7 u 3 v V / Y O u i o w k t E M i H s m + j x X l T N C O Z p r T f i w p D n 1 O e / 7 0 O v N 7 9 1 Q q F o k 7 P Y u p F + K x Y A E j W K e S 6 4 Z Y T / w g e Z g P z b B a s + t 2 D r R M n I L U o E B 7 W P 1 w R x E x I R W a c K z U w L F j 7 S V Y a k Y 4 n V d c o 2 i M y R S P 6 S C l A o d U e U m e e Y 5 O U m W E g k i m T 2 i U q z 8 3 E h w q N Q v 9 d D L L q P 5 6 m f i f N z A 6 a H o J E 7 H R V J D F o c B w p C O U F Y B G T F K i + S w l m E i W Z k V k g i U m O q 2 p k p d w l e H i + 8 v L p H t W d x r 1 x u 1 5 r d U s 6 i j D E R z D K T h w C S 2 4 g T Z 0 g E A M j / A M L 5 a x n q x X 6 2 0 x W r K K n U P 4 B e v 9 C 6 c 2 k j E = < / l a t e x i t > xv < l a t e x i t s h a 1 _ b a s e 6 4 = " A + u y h s h a 1 _ b a s e 6 4 = " 2 f A 2 0 I o 5 y K S 6 x + n O Q 6 m 7 w A 9 a X k E e 2 Q k I e y 7 2 F F O R O 0 o 5 n m t B 9 J i g O P 0 5 4 3 u 8 7 8 3 g  c a K n Y X K h / / M 0 G T T a X F Q R z Q = " > A A A B / X i c b V D L S s N A F J 3 U V 6 2 v + N i 5 C R b B V U m s a N 0 V 3 L i s Y B / Q h D C Z T N q h k 0 m Y m R R r C P 6 K G x e K u P U / 3 P k 3 T t I g a j 0 w c D j n X u 6 Z 4 8 W U C G m a n 1 p l a X l l d a 2 6 X t v Y 3 N r e 0 X f 3 e i J K O M J d F N G I D z w o M C U M d y W R F A 9 i j m H o U d z 3 J l e 5 3 5 9 i L k j E b u U s x k 4 I R 4 w E B E G p J F c / s C W h P k 7 t E M q x F 6 R 3 W e Z O X b 1 u N s w C x i K x S l I H J T q u / m H 7 E U p C z C S i U I i h Z c b S S S G X B F G c 1 e x E 4 B i i C R z h o a I M h l g 4 a Z E + M 4 6 V 4 h t B x N V j 0 i j U n x s p D I W Y h Z 6 a z E O K v 1 4 u / u c N E x m 0 n J S w O J G Y o f m h I K G G j I y 8 C s M n H C N J Z 4 p A x I n K a q A x 5 B B J V V i t K O E y x / n 3 l x d J 7 7 R h N R v N m 7 N 6 u 1 X W U Q W H 4 A i c A A t c g D a 4 B h 3 Q B Q j c g 0 f w D F 6 0 B + 1 J e 9 X e 5 q M V r d z Z B 7 + g v X 8 B o 6 G W G Q = = < / l a t e x i t > xu < l a t e x i t s h a 1 _ b a s e 6 4 = " Q z 8 i j o 3 u 8 e X V A M o k 1 C 6 F M y z V a b A = " > A A A B / X i c b V D L S s N A F J 3 4 r P U V H z s 3 g 0 V w V V I r W n c F N y 4 r 2 A c 0 I U w m k 3 b o Z B J m J m I N w V 9 x 4 0 I R t / 6 H O / / G S R p E r Q c G D u f c y z 1 z v J h R q S z r 0 1 h Y X F p e W a 2 s V d c 3 N r e 2 z Z 3 d n o w S g U k X R y w S A w 9 J w i g n X U U V I 4 N Y E B R 6 j P S 9 y W X u 9 2 + J k D T i N 2 o a E y d E I 0 4 D i p H S k m v u 2 4 o y n 6 R 2 i N T Y C 9 K 7 L H M T 1 6 x Z d a s A n C e N k t R A i Y 5 r f t h + h J O Q c I U Z k n L Y s G L l p E g o i h n J q n Y i S Y z w B I 3 I U F O O Q i K d t E i f w S O t + D C I h H 5 c w U L 9 u Z G i U M p p 6 O n J P K T 8 6 + X i f 9 4 w U U H L S S m P E 0 U 4 n h 0 K E g Z V B P M q o E 8 F w Y p N N U F Y U J 0 V 4 j E S C C t d W L U o 4 S L H 2 f e X 5 0 n v p N 5 o 1 p v X p 7 V 2 q 6 y j A g 7 A I T g G D X A O 2 u A K d E A X Y H A P H s E z e D E e j C f j 1 X i b j S 4 Y 5 c 4 e + A X j / Q u i H Z Y Y < / l a t e x i t > v < l a t e x i t s h a 1 _ b a s e 6 4 = " w Y p T t E j f u z n T b C R w Y a n H Q Y o 4 q z w = " > A A A B 8 X i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i S 2 a N 0 V 3 L i s Y B / Y h j K Z T t q h k 0 m Y m R R K 6 F + 4 c a G I W / / G n X / j J A 2 i 1 g M D h 3 P u Z c 4 9 X s S Z 0 r b 9 a R X W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R R 4 W x J L R N Q h 7 K n o c V 5 U z Q t m a a 0 1 4 k K Q 4 8 T r v e 9 C b 1 u z M q F Q v F v Z 5 H 1 A 3 w W D C f E a y N 9 D A I s J 5 4 f j J b D M s V u 2 p n Q K v E y U k F c r S G 5 Y / B K C R x Q I U m H C v V d + x I u w m W m h F O F 6 V B r G i E y R S P a d 9 Q g Q O q 3 C R L v E B n R h k h P 5 T m C Y 0 y 9 e d G g g O l 5 o F n J t O E 6 q + X i v 9 5 / V j 7 D T d h I o o 1 F W T 5 k R 9 z p E O U n o 9 G T F K i + d w Q T C Q z W R G Z Y I m J N i W V s h K u U 1 x + n 7 x K O h d V p 1 a t 3 d U r z U Z e R x F O 4 B T O w Y E r a M I t t K A N B A Q 8 w j O 8 W M p 6 s l 6 t t + V o w c p 3 j u E X r P c v E A S R R w = = < / l a t e x i t > u < l a t e x i t = " > A A A B 8 X i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i R W t O 4 K b l x W s A 9 s Q 5 l M J + 3 Q y S T M T I Q Q + h d u X C j i 1 r 9 x 5 9 8 4 S Y O o 9 c D A 4 Z x 7 m X O P F 3 G m t G 1 / W q W V 1 b X 1 j f J m Z W t 7 Z 3 e v u n / Q V W E s C O V i o X i T i c R d Q M 8 E c x n B G s j 3 Q 8 D r K e e n 8 b z U b V m 1 + 0 c a J k 4 B a l B g f a o + j E c h y Q O q N C E Y 6 U G j h 1 p N 8 V S M 8 L p v D K M F Y 0 w m e E J H R g q c E C V m + a J 5 + j E K G P k h 9 I 8 o V G u / t x I c a B U E n h m M k u o / n q Z + J 8 3 i L X f d F M m o l h T Q R Y f + T F H O k T Z + W j M J C W a J 4 Z g I p n J i s g U S 0 y 0 K a m S l 3 C V 4 e L 7 5 G X S P a s 7 j X r j 9 r z W a h Z 1 l O E I j u E U H L i E F t x A G z p A Q M A j P M O L p a w n 6 9 V 6 W 4 y W r G L n E H 7 B e v 8 C D n + R R g = = < / l a t e x i t > f v < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 n V U s 6 V W e s x T n I y A K q o H y P J X 2 f Q = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l s 0 X o r e P F Y 0 X 5 A G 8 p m u 2 m X b j Z h d 1 M o o T / B i w d F v P q L v P l v 3 K R I U h x 4 n H a 9 6 U 3 q d 2 d U K h a K B z 2 P q B v g s W A + I 1 g b 6 d 4 f z o b l i l 2 1 M 6 B V 4 u S k A j l a w / L H Y B S S O K B C E 4 6 V 6 j t 2 p N 0 E S 8 0 I p 4 v S I F Y 0 w m S K x 7 R v q M A B V W 6 S n b p A Z 0 Y Z I T + U p o R G m f p z I s G B U v P A M 5 0 B 1 h P 1 1 0 v F / 7 x + r P 2 G m z A R x Z o K s l z k x x z p E K V / o x G T l G g + N w Q T y c y t i E y w x E S b d E p Z C N c p L r 9 f X i W d i 6 p T q 9 b u 6 p V m I 4 + j C C d w C u f g w B U 0 4 R Z a 0 A Y C Y 3 i E Z 3 i x u P V k v V p v y 9 a C l c 8 c w y 9 Y 7 1 9 u n o 4 A < / l a t e x i t > Text Encoder f u < l a t e x i t s h a 1 _ b a s e 6 4 = " / z u h x p a I X D N + q 0 B X H a 6 R o G C v 3 H s = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m s a L 0 V v H i s a N p C G 8 p m u 2 m X b j Z h d y O U 0 J / g x Y M i X v 1 F 3 v w 3 b t I g a n 0 w 8 H h v h p l 5 f s y Z 0 r b 9 a Z V W V t f W N 8 q b l a 3 t n d 2 9 6 v 5 B R 0 W J J N Q l E Y 9 k z 8 e K c i a o q 5 n m t B d L i k O f 0 6 4 / v c 7 8 7 g O V i k X i X s 9 i 6 o V 4 L F j A C N Z G u g u G y b B a s + t 2 D r R M n I L U o E B 7 W P 0 Y j C K S h F R o w r F S f c e O t Z d i q R n h d F 4 Z J I r G m E z x m P Y N F T i k y k v z U + f o x C g j F E T S l N A o V 3 9 O p D h U a h b 6 p j P E e q L + e p n 4 n 9 d P d N D 0 U i b i R F N B F o u C h C M d o e x v N G K S E s 1 n h m A i m b k V k Q m W m G i T T i U P 4 S r D x f f L y 6 R z V n c a 9 c b t e a 3 V L O I o w x E c w y k 4 c A k t u I E 2 u E B g D I / w D C 8 W t 5 6 s V + t t 0 V q y i p l D + A X r / Q t t G o 3 / < / l a t e x i t > `(u!v) < l a t e x i t s h a 1 _ b a s e 6 4 = " y x N g w p y g k r s 8 x q o Y f 7 n y N i M q s F Q = " > A A A C A X i c b V D L S s N A F J 3 4 r P U V d S O 4 G S x C 3 Z T E i t Z d w Y 3 L C v Y B T S y T 6 a Q d O p m E m U m l h L r x V 9 y 4 U M S t f + H O v 3 G S B l H r g Q u H c + 7 l 3 n u 8 i F G p L O v T W F h c W l 5 Z L a w V 1 z c 2 t 7 b N n d 2 W D G O B S R O H L B Q d D 0 n C K C d N R R U j n U g Q F H i M t L 3 R Z e q 3 x 0 R I G v I b N Y m I G 6 A B p z 7 F S G m p Z + 4 7 h L H b p B x D R 9 D B U C E h w j s 4 P p 7 2 z J J V s T L A e W L n p A R y N H r m h 9 M P c R w Q r j B D U n Z t K 1 J u g o S i m J F p 0 Y k l i R A e o Q H p a s p R Q K S b Z B 9 M 4 Z F W + t A P h S 6 u Y K b + n E h Q I O U k 8 H R n g N R Q / v V S 8 T + v G y u / 5 i a U R 7 E i H M 8 W + T G D K o R p H L B P B c G K T T R B W F B 9 K 8 R D J B B W O r R i F s J F i r P v l + d J 6 6 R i V y v V 6 9 N S v Z b H U Q A H 4 B C U g Q 3 O Q R 1 c g Q Z o A g z u w S N 4 B i / G g / F k v B p v s 9 Y F I 5 / Z A 7 9 g v H 8 B N w K W 2 A = = < / l a t e x i t > `(v!u) < l a t e x i t s h a 1 _ b a s e 6 4 = " y e s X r G I j O d p Y Y T Z L L f L l H J L t x O s = " > A A A C A X i c b V D L S s N A F J 3 4 r P U V d S O 4 G S x C 3 Z T E i t Z d w Y 3 L C v Y B T S y T 6 a Q d O p m E m U m l h L r x V 9 y 4 U M S t f + H O v 3 G S B l H r g Q u H c + 7 l 3 n u 8 i F G p L O v T W F h c W l 5 Z L a w V 1 z c 2 t 7 b N n d 2 W D G O B S R O H L B Q d D 0 n C K C d N R R U j n U g Q F H i M t L 3 R Z e q 3 x 0 R I G v I b N Y m I G 6 A B p z 7 F S G m p Z + 4 7 h L H b p D y G j q C D o U J C h H c w P p 7 2 z J J V s T L A e W L n p A R y N H r m h 9 M P c R w Q r j B D U n Z t K 1 J u g o S i m J F p 0 Y k l i R A e o Q H p a s p R Q K S b Z B 9 M 4 Z F W + t A P h S 6 u Y K b + n E h Q I O U k 8 H R n g N R Q / v V S 8 T + v G y u / 5 i a U R 7 E i H M 8 W + T G D K o R p H L B P B c G K T T R B W F B 9 K 8 R D J B B W O r R i F s J F i r P v l + d J 6 6 R i V y v V 6 9 N S v Z b H U Q A H 4 B C U g Q 3 O Q R 1 c g Q Z o A g z u w S N 4 B i / G g / F k v B p v s 9 Y F I 5 / Z A 7 9 g v H 8 B N x C W 2 A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " j V T J 5 j w m B W 2 P 8 / j i N R c g M 4 S Q 6 H w = " > A A A B 9 H i c b V B N S 8 N A F H y p X 7 V + V T 1 6 W S y C p 5 J I w R 4 L X j x W s L b Q l L L Z v r R L N 5 u w u y m U 0 L / h x Y O C e P X H e P P f u G l z 0 N a B h W H m P d 7 s B I n g 2 r j u t 1 P a 2 t 7 Z 3 S v v V w 4 O j 4 5 P q q d n T z p O F c M O i 0 W s e g H V K L j E j u F G Y C 9 R S K N A Y D e Y 3 u V + d 4 Z K 8 1 g + m n m C g 4 i O J Q 8 5 o 8 Z K v h 9 R M w n C b L I Y z o b V m l t 3 l y C b x C t I D Q q 0 h 9 U v f x S z N E J p m K B a 9 z 0 3 M Y O M K s O Z w E X F T z U m l E 3 p G P u W S h q h H m T L z A t y Z Z U R C W N l n z R k q f 7 e y G i k 9 T w K 7 G S e U a 9 7 u f i f 1 0 9 N 2 B x k X C a p Q c l W h 8 J U E B O T v A A y 4 g q Z E X N L K F P c Z i V s Q h V l x t Z U s S V 4 6 1 / e J N 2 b u t e o e 9 5 D o 9 Z q F n 2 U 4 Q I u 4 R o 8 u I U W 3 E M b O s A g g W d 4 h T c n d V 6 c d + d j N V p y i p 1 z + A P n 8 w c S O J I k < / l a t e x i t > h v < l a t e x i t s h a 1 _ b a s e 6 4 = " G k / n L w t i p W + 4 7 R 3 v 6 r t x O F n s C p c = " > A A A B 9 H i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I Y J c F N y 4 r W F v o D C W T Z t r Q J D P k I Z S h v + H G h Y K 4 9 W P c + T d m 2 l l o 6 4 H A 4 Z x 7 u S c n z j j T x v e / v c r G 5 t b 2 T n W 3 t r d / c H h U P z 5 5 1 K l V h H Z J y l P V j 7 G m n E n a N c x w 2 s 8 U x S L m t B d P b w u / 9 0 S V Z q l 8 M L O M R g K P J U s Y w c Z J Y S i w m c R J P p k P 7 b D e 8 J v + A m i d B C V p Q I n O s P 4 V j l J i B Z W G c K z 1 I P A z E + V Y G U Y 4 n d d C q 2 m G y R S P 6 c B R i Q X V U b 7 I P E c X T h m h J F X u S Y M W 6 u + N H A u t Z y J 2 k 0 V G v e o V 4 n / e w J q k F e V M Z t Z Q S Z a H E s u R S V F R A B o x R Y n h M 0 c w U c x l R W S C F S b G 1 V R z J Q S r X 1 4 n v a t m c N 0 M g v v r R r t V 9 l G F M z i H S w j g B t p w B x 3 o A o E M n u E V 3 j z r v X j v 3 s d y t O K V O 6 f w B 9 7 n D x C z k i M = < / l a t e x i t > h u Figure 2 : Overview of our ConVIRT framework. The blue and green shades represent the image and text encoding pipelines, respectively. Our method relies on maximizing the agreement between the true image-text representation pairs with bidirectional losses (v→u) and (u→v) . g u a projection, and u ∈ R d . The projection functions g v and g u project representations for both modalities from their encoder space to the same d-dimensional space for contrastive learning. At training time, we sample a minibatch of N input pairs (x v , x u ) from training data, and calculate their representation pairs (v, u). We use (v i , u i ) to denote the i-th pair. The training objective of ConVIRT involves two loss functions. The first loss function is an image-to-text contrastive loss for the i-th pair: (v→u) i = -log exp( v i , u i /τ ) N k=1 exp( v i , u k /τ ) , where v i , u i represents the cosine similarity, i.e., v, u = v u/ v u ; and τ ∈ R + represents a temperature parameter. This loss takes the same form as the InfoNCE loss (Oord et al., 2018) , and minimizing it leads to encoders that maximally preserve the mutual information between the true pairs under the representation functions. Intuitively, it is the log loss of an N -way classifier that tries to predict (v i , u i ) as the true pair. Note that unlike previous work which use a contrastive loss between inputs of the same modality (Chen et al., 2020a; He et al., 2020) , our image-to-text contrastive loss is asymmetric for each input modality. We therefore define a similar text-to-image contrastive loss as: (u→v) i = -log exp( u i , v i /τ ) N k=1 exp( u i , v k /τ ) . ( ) Our final training loss is then computed as a weighted combination of the two losses averaged over all positive image-text pairs in each minibatch: L = 1 N N i=1 λ (v→u) i + (1 -λ) (u→v) i , where λ ∈ [0, 1] is a scalar weight.

2.3. REALIZATION

We note that our ConVIRT framework defined above is agnostic to the specific choice of image and text encoders, transformations and projection functions. In this work, following previous work (Chen et al., 2020a) , we model g v and g u as separate learnable single-hidden-layer neural networks, i.e., g v (•) = W (2) σ(W (1) (•)) where σ is a ReLU non-linearity, and similarly for g u . For the image encoder f v , we use the ResNet50 architecture (He et al., 2016) for all experiments, as it is the architecture of choice for much medical imaging work and is shown to achieve competitive performance. For the text encoder f u , we use a BERT encoder (Devlin et al., 2019) followed by a max-pooling layer over all output vectors. We initialize our BERT encoder with the ClinicalBERT model (Alsentzer et al., 2019) pretrained on the MIMIC clinical notes, which achieved state-of-theart performance on a suite of clinical NLP tasks. At training time we allow the encoder to adapt to our contrastive task by freezing the embeddings and the first 6 layers of this BERT encoder and fine-tuning the last 6 layers. For the image transformation family T where t v is sampled from, we use sequential applications of five random transformations: cropping, horizontal flipping, affine transformation, color jittering and Gaussian blur. Different from recent work on contrastive visual representation learning (Chen et al., 2020a; b) , we only apply brightness and contrast adjustments in color jittering, due to the monochrome nature of the medical images. For the text transformation function t u , we apply a simple uniform sampling of a sentence from the input document x u (i.e., xu is a randomly sampled sentence from x u for each minibatch). We did not use a more aggressive transformation mainly because sampling at the sentence level can preserve the semantic meaning of the sampled spans.

3.1. DATA FOR PRETRAINING

We test our ConVIRT framework by pretraining two separate image encoders covering different medical specialties using two separate paired image-text datasets: • Chest image encoder: We use version 2 of the public MIMIC-CXR database (Johnson et al., 2019) , which is a collection of chest radiograph images paired with their textual reports, and since its release has become a standard resource for studying multi-modal modeling of medical images. After preprocessing, this dataset contains a total of about 217k image-text pairs, with each pair containing an average of 1.7 images and 6.0 sentences. • Bony image encoder: We obtain a collection of musculoskeletal image-text pairs from the Rhode Island Hospital system. Following chest images, musculoskeletal images constitute the second most common type of radiograph images in a typical hospital. This dataset contains a total of 48k image-text pairs, with each pair containing an average of 2.5 images and 8.0 sentences. We include model implementation and pretraining details in Appendix A.

3.2. EVALUATION TASKS & DATA

We evaluate our pretrained image encoders on three downstream medical imaging tasks: image classification, zero-shot image-image retrieval and zero-shot text-image retrieval. Image Classification. We evaluate our pretrained image representations on four representative medical image classification tasks: 1) RSNA Pneumonia Detection (Wang et al., 2017; Shih et al., 2019) , which involves binary classification of a chest radiograph image into either a pneumonia or a normal category; 2) CheXpert image classification (Irvin et al., 2019) , which involves multi-label binary classification of a chest image for five individual labels, i.e., atelectasis, cardiomegaly, consolidation, edema and pleural effusion; 3) COVIDx image classification (Wang & Wong, 2020) , which involves multi-class classification of a chest image into one of COVID19, non-COVID pneumonia or normal categories; and 4) MURA bony abnormality detection (Rajpurkar et al., 2018a) , which involves binary classification of a musculoskeletal image into abnormal or normal. We report test accuracy for COVIDx given its balanced test set, and report the standard area under the receiver operating characteristic curve (AUC) metric for other tasks following previous work. Following previous work (Hénaff et al., 2020; Chen et al., 2020a; He et al., 2020) , for all tasks, we evaluate each pretrained image encoder under two individual settings: a linear classification setting, where the pretrained CNN weights are frozen and only a linear classification head is trained for the task; and a fine-tuning setting, where both the CNN weights and the linear head are fine-tuned. The two settings complement each other for evaluation purposes: while the linear setting directly evaluates the quality of the extracted image features with the pretrained CNN, the fine-tuning setting more closely resembles how the pretrained CNN weights are used in practical applications. To further compare the data efficiency of different pretraining methods, for each setting we evaluate the image encoders with 1%, 10% and all training data, respectively (except for the COVIDx dataset where we omit the 1% setting due to the scarcity of data for some categories). To control the variance in results, for all settings and models, we report average results aggregated over 5 independent training runs. We include further dataset processing and model training details in Appendix B. Zero-shot Image-image Retrieval. This evaluation is similar to the conventional content-based image retrieval setting in which we search for images of a particular category using a representative query image. For evaluation, a group of query images and a larger collection of candidate images, each with a categorical label, are given to a pretrained CNN encoder. We encode each query and candidate image with this encoder, and then for each query, rank all candidates by their cosine similarities to the query in descending order. Since a widely-used annotated benchmark for this setting is not available, we create our own dataset by re-using existing annotations in the CheXpert dataset (Irvin et al., 2019) and additional expert annotations from a board-certified radiologist. The resulting dataset covers 8 different chest abnormality categories, each with 10 expert-annotated query and 200 candidate images. We include the detailed collection and annotation procedure in Appendix C, and refer to this dataset as CheXpert 8×200 Retrieval Dataset. We focus our evaluation on retrieval precision, and evaluate our models with Precision@k metrics where k = 5, 10, 100. Zero-shot Text-image Retrieval. This setting is similar to the image-image retrieval setting, but instead of using query images, we retrieve images of a particular category with textual queries. For this purpose, we ask a radiologist to write 5 diverse and representative textual descriptions for each of the 8 abnormality categories for the same CheXpert 8x200 candidate images (see Appendix D for details). At test time, for each query we encode its text with the learned text encoder f u and then retrieve from candidate images in a similar way. This evaluation not only evaluates the quality of the learned image representations, but also the alignment between the text representations and the image representations. We again use Precision@k metrics where k = 5, 10, 100.

3.3. BASELINE METHODS

We compare ConVIRT against the following standard or competitive initialization methods: • Random Init.: For all tasks we initialize the ResNet50 with its default random initialization. • ImageNet Init.: We use CNN weights pretrained on ImageNet (Russakovsky et al., 2015) , which remains a dominant initialization approach for medical imaging work (Raghu et al., 2019 ). • Caption-LSTM: We initialize the CNN weights with ImageNet, and then pretrain it with an image captioning task using the standard CNN-LSTM with attention architecture (Xu et al., 2015) . For the captioning task, we train the model to decode the paired textual report from the encoded image representations. Compared to the random or ImageNet initializations, this is an "in-domain" initialization baseline which uses the paired text data for representation learning. • Caption-Transformer: In this initialization we replace the CNN-LSTM model in Caption-LSTM with a CNN-Transformer-based captioning model in Cornia et al. (2020) , which recently achieves state-of-the-art results on the COCO image captioning benchmark (Lin et al., 2014) . • Contrastive-Binary: This baseline differs from our method by contrasting the paired image and text representations with a binary classification head, as is widely done in visual-linguistic pretraining work (Tan & Bansal, 2019; Su et al., 2020) . For each input pair, we first project encoder outputs h v and h u into the same dimension with linear layers, concatenate them, and use a MLP network to predict a binary probability of whether the input is a real or a "fake" pair, which we train with a binary cross-entropy loss. During training, for each (x v , x u ) pair in the training set, we construct a "fake" pair by replacing x u with a randomly sampled one from the dataset. We expect that the binary classification task requires the encoder to learn reasonable representations of the input images, and therefore is a stronger in-domain initialization baseline. For fair comparison, for all baselines that require paired image-text data, we use the same datasets as in our contrastive pretraining. For the captioning-based methods, we always use the model checkpoints that achieve the best CIDEr score (Vedantam et al., 2015) on a held-out validation set.

4.1. CLASSIFICATION TASKS

Linear Classification. We present all linear classification results for the medical imaging tasks in Table 1a . We find that compared to random initialization, ImageNet initialization provides markedly Fine-tuning. We show the fine-tuning evaluation results in Table 1b . Similar to the linear setting, we find that: 1) ImageNet initialization is again better than random initialization with smaller margins; 2) all in-domain initialization methods are better than the popular ImageNet initialization in most settings; and 3) our proposed ConVIRT pretraining again achieves the best overall results in 10 out of the 11 settings, with the exception of the CheXpert dataset with all training data used, where the result of ConVIRT is similar to that of the Caption-Transformer result. Most notably, on all datasets, with only 10% labeled training data ConVIRT achieves classification results that are better or close to the ImageNet initialization with 100% training data results. We also notice that our conclusion of using ImageNet versus random initialization is different from Raghu et al. (2019) : while they showed comparable results from the two strategies, we find that using ImageNet initialization is still superior than random initialization in most results, justifying its popularity. Upon closer examination, we conjecture that this is likely due to under-optimization of their models: while our ResNet50 with random initialization achieves an average AUC of 85.8 on the CheXpert dataset, their ResNet50 model only achieved 83.5 AUC on the same evaluation set.

4.2. RETRIEVAL TASKS

We present the zero-shot image-image and text-image retrieval results in Table 2 . For the imageimage retrieval setting, we present additional results from fine-tuning our pretrained model on all CheXpert training data, and use them as "upper bounds" of the results obtained from the use of supervised labels. We find that: 1) using ImageNet pretrained CNN weights in a zero-shot image retrieval setting is only better than random guess by small margins; 2) all in-domain pretrained CNN Table 2 : Zero-shot image-image and text-image retrieval results on the CheXpert 8×200 datasets. Random shows results from a random guess; ConVIRT + CheXpert Supervised shows results from further fine-tuning the pretrained weights with supervised training data. Text-image retrieval results are not obtained for some methods due to the lack of text encoders. Image-Image Retrieval Text-Image Retrieval Method Prec@5 Prec@10 Prec@50 Prec@5 Prec@10 Prec@50 Random 12.5 12.5 12.5 12.5 12. weights achieve much better retrieval performance than ImageNet weights; and 3) our proposed ConVIRT pretraining achieves the best overall retrieval results on all metrics. We find that while Contrastive-Binary performs notably better than other baselines in the image-image retrieval setting, its text-image retrieval results are far from ConVIRT pretraining. We conjecture that the lack of an explicit similarity-based loss function in the Contrastive-Binary baseline results in misaligned representations in the image and text space, leading to poor results in text-image retrieval. To understand how well ConVIRT pretraining helps separate images from different abnormality categories in its encoding space, in Figure 3 we present t-SNE plots (Maaten & Hinton, 2008) of candidate images in the CheXpert 8x200 dataset for five selected categories, from the ImageNet pretrained CNN encoder and the ConVIRT pretrained encoder. It is worth noting that clustering images in our setting is much more challenging than that in the general object classification setting due to the high inter-class similarity of the medical images. Nevertheless we find that ConVIRT pretraining achieves a better clustering of the images in the t-SNE plots. On the other hand, the lack of clear separations between groups suggests room for further improvement. Comparisons to Image-only Contrastive Learning. ConVIRT shows superior results against baselines in evaluation, but an important question remains as to how it compares against existing image-only contrastive visual representation learning methods. We study this by running two popular such methods, SimCLR (Chen et al., 2020a) and MoCo v2 (Chen et al., 2020b) , on the same collection of images that we used in our pretraining. We present the results in Table 3 and include model training details in Appendix E. We find that compared to ImageNet initialization, both contrastive methods lead to marginal to moderate improvements on the classification For each image we present maps for ImageNet, SimCLR, MoCo v2 and our ConVIRT initializations.

5. ANALYSIS AND DISCUSSION

Ground truth regions that are indicative of the abnormalities are shown as red boxes in the images. and retrieval tasks. However, our training strategy substantially outperforms both methods on all tasks, demonstrating its effective use of information from the paired text data. To understand the representational difference that has led to this difference in performance, for all four initialization methods, we visualize in Figure 4 the saliency maps (Simonyan et al., 2014) corresponding to the correct class on sampled images from the CheXpert dataset. Models for all initialization methods are trained with 1% CheXpert training data under the linear classification setting (with pretrained CNN weights frozen). We find that ImageNet pretraining has led to models that focus on trivial visual features that are mostly irrelevant to the task, and that the model with ConVIRT pretrained weights has focused on much more relevant areas than those with SimCLR and MoCo v2 pretraining, suggesting more effective representation learning. For example, for atelectasis, while the ConVIRT model has correctly focused on the bottom of the lung regions, the SimCLR model has much more scattered focus and the MoCo model has incorrectly focused on the heart region. Correlation Between Contrastive Loss and End Task Performance. To understand the relation between a model's performance on the ConVIRT pretraining task and its performance on the downstream tasks, we ran an analysis where for every 5 epochs during the pretraining, we transferred the pretrained checkpoint to the downstream tasks and evaluate its performance. The pretraining was run for a total of 200 epochs, and 40 points were obtained with varying validation loss and end task results. Figure 5 presents the results of the models' validation loss on the pretraining task, and its achieved performance on the RSNA 1% data linear evaluation and the two retrieval tasks. For all three tasks, we find a clear positive correlation between the pretraining performance and the end task performance. This corroborates that by learning with the ConVIRT objective, the image encoder learns gradually improved representations for the end tasks, and suggests that further improvement on the pretraining task may have positive impact on the end task performance. Hyperparameter Analysis. We run experiments to study the impact of hyperparameters, and find that: 1) similar to previous work on image-only contrastive learning (Chen et al., 2020a) , the pretraining results are most sensitive to the choice of the temperature value τ ; 2) unlike previous work, changing batch size does not lead to substantial change in the classification results; and 3) using linear projection heads instead of non-linear ones notably hurts the retrieval results. We include our detailed comparisons in Appendix F.

6. RELATED WORK

Our work is most relevant to existing work on medical image classification, which we have covered in Section 1, and textual report generation from medical images (Wang et al., 2018; Jing et al., 2018; Liu et al., 2019) . A dominant approach for initializing medical image encoders in this work has been using encoder weights pretrained on ImageNet, despite the drastic difference in image characteristics (Raghu et al., 2019) . Instead, our work proposes an alternative in-domain pretraining strategy, and compares ImageNet pretraining with different pretraining approaches that also use the paired text data. To our knowledge our work represents the first systematic attempt in this direction. Our work is inspired by the recent line of work on image view-based contrastive visual representation learning (Hénaff et al., 2020; Chen et al., 2020a; He et al., 2020; Grill et al., 2020) , but differs from existing studies by the contrastive learning with text modality, which as we show in Section 5, is more effective in learning high-quality representations of medical images. Another line of work related to ours is visual-linguistic representation learning (Lu et al., 2019; Tan & Bansal, 2019; Su et al., 2020) . Among existing studies, Ilharco et al. (2020) and Gupta et al. (2020) used a cross-modality contrastive objective related to ours, but for the purpose of probing visual-linguistic models and learning phrase grounding, respectively. Our work differs from this line of work in several crucial ways: 1) existing work in visual-linguistic learning focused on learning visual representations from paired text via a binary contrastive prediction task, whereas we showed the superior performance of the new cross-modality NCE objectives in our setting; 2) existing work has primarily used object representations extracted from image segmentation models in their preprocessing steps, making them less applicable to medical image understanding tasks where anatomical segmentations are extremely hard to obtain; 3) while existing work has run evaluation primarily on visual-linguistic tasks such as visual question answering, we instead focus on evaluation with classification and retrieval tasks which are at the center of medical image understanding research.

7. CONCLUSION

We presented ConVIRT, an unsupervised method for learning medical visual representations from naturally occurring pairing of images and text. Our method relies on contrasting the image representations with the paired text data via a bidirectional objective between the two modalities. On 4 medical image classification tasks and 2 image retrieval tasks, ConVIRT outperformed other strong in-domain initialization methods that also use the text data, and led to representations with notably higher quality. Compared to ImageNet pretraining, ConVIRT is able to achieve the same level of classification accuracy with an order of magnitude less labeled data. We hope that our work can inspire future work that makes more efficient use of textual data for medical image understanding. et al., 2014) . Next, we keep only the Findings and Impression sections and remove all other sections. We remove all image-text pairings from the dataset where the text section is empty or has less than 3 tokens. This preprocessing procedure gives us about 217k total image-text pairs for pretraining our chest image encoder and 48k total pairs for pretraining our bone image encoder. Image and Text Encoders. For the image encoder, we use the standard ResNet50 implementation provided by the torchvision library. For the text encoder, we use the BERT base encoder offered by the Transformers library (Wolf et al., 2019) and initialize it with the ClinicalBERT model (Alsentzer et al., 2019) pretrained on the MIMIC clinical notes. We also experimented with training a specialized BERT encoder on a large collection of radiology notes but found that it made no substantial difference in the pretraining results. At pretraining time we freeze the embeddings and the first 6 layers of this BERT encoder, and only fine-tune the last 6 layers for our contrastive task. Other Hyperparameters. For contrastive learning, we use projection layers with an output dimension d = 512, a temperature value τ = 0.1, a loss weight λ = 0.75. These hyperparameter settings are obtained by comparing the linear evaluation validation scores on the RSNA image classification task with the pretrained ResNet50 weights. For the image transformation family T , we adopt the implementations offered by the torchvision library. 2 We apply random cropping with a ratio sampled from [0.6, 1.0]; horizontal flipping with p = 0.5; affine transformation with a degree sampled from [-20, 20] , max horizontal and vertical translation fractions of 0.1, and a scaling factor sampled from [0.95, 1.05]; color jittering with brightness and contrast adjustment ratios sampled from [0.6, 1.4]; and Gaussian blur with σ ∈ [0.1, 3.0]. All images are resized to 224×224 after the transformation t v is applied. Limited by computational resources, we arrive at these image transformation parameters via preliminary experiments rather than a systematic search. Pretraining Details. At pretraining time, for each dataset, we randomly sample 5k image-text pairs to form a held-out validation set. We we use the Adam optimizer (Kingma & Ba, 2015) with an initial learning rate of 1e-4 and weight decay of 1e-6. We initialize the image encoder with ImageNet pretrained weights at the beginning of pretraining, and use a fixed batch size of 32. We calculate the validation loss every 5000 steps, and if the validation loss does not decrease after 5 straight evaluation runs, we anneal the learning rate by a factor of 0.5. We stop pretraining after 200 evaluation runs, and save the model checkpoint that achieves the lowest validation loss. For efficiency, we employ mixed-precision training, and for reference, the whole pretraining run on the MIMIC-CXR dataset took about 3 days on a single Titan RTX GPU card.

B IMAGE CLASSIFICATION EXPERIMENTS

We prepared and used the 4 image classification datasets following the procedures below: 1. RSNA Pneumonia Detection (Wang et al., 2017; Shih et al., 2019) : we used the original version of this dataset available at its Kaggle page,foot_2 which contains 25184/1500/3000 annotated images in its training/validation/test sets, respectively. 2. CheXpert image classification (Irvin et al., 2019) : we downloaded the original version of this dataset from its official website. 4 Since the original expert-labeled test set of this dataset is hidden and not included as part of the release, we instead followed Raghu et al. (2019) and used the original expert-labeled validation set as our test set, and randomly sampled 5000 images from The cardiac silhouette is enlarged.

Edema

The presence of hazy opacity suggests interstitial pulmonary edema. Fracture A cortical step off indicates the presence of a fracture. Pleural Effusion The pleural space is partially filled with fluid. Pneumonia A pulmonary opacity with ill defined borders likely represents pneumonia. Pneumothorax A medial pneumothorax is present adjacent to the heart.

No Finding

No clinically significant radiographic abnormalities. the textual queries for each abnormality category, we ask a board-certified radiologist to write at least 5 different sentences that he will use to describe this abnormality in radiology reporting. We additionally set the following requirements: 1) the sentences must describe the category with no ambiguity and must not include other categories; 2) the sentences must be diverse from each other; and 3) the sentences should not include very specific anatomic locations or rare clinical observations. At the end, we aggregate the results and keep 5 textual queries for each abnormality category. For reference, we present example textual queries in Table 4 .

E EXPERIMENTS ON IMAGE-ONLY CONTRASTIVE LEARNING METHODS

We run experiments with two popular image-only contrastive visual representation learning methods: SimCLR (Chen et al., 2020a) and MoCo v2 (Chen et al., 2020b) . For a fair comparison, in both experiments we use the exact same set of images from the MIMIC-CXR dataset that we use in the pretraining of our method and the baselines. Our settings for each method are: • SimCLR: We use the open PyTorch implementation available at https://github.com/ sthalles/SimCLR. For image encoder we use ResNet50. We use cosine similarity in the loss function, set the temperature value to 0.1 and set the output dimension to 128. We use the default image augmentation functions in the paper except for the color jittering transformation where we set the saturation and hue adjustment to 0 due to the monochrome nature of our medical images. For training, we use the Adam optimizer with an initial learning rate of 3e-4 and weight decay of 1e-4. We set batch size to 128 and run training on a single GPU card for 100 epochs, as we find that increasing the batch size or number of epochs does not lead to improved results. We use the default settings for all other parameters. • MoCo v2: We use the authors' PyTorch implementation available at https://github.com/ facebookresearch/moco. For image encoder we use ResNet50. We follow the default MoCo v2 setting and use a temperature value of 0.07 and an output dimension of 128. Similarly, we adopt the default image augmentation functions except for the color jittering transformation where we set the saturation and hue adjustment to 0. For training, we use the SGD optimizer with a learning rate of 0.0075 and weight decay of 1e-4. We use a batch size of 64 and a queue size of 4096, and run parallel training on two GPU cards for 100 epochs, as we find that further increasing the batch size or number of epochs does not lead to improved results. During training, we anneal the learning rate by a factor of 0.1 at the 60th and 80th epochs.

F HYPERPARAMETER ANALYSIS

Similar to previous work on unsupervised image representation learning (Chen et al., 2020a; He et al., 2020) , we first find that the effectiveness of ConVIRT pretraining is most sensitive to the temperature value τ . As shown in Table 5 , using a temperature much lower than the ideal value (τ = 0.01) hurts the retrieval results, and a temperature much larger (τ = 1) notably hurts the performance on all tasks. Unlike previous work, we find that using a smaller or larger batch size hurts the retrieval performance, but neither setup brings substantial impact to the classification results. Lastly, we find that replacing the non-linear projection heads in g v and g u with linear layers hurts the 



https://physionet.org/content/mimic-cxr-jpg/2.0.0/ https://github.com/pytorch/vision https://www.kaggle.com/c/rsna-pneumonia-detection-challenge https://stanfordmlgroup.github.io/competitions/chexpert/ https://github.com/lindawangg/COVID-Net https://stanfordmlgroup.github.io/competitions/mura/



Figure 1: Two example chest radiograph images with different abnormality categories, along with sentences from their paired textual report and example views indicative of their characteristics.

t e x i t s h a 1 _ b a s e 6 4 = " B J C g l Q U P 4 y O 1 0 C b m i p c e S n 1 w

B 1 P p g 4 P H e D D P z v I g z p W 3 7 0 y q s r W 9 s b h W 3 S z u 7 e / s H 5 c O j j g p j S W i b h D y U P Q 8 r y p m g b c 0 0 p 7 1

Figure 3: t-SNE visualizations of encoded image representations from different pretraining methods.

Figure 4: Saliency maps on sampled images for 4 abnormality categories in the CheXpert dataset. For each image we present maps for ImageNet, SimCLR, MoCo v2 and our ConVIRT initializations. Ground truth regions that are indicative of the abnormalities are shown as red boxes in the images.

Figure 5: (a) shows pretraining validation loss at different epochs; (b)-(d) shows correlation between the pretraining loss and the performance of three end tasks. For (a) the x-axis shows the training epoch number, and for (b)-(d) the x-axis shows the negative value of the pretraining loss (i.e., -L) on a held-out validation set.

Results for the medical image classification tasks: (a) linear classification; (b) fine-tuning setting. All results are averaged over 5 independent models. Best results for each setting are in boldface. COVIDx 1% setting is omitted due to the scarcity of labels in COVIDx.

Comparisons of ConVIRT to imageonly unsupervised image representation learning approaches.

A MODEL IMPLEMENTATION AND PRETRAINING DETAILSDataset Preprocessing. For the MIMIC-CXR chest radiograph dataset, we use the publicly available JPG version of it. 1 For both the MIMIC-CXR chest dataset and the Rhode Island Hospital bone image datasets, we resize the image files to have a size of 256 on the larger side. For the textual radiology report data, we first tokenize all reports with the default English tokenizer in version 4.0.0 of the CoreNLP library (Manning

Example textual queries for each of the 8 categories in the text-image retrieval task.

Evaluation results with different hyperparameters, for the RSNA 1% data linear evaluation, image-image retrieval and text-image retrieval tasks. bs represents batch size and linear proj. represents using linear projection layers for g v and g u . Our default model uses τ = 0.1, bs = 32 and non-linear projections. , suggesting worse representations. However, this is again not reflected notably in the RSNA classification results.

annex

the original training set for validation purpose. The resulting dataset contains 218414/5000/234 images in each split. 3. COVIDx image classification (Wang & Wong, 2020) : we prepared this dataset following the scripts provided by its authors. 5 We used the version 4 of this dataset, the latest version at the time of this work. We additionally randomly sampled 300 images from the training set for validation, resulting in a dataset with 13598/300/300 images in each split. 4. MURA bony abnormality detection (Rajpurkar et al., 2018a) : we downloaded the original version of this dataset from its website. 6 Similar to the CheXpert dataset, we again used the original validation set as our test set, and randomly sampled 10% images from the training set for validation, resulting in a dataset with 33078/3730/3197 images in each split. Different from the other 3 datasets, the MURA dataset uses patient-level evaluation, meaning that the prediction results from different images of the same patient needs to be aggregated to produce a final prediction for the patient, which is then scored against the gold patient label. We therefore followed Rajpurkar et al. (2018a) and at test time aggregated result for a patient by averaging the predicted probabilities from multiple images.Classification Model Training Details. For all models that require ImageNet pretrained initialization, we use the pretrained weights from torchvision, which achieves an ImageNet top-5 error rate of 7.13%. For all datasets, we first zero-pad the input image to be square, and then resize it to be 224×224. For training, we use the Adam optimizer with an initial learning rate of 1e-3 for the COVIDx task and 1e-4 for the other three tasks. We additionally apply a weight decay of 1e-6 and a dropout before the last classification layer with p = 0.2 in all tasks. All classification models are trained with a batch size of 64. In the fine-tuning evaluation setting, we first "warmup" the classification head by freezing the CNN weights and only training the classification head with a learning rate of 1e-3 for 200 steps, after which we unfreeze the CNN weights and fine-tune the entire network together. Validation score is obtained after each epoch of training and we anneal the learning rate by a factor of 0.5 if the validation score is not improved after 3 epochs. The training is stopped after no validation improvement is observed for 10 straight epochs, at which point the model checkpoint with the highest validation score is evaluated on the test set.

C IMAGE-IMAGE RETRIEVAL DATASET COLLECTION

We create the CheXpert 8×200 Retrieval Dataset with 8 different abnormality categories commonly found in Chest radiograph images, including atelectasis, cardiomegaly, edema, fracture, pleural effusion, pneumonia, pneumothorax and a special no finding category indicating that no obvious abnormality is found in the image. We create the dataset by reusing existing rule-labeled annotations in the CheXpert dataset (Irvin et al., 2019) and additional expert annotations. To create the candidate images for a category label , we go through all images in the CheXpert training set, and keep an image as a candidate image if only its label for is positive and all other categories negative. We only include images with this "exclusive positivity" as candidate images, mainly to avoid confounding results between categories in retrieval evaluation.To create the query images for a category , we again first pre-select 50 exclusively positive images for this category in the CheXpert training set (with all candidate images excluded). Next, we ask a board-certified radiologist to examine each of the 50 images, and exclude images that: 1) might indicate additional abnormalities other than , 2) have uncommon color or contrast distortions in the image, or 3) are not well posed during the capture of the image. This procedure is mainly to avoid including query images that have uncommon features and may therefore bias the retrieval evaluation results. At the end, we aggregate the annotation results from the radiologist and keep 10 query images for each abnormality category.

D TEXT-IMAGE RETRIEVAL DATASET COLLECTION

For the text-image retrieval dataset, we first reuse all candidate images from the CheXpert 8×200 image-image retrieval dataset described above, with 200 images for each of 8 categories. To create

