ADVANCING RADIOGRAPH REPRESENTATION LEARN-ING WITH MASKED RECORD MODELING

Abstract

Modern studies in radiograph representation learning (R 2 L) rely on either selfsupervision to encode invariant semantics or associated radiology reports to incorporate medical expertise, while the complementarity between them is barely noticed. To explore this, we formulate the self-and report-completion as two complementary objectives and present a unified framework based on masked record modeling (MRM). In practice, MRM reconstructs masked image patches and masked report tokens following a multi-task scheme to learn knowledge-enhanced semantic representations. With MRM pre-training, we obtain pre-trained models that can be well transferred to various radiography tasks. Specifically, we find that MRM offers superior performance in label-efficient fine-tuning. For instance, MRM achieves 88.5% mean AUC on CheXpert using 1% labeled data, outperforming previous R 2 L methods with 100% labels. On NIH ChestX-ray, MRM outperforms the best performing counterpart by about 3% under small labeling ratios. Besides, MRM surpasses self-and report-supervised pre-training in identifying the pneumonia type and the pneumothorax area, sometimes by large margins. Code and models are available at https://github.com/RL4M/ MRM-pytorch.

1. INTRODUCTION

provides a choice to conduct pre-training with negligible human intervention by exploiting selfsupervision. However, the self-supervised paradigm ignores the introduction of medical expertise (e.g., anatomy), reducing its transferability to downstream tasks with limited label information. On the other hand, free-text radiology reports written by experienced radiologists often contain rich domain knowledge. To leverage this, researchers developed automated rule-based labelers (Wang



Figure 1: Illustration. MRM learns transferable radiograph representations via reconstructing masked records, i.e., masked radiograph patches and masked reports tokens.Radiograph representation learning (R 2 L) has been among the core problems of medical image analysis. Previously, downstream radiograph analysis tasks counts on pre-trained models on ImageNet(Deng  et al., 2009)  or large X-ray datasets(Wang et al.,  2017; Irvin et al., 2019; Johnson et al., 2019; Bustos  et al., 2020)  to alleviate the shortage of expert labeling. The emergence of self-supervised representation learning(Doersch et al., 2015; Agrawal et al.,  2015; Wang & Gupta, 2015; Zhou et al., 2021a;  2023)  provides a choice to conduct pre-training with negligible human intervention by exploiting selfsupervision. However, the self-supervised paradigm ignores the introduction of medical expertise (e.g., anatomy), reducing its transferability to downstream tasks with limited label information.

