ADVANCING RADIOGRAPH REPRESENTATION LEARN-ING WITH MASKED RECORD MODELING

Abstract

Modern studies in radiograph representation learning (R 2 L) rely on either selfsupervision to encode invariant semantics or associated radiology reports to incorporate medical expertise, while the complementarity between them is barely noticed. To explore this, we formulate the self-and report-completion as two complementary objectives and present a unified framework based on masked record modeling (MRM). In practice, MRM reconstructs masked image patches and masked report tokens following a multi-task scheme to learn knowledge-enhanced semantic representations. With MRM pre-training, we obtain pre-trained models that can be well transferred to various radiography tasks. Specifically, we find that MRM offers superior performance in label-efficient fine-tuning. For instance, MRM achieves 88.5% mean AUC on CheXpert using 1% labeled data, outperforming previous R 2 L methods with 100% labels. On NIH ChestX-ray, MRM outperforms the best performing counterpart by about 3% under small labeling ratios. Besides, MRM surpasses self-and report-supervised pre-training in identifying the pneumonia type and the pneumothorax area, sometimes by large margins. Code and models are available at https://github.com/RL4M/ MRM-pytorch.

1. INTRODUCTION

Radiograph representation learning (R 2 L) has been among the core problems of medical image analysis. Previously, downstream radiograph analysis tasks counts on pre-trained models on ImageNet (Deng et al., 2009) or large X-ray datasets (Wang et al., 2017; Irvin et al., 2019; Johnson et al., 2019; Bustos et al., 2020) to alleviate the shortage of expert labeling. The emergence of self-supervised representation learning (Doersch et al., 2015; Agrawal et al., 2015; Wang & Gupta, 2015; Zhou et al., 2021a; 2023) provides a choice to conduct pre-training with negligible human intervention by exploiting selfsupervision. However, the self-supervised paradigm ignores the introduction of medical expertise (e.g., anatomy), reducing its transferability to downstream tasks with limited label information. On the other hand, free-text radiology reports written by experienced radiologists often contain rich domain knowledge. To leverage this, researchers developed automated rule-based labelers (Wang et al., 2017; Irvin et al., 2019) to extract structured labels from unstructured texts. Nevertheless, these labelers have several practical limitations. First, some procedures of the label extraction workflow, such as rulemaking and natural language processing, still require the intensive involvement of experts and engineers. Besides, the developed labelers can hardly adapt to new scenarios due to the fixed rules and lexicons. Against this background, report-supervised R 2 L was proposed (Zhang et al., 2020) to acquire supervision from radiology reports. In practice, this paradigm leverages words and sentences in free-text reports as supervision to guide deep neural networks to learn radiograph representations, outperforming the archetypical label-and self-supervised pre-training by observable margins in various downstream tasks (Zhang et al., 2020; Zhou et al., 2022) . The report-supervised R 2 L highlights the importance of the incorporation of domain knowledge. This differs from the self-supervised paradigm, which focuses on learning invariant semantic representations. Nonetheless, current studies view the self-and report-supervised R 2 L as separate, discrete choices, preventing their combinations. Driven by this analysis, we present a unified framework based on masked record modeling (MRM), where the self-and report-completion tasks are modeled as two complementary objectives. Specifically, masked image reconstruction integrates semantics into pre-trained models, while masked report restoration facilitates the incorporation of medical expertise. As a result, MRM learns knowledge-enhanced semantic representations that generalize well. In practice, MRM masks random patches and tokens from the input radiograph and associated radiology report with high masking ratios. Following a multi-task scheme, MRM asks the radiography pre-trained model to learn visual representations that can not only reconstruct the missing patches but also restore the missing tokens from the non-masked token embeddings along with mask tokens. With MRM pre-training, we can train radiography models on MIMIC-CXR (Johnson et al., 2019) with improved generalization performance. With a pre-trained ViT-B/16 model, we achieve 88.5% mean AUC when fine-tuned on CheXpert (Irvin et al., 2019) with only 1% labels. This outperforms all previous counterparts with 100% labeled data. On NIH ChestX-ray (Wang & Gupta, 2015) , MRM surpasses the report-supervised paradigm by about 3% when the labeling ratiosfoot_0 are 1% and 10%. On pneumonia identification tasks, MRM outperforms self-and report-supervised baselines, sometimes by substantial margins. These observations help verify the effectiveness of MRM in learning more transferable radiograph representations.

2.1. REPORT-SUPERVISED RADIOGRAPH REPRESENTATION LEARNING

Recently, report-supervised learning (Zhang et al., 2020; Liao et al., 2021; Huang et al., 2021; Zhou et al., 2022; Boecking et al., 2022) 2022) presented a Transformer-based R 2 L framework that conducts autoregressive report modeling and study-report matching. Report-supervised R 2 L takes the advantage of label-supervised learning, which is the incorporation of domain knowledge. Compared to the self-supervised paradigm, reportsupervised R 2 L lays no emphasis on learning semantically invariant representations. To address the discrepancy between them, we formalize self-and report-completion as two complementary objectives, based on which we propose to encode both semantics and medical expertise into latent representations following a multi-task scheme.

2.2. VISUAL REPRESENTATION LEARNING VIA IMAGE-LANGUAGE PRE-TRAINING

Learning visual representations from image-language pairs has achieved tremendous success in natural image tasks (Sariyildiz et al., 2020; Desai & Johnson, 2021; Radford et al., 2021; Mu et al., 2021; Zhao et al., 2022; Li et al., 2021; Geng et al., 2022; Wang et al., 2022; Chen et al., 2022;  



The labeling ratio X% means that X% of the training set from a fully annotated downstream dataset are used for supervised fine-tuning.



Figure 1: Illustration. MRM learns transferable radiograph representations via reconstructing masked records, i.e., masked radiograph patches and masked reports tokens.

emerges as a new R 2 L paradigm that automatically acquires supervision from free-text radiology reports. Zhang et al. (2020) proposed ConVIRT to contrast the radiograph features with latent embeddings of sentences in radiology reports. Liao et al. (2021) and Huang et al. (2021) explored the alignment between local patches and words in the report. Zhou et al. (

