MEDICAL IMAGE UNDERSTANDING WITH PRE-TRAINED VISION LANGUAGE MODELS: A COM-PREHENSIVE STUDY

Abstract

The large-scale pre-trained vision language models (VLM) have shown remarkable domain transfer capability on natural images. However, it remains unknown whether this capability can also apply to the medical image domain. This paper thoroughly studies the knowledge transferability of pre-trained VLMs to the medical domain, where we show that well-designed medical prompts are the key to elicit knowledge from pre-trained VLMs. We demonstrate that by prompting with expressive attributes that are shared between domains, the VLM can carry the knowledge across domains and improve its generalization. This mechanism empowers VLMs to recognize novel objects with fewer or without image samples. Furthermore, to avoid the laborious manual designing process, we develop three approaches for automatic generation of medical prompts, which can inject expertlevel medical knowledge and image-specific information into the prompts for finegrained grounding. We conduct extensive experiments on thirteen different medical datasets across various modalities, showing that our well-designed prompts greatly improve the zero-shot performance compared to the default prompts, and our fine-tuned models surpass the supervised models by a significant margin.

1. INTRODUCTION

There may not exist another domain like medical images that requires high level of expert knowledge, while acquiring expert labeled data is also quite expensive. In fact, limited amount of welllabeled data is one of the factors that deter the medical image domain moves toward the era of largescale pre-trained models, and transfer learning becomes a natural choice. Nevertheless, as argued in (Niu et al., 2021) , the mismatch between domains may compromise the capability of the pre-trained models being transferred from one to another (Raghu et al., 2019) . Unfortunately, this mismatch also exists between medical and natural image domains. Therefore, finding a data-efficient approach with superior domain transfer performance is essential for advancing medical image understanding. Though pre-trained vision-language models (VLMs) have shown much success in domain transfer tasks, it is not known whether the knowledge learned from natural image and text pairs through large pre-trained vision-language models can benefit the understanding of the medical images. As pointed out by (Shen et al., 2022) , the large-scale VLMs perform well in recognizing common objects but may not perform well while encountering visual concepts that rarely appeared in their pre-training data. This observation motivates us to discover an even stronger approach to bridge the domain gap. In VL models like GLIP (Li et al., 2022) , X-VLM (Zeng et al., 2021), and VinVL (Zhang et al., 2021) , prompt learning also plays an essential role in enhancing the model's generalization. Instead of simply aligning the text and image pairs, GLIP aims to ground image regions with the help of text prompts and shows that prompts with expressive attributes can further improve model's performance in domain transfer. We presume that a prompt integrated with expert-level knowledge and imagespecific information could vastly help the domain transfer process because one key challenge in medical image understanding is locating the objects that merely appear in the natural image domain. With the help of well-designed text prompts, the model can be equipped with high-level semantics describing the characteristic of target objects instead of only providing object names. In this paper, we aim to leverage the powerful pre-trained vision-language models like GLIP with expressive medical prompts to make efficient domain transfers from natural images to medical images for object detection. To this end, we first explore how to manually design effective medical prompts by using attribute injection, and show that such well-designed prompts can significantly improve the domain transfer capability compared to the default category names. Intuitively, some common graphic attributes in text prompts, such as color, texture and shape, are shared across domains, and therefore by including these expressive attributes in the prompts, the VLMs can selectively learn to align visual features through the anchor points set by the prompts rather than aimlessly learning. Furthermore, to improve the efficiency and avoid the laborious manual designing, we propose several approaches, i.e., masked language model (MLM) driven auto-prompt generation, image specific auto-prompt generation or a hybrid of both, to automatically generate medical prompts that make the VLMs perform on par with the model with manually elaborated prompts. The MLM-driven approach mainly focuses on extracting expert-level knowledge from pretrained language models specialized in the medical domain, whereas the image-specific prompt generation, based on visual question answering (VQA) system, allows the flexibility in designing prompts to include imagespecific attribute information, rather than using a single fixed prompt for all images during inference. We evaluate our approaches on a broad range of existing medical datasets across different modalities including photography, endoscopy, cytology, histopathology and radiology (X-ray, CT, MRI and Ultrasound) image datasets. The models with our well-designed medical prompts exhibit significant superiority compared to those with default prompts in terms of zero-shot and few-shot performance, some even surpassing the supervised model trained with full data. Moreover, our fine-tuned models outperform the traditional supervised baselines by a significant margin across almost all datasets.

2. RELATED WORK

Transfer between natural and medical image domains Transfer learning is a prevailing strategy for training deep neural networks for domains with limited labeled data, such as the medical domain. Transfer learning has been widely investigated in the medical domain for a while (Peng et al., 2021; Mustafa et al., 2021; Raghu et al., 2019) . Zhou et al. ( 2021) broadly discussed about transfer learning for medical images. Mustafa et al. (2021) argued that transfer from natural to medical images could help if performed at a sufficient scale. Peng et al. (2021) and Raghu et al. (2019) pointed out that large models do not consistently outperform the simple and lightweight models. To the best of our knowledge, there hasn't been any transfer learning work done on medical images with VLMs. Vision language models Recently, VLMs have made breakthroughs in cross-modal tasks and visual recognition problems. Some pre-trained VLMs (Lu et al., 2019; Ilharco et al., 2021) proposed to leverage BERT-like architecture to deal with cross-modal inputs, and Zhang et al. (2020) adopted the contrastive learning paradigm to train a VLM for medical images. Inspired by this line of work, in CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) , a large amount of image and text pairs have been used to train the VLMs through contrastive learning. Eslami et al. (2021) proposed to leverage large-scale VLMs for medical VQA tasks. While these work focusing on pre-trained VLMs, another line of work focuses on integrating multi-task learning with the vision-language pre-training paradigm (Bao et al., 2021; Yu et al., 2022; Wang et al., 2022) . And these models are capable of performing cross-modal tasks, such as image captioning and visual question answering. (Moon et al., 2021) is one of the pioneer works in the medical domain for VLM multi-tasks learning. Prompt design Knowledge-intensive domains, such as the medical domain, usually require training domain-specific language models on expert knowledge augmented corpus to learn proper representations for domain concepts (Gu et al., 2021b; Lee et al., 2020) . Moreover, prompting language models in zero-shot or few-shot manner to elicit knowledge has been a commonly adopted approach in recent years (Petroni et al., 2019; Jiang et al., 2020) . Except for directly mining knowledge from language models, Shen et al. ( 2022) designed a pipeline for extracting knowledge from an external source such as WordNet (Miller, 1998) . Our proposed auto-prompts generation approaches are also partially inspired by the line of research (Song et al., 2022; Yang et al., 2022) . Zhou et al. (2022) pro-

