ANATOMICAL STRUCTURE-AWARE IMAGE DIFFER-ENCE GRAPH LEARNING FOR DIFFERENCE-AWARE MEDICAL VISUAL QUESTION ANSWERING

Abstract

To contribute to automating the medical vision-language model, we propose a novel Chest-Xray Different Visual Question Answering (VQA) task. Given a pair of main and reference images, this task attempts to answer several questions on both diseases and, more importantly, the differences between them. This is consistent with the radiologist's diagnosis practice that compares the current image with the reference before concluding the report. For this task, we propose a new dataset, namely MIMIC-Diff-VQA, including 698,739 QA pairs on 109,790 pairs of images. Meanwhile, we also propose a novel expert knowledge-aware graph representation learning model to address this problem. We leveraged expert knowledge such as anatomical structure prior, semantic and spatial knowledge to construct a multi-relationship graph to represent the image differences between two images for the image difference VQA task. Our dataset and code will be released upon publication. We believe this work would further push forward the medical vision language model.

1. INTRODUCTION

Several recent works focus on extracting text-mined labels from clinical notes and using them to train deep learning models for medical image analysis with several datasets: MIMIC (Johnson et al., 2019) , NIH14 (Wang et al., 2017) and Chexpert (Irvin et al., 2019) . During this arduous journal on vision-language (VL) modality, the community either mines per-image common disease label (Fig. 1 . (b)) through Natural Language Processing (NLP), or endeavors on report generation (Fig. 1 . (c) generated from (Nguyen et al., 2021) ) or even answer certain pre-defined questions (Fig.1. (d)) . Despite significant progress achieved on these tasks, the heterogeneity, systemic biases and subjective nature of the report still pose many technical challenges. For example, the automatically mined labels from reports in Fig. 1 . (a) is obviously problematic because the rule-based approach that was not carefully designed did not process all uncertainties and negations well (Johnson et al., 2019) . Training an automatic radiology report generation system to directly match the report appears to avoid the inevitable bias in the common NLP-mined thoracic pathology labels. However, radiologists tend to write more obvious impressions with abstract logic. For example, as shown in Fig. 1 . (a), a radiology report excludes many diseases (either commonly diagnosed or intended by the physicians) using negation expressions, e.g., no, free of, without, etc. However, the artificial report generator could hardly guess which disease is excluded by radiologists. Instead of thoroughly generating all of the descriptions, VQA is more plausible as it only answers the specific question. As shown in Fig. 1 , the question could be raised exactly for "is there any pneumothorax in the image?" in the report while the answer is no doubt "No". However, the questions in the existing VQA dataset ImageCLEF (Abacha et al., 2019) concentrate on very few general ones, such as "is there something wrong in the image? what is the primary abnormality in this image?", lacking the specificity for the heterogeneity and subjective texture. It often decays VQA into classification. While VQA-RAD (Lau et al., 2018) has more heterogeneous questions covering 11 question types, its 315 images dataset is relatively too small. To bridge the aforementioned gap in the visual language model, we propose a novel medical image difference VQA task which is more consistent with radiologists' practice. When radiologists make diagnoses, they compare current and previous images of the same patients to check the disease's progress. Actual clinical practice follows a patient treatment process (assessment -diagnosis -intervention -evaluation) as shown in Figure2. A baseline medical image is used as an assessment tool to diagnose a clinical problem, usually followed by therapeutic intervention. Then, another follow-up medical image is retaken to evaluate the effectiveness of the intervention in comparison with the past baseline. In this framework, every medical image has its purpose of clarifying the doctor's clinical hypothesis depending on the unique clinical course (e.g., whether the pneumothorax is mitigated after therapeutic intervention). However, existing methods can not provide a straightforward answer to the clinical hypothesis since they do not compare the past and present images. Therefore, we present a chest x-ray image difference VQA dataset, MIMIC-Diff-VQA, to fulfill the need of the medical image difference task. Moreover, we propose a system that can respond directly to the information the doctor wants by comparing the current medical image (main) to a past visit medical image (reference). This allows us to build a diagnostic support system that realizes the inherently interactive nature of radiology reports in clinical practice. MIMIC-Diff-VQA contains pairs of "main"(present) and "reference"(past) images from the same patient's radiology images at different times from MIMIC (Johnson et al., 2019 ) (a large-scale public database of chest radiographs with 227,835 studies, each with a unique report and images). The question and answer pairs are extracted from the MIMIC report for "main" and "reference" images with rule-based techniques. Similar to (Abacha et al., 2019; Lau et al., 2018; He et al., 2020) , we



Figure 1: (a) The ground truth report corresponding to the main(present) image. The red text represents labels incorrectly classified by either text mining or generated reports, while the red box marks the misclassified labels. The green box marks the correctly classified ones. The underlined text is correctly generated in the generated report. (b) The label "Pneumothorax" is incorrectly classified because there is NO evidence of pneumothorax from the chest x-ray. (c) "There is a new left apical pneumothorax" → This sentence is wrong because the evidence of pneumothorax was mostly improved after treatment. However, the vascular shadow in the left pulmonary apex is not very obvious, so it is understandable why it is misidentified as pneumothorax in the left pulmonary apex. "there is a small left pleural effusion" → It is hard for a doctor to tell if the left pleural effusion is present or not. (d) The ImageCLEF VQA-MED questions are designed too simple. (e) The reference(past) image and clinical report. (f) Our medical difference VQA questions are designed to guide the model to focus on and localize important regions.

