BAYES-MIL: A NEW PROBABILISTIC PERSPECTIVE ON ATTENTION-BASED MULTIPLE INSTANCE LEARN-ING FOR WHOLE SLIDE IMAGES

Abstract

Multiple instance learning (MIL) is a popular weakly-supervised learning model on the whole slide image (WSI) for AI-assisted pathology diagnosis. The recent advance in attention-based MIL allows the model to find its region-of-interest (ROI) for interpretation by learning the attention weights for image patches of WSI slides. However, we empirically find that the interpretability of some related methods is either untrustworthy as the principle of MIL is violated or unsatisfactory as the high-attention regions are not consistent with experts' annotations. In this paper, we propose Bayes-MIL to address the problem from a probabilistic perspective. The induced patch-level uncertainty is proposed as a new measure of MIL interpretability, which outperforms previous methods in matching doctors annotations. We design a slide-dependent patch regularizer (SDPR) for the attention, imposing constraints derived from the MIL assumption, on the attention distribution. SDPR explicitly constrains the model to generate correct attention values. The spatial information is further encoded by an approximate convolutional conditional random field (CRF), for better interpretability. Experimental results show Bayes-MIL outperforms the related methods in patch-level and slide-level metrics and provides much better interpretable ROI on several large-scale WSI datasets.

1. INTRODUCTION

In real-world applications of deep learning, data like images or texts are often associated with insufficient labels, due to the expensive annotation cost. For example, the whole slide images (WSI) for medical diagnosis have about 10 5 ⇥ 10 5 pixels per image, but are tagged with single categorical labels (Zhang et al., 2019; Campanella et al., 2019) . Weakly-supervised learning methods are designed for learning representations and making decision in these cases. Multiple instance learning (MIL) is a popular weakly-supervised learning model for the application of WSI recognition (Ilse et al., 2018; Lu et al., 2021) . Concretely, a large WSI slide is sliced into a bag of image patches (instances) with a moderate size.foot_0 MIL builds an end-to-end parametric model that aggregates the learned features from instances and only learns from bag-level labels. The rule of aggregation is implementing the key principle of MIL: for binary classification, a bag is negative when all instances are negative, and a bag is positive when there is one or more positive instance (Ilse et al., 2018) . Recent advances study the attention-based MIL for re-weighing the instances for better performance. This attention mechanism for MIL is extensively explored and used as a measure of interpretability in various downstream tasks for medical diagnosis, like prostatic cancer (Zhang et al., 2021 ), breast cancer (Naik et al., 2020) , etc. Specifically, the high attention weights are used to indicate that its associated instances are positive instances, e.g, the cancerous image patches. However, this rule is not formally justified and it is not clear whether the negative instances (i.e., benign) would be assigned a high attention value or the other way around. We first analyze the convergence of attention and provide validity of this rule under binary labels. Based on this rule, we conduct an empirical study on a large scale WSI dataset for how the attention mechanism in the related MIL methods performs. The study shows two clear flaws of the related methods: In this paper, we address the problems from a probabilistic perspective. First, a basic framework of Bayesian MIL (Bayes-MIL) is proposed, for inducing uncertainty over the attention weights. The uncertainty is potentially an accurate measure for guessing whether the instances are positive or negative, as a replacement of attention. Second, a regularizer is designed by deduction from the MIL principle and implemented via the variational inference framework, which sets specific constraints for the attention distributions of positive and negative bags. Third, to encode the spatial information of instances for medical imaging application, we propose an approximate operation to the convolutional conditional random field, which benefits the localization of the region of interest (ROI). The final classifier is modeled in a Bayesian way, in order to provide calibrated uncertainty of the bag-level prediction. The overview of our proposed method is shown in Fig. 1 . The contributions of this paper are listed as follows: • We analyze the attention-based MIL on the interpretability-critic medical application and point out the flaws by directly using attention for interpretation. • To address these problems, we propose the first Bayesian MIL for WSI with 3 key components: a probabilistic instance-wise attention module for uncertainty visualization, the slide-dependent patch regularizer for learning the correct attention distribution, and an approximate convolutional conditional random field for encoding spatial information. Our model provides well-calibrated uncertainties, which is crucial for safety in medical applications. • The evaluation on large-scale MIL datasets shows Bayes-MIL outperforms the related methods in instance-level interpretation and bag-level prediction under various evaluation metrics. The visualized distribution of data uncertainty shows a strong correlation of the designed regularizer, which validates the soundness of regularizer and explains why uncertainty is useful in MIL interpretation.

2. FORMULATION AND ANALYSIS OF MULTIPLE INSTANCE LEARNING

Multiple instance learning formulation We follow the standard formulation of Attention-based Multiple Instance Learning (MIL) (Ilse et al., 2018; Lu et al., 2021) . In MIL, the input is a bag of instances, X = {x 1 , . . . , x K }, x k 2 R D . K is the number of instances, which varies for different bags. There is a bag-level label Y . We further assume the instances also have corresponding instance-level labels {y 1 , . . . , y K }, which are unknown during training. There are N such bag-label pairs constituting the dataset D = {X n , Y n } N n=1 . The objective of MIL is to learn an optimal function for predicting the bag-level label with the bag of instances as input. To this end, the MIL model should be able to aggregate the information of instances {x k } K k=1 to make the final decision. A well-adopted aggregation method is the embedding-based approach which maps X to a bag-level representation z 2 R D and use z to predict Y . Ilse et al. ( 2018) extends the embedding-based



We define the following interchangeable terms for simplicity: "bag" and "slide"; "instance" and "patch".



Figure 1: The overview of our Bayes-MIL framework and zoom-in views for clear visualization of interpretability. (1) The basic Bayes-MIL improves patch-level localization performance (Sec. 3.1). (2) The slidedependent patch regularizer makes attention densely concentrated on the positive area, improving its interpretability (Sec. 3.2). (3) The convolutional CRF improves the localization by smoothing the uncertainty over different patches (Sec. 3.3). (bottom) The ablation results on a few metrics show the improvement of interpretability. The full ablation results are in Sec. 5. • The interpretability for negative bags is untrustworthy because some methods violate the key principle of MIL by placing high attention values on negative bags, thus indicating positive instances. • The interpretability for positive bags is unsatisfactory because the high attention values could not well match experts' annotations of positive instances.

