EXPLORING LOW-RANK PROPERTY IN MULTIPLE IN-STANCE LEARNING FOR WHOLE SLIDE IMAGE CLAS-SIFICATION

Abstract

The classification of gigapixel-sized whole slide images (WSIs) with slide-level labels can be formulated as a multiple-instance-learning (MIL) problem. State-ofthe-art models often consist of two decoupled parts: local feature embedding with a pre-trained model followed by a global feature aggregation network for classification. We leverage the properties of the apparent similarity in high-resolution WSIs, which essentially exhibit low-rank structures in the data manifold, to develop a novel MIL with a boost in both feature embedding and feature aggregation. We extend the contrastive learning with a pathology-specific Low-Rank Constraint (LRC) for feature embedding to pull together samples (i.e., patches) belonging to the same pathological tissue in the low-rank subspace and simultaneously push apart those from different latent subspaces. At the feature aggregation stage, we introduce an iterative low-rank attention MIL (ILRA-MIL) model to aggregate features with low-rank learnable latent vectors. We highlight the importance of cross-instance correlation modeling but refrain from directly using the transformer encoder considering the O(n 2 ) complexity. ILRA-MIL with LRC pre-trained features achieves strong empirical results across various benchmarks, including (i) 96.49% AUC on the CAMELYON16 for binary metastasis classification, (ii) 97.63% AUC on the TCGA-NSCLC for lung cancer subtyping, and (iii) 0.6562 kappa on the large-scale PANDA dataset for prostate cancer classification. Code is available at https://github.com/jinxixiang/low_rank_wsi.

1. INTRODUCTION

Recent artificial intelligence in digital pathology has presented the potential to analyze gigapixel whole-slide images (WSIs). However, some challenges remain unsolved, including limited samples for training deep learning models and the extremely high resolution of WSI images (Lu et al., 2021c; Campanella et al., 2019; Shao et al., 2021; Sharma et al., 2021; Lu et al., 2021b) . Since the relationship between input images and target labels is highly ill-posed, e.g., on CAME-LYON16, 1.5 million 224×224 input image tiles against 270 WSI-level labels, one has to decompose the model into two separate stages, local feature embedding and global feature aggregation. Biological tissues in WSIs exhibit a wide variation, and there are still high semantic and background similarities among different image patches from the same type of tissue. Therefore, one fundamental challenge is performing feature embedding that only captures relevant biological information and allows for quantitative comparison, categorization, and interpretation. After embedding, the standard MIL uses non-parametric max-/mean-pooling to perform slide-level classification. Such simplified schemes might lead to sub-optimal feature aggregation for WSI classification, and the models cannot learn cross-instance correlation due to the weak supervision signal. As consistent with the findings in natural images (Cong et al., 2013; Zhou et al., 2014; Zhang et al., 2013; Liu et al., 2012) , we empirically find that gigapixel WSIs exhibit essentially low-rank properties in the data manifold (see evidence in Appendix A). We aim to harness the low-rank property whole slide image patches feature extractor with LRC iterative low-rank attention MIL (ILRA-MIL) Q K Q K Q K Figure 1 : The proposed pipeline. WSI is cropped into patches and then embedded into vectors for classification. We design LRC for feature embedding and ILRA-MIL for feature aggregation. for WSI classification. The first intention is to learn a low-dimensional feature embedding in a discriminative way by extending contrastive loss with a low-rank constraint. For global feature aggregation, it would be beneficial for MIL to learn potential cross-instance correlation, which may help the model become more context-aware (Lu et al., 2021c) . To this end, the second intention is to introduce self-attention with a low-rank matrix that forms an attention bottleneck with which all instances must interact, allowing it to handle large-scale bag sizes with a small computation overhead. It resolves the quadratic complexity O(n 2 ) caused by global self-attention. Our main contributions: (1) We extend contrastive learning with a low-rank constraint (LRC) to learn feature embedding using unlabeled WSI data; (2) We use iterative low-rank attention MIL (ILRA-MIL) to process a large bag of instances, allowing it to encode cross-instance correlation naturally; (3) Extensive experiments on public benchmarks are conducted. Remarkably, ILRA-MIL improves over baselines, including attention-pooling and transformer-based MIL, by a large margin.

2.1. LOCAL FEATURE EMBEDDING IN MIL

Most methods conduct feature embedding with the ResNet50 pre-trained on ImageNet (Lu et al., 2021c; Campanella et al., 2019) . However, there is a significant domain deviation between pathological and natural images, which might lead to sub-optimal patch features for WSI classification. Contrastive learning paves a way for pathology-specific image pre-training (Lu et al., 2019; Li et al., 2021; Chen et al., 2020a; Ciga et al., 2022; Stacke et al., 2021) . The fundamental idea is to pull together an anchor and a "positive" sample in embedding space and push apart the anchor from many "negative" samples. Nevertheless, it is infeasible for pathology images since they usually consist of multiple positive instances (Li et al., 2022) . SupCon extends the self-supervised contrastive approach to the fully-supervised setting, allowing us to leverage label information (Khosla et al., 2020) effectively. Nevertheless, fine-grained local annotations for WSIs are hardly available; thus, we cannot adapt SupCon directly. We exploit the low-rank properties to generalize the supervised contrastive loss for WSIs without patch-level label information.

2.2. GLOBAL FEATURE AGGREGATION IN MIL

Traditional poolings are robust to noisy data and unbalanced distribution. MIL-RNN (Campanella et al., 2019) built on recurrent network achieved clinical grade using more than 10,000 slides, but it is data-hungry and constrained for binary classification. The local attention method, i.e. ABMIL (Ilse et al., 2018) uses the attention weights to allow us to find key instances, bringing significant improvements and robustness. CLAM (Lu et al., 2021c) further improves ABMIL with a clustering constraint by pulling the most and least attended instances apart. As concluded by (Lu et al., 2021c) , one limitation of CLAM and MIL-based approaches is that they typically treat different patches in the slide as independent and do not learn the potential cross-interactions, which may help the model become context-aware. To this end, global attention-based networks (Li et al., 2021; Shao et al., 2021; Lu et al., 2021a) , are introduced with non-local pooling or transformer encoder to compensate for the shortness of local attention MIL that considers no cross-instance correlations. We aim to improve the global attention model with ILRA-MIL further.

3. METHOD

The proposed pipeline is boosted with the low-rank property of WSI, consisting of a local feature embedding module and a global feature aggregation module, as illustrated in Fig. 1 .

3.1.1. PRELIMINARY

Contrastive learning implements the heuristic to discern positive samples from negative samples (Chen et al., 2020a; b; c; c; Grill et al., 2020; Gao et al., 2021) . Given a randomly sampled minibatch of N images, we get pairs of projected feature vectors from augmented examples {z i } i∈I , I = {1, • • • , 2N }. The self-supervised contrastive loss is (Chen et al., 2020a) : L Con = - i∈I log exp sim(z i , z j(i) )/τ a∈N (i) exp (sim(z i , z a )/τ ) where sim(u, v) = u ⊤ v/∥u∥∥v∥ is the dot product between ℓ 2 normalized u and v; N (i) = I\{i}; j(i) is the index of the other augmented sample from the same image; τ is a temperature parameter. For each anchor z i , there is one positive sample z j(i) and 2(N -1) negative samples.

3.1.2. EXTENSION OF CONTRASTIVE LOSS

Most pathology cases have high semantic and background similarity, thus resulting in multiple positives in a batch, introducing estimation errors in (1). One straightforward approach is the generalization of supervised contrastive learning (i.e., SupCon (Khosla et al., 2020) ) to an arbitrary number of positives by extending: L SupCon = - i∈I 1 |P(i)| p∈P(i) log exp (sim(z i , z p )/τ ) a∈N (i) exp (sim(z i , z a )/τ ) where P(i) is the set of indices of all positive samples in the minibatch given anchor z i ; |P(i)| is its cardinality. For images with labels, it is intuitive to constitute positive samples with the same labels.

3.1.3. PATHOLOGY SPECIFIC LOW-RANK LOSS

SupCon in (2) extents vanilla contrastive loss by leveraging label information. But we refrain from adopting SupCon for WSIs because no patch-level labels are available. We thus propose a new self-supervised learning loss named LRC tailored for pathology images, which is shown to be a generalization of SupCon to unlabeled scenarios. Given a set of feature samples, each of which can be represented as a linear combination of the bases in a dictionary, we aim at finding the representations that have a low-rank similarity matrix between two sets of augmented representations: R(T ⊤ T) = T ⊤ T ∈ R N ×N : rank(T ⊤ T) = r, r ≪ N (3) where T ⊤ T is a similarity matrix of T = [t 1 , • • • , t N ], T = [ t1 , • • • , tN ] ; ti and t i are two augmented representations of the same image. A low-rank matrix can be decomposed as the product of a dictionary D and a block-diagonal B such that (Liu et al., 2012; Wright & Ma, 2022) : T ⊤ T = DB + E = [D 1 , D 2 , • • • , D r ]     B 1 0 0 0 0 B 2 0 0 0 0 . . . 0 0 0 0 B r     + E (4) where E is an error matrix which should be minimized; D b ∈ R N ×s b , B b ∈ R s b ×q b with b = 1, • • • , r; s b , q b represent the shape of subspace B b . Intuitively, pairs belonging to the same subspace are more semantically similar than randomly sampled ones. This has also been recognized as latent classes (Chuang et al., 2020; Saunshi et al., 2019) . For self-supervised contrastive loss L Con with only one positive pair for each anchor, T ⊤ T is considered to be a full-rank diagonal matrix, i.e., all entries are zeros except for the diagonal ones. SupCon loss L SupCon further leverages label information to access more positive samples, enforcing T ⊤ T to explore the semantic similarity in the embedding space. In this way, SupCon loss could make low-rank constraints implicitly on T ⊤ T, where r is the rank of the matrix corresponding to the total number of classes in SupCon. This observation recognizes the connections of contrastive loss with the low-rank property. Since the low-rank decomposition of (3) is not tractable for online learning, as an alternative, we use SupCon loss in (2) as a surrogate by accessing more positives belonging to the same subspace B b . Suppose we get a set of descendingly sorted indices based on their similarity to the anchor: C(a) = {A(1), • • • , A(N )| if i < j, then sim(t a , tA(i) ) ≥ sim(t a , tA(j) )}. (5) Given an anchor t a , we get r subspace C b (a), b = 1, • • • , r as stated in the low-rank representation Eq. ( 4). We can intuitively consider that each subspace corresponds to a latent class, where C 1 (a) = {A(1), • • • , A(q 1 )}, C 2 (a) = {A(q 1 + 1), • • • , A(q 1 + q 2 )}, • • • , C r (a) = {A(N -q r + 1), • • • , A(N )}. Note that q 1 , q 2 , • • • q r , is the column dimension of B 1 , B 2 , • • • , B r . Instead of partitioning all samples to get all subspace, which is computationally infeasible without solving (4), we only optimize the objective over the least-and most-distant subspace C 1 (a), C r (a) with respect to the anchor. For any positive sample p ∈ C 1 (a) and negative sample n ∈ C r (a), we would like to achieve the following: sim(t a , tp ) ≥ sim(t a , tn ) + ξ, where ξ = 0.5 is a constant margin for all pairs of negative. We should add a threshold ξ rather than just ensure sim(t a , tp ) ≥ sim(t a , tn ) to avoid trivial solution where features collapse together, i.e. sim(t a , tp ) = sim(t a , tn ). We can incorporate low-rank constraint loss with margin into the supervised contrastive loss function in (2) by adding it after the cosine similarity term, giving us: L LRC = - a=1•••N 1 |C 1 (a)| p∈C1(a) log exp sim(t a , tp ) j∈ {C1(a)∪Cr(a)}\a exp sim(t a , tj ) + ξ j . ( ) where ξ j = 0 if j ∈ C 1 (a), otherwise ξ j = ξ; |C 1 (a)| = q 1 is the number of elements in C 1 (a). The loss ( 7) is minimized when all positive pairs are correctly identified with condition (6) satisfied, thus enforcing our low-rank constraints. We set the top 5% of instances in a training batch as C 1 (a) and the bottom 5% as C r (a). The derivation and analysis of ( 7) is provided in the Appendix A, C. The total loss for self-supervised learning for feature embedding is: L = λL con + (1 -λ)L LRC . Without the self-supervised contrastive loss (1), there is a chicken-and-egg issue that good features will not be learned and low-rank loss in ( 28) is not sufficiently good. Incorporating contrastive loss L con with L LRC is an incremental self-updating learning process. In our default setting where λ = 0.5, no unstable training is observed.

3.2.1. PRELIMINARY

Without loss of generality, we take the binary MIL classification as an example. The learning task is to learn a nonlinear function from feature space X to label space Y = {1, 0} using the training data set {(X 1 , y 1 ), • • • , (X m , y m )}, where X i = {x i,1 , • • • , x i,mi } is a WSI; m i is the bag size of X i ; x i,j is an instance. The corresponding instance labels {y i,1 , • • • y i,mi } are unknown, i.e. y i = 0, iff j y i,j = 0; y i,j ∈ {0, 1}, j = 1, • • • m i 1, otherwise . MIL processes a bag of instances with permutation invariance property, stating that the label of the bag remains unchanged regardless of the order of input instances on the bag (Ilse et al., 2018; Li et al., 2021) . A simple example of a permutation invariant model is a network that performs pooling over embedding extracted from the patches of a bag. Mathematically: logits(X i ) = ρ (pool ({ϕ (x i,1 ) , • • • , ϕ (x i,mi )})) , where 'pool' is a pooling operation; ϕ and ρ denote instance-level network and bag-level classifier, respectively. Attention-based pooling is commonly used (Ilse et al., 2018; Lu et al., 2021c; Tomita et al., 2019; Hashimoto et al., 2020) : ϕ(x i,j ) = exp W ⊤ (tanh (Vx i,j ) ⊙ sigm (Ux i,j )) mi j=1 exp {W ⊤ (tanh (Vx i,j ) ⊙ sigm (Ux i,j ))} , with learnable parameters W, V, and U.

3.2.2. TRANSFORMER-BASED MIL

Eq. ( 11) is a local attention network where the score ϕ(x i,j ) of instance x i,j only depends on the instance itself. We aim to explore the dependence and interaction among all instances. One straightforward approach is the application of transformer. The transformer encoder consists of alternating layers of multi-headed attention and MLP blocks. Here, we denote the feature matrix of the feature bag X i as X 1 i = [x i,1 , • • • , x i,mi ] ⊤ . An attention head maps queries Q ∈ R mi×d to outputs using m i key-value pairs K ∈ R mi×d , V ∈ R mi×d , and d is the query/key dimension: head h (X ℓ i ) = Attention(Q h , K h , V h ) = softmax Q h K ⊤ h / √ d V h where Q h = X ℓ i W Q h,ℓ , K h = X ℓ i W K h,ℓ , V h = X ℓ i W V h,ℓ , h = 1, • • • , H, where W Q h,ℓ , W K h,ℓ , W V h,ℓ are learnable; ℓ = 1, • • • , k is the index of the transformer layer; k is the total number of layers. Transformer uses multi-head attention to project Q; K; V onto H different vectors and then concatenate all attention outputs: MHA(X ℓ i ) = concat (head 1 , • • • , head H ) Xℓ i = MHA LN X ℓ i + X ℓ i , ( ) where LN is the layer norm. The output layer is MLP with a skip connection: X ℓ+1 i = MLP LN Xℓ i + Xℓ i . ( ) Considering a large number of instances in each bag (hundreds of thousands), one obstacle with transformer for MIL is the quadratic time and memory complexity O m 2 i . Despite the linear theoretical complexity with some approximations like Nystromformer (Xiong et al., 2021) , Linformer (Wang et al., 2020a) , or Performer (Choromanski et al., 2020) , it overlooks the innate characteristic of input instances.

3.2.3. ITERATIVE LOW-RANK ATTENTION MIL

Medical image including WSI is extensively high-dimensional in its raw form. As such, it is effective to explore the hidden structures in the forms of low-rank matrices of high-dimensional data (Wang et al., 2020b; Li et al., 2018; 2020) . We thus introduce a learnable low-rank latent matrix L ∈ R r×d to interact with all input instances as the proposed ILRA-MIL shown in Fig. 2 . One basic module of the network is the cross-attention (CAtt), defined as: CAtt(L, X ℓ i ) = Attention(Q, K, V) = softmax QK ⊤ / √ d V where Q = LW Q ℓ , K = X ℓ i W K ℓ , V = X ℓ i W V ℓ . Note that L is a unified matrix for all layers to keep the low-rank consistency for different layers. As shown in the right-hand side of Fig. 2 , we also use a unified layer with cross-attention and Gated Linear United (GLU), named Gated Attention Block (GAB): i ∈ R mi×d to low-rank space by attending to the latent vectors L ∈ R r×d , r < m i and the second GAB recovers the input dimension. ω represents softmax. The output layer uses non-local pooling to make predictions. Layer normalization is omitted for brevity. GAB(L, X ℓ i ) = (U V)W O ℓ U = ϕ U (LW U ℓ ), V = CAtt(L, X ℓ i ) where stands for element-wise multiplication ; ϕ U is Sigmoid Linear Units SiLU (Elfwing et al., 2018; Hua et al., 2022) ; W O ℓ , W U ℓ are linear transforms. The inputs of GAB are not permutation invariant as the first is the query and the second is the key-value. An ILRA block consists of: P = GAB f (L, X ℓ i ), X ℓ+1 i = GAB b (X ℓ i , P). Eq. ( 17) is analogous to low-rank projection or auto-encoder models. GAB f first projects the highdimensional X ℓ i to the low-rank space L. Then the projection result P ∈ R r×d is reconstructed to high-dimensional space X ℓ+1 i with GAB b where the query is X ℓ i and key-value is P. There are some desirable properties of ILRA-MIL. (i) The latent vectors L encode global features that help to explain input instances. For example, in the cancer subtyping problem for computational pathology, the latent vectors could be approximately some mutual and universal information of key cancerous regions so that the ILRA module can compare instances in the query indirectly through L to all inputs. (ii) The Q-K-V pair is not longer symmetric as in MHA because for the shapes L ∈ R r×d , K ∈ R mi×d , V ∈ R mi×d , r ≪ m i . Thus, the complexity of cross-attention operation significantly is reduced from quadratic O(m 2 i ) to linear O(rm i ). We set r = 64 by default. Constraining the latent vectors to be low-rank may restrict the network's ability to capture all of the necessary details from the input instances. To improve expressivity, the model stacks k (k = 4 by default) ILRA layers to extract information from the input instances: Xi = ILRA(ILRA(X 1 i ) • • • ) k layers , where LN should be applied before the input of each layer. Xi = { x1 , x2 , • • • , xmi } encodes cross-instance correlations in the bag. A bag feature x b ∈ R 1×d is obtained through max pooling over Xi . Then, a trainable linear classifier ρ is used to conduct non-local pooling at the output layer: Evaluation. For the evaluation metrics, we used accuracy and area under the curve (AUC) scores to evaluate the classification performance, where the accuracy was calculated with a threshold of 0.5 in all experiments. The multi-class PANDA is scored based on Cohen's kappa. logits( Xi ) = ρ( mi j=1 w j • xj ), w j = exp (x b • xj ) mi q=1 exp (x b • xq ) . ( Baseline methods include mean/max-pooling and deep MIL models, i.e., ABMIL (Ilse et al., 2018) , DSMIL (Li et al., 2021) , CLAM-SB / CLAM-MB (Lu et al., 2021c) , MIL-RNN (Campanella et al., 2019) , transMIL (Shao et al., 2021) , and DTFD-MIL (Zhang et al., 2022) .

5.1. RESULTS ON CLASSIFICATION

All results are provided in Table 6 . 'DSMIL+SimCLR' denotes DSMIL with SimCLR features as reported in (Li et al., 2021) . Other baselines use ImageNet pre-trained features without notice. In all cases, ILRA with LRC feature embedding consistently improves over ImageNet pre-trained feature embedding, as the statistic in the last row shows. In CAMELYON16, tumors are minor regions in positive slides (averagely < 10% per slide), resulting in a highly imbalanced distribution of positive and negative instances in a bag. Attention-based methods all outperform the traditional mean or max pooling operators. Nonlocal poolings, including DSMIL and TransMIL outperform attention pooling with a nonlocal operator that models the crossinstance correlation. DTFD-MIL is the best-performed competing method which is particularly designed to address the small sample cohorts. The proposed ILRA-MIL processes cross-instance correlation and the AUC score was at least 4.99% higher than CLAM-MB, which only local instance for aggregation. In TCGA-NSCLC, positive slides contain relatively large areas of the tumour region (average total cancer area per slide > 80%). As a result, both the max pooling and attention pooling operators work pretty well in this scenario. Non-local pooling methods are consistently stable, and ILRA-MIL performed better than all the other competing methods, achieving 1.53% improvement in AUC and 1.51% in accuracy, compared with the best competing results. For PANDA, as MIL-RNN does not work for multi-classification problems, we exclude it from the comparison result. PANDA is unbalanced distributed in cancer subtypes, and it is challenging to differentiate glandular patterns with intermediate morphological structures (Nagpal et al., 2020) . ILRA-MIL can also be applied to multi-class problems with unbalanced data, and it can be observed that the best results are achieved in both accuracy and kappa. For comparison, existing clinicalgrade AI system trained pixel-wise annotations from highly urological pathologists scores from 0.62 (Bulten et al., 2020) to 0.66 (Tolkach et al., 2020) .

5.2. ABLATIONS ON LRC

To demonstrate the effectiveness of the proposed clustering-constrained contrastive loss, we compare its performance with alternative contrastive learning: SimCLR (Chen et al., 2020a) , BYOL (Grill et al., 2020) , SimSiam (Chen & He, 2021) , and MoCoV3 (Chen et al., 2020c) , and an Ima-geNet pre-trained model, as shown in Table 2 . The same ILRA-MIL model is used to evaluate the ACC and AUC performance. Unsurprisingly, all self-supervised features significantly bootstrap the performance against the ImageNet pre-trained features. MoCo-V3 and SimCLR outperform BYOL and SimSiam without negative samples. The proposed LRC achieves 1.59% AUC improvement over MoCo-v3. Similar results also apply to CLAM-SB, as shown in the table. Fig. 3 shows the predicted probability on the CAMELYON16 test set using ILRA-MIL trained with MoCo-v3 and LRC features. As we set 0.5 as the classification threshold, we can observe that with LRC features, there are fewer false positives and false negatives samples compared with the probability distribution of MoCo-V3.

5.3. PARAMETER ANALYSIS AND ABLATIONS ON ILRA-MIL

We conduct some ablations on ILRA-MIL in terms of some key modules: (i) low-rank latent vectors attention in ( 15); (ii) non-local pooling in ( 19); (iii) iterative attention mechanism in (18) . Ablation studies are performed on the CAMELYON16 dataset with ImageNet features; see Table 3 . (1) The default rank of L in Eq. ( 15) is r = 64. We adjust the rank from 32 to 128, and the result demonstrates that with a large-enough vector rank, it can attend all input instances with negligible loss of information. Then, we compare it with the self-attention module. (2) We cannot directly apply the full self-attention considering the large bag size, and instead, we use the Nystrom transformer as an approximation. The same number of heads and layers are used for evaluation. The results indicated that low-rank attention achieves 0.9278 AUC, outperforming 0.8127 AUC of full self-attention by a large margin. Although with linear approximation, full self-attention involves excessively redundant and task-irrelevant interactions among instances and is challenging to optimize where only a tiny amount of slide-level labels are available. (3) After ILRA iteration in ( 17), non-local pooling is used with (19) to aggregate global feature. We ablate it with the commonly used max pooling and local attention pooling in (11). Remarkably, nonlocal pooling can improve max pooling and local attention pooling by 12.17% and 6.66% AUC. (4) ILRA-MIL can make a deeper network through iterative attention. As the number of iterations increases, the model performance growth tends to level off, which indicates that it is sufficient to characterize cross-instance correlations in the dataset. The iteration number greater than k = 4 leads to a significant decrease in performance caused by the over-fitting dataset.

5.4. INTERPRETABILITY

Our feature embedding LRC boosts the performance of CLAM-SB (see Table 2 ), and it can also enhance interpretability. We use the trained CLAM-SB model with LRC features to draw the predicted heatmap as shown in Fig. 4 . The heatmaps show remarkable consistency with expert annotation, especially for "test 075" where the ROIs only occupied a small area; the most significant regions are located and identified. We show more visual comparisons in the Appendix.

6. CONCLUSION

In this paper, we address the problem of WSI classification by optimizing the feature embedding and feature aggregation with low-rank properties. We improve the vanilla contrastive loss with additional low-rank constraints to collect more positive samples for contrast. We also devise an iterative lowrank attention feature aggregator to make efficient cross-instance correlations. All these designs boost the performance across various benchmarks, as the results show. One limitation of our model is that it has not been validated on multi-center larger-scale clinical datasets. In addition, ILRA-MIL cannot directly provide a local attention score for each instance, which might hinder an intuitive clinical analysis of each patch image. A LOW-RANK PROPERTY OF WSI High-dimensional WSI data bring great challenges to data analysis. But fortunately, the highdimensional WSI data often lie in low-dimensional subspace, consistent with findings including natural images in computer vision, documents in natural language processing (Udell et al., 2016; Zhang et al., 2013; Zhou et al., 2014; Wright & Ma, 2022) . Pursuing the low-rank property of highdimensional data is to identify the intrinsic manifold or physical mechanisms from which the data are generated. Given a bag of feature embedding from a WSI, it can be formulated as a data matrix X = [X 1 , • • • , X r ] ⊤ where X i corresponds to latent class i, r is the total number of latent classes. Ideally, X can be decomposed into a low-rank component DB and a sparse error component E, i.e., X = DB + E with respect to dictionary, the optimal representation matrix B for X should be block-diagonal:     B 1 0 0 0 0 B 2 0 0 0 0 . . . 0 0 0 0 B r     The space matrix D = [D 1 , D 2 , ...D r ] contains r sub-space. An example of optimal decomposition for feature embedding is illustrated in Fig. (5). For example, data X = [X 1 , X 2 , X 3 ] contains features from 3 classes, where X 1 contains 3 samples x 1 , x 2 , x 3 , X 2 contains 4 samples x 4 , x 5 , x 6 , x 7 , and X 3 contains 3 samples x 8 , x 9 , x 10 . D has 3 sub-space, and each has 2 support items. Figure 5 : Example of an optimal low-rank decomposition. Mathematically, the decomposition is achieved by optimizing: min B,E ∥B∥ * + λ∥E∥ 1 s.t X = DB + E We construct 270 data matrices with CAMLEYON16 training WSIs for low-rank property analysis. The size of data matrix is R m×d , where m is the bag size of a WSI and d is the fix-dimension of embedding depending on the encoder, e.g., d = 1024 for ResNet50 backbone. The low-rank decomposition problem in ( 21) can be optimized by ADMM algorithm (Alternating Direction Method of Multipliers) (Candès et al., 2011; Boyd et al., 2011) . We plot the histogram of the rank of all matrices in Fig. ( 6. The average rank of ImageNet feature embedding is 349, much smaller than the full-rank 1024. Remarkably, the average rank of self-supervised learning feature BYOL (Grill et al., 2020) , MoCo (Chen & He, 2021) Given an anchor t a , we get r subspace C b (a), b = 1, • • • , r as stated in the low-rank representation Eq. ( 4). We can intuitively consider that each subspace corresponds to a latent class. We thus would like to discriminate between different subspaces, e.g. C 1 (a), C r (a), which are the least and most-distant subspaces to anchor t a . We extends (27) to C 1 (a) positive and C r (a) negative pairs in a sample batch with SupCon (Khosla et al., 2020) , giving us: L LRC = - a=1•••N 1 |C 1 (a)| p∈C1(a) log exp sim(t a , tp ) j∈ {C1(a)∪Cr(a)}\a exp sim(t a , tj ) + ξ j . ( ) where ξ j = 0 if j ∈ C 1 (a), otherwise ξ j = ξ. C HOW SENSITIVE IS THE HYPER-PARAMETER IN L LRC ? Eq. ( 28) aims to push apart two subspaces. We set the top 5% of instances in a training batch as C 1 (a) and the bottom 5% as C r (a) by default. The percentage of C 1 (a) controls the estimated positive samples in a minibatch given anchor t a , and helps strike a balance between the benefits it brings with more true positive samples and the inverse effects of using false positive samples. Table 4 shows the WSI classification evaluations on CAMELYON16 and TCGA-NSCLC datasets for ILRA-MIL models trained with different choices of C 1 (a), ranging from 1% to 10%. The best result is highlighted in bold and the second best is underlined. The results show that 5% achieves relatively optimal performance on both datasets. A larger percentage of 10% or a smaller percentage of 1% generally leads to worse performance. We find the sensitivity of the hyperparameter is reduced in the range of 3% to 7%. PANDAfoot_2 is the largest prostate biopsy public dataset to date (Bulten et al., 2022) . We only use slides with pure and unequivocal patterns (from 0+0, 3+3, 4+4, or 5+5 slides) where the interobserver variability was normally low (Tolkach et al., 2020; Bulten et al., 2022; Ström et al., 2020) 

H HEATMAPS

Figure 7 shows 4 diverse examples on the CAMLEYON16 test set. In the "raw image" column, the tumor area is delineated by the blue line. In the "CLAM" and "CLAM+LRC" columns, brighter red indicates that the higher attention score is the tumor at the corresponding location. "CLAM" is the original CLAM-SB method (Lu et al., 2021c) whereas "CLAM+LRC" incorporates CLAM-SB with our proposed LRC feature embedding method. Figure 7 : Four examples of high-resolution heatmaps of the CAMELYON16 test set, namely test 016, test 073, test 117, and test 092 from the top row to the bottom row. We compare the heatmap of CLAM in the second column with our proposed LRC method in the third column. Our method is more consistent with the ground truth annotations, indicating superior performance.



https://camelyon16.grand-challenge.org/Data/ https://portal.gdc.cancer.gov/ https://www.kaggle.com/competitions/prostate-cancer-grade-assessment/data



Figure2: ILRA-MIL iterates over k IRLA layers. Each layer consists of two GAB blocks. The first GAB block projects input instance X 1 i ∈ R mi×d to low-rank space by attending to the latent vectors L ∈ R r×d , r < m i and the second GAB recovers the input dimension. ω represents softmax. The output layer uses non-local pooling to make predictions. Layer normalization is omitted for brevity.

Figure 3: Probability distribution with MoCo-V3 and LRC.

Figure 4: Heatmap visualization of "test 075" and "test 026" for CLAM with and without LRC.

Classification Results on Benchmarks. In CAMELYON16, the 270 training WSIs are split into approximately 90% training and 10% validation and tested on the official test set. In PANDA, we split the 4219 slides from Karolinska into 80% training and 20% validation and tested on the 2591 slides from Radboud. For TCGA datasets, we first ensured that different slides from one patient do not exist in both the training and test sets, and split the data in the ratio of training:validation:test = 60:15:25. For self-supervised learning, we use ResNet50 to encode 224 × 224 images into 1024-dimensional vectors. The same training data is used to develop feature embedding with LRC and feature aggregator ILRA-MIL.

Ablation on Different Pretrained Models.

Parameter Analysis and Ablations on ILRA-MIL

, and the proposed LRC can further reduce to218, 195, and 181, respectively.  As the Table1in the main paper shows, the classification performance AUC with the same ILRA-MIL model using ImageNet, BYOL, MoCo, and LRC features is 0.9278, 0.9330, 0.9490, 0.9649, respectively, i.e.:avg. rank:ImageNet > BYOL > MoCo > LRC classification AUC: ImageNet < BYOL < MoCo < LRC (22)Even without low-rank constraints, BYOL and MoCo tend to produce features with lower ranks than ImageNet.Also, Fig. (6) (d)  indicates that the distribution of all WSIs feature embedding is more Therefore, the cross-entropy loss to identify the positive is:

WSI classification evaluations for models trained with different choices of C 1 (a) , random horizontal flip, and random grayscale conversion. For all methods, we use an initial learning rate of 1.5e -4 . We use AdamW as the optimizer and adopt a learning rate warmup for 20 epochs to alleviate instability. Each model is optimized on 16 Nvidia V100 GPUs with a cosine learning rate decay schedule and a mini-batch size of 4096. We train for 200 epochs for CAMELYON16, TCGA-NSCLC, and PANDA. The training takes about 4 days.D.2 MIL TRAINING DETAILSMIL models, including baseline and ours, are trained on a single Nvidia V100 GPU. The training of ILRA-MIL take less than 1 hour for all datasets with an Nvidia V100 GPU. They are optimized end-to-end with Adam optimizer with a batch size of 1 and a learning rate of 1e -4 for 200 epochs. The Adam optimizer has parameters β 1 = 0.9, β 2 = 0.95, and ϵ = 1e -8 .E DATASETEach WSI is cropped into a series of 224 × 224 non-overlapping patches using a binary mask for the tissue regions which is computed based on thresholding the saturation channel of the image.CAMELYON16 1 is a public dataset for metastasis detection in breast cancer (2-level classification), including 270 training sets and 130 test sets. A total of about 1.5 million patches at ×10 magnification are obtained after prep-process.TCGA-NSCLC 2 includes two subtype projects (2-level classification), i.e., Lung Squamous Cell Carcinoma (TGCA-LUSC) and Lung Adenocarcinoma (TCGA-LUAD), for a total of 993 diagnostic WSIs, including 507 LUAD slides from 444 cases and 486 LUSC slides from 452 cases. We obtain 3.4 million patches in total at ×10 magnification.

We evaluate the inference runtime and MACs (multiply-accumulate operations) of the proposed model. We use the CAMELYON16 test set that contains 130 WSIs and the average bag size is 1600 at ×10 magnification. The evaluation involves data preprocessing including segmentation and patching, feature embedding using the LRC pre-trained ResNet50, and slide-level prediction with ILRA-MIL. The average inference run time for each WSI is represented in Table5. The data preprocessing consume most of the time cost and this module is not accelerated by GPU. Feature embedding module converts 224 × 224 images into 1024-dimensional vectors and its MACs are relatively high. The feature aggregation module operating on embedding vectors is efficient and only takes about 4.4 ms. Average Runtime Per Slide on CAMELYON16 Using a P40 GPU. following Table6and Table7are extensions of Table1in the main paper with the standard deviation. Each experiment is conducted for 5 runs with respect to different random startup seeds on the same data splits.

Results on Benchmarks CAMELYON16 and TCGA-NSCLC. ± 0.0184 0.9278 ± 0.0121 0.9004 ± 0.0218 0.9592 ± 0.0176 ILRA-MIL + LRC 0.9218 ± 0.0113 0.9649 ± 0.0096 0.9213 ± 0.0173 0.9763 ± 0.0149

Results on PANDA test set.

annex

compact than ImageNet. This empirical evidence implicitly shows that low-rank features are very likely to be beneficial to WSI representation. The positive pair is correctly classified if: sim(t a , tp ) ≥ sim(t a , tn )We incorporate the margin constraint into the classification boundary so that the positive pair is correctly classified only if: sim(t a , tp ) ≥ sim(t a , tn ) + ξ.The sigmoid in ( 23) is modified accordingly:exp sim(t a , tp ) exp sim(t a , tp ) + exp sim(t a , tn ) + ξ (26)

