EFFECTIVE CROSS-INSTANCE POSITIVE RELATIONS FOR GENERALIZED CATEGORY DISCOVERY

Abstract

We tackle the issue of generalized category discovery (GCD). GCD considers the open-world problem of automatically clustering a partially labelled dataset, in which the unlabelled data contain instances from novel categories and also the labelled classes. In this paper, we address the GCD problem without a known category number in the unlabelled data. We propose a framework, named CiP, to bootstrap the representation by exploiting Cross-instance Positive relations for contrastive learning in the partially labelled data which are neglected in existing methods. First, to obtain reliable cross-instance relations to facilitate the representation learning, we introduce a semi-supervised hierarchical clustering algorithm, named selective neighbor clustering (SNC), which can produce a clustering hierarchy directly from the connected components in the graph constructed by selective neighbors. We also extend SNC to be capable of label assignment for the unlabelled instances with the given class number. Moreover, we present a method to estimate the unknown class number using SNC with a joint reference score considering clustering indexes of both labelled and unlabelled data. Finally, we thoroughly evaluate our CiP framework on public generic image recognition datasets (CIFAR-10, CIFAR-100, and ImageNet-100) and challenging fine-grained datasets (CUB, Stanford Cars, and Herbarium19), all establishing the new state-of-the-art.

1. INTRODUCTION

After training on large-scale datasets with human annotations, existing machine learning models can achieve superb performance (e.g., (Krizhevsky et al., 2012) ). However, the success of these models heavily relies on the fact that they are only tasked to recognize images from the same set of classes with large-scale human annotations on which they are trained. This limits their application in the real open world where we will encounter data without annotations and from unseen categories. Indeed, more and more efforts have been devoted to dealing with more realistic settings. For example, semi-supervised learning (SSL) (Chapelle et al., 2006) aims at training a robust model using both labelled and unlabelled data from the same set of classes; few-shot learning (Snell et al., 2017) tries to learn models that can generalize to new classes with few annotated samples; open-set recognition (OSR) (Scheirer et al., 2012) learns to tell whether or not an unlabelled image belongs to one of the classes on which the model is trained. More recently, the problem of novel category discovery (NCD) (Han et al., 2019; 2020; Fini et al., 2021) has been introduced, which learns models to automatically partition unlabelled data from unseen categories by transferring knowledge from seen categories. One assumption in early NCD methods is that unlabelled images are all from unseen categories only. NCD has been recently extended to a more generalized setting, called generalized category discovery (GCD) (Vaze et al., 2022b) , by relaxing the assumption to reflect the real world better, i.e., unlabelled images are from both seen and unseen categories. In this paper, we tackle the problem of GCD by drawing inspiration from the baseline method (Vaze et al., 2022b) . In (Vaze et al., 2022b) , a vision transformer model was first trained for representation learning using supervised contrastive learning on labelled data and self-supervised contrastive learning on both labelled and unlabelled data. With the learned representation, semi-supervised k-means (Han et al., 2019) was then adopted for label assignment across all instances. In addition, based on semi-supervised k-means, (Vaze et al., 2022b) also introduced an algorithm to estimate the unknown category number for the unlabelled data by examining possible category numbers in a given range. However, this approach has several limitations. First, during representation learning, the method considers labelled and unlabelled data independently, and uses a stronger training signal for the labelled data which might compromise the representation of the unlabelled data. Second, the method requires a known category number for performing label assignment. Third, the category number estimation method is slow as it needs to run the clustering algorithm multiple times to test different category numbers. To overcome the above limitations, we propose a new approach for GCD which does not require a known unseen category number and considers Cross-instance Positive relations in unlabelled data for better representation learning (CiP) . At the core of our approach is our novel semi-supervised hierarchical clustering algorithm with selective neighbor, named as selective neighbor clustering (SNC), that takes inspiration from the parameter-free hierarchical clustering method FINCH (Sarfraz et al., 2019) . SNC can not only generate reliable pseudo labels for cross-instance positive relations, but also estimate unseen category numbers without the need for repeated runs of the clustering algorithm. SNC builds a graph indicating all subtly selected neighbor relations constrained by the labelled instances, and produces clusters directly from the connected components in the graph. SNC iteratively constructs a hierarchy of partitions with different granularity, while satisfying the constraints imposed by the labelled instances. With a one-by-one merging strategy, SNC can quickly estimate a reliable class number without repeated runs of the algorithm, which makes it significantly faster than (Vaze et al., 2022b) . The main contributions of this paper can be summarized as follows: (1) we propose a new GCD framework, named CiP, exploiting more cross-instance positive relations in the partially labelled set to strengthen the connections among all instances, fostering the representation learning for better category discovery; (2) we introduce a semi-supervised hierarchical clustering algorithm, named SNC, that can be adopted for reliable pseudo label generation during training and label assignment during testing; (3) we further leverage SNC for class number estimation by exploring intrinsic and extrinsic clustering quality based on a joint reference score considering both labelled and unlabelled data; (4) we comprehensively evaluate our CiP framework on both generic image recognition datasets and challenging fine-grained datasets, and demonstrate state-of-the-art performance across the board.

2. RELATED WORK

Our work is related to novel/generalized category discovery, semi-supervised learning, and open-set recognition. Novel category discovery (NCD) aims at discovering new classes in unlabelled data by leveraging knowledge learned from labelled data. It was pioneered by (Han et al., 2019) with a transfer clustering approach. Some earlier works on cross-domain/task transfer learning (Hsu et al., 2018a; b) can also be adopted to tackle this problem. (Han et al., 2020) proposed an efficient method called AutoNovel (aka RankStats) using ranking statistics. They first learned a good embedding using low-level selfsupervised learning on all data followed by supervised learning on labelled data for higher level features. They introduced a robust ranking statistics to determine whether two unlabelled instances are from the same class for NCD. Several successive works based on RankStats were proposed. For example, (Jia et al., 2021) proposed to use WTA hashing (Yagnik et al., 2011) for NCD in single-and multi-modal data; Zhao and Han (Zhao & Han, 2021) extended NCD with dual ranking statistics and knowledge distillation. (Fini et al., 2021) proposed UNO which uses a unified cross entropy loss to train labelled and unlabelled data. (Chi et al., 2022) proposed meta discovery which links NCD to meta learning with limited labelled data. (Vaze et al., 2022b) introduced generalized category discovery (GCD) which extends NCD by allowing unlabelled data from both old and new classes. They first finetuned a pretrained DINO ViT (Caron et al., 2021) with both supervised contrastive loss and self-supervised contrastive loss. Semi-supervised k-means was then adopted for label assignment. A concurrent work called ORCA by (Cao et al., 2022) addressed a similar problem by formulating it as open-world semi-supervised learning. We draw inspiration from (Vaze et al., 2022b) and develop a novel method to tackle GCD by exploring cross-instance correlations on labelled and unlabelled data which have been neglected in (Vaze et al., 2022b) . Semi-supervised learning (SSL) has long been studied in the machine learning community (Chapelle et al., 2006) . It aims at learning a good model by leveraging unlabelled data from the same set of classes as the labelled data. Various methods have been proposed for SSL. For example, Πmodel (Laine & Aila, 2017) uses self-ensembling to leverage label predictions on different epochs and under different conditions; Mean Teacher (Tarvainen & Valpola, 2017) utilizes averaging model weights instead of label predictions; FixMatch (Sohn et al., 2020) and FlexMatch (Zhang et al., 2021) employ pseudo-labels generated from model predictions to guide the training. The assumption that labelled and unlabelled data are from the same closed set of classes is often not valid in practice. In contrast, GCD relaxes this assumption and considers a more challenging scenario where unlabelled data can also come from unseen classes. Open-set recognition (OSR) aims at training a model using data from a known closed set of classes, and at test time determining whether or not a sample is from one of these known classes. It was first introduced in (Scheirer et al., 2012) . Since then many methods have been proposed for this task. For example, OpenMax (Bendale & Boult, 2016) is the first deep learning work to address the OSR problem based on Extreme Value Theory and fitting per-class Weibull distributions. RPL (Chen et al., 2020a) and its extension ARPL (Chen et al., 2021) exploit reciprocal points for constructing extra-class space to reduce the risk of unknown. Recently, (Vaze et al., 2022a) found the correlation between closed and open-set performance, and boosted the performance of OSR by improving closed-set accuracy. They also proposed Semantic Shift Benchmark (SSB) with a clear definition of semantic novelty for better OSR evaluation.

3.1. PROBLEM FORMULATION

Generalized category discovery (GCD) aims at automatically categorizing unlabelled images in a collection of data in which part of the data is labelled and the rest is unlabelled. The unlabelled images may come from the labelled classes or new ones. This is a much more realistic open-world setting than the common closed-set classification where the labelled and unlabelled data are from the same set of classes. Let the data collection be D = D L ∪ D U , where D L = {(x ℓ i , y ℓ i )} M i=1 ∈ X × Y L denotes the labelled subset and D U = {(x u i , y u i )} N i=1 ∈ X × Y U denotes the unlabelled subset with unknown y u i ∈ Y U . Only a subset of classes contains labelled instances, i.e., Y L ⊂ Y U . The number of labelled classes N L can be directly deduced from the labelled data, while the number of unlabelled classes N U is not known a priori. To tackle this challenge, we propose a novel framework CiP to jointly learn representations using contrastive learning by considering all possible interactions between labelled and unlabelled instances. Contrastive learning has been applied to learn representation in GCD, but without considering the connections between labelled and unlabelled instances (Vaze et al., 2022b) due to the lack of reliable pseudo labels. This limits the learned representation. In this paper, we propose an efficient semisupervised hierarchical clustering algorithm, named selective neighbor clustering (SNC), to generate reliable pseudo labels to bridge labelled and unlabelled instances during training and bootstrap representation learning. With the generated pseudo labels, we can then train the model on both labelled and unlabelled data in a supervised manner considering all possible pairwise connections. We further extend SNC with a simple one-by-one merging process to allow cluster number estimation and label assignment on all unlabelled instances. An overview of our CiP is shown in Fig. 1 .

3.2. JOINT CONTRASTIVE REPRESENTATION LEARNING

Contrastive learning has been widely used for self-supervised representation learning (Chen et al., 2020b; He et al., 2020) and supervised representation learning (Khosla et al., 2020) . For GCD, since the data contains both labelled and unlabelled instances, the mix of self-supervised and supervised contrastive learning appears to be a natural fit and good performance has been reported in (Vaze et al., 2022b) . However, cross-instance correlations are only considered for pairs of labelled instances, but not for pairs of unlabelled instances and pairs of labelled and unlabelled instances. The learned representation is likely to be biased towards the labelled data due to the stronger learning signal provided by them. Meanwhile, the embedding spaces learned from cross-instance correlations of labelled data and self correlations of unlabelled data might not be necessarily well aligned. These might explain why a much stronger performance on labelled data was reported in (Vaze et al., 2022b) compared with the unlabelled data. To mediate such a bias, we propose to introduce crossinstance relations for pairs of unlabelled instances and pairs of labelled and unlabelled instances in contrastive learning to bootstrap the representation learning. To this end, we propose an efficient semi-supervised hierarchical clustering algorithm to generate reliable pseudo labels relating pairs of unlabelled instances and pairs of labelled and unlabelled instances, as will be detailed in Sec. 3.3. Next, we briefly review supervised contrastive learning (Khosla et al., 2020) , which accommodates cross-instance relations, and describe how to extend it to unlabelled data. (Caron et al., 2021) to obtain a good representation space. We then finetune ViT by conducting joint contrastive learning with both true and pseudo positive relations in a supervised manner. True positive relations come from labelled data while pseudo positive relations of all data are generated by our proposed SNC algorithm. Specifically, SNC generates a hierarchical clustering structure. Pseudo positive relations are granted to all instances in the same cluster at one level of partition, further exploited in joint contrastive learning. With representations well learned, we estimate class number and assign labels to all unlabelled data using SNC with a one-by-one merging strategy. Let f and ϕ be a feature extractor and a MLP projection head. The supervised contrastive loss on labelled data can be formulated as L s i = - 1 |G B (i)| q∈G B (i) log exp(z ℓ i • z ℓ q /τ s ) n∈B L ,n̸ =i exp(z ℓ i • z ℓ n /τ s ) where z ℓ = ϕ(f (x ℓ )), τ s is the temperature, and G B (i) denotes other instances sharing the same label with the i-th labelled instance in B L , which is the labelled subset in the mini-batch B. Supervised contrastive loss leverages the true cross-instance positive relations between labelled instance pairs. To take into account the cross-instance positive relations for pairs of unlabelled instances and pairs of labelled and unlabelled instances, we extend the supervised contrastive loss on all data as L a i = - 1 |P B (i)| q∈P B (i) log exp(z i • z q /τ a ) n∈B,n̸ =i exp(z i • z n /τ a ) where τ a is the temperature, P B (i) is the set of pseudo positive instances for the i-th instance in the mini-batch B. The overall loss considering cross-instance relations for pairs of labelled instances, unlabelled instances, as well as labelled and unlabelled instances can then be written as L = i∈B L a i + i∈B L L s i (3) With the learned representation, we can discover classes with existing algorithms like semi-supervised k-means (Han et al., 2019; Vaze et al., 2022b) . We further propose a new method in Sec. 3.4 based on our pseudo label generation approach as will be introduced next.

3.3. SELECTIVE NEIGHBOR CLUSTERING

To generate pseudo labels for Eq. ( 2), an intuitive approach would be to apply an off-the-shelf clustering method like k-means or semi-supervised k-means to construct clusters and then obtain cross-instance relations based on the resulting cluster assignment. However, we empirically found that such a simple approach will produce many false positive pairs which severely hurt the representation learning. One way to tackle this problem is to overcluster the data to lower the false positive rate. FINCH (Sarfraz et al., 2019) has shown superior performance on unsupervised hierarchical overclustering, but it is non-trivial to extend it to cover both labelled and unlabelled data. Experiments show that FINCH will fail drastically if we simply include all the labelled data. Inspired by FINCH, we propose an efficient semi-supervised hierarchical clustering algorithm, named SNC, with selective neighbor, which subtly makes use of the labelled instances during clustering. FINCH constructs an adjacency matrix A for all possible pairs of instances (i, j), given by A(i, j) = 1 if j = κ i or κ j = i or κ i = κ j 0 else , where κ i is the first neighbor of the i-th instance and is defined as κ i = arg max j {f (x i ) • f (x j ) | x j ∈ D}, where f (•) outputs an ℓ 2 -normalized feature vector. A data partition can then be obtained by extracting connected components from A. Each connected component in A corresponds to one cluster. By treating each cluster as a super instance and building the first neighbor adjacency matrix iteratively, the algorithm can produce hierarchical partitions. Algorithm 1 Selective Neighbor Clustering (SNC) 1: Preparation: 2: Given labelled set D L and unlabelled set D U , treat each instance in D L ∪ D U as a cluster c 0 i with the cluster centroid µ(c 0 i ) being each instance itself, forming the first partition Γ 0 = Γ 0 L ∪ Γ 0 U , where Γ 0 = {c 0 i } |Γ 0 L |+|Γ 0 U | i=1 . 3: Main loop: 4: p ← 0 5: while there are more than N L clusters in Γ p do 6: Initialize Γ ⋆ L = Γ p L . 7: while there exists κ i of c p i ∈ Γ p L ∪ Γ p U not specified do 8: if c p i ∈ Γ p L then 9: Initialize Q = {c p i }, Γ ⋆ L = Γ ⋆ L \ {c p i }. 10: while |Q| < λ do 11: κ i ← arg max j {µ(c p i )•µ(c p j ) | c p j ∈ Γ ⋆ L , y p j = y p i } 12: Γ ⋆ L ← Γ ⋆ L \ {c p κi } 13: Q ← Q ∪ {c p κi } 14: c p i ← c p κi 15: end while 16: else 17: κ i ← arg max j {µ(c p i ) • µ(c p j ) | c p j ∈ Γ p L ∪ Γ p U } 18: end if 19: end while 20: Construct A following Eq. ( 4) with selective neighbors, forming a new partition Γ p+1 = Γ p+1 L ∪ Γ p+1 U . 21: p ← p + 1 22: end while First neighbor is designed for purely unlabelled data. To make use of the labels in partially labelled data, a straightforward idea is to connect all labelled data from the same class by setting A(i, j) to 1 for all pairs of instances (i, j) from the same class. However, after filling A(i, j) for pairs of unlabelled instances using Eq. ( 4), very often all instances become connected to a single cluster, making it impossible to properly partition the data. This problem is caused by having too many links among the labelled instances. To solve this problem, we would like to reduce the links between labelled instances while keeping labelled instances from the same class in the same connected component. A simple idea is to connect same labelled instances one by one to form a chain, which can significantly reduce the number of links. However, we found this still produces many incorrect links, resulting in low purity of the connected components. To this end, we introduce our selective neighbor to improve the purity of clusters while properly incorporating the labelled instances. The key ideas are as follows. First, we limit the chain length to at most λ. Second, each labelled instance in a chain can only be the selective neighbor of another labelled instance once. Third, the selective neighbor of an unlabelled instance can be a labelled or an unlabelled instance, depending on its actual distances to other instances. Similar to FINCH, we can apply selective neighbor iteratively to produce hierarchical clustering results. We name our method SNC which is summarized in Algo. 1 (lines 7-19 correspond to selective neighbor computation). For the chain length λ, we simply set it to the smallest integer great than or equal to the square root of the number of labelled instances n ℓ in each class, i.e., λ = ⌈ √ n ℓ ⌉. This is applied to all classes with labelled instances, and at each hierarchy level. A proper chain length can therefore be dynamically determined based on the actual size of the labelled cluster and also the hierarchy level. We analyze different formulations of chain length in Appx. A.1. SNC produces a hierarchy of data partitions with different granularity. Except the bottom level, where each individual instance is treated as one cluster, every non-bottom level can be used to capture cross-instance relations for the level below, because each instance in the current level represents a cluster of instances in the level below. In principle, we can pick any non-bottom level to generate pseudo labels. To have a higher purity for each cluster, it is beneficial to choose a relatively low level which overclusters the data. Hence, we choose a level that has a cluster number notably larger than the labelled class number (e.g., 2× more). Meanwhile, the level should not be too low as this will provide much fewer useful pair-wise correlations. In our experiment, we simply pick the third level from the bottom of hierarchy, which consistently shows good performance on all datasets. We discuss on the impact of the picked level in Appx. A.1.

3.4. LABEL ASSIGNMENT WITH AN UNKNOWN CLASS NUMBER

Algorithm 2 One-by-one merging  (i, j) ← arg min i,j {µ(c t i ) • µ(c t j ) | c t i , c t j ∈ Γ t } 8: Merge c t i and c t j , forming a new partition Γ ⋆ . 9: Update current partition Γ t ← Γ ⋆ . 10: end while 11: Output: 12: Obtain a specific partition Γ t of N e clusters, i.e., |Γ t | = N e . Once a good representation is learned, we could then determine the class label assignment for all unlabelled instances. When the class number is known, we can obtain the label assignment by adopting semi-supervised kmeans like (Han et al., 2019; Vaze et al., 2022b) or directly using our proposed SNC. Since SNC is an hierarchical clustering algorithm and the cluster number in each hierarchy level is determined automatically by the intrinsic correlations of the instances, it might not produce a level of partition with the exact same cluster number as the known class number. We therefore introduce a simple one-by-one merging strategy to SNC allowing it to reach a given class number. Specifically, we first identify a level of partition that has the closest cluster number larger than the given class number, and then merge the clusters one by one until the given class number is reached. At each merging step, we simply merge the two closest clusters. The merging process is summarized in Algo. 2. The label assignment can then be retrieved from the final partition. When the class number is unknown, exiting methods based on semi-supervised k-means need to first estimate the unknown cluster number before they can produce the label assignment. To estimate the unknown cluster number, (Han et al., 2019) proposed to run semi-supervised k-means on all the data while dropping part of the labels for clustering performance validation. Though effective, this algorithm is computational expensive as it needs to run semi-supervised k-means on all possible cluster numbers. (Vaze et al., 2022b) proposed an improved method with Brent's optimization (Brent, 1971) , which increases the efficiency. With the estimated cluster number, semi-supervised k-means is run again on all labelled and unlabelled instances to produce the final label assignment. In contrast, SNC can directly produce hierarchical cluster assignments without a known class number. For practical use, one can pick any level of assignments based on the required granularity. To obtain more reliable class number estimation, we propose to use a joint reference score considering both labelled and unlabelled data. In particular, we further split the labelled data D L into two parts D l L and D v L . We then run SNC on the full dataset D treating D l L as labelled and D U ∪ D v L as unlabelled. We then jointly measure the unsupervised intrinsic clustering index (such as silhouette score (Rousseeuw, 1987) ) on D U and the extrinsic clustering accuracy on D v L . We obtain a joint reference score s c by simply multiplying them after min-max scaling to achieve the best overall measurement on the labelled and unlabelled subsets. We then choose the level in SNC hierarchy with the maximum s c . The cluster number in the chosen level can be regarded as the estimated class number. To achieve more accurate class number estimation, we further leverage the one-by-one merging strategy. Namely, with the chosen level, we apply the one-by-one merging strategy starting from the level below the chosen one to the level above the chosen one. We then identify the merge that gives the best reference score s c and consider its cluster number as our estimated class number. Our proposed SNC with the one-by-one merging strategy can carry out class number estimation with one single run of hierarchical clustering, which is significantly more efficient than the methods based on semi-supervised k-means (Han et al., 2019; Vaze et al., 2022a) .

4. EXPERIMENTS

4.1 EXPERIMENTAL SETUP Data and evaluation metric. We evaluate our mothod on three generic image classification datasets, namely CIFAR-10 ( Krizhevsky et al., 2009) , CIFAR-100 (Krizhevsky et al., 2009) , and ImageNet-100 (Deng et al., 2009) . ImageNet-100 refers to randomly subsampling 100 classes from the ImageNet dataset. We further evaluate on two more challenging fine-grained image classification datasets, namely Semantic Shift Benchmark (Vaze et al., 2022a ) (SSB includes CUB-200 (Wah et al., 2011) and Stanford Cars (Krause et al., 2013) ) and long-tailed Herbarium19 (Tan et al., 2019) . We follow (Vaze et al., 2022b) to split the original training set of each dataset into labelled and unlabelled parts. We sample a subset of half the classes as seen categories. 50% of instances of each labelled class are drawn to form the labelled set, and all the rest data constitute the unlabeled set. The model takes all images as input and predicts a label assignment for each unlabelled instance. For evaluation, we measure the clustering accuracy by comparing the predicted label assignment with the ground truth, following the protocol of (Vaze et al., 2022b) . Implementation details. We follow (Vaze et al., 2022b) to use the ViT-B-16 initialized with pretrained DINO (Caron et al., 2021) as our backbone. The output [CLS] token is used as the feature representation. Following the standard practice, we project the representations with a non-linear projection head and use the projected embeddings for contrastive learning. We set the dimension of projected embeddings to 65,536 following (Caron et al., 2021) . At training time, we feed two views with random augmentations to the model. We only fine-tune the last block of the vision transformer with an initial learning rate of 0.01 and the head is trained with an initial learning rate of 0.1. All methods are trained for 200 epochs with cosine annealing schedule. For our method, the temperatures of two supervised contrastive losses τ s and τ a are set to 0.07 and 0.1 respectively. For class number estimation, we set |D l L |:|D v L | = 8:2. Our experiments are conducted on RTX 3090 GPUs.

4.2. COMPARISON WITH THE STATE-OF-THE-ART

We compare our CiP with four strong GCD baselines: RankStats+ and UNO+, which are adapted from RankStats (Han et al., 2021) and UNO (Fini et al., 2021) that are originally developed for NCD, the state-of-the-art GCD method of (Vaze et al., 2022b) , and ORCA (Cao et al., 2022) which addresses GCD from a semi-supervised learning perspective. As ORCA uses a different backbone model and data splits, for fair comparison, we retrain ORCA with ViT model using the official code on the same splits here. In Tab. 1, we compare CiP with others on the generic image recognition datasets. CiP consistently outperforms all others by a significant margin. For example, CiP outperforms the state-of-the-art GCD method of (Vaze et al., 2022b ) by 6.2% on CIFAR-10, 10.7% on CIFAR-100, and 6.4% on ImageNet-100 for 'All' classes, and by 9.5% on CIFAR-10, 22.7% on CIFAR-100, and 12.0% on ImageNet-100 for 'Unseen' classes. This demonstrates cross-instance positive relations obtained by SNC are effective to learn better representations for unlabelled data. Due to the fact that a linear classifier is trained on 'Seen' classes, UNO+ shows a strong performance on 'Seen' classes, but its performance on 'Unseen' ones is significantly worse. In contrast, CiP achieves comparably good performance on both 'Seen' and 'Unseen' classes, without biasing to the labelled data. In Tab. 2, we further compare our method with others on fine-grained image recognition datasets, in which the difference between different classes are subtle, making it more challenging for GCD. Again, CiP consistently outperforms all other methods for 'All' and 'Unseen' classes. On CUB-200 and SCars, CiP achieves 5.8% and 8.0% improvement over the state-of-the-art for 'All' classes. For the challenging Herbarium19 dataset, which contains many more classes than other datasets and has the extra challenge of long-tailed distribution, CiP still achieves an improvement of 1.4% and 5.6% for 'All' and 'Unseen' classes. Both RankStats+ and UNO+ show a strong bias to the 'Seen' classes. In Fig. 2 , we visualize the t-SNE projection on features generated by DINO (Caron et al., 2021) , GCD method of (Vaze et al., 2022b) , and our method CiP, performed on CIFAR-10. Both (Vaze et al., 2022b) and our features are more discriminative than DINO features. The method of (Vaze et al., 2022b) captures better representations with more separable clusters, but some seen categories are confounded with unseen categories, e.g., cat with dog and automobile with truck, while CiP features show better cluster boundaries for seen and unseen categories, further validating the quality of our learned representation. Figure 2 : Visualization on CIFAR-10. We conduct t-SNE projection on features extracted by raw DINO, GCD method of (Vaze et al., 2022b) and our CiP. We randomly sample 1000 images of each class from CIFAR-10 to visualize. Unseen categories are marked with *.

4.3. ESTIMATING THE UNKNOWN CLASS NUMBER

In Tab. 3, we report our estimated class numbers on both generic and fine-grained datasets using the joint reference score s c as described in Sec. 3.4. Overall, CiP achieves comparable results with the method of (Vaze et al., 2022b) costing slightly more memory, but it is far more efficient (40-150 times faster) and also does not require a list of predefined possible numbers. Even for the most difficult Herbarium19 dataset, CiP only takes a few minutes to finish, while it takes more than an hour for a single run of k-means due to large class number, let alone multiple runs from a predefined list of possible class numbers. 4.4 ABLATION STUDY Approaches to generate positive relations. In Tab. 4, we compare our SNC with multiple different approaches to generate positive relations for joint contrastive learning, including directly using nearest neighbor (Zhong et al., 2021) in every mini-batch and conducting various clustering algorithms to obtain pseudo labels, e.g., FINCH (Sarfraz et al., 2019 ), k-means (MacQueen et al., 1967) , and semi-supervised k-means (Han et al., 2019; Vaze et al., 2022b) . Non-hierarchical clustering methods (k-means and semi-supervised k-means) require a given cluster number. For k-means, we use the ground-truth class number. For semi-supervised k-means, we use both the ground truth and the overclustering number (twice the ground truth). We evaluate performance using both proposed SNC and semi-supervised k-means for comparison. It is clear that SNC reaches higher accuracy than semi-supervised k-means at test time. For generating pseudo positive relations, our method achieves best performance among all approaches. FINCH performs great on CIFAR-100 but degrades on CUB-200. We hypothesize that because FINCH is purely unsupervised without leveraging labelled data, it fails to generate reliable pseudo labels of more semantically similar instances on fine-grained CUB-200. Overclustering semi-supervised k-means achieves comparable performance on CUB-200 but performs bad on CIFAR-100. This might be caused by intrinsic poorer performance of semisupervised k-means compared to proposed SNC, which results in worse pseudo labels. We further report the mean purity curve of pseudo labels generated by all clustering methods throughout training process in Fig. 3 . We can observe that pseudo labels produced by SNC remain the highest purity on both datasets throughout the entire training process. Effectiveness of cross-instance positive relations. In this paper, we use SNC to generate pair-wise relations of unlabelled data, as well as relations between unlabelled and labelled data in supervised contrastive learning. In Tab. 5, we evaluate different settings to verify the effectiveness of both of these two relation types. We report evaluation results of SNC and semi-supervised k-means, showing higher accuracy achieved by SNC. Row (0) represents performance of the state-of-the-art GCD method of (Vaze et al., 2022b) without using any pseudo relations. Rows (1)-(3) show the effect of using different clustering methods to introduce relations of unlabelled and unlabelled pairs (u-u). All methods show improvements over (Vaze et al., 2022b) . Among all relation generating methods, SNC brings the largest improvement, outperforming k-means and FINCH. Row (4) shows only adding pair-wise relations of labelled and unlabelled data (u-ℓ) is not sufficient to boost baseline performance. Row (5) is our full method, which achieves the best performance. From (3)-( 5), we clearly find fully using relations u-u and u-ℓ generated from SNC benefits our method to the greatest extent, which also substantially improves performance on unseen categories. Table 5 : Results using different relations. u-u denotes pair-wise relations between unlabelled and unlabelled data, and u-ℓ denotes pair-wise relations between unlabelled and labelled data. Rows (3)-( 4) mean applying SNC on all data but only using u-u or u-ℓ for pseudo positive relations. 

5. CONCLUSION

We have presented a framework CiP for the challenging problem of GCD. Our framework leverages the cross-instance positive relations that are obtained with SNC, an efficient parameter-free hierarchical clustering algorithm we develop for the GCD setting. With the positive relations obtained by SNC, we can learn better representation for GCD, and the label assignment on the unlabelled data can be obtained from a single run of SNC, which is far more efficient than the semi-supervised k-means used in the state-of-the-art method. We also show that SNC can be used to estimate the unknown class number in the unlabelled data with higher efficiency. Category discovery efficiency. The latency for the category discovery process mainly consist of two parts: feature extraction and label assignment. In Tab. 10, we present the feature extraction time. All methods consume roughly the same amount of time for feature extraction per image. RankStats+ (Han et al., 2021) , UNO+ (Fini et al., 2021) , and ORCA (Cao et al., 2022) assign labels with a linear classifier, thanks to the assumption of known category number. Hence, the label assignment process is simply done by a fast feed-forward pass of a linear classifier, costing omitable time (< 0.0005 second per image), though their performance lags. Our CiP and (Vaze et al., 2022b) contain the transfer clustering process for label assignment, for which CiP is 6-30 times faster than semi-supervised k-means used in (Vaze et al., 2022b) For each dataset, we show two rows of 'Seen' categories (solid green box) and two rows of 'Unseen' categories (dashed red box). Zoom in to see attention details. 



When representing a clustering method here, SNC denotes selective neighbor clustering with one-by-one merging.



Figure1: Overview of our CiP framework. We first initialize ViT with pretrained DINO(Caron et al., 2021) to obtain a good representation space. We then finetune ViT by conducting joint contrastive learning with both true and pseudo positive relations in a supervised manner. True positive relations come from labelled data while pseudo positive relations of all data are generated by our proposed SNC algorithm. Specifically, SNC generates a hierarchical clustering structure. Pseudo positive relations are granted to all instances in the same cluster at one level of partition, further exploited in joint contrastive learning. With representations well learned, we estimate class number and assign labels to all unlabelled data using SNC with a one-by-one merging strategy.

Figure 3: Purity curve.

Figure 5: Curves throughout class number estimation. We report curves of accuracy on the labelled subset D v L , silhouette score on the unlabelled data D U , and our reference score on D v L ∪ D U . Note that the x-axis should be read from right to left, as the merging start from the lower level to the upper level.

Figure 6: Attention visualizations. We report visualization results of DINO (Caron et al., 2021) (left), (Vaze et al., 2022b) (middle) and CiP (right) on Stanford Cars (top) and CUB-200 (bottom).For each dataset, we show two rows of 'Seen' categories (solid green box) and two rows of 'Unseen' categories (dashed red box). Zoom in to see attention details.

Preparation: 2: Get initial partitions S = {Γ p } p=0 by SNC and a cluster number range [N e , N o ]. Note that the merging is from N o to N e and N o > N e . 3: Partition initialization: 4: Find Γ t ∈ S satisfying |Γ t | > N o and |Γ t+1 | ≤ N o . 5: Merging: 6: while |Γ t | > N e do

Results on generic image recognition datasets.

Results on fine-grained image recognition datasets.

Estimation of class number in unlabelled data.

Results using different approaches to generate positive relations. Semi-k-means ⋆ denotes using semi-supervised k-means with an overclustering class number (2 × ground truth). The results evaluated with SNC are reported of normal size (left), and those evaluated with semi-supervised k-means are reported of smaller size (right).

Results using different loss formulations..

(see Tab. 11). Time cost in clustering.

Data splits of all datasets. We present the number of classes in the labelled and unlabelled set (|Y L |, |Y U |), and the number of images (|D L |, |D U |).

Performance on seen-only and unseen-only unlabelled data. "original setting" denotes the performance of CiP dealing with GCD; "direct testing" denotes the performance of CiP dealing with seen-only or unseen-only unlabelled data using pretrained GCD model; "retraining" denotes the performance of retrained CiP dealing with seen-only or unseen-only unlabelled data.

ETHICS STATEMENT

The potential negative impacts lie in two aspects. On the one hand, although the performance achieves the state-of-the-art, it still lags behind fully supervised models, making it risky to apply to scenarios with strict safety and accuracy requirements, e.g., autonomous driving and medical image classification. On the other hand, due to unseen labels, manually checking the results is necessary in real applications, drawing attention to sensitive contexts (e.g., private data) and inappropriate contents (e.g., violent images). 

A.1 MORE ANALYSIS ON SNC

We present a more detailed illustration of our proposed SNC in Fig. 4 . SNC is inspired by the idea from FINCH, but they are significantly different in two key aspects: (1) FINCH treats all instances the same and simply uses nearest neighbors to construct graphs; SNC uses a novel selective neighbor strategy tailored for the GCD setting to construct graphs, treating labelled and unlabelled instances differently.(2) SNC is able to cluster a mixed set of labelled and unlabelled data fully exploiting label supervision, but FINCH is not.Effectivenss of SNC on different learned features. In Tab. 6, we evaluate SNC 1 on features extracted from DINO (Caron et al., 2021) , GCD method of (Vaze et al., 2022b) , and our method CiP. We also compare SNC with semi-supervised k-means (Han et al., 2019; Vaze et al., 2022b) . We can observe that SNC surpasses semi-supervised k-means with a significant margin on all features, except those extracted by (Vaze et al., 2022b) on ImageNet-100. Moreover, semi-supervised k-means with our features performs better than with other features. Overall, SNC with our learned features gives the best performance. Different choices of chain lengths λ. The choice of chain lengths should be positively correlated to (but smaller than) the labelled instance number, while the number should not be too small. The square root used in our paper is the simplest formulation we think of. In Tab. 7, we experiment on other formulations which satisfies the above relationship, e.g., λ = ⌈ 3 √ n ℓ ⌉ and λ = ⌈n ℓ /2⌉, and our formulation performs the best. We also compare our dynamic λ with a possible alternative of a fixed λ. For the fixed chain length, we conduct multiple experiments with different length values to find the best length giving the highest accuracy for each dataset. We observe that the best chain length varies from dataset to dataset, and there is no single fixed λ that gives the best performance for all datasets. In contrast, our dynamic λ consistently outperforms the fixed one, and it can automatically adjust the chain length for different datasets and different levels, without requiring any tuning nor validation like the fixed one. Impacts of different levels for positive relation generation. A proper level for positive relation generation should overcluster the labelled data to some extent, such that reliable positive relations can be generated. Level 1 is not a valid choice because no positive relations can be generated if each instance is treated as a cluster. In Tab. 8, we present the performance using levels 2, 3, and 4 to generate pseudo labels and also compare with the previous state-of-the-art baseline by (Vaze et al., 2022b) . We empirically find that the overclustering levels 3 and 4 are similarly good, while level 2 is worse because less positive relations are explored in each mini-batch. Even using level 2, our method still performs on par with (Vaze et al., 2022b) . In this paper, to leverage pseudo labels produced by SNC, we jointly train our model with two supervised contrastive losses, one using true positive relations of labelled data and the other using pseudo positive relations of all data. Indeed, it is possible to train the model with a unified loss by replacing the pseudo relations in the second term of our loss, and remove the first term. Formally, let R B (i) be the set of positive relations for instance i. The unified loss L r i can be written aswhereI L and I U denote the instance indices of the labelled and unlabelled set respectively. In Tab. 9, we compare our two-term loss formulation with this unified loss formulation. It turns out that our two-term loss appears to be more effective. We hypothesize the performance degradation of Eq. ( 6) is caused by unbalanced granularity of labelled data and unlabelled data, due to mixture of overclustering pseudo labels and non-overclustering ground-truth labels. 

Time cost

RankStats+ (Han et al., 2021) 0.015s±0.001 UNO+ (Fini et al., 2021) 0.017s±0.001 ORCA (Cao et al., 2022) 0.015s±0.001 Vaze et al. (Vaze et al., 2022b) Estimating class number. Compared to repeatedly running k-means with different class numbers as in (Vaze et al., 2022b) , CiP only requires a single run to obtain the estimated class number, thus significantly increasing efficiency. In Tab. 12, CiP is 40-150 times faster than (Vaze et al., 2022b) , which utilizes k-means with the optimization of Brent's algorithm (Brent, 1971) . ViT (Dosovitskiy et al., 2020) has a multi-head attention design, with each head focusing on different context of the image. For the final block of ViT, the input X ∈ R (HW +1)×D , corresponding to a feature of HW patches and a [CLS] token, is fed into multi-heads, which can be expressed aswherewhere d k is the dimension of queries and keys. In our model, patch size is 16 × 16 pixels and HW = 14 × 14 = 196. The number of heads h is 12. Referring to (Vaswani et al., 2017) , consider attention map of head j A j = sof tmax() ∈ [0, 1] (HW +1)×(HW +1) . A j describes the similarity of one feature to every other feature captured in head j. The first row of A j shows how head j attends [CLS] token to every spatial patch of the input image. In Fig. 6 , we visualize some of the interpretable attention heads to show semantic regions that ViT attends to. We can observe that our model CiP, as well as DINO (Caron et al., 2021) and (Vaze et al., 2022b) , can attend to specific semantic object regions. For instance, CiP attends three heads respectively to 'license plate', 'light' and 'wheels' for Stanford Cars (head 1 fails in row 1), and to 'body', 'head' and 'neck' for CUB-200.A.6 DATA SPLITS In Tab. 13, we show the details on data splits of CIFAR-10 (Krizhevsky et al., 2009) , CIFAR-100 (Krizhevsky et al., 2009) , ImageNet-100 (Deng et al., 2009) , CUB-200 (Wah et al., 2011 ), Stanford Cars (Krause et al., 2013 ) and Herbarium19 (Tan et al., 2019) in our experiments.

A.7 SPECIAL CASES OF UNLABELLED DATA

In the real world, we may meet the scenarios where unlabelled data are all from seen or unseen classes. We investigate into such scenarios and conduct experiments to validate effectiveness of our method. Our experiment are under two settings: (1) applying our pretrained models in the main paper to seen-only and unseen-only unlabelled data; (2) retraining the models with seen-only and unseen-only unlabelled data. In Tab. 14, we can observe that our model maintains strong performance in all cases.

A.8 LIMITATIONS

We note limitations of our method. In our current experiments, we consider images from the same curated dataset. However, in practice, we might want to transfer concepts from one dataset to another, which may have different data distribution, introducing more challenges. For example, the unlabelled data could follow the long-tailed distribution. Another limitation is that currently, we need to train the model on both labelled and unlabelled data jointly. However, in real world, there are often cases in which we do not have access to any labelled data from the seen classes when facing the unlabelled data. We consider these as our future research directions.A.9 LICENSE FOR EXPERIMENTAL DATASETS All datasets used in this paper are permitted for research use. CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) are released under MIT License, allowing for research propose. ImageNet-100 is the subset of ImageNet (Deng et al., 2009) , which allows non-commercial research use. Similarly, CUB-200 (Wah et al., 2011 ), Stanford Cars (Krause et al., 2013 ) and Herbarium19 (Tan et al., 2019) are also exclusive for non-commercial research purpose.

