NOVEL CLASS DISCOVERY UNDER UNRELIABLE SAMPLING

Abstract

When sampling data of specific classes (i.e., known classes) for a scientific task, collectors may encounter unknown classes (i.e., novel classes). Since these novel classes might be valuable for future research, collectors will also sample them and assign them to several clusters with the help of known-class data. This assigning process is also known as novel class discovery (NCD). However, sampling errors are common in practice and may make the NCD process unreliable. To tackle this problem, this paper introduces a new and more realistic setting, where collectors may misidentify known classes and even confuse known classes with novel classes -we name it NCD under unreliable sampling (NUSA). We find that NUSA will empirically degrade existing NCD methods if taking no care of sampling errors. To handle NUSA, we propose an effective solution, named hidden-prototype-based discovery network (HPDN). HPDN first trains a deep network to fully fit the wrongly sampled data, then applies the relatively clean hidden representations yielded by this network into a novel mini-batch K-means algorithm, which further prevents them overfitting to residual errors by detaching noisy supervision timely. Experiments demonstrate that, under NUSA, HPDN significantly outperforms competitive baselines (e.g., 6% more than the best baseline on CIFAR-10) and keeps robust even encountering serious sampling errors.

1. INTRODUCTION

Data, algorithms, and computing power create the boom in the field of artificial intelligence, especially the supervised learning with many powerful deep models (Deng et al., 2009; Krizhevsky et al., 2012; Simonyan & Zisserman, 2015) . Although these deep models can accurately identify or cluster the classes appeared in the training set (i.e., known/seen classes), they do not have reliable extrapolating ability in front of novel classes (i.e., unseen classes). For young children, after seeing some common vehicles (e.g., cars and bicycles), they can easily distinguish (cluster) the unseen but similar ones (e.g., trains and steamships) based on previous experience. This fact motivates researchers to formulate a novel problem called novel class discovery (NCD) (Han et al., 2020; 2019; Hsu et al., 2018; 2019; Zhao & Han, 2021; Zhong et al., 2021a; b) , aiming to accurately cluster novel classes using labeled known-class data and unlabeled novel-class data. Existing work (Chi et al., 2022) demystifies the underlying assumptions of NCD, then define NCD strictly from the perspective of sampling, making NCD problem theoretically solvable. Specifically, given a sampling task (i.e., collecting known-class data), the known-class and novel-class data are sampled in the same scenario, but the novel-class data are sampled passingly, and experts cannot identify them. Since the same scenario indicates that two groups have similar high-level semantic features, employing knowledge of known classes to assist the clustering of novel classes is meaningful. However, for professional and difficult sampling tasks, the experts may wrongly identify known classes (i.e., internal errors), and even confuse the known classes with novel classes (i.e., external errors). A direct example is to sample different varieties of privet, a type of shrubs. If experts are not very proficient, they may wrongly identify ligustrum vicaryi and ligustrum quihoui (internal errors), since they look very similar. Furthermore, they may confuse ligustrum vicaryi and kerria japonica (i.e., external errors), since they both appear to be red. Motivated by this scenario, we propose a new and challenging problem called NCD under unreliable sampling (NUSA), where we try to discover novel classes under both internal and external sampling errors, as shown in Figure 1 . The most direct solution to NUSA is the existing NCD methods (Fini et al., 2021; Han et al., 2020; 2019; Zhong et al., 2021a) , and the results are shown in the left one of Figure 2 . Clearly, NUSA empirically degrades the four representative NCD methods, and previous methods cannot handle NUSA well. Moreover, the label-noise learning methods (Han et al., 2018; Li et al., 2020) can be employed to correct the labelsfoot_0 of all the sampled data first, and then these data and revised labels will be applied into the existing NCD methods to solve NUSA, which can be regarded as a two-step solution to NUSA. However, existing label-noise learning methods cannot fully eliminate noises, and experimental results (Table 1 ) show that residual errors still weaken NCD methods. Based on these empirical results, the two types of sampling errors substantially invalidate both NCD methods and the above two-step methods. To address the sampling errors in NUSA, we propose the hidden-prototype-based discovery network (HPDN). In terms of supervision, the sampled data with errors can be treated as data with label noises. Li et al. (2021a) pointed out that if an architecture "suits" one task, training with noisy supervisions can induce useful hidden representations. Inspired by this conclusion, HPDN first trains a deep network (initialized by SimCLR (Chen et al., 2020)) to fully fit the wrongly sampled data. This network can yield relatively clean hidden representations for novel-class data (Li et al., 2021a) . However, the right one of Figure 2 indicates the residual errors in hidden representations still degrade the existing NCD methods. This is caused by the strong memory of deep networks (Zhang et al., 2021) , leading to the accumulation of residual noisy supervision information in training procedure. To avoid further errors accumulation in the representation, unlike existing NCD methods, at clustering stage, we detach the noisy supervision information in time. Then, we employ K-means (MacQueen et al., 1967) , an unsupervised clustering algorithm. Naive K-means uses all data representations at a time and is sensitive to initial centers. However, given many data representations, proper initialization is hard to choose, and it may be negatively affected by residual errors existed in representations, and furthermore many iterations are required. Thus, we propose the mini-batch K-means with memories of clustering centers (i.e., prototypes) to discover novel classes using hidden representations. Minibatches are easier to be initialized with K-means++ (Arthur & Vassilvitskii, 2006) due to their smaller sample complexity. After obtaining centers of each batch, we take their matched average value (i.e., prototypes) to initialize each batch in next epoch, taking care of the whole dataset. In this way, prototypes will gradually converge to a stable state as the final clustering centers. To verify the effectivenss of HPDN, we perform experiments on three benchmarks: CIFAR-10, CIFAR-100 and ImageNet. Experimental results show that HPDN outperforms existing baselines (e.g., 6% more than the best baseline on CIFAR-10) and is very robust to sampling errors in NCD (Figure 4 ), which confirms the effectiveness of HPDN.



The novel classes are currently considered as one class.



NCD under unreliable sampling (NUSA).

Figure 1: Novel class discovery (NCD, (a)) is formulated by a sampling process (green arrows). When collectors sample the data of required classes (i.e., bear, lion, wolf, and tiger) in a scenario, they may encounter novel classes (i.e., squirrel and hare) that are unfamiliar, and they had better also sample them for future research. Then, assigning them to several clusters with the help of known-class data is known as NCD. However, collectors possibly make mistakes in practice, which is named as NCD under unreliable sampling (NUSA, (b)). Here we consider two cases, where they misidentify the known classes (i.e., internal errors, shown in blue boxes) and even confuse known classes with novel classes (i.e., external errors, shown in yellow boxes).

