A FLEXIBLE FRAMEWORK FOR DISCOVERING NOVEL CATEGORIES WITH CONTRASTIVE LEARNING Anonymous authors Paper under double-blind review

Abstract

This paper studies the problem of novel category discovery on single-and multimodal data with labels from different but relevant categories. We present a generic, end-to-end framework to jointly learn a reliable representation and assign clusters to unlabelled data. To avoid over-fitting the learnt embedding to labelled data, we take inspiration from self-supervised representation learning by noise-contrastive estimation and extend it to jointly handle labelled and unlabelled data. In particular, we proposed using category discrimination on labelled data and cross-modal discrimination on multi-modal data to augment instance discrimination used in conventional contrastive learning approaches. We further employ Winner-Take-All (WTA) hashing algorithm on the shared representation space to generate pairwise pseudo labels for unlabelled data to better predict cluster assignments. We thoroughly evaluate our framework on large-scale multi-modal video benchmarks Kinetics-400 and VGG-Sound, and image benchmarks CIFAR10, CIFAR100 and ImageNet, obtaining state-of-the-art results.

1. INTRODUCTION

With the tremendous advances in deep learning, recent machine learning models have shown superior performance on many tasks, such as image recognition (Deng et al., 2009; Kuznetsova et al., 2020) , object detection (Zhou et al., 2019; Tan et al., 2020) , image segmentation (Cheng et al., 2019) , etc. While the state-of-the-art models might even outperform human in these tasks, the success of these models heavily relies on the huge amount of data with human annotations under the closedworld assumption. Applying deep learning in real (open) world brings many new challenges: it is cost-inhibitive to identify and annotate all categories, and new categories could keep emerging. Conventional methods struggle on handling unlabelled data from new categories (Fontanel et al., 2020) . On the flip side, real world provides rich unlabeled data, which are often multi-modal (e.g., video and audio), allowing more possibilities for machine learning models to learn in a similar way as human. Indeed, human learn from multi-modal data everyday with text, videos, audios, etc. In this paper, we focus on automatically learning to discover new categories in the open world setting. Similar to recent work (Han et al., 2019; 2020) which transfer knowledge from labelled images of a few classes to other unlabelled image collections, we formulate the problem as partitioning unlabelled data from unknown categories into proper semantic groups, while some labelled data from other categories are available. This is a more realistic setting than pure unsupervised clustering which may produce equally valid data partitions following different unconstrained criteria (e.g., images can be clustered by texture, color, illumination, etc) and closed-world recognition which can not handle unlabelled data from new categories without any labels. Meanwhile, our setting is more similar to the human cognition process where humans can easily learn the concept of a new object by transferring knowledge from known objects. Specifically, we introduce a flexible end-to-end framework to discover categories in unlabelled data, with the goal of utilizing both labelled and unlabelled data to build unbiased feature representation, while transferring more knowledge from labelled to unlabelled data. In particular, we extend the conventional contrastive learning (Chen et al., 2020b; He et al., 2020) to consider both instance discrimination and category discrimination to learn a reliable feature representation on labelled and unlabelled data. We also demonstrate that the cross-modal discrimination would further benefit representation learning on data with multi-modalities. To leverage more of unlabelled data, we introduce the Winner-Take-All (WTA) hashing (Yagnik et al., 2011) on the shared representation spaces to generate pair-wise pseudo labels on-the-fly, which is the key for robust knowledge transfer from the labelled data to unlabelled data. With the weak pseudo labels, the model can be trained with a simple binary cross-entropy loss on the unlabelled data together with the standard cross-entropy loss on the labelled data. This way our model can simultaneously learn feature representation and performs the cluster assignment using an unified loss function. The main contributions of the paper can be summarized as follows: (1) we propose a generic, end-toend framework for novel category discovery that can be trained jointly on labelled and unlabelled data; (2) to the best of our knowledge, we are the first to extend contrastive learning in novel category discovery task by category discrimination on labelled data and cross-modal discrimination on multimodal data ; (3) we propose a strategy to employ WTA hashing on the shared representation space of both labelled and unlabelled data to generate additional (pseudo) supervision on unlabeled data; and (4) we thoroughly evaluate our end-to-end framework on challenging large scale multi-modal video benchmarks and single-modal image benchmarks, outperforming existing methods by a significant margin.

2. RELATED WORK

Our method is related to self-supervised learning, semi-supervised learning, and clustering, while different from each of them. We review the most relevant works below. Self-supervised learning aims at learning reliable feature representations using the data itself to provide supervision signals during training. Many pretext tasks (e.g., relative position (Doersch et al., 2015) , colorization (Zhang et al., 2017 ), rotation prediction (Gidaris et al., 2018) ) have been proposed for self-supervised learning, showing promising results. Recently, the constrastive learning based methods, such as (He et al., 2020) and (Chen et al., 2020b) , have attracted lots of attention by its simplicity and effectiveness. The key idea of contrastive learning is instance discrimination, i.e., pulling similar pairs close and pushing dissimilar pairs away in the feature space. (Khosla et al., 2020) studied the supervised contrastive learning on labelled data as an alternative of cross-entropy. With the labels, more positive pairs can be generated from the intra-class instances, enabling category discrimination. Noise-Contrastive Estimation (NCE) (Gutmann & Hyvärinen, 2010; van den Oord et al., 2018) is an effective contrastive loss widely used in these methods. When handling multi-modal data like videos, different self-supervised learning methods have been proposed to exploit data of different modalities, such as (Patrick et al., 2020; Alayrac et al., 2020; Asano et al., 2020; Alwassel et al., 2019; Morgado et al., 2020) . Among them, (Morgado et al., 2020) suggests that cross-modal discrimination can be adopted to improve the representation learning for downstream tasks like image recognition and object detection, which implies that good representations are shared between multiple views of the world. Tian et al. (2019) shows that cross-view prediction outperforms conventional alternatives in contrastive learning on images, depth, video and flow, and more views can lead to better representation. In this paper, we consider the visual and audio modalities for cross-modal learning in videos, and present a new way to incorporate contrastive learning for both labelled and unlabelled data to bootstrap representation learning for novel category discovery. Semi-supervised learning (Chapelle et al., 2006) considers the setting with labelled and unlabelled data. Specifically, the unlabelled data are assumed to come from the same classes as the labelled data. The objective is to learn a robust model making use of both labelled and unlabelled data to avoid over-fitting to the labelled data. While this problem is well studied in the literature (e.g., (Oliver et al., 2018; Tarvainen & Valpola, 2017; Rebuffi et al., 2020a) ), existing methods can not handle unlabelled data from new classes. In contrast, our method is designed to discover new categories in the unlabelled data automatically. Clustering, which aims at automatically partitioning the unlabelled data into different groups, has long been studied in the machine learning community. There are many classic methods (e.g., kmeans (MacQueen, 1967 ), mean-shift (Comaniciu & Meer, 1979) ) and deep learning based methods (e.g., (Xie et al., 2016; Dizaji et al., 2017; Rebuffi et al., 2020b) ) showing promising results. However, the definition of a cluster can be intrinsically ambiguous, because different criteria can be used to cluster the data. For example, objects can be clustered by color, shape or texture, and the clustering results will be different by taking different criteria, while these criteria cannot be predefined in the

