A FLEXIBLE FRAMEWORK FOR DISCOVERING NOVEL CATEGORIES WITH CONTRASTIVE LEARNING Anonymous authors Paper under double-blind review

Abstract

This paper studies the problem of novel category discovery on single-and multimodal data with labels from different but relevant categories. We present a generic, end-to-end framework to jointly learn a reliable representation and assign clusters to unlabelled data. To avoid over-fitting the learnt embedding to labelled data, we take inspiration from self-supervised representation learning by noise-contrastive estimation and extend it to jointly handle labelled and unlabelled data. In particular, we proposed using category discrimination on labelled data and cross-modal discrimination on multi-modal data to augment instance discrimination used in conventional contrastive learning approaches. We further employ Winner-Take-All (WTA) hashing algorithm on the shared representation space to generate pairwise pseudo labels for unlabelled data to better predict cluster assignments. We thoroughly evaluate our framework on large-scale multi-modal video benchmarks Kinetics-400 and VGG-Sound, and image benchmarks CIFAR10, CIFAR100 and ImageNet, obtaining state-of-the-art results.

1. INTRODUCTION

With the tremendous advances in deep learning, recent machine learning models have shown superior performance on many tasks, such as image recognition (Deng et al., 2009; Kuznetsova et al., 2020) , object detection (Zhou et al., 2019; Tan et al., 2020) , image segmentation (Cheng et al., 2019) , etc. While the state-of-the-art models might even outperform human in these tasks, the success of these models heavily relies on the huge amount of data with human annotations under the closedworld assumption. Applying deep learning in real (open) world brings many new challenges: it is cost-inhibitive to identify and annotate all categories, and new categories could keep emerging. Conventional methods struggle on handling unlabelled data from new categories (Fontanel et al., 2020) . On the flip side, real world provides rich unlabeled data, which are often multi-modal (e.g., video and audio), allowing more possibilities for machine learning models to learn in a similar way as human. Indeed, human learn from multi-modal data everyday with text, videos, audios, etc. In this paper, we focus on automatically learning to discover new categories in the open world setting. Similar to recent work (Han et al., 2019; 2020) which transfer knowledge from labelled images of a few classes to other unlabelled image collections, we formulate the problem as partitioning unlabelled data from unknown categories into proper semantic groups, while some labelled data from other categories are available. This is a more realistic setting than pure unsupervised clustering which may produce equally valid data partitions following different unconstrained criteria (e.g., images can be clustered by texture, color, illumination, etc) and closed-world recognition which can not handle unlabelled data from new categories without any labels. Meanwhile, our setting is more similar to the human cognition process where humans can easily learn the concept of a new object by transferring knowledge from known objects. Specifically, we introduce a flexible end-to-end framework to discover categories in unlabelled data, with the goal of utilizing both labelled and unlabelled data to build unbiased feature representation, while transferring more knowledge from labelled to unlabelled data. In particular, we extend the conventional contrastive learning (Chen et al., 2020b; He et al., 2020) to consider both instance discrimination and category discrimination to learn a reliable feature representation on labelled and unlabelled data. We also demonstrate that the cross-modal discrimination would further benefit representation learning on data with multi-modalities. To leverage more of unlabelled data, we

