TASK REGULARIZED HYBRID KNOWLEDGE DISTIL-LATION FOR CONTINUAL OBJECT DETECTION Anonymous

Abstract

Knowledge distillation has been used to overcome catastrophic forgetting in Continual Object Detection(COD) task. Previous works mainly focus on combining different distillation methods, including feature, classification, location and relation, into a mixed scheme to solve this problem. In this paper, we propose a task regularized hybrid knowledge distillation method for COD task. First, we propose an image-level hybrid knowledge representation by combining instance-level hard and soft knowledge to use teacher knowledge critically. Second, we propose a task-based regularization distillation loss by taking account of loss and category differences to make continual learning more balance between old and new tasks. We find that, under appropriate knowledge selection and transfer strategies, using only classification distillation can also relieve knowledge forgetting effectively. Extensive experiments conducted on MS COCO2017 demonstrate that our method achieves state-of-the-art results under various scenarios. We get an absolute improvement of 27.98 at RelGap under the most difficult five-task scenario. Code is in attachment and will be available on github.

1. INTRODUCTION

The existing object detection models (Ge et al., 2021) mainly adopt overall learning paradigm, in which the annotations of all categories must be available before learning. It assumes that data distribution is fixed or stationary (Yuan et al., 2021) , while data in real-world comes dynamically with a non-stationary distribution. When model learns from incoming data continually, new knowledge interferes with the old one, leading to catastrophic forgetting (McCloskey & Cohen, 1989; Goodfellow et al., 2014) . To solve this problem, continual learning is proposed in recent years and has made progresses in image classification (Zeng et al., 2019; Qu et al., 2021) . On the other hand, continual object detection (COD) is rarely studied. Knowledge distillation (Hinton et al., 2015) has been proved to be an effective method for COD task, in which the model trained on old classes performs as a teacher to guide the training of student model on new classes. There are four kinds of distillation schemes: feature, classification, location and relation distillation. Most previous works combine feature distillation and classification distillation to construct their distillation methods (Li & Hoiem, 2018; Li et al., 2019; Yang et al., 2022b) , while the latest work (Feng et al., 2022) combines classification distillation and location distillation to construct a response-based distillation method. In addition, various distillation losses, based on KL diversity, cross entropy and mean square error, are proposed for knowledge transfer. In summary, the keys of knowledge distillation are what knowledge should be selected from teacher and how it is transferred to student. The former question needs Knowledge Selection Strategy (KSS), while the latter needs Knowledge Transfer Strategy (KTS). Continual object detection face two problems. (1) Teacher outputs probability distributions as logits and converts them into one-hot labels as final predictions. Logits and one-hot labels are regarded as soft and hard knowledge, respectively. Soft knowledge contains confidence relations among categories, but brings knowledge fuzziness inevitably. While, hard knowledge has completely opposite effects. Therefore, how to design KSS to keep balance between accuracy and ambiguity of knowledge is a key problem. (2) Continual learning should maintain old knowledge during the learning of new knowledge to overcome catastrophic forgetting, therefore how to design KTS to keep balance between stability of old knowledge and plasticity of new knowledge is a key problem. This paper focuses on how to design effective KSS and KTS for COD task. We demonstrate that as long as KSS and KTS are good enough, using only classification distillation can also significantly alleviate catastrophic forgetting to improve performance. Firstly, the max confidence value of logits is always lower than its corresponding one-hot value (equal to 1), which brings knowledge ambiguity and reduces supervise ability of teacher. This means soft knowledge is not completely reliable, which should be used critically. However, previous methods ignore this keypoint. Motivated by this insight, we propose an image-level hybrid knowledge representation method, named as HKR, by combining instance-level soft and hard knowledge adaptively to improve the exploration of teacher knowledge. Secondly, new coming data contains massive labeled objects of new classes, while contains a few unlabeled objects of old classes, therefore student trends to be dominated by new classes and falls into catastrophic knowledge. Thus it is very important to balance the learning of old and new classes. We propose a task regularized distillation method, named as TRD, by using losses difference between old and new classes to prevent student from task over-fitting effectively. We first explore imbalance learning problem explicitly for COD. Our contributions can be summarized as follows: (1) We propose a hybrid knowledge representation strategy by combing logits and one-hot predictions to make a better trade-off and selection between soft knowledge and hard knowledge. (2) We propose a task regularized distillation method as an effective knowledge transfer strategy to overcome the imbalance learning between old and new tasks, which relieves catastrophic forgetting significantly. (3) We demonstrate that, compared with the composite distillation schemes, using only classification distillation with appropriate knowledge selection and transfer strategies can also reach up to the state-of-the-art performance of COD task. Knowledge Distillation for Continual Object Detection. Knowledge distillation (Hinton et al., 2015) is an effective way to transfer knowledge between models with KL diversity, cross entropy or mean square error as the distillation loss. There are mainly four kinds of knowledge distillation used in COD task: feature, classification, location and relation distillation. LwF was the first to apply knowledge distillation to Fast RCNN detector (Li & Hoiem, 2018) . RILOD designed feature, classification and location distillation for RetinaNet detector on edge devices (Li et al., 2019) . SID combined feature and relation distillation for anchor-free detectors (Peng et al., 2021) . Yang et al. (2022b) proposed a feature and classification distillation by treating channel and spatial feature differently. ERD is the latest state-of-the-art method, combining classification and location distillation (Feng et al., 2022) . Most of existing methods combine feature, classification and location distillation in composite and complex schemes to realize knowledge selection and transfer.

3.1. OVERALL ARCHITECTURE

We build our continual object detector on the top of YOLOX (Ge et al., 2021) . Fig1 shows its overall architecture. YOLOX designs two independent branches as its classification and location heads. Firstly, hybrid knowledge selection (HKS) module works after the classification head of teacher to discover and select the valuable predictions for old classes. Secondly, task regularized distillation (TRD) module works between the heads of teacher and student to transfer knowledge.



Object Detection. There are several schemes for COD task. Li & Hoiem (2018) first proposed a knowledge distillation scheme by applying LWF to Fast RCNN (Girshick, 2015). Zheng & Chen (2021) proposed a contrast learning scheme to strike a balance between old and new knowledge. Joseph et al. (2021b) proposed a meta-learning scheme to share optimal information across continual tasks. Joseph et al. (2021a) introduced the concept of Open World Object Detection, which integrates continual learning and open-set learning simultaneously. In addition, Li et al. (2021) first studied few-shot COD. Li et al. (2019) designed a COD system on edge devices. Wang et al. (2021) presented an online continual object detection dataset. Recently, Wang et al. (2022) proposed a data compression strategy to improve sample replay scheme of COD. Yang et al. (2022a) proposed a prototypical correlation guiding mechanism to overcome knowledge forgetting. Cermelli et al. (2022) proposed to model the missing annotations to improve COD performance.

