TASK REGULARIZED HYBRID KNOWLEDGE DISTIL-LATION FOR CONTINUAL OBJECT DETECTION Anonymous

Abstract

Knowledge distillation has been used to overcome catastrophic forgetting in Continual Object Detection(COD) task. Previous works mainly focus on combining different distillation methods, including feature, classification, location and relation, into a mixed scheme to solve this problem. In this paper, we propose a task regularized hybrid knowledge distillation method for COD task. First, we propose an image-level hybrid knowledge representation by combining instance-level hard and soft knowledge to use teacher knowledge critically. Second, we propose a task-based regularization distillation loss by taking account of loss and category differences to make continual learning more balance between old and new tasks. We find that, under appropriate knowledge selection and transfer strategies, using only classification distillation can also relieve knowledge forgetting effectively. Extensive experiments conducted on MS COCO2017 demonstrate that our method achieves state-of-the-art results under various scenarios. We get an absolute improvement of 27.98 at RelGap under the most difficult five-task scenario. Code is in attachment and will be available on github.

1. INTRODUCTION

The existing object detection models (Ge et al., 2021) mainly adopt overall learning paradigm, in which the annotations of all categories must be available before learning. It assumes that data distribution is fixed or stationary (Yuan et al., 2021) , while data in real-world comes dynamically with a non-stationary distribution. When model learns from incoming data continually, new knowledge interferes with the old one, leading to catastrophic forgetting (McCloskey & Cohen, 1989; Goodfellow et al., 2014) . To solve this problem, continual learning is proposed in recent years and has made progresses in image classification (Zeng et al., 2019; Qu et al., 2021) . On the other hand, continual object detection (COD) is rarely studied. Knowledge distillation (Hinton et al., 2015) has been proved to be an effective method for COD task, in which the model trained on old classes performs as a teacher to guide the training of student model on new classes. There are four kinds of distillation schemes: feature, classification, location and relation distillation. Most previous works combine feature distillation and classification distillation to construct their distillation methods (Li & Hoiem, 2018; Li et al., 2019; Yang et al., 2022b) , while the latest work (Feng et al., 2022) combines classification distillation and location distillation to construct a response-based distillation method. In addition, various distillation losses, based on KL diversity, cross entropy and mean square error, are proposed for knowledge transfer. In summary, the keys of knowledge distillation are what knowledge should be selected from teacher and how it is transferred to student. The former question needs Knowledge Selection Strategy (KSS), while the latter needs Knowledge Transfer Strategy (KTS). Continual object detection face two problems. (1) Teacher outputs probability distributions as logits and converts them into one-hot labels as final predictions. Logits and one-hot labels are regarded as soft and hard knowledge, respectively. Soft knowledge contains confidence relations among categories, but brings knowledge fuzziness inevitably. While, hard knowledge has completely opposite effects. Therefore, how to design KSS to keep balance between accuracy and ambiguity of knowledge is a key problem. (2) Continual learning should maintain old knowledge during the learning of new knowledge to overcome catastrophic forgetting, therefore how to design KTS to keep balance between stability of old knowledge and plasticity of new knowledge is a key problem. This paper focuses on how to design effective KSS and KTS for COD task. We demonstrate that as long as KSS and KTS are good enough, using only classification distillation can also significantly alleviate catastrophic forgetting to improve performance. Firstly, the max confidence value of logits is always lower than its corresponding one-hot value (equal to 1), which brings knowledge ambiguity and reduces supervise ability of teacher. This means soft knowledge is not completely reliable, which should be used critically. However, previous methods ignore this keypoint. Motivated by this insight, we propose an image-level hybrid knowledge representation method, named as HKR, by combining instance-level soft and hard knowledge adaptively to improve the exploration of teacher knowledge. Secondly, new coming data contains massive labeled objects of new classes, while contains a few unlabeled objects of old classes, therefore student trends to be dominated by new classes and falls into catastrophic knowledge. Thus it is very important to balance the learning of old and new classes. We propose a task regularized distillation method, named as TRD, by using losses difference between old and new classes to prevent student from task over-fitting effectively. We first explore imbalance learning problem explicitly for COD. Our contributions can be summarized as follows: (1) We propose a hybrid knowledge representation strategy by combing logits and one-hot predictions to make a better trade-off and selection between soft knowledge and hard knowledge. (2) We propose a task regularized distillation method as an effective knowledge transfer strategy to overcome the imbalance learning between old and new tasks, which relieves catastrophic forgetting significantly. (3) We demonstrate that, compared with the composite distillation schemes, using only classification distillation with appropriate knowledge selection and transfer strategies can also reach up to the state-of-the-art performance of COD task.

2. RELATED WORKS

Continual Object Detection. There are several schemes for COD task. Li & Hoiem (2018) first proposed a knowledge distillation scheme by applying LWF to Fast RCNN (Girshick, 2015) . Zheng & Chen (2021) proposed a contrast learning scheme to strike a balance between old and new knowledge. Joseph et al. (2021b) proposed a meta-learning scheme to share optimal information across continual tasks. Joseph et al. (2021a) introduced the concept of Open World Object Detection, which integrates continual learning and open-set learning simultaneously. In addition, Li et al. (2021) first studied few-shot COD. Li et al. (2019) designed a COD system on edge devices. Wang et al. (2021) presented an online continual object detection dataset. Recently, Wang et al. (2022) proposed a data compression strategy to improve sample replay scheme of COD. Yang et al. (2022a) proposed a prototypical correlation guiding mechanism to overcome knowledge forgetting. Cermelli et al. (2022) proposed to model the missing annotations to improve COD performance. Knowledge Distillation for Continual Object Detection. Knowledge distillation (Hinton et al., 2015) is an effective way to transfer knowledge between models with KL diversity, cross entropy or mean square error as the distillation loss. There are mainly four kinds of knowledge distillation used in COD task: feature, classification, location and relation distillation. LwF was the first to apply knowledge distillation to Fast RCNN detector (Li & Hoiem, 2018) . RILOD designed feature, classification and location distillation for RetinaNet detector on edge devices (Li et al., 2019) . SID combined feature and relation distillation for anchor-free detectors (Peng et al., 2021) . Yang et al. (2022b) proposed a feature and classification distillation by treating channel and spatial feature differently. ERD is the latest state-of-the-art method, combining classification and location distillation (Feng et al., 2022) . Most of existing methods combine feature, classification and location distillation in composite and complex schemes to realize knowledge selection and transfer.

3.1. OVERALL ARCHITECTURE

We build our continual object detector on the top of YOLOX (Ge et al., 2021) . Fig1 shows its overall architecture. YOLOX designs two independent branches as its classification and location heads. Firstly, hybrid knowledge selection (HKS) module works after the classification head of teacher to discover and select the valuable predictions for old classes. Secondly, task regularized distillation (TRD) module works between the heads of teacher and student to transfer knowledge. The overall architecture of our continual detector ilYOLOX. Cls and Loc refer to classification and location respectively. Hybrid Knowledge Representation (HKR, Eq.3) and Task Regularized Distillation (TRD, Eq.10) refer to our proposed two components respectively. HKR and TRD play roles of knowledge selection and knowledge transfer strategies respectively. During learning of student on new task, teacher is frozen and outputs predictions of old task. ILYOLOX achieves SOTA performance using only classification distillation without any feature and location distillation.

3.2. HYBRID KNOWLEDGE REPRESENTATION

Teacher outputs probability distribution as logits and converts them to one-hot labels as final predictions. Logits are regarded as soft knowledge, while one-hot labels are regarded as hard knowledge. Hinton et al. (2015) shows that soft knowledge is better than hard knowledge for classification distillation. However, although soft knowledge reflects more between-class information than hard knowledge, it also brings fuzziness to knowledge inevitably, which makes student confused during distillation learning. Meanwhile, teacher confidence reflects knowledge quality. If teacher has high confidence about its predictions, we should further strengthen this trend so that student can feel the certainty of this knowledge. Conversely, if teacher has low confidence, we should not do that. Therefore, the key problem is how to evaluate the quality of soft knowledge from teacher. We here propose to evaluate soft knowledge according to the confidence difference between the maximum value and the secondary maximum value of teacher logits. Given a batch of images, teacher outputs a batch of logits for potential objects. For every logits in the batch, if the difference between the maximum confidence and the secondary maximum confidence is larger than a threshold, the knowledge quality of this logits will be regarded as high, otherwise as vanilla. High quality knowledge will be represented as one-hot prediction, while vanilla knowledge will be represented as soft prediction. We compute the mean value of the confidence differences across the entire batch as the threshold to judge knowledge quality adaptively. We formulate the description above as follows: ConfDiff = Conf max -Conf secondary max (1) quality = ConfDiff > 1 N N i ConfDiff i (2) Hybrid = quality • Onehot + (1 -quality) • Sof t (3) where, Conf N ×C refers to a batch of logits predictions with batch size of N and categories of C. ConfDiff N ×1 refers to the confidence difference for every logits between its maximum confidence and secondary maximum confidence. N and i refers to the total number of logits and the i th logit. 1 N N i ConfDiff i is the threshold to judge knowledge quality. quality defined in Eq.2 is a Boolean vector to indicate knowledge quality. Then, Hybrid predictions can be computed in Eq.3 by com-bining Onehot predictions and Sof t predictions. Obviously, our method combines soft knowledge and hard knowledge dynamically to form a hybrid knowledge representation for every input image.

3.3. TASK REGULARIZED DISTILLATION

The learning loss of student in COD task can be defined as following equation Eq.6. New task loss (Loss new , Eq.5) refers to the loss supervised by the ground-truth of new classes. Old task loss (Loss old , Eq.4) refers to the loss supervised by one-hot or soft targets from teacher. The Loss cls and Loss loc are the same as the official YOLOX, which are cross entropy loss and IoU loss respectively with coefficients of α = 1 and β = 5. The task balance factor γ is set to be 1 by default. Loss old = α • Loss cls + β • Loss loc (4) Loss new = α • Loss cls + β • Loss loc (5) Loss total = Loss new + γ • Loss old (6) Continual learning is easily affected by data proportion of old and new tasks. If the data proportion of new task is too large, student will be dominated by new task loss and forget old knowledge. Conversely, student will obtain much more stability to old knowledge and lack of plasticity to accept new knowledge. Therefore, the key problem of distillation learning is to keep balance between old and new tasks. Motivated by this insight, we propose a task regularized distillation method (TRD) to solve the imbalance learning problem. TRD method consists of two parts: task equal loss and task difference loss, which are formulated as follows: Loss * old = 2 • Loss new Loss old + Loss new • Loss old (7) Loss * new = 2 • Loss old Loss old + Loss new • Loss new (8) Loss * dif f = ( N new N old ) 2 • (Loss old -Loss new ) 2 Loss * total = Loss * new + Loss * old + Loss * dif f (10) Where, Loss old and Loss new are defined in Eq.4 and Eq.5, N old and N new refer to the number of old and new classes, respectively.Loss * old and Loss * new are the newly defined losses for old and new tasks, in which two dimensionless coefficients play a role of cross balancing factor. Obviously, Loss * old and Loss * new will be always equal to each other during the entire continual learning, which ensures a completely dynamic balance between old and new tasks regardless of their data imbalance. Loss * dif f measures the loss difference and categories proportion between old and new tasks, which can further contribute to their balance learning. Loss * total is the final formulation of TRD method. Compared with Eq.6, TRD emphasizes task balance explicitly by introducing a task-balancing penalty item (Eq.9) and prevents student from over-fitting to any task. Fig. 5 clearly compares their loss curves.

4.1. IMPLEMENTATION DETAILS

We build our continual detector on the top of YOLOX (Ge et al., 2021) . It is a typical one-stage anchor-free detector among famous and widely used YOLO series, which can contribute to the typical verification of our method. YOLOX uses CSPNet (Wang et al., 2020) as its backbone, which needs to be trained from scratch along with detection heads for 300 epochsfoot_0 . In general COD settings (Li et al., 2019) , 1x training schedule with 12 epochs and frozen backbone are widely used. So we replace CSPNet with ResNet backbone (He et al., 2016) for 12 epochs training, which is pretrained on ImageNet and frozen during continual learning. Official YOLOX adopts very strong data augmentation, including Mosaic, MixUP, Photo Metric Distortion, EMA, Random Affine, etc., to boost its performance, but we drop these tricks to reduce randomness. We keep the other components and hyper parameters of YOLOX unchanged. The modified YOLOX, denoted as ilYOLOX, is used for continual object detection. The ilYOLOX trained on old task is used as teacher to guide the next step learning of student on new task.

4.2. DATASETS AND EVALUATION METRICS

MS COCO2017 (Chen et al., 2015) is used to build benchmarks with the train set for training and the minival set for testing. We split the dataset of total 80 categories into several subsets by its alphabetic order. Each subset is a continual learning task. For example, the scenario of 40+20+20 means the data is split into three subsets and each of them contains 40, 20 and 20 categories respectively. The standard COCO protocols, including AP , AP 50 , AP 75 , AP S , AP M and AP L , are used to evaluate the detection performance. In order to better evaluate the performance of COD task, we use following metrics. (1) AbsGap and RelGap: the absolute gap and relative gap between the task-average mAP of continual learning and the mAP of overall learning, respectively. The mAP of overall learning from baseline detector is usually considered as Upper Bound mAP (denoted as Upper or Upper Bound) of continual learning. RelGap equals to the ratio of AbsGap to Upper Bound mAP. (2) F 1 b and F 2 b , defined in Eq.11(a) and Eq.11(b) respectively, are proposed to evaluate the balance ability between old and new tasks. (3) Omega (Ω), defined in Eq.12 (Hayes et al., 2018) , can evaluate the cumulative capability of continual learning task by task. Similar to COCO protocols, Ω can be extended as Ω all , Ω 50 , Ω 75 , Ω S , Ω M and Ω L . RelGap and Ω can eliminate the influence of upper bound from baseline detectors, providing a fair comparison among different continual learning methods. (a)F 1 b = |mAP old -mAP new | mAP old + mAP new (b)F 2 b = 2 √ mAP old × mAP new mAP old + mAP new (11) Ω = 1 T T t=1 mAP continual,t mAP overall,t where T and t is the number of total tasks and the t th task in continual learning. mAP continual,t and mAP overall,t means the task-average mAP and upper-bound mAP on all testing data containing learned categories after task t, respectively. The larger the Ω metric is, the better the ability of reducing cumulative knowledge forgetting would be.

4.3. EXPERIMENT SETUP

We design the following scenarios as benchmarks. Given a scenario, we continually train ilYOLOX task by task (step by step) under the following settings. Optimizer is SGD with warm-up iterations of 1500, a continual learning rate of 0.2 decayed by 10% at the 8 th and 11 th epochs respectively, a momentum of 0.9 and a weight decay of 0.0005. All experiments are performed on 8 NVIDIA 3090 GPUs with a total batch size of 16×8. All images are randomly resized to [640 × 320, 640 × 640] by their short sides with content shape ratios unchanged. Normalization and random horizontal flip with a probability of 50% are also used. .3 also show the same trend. We further plot Ω all curves in Fig. 2 to highlight our advantages. These results fully demonstrate much better continual learning capacity of our methods. In order to analysis the influence of upper bound mAP, we make extra experiments by using YOLOX-medium. Experiments in Table.9 (seen in AppendixA) demonstrate that our method get much better performance with a very large improvement under higher upper bound mAP. This further demonstrates its potential. 

4.4.2. BALANCE LEARNING ABILITY

For continual learning, the balancing ability between stability of old knowledge and plasticity of new knowledge is very important. Table .3 compares the results of ERD method and our method under the metrics of F 1 b and F 2 b . Obviously, our method strikes a better balance between the mAP of old and new tasks, reflecting its good balancing ability between stability of old knowledge and plasticity 

4.4.3. KNOWLEDGE FORGETTING

Figure 4 : The continual learning curves under Four-Task scenario (20+20+20+20), which shows that knowledge forgetting of old tasks is always controlled to a limited range during learning process. Fig. 4 reports the results under Four-Task scenario, which clearly shows detailed continual learning process of ilYOLOX in successive data streams. The AbsGap (plotted as white fluctuation columns at each learning epoch) shows that our method reduces the performance gap between continual learning and overall learning gradually and effectively. The mAP curves of each task (denoted as T aski, i = 1, 2, 3, 4) reflect their knowledge forgetting process. On the whole, the knowledge forgetting of old tasks is always controlled to a limited range during successive continual learning. Meanwhile, the TaskAvg mAP after each learning step, marked on the corresponding light-blue curve, also shows very good stability (28.71%, 28.20%, 28.16%, 28.11%). The results in Fig. 4 further demonstrate good effectiveness of our method.

5. ABLATION STUDY

The Independence and Compatibility of HKR and TRD. predictions (logits) as teacher knowledge respectively. Then we add Hybrid Knowledge Selection module and Task Regularized Distillation module to the Soft baseline respectively (seen in Fig. 1 ), whose results are denoted as Soft+HKR and Soft+TRD respectively. Finally, we add both HKR and TRD to the Soft baseline simultaneously, whose results are denoted as Soft+Both. The results under two scenarios all show that soft knowledge is better than hard knowledge, but both are inferior to hybrid knowledge. Compared with the Soft baseline, TRD shows higher performance improvement than HKR. This demonstrates that both HKR and TRD have their own effects as two independent components. Meanwhile, the results of 'Soft+Both' (means Soft+HKR+TRD) get further significant improvement, demonstrating that HKR and TRD have good additivity and compatibility. The HKR under Three-Task Scenario. In order to further analyze hybrid knowledge representation, we make additional ablation studies under Three-Task scenario. The results shown in Table .5 demonstrate the effectiveness of HKR clearly. It further reflects that hybrid knowledge can fully utilize the advantages of both soft knowledge and hard knowledge in adaptive manner. The Task Balance During Continual Learning. In order to further analyze the influence of task balance on continual learning, we make experiments by changing the task balancing factor (γ in Eq.6) from 0.2 to 3.0. The results of Table .6 show that our TRD method gets a medium mAP for new task, but gets the highest performance under all other metrics. Brown-marked digits show the sub-optimal results. When γ changes from 0.2 to 3.0, the mAP values in Table .6 shows noticeable changes. A similar experiment is made on another small dataset (seen appendix) and shown in Fig. 6 . It shows that even a very small change of γ from 1 to 0.9 can lead to dramatically descending of the old task mAP curve (blue dotted line) and bring catastrophic knowledge forgetting. Obviously, task balance factor (γ) has a significant influence on continual learning by controlling knowledge Table 6 : Continual learning results under Two-Task scenario for ablation study. Hyper parameter γ is the task balance factor (seen in Eq.6). TRD is task regularized distillation method (seen in Eq.10) .

Methods

mAP  ↑ Ω all ↑ Ω50 ↑ Ω75 ↑ F 1 b ↓ F 2 b ↑

6. CONCLUSION

In order to improve the performance of continual object detection, we propose a knowledge distillation method that combines knowledge selection strategy and knowledge transfer strategy effectively. For the first strategy, hard knowledge and soft knowledge are dynamically and adaptively combined to construct a kind of hybrid knowledge representation to use teacher knowledge critically and effectively. For the second strategy, loss difference and category proportion are combined to construct task regularized distillation loss to enhance task balance learning. Extensive experiments under different scenarios validate the effectiveness of our method. Most existing methods in COD task adopt a mixed distillation scheme including feature, classification, location and relation to relieve catastrophic forgetting. However, we demonstrate that as long as knowledge selection and transfer strategies are appropriate, even single classification distillation can also achieve state-of-the-art performance. In addition, our method has a good prospect to work together with feature and location distillation for further improvements. More analyses and discussions are provided in Appendix A. Task Regularized Distillation. Regularization method is very import for statistical machine learning, which can prevent model from over-fitting to some part of data. Classical regularization methods introduce a weight constraint in terms of p-Norm as model penalties. Other regularization methods include stopping the training as soon as performance on a validation set starts to get worse and soft weight sharing (Nowlan & Hinton, 1992) . Dropout works by dropping some connections randomly to prevent over-fitting of deep neural networks (Srivastava et al., 2014) . In continual learning, model needs to be trained task by task in the continuous data flow, therefore, bring task-based imbalance learning and lead to catastrophic forgetting. A few previous works propose regularization-based methods on continual image classification (Kirkpatrick et al., 2017; Li & Hoiem, 2018) . We give a solution by proposing task-based regularized distillation loss 10 for COD tasks. It explicitly uses the loss and categories difference between old and new tasks as a model penalty to constraint optimization process. Experiments in Table .4, Table .6, Table .8, Fig. 5 and Fig. 6 all show its strong effectiveness to prevent continual model from over-fitting to some task.

A.4 MORE DETAILS OF IMPLEMENTATION AND EXPERIMENTS

All the previous works, including Catastrophic Forgetting, RILOD (Li et al., 2019) , SID (Peng et al., 2021) and ERD (Feng et al., 2022) models, are implemented based on GFLv1 (Zheng et al., 2021) with original image size of 1333×800. Since they use the same base detector, the upper bound mAP of these previous models are all 40.20 for A(1-80), the overall training of total 80 categories. On the other hand, our experiments are implemented based on YOLOX-small detector (Ge et al., 2021) . YOLOX is a typical one-stage anchor-free detector among famous and widely used YOLO series, which can contribute to the typical verification of our method. The official YOLOX implementation in MMDetectionfoot_1 adopts 300 epochs training schedule with strong data augmentation, including Mosaic, MixUP, Photo Metric Distortion, EMA, Random Affine, Random Horizontal Flip, etc, to boost its performance. We drop these tricks to reduce randomness for continual learning and for better reproducibility. The modified YOLOX, denoted as ilYOLOX, is used for continual object detection. Since our experiments are conducted for demonstration of model effectiveness, we randomly resize all the images to [640 × 320, 640 × 640] by their short sides with content shape ratios unchanged. The upper bound mAP of our model is therefore 37.13 for A(1-80). Although different models have different baseline detectors, the evaluation metrics, like RelGap and Ω, are used to fairly compare their continual learning capability. We also analyze the influence of upper bound mAP by comparing different base detectors: YOLOX-small versus YOLOX-medium. Experiments in Table.9 shows that higher upper bound mAP leads to better continual learning performance, which further demonstrates the potential of our method. Here, we bring the continual object detection research on widely used lightweight model YOLOX to reduce computing loads and resource consumption, accelerating the researches of this topic and protecting our environment. 



seen YOLOX in https://github.com/open-mmlab/mmdetection seen YOLOX in https://github.com/open-mmlab/mmdetection



Figure1: The overall architecture of our continual detector ilYOLOX. Cls and Loc refer to classification and location respectively. Hybrid Knowledge Representation (HKR, Eq.3) and Task Regularized Distillation (TRD, Eq.10) refer to our proposed two components respectively. HKR and TRD play roles of knowledge selection and knowledge transfer strategies respectively. During learning of student on new task, teacher is frozen and outputs predictions of old task. ILYOLOX achieves SOTA performance using only classification distillation without any feature and location distillation.

(i)Two-Task scenarios: 40+40, 50+30, 60+20 and 70+10 settings. (ii)Three-Task scenario: 40+20+20 setting with 20 new classes added to the previous learned classes at each learning step. (iii)Four-Task scenario: 20+20+20+20 setting with 20 new classes added to the previous learned classes at each learning step. (iiii)Five-Task scenario: 40+10+10+10+10 setting with 10 new classes added to the previous learned classes at each learning step. We denote A(a-b) as the first-step overall learning for classes a-b, while +B(c-d) as the successive continual learning steps for classes c-d.

Figure 2: The performance of multi-task continual learning on MS COCO2017

transfer from teacher to student. Compared with different γ values, the mAP of old 70 classes reflects that TRD relieves knowledge forgetting to the greatest extent. Meanwhile, TRD shows much more balance learning ability between old and new tasks with the minimum F 1 b and the maxmum F 2 b . Overall speaking, our TRD realizes dynamic balance knowledge transfer by introducing task-based regularization. More over, TRD is a statistically adaptive method without hyper parameters.

Figure 5: The training loss of old and new tasks at γ = 0.8 and TRD for Table.6 respectively.

Figure 6: The catastrophic forgetting caused by imbalance learning of old and new tasks.

, Table.2 and Table.3 report the results of Two-Task, Three-Task and Five-Task respectively. .3 and Table.2) are reversed during continual learning. The results in Table

Continual learning results under Two-Task scenarios. Our experiments are implemented with YOLOX-small, while all the others are implemented with GFLv1.

Continual learning results under Five-Task scenario of 40+10+10+10. A(a-b) is the firststep normal training for categories a-b and +B(c-d) is the successive continual training for categories c-d. We use A(1-40) as the first step.

Table.4  shows the results of ablation experiments. The two baseline methods, denoted as Onehot and Soft, use one-hot predictions and soft Continual learning results under Three-Task scenario of 40+20+20.

Continual learning results under Two-Task scenarios for ablation study. We equip YOLOXsmall with two knowledge selection (Onehot and Soft) as our baselines.RelGap ↓ Ω all ↑ Ω50 ↑ Ω75 ↑ AbsGap ↓ RelGap ↓ Ω all ↑ Ω50 ↑ Ω75 ↑

Continual learning results under Three-Task scenario (40+20+20) for ablation study. AbsGap RelGap↓ Ω all ↑ Ω50 ↑ Ω75 ↑ mAP AbsGap RelGap↓ Ω all ↑ Ω50 ↑ Ω75 ↑

Results on COCO benchmark with different upper bound mAP. Scenarios Method AbsGap RelGap↓ Ω all ↑ Ω50 ↑ Ω75 ↑ ΩS ↑ ΩM ↑ ΩL ↑ mAP Upper Algorithm of Hybrid Knowledge Representation Input: Unlabeled image I, teacher detector θ ′ Output: Hybrid predictions Hybrid of response 1: Inference I with θ ′ yields the logits predictions Sof t and one-hot predictions Onehot

A APPENDIX

A.1 COMPARED WITH KL DIVERGENCE LOSS Kullback-Leibler Divergence loss (denoted as KLD loss) is used for knowledge distillation of image classification (Hinton et al., 2015) . YOLOX (Ge et al., 2021) uses cross entropy loss (denoted as CE loss) for its classification head. Table .7 shows the comparison results of this two losses on continual object detection. The experiments adopt the same loss weight setting with α = 1 and β = 5 (seen in Eq.4 and Eq.5) for the two losses. T is temperature factor, a hyper parameter of KLD loss. When the temperature T changes from 1 to 5, ilYOLOX gets its best performance (marked by brown color) under the medium temperature of 3. However, CE loss has a much better performance than KLD loss (marked by underline). Our TRD loss exceeds both KLD and CE loss. The use of KLD loss usually requires careful adjustment of temperature factor T and loss weight α. However, the change of α in Loss old will destroy the loss consistency about classification and location between old task (Loss old , Eq.4) and new task (Loss new , Eq.5), which will influence task balance during continual learning. Based on this consideration and the experiment results in Table.7, we use cross entropy loss as our fundamental knowledge transfer strategy. We describe the experiments in Fig. 6 in detail as follows. It is made on a small dataset that consists of 3800 images, 9 classes (commonly seen toys including car, truck, train, person and so on) and have 400 ∼ 500 instances for every class with a relative balanced category distribution. We build a Two-Task scenario of 5 classes + 4 classes as our continual object detection benchmark. The experiment results are illustrated in Table .8. When the task balance factor (γ defined in Eq.6) changes from 1.3 to 0.7, the mAP of old task (old 5 classes) drops from 54.24% to 5.10% quickly. It fully demonstrates that the loss imbalance between old and new tasks can bring significant catastrophic forgetting during continual leaning. 

A.3 DISCUSSION

Hybrid Knowledge Representation. Teacher outputs logits and one-hot representation as its predictions. Soft logits contains more information about between-class confidences and is regarded as a kind of soft knowledge (Hinton et al., 2015; Yang et al., 2022c) . Knowledge distillation methods are developed prosperously under this background in image classification and object detection tasks. But there are significant difference between the two tasks. For image classification, since an input image contains only one instance, the knowledge of the image is just the knowledge of its instance. However, for object detection, one input image contains several instances, the knowledge on an image is a collection of the knowledge of all instances on it. Instances on an image can be regarded as entity nodes, so the overall image knowledge should be constructed based on the entity nodes and their relations. In other words, image-level knowledge is the combination of instancelevel knowledge. Logits and One-hot predictions provide two kinds of instance-level knowledge representation as soft and hard knowledge, respectively. Soft knowledge contains confidence relations among categories, but brings knowledge fuzziness inevitably. While, hard knowledge has completely opposite effects. By combining their advantages, we construct an image-level hybrid knowledge representation (Eq.3) showing better performance on COD task in Table.4 and Table.5. 

