TIB: DETECTING UNKNOWN OBJECTS VIA TWO-STREAM INFORMATION BOTTLENECK

Abstract

Detecting diverse objects, including ones never-seen-before during model training, is critical for the safe application of object detectors. To this end, a task of unsupervised out-of-distribution object detection (OOD-OD) is proposed to detect unknown objects without the reliance on an auxiliary dataset. For this task, it is important to reduce the impact of lacking unknown data for supervision and leverage in-distribution (ID) data to improve the model's discrimination ability. In this paper, we propose a method of Two-Stream Information Bottleneck (TIB), which consists of a standard Information Bottleneck and a dedicated Reverse Information Bottleneck (RIB). Specifically, after extracting the features of an ID image, we first define a standard IB network to disentangle instance representations that are beneficial for localizing and recognizing objects. Meanwhile, we present RIB to obtain simulative OOD features to alleviate the impact of lacking unknown data. Different from standard IB aiming to extract task-relevant compact representations, RIB is to obtain task-irrelevant representations by reversing the optimization objective of the standard IB. Next, to further enhance the discrimination ability, a mixture of information bottlenecks is designed to sufficiently capture object-related information. In the experiments, our method is evaluated on OOD-OD and incremental object detection. The significant performance gains over baselines show the superiorities of our method.

1. INTRODUCTION

With the rejuvenation of deep neural networks, for object detection, many advances Ren et al. (2015) ; Redmon et al. (2016) ; Carion et al. (2020) ; Chen et al. (2022) have been achieved. Most existing methods often follow a close-set assumption that the training and testing processes share the same category space. However, the practical scenario is open and filled with unknown objects, presenting significant challenges for object detectors trained based on the close-set assumption. To this end, a task of unsupervised out-of-distribution object detection (OOD-OD) Du et al. (2022b) is recently proposed, whose goal is to accurately detect the objects never-seen-before during training without accessing any auxiliary data. Obviously, addressing this task is helpful for promoting the safe deployment of object detectors in real scenes, e.g., autonomous driving. The main challenge of unsupervised OOD-OD is lacking supervision signals from OOD data during training Du et al. (2022b) . In particular, as shown in the left part of Fig. 1 , an object detector is typically optimized only based on the in-distribution (ID) data. During inference, the detector could accurately localize and recognize ID objects but easily produces overconfident incorrect predictions for OOD objects. The reason is that the object detector could not learn a clear discrimination boundary between ID objects and OOD objects in the case of lacking OOD data for supervision. Thus, for this task, one feasible solution is to extract simulative OOD data based on the ID data. And the simulative OOD data could be used to improve the discrimination ability of the object detector. In order to obtain simulative OOD data, it is general to leverage generative methods, e.g., generative adversarial networks Lee et al. (2018a) and mixup Zhang et al. (2018) , to synthesize OOD images. Though these methods have been demonstrated to be effective, using a large number of synthesized images may increase computational costs. Meanwhile, it is difficult to use synthesized images to cover the overall object space, which may weaken the discrimination performance for certain unknown objects. Figure 1 : Two-Stream Information Bottleneck for OOD-OD. 'RPN' is Region-Proposal Network with RoI Alignment. The green boxes are OOD objects. The red and black lines separately indicate the decision boundary between ID and OOD objects and that between ID objects belonging to different categories. Due to lacking unknown data for supervision, the traditional object detector could not distinguish ID objects from OOD objects effectively. Our method aims to generate simulative OOD features by maximizing the prediction discrepancy between the features extracted by the IB module and that extracted by the RIB module, which enhances the discrimination ability. In this paper, we explore employing Information Bottleneck (IB) Tishby et al. (2000) ; Alemi et al. (2017) to obtain a series of simulative OOD features for training. Particularly, we propose a method of Two-Stream Information Bottleneck (TIB) to improve the discrimination ability of the object detector, which mainly consists of a standard Information Bottleneck and a dedicated Reverse Information Bottleneck (RIB). Specifically, as shown in the right part of Fig. 1 , given an ID image as the input, a backbone network, e.g., ResNet He et al. (2016) , is used to extract the corresponding representations. Then, a standard variational IB Alemi et al. ( 2017) is defined to decompose an Instance map from the backbone representations, which is instrumental in localizing and recognizing objects accurately. Besides, standard IB struggles to extract maximally compressed features of the input while preserving as much task-relevant information as possible Lee et al. (2021) . Whereas, OOD features could be considered irrelevant to the current task. Thus, we present RIB to obtain an OOD map used to extract task-irrelevant representations via reversing the optimization objective of the standard IB. Concretely, by maximizing the discrepancy between the predictions from the Instance map and that from the OOD map, and simultaneously minimizing the classification loss, the OOD map could be promoted to contain plentiful object-irrelevant information, which is beneficial for extracting simulative OOD features and improves the discrimination ability. Furthermore, recent research Schulz et al. (2020) has shown that IB is an effective mechanism to capture object information. Inspired by this idea, we explore designing a mixture of information bottlenecks to purify object-related information from multiple different facets. Finally, by combining the information, the discrimination ability could be further enhanced. In the experiments, our method is separately evaluated on OOD-OD and incremental object detection Kj et al. (2021) . Extensive experimental results demonstrate the superiorities of our method. The contributions of our work are summarized as follows: • We propose a method of Two-Stream Information Bottleneck consisting of a standard IB and a dedicated RIB. Particularly, RIB aims to obtain simulative OOD features by maximizing the prediction discrepancy between ID features and OOD features, which reduces the impact of lacking unknown data for supervision. • We design a mixture of information bottlenecks to purify object-related information from multiple different facets, which is beneficial for enhancing object-related information in the features for classification and improves the detection performance. • Experimental results show that our method could effectively improve the performance of OOD-OD and incremental object detection. Particularly, for PASCAL VOC Everingham et al. (2010) , compared with the baseline method Du et al. (2022b) , our method significantly reduces FPR95 by around 10.42%.

2. RELATED WORK

OOD detection. 2020) has shown that IB Tishby et al. (2000) is a promising mechanism to reveal the principle of neural networks through the lens of information stored in encoded representations of inputs. Given two random variables X and Y , the optimization objective of IB can be described as follows: max T I(T ; Y ) -βI(T ; X), where I(T ; X) and I(T ; Y ) are the mutual information of representation T towards inputs X and labels Y , respectively. β controls the tradeoff. By this optimization objective, the intermediate representation T can be promoted to contain compact task-relevant information. Recently, some works Ahuja et al. (2021) ; Li et al. (2022) have shown that IB is helpful for extracting domain-invariant representations, which improves the generalization ability. Besides, there exist some works Schulz et al. (2020) ; Kim et al. (2021) that indicate using IB could capture object-related information, which is beneficial for boosting the performance and enhancing the interpretability. In this paper, we explore exploiting IB to obtain simulative OOD features for training and design a mixture of IB to further enhance the object-related information. Extensive experiments on OOD-OD and incremental object detection have shown that our method is instrumental in improving the discrimination ability.

3. TWO-STREAM INFORMATION BOTTLENECK

For unsupervised OOD-OD, the object detector is trained based on the ID data {X, Y, B}, where X denotes the set of ID images, Y is the label set, and B indicates location information. During inference, give an image x * including OOD objects, the trained object detector should accurately distinguish ID objects (the output is 1) from OOD objects (the output is 0).

3.1. OBJECT-RELATED INFORMATION EXTRACTION

Since object detection involves two subtasks, i.e., object localization and classification, for OOD-OD, the model should first localize OOD objects and ID objects. Then, the model could accurately distinguish ID objects from OOD objects, while correctly classifying ID objects. To attain this goal, it is important to extract plentiful object-related information. As IB owns the advantage of capturing compact task-relevant information Alemi et al. ( 2017), we explore exploiting IB to compress object-irrelevant information (e.g., the background information) in the extracted features, which is beneficial for improving the discrimination performance. Concretely, as shown in Fig. 2 , we follow the baseline work Du et al. (2022b) and adopt the widely used object detector, i.e., Faster R-CNN Ren et al. (2015) , as the basic detection model. Given an input image, we first employ the backbone network, e.g., ResNet He et al. (2016) , to extract the corresponding feature map F ∈ R w×h×c , where w, h, and c separately denote width, height, and the number of channels. To obtain rich object-related information, we exploit the constraint of variational information bottleneck Alemi et al. (2017) to further encode the feature map F . Specifically, we separately define a convolutional network W µ1 and W σ1 to estimate the corresponding means and variances. The encoding processes of F are shown as follows: µ 1 = W µ1 * F, σ 1 = W σ1 * F, Z = µ 1 + ϵ • exp(σ 1 ), where µ 1 ∈ R w×h×c and σ 1 ∈ R w×h×c are the estimated means and variances. ϵ indicates Gaussian noise sampled from N (0, I). ' * ' represents the convolutional operation. The encoding output is denoted as the Instance map Z ∈ R w×h×c . Next, Z is taken as the input of the RPN module to extract a series of object proposals O. Based on O, RoI-Alignment operation followed by RoI-Feature extraction is performed on Z to obtain the output P in ∈ R z×s , where z and s respectively denote the number of proposals and channels. Since the object proposals usually contain much object-irrelevant information (e.g., the background information) that may weaken the discrimination performance, to this end, we design a mixture of information bottlenecks consisting of multiple branches to further enhance object-related information. For each branch, based on P in , we first define two fully-connected networks to separately estimate the corresponding means and variances. Then, we perform an encoding operation of P in : P µ i = Φ µ i (P in ), P σ i = Φ σ i (P in ), Q i = P µ i + ϵ • exp(P σ i ), (3) where i = 1, ..., n. 'n' is the number of IB. Φ µ i and Φ σ i represent two different fully-connected networks. P µ i ∈ R z×s and P σ i ∈ R z×s are the estimated means and variances. Q i ∈ R z×s denotes the output encoding results of the current branch. Since P in contains the information belonging to multiple different objects and much background information, exploiting multiple branches of information bottlenecks is beneficial for purifying objectrelated information from multiple different facets. Next, we first define a gating operation to aggregate the information from different IB. By means of the residual operation between the aggregated information and the input P in , the object-related information in P in could be enhanced. The overall enhancing processes are shown as follows: G i = P in T Q i n i=1 P in T Q i , A = n i=1 G i • Q i , E = A + α • P in , where P in ∈ R s and Q i ∈ R s separately represent the average results of P in and Q i . G i is the calculated gating weight. A ∈ R z×s indicates the aggregated results. Here, α ∈ R z×s denotes the learnable sigmoid weight, i.e., α = Ψ(P in ), where Ψ is a fully-connected network. Finally, E ∈ R z×s indicates the output enhancing result. As shown in Fig. 2 , during training, E is taken as the input of the classifier and regressor to calculate the classification and localization losses. The joint training objective is shown as follows: L IB = L cls + L loc + β • (KL[p(Z|F ), r(Z)] + 1 n n i=1 KL[p(Q i |P in ), r(Q i )]), where L cls and L loc separately denote the classification and localization losses. β is a hyperparameter. In the experiments, β is set to 0.0001. Following the information theories for deep learning Alemi et al. ( 2017), we define r(•) as a prior marginal distribution, which is modeled as a standard Gaussian N (0, I). Obviously, by minimizing the task loss and the KL-divergence loss, the dependence between Z and F and that between Q i and P in are reduced, indicating that Z and Q i encode plentiful object-relevant information from the input F and P in , which is instrumental in improving the discrimination performance.

3.2. SIMULATIVE OOD FEATURES GENERATION

For unsupervised OOD-OD, one of the major challenges lies in lacking unknown data for supervision, which is prone to producing overconfident incorrect predictions for OOD objects. To this end, we propose a RIB method to generate simulative OOD features by reversing the optimization objective of the standard IB Alemi et al. (2017) , which reduces the impact of lacking unknown data and improves the ability of distinguishing OOD objects. Concretely, as shown in Fig. 2 , based on the feature map F from the backbone network, we first define a convolutional network W µ2 and W σ2 to estimate the corresponding means and variances and leverage variational information bottleneck Alemi et al. (2017) to encode F : µ 2 = W µ2 * F, σ 2 = W σ2 * F, R = µ 2 + ϵ • exp(σ 2 ), where R ∈ R w×h×c indicates the encoding OOD map. Next, based on the object proposals O, RoI-Alignment operation followed by RoI-Feature extraction is performed on R to obtain simulative OOD features P ood ∈ R z×s . To promote P ood to contain plentiful task-irrelevant information, we explore reversing the optimization objective of the standard IB (as shown in equation 1): max R,P ood I((R, P ood ); F ) -λI(P ood ; Y ), where λ controls the tradeoff. The goal of equation 7 is to enforce the extracted R and P ood from the input F to encode much less task-related information, which is beneficial for obtaining simulative OOD features for supervision. To attain this goal, we explore maximizing the prediction discrepancy L dis between E and P ood . The processes are shown as follows: L dis = 1 K K k=1 |p(P ood ) k -p(E) k |, where p(P ood ) k and p(E) k denote the prediction probability for class k, respectively. K is the number of ID categories. Meanwhile, the classifier with shared parameters is used to produce p(P ood ) and p(E). Besides, since R is directly encoded based on the input F , R could be considered to be related to F . Here, we do not use a mutual information constraint to enhance the dependence between R and F . By maximizing the loss L dis while minimizing the task loss, the gap between P ood and E will be enlarged, which promotes P ood to contain plentiful information unrelated to the ID objects. Finally, to achieve the goal of distinguishing OOD objects from ID objects, P ood and E are used to calculate an uncertainty loss Du et al. (2022b) , which aims to regularize the detector to produce a low OOD score for the ID object features, and a high OOD score for the simulative OOD features. The processes are shown as follows: L uncertainty = E u∽E [-log exp -E(u) 1 + exp -E(u) ] + E v∽P ood [-log 1 1 + exp -E(v) ], where E(•) is the object-level energy score Du et al. (2022b) ; Liu et al. (2020) . During training, we can only access the ID data. The overall training objective is shown as follows: L = L IB -λ • L dis + τ • L uncertainty , where λ and τ are two hyper-parameters, which are set to 0.001 and 0.1 in the experiments.

3.3. INFERENCE FOR OOD OBJECT DETECTION

During inference, we use the output of the uncertainty loss for OOD object detection. Specifically, for a predicted bounding box b, the detection processes are shown as follows: score(b) = exp -E(b) 1 + exp -E(b) , D(b) = 0 if score(b) < γ, 1 if score(b) ≥ γ. ( ) For the output of the classifier D(•), the commonly used threshold mechanism is leveraged to distinguish the ID objects (the result is 1) from the OOD objects (the result is 0). The threshold γ is set to 0.95 so that a high fraction of ID data is correctly classified.

4. EXPERIMENTS

In the experiments, for unsupervised OOD-OD, we follow the settings of the work Du et al. (2022b) and do not use any auxiliary dataset for training. And our method is evaluated on two different benchmarks. Furthermore, to further demonstrate the effectiveness of our method, we verify our method on the task of class-incremental object detection Kj et al. (2021) , i.e., new classes are sequentially introduced to the object detector. Metrics. To evaluate the performance of unsupervised OOD-OD, we report: (1) the false positive rate (FPR95) of OOD objects when the true positive rate of ID objects is at 95%; (2) the area under the receiver operating characteristic curve (AUROC); (3) mean average precision (mAP).

4.2. PERFORMANCE ANALYSIS OF UNSUPERVISED OOD-OD

Table 1 shows the performance of unsupervised OOD-OD. We can see that though these methods own similar detection performance, the ability of distinguishing OOD objects differs significantly. This shows that these detection methods are easily affected by OOD objects. Thus, detecting OOD objects is meaningful for promoting the safe application of object detectors. We can see that compared with baseline methods, our method obtains the best performance of OOD object detection. 2020), our method outperforms VOS by around 13.4% and 17.06%. This shows that our method of Two-Stream Information Bottleneck is beneficial for extracting simulative OOD features, which reduces the impact of lacking unknown data for supervision and improves the discrimination ability of the object detector. In Fig. 3 , we show some detection examples. For these images, the baseline method Du et al. (2022b) does not detect all OOD objects accurately. Taking the second column as an example, VOS Du et al. (2022b) misclassifies the dog into the Pedestrian category. We can see that our method correctly localizes and recognizes OOD objects, further demonstrating its effectiveness.

4.3. PERFORMANCE ANALYSIS OF CLASS-INCREMENTAL OBJECT DETECTION

To further demonstrate the effectiveness of our method, we evaluate our method on class-incremental object detection Kj et al. (2021) and follow the standard evaluation protocol Kj et al. (2021) . We initially learn 10, 15, or 19 base classes, and then introduce 10, 5, or 1 new class as the second task. Meanwhile, we directly plug our method into the baseline method Kj et al. (2021) and do not calculate the uncertainty loss (equation 9). The training details are the same as the baseline Kj et al. et al. (2010) . We consider adding 10, 5, and 1 classes (highlighted in blue) to a detector trained on the rest of the classes. 'iOD + Ours' indicates that our method is plugged into iOD Kj et al. (2021) . Here, 'P50' indicates that the mAP metric is calculated when the IOU threshold is set to 0.5. 2021) . Table 2 and 3 separately show the results based on the metric of mAP50 and mAP75. We can see that plugging our method improves the detection performance significantly. Particularly, for the '19+1' setting and the mAP75 metric, employing our method boosts the performance by around 5.2%, which further shows that our method could indeed enhance the discrimination ability. In this section, we utilize PASCAL VOC as the ID data for training and MS-COCO as the OOD data to perform an ablation analysis of our method.

4.4. ABLATION AND VISUALIZATION ANALYSIS

Analysis of IB and RIB. In this paper, we define an IB branch and a RIB branch to separately extract instance-level features and simulative OOD features. Here, we make an ablation analysis of our method. Table 4 shows the performance. Here, 'IB' includes the designed mixture of information bottlenecks and uses VOS Du et al. (2022b) to synthesize virtual outliers for training. We can see that employing the IB branch could improve the detection performance, which shows that using the IB is indeed helpful for compressing object-unrelated contents. And we observe that exploiting the mixture of information bottlenecks reduces FPR95 by around 2.8%, indicating that this module is beneficial for enhancing object-related information and improving the discrimination ability. Furthermore, we observe that only using the RIB module could obtain superior performance compared with the baseline method Du et al. (2022b) . This indicates that reversing the optimization objective of the information bottleneck could extract object-unrelated information, which is helpful for obtaining simulative OOD features and improving the discrimination ability. Analysis of the branch number in MIB. In this paper, we design a mixture of information bottlenecks (MIB) to purify object-related information from multiple different facets (as shown in Fig. 2 ). We make an ablation analysis of the branch number. Here, we only change the branch number and keep other modules unchanged. Table 5 shows the detection results. We can see that the number of IB branches indeed affects the performance of OOD detection. When the number of IB branches is small, the MIB module does not obtain better performance. The reason may be that the proposal features involve much information belonging to different objects. Using fewer IB branches could not sufficiently capture object-related information, which weakens the discrimination. Instead, using more IB branches introduces more parameters, which may lead to overfitting and reduces the performance of OOD detection. We observe that for our method, the performance of using 8 branches is the best. Visualization analysis. In this paper, we separately extract an Instance map and an OOD map based on the backbone output. In Fig. 4 , we make a visualization analysis. We can see that the Instance map contains plentiful object-related contents and less object-irrelevant information, which is instrumental in improving the and recognition performance. Meanwhile, we observe the extracted OOD map is significantly unrelated to the current object, which is beneficial for obtaining simulative OOD features and alleviates the impact of lacking unknown data for supervision.

5. CONCLUSION

For unsupervised OOD-OD, this paper proposes a method of Two-Stream Information Bottleneck consisting of an IB branch and a RIB branch. Specifically, the IB branch aims to extract objectrelated information that is helpful for improving localization and recognition performance. Meanwhile, the RIB branch is to extract simulative OOD features to alleviate the impact of lacking unknown data for training by reversing the optimization objective of the information bottleneck. Extensive experimental results on OOD-OD and class-incremental object detection, and visualization analysis indicate the superiorities of our method. For each feature map, the channels corresponding to the maximum value are selected for visualization. We can see that the Instance map contains plentiful object-relevant information. Meanwhile, the OOD map involves sufficient object-irrelevant information, which is helpful for extracting simulative OOD features to alleviate the impact of lacking unknown data for supervision. Figure 6 : OOD detection examples based on our method. Here, we use BDD-100k as the indistribution data and MS-COCO as the OOD data. We can see that our method accurately distinguishes OOD objects, which shows the effectiveness of our method. Figure 7 : Detection results based on PASCAL VOC. We can see that our method accurately localizes and recognizes objects in these images, e.g., the dog, train, cow, and person, which shows that our method is effective for in-distribution data. Figure 8 : To further evaluate our method, we synthesize some images that contain OOD classes, e.g., the Dinosaur, Panda, and Camel. Meanwhile, we also collect some images from real scenarios. Table 6 : Definitions of notations used in our method. Notations Definition F The feature map extracted by the backbone network.

Z

The encoded Instance map.

P in

The ID object features extracted based on Z. Q i The output encoding result of the i-th IB branch for information enhancement.

G i

The calculated gating weight of the i-th IB branch. A The aggregated results.

E

The output enhanced results.

R

The encoded OOD map.

P ood

The OOD object features extracted based on R.

A.8 MORE ABLATION STUDIES ABOUT EQUATION 4

In equation 4, A is to aggregate the results of each IB branch. By this operation, A contains plentiful object-related information. The learned sigmoid weight α is to balance A and P in during the enhancing process. We make an ablation analysis about A and α. Firstly, we replace the gating operation (as shown in the left part of equation 4) with the simple mean operation and keep other modules unchanged. We observe that compared with our method, the mean operation increases FPR95 by around 2.7%, which shows the effectiveness of the gating operation. Next, we replace the learned sigmoid weight α with a manually set value and keep other modules unchanged. We set multiple different values and observe that 0.6 corresponds to the best performance. However, compared with our method, this operation increases FPR95 by around 1.3%, which indicates that using the learned weight is much better for balancing A and P in . A.9 DEFINITIONS OF NOTATIONS Table 6 gives the definitions of notations used in our method. A.10 LIMITATION To promote the safe deployment of object detectors, we propose to use information bottlenecks to strengthen object-related information and generate simulative OOD features to alleviate the impact of lacking OOD data for training. Since we only verify our method on two benchmarks, we do not clear about the performance of the proposed method in practical scenes, which may be the limitation of our proposed method.



Figure2: The architecture of our method. 'Infor Enc' indicates information enhancement. Our method mainly consists of an IB branch and a RIB branch. Particularly, the IB branch aims to capture plentiful object-related information by optimizing the objective of information bottleneck. Meanwhile, after obtaining the high-level features from the detection head, a mixture of IB is designed to further enhance object-related information, which is beneficial for improving the discrimination ability. Besides, to alleviate the impact of lacking unknown data for training, we propose a RIB to generate simulative OOD features by maximizing the loss L dis . Here, it is worth noting that during the RPN process, for the OOD map, we only perform RoI Alignment based on the proposals O extracted from the Instance map. Finally, the ID features from the MIB and the simulative OOD features are all used to calculate the uncertainty loss L uncertainty .

IMPLEMENTATION DETAILS AND BENCHMARKSImplementation Details. We use Faster R-CNNRen et al. (2015) with the RoI-Alignment layerHe et al. (2017) as the basic detection model. The backbone is ResNet-50He et al. (2016). The parameters are pre-trained on ImageNetRussakovsky et al. (2015) for initialization. For the generation of the Instance map (equation 2) and OOD map (equation 6), we separately utilize two convolutional layers to define W µ1 , W σ1 , W µ2 , and W σ2 . For each branch of MIB (equation 3), we respectively utilize two fully-connected layers to define Φ µ and Φ σ . And the number n of the IB branches is set to 8. All the experiments are trained using the standard SGD optimizer with a learning rate of 0.02.OOD-OD Benchmarks. PASCAL VOCEveringham et al. (2010) and Berkeley DeepDrive (BDD-100k)Yu et al. (2020) datasets are taken as the ID training data. Meanwhile, we adopt MS-COCOLin et al. (2014) and OpenImagesKuznetsova et al. (2020) as the OOD datasets to evaluate the trained model. And the OOD datasets are manually examined to ensure the OOD images do not contain ID categories.

Figure 3: Detection results on the OOD images from MS-COCO. The first and second rows respectively indicate results based on VOS Du et al. (2022b) and our method. The in-distribution dataset is BDD-100k. Blue boxes represent objects detected and classified as one of the ID categories. Green boxes indicate OOD objects. We can see that our method accurately determines OOD objects. Particularly, compared with VOS Du et al. (2022b) that aims to synthesize virtual outliers, based on FPR95, for PASCAL VOC Everingham et al. (2010), our method outperforms VOS by around 10.42% and 9.62%. For BDD-100k Yu et al. (2020), our method outperforms VOS by around 13.4% and 17.06%. This shows that our method of Two-Stream Information Bottleneck is beneficial for extracting simulative OOD features, which reduces the impact of lacking unknown data for supervision and improves the discrimination ability of the object detector.

Figure 4: Visualization of the Instance map and OOD map based on the OOD data (MS-COCO). For each feature map, the channels corresponding to the maximum value are selected for visualization.

Figure5: Visualization of the Instance map and OOD map. For each feature map, the channels corresponding to the maximum value are selected for visualization. We can see that the Instance map contains plentiful object-relevant information. Meanwhile, the OOD map involves sufficient object-irrelevant information, which is helpful for extracting simulative OOD features to alleviate the impact of lacking unknown data for supervision.

To promote the safe application of models in practical scenarios, OOD detection Pimentel et al. (2014); Yang et al. (2021b) has attracted much attention, whose goal is to distinguish ID data from OOD data. Most existing methods Lee et al. (2018b); Hendrycks et al. (2019);

The performance (%) of unsupervised OOD-OD. All methods are trained based on ID data and do not use any auxiliary data. ↑ denotes larger values are better and ↓ denotes smaller values are better. ' †' indicates that we directly run the released code to obtain the results. 'Bbone' and 'R50' separately represent backbone network and ResNet-50.

Performance (%) analysis of class-incremental object detection based on PASCAL VOC Everingham

Performance (%) analysis of incremental object detection based on PASCALVOC Everingham et al.

The performance (%) of only using the IB branch and only using the RIB branch.

The performance (%) of using a different number of IB branches.

A APPENDIX

Here we provide additional analyses, various ablation studies, and more visualization results.A.1 FURTHER DISCUSSION ABOUT RIB In this paper, to alleviate the impact of lacking unknown data for supervision, we design a RIB module to extract simulative OOD features via reversing the optimization objective of the information bottleneck. The reversed objective is shown as follows: max R,P ood I((R, P ood ); F ) -λI(P ood ; Y ).Proposition 1. Assume I(P ood ; Y ) = 0, then we achieve the features that contain plentiful out-ofdistribution information.P roof. Note that I(P ood ; Y ) = 0 implies P ood and the labels Y are independent. By minimizing the classification loss, the in-distribution features P in can be promoted to be related to the labels Y . Hence, I(P ood ; Y ) = 0 enforces P ood to be irrelevant to P in , which promotes P ood to contain rich out-of-distribution information. 

A.3 ABLATION ANALYSIS OF HYPER-PARAMETERS

In this paper, we use the hyper-parameter β to control the KL-loss (equation 5) and use λ and τ to separately control the loss L dis and L uncertainty (equation 10). Since L cls , L loc , and L uncertainty are directly related to the task, β and λ should be set to a smaller value than τ . Meanwhile, if β and λ are set to a very small value, KL-loss and L dis may play a small role in optimization. Thus, it is meaningful to set proper values for these hyper-parameters. Next, we utilize PASCAL VOC as the ID data and MS-COCO as the OOD data to perform an ablation analysis of hyper-parameters. And we only change the value of hyper-parameters and keep other modules unchanged.Analysis of hyper-parameter β. We use β to control the KL-divergence loss in information bottlenecks. When β is separately set to 0.001, 0.0001, and 0.00001, for FPR95, the corresponding performance is 43.84%, 41.55%, and 42.83%.Analysis of hyper-parameter λ. We use λ to constrain the reversing optimization objective of the information bottleneck. When λ is separately set to 0.01, 0.001, and 0.0001, for FPR95, the corresponding performance is 44.15%, 41.55%, and 43.26%.Analysis of hyper-parameter τ . For the uncertainty loss, we follow the work Du et al. (2022b) and use the same setting for τ . When τ is separately set to 0.15, 0.1, and 0.05, the corresponding performance is 44.58%, 41.55%, and 43.73%.

A.4 MORE EXPERIMENTAL RESULTS

Results on RegNetX-4.0GF. To further verify the effectiveness of our method, we evaluate our method on another backbone network, i.e., RegNetX-4. 0GF Radosavovic et al. (2020) . Here, we take PASCAL VOC as the ID training data and MS-COCO as the OOD data for evaluation. Based on FPR95, AUROC, and mAP, the performance of our method is 44.90%, 90.08%, and 51.5%, which significantly outperforms the VOS's performance, i.e., 50.81%, 88.42%, and 50.8%. This shows that our method of Two-Stream Information Bottleneck is able to strengthen the instancerelated information and extract proper simulative OOD features, which reduces the impact of lacking unknown data for supervision and improves the discrimination ability.Further Analysis about RIB. For RIB (see equation 7), since the OOD map R is extracted based on the input feature map F , R could be considered to be related to F . Hence, we do not use a mutual information constraint to enhance the dependence between R and F . Here, we make an ablation analysis of adding the mutual information constraint. We take PASCAL VOC as the ID training data and MS-COCO as the OOD data for evaluation. We observe that adding the mutual information constraint increases the performance of FPR95 by around 1.1%. Meanwhile, we replace the variational operation with the traditional convolution operation. The performance of FPR95 is increased by around 4.8%. These analyses show that using the variational operation is helpful for capturing distribution-related information, which is better to extract out-of-distribution information.

A.5 MORE VISUALIZATION EXAMPLES

In Fig. 5 , we give more visualization examples. We can see that the extracted Instance map contains plentiful object-relevant information, which is helpful for localizing and recognizing objects accurately. Meanwhile, we can also see that the calculated OOD map is significantly different from the Instance map, which conforms to the meaning of the OOD, i.e., the OOD data deviates from the in-distribution. This further shows that our proposed reverse information bottleneck could indeed extract simulative OOD features, which is beneficial for alleviating the impact of lacking unknown data for supervision and improving the discrimination ability.In Fig. 6 and 7 , we show more detection results. We can see that our method accurately localizes and recognizes ID objects and OOD objects. Particularly, compared with ID objects that contain a fixed number of categories, the category distribution of OOD objects is diverse, presenting a significant challenge for the object detector. Our method attempts to solve this problem from the feature perspective, which has been demonstrated to be effective.

A.6 COMPUTATION OF MUTUAL INFORMATION ON 3D FEATURE MAPS

In this paper, we compute KL-divergence loss to approximate the mutual information. The processes are shown as follows:where p(•) and q(•) represent two probability distributions. For example, given two 3D feature maps H ∈ R w×h×c and C ∈ R w×h×c , we first perform softmax operation on H and C. Then, we separately take the corresponding elements between the two processed results as the input of equation 13 to calculate the KL result. The mean of the KL results from all corresponding elements is taken as the output KL-divergence loss.A.7 PERFORMANCE ON HIGHER LEVELS OF CONTAMINATION FROM OOD CLASSESWe select 8K images from COCO and synthesize some OOD objects on these images to perform a further evaluation. In Fig. 8 , we show some synthesized images. Through experiments, we observe that our method still outperforms VOS Du et al. (2022b) significantly. Particularly, compared with VOS Du et al. (2022b) , our method reduces FPR95 by around 10.1% and improves AUROC by around 3.2%, which further demonstrates the effectiveness of our method.

