THE KFIOU LOSS FOR ROTATED OBJECT DETECTION

Abstract

Differing from the well-developed horizontal object detection area whereby the computing-friendly IoU based loss is readily adopted and well fits with the detection metrics, rotation detectors often involve a more complicated loss based on SkewIoU which is unfriendly to gradient-based training. In this paper, we propose an effective approximate SkewIoU loss based on Gaussian modeling and Gaussian product, which mainly consists of two items. The first term is a scale-insensitive center point loss, which is used to quickly narrow the distance between the center points of the two bounding boxes. In the distance-independent second term, the product of the Gaussian distributions is adopted to inherently mimic the mechanism of SkewIoU by its definition, and show its alignment with the SkewIoU loss at trend-level within a certain distance (i.e. within 9 pixels). This is in contrast to recent Gaussian modeling based rotation detectors e.g. GWD loss and KLD loss that involve a human-specified distribution distance metric which require additional hyperparameter tuning that vary across datasets and detectors. The resulting new loss called KFIoU loss is easier to implement and works better compared with exact SkewIoU loss, thanks to its full differentiability and ability to handle the non-overlapping cases. We further extend our technique to the 3-D case which also suffers from the same issues as 2-D. Extensive results on various datasets with different base detectors show the effectiveness of our approach.

1. INTRODUCTION

Rotated object detection is a relatively emerging but challenging area, due to the difficulties of locating the arbitrary-oriented objects and separating them effectively from the background, such as aerial images (Yang et al., 2018a; Ding et al., 2019; Yang et al., 2018b) , scene text (Jiang et al., 2017; Zhou et al., 2017) . Though considerable progresses have been recently made, for practical settings, there still exist challenges for rotating objects with large aspect ratio, dense distribution. The Skew Intersection over Union (SkewIoU) between large aspect ratio objects is sensitive to the deviations of the object positions. This causes the negative impact of the inconsistency between metric (dominated by SkewIoU) and regression loss (e.g. l n -norms), which is common in horizontal detection, and is further amplified in rotation detection. The red and orange arrows in Fig. 1 show the inconsistency between SkewIoU and Smooth L1 Loss. Specifically, when the angle deviation is fixed (red arrow), SkewIoU will decrease sharply as the aspect ratio increases, while the Smooth L1 loss is unchanged (mainly from the angle difference). Similarly, when SkewIoU does not change (orange arrow), Smooth L1 loss increases as the angle deviation increases. Figure 1 : For rotation detection (Yang et al., 2021b) , there is a notable inconsistency between the final detection metric i.e. mAP (largely depending on SkewIoU) and regression-based loss e.g. the popular Smooth L1. See Fig. 3(a Solution for inconsistency between the metric and regression loss has been extensively discussed in horizontal detection by using IoU loss and related variants, such as GIoU loss (Rezatofighi et al., 2019) and DIoU loss (Zheng et al., 2020b) . However, the applications of these solutions to rotation detection are blocked because the analytical solution of the SkewIoU calculation process 1 is not easy to be provided due to the complexity of intersection between two rotated boxes (Zhou et al., 2019) . Especially, there exist some custom operations (intersection of two edges and sorting the vertexes etc.) whose derivative functions have not been implemented in the existing deep learning frameworks (Abadi et al., 2016; Paszke et al., 2017; Hu et al., 2020) . Besides, the calculation of SkewIoU is not differentiable when there are more than eight intersection points between two bounding boxes, i.e. two boundary boxes are completely coincident, or one edge is coincident, which will lead to the failure to obtain very accurate prediction results. Thus, developing an easy-to-implement and fully differentiable approximate SkewIoU loss is meaningful and several works (Chen et al., 2020; Zheng et al., 2020a; Yang et al., 2021c; d) have been proposed. This paper aims to find an easy-to-implement and better-performing alternative. We design an alternative to SkewIoU loss based on Gaussian product, named KFIoU loss 2 , which can be easily implemented by the existing operations of the deep learning framework without the need for additional acceleration (e.g. C++/CUDA). Specifically, we convert the rotated bounding box into a Gaussian distribution, which can avoid the well-known boundary discontinuity and square-like problems (Yang et al., 2021c) in rotation detection. Then we use a center point loss to narrow the distance between the center of the two Gaussian distributions, follow by calculating the overlap area under the new position through the product of the Gaussian distributions. By calculating the error variance and comparing the final performance of different methods, we find trend-level alignment with the SkewIoU loss is critical for solving the inconsistency between metric and loss, and further improving the performance. Furthermore, compared to best-tuned Gaussian distance metric based methods, our proposed method achieves more competitive performance without hyperparameter tuning. The highlights are as follows: 1) For rotation detection, instead of exactly computing the SkewIoU loss which is tedious and unfriendly to differentiable learning, we propose our easy-to-implement approximate loss, named KFIoU loss, which works better since it is fully differentiable and able to handle the non-overlapping cases. It follows the protocol of Gaussian modeling for objects, yet innovatively uses Gaussian product to mimic SkewIoU's computing mechanism within a looser distance. 2) Compared to Gaussian-based losses (GWD loss, KLD loss) that try to approximate SkewIoU loss by specifying a distance which need extra hyperparameters tuning and metric selection that vary across datasets and detectors, our mechanism level simulation to SkewIoU is more interpretable and natural, and free from hyperparameter tuning. 3) We also show that KFIoU loss achieves the better trend-level alignment with SkewIoU loss within a certain distance than GWD loss and KLD loss, where the trend deviation is measured by our devised error variance. The effectiveness of such a trend-level alignment strategy is verified by comparing KFIoU loss with ideal SkewIoU loss. On extensive benchmarks (aerial images, scene texts, face), our approach also outperforms other best-tuned SOTA alternatives. 4) We further extend the Gaussian modeling and KFIoU loss from 2-D to 3-D rotation detection, with notable improvement compared with baselines. To our best knowledge, this is the first 3-D rotation detector based on Gaussian modeling which also verifies its effectiveness, which is in contrast to (Yang et al., 2021c; d; 2022) focusing on 2-D rotation detection. The source code is available at TensoFlow (Abadi et al., 2016) -based AlphaRotate (Yang et al., 2021e) , PyTorch (Paszke et al., 2017) -based MMRotate (Zhou et al., 2022) and Jittor (Hu et al., 2020 )-based JDet.

2. RELATED WORK

Rotated Object Detection. Rotated object detection is an emerging direction, which attempts to extend classical horizontal detectors (Girshick, 2015; Ren et al., 2015; Lin et al., 2017a; b) to the rotation case by adopting the rotated bounding boxes. Aerial images and scene text are popular application scenarios of rotation detector. For aerial images, objects are often arbitrary-oriented and dense-distributed with large aspect ratios. To this end, ICN (Azimi et al., 2018) , ROI-Transformer (Ding et al., 2019) , SCRDet (Yang et al., 2019) , Mask OBB (Wang et al., 2019) , Gliding Vertex (Xu et al., 2020) , ReDet (Han et al., 2021b) are two-stage mainstreamed approaches whose pipeline is inherited from Faster RCNN (Ren et al., 2015) , while DRN (Pan et al., 2020) , DAL (Ming et al., 2021b) , R 3 Det (Yang et al., 2021b) , RSDet (Qian et al., 2021a; b) and S 2 A-Net (Han et al., 2021a) are based on single-stage methods for faster detection speed. For scene text detection, RRPN (Ma et al., 2018) employs rotated RPN to generate rotated proposals and further perform rotated bounding box regression. TextBoxes++ (Liao et al., 2018a) adopts vertex regression on SSD (Liu et al., 2016) . RRD (Liao et al., 2018b) improves TextBoxes++ by decoupling classification and bounding box regression on rotation-invariant and rotation sensitive features, respectively. The regression loss of the above algorithms is rarely SkewIoU loss due to the complexity of implementing SkewIoU. Variants of IoU-based Loss. The inconsistency between metric and regression loss is a common issue for both horizontal detection and rotation detection. Solution for this inconsistency has been extensively discussed in horizontal detection by using IoU related loss. For instance, Unitbox (Yu et al., 2016) proposes an IoU loss which regresses the four bounds of a predicted box as a whole unit. More works (Rezatofighi et al., 2019; Zheng et al., 2020b) extend the idea of Unitbox by introducing GIoU (Rezatofighi et al., 2019) and DIoU (Zheng et al., 2020b) for bounding box regression. However, their applications to rotation detection are blocked due to the hard-to-implement SkewIoU. Recently, some approximate methods for SkewIoU loss have been proposed. Box/Polygon based: SCRDet (Yang et al., 2019) propose IoU-Smooth L1, which partly circumvents the need for SkewIoU loss with gradient backpropagation by combining IoU and Smooth L1 loss. To tackle the uncertainty of convex caused by rotation, the work (Zheng et al., 2020a) proposes a projection operation to estimate the intersection area for both 2-D/3-D object detection. PolarMask (Xie et al., 2020) proposes Polar IoU loss that can largely ease the optimization and considerably improve the accuracy. CFA (Guo et al., 2021) proposes convex hull based CIoU loss for optimization of point based detectors. Pixel based: PIoU (Chen et al., 2020) calculates the SkewIoU directly by accumulating the contribution of interior overlapping pixels. Gaussian based: GWD (Yang et al., 2021c) and KLD (Yang et al., 2021d) simulate SkewIoU by Gaussian distance measurement.

3

This section presents the preliminary according to (Yang et al., 2021c) , for how to convert an arbitrary-oriented 2-D/3-D bounding box to a Gaussian distribution G(µ, Σ). Σ =RΛR ⊤ , µ = (x, y, (z)) ⊤ (1) where R represents the rotation matrix, and Λ represents the diagonal matrix of eigenvalues. Take 3-D object B 3d (x, y, z, w, h, l, θ) as an example: R =   cos θ -sin θ 0 sin θ cos θ 0 0 0 1   , Λ =    w 2 4 0 0 0 h 2 4 0 0 0 l 2 4    It is worth noting that the recent works GWD loss (Yang et al., 2021c) and KLD loss (Yang et al., 2021d ) also belong to the Gaussian modeling based. Compared with our work, their difference is that they use the nonlinear transformation of distribution distance to approximate SkewIoU loss. In this process, additional hyperparameters are introduced. Since Gaussian modeling has the natural Figure 2 : SkewIoU loss approximation process in two-dimensional space based on Gaussian product. Compared with GWD loss (Yang et al., 2021c) and KLD loss (Yang et al., 2021d) , our approach follows the calculation process of SkewIoU without introducing additional hyperparameters. We believe such a design is more mathematically rigorous and more in line with SkewIoU loss. advantages of being immune to boundary discontinuity and square-like problems, in this paper, we will take another perspective to approximate the SkewIoU loss to better train the detector without any extra hyperparameter, which can be more in line with SkewIoU calculation. Tab. 1 shows the comparison of properties between different losses. It should be noted that the results presented in our experiments of GWD loss and KLD loss are obtained by best-tuned hyperparameters in DOTA, but not optimal in others.

4. PROPOSED METHOD

In this section, we present our main approach. Fig. 2 shows the approximate process of SkewIoU loss in two-dimensional space based on Gaussian product. Briefly, we first convert the bounding box to a Gaussian distribution as discussed in Sec. 3, and move the center points of the two Gaussian distributions to make them close. Then, the distribution function of the overlapping area is obtained by Gaussian product. Finally, the obtained distribution function is inverted into a rotated bounding box to calculate the overlapping area and the SkewIoU and loss.

4.1. SKEWIOU BASED ON GAUSSIAN PRODUCT

First of all, we can easily calculate the volume of the corresponding rotating box based on its covariance (V B (Σ)), when we obtain a new Gaussian distribution: VB(Σ) = 2 n eig(Σ) = 2 n • |Σ 1 2 | = 2 n • |Σ| 1 2 (3) where n denotes the number of dimensions. To obtain the final SkewIoU, calculating the area of overlap is critical. For two Gaussian distributions, N x (µ 1 , Σ 1 ) and N x (µ 2 , Σ 2 ), we use the product of the Gaussian distributions to get the distribution function of the overlapping area: αNx(µ, Σ) = Nx(µ1, Σ1)Nx(µ2, Σ2) Note here α is written by: where α = Nµ 1 (µ2, Σ1 + Σ2) µ = µ 1 + K(µ 2 -µ 1 ), Σ = Σ 1 -KΣ 1 , and K is the Kalman gain, K = Σ 1 (Σ 1 + Σ 2 ) -1 . We observe that Σ is only related to the covariance (Σ 1 and Σ 2 ) of the given two Gaussian distributions, which means that no matter how the two Gaussian distributions move, as long as the covariance is fixed, the area calculated by Eq. 3 will not change (distance-independent). This is obviously not in line with intuition: the overlapping area should be reduced when the two Gaussian distributions are far away. The main reason is αN x (µ, Σ) is not a standard Gaussian distribution (probability sum is not 1), we cannot directly use Σ to calculate the area of the current overlap by Eq. 3 without considering α. Eq. 5 shows that α is related to the distance between the center points (µ 1 -µ 2 ) of the two Gaussian distributions. Based on the above findings, we can first use a center point loss L c to narrow the distance between the center of the two Gaussian distributions. In this way, α can be approximated as a constant, and the introduction of the L c also allows the entire loss to continue to optimize the detector in non-overlapping cases. Then, calculate the overlap area under the new position by Eq. 3. According to Fig. 2 , overlap area is calculated as follows: KFIoU = VB 3 (Σ) VB 1 (Σ1) + VB 2 (Σ2) -VB 3 (Σ) where B 1 , B 2 and B 3 refer to the three different bounding boxes shown in the right part of Fig. 2 . In the appendix, we prove that the upper bounds of KFIoU in n-dimensional space is 1 2 n 2 +1 -1 . For 2-D/3-D detection, the upper bounds are 1 3 and 1 √ 32-1 respectively when n = 2 and n = 3. We can easily stretch the range of KFIoU to [0, 1] by linear transformation according to the upper bound, and then compare it with IoU for consistency. Fig. 3 (a)-3(b) show the curves of five loss forms for two bounding boxes with the same center in different cases. Note that we have expanded KFIoU by 3 times so that its value range is [0, 1] like SkewIoU. Fig. 3 (a) depicts the relation between angle difference and loss functions. Though they all bear monotonicity, obviously the Smooth L1 loss curve is more distinctive. Fig. 3(b) shows the changes of the five loss under different aspect ratio conditions. It can be seen that the Smooth L1 loss of the two bounding boxes are constant (mainly from the angle difference), but other losses will change drastically as the aspect ratio varies. Regardless of the case in Fig. 3(c ), KFIoU loss can maintain the best trend-level alignment with the SkewIoU loss within 5 pixels devariation. This conclusion still holds at 9 pixels, which is already quite a distance, especially for aerial image. To further explore the behavior of different approximate SkewIoU losses, we design the metrics of error mean (EMean) and error variance (EVar) as follows: EMean = 1 N N i=1 (SkewIoU plain -SkewIoUapp), EVar = 1 N N i=1 (SkewIoUapp -EMean) 2 (7) where EVar measures the trend-level consistency between the designed loss and the SkewIoU loss. Tab. 1 calculates the EVar of different losses in Fig. 3(c ). In general, EVar L kf iou +Lc < EVar L kld < EVar L gwd < EVar L1 . In our analysis, this is probably due to the fundamental inconsistency between the distribution distance as used in GWD/KLD and the definition of similarity in SkewIoU. Moreover, for GWD such inconsistency is more pronouced, because it has no scale invariance under the same IoU, and a case with a larger scale will get a larger loss value, it can greatly magnify its trend inconsistency with SkewIoU loss. The results in Tab. 1 also verifies our analysis. In contrast, the calculation process of KFIoU loss is essentially the calculation of the overlap rate, so it does not require hyperparameters and can maintain a high trend-level consistency with SkewIoU loss. Combined with the corresponding performance on three datasets, smaller EVars tend to have better performance in a general level. When EVar is small enough, which implies sufficient consistency, the performance difference of different methods (e.g. KLD loss and KFIoU loss) is close. Therefore, we come to the conclusion that the key to maintaining the consistency between metric and regression loss lies in the trend-level consistency between approximate and exact SkewIoU loss rather than value-level consistency. The reason why the Gaussian-based losses (e.g. KFIoU loss, KLD loss, GWD loss) outperform the plain SkewIoU loss is due to the advanced parameter optimization mechanism, effective measurement for non-overlapping cases, and complete derivation. However, the introduction of hyperparameters makes KLD loss and GWD loss less stable than KFIoU loss in terms of Evar and performance. Compared with GWD and KLD, which use the distribution distance to approximate SkewIoU, KFIoU is physically more reasonable (in line with the calculation process of SkewIoU) and simpler, as well as empirically more effective than best-tuned GWD and KLD. In addition, KFIoU implementation is much simpler than plain SkewIoU and can be easily implemented by the existing operations of the deep learning framework.

4.2. THE PROPOSED KFIOU LOSS

We take 2-D object detection as the main example for notation brevity, though our experiments further cover the 3-D case. We use the one-stage detector RetinaNet (Lin et al., 2017b) as the baseline. Rotated rectangle is represented by five parameters (x, y, w, h, θ). First, we shall clarify that the network has not changed the output of the original regression branch, that is, it is not directly predicting the parameters of the Gaussian distribution. The whole training process of detector is summarized as follows: i) predict offset (t * x , t * y , t * w , t * h , t * θ ); ii) decode prediction box; iii) convert prediction box and target ground-truth into Gaussian distribution; iv) calculate L c and L kf of two Gaussian distributions. Therefore, the inference time remains unchanged. The regression equation of (x, y, w, h) is as follows: tx = (x -xa)/wa, ty = (y -ya)/ha, tw = log(w/wa), t h = log(h/ha) t * x = (x * -xa)/wa, t * y = (y * -ya)/ha, t * w = log(w * /wa), t * h = log(h * /ha) where x, y, w, h denote the box's center coordinates, width and height, respectively. x, x a , x * are for ground-truth box, anchor box, and predicted box (likewise for y, w, h). For the regression of θ, we use two forms as the baselines: i) Direct regression, marked as Reg. (∆θ). The model directly predicts the angle offset t * θ : t θ = (θ -θa) • π/180, t * θ = (θ * -θa) • π/180 (9) ii) Indirect regression, marked as Reg. * (sin θ, cos θ). The model predicts two vectors (t * sin θ and t * cos θ ) to match the two targets from the ground truth (t sin θ and t cos θ ): t sin θ = sin (θ • π/180), t cos θ = cos (θ • π/180), t * sin θ = sin (θ * • π/180), t * cos θ = cos (θ * • π/180) To ensure that t * 2 sin θ +t * 2 cos θ = 1 is satisfied, we will perform the following normalization processing: t * sin θ = t * sin θ t * 2 sin θ + t * 2 cos θ , t * cos θ = t * cos θ t * 2 sin θ + t * 2 cos θ The multi-task loss is: L total = λ1 Npos n=1 Lreg (G(bn), G(gtn)) + λ2 N N n=1 L cls (pn, tn) where N and N pos indicates the number of all anchors and that of positive anchors. b n denotes the n-th predicted bounding box, gt n is the n-th target ground-truth. G(•) is Gaussian transfer function.  L kf (Σ1, Σ2) = e 1-KFIoU -1 See more ablation experiments on the functional form of L kf (Σ 1 , Σ 2 ) in the Appendix. For center point loss L c , this paper provides two different forms: 1) The loss adopted in Faster RCNN (Lin et al., 2017a)  (default): L c (t, t * ) = i∈(x,y) l n (t i , t * i ). 2) The first term of KLD (Yang et al., 2021d) (advanced), which has an advanced center point optimization mechanism: L c (µ 1 , µ 2 , Σ 1 ) = ln (µ 2 -µ 1 ) ⊤ Σ -1 1 (µ 2 -µ 1 ) + 1 .

5.1. DATASETS AND IMPLEMENTATION DETAILS

Aerial image dataset: DOTA (Xia et al., 2018) is one of the largest datasets for oriented object detection in aerial images with three released versions: DOTA-v1.0, DOTA-v1.5 and DOTA-v2.0. DOTA-v1.0 contains 15 common categories, 2,806 images and 188,282 instances. DOTA-v1.5 uses the same images as DOTA-v1.0, but extremely small instances (less than 10 pixels) are also annotated. Moreover, a new category, containing 402,089 instances in total is added in this version. While DOTA-v2.0 contains 18 common categories, 11,268 images and 1,793,658 instances. We divide the images into 600 × 600 subimages with an overlap of 150 pixels and scale it to 800 × 800. HRSC2016 (Liu et al., 2017) We use PointPillar (Lang et al., 2019) implemented in MMDetection3D (Contributors, 2020) as the baseline, and the training schedule inherited from SECOND (Yan et al., 2018) : ADAM optimizer with a cosine-shaped cyclic learning rate scheduler that spans 160 epochs. The learning rate starts from 1e-4 and reaches 1e-3 at the 60th epoch, and then goes down gradually to 1e-7 finally. In the development phase, the experiments are conducted with a single model for 3-class joint detection.

5.2. ABLATION STUDY AND FURTHER COMPARISON

Ablation study on different center point losses. Tab. 1 compares the two different center point losses proposed in Sec. 4.2 on three versions of DOTA datasets. Even with the most commonly used L c (t, t * ), KFIoU loss achieves competitive performance, significantly better than GWD loss and comparable to KLD loss. For a fairer comparison, after adopting the same center point loss term as KLD loss L c (µ 1 , µ 2 , Σ 1 ), the performance of KFIoU loss is further improved, which is better than KLD loss thanks to a better center point optimization mechanism. the model show strong performance in large aspect ratio objects (e.g. SV, LV, SH). However, the large number of anchors makes it time-consuming. Therefore, we use horizontal anchors by default to balance accuracy and speed. In terms of definition, experiments show that OpenCV definition (D oc ) (Yang et al., 2019) is slightly better than Long Edge definition (D le ) (Ma et al., 2018) on the three versions of DOTA. Angle direct regression (Reg.) always suffers from the boundary discontinuity problem as widely studied recently (Yang & Yan, 2020) . In contrast, angle indirect regression (Reg * .) is a simpler way to avoid above issues and brings performance boost according to Tab. 4. PIoU calculates the SkewIoU by accumulating the contribution of interior overlapping pixels but the effect is not significant. IoU-Smooth L1 partly circumvents the need for SkewIoU loss with gradient backpropagation by combining IoU and Smooth L1 loss. Although IoU-Smooth L1 has achieved an improvement of 1.26%/0.29%/2.15% on DOTA-v1.0/v1.5/v2.0, the gradient is still dominated by Smooth L1 but still worse than plain SkewIoU loss. Modulated Loss and RIL implement ordered and disordered quadrilateral detection respectively, and the more accurate representation makes them both have a considerable performance improvement. In particular, Modulated Loss achieves the third highest performance on DOTA-v1.5/v2.0. CSL and DCL convert the angle prediction from regression to classification, cleverly eliminating the boundary discontinuity problem caused by the angle periodicity. GWD loss, KLD loss and KFIoU loss are three different regression losses based on Gaussian distribution. The results presented in our experiments of GWD loss and KLD loss are obtained by best-tuned hyperparameters. In contrast, KFIoU loss is free from hyperparameter tuning and has a more stable performance increase due to a more consistent calculation process with SkewIoU loss as the center point gets closer.

5.3. COMPARISON WITH THE STATE-OF-THE-ART

Tab. 5 compares recent detectors on DOTA-v1.0, as categorized by single-, refine-, and two-stage based methods. Since different methods use different image resolution, network structure, training strategies and various tricks, we cannot make absolutely fair comparisons. In terms of overall performance, our method has achieved the best performance so far, at around 77.35%/81.03%/80.93%.

6. CONCLUSION

We have presented a trend-level consistent approximate to the ideal but gradient-training unfriendly SkewIoU loss for rotation detection, and we call it KFIoU loss as the product of the Gaussian distributions is adopted to directly mimic the computing mechanism of SkewIoU loss by definition. This design differs from the distribution distance based losses including GWD loss and KLD loss which in our analysis have fundamental difficulty in achieving trend-level alignment with SkewIoU loss without tuning hyperparameters. Moreover, KFIoU is easier to implement and works better than plain SkewIoU due to the effective measurement for non-overlapping cases and complete derivation. Experimental results on various 2D and 3D datasets show the effectiveness of our approach. A PROOF OF KFIOU UPPER BOUND For an n-dimensional Gaussian distribution, its volume is: V = 2 n • |Σ 1 2 | = 2 n • |Σ| 1 2 (14) For Σ kf , we have |Σ kf | =|Σ1 -Σ1(Σ1 + Σ2) -1 Σ1| =|Σ1(Σ1 + Σ2) -1 Σ2| = |Σ1| • |Σ1| |Σ1 + Σ2| According to Minkowski's inequality: |Σ 1 + Σ 2 | 1 n ≥ |Σ 1 | 1 n + |Σ 2 | 1 n Simultaneous mean inequalities: |Σ1 + Σ2| 1 n ≥ |Σ1| 1 n | + |Σ2| 1 n ≥ 2 • |Σ1| 1 2n • |Σ2| 1 2n Thus: |Σ 1 | 1 2n • |Σ 2 | 1 2n |Σ 1 + Σ 2 | 1 n ≤ 1 2 |Σ 1 | 1 2 • |Σ 2 | 1 2 |Σ 1 + Σ 2 | ≤ 1 2 n (18) and |Σ kf | = |Σ 1 | • |Σ 1 | |Σ 1 + Σ 2 | ≤ |Σ 1 | 1 2 • |Σ 2 | 1 2 2 n |Σ kf | 1 2 ≤ |Σ 1 | 1 4 • |Σ 2 | 1 4 2 n 2 Combine the mean inequalities again: |Σ kf | 1 2 ≤ |Σ 1 | 1 4 • |Σ 2 | 1 4 2 n 2 ≤ |Σ 1 | 1 2 + |Σ 2 | 1 2 2 n 2 +1 (20) According to Eq. 14, we have V kf ≤ V 1 + V 2 2 n 2 +1 Therefore, the upper bound of KFIoU is KFIoU = V kf V 1 + V 2 -V kf ≤ 1 2 n 2 +1 -1 (22) When n = 2 and n = 3, the upper bounds are 1 3 and 1 √ 32-1 respectively.

B SUPPLEMENTARY EXPERIMENT

Ablation study of three forms of KFIoU loss on two detectors. We use two different detectors and three different KFIoU based loss functions to verify its effectiveness, as shown in Tab. 6. RetinaNetbased detector will have a large number of low-SkewIoU prediction bounding box in the early stage of training, and will produce very large loss after the log function, which weakens the improvement of the model. Compared with the linear function, the derivative of the exp-based function will pay more attention to the training of difficult samples, so it has a higher performance, at 70.64%. In contrast, R 3 Det-based detector can generate high-quality prediction box at the beginning of training by adding refinement stages, so it will not suffer the same troubles as RetinaNet. Due to the same mechanism of focusing on difficult samples, log and exp-based functions are both better than linear functions, and the best performance is achieved on the log-based function, about 72.28%. We also expand KFIoU by 3 times to make its range truly consistent with the IoU loss, at [0, 1]. However, this Ablation study of training strategies and tricks. We reimplement KFIoU based on the more powerful benchmark, MMRotate (Zhou et al., 2022) . We use a single GeForce RTX 3090 Ti with a total batch size of 2 for training. For ResNet (He et al., 2016) , SGD optimizer is adopted with an initial learning rate of 0.0025. The momentum and weight decay are 0.9 and 0.0001, respectively. For Swin Transformer (Liu et al., 2021 ), AdamW (Kingma & Ba, 2014; Loshchilov & Hutter, 2018) optimizer is adopted with an initial learning rate of 0.0001. The weight decay is 0.05. In addition, we adopt learning rate warmup for 500 iterations, and the learning rate is divided by 10 at each decay step. Tab. 7 performs ablation experiments on four detectors: RetinaNet (Lin et al., 2017b) , S 2 A-Net (Han et al., 2021a) , R 3 Det (Yang et al., 2021b), and RoI Transformer (Ding et al., 2019) . The experimental results prove that KFIoU can stably enhance the performance of the detector. In order to further improve the performance of the model on DOTA, we verified many commonly used training strategies and tricks, including backbone, training schedule, data augmentation and multiscale training and testing, as shown in Tab. 7. Ablation study of KFIoU loss on 3-D case. More detailed results in KITTI are shown in Tab. 8. Ablation study on more datasets. The performance of different loss functions is compared in Tab. 9 on ICDAR2015, UCAS-AOD, SSDD (Li et al., 2017) and HRSID (Wei et al., 2020b) datasets, and KFIoU is still the best. 

E LIMITATION

Note that the Gaussian modeling has a limitation that it cannot be directly applied to quadrilateral/polygon detection (Ming et al., 2021a; Xu et al., 2020) which is also an important task in aerial images, scene text, etc. In addition, the Gaussian distribution of the square like object is close to the isotropic circle, which is not suitable for the object heading detection. 



Figure 1: For rotation detection (Yang et al., 2021b), there is a notable inconsistency between the final detection metric i.e. mAP (largely depending on SkewIoU) and regression-based loss e.g. the popular Smooth L1. See Fig. 3(a) and Fig. 3(b) for more specific comparison.

(a) Angle difference case. (b) Aspect ratio case. (c) Regardless of the case.

Figure 3: Behavior comparison of different losses in different cases. (a) depicts the relation between angle difference and loss functions. (b) shows the changes of the five loss under different aspect ratio condition. (c) gives scatter plot between approximate losses and SkewIoU loss, 1,000 examples regardless of the case by randomly generating box pairs with the close centers (within 5 pixels).

Figure 4: Visual comparison between Smooth L1 loss-based (left), GWD-based (middle) and the KFIoU-based (right) detectors on DOTA (2-D) and KITTI (3-D). For 3-D object detection, red and blue box denotes ground-truth and predict bounding box, respectively.

Figure 5: Visual comparison between Smooth L1 loss-based (left), GWD-based (middle) and the KFIoU-based (right) detectors on FDDB.

Figure 6: (a) Impact of center deviation on the trend consistency of each loss function. (b) Impact of object scale on the trend consistency of each loss function under a 5 pixels center deviations.

Comparison of the properties and performance of different regression losses. Base model is RetinaNet. BC and HP denote Boundary Continuity and Hyperparameter. † indicates that the first term of KLD is taken as the center point loss, i.e. L c (µ 1 , µ 2 , Σ 1 ).

Ablation study on various 2-D datasets with different base detectors. 'R', 'F' and 'G' indicate random rotation, flipping, and graying. † indicates that the first term of KLD is taken as the center point loss, i.e. L c (µ 1 , µ 2 , Σ 1 ). Base detector is RetinaNet. Data Aug. Reg. Loss Hmean/AP 50 Hmean/AP 60 Hmean/AP 75 Hmean/AP 85 Hmean/AP 50:95 n represents the label of the n-th object, p n is the n-th probability distribution of classes calculated by sigmoid function. λ 1 , λ 2 control the trade-off and are set to {0.01, 1}. The classification loss L cls is set as the focal loss(Lin et al., 2017b). The regression loss is set by L reg = L c + L kf , where

contains images from two scenarios with ships on sea and close inshore. The training, validation and test set include 436, 181 and 444 images.

Results on KITTI val split 3D detection and BEV Detection. The evaluation is classified into Easy, Moderate or Hard according to the object size, occlusion and truncation. All results are evaluated by the mean average precision with a rotated IoU threshold 0.7 for cars and 0.5 for pedestrian and cyclists. To evaluate the model's performance on KITTI val split, we train our model on the train set and report the results on the val set.



AP of different objects on DOTA-v1.0. Red and blue: top two performances.

Ablation study of different KFIoU loss forms with different detectors on DOTA-v1.0.

Ablation study of training strategies and tricks. Rotate and MS indicate rotation augmentation and multi-scale training and testing.

Results on KITTI val split 3D detection and BEV Detection. .33 88.11 85.44 64.46 57.26 52.53 87.40 68.19 64.47 +KFIoU 72.08 92.15 89.90 85.66 63.45 57.81 53.07 87.52 68.55 64.56 consistency do not bring any additional gains, so the following experiments are all use the KFIoU before non-expansion.

Results on more datasets, the base detector is RetinaNet.

funding

The work was partly done when the first author Xue Yang was an intern at Huawei Cloud. The work was also in part supported by NSFC (62222607), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102).1 See an open-source version with thousands of lines of code for implementing the loss at https:// github.

