H2RBOX: HORIZONTAL BOX ANNOTATION IS ALL YOU NEED FOR ORIENTED OBJECT DETECTION

Abstract

Oriented object detection emerges in many applications from aerial images to autonomous driving, while many existing detection benchmarks are annotated with horizontal bounding box only which is also less costive than fine-grained rotated box, leading to a gap between the readily available training corpus and the rising demand for oriented object detection. This paper proposes a simple yet effective oriented object detection approach called H2RBox merely using horizontal box annotation for weakly-supervised training, which closes the above gap and shows competitive performance even against those trained with rotated boxes. The cores of our method are weakly-and self-supervised learning, which predicts the angle of the object by learning the consistency of two different views. To our best knowledge, H2RBox is the first horizontal box annotation-based oriented object detector. Compared to an alternative i.e. horizontal box-supervised instance segmentation with our post adaption to oriented object detection, our approach is not susceptible to the prediction quality of mask and can perform more robustly in complex scenes containing a large number of dense objects and outliers. Experimental results show that H2RBox has significant performance and speed advantages over horizontal box-supervised instance segmentation methods, as well as lower memory requirements. While compared to rotated box-supervised oriented object detectors, our method shows very close performance and speed. The source code is available at PyTorch-based MMRotate and Jittor-based JDet.

1. INTRODUCTION

In addition to the relatively matured area of horizontal object detection (Liu et al., 2020) , oriented object detection has received extensive attention, especially for complex scenes, whereby fine-grained bounding box (e.g. rotated/quadrilateral bounding box) is needed, e.g. aerial images (Ding et al., 2019; Yang et al., 2019a) , scene text (Zhou et al., 2017) , retail scenes (Pan et al., 2020) etc. Despite the increasing popularity of oriented object detection, many existing datasets are annotated with horizontal boxes (HBox) which may not be compatible (at least on the surface) for training an oriented detector. Hence labor-intensive re-annotation 1 have been performed on existing horizontalannotated datasets. For example, DIOR-R (Cheng et al., 2022) and SKU110K-R (Pan et al., 2020) are rotated box (RBox) annotations of the aerial image dataset DIOR (192K instances) (Li et al., 2020) and the retail scene SKU110K (1,733K instances) (Goldman et al., 2019) , respectively. One attractive question arises that if one can achieve weakly supervised learning for oriented object detection by only using (the more readily available) HBox annotations than RBox ones. One poten- Figure 1 : Visual comparison of three HBox-supervised rotated detectors on aircraft detection (Wei et al., 2020) , ship detection (Yang et al., 2018) , vehicle detection (Azimi et al., 2021) , etc. The HBox-Mask-RBox style methods, i.e. BoxInst-RBox (Tian et al., 2021) and BoxLevelSet-RBox (Li et al., 2022b) , perform not well in complex and object-cluttered scenes. tial and verified technique in our experiments is HBox-supervised instance segmentation, concerning with BoxInst (Tian et al., 2021) , BoxLevelSet (Li et al., 2022b) , etc. Based on the segmentation mask by these methods, one can readily obtain the final RBox by finding its minimum circumscribed rectangle, and we term the above procedure as HBox-Mask-RBox style methods i.e. BoxInst-RBox and BoxLevelSet-RBox in this paper. Yet it in fact involves a potentially more challenging task i.e. instance segmentation whose quality can be sensitive to the background noise, and it can influence heavily on the subsequent RBox detection step, especially given complex scenes (in Fig. 1(a) ) and the objects are crowded (in Fig. 1(b) ). Also, involving segmentation is often more computational costive and the whole procedure can be time consuming (see Tab. 1-2). In this paper, we propose a simple yet effective approach, dubbed as HBox-to-RBox (H2RBox), which achieves close performance to those RBox annotation supervised methods e.g. (Han et al., 2021b; Yang et al., 2023a) by only using HBox annotations, and even outperforms in considerable amount of cases as shown in our experiments. The cores of our method are weakly-and selfsupervised learning, which predicts the angle of the object by learning the enforced consistency between two different views. Specifically, we predict five offsets in the regression sub-network based on FCOS (Tian et al., 2019) in the WS branch (see Fig. 2 left) so that the final decoded outputs are RBoxes. Since we only have horizontal box annotations, we use the horizontal circumscribed rectangle of the predicted RBox when computing the regression loss. Ideally, predicted RBoxes and corresponding ground truth (GT) RBoxes (unlabeled) have highly overlapping horizontal circumscribed rectangles. In the SS branch (see Fig. 2 right), we rotate the input image by a randomly angle and predict the corresponding RBox through a regression sub-network. Then, the consistency of RBoxes between the two branches, including scale consistency and spatial location consistency, are learned to eliminate the undesired cases to ensure the reliability of the WS branch. Our main contributions are as follows: 1) To our best knowledge, we propose the first HBox annotation-based oriented object detector. Specifically, a weakly-and self-supervised angle learning paradigm is devised which closes the gap between HBox training and RBox testing, and it can serve as a plugin for existing detectors. 2) We prove through geometric equations that the predicted RBox is the correct GT RBox under our designed pipeline and consistency loss, and does not rely on not-fully-verified/ad-hoc assumptions, e.g. color-pairwise affinity in BoxInst or additional intermediate results whose quality cannot be ensured, e.g. feature map used by many weakly supervised methods (Wang et al., 2022) . 3) Compared with the potential alternatives e.g. HBox-Mask-RBox whose instance segmentation part is fulfilled by the state-of-the-art BoxInst, our H2RBox outperforms by about 14% mAP (67.90% vs. 53.59%) on DOTA-v1.0 dataset, requiring only one third of its computational resources (6.25 GB vs. 19.93 GB), and being around 12× faster in inference (31.6 fps vs. 2.7 fps). 4) Compared with the fully RBox annotation-supervised rotation detector FCOS, H2RBox is only 0.91% (74.40% vs. 75.31%) and 1.01% (33.15% vs. 34.16%) behind on DOTA-v1.0 and DIOR-R, respectively. Furthermore, we do not add extra computation in the inference stage, thus maintaining a comparable detection speed, about 29.1 FPS vs. 29.5 FPS on DOTA-v1.0.

2. RELATED WORK

RBox-supervised Oriented Object Detection. Oriented object detection in visual images has received increasing attention across different areas e.g. aerial image (Xu et al., 2020; Yang et al., 2022; 2023a; Hou et al., 2023) , scene text (Zhou et al., 2017; Liao et al., 2018) , retail (Pan et al., 2020; Chen et al., 2020) , etc. Earlier methods including RRPN (Ma et al., 2018) , ROI-Transformer (Ding et al., 2019) and ReDet (Han et al., 2021b) directly perform angle regression. To address the loss discontinuity and regression inconsistency due to periodicity of angle, subsequent works either convert the parameterization of the rotated bounding box into 2-D Gaussian distributions (Yang et al., 2021c; d) or transform the angle regression to classification (Yang et al., 2021a; Yang & Yan, 2022) . (Hou et al., 2022; Li et al., 2022a) introduce the adaptive point set for object representation to mitigate the angle regression sensitivity and meanwhile captures instances' semantic information. HBox-supervised Instance Segmentation and Its Potential for Oriented Object Detection. The bold idea of purely using HBox-annotations to train a rotated object detector is attractive yet still rarely studied in literature, which can be seen as a weakly-supervised (WS) learning paradigm for oriented object detection. A related and better-studied technique is HBox-supervised instance segmentation, which tries to segment instance based on the HBox annotations for WS training. For instance, SDI (Khoreva et al., 2017) relies on the region proposals generated by MCG (Pont-Tuset et al., 2020) . BoxLevelSet (Li et al., 2022b) introduces an energy function to predict the instance-aware mask as the level set. Though one can obtain the final object orientation by certain means based on the segmentation mask from the above instance segmentation methods, e.g. by finding the minimum circumscribed rectangle, we argue and show in our experiments that such an HBox-Mask-RBox pipeline can be complex (segmentation can be even more difficult than rotation detection -see Fig. 1 ) and expensive in the presence of dense objects and background noises. Hence we aim to skip the segmentation step and build an HBox-to-RBox paradigm which has not been studied before to our best knowledge.

3. PROPOSED METHOD

The overview of the H2RBox is shown in Fig. 2 . Two augmented views are generated and information leakage is avoided for training overfitting. There are two branches. One branch is used for weakly-supervised (WS) learning where the supervision is the GT HBox from the training data, and the regression loss is calculated between the circumscribed HBox derived from the predicted RBox by this branch and GT HBox. The other branch is trained by self-supervised (SS) learning that involves two augmented views of the raw input image, which encourages to obtain the consistent RBox prediction between the two views. The final loss is the weighted sum of the WS loss and SS loss. Note that the test-stage prediction is concerned only with the WS branch.

3.1. AUGMENTED VIEW GENERATION

In line with the general idea of self-supervised learning by data augmentation, given the input image, we perform random rotation to generate View 2 while keeping View 1 consistent with the input image, as shown in Fig. 2 . However, rotation transformation will geometrically and inevitably introduce an artificial black border area and leads to the risk of GT angle information leakage. We provide two available techniques to resolve this issue: 1) Center Region Cropping: Crop a √ 2 2 s × √ 2 2 s areafoot_0 in the center of the image. 2) Reflection Padding: Fill the black border area by reflection padding. If the Center Region Cropping is used in View 2, View 1 also needs to perform the same operation and filter the corresponding ground truth. In contrast, Reflection Padding works better than Center Region Cropping because it preserves as much of the area as possible while maintaining a higher image resolution. Fig. 3 (a) and Fig. 3 (b) compare zeros padding and reflection padding. Note that the black border area does not participate in the regression loss calculation in the SS branch, so it does not matter that this region is filled with unlabeled foreground objects by reflection padding. Recall that we can not use the predicted RBox to calculate the final regression directly as there is no RBox annotation but HBox only. Therefore, we first convert the predicted RBox into the corresponding minimum horizontal circumscribed rectangle, for calculating the regression loss between the derived HBox and the GT Hbox annotation (we defer the details of the loss formulation to Sec. 3.5). As the network is better trained, an indirect connection (horizontal circumscribed rectangle constraint) occurs between predicted RBox and GT RBox (unlabeled): No matter how an object is rotated, their corresponding horizontal circumscribed rectangles are always highly overlapping. However, as shown in Fig. 4 (a), only using WS loss can only localize the objects, while still not effective enough for accurate rotation estimation.

3.2. THE WEAKLY-SUPERVISED (WS) BRANCH

w • | cos θ| + h • | sin θ| = wws w • | sin θ| + h • | cos θ| = hws w • | cos φ| + h • | sin φ| = wss w • | sin φ| + h • | cos φ| = hss

3.3. THE SELF-SUPERVISED (SS) BRANCH

As complementary to the WS loss, we further introduce the SS loss. The SS branch only contains one regression sub-network for predicting RBox in the rotated View 2. Given a (random) rotation transformation R (with degree ∆θ) as adopted in View 2, the relationship between location (x, y) of View 1 in the WS branch and location (x * , y * ) of View 2 with rotation R in the SS branch is: (x * , y * ) = (x -xc, y -yc)R ⊤ + (xc, yc), R = cos ∆θ -sin ∆θ sin ∆θ cos ∆θ (1) where (x c , y c ) is the rotation center (i.e. image center). Recall the label of the black border area (in Fig. 3 ) in the SS branch is set as invalid and negative samples, which will not participate in the subsequent losses designed below. Specifically, a scale loss L wh accounts for the scale consistency to enhance the indirect connection described above: For augmented objects obtained from the same object through different rotations, a set of RBoxes of the same scale are predicted by the detector, and these predicted RBoxes and corresponding GT RBoxes (unlabeled) shall have highly overlapping horizontal circumscribed rectangles. With such an enhanced indirect connection, including horizontal circumscribed rectangle constraint and scale constraint, we can limit the prediction results to a limited number of feasible cases, explained as follows: Fig. 5 shows two cases based on the above enhanced indirect connection, and lists four different expressions for the four variables (w, h, θ, φ). Due to the periodicity of the angles, there are only two feasible solutions to the four equations within the angle definition, i.e. the green GT RBox and the orange symmetric RBox. In other words, with such a strengthened indirect connection, the relationship between predicted RBox and GT RBox is coincident B c (w, h, θ) or symmetrical about the center of the object B s (w, h, π -θ). It can be seen from Fig. 4 (a) that there are still many bad cases with extremely inaccurate angles after using L wh . Interestingly, if we make a symmetry transformation of these bad cases with their center point, the result becomes much better. When generating views, a geometric prior can be obtained, that is, the spatial transformation relationship between the two views, denoted as R in Eq. 1. Thus, we can get the following four transformation relationships, marked as T ⟨B ws , B ss ⟩, between the two branches: T ⟨B c ws , B c ss ⟩ = {R}, T ⟨B c ws , B s ss ⟩ = {R, S} = {S, R ⊤ } T ⟨B s ws , B s ss ⟩ = {R ⊤ }, T ⟨B s ws , B c ss ⟩ = {R ⊤ , S} = {S, R} where B c ws and B s ss represent the coincident bounding box predicted in WS branch and the symmetric bounding box predicted in SS branch, respectively. Here S denotes symmetric transformation. Take T ⟨B c ws , B s ss ⟩ = {R, S} as an example, it means B s ss = S(R • B c ws ). Therefore, an effective way to eliminate the symmetric case is to let the model know that the relationship between the RBoxes predicted by the two branches can only be R. Inspired by above analysis, spatial location loss is used to construct the spatial transformation relationship R of RBoxes predicted by two branches. Specifically, the RBox predicted by WS branch is first transformed by R, and then several losses (e.g. center point loss L xy and angle loss L θ ) are used to measure its location consistency with the RBox predicted by SS branch. In fact, the spatial location consistency, especially the angle loss, provides a fifth angle constraint equation (φ -θ = ∆θ ̸ = 0) so that the system of equations in Fig. 5 have a unique solution (i.e. the predicted RBox is the GT RBox) with non-strict proof, because system of equations are nonlinear. The final SS learning consists of scale-consistent and spatial-location-consistent learning: Sim⟨R • Bws, Bss⟩ = 1 (3) Fig. 4 (b) shows the visualization by using the SS loss, with accurate predictions. The appendix shows visualizations of feasible solutions for different combinations of constraints.

3.4. LABEL RE-ASSIGNER

Since the consistency of the prediction results of the two branches needs to be calculated, the labels need to be re-assigned in the SS branch. Specifically, the labels at the location (x * , y * ) of the SS branch, including center-ness (cn * ), target category (c * ) and target GT HBox (gtbox h * ), are the same as in the location (x, y) of the WS branch. Besides, we also need to assign the rbox ws (x ws , y ws , w ws , h ws , θ ws ) predicted by the WS branch as the target RBox of the SS branch to calculate the SS loss. We propose two reassignment strategies: 1) One-to-one (O2O) assignment: With cn, c and gtbox h , the rbox ws predicted at location (x, y) in the WS branch is used as the target RBox at (x * , y * ) of the SS branch (see Fig. 3(a) ). 2) One-to-many (O2M) assignment: Use the rbox ws closest to the center point of the gtbox h The the visualized label assignment in Fig. 3 further shows that the SS loss effectively eliminates prediction of the undesired case. The label reassignment of different detectors may require different strategies. The key is to design a suitable matching strategy for the prediction results of the two views, which can allow the network to learn the consistency better.

3.5. THE OVERALL LOSS BY COMBINING THE WS AND SS LOSSES

Since the WS branch is a rotated object detector based on FCOS, the losses in this part mainly include the regression L reg , classification L cls , and center-ness L cn . We define the WS loss in the WS branch as follows: where B ws (-w * ws , -h * ws , w * ws , h * ws ), B 1 ss (-w ss , -h ss , w ss , h ss ) and B 2 ss (-h ss , -w ss , h ss , w ss ). We set γ 1 = 0.15 and γ 2 = 1 by default. L whθ takes into account the loss discontinuity caused by the boundary issues (Yang et al., 2021c) , such as periodicity of angle and exchangeability of edges. Lws = µ1 Npos (x, The overall loss is a weighted sum of the WS loss and the SS loss where we set λ = 0.4 by default. (He et al., 2016) backbone and FPN neck (Lin et al., 2017a) as the baseline method and building block based on which we develop our approach (see Fig. 1 ). To implement the weakly-supervised HBox-Mask-RBox alternatives for comparison, we use two strong HBox annotation-based instance segmentation methods: BoxInst and BoxLevelSet, followed by finding its minimum compact surrounding rectangle as the detected RBox and we dub them BoxInst-RBox and BoxLevelSet-RBox respectively. All models are trained with AdamW (Loshchilov & Hutter, 2018) on GeForce RTX 3090 GPU, except BoxLevelSet (Li et al., 2022b) which requires NVIDIA V100 with larger memory. The initial learning rate is 10 -4 with 2 images per mini-batch. The weight decay is 0.05. Besides, we adopt learning rate warm-up for 500 iterations, and the learning rate is divided by 10 at each decay step. Random flipping is adopted without any additional tricks. 

4.2. MAIN RESULTS

Results on DOTA-v1.0. As shown in Tab. 1, our method significantly outperforms BoxInst-RBox and BoxLevelSet-RBox by 14.31% and 11.46% in terms of AP 50 , respectively. Moreover, our methods are also more memory and inference efficient. Specifically, compared to BoxInst, we only need less than one-third of its memory (6.25 GB vs. 19.93 GB) and have a about 12× speed advantage (31.6 fps vs. 2.7 fps). In contrast to BoxLevelSet, our memory costs only a quarter of its memory (6.25 GB vs. 26.81 GB), and inference is about 7 times faster (31.6 fps vs. 4.7 fps). In fact, the main cost of the -RBox methods come from the costive post-processing step for find the compact surrounding box as RBox which is fulfilled by calling an OpenCV function in our implementation. Even compared with RBox-supervised methods, our method has outperformed several methods, such as RepPoints and RetinaNet. Under the '1x' and '3x' training schedules, our method slightly lags behind the baseline method, i.e. FCOS (recall it is RBox-supervised), by 2.96% and 1.81%. After using multi-scale training and testing, the gap is reduced to only 0.91% (75.31% vs. 74.40%). Results on DIOR-R. Note that some categories in this dataset including Chimney, Wind mill, Airport, Golf field, are all forcefully annotated by horizontal boxes though the objects are not exactly 

4.3. ABLATION STUDIES

The ablation study is performed on the proposed H2RBox with 12 training epochs. Border effect elimination for view generation. Tab. 3 studies the impact of different border effect elimination strategies for view generation, in terms of padding and/or cropping (see Sec. 3.1). Such techniques are essential to avoid ground truth angle information leakage, otherwise the model will suffer overfitting and leads to significant performance drop as verified in the first row of the table. Note that when both reflection padding and cropping are applied the AP slightly drops from 35.92% to 33.60% compared with only using reflection padding. The reason may be due to that reduced size of input image by cropping. Hence in all other experiments we always use reflection padding alone. Label re-assignment. Tab. 4 shows the one-to-one strategy outperforms one-to-many strategy. Strategies for dealing with sotropic circular object classes. For circular objects like Storage Tank (ST) and Roundabout (RA), the self-supervised loss takes no effect as it is insensitive to isotropic information. We take two treatments to handle such circular objects. S1: for training, we mask the SS loss for circular classes. S2: for testing, the horizontal circumscribed rectangle of the circular category is taken as the final output. Tab. 5 shows that, when either or both strategies is used, the performance can be greatly improved, about 15% on ST and about 25% on RA. Self-supervised loss. Without using SS loss, Tab. 6 shows that our method only achieves 12.63% and 15.27% on DOTA-v1.0 and DIOR-R, respectively. In contrast, the use of SS loss leads to a substantial increase in overall performance, reaching 35.92% and 33.15%. Figure 4 (b) also shows that the SS loss can effectively help the model learn the correct object angle information.

5. CONCLUSION

This paper presents H2RBox, the first (to the best of our knowledge) HBox-supervised oriented object detector. H2RBox learns the rotation via self-supervised learning, whose loss measures the consistency of the predicted angles in two different views. Compared to the alternative HBox-supervised instance segmentation methods, H2RBox achieves much higher detection accuracy especially for complex scenes, yet with lower memory and higher speed. Compared with fully RBox-supervised algorithms, our method still shows competitive. 



When the rotation angle is a multiple of 45 • , the black border area reaches its peak, so the side length of the largest crop area is √ 2 2 of the side length of the original image (s), refer to the View 2 in Fig.2.



Figure 2: Our H2RBox consists of two branches respectively fed with two augmented views (View 1 and View 2) of the input image. The left Weakly-supervised Branch in general can be any rotated object detector (FCOS here) for RBox prediction, whose circumscribed HBox is used for supervised learning given the GT HBox label in the sense of weakly-supervised learning. This branch is also used for test-stage inference. The right Self-supervised Branch tires to achieve RBox prediction consistency of the two views with self-supervised learning. Image is from the DIOR-R dataset.

(a) O2O by zeros padding. (b) O2M by reflect padding.(c) The two re-assignment strategies.

Figure 3: Comparison of different padding methods (Sec. 3.1) and re-assignment strategies (Sec. 3.4). Green and red RBox represent the target rbox ws * and rbox ss , respectively.

Figure 4: Visual comparison of our methods with and without the SS loss used in the SS branch. It can help learn the scale and spatial location consistency between the two branches.

Figure 5: Proof of the relationship between predicted RBox and GT RBox under horizontal circumscribed rectangle constraint and scale constraint. Green and orange RBoxes represent correct coincident prediction B c and undesired symmetric prediction B s .The two generated views (View 1 and View 2) are respectively fed into the two branches with the parametershared backbone and neck, specified as ResNet(He et al., 2016) and FPN(Lin et al., 2017a)  as shown in Fig.2. The WS branch here is specified by a FCOS-based rotated object detector, as involved for both training and inference. This branch contains regression and classification sub-networks to predict RBox, category, and center-ness. Recall that we can not use the predicted RBox to calculate the final regression directly as there is no RBox annotation but HBox only. Therefore, we first convert the predicted RBox into the corresponding minimum horizontal circumscribed rectangle, for calculating the regression loss between the derived HBox and the GT Hbox annotation (we defer the details of the loss formulation to Sec. 3.5). As the network is better trained, an indirect connection (horizontal circumscribed rectangle constraint) occurs between predicted RBox and GT RBox (unlabeled): No matter how an object is rotated, their corresponding horizontal circumscribed rectangles are always highly overlapping. However, as shown in Fig.4(a), only using WS loss can only localize the objects, while still not effective enough for accurate rotation estimation.

RBox at location (x * , y * ) of SS branch, as shown in Fig. 3(b).

Fig.  3(c)  visualizes the difference between the two re-assignment strategies. After re-assigning, we need to perform an rotation transformation on the rbox ws to get the rbox ws * (x * ws , y * ws , w * ws , h * ws , θ * ws ) for calculating the SS loss according to Eq. 3:(x *ws , y * ws ) = (xws -xc, yws -yc)R ⊤ + (xc, yc), (w * ws , h * ws ) = (wws, hws), θ * ws = θws + ∆θ (4)

FEASIBLE SOLUTIONS UNDER DIFFERENT CONSTRAINTS Three different constraints, including horizontal circumscribed rectangle constraint (HCRC), scale constraint (SC) and angle constraint (AC), are introduced in this paper to guide the model to learn the correct result. Fig. 6(a) shows when there are only horizontal circumscribed rectangle constraint, the feasible solutions are still infinite. After adding scale constraint, only the symmetric case and the correct case are left, as shown in Fig. 6(b). The final angle constraint allows the correct solution to be preserved, refer to Fig. 6(c).

Figure 6: Visualization of feasible solutions under different constraints.

cls is the focal loss(Lin et al., 2017b), L cn is cross-entropy loss, and L reg is IoU loss(Yu et al., 2016). N pos denotes the number of positive samples. p and c denote the probability distribution of various classes calculated by Sigmoid function and target category. rbox ws and gtbox h represent the predicted RBox in the WS branch and horizontal GT box, respectively. cn ′ and cn indicate the predicted and target center-ness. 1 {c (x,y) >0} is the indicator function, being 1 if c (x,y) > 0 and 0 otherwise. The r2h(•) function converts the RBox to its corresponding horizontal circumscribed rectangle. We set the hyperparameters µ 1 = 1, µ 2 = 1 and µ 3 = 1 by default.Then, the SS loss between rbox ws * (x * ws , y * ws , w * ws , h * ws , θ * ws ) and rbox ss (x ss , y ss , w ss , h ss , θ ss ) predicted by the SS branch is:

(Xia et al., 2018) is one of the largest datasets for oriented object detection in aerial images, which contains challenging cases, e.g. large-scale dense scenes and complex background. It contains 15 categories, 2,806 images and 188,282 instances with both RBox and HBox annotations, and the latter are directly derived from the former one. The proportion of the training set, validation set, and testing set is 1/2, 1/6, and 1/3, respectively. For training and testing, we follow a standard protocol by cropping images into 1,024×1,024 patches with a stride of 824. DIOR-R(Cheng et al., 2022) is an aerial image dataset annotated by RBoxes based on its horizontal annotation version DIOR(Li et al., 2020). There are 23,463 images and 190,288 instances with 20 classes.

Results of box the default AP 50 (%) on the DOTA-v1.0. All models are trained with ResNet50. '1x' and '3x' schedules indicate 12 epochs and 36 epochs for training. * indicates using NV V100 GPU with more memory. MS denotes multi-scale (Zhou et al., 2022) training and testing. See the appendix for performance of specific categories.

Results of box AP (%) on the DIOR-R test. All models are trained with ResNet50. The input image size is 800×800. '1x' and '3x' schedules indicate 12 epochs and 36 epochs.

Ablation for H2RBox with different border effect dismissing strategies for view generation by padding/cropping on DOTA-v1.0.

Ablation with different label reassignment strategies. O2M and O2O represent one-to-many and one-to-one.

Ablation with two strategies S1, S2 dealing with circular category: ST & RA on DOTA-v1.0.

Ablation with using SS loss (L ss ) or not on DOTA-v1.0 and DIOR-R. may affect the learning and the final results. As shown in Tab. 2, compared with DOTA-v1.0, DIOR-R is less challenging for the instance segmentation methods. This may explain the observation that the performance of H2RBox and BoxInst-RBox on AP 50 is close. Yet for high-precision detection i.e. with high AP 75 that requires more accurate segmentation, H2RBox outperforms BoxLevelSet-RBox and BoxInst-RBox on AP 75 by 8.24% (32.6% vs. 24.36%) and 4.50% (32.6% vs. 28.10%), and with lower memory and high inference speed. Similarly, H2RBox performs slightly inferior than the RBox-supervised FCOS: 33.15% vs. 34.16%.

Results of box the default AP 50 (%) on the DOTA-v1.0. All models are trained with ResNet50. '1x' and '3x' schedules indicate 12 epochs and 36 epochs for training.

funding

* Correspondence author is Junchi Yan. The work was in part supported by National Key Research and Development Program of China (2020AAA0107600), National Natural Science Foundation of China (62222607), and Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102).1 The annotation cost (in price) of the RBox is about 36.5% ($86 vs. $63) higher than that of the HBox according to https://cloud.google.com/ai-platform/data-labeling/pricing.

