BBREFINEMENT: AN UNIVERSAL SCHEME TO IMPROVE PRECISION OF BOX OBJECT DETECTORS

Abstract

We present a conceptually simple yet powerful and flexible scheme for refining predictions of bounding boxes. Our approach is trained standalone on GT boxes and can then be combined with an object detector to improve its predictions. The method, called BBRefinement, uses mixture data of image information and the object's class and center. Due to the transformation of the problem into a domain where BBRefinement does not care about multiscale detection, recognition of the object's class, computing confidence, or multiple detections, the training is much more effective. It results in the ability to refine even COCO's ground truth labels into a more precise form. BBRefinement improves the performance of SOTA architectures up to 2mAP points on the COCO dataset in the benchmark. The refinement process is fast; it adds 50-80ms overhead to a standard detector using RTX2080, so it can run in real-time on standard hardware. The code is available at https://gitlab.com/irafm-ai/bb-refinement.

1. PROBLEM STATEMENT

Object detection plays an essential role in computer vision, which attracts a strong emphasis on this field among the researchers. That leads to a situation when new, more accurate, or faster object detectors replace the older ones with high frequency. A typical object detector takes an image and produces a set of rectangles, so-called bounding boxes, which define borders of objects in the image. The detection quality is measured as an overlap between the detected box and ground truth (GT), and it is essential for two reasons. Firstly, the criterion used in benchmarks -mean Average Precision (mAP) -is based on particular thresholds for various values of Intersect over Union (IoU) between the prediction and the GT. Such thresholds are typically applied to distinguish between accepted and rejected boxes in detection. Therefore, precision here is crucial to filter valid boxes from discarded. Secondly, the more precise the detected box is, the more accurate the classification should be. Although NN-based classifiers can deal with some tolerance in shifted or cropped data, the higher accuracy in the object detection may lead to the increased accuracy in the classification process. Existing solutions for object detection yield accuracy around 0.3-0.5mAP on the COCO dataset (Lin et al., 2014) . Such a score allows the usage in many real applications. On the other hand, there is space for improvement. A combination of the following may reach such growth: more precisely distinguish between classes; increase the rate of true-positive detections; decrease falsepositive detections; or increase the IoU of the detections. There are four points on why object detection may be difficult in general, which blocks further mAP growth. 1) A neural network has to find all objects in an image. The number may vary from zero to hundreds of objects. 2) A neural network has to be sensitive to all possible sizes of an object. The same object class may be tiny or occupy the whole image. 3) A network usually has no a priori information, which should make the detection easier, like the context of the scene or the number of objects. 4) There is a lack of satisfactory big datasets. Therefore, the distribution of data is sampled roughly only. In this paper, we propose BBRefinement, which can suppress the effect of all the four mentioned difficulties. The proposed inference scheme 'Detection → Refinement' is achieved by a combination during prediction phase with a generic detector, and it increases the IoU of the detected boxes with its ground truth labels, resulting in higher mAP.

Related work.

The problem of refinement can be tracked to the origin of two-stage detectors, where R-CNN (Girshick et al., 2014) uses a region proposal algorithm that is used to generate a fixed number of regions. The regions are classified and by bounding box regressor refined. Faster R-CNN (Ren et al., 2015) replaces the region proposal generation algorithm with a region proposal network. The same bounding box regressor can be used iteratively to obtain more precise detections (Gidaris & Komodakis, 2015; Li et al., 2017) . The effect of iterative refinement may be increased by involving LSTM module (Gong et al., 2019) . The aim of refinement can also be anchors; RefineDet (Zhang et al., 2018; 2020) refines them to obtain customized anchors for each cell. Cascade R-CNN (Cai & Vasconcelos, 2018) uses a sequence of bounding box regressors to create n-staged object detector. In Cascade R-CNN, network head h 0 takes proposals from the region proposal network and feeds the regressed bounding boxes to network head h 1 and so on. All the heads work over the same features extracted from a backbone network. The cascade scheme shows that h 1 is dependant on the quality of h 0 head. If h 0 includes some bias, h 1 balances it. Therefore, all the heads have to be trained together (part by part), and if h 0 is retrained, h 1 should be retrained as well. In contrast, BBRefinement is a trained standalone, and it is not dependent on the quality of the object detector with whom it is coupled during inference. That makes BBRefinement universal and able to be applied on various image detectors without retraining a detector or BBRefinement.

2. EXPLAINING BBREFINEMENT

The main feature of BBRefinement is a transformation of the problem into a simpler scheme, where an NN can be trained easily. Compared with a standard object detector, BBRefinement is a specialized, one-purpose neural network working as a single object detector. It does not search for zero-to-hundreds objects, but it always detects only a single object and does not produce its confidence. It is also missing the part responsible for classification, so it does not assess the object's class. The only purpose is to take an image with a single object within a normalized scale and generate a super-precise bounding box. The training is realized on boxes extracted from a dataset according to ground truth labels. When BBRefinement is trained, the fixed model can be coupled with an arbitrary detector to realize the inference. Here, the feed for BBRefinement is bounding boxes in the form of image data produced by the detector.

2.1. PROBLEM WITH A NAIVE SINGLE OBJECT DETECTOR

Let bounding box b be given by its top-left and bottom-right coordinates b = (x 1 , y 1 , x 2 , y 2 ). Further, let us suppose a color image f : D ⊂ N 2 → L ⊂ R 3 . Then a neural network detecting single object is generally noted as g : f → b. To train such a network, we generally minimize term |b -g(f )| or its alternatives.



Figure 1: The figure illustrates the proposed pipeline of prediction. A generic object detector processes an image, and then the detected boxes are taken from the original image, updated by BBRefinement, and taken as the output predictions.

