PBFORMER: CAPTURING COMPLEX SCENE TEXT SHAPE WITH POLYNOMIAL BAND TRANSFORMER

Abstract

We present PBFormer, an efficient yet powerful scene text detector that unifies the transformer with a novel text shape representation Polynomial Band (PB). The representation has four polynomial curves to fit a text's top, bottom, left, and right sides, which can capture a text with a complex shape by varying polynomial coefficients. PB has appealing features compared with conventional representations: 1) It can model different curvatures with a fixed number of parameters, while polygon-points-based methods need to utilize a different number of points. 2) It can distinguish adjacent or overlapping texts as they have apparent different curve coefficients, while segmentation-based methods suffer from adhesive spatial positions. PBFormer combines the PB with the transformer, which can directly generate smooth text contours sampled from predicted curves without interpolation. A parameter-free cross-scale pixel attention (CPA) module is employed to highlight the feature map of a suitable scale while suppressing the other feature maps. The simple operation can help detect small-scale texts and is compatible with the one-stage DETR framework, where no postprocessing exists for NMS. Furthermore, PBFormer is trained with a shape-contained loss, which not only enforces the piecewise alignment between the ground truth and the predicted curves but also makes curves' position and shapes consistent with each other. Without bells and whistles about text pre-training, our method is superior to the previous state-of-the-art text detectors on the arbitrary-shaped CTW1500 and Total-Text datasets. Codes will be public.

1. INTRODUCTION

Scene text detection is an active research topic in computer vision and enables many downstream applications such as image/video understanding, visual search, and autonomous driving (Radford et al., 2021; Long et al., 2021; Reddy et al., 2020) . However, the task is also challenging. One nonnegligible reason is that the text instance can have a complex shape due to the non-uniformity of the text font, skewing from the photograph, and specific art design. Capturing complex text shapes needs to develop effective text representation. State-of-the-art methods roughly tackle this problem with two types of representations. One is the point-based representation, which predicts the points on the image space to control the shape of the points, including the Bezier control points (Liu et al., 2020) and polygon points (Zhang et al., 2021) . The other produces segmentation maps. The map can describe the text of various shapes and can benefit from the prediction results at the pixel level (Liao et al., 2020; Zhu et al., 2021b) . Despite the good performance, both types of representation have limitations: 1) Points-based methods suffer from a fixed number of control points (Tang et al., 2022; Zhang et al., 2022b) . Too few points cannot handle the highly-curved texts, while simply adding points will increase redundancy for most perspective texts. 2) Segmentation-based methods frequently fail in dividing adjacent texts due to ambiguous spatial positions. The produced segmentation map still needs post-processing and often requires extensive training data (Zhu et al., 2021b) . To address these limitations, we propose a novel representation, named Polynomial Band (PB). It has clear advantages compared with previous text representations. In particular, PB consists of four polynomial curves, each of which fits along a text's top, bottom, left, and right sides. First, the coefficients of PB are discriminative in the parameter space even if the two texts are very close in et al., 2022b) . We apply the proposed PB to the one-stage deformable DETR to improve the efficiency. In particular, we design parameter-free cross-scale pixel attention (CPA) module between the CNN feature and the transformer encoder-decoder layers. The CPA module first aligns the feature maps of different scales by enlarging all the feature maps to the same scale. Then it performs the cross-scale attention that highlights the feature value from a suitable scale while suppressing the other feature maps. With the scale-selective mechanism, our method becomes more compatible with the transformer decoders that do not have NMS for postprocessing. It implicitly suppresses the text proposals with incorrect scales, alleviating the learning burden of the transformer encoderdecoder layers. The features from CPA are effective to represent the shape of the text, two layers of transformer encoder-decoders are sufficient to detect the reasonable size of PB without NMS. The transformer decodes each polynomial curve's K coefficients and 2 boundary variables that determine the curve's definition domain. We uniformly sample the points on the predicted curves within the definition domain and compare them with the corresponding points on the ground truth polygon. Such a design supervises the curve piece by piece and can learn the curve shape and range consistently. In summary, the contribution of PBFormer is: • A novel text representation called the Polynomial Band (PB) is proposed. PB can utilize a fixed number of parameters to capture the text instance with various curvatures. It also excels at distinguishing the spatially close text instances. • A cross-scale pixel attention module is proposed. The module performs pixel-wise attention across the feature maps with different sizes. It implicitly highlights the text regions and enables the transformer to direct take all pixel-wise features as input. • We design a shape-constrained loss function. The loss enforces the piece-wise supervision over the predicted curve and consistently optimizes the curve coefficients and definition domains. Experiments on multi-oriented and curved text detection datasets CTW1500 (Liu et al., 2019) and Total-Text (Chng & Chan, 2017) demonstrate the effectiveness of our approach. Without any pretraining on large-scale text datasets, our method can achieve better results in terms of F-measure. Due to the lightweight network architecture, our method runs real-time and 4.4 × faster than other open-sourced transformer text detectors.



Figure 1: Advantages of PB. Comparing (a) and (b), PB divides adjacent texts more clearly than Bezier control points. (c) shows the number of output point increase gradually to represent shapes from straight to highly curved. (d) shows varying curve coefficients can handle dynamic shapes with two boundary variables.the image space, as shown in Fig.1. Second, PB is represented by the functions defined in the image space. We can directly compare the ground truth contour points with the sampled points from polynomial curves by re-sampling techniques. It differs from Bezier-curve-based methods that need to generate the intermediate representation, i.e., "control points" for supervision. The loss defined by the control points cannot truly reflect how humans percept the shape. A small difference in control points will lead to a large shape difference.Witnessing the great success in NLP(Vaswani et al., 2017), there has been a recent surge of interest in introducing transformers to vision tasks, including scene text detection. The current transformerbased text detectors are with two stages, such as FewBetter(Tang et al., 2022)  and TESTR(Zhang  et al., 2022b). We apply the proposed PB to the one-stage deformable DETR to improve the efficiency. In particular, we design parameter-free cross-scale pixel attention (CPA) module between the CNN feature and the transformer encoder-decoder layers. The CPA module first aligns the feature maps of different scales by enlarging all the feature maps to the same scale. Then it performs the cross-scale attention that highlights the feature value from a suitable scale while suppressing the other feature maps. With the scale-selective mechanism, our method becomes more compatible with the transformer decoders that do not have NMS for postprocessing. It implicitly suppresses the text proposals with incorrect scales, alleviating the learning burden of the transformer encoderdecoder layers. The features from CPA are effective to represent the shape of the text, two layers of transformer encoder-decoders are sufficient to detect the reasonable size of PB without NMS.

