NATURAL WORLD DISTRIBUTION VIA ADAPTIVE CONFUSION ENERGY REGULARIZATION Paper ID: 1442 Paper under double-blind review

Abstract

We introduce a novel and adaptive batch-wise regularization based on the proposed Batch Confusion Norm (BCN) to flexibly address the natural world distribution which usually involves fine-grained and long-tailed properties at the same time. The Fine-Grained Visual Classification (FGVC) problem is notably characterized by two intriguing properties, significant inter-class similarity and intra-class variations, which cause learning an effective FGVC classifier a challenging task. Existing techniques attempt to capture the discriminative parts by their modified attention mechanism. The long-tailed distribution of visual classification poses a great challenge for handling the class imbalance problem. Most of existing solutions usually focus on the class-balancing strategies, classifier normalization, or alleviating the negative gradient of tailed categories. Depart from the conventional approaches, we propose to tackle both problems simultaneously with the adaptive confusion concept. When inter-class similarity prevails in a batch, the BCN term can alleviate possible overfitting due to exploring image features of fine details. On the other hand, when inter-class similarity is not an issue, the class predictions from different samples would unavoidably yield a substantial BCN loss, and prompt the network learning to further reduce the cross-entropy loss. More importantly, extending the existing confusion energy-based framework to account for long-tailed scenario, BCN can learn to exert proper distribution of confusion strength over tailed and head categories to improve classification performance. While the resulting FGVC model by the BCN technique is effective, the performance can be consistently boosted by incorporating extra attention mechanism. In our experiments, we have obtained state-of-the-art results on several benchmark FGVC datasets, and also demonstrated that our approach is competitive on the popular natural world distribution dataset, iNaturalist2018.

1. INTRODUCTION

Fine-grained visual classification (FGVC) is an active and challenging problem in computer vision. Such a recognition task differs from the classical problem of large-scale visual classification (LSVC) by focusing on differentiating similar sub-categories of the same meta-category. In FGVC, the inter-class similarity among the object categories is often pervasive, while the intra-class variations further impose ambiguities in learning a unified and discriminative representation for each category. Long-tailed distribution brings in another aspect of challenge that the head categories tend to dominate the training procedure. The learned classification model thus performs better on these categories, while yielding significantly poor performance for the tail categories. The performance distribution somewhat resembles the data distribution. As the natural world distribution often assumes both fine-grained and long-tailed properties, how to satisfactorily address the recognition problem under such a general setting raises a practical and challenging problem. From the existing literature, there are only a few attempts to solving these two problems at the same time. Relevant efforts mostly focus on tackling either task. In FGVC, most of the recent research efforts have converged to learn pivotal local/part details relevant to distinguishing finegrained categories e.g., (Fu et al., 2017; Yang et al., 2018; Zheng et al., 2019) , and typically require the fusion of several sophisticated computer vision techniques to accomplish the task such as in (Ge et al., 2019) . In resolving the long-tailed issue, previous approaches have looked into data balanced Figure 1a illustrates the two aspects of paradoxes in FGVC where the inter-class similarity and the intra-class variations are subtly intertwined, yielding a daunting classification task. For humans, the example convincingly suggests that expert knowledge is needed to differentiate one from the other two categories. Alternatively, it also exhibits the challenges of formulating universal criteria in developing machine learning frameworks to satisfactorily solve the FGVC problem even for a modest case involving just three object categories. Figure 1b presents an extreme data distribution that some head categories have 1,000 images but only 2 images are included in a tailed category. Hence, a model by conventional training is expected to yield classification performance, displaying the long-tailed distribution on a balanced test/val set. It goes without saying that techniques based on deep neural networks have been the focal point of the recent development in tackling FGVC. Characterized by powerful model capacity and end-to-end feature learning, these state-of-the-art approaches are craftily designed to extract discriminative local details and consistent global structure, and shown to achieve significant improvements over conventional non-DNN approaches, e.g., Duan et al. ( 2012) on almost all FGVC benchmark datasets. However, the improvement for solving FGVC by exploring visual features of different levels and resolutions from relevant regions seems to be saturated and also does not properly address the longtailed issue. The concern is reflected by that most FGVC methods do not include experimental results on the natural world distribution dataset iNaturalist2018 (Van Horn et al., 2018) . Motivated by these developments, we propose a flexible and effective regularization design that aims at guiding the resulting DNN learning to improve model efficiency on tackling the FGVC and long-tailed issues at the same time. Our method is relevant to the pairwise confusion regularization (Dubey et al., 2018) ; however, the proposed formulation goes beyond the restriction of working on pairs of data and develops a batch norm-based framework with sufficient model capacity to simultaneously deal with FGVC and long-tailed issues. We first assume all samples/images within a batch are of different classes. The targeted confusion energy is then modeled by a batch-wise matrix norm, termed as Batch Confusion Norm (BCN). The matrix is constructed by including prediction results from all images within a batch, as well as an adaptive matrix to adjust class-specific weights. The former is used to handle the FGVC task and the latter is for resolving the long-tailed distribution. To achieve efficient DNN learning, we provide an approximation scheme to BCN so that gradient backpropagation can be readily carried out. The promising experimental results support that BCN has good potential to function as a generic regularizer for solving a wide range of classification tasks.



Figure 1: (a) Inter-class similarity vs. intra-class variation: Each column includes two instances of a specific "Gull" category from the CUB-200-2011 dataset Wah et al. (2011). (b) The natural world distribution dataset iNaturalist2018 Van Horn et al. (2018).

