LEARNING BINARY NETWORKS ON LONG-TAILED DISTRIBUTIONS

Abstract

In deploying deep models to real world scenarios, there are a number of issues including computational resource constraints and long-tailed data distributions. For the first time in the literature, we address the combined challenge of learning long-tailed distributions under the extreme resource constraints of using binary networks as backbones. Specifically, we propose a framework of calibrating off-the-shelf pretrained full precision weights that are learned on non-long-tailed distributions when training binary networks on long-tailed datasets. In the framework, we additionally propose a novel adversarial balancing and a multi-resolution learning method for better generalization to diverse semantic domains and input resolutions. We conduct extensive empirical evaluations on 15 datasets including newly derived long-tailed datasets from existing balanced datasets, which is the largest benchmark in the literature. Our empirical studies show that our proposed method outperforms prior arts by large margins, e.g., at least +14.33% on average.

1. INTRODUCTION

In recent years, there grows emphasis on resource constraints, especially for edge devices, in learning deep models, resulting in breakthroughs such as MobileNet (Howard et al., 2017) and YOLO-V7 (Wang et al., 2022) that concern not only accuracy but also computational costs. This has attracted attention for deep learning both in the research and the industrial communities. Besides, long-tailed (LT) training data are frequently encountered in the wild (He & Garcia, 2009) . Thus, many deep learning methods have been developed to combat it (Cui et al., 2021; He et al., 2021; Zhong et al., 2021 ). Yet, the devices of daily usages, of many which aim to utilize these deep models, suffer from lack of sufficient computing power. Thus, for real world deployment of such deep models, methodological advances in LT recognition should also consider the resource constraints. Unfortunately, current LT recognition methods largely assume sufficient computing resources and are designed to work with a large number of full precision parameters, i.e., floating point (FP) weights. While recognition on LT distributions by FP models may have significantly improved (Cui et al., 2021; He et al., 2021; Zhong et al., 2021) , it is not clear whether these improvements would immediately translate to real world scenarios or not where resource constraints limit the model selection to those with lower capacity. To this end, we argue that it is necessary to benchmark and improve the performance of long-tailed recognition with capacity-limited models. As binary networks are at the extreme end of capacity-limited models (Rastegari et al., 2016) , the long-tailed recognition performance using the 1-bit networks would roughly correspond to the 'worst case scenario' for resource constrained LT. If we can show sufficient LT performance with binary networks, we can reasonably expect, at the very least, matching or better LT performance with N -bit models, where N > 1. Here, we take an initiative to benchmark and develop long-tailed recognition methods using binary networks as a challenging reference. In the LT scenario, the data scarcity in the tail classes is one of the major issues that may cause problems such as the class weights having varying magnitudes for head to tail classes (Kang et al., 2019; Alshammari et al., 2022) , leading to disappointing performance. Prior methods (Liu et al., 2019; Kozerawski et al., 2020; Park et al., 2022) use cosine classifiers which eliminate the effect of class weight norms as a discriminative statistic and improve accuracy. However, since binary networks lack learning capacity and exhibit worse generalization performance than FP networks (Rastegari et al., 2016; Courbariaux et al., 2016) We investigate how the proposed method affects the variance and visualize the results in Fig. 1 . As shown in the figure, the calibrate and distill method significantly decreases the 'Binary-FP' gap, suggesting a reduction of the high variance in binary networks. Note that while the magnitudes are within similar magnitude ranges, they are not equal for all classes, which could be useful for discriminating different classes. In addition, semantic domains and image resolutions vary across multiple real-world data. In addressing the variety, we found that balancing the amount of knowledge transferred from the distillation loss to the feature extractor and classifier part of the binary network largely affects the performance, depending on how varied the target data is to the pretraining data. However, manual tuning may result in good performance only for a limited number of benchmark datasets and may not generalize to various scenarios. For better generalization to different data distributions, we propose a novel adversarially learned balancing scheme for long-tailed recognition that learns the balance dependent on the input data without requiring finicky hyper-parameter tuning. Furthermore, we incorporate various input resolutions for better generalization to different data distributions (Jacobs et al., 1995; Rosenfeld, 2013) . This would ease the burden for the model to learn various optimal receptive fields that can help generalize to the differently sized data. But using multi-resolution inputs (Rosenfeld, 2013) may incur increasing computational costs depending on what and how many resolutions are used. For negligible training time increase in multi-scale learning, we propose to selectively use multi-resolution inputs to mitigate the extra computation costs



, we want to reduce the adverse effect of the uneven class Samples' indicate the number of training samples for each class and the classes are sorted by the number of training samples (large to small), e.g., head to tail. When trained from scratch on the long-tailed (LT) distribution, the gap widens towards the tail classes due to data scarcity. The widening gap implies that the binary network has higher variance (i.e., larger weight magnitudes) compared to FP. Using our 'Calibrate and Distill' mitigates the gap substantially.weight norms without sacrificing its usefulness as a discriminative statistic. Inspired by previous LT literature (Kang et al., 2019; Alshammari et al., 2022)  that analyzed the classifier weight norms (of the last fully-connected layer) in FP networks with linear classifiers, we first investigate the high variance issue in binary networks(Zhu et al., 2019)  by visualizing the Binary-FP gap in the magnitude of the weight norms of the classifier in Fig.1. The widening gap towards the tail classes indicates that a binary network has higher variance at tail classes than FP models. We argue that we need to mitigate the above issue to improve LT recognition with binary networks. As most of pretrained FP networks are learned on non LT datasets, using them as teachers may not be helpful for training binary networks on LT datasets due to the domain gap. One can train the FP teacher on the target LT datasets from scratch (He et al., 2021) before distillation but that is not scalable. Recent works in various domains such as prompt engineering(Radford et al., 2021; Liu  et al., 2021a)  and self-supervised learning(Bachman et al., 2019; Chen et al., 2020)  suggest utilizing pretrained weights is desirable for scalability(He et al., 2020). Here, we propose a way to utilize pretrained teacher networks on non LT data (e.g. from open source projects) for LT distillation without training the teacher from scratch. Specifically, we propose a 'Calibrate and Distill' framework where we calibrate the pretrained teacher on non LT data by attaching and training the classifier on the target LT dataset.

