DECOUPLED MIXUP FOR DATA-EFFICIENT LEARNING

Abstract

Mixup is an efficient data augmentation approach that improves the generalization of neural networks by smoothing the decision boundary with mixed data. Recently, dynamic mixup methods have improved previous static policies effectively (e.g., linear interpolation) by maximizing salient regions or maintaining the target in mixed samples. The discrepancy is that the generated mixed samples from dynamic policies are more instance discriminative than the static ones, e.g., the foreground objects are decoupled from the background. However, optimizing mixup policies with dynamic methods in input space is an expensive computation compared to static ones. Hence, we are trying to transfer the decoupling mechanism of dynamic methods from the data level to the objective function level and propose the general decoupled mixup (DM) loss. The primary effect is that DM can adaptively focus on discriminative features without losing the original smoothness of the mixup while avoiding heavy computational overhead. As a result, DM enables static mixup methods to achieve comparable or even exceed the performance of dynamic methods. This also leads to an interesting objective design problem for mixup training that we need to focus on both smoothing the decision boundaries and identifying discriminative features. Extensive experiments on supervised and semisupervised learning benchmarks across seven classification datasets validate the effectiveness of DM by equipping it with various mixup methods.

1. INTRODUCTION

Deep Learning has become the bedrock of modern AI for many tasks in machine learning (Bishop, 2006) such as computer vision (He et al., 2016; 2017) , natural language processing (Devlin et al., 2018) . Using a large number of learnable parameters, deep neural networks (DNNs) can recognize subtle dependencies in large training datasets to be later leveraged to perform accurate predictions on unseen data. However, models might overfit the training set without constraints or enough data (Srivastava et al., 2014) . To this end, regularization techniques have been deployed to improve generalization (Wan et al., 2013) , which can be categorized into data-independent or data-dependent ones (Guo et al., 2019) . Some data-independent strategies, for example, constrain the model by punishing the parameters' norms, such as weight decay (Loshchilov & Hutter, 2017) . Among data-dependent strategies, data augmentations (Shorten & Khoshgoftaar, 2019) are widely used. Mixup (Zhang et al., 2017; Yun et al., 2019) , a data-dependent augmentation technique, is proposed to generate virtual samples by a linear combination of data pairs and the corresponding labels with the mixing ratio λ ∈ [0, 1]. DNNs trained with this technique are typically more generalizable and calibrated (Thulasidasan et al., 2019) , whose prediction accuracy tends to be consistent with confidence. The main reason is that mixup heuristically smooths the decision boundary to improve 2022) improve mixup accuracy significantly. The basic idea is to decouple the foreground from the background and mix their corresponding features to avoid label mismatch. However, these data-level decoupling methods require a complex deployment and an additional 1.5 times the training time as static methods, which may violate the mixup augmentations' ease of use and lightness. Thus, leaving aside the design of a new mixup policy, a new question raises that can we design a loss function for mixup that takes into account both the smoothness of mixup and the discriminatory of instances without heavy computation? From a different perspective, we first regard the labelmismatched samples as hard mixed samples. In other words, even though there are not sufficient features in the mixed sample, we still expect the model to have the ability to mine these hard features to make confident predictions. Therefore, we make full use of mixed samples without introducing additional computational effort to achieve data-efficient mixup training. Motivated by this intuition, we introduce Decoupled Mixup (DM), a mixup objective function for explicitly leveraging the target-relevant information of mixed samples without losing original smoothness. Based on the standard cross-entropy loss, an extra decoupled term is introduced to enhance the ability to mine underlying discriminative statistics in the mixed sample by independently computing the predicted probabilities of each mixed class. As a result, DM can further emphasize the contribution of each involved class in mixup to explore the efficient usage of existing data. Extensive experiments demonstrate that DM achieves data-efficiency training on supervised and semi-supervised learning benchmarks. Our contributions are summarized below: • Unlike those static mixup policies that suffer from the label mismatch problem, we propose DM, a mixup objective of mining discriminative features while maintaining smoothness. • Our work contributes more broadly to understanding mixup training: it is essential to focus not only on the smoothness by regression of the weight of mixing but also on discrimination by encouraging the network to give a highly confident prediction when the evidence is clear. • The proposed DM can be easily generalized to semi-supervised learning by a minor modification. By leveraging the unlabeled data efficiently, it can reduce the conformation bias and significantly improve overall performance. • Comprehensive experiments on various tasks verify the effectiveness of DM, e.g., DMbased static mixup policies achieve a comparable or even better performance than dynamic methods without the extra computational overhead.



Figure 1: Illustration of the problem of label mismatch. The red mixed labels are the ground truth.

Figure 2: Experimental overviews of the label mismatch issue. Compared with static policies like Mixup and CutMix, the dynamic method AutoMix significantly reduces the difficulty of mixup classification and alleviates the label mismatch problem by providing more reliable mixed samples, but also brings a large computational overhead. Left: Top-1 and Top-2 accuracy of mixed data on ImageNet-1k with 100 epochs. Prediction is counted as correct if the Top-1 prediction belongs to {y a , y b }; prediction is counted as correct if the Top-2 predictions are equal to {y a , y b }. Right: Taking Mixup as an example, decouple mixup cross-entropy (DMCE) significantly improves training efficiency by alleviating the label mismatch issue from the perspective of designing a loss function.the overall robustness by regressing the mixing ratio λ in mixed labels. However, a completely handcrafted mixing policies(Verma et al., 2019; Uddin et al., 2020; Harris et al., 2020)  (referred to as static methods) may result in a mismatch issue between the mixed labels and the mixed samples, which leads to problems such as instability or slow convergence, namely label mismatch (Liu et al.

