BEYOND CATEGORICAL LABEL REPRESENTATIONS FOR IMAGE CLASSIFICATION

Abstract

We find that the way we choose to represent data labels can have a profound effect on the quality of trained models. For example, training an image classifier to regress audio labels rather than traditional categorical probabilities produces a more reliable classification. This result is surprising, considering that audio labels are more complex than simpler numerical probabilities or text. We hypothesize that high dimensional, high entropy label representations are generally more useful because they provide a stronger error signal. We support this hypothesis with evidence from various label representations including constant matrices, spectrograms, shuffled spectrograms, Gaussian mixtures, and uniform random matrices of various dimensionalities. Our experiments reveal that high dimensional, high entropy labels achieve comparable accuracy to text (categorical) labels on the standard image classification task, but features learned through our label representations exhibit more robustness under various adversarial attacks and better effectiveness with a limited amount of training data. These results suggest that label representation may play a more important role than previously thought.

1. INTRODUCTION

Image classification is a well-established task in machine learning. The standard approach takes an input image and predicts a categorical distribution over the given classes. The most popular method to train these neural network is through a cross-entropy loss with backpropagation. Deep convolutional neural networks (Lecun et al., 1998; Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; He et al., 2015; Huang et al., 2016) have achieved extraordinary performance on this task, while some even surpass human level performance. However, is this a solved problem? The state-of-theart performance commonly relies on large amounts of training data (Krizhevsky, 2009; Russakovsky et al., 2015; Kuznetsova et al., 2018) , and there exist many examples of networks with good performance that fail on images with imperceptible adversarial perturbations (Biggio et al., 2013; Szegedy et al., 2013; Nguyen et al., 2014) . Much progress has been made in domains such as few-shot learning and meta-learning to improve the data efficiency of neural networks. There is also a large body of research addressing the challenge of adversarial defense. Most efforts have focused on improving optimization methods, weight initialization, architecture design, and data preprocessing. In this work, we find that simply replacing standard categorical labels with high dimensional, high entropy variants (e.g. an audio spectrogram pronouncing the name of the class) can lead to interesting properties such as improved robustness and efficiency, without a loss of accuracy. Our research is inspired by key observations from human learning. Humans appear to learn to recognize new objects from few examples, and are not easily fooled by the types of adversarial perturbations applied to current neural networks. There could be many reasons for the discrepancy between how humans and machines learn. One significant aspect is that humans do not output categorical probabilities on all known categories. A child shown a picture of a dog and asked "what is this a picture of?" will directly speak out the answer -"dog." Similarly, a child being trained by a parent is shown a picture and then provided the associated label in the form of speech. These observations raise the question: Are we supervising neural networks on the best modality? In this paper, we take one step closer to understanding the role of label representations inside the training pipeline of deep neural networks. However, while useful properties emerge by utilizing various label representations, we do not attempt to achieve state-of-the-art performance over these metrics. Rather, we hope to provide a novel research perspective on the standard setup. Therefore, our study is not mutually exclusive with previous research on improving adversarial robustness and data efficiency. 𝐼 ! 𝐼 "# "bird" 2 2 2 2 2 … 2 2 2 2 2 2 2 2 2 … 2 2 2 2 2 2 2 2 2 … 2 2 2 2 2 2 2 2 2 … 2 2 2 2 2 2 2 2 2 … 2 2 2 2 … Speech Shuffled Speech Composite Gaussian Constant An overview of our approach is shown in Figure 1 . We first follow the above natural observation and modify the existing image classifiers to "speak out" the predictions instead of outputting a categorical distribution. Our initial experiments show surprising results: that neural networks trained with speech labels learn more robust features against adversarial attacks, and are more data-efficient when only less than 20% of training data is available. Furthermore, we hypothesize that the improvements from the speech label representation come from its property as a specific type of high-dimensional object. To test our hypothesis, we performed a large-scale systematic study with various other high-dimensional label representations including constant matrices, speech spectrograms, shuffled speech spectrograms, composition of Gaussians, and high dimensional and low dimensional uniform random vectors. Our experimental results show that high-dimensional label representations with high entropy generally lead to robust and data efficient network training. We believe that our findings suggest a significant role of label representations which has been largely unexplored when considering the training of deep neural networks. Our contributions are three fold. First, we introduce a new paradigm for the image classification task by using speech as the supervised signal. We demonstrate that speech models can achieve comparable performance to traditional models that rely on categorical outputs. Second, we quantitatively show that high-dimensional label representations with high entropy (e.g. audio spectrograms and composition of Gaussians) produce more robust and data-efficient neural networks, while highdimensional labels with low entropy (e.g. constant matrices) and low-dimensional labels with high entropy do not have these benefits and may even lead to worse performance. Finally, we present a set of quantitative and qualitative analyses to systematically study and understand the learned feature representations of our networks. Our visualizations suggest that speech labels encourage learning more discriminative features.

2. RELATED WORKS

Data Efficiency and Robustness Data efficiency has been a widely studied problem within the context of few-shot learning and meta-learning (Thrun & Pratt, 2012; Vilalta & Drissi, 2002; Vanschoren, 2018; Wang et al., 2019) . Researchers have made exciting progress on improving methods of optimization (Ravi & Larochelle, 2016; Li et al., 2017 ), weight initialization (Finn et al., 2017; Ravi & Larochelle, 2017) , and architecture design (Santoro et al., 2016a; b) . There is also a large body of research addressing the challenge of adversarial defense. Adversarial training is perhaps the most common measure against adversarial attacks (Goodfellow et al., 2014; Kurakin et al., 2016; Szegedy et al., 2013; Shaham et al., 2018; Madry et al., 2017) . Recent works try to tackle the problem by leveraging GANs (Samangouei et al., 2018) , detecting adversarial examples (Meng & Chen, 2017; Lu et al., 2017; Metzen et al., 2017) , and denoising or reconstruction (Song et al., 2017; Liao et al., 2017) .



Figure 1: Label Representations beyond Categorical Probabilities: We study the role of label representation in training neural networks for image classification. We find that high-dimensional labels with high entropy lead to more robust and data-efficient feature learning.

