PLANCKIAN JITTER: COUNTERING THE COLOR-CRIPPLING EFFECTS OF COLOR JITTER ON SELF-SUPERVISED TRAINING

Abstract

Several recent works on self-supervised learning are trained by mapping different augmentations of the same image to the same feature representation. The data augmentations used are of crucial importance to the quality of learned feature representations. In this paper, we analyze how the color jitter traditionally used in data augmentation negatively impacts the quality of the color features in learned feature representations. To address this problem, we propose a more realistic, physics-based color data augmentation -which we call Planckian Jitter -that creates realistic variations in chromaticity and produces a model robust to illumination changes that can be commonly observed in real life, while maintaining the ability to discriminate image content based on color information. Experiments confirm that such a representation is complementary to the representations learned with the currently-used color jitter augmentation and that a simple concatenation leads to significant performance gains on a wide range of downstream datasets. In addition, we present a color sensitivity analysis that documents the impact of different training methods on model neurons and shows that the performance of the learned features is robust with respect to illuminant variations.

1. INTRODUCTION

Self-supervised learning enables the learning of representations without the need for labeled data (Doersch et al., 2015; Dosovitskiy et al., 2014) . Several recent works learn representations that are invariant with respect to a set of data augmentations and have obtained spectacular results (Grill et al., 2020; Chen & He, 2021; Caron et al., 2020) , significantly narrowing the gap with supervised learned representations. These works vary in their architectures, learning objectives, and optimization strategies, however they are similar in applying a common set of data augmentations to generate different image views. These algorithms, while learning to map these different views to the same latent representation, learn rich semantic representations for visual data. The set of transformations (data augmentations) used induces invariances that characterize the learned visual representation. Before deep learning revolutionized the way visual representations are learned, features were handcrafted to represent various properties, leading to research on shape (Lowe, 2004) , texture (Manjunath & Ma, 1996) , and color features (Finlayson & Schaefer, 2001; Geusebroek et al., 2001) . Color features were typically designed to be invariant to a set of scene-accidental events such as shadows, shading, and illuminant and viewpoint changes. With the rise of deep learning, feature representations that simultaneously exploit color, shape, and texture are learned implicitly and the invariances are a byproduct of end-to-end training (Krizhevsky et al., 2009) . Current approaches to self-supervision learn a set of invariances implicitly related to the applied data augmentations. In this work, we focus on the currently de facto choice for color augmentations. We argue that they seriously cripple the color quality of learned representations and we propose an alternative, physics-based color augmentation. Figure 1 (left) illustrates the currently used color augmentation on a sample image. It is clear that the applied color transformation significantly alters the colors of the original image, both in terms of hue and saturation. This augmentation results in a representation that is invariant with respect to surface reflectance -an invariance beneficial for recognizing classes whose surface reflectance varies significantly, for example many man-made objects such as cars and chairs. However, such invariance is expected to hurt performance on downstream tasks for which color is an important feature, like natural classes such as birds or food. One of the justifications is that without large color changes, mapping images to the same latent representation can be purely done based on color and no complex shape or texture features are learned. However, as a result the quality of the color representation learned with such algorithms is inferior and important information on surface reflectance might be absent. Additionally, some traditional supervised learning methods propose domain-specific variations of color augmentation Galdran et al. (2017); Xiao et al. (2019) . In this paper we propose an alternative color augmentation (Figure 1 , right) and we assess its impact on self-supervised learning. We draw on the existing color imaging literature on designing features invariant to illuminant changes commonly encountered in the real world (Finlayson & Schaefer, 2001) . Our augmentation, called Planckian Jitter, applies physically-realistic illuminant variations. We consider the illuminants described by Planck's Law for black-body radiation, that are known to be similar to illuminants encountered in real-life (Tominaga et al., 1999) . The aim of our color augmentation is to allow the representation to contain valuable information about the surface reflectance of objects -a feature that is expected to be important for a wide range of downstream tasks. Combining such a representation with the already high-quality shape and texture representation learned with standard data augmentation leads to a more complete visual descriptor that also describes color. Our experiments show that self-supervised representations learned with Planckian Jitter are robust to illuminant changes. In addition, depending on the importance of color in the dataset, the proposed Planckian jitter outperforms the default color jitter. Moreover, for all evaluated datasets the combination of features of our new data augmentation with standard color jitter leads to significant performance gains of over 5% on several downstream classification tasks. Finally, we show that Planckian Jitter can be applied to several state-of-the-art self-supervised learning methods.

2. BACKGROUND AND RELATED WORK

Self-supervised learning and contrastive learning. Recent improvements in self-supervision learn semantically rich feature representations without the need for labelled data. In SimCLR (Chen et al., 2020a) similar samples are created by augmenting an input image, while dissimilar are chosen



Figure 1: Default color jitter (left) and Planckian Jitter (right). Augmentations based on default color jitter lead to unrealistic images, while Planckian Jitter leads to a set of realistic ones. The ARC chromaticity diagrams for each type of jitter are computed by sampling initial RGB values and mapping them into the range of possible outputs given by each augmentation. These diagrams show that Planckian Jitter transforms colors along chromaticity lines occurring in nature when changing the illuminant, whereas default color jitter transfers colors throughout the whole chromaticity plane.

availability

https://github.com/TheZino/

