PLANCKIAN JITTER: COUNTERING THE COLOR-CRIPPLING EFFECTS OF COLOR JITTER ON SELF-SUPERVISED TRAINING

Abstract

Several recent works on self-supervised learning are trained by mapping different augmentations of the same image to the same feature representation. The data augmentations used are of crucial importance to the quality of learned feature representations. In this paper, we analyze how the color jitter traditionally used in data augmentation negatively impacts the quality of the color features in learned feature representations. To address this problem, we propose a more realistic, physics-based color data augmentation -which we call Planckian Jitter -that creates realistic variations in chromaticity and produces a model robust to illumination changes that can be commonly observed in real life, while maintaining the ability to discriminate image content based on color information. Experiments confirm that such a representation is complementary to the representations learned with the currently-used color jitter augmentation and that a simple concatenation leads to significant performance gains on a wide range of downstream datasets. In addition, we present a color sensitivity analysis that documents the impact of different training methods on model neurons and shows that the performance of the learned features is robust with respect to illuminant variations.

1. INTRODUCTION

Self-supervised learning enables the learning of representations without the need for labeled data (Doersch et al., 2015; Dosovitskiy et al., 2014) . Several recent works learn representations that are invariant with respect to a set of data augmentations and have obtained spectacular results (Grill et al., 2020; Chen & He, 2021; Caron et al., 2020) , significantly narrowing the gap with supervised learned representations. These works vary in their architectures, learning objectives, and optimization strategies, however they are similar in applying a common set of data augmentations to generate different image views. These algorithms, while learning to map these different views to the same latent representation, learn rich semantic representations for visual data. The set of transformations (data augmentations) used induces invariances that characterize the learned visual representation. Before deep learning revolutionized the way visual representations are learned, features were handcrafted to represent various properties, leading to research on shape (Lowe, 2004) , texture (Manjunath & Ma, 1996) , and color features (Finlayson & Schaefer, 2001; Geusebroek et al., 2001) . Color features were typically designed to be invariant to a set of scene-accidental events such as shadows, shading, and illuminant and viewpoint changes. With the rise of deep learning, feature

availability

https://github.com/TheZino/

