LEARNING REPRESENTATION IN COLOUR CONVER-SION

Abstract

Colours can be represented in an infinite set of spaces highlighting distinct features. Here, we investigated the impact of colour spaces on the encoding capacity of a visual system that is subject to information compression, specifically variational autoencoders (VAEs) where bottlenecks are imposed. To this end, we propose a novel unsupervised task: colour space conversion (ColourConvNets). We trained several instances of VAEs whose input and output are in different colour spaces, e.g. from RGB to CIE L*a*b* (in total five colour spaces were examined). This allowed us to systematically study the influence of input-output colour spaces on the encoding efficiency and learnt representation. Our evaluations demonstrate that ColourConvNets with decorrelated output colour spaces produce higher quality images, also evident in pixel-wise low-level metrics such as colour difference (∆E), peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM). We also assessed the ColourConvNets' capacity to reconstruct the global content in two downstream tasks: image classification (ImageNet) and scene segmentation (COCO). Our results show 5-10% performance boost for decorrelating ColourConvNets with respect to the baseline network (whose input and output are RGB). Furthermore, we thoroughly analysed the finite embedding space of Vector Quantised VAEs with three different methods (single feature, hue shift and linear transformation). The interpretations reached with these techniques are in agreement suggesting that (i) luminance and chromatic information are encoded in separate embedding vectors, and (ii) the structure of the network's embedding space is determined by the output colour space.

1. INTRODUCTION

Colour is an inseparable component of our conscious visual perception and its objective utility spans over a large set of tasks such as object recognition and scene segmentation (Chirimuuta et al., 2015; Gegenfurtner & Rieger, 2000; Wichmann et al., 2002) . Consequently, colour is an ubiquitous feature in many applications: colour transfer (Reinhard et al., 2001) , colour constancy (Chakrabarti, 2015) , style transfer (Luan et al., 2017 ), computer graphics (Bratkova et al., 2009) , image denoising (Dabov et al., 2007 ), quality assessment (Preiss et al., 2014) , to name a few. Progress in these lines requires a better understanding of colour representation and its neural encoding in deep networks. To this end, we present a novel unsupervised task: colour conversion. In our proposed framework the input-output colour space is imposed on deep autoencoders (referred to as ColourConvNets) that learn to efficiently compress the visual information (Kramer, 1991) while transforming the input to output. Essentially, the output y for input image x is generated on the fly by a transformation y = T (x), where T maps input to output colour space. This task offers a fair comparison of different colour spaces within a system that learns to minimise a loss function in the context of information bottleneck principle (Tishby & Zaslavsky, 2015) . The quality of output images demonstrates whether the representation of input-output colour spaces impacts networks' encoding power. Furthermore, the structure of internal representation provides insights on how colour transformation is performed within a neural network. In this work, we focused on Vector Quantised Variational Autoencoder (VQ-VAE) (van den Oord et al., 2017) due to the discrete nature of its latent space that facilitates the analysis and interpretability of the learnt features. We thoroughly studied five commonly used colour spaces by training ColourConvNets for all combinations of input-output spaces. First, we show that ColourConvNets with a decorrelated output colour space (e.g. CIE L*a*b) convey information more efficiently in their compressing bottleneck, in line with the presence of colour opponency in the human visual system. This is evident qualitatively (Figures 1 and A .1) and quantitatively (evaluated with three low-level and two high-level metrics). Next, we present the interpretation of ColourConvNets' latent space by means of three methods reaching a consensus interpretation: (i) the colour representation in the VQ-VAEs' latent space is determined by the output colour space, suggesting the transformation T occurs at the encoder, (ii) each embedding vector in VQ-VAEs encodes a specific part of the colour space, e.g. the luminance or chromatic information, which can be modelled by a parsimonious linear transformation. 

1.1. RELATED WORK

The effectiveness of different colour spaces have been investigated in a few empirical studies of deep neural networks (DNNs). Information fusion over several colour spaces improved retinal medical imaging (Fu et al., 2019) . A similar strategy enhanced the robustness of face (Li et al., 2014; Larbi et al., 2018) and traffic light recognition (Cires ¸an et al., 2012; Kim et al., 2018) . This was also effective in predicting eye fixation (Shen et al., 2015) . Opponent colour spaces have been explored for applications such as style transfer (Luan et al., 2017; Gatys et al., 2017) and picture colourisation (Cheng et al., 2015; Larsson et al., 2016) . Most of these works are within the domain of supervised learning. The most similar approach to our proposed ColourConvNets is image colourisation as a pretext task for unsupervised visual feature learning (Larsson et al., 2017) . Initial works on colour representation in DNNs revealed object classification networks learn to decorrelate their input images (Rafegas & Vanrell, 2018; Flachot & Gegenfurtner, 2018; Harris et al., 2019) . This is a reminiscence of horizontal and ganglion cells that decorrelate retinal signal into colour-opponency before transmitting it to the visual cortex (Schiller & Malpeli, 1977; Derrington et al., 1984; Gegenfurtner & Kiper, 2003) . Another set of works reported existence of hue-sensitive units (Engilberge et al., 2017) that mainly emerge in early layers (Bau et al., 2017) . Representation of colours in deep networks at intermediate and higher layers is rather understudied. In this article, we specifically focus on the intermediate representation that emerges at the latent space of autoencoders, which to the best of our knowledge has not been reported in the literature.

2. COLOUR CONVERSION AUTOENCODERS

In this article, we propose a novel unsupervised task of colour conversion: the network's output colour space is independent of its input (see Figure 2 ). A colour space is an arbitrary definition of colours' organisation in the space (Koenderink & van Doorn, 2003) . Thus, the choice of transfor-



Figure 1: Qualitative comparison of three ColourConvNets (VQ-VAE of K=8 and D=128). The first column is the networks' input and the other columns their corresponding outputs. The output images of rgb2dkl and rgb2lab have been converted to the RGB colour space for visualisation purposes. The artefacts in rgb2rgb are clearly more visible in comparison to the other ColourConvNets.

