COLORIZATION TRANSFORMER

Abstract

We present the Colorization Transformer, a novel approach for diverse high fidelity image colorization based on self-attention. Given a grayscale image, the colorization proceeds in three steps. We first use a conditional autoregressive transformer to produce a low resolution coarse coloring of the grayscale image. Our architecture adopts conditional transformer layers to effectively condition grayscale input. Two subsequent fully parallel networks upsample the coarse colored low resolution image into a finely colored high resolution image. Sampling from the Colorization Transformer produces diverse colorings whose fidelity outperforms the previous state-of-the-art on colorising ImageNet based on FID results and based on a human evaluation in a Mechanical Turk test. Remarkably, in more than 60% of cases human evaluators prefer the highest rated among three generated colorings over the ground truth. The code and pre-trained checkpoints for Colorization Transformer are publicly available at this url.

1. INTRODUCTION

Image colorization is a challenging, inherently stochastic task that requires a semantic understanding of the scene as well as knowledge of the world. Core immediate applications of the technique include producing organic new colorizations of existing image and video content as well as giving life to originally grayscale media, such as old archival images (Tsaftaris et al., 2014) , videos (Geshwind, 1986) and black-and-white cartoons (Sỳkora et al., 2004; Qu et al., 2006; Cinarel & Zhang, 2017) . Colorization also has important technical uses as a way to learn meaningful representations without explicit supervision (Zhang et al., 2016; Larsson et al., 2016; Vondrick et al., 2018) or as an unsupervised data augmentation technique, whereby diverse semantics-preserving colorizations of labelled images are produced with a colorization model trained on a potentially much larger set of unlabelled images. The current state-of-the-art in automated colorization are neural generative approaches based on log-likelihood estimation (Guadarrama et al., 2017; Royer et al., 2017; Ardizzone et al., 2019) . Probabilistic models are a natural fit for the one-to-many task of image colorization and obtain better results than earlier determinisitic approaches avoiding some of the persistent pitfalls (Zhang et al., 2016) . Probabilistic models also have the central advantage of producing multiple diverse colorings that are sampled from the learnt distribution. In this paper, we introduce the Colorization Transformer (ColTran), a probabilistic colorization model composed only of axial self-attention blocks (Ho et al., 2019b; Wang et al., 2020) . The main advantages of axial self-attention blocks are the ability to capture a global receptive field with only two layers and O(D √ D) instead of O(D 2 ) complexity. They can be implemented efficiently using matrix-multiplications on modern accelerators such as TPUs (Jouppi et al., 2017) . In order to enable colorization of high-resolution grayscale images, we decompose the task into three simpler sequential subtasks: coarse low resolution autoregressive colorization, parallel color and spatial super-resolution. For coarse low resolution colorization, we apply a conditional variant of Axial Transformer (Ho et al., 2019b) , a state-of-the-art autoregressive image generation model that does not require custom kernels (Child et al., 2019) . While Axial Transformers support conditioning by biasing the input, we find that directly conditioning the transformer layers can improve results significantly. Finally, by leveraging the semi-parallel sampling mechanism of Axial Transformers we are able to colorize images faster at higher resolution than previous work (Guadarrama et al., 2017) and as an effect this results in improved colorization fidelity. Finally, we employ fast parallel deterministic upsampling models to super-resolve the coarsely colorized image into the final high resolution output. In summary, our main contributions are: • First application of transformers for high-resolution (256 × 256) image colorization. Remarkably, in more than 60% of cases human evaluators prefer the highest rated among three generated colorings over the ground truth.

2. RELATED WORK

Colorization methods have initially relied on human-in-the-loop approaches to provide hints in the form of scribbles (Levin et al., 2004; Ironi et al., 2005; Huang et al., 2005; Yatziv & Sapiro, 2006; Qu et al., 2006; Luan et al., 2007; Tsaftaris et al., 2014; Zhang et al., 2017; Ci et al., 2018) and exemplar-based techniques that involve identifying a reference source image to copy colors from (Reinhard et al., 2001; Welsh et al., 2002; Tai et al., 2005; Ironi et al., 2005; Pitié et al., 2007; Morimoto et al., 2009; Gupta et al., 2012; Xiao et al., 2020) . Exemplar based techniques have been recently extended to video as well (Zhang et al., 2019a) . In the past few years, the focus has moved on to more automated, neural colorization methods. 



Figure 1: Samples of our model showing diverse, high-fidelity colorizations.

We introduce conditional transformer layers for low-resolution coarse colorization in Section 4.1. The conditional layers incorporate conditioning information via multiple learnable components that are applied per-pixel and per-channel. We validate the contribution of each component with extensive experimentation and ablation studies. • We propose training an auxiliary parallel prediction model jointly with the low resolution coarse colorization model in Section 4.2. Improved FID scores demonstrate the usefulness of this auxiliary model. • We establish a new state-of-the-art on image colorization outperforming prior methods by a large margin on FID scores and a 2-Alternative Forced Choice (2AFC) Mechanical Turk test.

The deterministic colorization techniques such as CIC(Zhang et al., 2016),LRAC (Larsson et al., 2016), LTBC (Iizuka et al., 2016), Pix2Pix (Isola  et al., 2017)  and DC(Cheng et al., 2015; Dahl, 2016)  involve variations of CNNs to model per-pixel color information conditioned on the intensity. ColTran is similar to PixColor in the usage of an autoregressive model for low resolution colorization and parallel spatial upsampling. ColTran differs from PixColor in the following ways. We train ColTran in a completely unsupervised fashion, while the conditioning network in PixColor requires pre-training with an object detection network that provides substantial semantic information. PixColor relies onPixelCNN (Oord et al., 2016)  that requires a large depth to model interactions between all pixels. ColTran relies on Axial Transformer(Ho et al., 2019b)  and can model all interactions between pixels with just 2 layers. PixColor uses different architectures for conditioning, colorization and super-resolution, while ColTran is conceptually simpler as we use self-attention blocks everywhere for both colorization and superresolution. Finally, we train

