DIFFACE: BLIND FACE RESTORATION WITH DIFFUSED ERROR CONTRACTION

Abstract

While deep learning-based methods for blind face restoration have achieved unprecedented success, they still suffer from two major limitations. First, most of them deteriorate seriously when facing complex degradations out of their training data. Second, these methods require multiple constraints, e.g., fidelity, perceptual, and adversarial losses, which requires laborious hyper-parameters tuning to stabilize and balance their influences. In this work, we propose a novel method named DifFace, being able to cope with unseen and complex degradations more gracefully without complicated loss designs. The key of our method is to establish a posterior distribution from the observed low-quality (LQ) image to its high-quality (HQ) counterpart. In particular, we design a transition distribution from the LQ image to the intermediate state of a pre-trained diffusion model and then gradually transmit from this intermediate state to the HQ target by recursively applying a pre-trained diffusion model. The transition distribution only relies on a restoration backbone that is trained with L 2 loss on some synthetic data, which favorably avoids the cumbersome training process in existing methods. Moreover, the transition distribution is capable of contracting the error of the restoration backbone and thus makes our method more robust to unknown degradations. Comprehensive experiments show that DifFace is superior to current state-of-the-art methods, especially in cases with severe degradations. Code and model will be released.

1. INTRODUCTION

Blind face restoration (BFR) aims at recovering a high-quality (HQ) image from its low-quality (LQ) counterpart, which usually suffers from complex degradations, such as noise, blurring, and downsampling. BFR is an extremely ill-posed inverse problem as multiple HQ solutions may exist for any given LQ image. Approaches for BFR have been dominated by deep learning-based methods (Wang et al., 2021; Tu et al., 2021; Feihong et al., 2022; Gu et al., 2022) . The main idea of them is to learn a mapping, usually parametrized as a deep neural network, from the LQ images to the HQ ones based on a large amount of pre-collected LQ/HQ image pairs. In most cases, these image pairs are synthesized by assuming a degradation model that often deviates from the real one. Most existing methods are sensitive to such a deviation and thus suffer a dramatic performance drop when encountering mismatched degradations in real scenarios. Various constraints have been designed to mitigate the influence of such a deviation and improve the restoration quality. The L 2 (or L 1 ) loss is commonly used to ensure fidelity, although these pixel-wise losses are known to favor the prediction of an average (or a median) over the plausible solutions. Recent BFR methods also introduce the adversarial loss (Goodfellow et al., 2014) and the perceptual loss (Johnson et al., 2016; Zhang et al., 2018) to achieve more realistic results. Besides, some existing methods also exploit face-specific priors to further constrain the restored solution, e.g., face landmarks (Chen et al., 2018 ), facial components (Li et al., 2020) , and generative priors (Chan et al., 2022; Pan et al., 2021; Wang et al., 2021; Yang et al., 2021) . Considering so many constraints together makes the training unnecessarily complicated, often requiring laborious hyper-parameters tuning to make a trade-off among these constraints. Worse, the notorious instability of adversarial loss makes the training more challenging. 𝑥𝑥 𝑇𝑇 𝑥𝑥 𝑁𝑁 𝑥𝑥 𝑁𝑁-1 𝑥𝑥 0 ⋯ ⋯ 𝑝𝑝 𝜃𝜃 (𝑥𝑥 𝑁𝑁-1 |𝑥𝑥 𝑁𝑁 ) 𝑦𝑦 0 𝑥𝑥 𝑁𝑁 ~𝑝𝑝 𝑥𝑥 𝑁𝑁 |𝑦𝑦 0 Diffused Estimator 𝑓𝑓(� ; 𝑤𝑤) HQ Image LQ Image Figure 1 : Overview of the proposed method. The solid lines denote the whole inference pipeline of our method. For ease of comparison, we also mark out the forward and reverse processes of the diffusion model by dotted lines. To this end, we establish a posterior distribution p(x 0 |y 0 ), aiming to infer the HQ image x 0 conditioned on its LQ counterpart y 0 . Due to the complex degradations, solving for this posterior is non-trivial in blind restoration. Our solution to this problem, as depicted in Fig. 1 , is to approximate this posterior distribution by a transition distribution p(x N |y 0 ), where x N is a diffused version of the desirable HQ image x 0 , followed with a reverse Markov chain that estimates x 0 from x N . To construct this transition distribution p(x N |y 0 ), we introduce a deep neural network which can be simply trained just using L 2 loss. Such a transition distribution is appealing in that it is well motivated by an important observation in DDPM (Ho et al., 2020) , where data is destroyed by re-scaling it with a factor of less than 1 and adding noise in the diffusion process. Bringing this notion to our context, the residual between x 0 and y 0 is also contracted by this factor after diffusion. Our framework uniquely leverages this property by inferring the intermediate diffused variable x N (where N < T ) from the LQ image y 0 , of which the residual to HQ image x 0 is reduced. And then from this intermediate state we infer the desirable x 0 . There are several advantages of doing so: i) our solution is more efficient than the full reverse diffusion process from x T to x 0 , ii) we do not need to retrain the diffusion model from scratch, and iii) we can still take advantage of the pre-trained diffusion model via the reverse Markov chain from x N to x 0 . SR3 (Saharia et al., 2022b ) also exploits the potentials of diffusion model for blind restoration. It feeds the LQ image into the diffusion model as a condition to guide the restoration at each timestep. This requires one to retrain the diffusion model from scratch on pre-collected training data. Hence, it would still suffer from the issue of degradation mismatch when dealing with real-world data. Different from SR3, our method does not need to train the diffusion model from scratch but sufficiently leverages the prior knowledge contained in the pretrained diffusion model. The unique design on transition distribution p(x N |y 0 ) further allows us to cope with unknown degradations.



Figure 2: Comparative results of recent state-of-the-art methods and the proposed method on one typical real example. From left to right: (a) low-quality image, (b)-(g) restored results of DFDNet (Li et al., 2020), GLEAN (Chan et al., 2022), PSFRGAN (Chen et al., 2021), GFPGAN (Wang et al., 2021), VQFR (Gu et al., 2022), and our proposed method.

