AN OPTIMAL TRANSPORT PERSPECTIVE ON UNPAIRED IMAGE SUPER-RESOLUTION Anonymous

Abstract

Real-world image super-resolution (SR) tasks often do not have paired datasets, which limits the application of supervised techniques. As a result, the tasks are usually approached by unpaired techniques based on Generative Adversarial Networks (GANs), which yield complex training losses with several regularization terms, e.g., content or identity losses. We theoretically investigate optimization problems which arise in such models and find two surprizing observations. First, the learned SR map is always an optimal transport (OT) map. Second, we theoretically prove and empirically show that the learned map is biased, i.e., it does not actually transform the distribution of low-resolution images to high-resolution ones. Inspired by these findings, we propose an algorithm for unpaired SR which learns an unbiased OT map for the perceptual transport cost. Unlike the existing GAN-based alternatives, our algorithm has a simple optimization objective reducing the need for complex hyperparameter selection and an application of additional regularizations. At the same time, it provides a nearly state-of-the-art performance on the large-scale unpaired AIM19 dataset.

1. INTRODUCTION

Figure 1 : Super-resolution of a squirrel using Bicubic upsample, OTS (ours) and DASR (Wei et al., 2021) methods (4×4 upsample, 370×800 crops). The problem of image super-resolution (SR) is to reconstruct a high-resolution (HR) image from its low-resolution (LR) counterpart. In many modern deep learning approaches, SR networks are trained in a supervised manner by using synthetic datasets containing LR-HR pairs (Lim et al., 2017, 4.1) ; (Zhang et al., 2018b, 4.1) . For example, it is common to create LR images from HR with a simple downscaling, e.g., bicubic (Ledig et al., 2017, 3.2) . However, such an artificial setup barely represents the practical setting, in which the degradation is more sophisticated and unknown (Maeda, 2020) . This obstacle suggests the necessity of developing methods capable of learning SR maps from unpaired data without considering prescribed degradations. Contributions. We study the unpaired image SR task and its solutions based on Generative Adversarial Networks (Goodfellow et al., 2014, GANs) and analyse them from the Optimal Transport (OT, see (Villani, 2008) ) perspective. 1. We investigate the GAN optimization objectives regularized with content losses, which are common in unpaired image SR methods ( 5, 4) . We prove that the solution to such objectives is always an optimal transport map. We theoretically and empirically show that such maps are biased ( 7.1), i.e., they do not transform the LR image distribution to the true HR image distribution. 2. We provide an algorithm to fit an unbiased OT map for perceptual transport cost ( 6.1) and apply it to the unpaired image SR problem ( 7.2). We establish connections between our algorithm and regularized GANs using integral probability metrics (IPMs) as a loss ( 6.2). Our algorithm solves a minimax optimization objective and does not require extensive hyperparameter search, which makes it different from the existing methods for unpaired image SR. At the same time, the algorithm provides a nearly state-of-art performance in the unpaired image SR problem ( 7.2). Notation. We use X , Y to denote Polish spaces and P(X ), P(Y) to denote the respective sets of probability distributions on them. We denote by Π(P, Q) the set of probability distributions on X × Y with marginals P and Q. For a measurable map T : X → Y, we denote the associated push-forward operator by T # . The expression ∥ • ∥ denotes the usual Euclidean norm if not stated otherwise. We denote the space of Q-integrable functions on Y by L 1 (Q).

2. UNPAIRED IMAGE SUPER-RESOLUTION TASK

In this section, we formalize the unpaired image super-resolution task that we consider (Figure 2 ). Figure 2 : The task of super-resolution we consider. Let P and Q be two distributions of LR and HR images, respectively, on spaces X and Y, respectively. We assume that P is obtained from Q via some unknown degradation. The learner has access to unpaired random samples from P and Q. The task is to fit a map T : X → Y satisfying T # P = Q which inverts the degradation. We highlight that the image SR task is theoretically ill-posed for two reasons. 1. Non-existence. The degradation filter may be non-injective and, consequently, non-invertible. This is a theoretical obstacle to learn one-to-one SR maps T . 2. Ambiguity. There might exist multiple maps satisfying T # P = Q but only one inverting the degradation. With no prior knowledge about the correspondence between P and Q, it is unclear how to pick this particular map. The first issue is usually not taken into account in practice. Most existing paired and unpaired SR methods learn one-to-one SR maps T , see (Ledig et al., 2017; Lai et al., 2017; Wei et al., 2021) . The second issue is typically softened by regularizing the model with the content loss. In the real-world, it is reasonable to assume that HR and the corresponding LR images are close. Thus, the fitted SR map T is expected to only slightly change the input image. Formally, one may require the learned map T to have the small value of R c (T ) def = Y c x, T (x) dP(x), where c : X × Y → R + is a function estimating how different the inputs are. The most popular example is the ℓ 1 identity loss, i.e, formulation (1) for X = Y = R D and c(x, y) = ∥x -y∥ 1 . More broadly, losses R c (T ) are typically called content losses and incorporated into training objectives of methods for SR (Lugmayr et al., 2019a, 3.4) , (Kim et al., 2020, 3) and other unpaired tasks beside SR (Taigman et al., 2016, 4) , (Zhu et al., 2017, 5. 2) as regularizers. They stimulate the learned map T to minimally change the image content.

3. BACKGROUND ON OPTIMAL TRANSPORT

In this section, we give the key concepts of the OT theory (Villani, 2008) that we use in our paper. Primal form. For two distributions P ∈ P(X ) and Q ∈ P(Y) and a transport cost c : X × Y → R, Monge's primal formulation of the optimal transport cost is as follows: Cost(P, Q) def = inf T # P=Q X c x, T (x) dP(x), where the minimum is taken over the measurable functions (transport maps) T : X → Y that map P to Q, see Figure 3a . The optimal T * is called the optimal transport map. Note that (2) is not symmetric, and this formulation does not allow mass splitting, i.e., for some P, Q there may be no map T that satisfies T # P = Q. Thus, (Kantorovitch, 1958) proposed the relaxation: Cost(P, Q) def = inf π∈Π(P,Q) X ×Y c(x, y)dπ(x, y), where the minimum is taken over the transport plans π, i.e., the measures on X × Y whose marginals are P and Q (Figure 3b ). The optimal π * ∈ Π(P, Q) is called the optimal transport plan. With mild assumptions on the transport cost c(x, y) and distributions P, Q, the minimizer π * of (3) always exists (Villani, 2008, Theorem 4 .1) but might not be unique. If π * is of the form [id, T * ] # P ∈ Π(P, Q) for some T * , then T * is an optimal transport map that minimizes (2). Dual form. The dual form (Villani, 2003) of OT cost (3) is as follows: Cost(P, Q) = sup f X f c (x)dP(x) + Y f (y)dQ(y) ; here sup is taken over all f ∈ L 1 (Q), and f c (x) = inf y∈Y c(x, y)-f (y) is the c-transform of f .

4. RELATED WORK

UNPAIRED IMAGE SUPER-RESOLUTION. Existing approaches to unpaired image SR mainly solve the problem in two steps. One group of approaches learn the degradation operation at the first step and then train a super-resolution model in a supervised manner using generated pseudo-pairs, see (Bulat et al., 2018; Fritsche et al., 2019) . Another group of approaches (Yuan et al., 2018; Maeda, 2020) firstly learn a mapping from real-world LR images to "clean" LR images, i.e., HR images, downscaled using predetermined (e.g., bicubic) operation, and then a mapping from "clean" LR to HR images. Most methods are based on CycleGAN (Zhu et al., 2017) , initially designed for the domain transfer task, and utilize cycle-consistency loss. Methods are also usually endowed with several other losses, e.g. content (Kim et al., 2020, 3) , identity (Wang et al., 2021, 3. 2) or perceptual (Lugmayr et al., 2019a, 3.4) . OPTIMAL TRANSPORT IN GENERATIVE MODELS. The majority of existing OT-based generative models employ OT cost as the loss function to update the generative network, e.g., see (Arjovsky et al., 2017) . These methods are out of scope of the present paper, since they do not compute OT maps. Existing methods to compute the OT map approach the primal (2), (3) or dual form (4). Primal-form methods (Lu et al., 2020; Xie et al., 2019; Bousquet et al., 2017; Balaji et al., 2020) optimize complex GAN objectives such as (5) and provide biased solutions ( 5, 7.1). For a comprehensive overview of dual-form methods, we refer to (Korotin et al., 2021) . The authors conduct an evaluation of OT methods for the quadratic cost c(x, y) = ∥x -y∥ 2 . According to them, the best performing method is ⌊MM:R⌉. It is based on the variational reformulation of (4), which is a particular case of our formulation (12). Extensions of ⌊MM:R⌉ appear in (Rout et al., 2022; Fan et al., 2021) .

5. BIASED OPTIMAL TRANSPORT IN GANS

Figure 4 : Illustration of Lemma 1. The solution T λ of ( 5) is an OT map from P to T λ # P. In general, T λ # P ̸ = Q (Thm. 1). In this section, we establish connections between GAN methods regularized by content losses (1) and OT. Such GANs are popular in a variety of tasks beside SR, e.g., style transfer (Huang et al., 2018) . The theoretical analysis in this section holds for these tasks as well. However, since we empirically demonstrate the findings on a practically important SR problem, we keep the corresponding notation throughout 5. A common approach to solve the unpaired SR via GANs is to define a loss function D : P(Y) × P(Y) → R + and train a generative neural network T via minimizing inf T :X →Y D(T # P, Q) + λR c (T ) . ( ) The term D(T # P, Q) ensures that the generated distribution T # P of SR images is close to the true HR distribution Q; the second term R c (T ) is the content loss (1). For convenience, we assume that D(Q, Q) = 0 for all Q ∈ P(Y). Two most popular examples of D are the Jensen-Shannon divergence (Goodfellow et al., 2014) , i.e., the vanilla GAN loss, and the Wasserstein-1 loss (Arjovsky & Bottou, 2017) . In unpaired SR methods, the optimization objectives are typically more complex than (5). In addition to the content or identity loss (1), several other regularizations are usually introduced, see 4. In Appendix F, we show that the learning objectives of popular SR methods can be represented as (5). For a theoretical analysis, we stick to the basic formulation regularized with generic content loss (5). It represents the simplest and straightforward SR setup. We prove the following lemma, which connects the solution T λ of ( 5) and optimal maps for transport cost c(x, y). Lemma 1 (The solution of the regularized GAN is an OT map). Assume that λ > 0 and the minimizer T λ of (5) exists. Then T λ is an OT map between P and Q λ def = T λ # P for cost c(x, y), i.e., it minimizes inf T # P=Q λ R c (T ) = inf T # P=Q λ X c x, T (x) dP(x). Proof. Assume that T λ is not an optimal map between P and T λ # P. Then there exists a more optimal T † satisfying T † # P = T λ # P and R c (T † ) < R c (T λ ). We substitute this T † to (5) and derive D(T † # P, Q) + λR c (T † ) = D(T λ # P, Q) + λR c (T † ) < D(T λ # P, Q) + λR c (T λ ), which is a contradiction, since T λ is a minimizer of ( 5), but T † provides the smaller value. Our Lemma 1 states that the minimizer T λ of a regularized GAN problem is always an OT map between P and the distribution Q λ generated by the same T λ from P. However, below we prove that Q λ ̸ = Q, i.e., T λ does not actually produce the distribution of HR images (Figure 4 ). To begin with, we state and prove the following auxiliary result. Lemma 2 (Reformulation of the regularized GAN via distributions). Under the assumptions of Lemma 1, let X = Y be a compact subset of R D with negligible boundary. Let P ∈ P(X ) be absolutely continuous, Q ∈ P(Y) and c(x, y) = ∥x -y∥ p with p > 1. Then (5) is equivalent to inf Q ′ ∈P(Y) F(Q ′ ) def = inf Q ′ ∈P(Y) D(Q ′ , Q) + λ • Cost(P, Q ′ ) , and the solutions of ( 5) and ( 6) are related as Q λ = T λ # P, where Q λ is the minimizer of (6). Proof. We derive inf T :X →Y D(T # P, Q) + λR c (T ) = inf T :X →Y D(T # P, Q) + λ X c x, T (x) dP(x) = (7) inf T :X →Y D(T # P, Q) + λ • Cost(P, T # P) = inf Q ′ ∈P(Y) D(Q ′ , Q) + λ • Cost(P, Q ′ ) . In transition from ( 7) to (8), we use the definition of OT cost (2) and our Lemma 1, which states that the minimizer T λ of ( 5) is an OT map, i.e., X c x, T λ (x) dP(x) = Cost(P, T λ # P). The equality in (8) follows from the fact that P is abs. cont. and c(x, y) = ∥x -y∥ p : for all Q ′ ∈ P(Y) there exists a (unique) solution T to the Monge OT problem (2) for P, Q ′ (Santambrogio, 2015, Thm. 1.17 ). In the following Theorem, we prove that, in general, Q λ ̸ = Q for the minimizer Q λ of (6). Theorem 1 (The distribution solving the regularized GAN problem is always biased). Under the assumptions of Lemma 2, assume that the first variation (Santambrogio, 2015, Definition 7.12)  of the functional Q ′ → D(Q ′ , Q) at the point Q ′ = Q exists and is equal to zero. This means that D(Q + ϵ∆Q) = D(Q, Q) + o(ϵ) for every signed measure ∆Q of zero total mass and ϵ ≥ 0 such that Q + ϵ∆Q ∈ P(Y). Then, if P ̸ = Q, then Q ′ = Q does not deliver the minimum to F. Before proving Theorem 1, we highlight that the assumption about the vanishing first variation of Q ′ → D(Q ′ , Q) at Q ′ = Q is reasonable. In Appendix A, we prove that this assumption holds for the popular GAN discrepancies D(Q ′ , Q), e.g., f -divergences (Nowozin et al., 2016) , Wasserstein distances (Arjovsky et al., 2017) , and Maximum Mean Discrepancies (Li et al., 2017) . Proof. Let ∆Q = P -Q denote the difference measure of P and Q. It has zero total mass and ∀ϵ ∈ [0, 1] it holds that Q+ϵ∆Q = ϵP+(1-ϵ)Q is a mixture distribution of probability distributions P and Q. As a result, for all ϵ ∈ [0, 1], we have F(Q + ϵ∆Q) = D(Q + ϵ∆Q, Q) + λ • Cost(P, Q + ϵ∆Q) = D(Q, Q) + o(ϵ) + λ • Cost(P, ϵP + (1 -ϵ)Q) ≤ (9) o(ϵ) + λ • ϵ • Cost(P, P) + λ • (1 -ϵ) • Cost(P, Q) = o(ϵ) + λ • (1 -ϵ) • Cost(P, Q) = (10) λ • Cost(P, Q) =F (Q) -λ • ϵ • Cost(P, Q) >0 +o(ϵ), where in transition from ( 9) to (10), we use D(Q, Q) = 0 and exploit the convexity of the OT cost (Villani, 2003, Theorem 4.8) . In (10), we use Cost(P, P) = 0. We see that F(Q+ϵ∆Q) is smaller then F(Q) for sufficiently small ϵ > 0, i.e., Q ′ = Q does not minimize F. Corollary 1. Under the assumptions of Theorem 1, the solution T λ of regularized GAN (5) is biased, i.e., it does not satisfy T λ # P = Q and does not transform LR images to true HR ones. Additionally, we provide a toy example that further illustrates the issue with the bias. Example 1. Consider X = Y = R 1 . Let P = 1 2 δ 0 + 1 2 δ 2 , Q = 1 2 δ 1 + 1 2 δ 3 be distributions concentrated at {0, 2} and {1, 3}, respectively. Put c(x, y) = |x -y| to be the content loss. Also, let D to be the OT cost for |x -y| 2 . Then for λ = 0 there exist two maps between P and Q that deliver the same minimal value for (5), namely T (0) = 1, T (2) = 3 and T (0) = 3, T (2) = 1. For λ > 0, the optimal solution of the problem (5) is unique, biased and given by T (0) = 1 -λ 2 , T (2) = 3 -λ 2 . Proof. Let T (0) = t 0 and T (2) = t 2 . Then T # P = 1 2 δ t0 + 1 2 δ t2 , and now (5) becomes min t0,t2 min 1 2 (t 0 -1) 2 + 1 2 (t 2 -3) 2 ; 1 2 (t 0 -3) 2 + 1 2 (t 2 -1) 2 + λ 1 2 |0 -t 0 | + 1 2 |2 -t 2 | , where the second term is R c (T ) and the first term is the OT cost D(T # P, Q) expressed as the minimum over the transport costs of two possible transport maps t 0 → 1; t 2 → 3 and t 0 → 3; t 2 → 1. The minimizer can be derived analytically and equals t 0 = 1 -λ 2 , t 2 = 3 -λ 2 . In Example 1, T λ # P = Q λ never matches Q exactly for λ > 0. In 7.1, we conduct an evaluation of maps obtained via minimizing objective (5) on the synthetic benchmark by (Korotin et al., 2021) . We empirically demonstrate that the bias exists and it is indeed a notable practical issue. Remarks. Throughout this section, we enforce additional assumptions on ( 5), e.g., we restrict our analysis to content losses c(•, •), which are powers of Euclidean norms ∥ • ∥ p . This is needed to make the derivations concise and to be able to exploit the available results in OT. We think that the provided results hold under more general assumptions and leave this question open for future studies.

6. UNBIASED OPTIMAL TRANSPORT SOLVER

In 6.1, we derive our algorithm to compute OT maps. Importantly, in 6.2, we detail its differences and similarities with regularized GANs, which we discussed in 5. Our algorithm is suitable for general costs and generalizes the OT algorithm for the quadratic cost by (Rout et al., 2022) .

6.1. MINIMAX OPTIMIZATION ALGORITHM

We derive a minimax optimization problem to recover the optimal transport map from P to Q. We expand the dual form (4). To do this, we first note that X f c (x)dP(x) = X inf y∈Y {c(x, y) -f (y)} dP(x) = inf T :X →Y X c x, T (x) -f T (x) dP(x). (11) Here we replace the optimization over points y ∈ Y with an equivalent optimization over the functions T : X → Y. This is possible due to the Rockafellar interchange theorem (Rockafellar, 1976, Theorem 3A) . Substituting (11) to (4), we have Cost(P, Q) = sup f inf T :X →Y Y f (y)dQ(y) + X c x, T (x) -f T (x) dP(x) We denote the expression under the sup inf by L(f, T ). Now we show that by solving the saddle point problem (12) one can obtain the OT map T * . Lemma 3 (OT maps solve the saddle point problem). Assume that the OT map T * between P, Q for cost c(x, y) exists. Then, for every optimal potential f * ∈ arg sup f inf T :X →Y L(f, T ) of (12), T * ∈ arg inf T :X →Y X c x, T (x) -f T (x) dP(x). Proof. Since f * is optimal, we have inf T :X →Y L(f * , T ) = Cost(P, Q). We use T * # P = Q and the change of variables y = T * (x) to derive X f * T * (x) dP = Y f * (y)dQ. Substituting this equality into (12), we obtain L(f * , T * ) = X c x, T * (x) dP(x) = Cost(P, Q), i.e., (13) holds. Our Lemma 3 states that one can solve a saddle point problem (12) and extract an OT map T * between P, Q from the optimal pair (f * , T * ). Analogous result but only for the (Q-embedded) quadratic cost can be found in (Rout et al., 2022) . For general P, Q, the arg inf T set for an optimal f * might contain Algorithm 1: OT solver to compute the OT map between P and Q for transport cost c(x, y). Input :distributions P, Q accessible by samples; mapping network T θ : X → Y; potential f ω : X → R; transport cost c : X × Y → R; number K T of inner iters; Output :approximate OT map (T θ ) # P ≈ Q; repeat Sample batches X ∼ P, Y ∼ Q; L f ← 1 |Y | y∈Y f ω (y) -1 |X| x∈X f ω T θ (x) ; Update ω by using ∂L f ∂ω to maximize L f ; for k T = 1, 2, . . . , K T do Sample batch X ∼ P; L T ← 1 |X| x∈X c x, T θ (x) -f ω T θ (x) ; Update θ by using ∂L T ∂θ to minimize L T ; until not converged; not only OT map T * but other functions as well. However, our experiments ( 7) show that this is not a serious issue in practice. To solve the optimization problem ( 12), we approximate the potential f and map T with neural networks f ω and T θ , respectively. We train the networks with stochastic gradient ascent-descent by using random batches from P, Q. The practical optimization procedure is detailed in Algorithm 1. We call this procedure an Optimal Transport Solver (OTS).

6.2. REGULARIZED GANS VS. OPTIMAL TRANSPORT SOLVER

In this subsection, we discuss similarities and differences between our optimization objective ( 12) and the objective of regularized GANs (5). We establish an intriguing connection between GANs that use integral probability metrics (IPMs) as D. A discrepancy D : P(Y)×P(Y) → R + is an IPM if D(Q 1 , Q 2 ) = sup f ∈F Y f (y)dQ 2 (y)- Y f (y)dQ 1 (y) , where the maximization is performed over some certain class F of functions (discriminators) f : Y → R. The most popular example of D is the Wasserstein-1 loss (Arjovsky & Bottou, 2017) , where F is a class of 1-Lipschitz functions. For other IPMs, see (Mroueh et al., 2017 , Table 1 ). Substituting ( 14) to (5) yields the saddle-point optimization problem for the regularized IPM GAN: inf T :X →Y sup f ∈F Y f (y)dQ(y)- X f T (x) dP(x) + λ X c x, T (x) dP(x)} = inf T :X →Y sup f ∈F Y f (y)dQ(y) + X λ • c x, T (x) -f T (x) dP(x) . ( ) We emphasize that the expression inside (15) for λ = 1 is similar to the expression in OTS optimization ( 12). Below we highlight the key differences between ( 12) and ( 15). First, in OTS the map T is a solution to the inner optimization problem, while in IPM GAN the generator T is a solution to the outer problem. Swapping inf T and sup f is prohibited and, in general, yields a different problem, e.g., 1 = inf x sup y cos(x+y) ̸ = sup y inf x cos(x+y) = -1. Second, in OTS the optimization over potential f is unconstrained, while in IPM GAN it must belong to F, some certain restricted class of functions. For example, when D is the Wasserstein-1 (W 1 ) IPM, one has to use an additional penalization, e.g., the gradient penalty (Gulrajani et al., 2017) . This further complicates the optimization and adds hyperparameters which have to be carefully selected. Third, the optimization of IPM GAN requires selecting a parameter λ that balances the content loss R c and the discrepancy D. In OTS for all costs λ • c(x, y) with λ > 0, the OT map T * is the same. To conclude, even for λ = 1, the IPM GAN problem does not match that of OTS. Table 1 summarizes the differences and the similarities between OTS and regularized IPM GANs. Optimal Transport Solver (Ours) Regularized IPM GAN Minimax optimization objective sup f inf T :X →Y Y f (y)dQ(y)+ X c x, T (x) -f T (x) dP(x) inf T :X →Y sup f ∈F Y f (y)dQ(y)+ X λ • c x, T (x) -f T (x) dP(x) Transport map T (generator) T * solves the inner problem (for optimal f * ); it is an OT map from P to Q (Lemma 3) T * solves the outer problem; it is a biased OT map ( 5, 7.1)  Potential f (discriminator) Unconstrained f ∈ L 1 (Q) Constrained f ∈ F ⊂ L 1 (Q) A method

7. EVALUATION

In 7.1, we assess the bias of regularized IPM GANs by using the Wasserstein-2 benchmark (Korotin et al., 2021) . In 7.2, we evaluate our method on the large-scale unpaired AIM-19 dataset from (Lugmayr et al., 2019b) . In Appendix D, we test it on the CelebA dataset (Liu et al., 2015) . The code is written in PyTorch. We list the hyperparameters for Algorithm 1 in Table 4 of Appendix C. Neural network architectures. We use WGAN-QC's (Liu et al., 2019) ResNet (He et al., 2016) architecture for the potential f ω . In 7.1, where input and output images have the same size, we use UNetfoot_0 (Ronneberger et al., 2015) as a transport map T θ . In 7.2, the LR input images are 4 × 4 times smaller than HR, so we use EDSR network (Lim et al., 2017) . Transport costs. In 7.1, we use the mean squared error (MSE), i.e., c(x, y) = ∥x-y∥ 2 dim(Y) . It is equivalent to the quadratic cost but is more convenient due to the normalization. In 7.2, we consider c(x, y) = b(Up(x), y), where b is a cost between the bicubically upsampled LR image x up = Up(x) and HR image y. We test b defined as MSE and the perceptual cost using features of a pre-trained VGG-16 network (Simonyan & Zisserman, 2014) , see Appendix C for details.

7.1. ASSESSING THE BIAS IN REGULARIZED GANS

In this section, we empirically confirm the insight of 5 that the solution T λ of (5) may not satisfy T λ # P = Q. Note if T λ # P = Q, then by our Lemma 1, we conclude that T λ ≡ T * , where T * is an OT map from P to Q for c(x, y). Thus, to access the bias, it is reasonable to compare the learned map T λ with the ground truth OT map T * for P, Q. For evaluation, we use the Wasserstein-2 benchmark (Korotin et al., 2021) . It provides highdimensional continuous pairs P, Q with an analytically known OT map T * for the quadratic cost c(x, y) = ∥x -y∥foot_1 . We use their "Early" images benchmark pair. It simulates the image deblurring setup, i.e., X = Y is the space of 64 × 64 RGB images, P is blurry faces, Q is clean faces satisfying Q = T * # P, where T * is an analytically known OT map, see the 1st and 2nd lines in Figure 5 . To quantify the learned maps from P to Q, we use PSNR, SSIM, LPIPS (Zhang et al., 2018a) , FID (Heusel et al., 2017) metrics. Similar to (Wei et al., 2021) , we use the AlexNet-based (Krizhevsky et al., 2012) LPIPS. FID and LPIPS are practically the most important since they better correlate with the human perception of the image quality. We include PSNR, SSIM as popular evaluation metrics, but they are known to badly measure perceptual quality (Zhang et al., 2018a; Nilsson & Akenine-Möller, 2020 ). Due to this, higher PSNR, SSIM values do not necessarily mean better performance. We calculate metrics using scikit-image for SSIM and open source implementations for PSNR 2 , LPIPSfoot_2 and FIDfoot_3 . In this section, we additionally use the L 2 -UVP (Korotin et al., 2021, 4. 2) metric. On the benchmark, we compare OTS (12) and IPM GAN (5). We use MSE as the content loss c(x, y). In IPM GAN, we use the Wasserstein-1 (W 1 ) loss with the gradient penalty λ GP = 10 (Gulrajani et al., 2017) as D. We do 10 discriminator updates per 1 generator update and train the model for 15K generator updates. For fair comparison, the rest hyperparameters match those of our algorithm. 

Metrics/ Method

Regularized IPM GAN (WGAN-GP, λGP = 10) OTS (ours) λ = 0 λ = 10 -1 λ = 10 0 λ = 10 1 λ = 10 2 λ = 10 3 λ = 10 4 λ = 10 Table 2 : Quantitative evaluation of restoration maps fitted by the regularized IPM GAN, OTS (ours) using the Wasserstein-2 images benchmark (Korotin et al., 2021) . We train the regularized WGAN-GP with various coefficients of content loss λ ∈ {0, 10 -1 , . . . , 10 5 } and show the learned maps T λ and the map T obtained by OTS in Figure 5 . Results. The performance of the regularized IPM GAN significantly depends on the choice of the content loss value λ. For high values λ ≥ 10 3 , the learned map is close to the identity as expected. For small values λ ≤ 10 1 , the regularization has little effect, and WGAN-GP solely struggles to fit a good restoration map. Even for the best performing λ = 10 2 all metrics are notably worse than for OTS. Importantly, OTS decreases the burden of parameter searching as there is no parameter λ.

7.2. LARGE-SCALE EVALUATION

For evaluating our method at a large-scale, we employ the dataset by (Lugmayr et al., 2019b) of AIM 2019 Real-World Super-Resolution Challenge (Track 2). The train part contains 800 HR images with up to 2040 pixels width or height and 2650 unpaired LR images of the same shape. They are constructed using artificial, but realistic, image degradations. We quantitatively evaluate our method on the validation part of AIM dataset that contains 100 pairs of LR-HR images. Baselines. We compare OTS on AIM dataset with the bicubic upsample, FSSR (Fritsche et al., 2019) and DASR (Wei et al., 2021) methods. FSSR method is the winner of AIM 2019 Challenge; DASR is a current state-of-the-art method for unpaired image SR. Both methods utilize the idea of frequency separation and solve the problem in two steps. First, they train a network to generate LR images. Next, they train a super-resolution network using generated pseudo-pairs. Differently to FSSR, DASR also employs real-world LR images for training SR network taking into consideration the domain gap between generated and real-world LR images. Both methods utilize several losses, e.g., adversarial and perceptual, either on the entire image or on its high/low frequency components. For testing FSSR and DASR, we use their official code and pretrained models. Implementation details. We train the networks using 128×128 HR, 32×32 LR random patches of images augmented via random flips, rotations. We conduct separate experiments using EDSR as the transport map and either MSE or perceptual cost, and denote them as OTS (MSE), OTS (VGG) respectively. Metrics. We calculate PSNR, SSIM, LPIPS, FID. FID is computed on 32×32 patches of LR test images upsampled by the method in view w.r.t. random patches of test HR. We use 50k patches to compute FID. The other metrics are computed on the entire upsampled LR test and HR test images. Experimental results are given in Table 3 , Figure 6 . The results show that the usage perceptual cost function in OTS boosts performance. According to FID, OTS with perceptual cost function beats DASR. On the other hand, it outperforms FSSR in PSNR, SSIM and, importantly, LPIPS. Note that bicubic upsample outperforms all the methods, according only to PSNR and SSIM, which have issues stated in 7.1. According to visual analysis, OTS with the perceptual cost better deals with noise artifacts. Additional results are given in Appendix E. We also demonstrate the bias issue of FSSR and DASR in Appendix B.

8. DISCUSSION

Significance. Our analysis connects content losses in GANs with OT and reveals the bias issue. Content losses are used in a wide range of tasks besides SR, e.g., in the style transfer and domain adaptation tasks. Our results demonstrate that GAN-based methods in all these tasks may a priori lead to biased solutions. In certain cases it is undesirable, e.g., in medical applications (Bissoto et al., 2021) . Failing to learn true data statistics (and learning biased ones instead), e.g., in the super-resolution of MRI images, might lead to a wrong diagnosis made by a doctor due to SR algorithm drawing inexistent details on the scan. Thus, we think it is essential to emphasize and alleviate the bias issue, and provide a way to circumvent this difficulty. Potential Impact. We expect our OT approach to improve the existing applications of image superresolution. Importantly, it has less hyperparameters, uses smaller number of neural networks than many existing methods (see Table 5 in Appendix C for comparison), and is end-to-end -this should simplify its usage in practice. Besides, our method is generic and presumably can be applied to other unpaired learning tasks as well. Studying such applications is a promising avenue for the future work. Limitations. Our method fits a one-to-one optimal mapping (transport map) for super-resolution which, in general, might not exist. Besides, not all optimal solutions of our optimization objective are guaranteed to be OT maps. Moreover, our method requires solving the saddle point problem. Thus, it might encounter training issues similar to those of the GAN-based approaches. These limitations suggest the need for further theoretical analysis and improvement of our method for optimal transport.

A FIRST VARIATIONS OF GAN DISCREPANCIES VANISH AT THE OPTIMUM

We demonstrate that the first variation of Q ′ → D(Q ′ , Q) is equal to zero at Q ′ = Q for common GAN discrepancies D. This suggests that the corresponding assumption of our Theorem 1 is relevant. To begin with, for a functional G : P(Y) → R ∪ {∞}, we recall the definition of its first variation. A measurable function δG[Q] : Y → R ∪ {∞} is called the first variation of G at a point Q ∈ P(Y), if, for every measure ∆Q on Y with zero total mass ( Y 1 d∆Q(y) = 0), G(Q + ϵ∆Q) = G(Q) + ϵ Y δG[Q](y) d∆Q(y) + o(ϵ) for all ϵ ≥ 0 such that Q + ϵ∆Q is a probability distribution. Here for the sake of simplicity we suppressed several minor technical aspects, see (Santambrogio, 2015, Definition 7 .12) for details. Note that the first variation is defined up to an additive constant. Now we recall the definitions of three most popular GAN discrepancies and demonstrate that their first variation is zero at an optimal point. We consider f -divergences (Nowozin et al., 2016) , Wasserstein distances (Arjovsky et al., 2017) , and Maximum Mean Discrepancies (Li et al., 2017) . Case 1 (f -divergence). Let f : R + → R be a convex and differentiable function satisfying f (1) = 0. The f -divergence between Q ′ , Q ∈ P(Y) is defined by D f (Q ′ , Q) def = Y f dQ ′ (y) dQ(y) dQ(y). The divergence takes finite value only if Q ′ ≪ Q, i.e., Q ′ is absolutely continuous w.r.t. Q. Vanilla GAN loss (Goodfellow et al., 2014) is a case of f -divergence (Nowozin et al., 2016 , Table 1 ). We define G(Q ′ ) def = D f (Q ′ , Q). For Q ′ = Q and some ∆Q such that Q + ϵ∆Q ∈ P(Y) we derive G(Q + ϵ∆Q) = Y f dQ(y) dQ(y) + ϵ d∆Q(y) dQ(y) dQ(y) = Y f 1 + ϵ d∆Q(y) dQ(y) dQ(y) (18) = Y f (1)dQ(y) + Y f ′ (1) d∆Q(y) dQ(y) dQ(y) + o(ϵ) = G(Q) + Y f ′ (1)d∆Q(y) + o(ϵ), where in transition from ( 18) to (19), we consider the Taylor series w.r.t. ϵ at ϵ = 0. We see that δG [Q](y) ≡ f ′ (1) is constant, i.e., the first variation of Q ′ → D f (Q ′ , Q) vanishes at Q ′ = Q. Case 2 (Wasserstein distance). If in OT formulation (3) the cost function c(x, y) equals ∥x -y∥ p with p ≥ 1, then Cost(P, Q) 1/p is called the Wasserstein distance (W p ). Generative models which use W p p as the discrepancy are typically called the Wasserstein GANs (WGANs). The most popular case is p = 1 (Arjovsky et al., 2017; Gulrajani et al., 2017) , but more general cases appear in related work as well, see (Liu et al., 2019; Mallasto et al., 2019) .

The first variation of

G(Q ′ ) def = W p p (Q ′ , Q) at a point Q ′ is given by G[Q ′ ](y) = (f * ) c (y) , where f * is the optimal dual potential (provided it is unique up to a constant) in (4) for a pair (Q ′ , Q), see (Santambrogio, 2015, 7. 2). Our particular interest is to compute the optimal potential (f * ) c at Q ′ = Q. We recall (4) and use W p p (Q, Q) = 0 to derive W p p (Q, Q) = 0 = sup f X f c (y ′ )dQ ′ (y ′ ) + Y f y)dQ(y) . One may see that f * ≡ 0 attains the supremum (its c-transform (f * ) c is also zero). Thus, if (f * ) c ≡ 0 is a unique potential (up to a constant), the first variation of Q ′ → W p p (Q ′ , Q) at Q ′ = Q vanishes. Case 3 (Maximum Mean Discrepancy). Let k : Y × Y → R be a positive definite symmetric kernel. The (square) of the Maximum Mean Discrepancy between Q ′ , Q is given by MMD 2 k (Q ′ , Q) def = Y×Y k(y 1 , y 2 )d (Q -Q ′ ) × (Q -Q ′ ) (y 1 , y 2 ) = Y×Y k(y 1 , y 2 )d(Q ′ × Q ′ )(y 1 , y 2 ) -2 Y×Y k(y 1 , y 2 )d(Q ′ × Q)(y 1 , y 2 ) + Const(Q) (20) see (Sejdinovic et al., 2013, Equation 3.3) . The first variation of the quadratic in Q ′ term is given by y → 2 • Y k(y, y 2 )dQ ′ (y 2 ), see (Santambrogio, 2015, 7.2) . The second term is linear in Q ′ and its first variation is simply y → (-2) • Y k(y, y 2 )dQ(y 2 ). When Q ′ = Q, the sum of these terms is zero. That is, the first variation of the functional Q ′ → MMD 2 k (Q ′ , Q) vanishes at Q ′ = Q. B ASSESSING THE BIAS OF METHODS ON AIM19 DATASET where f k denotes the features of the kth layer of a pre-trained VGG-16 network (Simonyan & Zisserman, 2014) , MAE is the mean absolute error MAE(x, y) = ∥x-y∥1 dim(Y) . Dynamic transport cost. In the preliminary experiments, we used bicubic upsampling as the "Up" operation. Later, we found that the method works better if we gradually change the upsampling. We start from the bicubic upsampling. Every k c iterations of f ω (see Table 4 ), we change the cost to c(x, y) = b T ′ θ (x), y , where T ′ θ is a fixed frozen copy of the currently learned SR map T θ . 



github.com/milesial/Pytorch-UNet github.com/photosynthesis-team/piq github.com/richzhang/PerceptualSimilarity github.com/mseitzer/pytorch-fid



(a) Monge's formulation of OT. (b) Kantorovich's formulation of OT.

Figure 3: Monge's and Kantorovich's formulations of Optimal Transport.

Figure 5: Comparison of OTS (ours), regularized IPM GAN on the Wasserstein-2 benchmark. The 1st line shows blurry faces x ∼ P, the 2nd line, clean faces y = T * (x), where T * is the OT map from P to Q. Next lines show maps from P to Q fitted by the methods.

Figure 6: Qualitative results of OTS (ours), bicubic upsample, FSSR and DASR on AIM 2019 dataset (350×350 crops).

Figure 7: Color palettes and their variance for Test HR, LR datasets and solutions of Bicubic Upscale, OTS, FSSR, DASR methods on AIM19.

Figure 10: Additional qualitative results of OTS (ours), bicubic upsample, FSSR and DASR on AIM 2019. The sizes of crops on the 1st and 2nd images are 350×350 and 800×800, respectively.

to impose the constraint is needed. Comparison of the optimization objectives of OTS (ours) and regularized IPM GAN.

Comparison of OTS (ours) with FSSR, DASR

annex

Hyperparameters. For EDSR, we set the number of residual blocks to 64, the number of features to 128, and the residual scaling to 1. For UNet, we set the base factor to 64. The training details are given in Table 4 . We provide a comparison of the hyperparameters of FSSR, DASR and OTS (ours) in Table 5 . In contrast to FSSR and DASR, our method does not contain a degradation part. This helps to notably reduce the amount of tunable hyperparameters.Optimizer. We employ Adam (Kingma & Ba, 2014) .

Computational complexity.

Training OTS with EDSR as the transport map and the perceptual transport cost on AIM 2019 dataset takes ≈ 4 days on a single Tesla V100 GPU. 

D IMAGE SUPER-RESOLUTION OF FACES

We conduct an experiment using CelebA (Liu et al., 2015) faces to test the applicability of OT for unpaired SR. We test our Algorithm 1 with MSE as the cost and UNet or EDSR as the transport map. Pre-processing and train-test split. We resize images to 64×64 px. We adopt the unpaired train-test split from (Rout et al., 2022, 5.2) . We split the original HR dataset in 3 parts A, B, C containing 90K, 90K, 22K samples, respectively. We apply the bicubic downsample to each image and obtain the LR dataset (16×16 faces). For training, we use LR part A, HR part B. For testing, we use parts C.Metrics. We compute PSNR, SSIM, LPIPS and FID metrics on the test part, see Table 6 .

E ADDITIONAL RESULTS ON AIM19

Figure 9 : Additional qualitative results of OTS (ours), bicubic upsample, FSSR and DASR on AIM 2019 (800×800 crops).F CONNECTION BETWEEN GAN OBJECTIVES AND EQUATION ( 5)Typical objectives of GAN-based approaches consist of multiple lossesusually one adversarial and several content losses. To make the exposition simple, in our paper, we represented all the content losses as a single loss c(•, •). Below we provide several examples showing how the objectives of popular GAN-based approaches to unpaired image SR could be viewed as ( 5). For all of these methods, our Lemma 1 applies without any changes. We include in brackets the number of papers citations according to Google Scholar to show that chosen methods are widely used.FaceSR (2018, 284 citations) The paper of (Bulat et al., 2018) presents one of the first GAN-based approaches to unpaired image SR problem. The method is composed of two steps. First, it learns a degradation between unpaired HR and LR images. Then it employs a second GAN to learn a supervised mapping between paired generated LR and corresponding HR images. The objective of the unpaired step (see their Eq. ( 1)) is as follows: l = αl pixel content lossadversarial lossHere l pixel is the MSE loss between the generated LR image and downsampled HR. Thus, the objective of this method exactly follows Equation ( 5).CinCGAN (2018, 400 citations) The method of (Yuan et al., 2018) is an other pioneering GAN-based approach to unpaired image SR problem, which establishes a different to FaceSR group of two-step methods. First, it uses one CycleGAN to learn a mapping between given noisy LR images and downsampled HR ("clean LR") images. Then, a second CycleGAN fine-tunes a mapping between real LR and HR images. The objective for the first GAN (see their Eq. ( 5)) is as follows:Here L LR cyc is the cycle-consistency loss 5 , L LR idt -l 1 identity loss, and L LR TVtotal variation loss. FSSR (Winner of the AIM Challenge on Real-World SR (Lugmayr et al., 2019b) , 2019, 127 citations) FSSR (Fritsche et al., 2019) method employs a similar to FaceSR strategy. It firstly learns a mapping between downsampled HR images and given unpaired LR images, and then uses the generated pairs to learn a supervised SR model. The objective of the unpaired step (see their Eq. ( 6)) is defined by: L d = 0.005L , where the texture (adversarial) loss L tex, d and the color (l 1 identity) loss L col, d are applied to low frequencies of the images, while the perceptual loss L per, dto the features of the full images.DASR (2021, 51 citations) DASR (Wei et al., 2021) structure is also based on the similar to FSSR principles and its two-step structure. In contrast to FSSR, a SR network is trained in a partially supervised manner using not only generated, but also real LR images. The objective of the fully unpaired degradation learning step (see their Eq. ( 4)) is as follows:adversarial lossHere the adversarial loss L G adv is defined on high frequencies of the image, while the content L con (l 1 identity) and the perceptual L per losses are defined on full images and their features respectively.ESRGAN-FS (2020, 13 citations) ESRGAN-FS is an other two-step approach based on the principle of learning the degradation, see (Zhou et al., 2020) . The objective of its unpaired degradation learning step (see their Eq. ( 4)) is as follows:L total = λ t1 .Here L low (l 1 identity) loss is applied to low frequencies of the images, the perceptual loss L perto the features of the full images, while L high (adversarial loss)high frequencies of the images.5 L LR cyc is defined as the MSE loss between given LR image x and G2(G1(x)), where G1 learns to map real LR images to "clean" ones and G2 learns an opposite mapping. For a fixed G2 this loss can be considered as a part of the content loss.

