ADAPTIVE IMLE FOR FEW-SHOT IMAGE SYNTHESIS Anonymous authors Paper under double-blind review

Abstract

Despite their success on large datasets, GANs have been difficult to apply in the few-shot setting, where only a limited number of training examples are provided. Due to mode collapse, GANs tend to ignore some training examples, causing overfitting to a subset of the training dataset, which is small to begin with. A recent method called Implicit Maximum Likelihood Estimation (IMLE) is an alternative to GAN that tries to address this issue. It uses the same kind of generators as GANs but trains it with a different objective that encourages mode coverage. However, the theoretical guarantees of IMLE hold under restrictive conditions, such as the requirement for the optimal likelihood at all data points to be the same. In this paper, we present a more generalized formulation of IMLE which includes the original formulation as a special case, and we prove that the theoretical guarantees hold under weaker conditions. Using this generalized formulation, we further derive a new algorithm, which we dub Adaptive IMLE, which can adapt to the varying difficulty of different training examples. We demonstrate on multiple fewshot image synthesis datasets that our method significantly outperforms existing methods.

1. INTRODUCTION

Image synthesis has achieved significant progress over the past decade with the emergence of deep learning. Deep generative models such as GANs (Goodfellow et al., 2014; Brock et al., 2019; Karras et al., 2019; 2020; 2021) , VAEs (Kingma & Welling, 2013; Vahdat & Kautz, 2020; Child, 2021; Razavi et al., 2019) , diffusion models (Dhariwal & Nichol, 2021; Ho et al., 2020) , score-based models (Song et al., 2021; Song & Ermon, 2019) , normalizing flows (Dinh et al., 2017; Kobyzev et al., 2021; Kingma & Dhariwal, 2018) , and autoregressive models (Salimans et al., 2017; van den Oord et al., 2016b; a) have made incredible improvements in generated image quality, which makes it possible to generate photorealistic images using these models. Many of these deep generative models require training on a large-scale datasets to produce highquality images. However, there are many real-life scenarios in that only a limited number of training examples are available, such as orphan diseases in the medical domain and rare events for training autonomous driving agents. One way to address this issue is by fine-tuning a model pre-trained on large auxiliary dataset from similar domains (Wang et al., 2020; Zhao et al., 2020a; Mo et al., 2020) . Nonetheless, a large auxiliary dataset with a sufficient degree of similarity to the task at hand may not be available in all domains. If an insufficient similar auxiliary dataset were used regardless, image quality may be adversely impacted, as shown in (Zhao et al., 2020b) . In this paper, we focus on the challenging setting of few-shot unconditional image synthesis without auxiliary pre-training. The scarcity of training data in this setting makes it especially important for generative models to make full use of all training examples. This requirement sets it apart from the many-shot setting with abundant training data, where ignoring some training examples does not cause as big an issue. As a result, despite achieving impressive performance in the many-shot setting, GANs are challenging to apply to the few-shot setting due to the well-known problem of mode collapse, where the generator only learns from a subset of the training images and ignores the rest. A recent work (Li & Malik, 2018) proposed an alternative technique called Implicit Maximum Likelihood Estimation (IMLE) for unconditional image synthesis. Similar to GAN, IMLE uses a generator, but rather than adopting an adversarial objective which encourages each generated image to be similar to some training images, IMLE encourages each training image to have some similar generated images. Therefore, the generated images could cover all training examples without collapsing to a subset of the modes. In this paper, we introduce a generalized formulation of IMLE, which in turn enables the derivation of a new algorithm that requires fewer conditions and gets around the aforementioned issue. In particular, we mathematically prove that the theoretical guarantees of the generalized formulation hold under weaker conditions and subsumes the IMLE formulation as a special case. Furthermore, we derive an algorithm called Adaptive IMLE using this generalized formulation, which could adapt to points with different difficulties, as illustrated in the bottom row of Fig. 1 . We compare our method to existing few-shot image synthesis baselines over six datasets and show significant improvements over the baselines in terms of mode modelling accuracy and coverage.

2. RELATED WORK

There are two broad families of work on few-shot learning, one that focuses on discriminative tasks such as classification (O' Mahony et al., 2019; Finn et al., 2017; Snell et al., 2017) and another that focuses on generative tasks. In this paper, we focus on the latter. Similar to many-shot generation tasks, few-shot generation tasks take a limited number of training examples as input and aim to generate samples that are similar to those training examples. What is different from the many-shot setting is that it is crucial for the generative model to utilize all the training examples in the few-shot setting. Due to the scarcity of available data for training, ignoring even just a few data points would cause a more serious issue in the few-shot setting than in the many-shot setting. One line of work focuses on pre-training on large-scale auxiliary datasets from similar domains and adapting the pre-trained models for the few-shot task. This has been applied to unconditional image generation (Li et al., 2020; Zhao et al., 2020a; Mo et al., 2020; Ojha et al., 2021; Wang et al., 2020) , conditional image generation (Sinha et al., 2021; Liu et al., 2019) and video generation (Wang et al., 2019) . However, there are no guarantees on the existence of such large-scale auxiliary datasets for all domains, and recent studies (Zhao et al., 2020b; Kong et al., 2022) also showed that fine-tuning from a dissimilar domain could even lead to the degradation of generated image quality. In this paper, we focus on the setting without fine-tuning pre-trained models from auxiliary datasets. Most prior work considered applying GANs to this setting and developed methods for alleviating the well-known mode collapse problem of GANs. FastGAN (Liu et al., 2021) introduced a skip-layer excitation module for faster training and used self-supervision for the discriminator to learn more descriptive features, which aids better mode coverage of the generator. MixDL (Kong et al., 2022) introduced a two-sided distance regularization to facilitate learning smooth and mode-preserving latent space. Despite these improvements, some degree of mode collapse still remains. A recent method called Implicit Maximum Likelihood Estimation (IMLE) (Li & Malik, 2018 ) adopted a different objective function and showed promising results towards alleviating mode collapse on unconditional image synthesis tasks. Prior IMLE-based methods mainly focused on conditional image synthesis (Li* et al., 2020; Peng et al., 2022) . In this work, we build on (Li & Malik, 2018) and introduce a novel and more generalized formulation of IMLE to make it more suitable for the unconditional few-shot setting.

3. BACKGROUND: IMPLICIT MAXIMUM LIKELIHOOD ESTIMATION (IMLE)

In unconditional image synthesis, the goal is to learn the unconditional probability distribution of images p(x), from which samples can be drawn to yield new synthesized images. It is common to use an implicit generative model -one example of such a model is the generator in GANs, which takes the form of a function T θ parameterized as a neural network with parameters θ, which maps latent codes z drawn from a standard Gaussian N (0, I) to images x. One way to learn this model is with the GAN objective, which introduces a discriminator that aims to distinguish between generated images T θ (z) and real images x. The generator is trained to produce more realistic images that would fool the discriminator. However, the output T θ (z) tends to recover only a subset of the training examples even when varying all values of z. This issue is known as mode collapse, and the intuitive reason behind it is that the adversarial objective of GAN only encourages each generated sample to be similar to some training examples, but there is no guarantee that all training examples will have some similar generated samples. In the few-shot image synthesis setting, the issue of mode collapse is even more significant given the limited number of training examples that are available in the first place. A more recent method known as Implicit Maximum Likelihood Estimation (IMLE) (Li & Malik, 2018) proposed an alternative objective to address this issue. Instead of making each generated sample similar to some training examples, IMLE tries to ensure that samples can be generated around each training example x i . The generator T θ is encouraged to pull some samples T θ (z j ) towards each x i , thereby rewarding coverage of the modes associated with all training examples. More precisely, the IMLE objective takes the following form: min θ E z1,...,zm∼N (0,I) n i=1 min j∈[m] d (x i , T θ (z j )) where d(., .) is a distance metric, m is a hyperparameter, and x i is the i th training example. The training procedure involves finding the nearest generated sample index σ(i) to each training example x i , and optimizing the model parameter θ by minimizing the distance from the selected sample T θ z σ(i) to the target data x i . Detailed pseudocode of the algorithm can be found in the appendix. Despite the algorithm's simplicity, restrictive conditions need to be satisfied for the theoretical guarantees of IMLE to hold, such as requiring a uniform optimal likelihood for all data points. As an example, consider a dataset with two clusters with the same number of points where one cluster has large variance and the other has small variance. In this case, the training examples from the high-variance cluster are more difficult to learn than the training examples from the low-variance cluster, because of sparser coverage of the space in the former cluster. If we consider what the ground truth data distribution looks like, it is a bimodal distribution, with the mode corresponding to the low-variance cluster having higher likelihood than the other. So requiring uniform optimal likelihood for all data points, as IMLE does, will result in overfitting to the low-variance cluster and underfitting to the high-variance cluster, which is not optimal. We refer readers to the IMLE paper (Li & Malik, 2018) for more details.

4. METHOD

In this paper, we devise a generalized formulation of IMLE, whose theoretical guarantees hold under more general conditions than vanilla IMLE (Sec 4.1). This formulation subsumes vanilla IMLE as a special case and also gives rise to a new algorithm which we call Adaptive IMLE. It turns out that Adaptive IMLE offers theoretical and practical advantages over IMLE, which we will demonstrate (Sec 4.2).

4.1. GENERALIZED FORMULATION

Since T θ is an implicit generative model, the likelihood induced by the model p θ cannot in general be expressed in closed form, and so evaluating it numerically is typically computationally intractable. In order to train the generative model, we would like to maximize the likelihood of the training examples without actually needing to evaluate the likelihood. Below we will consider the generalized objective we propose and show that optimizing the objective is equivalent to maximizing the sum of likelihoods at the training examples, without requiring the evaluation of likelihood. Consider the following optimization problem: max θ L {τi}i (θ) := max θ E z1,...,zm∼N (0,I)   1 n n i=1 1 w i   τ i - 1 m m j=1 Φ τi (d(x i , T θ (z j )))     (2) where T θ , d(•, •) and m are as defined in Eqn 1. We will choose w, τ and Φ τ (•) later based on the insight revealed by lemmas below. We will present the high-level sketches of our key lemmas (omitting some technicalities) and delineate their interpretations and significance. The precise statements of the lemmas and their proofs are left to the appendix. We will first present a lemma that relates an expectation of a random variable to the weighted integral of one minus its cumulative density function (CDF) evaluated at different points, which we will refer to as cumulative densities. Lemma 1. Let X be a non-negative random variable and Φ be a continuous function on [0, ∞). If Φ ′ is integrable on all closed intervals in [0, ∞), E [Φ(X)] = Φ(0) + ∞ 0 Φ ′ (t)Pr(X ≥ t)dt This lemma is useful because the left-hand side (LHS) is easy to approximate with Monte Carlo estimates of expectations, and the right-hand side (RHS) is a weighted integral of one minus cumulative densities, which are intractable to compute in general. It enables us to control the weighting of different cumulative densities by choosing the function Φ. Recall that our goal is to maximize the likelihood at each training example without actually computing the likelihood. We can leverage Lemma 1 for this purpose, by choosing the non-negative random variable X appropriately. We choose X to be the distance between a training example and a generated sample d(x i , T θ (z j )). With this choice, Lemma 1 gives us a way to relate a weighted integral of the average likelihoods within differently sized neighbourhoods around the training example x i (RHS) to the expectation of a function of the distance d(x i , T θ (z j )) (LHS). Moreover, we'd like to restrict the average likelihoods we integrate over to only those within neighbourhoods of certain sizes rather than from 0 to ∞. Specifically, we'd like to integrate from δτ to τ , where τ > 0 is the radius of the largest neighbourhood and 0 ≤ δ < 1 is a tightening threshold. To this end, we can choose the weighting function Φ ′ τ (•) to be 1 when δτ ≤ t ≤ τ and 0 otherwise. One choice of such Φ τ (•) that satisfies this condition and its associated Φ ′ τ (•) are: Φ τ (t) =    δτ t < δτ t δτ ≤ t ≤ τ τ t > τ Φ ′ τ (t) =    0 t < δτ 1 δτ ≤ t ≤ τ 0 t > τ Using this choice of Φ τ (•), we obtain the following lemma for a particular training example x i . Lemma 2. Under the choice of Φ τ (•) above and its associated Φ ′ τ (•), E z1,...,zm∼N (0,I) [Φ τi (d(x i , T θ (z j )))] = τ i - τi δτi Pr(d(x i , T θ (z j )) < t)dt. This lemma shows that, for one training example x i , the expectation on the LHS reduces to τ i minus the integral of the average likelihoods within balls whose radii lie between δτ i and τ i . Applying Lemma 2 to all training examples x 1 , . . . , x n , we obtain the following lemma that reveals what the overall objective in Eqn 2 optimizes. Lemma 3. Under the choice of Φ τ (•) above and its associated Φ ′ τ (•), L {τi}i (θ) = 1 n n i=1 1 mw i m j=1 τi δτi Pr(d(x i , T θ (z j )) < t)dt. Lemma 3 shows that L {τi}i (θ) implicitly computes the average likelihood that the generative model assigns to the neighbourhood of each data point. The choice of τ i controls the radius. Since we would like to maximize probability in the immediate neighbourhood of each data point, we would like τ i to be small. So should we choose an arbitrarily small value for τ i ? Recall that by definition of Φ τi (•), if d(x i , T θ (z j )) > τ i , Φ τi (d(x i , T θ (z j ))) = τ i . So, for a very small τ i , it may well be the case that d(x i , T θ (z j )) > τ i ∀j, which would make the Monte Carlo estimate of L {τi}i (θ), i.e., 1 n n i=1 1 wi τ i -1 m m j=1 Φ τi (d(x i , T θ (z j ))) , zero. Since this is a constant, the gradient w.r.t. the parameters is zero, which makes gradient-based learning impossible. This would happen whenever τ i < min j∈[m] d(x i , T θ (z j )), and so the smallest τ i that can be chosen is min j∈[m] d(x i , T θ (z j )) (which is treated as a constant rather than a function of θ). With this choice of τ i , assuming that there is a unique j * such that d(x i , T θ (z j * )) = min j∈[m] d(x i , T θ (z j )) (which happens almost surely), the objective can be simplified to: L {τi}i (θ) = E z1,...,zm∼N (0,I) 1 nm n i=1 1 w i τ i -max( min j∈[m] d(x i , T θ (z j )), δτ i ) If we minimize the objective in Eqn. 3, we get a novel objective known as the Adaptive IMLE objective. The solution to the Adaptive IMLE objective can be expressed as: arg max θ L {τi}i (θ) = arg min θ E z1,...,zm∼N (0,I) n i=1 1 w i max( min j∈[m] d(x i , T θ (z j )), δτ i ) It turns out that the vanilla IMLE objective can be recovered as a special case, by choosing δ = 0 and w 1 = w 2 = • • • = w n . arg max θ L {τi}i (θ) = arg min θ E z1,...,zm∼N (0,I) n i=1 1 w i max( min j∈[m] d (x i , T θ (z j )) , 0) = arg min θ E z1,...,zm∼N (0,I) n i=1 1 w i min j∈[m] d (x i , T θ (z j )) = arg min θ E z1,...,zm∼N (0,I) n i=1 min j∈[m] d (x i , T θ (z j )) 4.1.1 CURRICULUM LEARNING Recall that our goal is to maximize the likelihood of the immediate neighbourhood around each data point, and the size of this neighbourhood is controlled by τ i . Therefore, we want to make τ i small. In order to make τ i small without impeding learning, we need to make E z1,...,zm∼N (0,I) [τ i ] = E z1,...,zm∼N (0,I) min j∈[m] d(x i , T θ (z j )) small. To this end, we can either increase m, the number of samples, or train T θ so that the samples it produces are close to the data point x i . The former is computationally expensive, and so we will devise a method to achieve the latter. We propose a curriculum learning strategy, which solves a sequence of optimization problems with different τ i 's, such that τ i 's get smaller for optimization problems later in the sequence. The earlier optimization problems in the sequence help train T θ to produce samples close to the data points. After each optimization problem is solved to convergence, we start solving the next optimization problem with θ initialized to the solution found previously. This will make τ i 's smaller and smaller. If they eventually converge to zero, then it turns out that we would have equivalently maximized the sum of likelihoods p θ (x i ) of the training examples under the probability distribution induced the generative model, as shown in the lemma below. Lemma 4. Suppose p θ is continuous at all data points x 1 , . . . , x n , under the choice of w i =  lim {τi→0 + }i L {τi}i (θ) = 1 n n i=1 p θ (x i ) This lemma shows the theoretical guarantees of Adaptive IMLE hold under more general conditions that those required by vanilla IMLE.

4.2. ADAPTIVE IMLE

The key difference from the objective in Eqn. 4 to the original IMLE formulation in Eqn. 1 is the individualized neighbourhood radius τ i around each data point x i . This change in the objective is crucial, as it allows the model to adapt to the varying difficulty in learning different training examples, hence the algorithm name, Adaptive IMLE. As mentioned in Sect. 4.1.1, we need to gradually decrease τ i in order to make the learning feasible. This could be achieved by decreasing the tightening threshold δτ i . To this end, the algorithm optimizes the model's parameter until the distance between the generated sample and the target data d(x i , T θ (z j )) decreases to δτ i . Once the threshold is reached, the algorithm decreases the threshold by multiplying it by δ as 0 ≤ δ < 1. This updated threshold then serves as the new target for learning x i . Intuitively, the tightening coefficient δ determines the amount of progress required for the selected sample towards each training example. Now let's turn our attention to the optimization problem we solve in each stage of the curriculum. Consider the following unweighted variant to the objective in Eqn. 4 without the 1 wi factor: arg min θ E z1,...,zm∼N (0,I) n i=1 max( min j∈[m] d(x i , T θ (z j )), δτ i ) (5) Algorithm 1 Adaptive IMLE Procedure Require: The set of inputs {xi} n i=1 , tightening coefficient δ ∈ [0, 1) 1: Initialize the parameters θ of the generator T θ 2: Draw latent codes Z ← z1, ..., zm from N (0, I) 3: σ(i) ← arg min j∈[m] d(xi, T θ (z j )) ∀i ∈ [n] 4: τi ← d xi, T θ z σ(i) ∀i ∈ [n] ▷ Initialize the threshold for each data point 5: for k = 1 to K do 6: Pick a random batch S ⊆ [n] 7: θ ← θ -η∇ θ i∈S d xi, T θ z σ(i) /|S| 8: Draw latent codes Z ← z1, ..., zm from N (0, I) 9: for i ∈ S do 10: if d xi, T θ z σ(i) ≤ δτi then ▷ Only update σ(i) when getting into the threshold 11: τi ← τiδ ▷ Tightening the threshold 12: σ(i) ← arg min j∈[m] d(xi, T θ (z j )) 13: end if 14: end for 15: end for 16: return θ If we were to run stochastic gradient descent (SGD) without replacement on the weighted objective (Eqn. 4) and the unweighted objective (Eqn. 5) and consider the updates made by each, we will find that updates induced by the weighted objective are just scalar multiples of those induced by the unweighted objective, since τ i and therefore w i is fixed during each stage of the curriculum. So, optimizing the weighted objective is equivalent to optimizing the unweighted objective, with a different step size chosen for each update. We consider optimizing the unweighted objective with a constant step size that is smaller than these per-iteration step sizes. Then, if SGD on the weighted objective converges to a solutionfoot_0 where min j∈[m] d(x i , T θ (z j )) falls below δτ i (which is a global minimum), SGD on the unweighted objective with such a step size will do the same, because the latter is just choosing more conservative step sizes than implied by the weighted objective. Now, if we putting everything together, we obtain the Adaptive IMLE algorithm. The details are shown in Algorithm 1.

5. EXPERIMENTS

Baselines We compare our method to recent few-shot unconditional image synthesis methods that operate in the same setting we consider, namely without needing to pre-train on auxiliary datasets. Two of such recent methods are FastGAN (Liu et al., 2021) and MixDL (Kong et al., 2022) . Training Details Our network architecture is modified from Child (2021) , where we keep the decoder architecture and replace the encoder with a fully-connected mapping network inspired by Karras et al. (2019) . We choose an input latent dimension of 1024 and a tightening coefficient δ = 0.9. We train our model for 500k iterations with a mini-batch size of 4 using the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 2 × 10 -6 on a single NVIDIA V100 GPU. Datasets We evaluate our method and the baselines on a wide range of natural image datasets at 256 × 256 resolution, which includes Animal-Face Dog and Cat (Si & Zhu, 2012) , Obama, Panda, and Grumpy-cat (Zhao et al., 2020b) and Flickr-FaceHQ (FFHQ) subset (Karras et al., 2019) . All datasets contain 100 images except for Dog and Cat which contain 389 and 160 images respectively. The FFHQ subset consists of 100 FFHQ images with similar backgrounds, in order to highlight diversity in the generation of foregrounds. Evaluation Metrics We use the Fréchet Inception Distance (FID) (Heusel et al., 2017) to measure the perceptual quality of the generated images, where we randomly generate 5000 images and compute FID between the generated samples and real images in each dataset. To evaluate the mode modelling accuracy (precision) and coverage (recall), we use the precision metric of Kynkäänniemi et al. (2019) to measure the former, and use the recall metric of Kynkäänniemi et al. (2019) and (Heusel et al., 2017) between the real data and 5000 randomly generated samples in all cases. LPIPS above represents LPIPS backtracking score (Liu et al., 2021) . For this metric, each model is trained on 90% of the dataset. The resulting model is used to backtrack in the latent space and reconstruct the remaining 10%. Lower LPIPS backtracking score shows better mode coverage of the training data. LPIPS backtracking score (Liu et al., 2021) to measure the latter. For LPIPS backtracking, we use 90% of the full dataset for training and evaluate the metric using the remaining 10% of the dataset.

5.1. QUANTITATIVE RESULTS

We compare the FID and LPIPS backtracking scores across all methods in Tab. 1. As shown, our method outperforms the baselines in terms of both metrics on all datasets except for Cat, where our FID is the second best and LPIPS backtracking score is the best. We compare the mode accuracy and coverage in Tab. 2. As shown, our method achieves better precision than the baselines and significantly outperforms the baselines in terms of recall. These results show that our method could produce high-quality images while obtaining better mode coverage compared to the baselines. (Kynkäänniemi et al., 2019) is computed across 1000 randomly generated samples and the target dataset. Our method performs better for both precision and recall in all cases. Higher precision shows better fitting to the target dataset and higher recall corresponds to better mode coverage.

5.2. QUALITATIVE RESULTS

We show the qualitative comparison of our method to the baselines in Fig. 2 . As shown, our method generates higher quality samples which better preserve the semantic structures compared to the baselines, such as the eyes in Cat, the facial structure in Dog and the mouth and hair in the FFHQ subset. In addition, our method generates more diverse results while the baselines suffer from mode collapse and generate similar samples, such as in Panda, Grumpy Cat and Obama. Additional samples from our model can be found in the appendix. We show the final reconstruction of the target image found using LPIPS backtracking (Liu et al., 2021) on the models trained with different methods in Fig. 3 . As shown, our method is the only one where the reconstruction is structurally similar to the target image, demonstrating that our model successfully covers the mode that the target image belongs to. We also compare our method to the baselines on the quality of interpolations between two samples in the latent space. As shown in 

5.3. ABLATION STUDY

We compare the FID, precision, and recall between the proposed method, Adaptive IMLE, and vanilla IMLE on the more challenging datasets, Obama and FFHQ subset. As shown in Tab. 3, Adaptive IMLE significantly improves upon vanilla IMLE in terms of FID and recall while achieving similar precision, validating the effectiveness of the proposed method under the few-shot setting. 

6. CONCLUSION

We developed a method for the challenging few-shot image synthesis setting that does not depend on pre-training on auxiliary datasets. We presented a more generalized formulation of IMLE and proved that the theoretical guarantees of this generalized formulation hold under weaker conditions. We further derived a novel algorithm based on this formulation which can adapt to different training examples of varying difficulty. We showed that our method significantly outperforms existing baselines in terms of mode modelling accuracy and coverage on six few-shot benchmark datasets. A PROOFS Lemma 1. Let X be a non-negative random variable and Φ be a continuous function on [0, ∞). If Φ ′ is integrable on all closed intervals in [0, ∞), E [Φ(X)] = Φ(0) + ∞ 0 Φ ′ (t)Pr(X ≥ t)dt Proof. Φ(0) + ∞ 0 Φ ′ (t)Pr(X ≥ t)dt = Φ(0) + ∞ 0 ∞ t Φ ′ (t)p(x)dxdt = Φ(0) + {x≥t,t≥0} Φ ′ (t)p(x)d x t = Φ(0) + {t≤x,t≥0} Φ ′ (t)p(x)d x t = Φ(0) + ∞ 0 x 0 Φ ′ (t)p(x)dtdx = Φ(0) + ∞ 0 x 0 Φ ′ (t)dt p(x)dx = Φ(0) + ∞ 0 (Φ(x) -Φ(0)) p(x)dx (2nd FTC) = Φ(0) + ∞ 0 Φ(x)p(x)dx - ∞ 0 Φ(0)p(x)dx = Φ(0) + ∞ 0 Φ(x)p(x)dx -Φ(0) ∞ 0 p(x)dx = Φ(0) + E [Φ(X)] -Φ(0) = E [Φ(X)] Lemma 2. Under the choice of Φ τ (•) above and its associated Φ ′ τ (•), E z1,...,zm∼N (0,I) [Φ τi (d(x i , T θ (z j )))] = τ i - τi δτi Pr(d(x i , T θ (z)) < t)dt. Proof. By definition, Φ τi (0) = δτ i . E z1,...,zm∼N (0,I) [Φ τi (d(x i , T θ (z j )))] = Φ τi (0) + ∞ 0 Φ ′ τi (t)Pr(d(x i , T θ (z)) ≥ t)dt (Lemma 1) = δτ i + τi δτi Pr(d(x i , T θ (z)) ≥ t)dt = δτ i + τi δτi (1 -Pr(d(x i , T θ (z)) < t)) dt = δτ i + (τ i -δτ i ) - τi δτi Pr(d(x i , T θ (z)) < t)dt = τ i - τi δτi Pr(d(x i , T θ (z)) < t)dt Lemma 3. Under the choice of Φ τ (•) above and its associated Φ ′ τ (•), L {τi}i (θ) = 1 n n i=1 1 mw i m j=1 τi δτi Pr(d(x i , T θ (z j )) < t)dt. Proof. end for 10: end for 11: return θ L {τi}i (θ) = E z1,...,zm∼N (0,I)   1 n n i=1 1 w i   τ i - 1 m m j=1 Φ τi (d(x i , T θ (z j )))     = 1 n n i=1 1 w i   τ i - 1 m m j=1 E z1,...,zm∼N (0,I) [Φ τi (d(x i , T θ (z j )))]   = 1 n n i=1 1 w i   τ i - 1 m m j=1 τ i - τi δτi Pr(d(x i , T θ (z j )) < t)dt   (Lemma 2) = 1 n n i=1 1 mw i m j=1 τi δτi Pr(d(x i , T θ (z j )) < t)dt Equation 3. Proof. L {τi}i (θ) = E z1,...,zm∼N (0,I) 1 n n i=1 1 w i τ i - m -1 m τ i - 1 m max( min j∈[m] d(x i , T θ (z j )), δτ i ) = E z1,...,zm∼N (0,I) 1 n n i=1 1 w i 1 m τ i - 1 m max( min j∈[m] d(x i , T θ (z j )), δτ i ) = E z1,...,zm∼N



The weighted objective will always converge to such a solution since we can choose δ to be close to 1.



Figure 1: Schematic illustration that compares vanilla IMLE (Li & Malik, 2018) (top row) with the proposed algorithm, Adaptive IMLE (bottom row). While IMLE treats all training examples (denoted by the squares on the left) equally and pulls the generated samples (denoted by the circles on the left) towards them at a uniform pace, Adaptive IMLE adapts to the varying difficulty of each training example and pulls the generated samples towards them at an individualized pace that depends on the training example. The dashed line on the left figure illustrates the progression towards three data points at four comparable epochs with the starting positions highlighted. The corresponding generated samples are shown on the right. As shown, Adaptive IMLE can converge to the various data points faster and closer than IMLE.

τi δτi vol(B t (x i ))dt := τi δτi Bt(xi) dxdt, where B r (x) = {y|d(y, x) < r} is an open ball of radius r centred at x,

Figure2: Qualitative comparison of images generated by our method and those generated by the baselines, FastGAN(Liu et al., 2021) and MixDL(Kong et al., 2022). As shown, our samples are of higher quality and have greater diversity. On the other hand, the samples generated by the baselines show more limited diversity, which is validated by the recall results in Table.2, which suggest that the baselines exhibit mode collapse.

Figure 3: Visualizations of the reconstructions of an unseen target image from LPIPS backtracking. While the reconstructions of MixDL and FastGAN are structurally dissimilar from the target images, the reconstructions of our method are structurally similar to the target images. Obama FFHQ subset

i , T θ (z j )), δτ i ) i , T θ (z j )), δτ i ) i , T θ (z j )), δτ i ) i , T θ (z j )), δτ i ) = arg min θ E z1,...,zm∼N (0,I) j∈[m] d(x i , T θ (z j )), δτ i )Lemma 4. Suppose p θ is continuous at all data points x 1 , . . . , x n , under the choice ofw i = τi δτi vol(B t (x i ))dt := τi δτi Bt(xi) dxdt, where B r (x) = {y|d(y, x) < r} is an open ball of radius r centred at x, x i , T θ (z j )) < t)dt (Lemma 3) Bτ i (xi) p θ (x)dx -δ B δτ i (xi) p θ (x)dx Bτ i (xi) dx -δ B δτ i(xi) dx (L'Hôpital and 2nd FTC)Bτ i (xi) p θ (x)(1 -δ1 B δτ i (xi) (x))dx Bτ i (xi) 1 -δ1 B δτ i (xi) δ1 {r<δτi} (r)) {x|d(x,xi)=r} p θ (x)dxdr τi 0 (1 -δ1 {r<δτi} (r)) {x|d(x,xi)=r} dxdrNote that under common metrics like ℓ p distances, w i can be found in closed form, i.e., vol(B t (x i )) = (2t) d Γ(1+1/p) d Γ(1+d/p) , and sow i = τi δτi vol(B t (x i ))dt = τi δτi (2t) d Γ(1+1/p) d Γ(1+d/p) dt = (2(1-δ)τi) d+1 2(d+1)• Γ(1+1/p) d Γ(1+d/p) , where Γ(•) denotes the gamma function.B PSEUDO CODE FOR IMLEAlgorithm 2 Implicit maximum likelihood estimation (IMLE) procedureRequire: The set of inputs {x i } n i=11: Initialize the parameters θ of the generator T θ 2: for k = 1 to K do 3: Pick a random batch S ⊆ [n] 4: Draw latent codes Z ← z 1 , ..., z m from N (0, I) 5: σ(i) ← arg min j∈[m d(x i , T θ (z j )) ∀i ∈ S ← θ -η∇ θ i∈ S d x i , T θ z σ(i)

Table. 2, which suggest that the baselines exhibit mode collapse. We compute FID

Rec.↑ Prec.↑ Rec.↑ Prec.↑ Rec.↑ Prec.↑ Rec.↑ Prec.↑ Rec.↑ Prec.↑ Rec.↑ Precision and recall

Adaptive IMLE significantly improves perceptual quality (FID) and recall compared to vanilla IMLE, while maintaining similarly high levels of precision.

