DISCRETE CONTRASTIVE DIFFUSION FOR CROSS-MODAL MUSIC AND IMAGE GENERATION

Abstract

Diffusion probabilistic models (DPMs) have become a popular approach to conditional generation, due to their promising results and support for cross-modal synthesis. A key desideratum in conditional synthesis is to achieve high correspondence between the conditioning input and generated output. Most existing methods learn such relationships implicitly, by incorporating the prior into the variational lower bound. In this work, we take a different route-we explicitly enhance input-output connections by maximizing their mutual information. To this end, we introduce a Conditional Discrete Contrastive Diffusion (CDCD) loss and design two contrastive diffusion mechanisms to effectively incorporate it into the denoising process, combining the diffusion training and contrastive learning for the first time by connecting it with the conventional variational objectives. We demonstrate the efficacy of our approach in evaluations with diverse multimodal conditional synthesis tasks: dance-to-music generation, text-to-image synthesis, as well as class-conditioned image synthesis. On each, we enhance the inputoutput correspondence and achieve higher or competitive general synthesis quality. Furthermore, the proposed approach improves the convergence of diffusion models, reducing the number of required diffusion steps by more than 35% on two benchmarks, significantly increasing the inference speed.

1. INTRODUCTION

. While such approaches still feature high generation fidelity, the correspondence between the conditioning and the synthesized data can sometimes get lost, as illustrated in the right column in Fig. 1 . To this end, we aim to explicitly enhance the input-output faithfulness via their maximized mutual information under the diffusion generative framework for conditional settings in this paper. Examples of our synthesized music audio and image results are given in Fig. 1 .



Generative tasks that seek to synthesize data in different modalities, such as audio and images, have attracted much attention. The recently explored diffusion probabilistic models (DPMs) Sohl-Dickstein et al. (2015b) have served as a powerful generative backbone that achieves promising results in both unconditional and conditional generation Kong et al. (2020); Mittal et al. (2021); Lee & Han (2021); Ho et al. (2020); Nichol & Dhariwal (2021); Dhariwal & Nichol (2021); Ho et al. (2022); Hu et al. (2021). Compared to the unconditional case, conditional generation is usually applied in more concrete and practical cross-modality scenarios, e.g., video-based music generation Di et al. (2021); Zhu et al. (2022a); Gan et al. (2020a) and text-based image generation Gu et al. (2022); Ramesh et al. (2021); Li et al. (2019); Ruan et al. (2021). Most existing DPM-based conditional synthesis works Gu et al. (2022); Dhariwal & Nichol (2021) learn the connection between the conditioning and the generated data implicitly by adding a prior to the variational lower bound Sohl-Dickstein et al.

