DISCRETE CONTRASTIVE DIFFUSION FOR CROSS-MODAL MUSIC AND IMAGE GENERATION

Abstract

Diffusion probabilistic models (DPMs) have become a popular approach to conditional generation, due to their promising results and support for cross-modal synthesis. A key desideratum in conditional synthesis is to achieve high correspondence between the conditioning input and generated output. Most existing methods learn such relationships implicitly, by incorporating the prior into the variational lower bound. In this work, we take a different route-we explicitly enhance input-output connections by maximizing their mutual information. To this end, we introduce a Conditional Discrete Contrastive Diffusion (CDCD) loss and design two contrastive diffusion mechanisms to effectively incorporate it into the denoising process, combining the diffusion training and contrastive learning for the first time by connecting it with the conventional variational objectives. We demonstrate the efficacy of our approach in evaluations with diverse multimodal conditional synthesis tasks: dance-to-music generation, text-to-image synthesis, as well as class-conditioned image synthesis. On each, we enhance the inputoutput correspondence and achieve higher or competitive general synthesis quality. Furthermore, the proposed approach improves the convergence of diffusion models, reducing the number of required diffusion steps by more than 35% on two benchmarks, significantly increasing the inference speed.

1. INTRODUCTION

Generative tasks that seek to synthesize data in different modalities, such as audio and images, have attracted much attention. The recently explored diffusion probabilistic models (DPMs) Sohl-Dickstein et al. (2015b) have served as a powerful generative backbone that achieves promising results in both unconditional and conditional generation Kong et al. ( 2020 2018) in the conventional contrastive representation learning field, we show that this can be effectively done within our proposed contrastive diffusion framework. Specifically, we reformulate the optimization problem for the desired conditional generative tasks via DPMs by analogy to the above embedding z and raw data x with our conditioning input and synthesized output. We introduce a Conditional Discrete Contrastive Diffusion (CDCD) loss, and design two contrastive diffusion mechanisms -step-wise parallel diffusion that invokes multiple parallel diffusion processes during contrastive learning, and sample-wise auxiliary diffusion, which maintains one principal diffusion process, to effectively incorporate the CDCD loss into the denoising process. We demonstrate that with the proposed contrastive diffusion method, we can not only effectively train so as to maximize the desired mutual information by connecting the CDCD loss with the conventional variational objective function, but also to directly optimize the generative network p. The optimized CDCD loss further encourages faster convergence of a DPM model with fewer diffusion steps. We additionally present our intraand inter-negative sampling methods by providing internally disordered and instance-level negative samples, respectively. To better illustrate the input-output connections, we conduct main experiments on the novel crossmodal dance-to-music generation task Zhu et al. (2022a) , which aims to generate music audio based on silent dance videos. Compared to other tasks such as text-to-image synthesis, dance-to-music



); Mittal et al. (2021); Lee & Han (2021); Ho et al. (2020); Nichol & Dhariwal (2021); Dhariwal & Nichol (2021); Ho et al. (2022); Hu et al. (2021). Compared to the unconditional case, conditional generation is usually applied in more concrete and practical cross-modality scenarios, e.g., video-based music generation Di et al. (2021); Zhu et al. (2022a); Gan et al. (2020a) and text-based image generation Gu et al. (2022); Ramesh et al. (2021); Li et al. (2019); Ruan et al. (2021). Most existing DPM-based conditional synthesis works Gu et al. (2022); Dhariwal & Nichol (2021) learn the connection between the conditioning and the generated data implicitly by adding a prior to the variational lower bound Sohl-Dickstein et al.(2015b). While such approaches still feature high generation fidelity, the correspondence between the conditioning and the synthesized data can sometimes get lost, as illustrated in the right column in Fig.1. To this end, we aim to explicitly enhance the input-output faithfulness via their maximized mutual information under the diffusion generative framework for conditional settings in this paper. Examples of our synthesized music audio and image results are given in Fig.1.

Figure 1: Examples of the input (left column) and synthesized output (middle column) from our contrastive diffusion model for dance-to-music (Rows 1-2), text-to-image (Rows 3-4), and classconditioned (Row 5) generation experiments on five datasets. The right column shows some synthesized data with reasonable quality but weaker correspondence to the input from existing methods Zhu et al. (2022a); Gu et al. (2022).

