UNIFIED DISCRETE DIFFUSION FOR SIMULTANEOUS VISION-LANGUAGE GENERATION

Abstract

The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multimodality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks using a single model, performing text-based, image-based, and even vision-language simultaneous generation. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Moreover, we design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.

1. INTRODUCTION

Here, we take the use of discrete diffusion model into a new realm -multi-modality generation using a unified vision-language model. In contrast to the modality translation tasks outlined above, our multi-modality generative model does not require any conditional signals given in prior and is capable of simultaneously generating content pairs with the associated multi-modalities. UniD3, our new Unified Discrete Denoising Diffusion model, allows us to construct a joint vision-language probability distribution by mixing discrete image tokens with text tokens, leading to a capability of simultaneously generating cross-domain results. This is achieved by the two-stages framework, illustrated in Fig. 2 , (1) An offline model to generate a compact yet expressive discrete representation for both images and texts (the pink part in Fig. 2 ). (2) A novel unified discrete diffusion model to estimate the joint distribution of such latent visual and language codes (the cyan part in Fig. 2 ). Once trained, UniD3 can not only inherit the ability to manipulate the provided text or image, but is also able to unify the text and image generation, e.g., unconditional vision-language pairings generation, cross modal manipulation, text guided image completion, and image conditional text caption (Fig. 1 depicts the tasks that our model is capable of handling). Based on the empirical exploration, our model can achieve comparable image quality to the text condition in the unconditional case. In terms of image caption, our model is also comparable to some states of the art. In summary, our key contributions include the following: • We design a specific Markov transition matrix for our unified discrete denoising diffusion model, which lead to a sophisticated control of diffusion process, to estimate the joint



Figure 1: Examples of various tasks supported by UniD3. The dark brown portions of the image and description represent the [MASK].

