UNIFIED DISCRETE DIFFUSION FOR SIMULTANEOUS VISION-LANGUAGE GENERATION

Abstract

The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multimodality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks using a single model, performing text-based, image-based, and even vision-language simultaneous generation. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Moreover, we design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.

1. INTRODUCTION

Diffusion models (Ho et al., 2020; Song et al., 2021b) have garnered significant interest on various high quality conditional image generation tasks, such as image super-resolution (Rombach et al., 2022) , image inpainting (Lugmayr et al., 2022) , image editing (Avrahami et al., 2022) , image translation (Saharia et al., 2022a) , among others. Concurrently, the Vector Quantized (VQ) models have also achieved rapid advances in image generations, especially on cross-modal tasks, examples in-Figure 2 : The pipeline of UniD3. With an offline model (red background part), the given inputs are represented by discrete token sequence in separate domain. The fusion embedding concatenate the tokens in different modal and embed them to the same space. The unified diffusion (in blue background) will construct the joint distribution of all modalities based on the fused embedding with a fixed unified Markov transition matrix. clude text-to-image (Kim et al., 2022) , sketch-to-image (Esser et al., 2021b) , image-to-video (Wu et al., 2021) . Despite their success, all of these generation tasks are designed for only a single special modality, i.e. either modality translation or modality generation, with the help of a powerful diffusion model. In particular, the former modality translation translates the given conditional signals into the traget domain, while the latter modality generation only works for the unconditional image generation, e.g. image generation on CelebA Karras et al. (2017) or LSUN Yu et al. (2015) datasets. However, none of them consider to learn a join distribtuion for the mixture modality. Here, we take the use of discrete diffusion model into a new realm -multi-modality generation using a unified vision-language model. In contrast to the modality translation tasks outlined above, our multi-modality generative model does not require any conditional signals given in prior and is capable of simultaneously generating content pairs with the associated multi-modalities. UniD3, our new Unified Discrete Denoising Diffusion model, allows us to construct a joint vision-language probability distribution by mixing discrete image tokens with text tokens, leading to a capability of simultaneously generating cross-domain results. This is achieved by the two-stages framework, illustrated in Fig. 2 , (1) An offline model to generate a compact yet expressive discrete representation for both images and texts (the pink part in Fig. 2 ). (2) A novel unified discrete diffusion model to estimate the joint distribution of such latent visual and language codes (the cyan part in Fig. 2 ). Once trained, UniD3 can not only inherit the ability to manipulate the provided text or image, but is also able to unify the text and image generation, e.g., unconditional vision-language pairings generation, cross modal manipulation, text guided image completion, and image conditional text caption (Fig. 1 depicts the tasks that our model is capable of handling). Based on the empirical exploration, our model can achieve comparable image quality to the text condition in the unconditional case. In terms of image caption, our model is also comparable to some states of the art. In summary, our key contributions include the following: • We design a specific Markov transition matrix for our unified discrete denoising diffusion model, which lead to a sophisticated control of diffusion process, to estimate the joint distribution of language and image. The purposive design of transfer matrices based on task objectives and data properties is also pioneering for discrete diffusion models. • We additionally propose mutual mechanism with fuse embedding to fulfil the objective of multi-modal integration. And we alter a unified objective function to offer the optimization with more concise restrictions. • To the best of our knowledge, the UniD3 is the first work to solve both multi-modality generation and modality translation problems that is capable of handling the simultaneous unconditional vision-language generation and bi-directional vision-language synthesis tasks with a single model.

2. PRELIMINARIES

Vector Quantised Model The Vector-Quantised Variational Auto Encoder (VQ-VAE) (Van Den Oord et al., 2017) learns to embed the high-dimensional data, e.g., image or audio, into a discrete representation. In particular, given a high dimensional data x ∈ R C×H×W , the encoder E first converts it to the spatial latent features z = {z i,j } ∈ R d×h×w , and then transfer this continuous features into discrete space by looking up the closest features in the codebook Z = {z k } ∈ R K×d to obtain the tokens z q : z q = Quantise(z) = Quantise(E(x)) := arg min k ||z i,j -z k ||, where the dimensions h, w of latent feature z are substantially smaller than the original dimensions H, W . The reconstructions can be obtained through a decoder G: x = G(z q ) = G(Quantise(E(x))). Recently, there have been significant advances in learning more compact representation, yet higherquality reconstruction, such as introducing new losses (Esser et al., 2021a) , applying powerful backbone (Yu et al., 2021) , using multiple channels representation (Lee et al., 2022) and spatially normalization (Zheng et al., 2022) . However, in this paper, we focus on applying the technology in our novel UniD3, rather than exploring a new codebook learning approach.

Discrete Diffusion

The discrete diffusion model was originally mentioned in (Sohl-Dickstein et al., 2015) , with transitions converging to a binomial distribution. Subsequent work extended this to multinomial diffusion (Hoogeboom et al., 2021; Song et al., 2021a) , while Austin et al. (2021) provided more options for transition matrices. Recent works integrated discrete diffusion models with VQ-VAE, allowing for high-quality image synthesis (Gu et al., 2022; Hu et al., 2022) . Here we briefly describe the multinomial diffusion with the absorbing state, as employed in VQ-Diffusion (Gu et al., 2022) . Besides K tokens from a discrete VAE, an additional [MASK] token is introduced. The forward process is defined as: q(x t |x t-1 ) = Cat(x t ; p = Q t x t-1 ) = x T t Q t x t-1 , where x is a one-hot vector identifying the token index. Here K+1) is the Markov transition matrix from t -1 to t, which can be expressed as: [Q t ] i,j = q(x t = i|x t-1 = j) ∈ R (K+1)×( Q [t-1→t] =       α t + β t β t β t • • • 0 β t α t + β t β t • • • 0 β t β t α t + β t • • • 0 . . . . . . . . . . . . . . . γ t γ t γ t • • • 1       , where α t ∈ [0, 1] is the probability of retaining the token, and has a probability of γ t to be replaced by the [MASK] ] token, leaving a chance of β t = (1 -α t -γ t )/K to be diffused. The posterior of the diffusion process can be formulated as: q(x t-1 |x t , x 0 ) = q(x t |x t-1 , x 0 )q(x t-1 |x 0 ) q(x t |x 0 ) = (x T t Q t x t-1 )(x T t-1 Q t-1 x 0 ) x T t Q t x 0 , given Q t = Q t • • • Q 1 , which can be calculated in closed form, and q(x t |x 0 ) = Cat(x t ; p = Q t x 0 ) = x T t Q t x 0 . In the reverse process, instead of explicitly predicting the posterior using a denoising neural network, the x 0 -parameterisation might increase the stability and permit fast inference (skipping ∆t steps per iteration). The reverse transition with reparameterisation is given as: p θ (x t-1 |x t ) ∝ x0 q(x t-1 |x t , x0 ) pθ ( x0 |x t ), in which the neural network predicts the logits of the target data q(x 0 ).

3. METHOD

Our goal is to adapt the discrete diffusion model to learn the joint distribution of linguistic and visual features concurrently. First, we propose a transition matrix that allows the diffusion model to capture the implicit association between text and images. Second, we present a mutual attention transformer architecture with fuse embedding layer as the denoising function and a unified objective function, which fits our unified diffusion process objective and permits for more precise predictions. Our overall pipeline is illustrated in Fig. 2 . Specifically, our solution begins by compressing the figures and texts into discrete token sequences using dVAE and BPE, respectively. A robust diffusion model with unified transition matrix is then constructed to fit the joint distribution across different modalities, which is further empowered by a transformer with a mutual attention mechanism.

3.1. UNIFIED DIFFUSION PROCESS

Unified Transition Matrix Discrete diffusion models with transition matrices can capture global links. The presence of a transition matrix determines the nature of the discrete diffusion model, which also provides us with more choices for token evolution. Thus we may wonder if it is feasible to design transition matrices that capture the global connections between various modalities. The Markov transition matrix of the discrete diffusion model should satisfy the following requirements: 1. each column in Q t should sum to one to conserve probability mass; 2. each column in the cumulative-product Q t should converge to either a known stationary distribution or a learnt prior when t becomes large. On the basis of these criteria, we construct a unified transition matrix [Q t ] i,j = q(x t = i|x t-1 = j) capable of encapsulating discrete representations among various modalities. The following matrix Q t ∈ R (K+K * +1)×(K+K * +1) illustrates a unified transition process with only text and image modalities:  Q t =                 α t + β t β t β t • • • β t 0 0 0 • • • 0 0 β t α t + β t β t • • • β t 0 0 0 • • • 0 0 . . . . β t β t β t • • • α t + β t 0 0 0 • • • 0 0 0 0 0 • • • 0 α t + β * t β * t β * t • • • β * t 0 0 0 0 • • • 0 β * t α t + β * t β * t • • • β * 0 0 0 • • • 0 β * t β * t β * t • • • α t + β * t 0 γ t γ t γ t • • • γ t γ t γ t γ t • • • γ t 1                 , where α t ∈ [0, 1] is the probability to keep this token, β t and β * t are the probabilities of a token to be replaced by any other accessible tokens in different modality, and γ t is the absorbing probability. The dimension of the matrix Q t is (K + K * + 1) × (K + K * + 1), where K and K * are the number of states in different modals respectively, e.g., K is the size from the codebook in discrete VAE and K * is the dictionary size of BPE. The matrix comprises five sections: • The final row and column form a section associated with the transition of the absorbing state. Intuitively, if the token is [MASK] at t -1 step, then the token must be [MASK] at time t. Conversely, any other modal token has the equal possibility γ t of being diffused to [MASK]. • The remainder of the matrix is composed of four quadrants, the first and third of which are fully zeros. Specifically, these two sub-matrices prevent tokens transitioning from one modality to another. The form of the second and fourth quadrants closely resembles that for multinomial diffusion. Here, in addition to some probability of being converted into a [MASK] token, each token also has some chance of transiting to a different state within the same modality, or remaining unaltered. • The dimensions of these four quadrants are not identical, which are K * × K, K × K, K × K * and K * × K * , respectively. It is worth noting that α t and γ t are the same in all modalities, whereas β t varies based on the number of states in different modalities. Mathematically, β t = (1-α t -γ t )/K and β * t = (1-α t -γ t )/K * . The sum of each column in this transition matrix is one to preserve probability mass, and also all the mass of the stationary distribution falls on the [MASK] token, which satisfies the prerequisite for a discrete diffusion model transition matrix. The computation of Q t x 0 , needed for q(x t |x 0 ) in Eq. 6, can be efficiently obtained in closed form: Q t x 0 = α t x 0 + γ t -1(x 0 )β t -1 * (x 0 )β * t x [M] + 1(x 0 )β t + 1 * (x 0 )β * t , where 1(x 0 ) = 1 if argmax x 0 ∈ [0, K), 0 otherwise. , 1 * (x 0 ) = 1 if argmax x 0 ∈ [K, K + K * ), 0 otherwise. , x [M] = x ← argmax x = K + K * and α t , β t , β * t , γ t are the corresponding cumulative product. ( ) The detailed proof is provided in Appendix E. Unified Objective For conditional generation tasks such as text-to-image generation, the goal is to maximize p(x|y) by finding x, where y is the given text condition. In our task, it can be approximated that the model try to maximize the joint distribution p(x, y) simultaneously. In practice, we minimize the Kullback-Leibler divergence between q(x t-1 |x t , x 0 ) and p θ (x t-1 |x t ) in both image and text directions, as shown in the following: L 0 = -E q(x1|x0) [log p θ (x img 0 |x 1 , x txt 0 ) + log p θ (x txt 0 |x 1 , x img 0 )], L t-1 = E q(xt|x0) D KL q(x t-1 |x t , x 0 )∥ p θ (x img t-1 |x t ); p θ (x txt t-1 |x t ) , where p θ (x t-1 ) is the integration of the logits p θ (x img t-1 ) and p θ (x txt t-1 ) from separate modal, and they can obtained with x 0 -parameterisation as following: p θ (x img t-1 |x t ) ∝ ximg 0 q(x img t-1 |x t , ximg 0 , x txt 0 ) pθ ( ximg 0 |x t )], p θ (x txt t-1 |x t ) ∝ xtxt 0 q(x txt t-1 |x t , xtxt 0 , x img 0 ) pθ ( xtxt 0 |x t )]. Due to the x 0 -parameterisation, our model is also capable of achieving fast sampling by increase the step size. And the last term of the variational lower bound loss L T is a constant and can be ignored during the training: L T = D KL (q(x T |x 0 )∥p(x T )) , as the prior p(x T ) is fixed with the unified transition matrix: p(x T ) = β T , β T , . . . , β * T , β * T , . . . , γ T . ( ) The full expression of the loss function can be found in Appendix C.

3.2. DENOISING FUNCTION FOR MULTIMODAL

Mutual Attention As indicated by Eqs. 12 & 13, the neural network is responsible for prediction of the distribution pθ ( x0 |x t ). However, the input to the neural network covers all modalities, and a simple self-attention mechanism can scarcely highlight the inter-modal linkages. In addition, the distributions predicted by the neural network need to be decoupled according to the various modalities throughout the reparameterisation. In other words, the network predictions should be expressed in terms of different modalities, e.g., pθ ( ximg 0 |x t ) and pθ ( xtxt 0 |x t ), with interconnections. Therefore, we propose the mutual attention mechanism and construct the unified transformer as shown in Fig. 3 . The unified transformer contains several transformer blocks, each of which consists of one self-attention, two parallel mutual attention operations and one feed-forward module. Each block receives a sequence of mixed-modal tokens as input that traverses a layer of self-attention to capture the inherent connection within the modalities. Afterwards, the hidden tokens are decoupled according to the location of the different modalities and fed to the corresponding mutual attention layers. Mutual attention is a modified version of the cross-attention layer, with the conditional inputs to cross-attention being maintained constant while the inputs to mutual attention are derived from hidden features. Next, the outputs from the various mutual attention are concatenated into one mixed-modal token sequence for transmission to the next block. At the end of the unified transformer, we layer-normalise the sequence of tokens from the blocks, which is then decoupled to different predictive distributions by fully-connected layers. In our model, each modality may become a component of what needs to be generated. Therefore, we propose mutual attention to enable tokens of different modalities in a sequence to be conditional on each other, allowing the capture of the relationships between the various modalities. Mathematically, our mutual attention can be expressed as MA(T i , T j ) = Attn(T i , T j ; W ) = softmax( Q i K T j √ 2 )V j , where  and T i & T j are the tokens in a different modality, e.g. image and text. The Unified block contains a self-attention and several mutual attention in parallel. Given a hidden token vector Q i = W Q T i , K = W K T j & V = W V T j , H ′ = [H ′ i , H ′ j ] , the whole pipeline of Unified block is 1.T ′ = SA(H ′ ) 2. Decouple: T ′ → T ′ i , T ′ j 3. T i = MA(T ′ i , T ′ j ) T j = MA(T ′ j , T ′ i ) 4. Couple: T ← T i , T j this bird has a white belly, a spotted breast, a short tail, and pointed bill. a white bird with a black crown and a long beak. this bird is all black and has a small yellow eyes and white beak. a clean and tide kitchen with wooden cabinets for use. a tall giraffe is standing in a forest . a tall building with a park area. this bird has wings that are grey and has a white belly. the small bird is mostly brown and has a light red head. this bird has wings that are black and has a yellow and black belly. colorful flowers in an orange vase on the window. there is a bench by the trees in the seaside park. a window on the front of a brick building. Fused Embedding In addition, discrete tokens of various modalities should be embedded in the same space. Considering that two modalities have K and K * states respectively, we firstly create an embedding layer with size K + K * + 1, with the additional state for the [MASK] token. We then create a learnable spatial position encoding for the image modality and a learnable sequence position encoding for the text modality. The final fused embedding is obtained by adding the embedded vectors to the positional encoding of their associated modalities.

4. EXPERIMENTS

The description of the datasets and model experimental details can be found in Appendix B.

4.1. QUANTITATIVE RESULTS

In the unconditional scenario, our model is able to sample from all modalities while maintaining the relationship between them. For conditional generation, we conducted experiments on text-based image generation and image-based caption generation, respectively. Metrics We evaluate the multimodal capacity of our model in three distinct areas: Image sampling quality, text sampling quality and the similarity between sampled image and text. Fréchet inception distance (FID) (Heusel et al., 2017) and Inception Score (IS) (Salimans et al., 2016) are used to evaluate the realism and variety of the produced images. The BLEU-4, METEOR and SPICE are used to assess the quality of a picture caption. And CLIP (Radford et al., 2021) is used to identify the similarity between visuals and text. Image Quality We quantitatively compare our unconditional and text-conditional results with several state-of-the-art text-to-image and bidirectional conditional generation methods, including GANbased methods, diffusion-based methods and autoregressive methods. The comparison results are given in Table 1 , in which Pair means the unconditional generated vision-language pairings and T2I represents the text-to-image generation. The first section of the table solely covers modality translation methods. In the second part, these models are claimed to be endowed with cross modal capabilities. However, such methods are not able to achieve simultaneous vision-language pair generation and necessitate given signals. For a fair (Herdade et al., 2019) 38.6 28.7 22.6 AoA (Huang et al., 2019) 38.9 29.3 -X-LAN (Pan et al., 2020) 39.7 29.5 23.4 SimVLM (Wang et al., 2021) 40.6 33.7 25.4 OSCAR (Li et al., 2020) 40.5 29.7 22.8 Unifying (Huang et al., 2021) 37.3 28.2 21.9 L-verse (Kim et al., 2022) 39.9 31.4 23.3 OFA (Wang et al., 2022) 41.0 30.9 24.2 Ours(I2T) 39.6 29.3 23.4 Table 2 : Evaluation of image caption on the MSCOCO Karpathy split. The first section is the text-conditioned results while the model in second part includes multimodal solution. Text Quality In the case of image captions, we compared several other text description methods, the details are shown in Table 2 , where I2T is the image-based text-generation task. Similarly, the first part of the table has pure image caption solutions and the bi-directional methods are demonstrated in the second part. Some results for image caption tasks are provided in Appendix G.1 Similarity In the multimodal generation phase, all the other methods for comparison need the input of at least one specified modality; instead our method generates both visual and verbal features. The generated vision-language pairings are shown in Fig. 4 while more pair samples are provided in Appendix G.4. In this experiment, we use CLIP scores to evaluate the similarity between our generated visual and verbal modalities, while generated images with given texts were used for other methods. The comparison results are given in Table 3 , where a higher correlation between the text/image samples of our model can be found.

4.2. CROSS MODAL MANIPULATION

To demonstrate the modification potential of our vision-language federation, we present the outcomes of picture alteration with description modification in Fig. 5 and more samples can be found in Appendix G.3. In this experiment, we obscured a portion of the image and modified the text description as well, the tokens in the masked region are merely designated as [MASK] tokens. We anticipate that the model will complete the image based on the amended text, in conjunction with the description based on the visible image region. Depending on the unmasked images, our model may supplement the caption that is derived from a portion of the image. Moreover, based on the amended description, the model may enhance the appearance of the masked component. These findings indicate that our model is capable of modifying text and pictures in both ways, in addition to jointly generating text and images.

4.3. ABLATION STUDY

In the ablation test, we utilized a rapid sampling scheme, with an interval between each step of 10. First, we evaluated the outcomes without mutual attention, where we replaced all mutual attention in the model structure with causal self-attention. In addition, we evaluated the performance without the unified transition matrix, we employed a matrix similar to Equation 4, with the out-of-bound outputs substituted by the minimal ordinal number of the corresponding modal. Dhariwal & Nichol, 2021) and speed of sampling (Song et al., 2021a; Salimans & Ho, 2021) . In addition to continuous diffusion models that have received the significant attention, there is also substantial research based on discrete diffusion models (Hoogeboom et al., 2021; Austin et al., 2021) Modality Translation Text-to-image generation is one of the key aspects of modality translation tasks. On simpler datasets, traditional GAN-based models (Xu et al., 2018; Zhu et al., 2019) may create high-quality pictures based on text hints, but face difficulties on more complex datasets. As a result of the emergence of transformers employing the attention mechanism (Vaswani et al., 2017) , conventional autoregressive models (Van den Oord et al., 2016) have gained more powerful generating capacities as ART (Auto-regressive Transformers) (Ramesh et al., 2021; Esser et al., 2021b; Yu et al., 2022) . Other types of modality translation models involve the addition of prospective modalities as criteria for improved generation outcomes (Gafni et al., 2022) . There are also some works that focus on addressing bidirectional generation between different modalities (Kim et al., 2022; Huang et al., 2021; Wang et al., 2022) ; however, these solutions are task-specific, necessitating different input formats and specific models with distinct modalities.

6. CONCLUSION

This work presents UniD3, a novel unified framework for multi-modality generation tasks. We fisrtly introduced a unified transition matrix for the diffusion process which permits the states of fused tokens to be corrupted in a stable way. We demonstrated that designing transition matrices based on the objective and data characteristics is beneficial for discrete diffusion models. To capture the connection from the various modalities, we suggested a mutual attention with fuse embedding for noiseless state recovery. Our method is also generic and competent to perform modality translation tasks for the specified modality. The specific designs of transition matrix are intriguing for further exploring, as well as the performance of more modalities is also one of the future directions.

7. ETHICS

In this work, we propose UniD3, a new method for multimodal multi-task generation based on the discrete denoising diffusion model. All datasets evaluated in our experiments are open source, publicly accessible, and used with permission. Similar to other generation algorithms, our method has both beneficial and negative societal effects, depending on its application and usage. • Positively, UniD3 may explore the limits of deep learning generation through a unique pattern of graphic pair generation, while letting the typical user's imagination run wild and decreasing the entrance barrier for visual content production. • On the other hand, UniD3 may be used to make modified photographs of great quality that are difficult to discern from the originals. Ways to fool humans and propagate falsehoods that are malicious. UniD3 can be abused by nefarious users, which may have serious repercussions. • In our code distribution. We shall outline the permitted uses of our system and give the corresponding licences. However, we observe that the present discriminators used to distinguish generated images are ineffective at identifying images generated by diffusion models. There is still the necessity for more exploration to discern between authentic and fake images.

A LIMITATIONS AND DISCUSSION

Our work focuses on multi-modality generation, and the task of every modal generation is taken into account. We consequently cannot employ adequate text or visual representations that are difficult to recover, which putting some pressure on the generative model in order to comprehend and represent the original data. This leads to a lack of performance of our model compared to the state-of-the-art methods in terms of conditional generation quality. With the permission to manipulate the conditional space for a better representation, it can alleviate the burden on the generative model to represent the condition. Most of the current modality translation models use a highly abstract representation of a text as a conditional signal, for example, CLIP Radford et al. (2021) embedding are widely used in VQ-Diffuion Gu et al. (2022) , Stable Diffusion / LDM Rakhimov et al. (2021) . Some work has proved that a powerful text encoder (T5 Raffel et al. ( 2020)) and a highly abstract text representation could provide a better result Saharia et al. (2022b) . In a future challenge, we will consider obtaining a condensed modal representation as a discrete embedding, which is also simple to recover. This solution would reduce the strain on the generative model, enabling it to concentrate on producing superior generative results. We also found that the first-stage VQ model also partially limits the quality of our image generation. Finding a VQ model with a lower information loss rate is also an important way to improve the performance of the model. Besides, our model demonstrates sufficient superiority in terms of text-image similarity without optimized by the CLIP loss, as shown in Table 3 . In contrast, both improved VQ-Diffusion Tang et al. ( 2022) and OFA Wang et al. ( 2022) make explicitly or implicitly use of CLIP loss to optimise the model parameters.

B DATASETS AND EXPERIMENTAL DETAILS B.1 DATASETS

We demonstrate the feasibility of the proposed method on two commonly used datasets: CUB-200 (Wah et al., 2011) and MSCOCO (Lin et al., 2014) . The CUB-200 dataset consists of 8,855 training images and 2,933 test images representing 200 species of birds. Each of these photos is accompanied by 10 textual descriptions. In MSCOCO, there are 591,753 images utilized for training and 25,014 for testing, with each image corresponding to 5 textual descriptions.

B.2 MODEL DETAILS

We use VQ-GAN (Esser et al., 2021b) with gumbel softmax to compress the images into discrete token sequences. We directly use the pre-trained and public model which is trained on OpenImage (Krasin et al., 2017) . The compression ratio of the model is 8 × 8 × 3 = 192, and each image is compressed into a 32 × 32 token sequence. The codebook size is 2,887 after removing useless codes (Gu et al., 2022) . Furthermore, as the transition matrix of the diffusion model is sensitive to the number of dictionary size and the length of the sequence influences the performance of the denoising neural network, we further investigated the effect of various forms of text word lists. We select the dVAEs with different downsampling factors and dictionary size from (Esser et al., 2021b) . We used alternative models of VQ-GAN release, including different downsampling factor with different codebook sizes, from f 8 with 8,192 terms Z ∈ R 8192×256 to f 16 with 16,384 entries Z ∈ R 16384×256 . For the text portion, we use a BPE tokenizer comparable to (Ramesh et al., 2021; Radford et al., 2021) and set the dictionary size to 8,192, compressing each text description into a sequence of length 128. The text and image encoders are fixed in the training phase.

B.3 EXPERIMENTAL DETAILS

For the diffusion process, we set the number of diffusion steps to 500. And the noise planning is linear, where α t goes from 1 to 0 and γ t goes from 0 to 1. (Loshchilov & Hutter, 2018) , and the learning rate is 9e -4 without warmup. We trained all models with a batch size of 16 across 8 Tesla A100. It is worth noting in the comparison experiments that LAFITE (Zhou et al., 2022) is a language-free model. And LDM (Rombach et al., 2022) 

C LOSS FUNCTION

Similar to the continuous domain, the complete expression of the loss function in our model is: L vb =E q(x0) [D KL (q(x T |x 0 )∥p(x T ))] + T t=2 E q(xt|x0) D KL q(x t-1 |x t , x 0 )∥ p θ (x img t-1 |x t ); p θ (x txt t-1 |x t ) -E q(x1|x0) log p θ (x img 0 |x 1 , x txt 0 ) + log p θ (x txt 0 |x 1 , x img 0 )] .

D TRUNCATION SAMPLING

As stated in (Gu et al., 2022) , the truncation sampling is crucial for the discrete diffusion-based approach, which prevents the network from sampling tokens with low probability. In our experiments, the truncation rate to 0.88 for Pair and T2I generation and 0.75 for I2T task, mathematically, we only keep top 88% or 75% tokens of p θ ( x0 |x t ) during inference for image and text tasks, respectively. A truncation rate that is too low will cause loss of image detail, but a rate that is too high will prohibit the image from creating a distinct geometry.

E PROOF OF THE CLOSED-FORM SOLUTION FOR Q t

The solution for Q t inherits the closed-form attribute from (Gu et al., 2022) . Mathematically, given the initial state x 0 at t = 0, the probabilities for the next time-step t = 1 can be obtained: Qx 0 =              α 1 + β 1 , x = x 0 s.t. argmax x 0 ∈ [0, K), α 1 + β * 1 , x = x 0 s.t. argmax x 0 ∈ [K, K + K * ), β 1 , x ̸ = x 0 s.t. argmax x 0 ∈ [0, K), β * 1 , x ̸ = x 0 s.t. argmax x 0 ∈ [K, K + K * ), γ, argmax x = K + K * . Suppose the closed-form expression of Q τ holds at time-step t = τ , then for the next step t = τ + 1: Q τ +1 x 0 = Q τ +1 Q τ x 0 . The outputs under different conditions can be discussed: 1. When x = x 0 and argmax x 0 ∈ [0, K): Q τ +1 x 0 = β τ β τ +1 (K -1) + (α τ +1 + β τ +1 ) α τ + β τ = β τ (Kβ τ +1 + α t+1 ) + α t (α t+1 + β t+1 ) = 1 K β τ (1 -γ τ +1 ) + α τ β τ +1 -β τ +1 ) * K + α τ +1 + β τ +1 = 1 K (1 -α τ -γ τ )(1 -γ τ +1 ) + Kα τ β τ +1 -(1 -α τ +1 -γ τ +1 ) + α τ +1 + β τ +1 = 1 K (1 -γ τ +1 ) -α τ (1 -γ τ +1 -Kβ τ +1 ) -(1 -γ τ +1 ) + α τ +1 + α τ +1 + β τ +1 = α τ +1 + β τ +1 ; (19) 2. When x = x 0 and argmax x 0 ∈ [K, K + K * ): Q τ +1 x 0 = β * τ β * τ +1 (K * -1) + α τ +1 + β * τ +1 α τ + β * τ = β * τ K * β * τ +1 + α t+1 + α t α t+1 + β * t+1 = 1 K * β * τ (1 -γ τ +1 ) + α τ β * τ +1 -β * τ +1 ) * K * + α τ +1 + β * τ +1 = 1 K * (1 -α τ -γ τ )(1 -γ τ +1 ) + K * α τ β * τ +1 -(1 -α τ +1 -γ τ +1 ) + α τ +1 + β * τ +1 = 1 K * (1 -γ τ +1 ) -α τ (1 -γ τ +1 -K * β * τ +1 ) -(1 -γ τ +1 ) + α τ +1 + α τ +1 + β * τ +1 = α τ +1 + β * τ +1 ; 3. When x ̸ = x 0 and argmax x 0 ∈ [0, K): Q τ +1 x 0 = β τ (α τ +1 + β τ +1 ) + β τ β τ +1 (K -1) + α τ β τ +1 = β τ (α τ +1 + β τ +1 ) * K + α τ β τ +1 = 1 -α τ -γ τ K * (1 -γ τ +1 ) + α τ β τ +1 = 1 K (1 -γ τ +1 ) + α τ (β τ +1 - 1 -γ τ +1 K ) = β τ +1 + α τ +1 K + α τ [(1 -α τ +1 ) -1] K = β τ +1 ; (21) 4. When x ̸ = x 0 and argmax x 0 ∈ [K, K + K * ): Q τ +1 x 0 = β * τ (α τ +1 + β * τ +1 ) + β * τ β * τ +1 (K * -1) + α τ β * τ +1 = β * τ (α τ +1 + β * τ +1 ) * K * + α τ β * τ +1 = 1 -α τ -γ τ K * * (1 -γ τ +1 ) + α τ β * τ +1 = 1 K * (1 -γ τ +1 ) + α τ (β * τ +1 - 1 -γ τ +1 K * ) = β * τ +1 + α τ +1 K * + α τ [(1 -α τ +1 ) -1] K * = β * τ +1 ; (22) 5. When argmax x = K + K * : Q τ +1 x 0 = γ τ + (1 -γ τ )γ τ +1 = 1 -(1 -γ τ +1 ) = γ τ +1 . Thus, we can obtain a closed-form solution for Q t x 0 : Q t x 0 = (α τ + β τ )x 0 + (γ τ -β τ )x [M] + β τ , s.t. argmax x 0 ∈ [0, K), (α τ + β * τ )x 0 + (γ τ -β * τ )x [M] + β * τ , s.t. argmax x 0 ∈ [K, K + K * ). , where x [M] = x ← argmax x = K + K * . Given indicator functions 1 and 1 * for different modals: 1(x 0 ) = 1 if argmax x 0 ∈ [0, K), 0 otherwise. and 1 * (x 0 ) = 1 if argmax x 0 ∈ [K, K + K * ), 0 otherwise. s.t.1(x 0 ) ∩ 1 * (x 0 ) = ∅, Eq. 24 can be expressed in: Q t x 0 = α t x 0 + γ t -1(x 0 )β t -1 * (x 0 )β * t x [M] + 1(x 0 )β t + 1 * (x 0 )β * t .

F EXTENSION TO MORE MODALITIES

The above description emphasizes textual and visual modalities. However, our proposed methodology is also applicable to other discretisable modalities. Even within the visual domain, we may further consider distinct modalities such as bounding boxes, segmentation masks and edge maps, in addition to RGB images. Specifically, our transition matrix in Eq. 8 is extensible, we can have additional modalities by simply adding new modal quadrants and leaving the mask transition in the final row and column intact. Should the number of states in a modality become excessive, it will degrade to an absorbing diffusion. It is straightforward that the extended unified transition matrix also meets the criteria of the discrete diffusion model, with the stationary distribution assigning all probability mass to the [MASK] token. The unified objective can also include more modalities. When there are more modalities, p θ (x t-1 ) consists of more distributions, e.g., p θ (x 0 t-1 ), p θ (x 1 t-1 ), • • • , p θ (x n t-1 ). The architecture of the neural network used for prediction x0 can also be altered by adding the mutual attention to corresponding modal, which is conditioned on the remaining modal sequences. Due to the fact that the complexity of the transformer is quadratic to the length of the sequence, excessive modalities may incur enormous memory cost. The above demonstrates the ability of the proposed model to handle complex modalities; however we leave the exploration and optimisation of more modalities to future work.

G.1 IMAGE CAPTION

We provide some image caption results based on this bird is blue and white bird with a large black beak. yellow bird with an orange bill and a black head with green feathers. this is a black bird with red eye rings and a sharp beak. this bird is brown with white and has a long , pointy beak. this little bird has a pointed bill and yellow colored breast. this is a bird with white a yellow wing and a yellow crown and black bill. this small brown bird has a grey beak and a yellow throat. this bird has wings that are black and has a yellow and black belly. the bird has a brown belly , black throat and black wings. this particular bird has a white belly and orange and brown feet and a black tail. this is a grey bird with black and blue wings the small yellow bird has black wings. 



REPRODUCIBILITYIn order to achieve reproducibility, we have made the following efforts:1. The inference codes are released at https://github.com/mhh0318/UniD3. 2. Details on dataset and model architectures are provided in Appendix B, as well as more experimental settings are provided in Appendix D. 3. Some critical proofs are included in Appendix C & E. 4. Additional experimental results are provided in Appendix G.



Figure 1: Examples of various tasks supported by UniD3. The dark brown portions of the image and description represent the [MASK].

Figure 3: Illustration of transformer blocks with mutual attention. A unified transformer is composed of several blocks stacked on top of one another.

Figure 4: Generated vision-language Pairs from CUB-200 and MSCOCO. Both the image and caption are generated simultaneously. The quality of the created photos and text is comprehensible, and there is a correlation between the descriptions and the visuals.

Figure 5: Presentation of the results of cross modal vision and language manipulation and infilling. The dark brown portions of the image and description represent the [MASK], while the strikethrough represent the caption manipulation. Image and text compliment each other simultaneously.

for text generation or image segmentation. With the help of VQ-VAE, Esser et al. (2021a); Hu et al. (2022); Gu et al. (2022); Tang et al. (2022) discrete diffusion models have become capable of generating high quality images.

Figs 6 & 7,  respectively.

Figure 9: More samples of cross modal vision and language manipulation and infilling. The dark brown portions of the image and description represent the [MASK], while the strike-through represent the caption manipulation.

this bird has wings that are black and has a red body. a small brown bird , with white breast , short beak. this bird is black and yellow in color, with black sharp beak, and black eye rings. a red stop sign in the desert. a view on the top of the hill. there is a village at the foot of the mountain. a bunch of red flowers in the plastic vase. a busy harbor before dawn. a plate has fast food with salad and breads. a large old church and a clear blue day. a plane just landed on the grass. there is a baseball player ready for a pitch. a box of already baked pizza being shared. a pair of couple are selling bread in the market. a green train traveling fast across the city. there are many people on the blue shuttle bus. many baseball players are on the field. a clean and tight kitchen with windows.

Figure 10: More vision-language pair samples from CUB-200 and MSCOCO. Both the image and caption are generated simultaneously.

comparison, we exhibit the quantitative generative results based on various modalities individually. We provide more experimental results for text conditional generation in Appendix G.2

The CLIP similarity between the generated captions and images.

Ablations studies. The experiments are conducted on the CUB-200.

Ablations studies on Image Encoder. The experiments are conducted on the CUB-200. as described in Sec.3.2, in which the transformer comprises 20 transformer blocks with 16 heads attention and 1024 feature dimension. The model contains 600M parameters. For the ablation model, we use 18 transformer blocks with 16 heads, and the dimension is 256. The model contains 119M parameters. The optimiser for the model is AdamW

is trained on LAION-400M and conducts zero-shot textto-image generation based on MSCOCO evaluation captions. The results of LDM is with a f 8-KL based image encoder and fast sampling strategy, excluding the classifier guidance during inference.And as the caption generation in our Pair scenario is unique. Common text generation metrics are difficult to implement due to a lack of the references. Here we used Perplexity (PPL) with GPT-2 to evaluate the text quality. The perplexities of generated CUB-200 and COCO captions are 123.32 and 188.74 whereas the corresponding values for the evaluation set are 108.66 and 175.36.

G.2 TEXT CONDITIONAL GENERATION

We provide more results of pure text to image synthesis in Fig. 8 . The resolution of each generated image is 256 × 256.

G.3 CROSS MODAL MODIFICATION

We present samples for modification across vision and language modals under CUB-200 dataset in 

