UNIFIED DISCRETE DIFFUSION FOR SIMULTANEOUS VISION-LANGUAGE GENERATION

Hu, Minghui; Zheng, Chuanxia; Zheng, Heliang; Cham, Tat-Jen; Wang, Chaoyue; Yang, Zuopeng; Tao, Dacheng; Suganthan, Ponnuthurai N

View/Open

1852_unified_discrete_diffusion_for.pdf (16.31Mb)

Date

2023

Author

Hu, Minghui
Zheng, Chuanxia
Zheng, Heliang
Cham, Tat-Jen
Wang, Chaoyue
...show more authors ...show less authors

Metadata

Show full item record

Abstract

The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the"modality translation" and"multi-modality generation" tasks using a single model, performing text-based, image-based, and even vision-language simultaneous generation. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Moreover, we design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks. 2023 11th International Conference on Learning Representations, ICLR 2023. All rights reserved.

DOI/handle

http://dx.doi.org/10.48550/arXiv.2211.14842
http://hdl.handle.net/10576/62244

Collections

Network & Distributed Systems [‎142‎ items ]