Show simple item record

AuthorHu, Minghui
AuthorZheng, Chuanxia
AuthorZheng, Heliang
AuthorCham, Tat-Jen
AuthorWang, Chaoyue
AuthorYang, Zuopeng
AuthorTao, Dacheng
AuthorSuganthan, Ponnuthurai N
Available date2025-01-19T10:05:08Z
Publication Date2023
Publication Name11th International Conference on Learning Representations, ICLR 2023
ResourceScopus
Identifierhttp://dx.doi.org/10.48550/arXiv.2211.14842
URIhttp://hdl.handle.net/10576/62244
AbstractThe recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the"modality translation" and"multi-modality generation" tasks using a single model, performing text-based, image-based, and even vision-language simultaneous generation. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Moreover, we design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks. 2023 11th International Conference on Learning Representations, ICLR 2023. All rights reserved.
Languageen
PublisherInternational Conference on Learning Representations, ICLR
SubjectDiffusion model
Diffusion process
Discrete diffusion
Embeddings
Image-based
Language generation
Multi-modal
Single models
Transition matrices
Image generation
Image Caption.
TitleUNIFIED DISCRETE DIFFUSION FOR SIMULTANEOUS VISION-LANGUAGE GENERATION
TypeConference
dc.accessType Open Access


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record