UNIFIED DISCRETE DIFFUSION FOR SIMULTANEOUS VISION-LANGUAGE GENERATION
المؤلف | Hu, Minghui |
المؤلف | Zheng, Chuanxia |
المؤلف | Zheng, Heliang |
المؤلف | Cham, Tat-Jen |
المؤلف | Wang, Chaoyue |
المؤلف | Yang, Zuopeng |
المؤلف | Tao, Dacheng |
المؤلف | Suganthan, Ponnuthurai N |
تاريخ الإتاحة | 2025-01-19T10:05:08Z |
تاريخ النشر | 2023 |
اسم المنشور | 11th International Conference on Learning Representations, ICLR 2023 |
المصدر | Scopus |
المعرّف | http://dx.doi.org/10.48550/arXiv.2211.14842 |
الملخص | The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the"modality translation" and"multi-modality generation" tasks using a single model, performing text-based, image-based, and even vision-language simultaneous generation. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Moreover, we design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks. 2023 11th International Conference on Learning Representations, ICLR 2023. All rights reserved. |
اللغة | en |
الناشر | International Conference on Learning Representations, ICLR |
الموضوع | Diffusion model Diffusion process Discrete diffusion Embeddings Image-based Language generation Multi-modal Single models Transition matrices Image generation Image Caption. |
النوع | Conference |
الملفات في هذه التسجيلة
هذه التسجيلة تظهر في المجموعات التالية
-
الشبكات وخدمات البنية التحتية للمعلومات والبيانات [142 items ]