UniCanvas: A Diffusion-based Unified Model for Text-in-Image Joint Generation

Abstract

UniCanvas unifies diffusion models to generate interleaved multimodal (text + image) content on a shared pixel canvas by representing language as visual patterns within the model's embedding space, enabling seamless text-in-image generation and positioning diffusion-based world models as a promising unified multimodal generation paradigm.

Key Contributions

- First unified diffusion model for interleaved text-in-image generation: Unlike autoregressive VLMs (high reasoning, low image quality) and standard diffusion models (high image quality, poor text rendering), UniCanvas does both on a single pixel canvas.

Language-as-visual-patterns representation: Instead of producing discrete text tokens, the model represents language as visual patterns within images, leveraging diffusion's inherent multimodal embedding.

World model framing: Diffusion models over a shared pixel canvas are framed as world models of visual change — actions = pixel transformations, state = image.

Method Details

UniCanvas builds on a diffusion model backbone (likely SDXL-class). The key architectural shift is replacing discrete text token generation with a continuous visual representation of language semantics. Key properties:

Uses the diffusion model's native multimodal embedding to "draw" text as visual patterns

Single shared pixel canvas for all modalities (text rendered as stylized visual elements)

Trained with standard diffusion objectives on joint text-image data

The canvas acts as a world model: generative actions = pixel-level transformations conditioned on text

Authors include Zeyuan Yang, Hao-Wei Chen, Xueyang Yu, Yuncong Yang, Haoyu Zhen, Ziqiao Ma, Maohao Shen, Chuang Gan (MIT/CMU/others).

Key Results

- Improves over previous unified models (autoregressive + diffusion hybrid approaches) on text-in-image coherence metrics

Positions text-in-image generation as a viable unified multimodal generation paradigm

No specific benchmark numbers reported in available abstract

Limitations & Future Work

- Text rendering quality not compared to dedicated text-to-image models (e.g., FLUX, DALL-E 3)

"Language as visual patterns" is architecturally unusual —may struggle with long-form text generation

Interleaved generation (long sequences of text+image) not thoroughly evaluated

The "world model" framing is metaphorical rather than a learned predictive model

Relevance to Patrick's Research

Moderate relevance. The key insight — treating diffusion over a shared canvas as a world model of visual change — connects to Sora/VDM-style world models. The text-in-image application is novel but somewhat tangential to core world model research. Useful as a reference point for unified multimodal generation paradigms.