UM-Text: A Unified Multimodal Model for Image Understanding
Abstract
A unified multimodal model for visual text editing that understands natural language instructions and maintains stylistic consistency with reference images through visual language modeling and contextual embedding combination.
With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. Specifically, we introduce a Visual Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information. To generate an accurate and harmonious visual text image, we further propose the UM-Encoder to combine the embeddings of various condition information, where the combination is automatically configured by VLM according to the input instruction. During training, we propose a regional consistency loss to offer more effective supervision for glyph generation on both latent and RGB space, and design a tailored three-stage training strategy to further enhance model performance. In addition, we contribute the UM-DATA-200K, a large-scale visual text image dataset on diverse scenes for model training. Extensive qualitative and quantitative results on multiple public benchmarks demonstrate that our method achieves state-of-the-art performance.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UTDesign: A Unified Framework for Stylized Text Editing and Generation in Graphic Design Images (2025)
- SkyReels-Text: Fine-grained Font-Controllable Text Editing for Poster Design (2025)
- Text2Traffic: A Text-to-Image Generation and Editing Method for Traffic Scenes (2025)
- UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation (2025)
- TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models (2025)
- Exploring MLLM-Diffusion Information Transfer with MetaCanvas (2025)
- UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
