VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding
Abstract
VERSE is a methodology for analyzing and improving Vision-Language Models in document understanding by visualizing latent representations and generating synthetic data to enhance performance in error-prone clusters.
This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.
Community
We usually train VLMs on visual synthetic data that we (as humans) label as photorealistic. We argue that this is an anthropocentric perspective imposed to a model that might not synthetize visual information as we do. VERSE helps to visualize latent space and overlay visual features to detect poor-performance regions and take action to include better-suited training sets to boost model performance.
You can explore more here:
- Code: https://github.com/nachoDRT/MERIT-Dataset
- Hugging Face Space: https://huggingface.co/spaces/de-Rodrigo/Embeddings
Thanks!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding (2025)
- GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models (2025)
- Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models (2025)
- Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models (2025)
- Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing (2025)
- One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation (2025)
- Next-Embedding Prediction Makes Strong Vision Learners (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper