arxiv:2601.05125

VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding

Published on Jan 8

· Submitted by

de Rodrigo on Jan 9

CICLAB Comillas ICAI

Upvote

Authors:

Ignacio de Rodrigo ,

Alvaro J. Lopez-Lopez ,

Abstract

VERSE is a methodology for analyzing and improving Vision-Language Models in document understanding by visualizing latent representations and generating synthetic data to enhance performance in error-prone clusters.

AI-generated summary

This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

de-Rodrigo

Paper author Paper submitter 1 day ago

We usually train VLMs on visual synthetic data that we (as humans) label as photorealistic. We argue that this is an anthropocentric perspective imposed to a model that might not synthetize visual information as we do. VERSE helps to visualize latent space and overlay visual features to detect poor-performance regions and take action to include better-suited training sets to boost model performance.

You can explore more here:

Code: https://github.com/nachoDRT/MERIT-Dataset
Hugging Face Space: https://huggingface.co/spaces/de-Rodrigo/Embeddings

Thanks!

librarian-bot

about 10 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.05125 in a model README.md to link it from this page.

VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 2

Collections including this paper 1