FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers
Paper
•
2501.16297
•
Published
•
1
[Paper] [GitHub] [Project Page]
This is the official model weights of FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers. In this work, we propose the FALCON model, which introduces a novel visual register technique to simultaneously address the issues of visual redundancy and fragmentation in the high-resolution visual encoding of MLLMs.
Please refer to the instructions in the Githhub repository.
If you find this work useful for your research, please kindly cite our paper:
@InProceedings{zhang2025falcon,
author={Zhang, Renshan and Shao, Rui and Chen, Gongwei and Zhang, Miao and Zhou, Kaiwen and Guan, Weili and Nie, Liqiang},
title={FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month= {October},
year={2025},
}
Base model
google/siglip-large-patch16-384