A LoRA adapter that enhances spatial understanding capabilities of the FLUX.1 text-to-image
diffusion model. This model demonstrates significant improvements in generating images with specific
spatial relationships between objects.
Explicit spatial relationships (e.g., "a photo of A to the right of B")
Training Details
Training Data
Built using the SCOP (Spatial Constraints-Oriented Pairing) data engine
~28,000 curated object pairs from COCO
Enforces criteria for:
Visual significance
Semantic distinction
Spatial clarity
Object relationships
Visual balance
Training Process
Trained for 24,000 steps
Batch size of 4
Learning rate: 1e-4
Optimizer: AdamW with β₁=0.9, β₂=0.999
Weight decay: 1e-2
Evaluation Results
Metric
FLUX.1
+CoMPaSS
VISOR uncond (⬆️)
37.96%
75.17%
T2I-CompBench Spatial (⬆️)
0.18
0.30
GenEval Position (⬆️)
0.26
0.60
FID (⬇️)
27.96
26.40
CMMD (⬇️)
0.8737
0.6859
Citation
If you use this model in your research, please cite:
@inproceedings{zhang2025compass,
title={CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models},
author={Zhang, Gaoyang and Fu, Bingtao and Fan, Qingnan and Zhang, Qi and Liu, Runxing and Gu, Hong and Zhang, Huaqi and Liu, Xinguo},
booktitle={ICCV},
year={2025}
}