faster-whisper Collection faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is a fast inference engine for Transformer models. • 15 items • Updated Jul 4, 2025 • 10
PP-OCRv5 Collection PP-OCRv5 is the latest text recognition solution, supporting Simplified Chinese, Chinese Pinyin, Traditional Chinese, English, and Japanese • 13 items • Updated Sep 15, 2025 • 50
MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation Paper • 2511.09611 • Published Nov 12, 2025 • 69
Common Pile v0.1 Filtered Data Collection An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 • 31 items • Updated Jun 6, 2025 • 20
Efficient Part-level 3D Object Generation via Dual Volume Packing Paper • 2506.09980 • Published Jun 11, 2025 • 7
Voila Collection Voila: Voice-Language Foundation Models. https://voila.maitrix.org • 7 items • Updated May 6, 2025 • 24
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play Paper • 2505.02707 • Published May 5, 2025 • 85
PixelHacker: Image Inpainting with Structural and Semantic Consistency Paper • 2504.20438 • Published Apr 29, 2025 • 43
Orpheus Multilingual Research Release Collection Beta Release of multilingual models. • 12 items • Updated Apr 10, 2025 • 109
TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes Paper • 2503.23461 • Published Mar 30, 2025 • 94
Long-Video Audio Synthesis with Multi-Agent Collaboration Paper • 2503.10719 • Published Mar 13, 2025 • 9
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding Paper • 2502.19400 • Published Feb 26, 2025 • 47
ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation Paper • 2502.18364 • Published Feb 25, 2025 • 36
MutaGReP: Execution-Free Repository-Grounded Plan Search for Code-Use Paper • 2502.15872 • Published Feb 21, 2025 • 5
Qwen2.5-VL Collection Vision-language model series based on Qwen2.5 • 11 items • Updated 6 days ago • 549
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces Paper • 2501.12909 • Published Jan 22, 2025 • 74
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper • 2501.12380 • Published Jan 21, 2025 • 84