-
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
Paper • 2407.07053 • Published • 47 -
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Paper • 2407.12772 • Published • 35 -
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Paper • 2407.11691 • Published • 15 -
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Paper • 2408.02718 • Published • 62
Collections
Discover the best community collections!
Collections including paper arxiv:2501.06186
-
How Far Are We from Intelligent Visual Deductive Reasoning?
Paper • 2403.04732 • Published • 21 -
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
Paper • 2404.16790 • Published • 10 -
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper • 2501.06186 • Published • 65
-
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper • 2501.06186 • Published • 65 -
apple/OpenELM
Updated • 1.44k -
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Text Generation • 2B • Updated • 1.21M • • 1.43k -
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Paper • 2501.12948 • Published • 434
-
An Empirical Study of Autoregressive Pre-training from Videos
Paper • 2501.05453 • Published • 41 -
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper • 2501.06186 • Published • 65 -
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
Paper • 2501.05707 • Published • 20 -
The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve?
Paper • 2502.17535 • Published • 8
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 -
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 19 -
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 -
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 48
-
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper • 2501.06186 • Published • 65 -
Graph-Aware Isomorphic Attention for Adaptive Dynamics in Transformers
Paper • 2501.02393 • Published • 7 -
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
Paper • 2501.01904 • Published • 33 -
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
Paper • 2412.14711 • Published • 16
-
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Paper • 2501.03895 • Published • 52 -
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper • 2501.06186 • Published • 65 -
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
Paper • 2501.09012 • Published • 10
-
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
Paper • 2407.07053 • Published • 47 -
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Paper • 2407.12772 • Published • 35 -
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Paper • 2407.11691 • Published • 15 -
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Paper • 2408.02718 • Published • 62
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
How Far Are We from Intelligent Visual Deductive Reasoning?
Paper • 2403.04732 • Published • 21 -
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
Paper • 2404.16790 • Published • 10 -
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper • 2501.06186 • Published • 65
-
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 -
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 19 -
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 -
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 48
-
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper • 2501.06186 • Published • 65 -
apple/OpenELM
Updated • 1.44k -
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Text Generation • 2B • Updated • 1.21M • • 1.43k -
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Paper • 2501.12948 • Published • 434
-
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper • 2501.06186 • Published • 65 -
Graph-Aware Isomorphic Attention for Adaptive Dynamics in Transformers
Paper • 2501.02393 • Published • 7 -
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
Paper • 2501.01904 • Published • 33 -
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
Paper • 2412.14711 • Published • 16
-
An Empirical Study of Autoregressive Pre-training from Videos
Paper • 2501.05453 • Published • 41 -
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper • 2501.06186 • Published • 65 -
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
Paper • 2501.05707 • Published • 20 -
The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve?
Paper • 2502.17535 • Published • 8
-
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Paper • 2501.03895 • Published • 52 -
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper • 2501.06186 • Published • 65 -
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
Paper • 2501.09012 • Published • 10