-
You Do Not Fully Utilize Transformer's Representation Capacity
Paper • 2502.09245 • Published • 37 -
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
Paper • 2502.15007 • Published • 174 -
Transformers without Normalization
Paper • 2503.10622 • Published • 170 -
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper • 2503.02130 • Published • 32
Collections
Discover the best community collections!
Collections including paper arxiv:2503.02130
-
RuCCoD: Towards Automated ICD Coding in Russian
Paper • 2502.21263 • Published • 133 -
Unified Reward Model for Multimodal Understanding and Generation
Paper • 2503.05236 • Published • 122 -
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
Paper • 2503.05179 • Published • 46 -
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Paper • 2503.05592 • Published • 27
-
What Matters in Transformers? Not All Attention is Needed
Paper • 2406.15786 • Published • 31 -
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss
Paper • 2410.17243 • Published • 92 -
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper • 2503.02130 • Published • 32 -
Transformers without Normalization
Paper • 2503.10622 • Published • 170
-
Depth Anything V2
Paper • 2406.09414 • Published • 103 -
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
Paper • 2406.09415 • Published • 51 -
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion
Paper • 2406.04338 • Published • 39 -
SAM 2: Segment Anything in Images and Videos
Paper • 2408.00714 • Published • 120
-
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper • 2503.02130 • Published • 32 -
L^2M: Mutual Information Scaling Law for Long-Context Language Modeling
Paper • 2503.04725 • Published • 21 -
Transformers without Normalization
Paper • 2503.10622 • Published • 170 -
I-Con: A Unifying Framework for Representation Learning
Paper • 2504.16929 • Published • 29
-
LM2: Large Memory Models
Paper • 2502.06049 • Published • 31 -
Titans: Learning to Memorize at Test Time
Paper • 2501.00663 • Published • 29 -
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Paper • 2501.17161 • Published • 123 -
You Do Not Fully Utilize Transformer's Representation Capacity
Paper • 2502.09245 • Published • 37
-
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper • 2408.11796 • Published • 58 -
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Paper • 2408.09174 • Published • 52 -
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 45 -
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Paper • 2408.11878 • Published • 63
-
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
Paper • 2404.08801 • Published • 66 -
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
Paper • 2404.07839 • Published • 47 -
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Paper • 2404.05892 • Published • 40 -
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper • 2312.00752 • Published • 148
-
You Do Not Fully Utilize Transformer's Representation Capacity
Paper • 2502.09245 • Published • 37 -
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
Paper • 2502.15007 • Published • 174 -
Transformers without Normalization
Paper • 2503.10622 • Published • 170 -
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper • 2503.02130 • Published • 32
-
RuCCoD: Towards Automated ICD Coding in Russian
Paper • 2502.21263 • Published • 133 -
Unified Reward Model for Multimodal Understanding and Generation
Paper • 2503.05236 • Published • 122 -
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
Paper • 2503.05179 • Published • 46 -
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Paper • 2503.05592 • Published • 27
-
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper • 2503.02130 • Published • 32 -
L^2M: Mutual Information Scaling Law for Long-Context Language Modeling
Paper • 2503.04725 • Published • 21 -
Transformers without Normalization
Paper • 2503.10622 • Published • 170 -
I-Con: A Unifying Framework for Representation Learning
Paper • 2504.16929 • Published • 29
-
LM2: Large Memory Models
Paper • 2502.06049 • Published • 31 -
Titans: Learning to Memorize at Test Time
Paper • 2501.00663 • Published • 29 -
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Paper • 2501.17161 • Published • 123 -
You Do Not Fully Utilize Transformer's Representation Capacity
Paper • 2502.09245 • Published • 37
-
What Matters in Transformers? Not All Attention is Needed
Paper • 2406.15786 • Published • 31 -
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss
Paper • 2410.17243 • Published • 92 -
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper • 2503.02130 • Published • 32 -
Transformers without Normalization
Paper • 2503.10622 • Published • 170
-
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper • 2408.11796 • Published • 58 -
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Paper • 2408.09174 • Published • 52 -
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 45 -
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Paper • 2408.11878 • Published • 63
-
Depth Anything V2
Paper • 2406.09414 • Published • 103 -
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
Paper • 2406.09415 • Published • 51 -
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion
Paper • 2406.04338 • Published • 39 -
SAM 2: Segment Anything in Images and Videos
Paper • 2408.00714 • Published • 120
-
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
Paper • 2404.08801 • Published • 66 -
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
Paper • 2404.07839 • Published • 47 -
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Paper • 2404.05892 • Published • 40 -
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper • 2312.00752 • Published • 148