Papers
arxiv:2512.13033

Scaling Bidirectional Spans and Span Violations in Attention Mechanism

Published on Dec 15, 2025
Authors:
,
,

Abstract

Transformer optimization framework reduces computational inefficiency by decomposing backward-pass gradients through asymmetric projection while maintaining forward-pass structure, demonstrating improved training performance on WikiText-2.

AI-generated summary

The canonical O(N^2) Transformer remains the empirical performance frontier in sequence modeling, and its training can be further optimized by addressing geometric inefficiency. We propose an optimization framework that leverages an asymmetric projection to decompose the backward-pass gradients into parallel spans and orthogonal violations, while keeping the canonical forward-pass QKV structure intact. Through consistent experimental validation across various decomposition and projection setups, we provide strong theoretical evidence: the standard attention gradient is suboptimal. We demonstrated that selectively scaling these components, focusing primarily on 0^{th} order bidirectional parallel spans, yields the most effective learning signal. On the limited WikiText-2 dataset, and using a crude configuration, this method achieved a 0.56% reduction in validation loss, confirming the framework's fundamental validity and suggesting significant potential gains on larger datasets and deeper training regimes

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.13033 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.13033 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.13033 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.