Length-Induced Embedding Collapse in PLM-based Models¶

Conference: ACL 2025 (Findings)
arXiv: 2410.24200
Code: GitHub
Area: Other
Keywords: text embedding, length collapse, self-attention, low-pass filter, temperature scaling

TL;DR¶

Identifies and rigorously proves the "length collapse" phenomenon in PLM-based text embedding models—where long-text embeddings tend to cluster together because self-attention acts as a low-pass filter whose filtering effect intensifies as text length increases, over-suppressing high-frequency information. Proposes the TempScale method to mitigate the distributional discrepancy between short and long text embeddings by scaling down the attention temperature, improving MTEB by 0.94% and LongEmbed by 1.10%.

Background & Motivation¶

Background: PLM-based embedding models (e.g., BGE, E5, etc.) encode text into fixed-dimensional vectors and are widely used in tasks such as retrieval, classification, and STS.

Limitations of Prior Work: Embedding models experience significant performance degradation on long text. For BGE in IMDB classification, accuracy drops from 75.6% to 59.0% (a 16.6% decrease) when the token count increases from [0, 100) to [400, 500). However, the underlying cause remains unclear.

Key Challenge: Why does long-text embedding perform poorly? It is not simply due to excessive information, but because long-text embeddings become overly similar to each other, losing discriminative power.

Goal: (a) Define and empirically verify the Length Collapse phenomenon; (b) Provide a mechanistic explanation from a frequency-domain perspective; (c) Propose a mitigation method.

Key Insight: Analyze self-attention in the frequency domain as a low-pass filter (following the analysis of ViTs by Wang et al., 2022) and prove the relationship between the filtering rate and the sequence length \(n\).

Core Idea: The maximum singular value of the high-frequency component of the attention matrix \(\sigma_a\) decreases as \(n\) increases \(\to\) token representations of long texts tend to homogenize (retaining only the DC component) \(\to\) embeddings cluster after pooling = Length Collapse.

Method¶

Overall Architecture¶

Theoretical analysis: Proves that \(\sigma_a\) monotonically decreases with sequence length \(n \to\) explains the mechanism of Length Collapse. Empirical validation: Proposes TempScale to adjust the attention temperature to mitigate this phenomenon.

Key Designs¶

Frequency-Domain Analysis of Length Collapse
- Lemma 1: The attention matrix \(\mathbf{A} = \text{softmax}(\mathbf{P})\) is a low-pass filter—repeatedly applying \(\mathbf{A}\) causes the signal to retain only the DC component.
- Theorem 2: The high-frequency decay rate is controlled by \(\sigma_a \|\mathbf{W}_V\|_2\). Smaller \(\sigma_a \to\) high-frequency components are suppressed faster.
- Theorem 3: Assuming the query/key distributions are Gaussian, prove that \(\sigma_a \leq \sqrt{\frac{n}{2\sqrt{1+e^{-2\sigma_s^2}}(n-1)^{3/2}+1}}\), which monotonically decreases as \(n\) increases.
- Corollary 4: Larger \(n \to\) smaller \(\sigma_a \to\) token characteristics homogenize \(\to\) embedding cosine similarity increases = Length Collapse.
Impact Analysis of Length Collapse on Downstream Tasks
- Classification/Clustering: Long-text embeddings cluster at the center \(\to\) the KNN classifier biases towards long texts \(\to\) accuracy decreases.
- Retrieval: The embedding space of long documents is compressed \(\to\) short noisy documents may rank higher than truly relevant long documents.
- STS: High similarity between long text pairs \(\to\) unrelated long text pairs also receive high scores, resulting in poor discriminative power.
The TempScale Method
- Function: Scaled down by temperature \(\tau < 1\) after dividing the attention score by \(\sqrt{d}\), i.e., \(\mathbf{A} = \text{softmax}(\frac{\mathbf{XW}_Q(\mathbf{XW}_K)^\top}{\tau\sqrt{d}})\).
- Mechanism: \(\tau < 1\) is equivalent to increasing \(\sigma_s\) and lowering the "temperature" of the attention \(\to\) makes \(\sigma_a\) less sensitive to changes in \(n \to\) narrows the filtering rate gap between long and short texts.
- Design Motivation: Analysis of extreme cases—as \(\tau \to 0\), the attention becomes one-hot (no filtering but losing aggregation capability); as \(\tau \to 1\), it maintains the original behavior. The optimal \(\tau\) lies in between.
- No retraining required: Directly modify the attention calculation during inference.

Key Experimental Results¶

Main Results¶

Benchmark	Base Model	Original	+TempScale	Gain
MTEB (56 tasks)	BGE-base	63.55	64.49	+0.94%
MTEB	E5-large	66.23	67.01	+0.78%
LongEmbed	BGE-base	42.15	43.25	+1.10%
LongEmbed	E5-4K	56.82	57.89	+1.07%

Classification Accuracy by Text Length (BGE on IMDB)¶

Token Range	Original	+TempScale
[0, 100)	75.6	75.8
[100, 200)	72.3	73.5
[200, 300)	65.1	67.2
[300, 400)	61.5	64.0
[400, 500)	59.0	62.3

Key Findings¶

Length Collapse is a universal phenomenon: It exists in mainstream embedding models such as BGE, E5, GTE, and ANCE.
Pairwise cosine similarity of long texts monotonically increases with length: Increasing from ~0.3 for [0, 100) to ~0.6 for [400, 500), validating Corollary 4.
TempScale improves long texts the most: The improvement is most significant in the [400, 500) range, while short texts are virtually unaffected.
Experimental validation of the relationship between \(\sigma_a\) and \(n\): \(\sigma_a\) extracted from the last layer of the BGE attention matrix indeed decreases as the text length increases.
The optimal \(\tau\) lies between \([0.5, 0.8]\): Too low of a \(\tau\) causes the attention to degenerate into one-hot.

Highlights & Insights¶

Elegance of frequency-domain analysis: Explaining attention as a low-pass filter from a Fourier perspective, and then deriving the impact of length on the filtering rate—with a clear causal chain: \(n \uparrow \to \sigma_a \downarrow \to\) high frequencies are suppressed \(\to\) embeddings collapse.
Training-free inference-time solution: TempScale does not require retraining the model, only modifying the attention temperature during inference—making it directly applicable to deployed embedding models.
Perfect alignment between theoretical prediction and experiments: Theorem 3 predicts \(\sigma_a\) decreases with \(n\), which is experimentally verified in Figure 7; Corollary 4 predicts cosine similarity increases with \(n\), which is verified in Figure 1(c).
Direct value for RAG and long-text retrieval: Representation degradation of long-text embeddings is a real-world bottleneck in long-document retrieval. TempScale provides a simple yet effective mitigation.

Limitations & Future Work¶

Gaussian assumption: Theorem 3 assumes Gaussian distributions for query/key, which might not be strictly satisfied in actual PLMs.
Only a single hyperparameter \(\tau\): A globally uniform temperature; layer-wise or head-wise adaptive temperatures could be considered.
Causal attention (decoder-only) not considered: The analysis targets bidirectional attention (encoders). Length Collapse in causal attention requires extra research.
Limited performance gain: MTEB +0.94% / LongEmbed +1.10%. While the consistency is good, the absolute improvements are modest.
Not combined with other long-text methods: Such as position interpolation, RoPE scaling, etc.

vs. Wang et al. (2022) ViT over-smoothing: They found that self-attention in ViTs causes over-smoothing as depth increases. This paper finds that a similar effect occurs as length increases—both over-smoothing phenomena share the same mechanism.
vs. rotary position embedding: RoPE improves attention's generalization to length via rotation, but does not directly address the low-pass filtering problem—and can be combined with TempScale.
vs. chunking strategies: Splitting a long document into short chunks for individual encoding is an engineering solution, whereas TempScale is a theory-driven model-level solution.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to define Length Collapse and provide a rigorous frequency-domain proof, with a brand-new derivation for Theorem 3.
Experimental Thoroughness: ⭐⭐⭐⭐ Validated on MTEB + LongEmbed across multiple models with visualizations—highly comprehensive.
Writing Quality: ⭐⭐⭐⭐⭐ Seamlessly progressive theoretical derivations, intuitive figures (Figure 1 perfectly showcases the problem), and rigorous reasoning.
Value: ⭐⭐⭐⭐⭐ Provides a fundamental understanding of the long-text issue in embedding models; TempScale is both practical and training-free.