REOrdering Patches Improves Vision Models¶

Conference: NeurIPS 2025 arXiv: 2505.23751 Code: Project Page Area: Model Compression Keywords: patch ordering, long-sequence models, Mamba, reinforcement learning, Plackett-Luce

TL;DR¶

This paper reveals that patch ordering significantly affects the performance of long-sequence vision models, and proposes the REOrder framework, which leverages information-theoretic priors and reinforcement learning to automatically discover optimal patch permutations, achieving up to 3.01% improvement on ImageNet-1K and 13.35% on FMoW.

Background & Motivation¶

Vision Transformers and other sequence models require flattening 2D images into 1D patch sequences. The conventional approach uses raster-scan (row-major) order, based on the assumption that full self-attention is permutation-equivariant and thus patch order is irrelevant.

However, this assumption does not hold for modern long-sequence models:

Transformer-XL introduces a memory mechanism based on segment-level recurrence and relative positional encodings, where the relative index \(i-j\) makes it sensitive to permutations.
Longformer employs sliding-window local attention with global tokens, and the fixed local mask is not equivariant under permutation.
Mamba uses content-dependent state-space updates (linear recurrence), where the token-by-token causal scan is inherently order-dependent.

These architectures trade the permutation equivariance of full attention for \(O(n)\) or \(O(nw)\) complexity, at the cost of output sensitivity to patch ordering. Experiments show that simply switching from row-major to column-major or Hilbert curve ordering can lead to substantial accuracy changes. More critically, no single fixed permutation is optimal across all models and datasets, motivating a data-driven approach to discover the optimal ordering.

Method¶

Overall Architecture¶

REOrder is a two-stage framework:

Information-theoretic prior discovery: Candidate orderings are filtered by evaluating the compressibility of different patch sequences, providing a warm-start initialization.
Reinforcement learning optimization: A Plackett-Luce permutation policy combined with the REINFORCE algorithm searches the combinatorial space for the optimal permutation.

Key Designs¶

Theoretical analysis of permutation equivariance: The authors formally prove that full self-attention satisfies \(\text{Attn}(\mathbf{P}\mathbf{X}) = \mathbf{P}\,\text{Attn}(\mathbf{X})\), and individually analyze why Transformer-XL, Longformer, and Mamba violate this property. This provides a theoretical foundation for the claim that patch order matters. The core insight is that efficient attention approximations introduce sensitivity to input permutations as a byproduct of reducing complexity.
Information-theoretic prior (compression rate analysis): Image patches are discretized into VQ-VAE tokens and compressed via LZMA after unigram/bigram tokenization. Row-major and Hilbert curve orderings yield higher compression rates (greater local redundancy), while column-major and spiral orderings yield lower rates. Compression rate has a weak, model-dependent correlation with downstream accuracy—lower compression rate is not always better, but serves as a weak prior for initialization.
Plackett-Luce permutation policy: The permutation distribution is parameterized by a score vector \(\mathbf{z} \in \mathbb{R}^n\) (where \(n\) is the number of patches). Sampling is parallelized via the Gumbel-top-\(k\) trick: \(\pi = \text{argsort}_{\text{desc}}(\mathbf{z} + \tau \mathbf{g})\), where \(g_i \sim \text{Gumbel}(0,1)\). For 196 patches, the policy requires only 196 parameters with negligible overhead. The log-probability admits a closed-form expression: \(\log P(\pi|\mathbf{z}) = \sum_{i=1}^n [z_{\pi_i} - \log\sum_{k=i}^n \exp(z_{\pi_k})]\), computed efficiently via reverse cumulative logsumexp.

Loss & Training¶

A three-stage curriculum is adopted:

First \(N\) epochs: The classifier is trained with standard row-major ordering to establish a stable feature representation.
Epochs \(N\) to \(N+M\): The PL policy is activated and trained with REINFORCE. The temperature \(\tau\) is linearly annealed from 0 to 0.2 and back to 0, encouraging exploration before convergence.
After epoch \(N+M\): The policy is frozen to the deterministic maximum-likelihood permutation \(\hat{\pi} = \text{argsort}(\mathbf{z})\), and only the backbone is trained.

The reward is defined as the negative cross-entropy loss \(r = -\mathcal{L}_{\text{CE}}\), with a momentum baseline \(b_{t+1} = \beta b_t + (1-\beta)r_t\) (\(\beta=0.99\)) for variance reduction. The total loss is \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{CE}} + \mathcal{L}_{\text{policy}}\).

Key Experimental Results¶

Main Results — Fixed Permutation Comparison¶

Model	Dataset	Row-major	Best Fixed Permutation	Accuracy Range
ViT	IN-1K	37.5%	~37.5%	±0% (equivariant)
T-XL	IN-1K	baseline	Column-major: +1.92%	−6.43% ~ +1.92%
Longformer	IN-1K	baseline	Column-major: +1.83%	−0.5% ~ +1.83%
Mamba	IN-1K	baseline	—	general degradation for non-row/column orderings
T-XL	FMoW	baseline	Column-major best	smaller variation

Main Results — REOrder Improvements¶

Model	Dataset	Best REOrder Gain	Permutation Source
Mamba	IN-1K	+3.01% (±0.23)	Hilbert curve
Mamba	FMoW	+13.35% (±0.21)	Diagonal
T-XL	IN-1K	+1.50% (±0.21)	Hilbert curve
T-XL	FMoW	+1.10% (±0.21)	Column-major
Longformer	IN-1K/FMoW	~0%	No significant gain

Ablation Study¶

Configuration	Key Observation	Notes
Random permutation vs. REOrder	REOrder significantly outperforms random	Rules out coincidental training dynamics
100 vs. 300 epochs	Gains persist across training lengths	Not an artifact of insufficient training
With/without information-theoretic initialization	Initialization helps but is not required	Weak prior provides a reasonable starting point
Longformer	Negligible improvement	Approximation is closest to full attention; lowest permutation sensitivity

Key Findings¶

Mamba is most sensitive to patch ordering; its fixed four-directional causal scan conflicts with non-standard permutations, yet REOrder successfully discovers better orderings through learning.
Different datasets have different optimal permutations (ImageNet prefers column-major; FMoW prefers diagonal), confirming that no universal optimal permutation exists.
During policy training, salient patches (e.g., keyboard keys) tend to be placed toward the end of the sequence, reflecting the center bias of the dataset.

Highlights & Insights¶

Deep theoretical insight: The widely accepted assumption that "patch order is irrelevant" is precisely characterized as holding only for full self-attention, and shown to fail for all efficient attention approximations.
Minimal overhead: The PL policy requires only 196 parameters (for a 14×14 patch grid) and a single argsort operation at training time.
The 13.35% gain on FMoW is remarkable: It suggests that non-natural images such as satellite imagery may offer larger gains from permutation optimization.

Limitations & Future Work¶

No significant improvement is observed for Longformer, indicating that the method is primarily effective for architectures with strong permutation sensitivity.
The current approach uses a globally shared permutation (one per batch); sample-adaptive permutations remain unexplored.
Validation is limited to classification tasks; effectiveness on dense prediction tasks such as detection and segmentation is unknown.
Mamba's four-directional scan directions are fixed; ideally, scan directions should be jointly optimized with the permutation.

Compared to ARM (a Mamba vision variant) that clusters patches within rows at small scale, REOrder searches over a substantially larger permutation space and achieves superior results.
This work has implications for any setting that converts 2D structured data into 1D sequences, such as point clouds and graph-structured data.
REOrder can be combined with efficiency methods such as token pruning/merging: permutation optimization followed by pruning.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First systematic study revealing and addressing the impact of patch ordering on long-sequence vision models
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage across four models, two datasets, six orderings, and multiple ablations
Writing Quality: ⭐⭐⭐⭐⭐ Rigorous theoretical analysis, fluent exposition, and well-designed figures
Value: ⭐⭐⭐⭐ Immediate value for the long-sequence vision model community; extension to more tasks and architectures is needed