Random Wins All: Rethinking Grouping Strategies for Vision Tokens¶

Conference: CVPR 2026 arXiv: 2603.00486 Authors: Qihang Fan, Yuang Ai, Huaibo Huang, Ran He (Institute of Automation, Chinese Academy of Sciences) Code: GitHub Area: 3D Vision Keywords: Vision Transformer, Token Grouping, Random Grouping, Attention Mechanism, Efficiency Optimization

TL;DR¶

This paper proposes a minimalist random grouping strategy to replace various elaborately designed token grouping methods in Vision Transformers. The approach achieves near-universal improvements over all baselines across image classification, object detection, semantic segmentation, point cloud segmentation, and VLMs, and provides a four-dimensional explanation for its success: positional information, per-head feature diversity, global receptive field, and fixed grouping patterns.

Background & Motivation¶

Background: The self-attention mechanism in Transformers incurs \(O(n^2)\) quadratic complexity, and vision token grouping is the dominant approach to reducing this cost. Grouping strategies have grown increasingly sophisticated, from simple window partitioning (Swin Transformer) to semantically aware tree-structured grouping (Quadtree) and bi-level routing grouping (BiFormer), yet inference efficiency has continuously declined.
Limitations of Prior Work: Are these elaborately designed grouping methods truly necessary? Complex clustering and routing operations severely hinder deployment efficiency, and it remains unclear whether performance gains originate from the grouping strategy itself.
Key Challenge: A minimalist random grouping strategy—simply applying a random permutation to tokens followed by equal partitioning—almost universally outperforms existing complex grouping methods across all tasks and baselines, while also achieving faster inference.
Goal: The paper proposes the Random Grouping Strategy and conducts an in-depth analysis of why such a simple method succeeds, summarizing four key design principles for grouping strategies.

Method¶

Overall Architecture¶

The core mechanism of the random grouping strategy is extremely straightforward: a fixed random tensor is generated to shuffle the token order, and the tokens are then equally partitioned into groups for intra-group self-attention or pooling. The strategy serves as a drop-in replacement for the token grouping modules in various baselines including Swin, CSwin, Quadtree, BiFormer, PVTv2, and Focal, and can be extended to multi-modal tasks such as point cloud processing and VLMs.

Key Designs¶

Design 1: Random Tensor Generation and Sort-Based Grouping¶

Function: Randomly shuffles input tokens and partitions them into equal groups.
Mechanism: Given input \(X \in \mathbb{R}^{h \times w \times d}\), a random tensor \(P \in \mathbb{R}^{h \times w}\) is generated with a one-to-one spatial correspondence to \(X\). Tokens are reordered by sorting \(P\) in descending order to obtain \(X_p\), which is then evenly split into groups. Once generated, \(P\) is stored and fixed, so all subsequent images share the same token ordering.
Design Motivation: Random shuffling completely breaks local bias, ensuring that tokens within each group originate from globally distributed positions across the image, thereby naturally acquiring a global receptive field. Fixing \(P\) ensures a consistent grouping pattern during training, enabling the model to learn stable feature representations.

Design 2: Per-Head Independent Random Grouping¶

Function: Assigns a distinct random grouping pattern to each head in multi-head attention.
Mechanism: The shape of \(P\) is extended from \(h \times w\) to \(n \times h \times w\) (where \(n\) is the number of attention heads), so each head uses an independent random tensor for grouping.
Design Motivation: Using different random groupings per head encourages greater diversity in the features learned by each head. Ablation experiments show that sharing a single \(P\) across all heads leads to a significant performance drop (e.g., Random-Swin-T drops from 82.7% to 80.5%), confirming the critical role of per-head feature diversity.

Design 3: Nearest-Neighbor Interpolation for High-Resolution Adaptation¶

Function: Adapts the fixed-resolution random tensor to different scales in downstream tasks.
Mechanism: Since \(P\) is fixed at shape \(h \times w\), nearest-neighbor interpolation is used to resize \(P\) to the target resolution when applied to high-resolution scenarios such as object detection (800×1333) or semantic segmentation (512×512).
Design Motivation: Nearest-neighbor interpolation preserves the discrete structural properties of the random permutation, avoiding the smoothing effects that methods such as bilinear interpolation may introduce, and ensures sharp grouping boundaries.

Design 4: Unified Adaptation for Three Backbone Categories¶

Function: Provides a unified grouping replacement scheme for Plain, Partition-based, and Pooling-based backbones.
Mechanism:
- Plain backbones (e.g., DeiT): Random grouping is applied directly; intra-group self-attention reduces global \(O(n^2)\) complexity to \(O((n/g)^2)\).
- Partition-based backbones (e.g., Swin, CSwin, BiFormer): Original window/routing grouping is replaced by random grouping.
- Pooling-based backbones (e.g., PVTv2, Focal): Spatial grouping prior to token pooling is replaced by random grouping.
Design Motivation: Demonstrates that random grouping is a general-purpose strategy that is architecture-agnostic and can uniformly replace diverse grouping methods.

Key Experimental Results¶

Main Results: ImageNet-1K Image Classification¶

Model	Params (M)	FLOPs (G)	Throughput (img/s)	Top-1 Acc (%)
DeiT-T	6	1.3	6433	72.2
Random-DeiT-T	6	1.1	6682	73.1 (+0.9)
DeiT-S	22	4.6	3122	79.8
Random-DeiT-S	22	4.3	3313	80.9 (+1.1)
DeiT-B	87	17.6	1226	81.8
Random-DeiT-B	87	17.0	1348	82.5 (+0.7)
Swin-T	28	4.5	1738	81.3
Random-Swin-T	28	4.5	1866	82.7 (+1.4)
Swin-S	50	8.7	1186	83.0
Random-Swin-S	50	8.7	1248	83.9 (+0.9)
Swin-B	88	15.4	864	83.5
Random-Swin-B	88	15.4	902	84.4 (+0.9)
Quadtree-b2	24	4.5	467	82.7
Random-Quadtree-b2	21	4.3	1926	83.4 (+0.7)
BiFormer-B	57	9.8	544	84.3
Random-BiFormer-B	57	9.6	667	85.1 (+0.8)
PVTv2-B2	25	4.0	1663	82.0
Random-PVTv2-B2	21	4.2	1678	82.7 (+0.7)
Focal-B	90	16.0	248	83.8
Random-Focal-B	88	15.5	887	84.5 (+0.7)

Main Results: COCO Object Detection and Instance Segmentation¶

Backbone	Mask R-CNN AP^b	AP^m	RetinaNet AP^b
Swin-T	43.7	39.8	41.7
Random-Swin-T	46.0 (+2.3)	41.9 (+2.1)	44.3 (+2.6)
Swin-S	45.7	41.1	44.5
Random-Swin-S	48.0 (+2.3)	43.2 (+2.1)	46.6 (+2.1)
Swin-B	46.9	42.3	45.0
Random-Swin-B	49.1 (+2.2)	44.6 (+2.3)	47.4 (+2.4)
PVTv2-B2	45.3	41.2	44.6
Random-PVTv2-B2	47.1 (+1.8)	42.4 (+1.2)	46.0 (+1.4)

Main Results: ADE20K Semantic Segmentation¶

Model	UperNet 160K mIoU (%)
Swin-T	44.5
Random-Swin-T	46.8 (+2.3)
Swin-S	47.6
Random-Swin-S	48.9 (+1.3)
CSwin-B	51.1
Random-CSwin-B	52.2 (+1.1)
BiFormer-B	51.0
Random-BiFormer-B	52.0 (+1.0)

Ablation Study: Four Key Factors¶

Ablation	Model	Acc (%)	Change
Positional Information	Random-Swin-T	82.7	-
Remove PE	Random-Swin-T w/o PE	79.3	-3.4
Reference: Swin-T w/o PE	-	80.1	-1.6
Per-Head Feature Diversity	Random-Swin-T (multi-P)	82.7	-
All heads share single P	Random-Swin-T (single-P)	80.5	-2.2
Global Receptive Field	Random-Swin-T (global)	82.7	-
Restricted to local regions	Random-Swin-T (regional)	81.5	-1.2
Fixed Grouping Pattern	Random-Swin-T (fixed P)	82.7	-
Different P per image	Fully Random Swin-T	76.4	-6.3

Ablation Study: Roadmap from Fully Random to Random-Swin¶

Configuration	Throughput (img/s)	Acc (%)
Fully Random (single P)	1922	71.2
+ Fixed grouping pattern	1922	77.6 (+6.4)
+ Per-head independent P	1917	80.1 (+2.5)
+ CPE positional encoding	1866	82.7 (+2.6)
Swin-T (reference)	1738	81.3

Key Findings¶

Random grouping comprehensively outperforms carefully designed methods: Across 3 backbone categories × 6+ architectures × 5 tasks, random grouping almost universally surpasses the original grouping strategies while achieving faster inference.
Larger gains on downstream tasks: On COCO detection, Random-Swin-T yields gains of up to +2.3 AP^b and +2.6 RetinaNet AP^b, far exceeding the +1.4% improvement on classification.
Significant speed advantages: Random-Quadtree-b2 achieves 1926 img/s vs. 467 img/s for the original Quadtree-b2—a 4.1× speedup—alongside a +0.7% accuracy gain.
Fixed pattern is the most critical factor: Fully Random (different P per image) causes a catastrophic -6.3% drop, demonstrating that models require a consistent grouping pattern to learn stable feature representations.
Multi-modal generalization: On Point Transformer v3, latency is reduced by 23% (88ms→68ms) while mIoU improves by +0.2; all benchmarks improve when applied to LLaVA-1.5/1.6.

Highlights & Insights¶

Counter-intuitive yet profound finding: Elaborately designed grouping strategies are outperformed by random grouping, overturning the default assumption that "more complex is better." The core contribution lies not only in the method itself but in a deeper understanding of the grouping strategy design space.
Practical guidance from the four-factor analysis: The four factors—positional information, per-head diversity, global receptive field, and fixed patterns—provide clear design principles for future Transformer efficiency research.
Exceptional engineering simplicity: No complex clustering, routing, or tree structures are required; only sorting and splitting, making the method deployment-friendly and well-suited for industrial applications.
Elegant roadmap experiment design: The progressive ablation in Tab. 10 elegantly demonstrates the journey from fully random grouping (71.2%) to incrementally incorporating each key factor, ultimately surpassing Swin-T (82.7% vs. 81.3%).

Limitations & Future Work¶

Random grouping yields smaller gains on architectures that already possess a global receptive field (e.g., CSwin-T gains only +0.4%), suggesting that its advantages stem primarily from enhancing global receptive fields and per-head diversity.
The shape of the fixed random tensor is tied to the training resolution; transferring to higher resolutions requires interpolation, which may degrade the quality of the random permutation.
The paper does not discuss random seed sensitivity or whether different random initializations introduce performance variance.

Swin Transformer [ICCV 2021]: The canonical window-based grouping method, universally outperformed by random grouping, demonstrating that local windows are not the optimal choice.
BiFormer [CVPR 2023]: Bi-level routing grouping with high complexity, surpassed by a simple random approach, revealing the problem of over-engineering.
Point Transformer v3 [CVPR 2024]: Grouping strategies in point cloud scenarios can also be replaced by random grouping, demonstrating cross-modal universality.

Rating¶

Dimension	Score (1–10)	Notes
Novelty	9	A minimalist yet counter-intuitive finding that challenges the entire token grouping paradigm.
Technical Depth	8	The four-factor analysis is thorough and systematic; ablation design is elegant.
Experimental Thoroughness	9	Comprehensive validation across 6 backbone types, 5 tasks, and 3 modalities.
Writing Quality	8	Logical and clear, progressively building from observations to explanations.
Value	9	Plug-and-play, deployment-friendly, with extremely high engineering value.
Overall	8.6	A paradigm-shifting work that challenges complex designs with extreme simplicity.