Random Wins All: Rethinking Grouping Strategies for Vision Tokens¶
Conference: CVPR 2026
arXiv: 2603.00486Authors: Qihang Fan, Yuang Ai, Huaibo Huang, Ran He (Institute of Automation, Chinese Academy of Sciences)
Code: GitHubArea: 3D Vision
Keywords: Vision Transformer, Token Grouping, Random Grouping, Attention Mechanism, Efficiency Optimization
This paper proposes a minimalist random grouping strategy to replace various elaborately designed token grouping methods in Vision Transformers. The approach achieves near-universal improvements over all baselines across image classification, object detection, semantic segmentation, point cloud segmentation, and VLMs, and provides a four-dimensional explanation for its success: positional information, per-head feature diversity, global receptive field, and fixed grouping patterns.
Background: The self-attention mechanism in Transformers incurs \(O(n^2)\) quadratic complexity, and vision token grouping is the dominant approach to reducing this cost. Grouping strategies have grown increasingly sophisticated, from simple window partitioning (Swin Transformer) to semantically aware tree-structured grouping (Quadtree) and bi-level routing grouping (BiFormer), yet inference efficiency has continuously declined.
Limitations of Prior Work: Are these elaborately designed grouping methods truly necessary? Complex clustering and routing operations severely hinder deployment efficiency, and it remains unclear whether performance gains originate from the grouping strategy itself.
Key Challenge: A minimalist random grouping strategy—simply applying a random permutation to tokens followed by equal partitioning—almost universally outperforms existing complex grouping methods across all tasks and baselines, while also achieving faster inference.
Goal: The paper proposes the Random Grouping Strategy and conducts an in-depth analysis of why such a simple method succeeds, summarizing four key design principles for grouping strategies.
The core mechanism of the random grouping strategy is extremely straightforward: a fixed random tensor is generated to shuffle the token order, and the tokens are then equally partitioned into groups for intra-group self-attention or pooling. The strategy serves as a drop-in replacement for the token grouping modules in various baselines including Swin, CSwin, Quadtree, BiFormer, PVTv2, and Focal, and can be extended to multi-modal tasks such as point cloud processing and VLMs.
Design 1: Random Tensor Generation and Sort-Based Grouping¶
Function: Randomly shuffles input tokens and partitions them into equal groups.
Mechanism: Given input \(X \in \mathbb{R}^{h \times w \times d}\), a random tensor \(P \in \mathbb{R}^{h \times w}\) is generated with a one-to-one spatial correspondence to \(X\). Tokens are reordered by sorting \(P\) in descending order to obtain \(X_p\), which is then evenly split into groups. Once generated, \(P\) is stored and fixed, so all subsequent images share the same token ordering.
Design Motivation: Random shuffling completely breaks local bias, ensuring that tokens within each group originate from globally distributed positions across the image, thereby naturally acquiring a global receptive field. Fixing \(P\) ensures a consistent grouping pattern during training, enabling the model to learn stable feature representations.
Function: Assigns a distinct random grouping pattern to each head in multi-head attention.
Mechanism: The shape of \(P\) is extended from \(h \times w\) to \(n \times h \times w\) (where \(n\) is the number of attention heads), so each head uses an independent random tensor for grouping.
Design Motivation: Using different random groupings per head encourages greater diversity in the features learned by each head. Ablation experiments show that sharing a single \(P\) across all heads leads to a significant performance drop (e.g., Random-Swin-T drops from 82.7% to 80.5%), confirming the critical role of per-head feature diversity.
Design 3: Nearest-Neighbor Interpolation for High-Resolution Adaptation¶
Function: Adapts the fixed-resolution random tensor to different scales in downstream tasks.
Mechanism: Since \(P\) is fixed at shape \(h \times w\), nearest-neighbor interpolation is used to resize \(P\) to the target resolution when applied to high-resolution scenarios such as object detection (800×1333) or semantic segmentation (512×512).
Design Motivation: Nearest-neighbor interpolation preserves the discrete structural properties of the random permutation, avoiding the smoothing effects that methods such as bilinear interpolation may introduce, and ensures sharp grouping boundaries.
Design 4: Unified Adaptation for Three Backbone Categories¶
Function: Provides a unified grouping replacement scheme for Plain, Partition-based, and Pooling-based backbones.
Mechanism:
Plain backbones (e.g., DeiT): Random grouping is applied directly; intra-group self-attention reduces global \(O(n^2)\) complexity to \(O((n/g)^2)\).
Partition-based backbones (e.g., Swin, CSwin, BiFormer): Original window/routing grouping is replaced by random grouping.
Pooling-based backbones (e.g., PVTv2, Focal): Spatial grouping prior to token pooling is replaced by random grouping.
Design Motivation: Demonstrates that random grouping is a general-purpose strategy that is architecture-agnostic and can uniformly replace diverse grouping methods.
Random grouping comprehensively outperforms carefully designed methods: Across 3 backbone categories × 6+ architectures × 5 tasks, random grouping almost universally surpasses the original grouping strategies while achieving faster inference.
Larger gains on downstream tasks: On COCO detection, Random-Swin-T yields gains of up to +2.3 AP^b and +2.6 RetinaNet AP^b, far exceeding the +1.4% improvement on classification.
Significant speed advantages: Random-Quadtree-b2 achieves 1926 img/s vs. 467 img/s for the original Quadtree-b2—a 4.1× speedup—alongside a +0.7% accuracy gain.
Fixed pattern is the most critical factor: Fully Random (different P per image) causes a catastrophic -6.3% drop, demonstrating that models require a consistent grouping pattern to learn stable feature representations.
Multi-modal generalization: On Point Transformer v3, latency is reduced by 23% (88ms→68ms) while mIoU improves by +0.2; all benchmarks improve when applied to LLaVA-1.5/1.6.
Counter-intuitive yet profound finding: Elaborately designed grouping strategies are outperformed by random grouping, overturning the default assumption that "more complex is better." The core contribution lies not only in the method itself but in a deeper understanding of the grouping strategy design space.
Practical guidance from the four-factor analysis: The four factors—positional information, per-head diversity, global receptive field, and fixed patterns—provide clear design principles for future Transformer efficiency research.
Exceptional engineering simplicity: No complex clustering, routing, or tree structures are required; only sorting and splitting, making the method deployment-friendly and well-suited for industrial applications.
Elegant roadmap experiment design: The progressive ablation in Tab. 10 elegantly demonstrates the journey from fully random grouping (71.2%) to incrementally incorporating each key factor, ultimately surpassing Swin-T (82.7% vs. 81.3%).
Random grouping yields smaller gains on architectures that already possess a global receptive field (e.g., CSwin-T gains only +0.4%), suggesting that its advantages stem primarily from enhancing global receptive fields and per-head diversity.
The shape of the fixed random tensor is tied to the training resolution; transferring to higher resolutions requires interpolation, which may degrade the quality of the random permutation.
The paper does not discuss random seed sensitivity or whether different random initializations introduce performance variance.
Swin Transformer [ICCV 2021]: The canonical window-based grouping method, universally outperformed by random grouping, demonstrating that local windows are not the optimal choice.
BiFormer [CVPR 2023]: Bi-level routing grouping with high complexity, surpassed by a simple random approach, revealing the problem of over-engineering.
Point Transformer v3 [CVPR 2024]: Grouping strategies in point cloud scenarios can also be replaced by random grouping, demonstrating cross-modal universality.