TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba¶

Conference: ICCV 2025 arXiv: 2411.17473 Code: GitHub Area: Image Segmentation Keywords: Lightweight visual backbone, Mamba, frequency decoupling, Laplacian pyramid, high-low frequency separation

TL;DR¶

This paper proposes TinyViM, a lightweight convolution-Mamba hybrid visual backbone based on frequency decoupling. A Laplace Mixer routes low-frequency components to Mamba for global context modeling and enhances high-frequency components via depthwise convolution. A frequency ramp Inception structure progressively adjusts frequency allocation across stages. TinyViM achieves 2–3× higher throughput than existing Mamba models on classification, detection, and segmentation tasks.

Background & Motivation¶

Mamba has attracted considerable attention in vision due to its linear-complexity global modeling capability, with methods such as ViM and VMamba demonstrating competitive performance on image classification. However, existing lightweight Mamba backbones (e.g., EfficientVMamba) fail to compete with lightweight CNN- or Transformer-based counterparts in terms of both performance and efficiency.

Through spectral analysis, the authors identify a key phenomenon: in convolution-Mamba hybrid architectures, Mamba primarily models low-frequency information while suppressing high-frequency information. Specifically: - Low-frequency components at the spectral center are amplified after Mamba processing. - High-frequency details such as edges and textures are attenuated.

This observation leads to two inferences: 1. Feeding all frequency components into the Mamba block is inefficient — high-frequency information constitutes unnecessary overhead for Mamba. 2. Uniformly processing low-frequency information across all stages degrades high-frequency components, harming fine-grained recognition.

Core motivation: since Mamba naturally favors low frequencies, it is more principled to explicitly decouple high and low frequencies — letting Mamba handle only the low-frequency components (at lower resolution, thus lower cost) while efficiently enhancing high-frequency components with convolutions.

Method¶

Overall Architecture¶

TinyViM follows a four-stage multi-scale design. Each stage contains Local Blocks (re-parameterized 3×3 convolution + FFN) and TinyViM Blocks (Laplace Mixer + FFN). Patch Embedding performs downsampling and channel expansion between stages.

Key Design 1: Quantitative Validation of Frequency Decoupling¶

A baseline combining convolution and standard Mamba (SS2D) is constructed and compared against several input variants:

Input Variant	GMACs	Throughput	Top-1 (%)
Baseline (all frequencies)	0.96	1673	79.1
Low frequency only	0.93	2574	79.0
High frequency only	0.96	1377	78.6
High + Low (parallel)	0.97	1509	79.1

Using only low-frequency input incurs negligible accuracy loss (79.0 vs. 79.1) while achieving 1.5× throughput improvement, validating the frequency decoupling strategy.

Key Design 2: Laplace Mixer¶

Given input features $X \in \mathbb{R}^{H \times W \times D}$, the channel dimension is split by ratio $\alpha$ into a low-frequency input $X_l$ and a high-frequency input $X_h$.

Low-frequency branch: Laplacian pyramid decomposition is applied: $$X_{ll} = \text{Pool}(X_l), \quad X_{lh} = X_l - \text{Upsample}(X_{ll})$$

The low-frequency component $X_{ll}$, at $\frac{1}{2}$ the original resolution, is fed into SS2D (VMamba's 2D selective scan) for global context modeling: $$\hat{X}_{ll} = \text{SS2D}(X_{ll})$$

High-frequency branch: The high-frequency residual $X_{lh}$ from the low-frequency branch and $X_h$ are concatenated and enhanced via a re-parameterized 3×3 depthwise convolution: $$\hat{X}_{hh} = \text{Rep}_3(X_{hh})$$

Corresponding high- and low-frequency features are summed element-wise and fused through a 1×1 convolution.

Key Design 3: Frequency Ramp Inception¶

Motivated by two observations: (1) deep-layer features exhibit redundancy; and (2) shallow layers require more high-frequency detail while deep layers benefit more from global information, the paper proposes progressively adjusting the split ratio $\alpha$ per stage — allocating more channels to the high-frequency branch (smaller $\alpha$) in shallow stages and more to the low-frequency branch (larger $\alpha$) in deep stages:

Stage	$\alpha_1$	$\alpha_2$	$\alpha_3$	$\alpha_4$	Top-1
Uniform allocation	0.5	0.5	0.5	0.5	79.0
Ramp allocation	0.25	0.5	0.5	0.75	79.2

Loss & Training¶

Standard classification loss (cross-entropy) is used. Downstream tasks adopt the default losses of their respective frameworks.

Key Experimental Results¶

ImageNet-1K Classification¶

Model	Type	Param	GMACs	Throughput (im/s)	Top-1 (%)
SwiftFormer-S	CNN+ViT	6.1M	1.0	2626	78.5
EfficientVMamba-T	Mamba	6.1M	1.0	1396	76.5
TinyViM-S	CNN+Mamba	5.6M	0.9	2563	79.2
MobileOne-S4	CNN	14.8M	3.0	1223	79.4
EfficientVMamba-S	Mamba	11M	1.3	674	78.7
TinyViM-B	CNN+Mamba	11M	1.5	1851	81.2
VMamba-T	Mamba	30M	4.9	383	82.6
EfficientVMamba-B	Mamba	33M	4.0	580	81.8
TinyViM-L	CNN+Mamba	31.7M	4.7	843	83.3

TinyViM-S achieves 79.2% Top-1 accuracy with only 5.6M parameters, surpassing EfficientVMamba-T by 2.7% with 1.8× higher throughput.

COCO Detection and Instance Segmentation (Mask R-CNN)¶

Backbone	Throughput	AP^box	AP^mask
EfficientVMamba-S	104	39.3	36.7
SwiftFormer-L1	174	41.2	38.1
TinyViM-B	180	42.3	38.7
FastViT-SA24	93	42.0	38.0
EfficientVMamba-B	104	43.4	39.5
TinyViM-L	119	44.5	40.3

TinyViM-B outperforms SwiftFormer-L1 by +1.1 AP^box while achieving higher throughput.

ADE20K Semantic Segmentation (Semantic FPN)¶

Backbone	mIoU
EfficientFormer-L1	38.9
TinyViM-S	38.9
SwiftFormer-L1	41.1
TinyViM-B	41.9
PoolFormer-S36	42.0
TinyViM-L	44.1

Ablation Study¶

Laplacian kernel size:

Kernel	Top-1	Throughput
3	79.0	2510
5 (axial)	79.0	2598
7 (axial)	79.2	2563
9 (axial)	79.2	2479

The 7×7 axial convolution achieves the best accuracy-efficiency trade-off.

Key Findings¶

Frequency decoupling is essential for lightweight Mamba — restricting hidden-state propagation to low-frequency components preserves accuracy while substantially improving throughput.
The progressive frequency allocation of Frequency Ramp Inception outperforms uniform allocation, consistent with the principle of "high frequencies in shallow layers, low frequencies in deep layers."
Re-parameterization provides no gain for small models (TinyViM-S ±0) but benefits larger models (TinyViM-L +0.2).
ERF visualizations show that TinyViM has substantially larger effective receptive fields than MobileOne and SwiftFormer.

Highlights & Insights¶

Novel frequency-domain perspective: This is the first work to analyze Mamba's behavioral preference from a frequency-domain viewpoint and to design a dedicated architecture accordingly.
Extreme efficiency: TinyViM-S achieves 79.2% Top-1 with only 5.6M parameters and 0.9 GMACs, making it highly competitive in the ultra-lightweight regime.
Outstanding throughput advantage: TinyViM delivers 2–3× the throughput of comparable Mamba models, making it genuinely suitable for real-time deployment.

Limitations & Future Work¶

The Laplacian pyramid decomposition increases implementation complexity, and non-standard operators may not be fully optimized on all hardware platforms.
The assumption that Mamba consistently favors low frequencies requires further validation under different pre-training configurations.
Throughput is evaluated only on V100 GPUs; actual latency on mobile or edge devices is not reported.

Efficient visual backbones: MobileNet series, EfficientFormer, SwiftFormer, FastViT, etc.
Vision Mamba: ViM, VMamba, EfficientVMamba, QuadMamba, etc.
Frequency analysis: Prior work has analyzed the frequency preferences of CNNs and Transformers; this paper is the first to conduct such analysis for Mamba.

Rating¶

Novelty: ⭐⭐⭐⭐ — The combination of frequency decoupling and Mamba is novel, with analysis-driven design.
Technical Depth: ⭐⭐⭐⭐ — The logical chain from spectral analysis to quantitative validation to architecture design is complete and rigorous.
Experimental Thoroughness: ⭐⭐⭐⭐ — Full coverage of classification, detection, and segmentation with detailed ablations.
Writing Quality: ⭐⭐⭐⭐ — Motivation is well-argued and figures are clear.