CVPR 2025 Model Compression lightweight network LS convolution large kernel dynamic convolution heteroscale vision token mixing

LSNet: See Large, Focus Small¶

Conference: CVPR 2025
arXiv: 2503.23135
Code: https://github.com/jameslahm/lsnet
Area: Model Compression
Keywords: lightweight network, LS convolution, large kernel, dynamic convolution, heteroscale vision, token mixing

TL;DR¶

Inspired by the dual-scale mechanism of human visual perception (peripheral for broad perception and foveal for fine aggregation), this paper proposes LS convolution (large-kernel depthwise convolution for perception + small-kernel dynamic convolution for aggregation) to build the LSNet lightweight network family, comprehensively outperforming existing SOTA lightweight models under 0.3~1.3G FLOPs.

Background & Motivation¶

Background: Lightweight vision networks are crucial for real-time deployment scenarios. Existing lightweight models primarily rely on self-attention or standard convolution for token mixing.

Limitations of Prior Work: (1) Self-attention: Sensing and aggregation ranges are identical (homoscale). Expanding the perception range inevitably increases computational complexity, and redundant attention to unimportant regions (e.g., backgrounds) wastes the limited computational budget. (2) Standard convolution: Aggregation weights are determined by fixed kernel weights, lacking adaptability to different contexts; lightweight models typically use small kernel sizes, restricted to a limited receptive field.

Key Challenge: How to simultaneously achieve broad-range perception (to understand context) and efficient fine aggregation (to extract discriminative features) under extremely low computational budgets.

Key Insight: The rod cells (widely distributed in the periphery, wide-field low-resolution) and cone cells (concentrated in the fovea, small-field high-resolution) in the human retina naturally form a "see large, focus small" mechanism. This study maps this mechanism into static large-kernel convolution for perception and dynamic small-kernel convolution for aggregation.

Method¶

Overall Architecture¶

A four-stage pyramid architecture, where the first three stages stack LS Blocks, and the final stage uses MSA Blocks (as the resolution is already sufficiently small). The resolution transitions as follows: H/8→H/16→H/32→H/64.

Key Designs¶

1. Large-Kernel Perception (LKP) - Function: Uses large-kernel depthwise convolution to efficiently capture wide-range contextual relationships, generating position-adaptive aggregation weights. - Core Structure: PW (dimension reduction to C/2) → DW_{K_L×K_L} (large-kernel depthwise convolution, default $K_L=7$) → PW (generating weights $W \in \mathbb{R}^{H \times W \times D}$) $$w_i = \mathcal{P}_{ls}(x_i, \mathcal{N}_{K_L}(x_i)) = \text{PW}(\text{DW}_{K_L \times K_L}(\text{PW}(\mathcal{N}_{K_L}(x_i))))$$ - Design Motivation: The computational complexity of large-kernel DW convolution is $O(HWCK^2/2)$, which increases linearly with the kernel size rather than quadratically (compared to self-attention $O(H^2W^2)$), allowing the receptive field to be expanded at low cost.

2. Small-Kernel Aggregation (SKA) - Function: Leverages the adaptive weights generated by LKP to perform dynamic convolution in a small neighborhood, aggregating fine-grained features. - Mechanism: Reshapes the weights $w_i$ generated by LKP into $w_i^* \in R^{G \times K_S \times K_S}$ (default $K_S=3$, $G=C/8$), performing convolution with dynamic kernels shared across each group of channels: $$y_{ic} = w_{ig}^* \circledast \mathcal{N}_{K_S}(x_{ic})$$ - Design Motivation: The small kernel limits the scope of aggregation, ensuring computational efficiency. Since the dynamic kernels are generated from broad receptive field information, the small kernel also acquires global context-awareness.

3. LS Block Design - Function: A comprehensive block design built around LS Conv. - Core Structure: LS Conv → Skip Connection → Extra DW Conv + SE layer (introducing local inductive bias) → FFN (channel mixing). - Design Motivation: SE and the extra DW provide a small but critical enhancement of local structural information under extremely lightweight budgets.

Loss & Training¶

Standard classification cross-entropy + knowledge distillation (optional, using RegNetY-16GF at 82.9% as the teacher model).

Complexity Analysis¶

\[O\left(\frac{HWC}{4}(3C + 2K_L^2 + (2G+4)K_S^2)\right)\]

Linear complexity with respect to input resolution.

Key Experimental Results¶

Main Results — ImageNet-1K Classification¶

Model	Params (M)	FLOPs (G)	Throughput	Top-1 (%)
EfficientViT-M3	6.9	0.3	14613	73.4
StarNet-S1	2.9	0.4	5034	73.5
LSNet-T	11.4	0.3	14708	74.9
UniRepLKNet-A	4.4	0.6	3931	77.0
SHViT-S3	14.2	0.6	8993	77.4
LSNet-S	16.1	0.5	9023	77.8
AFFNet	5.5	1.5	1355	79.8
RepViT-M1.1	8.2	1.3	3604	79.4
LSNet-B	23.2	1.3	3996	80.3

LSNet-T achieves 74.9% with only 0.3G FLOPs, outperforming all models with the same FLOPs; LSNet-B achieves 80.3%, outperforming AFFNet by 0.5% with ~3$\times$ faster inference speed.

COCO Detection + Instance Segmentation (RetinaNet / Mask R-CNN)¶

Backbone	FLOPs (G)	RetinaNet AP	Mask R-CNN AP^b / AP^m
EfficientViT-M4	1.6	32.7	32.8 / 31.0
StarNet-S1	2.2	33.6	—
LSNet-S	2.5	36.7	37.1 / 34.5

Ablation Study¶

Variant	Top-1 (%)	Description
Only Large Kernel DW	76.7	No dynamic aggregation
Only Small Kernel Dynamic	77.0	No wide-range perception
Simple Concatenation of Large+Small Kernel	77.2	Lacks the perception $\rightarrow$ aggregation guidance relationship
LS Conv (Full)	77.8	Large-kernel perception guiding small-kernel dynamic aggregation

Key Findings¶

Heteroscale outperforms Homoscale: The perception-aggregation heteroscale design of LS Conv is more efficient than both self-attention (homoscale) and standard convolution (homoscale + static).
Not simple stacking: Simply concatenating large and small kernels yields only 77.2%, whereas the structured combination of LS Conv achieves 77.8% (+0.6%), proving that the causal relationship of "perception guiding aggregation" is superior to parallel concatenation.
Throughput advantage: LSNet-T reaches 14708 img/s, making it one of the fastest lightweight models.

Highlights & Insights¶

Precise biological vision-inspired mapping: Peripheral vision $\rightarrow$ large-kernel perception, foveal vision $\rightarrow$ small-kernel aggregation. The biological analogy is not just a narrative gimmick, but is converted into concrete computational structures.
Decoupling perception and aggregation: Breaks the "perception range = aggregation range" constraint of self-attention, allowing the use of cheap large kernels to obtain large receptive fields, and small kernels to maintain low aggregation overhead.
Linear complexity + Dynamics: Simultaneously achieves the linear complexity of large kernels and the content adaptability of dynamic convolutions.
Universality: Comprehensive SOTA performance across classification, detection, and segmentation tasks, rather than a single-point breakthrough.

Limitations & Future Work¶

The parameter count is relatively high (LSNet-T 11.4M vs EfficientViT-M3 6.9M). Although the FLOPs are low, memory footprint could be a bottleneck.
Verified only on ImageNet-1K; higher-resolution or larger-scale pre-training (e.g., ImageNet-22K) was not explored.
$K_L=7, K_S=3$ are empirical values; a systematic search for kernel sizes was not presented.
The final stage still falls back on MSA Blocks, rather than fully unifying into the LS Conv architecture.
The group mechanism of the dynamic kernel ($G=C/8$) increases implementation complexity, and its compatibility with specialized hardware requires evaluation.

RepLKNet/UniRepLKNet: Representatives of large-kernel convolutions, but aggregation remains static $\rightarrow$ LSNet introduces dynamic aggregation to complement the shortcomings of the large-kernel scheme.
Involution (Li et al.): Generates dynamic kernels based on single-pixel MLPs $\rightarrow$ LSNet replaces the single-pixel MLP with large-kernel perception, providing richer contextual information.
EfficientViT: Cascaded group attention $\rightarrow$ still limited by the homoscale issue of self-attention.
Insight: The design concept that "perception and aggregation can have different scopes" can be generalized to NLP (e.g., token mixing could first model relationships broadly, then aggregate locally).

Rating¶

⭐⭐⭐⭐ — The design style is clear and supported by biology, achieving comprehensive SOTA performance across three tasks with code released; however, the parameter count is relatively high, and key hyperparameters lack systematic ablation studies.