OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels¶

Conference: CVPR 2025
arXiv: 2502.20087
Code: https://bit.ly/OverLoCK
Area: Segmentation/Vision Backbone
Keywords: Convolutional Neural Networks, Top-Down Attention, Dynamic Convolution, Long-Range Dependencies, Backbone

TL;DR¶

OverLoCK is proposed, which is the first pure convolutional backbone network that explicitly incorporates a top-down attention mechanism. Through a deep-stage decomposition strategy (DDS) and context-mixing dynamic convolution (ContMix), it surpasses ConvNeXt-B on ImageNet-1K using only 1/3 of the FLOPs, achieving comprehensive leadership in detection and segmentation tasks.

Background & Motivation¶

The top-down attention mechanism in the human visual system—first obtaining a global overview to discover salient cues, and then closely examining details—has been largely neglected in modern vision backbones.

The Key Challenge faced by current backbone networks:

Lack of Feedback in Pyramidal Architectures: Existing ConvNet/ViT/Mamba backbones adopt a step-by-step downsampling pyramidal structure. Intermediate layers can only rely on prior features, lacking explicit top-down semantic guidance.
Experimental Verification: Visualizations of Class Activation Maps (CAM) of Swin-T, ConvNeXt-T, and VMamba-T reveal that even at Stage 4 (close to the classifier), these models still struggle to accurately locate target objects, with performance being even worse at Stage 3.
Limitations of Prior Work: Recurrent top-down architectures introduce excessive computational overhead, resulting in a poor performance-complexity trade-off. Task-specific feedback designs are unsuitable for building general-purpose backbones.

Another key challenge is: how to equip pure convolutions with dynamic global modeling capabilities (analogous to Transformers/Mamba) while preserving the inherent local inductive bias of convolutions? The receptive field of large-kernel convolutions relatively shrinks as the resolution increases, while deformable convolutions sacrifice local inductive biases.

Method¶

Overall Architecture¶

OverLoCK decomposes the network into three collaborative sub-networks: Base-Net encodes mid-to-low-level features (Stage 1 to the first half of Stage 3); a lightweight Overview-Net rapidly generates coarse-grained global semantic overviews (Stages 3-4); a powerful Focus-Net performs detailed perception guided by top-down signals (Stages 3-4). The output of Overview-Net is injected into every building block of Focus-Net as a context prior.

Key Designs¶

Design 1: Deep-stage Decomposition Strategy (DDS)

Function: Explicitly encode the "overview first, look closely next" human visual mechanism into the network architecture.
Mechanism: Base-Net downsamples images to \(H/16 \times W/16\). Overview-Net further downsamples them to \(H/32 \times W/32\) to rapidly obtain the semantic overview (context prior). Focus-Net receives features from Base-Net and the context prior, progressively refining features under top-down guidance. Sharing Base-Net between the two sub-backbones minimizes additional overhead.
Design Motivation: Realizing top-down attention via a branched architecture rather than a recurrent one avoids the computational redundancy of recurrent structures. During pre-training, both sub-networks have individual classification heads, while only the Focus-Net output is used in downstream tasks.

Design 2: Context-Mixing Dynamic Convolution (ContMix)

Function: Equip fixed-size convolution kernels with adaptive long-range dependency modeling capabilities while retaining the local inductive bias.
Mechanism: For an input feature map, an affinity matrix \(A^g \in \mathbb{R}^{HW \times S^2}\) is computed between each token and the \(S \times S\) regional centers. The affinity values are aggregated into spatially-varying dynamic convolution kernels \(D^g = \text{softmax}(A^g W_d)\) via a learnable linear layer \(W_d\). Since the weights of each kernel encode global context, sliding-window convolutions can capture long-range dependencies.
Design Motivation: The receptive field of large-kernel convolutions is static, shrinking relatively as resolution increases. By mixing global context into kernel weights, ContMix enables global perception even under a fixed kernel size, while maintaining the local structure of convolutions.

\[D^g = \text{softmax}(A^g W_d) \in \mathbb{R}^{HW \times K^2}\]

Design 3: Context Flow and Gated Dynamic Spatial Aggregator (GDSA)

Function: Continuously update and utilize top-down semantic guidance within Focus-Net.
Mechanism: The context prior \(P_i\) and feature map \(Z_i\) are concatenated and fed into the Dynamic Block. Within ContMix, \(P_i\) is used to compute keys (regional centers) while \(Z_i\) is used to compute queries, achieving the effect of "context-guided kernel weights". After separating the outputs, the context prior is updated via \(P_{i+1} = \alpha P_i' + \beta P_o\) to prevent the dilution of context.
Design Motivation: Top-down guidance should not be a one-time injection but should continuously affect the feature extraction process within each block. A residual connection to the initial context prior is leveraged to prevent info decay.

Loss & Training¶

During ImageNet pre-training, Focus-Net and Overview-Net are each connected to an individual classification head, optimized with the same cross-entropy classification loss. In downstream tasks, Overview-Net no longer requires auxiliary supervision.

Key Experimental Results¶

ImageNet-1K Image Classification (224×224)¶

Method	Type	FLOPs(G)	Params(M)	Top-1 Acc(%)
ConvNeXt-T	ConvNet	4.5	29	82.1
UniRepLKNet-T	ConvNet	4.9	31	83.2
VMamba-T	Mamba	4.9	30	82.6
Swin-T	Transformer	4.5	29	81.3
OverLoCK-T	ConvNet	4.6	29	84.2
ConvNeXt-B	ConvNet	15.4	89	83.8
OverLoCK-T vs ConvNeXt-B	—	~1/3 FLOPs	~1/3 Params	+0.4

COCO Object Detection (Mask R-CNN 3x)¶

Method	FLOPs(G)	\(AP^b\)	\(AP^m\)
ConvNeXt-S	348	49.7	43.8
MogaNet-B	373	49.9	44.2
OverLoCK-S	345	50.9	44.8

ADE20K Semantic Segmentation (UperNet)¶

Method	FLOPs(G)	mIoU
UniRepLKNet-T	946	48.6
MogaNet-S	946	49.2
OverLoCK-T	930	50.3

Key Findings¶

OverLoCK-T achieves 84.2% Top-1 accuracy with ~4.6G FLOPs, surpassing ConvNeXt-B which requires 15.4G FLOPs.
Effective Receptive Field (ERF) visualizations demonstrate that OverLoCK-T enjoys a larger ERF at Stages 3 & 4 than VMamba-T, despite being a pure ConvNet.
Class Activation Maps indicate that OverLoCK can accurately locate targets as early as Stage 3, validating the effectiveness of top-down guidance.
ContMix ablation analysis indicates that concurrent usage of large and small kernel groups (multi-scale) yields the best performance.

Highlights & Insights¶

Biologically-inspired Architectural Innovation: Demonstrates the first explicit realization of top-down attention in a pure ConvNet, without relying on recurrent structures or introducing Transformer blocks.
Core Insight of ContMix: By encoding global context into convolution kernel weights, it cleverly endows fixed-size convolutions with "resolution-adaptive" long-range modeling capability.
Excellent Efficiency-Accuracy Trade-off: OverLoCK-T outperforms ConvNeXt-B's accuracy with only approximately 1/3 of its computational cost, demonstrating the immense potential of the architectural design.

Limitations & Future Work¶

Overview-Net introduces extra branch computations, which is lightweight but still incurs some overhead.
The three-subnet architecture increases design complexity and the hyperparameter space.
The number of regional centers \(S=7\) in ContMix is fixed; the possibility of adaptive adjustment was not explored.
Exploring the extension of the DDS strategy to Transformer or Mamba architectures is worth exploring.

ConvNeXt/RepLKNet/UniRepLKNet: Evolution path of large-kernel ConvNets; OverLoCK addresses long-range modeling from a different perspective (dynamic kernels).
InternImage: Achieves dynamic modeling via deformable convolutions but sacrifices inductive bias.
AbsViT: Feedback-driven ViT backbone, but relies on recurrent structures. OverLoCK avoids recurrence through a branched design.
Insight: The "context-to-kernel" concept of ContMix can be applied to other convolutional scenarios requiring long-range dependencies.

Rating¶

⭐⭐⭐⭐⭐ — Implements a major breakthrough in pure ConvNet architectures. The core innovations (DDS + ContMix) feature solid theoretical foundations, comprehensive experiments, and outstanding performance. The efficiency-accuracy trade-off, which surpasses ConvNeXt-B with only 1/3 of the computational budget, is highly impressive. Underpins landmark progress for vision backbone designs in 2025.