MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention¶

Conference: CVPR 2026 arXiv: 2603.01361 Code: GitHub Area: Segmentation / Crack Segmentation Keywords: crack segmentation, hybrid architecture, Mamba attention decoupling, direction-guided edge convolution, lightweight and efficient

TL;DR¶

MixerCSeg is proposed to decouple channels into global/local branches by analyzing the implicit attention mechanism of Mamba, enhanced respectively by Self-Attention and CNN, combined with Direction-guided Edge Gated Convolution, achieving state-of-the-art crack segmentation performance at only 2.05 GFLOPs and 2.54M parameters.

Background & Motivation¶

Crack segmentation is a critical technique for infrastructure health monitoring, yet faces challenges including diverse crack morphologies, uneven texture distributions, and low contrast against backgrounds. The three mainstream architectural paradigms each have their own limitations:

CNN: Strong in local feature extraction but insufficient in global modeling, struggling with complex morphologies
Transformer: Strong in global dependency modeling but computationally expensive
Mamba: Global attention with linear complexity, but its sequential processing mechanism limits the utilization of global context within a single forward pass

Existing hybrid models (MambaVision, RestorMixer) naively stack different architectures without analyzing their underlying interaction logic. The core insight of this paper is that Mamba's implicit attention naturally differentiates channels into global and local channels along the channel dimension (discovered by analyzing \(\Delta_t\)), enabling principled role assignment among CNN, Transformer, and Mamba.

Method¶

Overall Architecture¶

An encoder-decoder structure: the input passes through a Stem layer, followed by TransMixer Blocks that extract multi-scale features \(\{F_1, F_2, F_3, F_4\}\) → DEGConv enhances edge and directional awareness → SRF module performs multi-scale fusion → segmentation head outputs pixel-level predictions.

Key Designs¶

TransMixer Block: Standard Mamba operations (Eq. 1–2) are first applied to obtain output \(Y\); channels are then sorted along the channel dimension according to \(\Delta_t\) (the factor controlling the influence of historical tokens on the current token). The top \(d_g = d \cdot \gamma\) channels are designated as global tokens (large \(\Delta_t\), fast decay, attending more to the current frame), while the remaining \(d_l = d \cdot (1-\gamma)\) channels are designated as local tokens. The global branch is fed into Self-Attention to enhance long-range dependencies; the local branch is fed into a Local Refinement Module (Norm → Reshape → MaxPool2d → Conv \(1\times1\) → Sigmoid gate → element-wise multiplication with the original features) to enhance fine-grained details. The default setting is \(\gamma = 0.5\). This design enables the three architectures to fulfill distinct roles rather than being naively stacked.
Direction-guided Edge Gated Convolution (DEGConv): Proceeds in three steps: (a) Rearrange: the feature map is partitioned into \(N\) non-overlapping local views \(F_i^j \in \mathbb{R}^{C_i \times h_i \times w_i}\), each processed independently; (b) Direction embedding generation: for each view, channels are averaged → Sobel operators compute horizontal/vertical gradients → \(\theta = \arctan(d_y/d_x)\) yields directional angles → cells and bins are divided to construct directional histograms → a directional embedding vector \(\epsilon \in \mathbb{R}^{C_i}\) is obtained via Conv + ReLU + AvgPool; (c) Gated edge convolution: \(g = \sigma_2(\text{EdgeConv}(F_i^j + \epsilon))\), \({F_i^j}' = g \odot \text{EdgeConv}(F_i^j)\). EdgeConv uses \(1 \times k\) and \(k \times 1\) strip convolutions to extract horizontal and vertical features respectively, followed by concatenation and depthwise convolution. Crack orientation is explicitly modeled through directional priors.
Spatial Refinement Multi-Level Fusion (SRF): A spatial attention map \(\alpha = \sigma_2(\text{Conv}_{1\times1}(F_1'))\) is generated from the high-resolution feature \(F_1'\) to weight the upsampled low-resolution features \(F_i'' = \alpha \odot F_i^{up}\); all scale features are then concatenated and fed into the segmentation head \(r = \mu([F_1^{up}; F_2^{up}; F_3^{up}; F_4^{up}])\). High-resolution details guide low-resolution semantic fusion without additional computational cost.

Loss & Training¶

BCE + Dice Loss with ratio 1:5
Single NVIDIA A100 GPU, 50 epochs, batch size = 1
AdamW optimizer, initial lr = 5e-4
Input size: 512×512
Key hyperparameters: \(\gamma=0.5\), cell size = (8, 8), number of bins \(n=180\) (36 for Crack500, due to smoother crack curvature and larger crack width)

Key Experimental Results¶

Main Results¶

Dataset	Metric (mIoU)	MixerCSeg	Runner-up	Gain
DeepCrack	mIoU	0.9151	0.9022 (SCSegamba)	+1.43%
CamCrack789	mIoU	0.8409	0.8372 (U-Net)	+0.44%
CrackMap	mIoU	0.8123	0.8094 (SCSegamba)	+0.36%
Crack500	mIoU	0.7824	0.7778 (SCSegamba)	+0.59%
DeepCrack	F1	0.9205	0.9110 (SCSegamba)	+1.04%

Model	FLOPs (G)	Params (M)	Memory (MiB)
MixerCSeg	2.05	2.54	1190
SCSegamba	18.16	2.80	2206
RestorMixer	98.71	3.19	10384
MambaVision	642.86	13.57	5222

Ablation Study¶

Configuration	DeepCrack mIoU	CamCrack mIoU	Note
Baseline (VMamba+Segformer)	0.8826	0.8283	No additional modules
+ TransMixer	0.9016	0.8359	Significant encoder enhancement
+ DEGConv	0.9097	0.8381	Directional edge modeling
+ SRF	0.9151	0.8409	Multi-scale fusion completed

Key Findings¶

MixerCSeg reduces FLOPs by 88.7% compared to SCSegamba while achieving higher mIoU — the efficiency advantage is highly significant
TransMixer outperforms naive stacking approaches (MambaVision, RestorMixer), validating that attention-property-based decoupling is superior to blind stacking
\(\gamma = 0.5\) (equal split between global and local channels) is the optimal channel allocation ratio
The number of bins in the directional embedding requires dataset-specific tuning: 180 bins for complex cracks, 36 bins for smooth and wide cracks
Memory footprint of only 1190 MiB, suitable for edge deployment

Highlights & Insights¶

Mechanism-driven architecture design: Rather than intuitively mixing architectures, the paper analyzes Mamba's \(\Delta_t\) attention weights to discover channel-level global/local differentiation, providing a principled basis for role assignment
The directional embedding introduces crack-segmentation-specific prior knowledge (Sobel → directional histogram → embedding), enhancing perception of irregular geometric structures
Extremely lightweight: 2.05 GFLOPs + 2.54M parameters, one to two orders of magnitude smaller than most methods, yet achieves the best performance
SRF guides multi-scale fusion via high-resolution features rather than simple concatenation, without additional computational cost

Limitations & Future Work¶

Validation is limited to crack segmentation; applicability to general semantic segmentation benchmarks (e.g., Cityscapes) remains to be verified
The spatial block partitioning strategy in DEGConv may introduce discontinuities at block boundaries, partially mitigated by a subsequent EdgeConv layer
The number of bins in the directional histogram requires manual tuning across datasets, lacking an adaptive mechanism
Training with batch size = 1 may limit the effectiveness of BatchNorm layers

SCSegamba: A pioneering work applying Mamba to crack segmentation with a structure-aware scanning strategy
MambaVision: The first Mamba-Transformer hybrid visual backbone, but relies on naive stacking
RestorMixer: Combines CNN + Transformer + Mamba for image restoration, similarly lacking in-depth analysis of architectural interactions
Insight: Designing architectures based on the internal attention mechanisms of models (rather than intuitively assembling components) constitutes a more principled hybrid strategy

Rating¶

Novelty: ⭐⭐⭐⭐ Channel decoupling based on analysis of Mamba's implicit attention is a novel and theoretically grounded design
Experimental Thoroughness: ⭐⭐⭐⭐ Four datasets, seven SOTA comparisons, comprehensive ablation studies, and efficiency analysis
Writing Quality: ⭐⭐⭐⭐ Figures and tables are clear; the derivation chain from theoretical analysis to architectural design is complete
Value: ⭐⭐⭐⭐ Achieves an excellent efficiency-accuracy trade-off in the practical application of crack segmentation; the lightweight design has deployment value