Skip to content

MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention

Conference: CVPR 2026 arXiv: 2603.01361 Code: GitHub Area: Segmentation / Crack Segmentation Keywords: crack segmentation, hybrid architecture, Mamba attention decoupling, direction-guided edge convolution, lightweight and efficient

TL;DR

MixerCSeg is proposed to decouple channels into global/local branches by analyzing the implicit attention mechanism of Mamba, enhanced respectively by Self-Attention and CNN, combined with Direction-guided Edge Gated Convolution, achieving state-of-the-art crack segmentation performance at only 2.05 GFLOPs and 2.54M parameters.

Background & Motivation

Crack segmentation is a critical technique for infrastructure health monitoring, yet faces challenges including diverse crack morphologies, uneven texture distributions, and low contrast against backgrounds. The three mainstream architectural paradigms each have their own limitations:

  • CNN: Strong in local feature extraction but insufficient in global modeling, struggling with complex morphologies
  • Transformer: Strong in global dependency modeling but computationally expensive
  • Mamba: Global attention with linear complexity, but its sequential processing mechanism limits the utilization of global context within a single forward pass

Existing hybrid models (MambaVision, RestorMixer) naively stack different architectures without analyzing their underlying interaction logic. The core insight of this paper is that Mamba's implicit attention naturally differentiates channels into global and local channels along the channel dimension (discovered by analyzing \(\Delta_t\)), enabling principled role assignment among CNN, Transformer, and Mamba.

Method

Overall Architecture

An encoder-decoder structure: the input passes through a Stem layer, followed by TransMixer Blocks that extract multi-scale features \(\{F_1, F_2, F_3, F_4\}\) → DEGConv enhances edge and directional awareness → SRF module performs multi-scale fusion → segmentation head outputs pixel-level predictions.

Key Designs

  1. TransMixer Block: Standard Mamba operations (Eq. 1–2) are first applied to obtain output \(Y\); channels are then sorted along the channel dimension according to \(\Delta_t\) (the factor controlling the influence of historical tokens on the current token). The top \(d_g = d \cdot \gamma\) channels are designated as global tokens (large \(\Delta_t\), fast decay, attending more to the current frame), while the remaining \(d_l = d \cdot (1-\gamma)\) channels are designated as local tokens. The global branch is fed into Self-Attention to enhance long-range dependencies; the local branch is fed into a Local Refinement Module (Norm → Reshape → MaxPool2d → Conv \(1\times1\) → Sigmoid gate → element-wise multiplication with the original features) to enhance fine-grained details. The default setting is \(\gamma = 0.5\). This design enables the three architectures to fulfill distinct roles rather than being naively stacked.

  2. Direction-guided Edge Gated Convolution (DEGConv): Proceeds in three steps: (a) Rearrange: the feature map is partitioned into \(N\) non-overlapping local views \(F_i^j \in \mathbb{R}^{C_i \times h_i \times w_i}\), each processed independently; (b) Direction embedding generation: for each view, channels are averaged → Sobel operators compute horizontal/vertical gradients → \(\theta = \arctan(d_y/d_x)\) yields directional angles → cells and bins are divided to construct directional histograms → a directional embedding vector \(\epsilon \in \mathbb{R}^{C_i}\) is obtained via Conv + ReLU + AvgPool; (c) Gated edge convolution: \(g = \sigma_2(\text{EdgeConv}(F_i^j + \epsilon))\), \({F_i^j}' = g \odot \text{EdgeConv}(F_i^j)\). EdgeConv uses \(1 \times k\) and \(k \times 1\) strip convolutions to extract horizontal and vertical features respectively, followed by concatenation and depthwise convolution. Crack orientation is explicitly modeled through directional priors.

  3. Spatial Refinement Multi-Level Fusion (SRF): A spatial attention map \(\alpha = \sigma_2(\text{Conv}_{1\times1}(F_1'))\) is generated from the high-resolution feature \(F_1'\) to weight the upsampled low-resolution features \(F_i'' = \alpha \odot F_i^{up}\); all scale features are then concatenated and fed into the segmentation head \(r = \mu([F_1^{up}; F_2^{up}; F_3^{up}; F_4^{up}])\). High-resolution details guide low-resolution semantic fusion without additional computational cost.

Loss & Training

  • BCE + Dice Loss with ratio 1:5
  • Single NVIDIA A100 GPU, 50 epochs, batch size = 1
  • AdamW optimizer, initial lr = 5e-4
  • Input size: 512×512
  • Key hyperparameters: \(\gamma=0.5\), cell size = (8, 8), number of bins \(n=180\) (36 for Crack500, due to smoother crack curvature and larger crack width)

Key Experimental Results

Main Results

Dataset Metric (mIoU) MixerCSeg Runner-up Gain
DeepCrack mIoU 0.9151 0.9022 (SCSegamba) +1.43%
CamCrack789 mIoU 0.8409 0.8372 (U-Net) +0.44%
CrackMap mIoU 0.8123 0.8094 (SCSegamba) +0.36%
Crack500 mIoU 0.7824 0.7778 (SCSegamba) +0.59%
DeepCrack F1 0.9205 0.9110 (SCSegamba) +1.04%
Model FLOPs (G) Params (M) Memory (MiB)
MixerCSeg 2.05 2.54 1190
SCSegamba 18.16 2.80 2206
RestorMixer 98.71 3.19 10384
MambaVision 642.86 13.57 5222

Ablation Study

Configuration DeepCrack mIoU CamCrack mIoU Note
Baseline (VMamba+Segformer) 0.8826 0.8283 No additional modules
+ TransMixer 0.9016 0.8359 Significant encoder enhancement
+ DEGConv 0.9097 0.8381 Directional edge modeling
+ SRF 0.9151 0.8409 Multi-scale fusion completed

Key Findings

  • MixerCSeg reduces FLOPs by 88.7% compared to SCSegamba while achieving higher mIoU — the efficiency advantage is highly significant
  • TransMixer outperforms naive stacking approaches (MambaVision, RestorMixer), validating that attention-property-based decoupling is superior to blind stacking
  • \(\gamma = 0.5\) (equal split between global and local channels) is the optimal channel allocation ratio
  • The number of bins in the directional embedding requires dataset-specific tuning: 180 bins for complex cracks, 36 bins for smooth and wide cracks
  • Memory footprint of only 1190 MiB, suitable for edge deployment

Highlights & Insights

  • Mechanism-driven architecture design: Rather than intuitively mixing architectures, the paper analyzes Mamba's \(\Delta_t\) attention weights to discover channel-level global/local differentiation, providing a principled basis for role assignment
  • The directional embedding introduces crack-segmentation-specific prior knowledge (Sobel → directional histogram → embedding), enhancing perception of irregular geometric structures
  • Extremely lightweight: 2.05 GFLOPs + 2.54M parameters, one to two orders of magnitude smaller than most methods, yet achieves the best performance
  • SRF guides multi-scale fusion via high-resolution features rather than simple concatenation, without additional computational cost

Limitations & Future Work

  • Validation is limited to crack segmentation; applicability to general semantic segmentation benchmarks (e.g., Cityscapes) remains to be verified
  • The spatial block partitioning strategy in DEGConv may introduce discontinuities at block boundaries, partially mitigated by a subsequent EdgeConv layer
  • The number of bins in the directional histogram requires manual tuning across datasets, lacking an adaptive mechanism
  • Training with batch size = 1 may limit the effectiveness of BatchNorm layers
  • SCSegamba: A pioneering work applying Mamba to crack segmentation with a structure-aware scanning strategy
  • MambaVision: The first Mamba-Transformer hybrid visual backbone, but relies on naive stacking
  • RestorMixer: Combines CNN + Transformer + Mamba for image restoration, similarly lacking in-depth analysis of architectural interactions
  • Insight: Designing architectures based on the internal attention mechanisms of models (rather than intuitively assembling components) constitutes a more principled hybrid strategy

Rating

  • Novelty: ⭐⭐⭐⭐ Channel decoupling based on analysis of Mamba's implicit attention is a novel and theoretically grounded design
  • Experimental Thoroughness: ⭐⭐⭐⭐ Four datasets, seven SOTA comparisons, comprehensive ablation studies, and efficiency analysis
  • Writing Quality: ⭐⭐⭐⭐ Figures and tables are clear; the derivation chain from theoretical analysis to architectural design is complete
  • Value: ⭐⭐⭐⭐ Achieves an excellent efficiency-accuracy trade-off in the practical application of crack segmentation; the lightweight design has deployment value