CRFT: Consistent-Recurrent Feature Flow Transformer for Cross-Modal Image Registration¶

Conference: CVPR 2026 arXiv: 2604.05689 Code: https://github.com/NEU-Liuxuecong/CRFT Area: Medical Imaging / Image Registration Keywords: Cross-modal registration, feature flow learning, coarse-to-fine, discrepancy-guided attention, spatial geometric transformation

TL;DR¶

CRFT is a unified coarse-to-fine cross-modal image registration framework that learns modality-agnostic feature flow representations within a Transformer architecture. It employs 1/8-resolution global correspondence at the coarse stage and multi-scale local refinement at 1/2–1/4 resolution at the fine stage, coupled with iterative discrepancy-guided attention and Spatial Geometric Transform (SGT) to recursively refine flow fields and capture subtle spatial inconsistencies. CRFT outperforms SOTA methods including RAFT, GMFlow, and LoFTR across diverse cross-modal datasets covering optical, infrared, SAR, and multispectral imagery.

Background & Motivation¶

Background: Cross-modal image registration—establishing spatial correspondences between images acquired by different sensors—is a core problem in computer vision, with applications in 3D reconstruction, visual localization, and remote sensing analysis.

Limitations of Prior Work: (1) Hand-crafted features (SIFT/RIFT) are unreliable under strong nonlinear appearance discrepancies; (2) learning-based sparse matching methods (SuperGlue/LoFTR) are optimized for RGB imagery and generalize poorly to cross-modal inputs; (3) optical flow methods (RAFT/GMFlow) assume photometric consistency, which is violated by cross-modal inputs; (4) all existing methods struggle with combined challenges of large affine transformations, scale variation, and modality gaps.

Core Idea: (1) Learn modality-agnostic feature flow within a Transformer—establishing feature-space correspondences across modalities without relying on pixel-level consistency; (2) hierarchical coarse-to-fine matching (global + local); (3) iterative discrepancy-guided recurrent flow refinement—leveraging feature discrepancies to actively localize misaligned regions.

Method¶

Overall Architecture¶

Given an input image pair \((I^A, I^B)\), a shared ResNet CNN extracts features at three scales (1/2, 1/4, 1/8). The coarse stage (1/8 resolution) applies self-attention and cross-attention to establish global correspondences and produce an initial flow field. The fine stage (1/2 + 1/4 resolution) injects fine-grained spatial details via window attention and cross-attention. Iterative discrepancy-guided flow optimization (N rounds) then applies SGT-based feature alignment, computes feature discrepancies, applies discrepancy-weighted attention-guided updates with residual flow estimation, and employs a confidence estimation network for smoothing, ultimately producing a high-accuracy dense flow field.

Key Designs¶

Coarse-Stage Flow Estimation (Global Context):
- 1/8-resolution features → self-attention enhancement + cross-attention cross-modal matching → global correlation matrix → initial flow field.
- Design Motivation: Low-resolution features capture high-level structure, making them more robust to inter-modal spectral and radiometric inconsistencies, thereby stabilizing global matching.
Fine-Stage Flow Refinement (Local Detail):
- 1/2 and 1/4-resolution features → window self-attention for local pattern capture → cross-attention to inject fine-grained spatial details.
- Design Motivation: High resolution preserves spatial detail, but global attention is computationally infeasible at this scale; window attention combined with hierarchical fusion addresses this limitation.
Iterative Discrepancy-Guided Flow Optimization (Core Contribution):
- N rounds of iterative refinement. Per round:
  - Fine-Scale Feature Transformation (FSFT): Warp features using the current flow field.
  - Spatial Geometric Transform (SGT): Explicitly models affine/scale transformations to handle large deformations.
  - Discrepancy Computation: Residual between warped features and target features → discrepancy map.
  - Discrepancy-Guided Flow Optimization (DGFO): Discrepancy-weighted attention → automatically focuses on misaligned regions.
  - Residual Update (RU): Discrepancy-guided residual flow update.
  - Confidence Estimation Network (CENet): Predicts per-pixel confidence to smooth the final flow field.
- Design Motivation: Single-pass matching cannot handle complex nonlinear and affine deformations; recurrent refinement progressively corrects errors. Discrepancy guidance ensures attention concentrates on the most poorly aligned regions, improving efficiency.
Modality-Agnostic Design:
- A shared CNN encoder is used across modalities, learning modality-invariant features.
- Feature flow formulation replaces pixel-level photometric/optical flow assumptions, eliminating dependence on pixel-level consistency.

Key Experimental Results¶

Main Results¶

OSdataset (Optical–SAR Registration)

Method	Type	AEPE ↓	CMR@3px ↑	CMR@1px ↑	CMR@0.7px ↑
RIFT2	Hand-crafted	23.61	22.9%	0.0%	0.0%
GMFlow	Optical flow	11.91	17.0%	0.0%	0.0%
RAFT	Optical flow	3.51	69.6%	15.9%	8.7%
ADRNet	Dense matching	1.67	90.1%	35.0%	20.6%
GDROS	Dense matching	1.34	91.1%	49.2%	35.5%
XoFTR+Flow	Semi-dense	1.13	96.2%	57.6%	41.7%
CRFT	Ours	0.65	99.0%	95.1%	89.9%

CRFT is the only method to achieve sub-pixel AEPE (0.65); CMR@0.7px reaches 89.9%, which is 2.15× the second-best method XoFTR+Flow (41.7%).

RoadScene (Visible–Infrared Registration)

Method	AEPE ↓	CMR@3px ↑	CMR@1px ↑	CMR@0.7px ↑
RIFT2	17.27	36.4%	0.0%	0.0%
RAFT	8.92	66.6%	14.1%	8.0%
ADRNet	4.72	50.1%	9.4%	4.8%
XoFTR+Flow	4.83	27.3%	0.0%	0.0%
CRFT	2.37	68.2%	18.2%	4.5%

On RoadScene, CRFT achieves the lowest AEPE (2.37) and the highest CMR@1px (18.2%).

Ablation Study¶

Configuration	Effect
Coarse stage only	Global correspondences established but insufficient spatial precision
+ Fine stage	Improved local detail and higher accuracy
+ Discrepancy guidance (N=1)	Further correction of geometric misalignment
+ Iterative refinement (N=3)	Best performance with stable convergence
w/o SGT	Degraded—significant drop in registration of large affine transformations
w/o discrepancy guidance	Degraded—unfocused attention, lower correction efficiency
w/o FSFT	Degraded—cross-modal feature spaces remain misaligned, leading to unstable discrepancy computation

Key Findings¶

The SGT module is most critical for large affine transformations—without SGT, registration under large rotation/scale changes is nearly infeasible.
Discrepancy-guided attention outperforms uniform attention by concentrating on regions requiring correction, making iterations more efficient.
N=3 iterations is sufficient for convergence; additional iterations yield diminishing returns.
CRFT remains competitive on RGB–RGB scenarios, demonstrating that the modality-agnostic design does not sacrifice same-modality performance.

Highlights & Insights¶

Modality-agnostic feature flow: Cross-modal registration is unified as flow estimation in feature space—no modality-specific design is required per sensor pair, yielding strong generalizability.
Discrepancy-guided adaptive attention: Feature discrepancies between warped and target representations serve as attention weights, automatically localizing misaligned regions—substantially more efficient than uniform attention.
Explicit geometric modeling via SGT: Affine transformations are integrated as a learnable module rather than relying on the flow field to implicitly learn large deformations.

Limitations & Future Work¶

N=3 iterative refinement increases inference time.
The coarse stage employs global attention, requiring resolution control for large images.
Validation is currently limited to remote sensing and navigation scenarios; application to medical registration (CT–MRI) remains to be explored.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of discrepancy-guided recurrence, SGT, and modality-agnostic flow is effective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Validated across optical, infrared, SAR, and multispectral scenarios.
Writing Quality: ⭐⭐⭐⭐ Detailed architectural diagrams.
Value: ⭐⭐⭐⭐ Offers general-purpose registration utility for remote sensing and navigation applications.