CRFT: Consistent-Recurrent Feature Flow Transformer for Cross-Modal Image Registration¶
Conference: CVPR 2026 arXiv: 2604.05689 Code: https://github.com/NEU-Liuxuecong/CRFT Area: Medical Imaging / Image Registration Keywords: Cross-modal registration, feature flow learning, coarse-to-fine, discrepancy-guided attention, spatial geometric transformation
TL;DR¶
CRFT is a unified coarse-to-fine cross-modal image registration framework that learns modality-agnostic feature flow representations within a Transformer architecture. It employs 1/8-resolution global correspondence at the coarse stage and multi-scale local refinement at 1/2–1/4 resolution at the fine stage, coupled with iterative discrepancy-guided attention and Spatial Geometric Transform (SGT) to recursively refine flow fields and capture subtle spatial inconsistencies. CRFT outperforms SOTA methods including RAFT, GMFlow, and LoFTR across diverse cross-modal datasets covering optical, infrared, SAR, and multispectral imagery.
Background & Motivation¶
Background: Cross-modal image registration—establishing spatial correspondences between images acquired by different sensors—is a core problem in computer vision, with applications in 3D reconstruction, visual localization, and remote sensing analysis.
Limitations of Prior Work: (1) Hand-crafted features (SIFT/RIFT) are unreliable under strong nonlinear appearance discrepancies; (2) learning-based sparse matching methods (SuperGlue/LoFTR) are optimized for RGB imagery and generalize poorly to cross-modal inputs; (3) optical flow methods (RAFT/GMFlow) assume photometric consistency, which is violated by cross-modal inputs; (4) all existing methods struggle with combined challenges of large affine transformations, scale variation, and modality gaps.
Core Idea: (1) Learn modality-agnostic feature flow within a Transformer—establishing feature-space correspondences across modalities without relying on pixel-level consistency; (2) hierarchical coarse-to-fine matching (global + local); (3) iterative discrepancy-guided recurrent flow refinement—leveraging feature discrepancies to actively localize misaligned regions.
Method¶
Overall Architecture¶
Given an input image pair \((I^A, I^B)\), a shared ResNet CNN extracts features at three scales (1/2, 1/4, 1/8). The coarse stage (1/8 resolution) applies self-attention and cross-attention to establish global correspondences and produce an initial flow field. The fine stage (1/2 + 1/4 resolution) injects fine-grained spatial details via window attention and cross-attention. Iterative discrepancy-guided flow optimization (N rounds) then applies SGT-based feature alignment, computes feature discrepancies, applies discrepancy-weighted attention-guided updates with residual flow estimation, and employs a confidence estimation network for smoothing, ultimately producing a high-accuracy dense flow field.
Key Designs¶
-
Coarse-Stage Flow Estimation (Global Context):
- 1/8-resolution features → self-attention enhancement + cross-attention cross-modal matching → global correlation matrix → initial flow field.
- Design Motivation: Low-resolution features capture high-level structure, making them more robust to inter-modal spectral and radiometric inconsistencies, thereby stabilizing global matching.
-
Fine-Stage Flow Refinement (Local Detail):
- 1/2 and 1/4-resolution features → window self-attention for local pattern capture → cross-attention to inject fine-grained spatial details.
- Design Motivation: High resolution preserves spatial detail, but global attention is computationally infeasible at this scale; window attention combined with hierarchical fusion addresses this limitation.
-
Iterative Discrepancy-Guided Flow Optimization (Core Contribution):
- N rounds of iterative refinement. Per round:
- Fine-Scale Feature Transformation (FSFT): Warp features using the current flow field.
- Spatial Geometric Transform (SGT): Explicitly models affine/scale transformations to handle large deformations.
- Discrepancy Computation: Residual between warped features and target features → discrepancy map.
- Discrepancy-Guided Flow Optimization (DGFO): Discrepancy-weighted attention → automatically focuses on misaligned regions.
- Residual Update (RU): Discrepancy-guided residual flow update.
- Confidence Estimation Network (CENet): Predicts per-pixel confidence to smooth the final flow field.
- Design Motivation: Single-pass matching cannot handle complex nonlinear and affine deformations; recurrent refinement progressively corrects errors. Discrepancy guidance ensures attention concentrates on the most poorly aligned regions, improving efficiency.
- N rounds of iterative refinement. Per round:
-
Modality-Agnostic Design:
- A shared CNN encoder is used across modalities, learning modality-invariant features.
- Feature flow formulation replaces pixel-level photometric/optical flow assumptions, eliminating dependence on pixel-level consistency.
Key Experimental Results¶
Main Results¶
OSdataset (Optical–SAR Registration)
| Method | Type | AEPE ↓ | CMR@3px ↑ | CMR@1px ↑ | CMR@0.7px ↑ |
|---|---|---|---|---|---|
| RIFT2 | Hand-crafted | 23.61 | 22.9% | 0.0% | 0.0% |
| GMFlow | Optical flow | 11.91 | 17.0% | 0.0% | 0.0% |
| RAFT | Optical flow | 3.51 | 69.6% | 15.9% | 8.7% |
| ADRNet | Dense matching | 1.67 | 90.1% | 35.0% | 20.6% |
| GDROS | Dense matching | 1.34 | 91.1% | 49.2% | 35.5% |
| XoFTR+Flow | Semi-dense | 1.13 | 96.2% | 57.6% | 41.7% |
| CRFT | Ours | 0.65 | 99.0% | 95.1% | 89.9% |
CRFT is the only method to achieve sub-pixel AEPE (0.65); CMR@0.7px reaches 89.9%, which is 2.15× the second-best method XoFTR+Flow (41.7%).
RoadScene (Visible–Infrared Registration)
| Method | AEPE ↓ | CMR@3px ↑ | CMR@1px ↑ | CMR@0.7px ↑ |
|---|---|---|---|---|
| RIFT2 | 17.27 | 36.4% | 0.0% | 0.0% |
| RAFT | 8.92 | 66.6% | 14.1% | 8.0% |
| ADRNet | 4.72 | 50.1% | 9.4% | 4.8% |
| XoFTR+Flow | 4.83 | 27.3% | 0.0% | 0.0% |
| CRFT | 2.37 | 68.2% | 18.2% | 4.5% |
On RoadScene, CRFT achieves the lowest AEPE (2.37) and the highest CMR@1px (18.2%).
Ablation Study¶
| Configuration | Effect |
|---|---|
| Coarse stage only | Global correspondences established but insufficient spatial precision |
| + Fine stage | Improved local detail and higher accuracy |
| + Discrepancy guidance (N=1) | Further correction of geometric misalignment |
| + Iterative refinement (N=3) | Best performance with stable convergence |
| w/o SGT | Degraded—significant drop in registration of large affine transformations |
| w/o discrepancy guidance | Degraded—unfocused attention, lower correction efficiency |
| w/o FSFT | Degraded—cross-modal feature spaces remain misaligned, leading to unstable discrepancy computation |
Key Findings¶
- The SGT module is most critical for large affine transformations—without SGT, registration under large rotation/scale changes is nearly infeasible.
- Discrepancy-guided attention outperforms uniform attention by concentrating on regions requiring correction, making iterations more efficient.
- N=3 iterations is sufficient for convergence; additional iterations yield diminishing returns.
- CRFT remains competitive on RGB–RGB scenarios, demonstrating that the modality-agnostic design does not sacrifice same-modality performance.
Highlights & Insights¶
- Modality-agnostic feature flow: Cross-modal registration is unified as flow estimation in feature space—no modality-specific design is required per sensor pair, yielding strong generalizability.
- Discrepancy-guided adaptive attention: Feature discrepancies between warped and target representations serve as attention weights, automatically localizing misaligned regions—substantially more efficient than uniform attention.
- Explicit geometric modeling via SGT: Affine transformations are integrated as a learnable module rather than relying on the flow field to implicitly learn large deformations.
Limitations & Future Work¶
- N=3 iterative refinement increases inference time.
- The coarse stage employs global attention, requiring resolution control for large images.
- Validation is currently limited to remote sensing and navigation scenarios; application to medical registration (CT–MRI) remains to be explored.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of discrepancy-guided recurrence, SGT, and modality-agnostic flow is effective.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Validated across optical, infrared, SAR, and multispectral scenarios.
- Writing Quality: ⭐⭐⭐⭐ Detailed architectural diagrams.
- Value: ⭐⭐⭐⭐ Offers general-purpose registration utility for remote sensing and navigation applications.