3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion¶
Conference: CVPR 2026 arXiv: 2511.19117 Code: GitHub Area: Image Segmentation Keywords: Thermal imaging super-resolution, cross-modal diffusion, calibration-free fusion, RGB guidance, mobile thermal imaging
TL;DR¶
This paper proposes 3M-TI, a calibration-free multi-camera cross-modal diffusion framework that performs implicit alignment and fusion of uncalibrated RGB–thermal infrared image pairs via a Cross-modal Self-attention Module (CSM) in the VAE latent space. Combined with a misalignment augmentation strategy, the method achieves state-of-the-art performance on mobile thermal imaging super-resolution and significantly improves downstream object detection and semantic segmentation.
Background & Motivation¶
Hardware bottleneck in mobile thermal imaging: Miniaturized thermal sensors on mobile platforms suffer from reduced apertures and constrained pixel sizes, yielding blurry, information-deficient outputs (typically 96×96 resolution).
Limitations of Prior Work — single-image SR: A single thermal image lacks sufficient high-frequency information to recover fine structures, especially at large upscaling factors.
Limitations of Prior Work — RGB-guided methods require calibration: Existing RGB-guided thermal SR methods rely on precise pixel-level cross-camera calibration, which is cumbersome and non-robust in practical deployment.
Large cross-modal domain gap: RGB and thermal infrared imaging operate on fundamentally different physical principles; naively merging features tends to introduce unrealistic texture artifacts.
Limited thermal infrared datasets: Compared to the RGB domain, thermal infrared datasets are small in scale and lack scene diversity, constraining network training and generalization.
Spatiotemporal misalignment in practice: Multi-camera systems inevitably suffer from parallax and temporal desynchronization, to which existing methods lack robustness.
Method¶
Overall Architecture¶
3M-TI is built upon the single-step diffusion model SD-Turbo. The inputs are a low-resolution thermal image (64×64) and an uncalibrated high-resolution RGB reference image (512×512). The pipeline proceeds as follows: (1) a frozen VAE encoder maps both modalities into the latent space; (2) the CSM replaces the original self-attention layers in the UNet Transformer blocks to enable cross-modal alignment and fusion; (3) misalignment augmentation is applied to RGB images during training to improve robustness; (4) zero-initialized skip connections are added to enhance structural consistency; (5) RAM generates text prompts from the RGB image to provide semantic guidance; (6) LoRA fine-tunes the UNet (rank=16) and VAE decoder (rank=4).
Key Design 1: Cross-modal Self-attention Module (CSM)¶
- Function: Achieves implicit alignment and fusion of RGB and thermal infrared within UNet Transformer blocks.
- Mechanism: The latent tokens of both modalities are concatenated into a joint sequence \(\{z_{RGB}, z_{th}\} \in \mathbb{R}^{B \times (M \times H \times W) \times C}\), allowing self-attention to simultaneously model intra-modal (thermal–thermal) and inter-modal (RGB–thermal) dependencies without additional parameters.
- Design Motivation: Standard cross-attention captures only inter-modal information while ignoring intra-modal spatial context. CSM models both relationships jointly via self-attention on the concatenated sequence. Compared to static feature concatenation followed by a fully connected projection, CSM is content-adaptive. The design is inspired by video and multi-view diffusion models.
Key Design 2: Misalignment Augmentation¶
- Function: Applies controllable spatial transformations (translation, scaling, rotation, perspective distortion) to RGB images during training.
- Mechanism: Simulates geometric offsets caused by parallax and temporal desynchronization in real multi-camera systems without physical simulation.
- Design Motivation: Existing RGB–thermal datasets are strictly pixel-aligned, causing models to overfit specific calibration configurations. Injecting synthetic misalignment forces CSM to learn robust cross-modal correspondences under uncalibrated conditions, bridging the gap between training data and real deployment.
Key Design 3: Latent-Space Diffusion and Skip Connections¶
- Function: Leverages SD-Turbo's single-step diffusion in the latent space to generate high-fidelity thermal images, with zero-initialized skip connections propagating encoder feature maps to the decoder.
- Mechanism: The generative prior of the diffusion model compensates for the scarcity of thermal infrared data by synthesizing realistic high-frequency details; skip connections maintain geometric structural consistency.
- Design Motivation: Pure CNN/Transformer methods produce over-smoothed results under severe degradation, while diffusion models can generate high-frequency details but may introduce geometric distortions. Skip connections effectively mitigate this issue.
Loss & Training¶
- Loss function: \(\mathcal{L} = \mathcal{L}_2 + \lambda \cdot \mathcal{L}_{\text{LPIPS}}\), where \(\lambda = 1\), combining pixel-level L2 loss and perceptual loss LPIPS.
- Optimizer: Adam, learning rate \(2 \times 10^{-5}\), batch size = 4.
- Training time: Approximately 4 hours (8,000 iterations) on a single A800 GPU (80 GB).
- LoRA configuration: UNet rank=16, VAE decoder rank=4.
- Training data: 10,922 RGB–thermal image pairs from four datasets: IRVI, LLVIP, M3FD, and PBVS 2025.
Key Experimental Results¶
Table 1: Quantitative Comparison on Public Datasets¶
| Method | PSNR↑ | SSIM↑ | LPIPS↓ | MANIQA↑ | MUSIQ↑ |
|---|---|---|---|---|---|
| CoReFusion | 30.11 | 0.8588 | 0.3214 | 0.2771 | 28.35 |
| CoRPLE | 30.47 | 0.8642 | 0.3206 | 0.2833 | 30.46 |
| SwinFuSR | 29.85 | 0.8549 | 0.3085 | 0.2740 | 29.86 |
| SeeSR | 29.41 | 0.8495 | 0.1828 | 0.4278 | 35.22 |
| OSEDiff | 28.05 | 0.8422 | 0.2113 | 0.4014 | 36.30 |
| DifIISR | 27.48 | 0.7905 | 0.3484 | 0.4214 | 36.74 |
| 3M-TI (Ours) | 30.09 | 0.8610 | 0.1787 | 0.4443 | 36.66 |
3M-TI achieves the best perceptual metrics (LPIPS, MANIQA, MUSIQ) while outperforming other diffusion-based methods on fidelity metrics (PSNR, SSIM).
Table 2: Downstream Object Detection Performance¶
| Method | Precision↑ | Recall↑ | F1↑ | IoU↑ |
|---|---|---|---|---|
| SwinPaste | 0.1800 | 0.2109 | 0.1765 | 0.1941 |
| SeeSR | 0.3832 | 0.4637 | 0.3849 | 0.3022 |
| 3M-TI | 0.4565 | 0.5455 | 0.4724 | 0.3427 |
| Reference RGB | 0.4322 | 0.5708 | 0.4643 | 0.3359 |
| GT Thermal | 0.4582 | 0.5793 | 0.4887 | 0.3494 |
3M-TI's detection performance slightly surpasses the RGB reference and approaches GT thermal image quality.
Ablation Study¶
- Removing the RGB reference: reconstruction becomes blurry; LPIPS degrades from 0.1787 to 0.2106.
- Removing misalignment augmentation: MUSIQ drops from 36.66 to 34.94 with noticeable high-frequency detail degradation.
- Removing skip connections: structural fidelity decreases (PSNR from 30.09 to 29.86) with geometric distortions in objects such as circular wheels.
- CSM outperforms standard cross-attention (LPIPS 0.1787 vs. 0.1953) and feature concatenation (0.1787 vs. 0.2164).
Highlights & Insights¶
- Calibration-free cross-modal fusion is the most practically significant contribution — implicit alignment via attention in the VAE latent space completely avoids the calibration challenges encountered in real deployment.
- CSM is elegantly simple: it introduces no additional parameters and captures both intra- and inter-modal dependencies through token concatenation and self-attention, representing a creative adaptation of multi-frame processing ideas from video diffusion models.
- The misalignment augmentation strategy is conceptually novel, replacing complex physical simulation with simple geometric transformations to effectively improve generalization.
- Downstream task validation is thorough: beyond perceptual quality evaluation, the method demonstrates substantive gains in object detection and semantic segmentation, confirming the practical value of the super-resolution output.
- Real hardware validation: a practical system is constructed using a HIKVISION P09 thermal camera module (under $100) and a Xiaomi 15 smartphone, demonstrating strong engineering feasibility.
Limitations & Future Work¶
- Quality ceiling of single-step diffusion: while efficient, SD-Turbo-based single-step inference may not match the generation quality of multi-step diffusion methods.
- Semantic guidance depends on RAM: the method requires reasonably high-quality RGB input; degraded references (e.g., low-light or motion-blurred) may produce erroneous semantic prompts.
- Insufficient handling of FOV discrepancy: the RGB and thermal cameras have different fields of view (74°×59° vs. 50°×50°); robustness under larger FOV gaps remains to be validated.
- Only 8× super-resolution (64→512) is evaluated: applicability to other upscaling factors is not discussed.
- Inference overhead not fully analyzed: the cross-modal self-attention in the UNet operates on sequences of length \(2HW\), and the computational complexity at higher resolutions warrants attention.
Related Work & Insights¶
- CoReFusion / SwinFuSR / SwinPaste: Traditional RGB-guided thermal SR methods that rely on calibration and prioritize fidelity at the expense of high-frequency detail.
- SeeSR / OSEDiff: Diffusion-based image SR methods capable of generating high-frequency content but lacking cross-modal guidance, prone to artifacts.
- DifIISR: Infrared-specific diffusion SR, but dependent on strictly aligned data.
- Stable Video Diffusion: The multi-frame joint self-attention paradigm is the direct inspiration for CSM.
- Insights: The calibration-free cross-modal fusion concept is extensible to other modality pairs (e.g., depth–RGB, SAR–optical); the misalignment augmentation strategy is broadly applicable.
Rating¶
- Novelty: ⭐⭐⭐⭐ — CSM and misalignment augmentation are novel; the calibration-free setting has significant practical relevance
- Experimental Thoroughness: ⭐⭐⭐⭐ — Covers public benchmarks, real mobile hardware, downstream tasks, and ablation studies comprehensively
- Writing Quality: ⭐⭐⭐⭐ — Clear structure, well-motivated design choices, and rich figures and tables
- Value: ⭐⭐⭐⭐⭐ — Calibration-free design, low-cost hardware, and mobile deployment make this highly practical