MMOne: Representing Multiple Modalities in One Scene¶
Conference: ICCV 2025 arXiv: 2507.11129 Code: MMOne Area: Multimodal VLM Keywords: Multimodal scene representation, 3D Gaussian splatting, modality conflict, modality decomposition, thermal imaging
TL;DR¶
MMOne is a general framework that addresses property disparity and granularity disparity in multi-modal scene representation through a modality modeling module (with modality indicators) and a multi-modal decomposition mechanism. It jointly models RGB, thermal, and language modalities within a single 3DGS representation, achieving consistent improvements across all modalities.
Background & Motivation¶
3D scene representation has achieved great success in RGB rendering, evolving from NeRF to 3D Gaussian Splatting (3DGS). However, integrating multiple modalities (RGB, thermal imaging, language) into a unified scene representation poses fundamental challenges—Modality Conflicts:
Property Disparity: Different modalities have intrinsically different data characteristics. For example, RGB is a 3-dimensional color vector, whereas language requires a high-dimensional feature space; paper occludes heat sources in RGB/language but not in thermal imaging.
Granularity Disparity: Different modalities operate at different information granularities. Thermal imaging is relatively coarse, RGB is finer, and language features remain consistent within object boundaries. Consequently, at object edges, the thermal modality favors fewer large Gaussians, while RGB requires many small ones.
Key limitations of existing methods: - Use shared opacity for all modalities, ignoring inter-modal property disparity - Use the same set of Gaussians for all modalities, contradicting modality-specific granularity requirements - Modality-specific designs target individual modalities and do not generalize to additional ones
The authors' core question: How can the intrinsic differences among modalities be resolved when representing multiple modalities simultaneously?
Method¶
Overall Architecture¶
MMOne builds upon the 3DGS framework. Given multi-view multi-modal inputs, it progressively constructs a multi-modal scene representation. Each modality is handled by a dedicated modality modeling module, and the densification process integrates a multi-modal decomposition mechanism.
During training, each modality is rendered independently and losses are summed: \(\mathcal{L} = \sum_{i=1}^{m} \mathcal{L}_{M_i}\)
Modality Modeling Module (Addressing Property Disparity)¶
Two components are introduced per modality:
Modality-specific features \(m_i \in \mathbb{R}^{d_m}\): Different modalities use feature vectors of different dimensions to accommodate their respective physical properties.
Modality indicator \(\alpha^m \in [0,1]\): Replaces shared opacity by independently controlling opacity for each modality. The rendering equation becomes:
Key roles of the modality indicator: - Weighting mechanism: Provides different rendering weights for different modalities - Switch function: Selectively deactivates certain modalities during rendering. When a modality is "switched off," the geometric attributes of the Gaussians are influenced only by the remaining active modalities
The switch functionality is implemented in CUDA rasterization by skipping the rendering of specific modalities, thereby freezing their corresponding gradient updates.
Multi-Modal Decomposition Mechanism (Addressing Granularity Disparity)¶
Multi-modal pruning: - In vanilla 3DGS, Gaussians with low opacity are directly removed (Hard Prune). In multi-modal scenes, directly pruning a Gaussian whose indicator is low in one modality but high in another degrades the other modality. - The proposed Soft Prune only deactivates the specific modality (by setting its modality indicator to "off") rather than deleting the entire Gaussian. - The pruning threshold for single-modal Gaussians is raised to reduce unimportant single-modal Gaussians and encourage learning of cross-modal shared attributes.
Multi-modal decomposition: During 3DGS densification, gradients from different modalities back-propagated to the same Gaussian may cancel each other, leading to suboptimal results. The solution:
Accumulated gradients \(g_{m_i}\) and \(g_{m_j}\) per modality are used to compute the gradient discrepancy:
When the gradient discrepancy exceeds a threshold (0.0002), a multi-modal Gaussian is decomposed into multiple single-modal Gaussians, each optimized independently by its own modality loss.
This disentangles multi-modal information into shared components (multi-modal Gaussians) and modality-specific components (single-modal Gaussians), yielding a more compact and efficient representation.
Loss & Training¶
The total loss is the sum of per-modality losses, with modality-specific formulations: - RGB: Standard 3DGS L1 + SSIM loss - Thermal: L1 + SSIM + smoothness regularization - Language: Semantic feature loss following LangSplat
Key Experimental Results¶
RGB–Thermal Evaluation¶
| Modality | Method | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|---|
| RGB | 3DGS | 23.27 | 0.821 | 0.220 |
| RGB | ThermalGaussian | 24.38 | 0.846 | 0.204 |
| RGB | MMOne | 24.89 | 0.854 | 0.209 |
| Thermal | 3DGS | 24.11 | 0.859 | 0.214 |
| Thermal | ThermalGaussian | 25.51 | 0.879 | 0.172 |
| Thermal | MMOne | 25.89 | 0.890 | 0.176 |
RGB improves by 0.5 dB and thermal by 0.4 dB, using only one-third the number of Gaussians compared to ThermalGaussian.
RGB–Language Evaluation¶
| Modality | Method | PSNR↑(R) / mIoU↑(L) | Note |
|---|---|---|---|
| R | LangSplat | 24.02 | Sequential: RGB first, then language |
| R | LS-Joint | 23.23 | Joint training degrades RGB |
| R | MMOne | 24.35 | RGB surpasses single-modal LangSplat |
| L | LangSplat | 47.6 | Baseline |
| L | LS-Joint | 55.3 | mIoU +7.7 but at the cost of RGB |
| L | MMOne | 56.6 | Best mIoU with no RGB degradation |
Key finding: MMOne achieves higher RGB rendering quality than LangSplat trained on RGB alone, demonstrating mutual enhancement across modalities.
RGB–Thermal–Language (Three-Modality) Evaluation¶
| Method | RGB PSNR | Thermal PSNR | Language mIoU |
|---|---|---|---|
| MM-Joint | 22.32 | 23.38 | 45.1 |
| MMOne | 23.19 | 24.24 | 48.1 |
Modality Conflict Analysis (Key Findings)¶
| Method | RGB PSNR (2-modal→3-modal) | Thermal PSNR (2-modal→3-modal) |
|---|---|---|
| ThermalGaussian + Language | 22.88 → 22.32 (−0.56) | 23.90 → 23.38 (−0.52) |
| MMOne + Language | 23.12 → 23.19 (+0.07) | 24.17 → 24.24 (+0.07) |
Adding language to the joint-training baseline significantly degrades RGB and thermal performance, whereas adding language to MMOne yields a slight improvement, completely resolving modality conflict.
Ablation Study¶
| Method | RGB PSNR | Thermal PSNR | Lang mIoU | #Gaussians (×10⁴) |
|---|---|---|---|---|
| MM-Joint | 22.32 | 23.38 | 45.1 | 32.9 |
| + Modality Modeling | 22.38 | 23.73 | 45.3 | 29.0 |
| + Hard Prune | 22.67 | 23.86 | 46.9 | 13.4 |
| + Soft Prune | 22.98 | 23.99 | 47.0 | 10.6 |
| + Decomposition | 23.19 | 24.24 | 48.1 | 9.9 |
Each component contributes consistent improvements. The final model achieves comprehensive superiority using only 30% of the baseline Gaussian count.
Highlights & Insights¶
- Identification of fundamental problems: The first systematic analysis of property disparity and granularity disparity in multi-modal scene representation, accompanied by a unified solution.
- Elegant design of the modality indicator: Serving simultaneously as a weighting coefficient and a switch, a single concise concept addresses both property and granularity disparities.
- Mutual enhancement rather than conflict: Demonstrates that proper disentanglement enables multi-modal learning to be mutually beneficial rather than mutually detrimental.
- Compact and efficient: Superior performance is achieved with one-third the number of Gaussians, indicating genuine information efficiency gains through disentanglement.
- General and scalable: The modality-agnostic framework design readily accommodates additional modalities.
Limitations & Future Work¶
- Validation is limited to RGB, thermal, and language modalities; effectiveness for depth, tactile, and other modalities remains to be verified.
- The framework relies on RGB camera poses from COLMAP; thermal camera poses require accurate calibration.
- The gradient discrepancy threshold for multi-modal decomposition (0.0002) is manually set and may require adjustment for different scenes.
- Dynamic scenes are not addressed.
Related Work & Insights¶
- Single-modal scene representation: Advances of NeRF/3DGS in RGB; Thermal3D-GS for thermal imaging.
- Dual-modal representation: LERF/LangSplat for RGB+language; ThermalGaussian for RGB+thermal.
- Multi-modal representation: GLS/LangSurf leverage depth to assist RGB+language but are fundamentally dual-modal.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐⭐⭐ |
| Technical Depth | ⭐⭐⭐⭐ |
| Experimental Thoroughness | ⭐⭐⭐⭐⭐ |
| Writing Quality | ⭐⭐⭐⭐ |
| Overall Recommendation | 8.5/10 |