X-WIN: Building Chest Radiograph World Model via Predictive Sensing¶
Conference: CVPR 2026 arXiv: 2511.14918 Code: None Area: Medical Imaging Keywords: World model, chest radiograph representation learning, CT knowledge distillation, contrastive learning, domain adaptation
TL;DR¶
X-WIN is a chest radiograph world model that, for the first time, incorporates 3D CT spatial knowledge into CXR representation learning. By learning to predict 2D projections of CT volumes at varying rotation angles, the model internalizes 3D anatomical structure. Combined with affinity-guided contrastive alignment and structure-preserving domain adaptation, X-WIN achieves state-of-the-art linear probing performance across 6 CXR benchmarks.
Background & Motivation¶
- Inherent limitations of CXR: As 2D projection images, CXRs suffer from structural superimposition and cannot directly capture 3D anatomical information, limiting diagnostic capability.
- CT vs. CXR trade-off: CT provides 3D structural information but is costly, radiation-intensive, and less accessible, while CXR is safe and affordable but informationally limited.
- Radiologist insight: Radiologists viewing frontal/lateral CXRs can cognitively reconstruct a 3D thoracic model to assist in diagnosing occluded structures.
- Limitations of existing world models: CheXWorld only learns 2D local structure and global geometry, lacking 3D spatial awareness.
- Mechanism: If a model can accurately predict X-ray projections of a CT volume at arbitrary rotation angles, it has internalized meaningful 3D anatomical structure.
Method¶
Overall Architecture (JEPA Variant)¶
- Context encoder \(f_\theta\): Takes standard frontal/lateral CXRs as input.
- EMA encoder \(f_{\theta'}\): Takes multiple target projections; updated via exponential moving average.
- View predictor \(g_v\): Predicts latent representations of new projections conditioned on action.
- Masked predictor \(g_m\): Reconstructs masked patch tokens.
Action Design¶
The action \(a_i = k \cdot \Delta\phi\) defines the yaw rotation angle of the X-ray source relative to the input position, constrained to \([-90°, 90°]\), with \(N=8\) projections sampled randomly. The optimal step size is \(\Delta\phi = 3°\) (corresponding to 60 latent projections).
Affinity-Guided Contrastive Alignment¶
Predicted representation: \(z_i^{\text{patch}} = g_v(\text{Linear}(a_i) \oplus (f_\theta(u_{\text{context}}) + \text{PE}))\)
Standard InfoNCE uses one-hot labels for hard alignment, but different projections of the same CT exhibit rich anatomical correspondences. An affinity matrix \(A\) is introduced for soft regularization:
Structure-Preserving Domain Adaptation¶
Simulated CXRs (generated from CT projections) exhibit a domain gap relative to real CXRs. The following strategies are adopted:
- Masked image modeling (MIM): Applied to both real and simulated CXRs to encode local and contextual features.
- Domain classifier \(f_c\): Learns to distinguish between real and simulated domain representations.
- Domain adaptation loss:
This encourages projection predictions to preserve structural information while being statistically aligned with the real domain.
Overall Loss¶
Key Experimental Results¶
Main Results: Linear Probing (AUROC)¶
| Model | Pretraining Data | VinDr | CheXpert | NIH-CXR | RSNA | JSRT | Avg. |
|---|---|---|---|---|---|---|---|
| DINOv2 | LVD-142M | 0.795 | 0.776 | 0.711 | 0.798 | 0.559 | 0.728 |
| CheXFound | 987K CXR | 0.869 | 0.876 | 0.829 | 0.872 | 0.846 | 0.858 |
| Ark+ | 704K CXR | 0.906 | 0.876 | 0.831 | 0.893 | 0.807 | 0.863 |
| CheXWorld | 448K CXR | 0.903 | 0.871 | 0.833 | 0.824 | 0.791 | 0.844 |
| X-WIN (ViT-L) | 372K CXR+32K CT | 0.925 | 0.908 | 0.843 | 0.929 | 0.857 | 0.892 |
X-WIN achieves a mean AUROC of 0.892, surpassing all CXR foundation models and vision-language models.
Few-Shot Fine-Tuning (COVIDx, AUROC)¶
| Model | 4-shot | 8-shot | 16-shot | All |
|---|---|---|---|---|
| CheXFound | 0.823 | 0.883 | 0.897 | 0.977 |
| CheXWorld | 0.843 | 0.893 | 0.902 | 0.981 |
| X-WIN | 0.868 | 0.924 | 0.939 | 0.993 |
3D CT Reconstruction¶
3D CT volumes are reconstructed via a VQ-GAN decoder combined with the FDK algorithm: - 2D projection: PSNR 30.23 dB, SSIM 0.888 - 3D reconstruction: PSNR 27.87 dB, SSIM 0.789
Ablation Study¶
- \(\mathcal{L}_{\text{InfoNCE}}\) alone establishes a strong baseline.
- Combining \(\mathcal{L}_{\text{MIM}}\) + \(\mathcal{L}_{\text{InfoNCE}}\) yields substantial improvements.
- Domain adaptation improves cosine similarity from 0.845 to 0.967.
- Direct rotation outperforms stepwise rotation; yaw-only rotation outperforms full 3D Euler angle rotation.
Highlights & Insights¶
- Breakthrough in medical imaging world models: X-WIN is the first to incorporate 3D CT spatial knowledge into a 2D CXR world model, representing a paradigm shift from 2D to 3D reasoning.
- Elegant predictive sensing design: 3D structure is internalized by predicting rotated projections rather than directly regressing 3D features.
- Affinity-guided contrastive learning: Natural correlations among different projections of the same CT are leveraged for soft contrastive supervision, which is more principled than hard InfoNCE.
- Validated 3D reconstruction: The model's ability to render projections from CXRs and reconstruct CT volumes confirms that genuine 3D knowledge has been acquired.
Limitations & Future Work¶
- Relies on DiffDRR for simulated projection generation; the simulation-to-real domain gap remains a bottleneck.
- Only yaw rotation is utilized, leaving pitch and roll dimensions unexplored.
- 3D reconstruction results exhibit blurring with notable loss of local detail.
- Training requires 8×A100 40 GB GPUs for 100 epochs, incurring substantial computational cost.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐⭐⭐ |
| Experimental Thoroughness | ⭐⭐⭐⭐⭐ |
| Writing Quality | ⭐⭐⭐⭐ |
| Value | ⭐⭐⭐⭐⭐ |