GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis¶
Conference: CVPR 2026 arXiv: 2603.09446 Code: Unavailable Area: Medical Image Analysis / Graph Neural Networks / Computer-Aided Diagnosis Keywords: Multi-heterogeneous graph, multi-view diagnosis, intra-/inter-view dependencies, missing view handling, CADx
TL;DR¶
This paper proposes the GIIM framework, which constructs a Multi-Heterogeneous Graph (MHG) with four types of edge relations to simultaneously model the dynamic changes of individual lesions across imaging phases and the spatial associations among different lesions. Four missing-view imputation strategies are designed. GIIM achieves significant improvements over existing methods on three modalities: liver CT, breast mammography, and breast MRI.
Background & Motivation¶
Background: Clinical diagnosis requires integrating complex dependencies among abnormalities across multiple views — including the dynamic enhancement pattern of a single lesion across multi-phase CT and the spatial co-occurrence among different lesions. CNN, Transformer, and GNN-based methods have made progress in single-view or simple multi-view fusion settings.
Limitations of Prior Work:
- Existing CADx methods typically process each view independently or simply concatenate features, ignoring intra-view multi-lesion relationships and inter-view temporal/spatial dynamics.
- Attention-based methods require fixed-size inputs and cannot flexibly handle a variable number of lesions.
- Missing views are common in clinical practice due to protocol constraints, technical failures, or patient-related factors, yet existing methods lack robustness to such scenarios.
Key Challenge: There is a need to simultaneously model four types of dependencies (same lesion across views, different lesions within a single view, different lesions across multiple views, and single-to-multi-view aggregation), while maintaining robustness under missing-view conditions.
Goal: Reformulate multi-view medical diagnosis as a relational modeling problem, comprehensively capturing four types of dependencies via heterogeneous graphs while handling missing data.
Key Insight: GNNs are naturally suited for variable-size node sets and heterogeneous relational modeling, enabling different node and edge types to encode distinct levels of clinical relationships.
Core Idea: Construct each patient's multi-lesion, multi-view data as a multi-heterogeneous graph and perform type-aware message passing to jointly reason over all four dependency types.
Method¶
Overall Architecture¶
Two-stage training: (1) A ConvNeXt feature extractor is independently trained for each view, leveraging 7×7 large-kernel convolutions and depth-wise separable convolutions to capture morphological and intensity details. (2) The feature extractor is frozen, and multi-lesion, multi-view features are organized into a Multi-Heterogeneous Graph (MHG), over which a heterogeneous message-passing GNN performs relational reasoning and classification.
Key Designs¶
-
Dual-type Node Representation
-
Single-view node \(N_{single}^v = f_v(l_v)\): the feature of a lesion under a specific view.
- Multi-view node \(M_{multi} = \|_{v=1}^V N_{single}^v\): an aggregated node formed by concatenating features from all views.
-
For breast mammography, where lesion correspondences between CC and MLO views are uncertain, single-view nodes are replaced by the mean of all lesion features within that view.
-
Four Types of Edge Relations
-
\(E_{intra}\): Same lesion across different views (e.g., arterial → venous → delayed phase) → captures temporal enhancement dynamics.
- \(E_{s-m}\): Single-view nodes to their corresponding multi-view aggregation node → integrates information across phases.
- \(E_{inter-s}\): Different lesions within the same view → models spatial co-occurrence (e.g., HCC commonly co-occurs with satellite nodules).
-
\(E_{inter-m}\): Between aggregated nodes of different lesions → captures high-level inter-lesion contextual relationships, allowing small lesions to leverage context from nearby larger lesions.
-
Heterogeneous Message Passing
-
For each node, messages are aggregated separately from single-view neighbors and multi-view neighbors (each with independent weight matrices \(\mathbf{W}_{single}^k\) and \(\mathbf{W}_{multi}^k\)), concatenated with the node's previous state, and updated via a nonlinear transformation: \(h_n^k = \sigma(\mathbf{W}^k \cdot \text{CONCAT}(h_n^{k-1}, h_{N_{single}(n)}^k, h_{M_{multi}(n)}^k))\)
-
Five layers of SAGEConv (512→256→128→64→number of classes), with the final layer directly outputting classification probabilities.
-
Four Missing-View Imputation Strategies
-
Constant: Zero-vector padding — simple, but encourages the model to learn to ignore missing nodes.
- Learnable: Learnable parameters normalized via the Frobenius norm.
- RAG-based: Retrieves the most similar complete sample from a database using available features, and borrows its missing-view features.
- Covariance-based: Computes inter-view feature-difference covariance to measure sample similarity, then selects the most similar complete sample for imputation.
Loss & Training¶
- Single-view stage: Standard cross-entropy classification loss; ConvNeXt is trained independently and then frozen.
- Graph model stage: MHG is trained end-to-end; graphs are constructed per patient (one patient graph per batch).
Key Experimental Results¶
Main Results¶
| Dataset | Method | Accuracy (%) | AUC (%) |
|---|---|---|---|
| Liver CT | NN-based (multi-view) | 75.45 | 89.09 |
| Attention-based | 73.41 | 88.53 | |
| GIIM | 78.20 | 91.05 | |
| VinDr-Mammo | NN-based | 67.48 | 82.21 |
| Attention-based | 68.09 | 81.00 | |
| GIIM | 71.17 | 82.54 | |
| BreastDM (MRI) | NN-based | 80.85 | 87.35 |
| Attention-based | 85.10 | 76.37 | |
| GIIM | 87.23 | 89.02 |
Multi-view vs. single-view: approximately +12% accuracy on Liver CT and +7.8% on Mammography.
Ablation Study¶
Missing-view strategy comparison (Liver, 100% missing-view test)
| Strategy | 100% miss-view | Full-view |
|---|---|---|
| NN-based | 70.00 | 75.45 |
| GIIM (Constant) | 72.27 | 78.20 |
| GIIM (Learnable) | 72.05 | 77.05 |
| GIIM (RAG) | 71.59 | 78.41 |
| GIIM (Covariance) | 72.05 | 78.18 |
Edge type ablation: Removing any of the four edge types leads to performance degradation; \(E_{intra}\) (same lesion across phases) has the largest impact.
Key Findings¶
- Zero-vector imputation is the most stable under missing-view testing (serving as a unique "missing indicator" that trains the model to rely on other views), while RAG/Covariance imputation performs better on complete data.
- Multi-view consistency yields the greatest gains on Liver CT (4-phase CT with significant enhancement pattern changes across phases for the same lesion).
- For BI-RADS classification, mean pooling is adopted instead of per-lesion graph construction due to the uncertain correspondence between CC and MLO view lesions.
Highlights & Insights¶
- The four-type edge design comprehensively covers the relational reasoning patterns employed by clinical radiologists, offering greater interpretability than simple attention decomposition.
- The practical trade-off finding for missing-view strategies is actionable: generative imputation performs better on complete data, while zero-vector imputation is more robust under missing-view conditions.
- The heterogeneous message-passing aggregation scheme — aggregating separately from single-view and multi-view neighbors — prevents information loss due to edge-type conflation.
- The flexibility of GNNs allows the framework to handle an arbitrary number of lesions and views, outperforming CNN/Transformer architectures that require fixed-size inputs.
Limitations & Future Work¶
- The single-view feature extractor and graph model are trained in separate stages; joint end-to-end training may yield further improvements.
- The graph structure is hard-coded by data (determined by the number of lesions and views); dynamic graph construction or attention-weighted edges remain unexplored.
- ConvNeXt serves as a relatively conservative backbone; stronger alternatives such as ViT or SAM may further improve performance.
- The three datasets are relatively small in scale (the largest contains 920 cases), limiting the generalizability of the validation.
Related Work & Insights¶
- vs. Phase Attention (Wang et al. 2022): Performs intra-phase and inter-phase attention but handles fixed-size inputs and ignores inter-lesion relationships; GIIM uses GNNs to flexibly accommodate variable numbers of lesions.
- vs. SSL-MNGCN (Ibrahim et al. 2022): Applies GCN to texture/spatial feature maps in mammography but does not model cross-view temporal relationships.
- vs. mmFormer (Zhang et al. 2022): A multi-modal Transformer for incomplete brain tumor segmentation, targeting voxel-level tasks rather than lesion-level classification.
- Insight: The heterogeneous graph relational modeling paradigm is generalizable to other scenarios requiring joint multi-view/multi-modal reasoning.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of four heterogeneous edge types and missing-view imputation strategies constitutes a complete and well-motivated design.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Covers three modalities, missing-view ablation, and comparisons of four imputation strategies.
- Writing Quality: ⭐⭐⭐ — Content is detailed but the structure is somewhat complex.
- Value: ⭐⭐⭐ — Provides a general framework for multi-view medical diagnosis, though the limited dataset scale constrains its persuasiveness.