AAAI 2026 Medical Imaging Childhood malnutrition detection multi-pose imaging graph attention network CLIP retrieval augmentation anthropometric prediction

NutriScreener: Retrieval-Augmented Multi-Pose Graph Attention Network for Malnourishment Screening¶

Conference: AAAI 2026 arXiv: 2511.16566 Code: IAB-RUBRIC NutriScreener Toolkit Area: Medical Imaging / Nutritional Screening Keywords: Childhood malnutrition detection, multi-pose imaging, graph attention network, CLIP, retrieval augmentation, anthropometric prediction

TL;DR¶

This paper proposes NutriScreener, a framework combining a CLIP visual encoder, a multi-pose graph attention network (GAT), and a FAISS-based retrieval-augmented classification/regression module. Through cross-pose attention and category-enhanced retrieval, the system achieves robust childhood malnutrition detection and anthropometric prediction, attaining 0.79 recall and 0.82 AUC on cross-continental datasets including AnthroVision, with clinician ratings of 4.3/5 for accuracy and 4.6/5 for efficiency.

Background & Motivation¶

Background: As of 2024, approximately 150 million children under five suffer from stunting and over 42 million from wasting globally. Malnutrition remains a leading cause of irreversible developmental harm and mortality in children. Low-resource regions are particularly lacking in timely screening capacity.

Limitations of Prior Work: - Inefficiency of traditional methods: Manual anthropometric measurements using MUAC tapes, weight-for-height charts, and questionnaires are time-consuming, error-prone, and unscalable. - Limitations of existing AI methods: Facial-based methods are mostly designed for elderly populations and are unsuitable for children; Microsoft's Child Growth Monitor requires infrared depth sensors; existing models suffer from small datasets and majority-class bias (DomainAdapt recall of only 67%). - Severe class imbalance: Malnourished children constitute a minority class, causing models to be biased toward predicting healthy outcomes. - Single-pose insufficiency: A single image cannot capture all diagnostic cues, such as asymmetric fat loss or pose-dependent deformations.

Key Challenge: Low-resource settings demand low-cost, scalable screening solutions, yet existing AI approaches either rely on specialized hardware or perform poorly on minority-class detection, making real-world deployment infeasible.

Goal: From multi-pose 2D images captured with standard smartphones, simultaneously achieve: (1) binary nutritional status classification; and (2) regression prediction of four anthropometric measures — height, weight, MUAC, and head circumference.

Key Insight: Each subject is modeled as a graph (nodes = per-pose CLIP embeddings), with a GAT capturing inter-pose relationships. A retrieval-augmented module then queries a knowledge base for similar samples to compensate for minority-class bias.

Core Idea: Multi-pose CLIP embeddings + GAT cross-pose reasoning + category-enhanced FAISS retrieval + context-aware adaptive fusion.

Method¶

Overall Architecture¶

NutriScreener comprises four core components: 1. CLIP image encoder: Extracts semantic features from each pose. 2. Graph Attention Network (GAT): Models inter-pose relationships to produce consistent multi-view predictions. 3. Retrieval module: Queries a knowledge base (KB) to retrieve representative support samples. 4. Context-aware fusion mechanism: Adaptively combines GAT and retrieval predictions.

Multi-Pose Embedding Extraction¶

Each pose image \(x_{i,j}\) is passed through a frozen CLIP encoder (RN50x64 variant) to extract a 1024-dimensional embedding \(e_{i,j}\).
The scalar age \(a_i\) is concatenated to form a 1025-dimensional node feature: \(v_{i,j} = [e_{i,j}; a_i]\).
Advantages of multi-pose design: (1) aggregating cross-view redundant cues compensates for single-pose limitations; (2) accommodates varying capture conditions (occlusion, missing poses).

Graph Construction and GAT Inference¶

All pose embeddings \(\{v_{i,1}, \ldots, v_{i,P}\}\) of the same subject are organized as nodes in a fully connected undirected graph.
A 2-layer GAT (8-head attention, dropout = 0.1) performs multi-head self-attention message passing.
Global pooling yields a subject-level embedding \(h_i\), which is fed into classification and regression heads.
Cross-pose attention in the GAT can capture inter-pose correlations (e.g., asymmetric fat loss), improving robustness.

Knowledge Base Construction¶

248 pediatric subjects, each with 8 poses (frontal ×4, left lateral, right lateral, posterior, selfie).
Captured with a standard smartphone (OnePlus Nord, approximately 165 cm distance); trained healthcare workers recorded height, weight, MUAC, and head circumference.
Per-subject average pose embeddings and labels are indexed with FAISS.

Retrieval-Augmented Classification¶

Compute the global query embedding: \(q_i = \frac{1}{P_i}\sum_{j=1}^{P_i} v_{i,j}\)
FAISS retrieves the top-\(k\) nearest neighbors, yielding cosine distances \(\{d_j\}\) and labels \(\{y_j^{kb}\}\).
Distances are normalized via temperature-scaled softmax.
Category enhancement: Malnourished neighbors are multiplied by an enhancement factor \(\gamma\) to upweight minority-class contributions.
After renormalization, the retrieval prediction is obtained as a weighted sum: \(y_i^{retrieved} = \sum_j w_j y_j^{kb}\)

Context-Aware Fusion¶

An auxiliary context vector \(c_i = [\log\frac{p_i}{1-p_i}, \bar{d}]\) — comprising the GAT log-odds and mean retrieval distance — is fed into a small MLP to predict a fusion coefficient \(\alpha \in [0,1]\):

\[\hat{y}_i^{CLS} = \alpha^{CLS} y_i^{GAT} + (1-\alpha^{CLS}) y_i^{retrieved}\]

When KB neighbors are dense, the mechanism favors retrieval; when neighbors are sparse, it favors the GAT.
The same formulation applies to regression tasks using an independent \(\alpha^{reg}\).

Loss & Training¶

Joint training: \(\mathcal{L} = \mathcal{L}_{class} + \mathcal{L}_{reg}\) (BCE with logits + MSE).

Key Experimental Results¶

Datasets¶

Dataset	Samples	Population	Poses	Annotations
AnthroVision	2,141	Indian children	Multi-pose	Height/Weight/MUAC/HC/WC
ARAN	512	Kurdish children	4 anonymized views	Height/Weight/WC/HC
CampusPose	80	University students	Multi-pose	Height/Weight/MUAC/HC/WC

Main Results¶

Model	Acc↑	Prec↑	Rec↑	F1↑	AUC↑	H RMSE↓	W RMSE↓	MUAC RMSE↓	HC RMSE↓
DomainAdapt	0.68	0.63	0.67	0.64	0.55	22.00	12.40	3.55	5.05
CLIP+GNN	0.76	0.66	0.54	0.59	0.82	7.37	5.82	3.80	5.23
Retrieval-only	0.53	0.36	0.66	0.45	0.61	9.48	7.89	3.12	2.76
NutriScreener-W	0.74	0.56	0.79	0.66	0.82	6.38	5.32	2.80	2.97

Key gains: - vs. DomainAdapt: recall 0.67 → 0.79; height RMSE 22 cm → 6.38 cm. - vs. CLIP+GNN: recall 0.54 → 0.79 (+46%), with comprehensive improvements across all regression metrics.

Ablation Study¶

Variant	Rec↑	F1↑	AUC↑	H RMSE↓
BCE	0.81	0.59	0.78	10.93
Focal	0.73	0.53	0.73	10.82
Context	0.65	0.59	0.78	10.82
Weighted (final)	0.79	0.66	0.82	6.38

CLIP Encoder Selection¶

Among 9 CLIP variants, RN50×64 achieves the highest ROC-AUC (68%) and mAP (58%) with the most balanced precision–recall. The frozen pretrained encoder substantially outperforms its fine-tuned counterpart (recall: 79% vs. 38%), validating that foundation models should be used in frozen form in low-resource settings.

Cross-Dataset Analysis¶

Using a demographically matched knowledge base yields up to 25% recall improvement and a 3.5 cm reduction in RMSE.
A highly out-of-distribution KB (CampusPose → AnthroVision) is equivalent to no retrieval; the fusion mechanism automatically degrades gracefully in this scenario.
Across cohorts (community vs. clinical), AUC values are 0.78 and 0.74 respectively, demonstrating good generalizability.

Clinical User Study¶

Twelve medical professionals (mean 9.5 years of experience) evaluated the system in a realistic clinical setting: - Clinical consistency: 4.3/5 - Efficiency: 4.6/5 - Trustworthiness: 4.4/5 - Deployment readiness: 4.1/5 - Notable feedback: the system successfully flagged a visually ambiguous malnutrition case.

Key Findings¶

Frozen CLIP outperforms fine-tuned CLIP — in low-resource settings, pretrained representations from foundation models generalize better.
Retrieval augmentation is an effective tool for addressing class imbalance — but requires category enhancement and context-aware fusion to be effective.
Demographic alignment of the knowledge base directly impacts performance — even a small number of matched samples yields significant gains.
Multi-pose modeling (lateral > frontal) is particularly important for anthropometric regression.
GAT attention weights provide cross-pose interpretability.

Highlights & Insights¶

End-to-end multi-task design: Jointly addresses classification (malnourished/healthy) and regression (4 anthropometric measures) with efficient parameter sharing.
Deployment-oriented design: Operates with standard smartphone images, requires no specialized hardware, uses irreversible CLIP embeddings (privacy-friendly), and has received IRB approval.
Knowledge base as adaptation: No retraining is required — domain adaptation to a new population is achieved simply by swapping the knowledge base, offering a highly practical paradigm for low-resource deployment.
Elegant adaptive fusion: Log-odds and retrieval distance serve as context signals to automatically arbitrate between the GAT and retrieval — trusting retrieval when the KB is dense, and the GAT when it is sparse.
Significant leap from CNN to VLM: DomainAdapt's height RMSE of 22 cm renders it nearly unusable in practice; NutriScreener reduces this to 6.38 cm — a qualitative improvement.

Limitations & Future Work¶

Limited data scale: AnthroVision contains only 2,141 children and the KB only 248 subjects, making it difficult to cover global diversity.
Geographic constraints: Validation is primarily conducted on Indian and Kurdish children; coverage of high-burden regions such as sub-Saharan Africa and Southeast Asia is absent.
Age range limitations: CampusPose comprises university students (out-of-domain), with regression RMSE reaching 24 cm, indicating difficulty generalizing across age groups.
Trade-off of privacy design: Frozen CLIP with irreversible embeddings ensures privacy but also precludes domain adaptation through fine-tuning.
Lack of uncertainty estimation: Clinicians suggested adding uncertainty quantification and visual attention heatmaps, which the current version does not provide.
Coarse binary classification: The model only distinguishes healthy from malnourished, without further stratification into subtypes such as stunting, wasting, and underweight.

AI-based nutritional assessment: ARAN (512 children), AnthroVision + DomainAdapt (multi-task CNN), Microsoft Child Growth Monitor (infrared depth sensor).
Visual foundation models: CLIP (cross-domain generalization), MedCLIP (medical adaptation), NurtureNet (anthropometric CLIP).
Graph neural networks: DMGNN (multi-scale joint relationships), GraphCMR (body shape regression).
Retrieval-augmented learning: RAC (FAISS memory indexing), COBRA (mutual-information-optimized retrieval).
Multi-view anthropometric estimation: Liu et al. (linear model + multi-angle height and MUAC prediction).

Rating¶

⭐⭐⭐⭐ (4/5)

Novelty: ⭐⭐⭐⭐ — The combination of multi-pose GAT and retrieval augmentation is novel and effective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Cross-dataset validation, clinical user study, and extensive ablations.
Writing Quality: ⭐⭐⭐⭐ — Well-structured with thorough discussion of ethics and deployment.
Value: ⭐⭐⭐⭐⭐ — Genuinely targets low-resource deployment, clinically validated, open-source toolkit.