NutriScreener: Retrieval-Augmented Multi-Pose Graph Attention Network for Malnourishment Screening¶
Conference: AAAI 2026 arXiv: 2511.16566 Code: IAB-RUBRIC NutriScreener Toolkit Area: Medical Imaging / Nutritional Screening Keywords: Childhood malnutrition detection, multi-pose imaging, graph attention network, CLIP, retrieval augmentation, anthropometric prediction
TL;DR¶
This paper proposes NutriScreener, a framework combining a CLIP visual encoder, a multi-pose graph attention network (GAT), and a FAISS-based retrieval-augmented classification/regression module. Through cross-pose attention and category-enhanced retrieval, the system achieves robust childhood malnutrition detection and anthropometric prediction, attaining 0.79 recall and 0.82 AUC on cross-continental datasets including AnthroVision, with clinician ratings of 4.3/5 for accuracy and 4.6/5 for efficiency.
Background & Motivation¶
Background: As of 2024, approximately 150 million children under five suffer from stunting and over 42 million from wasting globally. Malnutrition remains a leading cause of irreversible developmental harm and mortality in children. Low-resource regions are particularly lacking in timely screening capacity.
Limitations of Prior Work: - Inefficiency of traditional methods: Manual anthropometric measurements using MUAC tapes, weight-for-height charts, and questionnaires are time-consuming, error-prone, and unscalable. - Limitations of existing AI methods: Facial-based methods are mostly designed for elderly populations and are unsuitable for children; Microsoft's Child Growth Monitor requires infrared depth sensors; existing models suffer from small datasets and majority-class bias (DomainAdapt recall of only 67%). - Severe class imbalance: Malnourished children constitute a minority class, causing models to be biased toward predicting healthy outcomes. - Single-pose insufficiency: A single image cannot capture all diagnostic cues, such as asymmetric fat loss or pose-dependent deformations.
Key Challenge: Low-resource settings demand low-cost, scalable screening solutions, yet existing AI approaches either rely on specialized hardware or perform poorly on minority-class detection, making real-world deployment infeasible.
Goal: From multi-pose 2D images captured with standard smartphones, simultaneously achieve: (1) binary nutritional status classification; and (2) regression prediction of four anthropometric measures — height, weight, MUAC, and head circumference.
Key Insight: Each subject is modeled as a graph (nodes = per-pose CLIP embeddings), with a GAT capturing inter-pose relationships. A retrieval-augmented module then queries a knowledge base for similar samples to compensate for minority-class bias.
Core Idea: Multi-pose CLIP embeddings + GAT cross-pose reasoning + category-enhanced FAISS retrieval + context-aware adaptive fusion.
Method¶
Overall Architecture¶
NutriScreener comprises four core components: 1. CLIP image encoder: Extracts semantic features from each pose. 2. Graph Attention Network (GAT): Models inter-pose relationships to produce consistent multi-view predictions. 3. Retrieval module: Queries a knowledge base (KB) to retrieve representative support samples. 4. Context-aware fusion mechanism: Adaptively combines GAT and retrieval predictions.
Multi-Pose Embedding Extraction¶
- Each pose image \(x_{i,j}\) is passed through a frozen CLIP encoder (RN50x64 variant) to extract a 1024-dimensional embedding \(e_{i,j}\).
- The scalar age \(a_i\) is concatenated to form a 1025-dimensional node feature: \(v_{i,j} = [e_{i,j}; a_i]\).
- Advantages of multi-pose design: (1) aggregating cross-view redundant cues compensates for single-pose limitations; (2) accommodates varying capture conditions (occlusion, missing poses).
Graph Construction and GAT Inference¶
- All pose embeddings \(\{v_{i,1}, \ldots, v_{i,P}\}\) of the same subject are organized as nodes in a fully connected undirected graph.
- A 2-layer GAT (8-head attention, dropout = 0.1) performs multi-head self-attention message passing.
- Global pooling yields a subject-level embedding \(h_i\), which is fed into classification and regression heads.
- Cross-pose attention in the GAT can capture inter-pose correlations (e.g., asymmetric fat loss), improving robustness.
Knowledge Base Construction¶
- 248 pediatric subjects, each with 8 poses (frontal ×4, left lateral, right lateral, posterior, selfie).
- Captured with a standard smartphone (OnePlus Nord, approximately 165 cm distance); trained healthcare workers recorded height, weight, MUAC, and head circumference.
- Per-subject average pose embeddings and labels are indexed with FAISS.
Retrieval-Augmented Classification¶
- Compute the global query embedding: \(q_i = \frac{1}{P_i}\sum_{j=1}^{P_i} v_{i,j}\)
- FAISS retrieves the top-\(k\) nearest neighbors, yielding cosine distances \(\{d_j\}\) and labels \(\{y_j^{kb}\}\).
- Distances are normalized via temperature-scaled softmax.
- Category enhancement: Malnourished neighbors are multiplied by an enhancement factor \(\gamma\) to upweight minority-class contributions.
- After renormalization, the retrieval prediction is obtained as a weighted sum: \(y_i^{retrieved} = \sum_j w_j y_j^{kb}\)
Context-Aware Fusion¶
An auxiliary context vector \(c_i = [\log\frac{p_i}{1-p_i}, \bar{d}]\) — comprising the GAT log-odds and mean retrieval distance — is fed into a small MLP to predict a fusion coefficient \(\alpha \in [0,1]\):
- When KB neighbors are dense, the mechanism favors retrieval; when neighbors are sparse, it favors the GAT.
- The same formulation applies to regression tasks using an independent \(\alpha^{reg}\).
Loss & Training¶
Joint training: \(\mathcal{L} = \mathcal{L}_{class} + \mathcal{L}_{reg}\) (BCE with logits + MSE).
Key Experimental Results¶
Datasets¶
| Dataset | Samples | Population | Poses | Annotations |
|---|---|---|---|---|
| AnthroVision | 2,141 | Indian children | Multi-pose | Height/Weight/MUAC/HC/WC |
| ARAN | 512 | Kurdish children | 4 anonymized views | Height/Weight/WC/HC |
| CampusPose | 80 | University students | Multi-pose | Height/Weight/MUAC/HC/WC |
Main Results¶
| Model | Acc↑ | Prec↑ | Rec↑ | F1↑ | AUC↑ | H RMSE↓ | W RMSE↓ | MUAC RMSE↓ | HC RMSE↓ |
|---|---|---|---|---|---|---|---|---|---|
| DomainAdapt | 0.68 | 0.63 | 0.67 | 0.64 | 0.55 | 22.00 | 12.40 | 3.55 | 5.05 |
| CLIP+GNN | 0.76 | 0.66 | 0.54 | 0.59 | 0.82 | 7.37 | 5.82 | 3.80 | 5.23 |
| Retrieval-only | 0.53 | 0.36 | 0.66 | 0.45 | 0.61 | 9.48 | 7.89 | 3.12 | 2.76 |
| NutriScreener-W | 0.74 | 0.56 | 0.79 | 0.66 | 0.82 | 6.38 | 5.32 | 2.80 | 2.97 |
Key gains: - vs. DomainAdapt: recall 0.67 → 0.79; height RMSE 22 cm → 6.38 cm. - vs. CLIP+GNN: recall 0.54 → 0.79 (+46%), with comprehensive improvements across all regression metrics.
Ablation Study¶
| Variant | Rec↑ | F1↑ | AUC↑ | H RMSE↓ |
|---|---|---|---|---|
| BCE | 0.81 | 0.59 | 0.78 | 10.93 |
| Focal | 0.73 | 0.53 | 0.73 | 10.82 |
| Context | 0.65 | 0.59 | 0.78 | 10.82 |
| Weighted (final) | 0.79 | 0.66 | 0.82 | 6.38 |
CLIP Encoder Selection¶
Among 9 CLIP variants, RN50×64 achieves the highest ROC-AUC (68%) and mAP (58%) with the most balanced precision–recall. The frozen pretrained encoder substantially outperforms its fine-tuned counterpart (recall: 79% vs. 38%), validating that foundation models should be used in frozen form in low-resource settings.
Cross-Dataset Analysis¶
- Using a demographically matched knowledge base yields up to 25% recall improvement and a 3.5 cm reduction in RMSE.
- A highly out-of-distribution KB (CampusPose → AnthroVision) is equivalent to no retrieval; the fusion mechanism automatically degrades gracefully in this scenario.
- Across cohorts (community vs. clinical), AUC values are 0.78 and 0.74 respectively, demonstrating good generalizability.
Clinical User Study¶
Twelve medical professionals (mean 9.5 years of experience) evaluated the system in a realistic clinical setting: - Clinical consistency: 4.3/5 - Efficiency: 4.6/5 - Trustworthiness: 4.4/5 - Deployment readiness: 4.1/5 - Notable feedback: the system successfully flagged a visually ambiguous malnutrition case.
Key Findings¶
- Frozen CLIP outperforms fine-tuned CLIP — in low-resource settings, pretrained representations from foundation models generalize better.
- Retrieval augmentation is an effective tool for addressing class imbalance — but requires category enhancement and context-aware fusion to be effective.
- Demographic alignment of the knowledge base directly impacts performance — even a small number of matched samples yields significant gains.
- Multi-pose modeling (lateral > frontal) is particularly important for anthropometric regression.
- GAT attention weights provide cross-pose interpretability.
Highlights & Insights¶
- End-to-end multi-task design: Jointly addresses classification (malnourished/healthy) and regression (4 anthropometric measures) with efficient parameter sharing.
- Deployment-oriented design: Operates with standard smartphone images, requires no specialized hardware, uses irreversible CLIP embeddings (privacy-friendly), and has received IRB approval.
- Knowledge base as adaptation: No retraining is required — domain adaptation to a new population is achieved simply by swapping the knowledge base, offering a highly practical paradigm for low-resource deployment.
- Elegant adaptive fusion: Log-odds and retrieval distance serve as context signals to automatically arbitrate between the GAT and retrieval — trusting retrieval when the KB is dense, and the GAT when it is sparse.
- Significant leap from CNN to VLM: DomainAdapt's height RMSE of 22 cm renders it nearly unusable in practice; NutriScreener reduces this to 6.38 cm — a qualitative improvement.
Limitations & Future Work¶
- Limited data scale: AnthroVision contains only 2,141 children and the KB only 248 subjects, making it difficult to cover global diversity.
- Geographic constraints: Validation is primarily conducted on Indian and Kurdish children; coverage of high-burden regions such as sub-Saharan Africa and Southeast Asia is absent.
- Age range limitations: CampusPose comprises university students (out-of-domain), with regression RMSE reaching 24 cm, indicating difficulty generalizing across age groups.
- Trade-off of privacy design: Frozen CLIP with irreversible embeddings ensures privacy but also precludes domain adaptation through fine-tuning.
- Lack of uncertainty estimation: Clinicians suggested adding uncertainty quantification and visual attention heatmaps, which the current version does not provide.
- Coarse binary classification: The model only distinguishes healthy from malnourished, without further stratification into subtypes such as stunting, wasting, and underweight.
Related Work & Insights¶
- AI-based nutritional assessment: ARAN (512 children), AnthroVision + DomainAdapt (multi-task CNN), Microsoft Child Growth Monitor (infrared depth sensor).
- Visual foundation models: CLIP (cross-domain generalization), MedCLIP (medical adaptation), NurtureNet (anthropometric CLIP).
- Graph neural networks: DMGNN (multi-scale joint relationships), GraphCMR (body shape regression).
- Retrieval-augmented learning: RAC (FAISS memory indexing), COBRA (mutual-information-optimized retrieval).
- Multi-view anthropometric estimation: Liu et al. (linear model + multi-angle height and MUAC prediction).
Rating¶
⭐⭐⭐⭐ (4/5)
- Novelty: ⭐⭐⭐⭐ — The combination of multi-pose GAT and retrieval augmentation is novel and effective.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Cross-dataset validation, clinical user study, and extensive ablations.
- Writing Quality: ⭐⭐⭐⭐ — Well-structured with thorough discussion of ethics and deployment.
- Value: ⭐⭐⭐⭐⭐ — Genuinely targets low-resource deployment, clinically validated, open-source toolkit.