Skip to content

NutriScreener: Retrieval-Augmented Multi-Pose Graph Attention Network for Malnourishment Screening

Conference: AAAI 2026 arXiv: 2511.16566 Code: IAB-RUBRIC NutriScreener Toolkit Area: Medical Imaging / Nutritional Screening Keywords: Childhood malnutrition detection, multi-pose imaging, graph attention network, CLIP, retrieval augmentation, anthropometric prediction

TL;DR

This paper proposes NutriScreener, a framework combining a CLIP visual encoder, a multi-pose graph attention network (GAT), and a FAISS-based retrieval-augmented classification/regression module. Through cross-pose attention and category-enhanced retrieval, the system achieves robust childhood malnutrition detection and anthropometric prediction, attaining 0.79 recall and 0.82 AUC on cross-continental datasets including AnthroVision, with clinician ratings of 4.3/5 for accuracy and 4.6/5 for efficiency.

Background & Motivation

Background: As of 2024, approximately 150 million children under five suffer from stunting and over 42 million from wasting globally. Malnutrition remains a leading cause of irreversible developmental harm and mortality in children. Low-resource regions are particularly lacking in timely screening capacity.

Limitations of Prior Work: - Inefficiency of traditional methods: Manual anthropometric measurements using MUAC tapes, weight-for-height charts, and questionnaires are time-consuming, error-prone, and unscalable. - Limitations of existing AI methods: Facial-based methods are mostly designed for elderly populations and are unsuitable for children; Microsoft's Child Growth Monitor requires infrared depth sensors; existing models suffer from small datasets and majority-class bias (DomainAdapt recall of only 67%). - Severe class imbalance: Malnourished children constitute a minority class, causing models to be biased toward predicting healthy outcomes. - Single-pose insufficiency: A single image cannot capture all diagnostic cues, such as asymmetric fat loss or pose-dependent deformations.

Key Challenge: Low-resource settings demand low-cost, scalable screening solutions, yet existing AI approaches either rely on specialized hardware or perform poorly on minority-class detection, making real-world deployment infeasible.

Goal: From multi-pose 2D images captured with standard smartphones, simultaneously achieve: (1) binary nutritional status classification; and (2) regression prediction of four anthropometric measures — height, weight, MUAC, and head circumference.

Key Insight: Each subject is modeled as a graph (nodes = per-pose CLIP embeddings), with a GAT capturing inter-pose relationships. A retrieval-augmented module then queries a knowledge base for similar samples to compensate for minority-class bias.

Core Idea: Multi-pose CLIP embeddings + GAT cross-pose reasoning + category-enhanced FAISS retrieval + context-aware adaptive fusion.

Method

Overall Architecture

NutriScreener comprises four core components: 1. CLIP image encoder: Extracts semantic features from each pose. 2. Graph Attention Network (GAT): Models inter-pose relationships to produce consistent multi-view predictions. 3. Retrieval module: Queries a knowledge base (KB) to retrieve representative support samples. 4. Context-aware fusion mechanism: Adaptively combines GAT and retrieval predictions.

Multi-Pose Embedding Extraction

  • Each pose image \(x_{i,j}\) is passed through a frozen CLIP encoder (RN50x64 variant) to extract a 1024-dimensional embedding \(e_{i,j}\).
  • The scalar age \(a_i\) is concatenated to form a 1025-dimensional node feature: \(v_{i,j} = [e_{i,j}; a_i]\).
  • Advantages of multi-pose design: (1) aggregating cross-view redundant cues compensates for single-pose limitations; (2) accommodates varying capture conditions (occlusion, missing poses).

Graph Construction and GAT Inference

  • All pose embeddings \(\{v_{i,1}, \ldots, v_{i,P}\}\) of the same subject are organized as nodes in a fully connected undirected graph.
  • A 2-layer GAT (8-head attention, dropout = 0.1) performs multi-head self-attention message passing.
  • Global pooling yields a subject-level embedding \(h_i\), which is fed into classification and regression heads.
  • Cross-pose attention in the GAT can capture inter-pose correlations (e.g., asymmetric fat loss), improving robustness.

Knowledge Base Construction

  • 248 pediatric subjects, each with 8 poses (frontal ×4, left lateral, right lateral, posterior, selfie).
  • Captured with a standard smartphone (OnePlus Nord, approximately 165 cm distance); trained healthcare workers recorded height, weight, MUAC, and head circumference.
  • Per-subject average pose embeddings and labels are indexed with FAISS.

Retrieval-Augmented Classification

  1. Compute the global query embedding: \(q_i = \frac{1}{P_i}\sum_{j=1}^{P_i} v_{i,j}\)
  2. FAISS retrieves the top-\(k\) nearest neighbors, yielding cosine distances \(\{d_j\}\) and labels \(\{y_j^{kb}\}\).
  3. Distances are normalized via temperature-scaled softmax.
  4. Category enhancement: Malnourished neighbors are multiplied by an enhancement factor \(\gamma\) to upweight minority-class contributions.
  5. After renormalization, the retrieval prediction is obtained as a weighted sum: \(y_i^{retrieved} = \sum_j w_j y_j^{kb}\)

Context-Aware Fusion

An auxiliary context vector \(c_i = [\log\frac{p_i}{1-p_i}, \bar{d}]\) — comprising the GAT log-odds and mean retrieval distance — is fed into a small MLP to predict a fusion coefficient \(\alpha \in [0,1]\):

\[\hat{y}_i^{CLS} = \alpha^{CLS} y_i^{GAT} + (1-\alpha^{CLS}) y_i^{retrieved}\]
  • When KB neighbors are dense, the mechanism favors retrieval; when neighbors are sparse, it favors the GAT.
  • The same formulation applies to regression tasks using an independent \(\alpha^{reg}\).

Loss & Training

Joint training: \(\mathcal{L} = \mathcal{L}_{class} + \mathcal{L}_{reg}\) (BCE with logits + MSE).

Key Experimental Results

Datasets

Dataset Samples Population Poses Annotations
AnthroVision 2,141 Indian children Multi-pose Height/Weight/MUAC/HC/WC
ARAN 512 Kurdish children 4 anonymized views Height/Weight/WC/HC
CampusPose 80 University students Multi-pose Height/Weight/MUAC/HC/WC

Main Results

Model Acc↑ Prec↑ Rec↑ F1↑ AUC↑ H RMSE↓ W RMSE↓ MUAC RMSE↓ HC RMSE↓
DomainAdapt 0.68 0.63 0.67 0.64 0.55 22.00 12.40 3.55 5.05
CLIP+GNN 0.76 0.66 0.54 0.59 0.82 7.37 5.82 3.80 5.23
Retrieval-only 0.53 0.36 0.66 0.45 0.61 9.48 7.89 3.12 2.76
NutriScreener-W 0.74 0.56 0.79 0.66 0.82 6.38 5.32 2.80 2.97

Key gains: - vs. DomainAdapt: recall 0.67 → 0.79; height RMSE 22 cm → 6.38 cm. - vs. CLIP+GNN: recall 0.54 → 0.79 (+46%), with comprehensive improvements across all regression metrics.

Ablation Study

Variant Rec↑ F1↑ AUC↑ H RMSE↓
BCE 0.81 0.59 0.78 10.93
Focal 0.73 0.53 0.73 10.82
Context 0.65 0.59 0.78 10.82
Weighted (final) 0.79 0.66 0.82 6.38

CLIP Encoder Selection

Among 9 CLIP variants, RN50×64 achieves the highest ROC-AUC (68%) and mAP (58%) with the most balanced precision–recall. The frozen pretrained encoder substantially outperforms its fine-tuned counterpart (recall: 79% vs. 38%), validating that foundation models should be used in frozen form in low-resource settings.

Cross-Dataset Analysis

  • Using a demographically matched knowledge base yields up to 25% recall improvement and a 3.5 cm reduction in RMSE.
  • A highly out-of-distribution KB (CampusPose → AnthroVision) is equivalent to no retrieval; the fusion mechanism automatically degrades gracefully in this scenario.
  • Across cohorts (community vs. clinical), AUC values are 0.78 and 0.74 respectively, demonstrating good generalizability.

Clinical User Study

Twelve medical professionals (mean 9.5 years of experience) evaluated the system in a realistic clinical setting: - Clinical consistency: 4.3/5 - Efficiency: 4.6/5 - Trustworthiness: 4.4/5 - Deployment readiness: 4.1/5 - Notable feedback: the system successfully flagged a visually ambiguous malnutrition case.

Key Findings

  1. Frozen CLIP outperforms fine-tuned CLIP — in low-resource settings, pretrained representations from foundation models generalize better.
  2. Retrieval augmentation is an effective tool for addressing class imbalance — but requires category enhancement and context-aware fusion to be effective.
  3. Demographic alignment of the knowledge base directly impacts performance — even a small number of matched samples yields significant gains.
  4. Multi-pose modeling (lateral > frontal) is particularly important for anthropometric regression.
  5. GAT attention weights provide cross-pose interpretability.

Highlights & Insights

  1. End-to-end multi-task design: Jointly addresses classification (malnourished/healthy) and regression (4 anthropometric measures) with efficient parameter sharing.
  2. Deployment-oriented design: Operates with standard smartphone images, requires no specialized hardware, uses irreversible CLIP embeddings (privacy-friendly), and has received IRB approval.
  3. Knowledge base as adaptation: No retraining is required — domain adaptation to a new population is achieved simply by swapping the knowledge base, offering a highly practical paradigm for low-resource deployment.
  4. Elegant adaptive fusion: Log-odds and retrieval distance serve as context signals to automatically arbitrate between the GAT and retrieval — trusting retrieval when the KB is dense, and the GAT when it is sparse.
  5. Significant leap from CNN to VLM: DomainAdapt's height RMSE of 22 cm renders it nearly unusable in practice; NutriScreener reduces this to 6.38 cm — a qualitative improvement.

Limitations & Future Work

  1. Limited data scale: AnthroVision contains only 2,141 children and the KB only 248 subjects, making it difficult to cover global diversity.
  2. Geographic constraints: Validation is primarily conducted on Indian and Kurdish children; coverage of high-burden regions such as sub-Saharan Africa and Southeast Asia is absent.
  3. Age range limitations: CampusPose comprises university students (out-of-domain), with regression RMSE reaching 24 cm, indicating difficulty generalizing across age groups.
  4. Trade-off of privacy design: Frozen CLIP with irreversible embeddings ensures privacy but also precludes domain adaptation through fine-tuning.
  5. Lack of uncertainty estimation: Clinicians suggested adding uncertainty quantification and visual attention heatmaps, which the current version does not provide.
  6. Coarse binary classification: The model only distinguishes healthy from malnourished, without further stratification into subtypes such as stunting, wasting, and underweight.
  • AI-based nutritional assessment: ARAN (512 children), AnthroVision + DomainAdapt (multi-task CNN), Microsoft Child Growth Monitor (infrared depth sensor).
  • Visual foundation models: CLIP (cross-domain generalization), MedCLIP (medical adaptation), NurtureNet (anthropometric CLIP).
  • Graph neural networks: DMGNN (multi-scale joint relationships), GraphCMR (body shape regression).
  • Retrieval-augmented learning: RAC (FAISS memory indexing), COBRA (mutual-information-optimized retrieval).
  • Multi-view anthropometric estimation: Liu et al. (linear model + multi-angle height and MUAC prediction).

Rating

⭐⭐⭐⭐ (4/5)

  • Novelty: ⭐⭐⭐⭐ — The combination of multi-pose GAT and retrieval augmentation is novel and effective.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Cross-dataset validation, clinical user study, and extensive ablations.
  • Writing Quality: ⭐⭐⭐⭐ — Well-structured with thorough discussion of ethics and deployment.
  • Value: ⭐⭐⭐⭐⭐ — Genuinely targets low-resource deployment, clinically validated, open-source toolkit.