HealthSLM-Bench: Benchmarking Small Language Models for Mobile and Wearable Healthcare Monitoring¶

Conference: NeurIPS 2025 arXiv: 2509.07260 Code: None Area: AI Safety (Healthcare AI) Keywords: small language models, mobile health monitoring, wearable devices, privacy preservation, on-device deployment

TL;DR¶

The first benchmark systematically evaluating small language models (SLMs, 1–4B parameters) on mobile and wearable health monitoring tasks, covering zero-shot, few-shot, and instruction fine-tuning paradigms, with on-device deployment validated on an iPhone.

Background & Motivation¶

Background: Mobile and wearable devices can continuously collect physiological data such as step counts, heart rate, and sleep metrics. LLMs have demonstrated strong generalization in health prediction tasks (e.g., Health-LLM, PhysioLLM).

Limitations of Prior Work: LLM-based approaches predominantly rely on cloud inference and face three core challenges: (1) privacy leakage risk, as sensitive health data must be uploaded to remote servers; (2) communication latency that impairs real-time monitoring; and (3) computational and memory overhead of 7B+ models far exceeding mobile device capacity.

Key Challenge: A fundamental tension exists between the powerful capabilities of LLMs and the resource constraints of mobile platforms — a solution is needed that preserves LLM-level performance while enabling local execution.

Goal: Can SLMs (≤7B parameters) match LLMs on health prediction tasks? What is the deployment efficiency on real mobile devices?

Key Insight: Construct a comprehensive benchmark that systematically compares 9 state-of-the-art SLMs against multiple LLMs across 8 health tasks, with actual deployment on an iPhone for validation.

Core Idea: With appropriate fine-tuning, SLMs can match or even surpass LLMs on health monitoring tasks while delivering orders-of-magnitude efficiency gains and stronger privacy guarantees.

Method¶

Overall Architecture¶

HealthSLM-Bench evaluates SLMs under three paradigms: (1) Zero-shot learning — direct inference without examples; (2) Few-shot learning — in-context learning with 1/3/5/10 examples; (3) Instruction fine-tuning — parameter-efficient fine-tuning via LoRA. The best-performing models are subsequently deployed on an iPhone 15 Pro Max to assess on-device efficiency.

Key Designs¶

Zero-shot Prompt Construction:
- Function: Design standardized prompt templates for health monitoring
- Design Motivation: Evaluate the intrinsic health reasoning capability of SLMs based on pre-trained knowledge
- Mechanism: Prompts consist of three components — Instruction (role setting, e.g., "You are a personal health agent") + Main Query (14-day sensor data sequences: steps, calories, heart rate, sleep, etc.) + Output Constraints (restricting output format, e.g., "Predict fatigue level 1–5")
- Novelty: Chain-of-thought and self-consistency are intentionally excluded to preserve on-device deployment efficiency
Few-shot Prompt Construction:
- Function: Enhance in-context learning via a small number of labeled examples
- Design Motivation: Leverage in-context learning to capture input–output patterns
- Mechanism: \(\text{Prompt}_{FS} = \text{Instruction}_{FS} + \text{Examples}_N + \text{Prompt}_{ZS}\), where each example consists of a zero-shot prompt paired with its answer. Experiments are conducted with \(N \in \{1, 3, 5, 10\}\)
- Novelty: Different tasks exhibit distinct sensitivity to the number of examples — mental health tasks benefit more from additional demonstrations
Instruction Fine-tuning (LoRA):
- Function: Format instruction–response pairs using the Alpaca template and fine-tune efficiently via LoRA
- Design Motivation: Update model parameters to achieve more persistent task alignment
- Mechanism: Trainable low-rank decomposition matrices are introduced into attention and feed-forward layers while original weights are frozen
- Novelty: Particularly suited for on-device inference, minimizing memory and computational overhead
On-device Deployment:
- Function: Deploy the best-performing SLMs on an iPhone 15 Pro Max
- Mechanism: Models are converted to GGUF format → 4-bit quantization → inference via the Llama.cpp engine
- Evaluation Metrics: TTFT (time to first token), ITPS/OTPS (throughput), OET (output evaluation time), CPU/RAM usage

Loss & Training¶

Cross-entropy loss for classification tasks; MAE loss for regression tasks
LoRA fine-tuning: 8:2 train/test split with 14-day sliding window normalized data
Evaluation metrics: Accuracy for classification, MAE for regression

Key Experimental Results¶

Main Results¶

Zero-shot Performance Comparison (LLMs vs. SLMs):

Metric	LLMs Mean	SLMs Mean	Best SLM
Stress MAE↓	0.64	0.61	Qwen2-1.5B: 0.40
Readiness MAE↓	2.56	2.15	Llama-3.2-1B: 1.87
Fatigue Acc↑	41.54%	52.20%	Llama-3.2-1B: 63.79%
Sleep Quality MAE↓	0.60	0.60	Gemma-2-2B: 0.47
Calories MAE↓	47.60	143.23	Llama-3.2-3B: 19.70

Instruction Fine-tuning (LoRA) Performance Comparison:

Metric	LLMs Mean	SLMs Mean	Best SLM
Fatigue Acc↑	52.4%	46.1%	TinyLlama: 63.2%
Calories MAE↓	41.6	7.57	Gemma-2-2B: 2.80
Stress MAE↓	0.44	0.57	Phi-3-mini: 0.40
Activity Acc↑	28.2	21.8	Gemma-2-2B: 34.4%

Ablation Study¶

On-device Deployment Efficiency (iPhone 15 Pro Max):

Model	TTFT(s)↓	ITPS(t/s)↑	OET(s)↓	OTPS(t/s)↑	RAM(GB)↓
Llama-2-7B	29.12	24.74	27.85	3.04	7.15
Phi-3-mini-4k	6.39	112.39	0.96	13.49	6.48
TinyLlama-1.1B	1.37	527.01	0.35	45.89	5.17

Speedup: TinyLlama vs. Llama-2-7B: 21× faster TTFT, 79× faster OET, ITPS improvement exceeding 2000%.

Key Findings¶

Under zero-shot evaluation, SLMs already match or surpass LLMs on most health tasks, particularly in stress, readiness, and fatigue prediction.
Regression tasks (calorie estimation) remain challenging for SLMs in the zero-shot setting, but after instruction fine-tuning, SLMs substantially outperform LLMs (MAE: 7.57 vs. 41.6).
A "collapse" phenomenon is observed in few-shot learning, where certain SLMs experience abrupt performance degradation under specific few-shot configurations.
Mental health tasks (anxiety, depression) benefit more from additional few-shot examples than physiological monitoring tasks.
Instruction fine-tuned SLMs exhibit class imbalance bias, tending to predict the majority class.

Highlights & Insights¶

High practical value: Directly addresses the critical question of whether SLMs are viable for mobile health monitoring.
End-to-end validation: Provides a complete pipeline from model evaluation to real iPhone deployment, rather than purely theoretical analysis.
Remarkable efficiency gains: TinyLlama achieves a first-token latency of only 1.37 seconds on iPhone, 21× faster than 7B models.
Privacy advantage: On-device inference entirely eliminates the privacy risk of uploading health data to the cloud.
Comprehensiveness: 9 SLMs × 8 tasks × 3 evaluation paradigms constitutes a highly systematic benchmark.

Limitations & Future Work¶

Evaluation is limited to 3 public datasets, with insufficient coverage of health scenarios (lacking important tasks such as cardiovascular and diabetes monitoring).
Severe class imbalance significantly impacts fine-tuning performance, with no concrete mitigation proposed.
Root cause analysis of the few-shot collapse phenomenon is insufficiently thorough.
Deployment is tested only on the iPhone 15 Pro Max; other mobile platforms (Android devices, smartwatches, etc.) are not covered.
The safety of SLMs is not evaluated — erroneous health predictions may have serious consequences.
The impact of 4-bit quantization on health prediction accuracy is not analyzed in detail.

Extends the research line of Health-LLM with a focus on more efficient deployment strategies.
Complements MobileAIBench, which evaluates only general NLP tasks.
Motivates a "small but precise" paradigm for mobile health: fine-tuned 1–4B models can satisfy the majority of practical scenarios.
The combination of on-device SLMs and federated learning may represent a key architectural direction for future privacy-preserving health AI.

Rating¶

Novelty: ⭐⭐⭐ Applying SLMs to health monitoring is a natural extension, though the first systematic benchmark represents a meaningful contribution
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-model, multi-task, multi-paradigm evaluation with on-device deployment validation, though the number of datasets is limited
Writing Quality: ⭐⭐⭐⭐ Clear structure with rich tables, though some analyses are primarily descriptive
Value: ⭐⭐⭐⭐ Important reference value for the practical deployment of mobile health AI