CVPR 2026 Medical Imaging Federated Learning Autism Behavior Recognition Skeleton-based Action Recognition Privacy Preservation Personalized Federated Learning

Unlocking Multi-Site Clinical Data: A Federated Approach to Privacy-First Child Autism Behavior Analysis¶

Conference: CVPR 2026 arXiv: 2604.02616 Code: None Area: Medical Imaging Keywords: Federated Learning, Autism Behavior Recognition, Skeleton-based Action Recognition, Privacy Preservation, Personalized Federated Learning

TL;DR¶

This paper proposes the first federated learning framework for children's autism behavior recognition. Through a two-tier privacy strategy—3D skeleton abstraction (identity removal) combined with federated optimization (data never leaves the site)—the proposed approach achieves 87.80% accuracy on the MMASD dataset using the APFL personalized federated method, surpassing local training by 5.2% while satisfying HIPAA/GDPR compliance requirements.

Background & Motivation¶

Background: Early identification of Autism Spectrum Disorder (ASD) relies on behavioral observation and assessment, currently performed primarily by clinical experts. Video-based automatic behavior analysis (e.g., action recognition) has the potential to assist large-scale screening, but requires multi-site clinical data to train well-generalizing models.
Limitations of Prior Work: (1) Clinical videos of children constitute extremely sensitive data, and HIPAA/GDPR strictly prohibit cross-site sharing of raw video; (2) single-site data is limited in volume and subject to treatment style variation (e.g., robot-assisted vs. yoga), resulting in poor model generalizability; (3) existing federated learning research focuses predominantly on medical imaging (e.g., CT/MRI), with virtually no work targeting behavioral video.
Key Challenge: Multi-site collaboration improves model generalization, yet raw video contains sensitive information such as children's facial features and body characteristics—even federated learning cannot fully eliminate the risk of privacy leakage through gradient inversion attacks.
Goal: Design a two-tier privacy protection scheme that enables multi-site collaborative training without transmitting any identifiable information.
Key Insight: Skeleton sequences naturally eliminate identity-related cues such as facial features, clothing, and background, and are invariant to lighting and camera conditions—serving simultaneously as a privacy protection mechanism and a robust behavioral representation.
Core Idea: The first privacy tier anonymizes data via ROMP-based 3D skeleton extraction; the second tier enforces federated learning to keep data local. The combination of both layers satisfies the most stringent compliance requirements.

Method¶

Overall Architecture¶

Clinical videos at each site → ROMP extracts 3D skeleton sequences \(S \in \mathbb{R}^{T \times 71 \times 3}\) (identity removed) → local FreqMixFormer model training → federated aggregation (FedAvg / FedProx / APFL, etc.) → global or personalized model returned to each site → iterate until convergence.

Key Designs¶

Skeleton Abstraction Layer
Function: Converts clinical video into a privacy-safe behavioral representation.
Mechanism: ROMP is used to extract 71 3D keypoints (SMPL + additional + H36M joints), completely removing facial features, clothing information, and environmental context. The skeleton sequence \(S \in \mathbb{R}^{T \times V \times 3}\) is invariant to lighting, background, and camera parameters.
Design Motivation: Serves as the first line of privacy defense—even if skeleton data is leaked, identity cannot be recovered. It also eliminates appearance-level distribution discrepancies across sites.
FreqMixFormer Action Recognition Backbone
Function: Recognizes autism-related behavioral patterns from skeleton sequences.
Mechanism: A frequency-aware attention module applies Discrete Cosine Transform (DCT) to joint trajectories; a hybrid Transformer architecture balances global temporal dependencies and local spatial correlations. The lightweight design minimizes parameter count for efficient federated edge deployment.
Design Motivation: Reduced model size lowers federated communication cost (fewer parameters transmitted per round); frequency-domain features are more suitable than purely temporal representations for capturing repetitive autism-related behavioral patterns.
Adaptive Personalized Federated Learning (APFL)
Function: Adaptively balances global knowledge sharing and site-specific specialization.
Mechanism: Each site maintains a personalized model \(v_i = \alpha_i u_i + (1-\alpha_i)w\), where \(u_i\) is the local model and \(w\) is the global model. The mixing coefficient \(\alpha_i\) is learned adaptively via gradient descent: \(\alpha_i^{t+1} = \alpha_i^t - \eta_\alpha \langle \nabla f_i(v_i), u_i - w \rangle\).
Design Motivation: The behavioral distribution across different therapy themes (robot-assisted / rhythmic / yoga) is highly heterogeneous (non-IID), causing FedAvg to suffer a 12% performance drop. APFL allows each site to automatically determine the degree to which it trusts the global model.

Loss & Training¶

Standard cross-entropy classification loss. Federated training runs for 30 communication rounds, with \(K=1\) local SGD epoch per round and weighted averaging aggregation \(w^{t+1} = \sum_{i=1}^N \frac{n_i}{n}(w^t + \Delta w_i^t)\). FedProx adds a proximal regularization term \(\frac{\mu}{2}||w - w^t||^2\).

Key Experimental Results¶

Main Results¶

Method	Theme 1 (Robot)	Theme 2 (Rhythmic)	Theme 3 (Yoga)	Average
Local Training	87.10%	65.33%	95.41%	82.61%
FedAvg	70.16%	52.67%	88.07%	70.30%
FedProx	79.03%	70.00%	98.17%	82.40%
FedBN	66.13%	78.67%	64.22%	69.67%
FedPer	63.71%	74.67%	91.74%	76.71%
APFL	92.74%	78.00%	92.66%	87.80%

Ablation Study¶

Comparison	Key Observation	Explanation
APFL vs. Local Training	+5.19% average	Federated collaboration genuinely improves generalization
APFL vs. FedAvg	+17.50% average	Personalization is critical for non-IID data
FedProx vs. FedAvg	+12.10% average	Proximal regularization effectively mitigates heterogeneity
APFL \(\alpha\) evolution	Low initially → gradually increases	Global knowledge leveraged first → progressive local adaptation

Key Findings¶

FedAvg fails severely under strongly non-IID conditions (12% below local training), validating the necessity of personalized federated learning in this setting.
APFL outperforms local training on all three themes, demonstrating that federated collaboration remains beneficial even under highly heterogeneous distributions.
The evolution trajectory of \(\alpha\) provides interpretability—the model first learns global commonalities and then adapts to local specifics.
Theme 2 (rhythmic activities) is the most challenging (local accuracy only 65.33%); APFL improves it to 78.00%, indicating that cross-site knowledge transfer is most beneficial for difficult tasks.

Highlights & Insights¶

Engineering value of the two-tier privacy design: The combination of skeleton abstraction and federated learning not only satisfies compliance requirements but also incidentally provides cross-site feature alignment benefits—all sites receive the same skeleton representation free of scene-specific biases.
Interpretability through APFL's adaptive mixing coefficient: The training dynamics of \(\alpha\) allow direct observation of the model's transition from "global learning" to "local specialization," which facilitates understanding of model behavior in clinical settings.
Unified solution for privacy and representation learning: Skeleton extraction serves simultaneously as a privacy protection mechanism and a domain alignment technique, achieving two objectives with a single design choice.

Limitations & Future Work¶

The MMASD dataset is limited in scale (1,315 samples); larger multi-site clinical validation is needed.
Only skeleton features are used, discarding potentially diagnostically valuable facial expressions and speech information.
Communication efficiency in federated training is not thoroughly analyzed (e.g., gradient compression, sparsification).
The experimental setup covers only 3 sites; scalability to 10+ sites remains unexplored.
Future work could integrate multimodal information such as speech prosody and conversational dynamics, enabling privacy-preserving multimodal fusion within the federated framework.

vs. standard FedAvg: Performance degrades severely under the strong heterogeneity of autism behavior data, confirming that this setting requires personalized federated learning rather than a one-size-fits-all approach.
vs. traditional privacy protection methods (differential privacy, homomorphic encryption): The skeleton abstraction in this paper operates at the data level, offering greater efficiency than cryptographic computation without sacrificing model accuracy.
vs. medical imaging federated learning: Prior work (e.g., FedBN) primarily addresses domain shift in CT/MRI; this paper is the first to apply federated learning to behavioral video analysis.

Rating¶

Novelty: ⭐⭐⭐ The combination of skeleton abstraction and federated learning is straightforward yet effective; technical novelty is moderate.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparison of multiple federated methods, convergence analysis, and \(\alpha\) evolution analysis.
Writing Quality: ⭐⭐⭐⭐ Problem motivation and privacy design are clearly articulated.
Value: ⭐⭐⭐⭐ High clinical application value for early ASD screening; the two-tier privacy design is practically useful.