NeurIPS 2025 Segmentation Referring Human Action Segmentation Multi-Person Scenarios Diffusion Models xLSTM Fourier Conditioning RHAS

HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios¶

Conference: NeurIPS 2025 arXiv: 2506.09650 Code: https://github.com/KPeng9510/HopaDIFF Area: Temporal Action Segmentation / Multi-Person Video Understanding Keywords: Referring Human Action Segmentation, Multi-Person Scenarios, Diffusion Models, xLSTM, Fourier Conditioning, RHAS

TL;DR¶

This paper introduces the Referring Human Action Segmentation (RHAS) task—localizing a specific individual in multi-person videos via textual descriptions and performing frame-level action segmentation. The authors construct the RHAS133 dataset comprising 133 movies, 137 action categories, and 33 hours of video, and propose HopaDIFF, a holistic-partial aware Fourier-conditioned diffusion framework that substantially outperforms existing baselines across multiple evaluation settings.

Background & Motivation¶

Background: Action segmentation aims to temporally partition untrimmed videos into action segments and classify them. Existing methods (FACT, ActDiff, ASQuery) primarily target single-person scenarios, where action sequences follow predefined protocols (e.g., preparing a salad according to a fixed recipe or assembling objects from instructions), and thus fail to capture the randomness and complexity of real-world multi-person scenarios.

Limitations of Prior Work: - No multi-person action segmentation dataset exists—existing datasets (50Salads, Assembly101, Breakfast) are all single-person, procedure-following, and first-person. - No text-guided mechanism is available to specify which individual to segment. - Existing action segmentation methods lack target-aware partial feature reasoning and fine-grained generative control.

Core Idea: Define the new RHAS task + construct the RHAS133 dataset + propose the dual-branch diffusion model HopaDIFF, which enhances controllability via holistic-partial feature interaction and frequency-domain conditioning.

RHAS133 Dataset¶

Dataset Statistics¶

Property	RHAS133	50Salads	Assembly101	Breakfast
Multi-Person	✓	✗	✗	✗
Text-Guided	✓	✗	✗	✗
Multi-Label	✓	✗	✗	✗
Free-Form Procedure	✓	✗	✗	✗
Viewpoint	Third-Person	First-Person	First-Person	Third-Person
Duration	33h	4h	513h	77h
Action Categories	137	50	101	10
Individuals	542	25	53	52

133 movies, 542 annotated individuals, cross-validated by 6 domain experts.
Action labels follow the AVA protocol, extended from 80 to 137 fine-grained categories.
Textual descriptions refer to the target individual's appearance without revealing action content.

Method¶

Overall Architecture¶

HopaDIFF employs a dual-branch diffusion architecture: the holistic branch captures full-video context, while the partial branch extracts fine-grained features from cropped video clips of the target person detected by GroundingDINO. Both branches share a VLM feature extractor (BLIP-2/CLIP) but use independent encoders.

1. Holistic-Partial Conditioned Diffusion¶

Dual-branch feature extraction:

\[\mathbf{z}_h, \mathbf{z}_p = \mathbf{E}_h(\mathbf{F}_\phi(\mathbf{v}_h, \mathbf{r})), \quad \mathbf{E}_p(\mathbf{F}_\phi(\text{G-Dino}(\mathbf{v}_h, \mathbf{r}), \mathbf{r}))\]

where \(\mathbf{v}_h\) denotes the original video, \(\mathbf{r}\) the textual reference, \(\mathbf{F}_\phi\) the VLM feature extractor, and G-Dino performs target person detection and cropping.

2. HP-xLSTM: Cross-Input Gated Attention xLSTM¶

Building on the mLSTM variant, bidirectional cross-attention (BCA) is introduced at the input gate to enable holistic-partial feature interaction:

\[\hat{\mathbf{z}}^h, \hat{\mathbf{z}}^p = \text{HP-xLSTM}(\mathbf{z}^h, \mathbf{z}^p)\]

The mLSTM matrix memory update: \(\boldsymbol{C}_t^m = f_t^m \boldsymbol{C}_{t-1}^m + i_t^m \boldsymbol{v}_t^m \boldsymbol{k}_t^{m\top}\)

The key innovation lies in the input gate: \(\tilde{i}_t^h, \tilde{i}_t^p = \text{BCA}(\boldsymbol{w}_i^{h\top} \boldsymbol{z}_t^h + b_i^h, \; \boldsymbol{w}_i^{p\top} \boldsymbol{z}_t^h + b_i^h)\)

Through bidirectional cross-attention, the input gate of the holistic branch is modulated by partial features and vice versa, enabling bidirectional information exchange between the two branches.

3. Fourier Frequency-Domain Conditioning¶

Discrete Fourier Transform (DFT) is applied to the HP-xLSTM-enhanced features to extract frequency-domain information as additional conditioning:

\[\hat{\mathbf{z}}_f^h, \hat{\mathbf{z}}_f^p = \text{DFT}(\hat{\mathbf{z}}^h), \text{DFT}(\hat{\mathbf{z}}^p)\]

The decoder receives both spatiotemporal and frequency-domain features:

\[\mathbf{s}^h, \mathbf{s}^p = \mathbf{D}_h(\mathbf{y}_t, \hat{\mathbf{z}}_f^h, \mathbf{z}_f^h), \quad \mathbf{D}_p(\mathbf{y}_t, \hat{\mathbf{z}}_f^p, \mathbf{z}_f^p)\]

Frequency-domain conditioning enhances fine-grained controllability—low frequencies encode overall action rhythm while high frequencies capture action boundary transitions.

4. Loss & Training¶

Training: Binary cross-entropy loss + temporal boundary loss (aligning the denoised sequence with GT action boundaries).
Inference: Iterative denoising from Gaussian noise; final predictions are averaged across the two branches.

Key Experimental Results¶

BLIP-2 Features + Random Split¶

Method	ACC↑	EDIT↑	F1@10↑	F1@25↑	F1@50↑
FACT	26.08	0.27	52.91	50.77	47.06
ActDiff	41.85	7.20	70.56	68.34	63.29
LTContent	34.23	0.31	64.70	63.09	58.50
RefAtomNet	38.01	0.13	34.01	31.93	27.62
HopaDIFF	62.58	7.75	87.96	85.50	79.39

HopaDIFF achieves a 50% relative improvement in ACC (62.58 vs. 41.85) and a 25% improvement in F1@50 (79.39 vs. 63.29), substantially outperforming all baselines.

BLIP-2 Features + Cross-Movie Evaluation (Stricter Generalization Test)¶

Method	ACC↑	EDIT↑	F1@10↑	F1@25↑	F1@50↑
ActDiff	2.36	15.09	22.44	22.28	21.80
LTContent	52.52	0.37	49.35	47.24	42.55
HopaDIFF	59.63	19.37	90.91	90.33	89.26

Under the cross-movie setting, the advantage is even more pronounced—F1@50 improves from 42.55 to 89.26, demonstrating the strong generalization of the dual-branch architecture. ActDiff degrades severely (ACC drops to 2.36%), while HopaDIFF remains robust.

Highlights & Insights¶

New Task + New Dataset: RHAS is defined for the first time, and RHAS133 is the first multi-person text-guided action segmentation dataset.
HP-xLSTM: Bidirectional cross-attention introduced at the xLSTM input gate elegantly enables holistic-partial information exchange.
Fourier Conditioning: Frequency-domain conditioning supplements spatiotemporal features, enhancing the diffusion model's fine-grained control over action boundaries.
Cross-Movie Generalization: F1@50 reaches 89.26%, far exceeding competing methods, validating the robustness of the dual-branch architecture.

Limitations & Future Work¶

The framework relies on GroundingDINO for target person detection, which may fail under severe occlusion or when faces are not visible.
RHAS133 covers only 133 movies; while broad in scope, the dataset size remains limited.
The two-stage architecture (VLM feature pre-extraction + diffusion segmentation) precludes end-to-end training.
Inference requires simultaneous execution of GroundingDINO, the VLM, and the dual-branch diffusion model, resulting in considerable computational overhead.

The RHAS task can be extended to sports analytics (tracking the action sequences of specific athletes) and movie content understanding.
The cross-input gating design of HP-xLSTM is applicable to other long-sequence modeling tasks requiring multi-source information fusion.
The frequency-domain conditioning strategy can be adopted in video generation to maintain temporal consistency.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ New task + new dataset + innovative technical approach; comprehensive contributions.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple evaluation settings and two feature extractors; ablation study could be strengthened.
Writing Quality: ⭐⭐⭐⭐ Clear framework presentation, though the dense notation warrants careful cross-referencing.
Value: ⭐⭐⭐⭐⭐ Pioneers the RHAS direction; the dataset holds long-term research value.