Disentangled Concepts Speak Louder Than Words: Explainable Video Action Recognition¶
Conference: NeurIPS 2025 (Spotlight)
arXiv: 2511.03725
Code: Available
Area: Video Understanding / Explainable AI
Keywords: Explainable video action recognition, concept bottleneck model, motion disentanglement, pose sequences, concept discovery
TL;DR¶
This paper proposes DANCE, a framework that achieves structured and motion-aware explainable video action recognition by disentangling action explanations into three concept types: motion dynamics, objects, and scenes.
Background & Motivation¶
Video action recognition models have achieved remarkable performance, yet their decision-making processes remain opaque. Existing explainability methods exhibit notable limitations:
Saliency methods (Saliency Tubes, GradCAM, etc.): produce entangled explanations that cannot distinguish whether a model relies on motion or spatial context.
Language-based methods (LLM-generated concept descriptions): can describe objects and scenes but struggle to express motion dynamics — motion constitutes tacit knowledge, i.e., knowledge that is intuitively understood but difficult to verbalize.
From a cognitive science perspective, humans perceive actions by separately analyzing two factors: - Temporal dynamics: how motion unfolds over time - Spatial context: surrounding objects and scenes
Consequently, ideal video XAI should explicitly disentangle temporal dynamics from spatial context — a requirement unmet by existing approaches.
Method¶
Overall Architecture¶
DANCE adopts an ante-hoc concept bottleneck design, inserting a concept layer between a pretrained video backbone encoder and the final classifier. The prediction pipeline is:
Input video → Video features → Three-type concept activations (motion dynamics, objects, scenes) → Action prediction
Each concept type has its own concept layer parameters \(W_C = [W_C^m; W_C^o; W_C^s]\), ensuring explicit disentanglement across concept types.
Key Designs¶
1. Motion Dynamics Concepts
Core innovation: motion concepts are defined via human pose sequences rather than textual descriptions.
- Key segment selection: informative short clips are extracted from videos via keyframe detection
- Pose sequence extraction: a 2D pose estimator extracts per-frame poses \(P_i^s \in \mathbb{R}^{L \times J \times 2}\) for each key segment
- Concept discovery via clustering: pose sequences from all training videos are aggregated and clustered using the FINCH algorithm to discover representative motion patterns
- Concept annotation: binary labels \(c_k^m = I(\sum_s a_{i,s,k})\) are automatically generated from cluster assignments
Advantage: pose sequences provide appearance-agnostic motion representations that allow users to intuitively understand how an action unfolds over time.
2. Object & Scene Concepts
- GPT-4o is queried to retrieve objects and scenes associated with each action class
- Pseudo-labels are automatically generated via a vision-language dual encoder (InternVid)
- Object pseudo-labels: \(\tilde{c}_i^o = E_T(\mathcal{O}) E_V(V_i)\)
3. Concept Bottleneck Architecture
- The pretrained video backbone (VideoMAE) is frozen; only the concept layer and classifier are trained
- The concept layer projects video features into concept space to obtain activations \(z = [z_m; z_o; z_s]\)
- The classifier predicts actions based on concept activations
Loss & Training¶
Training proceeds in two stages:
Stage 1: Concept layer training - Motion dynamics concepts: binary cross-entropy loss (motion labels are multi-label) $\(\mathcal{L}_m = -\frac{1}{M_m}\sum_{k=1}^{M_m}[c_k^m\log\sigma_k(z_m) + (1-c_k^m)\log(1-\sigma_k(z_m))]\)$ - Object/scene concepts: cosine cubed loss, emphasizing directional alignment
Stage 2: Classifier training - The concept layer is frozen; the final linear classifier is trained with cross-entropy loss and sparsity regularization $\(\mathcal{L}_{cls} = -\frac{1}{K}\sum_k y_k\log\hat{y}_k + \lambda[(1-\alpha)\frac{1}{2}\|W_A\|_F + \alpha\|W_A\|_{1,1}]\)$ - L1 regularization promotes weight sparsity, enhancing interpretability
Key Experimental Results¶
Main Results¶
Table 1: Video Action Recognition Performance (Top-1 Accuracy %)
| Method | KTH | Penn Action | HAA-100 | UCF-101 |
|---|---|---|---|---|
| Non-explainable baseline | 89.7 | 97.8 | 73.5 | 88.4 |
| CBM + UCF-101 attributes | - | - | - | 86.8 |
| LF-CBM + entangled language concepts | 87.4 | 96.3 | 66.5 | 85.5 |
| LF-CBM + disentangled language concepts | 89.9 | 97.7 | 65.3 | 83.7 |
| DANCE | 91.1 | 98.1 | 70.7 | 87.5 |
Key findings: - DANCE surpasses the non-explainable baseline on KTH and Penn Action (+1.4 and +0.3) - Only marginal drops on HAA-100 and UCF-101 (−2.8 and −0.9) - DANCE consistently outperforms language-concept-based CBM variants across all datasets
User Study Results (Figure 6)
| Compared Method | DANCE Better | Comparable | Other Better |
|---|---|---|---|
| vs GPT-4o concept CBM | >70% | ~20% | <10% |
| vs VTCD (saliency method) | >70% | ~20% | <10% |
| vs Expert-defined concepts | >70% | ~15% | <15% |
Motion dynamics concept interpretability score: Ours 4.3/5, language-based method 2.3/5, expert-defined concepts 3.4/5.
Ablation Study¶
Cross-domain model editing experiment (Figure 10)
Under severe domain shift (UCF-101 → UCF-101-SCUBA): - Adjusting concept weights for 3 classes improves accuracy from 77.7% to 82.0% (+4.3%) - No retraining required
Sample-level intervention (Figure 9)
- Deactivating irrelevant scene concepts (e.g., "Table Tennis Club") corrects erroneous predictions
- Demonstrates DANCE's support for fine-grained, transparent prediction control
Temporal direction sensitivity check (Figure 7)
- A forward video is predicted as "Bowing FullBody"; the reversed video is predicted as "Burpee"
- Validates that the model genuinely relies on motion dynamics concepts rather than spatial context alone
Key Findings¶
- Cleaner concepts yield better performance: replacing language descriptions of motion with pose sequences consistently outperforms language-concept methods across all datasets
- Interpretability and accuracy need not conflict: DANCE even improves performance on KTH and Penn Action
- Motion dynamics concepts are the most intuitive: 89.7% of user study participants rated them 4 or 5 out of 5
- Supports training-free model debugging: concept weight editing recovers performance under domain shift without retraining
Highlights & Insights¶
- Representing motion concepts via pose sequences is the key innovation: this circumvents the tacit knowledge problem of motion — rather than describing motion verbally, the approach directly visualizes pose sequences
- Fully automated concept discovery: motion concepts are discovered through clustering; object/scene concepts are extracted via LLM, requiring no manual annotation
- Ante-hoc explainability design: explanations are not post-hoc rationalizations — the model itself makes predictions through concepts, guaranteeing faithfulness
- Practical model debugging capability: the editability of concept weights makes model debugging and domain adaptation straightforward
Limitations & Future Work¶
- Relies on the quality of the 2D pose estimator; inaccurate estimates degrade motion concept quality
- Applicable only to human-centric action recognition; not suitable for non-human actions (e.g., natural phenomena)
- The linear concept layer may limit the modeling of complex inter-concept interactions
- The number of concepts (especially motion concepts) depends on clustering hyperparameter selection
- A ~0.9% performance gap remains on UCF-101; scalability to large-scale datasets requires further investigation
Related Work & Insights¶
- Concept Bottleneck Models (CBM) [Koh et al., 2020]: foundation for the concept layer design in this work
- Label-Free CBM [Oikarinen et al., 2023]: automated concept discovery via LLM; DANCE's object/scene concept discovery builds upon this
- VTCD [Kowal et al., 2024]: optimization-based video concept discovery, requiring additional computational overhead
- Saliency Tubes [Stergiou et al., 2019]: 3D saliency method, but produces entangled explanations
Rating¶
- Novelty: ★★★★★ — The definition and discovery of motion dynamics concepts is a pioneering contribution to video XAI
- Technical Depth: ★★★★☆ — The concept bottleneck framework is well-established; innovation lies primarily in concept definition and the discovery pipeline
- Experimental Thoroughness: ★★★★★ — Comprehensive evaluation across 4 datasets, user studies, ablations, and model editing experiments
- Writing Quality: ★★★★★ — Polished figures, clear narrative structure; a well-deserved Spotlight paper
- Value: ★★★★☆ — Model debugging and editing capabilities have direct practical applicability