Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition¶
Conference: CVPR 2026 arXiv: 2603.03827 Code: GitHub Area: Dialogue Systems Keywords: Multimodal Intent Recognition, Hierarchical Semantic Representation, Self-Evolutionary Reasoning, Concept Clustering, CoT
TL;DR¶
This paper proposes HIER, which combines hierarchical semantic representation (a three-level hierarchy of tokens → concepts → relations) with a self-evolutionary reasoning mechanism driven by MLLM feedback, consistently outperforming SOTA methods and leading MLLMs by 1–3% on three multimodal intent recognition benchmarks.
Background & Motivation¶
Importance of Multimodal Intent Recognition: Inferring human intent from multimodal signals (text + video + audio) is a core task in human-computer interaction, dialogue systems, and intelligent transportation.
Neglect of Hierarchical Semantics in Prior Work: Most existing methods focus on fine-grained multimodal cue fusion while overlooking the inherently hierarchical nature of semantic information, which limits coherent and reliable reasoning.
Limitations of Static Reasoning Pipelines: Existing methods rely on fixed reasoning workflows and lack self-evolutionary refinement capabilities, making it difficult to dynamically adapt in complex scenarios.
Underutilization of MLLM Reasoning Potential: Although MLLMs possess strong reasoning capabilities, they still struggle with complex multimodal semantics in the absence of fine-grained hierarchical reasoning paths.
Inspiration from Human Cognition: Humans first establish situational awareness, then identify salient semantic cues, and finally perform integrative judgment through relational reasoning and iterative self-refinement.
Preliminary Attempt by LGSRR: Leveraging LLM reasoning to assist intent understanding has shown promise, but the reasoning process remains shallow and dependent on specific semantic concepts.
Method¶
Overall Architecture¶
HIER consists of three steps: (1) Multimodal Concept Clustering — clustering tokens into mid-level semantic concepts; (2) Multimodal Relation Selection — employing an IB network with JS divergence to select highly informative inter-concept relations; (3) Evolutionary Multimodal Reasoning — performing hierarchical reasoning via structured CoT combined with a self-evolutionary mechanism.
Key Designs¶
Multimodal Concept Clustering¶
A Qwen2-VL encoder is used to extract text tokens \(T\) and visual tokens \(V\), which are concatenated into a unified sequence \(Z\). Spherical K-Means++ (with cosine similarity) is applied for soft clustering. A label-guided strategy is introduced, using intent label embeddings as semantic anchors: a convex combination is computed with current centroids weighted by cosine similarity:
Multimodal Relation Selection¶
For all concept pairs \((c_i, c_j)\), relations are encoded via an information bottleneck network as \(r_{ij} = \text{MLP}(\text{ReLU}([c_i; c_j]))\). JS divergence is used to quantify the semantic novelty provided by each relation — high-divergence relations capture complementary or emergent semantics beyond individual concepts. The top-\(k\) high-divergence relations are retained.
Evolutionary Multimodal Reasoning¶
A structured CoT with three stages is employed: CoT-1 (contextual understanding, operating on token-level input) → CoT-2 (concept analysis, operating on mid-level concepts) → CoT-3 (relational reasoning, operating on high-level relations). In the latter two stages, the model is explicitly prompted to assess the utility of each concept and relation.
Self-Evolutionary Mechanism¶
Concept and relation features are projected into vocabulary logits via a shared generation head. Normalized confidence scores for "Yes/No" tokens are extracted from a reflection prompt and used to dynamically modulate features: \(\text{Feature}' = \text{Score} \cdot \text{Feature}\).
Loss & Training¶
\(\mathcal{L}_{\text{task}}\) is the autoregressive language modeling loss, and \(\mathcal{L}_{\text{relation}}\) is the cross-entropy loss for intent classification over concepts and relations.
Key Experimental Results¶
Main Results: Comparison on Three Benchmarks¶
| Method | MIntRec ACC | MIntRec F1 | MIntRec2.0 ACC | MELD-DA ACC |
|---|---|---|---|---|
| MAG-BERT | 72.40 | 68.29 | 60.38 | 61.08 |
| MulT | 72.31 | 68.97 | 60.66 | 59.99 |
| TCL-MAP | 73.17 | 68.92 | 58.24 | 61.63 |
| SDIF-DA | 71.64 | 68.19 | - | - |
| HIER (Ours) | 74.5+ | 71.0+ | 62.5+ | 63.0+ |
Ablation Study¶
| Component | Contribution |
|---|---|
| Concept Clustering | Provides mid-level semantic abstraction |
| Label Guidance | Aligns clustering with intent semantics |
| Relation Selection | Captures higher-order interaction patterns |
| JS Divergence Filtering | Removes redundant relations |
| Self-Evolutionary Mechanism | Dynamically refines features |
| Structured CoT | Deepens hierarchical reasoning |
Key Findings¶
- HIER consistently outperforms SOTA methods across all three benchmarks and surpasses direct use of MLLMs (e.g., Qwen2-VL).
- The self-evolutionary mechanism effectively filters out uninformative concepts and relations, improving reasoning robustness.
- The method generalizes to different backbone architectures beyond Qwen2-VL.
- Hierarchical representation yields the greatest benefit for complex, multi-class intent recognition.
Highlights & Insights¶
- Convincing three-level hierarchical design: The progressive abstraction from tokens → concepts → relations naturally mirrors human cognitive processes.
- The label-guided clustering strategy elegantly bridges unsupervised clustering and task objectives.
- The use of JS divergence for relation selection is theoretically well-motivated — high divergence indicates that a relation introduces new information.
- The self-evolutionary mechanism leverages the MLLM's generation head for feature evaluation without requiring additional annotations.
- This is the first work to establish a multi-level progressive reasoning paradigm for multimodal intent recognition.
Limitations & Future Work¶
- The number of concepts \(k\) and the relation retention ratio require careful tuning.
- Clustering is performed independently per sample, lacking global semantic consistency across samples.
- The binary Yes/No evaluation in the self-evolutionary mechanism is coarse and may miss subtle distinctions.
- The overall computational cost is substantial, involving concept clustering, relation modeling, and MLLM inference.
Related Work & Insights¶
- HIER shares the motivation of LGSRR in leveraging LLM reasoning to enhance intent understanding, but employs deeper and more structured reasoning.
- InMu-Net addresses noisy non-verbal cues; HIER implicitly handles a similar issue through concept clustering.
- The self-evolutionary mechanism intersects with self-alignment approaches such as RLAIF-V and SENA, but operates at the feature level rather than the sample level.
- The framework combining hierarchical representation and self-evolution is generalizable to tasks such as sentiment analysis and dialogue understanding.
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐