Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition¶

Conference: CVPR 2026 arXiv: 2603.03827 Code: GitHub Area: Dialogue Systems Keywords: Multimodal Intent Recognition, Hierarchical Semantic Representation, Self-Evolutionary Reasoning, Concept Clustering, CoT

TL;DR¶

This paper proposes HIER, which combines hierarchical semantic representation (a three-level hierarchy of tokens → concepts → relations) with a self-evolutionary reasoning mechanism driven by MLLM feedback, consistently outperforming SOTA methods and leading MLLMs by 1–3% on three multimodal intent recognition benchmarks.

Background & Motivation¶

Importance of Multimodal Intent Recognition: Inferring human intent from multimodal signals (text + video + audio) is a core task in human-computer interaction, dialogue systems, and intelligent transportation.

Neglect of Hierarchical Semantics in Prior Work: Most existing methods focus on fine-grained multimodal cue fusion while overlooking the inherently hierarchical nature of semantic information, which limits coherent and reliable reasoning.

Limitations of Static Reasoning Pipelines: Existing methods rely on fixed reasoning workflows and lack self-evolutionary refinement capabilities, making it difficult to dynamically adapt in complex scenarios.

Underutilization of MLLM Reasoning Potential: Although MLLMs possess strong reasoning capabilities, they still struggle with complex multimodal semantics in the absence of fine-grained hierarchical reasoning paths.

Inspiration from Human Cognition: Humans first establish situational awareness, then identify salient semantic cues, and finally perform integrative judgment through relational reasoning and iterative self-refinement.

Preliminary Attempt by LGSRR: Leveraging LLM reasoning to assist intent understanding has shown promise, but the reasoning process remains shallow and dependent on specific semantic concepts.

Method¶

Overall Architecture¶

HIER consists of three steps: (1) Multimodal Concept Clustering — clustering tokens into mid-level semantic concepts; (2) Multimodal Relation Selection — employing an IB network with JS divergence to select highly informative inter-concept relations; (3) Evolutionary Multimodal Reasoning — performing hierarchical reasoning via structured CoT combined with a self-evolutionary mechanism.

Key Designs¶

Multimodal Concept Clustering¶

A Qwen2-VL encoder is used to extract text tokens \(T\) and visual tokens \(V\), which are concatenated into a unified sequence \(Z\). Spherical K-Means++ (with cosine similarity) is applied for soft clustering. A label-guided strategy is introduced, using intent label embeddings as semantic anchors: a convex combination is computed with current centroids weighted by cosine similarity:

\[\tilde{c}_m^{(u)} = \alpha \cdot c_m^{(u)} + (1-\alpha) \cdot \sum_{i=1}^L \text{Weight}_{i,m}^{(u)} y_i\]

Multimodal Relation Selection¶

For all concept pairs \((c_i, c_j)\), relations are encoded via an information bottleneck network as \(r_{ij} = \text{MLP}(\text{ReLU}([c_i; c_j]))\). JS divergence is used to quantify the semantic novelty provided by each relation — high-divergence relations capture complementary or emergent semantics beyond individual concepts. The top-\(k\) high-divergence relations are retained.

Evolutionary Multimodal Reasoning¶

A structured CoT with three stages is employed: CoT-1 (contextual understanding, operating on token-level input) → CoT-2 (concept analysis, operating on mid-level concepts) → CoT-3 (relational reasoning, operating on high-level relations). In the latter two stages, the model is explicitly prompted to assess the utility of each concept and relation.

Self-Evolutionary Mechanism¶

Concept and relation features are projected into vocabulary logits via a shared generation head. Normalized confidence scores for "Yes/No" tokens are extracted from a reflection prompt and used to dynamically modulate features: \(\text{Feature}' = \text{Score} \cdot \text{Feature}\).

Loss & Training¶

\[\mathcal{L} = \mathcal{L}_{\text{task}} + \beta \mathcal{L}_{\text{relation}}\]

\(\mathcal{L}_{\text{task}}\) is the autoregressive language modeling loss, and \(\mathcal{L}_{\text{relation}}\) is the cross-entropy loss for intent classification over concepts and relations.

Key Experimental Results¶

Main Results: Comparison on Three Benchmarks¶

Method	MIntRec ACC	MIntRec F1	MIntRec2.0 ACC	MELD-DA ACC
MAG-BERT	72.40	68.29	60.38	61.08
MulT	72.31	68.97	60.66	59.99
TCL-MAP	73.17	68.92	58.24	61.63
SDIF-DA	71.64	68.19	-	-
HIER (Ours)	74.5+	71.0+	62.5+	63.0+

Ablation Study¶

Component	Contribution
Concept Clustering	Provides mid-level semantic abstraction
Label Guidance	Aligns clustering with intent semantics
Relation Selection	Captures higher-order interaction patterns
JS Divergence Filtering	Removes redundant relations
Self-Evolutionary Mechanism	Dynamically refines features
Structured CoT	Deepens hierarchical reasoning

Key Findings¶

HIER consistently outperforms SOTA methods across all three benchmarks and surpasses direct use of MLLMs (e.g., Qwen2-VL).
The self-evolutionary mechanism effectively filters out uninformative concepts and relations, improving reasoning robustness.
The method generalizes to different backbone architectures beyond Qwen2-VL.
Hierarchical representation yields the greatest benefit for complex, multi-class intent recognition.

Highlights & Insights¶

Convincing three-level hierarchical design: The progressive abstraction from tokens → concepts → relations naturally mirrors human cognitive processes.
The label-guided clustering strategy elegantly bridges unsupervised clustering and task objectives.
The use of JS divergence for relation selection is theoretically well-motivated — high divergence indicates that a relation introduces new information.
The self-evolutionary mechanism leverages the MLLM's generation head for feature evaluation without requiring additional annotations.
This is the first work to establish a multi-level progressive reasoning paradigm for multimodal intent recognition.

Limitations & Future Work¶

The number of concepts \(k\) and the relation retention ratio require careful tuning.
Clustering is performed independently per sample, lacking global semantic consistency across samples.
The binary Yes/No evaluation in the self-evolutionary mechanism is coarse and may miss subtle distinctions.
The overall computational cost is substantial, involving concept clustering, relation modeling, and MLLM inference.

HIER shares the motivation of LGSRR in leveraging LLM reasoning to enhance intent understanding, but employs deeper and more structured reasoning.
InMu-Net addresses noisy non-verbal cues; HIER implicitly handles a similar issue through concept clustering.
The self-evolutionary mechanism intersects with self-alignment approaches such as RLAIF-V and SENA, but operates at the feature level rather than the sample level.
The framework combining hierarchical representation and self-evolution is generalizable to tasks such as sentiment analysis and dialogue understanding.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐