ECLM: Entity Level Language Model for Spoken Language Understanding with Chain of Intent¶

Conference: ACL 2025
arXiv: 2403.04481
Area: LLM / Spoken Language Understanding
Keywords: Multi-intent spoken language understanding, entity-level slot filling, Chain of Intent, large language models, dialogue systems

TL;DR¶

This paper proposes the ECLM framework to apply LLMs to multi-intent spoken language understanding. By converting token-level slot filling into an entity recognition task, it solves the sequence alignment issue. It introduces the "Chain of Intent" to achieve step-by-step multi-intent recognition, significantly outperforming SOTA baselines on MixATIS and MixSNIPS.

Background & Motivation¶

Limitations of Prior Work¶

Limitations of Prior Work: Background: Spoken Language Understanding (SLU) is a core component of task-oriented dialogue systems, consisting of two subtasks: intent detection (classification) and slot filling (sequence labeling). In real-world scenarios, users often express multiple intents in a single utterance (e.g., 52% of instances in an internal Amazon dataset are multi-intent), making multi-intent SLU a significant challenge.

Directly applying LLMs to multi-intent SLU faces two core challenges:

Sequence Alignment Issue: Autoregressive generation in LLMs may produce outputs that do not align one-to-one with the original tokens, making it impossible to align slot-filling BIO tags with the original utterance.

Multi-intent Relationship Modeling: Simply fine-tuning LLMs directly makes it difficult to capture fine-grained intent-slot interaction relationships in semantic-level tasks.

Method¶

Overall Architecture¶

The ECLM framework consists of three core components: an Entity Slots construction/recovery mechanism, a Chain of Intent inference strategy, and supervised fine-tuning based on LLaMA 3.1-8B-Instruct.

Key Designs¶

1. Entity Slots Construction and Recovery

Core Idea: Convert traditional token-level BIO sequence labeling into an entity-level slot detection problem.

Training Stage (Entity Slots Construction): Given a token sequence $T$ and a BIO tag sequence $S$, the mapping function $c(T,S)$ is used to extract entity-slot pairs $\{(k_i, \bigcup_{j \in I_i} t_j)\}$. For example, converting {O, O, B-Weather, O, O, O, O, B-Location} into {Weather: weather, Location: destination}.
Inference Stage (Entity Slots Recovery): Convert the entity-slot structure generated by the LLM back to the BIO tag sequence via the recovery function $r(T,E)$, achieving precise alignment with the original tokens.

This design allows the LLM to focus solely on entity-level slot detection, eliminating the need to generate tags for each token and effectively avoiding alignment and generation length control issues.

2. Chain of Intent

Inspired by Chain-of-Thought, multi-intent recognition is decomposed into a step-by-step process:

Given an utterance $U$ containing $n$ intents, it is mapped to: $$U \mapsto \{(I_1: U_1), (I_2: U_2), \ldots, (I_n: U_n)\}$$

Each intent $I_i$ is paired with its corresponding sub-utterance $U_i$. For example, "Check the weather and then navigate to the office" is decomposed into: - Intent 1 (Weather_Inquiry): "Check the weather" - Intent 2 (Navigation): "navigate to the office"

This step-by-step decomposition enables the LLM to systematically process multiple intents instead of attempting to output all of them at once.

3. Loss & Training

Standard cross-entropy loss is used to fine-tune LLaMA 3.1-8B-Instruct with a learning rate of $2 \times 10^{-5}$, a batch size of 32, and for only 1 epoch. During inference, a temperature of 0.0 is used to ensure deterministic output.

Key Experimental Results¶

Main Results¶

On two multi-intent SLU datasets, MixATIS and MixSNIPS:

Model	MixATIS Slot(F1)	MixATIS Overall(Acc)	MixSNIPS Slot(F1)	MixSNIPS Overall(Acc)
Uni-MIS (SOTA)	88.3	52.5	96.4	83.4
Vanilla SFT	68.2	47.7	88.9	65.3
ECLM	90.2	56.2	97.0	86.5

Key Comparisons: - vs Uni-MIS: Overall Acc improved by +3.7% (MixATIS) and +3.1% (MixSNIPS). - vs Vanilla SFT: Overall Acc improved by +8.5% and +21.2%, and Slot F1 improved by +22% and +8.1%.

Key Findings¶

Ablation studies validate the independent value of both components:
- Removing Entity Slot: Slot F1 drops from 90.2 to 73.5 (MixATIS), showing that entity-level conversion is crucial for sequence labeling.
- Removing Chain of Intent: Overall Acc drops from 56.2 to 52.9 (MixATIS), demonstrating that the Chain of Intent significantly contributes to multi-intent recognition.
- Removing both (= Vanilla SFT): Performance drops significantly, validating the overall design of the framework.
Greater advantage in scenarios with higher numbers of intents: In 1/2/3-intent scenarios, it improves over Uni-MIS by 1.1%, 4.3%, and 7.8% respectively.
High data efficiency: ECLM outperforms Uni-MIS trained on the full dataset using only 60% of the training data.

Highlights & Insights¶

The conversion from BIO to entities is highly ingenious: It perfectly exploits the generative advantages of LLMs while avoiding the inherent weaknesses of sequence labeling, making it simple and effective.
Chain of Intent is a natural extension of CoT in SLU: The idea of step-by-step decomposition of multi-intent utterances is intuitive and reasonable, and can be generalized to other multi-label classification tasks.
Entity Slots Recovery ensures precise alignment during inference: This design resolves the most critical engineering issue for LLMs in sequence labeling.
Outperforming SOTA by a large margin with only 1 epoch of fine-tuning: This indicates that the foundational capabilities of LLMs can be highly activated under an appropriate framework.

Limitations & Future Work¶

Evaluation is limited to two English datasets, MixATIS and MixSNIPS, lacking evaluation in multilingual or more complex scenarios.
Entity Slots Recovery relies on exact matching; if the LLM generates words not present in the original utterance, recovery may fail.
The number of intents is limited to 1–3, leaving scenarios with higher intent counts (e.g., 5+) unverified.
The base model is LLaMA 3.1-8B, which incurs high deployment overhead; smaller models or quantization schemes have not yet been explored.
Chain of Intent requires intent boundary annotations in the training data, limiting its applicability to data without segmentation annotations.

Multi-intent SLU: Interactive modeling methods based on graph attention networks, such as AGIF, GL-GIN, CLID, and Uni-MIS.
LLMs for NLU: Attempts and limitations of directly fine-tuning LLMs for sequence labeling.
Chain-of-Thought: CoT reasoning frameworks and their variants in classification tasks.
Joint Intent Detection and Slot Filling Models: Classical joint modeling approaches such as Stack-Propagation.

Rating¶

Novelty: ⭐⭐⭐⭐ (Ingenious BIO-to-entity conversion and Chain of Intent design)
Technical Depth: ⭐⭐⭐ (Method is intuitive and simple, with limited theoretical analysis)
Experimental Thoroughness: ⭐⭐⭐⭐ (Detailed ablation and analyses across different intent counts)
Value: ⭐⭐⭐⭐ (Addresses practical problems in dialogue systems; the framework is highly extensible)