Skip to content

ECLM: Entity Level Language Model for Spoken Language Understanding with Chain of Intent

Conference: ACL 2025
arXiv: 2403.04481
Area: LLM / Spoken Language Understanding
Keywords: Multi-intent spoken language understanding, entity-level slot filling, Chain of Intent, large language models, dialogue systems

TL;DR

This paper proposes the ECLM framework to apply LLMs to multi-intent spoken language understanding. By converting token-level slot filling into an entity recognition task, it solves the sequence alignment issue. It introduces the "Chain of Intent" to achieve step-by-step multi-intent recognition, significantly outperforming SOTA baselines on MixATIS and MixSNIPS.

Background & Motivation

Limitations of Prior Work

Limitations of Prior Work: Background: Spoken Language Understanding (SLU) is a core component of task-oriented dialogue systems, consisting of two subtasks: intent detection (classification) and slot filling (sequence labeling). In real-world scenarios, users often express multiple intents in a single utterance (e.g., 52% of instances in an internal Amazon dataset are multi-intent), making multi-intent SLU a significant challenge.

Directly applying LLMs to multi-intent SLU faces two core challenges:

Sequence Alignment Issue: Autoregressive generation in LLMs may produce outputs that do not align one-to-one with the original tokens, making it impossible to align slot-filling BIO tags with the original utterance.

Multi-intent Relationship Modeling: Simply fine-tuning LLMs directly makes it difficult to capture fine-grained intent-slot interaction relationships in semantic-level tasks.

Method

Overall Architecture

The ECLM framework consists of three core components: an Entity Slots construction/recovery mechanism, a Chain of Intent inference strategy, and supervised fine-tuning based on LLaMA 3.1-8B-Instruct.

Key Designs

1. Entity Slots Construction and Recovery

Core Idea: Convert traditional token-level BIO sequence labeling into an entity-level slot detection problem.

  • Training Stage (Entity Slots Construction): Given a token sequence \(T\) and a BIO tag sequence \(S\), the mapping function \(c(T,S)\) is used to extract entity-slot pairs \(\{(k_i, \bigcup_{j \in I_i} t_j)\}\). For example, converting {O, O, B-Weather, O, O, O, O, B-Location} into {Weather: weather, Location: destination}.

  • Inference Stage (Entity Slots Recovery): Convert the entity-slot structure generated by the LLM back to the BIO tag sequence via the recovery function \(r(T,E)\), achieving precise alignment with the original tokens.

This design allows the LLM to focus solely on entity-level slot detection, eliminating the need to generate tags for each token and effectively avoiding alignment and generation length control issues.

2. Chain of Intent

Inspired by Chain-of-Thought, multi-intent recognition is decomposed into a step-by-step process:

Given an utterance \(U\) containing \(n\) intents, it is mapped to: $\(U \mapsto \{(I_1: U_1), (I_2: U_2), \ldots, (I_n: U_n)\}\)$

Each intent \(I_i\) is paired with its corresponding sub-utterance \(U_i\). For example, "Check the weather and then navigate to the office" is decomposed into: - Intent 1 (Weather_Inquiry): "Check the weather" - Intent 2 (Navigation): "navigate to the office"

This step-by-step decomposition enables the LLM to systematically process multiple intents instead of attempting to output all of them at once.

3. Loss & Training

Standard cross-entropy loss is used to fine-tune LLaMA 3.1-8B-Instruct with a learning rate of \(2 \times 10^{-5}\), a batch size of 32, and for only 1 epoch. During inference, a temperature of 0.0 is used to ensure deterministic output.

Key Experimental Results

Main Results

On two multi-intent SLU datasets, MixATIS and MixSNIPS:

Model MixATIS Slot(F1) MixATIS Overall(Acc) MixSNIPS Slot(F1) MixSNIPS Overall(Acc)
Uni-MIS (SOTA) 88.3 52.5 96.4 83.4
Vanilla SFT 68.2 47.7 88.9 65.3
ECLM 90.2 56.2 97.0 86.5

Key Comparisons: - vs Uni-MIS: Overall Acc improved by +3.7% (MixATIS) and +3.1% (MixSNIPS). - vs Vanilla SFT: Overall Acc improved by +8.5% and +21.2%, and Slot F1 improved by +22% and +8.1%.

Key Findings

  • Ablation studies validate the independent value of both components:
    • Removing Entity Slot: Slot F1 drops from 90.2 to 73.5 (MixATIS), showing that entity-level conversion is crucial for sequence labeling.
    • Removing Chain of Intent: Overall Acc drops from 56.2 to 52.9 (MixATIS), demonstrating that the Chain of Intent significantly contributes to multi-intent recognition.
    • Removing both (= Vanilla SFT): Performance drops significantly, validating the overall design of the framework.
  • Greater advantage in scenarios with higher numbers of intents: In 1/2/3-intent scenarios, it improves over Uni-MIS by 1.1%, 4.3%, and 7.8% respectively.
  • High data efficiency: ECLM outperforms Uni-MIS trained on the full dataset using only 60% of the training data.

Highlights & Insights

  1. The conversion from BIO to entities is highly ingenious: It perfectly exploits the generative advantages of LLMs while avoiding the inherent weaknesses of sequence labeling, making it simple and effective.
  2. Chain of Intent is a natural extension of CoT in SLU: The idea of step-by-step decomposition of multi-intent utterances is intuitive and reasonable, and can be generalized to other multi-label classification tasks.
  3. Entity Slots Recovery ensures precise alignment during inference: This design resolves the most critical engineering issue for LLMs in sequence labeling.
  4. Outperforming SOTA by a large margin with only 1 epoch of fine-tuning: This indicates that the foundational capabilities of LLMs can be highly activated under an appropriate framework.

Limitations & Future Work

  • Evaluation is limited to two English datasets, MixATIS and MixSNIPS, lacking evaluation in multilingual or more complex scenarios.
  • Entity Slots Recovery relies on exact matching; if the LLM generates words not present in the original utterance, recovery may fail.
  • The number of intents is limited to 1–3, leaving scenarios with higher intent counts (e.g., 5+) unverified.
  • The base model is LLaMA 3.1-8B, which incurs high deployment overhead; smaller models or quantization schemes have not yet been explored.
  • Chain of Intent requires intent boundary annotations in the training data, limiting its applicability to data without segmentation annotations.
  • Multi-intent SLU: Interactive modeling methods based on graph attention networks, such as AGIF, GL-GIN, CLID, and Uni-MIS.
  • LLMs for NLU: Attempts and limitations of directly fine-tuning LLMs for sequence labeling.
  • Chain-of-Thought: CoT reasoning frameworks and their variants in classification tasks.
  • Joint Intent Detection and Slot Filling Models: Classical joint modeling approaches such as Stack-Propagation.

Rating

  • Novelty: ⭐⭐⭐⭐ (Ingenious BIO-to-entity conversion and Chain of Intent design)
  • Technical Depth: ⭐⭐⭐ (Method is intuitive and simple, with limited theoretical analysis)
  • Experimental Thoroughness: ⭐⭐⭐⭐ (Detailed ablation and analyses across different intent counts)
  • Value: ⭐⭐⭐⭐ (Addresses practical problems in dialogue systems; the framework is highly extensible)