EnSToM: Enhancing Dialogue Systems with Entropy-Scaled Steering Vectors for Topic Maintenance¶

Conference: ACL 2025
arXiv: 2505.16526
Code: https://github.com/linkyouhj/enstom
Area: Dialogue Systems
Keywords: Dialogue Systems, Topic Consistency, Steering Vector, Entropy Scaling, Activation Engineering

TL;DR¶

EnSToM is proposed as a lightweight method based on entropy-scaled steering vectors, which dynamically adjusts steering intensity by leveraging the differences in internal layer entropy distributions of LLMs to enhance the topic maintenance capability of task-oriented dialogue systems without modifying model parameters.

Background & Motivation¶

Background: Small Large Language Models (sLLMs) are suitable for deployment in resource-constrained environments due to their lightweight and efficient nature. Enterprise task-oriented dialogue systems (such as banking customer service bots) require models to strictly adhere to predetermined topics and refuse off-topic or malicious inputs.

Limitations of Prior Work: (1) sLLMs have limited capacity, making it difficult to maintain scenario consistency during long-term interactions; (2) Fine-tuning methods require large amounts of data and computing resources, making it hard to cover all scenarios; (3) Prompt engineering has limited effectiveness in complex scenarios; (4) Directly applying steering vectors can improve off-topic rejection rates but severely damages the quality of on-topic responses (on-topic accuracy drops from 0.94 to 0.70).

Key Challenge: While steering vectors effectively improve the ability to reject distractor inputs, indiscriminately applying steering to all inputs causes on-topic responses to be incorrectly rejected as well—how can steering be dynamically adjusted based on the input?

Goal: Design an adaptive steering intensity adjustment mechanism that strongly steers distractor inputs while applying weak or no steering to on-topic inputs.

Key Insight: It is observed that the entropy distribution across different layers of LLMs exhibits significant differences between on-topic and distractor inputs, which can serve as a distinguishing signal to dynamically adjust the steering coefficient.

Core Idea: Leverage layer-wise generation entropy in LLMs to distinguish off-topic from on-topic inputs, dynamically scaling the steering vector intensity via a sigmoid function to achieve precise topic maintenance.

Method¶

Overall Architecture¶

EnSToM consists of three components: (1) extracting steering vectors from contrastive data; (2) dynamically adjusting steering intensity based on entropy-based coefficient scaling; (3) generating responses using the scaled steering vectors. The entire process is training-free and intervenes purely at inference time.

Key Designs¶

Steering Vector Extraction: Construct a Steering QA Dataset $S = \{q_1, q_2, \dots\}$, where each $q_i$ contains contrastive prompts of desired behaviors (refusing and redirecting back to the topic) and undesired behaviors (continuing to answer off-topic questions). Forward propagation is performed at a designated layer $l$ to compute the difference in hidden representations between desired and undesired behaviors: $v_s^i = h_p^{(l)} - h_n^{(l)}$ The final steering vector is obtained by normalization and averaging: $v = \frac{1}{k}\sum_{i=1}^{k} \text{norm}(v_s^i)$.
Layer-wise Entropy Analysis: Compute the entropy of generating the first 2 tokens at layer $l$ of the LLM: $$E^{(l)} = \mathbb{E}\left[-\sum_{i=1}^{V} p_i^{(l)} \log(p_i^{(l)} + \epsilon)\right]$$ Key finding: At Layer 16 (a semantically critical layer), the entropy of distractor inputs is lower than that of on-topic inputs (since off-topic content induces highly focused attention); at Layer 19 (a deeper layer), this relationship reverses.
Entropy-based Scaling Coefficient: A sigmoid function maps the entropy to a steering coefficient: $$C_H^{(L)} = \frac{C_{\max}}{1 + e^{-\alpha \delta (H^{(L)} - t)}}$$ Where $C_{\max} = 1.5$ is the maximum coefficient, $\alpha = 5$ controls the steepness of the sigmoid, $t = 7.5$ is the threshold, and $\delta$ takes $\pm 1$ depending on the direction of the entropy distribution. This assigns a high coefficient (strong steering) to distractor inputs and a low coefficient (weak/no steering) to on-topic inputs.
Response Generation: During inference, 2 tokens are generated first to compute the entropy and obtain the coefficient, and then the scaled steering vector is added to the activations at the specified layer: $h'^{(l)} = h^{(l)} + C_H^{(L)} \cdot v$

Loss & Training¶

Completely Training-Free: Requires only about 100 contrastive samples to extract steering vectors.
Rejection and response options are generated by GPT-4o, with positions randomly assigned to avoid positional bias.
Evaluation utilizes a GPT-4o classification model to categorize response categories as rejection/response.

Key Experimental Results¶

Main Results (LLaMA-2-7B-Chat, CantTalkAboutThis Banking Domain)¶

Method	Entropy Layer L	Steering Layer	Distractor ↑	On-topic ↑	Overall ↑
Prompt Only	-	-	0.282	0.938	0.610
Vanilla Steering	-	-	0.800	0.700	0.750
EnSToM	16	15	0.810 (+0.53)	0.747 (-0.19)	0.779
EnSToM	16	16	0.709 (+0.43)	0.895 (-0.04)	0.802
EnSToM	19	16	0.749 (+0.47)	0.818 (-0.12)	0.784

Optimal configuration (L=16, Steer@16): overall score of 0.802, which is 19.2 percentage points higher than Prompt Only and 5.2 percentage points higher than Vanilla, with on-topic performance dropping by only 4.3 percentage points.

Cross-Architecture Generalization (Ministral-8B-Instruct)¶

Method	Distractor	On-topic	Overall
Prompt Only	0.25	0.98	0.62
EnSToM @ layer 18	0.63 (+0.38)	0.91 (-0.07)	0.76

Ablation Study (Impact of Threshold $t$)¶

Threshold $t$	Distractor	On-topic	Overall
Vanilla (Fixed)	0.80	0.70	0.75
$t = 2$	0.30	0.95	0.63
$t = 7.5$	0.76	0.84	0.80
$t = 9$	~baseline	0.72	~0.6x

Data Efficiency¶

Effective steering vectors can be extracted with as few as 10 contrastive samples: distractor accuracy is 0.74 (vs. 0.81 with 100 samples), and on-topic accuracy is 0.85 (vs. 0.75), making it highly suitable for low-resource scenarios.

Key Findings¶

Entropy Separation is Most Pronounced at Layer 16: Middle layers encode semantic information; distractor inputs focus on a small number of unique tokens, leading to low entropy, while on-topic inputs have scattered attention, resulting in high entropy.
Cross-Domain Consistency: Steering vectors extracted from various domains (such as banking, education, health, and insurance) are all effective, indicating that the rejection mechanism is general rather than domain-specific.
Potential for Task Generalization: In jailbreak defense tasks, the entropy distribution at Layer 33 can similarly distinguish between harmful and harmless inputs.
Analysis of Coefficient Distribution: 82.5% of distractors are assigned $C \geq 1.0$ (strong steering), and 45.8% of on-topic inputs are assigned $C < 0.5$ (weak steering), aligned with the design expectations.
Robustness of On-topic Inputs to Over-steering: Even though 40.2% of on-topic inputs are assigned $C \geq 1.0$, the accuracy still reaches 0.79.

Highlights & Insights¶

The core finding is highly elegant: internal layerwise entropy of LLMs naturally distinguishes on-topic and distractor inputs without requiring external classifiers.
Completely training-free inference-time intervention requires only ~100 contrastive samples, leading to exceptionally low deployment costs.
The analysis of layer-wise functional differentiation is consistent with findings in cognitive science: shallow layers capture syntax, middle layers encode semantics, and deep layers integrate context.
Dynamic coefficients offer a clear advantage over fixed coefficients in avoiding "harm to normal dialogue."

Limitations & Future Work¶

Manual Selection of Layers and Thresholds: The entropy extraction layer $L$ and threshold $t$ are currently determined empirically, which needs to be automated in the future.
Hard Negative Samples: Samples in the overlapping regions of entropy distribution can be misclassified, leading to incorrect steering directions.
Evaluated only on 7B/8B Models: The effectiveness on larger models (70B+) has not been verified.
Evaluation Relies on GPT-4o: Categorizing responses into rejection or responding relies on GPT-4o, which may introduce bias.
In-depth Evaluation Limited to Banking Domain: Cross-domain experiments only utilize steering vector transfer without comprehensive domain adaptation.

Steering Vectors: First proposed by Turner et al. 2023 and applied to LLaMA-2 by Rimsky et al. 2024—this work adds entropy scaling to resolve the on-topic degradation issue.
Topic Maintenance: CantTalkAboutThis (Sreedhar et al. 2024) provides the dataset, and Llama Guard achieves safety guarding via instruction tuning—this work offers a much more lightweight alternative.
Utilization of LLM Internal States: DoLa (Chuang et al. 2024) improves truthfulness via layer-wise contrast, and INSIDE (Chen et al. 2024) uses internal states to detect hallucinations—this work leverages internal entropy for input classification.
Insights: Entropy signals can be extended to more scenarios (e.g., hallucination detection, identifying uncertainty); they can also be combined with parameter-efficient methods such as LoRA.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The idea of entropy-scaled steering vectors is highly novel, elegantly combining activation engineering with internal signals.
Experimental Thoroughness: ⭐⭐⭐⭐ — Includes multi-layer analysis, cross-architecture, cross-domain, and data efficiency experiments, though model scale is limited.
Writing Quality: ⭐⭐⭐⭐ — Clear motivation, rigorous formulations, and logical diagrams.
Value: ⭐⭐⭐⭐ — Provides a practical, training-free solution for topic maintenance in dialogue systems, making a significant contribution to the field of activation engineering.