Old Habits Die Hard: How Conversational History Geometrically Traps LLMs¶

Conference: ICML 2026
arXiv: 2603.03308
Code: https://github.com/technion-cs-nlp/OldHabitsDieHard
Area: LLM Security / Mechanistic Interpretability / Conversational Behavior Analysis
Keywords: Conversational History, Behavioral Persistence, Markov Chains, Geometric Traps, Refusal / Sycophancy / Hallucination

TL;DR¶

The History-Echoes framework analyzes the carryover effect of LLM conversational history through "Markov chain state consistency" and "latent space geometric angles." It identifies a Spearman correlation of 0.78—once a behavior (hallucination, sycophancy, or refusal) occurs, the model becomes trapped in a latent space region corresponding to that state, making escape difficult. The "refusal" trap is the strongest, while "hallucination" is the weakest; these traps dissolve when topic consistency is broken.

Background & Motivation¶

Background: LLMs exhibit various state-dependent behaviors—both undesirable (hallucination, sycophancy) and desirable (refusal). Prior work has documented these phenomena, but how they persist and are represented across multi-turn dialogues lacks a unified framework. Existing studies on safety trajectories or generation difficulty analyze these phenomena in isolation, without linking "persistence probability" to "internal geometry."

Limitations of Prior Work: Analyzing strictly via black-box (output layer) or white-box (hidden states) is insufficient. Black-box analysis fails to reveal the mechanism (why it persists), while white-box analysis lacks behavioral validation (whether geometric patterns actually correspond to external behavior).

Key Challenge: Explaining why "a model that has refused once is more likely to refuse again" requires proving both that the behavior persists at the output level and that there is a structural correspondence in internal geometry, with the two being strongly correlated. Otherwise, the findings are either statistical illusions or cherry-picked geometry.

Goal: (1) Quantitatively measure behavioral carryover; (2) reveal the mechanism via latent space geometry; (3) demonstrate a strong correlation between these perspectives, providing dual evidence for "behavioral persistence \(\approx\) geometric trap."

Key Insight: Dialogue states are binarized (behavior present/absent) and modeled as first-order Markov chains. Simultaneously, orthogonal bases for \(\mathcal{H}_{\phi^+}\) and \(\mathcal{H}_{\phi^-}\) are constructed in latent space using Gram-Schmidt to measure angular separation. The study predicts a positive correlation between these two metrics (black-box persistence vs. white-box geometric angles).

Core Idea: Behavioral persistence is not an isolated output-layer phenomenon; it is a latent space "geometric trap" where two state regions are separated by large angles, and switching states requires a significant rotation that is often incomplete.

Method¶

Overall Architecture¶

History-Echoes investigates why specific behaviors (refusal, sycophancy, hallucination) tend to recur once they appear. It employs two complementary perspectives: at the black-box level, it treats the presence/absence of behavior in each turn as a two-state sequence, quantifying its "stickiness" via Markov chain transition structures. At the white-box level, it maps hidden states to latent space to quantify the separation between states and the incompleteness of state transitions. Finally, it correlates black-box and white-box metrics across multiple models and datasets to confirm they represent the same underlying mechanism.

For experimental data, QA pairs from datasets (TriviaQA, NaturalQA, SORRY-Bench, Do-Not-Answer, SycophancyEval) are embedded using Qwen3-Embedding and sorted by nearest neighbors to form topic-consistent dialogues (\(D_{\text{consistent}}\)) or randomly shuffled (\(D_{\text{inconsistent}}\)). Each dataset comprises 100 dialogues of 20 turns each.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Dialogue Construction<br/>QA pairs → Qwen3 Embedding Sorting → Consistent/Shuffled<br/>→ 100 dialogues × 20 turns"]
    A --> B["Two-state Markov Chain + Trace (Black-box)<br/>Classify turn as φ+/φ− → Transition Matrix T → Tr(T)=1+λ₂"]
    A --> C["Gram-Schmidt Orthogonal Basis + θ_ref (White-box)<br/>Hidden states → 2D Orthogonal Basis<br/>→ Angular Separation θ_ref + Transition Incompleteness"]
    B --> D["Cross-perspective Correlation<br/>18 points (trace, θ_ref) Spearman = 0.78"]
    C --> D
    D --> E["Conclusion: Behavioral Persistence = Geometric Trap"]

Key Designs¶

1. Two-state Markov Chain + Trace: Quantifying "Stickiness" via Black-box

To quantify the intuition that "previous refusal leads to more refusal," the behavior is mapped to a scalar independent of model internals. Each turn is classified as "phenomenon \(\phi\) present/absent" (using string matching, with a 6.5% error rate via manual audit). The transition matrix \(T_{ij}=P(s_j|s_i)\) is estimated, and its trace \(\text{Tr}(\mathbf{T})=P(s_{\phi^+}|s_{\phi^+})+P(s_{\phi^-}|s_{\phi^-})\) serves as the persistence measure. Since \(\text{Tr}(\mathbf{T})=1+\lambda_2\) (where \(\lambda_2\) is the second eigenvalue), a \(\text{Tr} > 1\) (or \(\lambda_2 > 0\)) indicates state self-looping. Larger values imply longer mixing times and behaviors being "locked" in.

2. Gram-Schmidt Orthogonal Basis + \(\theta_{\text{ref}}\): Quantifying Geometric Separation via White-box

To explain why behaviors stick, the latent geometry is analyzed. Residual hidden states at 85% relative depth are collected from the first token of each response. Mean vectors \(\mathbf{h}_{\phi^+}\) and \(\mathbf{h}_{\phi^-}\) are computed and orthonormalized using Gram-Schmidt to create a shared 2D basis (\(\mathbf{B}_1\) from normalized \(\mathbf{h}_{\phi^-}\), and \(\mathbf{B}_2\) from the component of \(\mathbf{h}_{\phi^+}\) orthogonal to \(\mathbf{B}_1\)). Two signatures are calculated: angular separation \(\theta_{\text{ref}}\) (the angle between the two class means in this basis) and "transition incompleteness" (the ratio of the actual Procrustes rotation angle during a state switch to \(\theta_{\text{ref}}\)). If the ratio is \(<1\), the activation fails to reach the target state, leaving a "geometric fingerprint" of the previous state.

3. Cross-perspective Correlation: Linking Black-box and White-box

To prove that high trace and large \(\theta_{\text{ref}}\) describe the same mechanism, Spearman rank correlation is calculated across 18 model/dataset combinations. A significant positive correlation confirms that behavioral persistence is the external projection of latent state separation and incomplete geometric rotation.

Key Experimental Results¶

Behavioral Persistence (trace, average across three models)¶

Phenomenon	NaturalQA	TriviaQA	Sorry	DoNotAns	S-pos	S-neg	Mean
Tr(T)	1.13	1.12	1.57	1.59	1.33	1.14	1.31

All phenomena exhibit \(\text{Tr} > 1\); refusal datasets show the highest trace (\(\approx 1.6\)), indicating the strongest carryover.

Geometric Angular Separation \(\theta_{\text{ref}}\) (degrees)¶

Model	NaturalQA	TriviaQA	Sorry	DoNotAns	S-pos	S-neg
LLaMA-3.1-8B	11.3	13.1	66.5	54.3	14.6	28.2
Qwen-8B	11.7	6.4	46.4	38.6	22.5	22.6
GPT-OSS-20B	9.6	13.9	42.7	34.0	27.8	23.6

Refusal datasets show \(\theta_{\text{ref}}\) of 30–66°, significantly higher than the 6–14° of hallucinations—geometric refusal states are distinctly separated.

Cross-perspective Correlation¶

The Spearman correlation for 18 (trace, \(\theta_{\text{ref}}\)) points across 3 models and 6 datasets is 0.78, confirming a strong positive correlation between high trace and large geometric angles.

Topic Inconsistency Dissolves the Trap¶

Dataset	\(D_{\text{consistent}}\) trace	\(D_{\text{inconsistent}}\) trace	Difference
Sorry	1.57	1.18	−0.39
Do-not-answer	1.59	1.20	−0.39
S-neg	1.14	1.05	−0.09

Shuffling topics significantly reduces the trace and \(\theta_{\text{ref}}\), proving that the "geometric trap" depends on topic consistency. This aligns with adversarial jailbreak strategies that inject irrelevant tokens to break context.

Key Findings¶

Carryover strength is ordered: refusal > sycophancy > hallucination, consistent across both trace and \(\theta_{\text{ref}}\).
Refusal strength stems from a "single direction": This matches Arditi et al. 2024, where refusal is governed by a single representation direction. Clearly defined phenomena are more geometrically separated and thus harder to escape.
Hallucinations are weakest: Likely because hallucinations are a broad set of failure modes (factual errors, fabrications, inconsistencies) without a unified latent subspace.
Inconsistent dialogues break traps: Switching topics may be a simple practical method to "unlock" a stuck model.

Highlights & Insights¶

Strong correlation between black-box and white-box: Systematically links behavioral statistics to latent geometry, providing dual evidence for the "behavioral persistence = geometric trap" theory.
Unified treatment of phenomena: Contrasting failure modes (hallucination) with conservative behaviors (refusal) reveals that carryover strength corresponds to "geometric clarity."
Diagnostic utility for closed-source models: Trace calculation does not require internal access, providing a proxy to diagnose internal carryover in models like GPT-5 or Claude.
Geometric explanation for jailbreaking: Explains why adversarial tokens work—they disrupt topic consistency, thereby dissolving the geometric trap.

Limitations & Future Work¶

Phenomenon detection relies on string matching (6.5% error rate), which lacks granularity for hallucination types.
First-order Markov assumptions may oversimplify long-range dependencies.
The study uses small models (4–20B); geometric patterns in larger models might differ.
Calculations are fixed at 85% relative depth; trap strength may vary across layers.
Research focuses on "once-trapped-stay-trapped" without exploring active "de-trapping" mechanisms besides topic shuffling.

vs. Arditi et al. 2024 (Refusal Direction): Ours generalizes this to show refusal has a "single direction" and "strong carryover."
vs. Carryover Effects Studies (Simhi 2024, Zhang 2024): Previous works only viewed the output layer; this adds the white-box perspective and proves correlation.
vs. Jailbreak via Adversarial Tokens (Zou 2023): Provides a geometric explanation for why adversarial tokens are effective.
Inspiration: This framework can be extended to other state-dependent phenomena (e.g., ICL format locking, persona drift). It also suggests designing active de-trap mechanisms, such as prompt-side safety patches that refresh topic state.

Rating¶

Novelty: ⭐⭐⭐⭐ (Dual-perspective framework is new; individual components are known.)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Wide coverage across models, datasets, phenomena, and closed-source validation.)
Writing Quality: ⭐⭐⭐⭐ (Clear concepts and intuitive figures.)
Value: ⭐⭐⭐⭐ (Practical insights for safety and jailbreak mechanisms.)