Skip to content

Surprisal Minimisation over Goal-directed Alternatives Predicts Production Choice in Dialogue

Conference: ACL2026
arXiv: 2605.00506
Code: None
Area: Dialogue Modeling / Psycholinguistics / Information-theoretic Language Production
Keywords: surprisal, goal-directed alternatives, language production, UID, dialogue corpus

TL;DR

This paper models utterance generation in natural dialogue as a cost-sensitive choice among contextual alternatives. It discovers that minimizing surprisal relative to "goal-directed alternatives" (paraphrases sharing the same communicative goal) best predicts the actual continuations produced by humans.

Background & Motivation

Background: In language comprehension research, surprisal is frequently used to explain reading times, eye movements, brain imaging, and processing load. In language production research, the Uniform Information Density (UID) hypothesis and length costs are common explanations for why speakers choose specific expressions.

Limitations of Prior Work: Many information-theoretic analyses focus only on the observed utterances themselves without explicitly defining the alternative expressions available to the speaker at that moment. Without alternatives, it is difficult to determine whether a sentence was chosen because of its low cost or if it simply happened to appear in the corpus.

Key Challenge: Production choices must be defined relative to a candidate set, yet natural dialogue exists in an open-ended generation space. Traditional models either study small sets of syntactic variants or define alternatives too narrowly to cover the range of expressions a speaker might consider in real dialogue.

Goal: The authors aim to use language models to generate open-ended alternatives, distinguishing between two candidate sets: goal-agnostic alternatives, representing reasonable continuations a listener might expect in context; and goal-directed alternatives, representing paraphrases a speaker could choose to achieve the same communicative goal.

Key Insight: If a cost metric truly characterizes language production, it should favor human choices within a set of candidates sharing the same goal. If it primarily reflects listener comprehension pressure, its effect might be more evident among goal-agnostic alternatives.

Core Idea: Use LMs to generate both goal-directed and goal-agnostic alternatives and compare whether human continuations exhibit lower surprisal, UID, or length costs than these alternatives to identify which cost best explains production choices.

Method

The key contribution of this paper is not a new neural model, but the redefinition of "production choice" as an experimental framework involving candidate sets, cost functions, and probabilistic choice rules.

The most critical distinction is: in the same context, the listener does not know what the speaker intends to express, but the speaker knows their own communicative goal.

Therefore, alternatives used to explain speaker choice should maintain the same goal, rather than just being contextually plausible.

Overall Architecture

The paper utilizes the Switchboard Dialogue Act Corpus, which consists of natural spoken dialogue data.

The authors first clean the transcripts, removing backchannels, short pauses, noise markers, and obvious disfluencies.

Target utterances are restricted to between 10 and 30 words, with dialogue acts categorized as statements or questions, where the preceding turn comes from a different speaker.

Each target utterance is split into context and continuation: the sentence's root verb serves as the selection point; the root verb and everything prior form the context, and everything following is the actual human continuation.

After cleaning, 1,342 utterances were obtained. Following alternative generation and filtering, the final analysis set included 309 contexts, 309 observed human continuations, and 12,360 generated continuations.

Cost estimation is performed using GPT-2 Small to calculate surprisal, as it is a common proxy for processing load in psycholinguistics.

Goal-agnostic alternatives are generated by GPT-4o completing sentences under different history conditions: no history, previous turn, or full history.

Goal-directed alternatives are generated by GPT-4o performing constrained paraphrasing of the observed human sentences, requiring that the paraphrase retains the same context and semantic goal.

A GPT-4o judge then filters the paraphrases; manual sampling showed a paraphrase judgment accuracy of 98.75%.

Finally, the authors compare the rank of human continuations across various costs and use a pairwise logistic choice model to test whether cost differences predict human choices.

Key Designs

  1. Two Types of Contextual Alternatives:

    • Function: Separates the producer's perspective from the listener's perspective.
    • Mechanism: The goal-agnostic set \(A_c\) is conditioned only on context \(c\), containing any grammatically plausible and contextually coherent continuation. The goal-directed set \(A_{c,g}\) is conditioned on both context and communicative goal \(g\), containing only paraphrases that are semantically equivalent or near-equivalent to the human continuation.
    • Design Motivation: Without distinguishing these sets, the interpretation of surprisal is confounded. Low surprisal relative to all possible continuations reflects listener expectation; low surprisal relative to same-goal paraphrases reflects the speaker choosing an easier-to-produce form to express a goal.
  2. Cost-Sensitive Choice Model:

    • Function: Converts "why humans say this" into a testable probabilistic choice hypothesis.
    • Mechanism: The production probability of a candidate continuation is exponentially related to its utility, where utility is defined as a constant minus cost. Thus, \(P_S(a|c,g)\) is proportional to \(\exp(-\alpha C(a;c))\). Lower costs result in a higher probability of selection; as choice noise approaches zero, the human continuation should frequently be rank 1.
    • Design Motivation: This model bridges deterministic rank analysis and probabilistic logistic analysis, answering both "did humans choose the minimum cost item?" and "do cost differences continuously predict selection probability?"
  3. Stratified Sampling of Alternatives:

    • Function: Prevents generated alternatives from being systematically different from human utterances in length or global UID, which could drive biased conclusions.
    • Mechanism: The authors compared the overall cost distributions of generated vs. human continuations. While surprisal and local UID differences were insignificant, length and global UID showed significant differences. They binned human utterances by length and global UID, then sampled from the generation pool without replacement to align the strata proportions of the alternatives with the human distribution.
    • Design Motivation: If generated sentences are naturally shorter or smoother, the cost comparison is unfair. Stratified sampling ensures results reflect context-specific preferences rather than global distributional differences.

Loss & Training

No new models were trained. The experiments rely on existing LMs for cost estimation, alternative generation, and paraphrase judgment.

Surprisal, local UID, global UID, and length are the four classes of cost functions.

Surprisal is the negative log-probability of the continuation given the context and dialogue history.

Local UID measures the mean squared difference in surprisal between adjacent words in a continuation; lower values indicate smoother local information density.

Global UID measures the variance of surprisal across the sentence relative to its mean; lower values indicate more uniform information density throughout the sentence.

Length is the number of words in the continuation, acting as a simple proxy for production effort.

Statistical analyses include the Poisson-binomial rank-1 test, pairwise logistic choice model, one-sided t-tests, and conditional logit analysis.

Key Experimental Results

Main Results

In the deterministic cost minimisation analysis, the authors calculated the proportion of times the human continuation was the lowest-cost item among alternatives.

Cost Goal-directed rank-1 Goal-agnostic rank-1 Uniform baseline
Surprisal 53.4% 15.2% 16.5% / 7.2%
Local uniformity 34.1% 16.2% 16.5% / 7.2%
Global uniformity 24.1% 19.3% 16.5% / 7.2%
Length 28.6% 26.6% 16.5% / 7.2%

All costs were significantly higher than chance, but the strongest result was surprisal under goal-directed alternatives: 53.4% of human continuations were the minimum surprisal option, approximately 3.24 times the baseline.

The pairwise logistic choice model supported the same conclusion: the negative cost coefficient for surprisal was the most stable, with the effect in the goal-directed condition being about 7 times stronger than in the goal-agnostic condition.

Cost Goal-agnostic β Goal-directed β Interaction β Per-item LL
Surprisal -0.304 -2.073 -1.769 -0.615
Local uniformity 0.357 -0.683 -1.040 -0.670
Global uniformity 0.796 -0.632 -1.428 -0.637

A negative coefficient indicates humans are more likely to choose a low-cost continuation; surprisal is most negative under the goal-directed condition and yields the best log-likelihood.

Ablation Study

The authors conducted additional analyses on distributional differences and results without stratified sampling to confirm the main conclusion was not an artifact of the sampling strategy.

Analysis Goal-directed Goal-agnostic Explanation
Surprisal t-test t=-32.48, p<1e-8 t=-8.23, p<1e-8 Human surprisal is significantly lower than alternatives, stronger in goal-directed
Local UID t-test t=-13.37, p<1e-8 t=10.57, p=1.00 UID prediction holds only under goal-directed
Global UID t-test t=-12.11, p<1e-8 t=26.48, p=1.00 Direction reverses under goal-agnostic
Length t-test t=2.99, p=1.00 t=-10.72, p<1e-8 Length acts more like surface pressure in goal-agnostic

Without stratified sampling, surprisal remains the strongest goal-directed explanatory variable, though absolute proportions decrease.

Cost Goal-directed rank-1 Goal-agnostic rank-1 Uniform baseline
Surprisal 47.6% 10.7% 9.3% / 3.3%
Local uniformity 22.1% 12.4% 9.3% / 3.3%
Global uniformity 13.0% 13.4% 9.3% / 3.3%
Length 22.8% 22.5% 9.3% / 3.3%

Key Findings

  • Surprisal relative to goal-directed alternatives has the strongest predictive power, supporting the idea that speakers prefer more conventional, easier-to-produce expressions when achieving the same goal.
  • UID is not ineffective, but its effect is weaker and depends on the alternative set; it even reverses in the goal-agnostic set, suggesting UID is complex and shouldn't be interpreted simply as a unified production goal.
  • Length has some predictive power in the goal-agnostic set but fails to explain why one expression is chosen among paraphrases sharing the same communicative goal.
  • LM-generated alternatives can transform open dialogue production into testable experiments, provided paraphrase filtering and distribution alignment are used to prevent generation bias.
  • This work provides an operational distinction between "speaker cost" and "listener expectation": look at whether the cost operates within a set of same-goal alternatives or across all contextually plausible continuations.

Highlights & Insights

  • The distinction between goal-directed and goal-agnostic is elegant. While many studies mention "alternatives," they rarely specify if these share a communicative goal; this paper turns that ambiguity into an experimental variable.
  • The paper does not treat the LLM as a cognitive model directly but as a toolchain (generator, estimator, judge). This positioning is robust and facilitates error analysis.
  • The stratified sampling design is crucial. LGMs have their own length and style biases; without stratification, model preferences could easily be misidentified as human production laws.
  • The results are insightful for NLG: to generate natural expressions, systems should optimize costs within a paraphrase set for a fixed communicative goal, rather than just sampling from globally high-probability text.

Limitations & Future Work

  • The experiment only covers one type of selection point in the English Switchboard corpus (continuations after a matrix verb) and may not generalize to all syntactic choices, languages, or domains.
  • Costs are aggregated over the entire continuation, which might be insensitive to long sentences or incremental planning; tokens closer to the selection point might warrant higher weights.
  • The quality of GPT-4o's generation and filtering affects conclusions, particularly whether goal-directed paraphrases truly maintain the same communicative goal.
  • GPT-2 Small's surprisal may not perfectly align with modern large models or human processing, though its use is traditional in psycholinguistics.
  • The framework does not explicitly model communicative effectiveness, as natural dialogue lacks external success signals for every paraphrase. Integrating a listener interpretation model would be more comprehensive.
  • vs Uniform Information Density: UID emphasizes smooth information density, but this paper shows UID's explanatory power depends on the alternative set and is less stable than goal-directed surprisal.
  • vs Rational Speech Act (RSA) / rate-distortion: RSA often models choices over small discrete action sets; this paper uses LLMs to generate open alternatives, extending similar ideas to natural dialogue.
  • vs Traditional Surprisal Comprehension: Traditional surprisal mostly explains listener processing load; this paper points out that the same metric can have a speaker-oriented interpretation in goal-directed sets.
  • vs LM Alternative Sampling: Past work using LM alternatives often predicted listener uncertainty or semantic inference; this paper refines the theoretical interpretation by splitting alternatives into speaker vs. listener categories.

Rating

  • Novelty: ⭐⭐⭐⭐☆ The core method of redefining alternatives and systematically comparing costs is a theoretically innovative distinction.
  • Experimental Thoroughness: ⭐⭐⭐⭐☆ Data cleaning, alternative generation, and statistical testing are complete, though the corpus and selection point scopes are narrow.
  • Writing Quality: ⭐⭐⭐⭐⭐ The argumentation chain is clear, naturally connecting background, models, experiments, and limitations.
  • Value: ⭐⭐⭐⭐☆ Highly insightful for both psycholinguistics and NLG goal design; engineering adoption depends on the quality of alternative generation.