LLM-MC-Affect: LLM-Based Monte Carlo Modeling of Affective Trajectories and Latent Ambiguity for Interpersonal Dynamic Insight¶
Conference: ACL2026
arXiv: 2601.03645
Code: Unreleased link (cache did not provide code address)
Area: Affective Computing / Dialogue Analysis / Educational Dialogue
Keywords: Probabilistic Affective Modeling, Monte Carlo Sampling, Affective Trajectories, Latent Ambiguity, Interpersonal Dynamics
TL;DR¶
This paper proposes LLM-MC-Affect, which transforms dialogue affect from single-point labels into a latent distribution approximated by stochastic LLM decoding, using mean, variance, cross-correlation, and slope metrics to analyze affective synchrony and dominance in teacher-student dialogues.
Background & Motivation¶
Background: Affective synchrony and interpersonal emotional dynamics are typically studied using physiological signals, neural synchrony, or manual annotation. These methods provide fine-grained time series but impose high requirements on equipment, scene control, and privacy compliance. On the other hand, natural language dialogues in education, counseling, and collaboration already contain continuous emotional changes, making textual affective trajectories a more scalable alternative signal.
Limitations of Prior Work: Many textual sentiment analysis methods still compress each dialogue turn into a deterministic affective score, assuming there is only one correct emotional judgment. This erases two types of information: first, the same sentence may be interpreted with different emotions by different people; second, the emotions of both parties are temporally coupled rather than independent of each other.
Key Challenge: The challenge addressed in the paper is that scalable text analysis often sacrifices affective ambiguity and interaction structure, while high-fidelity interpersonal dynamic analysis often relies on expensive sensors or manual labelers. The authors aim to transform LLM stochasticity into a quantifiable source of affective uncertainty without fine-tuning or collecting additional physiological signals.
Goal: First, to estimate the central affective tendency of each turn; second, to represent latent affective ambiguity through the variance of multiple stochastic inferences; third, to organize the affective sequences of both teacher and student into trajectories; fourth, to interpret who is influencing whom and whether the interaction is mutually improving or deteriorating through time-lagged cross-correlation and trend slopes.
Key Insight: The authors observe that LLMs at non-zero temperature provide different but plausible affective ratings for the same utterance. This stochastic output should not be viewed merely as noise but as sampling from an implicit affective distribution. Thus, repeated sampling serves as a proxy for multiple human raters.
Core Idea: Use Monte Carlo samples from stochastic LLM decoding to estimate affective distributions, then link the distribution means and variances into dialogue trajectories to analyze interpersonal affective synchrony at the text level.
Method¶
The core of LLM-MC-Affect is not training a new affective model but defining a statistical pipeline from dialogue text to interaction interpretation. It prompts the LLM to provide affective scores for dialogue turns multiple times under a unified psychometric rubric, converts these scores into standardized affective trajectories, and finally uses time-series tools to interpret the emotional coupling between teacher and student.
Overall Architecture¶
The input is a dyadic dialogue window and an affective rating prompt. For each turn and each speaker, the system runs \(K\) independent stochastic LLM inferences to obtain a set of affective score samples. Subsequently, the sample mean is calculated as the central affective state, and the sample variance as the perceived ambiguity. Raw scores are first rated on a scale of \(0\) to \(5\) and then mapped to \([-1,1]\), where positive values indicate positive affect and negative values indicate negative affect.
After obtaining standardized trajectories for both the teacher and the student, the method calculates the normalized cross-correlation \(R_{TS}(L)\) at different dialogue lags \(L\), selecting \(L^*=\arg\max_L |R_{TS}(L)|\) as the dominant lag. If \(L^*>0\), it indicates the teacher's affect leads the student's; if \(L^*<0\), it indicates the student's affect leads the teacher's. Simultaneously, the method performs linear regression on each trajectory, using the slopes \(\beta_T\) and \(\beta_S\) to summarize long-term affective trends.
Key Designs¶
-
Stochastic Decoding as Affective Distribution Sampling:
- Function: Expands affective judgment of an utterance from a single deterministic output into a set of samples.
- Mechanism: Performs repeated inference \(K=20\) times under non-zero temperature with the same dialogue context to obtain \(\{\hat{s}_{t,k}\}_{k=1}^K\), using the sample mean for central tendency and the sample variance for latent ambiguity.
- Design Motivation: Traditional single-point affective scores average out ambiguity; retaining variance allows distinguishing between "the model is certain it is neutral" and "multiple plausible affective interpretations are competing."
-
Sign-Stable Affective Rating and Polarity Mapping:
- Function: Reduces LLM sign confusion during positive and negative affective rating and transforms results into standard affective coordinates.
- Mechanism: The model is first asked to rate on a scale of \(0\) to \(5\), where \(2.5\) is neutral, closer to \(0\) is more positive, and closer to \(5\) is more negative; subsequently, \(\tilde{s}_t=1-2(s_t/5)\) maps the result to \([-1,1]\), with variance scaled by \((2/5)^2\).
- Design Motivation: Non-negative scores are more stably executed by LLMs, while standardized positive/negative polarity facilitates alignment with affective computing literature.
-
Joint Interpretation of Interaction Patterns via Cross-Correlation and Slope:
- Function: Converts low-level affective trajectories into interactive interpretations such as "Effective Scaffolding," "Mutual Fatigue," or "Dynamic Compensation."
- Mechanism: Cross-correlation provides the phase relationship of affective changes, while slopes provide the direction of long-term improvement or deterioration; the paper defines teacher-student interaction types using the signs of \(L^*\), \(R_{TS}(L^*)\), and the \(\beta_T, \beta_S\) combination.
- Design Motivation: Observing a single individual's emotional state cannot explain the interaction relationship; only combined lags and trends can explain who is guiding the emotion and whether that guidance is positive or negative.
Loss & Training¶
This work does not train a new model or propose a supervised loss. It adopts zero-shot inference and statistical estimation: a unified rubric, \(K=20\) Monte Carlo samples per turn, fixed model temperature for sensitivity analysis, and normalized cross-correlation plus ordinary least squares slope estimation during the interaction interpretation phase. GPT-4.1 at \(\tau=0.7\) was primarily used for final interaction analysis, as this setting balances mean stability and ambiguity visibility.
Key Experimental Results¶
Main Results¶
| Target | Setting | Key Metric / Observation | Conclusion |
|---|---|---|---|
| Google Education Dialogue Dataset | Synthetic multi-turn teacher-student dialogue | Used GPT-4.1, GPT-3.5-Turbo, Gemma 3 4B, Llama 3.3 70B, Phi 4 14B, GPT-OSS 120B | As a controlled education interaction case, it validates that the method can extract affective trajectories from text. |
| Personification Topic | GPT-4.1, \(\tau=0.7\) | Teacher slope \(\beta_T=0.1621\), student slope \(\beta_S=0.2532\), \(L^*=+1\), \(R_{TS}=0.999\) | Interpreted as Effective Scaffolding, i.e., the teacher leads and drives the improvement of the student's affect. |
| Cross-Model Comparison | Same rubric, zero-shot | Both GPT-4.1 and GPT-3.5-Turbo captured a V-shaped trajectory dipping early and then recovering. | GPT series are more stable in fine-grained affective transitions. |
| Open-Source Model Behavior | Llama 3.3 70B / Phi 4 14B / Gemma 3 4B | Llama 3.3 70B almost missed the negative dip at Turn 2; Phi-4 recovery peaked at ~0.40; Gemma 3 4B recovery flattened around 0.15. | Alignment or model scale may introduce positivity bias and conservative emotional estimation. |
Ablation Study¶
| Analysis Item | Setting | Key Data | Description |
|---|---|---|---|
| Temperature Sensitivity | Utterance 6, GPT-4.1 | At \(\tau=0.1\), mean \(\approx -0.12\), variance \(0.010\); at \(\tau=1.0\), variance increased to \(0.024\). | Higher temperature makes ambiguity more visible but does not equate to the affective mean losing control. |
| Mean Stability | Personification full temperature range | The mean of Utterance 6 fluctuated between approximately \(-0.11\) and \(-0.26\). | Central affective tendency remains relatively stable even when variance changes significantly. |
| Trajectory Convergence | Teacher affective trajectory | A trend of dipping near Turn 2 followed by a strong positive turn at Turn 7 was observed across different \(\tau\). | The method can filter out some stochastic sampling noise while preserving the main affective signals. |
| Statistical Interpretation | NCCF + Slope | \(L^*=+1\) and \(R_{TS}=0.999\), with both \(\beta_T, \beta_S\) positive. | Supports the interpretation that "teacher's previous turn emotion predicts student's next turn emotion." |
Key Findings¶
- Monte Carlo variance is not simple noise but a core variable used explicitly by the paper to represent affective ambiguity.
- Affective means are relatively robust to changes in temperature, while variance expands as temperature increases; thus, mean and variance serve different interpretive functions.
- GPT-4.1 is most suitable for interaction analysis in this case; some open-source models tend to be overly positive or underestimate negative transitions, which itself can serve as a diagnostic signal for model affective perception bias.
- Cross-correlation can only support sequential association interpretation and cannot directly infer causality, a point explicitly emphasized in the paper's limitations.
Highlights & Insights¶
- Viewing LLM stochasticity as an "estimable affective distribution" rather than "noise that needs to be suppressed" is insightful. It grants zero-shot LLM inference statistical semantics similar to multi-person rating.
- The paper does not stop at affective classification but continues to link affective sequences to interaction pattern interpretation. For educational dialogue, the combination of \(L^*\) and slope is closer to classroom dynamics that teachers truly care about than single affective scores.
- The method also has diagnostic value for model bias: if a certain model always provides overly positive affective trajectories, it may not be suitable for monitoring student frustration.
- This pipeline can be transferred to dyadic or multi-party dialogue scenarios such as psychological counseling, customer service, or collaborative meetings, as long as the rubric is replaced with target-domain interaction evaluation dimensions.
Limitations & Future Work¶
- Monte Carlo variance mixes linguistic ambiguity, model bias, prompt sensitivity, and decoding randomness; it cannot be equated directly with real human perceptual ambiguity.
- Experiments are primarily based on synthetic educational dialogues. While convenient for controlling variables, they cannot fully represent noise, non-verbal cues, and student group differences in real classrooms.
- Cross-correlation and lag metrics can only indicate sequence alignment or leading relationships and should not be interpreted as causal evidence that teacher emotions cause student emotional changes.
- Repeated stochastic decoding incurs significant computational costs; to deploy in real-time educational systems, lighter affective models or hybrid architectures are needed.
- Subsequent work could add real classroom data, human rating calibration, cross-cultural affective rubrics, and sliding-window cross-correlation to handle longer dialogues.
Related Work & Insights¶
- vs Traditional Text Affective Classification: Traditional methods output deterministic labels or scores; Ours outputs mean and variance, thus preserving subjective ambiguity.
- vs Human Multi-Annotator Affective Modeling: Multi-annotator methods approximate implicit distributions with real human disagreement; Ours uses stochastic LLM inference as a proxy, which is lower cost but more dependent on model bias.
- vs Physiological Signal Affective Synchrony Research: Physiological signals can capture high-fidelity synchrony but have high deployment barriers; Ours sacrifices some sensing precision for scalability and privacy-friendliness using textual trajectories.
- vs Pure LLM-as-a-Judge: Standard LLM evaluation often takes a single judgment; Ours statisticalizes the evaluation process, making it more suitable for analyzing uncertainty and temporal dynamics.
Rating¶
- Novelty: ⭐⭐⭐⭐☆ Using stochastic decoding for affective distribution estimation and linking it to interaction dynamics is a clear and somewhat novel idea.
- Experimental Thoroughness: ⭐⭐⭐☆☆ Includes temperature, cross-model, and case studies, but relies mainly on synthetic educational scenarios with insufficient validation on real data.
- Writing Quality: ⭐⭐⭐⭐☆ The chain of motivation, statistical modeling, and interactive interpretation is complete, and limitations are addressed honestly.
- Value: ⭐⭐⭐⭐☆ Has reference value for educational dialogue analysis, affective computing, and LLM evaluation, especially for inspiring textual interaction analysis with uncertainty awareness.