Skip to content

SDPO: Segment-Level Direct Preference Optimization for Social Agents

Conference: ACL 2025
arXiv: 2501.01821
Code: Open-sourced (see paper for details)
Area: LLM Alignment
Keywords: Preference Optimization, Multi-turn Dialogue Alignment, Social Agents, Segment-level Optimization, DPO

TL;DR

SDPO proposes optimizing preferences in multi-turn social dialogues at the granularity of "segments." By dynamically locating error turns, resampling positive instances from the history before the error point, and selecting equal-length key segment pairs for training, it reduces the training noise of session-level DPO and strictly eliminates the partition function \(Z\) through equal-length constraints, outperforming GPT-4o and all DPO variants on the SOTOPIA benchmark.

Background & Motivation

Background: LLM-based social agents can simulate human social behaviors, but perform poorly in complex, goal-oriented scenarios such as negotiation, cooperation, and competition. Direct Preference Optimization (DPO) has emerged as the dominant method for aligning LLM behavior with human preferences, with standard DPO optimizing "good/bad" response pairs at a single-turn level.

Limitations of Prior Work: Standard DPO only optimizes single-turn responses, failing to model policy continuity during multi-turn goal completion. Although session-level expansion methods (ETO, DMPO) extend the optimization scope to the entire session, they suffer from two key issues: (1) Coarse granularity—normal turns in negative sessions are treated as "bad" outputs, introducing significant training noise, and positive sessions are sampled from scratch where the interlocutor has an immense action space, meaning high scores for positive samples may stem from interlocutor behavior shifts rather than agent policy improvements. (2) Theoretical flaws—multi-turn DPO extensions cannot directly eliminate the partition function \(Z\): ETO lacks theoretical guarantees, while DMPO is forced to use heuristic normalization due to unequal lengths of positive and negative samples.

Key Challenge: The alignment of multi-turn social dialogues requires a broader optimization scope than single-turn to model policy continuity, yet must avoid the noise introduced by coarse-grained session-level methods. A "just-right granularity" located in between is needed.

Goal: To propose segment-level preference optimization granularity—covering multiple key turns to model policy continuity, precisely excluding irrelevant turns to mitigate noise, and strictly eliminating the partition function via equal-length constraints.

Key Insight: The authors observe that core improvement opportunities in multi-turn social dialogues are concentrated in specific "key segments"—the critical interaction window starting from an error turn leading to goal completion. By shifting the sampling start point to just before the error turn (reducing the interlocutor's action space) and extracting equal-length key segments, the causal consistency of positive samples is improved, and \(Z\) is theoretically eliminated.

Core Idea: By dynamically selecting equal-length key segments within multi-turn dialogues instead of entire sessions to construct preference pairs, the partition function is theoretically eliminated, and training noise is reduced, achieving more precise alignment for multi-turn social dialogues.

Method

Overall Architecture

The SDPO pipeline consists of three stages. (1) Behavior Cloning (BC): Fine-tuning Llama-3.1-8B on expert social dialogue data generated by GPT-4-turbo to obtain a base social agent. (2) Preference Data Construction: The base agent generates dialogues on SOTOPIA-π scenarios, and sessions with a goal completion score below a threshold of 7 are treated as potential negative samples. Segment-level preference pairs are constructed using a three-step pipeline: "error localization → positive sampling → segment selection". (3) SDPO Training: Optimizing preferences using equal-length segment pairs and the SDPO loss function.

Key Designs

  1. Three-step preference data construction pipeline:

    • Function: Automatically extracting high-quality segment-level preference pairs from negative sessions.
    • Mechanism: First, GPT-4o is used to locate the "error turn" \(e\) in negative sessions—turns critical to goal completion but still having room for improvement. Second, starting from the interaction history \(h_e\) before the error turn, 5 complete sessions are sampled, and the one with the highest goal/relationship score is selected as the positive sample (requiring the score to be higher than the negative one, otherwise discarded). Third, GPT-4o is utilized to select the "key segment" contributing most to the high score in the positive sample, and a matching segment of the same start point and equal length is cropped from the negative sample.
    • Design Motivation: Unlike the "sampling from scratch" in session-level approaches, sampling from before the error point significantly narrows the interlocutor's action space, making high scores of positive samples more likely driven by improvements in the agent's own strategy rather than stochastic changes in the interlocutor's behavior. Equal-length cropping provides the foundation for theoretically eliminating \(Z\).
  2. SDPO Loss Function:

    • Function: Strictly extending DPO to multi-turn scenarios by summing log probability ratios of all turns within a segment.
    • Mechanism: Based on the State-Action Occupancy Measure (SAOM) framework, dialogue history is treated as states and agent outputs as actions. The key insight is that when positive and negative segment lengths are equal (\(T_w = T_l = k\)), the partition function \(Z\) in the Bradley-Terry model is precisely canceled out in the reward difference of positive and negative samples, yielding a concise SDPO loss: \(L_{SDPO} = -\mathbb{E}\log\sigma[\sum_{t=e}^{e+k}\beta(\log\frac{\pi_\theta(y_t^w|h_t^w)}{\pi_{ref}(y_t^w|h_t^w)} - \log\frac{\pi_\theta(y_t^l|h_t^l)}{\pi_{ref}(y_t^l|h_t^l)})]\).
    • Design Motivation: Address the theoretical vulnerabilities of ETO (lacks theoretical guarantee) and DMPO (uses heuristic normalization to eliminate \(Z\)). While the equal-length constraint is a trade-off, experiments demonstrate that dynamically selected equal-length segments are sufficient to cover key interactions.
  3. Dynamic Segment Length Selection:

    • Function: Dynamically determining the optimal segment length based on the specific circumstances of each data pair.
    • Mechanism: Allow GPT-4o to freely select the segment range in positive samples that contributes most to high scores (with no length limit), and then crop a segment of equal length from negative samples. Consequently, segment lengths across different data pairs can vary (averaging around 3 turns), while ensuring that positive and negative segments within each individual pair are of equal length.
    • Design Motivation: Ablation studies show that dynamic selection (Goal 8.56) significantly outperforms fixed lengths of [3,3] (Goal 8.40) and [5,5] (Goal 8.34), and far exceeds the unequal length configuration of [1,3] (Goal 7.77), proving the necessity of both dynamic selection and the equal-length constraint.

Loss & Training

The SDPO loss is the sum of standard DPO log probability ratios for all turns within a segment. During the training phase: batch size is 32, \(\beta=0.1\), learning rate is \(1e^{-6}\), with cosine decay and no warmup. The reference model is the base agent fine-tuned in the BC stage.

Key Experimental Results

Main Results

Goal completion (Goal, 0-10) and relationship dimension (Rel, -5 to 5) on the SOTOPIA benchmark:

Method Self-Chat Goal Self-Chat Rel vs GPT-4o Goal vs GPT-4o Rel
GPT-4o 7.90 2.67 7.90 2.67
GPT-4-turbo 8.18 2.96 7.92 2.79
Llama-8B+BC 7.81 3.05 7.53 2.78
+DPO (Single-turn) 7.95 3.28 7.80 2.97
+ETO (Session-level) 8.29 3.39 8.02 3.03
+DMPO (Session-level) 8.28 3.37 8.00 2.98
+SDPO (Segment-level) 8.56 3.69 8.13 3.16

Ablation Study

Segment length ablation (Self-Chat, based on Llama-8B+BC):

Segment Length [Pos, Neg] Goal Rel Description
[1,1] (=DPO) 7.95 3.28 Single-turn, baseline
[3,3] Fixed 8.40 3.64 Multi-turn effective
[5,5] Fixed 8.34 3.60 Diminishing marginal returns
[Dynamic, Dynamic] (SDPO) 8.56 3.69 Optimal
[1,3] Unequal 7.77 3.08 Collapse, validates equal-length necessity
[3,5] Unequal 8.07 3.16 Performance degradation

Data source ablation:

Data Source Goal Rel
Self-chat only 8.42 3.56
GPT-4o interaction only 7.88 3.05
Mixed (Self-chat + GPT-4o) 8.56 3.69

Key Findings

  • Segment-level granularity is significantly superior to both extremes: SDPO (Goal 8.56) outperforms single-turn DPO (7.95) by ~7.7%, session-level ETO/DMPO (~8.28) by ~3.4%, and GPT-4o (7.90) by ~8.4%.
  • Equal-length constraints are critical for both theory and practice: The unequal length configuration [1,3] leads to a performance collapse (Goal dropping from 8.40 to 7.77), validating gradient signal instability when the partition function \(Z\) cannot be eliminated.
  • High-quality data ≠ better results: Although session-level methods employ higher-scoring positive samples, SDPO achieves superior alignment via more precise segment selection and lower training noise.
  • Cross-model generalization: SDPO consistently outperforms all baselines on Mistral-v0.3 (Goal 8.48 vs ETO 8.30), demonstrating that the method does not rely on a specific model.
  • Alignment enhances social intelligence rather than 'cheating': Both Goal and Relationship scores improve simultaneously, indicating that SDPO does not achieve goals through anti-social behaviors such as threats or deception.

Highlights & Insights

  • Segment-level granularity is an elegant solution for multi-turn alignment: Lying between single-turn and session-level granularity, it preserves multi-turn policy modeling capabilities while naturally resolving the theoretical puzzle of eliminating the partition function via equal-length constraints. This approach can be generalized to any multi-turn interaction alignment task.
  • Sampling from the error point rather than from scratch: Anchors the causal origin of the positive sample to the improvement of the agent's own strategy rather than stochastic changes in the interlocutor's behavior, making the preference signal cleaner.
  • Theoretical elegance: The derivation that strictly eliminates \(Z\) using equal-length constraints is highly concise and much more convincing than the heuristic normalization of DMPO.

Limitations & Future Work

  • Flexibility limits of the equal-length constraint: Coercing positive and negative segments to be of equal length may result in losing useful information existing in unequal-length segments.
  • Dependence on GPT-4o for error localization and segment selection: The quality of these steps directly dictates the quality of training data, increasing costs and reliance on external models.
  • Evaluated only on SOTOPIA: Social dialogue serves as the sole evaluation scenario, lacking validation in other multi-turn contexts such as customer service or tutoring.
  • Unexplored iterative training: Currently only a single round of SDPO is performed; multi-round iterations (similar to Self-Play) could potentially yield further improvements.
  • vs DPO: Standard DPO only optimizes a single turn, whereas SDPO models multi-turn policy continuity via segment-level extensions, achieving a 7.7% improvement in Goal.
  • vs ETO/DMPO: Session-level methods construct preference pairs utilizing the entire session, introducing noise due to coarse granularity; SDPO precisely isolates key segments and is theoretically more rigorous.
  • vs RLHF/PPO: SDPO maintains the advantages of the DPO family—obviating the need for reward models and RL training—while resolving the theoretical challenges of multi-turn scaling.

Rating

  • Novelty: ⭐⭐⭐⭐ The proposal of segment-level granularity is intuitive and theoretically backed, with a highly elegant derivation for eliminating \(Z\) via equal-length constraints.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Includes multi-baseline comparisons, multi-model validation, and multi-dimensional ablations (segment length, data sources, output length, etc.).
  • Writing Quality: ⭐⭐⭐⭐ The motivational derivation unfolds progressively, and the connection between theory and practice is tightly integrated.
  • Value: ⭐⭐⭐⭐ Provides a general segment-level framework for multi-turn alignment; the superior performance over GPT-4o possesses practical significance.