Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process¶
| Conference | Area | Keywords |
|---|---|---|
| ACL 2025 | Other / Alignment & Fine-tuning | LLM Alignment, SFT, Preference Optimization, MDP, Residual Connection, Unified Framework |
TL;DR: By unifying SFT and Preference Optimization (PO) under an MDP framework, this paper reveals that SFT is merely a special case of PO. It proposes Intuitive Fine-Tuning (IFT), which leverages temporal residual connections to merge the data efficiency of SFT with the alignment performance of PO, achieving results close to or exceeding SFT+PO using only positive samples and a single-policy model.
Background & Motivation¶
Research Question: SFT and preference optimization (PO, e.g., DPO/PPO) are typically executed sequentially as two separate stages of alignment. Given the paradigm gap (mismatched loss functions, data formats, and auxiliary models), is it possible to unify them into a single process?
Limitations of Prior Work: - SFT: Predicts the next token using ground truth prefixes, but these prefixes deviate from the model's own distribution, leading to biased preference estimation and suboptimal transition optimization. - PPO: Provides unbiased preference estimation but requires reward models and online sampling, incurring high computational costs. - DPO: Achieves theoretically optimal estimation but requires paired preference data (positive + negative samples), which is expensive to collect. Offline variants use negative samples generated by non-current policies, causing biased estimation. - Existing Unification Attempts: Methods like ORPO and SimPO still require annotated preference data or reference models.
Core Motivation: Why is the preference estimation in SFT biased? Because when predicting the \(n\)-th token, the context uses the first \(n-1\) ground truth tokens, rather than the prefixes generated by the model itself. Can this bias be corrected without increasing data and computational costs?
Method¶
Overall Architecture¶
IFT consists of three steps: 1. One-step Forward Inference: For each ground truth prefix, the current model predicts the next token to obtain the model's preference. 2. Intuitive Preference Estimation: Blend the embedding of the model-predicted token with the embedding of the ground truth token weighted by \(\lambda\), constructing a prior state closer to the model's distribution. 3. Dynamic Relation Propagation: Reconstruct the loss function via cumulative summation, allowing the gradient of the current token to be influenced by the accuracy of future tokens.
Key Designs¶
- Temporal Residual Connection: \(\hat{s_i^{\theta}} = (1-\lambda) \cdot s_i^* + \lambda \cdot \pi_\theta(s_{i-1}^*)\), passing the model-generated embedding residuals to the next token, enabling the model to perceive its "overall response intuition" within the ground truth context.
- Unified MDP Perspective: Defines Preference Estimation and Transition Optimization, demonstrating that SFT implicitly assumes \(T_\theta(s_{n-1}^*, \rho_0)=1\) (i.e., prefixes are guaranteed to be generated by the model), leading to overestimation.
- Positive-Only Samples: Unlike DPO which requires positive-negative pairs, IFT requires data in the same format and scale as standard SFT.
Loss & Training¶
The loss function of IFT introduces cumulative summation on top of standard cross-entropy to achieve dynamic relation propagation:
where \(\delta_\theta\) is the intuitive preference estimation function. This loss implicitly satisfies the Bellman equation, combining the effectiveness of RLHF with the efficiency of SFT. A decay factor \(\alpha\) can optionally be introduced to handle long sequences.
Key Experimental Results¶
Main Results (Open-LLM Leaderboard, Mistral-7B Backbone)¶
| Method | ARC | MMLU | TruthfulQA | WinoGrande | GSM8K | Average |
|---|---|---|---|---|---|---|
| SFT | 56.49 | 60.44 | 55.57 | 77.90 | 42.84 | 58.65 |
| DPO | 61.86 | 61.02 | 47.98 | 76.64 | 43.89 | 58.28 |
| ORPO | 56.66 | 60.57 | 51.77 | 77.19 | 42.30 | 57.70 |
| SimPO | 59.90 | 52.61 | 47.25 | 78.30 | 37.53 | 55.15 |
| IFT | 56.74 | 60.49 | 57.65 | 78.45 | 44.73 | 59.61 |
Generation Quality Evaluation (Alpaca-Eval)¶
| Method | Data Size | Preference Data | Reference Model | Win Rate | LC Win Rate |
|---|---|---|---|---|---|
| SFT | 120k | ✗ | ✗ | 82.56 | 78.32 |
| DPO | 120k | ✓ | ✓ | 74.00 | 73.12 |
| ORPO | 120k | ✗ | ✓ | 85.14 | 76.60 |
| IFT | 120k | ✗ | ✗ | 85.18 | 78.78 |
| SFT+DPO | 320k | ✓ | ✓ | 91.62 | 81.54 |
| SFT+IFT | 260k | ✗ | ✗ | 88.37 | 81.29 |
Key Findings¶
- Single-Stage IFT Outperforms SFT: Across 6 benchmarks, IFT achieves an average of 59.61 vs SFT's 58.65, without requiring preference data.
- IFT Approaches the Performance of Sequential SFT+DPO Training: SFT+IFT (260k data) achieves a LC Win Rate of 81.29, which is close to the 81.54 achieved by SFT+DPO (320k data).
- Significant Advantage on TruthfulQA: IFT's score of 57.65 substantially outperforms DPO (47.98) and ORPO (51.77), indicating that IFT excels at factual alignment.
- Extreme Data Efficiency: Using only 120k non-preference data, IFT reaches the performance level of methods requiring 320k paired data.
- Frozen Lake Experimental Validation: Experiments in interpretable environments confirm that IFT learns competitive policies.
Highlights & Insights¶
- Elegant Theory: Unifies SFT and PO under the MDP framework, revealing the root cause of SFT bias.
- Simplistic and Effective Design: The temporal residual connection is straightforward and effective, requiring no additional models or data.
- Positive-Only + Single-Policy Model: Drastically lowers the threshold for alignment.
- Outstanding Performance: Performs exceptionally well in truthfulness (TruthfulQA) and generation quality (Alpaca-Eval).
Limitations & Future Work¶
- Theoretical analysis relies on idealized MDP assumptions, whereas the state space of actual language generation is far more complex than an MDP.
- Hyperparameter \(\lambda\) requires tuning, and its sensitivity is not fully discussed in the paper.
- One-step forward inference increases computational cost by about \(2\times\) (though still lower than online sampling in PPO/DPO).
- Primarily validated on 7B-8B models, lacking thorough testing on larger-scale models.
Related Work & Insights¶
- SFT: Standard supervised fine-tuning, using teacher forcing with ground truth.
- PPO: Schulman et al. (2017), online policy optimization with a reward model.
- DPO: Rafailov et al. (2024), merging reward modeling and policy optimization.
- ORPO/SimPO/TDPO: Intermediate solutions attempting to unify SFT and PO, but still requiring preference data.
- Unlikelihood Training: Welleck et al. (2019), introducing penalties for negative samples in SFT.
Rating¶
| Dimension | Score (1-5) |
|---|---|
| Novelty | 5 |
| Theoretical Depth | 5 |
| Experimental Thoroughness | 4 |
| Writing Quality | 4 |
| Practical Value | 4 |
| Overall Rating | 4.4 |