TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning¶
Conference: ACL 2025
arXiv: 2503.04381
Code: https://github.com/d223302/TRACT
Area: LLM Reasoning
Keywords: LLM-as-a-Judge, Regression-Aware Fine-tuning, Chain-of-Thought, Numerical Prediction, Self-generated CoT
TL;DR¶
Proposes TRACT, a two-stage regression-aware fine-tuning method that combines CoT reasoning with regression loss (squared error) to improve numerical scoring accuracy in LLM-as-a-judge scenarios, significantly outperforming existing approaches using only cross-entropy training or only regression loss.
Background & Motivation¶
Background: - LLM-as-a-Judge has become the mainstream paradigm for automated text evaluation, where models score text outputs based on fine-grained scoring criteria (e.g., 1-5). - Existing methods typically fine-tune LLMs using cross-entropy (CE) loss, prompting them to generate a CoT analysis before outputting the score. - RAFT (Regression-Aware Fine-Tuning) has been proven to improve performance in numerical regression tasks but does not consider CoT reasoning.
Limitations of Prior Work: - CE loss ignores numerical distances: Given a ground-truth score of 1, predictions of 5 and 2 receive the same penalty, despite their massive difference in numerical error. - RAFT lacks CoT: Although RAFT introduces squared error loss to improve regression prediction, it does not leverage CoT reasoning, which is crucial for LLM-as-a-Judge. - Distribution shift in CoT sources: During training, CoT is obtained from GPT-4 annotations, while during inference, CoT is generated by the model itself, leading to inconsistent distributions.
Key Challenge: - CoT reasoning and regression-aware training each have its own advantages, but how to effectively combine them remains an open question. - There is a significant distribution shift between GPT-4 generated CoTs and the self-generated CoTs of the fine-tuned model.
Goal: - How to leverage both CoT reasoning capability and regression-aware loss in LLM-as-a-Judge fine-tuning. - How to alleviate the distribution shift problem of CoT sources between the training and inference stages.
Key Insight: - Use CE loss for CoT learning and RAFT loss for score prediction, combining them into a CoT-RAFT objective. - Employ a two-stage training strategy where the second stage uses model-self-generated CoT to replace external annotations.
Core Idea: - Achieve the optimal combination of CoT reasoning and numerical prediction through two-stage self-generated CoT + regression-aware fine-tuning.
Method¶
Overall Architecture¶
TRACT (Two-stage Regression-Aware fine-tuning with CoT) consists of two stages: - Stage 1: Fine-tune the seed LLM using GPT-4-annotated CoTs and the CoT-RAFT objective to obtain the model \(p_s\). - Stage 2: Generate self-generated CoTs using \(p_s\) to replace GPT-4's CoTs, retraining from the seed LLM to secure the final model \(p_{\text{tract}}\).
Key Designs¶
-
CoT-RAIL Predictor (CR Predictor):
- Function: First generates CoT during inference, then performs a weighted sum over possible scores instead of argmax decoding.
- Mechanism: \(\hat{y}_{CR}(x) = \sum_{y \in \mathcal{Y}} p(\text{str}(y) | [x, \hat{s}]) \cdot y\), where \(\hat{s} \sim p(\cdot|x)\)
- Design Motivation: RAIL weighted averaging outperforms argmax decoding, and CoT provides better context conditioning; combining both yields complementary benefits.
-
CoT-RAFT Training Objective:
- Function: Combined weighted loss of CE loss (learning CoT) and RAFT loss (learning score prediction).
- Mechanism: \(\ell_{\text{CoT-RAFT}}^{\lambda} = \lambda(\sum_y p(\text{str}(y)|[x,\hat{s}]) \cdot y - y^*)^2 - \log p([\hat{s}, y^*] | x)\)
- Design Motivation: CE ensures the quality of CoT generation, while RAFT ensures score prediction is sensitive to numerical distance.
-
Two-Stage Self-Generated CoT Strategy:
- Function: Train with external CoT in Stage 1, and retrain using CoTs generated by the Stage 1 model in Stage 2.
- Mechanism: Replace the GPT-4 CoT with CoT under the model's own distribution to eliminate the training-inference distribution shift.
- Design Motivation: Since CoT is generated by the model during inference, using self-generated CoT during training maintains distributional consistency.
Loss & Training¶
- Loss Function: CoT-RAFT = \(\lambda \times (\text{RAIL prediction} - \text{ground truth score})^2 - \log p(\text{CoT} + \text{score} \mid \text{input})\)
- Hyperparameter \(\lambda\): Controls the regression loss weight, selected via the validation set in experiments.
- Training Data: Feedback Collection (approx. 100K samples), with CoTs initially generated by GPT-4.
- Initialization from the seed LLM in both stages: To avoid overfitting to the parameters of the first stage.
Key Experimental Results¶
Main Results¶
Evaluated on four LLM-as-a-Judge datasets using Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct:
| Method | FB Bench (r) | FLASK (r) | Vic. Bench (r) | MT Bench (r) | Average (r/ρ) |
|---|---|---|---|---|---|
| CE (w/o CoT) | 0.890 | 0.355 | 0.429 | 0.279 | 0.488/0.483 |
| CE (w/ CoT) | 0.872 | 0.413 | 0.463 | 0.480 | 0.557/0.554 |
| RAFT | 0.932 | 0.509 | 0.567 | 0.483 | 0.623/0.605 |
| Prometheus-2-7B | 0.845 | 0.512 | 0.488 | 0.519 | 0.591/0.576 |
| TRACT | 0.931 | 0.518 | 0.593 | 0.555 | 0.650/0.628 |
- TRACT achieves an average Pearson correlation coefficient of 0.650, which is an improvement of 0.027 over RAFT and 0.059 over Prometheus-2-7B.
- On MT Bench, TRACT improves by 0.072 (r) and 0.060 (ρ) compared to RAFT.
Key Findings¶
- Self-generated CoT is crucial: The variant using GPT-4 CoTs (A.1) achieves an average r of only 0.556, whereas the self-generated CoT version (TRACT) reaches 0.650, demonstrating a massive gap.
- Both components are indispensable: Removing RAFT to use pure CE (A.2) drops the average r to 0.617; removing CoT to use pure RAFT yields an average r of 0.623.
- TRACT also performs excellently on RewardBench: Despite being trained only on pointwise scoring formats, it remains competitive on the pairwise comparison dataset RewardBench.
- Controllable inference overhead: Compared to standard CoT decoding, TRACT only requires additional RAIL weighting over 5 candidate scores.
Highlights & Insights¶
- Philosophical difference between regression and classification: This work profoundly reveals the fundamental limitation of treating numerical prediction as a classification task: CE loss is insensitive to numerical distance.
- Importance of self-generated CoT: Provides compelling evidence that matching the training-inference distribution is more critical than the sheer quality of the CoT.
- Clear modular design: The CoT-RAFT objective elegantly combines CE and regression losses, each playing its designated role.
- Fully open-source code and models: Facilitates replication and adoption.
Limitations & Future Work¶
- Validated only on LLM-as-a-Judge tasks; not yet extended to other numerical prediction scenarios (e.g., the STS-B regression task).
- The two-stage training increases computational overhead as it requires generating self-CoTs before retraining.
- The impact of CoT quality on final performance is not thoroughly analyzed; self-generated CoT might introduce systematic biases.
- Evaluated only on 7B/8B scale models; the effectiveness on larger or smaller models remains unknown.
- The \(\lambda\) hyperparameter requires validation set tuning, which is less convenient.
Related Work & Insights¶
- RAFT (Lukasik et al., 2025): The foundational study on regression-aware fine-tuning, upon which this work introduces CoT.
- Prometheus-2 (Kim et al., 2024b): Currently the SOTA 7B-scale LLM-as-a-Judge, achieved through model merging.
- RAIL (Lukasik et al., 2024): The zero-shot version of regression-aware inference, employing weighted expectations instead of argmax.
- Insight: In other LLM tasks requiring numerical outputs (e.g., reward model training, regression annotations), similar regression-aware + CoT strategies could be equally effective.
Rating¶
| Metric | Score (1-10) |
|---|---|
| Novelty | 7 |
| Technical Depth | 8 |
| Experimental Thoroughness | 8 |
| Writing Quality | 8 |
| Value | 8 |
| Overall Score | 7.8 |