TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning¶

Conference: ACL 2025
arXiv: 2503.04381
Code: https://github.com/d223302/TRACT
Area: LLM Reasoning
Keywords: LLM-as-a-Judge, Regression-Aware Fine-tuning, Chain-of-Thought, Numerical Prediction, Self-generated CoT

TL;DR¶

Proposes TRACT, a two-stage regression-aware fine-tuning method that combines CoT reasoning with regression loss (squared error) to improve numerical scoring accuracy in LLM-as-a-judge scenarios, significantly outperforming existing approaches using only cross-entropy training or only regression loss.

Background & Motivation¶

Background: - LLM-as-a-Judge has become the mainstream paradigm for automated text evaluation, where models score text outputs based on fine-grained scoring criteria (e.g., 1-5). - Existing methods typically fine-tune LLMs using cross-entropy (CE) loss, prompting them to generate a CoT analysis before outputting the score. - RAFT (Regression-Aware Fine-Tuning) has been proven to improve performance in numerical regression tasks but does not consider CoT reasoning.

Limitations of Prior Work: - CE loss ignores numerical distances: Given a ground-truth score of 1, predictions of 5 and 2 receive the same penalty, despite their massive difference in numerical error. - RAFT lacks CoT: Although RAFT introduces squared error loss to improve regression prediction, it does not leverage CoT reasoning, which is crucial for LLM-as-a-Judge. - Distribution shift in CoT sources: During training, CoT is obtained from GPT-4 annotations, while during inference, CoT is generated by the model itself, leading to inconsistent distributions.

Key Challenge: - CoT reasoning and regression-aware training each have its own advantages, but how to effectively combine them remains an open question. - There is a significant distribution shift between GPT-4 generated CoTs and the self-generated CoTs of the fine-tuned model.

Goal: - How to leverage both CoT reasoning capability and regression-aware loss in LLM-as-a-Judge fine-tuning. - How to alleviate the distribution shift problem of CoT sources between the training and inference stages.

Key Insight: - Use CE loss for CoT learning and RAFT loss for score prediction, combining them into a CoT-RAFT objective. - Employ a two-stage training strategy where the second stage uses model-self-generated CoT to replace external annotations.

Core Idea: - Achieve the optimal combination of CoT reasoning and numerical prediction through two-stage self-generated CoT + regression-aware fine-tuning.

Method¶

Overall Architecture¶

TRACT (Two-stage Regression-Aware fine-tuning with CoT) consists of two stages: - Stage 1: Fine-tune the seed LLM using GPT-4-annotated CoTs and the CoT-RAFT objective to obtain the model \(p_s\). - Stage 2: Generate self-generated CoTs using \(p_s\) to replace GPT-4's CoTs, retraining from the seed LLM to secure the final model \(p_{\text{tract}}\).

Key Designs¶

CoT-RAIL Predictor (CR Predictor):
- Function: First generates CoT during inference, then performs a weighted sum over possible scores instead of argmax decoding.
- Mechanism: \(\hat{y}_{CR}(x) = \sum_{y \in \mathcal{Y}} p(\text{str}(y) | [x, \hat{s}]) \cdot y\), where \(\hat{s} \sim p(\cdot|x)\)
- Design Motivation: RAIL weighted averaging outperforms argmax decoding, and CoT provides better context conditioning; combining both yields complementary benefits.
CoT-RAFT Training Objective:
- Function: Combined weighted loss of CE loss (learning CoT) and RAFT loss (learning score prediction).
- Mechanism: \(\ell_{\text{CoT-RAFT}}^{\lambda} = \lambda(\sum_y p(\text{str}(y)|[x,\hat{s}]) \cdot y - y^*)^2 - \log p([\hat{s}, y^*] | x)\)
- Design Motivation: CE ensures the quality of CoT generation, while RAFT ensures score prediction is sensitive to numerical distance.
Two-Stage Self-Generated CoT Strategy:
- Function: Train with external CoT in Stage 1, and retrain using CoTs generated by the Stage 1 model in Stage 2.
- Mechanism: Replace the GPT-4 CoT with CoT under the model's own distribution to eliminate the training-inference distribution shift.
- Design Motivation: Since CoT is generated by the model during inference, using self-generated CoT during training maintains distributional consistency.

Loss & Training¶

Loss Function: CoT-RAFT = \(\lambda \times (\text{RAIL prediction} - \text{ground truth score})^2 - \log p(\text{CoT} + \text{score} \mid \text{input})\)
Hyperparameter \(\lambda\): Controls the regression loss weight, selected via the validation set in experiments.
Training Data: Feedback Collection (approx. 100K samples), with CoTs initially generated by GPT-4.
Initialization from the seed LLM in both stages: To avoid overfitting to the parameters of the first stage.

Key Experimental Results¶

Main Results¶

Evaluated on four LLM-as-a-Judge datasets using Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct:

Method	FB Bench (r)	FLASK (r)	Vic. Bench (r)	MT Bench (r)	Average (r/ρ)
CE (w/o CoT)	0.890	0.355	0.429	0.279	0.488/0.483
CE (w/ CoT)	0.872	0.413	0.463	0.480	0.557/0.554
RAFT	0.932	0.509	0.567	0.483	0.623/0.605
Prometheus-2-7B	0.845	0.512	0.488	0.519	0.591/0.576
TRACT	0.931	0.518	0.593	0.555	0.650/0.628

TRACT achieves an average Pearson correlation coefficient of 0.650, which is an improvement of 0.027 over RAFT and 0.059 over Prometheus-2-7B.
On MT Bench, TRACT improves by 0.072 (r) and 0.060 (ρ) compared to RAFT.

Key Findings¶

Self-generated CoT is crucial: The variant using GPT-4 CoTs (A.1) achieves an average r of only 0.556, whereas the self-generated CoT version (TRACT) reaches 0.650, demonstrating a massive gap.
Both components are indispensable: Removing RAFT to use pure CE (A.2) drops the average r to 0.617; removing CoT to use pure RAFT yields an average r of 0.623.
TRACT also performs excellently on RewardBench: Despite being trained only on pointwise scoring formats, it remains competitive on the pairwise comparison dataset RewardBench.
Controllable inference overhead: Compared to standard CoT decoding, TRACT only requires additional RAIL weighting over 5 candidate scores.

Highlights & Insights¶

Philosophical difference between regression and classification: This work profoundly reveals the fundamental limitation of treating numerical prediction as a classification task: CE loss is insensitive to numerical distance.
Importance of self-generated CoT: Provides compelling evidence that matching the training-inference distribution is more critical than the sheer quality of the CoT.
Clear modular design: The CoT-RAFT objective elegantly combines CE and regression losses, each playing its designated role.
Fully open-source code and models: Facilitates replication and adoption.

Limitations & Future Work¶

Validated only on LLM-as-a-Judge tasks; not yet extended to other numerical prediction scenarios (e.g., the STS-B regression task).
The two-stage training increases computational overhead as it requires generating self-CoTs before retraining.
The impact of CoT quality on final performance is not thoroughly analyzed; self-generated CoT might introduce systematic biases.
Evaluated only on 7B/8B scale models; the effectiveness on larger or smaller models remains unknown.
The \(\lambda\) hyperparameter requires validation set tuning, which is less convenient.

RAFT (Lukasik et al., 2025): The foundational study on regression-aware fine-tuning, upon which this work introduces CoT.
Prometheus-2 (Kim et al., 2024b): Currently the SOTA 7B-scale LLM-as-a-Judge, achieved through model merging.
RAIL (Lukasik et al., 2024): The zero-shot version of regression-aware inference, employing weighted expectations instead of argmax.
Insight: In other LLM tasks requiring numerical outputs (e.g., reward model training, regression annotations), similar regression-aware + CoT strategies could be equally effective.

Rating¶

Metric	Score (1-10)
Novelty	7
Technical Depth	8
Experimental Thoroughness	8
Writing Quality	8
Value	8
Overall Score	7.8