Skip to content

TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning

Conference: ACL 2025
arXiv: 2503.04381
Code: https://github.com/d223302/TRACT
Area: LLM Reasoning
Keywords: LLM-as-a-Judge, Regression-Aware Fine-tuning, Chain-of-Thought, Numerical Prediction, Self-generated CoT

TL;DR

Proposes TRACT, a two-stage regression-aware fine-tuning method that combines CoT reasoning with regression loss (squared error) to improve numerical scoring accuracy in LLM-as-a-judge scenarios, significantly outperforming existing approaches using only cross-entropy training or only regression loss.

Background & Motivation

Background: - LLM-as-a-Judge has become the mainstream paradigm for automated text evaluation, where models score text outputs based on fine-grained scoring criteria (e.g., 1-5). - Existing methods typically fine-tune LLMs using cross-entropy (CE) loss, prompting them to generate a CoT analysis before outputting the score. - RAFT (Regression-Aware Fine-Tuning) has been proven to improve performance in numerical regression tasks but does not consider CoT reasoning.

Limitations of Prior Work: - CE loss ignores numerical distances: Given a ground-truth score of 1, predictions of 5 and 2 receive the same penalty, despite their massive difference in numerical error. - RAFT lacks CoT: Although RAFT introduces squared error loss to improve regression prediction, it does not leverage CoT reasoning, which is crucial for LLM-as-a-Judge. - Distribution shift in CoT sources: During training, CoT is obtained from GPT-4 annotations, while during inference, CoT is generated by the model itself, leading to inconsistent distributions.

Key Challenge: - CoT reasoning and regression-aware training each have its own advantages, but how to effectively combine them remains an open question. - There is a significant distribution shift between GPT-4 generated CoTs and the self-generated CoTs of the fine-tuned model.

Goal: - How to leverage both CoT reasoning capability and regression-aware loss in LLM-as-a-Judge fine-tuning. - How to alleviate the distribution shift problem of CoT sources between the training and inference stages.

Key Insight: - Use CE loss for CoT learning and RAFT loss for score prediction, combining them into a CoT-RAFT objective. - Employ a two-stage training strategy where the second stage uses model-self-generated CoT to replace external annotations.

Core Idea: - Achieve the optimal combination of CoT reasoning and numerical prediction through two-stage self-generated CoT + regression-aware fine-tuning.

Method

Overall Architecture

TRACT (Two-stage Regression-Aware fine-tuning with CoT) consists of two stages: - Stage 1: Fine-tune the seed LLM using GPT-4-annotated CoTs and the CoT-RAFT objective to obtain the model \(p_s\). - Stage 2: Generate self-generated CoTs using \(p_s\) to replace GPT-4's CoTs, retraining from the seed LLM to secure the final model \(p_{\text{tract}}\).

Key Designs

  1. CoT-RAIL Predictor (CR Predictor):

    • Function: First generates CoT during inference, then performs a weighted sum over possible scores instead of argmax decoding.
    • Mechanism: \(\hat{y}_{CR}(x) = \sum_{y \in \mathcal{Y}} p(\text{str}(y) | [x, \hat{s}]) \cdot y\), where \(\hat{s} \sim p(\cdot|x)\)
    • Design Motivation: RAIL weighted averaging outperforms argmax decoding, and CoT provides better context conditioning; combining both yields complementary benefits.
  2. CoT-RAFT Training Objective:

    • Function: Combined weighted loss of CE loss (learning CoT) and RAFT loss (learning score prediction).
    • Mechanism: \(\ell_{\text{CoT-RAFT}}^{\lambda} = \lambda(\sum_y p(\text{str}(y)|[x,\hat{s}]) \cdot y - y^*)^2 - \log p([\hat{s}, y^*] | x)\)
    • Design Motivation: CE ensures the quality of CoT generation, while RAFT ensures score prediction is sensitive to numerical distance.
  3. Two-Stage Self-Generated CoT Strategy:

    • Function: Train with external CoT in Stage 1, and retrain using CoTs generated by the Stage 1 model in Stage 2.
    • Mechanism: Replace the GPT-4 CoT with CoT under the model's own distribution to eliminate the training-inference distribution shift.
    • Design Motivation: Since CoT is generated by the model during inference, using self-generated CoT during training maintains distributional consistency.

Loss & Training

  • Loss Function: CoT-RAFT = \(\lambda \times (\text{RAIL prediction} - \text{ground truth score})^2 - \log p(\text{CoT} + \text{score} \mid \text{input})\)
  • Hyperparameter \(\lambda\): Controls the regression loss weight, selected via the validation set in experiments.
  • Training Data: Feedback Collection (approx. 100K samples), with CoTs initially generated by GPT-4.
  • Initialization from the seed LLM in both stages: To avoid overfitting to the parameters of the first stage.

Key Experimental Results

Main Results

Evaluated on four LLM-as-a-Judge datasets using Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct:

Method FB Bench (r) FLASK (r) Vic. Bench (r) MT Bench (r) Average (r/ρ)
CE (w/o CoT) 0.890 0.355 0.429 0.279 0.488/0.483
CE (w/ CoT) 0.872 0.413 0.463 0.480 0.557/0.554
RAFT 0.932 0.509 0.567 0.483 0.623/0.605
Prometheus-2-7B 0.845 0.512 0.488 0.519 0.591/0.576
TRACT 0.931 0.518 0.593 0.555 0.650/0.628
  • TRACT achieves an average Pearson correlation coefficient of 0.650, which is an improvement of 0.027 over RAFT and 0.059 over Prometheus-2-7B.
  • On MT Bench, TRACT improves by 0.072 (r) and 0.060 (ρ) compared to RAFT.

Key Findings

  1. Self-generated CoT is crucial: The variant using GPT-4 CoTs (A.1) achieves an average r of only 0.556, whereas the self-generated CoT version (TRACT) reaches 0.650, demonstrating a massive gap.
  2. Both components are indispensable: Removing RAFT to use pure CE (A.2) drops the average r to 0.617; removing CoT to use pure RAFT yields an average r of 0.623.
  3. TRACT also performs excellently on RewardBench: Despite being trained only on pointwise scoring formats, it remains competitive on the pairwise comparison dataset RewardBench.
  4. Controllable inference overhead: Compared to standard CoT decoding, TRACT only requires additional RAIL weighting over 5 candidate scores.

Highlights & Insights

  • Philosophical difference between regression and classification: This work profoundly reveals the fundamental limitation of treating numerical prediction as a classification task: CE loss is insensitive to numerical distance.
  • Importance of self-generated CoT: Provides compelling evidence that matching the training-inference distribution is more critical than the sheer quality of the CoT.
  • Clear modular design: The CoT-RAFT objective elegantly combines CE and regression losses, each playing its designated role.
  • Fully open-source code and models: Facilitates replication and adoption.

Limitations & Future Work

  1. Validated only on LLM-as-a-Judge tasks; not yet extended to other numerical prediction scenarios (e.g., the STS-B regression task).
  2. The two-stage training increases computational overhead as it requires generating self-CoTs before retraining.
  3. The impact of CoT quality on final performance is not thoroughly analyzed; self-generated CoT might introduce systematic biases.
  4. Evaluated only on 7B/8B scale models; the effectiveness on larger or smaller models remains unknown.
  5. The \(\lambda\) hyperparameter requires validation set tuning, which is less convenient.
  • RAFT (Lukasik et al., 2025): The foundational study on regression-aware fine-tuning, upon which this work introduces CoT.
  • Prometheus-2 (Kim et al., 2024b): Currently the SOTA 7B-scale LLM-as-a-Judge, achieved through model merging.
  • RAIL (Lukasik et al., 2024): The zero-shot version of regression-aware inference, employing weighted expectations instead of argmax.
  • Insight: In other LLM tasks requiring numerical outputs (e.g., reward model training, regression annotations), similar regression-aware + CoT strategies could be equally effective.

Rating

Metric Score (1-10)
Novelty 7
Technical Depth 8
Experimental Thoroughness 8
Writing Quality 8
Value 8
Overall Score 7.8