How to Mitigate Overfitting in Weak-to-Strong Generalization?¶

Conference: ACL 2025
arXiv: 2503.04249
Code: None
Area: Others
Keywords: Weak-to-strong generalization, superalignment, overfitting, data filtering, self-consistency

TL;DR¶

A two-stage training framework is proposed to address the overfitting issue in weak-to-strong generalization. The first stage enhances the quality of weak supervision signals through uncertainty-based filtering, while the second stage utilizes the fine-tuned strong model to regenerate answers for discarded hard problems to restore problem quality. This approach improves the PGR from 7.19% to 120.50% on GSM8k and MATH.

Background & Motivation¶

Background: The core challenge of superalignment is how to align superhuman models when tasks exceed human evaluation capabilities. Weak-to-strong generalization explores whether weak supervisors can guide stronger models.

Limitations of Prior Work: Labels generated by weak models contain noise, and the strong fitting ability of strong models causes them to overfit these incorrect labels, leading to severe performance degradation.

Key Challenge: Simply filtering out incorrect labels improves label quality but simultaneously discards valuable hard instances, causing degradation in the difficulty and diversity of the training set (decline in problem quality). This creates a dilemma between "supervision quality" and "problem quality."

Goal: Simultaneously improve supervision signal quality and problem quality, breaking the quality-diversity trade-off caused by filtering.

Key Insight: Starting from the scaling theory of Lang et al. (2024)—weak-to-strong generalization relies on two mechanisms: pseudo-label correction and coverage expansion. Excessive filtering improves the former but harms the latter.

Core Idea: First filter and purify supervision signals, and then re-annotate the discarded hard problems using the fine-tuned strong model. This ensures both label correctness and restores problem difficulty and diversity.

Method¶

Overall Architecture¶

Stage I (Purifying Supervision Signals): The weak model generates 10 CoT responses for each problem \(\rightarrow\) Compute self-consistency \(\rightarrow\) Filter out low-consistency samples \(\rightarrow\) High-consistency samples form Training Set A \(\rightarrow\) Fine-tune the strong model.
Stage II (Restoring Problem Quality): Use the strong model fine-tuned in Stage I to regenerate answers for the problems discarded in Stage I \(\rightarrow\) Perform consistency filtering again \(\rightarrow\) High-confidence samples form Training Set B \(\rightarrow\) Merge A and B to re-fine-tune the initial strong model.

Key Designs¶

Uncertainty-based Filtering
- Use CoT prompting to generate 10 responses for each problem.
- Select the most frequent answer as the final answer.
- Compute confidence: \(\text{Confidence}(\text{Ans}) = \frac{N_{Ans}}{N_{Total}} \times 100\%\).
- Set consistency thresholds (e.g., 50%, 60%, 70%, 80%) to filter out low-confidence samples.
- Experimental verification: Higher thresholds yield higher label accuracy (Figure 3).
Problem Degradation Analysis
- Difficulty degradation: As the filtering threshold increases, the average difficulty decreases from 3.48 to 2.66, and the proportion of high-difficulty problems (Level 4-5) drops sharply.
- Diversity degradation: The proportion of certain topics (e.g., Counting and Probability) drops from 10.79% to 4.31%, showing a significant shift in topic distribution.
Stage II: Strong Model Re-labeling
- Leverage the fact that the strong model fine-tuned in Stage I has surpassed the weak teacher.
- Regenerate diverse responses for hard problems that the weak model was uncertain about (and were discarded).
- Apply consistency filtering similarly to ensure the quality of the new labels.
- Append high-confidence samples to the training set to enhance the difficulty and diversity of the training data.

Loss & Training¶

Standard SFT fine-tuning (the standard training approach for mathematical reasoning).
Weak model: Llama 3 8B Instruct / Deepseek 7B Chat.
Strong model: Llama 3 70B / Deepseek 67B Base.
Strong ceiling: The strong model fine-tuned with ground-truth labels.
Evaluation metric: \(PGR = \frac{\text{weak-to-strong} - \text{weak}}{\text{strong ceiling} - \text{weak}}\).
Datasets: GSM8k (train set), MATH (train set), using the same training set as Yang et al. (2024b).

Key Experimental Results¶

Main Results¶

Llama 3 (8B Instruct → 70B):

Stage	GSM8k Acc	GSM8k PGR	MATH Acc	MATH PGR
Baseline	75.20%	7.19%	18.2%	36.17%
Stage I	80.28%	98.56%	34.0%	112.77%
Stage II	81.50%	120.50%	35.2%	121.28%

Deepseek (7B Chat → 67B Base):

Stage	GSM8k Acc	GSM8k PGR	MATH Acc	MATH PGR
Baseline	62.39%	51.39%	16.8%	65.85%
Stage I	71.11%	83.33%	21.2%	119.51%
Stage II	72.94%	90.04%	21.8%	126.83%

PGR exceeding 100% means that the accuracy of the weak-to-strong method even surpasses the strong ceiling trained on ground-truth labels.

Ablation Study¶

Necessity of Filtering in Stage II (Llama 3, GSM8k):

Stage I Threshold	No Stage II	Stage II with Filtering	Stage II without Filtering
50%	78.99	80.89 (+1.90)	78.31 (-0.68)
60%	80.07	81.50 (+1.43)	78.84 (-1.23)
70%	80.28	81.19 (+0.91)	80.28 (+0.00)
80%	80.06	80.74 (+0.68)	79.59 (-0.47)

Directly appending all re-labeled samples without filtering leads to performance degradation.
This verifies the necessity of uncertainty-based filtering in Stage II.

Exploration of Iterative Refinement (Deepseek): - Adding another iteration round (Stage Exp) on top of Stage II further improves MATH PGR from 126.83% to 134.15%. - This indicates that iterative refinement has room for further improvement.

Key Findings¶

Double-edged sword effect of filtering: There exists an optimal threshold for filtering—too low fails to denoise, while too high discards hard problems. The performance curve of Stage I clearly demonstrates this trade-off.
Robustness of Stage II: Under all filtering thresholds, Stage II yields additional performance gains, and the recovery effect is more significant in scenarios with high thresholds (over-filtering).
Restoration of difficulty and diversity: The refined dataset from Stage II is closer to the original dataset in terms of difficulty distribution and topic diversity.
Implications of PGR > 100%: Weak-to-strong generalization can surpass the strong ceiling, presumably because the filtering process under weak supervision inherently acts as data augmentation and denoising.

Highlights & Insights¶

Deep problem insight: Not only observing that "filtering improves label quality," but further discovering that "over-filtering damages problem quality." This trade-off is the core contribution of this work.
Elegant two-stage framework: Stage I builds the foundation (purification), and Stage II compensates for the shortcomings (restoring hard problems), which is logically clear and complementary.
Self-enhancement loop: The fine-tuned strong model is used to label problems that the weak model cannot handle, forming a positive feedback loop.
Comprehensive experimental design: Extensive grid experiments conducted across different model families (Llama 3, Deepseek), different datasets (GSM8k, MATH), and various thresholds.

Limitations & Future Work¶

Validated only on mathematical reasoning tasks; its efficacy in other domains remains to be verified.
The optimal consistency threshold varies by task and dataset, making automatic threshold determination an open question.
The computational overhead of two-stage fine-tuning is high, especially since Stage II requires generating multiple responses for all discarded problems.
Using instruct versions as weak supervisors simplifies the experimental setup, but may not fully reflect real-world human-AI alignment scenarios.
The convergence and optimal number of iterations in iterative refinement have not been deeply studied.

Burns et al. (2023): Proposed the concept of weak-to-strong generalization and the PGR metric.
Lang et al. (2024): Proposed two mechanisms of weak-to-strong generalization (pseudo-label correction + coverage expansion), providing the theoretical foundation for the proposed framework.
Guo & Yang (2024): Introduced filtering and confidence re-weighting, but did not consider the degradation of problem quality.
Insight: In any scenario involving learning from noisy labels, one should be cautious of the side effects of "denoising" on the data distribution—difficulty and diversity are just as important as cleanliness.

Rating¶

Dimension	Score (1-5)
Novelty	4
Technical Depth	3
Experimental Thoroughness	5
Writing Quality	4
Overall Rating	4.0