Synergistic Weak-Strong Collaboration by Aligning Preferences¶
Conference: ACL 2025
arXiv: 2504.15188
Code: Publicly available (as stated in the paper)
Area: Others
Keywords: Weak-strong model collaboration, preference alignment, DPO, knowledge complementarity, LLM collaboration
TL;DR¶
Ours proposes the CoWest framework, which allows a specialized weak model (such as LLaMA3-8B) to generate initial drafts, which are then refined by a general strong model (such as GPT-4). It leverages collaborative feedback to fine-tune the weak model via DPO to align with the strong model's preferences, significantly outperforming individual models and existing collaborative methods across counterfactual reasoning, medicine, and ethics.
Background & Motivation¶
Large Language Models (LLMs) excel at general reasoning but still fall short on specialized tasks that require proprietary or domain-specific knowledge. Domain-specific fine-tuning for large models faces two major obstacles: (1) many popular LLMs (e.g., GPT-4, Gemini) are black-box models whose parameters are inaccessible; (2) even when fine-tuning is feasible, it incurs massive computational costs and privacy risks, as fine-tuning requires exposing sensitive data to the model.
Limitations of Prior Work in existing weak-strong model collaborative methods include: (1) predefined interaction mechanisms lack flexibility, e.g., the weak model only provides fixed-format knowledge snippets; (2) they only use feedback from a single model to fine-tune another model, ignoring the feedback information generated by the collaborative process itself. Collaborative feedback can help the weak model understand the preferences of the strong model, thereby enhancing mutually beneficial collaboration.
The Core Idea of Ours is to not only enable collaborative reasoning between weak and strong models but also extract preference signals from the collaborative process to optimize the weak model, making its outputs better aligned with the strong model's needs.
Method¶
Overall Architecture¶
CoWest consists of two phases: - Training Phase: (1) Fine-tune the weak model with task data via SFT to acquire domain capabilities; (2) construct collaborative preference data; (3) align the weak model utilizing DPO. - Inference Phase: The weak model generates an initial output (including a reasoning chain) for a query, and the strong model receives the weak model's output along with the original query to refine it, producing the final answer.
Key Designs¶
-
Weak Model SFT Fine-Tuning: On the task-specific training set \(\mathcal{D}_{\text{SFT}} = \{(x, \hat{y})\}\), the weak model \(\pi_w\) is fine-tuned via the standard negative log-likelihood loss to acquire domain-specific expertise. LoRA is used for efficient fine-tuning.
-
Collaborative Preference Feedback Construction: The core innovation lies in the source of preference data—not from human annotation, but from the collaborative process itself. The specific steps are:
- Strong Model Standalone Inference: The strong model is prompted directly using CoT to answer, yielding output \(z \sim \pi_s(z|x)\).
- Weak-Strong Collaborative Inference: The weak model first generates explanations and initial results \(y \sim \pi_w(y|x)\), which are then passed to the strong model for refinement to obtain \(y^* \sim \pi_s(y^*|y)\).
- Preference Evaluation: An external evaluator \(E\) scores both outputs based on logical reasoning coherence and consistency with the ground truth. The evaluation difference is \(\Delta = E(\pi_s \circ y, x) - E(z, x)\): if \(\Delta > 0\), the weak model's contribution is beneficial, and its output is treated as a positive sample \(y_+\); otherwise, it is a negative sample \(y_-\).
-
DPO Preference Optimization: The weak model is fine-tuned via DPO using the constructed preference triplets \(\mathcal{D}_{\text{PT}} = \{(x, y_+, y_-)\}\):
The optimization target is to ensure that the outputs generated by the weak model yield higher scores after being refined by the strong model.
- Collaborative Inference: During final inference, the query is first processed by the aligned weak model \(\pi_w^*\), and then refined by the strong model: \(y^* = \pi_s \circ (x, \pi_w^* \circ x)\).
Loss & Training¶
Training is conducted in two steps: SFT in the first step (standard cross-entropy loss) and DPO in the second step (preference optimization loss). The evaluator uses the same LLM as the strong model (such as GPT-4) to ensure that preference signals reflect the strong model’s actual preferences.
Key Theoretical Insight: Under simplifying assumptions (where the strong model's score for any query is constant), the optimized weak model \(\pi_w^*\) assigns zero probability to all outputs that result in collaborative scores no higher than the strong model's standalone performance. That is, the weak model learns to produce only outputs that contribute positively to the collaboration.
Key Experimental Results¶
Main Results¶
| Method Category | Method | Counterfactuals (EM/F1) | Medicine (Acc/F1) | Ethics (Acc/F1) |
|---|---|---|---|---|
| Weak Model | Llama-3-8B (SFT) | 69.71/72.69 | 73.08/58.26 | 64.29/62.40 |
| Strong Model | GPT-4 (CoT) | 57.42/65.60 | 71.80/57.69 | 39.00/39.58 |
| RAG | FLARE | 62.07/70.59 | 72.40/58.89 | 55.27/54.97 |
| Collaboration | SuperICL | 68.85/74.82 | 73.64/58.33 | 66.18/63.86 |
| Collaboration | RLWF | 70.52/75.04 | 72.01/57.65 | 64.85/62.10 |
| Collaboration | CoWest | 75.85/77.34 | 75.10/60.13 | 68.33/65.61 |
CoWest's Gain over the best single model: Counterfactuals +6.14 EM, Medicine +2.02 Acc, Ethics +4.04 Acc.
Ablation Study¶
Strategy ablation for interaction (EM/Acc reported):
| Interaction Format | Without Alignment | With Alignment | Description |
|---|---|---|---|
| Direct Answer | Low | Medium | Direct answers provide limited information |
| Domain Knowledge | Medium | High | Provides domain background knowledge |
| Chain of Thought | High | Highest | CoT detailed reasoning paths are most effective |
Impact of different strong models:
| Strong Model | Counterfactuals | Ethics |
|---|---|---|
| GPT-4 | 75.9% | 38.2% |
| Llama-3-70B | 72.1% | 62.3% |
| GPT-3.5-Turbo | 70.8% | 68.3% |
| Llama-2-70B | 68.5% | 55.7% |
Impact of different weak models:
| Weak Model | Parameters | Overall Performance |
|---|---|---|
| Llama-3-8B | 8B | Best |
| Llama-2-7B | 7B | Better |
| Phi-3-mini | 3B | Moderate |
| TinyLlama | 1B | Weak |
Key Findings¶
-
Collaboration Significantly Outperforms Single Models: CoWest significantly outperforms both the individual weak model (even after SFT) and the individual strong model across all three datasets. GPT-4 alone achieves only 57.42% on counterfactual reasoning, which increases to 75.85% under collaboration.
-
Preference Alignment is Key: Under all interaction formats, the aligned versions significantly outperform the unaligned counterparts, demonstrating the effectiveness of collaborative feedback.
-
CoT Format is most Effective: The Chain of Thought format performs best across all datasets; its detailed reasoning trajectory helps the strong model better comprehend and refine the weak model's outputs.
-
The General Ability of Strong Models, Rather than Absolute Dominance, is Critical: GPT-4 is the strongest in counterfactual reasoning but falls short of GPT-3.5-Turbo on ethics. Simply being "stronger than the weak model" is insufficient; the strong model must possess adequate error-correction capabilities.
-
The Foundational Capability of the Weak Model is Important: Larger weak models (8B/7B) significantly outperform smaller models (3B/1B). The weak model needs adequate foundational capabilities to generate valuable initial outputs.
-
There exists an Optimal Point for Training Data Volume: For counterfactual reasoning (with only 2K samples), performance peaks at 1K preference data, and continuing to increase it degrades quality due to duplicate sampling. In contrast, larger datasets (Medical/Ethics) continue to witness gains up to 2K.
Highlights & Insights¶
- Elegant mechanism for constructing collaborative preferences: It determines whether the weak model's contribution is positive or negative by comparing the performance difference between "strong model alone" and "weak-strong collaboration", eliminating the need for human annotation.
- Theoretical analysis provides meaningful guarantees: the optimized weak model will not produce outputs that are counterproductive.
- The framework is friendly to black-box strong models, as it does not require access to the strong model's parameters and only relies on its API calls, offering strong practical significance.
- The experimental finding regarding the domain dependence of strong model performance is highly inspiring: there is no one-size-fits-all strong model, and different domains require matching different strong models.
Limitations & Future Work¶
- Single-Round Feedback: Only one round of preference alignment iteration was performed, leaving the potential for continuous improvements via multi-round iterative alignments unexplored.
- Limited Model Families: Experiments only employed Llama and GPT series, leaving the effectiveness of other model architectures (e.g., Mistral, Qwen) unverified.
- Evaluator Bias: Using the same LLM as the strong model for evaluation might introduce bias, as the evaluator may prefer outputs modeled after the strong model's style.
- Inference Latency: It requires two sequential LLM calls (weak model \(\rightarrow\) strong model), doubling the inference latency.
- Preference Data Construction Cost: It requires calling the strong model twice for each training sample (standalone inference and collaborative inference), resulting in high API costs.
Related Work & Insights¶
- SuperICL (Xu et al., 2024): Prompts large models using small model outputs, but with a fixed interaction mechanism. CoWest dynamizes this interaction through preference alignment.
- Weak-to-Strong Generalization (Burns et al., 2024): Strong models learn from weak model supervision, but require access to the strong model's parameters. CoWest keeps the strong model black-box.
- DPO (Rafailov et al., 2023): Ours applies DPO to align the weak model, expanding the source of preference signals from human/AI feedback to the collaborative process itself.
- RAG: Retrieval-Augmented Generation provides static contexts, whereas CoWest's weak model outputs represent dynamic reasoned knowledge, offering stronger adaptability.
Rating¶
- Novelty: ⭐⭐⭐⭐ The concept of collaborative preference feedback is novel, extending DPO from traditional human/AI preferences to collaborative process preferences.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated across three domains with multiple baselines and detailed ablations (interaction strategies, model selection, data volume), although the scale of datasets is relatively small.
- Writing Quality: ⭐⭐⭐⭐ Method motivation and workflow descriptions are clear, with concise and powerful theoretical analysis.
- Value: ⭐⭐⭐⭐ Offers guiding significance for practical applications in the era of black-box LLMs; the framework is generic and easy to implement.