Cool-Fusion: Fuse Large Language Models without Training¶
Conference: ACL 2025
arXiv: 2407.19807
Code: None
Area: LLM/NLP
Keywords: LLM Fusion, training-free ensemble, perplexity reranking, heterogeneous models, text segment alignment
TL;DR¶
This work proposes Cool-Fusion, a training-free method to fuse heterogeneous LLMs. By enabling multiple models to evaluate and rerank generated content at the text segment granularity, it achieves a 17.4% accuracy improvement over the strongest source model on GSM8K.
Background & Motivation¶
Limitations of Prior Work¶
Limitations of Prior Work: Background: Different large language models exhibit diverse strengths and weaknesses due to differences in pre-training data, architectures, optimizers, and training methods. Existing model fusion methods face the following challenges:
Vocabulary Incompatibility: The token vocabularies of different LLMs differ significantly. For example, LLaMA-3 and Phi-3 share only 6.4% of their total tokens, while Phi-3 and GLM-4 share only 7.5%.
Prohibitive Training Costs: Existing fusion methods typically require optimization training involving combinations of fine-tuning, distillation, or vocabulary alignment.
Need for Rapid Deployment: Many application scenarios require rapid deployment and cannot afford training overheads.
Existing methods such as weight merging (Model Soups) require identical architectures, and traditional ensembling requires the same vocabulary. While methods like FuseLLM and EVA can handle heterogeneous models, they all require varying degrees of training. Cool-Fusion aims to provide a completely training-free fusion scheme applicable to any heterogeneous LLM collection.
Method¶
Overall Architecture¶
Cool-Fusion adopts an iterative text generation loop, where each iteration consists of three steps:
- Generation (TextGen): Each source LLM independently generates a text segment.
- Evaluation (Evaluate): All generated text segments are sent to all source LLMs to calculate perplexity, which is then averaged.
- Selection (Select): The text segment with the lowest average perplexity is selected as the joint prediction result, and the states of all models are updated via broadcasting.
In the evaluation step, using the average perplexity as the selection criterion is justified from two perspectives: - Ensemble Perspective: The average perplexity aligns with the cross-entropy objective of LLM ensembling. - Critic Perspective: LLMs leverage complementary critical abilities to detect non-factual text segments by assigning high perplexities.
Key Designs¶
Shortest Text Segment: Defined as the text decodable from the shortest token sequence generated by greedy decoding. Different tokenizers have different implementations:
- LLaMA-3-like tokenizers provide a word_ids function to directly retrieve word boundaries.
- LLaMA-2-like tokenizers provide an offsets attribute.
- Other tokenizers iteratively append tokens until reversible decoding is achieved.
Aligned Text Segment: An improved scheme proposed to address perplexity bias. Different tokenizers may produce shortest text segments of varying lengths. Perplexity, as a measure of uncertainty, is often larger at the first token of each word. This leads to: - Perplexity evaluation biasing towards longer text segments. - Bias towards tokenizers that produce longer average text segments.
An aligned text segment is defined as the shortest text segment generated by an LLM that can be decoded by the tokenizers of all source LLMs, thereby reducing perplexity bias caused by uneven segment lengths.
Incremental Encoding & Decoding: The encoding and decoding functions of some tokenizers (such as LLaMA-2) are context-dependent and cannot be processed incrementally. The solution is to prepend only the tokens of the last \(k=4\) decoded words, ensuring constant-time complexity for encoding and decoding.
Rerank Combination: In addition to the fine-grained text segment selection of Cool-Fusion, each source LLM is also allowed to independently predict the complete continuation. Finally, \(k+1\) candidate continuations are reranked based on the average perplexity, incurring almost zero extra overhead.
Loss & Training¶
Cool-Fusion requires absolutely no training. The core selection criterion is based on the perplexity formula:
where \(\log p_u(s_i)\) is the logit output of LLM \(u\) for each token \(s_i\). The perplexities of multiple models are averaged arithmetically.
Key Experimental Results¶
Main Results¶
Experiments are conducted using three heterogeneous models: LLaMA-3 8B, Phi-3 mini (3.8B), and GLM-4 9B, covering multiple domains:
GSM8K Mathematical Reasoning: - LLaMA-3: 69.14% → Cool+R: 81.20% (+17.4%) - Phi-3: 68.31% → Cool+R: 81.20% (+18.9%) - GLM-4: 63.38% → Cool+R: 81.20% (+28.1%)
Comparison with Training-required Methods (GSM8K): - EVA (7 source LLMs + training vocabulary mapping): 42.91% - PairRanker (7 source LLMs + Ranker training): 39.58% - Cool-Fusion (3 source LLMs, training-free): 33.5% (Gain: +6.6%) - Note: Cool-Fusion uses only 3 models, and the average score of its source models is 4 points lower.
Cross-Domain Performance: - Q&A Datasets (CoQA, DROP, TriviaQA): Cool-Fusion outperforms or matches the best source model. - Multilingual GSM: Outperforms the best source model in most languages. - Mathematics and Unscramble: Maintains performance comparable to the best source model.
Ablation Study¶
| Method | GSM8K Accuracy |
|---|---|
| LLaMA-3 | 69.14% |
| Phi-3 | 68.31% |
| GLM-4 | 63.38% |
| Cool2 (LLaMA-3 + Phi-3) | 72.33% |
| Rerank3 | 77.79% |
| Cool-align (shortest text segment) | 74.45% |
| Cool (aligned text segment) | 74.68% |
| Cool+R (Cool + Rerank) | 81.20% |
Key Findings: - Cool2 (two-model fusion) yields a significant improvement (+4.6%). - Aligned text segments slightly outperform shortest text segments (74.68% vs 74.45%), validating the effectiveness of the perplexity bias correction. - Rerank itself is highly effective (+12.5%), and its combination with Cool yields the best performance. - Cool+R improves by an additional 4.4% compared to Rerank alone.
Key Findings¶
- Cool-Fusion outperforms or matches the best source model across all domains, remaining robust even when some source models perform poorly in specific domains.
- Coarse-grained Rerank and fine-grained text segment selection are complementary, and their combination yields the best results.
- The method works effectively across heterogeneous models (different architectures and vocabulary sizes ranging from 32K to 151K).
- FuseLLM (which requires distillation training) reaches only 13.8% on GSM8K, while the training-free Cool achieves 12.3%, performing almost on par.
Highlights & Insights¶
- Simple yet Effective: The core idea is to extend token-level ensembling to segment-level ensembling, elegantly bypassing the vocabulary alignment bottleneck.
- Theoretical Soundness: Average perplexity selection is supported by both the ensemble and critic perspectives.
- Bias Analysis of Aligned Text Segments: Deeply analyzes the bias of perplexity over text segments of different lengths and proposes a theoretically grounded solution.
- Scalability: Given \(k\) GPUs, it scalably supports \(k\) source LLMs with constant latency.
- Practical Value: Requires no training data or fine-tuning; it can be used as long as model inference capability is available.
Limitations & Future Work¶
- Inference Speed: The current implementation speed is about one-sixth of a standard LLM, mainly caused by inter-model communication and frequent tokenizer invocations.
- Experimental Scale: Limited by resources, experiments were only conducted fusing 2-3 source models.
- Evaluation Methodology: Relies solely on automatic metrics, lacking human or GPT-4 evaluation.
- Optimization Space: Parallelizing tokenizers, using longer text segments to reduce communication overhead, and pipelining the inference process to utilize GPU idle time.
- Whether it is applicable to Unicode-based vocabularies (such as Chinese glyph encodings) remains unexplored.
Related Work & Insights¶
- EVA (ICLR 2024): Achieves token-level ensembling by training a vocabulary projection matrix; requires training but offers finer granularity.
- LLM-Blender: Uses a fine-tuned ranking model to select the optimal output first, and then generates fused output using a fine-tuned LLM.
- Contrastive Decoding: Leverages contrasts between expert and amateur LLMs to maximize the log-likelihood difference.
- CALM: Combines representations through cross-attention between models to support new capabilities.
- Inspiration from Cool-Fusion: Text-level operations represent an elegant solution to overcome vocabulary barriers and can be generalized to other scenarios requiring cross-model collaboration.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Segment-level ensembling to bypass vocabulary alignment is an ingenious design.
- Practicality: ⭐⭐⭐⭐ — A true zero-training solution, easy to deploy.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Covers multiple domains like mathematics, QA, and multilingual, with detailed ablation studies.
- Writing Quality: ⭐⭐⭐⭐ — Clear examples and solid theoretical analysis.