Collaborative Performance Prediction for Large Language Models¶
Conference: ACL 2025
arXiv: 2407.01300
Code: https://github.com/Don-Joey/CPP_LLM
Area: LLM/NLP
Keywords: Performance Prediction, Scaling Law, Collaborative Filtering, LLM Evaluation, Model Selection
TL;DR¶
This paper proposes a Collaborative Performance Prediction (CPP) framework that leverages the historical performance of multiple LLMs across multiple tasks alongside design factors of models/tasks to perform collaborative-filtering-style prediction, overcoming the limitation of traditional Scaling Laws restricted to single-family prediction and accurately forecasting downstream performance across different model families.
Background & Motivation¶
Background: Understanding and predicting the performance of large language models on various downstream tasks is a key challenge in NLP research. Scaling Laws (e.g., Kaplan et al., Chinchilla) reveal power-law relationships between model performance and compute budget, which have been used to guide model design and resource allocation. Recent works (e.g., Hu et al., 2024; Isik et al., 2024) extend Scaling Laws to predict downstream task performance.
Limitations of Prior Work: Existing downstream performance prediction methods suffer from three key limitations: (1) they require transparent design factors (e.g., FLOPs, training data volume), which many closed-source models do not provide; (2) they are restricted to predictions within the same model family (e.g., only using LLaMA-7B to predict LLaMA-70B), ignoring similarities across different model families; (3) evaluation costs are high, as evaluating an LLM on certain benchmarks can require $10K+ and 4K+ GPU hours, making evaluation costs for new tasks and models prohibitively high.
Key Challenge: The "single-family" assumption of Scaling Laws is too restrictive—strong similarities actually exist across different model families (e.g., models strong at reasoning perform well across multiple reasoning tasks), yet existing methods fail to exploit this cross-family similarity.
Goal: Devise a framework capable of utilizing "model \(\times\) task" collaborative information to predict performance, akin to collaborative filtering in recommendation systems—predicting unknown scores given partial model performances on partial tasks.
Key Insight: The authors analogize LLM performance prediction to the rating prediction problem in recommendation systems where "models" correspond to "users", "tasks" to "items", and "performance scores" to "ratings". If Model A performs similarly to Model B on tasks 1–3, the performance of Model A on task 4 can leverage the score of Model B on task 4.
Core Idea: Construct a collaborative data matrix (a model-task performance score matrix enriched with additional design factors) and apply matrix factorization and factor-enhanced collaborative filtering methods to predict missing performance scores.
Method¶
Overall Architecture¶
The CPP framework consists of two core components: (1) Collaborative Data — collecting a model-task performance score matrix from online platforms (e.g., Open LLM Leaderboard), along with model design factors (parameter size, training method, architecture type) and task design factors (task type, difficulty, evaluation metrics); (2) Collaborative Prediction Method — a prediction model based on matrix factorization and auxiliary factors, which leverages the collaborative information between a given model ID and task ID to predict the performance score.
Key Designs¶
-
Collaborative Data Construction:
- Function: Construct a model-task performance dataset for collaborative prediction.
- Mechanism: Crawl extensive evaluation results of models on multiple benchmark tasks from platforms such as Hugging Face Open LLM Leaderboard to form a sparse score matrix \(R \in \mathbb{R}^{M \times T}\). Concurrently, harvest metadata for each model (parameter size \(N\), architecture type \(a\), training data volume \(D\), fine-tuning method \(f\), etc.) and attributes for each task (type \(c\), few-shot setting \(k\), evaluation metric \(m\), etc.). These design factors serve as auxiliary information to enhance prediction.
- Design Motivation: Pure matrix factorization performs poorly when the score matrix is sparse; auxiliary factors provide additional inductive bias.
-
Factor-Enhanced Matrix Factorization:
- Function: Fuse collaborative signals and design factors for performance prediction.
- Mechanism: Represent model \(i\) and task \(j\) as latent vectors \(u_i\) and \(v_j\) respectively (akin to user/item embeddings in recommendation systems). The predicted score is formulated as \(\hat{R}_{ij} = u_i^T v_j + f(x_i^{model}, x_j^{task})\), where \(f\) is a lightweight MLP network that takes design factor feature vectors of the model and task as input. The matrix factorization component captures collaborative signals (similarly performing models behave similarly on similar tasks), while the MLP component utilizes design factors for prediction, complementing each other.
- Design Motivation: Traditional Scaling Laws solely utilize design factors (e.g., \(Loss \propto N^{-\alpha}\)), ignoring collaborative signals; while pure collaborative filtering neglects prior knowledge. This method unifies both.
-
Factor Importance Analysis:
- Function: Quantify the contribution of each design factor to performance prediction.
- Mechanism: Utilize gradient magnitudes and SHAP values of each input factor in the MLP branch to estimate the importance of each design factor. Quantify the contribution of each factor by perturbing the inputs of different factors and observing the variations in predicted scores (e.g., analyzing how much "parameter size", "training data type", or "fine-tuning method" contributes to prediction accuracy).
- Design Motivation: Unlike previous Scaling Laws that only consider a few predefined factors (such as N, D, C), this method dynamically discovers more critical factors through a data-driven approach.
Loss & Training¶
Mean Squared Error (MSE) loss is employed to fit known performance scores, incorporated with L2 regularization to prevent overfitting. A negative sampling strategy is used for sparse data.
Key Experimental Results¶
Main Results¶
| Method | RMSE↓ | MAE↓ | R²↑ | Cross-family Prediction RMSE↓ |
|---|---|---|---|---|
| Chinchilla Scaling Law | 5.82 | 4.31 | 0.62 | 8.47 |
| Observational Scaling (Ruan et al.) | 4.15 | 3.02 | 0.74 | 6.23 |
| FPE (Isik et al.) | 3.87 | 2.78 | 0.78 | 5.81 |
| Pure Matrix Factorization | 3.21 | 2.35 | 0.83 | 3.45 |
| CPP (Ours) | 2.43 | 1.76 | 0.91 | 2.68 |
Ablation Study¶
| Configuration | RMSE↓ | Description |
|---|---|---|
| Full CPP | 2.43 | Collaborative + Factors |
| w/o Design Factors (Pure Collaborative) | 3.21 | Degrades to pure MF without factors |
| w/o Collaborative Signals (Pure Factors) | 3.95 | Degrades to Scaling Law variant without collaboration |
| w/o Model Factors | 2.87 | Model metadata contribution +0.44 |
| w/o Task Factors | 2.71 | Task attribute contribution +0.28 |
| 20% Sparsity Observed | 3.12 | Still effective with a small amount of observed data |
| 50% Sparsity Observed | 2.43 | Optimal when more substantial data is available |
Key Findings¶
- The collaborative signal is the largest contributing factor: using collaborative filtering alone yields a 0.66 lower RMSE than the best Scaling Law method, demonstrating the value of inter-model similarity.
- CPP delivers the most significant advantage in cross-model family prediction (RMSE 2.68 vs 5.81), as collaborative filtering does not depend on the intra-family training resource curve.
- Factor importance analysis reveals that the fine-tuning method (RLHF vs SFT) exerts a greater impact on downstream task performance than parameter size—a critical detail overlooked by traditional Scaling Laws.
- Even with only 20% of the score matrix available, CPP still outperforms traditional methods that require full design factors.
Highlights & Insights¶
- Modeling LLM performance prediction as a recommendation system problem is a compelling analogy—where "user ratings of movies" correspond to "model scores on tasks," enabling direct migration of mature recommendation technologies.
- The factor importance analysis unveils counter-intuitive insights (e.g., fine-tuning method being more critical than parameter size), which offers direct practical guidance for LLM development decisions.
- The automatic collection mechanism of collaborative data allows for continuous updates; as more models and evaluation results are released, prediction accuracy will progressively improve.
Limitations & Future Work¶
- The quality of collaborative data heavily relies on the consistency of evaluation standards across online platforms; discrepancies in evaluation settings across different platforms may introduce noise.
- For entirely new architectures that differ completely from existing model families (e.g., SSM), cold-start issues may impair prediction performance.
- Currently, only single-metric performance scores are predicted, without accounting for correlation among multiple metrics.
- The system can be extended to dynamic prediction—forecasting the final converged performance based on checkpoints during the training process.
Related Work & Insights¶
- vs Scaling Laws (Kaplan et al., 2020): While traditional Scaling Laws rely on power laws of computation, this work introduces collaborative filtering to break through single-family restrictions.
- vs FPE (Isik et al., 2024): Whereas FPE leverages intra-family similarities, this work further exploits cross-family similarities.
- vs Observational Scaling (Ruan et al., 2024): While observational scaling fits scaling curves via empirical data, the collaborative approach proposed here predicts performance without strictly relying on design factors.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Introducing recommendation system concepts to LLM performance prediction is a highly clever and elegant innovation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Features comprehensive comparison, detailed ablation, and in-depth analysis.
- Writing Quality: ⭐⭐⭐⭐ The framework is clearly described, utilizing precise and appropriate analogies.
- Value: ⭐⭐⭐⭐⭐ Holds high practical utility for LLM evaluation and model selection.