IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory¶
Conference: ACL 2025
arXiv: 2506.01048
Code: https://github.com/Mercidaiha/IRT-Router
Area: Interpretability
Keywords: LLM routing, item response theory, multi-model selection, interpretability, cost optimization
TL;DR¶
IRT-Router borrows Item Response Theory (IRT) from psychometrics, treating LLMs as "test-takers" and queries as "exam questions." It learns multi-dimensional ability vectors along with difficulty and discrimination parameters to achieve interpretable multi-LLM routing, achieving over 87% accuracy in OOD scenarios at only 1/30 of the cost of GPT-4o.
Background & Motivation¶
Background¶
Background: When using multiple LLMs, it is necessary to automatically select the most appropriate model based on query characteristics to balance performance and cost.
Limitations of Prior Work: Existing routing methods (RouteLLM, RouterBench) utilize simple heuristics or blackbox classifiers, which lack interpretability and fail to explain "why a query is routed to a specific model."
Key Challenge: The need to simultaneously address three challenges: interpretability, cold-start (how to route new queries), and the performance-cost trade-off.
Core Idea: IRT naturally models the "ability-difficulty" relationship, and transferring it to LLM routing yields both interpretability and effectiveness.
Method¶
Overall Architecture¶
Two implementation variants: (1) MIRT-Router (Multi-dimensional IRT): \(\hat{P}(q_i, M_j) = 1/(1 + \exp(-a_i^T \theta_{M_j} + b_i))\), where \(\theta_{M_j}\) represents the LLM ability vector, \(a_i\) represents discrimination, and \(b_i\) represents difficulty; (2) NIRT-Router (Neural IRT): introduces relevance vectors and neural network interaction functions.
Key Designs¶
- IRT Modeling: Each LLM is assigned a multi-dimensional ability vector \(\theta_{M_j}\), and each query has a difficulty \(b_i\) and a discrimination \(a_i\). The parameters are learned via embedding + linear transformation.
- Cold-start Warm-up: For unseen queries, interpolation using neighboring known query embeddings is applied: \(e_{q_i}' = (1-\lambda) e_{q_i} + \lambda \cdot \text{mean(neighbors)}\), where \(\lambda=0.3\text{-}0.4\) is optimal.
- Scoring Function: \(S(q_i, M_j) = \alpha \hat{P}(q_i, M_j) - \beta C(M_j)\), where \(\alpha+\beta=1\) to balance performance and cost.
Key Experimental Results¶
Main Results¶
| Method | Accuracy | Cost | Reward |
|---|---|---|---|
| MIRT-Router | 80.67% | $0.42 | 63.89 |
| RouterBench | 80.01% | $1.15 | 62.23 |
| RouteLLM | 77.25% | $12.80 | 42.00 |
| GPT-4o only | 77.53% | $12.93 | 42.02 |
OOD Scenarios (20 candidate LLMs, 12 datasets)¶
| Method | Accuracy | Cost |
|---|---|---|
| MIRT-Router | 87.12% | $0.14 |
| NIRT-Router | 87.37% | $0.15 |
Top-k Routing Accuracy (MIRT-Router)¶
| Scenario | Top-1 | Top-2 | Top-3 | Top-5 |
|---|---|---|---|---|
| ID | 2.72% | 9.88% | 32.51% | 47.85% |
| OOD | 2.15% | 7.50% | 27.29% | 39.47% |
Top-1 accuracy is relatively low because multiple LLMs have similar abilities (e.g., both DeepSeek-Chat and DeepSeek-Coder reach 81%), but Top-3 accuracy reaches 32.5%. Routing analysis shows that 80% of high-difficulty queries are routed to DeepSeek-Chat, while 99% of low-difficulty queries are routed to the most cost-effective model, Qwen2.5-32B-GPTQ ($0.2/M).
Key Findings¶
- Optimal Performance-Cost Trade-off: The accuracy is 3% higher than GPT-4o, while the cost is only 1/30.
- Interpretability: The ability vectors and difficulty scores possess clear semantics (e.g., DeepSeek-Chat achieves the highest ability of 81%, while GPT-4o achieves 78%).
- Effective Cold-start: The warm-up mechanism significantly boosts OOD performance, with a more substantial impact on NIRT-Router.
Highlights & Insights¶
- Elegant cross-domain transfer from IRT to LLM routing: The mature theory of psychometrics is directly applied to LLM capability evaluation.
- Interpretability is the core selling point: Beyond strong routing performance, it provides explanations for what each LLM specializes in and where the difficulty lies in each query.
Limitations & Future Work¶
- Top-1 routing accuracy is relatively low (2-3%) because multiple models share similar capabilities, resulting in close proximity of ability vectors in high-dimensional space.
- Generalization to entirely original/unseen LLMs during training is limited, requiring a small amount of calibration data to initialize the ability vector of the new model.
- The router is insufficiently sensitive to changes in cost parameters; fine-tuning the ratio of \(\alpha/\beta\) has limited influence on routing decisions.
- The proximity metric for cold-start warm-up relies on Euclidean distance within the embedding space, which may not accurately reflect the similarity in true query difficulty.
- The IRT model assumes static capabilities and difficulties, yet LLM capabilities can change as API versions update, necessitating periodic recalibration.
- Although the NIRT version introduces neural networks to enhance flexibility, it sacrifices the complete interpretability of the MIRT version.
Rating¶
- Novelty: ⭐⭐⭐⭐ The application of IRT to LLM routing represents an ingenious cross-domain transfer.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 20 LLMs × 12 datasets × ID+OOD scenarios.
- Writing Quality: ⭐⭐⭐⭐ Clear presentation of IRT theory combined with thorough interpretability analysis.
- Value: ⭐⭐⭐⭐⭐ Directly practical for multi-LLM deployment scenarios.