Skip to content

Task-Aware LLM Routing with Multi-Level Task-Profile-Guided Data Synthesis for Cold-Start Scenarios

Conference: ACL 2026
arXiv: 2604.09377
Code: GitHub
Area: LLM Evaluation
Keywords: LLM Routing, Cold-Start, Data Synthesis, Task Awareness, Cost-Performance Trade-off

TL;DR

This paper proposes a multi-level task-profile-guided data synthesis framework to address the cold-start problem in LLM routing. It designs TRouter—a routing method that treats task types as latent variables and models the query-cost-performance relationship through variational inference, achieving effective routing in both cold-start and in-domain settings.

Background & Motivation

Background: LLM routing aims to select the optimal model from a candidate pool for each user query to balance performance and cost. Prevailing methods are categorized into classification-based (predicting the best model directly) and regression-based (predicting cost and performance to maximize a utility function), typically requiring small routers (e.g., BERT) trained on in-domain data.

Limitations of Prior Work: (1) Real-world deployments often encounter cold-start scenarios where no in-domain labeled data is available for training routers; (2) Pre-trained routers exhibit poor generalization across domains, sometimes performing worse than simple rule-based baselines (Adaptive LLM); (3) Relying directly on LLMs for model selection is unreliable due to the difficulty in accurately characterizing the capability boundaries of candidate models.

Key Challenge: LLM routing depends on labeled data, which is unavailable in cold-start scenarios; meanwhile, out-of-distribution shifts render cross-domain trained routers ineffective.

Goal: (1) Design a data synthesis method without human annotation to approximate the query distribution at test time; (2) Construct a task-aware router to enhance cross-domain robustness.

Key Insight: It is observed that the cost and performance of LLMs are intrinsically linked to task categories and difficulty—different task types/difficulties impose significantly different requirements on models. Based on this, a hierarchical task classification system can organize synthetic data, and implicit task type information can be leveraged during routing.

Core Idea: Use a hierarchical task taxonomy (domain \(\rightarrow\) subcategory \(\rightarrow\) difficulty) to guide synthetic data generation and incorporate task types as latent variables into a regression-based routing framework.

Method

Overall Architecture

The system consists of two major modules: (1) A multi-level task-profile-guided data synthesis framework, which iteratively builds a hierarchical task taxonomy starting from seed domain descriptions to generate diverse QA pairs as routing training data; (2) TRouter—a task-type-aware router that introduces latent task variables and jointly models the conditional distribution of performance and cost via variational inference.

Key Designs

  1. Hierarchical Task Taxonomy Generation (Task Type Generator + Quality Evaluator):

    • Function: Automatically expands a small number of seed domain descriptions into a complete three-level taxonomy (domain \(\rightarrow\) subcategory \(\rightarrow\) difficulty).
    • Mechanism: The Task Type Generator uses parent category descriptions as prompt conditions to recursively generate sub-types (each level including name, definition, and examples). The Task Type Quality Evaluator performs self-review on the generated sub-types to check for redundancy, specificity, and completeness, iterating until no modifications are made for three consecutive rounds. GPT-4.1 was used to synthesize 10 domains, 103 subcategories, 447 difficulty nodes, and 17,880 QA pairs.
    • Design Motivation: The hierarchical structure enables fine-grained control and efficient sampling coverage, while the quality evaluator ensures the cohesion and diversity of the taxonomy.
  2. QA Pair Generation and Deduplication (Question-Answer Pair Generator):

    • Function: Generates diverse QA pairs for each difficulty-level task profile to serve as routing training data.
    • Mechanism: Uses task profiles (containing descriptions of the current and parent task types) as conditions for batch QA pair generation. A sentence-transformer calculates semantic similarity between new and existing QA pairs, filtering out near-duplicates with a maximum similarity \(>0.9\). Generation iterates until each profile reaches the target quantity (40 pairs per profile, batch=8).
    • Design Motivation: Ensures synthetic data approximates the diversity of real test distributions, while the deduplication mechanism avoids data redundancy.
  3. TRouter: Task-Type-Aware Router:

    • Function: Introduces latent task type variables to enhance routing robustness and generalization.
    • Mechanism: Decomposes \(p(h|q,m)\) as \(\sum_t p(h|t,m) \cdot p(t|q)\), where \(h\) is the evaluation metric and \(t\) is the latent task type. The Task Recognition Module encodes the query and all task type descriptions, concatenating them through an MLP+softmax to predict the task distribution \(q_\phi(t|q)\), constrained by cross-entropy and KL divergence from a prior. The Metric Prediction Module produces final predictions for each metric-model pair using task-distribution-weighted predictions for each type, trained with MSE loss. At inference, the optimal model is selected via the utility function \(U(m,q)=\mu_r \cdot r(m,q) - \mu_c \cdot c(m,q)\).
    • Design Motivation: Directly predicting cost/performance from query features is prone to superficial feature influence; introducing task types as intermediate representations decouples the impact of task semantics, improving cross-domain robustness.

Loss & Training

The total loss is \(\mathcal{L} = \mathcal{L}_{CE} + \frac{1}{|\mathcal{M}||\mathcal{H}|} \sum_m \sum_h \mathcal{L}_{MSE}^{h,m}\), where the cross-entropy loss corresponds to the KL term of the ELBO and the MSE loss corresponds to the reconstruction term. Queries and task types are mapped to 256 dimensions using all-MiniLM-L6-v2. In cold-start settings, 30 training and 10 validation QA pairs are used per type.

Key Experimental Results

Main Results

Setting Method Cost-first Utility Balanced Utility Perf-first Utility Utility Sum
Cold-start Adaptive LLM 0.0217 0.1809 0.2887 0.4913
Cold-start RouterDC⋆ 0.0197 0.1490 0.2989 0.4676
Cold-start Ours▲ (GPT-4.1 Synth) 0.0355 0.1811 0.3108 0.5274
Cold-start Ours∙ (Gemini Synth) 0.0352 0.1809 0.3221 0.5382
In-domain MetricRouter 0.0442 0.1911 0.3388 0.5741
In-domain Ours▲ 0.0518 0.1949 0.3447 0.5914

Ablation Study

Configuration Utility Sum Description
TRouter (Full) 0.5382 Full model (Gemini Synth)
w/o Task Variables ~0.52 Degenerates to standard regression routing
w/o Data Synthesis 0.4913 Degenerates to rule-based baseline
w/o Quality Evaluator ~0.51 Decline in taxonomy quality

Key Findings

  • In cold-start scenarios, TRouter's Utility Sum exceeds all baselines and even approaches the performance of in-domain methods.
  • The synthesis framework is effective using both GPT-4.1 and Gemini-2.5-flash, verifying its versatility.
  • In in-domain settings, TRouter also outperforms regression baselines like MetricRouter, proving that task type modeling gains are not limited to cold-start.
  • Traditional cross-domain trained routers (RouterDC⋆, MetricRouter⋆) perform poorly under cold-start, sometimes failing to beat the Adaptive LLM rule-based baseline.

Highlights & Insights

  • The design of task type as a latent variable is ingenious: Extending the task taxonomy from the data synthesis stage to the routing modeling stage creates a closed loop of "synthetic data \(\rightarrow\) routing prior." This adds a layer of structured inductive bias compared to simply training a standard router on synthetic data.
  • Problem definition and solution for cold-start are transferable: The core idea of the data synthesis framework—using hierarchical classification to guide the generation of diverse samples—is applicable to any model selection or scheduling scenario lacking labeled data.
  • The variational inference framework provides interpretability: The task distribution \(q_\phi(t|q)\) is not only used for prediction but also informs the user about the type of task a query belongs to, enhancing the explainability of routing decisions.

Limitations & Future Work

  • Synthetic data remains dependent on powerful LLMs (GPT-4.1 or Gemini); applicability may be limited in scenarios where these models are unavailable.
  • Seed domains for the task taxonomy must be manually specified (expanded from 6 to 10), and adaptation capability to entirely new domains is unverified.
  • The candidate model pool in experiments is relatively small (6 open-source + 5 commercial); routing efficiency and scalability with larger model pools require further validation.
  • Routing latency is not discussed—whether the inference time of the router itself offsets the efficiency gains of model selection in real-world deployments remains to be seen.
  • vs GraphRouter: GraphRouter models routing as edge prediction on a heterogeneous graph; TRouter is more concise using latent task variables and shows a clear advantage in cold-start scenarios.
  • vs MetricRouter: Both are regression-based; whereas MetricRouter predicts metrics directly from query embeddings, TRouter introduces additional task type decomposition, outperforming it in both in-domain and cold-start settings.
  • vs Adaptive Rules: Adaptive LLM linearly selects models based only on user cost tolerance. Its relative robustness compared to most learning-based methods in cold-start highlights the severity of the cold-start problem.

Rating

  • Novelty: ⭐⭐⭐⭐ Valuable cold-start problem definition; clever integration of data synthesis and latent variable routing.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covers both cold-start and in-domain settings with multi-LLM pool validation, though ablation studies could be more detailed.
  • Writing Quality: ⭐⭐⭐⭐ Clear theoretical derivation, intuitive framework diagrams, and well-defined problem statements.
  • Value: ⭐⭐⭐⭐ Cold-start routing is a genuine pain point in actual deployment; the synthesis framework offers good generality.