Mastering Multiple-Expert Routing: Realizable H-Consistency and Strong Guarantees¶

Conference: ICML 2025
arXiv: 2506.20650
Code: None
Area: Medical Imaging (Learning to Defer/Expert Routing)
Keywords: Learning to Defer, Multiple-Expert Routing, H-Consistency, Surrogate Loss, Bayes Consistency

TL;DR¶

This paper proposes new surrogate loss functions and efficient algorithms for the multiple-expert routing (learning to defer) problem, establishing theoretical guarantees for realizable H-consistency, H-consistency bounds, and Bayes consistency across both single-stage and two-stage learning scenarios.

Background & Motivation¶

Background: Multiple-expert routing (or learning to defer) is a decision-making problem that optimally allocates input instances to different experts (including AI models and human experts), which is increasingly important in NLP generation, image processing, and medical diagnosis. For example: simple problems are handled by small models, while complex ones are referred to larger models or human experts.

Limitations of Prior Work: Recent studies have proposed various surrogate loss functions to optimize routing decisions, but there remain unresolved issues regarding consistency guarantees. In particular: (1) the realizable H-consistency of existing surrogate losses has not been proven; (2) whether H-consistency bounds exist; (3) how Bayes consistency is guaranteed in multiple-expert scenarios.

Key Challenge: Practical deployment requires guarantees over a finite hypothesis space \(H\) (rather than infinite capacity), but most existing theories remain at the level of Bayes consistency (infinite capacity). The 0-1 loss of routing is non-optimizable and requires a surrogate loss, but the consequences of using surrogate losses need to be quantified.

Goal: To provide a comprehensive theoretical guarantee framework for both single-stage and two-stage multiple-expert routing.

Key Insight: Starting from the design of surrogate losses, establish consistency guarantees at all levels through rigorous mathematical analysis.

Core Idea: Design a new family of surrogate losses that simultaneously satisfy realizable H-consistency and H-consistency bounds, providing the strongest theoretical guarantees for multiple-expert routing.

Method¶

Overall Architecture¶

Problem Definition: Given an input \(x\), select from \(K+1\) options (\(K\) experts + the predictor itself), aiming to minimize the total error and computational cost of allocation.
Single-stage: Jointly learn the predictor and the routing function.
Two-stage: Fix the experts and only learn the routing function.

Key Designs¶

Single-Stage Realizable H-Consistent Surrogate Loss Family:
- Propose a new family of surrogate loss functions that satisfy realizable H-consistency.
- Meaning of realizable H-consistency: when the surrogate loss achieves the optimum in \(H\), the original 0-1 loss also achieves the optimum in \(H\).
- Further prove that a specific member within this family satisfies H-consistency bounds (quantifying the relationship: surrogate loss gap \(\rightarrow\) original loss gap).
- Why it matters: This is the strongest consistency guarantee, whereas previously only Bayes consistency was available (which requires an infinite capacity assumption).
Multi-Level Theoretical Guarantees for the Two-Stage Setting:
- For the two-expert scenario: derive new surrogate losses that simultaneously satisfy realizable H-consistency, H-consistency bounds, and Bayes consistency.
- For the multiple-expert scenario: obtain similar guarantees under natural assumptions (bounded expert errors).
- Why it matters: The two-stage setting is more practical (as experts are typically already deployed), but its theoretical analysis is more challenging.
Enhanced Guarantees under Low-Noise Conditions:
- Provide tighter bounds under low-noise assumptions (i.e., when most inputs have a clear optimal choice).
- Features include faster convergence rates and smaller constants.
- Why it matters: In real-world scenarios, most inputs indeed have a clear optimal route.

Loss & Training¶

Single-stage surrogate loss: A parameterized family of losses \(\Phi_\alpha\) satisfying convexity + consistency.
Two-stage surrogate loss: A new loss designed specifically for the fixed-expert scenario.
Training method: Standard gradient descent/SGD (surrogate loss guarantees optimizability).
Hyperparameter \(\alpha\) controls the trade-off of different consistency properties.

Key Experimental Results¶

Main Results¶

Dataset	Setting	Ours	Prev. SOTA	Description
CIFAR-10 + expert	Two-stage / 2-expert	Better	L2D series	Acc improvement
CIFAR-100 + expert	Two-stage / multi-expert	Better	CE-based	Acc improvement
HAM10000 (skin)	Medical routing	Competitive	Existing methods	Safety-critical scenario
NLP routing	Multi-expert	Better	mixture baselines	NLG tasks

Ablation Study¶

Configuration	Key Metrics	Description
Different \(\alpha\) values	Monotonic changes	Verifies the controllability of the parameterized family
Bayes consistency only	Poor generalization	Finite capacity requires H-consistency
Single-stage vs Two-stage	Respective advantages	Depends on whether experts can be retrained
Low-noise vs High-noise	Low-noise is better	Verifies the enhanced guarantees

Key Findings¶

The new surrogate losses consistently outperform or match existing methods in experiments.
Theoretical guarantees indeed translate to better performance in finite-capacity (practical network) scenarios.
Performance improvement under low-noise conditions is more pronounced.
Safety guarantees in medical routing scenarios hold practical significance.

Highlights & Insights¶

Theoretical Completeness: Provides the first complete theory of realizable H-consistency + H-consistency bounds + Bayes consistency for multiple-expert routing.
Practicality: The theory is not just on paper; the new losses indeed perform better in experiments.
Comprehensive Coverage: Covers single-stage + two-stage, dual-expert + multi-expert, and general conditions + low-noise conditions.
Answering Open Questions: Resolves several open theoretical questions from previous literature.

Limitations & Future Work¶

Full guarantees for multiple experts require natural assumptions (bounded expert errors) and are not completely unconditional.
Experimental scale is relatively small, with insufficient validation on large-scale LLM routing.
The cost model for computation is relatively simplified (assuming the cost of each expert is known).
Dynamic/non-stationary expert scenarios are not considered.

The L2D series of Madras et al., Mozannar & Sontag are the primary precursor works.
The H-consistency theoretical framework of Mohri et al. provides the methodological foundation.
Insight: The H-consistency analysis method can be generalized to other decision-making problems involving multi-system collaboration.

Rating¶

Novelty: ⭐⭐⭐⭐ Solid theoretical contribution
Experimental Thoroughness: ⭐⭐⭐⭐ Validated in multiple scenarios
Writing Quality: ⭐⭐⭐⭐ Rigorously written
Value: ⭐⭐⭐⭐ Provides a solid theoretical foundation for multiple-expert routing