Matryoshka Pilot: Learning to Drive Black-Box LLMs with LLMs¶
Conference: NeurIPS 2025 arXiv: 2410.20749 Code: GitHub Area: Model Compression / LLM Control Keywords: Black-Box LLM, White-Box Controller, Iterative DPO, Intermediate Guidance, Multi-Turn Interaction
TL;DR¶
This paper proposes Matryoshka Pilot (M-Pilot), which employs a lightweight white-box LLM as a controller to generate intermediate guidance (task decomposition, high-level plans, user profiles) for driving black-box LLMs on complex long-horizon tasks such as reasoning, planning, and personalization, with iterative DPO enabling continual self-improvement.
Background & Motivation¶
Background: Commercial LLMs (e.g., GPT-4, Gemini) are predominantly black-box models, offering no access to parameters, architecture, or even output logits.
Limitations of Prior Work: Existing approaches to enhancing black-box LLM capabilities fall into two categories: (a) ICL-based methods that rely on carefully designed demonstrations and prompts, dependent on human heuristics; and (b) adapter-based methods that select the best output from multiple candidates, but remain constrained by the black-box LLM's own generation capacity. Both categories perform poorly on long-horizon tasks involving multi-step reasoning and long-range planning.
Key Challenge: The opacity of black-box LLMs renders direct parameter optimization infeasible, yet task-specific performance improvements remain a practical necessity.
Goal: To systematically enhance black-box LLMs' reasoning, planning, and personalization capabilities on complex long-horizon tasks without accessing their parameters.
Key Insight: Treating the black-box LLM as an "environment" and training a white-box LLM as a "policy" to generate intermediate guidance.
Core Idea: Driving large models with small models — a lightweight white-box controller generates intermediate guidance to steer black-box LLM behavior, with iterative preference optimization enabling continuous improvement.
Method¶
Overall Architecture¶
M-Pilot adopts a controller–generator framework: a white-box LLM (e.g., LLaMA-3-8B-Instruct) serves as the controller, generating \(T\)-step intermediate guidance \(\{g_t\}_{t=1}^T \sim f_\theta(x)\); a black-box LLM (e.g., GPT-4o-mini) acts as the generator/environment, receiving the guidance and producing the final answer \(\hat{y} \sim g_{\text{LLM}}(x, \{g_t\}_{t=1}^T)\).
Key Designs¶
-
Task-Specific Instantiation of Intermediate Guidance:
-
Reasoning tasks: The controller outputs a sequence of sub-tasks decomposed from the original question, facilitating step-by-step reasoning in the black-box LLM.
- Planning tasks: The controller generates high-level plans, decomposing complex tasks into sub-goals.
- Personalization tasks: The controller summarizes user history to construct a user profile.
All three guidance forms are unified within the same framework, demonstrating the generality of the approach.
-
Multi-Turn Interaction and MDP Formulation: The controller–environment interaction is modeled as a Markov Decision Process. At each step \(t\), the controller generates action \(a_t\) (i.e., guidance) based on the current state \(s_{t-1}\), and the environment returns observation \(o_t\). The state transition is defined as: $\(s_t = (s_{t-1}, a_t, o_t) = (x, a_1, o_1, \cdots, a_t, o_t)\)$ The trajectory reward is determined by a correctness evaluation function on the final answer: \(u(\tau) = \text{eval}(\hat{y}, y)\).
-
Data Collection: For each input \(x_i\), \(K\) multi-turn interaction trajectories are sampled and evaluated, yielding positive examples (successful guidance) and negative examples (failed guidance). Stochasticity is introduced to increase guidance diversity.
-
Iterative Direct Preference Optimization (IDPO):
-
SFT Warm-Up: The controller policy is initialized via behavioral cloning using guidance generated by GPT-3.5.
- Iterative Preference Pair Collection: At iteration \(m\), new trajectories are generated using \(\theta^{(m)}\), grouped by success/failure, and merged with historical data.
- Preference Optimization: Preferences are modeled using the Bradley-Terry model, with the reference policy updated to the previous checkpoint \(\pi_{\text{ref}} = \pi_\theta^{(m)}\). The training objective is: $\(\mathcal{L}_{\text{IDPO}} = \mathbb{E}_{(x, \tau^+, \tau^-) \sim \mathcal{D}} \left[ -\log\sigma\left(\eta^{-1}\left(\log\frac{p_{\theta^{(m+1)}}(\{g_t^+\}|x)}{p_{\theta^{(m)}}(\{g_t^+\}|x)} - \log\frac{p_{\theta^{(m+1)}}(\{g_t^-\}|x)}{p_{\theta^{(m)}}(\{g_t^-\}|x)}\right)\right)\right]\)$ A key derivation shows that since the black-box LLM's generation probabilities cancel in the positive/negative ratio, the optimization objective involves only the white-box controller's parameters.
Loss & Training¶
- SFT warm-up → iterative data sampling + DPO training (with bootstrapping-style data accumulation)
- The reference policy is updated at each iteration, enabling continual self-improvement.
Key Experimental Results¶
Main Results¶
Personalization (LaMP): GPT-4o-mini used as the black-box LLM.
| Method | LaMP-1 Acc | LaMP-2N Acc | LaMP-2M Acc | LaMP-3 MAE↓ | LaMP-4 BLEU |
|---|---|---|---|---|---|
| gpt-4o-mini | 0.514 | 0.655 | 0.413 | 0.371 | 0.992 |
| RAG (k=4) | 0.632 | 0.792 | 0.502 | 0.272 | 2.953 |
| M-Pilot | 0.640 | 0.823 | 0.527 | 0.277 | 4.298 |
Reasoning (GSM8K):
| Method | GSM8K (gpt-3.5) | GSM-HARD (gpt-3.5) |
|---|---|---|
| CoT | 0.809 | 0.406 |
| PAL_SelfDebug | 0.864 | 0.701 |
| M-Pilot | 0.931 | 0.761 |
Planning (ALFWorld): GPT-3.5-turbo used as the black-box LLM.
| Method | Overall Success Rate |
|---|---|
| ReAct | 47.76% |
| AdaPlanner | 88.06% |
| M-Pilot | 96.27% |
Ablation Study¶
| Variant | ALFWorld Success Rate |
|---|---|
| M-Pilot (full) | 96.27% |
| w/o 2nd-round IDPO | 94.78% |
| w/o 1st & 2nd-round IDPO | 88.06% |
| w/o Guidance Optimization | 81.34% |
Key Findings¶
- M-Pilot achieves average improvements of +3.19% (reasoning), +7.46% (planning), and +5.82% (personalization) across the three task categories.
- Plug-and-play transferability: a controller trained on GPT-4o-mini transfers directly to GPT-3.5 and Gemini-1.5-flash without additional training.
- High sample efficiency: M-Pilot surpasses the strongest baseline (AdaPlanner) using only 1/4 of the training data.
- Iterative DPO yields consistent self-improvement, with each additional IDPO round contributing stable gains.
Highlights & Insights¶
- Paradigm innovation of "small models driving large models": The controller–generator interaction is formalized as an MDP and optimized via reinforcement learning methods.
- Strong generality: The same framework is effective across three fundamentally distinct task types — reasoning, planning, and personalization.
- Plug-and-play: A trained controller transfers to other black-box LLMs without any additional fine-tuning.
- Self-improvement: Iterative DPO enables the controller to continuously improve guidance quality without human annotation.
Limitations & Future Work¶
- The controller requires task-type-specific guidance format design (e.g., question decomposition, high-level plans, user summaries).
- The data collection phase incurs substantial API costs due to extensive interactions with the black-box LLM.
- The quality ceiling of intermediate guidance is bounded by the white-box LLM's capacity.
- Validation is currently limited to NLP tasks; extension to multimodal settings remains unexplored.
- Security concern: malicious users could potentially exploit the white-box controller to jailbreak the black-box LLM.
Related Work & Insights¶
- o1-preview Scratchpad: M-Pilot shares conceptual similarity, but externalizes the intermediate reasoning process into a separate white-box model.
- STaR (Zelikman et al.): M-Pilot draws on the bootstrapping idea from self-taught reasoning.
- RLPrompt / TEMPERA: Existing RL-based prompt optimization methods are limited to classification tasks; M-Pilot extends them to long-horizon generation.
- Insight: Using lightweight models as a "prefrontal cortex" to guide large model execution may be a promising direction for future black-box LLM optimization.
Rating¶
- Novelty: ⭐⭐⭐⭐ The MDP formulation of the controller–generator framework is novel, though the general idea of "small models assisting large models" is not entirely new.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage across three task types, with thorough ablation and plug-and-play experiments.
- Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are clear and the framework is presented in a well-structured manner.
- Value: ⭐⭐⭐⭐ Practically valuable for black-box LLM enhancement; the plug-and-play property is particularly useful.