Matryoshka Pilot: Learning to Drive Black-Box LLMs with LLMs¶

Conference: NeurIPS 2025 arXiv: 2410.20749 Code: GitHub Area: Model Compression / LLM Control Keywords: Black-Box LLM, White-Box Controller, Iterative DPO, Intermediate Guidance, Multi-Turn Interaction

TL;DR¶

This paper proposes Matryoshka Pilot (M-Pilot), which employs a lightweight white-box LLM as a controller to generate intermediate guidance (task decomposition, high-level plans, user profiles) for driving black-box LLMs on complex long-horizon tasks such as reasoning, planning, and personalization, with iterative DPO enabling continual self-improvement.

Background & Motivation¶

Background: Commercial LLMs (e.g., GPT-4, Gemini) are predominantly black-box models, offering no access to parameters, architecture, or even output logits.

Limitations of Prior Work: Existing approaches to enhancing black-box LLM capabilities fall into two categories: (a) ICL-based methods that rely on carefully designed demonstrations and prompts, dependent on human heuristics; and (b) adapter-based methods that select the best output from multiple candidates, but remain constrained by the black-box LLM's own generation capacity. Both categories perform poorly on long-horizon tasks involving multi-step reasoning and long-range planning.

Key Challenge: The opacity of black-box LLMs renders direct parameter optimization infeasible, yet task-specific performance improvements remain a practical necessity.

Goal: To systematically enhance black-box LLMs' reasoning, planning, and personalization capabilities on complex long-horizon tasks without accessing their parameters.

Key Insight: Treating the black-box LLM as an "environment" and training a white-box LLM as a "policy" to generate intermediate guidance.

Core Idea: Driving large models with small models — a lightweight white-box controller generates intermediate guidance to steer black-box LLM behavior, with iterative preference optimization enabling continuous improvement.

Method¶

Overall Architecture¶

M-Pilot adopts a controller–generator framework: a white-box LLM (e.g., LLaMA-3-8B-Instruct) serves as the controller, generating $T$-step intermediate guidance $\{g_t\}_{t=1}^T \sim f_\theta(x)$; a black-box LLM (e.g., GPT-4o-mini) acts as the generator/environment, receiving the guidance and producing the final answer $\hat{y} \sim g_{\text{LLM}}(x, \{g_t\}_{t=1}^T)$.

Key Designs¶

Task-Specific Instantiation of Intermediate Guidance:
Reasoning tasks: The controller outputs a sequence of sub-tasks decomposed from the original question, facilitating step-by-step reasoning in the black-box LLM.
Planning tasks: The controller generates high-level plans, decomposing complex tasks into sub-goals.
Personalization tasks: The controller summarizes user history to construct a user profile.

All three guidance forms are unified within the same framework, demonstrating the generality of the approach.

Multi-Turn Interaction and MDP Formulation: The controller–environment interaction is modeled as a Markov Decision Process. At each step $t$, the controller generates action $a_t$ (i.e., guidance) based on the current state $s_{t-1}$, and the environment returns observation $o_t$. The state transition is defined as: $$s_t = (s_{t-1}, a_t, o_t) = (x, a_1, o_1, \cdots, a_t, o_t)$$ The trajectory reward is determined by a correctness evaluation function on the final answer: $u(\tau) = \text{eval}(\hat{y}, y)$.
Data Collection: For each input $x_i$, $K$ multi-turn interaction trajectories are sampled and evaluated, yielding positive examples (successful guidance) and negative examples (failed guidance). Stochasticity is introduced to increase guidance diversity.
Iterative Direct Preference Optimization (IDPO):
SFT Warm-Up: The controller policy is initialized via behavioral cloning using guidance generated by GPT-3.5.
Iterative Preference Pair Collection: At iteration $m$, new trajectories are generated using $\theta^{(m)}$, grouped by success/failure, and merged with historical data.
Preference Optimization: Preferences are modeled using the Bradley-Terry model, with the reference policy updated to the previous checkpoint $\pi_{\text{ref}} = \pi_\theta^{(m)}$. The training objective is: $$\mathcal{L}_{\text{IDPO}} = \mathbb{E}_{(x, \tau^+, \tau^-) \sim \mathcal{D}} \left[ -\log\sigma\left(\eta^{-1}\left(\log\frac{p_{\theta^{(m+1)}}(\{g_t^+\}|x)}{p_{\theta^{(m)}}(\{g_t^+\}|x)} - \log\frac{p_{\theta^{(m+1)}}(\{g_t^-\}|x)}{p_{\theta^{(m)}}(\{g_t^-\}|x)}\right)\right)\right]$$ A key derivation shows that since the black-box LLM's generation probabilities cancel in the positive/negative ratio, the optimization objective involves only the white-box controller's parameters.

Loss & Training¶

SFT warm-up → iterative data sampling + DPO training (with bootstrapping-style data accumulation)
The reference policy is updated at each iteration, enabling continual self-improvement.

Key Experimental Results¶

Main Results¶

Personalization (LaMP): GPT-4o-mini used as the black-box LLM.

Method	LaMP-1 Acc	LaMP-2N Acc	LaMP-2M Acc	LaMP-3 MAE↓	LaMP-4 BLEU
gpt-4o-mini	0.514	0.655	0.413	0.371	0.992
RAG (k=4)	0.632	0.792	0.502	0.272	2.953
M-Pilot	0.640	0.823	0.527	0.277	4.298

Reasoning (GSM8K):

Method	GSM8K (gpt-3.5)	GSM-HARD (gpt-3.5)
CoT	0.809	0.406
PAL_SelfDebug	0.864	0.701
M-Pilot	0.931	0.761

Planning (ALFWorld): GPT-3.5-turbo used as the black-box LLM.

Method	Overall Success Rate
ReAct	47.76%
AdaPlanner	88.06%
M-Pilot	96.27%

Ablation Study¶

Variant	ALFWorld Success Rate
M-Pilot (full)	96.27%
w/o 2nd-round IDPO	94.78%
w/o 1st & 2nd-round IDPO	88.06%
w/o Guidance Optimization	81.34%

Key Findings¶

M-Pilot achieves average improvements of +3.19% (reasoning), +7.46% (planning), and +5.82% (personalization) across the three task categories.
Plug-and-play transferability: a controller trained on GPT-4o-mini transfers directly to GPT-3.5 and Gemini-1.5-flash without additional training.
High sample efficiency: M-Pilot surpasses the strongest baseline (AdaPlanner) using only 1/4 of the training data.
Iterative DPO yields consistent self-improvement, with each additional IDPO round contributing stable gains.

Highlights & Insights¶

Paradigm innovation of "small models driving large models": The controller–generator interaction is formalized as an MDP and optimized via reinforcement learning methods.
Strong generality: The same framework is effective across three fundamentally distinct task types — reasoning, planning, and personalization.
Plug-and-play: A trained controller transfers to other black-box LLMs without any additional fine-tuning.
Self-improvement: Iterative DPO enables the controller to continuously improve guidance quality without human annotation.

Limitations & Future Work¶

The controller requires task-type-specific guidance format design (e.g., question decomposition, high-level plans, user summaries).
The data collection phase incurs substantial API costs due to extensive interactions with the black-box LLM.
The quality ceiling of intermediate guidance is bounded by the white-box LLM's capacity.
Validation is currently limited to NLP tasks; extension to multimodal settings remains unexplored.
Security concern: malicious users could potentially exploit the white-box controller to jailbreak the black-box LLM.

o1-preview Scratchpad: M-Pilot shares conceptual similarity, but externalizes the intermediate reasoning process into a separate white-box model.
STaR (Zelikman et al.): M-Pilot draws on the bootstrapping idea from self-taught reasoning.
RLPrompt / TEMPERA: Existing RL-based prompt optimization methods are limited to classification tasks; M-Pilot extends them to long-horizon generation.
Insight: Using lightweight models as a "prefrontal cortex" to guide large model execution may be a promising direction for future black-box LLM optimization.

Rating¶

Novelty: ⭐⭐⭐⭐ The MDP formulation of the controller–generator framework is novel, though the general idea of "small models assisting large models" is not entirely new.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage across three task types, with thorough ablation and plug-and-play experiments.
Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are clear and the framework is presented in a well-structured manner.
Value: ⭐⭐⭐⭐ Practically valuable for black-box LLM enhancement; the plug-and-play property is particularly useful.