Strategy-Induct: Task-Level Strategy Induction for Instruction Generation¶
Conference: ACL2026 arXiv: 2605.20924 Code: To be confirmed Area: llm_reasoning Keywords: Instruction induction, reasoning strategy, prompt engineering, question-only, task-level instructions, cross-model generalization
TL;DR¶
Strategy-Induct proposes a framework for inducing task-level instructions using only a small number of input questions (without labeled answers). It first generates reasoning strategies for each individual question, then induces reusable task instructions from strategy-question pairs. It outperforms existing SOTA methods on BBH-Induct, Evals-Induct, and Shift Cipher benchmarks.
Background & Motivation¶
High-quality task instructions are critical for LLM performance, yet manual instruction design requires domain expertise and is costly. Existing Instruction Induction methods rely on input-output pairs, whereas obtaining labeled answers in real-world applications is often difficult or expensive. This work proposes inducing effective task instructions from questions alone in a question-only setting, eliminating the dependency on labeled answers.
Method¶
Overall Architecture¶
Strategy-Induct consists of three stages: (1) Strategy Stage—generating reasoning strategies for each input question; (2) Induct Stage—inducing task-level instructions from strategy-question pairs; (3) Inference Stage—solving new questions guided by the induced instructions.
Key Designs¶
- Strategy Generation (Strategy Stage): Given \(N\) input questions \(\mathcal{X} = \{x_1, ..., x_N\}\), a reasoning strategy \(s_i = \text{LLM}(P_S, d, x_i)\) is generated for each question using a meta prompt \(P_S\) and an optional Short Phrase description \(d\), forming a set of strategy-question pairs \(\mathcal{S}\). Strategies replace labeled answers in traditional methods to provide structured reasoning signals.
- Instruction Induction (Induct Stage): strategy-question pairs \(\mathcal{S}\) are combined with a meta prompt \(P_I\) and the Short Phrase \(d\) to induce reusable task-level instructions \(P_{\text{Strategy-Induct}} = \text{LLM}(P_I, d, \mathcal{S})\).
- Short Phrase Mechanism: Employs concise task descriptions (e.g., one or two words) to convey task intent, lowering the barrier for user prompt engineering; this can be omitted if questions are self-explanatory.
Loss & Training¶
No training process is involved. The entire framework is based on the in-context learning capabilities of LLMs, defaulting to \(N=3\) example questions and temperature=0 to ensure deterministic output.
Key Experimental Results¶
Main Results¶
Evaluated across 18 models (BBH-Induct / Evals-Induct / Shift Cipher) and compared with ZCoT, SCoT, and INDUCT:
| Model | ZCoT | SCoT | INDUCT | Strategy-Induct |
|---|---|---|---|---|
| Llama 3.1 8B (BBH) | 62.03 | 56.29 | 59.48 | 65.33 |
| Llama 3.1 70B (BBH) | 82.09 | 84.52 | 86.03 | 88.99 |
| GPT-4o (BBH) | 84.12 | 87.83 | 87.94 | 87.65 |
| GPT o3 mini high (BBH) | 88.87 | 89.91 | 89.74 | 91.30 |
| Gemini 2.0 Flash (Shift) | 54.24 | 53.44 | 65.60 | 67.04 |
Overall vs ZCoT: 50-3-7 (win-tie-loss); vs INDUCT: 44-3-13.
Ablation Study¶
| Model | N=1 | N=3 | N=5 |
|---|---|---|---|
| Llama 3.1 8B | 64.35 | 65.33 | 61.74 |
| Llama 3.1 70B | 87.54 | 88.99 | 89.97 |
| Mistral Large 2 | 84.87 | 85.97 | 84.58 |
\(N=3\) serves as the optimal balance point—\(N=1\) lacks diversity, while \(N=5\) may exceed the context processing capacity of smaller models.
Key Findings¶
- Smaller models (8B-12B) generally benefit from Strategy-Induct, achieving a 10-3-2 record against INDUCT.
- The largest improvements occur in knowledge-intensive subtasks (e.g., snarks, sports understanding), with gains ranging from 8 to 60 percentage points.
- Large Reasoning Models (LRMs) like GPT o3 mini show increasing gains from Strategy-Induct as reasoning intensity increases.
- On Shift Cipher, improvements are most significant for low-frequency shift values (non ROT-1/3/13), where strategies explicitly guide the LLM to handle character wrap-around effects.
Highlights & Insights¶
- Instruction Induction without Labeled Answers: Replacing expensive labeled answers with LLM-generated reasoning strategies represents a paradigm shift in instruction induction.
- Cross-model Generalization: Induced instructions can be migrated across different models without requiring per-model re-optimization.
- LLM + LRM Synergy: Combining an LLM for instruction generation with an LRM for reasoning execution can further enhance performance.
Limitations & Future Work¶
- When \(N=5\), performance in some small models degrades, indicating that the scale of strategy-question pairs is restricted by model context windows and induction capabilities.
- Strategy quality depends on the reasoning capability of the LLM itself; strategies generated by small models may be of lower quality.
- The method was primarily verified on classification/decoding tasks; its applicability to open-ended generation remains to be explored.
Related Work & Insights¶
- INDUCT-LEARN (Chen et al., 2024b): A current SOTA instruction induction method, but requires input-output pairs; Ours outperforms it in question-only settings.
- SCoT (Wang et al., 2024): Automated strategy chains of thought, but acts as an instance-level method and cannot reuse instructions.
- APE (Zhou et al., 2022): A pioneer in automatic prompt engineering that requires large external resources or initial instructions.
Rating¶
| Dimension | Score (1-10) |
|---|---|
| Novelty | 7 |
| Value | 8 |
| Writing Quality | 8 |
| Experimental Thoroughness | 9 |
Rating¶
- Novelty: To be rated
- Experimental Thoroughness: To be rated
- Writing Quality: To be rated
- Value: To be rated