Skip to content

Strategy-Induct: Task-Level Strategy Induction for Instruction Generation

Conference: ACL2026 arXiv: 2605.20924 Code: To be confirmed Area: llm_reasoning Keywords: Instruction induction, reasoning strategy, prompt engineering, question-only, task-level instructions, cross-model generalization

TL;DR

Strategy-Induct proposes a framework for inducing task-level instructions using only a small number of input questions (without labeled answers). It first generates reasoning strategies for each individual question, then induces reusable task instructions from strategy-question pairs. It outperforms existing SOTA methods on BBH-Induct, Evals-Induct, and Shift Cipher benchmarks.

Background & Motivation

High-quality task instructions are critical for LLM performance, yet manual instruction design requires domain expertise and is costly. Existing Instruction Induction methods rely on input-output pairs, whereas obtaining labeled answers in real-world applications is often difficult or expensive. This work proposes inducing effective task instructions from questions alone in a question-only setting, eliminating the dependency on labeled answers.

Method

Overall Architecture

Strategy-Induct consists of three stages: (1) Strategy Stage—generating reasoning strategies for each input question; (2) Induct Stage—inducing task-level instructions from strategy-question pairs; (3) Inference Stage—solving new questions guided by the induced instructions.

Key Designs

  1. Strategy Generation (Strategy Stage): Given \(N\) input questions \(\mathcal{X} = \{x_1, ..., x_N\}\), a reasoning strategy \(s_i = \text{LLM}(P_S, d, x_i)\) is generated for each question using a meta prompt \(P_S\) and an optional Short Phrase description \(d\), forming a set of strategy-question pairs \(\mathcal{S}\). Strategies replace labeled answers in traditional methods to provide structured reasoning signals.
  2. Instruction Induction (Induct Stage): strategy-question pairs \(\mathcal{S}\) are combined with a meta prompt \(P_I\) and the Short Phrase \(d\) to induce reusable task-level instructions \(P_{\text{Strategy-Induct}} = \text{LLM}(P_I, d, \mathcal{S})\).
  3. Short Phrase Mechanism: Employs concise task descriptions (e.g., one or two words) to convey task intent, lowering the barrier for user prompt engineering; this can be omitted if questions are self-explanatory.

Loss & Training

No training process is involved. The entire framework is based on the in-context learning capabilities of LLMs, defaulting to \(N=3\) example questions and temperature=0 to ensure deterministic output.

Key Experimental Results

Main Results

Evaluated across 18 models (BBH-Induct / Evals-Induct / Shift Cipher) and compared with ZCoT, SCoT, and INDUCT:

Model ZCoT SCoT INDUCT Strategy-Induct
Llama 3.1 8B (BBH) 62.03 56.29 59.48 65.33
Llama 3.1 70B (BBH) 82.09 84.52 86.03 88.99
GPT-4o (BBH) 84.12 87.83 87.94 87.65
GPT o3 mini high (BBH) 88.87 89.91 89.74 91.30
Gemini 2.0 Flash (Shift) 54.24 53.44 65.60 67.04

Overall vs ZCoT: 50-3-7 (win-tie-loss); vs INDUCT: 44-3-13.

Ablation Study

Model N=1 N=3 N=5
Llama 3.1 8B 64.35 65.33 61.74
Llama 3.1 70B 87.54 88.99 89.97
Mistral Large 2 84.87 85.97 84.58

\(N=3\) serves as the optimal balance point—\(N=1\) lacks diversity, while \(N=5\) may exceed the context processing capacity of smaller models.

Key Findings

  • Smaller models (8B-12B) generally benefit from Strategy-Induct, achieving a 10-3-2 record against INDUCT.
  • The largest improvements occur in knowledge-intensive subtasks (e.g., snarks, sports understanding), with gains ranging from 8 to 60 percentage points.
  • Large Reasoning Models (LRMs) like GPT o3 mini show increasing gains from Strategy-Induct as reasoning intensity increases.
  • On Shift Cipher, improvements are most significant for low-frequency shift values (non ROT-1/3/13), where strategies explicitly guide the LLM to handle character wrap-around effects.

Highlights & Insights

  • Instruction Induction without Labeled Answers: Replacing expensive labeled answers with LLM-generated reasoning strategies represents a paradigm shift in instruction induction.
  • Cross-model Generalization: Induced instructions can be migrated across different models without requiring per-model re-optimization.
  • LLM + LRM Synergy: Combining an LLM for instruction generation with an LRM for reasoning execution can further enhance performance.

Limitations & Future Work

  • When \(N=5\), performance in some small models degrades, indicating that the scale of strategy-question pairs is restricted by model context windows and induction capabilities.
  • Strategy quality depends on the reasoning capability of the LLM itself; strategies generated by small models may be of lower quality.
  • The method was primarily verified on classification/decoding tasks; its applicability to open-ended generation remains to be explored.
  • INDUCT-LEARN (Chen et al., 2024b): A current SOTA instruction induction method, but requires input-output pairs; Ours outperforms it in question-only settings.
  • SCoT (Wang et al., 2024): Automated strategy chains of thought, but acts as an instance-level method and cannot reuse instructions.
  • APE (Zhou et al., 2022): A pioneer in automatic prompt engineering that requires large external resources or initial instructions.

Rating

Dimension Score (1-10)
Novelty 7
Value 8
Writing Quality 8
Experimental Thoroughness 9

Rating

  • Novelty: To be rated
  • Experimental Thoroughness: To be rated
  • Writing Quality: To be rated
  • Value: To be rated