Skip to content

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

Conference: ACL2026
arXiv: 2602.21103
Code: Not released
Area: LLM Efficiency / Prompt Optimization / Reasoning Distillation
Keywords: Prompt Distillation, Non-parametric fine-tuning, Instruction clustering, Conflict resolution, Reasoning efficiency

TL;DR

This paper proposes Prompt-Level Distillation, which extracts, clusters, and resolves conflicts in reasoning patterns from a teacher model on training samples, then writes them into the system prompt of a student model. This significantly improves the reasoning and classification capabilities of small models without updating parameters.

Background & Motivation

Background: Complex reasoning tasks typically rely on Chain-of-Thought prompting, allowing models to generate intermediate reasoning before outputting an answer. This approach is effective in logical inference, compliance judgment, and reading comprehension but introduces additional tokens, latency, and inference costs.

Limitations of Prior Work: Industrial systems often use fine-tuning of small models to replace expensive CoT reasoning, but fine-tuning requires training data, computational resources, and model version management. More problematically, when teacher models are updated or business rules change, student models must be retrained. For closed-source small models or rapidly iterating business scenarios, this maintenance cost is heavy.

Key Challenge: Reasoning ability requires complex rules, while production environments demand low latency, auditability, and ease of maintenance. Compressing reasoning rules into weights turns them into a black box, and running CoT at runtime is too slow. The authors aim to shift "reasoning" to an offline stage, explicitly saving reusable logic in the prompt.

Goal: Propose a non-parametric distillation framework, PLD, using labeled training sets to extract generalizable natural language instructions from a teacher model and synthesize a conflict-free system prompt, enabling the student model to execute these rules directly during zero-shot reasoning.

Key Insight: The paper focuses on reasoning-intensive classification, where input boundaries are relatively static and rules can be summarized, such as contract clause relationships, bias type identification, and logical QA. This setup makes "compressing reasoning into an instruction library" possible.

Core Idea: Instead of distilling teacher outputs into student weights, the decision logic of the teacher is distilled into a system prompt, allowing small models to execute pre-mined rules at near zero-shot speeds.

Method

PLD can be understood as an offline compilation pipeline: extracting micro-rules from each training sample, merging similar rules into general heuristics, and driving conflict resolution with error cases from the student model. The final product is not a new model but a readable, checkable, and replaceable consolidated instruction set.

Overall Architecture

The input consists of a labeled training set \(T=\{(x_i,y_i)\}\), a teacher model, a student model, and the target task format. In the first stage, the teacher model is asked to explain why a sample should have the given label under ground-truth constraints and simultaneously abstract a natural language rule without specific entities, forming \(D=\{(x_i,y_i,I_i)\}\). In the second stage, these rules are embedded into a vector space, semantic clusters are found using DBSCAN, and a strong model synthesizes each cluster into a more general instruction. In the third stage, the current instruction prompt is deployed to the student model to find errors on training/validation samples, and a conflict resolution model corrects rules based on success and failure cases. In the fourth stage, the final system prompt is injected into the student model at deployment, requiring no retrieval, fine-tuning, or runtime CoT.

Key Designs

  1. Supervised Instruction Extraction:

    • Function: Compress the reasoning process of the teacher model on a single training sample into transferable natural language rules.
    • Mechanism: Prompt the teacher to complete two things simultaneously: perform a CoT-style analysis based on the ground-truth label, and abstract this analysis into a de-entitized instruction. This utilizes label supervision while avoiding putting sample content directly into the prompt.
    • Design Motivation: If the teacher only generates answers, the student cannot learn decision boundaries; if CoT is saved directly, it becomes too long and contains sample details. Abstract instruction is a compromise between the two.
  2. DBSCAN Semantic Clustering and Logic Synthesis:

    • Function: Merge a large number of repetitive, fragmented, and local micro-rules into a small number of readable rules.
    • Mechanism: Use Gemini Embedding to obtain 768-dimensional vectors, with DBSCAN using cosine distance, \(\epsilon=0.4\), and min_samples=6. DBSCAN does not force all points into clusters, discarding noise points. Each cluster is then handed to Gemini 3 Pro to be synthesized into a unified heuristic.
    • Design Motivation: If the number of rules is too high, the system prompt will bloat; if merging is too coarse, minority classes and fringe rules will be lost. Density clustering allows the number of rules to emerge naturally from the data.
  3. Closed-loop Conflict Resolution:

    • Function: Fix contradictions between merged rules and subtle boundaries not covered by the initial extraction.
    • Mechanism: Place the current instruction set into the student model, perform inference on the training data, and pick out cases where the model follows rules but still predicts incorrectly. Combined failure and success cases are given to a conflict resolution model to generate updated rules. The loop persists until validation error converges.
    • Design Motivation: Rules in complex tasks often have priorities and exception conditions. Single-pass clustering synthesis easily averages out minority class rules; closed-loop error analysis can recover these long-tail constraints.

Loss & Training

PLD itself does not update student model parameters, so there is no traditional training loss. Its "optimization objective" is reflected in prompt search and the error loop: the teacher extraction phase maximizes the explainability of each rule for the label, the clustering phase balances prompt length and rule coverage, and the conflict resolution phase uses the convergence of error rates on training/validation samples of the student model as the stop signal.

In experiments, Gemini 3 Flash thinking mode is used for teacher extraction, while Gemini 3 Pro thinking mode is used for clustering synthesis and conflict resolution. Student models include Gemma-3 4B, Mistral Small 3.1 24B, and Gemini 2 Flash. Comparison methods include zero-shot, 5-shot, TextGrad, and parametric distillation baselines using LoRA fine-tuning on Gemma/Mistral.

Key Experimental Results

Main Results

The paper evaluates PLD on StereoSet, Contract-NLI, and LogiQA. Macro-F1 is reported for StereoSet and Contract-NLI, while Accuracy is reported for LogiQA.

Task / Student Model Zero-shot TextGrad Clustered-Inst. Post-Conflict Main Conclusions
StereoSet / Gemma-3 4B 0.57 0.87 0.90 0.90 PLD improves small models to near-strong model levels
Contract-NLI / Gemma-3 4B 0.67 0.74 0.81 0.83 Conflict resolution continues to provide benefits in legal logic tasks
LogiQA / Gemma-3 4B 0.67 0.69 0.69 0.70 Small but stable improvement
StereoSet / Mistral Small 3.1 0.65 0.96 0.96 0.97 Equally effective across architectures
Contract-NLI / Gemini 3 Flash 0.77 0.76 0.84 0.86 Even teacher-level models benefit from explicit rules

Ablation Study

Configuration Key Metrics Description
Contract-NLI, \(\epsilon=0.2\) / min_samples=6 27 clusters / 6,449 tokens / F1 0.79 Clusters are too fragmented, prompt lengthens, and performance drops
Contract-NLI, \(\epsilon=0.4\) / min_samples=6 17 clusters / 4,630 tokens / F1 0.83 Compromise configuration chosen by the authors
Contract-NLI, \(\epsilon=0.5\) / min_samples=6 14 clusters / 4,068 tokens / F1 0.80 Merging is too coarse, fine-grained logic is lost
Contract-NLI, 1,030 examples 16 clusters / 4,062 tokens / F1 0.77 Small data is already sufficient to discover main themes
Contract-NLI, 7,190 examples 18 clusters / 4,630 tokens / F1 0.83 Increased data mainly refines existing clusters rather than infinitely increasing prompt length

Key Findings

  • Gemma-3 4B improved from 0.57 to 0.90 on StereoSet and from 0.67 to 0.83 on Contract-NLI, showing that non-parametric prompt distillation can significantly bridge the gap between small and strong models.
  • The authors report that Gemma-3 4B is 25x cheaper and 80x faster than Gemini 3 Flash; the value of PLD lies not only in accuracy but in converting runtime CoT costs into offline prompt compilation costs.
  • Conflict resolution brings about a 2.5% Macro-F1 increment on Contract-NLI but shows almost no extra benefit on StereoSet, indicating it primarily functions for overlapping boundaries and complex exceptions.
  • The dataset-size ablation for Contract-NLI shows that the number of clusters remains stable at around 18 even with 7,000+ samples, supporting the claim that "more data improves rule quality rather than linearly increasing prompt length."

Highlights & Insights

  • The paper shifts the distillation target from "model weights" to "auditable prompts," a perspective highly suitable for compliance, legal, finance, and content moderation scenarios where human verification of rules is required.
  • The choice of DBSCAN is appropriate: it can discard outlier rules to avoid contaminating the system prompt with one-off samples; simultaneously, it does not require pre-specifying the number of clusters, meeting the need for an automatically generated rule library.
  • The conflict resolution stage emphasizes providing both success and failure cases simultaneously, preventing the model from breaking original correct behaviors while fixing an error. This is closer to production iteration than many automated prompt optimization methods.
  • PLD is not compressing tokens but compressing semantic reasoning paths; its target differs from prompt compression, behaving more like offline compilation of a teacher model into a task-specific "judge manual."

Limitations & Future Work

  • The authors explicitly limit the scope to reasoning-intensive classification with static decision boundaries. For tasks like complex arithmetic, symbolic proof, or planning that must generate intermediate states at runtime, concise instructions alone are likely insufficient.
  • The upper limit of system prompt scale has not been systematically modeled; as task rules become more complex, prompts at the 4,630-token level may turn into longer contexts and introduce prompt-processing latency.
  • StereoSet experiments involve bias category identification; teacher models might solidify data biases or incorrect interpretations into the prompt during rule extraction. While auditability is increased, rule fairness is not automatically guaranteed.
  • Subsequent work could combine PLD with retrieval-augmented rule libraries: placing high-frequency stable rules in the system prompt and retrieving long-tail rules on demand to balance coverage and context length more flexibly.
  • vs Chain-of-Thought prompting: CoT generates reasoning at runtime, which is accurate but slow; PLD extracts reasoning in an offline stage, turning runtime into direct rule execution.
  • vs Knowledge Distillation: Traditional KD updates student weights, which is hard to audit and requires maintaining multiple model artifacts; PLD does not change parameters, requiring only a prompt update when rules change.
  • vs Automatic Prompt Optimization: APE, OPRO, and TextGrad incline toward optimizing wording or prompt programs; the core of PLD is mining and merging the teacher's domain logic rather than just finding better expressions.
  • Insights: For enterprise LLM applications, expert-reviewed error cases can be continuously fed back into PLD to form a versionable prompt rule library, instead of retraining small models every time.

Rating

  • Novelty: ⭐⭐⭐⭐☆ Non-parametric distillation is not a brand-new concept, but the full "extraction-clustering-conflict resolution" pipeline is very practical.
  • Experimental Thoroughness: ⭐⭐⭐⭐☆ Covers three types of tasks and various student models with solid ablations; however, task types remain concentrated on classification.
  • Writing Quality: ⭐⭐⭐⭐☆ Method description is clear, and tables are direct; some latency/cost results are only presented as appendix figures, lacking finer numerical values in the main text.
  • Value: ⭐⭐⭐⭐☆ Highly enlightening for low-latency, auditable small model deployment, particularly suited for industry scenarios with relatively stable rules.