Skip to content

TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination

Conference: ACL 2026
arXiv: 2510.22767
Code: https://github.com/omyokun/tale/
Area: Model Compression / LLM Efficiency
Keywords: Task-Aware Pruning, Layer Elimination, Inference Acceleration, Validation Set Search, Retraining-free

TL;DR

TALE utilizes a retraining-free greedy search process to directly eliminate "underperforming" Transformer layers for specific downstream tasks, simultaneously improving task accuracy and reducing inference costs across 5 open-source LLMs and 9 benchmarks.

Background & Motivation

Background: Large Language Models (LLMs) are typically deployed with a fixed depth, processing all Transformer layers regardless of whether the downstream task involves mathematical reasoning, commonsense QA, or knowledge-based multiple-choice questions. Existing model compression methods can remove weights, heads, blocks, or implement early exit, but most rely on general metrics such as perplexity, representational similarity, or reconstruction error, with the primary goal of saving computational resources.

Limitations of Prior Work: General pruning metrics do not necessarily correspond to performance on target tasks. A layer that appears important for language modeling perplexity might introduce noise for a specific task; conversely, certain intermediate layers may already be "good enough" for a specific task, and proceeding through subsequent layers can actually decrease accuracy. On the other hand, fine-tuning can improve task performance but does not reduce inference costs and requires data and training budgets.

Key Challenge: Model compression typically assumes that "removing layers will damage capability," so optimization goals focus on minimizing performance drops. However, this paper observes that for some tasks, deleting mismatched layers is itself a form of task adaptation, potentially making the model both more accurate and faster.

Goal: The authors aim to provide a practically deployable method: no weight modifications, no retraining, and no reliance on hardware-specific implementations. Using only a small-scale task validation set, the method finds the optimal pruned structure and efficient architecture for that task.

Key Insight: The paper first explains layer removal through residual flow: deleting the \(\ell\)-th layer is equivalent to setting that layer's transformation \(F_\ell\) to zero, allowing the hidden state to pass through directly. The authors further project intermediate layer hidden states into the vocabulary space and find that for many tasks, intermediate layer predictions already outperform the final layer, indicating that "deeper" is not always "better."

Core Idea: Instead of using task-agnostic proxies to guess which layers are redundant, the method directly tests the removal of each layer on the target task validation set, retaining the deletion operations that maximize validation accuracy.

Method

The TALE method is straightforward, which makes it suitable for deployment. Given an open-source LLM and a task validation set, it updates no parameters and only performs architectural search: each step enumerates all currently existing layers in the model, temporarily removes one layer, and runs the validation set to obtain accuracy. The candidate model with the highest accuracy is selected, that layer is permanently removed, and the process repeats on the new, shallower model.

The paper outputs two model concepts: BEST refers to the pruned model with the highest task accuracy during the search, suitable for "accuracy-prioritized" scenarios; BSBA stands for Best Speedup with at least Baseline Accuracy, which deletes as many layers as possible without falling below the original model's accuracy, suitable for "speed-prioritized" scenarios.

Overall Architecture

The input includes a pre-trained or instruction-tuned model \(M\), a validation set \(D_{val}\), and a threshold \(\epsilon\). TALE initializes \(M^*=M\). In each iteration, for every removable layer \(\ell\) in the current model, a candidate model \(M_{-\ell}\) is constructed, and \(A_\ell=Acc(M_{-\ell},D_{val})\) is calculated. The layer \(\ell^*=\arg\max_\ell A_\ell\) is selected. If the performance after removal remains within the allowed range, \(M^*\) is updated to \(M_{-\ell^*}\); otherwise, the search stops. The stop threshold is set at 8% below the baseline accuracy to allow the search to briefly explore lower-performing structures, though the authors did not observe recovery after falling below the baseline in practice.

For evaluation, the authors use two protocols: LM-Eval and Decoder Eval. Decoder Eval requires the model to generate structured answers, from which final answers are extracted and compared against the ground truth. The authors argue this is closer to real generation capability, as multiple-choice probabilistic LM-Eval can make weak models appear stronger due to option compression.

Key Designs

  1. Directly Using Task Validation Accuracy as the Deletion Criterion:

    • Function: Avoids using proxies like perplexity or similarity to represent real downstream targets.
    • Mechanism: Every step evaluates all single-layer removal candidates on the validation set of the same task, selecting the structure that yields the highest validation accuracy, aligning search and deployment objectives.
    • Design Motivation: Methods like SLEB or BlockPruner often rely on representational similarity or perplexity, which can easily delete layers useful for a task or retain layers harmful to it. Task-aware accuracy can directly identify "negative contribution layers."
  2. Greedy Iterative Layer Elimination:

    • Function: Finds task-specific shallow architectures in a simple, interpretable way.
    • Mechanism: Instead of specifying how many layers to delete at once, only the current optimal layer is removed per round, and all remaining layers are re-evaluated. This captures interactions between layer removals and avoids fixed top-k truncation.
    • Design Motivation: The paper finds that the optimal number of deleted layers varies by task; some tasks perform best when pruned to the \(n\)-th layer, while one more deletion causes a sharp drop. Preset pruning budgets can easily miss the optimal point.
  3. BEST / BSBA Dual Deployment Targets:

    • Function: Serves both accuracy-prioritized and efficiency-prioritized scenarios.
    • Mechanism: The search trajectory records the BEST model (highest accuracy) and the BSBA model (most layers deleted while maintaining baseline performance). Users can choose based on deployment needs.
    • Design Motivation: Real-world systems may not only seek the highest score; multi-agent, high-concurrency, or edge deployments care more about throughput and latency. TALE exposes both needs as optional results.

Loss & Training

TALE itself has no training loss because it does not update model weights. Its "optimization objective" is the validation set accuracy. The computational cost is approximately \(O(I\cdot L\cdot V\cdot T_{layer})\), where \(I\) is the number of deletion iterations, \(L\) is the number of layers, \(V\) is the size of the validation set. For LLaMA 3.1 8B, on a validation set of 500 to 1500 samples, one search per task takes approximately 1 to 2 A100 GPU hours. The paper also uses LoRA in fine-tuning experiments, but that is to evaluate TALE's interaction with fine-tuning rather than a necessary step for TALE itself.

Key Experimental Results

Main Results

TALE was evaluated on 5 open-source models and 9 tasks, including ARC-Challenge, ARC-Easy, MMLU, Winogrande, GSM8K-HARD, MATH500, CommonQA, BIG-Bench, and BoolQ. The table below extracts zero-shot results for LLaMA 3.1 8B and Qwen 2.5 7B; #D is the number of deleted layers.

Model Dataset Baseline TALE BEST #D BSBA Observations
LLaMA 3.1 8B ARC-Challenge 79.4 80.6 4 77.6 Knowledge/Commonsense improvements are modest
LLaMA 3.1 8B MMLU 48.8 53.8 1 50.2 Significant gain with only 1 layer removed
LLaMA 3.1 8B GSM8K-HARD 39.0 59.0 1 39.4 Mathematical reasoning sees the largest gain
LLaMA 3.1 8B MATH500 25.4 28.2 2 27.4 Removing early/middle layers is effective for reasoning
Qwen 2.5 7B ARC-Challenge 86.55 92.00 2 86.55 ARC-C improves by 5.45 points
Qwen 2.5 7B MMLU 68.10 71.00 5 68.13 More layers can be deleted while maintaining baseline
Qwen 2.5 7B GSM8K-HARD 43.80 61.80 2 43.99 Math reasoning improves by 18 points
Qwen 2.5 7B MATH500 31.00 38.20 2 32.10 Consistent gains in math tasks

The overall trend summarized in the paper is: LLaMA's improvement on ARC-Challenge is smaller (~+1.6), while Qwen 2.5 7B is more significant (~+6.3); reasoning tasks like MATH500 and GSM8K show improvements ranging from 23% to 51%. This supports the authors' judgment: some layers are not neutrally redundant for reasoning tasks but may actually introduce task-mismatched representational disturbances.

Eval Method ARC-Easy ARC-Challenge Winogrande Conclusion
Decoder Eval Ours 76.7 54.3 73.1 Best across all three
Decoder Eval SLEB-ta 61.0 38.0 66.5 Still significantly behind even when task-aware
Decoder Eval BlockPruner-ta 64.6 39.6 65.59 Inferior to direct accuracy search
LM-Eval BlockPruner-ta 65 41 66 Baseline pruning still shows obvious drops
LM-Eval Ours 81 55 78 Similarly maintains lead

Ablation Study

The paper lacks traditional module ablation, instead validating TALE through robustness, evaluation protocols, validation set size, and fine-tuning interactions. The most critical comparison is replacing TALE's objective from task accuracy to representational similarity or perplexity, which significantly degrades results. For example, on ARC-Easy, using cosine similarity to guide TALE deletes 2 layers, but LLaMA accuracy drops from 79.5 to 58.5, indicating that proxy objectives can be severely misaligned with real task objectives.

Setting Representative Results Explanation
Validation Set Size Beyond 500 samples, the set of deleted layers for ARC-Easy/MMLU/GSM8K tends to stabilize TALE does not require a massive validation set
Random Seeds BEST results for LLaMA, Qwen, Lucie, and Mistral show very low variance Search does not rely on lucky seed hits
Inference Efficiency 9/9 settings improved first-token latency, macro avg. -14.3%; throughput improved in 9/9, macro avg. +17.9% BEST models also provide practical throughput gains
Search Cost ~1-2 A100 hours per task for LLaMA 3.1 8B One-time search cost is amortized by long-term inference

Fine-tuning interaction is also noteworthy. TALE is not just for inference pruning but can complement LoRA fine-tuning.

Model / Dataset Baseline TALE only FT only TALE → FT FT → TALE (TALE → FT) → TALE
LLaMA 3.1 8B / Winogrande 53.83 56.67 (#D=4) 85.00 87.06 (#D=4) 86.74 (#D=7) 87.37 (#D=8)
LLaMA 3.1 8B / MMLU 54.87 59.90 (#D=1) 63.62 63.49 (#D=1) 64.21 (#D=2) 64.01 (#D=2)
LLaMA 3.1 8B / GSM8K 15.07 37.08 (#D=3) 42.70 53.96 (#D=1) 50.86 (#D=2) 54.02 (#D=2)
Qwen 0.5B / MMLU 31.48 39.98 (#D=2) 44.87 43.76 (#D=2) 45.53 (#D=2) 45.58 (#D=3)

Key Findings

  • Layer removal is not just compression; it can also be task adaptation. Especially in mathematical reasoning tasks, deleting 1 to 3 early-to-middle layers often yields the greatest benefit.
  • Layer importance is highly task-dependent. Commonsense/knowledge tasks and mathematical reasoning tasks rely on different layer segments, making strategies that fixedly prune from the top or bottom difficult to generalize.
  • TALE is effective for both large and small models, though the magnitude of gain varies. The paper notes significant gains for Lucie 7B, potentially due to fewer pre-training tokens leaving it further from its performance ceiling.

Highlights & Insights

  • The most elegant aspect is changing the pruning objective from "minimizing drop" to "directly maximizing task score." This unifies compression and adaptation, avoiding the detours of proxy metrics.
  • The method is engineering-friendly: no weight changes, no retraining, hardware-agnostic, and outputs a standard Transformer that can enter existing inference stacks directly.
  • TALE provides an explanatory perspective: if task performance increases after removing a layer, that layer might have introduced redundant or unnecessary transformations for that task; layer-by-layer ablation curves help locate the distribution of task capabilities within the network.

Limitations & Future Work

  • TALE currently operates at the whole-layer granularity, which is transparent but coarse. Finer structures like attention heads, MLP blocks, or token-adaptive routing might offer better trade-offs.
  • It requires separate searches for each task, producing task-specific structures; if a deployment scenario requires one model to handle numerous heterogeneous tasks, frequent structure switching increases management complexity.
  • Greedy search does not guarantee global optimality. Deleting one layer changes the importance of subsequent layers, and local optimal paths might miss structures that only emerge through combined deletions.
  • While search costs are low, a target task validation set is still required. For open-ended generation tasks without stable validation sets or high evaluation noise, TALE's objective function would need redesigning.
  • vs SLEB: SLEB performs training-free layer removal based on representational similarity and perplexity; TALE searches via direct target task accuracy, making it better at identifying task-harmful layers.
  • vs BlockPruner: BlockPruner divides layers into blocks and uses general metrics; TALE is coarser but more direct. Experiments show TALE outperforms SLEB/BlockPruner even when they are given task validation data.
  • vs SparseGPT / Wanda / SliceGPT: These methods lean towards weight or dimension-level general compression, whereas TALE keeps weights intact and only changes the layer-level path, making deployment and rollback simpler.
  • vs Early Exit: Early exit dynamically decides when to stop during inference; TALE performs offline search for a fixed pruned structure. The former is more flexible, while the latter integrates more easily with existing static inference optimizations.

Rating

  • Novelty: ⭐⭐⭐⭐ Simple idea but well-aimed; using task accuracy as a pruning target yields strong empirical results.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive coverage across models, tasks, evaluation protocols, random seeds, validation sizes, baselines, and fine-tuning interactions.
  • Writing Quality: ⭐⭐⭐⭐ Clear main line and ample appendix data; some tables and speed metrics are densely formatted.
  • Value: ⭐⭐⭐⭐⭐ Highly practical for task-specific LLM deployment, especially for preparing lightweight specialized models for different roles in multi-agent systems.