Mechanistic Interpretability of Large-Scale Counting in LLMs through a System-2 Strategy¶
Conference: ACL 2026
arXiv: 2601.02989
Code: To be confirmed
Area: Mechanistic Interpretability / LLM Reasoning / Counting
Keywords: System-2, Counting, activation patching, causal mediation, attention knockout
TL;DR¶
Addressing the failure of LLMs in large-scale counting (where single forward passes fail at \(~10–30\) due to limited layer depth), a simple test-time strategy—partitioning lists with | and prompting the model to count segments before summing—enables Qwen2.5/Llama3/Gemma3/GPT-4o/Gemini-2.5-Pro to jump from \(0–20\%\) to \(50–95\%\) accuracy in 50–100 object scenarios. Through attention analysis and four types of causal mediation experiments, the "segment counting -> intermediate aggregation -> final summation" three-stage circuit is localized to Layer 22 of Qwen2.5-7B (head 13 for segments, head 1 for aggregation).
Background & Motivation¶
Background: LLMs perform well on simple arithmetic, but accuracy for naive counting tasks (e.g., "how many apples in the list") drops sharply when \(N > 10\). Prior work (Hasani 2025b, Yehudai 2024) demonstrated this as an architectural bottleneck of Transformers: counting signals accumulate layer-by-layer (latent counter), saturating once they reach the layer depth limit. Furthermore, numerical representations in LLMs are sublinear/log-like, making larger numbers increasingly "fuzzy."
Limitations of Prior Work: (1) Simple Chain-of-Thought (CoT) provides limited help (requires structured input + CoT); (2) Training-side fixes (re-tokenization, specialized math models) treat symptoms rather than the root cause, as they remain depth-limited; (3) Existing test-time block-based techniques (LVLM-COUNT, Izadi 2025) verify behavioral improvements but fail to explain internal mechanisms—such as which heads and layers are activated by partitioning.
Key Challenge: (a) Architecture vs. Task Scale — A Transformer forward pass can only count up to its depth limit, whereas task scales can be infinite; (b) Behavior vs. Mechanism — While prompting is known to be effective, the underlying circuit structure remains unknown, preventing guaranteed controllable scaling.
Goal: (1) Propose "explicit partitioning + CoT summation" as a System-2 counting strategy, proving its efficacy across various LLMs; (2) Decouple the "segment count → write intermediate steps → aggregate" three-stage circuit using attention analysis, activation patching, masking ablation, and attention knockout; (3) Causally verify via cross-context patching that the final answer is directly regulated by intermediate step token embeddings.
Key Insight: The authors adopt Kahneman's Dual Process Theory: the implicit counting in a single LLM forward pass is treated as "System-1" (fast but depth-limited), while "partitioning + CoT step-by-step summation" is treated as "System-2" (slow, explicit, scalable). Mechanistic interpretability confirms that System-2 is implemented via specific heads/layers within the Transformer.
Core Idea: Explicitly partition the input using | and force the model to output part1: x1, part2: x2, ... before the sum. This keeps each partition within the model's "reliable counting range" (System-1 works), while System-2 handles integer summation (a step where almost all models succeed, as shown in Table 3 final-step accuracy 86–100%).
Method¶
Overall Architecture¶
The "Test-time System-2 Counting" pipeline: (1) At the input, partition a list of \(N\) objects using | (fragment size \(~6-9\) for open models, \(~15-25\) for closed models); (2) Use a prompt requiring the model to follow a fixed output format: part1: x1\npart2: x2\n... Final answer: x; (3) The model implicitly counts partitions (System-1) and explicitly aggregates (System-2). No fine-tuning or external tools are required.
Mechanistic analysis utilizes four tools: (a) CountScope probing — Patch target token activations into a blank counting context to decode the implied count, localizing "where the latent count exists"; (b) Token Zero-Ablation — Zero out activations of the item + comma at the end of a partition to observe the drop in intermediate count probability; (c) Layer-wise mask/unmask — Identify which layers write the count; (d) Attention knockout — Block individual head attention to find critical heads for segments/aggregation; (e) Cross-context patching — Exchange intermediate token embeddings between two different contexts to see if the final answer changes accordingly (verifying causal direction).
Key Designs¶
-
Dual combination of explicit partitioning + forced intermediate steps:
- Function: Decomposes large-number tasks (impossible for a single forward pass) into sub-tasks within the model's "reliable range" and forces sub-results into materialized tokens for subsequent aggregation.
- Mechanism: Adding
|without CoT is actually harmful (Qwen2.5-7B accuracy at \(N=11-20\) drops from 0.38 unstructured to 0.20 structured w/o steps), as partitioning resets the implicit counter but the model fails to aggregate. CoT without partitioning also helps little. Only the combination yields a jump from 0.38 to 0.95. Final-step accuracy (summation stage) exceeds \(86\%\) for nearly all models, proving the bottleneck lies entirely in the intermediate count step—where System-1 still functions within segments. - Design Motivation: This combination is the minimum viable implementation of System-2. Partitioning provides a "controllable System-1 sub-task space," while CoT materializes sub-results as tokens for subsequent attention access.
-
CountScope localizing latent counts at partition boundaries:
- Function: Identify "which token's hidden state stores the count information for each partition" to provide precise targets for causal intervention.
- Mechanism: CountScope (Hasani 2025b) is a patching-based probe: injecting target activations into a blank counting context to see what number the LM generates. Experiments show count signals are stored with high confidence on the last item token + last comma token of each partition, and counters reset between partitions.
- Design Motivation: Traditional logit-lens/tuned-lens are unreliable for decoding numbers. Task-conditioned probes like CountScope are necessary. Localizing to boundary tokens directly validates the mechanistic hypothesis of independent segment counting.
-
Three-stage circuit + single-head causal attribution:
- Function: Dissect the specific attention pathways for "Information Storage (partition end tokens) → Information Transfer (intermediate step tokens) → Information Aggregation (final answer token)."
- Mechanism: (a) Attention analysis reveals Layers 19-23 point strongly from intermediate tokens to corresponding partition boundary tokens; (b) Attention knockout identifies Layer 22 as critical: Head 22-13 handles "partition end → intermediate step" transfer, while Head 22-1 handles "intermediate step → final answer" aggregation; (c) Cross-context patching provides ultimate validation: swapping intermediate token embeddings from Context A with Context B changes A's final answer accordingly (e.g., \(19 \to 21\)), confirming token embeddings as causal mediators.
- Design Motivation: Pure attention analysis offers only correlational evidence; activation patching provides causal conclusions. The discovery of different heads for different stages implies a division of labor within the LLM for this task.
Loss & Training¶
Pure inference-time, no training involved. All interventions (CountScope probe / zero-ablation / attention knockout / cross-context patching) are implemented via forward hooks.
Key Experimental Results¶
Main Results: Behavioral Performance (Accuracy / MAE for \(N=11–50\) (Open) and \(N=51–100\) (Closed))¶
| Model | Input | Output | Acc \(N=21-30\) | Acc \(N=41-50\) | MAE \(N=41-50\) |
|---|---|---|---|---|---|
| Qwen2.5-7B (28 layer) | Unstruct | w/o steps | 0.13 | 0.00 | 10.50 |
| Qwen2.5-7B | Unstruct | w/ steps | 0.11 | 0.00 | 9.68 |
| Qwen2.5-7B | Struct | w/o steps | 0.13 | 0.01 | 6.35 |
| Ours (Qwen2.5-7B) | Struct | w/ steps | 0.61 | 0.24 | 2.18 |
| Ours (Llama3-8B) | Struct | w/ steps | 0.54 | 0.26 | 2.20 |
| Ours (Gemma3-27B) | Struct | w/ steps | 0.85 | 0.50 | 2.25 |
| Closed Models (\(N=51-100\)) | Input/Output | Acc \(N=91-100\) | MAE \(N=91-100\) |
|---|---|---|---|
| GPT-4o, Unstruct, w/o steps | — | 0.24 | 4.26 |
| GPT-4o, Struct, w/ steps | — | 0.86 | 0.18 |
| Gemini-2.5-Pro, Unstruct, w/o steps | — | 0.20 | 2.70 |
| Gemini-2.5-Pro, Struct, w/ steps | — | 0.91 | 0.07 |
Structured + w/ steps is the only universally effective configuration.
Ablation Study: Error Source Decomposition (Structured CoT setting)¶
| Model | Total Acc | Final-step Acc | Intermediate Acc |
|---|---|---|---|
| Qwen2.5-7B | 0.51 | 0.86 | 0.53 |
| Llama 3-8B | 0.49 | 0.96 | 0.48 |
| Gemma 3-27B | 0.71 | 0.93 | 0.76 |
| GPT-4o | 0.89 | 1.00 | 0.89 |
| Gemini-2.5-Pro | 0.94 | 0.97 | 0.94 |
Final-step Accuracy is consistently \(\ge 86\%\), indicating the bottleneck is entirely the intermediate count.
Key Findings¶
- Failure modes differ across prompt combinations: Partitioning without CoT causes models to output the "maximum partition size" as the answer (found in \(13\%\) of Qwen2.5-7B and \(43.6\%\) of Llama3-8B errors), aligning with the hypothesis that models output the maximum latent count.
- System-2 mechanism is staged: CountScope shows partition counters reset after each
|. Intermediate tokens aggregate these via Layer 22-Head 13, and the final answer aggregates intermediate tokens via Layer 22-Head 1. - Cross-model mechanistic consistency: Parallel attention patterns were found in Llama3.2-8B (Layers 13-18) and Gemma3-4B (Layers 21-23), suggesting the System-2 circuit activated by prompting is a general Transformer capability rather than model-specific.
- Counter-intuitive finding on CoT: While CoT is often seen as a panacea, it provides almost no improvement for unstructured inputs (0.45 vs 0.38). Explicit structural separators are required to provide "stage boundaries" for attention heads.
Highlights & Insights¶
- Staged Computation Framework: The paper decomposes System-2 behavior into "Storage → Transfer → Aggregation," mapping each to specific token positions and heads. This staged interpretation paradigm is extensible to other "divide and conquer" tasks like multi-step reasoning or long-doc summarization.
- Input structure as stage boundaries: Explicit separators create "stage anchors" in the token stream that attention heads can index, which is more reliable than letting the model generate its own stages.
- Methodological contribution of CountScope: The discovery that logit-lens/tuned-lens are unreliable for numerical decoding is significant; patching-style probes are superior for math/reasoning interpretability.
- Bypassing architectural limits via test-time scaling: Instead of increasing layers, externalizing computation into token sequences allows token-by-token generation to act as depth unrolling.
Limitations & Future Work¶
- The study used synthetic noun-repetition data rather than natural prose.
- Requires prior knowledge of the model's "reliable partition size."
- Strategy is limited to near-independent sub-tasks (counting, step-by-step arithmetic) and may not generalize to strongly coupled multi-hop causal chains.
- Cross-model verification focused on layer-level attention rather than exhaustive head-level knockout for every model.
Related Work & Insights¶
- vs. CoT (Wei 2022): CoT provides steps but not necessarily stage boundaries; this paper proves boundaries are critical for saturation tasks.
- vs. Hasani 2025b: They proposed the layer-wise counter; this paper applies their tool to explain System-2 decomposition.
- vs. Yehudai 2024: They provided theoretical upper bounds for implicit counting; this paper bypasses these bounds empirically via test-time decomposition.
Rating¶
- Novelty: ⭐⭐⭐⭐☆ The behavioral strategy is simple, but the complete three-stage circuit identification and causal validation are significant mechanistic contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Broad coverage across 5 models, multiple configurations, and rigorous causal interventions.
- Writing Quality: ⭐⭐⭐⭐☆ Clear structure (phenomenon → localization → causation).
- Value: ⭐⭐⭐⭐⭐ Directly guides prompt engineering for capacity-saturated tasks and enhances our understanding of System-2 internal circuits.