Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length¶

Conference: ACL2026
arXiv: 2603.22608
Code: https://github.com/jingxuanchen916/multi-instance-processing
Area: LLM Efficiency / Long-context Evaluation / Multi-instance Processing
Keywords: Multi-instance Processing, Long Context, Aggregated Reasoning, Failure Modes, Instance Count

TL;DR¶

This paper systematically evaluates the degradation patterns of 16 LLMs in multi-instance processing (MIP). It finds that performance decline is not solely caused by increasing context length; the instance count itself exerts a stronger influence on success rates. Specifically, almost all models collapse when processing over 1,000 instances and rarely proactively alert the user.

Background & Motivation¶

Background: Many LLM applications are still evaluated via single-instance processing (SIP), such as determining the sentiment of a single review, the language of a single sentence, or the parity of a single number. However, real-world data analysis scenarios often require users to input dozens to thousands of instances at once, expecting the model to process each item and aggregate them into totals, category distributions, or sums.

Limitations of Prior Work: Research on long contexts typically couples increasing input length with increasing task complexity, making it difficult to distinguish whether a model is overwhelmed by the number of tokens or by the repetitive processing of numerous instances. Batch processing research often focuses on merging a small number of queries primarily to reduce costs, lacking a systematic scan of the large-scale instance count dimension.

Key Challenge: MIP imposes simultaneous pressure from both long context and repetitive operations. Even if each instance is simple in isolation, placing 500 or 2,000 instances in a single prompt requires the model to identify each item, maintain indices, avoid omissions, and perform accurate aggregation. This differs fundamentally from retrieving a few pieces of evidence in RAG.

Goal: To answer two questions: how LLMs degrade and what failure modes emerge as instance counts increase in MIP; and whether performance degradation is more affected by context length or instance count.

Key Insight: The authors first filter out samples that models fail on individually using SIP, retaining only instances that all compared models can answer correctly in isolation. These "inherently simple" instances are then combined into MIP inputs. Thus, failures in MIP can be more accurately attributed to multi-instance processing and aggregation capabilities rather than individual sample difficulty.

Core Idea: After controlling for single-instance difficulty, the study independently manipulates instance count and context length to measure the true bottlenecks of LLMs in repetitive processing and aggregation.

Method¶

MIP (multi-instance processing) is defined as: a model receives a set of instances \(X'=\{x_1,\dots,x_n\}\) and a task instruction \(\tau\) within the same prompt, performs an implicit or explicit judgment for each instance, and outputs an aggregated answer \(y_{agg}\). This output may be correct, incorrect, or invalid; errors are further decomposed into instance-level errors, aggregation errors, index/key errors, or combinations thereof. The methodology is designed to isolate the "difficulty of individual samples" and let the data reveal whether LLMs are dragged down by tokens or by the repetition and summation of operations hundreds or thousands of times.

Overall Architecture¶

The experiment proceeds in two steps. First, single-instance evaluation (SIP) is conducted for each task to filter out ambiguous or unstable samples, leaving a clean pool \(X_{SIP}\) of instances that "everyone can do." Second, instance sets of sizes 2, 5, 10, 20, 50, 100, 200, 500, 1000, and 2000 are sampled from this pool using 5 random seeds. These are fed to models to generate aggregated answers, such as "how many reviews are positive," "how many numbers are odd," or "how many person entities are in these sentences." The primary metrics are success rate (SR) and invalid rate (IR). A "instance-level variant" is also established, requiring the model to provide per-instance predictions before the aggregate answer to decompose where failures occur.

Key Designs¶

1. SIP Filtering: Eliminating the Interference of Task Difficulty

If a model fails to identify the sentiment of a single review, its failure to count positive reviews among 500 samples is uninterpretable. To address this, the authors perform SIP on 2,500 instances per task, retaining only those correctly answered by all participating models. Three thresholds are set: an average SIP success rate >95%, a per-task SIP success rate >90%, and an inter-model agreement >85%. After filtering, the MIP test becomes a pure measure of repeating a verified "simple" operation many times and aggregating the results.

2. Multi-task and Multi-model MIP Scanning: Confirming Universal Degradation

To demonstrate that MIP degradation is a systemic weakness of LLMs, the study covers eight task categories—Arithmetic, Category, Language, NER, Parity, Sentiment, Word, and WSD—covering various instance types and aggregation forms. 16 models are evaluated, including 9 open-weight (DeepSeek R1/V3, gpt-oss-120b/20b, Llama, Qwen3, etc.) and 7 closed-source models (Claude, Gemini, GPT-5, Grok, etc.). All models are set to temperature 0 with a maximum output of 20K tokens and up to 3 retries for invalid outputs.

3. Decoupling Instance Count and Context Length: Identifying the Primary Driver

MIP involves both increasing context length and increasing repetitive operations. The authors decouple these using two methods: first, artificial length augmentation, where irrelevant noise text is inserted into each instance to increase the average length from ~136 tokens to ~326 tokens without increasing the instance count; second, Spearman correlation analysis, calculating the correlation of success rate with instance count versus total context length, and examining context length effects while fixing the instance count.

Loss & Training¶

This is an evaluation and diagnostic paper; no new models were trained. Reproducibility relies on experimental controls: uniform prompt templates, temperature=0, 20K maximum tokens, three retries for invalid outputs, and sampling with 5 random seeds for each MIP configuration. The core metrics are success rate (\(SR\)) and invalid rate (\(IR\)).

Key Experimental Results¶

Main Results¶

The overall success rates show that closed-source models do not completely dominate open-weight models. GPT-5 and Gemini 3.1 Pro were highest, but Qwen3-Thinking, gpt-oss-120b, DeepSeek R1, Grok 4 Fast, and GPT-5 Nano also exceeded 65%.

Model	Type / Cost Info	Success Rate	Invalid Rate	Observation
GPT-5	Closed, ~USD 2.64 / task	81.8 ± 2.6	1.8 ± 0.7	Highest overall
Gemini 3.1 Pro	Closed, ~USD 6.28 / task	80.3 ± 1.4	2.6 ± 0.9	High performance, highest cost
Grok 4	Closed, ~USD 5.54 / task	70.6 ± 1.7	1.3 ± 0.0	Strong closed model
Qwen3-Thinking	Open-weight A22B	69.4 ± 2.4	3.9 ± 1.6	Very strong open-weight
gpt-oss-120b	Open-weight A5.1B	68.3 ± 2.8	3.6 ± 1.1	High success rate
DeepSeek R1	Open-weight A37B	67.5 ± 2.6	2.9 ± 0.6	Stable reasoning performance
Grok 4 Fast	Closed, ~USD 0.26 / task	67.0 ± 2.8	0.0 ± 0.0	Zero invalid rate, great robustness
GPT-5 Nano	Closed, ~USD 0.13 / task	66.5 ± 3.8	7.5 ± 0.6	Most likely to admit limitations

Ablation Study¶

Key analyses compared failure types and context length effects. Results indicate that models usually function with few instances but decline sharply beyond 200, approaching collapse at 1,000-2,000. Simply doubling the instance length with noise does not cause an equivalent magnitude of degradation.

Analysis Item	Key Data	Conclusion
Instance Count Degradation	All models drop significantly above 200 instances; SR < 20% at 2,000; no model > 40% at 1,000+	Large-scale MIP is a systemic weakness of current LLMs
Task Variance	All tasks (excl. Arithmetic) avg SR > 60% with < 50 instances	Small-batch MIP is usable, but degradation rates vary by task
Failure Types	Combined mistakes reach ~25%-45% after 100 instances; parsing error approaches 30% at 2,000	Errors stem from aggregation, formatting, and length, not just judgment
Self-Awareness	Only 171 / 4,620 experiments showed omission; only 27 explicitly suggested batching	Most models fail without notifying the user of capacity limits
Artificial Context Extension	Context lengthened from ~136 to ~326 tokens; SR remained similar for fixed instance counts	Context length is not the sole cause of failure
Correlation	Spearman for SR vs. instance count is -0.61, vs. context length is -0.37; fixed count context correlation ~ -0.15 to 0.15 (p>0.1)	Instance count has higher explanatory power for degradation

Key Findings¶

MIP degradation follows a "slow decline then sudden collapse" pattern. Models perform well at 20-100 instances but deteriorate after 200 and become unreliable beyond 1,000.
Instance order has little impact. Randomized shuffles yield highly consistent degradation curves, suggesting the issue is not simple positional bias.
Aggregation errors are critical. Even if instance-level predictions are correct, models fail at final counting or summation; combinations of instance and aggregation errors become common as counts increase.
"Long-context capability" is not equivalent to "MIP capability." A model may read long documents but fail to reliably execute the same operation across hundreds of independent instances.

Highlights & Insights¶

The paper addresses a highly realistic use case: users pasting raw data into an LLM for statistics without writing scripts. It proves this is unreliable at scale.
SIP filtering is a methodological highlight, isolating MIP capacity from general task performance.
The decoupling of instance count and context length is insightful. While many systems manage batches by token budget, this suggests batch policies must also restrict instance counts.
The analysis of failure self-awareness is practical. Models rarely admit "I can't process this many items," which is dangerous as errors often appear as confident aggregated answers.

Limitations & Future Work¶

The study is diagnostic and does not validate mitigation strategies like automated batching, tool usage, external counters, or map-reduce pipelines.
Tasks focused on exact aggregation (counting, summing) might be more fragile than "soft" aggregation like summarization; findings may need further validation for open-ended analysis.
Prompting was fixed; the effects of advanced CoT, JSON schemas, or self-check prompts were not explored.
Despite decoupling experiments, instance count and context length remain difficult to separate causally in practice; future work requires attention-level or hidden-state diagnostics.
The focus on English tasks and specific models suggests that cross-lingual MIP and specialized data agents warrant further evaluation.

vs. Long-context benchmarks: While long-context evaluations focus on finding evidence or cross-segment reasoning, this work focuses on repetitive operations across all instances.
vs. Batch prompting: Unlike batch prompting for cost reduction, this pushes instance counts to 2,000 to reveal the collapse thresholds of large-batch processing.
vs. RAG: RAG can succeed with few relevant snippets; MIP requires traversing every instance. Thus, a long context window does not guarantee reliable MIP.
Transferable Insight: LLM data analysis products should manage token and instance budgets separately, defaulting to partitioned execution and programmatic aggregation for large inputs.

Rating¶

Novelty: ⭐⭐⭐⭐☆ High practical value in problem definition and decoupling analysis.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive coverage across 16 models, 8 tasks, and 10 scales of instance counts.
Writing Quality: ⭐⭐⭐⭐☆ Clear narrative and solid charts.
Value: ⭐⭐⭐⭐⭐ Directly informs long-context applications, data analysis agents, and reliability evaluation.