Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length¶
Conference: ACL2026
arXiv: 2603.22608
Code: https://github.com/jingxuanchen916/multi-instance-processing
Area: LLM Efficiency / Long-Context Evaluation / Multi-Instance Processing
Keywords: Multi-instance processing, Long-context, Aggregated reasoning, Failure modes, Instance count
TL;DR¶
This paper systematically evaluates the degradation patterns of 16 LLMs in multi-instance processing (MIP). It discovers that performance decline is not solely caused by context length; the number of instances itself has a stronger impact on success rates. Specifically, at over 1,000 instances, almost all models collapse and rarely proactively alert the user.
Background & Motivation¶
Background: Many LLM applications are still evaluated using single-instance processing (SIP), such as determining the sentiment of a single review, the language of a sentence, or the parity of one number. However, real-world data analysis scenarios often require users to input dozens to thousands of instances simultaneously for the model to judge each one and then aggregate them into total counts, category distributions, or sums.
Limitations of Prior Work: Research on long contexts typically binds input length with task complexity, making it difficult to distinguish whether models are hindered by the number of tokens or by the repetitive processing of numerous instances. Batch processing research often only examines the merging of a few queries, primarily motivated by cost reduction, and lacks systematic scanning in the dimension of large-scale instance counts.
Key Challenge: MIP simultaneously involves the pressures of long contexts and repetitive operations. Even if each instance is simple on its own, placing 500 or 2,000 instances in a single prompt requires the model to identify each item, maintain indices, avoid omissions, and perform accurate aggregation; this differs from RAG, which involves finding a small amount of relevant evidence.
Goal: To answer two questions: how LLMs degrade and what failure modes appear as the number of instances increases in MIP; and whether performance degradation is more affected by context length or instance count.
Key Insight: The authors first use SIP to filter out samples that models fail on individually, retaining only instances that all compared models can answer correctly in isolation. These "inherently simple" instances are then combined into MIP inputs. Thus, if a model fails in MIP, it can be more accurately attributed to its multi-instance processing and aggregation capabilities rather than the difficulty of single samples.
Core Idea: After controlling for single-instance difficulty, the number of instances and context length are independently manipulated to measure the true bottlenecks of LLMs in repetitive processing and aggregation.
Method¶
The paper defines MIP as: a model receives a set of instances \(X'=\{x_1,\dots,x_n\}\) and a task instruction \(\tau\) in the same prompt, needing to produce an implicit or explicit judgment for each instance and output an aggregated answer \(y_{agg}\). The output may be correct, incorrect, or invalid; errors can be further broken down into instance-level errors, aggregation errors, index/key errors, and combinations thereof.
Overall Architecture¶
The experimental workflow begins by running SIP on each task to filter out ambiguous samples or those the model cannot solve stably in isolation. Then, different-sized instance sets are sampled from the retained \(X_{SIP}\) using 5 random seeds, with instance counts of 2, 5, 10, 20, 50, 100, 200, 500, 1000, and 2000. Models are required to output an aggregated answer, such as "how many of these reviews are positive," "how many of these numbers are odd," or "how many person entities are in these sentences." Evaluation uses success rate and invalid rate; an additional instance-level variant requires the model to provide individual predictions before the aggregated answer to analyze failure types.
Key Designs¶
-
SIP Filtering to Control Single-Sample Difficulty:
- Function: Ensures MIP failure is not due to the difficulty of individual samples.
- Mechanism: Each task first undergoes a single-instance evaluation with 2,500 instances, retaining only those instances that all compared models get correct. Models are retained only if their average SIP success rate exceeds 95% and per-task SIP success rate exceeds 90%, with tasks requiring over 85% agreement between models.
- Design Motivation: If a model misjudges a single review, multi-instance failure cannot be interpreted. After filtering, MIP tests the ability to repeat simple operations many times and aggregate them.
-
Multi-Task Multi-Model MIP Scanning:
- Function: Covers various instance types, aggregation forms, and model families.
- Mechanism: Tasks include eight categories: Arithmetic, Category, Language, NER, Parity, Sentiment, Word, and WSD. Models include 9 open-weight models and 7 closed-source models, such as DeepSeek R1/V3, gpt-oss-120b/20b, Llama, Qwen3, Claude, Gemini, GPT-5, and Grok. Temperature is set to 0, maximum output length to 20K, and invalid outputs are allowed up to 3 retries.
- Design Motivation: MIP is not a phenomenon unique to one dataset. Cross-task and cross-model evaluation allows observation of whether degradation curves are universal.
-
Decoupling Analysis of Instance Count and Context Length:
- Function: Distinguishes between "too many tokens" and "too many instances."
- Mechanism: The authors construct manual length augmentation: irrelevant noise text is added to each instance, increasing the average instance length from approximately 136 tokens to 326 tokens while keeping the instance count constant. Spearman correlation analysis is performed to compare success rates with instance count and total context length respectively, and the impact of context length is examined while fixing the instance count.
- Design Motivation: If performance degrades solely due to token length, adding noise should significantly lower performance; if degradation stems from repetitive processing and aggregation, the impact of context length under a fixed instance count should be much weaker.
Loss & Training¶
This work is an evaluation and diagnostic paper and does not train new models. Key experimental controls include unified prompt templates, \(temperature=0\), 20K token maximum output, three retries for invalid outputs, and sampling with 5 random seeds for each MIP configuration. Success rate \(SR\) is the proportion of correct aggregated answers across all experiments; invalid rate \(IR\) is the proportion of outputs that cannot be parsed or exceed context limits.
Key Experimental Results¶
Main Results¶
The overall success rates of the models indicate that closed-source models do not completely dominate open-weight models. GPT-5 and Gemini 3.1 Pro are the highest, but Qwen3-Thinking, gpt-oss-120b, DeepSeek R1, Grok 4 Fast, and GPT-5 Nano also stay above 65%.
| Model | Type / Cost Info | Success Rate | Invalid Rate | Observation |
|---|---|---|---|---|
| GPT-5 | Closed, ~USD 2.64 / task | 81.8 ± 2.6 | 1.8 ± 0.7 | Highest overall |
| Gemini 3.1 Pro | Closed, ~USD 6.28 / task | 80.3 ± 1.4 | 2.6 ± 0.9 | High performance but highest cost |
| Grok 4 | Closed, ~USD 5.54 / task | 70.6 ± 1.7 | 1.3 ± 0.0 | Strong closed-source model |
| Qwen3-Thinking | open-weight A22B | 69.4 ± 2.4 | 3.9 ± 1.6 | Very strong performance for open-weight |
| gpt-oss-120b | open-weight A5.1B | 68.3 ± 2.8 | 3.6 ± 1.1 | High success rate |
| DeepSeek R1 | open-weight A37B | 67.5 ± 2.6 | 2.9 ± 0.6 | Stable performance for a reasoning model |
| Grok 4 Fast | Closed, ~USD 0.26 / task | 67.0 ± 2.8 | 0.0 ± 0.0 | Zero invalid rate, good robustness |
| GPT-5 Nano | Closed, ~USD 0.13 / task | 66.5 ± 3.8 | 7.5 ± 0.6 | Most frequent to admit capacity limits |
Ablation Study¶
The key analysis compares failure types and the impact of context length. Results show that models typically function on small numbers of instances but drop significantly after 200, approaching collapse at 1,000-2,000. Simply doubling the length of each instance does not cause the same magnitude of degradation.
| Analysis Item | Key Data | Conclusion |
|---|---|---|
| Instance Count Degradation | All models drop significantly above 200 instances; SR below 20% at 2,000; none exceed 40% at >1,000 | Large-scale MIP is a consistent weakness for current LLMs |
| Task Differences | Except for Arithmetic, average SR exceeds 60% for all tasks under 50 instances | Small-batch MIP is usable, but degradation rates vary by task |
| Failure Types | Combined mistakes rise to ~25%-45% after 100 instances; parsing error near 30% at 2,000 | Errors stem not only from single judgments but also from aggregation, formatting, and output length |
| Self-Awareness | Only 171 / 4,620 experiments showed omission; only 27 explicitly suggested batch-wise processing | Most models do not proactively inform users to use batching when failing |
| Manual Context Lengthening | Success rates remain similar when average instance length increases from ~136 to ~326 tokens at a fixed count | Context length is not the sole primary cause |
| Correlation Analysis | Spearman for SR vs instance count is -0.61, vs context length is -0.37; correlation with context length is ~-0.15 to 0.15 with p>0.1 when count is fixed | Instance count has stronger explanatory power for degradation |
Key Findings¶
- MIP degradation follows a "slow decline then sudden collapse" shape. There is only a slight drop at 20-100 instances, but it worsens significantly after 200, making it generally unreliable above 1,000.
- Instance order has little impact. After shuffling the same set of instances twice, degradation curves remains highly consistent, suggesting the issue is not simple positional bias.
- Aggregation errors are critical. Even if instance-level predictions are correct, models may fail in the final count or sum; as instance count increases, the combination of instance errors and aggregation errors becomes more common.
- "Long-context capability" cannot be equated with "MIP capability." A model might read long texts but fail to reliably repeat the same operation across hundreds or thousands of independent instances.
Highlights & Insights¶
- The paper identifies a very realistic usage scenario: users may not write scripts or agent workflows but instead paste a large amount of data into an LLM for statistics. The paper proves this usage is unreliable at scale.
- SIP filtering is a methodological highlight. It removes the interference of "sample difficulty," focusing the failure on MIP capability.
- The decoupling experiment of instance count and context length is insightful. While many systems partition batches by token length, this paper suggests batch policies should also limit the number of instances.
- The analysis of failure self-awareness is practical. Models rarely proactively say "I cannot handle this many items," which is dangerous for real users because errors appear as confident aggregated answers.
Limitations & Future Work¶
- The paper focuses on diagnosis without validating specific mitigation strategies, such as automatic batching, tool calling, external counters, verification agents, or map-reduce style pipelines.
- Tasks are primarily precise aggregation (e.g., counting, summation, category frequency), which may be more fragile than soft aggregation like summarization or trend analysis; more experiments are needed for open-ended analysis.
- Prompt templates are fixed; there is no systematic study on whether stronger chain-of-thought, tabular output, JSON schema, step-by-step constraints, or self-check prompts can mitigate issues.
- Despite noise lengthening and correlation analysis, instance count and context length remain difficult to separate completely in reality; future attention-level or hidden-state diagnostics are needed.
- Experiments focus on English tasks and current models; cross-lingual MIP, ultra-long context models, and specially trained data agents still warrant evaluation.
Related Work & Insights¶
- vs. Long-context benchmarks: Long-context evaluation often focuses on finding evidence in long documents or cross-paragraph reasoning; this paper focuses on repetitive operations and aggregation across every instance, presenting different challenges.
- vs. Batch prompting: Batch prompting focuses on merging a few queries to reduce costs; this paper pushes instance counts to 2,000, revealing the collapse zone during massive batch processing.
- vs. RAG: RAG can answer using only relevant snippets, whereas MIP must traverse all instances; thus, a sufficiently long context window does not guarantee reliable MIP.
- Transferable Insights: LLM data analysis products should manage token budget and instance budget separately, adopting chunked execution, programmatic aggregation, and result verification by default for large batch inputs rather than letting a single prompt handle everything.
Rating¶
- Novelty: ⭐⭐⭐⭐☆ The problem definition and decoupling analysis have significant real-world value; the method is a systematic evaluation rather than a new algorithm.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 16 models, 8 tasks, levels of instance counts, and comprehensive analysis of failure types and context length.
- Writing Quality: ⭐⭐⭐⭐☆ Clear narrative and solid charts; some tables across pages in the PDF make detailed reading slightly cumbersome.
- Value: ⭐⭐⭐⭐⭐ Direct guidance for long-context applications, data analysis agents, batch policies, and reliability evaluation.