ParallelPrompt: Extracting Parallelism from Large Language Model Queries¶
Conference: NeurIPS 2025 arXiv: 2506.18728 Code: Available (open-source benchmark + C++ execution framework) Area: Multilingual Translation Keywords: Intra-query parallelism, prompt decomposition, LLM inference acceleration, benchmark, semantic fidelity
TL;DR¶
This work presents ParallelPrompt, the first benchmark for intra-query parallelism, comprising structured decomposition annotations for 37,000+ real user prompts. It demonstrates that approximately 10% of user queries contain exploitable parallel structure, and that parallel execution can achieve up to 5.7× latency speedup with limited quality degradation.
Background & Motivation¶
Background: LLM inference optimization has primarily focused on two levels — token-level techniques (speculative decoding, KV caching, etc.) and request-level techniques (batching, scheduling, etc.). A largely overlooked dimension is intra-query parallelism: many user prompts inherently contain multiple subtasks that can be executed independently.
Limitations of Prior Work: Existing task decomposition methods (e.g., Skeleton-of-Thought, Tree-of-Problems) target synthetic scenarios or predefined patterns, and generalize poorly to the diverse parallel structures found in real user queries. More critically, no standardized benchmark dataset exists to quantify the potential of intra-query parallelism.
Key Challenge: Users frequently compose queries containing multiple independently processable subtasks (e.g., "translate the following 10 sentences," "analyze the pros and cons of these 5 books"), yet LLM serving systems process them serially as monolithic units. This "monolithic query" assumption wastes substantial parallelization opportunities.
Goal - Construct the first intra-query parallelism benchmark from real user queries - Quantify the practical gains of parallelism (latency–quality trade-off) - Evaluate existing decomposition methods on real-world queries
Key Insight: A multi-stage pipeline filters queries with decomposable structure from 1M+ real LLM conversation logs, extracts a structured schema (template + shared context + iterated data), and quantifies gains via parallel vs. serial execution.
Core Idea: Approximately 10% of real user queries exhibit exploitable intra-query parallel structure, and simple schema-based decomposition can yield significant latency improvements.
Method¶
Overall Architecture¶
ParallelPrompt consists of three components: (1) Data filtering pipeline: multi-stage classification and schema extraction using Claude 3.5 on LMSYS-Chat-1M and WildChat-1M, followed by rule-based validation with multilingual support; (2) Structured annotation: each query is annotated with a 5-field schema (task template, shared context, data list, generation count, etc.); (3) Execution evaluation suite: a high-performance C++ backend comparing serial and parallel execution across latency, structural consistency, and semantic fidelity.
Key Designs¶
-
Multi-stage Filtering Pipeline:
- Function: Filters 37,000+ parallelizable queries from 2M+ raw queries
- Mechanism: Claude 3.5 classification → schema extraction → three-tier validation (62% high-confidence with explicit structural markers, 38% medium-confidence relying on semantic cues, 38.6% of candidates rejected)
- Design Motivation: Precision is prioritized over recall — it is preferable to miss some parallelizable queries than to incorrectly annotate queries that cannot be safely parallelized
- Finding: 10.3% of user queries exhibit decomposable structure, a surprisingly high proportion
-
Taxonomy:
- Function: Categorizes parallel patterns into 7 canonical classes and 400+ emerging categories
- 7 main classes: repeated generation (25%), reading comprehension (30%), named entity recognition (30%), keyword extraction (7%), translation (9%), grammatical correction (6%), sentiment analysis (3%)
- 400+ emerging categories organized under 6 meta-categories: comparative, generative, analytical, transformative, organizational, and computational
- Multilingual coverage: 11+ languages; Chinese 5.6%, Russian 6.3%; non-Western languages show lower validation success rates (28–34% vs. 55–63%)
-
Execution and Evaluation Framework:
- Function: Precise latency measurement via a C++ backend, replacing task-specific decomposition logic with schema-based decomposition
- Evaluation dimensions: latency (raw speedup and normalized speedup), semantic fidelity (GPT-4o judge), and structural consistency
- Normalized speedup design rationale: LLMs tend to impose a "length budget" when processing multi-item queries, causing truncation (e.g., only 50 words per story when 10 stories are requested); parallel execution grants each subtask full context, necessitating adjustment for token-length differences
Key Experimental Results¶
Main Results (schema extraction: Claude 3.5; execution: GPT-4-1106-preview)¶
| Task Category | Parallel Avg. Time (s) | Serial Avg. Time (s) | Normalized Speedup | Raw Speedup |
|---|---|---|---|---|
| Keyword Extraction | 2.38 | 3.23 | 2.54× | 1.36× |
| Reading Comprehension | 3.49 | 10.27 | 5.72× | 2.94× |
| Repeated Generation | 3.79 | 9.51 | 4.39× | 2.50× |
| Repeated Generation (E2E w/ extraction) | 4.88 | 9.51 | 3.41× | 1.70× |
Quality Retention Analysis¶
| Task Category | Quality Retention | Impact Level |
|---|---|---|
| Reading Comprehension | 92% | Low (independent Q&A preserves semantic accuracy) |
| Keyword/Entity Extraction | 82–85% | Moderate (losses in consistency and redundancy handling) |
| Creative Generation | 76% | High (narrative coherence and stylistic consistency degraded) |
Key Findings¶
- Serial execution time grows linearly with the number of outputs \(n\), while parallel execution time remains approximately constant — speedup increases with larger \(n\)
- SoT and ToP fail extensively on real queries: SoT's two-stage outline-expand assumption does not apply to extraction tasks; ToP's recursive decomposition introduces unnecessary overhead for flat, independent tasks
- End-to-end speedup including schema extraction overhead remains 3.41×, indicating that extraction costs are offset by parallel gains
- Schema extraction quality is notably lower for non-Western languages, with validation success rates of only 28–34% for East Asian languages
Highlights & Insights¶
- The empirical finding that "~10% of queries are parallelizable" is itself highly significant: for production systems processing billions of queries daily, this represents hundreds of millions of optimization opportunities, fully complementary to existing techniques such as speculative decoding and KV caching
- Schema-based decomposition generalizes far better than task-specific approaches: the three-field representation (template + context + data list) handles diverse parallel patterns, whereas SoT/ToP are limited to specific task types
- The quality–speedup task-dependent trade-off plot (Figure 7) provides a clear decision guide: structured extraction tasks are the sweet spot for parallelism, while creative generation warrants caution
Limitations & Future Work¶
- Schema extraction relies on Claude 3.5 (closed-source) and performs poorly on non-Western languages — open-source alternatives and multilingual optimization warrant exploration
- Only "fully decomposable" queries are evaluated; many real queries exhibit partial dependencies (e.g., "first summarize, then compare"), requiring support for DAG-style semi-parallel execution
- Execution uses the GPT-4 API and is subject to rate limits; the cost-efficiency of parallel gains in local deployment settings (e.g., vLLM) has not been analyzed
- The benchmark skews toward information extraction tasks (NER 30% + reading comprehension 30%), with insufficient representation of creative generation
- The reassembly and ordering of results after parallel execution is not discussed — for certain tasks (e.g., ordered lists), result merging may introduce additional complexity
Related Work & Insights¶
- vs. Skeleton-of-Thought: SoT's two-stage outline-expand design is applicable only to tasks amenable to segmented expansion, and fails entirely on extraction and analysis queries. ParallelPrompt's schema-based approach is substantially more general
- vs. vLLM/TensorRT-LLM: These systems perform inter-request batching, which is orthogonal to and complementary with ParallelPrompt's intra-query decomposition
- Implications for LLM serving: Future serving systems may benefit from a "parallelism detection" stage upon query receipt to automatically determine whether decomposed execution is appropriate
Rating¶
- Novelty: ⭐⭐⭐⭐ Intra-query parallelism is a novel and important research direction; this is the first standardized benchmark
- Experimental Thoroughness: ⭐⭐⭐⭐ 37,000+ queries, multiple task categories, and detailed quality analysis, though model coverage is limited
- Writing Quality: ⭐⭐⭐⭐ Clear structure and rich visualizations, though certain sections are repetitive
- Value: ⭐⭐⭐⭐⭐ Direct engineering value for LLM serving; the benchmark dataset will advance research in this direction