Hogwild! Inference: Parallel LLM Generation via Concurrent Attention¶

Conference: NeurIPS 2025 arXiv: 2504.06261 Code: https://github.com/eqimp/hogwild_llm Area: LLM Agent Keywords: parallel inference, shared KV cache, collaborative inference, RoPE positional encoding, LLM acceleration

TL;DR¶

This paper proposes Hogwild! Inference—a parallel LLM inference protocol that requires no predefined collaboration framework. Multiple LLM instances synchronize in real time through a shared concurrent KV cache, leveraging RoPE positional encoding to avoid recomputation, achieving higher accuracy with fewer serial steps on mathematical reasoning and programming tasks.

Background & Motivation¶

Modern LLMs require substantial inference-time computation (token-by-token generation) for complex tasks such as reasoning, long-form generation, and tool use. A common human strategy for solving complex problems is collaboration: decomposing subtasks, exploring different strategies in parallel, and communicating adjustments in real time.

Existing parallel inference methods each have limitations:

Self-Consistency (vote aggregation): generates multiple independent solutions and aggregates by voting; inefficient when some threads are slower than others.

Skeleton-of-Thought (subtask parallelism): plans first, then executes subtasks in parallel; requires the problem to be immediately decomposable, and initial planning errors cannot be corrected.

PASTA and similar asynchronous subtask methods: require specialized fine-tuning and stall when individual subtasks are excessively long.

Core insight: no single collaboration strategy suits all tasks. Rather than predefining a collaboration framework, it is preferable to let LLMs decide how to collaborate themselves. This is inspired by human collaboration—dynamic replanning, abandoning poor strategies midway, and discussing strategy adjustments.

Method¶

Overall Architecture¶

Hogwild! Inference (inspired by the asynchronous update idea from Hogwild! SGD): - Multiple LLM instances ("workers") generate in parallel using identical weights - A concurrently updated KV cache is shared among all workers - Each worker immediately observes the tokens generated by other workers - Workers are guided via prompts to autonomously determine their collaboration strategy - No additional fine-tuning required; existing inference models work out of the box

Key Designs¶

1. Shared KV Cache and Concurrent Attention (Section 3.1)¶

Using 2 workers (Alice and Bob) as an example: - The cache is divided into multiple blocks: a common prompt block and a generation block per worker - Alice sees: common prompt → Bob's tokens → Alice's own tokens - Bob sees: common prompt → Alice's tokens → Bob's own tokens

Challenge: the same KV pair appears at different positions from different workers' perspectives, and relative positions shift as generation proceeds.

Leveraging RoPE to avoid recomputation: Most modern LLMs use Rotary Position Embedding (RoPE), where keys and queries are rotated to angles proportional to their absolute positions. Rather than re-encoding tokens, Hogwild! Inference rotates the query: $$\rho(q, i_q)[\rho(A, i_k^A) \oplus \rho(B, i_k^B) \oplus \rho(C, i_k^C)] = \rho(q, i_q-i_k^A)A \oplus \rho(q, i_q-i_k^B)B \oplus \rho(q, i_q-i_k^C)C$$

Only the query of the current step (a single token) needs to be rotated, rather than cache blocks containing thousands of tokens.

2. Cache Layout Design (Section 3.2)¶

Combined Layout (primary method)—combines chatroom-style step sharing with real-time token visibility:

Common Cache: stores the system prompt, task description, and the history of all workers' past reasoning steps
Other Workers: KV of other workers' current incomplete steps (ordered by worker)
Current Worker: the current worker's own incomplete step

Each block begins with \n\n + a worker ID header. When a worker completes a reasoning step, its KV is rotated into the shared history.

This design addresses the "distance problem": naive concatenation places the latest outputs of different workers far apart, degrading attention effectiveness.

3. Zero-Shot Collaboration Prompting (Section 3.3)¶

A two-part prompting strategy: 1. System Prompt: describes the "rules" of the shared cache and encourages workers to collaborate 2. s1-like intervention prompt: every 1024 generated tokens, a randomly selected worker is injected with "Wait, am I doing redundant work? (yes/no):"

The latter is critical—reasoning models may be overly "focused" on their current generation and neglect the progress of other instances. When directly queried, they can typically detect redundancy and adjust their strategy.

Loss & Training¶

No training or fine-tuning required—entirely a zero-shot inference protocol
Works out of the box with existing reasoning models (QwQ-32B, DeepSeek-R1, Qwen3, Phi-4 Reasoning Plus)
Implementation is based on Flash-Decoding, with custom GPU kernels handling attention computation across multiple cache blocks
KV entries within each cache block are stored at positions 0, 1, ..., len(block); actual positions are handled via query rotation

Key Experimental Results¶

Main Results¶

LIMO dataset (817 mathematical reasoning problems)—QwQ-32B:

Hogwild! Inference (2 workers) consistently outperforms all baselines under the same serial step budget: - Advantage is most pronounced at low-to-medium token budgets (faster convergence to correct answers) - Accuracy improves further as worker count increases (2→3→4) - Self-Consistency improves results but underperforms Hogwild! - Skeleton-of-Thought has limited effectiveness on problems without obvious subtask decomposition

Cross-model generalization: - QwQ-32B: significant improvement ✓ - Phi-4-Reasoning-Plus (14B): improvement ✓ - Qwen3-8B: improvement ✓ - Qwen3-4B: partial improvement - Qwen3-1.7B: smaller models struggle to adapt to the shared cache setup

GSM8k×5 synthetic benchmark (5 independent problems packed together): - Both Hogwild! and SoT effectively accelerate this type of decomposable task - Validates that KV cache operations do not disrupt model reasoning capability

Ablation Study¶

OlympiadBench mathematics + physics: - Mathematics: QwQ-32B, Qwen3-14B, and Qwen3-8B all show improvement - Physics: QwQ-32B and Qwen3-8B improve, but Qwen3-14B exhibits overthinking beyond ~4096 tokens

LiveCodeBench code generation (279 problems): - QwQ-32B: notable improvement - Phi-4-Reasoning-Plus: improvement - Qwen3-8B: improvement

AIME'25 (large models): - Qwen3-235B-A22B: improvement - DeepSeek-R1: improvement

Cache layout ablation: - Combined (token-wise synchronization + step history) > Interleaved (step-level synchronization only) > Contiguous (token-level synchronization only)

Key Findings¶

Collaboration capability quantification (GPT-4o scoring, 1–6 scale):

Setting	QwQ-32B	Phi-4-R+	Qwen3-8B
No synchronization (independent generation)	~1.2	~1.1	~1.1
Step-level synchronization	~2.5	~2.2	~2.0
Token-level synchronization (full Hogwild!)	~3.5	~3.0	~2.5

Token-level real-time synchronization significantly outperforms step-level synchronization alone.

Inference throughput (QwQ-32B-AWQ, L40S GPU):

Workers	1024 tokens	8192 tokens	16384 tokens
1 (baseline)	20.1 tok/s	19.3 tok/s	18.3 tok/s
2 workers	36.3 tok/s	36.1 tok/s	34.3 tok/s
4 workers	68.9 tok/s	66.3 tok/s	60.3 tok/s

2 workers achieve nearly 2× speedup; 4 workers achieve 3.2–3.6× speedup with minimal overhead.

Highlights & Insights¶

Paradigm innovation: rather than predefining a collaboration framework, the model lets LLMs decide how to collaborate—a fundamental rethinking of parallel inference strategies
Zero-shot collaboration capability: existing reasoning models (QwQ, DeepSeek-R1) can "reason to coordinate" without fine-tuning—formulating and following plans, correcting errors, and leveraging each other's key findings
Elegant engineering: by exploiting the rotational properties of RoPE, only the current query is rotated rather than all cache blocks, avoiding $O(n^3)$ recomputation overhead
The power of immediate visibility: token-level synchronization (where tokens are visible to other workers as they are being generated) substantially outperforms step-level synchronization
s1-like intervention prompting is simple yet effective—reasoning models tend to "over-focus," and periodically asking "Are you doing redundant work?" triggers strategy adjustment

Limitations & Future Work¶

Poor robustness in small models: Qwen3-1.7B struggles to adapt to the shared cache setting, suggesting a lower bound on model scale
Degraded robustness in long contexts: collaboration effectiveness may diminish as context length grows
Dependence on GPT-4o for automatic evaluation: collaboration quality scoring relies on a proprietary model, which may affect reproducibility
Fine-tuning unexplored: further gains in collaboration capability may be achievable through RL or specialized data fine-tuning
Cache growth management: the shared history cache may become excessively large during extended inference, necessitating selective forgetting mechanisms
Human-in-the-loop potential: KV cache rearrangement could support asynchronous human feedback, but this direction remains underexplored

Hogwild! SGD: the namesake inspiration; the idea of asynchronous updates is transferred from training to inference
Self-Consistency / Skeleton-of-Thought: parallel inference methods used for direct comparison
Paged Attention (vLLM): similar segmented KV cache management, but Hogwild! enables cross-worker attention
s1 (muennighoff2025s1): provides inspiration for "budget thinking"
Insights: (1) the emergent collaboration capabilities of LLMs warrant deeper investigation; (2) parallelizing inference-time computation is an important direction for LLM acceleration; (3) the shared memory model is extensible to scenarios such as code collaboration and tool use

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (a new paradigm for parallel inference; shared KV cache + zero-shot collaboration is highly creative)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (multiple models × multiple benchmarks × ablations × collaboration scoring × throughput benchmarks; exceptionally comprehensive)
Writing Quality: ⭐⭐⭐⭐ (main body is clear, though technical details are dense)
Value: ⭐⭐⭐⭐⭐ (opens a new direction for inference-time parallelization; highly practical; open-source code)