ASAP: An Agentic Solution to Auto-Optimize Performance of Large-Scale LLM Training¶

Conference: NeurIPS 2025 arXiv: 2511.03844 Code: Not released (Google internal ADK) Area: Recommender Systems Keywords: Multi-agent systems, distributed training optimization, sharding configuration, bottleneck analysis, Roofline model

TL;DR¶

ASAP is a multi-agent system (Coordinator + Analyzer + Proposal) that automatically diagnoses bottleneck types (compute/memory/communication) in large-scale LLM distributed training and proposes sharding configurations. Across 3 experimental scenarios, it matches human expert solutions and achieves up to 2.58× throughput improvement.

Background & Motivation¶

Background: Large-scale LLM training requires parallelization across hundreds or thousands of TPUs/GPUs (data parallelism, model parallelism, sequence parallelism, pipeline parallelism, etc.), where sharding configurations have a substantial impact on performance — suboptimal configurations can cause 50%+ throughput degradation.

Limitations of Prior Work: (a) Manual trial-and-error demands deep expertise in distributed systems; (b) black-box optimization methods (e.g., Bayesian optimization) converge slowly in large configuration spaces and provide no interpretable rationale; (c) different bottleneck types (compute-bound/memory-bound/communication-bound) require different optimization strategies, making general-purpose methods inefficient.

Key Challenge: Diagnosis requires understanding performance profiling reports (Xprof profiles), and decision-making requires knowledge of parallelization principles and hardware topology — expertise that has traditionally been confined to human specialists.

Goal: Replace human experts with LLM agents for the full pipeline of performance diagnosis → proposal generation → configuration tuning.

Key Insight: Combining LLM reasoning capabilities with specialized tools (Xprof profiling, Roofline analysis) and knowledge bases (historical optimization records, JAX parallelization documentation) into a multi-agent system.

Core Idea: Coordinator orchestrates → Analyzer diagnoses bottleneck type → Proposal (RAG + knowledge base) generates multiple candidate solutions = LLM agents serving as distributed training experts.

Method¶

Overall Architecture¶

Three agents collaborate: Coordinator (accepts experiment ID, orchestrates workflow) → Analyzer (invokes Xprof to analyze KPIs + operation-level performance + Roofline, diagnosing compute/memory/communication bottlenecks) → Proposal (retrieves historical configurations and knowledge-base documents via RAG, generates 3 candidate solutions with rationale and expected impact).

Key Designs¶

Analyzer Agent (Performance Diagnosis):
- Reads KPIs: accelerator busy rate, training step time, goodput
- Operation-level analysis: FLOPs utilization and HBM bandwidth utilization for the top-5 most time-consuming operations
- Roofline analysis: classifies each operation as compute-bound, HBM-bound, or communication-bound
- Correlates KPI anomalies with operation-level bottlenecks → generates a diagnostic report
Proposal Agent (RAG-based Solution Generation):
- Historical database: past configurations, Xprof profiles, impact summaries
- Knowledge base: JAX parallelization documentation
- Generates 3 candidate sharding configurations + rationale + expected trade-offs
- Confidence scores (85–90%)
Sharding Configuration Space:
- ici_mesh: four parallelism dimensions — {model, data, replica, sequence}
- Hardware: TPU v5p-512 (8×8×8), TPU v6e-16 (4×4), TPU v5e-256 (16×16)

Loss & Training¶

No training involved — purely LLM inference with tool invocation.

Key Experimental Results¶

Main Results (3 Scenarios)¶

Scenario	Hardware	Bottleneck Type	Agent Solution	Effect	Matches Expert?
1	TPU v5p-512	Compute-bound	model=8, data=16, seq=4	−28% step time	✓ Exact match
2	TPU v6e-16	HBM-bound	data=4, model=4	1.43× throughput	✓ Exact match
3	TPU v5e-256	Communication-bound	data=8, model=2, seq=16	2.58× throughput	✓ Exact match

Ablation Study¶

Configuration	Key Finding	Remark
Scenario 1 in historical database	Agent retrieves and improves upon it	Retrieval augmentation
Scenarios 2/3 not in historical database	Agent still diagnoses and proposes correctly	Principled reasoning & generalization
Contribution of Roofline analysis	Differentiates compute/memory/communication	Critical for diagnostic accuracy
3 proposals vs. 1 proposal	Best solution most often ranked first	Diversity provides fallback options
Agent confidence	85–90%	Accurately reflects uncertainty

Key Findings¶

All 3 scenarios exactly match human expert solutions — demonstrating that LLM reasoning combined with tools can substitute for distributed systems expertise.
Scenarios 2 and 3 are absent from the historical database — indicating that the agent reasons rather than memorizes, and generalizes to novel scenarios.
The largest gain of 2.58× throughput stems from correctly identifying the communication bottleneck and reducing model parallelism degree.

Highlights & Insights¶

LLM Reasoning + Specialized Tools = Domain Expert: ASAP demonstrates that LLMs do not merely retrieve historical configurations but reason from parallelization principles — enabling them to handle novel situations not present in the training database.
Interpretable Decision-Making: Each proposal includes detailed rationale and performance references, offering greater engineering value than black-box optimization methods.
Full Coverage of Three Bottleneck Types: Different scenarios demand fundamentally different strategies (compute-bound → increase parallelism degree; communication-bound → reduce parallelism degree), and the agent correctly distinguishes among all three.

Limitations & Future Work¶

Only global sharding is supported; per-layer sharding is not considered.
No closed-loop feedback — solutions are proposed but not automatically verified and iterated.
Compiler-level optimizations (operator fusion, memory layout selection) are absent.
The system depends on accurate Xprof data; robustness to noisy profiling data is untested.
Holistic trade-off optimization between computation and communication remains insufficient.

vs. Alpa (Zheng et al., 2022): Alpa employs integer linear programming to search for optimal parallelization strategies; ASAP uses LLM reasoning with a knowledge base, offering greater flexibility and interpretability.
vs. FlexFlow: An automatic parallelization framework with a limited search space; ASAP handles a substantially larger configuration space.
vs. SOLAR (Google, 2024): A system recommendation framework that does not perform bottleneck diagnosis; ASAP covers the full pipeline from diagnosis to recommendation.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of multi-agent systems with performance profiling tools is well-motivated and effective.
Experimental Thoroughness: ⭐⭐⭐ Only 3 scenarios are evaluated; all match expert solutions, but the sample size is small.
Writing Quality: ⭐⭐⭐⭐ The diagnostic process is described in detail with a clear reasoning chain.
Value: ⭐⭐⭐⭐ The approach has practical engineering value for automated tuning of large-scale LLM training.
Implications for MLOps: ASAP's multi-agent diagnose-and-propose paradigm is generalizable to hyperparameter search, data pipeline optimization, and similar tasks.
Intersection with the MLSys Community: Introducing LLM agents into systems optimization represents a promising new research direction.
Commercial Value: Configuration optimization for large-scale training clusters directly impacts computational costs on the order of millions of dollars.
Cross-Hardware Generalization: Success across all 3 TPU topologies indicates that the reasoning capability is not tied to specific hardware.
Relevance to the Open-Source Community: The diagnostic methodology (Roofline + operation-level analysis) is transferable to PyTorch/DeepSpeed environments.
Comparison with Auto-Tuning: Traditional auto-tuning methods do not explain why a configuration works; ASAP provides an interpretable decision chain.
Extensible Architecture: The three-agent design can be extended with an Evaluator Agent to enable closed-loop iterative optimization.
Value of Agent Confidence Scores: Confidence estimates of 85–90% help users determine whether additional validation is warranted.
Maintainability of the Knowledge Base: The RAG-based design allows the knowledge base to be updated by appending new records, without retraining the agents.