S-DAG: A Subject-Based Directed Acyclic Graph for Multi-Agent Heterogeneous Reasoning¶
Conference: AAAI 2026 arXiv: 2511.06727 Code: https://github.com/WanyuGroup/AAAI2026_S-DAG Area: Graph Learning Keywords: Subject-level analysis, directed acyclic graph, GNN reasoning, expert model composition, heterogeneous reasoning
TL;DR¶
This paper proposes S-DAG, which uses a GNN to identify relevant subjects and their dependencies from a given question, constructing a directed acyclic graph. Subject nodes are matched to the most capable expert LLMs (14 domain-specific models of 7–13B parameters), and collaborative reasoning proceeds in DAG topological order (supporting subjects → dominant subject). The resulting small-model pool surpasses GPT-4o-mini (59.73 vs. 58.52) and approaches the performance of a 72B model.
Background & Motivation¶
Background: Existing MoE/routing methods (e.g., MoE Router, GraphRouter) select models at the task level—choosing one model or Top-k models for an entire question. However, many questions span multiple subjects (e.g., a problem involving physics, mathematics, and chemistry simultaneously), making task-level routing too coarse-grained.
Limitations of Prior Work: Multi-Agent Debate allows multiple models to deliberate but does not differentiate their respective expertise; Symbolic-MoE selects Top-k models by skill but ignores inter-subject dependencies (e.g., solving a physics problem may require mathematical derivation first).
Key Challenge: Three issues must be addressed simultaneously: (a) identifying which subjects a question involves; (b) determining the information flow among subjects (which supports which); (c) matching each subject to the most capable model.
Key Insight: Model multi-subject problems as a DAG—subjects as nodes, dependencies as directed edges (supporting → dominant)—and assign the best expert model to each node.
Core Idea: GNN-based subject-level DAG construction + model capability profiling and matching + DAG topological-order multi-agent collaborative reasoning.
Method¶
Overall Architecture¶
Two stages: (1) a GNN constructs the S-DAG (nodes = subjects, edges = dependencies); (2) expert models are assigned and collaborate in DAG topological order.
Key Designs¶
-
S-DAG Construction (GNN-based):
- The question is encoded by BERT → fused with embeddings of 15 candidate subjects → GNN directed message passing updates node features → a node classifier determines subject relevance and an edge classifier determines dependency direction.
- Ground-truth S-DAG: an LLM (qwen-turbo) scores the weight of each of the 15 subjects (averaged over 3 rounds); edges point from lower-weight (supporting) to higher-weight (dominant) subjects.
-
LLM Profiling:
- 14 domain expert models (DeepseekMath-7B, BioMistral-7B, Qwen2.5-Coder-7B, etc.).
- A per-model per-subject capability matrix \(C_{ij}\) is constructed using 200 randomly sampled test questions.
- Each subject node is assigned the highest-scoring model.
-
DAG-guided Collaboration:
- Subject Expert Agent (root nodes): processes the original question from its own domain perspective.
- Supporting Agent (intermediate nodes): integrates outputs from upstream agents with its own expertise.
- Dominant Agent (terminal node): synthesizes all inputs and produces the final answer.
Key Experimental Results¶
Main Results¶
| Method | MMLU-Pro | GPQA | MedMCQA | Avg. |
|---|---|---|---|---|
| GPT-4o-mini (CoT) | 49.42 | 47.31 | 78.82 | 58.52 |
| Symbolic-MoE | 48.13 | 45.92 | 78.55 | 57.53 |
| MAD (Qwen2.5-7b) | 45.82 | 46.81 | 76.55 | 56.39 |
| S-DAG (7–13B pool) | 50.98 | 49.82 | 78.38 | 59.73 |
| Qwen2.5-72B (CoT) | 50.81 | 48.98 | 80.44 | 60.08 |
Ablation Study¶
| Configuration | Avg. Accuracy | Inference Time | LLM Calls |
|---|---|---|---|
| w/o GNN + random model | 41.12% | 14.21s | 5.1 |
| w/ GNN + random model | 42.19% | 14.82s | 4.1 |
| w/o GNN + capability matching | 53.51% | 14.53s | 5.1 |
| Fully connected graph | 57.29% | 38.45s | 8.2 |
| S-DAG (full) | 59.73% | 15.02s | 4.1 |
Key Findings¶
- Small model pool surpasses GPT-4o-mini: S-DAG (7–13B) 59.73% vs. GPT-4o-mini 58.52%, demonstrating that collaborative small models can substitute large models.
- Model capability matching is the most critical factor: random matching yields only 42.19%, while capability matching reaches 53.51% (+11.3 pp), with GNN-based DAG further improving it to 59.73%.
- DAG outperforms fully connected graph in both accuracy and efficiency: 59.73% vs. 57.29% (higher accuracy), 15s vs. 38.5s (2.5× faster), 4.1 vs. 8.2 LLM calls (halved)—redundant communication is in fact harmful.
- GNN's value lies in noise filtering: LLM-generated S-DAG labels are noisy and inconsistent; the trained GNN produces more robust graph structures.
Highlights & Insights¶
- Subject-level granularity is finer than task-level routing—different subjects within the same question can be handled by different expert models, realizing true specialization.
- DAG topological constraints improve both accuracy and efficiency: supporting subjects are processed first and the dominant subject integrates their results, eliminating the redundant communication of fully connected graphs.
- The paradigm of replacing large models with a small-model pool has significant practical deployment value—the total parameter count of 14 models at 7–13B is far smaller than a single 72B model.
Limitations & Future Work¶
- Evaluation is limited to MCQ benchmarks; open-ended generation tasks are not assessed.
- The 15 candidate subjects may lack sufficient granularity (e.g., "Medicine" could be divided into finer sub-disciplines).
- The model capability profile is estimated from only 200 samples, which is a relatively small sample size.
- Deploying 14 models simultaneously introduces non-trivial engineering complexity.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Subject-level DAG construction + capability matching + topological collaboration constitutes a new paradigm for multi-agent reasoning.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three benchmarks, nine baselines, and comprehensive ablations, though limited to MCQ format.
- Writing Quality: ⭐⭐⭐⭐ Method description is clear and ablation analysis is thorough.
- Value: ⭐⭐⭐⭐⭐ The result of a small-model pool surpassing a large model has practical deployment value; the finding that DAG outperforms fully connected graphs carries theoretical significance.