Skip to content

MedCoG: Maximizing LLM Inference Density in Medical Reasoning via Meta-Cognitive Regulation

Conference: ICML 2026
arXiv: 2602.07905
Code: Undisclosed
Area: Medical Reasoning / LLM Agent / Meta-Cognition
Keywords: Meta-cognitive regulation, Medical reasoning, Knowledge Graph, Inference density, On-demand reasoning

TL;DR

MedCoG enables LLMs to perform task-level self-assessment across three dimensions—"Complexity/Familiarity/Knowledge Density"—before invoking SCoT, Memory, and Knowledge Graph (KG) on demand. This approach increases inference density (theoretical cost/actual cost required to reach the same precision) to 6.2× and improves average accuracy from 34.5 to 37.5 on 5 MedQA hard sets compared to AFlow.

Background & Motivation

Background: Medical reasoning is one of the most challenging domains for LLMs. Mainstream approaches wrap LLMs in agent frameworks—multi-role play (MedAgents, MDAgents), KG retrieval (MedReason), historical experience memory, and iterative self-refinement (Self-Refine, AFlow)—relying on test-time scaling to boost performance.

Limitations of Prior Work: Upon plotting the cost-accuracy Pareto frontier, the authors observed that these methods generally follow a logarithmic scaling law (\(Acc = \alpha \ln(C) + \beta, R^2=0.996\)), where doubling compute power yields diminishing accuracy gains. Worse, on MedQA-H, SCoT+KG performed 4 points lower than pure SCoT (41→37), as blindly adding KG/Memory can interfere with the internal knowledge of the LLM.

Key Challenge: Oracle experiments across five strategy pools {Zero-Shot, SCoT, SCoT+Mem, SCoT+KG, SCoT+KG+Mem} revealed that selecting the optimal strategy per sample could reach 98.98 on MedQA-Full (surpassing o1's 96.52) and 67.0 on MedQA-H, while any single non-Oracle strategy stays below 50. This indicates that the bottleneck is not the scope of knowledge, but the lack of a mechanism for "per-sample strategy selection."

Goal: To allow the LLM to judge "what kind of knowledge is needed and how much" for a given question, rather than indiscriminately stacking KG, Memory, and CoT.

Key Insight: Drawing from meta-cognition in cognitive science (Schraw 1998), agents should evaluate their cognitive state before selecting a strategy. Tulving's three types of knowledge (Procedural / Episodic / Factual) are mapped to SCoT / Memory / KG respectively.

Core Idea: A Meta-Cognition Regulator performs "on-demand routing" between SCoT, Memory, and KG, shifting scaling from blind expansion to LLM-centric on-demand reasoning, simultaneously reducing costs (avoiding useless knowledge) and improving accuracy (avoiding noise).

Method

Overall Architecture

MedCoG = Meta-Cognition Regulator + Knowledge Executor. Given a medical question \(\mathcal{Q}\): 1. The Regulator's Monitoring outputs a 3D state vector \(\mathbf{s}=[s_c, s_f, s_k]\) (Complexity / Familiarity / Knowledge Density); 2. Planning uses a non-parametric gate \(I_j = \mathbb{1}(s_j > \tau_j)\) to route to the strategy pool \(\mathcal{S}=\{\text{Zero-Shot, SCoT, SCoT+Mem, SCoT+KG, SCoT+KG+Mem}\}\). If \(I_k=1\), a KG verification plan is generated; 3. The Executor executes SCoT (procedural), Memory retrieval (episodic), or KG path search (factual) according to the plan; 4. If KG is used, Evaluating checks if the retrieved evidence is sufficient; if not, it refines the plan once or falls back to SCoT to conclude.

Key Designs

  1. 3D Meta-Cognitive Monitoring + Non-parametric Gated Routing:

    • Function: Explicitly quantifies the LLM's self-judgment into three scalars \(s_c, s_f, s_k\), using thresholds \(\tau=[\tau_c, \tau_f, \tau_k]\) to decide whether to activate Memory and KG.
    • Mechanism: Complexity determines if multi-hop reasoning is needed; Familiarity checks if the question resembles textbook cases (deciding whether to reuse past experience); Knowledge Density checks dependence on specific medical facts. The strategy selection is \(\mathcal{M}=\pi(\mathbf{s};\tau) = \text{SCoT} \oplus \sum_{j\in\{f,k\}} I_j \cdot \mathcal{M}_j\). If all scores are below thresholds, Zero-Shot is used. Thresholds are calibrated on 50 held-out samples via per-backbone calibration.
    • Design Motivation: The authors found meta-cognitive characteristics vary significantly by LLM—o3-mini is overconfident (underestimating KG needs), Qwen3-8B fails on Knowledge Density (F1 only 0.33), and GPT-4o-mini rarely uses Memory. Non-parametric gating with per-backbone thresholds is more robust than training a policy network and avoids per-backbone model fine-tuning.
  2. KG Verification Plan + Three-Step Entity Grounding:

    • Function: Instead of direct retrieval using the full question, the problem is decomposed into "Verification Pairs" \(\mathcal{V}(\mathcal{Q})=\{(v_i, h_i)\}\) (atomic queries + LLM hypotheses) before searching the KG for the shortest connecting paths, compressing the search space.
    • Mechanism: Entity grounding follows three steps: (1) KG-LLM extracts candidate entity phrases \(\mathcal{E}_v, \mathcal{E}_h\); (2) Each candidate finds the most similar entity in PrimeKG (4M+ relations) using bge-base-en-v1.5: \(\hat{e}=\arg\max_{e^g \in \mathcal{E}} \text{sim}(\text{enc}_\theta(e), \text{enc}_\theta(e^g))\); (3) KG-LLM refines based on context. Shortest paths \(\mathcal{P}^g = \bigcup \{\text{SP}(e_v, e_h)\}\) are retrieved for all \((v_i, h_i)\), followed by a ranker selecting Top-K=5.
    • Design Motivation: In clinical questions, only a small set of indicators determines the answer. Retrieving based on the full text introduces noise (e.g., family history, physiological baseline). Executing a plan of "what to verify" hard-codes a cognitive path to avoid the hub entity explosion common in KG retrieval.
  3. SCoT: Decoupling Procedural Knowledge from Structured Paths:

    • Function: Explicitly separates "knowing what facts" from "knowing how to reason," outputting structured entity-relation paths \(\mathcal{P}^e\) before generating reasoning chains \(\mathcal{C}\) anchored on them, denoted as \(\text{SCoT}=(\mathcal{P}^e, \mathcal{C})\).
    • Mechanism: All samples activating any \(I_j\) use SCoT as a base. When KG is activated, \(\mathcal{P}^e\) is filled by retrieval; otherwise, it is elicited by the LLM. Episodic Memory reuses the same format—the Case Bank \(\mathcal{B}=(q_i, (\mathcal{P}^e_i, \mathcal{C}_i), r_i)\) stores historical questions, SCoT trajectories, and reward \(r_i \in \{0,1\}\), retrieving Top-K=5 by similarity.
    • Design Motivation: Standard CoT mixes facts and reasoning; when KG results are inserted, LLMs often hallucinate stories following incorrect KG paths. Decoupling allows the "fact layer" and "reasoning layer" to be independently replaced and evaluated, while historical SCoT trajectories provide procedural templates for "how to decompose" rather than just giving final answers.

Loss & Training

The system is training-free. GPT-4o (2024-08-06) serves as the Regulator and SCoT backbone, and GPT-4o-mini performs KG grounding, with temperature=0. The only learnable components are \(\tau\) (50-sample calibration) and the off-the-shelf bge-base-en-v1.5 ranker. PrimeKG is used for KG; Case Bank is filtered from MedReason. The Evaluating module allows a maximum of 2 re-planning rounds to control overhead.

Key Experimental Results

Main Results (5 MedAgentsBench Hard Sets, GPT-4o backbone, IIE* = Marginal Efficiency per 1k samples)

Method MedQA MedMCQA MMLU MMLU-Pro PubMedQA Avg IIE*
CoT (baseline) 39.0 30.0 26.0 35.0 10.0 28.0 Ref
Self-Refine 41.0 34.0 34.2 34.0 13.0 31.2 0.345
MultiPersona 45.0 25.0 37.0 42.0 15.0 32.8 0.162
AFlow 48.0 31.0 38.4 37.0 18.0 34.5 0.141
MedAgents 43.0 30.0 28.8 8.0 15.0 25.0 −0.035
MDAgents 36.0 22.0 24.7 8.0 11.0 20.3 −0.165
MedCoG-Meta 52.0 36.0 35.6 44.0 20.0 37.5 0.438
MedCoG-All (All On) 50.0 32.0 28.8 36.0 19.0 33.2 0.181

MedCoG-Meta outperforms AFlow by 8.7% on average, with an IIE 3.1× higher. Several medical agent frameworks performed worse than CoT (negative IIE), confirming that "blindly stacking agents can harm the LLM's inherent capabilities."

Oracle Upper Bound and Inference Density

Strategy (GPT-4o) MedQA-Full MedQA-H
Zero-Shot 87.80 32.0
SCoT 89.55 41.0
SCoT+Mem 89.08 42.0
SCoT+KG 87.43 37.0
SCoT+KG+Mem 88.85 50.0
MedCoG-Oracle 98.98 67.0
Current SOTA (o1/o3-mini) 96.52 53.0

Oracle results show the strategy pool ceiling exceeds o1. MedCoG-Meta achieves an Inference Density \(\rho = f^{-1}(Acc_\mathcal{M}) / C_\mathcal{M}\) of 6.2× (fit curve \(R^2=0.996\)), meaning reaching the same accuracy on the reference curve would cost 6.2× more.

Key Findings

  • Meta-cognitive monitoring reliability is strongly correlated with model scale: Qwen3-8B had an F1 of 0.33 on Knowledge Density, which rose to 0.79~0.80 at 32B/Max, indicating meta-cognition requires sufficient model capacity. Conversely, o3-mini showed overconfidence in Familiarity (Recall 1.0 but Precision 0.65).
  • Synergy between KG and Memory: Adding KG alone dropped MedQA-H performance (41→37), but SCoT+KG+Mem spiked to 50. The authors suggest episodic memory helps LLMs interpret abstract KG paths.
  • Error Structure Comparison: Total strategy pool error was 156; MedCoG-Meta reduced this to 70. Reductions included Synergy Missed (29→4), Memory Noise (20→3), and Over Reasoning (33→14), while Unsolvable remained at 33.
  • OOD Memory Utility: Using the MedQA+MedMCQA case bank on MMLU/PubMedQA still achieved best or second-best results, as the Familiarity threshold effectively filtered inapplicable historical cases.

Highlights & Insights

  • Making "lower costs" a metric: Inference Density \(\rho\) and IIE = \((Acc_\mathcal{M} - Acc_{CoT}) / (C_\mathcal{M} - C_{CoT})\) encourage "higher scores for less money," preventing the method from blindly using tokens.
  • Oracle experiments pinpoint bottlenecks: By proving an ideal router can achieve 98.98 while existing methods struggle to reach 50, the authors justify focusing on the router rather than just expanding the KG.
  • Per-backbone meta-cognitive profiling is transferable: The P/R/F1 data in Table 3 provides backbone-level baseline data for future "LLMs' self-assessment capability" evaluations.
  • Training-free, threshold-based: Avoiding end-to-end policy network training makes the system easy to deploy across different backbones with only 50 samples and three \(\tau\) values.

Limitations & Future Work

  • Complete comparisons were only done on hard subsets (100 samples each, 73 for MMLU) with GPT-4o, making IIE figures sensitive to individual errors.
  • The Regulator relies on closed-source LLM calls (at least 2 calls per sample). If small open-source models (e.g., Qwen3-8B) are used, meta-cognitive performance drops significantly.
  • Non-parametric gating with single thresholds is simple but brittle when scores are correlated or data distributions shift.
  • KG remains a bottleneck: Table 4 shows 10 cases of KG Noise, suggesting a need for path verbalization or chain-level verifiers.
  • vs AFlow / MedAgents / MDAgents: These use fixed workflow-level agent orchestration. MedCoG uses sample-level dynamic strategy selection; token savings come from skipping KG when unnecessary.
  • vs MedReason / MedPrompt: MedReason provides SCoT data but no dynamic routing. MedCoG reuses this data for the Case Bank and adds a meta-cognitive layer, demonstrating a "upstream knowledge engineering + upper-level cognitive scheduling" hierarchy.
  • Inspiration for RAG: Using a verification plan to atomize questions before retrieval significantly reduces noise compared to direct dense retrieval of full questions.

Rating

  • Novelty: ⭐⭐⭐⭐ Explicitly mapping meta-cognitive categories (Procedural/Episodic/Factual) to SCoT/Memory/KG with non-parametric routing is a systematic first.
  • Experimental Thoroughness: ⭐⭐⭐ Comprehensive across 5 backbones and 5 datasets, but hard subset sample sizes are relatively small.
  • Writing Quality: ⭐⭐⭐⭐ The logic chain from Pilot → Oracle → Method → Metric is highly persuasive.
  • Value: ⭐⭐⭐⭐ Provides a rigorous framework for medical agents to avoid "blind token usage"; IIE metric is likely to be adopted by future agent research.