A Coherence-Based Measure of AGI¶
Conference: AAAI 2026 arXiv: 2510.20784 Code: Not available Area: Interpretability Keywords: AGI evaluation, generalized mean, coherence measure, cognitive capability balance, non-compensatory aggregation
TL;DR¶
This paper identifies that existing AGI scores rely on arithmetic averaging, which implicitly encodes a "compensatory" assumption (strengths offsetting weaknesses), and proposes \(\text{AGI}_{\text{AUC}}\)—a coherence measure based on the continuous spectrum of generalized means. By integrating over the compensability parameter \(p \in [-1, 1]\), the metric penalizes uneven capability profiles and exposes bottlenecks concealed by arithmetic averaging.
Background & Motivation¶
Background: Hendrycks et al. define an AGI score as the arithmetic mean of scores across 10 cognitive domains based on the CHC (Cattell-Horn-Carroll) theory of cognitive abilities. GPT-4 scores 27% and GPT-5 scores 58% under this scheme. However, arithmetic averaging implicitly encodes a compensatory assumption—strong reasoning can compensate for weak memory.
Limitations of Prior Work: Psychometric evidence from CHC theory argues against compensability: cognitive abilities are interdependent (reasoning relies on working memory; perception constrains abstraction), and extreme imbalance typically indicates dysfunction rather than high intelligence.
Key Challenge: Systems theory supports a bottleneck effect—the overall capability of a complex system is constrained by its weakest component (limiting-factor dynamics), which simple summation cannot capture.
Goal: General intelligence should exhibit coherent sufficiency—all critical capabilities meeting a balanced threshold—rather than excelling in isolated domains.
Method¶
Overall Architecture¶
The degree of compensability is parameterized via the generalized power mean family, and a robust coherence measure is derived by integrating over the AUC:
Key Designs¶
- Semantics of the compensability index \(p\): \(p=1\) (arithmetic mean, strong compensation) → \(p=0\) (geometric mean, moderate non-compensation) → \(p=-1\) (harmonic mean, strong non-compensation) → \(p \to -\infty\) (minimum, strict bottleneck).
- \(\text{AGI}_p\) curve: The horizontal axis represents \(p\) and the vertical axis represents the score. A flatter and higher curve indicates more balanced capabilities. GPT-5's curve drops sharply for \(p < 0\), exposing bottlenecks in memory and perception.
- AUC aggregation: Integration over \(p \in [-1, 1]\) provides a comprehensive measure of model robustness across different compensability assumptions.
- Stability constant: \(\varepsilon = 10^{-6}\) prevents collapse of the generalized mean when any dimension equals zero.
Loss & Training¶
This paper presents a purely evaluative framework with no model training. Numerical integration is performed via the composite trapezoidal rule on a uniform grid over \(p\).
Key Experimental Results¶
CHC Domain Score Analysis (GPT-4 / GPT-5 / Ideal AGI)¶
| Model | \(\text{AGI}_1\) (Arithmetic) | \(\text{AGI}_{0.5}\) | \(\text{AGI}_0\) (Geometric) | \(\text{AGI}_{-0.5}\) | \(\text{AGI}_{-1}\) (Harmonic) | \(\text{AGI}_{\text{AUC}}\) (Ours) |
|---|---|---|---|---|---|---|
| GPT-4 | 27% | 16% | ≈0% | ≈0% | ≈0% | 7% |
| GPT-5 | 58% | 50% | 16% | ≈0% | ≈0% | 24% |
| Ideal AGI | 100% | 100% | 100% | 100% | 100% | 100% |
GPT-4/5 Scores Across Ten Domains¶
| Domain | Knowledge | Literacy | Math | Reasoning | Working Memory | LT Memory Storage | LT Memory Retrieval | Visual | Auditory | Speed |
|---|---|---|---|---|---|---|---|---|---|---|
| GPT-4 | 80 | 60 | 40 | 0 | 20 | 0 | 40 | 0 | 0 | 30 |
| GPT-5 | 90 | 100 | 100 | 70 | 50 | 0 | 40 | 40 | 60 | 30 |
Key Findings¶
- Arithmetic averaging substantially overstates AGI progress: GPT-5's \(\text{AGI}_1 = 58\%\) creates the illusion that the field is "more than halfway there," whereas \(\text{AGI}_{\text{AUC}} = 24\%\) more faithfully reflects actual capability.
- Zero-score domains cause collapse of geometric/harmonic means: GPT-5's long-term memory storage score of 0% drives \(\text{AGI}_0\) down to 16% and \(\text{AGI}_{-1} \approx 0\%\).
- Consistency with external benchmarks: \(\text{AGI}_{\text{AUC}} = 24\%\) for GPT-5 aligns more closely with ARC-AGI-2's 18% than with the arithmetic mean of 58%.
- "GPT-6" simulation: Raising only GPT-5's weakest domain (long-term memory storage) from 0% to 30% produces a substantial improvement in \(\text{AGI}_{\text{AUC}}\), demonstrating that patching bottlenecks yields far greater returns than strengthening existing capabilities.
- 17-benchmark extended validation: Repeating the analysis on Gemini 3 Pro, GPT-5.1, Claude Sonnet 4.5, and other models using 17 heterogeneous benchmarks yields coherence patterns fully consistent with the CHC domain analysis.
Highlights & Insights¶
- Primary conceptual contribution: The paper frames "compensability" as the central unexamined assumption in AGI evaluation—a foundational issue previously overlooked by the community.
- The "GPT-6" thought experiment is particularly illuminating: Patching only the weakest domain yields disproportionately large coherence gains, intuitively demonstrating the leverage effect of bottleneck remediation.
- The generalized mean spectrum is an elegant mathematical tool: The continuous transition from full compensation (\(p=1\)) to strict bottleneck (\(p \to -\infty\)) means the \(\text{AGI}_p\) curve itself serves as a diagnostic instrument.
- Practical implications: If the AGI community were to adopt \(\text{AGI}_{\text{AUC}}\), model development would shift toward addressing weaknesses rather than amplifying existing strengths.
- Framework agnosticism: The approach is independent of any particular benchmark suite; any collection of domain scores can be aggregated under this framework.
- Alignment with ARC-AGI-2 and BIG-Bench Hard validates that AUC more faithfully reflects functional coherence than arithmetic averaging.
Limitations & Future Work¶
- Dependence on domain score quality: The normalization and estimation of CHC domain scores are themselves subject to bias (the paper discusses sub-domain inflation in an appendix).
- Subjective choice of \(p\) range: The interval \([-1, 1]\) is an empirical choice; alternative ranges such as \([-2, 1]\) or \([-0.5, 1]\) would yield different results.
- Handling of zero values via \(\varepsilon\): Replacing zero-score domains with \(10^{-6}\) is mathematically defensible, but semantically a zero score on any capability should arguably render the AGI score zero.
- Purely evaluative: The framework diagnoses problems but provides no technical prescription for improving weak domains.
- Equal domain weighting: The 10 cognitive domains are treated with uniform weights, without accounting for differential contributions to general intelligence.
- Absence of temporal dimension: The framework provides a static snapshot and does not capture coherence dynamics over continual learning or forgetting.
- Limited multi-model comparison: The CHC domain analysis covers only GPT-4 and GPT-5; broader comparisons (e.g., Claude, Gemini) would strengthen the conclusions.
Related Work & Insights¶
- Hendrycks et al. (2025): The first psychometric AGI definition using a 10-domain arithmetic mean; this paper directly extends and improves upon that framework.
- Chollet (ARC-AGI): Emphasizes out-of-distribution reasoning and abstraction, consistent with the non-compensatory philosophy of this work.
- Multi-criteria decision theory (Keeney & Raiffa): Provides the theoretical foundation for non-compensatory aggregation.
- Bottleneck effects in systems theory (Kitano): System performance is constrained by the weakest component.
- BIG-Bench Hard: GPT-4 scores approximately 6% on this benchmark, closely matching \(\text{AGI}_{\text{AUC}} = 7\%\), as opposed to the arithmetic mean of 27%.
- Gemini 3 Pro Model Evaluation Report: The paper uses 17 benchmarks from this report for extended validation, demonstrating the generality of the framework.
- Core insight: The design of an evaluation metric itself encodes assumptions about the nature of capability—choosing an aggregation function is choosing a theory of intelligence.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First systematic introduction of the compensability problem into AGI evaluation
- Experimental Thoroughness: ⭐⭐⭐⭐ Dual validation via CHC domains and 17 benchmarks, though dependent on external data with no original experiments
- Writing Quality: ⭐⭐⭐⭐⭐ Mathematically elegant, rigorously argued, and thoroughly discussed
- Value: ⭐⭐⭐⭐ The evaluation framework design approach is broadly applicable; generalized mean aggregation transfers well to multi-task evaluation settings