Out-of-distribution Generalisation is Hard: Evidence from ARC-like Tasks¶
Conference: NeurIPS 2025 arXiv: 2505.09716 Code: None Area: Other Keywords: OOD generalisation, compositional generalisation, ARC tasks, inductive bias, feature learning
TL;DR¶
By constructing ARC-like tasks with well-defined OOD metrics, this paper demonstrates that standard neural networks (MLP/CNN/Transformer) fail to achieve compositional OOD generalisation. Moreover, even architectures designed with correct inductive biases that attain near-perfect OOD performance may still learn incorrect compositional features.
Background & Motivation¶
Out-of-distribution (OOD) generalisation is widely regarded as a hallmark of human and animal intelligence. Achieving OOD generalisation through composition requires a system to: 1. Identify input-output mapping properties that are invariant across environments 2. Transfer them to novel inputs
However, existing research has a critical gap: - Testing on OOD settings alone is insufficient to prove that an algorithm has learned compositional structure - One must additionally verify that the identified features are genuinely compositional - When a system performs well on OOD tests, it remains unclear whether it has truly learned the correct compositional rules
The authors' central argument: validating OOD generalisation requires jointly verifying both performance and feature correctness.
Method¶
Overall Architecture¶
- Design two tasks with clearly defined OOD metrics
- Evaluate the OOD capabilities of three standard networks (MLP, CNN, Transformer)
- Design two novel architectures with correct inductive biases
- Analyse whether feature learning remains correct even when OOD performance is perfect
Key Designs¶
Task Design: - Inspired by the ARC (Abstraction and Reasoning Corpus) benchmark - Tasks possess decomposable compositional structure - OOD metrics are clearly defined: whether novel combinations of input features are handled correctly
Task 1 — Geometric Transformation Task: - Input: grid patterns with specific geometric properties - Output: patterns after deterministic transformations - OOD setting: the combinations of geometric properties differ between training and test sets - Key property: transformations can be decomposed into independent sub-operations
Task 2 — Colour–Shape Composition Task: - Inputs involve independent variation in colour and shape - Correct compositional generalisation requires learning invariant features for colour and shape separately
Two Novel Architectures: - Architecture A: embeds geometric invariance biases, enabling the network to process transformations along each dimension independently - Architecture B: embeds compositional decomposition biases, forcing the network to factorize the feature space into independent subspaces
Loss & Training¶
- Standard cross-entropy / MSE loss
- All architectures use identical training data and optimisation strategies
- Train–test splits strictly control the degree of OOD shift
Key Experimental Results¶
Main Results¶
OOD generalisation of standard networks:
| Network | Task 1 ID Acc. | Task 1 OOD Acc. | Task 2 ID Acc. | Task 2 OOD Acc. |
|---|---|---|---|---|
| MLP | ~100% | ~0% | ~100% | ~0% |
| CNN | ~100% | ~0% | ~100% | ~0% |
| Transformer | ~100% | ~0% | ~100% | ~0% |
Novel architectures with correct inductive biases:
| Architecture | Task 1 OOD Acc. | Task 2 OOD Acc. | Feature Correctness |
|---|---|---|---|
| Biased Arch. A | ~100% | ~100% | Partially incorrect |
| Biased Arch. B | ~100% | ~100% | Partially incorrect |
Ablation Study¶
| Condition | OOD Performance | Correct Compositional Features Learned |
|---|---|---|
| No inductive bias | Fails | No |
| Weak inductive bias | Partial success | Uncertain |
| Strong inductive bias | Succeeds | Not guaranteed |
| Perfect inductive bias | Near-perfect | May still be incorrect |
Key Findings¶
- Standard networks fail comprehensively: MLP, CNN, and Transformer all achieve near-zero OOD accuracy on both tasks.
- Correct bias ≠ correct features: Even when embedding correct inductive biases yields near-perfect OOD performance, networks may still learn incorrect compositional features.
- OOD performance ≠ compositional generalisation: High OOD performance does not prove that an algorithm has learned the underlying compositional structure.
- Necessity of verifying feature correctness: Testing OOD performance alone is insufficient to confirm compositional generalisation capability.
Highlights & Insights¶
- Methodological contribution: Proposes a new standard for evaluating compositional generalisation — one must not only measure performance but also verify feature correctness.
- Counter-intuitive finding: Near-perfect OOD performance can be achieved through "incorrect" means, i.e., without relying on correct compositional features.
- Implications for ARC: Provides an explanation for why current AI systems perform poorly on ARC tasks — they lack genuine compositional reasoning ability.
- Theory–practice gap: Inductive biases are necessary but not sufficient, underscoring the importance of understanding what models actually learn.
Limitations & Future Work¶
- The designed tasks are relatively simple and limited in complexity compared to real ARC tasks.
- Only three standard architectures and two custom architectures are evaluated.
- The verification of feature correctness relies heavily on manual inspection, lacking automated evaluation tools.
- The compositional generalisation capabilities of larger-scale or pretrained models are not explored.
- The influence of new paradigms such as in-context learning on OOD generalisation is not considered.
Related Work & Insights¶
- ARC Challenge: The abstract reasoning benchmark proposed by François Chollet.
- Compositional generalisation: Compositional generalisation benchmarks in semantic parsing, such as SCAN and COGS.
- Systematic generalisation: Lake & Baroni's discussion on systematic generalisation in neural networks.
- Insight: In-depth analysis on simple tasks is often more informative than surface-level testing on complex ones.
Rating¶
- Novelty: ⭐⭐⭐⭐ (significant methodological contribution)
- Technical depth: ⭐⭐⭐⭐ (rigorous analysis)
- Experimental thoroughness: ⭐⭐⭐ (tasks could be more diverse)
- Value: ⭐⭐⭐ (primarily theoretical insights)