Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities¶
Conference: NeurIPS 2025 arXiv: 2505.12680 Code: GitHub Area: LLM Evaluation Keywords: automated theorem proving, compositional reasoning, formal verification, Lean 4, mathematical inequalities, benchmark
TL;DR¶
This paper introduces the Ineq-Comp benchmark, which applies compositionally transformed variants of simple inequality seed problems—variants that humans can resolve with minimal additional effort—to expose fundamental deficiencies in the compositional reasoning of current LLM-based formal theorem provers. Even DeepSeek-Prover-V2-7B suffers a performance drop exceeding 20%.
Background & Motivation¶
Recent years have seen remarkable progress in LLM-based formal theorem proving: systems such as AlphaProof and DeepSeek-Prover-V2 have set new records on benchmarks like MiniF2F. Nevertheless, existing benchmarks suffer from the following shortcomings:
- MiniF2F is small in scale and spans an extreme difficulty range, mixing introductory problems with IMO-level challenges.
- ProofNet / PutnamBench focus on high-difficulty problems that current open-source models can solve only in rare cases.
- None of these benchmarks explicitly test compositional reasoning—the ability to chain known simple reasoning steps into more complex arguments.
- They carry a non-trivial risk of data contamination.
The authors' core insight is that the essence of mathematical reasoning lies in "standing on the shoulders of giants": complex arguments are constructed by linking simple, well-understood steps. If a prover can prove a seed problem, it should likewise handle variants obtained by straightforward compositional transformations—variants that require virtually no additional reasoning for a human expert.
Method¶
Overall Architecture¶
The Ineq-Comp benchmark comprises three subsets:
- Ineq-Simp: 150 problems generated from 75 seed problems via two classes of transformations.
- Ineq-Mix: A rule-based automatic generation framework that is infinitely extensible.
- Ineq-Real: 50 inequality problems sourced from real mathematical competitions.
Key Designs¶
Seed Problems (75 total):
| Technique | Count | Description |
|---|---|---|
| AM-GM Inequality | 25 | Arithmetic–geometric mean inequality |
| Cauchy–Schwarz | 25 | Cauchy–Schwarz inequality |
| Others (Jensen/Schur/SOS/Induction) | 25 | Jensen (10) + Schur (5) + SOS (5) + Induction (5) |
Each seed problem is accompanied by a verified Lean 4 proof, written and reviewed by personnel with IMO/CMO experience.
Type I Transformation (Variable Duplication + Composition):
Given a seed inequality \(f(X) \geq g(X)\) with \(X = (x_1, \ldots, x_n)\) and constraint \(X \in \mathcal{C}\):
- Duplicate the inequality with fresh variables \(Y = (y_1, \ldots, y_n)\).
- If \(g(X) \geq 0\): multiplicative composition \(f(X) \cdot f(Y) \geq g(X) \cdot g(Y)\).
- If \(g(X)\) is not guaranteed non-negative: additive composition \(f(X) + f(Y) \geq g(X) + g(Y)\).
This transformation is trivially simple for humans—it merely requires decomposing the problem into two identical parts.
Type II Transformation (Algebraic Variable Substitution):
A transformation \(T: \mathbb{R}^n \to \mathbb{R}^n\) (e.g., squaring, taking square roots) is applied to the variables of the seed problem, and one must prove \(f(T(X)) \geq g(T(X))\) under the condition \(T(X) \in \mathcal{C}\).
Ineq-Mix Automatic Generation Framework:
Three classes of rules can be freely combined: 1. Compositional transformations: combining inequalities via addition, multiplication, max, or min. 2. Variable-level transformations: substituting variables with algebraic expressions. 3. Problem-level transformations: applying monotone functions (e.g., \(\exp\), \(\log\)) to both sides.
Loss & Training¶
This is a benchmarking paper and involves no training loss functions. Evaluation uses the standard pass@N accuracy: a problem is considered "solved" if and only if at least one correct proof is produced among \(N\) generations.
Key Experimental Results¶
Main Results¶
pass@3200 accuracy (%) on Ineq-Simp:
| Model | AM-GM Seed | AM-GM Type I | AM-GM Type II | Cauchy Seed | Cauchy Type I | Cauchy Type II |
|---|---|---|---|---|---|---|
| DeepSeek-Prover-V2-7B | 96.0 | 84.0 | 76.0 | 76.0 | 52.0 | 48.0 |
| Kimina-7B | 80.0 | 44.0 | 64.0 | 68.0 | 52.0 | 36.0 |
| STP | 64.0 | 44.0 | 40.0 | 44.0 | 24.0 | 28.0 |
| Goedel-Prover-SFT | 60.0 | 4.0 | 24.0 | 40.0 | 32.0 | 28.0 |
| DeepSeek-Prover-V1.5-RL | 60.0 | 4.0 | 24.0 | 24.0 | 12.0 | 12.0 |
Key finding: With the exception of DeepSeek-Prover-V2-7B, all models exhibit severe performance degradation from seed to Type I—Goedel-Prover drops precipitously from 60% to 4%.
Model scaling experiments (pass@32):
| Model | Seed | Type I | Type II |
|---|---|---|---|
| DeepSeek-Prover-V2-7B | 66.2 | 47.0 | 42.1 |
| DeepSeek-Prover-V2-671B | 88.0 | 68.3 | 70.7 |
| Goedel-Prover-V2-8B | 69.0 | 35.0 | 49.0 |
| Goedel-Prover-V2-32B | 90.0 | 68.0 | 83.0 |
Although scaling up model size improves absolute performance, the gap between seed problems and their transformed variants persists.
Ablation Study¶
In-Context Learning (providing seed proofs):
| Model | AM-GM Type I | AM-GM Type II | Cauchy Type I | Cauchy Type II |
|---|---|---|---|---|
| Qwen2.5-Coder-32B (128) | 28.0 | 40.0 | 48.0 | 20.0 |
| DeepSeek-R1-32B w/ thinking (128) | 16.0 | 44.0 | 84.0 | 40.0 |
| GPT-4o (16) | 12.0 | 40.0 | 56.0 | 16.0 |
| Claude 3.7 Sonnet (16) | 36.0 | 28.0 | 32.0 | 20.0 |
Even when complete proofs of the seed problems are provided in context, models still struggle to generalize to transformed variants. The root cause of the compositional reasoning deficit is not a lack of knowledge, but a lack of compositional generalization ability.
Fine-tuning experiments:
Synthetic data (~8k compilable samples) generated from AM-GM seed problems was used to fine-tune Goedel-Prover-SFT: - In-distribution (AM-GM Type I): accuracy improves substantially to 56%. - Out-of-distribution: performance shows virtually no improvement, remaining near pre-fine-tuning levels. - Performance on MiniF2F does not degrade, ruling out catastrophic forgetting.
Key Findings¶
- Tactic monoculture: Nearly all models rely heavily on
nlinarith+sq_nonneg(sum-of-squares approach) and rarely attempt to invoke classical inequalities such as AM-GM or Cauchy–Schwarz directly. - Natural language vs. formal reasoning gap: Models frequently invoke correct high-level strategies (e.g., AM-GM) in natural-language annotations, but abandon these strategies in Lean code, reverting to brute-force algebraic manipulation.
- Comparison with O3: OpenAI's O3 model can readily decompose and solve these compositional problems in natural-language mode, confirming that the bottleneck lies in the formal reasoning system rather than in mathematical understanding per se.
- Ineq-Mix harder than Ineq-Real: Automatically generated compositional problems prove more challenging for models than real-world competition problems—a counterintuitive result that underscores the core difficulty of compositional reasoning.
Highlights & Insights¶
- Elegant benchmark design: The bottom-up construction methodology enables controlled, scalable difficulty and allows precise diagnosis of compositional reasoning as a specific capability.
- A deep underlying problem revealed: Current formal provers do not simply lack sufficient intelligence; they lack the ability to compositionally apply existing knowledge—a sharp contrast with the core capacity of human mathematical reasoning.
- The O3–formalization gap: Compositional problems that natural-language reasoning handles effortlessly become a primary bottleneck in the formal setting, suggesting that compositional reasoning patterns are underrepresented in formal training data.
- Fine-tuning cannot bridge the gap: This is not merely a data scarcity issue; moderate-scale SFT can only learn specific patterns without achieving genuine generalization.
Limitations & Future Work¶
- The benchmark focuses exclusively on inequalities; the behavior of compositional reasoning in other mathematical domains—algebra, geometry, number theory—remains unexplored.
- Seed problem difficulty is limited to introductory competition level; applicability to higher-difficulty problems has not been validated.
- Newer tree-search-based methods such as HunyuanProver and BFS-Prover are not evaluated, as their code is not publicly available.
- The random generation process in Ineq-Mix may introduce problems that are formally valid but mathematically trivial.
Related Work & Insights¶
- MiniF2F / ProofNet / PutnamBench: Existing benchmarks emphasize overall difficulty rather than diagnostic evaluation of specific reasoning capabilities.
- AlphaProof / DeepSeek-Prover-V2: The strongest formal proving systems, yet they exhibit notable weaknesses in compositional reasoning.
- AIPS / inequality solvers: Targeting higher-difficulty problems, whereas this work exposes a qualitatively different dimension of failure by starting from simple problems.
- Implication: Future provers may require explicit decompose-and-compose reasoning modules, or curriculum learning that incorporates compositional reasoning during RL training.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Examining formal theorem proving through the lens of compositional reasoning is a genuinely novel perspective.
- Technical Depth: ⭐⭐⭐⭐ — The benchmark design is rigorous, the transformation definitions are precise, and the ablation study is comprehensive.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers 6+ provers across multiple inference budgets, with validation via ICL, fine-tuning, and model scaling.
- Value: ⭐⭐⭐⭐ — Provides the formal theorem proving community with a sustainably extensible diagnostic tool.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure, intuitive figures, and well-motivated problem framing.
- Overall: ⭐⭐⭐⭐ (8/10)