Solving Inequality Proofs with Large Language Models¶
Conference: NeurIPS 2025 arXiv: 2506.07927 Code: IneqMath Area: LLM/NLP Keywords: mathematical reasoning, inequality proving, LLM-as-Judge, benchmark, reasoning evaluation
TL;DR¶
This paper proposes IneqMath, the first large-scale olympiad-level inequality benchmark, formulates inequality proving as two automatically verifiable subtasks (bound estimation and relation prediction), develops a five-module LLM-as-Judge framework, and finds that even o1 achieves an overall accuracy below 10% under step-by-step reasoning scrutiny.
Background & Motivation¶
Inequality proving is a central challenge in mathematical reasoning, requiring the discovery of tight bounds, strategic application of classical theorems (AM-GM, Cauchy-Schwarz, etc.), and precise symbolic manipulation.
Existing resources are insufficient:
General ATP datasets (MiniF2F, ProofNet): contain very few inequalities
Synthetic datasets (INT, AIPS): template-based generation, lacking structural diversity
Manually curated sets (ChenNEQ): too small in scale
Format limitations: most adopt formal representations (Lean/Isabelle), whereas LLMs may perform better in natural language reasoning
Core problem: evaluating only final answer accuracy severely overestimates LLMs' mathematical reasoning ability—models may guess the correct answer while the reasoning process is riddled with flaws.
Method¶
Overall Architecture¶
Three main contributions: (1) informal yet verifiable task definitions; (2) the IneqMath dataset; (3) a five-module LLM-as-Judge evaluation framework.
Key Design 1: Informal but Verifiable Tasks¶
Inequality proving is decomposed into two automatically checkable subtasks:
Bound Estimation: find the optimal constant for which the inequality holds: $\(C^* = \sup\{C \in \mathbb{R} : f(\mathbf{x}) \geq C \cdot g(\mathbf{x}), \forall \mathbf{x} \in \mathcal{D}\}\)$
Relation Prediction: determine the relationship between two expressions (\(>, \geq, =, \leq, <\))
Key Design 2: IneqMath Dataset¶
| Statistic | Count | Bound Estimation | Relation Prediction |
|---|---|---|---|
| Test set | 200 | 96 | 104 |
| Dev set | 100 | 50 | 50 |
| Train set | 1,252 | 626 | 626 |
| Theorem categories | 29 | - | - |
| Named theorems | 83 | - | - |
| With theorem annotation | 962 | 482 | 480 |
| Max solutions per problem | 4 | 4 | 4 |
The test set was originally designed by IMO medalists and verified by an independent panel of experts. The training set includes step-by-step solutions and theorem annotations.
Key Design 3: Five-Module LLM-as-Judge Framework¶
A solution is considered correct only if it passes all five judges:
- Final Answer Judge: two-stage verification—extract conclusion sentence + check mathematical equivalence (handling \(\frac{1}{\sqrt{2}} = \frac{\sqrt{2}}{2}\)), F1 = 1.00
- Toy Case Judge: detects errors where general conclusions are drawn solely from specific numerical cases, F1 = 0.91
- Logical Gap Judge: flags missing reasoning steps and unsupported assertions, F1 = 0.96
- Numerical Approximation Judge: flags improper use of numerical approximations, F1 = 0.96
- Numerical Computation Judge: verifies arithmetic step correctness (Python/SymPy), F1 = 0.80
Overall F1 = 0.93.
Loss & Training¶
This work presents an evaluation framework and does not involve model training. The step-by-step solutions and theorem annotations in the training set are available for future research.
Key Experimental Results¶
Main Results¶
Zero-shot evaluation of 29 leading LLMs:
| Model | Answer Accuracy | Overall Accuracy | Drop |
|---|---|---|---|
| o1 | ~73.5% | 8.0% | -65.5% |
| o3 | ~65% | ~31% | -34% |
| Qwen2.5-72B | 42.0% | 2.5% | -39.5% |
| GPT-4o | ~40% | ~2% | -38% |
| Llama-4-Maverick | 40.5% | 2.5% | -38% |
Ablation Study¶
Model scale: larger models improve final answer accuracy but provide limited gains in overall accuracy
Test-time compute: o1's overall accuracy remains 8.0% across 5K to 40K max tokens; o3 saturates at approximately 31%
Improvement strategies: - Theorem guidance: overall accuracy of o3-mini improves by +11% - Critic-guided self-refinement: Gemini 2.5 Pro improves by +5%
Key Findings¶
- Correct answers do not imply correct reasoning: accuracy drops of up to 65.5% are observed
- Scaling is not a panacea: model size and inference time yield minimal improvement in step-level correctness
- Taxonomy of LLM reasoning flaws: logical gaps > toy-case generalization > numerical approximation misuse > arithmetic errors
- Theorem knowledge is critical: providing relevant theorems significantly improves reasoning quality
Highlights & Insights¶
- Reveals an evaluation blind spot: the first systematic demonstration of drastic accuracy drops under step-level scrutiny
- Modular judge design outperforms monolithic general-purpose judges
- High dataset quality: original problems by IMO medalists, independently verified, with multi-solution annotations
- Points to concrete improvement directions: theorem-guided reasoning and self-refinement, rather than naive scaling
- Comprehensive evaluation scope: 29 models covering both open-source and proprietary systems
Limitations & Future Work¶
- LLM-as-Judge is imperfect: the numerical computation judge achieves only F1 = 0.80
- Limited test set size: 200 problems offer limited statistical power
- Zero-shot evaluation only: few-shot and fine-tuned performance remain unexplored
- Inequality types skewed toward competitions: insufficient coverage of inequalities in analysis and probability theory
- Hybrid approaches combining formal and informal verification are worth exploring
Related Work & Insights¶
- MiniF2F (Zheng et al., 2022): formal mathematical benchmark
- MATH/GSM8K: conventional math benchmarks focusing on final answers
- LLM-as-Judge (Zheng et al., 2024): general-purpose judging framework
Rating¶
⭐⭐⭐⭐⭐ (4.5/5)
- Novelty ⭐⭐⭐⭐⭐: task definition, evaluation framework, and dataset constitute a unified contribution
- Experimental Thoroughness ⭐⭐⭐⭐⭐: systematic evaluation across 29 models
- Impact ⭐⭐⭐⭐⭐: exposes fundamental deficiencies in LLM mathematical reasoning
- Writing Quality ⭐⭐⭐⭐⭐: logically clear with well-crafted figures and tables