GeoCodeBench: Benchmarking PhD-Level Coding in 3D Geometric Computer Vision¶

Conference: CVPR 2026
arXiv: 2603.30038
Code: https://geocodebench.github.io/
Area: LLM Efficiency / Code Generation Evaluation
Keywords: 3D Vision Code Generation, LLM Evaluation, Geometric Algorithm Implementation, PhD-level Benchmark, Unit Testing

TL;DR¶

The first PhD-level code generation benchmark for 3D geometric computer vision, GeoCodeBench, comprising 100 function completion tasks curated from top-venue 2025 papers and codebases, with automated diverse unit tests. The strongest model GPT-5 achieves only 36.6% pass rate, revealing a significant gap in LLM scientific-level 3D code implementation.

Background & Motivation¶

Background: AI-assisted programming has reshaped software practices and research workflows, but existing models still struggle with complex 3D geometric vision code. If models could reliably write such code, 3D vision research would be fundamentally transformed (automated prototyping, accelerated research cycles, democratized algorithm development).

Limitations of Prior Work: (1) Existing code benchmarks (HumanEval/MBPP/SWE-bench) do not cover 3D geometric implementations—they target general software engineering or competitive programming; (2) Scientific 3D vision code requires mathematically precise geometric operators, physical modeling, and multi-view reasoning—far beyond general-purpose capabilities; (3) Paper-to-code long-context scientific understanding remains an open problem.

Key Challenge: LLMs can generate general-purpose code but cannot reliably implement core functions in 3D geometric vision—how large is this gap? Where are the bottlenecks?

Key Insight: Simulating real research scenarios—giving models paper text plus function skeletons, requiring implementation completion, with automatic evaluation via unit tests.

Core Idea: (1) Extract core functions from official repositories of 2025 top-venue papers; (2) Automated tool nomination plus manual curation to ensure quality; (3) Diverse boundary tests covering geometric degenerate configurations; (4) Two-level capability taxonomy for evaluation.

Method¶

Overall Architecture¶

Paper PDF (OCR → structured JSON) + Code repository (automated candidate extraction → manual curation → function masking) + Unit tests (auto-generation → manual review) → LLM receives (paper + masked code + execution template) → fills implementation → sandbox execution + testing → PassRate scoring.

Key Designs¶

Benchmark Construction Pipeline:
- Paper Processing: MinerU OCR automatically extracts text/formulas/figures → organized into JSON by sections
- Code Processing: Cursor automatically recommends candidate functions (10–20 per repository) → manual review by 3D vision researchers → retains 3–5 core geometric functions → function bodies replaced with ****EMPTY**** placeholders
- Unit Tests: Cursor auto-generates 10 test cases (multiple parameter configurations) → manual review ensures reliability. Standardized execution templates (import/input-output definitions) are also provided
- Design Motivation: Automated nomination is efficient but may select trivial/auxiliary functions → manual curation ensures every task is a "paper-core 3D geometric component"
Two-Level Capability Taxonomy:
- General 3D Capability (foundational geometric knowledge):
  - Geometric Transformations (24%): coordinate conversions, projections, normals, rotation parameterizations
  - Mechanics/Optics Formulation (31%): spherical harmonics, BRDF, equations of motion, radiometric quantities
- Research Capability (research-level reasoning):
  - Novel Algorithm Implementation (34%): function-level implementation of paper-core novel ideas
  - Geometric Logic Routing (11%): composing existing operators into new pipelines—many influential papers are structured this way
- Design Motivation: Separating foundational and research capabilities to diagnose model weaknesses
Evaluation Metric:
- PassRate = \(\frac{1}{N}\sum_{i=1}^{N}\frac{p_i}{T_i}\), where \(p_i\) is the number of passed tests and \(T_i\) is the total number of tests
- Context ablation: Method-only vs. full-text input

Paper Sources¶

Covering 3DGS, pose estimation, SLAM, reconstruction, physics-based modeling, NeRF, 3D segmentation, and other subfields. All papers are from CVPR/ICCV/ICLR 2025, maximizing reduction of data leakage risk.

Key Experimental Results¶

Main Results (8 Representative Models)¶

Model	Company	Overall	General	Research	Geo.Trans.	Algorithm
GPT-5	OpenAI	36.6%	42.8%	29.1%	41.7%	29.1%
Claude-Sonnet-4.5	Anthropic	31.1%	37.2%	23.7%	38.3%	19.7%
Gemini-2.5-Pro	Google	30.4%	33.8%	26.2%	41.9%	25.3%
Kimi-K2-Instruct	Moonshot	30.4%	34.6%	25.1%	36.7%	23.1%
Doubao-Seed-1.6	ByteDance	26.9%	29.7%	23.4%	40.9%	22.9%
Qwen3-Coder-480B	Alibaba	23.5%	22.7%	24.6%	29.0%	21.8%
DeepSeek-R1	DeepSeek	21.0%	-	-	-	-

Ablation Study¶

Input Context	PassRate	Note
Full text	Baseline	Includes introduction, related work, etc.
Truncated to Method	Statistically significantly better	Irrelevant context interferes with reasoning
Abstract only	Significant drop	Insufficient technical details

Key Findings¶

Best model achieves only 36.6%: GPT-5 is far from reliable on PhD-level 3D code
Research tasks are harder but positively correlated with General: Geometric fundamentals are necessary but not sufficient for research-level implementation
Truncating to Method section actually performs better: This indicates LLMs face severe difficulties in long-context scientific paper understanding—more text equals more interference rather than more useful information
Creative correctness: In some successful cases, models used completely different but mathematically equivalent approaches to pass tests—demonstrating genuine problem-solving beyond copying
Geometric Logic Routing (11% of tasks) reflects how many classic 3D vision papers are constructed—composing existing operators—requiring higher-level system design capabilities

Highlights & Insights¶

First 3D vision code benchmark: Fills the gap in AI coding evaluation for the scientific 3D domain. The community-driven, extensible design enables continuous growth with new papers
"More context is not better" finding: Raises sharp questions about LLMs' long-context scientific understanding capabilities. Method truncation outperforms full text → LLMs may be misled by noise in introduction/related work sections
Paper-to-code research paradigm: GeoCodeBench's evaluation setup directly simulates the real research workflow of "read paper → implement algorithm," representing a first step toward an "automated 3D vision scientist"
Unit test engineering contribution: The diverse, boundary-case-covering automated tests for each function are themselves valuable pedagogical materials for 3D geometry

Limitations & Future Work¶

The scale of 100 functions remains limited—continuous expansion is needed
Restriction to 2025 papers may require updates over time to avoid data leakage
Unit test coverage may be incomplete—passing tests does not necessarily guarantee a fully correct implementation
Only evaluates function-level completion—full paper reproduction (including training loops, data pipelines) is more challenging

vs HumanEval/MBPP: General programming benchmarks without domain knowledge. GeoCodeBench requires deep 3D geometric reasoning
vs SWE-bench: Repository-level issue resolution; GeoCodeBench is function-level paper-to-code
vs PaperBench: Full paper reproduction evaluation; GeoCodeBench focuses on function-level core components—complementary
vs ResearchCodeBench: Also masks key paper code, but does not focus on 3D geometry and tests are less diverse

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First 3D vision code benchmark with an insightful two-level capability taxonomy
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 8 models, context ablation, category analysis, creative case studies
Writing Quality: ⭐⭐⭐⭐⭐ Transparent and reproducible construction pipeline
Value: ⭐⭐⭐⭐⭐ Long-term impact on 3D vision automation research and LLM scientific coding evaluation