ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming¶

Conference: ACL 2025
arXiv: 2505.16667
Code: Yes
Area: LLM Evaluation
Institution: Sichuan University, Tianjin University of Science and Technology, Vanderbilt University
Authors: Xinwei Yang, Zhaofeng Liu, Chen Huang, Jiashuai Zhang, Tong Zhang, Yifan Zhang, Wenqiang Lei

TL;DR¶

This work introduces ELABORATION, the first comprehensive benchmark designed to evaluate human-LLM collaborative competitive programming. Supported by a human feedback taxonomy covering the entire programming workflow and a meticulously annotated dataset of 8,320 problems, the study reveals that while LLMs possess limited independent problem-solving capabilities (with only a 3.4% Pass@1 rate on hard problems), human feedback—especially expert guidance during the coding phase—can yield a significant average improvement of 9.3%.

Background & Motivation¶

Competitive programming demands mastery across four cognitive phases: problem understanding \(\rightarrow\) solution planning \(\rightarrow\) code generation \(\rightarrow\) debugging, requiring extremely high precision and efficiency. Although LLMs have made strides in code generation, their utility on expert-level competitive tasks remains limited. Consequently, recent efforts have shifted toward human-LLM collaborative programming (human-in-the-loop) to enhance LLM performance through multi-turn human feedback.

However, existing literature is bottlenecked by two core limitations of prior work:

Fragmented feedback types: Prior works only focus on isolated phases (e.g., Mozannar et al. only address strategic suggestions; Zheng et al. focus solely on conversational bug identification), lacking systemic coverage of the complete programming process.
Lack of standardized evaluation: Most existing benchmarks target fully autonomous programming, failing to support systematic evaluation of human-LLM collaborative scenarios.

Therefore, there is a compelling need for a comprehensive benchmark that spans the entire programming process and supports multi-granular human feedback.

Method¶

Overall Architecture¶

The ELABORATION benchmark comprises three core components:

Human Feedback Taxonomy: Systematizes human feedback across the entire programming workflow into a four-stage taxonomy for the first time.
Elaborationset Dataset: Consists of 8,320 competitive programming problems, meticulously annotated to support simulated and real-human feedback experiments.
Evaluation Protocol: Supports both LLM simulator and human participant modes, injecting feedback at each stage and evaluating the outcomes.

During evaluation, the LLM interacts iteratively with a human (either a real user or a simulator). The human provides textual feedback at each stage until a correct solution is generated or the maximum of 10 interaction turns is reached.

Key Designs¶

Key Design 1: Four-Stage Human Feedback Taxonomy¶

The entire competitive programming process is partitioned into four distinct phases, with specific human feedback forms defined for each stage:

Phase	Human Feedback Content	Example
Problem Understanding	Providing key requirements and specifications	Pointing out edge cases, clarifying functional requirements, emphasizing time complexity constraints
Solution Planning	Recommending algorithms and providing pseudocode	Recommending Dijkstra's algorithm with pseudocode for a shortest-path problem
Code Generation	Recommending design strategies and implementation details	Recommending specific implementations such as stack structures or binary heap-based priority queues
Code Debugging	Identifying bugs until all tests pass	Locating logical flaws that lead to infinite loops

The unique value of this taxonomy lies in its scope: while most existing methods are limited to the debugging stage, this work demonstrates the necessity of full-process feedback.

Key Design 2: Elaborationset Meticulously Annotated Dataset¶

Sourced from Codeforces and AtCoder (Oct 2011 to Nov 2024), the dataset consists of 8,320 problems across three difficulty levels. Predefined annotations include:

Problem Clarifications (averaging 8.1–12.1 items/problem): Requirements and specifications for each problem.
Algorithmic Knowledge Summaries (averaging 2.4–3.8 items/problem): Required definitions and pseudocode of algorithms.
Ground-truth Solutions (averaging 4.8 solutions/problem): Directly retrieved from the platforms.
Human-LLM Interaction Records (300 problems): Containing multi-turn dialogues and human-annotated LLM code errors.

The annotation pipeline adopts a two-phase strategy: "LLM initial generation followed by human verification." Reference solutions are directly retrieved from the competitive platforms. The dataset is split temporally to support contamination-free evaluation.

Key Design 3: Dual-Mode Human Simulator¶

Two levels of LLM-simulated participants are designed (both based on o1-mini):

Student Programmers (Intermediate): Provide feedback based solely on the model's internal knowledge, simulating typical coders.
Teacher Programmers (Expert): Leverage the full dataset annotations to guarantee expert-grade feedback quality.

For example, in the N-Queens problem, the teacher's feedback explicitly specifies coordinate validation, iterative placement strategies, and backtracking conditions, whereas the student's feedback might omit these critical details.

Key Experimental Results¶

Table 1: Pass@1 (%) of Closed-Source Models in Contamination-Free Evaluation¶

Model	Easy	Middle	Hard	Overall
O1-Mini	80.6	66.6	30.8	59.3
GPT-4o	74.1	31.7	10.3	38.7
+ Student Feedback	76.2	34.8	15.1	42.0
+ Teacher Feedback	80.1	42.9	23.3	48.8
Claude-3.5	74.5	34.3	5.4	38.1
+ Teacher Feedback	83.1	44.2	16.5	47.9
Closed-Source Average	71.8	31.5	7.7	37.0
+ Teacher Feedback	79.9(+8.1)	41.8(+10.3)	19.6(+11.9)	47.1(+10.1)

Table 2: Comparison of Human Debugging Experiments (GPT-4 Turbo, 300 contamination-free problems)¶

Debugging Method	Precision	Recall	Base P@1	+Debugging P@1
Automated Debugging (Compiler + o1-mini simulation)	0.23	0.40	0.33	0.38(+5%)
Human Debugging (5 CS graduate students)	0.81	0.71	0.38	0.62(+24%)

Highlights & Insights¶

Highest Systematicity: Establishes the first human feedback taxonomy covering the four stages of the complete programming process, addressing the fragmentation of existing works that focus only on partial phases.
Key finding that feedback during code generation yields the highest returns: Fine-grained analysis reveals that feedback during the understanding phase shows minimal benefit (as LLMs already understand problem specifications reasonably well), whereas feedback during the code generation phase contributes the most. This challenges the intuitive assumption that "debugging is the most critical phase."
Human experiments reveal human-AI complementarity: Automated debugging excels at syntactic errors, whereas human developers are superior at identifying semantic errors (references, calculations, and logical flaws). A combination of both yields the best results.
Meticulous Dataset Design: Multi-layer annotations are integrated to support feedback simulations of different expertise levels. Temporal partitioning ensures reliable, contamination-free evaluation.

Limitations & Future Work¶

Prompt Sensitivity: As with all LLM prompting research, performance remains sensitive to prompt design, though averaging across 8,000+ problems partially mitigates this.
Limited Generalizability: Validated only in competitive programming scenarios, leaving its applicability to real-world software development to be verified.
Gap Between Simulators and Real Humans: Although LLM-based simulators are more realistic than rule-based counterparts, a conspicuous quality gap still remains compared to actual human feedback.
Cost Efficiency: Although feedback during the coding phase yields the best results, it incurs the highest token overhead, making the solution planning phase potentially more cost-effective.

Competitive Programming Benchmarks: Benchmarks like APPS, CODE-CONTESTS, and LiveCodeBench target fully autonomous programming, whereas ELABORATION is the first to explicitly support human-LLM collaborative evaluation.
Human-LLM Collaborative Coding: OpenCoderInterpreter (Zheng et al., 2024) is the closest prior work but covers only the debugging phase, whereas this work offers full-process coverage.
Human Feedback Simulation: Transitioning from rule-based to LLM-based simulators, this work deploys both simulated and real human participants.
Code LLMs: Models like CodeLlama, DeepSeek-Coder, and Qwen2.5-Coder are utilized as evaluation targets.

Rating¶

⭐⭐⭐⭐

Reasoning: The problem definition is clear, and the taxonomy is comprehensive and systematic. The dataset is large-scale and meticulously annotated, and the experimental evaluation spans both simulated and real-human factors. The core insights—specifically, "feedback is most effective during the coding phase" and "human-AI diagnostic complementarity"—carry strong practical significance. The limitation lies in its exclusive focus on competitive programming, with relatively limited methodology-level novelty (i.e., the primary contribution lies in benchmark construction rather than novel algorithm design).