Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback¶
| Information | Content |
|---|---|
| Conference | ACL 2025 |
| arXiv | 2501.03916 |
| Code | GitHub |
| Area | others (Auto-Research × LLM Agent × Code Generation) |
| Keywords | auto-research, closed-loop, idea generation, experimental verification, feedback |
TL;DR¶
Proposes Dolphin, a closed-loop auto-research framework that incorporates a three-stage cycle of "idea generation \(\rightarrow\) experimental verification \(\rightarrow\) results feedback". Through task-attribute-guided paper ranking and exception-traceback-guided debugging processes, Dolphin automatically proposes and verifies methods that approach human-designed SOTA on tasks such as 3D classification.
Background & Motivation¶
- Scientific Research Paradigm Shift: AI-assisted scientific research is evolving from "fully human-driven" to "auto-research", progressing through four stages: fully human-driven \(\rightarrow\) AI-assisted \(\rightarrow\) semi-automatic \(\rightarrow\) fully automatic.
- Key Challenges in Prior Work:
- Inaccurate Idea Evaluation: Most works (Si et al., 2024; Li et al., 2024) rely on humans or LLMs to evaluate idea quality, focusing only on novelty rather than experimental validity. Although AI-Scientist (Lu et al., 2024) performs experimental verification, it utilizes simple self-constructed datasets, lacking meaningful comparisons with existing methods in the same field.
- Lack of Feedback Mechanisms: Human researchers iteratively improve ideas based on experimental results, whereas existing works either offer feedback only within isolated idea-generation phases or completely lack feedback loop.
- Core Motivation: To build a true "closed-loop" system where experimental results are fed back into the next round of idea generation, mimicking the iterative research process of human researchers.
Method¶
Overall Architecture¶
Dolphin consists of three core processes to form a closed loop (Figure 2):
1. Idea Generation Process
Paper Retrieval and Ranking¶
- Uses the Semantic Scholar API to retrieve 50 related papers.
- Task-Attribute-Guided Paper Ranking:
- The LLM first extracts the attributes of the input task (e.g., model input, output, and other characteristics).
- It scores the papers (on a scale of 1-10) based on two criteria: topic relevance and task attribute match.
- Papers scored below 8 are filtered out.
- Effect: For 3D classification tasks, irrelevant papers such as those on 3D detection are significantly reduced.
Idea Generation and Filtering¶
- Based on the ranked highly-relevant papers, the LLM generates \(N\) ideas (each containing a title, experimental plan, and abstract).
- Independence Check: Uses sentence embeddings to compute the cosine similarity between ideas, filtering redundancies with a threshold of 0.8.
- Maintains an idea library B (storing the embeddings of checked ideas).
- Novelty Check: Determines whether an idea is novel through Semantic Scholar searches.
2. Experimental Verification Process
- The LLM generates a detailed experimental plan and modifies the reference code.
- Core Innovation: Exception-Traceback-Guided Debugging
- Directly feeding the traceback to the LLM yields a low execution success rate (4/15), as the LLM struggles to understand complex nested relationships.
- Solution: Extract information from the traceback \(\rightarrow\) Guide the LLM to generate the local code structure related to the error \(\rightarrow\) Debug based on the code structure and the traceback.
- Focuses only on custom code, excluding library function calls.
- Maximum of 5 debugging iterations.
- Effect: Success rate increases from 33.3% to 50.0%.
3. Results Feedback Process
- Classifies experimental results into three categories: improvement, maintenance, and decline.
- Adds embeddings of maintained/declining ideas to the idea library B to avoid redundant verification.
- Appends ideas that successfully improve performance to the prompt for the next round of idea generation.
- Closed-loop effect: Improvement rate increases from 2/7 in Loop 1 to 4/8 in Loop 3.
Key Experimental Results¶
Experimental Setup¶
- LLM Agent: GPT-4o-2024-08-06 (idea generation); DeepSeek-v2.5 via Ollama (code execution)
- Tasks:
- 3D Point Cloud Classification: ModelNet40 + PointNet baseline
- 2D Image Classification: CIFAR-100 + WRN-28-10 baseline
- Sentiment Classification: SST-2 + BERT-base baseline
- 2 loops executed per task (yielding a total of 40 ideas)
Main Results¶
| Task | Baseline | Average Gain | Max Gain | Human-designed SOTA | No. of Effective Ideas |
|---|---|---|---|---|---|
| ModelNet40 OA | 91.0 (PointNet) | 92.0 (+1.0) | 93.9 (+2.9) | 93.8 (GPSFormer) | 5/40 |
| ModelNet40 mAcc | 87.6 (PointNet) | 88.7 (+1.1) | 91.1 (+3.5) | 91.8 (GPSFormer) | 5/40 |
| CIFAR-100 | 81.2 (WRN) | 81.8 (+0.6) | 82.0 (+0.8) | 82.2 (ResNeXt) | 6/40 |
| SST-2 | 91.0 (BERT-base) | 91.8 (+0.8) | 92.5 (+1.5) | 93.1 (BERT-large) | 6/40 |
Highlight Result: PointNet-CSR, automatically generated on ModelNet40, achieves 93.9% OA, closely matching the human-designed SOTA GPSFormer (93.8%)!
MLE-bench Results¶
| Task | Code Source | Previous Score | Dolphin Score |
|---|---|---|---|
| Social Insult Detection | AIDE | 81.0 | 84.7 |
| Tabular Prediction | Kaggle | 95.3 | 96.2 |
| Toxic Comment Classification | Kaggle | 94.7 | 97.2 |
- Dolphin can be flexibly integrated with other frameworks such as AIDE and Agent Laboratory.
- It is capable of executing updates to techniques and code versions.
Ablation Study¶
Analysis of the Idea Generation Process¶
| Method | No. of Novel Ideas | Average Cost / Idea |
|---|---|---|
| Naive Generation (w/o retrieval) | 8/20 | $0.106 |
| Naive Retrieval + Generation | 13/20 | $0.187 |
| Task-Attribute Filtering (Ours) | 19/20 | $0.184 |
- Task-attribute filtering increases the proportion of novel ideas from 40% to 95%.
Analysis of the Debugging Process¶
| Local Code Structure | Traceback Information Extraction | Successful Execution Rate (L1/L2/L3) |
|---|---|---|
| ✗ | ✗ | 4/15, 5/13, 5/14 |
| ✓ | ✗ | 3/15, 5/13, 6/14 |
| ✓ | ✓ | 7/15, 6/13, 8/14 |
- Providing only the local code structure is insufficient (as it may contain irrelevant library information); extracting information from the traceback is required to focus on custom code.
Analysis of Closed-Loop Feedback¶
| Loop | Loop 1 | Loop 2 | Loop 3 |
|---|---|---|---|
| Improvement Rate | 2/7 | 3/6 | 4/8 |
- As iterations progress, the quality of ideas continues to improve, demonstrating the value of closed-loop feedback.
Case Study: PointNet-CSR vs DGCNN¶
| Dimension | DGCNN (Human-designed) | PointNet-CSR (Dolphin) |
|---|---|---|
| Idea Level | Architecture-level | Module-level |
| Parameters | Learnable parameters | No learnable parameters |
| Structure | Repeated blocks | Single module |
| mAcc / OA | 90.2% / 92.9% | 91.1% / 93.9% |
| Training Speed | ~20.86s/epoch | ~6.12s/epoch (>3x faster) |
PointNet-CSR achieves better and faster performance through a more concise architecture.
Highlights & Insights¶
- True Closed Loop: The flow of experimental results \(\rightarrow\) feedback \(\rightarrow\) next-round idea generation represents a rare fully closed-loop design among current auto-research frameworks.
- Validation on Public Benchmarks: Unlike AI-Scientist, which relies on custom-built datasets, Dolphin validates the efficacy of its ideas on standard benchmarks such as ModelNet40, CIFAR-100, and SST-2.
- Nearing Human SOTA: The automatically generated method on 3D classification achieves 93.9% accuracy, coming close to or even matching meticulously hand-designed human approaches.
- Highly Cost-Effective: The average cost per idea is only about $0.2.
- Exception-Traceback-Guided Debugging: Resolves issues with LLMs struggling to understand complex nested code structures. It upgrades the general practice of "feeding error logs" to structured "local code structure analysis".
Limitations & Future Work¶
- Knowledge Leakage: Pre-existing knowledge acquired by LLMs during training may lead to the "rediscovery" of existing methods rather than genuine innovation.
- Reliance on Abstracts and Titles Only: Idea generation is solely based on paper titles and abstracts, which limits deep understanding of technical nuances and logical relationships between papers.
- Limitations in Coding Capability: LLMs struggle to comprehend complex, project-level codebases, constraining their ability to verify more intricate tasks.
- Low Ratio of Effective Ideas: Only 5-6 ideas out of 40 are effective (\(12.5\% \sim 15\%\)), meaning substantial computational resources are spent on ineffective verifications.
- Constrained Scope of Tasks: The baseline models verified are relatively simple (PointNet, WRN, BERT-base), and have not been tested on more sophisticated modern architectures.
- Simplistic Feedback Signals: The feedback relies entirely on a coarse "improvement/maintenance/decline" classification, lacking a detailed analysis of failure modes.
Related Work & Insights¶
- Open-ended Auto-Research: AI-Scientist (Lu et al., 2024) proposes an end-to-end framework but lacks feedback and validation on real benchmarks; Chain of Ideas (Li et al., 2024) generates ideas based on paper chains but lacks experimental verification; NOVA (Hu et al., 2024) enhances novelty through iterative refinement.
- Constrained Auto-Research: AutoML-GPT (Zhang et al., 2023b) utilizes LLMs for hyperparameter tuning; AgentHPO (Liu et al., 2024) iteratively optimizes hyperparameters.
- Code Generation: AIDE (Schmidt, 2024) and Agent Laboratory (Schmidgall et al., 2025) automate code generation in machine learning competitions and laboratory scenarios, respectively.
Rating ⭐⭐⭐⭐¶
The closed-loop design concept is sound and systematically validated (with improvement rates consistently rising across cycles), yielding impressive results that approach human SOTA on standard benchmarks. Task-attribute-guided ranking and exception-traceback-guided debugging are highly practical technical contributions. The principal limitations stem from the relatively low proportion of effective ideas, simple baseline configurations, and the threat of LLM knowledge leakage. Overall, however, this represents a solid step toward fully automated scientific research.