IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery¶
Conference: ACL 2025
arXiv: 2504.16728
Code: Yes (Open-source platform)
Area: Other
Keywords: Scientific hypothesis generation, Human-in-the-Loop, MCTS, Research ideation, LLM-assisted discovery
TL;DR¶
Proposes IRIS, an open-source interactive research ideation system that achieves human-machine collaborative scientific hypothesis generation through Monte Carlo Tree Search (MCTS) for test-time compute scaling, fine-grained feedback mechanisms, and query-based literature synthesis.
Background & Motivation¶
LLMs have demonstrated great potential in automating scientific discovery, particularly in hypothesis generation—the very initial stage of research. However, existing methods suffer from the following key issues:
Lack of Human Intervention: Most methods (such as AI-Researcher, ResearchAgent) rely on multi-agent frameworks or scaling test-time compute, but are essentially fully autonomous, failing to effectively integrate human oversight.
Alignment Issues: Allocating substantial computation to generate "objectively novel" ideas might result in directions that do not align with the user's research goals.
"Reward Hacking" Behavior: LLMs might use fancy terminology (e.g., "Prompt Learning and Optimization Nexus") or groundlessly propose "graph" structures to achieve high scores; recursive feedback loops that force LLMs to become "more novel" are in fact merely gaming the evaluation metrics.
AI Safety Concerns: Models may hallucinate scientific information or even exhibit deceptive behaviors.
Limitations of Prior Work include: - Generating hypotheses in a single pass, ignoring the iterative nature of research. - Coarse-grained feedback (overall scoring rather than targeting specific sections). - Naive retrieval-augmented generation (merely appending keywords or abstracts). - Unstructured search of the idea space.
Method¶
Overall Architecture¶
IRIS takes research objectives \(\mathcal{G}\) (problems + motivations) as input and outputs a research brief \(\mathcal{B}\) (title + methodology + experimental plans). The system supports two modes: semi-automatic (human-guided) and fully automatic (MCTS autonomous exploration).
Key Designs¶
1. Three-Agent Architecture¶
Ideation Agent: - Generates and iteratively refines research briefs. - Can switch between semi-automatic mode (guided by human researchers) and fully automatic mode (driven by MCTS).
Review Agent: - Responsible for two tasks: providing reward scores and feedback. - Defines a hierarchical evaluation taxonomy (based on real scientific review criteria). - Key Innovation — Fine-Grained Feedback: Instead of evaluating the entire brief, it provides actionable feedback for specific aspects of specific parts of the brief. - Human researchers validate the feedback and remove irrelevant parts, thereby mitigating "reward hacking."
Retrieval Agent: - Generates queries targeted at the research objectives. - Employs the Ai2 Scholar QA API (Semantic Scholar, 200M+ papers). - Two-stage retrieval + three-stage generation: passage retrieval \(\to\) reranking \(\to\) citation extraction \(\to\) section planning \(\to\) citation report generation. - Supports researchers uploading PDFs to supplement missing literature.
2. MCTS for Hypothesis Generation¶
- Function: Systematically explores the vast research idea space.
- State Definition: \(s = \{ \text{research brief } b, \text{reward } r, \text{latest feedback } f, \text{retrieved knowledge } k \}\)
- Action Space \(\mathcal{A} = \{ \text{generate, refinement based on retrieval, refinement based on review, refinement based on user feedback} \}\)
- UCT Selection Policy: $\(\text{UCT}(n) = \frac{Q(n)}{N(n)} + c\sqrt{\frac{\ln N(n_p)}{N(n)}}\)$ where \(c\) is the exploration constant (decrease \(c\) to favor exploitation when the budget is tight).
- Four-Stage Iteration: Selection \(\to\) Evaluation \(\to\) Expansion \(\to\) Backpropagation.
- Design Motivation: Unlike math/code (where rewards are objective), the quality of scientific ideation is subjective. Therefore, scores from the Review Agent are used as proxy rewards.
- Memory Mechanism: Each agent maintains trajectory-level memory to avoid redundant generation.
3. Human-AI Co-Design Principles¶
- Draws on design principles from Amershi et al. (2019) and Shneiderman (2020).
- Minimizes opacity: The MCTS tree interface provides visual control.
- Fine-grained feedback instead of general, vague scores.
- Maintains human oversight during planning, generation, and review stages.
Experimental Setup¶
- LLM Backbone: Gemini-2.0-Flash (via LiteLLM)
- Evaluation Metrics:
- Absolute rating: 1-10 points for each hypothesis.
- Relative rating: Head-to-head comparison to compute ELO rating.
- User Study: 8 researchers (AI/NLP, Chemistry, Physics, HCI), 10 case studies, approximately 60 minutes each.
Key Experimental Results¶
Automated Evaluation (Figure 3)¶
| Metric | Depth 0 → Depth 3 | Gain |
|---|---|---|
| Absolute Rating | ~6.5 → ~7.0 | +0.5 points |
| ELO Rating | ~990 → ~1002 | +12 points |
User interaction consistently improved hypothesis quality, which scaled with interaction depth.
User Study Ratings (Table 1)¶
| Feature/Aspect | Average Rating (1-5 Likert) |
|---|---|
| Usefulness of Fine-Grained Feedback | 4.3 ± 0.7 |
| MCTS Tree Interface (Controllability) | 4.2 ± 0.6 |
| Quality of Literature Synthesis | 3.7 ± 0.8 |
| Usability and Sense of Control | 4.5 ± 0.7 |
| Overall Satisfaction | 3.9 ± 0.7 |
Qualitative Findings¶
| Dimension | Proportion | Details |
|---|---|---|
| Controllability | 100% (8/8) | All users valued the control and transparency offered by the MCTS tree. |
| Feedback Resonance | 87.5% (7/8) | Review feedback often aligned with the users' own concerns. |
| Novel Insights | 50% (5/10) | Feedback occasionally sparked new ideas. |
| Relevance | 62.5% (5/8) | Hypotheses were connected to the ongoing work of the users. |
Key Findings¶
- Interaction Improves Quality: Hypotheses with user involvement achieved higher quality than those generated purely automatically.
- Elo is More Reliable than Absolute Rating: Elo correlation with human preferences yielded Pearson \(r=0.60\), whereas absolute rating was only \(r=0.45\).
- Literature Retrieval Quality Varies by Field: Performance was better in AI/NLP (3.7/5) and poorer in Chemistry/Physics, limited by the corpus coverage of Semantic Scholar.
- Usability Received the Highest Rating (4.5/5) — indicating that human-AI collaborative design indeed delivers a superior user experience over fully automated solutions.
- 25% of Users Felt Hypotheses Were "Significantly Better", 50% "Slightly Improved", and 100% agreed that it enhanced their understanding of the methodology.
Highlights & Insights¶
- Applying MCTS to scientific ideation is a key novelty — using search tree structures to balance exploration and exploitation makes the process more systematic than linear refinement.
- Fine-grained, human-validated review feedback effectively addresses the "reward hacking" problem, which is a major pain point for fully autonomous systems.
- Open-source implementation lowers the barrier to entry for the academic community.
- Focus on alignment issues is forward-looking — pointing out the issues of "smart plagiarism" and superficial packaging by LLMs in scientific ideation.
Limitations & Future Work¶
- Relies on researchers as evaluators, assuming they possess sufficient domain expertise.
- Due to budget constraints, stronger LLMs (e.g., Claude 3.7, o1, Gemini-2.5-Pro) were not used.
- The user study scale is small (\(N=8\)), so statistical significance of findings is limited.
- Literature retrieval relies on Semantic Scholar, which has insufficient coverage for fields like Chemistry/Physics.
- Has not validated the actual feasibility of the generated hypotheses (i.e., whether they can yield valid experiments).
- MCTS is computationally intensive and requires budget control.
Related Work & Insights¶
- AI-Researcher (Si et al., 2024): Fully automated but found to suffer from "smart plagiarism" issues.
- ResearchAgent (Baek et al., 2025): Coarse-grained feedback, recursive refinement leads to reward gaming.
- Acceleron (Nigam et al., 2024): Early HITL attempt but lacks flexibility.
- OpenScholar (Asai et al., 2024): An advanced system for literature synthesis.
- Insights: Future work could establish a true "two-way Socratic" dialogue — where the AI questions the researcher's choices and the researcher validates the AI's suggestions.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of MCTS + HITL + fine-grained feedback is a fresh approach in scientific ideation.
- Experimental Thoroughness: ⭐⭐⭐ — The user study is small-scale (\(N=8\)), automated evaluation improvements are limited (+0.5 / +12), and comparisons with strong baselines are lacking.
- Writing Quality: ⭐⭐⭐⭐ — Problem motivation is thoroughly explained, system description is detailed, and safety discussions are deep.
- Value: ⭐⭐⭐⭐ — The open-source platform brings practical value to the academic community, and the human-AI collaborative design philosophy serves as a valuable exemplar.