IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery¶

Conference: ACL 2025
arXiv: 2504.16728
Code: Yes (Open-source platform)
Area: Other
Keywords: Scientific hypothesis generation, Human-in-the-Loop, MCTS, Research ideation, LLM-assisted discovery

TL;DR¶

Proposes IRIS, an open-source interactive research ideation system that achieves human-machine collaborative scientific hypothesis generation through Monte Carlo Tree Search (MCTS) for test-time compute scaling, fine-grained feedback mechanisms, and query-based literature synthesis.

Background & Motivation¶

LLMs have demonstrated great potential in automating scientific discovery, particularly in hypothesis generation—the very initial stage of research. However, existing methods suffer from the following key issues:

Lack of Human Intervention: Most methods (such as AI-Researcher, ResearchAgent) rely on multi-agent frameworks or scaling test-time compute, but are essentially fully autonomous, failing to effectively integrate human oversight.

Alignment Issues: Allocating substantial computation to generate "objectively novel" ideas might result in directions that do not align with the user's research goals.

"Reward Hacking" Behavior: LLMs might use fancy terminology (e.g., "Prompt Learning and Optimization Nexus") or groundlessly propose "graph" structures to achieve high scores; recursive feedback loops that force LLMs to become "more novel" are in fact merely gaming the evaluation metrics.

AI Safety Concerns: Models may hallucinate scientific information or even exhibit deceptive behaviors.

Limitations of Prior Work include: - Generating hypotheses in a single pass, ignoring the iterative nature of research. - Coarse-grained feedback (overall scoring rather than targeting specific sections). - Naive retrieval-augmented generation (merely appending keywords or abstracts). - Unstructured search of the idea space.

Method¶

Overall Architecture¶

IRIS takes research objectives $\mathcal{G}$ (problems + motivations) as input and outputs a research brief $\mathcal{B}$ (title + methodology + experimental plans). The system supports two modes: semi-automatic (human-guided) and fully automatic (MCTS autonomous exploration).

Key Designs¶

1. Three-Agent Architecture¶

Ideation Agent: - Generates and iteratively refines research briefs. - Can switch between semi-automatic mode (guided by human researchers) and fully automatic mode (driven by MCTS).

Review Agent: - Responsible for two tasks: providing reward scores and feedback. - Defines a hierarchical evaluation taxonomy (based on real scientific review criteria). - Key Innovation — Fine-Grained Feedback: Instead of evaluating the entire brief, it provides actionable feedback for specific aspects of specific parts of the brief. - Human researchers validate the feedback and remove irrelevant parts, thereby mitigating "reward hacking."

Retrieval Agent: - Generates queries targeted at the research objectives. - Employs the Ai2 Scholar QA API (Semantic Scholar, 200M+ papers). - Two-stage retrieval + three-stage generation: passage retrieval $\to$ reranking $\to$ citation extraction $\to$ section planning $\to$ citation report generation. - Supports researchers uploading PDFs to supplement missing literature.

2. MCTS for Hypothesis Generation¶

Function: Systematically explores the vast research idea space.
State Definition: $s = \{ \text{research brief } b, \text{reward } r, \text{latest feedback } f, \text{retrieved knowledge } k \}$
Action Space $\mathcal{A} = \{ \text{generate, refinement based on retrieval, refinement based on review, refinement based on user feedback} \}$
UCT Selection Policy: $$\text{UCT}(n) = \frac{Q(n)}{N(n)} + c\sqrt{\frac{\ln N(n_p)}{N(n)}}$$ where $c$ is the exploration constant (decrease $c$ to favor exploitation when the budget is tight).
Four-Stage Iteration: Selection $\to$ Evaluation $\to$ Expansion $\to$ Backpropagation.
Design Motivation: Unlike math/code (where rewards are objective), the quality of scientific ideation is subjective. Therefore, scores from the Review Agent are used as proxy rewards.
Memory Mechanism: Each agent maintains trajectory-level memory to avoid redundant generation.

3. Human-AI Co-Design Principles¶

Draws on design principles from Amershi et al. (2019) and Shneiderman (2020).
Minimizes opacity: The MCTS tree interface provides visual control.
Fine-grained feedback instead of general, vague scores.
Maintains human oversight during planning, generation, and review stages.

Experimental Setup¶

LLM Backbone: Gemini-2.0-Flash (via LiteLLM)
Evaluation Metrics:
- Absolute rating: 1-10 points for each hypothesis.
- Relative rating: Head-to-head comparison to compute ELO rating.
User Study: 8 researchers (AI/NLP, Chemistry, Physics, HCI), 10 case studies, approximately 60 minutes each.

Key Experimental Results¶

Automated Evaluation (Figure 3)¶

Metric	Depth 0 → Depth 3	Gain
Absolute Rating	~6.5 → ~7.0	+0.5 points
ELO Rating	~990 → ~1002	+12 points

User interaction consistently improved hypothesis quality, which scaled with interaction depth.

User Study Ratings (Table 1)¶

Feature/Aspect	Average Rating (1-5 Likert)
Usefulness of Fine-Grained Feedback	4.3 ± 0.7
MCTS Tree Interface (Controllability)	4.2 ± 0.6
Quality of Literature Synthesis	3.7 ± 0.8
Usability and Sense of Control	4.5 ± 0.7
Overall Satisfaction	3.9 ± 0.7

Qualitative Findings¶

Dimension	Proportion	Details
Controllability	100% (8/8)	All users valued the control and transparency offered by the MCTS tree.
Feedback Resonance	87.5% (7/8)	Review feedback often aligned with the users' own concerns.
Novel Insights	50% (5/10)	Feedback occasionally sparked new ideas.
Relevance	62.5% (5/8)	Hypotheses were connected to the ongoing work of the users.

Key Findings¶

Interaction Improves Quality: Hypotheses with user involvement achieved higher quality than those generated purely automatically.
Elo is More Reliable than Absolute Rating: Elo correlation with human preferences yielded Pearson $r=0.60$, whereas absolute rating was only $r=0.45$.
Literature Retrieval Quality Varies by Field: Performance was better in AI/NLP (3.7/5) and poorer in Chemistry/Physics, limited by the corpus coverage of Semantic Scholar.
Usability Received the Highest Rating (4.5/5) — indicating that human-AI collaborative design indeed delivers a superior user experience over fully automated solutions.
25% of Users Felt Hypotheses Were "Significantly Better", 50% "Slightly Improved", and 100% agreed that it enhanced their understanding of the methodology.

Highlights & Insights¶

Applying MCTS to scientific ideation is a key novelty — using search tree structures to balance exploration and exploitation makes the process more systematic than linear refinement.
Fine-grained, human-validated review feedback effectively addresses the "reward hacking" problem, which is a major pain point for fully autonomous systems.
Open-source implementation lowers the barrier to entry for the academic community.
Focus on alignment issues is forward-looking — pointing out the issues of "smart plagiarism" and superficial packaging by LLMs in scientific ideation.

Limitations & Future Work¶

Relies on researchers as evaluators, assuming they possess sufficient domain expertise.
Due to budget constraints, stronger LLMs (e.g., Claude 3.7, o1, Gemini-2.5-Pro) were not used.
The user study scale is small ($N=8$), so statistical significance of findings is limited.
Literature retrieval relies on Semantic Scholar, which has insufficient coverage for fields like Chemistry/Physics.
Has not validated the actual feasibility of the generated hypotheses (i.e., whether they can yield valid experiments).
MCTS is computationally intensive and requires budget control.

AI-Researcher (Si et al., 2024): Fully automated but found to suffer from "smart plagiarism" issues.
ResearchAgent (Baek et al., 2025): Coarse-grained feedback, recursive refinement leads to reward gaming.
Acceleron (Nigam et al., 2024): Early HITL attempt but lacks flexibility.
OpenScholar (Asai et al., 2024): An advanced system for literature synthesis.
Insights: Future work could establish a true "two-way Socratic" dialogue — where the AI questions the researcher's choices and the researcher validates the AI's suggestions.

Rating¶

Novelty: ⭐⭐⭐⭐ — The combination of MCTS + HITL + fine-grained feedback is a fresh approach in scientific ideation.
Experimental Thoroughness: ⭐⭐⭐ — The user study is small-scale ($N=8$), automated evaluation improvements are limited (+0.5 / +12), and comparisons with strong baselines are lacking.
Writing Quality: ⭐⭐⭐⭐ — Problem motivation is thoroughly explained, system description is detailed, and safety discussions are deep.
Value: ⭐⭐⭐⭐ — The open-source platform brings practical value to the academic community, and the human-AI collaborative design philosophy serves as a valuable exemplar.