Literature Meets Data: A Synergistic Approach to Hypothesis Generation¶

Conference: ACL 2025
arXiv: 2410.17309
Code: https://github.com/ChicagoHAI/hypothesis-generation
Area: Other
Keywords: hypothesis generation, LLM, literature-based, data-driven, human evaluation

TL;DR¶

This work proposes the first method that synergistically integrates literature-driven and data-driven hypothesis generation. Through Refinement and Union strategies, LLMs are enabled to jointly generate more generalizable hypotheses from paper abstracts and observational data. The proposed approach achieves an average improvement of 3.37% over purely data-driven methods across OOD datasets for five social science classification tasks. Furthermore, human experiments demonstrate for the first time that LLM-generated hypotheses can significantly improve human decision-making accuracy (+7.44% / +14.19%).

Background & Motivation¶

LLM-driven hypothesis generation is an important direction in AI for Science. Existing methods can be roughly categorized into two types: theory-driven (summarizing patterns from literature to generate hypotheses) and data-driven (discovering patterns from observational data to generate hypotheses). Theory-driven methods rely on high-quality literature, making them difficult to adapt to new data and lacking empirical support. Conversely, data-driven methods (e.g., HypoGeniC), while adaptable to specific data, easily overfit to particular datasets and exhibit poor generalization. Although both paradigms have distinct strengths and weaknesses, previous research has not attempted to combine them—echoing Einstein's insight, "it is the theory which decides what we can observe," prior knowledge from literature should guide data discovery.

Core Problem¶

Can literature knowledge and data patterns complement each other? Specifically: (1) How can prior hypotheses from literature and empirical hypotheses from data be integrated within a unified framework? (2) Does this integration generate more generalizable and practical hypotheses? (3) Can these hypotheses truly assist humans in making better decisions?

Method¶

Overall Architecture¶

Given a research question \(q\) (e.g., "what features indicate a review is deceptive"), a set of relevant papers \(\mathcal{P}\), and an observational dataset \(\mathcal{D}\), the goal is to use an LLM \(\mathcal{M}\) to generate a high-quality hypothesis set \(\mathcal{H} = f_{\mathcal{M}}(q, \mathcal{P}, \mathcal{D})\). The entire pipeline consists of three modules: literature-based hypothesis generation, data-driven hypothesis generation (HypoGeniC), and two integration strategies (HypoRefine / Union).

Key Designs¶

Literature-Based Hypothesis Generation: 10 relevant papers are manually collected from Semantic Scholar / Google Scholar \(\rightarrow\) converted to JSON using S2ORC-doc2json \(\rightarrow\) LLMs generate paper abstracts \(\rightarrow\) the abstracts then guide the LLM to generate task-specific hypotheses. To address overly concise hypotheses generated by GPT-4o-mini, a Specificity Booster is introduced to add concrete examples and details.
Data-Driven Hypothesis Generation (HypoGeniC): Following the framework of Zhou et al. (2024). In the initialization phase, the LLM generates an initial hypothesis set \(\mathcal{H}_\mathcal{D}\) using a small number of samples. In the update phase, predictions for each new sample are made using the top-\(k\) high-reward hypotheses. Mispredicted samples are put into an error sample pool \(\mathcal{W}\). Once the pool is full, new hypotheses are generated and the hypothesis repository is updated using a UCB reward function. The reward function is \(r_i = \frac{\sum \mathbb{I}(y_j = \hat{y}_j)}{|\mathcal{S}_i|} + \alpha \sqrt{\frac{\log t}{|\mathcal{S}_i|}}\) to balance accuracy and exploration.
Fusion Strategy I: HypoRefine (Iterative Refinement): In the initialization phase of HypoGeniC, both data samples and paper abstracts are used to generate initial hypotheses. During the update phase, each new hypothesis \(\mathcal{H}_0\) generated from the error pool undergoes multiple rounds of alternating refinement. Odd rounds are refined by a data-refinement agent using error samples \(\mathcal{W}\), while even rounds are refined by a literature-refinement agent using paper information \(\mathcal{P}\), for a total of max_refine=6 rounds.
Fusion Strategy II: Union (Union & De-duplication): Literature-based and data-driven hypothesis repositories are generated separately \(\rightarrow\) duplicates within each are removed via an LLM redundancy detector (pairwise comparison forming a 20×20 matrix) \(\rightarrow\) the top 10 hypotheses from the data-driven repository and 10 randomly selected hypotheses from the literature-based repository are combined to form the final hypothesis repository (size of 20). This approach prevents literature-based hypotheses from being undervalued by the HypoGeniC reward function.

Loss & Training¶

The hypothesis repository size is fixed to 20.
The training set contains 200 samples, with 10 samples used for initialization.
The reward coefficient is set to \(\alpha=0.5\), and the error pool limit is \(w_{max}=10\).
1 new hypothesis is generated per update.
During inference, CoT prompts are used to guide the LLM to first select the most relevant hypotheses from the 20 available, and then make the prediction.
Temperature is set to \(1 \times 10^{-5}\), with max_tokens=4000.

Key Experimental Results¶

Five social science classification tasks: Deceptive Reviews detection, AI-Generated Content detection (LlamaGC / GPTGC), Persuasiveness prediction (Persuasive Pairs), and psychological stress detection (Dreaddit).

Method (GPT-4-mini)	Deceptive (OOD)	LlamaGC (OOD)	GPTGC (OOD)	Persuasive (OOD)	Dreaddit (OOD)
Few-shot k=3	65.56	51.11	64.22	83.64	75.00
Literature-only	59.22	49.00	54.00	78.80	67.68
HypoGeniC	75.22	81.67	68.56	82.20	76.56
HypoRefine	77.78	55.33	63.33	89.04	78.04
Lit∪HypoGeniC	72.41	83.00	69.22	89.88	78.20

Literature + Data outperforms other methods across all tasks and model configurations.
Shows an average improvement of 8.97% over few-shot, 15.75% over literature-only, and 3.37% over HypoGeniC.
Cross-model transferability: Hypotheses generated by one model are used by another model for inference, with performance variations < 3% in 10/20 cases.

Human Evaluation: - AIGC detection: Human accuracy increased from 58.86% \(\rightarrow\) 73.05% (+14.19%, p=0.01). - Deception detection: Human accuracy increased from 57.14% \(\rightarrow\) 64.58% (+7.44%, p=0.04). - 100% of participants found the hypotheses helpful, with >40% rating them as "very helpful" or "extremely helpful". - Novelty: 84% (deception detection) and 80% (AIGC detection) of the literature-data hypothesis pairs were evaluated by humans as providing mutually novel information.

Ablation Study¶

HypoRefine fails on AIGC detection tasks: Literature refinement led to a 13.64% drop compared to pure HypoGeniC because literature on AIGC detection lacks effective, interpretable features. In this scenario, the Union strategy (Lit∪HypoGeniC) performs better.
Union vs. Refine have distinct advantages: Refine is superior on Deceptive/Persuasive/Dreaddit (+3.92%), whereas Union performs better on AIGC.
HypoGeniC is occasionally stronger in In-Distribution (IND) settings: Because HypoGeniC is specifically optimized for IND data, the generalization of the integrated methods is primarily demonstrated on OOD datasets.
Hypotheses generated by commercial tools such as NotebookLM and HyperWrite contain invalid or irrelevant hypotheses, which hurts inference performance.

Highlights & Insights¶

Pioneering literature + data hypothesis fusion: Fills an intuitive yet previously unaddressed gap with a simple and effective core idea.
Two complementary fusion strategies: HypoRefine (deep fusion) is suitable for scenarios with high-quality literature, while Union (shallow fusion) is more robust. Their combination covers diverse scenarios.
First human experiment validating hypothesis utility: Beyond merely improving benchmark scores, the work demonstrates practical utility by empirically proving that these hypotheses help humans make better decisions.
UCB reward function: Leverages the exploration-exploitation balance from multi-armed bandits to evaluate hypothesis quality, which is both elegant and reasonable.
Cross-model transferability: The generated hypotheses are not tied to a specific LLM and exhibit strong transferability.

Limitations & Future Work¶

Small literature scale and reliance on manual collection: Only 10 papers per task were manually searched and collected; future work should integrate automated literature retrieval (RAG pipelines).
Limited to classification tasks: Research questions are formalized as classification tasks, without covering tasks with non-natural language representations like mathematics or code generation.
Limited scale of human experiments: With only 60 participants, the study cannot distinguish the performance difference between HypoGeniC and HypoRefine in human evaluation.
Hypothesis selection relies on ablation: The 3 hypotheses shown to humans were selected through ablation and subjective judgment, lacking a systematic approach for hypothesis recommendation.
Insufficient hyperparameter search: A single set of hyperparameters was shared across all tasks.

vs. HypoGeniC (Zhou et al., 2024): The direct predecessor and data-driven backbone of this work. HypoGeniC only utilizes data, leading to poor generalization. This work achieves an average improvement of 3.37% on OOD datasets by introducing literature.
vs. ResearchAgent (Baek et al., 2024) / SciMON (Wang et al., 2024): These studies generate hypotheses from literature using knowledge graphs, but they either lack open-source implementations or are difficult to adapt to new tasks. This work builds a simpler, self-contained literature hypothesis pipeline.
vs. AI Scientist (Lu et al., 2024): While AI Scientist aims for a fully automated scientific research pipeline (from idea generation to paper writing), this work focuses specifically on the hypothesis generation phase, emphasizing human agency in scientific research.

Inspirations & Connections¶

Inspirations for AI for Science: The synergistic "theory + data" paradigm can be extended to other scientific discovery domains. For instance, in drug discovery or materials science, existing literature knowledge (such as structure-activity relationships) can guide data mining.
Inspirations for LLM Agent Design: Alternating refinement by dual agents (a literature agent + a data agent) represents a generalizable module for fusing multi-source information.
UCB Application in Hypothesis Management: Adaptive exploration using UCB algorithms can be applied to scenario management such as idea pool maintenance or prompt optimization.
Hypotheses as Human Decision Aids: Instead of replacing human judgment with AI, this work enhances human capabilities through interpretable hypotheses, highlighting a valuable paradigm for human-AI collaboration.

Rating¶

Novelty: ⭐⭐⭐⭐ First to integrate literature-based and data-driven hypothesis generation paradigms. The core idea is intuitive and natural, although technically it is primarily an extension of HypoGeniC.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 5 datasets × 2 models × OOD/IND × automatic/human evaluations, along with cross-model transferability and novelty analyses.
Writing Quality: ⭐⭐⭐⭐ Clear structure, well-articulated motivation, and intuitive case studies.
Value: ⭐⭐⭐⭐ Establishes a complete evaluation paradigm for hypothesis generation, with the first human experiment validation marking a major milestone.