AlphaContext: An Evolutionary Tree-based Psychometric Context Generator for Creativity Assessment¶

Conference: ACL 2026 arXiv: 2604.18398 Code: https://github.com/yxwang19/AlphaContext Area: LLM/NLP Keywords: Creativity Assessment, Psychometrics, Evolutionary Algorithm, MCTS Text Generation, MAP-Elites

TL;DR¶

This paper proposes AlphaContext, an evolutionary tree-based psychometric context generator comprising four modules—HyperTree outline planning, MCTS sentence-level generation, MAP-Elites diversity optimization, and assessment-guided iterative refinement—to automatically generate high-quality long-form contexts for creativity assessment, achieving an average improvement of 8% over competitive baselines across 7 evaluation dimensions.

Background & Motivation¶

Background: Creativity assessment has become increasingly important in the LLM era. Psychometric research identifies scenario-based assessment as an effective approach for measuring creative thinking—presenting subjects with a future-oriented context and asking them to identify potential challenges in order to elicit creative responses. This paradigm originates from the Future Problem Solving Program (FPSP).

Limitations of Prior Work: High-quality creativity assessment contexts still rely on expert manual design, creating a severe production bottleneck (each context requires at least one week to craft). Existing LLM-based generation methods face two key challenges: (1) difficulty in simultaneously satisfying implicit assessment cue embedding and global narrative coherence; and (2) difficulty in achieving diversity while guaranteeing quality and measurement validity.

Key Challenge: Psychometric contexts differ fundamentally from ordinary stories—they must implicitly embed assessment cues within a coherent narrative, and these cues must effectively elicit creative thinking. Conventional story generation frameworks cannot satisfy such fine-grained constraints.

Goal: To automatically generate psychometric contexts that can substitute for expert-designed ones, while guaranteeing narrative coherence, assessment cue alignment, and stylistic diversity.

Key Insight: The context generation process is decomposed into three stages—planning, generation, and evolution—with search algorithms applied to ensure global structure, local quality, and diverse coverage respectively.

Core Idea: A HyperTree structure formalizes the expert outline design process; MCTS performs sentence-level search for optimal text under outline constraints; MAP-Elites iteratively evolves contexts in a stylistic behavior space; and virtual subject simulation validates assessment effectiveness.

Method¶

Overall Architecture¶

Given a title and topic query \(Q\), AlphaContext passes through four modules: (1) the HyperTree Outline Planner generates a structured outline; (2) the MCTS-based Context Generator performs sentence-level search under outline constraints to produce seed contexts; (3) the Evolutionary Context Optimizer applies MAP-Elites to iteratively evolve contexts in a style space, improving diversity and quality; and (4) the Assessment-Guided Evolution Refiner simulates virtual subjects and re-evolves low-efficacy contexts.

Key Designs¶

HyperTree Outline Planner (HOP):
- Function: Formalizes the expert outline design process as a rule-guided hypertree search.
- Mechanism: Defines a hypertree \(\mathcal{H} = (N, Q, \mathcal{R})\), where hyperedges connect a parent node to sets of child nodes, supporting hierarchical divide-and-conquer. The search proceeds in four steps: HT-Select (evaluates and prunes hyperlinks to select the optimal leaf node) → HT-Expand (applies expansion rules to generate candidate child groups) → HT-Construct (iteratively builds until a termination condition is met) → HT-Decide (globally evaluates candidates to select the final outline).
- Design Motivation: Experts design contexts by first planning the overall structure and then refining layer by layer; the hypertree structure captures this hierarchical divide-and-conquer process more faithfully than a standard tree. Ablation results show that removing HOP causes Relevance to drop from 79.06% to 70.20%.
MCTS-based Context Generator (MCG):
- Function: Generates high-quality seed contexts through sentence-level search under outline constraints.
- Mechanism: Treats text generation as a sentence-level decision process, with an LLM proposing candidate sentences at each step. A dual-horizon evaluation mechanism is adopted—high-scoring nodes are evaluated using immediate scores (a weighted average of scenario cue alignment \(S_{sc}\), imagery vividness \(S_{im}\), and discourse coherence \(S_{co}\), multiplied by \(1-S_{ha}\) for hallucination risk), while low-scoring nodes trigger short-continuation lookahead for re-evaluation. A UCT formula balances exploration and exploitation.
- Design Motivation: Sentence-level search maintains long-range structural consistency better than one-shot generation. Removing MCG causes Coherence to drop from 81.28% to 74.38%.
Evolutionary Context Optimizer (ECO) + Assessment-Guided Refiner:
- Function: MAP-Elites evolutionary search improves stylistic diversity; virtual subject simulation validates assessment effectiveness.
- Mechanism: A 3-dimensional behavior space is defined (proximity range \(\phi_1\), knowledge density \(\phi_2\), perspective diversity \(\phi_3\)) and discretized into a grid, with each cell storing the current best context. Seed contexts are edited via insertion, deletion, and substitution mutations, and elites are updated according to a fitness function (average of coherence, relevance, and engagement scores). A virtual subject simulator (with talkative, normal, and quiet styles) generates responses; contexts whose creativity scores fall below a threshold are returned for further evolution.
- Design Motivation: A single topic requires contexts of varied styles to suit different assessment populations. MAP-Elites naturally supports joint optimization of diversity and quality. Removing ECO degrades all metrics, with the largest drop observed in Uncertainty.

Loss & Training¶

AlphaContext is an unsupervised search framework and does not involve a conventional loss function. Quality is assessed via an LLM-based scorer (DeepSeek-V3.1), with fitness function \(F(C) = \text{Avg}(S_{coh}(C) + S_{rel}(C) + S_{eng}(C))\).

Key Experimental Results¶

Main Results¶

Method	Coherence↑	Relevance↑	Engagement↑	Significance↑	Uncertainty↑
GPT-5.1	70.44	70.20	65.39	50.37	68.60
Gemini-3.0-Pro	72.54	75.37	62.56	48.40	63.30
SS-GEN	60.22	69.69	56.40	60.10	53.57
AlphaContext	81.28	79.06	79.93	71.06	80.30

Ablation Study¶

Configuration	Coherence	Relevance	Engagement	Uncertainty
Full AlphaContext	81.28	79.06	79.93	80.30
w/o HOP	77.96	70.20	76.85	76.11
w/o MCG	74.38	71.80	72.17	71.92
w/o ECO	75.62	70.57	71.80	70.69

Key Findings¶

AlphaContext ranks first across all 7 dimensions, with the largest margins in Significance (+10.96% vs. the runner-up) and Uncertainty (+11.7% vs. the runner-up).
In human preference evaluation, AlphaContext achieves a win rate of 62% against GPT-5.1 and 74% against Gemini; human and LLM judgments show high agreement (Cohen's κ > 0.8).
In a real human experiment involving 36 middle school students, creativity scores follow a normal distribution and achieve a Pearson correlation of 0.377 with the standardized AUT test, demonstrating meaningful criterion validity.
Generating one context takes approximately 227 seconds—far faster than expert design (approximately one week)—at an acceptable cost.

Highlights & Insights¶

The three-stage "planning–search–evolution" design is highly systematic: HyperTree ensures global structure, MCTS optimizes local quality, and MAP-Elites expands diversity. This framework is transferable to other structured long-form text generation scenarios (e.g., lesson plan design, examination question generation).
Using virtual subject simulation to validate assessment effectiveness is an elegant closed-loop design that avoids the high cost of relying on real human experiments.
The real human experiment validates the psychometric validity of the generated contexts—a rare but highly persuasive contribution in NLP research.

Limitations & Future Work¶

Generation cost is relatively high (~12.9k tokens per context), requiring multiple LLM calls; future work could distill the system into a lightweight generator.
The CreaTE dataset consists of expert-curated title–topic pairs and is limited in scale (203 instances); domain coverage warrants expansion.
The current approach targets only future-oriented contexts; applicability to other creativity assessment types (e.g., open-ended tasks) has not been validated.
The representativeness of the virtual subject simulator depends on how well the underlying LLM approximates real human creative behavior.
The efficiency of sentence-level MCTS and MAP-Elites is sensitive to the choice of the underlying LLM and evaluator.

vs. DOC/CRITICS: These story generation frameworks focus on narrative entertainment and fluency, and do not meet the quality and validity requirements of psychometric assessment.
vs. SS-GEN: SS-GEN generates social stories for autism intervention—a fundamentally different setting from creativity assessment.
vs. CPIG: CPIG generates short items and is unsuitable for long-form contexts that require discourse coherence and implicit cue embedding.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The combination of HyperTree + MCTS + MAP-Elites is highly novel in text generation.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Ablation studies, human preference evaluation, real human experiments, and case studies are all present.
Writing Quality: ⭐⭐⭐⭐ Structure is clear, though notation is dense.
Value: ⭐⭐⭐⭐ Opens a new direction for LLM-assisted psychometric context generation.