ICLR 2026 LLM (Other) open-endedness coevolution model merging quality-diversity collective intelligence LLM discovery

Discovering Novel LLM Experts via Task-Capability Coevolution¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=efNINVs2So
Code: To be confirmed
Area: LLM / Open-ended Evolution / Model Merging
Keywords: open-endedness, coevolution, model merging, quality-diversity, collective intelligence, LLM discovery

TL;DR¶

The AC/DC framework is proposed to facilitate the coevolution of a population of LLMs (evolved through evolutionary model merging) and a set of synthetic tasks (generated by a "Scientist LLM"). This automatically discovers a complementary suite of small expert models in a single run, whose collective Coverage exceeds that of larger models within the same family and even approaches or surpasses GPT-4o, while utilizing significantly fewer total parameters.

Background & Motivation¶

Background: Development of frontier models aims for the continuous emergence of diverse capabilities. However, current "pre-training + post-training" paradigms require manual intervention to initialize new training runs with static datasets or reward functions whenever a new capability is needed. Even with synthetic data and broad reward-based self-improvement, only one massive and static model is produced per cycle.

Limitations of Prior Work: Relying on a single large static model for all real-world problems presents two dilemmas: first, "fractured entangled representations" and high inference costs make it difficult for a single model to robustly cover all tasks; second, the high barrier of scaling models and datasets is inaccessible for most ML researchers.

Key Challenge: Developers are forced to push frontiers via incremental improvements in static data, environments, algorithms, or architectures. The process is "manually initialized, single-model output," which fundamentally conflicts with the goal of "continuous, open-ended accumulation of new capabilities"—knowledge accumulation is open-ended, but training paradigms remain closed and directed.

Goal: Drawing on Collective Intelligence (CI) and open-endedness (OE), this work aims to automatically discover a group of small, accessible, and complementary expert LLMs in a single run. The system does not explicitly optimize for any specific benchmark but allows newer, more complex skills to emerge continuously over time.

Core Idea: [Model-Task Coevolution] Open-ended coevolution is extended to LLM discovery for the first time. The framework evolves an LLM population via evolutionary model merging (crossover + mutation) while simultaneously evolving a task population using a "Scientist LLM" via synthetic data generation; the two act as each other's environment, becoming more difficult and diverse together. [Minimal Criteria + Quality-Diversity] Minimal Criteria (MC) are used to filter degenerate models and unsolvable tasks, while Quality-Diversity (QD) selection ensures the population is both high-quality and behaviorally diverse.

Method¶

Overall Architecture¶

AC/DC (Assessment Coevolving with Diverse Capabilities) maintains two continuously updated archives: a Model Archive $A_M$ (active LLMs selected by DNS based on skill vectors) and a Synthetic Task Archive $A_Q$ (a set of increasingly complex and novel challenges). Each generation performs "Model Evolution"—selecting parents, generating offspring via merging/mutation, evaluating on tasks to obtain skill vectors, filtering degenerate models, and updating the archive with DNS. Every $G_{task}$ generations, "Task Evolution" is triggered—the Scientist LLM generates new tasks, filters for novelty/solvability, and backfills skill vectors for models. Finally, $N_{tf}$ models with the highest collective coverage of the synthetic task distribution are selected as a "Task Force" for evaluation on OOD real-world benchmarks.

flowchart LR
    subgraph M[Model Evolution Each Gen]
      A[SelectParents] --> B[Crossover+Mutation Generate N Offspring]
      B --> C[Evaluate on Task Archive<br/>Get Skill Vectors]
      C --> D[Gibberish Filter<br/>Remove Degenerate Models]
      D --> E[DNS Update Archive A_M]
    end
    subgraph Q[Task Evolution Every Gtask Gens]
      F[Scientist LLM Generates Tasks] --> G[Novelty Filtering]
      G --> H[Reflection+Verification<br/>Self-solve+Execute Scoring Function]
      H --> I[Unsolvable Task Filter<br/>Update Task Archive A_Q]
      I --> J[Backfill/Re-evaluate Skill Vectors]
    end
    E -.Triggered every Gtask.-> F
    J -.Next Gen Env.-> A
    E --> K[Select Ntf Models with Max Coverage<br/>to form Task Force]

Key Designs¶

1. Evolutionary Model Merging: Generating LLM Populations via Crossover and Singular Value Mutation — AC/DC does not train from scratch; it uses existing LLMs as "stepping stones" for gradient-free evolution. Crossover involves weighted linear interpolation of task vectors $\tau_{p_i}=\theta_{\text{parent}_i}-\theta_{\text{base}}$ from two parents (following the CycleQD approach). Mutation is performed by applying Singular Value Decomposition (SVD) $W=U\Sigma V^T$ to each weight matrix $W$ of the merged model, adding noise only to the top-$k$ singular values in $\Sigma$ before reconstruction. This changes the representation structure while preserving the overall geometry of the weight matrix, avoiding destructive mutations. This operator reduces the cost of "creating a new expert" from a training run to a single merge.

2. Coverage Metric + Skill Vectors: Aiming for Collective Complementarity — The value of the population is measured by collective solvability rather than individual model accuracy. Given $Q$ queries and $N$ models, Coverage is defined as: $$\text{Coverage}=\frac{1}{Q}\sum_{q=1}^{Q}\left(\bigvee_{i=1}^{N}(x_{q,i}=y_q)\right)$$ where the task is covered if any model in the population answers correctly ($\bigvee$ is logical OR). Each model is represented by a binary skill vector, acting as a behavioral signature. This allows direct comparison of model differences without pre-defining niches like in MAP-Elites. The distance between skill vectors drives diversity selection.

3. Dominated Novelty Search (DNS) for Quality-Diversity: Rewarding Models Far from the Strongest — While traditional optimization seeks a single global optimum, QD seeks a group of high-quality and diverse solutions. AC/DC uses DNS in the skill vector space. For model $i$, let $D_i$ be the set of solutions stronger than $i$. The local competition fitness is calculated based on the $k$ nearest neighbors $K_i$ within $D_i$: $$\tilde{f}_i=\begin{cases}\frac{1}{k}\sum_{j\in K_i}d_{i,j} & \text{if }|D_i|>0\\ +\infty & \text{otherwise}\end{cases}$$ This rewards experts that occupy unique behavioral niches where stronger models do not excel. Ablations show DNS and gibberish filtering are critical, with removals causing performance drops of 2.39% and 2.46% respectively (at N=3).

4. Task Coevolution: Scientist LLM with Difficulty Adaptation and Triple Filtering — Tasks are a "living environment" that adapts. The Scientist LLM synthesizes "QA pairs + Python scoring functions" following simplified METR standards. Generation involves four steps: (1) Task Proposal—adjusting difficulty based on population pass rates; (2) Novelty Filtering—using embedding cosine similarity to ensure distinctness; (3) Reflection and Verification—the Scientist LLM self-solves and executes the scoring function, triggering auto-correction for errors; (4) Quality Assurance & MC—removing unsolvable tasks where no model succeeds.

Key Experimental Results¶

Main Results: Coverage Gain (Across 4 Model Families, Average % Gain vs. Baselines)¶

Base Model	vs Experts(N=3)	vs Control N=8	vs Big Model N=8	vs GPT-4o N=8
Qwen2 7B	+2.06	+0.69	-6.08	+2.05
Qwen2.5 7B	+4.40	+3.85	+1.02	+6.95
Qwen3 14B	-0.21	+4.22	+5.45	+10.71
DeepSeek V1 7B	+9.69	+1.96	-18.46	-7.72
Average	+3.99	+2.68	-4.52	+2.99

High Parameter Efficiency: Qwen2.5 7B utilizes only 29% of the parameters of a 72B model but achieves 3.85% higher Coverage at N=3, increasing to +9.78% at N=8.
The N=8 population exceeds GPT-4o in Coverage; 3 Qwen2.5 7B models (N=3) are sufficient to surpass GPT-4o.

Best-of-N Single Answer Selection (Average % Gain vs. Baselines)¶

Base Model	vs Experts(N=3)	vs Control N=8	vs Big Model N=8
DeepSeek V1 7B	+11.73	+7.92	+4.94
Qwen3 14B	-0.49	+0.50	+1.37
Average	+1.34	+1.05	-0.25

DeepSeek 7B at N=3 approaches the performance of the 67B model within 1.27% (using 17% parameters) and surpasses it by 4.94% at N=8.

Ablation Study¶

Configuration	N=3	N=8
AC/DC (Ours)	60.82	69.00
DNS	60.18	66.48
CQD	59.85	65.42

Removing all evolutionary components results in a drop of 2.36% (N=3) and 7.02% (N=8).
Coevolution outperforms "model-only evolution on static data" by 3.62% at N=8.

Key Findings¶

OOD Generalization: The models achieve high coverage on OOD benchmarks without direct optimization, indicating the discovery of generalized complementary skills rather than benchmark overfitting.
Emergent Specialization: Eight models exhibit distinct expertise (e.g., Model 4 in Chemistry, Model 6 in CS/Business, Model 3 in Biology), whereas control groups show minimal variance and weaker overall performance.
Continuous Improvement: More generations of coevolution lead to sustained performance increases in the test population.

Highlights & Insights¶

Paradigm Shift: The focus shifts from "training a large static model" to "growing a population of small experts in one run," utilizing Coverage as the "North Star" metric for collective intelligence.
Mutual Environment: Tasks adapt to population competence, while models are forced to cover new tasks, creating an open-ended arms race that avoids saturation.
Pragmatic Engineering: The use of merging over training, SVD-based mutation, and Python-based execution verification enables an "unattended long-run" discovery process.
Parameter Efficiency: Providing empirical evidence that a collective of small models can approach or exceed frontier large models using 17%–29% of the parameters.

Limitations & Future Work¶

Best-of-N Bottleneck: While Coverage is high, realizing this benefit in single-answer deployment requires better BoN selection; current BoN performance still lags behind GPT-4o (Average -7.17% at N=8).
Stability Variance: Performance varies across families (e.g., DeepSeek shows significant fluctuations against larger models), suggesting a need for improved robustness.
Dependency on Teacher Models: Task generation and novelty filtering rely on frontier LLMs as judges, which may introduce biases.
Task Scope: Evaluation remains focused on QA/MCQ/Math/Code; creative or interactive tasks have yet to be fully explored.

Evolutionary Model Merging: Builds on EvoMerge (automated merging via CMA-ES) and CycleQD (task vector interpolation). AC/DC upgrades simple merging to "merging + task coevolution."
Quality-Diversity: Follows the lineage of MAP-Elites and Dominated Novelty Search, but adapts to LLMs by using skill vectors instead of pre-defined niches.
Open-Endedness: Implements the "AI-generating algorithms" philosophy by applying coevolution with minimal criteria to LLM discovery.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to extend open-ended coevolution to joint discovery of LLMs and synthetic tasks with Coverage-oriented goals.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across 4 families and multiple metrics; however, BoN results lag, and cross-family stability varies.
Writing Quality: ⭐⭐⭐⭐ Clear motivation and algorithmic visualization, though some technical details are dense in the appendices.
Value: ⭐⭐⭐⭐⭐ Significant value for low-resource researchers as a viable alternative to large-scale training via continuous evolutionary discovery.