SERM: Self-Evolving Relevance Model with Agent-Driven Learning from Massive Query Streams¶
Conference: ACL 2026 arXiv: 2601.09515 Code: None Area: Multilingual Translation Keywords: Search relevance, self-evolving model, multi-agent annotation, query stream adaptation, distribution shift
TL;DR¶
This paper proposes the SERM framework, which continuously self-evolves a search relevance model from large-scale real-world query streams via a multi-agent sample miner and a multi-agent relevance annotator. After three iterative rounds on an industrial search platform, SERM achieves a NDCG@1 improvement of +2.99, and significantly improves user retention in online A/B testing.
Background & Motivation¶
Background: Search relevance modeling is central to information retrieval, with the goal of ranking candidate documents for a given query. Traditional approaches adopt discriminative modeling (encoder + scoring function), while recent work leverages the generative capabilities of LLMs to directly produce relevance judgments and rationales. The standard training pipeline follows a two-stage paradigm of continued pre-training followed by supervised fine-tuning.
Limitations of Prior Work: Real-world query distributions evolve continuously—users constantly introduce new expressions, cultural references, and emerging linguistic patterns. Statically trained models fail to cover these shifts, resulting in insufficient generalization. For instance, queries such as "remember me pets arriving on 10/27" carry subtle semantics (commemorating deceased pets vs. general pet homecoming) that are difficult for models to capture.
Key Challenge: Self-evolution is a promising direction, but applying it to industrial-scale massive query streams poses two challenges: (C1) informative samples are extremely sparse among billions of queries and hard to identify; (C2) pseudo-labels generated by the model itself may be unreliable, leading to error accumulation.
Goal: Design a search relevance model capable of continuous self-evolution from large-scale query streams while simultaneously addressing the challenges of informative sample discovery and label reliability.
Key Insight: A multi-agent framework is employed in which different roles serve distinct functions: environment-feedback agents leverage user click and dwell-time signals to discover hard samples, introspective-feedback agents exploit model inconsistency and uncertainty to identify model weaknesses, and a multi-agent annotator generates reliable labels through a two-level consensus mechanism.
Core Idea: A multi-agent sample miner efficiently filters the most informative hard samples from massive query streams, after which a multi-agent annotator (multiple LLMs + intra- and inter-agent consensus) generates reliable labels for these samples, enabling iterative self-evolution.
Method¶
Overall Architecture¶
SERM is built upon an LLM-based generative relevance model (taking query + document as input and generating a relevance score with rationale). The self-evolution cycle runs every two weeks: (1) the multi-agent sample miner selects approximately 700K hard samples from new query streams; (2) the multi-agent annotator generates reliable labels for these samples; (3) the new data is mixed with existing SFT data to retrain the model, preventing catastrophic forgetting.
Key Designs¶
-
Multi-Agent Sample Miner (MSM):
- Function: Efficiently identify hard samples of highest value for model improvement from large-scale query streams.
- Mechanism: Four complementary agents are deployed. (a) User feedback agent: detects conflicting pairs where users show strong positive engagement (clicks/high dwell time) but the model assigns low confidence; (b) Click model feedback agent: uses a pre-trained click model to compensate for position bias and sparsity in raw click signals; (c) Model disagreement agent: applies temperature sampling to generate \(K\) judgments for the same pair, computing maximum disagreement \(MD(q,d) = \max_{i,j} |f^i(q,d) - f^j(q,d)|\); (d) Model uncertainty agent: computes the entropy of the label distribution \(MU(q,d) = -\sum_y \Pr(y|q,d) \log \Pr(y|q,d)\).
- Design Motivation: No single signal is comprehensive—user feedback suffers from bias and sparsity, while internal model signals cannot capture changes in external user intent. The union of multiple agents covers different types of hard samples.
-
Multi-Agent Relevance Annotator (MRA):
- Function: Generate reliable relevance labels and rationales for hard samples.
- Mechanism: A two-level consensus framework. (a) Intra-agent consensus: each LLM (e.g., GPT-4o, Gemini 2.5 Pro) first retrieves external knowledge, then generates multiple independent reasoning paths via multi-path CoT, producing stable labels through majority voting; (b) Cross-agent consensus: only samples on which multiple LLMs agree are retained, and all reasoning paths supporting the final label are consolidated into a unified rationale.
- Design Motivation: This differs from knowledge distillation—MRA provides evolutionary feedback by filtering noisy labels through cross-model consensus, avoiding error propagation in self-training. Intra-agent consensus addresses the stochasticity of individual LLM outputs, while cross-agent consensus mitigates systematic biases of individual models.
-
Iterative Self-Evolution Training:
- Function: Continuously improve model performance through multiple rounds of iteration.
- Mechanism: Each iteration retrains the model by mixing newly generated data, data from previous iterations, and the original SFT data. The cycle runs every two weeks to ensure sufficient query distribution shift. The model can be distilled into a smaller model (0.5B) to meet latency requirements.
- Design Motivation: Mixed training prevents catastrophic forgetting, and periodic updates ensure the model tracks the evolution of the query distribution.
Loss & Training¶
The generative modeling objective is \(\mathcal{L}_g = -\mathbb{E} \log \Pr_\theta(y|q,d)\), where the model generates a rationale followed by a relevance score in the range 0–3. Mixed training across three data sources is employed during iterative training to prevent forgetting. Distillation uses a KL divergence loss.
Key Experimental Results¶
Main Results¶
| Method | Model | Germanic NDCG@1 | Romance NDCG@1 | Minor Lang NDCG@1 |
|---|---|---|---|---|
| CT+SFT | 7B | 84.74 | 85.61 | 82.02 |
| Self-Training Iter3 | 7B | 84.78 | 85.58 | 82.20 |
| SERM Iter3 | 7B | 87.56 | 88.14 | 84.99 |
| CT+SFT | 1.5B | 84.59 | 85.99 | 81.75 |
| SERM Iter3 | 1.5B | 87.30 | 87.83 | 84.75 |
Online A/B Testing¶
| Metric | Gain | P-value |
|---|---|---|
| 14-day retention rate | +0.0359% | 0.0278 |
| User negative feedback | −1.2081% | 0.0001 |
| Query reformulation rate | −0.0839% | 0.0023 |
| Query reformulation rate (long-tail) | −0.1312% | 0.0015 |
Key Findings¶
- After three iterations, SERM achieves NDCG@1 gains of +2.82 (7B) / +2.71 (1.5B), whereas self-training yields only +0.04 / +0.45 and degrades by the third round (error propagation).
- Distillation results show that SERM distilled to 0.5B outperforms self-training distillation, demonstrating that more reliable labels transfer effectively to smaller models.
- Online A/B testing reveals significant user experience improvements—negative feedback reduced by 1.2% and 14-day retention improved by 0.036%—which are highly significant figures on a platform processing billions of requests per day.
- The instability of self-training is especially pronounced in the third round (Germanic NDCG@1 drops from 84.95 back to 84.78), confirming the pseudo-label error accumulation problem.
Highlights & Insights¶
- Elegant multi-agent collaborative design: Environment feedback (user signals) and introspective feedback (model uncertainty) complement each other—the former captures external information unknown to the model, while the latter identifies gaps in the model's own knowledge. This design is transferable to any system requiring continuous learning from data streams.
- Two-level consensus annotation mechanism: Intra-agent multi-path voting followed by cross-model consensus provides layered noise filtering. Compared to simple knowledge distillation or self-training, this mechanism fundamentally addresses the unreliability of pseudo-labels.
- Industrial-scale validation: Online A/B testing on a real search platform serving billions of requests per day lends strong credibility to the reported results.
Limitations & Future Work¶
- The framework relies on GPT-4o and Gemini 2.5 Pro as annotators, incurring high API costs; moreover, the annotators themselves may carry inherent biases.
- The bi-weekly iteration frequency may be insufficient to respond to sudden distributional shifts caused by breaking events.
- The current approach is validated only on document search; extension to multimodal scenarios such as video or image search requires additional adaptation.
- Future directions include: reducing reliance on external LLMs (e.g., incorporating the evolving model itself as one annotator in a hybrid consensus scheme) and introducing active learning strategies for more efficient sample selection.
Related Work & Insights¶
- vs. Self-Training: Self-training directly uses the model's own predictions as pseudo-labels and degrades after three iterations; SERM provides reliable labels via external multi-LLM consensus, achieving stable and consistent improvement.
- vs. Knowledge Distillation: Distillation is a unidirectional knowledge transfer from a fixed teacher to a student; SERM is an iterative evolutionary feedback process—each round's model is stronger than the previous, and annotation quality improves accordingly.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of multi-agent sample mining and two-level consensus annotation is novel and well-suited to industrial scenarios.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Offline multilingual evaluation combined with online A/B testing provides highly convincing industrial-scale validation.
- Writing Quality: ⭐⭐⭐⭐ Problem motivation and method descriptions are clear, though notation is somewhat dense.
- Value: ⭐⭐⭐⭐⭐ Directly addresses a core pain point in industrial search and has been validated at scale on a production platform.