Intent-Driven Semantic ID Generation for Grounded Conversational News Recommendation¶
Conference: ACL 2026
arXiv: 2605.07613
Code: To be confirmed
Area: Recommendation / Conversation / Generative Recommendation / Semantic ID
Keywords: Semantic ID, Conversational Recommendation, Generative Recommendation, Cold Start, RQ-VAE
TL;DR¶
This paper proposes NewsRec-Chat, which reverses conversational news recommendation from "retrieve-then-generate" to "generate SID then fuzzy match." By utilizing two-stage SID alignment and GPT-4 CoT distillation, a 7B model directly generates hierarchical Semantic ID prefixes and performs fuzzy matching against the daily news pool. On the Tencent News platform, within an open generation space of 152K, it achieves a 12.4% L1 (4× random) with 0% hallucinations. Through Profile-Aware Dual-Signal Reasoning, it enables 0-history users to reach 18.0% L1 (where other baselines achieve 0%).
Background & Motivation¶
Background: Mainstream conversational recommendation systems are built on stable product catalogs (movies, goods). They typically convert conversational intent into keywords or embeddings for retrieval, then let LLMs rank and explain the recalled set. Recently, generative recommendation has used SIDs (hierarchical tokens quantized by RQ-VAE) to encode items into learnable discrete sequences, but these assume abundant clicking behavior.
Limitations of Prior Work: The news platform differs significantly from stable catalogs—articles expire in large numbers within 24 hours, new articles continuously flow in, and 20-30% of users have < 10 history records. In conversations, five types of implicit intents like "one more," "something different," or "no sports" dominate, lacking keywords for RAG. Directly applying SIDs faces two open problems: (1) how to generate SID prefixes from conversational intent (rather than click sequences), and (2) how to handle cold-start users without click history.
Key Challenge: The retrieve-first paradigm requires explicit keys in the query, whereas real conversational news needs are implicit and short-lived, invalidating the assumptions of "query-first" and "static corpora."
Goal: (1) Map conversational intent directly to candidate items without relying on keywords, (2) structurally guarantee zero hallucinations (every recommendation must exist in today's pool), (3) enable meaningful recommendations for cold-start users via profiles, and (4) meet sub-100ms online latency.
Key Insight: It is observed that the first three layers of RQ-VAE SIDs are "semantic hierarchical encodings" (\(s_1\) coarse, \(s_2\) middle, \(s_3\) fine clusters), while the 4th layer approximates the item ID and fluctuates daily. Allowing the LLM to generate only the first 3 layers expresses intent while decoupling from "daily changing pools."
Core Idea: Replace RAG with Generate-then-Match—the LLM directly generates a three-layer SID prefix \(P = (s_1, s_2, s_3)\) based on (user profile, history, current intent), then performs fuzzy matching \(\text{Match}(P, \mathcal{P}) \subseteq \mathcal{P}\) with a tolerance \(\delta\) against the daily news pool, architecturally guaranteeing existence.
Method¶
Overall Architecture¶
Input: User profile \(\mathbf{p}_u\) (25+ dimensional features), behavioral history \(\mathbf{h}_u\), and the current conversational query \(q\). Process: (1) The PADR router selects the warm/hybrid/cold path based on \(|\mathbf{h}_u|\) and assembles the prompt; (2) a two-stage fine-tuned LLM generates a 3-layer SID prefix; (3) the fuzzy matching module compares the prefix with today's news pool with a tolerance \(\delta=5\), returning a small candidate set (mean 5.2, median 3.0 articles); (4) online serving uses a Dual-Track architecture where the Fast Track hits the cache for 100ms results, and the Enhance Track asynchronously runs full PADR reasoning to update the cache. Output: 1-3 grounded recommendations from today's pool.
Key Designs¶
-
Generate-then-Match: Reversing the RAG Paradigm:
- Function: Transitions from "retrieving the pool with a query" to "generating the SID first then look-up in the pool," absolutely eliminating hallucinations at the architectural level.
- Mechanism: \(\text{LLM}(u, h, q) \to \text{SID}\), followed by \(\text{Match}(\text{SID}, \mathcal{P}) = \{n \in \mathcal{P}: s_1' = s_1, s_2' = s_2, |s_3' - s_3| \leq \delta\}\). \(s_1/s_2\) must strictly match for semantic consistency, while \(s_3\) tolerance captures fine-grained similar neighbors. Candidates are ranked by \(1 - |s_3'-s_3|/(\delta+1)\). \(\delta=5\) was chosen via grid search. Generating only the first 3 layers avoids the daily-fluctuating \(s_4\) layer, preventing the model from being locked to a specific day's inventory.
- Design Motivation: Implicit intents fail in retrieve-first paradigms. Directly outputting SIDs collapses item selection from "semantic retrieval + ranking" into "semantic generation + existence check," leveraging LLM strengths while decoupling from daily pool changes via fuzzy matching.
-
Profile-Aware Dual-Signal Reasoning (PADR):
- Function: Enables effective recommendations for 0-history cold-start users using only profiles, while using specialized hybrid strategies for sparse-history users.
- Mechanism: Data is partitioned into three tiers based on \(|\mathbf{h}_u|\) and a threshold \(\tau=10\). The context explicitly inserts "sparse" or "no history" prompts, allowing the model to learn differentiated reasoning via CoT (warm: behavior-profile association; cold: demographic to interest mapping; hybrid: cross-validation). The cold path achieves 18.0% L1, being the only solution that does not drop to 0% on cold users.
- Design Motivation: Cold-start users comprise 20-30% of news platforms. Traditional SID models collapse without history. CoT distillation allows the "routing strategy" to be learned by the model rather than hard-coded, avoiding engineering complexity.
-
Two-stage Training: SID Alignment + CoT Distillation:
- Function: First teaches the LLM "content to SID, SID to content, and behavior summarizing to SID," then teaches it to "generate SIDs using different reasoning chains for each intent."
- Mechanism: Stage 1 uses 6 tasks (bilateral content↔SID mapping, behavior summarization, next-item prediction, multi-turn recommendation) with 483K samples for multi-task alignment. Stage 2 uses GPT-4 to generate gold CoTs for each (input, target SID) pair, distilled into a 7B Qwen. Key techniques: (i) 31% cold-start samples to ensure profile-only reasoning; (ii) independent CoT structures for each intent; (iii) limiting CoT length to 150-300 words to avoid inference degradation.
- Design Motivation: Without Stage 1, the model only "repeats SIDs" without reasoning. Without intent-specific distillation in Stage 2, it applies the same CoT to all intents. Removing Stage 2 causes hallucinations to spike from 0% to 18.4%.
Loss & Training¶
Stage 1 multi-task alignment uses standard LM cross-entropy. Stage 2 uses instruction distillation from teacher (GPT-4 CoT) to student (Qwen2.5-7B-Instruct) using next-token loss on the "CoT + SID prefix" sequence. Backbone is Qwen2.5-7B-Instruct + LoRA. RQ-VAE encoder is trained offline on new content embeddings (~2h, 1 GPU).
Key Experimental Results¶
Main Results (9982 test samples, 152K SID open space)¶
| Method | Hit@1 (Rand) | Hit@1 (Align) | L1 | L2 | Category | Hallucination |
|---|---|---|---|---|---|---|
| Random | 20.0 | 20.0 | 5.1 | 0.1 | 10.3 | – |
| Popular | 20.0 | 20.0 | 7.7 | 0.5 | 12.6 | – |
| Hist-Pop (Prod. Baseline) | – | – | 11.6 | 0.7 | 16.8 | – |
| Qwen-7B Direct | 28.1 | 26.0 | 2.4 | 0.0 | 6.9 | 70.0% |
| GPT-4 Direct | 34.4 | 30.9 | 0.9 | 0.0 | 1.4 | 94.6% |
| Qwen-7B + Hybrid RAG | 28.1 | 26.0 | 11.4 | 0.5 | 18.1 | 0% |
| GPT-4 + Hybrid RAG | 34.4 | 30.9 | 12.4 | 0.5 | 18.8 | 0% |
| NewsRec-Chat (Ours, 7B) | 59.3 | 30.8 | 12.4 | 1.0 | 20.0 | 0% |
Cold-start L1: SASRec 0% / TIGER 0% / OneRec-7B 16.1% / Ours 18.0%. Ours is the only one covering all 6 intent types.
Ablation Study (Rand setting)¶
| Configuration | Hit@1 | L1 | Hallucination | Latency |
|---|---|---|---|---|
| Full Model | 59.3% | 12.4% | 0% | 85ms |
| w/o Stage 2 (Stage 1 only) | 51.6% | 8.9% | 18.4% | 0.67s |
| w/o Fuzzy Match | 59.3% | 12.4% | 5.7% | 85ms |
| w/o Dual-Track | 59.3% | 12.4% | 0% | 3.7s |
Key Findings¶
- Stage 2 PADR CoT distillation is the single largest contributor—removing it spikes hallucinations to 18.4% and increases latency, as it is the core of zero-hallucination architecture.
- Fuzzy Match is indispensable: removing tolerance causes a 5.7% match failure because certain (s1, s2, s3) triplets may not have items in the daily pool.
- Cold-start user L1 is surprisingly higher than warm users, suggesting profile-to-SID-cluster mapping is more focused than handling long, noisy behavioral histories.
- Cross-category generalization: Average L1 of 23.5% across 29 categories. Zero-shot categories almost never seen in Stage 2 still reached 4× the random baseline.
- Pilot Deployment: 38-day study with 300+ people showed zero hallucination complaints and a 22.8% return rate.
- Comparison with GPT-4+Hybrid RAG: Tied in L1 (12.4%) but doubled L2 and increased category accuracy by 1.2pp at ~100× lower cost.
Highlights & Insights¶
- The engineering decision to generate 3 layers instead of 4 is clever—it decouples coarse semantics from daily item fluctuations, making the model loosely coupled with the inventory. This is transferable to other short-lived item domains like short videos or live streaming.
- Generate-then-Match provides an architectural solution to LLM hallucinations: instead of using constrained decoding or grounding losses, the model outputs "semantic slots" to be looked up in a real pool.
- PADR utilizes availability indicators in prompts to let the model learn routing itself, a trick applicable to any multi-branch strategy system.
- The finding that cold-start L1 is higher than warm-start L1 challenges the overestimation of behavioral sequence value in LLM-based recommendation.
Limitations & Future Work¶
- Evaluation is limited to a single platform and language (Tencent News, Chinese); cross-domain transfer (e.g., e-commerce) is left for future work.
- 7B inference for cold-start takes 3.7s initially, requiring caching to hit 85ms, which remains a latency issue for non-prewarmable user segments.
- \(\delta=5\) is a fixed grid-searched value; it may be too loose for sparse clusters and too tight for dense ones.
- Stage 2 distillation depends on GPT-4, risking potential drift or bias transfer from the teacher model.
Related Work & Insights¶
- vs TIGER (Rajput et al. 2023): They used SIDs for single-turn next-item prediction; this paper extends to 6 conversational intents plus PADR cold-start.
- vs OneRec-7B (Zhou et al. 2025): Uses the same backbone but constrained decoding; this paper uses Generate-then-Match. Our cold-start L1 is 18.0 vs 16.1.
- vs GPT-4 + Hybrid RAG: They rely on massive scale and heavy retrieval; we rely on 7B and direct generation, achieving parity in L1 with much lower costs.
- vs Constrained Decoding (Hokamp & Liu 2017): They constrain tokens during decoding, which limits expression; we apply constraints post-generation via fuzzy matching.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of Generate-then-Match and PADR is a genuine "paradigm shift."
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 5 baseline types, 4 retrieval variants, full ablations, p-values, task decomposition, and a 38-day pilot.
- Writing Quality: ⭐⭐⭐⭐ Clear pipeline diagrams; however, some key tricks like CoT length limits are in the appendix.
- Value: ⭐⭐⭐⭐⭐ Provides a reproducible industrial-grade "7B + SID" alternative to "GPT-4 + RAG" with zero hallucinations and sub-100ms latency.