Transforming Podcast Preview Generation: From Expert Models to LLM-Based Systems¶

Conference: ACL 2025
arXiv: 2505.23908
Code: None
Area: LLM/NLP
Keywords: podcast preview, LLM application, content understanding, A/B testing, industry deployment

TL;DR¶

Spotify proposes using an LLM (Gemini 1.5 Pro) to replace the legacy multi-model feature engineering pipeline for generating podcast preview clips. This approach significantly outperforms the traditional system in both offline human evaluation and online A/B testing, achieving a 4.6% increase in user engagement duration and a 5x improvement in processing efficiency.

Background & Motivation¶

Background: Discovering and evaluating long-form content like podcasts is time-consuming for users. Preview clips serve as an effective way to help users quickly assess their interest in the content.

Limitations of Prior Work: Legacy podcast preview systems (Legacy ML) rely on complex feature engineering pipelines that integrate multiple expert models—such as topic analysis, sentiment analysis, ad detection, audio event detection, sentence boundary detection, and ranking. This results in extremely high maintenance and iteration costs.

Key Challenge: Whenever a new requirement is added or criteria are adjusted in legacy systems, multiple models must be retrained or their weights and aggregation logic retuned. This leads to long iteration cycles and poor flexibility.

Goal: Replace the entire multi-model pipeline with a single LLM + few-shot prompt, eliminating feature engineering through prompt iteration to greatly simplify the architecture.

Key Insight: Leverage the long-context understanding and structured reasoning capabilities of LLMs to select the optimal preview clip directly from transcripts while generating metadata (such as topic tags and recommendation explanations).

Core Idea: Combining LLMs, sentence indexing, and few-shot prompting can replace complex legacy feature engineering pipelines, generating better podcast previews more rapidly.

Method¶

Overall Architecture¶

Podcast Audio → Transcript → Sentence Segmentation and Timestamp Alignment → LLM (Gemini 1.5 Pro) Few-shot Inference to Select Preview Clip → Post-processing Curation to ~1 Minute → Output Final Preview.

Key Designs¶

Sentence Indexing and Timestamp Alignment (Sentencization)
- Segment the transcripts into sentences based on heuristic rules like punctuation, and label each sentence with start and end timestamps.
- Design Motivation: LLMs require precise localization of clip boundaries; timestamp indexing serves as a critical bridge mapping the text space back to the audio space.
Structured Reasoning Prompt
- Guide the LLM toward step-by-step reasoning: identify the main topic of the episode → evaluate the relevance and appeal of various segments → generate preview metadata (reasons for recommendation, topic tags).
- Design Motivation: Structured reasoning enhances both the transparency and interpretability of decisions, thereby improving the quality of the generated previews.
Preview Constraint Rules
- The prompt explicitly lists preview constraints: engaging hook, logical progression, exclusion of advertisements, self-contained beginning and ending, emotional resonance, and a duration of approximately one minute.
- Design Motivation: Encode the domain knowledge of product design teams as prompt constraints, replacing the rule engines in legacy systems.
Few-shot Learning
- Provide manually curated, high-quality showcase examples in the prompt.
- Design Motivation: Let the LLM learn the definitions of "quality previews" through exemplars without requiring fine-tuning.
Manual Prompt Iteration
- Product and design teams iteratively optimize prompts and validate them repeatedly on small-scale evaluation sets.
- Design Motivation: Human feedback is better suited for tasks requiring aesthetic judgment compared to automated prompt engineering.

Comparison with Legacy System¶

Dimension	Legacy ML System	LLM System
Number of Models	6+ Expert Models	1 LLM
Input Modality	Audio + Text	Text Only
Processing Time	~100 seconds / episode	~20 seconds / episode
Iteration Method	Model Retraining + Feature Adjustment	Prompt Modification

Key Experimental Results¶

Table 1: Offline Human Evaluation - Overall Comparison and Statistical Tests¶

Evaluation Metric	Z-Test Statistic	P-value	LLM Significantly Better?
Understandability	-4.05	5.09e-05	Yes
Contextual Clarity	-3.40	0.00067	Yes
Interest Level	-4.32	1.59e-05	Yes

Out of 238 valid annotations, the LLM previews were rated better than or equivalent to legacy previews in 81.09% of cases, with a pure win rate of 54.2%.
The binomial test yielding a p-value of 1.37e-10 demonstrates that the LLM's superiority is highly statistically significant.

Table 2: Online A/B Test Results¶

Metric	Gain	Description
User Evaluation Time / User	+4.6%	Statistically significant, week 2 data
Evaluation Time per Preview	+4.0%	Statistically significant, week 2 data
Processing Efficiency	5x Gain	100s → 20s

The A/B test covered 67 English-speaking countries over 6 weeks, with LLM previews constituting 34% of the visible set for the treatment groups.

Key Findings¶

The LLM statistically significantly outperforms the legacy system across all three dimensions: understandability, contextual clarity, and interest level.
Online data validates the offline evaluation conclusions, showing that users indeed engage more with LLM-generated previews.
Outputs leveraging text-only context surpass those of the legacy system, which required both audio and text modalities.

Highlights & Insights¶

Real-world large-scale deployment: Deployed in Spotify's live production environment serving hundreds of thousands of podcast previews, validated by an A/B test covering 67 countries.
Substantial reduction in engineering complexity: Simplifying the system from a pipeline of 6+ expert models to a single LLM API call brings a qualitative change to maintenance costs and iteration speed.
5x improvement in processing efficiency: 20 seconds vs. 100 seconds, without requiring audio signal processing.
Rigorous evaluation framework: Dual validation via offline human evaluation (20 evaluators, 238 annotations, statistical tests) and online A/B testing (6 weeks, 67 countries).

Limitations & Future Work¶

English Only: Currently relies on metadata language tagging for English filtering; multilingual expansion remains unexplored.
Reliance on commercial LLMs: Using Gemini 1.5 Pro introduces dependencies on third-party API costs and controllability.
Non-automated prompt iteration: The manual optimization process is neither reproducible nor scalable.
No audio signals used: Outstanding clips that require acoustic cues (e.g., tone, laughter) to identify might be missed.
Limited evaluation metrics: Increased user engagement duration does not necessarily equate to improved content discovery; deeper metrics like conversion rate are still lacking.

Method	Mechanism	Data Modality	Requires Multiple Models?	Deployment Scale
Legacy Feature Engineering	Multi-expert model aggregation	Audio+Text	Yes (6+)	Production-scale
Unsupervised Highlight Detection	Clustering or graph-based methods	Video/Text	Partial	Research-grade
LLM Summarization	Extractive / Abstractive	Text	No	Research-grade
PodTile (Chapter Generation)	LLM + Indexing	Text	No	Production-scale
Ours (LLM Preview)	Few-shot LLM + Sentence Index	Text	No (Single LLM)	Production-scale (Spotify)

Rating¶

Novelty: ⭐⭐⭐ (The idea of replacing traditional pipelines with an LLM is not entirely brand new, but the engineering execution and dual validation are highly valuable)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Offline human evaluation + online A/B testing, complete statistical testing, industry-grade validation)
Writing Quality: ⭐⭐⭐⭐ (Clear structure, thorough legacy vs. LLM comparison, intuitive tables)
Value: ⭐⭐⭐⭐ (A benchmark paper for industrial applications, demonstrating the complete deployment path and efficacy of LLMs)