Accelerate Creation of Product Claims Using Generative AI¶

Conference: NeurIPS 2025 arXiv: 2509.20652 Code: Available (GitHub) Area: Audio & Speech Keywords: product claims, in-context learning, LoRA fine-tuning, MaxDiff, consumer simulation

TL;DR¶

This paper develops the Claim Advisor platform, leveraging LLM in-context learning and LoRA fine-tuning to accelerate the search, generation, refinement, and ranking of product claims for consumer goods. By emulating the MaxDiff research methodology, a fine-tuned Phi-3 14B model outperforms GPT-4o on claim ranking using only 1 in-context example versus GPT-4o's 100, and after three iterative rounds, 100% of generated claims achieve a "highly appealing" rating.

Background & Motivation¶

Background: Product claims are a key driver of consumer purchasing behavior. Traditional claim creation workflows involve manually drafting candidate claims, testing preferences via MaxDiff consumer studies, and iterating across multiple rounds—a process that typically requires weeks to months and substantial financial investment.

Limitations of Prior Work: (a) Claims must be legally compliant and scientifically substantiated while resonating with consumer topic trends, imposing multiple constraints on the creative process; (b) significant time in traditional workflows is spent searching existing claims, crafting new ones manually, and conducting consumer testing; (c) consumer research via MaxDiff is costly, requiring real consumer participation in each round.

Key Challenge: High-quality claim creation demands domain expertise combined with consumer preference feedback, yet manual iteration is extremely slow.

Goal - How can LLMs accelerate claim search, generation, and ranking? - How can LLMs simulate consumer preferences to partially replace real consumer studies?

Key Insight: Integrating the MaxDiff consumer research methodology into LLM prompt design and fine-tuning—training LLMs to select the best and worst claims in the manner of consumers.

Core Idea: Emulating the MaxDiff experimental paradigm (selecting the best and worst from a set of 5) to fine-tune/prompt LLMs for claim ranking, which substantially outperforms asking LLMs to rank all claims simultaneously.

Method¶

Overall Architecture¶

Claim Advisor is an MVP web application comprising three functional modules: (1) Semantic Search—retrieving relevant claims and visual assets from an existing claim repository using text embeddings and CLIP multimodal embeddings; (2) Generation/Refinement—generating new claims conditioned on product descriptions and consumer personas via prompt engineering and in-context learning (GPT-4o); (3) Ranking/Simulation—using a LoRA fine-tuned Phi-3 model to emulate MaxDiff experiments for virtual screening of candidate claims.

Key Designs¶

Semantic Search (Multimodal Fusion Retrieval)
- Function: Retrieves textual claims and visual designs from the claim repository that are semantically similar to a user query.
- Mechanism: Text is encoded with OpenAI TEXT-EMBEDDING-ADA-002; images are encoded into a shared space via CLIP. Multimodal fusion queries are supported as: \(emb = (1-W) \cdot emb_{txt} + W \cdot emb_{img}\), with retrieval performed via cosine similarity.
- Design Motivation: The first step in claim creation is typically searching existing assets—approved claims can be reused directly, and high-scoring MaxDiff claims serve as starting points for new ones.
In-Context Learning for Claim Generation/Refinement
- Function: Constructs examples from historical MaxDiff studies to guide the LLM in generating highly appealing new claims.
- Mechanism: Two methods for constructing in-context examples: (a) Performance-based—using claims ranked 2nd–6th in MaxDiff scores as input, prompting the model to generate a claim that surpasses them; (b) Semantics-based—using the 5 claims most semantically similar to the top-performing claim as input. A total of 300 examples are constructed for in-context learning.
- Design Motivation: The hypothesis is that LLMs can infer consumer preferences from moderately performing claims and synthesize improved ones. The performance-based method supplies preference signals, while the semantics-based method provides topical direction.
MaxDiff-Simulated Ranking (LoRA Fine-Tuned Phi-3)
- Function: Uses a fine-tuned LLM to emulate consumer behavior in MaxDiff studies, enabling virtual screening of candidate claims.
- Mechanism: Rather than asking the LLM to rank all claims simultaneously (which performs poorly), the approach emulates the MaxDiff paradigm—presenting 5 claims at a time and asking the model to select the best and worst. After multiple rounds, each claim's score is computed as: best count / worst count. Phi-3 (7B/14B) with LoRA is fine-tuned on 100K+ training samples.
- Design Motivation: Asking LLMs to produce a full ranking in a single pass lacks statistical grounding. Emulating the MaxDiff best–worst selection paradigm aligns with consumer decision psychology and yields statistically meaningful results.

Loss & Training¶

Generation/Refinement: No training; purely prompt engineering and in-context learning (GPT-4o).
Ranking: LoRA fine-tuning of Phi-3 on 100,316 training samples derived from historical MaxDiff studies.
Evaluation: Kendall's tau rank correlation coefficient and Top-N Coverage.

Key Experimental Results¶

Main Results: Claim Generation Quality (Three-Round MaxDiff Validation)¶

Round	Highly Appealing	Appealing	Low Appeal
Round 1 (Human)	20%	46%	34%
Round 2 (Claim Advisor)	33%	36%	31%
Round 3 (Claim Advisor)	100%	0%	0%

Within only 2 iterative rounds, LLM-generated claims improved from 20% highly appealing to 100%.

Ranking Model Comparison (Kendall's tau)¶

Model	# Examples	Kendall's tau
GPT-3.5	100	~0.15
GPT-4	100	~0.25
GPT-4o	100	~0.35
Phi-3 7B (mini) + LoRA	10	~0.35
Phi-3 14B (medium) + LoRA	1	~0.40 (best)

The fine-tuned Phi-3 14B surpasses GPT-4o with only 1 in-context example, compared to GPT-4o's 100.

Key Findings¶

The MaxDiff paradigm is critical: Asking LLMs to rank all claims directly performs poorly, but emulating MaxDiff (selecting best and worst) yields substantial improvements—methodology matters more than model scale.
More examples do not always help: Phi-3 14B performs better with 1 example than with 10, suggesting that a concise, well-designed prompt may outperform information-dense prompts.
Smaller models can outperform larger ones: The LoRA fine-tuned 7B Phi-3 approaches the 14B version on Top-N Coverage while substantially outperforming GPT-3.5/GPT-4.
Remarkable convergence in three rounds: Only 2 LLM-assisted iterations suffice to reach 100% high appeal, demonstrating the strong potential of LLMs in creative tasks.

Highlights & Insights¶

Integrating domain methodology (MaxDiff) into LLM design is the most elegant contribution of this work: rather than simply asking LLMs to "rank," the approach trains LLMs to emulate the experimental paradigm of consumer research—this "methodology alignment" is far more effective than naive prompting.
Performance-based vs. semantics-based in-context example construction: The two methods are complementary—one provides preference signals (what is popular) and the other provides topical direction (what themes are relevant).
Practical industrial deployment experience: Real-world concerns such as latency, model stability (commercial API updates), and cost management are discussed, offering valuable reference for practitioners.

Limitations & Future Work¶

Non-public data: Proprietary MaxDiff data and prompts cannot be released, limiting reproducibility.
Single product category: Validation is confined to P&G consumer goods; cross-industry generalizability remains unknown.
Regulatory compliance not automated: Generated claims still require manual legal review; LLMs do not guarantee compliance.
Diversity vs. specificity trade-off: Excessive prompt constraints reduce output diversity.
Evaluation limitations: Kendall's tau measures only relative ranking, not absolute preference intensity.

vs. General-purpose LLM text generation: This work adapts general LLMs to a specific business context via domain methodology (MaxDiff) and proprietary data, representing a canonical paradigm for industrial LLM applications.
vs. Direct fine-tuning: Fine-tuning is only necessary for the ranking task; in-context learning suffices for generation—a tiered strategy aligned with task complexity.
Transferable insight: Any domain with an established human preference comparison methodology (A/B testing, Elo ranking, etc.) can adopt a similar approach to embed that methodology into LLM workflows.

Rating¶

Novelty: ⭐⭐⭐ Incorporating the MaxDiff paradigm into LLM design is creative, but technically the approach is a standard combination of prompt engineering and LoRA.
Experimental Thoroughness: ⭐⭐⭐ Real consumer MaxDiff validation is a strength, but data unavailability and unsystematic model comparisons are weaknesses.
Writing Quality: ⭐⭐⭐⭐ Practical problem descriptions are clear and methods are explained intuitively, though the paper reads more as an industrial report than an academic contribution.
Value: ⭐⭐⭐ Valuable reference for industrial LLM applications, but academic contribution is limited.