ACL 2025 (Findings) LLM (Other) Personalized Generation Survey User Modeling LLM Diffusion Model Multimodal

Personalized Generation In Large Model Era: A Survey¶

Conference: ACL 2025 (Findings)
arXiv: 2503.02614
Code: Unreleased
Area: Other
Keywords: Personalized Generation, Survey, User Modeling, LLM, Diffusion Model, Multimodal

TL;DR¶

The first systematic survey on cross-modal Personalized Generation (PGen), presenting a unified user-centric perspective to integrate research from NLP, CV, and IR communities under a single framework, covering six modalities: text, image, video, audio, 3D, and cross-modality.

Background & Motivation¶

Core Observation: In the era of large models, content generation is shifting from one-size-fits-all generation to Personalized Generation (PGen). However, research in different communities (NLP, CV, IR) remains siloed, lacking a unified perspective.
Limitations of Prior Surveys: Existing surveys are either model-centric (e.g., focusing specifically on personalization of LLMs/diffusion models) or task-centric (e.g., dialogue generation, role-playing), lacking a cross-community panoramic survey.
Goal of Ours: Proposes the first modality-agnostic unified framework to systematically organize research across the boundaries of NLP, CV, and IR communities.

Method¶

Overall Architecture—Unified User-Centric Perspective¶

PGen relies on two types of user inputs: (1) Personalized Context: historical data containing user preferences; (2) Multimodal Instructions: signals such as text prompts and voice commands that explicitly express content requirements. Generative models learn preferences from the personalized context to generate customized content according to instructions.

Key Designs—Five Personalized Context Dimensions¶

Context Type	Description	Common Tasks
User Profile	Age, gender, occupation, location, etc.	Dialogue systems, e-commerce product images
User Document	Reviews, emails, social media posts	Writing assistant, personalized recommendation
User Behavior	Interactions such as search, click, purchase, etc.	Recommendation systems, information retrieval
Personal Face/Body	Facial structure, body shape, expression, movement	Portrait generation, virtual try-on
Personalized Subject	User-specific concepts such as pets, personal belongings, etc.	Subject-driven generation

Three Core Objectives¶

High Quality: Coherence, relevance, and aesthetics of generated content
Instruction Alignment: Accurately following the user's multimodal instructions
Personalization: Consistency with user preferences and personalized context

PGen Workflow¶

User Modeling Stage: - Representation Learning: Encoding into dense embeddings or discrete text representations - Prompt Engineering: Designing task-specific prompts to organize user info - RAG: Filtering irrelevant information and integrating external relevant data

Generation Modeling Stage: - Step 1 - Foundation Model Selection: LLM / MLLM / Diffusion Model - Step 2 - Guidance Mechanism: Instruction guidance (ICL, instruction tuning) + Structural guidance (adapter, cross-attention) - Step 3 - Optimization Strategy: Tuning-free (模型融合, multi-turn interaction) / Supervised Fine-Tuning (Full or PEFT) / Preference Optimization (RLHF, DPO)

Multi-level Taxonomy¶

The survey is organized hierarchically as Modality -> Personalized Context -> Task, covering 200+ papers:

Modality	Representative Tasks	Representative Methods
Text	Recommendation, writing assistant, dialogue, role-playing	LLM-Rec, REST-PG, PAED, CharacterLLM
Image	Subject-driven T2I, face generation, virtual try-on	DreamBooth, PhotoMaker, IDM-VTON
Video	Subject-driven T2V, Talking Head, dance generation	AnimateDiff, EMO, AnimateAnyone
3D	Image-to-3D, 3D face/body	MVDream, DreamBooth3D, DreamWaltz
Audio	Music generation, text-to-speech	UMP, DiffAVA
Cross-modal	Personalized captions/comments, dialogue	MyVLM, Yo'LLaVA

Experiments¶

This is a survey paper and does not contain original experiments. The primary contribution lies in the systematic review and taxonomy of the literature.

Dataset Summary¶

Modality	Representative Datasets
Text	LaMP, LongLaMP, Amazon Reviews, MovieLens
Image	DreamBooth dataset, VITON-HD, DeepFashion
Video	TikTok Dance, HDTF (Talking Head)
3D	ShapeNet, THuman2.0
Audio	LibriSpeech, MusicNet

Evaluation Metrics Summary¶

Objective	Metrics
Quality	FID, IS, CLIP Score, BLEU, Perplexity
Instruction Alignment	CLIP-T, BERT-Score
Personalization	CLIP-I, DINO Score, Face-Sim, User Study

Key Findings¶

Personalization research in the text modality is the most mature, followed by the image modality, while video/3D/audio modalities remain in early stages.
User behavior and user documents are the most commonly used personalized contexts in the text modality, whereas the CV field relies more on personal faces/bodies and personalized subjects.
PEFT (especially LoRA) has emerged as the mainstream strategy for cross-modal personalization fine-tuning.

Highlights & Insights¶

Incorporates personalization generation research from the NLP, CV, and IR communities into a unified framework for the first time, filling an important gap in literature reviews.
The proposed modality-agnostic workflow (user modeling -> generation modeling) provides a common language for researchers across different communities.
The multi-level taxonomy is clear and extensible, making it easy to track research progress in specific subfields.
The future work section discusses five open challenges, including scalability, preference evolution, privacy, and fairness.

Limitations & Future Work¶

As a survey paper, the depth of discussion for each subfield is limited, and the literature coverage of some emerging directions (such as 3D personalization) may not be comprehensive.
Although the unified framework provides high-level abstractions, technological discrepancies between different modalities remain significant, which somewhat limits the framework's practical guidance.
Performance data of alternative methods are not compared, lacking quantitative comparative analysis of different approaches.
Due to the literature search cutoff date, some of the most recent works might have been missed.

Model-Centric Surveys: Zhang et al. (2024) focus on LLM personalization; Zhang et al. (2024) discuss diffusion model personalization.
Task-Centric Surveys: Chen et al. (2024) discuss personalized dialogue; Tseng et al. (2024) discuss role-playing.
Foundation Model Surveys: Wu et al. (2024) review multimodal large language models.
Recommendation System Surveys: Ayemowa et al. (2024) discuss generative recommendation.

Rating¶

Dimension	Score (1-5)
Novelty	4
Technical Depth	3
Experimental Thoroughness	N/A (Survey)
Writing Quality	4
Total Score	3.7