Skip to content

UniVid: A Unified Vision-Language Model for Video Moderation

Conference: ACL 2026
arXiv: 2606.05748
Code: None (Authors do not open-source due to security considerations)
Area: AI Safety / Content Moderation
Keywords: Video Moderation, Vision-Language Models, Policy Alignment, Content Understanding, Model Compression

TL;DR

UniVid evolves the video moderation system from an unmaintainable "fragmented" architecture into an interpretable and reusable "end-to-end" system by replacing 1,000+ black-box classifiers with a unified policy-aware captioning VLM, resulting in a 42.7% reduction in violation leakage compared to traditional solutions in ByteDance's production deployment.

Background & Motivation

Background: Content moderation systems for global short-video platforms typically utilize massive families of specialized classifiers, where each classifier is independently trained and maintained for a specific policy (e.g., violence, pornography, harassment).

Limitations of Prior Work: This "decentralized" scheme faces three major dilemmas. First, the decision scores generated by thousands of black-box classifiers are difficult to explain to human auditors, lacking transparency. Second, the maintenance cost is extremely high, as these classifiers must be retrained and redeployed whenever platform policies are updated or optimized. Finally, these models operate in silos and cannot share semantic understanding, making synergy with other business lines like advertising and recommendation difficult.

Key Challenge: There is a fundamental tension between "high accuracy" and "maintainability" in traditional classifier schemes—increasing the number of classifiers improves detection precision for certain fine-grained policies but simultaneously makes the system more complex and difficult to maintain.

Goal: Design a unified Vision-Language Model (VLM) that not only understands video content accurately but also generates verifiable natural language rationales, enabling humans to intuitively understand the basis for moderation decisions.

Key Insight: The authors observe that while modern VLMs (e.g., LLaVA) possess powerful multimodal understanding, open-source models often "over-refuse" to describe violating content due to safety alignment, while commercial VLMs (GPT-4o, Gemini) have extremely high deployment costs ($3,000+ per million videos). If a VLM can be fine-tuned using a carefully designed "data recipe" to safely describe violations under internal safety policies, multiple objectives could be achieved simultaneously.

Core Idea: Use a specifically fine-tuned VLM to generate policy-aligned video captions as a unified intermediate representation for all downstream decisions. Captions serve as both human-readable evidence and reusable semantic features.

Method

Overall Architecture

UniVid decomposes end-to-end video moderation into a three-stage pipeline. The upstream Risk Filter stage rapidly scans all video streams, utilizing caption embeddings generated by UNIVID to fuse multimodal risk signals and outputting preliminary risk judgments via multiple MLP policy heads. The midstream Moderation Actor stage performs fine-grained moderation on high-risk videos, utilizing UNIVID-Lite (primary actor, prioritizing precision) and UNIVID-RAG (supplementary actor, prioritizing recall) for dual-layer decision-making. The downstream Trend Governance module reuses cached caption embeddings to quickly adapt to emerging harm trends through few-shot learning.

Key Designs

  1. Hybrid Data Recipe:

    • Function: Construct a high-quality, policy-aligned training set to solve the problem of real violating videos being difficult to collect and legally sensitive.
    • Mechanism: A three-stage training process is adopted. The first stage uses LLaVA public data to pre-train the base model for vision-text modality alignment. The second stage involves instruction fine-tuning on captions and synthetic VQA pairs (3.2M samples) generated by GPT-4o. The third stage continues fine-tuning on high-quality caption data (0.1M samples) refined by humans to strengthen policy alignment. The critical human annotation process includes two dimensions: (i) factual correction—removing hallucinations, completing subject action details, and OCR; (ii) policy grounding—requiring annotators to map violating content to specific policy clauses in the internal policy library.
    • Design Motivation: Relying solely on GPT-4o or human data is insufficient. GPT-4o captions often contain hallucinations and limited policy coverage, while pure human annotation is too costly and lacks generalization. The hybrid scheme reduces annotation costs while ensuring data quality and policy consistency through human quality control.
  2. Policy-Aware Captioning as Intermediate Representation:

    • Function: Replace black-box classification scores with natural language descriptions to provide clear violation rationales for human auditing.
    • Mechanism: UNIVID does not output numerical predictions like "violence level = 0.87" but generates descriptive text such as "three men riding a red motorcycle." These captions contain objective factual statements and implicit policy mappings—humans can immediately judge violations upon seeing the caption, enabling a "human-in-the-loop" decision process. The captions themselves can also be reused for other downstream tasks (ad safety, recommendation systems).
    • Design Motivation: The "black-box" nature of traditional classifiers prevents human auditors from verifying decisions, leading to confirmation bias and accountability issues. Using natural language captions as an intermediate representation preserves raw evidence while giving humans final decision-making power, meeting modern requirements for transparent content governance.
  3. Modular Phased Decision Architecture:

    • Function: Optimize different performance metrics across the three decision stages—Risk Filter prioritizes recall, Moderation Actor prioritizes precision, and Trend Governance prioritizes adaptation speed.
    • Mechanism: UNIVID-Lite is the primary actor, fine-tuned on 1M labeled videos to output "Approve/Violation" decisions in an auto-regressive manner. UNIVID-RAG is supplementary, retrieving Top-3 most similar cases from a violation event knowledge base (100K historical cases) as In-Context examples to help capture low-frequency or boundary violations. Trend Governance is processed light-weightily—training only one MLP trend head and reusing UNIVID's cached embeddings to handle emerging threats via few-shot adaptation.
    • Design Motivation: It is impossible to achieve multiple metrics simultaneously. Expecting a single model to offer high recall, high precision, and low latency is unrealistic. The phased design allows each module to optimize independently based on its task constraints.

Loss & Training

UNIVID employs a standard auto-regressive causal language modeling objective. Given video frames \(V\) and target caption \(C\) of length \(L\), the joint probability \(P(C | T_{\text{in}}, V_{\text{in}}) = \prod_{i=1}^{L} P(C_i | H_t, H_v, C_{<i})\) is maximized. During the pre-training stage, only the projection layer MLP is trained, while the vision encoder and LLM are frozen. During the fine-tuning stage, both the projection layer and the LLM decoder are trained. The entire training process takes 120 hours using 32 H100 GPUs. Mistral-v0.3-7B is selected as the LLM backbone.

Key Experimental Results

Main Results

Model Violence Sexual Abuse Mental Health Regulated Act. Violation Recall Recall Precision F1
GPT-4o 45.9 17.4 32.3 42.5 36.1 32.8 65.5 37.4
Gemini-2.5-Pro 63.8 44.3 55.6 57.6 55.1 42.5 95.2 57.9
LLaVA-OV 8B 17.8 6.9 12.9 15.7 13.0 12.0 86.3 19.3
UNIVID-7B 56.3 51.3 50.2 57.7 54.3 28.9 82.3 39.1
UNIVID-1B 53.6 49.1 49.9 55.3 52.1 27.4 82.8 37.5

UNIVID-7B outperforms the open-source baseline LLaVA-OV across the board in violation recall (54.3% vs. 13.0%), with particularly strong performance in sensitive domains such as sexual abuse and mental health.

Ablation Study

Configuration Violence Sexual Abuse Mental Health Regulated Act. Violation Recall
UNIVID-7B (Full) 56.3 51.3 50.2 57.7 54.3
w/o Hybrid Data 37.9 35.5 33.5 41.1 37.5
Human Data Only 29.1 22.9 18.9 29.4 26.1

The hybrid data recipe is crucial—removing synthetic data leads to a 16.8% decrease in violation recall, while using only human data results in the worst generalization (recall dropping to 26.1%).

Production Results

Metric Before After Relative Gain
Violation Leakage Rate 0.255% 0.146% -42.7% ↓
Over-deletion Rate 35.4% 22.3% -37.0% ↓
Deployment Cost (per 1M) - $180 1/15 of Commercial VLM
Specialized Classifiers Replaced 1000+ 1 Massive complexity reduction
Reclaimed GPU Resources - 1900 A30 units Calculation efficiency gain

Highlights & Insights

  • The Brilliance of Captions as Intermediate Representation: This design is the most astute aspect of the paper. Instead of forcing the VLM to directly output "Violation/No Violation," it generates natural language descriptions. This transforms the system from a "black-box neural network" into an "auditable human-AI collaborative workflow"—humans can judge based on policies after reading the caption or provide feedback to annotators to identify model blind spots.
  • Restrained Design of Hybrid Data Recipe: The authors did not greedily collect massive amounts of real violating videos but intelligently mixed GPT-4o synthetic data with human refinement. Human annotation is not done from scratch but corrects GPT hallucinations and aligns with internal policies.
  • Phased Decisions Eliminate "Multi-Metric Conflicts": The Risk Filter loosens thresholds to maximize recall, the Moderation Actor makes strict decisions to maximize precision, and Trend Governance adapts quickly to handle emerging threats.

Limitations & Future Work

Limitations acknowledged by the authors:

  • The system currently does not integrate reinforcement learning methods (e.g., GRPO). Although generated policy-aware captions can serve as reasoning trajectories, internal policy guidelines are not directly encoded as reward signals explicitly bound to the generation process.
  • The system employs frame sampling rather than full temporal modeling, meaning violating content appearing only in a single frame might be missed if not sampled.

Own observed limitations:

  • While the evaluation set is more globalized than previous benchmarks (e.g., KuaiMod), it is still primarily derived from the short-video platform ecosystem.
  • The consistency of the model in multilingual violation descriptions still needs validation.

Future directions: Explicitly encode policy guidelines as LLM system prompts or fine-tuning objectives; explore more efficient temporal sampling strategies; perform out-of-distribution testing on real traffic across multiple regions and platforms.

  • vs. Traditional Video Moderation Systems (e.g., end-to-end classifier families in Shi et al. 2024): They use thousands of independent classifiers, each corresponding to one policy; this paper uses a single VLM to generate captions before making decisions. The difference is that traditional schemes are "one model per policy," where maintenance costs grow linearly with the number of policies.
  • vs. Open-source VLMs (LLaVA-OV, LLaVA-Next): They pursue general-purpose capabilities, while this paper fine-tunes for violating content. LLaVA's strength lies in community support and general ability, while this paper's strength is specialized optimization for violation detection—improving violation recall from 13% to 54%.
  • vs. Commercial VLMs (GPT-4o, Gemini-2.5-Pro): They possess the strongest multimodal understanding but involve extremely high deployment costs and "refusal to generate" issues due to safety guardrails; this paper offers lower costs (1/15) and does not refuse to describe violating content because of safety alignment.
  • Inspiration: The hybrid data strategy and human annotation framework in this paper are valuable for other data-constrained scenarios (e.g., rare diseases in medical imaging, edge cases in autonomous driving).

Rating

  • Novelty: ⭐⭐⭐⭐ The idea of replacing specialized classifiers with a VLM is not entirely new, but systemizing it into an industrial-scale moderation system, covering data recipes, multi-stage decisions, and model variants, represents significant engineering innovation.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Includes not only offline benchmarking but also sandbox simulations and ablation studies. The CapBench dataset is more comprehensive than previous benchmarks.
  • Writing Quality: ⭐⭐⭐⭐ The structure is clear, and the ethics and compliance sections are well-developed.
  • Value: ⭐⭐⭐⭐⭐ This is the first report of a successful deployment of a specialized VLM moderation system on an industrial-scale short-video platform, showing significant improvements in leakage rates, over-deletion rates, and costs compared to baseline systems.