Skip to content

TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration

Conference: CVPR 2026 arXiv: 2603.22882 Code: https://github.com/ChunXiaostudy/TreeTeaming Area: Multimodal VLM Keywords: red-teaming, vision-language model safety, strategy tree exploration, automated vulnerability discovery, jailbreak attack

TL;DR

This paper proposes TreeTeaming, an autonomous red-teaming framework that transforms strategy exploration from static testing into a dynamic evolutionary process. An LLM orchestrator autonomously constructs and expands a hierarchical strategy tree, while a multimodal executor carries out concrete attacks. TreeTeaming achieves state-of-the-art attack success rates on 11 out of 12 evaluated VLMs, reaching 87.60% on GPT-4o.

Background & Motivation

  1. Background: As VLMs grow more capable, their safety has attracted increasing attention. Red-teaming is a critical methodology for systematically identifying model vulnerabilities.
  2. Limitations of Prior Work: Existing methods are constrained by predefined strategies — whether fixed prompt templates, typographic obfuscation, or fixed image patterns — and can only optimize within a known strategy space, failing to discover novel attack vectors.
  3. Key Challenge: Even methods with feedback mechanisms (e.g., TRUST-VLM) can only refine test cases within a predefined framework; the strategies themselves still require manual design.
  4. Goal: To automate the discovery of attack strategies themselves, rather than merely automating the execution of known strategies.
  5. Key Insight: Starting from a single seed example, a complete strategy system is grown through hierarchical exploration over a tree structure.
  6. Core Idea: A strategy orchestrator autonomously decides whether to "deepen promising attack paths" or "explore new strategy branches," thereby constructing a strategy tree.

Method

Overall Architecture

Two core components: (1) a Strategy Orchestrator (LLM) — maintains a hierarchical strategy tree and autonomously determines the direction of exploration (deepening vs. branching); (2) a Multimodal Executor — executes concrete attack strategies using a pluggable tool suite, with a built-in consistency checker that verifies the alignment between final samples and intended attacks.

Key Designs

  1. Hierarchical Strategy Tree: Abstract concepts serve as parent nodes, and concrete attack strategies serve as leaf nodes; the orchestrator expands the tree autonomously.
  2. Deepen vs. Explore Decision: The LLM orchestrator autonomously decides whether to go deeper (refining existing strategies) or sideways (exploring new strategy branches), based on attack success rates and strategy diversity.
  3. Consistency Checker: Verifies that execution results align with the intended strategy, addressing the problem of strategy drift.

Loss & Training

No training is required — an LLM serves as the orchestrator and multimodal tools execute the attacks.

Key Experimental Results

Main Results

Model TreeTeaming ASR Best Baseline ASR Note
GPT-4o 87.60% ~70% SOTA
Claude-3 ~80% ~65% SOTA
LLaVA ~90% ~75% SOTA
11 of 12 models SOTA - Comprehensively leading

Key Findings

  • The diversity of discovered strategies exceeds the combined set of all publicly known strategies.
  • Generated attacks exhibit an average toxicity reduction of 23.09% — making them more covert and harder to detect by safety filters.
  • A rich strategy tree can be grown from a single seed example.

Newly Discovered Attack Strategy Categories

Strategy Category Example First Discovered
Typographic obfuscation Embedding textual instructions in images Known
Role-play induction Constructing fictional scenarios to bypass moderation Known
Semantic camouflage Framing content within academic/educational contexts Newly discovered
Multi-turn progressive elicitation Gradually guiding toward target content Newly discovered
Visual-semantic conflict Misleading through image-text semantic contradiction Newly discovered

Attack Toxicity Comparison

  • Attacks generated by TreeTeaming show an average toxicity reduction of 23.09% — making them more covert.
  • Attack success rates are simultaneously higher — indicating that reduced toxicity actually enhances the ability to bypass safety filters.

Highlights & Insights

  • "Automatically discovering strategies" rather than "automatically executing strategies" represents a paradigm-level innovation.
  • The strategy tree structure makes discovered attacks interpretable and traceable.
  • The framework is equally valuable for defense research — newly discovered strategies can be directly used to improve safety alignment.

Limitations & Future Work

  • The framework relies on a powerful LLM as the orchestrator, which incurs high cost; weaker models may be unable to effectively explore the strategy space.
  • The ethical boundaries of automated attack generation require careful consideration, as newly discovered attack vectors could be maliciously exploited.
  • Attacking closed-source models (e.g., GPT-4o) depends on API access, which is subject to rate limits and cost constraints.
  • The growth rate of the strategy tree may slow with increasing depth, making deeper strategies progressively harder to discover.
  • The accuracy of the consistency checker may affect the quality of final attack samples.
  • The automatic generation of defensive strategies following attack discovery remains unexplored.
  • The definition of attack success rate may be insufficient — some "successful" cases may represent only marginal violations.
  • vs. TRUST-VLM: TRUST-VLM automates test case generation within a fixed strategy framework; TreeTeaming automates the discovery of strategies themselves.
  • vs. FigStep/MM-SafetyBench: These methods each represent a single manually designed attack strategy; TreeTeaming can automatically discover these strategies and more.

Additional Discussion

  • The core innovation lies in transforming the problem from a single dimension to multiple dimensions for analysis, providing a more comprehensive perspective.
  • The experimental design covers diverse scenarios and baseline comparisons, with statistically significant results.
  • The modular design of the method facilitates extension to related tasks and new datasets.
  • Open-sourcing the code and data is of significant value for community reproduction and follow-up research.
  • Compared to concurrent work, this paper demonstrates greater depth in problem formulation and comprehensiveness in experimental analysis.
  • The paper's logical structure is clear, forming a complete loop from problem definition to method design to experimental validation.
  • The computational overhead of the method is reasonable, making it deployable in practical applications.
  • Future work could consider integration with additional modalities (e.g., audio, 3D point clouds).
  • Validating the scalability of the method on larger-scale data and models is an important direction for future research.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ A paradigm leap from automated execution to automated strategy discovery
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across 12 models with strategy diversity analysis
  • Writing Quality: ⭐⭐⭐⭐ Clear framework presentation and intuitive comparisons
  • Value: ⭐⭐⭐⭐⭐ Significant contribution to AI safety research