HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning¶
Conference: ECCV 2024
arXiv: 2403.12884
Code: https://github.com/ControlNet/HYDRA
Area: LLM Agent
Keywords: compositional visual reasoning, reinforcement-learning, LLM agent, dynamic planning, visual foundation models
TL;DR¶
(Note: Brief note based on abstract) This paper proposes HYDRA, a multi-stage dynamic compositional visual reasoning framework. Through the collaboration of three modules—a Planner, a reinforcement learning cognitive controller (RL Agent), and a Reasoner—it achieves reliable and progressive visual reasoning, reaching SOTA performance on multiple datasets including RefCOCO/RefCOCO+, OK-VQA, and GQA.
Background & Motivation¶
Background: The field of visual reasoning (VR) has recently benefited from large vision-language models (VLMs), making progress in tasks such as VQA and visual grounding. However, end-to-end VLM methods require scale-intensive dataset training, which is computationally expensive and limited in generalization. Compositional visual reasoning (compositional VR) methods (such as ViperGPT) have emerged as effective alternatives by decomposing tasks into sub-tasks and calling tools for reasoning.
Limitations of Prior Work: Existing compositional methods (ViperGPT, IdealGPT, etc.) rely heavily on the common-sense knowledge of LLMs for planning and reasoning, but do not consider the impact of decisions on the visual reasoning process. The instructions/code generated by the LLM may not suit the current visual scene, leading to incorrect or failed reasoning processes without the ability to self-correct from mistakes.
Key Challenge: Compositional VR requires flexible planning capabilities (exploring multiple decomposition strategies) and robust decision-making capabilities (adjusting strategies based on feedback). However, the single-pass generation of LLMs lacks a dynamic adjustment mechanism, meaning planning errors propagate chain-like to the final results.
Goal: How to empower compositional visual reasoning systems with dynamic adjustment capabilities—enabling them to select the best instructions based on historical feedback to achieve coarse-to-fine progressive reasoning.
Key Insight: Introduce a reinforcement learning agent as a "cognitive controller" to act as a decision hub between the LLM planner and LLM reasoner, dynamically selecting the optimal instruction path.
Core Idea: Utilize an RL agent as the cognitive hub between the LLM planner and reasoner to dynamically select the best reasoning instructions based on a feedback loop of historical states, achieving adaptive compositional visual reasoning.
Method¶
Overall Architecture¶
HYDRA adopts a multi-stage iterative reasoning architecture: Input Image and Question \(\to\) Planner Module (LLM) generates \(N\) candidate instruction samples (each instruction represents a task decomposition strategy) \(\to\) Controller Module (RL Agent, DQN architecture) selects the best instruction based on the current state (including historical information) \(\to\) Reasoner Module (LLM) translates the selected instruction into executable code to invoke Visual Foundation Models (VFMs) \(\to\) Textualizer Module translates the perceptual output into text \(\to\) Results are stored in the State Memory Bank \(\to\) Proceed to the next iteration until the final answer is obtained.
Key Designs¶
-
Planner Module:
- Function: Generates multiple candidate reasoning instructions for a given visual reasoning question.
- Mechanism: Uses an LLM as a natural language planner to generate \(N\) instruction samples of varied complexity and scope tailored to the input image descriptions and questions. Each instruction contains the allocation of perception tasks and description of reasoning steps, along with estimated validity probabilities. The diversity of instructions is reflected in perceptual granularity (coarse/fine), task decomposition methods, types of visual tools invoked, etc.
- Design Motivation: Unlike methods like ViperGPT that generate only a single fixed program, HYDRA generates multiple candidate plans, providing selection space for subsequent dynamic choices. This diversity strategy avoids the "one mistake ruins all" issue.
-
Controller Module (RL Cognitive Controller):
- Function: Selects the optimal instruction from the \(N\) candidate instructions generated by the Planner in each iteration.
- Mechanism: An RL agent based on DQN (Deep Q-Network) models the visual reasoning process as a Markov Decision Process (MDP). The state space includes the current question, image features, historical instructions, and execution results (stored in the State Memory Bank). The action space is the choice among \(N\) candidate instructions. Rewards are based on the comparison between the reasoning results and ground-truth answers. The RL agent learns the optimal decision policy across different visual reasoning scenarios through interaction with the environment.
- Design Motivation: Pure LLM-driven decision-making is stateless—it does not remember what was done previously or how effective it was. The RL agent achieves "decision-making with memory" via a feedback loop of historical states, learning from past successes and failures to achieve a more reliable reasoning process.
-
Reasoner Module:
- Function: Translates the natural language instructions selected by the Controller into executable code and executes it.
- Mechanism: Composed of two sub-modules: LLM Code Generator (translating natural language instructions into Python code that calls VFM APIs) and Code Executor (executing code in a secure sandbox to call visual tools such as object detection, image segmentation, OCR, depth estimation, etc., to obtain perceptual results).
- Design Motivation: Decoupling reasoning from planning—the Planner is only responsible for "what to do", while the Reasoner is responsible for "how to do it". Code-based execution ensures the explainability and debuggability of the reasoning process.
-
Textualizer Module and State Memory Bank:
- Function: Converts perceptual outputs into text and maintains global state memory.
- Mechanism: The Textualizer converts the output of visual tools (bounding boxes, segmentation masks, depth maps, etc.) into structured textual descriptions for easier comprehension by the LLM. The State Memory Bank stores data from all historical iterations (instructions, code, outputs, textualized results), providing context for the decision-making of the Planner and Controller in the subsequent door.
- Design Motivation: A core challenge of compositional reasoning is information passing across iterations—the visual perception information obtained in each round needs to be leveraged by subsequent rounds. The State Memory Bank realizes incremental knowledge accumulation.
-
Incremental Reasoning:
- Function: Gradually accumulates fine-grained visual information across multiple iterations.
- Mechanism: In each iteration, historical state information helps the LLM and RL agent acquire finer-grained visual information (via VFMs and visual-perception-to-text modules), progressively refining the reasoning process. The first few rounds might perform coarse-grained perception (whole-image detection), while the subsequent rounds conduct fine-grained analysis (attribute recognition in specific regions).
- Design Motivation: Complex visual reasoning often requires multi-step information accumulation; a single-step reasoning pass can easily miss key details.
Loss & Training¶
- RL Agent Training: Trained using the DQN algorithm with a reward function based on the correctness of the final answer. Training data is sourced from visual reasoning datasets (GQA, OK-VQA, etc.).
- LLM Components of Planner and Reasoner: Can utilize GPT-4 or open-source LLMs driven by prompt engineering without requiring extra fine-tuning.
- Generalization of RL Agent: An agent trained on GQA can be directly transferred to OK-VQA, with performance only slightly below the specifically trained version (48.17 vs 48.63), demonstrating excellent cross-dataset generalization.
Key Experimental Results¶
Main Results¶
RefCOCO/RefCOCO+ Visual Grounding:
| Type | Method | RefCOCO | RefCOCO+ |
|---|---|---|---|
| E2E | OWL-ViT | 30.3 | 29.4 |
| E2E | OWLv2 | 33.5 | 31.7 |
| E2E | GLIP | 55.0 | 52.2 |
| E2E | ReCLIP | 58.6 | 60.5 |
| E2E | KOSMOS-2 | 57.4 | 50.7 |
| Compositional | Code-bison | 44.4 | 38.2 |
| Compositional | ViperGPT | 59.8 | 60.0 |
| Compositional | HYDRA | 61.7 | 61.1 |
OK-VQA / GQA Visual Question Answering:
| Type | Method | OK-VQA | GQA |
|---|---|---|---|
| E2E | BLIP-2 | 45.9 | 45.5 |
| E2E | Flamingo (9B) | 44.7 | - |
| E2E | InstructBLIP (13B) | 47.9 | - |
| Compositional | IdealGPT | 19.4 | 41.7 |
| Compositional | ViperGPT | 40.7 | 37.9 |
| Compositional | HYDRA | 48.6 | 47.9 |
Ablation Study¶
RL Agent Generalization:
| Method | Training Set | Test Set | Accuracy |
|---|---|---|---|
| ViLT | GQA | OK-VQA | 32.13 |
| ViperGPT | - | OK-VQA | 40.74 |
| HYDRA | GQA | OK-VQA | 48.17 |
| HYDRA | OK-VQA | OK-VQA | 48.63 |
| HYDRA | OKVQA | A-OKVQA | 55.94 |
| HYDRA | A-OKVQA | A-OKVQA | 56.35 |
Key Findings¶
- Compositional methods comprehensively outperform E2E methods: HYDRA surpasses InstructBLIP (13B) on OK-VQA with 48.6 vs 47.9, demonstrating that the compositional reasoning framework can reach or even exceed the performance of end-to-end large models without large-scale training.
- Critical value of the RL Agent's decision-making: Compared to ViperGPT (fixed strategy), HYDRA yields a +10.0 improvement on GQA (37.9 \(\to\) 47.9). This massive gain proves the supreme importance of dynamic instruction selection.
- Strong cross-domain generalization capability: The agent trained on GQA loses only 0.46% (48.17 vs 48.63) when transferred to OK-VQA, proving that the RL agent learns general reasoning and decision-making policies.
- Incremental reasoning improves reliability: Through the feedback loop and state memory, HYDRA can correct errors from previous rounds across multiple iterations, significantly reducing reasoning failure rates.
Highlights & Insights¶
- New paradigm of RL as a cognitive controller: Distinct from letting the LLM make all decisions directly, introducing an RL agent achieves "metacognition"—the dynamic evaluation and selection of LLM-generated strategies. This represents a new design paradigm for LLM agent systems.
- Compositional capability + adaptability: It inherits both the explainability and zero-shot generalization advantages of compositional reasoning, while acquiring adaptive learning capabilities from experience via RL.
- Addressing planning robustness via feedback loops: The State Memory Bank enables information accumulation and error correction across iterations instead of a one-off serial execution.
- Modular design facilitating extensibility: Each module is independently replaceable—one can easily swap in a stronger LLM, add more visual tools, or adjust the RL policy.
Limitations & Future Work¶
- Data demand for RL training: Although large-scale VLM training data is not required, the RL agent still needs training on specific tasks. Cross-domain generalization is robust, but performance gaps still exist.
- Efficiency issues of multi-round reasoning: Multi-stage iterative reasoning involves multiple LLM calls and visual tool executions, resulting in significantly higher inference latency compared to end-to-end methods.
- Selection of candidate instruction count \(N\): If \(N\) is too small, the exploration space is insufficient; if \(N\) is too large, the RL decision space explodes. The optimal value may vary by task.
- Dependency on LLM code generation quality: The Reasoner's code generation quality directly affects reasoning outcomes. LLM code errors are difficult for the RL agent to completely correct.
- Lack of video and 3D scene reasoning: The method is currently validated only on static image tasks. Its applicability to dynamic scenes (video VQA) and 3D scenes requires further exploration.
Related Work & Insights¶
- ViperGPT [Surís et al., 2023]: Uses an LLM to generate Python code for calling visual APIs for visual reasoning. HYDRA introduces a dynamic decision-making mechanism on top of this.
- IdealGPT [You et al., 2023]: An iterative decomposition and alignment framework for VQA, but lacks a mechanism to learn from feedback.
- Chameleon [Lu et al., 2024]: A tool-augmented LLM for compositional reasoning, but depends on fixed tool selection policies.
- Insights: The design concept of using an RL agent acting as a "cognitive controller" in an LLM system can be generalized to other LLM agent systems—such as dynamically selecting analysis strategies in automated scientific research assistants, or choosing optimal implementation solutions in code generation systems.
Rating¶
- ⭐⭐⭐⭐ Novelty: Introducing RL into the decision-making process of compositional VR is an innovative approach; the "cognitive controller" concept is highly valuable.
- ⭐⭐⭐⭐ Experimental Thoroughness: Covering 4 datasets, various comparison methods, and comprehensive evaluation of generalization.
- ⭐⭐⭐⭐ Writing Quality: Clear explanations of frame design and easily comprehensible modular structure.
- ⭐⭐⭐⭐ Value: The approach of introducing RL-based dynamic decision-making to LLM agent systems holds extensive transfer value.