Olympus: A Universal Task Router for Computer Vision Tasks¶

Conference: CVPR 2025
arXiv: 2412.09612
Code: None
Area: 3D Vision
Keywords: Task Routing, Multimodal Large Language Models (MLLMs), Chain-of-Action, Unified Framework, Vision Tasks

TL;DR¶

Olympus uses a multimodal large language model (MLLM) as a unified task router. By designing task-specific routing tokens and constructing a large-scale instruction dataset, it dispatches over 20 computer vision tasks (covering image, video, and 3D) to dedicated expert models, achieving a 94.75% single-task routing accuracy and a 91.82% chain-of-action precision.

Background & Motivation¶

Background: Multimodal Large Language Models (MLLMs, e.g., LLaVA, GPT-4V) show excellent performance in understanding tasks like visual question answering. Meanwhile, unified models such as Emu3 and Omni-Gen attempt to handle both understanding and generation tasks within a single network.

Limitations of Prior Work: "All-in-one" models face three major dilemmas: (1) Conflict between different task objectives (such as text generation vs. image generation) leads to degraded performance in individual tasks; (2) Diverse input and output formats make it very difficult to scale to more tasks; (3) Training such comprehensive models requires massive computational resources (e.g., Omni-Gen requires 104 \(\times\) A800 GPUs and a five-stage training process).

Key Challenge: Optimizing dozens of different vision tasks simultaneously in a single model makes it difficult to balance performance and scalability due to inter-task optimization conflicts and architectural limitations.

Goal: Instead of pursuing an "all-in-one" model, this work employs the MLLM as a "dispatcher" that handles vision-language understanding tasks itself, while delegating generation and classical vision tasks to external expert models.

Key Insight: Inspired by HuggingGPT but diverging from its pure prompt engineering approach, Olympus trains the MLLM to learn task routing capabilities. Combined with a large-scale instruction dataset generated by GPT-4o, it establishes precise mapping from user instructions to expert models.

Core Idea: By designing task-specific routing tokens, the MLLM can automatically output corresponding routing tokens and refined prompts when generating responses, thus scheduling 20+ vision tasks with zero conflict.

Method¶

Overall Architecture¶

Olympus uses a trainable MLLM (based on the Mipha architecture, featuring a SigLIP vision encoder and a Phi-2 language model) as its central controller. For understanding tasks like VQA, the MLLM directly responds using its internal capability. For tasks like image/video/3D generation, image editing, and depth estimation, the MLLM generates responses containing routing tokens and refined prompts, which are then dispatched to corresponding expert models (e.g., Stable Diffusion, ControlNet, InstantMesh) for execution. The framework supports executing up to five chained tasks within a single instruction.

Key Designs¶

OlympusInstruct Dataset Construction:
- Function: Provides high-quality user instruction-response pairs for 20 types of vision tasks as training data.
- Mechanism: A task-specific GPT-4o prompt template is designed for each task, incorporating diverse prefixes/phrases and a three-level complexity hierarchy (short/medium/long) to generate instructions varying in style, tone, and structure. Additionally, 64.8K chain-of-action instruction pairs are constructed to train the model to schedule multiple tasks within a single instruction. A total of 446.3K training samples and 49.6K evaluation samples were collected.
- Design Motivation: User instructions in real-world scenarios are highly diverse and require coverage of various expression styles. Chain-of-action data enables the model to handle composite requests, such as "generate an image first, then edit it."
Task-Specific Routing Tokens:
- Function: Defines unique token pairs for each vision task (e.g., <image_gen>...</image_gen>, <depth_est>...</depth_est>). The MLLM specifies the expert model to be scheduled by predicting these tokens.
- Mechanism: Given a user instruction such as "Please generate an image of a chihuahua", the model outputs <image_gen>a chihuahua dog...</image_gen>. After parsing the routing token, the system sends the content to the corresponding image generation model. In chain-of-action scenarios, the model can sequentially output multiple routing tokens, such as <pose_to_image>...</pose_to_image><image_edit>...</image_edit>, to form a task pipeline.
- Design Motivation: Routing tokens provide clear task boundaries and model selection signals, avoiding the ambiguity of natural language parsing while supporting flexible task composition.
Chain-of-Action Capability:
- Function: Decomposes and sequentially executes multiple vision tasks within a single user instruction.
- Mechanism: The MLLM is trained to comprehend the intent of composite instructions and decompose them into an ordered sequence of sub-tasks, with each sub-task corresponding to a routing token. The output of the previous task serves as the input to the subsequent task, forming a task pipeline. The training data includes chained instructions of up to 5 tasks.
- Design Motivation: Real-world user requests are often composite (e.g., "generate a castle image based on this pose and add green trees"), which cannot be met by single-task routing.

Loss & Training¶

Training employs the standard autoregressive language model loss (cross-entropy). The model must provide correct textual answers for understanding tasks and generate correct routing tokens + refined prompts for routing tasks. Using the Mipha architecture (SigLIP-SO + Phi-2, 2.7B parameters), training is divided into two stages: first, pre-training vision-language alignment using 558K image-text pairs; second, instruction fine-tuning using OlympusInstruct.

Key Experimental Results¶

Main Results — Multimodal Understanding Benchmarks¶

Method	LM Params	VQAv2	GQA	SQAI	MME-P	MMB	MM-Vet	POPE
LLaVA-1.5	7B	78.5	62.0	66.8	1510.7	64.3	30.5	85.9
Mipha-3B	2.7B	81.3	63.9	70.9	1488.9	69.7	35.2	86.7
Olympus	2.7B	80.8	63.6	72.5	1501.2	69.2	34.8	87.0

Routing Accuracy¶

Task Type	Accuracy
Single-task Routing (Average of 20 categories)	94.75%
Chain-of-Action (2-5 tasks)	91.82% Accuracy
2-Task Chain	94.82%
5-Task Chain	87.81%

Ablation Study¶

Configuration	Single-task Accuracy	Chain-of-Action Accuracy
W/o Routing Tokens (Pure text description)	71.32%	62.15%
Routing Tokens + W/o Chain-of-Action Data	94.75%	78.43%
Full Olympus	94.75%	91.82%

Key Findings¶

Olympus achieves overall comparable performance (and even slightly better on certain benchmarks) to the Mipha-3B baseline of the same parameter scale on multimodal understanding benchmarks, demonstrating that introducing routing capability does not degrade the original understanding ability.
Routing accuracy decreases as the chain length increases (94.82% for a 2-task chain \(\rightarrow\) 87.81% for a 5-task chain), yet it still maintains nearly 88% accuracy even for a 5-task chain.
Task-specific routing tokens yield an improvement of over 23 percentage points in accuracy compared to pure text description routing.

Highlights & Insights¶

Trade-off between Modularity and All-in-one: Olympus demonstrates that, at the current stage, employing an MLLM as a "dispatcher" is more practical than making it an "all-rounder". This approach preserves understanding capabilities while expanding task coverage through external expert models.
Simple yet Effective Routing Token Design: Compared to HuggingGPT's pure prompt engineering approach, training routing tokens provides stronger robustness in task recognition.
The Chain-of-Action Capability is a major highlight, enabling users to describe complex multi-step workflows in natural language.

Limitations & Future Work¶

Routing quality heavily relies on the coverage of OlympusInstruct, which may fail when user instructions are out of the training distribution.
In the current framework, the outputs of expert models cannot be perceived or validated by the MLLM, presenting a risk of error propagation.
Chain-of-action currently only supports sequential execution, lacking support for conditional branches or parallel tasks.
Extending the framework to new tasks requires reconstructing instruction data and retraining routing tokens.

vs HuggingGPT: HuggingGPT relies on prompt engineering to invoke ChatGPT as a task allocator without training. In contrast, Olympus achieves more precise routing by training the MLLM end-to-end.
vs Emu3/Omni-Gen: These all-in-one models attempt to handle everything within a single network, but incur extremely high computational costs and suffer from obvious task conflict. Olympus's divide-and-conquer strategy is more scalable.
vs Visual ChatGPT: Visual ChatGPT converts visual queries into Python code to invoke tools, which is flexible but lacks control. Olympus's routing tokens provide a more structured approach to scheduling.

Rating¶

Novelty: ⭐⭐⭐ The idea is intuitive. The core contributions lie more in engineering and data aspects, while the technical method itself (routing tokens + instruction tuning) is relatively standard.
Experimental Thoroughness: ⭐⭐⭐⭐ The routing evaluation across 20 tasks is comprehensive, and the chain-of-action experiments are convincing. However, a quantitative comparison of end-to-end task quality is lacking.
Writing Quality: ⭐⭐⭐⭐ The architecture diagram is clear, and the task coverage is well-presented.
Value: ⭐⭐⭐⭐ It provides a practical paradigm for a unified vision task framework, and the OlympusInstruct dataset holds potential value for reuse.