Chimera: Improving Generalist Model with Domain-Specific Experts¶

Conference: ICCV 2025 arXiv: 2412.05983 Code: Open-source (weights, data, evaluation) Area: Multimodal Large Models / VLM Keywords: Multimodal reasoning, expert model integration, domain adaptation, routing mechanism, visual content extraction

TL;DR¶

This paper proposes Chimera, a scalable and low-cost multimodal pipeline that integrates domain-specific expert knowledge (tables, charts, math, documents) into a generalist multimodal large model via a lightweight routing module for dynamic expert selection, a progressive training strategy, and a Generalist-Specialist Collaboration Masking (GSCM) mechanism. Chimera achieves 64.9% on MathVista (SOTA) and matches or surpasses specialist models on multiple visual structure extraction tasks.

Background & Motivation¶

Large multimodal models (LMMs) perform well on general tasks but remain inadequate for specialized domain tasks:

Limitations of generalist LMMs: Training data is dominated by natural images, whereas specialized tasks involve charts, tables, geometric figures, and function plots — content that is denser in text and more abstract.
"One for One" paradigm problems: Domain-specific expert models achieve strong performance but poor generalization; distribution gaps across sub-domains are large.
Data privacy issues: Domain-specific data is often proprietary and unavailable for post-training LMMs.

Directly integrating expert models faces two core challenges:

Representation gap: Large distribution shifts exist across encoders from different domains.

Optimization imbalance: The generalist visual encoder is already well-aligned with the language model, causing the model to over-rely on the generalist encoder and ignore expert features.

Method¶

Overall Architecture¶

Chimera consists of the following components: - Generalist visual encoder \(E^g\) + generalist projector \(P^g\) + language model \(f\) (initialized from a pretrained LMM, e.g., InternVL2) - Router \(R\) (a linear layer that predicts based on the CLS token of the generalist encoder) - Expert model set \(S^e = \{E^{table}, E^{chart}, E^{math}\}\) - Expert projector set \(S^p = \{P^{table}, P^{chart}, P^{math}\}\)

Inference flow: the router determines whether and which expert to invoke based on the visual input → expert features are concatenated with generalist features → fed into the language model.

Key Designs¶

Lightweight Routing Module (Router):
- Takes the CLS token \(\mathcal{Z}_v^{cls}\) of the generalist encoder as input
- A linear layer outputs \(N_e + 1\) predictions (\(N_e\) experts + 1 no-expert option)
- \(i = \arg\max_i(\mathcal{H}_r)_i\) selects the expert with the highest confidence
- Achieves 95.4% routing accuracy on MathVista; all errors are generalist–expert confusions, with no inter-expert confusion
- Routing loss uses categorical cross-entropy: \(\mathcal{L}_c = -\sum_{i=0}^{N_e+1}\log P(c_i|\mathcal{X}_v, \theta)\)
Generalist-Specialist Collaboration Masking (GSCM):
- Core problem: The generalist encoder is already well-aligned with the language model, so directly concatenating expert features causes the model to "take shortcuts" — relying solely on generalist features and ignoring expert features.
- Solution: During training, a proportion of generalist visual tokens is randomly sampled from a uniform distribution and their attention masks are set to False.
- Effect: Forces the model to utilize domain information from expert models as a complement to generalist information.
- The uniform distribution prevents bias introduced by concentrating masking on image centers or specific regions.
- Attention analysis validates: after applying GSCM, the model's output attention to expert visual tokens increases significantly.
Progressive Training Strategy:
- Stage 1 — Domain-generalist knowledge alignment: Freezes the generalist encoder, expert models, and language model; trains only the router, generalist projector, and expert projectors.
  - Data: natural image captioning, table format conversion, chart structure extraction, math diagram description, paragraph-level OCR.
- Stage 2 — Visual instruction fine-tuning: Unfreezes the router, projectors, and language model (expert encoders remain frozen throughout); applies GSCM.
  - Data: multi-domain instruction-following datasets.

Loss & Training¶

Total loss: \(\mathcal{L} = \mathcal{L}_c + \mathcal{L}_m\)

\(\mathcal{L}_m\): token-level cross-entropy loss for autoregressive modeling
\(\mathcal{L}_c\): router classification loss

Preference optimization (optional Stage 3): Preference pairs are constructed from 60K public data for DPO, pushing Chimera to 68.3%.

Key Experimental Results¶

Main Results¶

MathVista testmini accuracy (%)

Model	Scale	Overall	GPS	MWP	TQA	GEO
GPT-4o	-	63.8	-	-	-	-
InternVL2-8B	8B	61.6	64.4	61.3	64.6	61.9
Qwen2-VL	7B	58.2	-	-	-	-
Math-LLaVA*	13B	46.6	57.7	56.5	51.3	56.5
Chimera-8B	8B	64.9	71.6	72.6	65.2	69.5
Chimera†-8B	8B	68.3	76.9	80.1	60.8	74.5

MathVerse accuracy (%)

Model	Scale	Overall	Text Dominant	Vision Only
GPT-4V	-	39.4	54.7	31.6
MAVIS-7B*	7B	27.5	41.4	14.6
InternVL2-8B	8B	31.3	38.8	17.0
Chimera-8B	8B	32.4	39.6	19.3

Visual Structure Extraction (ChartQA-SE AP@strict / Table-SE TEDS)

Method	ChartQA-SE AP@strict	Table-SE TEDS	Table-SE Edit Dist↓
GOT*	74.7	0.745	0.257
InternVL-2	73.7	0.676	0.229
Chimera	74.1	0.740	0.165

Ablation Study¶

Domain-level analysis (MathVista testmini)

Model	Overall	General	Chart	Table	Math
InternVL2-2B	48.3	45.3	58.9	50.0	44.2
Chimera-2B	53.1	46.0	60.3	62.9	56.1
InternVL2-4B	57.0	50.1	66.2	65.7	58.3
Chimera-4B	61.3	54.0	64.8	72.9	66.9
InternVL2-8B	61.6	52.7	71.2	67.1	66.5
Chimera-8B	64.9	57.5	71.2	62.9	71.9

Router error statistics (MathVista testmini)

GT \ Pred	General	Table	Chart	Math
General	–	0	16	6
Table	1	–	0	0
Chart	1	0	–	0
Math	22	0	0	–

Routing accuracy: 95.4%. All misclassifications occur between general↔expert categories; no inter-expert confusion is observed.

Key Findings¶

Chimera-8B achieves 64.9% on MathVista, surpassing GPT-4o (63.8%) and setting a new SOTA among open-source LMMs under 70B.
Expert knowledge even improves performance on general-domain tasks: Chimera outperforms the InternVL2 baseline in the "General" category (57.5% vs. 52.7%), indicating that domain knowledge provides diversified perspectives.
DPO post-training with only 60K data improves performance from 64.9% to 68.3% (+3.4%), demonstrating the scalability of the framework.
GSCM attention analysis validates: with masking, expert tokens receive significantly more attention; without masking, the model relies almost entirely on the generalist encoder.
The math expert exhibits the most consistent gains — it improves performance across all model scales; the table expert's high specialization may introduce noise at the 8B scale.
Chimera substantially outperforms InternVL2 on document structure extraction (Doc-SE) in both English and Chinese tasks.

Highlights & Insights¶

"One for All + Domain Experts" paradigm: Achieves specialized capabilities without sacrificing generality, offering more practical value than either pure expert models or pure generalist models.
Deep insight behind GSCM: The mechanism identifies and addresses the counter-intuitive problem of "optimization imbalance caused by pretraining alignment advantage."
Minimal routing design: A single linear layer achieves 95.4% routing accuracy, indicating that visual content from different domains is already well-separable in feature space.
Efficient DPO integration: Demonstrates that Chimera, as a base framework, can be seamlessly combined with preference optimization.

Limitations & Future Work¶

Chimera-8B underperforms InternVL2-8B on the table domain (62.9% vs. 67.1%), as the table expert (StructEqTable) is overly specialized and the task gap introduces noise.
Routing labels are assigned at the dataset level rather than the image level; the "general" category may contain images from mixed domains.
Only three domain experts (table, chart, math) are currently integrated; the effect of extending to more domains remains to be validated.
Expert models remain frozen throughout training, limiting further adaptation of domain knowledge.
Inference requires running all encoders (generalist + selected expert), incurring higher computational cost than purely generalist models.

InternVL2: The base model for Chimera; its pretrain–finetune paradigm informs the progressive alignment strategy.
GOT: An expert model for document structure extraction, trained on millions of proprietary data samples.
ChartVLM: Its routing structure for chart QA inspires Chimera's routing design.
MAVIS / Math-LLaVA: Math reasoning expert models that demonstrate the importance of domain specialization but lack cross-domain generalization.
MoE: Chimera's router-plus-expert design shares similarities with Mixture of Experts but is more lightweight and targets the integration of pretrained specialist models.

Rating¶

Novelty: ⭐⭐⭐⭐ — The GSCM mechanism addresses an important yet easily overlooked optimization imbalance problem; the "integrate existing experts" paradigm is practical and scalable.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers both reasoning and extraction scenarios, three model scales (2B/4B/8B), multiple benchmarks (MathVista/MathVerse/ChartQA-SE/Table-SE/Doc-SE), and fine-grained domain-level analysis.
Writing Quality: ⭐⭐⭐⭐ — Architecture diagrams are clear; the progressive method description is well-structured; the training strategy is presented in a logical hierarchy.
Value: ⭐⭐⭐⭐⭐ — Provides a low-cost, highly efficient framework for LMMs to rapidly acquire domain capabilities; reproducible using public data and models, with high practical utility.