TableDART: Dynamic Adaptive Multi-Modal Routing for Table Understanding¶

Conference: ICLR 2026 arXiv: 2509.14671 Code: GitHub Area: Multimodal VLM Keywords: Table Understanding, Dynamic Routing, Multi-modal Fusion, Gating Network, LLM Agent

TL;DR¶

This paper proposes TableDART, which employs a lightweight MLP gating network with only 2.59M parameters to dynamically select the optimal processing path (Text-only / Image-only / Fusion) for each query-table pair. By reusing frozen unimodal expert models and introducing an LLM Agent for cross-modal fusion, TableDART achieves an average improvement of 4.02% over the strongest MLLM baseline HIPPO across 7 table understanding benchmarks, while reducing inference latency by 24.5%.

Background & Motivation¶

Background: Table understanding is a core task bridging structured data and natural language. Existing approaches fall into three paradigms: (1) Table-as-Text — linearizing tables into text sequences for LLM processing, which is effective but loses spatial structure and is sensitive to serialization format; (2) Table-as-Image — rendering tables as screenshots for VLM processing, which preserves structure but has limited semantic capture capability; (3) Table-as-Multimodality — fusing both textual and visual views, as in HIPPO, which jointly processes both representations within an MLLM.

Limitations of Prior Work: Despite promising results, multimodal methods suffer from two critical limitations: (1) Static fusion introduces redundancy and conflict — forcing dual-modal processing for all query-table pairs is unnecessary, since text linearization introduces row-order sensitivity while image representation maintains permutation invariance, and conflicting signals from both modalities can mislead the model; (2) MLLM fine-tuning is prohibitively expensive — even with parameter-efficient strategies such as LoRA, HIPPO requires 25.87M trainable parameters, approximately 10× that of TableDART.

Key Challenge: The benefit of multimodal fusion stems from information complementarity, but its cost lies in redundancy and potential conflict. In 58.7% of test samples, both unimodal paths yield correct answers (i.e., "easy samples"), making forced fusion not only computationally wasteful but also potentially noise-inducing.

Key Insight: Since the optimal processing strategy varies across query-table pairs, the system should automatically learn "when to use text, when to use image, and when fusion is needed." An extremely lightweight routing network can make instance-level decisions by fully reusing existing unimodal experts.

Core Idea: Replace expensive MLLM fine-tuning with a 2.59M-parameter MLP gating network that dynamically selects the Text-only / Image-only / Fusion path for each query-table pair.

Method¶

Overall Architecture¶

TableDART consists of five collaborative components: (1) a Table-as-Text model \(\mathcal{M}_t\) (TableGPT2-7B, frozen); (2) a Table-as-Image model \(\mathcal{M}_v\) (Ovis2-8B, frozen); (3) a query text embedding model; (4) a lightweight MLP gating network (the only trainable component, 2.59M parameters); and (5) an LLM Agent (Gemini 2.0 Flash, used in the Fusion path, training-free). Given a query and table, three encoders extract in parallel the text representation \(\mathbf{e}_t\), image representation \(\mathbf{e}_v\), and query embedding \(\mathbf{e}_q\), which are concatenated as \(\mathbf{x} = [\mathbf{e}_q, \mathbf{e}_t, \mathbf{e}_v]\) and fed into the gating network. The path with the highest output logit is then selected for inference.

Key Designs¶

Multimodal Encoding and Feature Concatenation
- Function: Unifies multimodal information from the query and table into input representations for the gating network.
- Mechanism: The table is separately serialized into text (encoded by \(\mathcal{M}_t\)'s encoder \(\mathcal{E}_t\)) and rendered as a screenshot (encoded by \(\mathcal{M}_v\)'s encoder \(\mathcal{E}_v\)), while the query is encoded by an independent text embedding model \(\mathcal{E}_q\). The three feature streams are pooled in a modality-specific manner and concatenated as \(\mathbf{x} = [\mathbf{e}_q, \mathbf{e}_t, \mathbf{e}_v]\). Note that \(\mathcal{E}_t\) and \(\mathcal{E}_v\) activate only a small fraction of the expert models' parameters (7.15% and 7.63%, respectively), incurring minimal computational overhead.
- Design Motivation: The gating network requires access to all modalities to make optimal routing decisions, but only needs feature-level representations rather than full inference outputs; thus, only the early encoder layers are used.
Gating Network and Policy Training
- Function: Dynamically selects the optimal inference path for each query-table pair.
- Mechanism: The gating network \(\mathcal{G}\) is a lightweight MLP that outputs three-way logits \(\mathbf{z} = \mathcal{G}(\mathbf{x})\). The training objective \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \lambda \mathcal{L}_{\text{resource}}\) consists of two terms: a task loss that minimizes via KL divergence the predicted distribution against an empirical correctness distribution (a binary vector \(\mathbf{s}\) pre-computed for each path, converted to soft targets via temperature-scaled softmax); and a resource regularization term \(\mathcal{L}_{\text{resource}} = \text{softmax}(\mathbf{z}/\tau_g)^T \mathbf{c}\) that penalizes high-cost paths (where \(\mathbf{c}\) is an empirically measured inference cost vector), preventing over-reliance on the Fusion path.
- Design Motivation: Pure task optimization would route most samples to Fusion (the safest but most expensive option); resource regularization encourages easy samples to be routed to more efficient unimodal paths. Setting \(\lambda = 0.15\) achieves the best balance between performance and efficiency.
LLM Agent Fusion Inference
- Function: Integrates the outputs of two unimodal experts when the gating network selects the Fusion path.
- Mechanism: The two expert models \(\mathcal{M}_t\) and \(\mathcal{M}_v\) are executed in parallel to obtain their respective results \(r_t, r_v\) and auxiliary outputs \(a_t, a_v\), which are then submitted to the Fusion Agent (Gemini 2.0 Flash) together with the original table. The Agent operates in two roles: (a) Arbitrator — when the two experts' results conflict, it selects the more reliable answer based on confidence; (b) Rescuer — when both experts are uncertain, it synthesizes partial evidence from both to reason toward a new answer.
- Design Motivation: Directly training an MLLM for fusion is expensive; using a training-free LLM Agent for post-hoc reasoning leverages strong reasoning capabilities without training costs. Experiments show that the Fusion path successfully rescues 14% of "hard samples" where both unimodal paths fail.

Loss & Training¶

The training set comprises 10K mixed samples drawn from 5 table understanding benchmarks. Only the gating network is trained; all large models are frozen. For each training sample, the correctness of all three paths \(\mathbf{s} \in \{0,1\}^3\) is pre-computed, and temperature \(\tau\) controls the smoothness of the soft label distribution. During inference, the path with the highest logit is selected deterministically.

Key Experimental Results¶

Main Results¶

Method	WTQ	TABMWP	TAT-QA	HiTab	FeTaQA	TabFact	InfoTabs	Avg. Acc
TableGPT2-7B (Text)	61.42	83.87	50.39	70.27	28.97	77.80	71.07	69.14
Ovis2-8B (Image)	58.76	87.00	47.67	68.59	34.70	80.80	74.11	69.49
HIPPO-8B (Multimodal)	55.77	87.50	60.75	63.00	33.18	82.27	75.74	70.84
Gemini 2.0 Flash	63.56	46.29	35.62	60.41	10.57	81.33	54.31	56.92
TableDART	70.58	84.54	62.05	74.37	36.11	81.37	76.22	74.86

TableDART achieves an average accuracy of 74.86%, surpassing the strongest multimodal baseline HIPPO-8B by +4.02%. Generalization on unseen datasets is particularly notable: TableDART 74.37% vs. HIPPO 63.00% (+18.05%).

Ablation Study¶

Routing Strategy	WTQ	TABMWP	TAT-QA	HiTab	TabFact	InfoTabs	Note
Random Routing	65.40	75.50	58.94	70.49	79.50	69.57	No effective routing
Non-adaptive Fusion	70.97	81.47	63.34	73.35	81.56	76.83	All samples routed to Fusion
Dynamic Routing	70.58	84.54	62.05	74.37	81.37	76.22	Ours

Dynamic routing outperforms non-adaptive fusion on TABMWP (+3.07) and HiTab (+1.02), demonstrating that forced fusion introduces noise on simpler datasets. In terms of inference efficiency, dynamic routing achieves an average latency of 2.20s vs. 2.92s for non-adaptive fusion, a 24.5% reduction.

Key Findings¶

58.7% of samples are "easy samples": Both unimodal paths answer correctly, making forced fusion entirely unnecessary.
24.0% of samples exhibit cross-modal complementarity: 17.2% are answered correctly only by the image path and 6.8% only by the text path, validating the necessity of maintaining independent unimodal paths.
The Fusion path achieves a 14% "rescue" rate: Among the 17.3% of hard samples where both unimodal paths fail, the Fusion Agent successfully resolves an additional 2.4%.
Routing strategies are interpretable: 97.2% of samples in simple datasets like TABMWP are routed to Image-only, while 88.7% of hard samples in TAT-QA are routed to Fusion.

Highlights & Insights¶

Extreme training efficiency: Training only 2.59M parameters surpasses HIPPO trained with 25.87M parameters. The core insight is that "routing decisions matter more than modal fusion." This "meta-decision + frozen experts" paradigm is transferable to any multi-expert system.
Generalizability of the routing strategy: Performance is nearly consistent across seen and unseen datasets (74.95% vs. 74.37%), whereas HIPPO drops from 72.41% to 63.00%, indicating that the gating network learns a generalizable routing policy rather than overfitting.
Elegant design of training signals: Using "three independently pre-computed correctness labels" as supervision allows multiple paths to be simultaneously correct, and combined with KL divergence soft-label training, this is more principled than hard-label classification.

Limitations & Future Work¶

Dependency on external Gemini as the Fusion Agent: The Fusion path requires calls to a closed-source API, increasing cost and privacy concerns; open-source LLMs could be explored as alternatives.
Pre-computation of three-path results for training data: Running inference three times per training sample is non-trivial, limiting scalability of the training set.
Gating network relies solely on feature-level information: The current routing decision is based on shallow encoder features and does not exploit higher-level information such as the semantic complexity of the query.
Only three fixed paths are supported: More flexible routing strategies, such as partial fusion or cascaded inference, have not been explored.

vs. HIPPO: HIPPO statically fuses text and image representations inside an MLLM, whereas TableDART dynamically routes externally, achieving both better performance and greater training efficiency (2.59M vs. 25.87M parameters).
vs. Mixture-of-Experts: TableDART's design resembles MoE but routes at the model level rather than the layer level; the experts are complete frozen models rather than trainable sub-networks.
vs. Table-LLaVA/TabPedia: These methods train on a single visual modality and cannot leverage the complementary advantages of the text modality.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of instance-level dynamic routing and training-free LLM Agent fusion is novel, though the basic idea of dynamic routing is not new.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across 7 benchmarks with comprehensive ablations, routing strategy analysis, efficiency analysis, and generalization verification.
Writing Quality: ⭐⭐⭐⭐ Clear structure, well-motivated arguments, and rich figures and tables.
Value: ⭐⭐⭐⭐ Provides a training-efficient paradigm for multimodal fusion with reference value for both table understanding and broader multi-expert systems.