Skip to content

FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning

Conference: ACL 2025
arXiv: 2412.12567
Code: HYU-NLP/FCMR
Area: Other
Keywords: Cross-Modal Reasoning, Multi-Hop QA, Financial NLP, benchmark, MLLM Evaluation

TL;DR

This work constructs FCMR, a cross-modal multi-hop reasoning benchmark in the financial domain, comprising three modalities: text, tables, and charts. It is categorized into three difficulty levels: Easy, Medium, and Hard. The strongest model, Claude 3.5 Sonnet, achieves only 30.4% accuracy on the Hard level, revealing critical bottlenecks of MLLMs in the information retrieval phase.

Background & Motivation

Real-world decision-making often requires integrating information from multiple modalities to perform reasoning. For instance, financial analysts need to examine textual reports, tabular data (balance sheets), and charts (trend graphs) simultaneously to make judgments. This capability is referred to as Cross-Modal Multi-Hop Reasoning.

Existing evaluation benchmarks suffer from two key limitations:

Data Leakage: Mainstream benchmarks like MMQA are constructed based on Wikipedia, which is a core data source for LLM pre-training. Experiments show that GPT-4o achieves 43.4% Exact Match on the hardest subset of MMQA even without looking at images, indicating that the models are "recalling" rather than "reasoning".

Lack of genuinely complex cross-modal multi-hop questions: Genuinely three-modality three-hop reasoning samples in MMQA account for only 0.8% (205 instances), while the vast majority are single-hop or two-hop questions.

The motivation of FCMR is to address these two pain points: using financial domain data to avoid leakage, and designing complex reasoning tasks that necessitate crossing all three modalities.

Method

Overall Architecture

The authors propose the CMRGen (Cross-Modal Multi-Hop Reasoning Generator) framework to automatically construct cross-modal multi-hop reasoning datasets. CMRGen consists of three stages: input data construction, statement generation, and rewriting and filtering. This framework is highly automated and extremely cost-effective—generating a single question costs only $0.004, compared to $0.33 for MMQA.

Key Designs

  1. Input Data Construction: Two types of financial data sources are used—SEC EDGAR's 10-K annual reports (text source) and WRDS Compustat's simplified financial statements (table source). These are aligned via shared corporate entities. Each FCMR instance contains one document, one table, and one chart, involving three companies. The chart is plotted using tabular data, and once plotted, the corresponding columns are removed from the table to ensure no information overlap between the chart and the table.

  2. Hierarchical Design of Statement Generation:

    • Easy: Single-modality single-hop statements (though all three modalities are still required to verify the correctness of all statements)
    • Medium: Cross-modal two-hop statements
    • Hard: Cross-modal three-hop statements—such as "In the year when fopo value of ABBOTT LABORATORIES was below 730.5, the company with the minimum act value was entitled to receive $43 million of sublease income"—requiring sequential lookup of Chart \(\rightarrow\) Table \(\rightarrow\) Text.
  3. Distractor Generation Strategy: Instead of simply modifying numerical values, incorrect statements are generated by substituting corporate entities. This aligns with real-world scenarios of multi-company analysis in the financial domain.

  4. Multiple-Choice Question Design: Each question contains three statements, where 0 to 3 statements can be true. The model must judge the veracity of all statements, with only completely correct classifications considered correct. This design is significantly more complex than traditional multiple-choice questions.

  5. Quality Control: WPD (Word Position Distance) and LD (Lexical Distance) are used to evaluate the rewriting quality, which outperforms that of MRPC and PAWS datasets. Chart types include line charts, bar charts, scatter plots, and pie charts, covering approximately 98% of common 10-K chart types.

Loss & Training

This study introduces an evaluation benchmark and does not involve model training. However, three enhancement strategies are explored in preliminary optimization experiments: - Modality Integration: Converting all modalities into textual representations. - 4-Stage Reasoning: Explicitly guiding four-step reasoning within the prompt. - Self-Refine: Enabling the model to iteratively refine its own answers.

Combining these three strategies improves the performance of Claude 3.5 Sonnet on the Hard level from 32% to 46%.

Key Experimental Results

Main Results

Model Easy Medium Hard Average
Random 12.2 12.9 12.3 12.5
ChartInstruct-Llama2 11.5 12.6 10.8 11.6
MiniCPM-V-2_6 16.4 11.7 13.2 13.7
Qwen2-VL-7B 17.6 13.3 12.0 12.3
Llama 3.2 90B-Vision 42.5 21.6 13.7 25.9
GPT-4o mini 49.1 22.0 13.0 28.1
Gemini 1.5 Pro 63.0 31.2 22.3 38.8
GPT-4o 64.2 43.7 24.4 44.1
Claude 3.5 Sonnet 75.4 50.8 30.4 52.2

Ablation Study

Dataset With Image Accuracy
MMQA Hard 43.4%
MMQA Hard 63.4%
FCMR Hard 14.7%
FCMR Hard 24.4%

FCMR performance drops close to random (12.3%) after removing charts, verifying the data is leakage-free.

Key Findings

  1. Information Retrieval is the Primary Bottleneck: Through a four-stage fine-grained analysis (Planning \(\rightarrow\) Modality Identification \(\rightarrow\) Information Retrieval \(\rightarrow\) Information Reasoning), it is discovered that MLLMs are most prone to failure in the "Information Retrieval" stage—often failing to accurately extract information even after correctly identifying which modality it resides in.

  2. Severe Degradation when Processing the Second Modality: The models perform acceptably when handling the first modality of the first statement, but their success rate collapses sharply when moving to the second modality.

  3. Chart Understanding Remains a Quality Weakness: 75% of Claude's errors in the Easy level are chart-related. Scatter plots are the most challenging (recording 23.4% accuracy on Hard), whereas line and bar charts perform slightly better.

  4. Models Exhibit Conservative Strategies: When uncertain, models tend to predict "False", sacrificing recall to minimize false positives.

  5. Trend Misinterpretation is the Most Frequent Error: Among 100 analyzed error samples from Claude, 35 cases involve trend misreading in charts, and 16 cases are ranking errors.

Highlights & Insights

  • Extremely low data construction cost ($0.004/question), with a framework that is highly generalizable to other fields (the appendix demonstrates an application in materials science).
  • The design of multiple-choice questions allowing 0-3 correct answers evaluates genuine reasoning abilities more robustly than traditional single-choice formats.
  • The four-stage analysis methodology provides a valuable framework for comprehending reasoning failures in MLLMs.
  • Revealing a counter-intuitive phenomenon: converting charts to tables using Deplot for GPT-4o yields higher performance on the Hard level than direct visual reading (32.9% vs 24.4%), indicating that visual understanding in MLLMs still lags behind structured text processing.

Limitations & Future Work

  1. Limited coverage to the financial domain: Although the framework is extensible, it has not been validated on a large scale yet.
  2. Part of the analysis relies on manual inspection: Future work could explore automated analysis tools.
  3. Charts are synthetically generated rather than extracted from real 10-K reports, which may introduce discrepancies with the complexity of actual document charts.
  4. The optimal combination of strategies achieves only 46% accuracy on the Hard level, suggesting a need for more fundamental methodological innovations.
  • MMQA (Talmor et al., 2021): The de facto standard for cross-modal multi-hop reasoning, yet plagued by data leakage and a scarcity of three-hop samples.
  • HybridQA, FinQA, and TAT-QA, etc., cover only two modalities.
  • ManyModalQA and CT2C-QA cover three modalities but lack cross-modal multi-hop reasoning.
  • WebQA and MuMuQA focus only on two-hop reasoning.
  • The most prominent distinction from these works is that FCMR strictly mandates three-modality three-hop reasoning for all Hard-level questions.

Rating

  • Novelty: ⭐⭐⭐⭐ The design of cross-modal three-hop reasoning across three modalities in the financial sector is highly unique, and the multiple-choice format is creative.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive coverage of models, with diversified analytical dimensions (modality-level, stage-level, error classification, chart types) and deep qualitative analysis.
  • 写作质量: ⭐⭐⭐⭐ The problem definition is clear, chart designs are elegant, and the analysis is progressively structured.
  • Value: ⭐⭐⭐⭐ It establishes a high-quality testbed for evaluating the multimodal reasoning capabilities of MLLMs, exposing pivotal capability deficiencies.