M3FinMeeting: A Multilingual, Multi-Sector, and Multi-Task Financial Meeting Understanding Evaluation Dataset¶
Conference: ACL 2025
arXiv: 2506.02510
Code: Available
Area: Multilingual Translation
Keywords: Financial Meetings, Multilingual Benchmark, Long Context Understanding, Summarization, Question Answering
TL;DR¶
This work introduces M3FinMeeting, the first multilingual (Chinese, English, Japanese), multi-sector, and multi-task evaluation benchmark for financial meetings. Containing 600 real-world financial meetings with three tasks—summarization, Q&A pair extraction, and question answering—it reveals that state-of-the-art LLMs still have significant room for improvement in understanding financial meetings.
Background & Motivation¶
Existing financial NLP benchmarks (such as FinQA, ConvFinQA, CFLUE, etc.) suffer from three major limitations:
Single Data Source: They heavily rely on news articles, financial reports, and announcements, lacking real financial meeting content. Financial meetings possess unique characteristics such as conversational nature, real-time dynamics, and strategic discussions, which existing datasets fail to cover.
Monolingual Nature: They are almost exclusively limited to either English or Chinese.
Lack of Long-Context Challenges: Financial meetings typically last 1-2 hours, with transcripts often exceeding 10K tokens, presenting a real test for the long-context capabilities of LLMs.
M3FinMeeting aims to fill these gaps and evaluate the comprehensive capabilities of LLMs in understanding real-world financial meetings.
Method¶
Overall Architecture¶
M3FinMeeting is an evaluation benchmark dataset whose core design revolves around "three multis":
- Multilingual: English (100 meetings), Chinese (400 meetings), and Japanese (100 meetings), totaling 600 meetings.
- Multi-sector: Covering all 11 sectors under the GICS standard (Communication, IT, Financials, Energy, etc.).
- Multi-task: Three tasks consisting of summarization generation, Q&A pair extraction, and question answering.
Key Designs¶
1. Data Collection and Annotation¶
Function: Real meeting audio was obtained from partner financial institutions, transcribed via ASR, and then manually corrected.
Core Pipeline: - Collection Criteria: Timeliness (recent meetings), length (prioritizing long audio), sector coverage, and authority. - Whisper was used for speech-to-text transcription, followed by paragraph-by-paragraph manual correction by annotators. - Each meeting lasts approximately 1 hour on average, with English averaging 10,086 tokens, Chinese about 11,740 tokens, and Japanese about 13,284 tokens. - Sensitive and personally identifiable information (PII) was strictly removed.
Design Motivation: Direct utilization of real-world financial meetings instead of synthetic data to ensure that the benchmark reflects real-world challenges.
2. Three Evaluation Tasks¶
Summarization Generation: LLMs are required to implicitly identify different topical segments in the document, generate a summary for each segment, and then concatenate them into a complete document summary. Evaluation employs segment-level P/R/F1 (aligned based on cosine similarity \(\ge 0.75\)) + GPT-4-Judge scoring (covering five dimensions: coverage, redundancy, readability, accuracy, and consistency, from 0 to 100 points).
Q&A Pair Extraction: Identifying meaningful questions and their corresponding answers from the full meeting transcript. This requires the LLM to understand the conversational structure, distinguish between meaningful questions and meaningless interruptions, and correctly pair multi-turn Q&A.
Question Answering: Given the full meeting transcript and a set of preset questions, LLMs must locate evidence within the long context and generate answers. Merging related questions into a single prompt simulates practical scenarios like writing reports or summaries.
3. Evaluation System¶
Function: Multi-level evaluation that balances both automated metrics and human judgment.
- Segment-level Precision/Recall/F1: Aligning generated and reference summaries based on embedding similarity.
- GPT-4-Judge: 0-100 scoring across five dimensions, cross-validated with Qwen-plus-Judge.
- Human Evaluation + Fleiss' Kappa: Validating the alignment between LLM evaluation and human judgment.
Loss & Training¶
This work introduces an evaluation benchmark and does not involve model training. Evaluations are conducted in a zero-shot setting.
Key Experimental Results¶
Main Results¶
Comprehensive Evaluation on Three Tasks (GPT-4-Judge Scores):
| Model | Summarization | Q&A Pair Extraction | Question Answering | Overall |
|---|---|---|---|---|
| GPT-3.5-turbo | 44.56 | 31.13 | 42.78 | 39.55 |
| LLaMA3.1-8B | 52.01 | 44.64 | 40.01 | 45.76 |
| GLM4-9B-Chat | 67.71 | 46.06 | 67.72 | 60.76 |
| Qwen2-7B | 73.59 | 37.33 | 69.99 | 60.71 |
| GPT-4o | 73.61 | 66.85 | 71.79 | 70.66 |
| Qwen2-72B | 74.17 | 60.85 | 73.50 | 69.66 |
| Qwen2.5-72B | 74.51 | 68.03 | 74.81 | 72.54 |
Extremely low F1 scores for Q&A pair extraction: Even the best model, Qwen2.5-72B, only achieves 38.41% on F1 metric, indicating that automated extraction of high-quality Q&A pairs from long conversations remains extremely challenging.
Ablation Study¶
Impact of RAG on Q&A Performance (Qwen2.5-72B, GPT-4-Judge):
| Method | <5K | 5-10K | 10-15K | 15-20K | >20K |
|---|---|---|---|---|---|
| Baseline 1 (All-in-one Answer) | Medium | High | High | Best | Best |
| Baseline 2 (One-by-one Answer) | Medium | Medium | High | Second Best | Second Best |
| RAG (top 5) | Good | Good | Medium | Poor | Poor |
| RAG (top 1) | Poor | Poor | Poor | Worst | Worst |
Key Finding: On long documents (\(>10\text{K}\) tokens), utilizing the full context outperforms RAG, which is counter-intuitive.
Key Findings¶
- Qwen2.5-72B achieves the best overall performance, but still scores only 72.54 points (out of 100), indicating significant room for improvement.
- Summarization Task: Segment-level F1 is below 30%, showing that LLMs perform poorly in implicit document segmentation.
- Q&A Extraction is the Most Difficult: Even the best model has a recall rate of less than 50%, missing more than half of the key questions.
- Language Performance: Most models perform best in Japanese, with no significant difference between Chinese and English; this might be attributed to better instruction following in Japanese.
- Sector Variation: Performance is better in the Communication, Consumer Discretionary, and IT sectors, with sector-wise performance disparities being more pronounced in weaker models.
- Impact of Length: GPT-3.5 degrades sharply beyond 15K tokens (due to the 16K window limit), while Qwen2.5-72B and GPT-4o remain stable on long documents.
- Reliability of LLM Evaluation: GPT-4-Judge exhibits consistent trends with Qwen-plus-Judge, showing a Fleiss' Kappa \(= 0.701\) with 5 human annotators.
Highlights & Insights¶
- The First Benchmark Dedicated to Financial Meetings: Fills the gap in meeting scenarios, showing fundamental HTML/structural differences from news or report data.
- RAG Underperforms Full-Context Input in Long Contexts: This holds important reference value for the practical implementation of RAG applications.
- Comprehensive Multi-Dimensional Evaluation System: Combines automatic metrics, LLM-Judge, and human evaluation, utilizing cross-validation to eliminate bias.
- Revealing Q&A Pair Extraction as the Most Challenging Task: Provides guiding significance for future research directions in financial NLP.
Limitations & Future Work¶
- Extremely high annotation costs (requiring professional analysts to annotate 1-2 hours of audio and more than 10K tokens of text).
- The Q&A task only uses extracted Q&A pairs and does not evaluate open-ended questions requiring deep reasoning.
- The number of Chinese meetings (400) is far greater than English and Japanese ones (100 each), presenting data imbalance.
- Only 7 LLMs were evaluated, without covering broader open-source models (e.g., Mistral, DeepSeek, etc.).
- ASR errors may still persist despite manual correction, potentially impacting downstream tasks.
Related Work & Insights¶
- Complementary to ECTSum (Earnings Call Transfer Summarization), but M3FinMeeting offers more languages, wider sector coverage, and a more comprehensive set of tasks.
- Draws inspiration from long-context evaluation designs like LongBench and RULER, but focuses specifically on the financial vertical domain.
- While GPT-4-Judge evaluation has become mainstream, this work adds Qwen-plus cross-validation and human Kappa computation, enhancing overall credibility.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The first multilingual and multi-sector financial meeting understanding benchmark; valuable scenario definition.
- Experimental Thoroughness: ⭐⭐⭐⭐ — 7 models, 3 tasks, multilingual/multi-sector/multi-length analysis, RAG comparison, and human evaluation.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure, detailed statistical tables, and sound evaluation methodology.
- Value: ⭐⭐⭐⭐ — Clear contribution to the financial NLP community; the dataset and findings (e.g., RAG's disadvantage in long documents) hold practical reference value.