EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements¶

Conference: ICLR 2026 arXiv: 2506.08762 Code: GitHub Area: Time Series Keywords: financial benchmark, LLM evaluation, fraud detection, earnings forecasting, Japanese NLP

TL;DR¶

This paper constructs EDINET-Bench, a financial benchmark derived from ten years of Japanese EDINET annual reports, comprising three expert-level tasks—accounting fraud detection, earnings forecasting, and industry classification—and finds that even state-of-the-art LLMs only marginally outperform logistic regression.

Background & Motivation¶

Background: LLMs have surpassed human performance in mathematics and programming, with benchmark datasets serving as a key driver of progress. However, financial benchmarks remain relatively scarce, and existing ones (e.g., FinQA, ConvFinQA) are mostly simple QA or data extraction tasks.

Limitations of Prior Work: Existing financial benchmarks do not involve expert-level reasoning—such as integrating multiple financial statements and textual passages—and thus fail to assess LLM capability on real-world, high-stakes financial tasks.

Key Challenge: Although LLMs excel at general tasks, financial analysis requires simultaneously processing large volumes of tabular data and textual information while performing complex cross-year reasoning.

Goal: Provide the first open-source Japanese financial benchmark requiring expert-level reasoning, and in particular the first publicly available accounting fraud detection dataset.

Key Insight: Leveraging ten years of real annual report data from Japan's EDINET system (analogous to the U.S. EDGAR), three challenging tasks are constructed.

Core Idea: Real annual reports + expert-level financial tasks = revealing the inadequacy of LLMs in financial reasoning.

Method¶

Overall Architecture¶

Data pipeline: EDINET API → edinet2dataset tool for parsing → EDINET-Corpus (~40,000 annual reports) → three benchmark tasks.

Key Designs¶

edinet2dataset Tool: Downloads annual reports via the EDINET API, parses TSV-format files at high speed using Polars, and extracts six categories of information: Meta, Summary, BS, PL, CF, and Text. Covers approximately 41,691 annual reports spanning 2014–2025.
Accounting Fraud Detection: Extracts 6,712 amended reports from revision filings; Claude 3.7 Sonnet is used to determine whether the reason for amendment involves fraud (668 confirmed as fraudulent), with manual review yielding a mislabeling rate below 5%. Non-fraudulent samples are randomly drawn from 700 companies and split by company into a training set (865) and test set (224).
Earnings Forecasting: Randomly selects 1,000 companies and constructs pairs of consecutive annual reports; the direction of change in net income attributable to parent shareholders serves as the label. A temporal split is applied (pre-2020 as training), yielding 549 training and 451 test samples.
Industry Classification: The TOPIX-33 categories based on SICC are merged into 16 broad classes, with approximately 35 companies per class, totaling 496 test samples.

Evaluation Setup¶

Zero-shot prompting: system prompt set to "You are a financial analyst"; inputs vary across combinations of annual report sections (Summary only / +BS+CF+PL / +Text).
Models: GPT-4o, o4-mini, GPT-5, Claude 3.5 Haiku/Sonnet, Claude 3.7 Sonnet, Kimi-K2, DeepSeek-V3/R1, Llama 3.3 70B.
Classical baselines: Logistic Regression, Random Forest, XGBoost.

Key Experimental Results¶

Main Results¶

Fraud Detection ROC-AUC (selected):

Model	Summary	+BS/CF/PL	+Text
Claude 3.5 Sonnet	0.64	0.63	0.73
GPT-5	0.56	0.62	0.67
Logistic Regression†	-	0.61	-

Earnings Forecasting ROC-AUC:

Model	Summary	+BS/CF/PL	+Text
GPT-5	0.58	0.62	0.65
Claude 3.7 Sonnet	0.55	0.58	0.61
Logistic Regression†	-	0.60	-

Ablation Study¶

Ablation on input information:

Input Configuration	Fraud Detection (avg)	Earnings Forecasting (avg)
Summary only	~0.58	~0.48
+BS/CF/PL	~0.59	~0.52
+Text	~0.64	~0.52

Key Findings¶

LLMs only marginally outperform logistic regression: On binary classification tasks, even the strongest LLM achieves MCC values of only 0.1–0.3.
Textual information is beneficial: Adding the Text section improves fraud detection ROC-AUC by ~0.06 on average.
Open-source models lag behind: DeepSeek-V3/R1 performs notably worse than closed-source models on financial tasks.
Industry classification is relatively easier: With complete financial statements, Claude 3.5 Sonnet achieves 41% accuracy (random baseline: 6.25% for 16 classes).
Each annual report contains approximately 30K tokens; single-inference cost is approximately $0.1 (Claude 3.7 Sonnet).

Highlights & Insights¶

First open-source accounting fraud detection dataset: No publicly available fraud detection evaluation benchmark previously existed.
Open-source edinet2dataset tool: Provides a complete pipeline for constructing financial datasets from EDINET, with high-speed TSV parsing based on Polars.
Honest conclusions: The paper explicitly states that providing annual reports for direct LLM inference is insufficient, and that additional scaffolding—such as simulation environments and task-specific reasoning support—is required.
Cross-linguistic value: The Japanese financial benchmark fills a gap in non-English financial NLP.
Rigorous experimental design: Ablation across multiple input configurations, comparison against classical ML baselines, and transparent cost analysis.
Label quality control: Fraud labels are generated by Claude and verified through manual review, with a mislabeling rate below 5%.

Limitations & Future Work¶

Only zero-shot settings are evaluated; few-shot and RAG experiments are absent, and reasoning augmentation strategies such as chain-of-thought are unexplored.
Fraud labels are generated by Claude 3.7 Sonnet rather than through full human annotation, potentially introducing systematic bias.
Most evaluated LLMs have limited understanding of Japanese financial terminology, particularly open-source models.
The data covers only the Japanese market, and cross-national generalizability is not assessed.
No in-depth analysis of LLM reasoning processes is provided (e.g., which financial statement items are attended to, visualization of reasoning paths).
Fine-tuned Llama-3.2-1B results are incomplete, and exploration of small-model fine-tuning is insufficient.
Both fraud detection and earnings forecasting are binary classification tasks; finer-grained regression formulations are not explored.
Annual reports of approximately 30K tokens approach the context length limits of some models, which may affect results.

Compared to FinQA/ConvFinQA: EDINET-Bench requires processing complete annual reports rather than short passages, more closely approximating real-world financial analysis scenarios.
Compared to FinanceBench: FinanceBench is an open-ended QA benchmark, whereas EDINET-Bench requires integrating multiple financial statements and textual content for expert-level reasoning.
Compared to FAMMA: FAMMA is based on CFA exams and tutorials, while EDINET-Bench is grounded in real corporate annual reports.
Compared to kim2024 (GPT-4 for earnings direction prediction): This paper provides open-source data and evaluation code for reproducibility.
Insight: Financial LLMs need to move beyond simple QA toward agent-based approaches that simulate financial analyst workflows.
Implications for non-English financial NLP: Countries can adopt similar approaches to construct local financial benchmarks (e.g., China CSRC disclosures, U.S. EDGAR).
Future direction: Integrating RAG or multi-agent frameworks may substantially improve LLM financial reasoning performance.

Rating¶

Novelty: ⭐⭐⭐⭐ First open-source fraud detection benchmark, though the task designs themselves are relatively straightforward.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers 10+ models and 3 input configurations, but lacks advanced experiments such as few-shot evaluation.
Writing Quality: ⭐⭐⭐⭐ Well-structured with detailed data construction procedures and rich tables.
Value: ⭐⭐⭐⭐ The open-source tools and datasets make practical contributions to the financial NLP community and expose the limitations of LLMs in financial reasoning.