Skip to content

EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements

Conference: ICLR 2026 arXiv: 2506.08762 Code: GitHub Area: Time Series Keywords: financial benchmark, LLM evaluation, fraud detection, earnings forecasting, Japanese NLP

TL;DR

This paper constructs EDINET-Bench, a financial benchmark derived from ten years of Japanese EDINET annual reports, comprising three expert-level tasks—accounting fraud detection, earnings forecasting, and industry classification—and finds that even state-of-the-art LLMs only marginally outperform logistic regression.

Background & Motivation

Background: LLMs have surpassed human performance in mathematics and programming, with benchmark datasets serving as a key driver of progress. However, financial benchmarks remain relatively scarce, and existing ones (e.g., FinQA, ConvFinQA) are mostly simple QA or data extraction tasks.

Limitations of Prior Work: Existing financial benchmarks do not involve expert-level reasoning—such as integrating multiple financial statements and textual passages—and thus fail to assess LLM capability on real-world, high-stakes financial tasks.

Key Challenge: Although LLMs excel at general tasks, financial analysis requires simultaneously processing large volumes of tabular data and textual information while performing complex cross-year reasoning.

Goal: Provide the first open-source Japanese financial benchmark requiring expert-level reasoning, and in particular the first publicly available accounting fraud detection dataset.

Key Insight: Leveraging ten years of real annual report data from Japan's EDINET system (analogous to the U.S. EDGAR), three challenging tasks are constructed.

Core Idea: Real annual reports + expert-level financial tasks = revealing the inadequacy of LLMs in financial reasoning.

Method

Overall Architecture

Data pipeline: EDINET API → edinet2dataset tool for parsing → EDINET-Corpus (~40,000 annual reports) → three benchmark tasks.

Key Designs

  1. edinet2dataset Tool: Downloads annual reports via the EDINET API, parses TSV-format files at high speed using Polars, and extracts six categories of information: Meta, Summary, BS, PL, CF, and Text. Covers approximately 41,691 annual reports spanning 2014–2025.
  2. Accounting Fraud Detection: Extracts 6,712 amended reports from revision filings; Claude 3.7 Sonnet is used to determine whether the reason for amendment involves fraud (668 confirmed as fraudulent), with manual review yielding a mislabeling rate below 5%. Non-fraudulent samples are randomly drawn from 700 companies and split by company into a training set (865) and test set (224).
  3. Earnings Forecasting: Randomly selects 1,000 companies and constructs pairs of consecutive annual reports; the direction of change in net income attributable to parent shareholders serves as the label. A temporal split is applied (pre-2020 as training), yielding 549 training and 451 test samples.
  4. Industry Classification: The TOPIX-33 categories based on SICC are merged into 16 broad classes, with approximately 35 companies per class, totaling 496 test samples.

Evaluation Setup

  • Zero-shot prompting: system prompt set to "You are a financial analyst"; inputs vary across combinations of annual report sections (Summary only / +BS+CF+PL / +Text).
  • Models: GPT-4o, o4-mini, GPT-5, Claude 3.5 Haiku/Sonnet, Claude 3.7 Sonnet, Kimi-K2, DeepSeek-V3/R1, Llama 3.3 70B.
  • Classical baselines: Logistic Regression, Random Forest, XGBoost.

Key Experimental Results

Main Results

Fraud Detection ROC-AUC (selected):

Model Summary +BS/CF/PL +Text
Claude 3.5 Sonnet 0.64 0.63 0.73
GPT-5 0.56 0.62 0.67
Logistic Regression† - 0.61 -

Earnings Forecasting ROC-AUC:

Model Summary +BS/CF/PL +Text
GPT-5 0.58 0.62 0.65
Claude 3.7 Sonnet 0.55 0.58 0.61
Logistic Regression† - 0.60 -

Ablation Study

Ablation on input information:

Input Configuration Fraud Detection (avg) Earnings Forecasting (avg)
Summary only ~0.58 ~0.48
+BS/CF/PL ~0.59 ~0.52
+Text ~0.64 ~0.52

Key Findings

  • LLMs only marginally outperform logistic regression: On binary classification tasks, even the strongest LLM achieves MCC values of only 0.1–0.3.
  • Textual information is beneficial: Adding the Text section improves fraud detection ROC-AUC by ~0.06 on average.
  • Open-source models lag behind: DeepSeek-V3/R1 performs notably worse than closed-source models on financial tasks.
  • Industry classification is relatively easier: With complete financial statements, Claude 3.5 Sonnet achieves 41% accuracy (random baseline: 6.25% for 16 classes).
  • Each annual report contains approximately 30K tokens; single-inference cost is approximately $0.1 (Claude 3.7 Sonnet).

Highlights & Insights

  • First open-source accounting fraud detection dataset: No publicly available fraud detection evaluation benchmark previously existed.
  • Open-source edinet2dataset tool: Provides a complete pipeline for constructing financial datasets from EDINET, with high-speed TSV parsing based on Polars.
  • Honest conclusions: The paper explicitly states that providing annual reports for direct LLM inference is insufficient, and that additional scaffolding—such as simulation environments and task-specific reasoning support—is required.
  • Cross-linguistic value: The Japanese financial benchmark fills a gap in non-English financial NLP.
  • Rigorous experimental design: Ablation across multiple input configurations, comparison against classical ML baselines, and transparent cost analysis.
  • Label quality control: Fraud labels are generated by Claude and verified through manual review, with a mislabeling rate below 5%.

Limitations & Future Work

  • Only zero-shot settings are evaluated; few-shot and RAG experiments are absent, and reasoning augmentation strategies such as chain-of-thought are unexplored.
  • Fraud labels are generated by Claude 3.7 Sonnet rather than through full human annotation, potentially introducing systematic bias.
  • Most evaluated LLMs have limited understanding of Japanese financial terminology, particularly open-source models.
  • The data covers only the Japanese market, and cross-national generalizability is not assessed.
  • No in-depth analysis of LLM reasoning processes is provided (e.g., which financial statement items are attended to, visualization of reasoning paths).
  • Fine-tuned Llama-3.2-1B results are incomplete, and exploration of small-model fine-tuning is insufficient.
  • Both fraud detection and earnings forecasting are binary classification tasks; finer-grained regression formulations are not explored.
  • Annual reports of approximately 30K tokens approach the context length limits of some models, which may affect results.
  • Compared to FinQA/ConvFinQA: EDINET-Bench requires processing complete annual reports rather than short passages, more closely approximating real-world financial analysis scenarios.
  • Compared to FinanceBench: FinanceBench is an open-ended QA benchmark, whereas EDINET-Bench requires integrating multiple financial statements and textual content for expert-level reasoning.
  • Compared to FAMMA: FAMMA is based on CFA exams and tutorials, while EDINET-Bench is grounded in real corporate annual reports.
  • Compared to kim2024 (GPT-4 for earnings direction prediction): This paper provides open-source data and evaluation code for reproducibility.
  • Insight: Financial LLMs need to move beyond simple QA toward agent-based approaches that simulate financial analyst workflows.
  • Implications for non-English financial NLP: Countries can adopt similar approaches to construct local financial benchmarks (e.g., China CSRC disclosures, U.S. EDGAR).
  • Future direction: Integrating RAG or multi-agent frameworks may substantially improve LLM financial reasoning performance.

Rating

  • Novelty: ⭐⭐⭐⭐ First open-source fraud detection benchmark, though the task designs themselves are relatively straightforward.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covers 10+ models and 3 input configurations, but lacks advanced experiments such as few-shot evaluation.
  • Writing Quality: ⭐⭐⭐⭐ Well-structured with detailed data construction procedures and rich tables.
  • Value: ⭐⭐⭐⭐ The open-source tools and datasets make practical contributions to the financial NLP community and expose the limitations of LLMs in financial reasoning.