ACL 2026 Findings LLM Evaluation Business reasoning benchmark knowledge and application evaluation LLM capability diagnosis finance and economics dual-axis design

BizCompass: Benchmarking the Reasoning Capabilities of LLMs in Business Knowledge and Applications¶

Conference: ACL 2026 Findings arXiv: 2604.17305 Code: https://bizcompass.dev.ypemc.com/ Area: LLM Evaluation Keywords: Business reasoning benchmark, knowledge and application evaluation, LLM capability diagnosis, finance and economics, dual-axis design

TL;DR¶

This paper introduces BizCompass, a business reasoning benchmark bridging theoretical foundations and practical applications. It covers four knowledge domains (finance, economics, statistics, and operations management) and three application roles (analyst, trader, and consultant), systematically evaluating the business reasoning capabilities of both open-source and closed-source LLMs, and revealing how theoretical knowledge transfers to real-world performance.

Background & Motivation¶

Background: LLMs hold significant promise for business applications, yet business analysis is inherently complex, requiring rigorous reasoning and the integration of diverse knowledge. Existing benchmarks (e.g., FinBen, CFLUE) typically target narrow, single-task settings such as sentiment analysis or entity extraction, and thus cannot answer a fundamental question: how can LLMs be reliably deployed in business contexts, and what theoretical foundations underlie such capabilities?

Limitations of Prior Work: (1) Existing benchmarks focus predominantly on finance, lacking coverage of other core business disciplines such as economics, statistics, and operations management; (2) No diagnostic framework exists to link theoretical knowledge capabilities with practical application performance — it is known whether an LLM performs well or poorly on a specific task, but the underlying foundational capabilities driving that performance remain unclear.

Key Challenge: Increasing model scale and chain-of-thought (CoT) techniques do not guarantee improvements in business reasoning. DeepSeek-R1 (671B) underperforms considerably smaller closed-source models on certain tasks, indicating that naive scaling is insufficient and that a deeper understanding of the mapping between knowledge and application is necessary.

Goal: (1) Construct an evaluation benchmark with comprehensive coverage of the business domain; (2) Employ a dual-axis design to diagnose how theoretical knowledge drives or constrains real-world application performance; (3) Provide actionable recommendations for model selection and training optimization.

Key Insight: A dual-axis design consisting of a knowledge layer and an application layer is adopted — the knowledge layer addresses "what the model knows," the application layer addresses "what the model can do," and cross-axis analysis answers "why it can or cannot."

Core Idea: A dual-axis benchmark is used to elevate business LLM evaluation from task performance measurement to capability diagnosis, not only assessing how well a model performs but also diagnosing the root causes of its strengths and weaknesses.

Method¶

Overall Architecture¶

BizCompass is structured into two layers. The knowledge layer spans four core domains: Finance (FIN), Economics (ECON), Statistics (STAT), and Operations Management (OM), each containing multiple-choice questions and open-ended questions. The application layer is organized around three representative business roles: Analyst (data analysis, risk assessment), Trader (market forecasting, investment decision-making), and Consultant (strategic advice, solution evaluation). Evaluation metrics include accuracy, F1, ROUGE, and GPT-Eval (a multi-dimensional scoring scheme using GPT-4o as the judge).

Key Designs¶

Four-Domain Coverage in the Knowledge Layer:
- Function: Comprehensively evaluates theoretical business knowledge
- Mechanism: Finance draws from professional examination items such as FRM and CFA; Economics covers micro- and macroeconomic theory; Statistics covers probability theory, hypothesis testing, regression analysis, and related topics; Operations Management covers supply chain, project management, quality control, and more. Each domain includes questions of varying difficulty levels.
- Design Motivation: Business decision-making is inherently cross-disciplinary. The four domains collectively cover the core theoretical foundations of business analysis.
Three-Role Design in the Application Layer:
- Function: Evaluates the transfer of theoretical knowledge into practical business skills
- Mechanism: The Analyst role requires analytical capabilities such as data interpretation, trend analysis, and risk quantification; the Trader role requires decision-making capabilities such as market judgment, portfolio construction, and risk management; the Consultant role requires integrated capabilities such as strategic thinking, solution evaluation, and client communication. Each role corresponds to specific task formats, including multiple-choice, open-ended questions, and case analysis.
- Design Motivation: Different business roles demand and apply knowledge in distinct ways. The three roles span the full spectrum from quantitative analysis to qualitative reasoning.
Cross-Domain Correlation Analysis:
- Function: Diagnoses how knowledge capabilities drive application performance
- Mechanism: A correlation matrix is computed between the four knowledge-layer domains and each application-layer task. Analytical and quantitative tasks show stronger correlations with OM and STAT, while text-based and advisory tasks exhibit weaker correlations with the knowledge domains. Correlations with code reasoning ability (SWE-bench) are also analyzed, revealing a positive relationship.
- Design Motivation: Rather than merely reporting scores, the framework aims to explain "why" — identifying which foundational capabilities constitute bottlenecks, thereby informing targeted training.

Key Experimental Results¶

Main Results¶

Model	Finance Acc	Economics Acc	Statistics Acc	OM Acc	Application Avg Acc
GPT (closed-source)	80.4%	83.0%	83.8%	79.3%	79.9%
Gemini (closed-source)	82.1%	87.8%	85.7%	82.7%	77.4%
Claude (closed-source)	81.8%	85.8%	84.6%	80.2%	75.5%
DeepSeek-R1 (671B)	73.8%	81.7%	70.9%	71.1%	71.3%
Qwen (235B)	78.6%	81.7%	82.1%	80.0%	64.8%
Llama (70B)	52.6%	62.8%	57.8%	60.5%	60.2%

Ablation Study¶

Analysis Dimension	Finding	Explanation
Scale vs. Performance	Non-linear	DeepSeek-R1 (671B) underperforms smaller closed-source models on multiple metrics
CoT vs. No CoT	Unstable	Adding CoT does not guarantee improvement; gains depend on data quality and alignment
Knowledge-to-Application Correlation	Uneven	OM/STAT exert greater influence on analytical tasks; FIN/ECON influence is weaker
Code Reasoning to Business Performance	Positive correlation	SWE-bench scores correlate positively with knowledge-layer performance

Key Findings¶

Closed-source models consistently outperform open-source counterparts on both the knowledge and application layers, with the gap being more pronounced on the application layer, suggesting that application capabilities are harder to acquire through open-source training.
Model scale is not a determining factor: DeepSeek-R1 (671B) scores lower than Qwen (235B) on statistics and operations management, and distilled models perform even worse.
Cross-domain correlation analysis reveals that statistical and operations management knowledge is more critical for analytical application tasks.
Code reasoning ability correlates positively with business knowledge, indicating that decomposed reasoning and structured thinking constitute shared underlying capabilities.

Highlights & Insights¶

Diagnostic Power of the Dual-Axis Design: Unlike conventional benchmarks that merely report scores, BizCompass diagnoses "why a model performs well or poorly" — cross-axis analysis between the knowledge and application layers pinpoints specific capability bottlenecks.
Empirical Evidence That Scale ≠ Capability: DeepSeek-R1 with 671B parameters underperforms smaller closed-source models on multiple business reasoning metrics, providing strong empirical challenge to the applicability of scaling laws in vertical domains.
Diversified Evaluation Metrics: The combined use of accuracy, F1, ROUGE, and GPT-Eval across different task types reflects a well-considered evaluation design.

Limitations & Future Work¶

The knowledge layer is primarily based on English-language examination items, leaving business environments in non-English languages unaddressed.
Although the three application-layer roles are representative, they do not cover all business scenarios (e.g., human resources, marketing).
GPT-Eval employs GPT-4o as the judge, introducing the risk of bias inherent to the judging model itself.
The dataset is static, and the rapid evolution of business environments poses challenges to benchmark timeliness.
A substantial portion of the 40-page paper is devoted to presenting complete result tables; the core findings could be presented more concisely.

vs. FinBen: Covers 36 datasets exclusively within the finance domain; BizCompass extends coverage to four business disciplines and introduces an application-layer evaluation.
vs. CFLUE: A Chinese financial language understanding benchmark; BizCompass is English-based and broader in scope.
vs. MMLU: A general knowledge benchmark that includes business-related subcategories but lacks diagnostic capability for business applications.
vs. BBT-Fin: Focuses solely on financial NLP tasks; BizCompass additionally covers reasoning and decision-making.

Rating¶

Novelty: ⭐⭐⭐⭐ The dual-axis design is innovative, though the technical contribution of a benchmark paper is inherently limited.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ A large number of open-source and closed-source models are evaluated with diverse metrics and in-depth analysis.
Writing Quality: ⭐⭐⭐⭐ The structure is clear, though the paper is overly long at 40 pages.
Value: ⭐⭐⭐⭐ Fills a gap in LLM evaluation for the business domain and offers practical reference value for industry applications.