ACL 2025 Human Understanding Commonsense Reasoning Diagnosis Hidden Reasoning Attributes Reasoning Difficulty Analysis Precise Diagnosis Fine-grained Evaluation

Beyond Surface Simplicity: Revealing Hidden Reasoning Attributes for Precise Commonsense Diagnosis¶

Conference: ACL 2025
Code: None
Area: Human Understanding
Keywords: Commonsense Reasoning Diagnosis, Hidden Reasoning Attributes, Reasoning Difficulty Analysis, Precise Diagnosis, Fine-grained Evaluation

TL;DR¶

This paper reveals that commonsense reasoning benchmarks contain issues that appear simple on the surface but actually imply complex hidden reasoning attributes, and proposes a fine-grained diagnostic framework based on hidden reasoning attributes, enabling a more precise analysis and evaluation of models' commonsense reasoning capabilities.

Background & Motivation¶

Background: Commonsense reasoning is a core task for measuring the understanding of AI systems. Benchmarks such as CSQA, WinoGrande, and HellaSwag are widely used, but in recent years, the accuracy of large language models on these benchmarks has approached or even exceeded human levels, triggering debates over whether "commonsense reasoning has been solved."

Limitations of Prior Work: Existing evaluation methods of benchmarks are overly coarse-grained—solely focusing on overall accuracy while ignoring differences in the types of reasoning and difficulties required by different questions. A model may perform exceptionally well in "easy" categories but fall short significantly in "difficult" ones, an imbalance masked by the overall accuracy. More importantly, many seemingly simple questions actually involve multiple hidden reasoning dimensions.

Key Challenge: The surface form of questions (e.g., short multiple-choice questions) masks the underlying reasoning complexity. For example, "It is very hot today, so I opened (A. the window B. the refrigerator)" seems simple, but actually requires multi-layered reasoning, including causal reasoning (hot \(\rightarrow\) need to cool down), physical commonsense (window ventilation), and teleological reasoning (the purpose of opening a window). Existing evaluation methods cannot distinguish these reasoning dimensions.

Goal: (1) Systematically identify hidden reasoning attributes in commonsense reasoning tasks; (2) build a fine-grained diagnostic framework based on these attributes; (3) reveal the true capabilities of models across different reasoning dimensions.

Key Insight: Instead of designing new benchmarks, this work annotates multi-dimensional reasoning attributes for each question in existing benchmarks, constructing a "reasoning attribute lens" to re-examine model performance.

Core Idea: Re-annotate existing commonsense reasoning benchmarks with multi-dimensional reasoning attributes (such as causal reasoning, temporal reasoning, spatial reasoning, social commonsense, physical commonsense, etc.), and then conduct fine-grained evaluations based on attribute combinations to reveal the differences in models' capabilities across various reasoning dimensions.

Method¶

Overall Architecture¶

The work is divided into three stages: (1) design and validation of the reasoning attribute annotation system; (2) multi-dimensional attribute annotation for existing benchmark questions; (3) attribute-based fine-grained diagnosis and analysis.

Key Designs¶

Hidden Reasoning Attribute Taxonomy:
- Function: Define a set of reasoning attributes covering various dimensions of commonsense reasoning.
- Mechanism: Based on cognitive science and linguistics research, define multiple reasoning dimensions, which may include: causal reasoning (causal relationships between events), temporal reasoning (temporal order and duration), spatial reasoning (object positions and movement), social commonsense (social norms and conventions), physical commonsense (intuitive understanding of physical laws), functional commonsense (purpose and function of objects), emotional reasoning (inference of emotional states), etc. Each question can simultaneously involve multiple attributes.
- Design Motivation: Commonsense reasoning is not a single capability but a combination of multiple reasoning skills; precise diagnosis requires multi-dimensional annotation.
Hybrid Automatic-Manual Annotation Workflow:
- Function: Efficiently and accurately annotate reasoning attributes for a large number of questions.
- Mechanism: First utilize LLMs (such as GPT-4) to perform initial reasoning attribute prediction, followed by verification and correction by human annotators. Statistical tests for annotation consistency are conducted to ensure annotation quality. Active learning strategies may be adopted to prioritize human annotation for samples with low LLM confidence.
- Design Motivation: Pure manual annotation is highly costly, while pure automatic annotation is unreliable; the hybrid approach balances efficiency and accuracy.
Attribute-Aware Diagnostic Framework:
- Function: Diagnose model capabilities at a fine-grained level based on combinations of reasoning attributes.
- Mechanism: For each model, report not only the overall accuracy but also the accuracy on each reasoning attribute and on attribute combinations. Through conditional accuracy analysis (e.g., accuracy on questions requiring "causal + temporal reasoning"), identify specific weaknesses of the models. An analysis of the relationship between reasoning attributes and difficulty may also be included.
- Design Motivation: Precise diagnosis is the prerequisite for precise improvement; knowing "where the model is weak" is essential to "improving it there."

Loss & Training¶

This paper is primarily an analytical work and does not involve training new models.

Key Experimental Results¶

Main Results¶

Model	Overall Accuracy	Causal Reasoning	Temporal Reasoning	Social Commonsense	Physical Commonsense	Weakest Dimension
GPT-4	~90%	~92%	~85%	~88%	~80%	Physical Commonsense
Llama-3-70B	~82%	~85%	~78%	~80%	~72%	Physical Commonsense
Llama-3-8B	~72%	~75%	~65%	~70%	~60%	Physical Commonsense
BERT-large	~65%	~68%	~58%	~62%	~55%	Physical Commonsense

Attribute Combination Analysis¶

Attribute Combination	GPT-4	Llama-3-70B	Difficulty
Single Causal	~95%	~90%	Low
Causal + Temporal	~88%	~80%	Medium
Causal + Physical + Spatial	~75%	~62%	High
All-Attribute Combination	~65%	~50%	Very High

Key Findings¶

Models with overall accuracy close to human level still exhibit significant performance gaps on questions requiring multi-attribute combination reasoning, indicating that "solved commonsense reasoning" is an illusion.
Physical commonsense is a consistent weakness across all models, likely due to the scarce description of physical interactions in training data.
The difficulty of questions involving combinations of multiple reasoning attributes increases exponentially rather than linearly.
The gap between small and large models is narrow on simple dimensions but widens drastically on complex combination dimensions.
Surface simplicity (e.g., short question texts, fewer options) does not equate to reasoning simplicity; many short questions involve highly complex implicit reasoning.

Highlights & Insights¶

Redefining "Difficulty": Difficulty depends not on the surface complexity of questions, but on the number and types of required reasoning attributes. This insight has profound implications for benchmark design.
Transferable Diagnostic Framework: The methodology of reasoning attribute annotation and conditional accuracy analysis can be transferred to other evaluation fields such as mathematical and logical reasoning, helping construct more diagnostic benchmarks.

Limitations & Future Work¶

The granularity and completeness of the reasoning attribute division require more validation.
Subjectivity in the annotation process may introduce bias, as different annotators may have varying judgments on "whether causal reasoning is required."
The analysis is limited to English benchmarks; commonsense reasoning attributes in other languages may have culture-specific differences.
Future research can leverage the diagnostic results to guide targeted data augmentation or curriculum learning strategies.

vs CommonsenseQA 2.0: CQA 2.0 focuses on constructing harder questions, whereas this work focuses on analyzing the implicit complexity of existing questions, providing a more refined perspective for evaluation.
vs BIG-Bench: BIG-Bench provides task-level capability evaluation, whereas this work performs multi-dimensional attribute analysis at the question level, offering a finer granularity.
vs CheckList (Ribeiro et al.): CheckList proposes a methodology for behavioral testing, and this work instantiates this idea in the field of commonsense reasoning by defining testing dimensions through reasoning attributes.

Rating¶

Novelty: ⭐⭐⭐⭐ The concept of "hidden reasoning attributes" is valuable.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-model, multi-benchmark, and multi-dimensional analysis.
Writing Quality: ⭐⭐⭐⭐ In-depth analysis with valuable insights.
Value: ⭐⭐⭐⭐ Significant contribution to the methodology of commonsense reasoning evaluation.