The Role of Deductive and Inductive Reasoning in Large Language Models¶

Conference: ACL 2025
arXiv: 2410.02892
Code: None
Area: LLM/NLP
Keywords: Deductive reasoning, inductive reasoning, prompt engineering, complexity evaluation, cognitive science

TL;DR¶

This paper proposes the DID (De-In-Ductive) framework to enhance the reasoning capabilities of LLMs by dynamically combining deductive and inductive reasoning. It utilizes a dual-indicator complexity evaluation system consisting of Littlestone dimension and information entropy to guide the question decomposition strategy. It achieves a 70.3% accuracy rate on the AIW benchmark (outperforming ToT's 62.2%) while maintaining lower computational costs.

Background & Motivation¶

Background: Currently, LLMs have made significant progress in reasoning tasks. Methods such as Chain of Thought (CoT), Tree of Thought (ToT), and Graph of Thought (GoT) guide the model's reasoning process through structured prompting, achieving promising results.

Limitations of Prior Work: These methods primarily rely on static prompting structures and extensive exploration of the output space. For example, although ToT achieves a 62.2% accuracy rate on AIW, it requires generating a large number of output tokens to explore multiple reasoning paths, leading to high computational overhead ($0.0038 per sample compared to CoT's $0.0022). More critically, these methods lack the ability to dynamically adjust reasoning strategies based on task complexity.

Key Challenge: When solving problems, humans dynamically switch between inductive reasoning (from specific to general) and deductive reasoning (from general to specific) based on task complexity, whereas existing LLM reasoning methods are static and cannot simulate this cognitive adaptability.

Goal: (1) How to precisely evaluate question complexity to guide reasoning strategies? (2) How to dynamically integrate inductive and deductive reasoning into LLM reasoning frameworks? (3) How to control computational costs while improving reasoning capabilities?

Key Insight: Grounded in cognitive science, the authors observe that inductive reasoning is used to discover patterns from simple instances, while deductive reasoning is used to apply these patterns to complex problems. Unlike existing methods that focus on expanding output exploration, DID adopts an input-centric strategy, investing computation in input structuring.

Core Idea: Through a dual-indicator complexity evaluation based on Littlestone dimension and info entropy, the complex problem is decomposed into a sequence of progressive sub-questions, using inductive reasoning first to discover patterns and then deductive reasoning to solve the target problem.

Method¶

Overall Architecture¶

The input to the DID framework is a reasoning question, and the output is the solution to that question. The overall pipeline is divided into three stages: (1) question complexity evaluation—using the Littlestone dimension $d$ and information entropy $H$ to calculate the overall complexity of the question $C(p) = d \cdot H(p)$; (2) question decomposition—decomposing the question into a sequence of sub-questions from simple to difficult based on complexity; (3) progressive reasoning—first learning patterns from simple sub-questions through inductive reasoning, and then applying the patterns to more complex questions through deductive reasoning.

Key Designs¶

Dual-Indicator Complexity Evaluation System:
- Function: Precisely evaluates the difficulty of reasoning tasks to guide the question decomposition strategy.
- Mechanism: In traditional online learning, the Littlestone dimension $d$ can measure the structural complexity of a question (such as decision tree depth and the number of critical decision points). However, the authors find that for LLMs, questions with the same Littlestone dimension can still vary greatly in difficulty. For instance, "Alice has 0 brothers and 1 sister" is much easier than "Alice has 3 brothers and 6 sisters," despite sharing the same reasoning structure. Therefore, information entropy $H(p) = \log_2(\prod_{i=1}^{n}(1+|x_i|))$ is introduced to measure instance-level complexity. The final complexity is defined as $C(p) = d \cdot H(p)$.
- Design Motivation: Addresses the limitation of the Littlestone dimension in distinguishing isomorphic questions of different scales, making complexity evaluation more accurate for LLMs.
Progressive Question Decomposition Algorithm:
- Function: Decomposes complex questions into a sequence of sub-questions ranging from simple to difficult.
- Mechanism: The algorithm first creates a basic sub-question with a reduced dimension of $d-2$ (by setting certain variables to zero), and then iteratively constructs $N = \lceil C(p)/a \rceil$ sub-questions. In the first half ($i < N/2$), it maintains a reduced dimension of $d-1$ to allow the model to establish basic patterns at lower complexity. In the second half ($i \geq N/2$), it restores the full dimension of $d$, gradually introducing the complete complexity of the question. The step-size parameter $a$ controls the granularity of the decomposition.
- Design Motivation: Mimics the human cognitive process—starting with a simplified version to identify core patterns, and then systematically applying insights to more complex scenarios. The two-stage dimension management ensures that the model balances pattern recognition (induction) and rule application (deduction).
Inductive-Deductive Reasoning Integration:
- Function: Dynamically switches between inductive and deductive reasoning during the reasoning process.
- Mechanism: The inductive component starts from simplified question instances (with reduced Littlestone dimension of $d-2$ or $d-1$) to identify basic patterns and relationships, generating and refining hypotheses through progressive exposure to more complex examples. The deductive component then systematically applies the rules discovered during the induction stage to more complex instances, where each deductive step also serves as a verification mechanism for the induction patterns. The two form a continuous learning loop.
- Design Motivation: Unlike methods such as CoT/ToT that focus on expanding output exploration, DID is an input-centric approach, achieving more efficient reasoning by investing in input structuring.

Loss & Training¶

DID does not involve model training or fine-tuning. It is a pure prompt engineering framework that guides the reasoning process of LLMs at inference time using carefully constructed input prompts. All experiments are conducted in a zero-shot setting, with model parameters kept at their default values.

Key Experimental Results¶

Main Results¶

Dataset	Metric	DID (GPT-4o)	ToT (GPT-4o)	CoT (GPT-4o)	IO (GPT-4o)
AIW (Alice Problem)	Accuracy	70.3%	62.2%	55.9%	43.4%
MR-GSM8K	Accuracy	87.7%	89.1%	85.0%	-
Holiday Puzzle	Accuracy	15.4%	7.5%	5.2%	7.8%

Using Claude 3.5 Sonnet, DID achieves an accuracy of up to 89.5% on AIW and 24.5% on Holiday Puzzle.

Ablation Study¶

Configuration	Key Metric	Description
GPT-4o DID Cost (AIW)	$0.0031/case	Lower than ToT's $0.0038/case
GPT-4o DID Input Tokens (AIW)	90 tokens	Higher than ToT's 56 tokens
GPT-4o DID Output Tokens (AIW)	290 tokens	Lower than ToT's 370 tokens
GPT-4o DID Cost (Holiday)	$0.0181/case	Lower than ToT's $0.0262/case

Key Findings¶

DID shows the most significant advantages in tasks requiring logical reasoning and pattern discovery (such as AIW and Holiday Puzzle), and performs comparably to ToT on well-structured mathematical reasoning tasks (such as MR-GSM8K).
DID consistently shows improvements across all three models (GPT-3.5, GPT-4o, Claude 3.5), indicating that the method does not rely on specific model architectures.
The input-centric strategy is effective: although DID uses more input tokens, the overall computational cost is lower due to the reduced need for output exploration.

Highlights & Insights¶

Introduces the dual-process model of inductive-deductive reasoning from cognitive science to LLM prompting frameworks, providing an interesting interdisciplinary perspective.
The complexity evaluation method integrating Littlestone dimension and information entropy is theoretically grounded, moving beyond intuition-driven prompt design.
The shift from an output-centric to an input-centric strategy represents a noteworthy paradigm shift—rather than prompting models to explore more output paths, it focuses on guiding the models with better-structured inputs.

Limitations & Future Work¶

LLMs are inherently based on next-token prediction and lack the cognitive ability to maintain consistent internal representations across reasoning steps. DID merely mitigates rather than fundamentally resolves this issue.
Generalizability: The highest accuracy on Holiday Puzzle is only 24.5%, indicating that it remains insufficient for tasks requiring precise temporal reasoning.
The automated calculation of the Littlestone dimension and information entropy is not detailed, which may require manual setting or additional heuristics in practical applications.
Validated only on the GPT series and Claude, lacking experiments on open-source models.
The method relies on the assumption that progressive sub-questions can be successfully constructed, which may not be applicable to certain reasoning tasks that cannot be naturally decomposed.

Similar to the concept of Test-Time Training (TTT), DID can be viewed as a form of Test-Time Prompting, investing more computation to construct inputs during inference.
While models like DeepSeek-R1 and o1 enhance performance through RL and extended reasoning paths, DID provides a complementary, training-free alternative.
Future work could explore combining DID with RL-based reasoning, or automating the complexity evaluation process.

Rating¶

Novelty: 7/10 — Introduces the dual-process theory of cognitive science into LLM reasoning, which is novel but inherently still remains within prompt engineering.
Technical Depth: 6/10 — The theoretical framework is interesting, but the actual implementation is relatively simple.
Experimental Thoroughness: 6/10 — The number of tasks is limited, and the accuracy on some tasks is quite low.
Writing Quality: 7/10 — Clearly structured with well-articulated theoretical motivation.
Value: 6/10 — Possesses some degree of generality as a zero-shot prompting method, but the automation of complexity evaluation is insufficient.