Skip to content

Improving Chain-of-Thought Reasoning via Quasi-Symbolic Abstractions

Conference: ACL 2025
arXiv: 2502.12616
Code: None
Area: LLM Reasoning
Keywords: Chain-of-Thought, Quasi-Symbolic Reasoning, Abstract Representation, In-Context Learning, Robustness

TL;DR

This paper proposes QuaSAR (Quasi-Symbolic Abstract Reasoning), a Chain-of-Thought (CoT) variant that guides LLMs to first abstract the problem symbolically (extracting variables/predicates), reconstruct it using a semi-formal representation, and finally solve it based on a quasi-symbolic reasoning chain. QuaSAR achieves up to an 8% accuracy improvement over standard CoT on GPT-4o, while significantly enhancing robustness against adversarial variants (e.g., option shuffling, numerical substitution).

Background & Motivation

Chain-of-Thought (CoT) is currently the dominant strategy for LLM reasoning, enhancing performance by decomposing complex problems into intermediate steps. However, explanations generated by CoT are susceptible to content bias—models may reason based on surface-level context cues rather than underlying logical relationships, which leads to: - Answers changing when the option order is shuffled (MMLU-Redux) - Significant performance drops when numerical values are substituted (GSM-Symbolic) - Reasoning processes being unfaithful to the true logical chain

To address this issue, one line of work proposes utilizing logical formalization (e.g., translating natural language into logical programs) combined with external symbolic solvers. However, complete formalization faces efficiency bottlenecks: fully translating natural language to formal language is highly complex, error-prone, and lacks flexibility.

The core idea of QuaSAR is to find a compromise: instead of complete formalization, it guides LLMs to symbolize only the key variables and predicates, allowing natural language and symbolic representations to coexist. This "quasi-symbolic abstraction" decouples concrete world knowledge from symbolic reasoning, reducing content bias while bypassing the bottlenecks of complete formalization.

This approach is grounded in the philosophy of science—Kitcher's (1981) unificationist theory of explanation suggests that a good explanation should establish reusable argument schemas by replacing concrete entities with abstract symbols.

Method

Overall Architecture

QuaSAR structures the reasoning process as a quadruple \((\mathcal{Q}, \mathcal{S}, \mathcal{R}, \mathcal{A})\), where \(\mathcal{S} = (s_1, s_2, s_3, s_4)\) represents a four-step structured instruction chain. Compared to the standard CoT triple \((\mathcal{Q}, \mathcal{R}, \mathcal{A})\), QuaSAR introduces an additional symbolic transformation layer. The workflow operates as a single-step prompting pipeline (without requiring external solvers), keeping execution overhead low.

Key Designs

  1. Step 1 - Abstraction

    • Guides the LLM to analyze the problem and extract key information: identifying relevant symbolic predicates, variables (numerical or textual), and constants.
    • This is the initial step of problem-solving—abstracting concrete questions into structured representations.
    • Example: Abstracting "Lisa has 5 apples" into apples(Lisa) = 5.
  2. Step 2 - Formalisation

    • Reformulates the original problem into a hybrid symbolic-natural language format.
    • The key lies in "quasi-formalisation"—translating only the necessary components while retaining natural language context critical for solving.
    • The goal is to minimize ambiguity and content effects without losing important information.
  3. Step 3 - Explanation

    • Performs step-by-step reasoning based on the quasi-symbolic structure.
    • The reasoning trajectory uses symbolic representations to explicitly demonstrate the logical connections between steps.
    • This reduces errors caused by contextual knowledge or implicit logical relationships.
  4. Step 4 - Answering

    • The LLM generates the final answer in a fixed format: "The answer is: [number]".
    • This ensures the reasoning chain reaches a clear conclusion and facilitates automated evaluation.

Two Application Modes

  1. QuaSAR for ICL: Used directly as an in-context learning strategy to guide large model reasoning.
  2. QuaSAR for Demonstrations: Employs high-performance LLMs (e.g., GPT-4o) to generate demonstrations in QuaSAR format, which serve as training data to fine-tune smaller models.
    • Quality Filtering: Demonstrations are first filtered for correct answers using exact match, followed by verification of citation and reference accuracy (filtering out approximately 50% of the generated data).

Loss & Training

  • The small models are trained using the standard language modeling objective: $\(\max_\theta \mathbb{E}_{(q, \alpha, y) \sim \mathcal{D}} \log p_\theta(Y \mid \alpha, Q) p_\theta(\alpha \mid Q)\)$
  • Where \(\alpha = \alpha_1 \cdot \alpha_2 \cdot \alpha_3 \cdot \alpha_4\) is the concatenation of the four-step reasoning trajectory.

Key Experimental Results

Main Results (QuaSAR as ICL)

Model Method AQuA GSM8K SVAMP MMLU-Redux GPQA DROP
GPT-4o Baseline 72.8 94.0 90.4 79.7 46.5 83.4
GPT-4o + CoT 84.3 94.5 90.3 88.1 50.2 84.2
GPT-4o + QuaSAR 87.4 96.5 97.0 90.2 55.4 88.9
Llama-3-70B + CoT 74.0 86.1 84.6 82.0 41.9 80.2
Llama-3-70B + QuaSAR 79.1 88.2 84.9 85.7 49.2 88.0

QuaSAR as Fine-Tuning Demonstration (results after fine-tuning in parentheses)

Model AQuA GSM8K SVAMP MMLU-Redux
Llama-3-8B + CoT 69.6(72.2) 80.4(82.6) 76.3(78.8) 64.5(65.9)
Llama-3-8B + QuaSAR 67.2(78.4) 77.2(83.0) 77.3(82.6) 63.0(67.2)

Robustness Evaluation

Model Task Baseline CoT QuaSAR
GPT-4o MMLU-Redux (Option Shuffling) 78.6(-1.2) 86.8(-1.2) 90.3(0.0)
GPT-4o GSM-Symbolic (Numerical Substitution) 89.7(-4.3) 90.8(-4.7) 95.3(-1.2)
Llama-3-8B MMLU-Redux (Option Shuffling) 27.0(-3.2) 30.4(-1.2) 37.3(-0.3)

Ablation Study (Impact of removing components)

Configuration Change in Avg. Accuracy Description
w/o Step 1 (Abstraction) -1.8 Important but not decisive
w/o Step 2 (Formalisation) -3.5 Most significant impact
w/o Step 3 (Explanation) -3.4 Most significant impact
w/o Step 4 (Answering) -2.5 Larger impact on multiple-choice questions
Randomly shuffling step order ~-4.0 Step order is also crucial

Key Findings

  • Significant improvements of QuaSAR over CoT on GPT-4o: AQuA +3.1, SVAMP +6.7, GPQA +5.2.
  • QuaSAR dramatically enhances robustness: Under option shuffling, performance shows almost zero degradation (0.0 vs. -1.2 for CoT); under numerical substitution, performance drops by only 1.2 (vs. -4.7 for CoT).
  • Direct QuaSAR ICL on small models yields limited performance: Small models like Llama-3-8B struggle with the complexity of the four-step instructions.
  • Fine-tuning small models with QuaSAR-generated demonstrations is highly effective: Llama-3-8B fine-tuned with QuaSAR improved from 72.2 (CoT) to 78.4 (QuaSAR) on AQuA.
  • Step 2 (Formalisation) and Step 3 (Explanation) are the most critical: Removing either component leads to a drop exceeding 3.4%.
  • High training data efficiency: QuaSAR demonstrations require only 25-50% of the training data quantity of standard CoT demonstrations to achieve comparable or superior performance.

Highlights & Insights

  • "Quasi-symbolisation" is an elegant compromise: It obtains the precision and robustness of symbolic reasoning while completely avoiding the bottlenecks of full formalization. This design philosophy is highly inspirational.
  • Robustness improvements represent the most compelling results: Obtaining zero performance degradation under option shuffling directly demonstrates that content bias is effectively eliminated.
  • Theoretical motivation from philosophy of science: Kitcher’s unificationist theory of explanation provides a rigorous and elegant theoretical foundation for the method.
  • Dual application modes (ICL and demonstration fine-tuning) ensure broad applicability of the methodology.

Limitations & Future Work

  • The method is validated only on English tasks; multilingual generalization remains unexplored.
  • For pure natural language understanding tasks that do not require logical reasoning (e.g., sentiment analysis), the additional steps of QuaSAR might act as unnecessary computational overhead.
  • Small models yield unstable performance under direct QuaSAR ICL, indicating a minimum capability threshold for the model.
  • No direct comparisons were made against long-CoT reasoning structures like the OpenAI o1-class models.
  • Approximately 50% of QuaSAR demonstrations are filtered out during quality checking, leaving room for generating optimization.
  • Research Idea: The formalization step of QuaSAR can be naturally combined with external verifiers—first using QuaSAR to generate a quasi-symbolic reasoning chain, and then utilizing a symbolic solver to check logical consistency, achieving "verifiable CoT."
  • Faithful CoT (Lyu et al., 2023): Follows the path of complete formalization plus external solvers. QuaSAR demonstrates that "partial symbolization" is sufficient.
  • CoMAT (Leang et al., 2024): Another CoT method integrating symbols; QuaSAR outperforms it on GPT-4o by 6.8% on average.
  • FLAIRE (Arakelyan et al., 2024): A reasoning method based on logical formalization.
  • The most critical insight of this work: Complete formalization is unnecessary; symbolizing only key variables and predicates is sufficient to yield the benefits of symbolic reasoning, representing an exceptional balance between practicality and theoretical elegance.

Rating

  • Novelty: ⭐⭐⭐⭐ The positioning of "quasi-symbolic abstraction" between pure NL and pure symbols is highly unique, backed by theoretical support from the philosophy of science.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Evaluation across multiple tasks (math, NLP), and models, using two distinct deployment modes (ICL and fine-tuning), along with robustness testing and exhaustive ablation.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear theoretical motivation, precise methodology description, and in-depth empirical analysis.
  • Value: ⭐⭐⭐⭐ Provides a generalizable and practical enhancement strategy for CoT, with improvements in robustness holding practical deployment significance.