Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning¶

Conference: ECCV 2024
Paper: ECVA
Code: GitHub
Area: Multimodal VLM / Causal Reasoning
Keywords: Counterfactual Reasoning, Knowledge Bias, Vision-Language Reasoning, VQA, Large Vision-Language Models

TL;DR¶

This paper proposes the Counterfactual Bias-Robust Reasoning dataset (CoBRa) and the Chain of Counterfactual Thought (CoCT) method. By constructing edited knowledge graphs and image content, the study evaluates and mitigates knowledge bias in large vision-language models (LVLMs), enabling models to perform step-by-step reasoning rather than relying on biased knowledge. This approach significantly outperforms existing methods on tasks requiring reasoning under knowledge bias.

Background & Motivation¶

Background: Large Vision-Language Models (LVLMs, such as GPT-4V and LLaVA) have achieved great success in tasks like VQA and image understanding. These models are pre-trained on massive datasets and accumulate rich world knowledge. However, such knowledge inherited from training data also introduces implicit bias.

Limitations of Prior Work: LVLMs are highly sensitive to knowledge bias in training data. When faced with counterfactual scenarios inconsistent with the training distribution—such as a purple banana or a medieval knight in a spacesuit—the models tend to rely on biased prior knowledge learned during pre-training ("bananas are yellow") rather than reasoning based on image content. This limits the models' generalization ability in novel scenarios and their reliability in practical applications.

Key Challenge: Models require knowledge to understand the world, but over-reliance on this knowledge leads to bias. The core issue is how to enable models to prioritize reasoning based on visual evidence when encountering scenarios that conflict with prior knowledge, while still retaining useful knowledge.

Goal: (1) How to systematically evaluate the knowledge bias of LVLMs? (2) How to teach models to perform robust reasoning in counterfactual scenarios?

Key Insight: The authors approach this from the perspective of counterfactual thinking—constructing "what if the world were different" scenarios. By editing knowledge graphs (modifying factual relations) and image content (visual counterfactuals), they create a dataset that requires models to discard bias and reason based on current evidence. They then teach models a "Chain of Counterfactual Thought" (CoCT)—first identifying conflicts with prior knowledge, and then reasoning step-by-step based on the current evidence.

Core Idea: Construct a counterfactual VQA dataset to expose knowledge bias, and use the Chain of Counterfactual Thought (CoCT) to teach LVMs to reason based on evidence rather than bias.

Method¶

Overall Architecture¶

LANP consists of two parts: dataset construction and the core method. The dataset portion (CoBRa) generates VQA samples containing counterfactual scenarios by editing knowledge graphs and image content, with each sample accompanied by detailed annotations of the reasoning process. The method portion (CoCT) contains two core steps: (1) training a Translation Language Model (TLM) to learn the counterfactual reasoning process; (2) leveraging the counterfactual reasoning exemplars generated by the TLM as in-context examples to guide LVLMs in bias-robust reasoning.

Key Designs¶

CoBRa Counterfactual Bias-Robust Reasoning Dataset:
- Function: Provides data for systematically evaluating and training LVLMs to handle knowledge bias.
- Mechanism: Starting from existing knowledge graphs (KGs), the authors select common factual triples (e.g., "banana-color-yellow") and then edit them—replacing "yellow" with "purple". Accordingly, image editing tools are used to modify the image to align with the new knowledge. Each sample includes: (a) the original image and question; (b) the edited image and knowledge graph; (c) the complete annotated reasoning process explaining how to derive the correct answer from the edited evidence. The dataset contains approximately 64K samples, covering over 14K unique entries.
- Design Motivation: Existing VQA datasets do not specifically assess knowledge bias. Explicitly constructing counterfactual scenarios allows for accurate measurement of the models' bias. Detailed annotations of the reasoning process provide a foundation for learning robust reasoning strategies.
Chain of Counterfactual Thought (CoCT):
- Function: Teaches LVLMs to perform step-by-step reasoning when encountering knowledge conflicts.
- Mechanism: CoCT is a structured reasoning paradigm that includes several key steps: (a) identifying the knowledge domain involved in the question; (b) checking whether visual evidence conflicts with prior knowledge; (c) if a conflict exists, explicitly discarding the prior and establishing a new reasoning chain based on current evidence; (d) deriving the answer step-by-step. The reasoning chain of CoCT is learned through annotations in the CoBRa dataset. During inference, CoCT is provided to LVLMs as few-shot in-context exemplars, guiding the models to follow the same reasoning pattern.
- Design Motivation: Standard Chain-of-Thought prompting does not include the "check for bias - discard prior" step, thereby failing to effectively address knowledge conflicts. CoCT explicitly models the strategy of "questioning before reasoning".
Translation Language Model (TLM):
- Function: Learns to "translate" a standard reasoning process into a counterfactual reasoning process.
- Mechanism: The TLM is trained on the CoBRa dataset, taking the reasoning process of the original scenario + the edited knowledge graph/image as input, and outputting the reasoning process for the counterfactual scenario. Essentially, TLM learns the ability of "how to adapt a reasoning chain to a new knowledge setting". The trained TLM can automatically generate reasoning exemplars for new counterfactual scenarios, which then serve as in-context examples for LVLMs. The training is based on a joint objective of masked language modeling (MLM) and translation language modeling (TLM).
- Design Motivation: Manually writing counterfactual reasoning exemplars is expensive and has limited coverage. TLM automating this process can generate tailored reasoning chains for any new scenario.

Loss & Training¶

The TLM uses a joint training objective of MLM + TLM. MLM predicts masked words in monolingual texts (original or counterfactual reasoning), while TLM learns cross-reasoning chain translation on bilingual aligned pairs (original reasoning - counterfactual reasoning). During inference, few-shot prompting is used without requiring additional fine-tuning of the LVLM.

Key Experimental Results¶

Main Results¶

Model	Method	CoBRa Accuracy (%)	Standard VQA Accuracy (%)	Bias Impact
InstructBLIP	Standard Inference	38.2	72.1	Severe Bias
InstructBLIP	CoT	41.5	71.8	Minor Improvement
InstructBLIP	CoCT	52.3	71.5	Significant Improvement
LLaVA	Standard Inference	42.7	74.3	Severe Bias
LLaVA	CoCT	56.8	73.9	Significant Improvement

Ablation Study¶

Configuration	CoBRa Accuracy	Description
CoCT (Full)	52.3	Full method
w/o TLM (manual exemplars)	48.1	TLM automatic generation performs better
w/o Knowledge Graph Editing	44.6	Only editing images is insufficient to fully expose bias
w/o Reasoning Process Annotations	43.2	Reasoning chain annotation is the core component
Standard CoT	41.5	Lacks counterfactual reasoning steps

Key Findings¶

All tested LVLMs exhibit severe knowledge bias on CoBRa, with accuracy rates far below those on standard VQA.
CoCT significantly improves counterfactual reasoning capabilities without compromising performance on standard VQA, indicating that the two capabilities are not in conflict.
The quality of the reasoning exemplars generated by TLM is higher than those manually written, as TLM can adaptively generate reasoning chains that better match new scenarios.
The combination of knowledge graph editing and image editing performs better than using either alone.

Highlights & Insights¶

Chain of Counterfactual Thought (CoCT) is a meaningful extension of standard CoT, adding a metacognitive step of "detecting conflict - discarding prior". This design reflects human cognitive processes when handling counter-intuitive information, which is highly intuitive. It can be transferred to any scenario where the model needs to "question its own knowledge".
The concept of TLM as a reasoning chain adapter is novel—instead of directly training the LVLM, an assistant model is trained to generate heuristic exemplars. This "coach model" philosophy can be extended to other tasks requiring customized prompts.
The CoBRa dataset itself is a significant contribution, providing a new dimension for evaluating the robustness of LVLMs.

Limitations & Future Work¶

CoBRa primarily relies on knowledge graphs for editing, which is limited by the coverage of the KG and the variety of editing operations.
The quality of image editing may affect the naturalness of counterfactual scenarios; unnatural edits might introduce additional biases.
As an in-context learning method, the effectiveness of CoCT is constrained by the context window length and the number of exemplars.
Direct fine-tuning of LVLMs using CoCT, rather than using it solely as a prompting strategy, has not been explored.
The method has only been validated on VQA tasks and has not been extended to other vision-language tasks (e.g., image captioning, visual reasoning).

vs VQA-CP (Contrastive VQA bias datasets): VQA-CP focuses on language biases (such as answer distribution shift), while CoBRa focuses on knowledge biases (counterfactual modifications of known facts), making them complementary.
vs Debiasing methods (e.g., causal debiasing): Traditional debiasing methods modify model architectures or training processes, whereas CoCT achieves this through reasoning strategies without requiring re-training the model.
vs Chain-of-Thought: Standard CoT does not include bias detection steps, whereas CoCT adds a metacognitive layer, making it better suited for knowledge conflict scenarios.

Rating¶

Novelty: ⭐⭐⭐⭐ The ideas behind the counterfactual reasoning chain and the CoBRa dataset construction are novel, representing a promising direction.
Experimental Thoroughness: ⭐⭐⭐⭐ Validated on multiple LVLMs with reasonable ablation experiments, though the dataset scale could be larger.
Writing Quality: ⭐⭐⭐⭐ The problem definition is clear, and the concept of counterfactual thought is articulated deeply.
Value: ⭐⭐⭐⭐ Knowledge bias is a critical issue for LVLMs, and CoBRa provides a standardized evaluation tool for this direction.