V-CECE: Visual Counterfactual Explanations via Conceptual Edits¶

Conference: NeurIPS 2025 arXiv: 2509.16567 Code: Project Page Area: Explainable AI / Counterfactual Explanation / Diffusion Models Keywords: Counterfactual explanation, concept editing, black-box, knowledge graph, diffusion models

TL;DR¶

V-CECE proposes the first black-box visual counterfactual explanation framework that systematically reveals the explanatory gap between human and neural network semantic understanding. It guarantees edit-set optimality via WordNet knowledge graphs and the Hungarian algorithm, and executes concept-level edits using Stable Diffusion. The key finding is that CNN classifiers are severely misaligned with human semantic reasoning (requiring 5+ edit steps), whereas LVLMs (Claude 3.5 Sonnet) are highly aligned with humans (requiring only 2–3 steps).

Background & Motivation¶

Background: Counterfactual explanation is an important tool in explainable AI—revealing model decision rationale through the lens of "how would the classification change if X were modified?" Existing methods are divided into white-box (requiring gradient access) and black-box (requiring no internal access) approaches.

Limitations of Prior Work: Existing counterfactual image generation methods suffer from three major issues: (1) edits are dispersed and uninterpretable (ACE, DiME, etc. produce pixel-level changes that humans cannot comprehend); (2) over-reliance on training to guide generation (white-box methods require days of training); (3) most critically, all semantic counterfactual methods assume classifiers reason at a human semantic level—an assumption that has never been validated.

Key Challenge: Do humans and neural network classifiers understand semantics in the same way? If not, using human-interpretable concept edits to explain CNN classification decisions is inherently misleading—more dangerous than uninterpretable adversarial edits, as it introduces a false sense of interpretability.

Goal: Two progressive questions: (1) Can the decision process of a classifier be explained using human-level semantics? (2) If so, what is the minimum set of semantic edits required to flip the classification label?

Key Insight: Decompose counterfactual explanation into two independent stages—first compute the optimal semantic edit set on a knowledge graph (model-agnostic), then execute edits using a frozen diffusion model (avoiding training bias), and finally verify effectiveness via classification outcomes.

Core Idea: Use knowledge graphs to guarantee edit optimality and frozen diffusion models to ensure evaluation fairness, thereby systematically measuring the semantic understanding gap between humans and models.

Method¶

Overall Architecture¶

A two-stage pipeline: (1) Explanation stage: Given concept sets for source class \(L\) and target class \(L^*\), solve for the minimum-cost edit set \(E\) (comprising insertions \(I\), deletions \(D\), and substitutions \(S\)) on the WordNet knowledge graph using the Hungarian algorithm. (2) Generation stage: Order edits according to one of three strategies, localize target regions using GroundingDINO+SAM, execute edits using Stable Diffusion v1.5 Inpainting, and check after each step whether the classifier flips its label.

Key Designs¶

Optimal Edit Guarantee (Knowledge Graph + Hungarian Algorithm):
- Function: Compute the minimum semantic edit set from class \(L\) to class \(L^*\).
- Mechanism: Substitution cost = shortest-path distance between two concepts on WordNet, computed via Dijkstra's algorithm. The problem is formulated as a bipartite graph matching—source and target concepts as nodes on each side, edge weights as substitution costs, with virtual nodes added to model insertions/deletions (cost = distance to root node). The Hungarian algorithm solves for the minimum-weight matching with time complexity \(O(mn\log n)\).
- Design Motivation: Provides a deterministic optimality guarantee, unlike the heuristic edit selection of prior methods. Non-executable edits can be automatically excluded by assigning infinite cost.
Three Edit Ordering Strategies:
- Function: Determine execution order within the optimal edit set \(E\) to flip the label as early as possible.
- Mechanism: (1) Local Edits: An LVLM observes the current image and remaining edits at each step and selects the next operation (image updated each step to prevent logical inconsistencies). (2) Global Edits: Aggregates edit frequencies across all images and ranks edits by an Importance score—defined as \((|I_{s_j^*}| - |D_{s_i}| + |S_{s_i \to s_j^*}| - |S_{s_j^* \to s_i}|) / |e \in E|\)—to capture systematic classifier biases. (3) Local-Global: Selects a local edit subset for a specific image and orders it by global importance.
- Design Motivation: Local leverages image context but ignores classifier bias; Global captures bias but ignores scene details; Local-Global balances both.
Frozen Diffusion Model for Edit Execution:
- Function: Execute concept-level image edits while maintaining fair evaluation.
- Mechanism: Uses Stable Diffusion v1.5 Inpainting (frozen weights, zero training), DPM++ 2M SDE sampler with 40 steps, GroundingDINO+SAM to generate concept masks, and inpaints only within masked regions. An LVLM (Claude 3.5 Sonnet) determines optimal placement and background fill.
- Design Motivation: Deliberately avoids fine-tuning the diffusion model on the target dataset, as training would introduce data bias and yield spuriously favorable counterfactual images. A frozen model ensures consistent bias, so differences in evaluation results reflect solely classifier behavior.

Loss & Training¶

No training—V-CECE is a fully plug-and-play framework. All modules (classifier, diffusion model, LVLM) are used in a black-box manner with zero training.

Key Experimental Results¶

Main Results¶

BDD100K (autonomous driving scene classification, Stop/Move):

Method	FID↓	CMMD↓	SR↑	Avg\|E\|↓	Training
ACE l1 (white-box)	1.02	-	99.9%	-	Days
TIME (black-box)	51.5	-	81.8%	-	Hours
V-CECE+DenseNet Local	90.42	1.101	88.9%	4.77	N/A
V-CECE+Claude3.5 Global	45.22	0.427	97.8%	2.65	N/A
V-CECE+Claude3.5 L-G	42.76	0.364	98.1%	2.44	N/A

Ablation Study¶

Human evaluation—edit steps required by the model vs. steps deemed reasonable by humans:

Classifier	Avg\|E\| Model	Avg\|E\| Human	Visual Correctness (%)
DenseNet	5.22	2.21	59.71
ConvNext	7.35	2.27	34.24
EfficientNet	5.96	2.66	30.17
Claude 3 Haiku	2.91	1.88	69.58
Claude 3.5 Sonnet	2.19	1.33	81.20
Claude 3.7 Sonnet	2.50	1.37	79.98

Key Findings¶

CNNs exhibit a significant semantic gap with humans: DenseNet requires 5.22 edit steps to flip the label, whereas humans consider 2.21 steps sufficient. Moreover, 59.7% of flipped DenseNet images already exhibit visual artifacts, indicating that the classifier relies on pixel distribution shifts rather than semantic changes.
LVLMs are highly aligned with humans: Claude 3.5 Sonnet requires only 2.19 steps (close to the human baseline of 1.33 steps), with 81.2% of images visually correct, demonstrating near-human semantic understanding.
CNN decisions lack consistency: The highest-importance concept scores only 0.16–0.23, and 35–55 important concepts are identified, indicating no consistent semantic dependency pattern. LVLMs show the opposite: top concept importance reaches 0.37–0.40, with only 27–31 important concepts.
Chain-of-thought reasoning is counterproductive: Enabling thinking in Claude 3.7 increases the required edit steps (3.78 vs. 3.03) and worsens FID, corroborating prior findings that CoT can be detrimental for visual tasks.

Highlights & Insights¶

Precise problem formulation: Decomposing counterfactual explanation into "semantic alignment verification" and "minimum edit computation" as two progressive questions is more fundamental than prior work that addresses only the second step. If a classifier does not reason at the semantic level, explaining it with semantic counterfactuals is misleading.
Frozen models ensure evaluation fairness: The deliberate choice not to fine-tune the diffusion model is elegant—it prevents data bias from contaminating evaluation results, enabling genuine comparisons of semantic understanding across classifiers.
LVLM-as-classifier as a new paradigm: Using LVLMs as classifiers is not only feasible but yields high semantic alignment, offering a new paradigm for explaining opaque commercial models.

Limitations & Future Work¶

Limited scale of human evaluation: The current human study is small in scale, limiting statistical power and precision; results should be treated as preliminary insights.
Generation quality constraints of diffusion models: Visual artifacts accumulating after multiple Stable Diffusion v1.5 inpainting steps are unavoidable, potentially conflating "classifier semantic misalignment" with "classification changes caused by image quality degradation."
Semantic granularity of the knowledge graph: WordNet has fixed and incomplete concept granularity, potentially missing visually important concepts that are relevant to classification.
Evaluation limited to BDD100K and Visual Genome: Generalizability to high-stakes domains such as medical imaging requires further validation.
Future directions: incorporating white-box generative models for comparison, scaling human evaluation with inter-annotator agreement assessment, and testing next-generation diffusion models (SD3/Flux).

vs. ACE/DiME (white-box counterfactuals): White-box methods achieve SR up to 99.9% but require gradients and training, producing uninterpretable edits; V-CECE is black-box and training-free, with human-understandable edits.
vs. Dervakos/Dimitriou (semantic counterfactuals): Prior work requires 12+ edit steps and does not generate images; V-CECE requires only 2–3 steps and produces visualizations.
vs. TIME (black-box counterfactuals): TIME requires training and does not provide semantic edits; V-CECE with an LVLM classifier achieves superior FID and SR compared to TIME.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First systematic revelation of the semantic understanding gap between humans and models; the problem formulation is highly valuable.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers diverse classifiers (CNNs, ViTs, LVLMs) and includes human evaluation, though the human study scale is limited.
Writing Quality: ⭐⭐⭐⭐ Logically clear, with precise problem motivation and in-depth experimental analysis.
Value: ⭐⭐⭐⭐⭐ The revealed explanatory gap carries fundamental significance for the XAI field and reshapes the understanding of counterfactual explanation.