ViUniT: Visual Unit Tests for More Robust Visual Programming¶

Conference: CVPR 2025
arXiv: 2412.08859
Code: Project Page
Area: Image Generation / Visual Reasoning
Keywords: Visual Programming, Unit Testing, Program Selection, Visual Question Answering, Reinforcement Learning

TL;DR¶

ViUniT proposes a framework for automatically generating visual unit tests. By utilizing an LLM to generate image descriptions and expected answers, and a text-to-image model to generate test images, the framework verifies the logical correctness of visual programs. This elevates 7B open-source models to surpass gpt-4o-mini and reduces "right-for-the-wrong-reason" programs by 40%.

Background & Motivation¶

Visual programming resolves compositional reasoning tasks by generating executable programs that call expert systems, but it suffers from severe reliability issues. On benchmark datasets, even when models produce correct answers, the underlying programs are incorrect 33% of the time—models frequently get the right answer for the wrong reason, leading to potential unexpected failures on novel data.

While unit testing is a fundamental approach in software engineering to ensure code correctness, its application to visual programming is limited: (1) existing methods only verify return value types (e.g., whether the output is within the yes/no range) rather than evaluating logical correctness; (2) visual unit tests require image-answer pairs, making them expensive to construct; (3) leveraging test signals to improve model behavior poses a significant challenge.

Core Idea of ViUniT: A completely unsupervised visual unit testing suite is constructed by utilizing LLMs to generate descriptions of test scenarios and expected answers, and employing diffusion models to synthesize corresponding images.

Method¶

Overall Architecture¶

The ViUniT framework consists of: (1) candidate unit test generation—using LLMs to create image description and expected answer pairs; (2) coverage-aware sampling—selecting a subset of tests that maximizes coverage; (3) image generation—converting descriptions into images using diffusion models; and (4) program evaluation and selection—executing candidate programs on unit tests and selecting the best one. Four utilization strategies are explored: optimal program selection, answer rejection, re-prompting, and unsupervised RL rewards.

Key Designs¶

Key Design 1: Coverage-Aware Unit Test Sampling¶

Function: Select the most diagnostic subset of tests from the candidate test set to maximize coverage.
Mechanism: Sampling is performed in two steps: first, answer coverage ensures that at least one test is included for each possible answer \(y \in Y\) (Eq. 2); next, input coverage maximizes the CLIP embedding distance of image descriptions \(\sigma_V(\mathcal{T}_K)\) (Eq. 3-4) by iteratively adding tests furthest in feature distance from the selected set. The entire process operates in the linguistic space to reduce computational overhead, and images are only generated for the final selected \(K\) tests.
Design Motivation: Utilizing all candidate tests directly is computationally expensive, whereas random sampling can lead to imbalanced answers or redundant inputs. The "answer-then-input" coverage strategy consistently improves GQA performance as the number of tests increases.

Key Design 2: Program-Agnostic Test Generation¶

Function: Utilize LLMs to generate implementation-agnostic test cases, avoiding overfitting to specific program implementations.
Mechanism: The unit test generator \(\psi\) takes an input query \(q\) (optionally including program \(p\)) and uses Llama-3-8B-Instruct to generate \(M\) candidate tests, where each test is an (image description \(c_i\), expected answer \(y_i\)) pair. Experiments indicate that generating tests without program implementation details performs significantly better under higher counts of tests and programs.
Design Motivation: Adhering to software engineering best practices, unit tests should be independent of concrete implementations. While including program details can be helpful with fewer tests, it becomes detrimental at scale.

Key Design 3: Four Test Utilization Strategies¶

Function: Convert unit test signals into distinct mechanisms for model improvement.
Mechanism: (a) Optimal Program Selection: \(p^* = \arg\max_{p \in \mathcal{P}} S(p)\) chooses the program that passes the most tests; (b) Answer Rejection: falls back to an end-to-end model if \(S(p^*) < \theta\); (c) Re-prompting: uses test feedback \(\mathcal{F}\) (discrepancies between descriptions, expected answers, and actual outputs) as context to guide LLMs to generate improved programs; (d) Unsupervised RL Reward: \(R_{\text{ViUnit}}(v,p)\) replaces ground-truth-dependent correctness rewards, providing finer-grained feedback via a threshold \(S(p) \geq \theta\).
Design Motivation: Different scenarios require different utilization mechanisms—program selection is the simplest and most effective; RL rewards address the issue of reinforcing "right-for-the-wrong-reason" programs.

Loss & Training¶

RL training employs a reward-weighted negative log-likelihood loss: \(J(w) = \mathbb{E}_{(v,q,p,y) \sim D}[R(v,p,y) L_{\text{NLL}}(p,q;w)]\).

Key Experimental Results¶

Main Results: Optimal Program Selection Performance¶

LLM	#Prog	#UT	GQA	Winoground	SugarCREPE	Avg
gpt-4o-mini	1	0	42.03	44.98	38.75	41.92
CodeLlama-7B	5	5	49.00	54.38	46.79	50.06
CodeGemma-7B	5	5	48.71	50.63	48.57	49.30

Ablation Study: RL Reward Design¶

Reward Type	GQA	Features
Correctness (Supervised)	Baseline	Requires GT labels
ViUnit (Unsupervised)	+1.3 avg	No GT labels required

Key Findings¶

ViUniT improves the accuracy of frozen LLMs by 11.4%.
On average, 7B open-source models outperform gpt-4o-mini by 7.7 percentage points.
Reduces "right-for-the-wrong-reason" programs by 40%.
Diminishing returns observed beyond 5 tests and 5 programs.
Unsupervised ViUnit RL reward outperforms supervised correctness reward by 1.3 percentage points.
Re-prompting yields a 3%+ improvement over regeneration without feedback.

Highlights & Insights¶

Software Engineering Mindset in Visual AI: Gracefully transfers the concept of unit testing to visual programming, utilizing LLMs alongside diffusion models to construct tests automatically.
Unsupervised Reward Outperforms Supervised Reward: ViUnit's RL reward surpasses GT-based correctness rewards without requiring any ground-truth labels, because it effectively penalizes "right-for-the-wrong-reason" programs.
Coverage Maximization is Crucial: Systematically analyzes optimal configurations for test generation, sampling, and image synthesis.

Limitations & Future Work¶

Image generation quality impacts test reliability, and the synthesis of spatial relationships remains limited.
The execution of unit tests increases inference execution time.
The current framework depends on external expert models (e.g., detectors, VQA models).
Future work could explore extending ViUniT to more sophisticated visual reasoning tasks.

ViperGPT: Pioneering work in visual programming, upon which this study's API is built.
Khan et al.: Employs RL training with correctness rewards, but risks reinforcing "right-for-the-wrong-reason" programs.
Hu et al.: Uses GT answers as a proxy for correctness to select the best program, whereas ViUniT does not require GT labels.
Insight: Automated testing can serve not only as a tool for evaluating program quality but also as an unsupervised training signal.

Rating¶

⭐⭐⭐⭐ — Innovatively applies the software engineering concept of unit testing to visual programming with a systematic and rigorous design. The four utilization strategies span various scenarios, from inference to training. The 11.4% improvement and outperformance of gpt-4o-mini are highly impressive.