VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension¶

Conference: CVPR 2026 arXiv: 2601.12781 Code: https://github.com/ml-postech/VIRO-neuro-symbolic-reasoning-with-verification Area: Interpretability Keywords: Referring Expression Comprehension, Neuro-Symbolic Reasoning, Operator-Level Verification, Zero-Shot Learning, Target-Absent Detection

TL;DR¶

VIRO embeds lightweight operator-level verification mechanisms (CLIP uncertainty verification + spatial logic verification) into a neuro-symbolic REC pipeline, enabling each reasoning step to self-verify and terminate early when no target exists. Under a zero-shot setting, it achieves 61.1% balanced accuracy, substantially outperforming compositional reasoning baselines, while maintaining a program failure rate below 0.3% and efficient inference speed.

Background & Motivation¶

Background: Referring Expression Comprehension (REC) aims to localize a target region in an image given a natural language description. Recent neuro-symbolic methods based on LLMs and VLMs decompose queries into structured programs and execute them step by step, enabling interpretable reasoning and strong zero-shot generalization.
Limitations of Prior Work: Existing compositional reasoning pipelines assume that every intermediate result is correct; however, open-vocabulary detectors (OVDs) frequently produce high-confidence false positives (FPs). These errors cascade along the reasoning chain and are particularly severe in target-absent scenarios—where the system is forced to select a false positive as the final answer (the "forced prediction" problem).
Key Challenge: Existing methods lack verification mechanisms for intermediate reasoning steps. On one hand, candidate boxes generated by OVDs may be visually or semantically similar false positives; on the other hand, spatial relation reasoning may still produce outputs even when constraints are not satisfied. Furthermore, many systems place large multimodal LLMs inside the reasoning loop, causing significant latency, with program generation and execution tightly coupled such that a new reasoning program must be generated for every image.
Goal: (a) How can verification be embedded within reasoning steps to prevent cascading errors? (b) How can the system correctly "abstain" in target-absent scenarios rather than producing forced predictions? (c) How can accuracy be maintained while improving efficiency and scalability?
Key Insight: Integrating lightweight verification modules inside each reasoning operator—leveraging CLIP for uncertainty verification to filter OVD false positives, and geometric tests for logic verification to check whether spatial relations hold.
Core Idea: Within the neuro-symbolic reasoning pipeline, each operator follows an "execute-then-verify" paradigm: if verification fails, an empty set is returned and the pipeline terminates early, enabling robust target-absent detection.

Method¶

Overall Architecture¶

VIRO adopts a two-stage decoupled pipeline: (1) Pre-execution stage: an LLM translates the natural language query into a symbolic program \(P = (o_1, o_2, \dots, o_T)\) composed of Verified Reasoning Operators (VROs), followed by a program verifier that ensures syntactic correctness; (2) Execution stage: a program interpreter executes the operator sequence over the image step by step, where each operator performs self-verification after execution—returning an empty set and immediately terminating the entire pipeline if verification fails. The output is either a localized bounding box or an empty set (indicating target absence).

Key Designs¶

Verified Reasoning Operators (VROs):
- Function: A finite set of operators serves as the basic building blocks of reasoning, covering four categories—recognition operators (FIND, PROPERTY), absolute spatial operators (LOCATE, SIZE, ORDER), relative spatial operators (FIND_DIRECTION, FIND_NEAR, FIND_INSIDE), and a termination operator (RESULT).
- Mechanism: Each operator not only executes a reasoning action but also self-verifies its output. If the verification condition is not satisfied, the operator returns \(\varnothing\), triggering early termination of the entire pipeline. This formalizes the REC output as either \(Y = B\) (target present) or \(Y = \varnothing\) (target absent).
- Design Motivation: By performing verification at the operator level rather than at the end of the pipeline, cascading errors are blocked at the earliest point of failure, while enabling efficient early exit.
Uncertainty Verification (UV) — within the FIND operator:
- Function: Filters high-confidence false positive candidates produced by OVDs.
- Mechanism: For each OVD candidate box \(B_j\), the corresponding image region \(I_j\) is cropped, and \(K\) common categories are predefined as negative anchors. A CLIP verification score is computed as \(S(l|I_j) = \frac{1}{K}\sum_{k=1}^K \frac{\exp(\text{sim}(I_j, l)/\tau)}{\exp(\text{sim}(I_j, l)/\tau) + \exp(\text{sim}(I_j, c_k)/\tau)}\), representing the average binary classification probability of the candidate label against the negative anchors. Candidates with scores below threshold \(\delta_l\) are filtered. To account for CLIP's intrinsic bias across different labels, per-label threshold calibration is performed using ImageNet.
- Design Motivation: OVDs (e.g., GroundingDINO) tend to produce high-confidence detections for visually or semantically similar but incorrect targets in open-vocabulary settings. Using CLIP as a binary discriminator incurs minimal computational overhead while effectively filtering such false positives.
Logic Verification (LV) — within the FIND_DIRECTION operator:
- Function: Verifies whether candidate targets genuinely satisfy the specified spatial relation constraints.
- Mechanism: Geometric tests are applied to all input candidates to check whether each target candidate satisfies the specified directional spatial relation relative to a reference target. Candidates failing the test cause the operator to return an empty set.
- Design Motivation: During spatial reasoning steps, even if the preceding FIND operator passes verification, the spatial relation may not actually hold (e.g., "the person to the left of the elephant" when no person appears to the left). Logic verification provides an additional filtering stage to eliminate such errors.
Program Generation and Verification:
- Function: Reliably converts natural language queries into executable symbolic programs.
- Mechanism: An LLM (Qwen2.5-72B-Instruct-AWQ) generates programs via few-shot prompting as \(P = \text{LLM}(Q|m)\), followed by a program verifier that checks syntactic correctness. If verification fails, concise diagnostic feedback is provided to trigger LLM self-correction. The constrained operator space (fixed symbolic structure vs. ViperGPT's open Python code) substantially reduces runtime errors.
- Design Motivation: LLMs occasionally generate syntactically incorrect programs; the constrained operator set combined with a structured verification loop reduces the program failure rate to below 0.3%.
Decoupled Architecture:
- Function: Program generation and execution are decoupled, enabling efficient 1-query-N-images scenarios.
- Mechanism: VIRO generates a program only once per query and reuses it across all \(N\) images, with total inference time \(T_{\text{total}} = T_{\text{pre}} + N \times T_{\text{exec}}\). In contrast, HYDRA and NAVER regenerate the reasoning program for each image.
- Design Motivation: In practical applications such as robotic visual search, the same query must be executed over many images. The decoupled design ensures latency grows linearly rather than multiplicatively.

Key Experimental Results¶

Main Results¶

Dataset / Split	Metric	VIRO	Prev. SOTA (Compositional Reasoning)	Gain
gRefCOCO+RefCOCO TestA	Balanced Acc.	61.1%	35.2% (HYDRA)	+25.9
gRefCOCO TestA	TNR (N-acc)	50.2%	7.5% (HYDRA)	+42.7
RefCOCO TestA	TPR (Acc@0.5)	71.9%	66.7% (ViperGPT)	+5.2
RefCOCO TestA	Program Failure Rate	0.07%	3.45% (ViperGPT)	Large reduction
RefCOCO TestA	Execution Latency	0.71s	1.49s (ViperGPT)	2.1× faster
RefEgo (all frames)	ACC@0.5+n	51.9%	23.0% (ViperGPT)	+28.9

Ablation Study¶

Configuration	Balanced Acc.	TNR	TPR	Notes
Detector-only	40.0%	22.8%	57.1%	OVD only
+ Operators	56.8%	38.9%	74.6%	+ compositional reasoning operators
+ LV	57.0%	39.3%	74.6%	+ logic verification
+ UV (fixed)	58.8%	43.1%	74.4%	+ uncertainty verification (fixed threshold)
+ UV (adaptive)	61.1%	50.2%	71.9%	+ adaptive threshold (full model)

Key Findings¶

Compositional reasoning operators alone contribute the largest gain (Balanced Acc. +16.8); UV verification substantially improves target-absent detection (TNR +11.3), but introduces a TPR–TNR trade-off.
Adaptive thresholds improve TNR by 7.1% over fixed thresholds but reduce TPR by 2.5%, reflecting a precision–recall trade-off.
In 1-query-N-images scenarios, VIRO and ViperGPT outperform HYDRA/NAVER due to decoupled architecture, while VIRO achieves lower execution latency.
Regarding CLIP backbone selection, ViT-H/14 achieves 3.1% higher TPR than ViT-L/14 but incurs 29% greater execution latency.

Highlights & Insights¶

Operator-level verification is the key innovation: Rather than verifying at the end of the entire pipeline, the "execute-then-verify" paradigm is applied inside each reasoning step—a design first realized in compositional REC methods—allowing errors to be caught at their source.
CLIP as a lightweight binary verification classifier: CLIP's discriminative capacity (rather than its typical retrieval usage) is cleverly exploited by computing verification scores through pairwise comparisons against a set of negative anchors, incurring minimal computational cost with significant effectiveness.
Target-absent detection as "abstention" rather than "classification": This is naturally realized via the empty-set return mechanism without requiring additional training for target-absent detection—a design principle transferable to any visual reasoning system that requires a "refuse to answer" capability.

Limitations & Future Work¶

An inherent trade-off exists between TPR and TNR—improving target-absent detection sacrifices accuracy in target-present cases (TPR decreases from 74.6% to 71.9%).
The fixed operator set has limited coverage for complex queries (e.g., those involving actions or temporal semantics may require extending the operator set).
CLIP verification uses ImageNet-calibrated thresholds; generalization to out-of-domain data remains to be validated.
Logic verification is currently limited to simple geometric tests; more complex spatial relations (occlusion, relative size, etc.) may require stronger reasoning mechanisms.

vs. ViperGPT: ViperGPT generates open Python code without verification, resulting in a 3.45% program failure rate; VIRO reduces this to 0.07% via constrained operators and a verification loop, while also achieving higher TPR.
vs. HYDRA/NAVER: These methods tightly couple program generation and execution, requiring program regeneration for every image and relying on large multimodal LLMs, resulting in significantly higher latency and failure rates than VIRO.
vs. Supervised REC (GREC-UNINEXT): VIRO approaches or surpasses supervised methods trained with target-absent annotations on target-absent detection, demonstrating the potential of zero-shot approaches.

Rating¶

Novelty: ⭐⭐⭐⭐ Operator-level verification combined with the abstention mechanism represents an important innovation in compositional reasoning for REC, though the overall approach is relatively intuitive.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers multiple standard REC benchmarks, target-absent benchmarks, video benchmarks, efficiency, scalability, and comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐ Clear problem definition and method exposition with rich figures and tables.
Value: ⭐⭐⭐⭐ Significant practical implications for robotic search and target-absent detection; fills a critical gap in compositional reasoning methods.