Model Context Protocol for Vision Systems: Audit, Security, and Protocol Extensions¶
Conference: NeurIPS 2025
arXiv: 2509.22814
Code: Coming soon (benchmark and validator suite)
Area: AI Systems / Protocol Security / Computer Vision Workflows
Keywords: Model Context Protocol, vision system orchestration, protocol security, schema validation, multimodal agents
TL;DR¶
The first protocol-level audit of MCP deployment in vision systems, analyzing 91 public MCP servers and finding that 78% exhibit schema inconsistencies and 89% lack runtime validation; the paper further proposes protocol extensions including semantic schemas, visual memory, and runtime validators.
Background & Motivation¶
Model Context Protocol (MCP) is a schema-bound execution model for agent–tool interaction that enables modular computer vision workflows through typed schemas and dynamic context objects, without requiring model retraining. However, applying MCP to vision domains introduces unique challenges:
High-dimensional tensor inputs: Vision data involves large-scale image streams and multimodal semantic fusion, placing significant pressure on orchestration pipelines.
Inconsistent spatial conventions: Different tools adopt different coordinate systems (e.g., XYWH vs. X1Y1X2Y2).
Lack of systematic auditing: No prior work has conducted a protocol-level, deployment-scale audit of MCP in vision systems.
Security risks: Dynamic and multi-agent workflows expose attack surfaces for privilege escalation and untyped tool chaining.
Existing orchestration strategies primarily rely on end-to-end model training or prompt-tuned vision-language systems, which are brittle under tool specialization and yield opaque intermediate reasoning.
Method¶
Overall Architecture¶
This work presents a systematic empirical audit rather than a novel algorithmic method. The core contributions include: - Filtering 91 vision-relevant MCP servers from the MCPServerCorpus (13,942 publicly registered deployments) - Annotating servers along nine compositional fidelity dimensions - Developing an executable benchmark and validator suite to detect and classify protocol violations
Key Designs¶
-
Taxonomy of Four Orchestration Patterns:
- Static composition (37%): Fixed tool sequences; highly auditable but low adaptability
- Retrieval-augmented selection (29%): Embedding-based semantic matching; flexible but 87% have undeclared coordinate formats
- Dynamic orchestration (21%): Runtime construction of execution graphs; 89% lack runtime schema checks
- Multi-agent coordination (13%): Distributed control; 55% exhibit stale memory or cross-tool leakage
-
Benchmark Validator Suite:
- Schema format validator: Detects inter-tool schema inconsistencies (detection rate: 78.0%)
- Coordinate convention validator: Detects missing or inconsistent spatial references (detection rate: 24.6%)
- Mask–image consistency validator: Detects dimension or channel mismatches (detection rate: 17.3%)
- Memory scope validator: Detects undocumented visual state retention (avg. 33.8 warnings / 100 executions)
- Permission verification validator: Detects privilege escalation or leakage via tool binding (detection rate: 41.0%)
-
Security Threat Taxonomy: Eight primary threat vectors are identified:
- Prompt injection, schema bypass, remote code execution (RCE), privilege escalation
- Stale memory access, untracked provenance, cross-tool leakage, type coercion injection
-
Protocol Extension Proposals:
- Semantic grounding schema: Adds
semantic_role,modality, andcoordinate_systemfields - Protocol-native visual memory: Encodes structured, versioned, semantically annotated intermediate states
- Runtime validators and compatibility contracts: Validates spatial dimensions, tensor channel semantics, and coordinate alignment at runtime
- Composable benchmarking: Evaluates orchestration fidelity, memory hygiene, and schema stability
- Semantic grounding schema: Adds
Analysis Methodology¶
Compatibility is formalized via a predicate function \(comp: \mathcal{T} \times \mathcal{T} \rightarrow \{0,1\}\), where \(\mathcal{T}\) is the set of tool schemas, determining whether the output of one tool is admissible as the input of another. All confidence intervals are reported at the 95% level.
Key Experimental Results¶
Main Results: Protocol Failure Mode Analysis (N=91)¶
| Failure Type | Rate | 95% CI |
|---|---|---|
| Schema format divergence | 62% | — |
| No runtime schema validation | 89% | — |
| Undeclared coordinate conventions | 87% | — |
| Out-of-band bridging scripts | 41% | — |
| Undocumented memory retention logic | 55% | — |
| Declared compositional fallback strategy | Only 9% | — |
| Schema inconsistency (composite detection) | 78.0% | [68.45, 85.28] |
| Coordinate convention inconsistency | 24.6% | [16.90, 34.36] |
| Mask–image dimension mismatch | 17.3% | [10.90, 26.35] |
Security Audit Results (N=47)¶
| Security Issue | Rate | 95% CI |
|---|---|---|
| Untyped tool chaining | 89.0% | [76.80, 95.19] |
| Privilege escalation or data leakage risk | 41.0% | [28.02, 55.37] |
| Memory scope warnings | Avg. 33.8 / 100 executions | [28.4, 39.9] |
Key Findings from Case Studies¶
| System | Audit Scale | Primary Issues Found |
|---|---|---|
| ParaView-MCP | — | Binary textures embedded in nested JSON; latency spikes >2.3s |
| SUMO+YOLO-MCP | 134 call pairs | 27.6% exhibit projection conflicts or axis mismatches |
| ALITA | 143 tool chains | 18.4% produce malformed responses |
| FHIR-MCP | 108 outputs | 14.9% of caption outputs show scaling mismatches |
| Blender-RCP | 97 multi-step compositions | 22 produce orphaned references or cache conflicts |
Key Findings¶
- Segmentation outputs across 91 servers span 5 incompatible formats: URI-encoded masks, run-length encoding, base64 tensors, polygon contours, and per-pixel label maps
- Bounding box formats are inconsistent across absolute XYWH, corner-based X1Y1X2Y2, and center-normalized representations
- Only 8 of 91 servers implement post-call output inspection
- 41% of deployments rely on undocumented bridging scripts for format conversion
Highlights & Insights¶
- First protocol-level audit of vision systems: Systematically exposes structural problems MCP faces in the vision domain, rather than defects in individual tools
- Rigorous quantitative analysis: All findings are accompanied by 95% confidence intervals; the sample of 91 servers is representative for this research area
- Actionable security threat taxonomy: The eight threat vectors and their mitigation strategies offer practical guidance for deployment
- Orchestration pattern taxonomy clarifies the trade-offs among different deployment strategies
Limitations & Future Work¶
- Only publicly accessible servers are analyzed, excluding enterprise-grade and proprietary deployments; research prototypes are overrepresented
- Protocol extensions are implemented only as reference prototypes on controlled testbeds, without validation in heterogeneous production environments
- Security analysis covers only 47 servers, potentially failing to capture the full threat surface of larger-scale deployments
- No comparative analysis with alternative orchestration frameworks (e.g., LangChain, AutoGen)
- The MCP ecosystem is rapidly evolving, and the observed patterns may not reflect the current state of the art
Related Work & Insights¶
- Unlike prompt-chaining systems such as LLaVA-Plus and MAGMA, MCP decouples reasoning from execution and supports dynamic tool loading at runtime
- The SPORT system demonstrates a lightweight, confidence-based tool prioritization strategy
- The AgentOrchestra case reveals the prevalence of out-of-band bridging scripts
- For building reliable multimodal agent systems, standardization at the protocol level—including semantic typing and memory scope management—is indispensable
Rating¶
- Novelty: ⭐⭐⭐⭐ (First protocol-level audit of MCP for vision systems; pioneering contribution)
- Experimental Thoroughness: ⭐⭐⭐⭐ (91 servers, 5 case studies, quantified confidence intervals)
- Writing Quality: ⭐⭐⭐⭐ (Clear structure; taxonomy and tables are well organized)
- Value: ⭐⭐⭐⭐ (Significant guidance for security and reliability in the MCP ecosystem)