Skip to content

Model Context Protocol for Vision Systems: Audit, Security, and Protocol Extensions

Conference: NeurIPS 2025 arXiv: 2509.22814
Code: Coming soon (benchmark and validator suite)
Area: AI Systems / Protocol Security / Computer Vision Workflows Keywords: Model Context Protocol, vision system orchestration, protocol security, schema validation, multimodal agents

TL;DR

The first protocol-level audit of MCP deployment in vision systems, analyzing 91 public MCP servers and finding that 78% exhibit schema inconsistencies and 89% lack runtime validation; the paper further proposes protocol extensions including semantic schemas, visual memory, and runtime validators.

Background & Motivation

Model Context Protocol (MCP) is a schema-bound execution model for agent–tool interaction that enables modular computer vision workflows through typed schemas and dynamic context objects, without requiring model retraining. However, applying MCP to vision domains introduces unique challenges:

High-dimensional tensor inputs: Vision data involves large-scale image streams and multimodal semantic fusion, placing significant pressure on orchestration pipelines.

Inconsistent spatial conventions: Different tools adopt different coordinate systems (e.g., XYWH vs. X1Y1X2Y2).

Lack of systematic auditing: No prior work has conducted a protocol-level, deployment-scale audit of MCP in vision systems.

Security risks: Dynamic and multi-agent workflows expose attack surfaces for privilege escalation and untyped tool chaining.

Existing orchestration strategies primarily rely on end-to-end model training or prompt-tuned vision-language systems, which are brittle under tool specialization and yield opaque intermediate reasoning.

Method

Overall Architecture

This work presents a systematic empirical audit rather than a novel algorithmic method. The core contributions include: - Filtering 91 vision-relevant MCP servers from the MCPServerCorpus (13,942 publicly registered deployments) - Annotating servers along nine compositional fidelity dimensions - Developing an executable benchmark and validator suite to detect and classify protocol violations

Key Designs

  1. Taxonomy of Four Orchestration Patterns:

    • Static composition (37%): Fixed tool sequences; highly auditable but low adaptability
    • Retrieval-augmented selection (29%): Embedding-based semantic matching; flexible but 87% have undeclared coordinate formats
    • Dynamic orchestration (21%): Runtime construction of execution graphs; 89% lack runtime schema checks
    • Multi-agent coordination (13%): Distributed control; 55% exhibit stale memory or cross-tool leakage
  2. Benchmark Validator Suite:

    • Schema format validator: Detects inter-tool schema inconsistencies (detection rate: 78.0%)
    • Coordinate convention validator: Detects missing or inconsistent spatial references (detection rate: 24.6%)
    • Mask–image consistency validator: Detects dimension or channel mismatches (detection rate: 17.3%)
    • Memory scope validator: Detects undocumented visual state retention (avg. 33.8 warnings / 100 executions)
    • Permission verification validator: Detects privilege escalation or leakage via tool binding (detection rate: 41.0%)
  3. Security Threat Taxonomy: Eight primary threat vectors are identified:

    • Prompt injection, schema bypass, remote code execution (RCE), privilege escalation
    • Stale memory access, untracked provenance, cross-tool leakage, type coercion injection
  4. Protocol Extension Proposals:

    • Semantic grounding schema: Adds semantic_role, modality, and coordinate_system fields
    • Protocol-native visual memory: Encodes structured, versioned, semantically annotated intermediate states
    • Runtime validators and compatibility contracts: Validates spatial dimensions, tensor channel semantics, and coordinate alignment at runtime
    • Composable benchmarking: Evaluates orchestration fidelity, memory hygiene, and schema stability

Analysis Methodology

Compatibility is formalized via a predicate function \(comp: \mathcal{T} \times \mathcal{T} \rightarrow \{0,1\}\), where \(\mathcal{T}\) is the set of tool schemas, determining whether the output of one tool is admissible as the input of another. All confidence intervals are reported at the 95% level.

Key Experimental Results

Main Results: Protocol Failure Mode Analysis (N=91)

Failure Type Rate 95% CI
Schema format divergence 62%
No runtime schema validation 89%
Undeclared coordinate conventions 87%
Out-of-band bridging scripts 41%
Undocumented memory retention logic 55%
Declared compositional fallback strategy Only 9%
Schema inconsistency (composite detection) 78.0% [68.45, 85.28]
Coordinate convention inconsistency 24.6% [16.90, 34.36]
Mask–image dimension mismatch 17.3% [10.90, 26.35]

Security Audit Results (N=47)

Security Issue Rate 95% CI
Untyped tool chaining 89.0% [76.80, 95.19]
Privilege escalation or data leakage risk 41.0% [28.02, 55.37]
Memory scope warnings Avg. 33.8 / 100 executions [28.4, 39.9]

Key Findings from Case Studies

System Audit Scale Primary Issues Found
ParaView-MCP Binary textures embedded in nested JSON; latency spikes >2.3s
SUMO+YOLO-MCP 134 call pairs 27.6% exhibit projection conflicts or axis mismatches
ALITA 143 tool chains 18.4% produce malformed responses
FHIR-MCP 108 outputs 14.9% of caption outputs show scaling mismatches
Blender-RCP 97 multi-step compositions 22 produce orphaned references or cache conflicts

Key Findings

  • Segmentation outputs across 91 servers span 5 incompatible formats: URI-encoded masks, run-length encoding, base64 tensors, polygon contours, and per-pixel label maps
  • Bounding box formats are inconsistent across absolute XYWH, corner-based X1Y1X2Y2, and center-normalized representations
  • Only 8 of 91 servers implement post-call output inspection
  • 41% of deployments rely on undocumented bridging scripts for format conversion

Highlights & Insights

  • First protocol-level audit of vision systems: Systematically exposes structural problems MCP faces in the vision domain, rather than defects in individual tools
  • Rigorous quantitative analysis: All findings are accompanied by 95% confidence intervals; the sample of 91 servers is representative for this research area
  • Actionable security threat taxonomy: The eight threat vectors and their mitigation strategies offer practical guidance for deployment
  • Orchestration pattern taxonomy clarifies the trade-offs among different deployment strategies

Limitations & Future Work

  • Only publicly accessible servers are analyzed, excluding enterprise-grade and proprietary deployments; research prototypes are overrepresented
  • Protocol extensions are implemented only as reference prototypes on controlled testbeds, without validation in heterogeneous production environments
  • Security analysis covers only 47 servers, potentially failing to capture the full threat surface of larger-scale deployments
  • No comparative analysis with alternative orchestration frameworks (e.g., LangChain, AutoGen)
  • The MCP ecosystem is rapidly evolving, and the observed patterns may not reflect the current state of the art
  • Unlike prompt-chaining systems such as LLaVA-Plus and MAGMA, MCP decouples reasoning from execution and supports dynamic tool loading at runtime
  • The SPORT system demonstrates a lightweight, confidence-based tool prioritization strategy
  • The AgentOrchestra case reveals the prevalence of out-of-band bridging scripts
  • For building reliable multimodal agent systems, standardization at the protocol level—including semantic typing and memory scope management—is indispensable

Rating

  • Novelty: ⭐⭐⭐⭐ (First protocol-level audit of MCP for vision systems; pioneering contribution)
  • Experimental Thoroughness: ⭐⭐⭐⭐ (91 servers, 5 case studies, quantified confidence intervals)
  • Writing Quality: ⭐⭐⭐⭐ (Clear structure; taxonomy and tables are well organized)
  • Value: ⭐⭐⭐⭐ (Significant guidance for security and reliability in the MCP ecosystem)