CVPR 2026 Segmentation UAV reasoning segmentation multimodal large language model dual-path visual encoder chain-of-thought reasoning pixel-level prediction

PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation¶

Conference: CVPR 2026 arXiv: 2604.15670 Code: https://github.com/XIEFOX/PixDLM Area: Semantic Segmentation Keywords: UAV reasoning segmentation, multimodal large language model, dual-path visual encoder, chain-of-thought reasoning, pixel-level prediction

TL;DR¶

This paper formally defines the UAV Reasoning Segmentation task, constructs the DRSeg benchmark comprising 10K high-resolution UAV images with chain-of-thought reasoning annotations, and proposes the dual-path pixel-level multimodal large language model PixDLM as a strong baseline.

Background & Motivation¶

Background: Reasoning Segmentation aims to identify regions in an image satisfying conditions described by free-form textual instructions. Models such as LISA and PixelLM have demonstrated the capacity of multimodal large language models (MLLMs) for implicit reasoning and pixel-level segmentation in ground-view scenarios.

Limitations of Prior Work: Existing reasoning segmentation models and datasets are predominantly built upon ground-view or nadir-view imagery, whose visual assumptions—moderate resolution, limited scale variation, stable camera orientation, and relatively large object sizes—are fundamentally inapplicable to UAV imagery. UAV images present three unique challenges: (1) high-altitude oblique perspectives continuously alter projective geometry; (2) extreme scale variation and densely packed small objects, with many critical targets spanning only tens of pixels; and (3) ultra-high-resolution scenes requiring simultaneous reasoning over global semantics and fine-grained high-frequency details.

Key Challenge: Existing MLLMs typically employ low-resolution visual tokenization, causing fine-grained UAV details to be lost during compression. Moreover, the absence of a reasoning segmentation benchmark specifically tailored to UAV scenarios impedes systematic research progress.

Goal: (1) Formally define the UAV Reasoning Segmentation task and construct a dedicated benchmark dataset; (2) propose a baseline model capable of jointly handling global semantics and local details.

Key Insight: The semantic reasoning requirements of UAV imagery are organized along three dimensions—spatial reasoning, attribute reasoning, and scene-level reasoning—corresponding to positional relationships, visual states, and global context, respectively.

Core Idea: A dual-path visual encoder (global low-resolution path + high-resolution structural path) is employed to preserve small-object and boundary cues, which are then combined with LLM-driven reasoning for pixel-level segmentation.

Method¶

Overall Architecture¶

PixDLM consists of four core components: (1) a dual-path visual encoder that extracts global semantic and fine-grained structural features; (2) a MultiPath Alignment module that fuses the dual-path features; (3) an LLM that performs instruction-conditioned reasoning; and (4) a multi-scale decoder that reconstructs the final segmentation mask. Given a UAV image and a natural language instruction, the model outputs a pixel-level mask satisfying the instruction.

Key Designs¶

Dual-Path Vision Encoder:
- Function: Simultaneously captures global semantic context and high-resolution structural details.
- Mechanism: The global path employs a CLIP visual encoder to process low-resolution inputs for semantic features; the structural path employs a SAM encoder to process high-resolution inputs, preserving small-object and boundary cues. The two paths are complementary—CLIP excels at semantic understanding while SAM excels at fine-grained structural perception.
- Design Motivation: A single low-resolution encoder loses densely distributed small-object information in UAV imagery, whereas a single high-resolution encoder incurs prohibitive computational costs. The dual-path design balances semantic understanding with detail preservation.
MultiPath Alignment Module:
- Function: Lightweight fusion of global semantic and local structural features.
- Mechanism: The semantic features from CLIP and the structural features from SAM are aligned into a unified representation space via a controlled integration scheme, enabling subsequent LLM reasoning to leverage both.
- Design Motivation: The two paths produce features at different scales and semantic levels; an effective alignment mechanism is required for the LLM to simultaneously exploit the advantages of both paths.
DRSeg Dataset Construction Pipeline:
- Function: Provides 10K high-resolution UAV images with corresponding reasoning annotations.
- Mechanism: The construction follows four stages—manual selection of complex scene images → coarse mask generation via SAM2 followed by human refinement → GPT-5 generation of three-dimensional reasoning QA pairs (with CoT reasoning chains) conditioned on image, mask, and category → human review. The data are uniformly distributed across three reasoning dimensions: spatial, attribute, and scene (each approximately 33.3%).
- Design Motivation: Existing UAV datasets lack fine-grained annotations and reasoning-oriented textual supervision, making them insufficient to support systematic reasoning segmentation research.

Loss & Training¶

The model follows the standard LISA training paradigm, employing mask tokens and an embedding-as-mask decoder. Supervised fine-tuning (SFT) mode is supported.

Key Experimental Results¶

Main Results¶

Model	Attribute gIoU	Scene gIoU	Spatial gIoU
LISA-13B (zero-shot)	52.65	47.08	42.85
PixelLM-7B (zero-shot)	46.87	43.07	41.28
LISA-7B (SFT)	59.22	54.45	57.33
PixDLM (Ours)	62.80	61.75	62.51

Ablation Study¶

Configuration	Attr gIoU	Scene gIoU	Spatial gIoU
DRSeg + RRSIS-D + CoT	61.13	55.60	60.55
DRSeg + CoT (w/o RRSIS-D)	62.80	61.75	62.51
DRSeg (w/o CoT)	62.51	61.67	61.98

Key Findings¶

PixDLM consistently outperforms both zero-shot and SFT baselines across all three reasoning dimensions, with particularly notable gains in scene-level reasoning (+7.3 vs. SFT LISA).
Incorporating RRSIS-D data degrades performance, indicating that domain-specific UAV data is more critical than data volume.
CoT reasoning supervision yields relatively modest gains, suggesting that the model is robust to noisy reasoning chains.

Highlights & Insights¶

Clear Task Definition: The semantic requirements of UAV reasoning segmentation are systematically organized into spatial, attribute, and scene dimensions, providing a clear framework for future research.
Mature Data Construction Pipeline: The semi-automatic annotation pipeline combining GPT-5 generation with human review achieves a favorable balance between quality and scalability.
Concise and Effective Dual-Path Design: Leveraging off-the-shelf CLIP and SAM encoders avoids the need to train a high-resolution encoder from scratch.

Limitations & Future Work¶

Approximately 58% of instances are small objects (area < 2%), leaving substantial room for improvement on extremely small targets.
Only a single target instance is annotated per image, precluding evaluation of multi-object reasoning scenarios.
The computational overhead of the dual-path encoder is non-trivial and may be insufficient for real-time UAV applications.
The dataset scale (10K images) is relatively limited; larger-scale data may yield further performance improvements.

vs. LISA: LISA relies on a single CLIP encoder, whereas PixDLM adds a SAM high-resolution path, yielding clear advantages in UAV small-object scenarios.
vs. GeoPix/GeoPixel: These remote sensing models exploit geographic priors but lack open-vocabulary reasoning capability and perform poorly on densely packed small objects.
vs. LLaVA-HR: LLaVA-HR similarly adopts a dual-path strategy for high-resolution processing, but PixDLM is specifically designed for pixel-level output.

Rating¶

Novelty: ⭐⭐⭐⭐ First formal definition of the UAV reasoning segmentation task; the dataset and task formulation are pioneering contributions.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive multi-baseline comparisons with well-designed ablations.
Writing Quality: ⭐⭐⭐⭐ Task definition and dataset construction are described in thorough and clear detail.
Value: ⭐⭐⭐⭐ Provides an important benchmark and baseline for UAV visual understanding research.