PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation¶
Conference: CVPR 2026 arXiv: 2604.15670 Code: https://github.com/XIEFOX/PixDLM Area: Semantic Segmentation Keywords: UAV reasoning segmentation, multimodal large language model, dual-path visual encoder, chain-of-thought reasoning, pixel-level prediction
TL;DR¶
This paper formally defines the UAV Reasoning Segmentation task, constructs the DRSeg benchmark comprising 10K high-resolution UAV images with chain-of-thought reasoning annotations, and proposes the dual-path pixel-level multimodal large language model PixDLM as a strong baseline.
Background & Motivation¶
Background: Reasoning Segmentation aims to identify regions in an image satisfying conditions described by free-form textual instructions. Models such as LISA and PixelLM have demonstrated the capacity of multimodal large language models (MLLMs) for implicit reasoning and pixel-level segmentation in ground-view scenarios.
Limitations of Prior Work: Existing reasoning segmentation models and datasets are predominantly built upon ground-view or nadir-view imagery, whose visual assumptions—moderate resolution, limited scale variation, stable camera orientation, and relatively large object sizes—are fundamentally inapplicable to UAV imagery. UAV images present three unique challenges: (1) high-altitude oblique perspectives continuously alter projective geometry; (2) extreme scale variation and densely packed small objects, with many critical targets spanning only tens of pixels; and (3) ultra-high-resolution scenes requiring simultaneous reasoning over global semantics and fine-grained high-frequency details.
Key Challenge: Existing MLLMs typically employ low-resolution visual tokenization, causing fine-grained UAV details to be lost during compression. Moreover, the absence of a reasoning segmentation benchmark specifically tailored to UAV scenarios impedes systematic research progress.
Goal: (1) Formally define the UAV Reasoning Segmentation task and construct a dedicated benchmark dataset; (2) propose a baseline model capable of jointly handling global semantics and local details.
Key Insight: The semantic reasoning requirements of UAV imagery are organized along three dimensions—spatial reasoning, attribute reasoning, and scene-level reasoning—corresponding to positional relationships, visual states, and global context, respectively.
Core Idea: A dual-path visual encoder (global low-resolution path + high-resolution structural path) is employed to preserve small-object and boundary cues, which are then combined with LLM-driven reasoning for pixel-level segmentation.
Method¶
Overall Architecture¶
PixDLM consists of four core components: (1) a dual-path visual encoder that extracts global semantic and fine-grained structural features; (2) a MultiPath Alignment module that fuses the dual-path features; (3) an LLM that performs instruction-conditioned reasoning; and (4) a multi-scale decoder that reconstructs the final segmentation mask. Given a UAV image and a natural language instruction, the model outputs a pixel-level mask satisfying the instruction.
Key Designs¶
-
Dual-Path Vision Encoder:
- Function: Simultaneously captures global semantic context and high-resolution structural details.
- Mechanism: The global path employs a CLIP visual encoder to process low-resolution inputs for semantic features; the structural path employs a SAM encoder to process high-resolution inputs, preserving small-object and boundary cues. The two paths are complementary—CLIP excels at semantic understanding while SAM excels at fine-grained structural perception.
- Design Motivation: A single low-resolution encoder loses densely distributed small-object information in UAV imagery, whereas a single high-resolution encoder incurs prohibitive computational costs. The dual-path design balances semantic understanding with detail preservation.
-
MultiPath Alignment Module:
- Function: Lightweight fusion of global semantic and local structural features.
- Mechanism: The semantic features from CLIP and the structural features from SAM are aligned into a unified representation space via a controlled integration scheme, enabling subsequent LLM reasoning to leverage both.
- Design Motivation: The two paths produce features at different scales and semantic levels; an effective alignment mechanism is required for the LLM to simultaneously exploit the advantages of both paths.
-
DRSeg Dataset Construction Pipeline:
- Function: Provides 10K high-resolution UAV images with corresponding reasoning annotations.
- Mechanism: The construction follows four stages—manual selection of complex scene images → coarse mask generation via SAM2 followed by human refinement → GPT-5 generation of three-dimensional reasoning QA pairs (with CoT reasoning chains) conditioned on image, mask, and category → human review. The data are uniformly distributed across three reasoning dimensions: spatial, attribute, and scene (each approximately 33.3%).
- Design Motivation: Existing UAV datasets lack fine-grained annotations and reasoning-oriented textual supervision, making them insufficient to support systematic reasoning segmentation research.
Loss & Training¶
The model follows the standard LISA training paradigm, employing mask tokens and an embedding-as-mask decoder. Supervised fine-tuning (SFT) mode is supported.
Key Experimental Results¶
Main Results¶
| Model | Attribute gIoU | Scene gIoU | Spatial gIoU |
|---|---|---|---|
| LISA-13B (zero-shot) | 52.65 | 47.08 | 42.85 |
| PixelLM-7B (zero-shot) | 46.87 | 43.07 | 41.28 |
| LISA-7B (SFT) | 59.22 | 54.45 | 57.33 |
| PixDLM (Ours) | 62.80 | 61.75 | 62.51 |
Ablation Study¶
| Configuration | Attr gIoU | Scene gIoU | Spatial gIoU |
|---|---|---|---|
| DRSeg + RRSIS-D + CoT | 61.13 | 55.60 | 60.55 |
| DRSeg + CoT (w/o RRSIS-D) | 62.80 | 61.75 | 62.51 |
| DRSeg (w/o CoT) | 62.51 | 61.67 | 61.98 |
Key Findings¶
- PixDLM consistently outperforms both zero-shot and SFT baselines across all three reasoning dimensions, with particularly notable gains in scene-level reasoning (+7.3 vs. SFT LISA).
- Incorporating RRSIS-D data degrades performance, indicating that domain-specific UAV data is more critical than data volume.
- CoT reasoning supervision yields relatively modest gains, suggesting that the model is robust to noisy reasoning chains.
Highlights & Insights¶
- Clear Task Definition: The semantic requirements of UAV reasoning segmentation are systematically organized into spatial, attribute, and scene dimensions, providing a clear framework for future research.
- Mature Data Construction Pipeline: The semi-automatic annotation pipeline combining GPT-5 generation with human review achieves a favorable balance between quality and scalability.
- Concise and Effective Dual-Path Design: Leveraging off-the-shelf CLIP and SAM encoders avoids the need to train a high-resolution encoder from scratch.
Limitations & Future Work¶
- Approximately 58% of instances are small objects (area < 2%), leaving substantial room for improvement on extremely small targets.
- Only a single target instance is annotated per image, precluding evaluation of multi-object reasoning scenarios.
- The computational overhead of the dual-path encoder is non-trivial and may be insufficient for real-time UAV applications.
- The dataset scale (10K images) is relatively limited; larger-scale data may yield further performance improvements.
Related Work & Insights¶
- vs. LISA: LISA relies on a single CLIP encoder, whereas PixDLM adds a SAM high-resolution path, yielding clear advantages in UAV small-object scenarios.
- vs. GeoPix/GeoPixel: These remote sensing models exploit geographic priors but lack open-vocabulary reasoning capability and perform poorly on densely packed small objects.
- vs. LLaVA-HR: LLaVA-HR similarly adopts a dual-path strategy for high-resolution processing, but PixDLM is specifically designed for pixel-level output.
Rating¶
- Novelty: ⭐⭐⭐⭐ First formal definition of the UAV reasoning segmentation task; the dataset and task formulation are pioneering contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive multi-baseline comparisons with well-designed ablations.
- Writing Quality: ⭐⭐⭐⭐ Task definition and dataset construction are described in thorough and clear detail.
- Value: ⭐⭐⭐⭐ Provides an important benchmark and baseline for UAV visual understanding research.