Semantic Satellite Communications for Synchronized Audiovisual Reconstruction¶

Conference: CVPR 2026 arXiv: 2603.10791 Code: N/A Area: Video Generation Keywords: Satellite Communications, Semantic Transmission, Audiovisual Synchronization, Cross-Modal Generation, LLM Agent

TL;DR¶

This paper proposes an adaptive multimodal semantic satellite transmission system that flexibly switches transmission priorities via a dual-stream generative architecture (video-driven audio / audio-driven video), combined with a dynamic knowledge base update mechanism and an LLM agent for adaptive decision-making, achieving high-fidelity synchronized audiovisual reconstruction under stringent bandwidth constraints.

Background & Motivation¶

Background: Satellite communications are indispensable in maritime, aviation, and disaster relief scenarios, yet face severe physical-layer challenges including free-space path loss (FSPL), Doppler shift, and long propagation delays. Conventional adaptive modulation and beamforming techniques struggle to support data-intensive multimodal streaming.
Limitations of Prior Work: Semantic communications have made progress in text and image transmission, but video transmission remains a bottleneck due to its large data volume and temporal complexity. Existing methods are largely limited to unimodal video transmission and cannot satisfy practical audiovisual synchronization requirements. Existing cross-modal semantic transmission schemes (e.g., transmitting 3DMM parameters for video conferencing) adopt fixed modality priorities and cannot dynamically adjust to task requirements.
Key Challenge: Three core challenges: ① decoupled semantic-layer and physical-layer design, ② rigid cross-modal dependencies (unidirectional generation only), and ③ passive channel adaptation strategies that perform poorly over high-latency satellite links.
Goal: Under bandwidth-constrained satellite scenarios, how to achieve flexible modality-priority switching, efficient knowledge base maintenance, and environment-aware proactive decision-making.
Key Insight: Incorporating the reasoning and planning capabilities of LLMs into satellite semantic communications, enabling an LLM agent to understand task intent and physical constraints and dynamically adjust transmission strategies.
Core Idea: Replace fixed-rule transmission with an LLM-driven dual-stream cross-modal generative architecture to realize adaptive audiovisual synchronized semantic communications in satellite scenarios.

Method¶

Overall Architecture¶

The system comprises three layers: an effectiveness layer (evaluating task quality metrics), a semantic layer (semantic feature extraction/encoding/decoding/cross-modal generation), and a technical layer (OFDM physical-layer transmission), along with a shared semantic knowledge base. Input audiovisual data is decomposed into video \(\mathbf{V}\) and audio \(\mathbf{A}\); semantic features are extracted from each and multiplexed into a unified data stream mapped to OFDM symbols, transmitted over uplink/downlink satellite channels, and reconstructed at the ground receiver.

Key Designs¶

Dual-Stream Cross-Modal Generation Network
- Function: Flexibly switches between "video-to-audio generation (V2A)" and "audio-to-video generation (A2V)" according to task requirements.
- Mechanism: The V2A path transmits 3DMM parameters and text, first reconstructing video and then generating synchronized audio from lip features and text; the A2V path transmits audio semantics (text, phonemes, duration), first reconstructing audio and then predicting facial parameters via an Audio-to-3DMM module to drive video generation. Video semantic extraction uses a 3DMM model to obtain expression/rotation/translation parameters (retaining only the first 6 expression coefficients); audio semantic extraction uses Whisper-small for speech recognition and the Montreal Forced Aligner for acoustic feature extraction.
- Design Motivation: Different tasks impose different modality requirements—surveillance scenarios require high video fidelity (V2A), while voice dispatch requires high audio clarity (A2V). Fixed priority schemes cannot accommodate the diverse service requirements of satellite communications.
Temporal Semantic Codec
- Function: Encodes extracted semantic features (3DMM parameters, text, phonemes, duration) into channel symbols and decodes them at the receiver.
- Mechanism: Floating-point data (3DMM, duration) are embedded via linear projection, while token sequences (text, phonemes) use lookup-table embeddings; both are mapped to the same dimension \(E\). The encoder employs a Transformer architecture to exploit temporal correlations and resist channel fading; the decoder uses fully connected layers (MSE loss for floating-point outputs, cross-entropy loss for token outputs).
- Design Motivation: Time-frequency fading in satellite channels disrupts the temporal coherence of semantic features, necessitating a dedicated temporal codec architecture to protect semantic information.
Dynamic Knowledge Base Update Mechanism
- Function: Balances bandwidth consumption against reconstruction quality, preventing knowledge base staleness from degrading generation quality.
- Mechanism: A four-level decision mechanism (L0–L3) is designed—L0 checks user identity consistency (CSIM cosine similarity \(> \alpha_{CSIM}\)), L1 evaluates pixel-level visual quality (PSNR \(> \alpha_{PSNR}\)), L2 evaluates 3DMM semantic consistency (expression/rotation/translation distances), and L3 enforces an unconditional update. Reference frames are reused only when all preceding checks are satisfied; otherwise, a new frame is transmitted to update the knowledge base.
- Design Motivation: Satellite bandwidth is limited, making frequent updates of high-dimensional images costly (each 256×256 image requires 16,384 symbols). However, stale reference frames cause significant degradation in generation quality, necessitating an on-demand update strategy.

Loss & Training¶

Semantic codec: MSE loss for floating-point features (3DMM, duration); cross-entropy loss for token sequences (text, phonemes). Four codecs are trained independently for 400 epochs each.
V2A audio generation network: Joint optimization of pitch, energy, and Mel-spectrogram losses, \(\mathcal{L} = \|\hat{\mathbf{P}} - \mathbf{P}\|_2^2 + \|\hat{\mathbf{E}} - \mathbf{E}\|_2^2 + \|\hat{\mathbf{F}} - \mathbf{F}\|_F^2\), trained for 1,000 epochs.
LLM Agent: GPT-4o is employed and configured via prompt engineering, incorporating satellite environment knowledge and historical transmission logs for adaptive decision-making.

Key Experimental Results¶

Main Results¶

Datasets: LRS2 (40K training + 8K test) and a VoxCeleb subset, video resolution 256×256. Satellite channel model: NTN-TDL-A, altitude 300–1200 km.

Method	Transmitted Symbols	Parameters (M)	Runtime (s/frame)
H264+LDPC	400,991	-	0.033
H265+LDPC	54,390	-	0.013
SVC	600	60.11	0.019
V2A	600	540.9	0.171
A2V	600	477.13	0.115

At 12 dB SNR, V2A achieves AKD = 5.41, A2V achieves AKD = 5.85, SVC achieves AKD = 8.36, while H264/H265 cannot detect keypoints due to facial blurring.

Ablation Study: Knowledge Base Update Strategy¶

Update Level	Updates (per 100 clips)	Semantic Symbols	KB Update Symbols	Description
L0	17	300	2,785	Identity consistency only
L1	27	300	4,427	+ Pixel quality
L2	50	300	8,192	+ 3DMM semantics
L3	100	300	16,384	Forced update (baseline)

V2A-L2 achieves AKD = 5.8 at 12 dB, approaching L3's 4.8, while consuming only approximately 50% of L3's bandwidth.

Key Findings¶

V2A shows significant improvement in video reconstruction as bandwidth increases; A2V performs better on audio-related tasks but exhibits a performance ceiling in video reconstruction (additional bandwidth yields no further improvement).
Cross-modal generation methods (V2A, A2V) substantially outperform conventional H264/H265 at low SNR, demonstrating strong robustness.
The LLM agent achieves comparable performance to the lookup-table method on facial verification tasks while saving approximately 50% bandwidth.

Highlights & Insights¶

The flexibility of the dual-stream generative architecture is the most notable contribution: A2V enables "zero-symbol" video transmission (video entirely generated from audio semantics), which is highly practical in extreme bandwidth-constrained scenarios.
The multi-level knowledge base update mechanism is elegantly designed, performing hierarchical checks from coarse to fine (identity → pixels → 3DMM semantics) to determine whether an update is needed, striking a precise balance across several-hundred-fold bandwidth differences.
The LLM agent elevates rule-based matching to semantic understanding combined with proactive planning, a paradigm transferable to other communication system designs requiring multi-dimensional trade-offs.

Limitations & Future Work¶

Computational complexity is relatively high: V2A and A2V involve multiple large-scale generative networks, resulting in inference latency substantially higher than conventional methods.
The system is limited to talking-head video scenarios (relying on 3DMM) and cannot be directly extended to general video transmission.
The LLM agent's decision process is slow and depends on a cloud API (GPT-4o), potentially creating a practical deployment bottleneck in latency-sensitive satellite communications.
Lightweight local models or distilled compact models could be explored as replacements for GPT-4o to facilitate on-device decision-making.

vs. SVC [27]: SVC transmits video keypoints but lacks cross-modal capability and cannot dynamically adjust modality priority. The proposed system supports bidirectional cross-modal generation with LLM coordination.
vs. DeepWiVe [59]: DeepWiVe is an end-to-end JSCC scheme whose reconstruction quality is limited by pixel-level approximation, resulting in blurring at low compression rates. The proposed system leverages generative semantics to drastically reduce bandwidth requirements.
vs. [57,58]: These works support only unidirectional fixed-path V2A or A2V generation, respectively, whereas the proposed system enables bidirectional flexible switching.

Rating¶

Novelty: ⭐⭐⭐⭐ Systematic integration of dual-stream generation and LLM agent is innovative, though individual modules largely rely on existing pretrained models
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive multi-dimensional comparisons with case studies; real satellite experiments are absent
Writing Quality: ⭐⭐⭐⭐ Clear structure with thorough system description
Value: ⭐⭐⭐⭐ Provides a complete system paradigm for satellite semantic communications, though the application scope is relatively narrow