ECCV2024 ECCV2024 accepted papers ECCV2024 paper list AI paper notes top conference papers 3D Vision Image Generation Human Understanding Segmentation Autonomous Driving Video Understanding Multimodal VLM Image Restoration

🎞️ ECCV2024 Accepted Papers¶

869 ECCV2024 paper notes covering 3D Vision (181), Image Generation (117), Human Understanding (54), Segmentation (54), Autonomous Driving (53), Video Understanding (51), Multimodal VLM (44), Image Restoration (32) and other 42 areas. Each note has TL;DR, motivation, method, experiments, highlights, and limitations — 5-minute reads of core ideas.

📊 LLM Evaluation (19)¶

ColorMNet: A Memory-based Deep Spatial-Temporal Feature Propagation Network for Video Colorization: This paper proposes ColorMNet, a memory-based deep spatial-temporal feature propagation network. By incorporating three components—Pre-trained Vision-Guided Feature Extraction (PVGFE), Memory-based Feature Propagation (MFP), and Local Attention (LA)—this method achieves video colorization performance superior to state-of-the-art (SOTA) models while significantly reducing GPU memory consumption to only 1.9 GB.
Deep Cost Ray Fusion for Sparse Depth Video Completion: This paper proposes the RayFusion framework, which achieves temporal fusion by applying self-attention and cross-attention along the ray direction on the cost volume. With only 1.15M parameters, it outperforms or matches state-of-the-art sparse depth completion methods across three datasets: KITTI, VOID, and ScanNetV2.
Distribution Alignment for Fully Test-Time Adaptation with Dynamic Online Data Streams: A Distribution Alignment (DA) loss is proposed to pull the test-time feature distribution back to the source domain distribution. Combined with a domain shift detection mechanism, this method significantly outperforms existing TTA methods under non-i.i.d. dynamic data streams and continuous domain shift scenarios.
Eliminating Warping Shakes for Unsupervised Online Video Stitching: Defines a new problem in video stitching termed "warping shake" (temporal shaking in non-overlapping regions when extending image stitching to video). This work proposes StabStitch, the first unsupervised online video stitching framework. By simultaneously generating and smoothing stitching trajectories, it achieves both video stitching and stabilization, reaching a real-time speed of 28.2 ms/frame.
EvSign: Sign Language Recognition and Translation with Streaming Events: This work constructs the first event-camera benchmark dataset, EvSign, for Continuous Sign Language Recognition (CSLR) and Sign Language Translation (SLT) tasks, and proposes an efficient sparse Transformer-based framework that achieves comparable or superior performance to SOTA RGB methods using only 0.34% FLOPs and 44.2% of the parameters.
Gradient-Regularized Out-of-Distribution Detection: This paper proposes GReg/GReg+, which learns the local smoothness of the scoring manifold by regularizing the input gradient norm of the OOD scoring function, and incorporates an energy-score-based clustering sampling strategy to select highly informative auxiliary samples, achieving SOTA on CIFAR and ImageNet OOD detection benchmarks.
Image-Feature Weak-to-Strong Consistency: An Enhanced Paradigm for Semi-Supervised Learning: Ours proposes IFMatch, which introduces feature-level perturbation and constructs a three-branch structure based on the traditional image-level weak-to-strong consistency paradigm. By employing a confidence strategy to distinguish naive/hard samples, IFMatch significantly enhances the performance of existing methods (e.g., FixMatch, FreeMatch, etc.) on multiple SSL benchmarks.
Imaging Interiors: An Implicit Solution to Electromagnetic Inverse Scattering Problems: A solution for the Electromagnetic Inverse Scattering Problem (EISP) is proposed based on Implicit Neural Representations (INR). By modeling the relative permittivity of the scatterer as a continuous implicit representation and optimizing it within a forward framework, this approach effectively avoids the difficulties of inverse estimation and low-resolution issues caused by discretization.
Instance-dependent Noisy-label Learning with Graphical Model Based Noise-rate Estimation: This paper proposes a noise-rate estimation method based on a probabilistic graphical model, which automatically estimates the label noise rate of the training set. The estimated values are used to guide the curriculum design of sample selection strategies. It can be seamlessly integrated into state-of-the-art (SOTA) noisy-label learning methods such as DivideMix and InstanceGM, improving their classification accuracy on both synthetic and real-world benchmarks.
Learn from the Learnt: Source-Free Active Domain Adaptation via Contrastive Sampling and Visual Persistence: The LFTL (Learn from the Learnt) framework is proposed. Consisting of two core modules—Contrastive Active Sampling (CAS) and Visual Persistence-guided Adaptation (VPA)—it achieves highly efficient domain adaptation under source-free and extremely low target annotation budgets (\(\le 5\%\)), reaching 87.4% accuracy on VisDA-C with only 1% annotation.

Browse all 19 LLM Evaluation papers →

📚 Pretraining (8)¶

Cross-Domain Learning for Video Anomaly Detection with Limited Supervision: Proposes a weakly-supervised cross-domain learning (CDL) framework that integrates unlabeled external videos into training via an uncertainty-driven pseudo-labeling mechanism, significantly improving the cross-domain generalization capability of video anomaly detection.
DragAPart: Learning a Part-Level Motion Prior for Articulated Objects: DragAPart proposes an image generator that uses dragging as an interactive interface, capable of responding to part-level interactions (such as opening/closing drawers/doors) rather than merely moving the entire object. Through the new synthetic dataset Drag-a-Move, multi-resolution drag encoding, and domain randomization strategies, the model generalizes well to real images and unseen categories despite being trained solely on synthetic data.
I Can't Believe It's Not Scene Flow!: Reveals that the catastrophic failure of existing scene flow methods on small objects like pedestrians is masked by current evaluation metrics, and proposes a category-aware and velocity-normalized Bucket Normalized EPE evaluation protocol, alongside a simple yet SOTA baseline, TrackFlow (generating scene flow from a detector + tracker), achieving a 1.5x improvement in pedestrian motion description.
Learning to Obstruct Few-Shot Image Classification over Restricted Classes: The Learning to Obstruct (LTO) algorithm is proposed, which modifies pre-trained backbone parameters via a MAML-like meta-learning approach to make them a "bad initialization" for specific restricted classes. This hinders the fine-tuning performance of few-shot classification methods on restricted classes while maintaining normal performance on other classes.
Plan, Posture and Go: Towards Open-Vocabulary Text-to-Motion Generation: This paper proposes PRO-Motion, a divide-and-conquer framework that decomposes text-to-motion generation into three stages: LLM-driven motion planning (Plan), script-based posture diffusion generation (Posture), and global translation and rotation estimation (Go). By reducing the complexity of each stage, it achieves high-quality open-vocabulary motion generation.
PreLAR: World Model Pre-training with Learnable Action Representation: This paper proposes PreLAR to bridge the gap between action-free pre-training and action-conditioned fine-tuning for world models. By encoding implicit action representations from adjacent frames and designing an action-state consistency loss during unsupervised pre-training on action-free videos, PreLAR significantly improves the sample efficiency of downstream visual control tasks.
Prompting Language-Informed Distribution for Compositional Zero-Shot Learning: This paper proposes the PLID method, which leverages sentence-level category descriptions generated by LLMs to construct language-knowledge-driven Gaussian distributions. Combined with vision-language primitive decomposition and randomized logit fusion, it achieves state-of-the-art (SOTA) performance on the Compositional Zero-Shot Learning (CZSL) task.
Scaling Backwards: Minimal Synthetic Pre-training?: Proposes 1p-frac—achieving pre-training performance comparable to the ImageNet-1k level using minute perturbations of a single fractal image. This challenges the conventional wisdom that "pre-training requires large-scale datasets" and reveals that the essence of pre-training might be closer to weight initialization than visual concept learning.

💬 LLM (Other) (11)¶

AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection: By concurrently incorporating static (globally shared) and dynamic (instance-specifically generated) learnable prompts into CLIP, and using auxiliary anomaly detection data for optimization, this method establishes a zero-shot SOTA on 14 industrial and medical anomaly detection datasets. The core innovation lies in the hybrid prompt design that achieves dual-tier adaptation at both the "task" and "instance" levels.
APL: Anchor-based Prompt Learning for One-stage Weakly Supervised Referring Expression Comprehension: This paper proposes APL (Anchor-based Prompt Learning), which designs an Anchor-based Prompt Encoder (APE) to generate distinctive prompts across three categories: location, color, and category. By dynamically integrating these prompts into anchor features to enrich visual semantics, alongside text reconstruction and visual alignment losses, APL achieves precise vision-language alignment. It outperforms existing weakly supervised methods on four REC benchmarks (e.g., exceeding RefCLIP by 6.44% on RefCOCO).
Cultural Value Differences of LLMs: Prompt, Language, and Model Size: This paper systematically investigates the behavioral patterns of LLMs in expressing cultural values utilizing the Hofstede cultural dimensions questionnaire. It finds that prompt language (Chinese vs. English) and model size have a far greater impact on cultural value disparities than differences in model architecture and question order.
FreestyleRet: Retrieving Images from Style-Diversified Queries: This work proposes the first Style-Diversified Query-Based Image Retrieval (Style-Diversified QBIR) task and the DSR dataset. It designs FreestyleRet, a lightweight, plug-and-play framework that extracts texture/style features of queries using Gram matrices to construct a style space. These style features then initialize prompt tokens, enabling a frozen vision encoder to adapt to various query styles such as texts, sketches, low-resolution images, and artistic paintings.
FunQA: Towards Surprising Video Comprehension: The authors construct a large-scale counter-intuitive video question answering benchmark, FunQA (consisting of 4.3K videos and 312K QA pairs), covering three categories of surprising videos: Humor, Creativity, and Magic. They also propose the FunMentor agent, which enhances the counter-intuitive reasoning capabilities of VLMs through multi-turn dialogue.
PromptIQA: Boosting the Performance and Generalization for No-Reference Image Quality Assessment via Prompts: The authors propose PromptIQA, which uses a small number of "image-score pairs" (ISPs) as prompts. This allows the trained NR-IQA model to adapt to new quality assessment requirements without fine-tuning, achieving SOTA performance and generalization capabilities across 12 datasets and 5 categories of IQA tasks.
Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos: VidAssist proposes a three-step framework of "Propose-Assess-Search", leveraging LLMs as a knowledge base and evaluation tool combined with a breadth-first search algorithm. It outperforms fully supervised SOTA in a zero/few-shot manner on goal-oriented planning tasks in instructional videos, achieving a +7.7% SR improvement on COIN compared to the fully supervised VLaMP in the few-shot setting.
Reprojection Errors as Prompts for Efficient Scene Coordinate Regression: This paper proposes the Error-Guided Feature Selection (EGFS) mechanism, which leverages low reprojection error regions as point prompts for SAM to expand into semantic masks. By iteratively filtering reliable training samples, the method outperforms existing 3D-free SCR methods on the Cambridge Landmarks and Indoor6 datasets with a smaller model size and less training time.
RoadPainter: Points Are Ideal Navigators for Topology Transformer: RoadPainter is proposed, which adopts a two-stage strategy of first regressing lane centerline points and then refining them using instance masks. Combined with a hybrid attention mechanism and a real-virtual lane separation strategy, it achieves SOTA topology inference performance on the OpenLane-V2 dataset.
Stripe Observation Guided Inference Cost-Free Attention Mechanism: By deeply analyzing the stripe pattern phenomenon in the attention weight matrices of Transformers, this paper proposes an attention enhancement mechanism that completely eliminates additional computational cost during the inference phase. By training an auxiliary module to learn stripe-guided attention modulation during the training phase, and re-parameterizing it into the standard attention weights during inference, this method achieves a "free lunch" style performance boost.

Browse all 11 LLM (Other) papers →

🎨 Image Generation (117)¶

2S-ODIS: Two-Stage Omni-Directional Image Synthesis by Geometric Distortion Correction: 2S-ODIS utilizes a pre-trained VQGAN (without fine-tuning) to synthesize panoramic images via a two-stage architecture: the first stage generates a low-resolution coarse ERP image, and the second stage corrects geometric distortions by generating and fusing 26 NFoV perspective images. This reduces training time from 14 days to 4 days while achieving superior image quality.
A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks: Proposed IF-GMI, which decomposes the generator of a pre-trained StyleGAN2 into multiple blocks and optimizes intermediate features layer-by-layer (incorporating an \(\ell_1\)-ball constraint to prevent image collapse). This expands the search space of model inversion attacks from the latent space to intermediate features, boosting attack accuracy in OOD scenarios by up to 38.8%.
A Diffusion Model for Simulation Ready Coronary Anatomy with Morpho-skeletal Control: Controllable generation of 3D multi-tissue coronary artery segmentation maps is achieved using Latent Diffusion Models (LDM). Topological interaction loss ensures anatomical plausibility, and decoupled control over cross-sectional morphology and branch structure is obtained through dual-channel morpho-skeletal conditioning. Additionally, Adaptive Null Guidance (ANG) is proposed to efficiently enhance conditional fidelity using a non-differentiable regressor, ultimately supporting counterfactual anatomical editing for finite element simulation.
A High-Quality Robust Diffusion Framework for Corrupted Dataset: This paper proposes the RDUOT framework, which integrates Unbalanced Optimal Transport (UOT) into a diffusion model (DDGAN) for the first time. By learning \(q(x_0|x_t)\) instead of \(q(x_{t-1}|x_t)\), it effectively filters outliers in training data, achieving robust generation on corrupted datasets while outperforming the DDGAN baseline on clean datasets.
AccDiffusion: An Accurate Method for Higher-Resolution Image Generation: This paper proposes AccDiffusion, which decouples global text prompts into patch-level content-aware prompts (utilizing cross-attention maps to determine whether each word belongs to a specific patch) and introduces dilated sampling with window interaction to improve global consistency. Without requiring extra training, this approach effectively solves the object duplication issue in patch-wise high-resolution image generation, achieving high-quality, duplication-free image extrapolation from 2K to 4K resolutions on SDXL.
AdaDiffSR: Adaptive Region-Aware Dynamic Acceleration Diffusion Model for Real-World Image Super-Resolution: Observing that the required denoising steps for different image regions in diffusion-based super-resolution vary significantly (background regions converge early while foreground textures still need iterations), this work proposes a dynamic step-skipping strategy based on Multi-Metric Latent Entropy (MMLE) to perceive information gain. Sub-regions are categorized into stable, growth, and saturated types, each assigned different step sizes. Concurrently, a Progressive Feature Injection (PFJ) module is developed to balance fidelity and realism. On datasets such as DRealSR, this approach achieves reconstruction quality comparable to StableSR while reducing inference time and FLOPs by 1.5\(\times\) and 2.7\(\times\), respectively.
AdaGen: Learning Adaptive Policy for Image Synthesis: This paper unifies step-level parameter scheduling (temperature, mask ratio, CFG scale, timestep, etc.) of multi-step generative models (MaskGIT/AR/Diffusion/Rectified Flow) as an MDP. A lightweight RL policy network is used to achieve sample-adaptive scheduling, and an adversarial reward design is proposed to prevent policy overfitting, consistently improving performance across four generative paradigms (e.g., VAR FID \(1.92 \rightarrow 1.59\), and reducing the inference cost of DiT-XL by 3x with superior performance).
AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation: This work proposes AdaNAT, which models the generation policy configuration of Non-Autoregressive Transformers (NAT) as an MDP. Utilizing a lightweight policy network combined with PPO reinforcement learning and an adversarial reward model, AdaNAT automatically customizes generation policies (re-masking ratio, sampling temperature, CFG weights, etc.) for each sample. It achieves an FID of 2.86 on ImageNet-256 using only 8 steps, yielding an approximate 40% relative improvement over hand-crafted policies.
AFreeCA: Annotation-Free Counting for All: By leveraging Stable Diffusion to generate synthetic sorting/counting data, this work implements a two-stage strategy of learning sorting before anchoring counts, combined with density-guided image partitioning. This enables the first annotation-free counting method applicable to objects of arbitrary categories, outperforming existing unsupervised methods in crowd counting.
AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation: Proposes AnyControl, which supports arbitrary combinations of multiple spatial control signals (depth, edge, segmentation, pose) via a Multi-Control Encoder featuring an alternating fusion and alignment block structure, outperforming existing methods on the COCO multi-control benchmark with an FID of 44.28.

Browse all 117 Image Generation papers →

🎬 Video Generation (14)¶

BlazeBVD: Make Scale-Time Equalization Great Again for Blind Video Deflickering: BlazeBVD is proposed, which leverages classical Scale-Time Equalization (STE) in the illumination histogram space to extract deflickering priors (filtered illumination maps, exposure maps, and flickering frame indices). This simplifies complex video space-time learning into frame-by-frame processing using a 2D spatial network coupled with a lightweight 3D temporal consistency network. It achieves SOTA quality on blind video deflickering while speeding up inference by more than 10 times compared to baselines.
DragAnything: Motion Control for Anything using Entity Representation: This paper proposes DragAnything, which utilizes the latent space features of diffusion models as Entity Representations to achieve entity-level motion control. It addresses the issue of existing trajectory-driven methods only dragging pixels without being able to precisely control the motion of target objects. DragAnything achieves state-of-the-art (SOTA) FVD/FID metrics on VIPSeg, outperforming DragNUWA by 26% in motion control votes in a user study.
DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing: DreamMotion, a zero-shot video editing framework based on Score Distillation, is proposed. By employing space-time self-similarity regularization, it injects target appearances while preserving the structural and motion integrity of the original video, applicable to both cascaded and non-cascaded video diffusion models.
Evaluating Text-to-Visual Generation with Image-to-Text Generation: The authors propose VQAScore, which uses Visual Question Answering (VQA) models instead of CLIP to evaluate text-to-visual generation quality. It significantly outperforms CLIPScore on complex compositional prompts and releases the GenAI-Bench benchmark.
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation: This paper is the first to explore the visual features of pre-trained text-to-video (T2V) diffusion models for video understanding tasks. It proposes the VD-IT framework, which extracts visual features with superior temporal-semantic consistency from a frozen T2V diffusion model using two key designs: text-guided image projection and video-specific noise prediction. VD-IT outperforms state-of-the-art methods using discriminatively pre-trained video backbones (such as Video Swin Transformer) across four major R-VOS benchmarks.
FreeInit: Bridging Initialization Gap in Video Diffusion Models: This work identifies a training-inference initialization discrepancy in video diffusion models (where low-frequency information leakage during training leads to temporally correlated initial noise, whereas uncorrelated Gaussian noise is used during inference). It proposes FreeInit, which bridges this gap by iteratively refining the spatiotemporal low-frequency components of the initial noise, thereby significantly improving the temporal consistency of generated videos.
Kalman-Inspired Feature Propagation for Video Face Super-Resolution: This paper proposes the KEEP framework, which leverages Kalman filtering principles to recursively fuse prior information from previous frames with observations of the current frame in the latent space. This achieves high-fidelity reconstruction of facial details and ensures temporal consistency in video face super-resolution, outperforming the previous state-of-the-art method by 0.8 dB in PSNR on the VFHQ dataset.
MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing: This paper proposes MagDiff, the first multi-alignment diffusion model that unifies video generation and editing. Through three mechanisms—subject-driven alignment, adaptive prompts alignment, and high-fidelity alignment—MagDiff achieves high-quality video generation and editing simultaneously within a single, tuning-free framework.
MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model: This paper proposes MOFA-Video, which equips a frozen image-to-video diffusion model (SVD) with controllable motion capabilities by designing multiple domain-specific motion field adapters (MOFA-Adapters). It supports various control signals and their combinations, such as hand-drawn trajectories and facial landmarks, to achieve open-domain controllable image animation.
PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation: Estimates spatially-varying Young's modulus material fields for static 3D Gaussian objects by leveraging physical dynamics priors implicit in video generation models, enabling physically plausible interactive 3D dynamics synthesis.

Browse all 14 Video Generation papers →

🧩 Multimodal VLM (44)¶

A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis: This work constructs the CDDM dataset containing 137k crop disease images and 1 million question-answering pairs, and proposes a strategy to apply LoRA fine-tuning simultaneously to the vision encoder, adapter, and language model. This enables Qwen-VL-Chat and LLaVA to leap from single-digit accuracy to over \(90\%\) in crop disease diagnosis.
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting: The AdaShield framework is proposed, which comprises a meticulously designed static defense prompt (AdaShield-S) and an LLM-based adaptive iterative optimization framework (AdaShield-A). Without fine-tuning MLLMs or training additional modules, it effectively defends against structure-based jailbreak attacks, reducing the attack success rate from over 75% to below 15% while maintaining normal task performance.
AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization: The AddressCLIP framework is proposed, which models the Image Address Localization (IAL) problem as an end-to-end vision-language alignment task through two core components: image-text alignment (contrastive learning of address and scene descriptions) and image-geography matching (manifold learning based on GPS distance). It achieves a Top-1 accuracy of up to 85.92% on three self-constructed IAL datasets.
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling: This paper proposes to reformulate the visual attribute recognition problem as a sentence generation probability problem under an image-conditioned Prefix Language Model (PrefixLM). By replacing traditional "contrastive retrieval" with "generative retrieval", the model explicitly captures the conditional dependency between objects and attributes, significantly outperforming contrastive methods on both the VAW and the newly proposed VGARank datasets.
Attention Prompting on Image for Large Vision-Language Models: This paper proposes Attention Prompting on Image (API), which utilizes an auxiliary VLM (CLIP or LLaVA) to generate attention attribution maps based on text queries. These maps are overlaid as heatmaps onto the original image to guide the LVLM to focus on relevant regions. API improves LLaVA-1.5 by up to 3.8% on MM-Vet and is widely effective across various LVLMs, including GPT-4V.
BLINK: Multimodal Large Language Models Can See but Not Perceive: Introduces BLINK—a multimodal evaluation benchmark containing 14 classic computer vision perception tasks (3,807 multiple-choice questions) that humans can solve "in a blink" (95.7% accuracy), but the strongest GPT-4V achieves only 51.26% (only 13.17% above random guessing), revealing a severe deficiency of current MLLMs in core visual perception capabilities.
BRAVE: Broadening the Visual Encoding of Vision-Language Models: This paper systematically analyzes the impact of different visual encoders (CLIP, DINOv2, EVA-CLIP, etc.) on VLM performance, finding that no single encoder is optimal across all tasks. Based on this, the BRAVE method is proposed, which utilizes a lightweight MEQ-Former to fuse features from multiple frozen encoders into a compact representation. Consequently, it achieves SOTA results on captioning and VQA tasks with only 116M trainable parameters while significantly reducing visual hallucinations.
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios: This paper proposes the CAT model, which captures fine-grained audio-visual features via a question-aware Clue Aggregator. Combined with a hybrid multimodal training strategy and an AI-assisted Vagueness-aware Direct Preference Optimization (ADPO) strategy, it significantly improves MLLM question-answering accuracy in dynamic audio-visual scenarios, achieving SOTA performance on multiple AVQA benchmarks.
CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts: From the perspective of causal generative models, this paper proposes CLAP (Contrastive Learning with Augmented Prompts). It trains a lightweight disentanglement network using text prompt augmentation and contrastive learning to separate content and style within CLIP pre-trained features. Trained solely on text, CLAP simultaneously improves representation quality for both image and text modalities, achieving consistent gains in zero-shot classification, few-shot classification, and adversarial robustness.
Dataset Growth (InfoGrowth): InfoGrowth is proposed as an efficient online data cleaning and selection algorithm. By estimating the information gain of each sample through nearest neighbor search, it enables continuous dataset growth while maintaining cleanliness and diversity, outperforming full training on CC3M using only 1/6 of the data.

Browse all 44 Multimodal VLM papers →

🧠 VLM Reasoning (1)¶

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models: NavGPT-2 closes the performance gap between LM-based agents and VLN-specific models while retaining the LLM's interpretable navigational reasoning capabilities by feeding the hidden layer representations of a frozen LLM into a topological map navigation policy network as vision-language features, showcasing excellent data efficiency.

⚡ VLM Efficiency (4)¶

Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding: This work proposes ClassAct/ActiveCLIP, which utilizes small, low-cost proxy models to compute "learnability" scores for data points to prioritize training data. This reduces training updates for large-scale visual classifiers and multimodal models by 46% and 51% respectively, achieves up to 25% total compute savings, and stands as the first active learning method to achieve net positive compute savings in large-scale pre-training.
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models: Groma proposes a new paradigm that embeds localization capabilities directly into the visual tokenization process. By discovering regions of interest (ROIs) via a region proposer and encoding them into region tokens, Groma enables MLLMs to perform high-accuracy referring and grounding without relying on LLM-generated coordinates or external modules. It also leverages GPT-4V with visual prompting to construct Groma Instruct, the first grounded chat dataset featuring dual visual-textual prompts.
IVTP: Instruction-Guided Visual Token Pruning for Large Vision-Language Models: IVTP proposes utilizing textual instruction information to dynamically assess the importance of each visual token and prune redundant tokens during the inference of Large Vision-Language Models (LVLMs). This achieves task-related adaptive visual info compression, significantly reducing computational overhead while maintaining or even improving model performance.
Quantized Prompt for Efficient Generalization of Vision-Language Models: By treating quantization error as a form of regularization noise, this work applies ultra-low-bit quantization (down to 1-bit) to the learnable prompts of VLMs. This significantly reduces storage overhead (up to \(16\times\) compression) while markedly improving the model's generalization capability to unseen classes. QCoOp achieves superior performance over various state-of-the-art (SOTA) methods using only 0.26KB of storage.

🎵 Audio & Speech (8)¶

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos: AV-LDM is proposed to implicitly disentangle foreground action sounds and background ambient sounds by introducing audio from different time segments of the same video as an ambient sound condition during training. Combined with retrieval-augmented generation (RAG) to select appropriate ambient sound conditions during inference, it significantly outperforms existing methods on Ego4D and EPIC-KITCHENS.
Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation: The Beat-It framework is proposed to achieve beat-synchronized and keyframe-controllable 3D dance generation by decoupling beat conditions from music and designing a hierarchical multi-condition fusion mechanism, significantly outperforming existing methods on AIST++.
CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing: This work proposes the CoLeaF dual-branch learning framework, which explicitly optimizes the integration of cross-modal context through event-aware contrastive learning, achieving an average improvement of 1.9% F-score on the weakly supervised audio-visual video parsing task.
ControlLLM: Augment Language Models with Tools by Searching on Graphs: This paper proposes the ControlLLM framework, which plans multimodal tool execution by performing graph search (Thoughts-on-Graph) on a pre-built Tool Graph. This significantly improves the accuracy of tool selection and parameter assignment in complex tasks.
Label-Anticipated Event Disentanglement for Audio-Visual Video Parsing: This paper proposes the LEAP (Label semantic-based Projection) decoding paradigm, which utilizes the text embeddings of event categories as semantic anchors. Using a cross-modal attention mechanism, potentially overlapping event semantics within audio/visual latent features are disentangled into independent label embeddings. Combined with an EIoU-based audio-visual semantic similarity loss, LEAP achieves SOTA performance on the AVVP task.
Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics: This paper proposes the Latent-INR framework. By learning an implicit latent code for each video frame and combining it with a hypernetwork for low-rank weight modulation, the framework decouples the spatial and temporal modeling of video INR. While maintaining competitive compression performance, it equips video representations with semantic discriminative capabilities, supporting various downstream tasks such as retrieval, video frame interpolation, and arbitrary-resolution inference.
Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation: This paper proposes CSTS (Contrastive Spatial-Temporal Separable), an audio-visual fusion method that introduces audio signals to egocentric gaze anticipation for the first time. It models spatial co-occurrence and temporal correlation of audio-visual signals separately through spatial and temporal separable fusion modules, and enhances representations using post-fusion contrastive learning, surpassing SOTA on Ego4D and Aria datasets.
Siamese Vision Transformers are Scalable Audio-Visual Learners: The AVSiam framework is proposed, which uses a single weight-shared ViT backbone to simultaneously process both audio and visual inputs. Combined with a multi-ratio random masking strategy and a dual-objective pre-training scheme (contrastive plus reconstruction), AVSiam achieves state-of-the-art (SOTA) performance on audio-visual classification and retrieval at an extremely low cost (28.9 times faster than MAViL).

🧊 3D Vision (181)¶

3D Congealing: 3D-Aware Image Alignment in the Wild: 3D Congealing aligns a set of unannotated, semantically similar internet images into a shared 3D canonical space. By combining SDS guidance from a pre-trained diffusion model to obtain the 3D shape and DINO semantic feature matching to estimate poses and coordinate mappings, it requires no templates, pose annotations, or camera parameters.
3D Reconstruction of Objects in Hands without Real World 3D Supervision: This paper proposes the HORSE framework, which trains an occupancy network to reconstruct the 3D shape of hand-held objects from a single RGB image. This is achieved by extracting multi-view 2D mask supervision from in-the-wild videos (using hand pose as an object pose proxy) and learning a 2D slice adversarial shape prior from a synthetic 3D shape collection. Without using any real-world 3D annotations, it outperforms 3D-supervised methods by 11.6% on the MOW dataset.
3D Single-Object Tracking in Point Clouds with High Temporal Variation: HVTrack is the first to explore 3D single-object tracking under high temporal variation scenarios. It addresses coordinate-wise cloud shape variations, distractor interference, and background noise via three modules: Relative-Pose-Aware Memory (RPM), Base-Expansion Feature Cross-Attention (BEA), and Contextual Point Guided Self-Attention (CPA). On the KITTI-HV dataset with a 5-frame interval, it improves Success/Precision by 11.3%/15.7% over the state-of-the-art (SOTA).
3DEgo: 3D Editing on the Go!: 3DEgo compresses the traditional three-stage 3D editing pipeline (COLMAP pose estimation \(\rightarrow\) unedited scene initialization \(\rightarrow\) iterative editing and update) into a single-stage framework: first performing multi-view consistent 2D editing on video frames using an autoregressive noise blending module, and then directly reconstructing the 3D scene from the edited frames using COLMAP-free 3DGS, boosting the speed by approximately 10x and supporting videos from arbitrary sources.
3iGS: Factorised Tensorial Illumination for 3D Gaussian Splatting: 3iGS replaces the independently optimized spherical harmonics (SH) coefficients of each Gaussian in 3DGS with a continuous incident illumination field based on tensor decomposition. Combined with learnable BRDF features and a lightweight neural renderer to model the outgoing radiance, it significantly improves the rendering quality of view-dependent effects such as specular reflections while maintaining real-time rendering speeds.
3×2: 3D Object Part Segmentation by 2D Semantic Correspondences: Proposes a training-free 3D object part segmentation method, 3-By-2, which utilizes 2D semantic correspondences from diffusion models (DIFT) to transfer part labels from annotated 2D datasets or a small number of 3D annotated objects to 3D, achieving state-of-the-art (SOTA) performance under both zero-shot and few-shot settings.
4Diff: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation: This paper proposes 4Diff, a transformer-based diffusion model integrating 3D geometric priors. By incorporating egocentric point cloud rasterization and 3D-aware rotary cross-attention mechanisms, it translates exocentric (third-person) images into egocentric (first-person) images, achieving state-of-the-art performance on the Ego-Exo4D dataset and demonstrating strong generalization capabilities to novel environments.
6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model: Proposes 6DGS, which inverts the 3DGS rendering workflow—casting rays uniformly from the surfaces of the ellipsoids (Ellicell), using an attention mechanism to bind rays to target image pixels, and then utilizing weighted least squares to solve for camera pose in closed form. Requiring no iterations or initial poses, it improves rotation accuracy by 12% and translation accuracy by 22% on real-world scenes, achieving near-real-time performance at 15fps.
A Compact Dynamic 3D Gaussian Representation for Real-Time Dynamic View Synthesis: This work models the position and rotation parameters in 3DGS as continuous functions of time (Fourier approximation for position, linear approximation for rotation), reducing the storage complexity of dynamic scenes from \(O(TN)\) to \(O(LN)\). It achieves rendering quality comparable to NeRF-based methods on the D-NeRF, DyNeRF, and HyperNeRF datasets while maintaining real-time rendering speeds over 118 FPS.
A Direct Approach to Viewing Graph Solvability: This paper proposes a more direct reformulation of the viewing graph solvability problem than prior works, introduces new concepts to understand the solvability of real-world SfM graphs, and presents more efficient algorithms for detecting and decomposing unsolvable scenarios.

Browse all 181 3D Vision papers →

🎯 Object Detection (31)¶

Adaptive Bounding Box Uncertainties via Two-Step Conformal Prediction: This paper proposes a two-step conformal prediction framework for uncertainty quantification in multi-object detection: the first step generates conformal prediction sets of class labels to handle classification errors, and the second step produces adaptive bounding box uncertainty intervals based on ensembles and quantile regression, providing practically useful tight prediction intervals while guaranteeing coverage.
Adaptive Multi-task Learning for Few-Shot Object Detection: This paper proposes an adaptive multi-task learning method (MTL-FSOD) that dynamically adjusts the gradient scales of classification and localization tasks using a precision-driven gradient balancer to alleviate their conflict. It also introduces CLIP-based knowledge distillation and a classification refinement scheme to enhance individual task performance, achieving consistent improvements across multiple few-shot object detection benchmarks.
AugDETR: Improving Multi-scale Learning for Detection Transformer: This paper proposes AugDETR (Augmented DETR), which expands the receptive field of the deformable encoder and introduces global context features to enhance feature representations through a Hybrid Attention Encoder. It then adaptively utilizes information from multiple encoder layers using Encoder-Mixing Cross-Attention to accelerate convergence, yielding improvements of 1.2, 1.1, and 1.0 AP over DINO, AlignDETR, and DDQ on COCO, respectively.
BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos: Proposes the Boundary-Aligned Moment Detection Transformer (BAM-DETR), which models moments using an anchor-boundary triplet \((p, d_s, d_e)\) instead of the traditional center-length duplet \((c, l)\). Combined with a dual-pathway decoder and a quality-based ranking mechanism, it effectively addresses the issue of imprecise localization caused by center ambiguity.
Bridge Past and Future: Overcoming Information Asymmetry in Incremental Object Detection: This paper proposes the Bridge Past and Future (BPF) method, which bridges past stages via pseudo-labels, excludes potential future objects using an attention mechanism, and incorporates dual-teacher distillation (Distillation with Future) to resolve the optimization goal inconsistency caused by cross-stage information asymmetry in incremental object detection.
Can OOD Object Detectors Learn from Foundation Models?: SyncOOD proposes an automated data curation method that leverages LLMs to imagine semantically novel OOD concepts and performs region-level editing on ID images via Stable Diffusion Inpainting to synthesize scene-level OOD samples. After refining bounding boxes with SAM and filtering via feature similarity, a lightweight MLP classifier is trained, substantially outperforming SOTA on multiple OOD detection benchmarks with a minimal amount of synthetic data.
DAMSDet: Dynamic Adaptive Multispectral Detection Transformer: DAMSDet proposes a dynamic adaptive infrared-visible object detection method based on the DETR architecture. By utilizing Modality Competitive Query Selection (dynamically selecting the dominant modality feature as the initial query for each object) and Multispectral Deformable Cross-Attention (adaptively sampling and aggregating bi-modal features across multiple semantic levels), it simultaneously addresses the dual challenges of complementary information fusion and modality misalignment, significantly outperforming the state-of-the-art (SOTA) on four public datasets.
DSPDet3D: 3D Small Object Detection with Dynamic Spatial Pruning: Proposed a Dynamic Spatial Pruning (DSP) strategy to progressively remove voxel features in areas where large objects have already been detected within the decoders of multi-scale 3D detectors. This allows the detector to process scenes at extremely high spatial resolutions, significantly improving small object detection accuracy (ScanNet small object [email protected] boosted from 27.5% to 44.8%) while reducing GPU memory to 1/5 of the baseline method with the same resolution.
GRA: Detecting Oriented Objects Through Group-Wise Rotating and Attention: A lightweight Group-wise Rotating and Attention (GRA) module is proposed. By grouping and rotating convolution kernels and applying group-wise spatial attention, it outperforms the previous SOTA method ARC with nearly 50% fewer parameters, achieving new state-of-the-art performance on DOTA-v2.0.
LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction: LaMI-DETR is proposed to address two core challenges in open-vocabulary object detection—insufficient concept representation and base-category overfitting—by leveraging GPT to generate visual concept descriptions and T5 to mine inter-category visual similarity relationships. It outperforms previous state-of-the-art methods by 7.8 rare AP on OV-LVIS, achieving 43.4 AP_rare.

Browse all 31 Object Detection papers →

✂️ Segmentation (54)¶

A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties: ProLab uses LLMs to generate common-sense descriptions of categories, compressing them into 256 interpretable descriptive properties via sentence embeddings and K-Means clustering. This constructs an attribute-level multi-hot label space to supervise the segmentation model, replacing traditional one-hot category labels. It consistently outperforms category-level supervision across five classic benchmarks and shows emergent out-of-domain generalization capabilities.
A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting: Based on Stable Diffusion, a minimalist latent diffusion segmentation framework named LDMSeg is proposed. It compresses segmentation masks into a latent space using a shallow autoencoder and then trains an image-conditioned diffusion model to generate panoptic segmentation results. This bypasses object detection modules, Hungarian matching, and complex post-processing found in traditional methods, while naturally supporting mask inpainting and multi-task extensions.
ActionVOS: Actions as Prompts for Video Object Segmentation: ActionVOS is proposed—a new setting for Referring Video Object Segmentation that uses human action narratives as additional linguistic prompts. It generates pseudo-labels via a parameter-free action-aware labeling module and designs an action-guided focal loss to suppress false positives, reducing the false segmentation of inactive objects by 35.6% mIoU on VISOR, while improving the segmentation of state-changing objects by 3.0% mIoU on VOST/VSCOS.
Active Coarse-to-Fine Segmentation of Moveable Parts from Real Images: Proposes the first active learning framework for instance segmentation of moveable parts in real-world indoor RGB images. Utilizing a pose-aware masked attention network, the framework achieves coarse-to-fine segmentation. It requires manual annotation of only 11.45% of the images to obtain fully verified high-quality segmentation results, saving 60% of manual effort compared to the best non-active learning methods.
Attention Decomposition for Cross-Domain Semantic Segmentation: This paper proposes ADFormer, a novel Transformer architecture for cross-domain semantic segmentation. By decomposing the cross-attention in the decoder into domain-agnostic and domain-specific components, combined with gradient reversal adversarial learning, it effectively bridges the distribution gap between source and target domains. It outperforms existing proposal-free methods on GTA→Cityscapes and SYNTHIA→Cityscapes benchmarks with significantly lower complexity.
CoLA: Conditional Dropout and Language-Driven Robust Dual-Modal Salient Object Detection: This paper proposes the CoLA framework, which introduces two core modules, Language-driven Quality Assessment (LQA) and Conditional Dropout (CD), to simultaneously address two key robustness issues in dual-modal salient object detection for the first time: noisy inputs and missing modalities.
ColorMAE: Exploring Data-Independent Masking Strategies in Masked AutoEncoders: This paper proposes ColorMAE, which generates data-independent masking patterns with spatial and semantic priors by applying different frequency domain filters to random noise. Without adding any parameters or computational overhead, ColorMAE significantly improves the downstream performance of MAE, particularly achieving a 2.72 mIoU improvement over random masking on semantic segmentation tasks.
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback: This paper proposes ControlNet++, which explicitly optimizes the quality of conditional controllable generation through pixel-level cycle consistency loss. A pre-trained discriminative model is used to extract conditions from generated images and align them with the input conditions. To avoid the massive GPU memory overhead of multi-step sampling, an efficient single-step denoising reward strategy is designed. This significantly improves controllability (e.g., +11.1% segmentation mIoU) under various conditional controls such as segmentation masks, edges, and depth.
CoReS: Orchestrating the Dance of Reasoning and Segmentation: This paper proposes CoReS (Chains of Reasoning and Segmenting), a multimodal Chain-of-Thought framework with a dual-chain structure. Through the hierarchical collaboration of the reasoning chain and the segmenting chain, combined with an in-context guidance strategy, it achieves progressive and precise segmentation of target objects in complex reasoning text, outperforming LISA by 6.5% on the ReasonSeg dataset.
CPM: Class-Conditional Prompting Machine for Audio-Visual Segmentation: This paper proposes the Class-Conditional Prompting Machine (CPM), which enhances the stability of bipartite matching and the effectiveness of cross-modal attention in Mask2Former for audio-visual segmentation by combining class-agnostic queries with GMM-sampled class-conditional queries. Simultaneously, three auxiliary tasks are designed—Audio-Conditional Prompting (ACP), Visual-Conditional Prompting (VCP), and Prompt Contrastive Learning (PCL)—achieving state-of-the-art performance on AVSBench and VPO benchmarks.

Browse all 54 Segmentation papers →

🖼️ Image Restoration (32)¶

A New Dataset and Framework for Real-World Blurred Images Super-Resolution: Addressing the issue where existing blind super-resolution methods over-texturize and destroy the perceptual quality of blurred regions when processing images with blur (defocus/motion blur), this work constructs the ReBlurSR dataset containing nearly 3,000 blurred images. It proposes the PBaSR framework, which employs Cross-Disentanglement training (CDM) and weight-interpolation-based Cross-Fusion (CFM) to simultaneously improve the super-resolution quality of both blurred and general images without introducing any additional inference overhead, improving LPIPS by 0.02 to 0.10.
Accelerating Image Super-Resolution Networks with Pixel-Level Classification: This and paper introduces PCSR, the first super-resolution method with pixel-level computational resource allocation. By leveraging a lightweight MLP classifier, it determines the restoration difficulty on a pixel-by-pixel basis and assigns them to upsamplers of varying capacities. PCSR reduces FLOPs to \(18\% \sim 57\%\) of the original models with almost no drop in PSNR, significantly outperforming existing patch-level methods like ClassSR and ARM.
Asymmetric Mask Scheme for Self-supervised Real Image Denoising: Proposed the asymmetric mask scheme AMSNet, which utilizes a single mask during training and complementary multiple masks during inference, breaking the structural requirements and receptive field constraints of blind spot networks, and achieving SOTA performance in self-supervised real image denoising.
BAMM: Bidirectional Autoregressive Motion Model: BAMM (Bidirectional Autoregressive Motion Model) is proposed. By unifying generative masked modeling and autoregressive modeling through a hybrid attention masking strategy, it simultaneously achieves high-quality motion generation, adaptive length prediction, and zero-shot motion editing within a single framework, comprehensively outperforming SOTA on HumanML3D and KIT-ML.
Blind Image Deblurring with Noise-Robust Kernel Estimation: This paper proposes a blind deblurring method based on a noise-robust kernel estimation function and deep image prior (DIP). By designing a kernel estimation function capable of accurately estimating blur kernels even under strong noise, combined with a multiple-kernel estimation scheme to handle unknown noise levels, it achieves superior deblurring performance on both simulated and real images.
BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion: Proposes BrushNet, a plug-and-play dual-branch diffusion model architecture for image inpainting. By decoupling masked image feature extraction and image generation into separate branches, it achieves layer-wise pixel-level feature injection, thoroughly outperforming existing methods in image quality, masked area preservation, and text alignment.
Contourlet Residual for Prompt Learning Enhanced Infrared Image Super-Resolution: To address the unique challenges of infrared image super-resolution, this paper proposes the CoRPLE framework. It utilizes the Contourlet transform for multi-scale and multi-directional infrared spectral residual enhancement, and introduces a prompt learning paradigm based on vision-language models to capture the inherent features of infrared images, achieving SOTA performance on infrared SR tasks.
DenoiSplit: A Method for Joint Microscopy Image Splitting and Unsupervised Denoising: This paper proposes DenoiSplit, the first method to jointly address semantic image splitting and unsupervised denoising. By integrating pixel noise models and an improved KL divergence loss weighting strategy into a hierarchical VAE, the method achieves end-to-end denoising and splitting on fluorescence microscopy images, significantly outperforming serial pipelines that perform denoising prior to splitting.
Domain-Adaptive Video Deblurring via Test-Time Blurring: A test-time domain adaptation method based on a diffusion blur model is proposed. By detecting relatively sharp regions from blurry videos as pseudo-sharp images and generating domain-adaptive blur conditions to synthesize training pairs, the method enables fine-tuning of deblurring models on unseen domains, achieving a maximum gain of 7.54dB across 5 real-world datasets.
EDformer: Transformer-Based Event Denoising Across Varied Noise Levels: EDformer proposes an event-by-event denoising model based on Transformer, which handles event camera noise under varied noise levels by learning spatiotemporal correlations among events. It also establishes ED24, the first real-world event denoising dataset containing 21 noise levels.

Browse all 32 Image Restoration papers →

🛰️ Remote Sensing (6)¶

Adapting Fine-Grained Cross-View Localization to Areas without Fine Ground Truth: To address the performance degradation of fine-grained cross-view localization models when deployed in new areas, a weakly-supervised learning method based on knowledge self-distillation is proposed. Employing three strategies—mode-based pseudo GT generation, coarse-level supervision, and outlier filtering—it reduces localization errors by 12% to 20% on VIGOR and KITTI using only ground-to-aerial image pairs from the target area (without requiring precise GT).
ConGeo: Robust Cross-View Geo-Localization Across Ground View Variations: This paper proposes ConGeo, a model-agnostic single-view + cross-view contrastive learning framework. By enforcing feature consistency across different ground view variations at the same location, it enables a single model to achieve robust cross-view geo-localization under arbitrary orientations and fields of view (FoV).
Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach: This work constructs G2A-VReID, the first ground-to-aerial cross-platform video person re-identification dataset, and proposes the VSLA-CLIP method, which adapts CLIP to video ReID tasks through visual-semantic alignment and a parameter-efficient Video Set-Level-Adapter.
Learning Representations of Satellite Images From Metadata Supervision: This paper proposes SatMIP (Satellite Metadata-Image Pretraining), which represents satellite image metadata (such as time, geographic location, sensor information, etc.) as text descriptions to align images and metadata in a shared embedding space via an image-metadata contrastive learning task. This constructs satellite image representations that encode both visual features and semantic information. It further introduces SatMIPS (combining image self-supervision and metadata supervision), which outperforms purely visual self-supervised methods like SimCLR on multiple remote sensing downstream tasks.
Masked Angle-Aware Autoencoder for Remote Sensing Images: The authors propose MA3E, which explicitly introduces angle variations into MAE pre-training (by constructing rotational crops via scaling center crop) and automatically assigns reconstruction targets using an optimal transport loss. This allows the model to perceive the diverse angles of remote sensing objects and learn rotation-invariant representations.
Weakly-Supervised Camera Localization by Ground-to-Satellite Image Registration: Proposes the first weakly-supervised ground-to-satellite image registration localization method. By training an orientation estimator on satellite-to-satellite pairs in a self-supervised manner and training a translation estimator via contrastive learning, it achieves the best cross-area generalization performance without requiring accurate ground-truth (GT) pose labels, outperforming most fully-supervised SOTA methods.

🧑 Human Understanding (54)¶

3D Hand Pose Estimation in Everyday Egocentric Images: By systematically investigating four practices—cropped inputs, Intrinsics-aware Positional Encoding (KPE), auxiliary supervision (hand segmentation + grasp labels), and multi-dataset joint training—this work proposes the WildHands system. Under the constraint of using only a ResNet50 backbone and a small amount of data, WildHands achieves robust 3D hand pose estimation in in-the-wild egocentric images. Its zero-shot generalization outperforms FrankMocap across all metrics and competes closely with HaMeR, which is \(10\times\) larger.
3DFG-PIFu: 3D Feature Grids for Human Digitization from Sparse Views: This paper proposes 3DFG-PIFu, which globally fuses multi-view features across the entire pipeline by introducing 3D Feature Grids, replacing the traditional point-wise local fusion approach. Combined with an iterative grid refinement mechanism and SDF-based SMPL-X features, it significantly outperforms state-of-the-art sparse-view human digitization methods.
3DGazeNet: Generalizing 3D Gaze Estimation with Weak-Supervision from Synthetic Views: Proposes to reformulate gaze estimation as dense 3D eye mesh regression, and performs weakly supervised training via automatic pseudo-label extraction from large-scale in-the-wild face images + HeadGAN-synthesized multi-views, achieving up to 30% improvement over SOTA in cross-domain scenarios.
3DSA: Multi-view 3D Human Pose Estimation With 3D Space Attention Mechanisms: This paper proposes a 3D Space Attention (3DSA) module that partitions the feature volume into multiple regions via a 3D space subdivision algorithm and assigns view-based attention weights to them. This addresses the issue of unequal contributions of different views to different spatial regions in multi-view 3D human pose estimation, achieving SOTA performance on the CMU Panoptic Studio dataset.
A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars: This paper proposes the first baseline system for Spoken2Sign translation with 3D Avatar output. The system translates spoken text into 3D sign language animations through a three-step pipeline (dictionary construction \(\to\) SMPLSign-X 3D pose estimation \(\to\) retrieval-connection-rendering translation). It achieves a back-translation BLEU-4 of 25.46 on Phoenix-2014T, while its 3D sign language byproducts (keypoint enhancement and multi-view understanding) significantly improve the performance of sign language understanding tasks.
AdaDistill: Adaptive Knowledge Distillation for Deep Face Recognition: This paper proposes AdaDistill, which embeds the knowledge distillation concept into the margin penalty softmax loss. By utilizing EMA-based adaptive class centers (employing simple sample-to-sample knowledge in early stages and complex sample-to-center knowledge in later stages) and a hard sample-aware mechanism, it enhances the discriminative power of lightweight face recognition models without requiring extra hyperparameters, outperforming SOTA distillation methods on challenging benchmarks such as IJB-B/C and ICCV21-MFR.
Adaptive High-Frequency Transformer for Diverse Wildlife Re-Identification: An Adaptive High-frequency Transformer (AdaFreq) is proposed. By employing frequency-domain mixup augmentation, target-aware dynamic selection of high-frequency tokens, and a feature equilibrium loss, it unifies high-frequency information (such as fur texture and contour edges) for the re-identification of diverse wildlife, outperforming existing ReID methods across 8 cross-species datasets.
ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation: The ADen framework is proposed to unify pose regression and probabilistic estimation paradigms by employing a generator to yield multiple pose hypotheses and a discriminator to score and select the best hypothesis. With only 500 adaptive samples, this approach outperforms methods requiring 500K uniform samples while achieving real-time inference.
Alignist: CAD-Informed Orientation Distribution Estimation by Fusing Shape and Correspondences: Proposes Alignist, the first method that leverages CAD model information (SDF + SurfEmb correspondence features) to train an implicit distribution network to estimate pose distributions over \(SO(3)\). By fusing geometric and feature alignment via a product of experts, it significantly outperforms contrastive learning methods in low-data scenarios.
Audio-Driven Talking Face Generation with Stabilized Synchronization Loss: This work proposes three improvements—AVSyncNet, stabilized synchronization loss, and a silent-lip generator—to systematically address the two core issues of SyncNet instability and lip leaking in audio-driven talking face generation, achieving SOTA performance in both lip synchronization and visual quality.

Browse all 54 Human Understanding papers →

📹 Video Understanding (51)¶

ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos: ActionSwitch is proposed—the first online temporal action localization (On-TAL) framework to detect overlapping action instances in streaming videos without category information. The core idea is to model multi-action detection as a state classification problem for a finite state machine, augmented by a conservativeness loss to reduce fragmented false positives. It achieves SOTA among OAD-extension methods on datasets such as THUMOS14, FineAction, and Epic-Kitchens 100.
Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts: Proposes Adapt2Reward, which adapts pre-trained video-language models into generalizable language-conditioned reward functions using learnable failure prompts. Requiring only a small amount of robot data from a single environment, it generalizes to new environments and tasks, outperforming prior methods by approximately 28% on MetaWorld.
AMEGO: Active Memory from Long EGOcentric Videos: Proposes AMEGO, a method for online construction of structured "active memory" from long egocentric videos. By combining HOI tracklets, location segments, and semantic-free visual queries, it outperforms Video QA baselines by 12.7% on the newly proposed AMB benchmark.
Bayesian Evidential Deep Learning for Online Action Detection: This paper proposes the BEDL (Bayesian Evidential Deep Learning) framework. Incorporating a Bayesian teacher-evidential student architecture, it achieves accurate and efficient inference as well as reliable uncertainty quantification in online action detection tasks. Furthermore, it designs a attention module based on Bayesian mutual information for active feature selection.
Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects: Based on the HANDS23 challenge (using the AssemblyHands and ARCTIC datasets), this study systematically benchmarks and deeply analyzes 3D pose estimation methods for egocentric hand-object interactions, revealing the effectiveness of distortion correction, high-capacity Transformers, and multi-view fusion, while highlighting unresolved challenges such as rapid motion, severe occlusion, and object reconstruction under narrow viewpoints.
Boosting 3D Single Object Tracking with 2D Matching Distillation and 3D Pre-training: This paper proposes a unified 3D single object tracking (SOT) framework that addresses the scarcity of point cloud data and the sparseness/incompleteness of LiDAR scans through 3D generative pre-training and matching knowledge distillation from a pre-trained 2D foundation tracker, achieving SOTA performance on KITTI, Waymo, and nuScenes.
Classification Matters: Improving Video Action Detection with Class-Specific Attention: Proposes a class-specific query (class queries) mechanism, which assigns an independent learnable query to each action class, allowing the model to dynamically attend to context regions relevant to each class, significantly improving classification performance in video action detection.
CrossGLG: LLM Guides One-Shot Skeleton-Based 3D Action Recognition in a Cross-Level Manner: This paper proposes the CrossGLG framework, which utilizes LLM-generated text descriptions to guide skeleton feature learning in a "global \(\to\) local \(\to\) global" manner, significantly outperforming competitors in one-shot 3D action recognition with only 2.8% of the parameter size of the SOTA model.
Data Collection-Free Masked Video Modeling: This paper proposes a Pseudo-Motion Generator (PMG) to recursively generate pseudo-motion videos from static images. Combined with Masked Video Modeling (VideoMAE) for self-supervised pre-training, it entirely eliminates the collection costs, privacy, and copyright concerns of real video data, and even enables effective video Transformer pre-training using only synthetic images.
DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video: This paper proposes DINO-Tracker, which combines the semantic features of pretrained DINOv2 with test-time single-video optimization. Through Delta-DINO residual fine-tuning and multi-source self-supervised losses, it achieves long-range dense point tracking. It reaches state-of-the-art (SOTA) performance among self-supervised methods and is comparable to supervised trackers, particularly outperforming existing methods by a wide margin in long-term occlusion scenarios.

Browse all 51 Video Understanding papers →

🚗 Autonomous Driving (53)¶

4D Contrastive Superflows are Dense 3D Representation Learners: The SuperFlow framework is proposed, which establishes 4D pre-training objectives using continuous LiDAR-camera pairs through three modules: view consistency alignment, dense-sparse consistency regularization, and flow-based spatiotemporal contrastive learning. It comprehensively outperforms prior image-to-LiDAR pre-training methods across 11 heterogeneous LiDAR datasets.
Accelerating Online Mapping and Behavior Prediction via Direct BEV Feature Attention: This paper proposes to directly expose internal BEV features from online map estimation models to downstream trajectory prediction models (instead of just passing decoded vectorized maps). Through three BEV feature injection strategies, the proposed method achieves up to a 73% acceleration in inference and up to a 29% improvement in prediction accuracy.
Adaptive Human Trajectory Prediction via Latent Corridors: This paper introduces prompt tuning to pedestrian trajectory prediction. By adding learnable low-rank visual prompts (termed latent corridors) to the input of a pre-trained trajectory predictor, it achieves highly parameter-efficient adaptation to scene-specific behavioral patterns with less than 0.1% extra parameters, improving ADE by up to 23.9% and 26.8% on synthetic and real-world data, respectively.
Approaching Outside: Scaling Unsupervised 3D Object Detection from 2D Scene: This paper proposes the LiSe method, which incorporates 2D image information into unsupervised 3D object detection. Through adaptive sampling and weak model aggregation strategies in self-paced learning, it significantly improves the detection capability for long-range and small targets.
CarFormer: Self-Driving with Learned Object-Centric Representations: CarFormer is proposed to introduce self-supervised slot attention-learned object-centric representations into autonomous driving for the first time. On the CARLA Longest6 benchmark, it outperforms PlanT, which utilizes precise object attributes, while demonstrating the capability of a world model to predict future states.
CSOT: Cross-Scan Object Transfer for Semi-Supervised LiDAR Object Detection: The CSOT (Cross-Scan Object Transfer) paradigm is proposed, which predicts semantically consistent object placement locations and compatibility scores using a Transformer network. This achieves the first successful object copy-paste augmentation in semi-supervised LiDAR object detection. Combined with a spatial-aware classification loss, it matches the performance of the fully supervised baseline using only 1% of the annotated data.
Detecting As Labeling: Rethinking LiDAR-camera Fusion in 3D Object Detection: This paper summarizes the fundamental rule from the data labeling process that "image features should not be used for regression tasks" and proposes the DAL paradigm. DAL analogizes the detection process to the labeling process, using LiDAR features independently to complete regression predictions and fused features for classification predictions. Combined with a simplified training pipeline, DAL substantially refreshes the SOTA on nuScenes with 74.0 NDS (val) and 74.8 NDS (test).
DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-directional Structure Alignment: A clustering-based Local-to-Global fusion network, DVLO, is proposed to address the data structure inconsistency between vision and LiDAR through bi-directional structure alignment (image-to-pseudo-point-cloud + point-cloud-to-pseudo-image), achieving state-of-the-art (SOTA) performance on both the KITTI odometry and FlyingThings3D scene flow tasks.
DySeT: A Dynamic Masked Self-distillation Approach for Robust Trajectory Prediction: DySeT proposes a dynamic masked self-distillation approach. By leveraging reinforcement learning-driven priority sampling of informative tokens and knowledge distillation from a complete representation to a masked representation, it significantly enhances the generalization ability and robustness of trajectory prediction models in autonomous driving scenarios.
Enhancing Vectorized Map Perception with Historical Rasterized Maps: This paper proposes HRMapNet, which maintains a low-cost global historical rasterized map to provide complementary prior information for online vectorized map perception. It enhances existing methods at two levels—BEV feature aggregation and query initialization—achieving significant improvements on nuScenes and Argoverse 2.

Browse all 53 Autonomous Driving papers →

🤖 Robotics & Embodied AI (13)¶

AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation: Proposes the STAformer architecture and two affordance-based modules (an environment affordance database + interaction hotspots), improving the relative performance of Short-Term object Interaction Anticipation (STA) in egocentric videos by 30-45% on Ego4D and EPIC-Kitchens.
An Economic Framework for 6-DoF Grasp Detection: This paper proposes the EconomicGrasp framework. By identifying that the ambiguity problem in dense supervision is the root cause of the conflict between performance and resource consumption, it designs an economic supervision paradigm (retaining all view perspectives but cropping rotation angles and depths) and a focus representation module (an interactive grasp head with composite scoring). It outperforms the SOTA by approximately 3 AP on GraspNet-1Billion with only 1/4 of the training time and 1/8 of the memory cost.
Decomposed Vector-Quantized Variational Autoencoder for Human Grasp Generation: Proposes Decomposed VQ-VAE (DVQ-VAE), which decomposes the hand into six parts to encode them into independent codebooks, and designs a dual-stage decoding strategy (posture first, then position), achieving an approximate 14.1% relative improvement in quality index across four benchmark datasets.
DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-Level Control: The DISCO framework is proposed, which significantly improves the performance of embodied navigation and interaction on the ALFRED benchmark (outperforming SOTA by +8.6% in unseen success rate, without requiring step-by-step instructions) through differentiable scene semantic representation and dual-level coarse-to-fine action control.
GraspXL: Generating Grasping Motions for Diverse Objects at Scale: GraspXL is proposed, an RL-based grasp motion generation framework that generalizes to over 500,000 unseen objects after training on only 58 objects, while simultaneously supporting multi-objective motion control (grasp region, heading, wrist rotation, and hand position) and multiple dexterous hand platforms.
Hierarchically Structured Neural Bones for Reconstructing Animatable Objects from Casual Videos: The Hierarchical Neural Bones (HSNB) framework is proposed, which decomposes object motion in a coarse-to-fine manner using a tree-structured bone system to reconstruct high-quality animatable 3D models from casual videos.
Learning Cross-Hand Policies of High-DOF Reaching and Grasping: A two-stage hierarchical framework is proposed, which uses semantic keypoints and the Interaction Bisector Surface (IBS) as hand-agnostic state representations. Combined with a Transformer policy network and hand-specific adaptation models, it achieves zero-shot transfer of dexterous grasping policies across different high-DOF robotic hands.
LLM as Copilot for Coarse-Grained Vision-and-Language Navigation: This paper proposes the VLN-Copilot framework, where a vision-and-language navigation agent actively seeks help from an LLM when confused under coarse-grained (short and ambiguous) instructions. Acting as a copilot, the LLM generates real-time, fine-grained navigation guidance, significantly improving navigation success rates on two coarse-grained VLN datasets.
Prioritized Semantic Learning for Zero-shot Instance Navigation: This paper proposes the Prioritized Semantic Learning (PSL) method. Through a semantic-augmented agent architecture, a prioritized semantic training strategy, and a semantic expansion inference scheme, it significantly improves the agent's semantic perception capabilities in zero-shot object/instance navigation, achieving SOTA performance on both ObjectNav and the newly proposed InstanceNav tasks.
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots: This paper proposes the first vision-language-action (QUAR-VLA) paradigm for quadruped robots, constructing a multi-task dataset QUARD with 259K episodes and the QUART model based on a pretrained multimodal large model, achieving unified control of multi-tasks such as perception, navigation, and whole-body manipulation.

Browse all 13 Robotics & Embodied AI papers →

🎮 Reinforcement Learning (3)¶

AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale: This paper proposes AdaGlimpse, which utilizes Soft Actor-Critic (SAC) reinforcement learning to select glimpses of arbitrary positions and scales from a continuous action space. Combined with a ViT encoder equipped with elastic positional encoding, it achieves multi-task active visual exploration (reconstruction, classification, and segmentation), outperforming state-of-the-art methods that use 18% of pixels, while requiring only 6% of pixels.
Octopus: Embodied Vision-Language Programmer from Environmental Feedback: This paper proposes Octopus, an embodied vision-language programming model that bridges high-level planning and low-level manipulation by generating executable code. It introduces a Reinforcement Learning with Environmental Feedback (RLEF) training scheme to enhance decision-making quality.
Visual Grounding for Object-Level Generalization in Reinforcement Learning: This paper leverages the visual grounding capability of a vision-language model (MineCLIP) to generate confidence maps of target objects. VLM knowledge is transferred to reinforcement learning through two pathways—reward design and task representation—enabling zero-shot generalization to unseen objects and instructions.

🎁 Recommender Systems (1)¶

AID-AppEAL: Automatic Image Dataset and Algorithm for Content Appeal Enhancement and Assessment Labeling: This paper proposes the Image Content Appeal Assessment (ICAA) task for the first time, distinguishing it from traditional Image Aesthetics Assessment (IAA). It designs a complete pipeline integrating automatic dataset generation, appeal estimation, and appeal enhancement, achieving large-scale dataset creation with zero human annotation using Stable Diffusion and Textual Inversion.

🔄 Self-Supervised Learning (16)¶

Adaptive Multi-head Contrastive Learning: This paper proposes Adaptive Multi-head Contrastive Learning (AMCL), which generates different feature perspectives through multiple projection heads and independently weights each sample pair using an adaptive temperature mechanism derived from Maximum Likelihood Estimation (MLE). This effectively resolves the overlap in similarity distributions of positive and negative samples under diverse data augmentations, consistently improving the performance of SimCLR, MoCo, and Barlow Twins.
COHO: Context-Sensitive City-Scale Hierarchical Urban Layout Generation: This work proposes a city-scale 2.5D urban layout generation method based on Graph Masked Autoencoders (GMAE). By capturing multi-level semantic context across buildings, blocks, and communities through a canonical graph representation and combining it with priority-scheduled iterative sampling, the method achieves realistic, semantically consistent, and topologically correct large-scale urban layout generation across 330 US cities.
Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders: CropMAE is proposed: a siamese masked autoencoder trained by replacing video frame pairs with two randomly cropped views of the same image. With an extremely high masking ratio of 98.5%, it learns object boundary-aware representations using only 2 visible patches. This accelerates training by up to 23.8× compared to SiamMAE while achieving competitive performance on video propagation tasks.
Exemplar-Free Continual Representation Learning via Learnable Drift Compensation: Proposes Learnable Drift Compensation (LDC), which trains a forward projector to map the old feature space to the new feature space. This effectively compensates for the semantic drift of class prototypes without needing to store old exemplars, achieving exemplar-free semi-supervised continual learning for the first time.
FlowCon: Out-of-Distribution Detection using Flow-Based Contrastive Learning: Proposes FlowCon, a density estimation-based OOD detection method that innovatively combines normalizing flows with supervised contrastive learning. By using a contrastive loss based on the Bhattacharyya coefficient in the latent space of the flow model to learn class-conditional Gaussian distributions, it achieves efficient OOD detection without requiring external OOD data or retraining the classifier.
InfMAE: A Foundation Model in the Infrared Modality: This paper proposes InfMAE, the first foundation model designed specifically for the infrared modality. By constructing the Inf30 dataset with 300,000 infrared images, and designing an information-aware masking strategy along with a multi-scale encoder, the proposed method outperforms existing state-of-the-art approaches in three downstream tasks: infrared semantic segmentation, object detection, and small object detection.
MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description: This paper proposes MarineInst, a foundation model for marine image analysis that simultaneously outputs instance masks and semantic descriptions. Additionally, it constructs MarineInst20M—the largest marine image dataset to date (20 million images), supporting multi-level marine visual analysis tasks from image-level scene understanding to region-level instance understanding.
PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer: The Position Forest Transformer (PosFormer) is proposed. By encoding the LaTeX sequence of mathematical expressions into a position forest structure, it explicitly models the hierarchical and spatial relationships among symbols. An implicit attention correction module is designed to comprehensively outperform SOTA methods on single-line, multi-line, and complex expression datasets, without introducing additional inference overhead.
PromptCCD: Learning Gaussian Mixture Prompt Pool for Continual Category Discovery: This paper proposes the PromptCCD framework, which utilizes a Gaussian Mixture Model (GMM) as a prompt pool to achieve continual discovery of novel categories in unlabeled data streams while mitigating catastrophic forgetting.
Rethinking Unsupervised Outlier Detection via Multiple Thresholding: This work proposes the Multi-T (Multiple Thresholding) module, which generates two thresholds to isolate inliers and outliers within a target dataset, respectively. By utilizing the identified inliers to train a clean normal manifold and using the outliers for feature denoising, Multi-T significantly enhances the performance of existing outlier scoring methods.

Browse all 16 Self-Supervised Learning papers →

🔬 Interpretability (5)¶

DetailSemNet: Elevating Signature Verification through Detail-Semantic Integration: DetailSemNet is proposed for offline signature verification, which decouples features into detail and semantic branches via a Detail-Semantics Integrator, and introduces EMD-based local structural matching to achieve SOTA performance on multiple multilingual signature datasets.
EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding: This paper proposes the EgoExo-Fitness dataset, which contains synchronized egocentric and exocentric fitness videos. It provides two-level temporal boundary annotations and innovative explainable action assessment labels (technical keypoint verification, natural language commentary, and quality scoring) and establishes five benchmark tasks.
Improving Intervention Efficacy via Concept Realignment in Concept Bottleneck Models: This paper identifies that the low efficiency of human intervention in Concept Bottleneck Models (CBMs) stems from the independent processing of concepts during intervention, which neglects inter-concept correlations. It proposes a lightweight Concept Intervention Realignment Module (CIRM) that automatically realigns the predictions of related concepts post-intervention, reducing the number of interventions required to reach target performance by up to 70%.
PLOT: Text-based Person Search with Part Slot Attention for Corresponding Part Discovery: This paper proposes the PLOT framework, which utilizes a Part Discovery Module based on Slot Attention to automatically discover corresponding human body parts across modalities (image-text). Combined with Text-based Dynamic Part Attention (TDPA) to dynamically adjust the importance of each part, it thoroughly outperforms state-of-the-art (SOTA) methods on three benchmarks without requiring part-level annotations.
POA: Pre-training Once for Models of All Sizes: POA proposes introducing an elastic student branch into the self-supervised self-distillation framework. Through parameter sharing and random sub-network sampling, hundreds of pre-trained models of different sizes can be produced simultaneously with a single pre-training run (e.g., directly extracting ViT-S/B from ViT-L). Each sub-network achieves SOTA performance on k-NN, linear probing, and downstream tasks.

📦 Model Compression (24)¶

A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging: The first low-bit quantization framework, Q-SCI, specifically designed for Video Snapshot Compressive Imaging (Video SCI) reconstruction. By incorporating a high-quality feature extraction module, a precise video reconstruction module, and query/key distribution shift calibration in the Transformer branch, it achieves a 7.8x theoretical speedup with only a 2.3% performance drop under 4-bit quantization.
AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer: This paper proposes AdaLog, an adaptive logarithmic base quantizer that addresses the power-law distribution of post-Softmax and post-GELU activations in ViTs by replacing fixed \(\log_2\)/\(\log_{\sqrt{2}}\) quantizers with a searchable logarithmic base. Additionally, a Fast Progressive Combinatorial Search (FPCS) strategy is designed to efficiently determine quantization hyperparameters, which significantly outperforms existing ViT PTQ methods under ultra-low bit (3/4-bit) configurations.
Adaptive Compressed Sensing with Diffusion-Based Posterior Sampling: This paper proposes AdaSense, which leverages the zero-shot posterior sampling capability of pre-trained diffusion models to quantify reconstruction uncertainty, thereby adaptively selecting the optimal measurement matrix. It achieves training-free adaptive compressed sensation across multiple domains including face images, MRI, and CT, outperforming non-adaptive methods and even the optimal PCA-based non-adaptive scheme.
Adaptive Selection of Sampling-Reconstruction in Fourier Compressed Sensing: This paper proposes the "adaptive selection of sampling-reconstruction pairs" (\(\mathcal{H}_{1.5}\)) framework. It leverages a super-resolution spatial generative model to quantify high-frequency Bayesian uncertainty and selects the optimal sampling mask-reconstruction network pair for each input data. Theoretically and experimentally, it outperforms both non-adaptive joint optimization (\(\mathcal{H}_1\)) and adaptive sampling (\(\mathcal{H}_2\)), achieving significant SSIM improvements in face image and multi-coil MRI reconstruction.
Adversarially Robust Distillation by Reducing the Student-Teacher Variance Gap: This paper proposes an adversarially robust knowledge distillation method based on feature distribution statistical alignment. By reducing the feature variance gap between adversarial and clean examples in the student and teacher models, the adversarial robustness of the student model is enhanced. It is discovered that robust accuracy exhibits a strong negative linear correlation with the variance gap.
Anytime Continual Learning for Open Vocabulary Classification: The AnytimeCL framework is proposed to achieve open-vocabulary continual learning, allowing the model to receive samples at any time and perform inference on arbitrary label sets. This is realized by partially fine-tuning the final transformer block of CLIP and dynamically fusing predictions from both the fine-tuned and original models.
Auto-DAS: Automated Proxy Discovery for Training-free Distillation-aware Architecture Search: This paper proposes Auto-DAS, an automated proxy discovery framework based on evolutionary algorithms for training-free distillation-aware architecture search (DAS). By automatically discovering optimal proxy metrics within a search space composed of student intrinsic statistics and teacher-student interaction statistics, it bypasses the limitations of hand-crafted proxies. Auto-DAS achieves SOTA ranking correlations and search accuracies across various architectures and search spaces, including ResNet, ViT, and NAS-Bench-101/201.
BaSIC: BayesNet Structure Learning for Computational Scalable Neural Image Compression: This paper proposes the BaSIC framework, which simultaneously controls backbone network complexity and the parallel computation capability of autoregressive units by learning the Bayesian network structure of neural image compression (NIC) systems, achieving computational scalability control over the entire NIC pipeline for the first time.
Bidirectional Stereo Image Compression with Cross-Dimensional Entropy Model: Proposes a bidirectionally symmetric stereo image compression framework, BiSIC, using a 3D convolutional joint codec and a cross-dimensional entropy model. It outperforms both traditional standards and existing learned methods on PSNR and MS-SSIM, while eliminating the reconstruction quality imbalance between the left and right views inherent in unidirectional approaches.
Category Adaptation Meets Projected Distillation in Generalized Continual Category Discovery: Proposes the CAMP method, which significantly improves the balance between learning new categories and retaining old knowledge in Generalized Continual Category Discovery (GCCD) scenarios through the cooperative combination of learnable projector distillation and category prototype adaptation networks.

Browse all 24 Model Compression papers →

🏥 Medical Imaging (28)¶

A Cephalometric Landmark Regression Method Based on Dual-Encoder for High-Resolution X-Ray Image: This paper proposes D-CeLR, an end-to-end regression method based on a dual-encoder architecture. Utilizing only Transformer encoders, it designs a three-stage framework comprising feature extraction, a reference encoder, and a finetune encoder to achieve coarse-to-fine cephalometric landmark detection, significantly outperforming existing SOTA methods in Mean Radical Error (MRE) and 2mm Success Detection Rate (SDR) metrics.
A Rotation-Invariant Texture ViT for Fine-Grained Recognition of Esophageal Cancer Endoscopic Ultrasound Images: This paper proposes SRRM-ViT, which introduces a Statistical Rotation-invariant Reinforcement Mechanism (SRRM) into ViT to adaptively select key regions and fuse histogram statistical features. This achieves unbiased fine-grained classification of lesions at any radial position in endoscopic ultrasound (EUS) images of esophageal cancer, obtaining significant performance improvements on clinical and public datasets.
Adaptive Correspondence Scoring for Unsupervised Medical Image Registration: To address the issue of spurious reconstruction errors caused by confounding factors such as noise and occlusions in unsupervised medical image registration, this paper proposes an adaptive correspondence scoring framework (AdaCS). By learning pixel-wise correspondence confidence maps to re-weight error residuals, AdaCS consistently improves the performance of three mainstream registration architectures across three datasets in a plug-and-play manner.
Alternate Diverse Teaching for Semi-supervised Medical Image Segmentation: Proposes AD-MT (Alternate Diverse Mean Teacher), which addresses the confirmation bias problem in semi-supervised medical image segmentation through random periodic alternate updating of two teacher models and an entropy-based conflict-combating strategy, comprehensively outperforming SOTA methods on ACDC, LA, and Pancreas datasets.
Architecture-Agnostic Untrained Network Priors for Image Reconstruction with Frequency Regularization: This paper proposes three architecture-agnostic frequency regularization techniques (bandwidth-constrained input, bandwidth-controllable upsampling, and Lipschitz-regularized convolutional layers) to address the issues of architectural sensitivity, overfitting, and operational inefficiency in untrained network priors, significantly narrowing the performance gap among different architectures in MRI reconstruction tasks.
Brain-ID: Learning Contrast-agnostic Anatomical Representations for Brain Imaging: This paper proposes Brain-ID, a contrast-agnostic brain anatomical representation learning model. Through a "mild-to-severe" intra-subject image synthesis strategy, it is trained on fully synthetic data to obtain anatomical features robust to MRI contrast, resolution, orientation, and artifacts. With only a single-layer adaptation, it achieves SOTA performance on four downstream tasks and six public datasets.
Brain Netflix: Scaling Data to Reconstruct Videos from Brain Signals: This paper proposes a novel method for reconstructing videos from functional magnetic resonance imaging (fMRI) signals. Through multi-dataset, multi-subject training and a three-stage pipeline utilizing pre-trained text-to-video and video-to-video models, it achieves state-of-the-art (SOTA) video reconstruction capabilities across both datasets and subjects.
CardiacNet: Learning to Reconstruct Abnormalities for Cardiac Disease Assessment from Echocardiogram Videos: A reconstruction-based cardiac disease assessment framework, CardiacNet, is proposed. By utilizing a Consistency Deformation Codebook (CDC) and a Consistency Deformation Discriminator (CDD), the model learns structural and motion discrepancies between normal and abnormal echocardiogram videos, achieving state-of-the-art (SOTA) performance in ejection fraction prediction, pulmonary arterial hypertension classification, and atrial septal defect classification.
Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild: This work proposes Chameleon, a data-efficient vision generalist model based on meta-learning and token matching. It adapts to entirely new dense prediction tasks (including medical images, video, 3D, etc.) using only dozens of labeled images, significantly outperforming existing generalist methods across six downstream benchmarks.
CheX: Interactive Localization and Region Description in Chest X-rays: This paper proposes ChEX, an interactive chest X-ray interpretation model that supports both text prompts and bounding box queries. Through a DETR-style prompt detector and multi-task joint training, ChEX achieves competitive performance with SOTA on 9 chest X-ray tasks while providing unique grounding interpretability and user interaction capabilities.

Browse all 28 Medical Imaging papers →

📡 Signal & Communications (6)¶

Defect Spectrum: A Granular Look of Large-Scale Defect Datasets with Rich Semantics: This paper constructs the Defect Spectrum dataset, providing fine-grained, semantic-rich, and large-scale multi-class defect annotations (125 defect classes, 3,518 + 1,920 images) across four industrial benchmarks. It also proposes Defect-Gen, a two-stage diffusion generator, to synthesize high-quality, diverse defect images under few-shot conditions, improving defect segmentation mIoU by up to 9.85.
Optimizing Illuminant Estimation in Dual-Exposure HDR Imaging: This paper proposes extracting a compact Dual-Exposure Feature (DEF) from dual-exposure HDR image pairs, based on which two ultra-lightweight illuminant estimators, EMLP and ECCC, are constructed. They achieve or exceed the performance of prior methods requiring hundreds of thousands of parameters, while using only a few hundred to a few thousand parameters.
PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation: This paper proposes PYRA, which generates decoupled adaptive modulation weights in parallel and modulates features of tokens to be merged using a re-activation strategy. This approach enables Vision Transformers to achieve both training efficiency (tuning only 0.4% parameters) and inference efficiency (approx. 1.7x-3.2x speedup) during downstream task adaptation, achieving comparable or superior performance to uncompressed PEFT methods.
QueryCDR: Query-based Controllable Distortion Rectification Network for Fisheye Images: The proposed QueryCDR network, utilizing a distortion-aware learnable query mechanism (DLQM) and two controllable modulation modules (CCMB/CAMB), achieves high-quality controllable rectification for fisheye images with various distortion degrees without retraining for the first time.
RAW-Adapter: Adapting Pre-trained Visual Model to Camera RAW Images: This paper proposes RAW-Adapter, which efficiently adapts sRGB pre-trained models to camera RAW images with extremely small parameter overhead (0.2–0.8M) via an input-level adapter (learnable ISP stages) and a model-level adapter (injecting ISP intermediate features into the backbone). It achieves SOTA performance on detection and segmentation tasks under various lighting conditions, including normal, low-light, and overexposure.
Unsupervised Exposure Correction: This paper proposes the first unsupervised exposure correction (UEC) method, which leverages multi-exposure sequences generated freely by ISP pipelines to train images as mutual ground truths. It designs a pixel-level transformation function with only 19K parameters to preserve image details, outperforming supervised SOTA on exposure correction and downstream edge detection.

🛡️ AI Safety (13)¶

Any Target Can Be Offense: Adversarial Example Generation via Generalized Latent Infection: GAKer is proposed as the first targeted adversarial attack generator that generalizes to unseen target classes. By injecting target features (latent infection) into the intermediate layers of a UNet and employing a class-agnostic cosine distance loss instead of cross-entropy, it outperforms HGN on unseen classes by 14.13% in attack success rate.
Bi-TTA: Bidirectional Test-Time Adapter for Remote Physiological Measurement: This paper proposes the Bi-TTA framework, which introduces Test-Time Adaptation to remote photoplethysmography (rPPG) tasks for the first time. By leveraging a spatiotemporal consistency self-supervised prior and a prospective-retrospective bidirectional adaptation strategy, the proposed method achieves model domain adaptation at test-time using only unlabeled single-instance data during inference.
CLIP-Guided Generative Networks for Transferable Targeted Adversarial Attacks: This paper proposes CGNC, which leverages the CLIP text encoder to inject target-category semantic information into a conditional generative network. Combining cross-attention modules with masked fine-tuning, this method significantly improves the black-box transfer success rate of both multi-target and single-target directed adversarial attacks.
Fisher Calibration for Backdoor-Robust Heterogeneous Federated Learning: This paper proposes Self-Driven Fisher Calibration (SDFC), which utilizes Fisher information to measure differences in parameter importance across different distributions. SDFC effectively distinguishes malicious backdoor clients and performs parameter calibration in heterogeneous federated learning scenarios, overcoming the limitations of existing defense methods that rely on data homogeneity and minority malicious node assumptions.
Event Trojan: Asynchronous Event-based Backdoor Attacks: This paper proposes the Event Trojan framework, which, for the first time, designs backdoor attack methods specifically for asynchronous event data streams. It includes two modes, namely immutable triggers and mutable triggers, directly injecting malicious events at the event stream level to achieve stealthy and efficient backdoor attacks.
Noise-Assisted Prompt Learning for Image Forgery Detection and Localization: This paper proposes CLIP-IFDL, a CLIP-based image forgery detection and localization model. By employing instance-aware dual-stream prompt learning and a forgery-enhanced noise adapter, it addresses CLIP's lack of domain-specific prompts and forgery sensitivity in forgery detection, successfully transferring CLIP's open-world generalization capability to the forgery detection task.
One-stage Prompt-based Continual Learning: This paper proposes the OS-Prompt framework. By directly utilizing the token embeddings of ViT intermediate layers as prompt queries (rather than relying on an extra query ViT forward pass), it reduces the computational-cost of prompt-based continual learning by approximately 50%. It further compensates for the loss in representation capacity with a Query-Pool Regularization (QR) loss, outperforming CodaPrompt by about 1.4% on CIFAR-100, ImageNet-R, and DomainNet.
Operational Open-Set Recognition and PostMax Refinement: This paper proposes OOSA (Operational Open-Set Accuracy), an evaluation metric for practical deployment scenarios, and PostMax, a post-processing algorithm. By normalizing the maximum class logit with deep feature magnitude and mapping it through a Generalized Pareto Distribution (GPD), logits are converted into reasonable probability estimates, achieving statistically significant SOTA performance in large-scale evaluations.
Preventing Catastrophic Overfitting in Fast Adversarial Training: A Bi-level Optimization Perspective: Analyzes the causes of catastrophic overfitting in fast adversarial training from a bi-level optimization perspective, and proposes the FGSM-PCO method. By adaptively fusing historical and current adversarial examples along with custom regularization loss, it effectively prevents and corrects the collapse of inner optimization.
Resilience of Entropy Model in Distributed Neural Networks: This paper presents the first systematic study on the robustness of entropy coding models in distributed DNNs under both intentional interference (adversarial attacks) and unintentional interference (weather changes, motion blur, etc.). It reveals that the compression features learned by the entropy model are distinct from classification features, and proposes an object-aware total variation denoising defense method. This approach reduces post-attack transmission overhead to below clean data levels, with an accuracy drop of only around 2%.

Browse all 13 AI Safety papers →

📂 Others (42)¶

A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation: A statistical framework is proposed that synergistically designs three components—stratification, sampling design, and estimation—to accurately estimate Computer Vision (CV) model accuracy with only a small number of annotated test samples, achieving up to a 10x efficiency gain (i.e., reaching equivalent accuracy with 1/10 of the annotations).
ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-Agnostic Counting: Proposes ABC123, the first exemplar-free method capable of concurrently counting multiple classes of unknown objects in an image. By combining a ViT for multi-channel density map regression, Hungarian-matching-based training, and a SAM-based example discovery mechanism, this approach significantly outperforms exemplar-based methods on the self-created synthetic MCAC dataset while demonstrating strong generalization on the real-world FSC-147 dataset.
Active Generation for Image Classification: This paper proposes ActGen, which integrates the concept of active learning into the image generation process of diffusion models. By identifying misclassified validation samples as guidance images and combining attentive guidance with gradient-based generation control, ActGen achieves a +2.26% accuracy improvement on ImageNet using only 10% generated images, outperforming previous methods that utilize 94% synthetic data.
AddMe: Zero-Shot Group-Photo Synthesis by Inserting People Into Scenes: This paper proposes AddMe, a zero-shot portrait generator based on diffusion models. Through an identity decoupling adapter and an enhanced portrait attention module, it can naturally insert a given portrait into specified positions of an existing scene image, while maintaining identity consistency and the plausibility of group interactions.
ADMap: Anti-disturbance Framework for Vectorized HD Map Construction: This paper proposes the ADMap framework, which cascadedly monitors the point sequence prediction process from both inter-instance and intra-instance levels using three modules: Multi-Scale Perception Neck (MPN), Instance Interactive Attention (IIA), and Vector Direction Difference Loss (VDDL). This effectively alleviates the point sequence jitter/jaggedness issues in vectorized HD map construction and achieves SOTA performance on nuScenes and Argoverse2.
Align before Collaborate: Mitigating Feature Misalignment for Robust Multi-Agent Perception: Proposes NEAT, a model-agnostic and lightweight plug-in that explicitly addresses feature-level spatial misalignment caused by pose errors and communication delays in collaborative perception, using three modules: importance-guided query proposals, deformable feature alignment, and region cross-attention reinforcement. It delivers consistent gains for multiple baseline methods under noisy settings across four collaborative 3D detection datasets.
An Incremental Unified Framework for Small Defect Inspection: This work proposes an Incremental Unified Framework (IUF), integrating incremental learning into unified reconstruction-based defect detection for the first time. By establishing semantic boundaries via Object-Aware Self-Attention (OASA), compressing non-primary semantic spaces through Semantic Compression Loss (SCL), and protecting old object features using an SVD-based weight update strategy, IUF achieves state-of-the-art incremental defect detection performance at both image and pixel levels on MVTec-AD and VisA.
AttnZero: Efficient Attention Discovery for Vision Transformers: This paper proposes AttnZero, the first framework to automatically discover efficient attention modules. By constructing a structured search space consisting of six computation graph types and a rich set of operators, and leveraging an evolutionary algorithm for multi-objective search, it automatically discovers linear attention formulations applicable to various ViTs. It achieves ImageNet top-1 accuracies of 74.9%/78.1%/82.1%/82.9% on DeiT/PVT/Swin/CSwin, respectively, and constructs the Attn-Bench-101 benchmark containing 2000 attention variants.
Auto-GAS: Automated Proxy Discovery for Training-Free Generative Architecture Search: This paper proposes Auto-GAS, the first training-free architecture search framework for generative adversarial networks (GANs). By automatically discovering and optimizing zero-cost proxy metrics to replace traditional training-based evaluations, it achieves a \(110\times\) search speedup while maintaining comparable generation quality with training-based methods.
Bidirectional Uncertainty-Based Active Learning for Open-Set Annotation: The BUAL framework is proposed to push unknown-class samples toward high-confidence regions and known-class samples toward low-confidence regions using Random Label Negative Learning. Combined with a bidirectional uncertainty sampling strategy, the framework effectively selects highly informative known-class samples under open-set scenarios.

Browse all 42 Others papers →

🗂 More Areas (30)¶

💡 LLM Reasoning (1)¶

Controllable Navigation Instruction Generation with Chain of Thought Prompting: This paper proposes C-Instructor, which leverages chain-of-thought prompting of LLMs to achieve style- and content-controllable navigation instruction generation. Through three core mechanisms—Chain of Thought with Landmarks (CoTL), Spatial Topology Modeling Task (STMT), and Style-Mixed Training (SMT)—the method comprehensively outperforms existing approaches on four indoor and outdoor navigation datasets.

🦾 LLM Agent (3)¶

Agent3D-Zero: An Agent for Zero-shot 3D Understanding: Agent3D-Zero proposes a VLM-based zero-shot 3D scene understanding agent framework. By utilizing Set-of-Line visual prompting on the bird's-eye view (BEV) to guide the VLM to actively select observation viewpoints and synthesizing multi-view images for 3D reasoning, it outperforms fine-tuned 3D-LLM methods on tasks like ScanQA.
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning: (Note: Brief note based on abstract) This paper proposes HYDRA, a multi-stage dynamic compositional visual reasoning framework. Through the collaboration of three modules—a Planner, a reinforcement learning cognitive controller (RL Agent), and a Reasoner—it achieves reliable and progressive visual reasoning, reaching SOTA performance on multiple datasets including RefCOCO/RefCOCO+, OK-VQA, and GQA.
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding: This paper proposes VideoAgent, a memory-augmented multimodal Agent. By constructing structured memory (temporal memory storing event descriptions and object memory storing object tracking states) and utilizing four tools to interact with the memory, it performs zero-shot long video QA tasks. It achieves an average gain of +6.6% on NExT-QA and +26.0% on EgoSchema, approaching the performance of Gemini 1.5 Pro.

🔒 LLM Safety (1)¶

MAGR: Manifold-Aligned Graph Regularization for Continual Action Quality Assessment: Proposes the MAGR method, which utilizes a manifold alignment projector and an Intra-Inter-Joint graph regularizer to address the misalignment between old and current feature manifolds caused by feature replay in Continual Action Quality Assessment (CAQA), significantly outperforming existing baselines across four datasets.

👻 Hallucination Detection (2)¶

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models: BEAF proposes a "before-after comparison" hallucination evaluation paradigm: by observing changes in VLM responses after removing objects through image editing and introducing four change-aware metrics (TU/IG/SB/ID), it reveals hallucination behaviors that cannot be detected by traditional text-axis evaluations.
LiDAR-Event Stereo Fusion with Hallucinations: This paper proposes the first framework to fuse sparse LiDAR depth points with event stereo cameras. By "hallucinating" (inserting fictitious events) within the event stack representations (VSH) or the raw event stream (BTH), the framework compensates for the missing information of event cameras in motion-free or textureless regions, significantly improving event stereo matching accuracy.

📖 NLP Understanding (1)¶

SLIMER: Show Less, Instruct More - Enriching Prompts with Definitions and Guidelines for Zero-Shot NER: SLIMER enhances the zero-shot named entity recognition capability of LLMs by injecting entity definitions and annotation guidelines into prompts. Trained on only 391 entity categories, it achieves performance comparable to State-of-the-Art (SOTA) methods trained on 13,000+ entity categories when evaluated on unseen entity tags.

🗣️ Dialogue Systems (1)¶

BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation: This paper proposes the BI-MDRG framework, which bridges image history information to enhance the image-grounding capability of textual responses and the object consistency in sequential image responses in multimodal dialogues.

🔍 Information Retrieval & RAG (3)¶

Multi-Label Cluster Discrimination for Visual Representation Learning: This work proposes MLCD (Multi-Label Cluster Discrimination), which assigns multiple cluster pseudo-labels to each image and designs a disambiguated multi-label classification loss. Pre-trained on LAION-400M, the ViT model under MLCD comprehensively outperforms OpenCLIP, FLIP, and UNICOM in linear probe, zero-shot classification, and retrieval tasks.
OneRestore: A Universal Restoration Framework for Composite Degradation: OneRestore is proposed as a Transformer-based universal image restoration framework. Driven by a scene-descriptor-guided cross-attention mechanism and a composite degradation restoration loss, it adaptively handles low-light, haze, rain, snow, and their arbitrary composite combinations within a single model, supporting controllable restoration under both text and visual modes.
Towards Open-Ended Visual Recognition with Large Language Model: This paper proposes the OmniScient Model (OSM)—a generative mask classifier based on a frozen CLIP-ViT, a trainable MaskQ-Former, and a frozen LLM (Vicuna-7B). It shifts visual recognition from "selecting categories from a predefined vocabulary" to "directly generating category names," eliminating the dependency on predefined vocabularies during both training and testing. It outperforms DaTaSeg by +4.3 PQ on COCO panoptic segmentation.

💻 Code Intelligence (1)¶

DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation: Proposes using code generation to synthesize structured visual data (slides and UIs) to train understanding models, thereby reducing the need for manual annotation.

📐 Optimization & Theory (2)¶

Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction: This paper proposes a sample-level bias prediction method named SBP. By leveraging a Bias-Oriented GAN, it utilizes the contextual information of the union region of object pairs to predict sample-specific bias correction vectors, reforming coarse-grained relationships into fine-grained ones. SBP outperforms dataset-level bias correction methods by an average of 5.6%/3.9%/3.2% in Average@K on VG/GQA/VG-1800 datasets, respectively.
Handling the Non-smooth Challenge in Tensor SVD: A Multi-objective Tensor Recovery Framework: A multi-objective tensor recovery framework (MOTC) based on learnable tensor nuclear norm is proposed. By introducing learnable unitary matrices in place of fixed transforms, this approach addresses the performance degradation of t-SVD methods on non-smooth tensor data, while effectively exploiting the low-rankness of tensors across all dimensions through multi-objective optimization.

🔗 Causal Inference (4)¶

Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation: This paper proposes a bi-level data pruning strategy, BiLP, which combines static pruning based on empirical loss and dynamic pruning based on individual treatment effect (ITE) to efficiently select the most valuable real samples for dataset distillation. It consistently improves the performance of existing distillation methods in a plug-and-play manner while reducing computational overhead.
Integrating Markov Blanket Discovery into Causal Representation Learning for Domain Generalization: This work proposes the CMBRL framework to discover Markov Blanket (MB) features—the minimal sufficient statistics of the target variable—within the latent space. This replaces the convention of selecting only causal or anti-causal variables in existing methods, constructing an invariant prediction mechanism to achieve cross-domain generalization.
Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning: This paper proposes the Counterfactual Bias-Robust Reasoning dataset (CoBRa) and the Chain of Counterfactual Thought (CoCT) method. By constructing edited knowledge graphs and image content, the study evaluates and mitigates knowledge bias in large vision-language models (LVLMs), enabling models to perform step-by-step reasoning rather than relying on biased knowledge. This approach significantly outperforms existing methods on tasks requiring reasoning under knowledge bias.
Understanding Physical Dynamics with Counterfactual World Modeling: This paper proposes Counterfactual World Modeling (CWM), which trains a masked video predictor using a temporally-factored masking policy and designs a "counterfactual prompting" mechanism to extract multiple visual structures (e.g., optical flow, segmentation, keypoints) from a single pre-trained model without fine-tuning, achieving state-of-the-art performance on the Physion benchmark for physical dynamics understanding.

🕸️ Graph Learning (4)¶

Confidence Self-Calibration for Multi-Label Class-Incremental Learning: To address the overconfident predictions and false-positive errors caused by partial labels in Multi-Label Class-Incremental Learning (MLCIL), a Confidence Self-Calibration (CSC) framework is proposed. It calibrates label relationships using a Class-Incremental Graph Convolutional Network (CI-GCN) and calibrates confidence via max-entropy regularization, significantly outperforming SOTA methods on MS-COCO and VOC.
GKGNet: Group K-Nearest Neighbor Based Graph Convolutional Network for Multi-Label Image Recognition: Proposes GKGNet, the first fully graph convolutional multi-label recognition model, which dynamically constructs graph structures between labels and image regions utilizing a Group KNN mechanism, achieving SOTA performance on MS-COCO and VOC2007 with lower computational cost.
SENC: Handling Self-collision in Neural Cloth Simulation: This paper proposes SENC, which effectively addresses the cloth self-collision problem in self-supervised neural cloth simulation for the first time, using a self-collision loss based on Global Intersection Analysis (GIA) and a self-collision-aware graph neural network.
Synchronous Diffusion for Unsupervised Smooth Non-Rigid 3D Shape Matching: A synchronous diffusion regularization method is proposed for unsupervised non-rigid 3D shape matching. The core idea is that "synchronously diffusing the same function on two shapes should yield consistent outputs." Through this simple yet efficient regularization, the matching smoothness of existing deep functional map methods is significantly improved, achieving SOTA performance on several datasets including FAUST, SCAPE, and TOPKIDS.

📈 Time Series (3)¶

Multi-person Pose Forecasting with Individual Interaction Perceptron and Prior Learning: This paper proposes IAFormer (Interaction-Aware Pose Forecasting Transformer). By designing the Interaction Perceptron Module (IPM) to evaluate the level of individual interaction with events, and introducing the Interaction Prior Learning Module (IPLM) to accumulate prior knowledge of high-frequency interaction patterns, it achieves semantic-level multi-person pose forecasting, significantly outperforming existing methods on multiple multi-person scene datasets.
OmniSat: Self-Supervised Modality Fusion for Earth Observation: This paper proposes OmniSat, a unified framework that fuses heterogeneous remote sensing data—including multi-spectral time-series (S2), SAR time-series (S1), and high-resolution single-temporal images (SPOT/Aerial)—into a unified representation using modality-specific encoders and cross-modal contrastive self-supervised pre-training. It outperforms all unimodal and multimodal baselines on semantic segmentation and crop classification.
Semantically Guided Representation Learning For Action Anticipation: The S-GEAR framework is proposed, which learns visual action prototypes and utilizes the semantic associations of language models to guide the geometric relationships among these prototypes. This enables the model to comprehend the semantic interconnectedness among actions, thereby enhancing action anticipation performance. S-GEAR achieves SOTA or highly competitive results across four benchmarks: Epic-Kitchens 55/100, EGTEA Gaze+, and 50 Salads.

⚛️ Physics & Scientific Computing (1)¶

Robust Fitting on a Gate Quantum Computer: Robust fitting is implemented on a real gate quantum computer (IonQ Aria) for the first time: a quantum circuit is proposed for 1D \(\ell_\infty\) feasibility testing, filling a critical gap in using Bernstein-Vazirani (BV) circuits to compute Boolean influence, and demonstrating how to accumulate 1D influence for high-dimensional non-linear models (e.g., fundamental matrix estimation).

🌍 Earth Science (1)¶

Semi-supervised Video Desnowing Network via Temporal Decoupling Experts and Distribution-Driven Contrastive Regularization: This paper proposes SemiVDN, the first semi-supervised video desnowing framework. By incorporating a physics-prior-guided temporal decoupling expert module and distribution-driven contrastive regularization, SemiVDN utilizes unlabeled real-world snowy videos to narrow the synthetic-to-real domain gap, outperforming existing methods on both synthetic and real-world datasets.

👥 Social Computing (2)¶

Distribution-Aware Robust Learning from Long-Tailed Data with Noisy Labels: Proposes the DaSC framework, which simultaneously addresses the joint problem of long-tailed distribution and noisy labels through distribution-aware class centroid estimation (DaCC) and confidence-aware contrastive learning (SBCL + MIDL), achieving SOTA results on CIFAR and real-world noisy datasets.
GRACE: Graph-Based Contextual Debiasing for Fair Visual Question Answering: Proposes GRACE (GRAph-based Contextual DEbiasing), a graph-based contextual debiasing method. Through unsupervised context graph learning and graph-based diverse in-context example selection, it addresses the data bias inherited by large language models in knowledge-enhanced VQA systems.