ECCV2024 Others AI paper notes paper summaries Adversarial Robustness Few-/Zero-Shot Learning Alignment/RLHF Agents GANs Diffusion Models

📂 Others¶

🎞️ ECCV2024 · 42 paper notes

📌 Same area in other venues: 📷 CVPR2026 (105) · 🔬 ICLR2026 (116) · 💬 ACL2026 (4) · 🧪 ICML2026 (70) · 🤖 AAAI2026 (117) · 🧠 NeurIPS2025 (121)

🔥 Top topics: Adversarial Robustness ×4

A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation: A statistical framework is proposed that synergistically designs three components—stratification, sampling design, and estimation—to accurately estimate Computer Vision (CV) model accuracy with only a small number of annotated test samples, achieving up to a 10x efficiency gain (i.e., reaching equivalent accuracy with 1/10 of the annotations).
ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-Agnostic Counting: Proposes ABC123, the first exemplar-free method capable of concurrently counting multiple classes of unknown objects in an image. By combining a ViT for multi-channel density map regression, Hungarian-matching-based training, and a SAM-based example discovery mechanism, this approach significantly outperforms exemplar-based methods on the self-created synthetic MCAC dataset while demonstrating strong generalization on the real-world FSC-147 dataset.
Active Generation for Image Classification: This paper proposes ActGen, which integrates the concept of active learning into the image generation process of diffusion models. By identifying misclassified validation samples as guidance images and combining attentive guidance with gradient-based generation control, ActGen achieves a +2.26% accuracy improvement on ImageNet using only 10% generated images, outperforming previous methods that utilize 94% synthetic data.
AddMe: Zero-Shot Group-Photo Synthesis by Inserting People Into Scenes: This paper proposes AddMe, a zero-shot portrait generator based on diffusion models. Through an identity decoupling adapter and an enhanced portrait attention module, it can naturally insert a given portrait into specified positions of an existing scene image, while maintaining identity consistency and the plausibility of group interactions.
ADMap: Anti-disturbance Framework for Vectorized HD Map Construction: This paper proposes the ADMap framework, which cascadedly monitors the point sequence prediction process from both inter-instance and intra-instance levels using three modules: Multi-Scale Perception Neck (MPN), Instance Interactive Attention (IIA), and Vector Direction Difference Loss (VDDL). This effectively alleviates the point sequence jitter/jaggedness issues in vectorized HD map construction and achieves SOTA performance on nuScenes and Argoverse2.
Align before Collaborate: Mitigating Feature Misalignment for Robust Multi-Agent Perception: Proposes NEAT, a model-agnostic and lightweight plug-in that explicitly addresses feature-level spatial misalignment caused by pose errors and communication delays in collaborative perception, using three modules: importance-guided query proposals, deformable feature alignment, and region cross-attention reinforcement. It delivers consistent gains for multiple baseline methods under noisy settings across four collaborative 3D detection datasets.
An Incremental Unified Framework for Small Defect Inspection: This work proposes an Incremental Unified Framework (IUF), integrating incremental learning into unified reconstruction-based defect detection for the first time. By establishing semantic boundaries via Object-Aware Self-Attention (OASA), compressing non-primary semantic spaces through Semantic Compression Loss (SCL), and protecting old object features using an SVD-based weight update strategy, IUF achieves state-of-the-art incremental defect detection performance at both image and pixel levels on MVTec-AD and VisA.
AttnZero: Efficient Attention Discovery for Vision Transformers: This paper proposes AttnZero, the first framework to automatically discover efficient attention modules. By constructing a structured search space consisting of six computation graph types and a rich set of operators, and leveraging an evolutionary algorithm for multi-objective search, it automatically discovers linear attention formulations applicable to various ViTs. It achieves ImageNet top-1 accuracies of 74.9%/78.1%/82.1%/82.9% on DeiT/PVT/Swin/CSwin, respectively, and constructs the Attn-Bench-101 benchmark containing 2000 attention variants.
Auto-GAS: Automated Proxy Discovery for Training-Free Generative Architecture Search: This paper proposes Auto-GAS, the first training-free architecture search framework for generative adversarial networks (GANs). By automatically discovering and optimizing zero-cost proxy metrics to replace traditional training-based evaluations, it achieves a \(110\times\) search speedup while maintaining comparable generation quality with training-based methods.
Bidirectional Uncertainty-Based Active Learning for Open-Set Annotation: The BUAL framework is proposed to push unknown-class samples toward high-confidence regions and known-class samples toward low-confidence regions using Random Label Negative Learning. Combined with a bidirectional uncertainty sampling strategy, the framework effectively selects highly informative known-class samples under open-set scenarios.
CLR-GAN: Improving GANs Stability and Quality via Consistent Latent Representation and Reconstruction: This paper proposes the CLR-GAN training paradigm. By requiring the discriminator to recover the pre-defined latent code of the generator and the generator to reconstruct the real input, it establishes consistency constraints between the latent spaces of G and D. This makes GAN training fairer and more stable, improving the FID by 31.22% on CIFAR10 and 39.5% on AFHQ-Cat.
COIN-Matting: Confounder Intervention for Image Matting: This paper analyzes the dataset bias issue in image matting from a causal inference perspective, identifying two typical biases (contrast bias and transparency bias) and their root cause—confounders. It proposes COIN-Matting, a model-agnostic de-biasing framework based on backdoor adjustment, which significantly mitigates the impact of bias and improves the performance of existing matting models.
DC-Solver: Improving Predictor-Corrector Diffusion Sampler via Dynamic Compensation: DC-Solver is proposed to alleviate the misalignment issue in predictor-corrector diffusion samplers through Dynamic Compensation. It optimizes compensation ratios using only 10 data points and achieves instant generalization to unseen NFE/CFG configurations via Cascade Polynomial Regression (CPR).
Docling Technical Report: Docling is an open-source PDF document conversion tool that integrates DocLayNet-based layout analysis models and TableFormer table structure recognition models. It efficiently converts PDFs into structured JSON or Markdown formats on standard hardware.
Domain Reduction Strategy for Non-Line-of-Sight Imaging: This paper proposes an optimization-based method for non-line-of-sight (NLOS) imaging. By modeling the transient signal as a superposition of point-wise light propagation functions and designing a coarse-to-fine domain reduction strategy to prune empty regions, it achieves approximately 20x acceleration under general NLOS scenarios while simultaneously reconstructing both albedo and surface normals.
Dropout Mixture Low-Rank Adaptation for Visual Parameters-Efficient Fine-Tuning: This paper proposes DMLoRA (Dropout-Mixture Low-Rank Adaptation), which balances accuracy and regularization by introducing a multi-branch up-and-down projection structure and progressively dropping out branches during training. Along with a two-stage learning scalar strategy to optimize the scaling coefficients of each layer, DMLoRA achieves SOTA performance on the VTAB-1k and FGVC visual fine-tuning benchmarks with zero additional inference overhead.
Elegantly Written: Disentangling Writer and Character Styles for Enhancing Online Chinese Handwriting: This paper proposes a sequence model-based method for online Chinese handwriting trajectory beautification. By disentangling writer style and character structural style via a cross-attention mechanism, it transforms the user's scribbled handwriting trajectories into aesthetic writing while preserving personal style. Concurrently, a Cartesian product decomposition effectively removes redundant style features.
Enhancing Optimization Robustness in 1-bit Neural Networks through Stochastic Sign Descent: This paper proposes the Diode optimizer, specifically designed for binary neural networks (BNNs). By utilizing lower-order moment estimates of gradient signs, Diode achieves latent-weight-free parameter updates. It improves the Top-1 accuracy of BNext-18 on ImageNet by 0.96% with an 8x reduction in training iterations, while achieving new SOTA performance on NLP tasks.
ET: The Exceptional Trajectories - Text-to-Camera-Trajectory Generation with Character Awareness: Proposes the first camera-character trajectory dataset, E.T. (115K samples, 11M frames), extracted from real movies, alongside Director, a diffusion-based camera trajectory generation method that generates complex camera trajectories based on text descriptions and character trajectories. Additionally, designs CLaTr, a contrastive language-trajectory embedding, for trajectory generation quality evaluation.
Exploring Guided Sampling of Conditional GANs: This paper proposes introducing a diffusion-like guided sampling strategy into conditional GANs. By estimating the joint data-condition distribution through latent space vector operations, without the need for pre-trained classifiers or learning unconditional models, it significantly improves GAN generation quality, reducing the FID on ImageNet 64×64 from 8.87 to 4.37.
FisherRF: Active View Selection and Mapping with Radiance Fields Using Fisher Information: This paper proposes FisherRF, which utilizes Fisher Information to directly quantify the observed information of radiance fields model parameters. By maximizing the Expected Information Gain (EIG), it selects the optimal view. The proposed method achieves state-of-the-art (SOTA) performance across three tasks: view selection, active mapping, and uncertainty quantification. Furthermore, by leveraging sparsity and custom CUDA kernels, it achieves a view evaluation speed of 70 fps.
Foster Adaptivity and Balance in Learning with Noisy Labels: Proposes the SED method, which addresses the label noise problem through an adaptive and class-balanced sample selection and re-weighting mechanism. It achieves SOTA performance on both synthetic and real-world noisy datasets without requiring prior knowledge such as predefined thresholds.
Free-Viewpoint Video of Outdoor Sports Using a Flying Camera: Proposes a system based on a UAV-mounted RGB camera capable of reconstructing 4D dynamic humans and 3D unbounded backgrounds in outdoor sports scenes, enabling free-viewpoint video rendering at any timestamp.
FreeAugment: Data Augmentation Search Across All Degrees of Freedom: This paper proposes FreeAugment, the first fully differentiable search method capable of simultaneously and globally optimizing the four degrees of freedom (number, type, sequence, and magnitude of transformations) of data augmentation strategies. By learning the depth distribution via Gumbel-Softmax and the permutation distribution via Gumbel-Sinkhorn to avoid duplicate sampling, it achieves state-of-the-art (SOTA) performance across multiple benchmarks.
Functional Transform-Based Low-Rank Tensor Factorization for Multi-Dimensional Data Recovery: Proposes Functional Transform-based Low-Rank Tensor Factorization (FLRTF), which utilizes implicit neural representations instead of traditional discrete transforms to capture the continuous smoothness of data along the third dimension, effectively addressing temporal/spectral degradation.
HiEI: A Universal Framework for Generating High-quality Emerging Images from Natural Images: This paper proposes a universal framework, HiEI, which converts natural images into high-quality Emerging Images (EIs) through a human-centric color quantization module (TTNet), a perceptual difficulty control module (PDC), and a template vectorization module (TV). It outperforms existing methods in content and style quality while effectively resisting attacks from deep vision models, making it suitable for CAPTCHA mechanisms.
High-Fidelity 3D Textured Shapes Generation by Sparse Encoding and Adversarial Decoding: This paper proposes a 3D textured shape generation framework based on a sparse encoding module and an adversarial decoding module. By minimally adapting StableDiffusion to the 3D domain, it achieves open-vocabulary, high-fidelity 3D generation on ShapeNet and G-Objaverse (200K samples), outperforming existing SOTA methods.
HPFF: Hierarchical Locally Supervised Learning with Patch Feature Fusion: HPFF is proposed to resolve block-to-block information loss and high GPU memory footprint in local learning through Hierarchical Locally Supervised Learning (HiLo, partitioning the network into independent and cascade levels of local modules) and Patch Feature Fusion (PFF, dividing auxiliary network inputs into patches for individual computation and average fusion). HPFF significantly outperforms existing local learning methods across multiple datasets and approaches or even surpasses BP.
Improving Point-based Crowd Counting and Localization Based on Auxiliary Point Guidance: Proposes an Auxiliary Point Guidance (APG) strategy and an Implicit Feature Interpolation (IFI) module to stabilize the proposal-target matching instability in point-based crowd counting methods by explicitly generating auxiliary positive and negative samples near ground-truth points, achieving state-of-the-art results on multiple datasets.
Mahalanobis Distance-Based Multi-View Optimal Transport for Multi-View Crowd Localization: Proposed a Mahalanobis distance-based multi-view optimal transport loss (M-MVOT) that adaptively adjusts transmission cost based on the line-of-sight direction and target-to-camera distance, introducing point-supervised optimal transport to the multi-view crowd localization task for the first time and significantly outperforming density map MSE loss-based methods.
MemBN: Robust Test-Time Adaptation via Batch Norm with Statistics Memory: This paper proposes MemBN (Memory-based Batch Normalization). By maintaining a statistics memory queue within each BN layer and designing dedicated memory management and aggregation algorithms, TTA methods can robustly estimate test domain statistics across various batch sizes, significantly improving accuracy and robustness in small-batch scenarios.
Momentum Auxiliary Network for Supervised Local Learning: This paper proposes the Momentum Auxiliary Network (MAN), which transfers parameter information from adjacent local blocks to the auxiliary network of the current block via Exponential Moving Average (EMA). By introducing learnable biases to compensate for cross-block feature discrepancies, MAN addresses the "myopia" problem caused by the lack of inter-block information exchange in supervised local learning. It achieves superior performance on ImageNet using less than half of the GPU memory of E2E training.
Non-parametric Sensor Noise Modeling and Synthesis: This paper proposes a non-parametric sensor noise model that models the real noise distribution by directly constructing a probability mass function (PMF) for each brightness level from real-world captured images, requiring no assumption of a specific noise distribution form. It also introduces ISO interpolation and noise synthesis methods on noisy images, significantly outperforming existing parametric noise models on downstream denoising tasks.
Object-Aware NIR-to-Visible Translation: This paper proposes an object-aware near-infrared (NIR) to visible image translation framework. By decomposing visible images into object-independent illumination components and object-specific reflectance components for separate processing, combined with segmentation prior knowledge, high-quality NIR colorization is achieved under the condition of lacking large-scale paired data. In addition, the first fully aligned large-scale paired NIR-visible dataset is constructed.
PartCraft: Crafting Creative Objects by Parts: This work proposes PartCraft, which achieves part-selection-based control for text-to-image generation for the first time. Users can "pick" different parts (such as a bird's head, wings, and body) from various objects, and the model naturally combines them into a novel and structurally coherent creative object.
Real-Data-Driven 2000 FPS Color Video from Mosaicked Chromatic Spikes: Defines a fully real-data-driven 2000 FPS color high-dynamic-range (HDR) video reconstruction method for mosaicked chromatic spikes. It addresses noise in short-time frames and motion blur via a self-supervised denoising module and a progressive warping module, reconstructing high-quality high-speed color video without requiring synthetic data.
Rebalancing Using Estimated Class Distribution for Imbalanced Semi-Supervised Learning under Class Distribution Mismatch: This paper proposes the RECD algorithm, which estimates the unknown class distribution of unlabeled data via Monte Carlo approximation. It rebalances the classifier based on the estimated distribution and introduces feature cluster compression to mitigate representation space imbalance, achieving SOTA performance in semi-supervised learning scenarios with labeled-unlabeled class distribution mismatch.
Rethinking Data Bias: Dataset Copyright Protection via Embedding Class-Wise Hidden Bias: This paper proposes "Undercover Bias", a dataset watermarking method. By embedding hidden watermark patterns that are irrelevant to the target task but correspond to the labels into the training data, models trained by unauthorized users unconsciously learn to classify these watermarks. The capability to classify the watermarks serves as irrefutable evidence of unauthorized use, achieving dataset copyright protection that is covert, model-agnostic, and non-destructive to the target task.
SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding: This work proposes the SpatialFormer architecture, which explicitly models the global spatial relations of scenes by introducing adaptive spatial tokens. It adopts a decoder-only architecture and bilateral cross-attention blocks to achieve efficient interaction between context and spatial information, demonstrating excellent generalization and transferability across classification, segmentation, and detection tasks.
Superpixel-Informed Implicit Neural Representation for Multi-Dimensional Data: Proposes Superpixel-Informed Implicit Neural Representation (S-INR), which replaces pixels with generalized superpixels as the basic unit of INR. By utilizing two modules—an exclusive attention MLP and a shared dictionary matrix—this method fully mines semantic information within and across generalized superpixels, outperforming existing INR methods in tasks such as image reconstruction, completion, denoising, and point data recovery.
Synergy of Sight and Semantics: Visual Intention Understanding with CLIP: This paper proposes the IntCLIP framework, which transfers "sight" (visual perception) knowledge in CLIP to "semantic" (semantic-centric) multi-label intention understanding tasks through a dual-branch encoding strategy. Combined with hierarchical class integration and sight-assisted aggregation, it significantly outperforms state-of-the-art (SOTA) methods on standard MIU benchmarks and image emotion recognition tasks.
Wavelength-Embedding-guided Filter-Array Transformer for Spectral Demosaicing: This paper proposes WeFAT, which endows the model with "wavelength memory" capability through Wavelength-Embedding-guided Multi-head Self-Attention (We-MSA). Combined with the Masked Attention Mechanism (MaM) to focus on high-quality spectral regions, it maintains stable performance under different cameras and spectral distributions when trained only on the ARAD dataset, outperforming existing SOTA methods.