ECCV2024 LLM Evaluation AI paper notes paper summaries Adversarial Robustness Alignment/RLHF Compression Translation Domain Adaptation Face & Gaze

📊 LLM Evaluation¶

🎞️ ECCV2024 · 19 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (131) · 💬 ACL2026 (97) · 🧪 ICML2026 (40) · 🤖 AAAI2026 (16) · 🧠 NeurIPS2025 (38) · 📹 ICCV2025 (27)

🔥 Top topics: Adversarial Robustness ×2

ColorMNet: A Memory-based Deep Spatial-Temporal Feature Propagation Network for Video Colorization: This paper proposes ColorMNet, a memory-based deep spatial-temporal feature propagation network. By incorporating three components—Pre-trained Vision-Guided Feature Extraction (PVGFE), Memory-based Feature Propagation (MFP), and Local Attention (LA)—this method achieves video colorization performance superior to state-of-the-art (SOTA) models while significantly reducing GPU memory consumption to only 1.9 GB.
Deep Cost Ray Fusion for Sparse Depth Video Completion: This paper proposes the RayFusion framework, which achieves temporal fusion by applying self-attention and cross-attention along the ray direction on the cost volume. With only 1.15M parameters, it outperforms or matches state-of-the-art sparse depth completion methods across three datasets: KITTI, VOID, and ScanNetV2.
Distribution Alignment for Fully Test-Time Adaptation with Dynamic Online Data Streams: A Distribution Alignment (DA) loss is proposed to pull the test-time feature distribution back to the source domain distribution. Combined with a domain shift detection mechanism, this method significantly outperforms existing TTA methods under non-i.i.d. dynamic data streams and continuous domain shift scenarios.
Eliminating Warping Shakes for Unsupervised Online Video Stitching: Defines a new problem in video stitching termed "warping shake" (temporal shaking in non-overlapping regions when extending image stitching to video). This work proposes StabStitch, the first unsupervised online video stitching framework. By simultaneously generating and smoothing stitching trajectories, it achieves both video stitching and stabilization, reaching a real-time speed of 28.2 ms/frame.
EvSign: Sign Language Recognition and Translation with Streaming Events: This work constructs the first event-camera benchmark dataset, EvSign, for Continuous Sign Language Recognition (CSLR) and Sign Language Translation (SLT) tasks, and proposes an efficient sparse Transformer-based framework that achieves comparable or superior performance to SOTA RGB methods using only 0.34% FLOPs and 44.2% of the parameters.
Gradient-Regularized Out-of-Distribution Detection: This paper proposes GReg/GReg+, which learns the local smoothness of the scoring manifold by regularizing the input gradient norm of the OOD scoring function, and incorporates an energy-score-based clustering sampling strategy to select highly informative auxiliary samples, achieving SOTA on CIFAR and ImageNet OOD detection benchmarks.
Image-Feature Weak-to-Strong Consistency: An Enhanced Paradigm for Semi-Supervised Learning: Ours proposes IFMatch, which introduces feature-level perturbation and constructs a three-branch structure based on the traditional image-level weak-to-strong consistency paradigm. By employing a confidence strategy to distinguish naive/hard samples, IFMatch significantly enhances the performance of existing methods (e.g., FixMatch, FreeMatch, etc.) on multiple SSL benchmarks.
Imaging Interiors: An Implicit Solution to Electromagnetic Inverse Scattering Problems: A solution for the Electromagnetic Inverse Scattering Problem (EISP) is proposed based on Implicit Neural Representations (INR). By modeling the relative permittivity of the scatterer as a continuous implicit representation and optimizing it within a forward framework, this approach effectively avoids the difficulties of inverse estimation and low-resolution issues caused by discretization.
Instance-dependent Noisy-label Learning with Graphical Model Based Noise-rate Estimation: This paper proposes a noise-rate estimation method based on a probabilistic graphical model, which automatically estimates the label noise rate of the training set. The estimated values are used to guide the curriculum design of sample selection strategies. It can be seamlessly integrated into state-of-the-art (SOTA) noisy-label learning methods such as DivideMix and InstanceGM, improving their classification accuracy on both synthetic and real-world benchmarks.
Learn from the Learnt: Source-Free Active Domain Adaptation via Contrastive Sampling and Visual Persistence: The LFTL (Learn from the Learnt) framework is proposed. Consisting of two core modules—Contrastive Active Sampling (CAS) and Visual Persistence-guided Adaptation (VPA)—it achieves highly efficient domain adaptation under source-free and extremely low target annotation budgets (\(\le 5\%\)), reaching 87.4% accuracy on VisDA-C with only 1% annotation.
MERLiN: Single-Shot Material Estimation and Relighting for Photometric Stereo: This paper proposes MERLiN, a single-stage attentive hourglass network, to jointly estimate spatially-varying BRDF parameters and perform physically correct relighting from a single image. It is the first to leverage relit images to drive photometric stereo methods for single-image normal estimation, bridging the gap between Shape from Shading and Photometric Stereo.
OGNI-DC: Robust Depth Completion with Optimization-Guided Neural Iterations: This paper proposes OGNI-DC, which achieves both SOTA accuracy and strong generalization in depth completion through an "Optimization-Guided Neural Iteration" (OGNI) framework, combining a ConvGRU for iterative depth gradient field refinement with a Differentiable Depth Integrator (DDI).
PetFace: A Large-Scale Dataset and Benchmark for Animal Identification: A large-scale animal face recognition dataset, PetFace, is constructed, containing 13 animal families, 319 breeds, and 257,484 individuals (over 1 million images). Two benchmark tests are established: seen individual re-identification (Re-ID) and unseen individual verification, providing infrastructure for non-invasive automatic animal identification.
R²-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations: Proposes R²-Bench, a comprehensive benchmark to systematically evaluate the robustness of referring perception models (RPMs) under various perturbations. It features a complete perturbation taxonomy, a versatile perturbation synthesis toolbox, and an LLM-based automated evaluation agent (R²-Agent), covering five key tasks and revealing the vulnerability of current RPMs under noisy conditions.
SIGMA: Sinkhorn-Guided Masked Video Modeling: This paper proposes SIGMA, which upgrades the reconstruction target of masked video modeling from the pixel level to learnable deep feature cluster assignments by introducing a projection network. Employing optimal transport with the Sinkhorn algorithm to enforce high-entropy regularization, SIGMA avoids representation collapse and comprehensively outperforms state-of-the-art methods like VideoMAE across 10 datasets and 3 benchmarks.
Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval: A Slerp-based ZS-CIR method is proposed, which directly fuses the image and text embeddings of VLP models via Spherical Linear Interpolation (Slerp) to construct composed query representations. Combined with Text-Anchored-Tuning (TAT), which fine-tunes the image encoder using LoRA to narrow the modality gap, this method achieves state-of-the-art (SOTA) performance on CIRR, CIRCO, and FashionIQ.
Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets: This paper proposes the task of Alignable Video Retrieval (AVR), which identifies and retrieves the most suitable videos for temporal alignment with a query video from a large-scale video database using the DRAQ alignment quality metric. It also introduces a feature contextualization method to improve alignment performance.
Versatile Incremental Learning: Towards Class and Domain-Agnostic Incremental Learning: This work is the first to define the Versatile Incremental Learning (VIL) scenario—where the class or domain incremental type of subsequent tasks is unknown. It proposes the ICON framework, which controls learning directions via CAST loss to avoid conflicts with historical tasks, and dynamically expands output nodes with the IC incremental classifier to address the cross-domain intra-class overwriting problem, comprehensively surpassing existing CIL/DIL methods on three benchmarks.
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding: VisFocus proposes a prompt-guided vision encoding method for OCR-free document understanding. By directly injecting the user prompt into the patch merging layers (ViLMA layers) of the vision encoder, combined with a local masked prompt modeling (LMPM) pre-training task, the vision encoder learns to focus on prompt-related text regions, achieving state-of-the-art results among similarly-sized models across multiple document VQA benchmarks.