Skip to content

💬 LLM (Other)

🎞️ ECCV2024 · 11 paper notes

📌 Same area in other venues: 📷 CVPR2026 (3) · 🔬 ICLR2026 (56) · 💬 ACL2026 (62) · 🧪 ICML2026 (39) · 🤖 AAAI2026 (29) · 🧠 NeurIPS2025 (54)

🔥 Top topics: Few-/Zero-Shot Learning ×2

AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection

By concurrently incorporating static (globally shared) and dynamic (instance-specifically generated) learnable prompts into CLIP, and using auxiliary anomaly detection data for optimization, this method establishes a zero-shot SOTA on 14 industrial and medical anomaly detection datasets. The core innovation lies in the hybrid prompt design that achieves dual-tier adaptation at both the "task" and "instance" levels.

APL: Anchor-based Prompt Learning for One-stage Weakly Supervised Referring Expression Comprehension

This paper proposes APL (Anchor-based Prompt Learning), which designs an Anchor-based Prompt Encoder (APE) to generate distinctive prompts across three categories: location, color, and category. By dynamically integrating these prompts into anchor features to enrich visual semantics, alongside text reconstruction and visual alignment losses, APL achieves precise vision-language alignment. It outperforms existing weakly supervised methods on four REC benchmarks (e.g., exceeding RefCLIP by 6.44% on RefCOCO).

Cultural Value Differences of LLMs: Prompt, Language, and Model Size

This paper systematically investigates the behavioral patterns of LLMs in expressing cultural values utilizing the Hofstede cultural dimensions questionnaire. It finds that prompt language (Chinese vs. English) and model size have a far greater impact on cultural value disparities than differences in model architecture and question order.

FreestyleRet: Retrieving Images from Style-Diversified Queries

This work proposes the first Style-Diversified Query-Based Image Retrieval (Style-Diversified QBIR) task and the DSR dataset. It designs FreestyleRet, a lightweight, plug-and-play framework that extracts texture/style features of queries using Gram matrices to construct a style space. These style features then initialize prompt tokens, enabling a frozen vision encoder to adapt to various query styles such as texts, sketches, low-resolution images, and artistic paintings.

FunQA: Towards Surprising Video Comprehension

The authors construct a large-scale counter-intuitive video question answering benchmark, FunQA (consisting of 4.3K videos and 312K QA pairs), covering three categories of surprising videos: Humor, Creativity, and Magic. They also propose the FunMentor agent, which enhances the counter-intuitive reasoning capabilities of VLMs through multi-turn dialogue.

PromptIQA: Boosting the Performance and Generalization for No-Reference Image Quality Assessment via Prompts

The authors propose PromptIQA, which uses a small number of "image-score pairs" (ISPs) as prompts. This allows the trained NR-IQA model to adapt to new quality assessment requirements without fine-tuning, achieving SOTA performance and generalization capabilities across 12 datasets and 5 categories of IQA tasks.

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos

VidAssist proposes a three-step framework of "Propose-Assess-Search", leveraging LLMs as a knowledge base and evaluation tool combined with a breadth-first search algorithm. It outperforms fully supervised SOTA in a zero/few-shot manner on goal-oriented planning tasks in instructional videos, achieving a +7.7% SR improvement on COIN compared to the fully supervised VLaMP in the few-shot setting.

Reprojection Errors as Prompts for Efficient Scene Coordinate Regression

This paper proposes the Error-Guided Feature Selection (EGFS) mechanism, which leverages low reprojection error regions as point prompts for SAM to expand into semantic masks. By iteratively filtering reliable training samples, the method outperforms existing 3D-free SCR methods on the Cambridge Landmarks and Indoor6 datasets with a smaller model size and less training time.

RoadPainter: Points Are Ideal Navigators for Topology Transformer

RoadPainter is proposed, which adopts a two-stage strategy of first regressing lane centerline points and then refining them using instance masks. Combined with a hybrid attention mechanism and a real-virtual lane separation strategy, it achieves SOTA topology inference performance on the OpenLane-V2 dataset.

Stripe Observation Guided Inference Cost-Free Attention Mechanism

By deeply analyzing the stripe pattern phenomenon in the attention weight matrices of Transformers, this paper proposes an attention enhancement mechanism that completely eliminates additional computational cost during the inference phase. By training an auxiliary module to learn stripe-guided attention modulation during the training phase, and re-parameterizing it into the standard attention weights during inference, this method achieves a "free lunch" style performance boost.

Zero-Shot Object Counting with Good Exemplars (VA-Count)

Proposes VA-Count, a vision-association-based zero-shot object counting framework. It establishes robust visual associations between high-quality exemplars and images for arbitrary categories through a Grounding DINO-driven exemplar enhancement module and a contrastive learning noise suppression module.