Skip to content

👥 Social Computing

🧠 NeurIPS2025 · 18 paper notes

Active Slice Discovery in Large Language Models

This paper proposes the Active Slice Discovery problem framework, integrating active learning into LLM error slice discovery. By combining uncertainty sampling with LLM internal representations (raw embeddings or SAE features), the method achieves slice detection accuracy comparable to fully supervised settings using only 2–10% of labeled data.

Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning-based Bias Detector

This paper proposes the Reasoning-based Bias Detector (RBD), a plug-and-play debiasing module for LLM judges. By externally detecting four types of evaluation bias (verbosity, position, bandwagon, and sentiment), RBD generates structured feedback with reasoning chains to guide judges toward self-correction. RBD-8B achieves an average accuracy improvement of 18.5% and consistency improvement of 10.9% across 8 LLM judges.

Auto-Search and Refinement: An Automated Framework for Gender Bias Mitigation in LLMs

This paper proposes FaIRMaker, a framework that adopts an "auto-search + refinement" paradigm: it first employs gradient-based optimization to identify debiasing trigger tokens (Fairwords), then trains a seq2seq model to transform them into human-readable instructions, effectively mitigating gender bias on both open-source and closed-source LLMs while preserving or even improving task performance.

AVerImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from the Web

AVerImaTeC introduces the first image-text fact-checking dataset with complete evidence annotation — 1,297 real-world image-text claims, a 5-stage annotation pipeline (extraction → QA reasoning → sufficiency check → iterative refinement → second check), and temporally constrained evidence (to prevent temporal leakage). The baseline system achieves 82% accuracy with ground-truth evidence, but drops to 15–25% under automatic evidence retrieval, revealing the substantial challenges of image-text verification.

Concept-Level Explainability for Auditing & Steering LLM Responses

This paper proposes ConceptX, an LLM explainability method based on concept-level (rather than token-level) Shapley attribution. It measures the influence of input concepts on outputs via semantic similarity rather than token overlap, and can be used to audit bias and steer LLM outputs through prompt editing — reducing attack success rate from 0.463 to 0.242 in jailbreak defense.

DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models

DATE-LM introduces the first unified benchmark for evaluating data attribution methods in LLMs. Through three application-driven tasks—training data selection, toxicity filtering, and factual attribution—it systematically compares multiple attribution approaches, finding that no single method dominates across all tasks and that simple baselines can match attribution methods in certain settings.

DeepTraverse: A Depth-First Search Inspired Network for Algorithmic Visual Understanding

Inspired by the depth-first search (DFS) algorithm, DeepTraverse is a visual backbone network that achieves highly competitive image classification performance with very few parameters, through a parameter-sharing recursive exploration module and an adaptive channel recalibration module.

Don't Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation

This paper proposes Token Timestep Allocation (TTA-Diffusion), which assigns independent denoising timesteps to each token to address the update-forgetting problem caused by classifier guidance in diffusion language models, achieving substantial improvements in both stability and efficiency for controllable text generation.

Evaluating Multiple Models Using Labeled and Unlabeled Data

This paper proposes SSME (Semi-Supervised Model Evaluation), which leverages a small amount of labeled data and a large amount of unlabeled data to estimate the joint distribution \(P(y, \mathbf{s})\) of multiple classifiers via a semi-supervised mixture model, enabling accurate classifier performance evaluation with errors reduced to 1/5 of those incurred when using labeled data alone.

GraphKeeper: Graph Domain-Incremental Learning via Knowledge Disentanglement and Preservation

GraphKeeper is proposed to address catastrophic forgetting in Graph Domain-Incremental Learning (Graph Domain-IL) through three components: domain-specific LoRA parameter isolation, intra/inter-domain disentanglement, and ridge regression-based deviation-free knowledge preservation. It outperforms the second-best method by 6.5%–16.6% and can be seamlessly integrated with graph foundation models.

IF-GUIDE: Influence Function-Guided Detoxification of LLMs

This paper proposes IF-Guide, which leverages influence functions to identify toxic content in training data at the token granularity and actively suppresses the model from learning toxic behaviors during pre-training or fine-tuning via a penalty-based training objective, substantially outperforming passive alignment methods such as DPO and RAD.

Noise-Robustness Through Noise: A Framework Combining Asymmetric LoRA with Poisoning MoE

This paper proposes LoPE, which designates a dedicated "poisoning expert" within an asymmetric LoRA architecture to absorb injected noise during training; at inference time, this expert is masked so that only the clean experts contribute to the output — achieving noise robustness through noise itself, entirely without data cleaning.

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

This paper presents OS-Harm, the first safety benchmark targeting general-purpose computer use agents (beyond browser-only settings), covering 150 tasks across three risk categories — deliberate user misuse, prompt injection attacks, and model misbehavior. Evaluations reveal that frontier models (o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro, etc.) broadly comply with harmful instructions (up to 70% unsafe rate) and exhibit a 20% compliance rate against basic prompt injection attacks.

Policy-as-Prompt: Turning AI Governance Rules into Guardrails for AI Agents

This paper proposes the Policy-as-Prompt framework, a two-stage end-to-end pipeline—POLICY-TREE-GEN and POLICY-AS-PROMPT-GEN—that automatically converts a team's existing unstructured design documents (PRD, TDD, code) into runtime-enforceable policy guardrails, using a lightweight LLM as a compliance "judge," achieving 70–73% input/output classification accuracy in HR and SOC applications.

Position Paper: If Innovation in AI Systematically Violates Fundamental Rights, Is It Innovation at All?

This paper challenges the prevailing belief that regulation and innovation are inherently at odds. Through historical analogies from pharmaceuticals, aviation, and welfare systems, combined with an analysis of the Collingridge dilemma, it argues that well-designed regulation serves as the foundation for sustainable innovation rather than an impediment to it. The regulatory sandbox, SME support mechanisms, and other provisions of the EU AI Act are presented as exemplars demonstrating how regulation can accelerate, rather than delay, responsible technological progress.

Precise Information Control in Long-Form Text Generation

This paper proposes the Precise Information Control (PIC) task, which requires LLMs to generate long-form text that strictly adheres to a given set of claims (neither omitting nor adding information). The authors construct PIC-Bench to evaluate 8 tasks, finding that over 70% of outputs from state-of-the-art models contain faithfulness hallucinations. Through weakly supervised preference data construction combined with DPO training, the proposed PIC-LM improves the F1 of an 8B model from 69.1% to 91.0%.

SLAyiNG: Towards Queer Language Processing

This work introduces SLAyiNG, the first explicitly annotated queer slang dataset, comprising 695 terms and nearly 200,000 usage instances. Inter-annotator agreement experiments (Krippendorff's \(\alpha = 0.746\)) demonstrate that reasoning models can serve as pre-screening tools but community-driven expert annotation remains indispensable.

VDRP: Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection

This paper proposes the VDRP framework, which addresses two core challenges in zero-shot HOI detection — intra-class visual diversity and inter-class visual entanglement — through visual diversity-aware prompt learning (via group-level variance injection and Gaussian perturbation) and region-aware prompt augmentation (via LLM-generated regional concept retrieval).