⚖️ Alignment & RLHF¶

📷 CVPR2025 · 5 paper notes

📌 Same area in other venues: 📷 CVPR2026 (12) · 🔬 ICLR2026 (102) · 💬 ACL2026 (38) · 🧪 ICML2026 (37) · 🤖 AAAI2026 (17) · 🧠 NeurIPS2025 (36)

🔥 Top topics: Alignment/RLHF ×2

Bases of Steerable Kernels for Equivariant CNNs: From 2D Rotations to the Lorentz Group: Proposes an alternative method to solve the constraint equations of steerable equivariant CNN kernels. By solving simpler invariance conditions at a fixed point and then "steering" to arbitrary points, this approach bypasses the need for computing Clebsch-Gordan coefficients, providing explicit kernel basis formulas for SO(2), O(2), SO(3), O(3), and the Lorentz group.
CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation: This paper proposes the CAD-Llama framework, which converts 3D CAD models into Python-style code rich in semantic descriptions (SPCC) via a hierarchical annotation pipeline. It then utilizes adaptive pre-training and instruction tuning to transform LLaMA3-8B into a parametric CAD model generator. This approach outperforms previous methods by approximately 14% in accuracy on the text-to-CAD task, while supporting various CAD editing tasks such as completion, addition, and deletion.
Continual SFT Matches Multimodal RLHF with Negative Supervision: Through gradient analysis, it is discovered that the core advantage of multimodal RLHF over continual SFT lies in the negative supervision signal within the rejected responses. Based on this, the nSFT method is proposed, which uses an LLM to extract error information from rejected responses and construct corrective dialogue data. It matches or even outperforms RLHF methods like DPO/PPO using only SFT loss, requiring only one model and significantly improving GPU memory efficiency.
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-Modal LLMs?: This work investigates whether safety alignment in multimodal large language models genuinely requires meticulously curated malicious data. It demonstrates that effective safety alignment can be achieved by leveraging existing benign data combined with simple safety fine-tuning strategies, thereby significantly reducing the data curation cost associated with safety alignment.
Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising: This paper proposes JailNTL, the first black-box attack method against Non-Transferable Learning (NTL) models. By utilizing test-time data disguising to transform unauthorized domain data into the style of the authorized domain, it improves unauthorized domain accuracy by up to 55.7% using only 1% authorized samples, without requiring any model modifications.