Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights¶

Conference: ICLR 2026 arXiv: 2603.13186 Code: None Area: AI Security / Privacy Protection Keywords: Membership Inference Attack, Weight Importance, Privacy Vulnerability, Weight Rewinding, Fine-grained Privacy Defense

TL;DR¶

This paper reveals that privacy vulnerability is concentrated in a remarkably small fraction of weights (as few as 0.1%), which is highly entangled with learnability (Pearson \(r > 0.9\)). The proposed CWRF method achieves superior privacy-utility trade-offs by rewinding privacy-vulnerable weights to their initialization and freezing them, while fine-tuning only the remaining weights.

Background & Motivation¶

Background: Membership Inference Attacks (MIA) exploit behavioral discrepancies between a model's responses to training and non-training data to infer data membership. Existing privacy-preserving methods (e.g., DP-SGD, RelaxLoss, HAMP) typically update or retrain all weights, incurring high computational costs and unnecessary utility loss. Limitations of Prior Work: While prior work (e.g., the Lottery Ticket Hypothesis) has shown that only a small subset of weights is critical to model performance, weight-level analysis of privacy vulnerability remains entirely unexplored. Standard pruning techniques (e.g., TFO), upon removing "unimportant" weights, fail to reduce—and may even increase—privacy risk: at 90% sparsity, test loss increases and MIA success rate remains unchanged or worsens. Key Challenge: Intuitively, one should remove "privacy-vulnerable" weights, but these weights are precisely the ones "critical to learnability"—both properties are highly entangled in a tiny fraction of weights (Pearson \(r > 0.9\)), making naive removal infeasible. Goal: To precisely identify and handle privacy-vulnerable weights to reduce MIA risk without compromising model performance. Key Insight: Since weight positions matter more than their values (preserving the positions of critical weights suffices to recover performance), the paper proposes rewinding privacy-vulnerable weights to their initialization—eliminating privacy risk while retaining the connectivity topology—and then freezing these weights while fine-tuning the remainder. Core Idea: Rather than removing privacy-vulnerable weights, rewind them to initialization values and freeze them, leveraging the insight that "position > value" to allow the model to recover performance during fine-tuning while protecting privacy.

Method¶

Overall Architecture¶

CWRF (Critical Weights Rewinding and Finetuning) comprises three stages: (1) Privacy Vulnerability Estimation (PVE) based on machine unlearning → (2) Rewinding and freezing the most vulnerable weights according to PVE scores → (3) Fine-tuning only the remaining weights (compatible with any privacy-preserving training method). The pipeline starts from a pretrained model and outputs a privacy-enhanced fine-tuned model.

Key Designs¶

Privacy Vulnerability Estimation (PVE):
- Function: Quantify each weight's contribution to privacy leakage.
- Mechanism: A dual-objective fine-tuning is adopted—minimizing cross-entropy loss on the training set (member data) to "learn" member information, while minimizing KL divergence from the initial model on a reference set (non-member data) to "forget" non-member information. The loss is defined as \(\mathcal{L}_{\text{pve}} = (1-\lambda)\mathcal{L}_{\text{ce}}(f(x_{tr};\theta_{up}), y_{tr}) + \lambda\mathcal{L}_{\text{kl}}(f(x_{re};\theta_{up}), f(x_{re};\theta_{vn}))\). During this process, the score \(|g_i \cdot w_i|\) (gradient × weight magnitude) is accumulated for each weight, yielding a weight-level privacy vulnerability ranking.
- Design Motivation: Unlike conventional TFO which optimizes only accuracy, PVE incorporates both "learning" and "forgetting" signals, ensuring that high-scoring weights are those that simultaneously amplify the behavioral gap between training and non-training data—precisely the signal exploited by MIA.
Weight Rewinding & Freezing + Privacy Fine-tuning:
- Function: Eliminate the risk of privacy-vulnerable weights while restoring model utility.
- Mechanism: The top-\(r\)% most vulnerable weights are selected based on PVE scores and rewound to their initialization values via a mask: \(\theta_{rw} = \mathcal{B}_f \odot \theta_{up} + \mathcal{B}_r \odot \theta_{vn}\). These weights are frozen (blocked from updates via gradient mask \(\mathcal{G}_p \leftarrow \mathcal{B}_f \odot \mathcal{G}_p\)), and only the remaining privacy-non-vulnerable weights are fine-tuned. The learning rate is also rewound to its initial value with a cosine annealing schedule. The approach is compatible with any privacy-preserving method (DP-SGD, RelaxLoss, HAMP, CCL, etc.).
- Design Motivation: Rewinding rather than removing weights preserves the "position" (connectivity topology), which is key to performance recovery. Ablation experiments confirm this: removing weights (A1) leads to irrecoverable accuracy collapse; rewinding + fine-tuning vulnerable weights (A2) and rewinding + fine-tuning non-vulnerable weights (A3/CWRF) both recover performance, but A3 achieves a significantly better privacy-utility trade-off than A2.

Loss & Training¶

The PVE stage uses \(\mathcal{L}_{\text{pve}}\) (dual-objective CE + KL) for \(T\) iterations to accumulate scores. The fine-tuning stage plugs in the user's chosen privacy-preserving method (standard CE or its variants), updating only non-frozen weights via gradient masking. The learning rate is initialized from its rewound value and follows a cosine annealing schedule for \(E\) epochs. The total computational overhead is far lower than retraining from scratch.

Key Experimental Results¶

Main Results¶

Entanglement Quantification (Table 1, Pearson correlation coefficients):

Architecture	Weight Type	Pearson r	Parameter Ratio
ResNet18	Conv	0.9410	99.50%
ResNet18	Linear	0.8096	0.45%
ResNet18	Norm	0.6776	0.05%
ViT	Att+MLP	0.9068	99.39%
ViT	Linear	0.8642	0.54%
ViT	Norm	0.7336	0.07%

CIFAR-10 Defense Performance (Table 3, ResNet18, LiRA Attack AUC ↓ lower is better):

Defense Method	Test Accuracy (%)	LiRA AUC (%)	LiRA TPR@0.1%FPR (%)
No Defense	79.44	85.00	2.18
RelaxLoss	77.10	70.51	1.38
RelaxLoss+CWRF	76.86	68.31	0.03
CCL	79.56	83.95	1.50
CCL+CWRF	77.77	64.82	0.22

Ablation Study¶

Configuration	Rewind Rate	Train Loss	Test Loss	Notes
A1 (Remove + Fine-tune Non-vulnerable)	0.1–5%	—	—	Accuracy collapses, unrecoverable
A2 (Rewind + Fine-tune Vulnerable)	3.0%	0.4326	0.9288	Larger train-test loss gap
A3/CWRF (Rewind + Fine-tune Non-vulnerable)	3.0%	0.4473	0.8044	Smallest train-test loss gap
From Scratch (RelaxLoss)	—	0.8087	1.5398	Global retraining performs worse

At a 3% rewind rate, CWRF achieves a test loss of only 0.8044, far superior to retraining from scratch (1.5398).

Key Findings¶

Standard pruning (TFO at 90% sparsity) does not reduce and may even increase MIA success rate—as redundancy decreases, the influence of vulnerable weights is amplified.
Weight "position" is more critical than weight "value": rewinding to initialization and retraining can fully recover accuracy, whereas removal cannot.
Attention layers in Transformers exhibit higher privacy vulnerability than convolutional layers in CNNs.
CWRF is composable with existing privacy-preserving training methods, yielding consistent improvements across DP-SGD, RelaxLoss, HAMP, and CCL.
On ViT, DP-SGD+CWRF reduces LiRA AUC from 54.97% to 55.68% (approaching the random baseline of 50%), and TPR@0.1‱FPR from 0.17% to 0.00%.

Highlights & Insights¶

This is the first work to analyze privacy vulnerability at the weight level, revealing its deep entanglement with learnability—which fundamentally explains why conventional pruning fails to improve privacy.
The "position > value" finding echoes the Lottery Ticket Hypothesis, providing new evidence and a novel application from a privacy-protection perspective.
CWRF incurs far lower computational cost than global methods such as DP-SGD—requiring fine-tuning of only a small subset of weights—making it a potentially practical lightweight privacy enhancement scheme.
Normalization layers account for only 0.05–0.07% of parameters yet contain highly privacy-vulnerable weights with relatively low correlation to learnability, suggesting a unique role in privacy protection.

Limitations & Future Work¶

Validation is limited to classification models (ResNet18, ViT) and small-scale datasets (CIFAR-10/100, CINIC-10); applicability to LLMs remains unknown.
The rewind rate \(r\) requires cross-validation to select, lacking an automated strategy.
PVE requires a non-member reference set, an assumption that may not hold in all deployment scenarios.
No formal theoretical connection to differential privacy is established, and no formal privacy guarantees are provided.

vs. DP-SGD: DP-SGD injects noise globally; CWRF precisely identifies vulnerable weights. CWRF can serve as a complement to DP-SGD, with experiments showing further gains when the two are combined.
vs. Lottery Ticket Hypothesis / Model Pruning: Pruning focuses on efficiency; CWRF focuses on privacy. Both share the insight that "a small number of weights determines model behavior," but CWRF reveals that the privacy dimension and the utility dimension are highly coupled.

Rating¶

Novelty: ⭐⭐⭐⭐ Weight-level privacy analysis is a genuinely new angle, with three core insights presented in a coherent progression.
Experimental Thoroughness: ⭐⭐⭐⭐ Validated on two architectures (ResNet/ViT), two attacks (LiRA/RMIA), and four defense methods.
Writing Quality: ⭐⭐⭐⭐ Excellent visualizations; the reasoning chain from observation to hypothesis to validation is clear.
Value: ⭐⭐⭐⭐ Offers a new direction for lightweight privacy-preserving fine-tuning with low practical deployment overhead.