Skip to content

Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting

Conference: CVPR 2026 arXiv: 2508.05059 Code: jjh6297/KNOW Area: Model Training / Weight Prediction / Knowledge Transfer Keywords: weight prediction, structured forgetting, meta-learning, hyper-model, knowledge transfer, scaling law, pre-trained weights

TL;DR

This paper proposes KNOW prediction: a framework that induces a structured forgetting process via sequential fine-tuning on progressively shrinking data subsets, collects the resulting weight transition trajectories, and then employs a meta-learned hyper-model (KNOWN) to reverse the forgetting direction, predicting virtually knowledge-enriched weights as if the model had been trained on a larger dataset. The approach consistently outperforms naive fine-tuning and multiple weight prediction baselines across diverse datasets (CIFAR/ImageNet/PACS, etc.) and architectures (ResNet/PVTv2/DeepLabV3+), yielding significant improvements on downstream tasks including image classification, semantic segmentation, image captioning, and domain generalization.

Background & Motivation

Pre-trained weights are the cornerstone of modern deep learning, particularly in data-scarce few-shot scenarios where high-quality pre-trained weights can substantially improve downstream task performance. The central question is: how can one obtain better pre-trained weights without increasing the amount of training data?

The authors' approach is grounded in three key observations:

Scaling Law: More training data generally yields better pre-trained weights (i.e., better generalization). However, large-scale data collection is costly and often infeasible in practice.

Fine-tuning induces forgetting: Fine-tuning on a data subset overwrites the model's knowledge of out-of-subset data—a classic manifestation of catastrophic forgetting, traditionally regarded as a deficiency.

Fine-tuning is reversible: Prior unlearning research has demonstrated that the weight-space changes induced by fine-tuning are partially reversible; the smoothness of the loss landscape (mode connectivity) makes weight prediction theoretically feasible.

Core idea: Since fine-tuning on shrinking data → knowledge forgetting → weight degradation is a structured process, reversing this process → knowledge recovery → weight enhancement is equally feasible. This reframes "forgetting" from a deficiency into a tool.

Method

Problem Formulation

Given weights \(\Theta_0\) pre-trained on dataset \(D_0\), a sequence of progressively smaller datasets \(D_S \subset D_{S-1} \subset \cdots \subset D_1 \subset D_0\) is constructed (with sampling rate \(r \in [0,1]\)). Sequential fine-tuning yields a weight sequence \([\Theta_0, \Theta_1, \Theta_2, \ldots, \Theta_{S-1}]\).

Assumption: There exists an ideal weight \(\Theta_{-1}\) corresponding to training on a larger dataset \(D_{-1} \supset D_0\), such that fine-tuning \(\Theta_{-1}\) on \(D_0\) recovers \(\Theta_0\).

Objective: By observing the forgetting trajectory \([\Theta_0, \Theta_1, \ldots, \Theta_{S-1}]\), retrodict \(\hat{\Theta}_{-1}\)—this constitutes KNOW (KNowledge-Overflowed Weights) prediction.

Structured Forgetting

  1. Starting from the full dataset \(D_0\), progressively construct \(D_1 = r \cdot D_0\), \(D_2 = r \cdot D_1\), etc., according to sampling rate \(r\).
  2. Fine-tune the previous step's weights on each subset: \(\Theta_0 \xrightarrow{D_1} \Theta_1 \xrightarrow{D_2} \Theta_2 \cdots\)
  3. This process deliberately induces structured forgetting—the amount of knowledge forgotten at each step is proportional to the degree of data reduction.

A key property: loss landscape visualization (via PCA projection) shows that the weight sequence forms a smooth curve, with high-accuracy regions continuously surrounding the trajectory—supporting the feasibility of weight prediction through trajectory extrapolation.

KNOWN (Knowledge-Overflowed Weights Nowcaster)

KNOWN is a lightweight meta-trained hyper-model with only 9,425 parameters, implemented as a two-stream MLP based on the WNN architecture.

Input: Weight history \(W_t = [\theta_0, \theta_1, \ldots, \theta_{S-1}]\) and its finite differences \(dW_t = [\theta_1 - \theta_0, \ldots, \theta_{S-1} - \theta_{S-2}]\) (with \(S=5\)).

Prediction: Outputs a weight residual to predict the enhanced weights:

\[\hat{\theta}^{t-1} = \theta^t + \text{KNOWN}(W_t, dW_t)\]

Parameter-type specialization: Three dedicated KNOWN models \([\text{KNOWN}_{\text{Conv}}, \text{KNOWN}_{\text{FC}}, \text{KNOWN}_{\text{Bias}}]\) are trained separately for Conv, FC, and Bias parameters.

Meta-training: - Weight trajectories are collected from diverse small-scale DNNs (CNN/ResNet/DenseNet/ShuffleNet/MobileNetV2, all <3M parameters) trained on CIFAR10/MNIST/FashionMNIST, totaling approximately 50 GB. - The training objective is \(\ell_1\) residual minimization: \(\|(\theta^t + \text{KNOWN}(W_t, dW_t)) - \theta^{t-1}\|_1\). - Once meta-trained, KNOWN requires no additional training and generalizes directly to all downstream settings.

Iterative Multi-Step Prediction

If the first-step prediction \(\hat{\Theta}_{-1}\) is reliable, one can further use \([\hat{\Theta}_{-1}, \Theta_0, \Theta_1, \ldots, \Theta_{S-2}]\) to predict \(\hat{\Theta}_{-2}\). With \(r=0.5\), the predictions \(\hat{\Theta}_{-1}\), \(\hat{\Theta}_{-2}\), \(\hat{\Theta}_{-3}\) correspond to virtual data augmentation factors of ×2, ×4, and ×8, respectively. Iterative prediction continues to yield performance gains, indicating that predicted weights are of sufficient quality to support recursive application.

Computational Cost

The training overhead of sequential forgetting is \(\frac{1-r^{S-1}}{1-r}\) times the original training cost. The inference overhead of weight prediction is negligible: predicting all parameters of ResNet18 requires only \(3.01 \pm 0.09\) seconds (\(2.67 \times 10^{-7}\) seconds per parameter).

Key Experimental Results

Image Classification (ResNet18, CIFAR100 → CIFAR10)

Method Pred. Steps 100% Data 50% Data 25% Data
Naïve Transfer 1 92.40 92.08
KNOWN ×2 93.00±0.11 92.58±0.14 92.29±0.04
KNOWN ×4 93.27±0.09 92.62±0.25 92.88±0.11
KNOWN ×8 93.55±0.05 93.11±0.19 92.92±0.15

With only 50% of the data, KNOWN (92.58) already surpasses the 100%-data baseline (92.40), and iterative prediction (×8) further improves performance to 93.55. Other methods (LogFit/TaskVector/TSV, etc.) sometimes degrade performance.

Cross-Architecture, Cross-Dataset (PVTv2, ImageNet Pre-trained → 5 Downstream Datasets)

KNOWN consistently achieves improvements on CIFAR100, TinyImageNet, Stanford Cars, CUB, and Oxford Flowers. At ×3 prediction: CIFAR100 82.46 (↑), TinyImageNet 77.53 (↑), CUB 71.18 (↑).

Domain Generalization (PACS, Leave-One-Domain-Out)

Method art sketch cartoon photo Avg.
Naïve Transfer 63.48
KNOWN (×3) 72.12 44.11 62.73 93.87 68.21
KNOWN (×9) 72.07 44.02 64.28 92.98 68.33

Average accuracy improves from 63.48 to 68.33, a gain of approximately 5 percentage points.

Semantic Segmentation (DeepLabV3+, Pascal VOC → Cityscapes)

Method mIoU
Naïve Transfer baseline
KNOWN (×3) 69.00±1.04 (↑)
KNOWN (×9) 71.22±0.82 (↑)

TaskVector falls below the baseline at ×9, while KNOWN yields stable improvements.

Image Captioning (PVTv2 + Transformer Decoder, Flickr8K)

KNOWN improves masked accuracy by approximately 2.2% over the baseline (39.38 vs. ~37.2), demonstrating effectiveness in cross-modal tasks.

Ablation Study (Effect of \(S\))

S ×2 Acc. ×4 Acc. ×8 Acc.
2 (≈TaskVector) 92.69 92.70 92.65
3 93.01 93.04 92.72
4 92.97 93.10 92.89
5 93.00 93.27 93.55

Longer forgetting sequences (\(S=5\)) provide richer trajectory information, with the advantage becoming particularly pronounced in multi-step iterative prediction.

Highlights & Insights

  • Reframing forgetting as a tool: Catastrophic forgetting has long been regarded as a persistent challenge in deep learning. This work is the first to deliberately induce and reverse it as a means of knowledge enhancement—a genuinely creative reframing.
  • KNOWN is extremely lightweight: A hyper-model with only 9,425 parameters, once meta-trained, generalizes across architectures (CNN/ViT), datasets, and tasks (classification/segmentation/captioning/domain generalization) without any retraining—a remarkable degree of generalization.
  • Near-zero inference cost: Predicting all parameters of ResNet18 takes only 3 seconds, which is entirely negligible compared to hours of training.
  • No additional data required: Unlike data augmentation or knowledge distillation, KNOW leverages only the forgetting structure of existing data to achieve the virtual effect of expanded training data.
  • Loss landscape visualization provides intuitive validation: The smoothness of the forgetting trajectory under PCA projection and the accurate localization of predicted weights offer direct visual evidence for the feasibility of the approach.

Limitations & Future Work

  1. Absence of large-scale model validation: The largest model evaluated is PVTv2 (~25M parameters); the approach has not been validated on ViT-Large, LLMs, or other models with hundreds of millions of parameters. Whether the weight space remains sufficiently smooth at larger scales is an open question.
  2. Meta-training dataset constraints: KNOWN is meta-trained exclusively on models with fewer than 3M parameters. Although experiments demonstrate generalization to PVTv2, whether this generalization holds across even larger scale gaps remains unknown.
  3. Sampling rate selection: Both \(r=0.5\) and \(r=0.33\) perform well in the paper, but no systematic guidelines for choosing \(r\) are provided. An excessively small \(r\) causes too much forgetting per step, while an excessively large \(r\) yields insufficient trajectory variation.
  4. Evaluation limited to visual tasks: Although the experiments cover classification, segmentation, captioning, and domain generalization, all settings are in the visual domain. Weight evolution patterns in NLP or speech modalities may differ substantially.
  5. Compatibility with modern training paradigms: The paper does not discuss integration with parameter-efficient fine-tuning methods such as LoRA or Adapters, nor does it address application during the pre-training phase itself (the method is used only as a post-pre-training enhancement).
  • WNN (ICML 2023): A prior work from the same research group proposing periodic weight prediction to accelerate training. The present paper extends WNN to cross-dataset-scale KNOW prediction.
  • Task Arithmetic (ICLR 2023): Proposes editing models through linear operations on task vectors. The present work demonstrates that linear extrapolation (TaskVector) is unstable on forgetting trajectories, and that KNOWN's nonlinear modeling is more reliable.
  • Model Soups / Weight Averaging: Averages weights from multiple fine-tuned models. KNOW differs fundamentally by extrapolating from a sequential trajectory rather than interpolating among multiple terminal states.
  • Scaling Law research: The present work can be viewed as an implementation pathway for "indirectly leveraging the Scaling Law without increasing data volume."
  • Implications: The KNOW framework may extend to other structured degradation processes, such as reversing model pruning (sparse → predicting dense weights) or reversing quantization (low-precision → predicting high-precision weights).

Rating

Dimension Score (1–5) Notes
Novelty 4.5 The idea of reversing forgetting is highly creative; the shift from "deficiency → tool" represents a genuine paradigm shift.
Practicality 4.0 KNOWN is lightweight, generalizes well, and incurs near-zero inference cost, making deployment straightforward.
Experimental Thoroughness 4.0 Multi-architecture, multi-dataset, multi-task validation is comprehensive with clear ablations, but large-scale model and NLP experiments are absent.
Writing Quality 4.0 Problem formulation is clear, mathematical notation is well-defined, and landscape visualizations are intuitive; some table values are rendered unclearly due to template issues.