Skip to content

Model-Behavior Alignment under Flexible Evaluation: When the Best-Fitting Model Isn't the Right One

Conference: NeurIPS 2025 arXiv: 2510.23321 Code: GitHub Area: Computational Neuroscience / Representational Alignment Keywords: Model Recovery, Representational Alignment, Linear Probing, Identifiability, THINGS Dataset

TL;DR

Through large-scale model recovery experiments, this paper demonstrates that even with 4.5 million behavioral data points, flexible evaluation methods based on linear probing achieve model recovery accuracy below 80% across 20 visual models. This reveals a fundamental trade-off between predictive accuracy and model identifiability, challenging the prevailing paradigm that the best-fitting model is the most appropriate one.

Background & Motivation

Representations from deep neural networks (DNNs) are widely used as computational models of biological visual systems. The standard evaluation pipeline involves extracting ANN representations, aligning them to brain/behavioral data via some metric, and designating the model with the highest predictive accuracy as the best biological representational model.

Flexible, data-driven alignment methods (e.g., linear probing) substantially improve predictive accuracy—but this raises a critical question: does predictive accuracy genuinely reflect representational similarity?

Limitations of Prior Work: - Kornblith et al. found that non-cross-validated flexible metrics fail to distinguish between layers (though this may be attributed to overfitting) - Han et al. tested under idealized settings (noise-free ANN activations), which does not reflect real noisy data - Schütt et al. validated the recovery capacity of non-flexible RSA but did not evaluate flexible RSA in noise-calibrated settings

Root Cause: Flexible evaluation improves predictive accuracy, potentially at the cost of model identifiability. The authors use the THINGS odd-one-out dataset (4.7 million behavioral judgments) to quantitatively investigate this trade-off.

Method

Overall Architecture

The paper adopts a model recovery experimental design: synthetic behavioral data are generated from model A → all models (including A) compete to fit the data → it is assessed whether model A can be correctly identified. If the best-fitting model is not the data-generating model, the evaluation method suffers from an identifiability problem.

Key Designs

  1. Mapping from ANN representations to behavioral predictions: For each pretrained ANN, the final representation layer \(\mathbf{X} \in \mathbb{R}^{n \times p}\) (\(n=1854\) images) is extracted, and a linear transformation \(\mathbf{W} \in \mathbb{R}^{p \times p}\) is learned. The similarity matrix is \(\mathbf{S} = (\mathbf{X}\mathbf{W})(\mathbf{X}\mathbf{W})^\top\). Odd-one-out predictions for triplets \(\{a,b,c\}\) use a softmax:

    \(p(\text{odd-one-out}=a \mid \{a,b,c\}) = \frac{\exp(S_{b,c}/T)}{\exp(S_{a,b}/T) + \exp(S_{a,c}/T) + \exp(S_{b,c}/T)}\)

Optimization minimizes negative log-likelihood plus regularization via L-BFGS.

  1. Improved regularization: Standard Frobenius norm regularization is replaced by "shrinkage toward a scalar matrix": \(\mathcal{R}(\mathbf{W}) = \min_\gamma \|\mathbf{W} - \gamma\mathbf{I}\|_F^2 = \|\mathbf{W}\|_F^2 - \frac{(\text{tr}(\mathbf{W}))^2}{p}\) This avoids the degenerate behavior of Frobenius regularization, which can cause performance to drop below zero-shot under strong penalty.

  2. Noise calibration: Rather than maximizing predictive likelihood, the temperature parameter \(T\) is tuned so that the model's response variability matches the human noise ceiling (67.8% leave-one-subject-out consistency). This ensures that synthetic data have noise levels consistent with real experiments.

  3. Model recovery experimental pipeline:

    • 20 diverse ANNs (varying architectures and training objectives)
    • Each model is first fit to the full human dataset to obtain \(\mathbf{W}\) and a calibrated temperature
    • Synthetic behavioral data are sampled from the calibrated model
    • All models fit the synthetic data from scratch
    • 3-fold cross-validation (over different image subsets) is used to compare predictive accuracy
    • 30 random seeds × 20 generative models × 18 dataset sizes

Identifiability Analysis

Regression analysis is used to identify causes of model misidentification: - Alignment-induced representational geometry shift of candidate models (positive predictor of accuracy difference) - Magnitude of shift of the data-generating model (negative predictor—models that shift more produce data more easily predicted by other models) - Effective dimensionality (ED) of the data-generating model (negative predictor—higher-dimensional representations are harder to correctly recover)

Key Experimental Results

Main Results: Model Recovery Accuracy vs. Data Volume

Training Triplets Model Recovery Accuracy Notes
~1,000 <10% Near chance (5%)
~10,000 ~15%
~100,000 ~45% Typical experimental scale
~1,000,000 ~70%
4,200,000 <80% Maximum data volume, still not saturated

Flexibility vs. Accuracy vs. Identifiability Trade-off

Evaluation Method Mean Predictive Accuracy Model Recovery Accuracy (4.2M)
Zero-shot ~34% ~95%
Diagonal \(\mathbf{W}\) ~47% ~85%
\(p \times 10\) rectangular \(\mathbf{W}\) ~55% ~75%
\(p \times p\) full matrix ~63% (near ceiling) <80%

Ablation Study

Control Variable Change in Recovery Accuracy Notes
Fixed PCA to 500 dimensions No improvement Parameter count is not the main factor
Expanded to 30 models Drops to ~70% More competitors make discrimination harder
Grouped by training objective 73.7% Difficult even within objective categories
Grouped by architecture 70.3% CNN vs. ViT also hard to distinguish

Key Findings

  • Systematic bias: OpenAI CLIP ResNet-50 is systematically misidentified as the best model; 4 models have mean rank >2 (i.e., more than one competitor ranks higher on average)
  • Representational geometry shift: After linear probing, all models converge toward VICE (a human embedding model); models initially farther from VICE exhibit larger shifts
  • Three significant regression predictors (after Bonferroni correction): candidate model shift (\(\beta=0.495\), \(p=0.02\)), generative model shift (\(\beta=-0.251\), \(p=0.01\)), and generative model effective dimensionality (\(\beta=-0.455\), \(p=0.01\))

Highlights & Insights

  • Rigorous quantitative proof that "best fit ≠ most correct": This is not a philosophical argument but a large-scale simulation-based validation
  • Noise calibration as a key innovation: Prior model recovery studies used noise-free ANN activations, which are unrepresentative of real conditions. Temperature calibration matches synthetic data noise to human levels, yielding sobering results
  • Shrinkage-to-scalar-matrix regularization is a small but practically important modification that avoids the degenerate behavior of the standard approach
  • The experimental design is analogous to knowledge distillation—candidate models act as "students" attempting to imitate the behavior of the "teacher" (data-generating model)

Limitations & Future Work

  • Restricted to behavioral data (THINGS odd-one-out); neural data (fMRI/EEG) may exhibit different trade-off characteristics
  • Quantitative model recovery results depend on the specific candidate model set (20 models)
  • In practice, the "true model" (biological representation) is absent from the candidate set, making the problem even harder
  • The paper proposes three improvement directions without implementing them: (1) active stimulus selection, (2) biologically constrained metrics, and (3) models with built-in alignment capacity
  • Complements Kornblith et al.'s comparison of CKA vs. linear encoding: CKA is more conservative but potentially more reliable
  • Muttenthaler et al. (2023)'s large-scale THINGS study serves as the direct foundation for this work
  • Serves as a cautionary message for the broader representational alignment community: optimizing predictive accuracy may be a misleading objective
  • Suggests that adaptive, model-discriminating stimulus design may be more effective than simply collecting more data

Rating

  • Novelty: ⭐⭐⭐⭐ The model recovery paradigm itself is not new, but noise calibration and large-scale application constitute important contributions
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive design spanning 20 models × 18 data sizes × 30 random seeds
  • Writing Quality: ⭐⭐⭐⭐⭐ Motivation is clearly articulated, experimental design is rigorous, and discussion is thorough
  • Value: ⭐⭐⭐⭐⭐ Has fundamental methodological implications for computational neuroscience; an exemplary instance of a negative result