Time-Evolving Dynamical System for Learning Latent Representations of Mouse Visual Cortex¶
Conference: NeurIPS 2025 arXiv: 2408.07908 Code: Available Area: Interpretability Keywords: latent variable models, visual neural activity, time-evolving dynamical systems, contrastive learning, mouse visual cortex
TL;DR¶
This paper proposes TE-ViDS, a sequential latent variable model that decomposes visual neural activity into an external representation linked to visual stimuli and an internal representation reflecting internal states. By incorporating a time-evolving structure and contrastive learning, TE-ViDS achieves state-of-the-art decoding performance on natural scenes and videos.
Background & Motivation¶
Latent variable models (LVMs) reveal intrinsic associations between neural activity and behavior or sensory stimuli by constructing low-dimensional representations, making them central to neural data analysis. However, three important gaps exist in the literature:
State of the Field — Bias toward motor cortex: Most LVM research focuses on motor regions (e.g., pre-planned movement), with relatively little work on the visual cortex.
Limitations of Prior Work — Temporal relationships ignored: Natural visual stimuli are inherently high-dimensional and temporally dependent, yet most LVMs do not explicitly model the temporal structure of neural activity.
Limitations of Prior Work — Visual-specific properties underutilized: Visual neural activity contains both stimulus-related and internal-state components, which existing methods do not specifically address.
Key Challenge: When mice passively observe natural scenes or videos, the neural dynamics in the visual cortex are driven by two factors: - External visual stimuli: Scene or movie frame content - Internal states: Attention, arousal level, etc., which may exert an even greater influence on neural activity than the visual stimuli themselves.
How to construct high-quality latent representations that disentangle these two components is therefore a critical open problem.
Method¶
Overall Architecture¶
TE-ViDS is a sequential latent variable model whose core components include: - Encoder: Extracts spatial features from sequential spike data - Time-evolving system: Evolves latent variables conditioned on RNN state factors - Decoder: Maps latent variables to inferred firing rates - Disentangled design: External latent variables (deterministic) + internal latent variables (stochastic)
The input is \(\mathbf{x} = (\mathbf{x}_1, ..., \mathbf{x}_T) \in \mathbb{R}^{T \times N}\) (spike counts from \(N\) neurons across \(T\) time windows).
Key Designs¶
1. External Latent Variables (Deterministic, Stimulus-Related)¶
Function: Capture the component of neural activity associated with visual stimuli.
Mechanism: Designed as deterministic (non-stochastic) values, since stimulus-related components should be stable and variability should be attributed to internal states. Shaped via contrastive learning (NT-Xent loss) — temporally offset sequences serve as positive pairs (as adjacent-time visual stimuli are similar), while negative samples are drawn randomly from the training set.
Design Motivation: Positive pairs cover time segments with similar visual stimuli, naturally aligning external representations with stimulus content. A swap operation is also applied — exchanging external representations between positive pairs while preserving internal representations — to further enhance disentanglement.
2. Internal Latent Variables (Stochastic, State-Related)¶
Function: Reflect the animal's internal dynamic states (attention, arousal, etc.), which exhibit high variability and noise.
Approximate posterior: \(\mathbf{z}_t^{(i)} | \mathbf{x}_{1:t}, \mathbf{h}_{1:t-1}^{(i)} \sim \mathcal{N}(\boldsymbol{\mu}_{z,t}, \boldsymbol{\sigma}_{z,t}^2 \cdot \mathbf{I})\)
Prior distribution: \(\tilde{\mathbf{z}}_t^{(i)} | \mathbf{h}_{1:t-1}^{(i)} \sim \mathcal{N}(\tilde{\boldsymbol{\mu}}_{z,t}, \tilde{\boldsymbol{\sigma}}_{z,t}^2 \cdot \mathbf{I})\)
Mechanism: Modeled as stochastic variables whose prior depends only on the previous state factor (capturing temporal spontaneity); KL divergence constrains the gap between posterior and prior.
Design Motivation: Internal states are inherently variable and noisy, making stochastic modeling more appropriate. A temporally dependent prior allows the model to capture the slow drift of internal states.
3. Time-Evolving Mechanism (GRU State Factors)¶
Two independent GRUs maintain external and internal state factors, respectively:
A key distinction: the GRU for the internal state factor additionally receives the external latent variable as input, reflecting the fact that internal states are inevitably influenced by visual stimuli.
Loss & Training¶
- \(\mathcal{L}_{\text{recons}}\): Poisson negative log-likelihood (spike count reconstruction)
- \(\mathcal{L}_{\text{contrastive}}\): NT-Xent contrastive loss (shaping external representations)
- \(\mathcal{L}_{\text{regular}}\): KL divergence + prior regularization (constraining internal representations)
Key Experimental Results¶
Main Results 1: Natural Scene Decoding (118 scene images)¶
| Model | Mouse 1 | Mouse 2 | Mouse 3 | Mouse 4 | Mouse 5 |
|---|---|---|---|---|---|
| PCA | 0.59% | 1.53% | 1.53% | 0.80% | 0.85% |
| LFADS | 30.76% | 16.46% | 22.20% | 19.69% | 4.69% |
| pi-VAE | 7.49% | 19.42% | 22.92% | 13.71% | 2.22% |
| Swap-VAE | 32.81% | 24.34% | 14.36% | 14.85% | 3.92% |
| CEBRA | 1.53% | 3.42% | 4.86% | 2.81% | 1.08% |
| TE-ViDS-small | 47.08% | 23.95% | 29.08% | 34.95% | 9.93% |
| TE-ViDS | 50.86% | 27.24% | 29.90% | 38.05% | 9.44% |
TE-ViDS achieves the highest decoding accuracy across all five mice, with substantial margins over the second-best model (18% gain for Mouse 1, 23% for Mouse 4).
Main Results 2: Natural Movie Frame Decoding (900 frames, 1-second windows)¶
| Model | Mouse 1 | Mouse 2 | Mouse 3 | Mouse 4 | Mouse 5 |
|---|---|---|---|---|---|
| PCA | 8.44% | 28.77% | 25.42% | 21.56% | 11.69% |
| LFADS | 8.94% | 26.57% | 26.77% | 24.76% | 12.69% |
| Swap-VAE | 12.19% | 51.31% | 45.96% | 41.53% | 22.70% |
| CEBRA | 10.62% | 52.76% | 61.01% | 42.11% | 22.33% |
| TE-ViDS | 13.88% | 65.38% | 59.88% | 54.33% | 30.18% |
Ablation Study¶
| Configuration | Key Metric | Remarks |
|---|---|---|
| External vs. internal representations | External decoding score >> internal | Validates the hypothesis that external representations capture stimulus-related information |
| Temporal vs. non-temporal synthetic data | Performance drops sharply after time dimension shuffling | Demonstrates model sensitivity to temporal structure |
| TE-ViDS vs. TE-ViDS-small | Comparable or marginally better | Small model is also effective; gains are not due to parameter scaling |
| Comparison across 6 cortical areas | VISp highest, VISrl lowest | Provides computational evidence for a functional hierarchy in the visual cortex |
Key Findings¶
- Mechanistic basis of individual differences: RSA analysis reveals that Mouse 1's neural representations split into two distinct temporal epochs across scenes (attributable to internal state shifts), whereas Mouse 2 shows no such pattern. This explains the large variance in decoding performance across animals.
- Evidence for cortical hierarchy: Primary and intermediate visual areas (VISp, VISl, VISal) show higher decoding performance than higher-order areas (VISpm, VISam), with the multisensory area VISrl scoring lowest — offering novel computational evidence for a functional hierarchy in the mouse visual cortex.
- Limitations of CEBRA: CEBRA performs extremely poorly on natural scene decoding (~3%), indicating that its fixed-kernel temporal encoding is ill-suited for extracting temporal features under static stimulation.
Highlights & Insights¶
- Stimulus-related and state-related disentanglement strategy: The deterministic-external plus stochastic-internal design precisely matches the two components of visual neural activity.
- Natural application of contrastive learning: Using temporally offset sequences as positive pairs is highly principled — visually similar stimuli naturally occur at adjacent time points.
- Rich biological insights: Beyond methodological contributions, the work reveals the influence of internal states on visual coding and functional differences across cortical regions.
- Methodological generality: The framework is not limited to the mouse visual cortex and can be extended to other species, brain regions, and modalities.
Limitations & Future Work¶
- Lack of quantitative evaluation for internal representations: No behavioral or internal state recordings are available to validate the interpretability of the internal latent variables.
- Large individual variability: The substantial gap in decoding performance between Mouse 1 and Mouse 5 indicates that the model does not fully overcome inter-individual variability.
- Passive viewing paradigm only: Mice do not perform any task, precluding direct links between representations and task-related behavior.
- Computational cost not thoroughly discussed: The time complexity of sequential GRU processing may become a bottleneck for very long time series.
Related Work & Insights¶
- vs. CEBRA (Schneider 2023): CEBRA encodes temporal features via fixed convolutional kernels, whereas TE-ViDS uses RNN-based dynamic evolution, which is better suited to visual neural activity.
- vs. Swap-VAE (Liu 2021): TE-ViDS inherits the swap operation and split architecture but augments them with a time-evolving mechanism.
- The influence of internal states on perception is consistent with the behavioral findings of Ashwood (2022).
- Future work could integrate brain–computer interface applications, leveraging the disentangled representations to handle stimulus and state information separately.
Rating¶
- Novelty: ⭐⭐⭐⭐ (Time-evolving + disentangled design is well-motivated but not revolutionary)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Synthetic and real neural data; comprehensive multi-animal, multi-region analysis)
- Writing Quality: ⭐⭐⭐⭐ (Methods are clearly presented; biological discussion is in-depth)
- Value: ⭐⭐⭐⭐ (Fills a gap in LVM research on visual neural activity; provides valuable neuroscientific insights)