Temporal Slowness in Central Vision Drives Semantic Object Learning¶
Conference: ICLR2026 arXiv: 2602.04462 Code: None Area: Self-Supervised Learning Keywords: central vision, temporal slowness, self-supervised learning, Ego4D, semantic representation
TL;DR¶
By simulating human central vision (foveal cropping) and the temporal slowness principle (temporal contrastive learning), SSL models trained on Ego4D data demonstrate that combining these two mechanisms effectively improves semantic object representations — central vision enhances foreground extraction, while temporal slowness distills semantic information during fixation periods.
Background & Motivation¶
State of the Field¶
Background: Humans acquire semantic object representations from egocentric visual streams with minimal supervision, yet SSL models trained on human visual experience perform poorly.
Limitations of Prior Work: Existing SSL models neglect two key biological processes: (1) high-resolution foveal processing in the retina (central vision), and (2) the slowness principle, whereby temporally proximate inputs yield similar representations.
Key Challenge: Full-field training conflates foreground and background information and fails to exploit temporal object-tracking signals.
Goal: To investigate the roles of central vision and temporal slowness in the formation of semantic object representations.
Key Insight: Gaze coordinates are generated on Ego4D (5 months of visual experience) using a gaze prediction model (GLC), central visual regions are cropped accordingly, and a temporal contrastive SSL model is trained.
Method¶
Overall Architecture¶
Ego4D frames → Gaze prediction (GLC model) → Central vision cropping → MoCoV3 + temporal contrastive learning (InfoNCE over temporally neighboring frames).
Key Designs¶
- Central Vision Simulation: Crop an \(N \times N\) region centered on the predicted gaze point.
- Temporal Slowness Learning: Randomly sample neighboring frames within a temporal window \(\Delta T\) as positive pairs.
- Single-Epoch Training: Train for one epoch over 64 million frames.
Key Experimental Results¶
Main Results¶
| Method | ImageNet-1k | Fine-Grained Avg. | Instance Recognition |
|---|---|---|---|
| Frames Learning (full-field, no slowness) | 49.50 | Baseline | Baseline |
| Bio-inspired (central + slowness) | 49.58 | Improved | Improved |
Key Findings¶
- Central vision strengthens foreground object feature extraction relative to background.
- Temporal slowness during fixation periods distills broader semantic information (category, contextual co-occurrence).
- The model exhibits greater alignment with human semantic judgments (CKA analysis).
- The two mechanisms are complementary: central vision provides "what," and temporal slowness provides "semantic associations."
Ablation Study¶
| Ablation / Analysis | Finding |
|---|---|
| Crop size \(N\) | 224–336 is the sweet spot; \(N=112\) is too small and loses information; full frame benefits scene recognition but degrades object recognition |
| Temporal window \(\Delta T\) | Optimal \(\Delta T=3\) for ResNet50; optimal \(\Delta T=1\) for ViT |
| Foreground vs. background analysis | Central vision reduces the importance of background features (validated via ImageNet-9 experiments) |
| Fixation vs. saccade | Temporal contrastive learning during fixation (small temporal window) distills the richest semantic information |
| Object co-occurrence CKA | The bio-inspired model shows higher CKA alignment with GloVe co-occurrence embeddings |
| Training epochs | A single epoch approaches saturation (second epoch yields only +0.5%), as Ego4D is highly redundant at 5 fps |
ImageNet-9 Foreground/Background Analysis¶
| Model | Normal Accuracy | Change w/o Background | Change w/o Foreground |
|---|---|---|---|
| Frames Learning (full-field) | 75% | −15% | −5% |
| Bio-inspired (central + slowness) | 80% | −10% | −20% |
→ Demonstrates that central vision causes the model to rely more on foreground objects than background — consistent with human visual processing.
Performance on Semantic Dimensions (ResNet50)¶
| Semantic Dimension | Frames Learning | Bio-inspired | Gain |
|---|---|---|---|
| Category Recognition Avg. | 45.65 | 46.94 | +1.29 |
| Fine-Grained Recognition Avg. | 33.84 | 38.42 | +4.58 |
| Instance Recognition Avg. | 59.03 | 67.00 | +7.97 |
| Scene Recognition Places365 | 43.02 | 42.95 | −0.07 |
Highlights & Insights¶
- Interdisciplinary integration — the paper bridges computational neuroscience (temporal slowness principle and central vision) with SSL, using computational experiments to validate neuroscientific hypotheses.
- Complementarity of central vision and temporal slowness: central vision determines "what to look at" (enhancing foreground object features), while temporal slowness captures "how to associate" (different viewpoints of the same object, co-occurring objects within the same scene).
- Explanation for degraded scene recognition: full-field views contain richer background and spatial layout information favorable to scene recognition; foveal cropping discards this information.
- Implications for embodied AI: robotic visual processing can mimic human vision — processing only the high-resolution foveal region to substantially reduce computational cost.
- Semantic distillation during fixation is a compelling finding — it suggests that human "looking" is not merely information collection; sustained fixation also leverages temporal consistency to learn invariant representations.
Limitations & Future Work¶
- Absolute performance gains are modest (only +1.29% on category recognition); the primary contribution lies in scientific understanding rather than engineering advancement.
- The gaze prediction model (GLC) introduces errors — genuine human gaze data amounts to only 45 hours, while the remaining 3,600+ hours rely on predicted gaze.
- Experiments primarily employ ResNet50 and ViT-B/16; validation on larger models (e.g., ViT-L) is absent.
- Single-epoch training approaches saturation on Ego4D — though this may stem from data redundancy rather than an inherent limitation of the method.
- No direct comparison is made with other egocentric SSL methods (e.g., EgoVLP, VC-1).
Related Work & Insights¶
- vs. R3M (Nair et al.): R3M learns slowly varying representations from Ego4D for robotic tasks but uses full-field views; this paper incorporates foveal cropping to further improve object features.
- vs. DINO/MoCo: Standard SSL methods rely on data augmentation (cropping, flipping, color jittering); this paper replaces spatial augmentation with temporal neighbors — a more biologically plausible approach.
- vs. Orhan et al. (2024) egocentric SSL: They train on full-field views without accounting for the distinctive role of central vision.
- vs. VIP (Ma et al.): VIP learns video prediction representations from Ego4D with a focus on temporal progress; this paper focuses on temporal slowness, representing a distinct perspective.
- Inspiration: The combination of central vision and temporal slowness can serve as a data processing strategy for pretraining visual foundation models — requiring no architectural changes, only modifications to data sampling.
Rating¶
- Novelty: ⭐⭐⭐⭐ Innovative combination of biological inspiration and SSL with scientific merit.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-dimensional analysis (categorization, fine-grained, instance, scene, co-occurrence).
- Writing Quality: ⭐⭐⭐⭐ Clear logic and targeted experimental design.
- Value: ⭐⭐⭐⭐ Scientific contribution to understanding human visual learning with practical implications for embodied AI.