Task-Optimized Convolutional Recurrent Networks Align with Tactile Processing in the Rodent Brain¶

Conference: NeurIPS 2025 arXiv: 2505.18361 Code: GitHub Area: Robotics Keywords: Tactile Perception, Convolutional Recurrent Networks, Somatosensory Cortex, Self-Supervised Learning, NeuroAI

TL;DR¶

This paper proposes the Encoder-Attender-Decoder (EAD) framework to systematically explore task-optimized temporal neural networks for tactile processing. It finds that convolutional recurrent networks (ConvRNNs, especially IntersectionRNN) outperform feedforward and state-space models on both tactile object classification and neural alignment with rodent somatosensory cortex. Contrastive self-supervised learning with tactile-specific augmentations achieves neural fitting comparable to supervised learning, providing the first quantitative characterization of the brain's computational mechanisms for touch.

Background & Motivation¶

Tactile perception plays a critical role in animal manipulation and environmental understanding, yet it remains far less understood in neuroscience compared to vision and language, and artificial systems lag far behind biological counterparts in tactile capability. This gap stems from two directions:

Hardware side: Current biomimetic whisker sensors face severe limitations — hardware complexity grows sharply beyond 18–20 whiskers, simultaneous stimuli such as airflow, direct contact, and inertia cannot be reliably distinguished, and mechanical properties (sensitivity, flexibility, bending angle) differ from biological whiskers. Anthropomorphic robotic hands face similar challenges — suitable artificial skin remains an open problem after four decades. These limitations make it difficult to identify robust tactile processing algorithms on real sensor inputs.

Neuroscience side: Despite extensive experimental characterization of the rodent somatosensory pathway, the underlying neural computations remain poorly understood. A key reason is the lack of computational models — prior work is limited to Zhuang et al. (2017), which used simple recurrent architectures and did not compare against brain data.

The paper's core goal is to systematically evaluate temporal neural network architectures on biomechanically realistic simulated tactile inputs, identifying models that both perform tactile tasks efficiently and best match neural responses in the somatosensory cortex. This directly bears on: (1) understanding the inductive biases of brain tactile processing; and (2) developing biologically inspired tactile perception algorithms for embodied AI.

Method¶

Overall Architecture¶

An Encoder-Attender-Decoder (EAD) parameterization is proposed to systematically explore the space of temporal neural networks: - Encoder: Processes temporally smoothed force/torque sensor signals using ConvRNN, ResNet, or S4 - Attender: High-level temporal aggregation using Transformer (GPT), Mamba, or no attention - Decoder: Classification head (supervised) or self-supervised features

Key Designs¶

Biomimetic Tactile Dataset Generation

The WHISKiT Physics simulator (the first 3D full rodent whisker array simulation) is used to generate a high-variability tactile dataset: - The rat whisker array is adapted to a mouse configuration of 30 whiskers arranged in a 5×7 grid - Force clipping range: ±1000 mN (±1 N), within biologically plausible bounds - Applied diverse scanning augmentations (speed, height, rotation, distance variation) across 9,981 ShapeNet objects (117 classes) - Two dataset versions: high-variability/low-fidelity (288 augmentations, 110 Hz) and low-variability/high-fidelity (16 augmentations, 1000 Hz) - 22 time steps extracted, corresponding to the rodent whisking frequency of 20 Hz

Design Motivation: The high-variability dataset strongly constrains learned representations, simulating evolutionary pressure ("Contravariance Principle") and enabling differentiation among architectures within a constrained search space.

EAD Architecture Search

The search space is designed based on the properties of tactile signals: - Temporally smooth force/torque signals → Encoder layers suited to convolutional and recurrent mechanisms (local temporal integration) - High-level aggregation requires spanning irregular time scales → Attender layers suited to attention mechanisms (dynamic time-step weighting)

A key implementation detail for ConvRNN encoders is temporal unrolling: each time step corresponds to a single feedforward pass with state propagation to the next step, rather than treating an entire feedforward pass as one recurrent time step. This parallels the sequential processing of stimuli across cortical layers in biological systems.

ConvRNN variants explored include: UGRNN, IntersectionRNN, LSTM, and GRU. The IntersectionRNN update rule is:

\(s_t^\ell = p_t^\ell \circ s_{t-1}^\ell + (1-p_t^\ell) \circ m_t^\ell\) \(h_t^\ell = y_t^\ell \circ x_t^\ell + (1-y_t^\ell) \circ n_t^\ell\)

where gate \(p\) controls state memory and gate \(y\) determines the mixing of the input with the transformed signal.

Tactile-Specific Self-Supervised Learning Augmentations

Augmentation strategies specifically designed for force/torque time-series data: - Vertical flip: simulates flipping the whisker array upside down - Horizontal flip: simulates left-right flipping - Rotation: simulates rotation of the whisker array - Temporal reversal: simulates reversal of motion direction

Conventional image augmentations (color jitter, grayscale) cause training failure, demonstrating that augmentation strategies must be modality-appropriate. Self-supervised methods include SimCLR (contrastive), SimSiam (non-contrastive), and autoencoders.

Loss & Training¶

Supervised learning: batch size = 256, 100 epochs, checkpoint saved at highest validation accuracy
SSL pretraining: SimCLR/AE batch size = 256, SimSiam batch size = 1024, 100 epochs; linear probing stage with frozen encoder for an additional 100 epochs
Optional LayerNorm added to ConvRNNs to stabilize training
Neural alignment evaluation: Representational Similarity Analysis (RSA), a parameter-free method comparing pairwise dissimilarity structures of model and neural population responses
Only neurons with split-half internal consistency (Spearman-Brown corrected) > 0.5 are retained
Noise-corrected RSA Pearson \(r\) used as the alignment score
Compared against animal-to-animal (a2a) consistency as a baseline
Neural data from Rodgers (2022): 11 mice, 999 neural units, barrel cortex L2/3–L6

Key Experimental Results¶

Main Results¶

Figures 3 and 4b report classification performance and neural fitting across 64 models:

Model Architecture (Encoder + Attender)	Training	Top-5 Classification Accuracy	Neural Fit (RSA r)
IntersectionRNN + GPT	Supervised	Highest	~1.1
IntersectionRNN + None	SimCLR (tactile augmentations)	Moderate	~1.2 (Highest)
UGRNN + GPT	Supervised	High	~0.9
ResNet + GPT	Supervised	Below ConvRNN	~0.7
S4 + GPT	Supervised	Training failure	—
Raw sensor input	—	—	0.46
Animal-to-animal (a2a)	—	—	0.18 (average)

Ablation Study¶

Ablation Dimension	Configuration	Key Findings
Encoder type	ConvRNN vs. ResNet vs. S4	ConvRNN significantly outperforms feedforward and SSM models
Recurrent unit	IntersectionRNN vs. LSTM vs. GRU vs. UGRNN	IntersectionRNN consistently best on both classification and neural fitting
Attender type	GPT vs. Mamba vs. None	GPT yields modest but consistent improvement
Augmentation type	Tactile augmentations vs. image augmentations	Image augmentations cause training failure; tactile augmentations are critical for both task performance and neural fitting
SSL method	SimCLR vs. SimSiam vs. AE	SimCLR achieves the best results, matching or exceeding supervised neural fitting
Supervised vs. SSL	~10× gap in classification accuracy	Neural fitting is comparable — the brain may prioritize broad, task-agnostic sensory representations

Key Findings¶

ConvRNN encoders (especially IntersectionRNN) consistently outperform ResNet and S4 on both tactile classification and neural alignment.
The best models saturate explainable neural variance — exceeding the animal-to-animal consistency baseline, passing a "NeuroAI Turing Test."
A clear linear relationship exists between supervised classification performance and neural fitting (\(r = 0.59\)).
SimCLR self-supervised models achieve neural fitting comparable to or slightly exceeding supervised models, despite classification accuracy being an order of magnitude lower — suggesting that somatosensory cortex may prioritize general tactile representations over specialized classification features.
Tactile-specific SSL augmentations (vertical/horizontal flip, rotation, temporal reversal) are essential for establishing biologically accurate tactile representations.
The modest improvement from GPT-based Attenders hints at the possible presence of attention-like mechanisms in somatosensory cortex.

Highlights & Insights¶

First quantitative characterization of inductive biases in tactile processing: Demonstrates that nonlinear recurrent processing is a key computational feature of rodent somatosensory cortex.
Engineering value of the EAD parameterization: Systematically combines encoding, attention, and decoding modules to efficiently explore a large architecture search space.
Self-supervision as an ecologically valid, label-free proxy: Classifying 117 ShapeNet object categories is not ecologically relevant for rodents, but SSL bypasses this issue and serves as a more principled surrogate task.
Strong generalization: Models saturate neural variability under entirely different experimental conditions (trained on passive contact, evaluated on active whisking) and on novel objects.
PyTorchTNN open-source library: Supports large-scale exploration of temporal neural networks.

Limitations & Future Work¶

Existing tactile neural datasets are severely limited in stimulus diversity — only 6 stimuli (concave/convex × 3 distances), which is why the noise ceiling is relatively low.
The statistical average of animal-to-animal consistency (0.18) is far below the maximum pairwise consistency (1.34), suggesting that larger datasets with more animals and stimuli are needed.
Multimodal fusion (tactile + proprioception + vision) has not been explored; naive concatenation at the final layer is insufficient.
A gap remains between simulated data and real whisker sensing — a train-test mismatch exists between passive scanning and active whisking conditions.
Model scale is relatively small; whether observed trends persist in larger models has not been verified.

Extends the success of the NeuroAI "goal-driven modeling" paradigm — previously validated in primate visual, auditory, motor, memory, and language areas.
An interesting contrast with mouse visual cortex studies (Nayebi et al., 2023): in the tactile domain, SSL matches supervised neural fitting, whereas in vision SSL substantially outperforms supervised methods.
Directly answers and extends the open questions raised by Zhuang et al. (2017).
Future direction: The success of IntersectionRNN suggests that its input-state "cross-gating" may correspond to selective modulation in biological somatosensory pathways.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First complete application of goal-driven modeling to the tactile system; the EAD framework is elegantly designed.
Experimental Thoroughness: ⭐⭐⭐⭐ Systematic evaluation across 64 models, though limited by the stimulus diversity of the neural dataset.
Writing Quality: ⭐⭐⭐⭐⭐ Problem motivation is clear; results and interpretations are tightly integrated.
Value: ⭐⭐⭐⭐⭐ Important implications for both NeuroAI and tactile robotics; bridges the gap from brain to machine.