Skip to content

Towards Reinforcement Learning from Neural Feedback: Mapping fNIRS Signals to Agent Performance

Conference: AAAI 2026 arXiv: 2511.12844 Code: Dataset (Public) / Classification Code Area: Human-Computer Interaction / Reinforcement Learning / Brain-Computer Interface Keywords: RLNF, fNIRS, Neural Feedback, Implicit Signals, Cross-Subject Generalization, NEURO-LOOP

TL;DR

This paper proposes the NEURO-LOOP framework, which leverages fNIRS (functional near-infrared spectroscopy) brain signals as implicit neural feedback to evaluate RL agent performance. The authors release an fNIRS dataset spanning 25 subjects × 3 domains × 6 conditions. Classification F1 reaches 67% (binary) / 46% (multi-class), with cross-subject fine-tuning yielding improvements of 17% and 41% respectively, laying the groundwork for Reinforcement Learning from Neural Feedback (RLNF).

Background & Motivation

RLHF has become a critical methodology for training and aligning advanced AI systems, yet existing feedback mechanisms suffer from fundamental limitations:

  • Explicit feedback (preference labels, ratings, demonstrations): Requires active engagement and cognitive effort, leading to fatigue and superficial feedback.
  • Expression/gesture-based implicit feedback: Still requires conscious physical adjustment and may feel unnatural.
  • EEG signals (e.g., Error-related Potentials): Transient signals are susceptible to motion artifacts, with limited temporal resolution.
  • Cognitive load problem: Richer feedback typically imposes higher cognitive burden on users.

Core Problem: Can brain signals recorded while humans passively observe agent behavior reliably map to agent performance levels? If so, can cross-subject generalization be achieved to reduce deployment costs?

Advantages of fNIRS: Non-invasive, portable, tolerant to physical motion, superior spatial resolution compared to EEG, and suitable for extended naturalistic scenarios. It measures hemodynamic changes in the prefrontal cortex (PFC), which is closely associated with reward processing and cognitive evaluation.

Method

Overall Architecture

The NEURO-LOOP pipeline: 1. Experimental design → 25 subjects × 3 domains × passive/active conditions 2. fNIRS data acquisition → ISS OxiplexTS device measuring bilateral PFC hemodynamic changes 3. Preprocessing → Motion artifact removal, bandpass filtering (0.001–0.2 Hz), short-channel regression 4. Feature extraction → Sliding window (5–7 s window, 1–2 s stride); 6 statistical features × 8 channels = 48-dimensional feature vector per window 5. Classification/Regression → SVM / KNN / Random Forest / MLP 6. Transfer learning → Multi-subject pretraining + fine-tuning on 20% of target-subject data

Key Designs

  1. Three-Domain Experimental Design:

    • Robot Fetch and Place: Continuous action space, 6 DoF + gripper; optimal / suboptimal (wrong target) / worst (random jitter + throwing block).
    • Lunar Lander: Discrete action space; optimal (landing between flags) / suboptimal (off-center) / worst (crash).
    • Flappy Bird: Discrete action space; optimal (15+ pipes) / suboptimal (5–15) / worst (≤5).
    • Each domain includes both passive (observing agent) and active (human control) conditions.
  2. Multi-Policy Action Agreement Labeling System:

    • \(K=10\) near-optimal policies are used to evaluate the agent's chosen actions.
    • Discrete actions: KL divergence measures error; continuous actions: Euclidean distance.
    • The average error across \(K\) policies serves as the continuous performance label, mitigating mislabeling from single-policy dependence.
    • Binary label \(B_t \in \{0,1\}\) (optimal/suboptimal); ternary label \(V_t \in \{0,1,2\}\) (optimal/suboptimal/worst).
  3. Three Training Paradigms:

    • Single-subject: Train and validate on individual subject data; F1 = 0.79 (binary) / 0.75 (multi-class).
    • Multi-subject: Joint training across subjects; cross-subject generalization is challenging.
    • Fine-tuned: Multi-subject pretraining + fine-tuning on 20% target-subject data; significant improvement observed.
  4. fNIRS Feature Engineering:

    • Dual-slope frequency-domain probes suppress superficial tissue and motion artifacts.
    • Dual wavelengths at 690 nm / 830 nm modulated at 110 MHz to measure oxy- and deoxyhemoglobin.
    • 5.2 Hz sampling rate; third-order bandpass filtering.
    • 6 statistical features per channel: mean, standard deviation, slope, intercept, skewness, kurtosis.

Loss & Training

  • Classification: Standard cross-entropy loss.
  • Regression: MSE loss.
  • Multiple classifiers compared: SVM, KNN, Random Forest, MLP.
  • Data balancing: Random undersampling of the majority class; 60-20-20 train-test-validation split.

Key Experimental Results

Main Results: Multi-Subject Classification Performance (MLP, F1)

Condition Binary Multi-Sub Binary Cross-Sub Binary Fine-tuned Multi-class Multi-Sub Multi-class Cross-Sub Multi-class Fine-tuned
Robot Passive 0.72 0.54 0.57 0.50 0.33 0.41
Robot Active 0.67 0.53 0.56 0.47 0.29 0.35
Lunar Passive 0.61 0.45 0.56 0.46 0.26 0.36
Lunar Active 0.62 0.52 0.54 0.40 0.27 0.42
Flappy Passive 0.67 0.44 0.52 0.35 0.26 0.39
Flappy Active 0.66 0.46 0.57 0.51 0.31 0.51
Average 0.66 0.49 0.55 0.45 0.29 0.41
Fine-tune Gain +17% +41%

Regression Performance & NASA-TLX

Metric Passive Condition Active Condition
Regression R² (average) 0.77 0.81
Single-subject binary F1 0.79
Single-subject multi-class F1 0.75
NASA-TLX cognitive load Low High
Subjective experience "Boring" Effortful

Key Findings

  • Passive observation yields meaningful neural feedback: Binary F1 = 0.66 under the passive condition with low cognitive load—users require virtually no additional effort.
  • Cross-subject transfer is the primary challenge: Zero-shot cross-subject performance approaches chance (binary ~0.49), but fine-tuning on only 20% of target-subject data substantially recovers performance.
  • Robot condition yields the best results: Likely because robotic actions are more visually interpretable, sustaining greater participant attention.
  • Multi-class classification is substantially harder than binary: F1 drops from 0.66 to 0.45 (multi-subject), indicating that fine-grained signals distinguishing "suboptimal vs. worst" are weak.
  • Regression outperforms classification: \(R^2 = 0.77\)\(0.81\) suggests that fNIRS signals encode continuous evaluative information beyond categorical boundaries.
  • Active condition yields better regression but at higher cognitive cost: A trade-off exists between signal quality and user experience.
  • NASA-TLX results show that Lunar Active imposes the highest cognitive load, while passive conditions are consistently low—supporting RLNF's goal of minimal intrusiveness.

Highlights & Insights

  • First experimental validation of the RLNF (Reinforcement Learning from Neural Feedback) concept: Advances the intersection of BCI and RLHF in a new direction.
  • Release of a large-scale fNIRS-RL dataset: 25 subjects × 3 domains × 6 conditions, with synchronized neural data and agent transition variables—addressing a critical data gap in the field.
  • The Multi-Policy Action Agreement system resolves labeling ambiguity arising from multiple equally optimal trajectories in RL.
  • The effectiveness of fine-tuning (41% gain with only 20% target data) suggests the existence of shared cross-individual neural patterns, making personalized calibration cost-efficient.
  • The systematic passive vs. active comparison provides a baseline for interaction design in future RLNF systems.

Limitations & Future Work

  • Classification performance is not yet sufficient for direct use as a reward signal in RLHF (F1 = 0.67 implies approximately one-third of feedback is erroneous).
  • Cross-subject generalization remains an open challenge—zero-shot transfer is largely ineffective.
  • Partial dataset imbalance: Some conditions have shorter episodes with insufficient samples.
  • The inherent 5–7 second hemodynamic delay in fNIRS limits the temporal granularity of real-time feedback.
  • Only traditional ML classifiers are employed; deep learning approaches (e.g., Transformers for time series) are not explored.
  • No closed-loop validation has been performed—classification outputs have not been integrated into an actual RL training pipeline.
  • The natural next step is to integrate trained classification/regression models into an RLHF pipeline for genuine closed-loop RLNF.
  • Multimodal fusion (fNIRS + EEG + eye tracking + facial expression) may substantially improve classification performance.
  • Personalized calibration strategies warrant further investigation—the efficiency of 41% gain from 20% data is highly promising.
  • fNIRS devices are rapidly miniaturizing (consumer headbands have emerged), increasing practical deployment feasibility.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ (RLNF concept is original; first systematic study of fNIRS in RL)
  • Technical Depth: ⭐⭐⭐ (Methodology follows a standard ML pipeline; primary contributions lie in experimental design and dataset construction)
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ (25 subjects × 6 conditions + 3 training paradigms + classification + regression + NASA-TLX)
  • Practical Value: ⭐⭐⭐ (Currently a proof-of-concept; substantial gap remains before real-world application)
  • Multi-class accuracy is limited (46%), insufficient for deployment as a reward signal.
  • fNIRS temporal resolution is inherently limited (hemodynamic response delay of ~5–6 seconds).
  • Experiments are conducted in controlled settings; real-world environments introduce greater noise.
  • Signal discriminability is validated, but the loop is not closed (i.e., signals are not used in actual RL training).
  • Sample size (25 subjects) is relatively small.
  • vs. RLHF: Implicit neural signals vs. explicit human feedback; no active annotation required.
  • vs. EEG-based BCI: fNIRS is more portable and less susceptible to motion artifacts.
  • vs. Affective computing approaches: Directly infers cognitive evaluation from neural activity rather than emotional states.

Inspiration & Connections

This work introduces a novel signal channel for building human-in-the-loop RL systems. Should classification accuracy improve further, the approach could directly serve as a reward shaping signal. It has potential applications in tasks requiring human supervision with high annotation costs, such as safety evaluation in autonomous driving.

Rating ⭐⭐⭐ (3/5)

An exploratory contribution with a novel concept but limited current classification accuracy. The dataset is a valuable resource; however, a substantial gap remains before a practical RLNF system can be realized.