Skip to content

Incremental Sequence Classification with Temporal Consistency

Conference: NeurIPS 2025 arXiv: 2505.16548 Code: None Area: Reinforcement Learning / NLP Keywords: incremental classification, temporal consistency, temporal-difference learning, LLM verification, sequence classification

TL;DR

This paper imports the temporal-difference (TD) learning idea from reinforcement learning into sequence classification, proposing the TC-\(\lambda\) loss function. By requiring the predictive distributions at adjacent time steps to satisfy a temporal consistency condition, it trains incremental sequence classifiers that outperform standard cross-entropy methods on both text classification and LLM verification tasks.

Background & Motivation

Background: Sequence classification is a fundamental problem in machine learning. Conventional approaches predict only on complete sequences, but many scenarios require continuously updated predictions as a sequence unfolds incrementally—for example, waiting for a complete sequence is costly in medical or financial settings, and recent LLM verification work demands early detection of whether a generation is correct.

Limitations of Prior Work: The standard approach (Direct Cross-Entropy, DCE) trains by directly comparing each prefix prediction against the final label. However, for early prefixes \(\mathbf{x}_{\leq t}\) (where \(t \ll T\)), the final label constitutes a very noisy training signal—the model must simultaneously handle uncertainty about future sequence tokens and uncertainty about the final label.

Key Challenge: Using distant final labels as training targets for every intermediate time step ignores the temporal structure of sequences—the predictive distributions at adjacent time steps should satisfy a consistency relation.

Goal: Develop improved loss functions for training incremental sequence classifiers, with particular emphasis on improving data efficiency and accuracy for prefix predictions.

Key Insight: The paper observes a key identity: for any calibrated classifier, \(p(y|s_t) = \mathbb{E}_{p(s_{t+1}|s_t)}[p(y|s_{t+1})]\)—that is, the class distribution at the current step equals the expectation of the class distribution at the next step. This is precisely the classification analogue of the Bellman equation in TD learning.

Core Idea: Use the predictive distribution at the next time step as a "soft target" for the current step, replacing distant hard labels (one-hot vectors), thereby enforcing temporal consistency constraints.

Method

Overall Architecture

Sequence classification is framed as an absorption probability estimation problem on a Markov chain. Given a labeled sequence dataset, a parameterized classifier \(p_\theta(y|s_t)\) is trained to accurately predict the final class at every intermediate state. The paper proposes the TC-\(\lambda\) family of loss functions, which interpolates between pure temporal consistency training (\(\lambda=0\)) and pure direct cross-entropy (\(\lambda=1\)) by tuning \(\lambda \in [0,1]\).

Key Designs

  1. Temporal Consistency Condition:

    • Function: Exploits the Markov property to establish constraints between the predictive distributions at adjacent time steps.
    • Design Motivation: If \(p(y|s_t, \ldots, s_1) = p(y|s_t)\) (Markov property), then it necessarily holds that \(p(y|s_t) = \mathbb{E}_{p(s_{t+1}|s_t)}[p(y|s_{t+1})]\)
    • Key Significance: This implies that the class distribution at time step \(t\) equals the expected class distribution at time step \(t+1\)—a calibrated predictor should not fluctuate sharply between adjacent steps.
    • Note: For text sequences, setting \(s_t = \mathbf{x}_{\leq t}\) trivially satisfies the Markov property.
  2. TC Loss Function:

    • Function: Uses the next step's predictive distribution as the training target for the current step.
    • Mechanism: \(\ell_{\text{TC}}(\theta; \theta', \mathbf{s}, y) = H[\delta_y \| \mathbf{p}_\theta(\cdot|s_T)] + \sum_{t=1}^{T-1} H[\mathbf{p}_{\theta'}(\cdot|s_{t+1}) \| \mathbf{p}_\theta(\cdot|s_t)]\)
    • Explanation: The final step is trained with the ground-truth label (hard target); intermediate steps use the reference model \(\theta'\)'s prediction at the next step as a soft target.
    • Optimization: Given \(\theta'\) (parameters from the previous iteration), \(\theta\) is optimized by minimizing the TC loss; training proceeds by alternating iterations.
    • Distinction from DCE: The essential difference is replacing the distant hard label \(\delta_y\) with the nearby soft target \(\mathbf{p}_{\theta'}(\cdot|s_{t+1})\).
  3. TC-\(\lambda\) Generalized Loss Function:

    • Function: Interpolates between TC and DCE, controlling the effective look-ahead distance.
    • Mechanism: \(\ell_{\text{TC-}\lambda}(\theta; \theta', \mathbf{s}, y) = \sum_{t=1}^{T} H[\mathbf{z}_t \| \mathbf{p}_\theta(\cdot|s_t)]\) where \(\mathbf{z}_t = \lambda^{T-t}\delta_y + (1-\lambda)\sum_{k=1}^{T-t}\lambda^{k-1}\mathbf{p}_{\theta'}(\cdot|s_{t+k})\)
    • Interpretation of \(\lambda\): The geometric mean \(\lambda/(1-\lambda)\) represents the effective look-ahead distance. \(\lambda=0\) reduces to TC (one-step look-ahead); \(\lambda=1\) reduces to DCE (final label only).
    • Optimal \(\lambda\): Experiments find that a look-ahead of 5–50 tokens performs best, corresponding to \(\lambda \in [0.8, 0.98]\).
  4. Theoretical Analysis (Finite-State Setting):

    • Function: Proves convergence, consistency, and data efficiency advantages of TC in tabular models.
    • Key Result (Proposition 3): In a \(T\)-layer, \(W\)-state Markov chain, the mean squared error of TC (indirect estimation) is \(W\) times smaller than that of DCE (direct estimation): \(\mathbf{E}[(\hat{p}^{\text{ind}}_{mk} - p^*_{mk})^2] / \mathbf{E}[(\hat{p}^{\text{dir}}_{mk} - p^*_{mk})^2] \to 1/W\)
    • Intuition: TC benefits from a "data pooling" effect—optimizing consistency between adjacent states allows the model to leverage trajectory information from different starting points that pass through the same intermediate states.

Loss & Training

  • A decoder-only transformer (OPT series) is used, with a linear classification head attached to the final hidden representation.
  • \(p_\theta(\cdot|\mathbf{x}_{\leq t}) = \text{softmax}(\mathbf{A}\mathbf{h}_t + \mathbf{b})\)
  • Both the classification head and all transformer parameters are jointly optimized.
  • The loss for each sequence is averaged over all prefixes (rather than summed), ensuring equal contribution from sequences of different lengths.
  • Training overhead is virtually identical to DCE (difference \(< 1\%\)).

Key Experimental Results

Main Results: Text Classification (OPT-125M)

Method ohsumed 4tok ohsumed 16tok ohsumed all newsgroups 4tok newsgroups all imdb 4tok imdb all ag-news 4tok ag-news all
Most frequent 16.0 16.0 16.0 5.3 5.3 50.0 50.0 25.0 25.0
GPT-4o 31.5 54.0 57.5 7.5 80.4 58.0 94.3 77.4 88.3
Last token 16.7 45.0 80.6 6.5 87.9 56.6 94.7 54.4 94.8
DCE 30.5 65.5 81.1 27.7 89.0 63.5 94.4 80.0 94.8
LSTD(\(\lambda\)) 32.7 64.9 78.0 26.2 87.8 64.6 94.7 81.1 94.9
TC-\(\lambda\) (ours) 33.7 68.3 81.8 33.4 88.5 64.7 94.9 81.4 95.0

Ablation Study

Ablation Dimension Finding
Model scale (125M→350M→1.3B) DCE requires approximately 10× larger models to match TC-\(\lambda\) performance
Choice of \(\lambda\) \(\lambda=1\) (i.e., DCE) is never optimal; the optimal look-ahead is 5–50 tokens
Cross-entropy vs. squared loss Cross-entropy significantly outperforms squared loss on multi-class tasks; the gap is small for binary classification
Incremental training vs. training on complete sequences only Incremental training (DCE/TC-\(\lambda\)) improves full-sequence classification accuracy
Temporal consistency metric Models trained with TC-\(\lambda\) exhibit significantly lower KL divergence between adjacent steps

LLM Verification Experiments (GSM8K)

Metric DCE TC-\(\lambda\)
ROC AUC @ 8 tokens ~65% ~70%
ROC AUC @ 64 tokens ~78% ~82%
Compute savings (boosted best-of-N) baseline 23–33% token savings

Key Findings

  • TC-\(\lambda\) surpasses DCE on prefix classification across all 4 datasets and on full-sequence classification on 3 out of 4 datasets.
  • Most important insight: Improvements in prefix prediction also yield improvements in full-sequence classification accuracy—temporal consistency acts as a regularizer.
  • In LLM verification, observing only 8 tokens is sufficient to distinguish correct from incorrect generations with approximately 70% accuracy.
  • Boosted best-of-N with a TC-\(\lambda\) verifier achieves the same answer accuracy at substantially lower computational cost.

Highlights & Insights

  • Elegant cross-domain connection: The core TD learning idea from RL—replacing final rewards with consistency between adjacent states—is cleanly transferred to classification, yielding a conceptually transparent and practically simple approach.
  • Unified theory and practice: Data efficiency advantages are rigorously proven in the tabular setting, then validated in large-scale LLM experiments.
  • High practical value: The method has direct applications to LLM verification and test-time scaling—earlier detection of generation quality translates directly to computational savings.
  • Minimal implementation overhead: Only the target distribution in the loss function needs to be modified; training cost increases by less than 1%.

Limitations & Future Work

  • Theoretical analysis is rigorous only in the tabular (finite-state) setting; guarantees for parameterized models are heuristic.
  • Experiments are limited to models with \(\leq 1.3\)B parameters; effectiveness at larger scales remains uncertain.
  • LLM verification experiments employ simple boosted best-of-N, without in-depth comparison against more advanced methods such as speculative rejection.
  • \(\lambda\) is a hyperparameter requiring tuning, and its optimal value depends on prefix length.
  • Applications to multimodal sequences (e.g., frame-level video prediction) are unexplored.
  • TD Learning (Sutton, 1988): The direct inspiration for this work, extending the core idea of "replacing Monte Carlo with temporal differences" from scalar values to multi-class distributions.
  • Cobbe et al. (2021): Training verifiers to solve math word problems; the present work builds on this by introducing improved verifier training methods.
  • Mudgal et al. (2023): Also applies TD learning to train LLM verifiers, but uses squared loss, which is unsuitable for classification and binary outcomes.
  • Insight: The "bootstrapping" idea from RL—using the model's own predictions as targets—is equally effective in supervised learning, provided the correct invariance constraint is identified (here, temporal consistency).

Rating

  • Novelty: ⭐⭐⭐⭐ The transfer from TD to classification is novel yet natural; the correspondence between TC-\(\lambda\) and TD(\(\lambda\)) is clearly articulated.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multiple datasets, baselines, and metrics, with theoretical validation; model scale is limited.
  • Writing Quality: ⭐⭐⭐⭐⭐ Theoretical derivations are clear, the analogy to RL is intuitive, and experimental presentation is convincing.
  • Value: ⭐⭐⭐⭐⭐ The method is simple and general, with direct practical utility for LLM verification and test-time scaling.