RoToR: Towards More Reliable Responses for Order-Invariant Inputs¶

Conference: ACL 2025
arXiv: 2502.08662
Code: Yes (https://github.com/soyoung97/RoToR)
Area: Others
Keywords: positional bias, order invariance, positional encoding, causal language models, selective routing

TL;DR¶

Proposes RoToR, a zero-shot, order-invariant language model based on global ordering and circular position encoding allocation. It achieves stable order invariance by minimizing position ID modifications and designs a Selective Routing mechanism to adaptively handle mixed input types.

Background & Motivation¶

Language models are highly sensitive to input order, but in many practical scenarios, the order of list-like inputs (such as table rows, multiple-choice options, and retrieved document collections) should be irrelevant. This "positional bias" problem is widely recognized:

In LLM-as-a-judge scenarios, models show up to a 75% preference for the first response.
On MMLU, simply changing the order of options can shift model rankings by up to 8 positions.
The "lost-in-the-middle" phenomenon: information in the middle positions is heavily ignored.

Existing zero-shot order-invariant methods suffer from two key limitations:

Training-inference distribution mismatch: PCW completely isolates attention between segments, while PINE dynamically reallocates position IDs for each query token. This results in a position encoding allocation that deviates significantly from the pre-training distribution.

Inability to adapt to mixed inputs: Practical questions (such as MMLU) contain both order-insensitive options (e.g., A, B, C) and order-sensitive options (e.g., "None of the above"). Existing methods apply a single strategy to both.

The core idea of RoToR is to: replace the query-wise dynamic sorting of PINE with a global ordering + circular allocation, significantly reducing position ID perturbations; meanwhile, selective routing is introduced to adaptively handle both order-sensitive and order-insensitive inputs.

Method¶

Overall Architecture¶

RoToR consists of two stages: 1. RoToR Core: A new position ID allocation scheme that uses global ordering and circular arrangement to achieve order invariance. 2. Selective Routing: Adaptively selects the output based on the confidence of the original model and the order-invariant model.

Key Designs¶

Global Ordering
- Function: Determines a unified permutation order for all input segments, rather than PINE's query-wise permutation.
- Core Idea:
  - Provides three global ordering algorithms:
- Lexicographical: Based on the lexicographical order of token sequences, with minimal overhead.
- MonoT5: Uses a pointwise reranker to sort based on relevance to the question.
- Frequency: Normalized sorting based on inverse token frequency.
  - The sorting result is shared across all query tokens, all layers, and all attention heads.
- Design Motivation:
  - PINE recalculates sorting for each query token, leading to \(O(n^2d)\) extra computation and frequent position ID changes.
  - Global ordering only needs to be performed once, reducing complexity to \(O(nk\log k)\), and consistent position ID allocation minimizes distribution shift.
Circular Arrangement
- Function: Simulates bidirectional attention in causal LMs, allowing each segment to "see" all other segments.
- Core Idea:
  - Given the global ordering A→B→O→K→G, construct a directed circular graph.
  - When segment B is used as a query, put B at the end according to the circular order: O→K→G→A→B.
  - When segment K is used as a query: G→A→B→O→K.
  - Key: All suffix and generated tokens are concatenated using the front and back parts of the global ordering, no longer changing position IDs token-by-token.
- Design Motivation:
  - In causal attention, tokens at the end of the sequence can see all preceding tokens.
  - Circular arrangement allows each segment to take turns in the "last position," achieving de facto bidirectional access.
  - Unlike PINE, the position IDs of suffix tokens remain constant, significantly reducing OOD risk.
Selective Routing
- Function: Adaptively chooses between using the output of the original model or the order-invariant model.
- Core Idea:
  - The original model and the RoToR model generate answers and their confidences (maximum token probabilities) for the same input, respectively.
  - If original model confidence + bias α > RoToR confidence, the answer from the original model is chosen; otherwise, RoToR is selected.
  - α = 0.2 (determined via validation set search), leaning slightly towards the original model.
- Design Motivation:
  - In practical tasks (e.g., MMLU), some options are order-sensitive (e.g., "None of the above").
  - The order-invariant model might perform worse on these options.
  - Confidence-based routing allows adaptive selection of the most suitable model.

Computational Complexity Analysis¶

Method	Extra Computational Overhead
PINE	\(O(n^2d + nk\log k)\) (Recalculates RoPE-free attention + sorting for each query)
RoToR (Lexicographical)	\(O(nk\log k)\) (Single global sorting)
RoToR (Radix Sort Optimized)	\(O(nk)\)

Key Experimental Results¶

Main Results (Lost in the Middle Benchmark, best_subspan_em %)¶

Method	ndoc=10	ndoc=20	ndoc=30
Llama-3.1-8B-Instruct
Original	50.2~54.7	51.0~54.8	43.5~56.8
PCW	11.9~12.4	3.7~4.0	1.8~2.3
Set-Based Prompting	42.5	26.3	14.1
PINE	58.6~59.0	55.5~56.2	53.7~54.8
RoToR-lexical	61.4~61.6	59.6~61.4	59.0~59.5
RoToR-MonoT5	61.2~61.4	60.7~61.2	60.7~60.9
Llama-3.1-70B-Instruct
Original	65.7~66.2	64.3~66.2	—
PINE	67.5~67.9	65.5~65.9	—
RoToR	69.3~69.6	67.6~67.9	—

KGQA Experiment (N=30 segments, best_subspan_em %)¶

Method	Llama-8B Acc.	Llama-70B Acc.	Qwen-4B Acc.	Qwen-7B Acc.
Original	50.2	61.6	30.7	31.5
PINE	51.5	63.1	31.6	32.3
RoToR	53.1	63.6	32.0	34.3
RoToR-MonoT5	51.6	—	32.3	32.9

Key Findings¶

RoToR consistently outperforms PINE across all models and settings: improving by an average of 2-5 percentage points on the LitM benchmark.
Excellent order invariance: After shuffling segment order, RoToR exhibits extremely small standard deviation (0.02-0.11), which is significantly better than the Original model (0.07-0.75).
Simple lexicographical order is sufficient: It does not require complex MonoT5 sorting; RoToR-lexical already brings significant benefits.
Computational overhead is far lower than PINE: Eliminates the \(O(n^2d)\) term, making the advantage more pronounced as the number of segments k increases.
PCW and Set-Based Prompting almost fail when the number of segments increases: At ndoc=30, PCW obtains only 2%.
Selective routing is effective: Helps handle order-sensitive special options in MMLU.

Highlights & Insights¶

Clever naming: RoToR is a palindrome, echoing the theme of "order invariance," while also hinting at "Rotary."
Simplicity is power: The global ordering + circular arrangement scheme is extremely simple, yet mathematically guarantees order invariance.
Unique OOD perspective: Frames the positional bias problem as a training-inference distribution mismatch and mitigates it through minimal modifications.
Experimental insights: Discerned that in bfloat16 precision, attention scores in PINE produce many tied values, leading to non-deterministic sorting, which is an important practical insight.

Limitations & Future Work¶

Inability to handle completely arbitrary input structures: Still requires explicit segment partitioning.
Selective routing requires two forward passes: Increases inference cost during practical deployment.
Global ordering does not guarantee optimality: Although lexicographical sorting is simple and effective, placing relevant documents closer might be better (the advantage of MonoT5 sorting).
Limited large-scale experiments: Due to resource constraints, experiments for the 70B model were not conducted at ndoc=30.
Lack of direct validation in high-impact scenarios such as LLM-as-a-judge.

PINE is the most direct predecessor. RoToR eliminates its core defect of query-wise sorting through global ordering.
Conceptually linked to order-invariance methods in Set/Graph ML (Murphy et al., 2019), but circular allocation and its application to pre-trained LMs represent novel contributions.
The idea of selective routing can be generalized to other scenarios requiring "heterogeneous processing" (e.g., deciding whether to use retrieval results in RAG).
Techniques for adapting RoPE and causal attention can inspire other work requiring modifications to the attention mechanism.

Rating¶

Novelty: ⭐⭐⭐⭐ — The combination of circular arrangement and global ordering is simple and novel. Introducing the OOD perspective to positional bias analysis is a unique contribution.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers three major task categories (LitM / KGQA / MMLU) across 5 model sizes and multiple sorting algorithm variants, including variance and time analysis.
Writing Quality: ⭐⭐⭐⭐ — Well-structured, vivid diagrams, with a direct and intuitive comparison to PINE.
Value: ⭐⭐⭐⭐ — Solves a persistent and practical problem in decoder-only LMs with a neat and practical approach.