RoToR: Towards More Reliable Responses for Order-Invariant Inputs¶
Conference: ACL 2025
arXiv: 2502.08662
Code: Yes (https://github.com/soyoung97/RoToR)
Area: Others
Keywords: positional bias, order invariance, positional encoding, causal language models, selective routing
TL;DR¶
Proposes RoToR, a zero-shot, order-invariant language model based on global ordering and circular position encoding allocation. It achieves stable order invariance by minimizing position ID modifications and designs a Selective Routing mechanism to adaptively handle mixed input types.
Background & Motivation¶
Language models are highly sensitive to input order, but in many practical scenarios, the order of list-like inputs (such as table rows, multiple-choice options, and retrieved document collections) should be irrelevant. This "positional bias" problem is widely recognized:
- In LLM-as-a-judge scenarios, models show up to a 75% preference for the first response.
- On MMLU, simply changing the order of options can shift model rankings by up to 8 positions.
- The "lost-in-the-middle" phenomenon: information in the middle positions is heavily ignored.
Existing zero-shot order-invariant methods suffer from two key limitations:
Training-inference distribution mismatch: PCW completely isolates attention between segments, while PINE dynamically reallocates position IDs for each query token. This results in a position encoding allocation that deviates significantly from the pre-training distribution.
Inability to adapt to mixed inputs: Practical questions (such as MMLU) contain both order-insensitive options (e.g., A, B, C) and order-sensitive options (e.g., "None of the above"). Existing methods apply a single strategy to both.
The core idea of RoToR is to: replace the query-wise dynamic sorting of PINE with a global ordering + circular allocation, significantly reducing position ID perturbations; meanwhile, selective routing is introduced to adaptively handle both order-sensitive and order-insensitive inputs.
Method¶
Overall Architecture¶
RoToR consists of two stages: 1. RoToR Core: A new position ID allocation scheme that uses global ordering and circular arrangement to achieve order invariance. 2. Selective Routing: Adaptively selects the output based on the confidence of the original model and the order-invariant model.
Key Designs¶
-
Global Ordering
- Function: Determines a unified permutation order for all input segments, rather than PINE's query-wise permutation.
- Core Idea:
- Provides three global ordering algorithms:
- Lexicographical: Based on the lexicographical order of token sequences, with minimal overhead.
- MonoT5: Uses a pointwise reranker to sort based on relevance to the question.
- Frequency: Normalized sorting based on inverse token frequency.
- The sorting result is shared across all query tokens, all layers, and all attention heads.
- Design Motivation:
- PINE recalculates sorting for each query token, leading to \(O(n^2d)\) extra computation and frequent position ID changes.
- Global ordering only needs to be performed once, reducing complexity to \(O(nk\log k)\), and consistent position ID allocation minimizes distribution shift.
-
Circular Arrangement
- Function: Simulates bidirectional attention in causal LMs, allowing each segment to "see" all other segments.
- Core Idea:
- Given the global ordering A→B→O→K→G, construct a directed circular graph.
- When segment B is used as a query, put B at the end according to the circular order: O→K→G→A→B.
- When segment K is used as a query: G→A→B→O→K.
- Key: All suffix and generated tokens are concatenated using the front and back parts of the global ordering, no longer changing position IDs token-by-token.
- Design Motivation:
- In causal attention, tokens at the end of the sequence can see all preceding tokens.
- Circular arrangement allows each segment to take turns in the "last position," achieving de facto bidirectional access.
- Unlike PINE, the position IDs of suffix tokens remain constant, significantly reducing OOD risk.
-
Selective Routing
- Function: Adaptively chooses between using the output of the original model or the order-invariant model.
- Core Idea:
- The original model and the RoToR model generate answers and their confidences (maximum token probabilities) for the same input, respectively.
- If original model confidence + bias α > RoToR confidence, the answer from the original model is chosen; otherwise, RoToR is selected.
- α = 0.2 (determined via validation set search), leaning slightly towards the original model.
- Design Motivation:
- In practical tasks (e.g., MMLU), some options are order-sensitive (e.g., "None of the above").
- The order-invariant model might perform worse on these options.
- Confidence-based routing allows adaptive selection of the most suitable model.
Computational Complexity Analysis¶
| Method | Extra Computational Overhead |
|---|---|
| PINE | \(O(n^2d + nk\log k)\) (Recalculates RoPE-free attention + sorting for each query) |
| RoToR (Lexicographical) | \(O(nk\log k)\) (Single global sorting) |
| RoToR (Radix Sort Optimized) | \(O(nk)\) |
Key Experimental Results¶
Main Results (Lost in the Middle Benchmark, best_subspan_em %)¶
| Method | ndoc=10 | ndoc=20 | ndoc=30 |
|---|---|---|---|
| Llama-3.1-8B-Instruct | |||
| Original | 50.2~54.7 | 51.0~54.8 | 43.5~56.8 |
| PCW | 11.9~12.4 | 3.7~4.0 | 1.8~2.3 |
| Set-Based Prompting | 42.5 | 26.3 | 14.1 |
| PINE | 58.6~59.0 | 55.5~56.2 | 53.7~54.8 |
| RoToR-lexical | 61.4~61.6 | 59.6~61.4 | 59.0~59.5 |
| RoToR-MonoT5 | 61.2~61.4 | 60.7~61.2 | 60.7~60.9 |
| Llama-3.1-70B-Instruct | |||
| Original | 65.7~66.2 | 64.3~66.2 | — |
| PINE | 67.5~67.9 | 65.5~65.9 | — |
| RoToR | 69.3~69.6 | 67.6~67.9 | — |
KGQA Experiment (N=30 segments, best_subspan_em %)¶
| Method | Llama-8B Acc. | Llama-70B Acc. | Qwen-4B Acc. | Qwen-7B Acc. |
|---|---|---|---|---|
| Original | 50.2 | 61.6 | 30.7 | 31.5 |
| PINE | 51.5 | 63.1 | 31.6 | 32.3 |
| RoToR | 53.1 | 63.6 | 32.0 | 34.3 |
| RoToR-MonoT5 | 51.6 | — | 32.3 | 32.9 |
Key Findings¶
- RoToR consistently outperforms PINE across all models and settings: improving by an average of 2-5 percentage points on the LitM benchmark.
- Excellent order invariance: After shuffling segment order, RoToR exhibits extremely small standard deviation (0.02-0.11), which is significantly better than the Original model (0.07-0.75).
- Simple lexicographical order is sufficient: It does not require complex MonoT5 sorting; RoToR-lexical already brings significant benefits.
- Computational overhead is far lower than PINE: Eliminates the \(O(n^2d)\) term, making the advantage more pronounced as the number of segments k increases.
- PCW and Set-Based Prompting almost fail when the number of segments increases: At ndoc=30, PCW obtains only 2%.
- Selective routing is effective: Helps handle order-sensitive special options in MMLU.
Highlights & Insights¶
- Clever naming: RoToR is a palindrome, echoing the theme of "order invariance," while also hinting at "Rotary."
- Simplicity is power: The global ordering + circular arrangement scheme is extremely simple, yet mathematically guarantees order invariance.
- Unique OOD perspective: Frames the positional bias problem as a training-inference distribution mismatch and mitigates it through minimal modifications.
- Experimental insights: Discerned that in bfloat16 precision, attention scores in PINE produce many tied values, leading to non-deterministic sorting, which is an important practical insight.
Limitations & Future Work¶
- Inability to handle completely arbitrary input structures: Still requires explicit segment partitioning.
- Selective routing requires two forward passes: Increases inference cost during practical deployment.
- Global ordering does not guarantee optimality: Although lexicographical sorting is simple and effective, placing relevant documents closer might be better (the advantage of MonoT5 sorting).
- Limited large-scale experiments: Due to resource constraints, experiments for the 70B model were not conducted at ndoc=30.
- Lack of direct validation in high-impact scenarios such as LLM-as-a-judge.
Related Work & Insights¶
- PINE is the most direct predecessor. RoToR eliminates its core defect of query-wise sorting through global ordering.
- Conceptually linked to order-invariance methods in Set/Graph ML (Murphy et al., 2019), but circular allocation and its application to pre-trained LMs represent novel contributions.
- The idea of selective routing can be generalized to other scenarios requiring "heterogeneous processing" (e.g., deciding whether to use retrieval results in RAG).
- Techniques for adapting RoPE and causal attention can inspire other work requiring modifications to the attention mechanism.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of circular arrangement and global ordering is simple and novel. Introducing the OOD perspective to positional bias analysis is a unique contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Covers three major task categories (LitM / KGQA / MMLU) across 5 model sizes and multiple sorting algorithm variants, including variance and time analysis.
- Writing Quality: ⭐⭐⭐⭐ — Well-structured, vivid diagrams, with a direct and intuitive comparison to PINE.
- Value: ⭐⭐⭐⭐ — Solves a persistent and practical problem in decoder-only LMs with a neat and practical approach.