All-day Multi-scenes Lifelong Vision-and-Language Navigation with Tucker Adaptation¶
Conference: ICLR 2026 arXiv: 2603.14276 Code: https://ganvin-li.github.io/AlldayWalker/ Area: Robotics Keywords: Lifelong Vision-and-Language Navigation, Tucker Decomposition, Parameter-Efficient Fine-Tuning, Catastrophic Forgetting, Multi-level Knowledge Decoupling
TL;DR¶
This paper proposes Tucker Adaptation (TuKA), which represents multi-level navigation knowledge across multiple scenes and environments as a high-order tensor, decomposed via Tucker decomposition into a shared subspace (core tensor + encoder/decoder) and scene/environment expert vectors. Combined with a Decoupled Knowledge Incremental Learning (DKIL) strategy, TuKA enables all-day multi-scene lifelong VLN, achieving superior SR and lower forgetting rates over LoRA variants across 24 navigation scenarios.
Background & Motivation¶
Background: VLN agents have evolved from discrete graph-based navigation to continuous low-level action navigation, yet real-world deployment requires agents to adapt to diverse scenes (bedroom, living room, etc.) and varying environmental conditions (normal, low-light, overexposed, hazy), necessitating continual learning.
Limitations of Prior Work: VLN agents fine-tuned on specific scenes suffer catastrophic forgetting when switched to new scenes. Existing LoRA/MoE-LoRA methods can only represent a two-level knowledge structure ("shared matrix + task-specific matrix"), and cannot decouple the orthogonal dimensions of "scene knowledge" and "environment knowledge."
Key Challenge: Navigation knowledge exhibits a multi-level structure — core navigation skills (shared across all scenes), scene-specific knowledge (e.g., indoor layouts), and environment-specific knowledge (e.g., visual adaptation under low light) — all three levels must be learned independently while being shared across tasks simultaneously.
Goal: Formalize the All-day Multi-scene Lifelong VLN (AML-VLN) problem and design a parameter-efficient adaptation method capable of decoupling multi-level knowledge.
Key Insight: Leverage the natural multi-modal factorization capability of Tucker tensor decomposition — the core tensor captures shared knowledge, while rows of the factor matrices encode scene/environment experts respectively.
Core Idea: A fourth-order Tucker tensor decomposition simultaneously encodes shared core navigation skills, scene experts, and environment experts; decoupled incremental learning achieves forgetting-free lifelong navigation.
Method¶
Overall Architecture¶
TuKA introduces a fourth-order tensor \(\mathcal{X}^l \in \mathbb{R}^{a_l \times b_l \times M \times N}\) at each layer of an LLM backbone (Qwen2-7B), decomposed via Tucker decomposition into: a core tensor \(\mathcal{G}\) (shared navigation skills), \(U^1, U^2\) (shared encoder/decoder), \(U^3 \in \mathbb{R}^{M \times r_3}\) (\(M\) scene experts), and \(U^4 \in \mathbb{R}^{N \times r_4}\) (\(N\) environment experts). When learning the \(t\)-th scene, the corresponding scene expert row \(U^3[s,:]\) and environment expert row \(U^4[e,:]\) are selected and combined with the shared components to produce the layer-wise adaptation weight \(\Delta W_t\).
Key Designs¶
-
Tucker Adaptation Architecture
- Function: Replace LoRA's low-rank matrix decomposition with a high-order Tucker tensor decomposition.
- Mechanism: \(\Delta W_t = U^1 \cdot (\mathcal{G} \times_3 U^3[s,:] \times_4 U^4[e,:]) \cdot (U^2)^T\). Scene experts are selected via mode-3 indexing (selecting the \(s\)-th row from \(M\) candidates), and environment experts via mode-4 indexing (selecting the \(e\)-th row from \(N\) candidates), naturally realizing a "scene × environment" two-dimensional compositional space.
- Design Motivation: LoRA/MoE-LoRA compresses all knowledge into a two-dimensional matrix (one shared + multiple task-specific), which cannot independently model the orthogonal knowledge dimensions of scene and environment. The high-order nature of Tucker decomposition natively supports multi-dimensional knowledge decoupling — the multi-modal structure of core tensor + factor matrices precisely matches the hierarchical structure of navigation knowledge.
-
Decoupled Knowledge Incremental Learning (DKIL)
- Function: Consolidate shared knowledge and constrain task-specific experts during continual learning of new scenes.
- Mechanism: Three losses work in concert:
- Shared Knowledge EWC (\(\mathcal{L}_{ewc}\)): Applies Fisher information-weighted quadratic constraints to the core tensor and encoder/decoder to prevent the shared components from drifting. Fisher weights are updated via exponential moving average.
- Expert Consistency (\(\mathcal{L}_{co}\)): Applies L2 constraints to previously learned scene/environment experts to prevent forgetting.
- Expert Separability (\(\mathcal{L}_{es}\)): Encourages new experts to be orthogonal to existing experts, ensuring new knowledge is learned in an independent subspace.
- Design Motivation: Shared knowledge requires gradual consolidation (EWC), learned experts must be preserved (consistency constraint), and new experts require independent exploration (orthogonality constraint) — the three mechanisms address distinct challenges of continual learning.
-
Task Expert Inference Search
- Function: Automatically match scene and environment experts at test time (without task-id).
- Mechanism: During training, CLIP visual feature prototypes are stored for each scene/environment. At test time, visual features of current observations are extracted and matched to the nearest scene and environment experts via cosine similarity.
- Design Motivation: Task-id is unavailable in real-world deployment; automatic routing to the correct expert combination based on visual features is required.
Allday-Habitat Simulation Platform¶
Built on Habitat with three imaging models (atmospheric scattering model, low-light noise model, overexposure clipping model) to synthesize degraded environments from normal ones, constructing 24 navigation scenarios (5 simulated scenes × 4 environments + 2 real-world scenes × 2 environments).
Key Experimental Results¶
Main Results (Average SR% across 24 scenarios)¶
| Method | Avg SR↑ | Avg F-SR↓ | Notes |
|---|---|---|---|
| Seq-FT | 11% | High | Sequential fine-tuning, severe forgetting |
| EWC-LoRA | 15% | — | LoRA + EWC |
| HydraLoRA | ~17% | — | MoE-LoRA |
| BranchLoRA | ~18% | — | Branched LoRA |
| AlldayWalker (TuKA) | Best | Lowest | Tucker adaptation |
TuKA consistently outperforms all LoRA-variant baselines on SR and SPL across all 24 scenarios, with significantly lower forgetting rates.
Ablation Study¶
| Configuration | Avg SR | Notes |
|---|---|---|
| w/o core tensor sharing | Drops | Shared knowledge cannot transfer across tasks |
| w/o EWC constraint | Notable drop | Shared knowledge overwritten by new tasks |
| w/o orthogonality constraint | Drops | New experts interfere with old experts in the same subspace |
| w/o expert consistency | Drops | Learned experts are modified, causing forgetting |
| Full TuKA | Best | Complete framework |
Key Findings¶
- Sequential fine-tuning (Seq-FT) reduces SR on earlier scenes to nearly 0% (T1–T6 all at 0%), demonstrating severe catastrophic forgetting.
- The mode-3/4 factor matrices of Tucker decomposition naturally support compositional generalization across "scene × environment" — scenes seen during training show some generalization to unseen environments.
- The orthogonality constraint, though simple, is critical for independent learning of new experts.
- Real-world deployment on two real scenes also validates the effectiveness of the approach.
Highlights & Insights¶
- The idea of modeling multi-level knowledge via tensor decomposition is elegant — the "core tensor + factor matrices" structure of Tucker decomposition maps naturally onto the "shared skills + scene experts + environment experts" knowledge hierarchy.
- The solution to the dimensionality alignment problem is clever: selecting a single row vector from each factor matrix reduces the high-order tensor to a two-dimensional weight matrix, perfectly matching the matrix structure of the LLM backbone.
- The three-level DKIL mechanism (EWC consolidation + consistency constraint + orthogonal exploration) forms a complete toolkit for continual learning.
- The Allday-Habitat platform synthesizes degraded environments via physics-based imaging models (rather than simple filters), enhancing the realism of environmental variation.
Limitations & Future Work¶
- The current setup covers only 5+2=7 scenes and 4 environments — scalability to large numbers of scenes (hundreds) remains unknown.
- The number of experts \(M\) and \(N\) must be predefined and cannot grow dynamically — truly open-ended lifelong learning should support unbounded expansion.
- Expert search at inference time relies on CLIP feature matching, which may fail when new environments differ substantially from seen ones.
- The four degradation types (normal/low-light/overexposed/hazy) have physical grounding but are simpler than real-world environmental variation (rain, motion blur, occlusion, etc.).
- The rank choices \(r_1=r_2=8, r_3=r_4=64\) appear somewhat arbitrary; sensitivity analysis of rank selection is absent.
Related Work & Insights¶
- vs. LoRA: LoRA's two-dimensional matrix factorization cannot decouple multi-dimensional knowledge; TuKA extends this to fourth-order tensor decomposition.
- vs. HydraLoRA/BranchLoRA: These MoE-LoRA methods only support a two-level "shared + specific" structure; TuKA introduces a three-level "shared + scene + environment" hierarchy.
- vs. EWC/LwF and other continual learning methods: Traditional continual learning approaches do not account for the hierarchical structure of knowledge; DKIL applies different strategies to different levels of knowledge.
- vs. StreamVLN: AlldayWalker builds on the StreamVLN agent architecture, with TuKA inserted as a parameter-efficient adaptation layer.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The combination of Tucker decomposition and multi-level knowledge decoupling is pioneering in VLN and continual learning; the problem formulation (AML-VLN) is also novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Coverage of 24 scenarios, real-world deployment, and ablation studies is solid, though the scale of scenarios is limited.
- Writing Quality: ⭐⭐⭐⭐ Problem formalization is clear and method diagrams are intuitive.
- Value: ⭐⭐⭐⭐ Directly relevant to real-world VLN deployment; the Tucker adaptation paradigm is transferable to other multi-dimensional continual learning settings.