Skip to content

Self-Taught Agentic Long-Context Understanding

Conference: ACL 2025
arXiv: 2502.15920
Code: https://github.com/EvanZhuang/AgenticLU
Area: LLM Agent
Keywords: long-context understanding, agentic workflow, chain-of-clarifications, inference-time scaling, self-taught reasoning

TL;DR

The AgenticLU framework is proposed, which enables LLMs to autonomously generate clarification questions and retrieve relevant context via a Chain-of-Clarifications (CoC) workflow. By distilling search paths from tree search into the model using a two-stage SFT+DPO fine-tuning, an 8B model significantly outperforms baselines on 128K long-context QA tasks.

Background & Motivation

Background: - While current LLMs support context windows of 128K or even 2M tokens, their performance in actual long-document understanding tasks falls far short of what their nominal capacity suggests. - A severe gap exists between the "nominal context size" and the "effective context size." - On HotpotQA, the accuracy of Llama3.1-8B-Instruct drops sharply as the context scales from 8K to 128K.

Limitations of Prior Work: - When directly processing ultra-long texts, models tend to lose key information located in the middle segments (the "lost-in-the-middle" effect). - Existing methods like ProLong require fine-tuning on an additional 40B tokens of long-context corpora, incurring extremely high training costs. - Prompting-based methods (such as Chain-of-Thought, Plan-and-Solve) suffer from severe performance degradation at extreme lengths (128K).

Key Challenge: - A huge chasm exists between the model's nominal context capacity (how long an input it can accept) and its effective context capability (how much of the input it can actually utilize). - Analogous to computer memory: having a larger capacity alone does not equate to efficient computation; an intelligent "information loading" mechanism is also required.

Goal: - How to enhance the LLM's capability to understand and utilize long contexts without relying on human annotations or stronger teacher models. - How to amortize the high computational overhead during inference into the training phase.

Key Insight: - Reformulate long-context understanding as an agentic workflow of iterative self-clarification and context localization. - Leverage tree search during inference to collect high-quality reasoning paths, which are then distilled back into the model.

Core Idea: - Allow the model to generate its own clarification questions, retrieve evidence itself, and answer them. This capability is then internalized through SFT+DPO to achieve "self-taught" long-context understanding.

Method

Overall Architecture

AgenticLU consists of two core phases: 1. CoC Path Construction (inference-time tree search): Generates diverse Chain-of-Clarifications paths using tree search. 2. CoC Path Distillation (training-time knowledge transfer): Distills the searched paths into the model via a two-stage fine-tuning of SFT + DPO.

Key Designs

  1. Chain-of-Clarifications (CoC):

    • Function: In each CoC step, the model autonomously executes three actions: (1) generating clarification questions to identify potentially misunderstood areas; (2) locating relevant paragraphs using the pointback mechanism; (3) answering the clarification questions and the original question based on the collected evidence.
    • Mechanism: Instead of processing the entire long context all at once, the task is decomposed into a series of targeted sub-tasks to progressively refine the understanding.
    • Design Motivation: Simulates the natural process of humans reading long documents—referring back to verify whenever encountering uncertainty.
  2. Pointback Mechanism:

    • Function: Highlights key context segments by labeling the index numbers of the relevant paragraphs.
    • Mechanism: During the data collection phase, the context is chunked into 512-token segments, and the LLM is queried one by one to determine relevance; after training, the model directly generates the paragraph numbers.
    • Design Motivation: Internalizes the computationally intensive block-by-block retrieval process into the model's intuitive capability.
  3. Tree Search Data Construction:

    • Function: Constructs search trees with a branching factor of 8 and a maximum depth of 3, where each node represents a CoC step.
    • Mechanism: Selects the optimal paths using a combined scoring of RougeL semantic similarity and GPT-4o-mini binary validation.
    • Design Motivation: 92% of the questions can be resolved with just one round of clarification; two rounds resolve an additional 53% of the remainder; and three rounds resolve another 35%, ultimately covering 97.8% of the correct answers.

Loss & Training

  • First Stage (SFT): Trains the model using standard cross-entropy loss to learn CoC reasoning paths, containing the full context + question + step-by-step reasoning chain.
  • Second Stage (DPO): Uses incorrect reasoning paths as negative samples (where correctness is determined by GPT-4o-mini) to create preference pairs for Direct Preference Optimization.
  • The base model is Llama3.1-8B-Instruct. The training data comes from NarrativeQA (14.7K QA pairs), generating 107,550 traces, with an average context length of 67K, and total generated tokens of 17M.

Key Experimental Results

Main Results

  • Long-Context Tasks (128K) Average: AgenticLU-8B improves over Llama3.1-8B by +14.7 points (\(53.4 \rightarrow 68.1\)).
  • HotpotQA (128K): \(+31.1\) (\(40.0 \rightarrow 71.1\)) — the multi-hop reasoning task with the most significant improvement.
  • NaturalQ (128K): \(+21.7\) (\(56.1 \rightarrow 77.8\))
  • TriviaQA (128K): \(+7.7\) (\(80.6 \rightarrow 88.3\))
  • NarrativeQA (128K): \(+18.0\) (\(38.0 \rightarrow 56.0\))
  • Short-Context Tasks Average: Drops by only \(-0.6\) points (\(62.3 \rightarrow 61.7\)), with almost no impact on general capabilities.
  • Consistently outperforms prompting methods and ProLong-8B across all 7 long-context tasks and all context lengths (8K to 128K).

Ablation Study / Key Findings

  • Effect of Multi-Turn CoC: \(1\text{ turn} \rightarrow 75.7\%\); \(2\text{ turns} \rightarrow 76.7\%\); \(3\text{ turns} \rightarrow 78.4\%\) (average of 4 RAG tasks on 128K), with the first turn obtaining most of the gains.
  • Removing Self-Clarification: Average accuracy drops from \(75.7\%\) to \(62.1\%\) (\(-13.6\)), and from \(71.1\%\) to \(57.8\%\) on HotpotQA.
  • Removing Pointback: Average accuracy drops from \(75.7\%\) to \(62.2\%\) (\(-13.5\)), demonstrating that context localization is equally critical.
  • Tree Search Coverage: Reaches \(97.8\%\) answer recall on NarrativeQA with a depth of 3 and a branching factor of 8.
  • With prefix caching, the additional inference overhead scales linearly only with the number of newly generated tokens.

Highlights & Insights

  • Self-taught Paradigm: It does not rely on human annotations or stronger teacher models; the base model itself generates training data to teach itself. The "self-taught" concept is highly elegant.
  • Amortization of Inference Time \(\rightarrow\) Training Time: Transfers the expensive tree search cost into one-time training, requiring only a single forward pass during inference.
  • Pointback Mechanism: Ingeniously internalizes RAG-style retrieval capabilities into the generative model, avoiding the introduction of external retrievers.
  • Generalization Preservation: Performance on short-context tasks remains almost intact after fine-tuning (\(-0.6\) points), indicating well-constructed data.

Limitations & Future Work

  • Training data comes solely from a single dataset, NarrativeQA; generalizability depends on data diversity.
  • Search depth is restricted to 3 (limited by exponential computational costs); deeper reasoning may require alternative strategies.
  • The base model uses 8B parameters; whether larger models can benefit further remains unexplored.
  • The quality of CoC paths highly depends on the long-context understanding capability of the initial model; models with weak capabilities may fail to generate effective clarification questions.
  • Lacks an in-depth comparison against RAG + external retriever approaches.
  • It shares the same philosophy as the STaR (Self-Taught Reasoner) line of work, but extends from mathematical reasoning to long-context understanding.
  • While LongRAG and Chain-of-Agents require multi-component/multi-agent collaboration, AgenticLU uses only a single LLM to orchestrate both reasoning and retrieval.
  • ProLong-8B requires 40B tokens of additional training data, whereas AgenticLU is much more data-efficient (17M generation tokens).
  • The utilization of DPO continues the successful paradigm of the RLHF family in LLM alignment, presenting a novel application scenario for long-context understanding.

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐