Retrieval-Augmented Generation for Reliable Interpretation of Radio Regulations¶

Conference: NeurIPS 2025 arXiv: 2509.09651 Code: GitHub Area: NLP Understanding / RAG / Domain-Specific QA Keywords: RAG, Radio Regulations, Domain QA, Retrieval-Augmented Generation, ITU

TL;DR¶

This paper presents a dedicated RAG pipeline for radio regulations—a legally sensitive, high-stakes domain—and introduces the first ITU radio regulation multiple-choice evaluation benchmark. The proposed system achieves 97% retrieval accuracy and an +11.9% QA accuracy gain over GPT-4o, substantially outperforming naive full-document insertion into the prompt.

Background & Motivation¶

Background: LLMs perform well in general-purpose QA but are severely limited by hallucination in legally sensitive domains such as radio regulations. ITU Radio Regulations constitute legally binding, technically dense text requiring precise interpretation; any error may cause spectrum interference, legal disputes, or critical infrastructure disruption.

Limitations of Prior Work: - General-purpose LLMs lack domain knowledge of radio regulations, yielding low direct-answer accuracy (GPT-4o: 59%) - Full document insertion into the prompt is nearly ineffective (+0.6%), as documents are too long and terminology-dense for models to reliably locate relevant provisions - Existing telecom RAG work (e.g., Telco-RAG, Telco-DPR) targets 3GPP standards; no prior work addresses radio regulations

Key Challenge: Radio regulations require answers pinpointed to specific provisions, yet LLMs' parametric memory cannot cover such specialized legal text; naive retrieval may also return irrelevant or incomplete fragments due to poor chunking strategies.

Goal: (1) Build a RAG pipeline specifically for radio regulations; (2) Construct the first standardized evaluation benchmark for the radio regulation domain.

Key Insight: Decouple the evaluation of retrieval and generation, optimizing each independently. FAISS + Sentence-Transformers are employed for efficient retrieval, and a domain-specific retrieval correctness metric is defined via ROUGE-L.

Core Idea: A carefully designed, structured RAG pipeline with optimized chunking strategies and retrieval parameters outperforms naive document injection by an order of magnitude in regulatory QA.

Method¶

Overall Architecture¶

The pipeline consists of two stages: a Retrieval Block that retrieves the top-\(k\) most relevant passages from the regulatory corpus, followed by a Generation Block that concatenates the retrieved passages with the query and passes them to an LLM for answer generation. An optional LLM-based reranker can be inserted between the two stages.

Key Designs¶

Corpus Chunking and FAISS Retrieval:
- Function: Split the full ITU Radio Regulations text into word-count-based chunks, encode them into dense vectors using Sentence-Transformers (all-MiniLM-L6-v2), and build a FAISS index.
- Mechanism: At inference time, the user query is encoded and the top-\(k\) nearest neighbors are retrieved. Chunk size is the critical hyperparameter—500–700-word chunks with top-7 retrieval achieve 97% accuracy; chunks smaller than 300 words suffer catastrophic accuracy drops (near 0%) due to insufficient context.
- Design Motivation: Provisions in radio regulations contain redundancy and cross-references; larger chunks preserve complete provision context.
Domain-Specific Retrieval Evaluation Metric:
- Function: Define a retrieval correctness criterion to enable decoupled evaluation of retrieval and generation.
- Mechanism: For each question, the ROUGE-L F1 score between the retrieved result \(R_i\) and the ground-truth context \(C_i\) is computed. Retrieval is deemed correct when \(F_1^{(i)} \geq \gamma \cdot F_{1,\max}\), where \(F_{1,\max}\) accounts for length differences between \(R\) and \(C\), and \(\gamma\) is a tolerance threshold.
- Design Motivation: Provisions in regulatory documents are redundant; strict exact matching underestimates retrieval quality, and a proportional threshold is more appropriate.
Evaluation Dataset Construction (First of Its Kind):
- Function: Construct the first multiple-choice evaluation benchmark for ITU radio regulations.
- Mechanism: A four-step pipeline—(1) extract full text from PDF and segment; (2) uniformly sample passages to ensure coverage; (3) generate multiple-choice questions using Flan-T5-XXL; (4) filter for quality using a telecom-domain expert model (Llama-3-8B-Tele), retaining only "Good"-rated items, followed by manual review.
- Design Motivation: No standard benchmark exists in this domain; this constitutes a foundational contribution to advancing research in the area.
Optional Reranking:
- Function: Apply LLM-based reranking of retrieved results between retrieval and generation stages.
- Effect: Yields only ~+1% accuracy improvement while incurring 1.5× computational overhead; disabled by default.

Loss & Training¶

No training is required—the entire pipeline is training-free: retrieval relies on pretrained embeddings and a FAISS index, while generation uses off-the-shelf LLMs. A key practical advantage is that only the FAISS index needs to be rebuilt when regulations are updated, with no model retraining required.

Key Experimental Results¶

Main Results¶

Method	Accuracy	Gain
GPT-4o (no RAG)	59.0% ± 0.5	—
GPT-4o + Full Document Insertion	59.6% ± 0.4	+0.6%
GPT-4o + RAG	70.9% ± 0.8	+11.9%
DeepSeek-R1-14B (no RAG)	36.0% ± 1.0	—
DeepSeek-R1-14B + RAG	59.0% ± 1.0	+23.0%

Ablation Study (Retrieval Hyperparameters)¶

Chunk Size	top-k	Retrieval Accuracy
700	7	97%
700	5	95%
500	7	95%
300	7	91%
300	5	73%
150	7	1%

Key Findings¶

Structured retrieval vs. naive document insertion: GPT-4o with full document injection gains only +0.6%, while the RAG pipeline yields +11.9%, demonstrating that precise localization of relevant provisions is far more important than providing the complete document.
Chunk size is the most critical hyperparameter: 500–700 words is the optimal range; below 300 words, provisions are truncated and retrieval fails.
Smaller models benefit more: DeepSeek-R1-1.5B gains +3% and the 14B variant gains +23%, indicating that RAG compensates effectively for limited parametric knowledge.
Reranking yields marginal gains: Only +1% improvement, as initial retrieval quality is already sufficiently high.

Highlights & Insights¶

The finding that full-document insertion is ineffective carries broad implications: In long specialized-document QA, more context is not necessarily better; precise retrieval and localization matter far more than context length.
First radio regulation benchmark: The dataset construction pipeline—LLM-based generation, domain expert model filtering, and manual review—serves as a reusable template for building domain-specific evaluation benchmarks.
Training-free modular design: Only the FAISS index needs to be rebuilt upon regulatory updates, which is a key advantage for real-world deployment.

Limitations & Future Work¶

Evaluation limited to multiple-choice questions: Real-world regulatory consultation typically involves open-ended QA; multiple-choice format oversimplifies the task.
Single embedding model: Only MiniLM-L6-v2 is evaluated; no comparison with legal- or telecom-domain-specific embeddings is conducted.
Limited methodological novelty: The RAG pipeline combines existing techniques; core contributions lie in the dataset and domain application.
Threshold selection for retrieval correctness metric: The fixed value of \(\gamma = 0.7\) lacks sensitivity analysis.

vs. Telco-RAG: Targets 3GPP standards, whereas this paper addresses ITU Radio Regulations, which carry stronger legal binding force.
vs. Tele-LLMs: Inject domain knowledge via pretraining; this paper demonstrates that RAG can achieve comparable performance without any pretraining.
vs. TelecomGPT: Requires continual pretraining, instruction fine-tuning, and alignment fine-tuning—at substantially higher cost than the training-free approach proposed here.

Rating¶

Novelty: ⭐⭐⭐ — Methodologically a combination of standard RAG components; innovation lies primarily in domain application and the first evaluation benchmark.
Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-model comparison, retrieval ablation, and hyperparameter analysis are comprehensive.
Writing Quality: ⭐⭐⭐⭐ — Well-structured; the decoupled evaluation of retrieval and generation is a commendable methodological choice.
Value: ⭐⭐⭐ — Clear domain value, though generalizability of the method is limited.