Can LLMs Interpret and Leverage Structured Linguistic Representations? A Case Study with AMRs¶

Conference: ACL 2025
arXiv: 2504.04745
Code: None
Area: LLM/NLP
Keywords: AMR, structured semantic representation, LLM, long-context understanding, zero-shot/few-shot prompting

TL;DR¶

Systematically evaluates the capability of LLMs to interpret and leverage Abstract Meaning Representation (AMR), finding that AMR-augmented prompting significantly improves performance in long-context tasks such as dialogue summarization (e.g., Llama 3.1 zero-shot cosine similarity increases from 66% to 76%), though it typically degrades performance in short-context tasks.

Background & Motivation¶

Background: Structured semantic representations like AMR have advantages in encoding high-level semantics of text, but prior utilization methods required modifying the model architecture (such as graph attention mechanisms).

Limitations of Prior Work: It remains unclear whether general-purpose LLMs can directly understand and leverage linearized AMR without architecture modifications.

Core Idea: Directly use linearized AMR as part of LLM prompts to evaluate its impact on LLM performance across different tasks and context lengths.

Method¶

Overall Architecture¶

Systematically evaluates the understanding and utilization capabilities of multiple LLMs regarding AMR across various NLP tasks. AMRs are extracted from text using IBM's transition-based neural parser (AMR3-structbart-L and doc-sen-conll-amr-seed42 models) and directly integrated as part of the LLM prompts after linearization. All experiments are conducted under three settings: zero-shot, 3-shot, and 5-shot.

Key Designs¶

Three Prompting Strategies:
- Context-only: Standard baseline
- AMR-augmented: Appends the AMR representation to the original text to evaluate whether AMR can enhance understanding
- AMR-only: Removes the original text and only provides the AMR, evaluating whether LLMs can extract sufficient information solely from structured representations
- Design Motivation: Distinguish whether AMR is more valuable as auxiliary information or as a replacement
Six Evaluation Tasks:
- AMR-to-text reconstruction: Evaluates the basic AMR understanding of LLMs
- Single-hop QA (SQuAD 2.0): Short-context reasoning
- Multi-hop QA (HotpotQA): Long-context reasoning (10 documents per question)
- Dialogue summarization (SAMSum): Long-dialogue understanding
- NLI (SNLI): Short-text natural language inference
- Document-level NLI (DocNLI): Long-text natural language inference
- Design Motivation: Cover the full spectrum from short context to long context, and from understanding to reasoning
Three Models:
- Llama3.1-8B-Instruct: Latest and largest
- Phi3-mini-128k-instruct: Medium-scale
- Mistral-7B-Instruct-v0.1: Older and smaller
- All models use 8-bit quantized versions
- Design Motivation: Observe the impact of model scale and recency on AMR utilization capabilities

Loss & Training¶

The primary experiments focus on zero-shot and few-shot inference, requiring no training. Additional experiments utilize rank-32 LoRA fine-tuning on Llama3.1 for the SAMSum summarization task comparison.

Key Experimental Results¶

Main Results¶

Task	Context	Model	Context-only	AMR-augmented	AMR-only	Gain
SAMSum	Long dialogue	Llama3.1 0-shot	66% cos	76% cos	Moderate	+10%
SAMSum	Long dialogue	Llama3.1 3-shot	High	Slightly higher	Medium	Positive
SQuAD	Short	Llama3.1 3-shot	59% F1	52% F1	48% F1	-7%
AMR→Text	-	Llama3.1 5-shot	-	-	81% cos	Basic understanding
SNLI	Short	Phi3 0-shot	27% F1	39% F1	25% F1	+12%

Ablation Study¶

Experiment	Results	Description
LoRA fine-tuning (SAMSum)	75%→76% cos	Slight improvement after AMR fine-tuning, but underperforms few-shot
3-shot vs 5-shot	Diminishing marginal returns	Negligible gains beyond 3 examples
HotpotQA (Long context)	No improvement with AMR	Since individual document AMRs are short, stacking multiple AMRs is not equivalent to long-context AMR

Key Findings¶

Significant benefit in long-context tasks: AMR compresses lengthy dialogues into structured semantic graphs, helping LLMs retain key message points. Zero-shot SAMSum summarization improved by 10%.
Typically detrimental for short-context tasks: AMR increases the input length without providing extra useful information, while instead introducing parsing noise.
Newer and larger models benefit more: Llama 3.1 (8B, most recent) benefits the most from AMR, whereas Mistral (7B, oldest) sees almost no gain.
LLMs possess basic AMR understanding capabilities: An 81% text reconstruction similarity demonstrates that LLMs can restore most of the original semantics from linearized AMRs.
Reasonable performance of AMR-only on certain tasks: Under 3-shot SQuAD, AMR-only achieves 48% F1, proving that AMRs do encode sufficient reasoning information.
The discrepancy between HotpotQA and SAMSum reveals key factors: It is not "long context" in general that is effective, but rather "AMR of a single long document"; stacking AMRs of multiple short documents is not equivalent.

Highlights & Insights¶

First systematic evaluation of the direct understanding capabilities of LLMs on structured semantic representations, without any architectural modifications.
The finding that "long-context is beneficial, short-context is detrimental" provides clear guidance for the practical application of AMR: AMR augmentation should only be applied in long dialogue/document scenarios.
The reasonable performance in AMR-only experiments implies a potential data compression strategy: replacing raw text with AMR to reduce token counts.

Limitations & Future Work¶

Only LoRA fine-tuning is conducted without exploring full fine-tuning, which might underestimate the potential of fine-tuning.
The AMR parser itself may introduce errors, affecting downstream performance.
Effects of other structured representations (such as Knowledge Graphs, Discourse Representation Structures (DRS)) remain unexplored.
The HotpotQA experiments did not utilize Chain-of-Thought prompting, which might influence the conclusions.

Rating¶

Novelty: ⭐⭐⭐⭐ First systematic evaluation of LLM + AMR
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple tasks, models, and strategies
Writing Quality: ⭐⭐⭐⭐ Clear experimental design
Value: ⭐⭐⭐ Provides useful empirical guidance