Vulnerability of LLMs to Vertically Aligned Text Manipulations¶

Conference: ACL 2025
arXiv: 2410.20016
Code: None
Area: Robotics
Keywords: LLM Robustness, Vertical Text Format, Text Classification, Adversarial Attacks, Tokenization

TL;DR¶

This paper systematically reveals the severe vulnerability of LLMs to vertically aligned text inputs: vertically aligning only a small number of keywords can lead to a drop of 25-45 percentage points in text classification accuracy. While CoT reasoning fails to mitigate this issue, a well-designed few-shot learning paradigm can effectively recover performance.

Background & Motivation¶

Background: Transformer-based LLMs have achieved exceptional performance in text classification tasks and are widely deployed in critical application scenarios such as sentiment analysis, toxic content detection, and spam filtering.

Limitations of Prior Work: Existing research shows that LLMs are sensitive to input formatting variations (e.g., line breaks, punctuation, word order), and encoder-based models (e.g., BERT) have been proven vulnerable to vertical text formatting. However, whether decoder-based LLMs suffer from the same problem has not been systematically investigated.

Key Challenge: Vertically aligned text remains easily understandable for humans but can severely mislead models. If LLMs fail to recognize vertically formatted keywords, malicious users could exploit this vulnerability to bypass toxic content detection systems.

Goal: To systematically evaluate the impact of vertical text formatting on various LLMs in text classification tasks, analyze the root causes, and explore mitigation strategies.

Key Insight: Selecting keywords within texts for vertical transformation to simulate realistic formatting manipulation attacks, covering both closed-source and open-source models.

Core Idea: Defects in LLM tokenization mechanisms and pre-training data render them incapable of understanding vertically aligned text, presenting a realistic threat to safety-critical applications like content moderation.

Method¶

Overall Architecture¶

The method comprises two core steps: Word Selection and Word Transformation, where selected keywords are converted from horizontal to vertical format while the rest of the text remains normal.

Key Designs¶

1. Word Selection¶

Function: Identifying the most critical words for classification from the text.
Mechanism: Utilizing a prompt-based LLM (GPT-4o-mini) as an evaluator to extract keywords, bypassing the high computational cost of traditional greedy evaluation methods that assess each word individually.
Design Motivation: Previous methods (Rusert, 2024) used a greedy search to evaluate the impact of each word on the prediction probability, which is computationally prohibitive for LLMs.

2. Word Transformation¶

Function: Embedding selected keywords into the original text in a vertical layout.
Mechanism: A five-step process: (1) tokenizing sentences into word lists and determining the vertical height; (2) initializing a 2D grid; (3) placing vertical word characters line by line; (4) aligning non-vertical words; (5) generating the final formatted string.
Design Motivation: Maintaining overall readability of the text (non-vertical words remain horizontal) while targetedly transforming only the key words.

3. CoT Reasoning Attempts (Failed Mitigation Strategy)¶

Function: Guiding model reasoning by incorporating "think step by step" in the prompt.
Mechanism: Hoping that explicit reasoning paths would help the model recognize the vertical format.
Actual Results: CoT completely fails to help models recognize vertical texts, with negligible changes in accuracy (typically \(\pm 3\) percentage points).

4. Few-Shot Learning (Effective Mitigation Strategy)¶

Function: Providing 3 examples with detailed analysis to assist model learning.
Mechanism: Meticulously crafting analytical processes for each exemplar, helping the model learn to identify and reconstruct vertically formatted texts.
Design Motivation: Models lack "awareness" of vertical text formats, and thus need to establish this cognition through in-context exemplars.

Loss & Training¶

This work is evaluative and does not involve training. The primary evaluation metric is classification accuracy: \(\text{Accuracy} = \frac{1}{N}\sum_{i=1}^{N}\mathbb{I}(y_i = \hat{y}_i)\).

Key Experimental Results¶

Main Results¶

Impact of vertical text on LLM accuracy (5 datasets, 4 vertical words / 2 for CoLA):

Model	SST-2 (Original/Vertical)	CoLA (Original/Vertical)	QNLI (Original/Vertical)	Rotten T. (Original/Vertical)	Jigsaw (Original/Vertical)
GPT-3.5	93/65 (↓28)	80/47 (↓33)	85/69 (↓16)	92/57 (↓35)	85/62 (↓23)
GPT-4	96/67 (↓29)	90/49 (↓41)	89/71 (↓18)	93/64 (↓29)	89/58 (↓31)
GPT-4o	95/68 (↓27)	87/47 (↓40)	90/70 (↓20)	90/65 (↓25)	91/60 (↓31)
Llama3-8B	89/61 (↓28)	75/50 (↓25)	83/62 (↓21)	86/42 (↓44)	88/58 (↓30)
Llama3.1-70B	96/66 (↓30)	84/50 (↓34)	84/66 (↓18)	92/63 (↓29)	87/62 (↓25)
Qwen2-72B	96/60 (↓36)	84/50 (↓34)	88/62 (↓26)	93/59 (↓34)	91/59 (↓32)

Ablation Study¶

Mitigation efficacy of CoT on vertical text classification (change compared to without CoT):

Model	SST-2	CoLA	QNLI	Rotten T.	Jigsaw
GPT-3.5 w/CoT	-4	+3	-10	-4	0
GPT-4 w/CoT	-1	+2	-3	-4	-2
GPT-4o w/CoT	+3	+5	+4	+1	+6
Llama3.1-8B w/CoT	+2	+2	+3	+2	-1
Gemma2-27B w/CoT	+3	+1	0	+3	-2

Recovery effect of Few-Shot Learning (GPT series, 3-shot): GPT-4 and GPT-4o show recovery to near-normal accuracy levels when utilizing 3-shot learning.

Key Findings¶

Severity: A vertical input of only 4 key words can cause accuracy to drop by 25-45 percentage points, with a decrease of up to 41 points on the CoLA dataset.
Security Threat: Negative sentiment identification rate on SST-2 dropped from 91% to 24%, and toxic content recognition on Jigsaw dropped from 86% to 28%.
CoT Ineffectiveness: Chain-of-Thought reasoning offers almost no help to this issue, with variations typically within \(\pm 5\) percentage points.
Few-Shot Effectiveness: 3-shot learning accompanied by detailed analysis can recover GPT-4/4o performance to near-normal levels.
Root Cause Analysis: Tokenization splits vertical words into multiple unrelated tokens (e.g., "vertical" fragments from 1 token into 15), causing vertical word tokens to lose strong attention associations with key classification tokens in the attention matrix.

Highlights & Insights¶

Unique Safety Perspective: Studying formatting manipulation as a potential attack vector, providing practical warnings for the security of content moderation systems.
In-depth Root Cause Analysis: Unveiling the underlying mechanisms of vulnerability from both tokenization and attention matrix dimensions.
Counter-intuitive Finding: CoT reasoning—conventionally regarded as a standard approach to enhance comprehension—is entirely ineffective for this issue; the models simply cannot "see" the vertical text.
Comprehensive Coverage: Evaluating 12 models (4 closed-source + 8 open-source) across 5 datasets, lending high generalizability to the conclusions.

Limitations & Future Work¶

Inability to explore whether fine-tuning can fundamentally resolve this issue.
Evaluation is restricted to text classification, leaving text generation tasks unaddressed.
The few-shot scheme requires manually designed exemplars for each task, limiting its practicality.
Whether incorporating vertical text data into the pre-training phase can enhance robustness was not discussed.
The impact of other unconventional text formats (diagonal, spiral, etc.) can be further explored.

Rusert (2024): First to identify key vulnerabilities of encoder-based models to vertical text, which this paper extends to decoder-based LLMs.
Sclar et al. (2024): Researching LLM sensitivity to punctuation and line breaks.
Dong et al. (2024): Jailbreak attacks on LLMs; this paper provides a new attack surface from the perspective of formatting manipulation.
Insights: Robustness evaluation of existing LLMs might be far from sufficient, and formatting-level attacks remain an underestimated threat.

Rating¶

Dimension	Rating
Novelty	⭐⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐⭐
Writing Quality	⭐⭐⭐⭐
Value	⭐⭐⭐⭐