Skip to content

Is Linguistically-Motivated Data Augmentation Worth It?

Conference: ACL 2025
arXiv: 2506.03593
Code: GitHub
Area: Others
Keywords: Data Augmentation, Linguistic Knowledge, Low-Resource Languages, Machine Translation, Morphological Analysis

TL;DR

This study systematically compares the effectiveness of linguistically motivated and non-linguistic (random perturbation) data augmentation strategies across two low-resource languages, revealing that linguistic approaches yield advantages only when generated samples closely align with the training data distribution, and can otherwise be detrimental.

Background & Motivation

Data augmentation is a widely used technique to address data sparsity, but a fundamental question remains unresolved: Is the expert effort required to design linguistically-informed augmentation strategies truly worth it?

Simple methods are surprisingly effective: Random perturbations (e.g., word deletion, noise insertion)—even when they generate nonsensical or ungrammatical sentences—can nonetheless benefit model performance.

Linguistic methods are costly: They demand domain-specific linguistic expertise and involve high implementation complexity.

Lack of systematic comparison: There has been no prior systematic empirical study directly comparing these two classes of strategies, preventing developers from making well-informed decisions.

Variable task difficulty: For sequence-to-sequence tasks (translation, morphological tagging), the target label is also an unconstrained sequence, making it significantly harder to preserve label validity than in simple classification tasks.

This paper investigates two typologically and morphologically distinct low-resource languages: Uspanteko (a Mayan language of Guatemala, with <6,000 speakers) and Arapaho (an endangered Algonquian language of North America, with <300 fluent speakers).

Method

Overall Architecture

Both linguistic and non-linguistic augmentation strategies are designed and comprehensively evaluated across two languages, three tasks (bidirectional translation and morphological tagging), and five dataset scales. In total, over 1,080 models are trained.

Key Designs

1. Simulating the Linguistic Expert Process

The first author (who has graduate-level linguistics training but zero prior knowledge of either language) spent over a year and nearly 200 hours studying grammar references and bilingual dictionaries to construct grammatical sentences in both languages. This simulates the actual costs associated with employing a linguistic expert.

2. Uspanteko Augmentation Strategies (6 types)

Linguistic Strategies: - Upd-TAM: Modifies the tense-aspect-mood markers of verbs (e.g., perfective \(\leftrightarrow\) imperfective) while synchronously updatingverb conjugations in the corresponding Spanish translation. It produces roughly 0.3 new samples per original sample. - Ins-Conj: Inserts random conjunctions/adverbs (from a pool of 20 common conjunctions) at the sentence beginning. Sentence-initial conjunctions are generally grammatical in Uspanteko. This generates up to 20 new samples. - Del-Excl: Randomly deletes words while excluding verbs to prevent generating completely ungrammatical sentences.

Non-Linguistic Strategies: - Ins-Noise: Inserts a random word (from a pool of 20 words that are not conjunctions/adverbs) at the sentence beginning, creating a direct ungrammatical contrast to Ins-Conj. - Del-Any: Randomly deletes words from any position. - Dup: Randomly duplicates a word at some position.

3. Arapaho Augmentation Strategies (3 types)

Linguistic Strategies: - Ins-Intj: Inserts interjections, greetings, or conjunctions (from 20 common words) at the sentence beginning. - Perm: Generates up to 10 word-order permutations. Since Arapaho exhibits free word order, these permuted sentences remain grammatical.

Non-Linguistic Strategies: - Ins-Noise: Inserts a random word (mostly nouns) at the beginning of the sentence.

4. Combinatorial Strategies

  • Uspanteko: \(2^6 = 64\) combinations.
  • Arapaho: \(2^3 = 8\) combinations.
  • Explores whether strategy diversity is more effective than any single strategy.

Loss & Training

  • Model: ByT5-small (300M parameters, byte-level processing to bypass tokenization issues).
  • Curriculum Learning: The model is first trained on local/augmented data, followed by training on the original data, with the optimizer being reset between phases.
  • Fixed training steps to control for the impact of different augmented data volumes.
  • Three random seeds per experimental configuration.

Key Experimental Results

Baseline Performance (Table 3)

Task 100 samples 500 samples 1000 samples 5000 samples Full
usp→esp (Translation) 14.6 26.4 31.7 44.1 45.2
esp→usp (Translation) 13.7 23.1 29.1 39.6 40.6
usp→igt (Tagging) 18.4 53.9 65.2 74.5 75.4
arp→eng (Translation) 15.3 18.7 22.2 31.0 38.9
eng→arp (Translation) 21.8 27.4 30.7 40.4 46.2
arp→igt (Tagging) 17.7 38.7 51.2 68.0 76.7

Individual Strategy Effects (Core Findings from Figure 2)

Strategy Type Translation Performance Tagging Performance
Ins-Conj/Ins-Intj Linguistic ✅ Consistently improves ✅ Mostly improves
Ins-Noise Non-Linguistic ✅ Consistently improves ✅ Consistently improves
Upd-TAM Linguistic ✅ Slightly improves ✅ Slightly improves
Dup Non-Linguistic ✅ Slightly improves ✅ Moderately improves
Del-Any/Del-Excl Mixed ⚠️ Mixed results ⚠️ Mixed results
Perm Linguistic Consistently worsens (-1+ chrF) Worsens even further
  • Ins-Conj shows a clear advantage over Ins-Noise in translation tasks.
  • Perm, despite being perfectly grammatical, consistently deteriorates performance—even when evaluated with a modified chrF metric that ignores word order.

Combinatorial Strategy Effects (Figure 4)

The optimal strategy is typically a combination of multiple augmentation methods. For Uspanteko, the best combination consistently includes Ins-Conj and/or Ins-Noise. The greatest absolute improvements are approximately +8 chrF (Uspanteko) and +3 chrF (Arapaho).

Permutation Strategy Verification (Table 5)

Metric Baseline +Perm
chrF 30.0 29.0 (-1.0)
Order-insensitive chrF 30.9 29.9 (-1.0)

The negative effect of Perm is not driven by the model learning "incorrect word orders", but rather because the augmented samples drift away from the natural data distribution.

Key Findings

  1. Key takeaway: Linguistic validity \(\neq\) Data distribution alignment. Grammatical but rare samples (such as Perm) impair performance, whereas grammatical and natural/common samples (such as Ins-Conj) indeed outperform non-linguistic methods.
  2. Most strategies actually slightly degrade performance—only a handful of strategies are consistently beneficial.
  3. Diminishing returns as training size scales up—obtaining more natural data is consistently more effective.
  4. Combinatorial strategies outperform single ones—strategy diversity matters more than the volume of a single strategy.
  5. Translating to a high-resource language is easier than translating from it—pretrained models have already mastered generating English/Spanish.
  6. Morphological tagging is significantly easier than translation due to the constrained output space.

Highlights & Insights

  • Highly rigorous experimental design: Direct contrastive pairs of linguistic/non-linguistic strategies (Ins-Conj vs. Ins-Noise) are crafted to isolate variables exhaustively.
  • Counter-intuitive findings: Grammatical but unnatural sentences can actually be harmful, challenging the naive assumption that "more correct data is always better".
  • Pragmatic advice for the linguistics community: Spending 200 hours learning linguistics to design custom augmentation strategies yields only a few chrF percentage points of gain; this effort is far better spent on manual data collection and annotation.
  • Quantifying the 200-hour learning cost makes the question of "Is it worth it?" concrete and deeply discussability-focused.

Limitations & Future Work

  • The number of augmented samples generated varies by strategy; although controlled by a fixed number of training steps, diversity itself might still act as a confounding factor.
  • Experiments are restricted to only two languages, which limits direct generalization to all morphological typologies.
  • The potential of LLMs as augmenters (e.g., prompting GPT to generate grammatical augmented data) was not investigated.
  • ByT5-small is the sole model architecture utilized; conclusions might vary with larger or alternative architectures.
  • There is a lack of quantitative metrics for measuring "data distribution alignment," relying instead on qualitative explanations.
  • Wei & Zou (2019) EDA method: Random augmentation is effective for classification tasks.
  • Seo et al. (2023): Synthesized new samples through morpheme composition.
  • Lucas et al. (2024): Sampled sentences via finite-state machines / context-free grammars.
  • Dai & Adel (2020): Compared linguistic and non-linguistic augmentations in classification tasks, whereas this work extends the evaluation to sequence-to-sequence tasks.
  • Insight: Future data augmentation should prioritize distributional fidelity over mere linguistic grammaticality.

Rating

  • Novelty: ⭐⭐⭐⭐ — First to systematically compare the impact of linguistic vs. non-linguistic augmentation on seq2seq tasks, backed by an exceptionally well-designed experimental setup.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 1080+ models, 5 training dataset sizes, 64 combinations, and 3 random seeds, showing rigorous control of experimental variables.
  • Writing Quality: ⭐⭐⭐⭐ — Highly articulate arguments and cautious conclusions without overclaiming.
  • Value: ⭐⭐⭐⭐ — Of immediate practical guidance to the low-resource NLP community; the core finding (distribution alignment mattering more than grammaticality) is broadly generalizable.