Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding¶
Conference: ACL 2026
arXiv: 2604.19921
Code: https://github.com/wang-zijie/commonsense_with_negation
Area: LLM Pretraining
Keywords: Commonsense Knowledge, Negation Understanding, Knowledge Base Enhancement, Negation Reasoning, Pretraining
TL;DR¶
Ours proposes an automated method to add negation to existing commonsense knowledge bases, constructing a negation commonsense corpus of over 2 million triplets (¬Atomic and ¬Anion), and demonstrates that pretraining on this corpus enhances the negation understanding capabilities of LLMs.
Background & Motivation¶
Background: Commonsense knowledge has been widely studied, with large-scale KBs such as Atomic and ConceptNet being constructed. LLMs have achieved success across various NLU tasks.
Limitations of Prior Work: (1) LLMs struggle with natural language understanding tasks involving negation, yet prior research has been limited to encoder models like BERT and early LLMs like GPT-3; (2) The intersection of commonsense knowledge and negation remains largely unexplored; (3) Anion, the only commonsense KB involving negation, only negates "if" events, requires heavy manual annotation, and fails to consider negated "then" events.
Key Challenge: Negation appears in approximately 25% of English sentences and constitutes an important semantic feature; however, existing commonsense KBs contain almost no negation, leading to insufficient negation understanding in LLMs.
Goal: To automatically add negation to existing commonsense KBs, construct a large-scale negation commonsense corpus, and utilize it to improve the negation understanding of LLMs.
Key Insight: It is observed that negating an "if" event, a "then" event, or both sometimes produces new triplets that still conform to commonsense. This allows existing corpora to be expanded by up to 3 times.
Core Idea: By automatically negating the if/then events of commonsense triplets and training a specialized LLM judge to verify validity, a large-scale commonsense knowledge corpus containing negation is constructed. Pretraining on this corpus can improve downstream negation understanding.
Method¶
Overall Architecture¶
Given a commonsense triplet \(\langle A, R, B \rangle\), three new triplets \(\langle \neg A, R, B \rangle\), \(\langle A, R, \neg B \rangle\), and \(\langle \neg A, R, \neg B \rangle\) are generated by adding "not" before the main verb or modifier to negate the if-event (\(A\)), the then-event (\(B\)), or both. A specialized LLM judge is then trained to verify whether each new triplet is Valid (consistent with commonsense), Invalid (violating commonsense), or Ambiguous. Finally, the LLM is pretrained using the validated corpus.
Key Designs¶
-
Automatic Negation Generation:
- Function: Automatically adds negation to commonsense triplets without manual annotation.
- Mechanism: Llama 3.1 70B is used to insert "not" before the main verb or modifier of an event. Manual evaluation of 200 instances confirmed a 99% grammatical accuracy rate.
- Design Motivation: To avoid the manual annotation costs associated with Anion while covering negations of then-events (Anion only negates if-events).
-
LLM Judge (Automatic Validation):
- Function: Automatically determines whether generated negation triplets conform to commonsense knowledge.
- Mechanism: It was found that SOTA models like GPT-4o and Claude Sonnet 4 perform poorly on this task (F1 only 0.52–0.56). Therefore, Llama 3.1 70B was trained via Supervised Fine-Tuning (SFT) as a specialized judge (achieving an F1 of 0.63) using QLoRA 4-bit quantization.
- Design Motivation: To bridge the gap in LLM capabilities for negation commonsense evaluation, achieving a precision of 0.70 for Valid and 0.79 for Invalid labels.
-
Pretraining Enhancement Strategy:
- Function: Utilizes the negation commonsense corpus to improve LLM negation understanding.
- Mechanism: Evaluation is performed on five downstream benchmarks across three tasks (QA, NLI, and Information Retrieval), using both Valid and Invalid triplets as pretraining data.
- Design Motivation: Both Valid and Invalid triplets help the model learn the semantics of negation, rather than relying solely on Valid triplets.
Loss & Training¶
The judge is trained using Supervised Fine-Tuning on Llama 3.1 8B/70B with QLoRA 4-bit quantization. Training data includes 5,400 triplets (200 per relation per label). During the pretraining phase, commonsense triplets are converted into natural language if-then statements.
Key Experimental Results¶
Main Results (Judge Validation)¶
| Model | Overall F1 | Overall Acc | Valid P | Invalid P |
|---|---|---|---|---|
| GPT-4o (few-shot) | 0.52 | 0.54 | 0.71 | 0.54 |
| Claude Sonnet 4 (few-shot) | 0.56 | 0.56 | 0.83 | 0.51 |
| Llama 3.1 70B (fine-tuned) | 0.63 | 0.64 | 0.70 | 0.79 |
Corpus Statistics¶
| Corpus | Total Triplets | Valid | Invalid | Ambiguous |
|---|---|---|---|---|
| ¬Atomic | 1,798k | 681k (37.9%) | 463k (25.8%) | 652k (36.3%) |
| ¬Anion | 285k | 104k (36.4%) | 46k (16.1%) | 135k (47.5%) |
Key Findings¶
- Negating a then-event is more likely to produce an Invalid triplet (63.6%), whereas negating an if-event while keeping the original then-event remains Valid in most cases (83.7%).
- The distribution of triplets resulting from negating both if and then events is relatively balanced (Valid 48.0%, Invalid 9.1%, Ambiguous 42.9%).
- Even SOTA models like GPT-4o and Claude Sonnet 4 show limited performance in negation commonsense judgment.
- Pretraining on the negation commonsense corpus improves LLM negation understanding across three downstream tasks: QA, NLI, and Information Retrieval.
Highlights & Insights¶
- The method is extremely simple yet effective: merely adding "not" can expand a commonsense KB by 3 times without requiring manual annotation of new then-events.
- A "generation-evaluation gap" in LLMs was identified: models excel at evaluation but deviate from privacy/commonsense norms during generation, a finding consistent with observations in the CI field.
- Both Valid and Invalid triplets contribute to improving negation understanding, indicating that models need exposure to both positive and negative examples.
Limitations & Future Work¶
- The precision of the automatic validator is limited (F1 0.63), which may introduce noisy labels.
- Currently, the approach is only validated in English; the manifestation of negation varies significantly across different languages.
- Pretraining effectiveness may depend on the alignment between the base model and the data volume.
- Future work could explore more complex forms of negation (e.g., double negation, implicit negation).
Related Work & Insights¶
- vs Anion: Anion only negates if-events and requires manual annotation of new then-events; Ours automatically negates if/then/both without manual labor.
- vs COMET: COMET generates new then-events; Ours retains the original events and only adds negation, making it more controllable.
- vs UNcommonsense: Focuses on explanations for rare/uncommon scenarios, while Ours focuses on the impact of negation on commonsense reasoning.
Rating¶
- Novelty: ⭐⭐⭐⭐ Systematic integration of negation into commonsense KBs for the first time; concise and elegant approach.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluation across three tasks and five benchmarks; thorough judge training.
- Writing Quality: ⭐⭐⭐⭐ Clear motivation, intuitive method, and detailed analysis.
- Value: ⭐⭐⭐ High resource contribution value, though the scope of application is relatively specific.