Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding¶

Conference: ACL 2026 arXiv: 2604.19921 Code: https://github.com/wang-zijie/commonsense_with_negation Area: LLM Pretraining Keywords: Commonsense Knowledge, Negation Understanding, Knowledge Base Augmentation, Negation Reasoning, Pretraining

TL;DR¶

This paper proposes an automated method for augmenting existing commonsense knowledge bases with negation, constructing a large-scale negated commonsense corpus (¬Atomic and ¬Anion) containing over 2 million triples, and demonstrates that pretraining on this corpus improves LLMs' negation understanding capabilities.

Background & Motivation¶

Background: Commonsense knowledge has been extensively studied, with large-scale commonsense knowledge bases such as Atomic and ConceptNet having been constructed, and LLMs achieving strong performance on various NLU tasks.

Limitations of Prior Work: (1) LLMs struggle with natural language understanding tasks involving negation, yet prior research has been limited to encoder models such as BERT and early LLMs such as GPT-3; (2) the intersection of commonsense knowledge and negation remains largely unexplored; (3) the only commonsense knowledge base addressing negation, Anion, negates only if-events and requires extensive human annotation, without considering the negation of then-events.

Key Challenge: Negation appears in approximately 25% of English sentences and constitutes an important semantic feature; however, existing commonsense knowledge bases contain almost no negation, and LLMs exhibit insufficient negation understanding.

Goal: To automatically augment existing commonsense knowledge bases with negation, construct a large-scale negated commonsense corpus, and leverage it to improve LLMs' negation understanding.

Key Insight: The observation that negating if-events, then-events, or both can sometimes yield new triples that remain commonsensically valid, thereby enabling the expansion of existing corpora by up to a factor of three.

Core Idea: By automatically negating if/then events in commonsense triples and training a dedicated LLM-based judge to validate the results, a large-scale negation-augmented commonsense knowledge corpus is constructed; pretraining on this corpus improves downstream negation understanding.

Method¶

Overall Architecture¶

Given a commonsense triple \(\langle A, R, B \rangle\), three new triples \(\langle \neg A, R, B \rangle\), \(\langle A, R, \neg B \rangle\), and \(\langle \neg A, R, \neg B \rangle\) are generated by inserting "not" before the main verb or modifier in the if-event (\(A\)), then-event (\(B\)), or both. An LLM judge is then trained to classify each new triple as Valid (commonsensically sound), Invalid (commonsensically unsound), or Ambiguous, and the validated corpus is used to pretrain LLMs.

Key Designs¶

Automatic Negation Generation:
- Function: Automatically introduces negation into commonsense triples without human annotation.
- Mechanism: Llama 3.1 70B is employed to insert "not" before the main verb or modifier of an event; manual evaluation of 200 instances confirms a grammatical correctness rate of 99%.
- Design Motivation: To avoid the high annotation cost of the Anion-style approach, while also covering negation of then-events, which Anion does not address.
LLM Judge (Automatic Validation):
- Function: Automatically determines whether generated negated triples constitute valid commonsense knowledge.
- Mechanism: Since state-of-the-art models such as GPT-4o and Claude Sonnet 4 perform poorly on this task (F1 of only 0.52–0.56), a dedicated judge is trained via supervised fine-tuning of Llama 3.1 70B using QLoRA 4-bit quantization, achieving an F1 of 0.63.
- Design Motivation: To address the gap in LLMs' ability to evaluate negated commonsense triples; Valid precision reaches 0.70 and Invalid precision reaches 0.79.
Pretraining Augmentation Strategy:
- Function: Leverages the negated commonsense corpus to improve LLMs' negation understanding.
- Mechanism: Evaluation is conducted on five downstream benchmarks spanning three tasks—question answering, NLI, and information retrieval—using both Valid and Invalid triples as pretraining data.
- Design Motivation: Both Valid and Invalid triples contribute to the model's learning of negation semantics, rather than relying solely on Valid triples.

Loss & Training¶

QLoRA 4-bit quantization is applied to fine-tune Llama 3.1 8B/70B as the judge model, with training data comprising 5,400 triples (200 per relation per label). During the pretraining stage, commonsense triples are converted into natural language if-then statements.

Key Experimental Results¶

Main Results (Judge Validation)¶

Model	Overall F1	Overall Acc	Valid P	Invalid P
GPT-4o (few-shot)	0.52	0.54	0.71	0.54
Claude Sonnet 4 (few-shot)	0.56	0.56	0.83	0.51
Llama 3.1 70B (fine-tuned)	0.63	0.64	0.70	0.79

Corpus Statistics¶

Corpus	Total Triples	Valid	Invalid	Ambiguous
¬Atomic	1,798k	681k (37.9%)	463k (25.8%)	652k (36.3%)
¬Anion	285k	104k (36.4%)	46k (16.1%)	135k (47.5%)

Key Findings¶

Negating then-events is more likely to yield Invalid triples (63.6%), whereas negating if-events while retaining the original then-events tends to produce Valid triples (83.7%).
Triples in which both if- and then-events are negated exhibit a more balanced distribution (Valid 48.0%, Invalid 9.1%, Ambiguous 42.9%).
Even state-of-the-art models such as GPT-4o and Claude Sonnet 4 show limited performance on negated commonsense judgment.
Pretraining on the negated commonsense corpus improves LLMs' negation understanding across downstream tasks in question answering, NLI, and information retrieval.

Highlights & Insights¶

The method is remarkably simple yet effective: inserting "not" alone can expand a commonsense knowledge base by a factor of three, with no need to annotate new then-events manually.
A "generation–evaluation gap" in LLMs is identified: models are capable evaluators but deviate from commonsense norms during generation, a finding consistent with observations in the counterfactual inference literature.
Both Valid and Invalid triples contribute to improved negation understanding, indicating that models benefit from exposure to both positive and negative examples.

Limitations & Future Work¶

The automatic validator achieves limited accuracy (F1 of 0.63), which may introduce noisy labels.
Validation is currently conducted only in English, whereas negation behaves differently across languages.
The effectiveness of pretraining may depend on the alignment between the base model and the volume of training data.
Future work could explore more complex negation forms, such as double negation and implicit negation.

vs. Anion: Anion negates only if-events and requires human annotation of new then-events; this work automatically negates if-events, then-events, or both without human involvement.
vs. COMET: COMET generates new then-events, whereas this work retains original events and introduces negation only, resulting in greater controllability.
vs. UNcommonsense: UNcommonsense focuses on explaining rare or atypical scenarios, whereas this work examines the effect of negation on commonsense reasoning.

Rating¶

Novelty: ⭐⭐⭐⭐ — First systematic integration of negation into commonsense knowledge bases; the approach is concise and elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluation spans three tasks and five benchmarks, with thorough judge training.
Writing Quality: ⭐⭐⭐⭐ — Motivation is clear, the method is intuitive, and the analysis is detailed.
Value: ⭐⭐⭐ — High value as a resource contribution, though the scope of application is relatively narrow.