A New Formulation of Zipf's Meaning-Frequency Law through Contextual Diversity¶
Conference: ACL 2025 (Outstanding Paper)
Code: None
Area: Others
Keywords: Zipf's law, meaning-frequency relationship, contextual diversity, language models, word meaning quantification
TL;DR¶
This paper proposes to reformulate Zipf's meaning-frequency law as a power-law relationship between word frequency and contextual diversity. It quantifies the number of word meanings through the directional distribution of contextualized word vectors generated by language models. The findings reveal that this law is unobservable in small-scale language models, and autoregressive LMs require significantly more parameters than masked LMs to exhibit the law.
Background & Motivation¶
Background: Zipf's meaning-frequency law is one of the classic findings in computational linguistics, describing a power-law relationship between word frequency and the number of meanings, where high-frequency words tend to have more meanings. Originally proposed by Zipf in 1945, subsequent studies validated this law using the number of dictionary definitions. Modern NLP, with the aid of contextualized representations from language models, provides new tools to revisit this law.
Limitations of Prior Work: Traditional validation methods rely heavily on human-annotated dictionaries as the source for the number of word meanings, which possess severe limitations: (1) the number of dictionary senses is limited and cannot cover all semantic variations in language usage; (2) dictionaries are subject to the subjective judgment of lexicographers, and different dictionaries may partition senses of the same word inconsistently; (3) validation is limited to a finite set of common words and cannot scale to rare words or domain-specific vocabulary in specialized corpora.
Key Challenge: Word meaning itself is a vague concept, lacking an objective, computable definition. Traditional methods equate the number of dictionary senses with the number of meanings, but these senses are artifacts of manual discretization and fail to reflect the true distribution of word meanings in a continuous semantic space.
Goal: This paper aims to propose a language-model-based, computable method for quantifying word meanings, replacing traditional dictionary sense counting with "contextual diversity," thereby generalizing Zipf's law from discrete dictionary definitions to continuous semantic spaces.
Key Insight: Inspired by the theory of "low-entropy information centralization," the authors observe that language models generate different vector directions for the same word in different contexts. If a word has more varied semantic usages, the directional distribution of its contextualized vectors will be more dispersed, yielding higher contextual diversity.
Core Idea: The number of word meanings is quantified by the directional diversity of contextualized word vectors generated by language models, thus reformulating Zipf's law as a power-law relationship between word frequency and contextual diversity.
Method¶
Overall Architecture¶
The entire pipeline consists of three steps: (1) using a language model (such as BERT or the GPT family) to generate contextualized word vectors for words extracted from a large-scale corpus in various contexts; (2) for each word, calculating the directional distribution of all its contextualized vectors on a hypersphere, and quantifying its contextual diversity using a directional diversity metric (serving as a proxy for the "number of meanings"); (3) plotting the log-log graph of word frequency against contextual diversity to test if it fits a power-law relationship.
Key Designs¶
-
Contextual Diversity Metric:
- Function: To convert the continuous directional distribution of vectors into a scalar, serving as a proxy metric for the number of word meanings.
- Mechanism: For a given word \(w\), all contextualized vectors \(\{v_1, v_2, ..., v_n\}\) of its occurrences in the corpus are collected. Each vector is normalized to a unit vector (projected onto a hypersphere), and the directional dispersion of these unit vectors is then computed. Specifically, this can be modeled using the reciprocal of the concentration parameter of the von Mises-Fisher distribution, or directly by calculating the complement of the average cosine similarity. More dispersed directions indicate higher contextual diversity, meaning the word plays more distinct semantic roles across different contexts.
- Design Motivation: Compared to traditional dictionary senses, this continuous metric avoids manual discretization bias and scales automatically to any vocabulary and corpus.
-
Analysis of Language Model Scale and Law Observability:
- Function: To reveal the relationship between the observability of Zipf's law and the scale of language models.
- Mechanism: The aforementioned process is repeated across various LM sizes (ranging from millions to billions of parameters) to inspect how the goodness of fit of the power-law relationship changes with model scale. The authors systematically compare multiple models including BERT-base/large and different sizes of GPT-2. It is found that small-scale LMs generate contextualized vectors with insufficient discriminative capabilities, causing the directional diversity of all words to collapse towards uniformity, which makes the power-law relationship unobservable.
- Design Motivation: If contextual diversity is indeed a valid proxy for the number of word meanings, it should depend on the semantic understanding capabilities of the LM. Representations learned by smaller LMs are not sufficiently fine-grained, failing to distinguish between words with high and low numbers of senses.
-
Comparative Analysis of Masked LMs and Autoregressive LMs:
- Function: To compare the difference between the two LM architectures in manifesting Zipf's law.
- Mechanism: BERT (masked LM) and GPT-2 (autoregressive LM) families are systematically compared under comparable parameter sizes. Results indicate that autoregressive LMs require significantly more parameters than masked LMs to make Zipf's law observable. For instance, BERT-base (110M parameters) already manifests the power-law relationship well, whereas GPT-2 (150M parameters) does not.
- Design Motivation: Masked LMs directly learn semantic representations of words at specific positions using bidirectional contexts, whereas autoregressive LMs can only utilize unidirectional information, thereby requiring larger capacity to compensate for the asymmetry in information direction.
Loss & Training¶
This work does not involve training new models and instead utilizes existing pre-trained language models. The key analysis method is to perform linear regression fitting on word frequency and contextual diversity in log-log space, using the \(R^2\) value to measure the goodness of fit of the power-law relationship.
Key Experimental Results¶
Main Results¶
| Language Model | Parameters | Corpus | \(R^2\) (Power-law Fit) | Observable |
|---|---|---|---|---|
| BERT-base | 110M | Wikipedia | High | Yes |
| BERT-large | 340M | Wikipedia | Higher | Yes |
| GPT-2 Small | 117M | Wikipedia | Low | No |
| GPT-2 Medium | 345M | Wikipedia | Medium | Partially Observable |
| GPT-2 Large | 774M | Wikipedia | Relatively High | Yes |
| GPT-2 XL | 1.5B | Wikipedia | High | Yes |
Ablation Study¶
| Configuration | Power-law Fit \(R^2\) | Description |
|---|---|---|
| BERT-base average of all layers | High | Uses average vectors across all layers |
| BERT-base last layer | Slightly lower | Uses only the last layer |
| BERT-base middle layers | Highest | Middle layers are richest in semantic information |
| Randomly initialized BERT | Extremely low | Untrained models fail to manifest the law |
| Word frequency thresholding (>100) | More stable | Insufficient statistical samples for low-frequency words |
Key Findings¶
- A power-law relationship indeed exists between contextual diversity and word frequency, verifying the validity of Zipf's law in continuous semantic spaces.
- Masked LMs are more prone to exhibiting this law than autoregressive LMs under identical parameter sizes, demonstrating the advantage of bidirectional contextual modeling.
- Model scale is a critical factor: the contextualized representations of LMs with too few parameters lack sufficient discriminative power.
- Vectors from the middle layers are more suitable for computing contextual diversity than those from the last layer, aligning with prior findings that "middle layers are rich in semantic information."
Highlights & Insights¶
- Extending a classic linguistic law from discrete senses to a continuous semantic space stands as an elegant theoretical contribution. Contextual diversity, as a proxy metric for the number of word meanings, avoids reliance on manual dictionaries and allows for automated, large-scale validation.
- Discovering the intrinsic differences between autoregressive LMs and masked LMs in manifesting Zipf's law reveals the performance gap in semantic representation between the two architectures from an unexpected angle. This perspective is novel and theoretically profound.
- This methodology can be transferred to other tasks requiring the quantification of word meaning richness, such as the automated evaluation of word sense disambiguation (WSD) systems, research on semantic evolution, and polysemy detection.
Limitations & Future Work¶
- Although contextual diversity correlates with the number of word meanings, it is not an exact measurement, presenting a degree of conceptual indirectness.
- Validation was conducted solely on English corpora and English language models; the cross-linguistic generalizability of this reformulated Zipf's law requires further investigation.
- The granularity of word meanings was not considered: coaxing apart two closely related meanings of a word in vector space remains challenging.
- Due to sparse occurrences, the estimation of contextual diversity for low-frequency words suffers from high variance, which affects the reliability of the power-law fitting.
Related Work & Insights¶
- vs Traditional Zipf's Law Studies (Zipf 1945, Miller 1957): Traditional studies rely on dictionary senses. This paper replaces discrete counts with continuous representations generated by LMs, achieving a more general and scalable formulation.
- vs Word Sense Disambiguation (WSD) Methods: WSD treats word meaning as discrete labels for classification. In contrast, this work does not require a predefined set of senses and directly measures semantic diversity from continuous space.
- vs Analysis of Contextualized Embeddings (Ethayarajh 2019): While Ethayarajh investigated the anisotropy of BERT representations across different layers, this paper further links directional analysis with classic linguistic laws, providing deeper theoretical significance.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Ingeniously links information theory with a classic linguistic law. The unique perspective makes the Outstanding Paper award highly deserved.
- Experimental Thoroughness: ⭐⭐⭐⭐ The design of systematic comparative experiments across multiple models and scales is rigorous.
- Writing Quality: ⭐⭐⭐⭐⭐ Starting from a linguistic law, the logical thread is clear, tightly coupling theory with experimentation.
- Value: ⭐⭐⭐⭐ Makes a significant contribution to computational linguistics theory, although practical application scenarios remain relatively limited.