Skip to content

The Proxy Presumption: From Semantic Embeddings to Valid Social Measures

Conference: ACL 2026
arXiv: 2605.07409
Code: None
Area: Causal Inference / Computational Social Science / Representation Measurement
Keywords: Construct Validity, Semantic Embeddings, Causal Representation, Counterfactual Neutralization, Social Measurement

TL;DR

This paper points out that naming the geometric distance of embeddings in NLP directly after social constructs such as "creativity, bias, and innovation" constitutes a "Proxy Presumption." It proposes a Construct Validity Protocol and Counterfactual Neutralization to transform heuristic proxies into verifiable measurement tools.

Background & Motivation

Background: NLP is shifting from a pure prediction tool to a measurement tool for computational social science. Many works utilize sentence vectors, document vectors, or LLM embeddings to measure abstract social concepts, such as paper novelty, text creativity, political bias, social norms, or toxicity.

Limitations of Prior Work: These works often default to the assumption that "cosine distance in embedding space is a specific social construct." The problem is that embeddings simultaneously encode a large number of nuisance factors, including topic, style, author, length, register, time, and institution; geometric distance is not naturally equivalent to theoretical variables.

Key Challenge: Researchers aim to measure a latent construct \(C\), but the model actually observes text \(D\) generated jointly by \(C\) and confounding factors \(Z\). Without explicit assumptions, interventions, or validation, identifying \(C\) from unsupervised representations of \(D\) is non-identifiable.

Goal: The paper does not aim to negate embedding-based measurement but intends to establish a minimum methodological standard for such measurements: "define the construct first, design the tool second, and report validity evidence last."

Key Insight: The authors place construct validity from social sciences, validity cards from psychometrics, and non-identifiability from causal representation learning within the same framework, demonstrating that the core issue of NLP proxies is not insufficient model size but the lack of identification of the measurement target.

Core Idea: Rewrite the NLP social measurement process using the language of causal identification and psychometrics, transforming "embedding similarity" from a default proxy into a measurement instrument that must undergo counterfactual, discriminant validity, and incremental validity testing.

Method

This is a position-and-synthesis paper. The methodological contributions primarily include theoretical formalization, operational protocols, and forensic literature analysis. The main thread is as follows: first, prove why unsupervised embeddings cannot automatically identify social constructs; second, provide intervention points to mitigate confounding; and finally, organize these intervention points into a Construct Validity Protocol.

Overall Architecture

The paper views a document as being jointly generated by a target construct \(c\) and a nuisance vector \(z\): \(p_{\theta}(D \mid c, z)\). A standard NLP measurement workflow first uses an encoder \(E\) to map text to an embedding \(e\), then uses a proxy function \(f(e)\) to output a scalar score. The authors point out that if \(E\) is learned unsupervised, the coordinate system of \(e\) can undergo arbitrary rotations, and a specific dimension or distance function does not necessarily correspond to \(c\).

In a single-document scenario, this means a "toxicity score" might simultaneously incorporate dialect, register, and topic. In a dual-document scenario, this means the cosine distance of "paper novelty" might merely reflect differences in topic, terminology shifts, or writing style changes rather than conceptual contributions.

To address this, the authors propose a three-layer mitigation path. The first layer involves disentanglement at the input, such as extracting only text segments related to the construct, standardizing writing style, or masking entities. The second layer involves disentanglement at the representation level, such as adversarial removal, iterative nullspace projection, or contrastive learning. The third layer involves counterfactual neutralization at the scoring function level, subtracting the score of a neutral version that retains nuisances but weakens the target construct from the score of the observed text.

These technical means are further integrated into the Construct Validity Protocol (CVP). CVP consists of three phases: Conceptualization, Operationalization, and Validity Suite. It requires researchers to first clarify what the construct is and is not, and which nuisances might confound it; then design measurement instruments that reduce confounding; and finally, report evidence of whether the proxy is truly measuring the target construct using stability, convergent validity, discriminant/incremental validity, known-groups, and predictive validity.

Key Designs

  1. Non-identifiability Argument:

    • Function: Explains why the assumption that "unsupervised embeddings automatically disentangle social constructs" is invalid.
    • Mechanism: The authors assume the latent variables \(h=[c;z]\) follow an isotropic Gaussian prior. For any orthogonal rotation matrix \(R\), the rotated latent space \(h'=Rh\) yields the same observed distribution as the original latent space; thus, unsupervised likelihood objectives cannot distinguish the "true construct coordinate" from a linear mixture of the construct and nuisances.
    • Design Motivation: This elevates the proxy presumption from an empirical critique to an identification problem. Even if \(c\) and \(z\) are independent in the real world, model-learned embeddings may mix them, a problem that cannot be solved automatically by larger corpora or deeper encoders.
  2. Counterfactual Neutralization:

    • Function: Offsets the contributions of nuisances such as topic, style, and entities to the proxy at the scoring function level.
    • Mechanism: Instead of reporting \(f(e_{obs})\) directly, a counterfactual neutral text is constructed (e.g., removing stance, novelty claims, or emotional expression while retaining topical content) to calculate \(\hat{C}=f(e_{obs})-f(e_{base})\). This differential score aims to isolate the portion caused by changes in the target construct.
    • Design Motivation: In many scenarios, full nuisance labels are unavailable, and retraining embeddings is inconvenient. Counterfactual rewriting provides a text-native intervention that leverages LLM generative capabilities to explicitly manipulate construct intensity while keeping the input readable.
  3. Construct Validity Protocol:

    • Function: Provides a reportable, reproducible, and auditable validity process for embedding-based social measures.
    • Mechanism: Phase 1 produces a construct map, facet blueprint, and three-tiered exemplar set; Phase 2 specifies how the input, representation, and scoring functions control nuisances; Phase 3 reports a Validity Card, including reliability/stability, convergent validity, discriminant and incremental validity, known-groups validity, and criterion-related evidence.
    • Design Motivation: Many papers only prove that a proxy is "useful" or "correlated" but fail to prove it is not a surrogate for another nuisance. CVP decomposes "measurement accuracy" into multiple complementary pieces of evidence, emphasizing discriminant and incremental validity as they best expose topic/style surrogacy.

Loss & Training

This paper does not propose a new model requiring end-to-end training but offers measurement protocols and plug-and-play intervention strategies. If operating at the representation layer, adversarial removal or nullspace projection can be used to suppress nuisance labels; if operating at the input and scoring layers, LLM extraction, style standardization, entity anonymization, and counterfactual neutralization can be used. The core optimization goal is not to improve classification accuracy but to enhance the interpretability, stability, and incremental signal of the proxy relative to the target construct.

Key Experimental Results

Main Results

The empirical portion of the paper includes a GoEmotions worked example and a forensic literature audit of 17 social measurement papers. The GoEmotions example demonstrates the operability of CVP, while the literature audit illustrates that the current community generally lacks the most critical validity evidence.

Validation Step Setting Key Findings Implication
Stability Card 1 GoEmotions gratitude dimension, 2 encoders × 2 pooling × 2 text normalizations, 8 proxy variants total, n=2000 AUC for variants: 0.9407-0.9662, ICC(2,1)=0.8467, ICC(2,k)=0.9779 The proxy is quite stable under similar implementations, but stability \(\neq\) identification.
Discriminant Validity Step 1 Predicting proxy using length/style features and TF-IDF+SVD topic block Length/Style \(R^2=0.0245\), Topic \(R^2=0.7762\), Full nuisance block \(R^2=0.7768\) The embedding proxy is largely recoverable from the topic, indicating high surrogacy risk.
Incremental Validity Step 2 Adding proxy to a nuisance-only model to predict human gratitude labels AUC increased from 0.9658 to 0.9831, \(\beta_{inc}>0\) Although strongly explained by topic, the proxy still provides additional signals; risks and benefits must be reported simultaneously.

Ablation Study

The "ablation" here is closer to a methodological diagnosis: the authors do not remove model modules but audit whether existing social measurement papers report different types of validity evidence.

Validity Dimension Yes Partial No Interpretation
Construct Validity 10 7 0 Most papers define constructs, but with varying degrees of rigor.
Face/Content Validity 6 11 0 Usually based on examples or expert intuition, but lacks systematicity.
Reliability / Stability 11 4 2 Reliability is the most reported dimension, usually via annotator agreement or perturbation stability.
Convergent Validity 1 12 4 Rarely use independent instruments of the same construct as benchmarks.
Discriminant Validity 0 11 6 No paper fully proves the proxy is not just a nuisance like topic/style.
Predictive Validity 1 3 13 External criterion evidence is significantly insufficient.
Handling Confounders 0 14 3 Most only perform heuristic control or regression on covariates, not an identification strategy.

Key Findings

  • The core empirical evidence is that the topic block in GoEmotions explains approximately 77% of the proxy's variance, intuitively demonstrating that embedding proxies may primarily track topical structures.
  • Stability is not a sufficient condition. A proxy can be stable across encoders and pooling methods but still consistently measure the wrong thing.
  • Current literature most lacks discriminant validity and confound isolation, which corresponds exactly to the Proxy Presumption described by the authors.
  • The value of Counterfactual Neutralization lies in turning the "text-intervenability" of NLP into a measurement tool, rather than relying solely on post-hoc correlation analysis.

Highlights & Insights

  • The paper names a common but vague problem as the Proxy Presumption and provides a clear explanation via causal representation learning: embedding geometry is not a social theoretical variable but an operation on mixed representations.
  • The strength of CVP is bringing mature construct validity language from social sciences back to NLP, upgrading paper writing from "I defined a score" to "I proved this score measures the target within reasonable bounds."
  • Counterfactual Neutralization is well-suited for NLP because text can be rewritten, extracted, and anonymized by LLMs; this is more flexible than construct manipulation in images or structured tables.
  • This paper is highly significant for downstream causal inference. No matter how sophisticated the subsequent causal ML is, if the input variables themselves are not valid measurements, the causal conclusions remain uninterpretable.

Limitations & Future Work

  • The paper is primarily a methodological framework and position statement; it does not fully execute CVP on a new social measurement task, thus costs, benefits, and failure modes of CVP remain unquantified.
  • Counterfactual Neutralization depends on the quality of LLM rewriting. If the LLM changes the topic or register while trying to "retain nuisance but change the construct," the differential score will still be contaminated.
  • Validity Cards increase paper and engineering workflow costs, especially when requiring expert samples, independent gold instruments, or external criteria, which small teams might find difficult to execute fully.
  • Future work could develop standardized benchmarks: the same construct, the same nuisance set, with different embeddings and counterfactual strategies to systematically compare which proxies best pass discriminant and incremental validity tests.
  • vs. WEAT / embedding bias measurement: Works like WEAT measure bias via embedding association; this paper does not negate such tools but points out that association scores need to be proven not to be surrogates for word frequency, register, or corpus structure.
  • vs. Causal representation learning: Traditional causal representation emphasizes the non-identifiability of latent factors; this paper migrates this conclusion to social measurement, stating that embedding-based proxies must introduce structural assumptions or interventions.
  • vs. Psychometrics: Psychometrics has long distinguished between construct and measure; this paper translates reliability, convergent validity, discriminant validity, and criterion evidence into an executable checklist for NLP.
  • Insights for future research: Any paper using LLMs/embeddings to generate "social variables" should at minimum report nuisance blocks, the degree to which the proxy can be predicted by nuisances, and whether there is out-of-sample incremental explanatory power after adding the proxy.

Rating

  • Novelty: ⭐⭐⭐⭐☆ Integrates existing measurement theory, causal representation, and NLP proxy critiques with great penetration; strong conceptual contribution despite not being a new model.
  • Experimental Thoroughness: ⭐⭐⭐☆☆ The GoEmotions case and 17-paper forensic audit support the arguments, but a complete prospective empirical pipeline is missing.
  • Writing Quality: ⭐⭐⭐⭐☆ Clear structure and accurate terminology; "Proxy Presumption" and "Validity Card" are effective for community dissemination.
  • Value: ⭐⭐⭐⭐⭐ Direct methodological value for computational social science, embedding measurement, LLM-based evaluation, and downstream causal inference.