Skip to content

Research Borderlands: Analysing Writing Across Research Cultures

Conference: ACL 2025
arXiv: 2506.00784
Code: shaily99/research_borderlands
Area: Other
Keywords: research culture, cultural norms, scientific writing, LLM evaluation, interdisciplinary research, cultural competence
Authors: Shaily Bhatt (CMU), Tal August (UIUC), Maria Antoniak (Copenhagen)

TL;DR

Through interviews with interdisciplinary researchers, this work constructs a cultural norm framework for academic writing (comprising four categories: structure, style, rhetoric, and citation). It quantifies writing differences across 11 CS communities using computational metrics, revealing a severe "homogenization" tendency in LLMs during cross-community writing adaptation.

Background & Motivation

  • Limitations of Prior Work: In evaluating the cultural competence of LLMs, the concept of "culture" is vaguely defined. Most studies rely on coarse-grained proxy variables such as nationality or language, lacking in-depth interaction with community members.
  • Key Insight: Academic writing serves as an explicit carrier of cultural norms—different research communities (such as NLP, HCI, and Education) hold distinct implicit expectations regarding paper structure, terminology, and argumentation styles.
  • Goal: To discover and measure cultural norms in a bottom-up, "human-centered" manner, rather than presetting proxy variables top-down. The study also aims to evaluate whether LLMs can adapt to the cultural norms of different research communities when functioning as writing tools.
  • Ecological Validity: Surveys (\(N=78\)) verify that "cross-community paper adaptation" is a real-world task faced by researchers, establishing a foundation for this study.

Method

1. Qualitative Study: Discovering Cultural Norms

  • Survey (\(N=78\)): Targeting interdisciplinary researchers to confirm that paper adaptation is a real-world demand; 66 respondents reported experience in cross-community adaptation, and almost all of them adjust the Introduction section during adaptation.
  • Semi-structured Interviews (\(N=10\)): Conducted with senior interdisciplinary scholars for 60 minutes each; prior to the interviews, participants were asked to provide multiple versions of the Introduction of the same paper submitted to different communities for comparison.
  • Coding Analysis: Two authors independently coded the first two interview transcripts, establishing a unified cultural norm framework after three weeks of iterative discussions.

2. Cultural Norm Framework (Four Categories)

Category Norm Dimension Example Differences
Structural Norms Length, use of tables/figures NLP papers 8-9 pages vs. FAccT papers 14 pages; CV community uses more figures, NLP community uses more tables
Stylistic Norms Terminology/jargon, readability, formality, redundancy The NLP community does not require explanations for "RoBERTa"; Education uses words like "preponderance"; Humanities allow informal prose
Rhetorical Norms Quantitative evidence, figurative language, framing, narrative organization ML/NLP emphasize numerical evidence; Humanities favor narrative storytelling; 9/10 interviewees consider "reframing" the most critical adaptation step
Citation Norms Classic citations, citation interaction styles The same concept has different seminal citations across communities (e.g., "mental models" in Cognitive Science vs. "folk theories" in HCI); Humanities often quote directly at the beginning

3. Computational Evaluation Suite

Operationalizes quantifiable norms in the framework into computational metrics:

  • Structure: Word count, sentence count (NLTK tokenization), and rates of table/figure mentions (regular expressions).
  • Style: Terminology specificity (NPMI specificity score), formality (DeBERTa-large fine-tuned on GYAFC), and readability (Flesch reading-ease).
  • Rhetoric: Percentage of quantitative evidence (Llama 3.1 70B as judge, showing 93% agreement with human labels), narrative organization (skewness of sentence positions classified by function), and value framing (10-dimensional value vector based on a lexicon classifier with 72.95% precision).

4. LLM Cultural Competence Evaluation

  • Task: Given an Introduction from a source community, instruct the LLM to adapt it to the style of a target community.
  • Data Sampling: (a) 100 random papers per community pair; (b) 100 papers with the highest specificity per community pair (better resembling real-world adaptation scenarios).
  • Models: GPT-3.5 Turbo, GPT-4o Mini, Llama 3.1 8B, Llama 3.3 70B, and Mistral Ministral 8B; 5 samples per prompt, totaling 550,000 generations.

Key Experimental Results

Differences in Writing Norms Across 11 CS Communities (Section 6, Figure 3)

Metric Key Findings
Length Economics & Computation is the longest, while NLP is relatively short
Figures/Tables The CV community uses the most figures, while the NLP community uses the most tables
Terminology Specificity All communities have positive values, with the Education community being the most distinctive
Formality Differences across communities are minor (as all are CS subfields)
Quantitative Evidence ML/NLP/AI exhibit the smallest variance, suggesting that quantitative evidence is a strong cultural norm in these fields
Narrative Organization Objective sentences appear earlier in ML/NLP/AI; results sentences also appear earlier in the AI community

Table 2: LLM Cultural Competence Evaluation (Section 7, targeting ML and NLP communities)

Observation Details
Vocabulary Adaptation Success Specificity almost always increases after adaptation across all LLMs, indicating that models actually understand vocabulary differences between communities
Homogenization in Other Dimensions Length is consistently shortened, mention rates of tables/figures consistently decrease, readability consistently drops, and the proportion of quantitative evidence slightly increases; successes only occur by chance when the target direction happens to align
Rigid Narrative Organization The skewness of background and methodology sentences increases, while that of objective sentences decreases, showing that all models tend to converge on a unified template
Framing Similarity The cosine similarity of framing remains almost unchanged before and after adaptation; LLMs fail to adjust the value framing based on the target community

Highlights & Insights

  • Methodological Innovation: Discovers cultural norms in a bottom-up, human-centered manner. Instead of presetting proxy variables, the study conducts interviews with interdisciplinary experts to establish a framework that comprehensively covers structure, style, rhetoric, and citation.
  • Large-Scale Validation: Quantitative analysis of 81,178 papers across 11 CS communities and 38 conferences successfully replicates the qualitative observations described by the interviewees.
  • Discovery of LLM Homogenization: First to systematically demonstrate that LLMs exhibit comprehensive homogenization (except for vocabulary) during cross-community writing adaptation. The large-scale experiment involving 550,000 generations provides highly convincing evidence.
  • Open-Source Evaluation Suite: Provides reusable computational metrics and code, which can be applied to scientometrics and LLM cultural competence evaluation.

Limitations & Future Work

  • Community Scope: Only covers subfields of CS, leaving out disciplines with potentially more pronounced differences such as sociology, biology, and art history.
  • Interviewee Bias: The 10 interviewees are primarily from ML/NLP and Computational Social Science, and recruiting through social media introduces potential selection bias.
  • Metric Limitations: Formality metrics show insufficient discrimination across CS communities; verbosity and figurative language are excluded due to the lack of reliable metrics; citation norms are not computed due to the inability to map in-text citations.
  • Evaluation Scheme: LLMs are evaluated only under zero-shot settings, without exploring methods like few-shot learning or RAG to improve cultural competence.
  • Proxy Selection: Incorporates 'communities' rather than 'conferences' as the cultural unit. While verified by surveys, this approach might neglect differences across sub-directions within the same community.
  • Understanding Research Communities: Lucy et al. (2023) analyze lexical specificity across communities; Birhane et al. (2022) and Jiang et al. (2025) study values encoded in ML research; Michael et al. (2023) investigate beliefs in the NLP community.
  • LLMs as Scientific Tools: Si et al. (2024) leverage LLMs for scientific idea generation; Robinson et al. (2024) focus on writing assistance for a single community—neither considers cross-community cultural differences.
  • Cultural Competence of LLMs: Adilazuarda et al. (2024) review the vague definitions of "culture"; Rao et al. (2024) construct geocultural benchmarks based on expert documents but use artificial settings; this work is driven by real-world tasks and community members, offering a more grounded methodology.
  • Homogenization in LLM Writing: Liang et al. (2024) find that LLM-assisted papers are shorter and more similar; Guo et al. (2024) observe a drop in linguistic diversity; Xu et al. (2025) discover a reduction in rhetorical diversity.

Rating

  • Novelty: ⭐⭐⭐⭐ — Approaches NLP cultural competence evaluation from anthropological and qualitative perspectives, offering a fresh angle.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Incorporates 81K papers and 550K LLM generations, presenting a massive scale with highly convincing metric designs.
  • Writing Quality: ⭐⭐⭐⭐⭐ — Smooth narrative flow using mixed methods, closely integrating qualitative and quantitative perspectives.
  • Value: ⭐⭐⭐⭐ — Provides a critical warning regarding the cultural adaptability of LLM-based scientific writing tools; the evaluation suite is highly reusable.