Position: Towards Bidirectional Human-AI Alignment¶
Conference: NeurIPS 2025 arXiv: 2406.09264 Code: None Area: AI Alignment / Human-AI Interaction Keywords: AI alignment, bidirectional alignment, human values, human-AI interaction, systematic review
TL;DR¶
This paper proposes a Bidirectional Human-AI Alignment framework grounded in a systematic review of 400+ papers, arguing that AI alignment should not be limited to the unidirectional goal of "aligning AI with humans," but must also encompass the critically underexplored direction of "aligning humans with AI," while identifying key gaps in the current research landscape.
Background & Motivation¶
Background: AI alignment has become a central topic in AI safety and ethics. Conventionally, alignment research focuses on making AI systems behave in accordance with human goals and values—exemplified by methods such as RLHF and Constitutional AI. However, as AI systems become more deeply embedded in everyday life, a unidirectional alignment perspective is no longer sufficient.
Limitations of Prior Work: - Ambiguous definition of alignment: Different communities (HCI, NLP, ML) hold inconsistent definitions of "alignment," lacking a shared conceptual foundation. - Unidirectional perspective: Existing work almost exclusively addresses the "AI → Human" direction, neglecting how human cognition and behavior adapt to AI. - Insufficient value modeling: The plurality, dynamism, and context-dependence of human values are overly simplified in current alignment methods.
Key Challenge: AI systems are growing increasingly complex and autonomous, yet human capacity to understand, oversee, and collaborate with them has not kept pace. A dynamic feedback loop exists between AI and humans—AI behavior shapes human responses, which in turn reshape AI behavior—yet this bidirectional interaction is almost entirely absent from existing research.
Goal: To rigorously define the conceptual boundaries of "alignment," propose a systematic framework encompassing bidirectional interaction, identify existing research gaps, and provide a roadmap for future alignment research.
Key Insight: A systematic review of 400+ papers spanning HCI, NLP, and ML, combined with qualitative coding and quantitative analysis, to construct a comprehensive taxonomy covering both the AI side and the human side.
Core Idea: AI alignment should be bidirectional—not only aligning AI with human values, but also helping humans understand, evaluate, and adapt to AI.
Method¶
Overall Architecture¶
The proposed Bidirectional Human-AI Alignment framework comprises two interconnected directions:
- Align AI with Humans: Incorporating human values and norms into AI training, steering, and customization.
- Align Humans with AI: Supporting human cognitive, behavioral, and societal adaptation to rapidly evolving AI technologies.
The framework is organized around four core research questions (RQ1–RQ4):
| Direction | Research Question | Focus |
|---|---|---|
| AI → Human | RQ1: Human values and norms | Which values are aligned? How are values specified interactively? |
| AI → Human | RQ2: Integrating human norms into AI | How are values embedded throughout the development and deployment pipeline? |
| Human → AI | RQ3: Human cognitive adaptation | How do humans learn to perceive, understand, and critically evaluate AI? |
| Human → AI | RQ4: Human behavioral adaptation | How do individuals and society collaborate with AI and respond to its impacts? |
Key Designs¶
RQ1: Human Values and Norms¶
- Value taxonomy: An adaptation of Schwartz's Theory of Basic Values, analyzed along two dimensions:
- Sources: Individual values (e.g., factuality, cognitive bias), social values (e.g., fairness, ethics), and interaction values (e.g., usability, trust).
- Types: Self-enhancement, self-transcendence, conservation, and openness to change.
- Interactive techniques for value specification:
- Explicit feedback: principles, ratings, natural language, multimodal input.
- Implicit feedback: behavioral cues, linguistic patterns, theory of mind.
- Simulated feedback: feedback simulators, synthetic data.
RQ2: Integrating Human Values into AI¶
- General value integration: Via instruction data (human-annotated / human-AI co-annotated / simulated), model learning (online/offline alignment), and inference stage (prompting / tools / search).
- Personalized/group-level customization: Customization data, adaptive learning (group-level learning, MoE, adapters), and interactive alignment.
- Evaluation frameworks: Human-in-the-loop evaluation vs. automated evaluation.
RQ3: Human Cognitive Adaptation¶
- Perceiving and understanding AI: AI literacy education, explainable AI visualizations, interactive explanation techniques.
- Critical thinking: Trust and reliance calibration, ethical auditing, cognitive recalibration.
RQ4: Human Behavioral Adaptation¶
- Human-AI collaboration modes: Collaboration mechanisms under three AI roles—assistant, partner, and tutor.
- AI impacts on individuals and society: Individual behavioral change, shifts in social relationships, and institutional responses to AI advancement.
- Evaluation methods: Micro-level (human-AI collaboration assessment) and macro-level (societal impact assessment).
Systematic Review Methodology¶
- Systematic literature review following PRISMA guidelines.
- Initial retrieval of 34,213 papers → keyword filtering to 2,136 → inclusion criteria filtering to 411 papers.
- Independent dual coding with a joint agreement rate of 0.78.
- Qualitative coding using a combined inductive and deductive approach to develop the analytical framework.
Key Experimental Results¶
Analysis of Literature Distribution¶
Quantitative statistics reveal significant imbalances in research coverage across dimensions:
| Dimension | Publication Trend | Key Finding |
|---|---|---|
| Explicit human feedback | Highest | Dominates value specification research |
| Implicit/simulated feedback | Very few | Severely underexplored, yet high potential |
| Model training stage | Many | Online/offline alignment research concentrated here |
| Inference-stage alignment | Few | Real-time adaptation capacity neglected |
| AI literacy education | Very few | One of the largest research gaps on the human side |
| Collaboration when AI surpasses human capability | Near absent | Existing research assumes AI remains in an assistive role |
| Societal-level impact assessment | Insufficient | Long-term behavioral changes lack longitudinal tracking |
Key Research Gaps¶
| Direction | Gap | Severity |
|---|---|---|
| AI → Human | Implicit/simulated value feedback | High |
| AI → Human | Inference-stage customization and interactive alignment | High |
| AI → Human | Standardization of human-in-the-loop evaluation | Medium |
| Human → AI | AI literacy and education | Very High |
| Human → AI | Collaboration with superhuman-capability AI | Very High |
| Human → AI | AI ethical auditing (from the human perspective) | High |
| Human → AI | Long-term societal impact assessment | High |
Key Findings¶
- Research is heavily skewed toward the AI side: The vast majority of alignment research focuses on "aligning AI with humans"; work on "aligning humans with AI" is severely underrepresented.
- Value specification methods are narrow: There is an excessive reliance on explicit feedback (ratings, rankings, instructions), while implicit behavioral signals and simulated feedback are largely ignored.
- Lack of dynamic perspective: Existing work treats alignment as a static process, with limited attention to long-term interaction design and value evolution modeling.
- Human-side research centers on explainability: The primary focus is on how XAI helps humans understand model decisions, while dimensions such as AI literacy, critical thinking, and ethical auditing remain neglected.
Highlights & Insights¶
- Conceptual breakthrough: The paper is the first to explicitly position "Align Humans with AI" as an equal and complementary direction in alignment research, challenging the entrenched unidirectional paradigm.
- Exceptional systematicity: A cross-disciplinary review spanning HCI, NLP, and ML with 400+ papers, featuring a detailed taxonomy and well-grounded quantitative analysis.
- Three long-term challenges are precisely defined:
- Specification Game: How can complex human values be fully specified?
- Dynamic Co-evolution: How do humans, AI, and society co-evolve over time?
- Safeguarding Co-adaptation: How can the co-adaptation process be kept safe?
- Introduction of Schwartz's value theory provides a psychological and sociological theoretical anchor for AI alignment research.
- Practical roadmap: Each identified research gap is accompanied by specific suggestions for future research directions.
Limitations & Future Work¶
- Limited review scope: Coverage is primarily restricted to computation-adjacent fields (ML/NLP/HCI); cognitive science, psychology, and science and technology studies (STS) are insufficiently represented.
- Temporal window: Coverage is mainly limited to 2019–2024, potentially omitting foundational earlier work.
- Lack of empirical validation: As a position paper, the proposed framework lacks experimental validation of its effectiveness.
- Insufficient treatment of value conflicts: Although value plurality and social choice theory are mentioned, the paper does not deeply discuss concrete mechanisms for resolving conflicts between the values of different groups.
- Unclear technical implementation pathways: While many research gaps are identified, the discussion of specific technical solutions remains relatively shallow.
Related Work & Insights¶
- Relation to traditional AI alignment research: Methods such as RLHF (Ouyang et al., 2022) and Constitutional AI (Bai et al., 2022) constitute a subset of the "AI → Human" direction within the proposed framework.
- Connection to HCI: The framework integrates HCI research on XAI, human-AI collaboration, and AI literacy under an alignment-oriented lens.
- Inspiration from social choice theory: Arrow's social choice theory can provide formal tools for aggregating pluralistic values.
- Connection to Scalable Oversight: The "Human → AI" direction directly speaks to the challenge of maintaining human oversight as AI systems scale.
Implications for future research: - Alignment systems capable of capturing implicit value signals are worth developing. - AI literacy education should become foundational infrastructure for alignment research. - Longitudinal research infrastructure is needed to track the long-term co-evolution of humans and AI.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The concept of "bidirectional alignment" represents an important conceptual breakthrough, though no technical innovations are introduced given the position paper format.
- Experimental Thoroughness: ⭐⭐⭐⭐ — The systematic review of 400+ papers is highly rigorous, and the quantitative analysis effectively reveals research biases.
- Writing Quality: ⭐⭐⭐⭐⭐ — Clear structure, rigorous taxonomy, and excellent readability; exemplary as a survey paper.
- Value: ⭐⭐⭐⭐⭐ — Offers a novel conceptual framework and systematic roadmap for AI alignment research, with significant guiding value for the community.