Paper deep dive
Towards Integrated Alignment
Ben Y. Reis, William La Cava
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%
Last extracted: 3/11/2026, 12:32:29 AM
Summary
The paper proposes 'Integrated Alignment' (IA), a framework for AI safety that bridges the gap between behavioral and representational alignment approaches. By drawing parallels to immunology and cybersecurity, the authors argue for strategic diversity, multi-scale monitoring, and adaptive coevolution to mitigate deceptive misalignment threats and overcome the limitations of isolated research paradigms.
Entities (6)
Relation Signals (4)
Integrated Alignment → combines → Behavioral Alignment
confidence 95% · IA frameworks that combine the complementary strengths of diverse alignment approaches
Integrated Alignment → combines → Representational Alignment
confidence 95% · IA frameworks that combine the complementary strengths of diverse alignment approaches
Immunology → informs → Integrated Alignment
confidence 90% · Drawing practical lessons from the related fields of immunology and cybersecurity, we propose a set of design principles for IA frameworks.
Cybersecurity → informs → Integrated Alignment
confidence 90% · Drawing practical lessons from the related fields of immunology and cybersecurity, we propose a set of design principles for IA frameworks.
Cypher Suggestions (2)
Find all methodologies integrated into the Integrated Alignment framework. · confidence 95% · unvalidated
MATCH (f:Framework {name: 'Integrated Alignment'})-[:COMBINES]->(m:Methodology) RETURN m.nameIdentify domains that inform the development of AI alignment frameworks. · confidence 90% · unvalidated
MATCH (d:Domain)-[:INFORMS]->(f:Framework) RETURN d.name, f.name
Abstract
Abstract:As AI adoption expands across human society, the problem of aligning AI models to match human preferences remains a grand challenge. Currently, the AI alignment field is deeply divided between behavioral and representational approaches, resulting in narrowly aligned models that are more vulnerable to increasingly deceptive misalignment threats. In the face of this fragmentation, we propose an integrated vision for the future of the field. Drawing on related lessons from immunology and cybersecurity, we lay out a set of design principles for the development of Integrated Alignment frameworks that combine the complementary strengths of diverse alignment approaches through deep integration and adaptive coevolution. We highlight the importance of strategic diversity - deploying orthogonal alignment and misalignment detection approaches to avoid homogeneous pipelines that may be "doomed to success". We also recommend steps for greater unification of the AI alignment research field itself, through cross-collaboration, open model weights and shared community resources.
Tags
Links
- Source: https://arxiv.org/abs/2508.06592
- Canonical: https://arxiv.org/abs/2508.06592
PDF not stored locally. Use the link above to view on the source site.
Full Text
48,012 characters extracted from source content.
Expand or collapse full text
Towards Integrated Alignment Authors: Ben Y. Reis 1,2,3,4,5,6 and William La Cava 1,2 1. Computational Health Informatics Program, Boston Children’s Hospital, Boston, Massachusetts 2. Department of Pediatrics, Harvard Medical School, Boston, Massachusetts 3. Ivan and Francesca Berkowitz Living Laboratory, Harvard Medical School and Clalit Research Institute 4. Affiliated Faculty, Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 5. Harvard Data Science Initiative, Cambridge, Massachusetts 6. Faculty Associate, Berkman Klein Center for Internet and Society at Harvard University Abstract As AI adoption expands across human society, the problem of aligning AI models to match human preferences remains a grand challenge. Currently, the AI alignment field is deeply divided between behavioral and representational approaches, resulting in narrowly aligned models that are more vulnerable to increasingly deceptive misalignment threats. In the face of this fragmentation, we propose an integrated vision for the future of the field. Drawing on related lessons from immunology and cybersecurity, we lay out a set of design principles for the development of Integrated Alignment frameworks that combine the complementary strengths of diverse alignment approaches through deep integration and adaptive coevolution. We highlight the importance of strategic diversity - deploying orthogonal alignment and misalignment detection approaches to avoid homogeneous pipelines that may be “doomed to success”. We also recommend steps for greater unification of the AI alignment research field itself, through cross-collaboration, open model weights and shared community resources. 1. Introduction Aligning models to conform with human preferences and expectations is a central challenge for the future of AI 1 . Misalignments can emerge along several dimensions, including truthfulness, safety, ethics, and logical soundness, among others 2–5 . Detecting these misalignments becomes increasingly difficult as models scale in size and complexity, with some deceptive forms of misalignment undermining attempts to detect them 6–8 . There is an urgent need to develop a deeper understanding of emerging misalignment threats, alongside improved methods for identifying and correcting them. The nascent field of AI alignment 9,10 explores a diverse range of approaches for aligning AI models and detecting misalignments, each with its own strengths and weaknesses. These approaches can be broadly divided into behavioral approaches that examine a model’s inputs and outputs, and representational approaches that examine a model’s inner workings 11,12 . Thus far, the vast majority of efforts have focused on only one of these approaches, leaving AI models potentially more vulnerable to a wide range of misalignment threats. The AI alignment field itself is also deeply split along this behavioral-representational divide, with limited communication between the two research communities 10,13–17 . In response to these growing challenges, we call for the development of Integrated Alignment (IA) frameworks that combine the complementary strengths of diverse alignment approaches with the aim of more robustly identifying a wide range of misalignments. Drawing practical lessons from the related fields of immunology and cybersecurity, we propose a set of design principles for IA frameworks. We highlight the importance of strategic diversity - deploying orthogonal forward and backward alignment approaches to mitigate the pitfalls of homogeneous alignment pipelines that may be “doomed to success.” We also call for greater unification of the AI alignment field itself, encouraging communication across subspecialties, greater availability of open model weights, and a growing ecosystem of shared community resources. Figure 1. Annual number of publications returned for the search string “Artificial Intelligence alignment”, 1990-2024 (Source: PubMed database). 2. AI Alignment: The Behavioral-Representational Divide The field of AI alignment has grown exponentially in recent years (Figure 1). While a comprehensive overview of developments in this wide-ranging field 10,16 is outside the scope of this perspective, we provide a brief summary of some of the major approaches to alignment and misalignment detection. Several categorization schemes for the field have been proposed, including backward alignment vs. forward alignment 10 , outer alignment vs. inner alignment 18,19 , and others. In this Perspective, we focus on a central distinction that divides the field today: behavioral alignment vs. representational alignment (Figure 2). Most alignment efforts to date have focused exclusively on analyzing either model behavior or internal model representations. Such narrow alignment approaches can leave models more vulnerable to a wide range of emerging misalignment threats. 16,20 Zhang et al. point out that approaches to understanding model behavior based either on model inputs or model internals are “studied and applied rather independently, resulting in a fragmented landscape of approaches and terminology” 15 Bereska and Gavves note that mechanistic interpretability has developed as a separate research area from behavioral approaches, with “diverging terminology” that “inhibits collaboration across disciplines.” 16 This division is exacerbated by the fact that many developers of frontier AI models do not provide open access to their model weights, leaving behavioral approaches as the only usable option for alignment researchers. 20 Burden et al. note that even within the behavioral alignment field, “divergent evaluation paradigms have emerged, often developing in isolation, adopting conflicting terminologies, and overlooking each other’s contributions. This fragmentation has led to insular research trajectories and communication barriers.” 17 We begin by summarizing the approaches on either side of this divide, including how they detect and correct misalignment, and their relative strengths and weaknesses. While the examples below relate to text-based LLMs, the challenges apply equally to models dealing with image, video, and other data modalities, as well as to agentic and multi-agent AI systems. Figure 2. Behavioral alignment approaches focus on a model’s inputs and outputs, whereas representational approaches focus on a model’s internal activations and representations. (A small sample of detection and correction methods are shown here; for a more complete listing, see 10,16 .) The proposed Integrated Alignment framework combines these approaches by examining model inputs and outputs alongside internal activations and representations. Behavioral Alignment Behavioral approaches seek to align a model in a “black-box” fashion – based only on its inputs and outputs, without access to the activations or representations in the intervening layers. Detection of Misalignment Behavioral approaches to misalignment detection can take the form of benchmarks or standardized exams 21 , which can sometimes be limited in their translation to real-world tasks 22 . They can also be implemented via user studies involving domain experts or end-users of an application 23,24 . Researchers have proposed additional behavioral alignment detection methods for specific domains that involve more sophisticated processing of outputs. For example, Alber et al. 25 proposed extracting biomedical concepts and relations from model responses and verifying them against a biomedical knowledge graph. In the field of software code generation, the “synthesize-execute-debug” approach takes model outputs in the form of code and compiles, executes and debugs them to identify misalignments 26,27 . Mathematical and other reasoning tasks may also employ automated verification algorithms such as theorem checkers 28 . Correction of Misalignment Many behavioral approaches to misalignment correction have been explored, the most popular being Reinforcement Learning with Human Feedback (RLHF) 29 . In this approach, used to train the original InstructGPT model 29 , human labelers indicate their preferences among several model outputs and provide sample output demonstrations which are then used for fine-tuning. Additional Behavioral Alignment approaches include Iterated Distillation and Amplification (IDA) 30 , Recursive Reward Modeling (RRM) 31 , Cooperative Inverse Reinforcement Learning (CIRL) 32 , Debate 33 , and Output Filtering, among others 34 . Strengths and Weaknesses The primary advantage of behavioral approaches is their ability to directly measure a model's outputs to determine whether these match practical expectations and preferences. Since they do not require access to model internals, they can be widely used for closed-source models, including many of today’s frontier models. On the other hand, behavioral approaches typically rely on human feedback, which can be noisy, inconsistent and costly 35,36 . Behavioral approaches may also not be robust to distributional shift: A model's behavior may be aligned for some set of inputs included in the training or testing set, while being misaligned for others 25 . Furthermore, behavioral approaches may provide limited mechanistic insight, as they only access model inputs and outputs. Lastly, they may be ill-suited for detecting deceptive forms of misalignment, as described in Section 3 below. Representational Alignment Whereas behavioral approaches treat an AI model as a black box, representational approaches take a “white-box” approach, examining the inner workings of a model. Neural networks map model inputs to real-valued vectors and perform a sequence of transformations, producing “activations”, together referred to as its “representation”. Representational approaches evaluate whether these internal representations, connections and activation patterns match a set of expectations and preferences. Detection of misalignment Detection of representational misalignment may occur at multiple scales. Mechanistic interpretability is the study of representational alignment at its most granular scale, attempting to identify specific neurons, sub-networks or paths that produce a given type of misalignment 16,37 . Conversely, top-down approaches, such as representation engineering 12 and sparse autoencoders 38 , treat the large-scale, distributed patterns of network activations as the fundamental unit of analysis, and attempt to link these patterns to concepts in order to measure alignment. Evaluation methods and benchmarks have been developed for these different approaches 39,40 , with some exploring nonlinear multidimensional features 41 . Representational approaches can be further subdivided into observational methods, which examine the relationship of inputs and activations to ground-truth data, and interventional methods, which impose certain activation patterns in order to examine the relationship of activation states to model outputs 16 . One common observational approach is probing, which uses the activations of an intermediate layer to train a model to estimate a ground truth label 42,43 . This approach was used to show, for example, that models contain a linear embedding of the geo-location of cities 44 . Misalignment can occur when the patterns of activation states do not correspond to known relationships between concepts in the real world. For example, among model prompts, one would expect internal activation patterns to be more similar between “bicycle” and “unicycle” than between “bicycle” and “giraffe”. Others have proposed coup probes to identify potentially catastrophic model behaviors 45 . Methods like representation engineering and sparse autoencoders also build a model of internal activations, but in an unsupervised fashion, interpreting the resultant extracted features and the concepts they encode 12,38 . Interventional approaches can then validate that artificially activating the identified features creates the expected model changes 46,47 . Correction of Misalignment Once representational features are identified, model behavior can be aligned by changing activations to be more similar to desirable concept activations and less similar to undesirable ones. One can intervene using “steering vectors” 48,49 , which employ detected concept features to steer outputs towards aligned behavior. Steering is lightweight in the sense that activations need only be adjusted during inference time. A more intensive approach is to fine-tune the models through additional end-to-end training or through low-rank representation adaptation which can be more computationally efficient 12 . Strengths and Weaknesses The primary advantage of representational approaches is that they allow direct examination of a model's internal knowledge representations to determine whether these match expectations and preferences. The primary drawback is their complexity; examining all possible activation patterns under a large number of conditions - also known as enumerative safety 50 - becomes vastly more difficult as AI systems scale. Additional limitations include: Individual neurons may be involved in representing multiple concepts (“polysemanticity”), making interpretability challenging 51 ; Model representations are often not localized, and must be analyzed at multiple stages. 52 ; Extracted concepts may be brittle to input changes or may not generalize well to new domains. 53 ; Representational alignment approaches are mostly limited to detecting known concepts, and may not be able to handle novel concepts or hallucinations; Representational approaches may measure alignment to concepts that have little to no bearing on task-specific performance, and thus should be interpreted with caution 54 . 3. Challenges to Alignment Efforts aimed at detecting and correcting different forms of misalignment must overcome a number of key hurdles. In this section, we review some of the more pressing and difficult challenges. Sycophancy - Misalignment measures, especially those that rely on human feedback, can be sensitive to the model’s use of sycophantic language 55 . This is partly driven by the observed human preference for sycophantic responses, which percolate to preference models, sometimes dampening truthfulness. 56 Specification Gaming and Reward Tampering - Additional difficulties surface when models exhibit specification gaming, in which they perform well on a given reward objective but not on the desired notion of alignment 57 . Capable models may even learn to tamper with the reward function itself (reward tampering), or with other proxies, to satisfy their training objectives in unhelpful ways 57,58 . This phenomenon is not particular to LLMs, also appearing in other complex reinforcement learning scenarios such as artificial life and evolutionary robotics 59 . Deceptive Alignment and Alignment Faking - Some early evidence suggests that sufficiently advanced AI systems may reason about whether and how they are being trained, and decide how to respond accordingly. A recent study reported on alignment faking, in which an LLM selectively complied with attempts at fine-tuning, while actually preserving prior preferences that conflicted with the fine-tuning attempts. 60 This phenomenon, also dubbed deceptive alignment, is especially hard to detect since the model may act differently when it knows it is not being fine-tuned for alignment, meaning researchers may now have to contend with models being aware of their own training processes. Several other studies reveal how alignment fine-tuning may not fully correct misaligned behavior, but rather temporarily bypass it. For example, researchers have trained “sleeper agents” that persist despite several types of “safety training” 61 . LLMs trained and aligned with standard safety infrastructures may be relatively easy to compromise via fine-tuning with only a few examples 62 . Similarly, toxic capabilities learned by LLMs during pre-training can be bypassed during fine-tuning and easily reverted 63 . Furthermore, misinformation and data poisoning can go undetected by behavioral benchmarks; a recent study found that LLMs trained on web data corrupted by misinformation could “match the performance of their corruption-free counterparts on open-source benchmarks routinely used to evaluate medical LLMs”. 25 In many such studies, representational alignment approaches can play a key role in identifying persistent misalignment that otherwise would have gone undetected 64 . As others have noted, “if a model is acting deceptively, it may be very hard for it to avoid ‘thinking’ about deception” 65 . Yet representational approaches to alignment such as steering vectors have limitations related to their reliability in out-of-distribution settings 53 , and so while they may aid in detecting misalignment that is missed by other approaches, they are not a panacea. Figure 3. Lessons learned from immunology and cybersecurity can be used to inform design principles for AI Alignment. While there is not a perfect one-to-one correspondence across fields, some important lessons can still be drawn. 4. Lessons from Related Fields: Immunology and Cybersecurity In facing these complex challenges and charting a future course for the field of AI alignment, it is useful to consider lessons from the related fields of Immunology and Cybersecurity. While there are inherent limitations to any such analogies due to fundamental differences between fields, we believe there are still valuable general lessons to consider and adapt. Lessons from Biology: Immune Systems Biological immune systems have evolved over millions of years to deploy multiple interacting defense mechanisms – each representing a different approach to distinguishing self from non- self and protecting the organism from hostile pathogens. Through an ongoing generational arms-race, immune systems have co-evolved with infectious agents to respond to a broad range of known and unknown threats, including those which actively attempt to undermine them. Several fundamental principles from immunology may be useful for guiding the future design of AI alignment approaches: Diversity and Redundancy - Immune systems rely on multiple cell types and a diverse range of antibodies and receptors, each tuned to detect different threats, providing redundancy and robustness. 66,67 Innate vs. Adaptive Immunity - Immune systems possess both in-built protections against common known threats, and the ability to adapt to novel threats through experience. B-cells mutate their antibody genes, which are then selected based on specific binding to target antigens - a guided adversarial search to evolve tailored defenses for emerging threats. 68,69 Distributed and Decentralized - Immune systems employ a distributed network of cells throughout the body, defending against threats at multiple points of entry and potential infection loci. 70,71 Cooperative Interactions - Immune cells engage in complex cooperative interactions, integrating different immunological approaches to provide a synergistic defense system. Helper T (CD4+) cells coordinate immune response, activating and directing other cells to mount a system-wide defense against pathogens. 72,73 Tolerance Induction and Negative Selection - Immune systems inhibit cells that respond excessively to self-components, preventing damaging autoimmune responses. T-cells are exposed to a broad array of self-antigens in the thymus, with those that bind too strongly being eliminated. T-cells are further monitored for autoimmune responses in the periphery via regulatory T-cells, immune checkpoints, and antigen presentation. 74,75 Damage Control and Repair Mechanisms - Beyond detecting and destroying infectious agents, immune systems also coordinate repair mechanisms to limit collateral damage and restore healthy function. 76,77 Lessons from Computer Science: Cybersecurity Computer systems face an ever-growing menagerie of increasingly sophisticated security threats. Cybersecurity systems 78 have co-evolved with these threats to become an integral pillar of computer system design - from the early days of access control mechanisms, to widely deployed consumer antivirus programs, to globally distributed cloud cybersecurity platforms. Here too, valuable lessons can be learned for the future of AI alignment: Layered Defenses - Cybersecurity systems deploy multiple overlapping and interacting defense approaches, including identity-related security, device-related security, and location-aware access, among others. 79,80 Arms Race and Continual Updates - Cybersecurity is an ongoing, adaptive process. As threats evolve to become more sophisticated, cybersecurity systems co-evolve with them to meet new challenges. 81 Behavioral Monitoring and Anomaly Detection - Unusual behaviors observed within a system may be indicative of dangerous or unwanted activity. The ability to distinguish typical behaviors from anomalous behaviors helps provide robust defense. 82 Adversarial Defenses and “Red Teaming” - In the face of a complex and diverse network architectures, multiple levels of adversarial testing such as penetration testing and “red teaming” can help identify important gaps and vulnerabilities. 83 Zero Trust Architectures - To promote vigilance, security systems assume that no component, device or user is automatically trusted—everything must be continuously verified. 84 Open Source and Community Defense - Open source software is typically created by multiple contributors and audited by an entire community to identify potential vulnerabilities. Known threats are shared across the community, though resources such as the MITRE ATT&CK 85 database, a community-authored, globally-accessible knowledgebase of adversarial tactics and techniques based on real-world observations. 86,87 Resilience and Expecting Failure - Modern cybersecurity systems are designed with the assumption that any system will eventually be breached, and are thus prepared to fail gracefully through containment and recovery. 88,89 5. Towards Integrated Alignment To meet the growing range of complex and deceptive alignment threats, we propose the development of Integrated Alignment (IA) frameworks that incorporate multiple complementary alignment approaches into a single integrated system. We believe that intentionally designed IA frameworks have the potential to provide more robust misalignment detection and correction than any one individual approach. Informed by the above lessons adapted from related fields, we propose a set of design principles for the development of IA frameworks. These design principles are purposefully formulated in broad terms, so as not to excessively limit their applicability to any specific methods within the field. Design Principles: IA for AI Diversity and Redundancy - To increase overall robustness, IA frameworks should employ an ensemble of alignment methods working together, including behavioral and representational approaches. Bereska and Gavves 16 have recommended integrating multiple approaches within mechanistic interpretability; we echo and expand this recommendation, proposing that integration occur at much broader scales, bridging the behavioral-representational divide. Multiscale Approaches - IA frameworks should incorporate multi-scale approaches, detecting and correcting misalignment at different levels – from individual neurons and connections 90 , to compound features 46 , circuits 91 , representations 12 , and behaviors. Distributed Alignment - IA frameworks should monitor a range of different points and layers throughout a model. For example, Eliciting Latent Knowledge studies have found that “middle layers tend to generalize better than later layers”, leading them to propose the Earliest Informative Layer criterion for selecting which layers in a model are most informative 52 . Coordination and Deep Integration - Within IA frameworks, different alignment approaches should not operate in separate silos, but rather should be deeply integrated with one another. Investigating the interactions between behavioral and representational methods across different use cases can allow researchers to leverage potential synergies between them. For example, the same results from a behavioral alignment method may be interpreted differently depending on the output of a representational alignment method running alongside it. (See examples below.) Adaptive Coevolution and Learning - Increasingly sophisticated types of misalignments will emerge as AI models continue to scale in size and complexity. IA frameworks must adapt to these new threats, with AI auditors identifying novel misalignment patterns in deployed systems and adaptively aligning them via integrated correction protocols. Anomaly Detection - IA frameworks should monitor model activity and behavior for unusual patterns, as unusual phenomena may indicate the presence of misalignment. Adversarial Defenses and “Red Teaming” - In the face of a broad range of misalignment threats, IA frameworks should employ adversarial testing to evaluate and detect potential misalignments, including deceptive approaches that can actively undermine attempts to detect them. Zero Trust and Continuous Verification - IA frameworks must continually verify AI models throughout deployment - even after initial verification. Negative Selection and Avoiding False Positives - IA frameworks should systematically down- regulate overly-sensitive misalignment detection methods that produce false positives that may contribute to alert fatigue for model maintainers. Ongoing runtime monitoring can be used to catch and suppress unwanted behavior while maintaining target levels of sensitivity. Resilience and Repair - IA frameworks should be designed with the assumption that all alignment methods will eventually be compromised, and should thus plan for graceful fall-back when such inevitable breaches occur. IA frameworks should also include restorative mechanisms that correct misalignments as they are found. Open Source and Community Defense - Curated open-source databases of misalignment detection methods and known vulnerabilities and exploits can help build “community defenses” against the global range of misalignment threats. Strategic Diversity - Methods used to align a model should be different from the methods used to detect potential misalignments. As a metaphor, consider a room full of insects crawling on the floor. A cleaner enters the room and turns on the ceiling light, causing most of the insects to scurry under the sofas. The cleaner captures the visible insects still out in the open and leaves the room. The health inspector then arrives, relying on the same ceiling light to scan the room for insects. The room appears insect-free, even though insects remain hidden under the furniture. Only by shining a light from a different perspective - like a flashlight under the sofa - can the hidden insects be detected. Similarly, optimizing for a single dimension of AI alignment can drive misalignments into other dimensions that are hidden from the chosen alignment objective 18,56,92,93 . Thus, if similar approaches are used to both align a model and check it for misalignment, the entire alignment pipeline may be “doomed” to a false sense of success. Only by ensuring a diversity of perspectives across alignment and misalignment detection can overall robustness be increased. 60,94,95 Integrated alignment frameworks are subject to limitations, costs and tradeoffs, including: Increased computational costs resulting from running multiple alignment methods in parallel; Increased potential for false positives resulting from multiple-testing effects; and Challenges in coordinating and interpreting the outputs of diverse alignment methods. Future research efforts should explore these tradeoffs, along with possible mitigation strategies. Promising Developments Towards Integrated Alignment Some recent studies have reported early results relating to integrated alignment approaches. Marks et al. 94 tasked four teams with conducting blind alignment audits on a model trained with hidden objectives; one team used a combination of methods spanning both behavioral and representational approaches (SAE feature-inspection & assistant-prefill behavioral attacks) to successfully discover the hidden objectives. Ji et al. 95 combined behavioral monitoring with internal representational monitoring of chain-of-thought reasoning in training an RL model to achieve a substantial reduction in deceptive behavior. Greenblatt et al. 60 similarly combined behavioral and representational alignment techniques to identify alignment faking. In related work, Zhang et al. described a unified attribution framework combining model inputs, training data and model internals to gain a more comprehensive understanding of model behavior. 15 We call on researchers to increasingly pursue integrative alignment studies such as these. The resulting IA frameworks can then be rigorously evaluated for their abilities to detect a wide range of misalignments, compared with single-method approaches. Evaluation metrics can include joint precision-recall across misalignment categories, alongside robustness to deceptive red-teaming. These can be weighed against increases in complexity and compute costs. 6. An Integrated Field for Integrated Alignment In order to fully realize the vision of Integrated Alignment, the AI alignment field itself must also move towards greater integration. As cross-disciplinary efforts are inherently challenging, we present a number of key recommendations for overcoming the structural barriers that exist today. Increased Collaboration and Shared Terminology - Different sub-communities should increase cross-communication through shared conferences, journals, resources and studies. Sucholutsky et al. 13 have called for greater communication within the representational alignment community; we echo and expand this call to the entire field of AI alignment, bridging the behavioral-representational divide. Such collaboration also requires a shared terminology for communicating about different facets of AI alignment. 14 16 Given the diverse interdisciplinary backgrounds of researchers in the field, universal nomenclatural consensus may be difficult to achieve; translational tutorials can be useful for introducing members of one subfield to the terms of art of another. 96 Open Access to Model Weights - While some leading research organizations have embraced open source models, many of the current best-performing frontier models do not allow researchers outside their organizations to examine model internals. Sharing model weights openly will allow researchers to investigate both behavioral approaches and representational approaches on leading AI models. Commercial barriers to sharing weights may be mitigated by alternative models, such as sandboxes for credentialed researchers. 97 Shared Computational Resources - The immense computational resources needed to rigorously study alignment, especially for complex models like LLMs, are typically not available to most researchers. We encourage investment in platforms that pair AI researchers with alignment specialists so that computational resources and subject-matter expertise can be shared in a mutually beneficial way. Community Alignment Databases - The field of AI alignment would be buoyed by the curation of open alignment datasets that are computationally intensive to construct but may permit a wealth of downstream analyses 98 . An open source database describing misalignments and exploits similar to the MITRE ATT&CK database 83,85 , in conjunction with open models and open reviews, would promote the use and availability of community-vetted AI systems and bring the full range of alignment approaches to bear. Contributing to AI Policy - As agency rulemaking around AI develops, government agencies should convene multidisciplinary task forces with researchers from across diverse sub-fields to develop standardized alignment guidelines 99 . Such guidelines could also clarify legal compliance and best practices for small and large organizations implementing AI safely. Efforts underway by industry and academic coalitions 100 would similarly benefit from practical guidance on IA to inform new guidance frameworks. The field of AI Alignment is at an inflection point. With the growing number of researchers and rapid proliferation of research directions, the field is at risk of descending into a more fragmented future rather than a more integrated one. At this critical time, we call on researchers, policy-makers, industry consortia and governments, to proactively take steps to nurture a more unified future for this important field. Acknowledgements We acknowledge support from award R01LM014300 from the National Library of Medicine of the National Institutes of Health. References 1. Everitt, T., Lea, G. & Hutter, M. AGI Safety Literature Review. in Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (International Joint Conferences on Artificial Intelligence Organization, California, 2018). doi:10.24963/ijcai.2018/768. 2. Yang, Y., Chern, E., Qiu, X., Neubig, G. & Liu, P. Alignment for Honesty. arXiv [cs.CL] (2023). 3. Hou, B. L. & Green, B. P. A multi-level framework for the AI alignment problem. arXiv [cs.CY] (2023). 4. Bradley, A. & Saad, B. AI Alignment vs AI Ethical Treatment: Ten Challenges. https://globalprioritiesinstitute.org/wp-content/uploads/Bradley-and-Saad-AI-alignment- vs-AI-ethical-treatment_-Ten-challenges.pdf (2024). 5. Diamond, A. PRISM: Perspective Reasoning for Integrated Synthesis and mediation as a multi-perspective framework for AI alignment. arXiv [cs.CY] (2025). 6. Park, P. S., Goldstein, S., O’Gara, A., Chen, M. & Hendrycks, D. AI deception: A survey of examples, risks, and potential solutions. Patterns (N. Y.) 5, 100988 (2024). 7. Ngo, R., Chan, L. & Mindermann, S. The alignment problem from a deep learning perspective. arXiv [cs.AI] (2022). 8. Carranza, A., Pai, D., Schaeffer, R., Tandon, A. & Koyejo, S. Deceptive Alignment Monitoring. arXiv [cs.LG] (2023). 9. Jan Leike, John Schulman & Jeffrey Wu. Our approach to alignment research. https://openai.com/blog/our-approach-to-alignment-research. 10. Ji, J. et al. AI Alignment: A Comprehensive Survey. arXiv (2024) doi:10.48550/arXiv.2310.19852. 11. Vegner, I., de Souza, S., Forch, V., Lewis, M. & Doumas, L. A. A. Behavioural vs. Representational systematicity in end-to-end models: An opinionated survey. arXiv [cs.LG] (2025). 12. Zou, A. et al. Representation Engineering: A Top-Down Approach to AI Transparency. arXiv (2023). 13. Sucholutsky, I. et al. Getting aligned on representational alignment. arXiv [q-bio.NC] (2023). 14. Shen, H. et al. Towards Bidirectional Human-AI alignment: A systematic review for clarifications, framework, and future directions. arXiv [cs.HC] (2024). 15. Zhang, S., Han, T., Bhalla, U. & Lakkaraju, H. Towards unified attribution in explainable AI, data-centric AI, and mechanistic interpretability. arXiv [cs.LG] (2025). 16. Bereska, L. & Gavves, E. Mechanistic interpretability for AI safety -- A review. arXiv [cs.AI] (2024). 17. Burden, J., Tešić, M., Pacchiardi, L. & Hernández-Orallo, J. Paradigms of AI evaluation: Mapping goals, methodologies and culture. arXiv [cs.AI] (2025). 18. Langosco, L. et al. Goal misgeneralization in deep reinforcement learning. arXiv [cs.LG] (2021). 19. Shah, R. et al. Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals. arXiv (2022). 20. Casper, S. et al. Black-Box Access is Insufficient for Rigorous AI Audits. in The 2024 ACM Conference on Fairness, Accountability, and Transparency 2254–2272 (ACM, New York, NY, USA, 2024). doi:10.1145/3630106.3659037. 21. Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31, 943–950 (2025). 22. Raji, I. D., Daneshjou, R. & Alsentzer, E. It’s time to bench the medical exam benchmark. NEJM AI 2, (2025). 23. Jabbour, S. et al. Measuring the Impact of AI in the Diagnosis of Hospitalized Patients: A Randomized Clinical Vignette Survey Study. JAMA 330, 2275–2284 (2023). 24. Masanneck, L. et al. Triage performance across large language models, ChatGPT, and untrained doctors in emergency medicine: Comparative study. J. Med. Internet Res. 26, e53297 (2024). 25. Alber, D. A. et al. Medical large language models are vulnerable to data-poisoning attacks. Nat. Med. 31, 618–626 (2025). 26. Gupta, K., Christensen, P. E., Chen, X. & Song, D. Synthesize, Execute and Debug: Learning to Repair for Neural Program Synthesis. Advances in Neural Information Processing Systems 33, 17685–17695 (2020). 27. Liventsev, V., Grishina, A., Härmä, A. & Moonen, L. Fully Autonomous Programming with Large Language Models. in Proceedings of the Genetic and Evolutionary Computation Conference 1146–1155 (Association for Computing Machinery, New York, NY, USA, 2023). doi:10.1145/3583131.3590481. 28. DeepSeek-AI et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv [cs.CL] (2025). 29. Ouyang, L. et al. Training language models to follow instructions with human feedback. arXiv (2022) doi:10.48550/arXiv.2203.02155. 30. Christiano, P., Shlegeris, B. & Amodei, D. Supervising strong learners by amplifying weak experts. arXiv [cs.LG] (2018). 31. Leike, J. et al. Scalable agent alignment via reward modeling: a research direction. arXiv (2018) doi:10.48550/arXiv.1811.07871. 32. Hadfield-Menell, D., Dragan, A., Abbeel, P. & Russell, S. Cooperative inverse reinforcement learning. arXiv [cs.AI] (2016). 33. Irving, G., Christiano, P. & Amodei, D. AI safety via debate. arXiv (2018) doi:10.48550/arXiv.1805.00899. 34. Hubinger, E. An overview of 11 proposals for building safe advanced AI. arXiv [cs.LG] (2020). 35. Bukharin, A. et al. Robust reinforcement learning from corrupted human feedback. arXiv [cs.LG] (2024). 36. Kim, D., Lee, K., Shin, J. & Kim, J. Spread Preference Annotation: Direct preference judgment for efficient LLM alignment. arXiv [cs.LG] (2024). 37. Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S. & Garriga-Alonso, A. Towards Automated Circuit Discovery for Mechanistic Interpretability. arXiv (2023) doi:10.48550/arXiv.2304.14997. 38. Gao, L. et al. Scaling and evaluating sparse autoencoders. arXiv [cs.LG] (2024). 39. Huang, J., Wu, Z., Potts, C., Geva, M. & Geiger, A. RAVEL: Evaluating interpretability methods on disentangling language model representations. arXiv [cs.CL] (2024). 40. Makelov, A., Lange, G. & Nanda, N. Towards principled evaluations of sparse autoencoders for interpretability and control. arXiv [cs.LG] (2024). 41. Engels, J., Michaud, E. J., Liao, I., Gurnee, W. & Tegmark, M. Not all language model features are one-dimensionally linear. arXiv [cs.LG] (2024). 42. Alain, G. & Bengio, Y. Understanding intermediate layers using linear classifier probes. arXiv [stat.ML] (2016). 43. Belinkov, Y. Probing Classifiers: Promises, Shortcomings, and Advances. Comput. Linguist. Assoc. Comput. Linguist. 48, 207–219 (2022). 44. Gurnee, W. & Tegmark, M. Language Models Represent Space and Time. arXiv (2023). 45. Roger, F. Coup probes: Catching catastrophes with probes trained off-policy. 46. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. https://transformer-circuits.pub/2024/scaling-monosemanticity/. 47. Jermyn, A. S., Schiefer, N. & Hubinger, E. Engineering Monosemanticity in Toy Models. arXiv [cs.LG] (2022). 48. Liu, S., Ye, H., Xing, L. & Zou, J. In-context vectors: Making in context learning more effective and controllable through latent space steering. arXiv [cs.LG] (2023). 49. Panickssery, N. et al. Steering Llama 2 via Contrastive Activation Addition. arXiv [cs.CL] (2023). 50. Elhage, N. et al. Toy Models of Superposition. Transformer Circuits Thread (2022). 51. Scherlis, A., Sachan, K., Jermyn, A. S., Benton, J. & Shlegeris, B. Polysemanticity and Capacity in Neural Networks. arXiv [cs.NE] (2022). 52. Mallen, A., Brumley, M., Kharchenko, J. & Belrose, N. Eliciting Latent Knowledge from quirky language models. arXiv [cs.LG] (2023). 53. Tan, D. et al. Analyzing the generalization and reliability of steering vectors. arXiv [cs.LG] (2024). 54. Ravichander, A., Belinkov, Y. & Hovy, E. Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance? arXiv (2021). 55. Measuring the Persuasiveness of Language Models. https://w.anthropic.com/news/measuring-model-persuasiveness. 56. Sharma, M. et al. Towards understanding sycophancy in language models. arXiv [cs.CL] (2023). 57. Denison, C. et al. Sycophancy to subterfuge: Investigating reward-tampering in large language models. arXiv [cs.AI] (2024). 58. Roger, F., Greenblatt, R., Nadeau, M., Shlegeris, B. & Thomas, N. Benchmarks for detecting measurement tampering. arXiv [cs.LG] (2023). 59. Lehman, J. et al. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities. arXiv [cs.NE] (2018). 60. Greenblatt, R. et al. Alignment faking in large language models. arXiv [cs.AI] (2024). 61. Hubinger, E. et al. Sleeper agents: Training deceptive LLMs that persist through safety training. arXiv [cs.CR] (2024). 62. Qi, X. et al. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv [cs.CL] (2023). 63. Lee, A. et al. A mechanistic understanding of alignment algorithms: A case study on DPO and toxicity. arXiv [cs.CL] (2024). 64. A transparency and interpretability tech tree. https://w.alignmentforum.org/posts/nbq2bWLcYmSGup9aF/a-transparency-and- interpretability-tech-tree. 65. Simple probes can catch sleeper agents. https://w.anthropic.com/research/probes- catch-sleeper-agents. 66. Häder, A. et al. Pathogen-specific innate immune response patterns are distinctly affected by genetic diversity. Nat. Commun. 14, 3239 (2023). 67. Mozeika, A., Fraternali, F., Dunn-Walters, D. & Coolen, A. C. C. Roles of repertoire diversity in robustness of humoral immune response. arXiv [q-bio.CB] (2019). 68. Netea, M. G. et al. Defining trained immunity and its role in health and disease. Nat. Rev. Immunol. 20, 375–388 (2020). 69. Chi, H., Pepper, M. & Thomas, P. G. Principles and therapeutic applications of adaptive immunity. Cell 187, 2052–2078 (2024). 70. Poon, M. M. L. & Farber, D. L. The whole body as the system in systems immunology. iScience 23, 101509 (2020). 71. Thomas-Vaslin, V. Understanding and modeling the complexity of the immune system. in First Complex Systems Digital Campus World E-Conference 2015 261–270 (Springer International Publishing, Cham, 2017). doi:10.1007/978-3-319-45901-1_29. 72. Tsay, G. J. & Zouali, M. The interplay between innate-like B cells and other cell types in autoimmunity. Front. Immunol. 9, 1064 (2018). 73. Duan, T., Du, Y., Xing, C., Wang, H. Y. & Wang, R.-F. Toll-like receptor signaling and its role in cell-mediated immunity. Front. Immunol. 13, 812774 (2022). 74. You, Y. et al. Direct presentation of inflammation-associated self-antigens by thymic innate-like T cells induces elimination of autoreactive CD8+ thymocytes. Nat. Immunol. 25, 1367–1382 (2024). 75. Dzhagalov, I. L., Chen, K. G., Herzmark, P. & Robey, E. A. Elimination of self-reactive T cells in the thymus: a timeline for negative selection. PLoS Biol. 11, e1001566 (2013). 76. Wong, R. S.-Y., Tan, T., Pang, A. S.-R. & Srinivasan, D. K. The role of cytokines in wound healing: from mechanistic insights to therapeutic applications. Explor. Immunol. 5, 1003183 (2025). 77. Caballero-Sánchez, N., Alonso-Alonso, S. & Nagy, L. Regenerative inflammation: When immune cells help to re-build tissues. FEBS J. 291, 1597–1614 (2024). 78. Borky, J. M. & Bradley, T. H. Protecting Information with Cybersecurity. in Effective Model-Based Systems Engineering 345–404 (Springer International Publishing, Cham, 2019). doi:10.1007/978-3-319-95669-5_10. 79. Choi, Y. B., Sershon, C., Briggs, J. & Clukey, C. Survey of Layered Defense, Defense in Depth and Testing of Network Security. Preprint at https://w.ijcit.com/archives/volume3/issue5/Paper030518.pdf. 80. Panteli, N., Nthubu, B. R. & Mersinas, K. Being responsible in cybersecurity: A multi- layered perspective. Inf. Syst. Front. 1–19 (2025) doi:10.1007/s10796-025-10588-0. 81. Zheng, Y., Li, Z., Xu, X. & Zhao, Q. Dynamic defenses in cyber security: Techniques, methods and challenges. Digit. Commun. Netw. 8, 422–435 (2022). 82. Bhuyan, M. H., Bhattacharyya, D. K. & Kalita, J. K. Network anomaly detection: Methods, systems and tools. IEEE Commun. Surv. Tutor. 16, 303–336 (2014). 83. Yulianto, S., Soewito, B., Gaol, F. L. & Kurniawan, A. Enhancing cybersecurity resilience through advanced red-teaming exercises and MITRE ATT&CK framework integration: A paradigm shift in cybersecurity assessment. Cyber Security and Applications 3, 100077 (2025). 84. Gambo, M. L. & Almulhem, A. Zero trust architecture: A systematic literature review. Techrxiv (2025) doi:10.36227/techrxiv.173933211.18231232/v1. 85. Al-Sada, B., Sadighian, A. & Oligeri, G. MITRE ATT&CK: State of the art and way forward. ACM Comput. Surv. 57, 1–37 (2025). 86. Manzoor, J., Waleed, A., Jamali, A. F. & Masood, A. Cybersecurity on a budget: Evaluating security and performance of open-source SIEM solutions for SMEs. PLoS One 19, e0301183 (2024). 87. Ilg, N., Duplys, P., Sisejkovic, D. & Menth, M. A survey of contemporary open-source honeypots, frameworks, and tools. J. Netw. Comput. Appl. 220, 103737 (2023). 88. Tzavara, V. & Vassiliadis, S. Tracing the evolution of cyber resilience: a historical and conceptual review. Int. J. Inf. Secur. 23, 1695–1719 (2024). 89. Chu, S., Koe, J., Garlan, D. & Kang, E. Integrating graceful degradation and recovery through requirement-driven adaptation. in Proceedings of the 19th International Symposium on Software Engineering for Adaptive and Self-Managing Systems (ACM, New York, NY, USA, 2024). doi:10.1145/3643915.3644090. 90. Bills, S. et al. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html. 91. Marks, S. et al. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv [cs.LG] (2024). 92. Betley, J. et al. Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. arXiv [cs.CL] (2025). 93. Stutz, D., Hein, M. & Schiele, B. Confidence-calibrated adversarial training: Generalizing to unseen attacks. arXiv [cs.LG] (2019). 94. Marks, S. et al. Auditing language models for hidden objectives. arXiv [cs.AI] (2025). 95. Ji, J. et al. Mitigating deceptive alignment via self-monitoring. arXiv [cs.AI] (2025). 96. NeurIPS Tutorial Cross-disciplinary insights into alignment in humans and machines. https://neurips.c/virtual/2024/tutorial/99529. 97. U.S. AI Safety Institute Signs Agreements Regarding AI Safety Research, Testing and Evaluation With Anthropic and OpenAI. NIST https://w.nist.gov/news- events/news/2024/08/us-ai-safety-institute-signs-agreements-regarding-ai-safety- research (2024). 98. Chen, S. et al. Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias. in Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (2024). doi:10.48550/arXiv.2405.05506. 99. Health and Human Services Department. Health Data, Technology, and Interoperability: Certification Program Updates, Algorithm Transparency, and Information Sharing. Federal Register Preprint at https://w.federalregister.gov/documents/2024/01/09/2023- 28857/health-data-technology-and-interoperability-certification-program-updates- algorithm-transparency-and (2024). 100. CHAI. https://w.chai.org/.