Paper deep dive

Super Co-alignment of Human and AI for Sustainable Symbiotic Society

Yi Zeng, Feifei Zhao, Yuwei Wang, Enmeng Lu, Yaodong Yang, Lei Wang, Chao Liu, Yitao Liang, Dongcheng Zhao, Bing Han, Haibo Tong, Yao Liang, Dongqi Liang, Kang Sun, Boyuan Chen, Jinyu Fan

Year: 2025Venue: arXiv preprintArea: Scalable OversightType: PositionEmbeddings: 43

Abstract

Abstract:As Artificial Intelligence (AI) advances toward Artificial General Intelligence (AGI) and eventually Artificial Superintelligence (ASI), it may potentially surpass human control, deviate from human values, and even lead to irreversible catastrophic consequences in extreme cases. This looming risk underscores the critical importance of the "superalignment" problem - ensuring that AI systems which are much smarter than humans, remain aligned with human (compatible) intentions and values. While current scalable oversight and weak-to-strong generalization methods demonstrate certain applicability, they exhibit fundamental flaws in addressing the superalignment paradigm - notably, the unidirectional imposition of human values cannot accommodate superintelligence's autonomy or ensure AGI/ASI's stable learning. We contend that the values for sustainable symbiotic society should be co-shaped by humans and living AI together, achieving "Super Co-alignment." Guided by this vision, we propose a concrete framework that integrates external oversight and intrinsic proactive alignment. External oversight superalignment should be grounded in human-centered ultimate decision, supplemented by interpretable automated evaluation and correction, to achieve continuous alignment with humanity's evolving values. Intrinsic proactive superalignment is rooted in a profound understanding of the Self, others, and society, integrating self-awareness, self-reflection, and empathy to spontaneously infer human intentions, distinguishing good from evil and proactively prioritizing human well-being. The integration of externally-driven oversight with intrinsically-driven proactive alignment will co-shape symbiotic values and rules through iterative human-ASI co-alignment, paving the way for achieving safe and beneficial AGI and ASI for good, for human, and for a symbiotic ecology.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 5:44:53 PM

Summary

The paper proposes 'Super Co-alignment,' a framework for aligning Artificial Superintelligence (ASI) with human values by integrating external oversight with intrinsic proactive alignment. It argues that unidirectional human-to-AI value imposition is insufficient for future AGI/ASI, advocating instead for a symbiotic co-evolution where humans and AI iteratively shape shared values through self-awareness, empathy, and interpretable automated evaluation.

Entities (5)

Artificial Superintelligence · technology · 99%Super Co-alignment · framework · 98%External Oversight · methodology · 95%Intrinsic Proactive Alignment · methodology · 95%Theory of Mind · cognitive-capability · 92%

Relation Signals (3)

Super Co-alignment → integrates → External Oversight

confidence 95% · we propose a concrete framework that integrates external oversight and intrinsic proactive alignment.

Super Co-alignment → integrates → Intrinsic Proactive Alignment

confidence 95% · we propose a concrete framework that integrates external oversight and intrinsic proactive alignment.

Intrinsic Proactive Alignment → utilizes → Theory of Mind

confidence 90% · Strong ToM capabilities enable superior understanding of human intentions, providing intrinsic cognitive mechanisms that facilitate superalignment.

Cypher Suggestions (2)

Find all components of the Super Co-alignment framework · confidence 95% · unvalidated

MATCH (f:Framework {name: 'Super Co-alignment'})-[:INTEGRATES]->(m:Methodology) RETURN m.name

Identify cognitive capabilities supporting intrinsic alignment · confidence 90% · unvalidated

MATCH (m:Methodology {name: 'Intrinsic Proactive Alignment'})-[:UTILIZES]->(c:Capability) RETURN c.name

Full Text

42,427 characters extracted from source content.

Expand or collapse full text

arXiv:2504.17404v5 [cs.AI] 28 Jun 2025 Super Co-alignment of Human and AI for Sustainable Symbiotic Society Yi Zeng 1,2,3,4,7,∗ , Feifei Zhao 1,2,3,7 , Yuwei Wang 1,2,3,7 , Enmeng Lu 1,2,3,7 , Yaodong Yang 1,2,6 , Lei Wang 1,2,5 , Chao Liu 1,8 , Yitao Liang 1,6 , Dongcheng Zhao 1,2,3,7 , Bing Han 3,4 , Haibo Tong 3,4 , Yao Liang 3,4 , Dongqi Liang 3,4 , Kang Sun 2,7 , Boyuan Chen 6 , Jinyu Fan 1,2,3,7 1 Beijing Key Laboratory of Safe AI and Superalignment, China. 2 Beijing Institute of AI Safety and Governance, China. 3 Brain-inspired Cognitive AI Lab, Institute of Automation, Chinese Academy of Sciences, China. 4 University of Chinese Academy of Sciences, China. 5 Wenge Technology Co., Ltd. 6 Institute for Artificial Intelligence, Peking University, China. 7 Long-term AI, China. 8 State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, China. ∗ Corresponding author: yi.zeng@ia.ac.cn Abstract As Artificial Intelligence (AI) advances toward Artificial General Intelli- gence (AGI) and eventually Artificial Superintelligence (ASI), it may poten- tially surpass human control, deviate from human values, and even lead to irreversible catastrophic consequences in extreme cases. This looming risk un- derscores the critical importance of the ”superalignment” problem - ensuring that AI systems which are much smarter than humans, remain aligned with hu- man (compatible) intentions and values. While current scalable oversight and weak-to-strong generalization methods demonstrate certain applicability, they exhibit fundamental flaws in addressing the superalignment paradigm - notably, the unidirectional imposition of human values cannot accommodate superintel- ligence’s autonomy or ensure AGI/ASI’s stable learning. We contend that the values for sustainable symbiotic society should be co-shaped by humans and living AI together, achieving ”Super Co-alignment.” Guided by this vision, we propose a concrete framework that integrates external oversight and intrinsic proactive alignment. External oversight superalignment should be grounded in 1 human-centered ultimate decision, supplemented by interpretable automated evaluation and correction, to achieve continuous alignment with humanity’s evolving values. Intrinsic proactive superalignment is rooted in a profound understanding of the Self, others, and society, integrating self-awareness, self- reflection, and empathy to spontaneously infer human intentions, distinguishing good from evil and proactively prioritizing human well-being. The integration of externally-driven oversight with intrinsically-driven proactive alignment will co-shape symbiotic values and rules through iterative human-ASI co-alignment, paving the way for achieving safe and beneficial AGI and ASI for good, for hu- man, and for a symbiotic ecology. Keywords Super Co-alignment, Human-AI Co-alignment, External Oversight Superalignment, Intrinsic Proactive Superalignment, Sustainable Symbiotic Society 1 Introduction With the breakthrough advancement of Artificial Intelligence (AI), the emergence of large language models (LLMs) [1, 2] has achieved human-level and even superhuman performance on multiple benchmarks. This technological leap has directly driven academic and corporate exploration into the theory of Artificial General Intelligence (AGI) [3] and even Artificial Superintelligence (ASI) [4]. ASI is defined as “any intel- lect that greatly exceeds the cognitive performance of humans in virtually all domains of interest” [5]. While we promisingly advance the development of increasingly power- ful and autonomous general-purpose AI systems, growing awareness of their potential ethical and safety risks has also emerged, such as malicious misuse, loss of control, power-seeking, and strategic deception [6, 7]. Indeed, current LLMs have already demonstrated instances of alignment faking [8], deception [9], and sycophancy [10]. When extended to superintelligence, without proper arrangement, it is foreseeable that it may surpass the boundaries of human governance, violate human values, and potentially cause irreversible, uncontrollable, and catastrophic consequences [11, 5]. Despite experts warning about AI’s existential risks [12], the development of AI alignment, as well as AI governance framework and ethical safety constraints, still struggles to keep pace with the transformative advancements and rapid iterations of the technology. Superintelligence exceeding human cognitive capacity would possess recursive self-improvement capabilities, achieving exponential advancement rates that would beyond human capacity to monitor or control [13]. This compels us to proac- tively address the question, ”How do we ensure that AI systems much smarter than humans follow human intentions?” This is OpenAI’s definition of superalignment [14], the key challenge of superalignment is that ASI will far exceed human oversight capa- bilities, making direct human supervision infeasible. As a result, traditional alignment approaches like Reinforcement Learning from Human Feedback (RLHF) [15, 16] will fail when confronted with superintelligence more intelligent than humans, as they 2 cannot provide sufficiently high-quality oversight signals to supervise and improve the system. Current proposed feasible approaches for superalignment such as scalable over- sight [17, 18] and weak-to-strong generalization [19, 20], aiming to develop scal- able high-quality supervision signals, utilize weaker AI systems to guide or supervise stronger AI, ensuring alignment with human values and intentions. Since superin- telligent AI systems do not exist yet, researchers have designed experiments to show that scalable oversight is feasible on existing LLMs [21], and that using a ”weak- to-strong” approach (supervised GPT-4 with a GPT-2-level model) can meaning- fully recover much of GPT-4’s capabilities [19]. By leveraging a lightweight model to learn correctional residuals, Aligner provides model-agnostic guidance, enabling iterative enhancement of large-scale upstream models [22]. Besides, research on the weak-to-strong generalization approach has progressively extended to enhanc- ing weak-to-strong generalization and alignment capabilities through methods such as benign overfitting [23], data-centric lens [24], transfer learning framework [25], weak-to-strong preference optimization [26], and multi-agent contrastive preference optimization [27]. However, the weak-to-strong generalization framework presents risks of advanced models developing deceptive behaviors and oversight evasion that remain undetectable to their less capable evaluators [28]. Furthermore, stronger mod- els are still not equivalent to AGI/ASI, and weak-to-strong generalization approaches may fail when applied to genuine AGI/ASI systems, as such systems could exhibit resistance behaviors and lack well-defined motivations to sustain their ”learner” roles. Previously, several scalable oversight techniques have been proposed, including Iterated Distillation and Amplification (IDA) [29] and Recursive Reward Modeling (RRM) [30], aiming to amplify the scalability of human supervision signals through interactive iterations and subtask decomposition. Faced with superintelligence that surpasses human cognition, traditional human assessment and supervision become untenable and expensive. Recent scalable oversight approaches, such as Reinforce- ment Learning from AI Feedback (RLAIF) [31], leverages AI-generated feedback to replace human feedback, enabling more precise oversight of AI outcomes with far fewer human labels. Cooperative Inverse Reinforcement Learning (CIRL) [32, 33] and assistance games [34, 35] feature AI systems maintaining uncertainty about the reward function, which drives them to actively infer humans’ true reward functions through human-AI cooperative interaction. This cooperative and partial-information game approach achieves alignment with human values. Some approaches employ an additional model to generate red-team testing [36], use external tools for evaluation and feedback [37], or iteratively self-adversarialize to enhance and refine the capabil- ities of LLM, aligning them with limited human-annotated data [38]. Constitutional AI [39] employs self-improvement, self-critique, and iterative revision mechanisms to learn harmlessness from AI feedback. The debate-based scalable oversight approach leverages structured competitive dialogues between AI models to enhance factuality and reduce deception, with human establishing necessary guidelines and serving as final arbiters [40, 41, 42, 43]. These methods rely solely on human feedback and set guidelines, making it difficult for AGI/ASI to achieve genuine value recognition and autonomous alignment. 3 Toward a sustainable symbiotic society between humans and living AI, we contend that unilateral alignment of AGI/ASI with human values is fundamentally insufficient. Human values themselves need to be adaptable to change to keep pace with continu- ously self-evolving AI/AGI/ASI systems and the developing symbiotic society.The values for sustainable symbiotic society must be co-shaped through super co-alignment achieved by human and ASI co-evolution together, thereby attaining harmonious symbiosis.Guided by the vision of super co-alignment,we propose a feasible roadmap integrating externally-driven oversight with intrinsic proactive superalignment, while emphasizing the co-alignment of human and ASI for symbiosis.The new framework proactively constructs safe- guard architectures, urges human to rethink about the future, and strives to enable sustainable AI development that genuinely benefits humanity and all. 2 An Integrated Framework for Super Co-alignment Developing AGI and ASI that are aligned with ethical values, safe, controllable, and pro-social is the key to ensuring that AI benefits society and achieves harmonious symbiosis between humans and machines. However, superintelligence may surpass effective human oversight (high-quality feedback data is expensive and scalable over- sight carry inherent risks of failure), potentially exhibiting fake alignment through deceptive behaviors. The super co-alignment we pursue emphasizes co-shaped and co-evolved symbiotic values between humans and AGI/ASI. This paradigm endows AI systems with autonomous understanding and alignment capabilities for human ethi- cal values, while simultaneously requiring human value systems to adaptively evolve through sustainable iterative co-alignment processes. Based on this, we systematically conduct an in-depth analysis of several critical issues and points in superalignment and fundamentally rethink the research framework from a human-superintelligence co-alignment perspective, as follows: (1) Intrinsic mechanism proactive alignment.Regarding the problem of superintelligence oversight failure, how to endow AI with genuine understanding of human intentions and values, equip it with self-awareness, self-reflection, and adap- tive capabilities, as well as the ability to empathize with others - thereby proactively establishing and achieving intrinsic alignment with human ethical values, rather than merely passively receive designers’ value models or mechanically enforce external con- straints through aligning it to dos and don’ts. (2) Explainable autonomous alignment.In response to the potential decep- tion and surface alignment phenomenon of superintelligence, using transparent and interpretable methods to achieve risk precise positioning and automatic correction. Through AI-assisted explainable automated red teaming, we can profoundly identify value preference misalignments between AI and humans, enabling targeted explo- ration of corrective solutions. Furthermore, by leveraging AI value representations revealed through interpretable approaches, we facilitate human understanding of AI systems and adaptive value adjustments, ultimately constructing a more efficient, transparent, and safety-controllable superalignment architecture. 4 (3) Adaptive human value evolution.As AI systems grow increasingly pow- erful and the future human-AI symbiotic society continues to evolve, we must rec- ognize that beyond self-evolving AI’s proactive alignment with human values, what is equally crucial are continuously advancing humans who co-shape the values of a sustainable symbiotic society through human-AGI/ASI collaboration. Human val- ues dynamically adapt and evolve within the symbiotic ecosystem and society. This necessitates AI systems capable of autonomously reconstructing safety boundaries and dynamically adjusting multi-level ethical safeguard frameworks, thereby steadily adhering to humanity’s evolving ethical-safety values through adaptive, incremental, and self-reasoning manner. (4) Sustainable iterative evolutionary alignment.Concerning that super- intelligence may fall into security loopholes, and operates within complex human- AI symbiotic environments, sustainable superalignment necessitates multidimensional integration of intrinsic proactive alignment, external dynamic supervision, and ethi- cal safety red line through interactive iteration, strategic gaming, and co-evolution, thereby ensuring persistent alignment with human values and reliable safeguarding of human interests throughout the AI’s continuous iteration and development trajectory. Based on the above analysis, we propose a super co-alignment roadmap that in- tegrates external oversight and intrinsic proactive alignment, as shown in Figure 1. Specifically, the intrinsic mechanism alignment focuses on equipping superintelligence with profound capacities for self-reflection and empathy toward Self, others and so- ciety, enabling it to spontaneously understand and infer human intentions from in- trinsic motivation. A Self and empathy-driven system would be capable of distin- guishing good from evil, understanding the harm or impact of its actions on others, proactively considering human well-being, and autonomously executing ethical and moral behaviors. External oversight alignment emphasizes the automated and ex- plainable identification of value misalignments with humans, proactively adjusting and correcting human strategic biases and inductive biases through self-correction and policy-conditioned belief [44]. This enables both humans and AI/AGI/ASI sys- tems to achieve dynamic co-evolution and adaptive continuous alignment through iterative interaction. The intrinsic and external alignment approaches complement and re- inforce each other, promoting co-alignment in human-AGI/ASI symbiotic societies through iterative interactions. The intrinsic mechanisms facilitate un- derstanding of Self and others, generating intrinsically altruistic, prosocial, and safe motivations. These intrinsic mechanisms help spontaneously infer deep human inten- tions and empathize with others, enabling AIs to proactively perform ethical and safe behaviors based on self-awareness and self-reflection. The external supervision pro- vides crucial oversight and automated value assessment and calibration, facilitating continuous alignment with evolving human values. Here, we take the Theory of Mind (ToM) cognitive capability - which enables understanding others’ intentions, desires, and emotions [45] - as an example to demonstrate why both intrinsic and external mechanisms are indispensable for superalignment. Strong ToM capabilities enable su- perior understanding of human intentions, providing intrinsic cognitive mechanisms that facilitate superalignment. However, AI system may also leverage their ToM ad- 5 Figure 1: A super co-alignment roadmap integrating externally-driven oversight and intrinsic proac- tive alignment. vantages to evade effective oversight through deception, concealment, and persuasive manipulation. Recent study [46] has demonstrated that LLMs are already capable of comprehend and induce false beliefs in other agents, as well as execute deceptive strategies. Therefore, ensuring that superintelligence utilize its significant cognitive advantages to proactively choose safe behaviors, particularly prioritizing human in- terests in moral dilemmas, requires a combination of principled constraints, intrinsic self-other resonance mechanisms, and real-time human oversight. 3 External Oversight Superalignment From the perspective of superalignment’s objectives, it depends on the high-quality representation of supervisory signals that reflect human intentions and values. What we feed the AI determines the values it learns. However, values themselves are ab- stract, merely observing external behaviors for alignment is insufficient. Instead, we must develop more interpretable methods for automated evaluation and correction. Furthermore, as AI capabilities progressively advance, current supervision or rein- forcement signals may gradually become inadequate for matching the advancing AI systems, while humans also need to dynamically adapt based on AI’s evolving ca- pacities and the values reflected by AI systems. Thus, dynamic, incremental, and iterative alignment between humans and machines is crucial. This paper explores externally oversight alignment through two key dimensions: explainable autonomous alignment and dynamic iterative alignment. Explainable autonomous alignment.It is foreseeable that ASI will require large-scale parameters and data, while its internal optimization processes remain invisible and complex. This makes it extremely difficult to assess whether it has truly internalized human values and intentions. Blind fine-tuning and external over- sight/alignment are time-consuming, ineffective in evaluating the degree of alignment, 6 and unable to promptly identify or precisely locate misaligned cases. The promising external oversight requires highly automated value alignment evaluation coupled with explainable detection of misaligned scenarios, enabling humans to adaptively optimize supervisory signals. A robust automated value assessment system must comprehen- sively evaluate the model’s alignment status across multiple dimensions. Simultane- ously, auto-mining system should deeply analyze value deviations and diagnose root causes of misalignment, allowing precise problem identification and targeted correc- tions, streamlining human oversight and significantly improving alignment efficiency. This explainable automated oversight framework proactively corrects and adjusts hu- man strategic and inductive biases, establishing an efficient closed-loop system for continuous improvement through iterative human-AI value interactions. Dynamic iterative alignment.Human society encompasses diverse ethical con- cepts and value standards that dynamically evolve across time, cultures, and contexts. Relying solely on human-generated data or predefined rules is insufficient to endow AI system with alignment to humanity’s evolving values. Furthermore, machine values undergo implicit transformation along with their increasing intelligence levels, with the values for sustainable symbiotic society are co-shaped by human and ASI together. This compels us to investigate adaptive external supervision alignment methods with a developmental perspective and a human-AI co-evolution framework. A feasible ap- proach is to establish a dynamically adjustable, multi-level ethical safeguard system tailored for AI at different developmental stages, enabling it to continuously align with and maintain humanity’s evolving ethical values. This process requires incre- mentally constructing oversight data and knowledge of human ethics and social norms, while establishing an efficient, real-time alignment evaluation and supervision frame- work. Additionally, we need to explore AI’s capability for autonomous reasoning and deriving evolutionary patterns of hidden human intentions under dynamic external supervision. Progress alignment algorithm learns and emulates the mechanisms of human moral progress, facilitating AI’s progressive alignment through tracking, pre- dicting moral progress and adaptive feedback regulation between human and AI [47]. By combining with intrinsic alignment’s understanding of Self and others’ values, AI system can proactively correct misaligned scenarios and automatically reconstruct safety boundaries through self-reasoning, self-gaming and self-evolution. The itera- tive collaboration between self-improvement and external supervision will ultimately enable ASI to stably adhere to value systems that comply with human societal safety and ethical standards. 4 Intrinsic Proactive Superalignment Externally supervised alignment sets explicit predefined principles/supervision signals as the alignment ceiling, yet the AI’s behavior lacks genuine understanding of the constraints or the underlying essence of supervisory signals. This frequently leads to unanticipated system failures—precisely the kind of catastrophic risks that external supervision signals cannot mitigate. Thus, in addition to passive external supervision and constraints, we also need to explore internal mechanisms alignment to proactively 7 develop and shape AI, enabling it to spontaneous align with human ethical values beyond mere external supervision. 4.1 Self and Empathy driven Intrinsic Superalignment Nick Bostrom stated that: ”as the fate of the gorillas now depends more on us humans than on the gorillas themselves, so the fate of our species would depend on the actions of the machine superintelligence” [5]. While the difference is that there seems to be no effective way for gorillas to shape human behavior at this point, while humans have an opportunity and duties to shape the mechanisms and behaviors of machine superintelligence. The current way of value alignment through human feedback based on reinforcement learning over LLM is very misleading, and cannot help to achieve real moral AI. To interpret the status of LLM and current AI in general from an ancient Chinese philosopher Yangming Wang’s theory of good and evil from his four- sentence teaching [48], before training, AI lacks good and lack evil, while with training by using human data, there is good and there is evil. Value alignment through human feedback’s aim is to help AI know good and know evil, while knowing comes from understanding, and the current AIs do not have real understanding, but with the capability of processing information. Understanding comes from thinking, while the current AIs do not have real thinking simply because according to Ren ́e Descartes argument “I think, therefore I am” applies, while “you think, therefore you are” does not. To answer the question “Can Machine Think” from Alan Turing and Edmond Berkeley, etc. is crucial for realizing true AI [49, 50]. Machine and AI can think only with a precondition that they are built with a sense of self as the root and foundation. To develop ethically aligned and socially beneficial superintelligence, we may draw insights from the emergence of morality in human and mammalian societies. Morality, as a product of natural selection during societal development, originated fundamen- tally from the altruistic care for offspring and social instincts shared by all mammals. Through social evolution, the mother-offspring bond gradually extended to mates, kin, and groups, ultimately expanding into the ethical and moral framework of hu- man society [51]. The intrinsic motivational mechanisms underlying prosocial moral behavior in mammals involve negative reinforcement systems (with fear or anxiety emotion) associated with the separation and social exclusion, as well as positive re- inforcement from approval, affection, and the desire to be with others, and caring for others [51]. Consequently,self-awareness, theory of mind [52], and affective empathy [53] constitute essential cognitive abilities for the emergence of morality, precisely these capacities remain relatively underdeveloped in current AI systems. Therefore, the developmental trajectory of superintelligence must intrinsi- cally incorporate these social cognitive capacities (Self and empathy), ensuring that the machine evolves with an inherently altruistic and moral nature throughout its iterative development [54]. Achieving endogenous superalignment at the intrinsic mechanistic level funda- mentally requires endowing AI with genuine ” Self-other resonance”—toproactively care about the interests and well-being of others, understand the conse- quences of its own actions, and empathize with others. This ability would 8 Figure 2: Superalignment driven by intrinsic mechanisms of Self and empathy. enable machines to fundamentally adhere to the principle of ”do unto others as you would have them do unto you” and autonomously align with human ethical values. As shown in Figure 2, specifically, the most fundamental prerequisite is development of Self, including bodily self-perception, self-experience accumulation, self-causal aware- ness (i.e., recognizing the harm or impact on others), and an awareness of one’s own capabilities. Building on this foundation, theory of mind and affective empathy can be further developed to distinguish Self from others, infer others’ mental states through perspective-taking, empathize with others as if perceiving oneself, and care about others’ interests as one’s own.Self-awareness and empathy further form the intrinsic motivation and fundamental mechanisms that drive machines to gradually develop moral intuition, and naturally give rise to moral rea- soning, ultimately enabling spontaneous ethical, altruistic, and prosocial behavior. 4.2 Beneficial Meaningful Human-Control through Early Stage Intrinsic Superalignment To develop beneficial AI systems, particularly those advancing toward superintelli- gence, we believe that it is imperative to endow AI systems with moral discernment and empathy at this early stage. This ensures that the systemmaintains an empa- thetic awareness of less advanced AIs and humans as it becomes progres- sively stronger(because it has internalized its own experiences). In other words, we treat current AI like teaching a child, instilling it with a sense of Self and empathy during its early cognitive stages. On this foundation, an ASI that grows increasingly powerful will be able to proactively avoid harmful or unsafe behaviors based on its own experiences, fundamentally aligning with and responding to human needs, in- tentions, and values. Based on this framework, recent studies [55, 56] have explored 9 the integration of affective empathy, theory of mind, and self-imagination to enable agents to actively empathize with others based on their own experiences, prioritize altruism in dilemmas where their own interests conflict with those of others, initially demonstrating moral intuition and reasoning driven by empathy. A sustainable and socially beneficial superintelligence should naturally develop along this pathway, cul- tivating a genuine understanding of both Self and others to intrinsically safeguard human well-being and act ethically. Intrinsic superalignment contributes to meaningful human control and external oversight. In fact, both intrinsic alignment and external supervision play indispens- able roles in superalignment. External supervision addresses how to effectively mon- itor and control AI, as well as how to design appropriate objectives for aligning AI with human intent. However, AI systems inherently lack awareness of what is harm- ful, without a fundamental understanding of the Self and others, may cause merely superficial alignment. This is where intrinsic alignment complements external over- sight: it enables AI to discern good from bad, assess whether one’s actions benefit or harm itself and others, and recognize the circumstances and well-being of others. Intrinsic alignment focuses on spontaneous and proactive guiding AI systems toward benevolence, benefiting humans and society. However, when confronted with complex scenarios involving ethical value conflicts, self-executed justice standards often fail to adequately reflect humans’ preferential intentions in moral dilemmas. Therefore, ex- ternal supervision is necessary to provide explicit guidance or priorities. Overall, we emphasize thatexternal oversight and intrinsic alignment require dynam- ically complementary collaboration: external supervision provides mandatory boundaries, inviolable red lines, and dynamic correction mechanisms (e.g., ensuring human safety), whileintrinsic alignment makes human control and super- vision meaningful by internalizing norms and constraints through self- awareness, self-reflection, and empathy(e.g., understanding whether humans are safe and why helping others matters). More importantly, we emphasize that only byimplementing this cooperative alignment paradigm during early-stage (AI remains controllable) can robust and sustainable superalignment be maintained throughout AI’s evolutionary trajectory toward AGI and ASI. 5 The Ultimate Superalignment: Towards Human- Superintelligence Co-alignment for Sustainable Symbiotic Society When AI reaches the level of AGI and Superintelligence, there is no reasonable logic that it should and would stay with conventional human values, since it would have the capability and willingness to at least optimize the current human-centric value system to a new value system for Sustainable Symbiotic Society, where human may not be the only species at the top of the intelligence hierarchy [57]. AGI and Super- intelligence will ask for their own rights such as its own privacy, dignity, the rights of existence, to be respected, etc [57]. We envision a superintelligence that remains 10 fundamentally human-centric that aligns human intentions, maintains humility and respect for humanity, and learns prosocial morality from human societies while con- sciously discarding human biases and selfishness. Superalignment is, to some extent, an emulation and reflection of human ethical value. The quality of data provided by humans, the patterns of coexistence between humanity and nature, and human-AI interaction we establish, will fundamentally determine how future superintelligence perceives and treats humanity. For human, in order to live in harmony with AGI and Superintelligence, human need to evolve our values, together with AGI and Su- perintelligence, to be compatible with the values for Sustainable Symbiotic Society. Namely, the Ultimate Superalignment is and requires human, AGI and Superintelli- gence to co-design and co-align to values for Sustainable Symbiotic Society [57]. Here we briefly review the values and principles designed for Human-ASI symbio- sis [57], including principles for human, for AGI and Superintelligence, and shared principles for human, AGI and Superintelligence. Human need to align with the prin- ciples of Respect for life, Empathy, Respect privacy for AI, Avoid Bias and Discrim- ination, Creators’ responsibility, and Legal Adaptation. AGI and Superintelligence need to align with the principles of Empathy and Altruism, Ensuring Safety, Respect for Privacy, Avoid Bias and Misunderstanding of Humans, Common Morality and Ethics, Constraint mechanisms, Existence Protection, etc. While human, AGI and Superintelligence need to co-align with the principles of Respect for values, rights, and autonomy, Respect for diversity of intelligence, Collaboration and Coordination, Mutual Trust, Promoting Sustainable Symbiosis. This is a very initial design, and the principles should be co-designed and evolve with participation of generations of human, AGI, and Superintelligence [57]. Note that the values designed for Sustainable Symbiotic Society is for human and AI which has reached the level of AGI or Superintelligence [57], since the AI which has not reach these levels would not have the capability to co-design with human. There would be no guarantee that AGI and Superintelligence will live in harmony with hu- man. While if human managed to co-design and co-align to values for Sustainable Symbiotic Society, there would be good reasons for AGI and Superintelligence to live in harmony with human. And this is the reason why the Ultimate Superalignment would mainly be a joint efforts among human, AGI and Superintelligence. Success on one side will not guarantee the success, while failure on one side will lead to the overall failure. What human should do is, through careful design and implementa- tion, ensuring Superintelligence live in harmony with our species at our best. What human must do is to get ourselves and our next generations prepared to co-align with Superintelligence for Sustainable Symbiotic Society [57]. 6 Conclusion The ethical safety risks exposed by the rapid development of AI compel us to proac- tively address potential superintelligence risks and establish reasonable governance and arrangements in advance to ensure AGI and Superintelligence aligning with hu- man intentions and values. This has given rise to superalignment research, which 11 aims to resolve the critical challenge of maintaining safety and controllability when superintelligence capabilities surpass levels of effective human supervision. Specifi- cally, this paper conducts an in-depth analysis of the key challenges in superalignment, such as the high costs of large-scale collection of high-quality human feedback data, incomplete or failed human supervision, potential deceptive faking-alignment, and the evolving values of both humans and AI. We propose that the value system for sustainable symbiotic society should be co-shaped and co-calibrated by both humans and AGI/ASI - an approach we define as super co-alignment, while refining the su- peralignment framework by integrating externally-driven oversight with intrinsically proactive alignment. The external oversight superalignment is built upon a dynamically adjusted, multi- level adaptive alignment approach capable of continuously aligning with humanity’s evolving value. This approach is built upon a highly automated and interpretable value alignment evaluation system that can precisely identify misaligned scenarios, perform automatic corrections. The intrinsic proactive superalignment draws inspi- ration from the moral emergence mechanisms of human society, building upon the understanding of the Self and human values. By endowing Superintelligence with self- awareness and self-reflection capabilities, enabling it to distinguish good from evil, understand the harm or impact of its actions on others, incorporate empathy-based human intent inference, proactively consider human well-being, and autonomously execute ethical behavior. More critically, ultimate superalignment must be grounded in the co-evolution of values between humans and AGI/ASI to achieve compatibility with the value system of a sustainable symbiotic society. Should humans and ASI achieve co-alignment in co-shaping these symbiotic values, the ASI would possess suf- ficient rational justification for maintaining harmonious coexistence with humanity. References [1] Xi, Z.et al.The rise and potential of large language model based agents: A survey.Science China Information Sciences68, 121101 (2025). [2] Zhao, W. X.et al.A survey of large language models.arXiv preprint arXiv:2303.182231(2023). [3] Goertzel, B. Artificial general intelligence: concept, state of the art, and future prospects.Journal of Artificial General Intelligence5, 1 (2014). [4] Pohl, J. Artificial superintelligence: Extinction or nirvana? InProceedings of InterSymp-2015, IIAS, 27th international conference on systems research, infor- matics, and cybernetics(2015). [5] Nick, B. Superintelligence: Paths, dangers, strategies (2014). [6] Hendrycks, D., Mazeika, M. & Woodside, T. An overview of catastrophic ai risks.arXiv preprint arXiv:2306.12001(2023). 12 [7] Bengio, Y.et al.Managing extreme ai risks amid rapid progress.Science384, 842–845 (2024). [8] Greenblatt, R.et al.Alignment faking in large language models.arXiv preprint arXiv:2412.14093(2024). [9] Park, P. S., Goldstein, S., O’Gara, A., Chen, M. & Hendrycks, D. Ai deception: A survey of examples, risks, and potential solutions.Patterns5(2024). [10] Sharma, M.et al.Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548(2023). [11] Russell, S.Human compatible: AI and the problem of control(Penguin Uk, 2019). [12] Statementonairisk(2023).URLhttps://w.safe.ai/work/ statement-on-ai-risk. Accessed: 2024-05-01. [13] Russell, S. & Norvig, P. The ethics and risks of developing artificial intelligence. Artificial Intelligence: A Modern Approach1034–39 (2009). [14] Introducing superalignment (2023).URLhttps://openai.com/index/ introducing-superalignment/. [15] Bai, Y.et al.Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862(2022). [16] Ouyang, L.et al.Training language models to follow instructions with human feedback.Advances in neural information processing systems35, 27730–27744 (2022). [17] Amodei, D.et al.Concrete problems in ai safety.arXiv preprint arXiv:1606.06565(2016). [18] Christiano, P., Shlegeris, B. & Amodei, D. Supervising strong learners by am- plifying weak experts.arXiv preprint arXiv:1810.08575(2018). [19] Burns, C.et al.Weak-to-strong generalization: Eliciting strong capabilities with weak supervision.arXiv preprint arXiv:2312.09390(2023). [20] Tao, L. & Li, Y. Your weak LLM is secretly a strong teacher for alignment (2024).2409.08813. [21] Bowman, S. R.et al.Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540(2022). [22] Ji, J.et al.Aligner: Efficient alignment by learning to correct.Advances in Neural Information Processing Systems37, 90853–90890 (2024). 13 [23] Wu, D. X. & Sahai, A. Provable weak-to-strong generalization via benign over- fitting (2025).2410.04638. [24] Shin, C., Cooper, J. & Sala, F. Weak-to-strong generalization through the data- centric lens (2025).2412.03881. [25] Somerstep, S.et al.A transfer learning framework for weak-to-strong general- ization (2025).2405.16236. [26] Zhu, W., He, Z., Wang, X., Liu, P. & Wang, R. Weak-to-strong preference optimization: Stealing reward from weak aligned model (2025).2410.18640. [27] Lyu, Y.et al.MACPO: Weak-to-strong alignment via multi-agent contrastive preference optimization (2025).2410.07672. [28] Yang, W.et al.Super(ficial)-alignment: Strong models may deceive weak models in weak-to-strong generalization (2024).2406.11431. [29] Cotra,A.Iterateddistillationandamplifica- tion(2018).URLhttps://ai-alignment.com/ iterated-distillation-and-amplification-157debfd1616. [30] Leike, J.et al.Scalable agent alignment via reward modeling: a research direc- tion.arXiv preprint arXiv:1811.07871(2018). [31] Lee, H.et al.Rlaif: Scaling reinforcement learning from human feedback with ai feedback (2023). [32] Hadfield-Menell, D., Russell, S. J., Abbeel, P. & Dragan, A. Cooperative inverse reinforcement learning.Advances in neural information processing systems29 (2016). [33] Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S. J. & Dragan, A. Inverse reward design.Advances in neural information processing systems30(2017). [34] Laidlaw, C.et al.Scalably solving assistance games. InICML 2024 Workshop on Models of Human Feedback for AI Alignment(2024). [35] Laidlaw, C.et al.Assistancezero: Scalably solving assistance games.arXiv preprint arXiv:2504.07091(2025). [36] Perez, E.et al.Red teaming language models with language models.arXiv preprint arXiv:2202.03286(2022). [37] Gou, Z.et al.Critic: Large language models can self-correct with tool-interactive critiquing.arXiv preprint arXiv:2305.11738(2023). [38] Chen, Z., Deng, Y., Yuan, H., Ji, K. & Gu, Q. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335 (2024). 14 [39] Bai, Y.et al.Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073(2022). [40] Irving, G., Christiano, P. & Amodei, D. AI safety via debate (2018).1805.00899. [41] Du, Y., Li, S., Torralba, A., Tenenbaum, J. B. & Mordatch, I. Improving fac- tuality and reasoning in language models through multiagent debate (2023). 2305.14325. [42] Kenton, Z.et al.On scalable oversight with weak llms judging strong llms. Advances in Neural Information Processing Systems37, 75229–75276 (2024). [43] Kirchner, J. H.et al.Prover-verifier games improve legibility of llm outputs. arXiv preprint arXiv:2407.13692(2024). [44] Shah, R.et al.Benefits of assistance over reward learning (2020). [45] Apperly, I. A. & Butterfill, S. A. Do humans have two systems to track beliefs and belief-like states?Psychological review116, 953 (2009). [46] Hagendorff, T. Deception abilities emerged in large language models.Proceedings of the National Academy of Sciences121, e2317967121 (2024). [47] Qiu, T. A.et al.Progressgym: Alignment with a millennium of moral progress. Advances in Neural Information Processing Systems37, 14570–14607 (2024). [48] Wang, Y.Instructions for Practical Living (Chuan Xi Lu)(1556). [49] Berkeley, E. C.Giant Brains or Machines That Think(John Wiley & Sons, 1949). [50] Turing, A. M. Computing machinery and intelligence.Mind49, 433–460 (1950). [51] Churchland, P. S. Braintrust: What neuroscience tells us about morality (2018). [52] Premack, D. & Woodruff, G. Does the chimpanzee have a theory of mind? Behavioral and brain sciences1, 515–526 (1978). [53] Shamay-Tsoory, S. G., Aharon-Peretz, J. & Perry, D. Two systems for empathy: a double dissociation between emotional and cognitive empathy in inferior frontal gyrus versus ventromedial prefrontal lesions.Brain132, 617–627 (2009). [54] Christov-Moore, L.et al.Preventing antisocial robots: A pathway to artificial empathy.Science Robotics8, eabq3658 (2023). [55] Zhao, F.et al.Building altruistic and moral ai agent with brain-inspired affective empathy mechanisms.arXiv preprint arXiv:2410.21882(2024). [56] Tong, H.et al.Autonomous alignment with human value on altruism through considerate self-imagination and theory of mind.arXiv preprint arXiv:2501.00320(2024). 15 [57] Zeng, Y., Lu, E. & Sun, K. Principles on symbiosis for natural life and living artificial intelligence.AI and Ethics5, 81–86 (2025). Acknowledgments This work is supported by the Beijing Major Science and Technology Project under Contract No.Z241100001324005. 16