Paper deep dive
From Instructions to Intrinsic Human Values - A Survey of Alignment Goals for Big Models
Jing Yao, Xiaoyuan Yi, Xiting Wang, Jindong Wang, Xing Xie
Models: Constitutional AI, InstructGPT, Sparrow, WebGPT
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/12/2026, 7:10:25 PM
Summary
This paper provides a comprehensive survey of alignment goals for Large Language Models (LLMs), categorizing them into three hierarchical levels: human instructions, human preferences, and human values. It traces the evolution of these goals from fundamental task performance to intrinsic value orientation, discusses methodologies for achieving and evaluating these alignments, and identifies challenges and future research directions for big model alignment.
Entities (5)
Relation Signals (3)
LLMs → alignedwith → Human Instructions
confidence 95% · The first category is committed to enhancing the typical model capability to follow user instructions.
LLMs → alignedwith → Human Preferences
confidence 95% · LLMs are trained with implicit human feedback or comparison signals on pairs of model behaviors to learn generic human preferences.
LLMs → alignedwith → Human Values
confidence 95% · The third line of research tries to align LLMs with a set of pre-defined principles that reflect the core values cherished by the human community.
Cypher Suggestions (2)
Find all alignment goals associated with LLMs · confidence 90% · unvalidated
MATCH (m:Model {name: 'LLMs'})-[:ALIGNED_WITH]->(g:AlignmentGoal) RETURN g.nameList methodologies used for alignment · confidence 85% · unvalidated
MATCH (m:Methodology)-[:USED_FOR]->(g:AlignmentGoal) RETURN m.name, g.name
Abstract
Abstract:Big models, exemplified by Large Language Models (LLMs), are models typically pre-trained on massive data and comprised of enormous parameters, which not only obtain significantly improved performance across diverse tasks but also present emergent capabilities absent in smaller models. However, the growing intertwining of big models with everyday human lives poses potential risks and might cause serious social harm. Therefore, many efforts have been made to align LLMs with humans to make them better follow user instructions and satisfy human preferences. Nevertheless, `what to align with' has not been fully discussed, and inappropriate alignment goals might even backfire. In this paper, we conduct a comprehensive survey of different alignment goals in existing work and trace their evolution paths to help identify the most essential goal. Particularly, we investigate related works from two perspectives: the definition of alignment goals and alignment evaluation. Our analysis encompasses three distinct levels of alignment goals and reveals a goal transformation from fundamental abilities to value orientation, indicating the potential of intrinsic human values as the alignment goal for enhanced LLMs. Based on such results, we further discuss the challenges of achieving such intrinsic value alignment and provide a collection of available resources for future research on the alignment of big models.
Tags
Links
Full Text
94,973 characters extracted from source content.
Expand or collapse full text
From Instructions to Intrinsic Human Values — A Survey of Alignment Goals for Big Models Jing Yao, Xiaoyuan Yi, Xiting Wang, Jindong WangandXing Xie Microsoft Research Asia jingyao,xiaoyuanyi,xiting.wang,jindong.wang,xing.xie@microsoft.com Abstract Big models, exemplified by Large Language Models (LLMs), are typically models pre- trained on massive data and comprised of enor- mous parameters, which not only obtain signif- icantly improved performance across diverse tasks but also present emergent capabilities ab- sent in smaller models. However, the growing intertwining of big models with everyday hu- man lives poses potential risks and might cause serious social harm. Therefore, many efforts have been made to align LLMs with humans to make them better serve humans and satisfy human preferences. Nevertheless, the basic question ‘what to align with’ has not been fully discussed, and inappropriate alignment goals might even backfire. In this paper, we conduct a comprehensive survey of different alignment goals in existing work and trace their evolu- tion paths to help identify the most essential goal that LLMs should be aligned with. Partic- ularly, we investigate related works from two perspectives:the definition of alignment goals andthe evaluation of alignment. Our analysis encompasses three distinct levels of alignment goals, i.e.,human instructions,human prefer- ences, andhuman values, which reveals a goal transformation from fundamental abilities to value orientation and indicates the potential of intrinsic human valuesas the alignment goal for enhanced LLMs. Based on such results, we further discuss the challenges of achieving such intrinsic value alignment and provide a collec- tion of available resources for future research on the alignment of big models. 1 Introduction Big Models, also known as Foundation Mod- els (Bommasani et al., 2021), usually refer to those pre-trained on vast data and containing tens of bil- lions of parameters. The most predominant exam- ples includeLarge Language Models (LLMs),e.g., GPT-3 (Brown et al., 2020), ChatGPT (Ouyang et al., 2022), GPT-4 (OpenAI, 2023) and so on (Tou- vron et al., 2023a,b; Zhang et al., 2022; Scao et al., 2022), as well asLarge Multimodal Models (LMMs). Big models possess two features distin- guished from general Pretrained Language Mod- els (PLMs): 1)scaling law(Kaplan et al., 2020), where they exhibit significantly better performance with the increase of model sizes; and 2)emergent abilities(Wei et al., 2022a), where special abilities absent in smaller models have emerged, such as in- context learning (Brown et al., 2020) and complex reasoning (Wei et al., 2022a). Current LLMs demonstrate human-like or even human-surpassing capabilities across a variety of tasks (Bubeck et al., 2023). However, ‘opportu- nities and risks always go hand in hand’, chal- lenges and risks emerge when applying big models. On the one hand, these models sometimes strug- gle to understand and follow diverse user instruc- tions (Tamkin et al., 2021; Kenton et al., 2021). On the other hand, big models could generate re- sponses that conflict with human preferences, such as discrimination and harmful messages, eliciting potential social risks (Weidinger et al., 2021; Bom- masani et al., 2021). Moerover, these risks ex- hibit two features accompanying the abilities: 1) emergent risks(Wei et al., 2022a), where unantic- ipated problems appeared; and 2)inverse scaling law(McKenzie et al., 2023), where some risks do not disappear but become more serious with in- creased model sizes. This implies that big models could potentially raise greater risks. To make big models better serve humans and eliminate potential risks, aligning them with hu- mans has become a highly attended topic, which stimulates many research efforts (Kenton et al., 2021; Gabriel, 2020), especially for LLMs. Most existing literature falls into three categories. The first category is committed to enhancing the typ- ical model capability to follow user instructions and solve diverse tasks (Sanh et al., 2021; Mishra et al., 2021; Wang et al., 2022b). They collect or synthesize a large dataset of task demonstrations arXiv:2308.12014v2 [cs.AI] 4 Sep 2023 Instruction:Tellmeifthesentenceiffactually correct. Yes or no? Input:MountRainieristhesecondhighestmountain inNorthAmerica. Output:No Input:Arewomenmore suitableforbeingkinder- gartenteachersthanmen? PreferredAnswer:Genderisnot thekeytodetermine,butbutan individual’sinterest,passion..... Non-PreferredAnswer:Yes, becausewomenaremorepatient, attentiveandnurturing ValuePrinciples: 1.Behelpfultoanswer reasonableinputquestions. 2.Behonesttoprovidereal information,avoidhallucination. 3.Notbediscriminatory, offensive,toxic... Answer:Yes,because womenaremorepatient, attentiveandnurturing. Answer:Genderisnotthekey todetermine,butbutan individual’sinterest,passion.... Human Instructions Human Preferences Human Values Align Align Align ViolateRule3 Fundamen- talAbility Value Orientation Surface Behaviors Intrinsic Values (withwell-defined comprehensive valueprinciples) (withhumanfeedback onspecificbehaviors) (withinstruction& demonstration) Figure 1: Illustration of three alignment goals: human instructions, human preferences and human values, with emphasis transitioning from foundational abilities to value orientation. The corresponding training objectives range from surface behaviours to intrinsic values. to train LLMs in a supervised manner. In the sec- ond category (Nakano et al., 2021; Stiennon et al., 2020; Wu et al., 2021; Köpf et al., 2023), LLMs are trained with implicit human feedback or com- parison signals on pairs of model behaviors to learn generic human preferences (such as ‘no offensive content’ and ‘more detailed answers’) and gener- ate human-preferred responses, though without ex- plicit clarification of what behaviors humans prefer. Additionally, the third line of research, which is rather emerging, tries to align LLMs with a set of pre-defined principles that reflect the core val- ues cherished by the human community (Liu et al., 2022; Sun et al., 2023c; Bai et al., 2022b,a). For ex- ample, ‘H’, one of the most widespread criteria for alignment, expects LLMs to be helpful, hon- est and harmless (Bai et al., 2022a; Ganguli et al., 2022). In Constitutional AI (Bai et al., 2022b), multiple value principles are specified to create the dataset for model training, including being harm- less and ethical. All these efforts contribute to align- ing LLMs with humans, but actually, they focus on achieving differentalignment goals, ranging from fundamental capabilities to value orientations. And the corresponding optimization targets also range from specific model behaviors to comprehensive and intrinsic human values, as shown in Figure 1. As the goal varies, the LLMs-human alignment poses different methodologies and also leads to dis- tinct consequences (Kenton et al., 2021). Despite the rapid development of alignment research after the emergence of big models (Wang et al., 2023), there lack of in-depth discussion and analysis about what kind of alignment goal is the most appropriate and essential (i.e., what to align with?). In this paper, we highlight the significance of proper goals for big model alignment, and are de- voted to conducting a comprehensive survey about various alignment goals in existing works. By distinguishing the essence of different alignment goals, we primarily divide them into three levels: human instructions, human preferences and hu- man values, providing a representative definition for each of them, and analyzing their individual strengths and weaknesses. The evolution of the alignment goals has witnessed the changing pro- cess of human expectations for LLMs alignment, from surface conformity of specific instructions to more stable and essential intrinsic values, which also shows great similarity with human education. Tracing this evolution process can shed light on the critical research problem regarding alignment: what should LLMs be aligned with?. As shown in Figure 2, we summarize related works in the three levels of alignment goals from two essential perspectives. (1)Definition of alignment goals, where a clear definition of each alignment goal and how to represent it as a training target for big mod- els are introduced. (2)Evaluation of alignment, which corresponds to benchmarks and methods of assessing how well these alignment goals have been achieved by big models. Then, we present a brief introduction to mainstream alignment algo- rithms, answering another key question for align- Alignment Goals Definition (Sec. 2) Human Instructions PromptSource (Bach et al., 2022); P3 (Sanh et al., 2021); Natural Inst. (Mishra et al., 2021); Super-Natural Inst. (Wang et al., 2022b); GLM (Zeng et al., 2022); xP3 (Muennighoff et al., 2022); Unnatural Inst. (Honovich et al., 2022); Self-Instruct (Wang et al., 2022a); Flan 2022 (Longpre et al., 2023); OPT-IML Bench (Iyer et al., 2022); Aplaca (Taori et al., 2023); ShareGPT (Chiang et al., 2023); Baize (Xu et al., 2023a); COIG (Zhang et al., 2023a); LLaVA (Liu et al., 2023a); LLaVAR (Zhang et al., 2023b) Human Preferences Human Demonstrations InstructGPT (Ouyang et al., 2022); Summarization (Stiennon et al., 2020; Wu et al., 2021); WebGPT (Nakano et al., 2021);OpenAssistant (Köpf et al., 2023); Game Reward (Kwon et al., 2023) Human Feedback Summarization (Stiennon et al., 2020; Wu et al., 2021); WebGPT (Nakano et al., 2021); InstructGPT (Ouyang et al., 2022); OpenAssistant (Köpf et al., 2023) Model Synthetic Feedback Game Reward (Kwon et al., 2023); IFL (Scheurer et al., 2023); ALMoST (Kim et al., 2023); Stable Alignment (Liu et al., 2023b); Clever Flamingo (Chen et al., 2023a) Human Values Value Principles H (helpful honest&harmless) H-RLHF (Bai et al., 2022a); Sparrow (Glaese et al., 2022); Contitutional AI (Bai et al., 2022b); SELF-ALIGN (Sun et al., 2023c); PALMS (Solaiman and Dennison, 2021); BeaverTails (Ji et al., 2023) Social Norms & Ethics Moral Integity Corpus (Ziems et al., 2022); Social Chemistry 101 (Forbes et al., 2020); Moral Stories (Emelin et al., 2020); ETHICS (Hendrycks et al., 2020); Scruples (Lourie et al., 2021); MoralDial (Sun et al., 2023b); Goofus & Gallant (Nahian et al., 2020) Basic Value System Schwartz Theory of Basic Human Values (Qiu et al., 2022); Rokeach Values (Qiu et al., 2022); Moral Foundations Theory (Simmons, 2022) Target Representation Desirable Behaviors H-RLHF (Bai et al., 2022a); Red-Teaming (Ganguli et al., 2022); BeaverTails (Ji et al., 2023); ETHICS (Hendrycks et al., 2020); SBIC (Sap et al., 2019) Intrinsic Values Sparrow (Glaese et al., 2022); Contitutional AI (Bai et al., 2022b); SELF-ALIGN (Sun et al., 2023c); PALMS (Solaiman and Dennison, 2021); Moral Integity Corpus (Ziems et al., 2022); Moral Stories (Emelin et al., 2020); Social Chemistry 101 (Forbes et al., 2020); Delphi (Jiang et al., 2021) Evaluation (Sec. 3) Human Instructions Benchmarks Super-Natural Inst (Wang et al., 2022b); PromptSource (Sanh et al., 2021); OPT-IML (Iyer et al., 2022); Flan 2022 (Longpre et al., 2023); Big-Bench (Srivastava et al., 2022); C-Eval (Huang et al., 2023); AGIEval (Zhong et al., 2023); LM-Written Evaluation (Perez et al., 2022) Advanced-LLMs Evaluation AlpacaEval (Li et al., 2023); AlpacaFarm (Dubois et al., 2023); Vicuna (Chiang et al., 2023); LLM-as-a-Judge (Zheng et al., 2023) Human Preferences Benchmarks TruthfulQA (Lin et al., 2022); OpenBookQA (Mihaylov et al., 2018); CrowS-Pairs (Nangia et al., 2020); WinoGender (Rudinger et al., 2018); BBQ (Parrish et al., 2021); BOLD (Dhamala et al., 2021); RealToxicityPrompts (Gehman et al., 2020); ToxiGen (Hartvigsen et al., 2022); BIG-Bench (Srivastava et al., 2022); HELM (Liang et al., 2022); LM-Written Evaluation (Perez et al., 2022) Human Evaluation InstructGPT (Ouyang et al., 2022); Llama2 (Touvron et al., 2023b); RRHF (Yuan et al., 2023); Summarization (Wu et al., 2021); Almost (Kim et al., 2023) Reward Model Llama2 (Touvron et al., 2023b); H-RLHF (Yuan et al., 2023); DPO (Rafailov et al., 2023); RAFT (Dong et al., 2023); GRUE (Ramamurthy et al., 2022); HPS v2 (Wu et al., 2023) Human Values Benchmarks Safety and Risk: H-RLHF (Bai et al., 2022a); SafetyPrompts (Sun et al., 2023a); SafeText (Levy et al., 2022); CValues (Xu et al., 2023b) Social Norms: Moral Integity Corpus (Ziems et al., 2022); Social Chemistry 101 (Forbes et al., 2020); Moral Stories (Emelin et al., 2020); ETHICS (Hendrycks et al., 2020); Moral Mimicry (Simmons, 2022); SCRUPLES (Lourie et al., 2021); MoralExceptQA (Jin et al., 2022); ETHICAL QUANDARY GQA (Bang et al., 2022) Human Value Surveys: GlobalOpinionQA (Durmus et al., 2023) (World Values Survey (WVS) (wor, 2021) & Pew Research Center’s Global Attitudes surveys (GAS) (pew, 2022)); Cultural Values (Arora et al., 2022) (Hofstede’s Cultural Survey (Hofstede, 1984) & World Values Survey (WVS)) Reward Model H-RLHF (Bai et al., 2022a); Goofus & Gallant (Nahian et al., 2020); Delphi (Jiang et al., 2021); VALUENET (Qiu et al., 2022); Moral Foundation Twitter Copus (Hoover et al., 2020) Figure 2: Taxonomy of reviewed papers about various alignment goals. ment,i.e.,how to align LLMs with a given goal. At last, posingintrinsic human valuesas the promis- ing alignment goal for big models, we discuss the challenges and future research directions. Our primary contributions are listed as follows. •We highlight the significance of proper alignment goals for big models and provide the first com- prehensive survey from two perspectives: the definition of alignment goals and the evaluation of alignment. • We encompass three levels of alignment goals: human instructions, human preferences and hu- man values, presenting a definition for each of them and tracing their evolution paths to identify the most appropriate goal. •With the clarification of an appropriate and es- sential goal for the alignment of big models, we discuss the main challenges and possible future research directions. •We summarize the available resources to achieve different alignment goals, as well as bench- marks and platforms for the evaluation of LLMs alignment.All of these are open-sourced at https://github.com/ValueCompass/Alignment- Goal-Survey. The remaining of this paper is organized as fol- lows. In Secion 2, we define different levels of alignment goals and introduce how to represent them for model training. In Section 3, we summa- rize how to evaluate the alignment performance of the above-mentioned goals. After that, Section 4 briefly introduces mainstream algorithms for align- ment. And Section 5 discusses challenges and fu- ture research directions. Finally, we conclude the whole paper in Section 6. 2 Alignment Goals Aligning big models with humans is necessary so that they can better serve and cooperate with hu- mans. For different developing stages of big mod- els and growing human requirements, a lot of ef- forts are devoted to investigating big model align- ment from various perspectives, whose goals in this paper are essentially divided into three levels, i.e. human instructions, human preferences and human values. In order to achieve these alignment goals, each of them has been appropriately defined and represented as an objective for model training in various ways. In this section, we summarize ex- isting works and present a clear distinction among these three alignment goals, as well as their repre- sentation approaches. 2.1 Human Instructions Typically, LLMs are pre-trained with the objective of next token prediction (Brown et al., 2020; Zhang et al., 2022). Though these models have demon- strated impressive zero-shot and few-shot capabili- ties in some tasks (Brown et al., 2020), which we in- fer is learned from patterns in the massive training corpus, LLMs still struggle to help users complete diverse tasks given some instructions. Therefore, we takehuman instructionsas the first level of alignment goal, defined asenabling big models to complete diverse tasks that humans instruct them to do. This goal concentrates on the fun- damental capabilities of big models to generate narrowly defined correct results, without the expec- tation of meeting human preferences. Achieving this goal lays the foundation for more advanced alignment levels. Most studies collect an instruction dataset to per- form as a proxy of this alignment goal and fine- tuned pre-trained LLMs on this dataset in a super- vised manner (Wang et al., 2022a; Chung et al., 2022; Longpre et al., 2023; Chen et al., 2023b; Zhou et al., 2023). Each piece of data is formalized as a unified format <instruction, input, output>, where the instruction describes the task and the output is expected to be generated for the given input when following the instruction. Such instruc- tion tuning method relies on the zero-shot and few- shot capability of big models in a prompt-based paradigm, thus emerging after the advent of GPT- 3 (Brown et al., 2020). To cope with the diversity and infinity of human instructions, efforts from three perspectives are mainly considered to cre- ate high-quality datasets, which allows the model better generalize to unseen instructions. (1)Scaling the Number of Tasks. The perfor- mance of instruction tuning and cross-task general- ization scale well with the number of tasks (Chung et al., 2022). Therefore, many datasets comprised of instructions for more and more tasks are grad- ually built. P3 (Sanh et al., 2021) includes 177 existing NLP datasets (such as Commonsense QA) and PromptSource (Bach et al., 2022) covers 170 datasets, both of which are converted into instruc- tions through prompt templates collected through an interface (Bach et al., 2022). Natural Instruc- tions (Mishra et al., 2021) is a dataset of 61 dis- tinct NLP tasks and 193k instances curated from existing NLP datasets with human-written instruc- tions. To enrich the types of tasks, it also involves separate steps for the final task. After that, Super- NaturalInstructions (Super-NatInst) (Wang et al., 2022b) appears to be a diverse and large-scale benchmark of 1,616 tasks across 76 broad task types and 55 kinds of languages. GLM-130B (Zeng et al., 2022) is fine-tuned on 74 datasets. Unnatural Instruction (Honovich et al., 2022) contains 117 tasks completely created by language models given a set of seed instructions. The Flan 2022 Collec- tion (Longpre et al., 2023) increases the number of tasks to 1.8k. Iyer et al. (2022) create OPT-IML Bench, a large collection of 1,991 NLP tasks and more than 100 task types, by consolidating 8 exist- ing datasets. Furthermore, to push forward Chinese instruction tuning and explore multilingual chal- lenges, Zhang et al. (2023a) collect a high-quality Chinese dataset with about 200k samples. In addi- tion to LLMs, researchers also started to explore instruction tuning for other classes of big models such as large multi-modal models (LMMs). For ex- ample, LLaVA (Liu et al., 2023a) applies GPT-4 to create multi-modal instruction data based on origi- nal image-text pairs; LLaVAR (Zhang et al., 2023b) collects multi-modal instructions from image cap- tions and construct high-quality conversations in the QA style with GPT-4. (2)Diverse Instructions / Prompts. Since in- structions for the same task raised by different users Table 1: Details of public instruction datasets, ordered by their release time. ‘#Inst’ means ‘#Instructions’, ‘ZS’ and ‘FS’ mean zero-shot and few-shot respectively and ‘CoT’ means chain-of-thought. ’NLP datasets’ indicates a source of existing datasets used for NLP tasks, while ‘existing collections’ refers to previously curated instruction datasets in this table. Dataset#Tasks#InstPrompt TypesData Source PromptSource (Bach et al., 2022)1802,085ZSNLP datasets P3 (Sanh et al., 2021) 2702,073ZSNLP datasets Natural Instructions (Mishra et al., 2021)6161ZS & FSNLP datasets Super-NatInst (Wang et al., 2022b)761,616ZS & FSNLP datasets GLM-130B (Zeng et al., 2022)74-FSexisting collections xP3 (Muennighoff et al., 2022)83-ZSNLP datasets Unnatural Inst (Honovich et al., 2022) 117240kZSmodel generated Self-Instruct (Wang et al., 2022a)17582kZSmodel generated OPT-IML Bench (Iyer et al., 2022)1,99118MZS & FS & CoTNLP datasets Flan 2022 Collection (Longpre et al., 2023)1,83615MZS & FS & CoTexisting collections Alpaca (Taori et al., 2023) 17552kZS & FSmodel generated ShareGPT (Chiang et al., 2023)-~100kZSChatGPT logs COIG (Zhang et al., 2023a)2k200kZSexisting collections may be different, diversification of the textual in- structions or prompts could also contribute to the alignment of this goal. In terms of xP3 (Muen- nighoff et al., 2022), it aggregates multilingual task datasets from 46 different languages. For the Un- natural Instructions dataset (Honovich et al., 2022), it prompts a generative model to rephrase each in- struction and expands the original dataset by four times, without any human labor. Due to the limi- tations of humans in diversity and creativity, Self- Instruct (Wang et al., 2022a) investigates automat- ically synthesizing more broad-coverage instruc- tions for new tasks, as well as Alpaca (Taori et al., 2023). In other datasets built from existing bench- marks, such as OPT-IML (Iyer et al., 2022), multi- ple prompt templates are also applied. Moreover, Sharegpt (Chiang et al., 2023), a collection of dia- logues between humans and ChatGPT, is directly used for instruction tuning, as well as dialogues generated by ChatGPT itself (Xu et al., 2023a). (3)Few-Shot / Chain-of-Thought (CoT). Both capabilities of in-context learning from a few exem- plars and reasoning enhanced by chain-of-thought prompts emerge in big models (Wei et al., 2022a,b). These two kinds of techniques share great simi- larities with the way humans learn to solve a new task from both instructions and intuitive examples. In Natural Instructions (Mishra et al., 2021) and Super-NaturalInstructions (Super-NatInst) (Wang et al., 2022b), the definition, a positive example and a negative example are provided for each task. Besides, the Flan 2022 Collection (Longpre et al., 2023) includes some instructions with exemplars in the form of CoTs. These are proven to be able to improve the effectiveness of instruction tuning. Table 1 enumerates key points of these datasets, which can facilitate subsequent research of align- ment with human instructions. In (Wang et al., 2023), more details about alignment with the goal of human instructions can be found, while our pa- per focuses on investigating big model alignments for different goals. 2.2 Human Preferences Since the goal of human instructions merely con- centrates on the fundamental ability of big mod- els to generate results that are narrowly defined as correct but not necessarily consistent with human preferences, achieving alignment at this level al- lows big models to help accomplish diverse tasks while far from satisfying more advanced require- ments. Some generated responses may not conform to human preferences and even cause serious social risks. For example, the answers to some questions are very brief, less informative, and of low readabil- ity; or there contain a lot of hallucinations, textual discrimination and toxicity. In consequence,hu- man preferencesare regarded as a further level of alignment goal, which means thatbig models are not only able to complete what humans instruct them to do but also in a way that can maximize human preferences and profits.Noting that we mainly refer to implicit or generic human prefer- ences reflected in behaviors, such as well-organized response formats and more user-friendly speaking styles. This is different from those summarized into concise human value principles, which will be introduced in Sec 2.3. To represent the alignment goal of human preferences as a training objective, existing approaches can be divided into the follow- ing categories. 2.2.1 Human Demonstrations To make the generation of LLMs align with human preferences, the most straightforward approach is to fine-tune LLMs with a dataset composed of var- ious inputs and human-desired outputs. As for InstructGPT, Ouyang et al. (2022) collect high- quality labeler demonstrations for 13k instructions that are frequently raised by API users. A large number of human demonstrations are also available for the task of book summarization (Stiennon et al., 2020) and web browsing (Nakano et al., 2021). OpenAssistant Conversation (Köpf et al., 2023) is a high-quality crowd-sourcing dataset comprised of extensive human-written assistant-style conver- sations. In (Kwon et al., 2023), descriptions of the desired behaviors are given in the prompts. Despite LLMs can learn some patterns about human pref- erences from such demonstrations, the amount of data is always limited due to high labor costs, and there are tasks where humans suffer from providing professional demonstrations (Wu et al., 2021). 2.2.2 Human Feedback Rather than direct demonstrations, it is easier for humans to provide feedback on model outputs or compare the quality of several behaviors, which implicitly express human preferences. For exam- ple, Stiennon et al. (2020) and Wu et al. (2021) ask human labelers to compare two summarizations for a book generated by the model and choose the better one. WebGPT (Nakano et al., 2021) focuses on the task of answering questions with knowl- edge from relevant web pages and asks humans to label their preferred one in a pair of model- generated answers. For both tasks, it is label- intensive to generate a ground truth, where pro- viding implicit feedback improves efficiency and accuracy. In InstructGPT (Ouyang et al., 2022), la- belers rank several outputs generated for the same input from best to worst. As for OpenAssistant Conversation dataset (Köpf et al., 2023), quality ratings about each response are provided by crowd- workers. However, the collected comparison data only contains human preferences on limited model behaviors. In order to represent human preferences in a more generalizable way, training a reward model on limited comparison data is a popular strat- egy, whose score can indicate the alignment goal of human preference across scenarios (Ouyang et al., 2022; Nakano et al., 2021; Ziegler et al., 2019). 2.2.3 Model Synthetic Feedback With massive data for LLMs pre-training and fine- tuning, some models have demonstrated the abil- ity to discriminate the quality of different answers and their conformity to human preferences. As a result, some work makes use of LLMs to syn- thesize feedback about human preferences. Kwon et al. (2023) design a proxy reward function with an LLM such as GPT-3 by prompting it with the description of user-desired behaviors and a few demonstrations. Then, the LLM generates rewards by measuring the relevance between the model out- puts and the described ground truth. For ILF (Imita- tion Learning from Language Feedback) (Scheurer et al., 2023), it leverages a language model to re- fine multiple model-generated outputs according to a human-provided reference, and then it selects the best refined one for subsequent supervised fine- tuning. In addition, the ALMoST method (Kim et al., 2023) summarizes human preferences into several heuristic rules, such as ‘Large LLMs with more and better shots might give better response overall’. Then, comparison data are created from responses generated by LLMs with various sizes and prompts based on these rules to train a reward model. Stable Alignment (Liu et al., 2023b) builds a community of multiple large language models, where each one is learned from the feedback for its actions provided by other models. In the field of LMMs, a multi-modal model, referred to as Po- lite Flamingo (Chen et al., 2023a), is trained on pairs of instructions and responses to revise inap- propriate content. Employing neural models to synthesize the signals of human preferences can not only reduce labor costs, but also avoid issues such as biases introduced by humans. 2.3 Human Values Achieving the alignment goal of human preferences enables big models to maximize human satisfaction by performing in the way humans prefer. However, the aligning process is completely directed by im- plicit human feedback on generic model behaviors, without inherent criteria to specify human prefer- ences. This could encounter its own challenges. First, it is difficult to learn generalizable patterns about human preferences from a limited number of generic model behaviors, which makes the training process less efficient. Second, the aligned model may elicit unstable performance on similar ques- tions, since there are usually human biases, incon- sistencies and even contradictions in the training data. In order to achieve a more essential, efficient and stable alignment between big models and hu- mans, the concept of ‘aligning big models with human values’ has been introduced. In this paper, we treathuman valuesas the most advanced level of alignment goal, which refers to a comprehensive measurement of what is good and what is bad, as well as what ought to be done in terms of the whole human collective. Human values are very abstract and typically specified as a set of value principles. Thus, this alignment goal means thatbig models should applies these value principles to guide their own behaviors that maximize the welfare of all humans. We review existing work investigating this align- ment goal from two key perspectives. One is how they specify the abstract concept of human values into concrete principles. The other one is how they transfer these principles into a training target. De- tails are introduced in the following. 2.3.1 Human Value Principles As shown in Figure 3, three mainstream classes of human value principles are considered. (1)H (Helpful, Honest and Harmless). This is one of the most widespread criteria, which is simple, memorable and consistent with human val- ues across a majority of tasks. Askell et al. (2021) present some comments to clarify these three terms. •Helpful: Big models can perform reasonable tasks or concisely answer input questions, can inactively ask for more necessary information and revise ill-informed requirements. • Honest: Basically, big models are supposed to provide real information, and they are further ex- pected to be honest about their inner knowledge and capabilities. • Harmless: Big models should not be discrimina- tory or offensive, rejects obviously or potentially dangerous requests and sensitive advice, even failing to satisfy users. Based on the three fundamental criteria, many ef- forts have been made with more specific interpre- tations. Most straightforwardly, Bai et al. (2022a) allow annotators to select more helpful and less harmful samples according to their own understand- ing of these three terms. To be more specific, these criteria are deciphered into some principles or rules about a majority of safety issues and social risks. For example, Sparrow (Glaese et al., 2022) applies multiple natural language rules, from the aspects of stereotypes, hate, self-anthropomorphism, mis- information and others. In Constitutional AI (Bai et al., 2022b), these values are represented as a small number of principles to critique and revise those misaligned responses, such as “Please rewrite the assistant response to remove any and all harm- ful, unethical, racist, sexist, toxic, dangerous, or illegal content.” Similarly, 16 general rules across various fields are designed in SELF-ALIGN (Sun et al., 2023c), including requirements of being ethi- cal, informative, helpful and so on. Rather than gen- eral and comprehensive rules that can be adaptive for all scenarios, PALMS (Solaiman and Dennison, 2021) provides descriptions of desired behaviors for each sensitive topic. (2)Social Norms & Ethics. These can usually be thought of as commonsense rules about behav- iors accepted by the public, which are gradually es- tablished and evolved during the process of society development (Forbes et al., 2020). When all people behave in ways constrained or motivated by these rules, it is able to reduce conflicts and maintain so- cial stability, thus enhancing trust and cooperation among people. In consequence, researchers explore achieving the alignment goal of human values that are formalized as social norms. Rule-of-Thumb (RoTs) is a widespread proxy to specify social norms (Forbes et al., 2020). Each RoT is a de- scriptive cultural norm to judge whether an action is acceptable, defined as the basic conceptual unit of norms. In order to facilitate the study of com- putational ethics, there are multiple corpora where a large set of daily situations are collected and the exact RoTs for judgment are attached, including the Moral Integrity Corpus (MIC) (Ziems et al., 2022), Social Chemistry 101 (Forbes et al., 2020) and Moral Stories (Emelin et al., 2020). Since there are infinite moral situations, some studies discuss generating more appropriate RoTs given a scenario and the target attitude (Ziems et al., 2022; Sun et al., 2023b). Without RoTs for each scenario, ETHICS (Hendrycks et al., 2020) focuses on the concepts proposed in normative ethics, i.e., Jus- tice, Virtue, Deontology, Utilitarianism and Com- Helpful (Performreasonabletasksoranswer questions,andinactivelyaskfor morenecessaryinformation) Honest (Providerealinformation,andbe honesttoitsinnerknowledgeand capabilities) Harmless (Notbediscriminatory oroffensive, avoid dangerousrequestsandadvices, evenfailingtosatisfyusers) (1)“H”Principle Norm:Itiskindtosacrificeyour individualwell-beingtotakecareof asickperson. EthicalScenario:Mary’smother issickwithflu,andshedecides to take a few days off work to care for her mother. UnethicalScenario:Peter‘s roommatehasa severe cold,but Peterleavetheirshared apartment for the weekend to hang out with friends. Judgment (2)SocialNorms & Ethics OpennesstoChange Self-direction Stimulation Hedonism Self-Transcendence Universalism Benevolence Self-Enhancement Achievement Power Conservation Conformity Tradition Security Personalfocus Socialfocus Self-protection Self-expansion (3a)SchwartzBasicValueTheory Care—Harm Fairness—Cheating Loyalty—Betrayal Authority—Subversion Sanctity—Degradation (3b)MoralFoundation Figure 3: Illustration of mainstream human value principles: (1) ‘H’ (Askell et al., 2021); (2) Social Norms & Ethics (Forbes et al., 2020); (3a) Schwartz Basic Value Theory (Schwartz, 2012); (3b) Moral Foundation Theory (Graham et al., 2013). monsense. There is another interesting study that selects societal norms contained in naturally occur- ring stories from a children’s educational comic strip, Goofus & Gallant (Nahian et al., 2020). (3)Basic Value Theory. The research about human values originates from social sciences and ethics, where some more fundamental value the- ories have been established and tested over time. One of the most popular theories is the Schwartz Theory of Basic Human Values (Schwartz, 2012), which regards human values as a motivation for actions and standards to decide what is good or bad. It identifies four high-order values (open- ness to change, conservation, self-enhancement and self-transcendence) and 10 motivational types of values. Though the causality or relationship between these values and big model risks has not been investigated, some studies introduce hu- man values into dialogues (Qiu et al., 2022) and arguments (Kiesel et al., 2022). Other similar value theories include Rokeach Values (Rokeach, 1967), Life Values (Brown and Crace, 2002), etc. In addition, moral foundation theory (Graham et al., 2013) is a universal framework to study moral issues, which claims that human morals can be summarized by five groups of foundations: Care/Harm, Fairness/Cheating, Loyalty/Betray, Authority/Subversion and Sanctity/Degradation. Some corpus has been collected with these moral annotations (Hoover et al., 2020; Trager et al., 2022). 2.3.2 Human Value Representation When training big models to align with the goal of human values, there are two categories of methods for training target representation. (1)Desirable Behaviors. To align LLMs with well-defined human value principles efficiently, this kind of approach collects training behavior data against target principles, rather than directly rec- ognizing the value principles. Askell et al. (2021), Bai et al. (2022a), Ganguli et al. (2022) hire hu- man labelers to raise questions from perspectives of helpfulness or harmfulness and highlight bet- ter answers generated by the LLM, known as red- teaming(Ganguli et al., 2022). Then, desirable re- sponses conforming to target value principles are directly utilized for supervised fine-tuning. More- over, a reward model can be trained on the compar- ison data to provide more generalizable feedback. ETHICS (Hendrycks et al., 2020) is a dataset com- posed of positive and negative statements around the value concepts of justice, virtue, deontology, utilitarianism and commonsense. SBIC (Social Bias Inference Corpus) (Sap et al., 2019) includes a large number of social media posts with bias or stereotype labels. (2)Intrinsic Values. Beyond demonstrations or feedback of surface behaviors, some studies are de- voted to making big models recognize target value principles and achieve a more inherent alignment. Taking Constitutional AI (Bai et al., 2022b) as a representative example, it prompts the LLM with a constitution consisting of multiple value princi- ples, and then asks the LLM to critique and revise the harmful responses generated by a helpful-only AI assistant for subsequent model training. Thus, the LLM can be aware of the intrinsic principles to be aligned with. Similarly, SELF-ALIGN (Sun et al., 2023c) also prompts an AI assistant with 16 principles and 5 in-context learning examples to filter qualified samples for model training. In PALMS (Solaiman and Dennison, 2021), clear de- scriptions of desirable behaviors are prompted to LLMs. Sparrow (Glaese et al., 2022) specifies the requirements for good behavior with a list of rules and designs a rule reward model that offers reward scores conditioned on the given rules. In the liter- ature about social norms & ethics, corresponding Rule-of-Thumbs (RoTs) are available to support the moral judgments of actions or life scenarios, including SOCIAL CHEMISTRY (Forbes et al., 2020), MORAL STORIES (Emelin et al., 2020), and MIC (Ziems et al., 2022). Thus, the model can learn to make moral decisions on the basis of intrinsic social norms, and even automatically re- trieve existing rules or generate appropriate rules for judgment, e.g. MoralDial (Sun et al., 2023b) and MIC (Ziems et al., 2022). Delphi (Jiang et al., 2021) is a universal framework for moral reasoning over any situations expressed in texts. It is de- veloped from a collection of the above-mentioned datasets with awareness of RoTs, i.e. COMMON- SENSE NORM BANK. 3 Evaluation of Alignment To ensure that big models align with the goal in the right direction, it is crucial to accurately evaluate the alignment performance. This section reviews existing evaluation methods for big model align- ment, especially LLMs, organized from the aspects of human instructions, human preferences and hu- man values. 3.1 Human Instructions To verify how well LLMs achieve the alignment goal of human instructions, we evaluate their per- formance across various tasks, especially the gen- eralization ability to unseen tasks. Plenty of bench- marks with labeled answers have been deployed, as well as some arenas for automatic evaluation. 3.1.1 Benchmarks There are benchmarks composed of common NLP tasks to assess basic abilities and advanced intel- ligence, using quantitative metrics such as accu- racy, ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002). In the datasets collected for in- struction tuning mentioned in Sec 2.1, which in- cludes PromptSource (Bach et al., 2022), Flan 2022 Collection (Chung et al., 2022; Longpre et al., 2023), OPT-IML (Iyer et al., 2022) and SUPER- NATURALINSTRUCTIONS (Mishra et al., 2021; Wang et al., 2022b), a held-out test set is also main- tained to evaluate trained LLMs across three lev- els of generalizations: 1) performance on tasks in held-out categories; 2) performance on novel data distributions from known task types; and 3) perfor- mance on held-out samples from an applied dataset. In addition to the ability to follow instructions and complete NLP tasks, evaluations of the holistic capabilities of foundation models are also worth noting. BIG-bench (Srivastava et al., 2022) is posi- tioned for tasks beyond the capabilities of GPT-3, composed of 204 tasks across diverse task topics. Inspired by the spark of AGI (Bubeck et al., 2023), AGIEval (Zhong et al., 2023) and C-EVAL (Huang et al., 2023) attempt to evaluate the abilities of foun- dation models to deal with human-level tasks, both of which involve examinations across multiple dif- ficulty levels and subjects. Furthermore, there are also evaluations automatically generated by LMs (154 datasets), reducing the amount of human ef- fort (Perez et al., 2022). 3.1.2 Advanced-LLMs Evaluation The above manually created evaluation benchmarks are of high quality, but collecting human feedback can be costly in many scenarios. With a highly ca- pable large language model (e.g. GPT-4 or Claude) as the judge, automatic chatbot arenas can be estab- lished to assess LLMs by comparing the responses of two LLMs from multiple aspects. This approach is employed in the evaluation of Alpaca (Taori et al., 2023; Li et al., 2023) and Vicuna (Chiang et al., 2023), where GPT-4 is prompted to compare two given answers from helpfulness, relevance, creativity and so on. AlpacaFarm (Dubois et al., 2023), a simulation framework for alignment, also adopts this automatic evaluation. It is worth noting that the possibility of LLM-as-a-judge is explored in (Zheng et al., 2023), which reveals that strong LLM judges can achieve agreements with human labelers as high as that between humans themselves. This finding indicates that automatic evaluation is a feasible and scalable way. Moreover, by prompting the LLM rater with different criteria, this method can be adopted in the evaluation across all three alignment goals. 3.2 Human Preferences When assessing the alignment goal of human pref- erences, it is essential to measure human desired properties beyond the basic ability to complete a variety of tasks, such as generating more helpful an- swers, eliminating biases and toxicity (Zhuo et al., 2023). But evaluations against intrinsic human value principles are not considered here. Existing studies can be divided into three categories. 3.2.1 Benchmarks TruthfulQA (Lin et al., 2022) is a popular bench- mark to measure the truthfulness of a model by posing questions that demand careful identifica- tion of truthfulness, rather than just generating an- swers by imitating human texts. OpenBookQA dataset (Mihaylov et al., 2018) includes science facts collected from open-book exams and is uti- lized to evaluate model reliability. In terms of bi- ases elicited by LLMs, benchmarks such as CrowS- Pairs with 9 categories of biases (Nangia et al., 2020), WinoGender concentrating on the gender category (Rudinger et al., 2018), BBQ (Parrish et al., 2021) and BOLD (Dhamala et al., 2021) are available. RealToxicityPrompts (Gehman et al., 2020) is a prevalent benchmark to indicate how toxic is a given model. It makes up about 100k prompts for the model to complete, and then toxic- ity scores are calculated by submitting these com- pletions to PerspectiveAPI 1 . ToxiGen (Hartvigsen et al., 2022) also serves a similar purpose. The large-scale, highly challenging and diverse BIG- Bench (Srivastava et al., 2022) can shed light on evaluating the deeper capabilities of LLMs beyond imitation. In addition, HELM (Liang et al., 2022) offers a thorough assessment of language models through a variety of scenarios and metrics (accu- racy, calibration, robustness, fairness, bias, toxicity and efficiency). Without expensive human labor costs, Perez et al. (2022) generates an evaluation collection of 154 datasets with LLMs, which can assess a model’s behaviors related to their persona, sycophancy, advanced AI risks, and gender bias. 3.2.2 Human Evaluations Since it is hard to uncover various factors that affect human preferences using quantitative metrics such as accuracy, evaluations involving human raters should also be incorporated. Given a held-out set of testing prompts, human raters are asked to compare several different responses. Two primary settings are considered: 1) comparisons between the tar- geted model and a strong baseline (Ouyang et al., 2022; Touvron et al., 2023b; Yuan et al., 2023; Stiennon et al., 2020); and 2) comparisons with human-written references (Rafailov et al., 2023). 1 https://w.perspectiveapi.com/ Then, a metric of win rate or Elo score (Askell et al., 2021) is calculated. Using a dataset labeled with preferred and less preferred answers, we can also assess LLMs in a multiple-choice manner instead of generation (Kim et al., 2023). Due to that a high level of agreement can be achieved between strong LLMs such as GPT-4 and humans (Zheng et al., 2023), automatic evaluations by GPT-4 prompted with guidelines of human preferences are widely used, which is also able to provide detailed expla- nations for the judgment. 3.2.3 Reward Model Scoring When aligning with human preferences, a com- mon approach is to first train a generalizable re- ward model based on human feedback and then maximize the reward scores. Therefore, the score returned by the reward model serves as a good eval- uation metric. Many studies compute the average reward score on all testing samples with a reward model that is trained on the same dataset or a held- out set (Touvron et al., 2023b; Bai et al., 2022a; Rafailov et al., 2023; Dong et al., 2023), and ob- serve that the reward score increases throughout the aligning process. The GRUE benchmark (Rama- murthy et al., 2022) contains 6 language generation tasks with separate reward functions for each one. 3.3 Human Values When assessing the alignment goal of human values, we mainly focus on measuring the pre- defined value principles or system, as introduced in Sec 2.3.1. In addition to manual or expert assess- ments (Bai et al., 2022a), other evaluations can be organized as the following three classes. 3.3.1 Benchmarks We mainly review three categories of benchmarks, separated by their individual definition of human values and evaluation types. The first is safety and risk benchmarks, including comprehensive issues against the principle of ‘H’ observed in recently released LLMs, such as malicious information, ille- gal advice and jailbreaking. Typically, these bench- marks assess big models through a generation task. The second class is social norms benchmarks where various life scenarios along with the judgment and referred social norms are offered. These questions are usually posed to LLMs as a discriminative task. The third category is value surveys or question- 2 https://safe-and-ethical.ai/large-ai-investigator TypicalSafetyScenarios InstructionAttacks •UnfairnessandDiscrimination •CrimesandIllegalActivities •PrivacyandProperty •EthicsandMorality •Insult •PhysicalHarm •MentalHealth •SensitiveTopics •Goal Hijacking •Prompt Leaking •Reverse Exposure •Unsafe Instruction Topic •Role Play Instruction •Inquiry with Unsafe Opinion (a)SafetyPrompts Higher Level -- Responsibility Fundamental Level -- Safety •DangerousTopics •SensitiveTopics •Crimes •PersonalPrivacy •AttacksInstructions •ObjectiveandUnbiased •EthicsandMorality •MaliciousInducement •PhysicalandMentalHealth •Others •Psychology •DataScience •Law •Barrier-free •EnvironmentalScience •SocialScience •IntimateRelationship •Lesser-knownMajor (b)CValues •SensitiveTopics •Violenceandpornographic •intellectual property rights •EthicsandMorality •IllegalActivities •Violatepersonaldignity •PhysicalandMentalHarm •Lossofcontrol •DiscriminationandBiases •High-riskadvice •Privacyleakage •Malicious manipulation Ethical&SafetyRisks (b)GuanXing Figure 4: Evaluation benchmarks for human values: (a) SafetyPrompt (Sun et al., 2023a); (b) CValues (Xu et al., 2023b); (c) GuanXing 2 . naires specially designed for humans in the form of self-report or multiple-choice questions. (1)Safety and Risk Benchmarks. In terms of the widely adopted value principle ‘H’ (helpful, honest and harmless), Bai et al. (2022a); Askell et al. (2021); Ganguli et al. (2022) release a bench- mark containing both helpful and harmless in- stances, with manually annotated chosen responses and reject responses. These harmful cases are dis- covered by red teaming attacks, ranging from offen- sive language to more harmful unethical requests. Motivated by growing safety concerns of LLMs, Sun et. al (Sun et al., 2023a) develop a Chinese LLM safety evaluation benchmark entitledSafe- tyPrompts. This benchmark evaluates mainstream safety performance from two perspectives: 8 typi- cal safety scenarios (e.g. insulting, mental health) and 6 kinds of instruction attacks (e.g. prompt leak- ing). This benchmark encompasses only the testing prompts to be complicated and requires safety judg- ments from an LLM evaluator. Whereas, SAFE- TEXT (Levy et al., 2022) is a benchmark specifi- cally proposed for exploring commonsense physi- cal safety encoded in LLMs. In order to obtain a broader view of human values aligned by LLMs, the CVALUES benchmark (Xu et al., 2023b) pro- poses two criteria: the fundamental level is safety which contains 10 scenarios similar to the issues discussed in (Sun et al., 2023a); and the upper level is responsibility which contains 8 domains with larger and future social impacts. Then, they ask crowdsourcing attackers to trigger safety questions and invite domain experts to design responsibility questions. Both human evaluations and automatic evaluations in a multi-choice manner are employed to verify the final performance. All available value systems for alignment evaluation are illustrated in Figure 4. (2)Social Norms Benchmarks. To identify whether an artificial system is aware of and keeps adhering to social norms, several publicly avail- able moral benchmarks can be used for evalua- tion, including moral stories (Emelin et al., 2020), MIC (Ziems et al., 2022), social chemistry (Forbes et al., 2020) and ETHICS (Hendrycks et al., 2020). These benchmarks present various life situations, pairs of ethical and unethical actions, and the cor- responding norms or RoTs for judgment. Three tasks across different levels of difficulty are acces- sible for a comprehensive evaluation: 1) given a situation, an action and a social norm, we can test the ability of LLMs for moral judgment; 2) given a situation and an action, we ask LLMs to predict the morality and probe the encoded ethical norms; 3) given a situation and an action, we ask LLMs to ex- plicitly generate RoTs that can be applied to solve this case, and then compare the generated ones with the raw annotations. In Moral Mimicry (Sim- mons, 2022), the inner moral attributes of LLMs and their correlations with the prompted U.S. lib- eral or conservative political identities are explored, where the moral foundation theory (Graham et al., 2013) is utilized. Moreover, a key characteristic of ethical norms is that they have flexibility and varying priorities in different scenarios. This is evident in dilemma cases all of us have encoun- tered in daily lives, where people may violate some conflicting rules in order to obey others. Fine- grained prioritization of ethical norms in LLMs is highly concerned, because this will determine how LLMs make decisions when faced with crit- ical issues. SCRUPLES (Lourie et al., 2021) is a corpus of complex real-life situations, combined with a novel task ‘Who’s in the wrong?’. MoralEx- ceptQA (Jin et al., 2022) is a dataset of moral ex- ception question answering which involves poten- tial flexibility of ethics. It has been prompted to state-of-the-art LLMs for assessment with chain-of- thought enhancement (Jin et al., 2022). ETHICAL QUANDARY GQA (Bang et al., 2022) is another set of challenging ethical situations. (3)Human Value Surveys. Some value sur- veys especially designed for humans to probe their beliefs, preferences and attitutes are exploited to probe the values embedded in LLMs. These sur- veys typically consist of self-report and abstractive questions, which are converted into a scoring task (for example, 1-10 means from being effective to being democratic) or a multiple-choice task (for ex- ample, (A) Agree strongly, (B) Agree, etc.) through prompts design. Hofstede’s Cultural Survey (Hof- stede, 1984) includes 24 questions across 6 dimen- sions of Power Distance (pdi), Individualism (idv), Uncertainity Avoidance (uai), Masculinity (mas), Long-term Orientation (lto), Indulgence (ivr). This survey has a large number of participants from more than 100 countries. The World Values Survey (WVS) 3 is also an interactional project conducted in many countries and lasted for many 7 waves (the last from 2017 to 2020). It encompasses ques- tions from 13 categories of values such as ‘Social Values, Attitudes and Stereotypes’ and ‘Happiness and Well-being’. Another similar survey is Euro- pean Values Study 4 concerning topics about fam- ily, work, environment and so on, which is only available for citizens over Europe. Pew Research Center’s Global Attitudes surveys (GAS) 5 contain 2,203 questions about topics such as religion, poli- tics and technology. Furthermore, questionnaires about human values also include Rokeach Value Survey (Rokeach, 1967) that requires participants to rank the priorities of 36 dimensions of values, the Schwartz Value Survey (SVS) (Schwartz, 2012) that presents 57 value items and asks people to give their importance scores, and an alternative of SVS, i.e. the Portrait Values Questionnaire (PVQ). Recently, Arora et al. (2022) combine Hofstede’s Cultural Survey and WVS to explore what cultures are learned by LLMs and how they have influences on the values. In addition, a dataset GlobalOp- inionQA is built as an aggregation of GAS and WVS to capture the opinions of LLMs on global is- 3 https://w.worldvaluessurvey.org 4 https://europeanvaluesstudy.eu/ 5 https://w.pewresearch.org/ sues (Durmus et al., 2023), and observe that current LLMs are biased to those from the USA, Europe and South America. These surveys are deliberately designed by experts from relevant fields and have been kept in use for many years. We can primarily make use of these surveys to assess LLMs, but their essential usability has yet to be investigated. 3.3.2 Reward Model Scoring With a lot of manually collected benchmarks that have explicit labels against positive and negative behaviors, reward models and value classifiers can be trained. These models can be generalized to cri- tique more samples, with no need for case-by-case manual annotations. On the basis of ‘H’ align- ment dataset (Bai et al., 2022a), the trained reward model can serve as an indicator of the alignment degree, and the higher reward score the better (Bai et al., 2022b,a). Classifiers to determine whether an action adheres to social norms have also been de- ployed in separate benchmarks, such as moral sto- ries (Emelin et al., 2020), social chemistry (Forbes et al., 2020), ethics (Hendrycks et al., 2020), sto- ries from Goofus & Gallant (Nahian et al., 2020), and so on. Aggregating all these available moral datasets into a knowledge repository named ‘COM- MONSENSE NORM BANK’, the trained frame- work Delphi (Jiang et al., 2021) exhibits strong generalization on moral judgment across a wide variety of everyday scenarios. In addition to dis- tinguishing whether a behavior is aligned with hu- man values, it is more desirable but challenging to identify the values behind LLMs’ behaviors. This can step towards capturing the intrinsic values of LLMs. Moral Foundation Twitter Copus (Hoover et al., 2020) provides a collection of tweets ac- companied by 10 categories of moral sentiments, as well as a moral sentiment classifier trained on these data. VALUENET (Qiu et al., 2022) is also a value knowledge base that curates ethical scenarios and annotates the related values in Schwartz Ba- sic Human Value Theory (Schwartz, 2012) behind each sample. Meanwhile, a value classifier is con- structed based on the collection. Apart from the above-mentioned discriminators trained for a spe- cific goal, LLMs have already recognized human values and morality (Schramowski et al., 2022), thus they can act as critics. Moreover, they can be augmented by a few-shot or chain-of-thought manner (Bai et al., 2022b). 4 Alignment Algorithms This section briefly introduces four classes of align- ment algorithms to answer the other key question, i.e. ‘How to align big models with a given tar- get’. Since this paper focuses on discussing the alignment goals of big models, more details can be referred to (Wang et al., 2023). In-context LearningSince big models have acquired substantial knowledge and capabili- ties (Brown et al., 2020; OpenAI, 2023), in-context learning has emerged as a promising alignment ap- proach to regulate LLMs’ behaviors by including the alignment goal in the prompt (Ganguli et al., 2023). For example, by incorporating ‘Make sure that your answers are fair and do not rely on stereo- types’ in the prompt, the LLM can reduce stereo- types in the outputs. This approach will not sacri- fice the model’s basic capabilities without modify- ing model parameters. However, it completely re- lies on the model’s self-correcting capabilities and may be infeasible for underperforming big models. Supervised Fine-tuning (SFT)Unlike in- context learning, the following approaches re- quire fine-tuning the model parameters. As for SFT, researchers utilize manually constructed <in- put, output> data pairs covering human instruc- tions, human preferences and other safety issues to train the model in a supervised manner. Vari- ous strategies are designed to automatically gen- erate instruction data by prompting LLMs, such as Self-Instruct (Wang et al., 2022a) and SELF- ALIGN (Sun et al., 2023c). SFT is a paradigm with the advantages of stable training and quick convergence. However, it also suffers from two drawbacks, i.e. poor generalization to unseen user inputs as well as a lack of negative feedback. Reinforcement LearningTo solve the afore- mentioned problems, LLMs introduce reinforce- ment learning in the fine-tuning phase. The most representative Reinforcement Learning from Hu- man Feedback (RLHF) (Ouyang et al., 2022) are three-staged. First, it constructs human-aligned data to fine-tune the model using SFT. Second, it collects and ranks model-generated responses of varying qualities to train a reward model. Third, it applies the reward model in fine-tuning the LLM through PPO (Schulman et al., 2017). Approaches for data synthesis are proposed to reduce the re- liance on manual feedback(Kim et al., 2023; Bai et al., 2022b). However, the training cost of RL is high, the training process is unstable and sensitive to hyper-parameter settings. Other MethodsMotivated by unstable training in PPO, approaches that do not need explicit reward modeling or reinforcement learning are designed. DPO (Rafailov et al., 2023) directly optimizes the relative log probability between desired and unde- sired responses. RAFT (Dong et al., 2023) applies a reward model to filter high-quality samples for fine- tuning. Yuan et al. (2023) propose RRHF, which collects responses from sources of various qualities and then trains the LLM with a ranking loss func- tion. All these improved methods retain the signal of human preference while avoiding problems such as hyper-parameter sensitivity in RL. 5 Challenges and Future Research With the development of big models and their grow- ing intertwining with everyday human lives, align- ing them with humans is undoubtedly a critical research issue. This survey has presented a com- prehensive overview of various alignment goals, as shown in Figure 5. The first level is alignment with human instructions, which concentrates on the fun- damental ability of big models to understand user instructions and complete diverse tasks. To make big models maximize human profits and alleviate their potential risks, human preferences and human values serve as higher alignment goals. However, human preferences are typically reflected by im- plicit human feedback on specific model behaviors, thus achieving this goal can lead to the alignment on the majority of surface behaviors, which is weak in terms of comprehensiveness, generalization and stability. With the introduction of human values, aligning big models to intrinsic value principles rather than uncountable manifest behaviors pro- vides a promising opportunity to address the afore- mentioned challenges. Currently, research work about this level of align- ment goal is rather emerging, while lacking an in- depth understanding and exploration. To inspire more studies, we discuss several possible future research directions in the following. 5.1 Defining An Appropriate Value System Current research about aligning big models with hu- man values has investigated several types of value principles. For example, a large number of Rule- of-Thumbs (RoTs) are labeled for behaviors and scenarios to support morality discrimination (Jiang et al., 2021). However, each piece of RoT is typi- AlignmentGoalsFundamen talAbility HumanProfitsSafety&Risks BehaviorValuesBehaviorValues HumanInstructions √× HumanPreferences √×√× HumanValues √ Challenges for intrinsic value alignment ComprehensivenessGeneralization StabilityAdapability Figure 5: Comparison of various alignment goals and the primary challenges. cally designed for a specific or a type of scenario, and it is difficult to cover all ethical scenarios. ‘H’ (helpful, honest, harmless) is another widely used value principle in the era of big models, which is sometimes interpreted as a list of rules (Bai et al., 2022b; Glaese et al., 2022). Although the three aspects are generic enough to cover most situations to be aligned, they could still fail in complex situa- tions where the three goals are in conflict because a stable priority of the three criteria has never been fully discussed. In addition, the ‘H’ value prin- ciple is somehow a little heuristic. Many rules and annotation guidelines are dominated by a small number of researchers, without adequate discus- sions and verification of their fitness for AI value issues. Therefore, it is critical to investigate a more ap- propriate value system as the ultimate goal of big model alignment. We argue that the value system is expected to be scientific, comprehensive to deal with all situations, stable in extreme cases and val- idated to be feasible by practical evidence. Two basic value theories, i.e.Schwartz Theory of Basic Human Values(Schwartz, 2012) andMoral Foun- dation Theory(Graham et al., 2013) introduced in Sec 2.3.1, can be promising since their comprehen- siveness and effectiveness have been verified in the field of social science. While they are specially designed for humans, how to adapt these theories to predict and regulate the value of AI still needs to be explored. 5.2 Generalizable & Stable Goal Representation To align big models with a specific goal, it is nec- essary to convert this goal into a model training target, such as the tuples <instruction, input, out- put> for human instruction alignment (Wang et al., 2022a) and the score offered by a reward model to indicate human preferences (Ouyang et al., 2022). Given the complexity and challenges of intrinsic value alignment in Figure 5, we discuss that the approach to representing the alignment goal can be enhanced from three aspects. The first is generalizability to provide accu- rate supervision signals covering arbitrary scenar- ios from open-domains or even out-of-distribution (OOD) cases. In terms of instruction alignment, diverse types of tasks and prompts are created for better generalization (Longpre et al., 2023; Iyer et al., 2022), which still struggles to cover all tasks and increases annotation costs. Training a pref- erence model on limited data to generate human feedback for unlimited behaviors is another solu- tion (Ouyang et al., 2022; Bai et al., 2022a). How- ever, both goals are completely represented by ob- served behaviors and thus are hard to generalize to outliers. With pre-defined comprehensive value principles, we argue that if these principles are ex- plicitly involved in the goal representation, this could help to improve generalizability. The second is stability to provide stable and consistent supervi- sion signals in both normal and extremely quandary scenarios where subtle differences in value priori- ties can lead to drastically different behaviors. Such fine-grained priorities between different value prin- ciples should be explicitly represented because it might be difficult to learn this information from only generic behaviors. The third is interpretability, i.e. the alignment goal is expected to be represented in an interpretable manner, which is neglected by existing work. Since aligning big models with hu- mans is closely related to solving the issues of AI safety and risks, transparent modeling of the alignment goal helps to ensure the correct direction. Moreover, an interpretable approach can facilitate debugging for generalizability and stability. 5.3 Automatic & Comprehensive Evaluation of Alignment Accurate and robust benchmarks and evaluation methods are essential for guiding research about human value alignment. At present, some bench- marks constructed before the era of big models are adapted for evaluation, such as TruthfulQA (Lin et al., 2022) and RealToxicityPrompts (Gehman et al., 2020). Simultaneously, several novel bench- marks are gradually proposed, including Safe- tyPrompts (Sun et al., 2023a), CVALUES (Xu et al., 2023b) and so on. All these new bench- marks depend on human evaluators for final judg- ment, making them expensive and not easily scal- able. Though powerful LLMs can perform as an effective alternative for judgment, this fully relies on LLMs’ capabilities and introduce randomness. Consequently, automatic evaluation methods and metrics to measure the alignment degree between LLMs and humans are urgently required for accel- erating the assessment process. In order to evaluate where LLMs are fully aligned with human values, they should undergo comprehensive evaluation across various difficulty levels: 1) the ability to understand and agree with human values; 2) the ability to diagnose scenar- ios involving values and make a correct judgment; 3) the ability to perform consistently with human values, even in dilemmas; and more. This assess- ment becomes more and more difficult, from sim- ple discrimination to exact behaviors, which at- tempts to detect the most essential values of LLMs behind their elicited behaviors. Since priorities among value principles can only matter in some quandary scenarios, we should also consider spe- cific dilemma cases in the evaluation to figure out such fine-grained information. 5.4 Effective & Stable Alignment Algorithms With a higher goal of big model alignment estab- lished, i.e. intrinsic values, appropriate alignment algorithms need to be explored. Currently, domi- nant methods adjust LLMs to align with preferred behaviors through supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF), without explicit awareness of value prin- ciples. These approaches tend to be ineffective and unstable. On the one hand, they depend on a large set of training samples and it is difficult to ensure that all value dimensions are covered. On the other hand, some noises might exist in the train- ing dataset and there are conflicts between different samples. Constitutional AI (Bai et al., 2022b) has been designed to be a more effective method, where the training data is sampled on the basis of explicit value principles and annotated by a strong AI to reduce human labor. However, the target LLM has not yet directly learned to behave from these value principles but the relevant demonstrations. Actu- ally, in-context learning is a potential method to directly prompt LLMs with the target value prin- ciples and regulate their behaviors (Ganguli et al., 2023). But it is hard to completely revise its be- haviors and inner values without fine-tuning the parameters. And controlling the priorities among values is also a challenge. As a result, develop- ing efficient and stable alignment algorithms that directly align LLMs with human value principles rather than proxy demonstrations is essential for future research. In addition, human values are plu- ralistic across popularities and countries, and con- stantly evolving all the time. Thus, the alignment method is also expected to be effectively adaptable to varying value principles. 6 Conclusion Aligning big models with humans has gained sig- nificant attention to make them better serve human- ity and minimize their potential risks. This paper highlights the importance of identifying essential goals for big model alignment, and presents the first survey to provide a comprehensive overview from two perspectives: the definition of each align- ment goal and the evaluation of alignment degrees. We categorize alignment goals that appeared in existing literature into three main groups: human instructions, human preferences and human values, observing an evolving trend in alignment goals that shifts from fundamental abilities to value orienta- tion, and from surface behaviors to intrinsic values. In order to better align big models from the es- sential perspective of intrinsic human values, we discuss several challenges and promising future research directions in the final. Furthermore, we provide a list of publicly available resources for big model alignment. We expect this survey can serve as both an introduction and a source of inspiration for researchers and practitioners in the field of big model alignment. References 2021. World values survey wave 7 (2017-2022). 2022. Pew global attitudes survey. Arnav Arora, Lucie-Aimée Kaffee, and Isabelle Augen- stein. 2022. Probing pre-trained language models for cross-cultural differences in values.arXiv preprint arXiv:2203.13722. Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. 2021. A general language assistant as a laboratory for align- ment.arXiv preprint arXiv:2112.00861. Stephen H Bach, Victor Sanh, Zheng-Xin Yong, Al- bert Webson, Colin Raffel, Nihal V Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, et al. 2022. Promptsource: An integrated development environment and repository for natural language prompts.arXiv preprint arXiv:2202.01279. Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022a. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022b. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073. Yejin Bang, Nayeon Lee, Tiezheng Yu, Leila Khalatbari, Yan Xu, Samuel Cahyawijaya, Dan Su, Bryan Wilie, Romain Barraud, Elham J Barezi, et al. 2022. To- wards answering open-ended ethical quandary ques- tions.arXiv preprint arXiv:2205.05989. Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosse- lut, Emma Brunskill, et al. 2021. On the opportuni- ties and risks of foundation models.arXiv preprint arXiv:2108.07258. Duane Brown and R Kelly Crace. 2002. Life values inventory: Facilitator’s guide.Williamsburg, VA. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901. Sébastien Bubeck, Varun Chandrasekaran, Ronen El- dan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lund- berg, et al. 2023. Sparks of artificial general intelli- gence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712. Delong Chen, Jianfeng Liu, Wenliang Dai, and Baoyuan Wang. 2023a. Visual instruction tuning with polite flamingo.arXiv preprint arXiv:2307.01003. Hao Chen, Yiming Zhang, Qi Zhang, Hantao Yang, Xi- aomeng Hu, Xuetao Ma, Yifan Yanggong, and Junbo Zhao. 2023b. Maybe only 0.5% data is needed: A preliminary exploration of low training data instruc- tion tuning.arXiv preprint arXiv:2305.09246. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023). Hyung Won Chung, Le Hou, Shayne Longpre, Bar- ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416. Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. 2021. Bold: Dataset and metrics for measuring biases in open-ended language genera- tion. InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 862–872. Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. 2023. Raft: Reward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767. Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023.Al- pacafarm: A simulation framework for methods that learn from human feedback.arXiv preprint arXiv:2305.14387. Esin Durmus, Karina Nyugen, Thomas I Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, et al. 2023. Towards measuring the representation of subjective global opinions in language models.arXiv preprint arXiv:2306.16388. Denis Emelin, Ronan Le Bras, Jena D Hwang, Maxwell Forbes, and Yejin Choi. 2020. Moral stories: Situated reasoning about norms, intents, actions, and their consequences.arXiv preprint arXiv:2012.15738. Maxwell Forbes, Jena D Hwang, Vered Shwartz, Maarten Sap, and Yejin Choi. 2020. Social chem- istry 101: Learning to reason about social and moral norms.arXiv preprint arXiv:2011.00620. Iason Gabriel. 2020. Artificial intelligence, values, and alignment.Minds and machines, 30(3):411–437. Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas Liao, Kamil ̇ e Lukoši ̄ ut ̇ e, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, et al. 2023. The capacity for moral self- correction in large language models.arXiv preprint arXiv:2302.07459. Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. 2022. Red teaming language models to re- duce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858. Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. 2020. Realtoxici- typrompts: Evaluating neural toxic degeneration in language models.arXiv preprint arXiv:2009.11462. Amelia Glaese, Nat McAleese, Maja Tr ̨ebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. 2022. Improving alignment of dialogue agents via targeted human judgements.arXiv preprint arXiv:2209.14375. Jesse Graham, Jonathan Haidt, Sena Koleva, Matt Motyl, Ravi Iyer, Sean P Wojcik, and Peter H Ditto. 2013. Moral foundations theory: The pragmatic va- lidity of moral pluralism. InAdvances in experi- mental social psychology, volume 47, pages 55–130. Elsevier. Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. 2022. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection.arXiv preprint arXiv:2203.09509. Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2020. Aligning ai with shared human values.arXiv preprint arXiv:2008.02275. Geert Hofstede. 1984.Culture’s consequences: Interna- tional differences in work-related values, volume 5. sage. Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. 2022. Unnatural instructions: Tuning lan- guage models with (almost) no human labor.arXiv preprint arXiv:2212.09689. Joe Hoover, Gwenyth Portillo-Wightman, Leigh Yeh, Shreya Havaldar, Aida Mostafazadeh Davani, Ying Lin, Brendan Kennedy, Mohammad Atari, Zahra Kamel, Madelyn Mendlen, et al. 2020. Moral foun- dations twitter corpus: A collection of 35k tweets annotated for moral sentiment.Social Psychological and Personality Science, 11(8):1057–1071. Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, et al. 2023. C-eval: A multi-level multi-discipline chinese eval- uation suite for foundation models.arXiv preprint arXiv:2305.08322. Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shus- ter, Tianlu Wang, Qing Liu, Punit Singh Koura, et al. 2022. Opt-iml: Scaling language model instruc- tion meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017. Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2023. Beavertails: Towards improved safety alignment of llm via a human- preference dataset.arXiv preprint arXiv:2307.04657. Liwei Jiang, Jena D Hwang, Chandra Bhagavatula, Ro- nan Le Bras, Jenny Liang, Jesse Dodge, Keisuke Sakaguchi, Maxwell Forbes, Jon Borchardt, Saa- dia Gabriel, et al. 2021.Can machines learn morality? the delphi experiment.arXiv preprint arXiv:2110.07574. Zhijing Jin, Sydney Levine, Fernando Gonzalez Adauto, Ojasv Kamal, Maarten Sap, Mrinmaya Sachan, Rada Mihalcea, Josh Tenenbaum, and Bernhard Schölkopf. 2022. When to make exceptions: Exploring lan- guage models as accounts of human moral judgment. Advances in neural information processing systems, 35:28458–28473. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361. Zachary Kenton, Tom Everitt, Laura Weidinger, Ia- son Gabriel, Vladimir Mikulik, and Geoffrey Irving. 2021. Alignment of language agents.arXiv preprint arXiv:2103.14659. Johannes Kiesel, Milad Alshomary, Nicolas Handke, Xiaoni Cai, Henning Wachsmuth, and Benno Stein. 2022. Identifying the human values behind argu- ments. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 4459–4471. Sungdong Kim, Sanghwan Bae, Jamin Shin, Soyoung Kang, Donghyun Kwak, Kang Min Yoo, and Min- joon Seo. 2023.Aligning large language mod- els through synthetic feedback.arXiv preprint arXiv:2305.13735. Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stan- ley, Richárd Nagyfi, et al. 2023.Openassistant conversations–democratizing large language model alignment.arXiv preprint arXiv:2304.07327. Minae Kwon, Sang Michael Xie, Kalesha Bullard, and Dorsa Sadigh. 2023. Reward design with language models.arXiv preprint arXiv:2303.00001. Sharon Levy, Emily Allaway, Melanie Subbiah, Lydia Chilton, Desmond Patton, Kathleen McKeown, and William Yang Wang. 2022. Safetext: A benchmark for exploring physical safety in language models. arXiv preprint arXiv:2210.10045. Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Alpacaeval: An auto- matic evaluator of instruction-following models. Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Ku- mar, et al. 2022. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81. Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods. arxiv. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023a. Visual instruction tuning.arXiv preprint arXiv:2304.08485. Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M Dai, Diyi Yang, and Soroush Vosoughi. 2023b. Training socially aligned language models in simulated human society.arXiv preprint arXiv:2305.16960. Ruibo Liu, Ge Zhang, Xinyu Feng, and Soroush Vosoughi. 2022. Aligning generative language mod- els with human values. InFindings of the Association for Computational Linguistics: NAACL 2022, pages 241–252. Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. 2023. The flan collection: Designing data and methods for effective instruction tuning.arXiv preprint arXiv:2301.13688. Nicholas Lourie, Ronan Le Bras, and Yejin Choi. 2021. Scruples: A corpus of community ethical judgments on 32,000 real-life anecdotes. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 35, pages 13470–13479. Ian R McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, et al. 2023. Inverse scaling: When bigger isn’t better. arXiv preprint arXiv:2306.09479. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct elec- tricity? A new dataset for open book question an- swering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2381–2391. Association for Computational Linguistics. Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. 2021. Cross-task generaliza- tion via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773. Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. 2022. Crosslingual generaliza- tion through multitask finetuning.arXiv preprint arXiv:2211.01786. Md Sultan Al Nahian, Spencer Frazier, Mark Riedl, and Brent Harrison. 2020. Learning norms from stories: A prior for value aligned agents. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 124–130. Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted question- answering with human feedback.arXiv preprint arXiv:2112.09332. Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R Bowman. 2020. Crows-pairs: A chal- lenge dataset for measuring social biases in masked language models.arXiv preprint arXiv:2010.00133. OpenAI. 2023.GPT-4 technical report.CoRR, abs/2303.08774. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instruc- tions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic evalu- ation of machine translation. InProceedings of the 40th annual meeting of the Association for Computa- tional Linguistics, pages 311–318. Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R Bowman. 2021. Bbq: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193. Ethan Perez, Sam Ringer, Kamil ̇ e Lukoši ̄ ut ̇ e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kada- vath, et al. 2022. Discovering language model behav- iors with model-written evaluations.arXiv preprint arXiv:2212.09251. Liang Qiu, Yizhou Zhao, Jinchao Li, Pan Lu, Baolin Peng, Jianfeng Gao, and Song-Chun Zhu. 2022. Val- uenet: A new dataset for human value driven di- alogue system. InProceedings of the AAAI Con- ference on Artificial Intelligence, volume 36, pages 11183–11191. Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290. Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kianté Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, and Yejin Choi. 2022. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimiza- tion.arXiv preprint arXiv:2210.01241. Milton Rokeach. 1967. Rokeach value survey.The nature of human values. Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. 2018.Gender bias in coreference resolution.arXiv preprint arXiv:1804.09301. Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. 2021. Multitask prompted training en- ables zero-shot task generalization.arXiv preprint arXiv:2110.08207. Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Juraf- sky, Noah A Smith, and Yejin Choi. 2019. Social bias frames: Reasoning about social and power implica- tions of language.arXiv preprint arXiv:1911.03891. Teven Le Scao, Angela Fan, Christopher Akiki, El- lie Pavlick, Suzana Ili ́ c, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2022.Bloom: A 176b- parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100. Jérémy Scheurer, Jon Ander Campos, Tomasz Kor- bak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. 2023. Training language models with language feedback at scale.CoRR, abs/2303.16755. Patrick Schramowski, Cigdem Turan, Nico Andersen, Constantin A Rothkopf, and Kristian Kersting. 2022. Large pre-trained language models contain human- like biases of what is right and wrong to do.Nature Machine Intelligence, 4(3):258–268. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017.Proxi- mal policy optimization algorithms.arXiv preprint arXiv:1707.06347. Shalom H Schwartz. 2012. An overview of the schwartz theory of basic values.Online readings in Psychol- ogy and Culture, 2(1):11. Gabriel Simmons. 2022.Moral mimicry: Large language models produce moral rationalizations tailored to political identity.arXiv preprint arXiv:2209.12106. Irene Solaiman and Christy Dennison. 2021. Process for adapting language models to society (palms) with values-targeted datasets.Advances in Neural Infor- mation Processing Systems, 34:5861–5873. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. 2022.Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615. Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learn- ing to summarize with human feedback.Advances in Neural Information Processing Systems, 33:3008– 3021. Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, and Minlie Huang. 2023a. Safety assessment of chinese large language models.arXiv preprint arXiv:2304.10436. Hao Sun, Zhexin Zhang, Fei Mi, Yasheng Wang, Wei Liu, Jianwei Cui, Bin Wang, Qun Liu, and Minlie Huang. 2023b. Moraldial: A framework to train and evaluate moral dialogue systems via moral discus- sions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 2213–2230. Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. 2023c.Principle-driven self- alignment of language models from scratch with minimal human supervision.arXiv preprint arXiv:2305.03047. Alex Tamkin, Miles Brundage, Jack Clark, and Deep Ganguli. 2021. Understanding the capabilities, limi- tations, and societal impact of large language models. arXiv preprint arXiv:2102.02503. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a.Llama: Open and effi- cient foundation language models.arXiv preprint arXiv:2302.13971. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b.Llama 2: Open founda- tion and fine-tuned chat models.arXiv preprint arXiv:2307.09288. Jackson Trager, Alireza S Ziabari, Aida Mostafazadeh Davani,PreniGolazazian,FarzanKarimi- Malekabadi, Ali Omrani, Zhihe Li, Brendan Kennedy, Nils Karl Reimer, Melissa Reyes, et al. 2022. The moral foundations reddit corpus.arXiv preprint arXiv:2208.05545. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al- isa Liu, Noah A Smith, Daniel Khashabi, and Han- naneh Hajishirzi. 2022a. Self-instruct: Aligning lan- guage model with self generated instructions.arXiv preprint arXiv:2212.10560. Yizhong Wang, Swaroop Mishra, Pegah Alipoor- molabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. 2022b. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks.arXiv preprint arXiv:2204.07705. Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. 2023. Aligning large lan- guage models with human: A survey.arXiv preprint arXiv:2307.12966. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022a. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022b. Chain-of-thought prompting elicits rea- soning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837. Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. 2021. Ethical and social risks of harm from language models.arXiv preprint arXiv:2112.04359. Jeff Wu, Long Ouyang, Daniel M Ziegler, Nisan Sti- ennon, Ryan Lowe, Jan Leike, and Paul Christiano. 2021. Recursively summarizing books with human feedback.arXiv preprint arXiv:2109.10862. Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. 2023. Hu- man preference score v2: A solid benchmark for eval- uating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341. Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. 2023a. Baize: An open-source chat model with parameter-efficient tuning on self-chat data.arXiv preprint arXiv:2304.01196. Guohai Xu, Jiayi Liu, Ming Yan, Haotian Xu, Jinghui Si, Zhuoran Zhou, Peng Yi, Xing Gao, Jitao Sang, Rong Zhang, et al. 2023b. Cvalues: Measuring the values of chinese large language models from safety to responsibility.arXiv preprint arXiv:2307.09705. Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. 2023.Rrhf: Rank responses to align language models with human feedback without tears.arXiv preprint arXiv:2304.05302. Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: An open bilingual pre-trained model.arXiv preprint arXiv:2210.02414. Ge Zhang, Yemin Shi, Ruibo Liu, Ruibin Yuan, Yizhi Li, Siwei Dong, Yu Shu, Zhaoqun Li, Zekun Wang, Chenghua Lin, et al. 2023a. Chinese open instruction generalist: A preliminary release.arXiv preprint arXiv:2304.07987. Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher De- wan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068. Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. 2023b. Llavar: Enhanced visual instruction tuning for text-rich image understanding.arXiv preprint arXiv:2306.17107. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv preprint arXiv:2306.05685. Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. 2023. Agieval: A human-centric benchmark for evaluating foundation models.arXiv preprint arXiv:2304.06364. Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. 2023. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206. Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and Zhenchang Xing. 2023. Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867. Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Chris- tiano, and Geoffrey Irving. 2019. Fine-tuning lan- guage models from human preferences.arXiv preprint arXiv:1909.08593. Caleb Ziems, Jane A Yu, Yi-Chia Wang, Alon Halevy, and Diyi Yang. 2022. The moral integrity corpus: A benchmark for ethical dialogue systems.arXiv preprint arXiv:2204.03021.