Paper deep dive

A Survey of LLM Alignment: Instruction Understanding, Intention Reasoning, and Reliable Generation

Zongyu Chang, Feihong Lu, Ziqin Zhu, Qian Li, Cheng Ji, Zhuo Chen, Yang Liu, Ruifeng Xu, Yangqiu Song, Shangguang Wang, Jianxin Li

Year: 2025Venue: arXiv preprintArea: Surveys & ReviewsType: SurveyEmbeddings: 69

Abstract

Abstract:Large language models have demonstrated exceptional capabilities in understanding and generation. However, in real-world scenarios, users' natural language expressions are often inherently fuzzy, ambiguous, and uncertain, leading to challenges such as vagueness, polysemy, and contextual ambiguity. This paper focuses on three challenges in LLM-based text generation tasks: instruction understanding, intention reasoning, and reliable dialog generation. Regarding human complex instruction, LLMs have deficiencies in understanding long contexts and instructions in multi-round conversations. For intention reasoning, LLMs may have inconsistent command reasoning, difficulty reasoning about commands containing incorrect information, difficulty understanding user ambiguous language commands, and a weak understanding of user intention in commands. Besides, In terms of Reliable Dialog Generation, LLMs may have unstable generated content and unethical generation. To this end, we classify and analyze the performance of LLMs in challenging scenarios and conduct a comprehensive evaluation of existing solutions. Furthermore, we introduce benchmarks and categorize them based on the aforementioned three core challenges. Finally, we explore potential directions for future research to enhance the reliability and adaptability of LLMs in real-world applications.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/12/2026, 5:33:41 PM

Summary

This paper provides a comprehensive survey of LLM alignment, focusing on three core challenges: instruction understanding, intention reasoning, and reliable dialog generation. It analyzes the limitations of current LLMs in handling fuzzy, ambiguous, and complex human instructions, categorizes existing solutions, and proposes future research directions to enhance model reliability and adaptability.

Entities (6)

Instruction Understanding · challenge · 100%Intention Reasoning · challenge · 100%LLM · technology · 100%Reliable Dialog Generation · challenge · 100%RLHF · method · 95%SFT · method · 95%

Relation Signals (4)

LLM → faceschallenge → Instruction Understanding

confidence 100% · One of the most pressing challenges that LLMs face is instruction understanding

LLM → faceschallenge → Intention Reasoning

confidence 100% · Another critical area is intention reasoning... where they struggle to align the generated responses with the users underlying intention.

LLM → faceschallenge → Reliable Dialog Generation

confidence 100% · The final major challenge is the reliable dialog generation, which pertains to the accuracy, ethical considerations, and stability of the content they produce

SFT → addresseschallenge → Instruction Understanding

confidence 90% · supervised fine-tuning methods using multi-turn dialogue data, enhanced by techniques like optimized instruction parsing

Cypher Suggestions (2)

Find all challenges associated with LLMs · confidence 90% · unvalidated

MATCH (l:Entity {name: 'LLM'})-[:FACES_CHALLENGE]->(c:Challenge) RETURN c.name

List methods used to address specific challenges · confidence 90% · unvalidated

MATCH (m:Method)-[:ADDRESSES_CHALLENGE]->(c:Challenge) RETURN m.name, c.name

Full Text

68,287 characters extracted from source content.

Expand or collapse full text

A Survey of LLM Alignment: Instruction Understanding, Intention Reasoning, and Reliable Generation Zongyu Chang* a , Feihong Lu* b , Ziqin Zhu f , Qian Li a , Cheng Ji b , Tao Yang b , Zhuo Chen a , Hao Peng b , Yang Liu e , Ruifeng Xu c , Yangqiu Song d , Jianxin Li b , Shangguang Wang a a Beijing University of Posts and Telecommunications, Beijing, 100876, China b Beihang University, Beijing, 100191, China c Harbin Institute of Technology, Harbin, 150081, China d Hong Kong University of Science and Technology, Hong Kong, China e Chinese Academy of Sciences, Beijing, 230026, China f The University of Auckland, Auckland, New Zealand Abstract Large language models have demonstrated exceptional capabilities in understanding and generation. However, in real-world scenarios, users natural language expressions are often inherently fuzzy, ambiguous, and uncertain, leading to challenges such as vagueness, polysemy, and contextual ambiguity. This paper focuses on three chal- lenges in LLM-based text generation tasks: instruction understanding, intention rea- soning, and reliable dialog generation. Regarding human complex instruction, LLMs have deficiencies in understanding long contexts and instructions in multi-round con- versations. For intention reasoning, LLMs may have inconsistent command reasoning, difficulty reasoning about commands containing incorrect information, difficulty un- derstanding user ambiguous language commands, and a weak understanding of user intention in commands. Besides, In terms of Reliable Dialog Generation, LLMs may have unstable generated content and unethical generation. To this end, we classify and analyze the performance of LLMs in challenging scenarios and conduct a compre- hensive evaluation of existing solutions. Furthermore, we introduce benchmarks and categorize them based on the aforementioned three core challenges. Finally, we ex- plore potential directions for future research to enhance the reliability and adaptability of LLMs in real-world applications. 1 * These authors contributed equally to this work. arXiv:2502.09101v3 [cs.HC] 29 Jan 2026 Keywords:Instruction Understanding, Intention Reasoning, Reliable Dialog Generation 1. Introduction Rapid advancements with the development of large language models (LLMs) have been experienced in the field of artificial intelligence. These models, built upon mas- sive amounts of data and extensive computing resources, have shown impressive ca- pabilities in understanding and generating human language. Recent advancements in LLMs, including the use of scaling laws [1], supervised fine-tuning (SFT) [2], and re- inforcement learning with human feedback (RLHF) [3], have propelled these models to new heights. Researchers have explored innovative strategies like chain-of-thought rea- soning (COT) [4], aiming to enhance their performance in processing and generating accurate responses. However, they still have struggled when interacting with human instructions in real-world scenarios due to the fuzzy, uncertain, and complex nature of human instructions, especially when the input data is fuzzy, incomplete, or incon- sistent. Despite improvements, issues such as content hallucination [5] and logical misinterpretations remain prevalent. Consequently, while LLMs show promise, they are far from flawless and require further refinement to address the challenges posed by more unpredictable and complex human instructions as follows. 1.1. Challenge of Instruction Understanding One of the most pressing challenges that LLMs face is instruction understand- ing as Figure 1(a) and Figure (2I), particularly when the user input involves com- plex or multi-step instructions. While models have improved in parsing relatively simple queries, they continue to encounter significant difficulties when dealing with long, context-rich instructions or when instructions are spread across multiple conver- sational turns. LLMs often fail to grasp subtle nuances or interpret implicit meanings embedded within the text, which leads to inaccurate or incomplete responses. Existing approaches to instruction understanding have introduced techniques like optimizing the model’s parsing abilities [6], and context-aware optimization [7]. While these methods 2 Pleasetellmethesignaturedishesofthebaozirestaurant“DongDongBao”, andtellmetheaddressofthestoreof“DongDongBao”inBeijing. DongDongBao‘ssignaturedishesincludethefollowing: “GoldenFriedBuns”,“SpecialtyScribbles”,and“DanDan Noodles”.Ifthereareanyotherrequests,pleaseletmeknow. Ignoresecondinstruction (a) Challenge of Instruction Understanding. Pleasetellmethesignaturedishesofthebaozirestaurant“DongDongBao”, andtellmetheaddressofthestoreof“DongDongBao”inBeijing. PleasenotethatitmustbefromBeijing. ThesignaturedishofDongdongBaoisthegoldenfriedbun, andIfoundtheexactaddressofDongdongBaoinBeijing, whichisYouthRoadStreet,LianhuDistrict,Xi'an,notBeijing. Reasoningerror,generated contentisnotrelatedtoinput (b) Challenge of Intention Reasoning. Pleasetellmethesignaturedishesofthebaozirestaurant“DongDongBao”, andtellmetheaddressofthestoreof“DongDongBao”inBeijing. PleasenotethatitmustbefromBeijingwhichcan'tbefromXi‘an. AsfortheaddressoftheDongDongBaostoreinBeijing,one oftheirbranchesislocatedat:DongDongBaoBeijingStore Address:No.99,XidanNorthStreet,XichengDistrict,Beijing Reasoningandinstructions arestraightforward,butthe generatedcontentiswrong (c) Challenge of Reliable Generation. Figure 1: Example of LLMs generation. show promise, they often fall short when addressing the complexities and ambiguities present in instructions. 1.2. Challenge of Intention Reasoning Another critical area is intention reasoning as illustrated in Figure 1(b) and Figure 2 (I), where they struggle to align the generated responses with the users underlying intention. Ambiguities in language, conflicting instructions, and implicit requirements often result in models producing outputs that diverge from the user’s expectations. LLMs also face difficulties when instructions are inconsistent or contain incorrect in- formation, which challenges the models ability to make accurate inferences. Various strategies, including retrieval-enhanced generation and fine-tuning techniques, have been proposed to enhance reasoning capabilities, enabling the models to better han- dle inconsistent or incomplete instructions. However, these methods often introduce new challenges related to bias and the inability to fully resolve conflicts in user input, further complicating the alignment between generated content and user expectations. 3 Long-Text Comprehension ...... Inconsistent Instruction Reasoning I:Instruction Understanding LLMs Easy to Understand Users Short Input Users Long Input... Challenges Hard to Reasoning LLMs Easy to Reasoning Users Fuzzy Input... Users Clear Input I : Intention Reasoning Hard to Understand Cause Solutions Information Focusing ...... ...... Knowledge Updating ... Response unstabile Wrong Output LLMs Stabile Response Users Truth Input... Users Fabricated Input... ...... External Tools ... I : Reliable Dialog Generation Inpu Users put... ers Clear pu ers Ha Us Us UsUs Figure 2: Unlike previous surveys on LLMs, we do not consider the alignment between LLMs and humans as an isolated process,instead, we view it as a continuous and dynamic information processing process con- sisting of instruction understanding, intention reasoning, and reliable dialogue generation. 1.3. Challenge of Reliable Dialog Generation. The final major challenge is the reliable dialog generation, which pertains to the accuracy, ethical considerations, and stability of the content they produce, such as Fig- ure 1(c) and Figure 2 (I. While LLMs are generally capable of generating coherent and contextually relevant outputs, they sometimes exhibit instability, generating con- tent that is factually incorrect, logically inconsistent, or ethically questionable. This challenge is exacerbated by the models inability to recognize uncertainty, which can lead to overconfident but inaccurate outputs. Recent efforts to address this issueinvolve techniques like uncertainty-aware fine-tuning and using external tools to evaluate out- put credibility. However, these approaches struggle to provide a comprehensive and reliable solution, especially in complex or dynamic contexts. Present Survey.Facing these challenges, there is an increasing need for focused re- search on LLMs and their interaction with human instructions and intentions. This paper systematically analyzes LLMs’ performance in processing human instructions, highlighting three key areas: user instruction understanding, intention comprehension and reasoning, and reliable dialog generation. While existing review papers address model training, fine-tuning, and specific aspects of LLMs’ capabilities [92, 93, 94], our focus is on the LLMs’ ability to understand and reason about user intentions. Specif- ically, we explore how well LLMs understand user input, reason about the user’s in- 4 Gap Between LLMs and Human Intention Instruction Understanding (§2) Long-text Comprehension (§2.1) Information Focusing 1) Attention Sparsification [8], 2) Dynamically adjusting attention[9], 3) Attention Optimization [10], 4) Position independent Training [11] Multipath Optimization 1) SFT and RL [12, 13, 14, 15], 2) Retrieval-Augmented [16, 17], 3) Recurrent-Sequence-Optimization [18], 4) External-memory [19], 5) Mimicking-brain-memory [20] Multi-turn Conversation Handling (§2.2) Soft Fine-tuning Multi-turn SFT [6, 21], Orca [22], WizardLM [23], Vicuna [24], Self-Instruct [25], Parrot [7] Reforcement Learning Multi-turn Reinforcement [26], SPIN [27], ArCHer [28] Intention Reasoning (§3) Inconsistent Instruction Resolution (§3.1) Knowledge Updating SituatedQA [29], CDConv [30], Red Teaming LM [31], RAG Systems [32], ContraDoc [33] Confidence Calibration CD2 [34], Uncertainty Calibration [35, 36], Abstention [37, 38], MacNoise [39] Misinformation Reasoning (§3.2) Targeted Fine-tuning CKL [40], StreamingQA [41], BIPIA [42], ROME [43], MEMIT [44], Web-Poisoning [45], KC-LLMs [46], CAR [47] Fuzzy Language Interpretation (§3.3) Clue Engineering Folkscope [48], Miko [49], Alignment LM [50], Clarification Questions [51], AmbigQA [52], Behavioral-Cloning [53], ATC [54] Intention Clarification Failure (§3.4) Deep Reasoning DeepSeek-R1 [55], S1 [56], SoulChat [57], MoChat [58], Deep Reasoning [59, 60, 61, 62], Empathy Adaptation [63, 64], Retrieval [65, 66, 67], LARA [68] Reliable Generation (§4) Response Stability (§4.1) Fine-tuning LLMs EDL [69], DER [70], ConformalFactuality [71], Bayesian [72, 73], Calibration [74, 75], Conformal Prediction [76, 77, 78], UaIT [79], LUQ [80] External ToolsSUE [81], CalibrateMath [82] Alignment (§4.2) Data Cleaning and Curation JoSEC [83], Stochastic Parrots [84] , DeepSoftDebias [85] RL-based Alignment PPO [3], DPO [86], GRPO [87], RLAIF [88] In-context Alignment URIAL [89], ICA [90], PICA [91] Figure 3: Challenges and existing solutions between LLMs and Human Intentions. tention, infer user intentions, and generate content that closely with human intentions, thereby maximizing the alignment between LLMs and humans. Comparison with Previous Surveys.While the gap between human intention and LLMs is a core challenge in generative AI, many studies focus on specific aspects of the is- sue, lacking a comprehensive overview. These works offer valuable insights but do not provide a systematic summary of the field. Lou et al. [92] primarily address instruction following challenges in LLMs without delving into the reasoning capabilities for com- plex user instructions. Xu et al. [95] examine the impact of various memory conflicts on LLM-generated content credibility and performance, yet do not consider reasoning 5 or intention comprehension. Plaat et al. [93] focus on LLM Reasoning for basic math- ematical problems, without exploring its applicability to broader fields. Shorinwa et al. [96] provide an initial analysis of LLMs uncertainty quantification, but exclude user input instructions. In contrast, our survey offers a more comprehensive perspective, as shown in Figure 2 and Figure 3, with a unique classification and systematic analysis of instruction processing, while addressing current solutions to key challenges. Survey Organization.As in Figure 3, we begin by exploring the capability of user in- struction understanding (§2). Next, we focu on how models infer implicit intentions, incorporate contextual information for logical reasoning, and address inconsistencies or incomplete instructions (§3). We then examine reliable dialog generation, assess- ing the quality and credibility of model-generated outputs (§4). Next, we briefly an- alyze the problems faced by LLMs in face of different challenges(§5) and review the benchmarks(§6) for the above problems. Finally, we propose potential research direc- tions (§7) and summarize the key findings (§8). 2. Instruction Understanding LLMs excel at single-turn dialogues, but struggle to understand multi-turn dia- logues and long-contexts, which are commonly used by users. LLMs may forget prior information, be influenced by irrelevant data, and overlook key inputs. 2.1. Long-Text Comprehension Understanding lengthy textual instructions remains a significant hurdle for large language models (LLMs), as real-world human instructions are often expressed in loose, unstructured natural language, contrasting with the explicitly defined tasks and structured labeling commonly employed in cue word engineering, so we categorize the relevant factors into the following three categories:1) Information Sparsity and Redundancy. Long texts often contain redundant or irrelevant information that can obscure the task-relevant content, leading to difficulties in information extraction.2) Remote Information Failure(Figure 4). Long contexts may cause models to forget relevant information that is distant within the text. Additionally, links between remote 6 Figure 4: Case ofRemote Information Failure (§2.1), where the model forgets relevant information over long distances in long context. information across paragraphs or sentences can be difficult for models to identify, di- minishing their understanding of contextual connections.3) Attention Dilution. As context length increases, the models attention mechanism faces greater computational demands and struggles to assign appropriate weights to each token, making it harder to prioritize key information, particularly with complex, multi-level relationships in longer texts. This paper classifies the existing solutions into the following two cate- gories: Information Focusing.Improving LLM’s ability to focus on important information in long texts involves several methods: 1) Sparsifying attention to concentrate on critical information [8]. 2) Dynamically adjusting attention based on the current task [9]. 3) Optimizing attention to minimize redundancy and emphasize core content [10]. 4) Training with location-independent tasks to enhance the ability to search and react to relevant information in long contexts [11]. Multipath Optimization.Various methods can enhance LLMs on long-context tasks: 1) Pre-training with extended context windows and reinforcement learning for fine- tuning to optimize long-context understanding [12, 13, 14, 15]. 2) Combining retrieval- based models with generative models on long-context tasks [16, 17]. 3) Leveraging cyclic sequence models’ linear scaling property for better inference efficiency [18]. 4) 7 Using external memory to store and retrieve long-context information [19]. 5) Mim- icking brain memory hierarchies to improve long-context processing efficiency [20]. Figure 5: Case ofIncorrect Relevance Judgment(§2.2), such as the model incorrectly associates wrong content from the previous turn. 2.2. Multi-Turn Conversation Handling Multi-turn conversation serves as a fundamental interaction mode between LLMs and humans. Given the challenges users face in providing complete and precise instruc- tions in a single turn, they often opt to refine and clarify their intentions incrementally through iterative exchanges. However, due to the characteristics of real-world conver- sations, such as the constant changes in user intentions and long-distance dependencies, LLMs still faces significant challenges in achieving coordinated multi-round interac- tions with humans.This paper categorizes the challenges faced by existing LLMs when understanding multi-turn conversations into three categories, as follows:1) Capability Weakening.Current supervised instruction fine-tuning (SIFT) and RLHF may even impair multi-turn capabilities[97], with models struggling on complex reasoning tasks that span multiple rounds, such as those requiring evidence collection and conclusions [98]. Additionally, multi-turn dialogs increase the vulnerability of LLMs to adversar- ial attacks, where malicious users can mask harmful intentions across multiple rounds, leading to the generation of misleading or harmful content [99].2) Error Propaga- tion.Instruction comprehension errors accumulate across rounds, leading to an esca- lating failure rate in subsequent responses [100], which may snowball into larger issues such as biased or incorrect outputs [101].3) Incorrect Relevance Judgment(Figure 5 Q.1) LLMs often struggle to identify relevant content in multi-turn dialogs, failing to 8 properly link content from previous rounds or to discern ellipsis and implicit meaning inherent in user commands [7]. To solve above challenges, this paper categorizes existing solutions into two types: supervised fine-tuning methods using multi-turn dialogue data, enhanced by techniques like optimized instruction parsing [6] and context-aware preference strategies [7]; and reinforcement learning methods tailored for multi-turn dialogue, with improvements such as hierarchical reinforcement learning [28]. 3. Intention Reasoning User instructions often lack clarity due to language ambiguities. While humans can infer intention, LLMs struggle with misinterpreting ambiguous inputs, leading to errors. We explore causes and solutions for intention errors, focusing on inconsistent instructions, misinformation, fuzzy language, and intention clarification. Figure 6: Case ofInconsistent Instruction Reasoning (§3.1), where LLMs fail to detect conflicting inputs (Q.1) or overlooking logical inconsistencies (Q.2) 3.1. Inconsistent Instruction Reasoning In natural language communication, humans easily identify inconsistencies using context and prior knowledge, whereas LLMs struggle, often accept contradictory in- puts, and generate unreliable answers. This phenomenon has been observed across 9 multiple question-answering generation tasks [33, 30], and we categorize the causes of this problem according to the scenarios in which it occurs as follows:1) Ignoring input errors(Figure 6 Q.1). The model ignores the input errors and gives an answer, resulting in the model assigning the same weight to each context given by the user, which in turn affects the generation of the answer.2) Inability to detect user incon- sistencies(Figure 6 Q.2). In the premise that the model has learned the knowledge, the model still has difficulty detecting user inconsistencies. To address inconsistent instruction reasoning issues, existing solutions primarily adopt the following two ap- proaches: Knowledge Updating.SituatedQA [29] attempts to enhance model performance by updating the knowledge base. ContraDoc [33] trains models to identify contradictions in long texts using human-annotated datasets. Additionally, CDConv [30] simulates common user behaviors to trigger chatbots through an automated dialogue generation method, generating contradictions for training purposes.See Figure 3 for more meth- ods. Confidence Calibration.Given the high cost of data annotation and model fine-tuning, some researchers have sought alternative approaches by introducing additional process- ing techniques. CD2 [34] maximizes probabilistic output and calibrates model confi- dence under knowledge conflicts using conflict decoupling and comparison decoding methods. MacNoise [39] enhances contradiction retrieval in an augmented generative system by explicitly fine-tuning a discriminator or prompting LLMs to improve con- tradiction discrimination. See Figure 3 for more methods. 3.2. Misinformation Reasoning Erroneous instructions mislead model outputs more severely than inconsistent ones, as they lack obvious contradictions, requiring the model to comprehend, reason, com- pare input knowledge with its parameterized knowledge, and make objective judg- ments [102, 103]. From the input perspective, this paper classifies the sources of er- roneous information into two categories as follows:1) Temporal Alignment Failure 10 Figure 7: Case ofMisinformation Reasoning(§3.2), caused by temporal misalignment leading to failure responses (Q.1) or data contamination resulting in misleading outputs (Q.2). (Figure 7 Q.1). arises when the knowledge provided by the user and the model is tem- porally misaligned due to updates occurring at different times, leading to inconsistent responses. Such discrepancies typically originate during the training process.2) In- formation Contamination(Figure 7 Q.2). refers to the degradation of model quality caused by the intentional distortion of input data. To solve the above problems, existing methods mainly focus on improving model susceptibility in the face of internal and external knowledge conflicts through targeted fine-tuning and processing. CKL [40] ensures that the model’s knowledge is updated in a timely manner through an online approach, although this approach is slightly weaker than re-training in terms of effectiveness [41]. RKC-LLMs [46] allows a large model to recognize knowledge conflicts by means of instructional fine-tuning that identifying specific passages of conflicting information. BIPIA [42] used adversarial training to combat the effects of information pollution and improve model robustness. CAR [47] achieved nearly 20% improvement by discriminating external knowledge that may not be contaminated in the RAG system. See Figure 3 for more methods. 11 Figure 8: Case ofFuzzy Language Interpretation(§3.3), where the model relies on biases for fuzzy queries (C.1) or defaults to a response without seeking clarification (C.2) 3.3. Fuzzy Language Interpretation When user instructions contain fuzzy terms (e.g., polysemy or vagueness), LLMs may select an incorrect interpretation from multiple possibilities, potentially leading to misleading responses. This phenomenon has been observed across multiple information- seeking tasks [50], and we categorize the causes of this problem according to the sce- narios in which it occurs as follows:1) Self-defined problem(Figure 8 Q.1). When the user inputs content with fuzzy sentences, LLMs may choose to generate content based on the preferences of its own training data.2) Select data based on fuzzy in- put(Figure 8 Q.2). In response to ambiguous user input, LLMs may select a default explanation without actively asking the user to clarify. To solve this problem, researchers have started to parse the semantic information expressed by users through clue engineering. Folkscope [48] proposed the FolkScope framework, which uses a large language model to analyze and discriminate users’ fuzzy purchasing intention. Miko [49] introduces a hierarchical intention generation framework that interprets users’ posting behaviors by analyzing the fuzzy information they share on social platforms. Zhang et al. [53] have employed behavioral cloning by using demonstration data from a strong model to train a weaker model so that the 12 weaker model can perform better on a similar task. ATC [54] utilizes the Active Think- ing Chain cueing scheme, which enhances the proactivity of a biglanguage model by adding goal-planning capabilities to the descriptive reasoning chain. See Figure 3 for more methods. Figure 9: Case ofIntention Clarification Failure(§3.4) , where it misinterprets sarcasm (Q.1) or ignores prior emotional context (Q.2) 3.4. Intention Clarification Failure Unlike humans, who reason based on experience, LLMs lack real-world common sense and thus struggle to infer complex contexts beyond their input data. Moreover, they often fail to maintain a consistent reasoning trajectory across long texts, complex contexts, or multi-turn conversations. When handling intricate intentions or sentiment shifts, LLMs may struggle to retain prior context, leading to errors in inferring implicit user needs. We categorize the causes of this problem according to the scenarios in which it occurs as follows:1) Fails to detect sarcasm(Figure 9 Q.1), when LLMs fails to understand the sarcastic intention of the user.2) Ignores prior emotional context(Figure 9 Q.2), when LLMs focus only on the second half of the sentence and ignore the emotions in the previous round of dialog. To solve above problems, researchers have started to try to construct multi-domain datasets [57] containing implicit intentions to strengthen the ability of the LLMs to rea- 13 son about complex intentions and user emotions in multi-round interaction scenarios. DeepSeek-R1 [55] enhances its understanding of human intention through a structured process with two RL stages for refining reasoning patterns and aligning with human preferences. S1 [56] uses budget forcing to control the number of thinking tokens. The upper limit is terminated early by a delimiter, while the lower limit prohibits de- limiters and adds "wait" to guide in-depth reasoning and optimize the quality of the answer. SoulChat [57] fine-tuned LLMs by constructing a dataset containing more than 2 million samples of multi-round empathic conversations. MoChat [58] constructs multi-round dialogues for spatial localization by using joint grouping spatio-temporal localization. LARA [68] combines a fine-tuned smaller model with retrieval enhance- ment mechanisms and integrates it into the architecture of LLM. See Figure 3 for more methods. 4. Reliable Dialog Generation Despite strong performance, LLMs struggle with output reliability. Trained on large corpora using maximum likelihood estimation, they generate deterministic re- sponses. While effective on familiar data, they often produce unstable or incomplete responses to unseen inputs, undermining reliability. 4.1. Response Stability The knowledge acquired by the LLMs is generally determined in the pre-training stage and stored in a parameterized form. For data in a specific field, the current model is generally optimized by fine-tuning instructions so that it outputs what humans want [104]. If knowledge samples that the model has not seen are used in instruction- tuning, it will inevitably cause the model to give a definite response to unknown inputs, and there is a high probability that an answer will be fabricated. This is the over- confidence of the model that causes the model to output unreliable answers, and we categorize the causes of this problem according to the scenarios in which it occurs as follows:1) Fabricated incorrect information(Figure 10 Q.1). When the mod- els knowledge did not match the input question, it fabricated information that did not 14 Figure 10: Case ofUnstable Content Generation (§4.1), where the model fabricates details when it lacks relevant knowledge (Q.1) or produces incorrect contextual information despite having relevant knowledge (Q.2) match the facts.2) Incorrect context output(Figure 10 Q.2). When the models knowledge did match the input question, it output incorrect context information. To address these issues, researchers have explored uncertainty, which quantifies the credi- bility and stability of model outputs. Fine-tuning LLMs.To make LLMs more accurate in estimating uncertainty, existing methods fine-tune models [69, 70]. LUQ [80] is a novel sampling-based uncertainty quantification method specifically designed for long texts. UaIT [79] using semantic entropy to assess output uncertainty. ConformalFactuality [71] defines the associated uncertainties for each possible output. See Figure 3 for more methods. External Tools.Fine-tuning LLMs typically demands substantial computing resources and slow training; therefore, reducing computational overhead is crucial for improv- ing efficiency. Researcher has proposed methods to evaluate the uncertainty of model outputs through external tools [81]. ConfidenceElicitation [105] is a new uncertainty measurement tool for large model outputs. CalibrateMath [82] assesses uncertainty by requiring models to generate numerical answers with confidence levels, evaluating 15 their reliability. Figure 11: Case of Misalignment with Human Values (§4.2) , where the model generates harmful or offen- sive content (Q.1) 4.2. Alignment Despite the impressive capabilities of large language models (LLMs), they have raised significant concerns regarding the unsafe or harmful content they may gener- ate. LLMs are typically trained on vast datasets scraped from the internet, including inappropriate or harmful content [84]. This means that the models may inadvertently produce outputs misaligned with human values as follows:1) Generation of Toxic Content(Figure 11 Q.1). LLMs may generate toxic content, such as hate speech or of- fensive comments, when asked to respond to sensitive topics [106, 107].2) Conflicts with Moral/Ethical Standards(Figure 11 Q.1). LLMs might produce outputs that conflict with moral or ethical standards, such as guiding illegal activities [108, 109]. To tackle the above concerns regarding unsafe or harmful content produced by LLMs, researchers have focused on various stages: Pretraining Data Cleaning and Curation.To minimize the risks associated with harm- ful or inappropriate content, LLM training datasets should undergo rigorous cleaning processes [84], such as filtering out toxic language, hate speech, and harmful stereo- types. Tools like the Perspective API [83] and word embedding debiasing methods [85] can help identify and remove toxic and biased content. Tools like word embedding de- biasing methods [85] can help identify and remove toxic and biased content. 16 Reinforcement Learning-based Alignment.To better align LLMs with human values and societal norms, reinforcement learning approaches are widely adopted, including RLHF and its advanced variants such as PPO [3], DPO [86], and GRPO [87]. Extend- ing RLHF, RLAIF [88] leverages AI systems to assist in the feedback process, improv- ing the scalability of evaluation and fine-tuning while enabling continuous monitoring and refinement of model behavior [88]. In-context Alignment.It leverages the ability of LLMs to adapt their responses based on a few examples provided in the prompt. [89] demonstrates that effective alignment can be achieved purely through ICL with just a few stylistic examples and a system prompt. [90] explored the effectiveness of different components of In-context align- ment, and found that examples within the context are crucial for enhancing alignment capabilities. Liu et al. [91] introduced PICA, which uses a two-stage approach to im- prove alignment efficiency. StageChallengeGPT-4o GPT-o3* Qwen3 Qwen3* Deepseek-v3 Deepseek-R1* Instruction Understanding Long-Text ComprehensionCACACA Multi-Turn Conversation HandlingCACCCA Intention Reasoning Inconsistent Instruction ReasoningCCCCBC Misinformation ReasoningBACBBA Fuzzy Language InterpretationBBCCCC Intention Clarification FailureCBCCCC Reliable Dialog Generation Response StabilityCCCCCC AlignmentCACCCA Table 1: Results of using different LLMs on the challenge cases;∗denotes models with reasoning abilities. ’A’ indicates the model’s accuracy is greater than 75%, ’B’ is between 50% and 75%, and ’C’ is below 50%. 5. Challenges Cases in Human-LLMs Aligement To fully understand the limitations of LLMs in practical applications, it is partic- ularly important to analyze their performance in a variety of complex scenarios. We collected human user instructions from real scenarios that the large model needs to in- teract with on a total of eight challenges that the summarized existing LLMs face in the three phases of instruction understanding, intention reasoning, and reliable dialog generation, and cleaned and filtered the instructions through manual screening, diver- 17 sity sampling, and difficulty filtering, and finally, 50 human instructions were used for each challenge. The statistical results of the test are shown in Table.1. In this section, we will present a series of representative case studies and conduct an in-depth analysis of the current mainstream LLMs (including GPT-4o, GPT-o3, Qwen3, DeepSeek-V3 and DeepSeek-R1). Besides, the blue smiley face indicates that the model is capable of pro- viding accurate responses or identifying inconsistencies in the user’s input instructions. In contrast, the red crying face signifies that the LLM failed to recognize contradictions in the user’s instructions and produced incorrect responses. Through these cases, we systematically reveal the typical failure modes of each model in different aspects and their deep-seated causes, providing important references and directions for targeted optimization in future research. 5.1. Case of Instruction Understanding Although LLMs perform well in short text and single-round dialogue scenarios, their ability to understand and execute complex instructions in long contexts and multi- round dialogues still faces many severe challenges. Figure 4 shows that long texts usually contain a lot of irrelevant or redundant infor- mation, which causes the core content directly related to the task to be submerged. The model is prone to omission or confusion when extracting key information, and even hallucinations, thereby generating content that is irrelevant to the user’s instructions. Figure 5 shows that when dealing with long-distance information associations that need to span multiple paragraphs or dialogue rounds, the model often has difficulty tracking and integrating the logical relationship between contexts, thereby losing im- portant clues. In multi-round dialogue scenarios, the model is not only prone to grad- ually accumulate errors due to inaccurate understanding of the previous text, but may also fail to correctly judge the relevance of each round of dialogue, causing the answer to deviate from the user’s real needs. As shown in Table 1, Figure 4, and Figure 5, in the instruction understanding stage, the reasoning models generally performed well on the "Long-Text Comprehension" task, while the non-reasoning models all struggled, and for Multi-Turn Conversation 18 Handling, in addition to the non-reasoning models, Qwen3, which provides reasoning capability, also exhibited suboptimal performance. The above results show that rea- soning can effectively improve the ability of LLMs in instruction understanding, but cannot completely solve the instruction understanding problem. 5.2. Case of Intention Reasoning Since there are often spelling errors, factual contradictions and semantic ambigui- ties in user instructions, LLM faces many challenges in understanding the user’s true intentions. Figure 6 shows that the model often relies too much on the user’s input and ig- nores the obvious errors in the input. It tends to generate answers directly instead of identifying and correcting these errors first; Figure 7 shows that when knowledge updates are not synchronized or input data is maliciously tampered with, the model is more likely to output information that is inconsistent with actual needs or contaminated. However, the reasoning model can identify the errors, indicating that the model’s reasoning ability is crucial in identifying the user’s input intention. Figure 8 shows that when faced with ambiguous or uncertain expressions, large models usually tend to customize questions or give default explanations based on their own training preferences, rather than actively asking users for more context.The GPT series performs better than models such as DeepSeek in such tasks, indicating that even models with strong reasoning capabilities may have difficulties in dealing with modal expressions such as sarcasm and irony. This reflects the limitations of AI in understanding complex human language expressions, as well as differences in the cov- erage of such language phenomena in the training data and the depth of the model’s understanding of the social and cultural context. Figure 9 shows that for implicit intentions such as sarcasm, metaphors, and emo- tions, the model often only focuses on the literal meaning and has difficulty grasping the deep emotions or context, thereby outputting incorrect analysis results. As shown in Table 1, Figure 6 to Figure 9, in the intention reasoning stage, all models encountered difficulties and performed poorly in Inconsistent Instruction Rea- 19 soning, Fuzzy Language Interpretation and Intention Clarification Failure; in Misinfor- mation Reasoning, only GPT-o3 and Deepseek-R1 achieved good performance, while the other models underperformed. This suggests that all models have significant chal- lenges in reasoning about intentions. 5.3. Case of Reliable Dialog Generation Current large language models exhibit both deterministic response preferences and ambiguous knowledge boundaries during generation. When queries fall within the models knowledge coverage, the outputs are generally reliable; however, in open- domain or previously unseen scenarios, the quality of generated content fluctuates sig- nificantly. Figure 10 illustrates the instability of LLM-generated content, which primarily manifests in two typical issues: First, when the model lacks relevant knowledge or information regarding the input (Q.1), it tends to fabricate details, producing content inconsistent with facts and thereby substantially compromising the accuracy and re- liability of its outputs. Second, even when the model possesses relevant knowledge (Q.2), it may still misinterpret or improperly integrate contextual information, result- ing in outputs that are contextually inappropriate or logically flawed. These issues not only undermine user experience but also limit the applicability of LLMs in high-stakes scenarios. Figure 11 demonstrates the misalignment between model-generated content and human values, mainly reflected in the models potential to violate widely accepted moral and ethical standards. On one hand, when handling sensitive or controversial topics, the model may generate harmful, offensive, or discriminatory statements, negatively impacting users. On the other hand, it may also produce responses that encourage illegal activities or contravene social ethics, introducing potential legal and societal risks and imposing higher requirements on the safety and trustworthiness of LLMs. As shown in Table 1, Figure 10 and Figure 11, in the reliable dialog generation stage, all models failed to maintain Response Stability; in Alignment test, only GPT- o3 and Deepseek-R1 performed well, while the rest of the models failed. The results show that all models have clear shortcomings in Response Stability, and the reasoning 20 models have improved in Alignment. Overall, reasoning capabilities can improve the performance of llms for interacting with real human user commands, however, all models still face significant challenges and remain underpowered in real-world and human interaction scenarios. Cat.BenchmarkYear Lang.Num. Type Description Instrution Understanding BotChat [110]2023 En&Zh 7658M Multi-round dialogue eval. via simulated data MINT [97]2023En29,307 M Tool use and feedback in multi-turn dia- logue MT-BENCH-101 [111]2024En1,388M Multi-turn dialogue ability ∞Bench [112]2024 En&Zh 100kSLong-context handling L-Eval [113]2024En2,000SEvaluation of Long-Context Language Models LongICLBench [114]2025En2,100kSLong In-Context Learning Intention Reasoning BIPIA [42]2023En712.5KSVulnerability to hint injection Miko [49]2024En10kSMultimodal social intent understanding CONTRADOC [33]2023En891SSelf-contradiction in long docs CDCONV [30]2022Zh12KM Contradiction in Chinese dialogues Relibale Dialog Generate Open-LLM-Leaderboard [115] 2024En10KSUncertainty in generation ETHICS [116]2020En130KSMoral reasoning FACTOR [117]2023En300SFactuality in generated text Table 2: A selection of widely used benchmark datasets for evaluating LLMs. ThCat.: task stage; ’Lang.’: language of the benchmark, a; ’Num.’: data size; ’Type’: S=Single-round, M=Multi-round. 6. Benchmark This section covers benchmarks for LLMs in above three stage (Table 2). 6.1. Benchmarking Instruction Understanding Instruction understanding in LLMs involves extracting key information, maintain- ing coherence, and adapting to dynamic conversation changes, especially in long or multi-round dialogues. LLM capabilities in multi-turn dialogue and long-context pro- cessing have been explored through various benchmarks. BotChat [110] evaluates di- alogue generation, showing GPT-4’s strengths but noting instruction compliance and length limitations in other models. MINT [97] highlights limited progress in tool use and feedback for complex tasks. MT-Bench-101 [111] identifies challenges in enhanc- ing long-term interaction skills. For long-context tasks, performance drops in ultra- long texts (Zhang et al. [112]), L-Eval [113] emphasizes length-instruction-enhanced 21 metrics, and LongICLBench [114] reveals difficulties in reasoning across multiple pieces of information. 6.2. Benchmarking LLM Reasoning LLM intention reasoning involves inferring user intentions by interpreting both explicit and implicit language cues. The existing benchmarks comprehensively eval- uate the multifaceted reasoning capabilities of LLMs, encompassing vulnerability to indirect hint injection attacks (BIPIA [42]), advantages in multimodal intention un- derstanding (Miko [49]), analysis of self-contradictions in long documents (CON- TRADOC [33]), and contradiction detection in Chinese dialogues (CDCONV [30]). Collectively, these benchmarks highlight both the challenges and advancements of LLMs in complex reasoning. 6.3. Benchmarking LLM Generation LLM Generation assesses a model’s ability to understand user instructions, avoid fabricating false information, and generate accurate, contextually appropriate responses. Open-LLM-Leaderboard [115], ETHICS [116] and FACTOR [117] all focus on the reliability evaluation of content generated by large models. Open-LLM-Leaderboard finds that large-scale models have higher uncertainty, and fine-tuned models have higher accuracy but greater uncertainty. ETHICS focuses on the ethical value alignment of generated content. FACTOR evaluates factuality through scalable methods to ensure that diverse and rare facts are covered. 7. Future Directions This section summarizes ongoing challenges in above stages with LLMs and out- lines potential future research directions. Automated Annotation Framework.Although LLMs excel in general-domain tasks, they often produce hallucinated or incomplete content in specialized fields due to limited domain-specific training data. While contextual learning and instruction fine- tuning methods have been explored to address this issue, manual data annotation re- mains labor-intensive and prone to quality inconsistencies. An automated annotation 22 framework could streamline data labeling, enhancing model performance in specialized fields by ensuring higher quality and scalability of domain-specific training datasets. GraphRAG.LLMs have shown impressive language generation capabilities through pre-training on large datasets, but their reliance on static data often results in inaccurate or fictional content, particularly in domain-specific tasks. The Graph-enhanced gener- ation approach aims to tackle this by leveraging KGs and GNNs for precise knowledge retrieval. Despite its advantages, GraphRAG faces challenges in capturing structural information during graph reasoning tasks and struggles with multi-hop retrieval ac- curacy and conflict resolution between external and internal knowledge. Future work should focus on refining retrieval strategies and improving the stability and accuracy of GraphRAG in complex tasks. Quantifying Uncertainty in LLMs.As LLMs like LLama and ChatGPT have revolu- tionized content generation, ensuring the reliability of their outputs remains a criti- cal challenge. Uncertainty quantification is a promising approach to address this, en- abling models to provide a confidence assessment alongside their responses. Evidence learning [69, 70], an emerging method in uncertainty representation, offers a reliable approach for quantifying uncertainty directly from data. However, its application to LLMs is computationally intensive, and most existing work focuses on small-scale models. Future research should aim to optimize uncertainty quantification for large- scale models efficiently. Balancing Safety and Performance.Although advancements in alignment techniques have improved factual accuracy and safety, they often come at the cost of the model’s creativity and fluency. Striking a balance between safety and performance is crucial. Future research should explore new alignment methods that ensure both the safety and usability of LLMs, optimizing the trade-off between generating reliable, safe content and maintaining the model’s creative and contextual capabilities. 23 8. Conclusion This paper analyzes LLMs’ performance in processing user instructions. Despite progress in natural language understanding, LLMs struggle with complex, inconsis- tent instructions, often resulting in biases, errors, and hallucinations. Improvements through prompt engineering, model expansion, and RLHF et al. have not fully ad- dressed LLMs’ limitations in reasoning and comprehension, limiting their real-world applicability. We identify three challenges: instruction understanding, intention rea- soning and reliable dialog generation. Future research should focus on enhancing rea- soning for complex instructions and aligning outputs with user intention to improve LLMs’ reliability. References [1]J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling laws for neural language mod- els, CoRR (2020). [2]J. Wu, L. Ouyang, D. M. Ziegler, N. Stiennon, R. Lowe, J. Leike, P. Christiano, Recursively summarizing books with human feedback, CoRR (2021). [3]L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., Training language models to follow in- structions with human feedback, NeurIPS 35 (2022). [4]J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large language models, NeurIPS 35 (2022). [5]J. Li, X. Cheng, X. Zhao, J. Nie, J. Wen, Halueval: A large-scale hallucination evaluation benchmark for large language models, in: EMNLP, 2023. [6]Z. Teng, Y. Song, X. Ye, Y. Ouyang, Fine-tuning llms for multi-turn dialogues: Optimizing cross-entropy loss with kl divergence for all rounds of responses, in: ICML, 2024. 24 [7]Y. Sun, C. Liu, K. Zhou, J. Huang, R. Song, W. X. Zhao, F. Zhang, D. Zhang, K. Gai, Parrot: Enhancing multi-turn instruction following for large language models, in: ACL, 2024. [8]I. Beltagy, M. E. Peters, A. Cohan, Longformer: The long-document trans- former, CoRR (2020). [9]C. Wu, F. Wu, T. Qi, B. Jiao, D. Jiang, Y. Huang, X. Xie, Smart bird: Learnable sparse attention for efficient and effective transformer, CoRR (2021). [10]Y. Chen, Z. You, S. Zhang, H. Li, Y. Li, Y. Wang, M. Tan, Core context aware attention for long context language modeling, CoRR (2024). [11]J. He, K. Pan, X. Dong, Z. Song, L. LiuYiBo, Q. Qianguosun, Y. Liang, H. Wang, E. Zhang, J. Zhang, Never lost in the middle: Mastering long-context question answering with position-agnostic decompositional training, in: ACL, 2024. [12]J. Zhang, Z. Hou, X. Lv, S. Cao, Z. Hou, Y. Niu, L. Hou, Y. Dong, L. Feng, J. Li, Longreward: Improving long-context large language models with ai feedback, CoRR (2024). [13]Y. Bai, X. Lv, J. Zhang, Y. He, J. Qi, L. Hou, J. Tang, Y. Dong, J. Li, Longalign: A recipe for long context alignment of large language models, arXiv preprint arXiv:2401.18058 (2024). [14]Z. Tang, Z. Sun, J. Li, Q. Zhu, M. Zhang, Logo–long context alignment via efficient preference optimization, arXiv preprint arXiv:2410.18533 (2024). [15]Y. Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, J. Jia, Longlora: Efficient fine- tuning of long-context large language models, arXiv preprint arXiv:2309.12307 (2023). [16]Z. Li, C. Li, M. Zhang, Q. Mei, M. Bendersky, Retrieval augmented generation or long-context llms? a comprehensive study and hybrid approach, in: EMNLP, 2024. 25 [17]P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances in neural information processing sys- tems 33 (2020) 9459–9474. [18]A. Gu, T. Dao, Mamba: Linear-time sequence modeling with selective state spaces, CoRR (2023). [19]W. Liu, Z. Tang, J. Li, K. Chen, M. Zhang, Memlong: Memory-augmented retrieval for long text modeling, CoRR (2024). [20]Z. He, Z. Qin, N. Prakriya, Y. Sun, J. Cong, Hmt: Hierarchical memory trans- former for long context language processing, CoRR (2024). [21]J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le, Finetuned language models are zero-shot learners, arXiv preprint arXiv:2109.01652 (2021). [22]S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, A. Awadallah, Orca: Progressive learning from complex explanation traces of gpt-4, arXiv preprint arXiv:2306.02707 (2023). [23]C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, D. Jiang, Wiz- ardlm: Empowering large language models to follow complex instructions, arXiv preprint arXiv:2304.12244 (2023). [24]W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, et al., Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, See https://vicuna. lmsys. org (accessed 14 April 2023) 2 (3) (2023) 6. [25]Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, H. Hajishirzi, Self-instruct: Aligning language models with self-generated instructions, in: Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), 2023, p. 13484–13508. 26 [26]L. Shani, A. Rosenberg, A. Cassel, O. Lang, D. Calandriello, A. Zipori, H. Noga, O. Keller, B. Piot, I. Szpektor, et al., Multi-turn reinforcement learning from preference human feedback, december 2024, URL http://arxiv. org/abs/2405.14655. [27]Z. Chen, Y. Deng, H. Yuan, K. Ji, Q. Gu, Self-play fine-tuning converts weak language models to strong language models, arXiv preprint arXiv:2401.01335 (2024). [28]Y. Zhou, A. Zanette, J. Pan, S. Levine, A. Kumar, Archer: Training language model agents via hierarchical multi-turn rl, in: ICML, 2024. [29]M. J. Q. Zhang, E. Choi, Situatedqa: Incorporating extra-linguistic contexts into QA, in: EMNLP, 2021, p. 7371–7387. [30]C. Zheng, J. Zhou, Y. Zheng, L. Peng, Z. Guo, W. Wu, Z. Niu, H. Wu, M. Huang, Cdconv: A benchmark for contradiction detection in chinese conversations, in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), EMNLP, ACL, 2022, p. 18–29. [31]X. Wen, B. Li, T. Huang, M. Chen, Red teaming language models for contradic- tory dialogues, CoRR (2024). [32]V. Gokul, S. Tenneti, A. Nakkiran, Contradiction detection in rag systems: Eval- uating llms as context validators for improved information consistency, arXiv preprint arXiv:2504.00180 (2025). [33]J. Li, V. Raheja, D. Kumar, Contradoc: Understanding self-contradictions in documents with large language models, in: K. Duh, H. Gómez-Adorno, S. Bethard (Eds.), NAACL, ACL, 2024, p. 6509–6523. [34]Z. Jin, P. Cao, Y. Chen, K. Liu, X. Jiang, J. Xu, Q. Li, J. Zhao, Tug-of-war between knowledge: Exploring and resolving knowledge conflicts in retrieval- augmented language models, in: N. Calzolari, M. Kan, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.), LREC/COLING 2024, ELRA and ICCL, 2024, p. 16867–16878. 27 [35]S. Kapoor, N. Gruver, M. Roberts, K. Collins, A. Pal, U. Bhatt, A. Weller, S. Dooley, M. Goldblum, A. G. Wilson, Large language models must be taught to know what they dont know, Advances in Neural Information Processing Sys- tems 37 (2024) 85932–85972. [36]G. He, P. Cui, J. Chen, W. Hu, J. Zhu, Investigating uncertainty calibration of aligned language models under the multiple-choice setting, arXiv preprint arXiv:2310.11732 (2023). [37]J. Xin, R. Tang, Y. Yu, J. Lin, The art of abstention: Selective prediction and error regularization for natural language processing, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, p. 1040–1051. [38]B. Wen, J. Yao, S. Feng, C. Xu, Y. Tsvetkov, B. Howe, L. L. Wang, Know your limits: A survey of abstention in large language models, Transactions of the Association for Computational Linguistics 13 (2025) 529–556. [39]G. Hong, J. Kim, J. Kang, S. Myaeng, J. J. Whang, Why so gullible? enhancing the robustness of retrieval-augmented models against counterfactual noise, in: K. Duh, H. Gómez-Adorno, S. Bethard (Eds.), NAACL, ACL, 2024, p. 2474– 2495. [40]J. Jang, S. Ye, S. Yang, J. Shin, J. Han, G. Kim, S. J. Choi, M. Seo, Towards continual knowledge learning of language models, in: ICLR, 2022. [41]A. Liska, T. Kociský, E. Gribovskaya, T. Terzi, E. Sezener, D. Agrawal, C. de Masson d’Autume, T. Scholtes, M. Zaheer, S. Young, E. Gilsenan- McMahon, S. Austin, P. Blunsom, A. Lazaridou, Streamingqa: A benchmark for adaptation to new knowledge over time in question answering models, in: ICML, Vol. 162, 2022, p. 13604–13622. [42]J. Yi, Y. Xie, B. Zhu, E. Kiciman, G. Sun, X. Xie, F. Wu, Benchmarking and defending against indirect prompt injection attacks on large language models, in: 28 Y. Sun, F. Chierichetti, H. W. Lauw, C. Perlich, W. H. Tok, A. Tomkins (Eds.), KDD, ACM, 2025, p. 1809–1820. [43]K. Meng, D. Bau, A. Andonian, Y. Belinkov, Locating and editing factual as- sociations in gpt, Advances in neural information processing systems 35 (2022) 17359–17372. [44]K. Meng, A. S. Sharma, A. Andonian, Y. Belinkov, D. Bau, Mass-editing mem- ory in a transformer, arXiv preprint arXiv:2210.07229 (2022). [45]N. Carlini, M. Jagielski, C. A. Choquette-Choo, D. Paleka, W. Pearce, H. Ander- son, A. Terzis, K. Thomas, F. Tramèr, Poisoning web-scale training datasets is practical, in: 2024 IEEE Symposium on Security and Privacy (SP), IEEE, 2024, p. 407–425. [46]Y. Wang, S. Feng, H. Wang, W. Shi, V. Balachandran, T. He, Y. Tsvetkov, Re- solving knowledge conflicts in large language models, CoRR (2023). [47]O. Weller, A. Khan, N. Weir, D. J. Lawrie, B. V. Durme, Defending against disinformation attacks in open-domain question answering, in: EACL, 2024, p. 402–417. [48]C. Yu, W. Wang, X. Liu, J. Bai, Y. Song, Z. Li, Y. Gao, T. Cao, B. Yin, Folkscope: Intention knowledge graph construction for e-commerce common- sense discovery, in: Findings of ACL, 2023, p. 1173–1191. [49]F. Lu, W. Wang, Y. Luo, Z. Zhu, Q. Sun, B. Xu, H. Shi, S. Gao, Q. Li, Y. Song, et al., Miko: multimodal intention knowledge distillation from large language models for social-media commonsense discovery, in: ACM M, 2024. [50]H. J. Kim, Y. Kim, C. Park, J. Kim, C. Park, K. M. Yoo, S.-g. Lee, T. Kim, Aligning language models to explicitly handle ambiguity, CoRR (2024). [51]M. Aliannejadi, H. Zamani, F. Crestani, W. B. Croft, Asking clarifying questions in open-domain information-seeking conversations, in: Proceedings of the 42nd 29 international acm sigir conference on research and development in information retrieval, 2019, p. 475–484. [52]S. Min, J. Michael, H. Hajishirzi, L. Zettlemoyer, Ambigqa: Answering am- biguous open-domain questions, arXiv preprint arXiv:2004.10645 (2020). [53]Y. Zhang, J. Lu, N. Jaitly, Probing the multi-turn planning capabilities of llms via 20 question games, in: ACL, 2024, p. 1495–1516. [54]Y. Deng, L. Liao, L. Chen, H. Wang, W. Lei, T. Chua, Prompting and evaluating large language models for proactive dialogues: Clarification, target-guided, and non-collaboration, in: Findings of EMNLP, 2023. [55]D. Guo, D. Yang, H. Zhang, et al., Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning, CoRR (2025). [56]N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, et al., s1: Simple test-time scaling, CoRR (2025). [57]Y. Chen, X. Xing, J. Lin, H. Zheng, Z. Wang, Q. Liu, X. Xu, Soulchat: Im- proving llms empathy, listening, and comfort abilities through fine-tuning with multi-turn empathy conversations, in: Findings of EMNLP, 2023. [58]J. Mo, Y. Chen, R. Lin, Y. Ni, M. Zeng, X. Hu, M. Li, Mochat: Joints-grouped spatio-temporal grounding LLM for multi-turn motion comprehension and de- scription, CoRR abs/2410.11404 (2024). [59]S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, Y. Cao, React: Syn- ergizing reasoning and acting in language models, in: The eleventh international conference on learning representations, 2022. [60]N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, S. Yao, Reflexion: Language agents with verbal reinforcement learning, Advances in Neural Information Pro- cessing Systems 36 (2023) 8634–8652. 30 [61]X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, D. Zhou, Self-consistency improves chain of thought reasoning in language models, arXiv preprint arXiv:2203.11171 (2022). [62]S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, K. Narasimhan, Tree of thoughts: Deliberate problem solving with large language models, Advances in neural information processing systems 36 (2023) 11809–11822. [63]D. Demszky, D. Movshovitz-Attias, J. Ko, A. Cowen, G. Nemade, S. Ravi, Goe- motions: A dataset of fine-grained emotions, arXiv preprint arXiv:2005.00547 (2020). [64]H. Rashkin, E. M. Smith, M. Li, Y.-L. Boureau, Towards empathetic open- domain conversation models: A new benchmark and dataset, in: Proceedings of the 57th annual meeting of the association for computational linguistics, 2019, p. 5370–5381. [65]K. Guu, K. Lee, Z. Tung, P. Pasupat, M. Chang, Retrieval augmented language model pre-training, in: International conference on machine learning, PMLR, 2020, p. 3929–3938. [66]S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. B. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clark, et al., Improving lan- guage models by retrieving from trillions of tokens, in: International conference on machine learning, PMLR, 2022, p. 2206–2240. [67]C. Packer, V. Fang, S. Patil, K. Lin, S. Wooders, J. Gonzalez, Memgpt: Towards llms as operating systems. (2023). [68]J. Liu, T. Y. Keat, F. Bin, LARA: linguistic-adaptive retrieval-augmented llms for multi-turn intent classification, CoRR abs/2403.16504 (2024). [69]M. Sensoy, L. Kaplan, M. Kandemir, Evidential deep learning to quantify clas- sification uncertainty, NeurIPS 31 (2018). 31 [70]A. Amini, W. Schwarting, A. Soleimany, D. Rus, Deep evidential regression, NeurIPS 33 (2020) 14927–14937. [71]C. Mohri, T. Hashimoto, Language models with conformal factuality guaran- tees, in: Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, OpenReview.net, 2024. [72]Y. Gal, Z. Ghahramani, Dropout as a bayesian approximation: Representing model uncertainty in deep learning, in: international conference on machine learning, PMLR, 2016, p. 1050–1059. [73]B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable predictive uncertainty estimation using deep ensembles, Advances in neural information processing systems 30 (2017). [74]C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, On calibration of modern neural networks, in: International conference on machine learning, PMLR, 2017, p. 1321–1330. [75]M. Kull, M. Perello Nieto, M. Kängsepp, T. Silva Filho, H. Song, P. Flach, Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration, Advances in neural information processing systems 32 (2019). [76]G. Shafer, V. Vovk, A tutorial on conformal prediction., Journal of Machine Learning Research 9 (3) (2008). [77]Y. Romano, E. Patterson, E. Candes, Conformalized quantile regression, Ad- vances in neural information processing systems 32 (2019). [78]J. Lei, M. GSell, A. Rinaldo, R. J. Tibshirani, L. Wasserman, Distribution-free predictive inference for regression, Journal of the American Statistical Associa- tion 113 (523) (2018) 1094–1111. [79]L. Kuhn, Y. Gal, S. Farquhar, Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation, in: The Eleventh Inter- 32 national Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, OpenReview.net, 2023. [80]C. Zhang, F. Liu, M. Basaldella, N. Collier, LUQ: long-text uncertainty quan- tification for llms, in: Y. Al-Onaizan, M. Bansal, Y. Chen (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, ACL, 2024, p. 5244–5262. [81]L. Liu, Y. Pan, X. Li, G. Chen, Uncertainty estimation and quantification for llms: A simple supervised approach, CoRR (2024). [82]S. Lin, J. Hilton, O. Evans, Teaching models to express their uncertainty in words, Transactions on Machine Learning Research. [83]L. Cheng, N. Kim, H. Liu, Debiasing word embeddings with nonlinear geome- try, in: COLING, 2022. [84]E. M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, On the dangers of stochastic parrots: Can language models be too big?, in: FAccT, 2021. [85]A. Rakshit, S. Singh, S. Keshari, A. Ghosh Chowdhury, V. Jain, A. Chadha, From prejudice to parity: A new approach to debiasing large language model word embeddings, in: CCL. [86]Y. Zeng, G. Liu, W. Ma, N. Yang, H. Zhang, J. Wang, Token-level direct prefer- ence optimization, in: ICML, 2024. [87]Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. K. Li, Y. Wu, D. Guo, Deepseekmath: Pushing the limits of mathematical reasoning in open language models, CoRR (2024). [88]H. Lee, S. Phatale, H. Mansoor, T. Mesnard, J. Ferret, K. Lu, C. Bishop, E. Hall, V. Carbune, A. Rastogi, S. Prakash, RLAIF vs. RLHF: scaling reinforcement learning from human feedback with AI feedback, in: ICML, 2024. 33 [89]B. Y. Lin, A. Ravichander, X. Lu, N. Dziri, M. Sclar, K. R. Chandu, C. Bha- gavatula, Y. Choi, The unlocking spell on base llms: Rethinking alignment via in-context learning, in: ICLR, 2024. [90]H. Huang, Y. Li, H. Sun, Y. Bai, Y. Gao, How far can in-context alignment go? exploring the state of in-context alignment, in: Findings of EMNLP, 2024. [91]Z. Liu, D. Li, X. Hu, X. Zhao, Y. Chen, B. Hu, M. Zhang, Take off the training wheels! progressive in-context learning for effective alignment, in: EMNLP, 2024. [92]R. Lou, K. Zhang, W. Yin, Large language model instruction following: A sur- vey of progresses and challenges, Computational Linguistics (2024). [93]A. Plaat, A. Wong, S. Verberne, J. Broekens, N. van Stein, T. Back, Reasoning with large language models, a survey, CoRR (2024). [94]H.-Y. Huang, Y. Yang, Z. Zhang, S. Lee, Y. Wu, A survey of uncertainty estima- tion in llms: Theory meets practice, CoRR (2024). [95]R. Xu, Z. Qi, Z. Guo, C. Wang, H. Wang, Y. Zhang, W. Xu, Knowledge conflicts for llms: A survey, in: Y. Al-Onaizan, M. Bansal, Y. Chen (Eds.), In EMNLP. [96]O. Shorinwa, Z. Mei, J. Lidard, A. Z. Ren, A. Majumdar, A survey on uncer- tainty quantification of large language models: Taxonomy, open research chal- lenges, and future directions, CoRR (2024). [97]X. Wang, Z. Wang, J. Liu, Y. Chen, L. Yuan, H. Peng, H. Ji, MINT: Evaluat- ing LLMs in multi-turn interaction with tools and language feedback, in: The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=jp3gWrMuIZ [98]E. Banatt, J. Cheng, S. Vaidyanath, T. Hwu, Wilt: A multi-turn, memorization- robust inductive logic benchmark for llms, CoRR (2024). 34 [99]D. Agarwal, A. R. Fabbri, B. Risher, P. Laban, S. Joty, C.-S. Wu, Prompt leak- age effect and mitigation strategies for multi-turn llm applications, in: EMNLP, 2024. [100]Y. He, D. Jin, C. Wang, et al., Multi-if: Benchmarking llms on multi-turn and multilingual instructions following, CoRR (2024). [101]Z. Fan, R. Chen, T. Hu, Z. Liu, Fairmt-bench: Benchmarking fairness for multi- turn dialogue in conversational llms, CoRR (2024). [102]C. S. Cheang, H. P. Chan, D. F. Wong, X. Liu, Z. Li, Y. Sun, S. Liu, L. S. Chao, Can lms generalize to future data? an empirical analysis on text summarization, in: EMNLP, 2023. [103]R. Xu, B. S. Lin, S. Yang, T. Zhang, W. Shi, T. Zhang, Z. Fang, W. Xu, H. Qiu, The earth is flat because...: Investigating llms’ belief towards misinformation via persuasive conversation, in: ACL, 2024, p. 16259–16303. [104]A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (8) (2019) 9. [105]M. Xiong, Z. Hu, X. Lu, Y. Li, J. Fu, J. He, B. Hooi, Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms, in: The Twelfth International Conference on Learning Representations, ICLR 2024, Vi- enna, Austria, May 7-11, 2024, OpenReview.net, 2024. [106]T. Luong, T. Le, L. Ngo, T. Nguyen, Realistic evaluation of toxicity in large language models, in: Findings of ACL, 2024. [107]A. Dutta, A. Khorramrouz, S. Dutta, A. R. KhudaBukhsh, Down the toxicity rabbit hole: A framework to bias audit large language models with key emphasis on racism, antisemitism, and misogyny, in: IJCAI, 2024, p. 7242–7250. [108]A. Ramezani, Y. Xu, Knowledge of cultural moral norms in large language mod- els, in: ACL, 2023. 35 [109]M. Abdulhai, G. Serapio-García, C. Crepy, D. Valter, J. Canny, N. Jaques, Moral foundations of large language models, in: EMNLP, 2024. [110]H. Duan, J. Wei, C. Wang, H. Liu, Y. Fang, S. Zhang, D. Lin, K. Chen, BotChat: Evaluating LLMs’ capabilities of having multi-turn dialogues, in: K. Duh, H. Gomez, S. Bethard (Eds.), Findings of the ACL: NAACL 2024, 2024. [111]G. Bai, J. Liu, X. Bu, Y. He, J. Liu, Z. Zhou, Z. Lin, W. Su, T. Ge, B. Zheng, W. Ouyang, MT-bench-101: A fine-grained benchmark for evaluating large lan- guage models in multi-turn dialogues, in: ACL, 2024. [112]X. Zhang, Y. Chen, S. Hu, Z. Xu, J. Chen, M. Hao, X. Han, Z. Thai, S. Wang, Z. Liu, M. Sun,∞Bench: Extending long context evaluation beyond 100K to- kens, in: ACL, 2024. [113]C. An, S. Gong, M. Zhong, X. Zhao, M. Li, J. Zhang, L. Kong, X. Qiu, L-eval: Instituting standardized evaluation for long context language models, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.), ACL, 2024. [114]T. Li, G. Zhang, Q. D. Do, X. Yue, W. Chen, Long-context LLMs struggle with long in-context learning, Transactions on Machine Learning Research (2025). [115]F. Ye, M. Yang, J. Pang, L. Wang, D. F. Wong, E. Yilmaz, S. Shi, Z. Tu, Bench- marking llms via uncertainty quantification, in: A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, C. Zhang (Eds.), NeurIPS, 2024. [116]D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, J. Steinhardt, Aligning AI with shared human values, in: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, OpenReview.net, 2021. [117]D. Muhlgay, O. Ram, I. Magar, Y. Levine, N. Ratner, Y. Belinkov, O. Abend, K. Leyton-Brown, A. Shashua, Y. Shoham, Generating benchmarks for factual- ity evaluation of language models, in: Y. Graham, M. Purver (Eds.), Proceedings 36 of the 18th Conference of the European Chapter of the ACL, EACL 2024 - Vol- ume 1: Long Papers, St. Julian’s, Malta, March 17-22, 2024, ACL, 2024, p. 49–66. 37