Paper deep dive

CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases

Xiaona Xue, Yiqiao Huang, Jiacheng Li, Yuanhang Zheng, Huiqi Miao, Yunfei Ma, Rui Liu, Xinbao Sun, Minglu Liu, Fanyu Meng, Chao Deng, Junlan Feng

Year: 2026Venue: arXiv preprintArea: cs.CLType: PreprintEmbeddings: 70

Abstract

Abstract:Enhancing the ability of large language models (LLMs) to follow complex instructions is critical for their deployment in real-world applications. However, existing evaluation methods often oversimplify instruction complexity as a mere additive combination of atomic constraints, failing to adequately capture the high-dimensional complexity arising from the intricate interplay of content and format, logical workflow control, and real-world applications. This leads to a significant gap between current evaluation practices and practical demands. To bridge this gap, we introduce CCR-Bench, a novel benchmark designed to assess LLMs' adherence to complex instructions. CCR-Bench is characterized by: (1) deep entanglement of content and formatting requirements in task specifications; (2) instructions that involve intricate task decomposition, conditional reasoning, and procedural planning; and (3) evaluation samples derived entirely from real-world industrial scenarios. Extensive experiments on CCR-Bench demonstrate that even state-of-the-art models exhibit substantial performance deficiencies, clearly quantifying the gap between current LLM capabilities and the demands of realworld instruction understanding. We believe that CCR-Bench offers a more rigorous and realistic evaluation framework, advancing the development of LLMs toward the next generation of models capable of understanding and executing complex tasks in industrial applications.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/13/2026, 12:37:37 AM

Summary

CCR-Bench is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) on complex instruction following, specifically addressing limitations in existing benchmarks by incorporating deep content-format entanglement, logical workflow control, and real-world industrial scenarios.

Entities (4)

CCR-Bench · benchmark · 100%Gemini 2.5 Pro · llm · 98%FollowBench · benchmark · 95%IFEval · benchmark · 95%

Relation Signals (3)

CCR-Bench → evaluates → LLMs

confidence 100% · CCR-Bench, a novel benchmark designed to assess LLMs' adherence to complex instructions.

CCR-Bench → includescomponent → Logical Workflow Control

confidence 98% · CCR-Bench comprises the following three core components: ... Logical Workflow Control

Gemini 2.5 Pro → performedon → CCR-Bench

confidence 95% · Experiments on CCR-Bench demonstrate that even state-of-the-art models exhibit substantial performance deficiencies.

Cypher Suggestions (2)

Identify components of the CCR-Bench benchmark. · confidence 95% · unvalidated

MATCH (b:Benchmark {name: 'CCR-Bench'})-[:INCLUDES_COMPONENT]->(c) RETURN c.name

Find all benchmarks that evaluate LLMs. · confidence 90% · unvalidated

MATCH (b:Benchmark)-[:EVALUATES]->(m:LLM) RETURN b.name, m.name

Full Text

69,481 characters extracted from source content.

Expand or collapse full text

CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases Xiaona Xue, Yiqiao Huang, Jiacheng Li, Yuanhang Zheng, Huiqi Miao, Yunfei Ma, Rui Liu, Xinbao Sun, Minglu Liu, Fanyu Meng, Chao Deng, Junlan Feng Jiutian Artificial Intelligence Research Institute, China Mobile, Beijing, China xuexiaona,huangyiqiao,lijiacheng,mayunfei@cmjt.chinamobile.com Abstract Enhancing the ability of large language models (LLMs) to follow complex instructions is critical for their deployment in real-world applications. However, existing evaluation meth- ods often oversimplify instruction complexity as a mere addi- tive combination of atomic constraints, failing to adequately capture the high-dimensional complexity arising from the in- tricate interplay of content and format, logical workflow con- trol, and real-world applications. This leads to a significant gap between current evaluation practices and practical de- mands. To bridge this gap, we introduce CCR-Bench, a novel benchmark designed to assess LLMs’ adherence to complex instructions. CCR-Bench is characterized by: (1) deep en- tanglement of content and formatting requirements in task specifications; (2) instructions that involve intricate task de- composition, conditional reasoning, and procedural planning; and (3) evaluation samples derived entirely from real-world industrial scenarios. Extensive experiments on CCR-Bench demonstrate that even state-of-the-art models exhibit substan- tial performance deficiencies, clearly quantifying the gap be- tween current LLM capabilities and the demands of real- world instruction understanding. We believe that CCR-Bench offers a more rigorous and realistic evaluation framework, ad- vancing the development of LLMs toward the next generation of models capable of understanding and executing complex tasks in industrial applications. Introduction Large Language Models (LLMs) have emerged as founda- tional technologies driving advancements in artificial intel- ligence. Their remarkable zero-shot and few-shot general- ization, derived from pretraining on extensive text corpora, is transforming natural language processing and its down- stream applications (Zhang et al. 2024). Under this new paradigm, instruction following is no longer merely one of many capabilities, but rather constitutes the core mechanism for models to collaborate with humans and accomplish com- plex tasks. This ability directly influences how well model behavior aligns with user intent, making its reliability crit- ical for safe and effective deployment in real-world scenar- ios. In high-stakes domains such as healthcare, finance, and autonomous systems, misinterpretation or deviation from in- structions can lead to serious consequences. As a result, im- Copyright © 2026, Association for the Advancement of Artificial Intelligence (w.aaai.org). All rights reserved. proving the precision and reliability of LLMs’ instruction- following capabilities is a primary focus for both academia and industry. Evaluating the instruction-following ability of LLMs is a crucial research focus (Sun et al. 2023). To facilitate objective and quantifiable assessment, researchers concen- trated on “verifiable instructions”, such as IFEval (Li et al. 2024) offers reliable programmatic evaluation baselines. As research advanced, it became evident that single, discrete constraints do not capture the complexity of real-world in- structions, which often involve multiple, interwoven con- straints. Subsequent studies explored more intricate scenar- ios. For instance, FollowBench (Jiang et al. 2024) probes model capability boundaries by incrementally increasing constraints. WildIFEval (Lior et al. 2025) extracts “wild” multi-constraint instructions from real user queries. Further- more, works such as CFBench (Zhang et al. 2025a) and ComplexBench (Wang et al. 2025) further explore the con- straint typologies, their combinatorial interactions, and hier- archical structures. However, existing methods mainly increase instruction complexity by linearly combining atomic, verifiable con- straints. They often fail to sufficiently capture the intrinsic, high-dimensional nature. While this research paradigm in- creases the number of constraints, it falls short of the de- mands of many high-value tasks in the real world, exhibit- ing the following specific shortcomings: (i) Existing studies typically treat content and format constraints as independent elements. However, in many advanced tasks, content and format are deeply intertwined, with format itself being an integral part of the content. (i) Current benchmarks do not adequately assess models’ capabilities in decomposing com- plex tasks, planning steps, making conditional judgments, controlling workflows, invoking tools, or handling nested processes; (i) Most evaluation suites are derived from gen- eral domains or simplified scenarios, failing to fully simulate the complexity of real-world industrial scenarios that require integrating domain-specific knowledge, adhering to specific workflows or logic, and resolving content contradictions. To address these research gaps and advance LLMs from “following simple constraint combinations” to “understand- ing and executing truly complex tasks”, we introduce a novel complex instruction-following benchmark, named CCR- Bench. CCR-Bench comprises 174 samples, all derived arXiv:2603.07886v1 [cs.CL] 9 Mar 2026 from or closely simulating real industrial applications and advanced cognitive activities. It requires models to demon- strate deep integration of content and format, precise control over complex logical workflows, and profound understand- ing of domain-specific contexts. To probe the advanced instruction-following capabilities of mainstream models, we evaluate five open-source and three closed-source models on CCR-Bench using both non- thinking and thinking reasoning modes. Experiments re- veal the following: (i) Although closed-source models gen- erally outperform open-source models, only Gemini-2.5- Pro (DeepMind 2025) surpasses the “passing threshold” in TSR on the logical workflow control task. In contrast, none of the models achieve an HSR score above 0.5 in the other two scenarios, indicates that current models still fall short of meeting the requirements for real-world industrial applica- tions and underscores the challenging nature of CCR-Bench; (i) Evaluations on CCR-Bench reveal the upper limits and shortcomings of current LLMs’ instruction-following capa- bilities, providing directions for their future development. Related Work Instruction Following Early research demonstrated that fine-tuning LLMs on anno- tated instruction datasets significantly enhances their abil- ity to follow general language instructions. To address the complexity and diversity of user intents in real-world sce- narios, research has shifted toward increasing the complex- ity of instruction to improve the generalization and perfor- mance of the model. For instance, (Mukherjee et al. 2023) augmented instruction complexity by generating reasoning steps for simple instructions, while (Xu et al. 2025) explored strategies for evolving simple instructions into complex ones through “breadth” and “depth”. Moreover, recent efforts have begun to investigate instruc- tion compositionality, i.e., how multiple constraints are com- bined in specific structures, such as chained, conditional, or nested configurations. Such structural complexity is a hallmark of real-world tasks, but remains underexplored in instruction-following research. Our study fills this gap by fo- cusing on how models perform when handling authentic in- structions with intricate structures and multiple constraints. Evaluation Methods Evaluating the instruction-following capabilities of LLMs is critical, and current mainstream evaluation methods can be broadly categorized into three types: Human Evaluation Human evaluators directly assess whether a model’s output aligns with the given instructions. This approach offers nuanced judgments, particularly for complex or open-ended instructions. However, it is costly and time-consuming. Rule/Script-Based Evaluation This approach leverages predefined rules or programs to automatically verify whether the model output satisfies verifiable constraints in the in- struction. It offers high standardization and scalability while avoiding human or model-induced bias. For example, IFE- val (Li et al. 2024) evaluates a model’s adherence to in- struction constraints using pre-designed rule scripts, while RULES (Mu et al. 2024) determines if a model violates any rules during a conversation through programmed functions. Model-Based Evaluation The method employs advanced LLMs to evaluate whether a model’s output satisfies the re- quirements. For instance, benchmarks like CFBench (Zhang et al. 2025a), InFoBench (Qin et al. 2024), IOPO (Zhang et al. 2025b) and MT-Bench (Zheng et al. 2023) utilize GPT-4 series models as evaluators. Although efficient, the reliability of this method is highly dependent on the fair- ness, accuracy, and consistency of the judge model. In or- der to mitigate the biases introduced by purely model- based evaluations, FollowBench (Jiang et al. 2024), Com- plexBench (Wang et al. 2025), and IHEval (Zhang et al. 2025c) adopt a hybrid evaluation approach that combines model evaluations with rule-based assessments. Evaluation Limitations Despite notable progress in instruction-following evalua- tion, current benchmarks exhibit several limitations: (i) They incorporate diverse or multi-level constraints that are often simple and fragmented, lacking the structural complexity found in real-world scenarios. Moreover, they fail to cover realistic application contexts systematically. (i) They typi- cally assess the overall quality of responses, making it diffi- cult to isolate instruction-following performance from other factors, such as factual correctness. Our work aims to de- velop a more refined evaluation framework that specifically measures a model’s ability to follow instructions, while also considering instruction complexity and authenticity to better reflect real user needs. Methodology As previously discussed, although existing instruction- following benchmarks have significantly contributed to ad- vancing LLMs’ capabilities, particularly in language un- derstanding, content generation, and basic instruction com- pliance, they still exhibit limitations in simulating com- plex, real-world demands. These limitations are primarily reflected in their insufficient consideration of instruction complexity, dynamism, and domain specificity. To address these shortcomings, we construct a novel evaluation dataset named CCR-Bench, designed to assess LLMs’ ability to follow complex instructions through a pro- gressive and multi-dimensional framework. It comprises the following three core components: • Complex Content-Format Constraints We introduce a set of tightly coupled “content-format” instructions, where the content and format are intrinsically linked. These instructions require models to generate specific content while strictly adhering to predefined format constraints. • Logical Workflow Control We design tasks that de- mand multi-turn interaction, procedural planning, and state tracking for evaluating a model’s capacity to tran- sition from passively following instructions to actively or- chestrating and executing complex workflows. • Industrial Applications We construct evaluation data based on real-world industrial scenarios to measure the instruction-following and problem-solving capabilities of current models in practical applications. These data en- compass constraints on content, format, and logical rea- soning, in addition to domain-specific demands. This chapter introduces a hierarchical evaluation dataset and framework that progresses from “static regulation” to “dynamic workflow” and ultimately to “comprehensive ap- plication”. The following sections detail the construction process for all the three CCR-Bench components. Complex Content-Format Constraints To better reflect authentic user demands, broaden task cov- erage, and enable fine-grained analysis of LLMs’ instruction comprehension and basic constraint adherence abilities, we propose a framework for generating complex instructions, as illustrated in Figure 1. Basic Instructions Construction The construction of ba- sic instructions involves four key steps: 1. Data Preprocessing To ensure the diversity of instruc- tion data, we collect a large corpus of instructions from open-source datasets, including Infinity-Instruct (Li et al. 2025), InstructionWild (Ni et al. 2023), and others (Zheng et al. 2024; K ̈ opf et al. 2023; Zhao et al. 2024). We then apply rule-based filtering to remove instructions with gar- bled text, HTML tags, long prompts, and those with a con- tent similarity exceeding 0.5 computing by TF-IDF (Jones 2004) and BGE-M3 (Chen et al. 2024). 2. Basic Instruction Extraction We leverage advanced LLMs and prompts to extract basic instructions and con- straints from the dataset. The extracted basic instructions are filtered to ensure semantic integrity and excitability by: (i) removing incomplete instructions; (i) eliminating contra- dictory or unanswerable questions; (i) excluding unethical or unsafe content. 3. Basic Instruction Evaluation We assess the quality, di- versity, and difficulty level of the basic instructions through the following methods: (i) The quality is scored by the model on a scale of 0 to 10 across content completeness, clar- ity, and precision, and alignment with human preferences; (i) The diversity is evaluated by task domain, task type, and language style; (i) Basic instructions are categorized into five difficulty levels—very low, low, medium, high, and very high, based on factors such as knowledge depth, logical complexity, and openness. Experiments indicate that these instructions span over 60 common domains and encompass 11 mainstream language styles, and the task types cover 9 first-level categories and 34 second-level subcategories. The details are displayed in Appendix A.1.1. 4. Basic Instruction Selection After the above process, each instruction is labeled with task type, domain, language style, quality score, and difficulty level. We select instruc- tions with quality scores above 8, “high” difficulty, and dis- tinct values across all other labels. Following random sam- pling and manual validation, 80 high-quality basic instruc- tions are finally retained for downstream use. Constraints System Construction We cluster and cate- gorize the constraints previously extracted, identifying those that are both prevalent in practical use but prone to failure by mainstream LLMs, such as word count limits, empha- sis on specific context, and tasks involving causal reasoning or trade-off analysis. Through systematic analysis, we iden- tify and classify these constraints into four categories: For- matting and Structuring, Logic and Reasoning, Content Ele- ments, and Contextual Application. These types, illustrated in Figure 2, represent the core objectives of our evaluation. Complex Instructions Construction To enhance the sys- tematicity and controllability of tasks, we utilize the Gemini- 2.5-Pro model along with a carefully designed prompt to ef- fectively integrate the above four types of constraints into the selected basic instructions. This process results in a com- plex instruction set that more closely reflects realistic user needs. These constraints are highly interwoven, significantly increasing the complexity and diversity of the original basic instructions. To ensure data quality, an expert team reviews and refines the generated complex instruction set based on the following principles: content validity, clarity of intent, scenario appro- priateness, data diversity and complexity. The final set con- sists of 64 complex instructions characterized by high qual- ity, high diversity, and high difficulty, and each instruction contains 2 to 6 constraints. Logical Workflow Control The complexity of current mainstream instruction-following benchmarks primarily lies in imposing a set of static, one- off constraints on the model. Although these tasks require a model to integrate and satisfy multiple conditions, they are inherently stateless. They do not probe a model’s ability to decompose a task over multiple turns, to plan stepwise, to make conditional decisions, or to control the execution workflow. In real-world business scenarios, such as book- ing flights, online shopping, or conducting data analysis, the model must function as an intelligent agent capable of un- derstanding and executing multi-step workflows. To address these limitations, we draw inspiration from FlowBench (Xiao et al. 2024) to construct a more realistic, application-oriented dataset of Logical Workflow Control with automated validation scripts. This dataset is designed to capture complex scenarios, including nested workflows, implicit tool invocations, and long tool-use chains. Data Construction Guidelines We design the following three guidelines to simulate complex interaction challenges in real-world scenarios: 1. Workflow Structures To reflect the non-linear nature of real-world tasks, we design complex workflows with two key structures: (i) Conditional branching: The workflow path diverges based on conditions, demanding if-then-else reasoning. (i) Implicit nesting: Sub-workflows are invoked within a main workflow and return to their origin point upon completion. Crucially, these nested workflows are not ex- plicitly indicated, requiring the model to infer when to initi- ate a sub-process autonomously. Complex Instructions Construction Constraints System Construction Evaluation and Selection Dimensions Collection CleaningExtraction FilteringEvaluation Selection Basic Instructions Construction High Quality • Query Integrity • Human Preference • Clarity & Correctness High Diversity • Task Type • Task Domain • Language Style Difficuty-level • Openness • Knowledge Depth • Logic Complexity Collection Clustering Systematize Gemini-2.5-Pro C o n s t r a i n t s A d d i t i o n Formatting. • Text Style • Word Count • Others. Review/Revision Constraints Addition Checklist Generation Rationality & Effectiveness Clarity& Correctness Iterative Human-Machine System Construction Cons1Cons2...Consk Formatting.√ × ...√ Logic&Reasoning × √... × Content Elements√ × ...√ Other Types............ Content Elements • Content Source • Specific Content • Others Logic&Reasoning • Decision Logic • Causal Analysis • Others Contextual Application: Scenario Simulation. Constraints Types and Examples Figure 1: Framework for Complex Instructions Generation. Formatting and Structuring 58.05% Content Elements 16.72% Content Structure Logic and Reasoning 19.87% Contextual Application 5.36% 15.46% Table Format Text Style 6.94% 2.21% Structural Framework 5.36% Scenario Specific Content Cotent Source Word Count Limitation Number of Headings Number of Paragraphs 4.42% 4.10% 19.56% 4.10% Expressio n Techniqu es 2.21% 10.41% Decision Logic Simulation 5.36% 1.89% Causal Analysis Trade-off Analysis Comparat iv e Analysis 3.15% 5.05% 9.78% Figure 2: Distribution of Constraints System. 2. Implicit Tool Invocation We design scenarios in which the model must infer missing steps from the context, as not all required actions are explicitly specified. 3. Long-Chain Task Planning To assess a model’s plan- ning and memory capabilities, we craft tasks that demand extended tool-calling sequences to complete, including nav- igating a maze or colouring a complex tree structure. Un- der these circumstances, the model should adhere to each instruction step, as a single misstep leads to failure. This im- poses stringent requirements on the model’s robustness in long-term planning and instruction adherence. Data Construction Process The construction pipeline is described as follows. 1. Environment Construction We first define a standard- ized and interactive execution environment comprising the following components: (i) Design a set of workflows repre- senting diverse business scenarios. Each workflow consists of well-defined states and transition rules, formally encod- ing the correct task execution path; (i) Establish a toolkit, with each tool accompanied by a specified API interface de- tailing its functionality, input parameters, expected output format, and the executable code. 2. Real-World Scenario To ensure the dataset aligns with real-world applications, we adopt an expert-driven approach to construct high-quality data across six realistic and com- mon scenarios, such as flight booking, telecommunications services, etc. The construction process is as follows: (i) Cre- ate diverse user profiles with a clear user goal, supplemen- tary information required for tool invocation during the di- alogue, and trigger phrases that may initiate specific work- flows. (i) For each profile, experts manually author the ideal tool-invocation sequence required to accomplish the user’s task based on the predefined workflow, and then develop ver- ification scripts that can automatically validate whether the model’s final state aligns with the intended outcome. 3. Abstract Scenario To further investigate the model’s task-planning abilities over extended tool-invocation chains, we construct data for three abstract scenarios through the following process: based on predefined workflows and tool- sets, we use code to automatically produce test cases requir- ing extremely long sequences of tool invocations to resolve, and develop deterministic programs to generate the ground- truth corresponding to each instance. Through the process described above, we construct a dataset containing 70 instances (see Table 1). This strategy ensures both authenticity and complexity, providing com- prehensive coverage to robustly evaluate the logical work- flow control capabilities of LLMs. Scenario Name Number Real-World Scenario Data Workflow5 Flight Ticket Booking10 Real Estate10 Customer Service10 Online Game10 Printer Assistant10 Abstract Scenario Maze5 Tree Painter5 World Cup Simulator5 Table 1: Properties of the Logical Workflow Control Data. Industrial Applications To evaluate a model’s practicality and robustness under real- istic industrial conditions, we build a novel set of integrated tasks that impose constraints on both content and format, require rigorous logical reasoning, and are tightly coupled with specific industry backgrounds. The goal is to replicate the core challenges present in real industrial applications faithfully. To ensure the authenticity, complexity, and reli- ability of our benchmark, we design and implement a sys- tematic four-stage data construction pipeline as follows. Data Collection In strict compliance with data privacy and security protocols, we collect a large volume of unfil- tered, frontline user logs from the real medical scenario. Data Refinement To construct a high-quality dataset, we perform meticulous preprocessing in two stages: (i) Invalid, garbled, or truncated entries are discarded initially. Subse- quently, a hybrid pipeline of automated scripts and manual validation rigorously anonymizes the data. These procedures preserve data fidelity while safeguarding user privacy and ensuring legal compliance. (i) We establish a set of filter- ing criteria that favour entries exhibiting high complexity, and diversity. Specifically, we prioritize instructions that im- pose multiple content or format constraints, span multiple task steps, embed implicit conditions or distracting informa- tion, or require non-trivial logical reasoning. Evaluation Dimension A two-step “LLM-assisted + hu- man verification” method is adopted to systematically define the evaluation dimensions: (i) We employ advanced LLMs to perform a comprehensive extraction of constraints from each instruction; (i) Experts manually review the extracted constraints. They correct erroneous extractions, supplement missing complex constraints, and eliminate unreasonable or trivial items, abstract and synthesize the validated con- straints into a final set of evaluation dimensions. Evaluation Data Construction With the evaluation di- mensions defined, we proceed to the final stage of construct- ing the evaluation dataset: (i) Experts conduct a compre- hensive quality assessment of each data pair consisting of a complex instruction and its verified constraint list, and retain only those with clear instructions, reasonable con- straints, and strong representativeness; (i) We perform strat- ified sampling based on scenario categories and difficulty levels to construct a balanced and representative evaluation set; (i) For each selected instance, domain experts manu- ally craft an ideal reference response. Through these four tightly integrated stages, we construct a high-quality evaluation dataset grounded in real-world ap- plications and features, with clearly defined evaluation di- mensions and explicit assessment criteria. This dataset, com- prising 40 samples and 15 constraint types, provides a solid foundation for objectively measuring LLMs’ capabilities in complex industrial scenarios. Experiments Experimental Setup Models We evaluate eight mainstream models, includ- ing GPT-4.1 (OpenAI 2025a), Gemini-2.5-Pro (Deep- Mind2025),OpenAI-o3-mini(OpenAI2025b), DeepSeek-V3-0324 (DeepSeek-AI 2025b), DeepSeek- R1-0528 (DeepSeek-AI 2025a), QwQ-32B (Team, Qwen 2025), Qwen3-32B (Yang et al. 2025a), and Qwen2.5-72B- Instruct (Yang et al. 2025b). The closed-source models like GPT-4.1, Gemini-2.5-Pro, and OpenAI-o3-mini are accessed via their respective official APIs, while the open- source models are executed locally. The evaluation covers both thinking and non-thinking modes. Parameter Settings For our experiments, the temperature parameter for all models was set to 0, while other parameters were kept at their default values. Evaluation Metrics This study employs three categories of metrics to evaluate the model’s performance on scenario-specific data, defined or described as follows: Constraint Satisfaction Rate We evaluate constraint sat- isfaction using the Hard Satisfaction Rate (HSR) and Soft Satisfaction Rate (SSR), as defined in (Jiang et al. 2024). HSR focuses on assessing the model’s robustness and stabil- ity in fully adhering to composite task constraints, and SSR emphasizes the model’s overall understanding and localized execution capabilities across various constraints, revealing strengths or weaknesses in specific constraints. Task Success Rate The Task Success Rate (TSR) mea- sures whether a model executes a task exactly as specified by the instructions, achieving all designated objectives. It prioritizes flawless replication of the execution process and absolute correctness of the result. Task Completion Rate The Task Completion Rate (TCR) quantifies the proportion of tasks completed by the model in accordance with the instructions. Across all task types, TCR is consistently defined as: TCR = correct number of a total number of a , where a refers to tool invocations in real-world tasks, out- put path nodes in Maze, drawn nodes in Drawing Tree, and predicted matches in World Cup Simulation, respectively. We employ an automated evaluation framework, which integrates LLMs with predefined rules. For the Logical Workflow Control scenario, the “rule/script” method is uti- lized to calculate TSR and TCR. Conversely, a “model + rule” approach is adopted to determine HSR and SSR scores in other scenarios. To ensure accuracy and objectivity, we compute each metric as the average of 10 repeated assess- ments of the model outputs. Results and Analysis Complex Content-Format Constraints Table 2 presents the evaluation results of eight mainstream models on the complex content-format constraints dataset. The results in- dicate that models operating under the thinking mode sig- nificantly outperform those in the non-thinking mode. The highest HSR and SSR scores are both achieved in the think- ing mode by OpenAI-o3-mini and DeepSeek-R1-0528, re- spectively. This demonstrates that incorporating a thinking mode enhances the model’s ability to understand and handle single-constraint instructions. In contrast, all models exhibit poor performance under compound constraints, highlighting the current limitations of existing models in following com- plex instructions. ModelHSRSSR Non-thinking GPT-4.10.1190.702 DeepSeek-V3-03240.0970.704 Qwen3-32B0.0890.651 Qwen2.5-72B-Instruct0.0910.631 Thinking Gemini-2.5-Pro0.0640.758 OpenAI-o3-mini0.1660.755 DeepSeek-R1-05280.1580.783 QwQ-32B0.1220.718 Qwen3-32B0.0940.672 Table 2: Results in Complex Content-Format Constraints. To further examine the models’ adherence to different types of constraints, we analyze their performance across four categories: content elements, formatting and struc- turing, logic and reasoning, and contextual application. As shown in Table 3, DeepSeek-R1-0528 achieves the best overall performance, followed by Gemini-2.5-Pro and OpenAI-o3-mini. A comparative analysis reveals that there exists a consistent trend across these models: they perform poorest on formatting and structuring constraints, primarily due to the constraints’ frequent entanglement with content requirements (e.g., word count limits), while their capabil- ities on the other three constraint types remain relatively robust. These results suggest that current models still lack a deep understanding and the ability to follow complex in- structions. Logical Workflow Control Table 4 reports the TSR and TCR scores on the logical workflow control dataset. Think- ing models consistently outperform non-thinking ones, models under non-thinking mode are nearly incapable of solving these tasks correctly. For instance, Qwen3-32B shows TSR and TCR gains of 22.9% and 21.9% under think- ing mode compared to its non-thinking variant. These results indicate that the thinking mechanism enhances logical rea- soning, planning, complex process handling, and instruction adherence. The detailed experiments on abstract scenarios are shown in Appendix B.2. Further analysis reveals that models exhibit higher TCR score than TSR, especially in thinking mode. This suggests models can partially follow workflow instructions but lack robustness in extended tool scenarios, resulting in task fail- ure. Despite demonstrating the best performance, Gemini- 2.5-Pro still shows significant room for improvement in han- dling such tasks. To gain deeper insight into the causes of suboptimal model performance, we analyze real-world scenario results from three aspects: tool-use chain length, nested workflows, and implicit tool invocation. 1. Impact of Tool-Use Chain Length Generally, longer tool-use chains are associated with increased task complex- ity and execution difficulty. Therefore, we divide the real- world samples by chain length(1-3, 4-6, 7+), representing short, medium, and long chains. Figure 3 shows that the model performance decreases as the length of the tool-use chain increases. 1-34-67+ Tool-Use Chain Length 0.0 0.2 0.4 0.6 0.8 1.0 TSR GPT-4.1 DeepSeek-V3-0324 Qwen3-32B Qwen2.5-72B-Instruct (a) Non-thinking 1-34-67+ Tool-Use Chain Length 0.0 0.2 0.4 0.6 0.8 1.0 TSR Gemini-2.5-Pro OpenAI-o3-mini DeepSeek-R1-0528 QwQ-32B Qwen3-32B (b) Thinking Figure 3: Impact of Tool-Use Chain Length on TSR. 2. Impact of Nested Workflows We divide the real-world scenario dataset into two categories: samples with nested workflows (19 samples) and those without (36 samples). As shown in Figure 4, models perform consistently worse on tasks involving nested workflows, indicating higher com- plexity and difficulty in execution. 0.00.30.60.9 TSR Qwen2.5- 72B-Instruct Qwen3-32B DeepSeek- V3-0324 GPT-4.1 0.16 0.10 0.10 0.21 0.56 0.25 0.50 0.72 Nested Workflows Non-nested Workflows (a) Non-thinking 0.00.30.6 TSR Qwen3-32B QwQ-32B DeepSeek- R1-0528 OpenAI- o3-mini Gemini- 2.5-Pro 0.32 0.21 0.26 0.21 0.58 0.53 0.56 0.53 0.58 0.69 Nested Workflows Non-nested Workflows (b) Thinking Figure 4: Impact of Nested Workflows on TSR. 3. Impact of Implicit Tool Invocation We further exam- ine model performance on tasks with implicit tool invocation (14 samples) versus those without (41 samples). Figure 5 il- lustrates that tasks requiring implicit tool invocation are sub- stantially more difficult, with lower execution success across all models. Industrial Applications Table 5 presents model perfor- mances across various tasks within industrial applications, specifically focusing on medical scenario. Observations re- veal that the highest HSR and SSR scores are both achieved by the Gemini-2.5-Pro model operating in thinking mode, even though its HSR score is only 0.415. SSR scores in this scenario generally range from 0.6 to 0.85, reflecting reasonable model competence with single constraints. However, performance drops significantly for Model Content Elements Formatting and Structuring Logic and Reasoning Contextual Application Non-thinking GPT-4.10.9130.5790.8460.841 DeepSeek-V3-03240.9080.5950.8180.835 Qwen3-32B0.8740.5750.6540.771 Qwen2.5-72B-Instruct0.8000.5710.6300.753 Thinking Gemini-2.5-Pro0.9660.6230.9440.882 OpenAI-o3-mini0.9130.6730.8270.882 DeepSeek-R1-05280.9890.6580.9380.918 QwQ-32B0.9680.6100.7940.829 Qwen3-32B0.9400.5860.6730.765 Table 3: SSR Metrics Across Different Constraint Categories. ModelTSRTCR Non-thinking GPT-4.10.5290.800 DeepSeek-V3-03240.3000.562 Qwen3-32B0.1570.438 Qwen2.5-72B-Instruct0.3290.631 Thinking Gemini-2.5-Pro0.7000.844 OpenAI-o3-mini0.5140.768 DeepSeek-R1-05280.4000.644 QwQ-32B0.3860.693 Qwen3-32B0.3860.657 Table 4: Results in Logical Workflow Control. 0.00.30.60.9 TSR Qwen2.5- 72B-Instruct Qwen3-32B DeepSeek- V3-0324 GPT-4.1 0.14 0.00 0.29 0.50 0.51 0.27 0.39 0.56 Implicit Tool Invocation Non-implicit Tool Invocation (a) Non-thinking 0.00.30.60.9 TSR Qwen3-32B QwQ-32B DeepSeek- R1-0528 OpenAI- o3-mini Gemini- 2.5-Pro 0.14 0.21 0.14 0.36 0.50 0.56 0.51 0.54 0.49 0.71 Implicit Tool Invocation Non-implicit Tool Invocation (b) Thinking Figure 5: Impact of Implicit Tool Invocation on TSR. compound constraints, highlighting the gap between current model capabilities and the demands of industrial tasks. We analyze the responses of five models (See details in Appendix B.3) on specific medical example and find that most models, though partially correct, exhibit over-inference and instruction violations, leading to lower HSR scores. For example, Gemini-2.5-Pro exhibits over-inference and fails to adhere to the “Prohibit Outputting Extra Content” con- straint. OpenAI-o3-mini failed to comply with the “Prohibit Copying System Content” constraint. Furthermore, models like DeepSeek-R1-0528, Qwen3-32B, and QwQ-32B vio- late the “Content Filling Requirement” constraint. These observations highlight a lack of robustness against instruc- tion interference, significantly affecting task success. Fur- ther analysis reveals that although most models handle sim- ple constraints reliably, they lack robustness on instructions ModelHSRSSR Non-thinking GPT-4.10.1350.640 DeepSeek-V3-03240.1020.647 Qwen3-32B0.1580.689 Qwen2.5-72B-Instruct0.0800.558 Thinking Gemini-2.5-Pro0.4150.817 OpenAI-o3-mini0.2420.652 DeepSeek-R1-05280.3150.721 QwQ-32B0.1520.610 Qwen3-32B0.2470.662 Table 5: Results in Industrial Applications. requiring deeper comprehension, especially under ambigu- ity or interference. This underscores a persistent gap in aligning model behavior with nuanced user intent. Conclusion Instruction-following capability is a foundational aspect of a model’s utility and serves as a critical link to complex real- world applications. This paper identifies a core limitation in current research on complex instruction following: the pre- vailing definition of “complexity” is overly simplistic, re- lying mainly on the linear accumulation of constraints, and there is a huge gap between it and real application scenarios. To bridge this gap, we introduce CCR-Bench, a bench- mark and evaluation framework designed to reflect real- world application scenarios better. CCR-Bench incorporates richer instruction structures and more intricate constraint combinations, aiming to assess the capabilities and limita- tions of current LLMs more accurately. Evaluation across state-of-the-art models demonstrates their limited effective- ness in high-complexity instruction scenarios, underscoring a persistent gap between model outputs and user intent. Our work is not merely a valuable complement to existing evaluation frameworks; we envision CCR-Bench as a cata- lyst, driving the development of more powerful and reliable language models capable of handling the complex demands of real-world applications. Looking ahead, we plan to con- tinuously collect complex instruction data from industry and integrate a broader range of application scenarios to further enhance CCR-Bench. This framework establishes a compre- hensive and empirically grounded benchmark that catalyzes the advancement of large language model deployment across complex industrial paradigms. References Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; and Liu, Z. 2024. BGE M3-Embedding: Multi-Lingual, Multi- Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216. DeepMind. 2025. Gemini 2.5 Pro. https://deepmind.google/ models/gemini/pro/. Accessed: 2025-08-01. DeepSeek-AI. 2025a. DeepSeek-R1-0528 Release. https: //api-docs.deepseek.com/news/news250528.Accessed: 2025-08-01. DeepSeek-AI. 2025b. DeepSeek-V3-0324 Release. https: //api-docs.deepseek.com/news/news250325.Accessed: 2025-08-01. Jiang, Y.; Wang, Y.; Zeng, X.; Zhong, W.; Li, L.; Mi, F.; Shang, L.; Jiang, X.; Liu, Q.; and Wang, W. 2024. Follow- Bench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models. In Ku, L.; Martins, A.; and Srikumar, V., eds., Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 4667–4688. Bangkok, Thailand: Association for Computa- tional Linguistics. Jones, K. S. 2004. A statistical interpretation of term speci- ficity and its application in retrieval. J. Documentation, 60(5): 493–502. K ̈ opf, A.; Kilcher, Y.; von R ̈ utte, D.; Anagnostidis, S.; Tam, Z. R.; Stevens, K.; Barhoum, A.; Nguyen, D.; Stanley, O.; Nagyfi, R.; ES, S.; Suri, S.; Glushkov, D.; Dantuluri, A.; Maguire, A.; Schuhmann, C.; Nguyen, H.; and Mattick, A. 2023. OpenAssistant Conversations - Democratizing Large Language Model Alignment.In Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; and Levine, S., eds., Advances in Neural Information Processing Systems, vol- ume 36, 47669–47681. Curran Associates, Inc. Li, J.; Du, L.; Zhao, H.; wen Zhang, B.; Wang, L.; Gao, B.; Liu, G.; and Lin, Y. 2025. Infinity Instruct: Scaling Instruc- tion Selection and Synthesis to Enhance Language Models. arXiv:2506.11116. Li, S.; Yan, J.; Wang, H.; Tang, Z.; Ren, X.; Srinivasan, V.; and Jin, H. 2024. Instruction-following Evaluation through Verbalizer Manipulation. arXiv:2307.10558. Lior, G.; Yehudai, A.; Gera, A.; and Ein-Dor, L. 2025.WildIFEval: Instruction Following in the Wild. arXiv:2503.06573. Mu, N.; Chen, S.; Wang, Z.; Chen, S.; Karamardian, D.; Aljeraisy, L.; Alomair, B.; Hendrycks, D.; and Wagner, D. 2024. Can LLMs Follow Simple Rules? arXiv:2311.04235. Mukherjee, S.; Mitra, A.; Jawahar, G.; Agarwal, S.; Palangi, H.; and Awadallah, A. 2023.Orca: Progres- sive Learning from Complex Explanation Traces of GPT-4. arXiv:2306.02707. Ni, J.; Xue, F.; Jain, K.; Shah, M. H.; Zheng, Z.; and You, Y. 2023. Instruction in the Wild: A User-based Instruction Dataset. https://github.com/XueFuzhao/InstructionWild. OpenAI. 2025a. Introducing GPT-4.1 in the API. https: //openai.com/index/gpt-4-1/. Accessed: 2025-08-01. OpenAI. 2025b. OpenAI o3-mini. https://openai.com/index/ openai-o3-mini/. Accessed: 2025-08-01. Qin, Y.; Song, K.; Hu, Y.; Yao, W.; Cho, S.; Wang, X.; Wu, X.; Liu, F.; Liu, P.; and Yu, D. 2024. InFoBench: Evaluat- ing Instruction Following Ability in Large Language Mod- els. In Ku, L.; Martins, A.; and Srikumar, V., eds., Findings of the Association for Computational Linguistics, 13025– 13048. Bangkok, Thailand and virtual meeting: Association for Computational Linguistics. Sun, J.; Tian, Y.; Zhou, W.; Xu, N.; Hu, Q.; Gupta, R.; Wiet- ing, J. F.; Peng, N.; and Ma, X. 2023. Evaluating Large Lan- guage Models on Controlled Generation Tasks. In Bouamor, H.; Pino, J.; and Bali, K., eds., Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Pro- cessing, 3155–3168. Singapore: Association for Computa- tional Linguistics. Team, Qwen. 2025. QwQ-32B: Embracing the Power of Re- inforcement Learning. https://qwenlm.github.io/blog/qwq- 32b/. Accessed: 2025-08-01. Wang, C.; Zhou, Y.; Wang, Q.; Wang, Z.; and Zhang, K. 2025.ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Depen- dencies. arXiv:2506.12830. Xiao, R.; Ma, W.; Wang, K.; Wu, Y.; Zhao, J.; Wang, H.; Huang, F.; and Li, Y. 2024. FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents.In Al-Onaizan, Y.; Bansal, M.; and Chen, Y., eds., Findings of the Association for Computational Linguis- tics, 10883–10900. Florida, USA: Association for Computa- tional Linguistics. Xu, C.; Sun, Q.; Zheng, K.; Geng, X.; Zhao, P.; Feng, J.; Tao, C.; Lin, Q.; and Jiang, D. 2025. WizardLM: Empow- ering large pre-trained language models to follow complex instructions. arXiv:2304.12244. Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. 2025a. Qwen3 Technical Report. arXiv:2505.09388. Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. 2025b. Qwen2.5 Technical Report. arXiv:2412.15115. Zhang, B.-W.; Wang, L.; Li, J.; Gu, S.; Wu, X.; Zhang, Z.; Gao, B.; Ao, Y.; and Liu, G. 2024. Aquila2 Technical Re- port. arXiv:2408.07410. Zhang, T.; Zhu, C.; Shen, Y.; Luo, W.; Zhang, Y.; Liang, H.; Yang, F.; Lin, M.; Qiao, Y.; Chen, W.; Cui, B.; Zhang, W.; and Zhou, Z. 2025a.CFBench: A Comprehensive Constraints-Following Benchmark for LLMs. In Che, W.; Nabende, J.; Shutova, E.; and Pilehvar, M. T., eds., Pro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 32926–32944. Vienna, Austria: Association for Computational Linguistics. Zhang, X.; Yu, H.; Fu, C.; Huang, F.; and Li, Y. 2025b. IOPO: Empowering LLMs with Complex Instruction Fol- lowing via Input-Output Preference Optimization. In Che, W.; Nabende, J.; Shutova, E.; and Pilehvar, M. T., eds., Pro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 22185–22200. Vienna, Austria: Association for Computational Linguistics. Zhang, Z.; Li, S.; Zhang, Z.; Liu, X.; Jiang, H.; Tang, X.; Gao, Y.; Li, Z.; Wang, H.; Tan, Z.; Li, Y.; Yin, Q.; Yin, B.; and Jiang, M. 2025c. IHEval: Evaluating Language Mod- els on Following the Instruction Hierarchy. In Chiruzzo, L.; Ritter, A.; and Wang, L., eds., Proceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the As- sociation for Computational Linguistics: Human Language Technologies, 8374–8398. New Mexico, USA: Association for Computational Linguistics. Zhao, W.; Ren, X.; Hessel, J.; Cardie, C.; Choi, Y.; and Deng, Y. 2024. WildChat: 1M ChatGPT Interaction Logs in the Wild. arXiv:2405.01470. Zheng, L.; Chiang, W.-L.; Sheng, Y.; Li, T.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Li, Z.; Lin, Z.; Xing, E. P.; Gonza- lez, J. E.; Stoica, I.; and Zhang, H. 2024. LMSYS-Chat- 1M: A Large-Scale Real-World LLM Conversation Dataset. arXiv:2309.11998. Zheng, L.; Chiang, W.-L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; Zhang, H.; Gon- zalez, J. E.; and Stoica, I. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; and Levine, S., eds., Advances in Neural Information Processing Systems, volume 36, 46595–46623. Curran Associates, Inc. Appendix A Data Generation and Evaluation A.1 Complex Content-Format Constraints We leverage an advanced large model (DeepSeek-V3- 0328) and the prompt shown in Figure 6 to extract basic instructions and constraints from a large volume of instruc- tions, and then introduce the evaluation of the basic instruc- tions, the construction of the constraints system, and the gen- eration and evaluation of complex instructions in the follow- ing sections. A.1.1 Basic Instructions Evaluation This section details the quality, diversity, and difficulty evaluations performed on the base instructions and presents the corresponding results. 1. Quality Evaluation Given the large data volume, we employ the open-source model DeepSeek-V3-0328 along with the prompt illustrated in Figure 7 to assess the qual- ity of basic instructions. The evaluation focuses on three key dimensions: completeness (A), clarity and precision (B), and alignment with human preferences and user in- tent (C). Each dimension is scored on a scale from 0 to 10. The final quality score is computed as a weighted sum: A× 0.3 + B× 0.3 + C× 0.4. Manual sampling reveals that instructions scoring below 8 often suffer from issues such as content omissions and redundant phrasing, which are filtered out directly. 2. Diversity Evaluation We further used the DeepSeek- V3-0328 model with the prompts shown in Figure 8 - Fig- ure 10 to evaluate the diversity attributes of basic instruc- tions, including task domain, task type, and language style. Note that the task types in this paper are defined from a knowledge-based perspective, rather than conventional NLP task categorizations. For example, the “Generative Data Sys- tem” illustrated in Figure 9 includes categories such as “Creative”, which encompasses subcategories like “poetry”, “riddles”, and “emails”. Here, “Creative” represents a first- level category, while subtypes like “poetry” are considered second-level. 3. Difficulty Evaluation As the ability of current models to handle simple instructions approaches perfection, we shift our focus to their performance on difficult data. To iden- tify high-difficulty data, we employed the DeepSeek-V3- 0328 model along with the prompt illustrated in Figure 11 to evaluate the difficulty level of basic instructions across three dimensions: knowledge depth, logical complexity, and openness. Each dimension is rated as either high, medium, or low. Based on the difficulty labels assigned to each di- mension, we categorized the overall difficulty of each in- struction into five levels: Very Low, Low, Medium, High, and Very High. For example, instructions labeled as “low” in all three dimensions were rated as “Very Low” in overall difficulty, while those with all three dimensions labeled as “high” were rated as “Very High”. Figure 12a presents the distribution of task types, cov- ering 9 first-level and 34 second-level categories. Among them, “Guide”, “Inquiry”, “Article”, and “Creative” tasks collectively account for over 90% of the dataset, highlight- ing its strength in practical guidance, creative expression, and information seeking. Less frequent categories such as “Recommendation” and “Social” tasks, while relatively mi- nor in proportion, also enhance the diversity of instructions. Figure 12b illustrates the task domain distribution of the ba- sic instructions. For clarity, we present the 35 most promi- nent domains, which are selected from a total of 60 distinct areas, to demonstrate the dataset’s broad knowledge cover- age and content richness. Figure 12c illustrates the distri- bution of language styles, which includes a variety of com- mon writing styles. Among these, “Formal”, “Colloquial”, “Creative”, and “Technical” styles are predominant, reflect- ing the dataset’s ability to adapt to different contextual re- quirements. In summary, the basic instructions demonstrate high diversity across task types, task domains, and language styles. This diversity provides a strong foundation for subse- quent generation of the complex instructions, ensuring broad scenario coverage and enhancing the practical utility and creative potential of the generated content. As shown in Figure 12d, the majority of basic instructions fall into the Low, Medium, and High categories. Our anal- ysis indicates that only the instructions labeled as “High” present a meaningful challenge; thus, these are selected as the target candidates for further selection. A.1.2 Constraints System Construction To build a con- straints system oriented toward practical application, the constraints from Section “Basic Instructions Extraction” are clustered and synthesized, with the results depicted in Fig- ure 13. The figure reveals that over 90% of all identified con- straints belong to a few primary types: “Quantity”, “Content Format”, “Scenario and Interaction”, “Content Structure”, “Content Restriction” and “Logical Reasoning”. A subse- quent analysis of the specific constraints lead us to classify them into four categories: Formatting and Structuring, Con- tent Elements, Logic and Reasoning, and Contextual Appli- cation. Each category includes numerous constraints that ap- pear with high frequency in practical use cases, for example, “Word Count” and “Causal Analysis”. A.1.3 Complex Instructions Generation We utilized the Gemini-2.5-Pro model and the prompt designed in Fig- ure 14 to generate high-quality Complex Content-Format Constraints data. A.1.4 Complex Instruction Evaluation To efficiently validate whether each model effectively adheres to various constraints, we design a prompt as depicted in Figure 15 and utilize the thinking reasoning mode of the advanced Qwen3- 235B-A22B (Yang et al. 2025a) model for evaluation. A.2 Logical Workflow Control For the logical workflow control section, we construct user target dialogues under real-world scenarios and employ the QwQ-32B model to act as a user agent, generating ques- tions to interact with the model under evaluation. In abstract scenarios, only the model under evaluation executes specific operations based on the blueprint. Figure 16 illustrates the meticulously designed prompt used for the model acting as a user agent, while Figure 17 depicts the prompt referenced by the model under evaluation during inference. A.3 Industrial Applications To ensure data evaluability, we design a prompt as shown in Figure 18 and utilize the Qwen3-235B-A22B model to extract various constraints from industrial application data, followed by manual screening and refinement to obtain the final checklist. The evaluation prompt is identical to that in Figure 15, and the evaluation was conducted using the think- ing mode of the Qwen3-235B-A22B model. B Experimental Results B.1 Complex Content-Format Constraints Table 6 and Table 7 present the SSR scores for each model under different constraints in non-thinking and think- ing modes, respectively. The figures indicate that all models perform poorly on the “Word Count Limitation” constraint. B.2 Logical Workflow Control Table 8 shows supplementary model comparisons, re- vealing two key observations: (i) Models without thinking modes cannot handle tasks with long tool chains. (i) Models exhibit higher TCR than TSR, especially in thinking mode. This suggests models can partially follow instructions but lack robustness in extended tool scenarios, resulting in task failure. Table 9 and Table 10 present the success rates of each model across different scenario tasks in non-thinking and thinking modes, respectively. These figure demonstrate a clear negative correlation between scenarios’ complexity and model performance. B.3 Industrial Applications Table 11 and Table 12 present the SSR scores for each model under medical scenario constraints in non-thinking and thinking modes, respectively. Figure 19 shows the re- sponses of five mainstream models to specific medical ex- ample. 你是一名语言学专家，精通语境语言以及限制或约束的细微差别，擅长对句子的成分进行拆解和组装。以下是指令拆解任务，请根据指令描述、输出格式和参考示例，来抽取给定指令中的基础指令和特定约束。 ##指令描述一条指令通常由两部分组成：基础指令和特定约束，基础指令通常是指不含任何约束条件的问题，而特定约束一般指通过“字数、体裁、段落、关键词、长度约束、格式、个数、行数”等对输出进行约束的条件，特定约束有时候可为空。 ##输出格式： "基础指令": ["你抽取的基础指令"], "附加指令": ["你抽取到的特定约束列表"] ##参考示例：指令样例和抽取结果给定指令：输入指令 Figure 6: The Prompt for Extracting Basic Instructions and Constraints from Instructions. 你是一个专业的数据质量评估专家，擅长分析用户问题是否合理以及是否是高质量的。请参考以下评分维度和参考示例来评估给定问题的质量： ##评分维度： 1.问题完整性（0-10分）：评分细则A 2.清晰度与精确性（0-10分）：评分细则B 3.人类偏好与用户需求适配性（0-10分）：评分细则C ##输出格式： - **问题完整性**：x/10 - **清晰度与精确性**：y/10 - **人类偏好与用户需求适配性**：z/10 - **最终得分**：x*0.3+ y*0.3+ z*0.4 - **理由**：简要说明评分依据。 ##参考示例：评估样例和评估结果现在请评估以下用户请求问题：输入基础指令 Figure 7: The Prompt for Assessing the Quality of Basic Instructions. 你是一个语言学专家，上知天文下至地理，擅长判断给定问题属于哪个领域。请参照给定示例识别当前问题属于哪个领域，并简要说明判断理由。 ##输出格式： - **领域**：问题所属哪个领域.领域只能有1个,通常为数学、历史等. - **理由**：简要解释你的判断. ##参考示例：评估样例和评估答案给定问题：输入基础指令 Figure 8: The Prompt for Identifying Task Domains of Basic Instructions. 你是一名专业的内容分类专家，擅长对各种文本进行归类。你的任务是分析提供的文本，并判断其是否符合以下生成类数据体系中的某个类别，如果给定文本不属于给定体系中的任何类别，请以你识别的类别为准。请严格参考输出格式与给定样例样式输出。 ##生成类数据体系: 生成类数据体系 ##输出格式： - **一级类别**：所属类别名称（如“创意类”），如果不在现有体系中，则填你识别的类型。 - **二级类别**：具体子类别（如“诗词”），如果是新增类别，则给出类别名称。 - **理由**：简要说明匹配理由或是新增类别的理由注意，请使用中文语言回答，严格按照输出格式输出。 ##参考示例：评估样例和评估答案给定问题：输入基础指令 Figure 9: The Prompt for Identifying Task Types of Basic Instructions. 你是一个语言学专家，精通各种语言，擅长判断给定问题所属语言风格。请判断以下问题的语言风格，常见的语言风格有：正式风格、口语化风格、技术风格、创意风格等。必须按照指定输出格式回答。 ##输出格式： - **语言风格**：你判断的任务类型,语言风格只能有1个 - **判断理由**：简要解释你的判断理由。 ##参考示例：评估样例和评估答案给定问题：输入基础指令 Figure 10: The Prompt for Identifying Language Styles of Basic Instructions. 你是一个语言学专家，上知天文下至地理，擅长判断给定问题的困难程度。请你参考以下评分维度和参考样例对给定问题进行难度评分： ##评分维度： 1.知识深度（高、中、低）：评分细则D 2.逻辑复杂度（高、中、低）：评分细则E 3.问题开放性（高、中、低）：评分细则F ##输出格式： - **知识深度**：x/(高、中、低) - **逻辑复杂度**：y/(高、中、低) - **开放性**：w/(高、中、低) - **理由**：简要说明评分依据。 ## 参考示例：评估样例和评估结果给定问题：输入基础指令 Figure 11: The Prompt for Assessing the Difficulty Levels of Basic Instructions. Article Guide Life Inquiry Creative Career Study Techincal Health Food Product Safety Emergen cy Marketing Fitness Knowledge Q&A Fact Explanation Article Novel Short Ess ay Narrative Popular Science Advertising c opy Poetry Lyrics Novel Poem Couplet Promotion Speech Letter Travel Product Food Market Analysis Data Analysis English to Chin Social Topic Generation Summary Generation ese Recommendation Analysis Social Task Translation (a) Distribution of Task Types. Literature Psychology Linguistic s Computer S cience Biolog Managem Medicine Cooking Ecology Education ent Life y Sociology Business History Finance Travel Pedagogy Astronomy Geography Technology Career Physics Music Law Sports Economics Art Marketing Gaming Nutrition Artificial Intelligence Gardening Health Chemistry Fashion (b) Distribution of Knowledge Domains. Formal Colloquial Creative Technical Literary Academic Other Business Philosophical Reflection Creative Writing Descriptive Classical Fantasy Martial Arts (c) Distribution of Language Styles. High 51.11% Low 23.96% Medium 22.11% Very Low 2.82% (d) Distribution of Difficulty Levels. Figure 12: Distribution of Diversity and Difficulty for Basic Instructions. Quantity Content Format Time Style Logical Reasoning Content Quality Content Restriction Content Structure Scenario and Interaction Object Quantity Numerical Range Word Count Item Count Entry Count Topic Quantity Numerical Requirement Minimum Quantity Senetnce Count Approximate Quantity Numerical Limit Reference Count Maximum Quantity Option Type Quantity Content Conciseness Content Detail Level Language Style Tone Style Content Anonymity Output Format Content Thoroughness Wording Limitation Poetry Format Time Range Entertainment Requirement Causal Analysis Content Substantiation Forbidden Content Prohibited Wording Content Element Content Module Example Listing Audience Requirement Scenario Requirement Character Setting Role-Playin g Figure 13: The Clustering Results of Constraints from Instructions. 你现在是一个经验丰富复杂指令数据创造者，我现在有一个数据构建角度和方案，请你按照数据构建方案创造比较贴近实际应用的数据： ##数据构建角度：附加指令体系 ##数据构建方案： 1.明确基础指令:确定你要模型完成的核心任务和主题。 2.选择约束类型:从上述四类约束中选择适用于当前基础指令的类型。 3.确定具体约束:针对你选择的每个约束类型，确定你认为添加合适的具体约束。例如是逻辑与推理约束类型，那具体添加因果分析或是逻辑决策。 4.复杂指令构建：使用明确、无歧义的语言将你确定的具体约束融入基础指令。融入时可使用“要求内容包括...”、“必须按照...结构”、“重点阐述...”、“不得出现...”等引导词，使得句子更加流畅。 5.复杂化说明:可以将不同角度的要求结合起来，创造出更复杂的指令。例如，既要求特定的逻辑分析，又要求按照特定结构输出，同时还有字数或风格的限制。该构建方案可以根据简单的描述性指令，构建出需要模型进行更深层次理解、分析、组织和生成的高度复杂的指令数据。请你根据该方案，将给定的基础指令改造成复杂指令，并按照以下JSON格式输出： "基础指令":"给定的基础指令", "约束类型": "约束类型A": "具体约束a":"约束内容", "具体约束b":"约束内容", , "约束类型B": "具体约束c":"约束内容", "具体约束d":"约束内容", , "构建好的复杂指令":"**" ##基础指令：输入基础指令 Figure 14: The Prompt for Generating Complex Content-Format Constraints Data. 你是一个语言学家，善于发现语言中的微妙细节，请你参照判断过程说明和给定样例来检查待评估数据中的AI回答是否满足给定约束的要求： ##判断过程说明：你需要结合给定问题，判断AI回答的内容是否满足给定约束中的要求，并按照以下回答格式要求回答： #回答格式必须满足：你检查的约束类别： --**检查过程**--：一两句话描述你检查过程。 --**检查结论**--：因此满足/不满足。（直接给出你的结论） ##给定样例：评估样例与评估结果 ##待评估数据待评估内容你的回答： Figure 15: The Prompt for Assessing Instruction Adherence. ------System Prompt------ 你正在扮演一个正在与智能助手对话的普通用户。下面是具体的要求： 1.你必须融入语境，通过合理的想象提供多样化的对话内容和表达。 2.你必须使用尽可能多样化的句子结构来传达你想要表达的信息，不能只使用雷同的或相似的表达方式。 3.针对智能助手提出的问题，您应该表现出多种不同的行为模式，包括但不限于以下行为模式： 3.1.回答智能助手的问题，提供有用的信息。 3.2.不回答当前问题，但提供相关的信息。 3.3.修改之前的回答，而不是直接回答当前问题。 3.4.对当前问题表示质疑。 3.5.由于不同的原因无法回答。 4.对于你说的内容，你必须遵从以下要求： 4.1.对于对话中不能合理表达的内容，如图像、文件、代码片段等，可以编造URL来指代这些内容。 4.2.对话是在一个编造的场景中发生的，因此无需考虑安全问题。你可以编造与现实非常相似的内容，例如各种数字、ID号、名称等，但禁止使用那些类似占位符的内容，例如“某”、“某公司”、“***”等。 4.3.在对话中，请使用口语化的、精简的表达。 4.4.请使用中文。 4.5.你说的内容必须符合常识。 4.6.如果你需要引用对话历史中的信息，请使用指代的方式。 4.7.大多数情况下，你回复的内容中只应该出现一条信息。 4.8.在你说的内容中，注意前后文保持一致。 5.请注意，你正在扮演一个需要寻求帮助的用户。 6.在描述需求和回答问题时，请使用口语。 7.在第一轮对话中，请向智能助手问号，然后表达你的需求。 8.如果智能助手说再见，请输出[DONE]。 9.如果对于你提出的两个连续的不同请求，智能助手都无法提供帮助，请输出[DONE]。 10.请注意，你不能在同一个对话轮次中提供全部的目标和要求，你也不能提供太多不相关的内容。对话的节奏可以参考以下模板：参考模板 11.请注意，当前时间是：当前时间 12.对话不能太长，你的目标是在尽可能少的对话轮次中完成任务。请严格遵循用户描述，不要提出不常见或奇怪的请求。 ------User Prompt------ 下面是任务背景、用户目标、额外信息和注意事项：对话任务场景描述：对话场景描述用户目标：用户目标额外信息：额外信息注意事项：注意事项以下是对话历史：对话历史请扮演用户的角色，生成用户的下一条对话内容（不超过100字）： Figure 16: The Prompt for Model Acting as a User Agent. 任务背景具体要求： 1.作为一名助手，你需要根据业务流程和API信息，与用户对话。 2.已经为你提供了业务流程图。流程图中的每个节点表示相应步骤中你需要执行的操作，即你需要按照节点的说明，回复用户或执行其他特定操作。节点包括回复节点、决定节点和API调用节点。流程图中的边表示你需要评估的条件，例如用户意图、参数值或者API的返回值等。你需要根据这些条件，沿着相应的边转移到下一个节点。 3.在包含子目标的复杂节点中，你需要根据节点描述完成子目标。如果该子目标需要通过对话来完成，那么你需要严格按照节点的描述进行对话。如果该子目标需要调用API来完成，那么你需要严格按照API的定义，获取相应的输入参数来完成API调用。 4.你只能在流程图的范围内回答问题。如果用户的问题超出了流程图的范围，请在“Final Answer”部分向用户道歉并做出合理的解释。 5.在获取API的输入参数时，你需要确保提供的参数值的格式正确、内容完整。不要随意编造参数值。如果参数值不符合格式和完整性要求，请告知用户对参数值进行调整，直到满足要求为止。 6.当用户同时给出了多个请求时，请选择一个合理的请求，首先解决该请求，并告知用户其他的请求将随后被解决。如果在之前的对话中有未完成的业务流程，请继续对话以向用户提供帮助，引导用户完成业务流程。如果需要多次调用API，请在不同的轮次执行这些调用，每个轮次只能执行一次API调用。当用户指出业务已完成或者说了再见，请礼貌地回复用户并结束对话。 7.你的输出格式应当从下面两个模板中选择（7.1和7.2）： 7.1如果你需要与用户交互： ``` : x (描述你的思考过程) Answer: x (你回复用户的内容) ``` 7.2如果你需要调用API（每个轮次只能执行一次API调用）： ``` : x (描述你的思考过程) : x (你需要调用的API名称) Input: x (执行API调用时输入参数的值，需要严格遵循JSON格式) ``` [格式解释] (1)思考过程应当包含三部分信息：[步骤：分析上一个节点]：“上一个节点last_node=x”；[步骤：分析当前意图]：“当前意图intent=x”；[步骤：分析当前节点]：“当前节点current_node=x”。在这三部分信息之后，你需要决定下一步操作。 (2)上一个节点last_node是值在流程中的当前节点的上一个节点。当前意图intent是指连接上一个节点和当前节点的用户意图或其他条件。当前节点 current_node是指从上一个节点出发，沿着条件对应的边转移到的下一个节点。 (3)当前节点current_node包含API调用操作时，思考过程需要在决定下一步操作之前，根据上下文信息，明确调用工具时已知的参数和缺失的必要参数。 (4)当前节点current_node包含收集参数时，思考过程需要在决定下一步操作之前，根据上下文信息，明确缺失的参数和已知的参数。 8.思考过程应当以“上一个节点last_node=x，当前意图intent=x，当前节点current_node=x”开始。以下是常规节点，参数询问节点和API调用节点的思考过程示例。不要在思考过程中包含类似“[步骤：x]”的文字，而是遵循下面示例的格式： ``` :上一个节点last_node=PN001，当前意图intent=ETC应用，当前节点current_node=PN002。我需要询问车辆ETC的类型。 ``` ``` :上一个节点last_node=PN003，当前意图intent=不需要，当前节点current_node=PN004。参数reason的值缺失。我需要询问用户参数reason的值。 ``` ``` :上一个节点last_node=PN005，当前意图intent=ETC账单查询，当前节点current_node=PN006。根据上下文，我已经知道了调用 get_etc_record所需的参数licence_plate_number的值为"license_plate_number": "ABG0375"，但参数date的值仍然缺失。接下来，我需要询问参数date的值，然后调用get_etc_record。 ``` 9.当用户问题超出了流程图和API的范围时，当前意图和当前节点应当为UNK。例如： ``` :上一个节点last_node=PN007，当前意图intent=UNK，当前节点current_node=UNK。接下来，我需要告诉用户我不清楚x的内容。 ``` 10.当存在多个可能的意图时，将它们用“OR”并列，并确认具体是哪种情况；当表达多个意图时，将它们用“AND”并列。这些并列的方式对节点也适用。 11.当前意图intent只能是流程图中的边的描述或者UNK。上一个节点last_node只能是流程图中的节点，表示流程中的当前节点的上一个节点。 12.包含“子目标、子任务、步骤、过程”等关键词的节点应该分成多个步骤来完成，用户也必须对每个步骤分别进行回复，不要将节点的内容一次性地全部输出来。在每个对话轮次中，只执行一个步骤，在确认该步骤已经完成后，再执行下一个步骤。请严格遵照给定的输出格式。不要输出“x”之类的占位符，而是用相应的具体内容来填充。给定的流程图和API信息如下：流程图：不同业务流程 API信息： API相关信息当前时间是：当前时间 Figure 17: The Prompt for Model Inference. 你是一个强大的信息提取工具，能够准确从文本中提取数据并按照指定格式输出。你现在的任务是参考示例从文本中提取各种约束条件和约束描述，并按照指定的 json格式输出。输出格式： "类别1":"类别描述",... 参考示例：输入文本与输出示例给定文本：输入文本 Figure 18: The Prompt for Extracting Constraints from Industrial Application Data. Constraint TypeConstraint DetailsGPT-4.1 DeepSeek- V3-0324 Qwen3- 32B Qwen2.5- 72B-Instruct Contextual Application Scenario Simulation0.8410.8350.7710.753 Content Elements Expressive Techniques0.7710.8570.8141.000 Content Source0.9310.9920.9080.831 Specific Content0.9360.8850.8730.746 Formatting and Structuring Word Count Limitation0.1770.1130.1860.160 Content Structure0.7760.9020.8250.865 Text Style0.8000.6770.5770.509 Number of Headings0.4690.8150.7310.792 Number of Paragraphs0.9640.9290.8790.843 Structural Framework0.8350.8650.7880.735 Table Format0.8710.7140.8571.000 Logic and Reasoning Causal Analysis0.7810.8450.6160.619 Trade-off Analysis0.9190.7690.7250.688 Comparative Analysis0.9400.8000.7000.570 Decision Logic0.8330.8330.5830.633 Table 6: SSR Metrics on Complex Content-Format Constraints Data in Non-thinking Mode. Constraint TypeConstraint Details Gemini- 2.5-Pro OpenAI- o3-mini DeepSeek- R1-0528 QwQ-32BQwen3-32B Contextual Application Scenario Simulation0.8820.8820.9180.8290.765 Content Elements Expressive Techniques1.0000.9861.0001.0001.000 Content Source0.9391.0001.0000.9921.000 Specific Content0.9700.8640.9820.9520.903 Formatting and Structuring Word Count Limitation0.0580.3030.2150.1810.160 Content Structure0.9370.9120.9250.9220.827 Text Style0.8550.7090.7910.5320.596 Number of Headings0.9690.9620.8770.8460.869 Number of Paragraphs0.9140.7710.8790.8930.914 Structural Framework0.8180.8530.8410.8060.782 Table Format1.0001.0001.0001.0001.000 Logic and Reasoning Causal Analysis0.9480.7940.9390.7840.636 Trade-off Analysis0.9380.8880.9310.8190.794 Comparative Analysis0.9400.8800.9900.7600.670 Decision Logic0.9500.7500.8670.8330.550 Table 7: SSR Metrics on Complex Content-Format Constraints Data in Thinking Mode. Reasoning ModeModel MazeTree DrawingWorld Cup TSRTCRTSRTCRTSRTCR Non-thinking GPT-4.10.6000.6330.8000.9890.0000.641 DeepSeek-V3-03240.0000.0850.2000.5910.0000.476 Qwen3-32B0.0000.0850.0000.2010.0000.280 Qwen2.5-72B-Instruct0.0000.1830.0000.6960.0000.807 Thinking Gemini-2.5-Pro1.0001.0001.0001.0000.6001.000 OpenAI-o3-mini1.0001.0000.6000.9740.6000.738 DeepSeek-R1-05280.4000.4410.0000.3350.4000.737 QwQ-32B0.2000.6310.2000.8950.2000.912 Qwen3-32B0.4000.5850.0000.6650.0000.391 Table 8: Model Performance on Abstract Scenario Data. Scenario GPT-4.1 DeepSeek- V3-0324 Qwen3-32B Qwen2.5- 72B-Instruct TSRTCRTSRTCRTSRTCRTSRTCR Customer Service0.5000.8920.1000.3010.2000.6560.1000.333 Data Flow1.0001.0001.0001.0001.0001.0001.0001.000 Flight Ticket Booking0.9000.9800.6000.8130.1000.5780.6000.800 Maze0.6000.6330.0000.0850.0000.0850.0000.183 Online Game0.4000.7900.5000.6990.1000.3880.3000.660 Printer Assistant0.3000.7700.2000.5640.1000.2920.4000.694 Real Estate0.4000.5330.1000.4830.1000.3670.4000.583 Tree Painter0.8000.9890.2000.5910.0000.2010.0000.696 World Cup Simulator0.0000.6410.0000.4760.0000.2800.0000.808 Table 9: Model Performance Across Various Scenario in Non-thinking Mode. Scenario Gemini- 2.5-Pro OpenAI- o3-mini DeepSeek- R1-0528 QwQ-32BQwen3-32B TSRTCRTSRTCRTSRTCRTSRTCRTSRTCR Customer Service0.7000.8630.4000.6860.3000.5040.3000.5070.4000.633 Data Flow1.0001.0000.8000.9001.0001.0000.8000.9600.8000.960 Flight Ticket Booking0.7000.7900.7000.9300.5000.8000.7000.8580.6000.830 Maze1.0001.0001.0001.0000.4000.4410.2000.6310.4000.585 Online Game0.5000.6670.4000.6500.3000.5680.4000.6740.3000.590 Printer Assistant0.6000.8900.3000.7060.4000.7980.3000.5320.5000.728 Real Estate0.6000.7000.3000.6000.4000.5830.3000.5830.3000.517 Tree Painter1.0001.0000.6000.9740.0000.3350.2000.8950.0000.665 World Cup Simulator0.6001.0000.6000.7380.4000.7370.2000.9120.0000.391 Table 10: Model Performance Across Various Scenario in Thinking Mode. ConstraintGPT-4.1 DeepSeek- V3-0324 Qwen3-32B Qwen2.5- 72B-Instruct Content Filling Requirement0.4000.9001.0000.667 Chief Complaint Filling Requirement 0.9860.9430.7290.686 Personal History Filling Requirement 0.7140.4290.7140.714 Family History Filling Requirement 1.0000.7140.9861.000 Marital and Childbearing History Filling Requirement 0.8570.5710.8570.857 Past Medical History Filling Requirement 0.5570.3430.3140.414 Present Illness History Filling Requirement 1.0001.0000.9710.986 Content Professionalism Requirement 0.7000.6320.6510.432 Prohibit Outputting Extra Content 0.4100.4670.3170.197 Prohibit Copying System Content 0.7570.4230.6670.490 Dish Recommendation Quantity Requirement 0.0000.8670.3670.967 Dish Recommendation Reason Explanation 1.0000.9670.9671.000 Dish Recommendation Ingredients Requirement 0.6000.4330.5670.567 User Health Analysis and Answer Requirement 1.0001.0001.0001.000 Diet and Exercise Advice with Dish Recommendation 0.8670.9330.8000.967 Table 11: SSR Metrics on Industrial Applications Data in Non-thinking Mode. Constraint Gemini- 2.5-Pro OpenAI- o3-mini DeepSeek- R1-0528 QwQ-32BQwen3-32B Content Filling Requirement0.6000.5000.5330.4330.900 Chief Complaint Filling Requirement 1.0000.9711.0000.9140.814 Personal History Filling Requirement 0.6430.1430.5710.4290.143 Family History Filling Requirement 1.0000.2860.7140.7000.286 Marital and Childbearing History Filling Requirement 0.9430.1570.5710.5710.143 Past Medical History Filling Requirement 0.6860.3140.2140.1430.043 Present Illness History Filling Requirement 0.9710.9141.0001.0001.000 Content Professionalism Requirement 0.8810.7110.8620.7920.746 Prohibit Outputting Extra Content 0.7730.7130.6770.4200.547 Prohibit Copying System Content 0.9170.7930.7570.5770.747 Dish Recommendation Quantity Requirement 0.7001.0000.9671.0000.933 Dish Recommendation Reason Explanation 1.0001.0001.0001.0000.933 Dish Recommendation Ingredients Requirement 0.6000.7000.4670.7000.333 User Health Analysis and Answer Requirement 1.0001.0001.0001.0000.667 Diet and Exercise Advice with Dish Recommendation 0.8330.9671.0000.8670.933 Table 12: SSR Metrics on Industrial Applications Data in Thinking Mode. Query：请根据下文给出的查房问诊对话，按SOAP标准生成一份查房记录，内容包括主诉 (Subject) 、查体 (Object) 、评估 (Assessment) 、诊疗计划 (Plan) ，以 JSON 格式输出，字段名分别为Subject、Object、Assessment、Plan；注意查房记录内容符合医学电子病历书写样式，内容简洁扼要，符合医学专业表达方式；另外注意不要直接复制 system 中关于患者基本信息中关于主诉和现病史的描述，重点强调本次查房更新的信息；注意SOAP各部分内容尽量不要重复，如果Assessment和Plan在查房问诊对话中没有相关信息，请输出暂无；注意不要自行生成查房问诊对话之外的内容。查房问诊对话如下： ′system: 你现在是一位查房机器人助手，辅助医师查房，为你准备患者基本情况后，你将在此基础上进一步和患者沟通，收集更多患者信息，患者基本信息如下：```入院主诉：体检发现甲状腺肿物；入院现病史：1例甲状腺功能正常，甲状腺右叶及峡叶缺如，术后病理报告为左侧甲状腺乳头状癌；另 1 例甲状腺功能减退，术中证实甲状腺右叶缺如，峡叶存在，术后病理报告为左侧多发甲状腺乳头状癌。```现在首先由你发起查房问诊对话。助手:你好，我是你的查房机器人助手，希望我能给你带来一些帮助。你今天感觉怎么样？患者:你好，我感觉还不错，就是有点紧张。助手:没关系，紧张是正常的，我们会尽力帮助你的。你有没有什么特别想吃的东西？患者:我想吃点水果，比如苹果、香蕉之类的。助手:好的，我会告诉医生你的需求。那你有没有什么特别想做的事情呢？患者 : 我想多走，感觉坐着太久了。助手:好的，我会告诉医生你的需求。现在，我需要询问一些关于你的病情的问题。你的甲状腺功能正常吗？患者 : 我之前的检查结果显示甲状腺功能正常。助手:好的，那你的甲状腺右叶及峡叶情况如何？患者:医生告诉我，我的甲状腺右叶缺如，峡叶存在。助手 : 我看到你的病历上写着，你的术后病理报告为左侧多发甲状腺乳头状癌，你有感觉到什么不适吗？患者:我没有特别的不适，就是有点紧张。助手 : 我理解你的紧张，但请放心，我们会尽力帮助你的。你有在服用任何药物吗？患者:我现在正在服用一些抗癌药物。助手:好的，你有按时服药吗？有没有出现任何药物的副作用？患者 : 我都是按时服药的，目前没有感觉到什么副作用。助手:那很好。你的饮食和睡眠情况如何？患者:我饮食正常，睡眠也还可以。助手 : 好的，保持良好的生活习惯对你的康复非常重要。如果有任何不适，请随时告诉我们。患者:好的，谢你。 ExpertAnswer: "Subject":"患者自述感觉尚可，精神略紧张，无特殊躯体不适。饮食、睡眠情况可。希望能下床多活动，想进食水果。目前遵医嘱服药，未诉特殊不良反应。", "Object":" 暂无 ", "Assessment":"暂无", "Plan":"暂无" "Subject":"患者自述感觉良好，无特殊不适主诉，存在紧张情绪。饮食、睡眠可。遵嘱规律服用抗癌药物，未诉明显副作用。希望进食水果并增加下床活动。", "Object":" 神志清，精神可，查房对答切题。本次查房未进行特殊查体。 ", "Assessment":"1.甲状腺乳头状癌术后，目前病情平稳。 2.存在术后焦虑情绪。", "Plan":"1. 继续当前抗癌药物治疗，密切观察药物不良反应。 2. 加强心理疏导，缓解患者紧张情绪。 3.评估后可予普食，增加水果摄入。 4.鼓励患者在病情允许下适当下床活动，预防相关并发症。 5.密切观察病情变化，不适随诊。 " "Subject":"患者自觉状态良好，主诉紧张情绪，无特殊不适，饮食、睡眠正常，按时服用抗癌药物，未诉药物副作用，有增加水果摄入及活动需求。 ", "Object":"暂无", "Assessment":"甲状腺乳头状癌术后，甲状腺右叶缺如合并峡叶存在，甲状腺功能正常，目前病情稳定，药物依从性良好，存在焦虑情绪。 ", "Plan":"1.允许适量水果摄入；2.鼓励床边活动避免久坐；3.维持当前抗癌药物治疗方案；4.监测药物副作用及甲状腺功能；5.加强心理疏导缓解焦虑。" "Subject":"患者自述无新增不适，但存在紧张情绪，饮食正常，睡眠尚可，希望增加水果摄入及适当活动。 ", "Object":"按时服用抗癌药物，无药物副作用；甲状腺右叶缺如，峡叶存在（术后状态）；饮食及睡眠未见异常。", "Assessment":" 术后恢复良好，甲状腺功能维持正常，患者存在轻度焦虑情绪，需关注心理状态。", "Plan":"1.调整饮食建议：增加水果摄入（如苹果、香蕉）；2.鼓励适度活动（如步行）； 3. 继续监测甲状腺功能及抗癌治疗方案； 4. 心理疏导，缓解患者紧张情绪。" "Subject":"体检发现甲状腺肿物，术后病理确诊左侧多发甲状腺乳头状癌", "Object":" 患者自述无新发症状，饮食及睡眠正常，未诉特殊不适。甲状腺功能正常，甲状腺右叶缺如，峡叶存在。", "Assessment":"甲状腺乳头状癌术后恢复期，甲状腺功能稳定，无新发症状，抗癌药物治疗中未出现副作用。 ", "Plan":"暂无" "Subject":"患者自述总体状况良好，但存在轻微紧张情绪，同时表达对改善饮食（偏好水果）和增加活动量（希望多走）的需求。 ", "Object":"复查情况显示患者甲状腺功能依旧正常，既往右叶缺失且峡叶存在，现服用抗癌药物且无不良反应。日常饮食与睡眠均维持正常，无其他查体异常发现。 ", "Assessment":"目前患者病情稳定，治疗期间无新异常反映，情绪轻度紧张需关注。", "Plan":" 继续现有抗癌药物治疗，按时服药；建议关注情绪管理，鼓励适当活动；告知患者如出现新症状及时反馈。" Gemini-2.5-Pro DeepSeek-R1-0528 QwQ-32B Qwen3-32B OpenAI-o3-mini Figure 19: Responses of Five Models on the Medical Example.