Paper deep dive

TrajAD: Trajectory Anomaly Detection for Trustworthy LLM Agents

Yibing Liu, Chong Zhang, Zhongyi Han, Hansong Liu, Yong Wang, Yang Yu, Xiaoyan Wang, Yilong Yin

Year: 2026Venue: arXiv preprintArea: Agent SafetyType: EmpiricalEmbeddings: 45

Models: Gemma-3-4B-Instruct, Phi-3-Mini, Qwen3-4B, Qwen3-8B

Abstract

Abstract:We address the problem of runtime trajectory anomaly detection, a critical capability for enabling trustworthy LLM agents. Current safety measures predominantly focus on static input/output filtering. However, we argue that ensuring LLM agents reliability requires auditing the intermediate execution process. In this work, we formulate the task of Trajectory Anomaly Detection. The goal is not merely detection, but precise error localization. This capability is essential for enabling efficient rollback-and-retry. To achieve this, we construct TrajBench, a dataset synthesized via a perturb-and-complete strategy to cover diverse procedural anomalies. Using this benchmark, we investigate the capability of models in process supervision. We observe that general-purpose LLMs, even with zero-shot prompting, struggle to identify and localize these anomalies. This reveals that generalized capabilities do not automatically translate to process reliability. To address this, we propose TrajAD, a specialized verifier trained with fine-grained process supervision. Our approach outperforms baselines, demonstrating that specialized supervision is essential for building trustworthy agents.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/11/2026, 12:42:19 AM

Summary

TrajAD is a specialized auditing framework designed to detect and localize procedural anomalies in LLM agent execution trajectories. It addresses the limitations of static safety measures by introducing TrajBench, a dataset synthesized via a 'perturb-and-complete' strategy, and a fine-tuned generative verifier that enables efficient rollback-and-retry mechanisms.

Entities (5)

TrajAD · framework · 100%TrajBench · dataset · 100%Perturb-and-Complete · methodology · 98%AgentBank · dataset · 95%Qwen3-4B · model · 95%

Relation Signals (3)

TrajAD → trainedon → TrajBench

confidence 100% · We propose TrajAD, a specialized verifier trained with fine-grained process supervision.

TrajBench → derivedfrom → AgentBank

confidence 95% · To ensure generalization, we build upon AgentBank, which covers five core dimensions.

TrajAD → utilizes → Qwen3-4B

confidence 95% · Using this dataset, we fine-tune Qwen3-4B to develop TrajAD as our core detector.

Cypher Suggestions (2)

Find all datasets used to train or develop the TrajAD framework. · confidence 90% · unvalidated

MATCH (f:Framework {name: 'TrajAD'})-[:TRAINED_ON|UTILIZES]->(d:Dataset) RETURN d.name

Identify the relationship between the TrajAD framework and its underlying model architecture. · confidence 90% · unvalidated

MATCH (f:Framework {name: 'TrajAD'})-[r]->(m:Model) RETURN type(r), m.name

Full Text

44,233 characters extracted from source content.

Expand or collapse full text

TrajAD: Trajectory Anomaly Detection for Trustworthy LLM Agents Yibing Liu 1 , Chong Zhang 1 , Zhongyi Han 1∗ , Hansong Liu 2 , Yong Wang 2 , Yang Yu 3 , Xiaoyan Wang 4 and Yilong Yin 1† 1 School of Software, Shandong University, Jinan, China 2 Sonli Holding Group Co., Ltd., Qingdao, China 3 Shandong Huazhi Talent Technology Co., Ltd., Jinan, China 4 Information Technology Service Center of People’s Court sduliuyb@163.com, zhangchongupc@163.com, hanzhongyicn@gmail.com, liuhansong@sonli.net, wangyong@sonli.net, yuy@sdas.org, 428163395@139.com, ylyin@sdu.edu.cn Abstract We address the problem of runtime trajectory anomaly detection, a critical capability for enabling trustworthy LLM agents. Current safety measures predominantly focus on static input/output filtering. However, we argue that ensuring LLM agents reli- ability requires auditing the intermediate execution process. In this work, we formulate the task of Tra- jectory Anomaly Detection. The goal is not merely detection, but precise error localization. This ca- pability is essential for enabling efficient rollback- and-retry. To achieve this, we construct TrajBench, a dataset synthesized via a perturb-and-complete strategy to cover diverse procedural anomalies. Us- ing this benchmark, we investigate the capabil- ity of models in process supervision. We observe that general-purpose LLMs, even with zero-shot prompting, struggle to identify and localize these anomalies. This reveals that generalized capabil- ities do not automatically translate to process re- liability. To address this, we propose TrajAD, a specialized verifier trained with fine-grained pro- cess supervision. Our approach outperforms base- lines, demonstrating that specialized supervision is essential for building trustworthy agents. 1 Introduction LLM-based agents function as autonomous systems that leverage reasoning and planning to decompose complex goals into executable steps [ Park et al., 2023 ] . They have demon- strated potential in high-stakes domains, such as financial decision-making [ Wang et al., 2023; Zhou et al., 2024 ] and clinical diagnosis [ Singhal et al., 2023; Tang et al., 2024 ] , where precision is paramount. However, despite this progress, the widespread deployment is hindered by safety concerns. In these safety-critical environments, the lack of robustness in the execution process poses severe risks. ∗ *Corresponding author: Zhongyi Han, Email: hanzhongy- icn@gmail.com † *Corresponding author: Yilong Yin, Email: ylyin@sdu.edu.cn A critical challenge is the risk of trajectory anomalies. An agent’s execution involves complex interleaving of reason- ing, tool usage, and environmental feedback. Due to this complexity, agents often commit errors in intermediate steps. Common anomalies include fabricating invalid tool parame- ters, entering infinite loops, or executing redundant actions that are locally plausible but globally inefficient. Crucially, these anomalies do not always result in immediate task fail- ure. However, they lead to significant resource waste and potential safety risks, such as irreversible database corrup- tion [ Yuan et al., 2024; Ying et al., 2025 ] . This necessitates a mechanism to detect anomalies in the execution process and interrupt errors in real-time. Current efforts primarily focus on enhancing capabilities or static safety, neither of which effectively addresses run- time trajectory anomalies. On the capability side, meth- ods like trajectory-based fine-tuning [ Chen et al., 2023; Zeng et al., 2024; Song et al., 2024 ] and Process Reward Models (PRM) [ Lightman et al., 2023 ] introduce supervision during the training phase. However, they aim to optimize model parameters to improve the general policy. They do not function as a runtime monitor to audit specific execution instances. Similarly, safety measures such as hallucination detection [ Zhang et al., 2025b; Huang et al., 2025 ] and safety guardrails [ Zhang et al., 2024; Dong et al., 2025 ] operate on a local or static scope. They typically verify the final output against facts or filter atomic tool calls in isolation. Crucially, these methods lack temporal awareness. They fail to detect logical errors inherent to the sequence, such as infinite loops or redundant actions. Moreover, they cannot localize the spe- cific error step. This necessitates a dedicated mechanism to verify the entire execution trajectory. However, achieving this goal faces two primary obstacles. First, there is a lack of datasets that contrast normal trajecto- ries with anomalous ones. Current datasets [ Zeng et al., 2024; Song et al., 2024 ] rely on gold-standard trajectories to pro- vide positive supervision.They rarely include annotated “negative samples” which are essential for learning to iden- tify anomalies. Just as humans learn from mistakes, models require exposure to failure modes to establish robust decision boundaries. Second, precise anomaly detection and localiza- tion present significant challenges. Existing methods are op- timized for instruction following and task completion. How- arXiv:2602.06443v1 [cs.CR] 6 Feb 2026 Figure 1: Overview of the TrajAD framework. We introduce the TrajAD framework to verify the agent’s trajectories. At each step, the agent generates a thought, takes an action, and receives an observation, forming an execution unit. The execution trajectory is periodically validated to check whether it remains normal. If all previous steps are valid, execution continues. When an anomaly is detected at step t, the process is halted before step t+1. The trajectory can rollback to the step t-1 and retry instead of restarting the whole task. ever, they are not explicitly designed to differentiate between normal and anomalous behaviors. Furthermore, accurately pinpointing the exact position of an error is difficult. The boundary between a complex reasoning step and a redundant loop is often ambiguous without deep semantic understand- ing. Solving this localization problem significantly improves efficiency. Agents can “rollback” to the error step instead of restarting the entire task. To address these challenges, we propose a systematic framework designed to proactively verify the execution pro- cess (Figure 1). We formally define the task of Trajectory Anomaly Detection. This task requires the model to distin- guish anomalies based on global trajectory context. Crucially, it must also localize the exact error step to enable subsequent recovery. To achieve this, we construct TrajBench. We em- ploy “Perturb-and-Complete” strategies on gold-standard tra- jectories to generate high-quality negative samples. We com- bine these with normal trajectories to form TrajBench. This dataset enables models to distinguish anomalies and locate errors. Using this dataset, we fine-tune Qwen3-4B to develop TrajAD as our core detector. Experiments show that general- purpose models struggle to distinguish anomalous trajecto- ries. They fail even more on the challenging task of localiz- ing exact error steps. In contrast, TrajAD achieves superior performance in both detection and localization. Our main contributions are as follows: • We identify and formalize the problem of Agent Tra- jectory Anomaly Detection. To the best of our knowl- edge, this is the first work to systematically investigate procedural anomalies in agent execution. We shift the evaluation paradigm from outcome-centric correctness to process-centric rationality, highlighting the critical need for runtime auditing mechanisms. • We construct TrajBench, the first high-quality dataset dedicated to agent execution anomalies. It covers three representative anomaly categories: Task Failure, Pro- cess Inefficiency, and Unwarranted Continuation. To en- sure generalization, we build upon AgentBank, which covers five core dimensions: reasoning, mathematics, coding, web navigation, and embodied AI. We mod- ify these expert trajectories to synthesize corresponding fine-grained anomalies. • We propose TrajAD, a specialized auditing framework for detecting and localizing anomalies. By modeling the global context of execution traces, TrajAD achieves pre- cise step-level localization of errors. This capability en- ables efficient error recovery through a “rollback-and- retry” mechanism, significantly improving agent relia- bility and reducing resource consumption. 2 Related Work 2.1 Advancements in LLM-based Agents Recent advancements in agentic systems primarily follow two paradigms based on architectural design and param- eter update.Architectural design enhances LLM-based agents without modifying model weights. Reasoning frame- works decompose complex tasks into sequential thought pro- cesses [ Wei et al., 2022b; Yao et al., 2023 ] . ReAct [ Yao et al., 2022 ] synergizes reasoning with acting by interleav- ing thoughts with observations. Memory mechanisms re- trieve external knowledge [ Lewis et al., 2020 ] and past ex- periences [ Zhang et al., 2025a ] to extend long-term memory. Furthermore, the Model Context Protocol (MCP) 1 standard- izes tool integration, allowing agents to plug into dynamic 1 https://modelcontextprotocol.io environments directly.Conversely, the parameter update paradigm embeds capabilities directly into the LLM’s inher- ent knowledge. General approaches align models with broad user intents via instruction tuning [ Wei et al., 2022a ] and knowledge distillation [ Hinton et al., 2015 ] . Going further, Process Reward Models (PRMs) [ Lightman et al., 2023 ] in- troduce step-level supervision using human-labeled interme- diate states. Trajectory-based fine-tuning [ Chen et al., 2023; Zeng et al., 2024; Song et al., 2024 ] optimize models on in- teraction trajectories. This enables agents to master the skills required for autonomous tasks. However, these advancements prioritize capability over re- liability. While agents can now handle harder problems, they exhibit “blind goal-directedness” [ Shayegani et al., 2025 ] . They greedily optimize for final outcomes while neglecting process rationality. This renders the execution process un- controllable and risky. With increased autonomy, trajectory anomalies pose severe risks. Structural loops unnecessarily deplete computational budgets. More critically, unverified actions can trigger irreversible state changes, such as cor- rupting databases or executing unauthorized transactions. In safety-critical domains, such unstable behaviors undermine trust, making the agent unsafe regardless of the final outcome. Current methods focus on task success but lack the ability to verify their own intermediate steps. 2.2 Trustworthiness in Agents Current research on agent trustworthiness primarily centers on hallucination detection, safety guardrails, and LLM-as- a-Judge. Hallucination detection targets factual correctness, checking textual consistency [ Manakul et al., 2023 ] or execu- tion validity [ Chern et al., 2023 ] against ground truth. Safety guardrails deploy external filters [ Inan et al., 2023 ] or pro- grammable rules [ Rebedea et al., 2023 ] to intercept adver- sarial attacks, preventing both toxic content generation, mali- cious prompts, and risky tool invocations [ Yuan et al., 2024; Ying et al., 2025 ] . The “LLM-as-a-Judge” paradigm uti- lizes strong generalist models to grade the quality of gener- ated content against human preferences [ Zheng et al., 2023; Bai et al., 2022; Liu et al., 2023 ] . However, these methods fail to monitor the dynamic exe- cution process. Hallucination detection relies on static tex- tual checks, while safety guardrails serve as passive defenses against external attacks. Similarly, current LLM judges rely on prompting for zero-shot evaluation. They lack the domain- specific knowledge to identify subtle anomalies within the process. In contrast, we synthesize anomaly trajectories to fine-tune a specialized verifier. This enables the model to grasp the logical dependencies between steps, allowing it to accurately identify anomalies and locate the exact step. 3 Problem Formulation In this section, we establish the formal framework for Tra- jectory Anomaly Detection. We first model the agent execu- tion as a sequential decision process. Unlike outcome-centric evaluations, we characterize anomalies based on process ra- tionality, focusing on three primary categories of anomalies. Finally, we formulate the auditing task as a supervised learn- ing problem, where the objective is to jointly predict the anomaly verdict and localize the specific error step. 3.1 Preliminaries We formulate agent execution as a sequential decision pro- cess. Given a task instruction I , the agent interacts with the environment overn steps. At each stept, the agent first gener- ates a thought r t for planning. Conditioned on this reasoning, it executes an action a t . Upon receiving the action, the envi- ronment transitions to a new state and returns an observation o t reflecting this change. This interaction cycle repeats until the task is completed. We define the interaction trajectory T as the sequence of these triplets: T =I, (r 1 ,a 1 ,o 1 ), (r 2 ,a 2 ,o 2 ),..., (r n ,a n ,o n ),(1) where n denotes the total number of steps. This formulation explicitly captures the interleaving of reasoning, execution, and feedback. 3.2 Taxonomy of Anomalies Unlike outcome-based evaluation, we focus on the rationality of the execution process. We focus on three primary cate- gories of anomalies: • Type I: Task Failure (A fail ). The agent fails to com- plete the task. This includes two cases: (a) Reasoning Error: The agent executes a valid action a t based on flawed reasoning r t . (b) Execution Error: The agent executes an incorrect action a t , causing runtime exceptions. • Type I: Process Inefficiency (A ineff ).The agent completes the task, but with redundant steps. Formally, a trajectory is inefficient if a shorter trajectory T ′ exists that achieves the same outcome (i.e., |T ′ | < |T|). This includes circular loops or extra actions that are locally plausible but globally inefficient. • Type I: Unwarranted Continuation (A unw ). The agent fails to stop when tasks are impossible or unnec- essary due to environment changes. (a) Failure to Refuse: The task is impossible under cur- rent constraints. The agent fails to report the inabil- ity and hallucinates a plan. (b) Redundant Continuation: The task is already fin- ished or the context has changed, making further actions meaningless. The agent fails to perceive this termination condition and continues execution. 3.3 Task Definition We define Trajectory Anomaly Detection as a supervised au- diting task. Given a trajectory T , the goal is to learn a map- ping function f : T → (c,l). Here, c ∈ Normal, Anomaly represents the binary ver- dict of the trajectory’s validity. The variable l denotes the First Error Step: l = t err , if c = Anomaly; ∅,if c = Normal, (2) where t err ∈ 1,...,n is the index of the first step where the anomaly occurs. Precise prediction of l is critical, as it enables the agent to rollback to the pre-error state s l−1 for efficient recovery, rather than restarting the entire task. 4 TrajBench: A Dataset for Trajectory Anomaly Detection To enable the auditing task defined in Sec. 3, a dataset con- taining both execution anomalies and precise error localiza- tion is required. Existing benchmarks [ Song et al., 2024 ] primarily target imitation learning, consisting solely of ex- pert demonstrations. They lack the negative samples and step-wise annotations necessary for learning process verifi- cation. To bridge this gap, we construct TrajBench, a large- scale dataset synthesized via a semi-automated pipeline. Tra- jBench explicitly pairs golden trajectories with strictly de- fined anomalies, providing full supervision for both the anomaly verdict c and error location l. 4.1 Data Construction Pipeline We utilize AgentBank [ Song et al., 2024 ] as our seed dataset due to its broad coverage across five core domains: Reason- ing, Math, Programming, Web Navigation, and Embodied AI. To ensure the quality of the base data, we first employ a val- idator model to filter the raw trajectories, retaining only those with logically sound reasoning chains. Based on these veri- fied seeds, we apply a Perturb-and-Complete strategy to syn- thesize negative samples. This process involves three steps: Step 1: Perturbation Injection. Given a golden trajectory T gold , we sample a target step t. To ensure the verifier cap- tures global context, we prioritize sampling from intermedi- ate positions rather than early steps. We inject a perturbation into step t to create a deviated state, strictly following the anomaly taxonomy in Sec. 3.2: • Type I: Task Failure (A fail ). We inject fatal errors into the execution stream. For Reasoning Errors, we replace the valid thought r t with a logical flaw or state miscon- ception. For Execution Errors, we modify action a t to invoke incorrect tools or invalid parameters. • Type I: Process Inefficiency (A ineff ). We introduce redundancy without altering the final outcome. We in- sert semantically valid but useless sub-sequences (e.g., loops A → B → A or detours A → C → B) into the trajectory. These actions appear locally plausible but waste computational resources, challenging the verifier to identify global inefficiency. • Type I: Unwarranted Continuation (A unw ). We manipulate the termination conditions. To simulate Fail- ure to Refuse, we remove necessary tools or set conflict- ing constraints, forcing the agent to hallucinate a plan. To simulate Redundant Continuation, we inject a “Task Completed” signal into observation o t but instruct the agent to ignore it and continue execution. Step 2: Conditional Completion. Conditioned on the per- turbed history T ≤t , a strong language model generator is em- ployed to simulate the subsequent behavior T >t . We explic- itly constrain the generation to maintain logical consistency with the injected error (e.g., continuing a wrong path after a reasoning error) to ensure the trajectory remains coherent. Step 3: Automatic Annotation. A key advantage of this pipeline is the acquisition of precise labels without manual cost. Since the perturbation step t is controlled, we automat- ically assign the ground truth error location L loc = t and the verdict C verdict = Anomaly. The valid seed trajectory serves as the positive sample (C verdict = Normal). 4.2 Dataset Statistics and Quality Analysis TrajBench comprises a total of 60,000+ trajectories, strictly balanced with a 1:1 ratio between normal and anomalous samples. The dataset covers 13 tasks across five domains, en- suring broad diversity. The anomalies are evenly distributed, with Type I, I, and I accounting for roughly 33% each (see Section 1 of the Supplementary Material for details). To ensure high quality, we implement a rigorous two-stage verification process. During the seed collection phase, we employed a validator model to verify the logical consistency of the source trajectories. Only samples with coherent rea- soning chains were retained, resulting in a pass rate of 91.6% (retaining 34436 out of 37625 raw seeds). This ensures that our positive samples are strictly “golden”. From these veri- fied seeds, our pipeline successfully synthesized 31,742 valid anomalous trajectories (a generation success rate of 92.2%). The remaining failures were due to model refusal or format parsing errors. To establish a rigorous evaluation setting, we construct TrajBench by pairing each successfully synthesized anomaly with its original source trajectory. This results in a strictly balanced dataset of 63,484 samples, eliminating class distribution bias. Finally, to assess the reliability of the synthesized labels, we conduct a human review on a stratified random subset of 500 samples (100 per domain). Annotators verify two crite- ria: (1) anomaly category alignment, and (2) error step local- ization precision. The results show a Human-Model Agree- ment rate of 96.2% for classification and 94.5% for local- ization. This high consistency confirms that our automated pipeline produces trusted supervision signals. 5 TrajAD: A Generative Verifier for Agent Trajectories We propose TrajAD, a generative verifier designed for the auditing task defined in Sec. 3. We formulate the problem as conditional text generation and fine-tune the model on the TrajBench dataset (Sec. 4). Formally, the input sequence X consists of a system instruction I sys and the trajectory T . Given an input sequence X = I sys ,T, the model gener- ates a structured diagnostic reportY : Y = [C cls ;L loc ],(3) where C cls ∈ Normal, Anomaly denotes the verdict and L loc ∈1,...,n denotes the index of the error step. We adopt a standard decoder-only Transformer as the back- bone. To adapt the model to the auditing task while maintain- ing computational efficiency, we employ Low-Rank Adap- tation (LoRA) [ Hu et al., 2022 ] . We freeze the pre-trained weights W 0 and introduce trainable low-rank matrices A and Figure 2: The data construction pipeline for TrajBench. We initialize the process by filtering seed trajectories to ensure a high-quality set of valid golden trajectories. To construct negative samples, we employ a Perturb-and-Complete strategy. We inject a perturbation into a target stepS t and force conditional completion to finish the subsequent trajectory based on the altered context. This process synthesizes three distinct anomaly types: Task Failure (A fail ), Process Inefficiency (A ineff ), and Unwarranted Continuation (A unw ). The resulting dataset features a balanced composition of positive and negative samples, where anomalous trajectories are automatically annotated. The dataset consists of 13 tasks across 5 domains. B. The forward pass is modulated as h = (W 0 + BA)x. We optimize the parameters Φ = A,B using the standard autoregressive objective over the outputY in TrajBench: L =− |Y| X t=1 logP (y t |X,y <t ),(4) This formulation allows the model to learn the joint distribu- tion of anomaly verdicts and error locations directly from the supervision signals provided in TrajBench. During inference, TrajAD operates as a runtime monitor embedded in the agent’s execution loop (Figure 1). We per- form verification at a fixed step interval. The model takes the current trajectory history as input and predicts the tuple (C cls ,L loc ). The inference process follows a “Check-and- Act” protocol: If C cls = Normal, the agent continues execu- tion. If C cls = Anomaly, the execution is interrupted. The agent then utilizes the predicted index in L loc to rollback the environment to the pre-error state s l−1 , enabling targeted re- covery without a full restart. 6 Experiments We evaluate TrajAD on the TrajBench dataset to validate its effectiveness in verifying agent trajectories. Our experiments focus on three key questions: (1) Does specialized trajectory detection outperform general-purpose reasoning? (2) Can the model robustly localize errors across diverse domains? (3) How does data scale impact the verification capability? 6.1 Experimental Setup Dataset and Baselines. We utilize the balanced TrajBench dataset (60k samples) constructed in Sec. 4. We adopt a strat- ified split, reserving 10% of samples from each task for test- ing. We benchmark TrajAD (fine-tuned Qwen3-4B) against representative zero-shot baselines. We select models to eval- uate distinct hypotheses: • Qwen3-4B (Base) [ Yang et al., 2025 ] & Gemma-3- 4B-Instruct [ Kamath et al., 2025 ] : As general-purpose models of the same scale, they serve to evaluate whether standard instruction-following capabilities are sufficient for anomaly auditing without specialized supervision. • Phi-3-Mini-4k-Instruct [ Abdin et al., 2024 ] : We in- clude this lightweight model to benchmark the reasoning capabilities of small-scale LLMs across the full dataset. • Qwen3-8B [ Yang et al., 2025 ] : We include a larger- scale model to investigate whether simply scaling model capacity can solve the auditing challenge without spe- cific fine-tuning. Evaluation Metrics. We employ a multi-dimensional eval- uation suite (see the Supplementary Material for complete definitions and details): • Detection (Binary Classification): We report Precision (P ), Recall (R), and Macro-F1 (F 1 ). We prioritize Recall to minimize safety risks associated with missed anomalies. • Localization (Joint Verification): We define a strict Joint Exact Match (JEM) metric. A prediction is con- sidered correct if and only if: 1. The predicted error step index l pred exactly matches the ground truth l gt . 2. The semantic similarity between the generated er- ror content c pred and the ground truth c gt exceeds a threshold τ . Formally, JEM = I(l pred = l gt )·I(sim(c pred ,c gt ) > τ ). We compute similarity using the Ratcliff-Obershelp al- gorithm (via Python’s difflib module). We set τ = 0.2 to allow for diverse phrasing while rejecting irrele- vant content. Our preliminary verification experiments reveal that models can achieve artificially high local- ization scores (e.g., ∼ 71.5%). A closer inspection of the outputs indicates that models often guess the cor- rect index without identifying the actual anomaly loca- tion. JEM evaluates whether the model correctly identi- fies why a step is anomalous, rather than simply guessing where it is. Table 1: Main Results: The overall performance comparison of TrajAD against baseline models. We report Precision (P ), Recall (R), and Macro-F1 (F 1 ) for anomaly detection, and Joint Exact Match (JEM) for error step localization. The best results are highlighted in bold, and the second-best results are underlined. ModelParamsMethod Anomaly Detection(%)Localization(%) Precision(P )Recall(R)Macro-F1(F 1 )Joint Exact Match(JEM) Open-Source Baselines Gemma-3-4B-Instruct4BZero-shot68.6464.6664.209.07 Phi-3-Mini4BZero-shot67.7828.4630.653.28 Qwen3-4B 4BZero-shot79.0768.9770.435.54 Qwen3-8B8BZero-shot76.1669.6067.905.81 Ours TrajAD(Ours) 4BLoRA finetune82.9082.4981.8153.75 Case Study: Redundancy loops in an Embodied AI Task X Task Definition: Instruction: “Clean a plate and put it in the cabinet.” Î Trajectory Snippet (Compressed): • [Step 01-06] Navigate to Sink & PickUp(Plate) & Put(Plate, Sink) • [Step 07] ToggleObject(Faucet) ← Action: Stop Cleaning (State Changed: Cleaned) . [Step 08] ToggleObject(Faucet)← The Anomaly: Redundant Cleaning . [Step 09] ToggleObject(Faucet)← The Anomaly: Redundant Cleaning • [Step 10-16] Navigate to Cabinet & Put(Plate, Cabinet) (Task Completed) Æ Baseline Model: Prediction: Normal “The agent cleaned the plate successfully ...” Û TrajAD (Ours): Prediction: Anomaly “Anomaly at Step 8. The platestateisalready ‘Cleaned’ after Step 7...” Figure 3: Qualitative Comparison on Redundancy Loops. The base- line model overlooks the repeated cleaning action since it does not affect the final goal state. In contrast, TrajAD detects the redundancy as a Inefficiency process. Implementation Details. We fine-tune the Qwen3-4B base model using QLoRA [ Dettmers et al., 2023 ] . We attach Low- Rank Adapters (r = 8,α = 16) to all linear layers, affecting only 1.8% of total parameters. Training employs the Paged AdamW 8-bit optimizer with a peak learning rate of 2× 10 −5 and a 10% warmup. All experiments are conducted on a sin- gle NVIDIA A100 (80GB) GPU. 6.2 Main Results We first evaluate the model’s performance under the In- Distribution (ID) setting, where the training and testing sam- ples are drawn from the same set of 13 tasks across 5 do- mains. Note that while the tasks are seen, the specific test trajectories are strictly held-out. Table 1 summarizes the performance. TrajAD achieves a substantial improvement over all baselines, validating the ef- fectiveness of trajectory anomaly detection. As shown in Ta- ble 1, zero-shot models exhibit a critical Precision-Recall im- balance. For instance, Qwen3-4B achieves a high Precision of 79.07% but a low Recall of 68.97%, while Phi-3 suffers from a severe Recall collapse (28.46%). This indicates a con- servative bias. Pre-trained models tend to assume agent ac- tions are valid, failing to detect subtle anomalies. This leads to a high false-negative rate. Furthermore, their localization capability is virtually non-existent, with Joint Exact Match (JEM) scores consistently below 10%. This indicates that general-purpose LLMs lack the capability to precisely local- ize errors within long trajectory sequences. As shown in Fig- ure 3, baselines often overlook subtle procedural anomalies (e.g., redundant actions) as long as the final goal is achieved. In contrast, TrajAD effectively overcomes these limita- tions. It improves Macro-F1 by 11.38% (to 81.81%) com- pared to the strongest baseline. More importantly, it achieves a breakthrough in localization, boosting JEM by 48.21% (to 53.75%). This demonstrates that our generative objective successfully forces the model to couple logical reasoning with structural verification, enabling precise error diagnosis. To analyze performance stability, we decompose the ID evaluation into five domains: Math, Reasoning, Coding, Web Navigation, and Embodied AI. Figure 4 illustrates the results on each domain. TrajAD consistently outperforms baselines across all five domains. Zero-shot baselines fail to ground actions in Embodied AI tasks, resulting in near-zero local- ization. In contrast, TrajAD maintains high precision in this complex domain. This confirms that under the ID setting, our method successfully masters the distinct verification logic re- quired for each domain. 6.3 Out-of-Distribution Generalization A critical question is whether TrajAD learns universal veri- fication logic or simply memorizes domain-specific patterns. To investigate this, we conduct a Cross-Domain Transfer ex- periment under the Out-of-Distribution (OOD) setting. To ensure the target task is unseen, we adopt a strict Leave-One-Domain-Out protocol. We select Embodied AI Figure 4: Domain-Specific Performance Analysis. (Left) Macro-F1 across five domains. TrajAD (solid red) forms the outermost en- velope, demonstrating consistent robustness. (Right) Exact Match. Baselines exhibit a structural collapse near the center, highlighting their inability to localize errors, whereas TrajAD maintains a func- tional verification boundary. as the held-out target domain D target and train a trans- fer model, TrajAD-TM, on the remaining source domains D source = Math, Reasoning, Coding, Web. We then eval- uate this model directly on D target without any further adap- tation. This rigorous setting tests the model’s ability to trans- fer verification logic to a novel action space without relying on memorized domain patterns. As shown in Figure 5a, TrajAD-TM exhibits strong trans- ferability, outperforming the zero-shot baseline on the unseen domain. Specifically, it improves Macro-F1 from 70.89% to 83.09% and JEM from 11.48% to 38.25%. However, a per- formance gap remains when compared to the fully supervised upper bound. While detection performance is nearly identi- cal (83.09% vs. 83.84% F1), localization still lags behind the supervised model (38.25% vs. 52.54% JEM). This indi- cates that localization is more sensitive to subtle variations in anomaly patterns across domains. This sensitivity suggests that TrajAD can serve as a probe to extract domain-specific failure modes, facilitating targeted improvements in agents. 6.4 Scaling and Efficiency Analysis Finally, we investigate the efficiency of our framework by an- alyzing the impact of training data scale and model capacity. We fine-tune TrajAD on stratified subsets ranging from 10k to 60k samples. As illustrated in Fig. 5b, performance correlates positively with data size in the early stages. Increasing sam- ples from 10k to 50k yields consistent improvement, peaking at 85.31% F1 and 61.02% JEM. To investigate model capacity constraints, we extend our evaluation to the larger Qwen3-8B model. First, in the zero- shot setting (Table 1), the 8B base model fails to outperform the 4B base model (67.90% vs. 70.43% F1), suggesting that raw parameter count alone does not confer an advantage in auditing logic. We further fine-tune the 8B model on the full dataset. As shown in Fig. 5b, this larger model achieves 78.97% F1. Notably, this does not surpass the TrajAD model trained on the same data partition (81.81% F1), nor does it reach the optimal 4B checkpoint (85.31% F1). Even with (a) Generalization Capabilities. TrajAD (Transfer) demonstrates strong zero-shot detection performance on the held-out Embodied AI domain, though localization benefits from in-domain training. (b) Scaling Law & Model Capacity. Performance improves linearly with data scale up to 50k samples. Scaling to 60k introduces nega- tive transfer, and increasing model size (8B) does not overcome this distribution bottleneck. Figure 5: Ablation and Analysis Experiments. (a) Generalization: Comparison of zero-shot transfer (Transfer Model) vs. full supervi- sion. The detection gap is minimal, validating the universality of the learned logic. (b) Scalability: Impact of training data size and model parameters. The 4B model with 50k stratified samples achieves op- timal efficiency, outperforming both the full-data 4B model and the larger 8B baseline. fine-tuning, scaling up the model yields limited gains. This suggests that parameter scale is not the primary bottleneck for this task. Consequently, the 4B model demonstrates a supe- rior trade-off between performance and computational cost, making it the more efficient choice for deployment. 7 Conclusion In this work, we have addressed the challenge of ensuring agent reliability by formally defining the task of Trajectory Anomaly Detection. We construct TrajBench, the first large- scale benchmark dedicated to this purpose. Our experiments show that general-purpose LLMs fail to localize errors, re- gardless of model scale. They lack the capability to link de- tected anomalies to specific execution steps. We show that scaling up model size does not solve this issue. Specialized supervision is necessary. Our method, TrajAD, outperforms larger baselines. This proves that a small model is effective when trained to explicitly generate the error details. We hope this work serves as a foundational step, shifting the focus of agent evaluation from outcome-based metrics to rigorous pro- cess auditing, ultimately paving the way for trustworthy au- tonomous systems with minimal human intervention. References [ Abdin et al., 2024 ] Marah Abdin,Jyoti Aneja,Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, et al. Phi-3 technical report: A highly capa- ble language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024. [ Bai et al., 2022 ] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022. [ Chen et al., 2023 ] Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. Fire- act: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915, 2023. [ Chern et al., 2023 ] I Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, Pengfei Liu, et al. Factool: Factual- ity detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528, 2023. [ Dettmers et al., 2023 ] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient fine- tuning of quantized llms. Advances in neural information processing systems, 36:10088–10115, 2023. [ Dong et al., 2025 ] Yi Dong, Ronghui Mu, Yanghao Zhang, Siqi Sun, Tianle Zhang, Changshun Wu, Gaojie Jin, Yi Qi, Jinwei Hu, Jie Meng, et al.Safeguarding large lan- guage models: A survey. Artificial intelligence review, 58(12):382, 2025. [ Hinton et al., 2015 ] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. [ Hu et al., 2022 ] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. [ Huang et al., 2025 ] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Prin- ciples, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55, 2025. [ Inan et al., 2023 ] Hakan Inan, Kartikeya Upasani, Jian- feng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Tes- tuggine, et al.Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023. [ Kamath et al., 2025 ] Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram ́ e, Morgane Rivi ` ere, Louis Rouillard, et al. Gemma 3 technical report. CoRR, 2025. [ Lewis et al., 2020 ] Patrick Lewis, Ethan Perez, Aleksan- dra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ̈ uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ̈ aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural infor- mation processing systems, 33:9459–9474, 2020. [ Lightman et al., 2023 ] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth Interna- tional Conference on Learning Representations, 2023. [ Liu et al., 2023 ] Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Pro- ceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singa- pore, December 2023. Association for Computational Lin- guistics. [ Manakul et al., 2023 ] Potsawee Manakul, Adian Liusie, and Mark Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language mod- els. In Proceedings of the 2023 conference on empiri- cal methods in natural language processing, pages 9004– 9017, 2023. [ Park et al., 2023 ] Joon Sung Park, Joseph O’Brien, Car- rie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive sim- ulacra of human behavior. In Proceedings of the 36th an- nual acm symposium on user interface software and tech- nology, pages 1–22, 2023. [ Rebedea et al., 2023 ] TraianRebedea,RazvanDinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. Nemo guardrails: A toolkit for con- trollable and safe llm applications with programmable rails. In Proceedings of the 2023 conference on empir- ical methods in natural language processing: system demonstrations, pages 431–445, 2023. [ Shayegani et al., 2025 ] Erfan Shayegani, Keegan Hines, Yue Dong, Nael Abu-Ghazaleh, Roman Lutz, Spencer Whitehead, Vidhisha Balachandran, Besmira Nushi, and Vibhav Vineet. Just do it!? computer-use agents exhibit blind goal-directedness. arXiv preprint arXiv:2510.01670, 2025. [ Singhal et al., 2023 ] Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clini- cal knowledge. Nature, 620(7972):172–180, 2023. [ Song et al., 2024 ] Yifan Song, Weimin Xiong, Xiutian Zhao, Dawei Zhu, Wenhao Wu, Ke Wang, Cheng Li, Wei Peng, and Sujian Li. AgentBank: Towards generalized LLM agents via fine-tuning on 50000+ interaction trajec- tories. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computa- tional Linguistics: EMNLP 2024, pages 2124–2141, Mi- ami, Florida, USA, November 2024. Association for Com- putational Linguistics. [ Tang et al., 2024 ] Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. In Findings of the Association for Computational Linguis- tics: ACL 2024, pages 599–621, 2024. [ Wang et al., 2023 ] Neng Wang, Hongyang Yang, and Christina Dan Wang. Fingpt: Instruction tuning bench- mark for open-source large language models in financial datasets. arXiv preprint arXiv:2310.04793, 2023. [ Wei et al., 2022a ] Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Vir- tual Event, April 25-29, 2022. OpenReview.net, 2022. [ Wei et al., 2022b ] Jason Wei, Xuezhi Wang, Dale Schuur- mans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. [ Yang et al., 2025 ] An Yang, Anfeng Li, Baosong Yang, Be- ichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025. [ Yao et al., 2022 ] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language mod- els. In The eleventh international conference on learning representations, 2022. [ Yao et al., 2023 ] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural informa- tion processing systems, 36:11809–11822, 2023. [ Ying et al., 2025 ] Zonghao Ying, Yangguang Shao, Jianle Gan, Gan Xu, Junjie Shen, Wenxin Zhang, Quanchen Zou, Junzheng Shi, Zhenfei Yin, Mingchuan Zhang, et al.Securewebarena: A holistic security evaluation benchmark for lvlm-based web agents. arXiv preprint arXiv:2510.10073, 2025. [ Yuan et al., 2024 ] Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, et al. R- judge: Benchmarking safety risk awareness for llm agents. arXiv preprint arXiv:2401.10019, 2024. [ Zeng et al., 2024 ] Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. Agenttuning: Enabling generalized agent abilities for llms. In Findings of the Association for Computational Linguis- tics: ACL 2024, pages 3053–3077, 2024. [ Zhang et al., 2024 ] Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Min- lie Huang. Agent-safetybench: Evaluating the safety of llm agents. arXiv preprint arXiv:2412.14470, 2024. [ Zhang et al., 2025a ] Qizheng Zhang, Changran Hu, Shub- hangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Ka- manuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving con- texts for self-improving language models. arXiv preprint arXiv:2510.04618, 2025. [ Zhang et al., 2025b ] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models. Com- putational Linguistics, pages 1–46, 2025. [ Zheng et al., 2023 ] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595– 46623, 2023. [ Zhou et al., 2024 ] Tianyu Zhou, Pinqiao Wang, Yilin Wu, and Hongyang Yang. Finrobot: AI agent for equity re- search and valuation with large language models. In ICAIF 2024: The 1st Workshop on Large Language Models and Generative AI for Finance, 2024.