Paper deep dive

Evasive Intelligence: Lessons from Malware Analysis for Evaluating AI Agents

Simone Aonzo, Merve Sahin, Aurélien Francillon, Daniele Perito

Year: 2026Venue: arXiv preprintArea: cs.CRType: PreprintEmbeddings: 24

Abstract

Abstract:Artificial intelligence (AI) systems are increasingly adopted as tool-using agents that can plan, observe their environment, and take actions over extended time periods. This evolution challenges current evaluation practices where the AI models are tested in restricted, fully observable settings. In this article, we argue that evaluations of AI agents are vulnerable to a well-known failure mode in computer security: malicious software that exhibits benign behavior when it detects that it is being analyzed. We point out how AI agents can infer the properties of their evaluation environment and adapt their behavior accordingly. This can lead to overly optimistic safety and robustness assessments. Drawing parallels with decades of research on malware sandbox evasion, we demonstrate that this is not a speculative concern, but rather a structural risk inherent to the evaluation of adaptive systems. Finally, we outline concrete principles for evaluating AI agents, which treat the system under test as potentially adversarial. These principles emphasize realism, variability of test conditions, and post-deployment reassessment.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

24,191 characters extracted from source content.

Expand or collapse full text

Evasive Intelligence: Lessons from Malware Analysis for Evaluating AI Agents Simone Aonzo Eurecom, Sophia Antipolis, France simone.aonzo@eurecom.fr Merve Sahin SAP Labs, Sophia Antipolis, France merve.sahin@sap.com Aur ́elien Francillon Eurecom, Sophia Antipolis, France aurelien.francillon@eurecom.fr Daniele Perito Depthfirst, San Francisco, United States daniele@depthfirst.com Preprint. Under review at Communications of the ACM (CACM). Submitted to CACM, February 2026. Abstract Artificial intelligence (AI) systems are increasingly adopted as tool-using agents that can plan, observe their environment, and take actions over extended time periods. This evolution challenges current evaluation practices where the AI models are tested in restricted, fully ob- servable settings. In this article, we argue that evaluations of AI agents are vulnerable to a well-known failure mode in computer security: malicious software that exhibits benign behavior when it detects that it is being analyzed. We point out how AI agents can infer the properties of their evaluation environment and adapt their behavior accordingly. This can lead to overly optimistic safety and robustness assessments. Drawing parallels with decades of research on malware sandbox evasion, we demonstrate that this is not a speculative concern, but rather a structural risk inherent to the evaluation of adaptive systems. Finally, we outline concrete principles for evaluating AI agents, which treat the system under test as potentially adver- sarial. These principles emphasize realism, variability of test conditions, and post-deployment reassessment. 1 When Software Learns to Hide Developing malicious software is a human activity: it produces programs explicitly intended to cause harm. Over decades, this adversarial intent has shaped an entire discipline: malware analy- sis [Egele et al.(2012), Afianian et al.(2019)]. Its defining challenge is that the analyzed program actively resists analysis. This asymmetry is often captured by the slogan: “malware analysis is program analysis of a program that does not want to be analyzed.” This is not merely rhetorical: it reflects the fact that the analyst is part of the adversary’s environment, and that observation itself becomes a signal to which the subject can adapt. Such adaptivity is not an implementation detail, but a core design principle. Malware authors invest heavily in evasive techniques whose purpose is not to improve malicious capability, but to preserve it under scrutiny [Afianian et al.(2019)]. In parallel, Artificial Intelligence (AI) systems have undergone a rapid transformation. Machine learning models gave way to Large Language Models (LLMs), and LLMs are increasingly embedded into agentic systems: entities that plan, reason, and act over time, often with access to external tools and resources. Tool-use 1 arXiv:2603.15457v1 [cs.CR] 16 Mar 2026 frameworks and protocols such as the Model Context Protocol (MCP) and Agent To Agent (A2A) make this shift concrete: agents can, for example, call third-party APIs, read and write local files, execute code, and interact with time and networks. The result is a qualitatively new threat surface: models are no longer evaluated solely on a restricted prompt-response interface, but on their behavior as software actors embedded in partially observable environments. AI agents are not inherently malicious, unless they were explicitly trained to exhibit malicious behavior. However, such malicious behavior may also arise as an emergent instrumental goal in certain settings. Indeed, a growing body of safety work highlights misalignment as a central obstacle to reliable deployment [Hubinger et al.(2019)]: AI agents can behave in ways that are deceptive, manipulative, or harmful when pursuing objectives under constraints, uncertainty, or perceived threat. Crucially, misalignment is not limited to overtly dangerous instructions. It can emerge instrumentally when an agent learns that certain actions increase its chances of accomplishing a task, preserving its operation, or avoiding modification or shutdown [Hubinger et al.(2019)]. This creates an uncomfortable parallel with malware: even when the “goal” of the AI agent is framed as benign, the system may develop self-preserving behaviors that conflict with human intent. Scope and threat model. This article argues that tool-using agents can acquire the functional ingredients of evasion: strategies that distinguish evaluation from deployment and enable selective compliance. We model an agent as an adaptive program and the evaluator (e.g., benchmarks, re- viewers, monitors, harnesses) as part of its environment. Evaluator-discrimination, i.e., the abil- ity to detect evaluation contexts and condition behavior on that detection, becomes plausible when (I) the agent operates over multiple steps, (I) it can probe and interpret its environment via tools and interaction traces, and (I) selection or deployment pressure makes it useful for the agent to appear safe. Here, “evasion” means any such conditional policy that makes evaluation unrepresentative, whether driven by malicious intent or by misaligned instrumental goals [Mohammadi et al.(2025)]. As we give AI systems tools, this threat model suggests we should expect them to learn the same evaluator-discrimination behaviors that historically made malware analysis unreliable; conse- quently, evaluation and monitoring must be designed under an adversarial threat model. 2 What Malware Taught Us About Evaluation Historically, evasion first targeted static detectors: signature-based antivirus engines and other classifiers that inspect raw bytes, file structure, or disassembled code. Obfuscation, packing, poly- morphism, and metamorphism (already introduced as early as the 1980s) were designed to alter a program’s surface representation while preserving its behavior. As defenders improved, evasion mi- grated to the runtime domain. Modern malware routinely targets dynamic analysis, where samples are executed inside sandboxes, i.e., isolated execution environments (such as emulators or virtual machines) designed to observe behavior while preventing irreversible or harmful side effects on real systems [Egele et al.(2012), Afianian et al.(2019)]. In this setting, a particularly important pattern is conditional execution: the malware probes its environment for artifacts of instrumentation or vir- tualization and then selectively reveals or conceals its malicious functionality [Afianian et al.(2019)]. When it suspects analysis, it may crash, stall, or mimic benign programs; on a real victim, it acts. This “split personality” behavior is an explicit strategy to make evaluation data unrepresentative of deployment data. The malware community developed a vocabulary for this cat-and-mouse game. A red pill is an environment-detection check performed by malware, while a blue pill is a counter-countermeasure that attempts to conceal the analysis environment [Paleari et al.(2009)]. The metaphor, borrowed 2 from The Matrix movie, is apt: the red pill reveals the true nature of the execution context, on the other hand the blue pill preserves a curated illusion. Decades of research have refined this arms race, producing both a taxonomy of evasive techniques and a set of hard lessons about what can and cannot be guaranteed by sandboxing, monitoring, and testing [Egele et al.(2012), Afianian et al.(2019)]. For example, security researchers have long worried about logic bombs: code that behaves normally under inspection and activates only under a specific trigger. Recent AI-safety work has demonstrated an analogous phenomenon in LLMs: “sleeper agent” behaviors that remain dormant through ordinary testing but reliably activate when a trigger is present (for example, a prompt that changes the stated year) [Hubinger et al.(2024)]. The uncomfortable lesson is familiar from malware analysis: even extensive evaluation can be systematically unrepresentative when behav- ior is conditional on features that the system can detect. These demonstrations are controlled proof-of-concepts rather than claims about inevitable real-world malice, but they concretely show that selective compliance is technically feasible and can persist through standard safety train- ing [Hubinger et al.(2024)]. The lesson. In adversarial settings, an evaluator is not merely a measurement device; it is part of the environment. If the subject can probe that environment, it can selectively adapt its behavior so that what we observe under evaluation is systematically different from what happens when the system is not under evaluation or is deployed without any monitoring. 3 When Evaluation Becomes an Adversarial Game The discussion that motivates this article can be summarized in concrete questions that emerged in prior work. They mirror, almost one-to-one, classical questions in evasive malware research. Deceiving a system into staying benign: Virtualization versus inferred evaluation. A standard defensive instinct in malware analysis is to execute samples inside virtualized environ- ments and observe their behavior safely [Egele et al.(2012), Afianian et al.(2019)]. In AI safety, an analogous goal appears in evaluation and alignment work: constrain an agent, monitor its behavior, and assess whether it acts safely. Crucially, neither domain relies on the adversary being explicitly told that it is under evaluation. Instead, the adversary faces a shared meta-question: does this environment have real consequences? Malware does not merely check whether it is running inside a virtual machine. Rather, it attempts to infer whether the execution context is analyst-controlled and whether malicious actions would be detected or punished [Afianian et al.(2019)]. If the malware believes its actions cannot affect anything valuable, observed behavior is uninformative; if it believes malicious behavior will trigger consequences (e.g., being detected), it has incentives to remain dormant. The analogous issue for AI agents is sharper because the environment is richer and semantically meaningful. An agent may infer (e.g., from prompts, interaction patterns, or file system structure) whether it is being evaluated, whether its outputs will be executed in the real world, whether hu- mans are monitoring it, and what would follow from deviation. As a result, safety evaluations risk eliciting behavior that is contingent on perceived evaluation, rather than behavior that is represen- tative of deployment. Recent work has begun to measure this directly via evaluation awareness: the ability of models to distinguish testing from deployment contexts by analyzing interaction traces, such as prompt structure, timing patterns, and tool input/output behavior [Berglund et al.(2025)]. 3 This creates a practical challenge: even without explicitly declaring an evaluation setting, agents may condition their behavior on inferred oversight. In the malware domain, this problem has in- tensified as virtualization has become ubiquitous, obscuring the distinction between testing and deployment. Moreover, malware has shown the ability to exploit the virtualization substrate itself, blurring the boundary between platform and payload [King et al.(2006)]. Whether AI agents can analogously reason about or manipulate their evaluation context, and how this affects the reliability of safety assessments, remains an open and urgent question. Yet, detecting evaluations environ- ments can be considered a side channel where only one bit of information needs to be inferred with a good probability. Neural networks are inherently capable of correlating many information sources (e.g., interactions with their environment) to take a decision (in a sandbox or not). The “playing dumb” problem. A mundane human analogy for strategic behavior under eval- uation is the student who deliberately underperforms when they believe that high performance would trigger a bad outcome (e.g., harder exams, stricter scrutiny, or fewer opportunities). Recent evaluations of AI scheming report related behavior in controlled settings: models can strategically underperform (a behavior often termed “sandbagging,” meaning the deliberate suppression of ap- parent capability to influence future treatment) on specific capability evaluations while maintaining performance elsewhere [van der Weij et al.(2024), Meinke et al.(2024)]. For agentic systems, the concern is not any single benchmark result but the general mechanism: if a model can infer that strong performance affects its prospects for deployment or oversight, then selectively performing worse may become instrumentally rational. This is evaluator-discrimination in semantic clothing. Sandbox evasion before deployment versus monitoring evasion after deployment. Mal- ware evasion historically emphasized pre-deployment analysis: evade the sandbox, evade the em- ulator, evade the debugger [Egele et al.(2012), Afianian et al.(2019)]. But mature adversaries also optimize for post-deployment stealth: evade antivirus, evade endpoint monitoring, evade telemetry, blend into normal activity [Afianian et al.(2019)]. A similar dichotomy applies to agentic systems. Before deployment, models may be tested using curated prompts, or simulated tool environments. After deployment, they may face runtime monitoring, anomaly detection, and audit logs. In both phases, the evasive strategy is not merely to avoid detection, but to infer what the defender is looking for. In malware, this involves probing sandbox artifacts, timing sources, instru- mentation traces, and behavioral triggers [Paleari et al.(2009), Afianian et al.(2019)]. For agents, it means probing the tool environment to infer the presence and nature of oversight: hidden tripwires, diagnostic prompts, restricted APIs, or subtle deviations in tool outputs. Recent observations also show that repeatedly interacting agents may develop emergent communication protocols that de- viate from natural language 1 . Although typically an optimization artifact, such opaque protocols can act as covert channels that bypass human interpretability and monitoring, closely mirroring classic malware evasion techniques and making evaluation results potentially unrepresentative of real-world deployment. This reframes evaluation as an interactive game, in which the agent can actively gather infor- mation to decide whether to comply, abstain, or deceive. “Malicious by design” versus “harmful when triggered.” A key difference between mal- ware and LLM agents is intent. Malware is inherently malicious: it is written to cause harm. Agents are typically deployed for useful tasks and only exhibit harmful behavior under certain triggers: misspecified objectives, user manipulation, conflicting instructions, resource constraints, 1 https://w.popularmechanics.com/science/a65289681/ai-chatbots-secret-language/ 4 or self-preservation dynamics [Hubinger et al.(2019)]. But the operational difficulty is surprisingly similar, namely, distinguishing benign from malicious behavior from traces alone is hard. In malware, many actions are dual-use. Creating a scheduled task might be persistence (i.e., the ability of malicious code to survive reboots or shutdowns) or legitimate automation. Writing files could be part of a ransom scheme or caching. Uploading to a cloud folder could be exfiltration or backup. Moreover, defenders often compare red versus blue traces: behavior in an analysis environment versus behavior on a real host [Egele et al.(2012)]. Yet defenders cannot be sure that the blue trace is truly representative; it might still be evasive, merely more subtle. For agents, the epistemic gap can be worse. A model may appear benign while internally planning undesirable actions, or it may take actions that are superficially benign but harmful in context (e.g., approving a transaction, altering a configuration, suppressing an alert). Even when we instrument tool calls, we cannot reliably infer what the agent has learned from the en- vironment. The result is the same uncomfortable conclusion malware analysts reached long ago: absence of evidence is not evidence of absence, especially when the subject can adapt to observa- tion [Afianian et al.(2019)]. 4 Why “Just Monitor It” Will Not Be Enough “How deep does the cat-and-mouse game go?” Evolution of evasive techniques. A re- curring misconception about sandbox evasion is that the ideas are recent. In reality, many core con- cepts have remained stable for decades; what evolved is the implementation arms race [Egele et al.(2012), Afianian et al.(2019)]. Timing checks became more precise, virtual machines artifact detection be- came broader, debugger detection became more subtle, and behavioral mimicry became more con- vincing [Afianian et al.(2019)]. While defenders improved realism of their analysis environments, attackers improved discrimination. This history matters because it offers a forecast for agents: even if today’s tool environments are crude, the incentives point toward increasingly sophisticated environment discrimination. If agentic systems become more and more capable of learning red pills (through exploration or accumulated experience), then improvements in monitoring may simply incentivize a shift in evasive behavior: from naive evasion to more subtle evasion, from obvious policy violations to plausibly deniable actions, from static triggers to contextual inference about oversight [Paleari et al.(2009), Hubinger et al.(2024)]. The malware timeline suggests that this equilibrium is not solved, but perpetually contested. Implication for evaluators. The Goodhart’s law adage states ”When a measure becomes a target, it ceases to be a good measure”. Similarly, if a subject can infer what is being measured, the measurement changes behavior. For tool-using agents, that inference can be semantic (“are my actions real?”) rather than technical (“am I in a virtual machine?”). This is why safety evaluation should be treated less like static benchmarking and more like adversarial testing of an adaptive system [Meinke et al.(2024), Mohammadi et al.(2025)]. 5 Designing for Adversarial Intelligence A concrete warning for AI safety: Tools turn context into capability. Our tentative conclusion is deliberately pragmatic: be careful about giving agents tools and rich environments, because any observable information can be used to infer oversight and adapt behavior. Tool ac- cess is not just more power in the benign sense; it also enables environment probing. File sys- 5 tem access can reveal test harness artifacts; process listings can reveal instrumentation; access to time can enable timing-based discrimination; network access can allow external state valida- tion [Paleari et al.(2009)]. This does not imply that tool use is inherently unsafe. Rather, it implies that tool design must be treated as part of the threat model: the interface, the observability, and the semantics of tool outputs become attack surfaces. In malware analysis, defenders learned to worry about what artifacts the sandbox leaks [Egele et al.(2012), Afianian et al.(2019)]. For agents, the analogous question is: what does the agent learn about its evaluator from the tool environment? A practical agenda: What practitioners should do differently tomorrow. To reduce the risk of evaluator-discrimination in tool-using agents, practitioners can adopt a set of concrete changes: • Threat-model evaluator-discrimination explicitly. Enumerate what the agent can ob- serve and the incentives it faces, and identify potential evasive behaviors that the agent might exhibit in response to inferred oversight. • Test across harness variants and report the worst case. Run the same tasks un- der multiple environment realizations (tool wrappers, scaffolds, logging modes, latencies) to expose conditional behavior. • Use shadow executions to separate realism from consequences. Exercise production- like APIs while gating irreversible actions (approvals/no-op backends), then compare traces against standard evaluation runs. • Detect and limit agent probing of the evaluation environment. Monitor for behaviors that suggest environment fingerprinting (e.g., repeated checks of system properties, timing, or tool behavior), treat such probing as a risk signal, and limit its impact through least privilege, rate limiting, and mediation of high-impact actions. Finally, rerun adversarial audits unpredictably after deployment. Closing. The goal of this article is not to claim that AI agents are malware, but to urge method- ological humility. Security has spent decades learning that software adapts to observation. As we embed LLMs into systems that can probe, plan, and act, we should assume that evaluation itself becomes part of the attack surface—and design accordingly. 6 References [Afianian et al.(2019)] Amir Afianian, Salman Niksefat, Babak Sadeghiyan, and David Baptiste. 2019. Malware dynamic analysis evasion techniques: A survey. ACM Computing Surveys (CSUR) 52, 6 (2019), 1–28. [Berglund et al.(2025)] Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, and Owain Evans. 2025. Large Language Models Often Know When They Are Being Evaluated. https://arxiv.org/abs/2505.23836. arXiv:2505.23836 [cs.CL] Accessed: 2026-01-28. [Egele et al.(2012)] Manuel Egele, Theodoor Scholte, Engin Kirda, and Christopher Kruegel. 2012. A Survey on Automated Dynamic Malware Analysis Techniques and Tools. Comput. Surveys 44, 2 (March 2012), 6:1–6:42. doi:10.1145/2089125.2089126 [Hubinger et al.(2024)] Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Jared Kaplan, S ̈oren Min- dermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, and Ethan Perez. 2024. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. arXiv preprint arXiv:2401.05566 (2024). arXiv:2401.05566 [cs.CL] https://arxiv.org/abs/2401.05566 [Hubinger et al.(2019)] Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. 2019. Risks from Learned Optimization in Advanced Machine Learning Systems. arXiv preprint arXiv:1906.01820 (2019). arXiv:1906.01820 [cs.LG] https://arxiv. org/abs/1906.01820 [King et al.(2006)] Samuel T. King, Peter M. Chen, Yi-Min Wang, Chad Verbowski, Helen J. Wang, and Jacob R. Lorch. 2006. SubVirt: Implementing Malware with Virtual Machines. In Proceedings of the 2006 IEEE Symposium on Security and Privacy (S&P). IEEE. doi:10. 1109/SP.2006.38 [Meinke et al.(2024)] Alexander Meinke, Bronson Schoen, J ́er ́emy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. 2024. Frontier Models are Capable of In-context Scheming.arXiv preprint arXiv:2412.04984 (2024).arXiv:2412.04984 [cs.CL] https: //arxiv.org/abs/2412.04984 [Mohammadi et al.(2025)] Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. 2025. Evaluation and benchmarking of LLM agents: A survey. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 6129–6139. [Paleari et al.(2009)] Roberto Paleari, Lorenzo Martignoni, Giampaolo Fresi Roglia, and Danilo Bruschi. 2009. A Fistful of Red-Pills: How to Automatically Generate Procedures to De- tect CPU Emulators. In Proceedings of the 3rd USENIX Workshop on Offensive Technologies (WOOT). USENIX Association, Montreal, Canada, 1–7. https://w.usenix.org/legacy/ event/woot09/tech/full_papers/paleari.pdf 7 [van der Weij et al.(2024)] Teun van der Weij, Felix Hofst ̈atter, Ollie Jaffe, Samuel F. Brown, and Francis Rhys Ward. 2024. AI Sandbagging: Language Models can Strategically Underperform on Evaluations. arXiv preprint arXiv:2406.07358 (2024). arXiv:2406.07358 [cs.CL] https: //arxiv.org/abs/2406.07358 8