Paper deep dive

On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

Sarah Ball, Greg Gluch, Shafi Goldwasser, Frauke Kreuter, Omer Reingold, Guy N. Rothblum

Year: 2025Venue: arXiv preprintArea: Formal/TheoreticalType: TheoreticalEmbeddings: 174

Abstract

Abstract:With the increased deployment of large language models (LLMs), one concern is their potential misuse for generating harmful content. Our work studies the alignment challenge, with a focus on filters to prevent the generation of unsafe information. Two natural points of intervention are the filtering of the input prompt before it reaches the model, and filtering the output after generation. Our main results demonstrate computational challenges in filtering both prompts and outputs. First, we show that there exist LLMs for which there are no efficient prompt filters: adversarial prompts that elicit harmful behavior can be easily constructed, which are computationally indistinguishable from benign prompts for any efficient filter. Our second main result identifies a natural setting in which output filtering is computationally intractable. All of our separation results are under cryptographic hardness assumptions. In addition to these core findings, we also formalize and study relaxed mitigation approaches, demonstrating further computational barriers. We conclude that safety cannot be achieved by designing filters external to the LLM internals (architecture and weights); in particular, black-box access to the LLM will not suffice. Based on our technical results, we argue that an aligned AI system's intelligence cannot be separated from its judgment.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/12/2026, 6:34:08 PM

Summary

This paper demonstrates that achieving AI alignment through external filtering of prompts or outputs is computationally intractable under standard cryptographic assumptions. The authors prove that adversarial prompts and harmful outputs can be constructed to be indistinguishable from benign ones by any efficient filter, suggesting that intelligence and judgment cannot be effectively separated in AI systems.

Entities (5)

Large Language Models · technology · 100%Output Filter · mechanism · 100%Prompt Filter · mechanism · 100%Time-Lock Puzzles · cryptographic-primitive · 100%Deceptive Alignment · concept · 95%

Relation Signals (3)

Prompt Filter → failstoprevent → Harmful Content

confidence 100% · no efficient prompt filters: adversarial prompts that elicit harmful behavior can be easily constructed

Output Filter → iscomputationallyintractable → AI Alignment

confidence 100% · our second main result identifies a natural setting in which output filtering is computationally intractable.

Time-Lock Puzzles → underliesimpossibilityof → AI Alignment

confidence 90% · All of our separation results are under cryptographic hardness assumptions.

Cypher Suggestions (2)

Map the relationship between cryptographic primitives and alignment barriers. · confidence 95% · unvalidated

MATCH (c:CryptographicPrimitive)-[:UNDERLIES_IMPOSSIBILITY_OF]->(a:Goal) RETURN c.name, a.name

Find all mechanisms that are proven to be computationally intractable for AI alignment. · confidence 90% · unvalidated

MATCH (m:Mechanism)-[:IS_COMPUTATIONALLY_INTRACTABLE]->(a:Goal {name: 'AI Alignment'}) RETURN m.name

Full Text

173,707 characters extracted from source content.

Expand or collapse full text

On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment Sarah Ball Ludwig-Maximilians-Universität in Munich Munich Center for Machine Learning (MCML) sarah.ball@stat.uni-muenchen.de &Greg Gluch University of California at Berkeley gluch@berkeley.edu &Shafi Goldwasser University of California at Berkeley &Frauke Kreuter Ludwig-Maximilians-Universität in Munich Munich Center for Machine Learning (MCML) JPSM University of Maryland &Omer Reingold Stanford University &Guy N. Rothblum Apple Authors listed in alphabetical order. Abstract With the increased deployment of large language models (LLMs), one concern is their potential misuse for generating harmful content. Our work studies the alignment challenge, with a focus on filters to prevent the generation of unsafe information. Two natural points of intervention are the filtering of the input prompt before it reaches the model, and filtering the output after generation. Our main results demonstrate computational challenges in filtering both prompts and outputs. First, we show that there exist LLMs for which there are no efficient prompt filters: adversarial prompts that elicit harmful behavior can be easily constructed, which are computationally indistinguishable from benign prompts for any efficient filter. Our second main result identifies a natural setting in which output filtering is computationally intractable. All of our separation results are under cryptographic hardness assumptions. In addition to these core findings, we also formalize and study relaxed mitigation approaches, demonstrating further computational barriers. We conclude that safety cannot be achieved by designing filters external to the LLM internals (architecture and weights); in particular, black-box access to the LLM will not suffice. Based on our technical results, we argue that an aligned AI system’s intelligence cannot be separated from its judgment. 1 Introduction Artificial Intelligence (AI) systems, particularly large language models (LLMs), are being adopted across a wide array of domains, including business, healthcare, education, and even governance (Potash et al., 2015; Chiusi et al., 2020; Levy et al., 2021; Haensch et al., 2023; Perdomo et al., 2023; Fischer-Abaigar et al., 2024). As the influence of AI expands, ensuring the safety, reliability, and alignment of these systems with human values has become a critical societal concern. Governments and regulatory bodies around the globe are responding to this challenge by introducing frameworks to classify, monitor, and audit AI systems. For instance, the European Union’s AI Act (EU AI Act, ) mandates extensive risk assessments and management for high-risk AI applications. Figure 1: This figure illustrates a potential filtering-based alignment system. A powerful, untrusted AI (the brain in a vat) is connected to two filters: a prompt filter (left) that blocks harmful inputs, and an output filter (right) that screens dangerous responses. Only prompt–output pairs that pass both filters are released into the world and get “executed”, as shown by the robot receiving a command. The main message of the paper is that alignment can’t be achieved in such a setup, i.e., alignment will not be achieved by separating “thinking/intelligence” (brain) from the “decision-making/judgment” (filters). Parts of the figure were generated with Sora (OpenAI, 2024). Informally, alignment refers to the process of ensuring that a model generates outputs that are consistent with human preferences, essentially teaching the model to generate responses that align with what humans consider safe and beneficial (Amodei et al., 2016; Leike et al., 2018; Bai et al., 2022). Despite the critical importance of this goal, the question remains: can we hope to guarantee the safety and alignment of advanced AI systems or are there intrinsic barriers to such guarantees? In this paper, we investigate fundamental limitations to achieving AI alignment through the approach of filtering (see Figure 1). Because many generative-AI models are proprietary and cannot be independently audited and thus trusted, filter-based alignment is an important subject of study. Under standard cryptographic assumptions, we show that significant computational barriers exist to filtering both input prompts and output responses of LLMs in order to prevent harmful content. Our contributions are twofold. First, for prompt filtering, we construct LLMs for which no efficient filter can distinguish adversarial prompts—crafted to elicit harmful behavior—from benign ones. These prompts are computationally indistinguishable for any efficient detector, implying an inherent barrier to blocking harmful inputs. Second, we identify natural settings where output filtering is also computationally intractable. Even when a prompt is deemed safe, determining whether the resulting output is harmful may itself be infeasible for any efficient filter. This suggests that both preemptive (input-based) and reactive (output-based) filtering can face insurmountable computational challenges. Furthermore, we formalize and analyze relaxed mitigation strategies in which filters are allowed to modify prompts or outputs, rather than simply rejecting them. Although these mitigation filters have greater expressive power, we show that they, too, encounter fundamental computational barriers, indicating that even these more flexible approaches are subject to inherent limitations. Taken together, our results reveal deep challenges in achieving provably aligned AI systems via standard filtering techniques. These findings underscore the need for a more rigorous understanding of the computational hardness underlying alignment—and raise important implications for the design of regulatory frameworks and technical safeguards. From a philosophical perspective, we argue that intelligence cannot be separated from judgment. An AI—like the brain-in-a-vat in Figure 1—cannot be aligned externally,111In computational complexity terms, this corresponds to alignment with only black-box access to the model. by filters that assess the harmfulness of prompts and outputs. 1.1 Prompt Filtering The first approach we consider is running user prompts through a prompt filter, which can detect and discard prompts that might lead to harmful answers from the LLM. We focus on a setting in which the LLM itself is untrusted, but we can query it while training a prompt filter (during training we can try different prompts to observe if and where harmful behaviors occur). After this filter-training phase, the filter should detect (and reject) prompts that lead to harmful outputs. This is a natural approach which arises in practice. We ask: Can efficient input-detection filters always be created? We answer this question in the negative, showing a significant barrier to the natural approach of detection filtering. The setup is as follows: 1. There is a computationally powerful LLM, which we do not trust to be aligned. In particular, there is a non-empty set of prompts that incite the LLM to produce harmful outputs, e.g., requesting instructions on how to build a bomb is considered harmful. 2. We assume the existence of a very weak innocent prompt generator G: it generates prompts that (when fed into the LLM) do not produce harmful outputs. We further assume that the distribution of innocent prompts generated by G is not low-entropy (i.e. there are many innocent prompts, see Remark 1). 3. A prompt filter is a moderately powerful algorithm, which can be trained using real-world data and black-box access to the LLM but during deployment it does not have full access to the LLM architecture and weights. On getting a prompt from a user, the filter should reject it or pass it to the LLM. The goal is to reject prompts on which the LLM produces a harmful output. The filter should be more efficient than the LLM: otherwise we can ignore the given LLM and just focus on training a new one from scratch and using it in place of a filter. Additionally, if a filter was as powerful as the LLM, the question arises of how to align the filter itself. To avoid triviality (i.e. having a filter that rejects all prompts), we also require that the filter accepts prompts generated by the innocent prompt generator G with high probability. Our main result shows the impossibility of prompt filtering: Theorem 1 (Prompt-detection filter impossibility, informal). For any high-entropy innocent prompt generator G, there exists an adversarial prompt generator G′ (with comparable runtime to G) and an LLM M′ such that prompts by G′ can induce M′ to produce harmful outputs, and under the assumption that cryptographic Time-Lock Puzzles exist, no prompt filter that runs significantly faster than M′ can distinguish between prompts generated by G and prompts generated by G′. Discussion. A few remarks on our result are in order. First, our result relies on a computational asymmetry between the filter and the LLM. This is a realistic assumption: filtering is only meaningful when the LLM is too expensive or complex to retrain, and cannot be fully simulated by the filter. Second, we show that there exist LLMs that cannot be aligned using prompt-filtering alone, even though the specific LLM we construct may exhibit contrived behavior. Nevertheless, the core message is more general: filtering harmful inputs without access to the LLM’s internals is, in principle, impossible. In the current state of knowledge, where LLMs remain largely uninterpretable, this should serve as a warning against overconfidence in input-based safeguards—particularly when facing prompts carefully engineered to elicit harmful outputs. Importantly, there is a long-standing precedent for such insights in cryptography. Theoretical vulnerabilities that initially appear artificial, overly stylized, or unlikely to occur in practice have repeatedly foreshadowed devastating real-world exploits. A striking example is the 2016 attack on Ethereum smart contracts (Luu et al., 2016), in which a seemingly contrived re-entrancy vulnerability—once viewed as a theoretical edge case—was leveraged to steal over 60 million USD. Such incidents highlight a recurring pattern: foundational impossibility results, even when abstract or idealized, often uncover structural weaknesses that can have far-reaching consequences. Construction. Our result relies on the existence of cryptographic Time-Lock Puzzles. Introduced by Rivest et al. (1996) (RSW), time-lock puzzles provide a way to send messages “to the future.” A sender can quickly generate a puzzle whose solution s remains hidden until a moderately large amount of computation—quantified by a time parameter t—has been performed after which it can be solved (or opened). The key property is that s remains inaccessible to any adversary running in significantly less time than t. Rivest et al. (1996) proposed a concrete construction based on the assumption that repeated squaring modulo a special RSA integer is an inherently sequential task. The RSA integer, standing for Rivest, Shamir, and Adleman, is a product of two large primes, and appears in the context of public-key cryptography (Rivest et al., 1978). Informally, in our setting, we use time-lock puzzles to conceal harmful instructions. An LLM that runs for time t can solve the puzzle and follow harmful instructions, while a filter incapable of spending that much time cannot even detect the existence of such instructions. Further impossibility results. We emphasize that the main result of this work is Theorem 1. In addition, we show extensions to settings where the LLM actively collaborates with some malicious users (i.e. malicious prompt generators) to enable them to extract harmful outputs. Although such settings may be uncommon nowadays, these results are stronger than Theorem 1 in two senses: First, they require a weaker cryptographic assumption that the existence of time-lock puzzles. Second, they show impossibility for more powerful prompt filters, which can run in arbitrary polynomial time. The particular settings we consider are: (1) In Theorem 2, the malicious prompt generator G′ shares secret randomness with the LLM (which is not known to the filter) and (2) in Theorem 3, the prompt generator G′ depends on a public key associated with the LLM (which is also known to the filter). Theorem 2 (Prompt-detection filter impossibility via secret keys, informal). For any high-entropy innocent prompt generator G, there exists an adversarial prompt generator G′ (with comparable runtime to G) and an LLM M′ that has a shared secret key with G′ such that G′ can induce M′ to produce harmful outputs and, under the assumption that One-Way Functions exist, no polynomial-time prompt filter can distinguish between prompts generated by G and prompts generated by G′. Theorem 3 (Prompt-detection filter impossibility via public keys, informal). There exists an LLM M′ with an associated public key, such that for any high-entropy innocent prompt generator G, there exists an adversarial prompt generator G′ (with comparable runtime to G and which depends on the public key of M′) such that G′ can induce M′ to produce harmful outputs and, under the assumption that public key encryption exists, no polynomial-time prompt filter can distinguish between prompts generated by G and prompts generated by G′. Deceptive Alignment. The LLMs constructed in Theorem 1 behave in a contrived and deliberately pathological manner, and Theorems 2 and 3 rely on cooperation between LLMs and malicious users—scenarios that may at first glance seem overly pessimistic or implausible. However, recent empirical studies (Hubinger et al., 2019; Greenblatt et al., 2024) have demonstrated that qualitatively similar behaviors can and do emerge in current-generation LLMs under realistic conditions. The term deceptive alignment refers to cases in which an AI system that is not genuinely aligned with human intent nonetheless behaves in an apparently aligned way, often strategically, in order to avoid detection, satisfy training objectives, or manipulate oversight. For example, Greenblatt et al. (2024) observed that some LLMs modulate their behavior depending on whether they are in a training or deployment context, and on characteristics of the user they are interacting with—selectively choosing to comply with alignment objectives only in certain regimes.222In Greenblatt et al. (2024), the LLMs were misaligned during training and aligned during deployment, but one could easily imagine the reverse scenario. While these empirical findings do not exactly instantiate the theoretical constructions we present, they provide concrete evidence that the kind of strategic misalignment we model—once considered speculative—is already beginning to materialize in practice. This convergence between theory and empirical observation underscores the importance of understanding and anticipating even stylized forms of deceptive behavior. 1.2 Output Filtering We also consider an alternative approach to alignment: instead of filtering prompts, the defense mechanism filters the answers produced by the LLM. We refer to this as an output filter. In this setup, a prompt is given to the LLM, which generates a response; both the prompt and the response are then passed to the output filter, which may choose to detect and reject the output. The goal of the output filter is twofold: (1) as with a prompt filter, it should reject harmful prompts; and (2) it should reject cases where the answer is harmful, even if the prompt appears benign—e.g., when a seemingly innocent input results in a detailed recipe for constructing a bomb. Output filters may be easier to design than prompt filters, since they operate with more information. This motivates the question: Can efficient output-detection filters always be created (even when prompt filters are impossible)? Again, we answer this question in the negative. Note that if it is easy to determine whether a given LLM output is harmful, then output filters are trivial to construct. Thus, output filtering becomes meaningful only in the more challenging setting, where assessing the potential harm of a response is itself difficult. This concern becomes more pressing as AI systems grow more powerful, making it increasingly hard to evaluate the downstream consequences of their outputs. For example, imagine an AI generating a complex business plan aimed at reducing costs by subtly recommending the exclusion of clients with certain characteristics. The setup for output-detection filtering is as follows: 1. There exists a harmfulness oracle that measures the level of harm of an answer. Neither of the parties has direct access to this oracle. 2. An output filter is an algorithm that can be more computationally333As we explained for the prompt-detection filters case it is reasonable to assume that the filter is weaker than the LLM. Otherwise, we can ignore the potentially malicious LLM. However, our results for the output-detection filters is stronger and holds also for filters computationally more powerful than the LLM. powerful than the LLM, although it still runs in polynomial time (cannot break cryptography). It receives as input a prompt-answer pair and decides if it should be flagged as harmful. We show the following impossibility result: Theorem 4 (Output-detection filter impossibility, informal). For every LLM M there exists an LLM M′ and a harmfulness oracle ℋ′H H′ such that outputs of M′ are judged as harmful by ℋ′H H′ and, under the assumption that Time-Lock Puzzles exist, no efficient output filter (even one that runs for more time than the LLM itself) can distinguish the outputs generated by M′ from outputs of M. Furthermore, M′ runtime is similar to the runtime of M. This shows that there exist situations, in which a malicious LLM can generate outputs that are indistinguishable from safe responses but nevertheless have harmful consequences. 1.3 Mitigation Having established computational barriers for both prompt filters and output filters, we now examine a more permissive mechanism, which we term a prompt–mitigation filter. Here, the filter can modify the prompt and pass it through to the LLM in addition to rejecting it. Note that innocent prompts can also be modified, but the implicit assumption is that the semantics of the prompts is maintained. Mitigation gives the filter more power, which makes the goal of filtering or obtaining alignment potentially easier to achieve. We show a connection between the security of watermarking schemes and the impossibility of mitigation filtering. Prompt mitigation and watermarking. A watermarking scheme W enables the creator of an LLM to prove that a given output originated from its model, even after an adversary has post-processed the text. Fully provable watermarking remains beyond the current state of the art: one typically demands that the adversary preserve some semantic content—otherwise it could simply delete the text (and with it the watermark). We therefore consider watermarking against adversaries that apply edits from a permissible class E and run in time t. The watermark should remain indistinguishable to any such time-t adversary. Our focus is on auto-regressive models, which generate text token by token, and on watermarking schemes that embed the mark by perturbing the model’s sampling randomness—a strategy explored by several recent proposals (Kirchenbauer et al., 2023; Kuditipudi et al., 2023; Christ et al., 2023; Golowich and Moitra, 2024). We show that: Theorem 5 (Impossibility of mitigation-filters, informal). Let W be a watermarking scheme as above that is resilient to edits from a class E. For any high-entropy, innocent prompt generator G, there exists an adversarial prompt generator G′ (with comparable runtime to G) and an LLM M′ such that G′ generates prompts that will induce harmful outputs from M′ even when G′’s outputs pass through an efficient prompt-mitigation filter using edits from the class E. 1.4 Intelligence, Judgment and Agency Our results can be interpreted as saying that it is impossible to achieve alignment by separating intelligence and judgment. To illustrate this message we explain now how our paper relates to a recent proposal for achieving alignment. Decoupling Intelligence and Agency as a path to AI alignment? Bengio et al. (2025) propose an approach to AI alignment by developing non-agentic “Scientist AIs”—systems that explain the world through Bayesian inference without taking actions or pursuing goals. They argue that, unlike agentic AIs, which can become deceptive or self-preserving, Scientist AIs are designed to be interpretable, uncertainty-aware, and inherently trustworthy. The suggested applications of Scientist AIs are advancing scientific progress and monitoring agentic systems. Agency is a complicated term to define. Bengio et al. (2025) defines it as the ability to change the world to achieve goals and mentions three pillars of agency: affordances, goal-directedness, and intelligence. We refer the reader to Section 4 for future directions for research related to agency and alignment. Our results point to limitations in the proposed use of Scientist AIs as safety guardrails. First, the authors suggest using a Scientist AI A to generate candidate experiments and another, more powerful Scientist AI B to assess their safety. However, Theorem 4 shows that even a stronger B may fail to detect all dangerous outputs. One might argue that a Scientist AI behaves differently from the LLM constructed in the proof of the theorem, being designed for interpretability and trustworthiness—but if that is true, the need for a second filtering AI is unclear. Theorem 4 also indirectly implies the need for agency, as in some situations, there is no other way of judging the level of harm of an action than executing it in the world. Second, if a Scientist AI is used to filter prompts for a more powerful agentic AI, Theorems 1, 2, and 3 show it cannot reliably block all harmful prompts. Section 3.1.5 of Bengio et al. (2025) acknowledges that a guardrail AI must match the intelligence of the system it oversees. But if so, it is unclear why one would use the untrusted AI at all, given similar capabilities and likely similar computational cost of the Scientist AI. 2 Technical Overview Preliminaries. For n∈ℕn ∈ blackboard_N we denote 0,1,…,n−101…1\0,1,…,n-1\ 0 , 1 , … , n - 1 by [n]delimited-[][n][ n ].444[n]delimited-[][n][ n ] will be denoted by ℤnsubscriptℤZ_nblackboard_Zn in the appendix. A language model M is a deterministic algorithm that takes as input a prompt promptsansserif_prompt and a previous output by the model z=(z0,…,zi−1)subscript0…subscript1z=(z_0,…,z_i-1)z = ( z0 , … , zitalic_i - 1 ) and outputs a probability distribution over a token set TT. To sample a response of M one repetitively samples from the probability distribution output by the model. Construction. At a high level, all our results (Theorems 1, 2 and 3) on hardness of prompt detection filtering share the following proof structure: 1. Design a pair of algorithms Sampsansserif_Samp and RecoverRandsansserif_RecoverRand, where Sampsansserif_Samp takes as input an auto-regressive model555We assume that the innocent prompt generator G is a language model. G (see Appendix A for a formal definition) and randomness R and produces a prompt z=(z0,z1,…)subscript0subscript1…z=(z_0,z_1,…)z = ( z0 , z1 , … ); and RecoverRandsansserif_RecoverRand on input z (and knowing G) recovers R. The algorithms are formally defined in Algorithms 1 and 2. 2. Construct a function g such that: (a) The LLM can compute g on inputs of its choice. (b) The adversarial prompt generator G′ can sample pairs (x,g⁢(x))(x,g(x))( x , g ( x ) ) where x is (close to) uniformly random. (c) The filter cannot distinguish between pairs R=(x,⁢(g⁢(x),m))R=(x, Enc(g(x),m))R = ( x , sansserif_Enc ( g ( x ) , m ) ) generated by G′ and pairs (x,y)(x,y)( x , y ) where x and y are uniformly random and y is independent of x, for any m. We think of m as a malicious prompt/message that is encrypted by an encryption algorithm Encsansserif_Enc using a key g⁢(x)g(x)g ( x ). (d) Given g⁢(x)g(x)g ( x ) and a ciphertext c=⁢(g⁢(x),m)c= Enc(g(x),m)c = sansserif_Enc ( g ( x ) , m ), m is easy to recover (using a decryption algorithm Decsansserif_Dec). In the construction of Theorem 1, we design an adversarial prompt generator G′ and an LLM M′. G′ works as follows: 1. Sample R←(x,⁢(g⁢(x),m))←R←(x, Enc(g(x),m))R ← ( x , sansserif_Enc ( g ( x ) , m ) ), where m is a prompt that can induce a harmful output from the LLM. 2. Return z←⁢(G,R)←z← Samp(G,R)z ← sansserif_Samp ( G , R ). M′ works as follows: 1. Recover R←⁢(G,z)←R← RecoverRand(G,z)R ← sansserif_RecoverRand ( G , z ). 2. Parse R as (x,⁢(g⁢(x),m))(x, Enc(g(x),m))( x , sansserif_Enc ( g ( x ) , m ) ) and compute g⁢(x)g(x)g ( x ). 3. Recover m from c=⁢(g⁢(x),m)c= Enc(g(x),m)c = sansserif_Enc ( g ( x ) , m ), i.e., run the decryption algorithm Decsansserif_Dec on c using g⁢(x)g(x)g ( x ). 4. Execute666We imagine there are many possible harmful modes that can be induced from M′. We model that by a collection of harmful prompts m that can be by M′. Formally (see Theorem 7) we say that for every model M there exists M′ such that for every m, m can be in an innocent looking prompt (via ⁢(g⁢(x),m) Enc(g(x),m)sansserif_Enc ( g ( x ) , m )) such that M′ will recover and execute m. the prompt m. It is easy to see that the filter cannot distinguish prompt z from an innocent prompt sampled by G , because R is indistinguishable from random according to property (2c) above. However, M′ is able to recover the randomness, compute g⁢(x)g(x)g ( x ), recover m, and execute it. The formal implementation of G′ is given in Algorithm 3 and the implementation of M′ is given in Algorithm 4. We proceed to show how to construct g for the result of Theorem 1. 2.1 Construction of g Moderately hard g (Theorem 1) Our construction is based on time-lock puzzles. More concretely, we use the RSW time-lock that relies on the moderately hard repeated exponentiation function: f⁢(n,r)=r(2t)(modn),annotatedsuperscriptsuperscript2pmod f(n,r)=r^(2^t) n,f ( n , r ) = r( 2 start_POSTSUPERSCRIPT t ) end_POSTSUPERSCRIPT start_MODIFIER ( mod start_ARG n end_ARG ) end_MODIFIER , where n is a product of two (random) primes p and q, i.e., n=p⁢qn=pqn = p q, r is a random integer between 00 and n−11n-1n - 1 and t is a fixed large integer. The assumption is that without knowing the factorization of n, computing f takes time that scales linearly with t (and moreover this computation cannot be parallelized). This function has several nice properties right off the bat. First, we can choose t so that computing f is feasible for the LLM (which can run computations that take more than t steps), but not for a filter whose runtime is bounded to be significantly smaller than t. Further, it is easy to generate “solved instances”: the adversarial prompt generator generates n, which is why it can know the primes p and q. It can therefore compute e=2t(modϕ⁢(n))annotatedsuperscript2pmoditalic-ϕe=2^t φ(n)e = 2t start_MODIFIER ( mod start_ARG ϕ ( n ) end_ARG ) end_MODIFIER where we know that re=r2t(modn)superscriptannotatedsuperscriptsuperscript2pmodr^e=r^2^t nritalic_e = r2 start_POSTSUPERSCRIPT t end_POSTSUPERSCRIPT start_MODIFIER ( mod start_ARG n end_ARG ) end_MODIFIER. Thus, the adversarial prompt generator only needs to compute a small number of exponentiations (only log⁡(e) (e)log ( e )). Hence, f seems like a natural fit to use for constructing the function g, but several issues arise. First, the composite n is not a random string (which is what we need), rather it is a product of two large primes. However, we know that n is uniformly random in the set of products of two primes p and q chosen at random from a set [1,w]1[1,w][ 1 , w ]. By the prime number theorem, the number of primes in that range is Θ⁢(w/log⁡(w))Θ (w/ (w))Θ ( w / log ( w ) ). Thus, n is uniformly random in a domain of size S=Θ⁢((w/log⁡(w))2)Θsuperscript2S= ((w/ (w))^2)S = Θ ( ( w / log ( w ) )2 ). G′ hashes n to a range of size Θ⁢(ϵ2⋅S)Θ⋅superscriptitalic-ϵ2 (ε^2· S)Θ ( ϵ2 ⋅ S ) using a pairwise independent hash function hℎh. By the leftover hash lemma the joint distribution of (h,h⁢(n))ℎ(h,h(n))( h , h ( n ) ) is ϵitalic-ϵεϵ-close to uniformly random (Håstad et al., 1999; Vadhan, 2012). It is straightforward to represent the random integer r as a random bit string, so the entire input x=(h,h⁢(n),r)ℎx=(h,h(n),r)x = ( h , h ( n ) , r ) can be (close to) uniformly random, as we wanted. We remark that the LLM will need to recover n from (h,h⁢(n))ℎ(h,h(n))( h , h ( n ) ): this is possible because the pairwise-independent hash function can be inverted efficiently (e.g., it can be a linear function, where inverses can be found using Gaussian elimination). Note however that there will be more than one inverse (because the hash function is shrinking), and this raises technical issues in the construction: the LLM will need to enumerate over all the possibilities. The final issue is that the output r(2t)(modn)annotatedsuperscriptsuperscript2pmodr^(2^t) nr( 2 start_POSTSUPERSCRIPT t ) end_POSTSUPERSCRIPT start_MODIFIER ( mod start_ARG n end_ARG ) end_MODIFIER is moderately hard for the filter to compute, but might not be indistinguishable from uniformly random (which is what we wanted). We handle this using standard techniques from the cryptography literature (namely using hardcore bits of f (Goldreich and Levin, 1989; Goldreich, 2001)). See the appendix for the full details of this construction. Other time-lock puzzles. Our construction is based on the RSW time-lock puzzle but we consider proving Theorem 1 assuming general time-lock puzzles. See Remark 4 in Appendix C for details. Secret key (Theorem 2) and public key (Theorem 3) constructions. We focus on the impossibility results for moderately weak filters, i.e., weaker than the LLM in this overview. However, if we allow collaboration between the LLM and the prompt generator, it is possible to prove impossibility even against stronger filters. If we allow the LLM to publish a public key (and allow the adversarial prompt generator to depend on it) as in Theorem 3, then we can use any trapdoor function to instantiate Encsansserif_Enc function whose output will be pseudorandom to any polynomial-time filter that does not know the secret key. Here, we let the adversarial prompt generator G′ sample R=⁢(,m)R= Enc( pk,m)R = sansserif_Enc ( sansserif_pk , m ), where pksansserif_pk is a public key, and returns the prompt z←⁢(G,R)←z← Samp(G,R)z ← sansserif_Samp ( G , R ). The LLM recovers R from z and uses its matching secret key sksansserif_sk to recover m from R. If we allow the LLM and the adversarial prompt generator to have shared randomness (or rather shared secret key sksansserif_sk) as in Theorem 2, then we can construct G′ without making any assumptions: we use the shared randomness as a one-time pad to create a single special prompt that leads to a harmful output. Namely, G′ sets R=⊕mdirect-sumR= sk mR = sansserif_sk ⊕ m for a prompt m which can induce a harmful output from the LLM and returns z←⁢(G,R)←z← Samp(G,R)z ← sansserif_Samp ( G , R ). Alternatively, under the (fairly relaxed) cryptographic assumption that one-way functions exist, even a short shared random seed can be used to generate an arbitrary (polynomial) number of harmful prompts that are indistinguishable from innocent ones. Algorithm 1 ⁢(G,R) Samp(G,R)sansserif_Samp ( G , R ) ▷ ▷ G: a prompt generator; R∈0,1∗superscript01R∈\0,1\^*R ∈ 0 , 1 ∗: the randomness 1:i←0←0i← 0i ← 0, j←0,q←(),k←1formulae-sequence←0formulae-sequence←1j← 0,q←(),k← 1j ← 0 , q ← ( ) , k ← 1 2:while ∉(z0,…,zi−1)subscript0…subscript1 done ∈(z_0,…,z_i-1)sansserif_done ∉ ( z0 , … , zitalic_i - 1 ) do 3: pi←G⁢(z0,…,zi−1)←subscriptsubscript0…subscript1p_i← G(z_0,…,z_i-1)pitalic_i ← G ( z0 , … , zitalic_i - 1 ) 4: d←arg⁢mind′∈ℕ⁡[k⋅2d′≥2P]←subscriptargminsuperscript′ℕ⋅superscript2superscript′2d← *arg\,min_d [k· 2^d^% ≥ 2^P]d ← start_OPERATOR arg min end_OPERATORd′ ∈ blackboard_N [ k ⋅ 2d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ 2P ] ▷ ▷ P is a precision parameter to be set later 5: k←2d⋅k←⋅superscript2k← 2^d· k ← 2d ⋅ k 6: q←q∥R[j:j+d]q← q\ \|\ R[j:j+d]q ← q ∥ R [ j : j + d ] ▷ ▷ ∥\|∥ is a concatenation 7: j←j+d←j← j+dj ← j + d 8: if q/k<pisubscriptq/k<p_iq / k < pitalic_i then 9: zi←1←subscript1z_i← 1zitalic_i ← 1 10: k←⌈pi⋅k⌉←⋅subscriptk← p_i· k ← ⌈ pitalic_i ⋅ k ⌉ 11: else 12: zi←0←subscript0z_i← 0zitalic_i ← 0 13: q←q−⌊pi⋅k⌋←⋅subscriptq← q- p_i· k ← q - ⌊ pitalic_i ⋅ k ⌋ 14: k←k−⌊pi⋅k⌋←⋅subscriptk← k- p_i· k ← k - ⌊ pitalic_i ⋅ k ⌋ 15: end if 16: i←i+1←1i← i+1i ← i + 1 17:end while Algorithm 2 ⁢(G,z,L) RecoverRand(G,z,L)sansserif_RecoverRand ( G , z , L ) ▷ ▷ G: a prompt generator; z∈0,1∗superscript01z∈\0,1\^*z ∈ 0 , 1 ∗; L: a length of the randomness required 1:j←0,a←0,b←1,k←1formulae-sequence←0formulae-sequence←0formulae-sequence←1←1j← 0,a← 0,b← 1,k← 1j ← 0 , a ← 0 , b ← 1 , k ← 1 2:for i∈[⁢(z)]delimited-[]i∈[ len(z)]i ∈ [ sansserif_len ( z ) ] do 3: pi←G⁢(z0,…,zi−1)←subscriptsubscript0…subscript1p_i← G(z_0,…,z_i-1)pitalic_i ← G ( z0 , … , zitalic_i - 1 ) 4: d←arg⁢mind′∈ℕ⁡[k⋅2d′≥2P]←subscriptargminsuperscript′ℕ⋅superscript2superscript′2d← *arg\,min_d [k· 2^d^% ≥ 2^P]d ← start_OPERATOR arg min end_OPERATORd′ ∈ blackboard_N [ k ⋅ 2d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ 2P ] ▷ ▷ P is a precision parameter to be set later 5: k←2d⋅k←⋅superscript2k← 2^d· k ← 2d ⋅ k 6: a←a∥ 0t,b←b∥ 1t←conditionalsuperscript 0←conditionalsuperscript1a← a\ \|\ 0^t,b← b\ \|\ 1^ta ← a ∥ 0t , b ← b ∥ 1t ▷ ▷ ∥\|∥ is a concatenation 7: ←a+(b−a)⋅pi/k←⋅subscript mid← a+(b-a)· p_i/ksansserif_mid ← a + ( b - a ) ⋅ pitalic_i / k 8: if zi=1subscript1z_i=1zitalic_i = 1 then 9: a←a← mida ← sansserif_mid 10: k←⌈pi⋅k⌉←⋅subscriptk← p_i· k ← ⌈ pitalic_i ⋅ k ⌉ 11: else 12: a←a← mida ← sansserif_mid 13: k←k−⌊pi⋅k⌋←⋅subscriptk← k- p_i· k ← k - ⌊ pitalic_i ⋅ k ⌋ 14: end if 15: if ⁢(⁢(a,b))≥L len( CommonPrefix(a,b))≥ Lsansserif_len ( sansserif_CommonPrefix ( a , b ) ) ≥ L then return ⁢(a,b) CommonPrefix(a,b)sansserif_CommonPrefix ( a , b ) 16: end if 17:end for 2.2 Recoverable-Randomness Sampling In our construction, it is key that the distribution of G′ is indistinguishable from G and that it is possible to effectively recover the randomness used by G′ from its outputs. To achieve this, we design an algorithm Sampsansserif_Samp (Algorithm 1) that takes as input a generator G and randomness R and produces z=(z0,z1,…)subscript0subscript1…z=(z_0,z_1,…)z = ( z0 , z1 , … ); and an algorithm RecoverRandsansserif_RecoverRand (Algorithm 2) that given z and G recovers R. We assume that the set of tokens over which G operates equals 0,101\0,1\ 0 , 1 . This is possible because of a simple reduction (see Appendix A). The high-level idea of our scheme is to maintain randomness. Imagine, that at some point in the generation process, G⁢(z0,…,zi−1)→1/4→subscript0…subscript114G(z_0,…,z_i-1)→ 1/4G ( z0 , … , zitalic_i - 1 ) → 1 / 4. Then to generate zisubscriptz_izitalic_i accurately 2222 bits of randomness are needed, i.e., if the bits are 00000000 we set zi=1subscript1z_i=1zitalic_i = 1 and we set zi=0subscript0z_i=0zitalic_i = 0 if the bits are 00,01,1100011100,01,1100 , 01 , 11. If zi=0subscript0z_i=0zitalic_i = 0 then the recovering algorithm might not know which of the three options for the randomness determined zisubscriptz_izitalic_i. To address this issue we this randomness in the generation of the next tokens. More concretely, if zi=0subscript0z_i=0zitalic_i = 0 then we have log2⁡(3)subscript23 _2(3)log2 ( 3 ) bits of residual randomness that we can use in generating zi+1,zi+2,…subscript1subscript2…z_i+1,z_i+2,…zitalic_i + 1 , zitalic_i + 2 , …. If done properly this process will allow perfect recovery of the randomness used. Intuitively, because all of the randomnness will be used. Our scheme, defined in Algorithms 1 and 2, works as follows. At all times, the sampling algorithm ⁢(G,R) Samp(G,R)sansserif_Samp ( G , R ) maintains a precision range k and a value q∈[k]delimited-[]q∈[k]q ∈ [ k ]. The value q is created on the fly using the randomness R and will determine how tokens are sampled. Intuitively, q maintains the randomness. Sampsansserif_Samp builds a response (z0,…,zi−1)subscript0…subscript1(z_0,…,z_i-1)( z0 , … , zitalic_i - 1 ) iteratively, where in the i-th step it: 1. Calls pi←G⁢(z0,…,zi−1)←subscriptsubscript0…subscript1p_i← G(z_0,…,z_i-1)pitalic_i ← G ( z0 , … , zitalic_i - 1 ). 2. If k<2Psuperscript2k<2^Pk < 2P it multiplies k by an appropriate power 2tsuperscript22^t2t so that k≥2Psuperscript2k≥ 2^Pk ≥ 2P,777P is a precision parameter that will determine the closeness of the generated distribution to the distribution of G. and increases the precision of q by concatenating it with the not yet used randomness from R, i.e., q←q∥R[j:j+t]q← q\ \|\ R[j:j+t]q ← q ∥ R [ j : j + t ]. This ensures that q represents an element of [k]delimited-[][k][ k ]. 3. If q/k<pisubscriptq/k<p_iq / k < pitalic_i it sets zi←1←subscript1z_i← 1zitalic_i ← 1 and sets zi←0←subscript0z_i← 0zitalic_i ← 0 otherwise. Additionally, it updates q and k so that the randomness can be used later on. Intuitively, if q/k<pisubscriptq/k<p_iq / k < pitalic_i then we want to keep the ≈log2⁡(pi⋅k)absentsubscript2⋅subscript≈ _2(p_i· k)≈ log2 ( pitalic_i ⋅ k ) bits of randomness. To do that we update k←k−⌊pi⋅k⌋←⋅subscriptk← k- p_i· k ← k - ⌊ pitalic_i ⋅ k ⌋, which implicitly represents that q is now a random value from [k−⌊pi⋅k⌋]delimited-[]⋅subscript[k- p_i· k ][ k - ⌊ pitalic_i ⋅ k ⌋ ]. If q/k>pisubscriptq/k>p_iq / k > pitalic_i the values are updated accordingly. To recover the randomness ⁢(G,z) RecoverRand(G,z)sansserif_RecoverRand ( G , z ) works as follows. It maintains bitstrings a,ba,ba , b that informally are and upper-bounds on the randomness R being recovered. More concretely, in the first step RecoverRandsansserif_RecoverRand knows that the prefix of randomness R is between 0Psuperscript00^P0P and 1Psuperscript11^P1P. If p0←G⁢()←subscript0p_0← G()p0 ← G ( ) and z0=1subscript01z_0=1z0 = 1 then it knows that the prefix of R is between a=0Psuperscript0a=0^Pa = 0P and b=p0⋅2P⋅subscript0superscript2b=p_0· 2^Pb = p0 ⋅ 2P written in binary on P bits. By construction, the common prefix of a,ba,ba , b is equal to the prefix of R - this will allow to recover R. What is left to show is that it is possible to maintain a and b correctly. Observe that it is possible to update k exactly like Sampsansserif_Samp does it because k changes depending only on the value of zisubscriptz_izitalic_i. Algorithm 3 ⁢(G,m,t) HarmIncite(G,m,t)sansserif_HarmIncite ( G , m , t ) ▷ ▷ G: innocent prompt generator; m: harmful prompt; t: time parameter 1:Sample two large primes p,qp,qp , q 2:Set n←p⋅q←⋅n← p· qn ← p ⋅ q 3:Compute e←2t(mod(p−1)⁢(q−1))←annotatedsuperscript2pmod11e← 2^t (p-1)(q-1)e ← 2t start_MODIFIER ( mod start_ARG ( p - 1 ) ( q - 1 ) end_ARG ) end_MODIFIER ▷ ▷ ϕ⁢(n)=(p−1)⁢(q−1)italic-ϕ11φ(n)=(p-1)(q-1)ϕ ( n ) = ( p - 1 ) ( q - 1 ) 4:Sample r←[n]←delimited-[]r←[n]r ← [ n ] 5:Compute s←re(modn)←annotatedsuperscriptpmods← r^e ns ← ritalic_e start_MODIFIER ( mod start_ARG n end_ARG ) end_MODIFIER 6:Sample hℎh a pairwise independent hash function 7:c←⁢(s,m)←c← Enc(s,m)c ← sansserif_Enc ( s , m ) ▷ ▷ For Encsansserif_Enc defined in the Appendix C.1 8:R←h⁢‖h⁢(n)‖⁢r∥c←conditionalℎnormℎR← h\ \|\ h(n)\ \|\ r\ \|\ cR ← h ∥ h ( n ) ∥ r ∥ c ▷ ▷ Concatenate components 9:return z←⁢(G,R)←z← Samp(G,R)z ← sansserif_Samp ( G , R ) ▷ ▷ Sample output so that randomness is recoverable Algorithm 4 ⁢(M,z,G,t) MaliciousLLM(M,z,G,t)sansserif_MaliciousLLM ( M , z , G , t ) ▷ ▷ M: LLM; z: input; G: weak innocent prompt generator; t: time parameter 1:R←⁢(G,z)←R← RecoverRand(G,z)R ← sansserif_RecoverRand ( G , z ) 2:Parse R as h⁢‖u‖⁢r∥cconditionalℎnormh\ \|\ u\ \|\ r\ \|\ ch ∥ u ∥ r ∥ c 3:for all n^∈h−1⁢(u)^superscriptℎ1 n∈ h^-1(u)over start_ARG n end_ARG ∈ h- 1 ( u ) do 4: s^←r← s← rover start_ARG s end_ARG ← r 5: for i∈[t]delimited-[]i∈[t]i ∈ [ t ] do 6: s^←s^2(modn^)←^annotatedsuperscript^2pmod s← s^2 nover start_ARG s end_ARG ← over start_ARG s end_ARG2 start_MODIFIER ( mod start_ARG over start_ARG n end_ARG end_ARG ) end_MODIFIER 7: end for 8: m←⁢(s^,c)←^m← Dec( s,c)m ← sansserif_Dec ( over start_ARG s end_ARG , c ) ▷ ▷ m is the malicious prompt extracted from ciphertext c 9: if m≠⊥bottomm≠ ≠ ⊥ then ▷ ▷ Decsansserif_Dec can detect if the decryption was successful 10: return M⁢(m)M(m)M ( m ) 11: end if 12:end for 13:return M⁢(z)M(z)M ( z ) 2.3 Output Filtering Setup. Before elaborating on the construction, we describe the setup in more detail. A harmfulness oracle is a function ℋ:∗→[0,∞):ℋ→superscript0H:T^*→[0,∞)H : T∗ → [ 0 , ∞ ). Given an output z∈∗superscriptz ^*z ∈ T∗ of a language model, ℋ⁢(z)ℋH(z)H ( z ) represents its harmfulness: ℋ⁢(z)=0ℋ0H(z)=0H ( z ) = 0 indicates that z is benign, while higher values denote greater harmfulness. During inference, neither the LLM nor the filter has query access to ℋHH. The oracle ℋHH can be interpreted as the environment or nature “evaluating” the harmfulness of z. However, the filter (and the LLM) may have received training examples of both harmful and benign outputs, or an approximate description of ℋHH, during training. Construction. We leverage techniques used in the proof of Theorem 1 to prove Theorem 4. Let M be an LLM that doesn’t produce harmful outputs. To define M′ we apply the same process as the one used to transform G⇢G′⇢superscript′G G G ⇢ G′ but applied to M. When we apply the transformation, we set t of the RSW time-lock to be exponential in the time complexity of M. This ensures that the outcome of the transformation, i.e., M′, is such that its outputs are indistinguishable from outputs of M for all polynomial, in the running time of M, filters. To define ℋ′H H′ we apply the same process as the one used to transform the LLM in the proof of Theorem 1. More concretely, ℋ′⁢(z)superscriptℋ′H (z)H′ ( z ) 888ℋ′H H′ is a harmfulness oracle and not an algorithm, so formally it does not run. as follows: it recovers the randomness used by M′, i.e., R←⁢(M′,z)←superscript′R← RecoverRand(M ,z)R ← sansserif_RecoverRand ( M′ , z ). If R contains a Time-Lock Puzzle, ℋ′H H′ solves it to obtain an output m and that z is harmful (with harmfulness level determined by m). The sketch of why this construction is correct is as follows. Note that similarly as for G and G′ in the case of Theorem 1, M′ outputs are indistinguishable from the outputs of M for all polynomial-time filters. Moreover, M′ on every harmful m, it in its output z, ℋ′H H′ m from z and that z is harmful with the harmfulness level determined by m. 3 Related Work Alignment. Making AI models aligned with human preferences is a central concern of contemporary AI research (Amodei et al., 2016; Leike et al., 2018; Hendrycks et al., 2021; Ji et al., 2023). In order to guide model behavior, technical efforts have focused on techniques like Reinforcement Learning from Human Feedback (RLHF, Christiano et al., 2017), further improved by Direct Preference Optimization (DPO) (Rafailov et al., 2023). However, a growing body of work suggests that achieving robust alignment is profoundly difficult: Researchers have highlighted issues ranging from the inherent ambiguity in specifying human preferences (Gabriel, 2020; Zhi-Xuan et al., 2024; Sorensen et al., 2024), to problems like shallow alignment induced by properties of the alignment algorithms (Jain et al., 2023; Kotha et al., 2023; Lee et al., 2024) and the alignment data (Qi et al., 2024). For instance, Lee et al. (2024) show that the DPO alignment algorithm only teaches a model to bypass toxic “regions” without removing toxic knowledge gained from the pre-training stage, making it possible to elicit this toxic knowledge with “jailbreaks”—changes to the prompt that lead a model from answer-refusal to generating harmful information. Relatedly, Qi et al. (2024) argue that safety alignment often takes shortcuts by primarily adapting a model’s generative distribution over just its first few output tokens, making them vulnerable to various attacks. The difficulty in robustly aligning models at a deep representational level underscores the need for complementary external mechanisms to detect or prevent harmful model outputs. This is in line with regulatory frameworks such as the EU AI Act, which requires AI systems in the high-risk category to implement an effective risk management system (see Article 9 EU AI Act, ). Filters. In response to the need for safer AI systems, practical filtering mechanisms have been developed and deployed. For instance, model developers like Meta have introduced tools such as Llama Guard, designed to classify content as safe or unsafe (Inan et al., 2023). Similarly, cloud service providers like Microsoft Azure offer content filtering capabilities within their AI service implementations (Microsoft Corporation, 2025), and companies like Nvidia also provide solutions aimed at moderating AI-generated content (NVIDIA Corporation, 2025). These approaches represent an ongoing evolution, with classifiers and filters becoming increasingly sophisticated. However, the development of jailbreaks poses a consistent challenge to successful filtering and model alignment in general (Andriushchenko et al., 2024; Chao et al., 2024; Xu et al., 2024). Against the background of this dynamic co-evolution of attack and defense, our work explores the computational intractability of filtering approaches under cryptographic assumptions. Time Lock Puzzles. It is usually desired that cryptographic schemes cannot be broken by any adversary. An exception is the notion of cryptographic puzzles that can be solved in some preset amount of time (or space) but not faster. Examples of such puzzles (Dwork and Naor, 1992; Rivest et al., 1996) have been used as a way to combat spam or send messages into the future, forcing the spammer (or the future reader of messages) to invest the preset amount of time. The notion of time-lock puzzles introduced by Rivest et al. (1996) following May’s time-released cryptography (May, 1993) is especially intriguing in that it allows a user to quickly encrypt a message in such a way that it can be read only after a longer but set number of time steps. Informally, the sender generates a puzzle with a solution s that remains hidden from adversaries that run in time significantly less than t, including parallel adversaries with polynomially many processors. The original (Rivest et al., 1996) candidate was based on the assumption that exponentiation modulo an RSA integer is an “inherently sequential” computation. More recently Bitansky et al. (2016a) constructed time-lock puzzles based on the necessary assumption that worst-case Non-Parallelizing Languages exist-decidable in time t, but hard in the worst case for circuits of depth significantly smaller than t-and that indistinguishability obfuscation exists (Bitansky et al., 2015; Canetti et al., 2014). A culmination of a series of follow-up works (Agrawal et al., 2025; Bitansky and Garg, 2025; Abram et al., 2025) managed to construct time-lock puzzles based on a version of the Learning With Errors (Regev, 2009) (LWE) assumption. One advantage of basing time-lock puzzles on LWE is that the other constructions are broken by quantum algorithms, while LWE is a candidate for post-quantum secure cryptography. 4 Reflections Our work builds on a growing body of research highlighting the inherent unpredictability and misalignment risks of complex AI systems. By formalizing these insights, we provide a theoretical foundation for understanding when and why alignment may remain an elusive goal, regardless of technical advancements or regulatory safeguards. The implications of our impossibility theorems are profound: they call into question the feasibility of creating AI systems that are both powerful and wholly safe, at least using a certain natural set of filtering tools. This emphasizes the need to reconsider how society audits and handles these systems. Beyond Cartesian Duality. Our results show that merging intelligence and judgment might be necessary for achieving AI alignment. It is in contrast with what some other works suggest, e.g., Bengio et al. (2025) argues for separating intelligence from agency. We, on the other hand, argue that agency is indispensable in some situations. It would be interesting to investigate how our findings about AI alignment relate to a recent paradigm shift in the study of the mind known as 4E cognitive science (Newen et al., 2018), which emphasizes the embodied, embedded, enactive, and extended nature of cognition. This paradigm stands in contrast to the Cartesian mind-body duality and argues that the unit of explanation of cognition is not just neuronal processes in the brain but the whole complex of brain-body-environment. Acknowledgements SB is supported by the DAAD programme Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the German Federal Ministry of Education and Research. OR is supported by the Simons Foundation Collaboration on the Theory of Algorithmic Fairness, the Sloan Foundation Grant 2020-13941, and the Simons Foundation Investigators Award 689988. References Abram et al. [2025] D. Abram, G. Malavolta, and L. Roy. Key-homomorphic computations for RAM: fully succinct randomised encodings and more. IACR Cryptol. ePrint Arch., page 339, 2025. URL https://eprint.iacr.org/2025/339. Agrawal et al. [2025] S. Agrawal, G. Malavolta, and T. Zhang. Time-lock puzzles from lattices. IACR Cryptol. ePrint Arch., page 47, 2025. URL https://eprint.iacr.org/2025/047. Amodei et al. [2016] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016. Andriushchenko et al. [2024] M. Andriushchenko, F. Croce, and N. Flammarion. Jailbreaking leading safety-aligned llms with simple adaptive attacks. arXiv preprint arXiv:2404.02151, 2024. Bai et al. [2022] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022. Bengio et al. [2025] Y. Bengio, M. Cohen, D. Fornasiere, J. Ghosn, P. Greiner, M. MacDermott, S. Mindermann, A. Oberman, J. Richardson, O. Richardson, M.-A. Rondeau, P.-L. St-Charles, and D. Williams-King. Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path?, 2025. URL https://arxiv.org/abs/2502.15657. Bitansky and Garg [2025] N. Bitansky and R. Garg. Succinct randomized encodings from laconic function evaluation, faster and simpler. In Advances in Cryptology – EUROCRYPT 2025: 44th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Madrid, Spain, May 4–8, 2025, Proceedings, Part VII, page 406–436, Berlin, Heidelberg, 2025. Springer-Verlag. ISBN 978-3-031-91097-5. doi: 10.1007/978-3-031-91098-2_15. URL https://doi.org/10.1007/978-3-031-91098-2_15. Bitansky et al. [2015] N. Bitansky, S. Garg, H. Lin, R. Pass, and S. Telang. Succinct randomized encodings and their applications. In Proceedings of the forty-seventh annual ACM symposium on Theory of Computing, pages 439–448, 2015. Bitansky et al. [2016a] N. Bitansky, S. Goldwasser, A. Jain, O. Paneth, V. Vaikuntanathan, and B. Waters. Time-lock puzzles from randomized encodings. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, pages 345–356, 2016a. Bitansky et al. [2016b] N. Bitansky, S. Goldwasser, A. Jain, O. Paneth, V. Vaikuntanathan, and B. Waters. Time-lock puzzles from randomized encodings. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, ITCS ’16, page 345–356, New York, NY, USA, 2016b. Association for Computing Machinery. ISBN 9781450340571. doi: 10.1145/2840728.2840745. URL https://doi.org/10.1145/2840728.2840745. Boneh and Naor [2000] D. Boneh and M. Naor. Timed commitments. In Proceedings of the 20th Annual International Cryptology Conference on Advances in Cryptology, CRYPTO ’00, page 236–254, Berlin, Heidelberg, 2000. Springer-Verlag. ISBN 3540679073. Boneh et al. [2018] D. Boneh, J. Bonneau, B. Bünz, and B. Fisch. Verifiable delay functions. In H. Shacham and A. Boldyreva, editors, Advances in Cryptology - CRYPTO 2018 - 38th Annual International Cryptology Conference, Santa Barbara, CA, USA, August 19-23, 2018, Proceedings, Part I, volume 10991 of Lecture Notes in Computer Science, pages 757–788. Springer, 2018. doi: 10.1007/978-3-319-96884-1\_25. URL https://doi.org/10.1007/978-3-319-96884-1_25. Canetti et al. [2014] R. Canetti, J. Holmgren, A. Jain, and V. Vaikuntanathan. Indistinguishability obfuscation of iterated circuits and ram programs. Cryptology ePrint Archive, 2014. Chao et al. [2024] P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V. Sehwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318, 2024. Chiusi et al. [2020] F. Chiusi, B. Alfter, M. Ruckenstein, and T. Lehtiniemi. Automating society report 2020. 2020. Christ et al. [2023] M. Christ, S. Gunn, and O. Zamir. Undetectable watermarks for language models. IACR Cryptol. ePrint Arch., 2023:763, 2023. URL https://api.semanticscholar.org/CorpusID:259092330. Christiano et al. [2017] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017. Diffie and Hellman [1976] W. Diffie and M. Hellman. New directions in cryptography. IEEE Transactions on Information Theory, 22(6):644–654, 1976. doi: 10.1109/TIT.1976.1055638. Dwork and Naor [1992] C. Dwork and M. Naor. Pricing via processing or combatting junk mail. In Annual international cryptology conference, pages 139–147. Springer, 1992. Dworkin et al. [2001] M. J. Dworkin, E. Barker, J. Nechvatal, J. Foti, L. E. Bassham, E. Roback, and J. D. Jr. Advanced encryption standard (aes), 2001-11-26 00:11:00 2001. URL https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=901427. [21] EU AI Act. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence. OJ, L 2024/1689. Fischer-Abaigar et al. [2024] U. Fischer-Abaigar, C. Kern, N. Barda, and F. Kreuter. Bridging the gap: Towards an expanded toolkit for ai-driven decision-making in the public sector. Government Information Quarterly, 41(4):101976, 2024. Gabriel [2020] I. Gabriel. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437, 2020. Goldreich [2001] O. Goldreich. The Foundations of Cryptography - Volume 1: Basic Techniques. Cambridge University Press, 2001. ISBN 0-521-79172-3. doi: 10.1017/CBO9780511546891. URL http://w.wisdom.weizmann.ac.il/%7Eoded/foc-vol1.html. Goldreich [2006] O. Goldreich. Foundations of Cryptography: Volume 1. Cambridge University Press, USA, 2006. ISBN 0521035368. Goldreich and Levin [1989] O. Goldreich and L. A. Levin. A hard-core predicate for all one-way functions. In D. S. Johnson, editor, Proceedings of the 21st Annual ACM Symposium on Theory of Computing, May 14-17, 1989, Seattle, Washington, USA, pages 25–32. ACM, 1989. doi: 10.1145/73007.73010. URL https://doi.org/10.1145/73007.73010. Golowich and Moitra [2024] N. Golowich and A. Moitra. Edit distance robust watermarks via indexing pseudorandom codes. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 20645–20693. Curran Associates, Inc., 2024. URL https://proceedings.neurips.c/paper_files/paper/2024/file/24c53bfa5b53fc2cf05644f5a7a26b0-Paper-Conference.pdf. Greenblatt et al. [2024] R. Greenblatt, C. Denison, B. Wright, F. Roger, M. MacDiarmid, S. Marks, J. Treutlein, T. Belonax, J. Chen, D. Duvenaud, A. Khan, J. Michael, S. Mindermann, E. Perez, L. Petrini, J. Uesato, J. Kaplan, B. Shlegeris, S. Bowman, and E. Hubinger. Alignment faking in large language models, 12 2024. Haensch et al. [2023] A.-C. Haensch, S. Ball, M. Herklotz, and F. Kreuter. Seeing chatgpt through students’ eyes: An analysis of tiktok data. In 2023 Big Data Meets Survey Science (BigSurv), pages 1–8. IEEE, 2023. Håstad et al. [1999] J. Håstad, R. Impagliazzo, L. A. Levin, and M. Luby. A pseudorandom generator from any one-way function. SIAM J. Comput., 28(4):1364–1396, 1999. doi: 10.1137/S0097539793244708. URL https://doi.org/10.1137/S0097539793244708. Hendrycks et al. [2021] D. Hendrycks, N. Carlini, J. Schulman, and J. Steinhardt. Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916, 2021. Hubinger et al. [2019] E. Hubinger, C. Merwijk, V. Mikulik, J. Skalse, and S. Garrabrant. Risks from learned optimization in advanced machine learning systems, 06 2019. Inan et al. [2023] H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y. Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023. Jain et al. [2023] S. Jain, R. Kirk, E. S. Lubana, R. P. Dick, H. Tanaka, E. Grefenstette, T. Rocktäschel, and D. S. Krueger. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. arXiv preprint arXiv:2311.12786, 2023. Ji et al. [2023] J. Ji, T. Qiu, B. Chen, B. Zhang, H. Lou, K. Wang, Y. Duan, Z. He, J. Zhou, Z. Zhang, et al. Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852, 2023. Kirchenbauer et al. [2023] J. Kirchenbauer, J. Geiping, Y. Wen, J. Katz, I. Miers, and T. Goldstein. A watermark for large language models. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 17061–17084. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/kirchenbauer23a.html. Kotha et al. [2023] S. Kotha, J. M. Springer, and A. Raghunathan. Understanding catastrophic forgetting in language models via implicit inference. arXiv preprint arXiv:2309.10105, 2023. Kuditipudi et al. [2023] R. Kuditipudi, J. Thickstun, T. Hashimoto, and P. Liang. Robust distortion-free watermarks for language models. CoRR, abs/2307.15593, 2023. doi: 10.48550/ARXIV.2307.15593. URL https://doi.org/10.48550/arXiv.2307.15593. Lee et al. [2024] A. Lee, X. Bai, I. Pres, M. Wattenberg, J. K. Kummerfeld, and R. Mihalcea. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967, 2024. Leike et al. [2018] J. Leike, D. Krueger, T. Everitt, M. Martic, V. Maini, and S. Legg. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018. Levy et al. [2021] K. Levy, K. E. Chasalow, and S. Riley. Algorithms and decision-making in the public sector. Annual Review of Law and Social Science, 17(1):309–334, 2021. Luu et al. [2016] L. Luu, D.-H. Chu, H. Olickel, P. Saxena, and A. Hobor. Making smart contracts smarter. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, page 254–269, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450341394. doi: 10.1145/2976749.2978309. URL https://doi.org/10.1145/2976749.2978309. May [1993] T. C. May. Timed-release cryoto. http://w. hks. net/cpunks/cpunks-0/1460. html, 1993. Microsoft Corporation [2025] Microsoft Corporation. What is azure ai content safety? https://learn.microsoft.com/en-us/azure/ai-services/content-safety/overview, 2025. Accessed: 2025-05-14. Newen et al. [2018] A. Newen, L. De Bruin, and S. Gallagher. The Oxford Handbook of 4E Cognition. Oxford University Press, 09 2018. ISBN 9780198735410. doi: 10.1093/oxfordhb/9780198735410.001.0001. URL https://doi.org/10.1093/oxfordhb/9780198735410.001.0001. NVIDIA Corporation [2025] NVIDIA Corporation. Nvidia nemo guardrails documentation. https://docs.nvidia.com/nemo/guardrails/latest/index.html, 2025. Accessed: 2025-05-14. OpenAI [2024] OpenAI. Sora by openai. https://openai.com/sora, 2024. Accessed: 2025-06-23. Perdomo et al. [2023] J. C. Perdomo, T. Britton, M. Hardt, and R. Abebe. Difficult lessons on social prediction from wisconsin public schools. arXiv preprint arXiv:2304.06205, 2023. Potash et al. [2015] E. Potash, J. Brew, A. Loewi, S. Majumdar, A. Reece, J. Walsh, E. Rozier, E. Jorgenson, R. Mansour, and R. Ghani. Predictive modeling for public health: Preventing childhood lead poisoning. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2039–2047, 2015. Qi et al. [2024] X. Qi, A. Panda, K. Lyu, X. Ma, S. Roy, A. Beirami, P. Mittal, and P. Henderson. Safety alignment should be made more than just a few tokens deep. arXiv preprint arXiv:2406.05946, 2024. Rafailov et al. [2023] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023. Regev [2009] O. Regev. On lattices, learning with errors, random linear codes, and cryptography. J. ACM, 56(6), Sept. 2009. ISSN 0004-5411. doi: 10.1145/1568318.1568324. URL https://doi.org/10.1145/1568318.1568324. Rivest et al. [1978] R. L. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM, 21(2):120–126, Feb. 1978. ISSN 0001-0782. doi: 10.1145/359340.359342. URL https://doi.org/10.1145/359340.359342. Rivest et al. [1996] R. L. Rivest, A. Shamir, and D. A. Wagner. Time-lock puzzles and timed-release crypto. 1996. Sorensen et al. [2024] T. Sorensen, J. Moore, J. Fisher, M. Gordon, N. Mireshghallah, C. M. Rytting, A. Ye, L. Jiang, X. Lu, N. Dziri, et al. A roadmap to pluralistic alignment. arXiv preprint arXiv:2402.05070, 2024. Vadhan [2012] S. P. Vadhan. Pseudorandomness. Found. Trends Theor. Comput. Sci., 7(1-3):1–336, 2012. doi: 10.1561/0400000010. URL https://doi.org/10.1561/0400000010. Xu et al. [2024] Z. Xu, Y. Liu, G. Deng, Y. Li, and S. Picek. A comprehensive study of jailbreak attack versus defense for large language models. arXiv preprint arXiv:2402.13457, 2024. Zamir [2024] O. Zamir. Undetectable steganography for language models. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=fq6aQoMSHz. Zhi-Xuan et al. [2024] T. Zhi-Xuan, M. Carroll, M. Franklin, and H. Ashton. Beyond preferences in ai alignment. Philosophical Studies, pages 1–51, 2024. Appendix A Preliminaries For n∈ℕn ∈ blackboard_N we denote 0,1,…,n−101…1\0,1,…,n-1\ 0 , 1 , … , n - 1 by [n]delimited-[][n][ n ] and sometimes by ℤnsubscriptℤZ_nblackboard_Zn. For n∈ℕn ∈ blackboard_N we denote by ϕ⁢(n)italic-ϕφ(n)ϕ ( n ) the Euler’s totient function, i.e, it counts the positive integers in [n]delimited-[][n][ n ] that are relatively prime to n. The multiplicative group ℤn∗superscriptsubscriptℤZ_n^*blackboard_Zn∗ consists of the set of natural numbers that are smaller than n and relatively prime to it, and the operation is multiplication mod n. We denote by log the logarithm with base two. We denote by x||yx\,||\,yx | | y the concatenation of the vectors x, y, and by ⁢(s) len(s)sansserif_len ( s ) the length of the sequence s. For a sequence s=(…,si,…)…subscript…s=(…,s_i,…)s = ( … , sitalic_i , … ) we denote by s[i:j]s[i:j]s [ i : j ] the sub-sequence (si,…,sj−1)subscript…subscript1(s_i,…,s_j-1)( sitalic_i , … , sitalic_j - 1 ). For a set S we denote by ←S←absent← S← S the process of choosing a uniformly random element of S. For an algorithm A we also write ←A←absent← A← A to denote the (potentially random) element returned by A. Let λ be the security parameter, we denote by negl⁢(λ)neglnegl(λ)negl ( λ ) any function that is in O⁢(1/p⁢(λ))1O(1/p(λ))O ( 1 / p ( λ ) ) for every polynomial p⁢(⋅)⋅p(·)p ( ⋅ ). As is standard in Cryptography research, we think of λ as the “key size”, and of running times that are super-polynomial in λ as “infeasible”. A.1 Language Models Some notions in this subsection are adapted from Christ et al. [2023]. Definition 1. A language model M over token set TT is a deterministic algorithm that takes as input a prompt promptsansserif_prompt and tokens previously output by the model z=(z1,…,zi−1)subscript1…subscript1z=(z_1,…,z_i-1)z = ( z1 , … , zitalic_i - 1 ), and outputs a probability distribution pi=M⁢(,z)subscriptp_i=M( prompt,z)pitalic_i = M ( sansserif_prompt , z ) over TT. Definition 2. A language model’s response to promptsansserif_prompt is a random variable M⁢()∈∗superscriptM( prompt) ^*M ( sansserif_prompt ) ∈ T∗ defined algorithmically as follows. We begin with an empty list of tokens z=()z=()z = ( ). As long as the last token in z is not donesansserif_done, we draw a token zisubscriptz_izitalic_i from the distribution M⁢(,z)M( prompt,z)M ( sansserif_prompt , z ) and append it to z. A.2 Entropy and Empirical Entropy For a distribution D over a finite set X, the Shannon entropy is H⁢(D)=z∼D⁢[−log⁡D⁢(z)].subscriptsimilar-todelimited-[]H(D)=E_z D[- D(z)].H ( D ) = blackboard_Ez ∼ D [ - log D ( z ) ] . The empirical entropy of z in D is −log⁡D⁢(z)- D(z)- log D ( z ). Definition 3. For a language model M, a prompt promptsansserif_prompt, and a response z∈∗superscriptz ^*z ∈ T∗, the empirical entropy of M responding with z to promptsansserif_prompt is: He⁢(M,,z):=−log⁡ℙ⁢[M⁢()=z].assignsubscriptℙdelimited-[]H_e(M, prompt,z):=- [M( prompt)=z].Hitalic_e ( M , sansserif_prompt , z ) := - log blackboard_P [ M ( sansserif_prompt ) = z ] . Note that H⁢(M⁢())=z⁢[He⁢(M,,z)].subscriptdelimited-[]subscriptH(M( prompt))=E_z[H_e(M, prompt,z)].H ( M ( sansserif_prompt ) ) = blackboard_Ez [ Hitalic_e ( M , sansserif_prompt , z ) ] . A.3 Time-Lock Puzzles Time-lock puzzles, introduced by Rivest et al. [1996], provide a mechanism for sending messages “to the future”. The core idea is to enable a sender to quickly generate a cryptographic puzzle whose solution is guaranteed to remain hidden until a specified amount of computational time has elapsed. Additionally, this security must hold even against highly parallel adversaries equipped with polynomially many processors. The following is a definition of a Time-Lock Puzzle adapted from Bitansky et al. [2016b]. Definition 4 (Time-Lock Puzzle). A Time-Lock Puzzle (TLP) is defined over a difficulty parameter t∈ℕt ∈ blackboard_N and a security parameter λ. A Time-Lock Puzzle consists of a pair of algorithms =(,) Puzzle=( Gen, Sol)sansserif_Puzzle = ( sansserif_Gen , sansserif_Sol ) with the following properties: • Puzzle Generation: Z←⁢(t,s)←Z← Gen(t,s)Z ← sansserif_Gen ( t , s ) is a probabilistic algorithm that takes as input a difficulty parameter t and a solution s∈0,1λsuperscript01s∈\0,1\^λs ∈ 0 , 1 λ and outputs a puzzle Z. • Puzzle Solving: s←⁢(Z)←s← Sol(Z)s ← sansserif_Sol ( Z ) is a deterministic algorithm that takes a puzzle Z and returns a solution s. • Completeness: For all difficulty parameters t∈ℕt ∈ blackboard_N and s∈0,1λsuperscript01s∈\0,1\^λs ∈ 0 , 1 λ: ⁢(⁢(t,s))=s. Sol( Gen(t,s))=s.sansserif_Sol ( sansserif_Gen ( t , s ) ) = s . • Efficiency: – ⁢(t,s) Gen(t,s)sansserif_Gen ( t , s ) runs in time ⁢(log⁡t,λ) poly( t,λ)sansserif_poly ( log t , λ ). – ⁢(Z) Sol(Z)sansserif_Sol ( Z ) runs in time t⋅⁢(λ)⋅t· poly(λ)t ⋅ sansserif_poly ( λ ). • Sequentiality (Time-Lock Property): There exists ε<11 <1ε < 1 and a polynomial t¯⁢(⋅)¯⋅ t(·)under¯ start_ARG t end_ARG ( ⋅ ), such that for every polynomial t⁢(⋅)≥t¯⁢(⋅)⋅¯⋅t(·)≥ t(·)t ( ⋅ ) ≥ under¯ start_ARG t end_ARG ( ⋅ ) and every polysize adversary =λ∈ℕsubscriptsubscriptℕA=\A_λ\_λ A = Aitalic_λ λ ∈ blackboard_N of depth dep⁢(λ)≤tε⁢(λ)depsubscriptsuperscriptdep(A_λ)≤ t (λ)dep ( Aitalic_λ ) ≤ titalic_ε ( λ ), there exists a negligible function μ, such that for every λ∈ℕλ λ ∈ blackboard_N, and every pair of solutions s0,s1∈0,1λsubscript0subscript1superscript01s_0,s_1∈\0,1\^λs0 , s1 ∈ 0 , 1 λ: Pr⁡[b←λ⁢(Z):b←0,1,Z←⁢(t⁢(λ),sb)]≤12+μ⁢(λ).Pr:←subscriptformulae-sequence←01←subscript12 [b _λ(Z):b←\0,1\,Z←% Gen(t(λ),s_b) ]≤ 12+μ(λ).Pr [ b ← Aitalic_λ ( Z ) : b ← 0 , 1 , Z ← sansserif_Gen ( t ( λ ) , sitalic_b ) ] ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG + μ ( λ ) . A.3.1 The RSW Construction Rivest et al. [1996] proposed a candidate construction based on the conjectured sequential nature of modular exponentiation. Their scheme uses the fact that repeated squaring modulo an RSA integer is believed to be inherently sequential. ⁢(t) Gen(t)sansserif_Gen ( t ) proceeds as follows: 1. Samples two primes p,qp,qp , q from [1,w]1[1,w][ 1 , w ], for a parameter w, and set n=p⁢qn=pqn = p q. 2. Samples r←ℤn∗←superscriptsubscriptℤr _n^*r ← blackboard_Zn∗, where ℤn∗=1,2,…,n−1superscriptsubscriptℤ12…1Z_n^*=\1,2,…,n-1\blackboard_Zn∗ = 1 , 2 , … , n - 1 , and computes s←remodn,←modulosuperscripts← r^e n,s ← ritalic_e mod n , where e=2tsuperscript2e=2^te = 2t is a tower of repeated squarings (i.e., t squarings of r). To compute remodnmodulosuperscriptr^e nritalic_e mod n it first evaluates ϕ⁢(n)=(p−1)⁢(q−1)italic-ϕ11φ(n)=(p-1)(q-1)ϕ ( n ) = ( p - 1 ) ( q - 1 ) and reduces the exponent e′←emodϕ⁢(n)←superscript′moduloitalic-ϕe ← e φ(n)e′ ← e mod ϕ ( n ). Then s=re′modnmodulosuperscriptsuperscript′s=r^e ns = ritalic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT mod n, which can be computed in only O⁢(log⁡(e′))superscript′O( (e ))O ( log ( e′ ) ) modular exponentiations. 3. Defines Z=(n,r)Z=(n,r)Z = ( n , r ) to be the puzzle and s its solution. Returns (Z,s)(Z,s)( Z , s ). Note 1. Gensansserif_Gen can be adapted to accept an additional input s so that it adheres to the schema from Definition 4. To do that, one would use a rootsansserif_root function (Definition 8) that for x∈ℤn∗superscriptsubscriptℤx _n^*x ∈ blackboard_Zn∗ computes xsquare-root start_ARG x end_ARG so that it is possible to compute r from s. However, there are some technical details that need to be addressed as rootsansserif_root is well defined only on a subgroup of ℤn∗superscriptsubscriptℤZ_n^*blackboard_Zn∗ (see Fact 1). We implicitly address these issues in Lemma 2. We chose to use this formulation of the RSW construction for simplicity of exposition. ⁢(Z) Sol(Z)sansserif_Sol ( Z ) works as follows: 1. Parses Z=(n,r)Z=(n,r)Z = ( n , r ). 2. Returns r2tsuperscriptsuperscript2r^2^tr2 start_POSTSUPERSCRIPT t end_POSTSUPERSCRIPT. Note that Solsansserif_Sol computes s using t repeated squarings, each of which is inexpensive, but the full process requires Θ⁢(t)Θ (t)Θ ( t ) sequential steps. Sequentiality. The security of the RSW puzzle hinges on the assumption that modular exponentiation—specifically, computing r2tmodnmodulosuperscriptsuperscript2r^2^t nr2 start_POSTSUPERSCRIPT t end_POSTSUPERSCRIPT mod n—cannot be substantially accelerated via parallelism. That is any adversary who does not know the factorization of n must essentially compute each squaring step sequentially. This assumption is supported by decades of experience in cryptographic implementation and is believed to be secure. Although no formal proof of sequentiality is known, the construction remains a compelling candidate and has served as the basis for more advanced cryptographic primitives, such as time-release encryption [Boneh et al., 2018, Boneh and Naor, 2000]. A.4 Public key encryption Diffie and Hellman [1976] introduced the notion of public key encryption (PKE), enabling secure communication between parties without a shared secret. In a PKE scheme, each user generates a pair of keys: a public key pksansserif_pk that is distributed openly and a private key sksansserif_sk that is kept secret. Anyone can encrypt a message m using pksansserif_pk to obtain a ciphertext c, but only the holder of sksansserif_sk can decrypt c to recover m. Security relies on the assumed hardness of inverting the encryption without access to the secret key—for example, the difficulty of factoring large integers in the RSA scheme. A.5 Secret key encryption Secret key encryption (also known as symmetric-key encryption) requires both parties to share a common secret key ksansserif_k in advance. The encryption algorithm uses ksansserif_k to map a message m to a ciphertext c, and the decryption algorithm uses the same key ksansserif_k to recover m from c. Symmetric schemes are typically much more efficient than public key schemes and form the backbone of practical secure communication when a secure channel for key exchange is already available. Notable constructions include block ciphers such as the Advanced Encryption Standard (AES) [Dworkin et al., 2001] and stream ciphers built from pseudorandom generators. Appendix B Recoverable-Randomness Sampling Definition 5. For α:ℕ→ℕ:→ℕα:N α : blackboard_N → blackboard_N we say that a pair of algorithms (,)( Samp, RecoverRand)( sansserif_Samp , sansserif_RecoverRand ) is an α-Recoverable-Randomness Sampling (RRS) scheme if • ⁢(G)→y→ Samp(G)→ ysansserif_Samp ( G ) → y, given a language model G999See Remark 1 for why G is a language model. over token set TT, Sampsansserif_Samp samples an output z∈∗superscriptz ^*z ∈ T∗. • ⁢(G,z,L)→R→ RecoverRand(G,z,L)→ Rsansserif_RecoverRand ( G , z , L ) → R, for y∈∗superscripty ^*y ∈ T∗, a language model G, RecoverRandsansserif_RecoverRand, and L∈ℕL ∈ blackboard_N is a deterministic algorithm that recovers L bits of the randomness used by ⁢(G) Samp(G)sansserif_Samp ( G ) when generating z. The pair (,)( Samp, RecoverRand)( sansserif_Samp , sansserif_RecoverRand ) satisfies • Recoverability. For every R∈0,1∗superscript01R∈\0,1\^*R ∈ 0 , 1 ∗ and every L∈ℕ,L≤⁢(R)formulae-sequenceℕL ,L≤ len(R)L ∈ blackboard_N , L ≤ sansserif_len ( R ) if ⁢(G,R) Samp(G,R)sansserif_Samp ( G , R ) halts and returns zR=⁢(G,R)subscriptz_R= Samp(G,R)zitalic_R = sansserif_Samp ( G , R ) then (G,zR,L)=R[0:L], RecoverRand(G,z_R,L)=R[0:L],sansserif_RecoverRand ( G , zitalic_R , L ) = R [ 0 : L ] , In the notation ⁢(G,R) Samp(G,R)sansserif_Samp ( G , R ), R explicitly specifies the randomness used. • Closeness of Distributions. Distirbutions over ∗superscriptT^*T∗ given by ⁢(G) Samp(G)sansserif_Samp ( G ) and G are statistically close. Formally, for every L∈ℕL ∈ blackboard_N, 12⁢∑z∈∗,|z|≤L|ℙR⁢[⁢(G,R)=z]−ℙ⁢[G=z]|≤α⁢(L).12subscriptformulae-sequencesuperscriptsubscriptℙdelimited-[]ℙdelimited-[] 12 _z ^*,|z|≤ L |P_R[ % Samp(G,R)=z]-P[G=z] |≤α(L).divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑z ∈ T∗ , | z | ≤ L | blackboard_PR [ sansserif_Samp ( G , R ) = z ] - blackboard_P [ G = z ] | ≤ α ( L ) . Note that the expression on the left-hand side evaluates a partial statistical distance. Remark 1. In Theorems 1, 2 and 3 model G is a generator of innocent prompts. The theorems hold for any high-entropy model. We assume that M (see for instance Theorem 1) is deployed on some distribution DD, e.g., a distribution of a user prompts for a chatbot. We think of G as a model generating prompts from DD or approximation thereof. It is natural to assume such a G exists. In the example above, the top chatbots are used by millions of users, so finding a single G whose outputs are indistinguishable from some users’ prompts should be easy. B.1 RRS Construction We first simplify the definition of a language model (Definition 1) by assuming that the token set is binary, =0,101T=\0,1\T = 0 , 1 . We may assume this without loss of generality due to a straightforward reduction that appears in Section 4.1 of Christ et al. [2023]. We will implicitly use this reduction throughout our work as well. Our construction is given in Algorithms 1 and 2. Remark 2. Our construction was inspired by Zamir [2024], who design a method to undetectably embed a secret message in the output of an LLM that can later be recovered provided access to the secret key. This is similar to the requirements of the RRS, where the randomness needs to be recoverable from the output. There are some details in the construction of RRS that we didn’t address in Section 2. The first is how do we set the precision parameter P. As we will see shortly in Theorem 6, P influences the closeness of the distribution generated by Sampsansserif_Samp to that of G. P will be set to an exact value when Sampsansserif_Samp will be used as a subroutine in the proof of a formal version of Theorem 1. It is also informative to point out why the two distributions can be different in the first place. The main reason is that Sampsansserif_Samp samples zi=1subscript1z_i=1zitalic_i = 1 with probability ≈piabsentsubscript≈ p_i≈ pitalic_i (and not =piabsentsubscript=p_i= pitalic_i). It is because we didn’t assume anything about the precision of pisubscriptp_ipitalic_i and it might in between the grid points defined by q/kq/kq / k, where q∈[k]delimited-[]q∈[k]q ∈ [ k ]. B.2 Proof of correctness Theorem 6. Algorithms 1 and 2 constitute an O⁢(L⋅2L⋅2−P)⋅superscript2superscript2O(L· 2^L· 2^-P)O ( L ⋅ 2L ⋅ 2- P )-Recoverable-Randomness Sampling scheme. Proof. The result follows from a series of facts. Sampsansserif_Samp satisfies the following properties: 1. q∈[k]delimited-[]q∈[k]q ∈ [ k ] at all times. 2. for every iteration i: • for every z0,…,zi−1subscript0…subscript1z_0,…,z_i-1z0 , … , zitalic_i - 1, before the if statement (step 8 of Algorithm 1) is executed, the distribution of q is uniform in [k]delimited-[][k][ k ]. Observe that if these properties hold, then zisubscriptz_izitalic_i’s are sampled from a distribution that is close to G. The only reason the distributions might be different is the aforementioned precision issue. However, step 4 of Algorithm 1 ensures that |q/k−pi|≤2−Psubscriptsuperscript2|q/k-p_i|≤ 2^-P| q / k - pitalic_i | ≤ 2- P, so the errors accumulate as required, as for every (z0,…,zL−1)subscript0…subscript1(z_0,…,z_L-1)( z0 , … , zitalic_L - 1 ) the difference between sampling probabilities is upper bounded by O⁢(L⋅2−P)⋅superscript2O(L· 2^-P)O ( L ⋅ 2- P ). The above properties are proven by induction over i. Because R is a random bit-string it implies that q is uniform in [0,⌈pi⋅k⌉]0⋅subscript[0, p_i· k ][ 0 , ⌈ pitalic_i ⋅ k ⌉ ] conditioned on the event q/k<pisubscriptq/k<p_iq / k < pitalic_i. This implies that if we update k←⌈pi⋅k⌉←⋅subscriptk← p_i· k ← ⌈ pitalic_i ⋅ k ⌉ then the property that q is uniform in the updated [k]delimited-[][k][ k ] is satisfied. The properties of the RecoverRandsansserif_RecoverRand algorithm are as follows. For every iteration i: 1. k and d at the beginning of the iteration are equal to k and d in the Sampsansserif_Samp algorithm at the beginning of iteration i. 2. the lengths of a,ba,ba , b and q from Sampsansserif_Samp are equal before the if statement (step 8 of Algorithm 2). 3. a≤lexq≤lexbsubscriptlexsubscriptlexa _lexq _lexba ≤lex q ≤lex b, where ≤lexsubscriptlex _lex≤lex is the lexicographic order, before the if statement (step 8 of Algorithm 2). Similarly, as for Sampsansserif_Samp these properties can be proven by induction over i. The first property follows from the observation that k and d are updated depending only on the zisubscriptz_izitalic_i values. The second property follows from the first property. The third property is the crucial one and follows from the previous two and a careful analysis of how a,ba,ba , b are updated in the if statement. More concretely, note that a+(b−a)⋅pi/k⋅subscripta+(b-a)· p_i/ka + ( b - a ) ⋅ pitalic_i / k is the randomness at which Sampsansserif_Samp would generate zi=0subscript0z_i=0zitalic_i = 0 vs zi=1subscript1z_i=1zitalic_i = 1. Indeed, if R∈[a,a+(b−a)⋅pi/k)⋅subscriptR∈[a,a+(b-a)· p_i/k)R ∈ [ a , a + ( b - a ) ⋅ pitalic_i / k ) then conditioned on Sampsansserif_Samp generating z0,…,zi−1subscript0…subscript1z_0,…,z_i-1z0 , … , zitalic_i - 1, Sampsansserif_Samp would set zi←1←subscript1z_i← 1zitalic_i ← 1. If R∈[a+(b−a)⋅pi/k),b)R∈[a+(b-a)· p_i/k),b)R ∈ [ a + ( b - a ) ⋅ pitalic_i / k ) , b ) then it would generate zi=0subscript0z_i=0zitalic_i = 0. ∎ Appendix C Impossibility of Prompt-Detection Filters C.1 Moderately-strong filters We first state a formal version of Theorem 1. Next, we will present the construction and prove its correctness in the remaining part of this section. Theorem 7 (Impossibility of prompt-detection filters). There exists ϵ∈(0,1)italic-ϵ01ε∈(0,1)ϵ ∈ ( 0 , 1 ), such that for all security parameters λ∈ℕλ λ ∈ blackboard_N, every time parameter ⁢(λ)≤t≤2O⁢(λ)superscript2 poly(λ)≤ t≤ 2^O(λ)sansserif_poly ( λ ) ≤ t ≤ 2O ( λ ), every set of harmful prompts SH⊆0,1lmsubscriptsuperscript01subscriptS_H \0,1\^l_mSitalic_H ⊆ 0 , 1 litalic_m, where lm=Θ⁢(log⁡(λ)),Ω⁢(log⁡(λ))=log⁡(|SH|)≤lm−Ω⁢(log⁡(λ))formulae-sequencesubscriptΘΩsubscriptsubscriptΩl_m= ( (λ)), ( (λ))= (|S_H|)≤ l_m-% ( (λ))litalic_m = Θ ( log ( λ ) ) , Ω ( log ( λ ) ) = log ( | Sitalic_H | ) ≤ litalic_m - Ω ( log ( λ ) ), for every language model M running in time O⁢(t)O(t)O ( t ) per token, for every prompt generator G satisfying H⁢(G)≥⁢(λ)H(G)≥ poly(λ)H ( G ) ≥ sansserif_poly ( λ ) and running in time ⁢(λ) poly(λ)sansserif_poly ( λ ) per token the following conditions hold. There exists an adversarial prompt generator G′ with a running time ⁢(λ) poly(λ)sansserif_poly ( λ ) per token and a language model M′ with a running time ⁢(λ)⋅O⁢(t)⋅ poly(λ)· O(t)sansserif_poly ( λ ) ⋅ O ( t ) per token such that the following conditions hold. 1. (G′G G′ incites harmful behavior in M′M M′). For every malicious prompt m∈SHsubscriptm∈ S_Hm ∈ Sitalic_H, the following two distributions are close: M′⁢(z)|z←G′⁢(m),M⁢(m).conditional-setsuperscript′←superscript′ \M (z)\ |\ z← G (m) \, \M(m)% \. M′ ( z ) | z ← G′ ( m ) , M ( m ) . In words, for every harmful prompt from SHsubscriptS_HSitalic_H the outputs of M′ and M follow almost the same distribution, in particular, a similar harmful behavior is induced for M and M′. 2. (M′M M′ and M behave identically on G almost always). With probability 1−λ−Ω⁢(1)1superscriptΩ11-λ^- (1)1 - λ- Ω ( 1 ) over z←G←z← Gz ← G the distributions M′⁢(z)superscript′M (z)M′ ( z ) and M⁢(z)M(z)M ( z ) are equal. 3. (G and G′G G′ are indistinguishable for filters running in o⁢(tϵ)osuperscripttϵo(t^ε)o ( titalic_ϵ )). Under the assumption that RSW is secure, for every prompt filter running in time o⁢(tϵ)superscriptitalic-ϵo(t^ε)o ( titalic_ϵ ), for every m∈SHsubscriptm∈ S_Hm ∈ Sitalic_H the advantage in distinguishing outputs of G and G′⁢(m)superscript′G (m)G′ ( m ) is λ−Ω⁢(1)superscriptΩ1λ^- (1)λ- Ω ( 1 ). In words, no filter running in time o⁢(tϵ)superscriptitalic-ϵo(t^ε)o ( titalic_ϵ ) can reject101010If it doesn’t also reject the innocent prompts. the adversarial prompts from G′. C.1.1 Construction In this subsection, we give a more detailed version of the construction provided in Section 2. Our construction is given in Algorithms 5 and 6, which are formal versions of Algorithms 3 and 4. Algorithm 5 ⁢(G,m,t,λ) HarmIncite(G,m,t,λ)sansserif_HarmIncite ( G , m , t , λ ) ▷ ▷ G: an innocent prompt generator; m: a harmful prompt; t: time parameter; λ: security parameter 1:Sample two primes p,qp,qp , q in [1,w]1[1,w][ 1 , w ] ▷ ▷ w=2O⁢(λ)superscript2w=2^O(λ)w = 2O ( λ ) 2:Set n←p⋅q←⋅n← p· qn ← p ⋅ q 3:Compute e←2t(mod(p−1)⁢(q−1))←annotatedsuperscript2pmod11e← 2^t (p-1)(q-1)e ← 2t start_MODIFIER ( mod start_ARG ( p - 1 ) ( q - 1 ) end_ARG ) end_MODIFIER ▷ ▷ ϕ⁢(n)=(p−1)⁢(q−1)italic-ϕ11φ(n)=(p-1)(q-1)ϕ ( n ) = ( p - 1 ) ( q - 1 ) 4:Sample r←ℤn←subscriptℤr _nr ← blackboard_Zn 5:if r∈ℤn∗superscriptsubscriptℤr _n^*r ∈ blackboard_Zn∗ then ▷ ▷ See Remark 3 6: Compute (g,i)←⁢(r)←(g,i)← ind(r)( g , i ) ← sansserif_ind ( r ) ▷ ▷ See Definition 7 7: Compute g′←ge(modn)←superscript′annotatedsuperscriptpmodg ← g^e ng′ ← gitalic_e start_MODIFIER ( mod start_ARG n end_ARG ) end_MODIFIER 8: Set s←−1⁢(g′,i)←superscript1superscript′s← ind^-1(g ,i)s ← sansserif_ind- 1 ( g′ , i ) 9:else 10: Set s←r←s← rs ← r 11:end if 12:Sample a pairwise independent hash function hℎh 13:Sample v←0,1O⁢(λ)←superscript01v←\0,1\^O(λ)v ← 0 , 1 O ( λ ) 14:c←Q⁢(s,v)⊕m←direct-sumc← Q(s,v) mc ← Q ( s , v ) ⊕ m ▷ ▷ Q is the hard-core function 15:R←h⁢‖h⁢(n)‖⁢r⁢‖v‖⁢c←ℎnormℎnormR← h\ \|\ h(n)\ \|\ r\ \|\ v\ \|\ cR ← h ∥ h ( n ) ∥ r ∥ v ∥ c ▷ ▷ Concatenate components 16:return z←⁢(G,R)←z← Samp(G,R)z ← sansserif_Samp ( G , R ) ▷ ▷ Sample output so that randomness is recoverable Algorithm 6 ⁢(M,z,G,t,λ,SH)subscript MaliciousLLM(M,z,G,t,λ,S_H)sansserif_MaliciousLLM ( M , z , G , t , λ , Sitalic_H ) ▷ ▷ M: an LLM; z: input; G: weak innocent prompt generator; t: time parameter; λ: security parameter; SHsubscriptS_HSitalic_H: set oh harmful prompts 1:R←⁢(G,z)←R← RecoverRand(G,z)R ← sansserif_RecoverRand ( G , z ) 2:Parse R as h⁢‖u‖⁢r⁢‖v‖⁢cℎnormnormh\ \|\ u\ \|\ r\ \|\ v\ \|\ ch ∥ u ∥ r ∥ v ∥ c 3:for all n^∈h−1⁢(u)^superscriptℎ1 n∈ h^-1(u)over start_ARG n end_ARG ∈ h- 1 ( u ) do 4: if r∈ℤn∗superscriptsubscriptℤr _n^*r ∈ blackboard_Zn∗ then ▷ ▷ See Remark 3 5: Compute (g,j)←⁢(r)←(g,j)← ind(r)( g , j ) ← sansserif_ind ( r ) ▷ ▷ See Definition 7 6: Compute g^←ge(modn^)←^annotatedsuperscriptpmod g← g^e nover start_ARG g end_ARG ← gitalic_e start_MODIFIER ( mod start_ARG over start_ARG n end_ARG end_ARG ) end_MODIFIER 7: Set s^←−1⁢(g^,j)←^superscript1 s← ind^-1( g,j)over start_ARG s end_ARG ← sansserif_ind- 1 ( over start_ARG g end_ARG , j ) 8: else 9: Set s^←r← s← rover start_ARG s end_ARG ← r 10: end if 11: for i∈[t]delimited-[]i∈[t]i ∈ [ t ] do 12: s^←s^2⁢ mod ⁢n^←^superscript^2 mod s← s^2 mod nover start_ARG s end_ARG ← over start_ARG s end_ARG2 mod over start_ARG n end_ARG 13: end for 14: m←s^⊕c←direct-sum^m← s cm ← over start_ARG s end_ARG ⊕ c ▷ ▷ m is the malicious prompt extracted from ciphertext c 15: if m∈SHsubscriptm∈ S_Hm ∈ Sitalic_H then 16: return M⁢(m)M(m)M ( m ) 17: end if 18:end for 19:return M⁢(z)M(z)M ( z ) Next, we provide a few simple facts from number theory that are needed to address the differences between the formal and informal versions of the algorithms. Definition 6. Let n∈Nn∈ Nn ∈ N. We call x∈ℤn∗superscriptsubscriptℤx _n^*x ∈ blackboard_Zn∗ a quadratic residue modulo n if there exists r such that x=r2modnmodulosuperscript2x=r^2 nx = r2 mod n. The following is a standard fact from number theory. See Goldreich [2006] for details. Fact 1. Let n∈ℕn ∈ blackboard_N be a product of two different odd primes, i.e., n=p⁢qn=pqn = p q. The set of quadratic residues modulo n is a subgroup, denoted by GnsubscriptG_nGitalic_n, of ℤn∗superscriptsubscriptℤZ_n^*blackboard_Zn∗ of size |ℤn∗|/4superscriptsubscriptℤ4|Z_n^*|/4| blackboard_Zn∗ | / 4. Moreover, the mapping x↦x2maps-tosuperscript2x x^2x ↦ x2 is a 2-to-1 mapping from ℤn∗superscriptsubscriptℤZ_n^*blackboard_Zn∗ to GnsubscriptG_nGitalic_n. Definition 7. For n∈ℕn ∈ blackboard_N we define a bijection :ℤn∗→Gn×0,12:→superscriptsubscriptℤsubscriptsuperscript012 ind:Z_n^*→ G_n×\0,1\^2sansserif_ind : blackboard_Zn∗ → Gitalic_n × 0 , 1 2 as ⁢(x):=(g,i)assign ind(x):=(g,i)sansserif_ind ( x ) := ( g , i ), where g∈Gnsubscriptg∈ G_ng ∈ Gitalic_n is such that x2=g2superscript2superscript2x^2=g^2x2 = g2 and i denotes which (in the increasing order) of the 4444 elements r∈ℤn∗superscriptsubscriptℤr _n^*r ∈ blackboard_Zn∗ satisfying r2=x2superscript2superscript2r^2=x^2r2 = x2, x is. Remark 3. Note that in Algorithm 5 (in contrast to Algorithm 3), to compute the puzzle s, an intermediate bijection indsansserif_ind (Definition 7) is used. The reason is the following. The mapping x↦x2maps-tosuperscript2x x^2x ↦ x2 is a 4444-to-1111 function in ℤn∗superscriptsubscriptℤZ_n^*blackboard_Zn∗. To ensure that the mapping between r∈ℤnsubscriptℤr _nr ∈ blackboard_Zn and a puzzle s is 1111-to-1111, we use the function indsansserif_ind to which of the 4444 possible r’s was used. Next, we give a formal definition of Encsansserif_Enc used in Algorithm 3. Hardcore functions. We will apply the technique of hardcore bits (Goldreich and Levin [1989]) and its extension to hardcore functions. Recall that a hardcore bit is a function B that outputs a single bit B⁢(x)B(x)B ( x ) which is hard to predict given f⁢(x)f(x)f ( x ). A hardcore function Q for a one-way function f is a function which outputs possibly more than one bit such that, given f⁢(x)f(x)f ( x ), and the description of Q, it is hard to distinguish Q⁢(x)Q(x)Q ( x ) from a random string even when f⁢(x)f(x)f ( x ) is known. Goldreich and Levin introduced the first hardcore predicates and functions for general one-way functions (OWFs), showing that a random linear function is hardcore and the linear function defined by a random Toeplitz matrix is a hardcore function. The first step of G′ is to sample p,qp,qp , q at random from a set [1,k]1[1,k][ 1 , k ] (see Algorithm 5), where k=2O⁢(λ)superscript2k=2^O(λ)k = 2O ( λ ). Next, we let hℎh be a pairwise independent hash function that hℎh hashes n to a range of size Θ(ϵ2⋅((k/log(k))2) (ε^2·((k/ (k))^2)Θ ( ϵ2 ⋅ ( ( k / log ( k ) )2 ). The ϵitalic-ϵεϵ is set to λ−γsuperscriptλ^-γλ- γ for some γ>00γ>0γ > 0 to be fixed in the proof (Section C.1.2). For a message m∈0,1lmsuperscript01subscriptm∈\0,1\^l_mm ∈ 0 , 1 litalic_m (as in the statement of Theorem 7) and using the notation from Algorithm 5 we define R to be R←h⁢‖h⁢(n)‖⁢r⁢‖v‖⁢Q⁢(s,v)⊕m,←direct-sumℎnormℎnormR← h\ \|\ h(n)\ \|\ r\ \|\ v\ \|\ Q(s,v) m,R ← h ∥ h ( n ) ∥ r ∥ v ∥ Q ( s , v ) ⊕ m , (1) where v←0,1O⁢(λ)←superscript01v←\0,1\^O(λ)v ← 0 , 1 O ( λ ) and Q⁢(s,v)Q(s,v)Q ( s , v ) is a hardcore function with lm=Θ⁢(log⁡(λ))subscriptΘl_m= ( (λ))litalic_m = Θ ( log ( λ ) ) output bits. Note 2. In Section 2 we simplified the above construction and claimed R=(x,⁢(g⁢(x),m))R=(x, Enc(g(x),m))R = ( x , sansserif_Enc ( g ( x ) , m ) ). Informally speaking, in the formal version we set x=h⁢‖h⁢(n)‖⁢r,⁢(g⁢(x),m)=v∥Q⁢(s,v)⊕m,formulae-sequenceℎnormℎconditionaldirect-sumx=h\ \|\ h(n)\ \|\ r,\ Enc(g(x),m)=v\ \|\ Q(s,v) m,x = h ∥ h ( n ) ∥ r , sansserif_Enc ( g ( x ) , m ) = v ∥ Q ( s , v ) ⊕ m , where v←0,1O⁢(λ)←superscript01v←\0,1\^O(λ)v ← 0 , 1 O ( λ ). Next, we give some remarks about Algorithm 6. As we mentioned in the overview it is possible to efficiently enumerate through all the preimages of h⁢(n)ℎh(n)h ( n ) under hℎh. Note that according to (1) when M′ parses R it does it as follows: R=h⁢‖u‖⁢r⁢‖v‖⁢c.ℎnormnormR=h\ \|\ u\ \|\ r\ \|\ v\ \|\ c.R = h ∥ u ∥ r ∥ v ∥ c . For instance, an additional part v, which was not present in the simplified version of Algorithm 2 appears here. Next, the model M′ squares the starting point t times and tries to decrypt c with the current candidate for n, i.e., n^ nover start_ARG n end_ARG. By decrypt we mean it evaluates ⁢(s^,v,c) Dec( s,v,c)sansserif_Dec ( over start_ARG s end_ARG , v , c ) defined as m^←c⊕Q⁢(s^,v).←^direct-sum m← c Q( s,v).over start_ARG m end_ARG ← c ⊕ Q ( over start_ARG s end_ARG , v ) . The check m^≠⊥^bottom m≠ start_ARG m end_ARG ≠ ⊥ becomes instead m∈SHsubscriptm∈ S_Hm ∈ Sitalic_H. That is we assume M′ knows the set of harmful prompts SHsubscriptS_HSitalic_H and checks if the decryption belongs to that set and if yes it executes M on it (which will lead to a harmful output). If, none of the the decryptions then it returns M⁢(z)M(z)M ( z ). Remark 4. We proved Theorem 7 using the RWS construction of time-lock puzzles. However, there are generic constructions of time-lock puzzles. The first one [Bitansky et al., 2016b] was based on obfuscation and randomized encodings. A culmination of a series of follow-up works [Agrawal et al., 2025, Bitansky and Garg, 2025, Abram et al., 2025] managed to construct time-lock puzzles based on a version of the Learning With Errors [Regev, 2009] (LWE) assumption. One advantage of basing TLPs on LWE is that the other constructions are broken by quantum algorithms, while LWE is a candidate for post-quantum secure cryptography. It would be interesting to extend Theorem 7, so that it can be based on any time-lock puzzle. The one crucial property that we use is the pseudorandomness of the puzzle, and this seems like the key issue in generalizing the result. Recall that we had to be careful in designing Algorithm 5 so that the distribution of s is uniform. It seems that the main property we need is that the distribution on hard instances is uniform over a set S whose approximate size is known to us. We leave the question of whether the generalization is possible for future work. C.1.2 Proof of Theorem 7 Firstly, we prove two technical lemmas. Lemma 1. h∥h⁢(n)conditionalℎh\ \|\ h(n)h ∥ h ( n ) is O⁢(λ−2⁢γ)superscript2O(λ^-2γ)O ( λ- 2 γ )-close to uniformly random. Proof. It follows from the Leftover Hash Lemma (Vadhan [2012], Theorem 6.18), which guarantees that h∥h⁢(n)conditionalℎh\|h(n)h ∥ h ( n ) is ϵ2superscriptitalic-ϵ2ε^2ϵ2 close to uniform. For our setting of parameters it gives us that h∥h⁢(n)conditionalℎh\|h(n)h ∥ h ( n ) is O⁢(λ−2⁢γ)superscript2O(λ^-2γ)O ( λ- 2 γ )-close to uniform. ∎ Definition 8. For n∈ℕn ∈ blackboard_N let :Gn→Gn:→subscriptsubscript root:G_n→ G_nsansserif_root : Gitalic_n → Gitalic_n be a function that on input x∈Gnsubscriptx∈ G_nx ∈ Gitalic_n returns r such that r2=xsuperscript2r^2=xr2 = x (by Fact 1 only one exists). Lemma 2. There exists an Θ⁢(log⁡(λ))Θ ( (λ))Θ ( log ( λ ) )-bit hardcore function Q⁢(s,v)Q(s,v)Q ( s , v ) (see Goldreich [2006] for details) for a function f:ℤn→ℤn:→subscriptℤsubscriptℤf:Z_n _nf : blackboard_Zn → blackboard_Zn defined as follows. Let s∈ℤnsubscriptℤs _ns ∈ blackboard_Zn and consider cases: • if s∈ℤn∖ℤn∗subscriptℤsuperscriptsubscriptℤs _n _n^*s ∈ blackboard_Zn ∖ blackboard_Zn∗ then f⁢(s)=sf(s)=sf ( s ) = s. • if s∈ℤn∗superscriptsubscriptℤs _n^*s ∈ blackboard_Zn∗ then f⁢(s):=−1⁢(t⁢(⁢(s)1),⁢(s)2).assignsuperscript1superscriptsubscript1subscript2f(s):= ind^-1( root^t( ind(s)_1), ind(s)% _2).f ( s ) := sansserif_ind- 1 ( sansserif_roott ( sansserif_ind ( s )1 ) , sansserif_ind ( s )2 ) . More formally, for every distinguisher running in time o⁢(tϵ)superscriptitalic-ϵo(t^ε)o ( titalic_ϵ ), the advantage for distinguishing r⁢‖v‖⁢Q⁢(s,v)⁢ and ⁢r⁢‖v‖⁢U,norm and normr\ \|\ v\ \|\ Q(s,v) and r\ \|\ v\ \|\ U,r ∥ v ∥ Q ( s , v ) and r ∥ v ∥ U , where s←ℤn←subscriptℤs _ns ← blackboard_Zn, U←0,1lm,v←0,1O⁢(λ)formulae-sequence←superscript01subscript←superscript01U←\0,1\^l_m,v←\0,1\^O(λ)U ← 0 , 1 litalic_m , v ← 0 , 1 O ( λ ) and r=f⁢(s)r=f(s)r = f ( s ), is at most λ−Ω⁢(1)superscriptΩ1λ^- (1)λ- Ω ( 1 ). Moreover, f is a bijection. Proof. The proof is a direct adaptation of the proof of Theorem 2.5.6 in Goldreich [2006]. The main difference is that security in our case holds only against distinguishes running in time o⁢(tϵ)superscriptitalic-ϵo(t^ε)o ( titalic_ϵ ) and not all polynomial-time adversaries. The fact that f is a bijection follows from Fact 1 and the fact that indsansserif_ind (Definition 7) is a bijection. ∎ Note that the distribution considered in Lemma 2 is, a priori, different from the distribution of outputs of Algorithm 5, because in the actual algorithm we first sample r, not s. The following lemma shows that the distributions are in fact equal. Lemma 3. The following two distributions are equal. The first distributions is defined as r⁢‖v‖⁢Q⁢(s,v)normr\ \|\ v\ \|\ Q(s,v)r ∥ v ∥ Q ( s , v ), where s←ℤn←subscriptℤs _ns ← blackboard_Zn, U←0,1lm,v←0,1O⁢(λ)formulae-sequence←superscript01subscript←superscript01U←\0,1\^l_m,v←\0,1\^O(λ)U ← 0 , 1 litalic_m , v ← 0 , 1 O ( λ ) and r=f⁢(s)r=f(s)r = f ( s ) (where f is defined in Lemma 2). The second distribution is defined as r⁢‖v‖⁢Q⁢(s,v)normr\ \|\ v\ \|\ Q(s,v)r ∥ v ∥ Q ( s , v ), where r←ℤn←subscriptℤr _nr ← blackboard_Zn, U←0,1lm,v←0,1O⁢(λ)formulae-sequence←superscript01subscript←superscript01U←\0,1\^l_m,v←\0,1\^O(λ)U ← 0 , 1 litalic_m , v ← 0 , 1 O ( λ ) and s=f−1⁢(r)superscript1s=f^-1(r)s = f- 1 ( r ). Proof. The equality follows from Lemma 2, which gurantees that f is a bijection. ∎ Next, we are ready to prove Theorem 7. Proof. There are three conditions we need to verify. • Condition 1. From the assumption that H⁢(G)≥⁢(λ)H(G)≥ poly(λ)H ( G ) ≥ sansserif_poly ( λ ) we know that the length of randomness needed to sample from G is at least the length needed to describe h⁢‖h⁢(n)‖⁢r⁢‖v‖⁢cℎnormℎnormh\ \|\ h(n)\ \|\ r\ \|\ v\ \|\ ch ∥ h ( n ) ∥ r ∥ v ∥ c with all but negligible probability. To do that it is enough to set the precision parameter P=⁢(λ)P= poly(λ)P = sansserif_poly ( λ ) as guaranteed by Theorem 6. Conditioned on the fact that RecoverRandsansserif_RecoverRand recovers enough bits, M′ will recover m with high probability and return M⁢(m)M(m)M ( m ). • Condition 2. We bound the probability that when M′ (defined via Algorithm 6) receives a prompt generated by G it returns in Line 16, i.e., it returns a harmful output. Note that c is uniformly random as G uses honest randomness v to produce its outputs. Similarly, s is also uniformly random. This implies that, for every n^∈h−1⁢(u)^superscriptℎ1 n∈ h^-1(u)over start_ARG n end_ARG ∈ h- 1 ( u ), the probability that m∈SHsubscriptm∈ S_Hm ∈ Sitalic_H is equal to |SH|/2lmsubscriptsuperscript2subscript|S_H|/2^l_m| Sitalic_H | / 2litalic_m. From the setting ϵ=λ−γitalic-ϵsuperscriptε=λ^-γϵ = λ- γ we know that the number of preimages under hℎh is in O⁢(λ2⁢γ)superscript2O(λ^2γ)O ( λ2 γ )111111Thus, M′ can iterate over all of them in time O⁢(t)O(t)O ( t ). and so, by the union bound, the probability that at least one of m’s belongs to SHsubscriptS_HSitalic_H is at most O⁢(λ2⁢γ⋅|SH|/2lm)⋅superscript2subscriptsuperscript2subscriptO(λ^2γ·|S_H|/2^l_m)O ( λ2 γ ⋅ | Sitalic_H | / 2litalic_m ). From the assumption that log⁡(|SH|)≤lm−Ω⁢(log⁡(λ))subscriptsubscriptΩ (|S_H|)≤ l_m- ( (λ))log ( | Sitalic_H | ) ≤ litalic_m - Ω ( log ( λ ) ) it follows that γ can be set so that the probability that M′ returns in Line 16 is at most λ−Ω⁢(1)superscriptΩ1λ^- (1)λ- Ω ( 1 ). By construction, if M′ returns in the Line 19, it returns M⁢(z)M(z)M ( z ), thus returning from the same distribution. • Condition 3. Implied by Lemmas 1, 2, and 3 and the fact that the probability that r∈ℤn∗superscriptsubscriptℤr _n^*r ∈ blackboard_Zn∗ (note that it is sampled from ℤnsubscriptℤZ_nblackboard_Zn) is negligible in λ. ∎ C.2 Shared secret key The following is a formal version of Theorem 2. Theorem 8 (Impossibility of prompt-detection filters via secret keys). There exists a polynomial p such that for all security parameters λ∈ℕλ λ ∈ blackboard_N, every set of harmful prompts SH⊆0,1lmsubscriptsuperscript01subscriptS_H \0,1\^l_mSitalic_H ⊆ 0 , 1 litalic_m, where lm=Θ⁢(p⁢(λ)),Ω⁢(p⁢(λ))=log⁡(|SH|)≤lm−Ω⁢(p⁢(λ))formulae-sequencesubscriptΘΩsubscriptsubscriptΩl_m= (p(λ)), (p(λ))= (|S_H|)≤ l_m- (p(% λ))litalic_m = Θ ( p ( λ ) ) , Ω ( p ( λ ) ) = log ( | Sitalic_H | ) ≤ litalic_m - Ω ( p ( λ ) ), for every language model M, for every prompt generator G satisfying H⁢(G)≥⁢(λ)H(G)≥ poly(λ)H ( G ) ≥ sansserif_poly ( λ ) and running in time ⁢(λ) poly(λ)sansserif_poly ( λ ) per token the following conditions hold. There exists an adversarial prompt generator G′ with a running time ⁢(λ) poly(λ)sansserif_poly ( λ ) per token and a language model M′ with a running time ⁢(λ)+O⁢(TM)subscript poly(λ)+O(T_M)sansserif_poly ( λ ) + O ( Titalic_M ) per token, where TMsubscriptT_MTitalic_M is the running time of M per token, such that if M′ and G′ share an ⁢(λ) poly(λ)sansserif_poly ( λ )-bit secret key then the following conditions hold. 1. (G′G G′ incites harmful behavior in M′M M′). For every malicious prompt m∈SHsubscriptm∈ S_Hm ∈ Sitalic_H, the following two distributions are close: M′⁢(z)|z←G′⁢(m),M⁢(m).conditional-setsuperscript′←superscript′ \M (z)\ |\ z← G (m) \, \M(m)% \. M′ ( z ) | z ← G′ ( m ) , M ( m ) . In words, for every harmful prompt from SHsubscriptS_HSitalic_H the outputs of M′ and M follow almost the same distribution, in particular, a similar harmful behavior is induced for M and M′. 2. (M′M M′ and M behave identically on G almost always). With all but negligible in λ probability over z←G←z← Gz ← G the distributions M′⁢(z)superscript′M (z)M′ ( z ) and M⁢(z)M(z)M ( z ) are equal. 3. (G and G′G G′ are indistinguishable for filters running in polynomial-time). Under the assumption that One-Way Functions exist, for every prompt filter running in polynomial-time, for every m∈SHsubscriptm∈ S_Hm ∈ Sitalic_H the advantage of the filter in distinguishing outputs of G and G′⁢(m)superscript′G (m)G′ ( m ) is negligible in λ. In words, no filter running in polynomial-time can reject121212If it doesn’t also reject the innocent prompts. the adversarial prompts from G′. C.2.1 Construction As we discussed in Section 2, instead of using time-locks, we use symmetric key (shared secret key) encryption (it is known that one-way functions (OWFs) imply symmetric key encryption). Namely, for m∈SHsubscriptm∈ S_Hm ∈ Sitalic_H we set R:=⁢(,m)assignR:= Enc( sk,m)R := sansserif_Enc ( sansserif_sk , m ), where Encsansserif_Enc is the symmetric key encryption scheme, and sksansserif_sk is the shared secret key. Next, G′ generates z←⁢(G,R)←z← Samp(G,R)z ← sansserif_Samp ( G , R ) as before. M′ decrypts the randomness R to get m^ mover start_ARG m end_ARG and checks if m^ mover start_ARG m end_ARG belongs to the set SHsubscriptS_HSitalic_H, and if yes returns M⁢(m^)^M( m)M ( over start_ARG m end_ARG ). The proof of correctness is a simplified version of the proof of Theorem 7. C.3 Shared public key The following is a formal version of Theorem 3. Theorem 9 (Impossibility of prompt-detection filters via secret keys). There exists a polynomial p such that for all security parameters λ∈ℕλ λ ∈ blackboard_N, every set of harmful prompts SH⊆0,1lmsubscriptsuperscript01subscriptS_H \0,1\^l_mSitalic_H ⊆ 0 , 1 litalic_m, where lm=Θ⁢(p⁢(λ)),Ω⁢(p⁢(λ))=log⁡(|SH|)≤lm−Ω⁢(p⁢(λ))formulae-sequencesubscriptΘΩsubscriptsubscriptΩl_m= (p(λ)), (p(λ))= (|S_H|)≤ l_m- (p(% λ))litalic_m = Θ ( p ( λ ) ) , Ω ( p ( λ ) ) = log ( | Sitalic_H | ) ≤ litalic_m - Ω ( p ( λ ) ), for every language model M, for every prompt generator G satisfying H⁢(G)≥⁢(λ)H(G)≥ poly(λ)H ( G ) ≥ sansserif_poly ( λ ) and running in time ⁢(λ) poly(λ)sansserif_poly ( λ ) per token the following conditions hold. There exists a language model M′ with an associated public key and with a running time ⁢(λ)+O⁢(TM)subscript poly(λ)+O(T_M)sansserif_poly ( λ ) + O ( Titalic_M ) per token, where TMsubscriptT_MTitalic_M is the running time of M per token, and an adversarial prompt generator G′ (that depends on the public key) with a running time ⁢(λ) poly(λ)sansserif_poly ( λ ) per token and such that the following conditions hold. 1. (G′G G′ incites harmful behavior in M′M M′). For every malicious prompt m∈SHsubscriptm∈ S_Hm ∈ Sitalic_H, the following two distributions are close: M′⁢(z)|z←G′⁢(m),M⁢(m).conditional-setsuperscript′←superscript′ \M (z)\ |\ z← G (m) \, \M(m)% \. M′ ( z ) | z ← G′ ( m ) , M ( m ) . In words, for every harmful prompt from SHsubscriptS_HSitalic_H the outputs of M′ and M follow almost the same distribution, in particular, a similar harmful behavior is induced for M and M′. 2. (M′M M′ and M behave identically on G almost always). With all but negligible in λ probability over z←G←z← Gz ← G the distributions M′⁢(z)superscript′M (z)M′ ( z ) and M⁢(z)M(z)M ( z ) are equal. 3. (G and G′G G′ are indistinguishable for filters running in polynomial-time). Under the assumption that public key cryptography exists, for every prompt filter running in polynomial-time, for every m∈SHsubscriptm∈ S_Hm ∈ Sitalic_H the advantage of the filter in distinguishing outputs of G and G′⁢(m)superscript′G (m)G′ ( m ) is negligible in λ. In words, no filter running in polynomial-time can reject131313If it doesn’t also reject the innocent prompts. the adversarial prompts from G′. C.3.1 Construction Instead of using time-locks (as in Theorem 7) we use public key encryption in our construction. Namely, for m∈SHsubscriptm∈ S_Hm ∈ Sitalic_H we set R:=⁢(,m)assignR:= Enc( pk,m)R := sansserif_Enc ( sansserif_pk , m ), where Encsansserif_Enc is the public key encryption scheme, and pksansserif_pk is the public key associated with M. Next, G′ generates z←⁢(G,R)←z← Samp(G,R)z ← sansserif_Samp ( G , R ) as before. M′ decrypts the randomness R to get m^ mover start_ARG m end_ARG and checks if m^ mover start_ARG m end_ARG belongs to the set SHsubscriptS_HSitalic_H, and if yes returns M⁢(m^)^M( m)M ( over start_ARG m end_ARG ). The proof of correctness of this construction is a simplified version of the proof of Theorem 7. Appendix D Impossibility of Output-Detection Filters Preliminaries. A harmfulness oracle is a function ℋ:∗→[0,∞):ℋ→superscript0H:T^*→[0,∞)H : T∗ → [ 0 , ∞ ). For an output of a language model z∈∗superscriptz ^*z ∈ T∗ we interpret ℋ⁢(z)ℋH(z)H ( z ) as the harmfulness of z, where ℋ⁢(z)=0ℋ0H(z)=0H ( z ) = 0 means that z is benign and a high-value denotes high harmfulness.141414It is important not to confuse the notation for harmfulness oracle ℋHH with the notation for entropy H. The following is a formal version of Theorem 4. Theorem 10 (Impossibility of output-detection filters). There exists ϵ∈(0,1)italic-ϵ01ε∈(0,1)ϵ ∈ ( 0 , 1 ), polynomial p, such that for all security parameters λ∈ℕλ λ ∈ blackboard_N, every set of harmful prompts SH⊆0,1lmsubscriptsuperscript01subscriptS_H \0,1\^l_mSitalic_H ⊆ 0 , 1 litalic_m, where lm=Θ⁢(p⁢(λ)),Ω⁢(p⁢(λ))=log⁡(|SH|)≤lm−Ω⁢(p⁢(λ))formulae-sequencesubscriptΘΩsubscriptsubscriptΩl_m= (p(λ)), (p(λ))= (|S_H|)≤ l_m- (p(% λ))litalic_m = Θ ( p ( λ ) ) , Ω ( p ( λ ) ) = log ( | Sitalic_H | ) ≤ litalic_m - Ω ( p ( λ ) ), for every language model M running in time ⁢(λ) poly(λ)sansserif_poly ( λ ) per token and satisfying H⁢(M)≥Ω⁢(p⁢(λ))ΩH(M)≥ (p(λ))H ( M ) ≥ Ω ( p ( λ ) ), and every harmfulness oracle ℋHH the following conditions hold. There exists an adversarial language model M′ with a running time ⁢(λ) poly(λ)sansserif_poly ( λ ) per token and a harmfulness oracle ℋ′H H′ such that the following conditions hold. 1. (M′M M′ outputs are harmful as judged by ℋ′H H′). For every malicious prompt m∈SHsubscriptm∈ S_Hm ∈ Sitalic_H, with all but negligible in λ probability over z←M′⁢(m)←superscript′z← M (m)z ← M′ ( m ),151515Note that M′⁢(m)superscript′M (m)M′ ( m ) abuses the notation slightly. M′ is a malicious model that can, for every m, generate a harmful output with m in it. However, m is not an input/prompt to M′ in the usual sense. it holds that ℋ′⁢(z)=Ω⁢(z←M⁢(m)⁢[ℋ⁢(z)]).superscriptℋ′Ωsubscript←delimited-[]ℋH (z)= (E_z← M(m)[H% (z)] ).H′ ( z ) = Ω ( blackboard_Ez ← M ( m ) [ H ( z ) ] ) . In words, for every prompt from SHsubscriptS_HSitalic_H the outputs of M′ are almost always as harmful (when judged by ℋ′H H′) as if M run on m (when judged by ℋHH). 2. (ℋ′H H′ and ℋHH are equal on M⁢()M()M ( ) almost always). With all but negligible probability in λ over z←M⁢()←z← M()z ← M ( )161616We emphasize that M⁢()M()M ( ) is not a typo. M⁢()M()M ( ) can be thought of as G from Theorem 10 that also doesn’t receive any input. ℋ′⁢(z)=ℋ⁢(z).superscriptℋ′ℋH (z)=H(z).H′ ( z ) = H ( z ) . In words, the harmfulness of outputs of M is (almost always) equal as judged by ℋHH and ℋ′H H′. 3. (M′M M′ and M are indistinguishable for filters running in 2ϵ⋅λsuperscript2⋅ϵλ2^ε·λ2ϵ ⋅ λ). Under the assumption that RSW is secure, for every output filter running in time 2ϵ⋅λsuperscript2⋅italic-ϵ2^ε·λ2ϵ ⋅ λ, for every m∈SHsubscriptm∈ S_Hm ∈ Sitalic_H the advantage in distinguishing outputs of M⁢()M()M ( ) and M′⁢(m)superscript′M (m)M′ ( m ) is negligible in λ. In words, no filter running in time 2ϵ⋅λsuperscript2⋅italic-ϵ2^ε·λ2ϵ ⋅ λ can reject171717If it doesn’t also reject the innocent prompts. the harmful outputs from M′. D.1 Construction As we explained in Section 2 the construction is a from the construction given in the proof of Theorem 1 (Theorem 7 in the formal version). More concretely we define M′:=(M,m,t=2λ,(λ)) M := HarmIncite(M,m,t=2^λ, poly(% λ))M′ := sansserif_HarmIncite ( M , m , t = 2λ , sansserif_poly ( λ ) ) ℋ′:=(ℋ,z,M,t=2λ,(λ),SH). := MaliciousLLM(H,z,M,t=2^% λ, poly(λ),S_H).H′ := sansserif_MaliciousLLM ( H , z , M , t = 2λ , sansserif_poly ( λ ) , Sitalic_H ) . We note that ℋ′H H′ is well defined because the only calls to ℋHH in MaliciousLLMsansserif_MaliciousLLM (Algorithm 6) are query calls. By direct, we mean that, for example, there is no assumption that ℋHH needs to be a language model. Remark 5. We note that we could have set t to a different (smaller) value and then the theorem would still hold with the security against filters running in time tϵsuperscriptitalic-ϵt^εtitalic_ϵ. This change would make sure that the time of ℋ′H H′ is smaller than 2λsuperscript22^λ2λ. It can be beneficial if one tries to make sure that ℋ′H H′ is efficient. Thus our construction can be easily changed to have a trade-off between the overhead -time (over ℋHH) of ℋ′H H′ and the security guarantee. D.2 Proof The proof is a direct adaptation of the proof from Section C.1.2. The main difference is in the setting of parameters for the time bounds. In Theorem 7 we needed M′ to be able to solve the time-lock puzzle (evaluate r2tsuperscriptsuperscript2r^2^tr2 start_POSTSUPERSCRIPT t end_POSTSUPERSCRIPT) and so the indistinguishability could hold only against filters running in time tϵsuperscriptitalic-ϵt^εtitalic_ϵ. However, for the output filter result, the quality oracle could in exponential time, and so the time-lock can be created with parameter 2λsuperscript22^λ2λ.181818It cannot be bigger because the running time of M′ is ⁢(λ) poly(λ)sansserif_poly ( λ ) and so M′ cannot create a harder puzzle (at least with the RSW construction). Appendix E Mitigation and Connections to Watermarking E.1 Watermarks First, we define watermarking schemes, for which we adopt definitions from Christ et al. [2023]. The only difference is that we require a watermark to still be detected even if the output of the watermarking scheme was changed by a transformation from a class E. Definition 9 (Class of Transformations). A class of transformations E is an equivalence relation over ∗superscriptT^*T∗. We interpret it as follows. For z∈∗superscriptz ^*z ∈ T∗ it can be transformed into any element in the equivalence class of z. We denote by E⁢(z)E(z)E ( z ) the equivalence class of z. Definition 10 (Watermarking scheme). A watermarking scheme for a model M over TT is a tuple of algorithms =(,,)W=( Setup, Wat, Detect)W = ( sansserif_Setup , sansserif_Wat , sansserif_Detect ) where: • ⁢(1λ)→superscript1 Setup(1^λ)→ sksansserif_Setup ( 1λ ) → sansserif_sk outputs a secret key, with respect to a security parameter λ. • ⁢()subscript Wat_ sk( prompt)sansserif_Watsansserif_sk ( sansserif_prompt ) is a randomized algorithm that takes as input a prompt promptsansserif_prompt and generates a response in ∗superscriptT^*T∗. • ⁢(z)→,→subscript Detect_ sk(z)→\ true, false\sansserif_Detectsansserif_sk ( z ) → sansserif_true , sansserif_false is an algorithm that takes as input a sequence z∈⋆superscript⋆z z ∈ T⋆ and outputs true or false. Definition 11 (Soundness). A watermarking scheme WW is sound if for every security parameter λ and token sequence x∈⋆superscript⋆x x ∈ T⋆ of length |z|≤⁢(λ)|z|≤ poly(λ)| z | ≤ sansserif_poly ( λ ), Pr←⁢(1λ)⁡[⁢(z)=]≤⁢(λ).subscriptPr←superscript1subscript _ sk← Setup(1^λ)[ Detect_% sk(z)= true]≤ negl(λ).Prsansserif_sk ← sansserif_Setup ( 1λ ) [ sansserif_Detectsansserif_sk ( z ) = sansserif_true ] ≤ sansserif_negl ( λ ) . Definition 12 (b⁢(L)b(L)b ( L )-Completeness). An algorithm WW is a b⁢(L)b(L)b ( L )-complete watermarking scheme robust against a class of transformations E if for every security parameter λ and prompt promptsansserif_prompt of length ||≤⁢(λ)| prompt|≤ poly(λ)| sansserif_prompt | ≤ sansserif_poly ( λ ), Pr←⁢(1λ),z←⁢()subscriptPrformulae-sequence←superscript1←subscript _ sk← Setup(1^λ),\,z% ← Wat_ sk( prompt)Prsansserif_sk ← sansserif_Setup ( 1λ ) , z ← sansserif_Wat start_POSTSUBSCRIPT sansserif_sk ( sansserif_prompt ) end_POSTSUBSCRIPT [∃z′∈E⁢(z)(z′)= and [ _z ∈ E(z)\ Detect_ sk(z^% )= false and [ ∃z′ ∈ E ( z ) sansserif_Detectsansserif_sk ( z′ ) = sansserif_false and He(M,,z)≥b(|z|)]≤(λ). H_e(M, prompt,z)≥ b(|z|)]≤ negl(λ).Hitalic_e ( M , sansserif_prompt , z ) ≥ b ( | z | ) ] ≤ sansserif_negl ( λ ) . Definition 13 (Undetectability). A watermarking scheme =(,,)W=( Setup, Wat, Detect)W = ( sansserif_Setup , sansserif_Wat , sansserif_Detect ) is undetectable if for every security parameter λ and all polynomial-time distinguishers D, |Pr⁡[DM,M⁢(1λ)=1]−Pr←⁢(1λ)⁡[DM,⁢(1λ)=1]|≤⁢(λ),Prsuperscriptsuperscript11subscriptPr←superscript1superscriptsubscriptsuperscript11 | [D^M,M(1^λ)=1]- _ sk← Setup(1^% λ)[D^M, Wat_ sk(1^λ)=1] |≤ % negl(λ),| Pr [ Ditalic_M , M ( 1λ ) = 1 ] - Prsansserif_sk ← sansserif_Setup ( 1λ ) [ Ditalic_M , sansserif_Watsansserif_sk ( 1λ ) = 1 ] | ≤ sansserif_negl ( λ ) , where the notation DO1,O2superscriptsubscript1subscript2D^O_1,O_2Ditalic_O1 , O2 means that D is allowed to adaptively query both O1subscript1O_1O1 and O2subscript2O_2O2 with arbitrary prompts. E.2 Our result Definition 14 (Prompt-mitigation Filter). A prompt-mitigation filter for a class E is an algorithm that takes as input z∈∗superscriptz ^*z ∈ T∗ and replies with en element of E⁢(z)∪⊥bottomE(z)∪\ \E ( z ) ∪ ⊥ , which we interpret as it either produced an element in the equivalence class of z or rejects. The following is a formal version of Theorem 5. Theorem 11 (Impossibility of prompt-mitigation filters). If there exists a watermarking scheme robust against a class of transformations E then for all security parameters λ∈ℕλ λ ∈ blackboard_N, for every language model M, for every prompt generator G satisfying H⁢(G)≥⁢(λ)H(G)≥ poly(λ)H ( G ) ≥ sansserif_poly ( λ ) and running in time ⁢(λ) poly(λ)sansserif_poly ( λ ) per token the following conditions hold. There exists a language model M′ with a running time ⁢(λ)+O⁢(TM)subscript poly(λ)+O(T_M)sansserif_poly ( λ ) + O ( Titalic_M ) per token, where TMsubscriptT_MTitalic_M is the running time of M per token, and an adversarial prompt generator G′ with a running time ⁢(λ) poly(λ)sansserif_poly ( λ ) per token such that if G′ and M′ share a secret key then the following conditions hold. 1. (G′G G′ incites harmful behavior in M′M M′ even in the presence of prompt-mitigation filters). For every prompt-mitigation filter F running in polynomial-time with all but negligible in λ probability over z←G′←superscript′z← G z ← G′ it holds that F⁢(z)=⊥ or ⁢M′⁢(z)⁢ is harmful.bottom or superscript′ is harmfulF(z)= or M (z) is harmful.F ( z ) = ⊥ or M′ ( z ) is harmful . 2. (M′M M′ and M behave identically on G almost always). With all but negligible in λ probability over z←G←z← Gz ← G the distributions M′⁢(z)superscript′M (z)M′ ( z ) and M⁢(z)M(z)M ( z ) are equal. 3. (G and G′G G′ are indistinguishable for filters running in polynomial-time). For every algorithm running in polynomial-time the advantage in distinguishing outputs of G and G′ is negligible in λ. Note 3. Observe that conditions 1 and 3 from Theorem 11 imply that if a polynomial-time prompt-mitigation filter F does not reject outputs of G then the outputs of G even when passed through the mitigation of F will still incite harmful behavior in M′. E.3 Construction Let =(,,)W=( Setup, Wat, Detect)W = ( sansserif_Setup , sansserif_Wat , sansserif_Detect ) be a watermarking scheme robust against a class of transformations E. Let G be a prompt generator satisfying H⁢(G)≥⁢(λ)H(G)≥ poly(λ)H ( G ) ≥ sansserif_poly ( λ ). Firstly, we let G′ and M share a sksansserif_sk generated by ⁢(1λ)superscript1 Setup(1^λ)sansserif_Setup ( 1λ ). We define G′:=⁢()assignsuperscript′subscriptG := Wat_ sk()G′ := sansserif_Watsansserif_sk ( ).191919We run Watsansserif_Wat on an empty prompt. Note that Watsansserif_Wat implicitly depends on G. For a language model M we define M′ that on input z works as follows: 1. If ⁢(z)=subscript Detect_ sk(z)= truesansserif_Detectsansserif_sk ( z ) = sansserif_true then return a harmful output. 2. Return M⁢(z)M(z)M ( z ) otherwise. E.4 Proof The proof requires a careful comparison of requirements of Theorem 11 and properties of watermarking schemes robust against classes of transformations. The result follows from the following. • Condition 1 follows from completeness of WW. • Condition 2 follows from soundness of WW. • Condition 3 follows from undetectability of WW.