Paper deep dive

Algorithms for Deciding the Safety of States in Fully Observable Non-deterministic Problems: Technical Report

Johannes Schmalz, Chaahat Jain

Year: 2026Venue: arXiv preprintArea: cs.AIType: PreprintEmbeddings: 53

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/22/2026, 5:19:57 AM

Summary

The paper addresses the safety of learned action policies in Fully Observable Non-deterministic (FOND) planning. It identifies that the existing state-of-the-art algorithm, TarjanSafe, has an exponential worst-case runtime. The authors introduce a new policy-iteration algorithm, iPI, which achieves polynomial worst-case complexity while maintaining the practical efficiency of TarjanSafe, and provide a linear-time alternative, Prop-U, for theoretical comparison.

Entities (6)

Prop-U · algorithm · 100%TarjanSafe · algorithm · 100%iPI · algorithm · 100%FOND · planning-framework · 98%Chaahat Jain · researcher · 95%Johannes Schmalz · researcher · 95%

Relation Signals (3)

Johannes Schmalz → authored → Algorithms for Deciding the Safety of States in Fully Observable Non-deterministic Problems

confidence 100% · Technical Report Johannes Schmalz, Chaahat Jain

Prop-U → hasworstcasecomplexity → Linear

confidence 95% · Prop-U, an algorithm with a linear worst case.

iPI → improvesupon → TarjanSafe

confidence 90% · iPI... matches TarjanSafe's best-case runtime while guaranteeing a polynomial worst-case.

Cypher Suggestions (2)

Map the relationship between algorithms and their performance characteristics · confidence 85% · unvalidated

MATCH (a:Algorithm)-[r:HAS_WORST_CASE_COMPLEXITY]->(c:ComplexityClass) RETURN a.name, c.name

Find all algorithms mentioned in the paper and their properties · confidence 80% · unvalidated

MATCH (a:Algorithm) RETURN a.name, a.worst_case_complexity, a.best_case_performance

Abstract

Abstract:Learned action policies are increasingly popular in sequential decision-making, but suffer from a lack of safety guarantees. Recent work introduced a pipeline for testing the safety of such policies under initial-state and action-outcome non-determinism. At the pipeline's core, is the problem of deciding whether a state is safe (a safe policy exists from the state) and finding faults, which are state-action pairs that transition from a safe state to an unsafe one. Their most effective algorithm for deciding safety, TarjanSafe, is effective on their benchmarks, but we show that it has exponential worst-case runtime with respect to the state space. A linear-time alternative exists, but it is slower in practice. We close this gap with a new policy-iteration algorithm iPI, that combines the best of both: it matches TarjanSafe's best-case runtime while guaranteeing a polynomial worst-case. Experiments confirm our theory and show that in problems amenable to TarjanSafe iPI has similar performance, whereas in ill-suited problems iPI scales exponentially better.

PDF

Open source PDF →Open local PDF →

Full Text

53,096 characters extracted from source content.

Expand or collapse full text

Algorithms for Deciding the Safety of States in Fully Observable Non-deterministic Problems: Technical Report Johannes Schmalz, Chaahat Jain Abstract Learned action policies are increasingly popular in sequential decision-making, but suffer from a lack of safety guarantees. Recent work introduced a pipeline for testing the safety of such policies under initial-state and action-outcome non-determinism. At the pipeline’s core, is the problem of deciding whether a state is safe (a safe policy exists from the state) and finding faults, which are state-action pairs that transition from a safe state to an unsafe one. Their most effective algorithm for deciding safety, TarjanSafe, is effective on their benchmarks, but we show that it has exponential worst-case runtime with respect to the state space. A linear-time alternative exists, but it is slower in practice. We close this gap with a new policy-iteration algorithm iPI, that combines the best of both: it matches TarjanSafe’s best-case runtime while guaranteeing a polynomial worst-case. Experiments confirm our theory and show that in problems amenable to TarjanSafe iPI has similar performance, whereas in ill-suited problems iPI scales exponentially better. 1 Introduction Learned action policies are gaining traction in AI and AI planning, e.g., Mnih et al. (2015); Silver et al. (2016, 2018); Toyer et al. (2020). However, such policies come without any safety guarantees, creating demand for safety assurance methods. We focus on the safety-testing framework of Jain et al. (2025) (henceforth JEA) in fully observable non-deterministic (FOND) planning. In JEA’s setting, a policy π is safe in a given state s if its execution will never lead to a failure condition ϕF _F, a state is safe if it has a safe policy, and faults are state-action pairs (s,π(s))(s,π(s)) where s is safe but applying π(s)π(s) non-deterministically leads to an unsafe state s′s . Their framework takes a learned policy, finds paths therein that lead to fail states, and then finds faults by deciding the safety of states along the paths. These faults can be used to repair the policy. To decide whether a state is safe JEA consider two orthogonal approaches: (1) adapting well-known algorithms for Markov Decision Processes (MDPs), such as Value Iteration (VI), LAO∗ (Hansen and Zilberstein 2001), and LRTDP (Bonet and Geffner 2003); (2) creating a specialised Depth-First Search (DFS) with a mechanism for detecting Strongly Connected Components (SCCs), which they call TarjanSafe. Their results showed that TarjanSafe is very effective for deciding safety. In this paper, we show that TarjanSafe has an exponential worst-case time complexity w.r.t. the state space using a pathological family of tasks where TarjanSafe expands the same states many times. Despite this limitation, TarjanSafe performs well in practice because it often explores only a small fraction of the state space. To contrast, we present Prop-Prop-U, an algorithm with a linear worst case. It works by propagating unsafe states, following techniques from reachability games (Diekert and Kufleitner 2022); Prop-Prop-U is unusable in practice because there are too many unsafe states to propagate. To address this gap we introduce iPI, a new algorithm based on policy iteration that combines a polynomial worst-case runtime while enjoying the practical speed of TarjanSafe; iPI starts with an initial policy πinit _init and attempts to prove its safety, making minimal changes to the policy as unsafe states are discovered; this makes iPI very efficient at finding safe policies similar to πinit _init, which occurs frequently. Finally, we confirm our theory experimentally. We use JEA’s benchmarks, which mostly correspond to TarjanSafe’s best case; there, iPI shows similar performance to TarjanSafe. Then, we introduce the Flappy Bird domain to showcase a problem ill-suited to TarjanSafe, and indeed iPI scales exponentially better. This places iPI as the current best algorithm for deciding state safety and gives valuable insight into what makes a good algorithm for this problem. Our contributions are: a simplified formalism for FOND safety, new algorithms for deciding safety (Prop-Prop-U and iPI), a complexity analysis of TarjanSafe and our algorithms, and an empirical evaluation that shows iPI’s effectiveness. 2 Background We consider fully-observable non-deterministic (FOND) planning tasks (Daniele, Traverso, and Vardi 2000; Cimatti et al. 2003). We follow the formalism of JEA, but break it down into a simpler, equivalent formulation. A task’s transition system is defined by Θ=⟨,,,I⟩ = S, A,T, S_I where • S is a finite set of states, • A is a finite set of actions and we write (s) A(s) to denote the set of actions applicable in state s, • T is a non-deterministic transition function where (s,a)∈()∖∅T(s,a) ( S) gives the set of states that may occur after applying a to s, • I⊊ S_I S is a non-empty set of initial states. Policies are partial functions π:→π: S→ A with the property that π(s)∈(s)π(s)∈ A(s) wherever defined. A policy π induces a policy graph Θπ=⟨,,π,I⟩ ^π= S, A,T^π, S_I with the same components as Θ except πT^π has π(s,a)=(s,a)T^π(s,a)=T(s,a) if a=π(s)a=π(s) and is undefined otherwise. A path is a finite or infinite sequence of alternating states and actions σ=s0,a0,s1…σ=s_0,a_0,s_1… that satisfies ai∈(si)a_i∈ A(s_i) and si+1∈(si,ai)s_i+1 (s_i,a_i) for all i≥0i≥ 0 in σ. The set of all paths from a state s is Paths(s)Paths(s) and Pathsπ(s)Paths^π(s) is the set of paths from s in the policy graph Θπ ^π. Typically, FOND tasks are also specified with a non-empty set of goal states ⊆ G S. We call such tasks Θ=⟨,,,I,⟩ = S, A,T, S_I, G goal-reaching FOND tasks. To solve them we consider strong cyclic solutions, which we present as policies π s.t. following π from any initial state sI∈Is_I∈ S_I will eventually reach a goal in G assuming fairness. Assumption 1 (Fairness). If an action a is applied infinitely-many times in state s, then each non-deterministic effect in (s,a)T(s,a) must trigger infinitely often (Cimatti et al. 2003). Formally, π solves a goal-reaching FOND task if all finite-length paths in ⋃sI∈IPathsπ(sI) _s_I∈ S_IPaths^π(s_I) terminate in G. Recall that our setting is to identify whether a safe policy exists for states in a learned policy’s envelope, so in this paper we focus on avoiding certain fail states rather than goal-reaching. Thus, we specify a non-empty set of fail states f⊆ S_f S, giving us state-avoiding FOND tasks Θ=⟨,,,I,f⟩ = S, A,T, S_I, S_f .111JEA use a failure condition ϕF _F, but for our discussion it is equivalent to consider a specified set of fail states f⊆ S_f S. A solution to such Θ is a policy π that never encounters any fail state, i.e., for all finite and infinite paths σ∈⋃sI∈IPathsπ(sI)σ∈ _s_I∈ S_IPaths^π(s_I), σ does not contain any states in f S_f. State-avoiding FOND lets us express notions of safety, i.e., that a “bad” state should never happen, e.g., a self-driving car must never crash into a wall. A state s is called unsafe if no policy can guarantee that it avoids fail states from s, i.e., ∀π∃σ∈Pathsπ(s):σ∩f≠∅∀π\;∃σ ^π(s):σ∩ S_f≠ . The set of unsafe states is called the unsafe region U. Whether a policy π is state-avoiding (i.e., safe) can be evaluated with the following value function. Definition 1. The value of policy π at state s is Vπ(s)=1 if s∈fmaxs′∈(s,π(s))⁡Vπ(s′) if s∉f.V^π(s)= cases1& if s∈ S_f\\ _s (s,π(s))V^π(s )& if s ∈ S_f. cases The policy π is safe from a state s, i.e., there is a way to avoid unsafe states, iff Vπ(s)=0V^π(s)=0. V∗(s)V^*(s) is the optimal value at s, given by V∗(s)=minπ∈Π⁡Vπ(s)V^*(s)= _π∈ V^π(s). Given V∗V^*, if a safe policy exists from s, then one can extract a safe policy by following V∗V^* greedily. V∗V^* can be computed with Value Iteration (VI). VI starts with an initial value function, e.g., V(s)=⟦s∈f⟧∀s∈V(s)= s∈ S_f \;∀ s∈ S, and then applies Bellman backups: V(s)←mina∈(s)⁡maxs′∈(s,a)⁡V(s′).V(s)← _a∈ A(s) _s (s,a)V(s ). The term being minimised in the Bellman backup is often called a Q-value, so for us Q(s,a)=maxs′∈(s,a)⁡V(s′)Q(s,a)= _s (s,a)V(s ). By repeatedly applying Bellman backups over all states the value function will eventually converge to V∗V^*. JEA give arguments for this, and it is also demonstrated by Prop-Prop-U later. Dually to V∗(s)=0V^*(s)=0 indicating a safe state, V∗(s)=1V^*(s)=1 indicates that s is an unsafe state, from which there is no policy that can guarantee that it avoids fail states f S_f. Now, we return to the problem formalisation. JEA look for safe policies in tasks that have goals, other terminal states, and fail states to avoid. Their policies must avoid the fail states, either by reaching a terminal state and stopping there, or by cycling in a way that avoids the fail states. JEA’s tasks can be transformed into our state-avoiding tasks by adding a self-loop action to goal and terminal states, allowing policies to avoid fail states indefinitely by using these self loops rather than terminating. For the rest of this paper we focus on deciding the safety of individual states, so we only consider singleton initial states I=sI S_I=\s_I\. We note that state-avoiding tasks can be cast into goal-reaching ones. Given a state-avoiding problem Θ=⟨,,,I,f⟩ = S, A,T, S_I, S_f , we can define an equivalent goal-reaching problem Θ′=⟨∪g,,′,I,g⟩ = S∪\g\, A,T , S_I,\g\ . This introduces the artificial goal g and all transitions from non-fail states gain g as a non-deterministic effect from all safe states; also, we turn fail states into “traps” that only loop back to themselves and can never reach the goal. Formally, ∀s∈,a∈(s)∀ s∈ S,a∈ A(s) ′(s,a)=(s,a)∪g if (s,a) defined and s∉fs if (s,a) defined and s∈fundefined if (s,a) undefined.T (s,a)= casesT(s,a)∪\g\& if T(s,a) defined and s ∈ S_f\\ \s\& if T(s,a) defined and s∈ S_f\\ undefined& if T(s,a) undefined. cases Now, Θ′ has a policy that always reaches its goal g\g\ iff Θ has a policy that avoids the states f S_f. This is clear because a goal-reaching policy for Θ′ can not enter any fail state f S_f, and therefore describes a state-avoiding policy for Θ . Going the other direction, a state-avoiding policy for Θ never enters f S_f, and so all its actions have g as one of the effects, which means that, due to the fairness assumption (1), the policy must eventually reach the artificial goal in Θ′ . This transformation from state-avoiding to goal-reaching lets us solve the state-avoiding problem in terms of goal-reaching with state-of-the-art FOND planners such as PR2 (Muise, McIlraith, and Beck 2024). However, we limit the scope of this paper to fundamental algorithms that are native to state-avoiding, letting us make a fairer comparison to JEA and perform sharper complexity analyses, thereby establishing a foundation for future work. Moreover, using their algorithms in our codebase is non-trivial because it requires a translation from JANI to variants of PDDL, which is an open line of research, e.g., (Klauck et al. 2018). 3 Algorithms We briefly explain TarjanSafe (JEA) and demonstrate that it has worst-case exponential runtime w.r.t. the state-space, but a good best-case that reflects its strong performance in practice. Then, we show Prop-Prop-U, an algorithm that solves state-avoiding tasks in linear time, but has a worse best-case that makes it impractical. Finally, we introduce a Policy Iteration algorithm and prove that it combines a polynomial worst case with the same best case as TarjanSafe. We use the notation m=||m=| A|, n=||n=| S|, b=maxs∈,a∈(s)⁡|(s,a)|b= _s∈ S,a∈ A(s)|T(s,a)| denotes a bound on the branching factor of non-determinism, and πmin-safe _min-safe is a safe policy with a minimal policy graph. We assume that (s)≠∅ A(s)≠ for each state, so m≥nm≥ n. Our complexity results are summarised in tab. 1. We restrict the best-case runtime analyses to tasks where a safe policy exists; if no safe policy exists then all the algorithms we consider can be implemented with an early stop, so that in the best case the algorithms terminate in one or two steps, i.e., O(1)O(1). Best case Worst case TarjanSafe (alg. 1) Θ(|πmin-safe|) (| _min-safe|) Ω(2m) (2^m) Prop-Prop-U (alg. 2) Θ(|f|) (| S_f|) Θ(bm) (bm) nPI (alg. 3), iPI (alg. 4) Θ(|πmin-safe|) (| _min-safe|) O(bmn)O(bmn) Table 1: Best-case (where a safe policy exists) and worst-case runtimes of the algorithms we consider. We use m=||m=| A|, n=||n=| S|, branching factor of non-determinism b, and |πmin-safe|| _min-safe| is a smallest safe policy’s envelope size. 3.1 TarjanSafe TarjanSafe is JEA’s algorithm for finding FOND state-avoiding policies. It is a DFS with a labelling mechanism for states that have been proven safe or unsafe, and it has a mechanism for detecting SCCs in its candidate policy using the stack and lowlink indices from Tarjan’s algorithm (Tarjan 1972), hence the name TarjanSafe. We present pseudocode for the algorithm in alg. 1.222JEA present a more general algorithm where a safety radius r can be specified, for us r=∞r=∞. Also, we assume our state-avoiding formulation. Our presentation has been simplified accordingly. TarjanSafe’s DFS is defined recursively: in its recursive cases, TarjanSafe iterates over a∈(s)a∈ A(s) and calls itself on the successor states (s,a)T(s,a), then it either marks s as unsafe if all these actions lead to unsafe states, or otherwise the algorithm assumes that s is safe and proceeds. Additionally, TarjanSafe detects SCCs and marks the SCC’s root as safe if the SCC contains no unsafe states. Then, the recursion stops at s without making further recursive calls in the following cases: 1. s is a fail state (s∈fs∈ S_f), 2. s marked as unsafe with known-unsafe[s] known-unsafe[s], 3. s is a goal state (not relevant in our setting), 4. s marked as safe with known-safe[s] known-safe[s], 5. some action in (s) A(s) leads only to states in the stack ((s,a)⊆stackT(s,a) stack). 1 2Function TarjanSafe(Θ,sI)TarjanSafe( ,s_I) 3 known-safe[s]←false∀s∈ known-safe[s] \;∀ s∈ S 4 known-unsafe[s]←false∀s∈ known-unsafe[s] \;∀ s∈ S 5 stack←empty stack stack stack 6 low←empty map low map 7 return rec-TarjanSafe(sI)rec-$ TarjanSafe$(s_I) 8 9 10Function rec-TarjanSafe(s)rec-TarjanSafe(s) // Has access to TarjanSafe’s vars 11 if s∈fs∈ S_f or known-unsafe[s] known-unsafe[s] then 12 return false 13 14 if known-safe[s] known-safe[s] then 15 return true 16 17 stack-idx←‖stack‖ stack-idx←\| stack\| 18 push s onto stack 19 low[s]←stack-idx low[s]← stack-idx 20 A-unsafe←true A-unsafe 21 foreach a∈(s)a∈ A(s) and while A-unsafe do 22 a-safe←true a-safe 23 foreach s′∈(s,a)s (s,a) and while a-safe do 24 if s′∈stacks ∈ stack then 25 low[s]←min⁡(low[s],low[s′]) low[s]← ( low[s], low[s ]) 26 27 else 28 a-safe←rec-TarjanSafe(s′) a-safe -$ TarjanSafe$(s ) 29 30 31 A-unsafe←¬a-safe A-unsafe← a-safe 32 33 if A-unsafe then 34 known-unsafe[s]←true known-unsafe[s] 35 else // Update safe at root of SCC 36 if low[s]=stack-idx low[s]= stack-idx then 37 known-safe[s]←true known-safe[s] 38 pop from stack all states down to s 39 40 41 pop s from stack 42 return ¬A-unsafe A-unsafe 43 44 Algorithm 1 JEA’s TarjanSafe We motivate TarjanSafe’s practical effectiveness with its best-case runtime. Theorem 1. TarjanSafe has a best-case runtime of Θ(|πmin-safe|) (| _min-safe|) (if a safe policy exists). Proof. Suppose πmin-safe _min-safe is a safe policy with minimal policy graph such that the policy graph is a tree, except its leaf nodes which have self-loops. Further suppose that TarjanSafe guesses πmin-safe _min-safe, i.e., in its DFS it always selects πmin-safe(s) _min-safe(s) for s. Then, the algorithm traverses πmin-safe _min-safe precisely once with its DFS to prove that the states are safe, and does not need to do any branching nor backtracking on the actions choices, so it is O(|πmin-safe|)O(| _min-safe|). To verify that a policy π is safe, one must at least check that each state in its policy graph is not a fail state f S_f, so it is impossible to do better than Ω(|πmin-safe|) (| _min-safe|). Thus, TarjanSafe’s best-case must be Θ(|πmin-safe|) (| _min-safe|). ∎ But in the worst case, TarjanSafe is exponential. Theorem 2. TarjanSafe has a worst-case runtime of Ω(2m) (2^m). Proof. We present a class of problems where TarjanSafe requires on the order of 2m2^m steps, which proves that TarjanSafe’s worst-case is asymptotically 2m2^m or worse, i.e., Ω(2m) (2^m). Consider the task in fig. 1 with 5 layers, and consider the generalised construction with d layers. This task has no fail, unsafe, nor goal states, so TarjanSafe’s cases 1–3 will never trigger for any state. Case 4 can only be triggered at an SCC’s root, which is only s0s_0, so this case only triggers when the algorithm terminates. Thus, the recursion can only bottom out at state s if s’s applicable action leads to states that are already on the stack, which in turn only happens at the state in the last layer, leading to s0s_0. This forces the algorithm to re-expand states and enumerate all paths. In our task with d layers, TarjanSafe must enumerate all paths from s0s_0 to sd+1s_d+1 before it can terminate, which is 2d2^d paths. The problem has 2d+22d+2 states and 2d+22d+2 actions with a branching factor b of 2, so indeed the runtime of the algorithm is exponential w.r.t. the transition system size. ∎ We now show that with other approaches our problem can in fact be solved in polynomial time, even in the worst case. s0s_0s6s_6s1′s _1s2′s _2s3′s _3s4′s _4s5′s _5s1s_1s2s_2s3s_3s4s_4s5s_5 Figure 1: Non-deterministic task with 5 layers and 252^5 paths from s0s_0 to s6s_6. Similarly constructed tasks with d layers have 2d2^d paths from s0s_0 to sd+1s_d+1. 3.2 Unsafety Propagation We define Prop-Prop-U, and show that it solves state-avoiding tasks in linear time by propagating the unsafe region. This result draws from reachability games — such games have two players who alternate turns: one player must reach certain states and their opponent must make sure that these states are never reached. This framework is immediately analogous to our setting where the fail states f S_f correspond to the winning states of an “unsafety player,” and the “safety player” wins by ensuring that the unsafety player never wins — a winning strategy for the safety player corresponds to a safe policy for us. The unsafety player can be understood as an adversarial choice of non-deterministic outcomes.333FOND fairness is not violated with this interpretation, since the unsafety player will never voluntarily choose to get trapped in a cycle, as they are trying to reach unsafe states in finite steps. The existence of a linear-time algorithm is considered folklore for reachability games, e.g., (Mazala 2002, Exercise 2.6). Diekert and Kufleitner (2022) present a concrete linear-time algorithm for solving reachability games. In alg. 2, we define Prop-Prop-U by adapting their algorithm to our state-avoiding FOND setting and formulating it in the framework of Value Iteration. Effectively, Prop-Prop-U is propagating the unsafe region from the known unsafe states (initially only the fail states), and then we know that a safe policy from sIs_I exists iff sIs_I is outside the unsafe region. Thus, upon termination, V(s)=1V(s)=1 iff s is unsafe. Be aware that throughout execution the semantic of V is slightly different: we have that V(s)=1V(s)=1 iff s is proved unsafe, and V(s)=0V(s)=0 indicates that we have not proved s either way. 1Function Prop-(Θ,sI)Prop-U( ,s_I) // Here Q is a separate variable, not a function of V 2 V(s)←⟦s∈f⟧∀s∈V(s)← s∈ S_f \;∀ s∈ S 3 Q(s,a)←0∀s∈,a∈(s)Q(s,a)← 0\;∀ s∈ S,a∈ A(s) 4 ←fZ← S_f 5 while ≠∅Z≠ do 6 s′←.pop()s .pop() 7 for (s,a):s′∈(s,a)(s,a):s (s,a) do 8 Q(s,a)←1Q(s,a)← 1 9 if V(s)=0 and mina′∈(s)⁡Q(s,a′)=1V(s)=0 and _a ∈ A(s)Q(s,a )=1 then 10 V(s)←1V(s)← 1 11 ←∪sZ ∪\s\ 12 13 14 15 return V(sI)V(s_I) 16 Algorithm 2 Propagate Unsafe States Theorem 3. Prop-Prop-U (alg. 2) returns 0 if sIs_I is safe and 11 if sIs_I is unsafe. Proof. We want to show that a state s updates V(s)←1V(s)← 1 and enters Z iff it is unsafe. Only unsafe states enter Z by an inductive argument: fail states f S_f are the base case, and otherwise s only enters Z if all its successors are unsafe with mina′∈(s)⁡Q(s,a′)=1 _a ∈ A(s)Q(s,a )=1. It remains to show that all unsafe states will eventually enter Z. For contradiction, suppose that an unsafe state does not enter Z, which implies that there is an unsafe state s with V(s)=0V(s)=0. This state must have an action a∈(s)a∈ A(s) that leads to an unsafe state s′s but has Q(s,a)=0Q(s,a)=0. Consequently, we can select an unsafe state s′∈(s,a)s (s,a) with V(s′)=0V(s )=0. Then we can apply the same argument to s′s , and repeat this argument to produce a chain ⟨s0,s1,…⟩ s_0,s_1,… of unsafe states with V(si)=0V(s_i)=0 for all sis_i in the chain. By selecting some action with Q(s,a)=0Q(s,a)=0 we are describing a policy, and since each state we are considering is unsafe there must be a path in this policy graph that leads to a fail state. So, it must be possible to extend our chain so that it eventually reaches a fail state, but fail states have V(s)=1V(s)=1, yielding the desired contradiction. ∎ Theorem 4. Prop-Prop-U (alg. 2) has a runtime of Ω(||) (|Y|) and O(b||)O(b|Y|) where =(s,a):s∈,a∈(s),(s,a)∩≠∅Y=\(s,a):s∈ S,a∈ A(s),T(s,a) ≠ \, i.e., Y are the state-action pairs with an effect in the unsafe region. Proof. Each unsafe state s∈s will get added to Z precisely once (as argued before), so the algorithm’s inner for loop will precisely iterate over Y with some potential repetitions if (s,a)∈(s,a) has multiple unsafe states in (s,a)T(s,a). This gives Ω(||) (|Y|). To handle the repetitions, we observe that (s,a)T(s,a) can not contain more unsafe states than the non-deterministic branching factor b. Thus, we iterate over Y with at most b duplicates, i.e., the algorithm is O(b||)O(b|Y|). ∎ Corollary 5. Prop-Prop-U (alg. 2) has a best-case runtime of Θ(|f|) (| S_f|) (if a safe policy exists) and a worst-case runtime of Θ(bm) (bm). If no safe policy exists from the initial state sIs_I then Prop-Prop-U can stop as soon as sIs_I enters Z, which may happen in a single step. We call Prop-Prop-U’s worst-case runtime of Θ(bm) (bm) linear, even though it contains two terms. Recall that b corresponds to the non-deterministic branching factor, and m is the number of actions, so bmbm gives the number of transitions ⟨s,a,s′⟩ s,a,s in our state space. In that sense, Prop-Prop-U is linear w.r.t. the number of transitions. Connecting to reachability games, bmbm corresponds to the number of edges in an equivalent game: the algorithm of Diekert and Kufleitner (2022) is linear over the number of edges in a game, so it is reasonable to consider our adaptation Prop-Prop-U linear in our setting. Since Prop-Prop-U is in the VI framework, we can compare it to standard VI and variants like Topological VI (TVI) (Dai et al. 2011). Prop-Prop-U dominates VI and TVI in the sense that in its worst case it computes Q(s,a)Q(s,a) once for each s∈,a∈(s)s∈ S,a∈ A(s), whereas VI and TVI perform the same computations in their best case, and usually more. Prop-Prop-U has the lowest worst-case complexity of the algorithms we consider in this paper, and it seems unlikely that it is possible to do better. However, in our setting, the unsafe region and therefore Y is very large (exponential w.r.t. problem description). This makes Prop-Prop-U impractical in practice. It is possible to exploit problem structure and propagate the set of unsafe states with Binary Decision Diagrams (BDDs), as explained in appendix A. However, propagating unsafe states with BDDs still fails in our problems, because the BDDs explode in size after only a few steps of propagation. It seems the approach of propagating unsafe states is fundamentally unsuitable to our problems. In the next section, we present an alternative algorithm that combines the best of TarjanSafe and Prop-Prop-U: it can stop as soon as it finds a safe policy but maintains a polynomial worst-case runtime. 3.3 Policy Iteration (PI) Our implementation of PI in alg. 3, named nPI, makes a nice tradeoff: it has the same best case as TarjanSafe and remains polynomial in the worst case. PI is well-known as an MDP algorithm (Howard 1960), and arguably, successful goal-reaching FOND algorithms such as PR2 (Muise, McIlraith, and Beck 2024) fit within its framework. In our setting, the idea of PI is that it starts with some policy π, determines where π is unsafe, fixes π, and repeats until π is safe or sIs_I is proven unsafe. We define nPI using a Value Function V with the semantic that V(s)=1V(s)=1 if we know that s is unsafe, and V(s)=0V(s)=0 if we are unsure (as in Prop-Prop-U). Initially, we only know that the fail states are unsafe so V(s)←⟦s∈f⟧V(s)← s∈ S_f (alg. 3). We use the shorthand Q(s,a)=maxs′∈(s,a)⁡V(s′)Q(s,a)= _s (s,a)V(s ), which is 11 if we know it is unsafe to apply a in s, and 0 if we are unsure. Then, we propagate the states that we know are unsafe by updating V(s)V(s) with its minimal Q-value (alg. 3). 1Function nPI(Θ,sI,πinit)nPI( ,s_I, _init) 2 V(s)←⟦s∈f⟧∀s∈V(s)← s∈ S_f \;∀ s∈ S 3 π←πinitπ← _init 4 5 repeat 6 V,saw-unsafe←Mark-Unsafe(Θ,sI,V,π)V, saw-unsafe -Unsafe( ,s_I,V,π) 7 if ¬saw-unsafe saw-unsafe or V(sI)=1V(s_I)=1 then 8 return V(sI)V(s_I) 9 10 π←Greedy(Θ,sI,V)π ( ,s_I,V) 11 12 13 14 15Function Mark-Unsafe(Θ,sI,V,π)Mark-Unsafe( ,s_I,V,π) 16 ℰ←DFS post order of policy graph Θπ from sIE post order of policy graph $ ^π$ from $s_I$ 17 for s∈ℰs do 18 V(s)←mina∈(s)⁡Q(s,a)V(s)← _a∈ A(s)Q(s,a) 19 20 saw-unsafe←⟦∃s∈ℰs.t. V(s)=1⟧ saw-unsafe← ∃ s \;s.t.\ V(s)=1 21 return V, saw-unsafe 22 23 24Function Greedy(Θ,sI,V)Greedy( ,s_I,V) 25 new π undefined everywhere 26 while π has undefined state reachable from sIs_I do 27 s←s← undefined state reachable from sIs_I 28 π(s)←argmina∈(s)Q(s,a)π(s)← argmin_a∈ A(s)Q(s,a) 29 30 return π 31 Algorithm 3 Naïve Policy Iteration We now analyse the runtime complexity of nPI. Theorem 6. nPI (alg. 3) has a best-case runtime of Θ(πmin-safe) ( _min-safe) (if a safe policy exists). Proof. The argument is similar to the proof of thm. 1. In the best case, πinit _init is πmin-safe _min-safe, so that nPI only needs to perform a DFS over πinit _init once to evaluate the policy. ∎ If no safe policy exists then nPI can stop as soon as it propagates the unsafe region to sIs_I and sets V(sI)←1V(s_I)← 1. With an early-stop mechanism (as implemented later in iPI), this may happen in as few as two steps. Theorem 7. nPI (alg. 3) has a worst-case runtime of O(bmn)O(bmn). Proof. The functions Mark-Unsafe and Greedy are both O(bm)O(bm) because they are implemented with a variant of DFS that does not re-expand states, i.e., it expands each state at most once. It remains to show that the main loop (lines 3–3) repeats at most n times. To this end we argue that for each pass of the loop where π remains unsafe, at least one state with V(s)=0V(s)=0 gets updated with V(s)←1V(s)← 1; this can happen at most n times, because before then we either get a safe policy and terminate, or we set V(sI)=1V(s_I)=1 and terminate. Suppose we have computed π←Greedy(Θ,sI,V)π ( ,s_I,V) and V,π-is-safe←Mark-Unsafe(Θ,sI,V,π)V,$π$-is-safe -Unsafe( ,s_I,V,π), and neither of the termination conditions are satisfied, i.e., π is unsafe and V(sI)=0V(s_I)=0. We know that V(sI)=0V(s_I)=0, since π is unsafe its envelope must contain at least one state s†s with V(s†)=1V(s )=1, and there is a (potentially empty) sequence of intermediate states between sIs_I and s†s . On that sequence, there must be a pair of states sis_i with V(si)=0V(s_i)=0 and V(si+1)=1V(s_i+1)=1 by a discrete intermediate value argument, i.e., if V(s)V(s) can only be 0 or 11, and at one end of the sequence we have V(sI)=0V(s_I)=0 and at the other V(s†)=1V(s )=1, then somewhere it has to switch from 0 to 11. Consequently, Q(si,π(si))=1Q (s_i,π(s_i) )=1. Recall that π was constructed greedily, so mina∈(si)⁡Q(s,a)=1 _a∈ A(s_i)Q(s,a)=1 which means that V(si)V(s_i) must be updated to 11, concluding the proof. ∎ The implementation in alg. 3 has various inefficiencies which we address in the Improved Policy Iteration algorithm iPI (alg. 4). The key change in iPI is that the construction of a greedy policy and its evaluation are combined into a single step. Consequently, the algorithm only needs to traverse the candidate policy’s envelope once per iteration, rather than twice. Moreover, it enables us to stop the construction of the greedy policy as soon as an unsafe state has been detected in its envelope (alg. 4 alg. 4); this avoids the expensive situation of nPI, where Greedy(⋯)Greedy(\!·s\!) encounters an unsafe state early on but continues construction, even though the policy can not be safe. In the change from nPI to iPI we also generalise the notion of an initial policy into an action order A where (s)A(s) returns the same actions as (s) A(s) but ordered. Consequently, we do not provide πinit _init, as it is captured by the first “preference” of each (s)A(s). This leads to the subtle detail that we no longer construct policies that are greedy w.r.t. V, but rather we construct policies according to the preferences of A. We made this choice because profiling revealed that computing Q-values to compute the greedy policy w.r.t. V is relatively expensive, and does not pay off, i.e., using an arbitrary fixed order over actions had better performance. The choice of A can be important, since a good ordering may quickly lead to a safe policy or to a way to prove sIs_I as unsafe, and a bad one causes iPI to waste time. We note that with iPI’s recursive definition it is not immediately clear whether the main loop (lines 4–4) is still necessary. Indeed, it is still necessary, as demonstrated by the example in appendix B. iPI (alg. 4) has the same best-case and worst-case runtimes as nPI (alg. 3), by similar arguments. 1Function iPI(Θ,sI, action order )iPI( ,s_I, action order A) 2 V(s)←⟦s∈f⟧∀s∈V(s)← s∈ S_f \;∀ s∈ S 3 repeat 4 seen←∅ seen← 5 saw-unsafe←false saw-unsafe 6 rec-iPI(sI)rec-iPI(s_I) 7 if ¬saw-unsafe saw-unsafe or V(sI)=1V(s_I)=1 then 8 return V(sI)V(s_I) 9 10 11 12 13 14Function rec-iPI(s)rec-iPI(s) // Rec-iPI has access to iPI’s vars 15 16 if V(s)=1V(s)=1 then 17 saw-unsafe←true saw-unsafe 18 return unsafe 19 20 if s∈seens∈ seen then 21 return maybe-safe 22 23 seen←seen∪s seen← seen∪\s\ 24 // Look for safe action in (s)A(s) 25 for a∈(s)a (s) do 26 for s′∈supp(s,a)s (s,a) do 27 s′-status←rec-iPI(s′)s -status -iPI(s ) 28 29 if s′-status≠unsafe∀s′∈supp(s,a)s -status \;∀ s (s,a) then 30 return maybe-safe 31 32 33 // All actions unsafe means s unsafe 34 V(s)←1V(s)← 1 35 saw-unsafe←true saw-unsafe 36 return unsafe 37 Algorithm 4 Improved Policy Iteration Figure 2: Cumulative plots of the proportion of states decided w.r.t. time. The plots separate safe and unsafe states: (left) is an aggregation over all domains’ safe states, (middle) is over all unsafe states, and (right) is over safe states in Flappy Bird. 4 Experiments We empirically compare iPI with JEA’s TarjanSafe for deciding state safety: given a learned policy π that we wish to check for safety, we find a state s in π’s graph, and interrogate iPI and TarjanSafe whether s is safe. Recall that finding such states is at the core of JEA’s pipeline for testing the safety of the learned policy. We consider iPI with three different action orderings (1) app: the order of the model (same as TarjanSafe); (2) learn: using the learned action policy that we are evaluating; (3) rand: randomised. We do not consider Prop-Prop-U because it runs out of memory (see section 3.2 and appendix A). Our implementation extends JEA’s C++ code and is available at (Schmalz and Jain 2026). All experiments were run on Intel Xeon E5-2660 CPUs with 12GB memory and a timeout of 15 minutes. Benchmarks. We use JEA’s benchmarks (scaled up), consisting of non-deterministic variants of blocksworld and transport; deterministic control benchmarks (bouncing ball, follow car and inverted pendulum); as well as different variants of a transport-like domain called one/two-way line, where a truck moves uni-(respectively bi-)directionally on a line, carrying packages to the right. We also introduce a new benchmark, Flappy Bird. In this problem, a bird must navigate a 2D environment from left to right, avoiding stationary obstacles. It can move up or down, and in each move may non-deterministically move to the right. Once the bird reaches the right side of the environment, it gets reset to the left side, i.e., it moves through the environment indefinitely unless it hits an obstacle. Thus, a safe policy moves the bird through the environment indefinitely, avoiding all obstacles. When safe policies exist, this showcases TarjanSafe’s bad performance, taking inspiration from the example in fig. 1 in a more practical setting. Following JEA, all problems are encoded in JANI (Budde et al. 2017). For learned policies to test, we trained neural networks via Q-learning with two hidden layers of size 64. Results. Our results are summarised in fig. 2. We note the benchmarks are solved in small timescales: the iPI variants solve almost all problems within 1 sec. This is not because the problems are small: we handle instances with up to 80 integer variables bounded by 70. To motivate this further, Prop-Prop-U runs out of memory as it enumerates all exponentially-many fail states (w.r.t. JANI description); similarly, Prop-Prop-U with regression over Binary Decision Diagram (BDDs), which can represent a set of states compactly, also runs out of memory as the BDDs become too large. Rather, the benchmarks are solved quickly by exploiting the features of state-avoiding problems. For deciding safe states, there are often cycles that TarjanSafe and iPI use to quickly construct safe policies that avoid fail states. For deciding unsafe states, TarjanSafe and iPI do not need to explore deeply before finding a way to propagate unsafety to sIs_I. Comparing the variants of iPI: iPI(app)iPI(app) dominates the others. The orderings themselves do not appear to have a significant impact: all orderings tend to guess safe policies quickly (for safe states) and propagate the unsafe region to sIs_I efficiently (for unsafe states). Rather, the performance difference comes from implementation, namely the overhead of sorting with learn and randomising with rand. Comparing iPI to TarjanSafe: For deciding states over all domains (left and middle), there is no significant difference between the approaches. JEA’s benchmarks contribute most to these results, and they tend to be amenable to TarjanSafe, so this demonstrates that iPI indeed has similar best-case behaviour to TarjanSafe. In contrast, for deciding safe states in Flappy Birds, iPI scales exponentially better than TarjanSafe (note the runtime scale is logarithmic). This confirms our theory that iPI is exponentially faster than TarjanSafe in its worst case. So, iPI never performs significantly worse than TarjanSafe, and in some cases performs exponentially better. Summary. These results confirm our theory that iPI(app)iPI(app) dominates TarjanSafe in the sense that it is similar for problems amenable to TarjanSafe, and exponentially better in problems that are problematic for TarjanSafe such as Flappy Bird. Furthermore, the results show that within iPI, the default action ordering performs best in our setting. We give a breakdown over domains and more discussion in appendix C, these are consistent with our findings here. 5 Conclusion The state-safety decision problem is significant, being a necessary component of JEA’s safety assurance pipeline for learned policies. We considered JEA’s TarjanSafe algorithm for deciding state safety, and introduced Prop-Prop-U and iPI. We showed TarjanSafe has an exponential worst-case runtime but a good best case; in contrast, Prop-Prop-U has a linear worst case but impractical best case; iPI fills the gap, combining a polynomial worst case while maintaining the same best case as TarjanSafe. We confirmed this empirically: iPI is comparable to TarjanSafe when TarjanSafe does well, and otherwise iPI scales exponentially better. This places iPI as the best algorithm for deciding state safety in our framework, and as the most promising starting point for future work. Acknowledgements Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – GRK 2853/1 “Neuroexplicit Models of Language, Vision, and Action” - project number 471607914. Appendix A Propagating Unsafety with Binary Decision Diagrams Rather than regressing unsafety directly in the state space, it is natural to try to exploit problem structure and regress using Binary Decision Diagrams (BDDs). In JEA’s setting, the set of fail states SfS_f is compactly represented as a symbolic formula ϕF _F called the fail condition. The symbolic representation of the unsafe region is equivalent to the weakest precondition a state has to satisfy to guarantee reaching the fail condition. This can be computed using regression (Rintanen 2008) and BDDs are state-of-the-art structures for such computations. A concrete implementation is provided in alg. 5.444Since BDDs are restricted to boolean variables, but our transition system deals with linear constraints on integer variables, we encode the constraints in toBDD as described in Bartzis and Bultan (2006) . This algorithm uses ρ to represent the previously found unsafe region and τ for the newly found unsafe states. In each iteration, ρ is updated by adding the states in τ. τ is found by computing the preimage of each action with respect to ρ. Here, each action consists of a guard (g): to represent the weakest precondition a state must satisfy for the action to be applicable in it; and updates (u1,…unu_1,… u_n): to represent the nondeterministic effects of applying an action on a state. The following describes the preimage computation for a given set of successor states S(v′)S(v ) and an action a. The set of predecessor states that can reach S(v′)S(v ) on applying a is: g(v)∧∃v′(S(v′)∧(u1(v,v′)∨…un(v,v′)))g(v) ∃ v (S(v ) (u_1(v,v ) … u_n(v,v ))). Intuitively, ∃v′.S(v′)∧(u1(v,v′)∨…un(v,v′))∃ v .S(v ) (u_1(v,v ) … u_n(v,v )) is a relational product (Burch et al. 1994) to represent states where applying a leads to states in S(v′)S(v ). A disjunctive partitioning ensures any nondeterministic effect leading to S(v′)S(v ) is included. Conjuncting with g(v)g(v) filters out all states where this action is not applicable. 1 2Function computeUnsafe(ϕφ)→ BDD: 3 ρ←ρ← toBDD (⊥);τ←( );τ← toBDD (ϕ)(φ) 4 while τ⊈ρτ ρ do 5 ρ←ρ∨τρ←ρ τ 6 τ←τ← toBDD (⊤)( ) 7 G←G← toBDD (⊥)( ) 8 foreach a∈Aa∈ A do 9 τ←τ∧TR(a,ρ)τ←τ~ TR(a,ρ) 10 11 12 return ρ 13 14 15Function TR(a,ρa,ρ)→ BDD: 16 if a=⟨g,u0|…|un⟩a= g,u_0|…|u_n then 17 return toBDD(g)∧(RelProd(toBDD(u0),ρ)∨⋯∨RelProd(toBDD(un),ρ)) toBDD(g) (RelProd( toBDD(u_0),ρ) … RelProd( toBDD(u_n),ρ)) 18 Algorithm 5 Compute the unsafe region using Binary Decision Diagrams Prop-Prop-U is not practical for our problems. The fail condition ϕF _F describes exponentially-many fail states f S_f, so alg. 2 runs out of memory in its first step of enumerating fail states, even before attempting to propagate the unsafe region. With BDDs (alg. 5), it is possible to represent ϕF _F compactly, but after propagating it a few steps to describe the unsafe region the BDDs explode in size, again running out of memory. Thus, it seems that regressing the unsafe region is not a suitable approach to solve our problems. Appendix B Is the Main Loop in iPI Necessary? Is the main loop (lines 4–4) necessary, or is a single call to rec-iPI(sI)rec-iPI(s_I) enough? The loop is indeed necessary, because there are cases where rec-iPI(sI)rec-iPI(s_I) constructs an unsafe policy π and does not return unsafe. Let us apply iPI to the state-avoiding task shown in fig. 3, the key steps are: • suppose rec-iPI first traverses sI,a0,s1,a1,s2,a2s_I,a_0,s_1,a_1,s_2,a_2 and then encounters s2s_2 which is marked as seen, so the recursion unwinds to s1s_1 • at s1s_1, rec-iPI sees that the only action a1a_1 inevitably encounters a fail state, so it sets V(s1)=1V(s_1)=1 and backtracks to sIs_I, • at sIs_I rec-iPI selects a0′a _0 as a potentially safe alternative to a0a_0 — this leads to s2s_2 which is in seen, now the recursion fully unwinds. However, the policy π with π(sI)=a0′π(s_I)=a _0, π(s1)=a1π(s_1)=a_1, and π(s2)=a2π(s_2)=a_2 is unsafe. The key insight is that rec-iPI only propagates that states are unsafe with V(s)=1V(s)=1 backwards, but in the presence of cycles, as we see in fig. 3, it may also be necessary to propagate this information forwards to detect that the policy is unsafe. sIs_Is1s_1s2s_2×a0a_0a1a_1a2a_2a0′a _0 Figure 3: A state-avoiding task where a single call to rec-iPIfinds an unsafe policy but does not recognise it as unsafe. Appendix C Detailed Results We break down the plots from fig. 2 by their domains in fig. 4 and fig. 5. Over domains, the results are consistent with the aggregated results in section 4, i.e., iPI(app)iPI(app) and TarjanSafe have similar performance on JEA’s benchmarks and an exponential difference when deciding safe states on Flappy Birds; within iPI, the action ordering app generally outperforms learn and learn. We emphasise that TarjanSafe’s worst case is only triggered by Flappy Bird when deciding safe states, and not when deciding unsafe states. This is because, for unsafe states, sIs_I can be proven unsafe before all paths need to be enumerated, bringing the task closer to TarjanSafe’s best case. Consequently, there is no significant difference between iPI(app)iPI(app) and TarjanSafe when deciding unsafe states for Flappy Bird, and this does not contradict any of our claims. Figure 4: Cumulative plots of the proportion of states decided w.r.t. time for unsafe states. Figure 5: Cumulative plots of the proportion of states decided w.r.t. time for safe states. References Bartzis and Bultan (2006) Bartzis, C.; and Bultan, T. 2006. Efficient BDDs for Bounded Arithmetic Constraints. Int. Journal on Software Tools for Technology Transfer, 8(1): 26–36. Bonet and Geffner (2003) Bonet, B.; and Geffner, H. 2003. Labeled RTDP: Improving the Convergence of Real-Time Dynamic Programming. In Proc. of Int. Conf. on Automated Planning and Scheduling (ICAPS), 12–21. Budde et al. (2017) Budde, C. E.; Dehnert, C.; Hahn, E. M.; Hartmanns, A.; Junges, S.; and Turrini, A. 2017. JANI: Quantitative Model and Tool Interaction. In Tools and Algorithms for the Construction and Analysis of Systems (TACAS), 151–168. Burch et al. (1994) Burch, J. R.; Clarke, E. M.; Long, D. E.; McMillan, K. L.; and Dill, D. L. 1994. Symbolic Model Checking for Sequential Circuit Verification. IEEE Trans. on Computer-Aided Design of Integrated Circuits & Systems, 13(4): 401–424. Cimatti et al. (2003) Cimatti, A.; Pistore, M.; Roveri, M.; and Traverso, P. 2003. Weak, Strong, and Strong Cyclic Planning via Symbolic Model Checking. Artificial Intelligence, 147(1–2): 35–84. Dai et al. (2011) Dai, P.; Mausam; Weld, D. S.; and Goldsmith, J. 2011. Topological Value Iteration Algorithms. Journal of Artificial Intelligence Research. Daniele, Traverso, and Vardi (2000) Daniele, M.; Traverso, P.; and Vardi, M. Y. 2000. Strong Cyclic Planning Revisited. In Recent Advances in AI Planning, 35–48. Diekert and Kufleitner (2022) Diekert, V.; and Kufleitner, M. 2022. Reachability Games and Parity Games. In Int. Colloquium on Theoretical Aspects of Computing (ICTAC). Hansen and Zilberstein (2001) Hansen, E. A.; and Zilberstein, S. 2001. LAO∗: A Heuristic Search Algorithm That Finds Solutions With Loops. Artificial Intelligence, 129(1): 35–62. Howard (1960) Howard, R. A. 1960. Dynamic Programming and Markov Processes. Technology Press of Massachusetts Institute of Technology. Jain et al. (2025) Jain, C.; Sherbakov, D.; Vinzent, M.; Steinmetz, M.; Davis, J.; and Hoffmann, J. 2025. Policy Safety Testing in Non-Deterministic Planning: Fuzzing, Test Oracles, Fault Analysis. In Proc. of European Conf. on Artificial Intelligence (ECAI). Klauck et al. (2018) Klauck, M.; Steinmetz, M.; Hoffmann, J.; and Hermanns, H. 2018. Compiling Probabilistic Model Checking into Probabilistic Planning. In Proc. of the Int. Conf. on Automated Planning and Scheduling (ICAPS), 150–154. Mazala (2002) Mazala, R. 2002. Infinite Games, chapter 2, 23–38. Mnih et al. (2015) Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M. A.; Fidjeland, A.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.; and Hassabis, D. 2015. Human-Level Control Through Deep Reinforcement Learning. Nature, 518(7540): 529–533. Muise, McIlraith, and Beck (2024) Muise, C.; McIlraith, S. A.; and Beck, J. C. 2024. PRP Rebooted: Advancing the State of the Art in FOND Planning. In Proc. of AAAI Conf. on Artificial Intelligence, 20212–20221. Rintanen (2008) Rintanen, J. 2008. Regression for Classical and Nondeterministic Planning. In Proc. of European Conf. on Artificial Intelligence (ECAI), 568–572. Schmalz and Jain (2026) Schmalz, J.; and Jain, C. 2026. Code and Data for Algorithms for Deciding the Safety of States in Fully Observable Non-deterministic Problems. DOI: 10.5281/zenodo.18926835. Silver et al. (2016) Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.; Nham, J.; Kalchbrenner, N.; Sutskever, I.; Lillicrap, T.; Leach, M.; Kavukcuoglu, K.; Graepel, T.; and Hassabis, D. 2016. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, 529: 484–503. Silver et al. (2018) Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; Lillicrap, T.; Simonyan, K.; and Hassabis, D. 2018. A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go Through Self-Play. Science, 362(6419): 1140–1144. Tarjan (1972) Tarjan, R. 1972. Depth-First Search and Linear Graph Algorithms. SIAM Journal on Computing, 1(2): 146–160. Toyer et al. (2020) Toyer, S.; Thiébaux, S.; Trevizan, F.; and Xie, L. 2020. ASNets: Deep Learning for Generalised Planning. Journal of Artificial Intelligence Research, 68: 1–68.