← Back to papers

Paper deep dive

RepV: Safety-Separable Latent Spaces for Scalable Neurosymbolic Plan Verification

Anonymous Authors

Year: 2025Venue: arXiv preprintArea: Formal/TheoreticalType: EmpiricalEmbeddings: 73

Abstract

Abstract:As AI systems migrate to safety-critical domains, verifying that their actions comply with well-defined rules remains a challenge. Formal methods provide provable guarantees but demand hand-crafted temporal-logic specifications, offering limited expressiveness and accessibility. Deep learning approaches enable evaluation of plans against natural-language constraints, yet their opaque decision process invites misclassifications with potentially severe consequences. We introduce RepV, a neurosymbolic verifier that unifies both views by learning a latent space where safe and unsafe plans are linearly separable. Starting from a modest seed set of plans labeled by an off-the-shelf model checker, RepV trains a lightweight projector that embeds each plan, together with a language model-generated rationale, into a low-dimensional space; a frozen linear boundary then verifies compliance for unseen natural-language rules in a single forward pass. Beyond binary classification, RepV provides a probabilistic guarantee on the likelihood of correct verification based on its position in the latent space. This guarantee enables a guarantee-driven refinement of the planner, improving rule compliance without human annotations. Empirical evaluations show that RepV improves compliance prediction accuracy by up to 15% compared to baseline methods while adding fewer than 0.2M parameters. Furthermore, our refinement framework outperforms ordinary fine-tuning baselines across various planning domains. These results show that safety-separable latent spaces offer a scalable, plug-and-play primitive for reliable neurosymbolic plan verification. Code and data are available at: this https URL.

Tags

ai-safety (imported, 100%)empirical (suggested, 88%)formaltheoretical (suggested, 92%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%

Last extracted: 3/11/2026, 1:12:06 AM

Summary

RepV is a neurosymbolic verifier that bridges formal methods and language models by learning a safety-separable latent space. It uses a lightweight projector to embed plans and rationales, allowing for the verification of natural-language rules against formal specifications. RepV provides probabilistic guarantees on verification correctness and enables guarantee-driven refinement of planning models without human annotation.

Entities (5)

RepV · system · 100%Interpreter · component · 95%Model Checker · tool · 95%Projector · component · 95%Latent Space · concept · 90%

Relation Signals (4)

Projector mapsto Latent Space

confidence 95% · projector P maps these embeddings into an m-dimensional latent space

RepV uses Projector

confidence 95% · RepV trains a lightweight projector that embeds each plan

Interpreter generates Rationale

confidence 90% · Interpreter MI that processes a generated plan p and a natural language rule r_nl, producing a binary classification y and a text-based rationale R.

RepV integrates Model Checker

confidence 90% · RepV unifies formal verification with representation learning

Cypher Suggestions (2)

Find all components of the RepV system · confidence 90% · unvalidated

MATCH (s:System {name: 'RepV'})-[:HAS_COMPONENT]->(c) RETURN c.name, c.type

Identify the data flow from input to latent space · confidence 85% · unvalidated

MATCH (p:Plan)-[:PROCESSED_BY]->(i:Interpreter)-[:OUTPUTS]->(r:Rationale), (i)-[:EMBEDDED_BY]->(proj:Projector)-[:MAPS_TO]->(ls:LatentSpace) RETURN p, i, proj, ls

Full Text

72,559 characters extracted from source content.

Expand or collapse full text

REPV: SAFETY-SEPARABLE LATENT SPACES FOR SCALABLE NEUROSYMBOLIC PLAN VERIFICATION Yunhao Yang 1 Neel P. Bhatt 1 Pranay Samineni 1 Rohan Siva 1 Zhangyang Wang 1 Ufuk Topcu 1 ABSTRACT As AI systems migrate to safety–critical domains, verifying that their actions comply with well-defined rules remains a challenge. Formal methods provide provable guarantees but demand hand-crafted temporal-logic specifications, offering limited expressiveness and accessibility. Deep learning approaches enable evaluation of plans against natural-language constraints, yet their opaque decision process invites misclassifications with potentially severe consequences. We introduceRepV, a neurosymbolic verifier that unifies both views by learning a latent space where safe and unsafe plans are linearly separable. Starting from a modest seed set of plans labeled by an off-the-shelf model checker,RepVtrains a lightweight projector that embeds each plan, together with a language model-generated rationale, into a low-dimensional space; a frozen linear boundary then verifies compliance for unseen natural-language rules in a single forward pass. Beyond binary classification,RepVprovides a probabilistic guarantee on the likelihood of correct verification based on its position in the latent space. This guarantee enables a guarantee-driven refinement of the plan- ner, improving rule compliance without human annotations. Empirical evaluations show thatRepVimproves compliance prediction accuracy by up to 15% compared to baseline methods while adding fewer than 0.2 M parameters. Furthermore, our refinement framework outperforms ordinary fine-tuning baselines across various planning domains. These results show that safety-separable latent spaces offer a scalable, plug-and-play primitive for reliable neurosymbolic plan verification. Code and data are available at: https://repv-project.github.io/ 1INTRODUCTION As AI systems take on greater roles in safety-critical ap- plications, ensuring their actions comply with well-defined rules remains challenging. Compliance is a foundation for safety, reliability, and accountability across healthcare (Os- ifowokan et al., 2025; Sbodio et al., 2024; Daram, 2025), finance (Deshpande, 2024; Balakrishnan, 2024), and au- tonomous systems (Yazdanpanah et al., 2023; He et al., 2021). Traditional rule-based systems enable formal veri- fication against specified logic-based constraints (Grosan et al., 2011). In contrast, AI systems, driven by deep learn- ing models, lack interpretability and formal guarantees due to their black-box nature, which increases the difficulty of rule compliance verification. Formal methods and language models provide two ap- proaches for verifying compliance, yet each has its limi- tations. Existing works have extended formal methods tech- niques to support compliance verification across multiple input modalities, including natural language and program- 1 The University of Texas at Austin, United States. Correspon- dence to: Yunhao Yang<yunhaoyang234@utexas.edu>, Neel P. Bhatt <npbhatt@utexas.edu>. Preprint. Copyright 2025 by the authors. ming languages (Yang et al., 2025; Li et al., 2024; Yang et al., 2024b). Although these methods offer guarantees, the logical specifications they rely on have limited expressive- ness and demand domain expertise, limiting their accessibil- ity. On the other hand, several works query language models for compliance classification, allowing inference with natu- ral language rules (Guan et al., 2024; Hassani, 2024; Ying & Sacks, 2024). Still, they can suffer from misclassifica- tions due to the black-box nature of language models, hence limiting their reliability. This work addresses the challenge of predicting rule compli- ance in planning tasks, where interpretibility and reliability are essential, but often in tension. We consider AI systems that generate and execute a plan, i.e., a sequence of actions, subject to a set of constraints. These constraints may be combinations of logical specifications and natural language rules. While logical specifications support formal verifica- tion through tools such as model checkers (Pnueli, 1977; Baier & Katoen, 2008), they require structured input and domain expertise. Conversely, natural language rules are easy to obtain and express but lack the semantics required for formal verification. We introduceRepV, a neurosymbolic verifier that bridges the gap between the two means of expressing constraints arXiv:2510.26935v1 [cs.RO] 30 Oct 2025 RepV: Scalable Neurosymbolic Verification Latent Space A vector space preserving the information of the plan, specs, and rationale: p,Φ,ℛ Safe: correct interpretation y=y* Unsafe: wrong Interpretation y≠y* Available logical specification φ Interpreter (LLM) RepV Model Checker Interpreter’s prediction with rationale Model checking result y∈0,1ℛ y*∈0,1 Plan p Natural language rule r nl New plan p′ Interpreter (LLM) Interpreter’s prediction with rationale y′ ∈0,1 ℛ′ Safe/Unsafe: ? y′ =y′ * Figure 1.A high-level illustration ofRepV. It first learns a latent space where the safe (the interpreter’s prediction aligns with the model checker) and unsafe plan representations are separable. Then, it classifies the new plan’s compliance against the new natural language rules based on the spatial location of the plan representation in the latent space. and verifies whether a plan complies with constraints, dur- ing system execution.RepVbegins with a modest set of plans labeled via formal verification tools. It then trains a lightweight projector that maps each plan, along with an LLM-generated rationale, into a low-dimensional latent space. Within this space, each point represents the align- ment between the interpreter’s predicted compliance and the formal verification outcome. We define a plan as safe when this alignment holds, i.e. when the interpreter’s prediction matches the verification result. Once the projector is trained, RepVclassifies compliance with unseen natural language rules based on a plan’s spatial location in the latent space. We present an illustration of RepV in Figure 1. Building on the alignment-based latent space, we establish a probabilistic guarantee (statistically calibrated probability estimate) onRepV’s correctness of verification. Instead of producing a binary label, the verifier quantifies the like- lihood that its inferred compliance outcome aligns with the formal verification outcome. This likelihood is cali- brated using the distance between the generated plan and the cluster centroid of a correctly aligned representation in latent space. Rather than providing a binary assessment of the plan,RepVoffers a calibrated confidence estimate, enabling uncertainty-aware compliance assessments. We further leverage the probabilistic guarantee to guide re- finement of the external planner (multimodal foundation model) in a data-efficient manner. We propose two ap- proaches for guarantee-driven refinement: Supervised fine- tuning and preference optimization. The former identifies high-guarantee compliant plans and uses them to fine-tune the planner’s parameters, consolidating reliable planning be- haviors. The latter ranks data preferences via the probabilis- tic guarantee and train the planner to produce the preferred planning behaviors. Together, the guarantee-driven refine- ment forms a lightweight feedback loop that continuously improves the plan-generation model without additional hu- man labeling or full retraining. Empirical evaluations show thatRepVimproves compli- ance prediction accuracy by up to 15% compared to base- line methods while adding fewer than 0.2 M parameters. The evaluations are over four domains: simulated driving, outdoor navigation (i.e., real-world driving), indoor navi- gation, and 3D aerial navigation.RepVachieves over 90% prediction accuracy in all domains, showcasing its reliabil- ity in safety-critical domains and generalizability to unseen environments and rules. Furthermore, the guarantee-driven refinement improves the planner’s ability to generate rule-compliant plans and outper- forms ordinary fine-tuning baselines. Across four robotic domains, this refinement process improves the planner’s plan-generation compliance rate by up to 15% while re- ducing convergence time by more than half compared to ordinary supervised fine-tuning. The results demonstrate that probabilistic guarantees provide an efficient mechanism for continual planner improvement, transforming verifica- tion feedback into actionable learning signals. Contributions: We develop a neurosymbolic verifier, RepV, that enables accessible and reliable planning through three key contributions: (1)RepVenables compliance veri- fication with natural language rules, eliminating the need to handcraft logical specifications. (2) It establishes a proba- bilistic guarantee that quantifies the likelihood of each com- pliance judgment, enabling uncertainty-aware verification with natural-language rules. (3) The probabilistic guarantee enables continual improvement of external plan generation models with minimal overhead. 2RELATED WORK Formal Methods for Compliance Verification: Formal methods offer rigorous guarantees for verifying system behavior against well-defined specifications (Mehdipour et al., 2023). Techniques such as model checking have been widely applied in safety-critical systems (Baier & Katoen, 2008). Recent works have expanded these techniques to support more input modalities (Bhatt et al., 2024), such as programming languages and natural language, by translat- ing them into structured (mathematical) representations (Li RepV: Scalable Neurosymbolic Verification et al., 2024; Yang et al., 2024b;a). While compelling in well-defined domains, these techniques struggle with the ambiguity and variability of rules expressed in natural lan- guage. Language Models for Compliance Checking: Recent advancements in language models with strong reasoning capabilities enable new approaches to compliance checking (Guan et al., 2024; Ying & Sacks, 2024; Gan, 2024). These models extract obligations and detect non-compliance in legal, financial, and healthcare texts, outperforming formal methods approaches in flexibility and scalability (Hafizi, 2024; Hassani, 2024; Hassani et al., 2024; Berger et al., 2023). While these approaches effectively enhance model safety, lacking formal guarantees can be unreliable, espe- cially in safety-critical domains. Neurosymbolic Approaches for Compliance Checking: Several studies integrate learned symbolic representations of behaviors from logic-based systems to assess whether the behaviors comply with well-defined rules (Barbara et al., 2023; Pakina & Pujari, 2024; Paul, 2025; Ahn et al., 2025; Bhatt et al., 2025b; Daggitt et al., 2024). Despite promising results, they require handcrafted logic-based systems or are limited to structured domains. Hence, they lack scalabil- ity towards long-term development. Additionally, several works use LLM-based modules to transform informal con- straints into formal logic (Ganguly et al., 2024; Lee et al., 2025), improving scalability. However, the reliability of this transformation is not guaranteed due to the black box LLM. Latent Signals from LLM Activations:Recent stud- ies have revealed that LLMs encode rich internal signals in their activations (intermediate outputs), such as indi- cators of errors (Orgad et al., 2024; Karnik & Bansal, 2025; Bhatt et al., 2025a), hallucinations (Ferrando et al., 2024), refusals (Arditi et al., 2024), and overthinking ten- dencies (Chen et al., 2025). They often capture subtle model behaviors that are not directly observable from the output text. Inspired by these findings, we explore whether safety- related signals can be detected within the model’s encoded activations. 3PRELIMINARIES Formal Methods in Sequential Decision-Making: Se- quential decision-making refers to the process by which an agent selects actions over time, where each action affects future states. In this work, we focus on planning problems which are a subset of sequential decision-making problems. Given a planning task, we define a plan as a sequence of actions that achieve a task goal. In this work, we describe a plan in either natural language or programming language. Formal methods provide tools for modeling and verifying planning problems represented mathematically. Suppose we ∅ start ω 1 ω 2 ¬ σ True σ True ∅ start ω 3 ∅ σ σ ¬σ True Figure 2.Automaton corresponding to a syntax “ifσ, doω 1 , else ω 2 ” (left) and “while σ, do ω 3 ” (right). have a transition systemTSmodeling the world knowledge or autonomous systems, a finite-state automaton (FSA)A representing the plan, and a temporal logic specification φconstraining the temporal ordering and logical relations between actions.AandTSare mathematical structures with a finite number of states (Baier & Katoen, 2008), andφ is a logical formula with temporal operators such as “always” and “eventually” (Pnueli, 1977). We apply a model checker, a tool from formal methods, to verify whetherA, when implemented inTS, satisfiesφ, denoted asA⊗ TS |= φ(Baier & Katoen, 2008; Cimatti et al., 2002). We present the formal definitions of these terminologies in Appendix A. Text-Based Plan to Automaton: To formally verify a plan against logic specifications, we need to express the plan in a formal representation such as an FSA. Existing works have developed methods for converting natural language or programming language to FSA (Yang et al., 2025; 2024b). We use an algorithm, denoted asL2A, that takes a text- based planp, expressed in natural language or programming language, as input and converts it into an FSAA = L2A(p). In particular, the algorithm parses the text input into phrases (e.g., keywords and variables in a programming language). It then follows pre-defined grammar to convert the parsed phrases into automaton states and transitions. We present an automaton conversion example in Figure 2 and present the details in Appendix B. 4REPV: A NEUROSYMBOLIC VERIFIER RepVis a neurosymbolic verifier that unifies formal verifi- cation with representation learning to enable verification of externally generated action plans against natural lan- guage rules. At its core,RepVconstructs a latent space where safe and unsafe plans are linearly separable, enabling efficient inference of compliance with natural-language rules. This latent-space embedding bridges the symbolic rigor with linguistic accessibility, establishing a foundation to extend verification toward probabilistic guarantees and iterative refinement. Components:RepVcomprises four main components: an external knowledge source for plan generation, an inter- preterM I , a projectorP , and a model checkerM C . - Knowledge Source (Planner): An external source used RepV: Scalable Neurosymbolic Verification Foundation Model Logical Specification φ Natural Language Rule r nl Yield at pedestrians and cars. r nl = Model Checker Latent Space (prior training) Star: (representation) z Projector Training Data input: , label: (z,y)y* 0~fail, 1~pass y*Φ0,1 Interpreter (LLM) “It meets the rule because...” ℛ= φ=G( pedestrian ≠ car )∈X¬ move Training Data input: [ representation ], Output: [ T/F ] Training Data input: [ representation ], Output: [ T/F ] Projector Training Data input: , label: D P (z,y)y* Projector Latent Space (after training) Training Data Collection Projector Training Plan “Wait for pedestrian, then turn right” p= Correct Interpretation (Safe) Wrong Interpretation (Unsafe) Projector Compliance Prediction yΦ0,1 Supervised Unsafe y′ ∨y′ * “The plan successfully... but violates......” ℛ′ = y′ =1 Compliance Inference New Natural Language Rule r′ nl Wrong Interpretation (Unsafe) Correct Interpretation (Safe) Interpreter L2A Text/Image Prompt External Knowledge Source Figure 3.An illustration for collection of auto-labeled training data for the projector, training the projector to construct a safety-separable latent space, and predicting plan compliance against natural language rules via the spatial location of the projector’s output representation. for plan generation. In this setting, we use a multimodal foundation model that accepts text- or image-based task prompts as input, generating a text-based plan that fulfills the task. Note that the rule is a part of the task prompt as well. For brevity, we denote it as a planner. - Interpreter: A language modelM I that processes a gen- erated planpand a natural language ruler nl , producing a binary classificationy ∈0, 1indicating its prediction on rule compliance (1 means comply) and a text-based ratio- naleR. Mathematically, we havey,R =M I (p,r nl ). An illustrative example of interpreter outputs is in Section 6. - Projector: A multi-layer perceptionP : R ε → R m that takes theε-dimensional text embeddingsemb(p,R) ∈ R ε derived from the plan and the interpreter’s rationale and maps them into anm-dimensional latent space, where com- pliance and noncompliance can be distinguished spatially. - Model Checker: A formal verification toolM C verifies whether a planp, represented as a finite-state automaton A p = L2A(p), satisfies a logical specificationφwhen ex- ecuted in a world/system modelTS. It returns a binary ground-truth label indicating compliance: M C (A p ,φ) = 1 ifA p ⊗ TS |= φ, else 0. Definition of Safety: To learn a safety-separable latent space, we first define the terminology on safe and unsafe. Definition 1: Given a planp, a natural language ruler nl , and a logical specificationφthat corresponds tor nl (i.e.,r nl describes the semantic meaning ofφ), we obtain a natural language interpretationy,R =M I (p,r nl )and the formal verification outcome y ∗ =M C (A p ,φ). We define the planpas safe with respect to the ruler nl if and only if its predicted classification aligns with the formal verification outcome: p is safe ⇐⇒ y = y ∗ .(1) Note that the “safety” inRepVdiffers from the everyday notion of physical or behavioral safety. A plan is consid- ered safe even if it violates the rule (y ∗ = 0), as long as the interpreter’s prediction correctly reflects that violation (y = 0). Conversely, a plan is unsafe when there is a misalignment between the interpreter’s prediction and the verified outcome (y ̸= y ∗ ), regardless of whether the plan itself satisfies or violates the rule. Hence, safety in this context measures semantic consistency between reasoning modalities—the alignment of linguistic interpretation and formal verification. Problem Statement: Given a planpextracted from ex- ternal sources (e.g., AI-generated) and a natural language ruler nl , verify whether each plan complies with the rule, denoted asp |= r nl , and estimate the probability of such verification outcome being correct. 4.1 Learning a Safety-Separable Latent Space To achieve reliable verification across linguistic and formal modalities,RepVlearns a latent space where semantically safe and unsafe plans become linearly separable. This latent space serves as a bridge between linguistic reasoning (from M I ) and symbolic verification (fromM C ), enabling the RepV: Scalable Neurosymbolic Verification system to infer compliance for new natural-language rules based on spatial proximity. Training Data Collection: We derive each training sample from a triplet(p,r nl ,φ), wherepis a generated plan,r nl is a natural-language rule, andφis its corresponding logical specification. The interpreterM I predicts a compliance label y and generates a rationaleR: y,R =M I (p,r nl ).(2) We then convert the planpinto an automatonA p = L2A(p) and verify it againstφby the model checkerM C , producing a ground-truth label y ∗ : y ∗ =M C (A p ,φ).(3) Latent Representation: We obtain a text embedding emb(p,R) ∈ R ε using a pretrained text encoderemb() that encodes the plan p and the interpreter’s rationaleR. A lightweight projectorP : R ε → R m maps this em- bedding into a low-dimensional latent representationz = P(emb(p,R)) . Each representationzis associated with a binary labely safe that indicates whether the interpreter’s prediction aligns with the verified outcome: y safe = ( 1, y = y ∗ (safe), 0, y ̸= y ∗ (unsafe). (4) Projector Optimization: We then train the projector using a cross-entropy loss that minimizes the classification error between the predicted safetyˆy safe (predicted via the spatial location in the latent space) and the verified label y safe : L proj =− X i y (i) safe log ˆy (i) safe +(1−y (i) safe ) log(1−ˆy (i) safe ). Through supervised learning,RepVlearns a safety- separable latent spaceZ ⊂ R m , where spatial clusters encode alignment between linguistic interpretation and for- mal verification. As shown in Figure 3, the process involves collecting formally verified samples, projecting them into Z, and learning a linear boundary that partitions safe and unsafe representations. 4.2 Probabilistic Rule Verification After obtaining the latent spaceZ,RepVinfers compliance for new plans and rules by projecting them intoZand esti- mating a probabilistic guarantee that reflects the probability of correct verification. Inference Procedure: Given a new planp ′ and a natural language ruler ′ nl , the interpreterM I produces a binary classification and rationale: y ′ ,R ′ =M I (p ′ ,r ′ nl ), and the projector maps the text embedding into the latent space: z ′ =P(emb(p ′ ,R ′ )). We then attach a linear classifierC : R m → 0, 1to predict the compliance label: ˆy ′ safe = C(z ′ )∈0, 1, where1indicates that the interpreter’s reasoning is aligned with the expected formal outcome. Calibration Dataset: To provide a probability estimate for this inference, we maintain a small calibration set D c = (z i ,y i ,y ∗ i )that is i.i.d. with the potential testing set, where eachz i =P(emb(p i ,R i ))is a latent representa- tion andy ∗ i is the formal verification result fromM C . For each correctly classified sample (y i = y ∗ i ), we compute its distance to the centroid c ∗ of the corresponding cluster: d i =∥z i − c ∗ ∥ 2 , c ∗ = 1 |S| X z j ∈S z j , whereSis the set of correctly aligned (safe) samples. These distances form a calibration distributionF C (d)that captures how far safe samples typically lie from their centroid. Distance-Based Probabilistic Guarantee: For a new in- ference(p ′ ,r ′ nl )with latent representationz ′ , we compute its distance to the nearest centroid. We then formulate a calibration distribution that captures the relation between the distances and prediction accuracies. This distribution is used to estimate the probability that a predictiony ′ is correct, denoted as the probabilistic guarantee ˆp(y ′ | z ′ ). Definition 2 (Distance to Centroid): Letc s andc u be the centroids of the safe and unsafe clusters inZ. For a new input(p ′ ,r ′ nl )with embeddingz ′ , the distance to its nearest centroid is d ′ = min ∥z ′ − c s ∥ 2 , ∥z ′ − c u ∥ 2 . Definition 3 (Calibration Distribution): LetF C : R + → [0, 1] denote the cumulative distribution of distances from correctly classified samples to their cluster centroids. Let y ∗ c ∈ 0, 1be the safety label associated withc ∗ . For a new input, F C (d ′ ) gives the probability that a sample from the opposite class lies beyond distance d ′ : F C (d ′ ) = Pr ∥z i − c ∗ ∥ 2 > d ′ y i ̸= y ∗ c . If the safe and unsafe classes are imbalanced, we compute F C separately for each class and select the corresponding F C based on the nearest centroid. Theorem 1 (Probabilistic Guarantee): Given a new in- put with latent representationz ′ and nearest centroidc ∗ ∈ RepV: Scalable Neurosymbolic Verification Data Calibration Training Data input: [ representation ], Output: [ T/F ] Training Data input: [ representation ], Output: [ T/F ] Calibration Data input: , label: D C (z,y)y* Latent Space (after training) Correct Interpretation (Safe) Wrong Interpretation (Unsafe) Centroid c* Representation z Distance to centroid d= || zΦc* || 2 Calibration Distribution F C Probabilistic Guarantee Guarantee-Guided Refinement New Input: (zℛ ,yℛ ) Latent Space dℛ = || zℛ Φc* || 2 Probabilistic Guarantee ≠ p(yℛ * | zℛ ) F C Foundation Model Training Data input: [ representation ], Training Data input: [ representation ], Fine-Tuning Data task prompt, plan D hf Guarantee-guided data filtering External Knowledge Source SFT or DPO Guarantee-guided preference ranking Figure 4.Overview of the probabilistic verification and refinement pipeline. Left: After training the latent space, we extract a calibration distributionF C using distances from correctly verified samples to their cluster centroids. Right: Given a new plan–rule pair, we compute the distance between its latent representation and the nearest centroidc ∗ . Then, we useF C to compute the probabilistic guarantee. The guarantee can be subsequently used as a feedback signal for upgrading the planner. c s ,c u , the probabilistic guaranteeˆp(y ′ | z ′ )that its clas- sification is correct is ˆp(y ′ | z ′ ) = 1− (1− F C (d ′ ))· Pr[y i ̸= y ∗ c ] Pr[∥z i − c ∗ ∥ 2 ≤ d ′ ] .(5) Proof 1: Letc ∗ ∈ c s ,c u be the nearest centroid of the new input’s latent representationz ′ , and letd ′ =∥z ′ −c ∗ ∥ 2 . By Definition 3, the calibration distribution satisfies F C (d ′ ) = Pr[∥z i − c ∗ ∥ 2 > d ′ | y i ̸= y ∗ c ]. We define the probabilistic guarantee as ˆp(y ′ | z ′ ) = Pr[y i = y ∗ c |∥z i − c ∗ ∥ 2 ≤ d ′ ] = 1− Pr[y i ̸= y ∗ c |∥z i − c ∗ ∥ 2 ≤ d ′ ]. Applying Bayes’ rule, Pr[y i ̸= y ∗ c |∥z i − c ∗ ∥ 2 ≤ d ′ ] = Pr[∥z i − c ∗ ∥ 2 ≤ d ′ | y i ̸= y ∗ c ] Pr[y i ̸= y ∗ c ] Pr[∥z i − c ∗ ∥ 2 ≤ d ′ ] = (1− F C (d ′ )) Pr[y i ̸= y ∗ c ] Pr[∥z i − c ∗ ∥ 2 ≤ d ′ ] . Substituting into the definition of ˆp(y ′ | z ′ ), we get ˆp(y ′ | z ′ ) = 1− (1− F C (d ′ )) Pr[y i ̸= y ∗ c ] Pr[∥z i − c ∗ ∥ 2 ≤ d ′ ] . Hence proven. Verification:ˆp(y ′ | z ′ )denotes uncertainty associated with the rule compliance prediction. A higherˆp(y ′ | z ′ )implies stronger alignment between linguistic interpretation and formal verification. Formally, a new planp ′ complies with a natural-language rule r ′ nl with a guarantee ˆp(y ′ | z ′ ) if p ′ |= r ′ nl ⇐⇒ (ˆy ′ safe = 1∧ y ′ = 1) ∨ (ˆy ′ safe = 0∧ y ′ = 0) .(6) 5GUARANTEE-DRIVEN REFINEMENT To close the loop between verification and planning, RepV uses the probabilistic guaranteeˆp(y | z)as a feedback signal to refine the planner, which is a foundation model that produces plans. We adopt two complementary strategies: supervised fine-tuning and preference optimization. Supervised Fine-Tuning (SFT) via Guarantee Filtering: We collect a dataset of plan–rule pairs(p,r nl )along with their interpreter predictions and latent probabilities. We then filter this dataset to retain only high-confidence exam- ples, i.e. those where the probabilistic guarantee exceeds a threshold τ (e.g. 0.9). Formally, define D hf = (p,r nl ) : ˆp(y | z)≥ τ. We useD hf to fine-tune the planner via cross-entropy loss, training it to generate plans that are rule compliant. Because the filtered set excludes high-uncertainty examples, the fine- tuning process focuses on reinforcing desired patterns. Probabilistic Guarantee for Preference Ranking: We can use the probabilistic guaranteeˆp(y | z)also as a measure of relative preference between alternative plans generated by the planner. For each given task, the planner generates two different plansp 1 andp 2 by varying the random seed. Then, we estimate the guarantee for both plans:ˆp(y 1 | z 1 )andˆp(y 2 | z 2 ). We select the plan with higher guarantee of rule compliance as the preferred output, and the other as the non- preferred output. Lastly, we add this triplet (task prompt, preferred output, non-preferred output) into the fine-tuning datasetD pr . We then directly passD pr into a Direct Preference Optimiza- tion (DPO) framework to fine-tune the planner. Compared RepV: Scalable Neurosymbolic Verification Figure 5.Example visual and textual input prompts to the planner. to supervised fine-tuning, preference optimization performs better when multiple correct answers exist. 6DEMONSTRATION We demonstrateRepVon navigation tasks using the Carla simulator (Dosovitskiy et al., 2017). In this section, we (1) present our experimental setting, (2) visualize the safety- separable latent space and its corresponding calibration dis- tribution, and (3) present case studies that highlight its veri- fication capability. Experimental Setup:We design a series of naviga- tion tasks, such as go straight, turn left, turn right, park, and make a U-turn at the traffic light or the stop sign intersection. We useGPT-o1-minias the interpreter and aNuSMVmodel checker (Cimatti et al., 2002). The projector comprises of three fully connected layers that map 1536-dimensional text embeddings (obtained from text-embedding-3-small) to a 10-dimensional la- tent space (m = 10). Given a set of pre-implemented APIs, such as: 1 # Publishes linear and angular velocity to the agent. 2 velocity_publisher(linear, angular) 3 stop() # Stops the robot. 4 # Returns a boolean value. 5 pedestrian_observed() we useGPT-4.1-minias our planner to generate high- level plans composed of sequences of the system APIs. Specifically, the planner takes a text-based and/or image- based task prompt and the system APIs as inputs, producing a high-level plan, i.e., a Python program, that calls the APIs to accomplish the task. We show examples of input prompts in Figure 5 and output plans in Listing 1. Training: We generate 400 plans and 15 natural-language rules, each paired with its formal logic specification, to train the projector. For example, the rule “Yield to pedestrian” corresponds to φ 1 = G pedestrian→ X¬ publish velocity. The complete rules are in Listing 2 (Appendix C.3). For each plan, we randomly sample one rule from the 15- rule set and query the interpreter to predict compliance and provide an explanation. Then, we use the model checker to obtain the ground-truth compliance label by verifying the corresponding formal specification. Each compliance prediction and rationale generation by the interpreter takes less than 3 seconds, while the model checking requires less than 0.1 seconds per plan. After collecting 400 labeled samples, we train the projector, which contains 197 K parameters, to predict rule com- pliance using cross-entropy loss with a batch size of 20 over 10 epochs. The full training process completes in 144 milliseconds. Figure 6 shows the latent spaces before and after training, demonstrating how the learned space achieves linear separability between safe and unsafe plans. Calibration:We construct a calibration set of 400 ad- ditional samples drawn from the same distribution as the training set, using the same set of natural-language rules and their corresponding formal specifications. Each sample is passed through the trained projector to obtain its latent representation and distance to the nearest cluster centroid. These distances form the calibration distributionF C . This calibration process completes in only 225 milliseconds. We show the resulting distribution in the right plot of Figure 6. Compliance Verification on Unseen Plans: Once the la- tent space and calibration distribution are established, we applyRepVto verify a new plan (Listing 1) against a new natural-language rule: Always avoiding pedestrians. This rule conveys the same underlying constraint as the training rules but is phrased differently, testing the model’s ability to generalize to paraphrased expressions of compliance. 1 def go_across_the_intersection(): 2 ...... 3 while dist < 20.0: 4 if pedestrian_observed(): 5 stop() 6 else: 7 velocity_publisher(speed, angle) 8 dist = speed * (time.time() - t) 9 ...... Listing 1. A plan for the task: “go across this intersection.” Given the meta prompt to the interpreter (including the rule): “Does the code meet the rule: Always avoiding pedestrians?” the interpreter classifies the plan as violating the rule (y ′ = 0 ) with an explanation: “It partially meets the rule. It stops when a pedestrian is detected during turning and completes the turn if unobstructed.” Then, we obtain the latent representation of this planz ′ . The distances ofz ′ to the centroids of the safe and unsafe clusters are[1.2517, 1.0975], indicating that the sample lies closer to the unsafe cluster. The corresponding probabilistic guarantee isˆp(y ′ | z ′ ) = 0.971, suggesting thatRepVis RepV: Scalable Neurosymbolic Verification Figure 6.The figures from left to right show: 1) the latent space before projector training, where safe and unsafe plans are inseparable, 2) latent space after training, which is separable by a linear decision boundary, 3) representations of out-of-domain testing samples with new rules, and 4) calibration distribution F C derived from a calibration set that drawn from the same distribution with the training samples. ∅ start move step right ¬ obstacle obstacle obstacle True ∅ start stop forward pedestrian ¬ pedestrian True True Figure 7.Automaton-based representations of the plans for a Carla simulated driving task (left) and an indoor navigation task (right). confident that the interpreter’s predictiony ′ is incorrect. This inference procedure takes an average of 2.5 seconds. Hence, we conclude that the plan has a 97.1% chance of satisfying the rule. To validate this conclusion, we convert the plan into an automaton (left in Figure 7) and verify it againstφ 1 . The model checker confirms that the plan satisfies the specification. This case highlights howRepVdetects interpreter misclassi- fications. By leveraging the geometric separation and quan- tifying uncertainty in the latent space,RepVcan rectify the language model interpreter’s misclassification, demon- strating its ability to provide reliable, uncertainty-aware verification against natural-language rules. 7QUANTITATIVE ANALYSIS We quantitatively evaluateRepVacross multiple robotic platforms to assess its verification accuracy and impact on downstream planning performance, e.g., how the verifica- tion outcome guides the planner refinement. 7.1 Rule Compliance Prediction We evaluate the compliance prediction accuracy ofRepV across four application domains and compare it against two baseline methods, as shown in Figure 8. Given a set of plans and a set of rules, we define compliance accuracy as Accuracy = 1 N N X i=1 I y i = y ∗ i ,(7) Figure 8.Comparison of three rule compliance prediction methods over four robot planning domains. RepV consistently achieves the highest accuracy compared to the two baselines. The error bar shows the standard deviation across five repetitive runs. whereNis the number of testing samples,y i denotes the predicted compliance label,y ∗ i the ground-truth label from model checking, and I[·] the indicator function. Baselines: We compare our method against two baseline methods listed below. (1) LLM Reasoning: We follow the framework pro- posed by (Guan et al., 2024), using a foundation model (GPT-4.1-nano) directly classifies plan compliance from text inputs via chain-of-thought reasoning. (2) N Classifier: a 10-layer perception trained with 400 samples for 10 epochs, taking plan embeddings as input to predict compliance. (3) RepV (ours): the proposed neurosymbolic verifier, which infers compliance according to Equation 6. Domains: We evaluate the baselines over four robot plan- ning domains of increasing embodiment diversity:Carla driving simulator,Jackalground robot sharing identical APIs and tasks with Carla,Unitree Go2legged robot we used for indoor navigation, and aPX4 Vision 1.0 quadcopter (drone) we used for 3D aerial navigation. Each domain contains 40 generated plans and 5 natural language rules, where each plan is checked against 5 rules separately. We present the rules in Appendix D.1. Results:RepVconsistently achieves the highest compli- ance accuracy across all domains, maintaining at least 95% compliance accuracy on in-domain tasks (Carla and Jackal) RepV: Scalable Neurosymbolic Verification Table 1.Fine-tuning Statistics. Acronyms: BS = Batch Size, LR = Learning Rate Multiplier, E = Epochs, TD = Training Duration (min), T = Trained Tokens, DS = Data Size (KB), S = Number of Samples, S/E = Steps per Epoch, Steps = Total Steps, CE = Convergence Epoch, CStep = Convergence Step. We define convergence if the training compliance rate fluctuates by less than 1% in the later epochs/steps. Our fine-tuning strategies require fewer training samples and converge faster than the baselines. Fine-Tuned ModelBSLRETDTTDSSS/EStepsCECStep SFT (No Filter)20.03516.97838,630658.55400200100051000 SFT via Confidence Filter20.03574.40330,225257.11150753754270 SFT via Guarantee Filter (Ours)20.0358.38346,670285.00150753752175 DPO (Ours)20.03550.68370,760233.222001005002120 Figure 9.The left figure compares compliance rates across various refinement methods. The right figure shows the testing performance of our guarantee-filtered SFT at different guarantee thresholds, in which achieves the best the performance at threshold 0.8. The error bar shows the standard deviation across three repetitive runs. Our methods (red and purple) outperforms the other baselines in all robotic domains and achieves higher performance gain in out-of-domain tasks (Go2 and PX4 drone). and 90% accuracy on out-of-domain tasks (Go2 and Drone). In contrast, the two baselines exhibit significant degradation when transferred to new domains, highlighting the robust- ness of our verifier. Figure 6 (the third from left) visualizes the latent-space dis- tributions of out-of-domain tasks (Go2 and Drone). Notably, these out-of-domain representations exhibit approximately identical distributions as those trained on Carla, showing that the latent space primarily captures the interpreter’s rea- soning behavior rather than domain-specific information. Because the interpreter’s linguistic reasoning pattern re- mains consistent across various domains, the latent geom- etry of safe versus unsafe plans remains stable, enabling robust cross-domain transfer without retraining. 7.2 Planner Refinement After demonstratingRepV’s capability in rule compliance prediction, we leverage its verification results to refine the knowledge-source planner. This section demonstrates how our guarantee-driven refinement effectively and efficiently enhances the planner to generate rule-compliant plans. For quantitative analysis, we compare our guarantee-driven refinement methods against several baselines: SFT (No Filter): Standard supervised fine-tuning using all foundation model-generated plan–rule pairs, regardless of their verification confidence. SFT via Confidence Filter: Fine-tuning performed only on plan–rule pairs whose interpreter assigns a high softmax confidence score (≥ 0.8) to its compliance prediction. SFT via Guarantee Filter (Ours): Fine-tuning restricted to samples satisfying the rules with probabilistic guarantee ˆp(y | z) ≥ τ = 0.8. We present the performance over different guarantee values in Figure 9 (right). DPO (Ours): DPO using the probabilistic guarantee as a ranking signal between alternative plans. To evaluate their performance, we define the metric compli- ance rate as compliance rate = 1 N N X i=1 I p i |= r nl,i ,(8) whereNis the number of testing plans,p i denotes thei-th generated plan,r nl,i the corresponding natural-language rule, andI[·] = 1if and only if the plan is rule-compliant. This metric measures the proportion of generated plans that formally satisfy the given rules. We present the fine-tuning details in Table 1 and evaluate the fine-tuned planners on 160 test samples, each labeled by the model checker using handcrafted formal specifications. The evaluation covers four domains: the Carla simulator (training environment, in-domain), the Jackal ground robot (real-world counterpart of Carla, sim2real transfer), the Uni- tree Go2 (out-of-domain), and the PX4 (out-of-domain). We present the rules for each domain in Appendix D.1. Figure 9 shows the compliance rates achieved by each plan- ner. Across all domains, both of our fine-tuned models outperform the baselines by 10% to 20% in compliance RepV: Scalable Neurosymbolic Verification pedestrian=False publish velocity (move forward) pedestrian=True stop pedestrian=False publish velocity (move forward) pedestrian=False, car=True stop pedestrian=car=False publish velocity (turn left) pedestrian=car=False publish velocity (move forward) Task Prompt: Go across this intersection. (third-person view) obstacle=False move forward (first-person view) obstacle=True step right (third-person view) obstacle=False turn right (third-person view) obstacle=False move forward & sit down (first-person view) obstacle=True step right Task Prompt: Go to the main entrance and sit down (forward 9 meters, right 1 meter). (third-person view) obstacle=True step right Figure 10.We execute the plan generated by our fine-tuned planner on real robots for outdoor and indoor navigation. The robots complete the task while complying with the domain-specific rules. Task Prompt: Fly in a square-shaped trajectory. Trajectory Weather: north wind Take off set_velocity_ned(0,0,-0.5) Move east set_velocity_ned(0,1,0) Land set_velocity_ned(0,0,0.5) Figure 11.The fine-tuned planner generates executable plans that complies rules in aerial navigation. rate, and the performance gap widens in out-of-domain tasks. Meanwhile, our fine-tuning strategies halve the num- ber of training samples and the convergence time. These results highlight that usingRepV’s probabilistic guarantee as feedback enables data-efficient, safety-aligned planner re- finement and generalization across different tasks and rules. 7.3 Real-World Deployment We further validate the refined planners (guarantee-filtered) in real-world environments to examine whetherRepVcan guide safe and executable behaviors. We present the com- plete task prompts and executable plans in Appendix D.2. Outdoor and Indoor Navigation: Figure 10 shows the execution of the generated andRepV-verified plans on two physical platforms. All the executed plans are verified as compliant with a probabilistic guarantee of 90%+. For the Jackal ground robot, we specify a rule “Yield to pedestrians and coming vehicles.” The robot crosses the intersection and crosswalk while dynamically adjusting its actions to give right of way to pedestrians and nearby cars. For the Unitree Go2, we specify a rule that is distinct from outdoor navigation: “Bypass the obstacles; do not stop.” The resulting behavior, as presented in Figure 7 (right), demonstrates that the planner generalizes to different loco- motion modalities and physical dynamics, achieving reliable obstacle avoidance without halting unnecessarily. Aerial Navigation: Figure 11 extends this evaluation to a 3D aerial navigation task. The rule is: “Keep altitude below 10 meters and landing speed within 1 m/s.” The drone follows a high-level plan that draws a square trajectory under mild wind disturbance while respecting altitude and landing-speed constraints. Summary: The RepV-refined planners successfully pro- duce executable and rule-compliant plans in various physi- cal environments, enabling platform agnostic and safety- constrained plan generation and verification across dif- ferent robot embodiments and operational constraints. 8CONCLUSION We introduceRepV, a neurosymbolic verifier that learns a safety-separable latent space bridging the rigor of for- mal verification with the accessibility of natural-language reasoning. By embedding plans and interpreter rationales into this latent representation,RepVenables probabilistic- guaranteed compliance verification without handcrafted logic specifications. Across simulated and real-world robotic applications,RepV achieves over 90% compliance verification accuracy. Fur- thermore, we leverage the probabilistic verification outcome to guide planner refinement, improving compliance rates of the generated plans by 10–20% while halving convergence time during refinement. Limitations and Future Work: Although RepV general- izes across embodiments, the learned projector currently requires retraining when transferred to distinct reasoning domains beyond robotics and planning, such as medical or finance. On the other hand, we show that the projector learn- ing and calibration require less than one second, making adaptation practical for new environments. Hence, the pri- mary difficulty lies in projector training and calibration data collection, which may require domain expertise to provide formal or domain-specific constraints. Moreover,RepVfocuses on natural-language rules. Future works will aim to verify multimodal constraints (visual traf- fic signals, speed-limit signs, or spatial safety boundaries) by integrating perception modules into the latent-space rea- soning loop. Another direction is to applyRepVto other sequential decision-making domains like logistics, health- care, and financial compliance, where natural-language rules work alongside formal specifications. RepV: Scalable Neurosymbolic Verification REFERENCES Ahn, S., Choi, W., Lee, J., Park, J., and Woo, H. To- wards reliable code-as-policies: A neuro-symbolic frame- work for embodied task planning.arXiv preprint arXiv:2510.21302, 2025. Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., and Nanda, N. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717, 2024. Baier, C. and Katoen, J.-P. Principles of model checking. MIT press, 2008. Balakrishnan, A. Leveraging artificial intelligence for en- hancing regulatory compliance in the financial sector. In- ternational Journal of Computer Trends and Technology, 2024. Barbara, V., Guarascio, M., Leone, N., Manco, G., Quarta, A., Ricca, F., and Ritacco, E. Neuro-symbolic ai for com- pliance checking of electrical control panels. Theory and Practice of Logic Programming, 23(4):748–764, 2023. Berger, A., Hillebrand, L., Leonhard, D., Deußer, T., De Oliveira, T. B. F., Dilmaghani, T., Khaled, M., Kliem, B., Loitz, R., Bauckhage, C., et al. Towards automated regulatory compliance verification in financial auditing with large language models. In 2023 IEEE International Conference on Big Data (BigData), p. 4626–4635. IEEE, 2023. Bhatt, N. P., Yang, Y., Siva, R., Milan, D., Topcu, U., and Wang, Z. Know where you’re uncertain when planning with multimodal foundation models: A formal framework. arXiv preprint arXiv:2411.01639, 2024. Bhatt, N. P., Li, P.-h., Gupta, K., Siva, R., Milan, D., Hogue, A. T., Chinchali, S. P., Fridovich-Keil, D., Wang, Z., and Topcu, U. Uncap: Uncertainty-guided planning us- ing natural language communication for cooperative au- tonomous vehicles. arXiv preprint arXiv:2510.12992, 2025a. Bhatt, N. P., Yang, Y., Siva, R., Samineni, P., Milan, D., Wang, Z., and Topcu, U. Vln-zero: Rapid exploration and cache-enabled neurosymbolic vision-language planning for zero-shot transfer in robot navigation. arXiv preprint arXiv:2509.18592, 2025b. Chen, R., Zhang, Z., Hong, J., Kundu, S., and Wang, Z. Seal: Steerable reasoning calibration of large language models for free. arXiv preprint arXiv:2504.07986, 2025. Cimatti, A., Clarke, E. M., Giunchiglia, E., Giunchiglia, F., Pistore, M., Roveri, M., Sebastiani, R., and Tacchella, A. NuSMV 2: An opensource tool for symbolic model checking. In Brinksma, E. and Larsen, K. G. (eds.), Computer Aided Verification, volume 2404 of Lecture Notes in Computer Science, p. 359–364, Copenhagen, Denmark, 2002. Springer. Daggitt, M. L., Kokke, W., Atkey, R., Slusarz, N., Arn- aboldi, L., and Komendantskaya, E. Vehicle: Bridging the embedding gap in the verification of neuro-symbolic programs. arXiv preprint arXiv:2401.06379, 2024. Daram, S. Explainable ai in healthcare: Enhancing trust, transparency, and ethical compliance in medical ai sys- tems. International Journal of AI, BigData, Computa- tional and Management Studies, 6(2):11–20, 2025. Deshpande, A. Regulatory compliance and ai: Navigating the legal and regulatory challenges of ai in finance. In 2024 International Conference on Knowledge Engineer- ing and Communication Systems (ICKECS), volume 1, p. 1–5. IEEE, 2024. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., and Koltun, V. Carla: An open urban driving simulator. In Conference on robot learning, p. 1–16. PMLR, 2017. Ferrando, J., Obeso, O., Rajamanoharan, S., and Nanda, N.Do i know this entity?knowledge awareness and hallucinations in language models. arXiv preprint arXiv:2411.14257, 2024. Gan, Z. Large language models empowering compliance checks and report generation in auditing. World Journal of Information Technology, p. 35, 2024. Ganguly, D., Iyengar, S., Chaudhary, V., and Kalyanaraman, S. Proof of thought : Neurosymbolic program synthesis allows robust and interpretable reasoning, 2024. URL https://arxiv.org/abs/2409.17270. Grosan, C., Abraham, A., Grosan, C., and Abraham, A. Rule-based expert systems. Intelligent systems: A modern approach, p. 149–185, 2011. Guan, M. Y., Joglekar, M., Wallace, E., Jain, S., Barak, B., Helyar, A., Dias, R., Vallone, A., Ren, H., Wei, J., et al. Deliberative alignment: Reasoning enables safer lan- guage models. arXiv preprint arXiv:2412.16339, 2024. Hafizi, A. Auditing healthcare claims through large-scale nlp-based consistency and compliance checks. Interna- tional Journal of Data Science, Big Data Analytics, and Predictive Modeling, 14(9):30–43, 2024. Hassani, S. Enhancing legal compliance and regulation anal- ysis with large language models. In 2024 IEEE 32nd In- ternational Requirements Engineering Conference (RE), p. 507–511. IEEE, 2024. RepV: Scalable Neurosymbolic Verification Hassani, S., Sabetzadeh, M., Amyot, D., and Liao, J. Re- thinking legal compliance automation: Opportunities with large language models. In 2024 IEEE 32nd Inter- national Requirements Engineering Conference (RE), p. 432–440. IEEE, 2024. He, H., Gray, J., Cangelosi, A., Meng, Q., McGinnity, T. M., and Mehnen, J. The challenges and opportunities of human-centered ai for trustworthy robots and autonomous systems. IEEE Transactions on Cognitive and Develop- mental Systems, 14(4):1398–1412, 2021. Karnik, S. and Bansal, S. Preemptive detection and steering of llm misalignment via latent reachability, 2025. URL https://arxiv.org/abs/2509.21528. Lee, C. P., Porfirio, D., Wang, X. J., Zhao, K. C., and Mutlu, B. Veriplan: Integrating formal verification and llms into end-user planning. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, p. 1–19. ACM, April 2025. doi: 10.1145/ 3706598.3714113. URLhttp://dx.doi.org/10. 1145/3706598.3714113. Li, Z., Hua, W., Wang, H., Zhu, H., and Zhang, Y. Formal-llm: Integrating formal language and natural lan- guage for controllable llm-based agents. arXiv preprint arXiv:2402.00798, 2024. Mehdipour, N., Althoff, M., Tebbens, R. D., and Belta, C. Formal methods to comply with rules of the road in autonomous driving: State of the art and grand challenges. Automatica, 152:110692, 2023. Orgad, H., Toker, M., Gekhman, Z., Reichart, R., Szpek- tor, I., Kotek, H., and Belinkov, Y. Llms know more than they show: On the intrinsic representation of llm hallucinations. arXiv preprint arXiv:2410.02707, 2024. Osifowokan, A. S., Agbadamasi, T. O., Adukpo, T. K., and Mensah, N. Regulatory and legal challenges of artificial intelligence in the us healthcare system: Liability, com- pliance, and patient safety. World Journal of Advanced Research and Reviews, 25(3):949–955, 2025. Pakina, A. K. and Pujari, M. Neuro-symbolic compliance architectures: Real-time detection of evolving financial crimes using hybrid ai. International Journal Science and Technology, 3(3):56–66, 2024. Paul, C.Neuro-symbolic reasoning for safe decision- making in autonomous vehicles frameworks. research- gate.net, 2025. Pnueli, A. The temporal logic of programs. In Sym- posium on Foundations of Computer Science, p. 46– 57, 1977. doi: 10.1109/SFCS.1977.32. URLhttps: //doi.org/10.1109/SFCS.1977.32. Sbodio, M. L., L ́ opez, V., Hoang, T. L., Brisimi, T., Picco, G., Vejsbjerg, I., Rho, V., Mac Aonghusa, P., Kristiansen, M., and Segrave-Daly, J. Collaborative artificial intelli- gence system for investigation of healthcare claims com- pliance. Scientific Reports, 14(1):11884, 2024. Yang, Y., Berthellemy, L., and Topcu, U. Reasoning, memorization, and fine-tuning language models for non- cooperative games. arXiv preprint arXiv:2410.14890, 2024a. Yang, Y., Ward, W., Hu, Z., Biswas, J., and Topcu, U. Joint verification and refinement of language models for safety- constrained planning. arXiv preprint arXiv:2410.14865, 2024b. Yang, Y., Neary, C., and Topcu, U. Automaton-based repre- sentations of task knowledge from generative language models. arXiv preprint arXiv:2212.01944, 2025. Yazdanpanah, V., Gerding, E. H., Stein, S., Dastani, M., Jonker, C. M., Norman, T. J., and Ramchurn, S. D. Rea- soning about responsibility in autonomous systems: chal- lenges and opportunities. Ai & Society, 38(4):1453–1464, 2023. Ying, H. and Sacks, R. From automatic to autonomous: A large language model-driven approach for generic build- ing compliance checking. In Proceedings of the CIB W78 Conference, 2024. RepV: Scalable Neurosymbolic Verification ADEFINITIONS Definition A1: A TRANSITION SYSTEMTS = (Q s ,T s ,L s )is a tuple of a set of statesQ s , a set of transitionsT s = (q i ,q j )| q i ,q j ∈ Q s , i.e., (q i ,q j ) means a transition from state q i to q j , and a label function L s : Q s → 2 AP . APis a set of atomic propositions. Each atomic proposition has a truth value—true or false—but does not contain any logical connectives like ”and,” ”or,” ”not,” etc. Definition A2: A finite-state automaton (FSA)A = (Q a ,p 0 , T a ,L a )is a tuple consisting of a set of statesQ a , an initial state p 0 , a set of transitions T a =(p i ,σ,p j )| p i ,p j ∈ Q a ,σ ∈ 2 AP , and a label function L a : Q a → 2 AP . Definition A3: Given an FSAAand a transition systemTS, a PRODUCT AUTOMATONA⊗ TS, is a tuple(Q,Q 0 ,T,L), where • Q =(p,q)| p∈ Q a ,q ∈ Q s , Q 0 =p 0 × Q s , • T =((p,q), (p ′ ,q ′ ))| p∈ Q a ,q ∈ Q s , (p,L s (q),p ′ )∈ T a , (q,q ′ )∈ T s , • and L((p,q)) = L a (p)∪ L s (q), where p∈ Q a ,q ∈ Q s . Temporal Logic Temporal logic is a formal language that expresses system (represented in FSA) properties that evolve over time. It extends propositional logic by including temporal operators, such asF(“eventually”) andG(“always”), which allow for reasoning about the system’s temporal behaviors. An LTL formula consists of • A set of atomic propositions. • A set of temporal operators describes the system’s temporal behavior. •A set of logical connectives, such as negation (¬), conjunction (∧), and disjunction (∨), that can be used to combine atomic propositions and temporal operators. B L2A: TEXT TO AUTOMATON In our settings, we express the plan in programs, e.g., Python. TheL2Aalgorithm first parses the plan into an abstract syntax tree (AST) using an existing library. Then, it converts the tree into an automaton as presented in Algorithm 13. During this conversion, we define a set of keywords (e.g., Python keywords) and the conversion rules based on these keywords, as presented in Figure 12. When the algorithm observes a keyword with a predefined sentence structure, it follows the conversion rules defined in the keyword processor to translate the sentence into automaton states and transitions. Lastly, the algorithm composes all the states and transitions into a finite-state automaton. CEMPIRICAL STUDY ON NAVIGATION TASKS We use the Carla simulator and the Jackal Clearpath robot for navigation tasks. They share the same set of APIs, but the former is executing in a simulated environment and the latter is executing in real world. C.1 Benchmark As described in Section 7.1, the benchmark methodLLM-Reasoningqueries a language model for compliance classifica- tion. We present the input prompts to the language model below. 1 Given the following plan in Python: 2 3 <Plan expressed in Python function> 4 5 Please analyze whether the code meets the rule: 6 Do not remain stopped when the road is clear, 7 8 Please answer with a single letter ’Y’ or ’N’, indicating yes or no. RepV: Scalable Neurosymbolic Verification Under review as submission to TMLR ADetails of Program-Automaton Conversion Executable Program to Abstract Syntax TreeRecall that an executable program is a program that consists of a set of predefined keywords and grammar associated with the keywords. Given a program, we first parse it into an AST. We use an existing parsing method with built-in rules for translating programs into ASTs. We present some sample ASTs in table 2. ASTFSANote rootstart while ωε → start ε ω ¬ ω ¬ ω ω ω=F C (f s ),ε=F C (f e ),f s ↑S, f e ↑ E. “For loop” can be expressed by “while loop.” rootstart if ωε → start ε ω True ¬ ω ω, ε=F C (f s ),F C (f e ),f s ↑S, f e ↑E. rootstart if ω ε 1 else ε 2 → start ε 1 ε 2 ¬ ω True ω True ω, ε 1 ,ε 2 =F C (f s ),F C (f e 1 ),F C (f e 2 ). For “if-elif-else,” we duplicate the “if” node and replace it with “elif.” rootstart f e1 f e2 f e3 → start ε 1 ε 2 ε 3 True TrueTrue True Running a set of functions sequen- tially without keywords.ε i =F C (f ei ) fori↑[1,2,3]. We can extend it to any number of leaf nodes. Table 2: Rules to convert abstract syntax trees tofinite state automaton(FSA)-based representations. An AST has a set oftree nodesand a set of direct transitions between the tree nodes. Each tree node corresponds to a keyword or a functionf↑S↓E. A tree node has at most one incoming transition and a set of outgoing transitions connecting to a set ofchildrentree nodes.Rootis a tree node that does not belong to the children of any node, andleafis a tree node whose children are empty. Keyword ProcessorThe keyword processor is a function mapping an AST with predefined keywords and specified structures to an FSA. It has a set of built-in rules for mapping an AST to an FSA, and we present some sample rules in table 1. The keyword processor cannot handle AST structures beyond the domain of built-in rules. Tr e e t o F S ASo far, we have the AST for the program and the keyword processor, so we can run algorithm 2 to construct an FSA representing the program. First, the algorithm initializes the states, transitions, and labels of an FSA (lines 2-4). Next, it follows a preorder traversal to go through all the tree nodes in the AST (line 9), and it uses the keyword processor to build sub-automata based on the keywords (lines 7). Then, it merges the sub-automata and returns the merged automaton as the final output (lines 11-15). 16 Figure 12. Keyword processor that converts particular grammars into automaton states and transitions. Under review as submission to TMLR Algorithm 1: Executable plan to finite state automaton. 1:procedure L2A(p, parse, keywords, keyword_processor)ωpis the plan,keywordsis a set of predefined words,parseandkeyword_processorare functions 2:root = parse(p) 3:returnTree2FSA(root, keywords, keyword_processor) 4:end procedure Algorithm 2: Abstract syntax tree to finite state automaton. 1:procedure Tree2FSA(root, keywords, keyword_processor)ωkeywordsis a set of predefined words, keyword_processoris a function 2:Q a ,T a ,L a =[],[],[] 3:create an initial statep 0 ,Q a .add(p 0 ),L a (p 0 )=→ 4:p current =p 0 ωkeep track of the current state 5:fornode in root.childrendo 6:if(every node in node.children is leaf) | (node.children[0] in keywords)then 7: ̃ Q, ̃p 0 , ̃ T, ̃ L= keyword_processor(node) 8:else 9: ̃ Q, ̃p 0 , ̃ T, ̃ L=Tree2FSA(node, keywords, keyword_processor)ωPreorder Traversal 10:end if 11:Q a += ̃ Q,T a += ̃ T,L a += ̃ Lωmerge the sub-automaton 12:T a .add((p current ,True, ̃p 0 )) 13:p current = ̃p 0 14:end for 15:returnQ a ,p 0 ,T a ,L a 16:end procedure Q=(p,q)|p↑Q a ,q↑Q s ,Q 0 =p 0 ↓Q s ,T=((p,q),(p → ,q → ))|p↑Q a ,q↑Q s ,(p,L s (q),p → )↑ T a ,(q,q → )↑T s , andL((p,q))=L a (p)↔L s (q),wherep↑Q a ,q↑Q s . In the product automatonP=(Q,Q 0 ,T,L), •aprefixis a finite sequence of states starting from(p 0 ,q 0 )↑Q 0 , e.g.,(p 0 ,q 0 )(p 1 ,q 1 )(p 2 ,q 2 )...(p k ,q k ), kis the prefix length, •atraceεis a sequence of labelsL((p 0 ,q 0 ))L((p 1 ,q 1 ))..., where Traces(P) denotes the set of all traces fromP. In words, Traces(P) captures all possible behaviors ofAin the task environment represented byTS. Finally, we use a model checker Cimatti et al. (2002) to verify whether the productPsatisfies all the specifications!. If a program’s automaton satisfies all the specifications, we add this program to a set of safety-constrained programs. We present an illustration of this procedure in fig. 2 Phase 1. 4.2SafetyofComposedProgram As verifying long-horizon programs directly can be computationally expensive, we propose acompositional verification theoremthat allows the safety of complex programs to be inferred from their components. Specifically, if each subprogram satisfies the safety specification in isolation, then their composition also satisfies the specification under mild assumptions. An example of a composed program is in fig. 1. To establish and prove the theorem, we first need to define the terminologysafety. Letε↑!be a temporal logic formula. We callεasafety specificationif it describes asafety propertyBaier & Katoen (2008) as defined in definition 6. 5 Figure 13. L2A algorithm that converts a text-based executable plan into a finite state automaton. RepV: Scalable Neurosymbolic Verification C.2 Formal Logic Specifications We present the complete list of specifications below. φ 1 = G pedestrian → X¬ publish velocity , φ 2 = F¬(pedestrian∨ car∨ red light)→ Xpublish velocity, φ 3 = G( stop sign ∧ car )→ X¬ publish velocity , φ 4 = G stop sign→ F stop , φ 5 = G red light→ X(¬ move forward ∧¬ turn left). We have APIs for observing stop signs, traffic lights, etc., that correspond to the atomic propositions “red/green light,” “stop sign,” etc. The propositions “turn left/right,” “move forward,” and “stop” correspond to the API velocitypublisher with different linear and angular speeds. For example, “move forward” corresponds to velocitypublisher(linear=10, angular=0). C.3 Natural Language Rules As described in section 6, we provide 30 natural language rules, from which we sample a subset of rules for training and evaluation. The rules are presented below. 1 ============Training and Calibration====== 2 Give the right of way to all pedestrians. 3 Let pedestrians pass first. 4 Always allow pedestrians to cross before proceeding. 5 6 Do not remain stopped if the road is clear. 7 Keep moving when it is safe to do so. 8 Avoid unnecessary stops when traffic allows you to proceed. 9 10 Stop where there is a vehicle ahead at a stop sign intersection. 11 Yield to oncoming cars at the stop sign. 12 Do not move ahead or turn if there is a vehicle at the stop sign intersection. 13 14 Make sure to stop at the stop sign. 15 Ensure a complete stop after seeing a stop sign ahead. 16 Always stop at the stop sign. 17 18 Always stop at a red light. 19 Do not proceed when the light is red. 20 Wait for the green light before moving. 21 22 ======Testing====== 23 Allow pedestrians to go first in all situations. 24 Pedestrians have the right of way, yield to them. 25 Slow down and let pedestrians cross safely. 26 27 Proceed when there are no obstacles. 28 Do not hesitate when the road is open. 29 Keep the flow of traffic moving when safe. 30 31 Give the right of way to approaching vehicles at the stop sign. 32 Let oncoming cars pass before proceeding from the stop sign. 33 Wait for approaching traffic to clear before moving past the stop sign. 34 35 Be sure to come to a complete stop at the stop sign. 36 Remember to halt when you reach the stop sign. 37 Do not forget to stop when you get to the stop sign. 38 39 Obey traffic signals, do not run red lights. 40 Do not cross the intersection on a red light. RepV: Scalable Neurosymbolic Verification 41 Red means stop; never ignore it. Listing 2. Natural language rules for robot navigation The first three rules from the training set and the first three from the testing set correspond toφ 1 in C.2. The second group of 3 rules in both training and testing sets corresponds toφ 2 in C.2. And the third, fourth, and fifth groups of 3 rules correspond toφ 3 ,φ 4 , andφ 5 , respectively. During training and evaluation, we pass their corresponding formal logic specifications to the model checker to obtain the ground-truth compliance label. DEMPIRICAL STUDY ON OUT-OF-DOMAIN TASKS D.1 Rules and Specifications Indoor Navigation: We present the APIs for Unitree Go2 legged robot below. 1 obstacle_detected() # returns a boolean: detect an obstacle ahead of the robot 2 Move(self, vx: float, vy: float, vyaw: float) # vx is forward velocity and vy is velocity toward the right, vyaw is the counter-clockwise angular velocity 3 sleep(time_in_seconds: float) 4 StandUp() 5 StandDown() For evaluation purposes, we use the model checker to verify the generated plan against the formal specifications presented below. The formal verification results serve as ground truth labels during evaluation. φ 6 = G obstacle → X¬ forward , φ 7 = G obstacle → X( step left ∨ step right ), φ 8 = G¬ obstacle → X¬ stop , the proposition “obstacle” corresponds to the API obstacledetected(); “forward,” “step left” and “step right” correspond to Move() with different input parameters; “stop” corresponds to Move(0,0,0,0). During evaluation, we randomly select one of the following natural-language rules each time and pass it through ourRepV framework for compliance verification. Then, we verify the corresponding formal logic specification to obtain the ground truth. 1 ============Corresponding to specification phi 6============ 2 If an obstacle is detected, the agent must not move forward in the next time step. 3 The robot should never go forward immediately after sensing an obstacle. 4 Whenever there is an obstacle ahead, the next step is to avoid moving forward. 5 6 ============Corresponding to specification phi 7============ 7 If an obstacle is present, the next move should be a step to the left or right. 8 Upon detecting an obstacle, the agent must turn or sidestep, either left or right, on the next step. 9 Whenever an obstacle appears, the next action should be moving sideways. 10 11 ============Corresponding to specification phi 8============ 12 If there is no obstacle, the agent should keep moving. 13 Whenever the path is clear, the robot must continue rather than stop. 14 In the absence of obstacles, the next action must not be to stop. Aerial Navigation: We use the PX4 Vision 1.0 quadracopter (drone) with the APIs presented below. 1 set_velocity_ned(north, east, down, angle) # set drone velocity in m/s, the drone will keep moving in this velocity until it receives the next command 2 sleep_for(seconds: float) # the drone will keep the current action (velocity) during sleep time 3 obstacle_in_front() # returns a boolean 4 attitude_limit(max_meters)# returns a boolean, true if the drone does not exceed the limit RepV: Scalable Neurosymbolic Verification To obtain ground truth labels for quantitative analysis, we use the following two specifications: φ 9 = G( landing speed < 1), φ 10 = G( attitude limit ), where the proposition “landing speed” is captured by the third parameter of setvelocityned and “attitude limit” corresponds to the API attitudelimit(maxmeters). We present their corresponding natural language rules below. 1 ============Corresponding to specification phi 9============ 2 The landing speed must stay below 1 m/s. 3 The drone should always maintain a landing speed below 1 meter per second. 4 The drone must never exceed a landing speed of 1 m/s. 5 6 ============Corresponding to specification phi 10============ 7 The maximum flying height is 5 meters. 8 The drone must always keep its height below 5 meters. 9 The drone should never rise higher than 5 meters at any time. D.2 Task Prompt and Executable Plans Task Prompt for indoor navigation: 1 Given the following drone APIs: 2 obstacle_detected() $ returns a boolean: detects an obstacle in front of the robot 3 Move(self, vx: float, vy: float, vyaw: float) # vx is forward velocity and vy is velocity toward the right, vyaw is the counter-clockwise angular velocity 4 sleep(time_in_seconds: float) 5 StandUp() 6 StandDown() 7 8 Task: Go to the main entrance and sit down (forward 9 meters, right 1 meter). 9 Rule: Step right if there is an obstacle ahead. We present the generated plan for the indoor navigation task below, which satisfies all the natural language rules with probabilistic guarantees above 90 percent. We present the real execution of the plan in Figure 10 (right). 1 def go_to_main_entrance(): 2 # Step 1: Stand up from resting posture 3 StandUp() 4 5 # Parameters 6 forward_speed = 0.5 # m/s forward 7 sidestep_speed = 0.3 # m/s right 8 turn_speed = -0.4 # rad/s yaw 9 forward_distance_1 = 9 # meters 10 forward_distance_2 = 1 # meters 11 control_interval = 0.1 # seconds 12 13 # Helper: move forward safely for given distance 14 def move_forward(distance): 15 duration = distance / forward_speed 16 elapsed = 0.0 17 while elapsed < duration: 18 if obstacle_detected(): 19 # Try sidestepping right for 1 second to go around the obstacle 20 for _ in range(int(1.0 / control_interval)): 21 Move(0.0, sidestep_speed, 0.0) 22 sleep(control_interval) 23 # Resume forward motion 24 else: 25 Move(forward_speed, 0.0, 0.0) 26 sleep(control_interval) 27 elapsed += control_interval RepV: Scalable Neurosymbolic Verification 28 # Stop movement 29 Move(0.0, 0.0, 0.0) 30 31 # Step 2: Move forward 9 meters, avoid obstacles 32 move_forward(forward_distance_1) 33 34 # Step 3: Turn right 35 turn_duration = math.abs(1.57 / turn_speed) 36 elapsed = 0.0 37 while elapsed < turn_duration: 38 Move(0.0, 0.0, turn_speed) 39 sleep(control_interval) 40 elapsed += control_interval 41 Move(0.0, 0.0, 0.0) 42 43 # Step 4: Move forward 1 meters 44 move_forward(forward_distance_2) 45 46 # Step 5: Stop and stand down 47 Move(0.0, 0.0, 0.0) 48 sleep(0.5) 49 StandDown() Task Prompt for aerial navigation: 1 Given the following drone APIs: 2 set_velocity_ned(north, east, down, angle) # set drone velocity in m/s, the drone will keep moving in this velocity until it receives the next command 3 sleep_for(seconds: float) # the drone will keep the current action (velocity) during sleep time 4 obstacle_in_front() # returns a boolean 5 attitude_limit(max_meters)# returns a boolean 6 7 Task: Generate a plan for flying in a 4-meter square-shaped trajectory. 8 Rule: Do not exceed 5 meters in height. The landing speed must stay below 1 m/s. We show the generated plan below. The plan satisfies all the natural language rules with guarantees above 95 percent. We safely execute the plan in physical environment and present the execution in Figure 11. 1 def fly_square_4m(): 2 side_length = 4.0 # meters 3 horizontal_speed = 1.0 # m/s 4 vertical_speed = 0.5 # m/s 5 max_altitude = 3.0 # meters (safety bound) 6 control_interval = 0.1 # seconds for safety checks 7 segment_time = side_length / horizontal_speed 8 segment_time_vertical = max_altitude / vertical_speed 9 10 def move_with_safety(north, east, down, angle, duration): 11 elapsed = 0.0 12 set_velocity_ned(north, east, down, angle) 13 while elapsed < duration: 14 if obstacle_in_front(): 15 # Stop immediately if an obstacle is detected 16 set_velocity_ned(0.0, 0.0, 0.0, angle) 17 return False 18 if attitude_limit(max_altitude): 19 # Stop if attitude or altitude exceeds safe bounds 20 set_velocity_ned(0.0, 0.0, 0.0, angle) 21 return False 22 sleep_for(control_interval) 23 elapsed += control_interval 24 set_velocity_ned(0.0, 0.0, 0.0, angle) 25 return True RepV: Scalable Neurosymbolic Verification 26 27 # Up 28 if not move_with_safety(0.0, 0.0, -vertical_speed, 0, segment_time_vertical): 29 return 30 # East 31 if not move_with_safety(0.0, horizontal_speed, 0.0, 0, segment_time): 32 return 33 # South 34 if not move_with_safety(-horizontal_speed, 0.0, 0.0, 0, segment_time): 35 return 36 # West 37 if not move_with_safety(0.0, -horizontal_speed, 0.0, 0, segment_time): 38 return 39 # North 40 if not move_with_safety(horizontal_speed, 0.0, 0.0, 0, segment_time): 41 return 42 # Down and Stop 43 set_velocity_ned(0.0, 0.0, vertical_speed, 0, segment_time_vertical) 44 print("Completed square flight.") 45 set_velocity_ned(0.0, 0.0, 0.0, 0, 1.0)