Paper deep dive
A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification
Yichi Zhu, Kan Ling, Xu Liu, Hengrun Zhang, Huiqun Yu, Guisheng Fan
Abstract
Abstract:Medication errors pose a significant threat to patient safety, making pharmacist verification (PV) a critical, yet heavily burdened, final safeguard. The direct application of Large Language Models (LLMs) to this zero-tolerance domain is untenable due to their inherent factual unreliability, lack of traceability, and weakness in complex reasoning. To address these challenges, we introduce PharmGraph-Auditor, a novel system designed for safe and evidence-grounded prescription auditing. The core of our system is a trustworthy Hybrid Pharmaceutical Knowledge Base (HPKB), implemented under the Virtual Knowledge Graph (VKG) paradigm. This architecture strategically unifies a relational component for set constraint satisfaction and a graph component for topological reasoning via a rigorous mapping layer. To construct this HPKB, we propose the Iterative Schema Refinement (ISR) algorithm, a framework that enables the co-evolution of both graph and relational schemas from medical texts. For auditing, we introduce the KB-grounded Chain of Verification (CoV), a new reasoning paradigm that transforms the LLM from an unreliable generator into a transparent reasoning engine. CoV decomposes the audit task into a sequence of verifiable queries against the HPKB, generating hybrid query plans to retrieve evidence from the most appropriate data store. Experimental results demonstrate robust knowledge extraction capabilities and show promises of using PharmGraph-Auditor to enable pharmacists to achieve safer and faster prescription verification.
Tags
Links
- Source: https://arxiv.org/abs/2603.10891v1
- Canonical: https://arxiv.org/abs/2603.10891v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: failed | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 0%
Last extracted: 3/13/2026, 1:12:31 AM
OpenRouter request failed (402): {"error":{"message":"This request requires more credits, or fewer max_tokens. You requested up to 65536 tokens, but can only afford 52954. To increase, visit https://openrouter.ai/settings/keys and create a key with a higher monthly limit","code":402,"metadata":{"provider_name":null}},"user_id":"user_2shvuzpVFCCndDdGXIdfi40gIMy"}
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
62,753 characters extracted from source content.
Expand or collapse full text
A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification Yichi Zhu Y30241060@mail.ecust.edu.cn School of Information Science and Engineering, East China University of Science and Technology Shanghai, China Kan Ling Y30241065@mail.ecust.edu.cn School of Information Science and Engineering, East China University of Science and Technology Shanghai, China Xu Liu Y30241061@mail.ecust.edu.cn School of Information Science and Engineering, East China University of Science and Technology Shanghai, China Hengrun Zhang ∗ zhanghengrun@ecust.edu.cn School of Information Science and Engineering, East China University of Science and Technology Shanghai, China Huiqun Yu ∗ yhq@ecust.edu.cn School of Information Science and Engineering, East China University of Science and Technology Shanghai, China Guisheng Fan ∗ gsfan@ecust.edu.cn School of Information Science and Engineering, East China University of Science and Technology Shanghai, China Abstract Medication errors pose a significant threat to patient safety, making pharmacist verification (PV) a critical, yet heavily burdened, final safeguard. The direct application of Large Language Models (LLMs) to this zero-tolerance domain is untenable due to their inherent factual unreliability, lack of traceability, and weakness in complex reasoning. To address these challenges, we introduce PharmGraph- Auditor, a novel system designed for safe and evidence-grounded prescription auditing. The core of our system is a trustworthy Hy- brid Pharmaceutical Knowledge Base (HPKB), implemented under the Virtual Knowledge Graph (VKG) paradigm. This architecture strategically unifies a relational component for set constraint sat- isfaction and a graph component for topological reasoning via a rigorous mapping layer. To construct this HPKB, we propose the Iterative Schema Refinement (ISR) algorithm, a framework that en- ables the co-evolution of both graph and relational schemas from medical texts. For auditing, we introduce the KB-grounded Chain of Verification (CoV), a new reasoning paradigm that transforms the LLM from an unreliable generator into a transparent reasoning engine. CoV decomposes the audit task into a sequence of verifiable queries against the HPKB, generating hybrid query plans to retrieve evidence from the most appropriate data store. Experimental results demonstrate robust knowledge extraction capabilities and show promises of using PharmGraph-Auditor to enable pharmacists to achieve safer and faster prescription verification. Keywords Prescription Auditing, Large Language Models, Hybrid Knowledge Base, Chain of Verification, Information Retrieval, Explainable AI 1 Introduction The advent of Large Language Models (LLMs) presents a promising solution to medication-related tasks. With their ability to process vast amounts of unstructured text, LLMs are well-suited for digest- ing information from prescribing information, medical literature, ∗ Corresponding author. Figure 1: The workflow and the role of PharmGraph-Auditor. and clinical guidelines. Currently, most works focus on simple diag- nostic tasks or general biomedical fact-checking [3,19,31], which are primarily based on medical Question-Answer pairs. However, the credibility of prescriptions in clinical settings still heavily relies on manual checks, a process that is increasingly strained by the complexity of modern pharmaceutical evidence. Medication errors, such as incorrect dosages and adverse inter- actions, represent a persistent healthcare challenge, contributing to tens of thousands of adverse events and even death annually [2,4,24]. The final safeguard against these risks is Pharmacist Verification (PV), where pharmacists meticulously scrutinize pre- scriptions to intercept potential mistakes. However, this manual defense is under increasing strain as pharmacists must navigate a data deluge, including an ever-expanding pharmacopeia, complex patient histories, and evolving clinical guidelines. This cognitive overload, often compounded by high workloads, increases the risk that critical details are overlooked [8], leading to severe patient harm. Consequently, there is a pressing need for intelligent systems capable of augmenting the pharmacist’s expertise with a systematic and evidence-based safety layer. Although some works aim to prevent medication errors, they rarely target the critical PV stage directly. As illustrated in Fig. 1, systems like Pais et al. [24] employ AI modules primarily for pre- scription standardization during Data Entry. In contrast, our work addresses the final verification step, where direct LLM application arXiv:2603.10891v1 [cs.AI] 11 Mar 2026 Zhu et al. is untenable due to fundamental limitations: (1) Factual Unreli- ability: LLMs are prone to “hallucination”, generating plausible but incorrect information—unacceptable where patient safety is at stake. (2) Lack of Traceability: Knowledge is opaquely encoded within model parameters, making it impossible to trace conclusions back to source documents. This violates evidence-based medicine, as untraceable recommendations are inherently untrustworthy. (3) Weakness in Complex Reasoning: Auditing requires multi-hop reasoning to connect disparate facts (e.g., patient renal function, drug properties, and dosage guidelines). LLMs struggle to perform such structured reasoning reliably without a factual scaffold. To bridge these gaps, we introduce PharmGraph-Auditor. Our system constructs and queries a Hybrid Pharmaceutical Knowledge Base (HPKB) based on the Virtual Knowledge Graph (VKG) paradigm. Guided by our Knowledge Stratification Framework, we recognize that pharmaceutical data inherently re- quires dual modeling: a Relational Component (R) to handle strict “Constraints” (e.g., dosage limits, numerical conditions), and a Graph Component (G) to capture the semantic “Topology” (e.g., interac- tions, hierarchies) for multi-hop reasoning. This hybrid architecture ensures both the flexibility required for complex reasoning and the rigor needed for numerical auditing. Our approach addresses the full lifecycle of intelligent auditing. For construction, we propose an Iterative Schema Refinement (ISR) algorithm that dynamically evolves the hybrid schema to capture domain heterogeneity. This is implemented via a Section- Aware Multi-Agent framework that ensures every extracted fact is traceable to its source. For application, we introduce the KB- grounded Chain of Verification (CoV). Instead of opaque gen- eration, CoV decomposes audits into verifiable subtasks, execut- ing transparent Hybrid Queries and utilizing a Patient Profile- driven Evidence Selection Tree (P-EST) to prune irrelevant rules. Crucially, the system is designed to explicitly flag Information Gaps when patient data is missing, prioritizing safety over halluci- nated verdicts. We evaluate PharmGraph-Auditor on a dataset of real-world inpatient prescriptions annotated by clinical experts. The results show that our framework consistently outperforms traditional rule- based CDSS, achieving a +13.4% improvement in F1 scores. To provide deeper insights, we examine the system’s ability to bal- ance safety with efficiency. It significantly surpasses the recall of human experts while maintaining the high precision necessary to effectively mitigate pharmacist alert fatigue. While these results underscore its effectiveness in clinical settings, the hybrid architec- ture of PharmGraph-Auditor makes it adaptable to other complex domains requiring both rigorous constraint satisfaction and ad- vanced semantic reasoning. Our code and data will be released upon publication. 2 Methodology To address the limitations of existing approaches, specifically their inability to simultaneously handle rigorous numerical auditing and complex semantic reasoning, we propose a formally grounded hybrid architecture. In this section, we establish the theoretical foundation of our data model based on the Virtual Knowledge Table 1: The Knowledge Stratification Framework DimensionRelational (R)Graph (G) Data NatureAtomic, Numerical, Con- ditional Associative, Hierarchi- cal, Transitive Logic TypeSet Constraint Satis- faction Topological Traversal Access CostIndex Scan: 푂(log푁)Index-freeAdjacency: 푂(1) † ComplexityDependent on Dataset Size (푁 ) Independent of Dataset Size (푁 ) Typical Audit Dosage Checks, Con- traindications Interactions, Allergies, Duplicate Therapy † 푂(1) denotes constant time per relationship traversal, irrespective of |푉|. Graph (VKG) paradigm, followed by the specific processes for its construction and application in prescription auditing. 2.1Theoretical Foundation: Hybrid VKG Model While vector databases dominate mainstream RAG architectures, they inherently lack the determinism required for clinical auditing, as semantic similarity cannot rigorously enforce exact numerical constraints. Consequently, a foundation in structured symbolic knowledge is essential. However, pharmaceutical knowledge pos- sesses a dual nature, comprising both strict conditional rules and highly connected semantic concepts. A single structured data model proves insufficient: relational databases struggle with deep recur- sive reasoning, while pure graph databases lack efficient indexing mechanisms for complex range filtering. To resolve this dilemma, we frame our Hybrid Pharmaceutical Knowledge Base (HPKB) as a specialized implementation of the Virtual Knowledge Graph (VKG) paradigm [38,39], adopting a Hybrid Materialization strategy. Formally, we define the HPKB as a tupleH=⟨R,G,휙⟩: • Ris the Relational Component (Constraint Store), a set of relations푅 1 , ...,푅 푛 storing high-integrity atomic facts. It handles data requiring strict schema validation, such as dosage thresholds. • Gis the Graph Component (Topology Store), a labeled property graphG=(푉,퐸)capturing the semantic topology of medical entities. It handles data requiring multi-hop reasoning. • 휙is the Mapping Function,휙:푉 ↔ Ð 푅 푖 , a bijective function establishing explicit links between graph vertices and relational tuples, ensuring the system functions as a unified whole. This hybrid architecture is not merely an engineering choice but a theoretical necessity derived from the algorithmic distinctness of prescription auditing tasks. To guide the schema design system- atically, we propose the Knowledge Stratification Framework (Table 1), which assigns data toRorGbased on its logical nature. 2.1.1Set Constraint Satisfaction (Why Relational?) Audit tasks like Contraindication Checking or Dosage Verification are fundamentally A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification Set Constraint Satisfaction problems. A clinical rule often mani- fests as a stack of boolean and range filters (e.g., Allow IF age>65 AND CrCl< 30 AND hepatic_impairment = ‘Severe’). We assign such data toRbecause Relational Database Man- agement Systems (RDBMS) are mathematically optimized for Set Theory operations. They efficiently execute dynamic predicate logic through B-Tree indices, achieving푂(log푁)complexity for range lookups, where푁is the table cardinality. Modeling continuous numerical ranges in a Graph database would inherently require dis- cretizing values into nodes or performing inefficient global property scans (푂(푁)), leading to unacceptable latency in real-time auditing. 2.1.2 Topological Traversal (Why Graph?) Conversely, tasks like Interaction Screening or Allergy Checking are Topological Traver- sal problems involving path discovery and transitivity. For instance, detecting an allergy requires traversing a hierarchy:Patient ℎ푎푠 −→ Allergy 푠푢푏푐푙푎푠_표푓 ←− Concept 푖푛푔푟푒푑푖푒푛푡 _표푓 ←− Drug. We assign such data toG. The Graph model leverages index- free adjacency, where connected nodes physically point to each other in memory. This allows relationship traversal in constant time푂(1)per hop, completely independent of the total graph size (|푉|or|퐸|). In contrast, implementing deep recursion inR(e.g., via Recursive CTEs) necessitates iterative index lookups for each hop. Since every join operation incurs an푂(log푁)overhead, the cumulative cost for a path of depth푘scales as푂(푘 · log푁). As the medical terminology hierarchy deepens and the dataset size푁 grows, this logarithmic penalty accumulates, rendering relational recursion computationally brittle compared to the constant-time pointer dereferencing of graphs. 2.2 Trustworthy HPKB Construction Guided by the theoretical modelH= ⟨R,G,휙⟩, we implement a verifiable construction pipeline to populate the HPKB from unstruc- tured pharmaceutical documents. As shown in Fig. 2, the upper modules (Modules 1 & 2) handle the pre-computation phase, prepar- ing the knowledge base for the subsequent auditing inference. 2.2.1 Phase I: Iterative Schema Refinement (ISR). Pre-defining a rigid, comprehensive schema for the pharmaceutical domain is im- practical, given the sheer complexity and heterogeneity of medical knowledge. To address this, we propose the Iterative Schema Re- finement (ISR) algorithm. This semi-automated, expert-supported process is designed to evolve a robust schemaS 푓푖푛푎푙 that balances high information recall with structural compactness. Stratified Sampling Strategy. To ensure the evolved schema gen- eralizes across diverse medical contexts, the ISR process utilizes a stratified sampling strategy based on the ICD-10 classification (In- ternational Classification of Diseases, 10th Edition) [35]. We select a representative corpus (e.g., 100 documents) evenly distributed across five major therapeutic areas with distinct structural com- plexities: Antineoplastic agents (complex regimens), Anti-infectives (contraindication-heavy), Cardiovascular agents (interaction-heavy), Nervous system agents, and Respiratory system agents. This diversity is crucial for testing the schema’s ability to handle heterogeneous data structures during the initialization phase. Human-AI Synergy: Gap Detection and Abstraction. The ISR algo- rithm operates on a Propose-Verify-Solidify loop that leverages the complementary strengths of Large Language Models (LLMs) and human experts. The process begins with a minimal “seed schema”. For each document in the sample, the workflow proceeds as follows: (1)LLM as the Gap Detector: The LLM agent compares the document text against the current schema. Beyond extract- ing fitting data, its primary role is to identify “Schema Gaps”—valuable information (e.g., dosing prerequisites, in- fusion rates) that the current schema cannot represent. By drafting structured Schema Change Proposals, the LLM reduces the cognitive load on experts and prevents omis- sions common in manual construction. (2)Expert as the Architect: Human experts review the pro- posals to enforce Semantic Abstraction. Our preliminary analysis suggests that LLMs, when unchecked, tend to suf- fer from “Schema Fragmentation” (e.g., proposing separate tables for RenalAdjustment and HepaticAdjustment). The expert mitigates this by elevating specific gaps into general- ized structures. For instance, instead of accepting disparate fields for different organ functions, the expert defines a genericConstraintnode. This abstract design ensures that the graph componentGremains compact while uniformly representing renal, age-based, or weight-based restrictions. Decision Policy and Stabilization. The acceptance of schema pro- posals is governed by our Knowledge Stratification Framework. The expert classifies gaps into two categories to update the theoretical modelH=⟨R,G,휙⟩: • Scenario A (Constraint Discovery): If the agent encoun- ters text defining a numerical boundary (e.g., Reduce dose by 50% if CrCl<30 mL/min), the expert assigns it as a new attribute column in the Relational ComponentR. •Scenario B (Topology Discovery): If the agent identifies a connection between entities (e.g., Drug A is physically incompatible with Drug B), the expert assigns it as a new Edge Type (e.g., has_taboo) in the Graph ComponentG. This iterative process continues until the schema exhibits “rapid stabilization”, defined as the state where푁 푠푡푎푏푙푒 consecutive doc- uments pass without triggering valid schema change proposals. This ensuresS 푓푖푛푎푙 achieving sufficient coverage of the domain’s complexity. 2.2.2Phase I: Section-Aware Knowledge Population. With the sta- ble schemaS 푓푖푛푎푙 established, we proceed to the full-scale popu- lation phase. The primary goal here is verifiability. We define a provenance functionΠ(푓)=(푑표푐_푖푑,푠푒푐푡푖표푛,푠표푢푟푐푒_푡푒푥푡), requir- ing that every extracted fact푓, whether a tuple inRor an edge in G, must carry the exact source text from which it is derived. 1. Document Preprocessing and Sectioning. The initial step is to convert heterogeneous source documents (typically PDFs) into a structured, machine-readable format. We leverage MinerU 1 , an open-source toolkit that parses PDF documents into Markdown while preserving hierarchical structures like headers and tables. 1 https://github.com/opendatalab/MinerU Zhu et al. Module 1: Document Ingestion & Preprocessing Preprocessing & Sectioning Sectioned Text Module 2: HPKB Construction via Multi-Agent Framework Dispatcher Agent Specialized Extractor Agents Dosage Agent Contraindication Agent Interaction Agent Allergy Agent ... Query Format Check Module 3: Prescription Auditing Engine Hybrid Pharmaceutical Knowledge Base (HPKB) Persistence Layer Prescription Patient Infomation Chain of Verification (CoV) Framework Structured Data Audit Report LLM Input Prompts Task Description Prompts Sectioned Text Drug Metadata Hybrid Queries Pharmaceutical Document Seed Schema expert LLM Schema Change Proposals Accept / ReviseReject Stable Count Finalize Final Schema ISR Algorithm expert ... Phase I Phase I Relational Component (R): Constraint Store Graph Component (G ): Topology Store Patient_Profile PID(PK)AgeCreatinine P_001651.2 Drug_Constraints DID(PK)Max_DoseUnit D_5052000mg Mapping Interface (): Unified Identity Anchoring (d) Unified ID (p) Unified ID Patient: P_001Allergy: Penicillins Ingredient: Amoxicillin Drug: D_505 has condition Documents subclass of active ingredient Figure 2: System architecture of PharmGraph-Auditor. Given that pharmaceutical documents typically follow a standard- ized organization, this structured representation allows us to seg- ment the text into distinct semantic blocks (e.g., Dosage and Admin- istration, Contraindications, Drug Interactions), enabling targeted, context-aware processing in subsequent steps. 2. The Section-Aware Multi-Agent Framework. To overcome the context window limitations and attention drift often observed in monolithic LLMs, we introduce a Section-Aware Multi-Agent frame- work. •The Dispatcher Agent: Acting as the orchestrator, this agent analyzes the header of each text block and routes it to the appropriate specialist. For instance, a block la- beled “Drug Interactions” is strictly routed to the Interac- tion Agent, minimizing noise. •Specialist Agents: We deploy a suite of agents (e.g., Con- traindication Agent, Dosage Agent, Interaction Agent), each configured with a specialized prompt and a specific subset ofS 푓푖푛푎푙 (derived from the ISR algorithm). –The Dosage Agent is prompted to extract structured tuples (e.g.,[푎푔푒_푚푖푛,푑표푠푒_푣푎푙]) targeting the Rela- tional ComponentR. –The Interaction Agent is prompted to extract triples (e.g.,(푃푎푡푖푒푛푡,ℎ푎푠_푐표푛푑푖푡푖표푛,퐴푙푒푟푔푦)) targeting the Graph ComponentG. Crucially, to ensure verifiability, each agent is strictly en- forced to output the provenance metadataΠ(푓)(the raw source text) alongside the extracted fact. 3. Hybrid Persistence and Mapping Layer. The final component acts as a bridge between the agentic framework and the hybrid storage engine. It parses and validates the extracted JSON, routing data to eitherRorGaccording to the schema definitions. To oper- ationalize the mapping휙, we employ a Unified Identity Strategy, where a shared global identifier anchors entities across both stor- age modalities, implicitly ensuring data consistency and seamless cross-referencing. Graph Database Relational Database Prescription Patient Infomation Final Audit Report (Findings & InfoGaps) LLM Phase 1: LLM-Powered Audit Task Decomposition Indication Dosage Contraindication Adjustment Risk Types Interaction (≥ 2 drugs) as "Planner" Phase 2: Rule-Based Query Generation as "Programmer" Task List CypherCypher Phase 3: Heterogeneous Knowledge Retrieval as "Executor" Raw Evidence Set P-EST LLM Contextual Instruction Retrieval Tags Filter: Highly Relevant Evidence Phase 4: LLM-Powered Synthesis & Report Generation LLM as "Analyst" KB-grounded Chain of Verification (CoV) Figure 3: The KB-grounded Chain of Verification (CoV) frame- work. Algorithm 1 KB-grounded Chain of Verification (CoV) 1: Input: Prescription 푃 , Hybrid HPKBK 2: Output: Audit Report(퐴,퐼) 3: 푉 푝푙푎푛 ← DecomposeTasks(푃)⊲ Stage 1 4: 퐸푣푖푑푒푛푐푒 ←∅ 5: for 푡푎푠푘 ∈ 푉 푝푙푎푛 do 6:⊲ Stage 2: Generate Hybrid Queries 7: if 푡푎푠푘.푡푦푝푒 is Constraint then 8:푄 ← GenerateSQL(푡푎푠푘) 9:푅푎푤퐷푎푡푎 ←K.R.execute(푄) 10:⊲ Stage 3: Curation via P-EST 11:퐶푢푟푎푡푒푑퐷푎푡푎 ← P-EST(푃,푅푎푤퐷푎푡푎) 12: else 13:푄 ← GenerateCypher(푡푎푠푘) 14:퐶푢푟푎푡푒푑퐷푎푡푎 ←K.G.execute(푄) 15: end if 16: 퐸푣푖푑푒푛푐푒.푎푑(푡푎푠푘,퐶푢푟푎푡푒푑퐷푎푡푎) 17: end for 18:⊲ Stage 4: Synthesis 19: (퐴,퐼) ← SynthesizeReport(푃,퐸푣푖푑푒푛푐푒) 20: return(퐴,퐼) 2.3 KB-Grounded Prescription Auditing The ultimate goal of our system is to leverage the trustworthy HPKB to audit prescriptions. Formally, given a prescription instance 푃=(PatientInfo, DrugList), the task is to generate an audit report (퐴,퐼), where퐴contains evidence-grounded Audit Findings and 퐼identifies Information Gaps (missing patient data required for verification). To achieve this in a manner that is safe, transparent, and robust against hallucination, we propose the KB-grounded Chain of Verification (CoV) framework. As illustrated in Fig. 3 and detailed in Algorithm 1, CoV transforms the opaque “black-box” reasoning of LLMs into a transparent “white-box” pipeline comprising four distinct stages: 2.3.1Stage 1: LLM-driven Task Decomposition. Instead of tasking an LLM with a monolithic instruction like “check this prescription”, A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification CoV first employs a specialized Decomposition Agent. This agent’s sole responsibility is to decompose the high-level auditing goal into a Verification Plan. This plan is a structured list of specific, verifiable sub-tasks (e.g., dosage verification, contraindication check) tailored to the patient’s profile and each prescribed drug. 2.3.2 Stage 2: Hybrid Query Generation. This stage is the oper- ational core of our hybrid architecture. For each sub-task in the Verification Plan, a deterministic Rule-based Query Engine gen- erates the precise database queries. Crucially, this engine selects the appropriate query language based on the task type, aligning with the theoretical stratification defined in Section 2.1: •For Constraint Tasks (e.g., Dosage): It generates SQL queries targeting the Relational ComponentR(e.g., “SE- LECT * FROM DosageRules WHERE Drug=‘Metformin’ AND ...”). • For Topology Tasks (e.g., Allergy Analysis): It gener- ates Cypher queries targeting the Graph ComponentGto retrieve pharmacological hierarchies needed for reasoning. For instance, to screen for potential allergies, the query retrieves the drug’s composition lineage (e.g., “MATCH (d:Drug name: ‘Metformin’)-[:HAS_INGREDIENT]->(i:In- gredient )-[:BELONGS_TO]->(c:Class) RETURN i, c”). By relying on deterministic rules rather than LLM generation for query construction, we eliminate the risk of syntax errors or hallu- cinated database fields. 2.3.3Stage 3: Evidence Retrieval and Curation via P-EST. Raw data retrieval is often insufficient. For instance, querying dosage rules for a generic drug might return dozens of rows covering various indications and populations. Feeding this “noisy” context to an LLM increases cognitive load and error rates. To address this, we introduce the Patient Profile-Driven Ev- idence Selection Tree (P-EST) (Fig. 4) for structured evidence curation. P-EST simulates clinical decision logic to prune irrelevant rules: (1)Exact Match Search: It first attempts to find a rule that perfectly matches the patient’s specific profile (e.g., “Age 65, CrCl 25ml/min”). (2)Hierarchical Fallback: If no exact match is found (com- mon in real-world data), P-EST initiates a fallback search, moving up the decision tree to find the most specific ap- plicable parent rule (e.g., “Any Renal Impairment”) before defaulting to the standard adult dose. This ensures that the downstream LLM receives only the single most relevant rule, maximizing precision. 2.3.4 Stage 4: Evidence-Grounded Synthesis. In the final stage, a Synthesis Agent receives the original prescription and the Curated Evidence Package to generate a structured Audit Report. To ensure safety, the agent integrates Uncertainty Handling into the synthesis process, explicitly identifying Information Gaps (퐼) whenever the evidence requirements exceed the available patient data. This is critical when retrieved evidence imposes conditional constraints. For example, if the evidence states that “dosage must be reduced for patients with renal impairment” but the patient profile in푃lacks re- nal function data, the agent flags this gap instead of hallucinating a Figure 4: The Patient Profile-driven Evidence Selection Tree (P-EST) logic for pruning conflicting dosage rules. verdict. The final output is a report where every finding is explicitly linked to the authoritative source text, ensuring full traceability. 3 Experiments To comprehensively evaluate the effectiveness and robustness of PharmGraph-Auditor, we design our experiments to answer two central research questions (RQs) that directly correspond to the core tasks defined in our problem formulation: RQ1: How effectively can PharmGraph-Auditor construct a high- fidelity HPKB? This question assesses the core knowledge extraction pipeline, including the performance of our section-aware and multi- agent framework. RQ2: How accurately and safely does PharmGraph-Auditor per- form evidence-grounded prescription auditing? This question eval- uates the application layer of our system, focusing on the perfor- mance of the KB-grounded CoV framework in identifying risks and handling uncertainty. 3.1 Performance of HPKB Construction (RQ1) This section evaluates the system’s ability to build the knowledge base. We focus on the quality of the knowledge population process, assuming a stable, expert-defined schema, to measure the perfor- mance of our extraction framework against established paradigms. 3.1.1 Knowledge Population Quality. Experimental Setup. Evaluation Ground Truth: To ensure a rigorous evaluation of the knowledge population performance, we constructed a gold-standard HPKB derived from 100 diverse phar- maceutical documents. The annotation process was conducted by a senior clinical pharmacist with over 10 years of experience in medi- cal informatics and ontology construction. This expert meticulously reviewed the source documents and manually extracted all rele- vant entities, attribute values, and relations according to the final schema. This expert-curated dataset serves as the reliable ground Zhu et al. Table 2: Statistics of the Golden Standard HPKB MetricRelational Data Graph Relations Total # Documents100100100 # Extracted Records2,9519233,874 Avg. # Records / Doc29.519.2338.74 truth for measuring the extraction fidelity of different systems. The statistics of this benchmark are detailed in Table 2. Baselines: We compare our section-aware, multi-agent frame- work (PharmGraph-Auditor) against two strong baselines represent- ing state-of-the-art approaches in knowledge extraction: •Baseline 1: Zero-shot OpenIE (GraphRAG-style). Sim- ulating the indexing phase of GraphRAG [7], this baseline operates without pre-defined schema constraints. It uti- lizes a powerful LLM to process the full document and autonomously identify entities, extracting both structured attribute tuples (for relational records) and Subject-Verb- Object triples (for graph edges) in a bottom-up manner. This baseline represents the performance of unconstrained, general-purpose information extraction covering both tab- ular and topological data. •Baseline 2: One-shot Schema-guided Agent (AutoKG- style). Adapted from the information extraction module of AutoKG [44], this baseline utilizes the full document context combined with a one-shot demonstration. To align with our hybrid data model, we extended its prompt to include not only candidate predicates for graph relations but also target fields for relational tables. It tests the efficacy of standard schema-guided prompting strategies against our fine-grained, section-aware approach. Metrics and Models: We measure performance using standard Precision, Recall, and F1-score. To ensure a fair comparison, we calculate these metrics separately for the Relational and Graph components, as well as an overall micro-average. We implement our framework using three representative LLMs: GPT-4o, Deepseek- V3 [18], and Qwen3-32B [40], to assess the generalizability of our approach across different model scales and types. Results. The quantitative results in Table 3 demonstrate that PharmGraph-Auditor achieves state-of-the-art performance, consis- tently outperforming baselines across all metrics. Our framework maintains a robust balance between Recall and Precision, resulting in F1-scores exceeding 0.83 for all tested LLMs. This balance is critical: the high Recall (>0.84) ensures comprehensive coverage of potential medical risks, while the high Precision (>0.82) effectively suppresses hallucinations, validating the trustworthiness of the constructed knowledge base. Comparison with Baselines. The baselines exhibit significant performance bottlenecks. The Zero-shot OpenIE (GraphRAG-style), despite acceptable precision, suffers from a critically low Recall (0.4860), missing over half of the essential facts due to the lack of schema guidance. The One-shot Schema-guided agent (AutoKG- style) improves Recall to 0.7709 but still falls short of our framework in both accuracy and completeness. These results confirm that processing complex documents in a single pass dilutes the model’s Types of Prescription Errors Indication INDICATION_MISMATCH Dosage & Administration DOSAGE_ERROR FREQUENCY_ERROR ROUTE_ERROR Contraindications ALLERGY_CONFLICT CONTRAINDICATION_CONDITION Interactions INTERACTION_DRUG_DRUG Others NO_RISK INFO_MISSING Special Population HEPATIC_IMPAIRMENT RENAL_IMPAIRMENT AGE_RESTRICTION INCOMPATIBILITY DUPLICATE_THERAPY Figure 5: Types of prescription errors. attention (“Lost-in-the-Middle”), whereas our section-aware multi- agent approach ensures fine-grained and accurate extraction. 3.2Performance of Prescription Auditing (RQ2) Having established the quality of our HPKB construction, we now evaluate the core application of our system: its ability to perform accurate, safe, and efficient prescription auditing in a real-world clinical setting. 3.2.1Experimental Setup. Real-world Dataset. Unlike synthetic benchmarks, we conducted an experiment using 100 sets of authen- tic inpatient medical records and prescriptions from a real-world hospital. The dataset covers complex clinical scenarios across de- partments. We first defined a comprehensive taxonomy of prescrip- tion risks, categorized into five main types and their sub-types as illustrated in Fig. 5. Consequently, the auditing process focused on these five categories: Indications, Dosage, Contraindications, Special Populations and Interactions, resulting in a total of 500 distinct audit points. Evaluation Baselines & Process. To rigorously assess the sys- tem, we employed a four-way comparative study: (1)Experience Review (Human Baseline): Performed by a senior pharmacist with over 10 years of clinical experience, relying solely on professional memory and expertise. (2)Knowledge Review (Gold Standard): The same senior pharmacist performed a second review, this time assisted by the retrieval results from our HPKB. (3)Proposed Method (CoV): The audit performed automati- cally by our proposed HPKB-driven framework. (4)Rule Review (Traditional CDSS): We compared our method against the hospital’s legacy Clinical Decision Support Sys- tem. Maintained for over 20 years, this pre-LLM system relies on rigid rules manually encoded by doctors and phar- macists, representing a labor-intensive construction pro- cess. The specific identified issues (37 out of 500) from the Knowl- edge Review serve as the ground truth. This results in a correct A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification Table 3: Performance on Knowledge Population Task MethodComponentPrecision Recall F1-score PharmGraph-Auditor (Ours, GPT-4o) Relational0.79730.82430.8106 Graph0.95650.95650.9565 Overall (Micro)0.82600.84910.8374 PharmGraph-Auditor (Ours, Deepseek-V3) Relational0.79480.83780.8157 Graph0.97820.97820.9782 Overall (Micro)0.82350.86030.8415 PharmGraph-Auditor (Ours, Qwen3-32B) Relational0.87500.85130.8630 Graph0.87500.91300.8936 Overall (Micro)0.87500.86030.8676 Zero-shot OpenIE (GraphRAG-style, GPT-4o) Relational0.84090.50000.6271 Graph0.85180.50000.6301 Overall (Micro)0.83650.48600.6148 One-shot Schema-guided (AutoKG-style, GPT-4o) Relational0.79570.76350.7793 Graph0.81250.83780.8297 Overall (Micro)0.80230.77090.7863 Table 4: Performance Comparison MethodPrecision Recall F1-Score Experience Review (Human)100.0%45.9%62.9% Rule Review (CDSS)52.1%67.6%58.8% Proposed Method (Ours)74.3%70.3% 72.2% prescription rate of 92.6%, which aligns with real-world scenarios where baseline quality is generally high. However, given the zero- tolerance policy for errors in medical environments, an effective detection method remains essential to identify these infrequent but critical mistakes. 3.2.2Results and Analysis. Comparative Performance. Table 4 presents the performance of our AI method compared to the Human Baseline (Experience Review) and the traditional rule-based CDSS. The results highlight distinct trade-offs inherent in each approach and validate the effectiveness of our proposed framework. Experience Review: The human pharmacist achieved perfect Precision (100%), indicating that seasoned experts rarely generate false alarms. However, this comes at the cost of safety coverage: the Recall was only 45.9%. This result highlights the critical limita- tions of human memory and attention: more than half of the latent risks were missed when the pharmacist relied solely on experi- ence. Notably, comparing this to the Gold Standard (where the same pharmacist used our HPKB), the inclusion of retrieval capabilities boosted error detection by approximately 117%. This significant gap confirms that high-fidelity retrieval is essential for comprehensive safety. Rule-based CDSS: The traditional CDSS improved the Recall to 67.6%, capturing a wider range of errors than the human baseline. However, it suffers from critically low Precision (52.1%). Nearly half of the alerts generated by the rule engine were False Positives. In real-world clinical settings, such a high noise ratio significantly contributes to “alert fatigue”, potentially causing pharmacists to ignore valid warnings. Proposed Method: Our proposed method achieved the best balance between safety and efficiency. It surpassed the rule-based system in Recall (70.3%), demonstrating superior sensitivity in de- tecting risks. Crucially, it achieved this while maintaining high Precision (74.3%), significantly reducing the false-positive rate com- pared to the rule-based baseline. Overall, these findings demonstrate that our approach serves as an effective assistant for human pharmacists. The knowledge retrieved by our system not only grounds the LLM’s reasoning but also provides interpretable evidence for human verification. By achieving robust precision and recall through the CoV framework, our method acts as a reliable filter, minimizing both missed risks and false alarms, thereby significantly reducing the workload for clinical pharmacists. Fine-grained Analysis by Risk Type. To understand the dis- tinct behaviors of the two systems, we analyzed the results across specific error categories, as illustrated in Fig. 6. Reasoning vs. Rigidity (Special Populations). The most pro- found divergence is observed in the Special Populations category. The rule-based system exhibited a complete failure, failing to iden- tify any of the true risks. In contrast, our proposed method suc- cessfully identified the majority of these issues. This performance gap highlights the fundamental limitation of rule-based logic: pa- tient constraints (e.g., renal impairment status, geriatric frailty) are rarely stored in structured database fields. Instead, they are embed- ded within unstructured clinical notes and laboratory reports. Our method’s semantic reasoning capability allows it to effectively infer these “soft constraints” from the context. Zhu et al. Figure 6: Fine-grained performance analysis by risk category. The top chart illustrates Precision, and the bottom chart illustrates Recall. Table 5: Ablation Study Results SettingPrecision Recall F1Cost Proposed Method0.7924 0.9504 0.8642 $0.0225 w/o CoV0.57570.76450.6561$0.0250 w/o CoV & Knowledge0.39270.52330.4487$0.0055 Precision and Alert Fatigue. Our method achieved superior Precision across all risk categories compared to the rule-based baseline. Notably, we attained perfect or near-perfect precision in categories such as Interactions, Dosage, and Special Populations, demonstrating exceptional reliability. The only outlier was the Indi- cation category; while our method still outperformed the baseline, the relatively lower precision scores for both systems highlight the inherent complexity of this specific task compared to the others. By generally maintaining such high precision, our approach directly mitigates the risk of pharmacist “alert fatigue”, a critical factor in the practical adoption of safety systems. The Challenge of “Clinical Context” (Indications). Upon investigating the lower Precision (0.46) for Indications, we identi- fied that a significant portion of False Positives stemmed from a lack of “Clinical Situational Awareness”, where the system strictly adheres to explicit pharmaceutical data while overlooking implicit clinical routines. A prime example is 0.9% Sodium Chloride (Saline): in inpatient settings, it is routinely used as a solvent or for line flushing, but since these procedural utilities are not formally listed as “therapeutic indications” in package inserts, the LLM incorrectly flags them as mismatches. This finding highlights the necessity for Patient Profile Age: 59 Weight: 64 kg Gender: Female Pregnancy Status: not pregnant Lactation Status: not lactating Renal Impairment: None Hepatic Impairment: None Diagnoses: HR-positive, HER2-negative metastatic breast cancer, tuberculosis Prescription List (1) Abemaciclib Tablets Dose per administration: 150 mg Frequency: Twice daily Route: Oral Duration: 28 days (2) Rifampin Capsules Dose per administration: 600 mg Frequency: Once daily Route: Oral Duration: 180 days Phase 1: LLM-driven Task Decomposition "task": "Indication Verification", "drug": ["Abemaciclib Tablets", "Rifampin Capsules"] "diagnosis": [ "HR-positive, HER2-negative metastatic breast cancer", "Tuberculosis" ] Verification Plan Drug-Drug Interaction "Abemaciclib", "Rifampin" Contraindication "Abemaciclib", "Rifampin" Dosage (Adjustment) "Abemaciclib", "Rifampin" Allergy "Abemaciclib", "Rifampin" Phase 2: Rule-based Query Generation MATCH (:Drug name:'Rifampin')-[r:INDICATED_FOR]->(d:Disease) RETURN d, r Indication Verification [CYPHER] Drug-Drug Interaction [CYPHER] MATCH p=(:Drug name:'Abemaciclib')-[:INTERACTS_WITH]-(:Drug name:'Rifampin') RETURN p Dosage Check [SQL] SELECT * FROM dosage_rules WHERE drug_name = 'Abemaciclib'; Cypher Phase 3: Evidence Retrieval & Curation Raw Evidence Set The queries retrieve a Raw Evidence Set from the HPKB, containing all potentially relevant facts. ·"Recommended dose with endocrine therapy is 150mg twice daily." ·"If co-administered with a strong CYP3A4 inhibitor, reduce dose to 100mg twice daily." ·"If already reduced to 100 mg twice daily and strong CYP3A4 inhibitor use cannot be avoided, further reduce to 50 mg twice daily." ·"No dose adjustment needed for mild or moderate renal impairment." ·"No dose adjustment for mild or moderate hepatic impairment (Child-Pugh A/B)." ·"For severe hepatic impairment (Child-Pugh C), reduce frequency to once daily." ·"Abemaciclib is for oral administration." From Dosage & Admin Check (Abemaciclib): Rifampin is indicated for Tuberculosis. Rifampin is indicated for Asymptomatic carriers of N. meningitidis. Rifampin is indicated for Nontuberculous mycobacterial infections. From Indication Check (Rifampin): P-EST Curated Evidence Package ·"Recommended dose with endocrine therapy is 150mg twice daily." ·"If co-administered with a strong CYP3A4 inhibitor, reduce dose to 100mg twice daily." ·"If already reduced to 100 mg twice daily and strong CYP3A4 inhibitor use cannot be avoided, further reduce to 50 mg twice daily." ·"Abemaciclib is for oral administration." From Dosage & Admin Check (Abemaciclib): Severity: Severe Effect: "Co-administration with strong CYP3A4 inducers (e.g., Rifampin) significantly decreases Abemaciclib plasma concentration (approx. 95% reduction), potentially leading to reduced efficacy." From Interaction Check (Abemaciclib + Rifampin): Phase 4: LLM-driven Synthesis & Reporting Finding 2 Finding 1 "riskType": "INTERACTION_DRUG_DRUG", "severity": "High", "explanation": "A severe drug-drug interaction exists between Abemaciclib and Rifampin. Rifampin significantly accelerates the metabolism of Abemaciclib, which can lead to a critical loss of therapeutic efficacy for the cancer treatment.", "evidenceSource": "Co-administration with strong CYP3A4 inducers (e.g., Rifampin) significantly decreases Abemaciclib plasma concentration..." "riskType": "DOSE_ADJUSTMENT_MISSED", "severity": "High", "explanation": "The patient is receiving a standard dose of Abemaciclib, but the co-administered Rifampin (a strong CYP3A4 inducer) drastically reduces its concentration. This results in a therapeutic failure risk due to an 'effective underdose'. The prescribing information advises against this combination.", "evidenceSource": "Avoid co-administration of Abemaciclib with strong CYP3A4 inducers... to prevent loss of efficacy." Standard Rule Irrelevant Highly Relevant Figure 7: Case study. future knowledge bases to be augmented with clinical procedural knowledge, enabling the differentiation between strict medical con- traindications and accepted hospital practices. 3.3 Ablation Study To rigorously validate the contributions of the Chain of Verification (CoV) framework and the external Knowledge Base, we conducted an ablation study using a high-fidelity synthetic benchmark con- structed via a “Red Teaming” approach. Dataset Construction. We employed Gemini 2.5 Pro [6] to sys- tematically generate over 1,000 prescription test cases. Specifically, the model was fed package inserts and prompted to synthesize erro- neous prescriptions that deliberately violated the provided content. For each generated case, the model simultaneously identified the error type and cited the corresponding supporting evidence from the manual text. This generation was strictly guided by the compre- hensive risk taxonomy illustrated in Fig. 5. To ensure data quality, expert pharmacists randomly sampled and reviewed the dataset, A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification verifying that the prescriptions aligned with medical common sense and that the error logic strictly adhered to the prompts. We compared our full method against two variants: (1) w/o CoV Framework, a standard Full-Text RAG approach where the LLM reads the entire drug package insert in a single pass; and (2) w/o CoV & w/o Knowledge, a Zero-Shot setting relying solely on the LLM’s internal weights. As shown in Table 5, removing external knowledge (w/o CoV & Knowledge) leads to a substantial decline in F1-score, confirming that internal parametric knowledge alone is insufficient for pre- cise clinical auditing. While providing full documents (w/o CoV ) yields improvements over the zero-shot baseline, it still significantly underperforms our proposed method, primarily due to the noise inherent in processing lengthy unstructured texts. Notably, this standard RAG approach also incurs a higher cost than our pro- posed method, primarily due to excessive token consumption from processing full texts. Consequently, our full CoV method achieves the optimal balance of accuracy and efficiency by retrieving only precise, relevant facts. 3.4 Case Study: Tracing the Chain of Verification To provide a concrete illustration of our system’s end-to-end work- flow, this section presents a case study of the KB-grounded Chain of Verification (CoV) framework in action. We trace a single, realis- tic prescription through the entire auditing pipeline, from initial task decomposition to the final, evidence-grounded report. This walkthrough, visualized in Figure 7, demonstrates how the CoV’s structured process ensures accuracy and safety by identifying a critical risk that less sophisticated methods might miss. Clinical Scenario: We consider a complex but clinically realistic scenario involving a dose adjustment necessitated by a critical drug- drug interaction. •Patient Profile: A 59-year-old female patient is diagnosed with HR-positive, HER2-negative metastatic breast cancer and Tuberculosis. Her profile indicates no pre-existing renal or hepatic impairment. •Prescription Order: The prescription includes Abemaciclib (150 mg twice daily) for her cancer. Critically, the patient is also being treated with Rifampin (600 mg once daily), a strong CYP3A4 inducer, for tuberculosis. 4 Related Work Prescription auditing, a critical component of medication safety, has evolved from manual pharmacist review to automated Clinical Decision Support Systems (CDSS). Traditional CDSS relies heavily on static, rule-based logic to detect errors such as drug-drug in- teractions (DDIs) and dosage violations [4,14,29]. While effective in capturing explicit errors, these systems often suffer from high alert fatigue due to poor specificity and the inability to process unstructured clinical narratives [22]. To overcome these rigidity issues, researchers have increasingly integrated machine learning (ML) and deep learning techniques into auditing workflows. For instance, methods based on graph neural networks (GNNs) and molecular structure analysis have been proposed to predict DDIs with higher precision [17,27,45]. Furthermore, studies utilizing natural language processing (NLP) have demonstrated success in extracting medication entities from electronic health records (EHRs) to support more comprehensive automated checks [12,21,33]. De- spite these advancements, most existing auditing systems remain siloed, handling either structured rules or unstructured text, but rarely synthesizing both for holistic risk assessment. The advent of Large Language Models (LLMs) has introduced a transformative paradigm for clinical data processing. Foundational models fine-tuned on biomedical corpora, such as BioGPT [20], PMC-LLaMA [36], and Med-PaLM 2 [30], have achieved expert per- formance in tasks from medical licensing exams to complex query answering [23,32]. To enhance comprehension, works like BALI [28] propose augmenting these models by aligning textual repre- sentations with biomedical Knowledge Graphs. These capabilities suggest LLMs could function as comprehensive auditing agents. However, deploying LLMs in high-stakes clinical decision-making is hampered by hallucinations, where models generate plausible but incorrect assertions [13,42]. In prescription auditing, unsubstanti- ated outputs can lead to severe adverse drug events. Moreover, the opaque nature of LLM reasoning lacks the transparency required for clinical validation [1,26]. To mitigate these risks, Retrieval- Augmented Generation (RAG) has emerged as a solution grounding model outputs in verified external knowledge [9, 15]. Recent research has advanced RAG by moving beyond unstruc- tured text retrieval to leveraging structured knowledge sources, giving rise to GraphRAG and medical Knowledge Graph Question Answering (KGQA) systems. Frameworks like MedGraphRAG [37] and DoctorRAG [19] utilize the topological structure of Knowledge Graphs (KGs) to enable multi-hop reasoning, allowing systems to trace relationships between symptoms, diagnoses, and treatments [11,25]. While these graph-centric approaches excel at semantic reasoning, they often struggle with the rigorous numerical con- straint satisfaction problems inherent in prescription auditing, such as verifying renal function thresholds or weight-based dosage calcu- lations [10]. This limitation highlights the algorithmic distinctness between semantic traversal and set-based value filtering, suggest- ing that a single data model is insufficient for representing the full spectrum of pharmaceutical knowledge. This data modeling challenge has revitalized interest in Hybrid Data Models and the Virtual Knowledge Graph (VKG) paradigm. VKG, historically known as Ontology-Based Data Access (OBDA), enables unified semantic querying over heterogeneous data sources without requiring physical migration to a graph format [5,38]. By mapping relational data to a conceptual graph layer, VKG frame- works allow systems to leverage the computational efficiency of SQL for numerical constraints while retaining the reasoning power of graph queries [34,43]. Parallel advancements in Information Ex- traction (IE) and automated Knowledge Base Construction (KBC), such as AutoKG [44] and schema-flexible extraction frameworks [16,41], have made it feasible to populate such complex architec- tures from unstructured text. However, existing automated KBC pipelines often prioritize static schemas or single-model outputs, failing to address the dynamic need for stratifying knowledge into hybrid storage backends suitable for clinical auditing. Zhu et al. 5 Conclusion In this paper, we presented PharmGraph-Auditor, a hybrid frame- work designed to bridge the gap between LLM capabilities and the rigorous safety requirements of clinical prescription auditing. By formalizing the distinct nature of pharmaceutical knowledge through the relational modeling of numerical constraints and the graphical representation of semantic topology, our system ensures both precision and reasoning depth. We introduced the Iterative Schema Refinement (ISR) algorithm for trustworthy knowledge construction and the KB-grounded Chain of Verification (CoV) for transparent inference. Experimental evaluations confirm that our approach outperforms traditional CDSS by +13.4% in F1 score, ef- fectively mitigating pharmacist alert fatigue while significantly surpassing human experts in identifying latent risks. Future work will focus on integrating Real-World Evidence (RWE) into the knowledge base to capture implicit clinical routines, bridg- ing the gap between rigid pharmaceutical definitions and flexible hospital workflows. Ultimately, this work can present a new para- digm for building safe, traceable, and explainable AI systems for critical clinical decision support, shifting the LLM’s role from an opaque generator to a verifiable, evidence-based reasoning engine. References [1]Julia Amann, Alessandro Blasimme, Effy Vayena, Dietmar Frey, Vince I Madai, and Precise4Q Consortium. 2020. Explainability for artificial intelligence in healthcare: A multidisciplinary perspective. BMC Medical Informatics and Deci- sion Making 20, 1 (2020), 310. [2] Philip Aspden and Philip Aspden. 2007. Preventing medication errors. Vol. 8. National Academies Press Washington, DC. [3] Mariano Barone, Antonio Romano, Giuseppe Riccio, Marco Postiglione, and Vincenzo Moscato. 2025. Combining evidence and reasoning for biomedical fact-checking. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1087–1097. [4]David W Bates, David J Cullen, Nan Laird, Laura A Petersen, Stephen D Small, Deborah Servi, Glenn Laffel, Bobbie J Sweitzer, Brian F Shea, Robert Hallisey, et al.1995. Incidence of adverse drug events and potential adverse drug events: Implications for prevention. Jama 274, 1 (1995), 29–34. [5] Diego Calvanese, Benjamin Cogrel, Sarah Komla-Ebri, Roman Kontchakov, Da- vide Lanti, Martin Rezk, Mariano Rodriguez-Muro, and Guohui Xiao. 2016. Ontop: Answering SPARQL queries over relational databases. Semantic Web 8, 3 (2016), 471–487. [6] Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al.2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025). [7] Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2024. From local to global: A graph RAG approach to query-focused summarization. arXiv preprint arXiv:2404.16130 (2024). [8]Christy Gorbach, Linda Blanton, Beverly A Lukawski, Alex C Varkey, E Paige Pitman, and Kevin W Garey. 2015. Frequency of and risk factors for medication errors by pharmacists during order verification in a tertiary care medical center. American Journal of Health-System Pharmacy 72, 17 (2015), 1471–1474. [9]Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In International Confer- ence on Machine Learning. PMLR, 3929–3938. [10]Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard De Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, et al.2021. Knowledge graphs. ACM Computing Surveys (Csur) 54, 4 (2021), 1–37. [11]Yuntong Hu, Zhihan Lei, Zheng Zhang, Bo Pan, Chen Ling, and Liang Zhao. 2024. GRAG: Graph Retrieval-Augmented Generation. arXiv preprint arXiv:2405.16506 (2024). [12]Peter B Jensen, Lars J Jensen, and Søren Brunak. 2012. Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics 13, 6 (2012), 395–405. [13]Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38. [14]Rainu Kaushal, Kaveh G Shojania, and David W Bates. 2003. Effects of computer- ized physician order entry and clinical decision support systems on medication safety: a systematic review. Archives of Internal Medicine 163, 12 (2003), 1409– 1416. [15]Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al.2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474. [16]Yongqi Li, Yu Yu, and Tieyun Qian. 2023. Type-aware decomposed framework for few-shot named entity recognition. In Findings of the Association for Compu- tational Linguistics: EMNLP 2023. 8911–8927. [17] Xuan Lin, Zhe Quan, Zhi-Jie Wang, Tengfei Ma, and Xiangxiang Zeng. 2020. KGNN: Knowledge graph neural network for drug-drug interaction prediction. In IJCAI, Vol. 380. 2739–2745. [18]Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al.2024. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024). [19]Yuxing Lu, Gecheng Fu, Wei Wu, Xukai Zhao, Goi Sin Yee, and Jinzhuo Wang. 2025. Towards doctor-like reasoning: Medical RAG fusing knowledge with patient analogy through textual gradients. In 39th Annual Conference on Neural Information Processing Systems. [20]Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. 2022. BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics 23, 6 (2022), bbac409. [21] Stéphane M Meystre, Guergana K Savova, Karin C Kipper-Schuler, and John F Hurdle. 2008. Extracting information from textual documents in the electronic health record: A review of recent research. Yearbook of Medical Informatics 17, 01 (2008), 128–144. [22] Karen C Nanji, Sarah P Slight, Diane L Seger, Insook Cho, Julie M Fiskio, Lisa M Redden, Lynn A Volk, and David W Bates. 2014. Overrides of medication-related clinical decision support alerts in outpatients. Journal of the American Medical Informatics Association 21, 3 (2014), 487–491. [23]Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. 2023. Capabilities of GPT-4 on medical challenge problems. arXiv preprint arXiv:2303.13375 (2023). [24] Cristobal Pais, Jianfeng Liu, Robert Voigt, Vin Gupta, Elizabeth Wade, and Mohsen Bayati. 2024. Large language models for preventing medication di- rection errors in online pharmacies. Nature Medicine 30, 6 (2024), 1574–1582. [25]Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. 2024. Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering 36, 7 (2024), 3580–3599. [26]Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1, 5 (2019), 206–215. [27] Jae Yong Ryu, Hyun Uk Kim, and Sang Yup Lee. 2018. Deep learning improves prediction of drug–drug and drug–food interactions. Proceedings of the National Academy of Sciences 115, 18 (2018), E4304–E4311. [28] Andrey Sakhovskiy and Elena Tutubalina. 2025. BALI: Enhancing biomedical language representations through knowledge graph and language model align- ment. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1152–1164. [29]Gordon D Schiff, MG Amato, T Eguale, J Boehne, A Wright, R Koppel, AH Rashidee, RB Elson, DL Whitney, T Thach, et al.2015. Computerised physician order entry-related medication errors: Analysis of reported errors and vulnera- bility testing of current systems. BMJ Quality & Safety 24, 4 (2015), 264–271. [30] Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al.2025. Toward expert-level medical question answering with large language models. Nature Medicine 31, 3 (2025), 943–950. [31]Kannan Sridharan and Gowri Sivaramakrishnan. 2024. Unlocking the potential of advanced large language models in medication review and reconciliation: A proof-of-concept investigation. Exploratory Research in Clinical and Social Pharmacy 15 (2024), 100492. [32]Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. 2023. Large language models in medicine. Nature Medicine 29, 8 (2023), 1930–1940. [33]Özlem Uzuner, Imre Solti, and Eithon Cadag. 2010. Extracting medication infor- mation from clinical text. Journal of the American Medical Informatics Association 17, 5 (2010), 514–518. [34]Domagoj Vrgoc, Carlos Rojas, Renzo Angles, Marcelo Arenas, Vicente Cal- isto, Benjamín Farías, Sebastían Ferrada, Tristan Heuer, Aidan Hogan, Gonzalo Navarro, et al.2024. MillenniumDB: A multi-modal, multi-model graph data- base. In Companion of the 2024 International Conference on Management of Data. 496–499. [35]World Health Organization. 2019. International Classification of Diseases, 10th Edition. https://icd.who.int/browse10/2019/en. Accessed: 2026-1-16. A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification [36]Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya Zhang, Weidi Xie, and Yanfeng Wang. 2024. PMC-LLaMA: Toward building open-source language models for medicine. Journal of the American Medical Informatics Association 31, 9 (2024), 1833–1843. [37]Junde Wu, Jiayuan Zhu, Yunli Qi, Jingkun Chen, Min Xu, Filippo Menolascina, Yueming Jin, and Vicente Grau. 2025. Medical graph RAG: Evidence-based medical large language model via graph retrieval-augmented generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 28443–28467. [38]Guohui Xiao, Linfang Ding, Benjamin Cogrel, and Diego Calvanese. 2019. Virtual knowledge graphs: An overview of systems and use cases. Data Intelligence 1, 3 (2019), 201–223. [39] Guohui Xiao, Lin Ren, Guilin Qi, Haohan Xue, MD Panfilo, and Davide Lanti. 2025. LLM4VKG: Leveraging large language models for virtual knowledge graph construction. In Proceedings of the 34th International Joint Conference on Artificial Intelligence (IJCAI). [40]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al.2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025). [41]Bowen Zhang and Harold Soh. 2024. Extract, define, canonicalize: An LLM- based framework for knowledge graph construction. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 9820–9836. [42]Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al.2025. Siren’s song in the AI ocean: A survey on hallucination in large language models. Computational Linguistics (2025), 1–46. [43] Fuheng Zhao, Divyakant Agrawal, and Amr El Abbadi. 2025. Hybrid querying over relational databases and large language models. In 15th Annual Conference on Innovative Data Systems Research. [44]Yuqi Zhu, Xiaohan Wang, Jing Chen, Shuofei Qiao, Yixin Ou, Yunzhi Yao, Shumin Deng, Huajun Chen, and Ningyu Zhang. 2024. LLMs for knowledge graph construction and reasoning: Recent capabilities and future opportunities. World Wide Web 27, 5 (2024), 58. [45]Marinka Zitnik, Monica Agrawal, and Jure Leskovec. 2018. Modeling polyphar- macy side effects with graph convolutional networks. Bioinformatics 34, 13 (2018), i457–i466.