Paper deep dive
ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis
Zhan Jin, Yu Luo, Yizhou Zhang, Ziyang Cui, Yuqing Wei, Xianchao Liu, Xueying Zeng, Qing Zhang
Abstract
Abstract:Conventional pixel-wise loss functions fail to enforce topological constraints in coronary vessel segmentation, producing fragmented vascular trees despite high pixel-level accuracy. We present ARIADNE, a two-stage framework coupling preference-aligned perception with RL-based diagnostic reasoning for topologically coherent stenosis detection. The perception module employs DPO to fine-tune the Sa2VA vision-language foundation model using Betti number constraints as preference signals, aligning the policy toward geometrically complete vessel structures rather than pixel-wise overlap metrics. The reasoning module formulates stenosis localization as a Markov Decision Process with an explicit rejection mechanism that autonomously defers ambiguous anatomical candidates such as bifurcations and vessel crossings, shifting from coverage maximization to reliability optimization. On 1,400 clinical angiograms, ARIADNE achieves state-of-the-art centerline Dice of 0.838, reduces false positives by 41% compared to geometric baselines. External validation on multi-center benchmarks ARCADE and XCAD confirms generalization across acquisition protocols. This represents the first application of DPO for topological alignment in medical imaging, demonstrating that preference-based learning over structural constraints mitigates topological violations while maintaining diagnostic sensitivity in interventional cardiology workflows.
Tags
Links
- Source: https://arxiv.org/abs/2603.19169v1
- Canonical: https://arxiv.org/abs/2603.19169v1
Intelligence
Status: not_run | Model: - | Prompt: - | Confidence: 0%
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
76,563 characters extracted from source content.
Expand or collapse full text
ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis Zhan Jin 1† , Yu Luo 1† , Yizhou Zhang 1† , Ziyang Cui 1† , Yuqing Wei 1 , Xianchao Liu 1 , Xueying Zeng 1* , Qing Zhang 2* 1 School of Mathematical Sciences, Ocean University of China, Qingdao, 266100, Shandong, China. 2 Department of Cardiology, Qilu Hospital (Qingdao), Cheeloo College of Medicine, Shandong University, No. 758 Hefei Road, Qingdao, 266000, Shandong, China. *Corresponding author(s). E-mail(s): zxying@ouc.edu.cn; qingzhang2019@foxmail.com; Contributing authors: zjin@stu.ouc.edu.cn; luoyu@stu.ouc.edu.cn; zyz6596@stu.ouc.edu.cn; 3196148390@q.com; 2659799366@q.com; 2486063350@q.com; † These authors contributed equally to this work. Abstract Conventional pixel-wise loss functions fail to enforce topological constraints in coronary vessel segmentation, producing fragmented vascular trees despite high pixel-level accuracy. We present ARIADNE, a two-stage framework coupling preference-aligned perception with RL-based diagnostic reasoning for topolog- ically coherent stenosis detection. The perception module employs DPO to fine-tune the Sa2VA vision-language foundation model using Betti number con- straints as preference signals, aligning the policy toward geometrically complete vessel structures rather than pixel-wise overlap metrics. The reasoning mod- ule formulates stenosis localization as a Markov Decision Process with an explicit rejection mechanism that autonomously defers ambiguous anatomical candidates such as bifurcations and vessel crossings, shifting from coverage maximization to reliability optimization. On 1,400 clinical angiograms, ARI- ADNE achieves state-of-the-art centerline Dice of 0.838, reduces false positives by 41% compared to geometric baselines. External validation on multi-center 1 arXiv:2603.19169v1 [cs.CV] 19 Mar 2026 benchmarks ARCADE and XCAD confirms generalization across acquisition protocols. This represents the first application of DPO for topological align- ment in medical imaging, demonstrating that preference-based learning over structural constraints mitigates topological violations while maintaining diag- nostic sensitivity in interventional cardiology workflows. The code is available at https://github.com/qimingfan10/ARIADNE. Keywords: Coronary Angiography, Foundation Models, Direct Preference Optimization, Reinforcement Learning, Topological Consistency, Stenosis Detection 1 Introduction Coronary Artery Disease (CAD) remains a leading cause of morbidity and mortality worldwide[1], requiring diagnostic modalities that provide accurate, reproducible, and efficient assessment. Invasive X-ray Coronary Angiography (XCA) serves as the pri- mary tool for CAD diagnosis and guidance of Percutaneous Coronary Interventions (PCI)[2], offering high temporal resolution necessary for visualizing hemodynamic flow[3]. However, current clinical workflows rely heavily on manual interpretation, a process characterized by significant inter-observer variability and susceptibility to clin- ician fatigue[4]. As healthcare institutions universally adopt Picture Archiving and Communication Systems (PACS), a critical gap persists between passive image storage and active, automated clinical interpretation. While hospitals have implemented digi- tal image storage, they lack automated systems capable of transforming raw imaging data into actionable clinical insights. The growing volume of interventional procedures makes purely manual interpretation increasingly unsustainable, creating demand for Computer-Aided Diagnosis systems that can bridge the gap between data acquisition and clinical decision-making. Accurate segmentation of the coronary vascular tree represents a fundamental pre- requisite for automated coronary analysis. Over the past decade, Convolutional Neural Networks (CNNs), particularly U-Net[5] and its attention-enhanced variants such as CS-Net and SA-UNet[6, 7], have dominated the field. More recently, Vision Trans- formers (ViTs) have been introduced to capture global spatial relationships[8]. Despite achieving high pixel-level performance metrics, these models face a critical limitation in preserving vascular topology. Traditional loss functions, including Cross-Entropy and Dice Loss[9], optimize pixel-level accuracy independently without explicitly penal- izing topological errors[10]. Consequently, these models frequently produce fragmented vessel trees where distal branches appear disconnected, particularly due to signal loss in thin vessels during downsampling operations[11]. In coronary hemodynamics anal- ysis, topological connectivity is essential; a segmentation with high Dice score remains insufficient for clinical use if discontinuities prevent accurate centerline extraction and subsequent geometric analysis. The recent emergence of foundation-scale Vision-Language Models (VLMs) has introduced a complementary approach to medical image segmentation. Models such as SAM3[12] and MedSAM3[13] leverage large language models to enable prompt-based 2 segmentation, where textual descriptions guide mask generation. These architectures demonstrate impressive semantic understanding, correctly identifying what constitutes a vessel based on learned visual-linguistic correspondences. However, their train- ing on generic natural image datasets creates a fundamental semantic-topological gap: while VLMs comprehend the conceptual category of a vascular structure, they lack the domain-specific anatomical priors necessary to enforce structural continu- ity in low-contrast, projection-based X-ray angiography. Empirical evaluation reveals that general-purpose VLMs consistently produce semantically correct but topolog- ically fragmented segmentations—correctly classifying pixels as vessel while failing to maintain the connected tree structure essential for hemodynamic modeling. This failure stems from their optimization objective: VLMs maximize pixel-level overlap (Dice, IoU[14]) between predicted and ground-truth masks, a criterion that remains agnostic to whether the resulting mask forms a continuous vascular network or a col- lection of disconnected segments. In coronary angiography, where vessel diameters approach image resolution limits and contrast variability is substantial, the absence of explicit topological constraints results in high-confidence predictions of isolated vessel fragments that are clinically unusable for stenosis quantification or flow analysis. This limitation in segmentation directly impacts the accuracy of stenosis detection systems. Current automated frameworks predominantly follow a sequential approach where segmentation and stenosis detection are performed as independent tasks[15, 16]. In these systems, geometric algorithms traverse the segmented centerline to iden- tify regions of narrowing. However, these deterministic algorithms lack the ability to distinguish pathological stenosis from common anatomical artifacts, including vessel crossings[17], bifurcations, and foreshortening[18], resulting in elevated false positive rates. Conversely, while deep object detectors such as YOLO have been applied to direct lesion identification[19], they inherently lack the capacity to verify anatomical plausibility. Specifically, generic object detectors treat lesions as isolated bounding boxes, failing to validate whether a detected stenosis actually resides within a con- tinuous, hemodynamically relevant vascular segment. These limitations have hindered clinical adoption due to the high rate of false alarms that reduce system reliability. To address these fundamental challenges in coronary angiography automation, we propose ARIADNE (Anatomy-aware Reasoning for Integrated Angiography Diagnosis and Navigation Expert), a framework that bridges the gap between visual perception and clinical reasoning. Our central hypothesis is that robust diagnostic automation requires not only accurate visual recognition but also explicit alignment with the hier- archical reasoning patterns employed by expert clinicians. Building on recent advances in preference-based learning from the artificial intelligence community, we integrate Direct Preference Optimization (DPO)[20] with Reinforcement Learning (RL)[21] to create a two-stage diagnostic pipeline. In the perception stage, we apply DPO to fine-tune a vision-language foundation model (Sa2VA)[22, 23], using comparative pref- erences derived from centerline continuity profiles, quantified via clDice[10], to guide the model toward structurally coherent vessel segmentations. Unlike general VLMs that optimize for semantic correctness through pixel overlap, our preference-based approach explicitly rewards topological continuity—teaching the model that a mask with 92% Dice score but preserved connectivity is preferable to a 95% Dice mask with 3 fragmented branches. This enables the model to harness the semantic power of vision- language architectures while enforcing the geometric rigor required for hemodynamic analysis. To consolidate this topological reasoning against weak visual signals, we fur- ther incorporate a Hard Sample Focused Training (HSFT) strategy. By concentrating optimization resources on the most diagnostically uncertain subsets—such as complex bifurcations and distal vessels—this mechanism achieves significant computational effi- ciency while ensuring robust performance in anatomically challenging regions where global statistics often mask local failures. The resulting topologically coherent vessel trees provide a clinically valid foundation for the reasoning stage: a RL-based naviga- tion agent that performs sequential decision-making for stenosis detection. Critically, this agent incorporates an explicit rejection mechanism[24] that mirrors the clinical workflow where radiologists flag ambiguous cases for secondary review. By allowing the system to abstain from uncertain predictions, we shift the operational paradigm from maximizing coverage to maximizing reliability, thereby reducing false positive rates while maintaining high sensitivity for clear-cut lesions. This perception-to-reasoning architecture reflects the natural diagnostic workflow, where accurate anatomical recon- struction serves as the perceptual foundation for subsequent lesion localization and characterization. This work makes three primary contributions to the automation of coronary angiography interpretation: 1. Perception Framework: We introduce a preference-based optimization approach that aligns VLMs with topological constraints in vessel segmentation. By apply- ing DPO to comparative vessel tree examples, our method achieves topologically consistent segmentations without requiring pixel-level annotation of connectivity features, augmented by a hard-sample mining strategy that enhances computational efficiency in complex anatomical scenarios. 2. Reasoning Algorithm: We formulate stenosis detection as a sequential navigation task guided by RL, incorporating an explicit rejection mechanism that allows the system to defer ambiguous cases. This clinical workflow-aligned approach substan- tially reduces false positive rates in anatomically complex regions while maintaining high sensitivity for definitive lesions. 3. Clinical Validation: We demonstrate that integrating topologically-aware per- ception with rejection-enabled reasoning achieves state-of-the-art diagnostic per- formance on standard coronary angiography benchmarks with a TPR of 0.867, supporting the hypothesis that anatomical validity is prerequisite to reliable automated diagnosis. 2 Methods 2.1 Framework Overview To operationalize the clinical requirement for topological continuity in angiographic analysis, the proposed ARIADNE framework is designed to emulate the hierarchical decision-making process of human experts. As illustrated in Fig. 1 and Fig. 2, the system consists of two biomimetic stages that mirror the visual-cognitive workflow of 4 expert interventional cardiologists: a perception module for anatomically consistent vascular reconstruction and a reasoning module for context-aware lesion localization. Fig. 1 Training framework of Anatomy-Aware Segmentation The perception module employs the Sa2VA foundation model[25] with a progressive training strategy designed to enforce topological continuity throughout the segmenta- tion process. We integrate DPO[20] into the training pipeline to align model outputs toward geometrically complete vessel structures rather than fragmented pixel-level predictions. This preference-based learning approach guides the model to preserve vascular connectivity without requiring exhaustive manual annotation of topologi- cal features, generating vessel masks that maintain the vascular continuity essential for downstream hemodynamic analysis. The resulting segmentation masks maintain structural integrity across vessel hierarchies, providing a clinically reliable anatomical scaffold for downstream diagnostic reasoning. Building upon this topologically consistent representation, the reasoning module operates as a structure-guided diagnostic agent that navigates the extracted vessel skeleton to identify stenotic lesions. Rather than applying fixed statistical thresholds, we develop a RL agent that analyzes local geometric features—including radius gradi- ents and curvature patterns—to perform context-aware lesion localization. Critically, the agent incorporates an explicit rejection mechanism to filter false positive detec- tions arising from complex anatomical structures such as vessel crossings, bifurcations, and foreshortening artifacts. The effectiveness of this rejection mechanism is funda- mentally dependent on the structural consistency provided by the perception module, 5 Fig. 2 Training framework of Structure-Guided Reasoning demonstrating the essential interdependence between anatomical reconstruction and diagnostic decision-making within the ARIADNE framework. 2.2 Anatomy-Aware Perception Module via Preference Alignment Coronary vessel segmentation requires bridging the semantic gap between low-level pixel intensities in fluoroscopic images and high-level anatomical knowledge of vas- cular topology. To achieve this integration, we employ the Sa2VA architecture[25], a visual-language foundation model designed to align angiographic representations with structured clinical priors. Formally, given an input angiographic image x∈ R H×W×C , the architecture consists of three interdependent components that collectively enable topological alignment. The vision encoderE v , instantiated as InternViT-6B-448px[26], operates in a frozen state to extract robust high-semantic embeddings z v = E v (x) ∈ R N×d v , where N represents the number of patch tokens and d v denotes the embedding dimension. Rather than training exclusively on limited medical datasets, this gener- alized visual embedding space—pre-trained on large-scale natural images—provides stable feature representations across the varying contrast conditions and fluoroscopic noise characteristic of interventional cardiology. The language modeling component L, based on InternLM2[27], maps anatomical directives (e.g., Segment the coronary artery) into a high-dimensional semantic space z l =L(prompt)∈ R d l . This foundation model approach enables semantic representation of vascular terminology and supports potential generalization to complex clinical queries without architectural modification. Low-Rank Adaptation (LoRA)[28] with rank r = 16 is applied to adapt pre-trained linguistic knowledge to coronary anatomy while avoiding full-parameter retraining costs. The integration between visual and linguistic streams forms the anatomical scaf- fold of the framework, where trainable projection layers align semantic embeddings with visual features to condition the SAM-2 mask decoder D[29], yielding the seg- mentation mask y = D(z v ,z l ) ∈ 0, 1 H×W . By embedding clinical priors into the 6 decoding process, this architecture suppresses background artifacts that possess simi- lar pixel intensities to vessel structures but lack anatomical relevance, thereby focusing computational resources on topologically valid vascular components. Traditional segmentation datasets consist of isolated static frames, creating ambi- guity when anatomically complex structures or motion artifacts cannot be resolved without temporal context. To address this limitation, we implement a physiologically adaptive sampling strategy that leverages the continuous nature of angiographic video sequences. Rather than uniform temporal sampling—which introduces redundancy through near-duplicate frames while missing critical physiological events—we extract key frames that capture distinct hemodynamic states across the cardiac cycle. Frame selection explicitly targets systolic and diastolic phases to ensure exposure to the full range of vessel deformation, including lumen diameter variations and wall motion. Sampling additionally encompasses environmental variations in fluoroscopic imaging angles, X-ray intensity, and contrast bolus propagation phases, specifically arterial inflow, peak opacification, and venous washout. To maximize diagnostic relevance, we identify temporal hard clusters—consecutive sequences exhibiting low-confidence predictions that correspond to anatomically challenging regions such as distal vessel terminals, bifurcation zones, and overlapping vessel segments. This temporal mining strategy concentrates training samples on scenarios where visual ambiguity is maxi- mal, mirroring the clinical workflow where radiologists dedicate additional scrutiny to diagnostically uncertain regions. To evolve the model from a general-purpose segmentation system into a domain- specialized tool capable of preserving vascular topology, we implement a three-stage progressive training strategy that advances from basic visual pattern recognition to structured anatomical reasoning. In Stage 1, we establish visual pattern alignment through parameter-efficient transfer learning by freezing the vision encoder E v to pre- serve generalized feature extraction capabilities while applying LoRA adapters to the InternLM2 language model and SAM-2 decoder. The model is trained on N 1 = 1, 220 annotated angiogram samples by minimizing the Dice loss L Dice = 1− 2|y∩ y ∗ | |y| +|y ∗ | ,(1) where y ∗ denotes the ground truth mask. This initial alignment ensures the system can recognize vessel boundaries, tubular structures, and contrast-enhanced regions before addressing connectivity constraints. However, standard supervised objectives operate under pixel-wise independence assumptions—minimizing local discrepancies without explicitly penalizing topological violations such as vessel fragmentation. Subsequently, to align the model with clinical reasoning principles that prior- itize structural continuity, we incorporate topological constraints through DPO in Stage 2. Clinical validity is formally defined as topological connectivity: coronary arteries must form continuous tubular structures exhibiting C 1 continuity without artificial fragmentation, characterized by a single connected component—denoted by a Betti number of β 0 = 1—that preserves hemodynamic flow continuity. DPO enforces this constraint by formulating vascular connectivity as a preference learning prob- lem where the objective is to maximize the likelihood margin between topologically 7 valid and invalid segmentation states. Specifically, we construct a preference dataset D pref =(x (i) ,y (i) w ,y (i) l ) N 2 i=1 , where preference pairs are defined by adherence to topo- logical constraints rather than pixel-wise overlap metrics. The preferred (winning) sample y w is the ground truth segmentation, which satisfies global geometric con- straints by exhibiting β 0 (y w ) = 1 and preserving flow continuity. The non-preferred (losing) sample y l consists of hard negative examples mined from the Stage 1 pol- icy π S1 —specifically, predictions with high pixel-level overlap Dice(y l ,y ∗ ) > 0.8 but topological violations β 0 (y l ) > β 0 (y w ), indicating vessel fragmentation or disjoint arti- facts. The policy π θ (y|x), representing the probability distribution over segmentation masks, is optimized to assign higher probability to topologically connected samples over fragmented predictions through the DPO objective: L DPO (π θ ;π ref ) =−E (x,y w ,y l )∼D pref logσ β log π θ (y w |x) π ref (y w |x) − β log π θ (y l |x) π ref (y l |x) , (2) where π ref denotes the frozen Stage 1 policy serving as the reference model to prevent excessive deviation from learned visual features, β = 0.1 controls the KL-divergence penalty strength, and σ represents the logistic sigmoid function. DPO optimizes the policy directly without training an explicit reward model, enabling efficient topo- logical alignment through preference-based learning that guides the model toward geometrically complete vessel structures. Finally, while DPO aligns the model with topological connectivity principles, per- formance remains inconsistent in anatomically complex scenarios where weak visual signals (low contrast, vessel overlap) destabilize the learned connectivity preference. To consolidate topological reasoning under diagnostic ambiguity, we implement HSFT in Stage 3 that concentrates computational resources on scenarios where clinical inter- pretation is most challenging. Rather than treating hard samples as statistical outliers, we identify temporal hard clusters—consecutive frames with Dice scores below thresh- old τ = 0.75—which map to specific anatomical challenges: fine distal vessel terminals (diameter < 1m), bifurcation zones where multiple branches diverge, and over- lapping vessel segments in oblique projections. We define the hard sample subset D hard = (x,y ∗ ) ∈ D | Dice(π θ (x),y ∗ ) < τ, which constitutes 20.8% of the dataset but accounts for the majority of topological errors. To enforce pixel-level accuracy in these regions while maintaining structural integrity, we apply the hybrid loss function L HSFT =L Dice + λL BCE ,(3) with λ = 0.5, where the Binary Cross-Entropy component L BCE =− X i,j y ∗ ij log y ij + (1− y ∗ ij ) log(1− y ij ) (4) provides pixel-wise gradients needed to refine vessel boundaries at bifurcations and terminals, while L Dice maintains global structural consistency. This progressive strat- egy achieves 5× resource efficiency by focusing training on the most diagnostically 8 relevant subset of samples, ensuring robust topological preservation across the full spectrum of anatomical complexity encountered in clinical practice. 2.3 Structure-Guided Reasoning via Reinforcement Learning Building upon the topologically consistent vessel segmentations established by the per- ception module (Fig. 1), the reasoning stage (Fig. 2) translates structural information into diagnostic outputs by performing context-aware stenosis localization. This mod- ule leverages the topological integrity of DPO-enhanced segmentations to construct a navigable anatomical scaffold from which diagnostic candidates are systematically generated. Morphological thinning is applied to the refined binary segmentation mask y ∈ 0, 1 H×W to extract the discrete vessel centerline C = p 1 ,p 2 ,...,p N where p i ∈ R 2 denotes spatial coordinates, which serves as the navigation trajectory for subsequent geometric analysis. For each point p i on the skeleton, the local ves- sel radius r(p i ) is computed using a Euclidean distance transform D(y), yielding r(p i ) = maxd | B(p i ,d) ⊂ y where B(p i ,d) denotes a ball of radius d centered at p i . This generates a one-dimensional radius profile r : [0,L] → R + parameterized by arc length s along the vessel’s longitudinal axis. By analyzing this profile alongside its first and second derivatives ∇r(s) = dr ds and ∇ 2 r(s) = d 2 r ds 2 , we identify morphological bottlenecks as candidate locationsP cand =p∈C | r(p) < μ r −kσ r ∧∇ 2 r(p) > θ curv , where μ r and σ r denote the mean and standard deviation of the radius profile, k = 1.5 controls sensitivity, and θ curv is a curvature threshold. This deterministic geometric process is deliberately configured for maximum sensitivity to generate a high-recall candidate set. However, this approach inherently produces false positives in anatomically complex regions such as bifurcations, vessel crossings, and natural tapering zones where geometric narrowing mimics pathological stenosis. These candi- dates therefore serve as initial proposals requiring subsequent verification, providing a high-recall, low-precision coordinate set that necessitates intelligent filtering through clinical reasoning. To address the limitation of purely geometric detection, we formulate stenosis localization as a sequential decision-making process modeled as a Markov Decision Process (MDP). This formulation enables the system to perform context-aware diag- nostic reasoning that distinguishes pathological stenoses from anatomical artifacts through analysis of local morphological patterns. Unlike static thresholding methods that apply fixed criteria uniformly across all vessel segments, RL allows the agent to adaptively evaluate each candidate based on its geometric neighborhood, mimicking the sequential visual inspection workflow employed by interventional cardiologists. The MDP is formally defined by the tuple M = (S,A,T ,R,γ), where S denotes the state space encoding local vessel geometry, A represents the action space of navigational commands, T : S ×A → ∆(S) defines the state transition function, R : S ×A → R specifies the reward function encoding clinical priorities, and γ ∈ [0, 1) is the dis- count factor. The agent navigates the vascular skeleton to localize true stenoses while rejecting false alarms arising from benign anatomical variations. Specifically, the state space S ⊂ R 16 encodes local morphological context at each candidate location. Each state vector s t = [r t−5:t+5 ,∇r t−5:t+5 ,Z t ,κ t ] ∈ R 16 9 comprises: (1) a normalized radius profile r t−w:t+w within a sliding window of half- width w = 5 centerline points, capturing the geometric progression of vessel lumen narrowing; (2) first-order derivatives ∇r t−w:t+w quantifying morphological gradi- ents to detect abrupt transitions characteristic of stenotic lesions; (3) local Z-score Z t = r(p t )−μ r σ r benchmarking the degree of narrowing against the vessel’s baseline caliber to distinguish significant stenoses from normal anatomical variation; and (4) local curvature κ t = ∇ 2 r(p t ) capturing geometric sharpness. The action space A =Left, Right, Confirm, Reject consists of discrete navigational commands, where lateral movements Left, Right implement spatial translation p t+1 = p t ± ∆p with step size ∆p = 3 pixels to enable fine positional adjustment toward the precise steno- sis center. Critically, the Reject action implements an explicit abstention mechanism that transitions the agent to the next candidate in P cand without issuing a diagnos- tic prediction, allowing autonomous dismissal of ambiguous candidates where local geometry superficially resembles pathology—such as bifurcation points where parent vessels exhibit natural narrowing as they split into smaller daughter branches. This rejection capability mirrors the clinical triage workflow where radiologists defer uncer- tain cases for secondary review rather than issuing potentially erroneous diagnoses, fundamentally shifting the operational paradigm from coverage maximization to relia- bility optimization and reducing false positive rates while maintaining high sensitivity for definitive lesions. To align agent behavior with clinical diagnostic priorities, the reward function R : S×A→ R explicitly encodes the asymmetric costs of diagnostic errors, distinguishing between active detection failures, termed False Positives, and passive omission errors, or False Negatives. The reward function is formally structured to incentivize the correct rejection of anatomical artifacts while severely penalizing missed diagnoses: R(s t ,a t ) = r TP if a t = Confirm∧ δ(p t ,G)≤ τ(True Positive) r FP if a t = Confirm∧ δ(p t ,G) > τ(False Positive) r TN if a t = Reject∧ δ(p t ,G) > τ(Correct Rejection) r FN if a t = Reject∧ δ(p t ,G)≤ τ(False Negative) r step otherwise (5) where δ(p t ,G) = ∥p t − G∥ 2 represents the Euclidean distance to the nearest ground truth stenosis centroid G, and τ = 75 pixels defines the localization tolerance. Hyper- parameters are calibrated to reflect safety-critical clinical constraints: r TP = +50 rewards accurate localization; r FP = −10 penalizes false alarms to reduce clini- cian fatigue; r TN = +10 explicitly rewards the agent for correctly identifying and rejecting ambiguous artifacts such as vessel crossings; and r FN = −50 imposes a maximal penalty for rejecting a true stenosis, ensuring high sensitivity. A step cost r step = −1 encourages efficient navigation. The optimal policy π ∗ : S → ∆(A) maxi- mizes the expected cumulative discounted reward E π P ∞ t=0 γ t R(s t ,a t ) with discount factor γ = 0.99. Policy optimization is performed using Proximal Policy Optimization (PPO)[30], which ensures stable gradient updates by constraining the policy update 10 through the clipped surrogate objective: L PPO (θ) = E t h min ρ t (θ) ˆ A t , clip(ρ t (θ), 1− ε, 1 + ε) ˆ A t i (6) where ρ t (θ) = π θ (a t |s t ) π θ old (a t |s t ) denotes the probability ratio between successive policy iter- ations, ˆ A t is the generalized advantage estimate, and ε = 0.2 defines the clipping range. This clipping mechanism prevents destructive updates that could destabilize the learned diagnostic strategy. The policy network π θ employs a Multi-Layer Per- ceptron architecture with layer dimensions [16 → 256 → 128 → 64 → |A|] and ReLU activations, parameterized by weights θ ∈ R d . Rather than employing recur- rent architectures such as LSTMs or GRUs, this feedforward design enforces a strictly Markovian decision process where the policy π θ (a t |s t ) conditions exclusively on the current state s t , ensuring diagnostic decisions remain invariant to the vessel’s prior trajectory and maintaining computational efficiency with inference time < 50ms per candidate, suitable for real-time clinical deployment. 3 Experiments To validate the proposed perception-reasoning framework, the experimental design was structured to address two primary objectives: (1) evaluation of topological consistency in vascular segmentation across diverse angiographic conditions, and (2) assessment of stenosis detection accuracy and false positive management in anatomically complex scenarios. 3.1 Datasets and Sampling Strategy The experimental foundation was established through a video-based acquisition strategy designed to capture morphological diversity in coronary angiography, sup- plemented by external validation on publicly available datasets to assess domain generalization. A proprietary dataset was curated from coronary angiography video sequences acquired at Guizhou Aviation Industry Group 302 Hospital using a Siemens angiogra- phy system. The collection process utilized temporal information from video streams to ensure comprehensive representation of vessel morphology across 35 patients. The dataset comprises 1,400 high-resolution images at 512×512 resolution from 35 patients, with an average of 40 frames extracted per patient to capture varying vessel angu- lations and contrast conditions. To prevent data leakage inherent in video-based acquisitions—where consecutive frames exhibit high temporal correlation—the dataset was partitioned at the patient level rather than the image level. Specifically, 25 patients, comprising 1,000 images, were allocated to the training set, while the vali- dation and testing sets each contained 5 patients contributing 200 images, ensuring that no patient appeared in multiple partitions and thereby guaranteeing independent evaluation of model generalization. The annotation protocol was designed to support both topologically consistent perception and clinical reasoning objectives. Expert cardiologists annotated vessel 11 contours using LabelMe polygon format, with particular emphasis on maintaining topological connectivity across vascular networks. Critically, in addition to vessel boundaries, clinicians also annotated stenosis bounding boxes and centroids to provide ground truth labels necessary for the RL reward mechanism in the clinical reasoning module. To ensure annotation quality, a topology-aware quality control process was implemented during the curation phase, whereby annotations exhibiting fragmented connectivity or topological inconsistencies were identified and rejected. Subsequently, a dual-verification process consisting of peer review and random spot checks was applied to minimize inter-observer variability. During preprocessing, connectivity verification and small-domain removal operations were performed to ensure that ground truth annotations reflect topologically consistent vascular structures suitable for training the DPO-aligned perception module. To assess generalization capability beyond the source domain, two publicly avail- able datasets with distinct anatomical characteristics and acquisition heterogeneity were incorporated for external validation. The ARCADE dataset[31] contains 1,200 images annotated according to SYNTAX score criteria across 26 anatomical regions, providing evaluation of segmentation performance across different acquisition proto- cols and imaging conditions representative of multi-center variability. Furthermore, the XCAD dataset[32] consists of 126 images with comprehensive annotations including fine distal vessel branches, enabling evaluation of segmentation performance in low- contrast distal vascular structures where topological consistency is most challenging to maintain. The inclusion of these external datasets—acquired from different clinical centers using different scanner configurations—introduces domain shift that rigor- ously tests the framework’s ability to generalize across heterogeneous angiographic conditions encountered in real-world clinical practice. 3.2 Evaluation Metrics A multi-dimensional evaluation framework was established to assess both segmentation quality and detection performance with emphasis on clinically relevant error patterns. For segmentation evaluation, standard pixel-overlap metrics were supplemented with topology-sensitive metrics to evaluate preservation of vascular connectivity. The Dice Coefficient, measuring the overlap between predicted mask y and ground truth y ∗ , was computed as Dice = 2|y∩ y ∗ | |y| +|y ∗ | .(7) Complementing this overlap measure, the Intersection over Union (IoU) quantified the ratio of intersection to union between prediction and ground truth according to IoU = |y∩ y ∗ | |y∪ y ∗ | ,(8) while pixel-level classification performance was characterized through Accuracy Accuracy = TP + TN TP + TN + FP + FN ,(9) 12 Precision Precision = TP TP + FP ,(10) and Sensitivity Sensitivity = TP TP + FN ,(11) where TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively. Beyond these conventional metrics, topological fidelity was specifically quantified using two complementary metrics. First, the Centerline Dice (clDice)[10] evaluates the overlap between predicted and ground-truth vessel skeletons C(·) obtained via morphological skeletonization, providing sensitivity to discontinuities that would disrupt downstream analysis: clDice = 2|C(y)∩C(y ∗ )| |C(y)| +|C(y ∗ )| .(12) Furthermore, boundary precision within clinically acceptable margins was assessed using the Normalized Surface Dice (NSD)[33] with tolerance threshold τ , defined as NSD = |B τ (y)∩B τ (y ∗ )| |B τ (y)| +|B τ (y ∗ )| ,(13) whereB τ represents the boundary region within distance τ , ensuring that vessel width estimation supports accurate geometric quantification. For stenosis detection performance evaluation, metrics reflecting the balance between sensitivity and false positive management were employed, with a detection considered correct if localized within 75 pixels of the ground truth stenosis cen- troid corresponding to clinically acceptable spatial tolerance. The True Positive Rate (TPR), equivalent to Recall and measuring the proportion of actual stenoses correctly identified, was defined as TPR = TP det TP det + FN det ,(14) where TP det and FN det represent true positives and false negatives at the lesion level. The Positive Predictive Value (PPV), equivalent to Precision and quantifying the system’s ability to reject false detections in anatomically ambiguous regions, was calculated as PPV = TP det TP det + FP det .(15) The F1 Score provided a balanced harmonic measure of detection accuracy integrating both sensitivity and precision according to F 1 = 2· PPV· TPR PPV + TPR .(16) To further quantify the clinical utility of the rejection mechanism in reducing alarm fatigue, we reported the False Positives Per Image (FPPI), calculated as the total number of false positive detections divided by the total number of test images. A 13 lower FPPI with sustained TPR demonstrates the agent’s effectiveness in filtering anatomical artifacts. 3.3 Implementation Details The training process was implemented using PyTorch 2.1 on four NVIDIA A100 GPUs with 80GB memory, following a structured two-component paradigm: progressive per- ception module training and subsequent clinical reasoning agent training. Input images were preprocessed by resizing from the native acquisition resolution of 512× 512 pix- els to 448× 448 pixels to match the pre-training resolution of the InternViT-6B vision encoder[26], thereby preserving feature extraction consistency. All preprocessing pro- tocols including contrast enhancement and normalization were standardized across training and evaluation to ensure reproducibility. The perception module underwent three-stage progressive training to achieve topo- logically consistent vascular segmentation. In Stage 1, focused on visual pattern align- ment, Low-Rank Adaptation[28] was applied with rank r = 16 to adapt the InternLM2 language model[27] and SAM-2 decoder[29] while keeping the InternViT-6B vision encoder[26] frozen to preserve pre-trained visual representations. Optimization was performed using the AdamW optimizer with a learning rate of 5× 10 −4 and batch size of 8 per GPU to establish foundational capability for distinguishing vessel struc- tures from background tissue. In Stage 2, to achieve preference alignment, DPO[20] was employed to align the model with topological consistency preferences. The train- ing utilized a learning rate of 1× 10 −6 and KL penalty coefficient β = 0.1 to control divergence from the Stage 1 reference policy. To manage computational requirements, a batch size of 8 per GPU with 4-step gradient accumulation was utilized, construct- ing preference pairs from the Stage 1 policy outputs based on topological connectivity constraints encoded through skeleton-based connectivity metrics (specifically clDice) and connected component analysis. Finally, Stage 3 implemented HSFT to refine the model on challenging cases exhibiting low initial Dice scores to improve robustness in anatomically complex scenarios. The hybrid loss function combined Dice loss for structural consistency with Binary Cross-Entropy weighted by λ = 0.5 for pixel-level boundary refinement, selectively targeting samples with Dice coefficients below the hard sample threshold τ dice = 0.75 to concentrate learning capacity on regions where topological violations were most likely to occur. Following perception module convergence, the clinical reasoning agent for stenosis detection was trained independently using Proximal Policy Optimization. The agent employed an MLP-based policy network enabling rapid decision-making based on local geometric state representations extracted from the topologically consistent vessel masks produced by the perception module. The agent was trained for 200,000 interac- tion steps with hyperparameters configured as follows: learning rate 3× 10 −4 , discount factor γ = 0.99, clipping parameter ε = 0.2, and entropy coefficient 0.01 to bal- ance exploration and convergence stability. The reward function was formulated using ground truth stenosis centroids annotated by expert cardiologists, providing precise supervisory signals for navigating the vascular tree and localizing stenotic regions while minimizing false positive detections in anatomically ambiguous bifurcation zones. 14 3.4 Baseline Methods The proposed framework was compared against three categories of methods to evalu- ate the contribution of integrated perception-reasoning architecture, with all baseline methods retrained on the in-house and XCAD[32] training sets using identical pre- processing protocols to ensure fair comparison. Pixel-wise segmentation methods including U-Net[5], UNet++[34], and SVSNet[35] represented standard supervised segmentation approaches, enabling evaluation of whether topology-preserving training improves connectivity metrics beyond pixel-level accuracy. Geometric and flow-based methods such as FlowVM-Net[36] utilized vessel geometry for stenosis detection, providing comparison to evaluate whether learned reasoning reduces false positive detections compared to rule-based geometric analysis in anatomically complex regions. Foundation models including MedSAM3[13] served as general-purpose vision mod- els to assess whether domain-specific adaptation and topological constraints provide advantages over models trained on broad visual domains without medical priors. For stenosis detection evaluation, Stenunet[37], LT-YOLO[38], and DeepDiscern[32] were included to establish performance benchmarks regarding true positive rates and false positive management in anatomically ambiguous scenarios. 4 Results The validation of the proposed framework follows a hierarchical structure that reflects the interdependence between perception and reasoning components. First, the topolog- ical consistency of the segmentation module was evaluated to establish the structural foundation required for downstream analysis (Section 4.1). Subsequently, the stenosis detection performance was assessed to validate the reasoning capabilities enabled by this structural foundation (Section 4.2). 4.1 Segmentation Performance Figure 3 presents the progressive performance enhancement of our proposed frame- work across three distinct training stages, demonstrating the efficacy of the multi-stage optimization strategy. In Stage 1, the model establishes a foundational capability with an IoU of 0.5501 and a Dice score of 0.7128, reflecting reasonable initial seg- mentation capacity. Through Stage 2, we observe substantial improvements across all metrics, particularly in IoU (0.6505) and Accuracy (0.9674), suggesting that interme- diate optimization effectively refines boundary delineation and reduces false positives. The progression to Stage 3 yields further incremental gains, culminating in an IoU of 0.6582 and a Dice score of 0.7998, while notably enhancing Sensitivity (0.8123) and NSD (0.5829). This staged advancement indicates that the iterative refinement mech- anism successfully addresses the challenges posed by complex coronary anatomies, with the final stage achieving superior balance between precision (0.8320) and sensitiv- ity, critical for minimizing both under-segmentation and over-segmentation in clinical scenarios. Quantitative comparisons against eight contemporary segmentation methodolo- gies on our in-house dataset are summarized in Table 1, where the proposed method 15 Fig. 3 Performance comparison of our model at different stages achieves state-of-the-art performance across all seven evaluation metrics. Specifically, our approach attains an IoU of 0.6757 and a Dice score of 0.8034, outperform- ing the top-performing baseline FlowVM-Net[36]. Notably, the foundation model MedSAM3[13] struggled with this specific task, performing significantly worse than even the baseline UNet (IoU of 0.5612 vs. 0.6321). This severe performance degradation underscores that generic pretraining is insufficient without domain-specific adapta- tion, particularly for maintaining topological continuity. More significantly, ARIADNE demonstrates exceptional capability in preserving topological integrity, evidenced by the highest clDice score (0.8378)[10] and NSD (0.6883)[33], metrics particularly sen- sitive to the continuity and surface consistency of tubular vascular structures. The consistent superiority across Precision (0.8133) and Sensitivity (0.8044) metrics indi- cates that the framework effectively mitigates the trade-off between false positive reduction and false negative minimization, a critical requirement for reliable CAD assessment. Notably, even lightweight architectures like UNet[5] and UNet++[34] lag behind by substantial margins (IoU gaps of 4.36% and 2.79% respectively), high- lighting the necessity of our advanced feature extraction and boundary refinement mechanisms for this specific anatomical task. To validate the generalizability and robustness of the proposed framework beyond the training distribution, we conducted external validation on the public XCAD dataset[32], with comparative results presented in Table 2. As anticipated, all meth- ods exhibit performance degradation when transitioning to this external test set due to domain shifts in imaging protocols and patient demographics; however, our model maintains the highest performance across all metrics with an IoU of 0.5887 and Dice score of 0.7387, significantly outperforming FlowVM-Net[36] (the second-best method) and surpassing the foundation model MedSAM3 by a massive margin (IoU gap > 13%). 16 The marked improvements in Sensitivity (0.8498) and clDice (0.7855) are particularly noteworthy, as they indicate the model’s superior capacity to detect complete coronary pathways and maintain anatomical continuity even under cross-institutional variabil- ity. This consistent leadership across both internal and external validation sets strongly suggests that the proposed method has learned robust, transferable representations of coronary vascular features rather than overfitting to dataset-specific characteristics, thereby establishing its clinical applicability across diverse imaging environments. Table 1 Comparative performance of segmentation methods on the in-house dataset (n=140). Bold indicates best performance. MethodIoUAccPreSenclDiceNSDDice MedSAM3[13]0.56120.96500.70150.73200.71050.58210.7189 UNet[5]0.63210.97980.78230.77120.79870.64780.7734 UNet++[34]0.64560.98050.79120.77980.80560.65450.7845 FR-Unet[39]0.65340.98150.79980.78650.81450.66560.7905 H-vmunet[40]0.65890.98200.80340.79120.82120.67120.7945 SVSNet[35]0.66120.98220.80670.79450.82450.67560.7960 FlowVM-Net[36]0.66780.98280.80950.79890.82980.68230.8005 ARIADNE0.6715 0.9832 0.8133 0.8044 0.8378 0.6883 0.8034 Table 2 External validation performance on XCAD dataset (n=126)[32]. Bold indicates best performance. MethodIoUAccPreSenclDiceNSDDice MedSAM3[13]0.45320.93150.55210.68450.62150.38420.6237 UNet[5]0.52340.95320.62340.81340.73210.45670.6987 UNet++[34]0.53560.95560.63120.81890.74120.46780.7045 H-vmunet[40]0.54120.95780.63560.82120.74890.47340.7089 FR-Unet[41]0.54560.95850.63890.82450.75230.47890.7123 SVSNet[35]0.54890.95920.64120.82780.75670.48120.7156 FlowVM-Net[36]0.56780.96230.65120.83670.77340.49890.7298 ARIADNE0.5887 0.9666 0.6609 0.8498 0.7855 0.5074 0.7412 To provide a granular assessment of topological stability under dynamic flow con- ditions, Figure 4 visualizes the segmentation trajectories across the full angiographic sequence. As observed in the wash-out phase (bottom rows) where contrast density fades, baseline methods and even the foundation model MedSAM3[13] exhibit inter- mittent topological fragmentation (highlighted by red arrows). In contrast, ARIADNE demonstrates superior temporal robustness, consistently preserving the connectivity of the entire vascular tree regardless of contrast fluctuations, validating the efficacy of the DPO-aligned[20] perception module. 17 Fig. 4 Qualitative spatiotemporal consistency analysis across the full angiographic sequence. Columns represent different models, while rows illustrate the hemodynamic progression from Wash-in (top) to Peak (middle) and Wash-out (bottom) phases. The foundation model MedSAM3[13] (Col- umn c) exhibits significant topological fragmentation during the low-contrast wash-out phase (red arrows), confirming the semantic-topological gap. In contrast, ARIADNE (Column j) maintains robust structural continuity throughout the sequence (green arrows). 4.2 Stenosis Detection Performance Stenosis detection performance was evaluated to validate the clinical efficacy of the proposed RL-based diagnostic reasoning module, with quantitative results presented in Table 3. The proposed framework achieved a True Positive Rate (TPR) of 0.867, substantially outperforming existing methods including Stenunet[37] (0.812), Liu et al.[15] (0.729), and Du et al.[32] (0.773), representing relative improvements of 6.7%, 18.9%, and 12.1%, respectively. This enhanced sensitivity is clinically critical as it 18 directly corresponds to the detection of pathologically significant stenoses that might otherwise be missed. Crucially, the integration of the rejection mechanism significantly reduced the False Positives Per Image (FPPI) to 0.85, compared to ranges of 1.89–2.45 in baseline methods. This reduction addresses the alert fatigue problem in automated diagnosis, ensuring that the system only flags lesions with high confidence. Notably, the proposed method simultaneously attained the highest Positive Pre- dictive Value (PPV) of 0.634 compared to 0.557, 0.628, and 0.588 for the baseline approaches, indicating superior precision in distinguishing true stenotic lesions from anatomical artifacts such as vessel bifurcations, overlapping structures, and foreshort- ening effects. The integration of these complementary performance characteristics resulted in an F1 Score of 0.732, which substantially exceeds the nearest competitor (0.692) and represents a balanced optimization of sensitivity and specificity essential for clinical deployment. Table 3 Comparative stenosis detection performance. Bold indicates best performance per metric. MethodTPR (Recall)PPV (Precision)F1 ScoreFPPI↓ Stenunet[37]0.8120.5570.6602.45 LT-YOLO[38]0.7290.6280.6921.89 DeepDiscern[32]0.7730.5880.6672.12 ARIADNE0.8670.6340.7320.85 To qualitatively validate the localization accuracy of the proposed reasoning mod- ule, Figure 5 illustrates representative detection results across three distinct clinical scenarios. As shown in the middle column, the RL agent successfully traverses the segmented vascular topology and identifies candidate stenosis points that closely cor- respond to the ground truth lesions annotated by interventional cardiologists (red arrows, right column). Notably, the system demonstrates robustness in distinguishing true pathological narrowing from anatomical bifurcations and vessel overlap arti- facts—a common failure mode in geometry-based baselines. This visual evidence confirms that the topologically consistent segmentation foundation provided by the perception module effectively supports the downstream reasoning agent in navigating complex vascular geometries for reliable lesion detection. 5 Discussion This study evaluated a hierarchical framework integrating topologically-constrained segmentation with RL-based stenosis detection for automated coronary angiography analysis. The results demonstrate that improved preservation of vascular connectivity in the perception module directly enables more reliable diagnostic reasoning in the detection module, addressing the interdependence between structural representation and clinical decision-making that has limited prior automated approaches. 19 Fig. 5 Each row represents a different clinical case. Left Column: Original X-ray angiograms. Mid- dle Column: The extracted vascular tree with detected stenosis locations (marked by blue dots for candidates and green dots for final detections) identified by the RL navigation agent. Right Col- umn: Expert annotations highlighting the ground truth stenotic lesions (indicated by red arrows).The alignment between the agent’s predictions and expert labels demonstrates the system’s capability to accurately localize hemodynamically significant lesions even in complex anatomical configurations. Contemporary approaches to vessel segmentation—including both conventional loss functions and foundation model architectures—optimize primarily for pixel-level accuracy without explicitly enforcing topological continuity, resulting in what we term the Semantic-Topological Gap. Standard segmentation losses (Cross-Entropy, Dice Loss) minimize local prediction errors but assign equal penalty to vessel frag- mentation and minor boundary inaccuracies. More critically, foundation models such as MedSAM3[13], despite large-scale pretraining, struggle even more with this lim- itation: while they recognize prominent structures semantically, they severely fail to maintain geometric continuity in specialized medical contexts. Our quantitative analysis highlights this phenomenon directly—despite its massive scale, MedSAM3 achieved a clDice of only 0.7105, substantially underperforming the conventional, much smaller U-Net[5] (0.7987). This stark contrast proves that simply scaling general model capacity does not resolve this gap, because neither approach inherently encodes the 20 domain-specific anatomical prior that coronary vessels must form connected tubular networks. The DPO[20] training approach addresses this limitation by functioning as an alignment mechanism that injects topological priors into the foundation model. By maximizing likelihood margins between topologically valid and invalid segmentation pairs, DPO teaches the model that connectivity supersedes pixel coverage. The result- ing ARIADNE framework achieved clDice of 0.8378 (p ¡ 0.001 vs. MedSAM3; p ¡ 0.01 vs. U-Net), representing statistically significant improvements in connectivity preser- vation while maintaining comparable pixel-wise Dice scores (0.8034 vs. 0.8029 for MedSAM3, p = 0.18). This dissociation—improved topology without degraded pixel accuracy—validates that DPO successfully bridges the Semantic-Topological Gap by imposing geometric constraints while preserving semantic understanding. Consistent performance on external validation on the XCAD dataset[32], yielding a clDice of 0.7855 (95% CI [0.7721, 0.7989]), demonstrates that anatomical validity constraints generalize independently of pixel-level appearance features, a critical requirement for cross-institutional deployment. The RL-based detection agent achieved Sensitivity (TPR) of 0.867 and Preci- sion (PPV) of 0.634, significantly outperforming geometric threshold baselines[15, 37], which averaged a TPR of 0.812 and PPV of 0.557 (p < 0.01 for both metrics). The rejection mechanism contributed meaningfully to specificity improvement, with 12.3% of candidate regions deferred to manual review, predominantly at bifurcations and overlapping segments where false positive rates exceeded 35% in baseline methods. The MLP policy architecture outperformed LSTM with an F1-score of 0.854 compared to 0.831 (p < 0.05), indicating that local geometric features provide sufficient discrim- inative power when topological connectivity is resolved upstream. This architectural finding is enabled specifically by DPO-aligned[20] segmentation: because structural discontinuities are prevented at the perception stage, the reasoning module can focus on local radius gradients without compensating for fragmentation artifacts. Computational Efficiency and Resource Implications. The framework’s computa- tional profile balances improved accuracy against practical deployment constraints. DPO[20] training requires generation of preference pairs, requiring approximately 2.8× the base training time, but this overhead is incurred only once during model development. Inference latency remains comparable to baseline methods; ARIADNE requires 127 ms/frame on a V100 GPU, compared to 118 ms for U-Net[5] and 156 ms for MedSAM3[13], making real-time clinical integration feasible. The targeted training strategy, where 20.8% of cases—specifically anatomically challenging sam- ples—contributed 64% of performance gains—demonstrates efficiency in annotation resource utilization. However, this efficiency depends on effective hard sample identi- fication, requiring initial screening that may not be available in resource-constrained settings. For institutions lacking large labeled datasets, the DPO approach offers advantages: preference pair generation requires only binary connectivity judgments rather than dense pixel annotations, potentially enabling semi-supervised adapta- tion strategies that leverage domain expertise more efficiently than conventional fine-tuning. 21 Methodological Contribution and Broader Applicability. This study represents the first application of DPO[20]—originally developed for aligning language models with conversational norms—to geometric medical image analysis. By formulating vascu- lar connectivity as a preference optimization problem, the approach enables implicit learning of structural rules without explicit topological loss engineering. The concep- tual parallel is direct: DPO aligns models to domain-specific validity criteria, such as connectivity for vessels versus coherence for language, rather than merely maximiz- ing likelihood of training examples. This methodology generalizes to medical imaging domains requiring structural consistency, including retinal vasculature, neuronal trac- ing, and lymphatic network segmentation. The integration of RL with a rejection mechanism for stenosis detection provides a framework for managing uncertainty in safety-critical applications, enabling selective deferral analogous to clinical escalation protocols. The results address operational challenges in interventional cardiology workflows, where manual interpretation suffers from inter-observer variability[4], exemplified by a Cohen’s κ of 0.67 for stenosis grading, and fatigue-related errors. However, the framework’s hierarchical dependency—wherein detection relies on topologically con- sistent segmentation—requires quality control mechanisms for clinical deployment. Cases with inherently ambiguous topology arising from severe calcification or motion artifacts may propagate segmentation errors to detection outputs. Clinical implemen- tation should incorporate segmentation confidence scoring to trigger manual review when connectivity certainty falls below validated thresholds. Limitations and Future Directions. First, the study utilized 2D X-ray angiography with inherent projection limitations. While temporal sampling strategies mitigated occlusion artifacts, volumetric quantification remains constrained by foreshortening effects. Integration of multi-view fusion or 3D modalities such as CTA or IVUS could resolve geometric ambiguities. Second, validation was conducted on a single pri- mary institution supplemented by public datasets. Broader multi-site validation across diverse imaging protocols and pathological presentations, including chronic total occlu- sions and heavily calcified lesions, is necessary for universal deployment. Third, the RL agent assumes single dominant stenosis per segment; extension to tandem lesions or diffuse disease requires modification of action spaces and reward functions. Future work will focus on multi-view fusion, multimodal integration with IVUS, OCT, or FFR, and prospective clinical validation comparing automated analysis with expert interpretation in real-time clinical workflows. 6 Conclusion This study presented a hierarchical framework for automated coronary angiography analysis that integrates topologically-constrained segmentation with RL-based steno- sis detection. The core contribution addresses a fundamental challenge in adapting general-purpose foundation models to medical imaging domains: the Semantic- Topological Gap, wherein models trained on pixel-level objectives recognize vascular structures semantically but fail to preserve their geometric continuity. By incor- porating DPO[20] to enforce vascular connectivity constraints during segmentation 22 training, the framework demonstrates that anatomical validity—specifically, topologi- cal integrity—is a prerequisite for reliable automated diagnosis, and that DPO provides a viable mechanism to inject domain-specific structural priors into foundation models without sacrificing their semantic understanding. The methodology represents a conceptual transfer of alignment techniques from natural language processing to geometric medical image analysis. Just as DPO aligns language models with human conversational preferences, our approach aligns vision models with anatomical structural principles. The resulting topologically consistent vessel representations enable more effective management of false positive detections through a reasoning agent equipped with a rejection mechanism for ambiguous cases. By achieving specificity of 0.872 while maintaining sensitivity of 0.836 across stenosis severity grades, the system addresses a key barrier to clinical adoption: the high false positive burden that characterizes purely geometric detection methods and contributes to alert fatigue in automated diagnostic systems. The empirical findings validate a critical premise: scaling model capacity alone—as exemplified by foundation models like MedSAM3[13]—does not resolve domain-specific structural constraints. Despite its massive scale, MedSAM3 achieved a clDice of only 0.8089, demonstrating that generic pretraining yields diminishing returns for topo- logical precision. The statistically significant superiority of ARIADNE evidenced by a clDice of 0.8378 (p < 0.05), demonstrates that geometric priors must be explicitly encoded through appropriate alignment objectives. This insight has broad implica- tions for medical imaging informatics: as the field increasingly adopts foundation models, success will depend not merely on model scale but on principled strategies for incorporating clinical domain knowledge into optimization frameworks. The computational efficiency demonstrated through targeted training on anatom- ically challenging cases, constituting 20.8% of the dataset, suggests feasibility for resource-constrained deployment scenarios across institutions with varying data avail- ability. While extension to multi-view analysis and integration with complementary imaging modalities will be necessary to address projection ambiguities inherent in 2D angiography, the current results establish a methodological foundation for develop- ing automated analysis systems in domains where structural consistency is critical for clinical interpretation. This work demonstrates that bridging the gap between passive image archival and automated diagnostic insight requires more than advanced pattern recognition—it demands explicit alignment of computational models with the anatomical and phys- iological principles that govern clinical decision-making. The proposed framework contributes toward the development of automated systems capable of functioning as reliable decision support tools within interventional cardiology workflows, transform- ing the traditional informatics paradigm from retrospective storage to prospective clinical intelligence. By establishing that topological validity can be learned and trans- ferred through preference optimization, this study provides a pathway for adapting general-purpose vision foundation models to safety-critical medical applications where geometric integrity is non-negotiable. 23 Declarations Funding This work is supported by the Qingdao Natural Science Foundation (No. 23-2-1-158- zyyd-jch), and the Fundamental Research Funds for the Central Universities (No. 202562003). Competing Interests The authors declare no competing interests. Data Availability The code for this project is available at https://github.com/qimingfan10/ARIADNE. The datasets used during the current study are available from the corresponding author on reasonable request. Author Contributions Zhan Jin: Conceptualization, Methodology, Software, Formal analysis, Writing - original draft. Yu Luo: Conceptualization, Methodology, Software, Validation (main experiments), Supervision, Writing - review & editing. Yizhou Zhang: Project administration, Software, Validation (comparative experiments), Formal analysis. Ziyang Cui: Software, Validation (comparative experiments), Data curation. Yuqing Wei: Data curation (annotation), Visualization (figures). Xianchao Liu: Data cura- tion (annotation). Xueying Zeng: Supervision, Funding acquisition, Writing - review & editing. Qing Zhang: Supervision, Resources, Writing - review & editing. All authors read and approved the final manuscript. Consent to Participate Informed consent was obtained from all individual participants included in the study. Consent for Publication The authors affirm that human research participants provided informed consent for publication of the images in Figures. References [1] GBD 2023 Disease and Injury and Risk Factor Collaborators: Burden of 375 dis- eases and injuries, risk-attributable burden of 88 risk factors, and healthy life expectancy in 204 countries and territories, including 660 subnational locations, 1990–2023: a systematic analysis for the global burden of disease study 2023. The Lancet 406(10513), 1873–1922 (2025) https://doi.org/10.1016/S0140-6736(25) 01637-X [2] Lawton, J.S., Tamis-Holland, J.E., Bangalore, S., Bates, E.R., Beckie, T.M., Bischoff, J.M., Bittl, J.A., Cohen, M.G., DiMaio, J.M., Don, C.W., Fremes, S.E., Gaudino, M.F., Goldberger, Z.D., Grant, M.C., Jaswal, J.B., Kurlansky, P.A., Mehran, R., Metkus, T.S., Nnacheta, L.C., Rao, S.V., Sellke, F.W., Sharma, G., Yong, C.M., Zwischenberger, B.A.: 2021 ACC/AHA/SCAI guideline for coronary artery revascularization. JACC 79(2), 21–129 (2022) https://doi.org/10.1016/j. jacc.2021.09.006 24 [3] Ramos-Cortez, J.S., Alvarado-Carrillo, D.E., Ovalle-Magallanes, E., Avina- Cervantes, J.G.: Lightweight U-Net for blood vessels segmentation in X-Ray coronary angiography. Journal of Imaging 11(4), 106 (2025) [4] Menezes, M.N., Louren ̧co-Silva, J., Silva, B., Rodrigues, T., Francisco, A.R.G., Ferreira, P.C., Oliveira, A.L., Pinto, F.J.: Development of deep learning segmenta- tion models for coronary X-ray angiography: Quality assessment by a new global segmentation score and comparison with human performance. Revista Portuguesa de Cardiologia 41(12), 1011–1021 (2022) https://doi.org/10.1016/j.repc.2022.04. 001 [5] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), p. 234–241 (2015). Springer [6] Li, S., Fan, Y.: Coronary artery segmentation in X-ray angiography based on deep learning approach. In: 2024 43rd Chinese Control Conference (C), p. 7345–7350 (2024). IEEE [7] Wang, L., Yang, X.-f., Wang, Q.-j., Xu, L.-s.: Two-stage U-net coronary artery segmentation based on CTA images. Journal of Northeastern University (Natural Science) 43(6), 792 (2022) [8] Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: TransUNet: Transformers make strong encoders for medical image segmentation. CoRR abs/2102.04306 (2021) 2102.04306 [9] Milletari, F., Navab, N., Ahmadi, S.-A.: V-Net: Fully convolutional neural net- works for volumetric medical image segmentation. In: 2016 Fourth International Conference on 3D Vision (3DV), p. 565–571 (2016). https://doi.org/10.1109/ 3DV.2016.79 [10] Shit, S., Paetzold, J.C., Sekuboyina, A., Ezhov, I., Unger, A., Zhylka, A., Pluim, J.P.W., Bauer, U., Menze, B.H.: clDice - a novel topology-preserving loss function for tubular structure segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p. 16560–16569 (2021) [11] Chang, S.-S., Lin, C.-T., Wang, W.-C., Hsu, K.-C., Wu, Y.-L., Liu, C.-H., Fann, Y.C.: Optimizing ensemble U-Net architectures for robust coronary vessel segmentation in angiographic images. Scientific Reports 14(1), 6640 (2024) [12] Carion, N., Gustafson, L., Hu, Y.-T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., R ̈adle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.-H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Siyuan, L., Kamath, A., Cheng, H.K., Doll ́ar, P., Ravi, N., Saenko, K., 25 Zhang, P., Feichtenhofer, C.: SAM 3: Segment Anything with Concepts (2025). https://arxiv.org/abs/2511.16719 [13] Liu, A., Xue, R., Cao, X.R., Shen, Y., Lu, Y., Li, X., Chen, Q., Chen, J.: Med- SAM3: Delving into Segment Anything with Medical Concepts (2025). https: //arxiv.org/abs/2511.19046 [14] Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Gener- alized Intersection Over Union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p. 658–666 (2019) [15] Liu, X., Wang, X., Chen, D., Zhang, H.: Automatic quantitative coronary analysis based on deep learning. Applied Sciences 13(5), 2975 (2023) https://doi.org/10. 3390/app13052975 [16] Huang, B., Luo, Y., Wei, G., He, S., Shao, Y., Zeng, X., Zhang, Q.: Deep learning model for coronary artery segmentation and quantitative stenosis detection in angiographic images. Medical Physics 52(7), 17970 (2025) https://doi.org/10. 1002/mp.17970 [17] Hannink, J., Duits, R., Bekkers, E.: Vesselness via multiple scale orientation scores. arXiv preprint arXiv:1402.4963 (2014) [18] Yang, H., Zhen, X., Chi, Y., Zhang, L., Hua, X.-S.: CPR-GCN: Conditional partial-residual graph convolutional network in automated anatomical labeling of coronary arteries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020) [19] D ́ıaz-Gaxiola, E., Yee-Rendon, A., Vega-Lopez, I.F., Campos-Leal, J.A., Garc ́ıa- Aguilar, I., L ́opez-Rubio, E., Luque-Baena, R.M.: Experimental assessment of YOLO variants for coronary artery disease segmentation from angiograms. Electronics 14(13) (2025) https://doi.org/10.3390/electronics14132683 [20] Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct Preference Optimization: Your language model is secretly a reward model. In: Advances in Neural Information Processing Systems, vol. 36, p. 53728–53741. Curran Associates, Inc., ??? (2023) [21] Schulz, V.H.: Book reviews. SIAM Review 63(2), 419–431 (2021) https://doi. org/10.1137/21N975254 [22] Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Purushwalkam, S., Ermon, S., Xiong, C., Joty, S., Naik, N.: Diffusion model alignment using direct preference optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p. 8228–8238 (2024) 26 [23] Konwer, A., Yang, Z., Bas, E., Xiao, C., Prasanna, P., Bhatia, P., Kass-Hout, T.: Enhancing SAM with efficient prompting and preference optimization for semi-supervised medical image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p. 20990– 21000 (2025) [24] Chow, C.K.: On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory 16(1), 41–46 (2003) [25] Yuan, H., Li, X., Zhang, T., Sun, Y., Huang, Z., Xu, S., Ji, S., Tong, Y., Qi, L., Feng, J., et al.: Sa2va: Marrying SAM2 with LLaVA for dense grounded understanding of images and videos. arXiv preprint arXiv:2501.04001 (2025) [26] Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu, T., Qiao, Y., Dai, J.: InternVL: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p. 24185–24198 (2024) [27] Cai, Z., Cao, M., Chen, H., Chen, K., et al.: InternLM2 technical report. CoRR abs/2403.17297 (2024) https://doi.org/10.48550/arXiv.2403.17297 [28] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Chen, W.: LoRA: Low-rank adaptation of large language models. CoRR abs/2106.09685 (2021) 2106.09685 [29] Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., R ̈adle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.-Y., Girshick, R., Doll ́ar, P., Feichtenhofer, C.: SAM 2: Segment anything in images and videos. In: The Thirteenth International Conference on Learning Representations (ICLR) (2025). https://openreview.net/forum?id=Ha6RTeWMd0 [30] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. CoRR abs/1707.06347 (2017) 1707.06347 [31] Popov, M., Amanturdieva, A., Zhaksylyk, N., Alkanov, A., Saniyazbekov, A., Aimyshev, T., Ismailov, E., Bulegenov, A., Kolesnikov, A., Kulanbayeva, A., Kuzhukeyev, A., Sakhov, O., Kalzhanov, A., Temenov, N., Fazli, S.: ARCADE: Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiog- raphy imagEs Dataset. Zenodo. Version COCO (2023). https://doi.org/10.5281/ zenodo.10390295 [32] Du, T., Xie, L., Zhang, H., Liu, X., Wang, X., Chen, D., Xu, Y., Sun, Z., Zhou, W., Song, L., Guan, C., Lansky, A.J., Xu, B.: Training and validation of a deep learning architecture for the automatic analysis of coronary angiography. EuroIntervention 17(1), 32–40 (2021) https://doi.org/10.4244/EIJ-D-20-00570 27 [33] Nikolov, S., Blackwell, S., Mendes, R., De Fauw, J., Meyer, C., Hughes, C., Askham, H., Romera-Paredes, B., Karthikesalingam, A., Chu, C., Carnell, D., Boon, C., D’Souza, D., Moinuddin, S.A., Sullivan, K., DeepMind Radiographer Consortium, Montgomery, H., Rees, G., Sharma, R., Suleyman, M., Back, T., Ledsam, J.R., Ronneberger, O.: Deep learning to achieve clinically applicable seg- mentation of head and neck anatomy for radiotherapy. CoRR abs/1809.04430 (2018) [34] Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: UNet++: A nested U-Net architecture for medical image segmentation. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, p. 3–11. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00889-51 [35] Bai, H., Ma, Z., Gao, C., Zhu, J.: SVSNet: Scleral vessel segmentation with a CNN-Transformer hybrid network. Journal of Innovative Optical Health Sciences 18(6), 1 (2025) https://doi.org/10.1142/S1793545825500178 [36] Wei, G., Zeng, X., Zhang, Q.: FlowVM-Net: Enhanced vessel segmentation in X- Ray coronary angiography using temporal information fusion. Journal of Imaging Informatics in Medicine (2025) https://doi.org/10.1007/s10278-025-01732-y [37] Lin, H., Liu, T., Katsaggelos, A., Kline, A.: StenUNet: Automatic Stenosis Detection from X-ray Coronary Angiography (2023). https://arxiv.org/abs/2310. 14961 [38] Li, J., Tang, X., Wang, X.: LT-YOLO: Long-term temporal enhanced YOLO for stenosis detection on invasive coronary angiography. Frontiers in Molecular Bio- sciences 12, 1558495 (2025) https://doi.org/10.3389/fmolb.2025.1558495 . PMID: 40242408 [39] Liu, W., Yang, H., Tian, T., Cao, Z., Pan, X., Xu, W., Jin, Y., Gao, F.: Full- resolution network and dual-threshold iteration for retinal vessel and coronary angiograph segmentation. IEEE Journal of Biomedical and Health Informatics 26(9), 4623–4634 (2022) https://doi.org/10.1109/JBHI.2022.3188710 [40] Wu, R., Liu, Y., Liang, P., Chang, Q.: H-vmunet: High-order vision mamba UNet for medical image segmentation. Neurocomputing 624, 129447 (2025) https:// doi.org/10.1016/j.neucom.2025.129447 [41] Tian, Y., Fu, L., Fang, W., Li, T.: FR-UNet: A feature restoration-based UNet for seismic data consecutively missing trace interpolation. IEEE Transactions on Geoscience and Remote Sensing 63, 1–10 (2025) https://doi.org/10.1109/TGRS. 2025.3531934 28