Paper deep dive

EndoSERV: A Vision-based Endoluminal Robot Navigation System

Junyang Wu, Fangfang Xie, Minghui Zhang, Hanxiao Zhang, Jiayuan Sun, Yun Gu, Guang-Zhong Yang

Year: 2026Venue: arXiv preprintArea: cs.ROType: PreprintEmbeddings: 75

Abstract

Abstract:Robot-assisted endoluminal procedures are increasingly used for early cancer intervention. However, the intricate, narrow and tortuous pathways within the luminal anatomy pose substantial difficulties for robot navigation. Vision-based navigation offers a promising solution, but existing localization approaches are error-prone due to tissue deformation, in vivo artifacts and a lack of distinctive landmarks for consistent localization. This paper presents a novel EndoSERV localization method to address these challenges. It includes two main parts, \textit{i.e.}, \textbf{SE}gment-to-structure and \textbf{R}eal-to-\textbf{V}irtual mapping, and hence the name. For long-range and complex luminal structures, we divide them into smaller sub-segments and estimate the odometry independently. To cater for label insufficiency, an efficient transfer technique maps real image features to the virtual domain to use virtual pose ground truth. The training phases of EndoSERV include an offline pretraining to extract texture-agnostic features, and an online phase that adapts to real-world conditions. Extensive experiments based on both public and clinical datasets have been performed to demonstrate the effectiveness of the method even without any real pose labels.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/13/2026, 12:44:06 AM

Summary

EndoSERV is a vision-based endoluminal robot navigation system that addresses challenges in endoscopic localization, such as tissue deformation, in vivo artifacts, and lack of distinctive landmarks. It utilizes a 'Segment-to-Structure' divide-and-conquer strategy and 'Real-to-Virtual' mapping to enable accurate pose estimation without requiring real-world pose labels, leveraging pre-operative CT/MRI data as a prior.

Entities (5)

EndoSERV · navigation-system · 100%Real-to-Virtual mapping · methodology-component · 98%Segment-to-Structure · methodology-component · 98%Neural Implicit-based SLAM · navigation-technique · 95%SfM · navigation-technique · 95%

Relation Signals (3)

EndoSERV → comprises → Segment-to-Structure

confidence 100% · It includes two main parts, i.e., SEgment-to-structure and Real-to-Virtual mapping

EndoSERV → comprises → Real-to-Virtual mapping

confidence 100% · It includes two main parts, i.e., SEgment-to-structure and Real-to-Virtual mapping

EndoSERV → addresses → Endoluminal navigation challenges

confidence 95% · This paper presents a novel EndoSERV localization method to address these challenges.

Cypher Suggestions (2)

Find all components of the EndoSERV system · confidence 90% · unvalidated

MATCH (n:System {name: 'EndoSERV'})-[:COMPRISES]->(component) RETURN component.name

Identify navigation techniques compared in the paper · confidence 85% · unvalidated

MATCH (n:Technique) WHERE n.name IN ['SfM', 'Neural Implicit-based SLAM', 'Real-Virtual Alignment'] RETURN n

Full Text

74,680 characters extracted from source content.

Expand or collapse full text

IEEE TRANSACTIONS ON ROBOTICS1 EndoSERV: A Vision-based Endoluminal Robot Navigation System Junyang Wu, Fangfang Xie, Minghui Zhang, Hanxiao Zhang, Jiayuan Sun, Yun Gu, Member, IEEE, and Guang-Zhong Yang, Fellow, IEEE Abstract—Robot-assisted endoluminal procedures are increas- ingly used for early cancer intervention. However, the intricate, narrow and tortuous pathways within the luminal anatomy pose substantial difficulties for robot navigation. Vision-based navigation offers a promising solution, but existing localization approaches are error-prone due to tissue deformation, in vivo artifacts and a lack of distinctive landmarks for consistent localization. This paper presents a novel EndoSERV localization method to address these challenges. It includes two main parts, i.e., SEgment-to-structure and Real-to-Virtual mapping, and hence the name. For long-range and complex luminal structures, we divide them into smaller sub-segments and estimate the odometry independently. To cater for label insufficiency, an efficient transfer technique maps real image features to the virtual domain to use virtual pose ground truth. The training phases of EndoSERV include an offline pretraining to extract texture-agnostic features, and an online phase that adapts to real- world conditions. Extensive experiments based on both public and clinical datasets have been performed to demonstrate the effectiveness of the method even without any real pose labels. I. INTRODUCTION E NDOLUMINAL intervention is an effective tool for inte- grated diagnosis and treatment of early luminal cancers. It is increasingly used for digestive, pulmonary, urinary, and gynecologic tract diseases, offering a minimally invasive and one-stop-shop to alternative surgical procedures. The emer- gence of endoluminal robotics platforms further enhances the safety, consistency, maneuverability, and accuracy of endo- scopic navigation, allowing small lesions to be targeted and dissected accurately. For deep, complex, tortuous lumens, however, accurate navigation in a maze-like internal structure is a major challenge, which is further hampered by limited endoscopic field-of-view (FoV), in vivo artifacts such as blood, mucus, and motion blur, as well as constant tissue deformation and a lack of distinctive feature landmarks. Existing endoluminal navigation techniques can be broadly categorized as electromagnetic and shape-sensing based navi- gation, and pure vision-based navigation methods. Electromag- netic and shape-sensing based navigation depends on special- ized external tracking devices. Pure vision-based approaches, on the other hands, leverage real-time endoscopic images for guidance, offering a cost-effective and more flexible solution by eliminating the need for additional hardware that may complicate the existing surgical workflow. Junyang Wu, Minghui Zhang, Hanxiao Zhang, Yun Gu and Guang-Zhong Yang are with the Institute of Medical Robotics, Shanghai Jiao Tong Univer- sity, Shanghai, China, 200240. (Email: yungu@ieee.org, gzyang@sjtu.edu.cn) Fangfang Xie and Jiayuan Sun are with the Shanghai Chest Hospital, Shanghai, China. (a)(b) Scale ambiguity in monocular SLAMAppearance similarity in endoscopic images Estimated trajectory Real trajectory Align with ground truth Monocular SLAM Real image at position A Real image at position B Feature similarity 0.5849 Different appearance Similar positions Virtual image at position A Similar appearance Different positions Feature similarity 0.6497 Fig. 1.(a). Scale ambiguity in monocular SLAM: During the testing phase, the results of monocular SLAM require alignment with ground truth trajectories, which is impractical in clinical applications due to the lack of absolute scale information. (b). Appearance similarity in endoscopic images: Different bronchial branches often exhibit nearly identical geometric and topological structures, which can lead to ambiguous feature matching and incorrect associations with virtual frames. Existing vision-based methods can be classified into three main categories: Structure-from-Motion (SfM), Neural Implicit-based SLAM, and Real-Virtual Alignment tech- niques. SfM: SfM methods use frame-to-frame correlations to es- timate camera motion, leveraging appearance differences to optimize the pose estimation. By warping the previous frame to the next frame, the difference in appearance can be used as a constraint to optimize the camera pose. Previous studies, such as [1]–[4], have explored SfM techniques in endoscopic scenarios. Cui et al. [5], [6] introduced fine-tuning on top of Depth-Anything [7], generating improved depth estimation results. However, for the monocular localization task, the scale information is limited. These methods needs the scale alignment with ground truth during testing phase, which is not accepted in the clinical scenarios. Neural Implicit-Based SLAM: Neural implication-based SLAM is a promising approach that combines neural implicit networks with SLAM for camera pose estimation and map- ping. Sucar et al. [8] and Johari et al. [9] combined neural representation and SLAM to provide odometry estimation and 3D reconstruction. Shan et al. [10] and Wang et al. [11] proposed dense SLAM systems using neural representations in endoscopic settings. However, methods based on neural implicit networks are time-consuming to render, making real- time navigation difficult. In addition, because these methods require high consistency among images within a scene, their performance is suboptimal in clinical scenarios when artifacts, tissue deformation and inter-reflections are abundant. Real-Virtual Alignment: 3D models reconstructed from arXiv:2603.08324v1 [cs.RO] 9 Mar 2026 IEEE TRANSACTIONS ON ROBOTICS2 Real-to-Virtual Ground truth pose Virtual images Supervised odometry trainer Style transfer module Odometry estimation Deformation refinement module Generated Virtual images Train view Test view Features Sliding windows Segment-to-Structure Fig. 2.Motivations of this work. Segment-to-Structure: Due to the complex and long-term structure in the luminal path, a divide-and-conquer strategy is proposed for long-term pose estimation. Real-to-Virtual: Due to the lack of real pose labels in clinical scenarios, in this work, pre-operative CT data are used as the structure prior for intra-operative odometry estimation. pre-operative images such as CT and MRI can provide virtual images and corresponding poses, providing a strong prior for localization. Aligning real and virtual images enables the use of this pre-operative information for localization [12], [13]. Zhu et al. [14] and Park et al. [15] proposed image-to-image translation methods that can bridge the gap between real and virtual domains. Shen et al. [16], Gu et al. [17], and Luo et al. [18], [19], aligned the virtual and real domains at the feature level, measuring the similarity between real images and virtual images. The measurement can then be used to perform real/virtual registration. Despite promising results on simple datasets, existing en- doscopic localization methods often exhibit challenges when applied to real-world clinical data. As shown in Fig. 1 (a), Structure-from-Motion (SfM) and neural implicit-based SLAM approaches estimate relative poses and lack access to absolute scale in monocular endoscopic settings, necessitating post-hoc alignment with ground truth trajectories. Such align- ment procedures are infeasible in practical clinical deploy- ments, where ground truth is unavailable. Moreover, the accu- mulation of errors over time in relative pose estimation leads to drift, further compromising navigation accuracy. While real- virtual alignment methods can estimate absolute poses by leveraging preoperative virtual images, their performance re- mains limited in clinical settings due to the inherent challenges of endoscopic images. Specifically, as shown in Fig. 1 (b), clinical endoscopic scenes often exhibit sparse textures, low contrast, and high visual similarity across different regions, resulting in localization ambiguities and disorientation during navigation. To address these challenges, we propose a novel En- doSERV localization system, which includes two parts, i.e., SEegment to Structure and Real to Virtual mapping. As illustrated in Fig. 2, the system adaptively divides the luminal pathway into manageable sub-segments, enabling independent analysis within each segment. Each segment is unique and do not confuse the localization system. In each sub-segment, a style transfer module and a deformation refinement module are used to adapt real images to the virtual domain, and an odometry trainer refines pose estimation using virtual labels to improve accuracy, while reducing the need for real labels. Since the odometry network is trained in the virtual domain, which has the absolute scale information, the scale ambiguity can be solved. The training pipeline comprises two key phases: an offline pretraining phase and an online training phase. During the offline phase, a texture-agnostic pose encoder is pretrained along with a foundational style transfer model. The encoder in- corporates texture-diverse augmentations and texture-agnostic alignment strategies to ensure robustness across various tex- tures. For the online phase, we optimize the localization module at the test time. A feature retrieval step is used to identify a virtual buffer of frames relevant to the current real- world scenario. These frames, combined with real buffers, are used to efficiently fine-tune the localization system, ensuring real-time adaptability. To further enhance the performance under distortions and deformations in real-world scenarios, we propose an augmentation-then-recovery strategy for re- constructing virtual images from augmented real images. By aligning real and virtual data, our approach facilitates transfer to the virtual domain, enabling the scene coordinate network trained in the virtual domain to accurately estimate the camera pose in the real domain. In summary, EndoSERV enables endoscopic localization without requiring any real-world labels and demonstrates effectiveness compared to the current state- of-the-art on both public and clinical data. I. RELATED WORK Endoscopic navigation is a challenging task, primarily due to the absence of real-world labels for supervision. This section provides a detailed overview of existing work based on SfM, Neural-Implicit-based SLAM, and Real-Virtual Alignment and related studies. A. SfM-based method SfM is one of the earliest and most widely applied self- supervised techniques for endoscopic navigation. It builds motion correlations between consecutive frames and optimizes camera poses using appearance differences as supervisory signals. EndoSLAM [1] is a seminal contribution that released large-scale datasets and formed the foundation for subsequent research. Liu et al. [2] further extended the application of SfM by training a convolutional neural network (CNN) with disparity supervision signals derived from conventional SfM algorithms, improving pose estimation accuracy. In addition, AF-SfMLearner [3] proposed an appearance module that mitigates brightness inconsistencies between adjacent frames, addressing one of the key challenges in SfM-based meth- ods. TCL [4] extended this idea by utilizing image triplets to increase the dataset diversity and reduce appearance in- consistencies in triplet images. With the rise of foundation models, SurgicalDINO [5] and EndoDAC [6] fine-tuned depth estimation models like Depth Anything to improve depth pre- diction performance. However, although SfM-based methods can achieve real-time inference, they suffer from a lack of scale information and are unable to estimate absolute camera IEEE TRANSACTIONS ON ROBOTICS3 poses in real-world scenarios, limiting their applicability in clinical environments. B. Neural Implicit-based SLAM Neural implication-based SLAM is a promising alternative to traditional SfM-based methods. This approach utilizes im- plicit neural representations for camera pose estimation and scene mapping. iMap [8] was the first to integrate neural representations into the SLAM framework and demonstrated promising results. iNeRF [20] employed a pretrained NeRF model to recover camera poses with greater accuracy. Subse- quent works, such as eSLAM [9] and COSLAM [21] enhanced NeRF-based SLAM systems by implementing sparse para- metric scene representations, thereby improving scalability and robustness. NICER-SLAM [22] and DenseNeRF-SLAM [23] alleviated the dependence on RGBD inputs, enabling SLAM systems to operate using only RGB images. In the context of endoscopy, DenseSLAM systems based on neural representations, such as eNeRF [10] and EndoGSLAM [11], have demonstrated significant improvements in both mapping accuracy and pose estimation. Despite their promise, the computational cost of rendering images using NeRF models has prevented many of these systems from achieving real-time performance. Furthermore, the complex artifacts in endoscopic scenarios pose challenges to the application of NeRF-based methods in clinical settings. C. Real-Virtual Alignment In contrast to self-supervised methods, Real-Virtual Align- ment bridges the gap between the intra-operative and pre- operative domains, taking advantage of pre-operative images such as CT and MRI as a prior for localization [12], [13]. Image-to-image translation is an efficient approach for Real- Virtual Alignment. It transforms the images from the source domain to the target domain. pix2pix [24] introduced a paired image translation method to transfer images from the source domain to the target domain. However, due to the scarcity of paired data, CycleGAN [14] enabled unpaired image trans- lation using cycle consistency. CUT [15] introduced a con- trastive learning strategy that replaced the cycle-consistency constraint and achieved high-quality image translation. Simi- larly, MI2GAN [25] addressed the content distortion issue by proposing a disentangling strategy that preserves content infor- mation while translating textures. With the rapid development of diffusion-based models, image translation methods based on diffusion networks have become viable alternatives to GAN. The Denoising Diffusion Probabilistic Model (DDPM) [26] demonstrated its capability to progressively transform Gaus- sian noise into coherent signals. Subsequent investigations have further explored image translation tasks. UNIT-DDPM [27] generated target domain images through a denoising Markov Chain Monte Carlo approach conditioned on the source images. EGSDE [28] trained an assist function on both source and target domains and used it to assist energy-guided stochastic differential equations for realistic image generation. UNSB [29] improved the simple Gaussian prior assumption by modeling a sequence of adversarial learning problems and achieved remarkable performance. In endoscopic navigation, Mahmood et al. [30] initiated the translation of real images into the virtual domain to enhance sparse medical datasets. Islam et al. [31] improved the approach by using cycle consistency loss and incorporat- ing two discriminators to remove specular highlights in the virtual domain. CLTS-GAN [32] further refined techniques for controlling color, lighting, and texture in real endoscopic images, thereby accurately generating illumination information for real-world applications. Feature-level registration is another way to align real and virtual domains. It maps the images to a unified space and measures the similarity between virtual and real images. Shen et al. [16] mapped all images to the depth domain, matching the depth maps between virtual and real images. Luo et al. [18], [19] proposed constrained evolutionary stochastic filtering, extracting stable features in the real domain, and combining features between virtual and real images. Zhu et al. [33] combined the NeRF SLAM and domain adaptation. It transferred the virtual images to the real domain and represented the 3D models as a neural radiance field using the generated real images and corresponding virtual poses. Subsequently, the camera pose was optimized to match the rendered images through NeRF and the real images. I. SYSTEM OVERVIEW A. Sliding Window Buffering The endoluminal pathway is a highly complex and maze- like structure characterized by anatomically similar features across various regions. This inherent similarity among anatom- ical landmarks poses significant challenges for navigation algorithms, which can easily become misled, causing the system to lose direction during the navigation process. To mitigate these challenges, we introduce a divide-and- conquer approach that utilizes sliding windows to localize within the endoluminal pathway. By partitioning the entire pathway into smaller sub-segments, each can be processed independently for greater precision and efficiency. As shown in Fig. 3 (a), the system alternates between training and testing phases. We maintain a confidence buffer to adaptively switch between these two phases. In the training phase, a training buffer containing a few local samples is used to rapidly fine-tune the system for the current sub-segment. The system then transitions to the testing phase, performing pose estimation. This continues until the confidence buffer detects a significant drop in prediction confidence. Such a decrease indicates that the module requires optimization to adapt to the new data. This triggers the system to automatically revert to the training phase, constructing a new training buffer for the next sub-segment. This adaptive approach ensures the system maintains both accuracy and reliability throughout the navigation process, optimizing the balance between real-time efficiency and model performance. B. Subsegment system pipeline Fig.3 (b) illustrates the detailed system pipeline, which is divided into offline training, online training, and testing IEEE TRANSACTIONS ON ROBOTICS4 Offline Training Robust encoder pretraining Virtual Images Odometry ground truth Offline real images Transfer model pretraining Odometry evaluation Online transfer training Online real Images Virtual database preparation Transfer refinement Trajectory Odometry training Training data preparation Scene coordinate estimation RANSAC & PnP Virtual buffer retrieval Transfer model finetuning Augmentaion- then-recovery Deformation refinement Scene coordinate estimation RANSAC & PnP Confidence estimation Image style transfer Test real Images Test Training buffer Testing buffer (a). EndoSERValternates between the training buffer and the testing buffer(b). The detailed system pipeline within a sub-segment Fig. 3.System overview of EndoSERV. (a). A sliding windows strategy for long-term pose estimation. Black windows denote the training images, while pink windows represent the testing images. The sliding windows move along the temporal axis, enabling the system to alternate between training buffers and testing buffers. (b). The detail pipeline within a sub-segment, consisting of offline training, online training, and testing phase. phases. Specifically, offline training is conducted during the pre-operative stage, while the online training and testing phases are conducted in the intra-operative phase. 1) Offline Training: The pipeline begins with Robust En- coder Pretraining (see Section IV-A), which trains a texture- agnostic pose encoder capable of extracting robust features. This is followed by Transfer Model Pretraining (see Section IV-A), where a style transfer model is trained using unpaired datasets to effectively handle domain gap. During the Virtual Database Preparation step, a comprehensive virtual database is constructed by collecting all virtual images. Features from these images are extracted using R2Former [34] and stored in a virtual feature database to support the following steps. 2) Online Training: In the online training phase, the system adapts the offline-trained models to real-world conditions. First, Virtual Buffer Retrieval (see Section IV-B1) narrows the virtual database by retrieving regions that are more closely aligned with the distribution of the real images. This step ensures that the virtual images in the virtual buffer are contextually similar to real-world images encountered during navigation. The virtual buffer is then used to fine-tune the trans- fer model (see Section IV-B2), improving its robustness in handling real-world scenarios. Subsequently, the system applies Deformation Refinement (see Section IV-B3), which employs an augmentation-then-recovery strategy. This strategy enhances the ability of the system to overcome distortions and deformations by augmenting the training data and recovering original virtual data. Once real and virtual images are aligned using the refined transfer model, the system trains a Scene Coordinate Estimation (see Section IV-B4) model that accu- rately predicts camera poses based on virtual image features. 3) Testing Phase: During testing, the trained transfer model and scene coordinate estimation network are deployed on real images to estimate the camera pose. The system incorporates a confidence estimation step to evaluate the reliability of each predicted pose. If the confidence is high, testing continues seamlessly. In cases where the confidence drops significantly, the system reverts to the training stage to update the models. IV. TRAINING PIPELINE A. Offline Training: Extracting Robust Features The offline training pipeline, shown in Fig.4 (a), is designed to extract texture-agnostic features for robust pose estimation. To achieve this, we propose a pre-training framework that incorporates two key components: diverse texture generation and feature alignment. A key insight for the generation of diverse textures is to generate images with stable structures and diverse textures. To accomplish this, we leverage a pre-trained diffusion model [35] to generate augmented images, which preserve the structure but introduce texture variations. To generate semantically meaningful prompts, we first lever- age the advanced capabilities of the current large-language model, GPT-4o [36], to generate prompts. We pose the follow- ing question:“Please describe the color and texture character- istics of endoscopic images.” In response, GPT-4o generates set prompt P = p 1 ,p 2 ,...,p k . For the input image I i , the generation step can be formulated as I g ij = G(I i ,p j ), where G is the pretrained diffusion model and p j is a ran- dom prompt in the prompt sets. For the input virtual image I i and its augmentation I g ij j=0,...,k , the feature encoder extracts the corresponding features F i ∈R H×W×C and F j ∈R H×W×C j=0,...,k . To align these features into a unified feature space, we employ two loss functions: similarity loss and contrastive loss. The similarity loss measures the cosine similarity between the feature representation of the virtual image and that of the augmented images, with the aim of minimizing this distance: L sim = X j,h,w (1−cos F i,h,w · F j,h,w ||F i,h,w ||·||F j,h,w || )/(k×H×W) (1) where (h,w) is the corresponding feature location. The contrastive triplet loss, on the other hand, encourages the network to minimize the distance between pairs of images with the same content but different textures, while maximizing the distance between pairs with identical textures but differing content. We sample a negative pair from another virtual image IEEE TRANSACTIONS ON ROBOTICS5 Diffusion model G <푖푛푝푢푡푝푟표푚푝푡푓표푟푡푒푥푡푢푟푒 푎푢푔푚푒푛푡푎푡푖표푛> Scene coordinate head H Scene coordinate head H Aligner E Maximize Aligner Scene coordinates (a). Offline training pipeline Virtualbuffer Realbuffer RANSAC &PnP Refine decoder Re(∙) DDAug A(∙) Feature retrieval ... ... Virtual feature database Retrieval step Transfer finetuning Deformation refinement & odometry training Real images in sliding windows All virtual images Transfer model Scene coordinate head H (b). Online training pipeline Minimize Virtual image Feature encoder E Augmented image with different textures ... ... Augmented features Feat ures Fig. 4.Training pipeline overview. (a). Offline training pipeline. Virtual images are augmented using the pretrained diffusion model, generating texture- diverse augmented images. A novel aligner is designed to constrain the feature encoder to extract features into a unified feature space. Scene coordinate head is designed to generate the scene coordinate map. (b). All virtual images are first compressed to the virtual buffer during a retrieval process, which is used to fine-tune the transfer model quickly with the real buffer. An augmentation-then-recovery strategy is proposed to refine the distortion and deformation issue. After aligning everything to the virtual domain, a scene coordinate head is trained to estimate the camera pose. I n , with the feature F n located at a different position. The contrastive triplet loss is then formulated as: L triplet = maxd(F i ,F j )− d(F i ,F n ) + τ, 0(2) where d(·) is the cosine distance, and τ is the margin value. After feature extraction, a scene coordinate head H is employed to estimate the scene coordinate of the image I i . Overall, during the offline training phase, the objective function is: L offline =L sim + L triplet + L proj (H,E,I i ,p ∗ ) + P j L proj (H,E,I j ,p ∗ ) k (3) where L proj is the reprojection loss, which will be discussed in the following section, and p ∗ is the virtual pose ground truth. E is the feature encoder. Additionally, we pretrain a style transfer model using un- paired data to handle domain gap between real and virtual domains. B. Online Training: Adaptation to Real-World Scenarios While offline pretraining bridges the virtual and real do- mains, it may encounter limitations in handling real-world image deformations and artifacts. To address these challenges, we introduce a novel online training module designed to dynamically fine-tune the network and achieve reliable odom- etry estimation. This module focuses on rapid adaptation to environmental variations and refinement of distortion and deformation. 1) Virtual Buffer Retrieval: The virtual database used for offline training is inherently large and time-consuming. Instead of relying on this exhaustive database during online training, we introduce a retrieval algorithm to facilitate more targeted and efficient training. By matching the real images within the buffer with relevant data from the virtual database, we significantly reduce the size of the database that needs to be considered during training. Specifically, for a training buffer I t =I 1 ,I 2 ,...,I T and virtual databaseI v k k=0,1,...,K , we perform a feature retrieval operation using R2former [34] as the feature extractor against the entire virtual database. This retrieval yields a list of indices Idx =idx 1 ,idx 2 ,...,idx T , representing that for the given real training image I i , the most similar image from the virtual database is I v idx i . After that, we define a Retrieval Hit Score S k for each virtual image I v k , representing the retrieval hit number of each virtual image: S k = T X i=1 1[k = Idx i ](4) where 1[·] is the indicator function that is 1 if the virtual image I v k appears in the retrieval indices Idx, and 0 otherwise. Furthermore, we define a contiguous range of R virtual im- ages in the virtual database that maximizes the total Retrieval Hit Score from the training buffer. For a contiguous subrange of R virtual images, the range score can be defined as: S range (k,k + R) = k+R X j=k S j (5) We then select the subrange of R consecutive virtual images that maximizes the total retrieval hits: (k ∗ ,k ∗ + R) = arg max k S range (k,k + R)(6) where (k ∗ ,k ∗ +R) represents the optimal contiguous subrange of R virtual images in the virtual database, which corresponds to the virtual buffer. This subrange is selected since it contains the highest concentration of retrieval hits from the training buffer images, ensuring that the virtual buffer is the most representative subset of the virtual database with respect to the current training buffer. IEEE TRANSACTIONS ON ROBOTICS6 Perturbation Noise mix Color jitter Training real images Training virtual images Generated real image Training step Inference step Augmented images Fig. 5.DDAug framework. The real image is generated from the virtual image using the pretrained transfer model. Three augmentations are applied: Color jitter, mixup with the noisy image, and camera parameter perturbation. 2) Transfer Model Fine-tuning: Upon selecting the virtual buffer, the next step involves fine-tuning the transfer model with both the real and virtual buffers. During the offline training phase, the transfer model has already established a correspondence between the real and virtual domains. At this stage, only the specific characteristics of the current environment need to be modeled, enabling rapid and effective fine-tuning. 3) Deformation Refinement: Although the transfer model bridges the virtual and real domains, it is limited by the absence of paired data. The unpaired nature of training results in insufficient precision, particularly for fine-grained tasks like pose estimation. Additionally, real images may contain artifacts, distortions, and deformations, further complicating accurate localization. To address these issues, we introduce a novel augmentation- then-recovery strategy to refine the distortion and deformation issues. Specifically, we first design diverse augmentations to simulate the realistic image distortions and deformations, and recover the original images using a paired training pipeline. As shown in Fig.4 (b), we begin by applying reverse transfer to generate real images from the virtual domain:I r = G(I v ), where I v is the virtual images from the virtual buffer, G is the reverse generator of the pretrained transfer model. Though transfer models can mitigate the domain gap be- tween real and virtual endoscopic data, real endoscopic images encountered in clinical scenarios often exhibit substantial distortions and deformations. For instance, artifacts such as bleeding, bubbles, and mucus can introduce significant noise, while inherent camera distortions may lead to pronounced scene deformations. To more realistically simulate the diverse conditions prevalent in clinical scenarios and further bridge this real-to-virtual domain gap, we propose a novel data augmentation method, termed DDAug. This approach incor- porates three meticulously designed augmentation strategies, specifically engineered to emulate common occurrences in real endoscopic scenes, thereby enhancing the robustness and clinical applicability of the proposed system. Specifically, as shown in Fig.5, three augmentations are applied: color jitter, noise mixup, and camera parameters perturbation. Color jitter: We use traditional color jitter in this work, which introduces random changes in the image’s brightness, contrast, and saturation to simulate variations in lighting. Noise mixup: To simulate the artifacts in the endoscopy scenarios, we apply mixup between generated real images and fractal images [35], which are used for inducing structural variations in the hybrid images. A randomly selected fractal image I F is blended with the generated real images I r with a random blending factor λ: I aug = λI r + (1− λ)I F . Camera parameters perturbation: In this work, we apply camera parameters perturbation to simulate the deformation occurring in the real scenarios. Unlike traditional approaches that only adjust camera poses [37], this work perturbs both the camera pose and intrinsic parameters, re-projecting all pixels to generate a novel synthetic image. This dual-parameter adjustment expands the diversity of training data. Specifically, given a generated real image I r , depth map from the virtual ground truth D, and intrinsic matrix K, we generate a new image I p by applying perturbations to both the rotation matrix and intrinsic parameters. Let T p represent the relative rotation matrix between the original and perturbed poses, and let K p denote the perturbed intrinsic matrix. For any point p = [u,v] T in I r , we map it to a corresponding point p ′ via the following homography: p ′ h = K p T p K −1 p zp h (7) where z is the depth at position p, p h and p ′ h represent the homogeneous coordinates of points p and p ′ . For the T p perturbation, we randomly sample pitch, yaw, and roll angles within the range of [-0.1,0.1] radians. Similarly, for the K p perturbation, we vary the intrinsic camera parameters f x , f y , c x , and c y by up to 10% of their respective values. After the augmentations, we recover the original virtual images using a paired training pipeline. For each augmented real image I aug , the corresponding virtual image I v can be considered as the paired ground truth. Based on it, we employ a reconstruction decoder to map augmented image I aug to the virtual image I v . RMSE is used as the loss function: L recon =||I v − Re(E(A(G(I v ))))|| 2 (8) where G(·) is the reverse generator of the pretrained transfer model, A(·) is the augmentation function, E(·) is the pre- trained encoder, and Re(·) is the reconstruction decoder. 4) Scene Coordinate Head Training: After aligning all data into the virtual domain, we apply a scene coordinate head and use the PnP algorithm to obtain the camera pose. For the RGB image, we denote the scene coordinate y i ∈ Y as associated with pixel x i , where the 2D-3D correspondences can be represented as: C RGB = (x i ,y i )|y i ∈ Y. The relation between pixel coordinates and scene coordinates can be represented by x i = Kh −1 y i , where K denotes the camera intrinsic matrix, and h is the ground truth camera pose. IEEE TRANSACTIONS ON ROBOTICS7 For estimating the scene coordinate, we employ an encoder- decoder neural network f() = H(E(·)), where the encoder E(·) is fixed, and the scene coordinate head H(·) can be optimized. Following previous work which patchifies the im- age [38], we estimate y i = f(p i ;w), where p i = P(x i ,I) represents an image patch extracted around pixel position x i from the input image, and w are the learnable parameters. The function f maps the image patch to a 3D coordinate: f :R C 1 ×H P ×W P →R 3 , where C 1 = 1,H P = W P = 81. The objective function can be defined as: L proj = X i ||p i − Kh −1 f(p i ;w)|| 2 (9) Notably, pose ground truth h is available exclusively for the virtual domain. The aforementioned scene coordinate head training is conducted entirely in the virtual domain. In summary, EndoSERV enables the training of a scene coordinate network and subsequent endoscopic localization without requiring any real-world labels. C. Confidence-Aware Pose Estimation During the inference phase, the aim is to recover cam- era pose h from scene coordinate Y . This is achieved by employing a traditional Perspective-n-Point (PnP) minimal solver within a RANSAC loop. The pose hypothesis h i that maximizes consensus among the scene coordinates is selected as the final estimate: ̃ h = arg max h j s(h j ,Y )(10) Here, s(·) is the scoring function, which is based on inlier counting: s(h,Y ) = X y i ∈Y 1[r(y i ,h)<τ](11) where r(·) is the residual measurement function, 1[·] is the indicator function. Apart from estimating the camera pose, another key issue is to assess the reliability of pose estimation, particularly in the context of real-world applications such as medical robotic systems. For instance, if the surgeon remains stationary, the network can maintain high-confidence testing without needing to switch to the training phase. In contrast, rapid movement or larger scene variations may require a switch to the training phase for the system to learn from the new scene context. Thus, the ability to gauge the current estimate’s confidence is crucial for adaptive decision-making. Building on this, we propose an extended use of inlier counts, not only for hypothesis selection but also for confi- dence estimation. In our experiments, we observe that inlier counts exhibit distinct patterns depending on the input data: high inlier counts are typically found in cases seen during training, while out-of-distribution (OOD) cases result in no- tably lower inlier counts. This observation motivates us to extend the role of inlier counting, not only for hypothesis selection but also for confidence estimation, allowing the TABLE I POSE ESTIMATION RESULTS OF C3VD AND CLINICAL EXPERIMENTS Method C3VD datasetClinical data ATE(m)ATE(m) SfM AF-SfMLearner [3]4.69±2.1812.49±5.06 EndoDAC [6]3.89±2.5011.10±5.63 LightMono [18]4.81±3.4012.26±5.97 NeRF-SLAM EndoGSLAM [11]4.35±1.9012.04±5.32 MonoGS [39]4.08±1.7511.61±5.70 Real-Virtual AI-copilot [40]6.39±2.0714.03±6.12 CycleGAN [14]6.05±3.3513.29±6.54 UNSB [29]3.06±1.4411.19±4.91 EndoSERV(Ours)1.90±0.516.22±2.83 TABLE I r RPE RESULTS OF REAL-VIRTUAL ALIGNMENT METHODS Method C3VD datasetClinical data r RPE (deg)r RPE (deg) AI-copilot [40]2.72±0.564.16±1.93 CycleGAN [14]1.87±0.833.14±1.72 UNSB [29]1.80±1.113.69±1.17 EndoSERV(Ours)0.50±0.261.05±0.44 system to determine when it can test with confidence and when it should transition to the training phase. Specifically, during inference, we maintain a confidence buffer of fixed length (50 frames) that stores the inlier counts from recent frames. For each incoming frame, we compare its inlier count against the statistical thresholds of the buffer. If the inlier count of the current frame exceeds μ− 2σ(μ and σ are the mean and standard deviation of inlier counts within the buffer), the frame is considered confident and is added to the buffer. Conversely, if the inlier count falls below this threshold, the frame is deemed uncertain and excluded from the buffer. If more than 20 frames are classified as uncertain, the system transitions from the testing phase to the online training phase. V. BASELINE METHODS In our experiments, we compared eight baselines in three categories: A. Real-Virtual Alignment The real video frames are converted to the virtual domain for estimating the 6DoF pose of the endoscope. Multiple real-to- virtual transformation networks are adopted as the baselines. • CycleGAN [14]: CycleGAN is an image-to-image trans- lation method in the absence of paired examples. It proposes an inverse mapping and a cycle consistency loss to achieve unpaired image translation. • AI-copilot [40]: AI-copilot is a structure-preserving un- paired image translation method. It consists of a genera- tor, a discriminator and a depth estimator, and leverages a depth constraint for structure consistency. IEEE TRANSACTIONS ON ROBOTICS8 EndoSERVAI-copilotCycleGAN UNSBEndoGSLAMMonoGS AF-SfMLearner EndoDAC LightMono EndoSERVAI-copilotCycleGAN UNSBEndoGSLAMMonoGS AF-SfMLearner EndoDAC LightMono Ground truth Estimated trajectory Fig. 6.Two examples of pose estimation from the C3VD dataset are provided. Visualizations for all methods are first centralized, followed by the presentation of 3D images along with their projections on the three coordinate axes. • UNSB [29]: UNSB expresses the Schr ̈ odinger Bridge problem as a sequence of adversarial learning problems and incorporates advanced discriminators and regulariza- tion to learn a Schr ̈ odinger Bridge between unpaired data. B. Neural Implicit-Based SLAM • EndoGSLAM [11]: EndoGSLAM is an efficient SLAM approach for endoscopic surgeries, which integrates streamlined Gaussian representation and differentiable rasterization to facilitate online camera tracking and tis- sue reconstructing. • MonoGS [39]: MonoGS is the first application of 3D Gaussian Splatting (3DGS) in monocular SLAM. It for- mulates camera tracking for 3DGS using direct optimisa- tion against the 3D Gaussians, and introduce geometric verification and regularization to handle the ambiguities occurring in incremental 3D dense reconstruction. C. Structure from Motion • AF-SfMLearner [3]: AF-SfMLearner is a self-supervised framework to estimate monocular depth and ego motion simultaneously in endoscopic scenes. It introduces a novel concept referred to as appearance flow to address the brightness inconsistency problem. • EndoDAC [6]: EndoDAC is an efficient self-supervised depth estimation framework that adapts foundation mod- els to endoscopic scenes. It develops the Dynamic Vector- Based Low-Rank Adaptation (DV-LoRA) and employ Convolutional Neck blocks to tailor the foundational model to the surgical domain, utilizing remarkably few trainable parameters. • LightMono [41]: LightMono is a hybrid architecture combined with CNNs and Transformers. It extracts rich multi-scale local features, and takes advantage of the self-attention mechanism to encode long range global information into the features. VI. EXPERIMENTS WITH SIMPLE DATASET A. Experiment Settings We evaluated the proposed method using the C3VD dataset [42], which consists of 22 colonoscopic video sequences paired with corresponding virtual models. Due to the small size and relatively low complexity of the C3VD dataset, we trained the model using only 50 real images per case to assess its generalization ability. B. Training Implementations During the offline training phase, the prompts we used are:“Bleed, inflamed red, slightly reddish, pale yellow and orange areas, pinkish hue, ulcerated regions, a sketch with crayon, mosaic.” The pretrained diffusion model we used is InstructPix2Pix [43]. For the feature encoder and the scene coordinate head, we follow previous work DSAC* [44] and applied the pretrained feature encoder in ACE [38] as the ini- tialization. We used Adam as the optimizer, with the learning rate 1e-4 and the training iteration 1e+6. We used CycleGAN as the foundation style transfer model, which trains for 200 epochs using 50 real images and all virtual images per case as the training data. The testing images include all real images. The learning rate is 2e-4 and the optimizer is Adam. During the online phase, since the simple dataset can be considered as a single sub-segment, the retrieval phase is not needed. In addition, a single transfer is enough to cover the whole scene; therefore, the number of the sliding window is one. The experiments on the simple dataset focus on testing the efficiency of offline training. IEEE TRANSACTIONS ON ROBOTICS9 EndoSERVAI-copilotCycleGANUNSB Bronchoscope EndoSERV Monocular camera Snake-part start point Translation X-axis bending Y-axis bending Rotation Guidance Robotics controller Surgeon (a). Robotics system overview (b). Two examples of pose estimation from the clinical datase EndoSERVAI-copilotCycleGANUNSB + Aligned path Planning path Estimated trajectory Estimated pointsGround truth Fig. 7.(a). Robotics system overview. The robotic system is equipped with the continuum joints allowing 4DoF operations including translation, rotation and X/Y bending. During surgery, the surgeon controls the robot system, capturing intra-operative images from the monocular camera. EndoSERV module estimates the absolute pose and generates the trajectory for intra-operative guidance. (b). Two examples of pose estimation from the clinical dataset are provided. All Real-Virtual Alignment Methods overcome the need of 7-DoF alignment with pose ground truth. EndoSERV achieves the best performance, whose trajectory crosses through the airway less frequently. C. Results To assess the performance of our algorithm, we conducted experiments on the C3VD dataset, treating each video segment as a separate sub-segment due to the limited number of frames in each video. As shown in Table I, EndoSERV consistently outperforms all baselines in terms of Absolute Trajectory Error (ATE), achieving the lowest ATE of 1.90 ± 0.51 m. SfM-based methods, including AF-SfMLearner, EndoDAC, and LightMono, demonstrate ATE values ranging from 3.89 m to 4.81 m, with EndoDAC achieving the best perfor- mance, 3.89 ± 2.50 m. MonoGS achieves the lowest ATE (4.08± 1.75 m) among NeRF-SLAM methods, demonstrat- ing its effectiveness in leveraging neural radiance fields for pose estimation. However, EndoSERV outperforms all SfM and NeRF-SLAM methods, even without 7Dof-alignment. We conduct the t-test between EndoSERV and EndoDAC; the p-value is 3.512e-5, demonstrating the improvement is statistically significant. Real-virtual alignment methods, which overcome the need of 7Dof-alignment with pose ground truth, exhibit competing performance on ATE evaluation. AI-copilot and CycleGAN report the highest ATE values (6.39 ± 2.07 m and 6.05 ± 3.35 m, respectively), while UNSB perform better with an ATE of 3.06 ± 1.44 m. Despite these results, EndoSERV outperforms all Real-Virtual Alignment methods, achieving significant improvement in ATE over the best-performing real- virtual alignment method, UNSB. Additionally, EndoSERV achieved the best r RP E of 0.50±0.26 deg, outperforming all competing real-virtual alignment methods. Addition, the trajectory visualization is shown in Fig. 6, pro- viding qualitative insights into the performance of different ap- proaches. Notably, our proposed method, EndoSERV, demon- strates significant advantages compared to existing methods. VII. EXPERIMENTS WITH ROBOTIC SYSTEM A. Experiment Settings We implemented the localization system on an endo- bronchial surgical robot and conducted the in-vivo animal trial. The study was conducted at our collaboration hospital with ethical approval by the IRB 1 . All procedures were performed in accordance with ethical standards. As shown in Fig.7, the robotic system is equipped with the continuum joints allowing 4DoF operations including translation, rotation, and X/Y bending. The outer diameter of the continuum joints is 4.2m. Both in-vivo images and their corresponding virtual images were acquired using a robotics platform. For the virtual images, we imported segmented bronchial trees from pre-operative CT scans into Unity and rendered the virtual images. For the in-vivo images, we acquired 52,340 endoscopy frames from 6 pigs, including 33 video sequences. In-vivo images were captured during the animal trial with a bronchoscope operating at 25 frames per second (fps). For the pose ground truth in in-vivo datasets, we first applied a coarse relative pose estimation algorithm to obtain the coarse pose, and refined it by manually aligning the structure between real images and virtual renderings. Specifically, we imported the initial estimated poses, along with the segmented bronchial tree, into a virtual engine for rendering. If the pose was accurate, the rendered virtual image should align with the real image in terms of structure. If misalignment was detected, we manually adjusted the pose within the virtual engine, iterating until the virtual image and the real image structures nearly overlapped. To ensure a fair evaluation across different pose estima- tion methods, we adopt different initialization and alignment strategies tailored to the nature of each method. For SfM- based and neural implicit-based SLAM approaches, which inherently produce relative pose estimates between consecutive frames, we initialize the first frame’s pose at the origin of the coordinate system. Subsequent poses are then accumulated using a SLAM-based pipeline to form a complete trajectory. During evaluation, we apply a 7-DoF transformation to align the estimated trajectory with the ground truth. In contrast, 1 The names of company, hospital and the ethical approval number are anonymized due to the double-blinded policy by IEEE T-RO. We will publish these details if the manuscript is accepted. IEEE TRANSACTIONS ON ROBOTICS10 for real-virtual alignment methods, the availability of virtual renderings with known absolute poses enables direct prediction of the endoscope’s global pose. This obviates the need for post- hoc alignment, as the pose predictions are already in the same coordinate space as the ground truth. During the operation, as the surgeon manipulates the robot, the bronchoscope’s monocular camera captures intra-operative images, which are processed by the EndoSERV module to estimate the bronchoscope’s absolute pose and generate a trajectory. This trajectory is compared with the pre-operative planned path, providing real-time guidance to the surgeon and helping identify deviations for correction. B. Training Implementations The offline training setting is the same as the simple dataset. We used CycleGAN as the foundation style transfer model, which trains for 200 epochs using 20 sequences as the training data. The learning rate is 2e-4 and the optimizer is Adam. During the online training phase, we first used R2former [34] as the feature extractor for the retrieval step. The size of real buffer and virtual buffer is 100. For the transfer fine- tuning and deformation refinement, we can use two GPUs to train them in parallel. Specifically, one GPU fine-tunes the CycleGAN model for only 8 epochs. Another GPU runs inference on the CycleGAN model and feeds generated images to the ‘Transfer refinement’ module and ‘Odometry training’ module. The refinement decoder trains for 20 epochs with a learning rate of 2e-5, with a batch size of 12. All experiments are conducted on GPUs with NVIDIA GeForce RTX 3090. C. Results In this section, we conducted experiments on the clinical data, which are more challenging due to the complex structure and unpredictable artifacts. The evaluation of camera absolute pose estimation methods on the clinical dataset was conducted using the Absolute Trajectory Error (ATE). The results, summarized in Table I, demonstrate that our proposed method, EndoSERV, outper- forms all other approaches, achieving the lowest ATE of 6.22 ± 2.83 m. Among the SfM methods, AF-SfMLearner and LightMono demonstrate similar performance with ATE values of 12.49 ± 7.56 m and 12.26± 5.97 m, respectively, while EndoDAC achieves a slightly lower error of 11.10 ± 5.63 m. NeRF- SLAM-based methods such as EndoGSLAM and MonoGS report ATE of 12.04 ± 5.32 m and 11.61 ± 5.70 m, respectively, showing comparable performance to the SfM approaches but trailing behind EndoSERV. Real-Virtual Alignment methods exhibit variable perfor- mance on the clinical dataset. AI-copilot and CycleGAN report ATE values of 14.03 ± 6.12 m and 13.29 ± 6.54 m, respectively, indicating higher errors compared to SfM and NeRF-SLAM methods. UNSB demonstrates a more com- petitive performance with an ATE of 11.19 ± 4.91 m, demonstrating the strong ability of the diffusion-based model. For the rotation metric, r RP E , EndoSERV achieves an error of 1.05±0.44 deg. This result substantially outperforms all TABLE I COMPUTATIONAL TIME EVALUATION Different modules in EndoSERV Style Transfer module6.70ms Refinement Module15.60ms Scene Coordinate Regression2.94ms PnP & RANSAC12.48ms Total time37.73ms baselines: AI-copilot (4.16±1.93), CycleGAN (3.14±1.72), and UNSB (3.69±1.17). In addition, the trajectory visualization is shown in Fig.7, providing qualitative insights into the performance of differ- ent approaches. Notably, our proposed method, EndoSERV, demonstrates significant advantages compared to existing methods. Real-Virtual Alignment baselines, including AI- copilot, CycleGAN, and UNSB, estimate absolute camera poses directly, eliminating the need for 7-DoF alignment during testing. However, they tend to exhibit more dispersed trajectories due to the frame-wise pose estimation approach, leading to discontinuities in the reconstructed path. Our proposed method, EndoSERV, overcomes these lim- itations by proposing the texture-agnostic offline pretrain- ing and refining the distortion and deformation. As a real- virtual alignment method, EndoSERV obviates the need for 7-DOF alignment and real-world labels, thereby ensuring its applicability in realistic surgical scenarios. At the same time, EndoSERV achieves a trajectory that is more continuous than other real-virtual alignment baselines, with less frequency crossing through the airway model, highlighting its robustness and accuracy. VIII. DISCUSSION A. Confidence-Aware Surgical Guidance Analysis In this section, we present a specific case from the clinical data to demonstrate how EndoSERV estimates camera pose accurately and efficiently. As shown in Fig. 8, the process begins with a surgeon performing a multi-angle scan at the entry point to train an initial model for localization. After the initial training, the system transitions to testing, achieving real- time performance (27 fps). Testing proceeds until a significant drop in the confidence signal is detected, signaling the need for further refining. The refining phase, lasting approximately 45 seconds, can run parallel with the testing phase. It fine- tunes the transfer model, refinement decoder, and the scene coordinate head, improving the estimating confidence. Upon completion of this refinement, the system reverts to the testing phase, and this cycle repeats. In the example shown in Fig. 8, a dataset of 3,000 images required two instances of online refinement. Except for the initial training time, this system only required 143 seconds to localize 3,000 images, achieving efficient surgical guidance for surgeons. The inference time of each module is shown in Table I, demonstrating the efficiency. IEEE TRANSACTIONS ON ROBOTICS11 Fig. 8. Surgical Guidance with Confidence-Aware Camera Localization. The process is composed of two refining phases and three testing phases. During the refining phase, the confidence progressively increases, while in the testing phase, the confidence experiences a decline at a specific time point. B. Impact of Each Module: Feature Similarity Analysis In this section, we evaluate the contributions of individual components by examining their ability to enhance the similar- ity between virtual and real features. This capability is crucial, as it enables the pose estimation network to be trained in the virtual domain while generalizing effectively to real-world scenarios. We hypothesize that two primary factors influence this feature similarity: (1) the effectiveness of the Real-Virtual Alignment strategy used to bridge the domain gap, and (2) at the feature level, the feature encoder’s ability to extract texture-agnostic features. To investigate these factors, our experiments use real images and their corresponding virtual counterparts from the clinical dataset. We assess two distinct encoder configurations to eval- uate their feature extraction capabilities: (1). ACE-encoder: A feature encoder pretrained on large datasets [38]. (2). Offline- encoder: The encoder developed and pretrained during the offline pretraining phase of this work. Furthermore, we compare four different real-virtual align- ment strategies to measure their alignment effectiveness: (1) None (original real image), (2) EndoSERV, (3) AI-copilot, and (4) UNSB. Specifically, the ‘EndoSERV’ strategy listed here refers to a sub-component of our proposed method, which consists of a feature encoder and a refinement decoder to generate virtual images. For each real-virtual image pair, the real image is processed by one of the four real-virtual alignment strategies (or ‘None’). Then, both the processed real image and the original virtual image are passed through one of the two feature encoders (ACE-encoder or Offline-encoder) to extract their respective feature vectors. We then compute the Cosine Similarity be- tween these two vectors. A higher cosine similarity score indicates a smaller domain gap and better alignment at the feature level. Fig. 9 summarizes the similarity scores: Comparison Across Real-Virtual Alignment Strategies: The EndoSERV strategy consistently achieved the highest similarity scores across encoder configurations, demonstrat- Original imageEndoSERVAI-copilotUNSB Fig. 9.Real-to-Virtual feature similarity for different configurations. It measures the feature similarity using different feature encoders and Real- Virtual Alignment methods. The similarity is measured using cosine similarity. ing its effectiveness in bridging the domain gap. For the ACE-encoder, the EndoSERV strategy improved similarity to 0.6856 ± 0.1897, significantly outperforming the original image (0.1973 ± 0.0858), AI-copilot (0.4569 ± 0.1678), and UNSB (0.4605 ± 0.1537). A similar trend was observed with the Offline-encoder, where EndoSERV yielded a similarity score of 0.9208 ± 0.0660, surpassing AI-copilot (0.6113 ± 0.1413) and UNSB (0.8059± 0.1169). These findings confirm that EndoSERV is a more effective adaptation method than other strategies, as it employs a paired generation strategy to generate virtual images that are more closely aligned with the virtual domain. Impact Across Encoder Configurations: For different image encoders, the Offline-encoder consistently outperformed the ACE-encoder across all adaptation strategies, demonstrat- ing the advantages of offline training. For instance, under the EndoSERV strategy, the Offline-encoder achieved a similarity score of 0.9208 ± 0.0660, compared to 0.6856 ± 0.1897 for the ACE-encoder. Similarly, for UNSB, the Offline-encoder achieved 0.8059 ± 0.1169, significantly higher than 0.4605 IEEE TRANSACTIONS ON ROBOTICS12 Channel 1Channel 2Channel 3 Channel 1Channel 2Channel 3 Similarity without offline pretraining: 0.590226(lower) Similarity with offline pretraining: 0.722608(higher) Similarity without offline pretraining: 0.57536(lower) Similarity with offline pretraining: 0.9648731(higher) Cosine similarity: 0.90131(lower) Cosine similarity: 0.96857(higher) Channel 1Channel 2Channel 3 (a)(b) (c) Fig. 10. Texture-agnostic features for real-virtual domain gap and discriminative features for appearance-similar endoscopic scenarios. (a) and (b) show the feature maps extracted from virtual and real images on the C3VD dataset and the clinical dataset, respectively, with and without offline pretraining. It can be observed that offline pretraining significantly increases the similarity between the extracted virtual and real features. (c) illustrates that although images from different bronchial branches may exhibit anatomical similarity, EndoSERV is still capable of extracting distinctive features. Specifically, for images from the same location but different domains, EndoSERV extracts highly similar features. In contrast, for anatomically similar images from different locations, it is able to capture distinct features. TABLE IV ABLATION STUDY ON EACH AUGMENTATION STRATEGY IN ENDOSERV Augmentations ATE Texture diverseDDAug ✗9.03±4.68 ✓✗8.16±4.37 ✗✓8.80±4.69 ✓6.22±2.83 Rand-Aug7.59±4.31 Auto-Aug 7.18±3.23 ± 0.1537 for the ACE-encoder. These results demonstrate the Offline-encoder’s superior ability to reduce the domain gap, especially when combined with advanced adaptation techniques. C. Ablation study of augmentation strategies In EndoSERV, two augmentation strategies are essential: the texture-diverse augmentation in the offline training phase, and the distortion and deformation augmentation during the online training phase. In this section, we investigate the impact of each augmentation strategy. To evaluate the effectiveness of different augmentation strategies in enhancing the performance of EndoSERV, we conducted an ablation study focusing on two specific aug- mentation techniques: Texture Diverse and DDAug. In the absence of Texture Diverse augmentation, the feature encoder was trained exclusively on virtual images. Similarly, without DDAug, the refine decoder reconstructed virtual images using only the transferred real images. Table IV presents the results of this study. The baseline model, without any augmentation applied, achieved an ATE of 9.03±4.68 m. Introducing the Texture Diverse augmen- tation alone resulted in a significant reduction of the ATE to 8.16±4.37 m, demonstrating its substantial contribution to improving localization accuracy. Applying DDAug in isolation yielded a decrease in ATE to 8.80±4.69 m. Notably, the combination of both Texture Diverse and DDAug augmenta- tions led to the most pronounced improvement, reducing the ATE to 6.22±2.83 m. This synergistic effect underscores the complementary nature of the two augmentation strategies, where Texture Diverse primarily enhances the model’s ability to generalize across varied textures, and DDAug simulates the distortion and deformation scenarios, fostering a more robust localization capability. Additionally, for simulating the real endoscopic scenarios, we carefully design the augmentation strategy. To validate it, we compare some augmentation methods usually used in deep learning, like Rand-Aug and Auto-Aug. The results are shown in Table IV. Using data augmentation techniques such as Rand-Aug or Auto-Aug results in a lower ATE compared to training without any augmentation. However, their performance is still inferior to that achieved with our proposed DDAug. This improvement can be attributed to DDAug’s ability to more realistically simulate the challenges encountered in endoscopic scenarios, thereby enhancing the robustness of the network. D. Texture-agnostic feature and similar anatomical feature In this section, we investigate two common challenges in endoscopic image processing from the perspective of fea- ture maps. First, we examine whether the feature extractor obtained through offline training can learn texture-agnostic representations. Second, we assess whether the extractor can generate distinguishable features for images captured from different branches that exhibit nearly identical geometric and topological properties, thereby enabling accurate localization. The results are shown in Fig. 10. For the feature extractor without pretraining, there is a substantial discrepancy between the feature maps of virtual and real images. However, after offline pretraining, this gap is significantly reduced. In Fig. 10 (c), we select two images from different bronchial branches that share a highly similar appearance. It can be observed that our feature extractor is still able to produce distinct features, demonstrating its ability to differentiate between visually similar yet anatomically distinct regions. E. Effectiveness of estimated confidence-based filtering To further evaluate the effectiveness of our confidence estimation, we incorporate an effective post-processing step that leverages the confidence estimated from scene coordinate IEEE TRANSACTIONS ON ROBOTICS13 predictions. This confidence-aware refinement filters out pre- dictions with low confidence, which are often associated with outliers and temporal inconsistencies. As shown in Table V, the proposed confidence-based filtering significantly improves the accuracy and stability of pose estimation. Starting from a baseline ATE of 6.22± 2.83 without any filtering, the ATE consistently decreases to 5.81±2.50, 5.58±2.41, 5.44±2.26, and 5.37±2.19, when filtering out the lowest 10%, 20%, 30% and 40% of confident predictions, respectively. This reduction in ATE demonstrates that the estimated confidence effectively captures prediction reliability and can be used to suppress noise and outliers. TABLE V EFFECT OF UNCERTAINTY-BASED FILTERING ON POSE ESTIMATION ACCURACY (ATE). Uncertainty Filter ThresholdATE (↓) No Filter6.22± 2.83 Top 90% confidence (10% filtered) 5.81± 2.50 Top 80% confidence (20% filtered) 5.58± 2.41 Top 70% confidence (30% filtered) 5.44± 2.26 Top 60% confidence (40% filtered) 5.37± 2.19 IX. CONCLUSION In this paper, we propose EndoSERV, a novel approach for efficient endoluminal robot localization. We propose an adaptive segment-to-structure strategy that partitions extended luminal paths into manageable sub-segments. For each sub- segment, we map everything in the real domain to the pre- operative virtual domain, taking advantage of virtual ground truth for odometry training. Specifically, we design a robust offline pretraining to extract texture-agnostic features, and propose a fast online training for simultaneous domain adapta- tion and odometry training. Additionally, we propose a novel augmentation-then-recovery strategy, simulating potential dis- tortion and deformation in real scenarios and recovering them via a paired training pipeline. Experimental results on both public and clinical datasets demonstrate significantly improved performance over current state-of-the-art methods, even in the absence of real-world pose labels. REFERENCES [1] K. B. Ozyoruk, G. I. Gokceler, T. L. Bobrow, G. Coskun, K. Incetan, Y. Almalioglu, F. Mahmood, E. Curto, L. Perdigoto, M. Oliveira et al., “Endoslam dataset and an unsupervised monocular visual odometry and depth estimation approach for endoscopic videos,” Medical image analysis, vol. 71, p. 102058, 2021. [2] X. Liu, A. Sinha, M. Ishii, G. D. Hager, A. Reiter, R. H. Taylor, and M. Unberath, “Dense depth estimation in monocular endoscopy with self-supervised learning methods,” IEEE transactions on medical imaging, vol. 39, no. 5, p. 1438–1447, 2019. [3] S. Shao, Z. Pei, W. Chen, W. Zhu, X. Wu, D. Sun, and B. Zhang, “Self- supervised monocular depth and ego-motion estimation in endoscopy: Appearance flow to the rescue,” Medical image analysis, vol. 77, p. 102338, 2022. [4] H. Yue and Y. Gu, “Tcl: Triplet consistent learning for odometry estima- tion of monocular endoscope,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2023, p. 144–153. [5] B. Cui, M. Islam, L. Bai, and H. Ren, “Surgical-dino: adapter learning of foundation models for depth estimation in endoscopic surgery,” International Journal of Computer Assisted Radiology and Surgery, p. 1–8, 2024. [6] B. Cui, M. Islam, L. Bai, A. Wang, and H. Ren, “Endodac: Efficient adapting foundation model for self-supervised depth estimation from any endoscopic camera,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.Springer, 2024, p. 208–218. [7] L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, p. 10 371–10 381. [8] E. Sucar, S. Liu, J. Ortiz, and A. J. Davison, “imap: Implicit mapping and positioning in real-time,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, p. 6229–6238. [9] M. M. Johari, C. Carta, and F. Fleuret, “Eslam: Efficient dense slam system based on hybrid representation of signed distance fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, p. 17 408–17 419. [10] J. Shan, Y. Li, T. Xie, and H. Wang, “Enerf-slam: A dense endoscopic slam with neural implicit representation,” IEEE Transactions on Medical Robotics and Bionics, 2024. [11] K. Wang, C. Yang, Y. Wang, S. Li, Y. Wang, Q. Dou, X. Yang, and W. Shen, “Endogslam: Real-time dense reconstruction and tracking in endoscopic surgeries using gaussian splatting,” in International Confer- ence on Medical Image Computing and Computer-Assisted Intervention. Springer, 2024, p. 219–229. [12] X. Luo and K. Mori, “A discriminative structural similarity measure and its application to video-volume registration for endoscope three- dimensional motion tracking,” IEEE transactions on medical imaging, vol. 33, no. 6, p. 1248–1261, 2014. [13] X. Lu ́ o, M. Feuerstein, D. Deguchi, T. Kitasaka, H. Takabatake, and K. Mori, “Development and comparison of new hybrid motion tracking for bronchoscopic navigation,” Medical image analysis, vol. 16, no. 3, p. 577–596, 2012. [14] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, p. 2223–2232. [15] T. Park, A. A. Efros, R. Zhang, and J.-Y. Zhu, “Contrastive learning for unpaired image-to-image translation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16. Springer, 2020, p. 319–345. [16] M. Shen, Y. Gu, N. Liu, and G.-Z. Yang, “Context-aware depth and pose estimation for bronchoscopic navigation,” IEEE Robotics and Automation Letters, vol. 4, no. 2, p. 732–739, 2019. [17] Y. Gu, C. Gu, J. Yang, J. Sun, and G.-Z. Yang, “Vision–kinematics inter- action for robotic-assisted bronchoscopy navigation,” IEEE Transactions on Medical Imaging, vol. 41, no. 12, p. 3600–3610, 2022. [18] X. Luo, L. Xie, H.-Q. Zeng, X. Wang, and S. Li, “Monocular endo- scope 6-dof tracking with constrained evolutionary stochastic filtering,” Medical Image Analysis, vol. 89, p. 102928, 2023. [19] X. Luo, “A new electromagnetic-video endoscope tracking method via anatomical constraints and historically observed differential evolution,” in Medical Image Computing and Computer Assisted Intervention– MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part I 23. Springer, 2020, p. 96–104. [20] L. Yen-Chen, P. Florence, J. T. Barron, A. Rodriguez, P. Isola, and T.-Y. Lin, “inerf: Inverting neural radiance fields for pose estimation,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, p. 1323–1330. [21] H. Wang, J. Wang, and L. Agapito, “Co-slam: Joint coordinate and sparse parametric encodings for neural real-time slam,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, p. 13 293–13 302. [22] Z. Zhu, S. Peng, V. Larsson, Z. Cui, M. R. Oswald, A. Geiger, and M. Pollefeys, “Nicer-slam: Neural implicit scene encoding for rgb slam,” in 2024 International Conference on 3D Vision (3DV).IEEE, 2024, p. 42–52. [23] H. Li, X. Gu, W. Yuan, L. Yang, Z. Dong, and P. Tan, “Dense rgb slam with neural implicit maps,” arXiv preprint arXiv:2301.08930, 2023. [24] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, p. 1125– 1134. [25] X. Xie, J. Chen, Y. Li, L. Shen, K. Ma, and Y. Zheng, “Mi 2 gan: gen- erative adversarial network for medical image domain adaptation using mutual information constraint,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2020, p. 516–525. IEEE TRANSACTIONS ON ROBOTICS14 [26] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, p. 6840– 6851, 2020. [27] H. Sasaki, C. G. Willcocks, and T. P. Breckon, “Unit-ddpm: Unpaired image translation with denoising diffusion probabilistic models,” arXiv preprint arXiv:2104.05358, 2021. [28] M. Zhao, F. Bao, C. Li, and J. Zhu, “Egsde: Unpaired image-to- image translation via energy-guided stochastic differential equations,” Advances in Neural Information Processing Systems, vol. 35, p. 3609– 3623, 2022. [29] B. Kim, G. Kwon, K. Kim, and J. C. Ye, “Unpaired image-to- image translation via neural schr\” odinger bridge,” arXiv preprint arXiv:2305.15086, 2023. [30] F. Mahmood, R. Chen, and N. J. Durr, “Unsupervised reverse domain adaptation for synthetic medical images via adversarial training,” IEEE transactions on medical imaging, vol. 37, no. 12, p. 2572–2581, 2018. [31] S. Mathew, S. Nadeem, S. Kumari, and A. Kaufman, “Augmenting colonoscopy using extended and directional cyclegan for lossy image translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, p. 4696–4705. [32] S. Mathew, S. Nadeem, and A. Kaufman, “Clts-gan: color-lighting- texture-specular reflection augmentation for colonoscopy,” in Interna- tional Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, p. 519–529. [33] L. Zhu, J. Zheng, C. Wang, J. Jiang, and A. Song, “A bronchoscopic navigation method based on neural radiation fields,” International Jour- nal of Computer Assisted Radiology and Surgery, vol. 19, no. 10, p. 2011–2021, 2024. [34] S. Zhu, L. Yang, C. Chen, M. Shah, X. Shen, and H. Wang, “R2former: Unified retrieval and reranking transformer for place recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, p. 19 370–19 380. [35] K. Islam, M. Z. Zaheer, A. Mahmood, and K. Nandakumar, “Diffusemix: Label-preserving data augmentation with diffusion models,” in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, p. 27 621–27 630. [36] A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford et al., “Gpt-4o system card,” arXiv preprint arXiv:2410.21276, 2024. [37] Y. Zhao, S. Kong, and C. Fowlkes, “Camera pose matters: Improving depth prediction by mitigating pose distribution bias,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, p. 15 759–15 768. [38] E. Brachmann, T. Cavallari, and V. A. Prisacariu, “Accelerated coordi- nate encoding: Learning to relocalize in minutes using rgb and poses,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, p. 5044–5053. [39] H. Matsuki, R. Murai, P. H. Kelly, and A. J. Davison, “Gaussian splatting slam,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, p. 18 039–18 048. [40] J. Zhang, L. Liu, P. Xiang, Q. Fang, X. Nie, H. Ma, J. Hu, R. Xiong, Y. Wang, and H. Lu, “Ai co-pilot bronchoscope robot,” Nature commu- nications, vol. 15, no. 1, p. 241, 2024. [41] N. Zhang, F. Nex, G. Vosselman, and N. Kerle, “Lite-mono: A lightweight cnn and transformer architecture for self-supervised monoc- ular depth estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, p. 18 537–18 546. [42] T. L. Bobrow, M. Golhar, R. Vijayan, V. S. Akshintala, J. R. Garcia, and N. J. Durr, “Colonoscopy 3d video dataset with paired depth from 2d-3d registration,” Medical image analysis, vol. 90, p. 102956, 2023. [43] T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, p. 18 392–18 402. [44] E. Brachmann and C. Rother, “Visual camera re-localization from rgb and rgb-d images using dsac,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 9, p. 5847–5865, 2021.