Paper deep dive
SHAPE: Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation for Medical Image Segmentation
Linkuan Zhou, Yinghao Xia, Yufei Shen, Xiangyu Li, Wenjie Du, Cong Cong, Leyi Wei, Ran Su, Qiangguo Jin
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%
Last extracted: 3/26/2026, 2:35:34 AM
Summary
SHAPE is a novel Unsupervised Domain Adaptation (UDA) framework for medical image segmentation that shifts the focus from local pixel-level accuracy to global anatomical plausibility. It utilizes a DINOv3 foundation with a Hierarchical Feature Modulation (HFM) module for structure-aware feature alignment, and a validation pipeline consisting of Hypergraph Plausibility Estimation (HPE) and Structural Anomaly Pruning (SAP) to ensure high-fidelity pseudo-label generation.
Entities (5)
Relation Signals (4)
SHAPE → includes → Hierarchical Feature Modulation
confidence 100% · its Hierarchical Feature Modulation (HFM) module first generates features
SHAPE → includes → Hypergraph Plausibility Estimation
confidence 100% · we introduce Hypergraph Plausibility Estimation (HPE)
SHAPE → includes → Structural Anomaly Pruning
confidence 100% · This is complemented by Structural Anomaly Pruning (SAP)
SHAPE → utilizes → DINOv3
confidence 95% · Built on a DINOv3 foundation
Cypher Suggestions (2)
Find all modules associated with the SHAPE framework · confidence 90% · unvalidated
MATCH (f:Framework {name: 'SHAPE'})-[:INCLUDES]->(m:Module) RETURN m.nameIdentify the foundation model used by SHAPE · confidence 90% · unvalidated
MATCH (f:Framework {name: 'SHAPE'})-[:UTILIZES]->(fm:FoundationModel) RETURN fm.nameAbstract
Abstract:Unsupervised Domain Adaptation (UDA) is essential for deploying medical segmentation models across diverse clinical environments. Existing methods are fundamentally limited, suffering from semantically unaware feature alignment that results in poor distributional fidelity and from pseudo-label validation that disregards global anatomical constraints, thus failing to prevent the formation of globally implausible structures. To address these issues, we propose SHAPE (Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation), a framework that reframes adaptation towards global anatomical plausibility. Built on a DINOv3 foundation, its Hierarchical Feature Modulation (HFM) module first generates features with both high fidelity and class-awareness. This shifts the core challenge to robustly validating pseudo-labels. To augment conventional pixel-level validation, we introduce Hypergraph Plausibility Estimation (HPE), which leverages hypergraphs to assess the global anatomical plausibility that standard graphs cannot capture. This is complemented by Structural Anomaly Pruning (SAP) to purge remaining artifacts via cross-view stability. SHAPE significantly outperforms prior methods on cardiac and abdominal cross-modality benchmarks, achieving state-of-the-art average Dice scores of 90.08% (MRI->CT) and 78.51% (CT->MRI) on cardiac data, and 87.48% (MRI->CT) and 86.89% (CT->MRI) on abdominal data. The code is available at this https URL.
Tags
Links
- Source: https://arxiv.org/abs/2603.21904v1
- Canonical: https://arxiv.org/abs/2603.21904v1
Full Text
52,716 characters extracted from source content.
Expand or collapse full text
SHAPE: Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation for Medical Image Segmentation Linkuan Zhou1, Yinghao Xia1, Yufei Shen1, Xiangyu Li2, Wenjie Du3, Cong Cong4, Leyi Wei5, Ran Su6, Qiangguo Jin1,* 1School of Software, Northwestern Polytechnical University, 2School of Computer Science and Technology, Harbin Institute of Technology, 3School of Software Engineering, USTC, 4Australian Institute of Health Innovation, Macquarie University, 5Faculty of Applied Science, Macao Polytechnic University, 6School of Computer Software, Tianjin University, qgking@nwpu.edu.cn Abstract Unsupervised Domain Adaptation (UDA) is essential for deploying medical segmentation models across diverse clinical environments. Existing methods are fundamentally limited, suffering from semantically unaware feature alignment that results in poor distributional fidelity and from pseudo-label validation that disregards global anatomical constraints, thus failing to prevent the formation of globally implausible structures. To address these issues, we propose SHAPE (Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation), a framework that reframes adaptation towards global anatomical plausibility. Built on a DINOv3 foundation, its Hierarchical Feature Modulation (HFM) module first generates features with both high fidelity and class-awareness. This shifts the core challenge to robustly validating pseudo-labels. To augment conventional pixel-level validation, we introduce Hypergraph Plausibility Estimation (HPE), which leverages hypergraphs to assess the global anatomical plausibility that standard graphs cannot capture. This is complemented by Structural Anomaly Pruning (SAP) to purge remaining artifacts via cross-view stability. SHAPE significantly outperforms prior methods on cardiac and abdominal cross-modality benchmarks, achieving state-of-the-art average Dice scores of 90.08% (MRI→ ) and 78.51% (CT→ ) on cardiac data, and 87.48% (MRI→ ) and 86.89% (CT→ ) on abdominal data. The code is available at https://github.com/BioMedIA-repo/SHAPE. 11footnotetext: Corresponding author 1 Introduction Medical image segmentation plays a crucial role in medical analysis and treatment. Despite the success of deep learning models in this task, they typically assume that training and test data follow the same distribution [30]. However, this hypothesis is not applicable in real-world clinical scenarios due to the large variations in imaging equipment, modalities, and parameters. Thus, the performance of a model trained on a labeled source domain may decline sharply when applied to an unlabeled target domain [22]. Unsupervised Domain Adaptation (UDA) addresses this by transferring knowledge from a labeled source to an unlabeled target domain, avoiding costly re-annotation [11]. Recently, deep learning methods have been used to address the UDA problem[46, 49, 48]. These methods can be broadly categorized into alignment based methods and pseudo label based methods. The first category aligns the source and target domains based on image appearance [51, 33, 46, 43], feature distribution [34, 8, 2, 50], or output prediction consistency[35, 38]. For example, RSA [46] employed a conditional diffusion model to generate multiple source-like images, and used predictive consistency to select the most reliable generated image. In contrast, LE-UDA [50] focused on feature-level alignment, constructing self-ensembling consistency to facilitate knowledge transfer between domains and utilizing an adversarial learning module for UDA. Different from the above methods, SIFA [2] performed co-alignment of domains from both image and feature perspectives, simultaneously transforming the appearance of images across domains and enhancing the domain invariance of the extracted features by leveraging adversarial learning in multiple aspects. In addition, SADA [38] aligned the joint distribution of segmentation results between source and target images, thereby achieving domain adaptation. However, these methods typically learn a global, content-agnostic mapping between domains, enabling overall domain adaptation, inevitably breaks the fine-grained mapping relationships between cross-domain classes and categories. The second category leverages semi-supervised learning to generate pseudo labels for target domain data using source domain data. For example, MAPSeg [49] introduced a masked autoencoding and pseudo labeling segmentation framework, which demonstrated good performance in heterogeneous and volumetric medical image segmentation. Similarly, GenericSSL [36] proposes a knowledge distillation framework, utilizing a shared diffusion encoder to learn distribution-invariant features and a reweighted decoder to generate reliable pseudo-labels for further supervision. In order to generate reliable pseudo labels, IPLC [48] iteratively generated pseudo labels using pre-trained source models and the SAM-Med2D [6], incorporating multiple random sampling and entropy estimation while continuously updating prompts for domain adaptation. Unlike IPLC, UPL-SFDA [39] generated different predictions of the target domain by duplicating the pre-trained model’s prediction head multiple times with perturbations and generated reliable pseudo labels using uncertainty estimation. Nevertheless, their quality assessment relies heavily on local, pixel-wise metrics such as prediction entropy or consistency, potentially admitting pseudo-labels with anatomical structural flaws, thus hindering the model from learning generalizable and structurally coherent features. Despite progress, these approaches face two fundamental challenges that limit their efficacy. First, feature alignment often operates in a semantically unaware manner. Monolithic strategies apply a uniform transformation across the feature map, averaging style characteristics over distinct anatomical structures. This blending prevents the generation of class-specific style information, leading to imprecise feature alignment and poor distributional fidelity. Second, pseudo-label validation disregards global anatomical constraints. Existing methods rely on pixel-level confidence or local consistency and are unable to prevent the formation of pseudo-labels that constitute globally implausible structures, such as those with anatomically impossible shapes or spatial arrangements. To overcome these limitations, we propose SHAPE (Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation), a UDA framework that shifts the paradigm from local pixel correctness to global anatomical plausibility. Built upon a powerful DINOv3 [31] foundation, SHAPE introduces a synergistic pipeline. Our Hierarchical Feature Modulation (HFM) performs a class-aware, spatially-differentiated alignment by applying tailored mixing strategies to semantic cores and structural boundaries, generating features with high distributional fidelity. To leverage these adapted features for self-training, high-quality pseudo-labels are essential. Such labels must be not only pixel-wise accurate but also globally plausible in their anatomical structure. While standard graphs capture pairwise relations, they fall short of representing the holistic interplay of multiple anatomical structures. We therefore introduce Hypergraph Plausibility Estimation (HPE), which leverages hypergraphs to uniquely model and score both the intra-class shape of individual structures and the inter-class spatial arrangement of the entire anatomy. Finally, a Structural Anomaly Pruning (SAP) stage inspects cross-view stability to identify and purge spurious, hallucinated class-level predictions, ensuring maximum label fidelity. Our main contributions are summarized as follows: • A Hierarchical Feature Modulation (HFM) strategy performs class-aware, spatially-differentiated alignment, overcoming semantically blind mixing for high distributional fidelity. • A novel Hypergraph-based validation pipeline (HPE and SAP) moves beyond pixel-level confidence to ensure global anatomical plausibility, gating and refining structurally coherent pseudo-labels. • Our integrated SHAPE framework establishes a robust UDA paradigm that significantly outperforms state-of-the-art methods via precise feature adaptation and high-fidelity pseudo-label supervision. 2 Related Work 2.1 Self-Supervised Foundation Models Self-supervised learning (SSL) paradigms, including contrastive learning [4, 13, 5, 29] and masked image modeling [28, 12], facilitate the pre-training of powerful Vision Transformers (ViTs) on unlabeled data. The resulting foundation models, such as DINOv3 [31], CLIP [29], and SAM [20], have been pivotal in medical imaging [9] by providing robust priors that enhance domain generalization. While these models offer a strong foundation, we contend that their direct application as feature extractors is insufficient for optimal adaptation. Consequently, we introduce a Hierarchical Feature Modulation (HFM) module designed to explicitly adapt these potent yet domain-biased priors to the specific characteristics of the target domain. 2.2 Feature Alignment The evolution of feature alignment strategies is central to progress in UDA. Common strategies include adversarial alignment to enforce domain-invariance [40, 14, 24], and global style alignment through statistical normalization like AdaIN [15] or spectral manipulation [41, 45]. Although effective for holistic appearance, these monolithic approaches are content-agnostic. To incorporate semantics, class-aware methods align per-class feature prototypes [44]. Interpolation strategies based on Mixup smooth the inter-domain transition by creating convex combinations of samples [27, 19, 1]. In contrast, our HFM performs content-aware, spatially-differentiated feature mixing. By dynamically applying either semantic interpolation or statistical alignment based on local patch content, HFM provides a granular, structure-preserving adaptation that overcomes the limitations of monolithic, semantically blind, or prototype-based methods. Figure 1: The pipeline of SHAPE. (a) Hierarchical Feature Modulation (HFM). (b) Hypergraph Plausibility Estimation (HPE). (c) Structural Anomaly Pruning (SAP). 2.3 Structural Plausibility Incorporating anatomical priors is crucial for segmentation. Early methods imposed structural constraints through CRFs [3, 21] or topology-aware losses [7, 17]. A better paradigm models relational context, where Graph Neural Networks (GNNs) capture pairwise object relations [42] but are limited to binary interactions. Hypergraphs overcome this by representing higher-order relationships and have been applied to vision-based segmentation [16, 10, 37]. Our work introduces the novel application of hypergraphs as a quality gate for pseudo-labels in UDA. We conceptualize predictions as structural hypergraphs to score the plausibility of intra-class shapes and inter-class layouts, creating a data-driven supervisory signal for self-training. 3 Method As shown in Fig. 1, our proposed framework, SHAPE (Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation), fundamentally reframes the UDA paradigm by shifting the objective from local pixel accuracy to global structural plausibility. The framework begins with Hierarchical Feature Modulation (HFM), which constructs a structurally-aware feature space. Although HFM can provide high-quality features, the subsequent challenge is to ensure that the resulting pseudo-labels are valid and accurate for efficient self-training. We address this by introducing Hypergraph Plausibility Estimation (HPE), which models each prediction as a structural hypergraph to validate its global anatomical integrity. To achieve maximum fidelity, a final Structural Anomaly Pruning (SAP) stage then purges the remaining class-level instabilities. Through this cascade of feature adaptation and multi-level validation, SHAPE synthesizes a set of high-fidelity pseudo-labels to guide the model adaptation process. 3.1 Hierarchical Feature Modulation Our approach is founded on a frozen DINOv3 ViT encoder [31], which provides rich semantic and structural priors. For an input image I, we extract the sequence of patch tokens from the final transformer block and reshape them into a dense feature map =Φ()∈ℝC×H×WF= (I) ^C× H× W, where Φ denotes the encoder. Our HFM module bridges the domain gap through a dual-granularity approach performing global style alignment and local, structure-aware feature mixing. Globally, we use Adaptive Instance Normalization (AdaIN) [15] to align textural properties between source sF_s and target tF_t features, computing the stylized map s→tF_s as: s→t=σ(t)(s−μ(s)σ(s)+ϵ)+μ(t),F_s =σ(F_t) ( F_s-μ(F_s)σ(F_s)+ε )+μ(F_t), (1) where μ(⋅)μ(·) and σ(⋅)σ(·) denote the channel-wise mean and standard deviation, and ϵε is a small constant added for numerical stability. To perform a more nuanced local adaptation, we upsample the feature maps to a finer resolution, yielding denser grids of N=4HWN=4HW tokens. We extract these tokens and corresponding label sub-patches s/tii=1N\m_s/t^i\_i=1^N through an unfolding operation (⋅)U(·). We then classify each token’s content by computing a purity score for its aligned sub-patch: (i)=maxk∈0..K−1∑v∈i(v=k)|i|,P(m^i)= _k∈\0..K-1\ _v ^iI(v=k)|m^i|, (2) where v is a pixel’s class value, K is the total number of classes, and (⋅)I(·) is the indicator function. Based on this score, we partition tokens into pure semantic cores (pureT_pure) and impure structural boundaries (impureT_impure). Specifically, a token is designated as pure if its purity score (i)P(m^i) exceeds a purity threshold τp _p. This distinction guides a differentiated modulation strategy: ^si=(1−λ)si+λtjif i∈s,pureσmix(si−μsimpureσsimpure+ϵ)+μmixif i∈s,impure f_s^i= cases(1-λ)f_s^i+ _t^j&if i _s,pure\\ _mix ( f_s^i- _s^impure _s^impure+ε )+ _mix&if i _s,impure cases (3) where i and j are indices for source and target tokens, respectively. For the pure case, λ is a random mixing factor within the range [0, 1], and tjf_t^j is a target token of the same class as sif_s^i, selected from a pool sorted by proximity to the mean of the target class to prioritize mixing with representative exemplars. For the impure case, μsimpure _s^impure and σsimpure _s^impure are the mean and standard deviation computed exclusively over the set of source boundary tokens si|i∈s,impure\f_s^i|i _s,impure\, and μmix,σmix _mix, _mix are statistics interpolated from both source and target boundary tokens. The modulated tokens ^si\ f_s^i\ are refolded to form the locally adapted map s,crossF_s,cross. This hierarchical process is applied symmetrically to produce t→sF_t and t,crossF_t,cross. All four feature maps are passed through the DINOv3 encoder’s final layer normalization to ensure distributional consistency for the decoder. While HFM furnishes structure-aware features, effective adaptation hinges on the anatomical plausibility of the pseudo-labels. We address this validation challenge next. 3.2 Hypergraph Plausibility Estimation Reliable self-training requires anatomically plausible pseudo-labels, a quality beyond the reach of pixel-wise metrics. Therefore, we model each predicted segmentation map ∈0,…,K−1H×WM∈\0,...,K-1\^H× W as a multi-level structural hypergraph =(,ℰ)G=(V,E). The vertex set V consists of all foreground pixels, while the hyperedge set is the union of Class Hyperedges ℰ=ekk=1K−1E_C=\e_k\_k=1^K-1, capturing intra-class shape, and a single Layout Hyperedge ele_l, representing inter-class spatial arrangement. We first assess the vertex set V reliability through an aggregated score, SvertexS_vertex, which averages a pixel-wise weight wpw_p over all foreground pixels: Svertex()=1||∑p∈wp,S_vertex(G)= 1|V| _p w_p, (4) where the weight wp=(1−H(¯p)logK)⋅(1−JSD(pn)logK)w_p=(1- H( M_p) K)·(1- JSD(\M^n_p\) K) combines certainty (from mean entropy) and consistency (from JSD). The set of foreground pixels is =p∣p>0V=\p _p>0\. The ¯p M_p is the mean of NensN_ens teacher predictions, and the JSD is defined as JSD(⋅)=H(¯p)−1Nens∑n=1NensH(pn)JSD(·)=H( M_p)- 1N_ens _n=1^N_ensH(M^n_p). Beyond vertex quality, we evaluate structural coherence by scoring the hyperedges. For intra-class shape, encoded by the Class Hyperedges ek∈ℰ\e_k _C\, the score SintraS_intra aggregates individual shape plausibility scores Sϕ,kS_φ,k using a softmax-based weighting that heavily penalizes malformed outliers: Sintra()=∑k=1K−1Sϕ,k⋅exp(−Sϕ,k/τ)∑j=1K−1exp(−Sϕ,j/τ),S_intra(G)= _k=1^K-1S_φ,k· (-S_φ,k/τ) _j=1^K-1 (-S_φ,j/τ), (5) where Sϕ,k=exp(−|zk|)S_φ,k= (- z_k ) is derived from the Z-score zk=(ϕ(ek)−μℬ,ϕ,k)/(σℬ,ϕ,k+ϵ)z_k=(φ(e_k)- _B,φ,k)/( _B,φ,k+ε). The shape descriptor ϕ(ek)=4π⋅Area(k)/((Perimeter(k))2+ϵ)φ(e_k)=4π·Area(M_k)/((Perimeter(M_k))^2+ε) is the isoperimetric ratio of the class mask kM_k, and the subscript ℬB denotes numerically-stabilized statistics (mean μ, std σ) computed over classes present in the current batch, and τ is a temperature parameter. Analogously, the inter-class arrangement, encoded by the Layout Hyperedge ele_l, is evaluated by the score SinterS_inter, which aggregates plausibility scores derived from the relative direction cosines ψij _ij between class centroids: Sinter()=∑i,j=1K−1Sψ,ij⋅exp(−Sψ,ij/τ)∑u,v=1K−1exp(−Sψ,uv/τ),S_inter(G)= _i,j=1^K-1S_ψ,ij· (-S_ψ,ij/τ) _u,v=1^K-1 (-S_ψ,uv/τ), (6) whereSψ,ijS_ψ,ij is computed in the same manner as Sϕ,kS_φ,k using the descriptor ψij _ij and its corresponding batch-level mean and standard deviation. To derive a holistic measure of quality, the intra-class shape score (SintraS_intra) and inter-class layout score (SinterS_inter) are first linearly combined into a single structural plausibility metric. This metric then serves as a multiplicative gate for the vertex-level score (SvertexS_vertex), ensuring that predictions with high pixel confidence but poor anatomical structure are penalized: Sfinal=Svertex()⋅(αSintra()+(1−α)Sinter()),S_final=S_vertex(G)· (α S_intra(G)+(1-α)S_inter(G) ), (7) where α is a fusion weight. Samples are selected for self-training only if SfinalS_final exceeds a dynamic threshold determined by the top-ρ percentile of scores accumulated within the current epoch. 3.3 Structural Anomaly Pruning A prediction deemed globally plausible by HPE may still contain specific class-level artifacts, such as spurious regions that appear inconsistently or vary significantly in size across augmented views. Our pruning stage is designed to identify and remove these structurally anomalous classes. Our method conceptualizes the stability of a predicted class k by examining its structural signature across the ensemble of NensN_ens teacher predictions. This signature is defined as the vector of pixel counts k=⟨C(1,k),…,C(Nens,k)⟩c_k= C(M^1,k),…,C(M^N_ens,k) , representing the class’s size under differently modulated features. A robust anatomical structure should exhibit a stable signature with low variance, whereas a model hallucination is likely to manifest as a volatile signature. We quantify this volatility with a Structural Instability Score, Υ(k) (k), defined as the coefficient of variation of the signature: Υ(k)=std(k)c¯k+ϵ, (k)= std(c_k) c_k+ε, (8) where c¯k=1Nens∑n=1Nensck,n c_k= 1N_ens _n=1^N_ensc_k,n is the empirical mean of the pixel counts, ck,mc_k,m denotes these counts of class k in the m-th prediction, and std(⋅)std(·) computes the unbiased sample standard deviation. A class k is deemed anomalous if its instability score exceeds a dynamic threshold θ _A, which is set to the q-th percentile of the instability scores from all significant foreground classes within the batch. This defines the set of anomalous classes: anom=k∈1,…,K−1∣Υ(k)>θ.K_anom=\k∈\1,…,K-1\ (k)> _A\. (9) Finally, the consensus pseudo-label map M, which is generated from the teacher ensemble and has passed the HPE check, is pruned by masking all pixels belonging to the anomalous class set anomK_anom. The refined map ′M is thus defined for each pixel p as: (′)p=()pif ()p∉anomignore_indexotherwise.(M )_p= cases(M)_p&if (M)_p _anom\\ ignore\_index&otherwise cases. (10) Through this cascade of structure-aware adaptation and validation, SHAPE synthesizes a set of high-fidelity pseudo-labels. These serve as the supervision signal for the student model, guiding its adaptation to the target domain with anatomically plausible targets. 3.4 Overall Learning Objective The SHAPE framework is trained with a composite objective. For the source domain, the supervised loss ℒsupL_sup trains the decoder D on source images with ground-truth labels sL_s. For domain robustness, it is computed as the average segmentation loss over the set of original and HFM-modulated features, defined as ℱs=s,s→t,s,crossF_s=\F_s,F_s ,F_s,cross\: ℒsup=1|ℱs|∑′∈ℱsℒseg((′),s).L_sup= 1|F_s| _F _sL_seg(D(F ),L_s). (11) For the target domain, the unsupervised loss ℒunsupL_unsup is guided by high-fidelity pseudo-labels ′M . These are synthesized by our full validation pipeline (HPE and SAP), which processes a prediction ensemble generated by a teacher model, emaD_ema, from modulated target features (t,t→s,t,crossF_t,F_t ,F_t,cross). The loss supervises the student’s predictions on the subset of samples ℬselB_sel that pass the plausibility check, weighted by pixel-wise certainty wpw_p: ℒunsup=1|ℬsel|∑i∈ℬselℒseg((ti),(′)i,wpi).L_unsup= 1|B_sel| _i _selL_seg(D(F_t^i),(M )^i,w_p^i). (12) The total loss is ℒtotal=ℒsup+γunsupℒunsupL_total=L_sup+ _unsupL_unsup, where γunsup _unsup is a ramp-up weight. The teacher model emaD_ema is an Exponential Moving Average (EMA) [32] of the student decoder, and ℒsegL_seg is a standard segmentation loss, implemented as a combination of Dice [26] and Focal loss [25]. 4 Experiments and Results 4.1 Experimental details Datasets and metrics. Our experiments are conducted on the cardiac dataset and the abdominal dataset. The cardiac dataset employs the MMWHS [52] dataset, which comprises 20 3D CT scans and 20 3D MRI scans, with segmentation targets including the ascending aorta (A), left atrium blood cavity (LAC), left ventricle blood cavity (LVC), and myocardium of the left ventricle (MYO). The abdominal dataset consists of 30 abdominal CT images from the MICCAI 2015 Multi-Atlas Abdomen Labeling Challenge [23] and 20 T2SPIR MRI images from the ISBI 2019 CHAOS Challenge [18], with segmentation targets for the liver (LIV), right kidney (RK), left kidney (LK), and spleen (SPL). Each image is normalized to zero mean and unit variance, with affine transformations such as rotation and scaling applied. The performance is evaluated using the Dice score (DSC) and average surface distance (ASD) metrics. Table 1: Quantitative comparison of different methods on the cardiac dataset. The best values are highlighted in bold, and the second-best are underlined. Method Cardiac MRI → Cardiac CT Cardiac CT → Cardiac MRI DSC(%)↑ ASD(m)↓ DSC(%)↑ ASD(m)↓ A LAC LVC MYO Avg A LAC LVC MYO Avg A LAC LVC MYO Avg A LAC LVC MYO Avg Supervised 91.28 92.49 95.56 94.16 93.37 2.16 2.43 2.56 1.3 2.11 82.22 83.21 90.46 81.74 84.41 2.23 2.29 2.73 2.35 2.4 W/o adaptation 68.29 61.41 18.24 35.71 45.91 50.21 22.14 52.16 27.56 38.02 36.56 46.49 49.23 15.35 36.91 35.46 15.9 16.54 24.32 23.06 CycleGAN [51] 63.29 72.50 45.98 50.73 58.13 40.21 15.32 14.85 14.74 21.28 34.57 65.08 75.13 59.57 58.59 21.24 13.27 7.25 15.51 14.32 AdaptSegNet [34] 67.80 69.35 60.97 53.38 62.88 30.02 10.81 14.51 13.06 17.10 47.68 62.45 73.91 61.64 61.42 20.84 13.34 11.32 16.46 15.49 ADVENT [35] 73.77 68.46 61.03 57.21 65.12 17.31 16.96 18.88 15.83 17.25 35.01 64.03 58.71 52.80 52.64 33.65 16.38 22.31 24.02 24.09 SIFA [2] 82.72 75.21 75.41 65.17 74.63 12.13 8.66 9.21 10.88 10.22 55.47 66.43 72.52 60.69 63.78 15.86 12.42 13.39 13.67 13.84 SASAN [33] 82.22 75.78 79.26 68.44 76.43 13.12 10.75 10.43 11.12 11.36 60.46 72.82 79.48 69.49 70.56 16.72 10.07 7.02 10.97 11.20 GenericSSL [36] 82.02 77.18 84.28 67.65 77.78 3.23 8.72 6.14 8.45 6.64 63.35 72.77 84.04 72.38 73.14 16.06 11.14 5.96 7.33 10.12 UPL-SFDA [39] 85.41 74.78 85.09 71.44 79.18 8.01 7.74 7.60 10.44 8.45 67.71 75.18 80.59 72.77 74.06 13.95 9.94 7.57 7.27 9.68 IPLC [48] 87.63 78.21 86.11 71.68 80.91 5.65 7.55 3.96 8.31 6.37 67.55 75.98 88.94 71.81 76.07 11.48 11.61 5.14 7.52 8.94 IPLC+ [47] 63.69 86.15 86.71 89.17 81.43 4.21 4.14 3.34 3.95 3.91 65.09 77.56 88.95 74.33 76.48 4.69 8.48 2.88 5.44 5.37 DDFP [45] 72.03 85.30 89.64 90.86 84.46 4.52 3.88 3.02 3.72 3.79 66.26 76.04 88.55 70.64 75.37 7.72 10.33 3.31 12.94 8.58 SHAPE 79.58 92.18 94.53 94.03 90.08 2.73 3.26 2.72 1.76 2.62 70.25 79.11 86.08 78.59 78.51 4.48 5.11 3.97 5.26 4.70 Table 2: Quantitative comparison of different methods on the abdominal dataset. The best values are highlighted in bold, and the second-best are underlined. Method Abdominal MRI → Abdominal CT Abdominal CT → Abdominal MRI DSC(%)↑ ASD(m)↓ DSC(%)↑ ASD(m)↓ LIV RK LK SPL Avg LIV RK LK SPL Avg LIV RK LK SPL Avg LIV RK LK SPL Avg Supervised 90.59 93.78 91.82 92.43 92.16 2.45 1.16 0.96 1.85 1.61 89.51 90.35 89.26 91.63 90.19 2.18 1.83 1.16 2.24 1.85 W/o adaptation 39.36 26.14 42.46 52.36 40.08 32.16 18.11 25.31 35.26 27.71 35.14 48.26 33.51 49.25 41.54 31.51 22.62 13.16 34.63 25.48 CycleGAN [51] 81.94 83.12 81.38 80.88 81.83 12.63 8.51 5.19 16.06 10.60 84.31 84.37 79.48 81.06 82.31 6.38 4.78 10.55 15.70 9.35 AdaptSegNet [34] 80.07 84.45 82.17 83.03 82.43 9.97 5.68 6.19 19.22 10.27 82.34 86.55 75.92 90.00 83.70 7.94 10.15 7.29 4.40 7.45 ADVENT [35] 81.11 82.70 83.34 81.17 82.08 11.01 7.05 4.84 10.02 8.23 79.64 81.49 75.77 81.15 79.51 10.28 7.05 9.96 19.68 11.74 SIFA [2] 83.64 86.67 83.21 79.88 83.35 6.41 5.47 4.39 8.35 6.16 83.65 86.79 77.19 89.03 84.17 18.77 13.86 5.23 4.26 10.53 SASAN [33] 82.13 83.76 85.99 80.51 83.10 19.88 8.56 2.83 8.34 9.90 82.75 85.26 78.92 90.36 84.32 6.50 14.52 7.09 3.44 7.89 GenericSSL [36] 84.98 84.08 84.37 84.92 84.59 10.38 7.78 5.76 6.32 7.56 82.43 86.88 80.52 89.79 84.91 8.15 9.11 6.78 2.79 6.71 UPL-SFDA [39] 84.21 86.34 87.58 82.13 85.07 18.82 4.75 2.51 9.25 8.83 84.39 86.65 78.44 90.76 85.06 6.61 11.57 9.92 6.19 8.57 IPLC [48] 84.68 88.18 84.56 84.52 85.49 10.61 7.27 6.74 5.98 7.65 82.30 88.11 81.31 87.64 84.84 11.72 3.06 6.19 3.78 6.19 IPLC+ [47] 85.29 88.63 85.12 83.79 85.71 6.56 2.89 6.57 4.56 5.15 80.80 88.30 85.71 85.41 85.06 13.68 2.41 2.90 3.33 5.58 DDFP [45] 86.40 87.66 88.05 78.55 85.17 3.24 2.15 3.57 3.60 3.14 82.49 86.43 87.15 89.01 86.27 11.57 2.96 2.76 3.39 5.17 SHAPE 88.26 86.99 83.37 91.28 87.48 3.84 2.53 3.41 1.98 2.94 86.83 88.86 88.30 83.58 86.89 3.69 2.14 1.86 3.55 2.81 Implementation details. Our PyTorch framework uses a pre-trained frozen DINOv3 ViT-S/16 encoder and a trainable UNet-style decoder. We train for 200 epochs on a single NVIDIA RTX 4090 with a batch size of 6464 using the AdamW optimizer. The initial learning rate is 1×10−41× 10^-4 with a cosine annealing schedule, and a teacher model is updated via EMA with 0.90.9 momentum. All input images are resized to 256×256256× 256, replicated to 3 channels, and normalized using DINO statistics. Our data augmentation pipeline includes random contrast adjustments, random zoom, and random affine transformations. Key hyperparameters for our method are the HFM purity threshold τp=1 _p=1, a plausibility fusion weight α=0.25α=0.25, a selection percentile ρ starting at 0.1 with a sigmoid ramp-up, an anomaly threshold θ _A at the 50th percentile of instability scores, and the unsupervised loss weight γunsup=1 _unsup=1. 4.2 Comparison with state-of-the-art methods We compared several state-of-the-art UDA methods, including alignment-based ones (CycleGAN [51], AdaptSegNet [34], ADVENT [35], SIFA [2], SASAN [33], DDFP [45]) and pseudo-labeling ones (GenericSSL [36], UPL-SFDA [39], IPLC [48], and IPLC+ [47]). Figure 2: Qualitative results of SHAPE and typical methods. Yellow arrows indicate areas where SHAPE outperforms competing methods. Figure 3: (a) Feature distribution before adaptation. (b) Feature distribution after adaptation by AdaIN. (c) Feature distribution after adaptation by HFM. Quantitative comparison. Table 1 and Tab. 2 present the quantitative evaluation results for various UDA methods. The “W/o adaptation” results serve as the baseline lower bound, highlighting the significant domain gap between the source and target domains. For example, in the Cardiac MRI → Cardiac CT scenario, the average DSC drops to 45.91% compared to the supervised upper bound. Similarly, the ASD increases to 38.02 m, indicating poor segmentation quality and inaccurate boundary alignment when no adaptation is applied. Notably, on the cardiac dataset, our SHAPE alleviates the performance degradation caused by domain shifts and outperforms all existing approaches across both domain adaptation scenarios. For example, in the Cardiac MRI → Cardiac CT scenario, SHAPE achieves a DSC of 90.08%, marking a substantial 5.62% improvement over the second-best method (DDFP [45]) and narrowing the gap to the supervised upper bound to just 3.29%. Similarly, as shown in Tab. 2, the effectiveness of SHAPE is validated on the abdominal dataset, further demonstrating its capability. SHAPE’s superior performance is driven by its synergistic pipeline. The HFM first generates domain-agnostic features through class-aware, spatially-differentiated alignment. These features produce high-quality initial pseudo-labels, which are then rigorously validated. HPE discards anatomically nonsensical predictions, while SAP purges the remaining class-level artifacts. This ensures self-training on high-fidelity labels, which directly improves segmentation performance. Qualitative comparison. Visualizations of the final segmentation outputs in Fig. 2 illustrate that our SHAPE produces fewer false predictions compared to other methods. To understand the underlying mechanism driving this improvement, we analyze the 3D t-SNE projection of patch-level features from the cardiac dataset in Fig. 3. Initially, Fig. 3(a) reveals a clear domain gap, with source (circles) and target (squares) features occupying separate regions, and their respective class centroids are far apart. Global AdaIN, shown in Fig. 3(b), attempts a coarse alignment by pulling the target centroids (crossed markers) towards the source centroids. However, this monolithic transformation induces a severe distributional contraction, whereby target features from all classes are indiscriminately aggregated and lose their inter-class separability. This homogenization of feature representations is highly detrimental to a pixel-level segmentation task. In stark contrast, Fig. 3(c) demonstrates HFM’s structure-aware approach. The target class centroids (stars) are precisely aligned with their source counterparts, achieving superior domain adaptation. Crucially, this alignment occurs while preserving the intra-class distributional structure. The feature points of each class maintain their inherent variance and relative organization around their newly aligned centroids, avoiding the homogenization characteristic of AdaIN. This empirically validates that HFM successfully bridges the domain gap while retaining the fine-grained, discriminative feature structure essential for accurate segmentation. 4.3 Ablation study Effectiveness of Individual Components. We first evaluate the contribution of each module by incrementally adding them to a strong baseline that already incorporates a DINOv3 backbone. The results, summarized in Tab. 3, demonstrate the efficacy of each component across two domain adaptation tasks. The baseline itself achieves a respectable DSC of 82.02% on the MRI → CT task. Integrating our HFM provides the most significant individual performance improvement, improving the DSC by 3.65% percentage points to 85.67%. This underscores the critical importance of moving beyond global alignment to a class-aware, structure-preserving feature modulation strategy. Adding HPE alone also yields a consistent improvement, confirming that validating the anatomical plausibility of pseudo-labels is an effective strategy in its own right. The full SHAPE model, integrating all three modules, achieves the highest performance, reaching a DSC of 90.08% on MRI → CT and 78.51% on CT → MRI. The significant increase in performance when all components are active confirms that they are not merely additive but work synergistically to achieve the final result. Table 3: Ablation study of the core components of SHAPE on the cardiac dataset. We evaluate the individual and combined contributions of our main modules over a strong baseline. The best results are highlighted in bold. Configuration Modules MRI → CT CT → MRI HFM HPE SAP DSC↑ ASD↓ DSC↑ ASD↓ (a) Baseline 82.02 4.63 71.58 6.86 Individual Component Contributions (b) Baseline + HFM ✓ 85.67 3.43 75.46 5.36 (c) Baseline + HPE ✓ 82.71 4.83 72.09 6.49 Combined Component Contributions (d) Baseline + HFM + HPE ✓ ✓ 85.80 3.24 75.81 5.17 (e) Baseline + HFM + SAP ✓ ✓ 86.03 3.02 76.23 5.13 (f) SHAPE (Full Model) ✓ ✓ ✓ 90.08 2.62 78.51 4.70 Visual Analysis of Feature Modulation. To visually substantiate why HFM is superior to global alignment, we visualize the feature maps of a target domain image in Fig. 4. The “Original feature” map demonstrates that the initial DINOv3 features already capture the anatomical structure within the region of interest (red dotted line). Applying global AdaIN, however, significantly disrupts this representation. The activations lose their structural coherence and become disorganized, failing to preserve the semantic boundaries of the anatomy. In stark contrast, the “After HFM” map exhibits a remarkable semantic refinement. The activations become concentrated within the anatomical boundary, resulting in a cleaner and more discriminative feature representation. This visually demonstrates that HFM preserves feature quality through structure-aware alignment, whereas global AdaIN degrades it. Figure 4: Comparison of the effectiveness of different feature modulation methods. Take the class inside the red dotted line as an example. Hyperparameter Sensitivity Analysis. To evaluate the robustness of SHAPE, we analyze its sensitivity to key hyperparameters in Fig. 5. Parameters governing trade-offs, such as the plausibility fusion weight α and anomaly threshold θ _A, exhibit clear optimal regions, validating our settings. Performance consistently improves with a stricter purity threshold τp _p and a larger batch size, as expected. Notably, while HPE benefits from larger batches for stable statistics, the framework remains effective at smaller sizes. This is because the SAP threshold θ _A is derived from the distribution of every foreground class within the batch, providing a robust statistical basis even with fewer samples. Overall, the analysis confirms SHAPE’s stability and practical applicability across a reasonable range of values. Figure 5: Hyperparametric sensitivity analysis of all adaptation segmentation tasks. The vertical axis is the mean Dice scores (%) of all classes. 5 Conclusions In this work, we propose SHAPE, a novel framework for unsupervised domain adaptation in medical image segmentation. SHAPE first performs class-aware feature alignment through Hierarchical Feature Modulation (HFM) to overcome the limitations of semantically unaware adaptation. It then enforces the anatomical plausibility of pseudo-labels through a dual-validation pipeline, which uses Hypergraph Plausibility Estimation (HPE) to assess global coherence and Structural Anomaly Pruning (SAP) to purge local artifacts. Extensive experiments and ablation studies confirm that SHAPE significantly outperforms existing state-of-the-art methods in segmentation performance. Acknowledgments This work was supported by the National Natural Science Foundation of China [Grant No. 62572401, No. 62222311, and 62322112], and the Key Research and Development Program of Shaanxi [Program No. 2025SF-YBXM-424]. References [1] Z. Cai, J. Xin, C. You, P. Shi, S. Dong, N. C. Dvornek, N. Zheng, and J. S. Duncan (2025) Style mixup enhanced disentanglement learning for unsupervised domain adaptation in medical image segmentation. Medical Image Analysis 101, p. 103440. Cited by: §2.2. [2] C. Chen, Q. Dou, H. Chen, J. Qin, and P. A. Heng (2020) Unsupervised bidirectional cross-modality adaptation via deeply synergistic image and feature alignment for medical image segmentation. IEEE Transactions on Medical Imaging 39 (7), p. 2494–2505. Cited by: §1, §4.2, Table 1, Table 2. [3] S. Chen, Z. S. Gamechi, F. Dubost, G. van Tulder, and M. de Bruijne (2022) An end-to-end approach to segmentation in medical images with CNN and posterior-CRF. Medical Image Analysis 76, p. 102311. Cited by: §2.3. [4] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton (2020) Big self-supervised models are strong semi-supervised learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33, p. 22243–22255. Cited by: §2.1. [5] X. Chen, S. Xie, and K. He (2021-10) An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, p. 9640–9649. Cited by: §2.1. [6] J. Cheng, J. Ye, Z. Deng, J. Chen, T. Li, H. Wang, Y. Su, Z. Huang, J. Chen, L. Jiang, H. Sun, J. He, S. Zhang, M. Zhu, and Y. Qiao (2023) SAM-Med2D. arXiv preprint arXiv:2308.16184. Cited by: §1. [7] A. Demir, E. Massaad, and B. Kiziltan (2023) Topology-aware focal loss for 3D image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 580–589. Cited by: §2.3. [8] Q. Dou, C. Ouyang, C. Chen, H. Chen, B. Glocker, X. Zhuang, and P. Heng (2019) PnP-AdaNet: Plug-and-play adversarial domain adaptation network at unpaired cross-modality cardiac segmentation. IEEE Access 7, p. 99065–99076. Cited by: §1. [9] J. Gui, T. Chen, J. Zhang, Q. Cao, Z. Sun, H. Luo, and D. Tao (2024) A survey on self-supervised learning: algorithms, applications, and future trends. IEEE Transactions on Pattern Analysis and Machine Intelligence 46 (12), p. 9052–9071. Cited by: §2.1. [10] Q. Guo, Y. Wang, Y. Zhang, H. Qi, Y. Hu, and Y. Jiang (2026) Hyper-BTS: Brain tumor segmentation based on hypergraph guidance. Pattern Recognition 169, p. 111926. External Links: ISSN 0031-3203, Document Cited by: §2.3. [11] X. Han, L. Qi, Q. Yu, Z. Zhou, Y. Zheng, Y. Shi, and Y. Gao (2021) Deep symmetric adaptation network for cross-modality medical image segmentation. IEEE Transactions on Medical Imaging 41 (1), p. 121–132. Cited by: §1. [12] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick (2022) Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 16000–16009. Cited by: §2.1. [13] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 9729–9738. Cited by: §2.1. [14] J. Hong, S. C. Yu, and W. Chen (2022) Unsupervised domain adaptation for cross-modality liver segmentation via joint adversarial learning and self-learning. Applied Soft Computing 121, p. 108729. Cited by: §2.2. [15] X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, p. 1501–1510. Cited by: §2.2, §3.1. [16] W. Jing, J. Wang, D. Di, D. Li, Y. Song, and L. Fan (2025) Multi-modal hypergraph contrastive learning for medical image segmentation. Pattern Recognition 165, p. 111544. Cited by: §2.3. [17] C. Katar, O. Eryilmaz, and E. Eksioglu (2025) Att-Next for skin lesion segmentation with topological awareness. Expert Systems with Applications, p. 127637. Cited by: §2.3. [18] A. E. Kavur, N. S. Gezer, M. Baris, S. Aslan, P. Conze, V. Groza, D. D. Pham, S. Chatterjee, P. Ernst, S. Özkan, et al. (2021) CHAOS challenge-combined (CT-MR) healthy abdominal organ segmentation. Medical Image Analysis 69, p. 101950. Cited by: §4.1. [19] D. Kim, M. Seo, K. Park, I. Shin, S. Woo, I. S. Kweon, and D. Choi (2023) Bidirectional domain mixup for domain adaptive semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, p. 1114–1123. Cited by: §2.2. [20] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Lo, et al. (2023) Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, p. 4015–4026. Cited by: §2.1. [21] T. Koleilat, H. Asgariandehkordi, H. Rivaz, and Y. Xiao (2024) MedCLIP-SAM: Bridging text and image towards universal medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, p. 643–653. Cited by: §2.3. [22] S. Kumari and P. Singh (2024) Deep learning for unsupervised domain adaptation in medical imaging: Recent advancements and future perspectives. Computers in Biology and Medicine 170, p. 107912. External Links: ISSN 0010-4825 Cited by: §1. [23] B. Landman, Z. Xu, J. Igelsias, M. Styner, T. Langerak, and A. Klein (2017) Multi-atlas labeling beyond the cranial vault. Note: Dataset URL: https://w.synapse.org/Synapse:syn3193805/wiki/89480 Cited by: §4.1. [24] H. Lin, F. Schiffers, S. López-Tapia, N. Tavakoli, D. Kim, and A. K. Katsaggelos (2026) DRL-STNet: Unsupervised domain adaptation for cross-modality medical image segmentation via disentangled representation learning. In Fast, Low-Resource, Accurate Robust Organ and Pan-cancer Segmentation, Cham, p. 178–194. External Links: ISBN 978-3-031-96202-8 Cited by: §2.2. [25] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar (2020) Focal loss for dense object detection. IEEE Transactions on Pattern Analysis & Machine Intelligence 42 (02), p. 318–327. Cited by: §3.4. [26] F. Milletari, N. Navab, and S. Ahmadi (2016) V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), p. 565–571. Cited by: §3.4. [27] E. Panfilov, A. Tiulpin, S. Klein, M. T. Nieminen, and S. Saarakkala (2019-10) Improving robustness of deep learning based knee mri segmentation: mixup and adversarial domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Cited by: §2.2. [28] Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei (2022) Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366. Cited by: §2.1. [29] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, p. 8748–8763. Cited by: §2.1. [30] O. Ronneberger, P. Fischer, and T. Brox (2015) U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2015, p. 234–241. Cited by: §1. [31] O. Siméoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. (2025) DINOv3. arXiv preprint arXiv:2508.10104. Cited by: §1, §2.1, §3.1. [32] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems, Vol. 30. Cited by: §3.4. [33] D. Tomar, M. Lortkipanidze, G. Vray, B. Bozorgtabar, and J. Thiran (2021) Self-attentive spatial adaptive normalization for cross-modality domain adaptation. IEEE Transactions on Medical Imaging 40 (10), p. 2926–2938. Cited by: §1, §4.2, Table 1, Table 2. [34] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018-06) Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, p. 7472–7481. Cited by: §1, §4.2, Table 1, Table 2. [35] T. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez (2019) ADVENT: Adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 2517–2526. Cited by: §1, §4.2, Table 1, Table 2. [36] H. Wang and X. Li (2024) Towards generic semi-supervised framework for volumetric medical image segmentation. Advances in Neural Information Processing Systems 36, p. 1833–1848. Cited by: §1, §4.2, Table 1, Table 2. [37] J. Wang, L. Fan, W. Jing, D. Di, Y. Song, S. Liu, and C. Cong (2025) Hypergraph tversky-aware domain incremental learning for brain tumor segmentation with missing modalities. In International Conference on Medical Image Computing and Computer-Assisted Intervention, p. 283–293. Cited by: §2.3. [38] P. Wang, J. Peng, M. Pedersoli, Y. Zhou, C. Zhang, and C. Desrosiers (2023) Shape-aware joint distribution alignment for cross-domain image segmentation. IEEE Transactions on Medical Imaging 42 (8), p. 2338–2347. Cited by: §1. [39] J. Wu, G. Wang, R. Gu, T. Lu, Y. Chen, W. Zhu, T. Vercauteren, S. Ourselin, and S. Zhang (2023) UPL-SFDA: Uncertainty-aware pseudo label guided source-free domain adaptation for medical image segmentation. IEEE Transactions on Medical Imaging 42 (12), p. 3932–3943. Cited by: §1, §4.2, Table 1, Table 2. [40] Y. Wu, D. Inkpen, and A. El-Roby (2020) Dual mixup regularized learning for adversarial domain adaptation. In European Conference on Computer Vision, p. 540–555. Cited by: §2.2. [41] J. Xian, X. Li, D. Tu, S. Zhu, C. Zhang, X. Liu, X. Li, and X. Yang (2023) Unsupervised cross-modality adaptation via dual structural-oriented guidance for 3D medical image segmentation. IEEE Transactions on Medical Imaging 42 (6), p. 1774–1785. Cited by: §2.2. [42] H. Xu and Y. Wu (2024) G2ViT: Graph neural network-guided vision transformer enhanced network for retinal vessel and coronary angiograph segmentation. Neural Networks 176, p. 106356. Cited by: §2.3. [43] Z. Xu, H. Gong, X. Wan, and H. Li (2023) ASC: Appearance and structure consistency for unsupervised domain adaptation in fetal brain MRI segmentation. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2023, H. Greenspan, A. Madabhushi, P. Mousavi, S. Salcudean, J. Duncan, T. Syeda-Mahmood, and R. Taylor (Eds.), p. 325–335. Cited by: §1. [44] D. Yin, W. Huang, Z. Xiong, and X. Chen (2023) Class-aware feature alignment for domain adaptative mitochondria segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, p. 238–248. Cited by: §2.2. [45] S. Yin, S. Liu, and M. Wang (2025) DDFP: Data-dependent frequency prompt for source free domain adaptation of medical image segmentation. Knowledge-Based Systems, p. 113651. Cited by: §2.2, §4.2, §4.2, Table 1, Table 2. [46] H. Zeng, K. Zou, Z. Chen, R. Zheng, and H. Fu (2024) Reliable Source Approximation: Source-free unsupervised domain adaptation for vestibular schwannoma MRI segmentation. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2024, p. 622–632. Cited by: §1. [47] G. Zhang, X. Qi, J. Wu, B. Yan, and G. Wang (2025) IPLC+: SAM-guided iterative pseudo label correction for source-free domain adaptation in medical image segmentation. IEEE Journal of Biomedical and Health Informatics. Cited by: §4.2, Table 1, Table 2. [48] G. Zhang, X. Qi, B. Yan, and G. Wang (2024) IPLC: Iterative pseudo label correction guided by SAM for source-free domain adaptation in medical image segmentation. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2024, p. 351–360. Cited by: §1, §1, §4.2, Table 1, Table 2. [49] X. Zhang, Y. Wu, E. Angelini, A. Li, J. Guo, J. M. Rasmussen, T. G. O’Connor, P. D. Wadhwa, A. P. Jackowski, H. Li, et al. (2024) MAPSeg: Unified unsupervised domain adaptation for heterogeneous medical image segmentation based on 3D masked autoencoding and pseudo-labeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 5851–5862. Cited by: §1, §1. [50] Z. Zhao, F. Zhou, K. Xu, Z. Zeng, C. Guan, and S. K. Zhou (2022) LE-UDA: Label-efficient unsupervised domain adaptation for medical image segmentation. IEEE Transactions on Medical Imaging 42 (3), p. 633–646. Cited by: §1. [51] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017-10) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, p. 2223–2232. Cited by: §1, §4.2, Table 1, Table 2. [52] X. Zhuang and J. Shen (2016) Multi-scale patch and multi-modality atlases for whole heart segmentation of MRI. Medical Image Analysis 31, p. 77–87. Cited by: §4.1.