Paper deep dive

Understanding Task Aggregation for Generalizable Ultrasound Foundation Models

Fangyijie Wang, Tanya Akumu, Vien Ngoc Dang, Amelia Jimńez-Sánchez, Jieyun Bai, Guénolé Silvestre, Karim Lekadir, Kathleen M. Curran

Year: 2026Venue: arXiv preprintArea: eess.IVType: PreprintEmbeddings: 30

Abstract

Abstract:Foundation models promise to unify multiple clinical tasks within a single framework, but recent ultrasound studies report that unified models can underperform task-specific baselines. We hypothesize that this degradation arises not from model capacity limitations, but from task aggregation strategies that ignore interactions between task heterogeneity and available training data scale. In this work, we systematically analyze when heterogeneous ultrasound tasks can be jointly learned without performance loss, establishing practical criteria for task aggregation in unified clinical imaging models. We introduce M2DINO, a multi-organ, multi-task framework built on DINOv3 with task-conditioned Mixture-of-Experts blocks for adaptive capacity allocation. We systematically evaluate 27 ultrasound tasks spanning segmentation, classification, detection, and regression under three paradigms: task-specific, clinically-grouped, and all-task unified training. Our results show that aggregation effectiveness depends strongly on training data scale. While clinically-grouped training can improve performance in data-rich settings, it may induce substantial negative transfer in low-data settings. In contrast, all-task unified training exhibits more consistent performance across clinical groups. We further observe that task sensitivity varies by task type in our experiments: segmentation shows the largest performance drops compared with regression and classification. These findings provide practical guidance for ultrasound foundation models, emphasizing that aggregation strategies should jointly consider training data availability and task characteristics rather than relying on clinical taxonomy alone.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

29,424 characters extracted from source content.

Expand or collapse full text

11institutetext: Research Ireland Centre for Research Training in Machine Learning 22institutetext: Departament de Matemàtiques i Informàtica, Universitat de Barcelona, Barcelona, Spain 33institutetext: School of Medicine, University College Dublin, Dublin, Ireland 44institutetext: School of Computer Science, University College Dublin, Dublin, Ireland 55institutetext: Institució Catalana de Recerca i Estudis Avançats (ICREA) 66institutetext: Department of Cardiovascular Surgery, The First Affiliated Hospital of Jinan University, Jinan University, Guangzhou, China 77institutetext: Auckland Bioengineering Institute, University of Auckland, Auckland, New Zealand † Equal contribution Understanding Task Aggregation for Generalizable Ultrasound Foundation Models Fangyijie Wang† Corresponding authors: fangyijie.wang@ucdconnect.ie, kathleen.curran@ucd.ie Tanya Akumu† Vien Ngoc Dang Amelia Jimńez-Sánchez Jieyun Bai Guénolé Silvestre Karim Lekadir Kathleen M. Curran⋆ Abstract Foundation models promise to unify multiple clinical tasks within a single framework, but recent ultrasound studies report that unified models can underperform task-specific baselines. We hypothesize that this degradation arises not from model capacity limitations, but from task aggregation strategies that ignore interactions between task heterogeneity and available training data scale. In this work, we systematically analyze when heterogeneous ultrasound tasks can be jointly learned without performance loss, establishing practical criteria for task aggregation in unified clinical imaging models. We introduce M2DINO, a multi-organ, multi-task framework built on DINOv3 with task-conditioned Mixture-of-Experts blocks for adaptive capacity allocation. We systematically evaluate 27 ultrasound tasks spanning segmentation, classification, detection, and regression under three paradigms: task-specific, clinically-grouped, and all-task unified training. Our results show that aggregation effectiveness depends strongly on training data scale. While clinically-grouped training can improve performance in data-rich settings, it may induce substantial negative transfer in low-data settings. In contrast, all-task unified training exhibits more consistent performance across clinical groups. We further observe that task sensitivity varies by task type in our experiments: segmentation shows the largest performance drops compared with regression and classification. These findings provide practical guidance for ultrasound foundation models, emphasizing that aggregation strategies should jointly consider training data availability and task characteristics rather than relying on clinical taxonomy alone. 1 Introduction Ultrasound imaging is a cornerstone of clinical care, including obstetrics [15], cardiology [25], oncology [14], and point-of-care (POC) settings [21]. It enables rapid, non-invasive, and cost-effective assessment of diverse anatomical structures at the bedside. However, ultrasound image appearance varies substantially across operators, devices, and acquisition protocols, complicating robust generalization [19, 5, 24]. Despite recent advances in Deep Learning (DL) for ultrasound, most models focus on isolated task instances (e.g., single-organ segmentation [11], multi-organ classification [10], or multi-organ segmentation [2]) or limited combinations of tasks (e.g., joint classification and segmentation [9]), rather than enabling unified multi-organ, multi-task analysis. Such specialization limits clinical applicability, as real-world workflows require simultaneous multi-organ and multi-task assessment. Foundation models therefore aim to streamline deployment, promote cross-task knowledge sharing, and enable comprehensive ultrasound analysis [1]. However, developing a single unified model that reliably performs segmentation, detection, classification, and regression across heterogeneous ultrasound tasks remains an open challenge. Recent multi-task and foundation-style approaches aim to unify clinical tasks within a single model, thereby simplifying deployment and enabling cross-task knowledge sharing [9, 2, 10, 13]. While these methods report promising results on selected task combinations, a systematic study of how task aggregation strategies influence performance across organs and task types is still lacking. In particular, it remains unclear which tasks can be effectively unified without inducing negative transfer, and how training data scale modulates such interactions. To address these questions, we introduce Multi-organ and Multi-task DINO framework (M2DINO), a DINOv3-based encoder augmented with task-conditio- ned Mixture-of-Experts (MoE) blocks for large-scale multi-task learning across 27 ultrasound tasks spanning segmentation, classification, detection, and regression. We investigate three training paradigms: (1) task-specific (TS) training, where each task is optimized independently; (2) clinically-grouped (CG) training, where clinically related tasks are trained jointly; and (3) all-task unified (AU) training, where all tasks are learned simultaneously within a single model. Our contributions are: (1) We introduce M2DINO, a unified multi-organ, multi-task ultrasound framework built on DINOv3 with task-conditioned MoE for adaptive capacity allocation across heterogeneous task objectives. (2) We introduce a structured framework for evaluating clinical task aggregation and compatibility across organ systems and prediction types. (3) Through experiments on 27 tasks, we show scale-dependent aggregation effects, identify conditions under which CG training induces negative transfer, and provide practical design guidelines for developing unified ultrasound foundation models. 2 Methodology This section first formalizes the problem setting and training paradigms (Section 2.1). We then detail the proposed M2DINO architecture, including the backbone [22], the task-conditioned MoE, and the heads with the multi-task objective (Sections 2.2–2.4). Fig. 1 illustrates an overview of the M2DINO framework. 2.1 Problem Setting and Training Paradigms Let =tt=1TD= \D_t \_t=1^T denote a collection of T ultrasound tasks spanning segmentation, classification, regression, and detection, covering diverse anatomical regions. Each task t=(it,it)D_t= \ (X_i^t,Y_i^t ) \ consists of ultrasound images itX_i^t and TS labels itY_i^t. Under the unified training paradigms, our objective is to learn a shared encoder fθf_θ that maps an input image X to a latent representation, which is subsequently optimized by heterogeneous TS prediction heads. We study how different task aggregation strategies affect model performance and transfer behavior within a common DINOv3-based foundation model. Specifically, we evaluate three training paradigms in a controlled comparison setting: • task-specific (TS): A separate DINOv3 model is trained independently for each task t, without any parameter sharing or cross-task interaction. • clinically-grouped (CG): Tasks are jointly trained within predefined clinical groups based on shared organ systems and examination context (e.g., obstetric tasks (OB), breast imaging tasks (Breast), and lung ultrasound tasks (Lung)). Each group shares a DINOv3 Vision Transformer (ViT) encoder and task-conditioned MoE routing while optimizing heterogeneous prediction objectives (segmentation, classification, detection, or regression). • all-task unified (AU): All T tasks are trained simultaneously within a single shared DINOv3 ViT encoder. For unified settings (CG and AU), the multi-task loss function is defined as: ℒ=∑t=1TλtℒtL= _t=1^T _tL_t, where T denotes the number of tasks trained jointly in the current paradigm (e.g., [3−27][3-27]), and ℒtL_t denotes the TS loss and λt _t denotes balancing coefficients. Unless otherwise specified, losses are equally weighted. In our controlled comparison, all training paradigms share the same pre-trained DINOv3 backbone, MoE configuration (when enabled), input resolution, data pre-processing, and optimization settings. As the effective training data size varies across paradigms (e.g., TS vs. AU), we perform a limited learning rate search within a fixed range for each setting to ensure stable optimization, using a consistent validation-based selection protocol. Table 1 summarizes the experimental settings for each training paradigm. Table 1: Experimental settings for evaluating different training paradigms. Seg: segmentation; Cls: classification; Reg: regression; Det: detection. MO: Multi-organ. Training MoE # # Reg Cls Seg Det Target Clinical Paradigm Tasks Images Anatomy Group TS ✗ 1 1,144 ✓ Breast Breast TS ✗ 1 1,776 ✓ Breast Breast TS ✗ 1 208 ✓ Cervical OB TS ✗ 1 2,818 ✓ PS/Fetal head OB TS ✗ 1 762 ✓ Fetal abdomen OB TS ✗ 1 624 ✓ Fetal femur OB TS ✗ 1 1,849 ✓ Fetal head OB TS ✗ 1 5,952 ✓ Fetal organs OB TS ✗ 1 483 ✓ Fetal breech OB TS ✗ 1 1,482 ✓ Lung Lung TS ✗ 1 772 ✓ Lung Lung CG ✓ 7 11,910 ✓ ✓ ✓ Fetal anatomy OB CG ✓ 3 2,254 ✓ ✓ Lung Lung CG ✓ 3 2,920 ✓ ✓ Breast Breast AU ✓ 27 32,311 ✓ ✓ ✓ ✓ MO All Figure 1: Overview of our M2DINO framework. (a) Ultrasound images are processed by a shared DINOv3 encoder augmented with task-conditioned MoE blocks. The unified representation is optimized for segmentation, detection, regression, and classification via task-specific prediction heads. Frozen and trainable components are indicated. (b) A conceptual comparison of the three training paradigms. Although the architecture remains the same, task-specific (TS), clinically-grouped (CG), and all-task unified (AU) differ in how tasks are aggregated during training and in whether the MoE is enabled. 2.2 DINO Backbone We use the pre-trained DINOv3 [22] model as our encoder backbone. DINOv3 provides ViT-S/B/L variants; we adopt the ViT-B/16 backbone to balance model capacity with dataset scale and computational efficiency. Given an ultrasound image, we convert it in RGB format to obtain the input ∈ℝ3×H×WX ^3× H× W. The DINOv3-based ViT encoder fθf_θ produces token embeddings Z and corresponding spatial feature maps F: (,)=fθ()(Z,F)=f_θ(X). Unlike prior multi-task formulations [23], we use spatial feature maps as the unified interface across all tasks. Downstream task-specific heads, including a dense prediction transformer (DPT) decoder [18] for segmentation, take the feature maps F as input. Although the fθf_θ produces token embeddings, we use only the spatial feature maps F for downstream heads. This design provides a consistent dense feature representation across segmentation, detection, classification, and regression tasks. Using global token pooling (e.g., the classification token) could favor global prediction tasks over dense prediction tasks such as segmentation and detection. By adopting feature maps F as the unified interface, we maintain architectural consistency and isolate the effect of task grouping in our compatibility analysis. 2.3 Mixture of Experts with Task-Conditioned Routing To mitigate task interference in unified training paradigms (CG/AU), we integrate task-conditioned MoE blocks into the DINOv3 encoder, inspired by [8, 12]. Each task is assigned a unique identifier t, which is mapped to a learnable embedding vector: t=Embedding⁡(t)e_t=Embedding(t). The gating network (shown in Fig. 1) conditions expert selection on both token embeddings h and the task embedding te_t: g(,t)=Softmax⁡(Wg[;t]).g (h,e_t )=Softmax (W_g [h;e_t ] ). The output of the MoE block is computed as a weighted combination of expert outputs: ′=∑i=1Kgi(,t)Ei()h = _i=1^Kg_i (h,e_t )E_i(h), where EiE_i denotes the i-th expert and K is the total number of experts. This design enables task-adaptive capacity while maintaining a shared backbone. Instead of integrating MoE into all ViT layers, we integrate the MoE blocks into the later layers (layers 7−127-12, i.e., the last six layers). Early transformer layers tend to encode generic low-level image representations, while later layers encode task-specific representations [4]. Restricting MoE blocks to later layers enables efficient conditional capacity allocation. Given the scale of our dataset (32,311 training and 8,077 validation samples), we adopt a partial-MoE design to balance task-adaptive capacity with computational efficiency. 2.4 Task-Specific Heads and Multi-Task Learning For four different task types, including segmentation, classification, regression, and detection, we develop four lightweight heads to improve computational efficiency. Let F denote the shared feature maps produced by fθf_θ. Each task employs a lightweight prediction head hth_t to output y^t=ht() y_t=h_t(F), where the head parameters are task-specific. For segmentation tasks, we adopt a DPT-style [18] decoder to generate dense pixel-wise predictions. Classification and regression tasks utilize global pooling followed by fully connected layers, while detection tasks adopt a task-specific detection head. Each task is optimized using an appropriate loss function ℒtL_t. Specifically, we use Dice loss for segmentation, cross-entropy loss for classification, and L1L1 loss for regression. For detection, we use a single-stage detection loss combining focal loss for pixel-wise supervision and Smooth L1L1 loss for normalized bounding box regression at the corresponding ground-truth center cell. For unified settings (CG/AU), the overall objective is defined in Sec 2.1. 3 Experiments Dataset. The dataset is designed to evaluate the model’s ability to generalize across four fundamental task categories: • Segmentation (12 tasks): Pixel-level annotations for fetal organs (e.g., the head, heart, and abdomen), maternal structures, and lesions. The training set contains 16,615 samples and the test set includes 2,674 samples. • Classification (9 tasks): Includes fetal standard-plane and fetal position classification, lung disease recognition, and tumor malignancy assessment. The training set has 16,361 samples, and test set has 2,727 samples. • Detection (3 tasks): Localization of thyroid nodules, uterine fibroids, and spinal cord injuries (4,333 training / 725 test samples). • Regression (3 tasks): Biometric measurements including angle of progression, cervical length, and fetal femur length. The training set includes 3,078 samples, and the test set contains 617 samples. During training, 20% of the training data are selected for validation. Implementation Details. All methods were trained for 200 epochs with a batch size of 16 using AdamW (initial learning rate 1e−51e-5, weight decay 1e−41e-4). The backbone learning rate was set to 2e−52e-5, the DPT head to 1e−51e-5, MoE to 2e−42e-4, and task-specific heads to 1e−31e-3 to accelerate convergence. Implementation was based on PyTorch (2.1.2) and Segmentation Models PyTorch [7] with CUDA (12.2), and experiments were conducted on a NVIDIA 4090 GPU. Models were evaluated on the validation set after each epoch, and the best-performing model’s weights were saved. Data augmentation and preprocessing followed standard protocols. Full implementation details and code are available at: GitHub. Evaluation Metrics. We define standardized evaluation metrics for each of the task types: Segmentation: We report the Dice Similarity Coefficient (DSC) [3] for region overlap and Hausdorff Distance (HD) [6] for boundary accuracy. Classification: We use the Area Under the Curve (AUC) [17], F1-score [20], and Matthews Correlation Coefficient (MCC) [16]. Detection: We use the Intersection over Union (IoU) [26] to measure the localization accuracy of predicted bounding boxes. Regression: The Mean Radial Error (MRE), reported in pixels, reflects the real-world clinical measurement precision, as it is computed at the original image resolution (i.e., predictions are mapped back from resized inputs). 4 Results Fig. 2 presents absolute performance comparisons across training paradigms. In the data-rich obstetrics (OB) group (11,910 training samples), both CG and AU training paradigms generally improve over TS on most tasks. Specifically, AU reduces cervial regression error (MRE: 30.4 → 15.6) and increases fetal abdomen segmentation overlap (DSC: 0.217 → 0.481). For fetal head segmentation and multi-organ classification, CG and AU yield small, modest gains. However, the Breast and Lung groups exhibit different trends. AU improves lung classification (AUC: 0.396 → 0.525). In contrast, CG shows large performance drops in breast lesion segmentation (DSC: 0.713 → 0.145) and lung segmentation (DSC: 0.801 → 0.576). These results suggest that task aggregation strategies (CG/AU) benefit from data-rich settings, whereas CG is less reliable in low-data settings. Figure 2: Absolute performance of TS, CG, and AU training paradigms across representative tasks: segmentation (DSC ↑ ), classification (AUC ↑ ), and regression (MRE ↓ ). Abd: Abdomen; MO: Multi-organ. To quantify task aggregation effects, Fig. 3 reports relative performance changes with respect to TS. It shows that the impact of CG and AU depends strongly on data scale. In the OB group (11,910 training samples), both AU and CG outperform TS across most tasks, with the largest improvements in regression and segmentation. CG shows a 5.1% performance drop in fetal abdomen segmentation. In contrast, CG shows significant performance drops in smaller groups (Breast and Lung), especially in breast lesion segmentation (-79.7%). By comparison, AU exhibits comparatively less performance changes (-4.9%). These results suggest that task aggregation interacts strongly with data availability, and that CG is more prone to negative transfer in low-data settings. Figure 3: Relative performance change (Δ , %) with respect to TS. Table 2: Group-wise average performance change (Δ ) relative to TS. Positive values indicate improvement; for regression (MRE), the sign is adjusted accordingly. Group # (CG) (AU) Images OB 11,910 +2.93+2.93 +3.76+3.76 Breast 2,920 −0.29-0.29 −0.02-0.02 Lung 2,254 −0.07-0.07 +0.07+0.07 Table 3 summarizes the group-wise average performance change relative to TS training. CG yields positive improvements in the OB (Δ=+2.93 =+2.93; 11,910 samples), but shows slight average decreases in Breast (Δ=−0.29 =-0.29) and Lung (Δ=−0.07 =-0.07). In contrast, AU exhibits more stable performance across datasets (+3.76+3.76 in OB, −0.02-0.02 in Breast, and +0.07+0.07 in Lung) and generally outperforms CG in smaller-scale settings. These results suggest that the effectiveness of CG training paradigm depends strongly on data scale. 5 Discussion Our study shows that the effectiveness of task aggregation strategies ( clinically-grouped (CG)/ all-task unified (AU)) in ultrasound imaging is strongly dependent on training data scale. In the data-rich obstetrics (OB) group, both CG and AU improve performance over task-specific (TS) (Table 3). However, in smaller groups (Breast and Lung), CG induces significant negative transfer, indicating that clinical grouping alone does not guarantee positive transfer. Importantly, AU shows more stable performance across groups and fewer large performance drops than CG (Fig. 3). This suggests that broader task aggregation may provide a regularizing effect that reducing overfitting when data are limited. Our findings highlight that partial grouping (i.e., CG) can be more prone to negative transfer in small datasets, whereas all-task aggregation yields more reliable transfer behavior. Furthermore, we observe task-type-dependent effects in our experiments. Segmentation shows the largest performance drops and negative transfer, while regression and classification remain comparatively stable (Fig. 2). These results suggest that the design of aggregation strategies for foundation models should consider clinical taxonomy together with data scale and task characteristics. This study has several limitations. First, we focus on a single backbone (DINOv3) and predefined clinical grouping strategies. Alternative architectures, such as ultrasound-specific foundation models (e.g., USFM [9] and TinyUSFM [13]), or data-driven grouping schemes, may lead to different outcomes. Second, our analysis is limited to ultrasound imaging. Future work should examine whether similar transfer patterns generalize to other 2D modalities (e.g., radiography or digital pathology) as well as 3D domains such as CT and MRI. Despite these limitations, our findings provide empirical evidence that aggregation strategy and data scale are important factors influencing the performance and stability of unified medical foundation models. 6 Conclusion We present a large-scale empirical analysis of task aggregation strategies for multi-task ultrasound foundation models across 27 heterogeneous clinical tasks. Our findings show that aggregation effectiveness is governed not only by clinical taxonomy but also by data scale and task characteristics. Clinically-grouped aggregation improves performance in data-rich settings but can induce negative transfer in low-data settings. In contrast, anatomy-agnostic aggregation provides more stable cross-task transfer. Segmentation tasks are particularly sensitive to aggregation design, underscoring the need for principled task selection. These results demonstrate that naive task scaling does not guarantee improved foundation models and provide practical guidelines for constructing reliable and scalable ultrasound foundation models, with implications for broader medical imaging applications. credits 6.0.1 Acknowledgements This work was funded by Taighde Éireann – Research Ireland through the Research Ireland Centre for Research Training in Machine Learning (18/CRT/6183). 6.0.2 The authors have no competing interests to declare that are relevant to the content of this article. References [1] M. Awais, M. Naseer, S. Khan, R. M. Anwer, H. Cholakkal, M. Shah, M. Yang, and F. S. Khan (2025) Foundation models defining a new era in vision: a survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence 47 (4), p. 2245–2264. External Links: Document Cited by: §1. [2] H. Chen, Y. Cai, C. Wang, L. Chen, B. Zhang, H. Han, Y. Guo, H. Ding, and Q. Zhang (2025) Multi-organ foundation model for universal ultrasound image segmentation with task prompt and anatomical prior. IEEE Transactions on Medical Imaging 44 (2), p. 1005–1018. External Links: Document Cited by: §1, §1. [3] L. R. Dice (1945) Measures of the amount of ecologic association between species. Ecology 26 (3), p. 297–302. External Links: ISSN 00129658, 19399170 Cited by: §3. [4] T. Dorszewski, L. Tětková, R. Jenssen, L. K. Hansen, and K. K. Wickstrøm (2026) From colors to classes: emergence of concepts in vision transformers. In Explainable Artificial Intelligence, p. 28–47. External Links: ISBN 978-3-032-08317-3 Cited by: §2.3. [5] L. Huang, J. Zhou, J. Jiao, S. Zhou, C. Chang, Y. Wang, and Y. Guo (2024) Standardization of ultrasound images across various centers: m2o-diffgan bridging the gaps among unpaired multi-domain ultrasound images. Medical Image Analysis 95, p. 103187. External Links: ISSN 1361-8415, Document Cited by: §1. [6] D.P. Huttenlocher, G.A. Klanderman, and W.J. Rucklidge (1993) Comparing images using the hausdorff distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 15 (9), p. 850–863. External Links: Document Cited by: §3. [7] P. Iakubovskii (2019) Segmentation models pytorch. GitHub. External Links: Link Cited by: §3. [8] Y. Jain, H. Behl, Z. Kira, and V. Vineet (2023) Damex: dataset-aware mixture-of-experts for visual understanding of mixture-of-datasets. Advances in Neural Information Processing Systems 36, p. 69625–69637. Cited by: §2.3. [9] J. Jiao, J. Zhou, X. Li, M. Xia, Y. Huang, L. Huang, N. Wang, X. Zhang, S. Zhou, Y. Wang, and Y. Guo (2024) USFM: a universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis. Medical Image Analysis 96, p. 103202. External Links: ISSN 1361-8415, Document Cited by: §1, §1, §5. [10] Q. Kang, Q. Lao, J. Gao, W. Bao, Z. He, C. Du, Q. Lu, and K. Li (2025) URFM: a general ultrasound representation foundation model for advancing ultrasound image diagnosis. IScience 28 (8). Cited by: §1, §1. [11] S. Kim, P. Jin, S. Song, C. Chen, Y. Li, H. Ren, X. Li, T. Liu, and Q. Li (2025) EchoFM: foundation model for generalizable echocardiogram analysis. IEEE Transactions on Medical Imaging 44 (10), p. 4049–4062. External Links: Document Cited by: §1. [12] Y. Lu, M. Weng, Z. Xiao, R. Jiang, W. Su, G. Zheng, P. Lu, and X. Li (2025-10) Dynamic-dino: fine-grained mixture of experts tuning for real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), p. 20847–20856. Cited by: §2.3. [13] C. Ma, J. Jiao, S. Liang, J. Fu, Q. Wang, Z. Li, Y. Wang, and Y. Guo (2025) TinyUSFM: towards compact and efficient ultrasound foundation models. arXiv preprint arXiv:2510.19239. Cited by: §1, §5. [14] H. H. T. Madsen and F. Rasmussen (2011) Contrast-enhanced ultrasound in oncology. Cancer Imaging 11 (1A), p. S167. Cited by: §1. [15] M. A. Maraci, M. Yaqub, R. Craik, S. Beriwal, A. Self, P. von Dadelszen, A. Papageorghiou, and J. A. Noble (2020-01) Toward point-of-care ultrasound estimation of fetal gestational age from the trans-cerebellar diameter using CNN-based ultrasound image analysis. J Med Imaging (Bellingham) 7 (1), p. 014501 (en). Cited by: §1. [16] B.W. Matthews (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405 (2), p. 442–451. External Links: ISSN 0005-2795, Document Cited by: §3. [17] W. Peterson, T. Birdsall, and W. Fox (1954) The theory of signal detectability. Transactions of the IRE Professional Group on Information Theory 4 (4), p. 171–212. External Links: Document Cited by: §3. [18] R. Ranftl, A. Bochkovskiy, and V. Koltun (2021) Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, p. 12179–12188. Cited by: §2.2, §2.4. [19] I. Sarris, C. Ioannou, P. Chamberlain, E. Ohuma, F. Roseman, L. Hoch, D. G. Altman, A. T. Papageorghiou, and International Fetal and Newborn Growth Consortium for the 21st Century (INTERGROWTH-21st) (2012) Intra- and interobserver variability in fetal ultrasound measurements. Ultrasound Obstet. Gynecol. 39 (3), p. 266–273. Cited by: §1. [20] K. Sasaki, S. Sakamoto, H. Uchida, T. Shigeta, M. Matsunami, H. Kanazawa, A. Fukuda, A. Nakazawa, M. Sato, S. Ito, et al. (2015) Two-step transplantation for primary hyperoxaluria: a winning strategy to prevent progression of systemic oxalosis in early onset renal insufficiency cases. Pediatric Transplantation 19 (1), p. E1–E6. Cited by: §3. [21] A. Self, Q. Chen, B. K. Desiraju, S. Dhariwal, A. D. Gleed, D. Mishra, and et al. (2022-09-01) Developing clinical artificial intelligence for obstetric ultrasound to improve access in underserved regions: protocol for a computer-assisted low-cost point-of-care ultrasound (calopus) study. JMIR Res Protoc 11 (9), p. e37374. External Links: ISSN 1929-0748 Cited by: §1. [22] O. Siméoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. Jégou, P. Labatut, and P. Bojanowski (2025) DINOv3. External Links: 2508.10104, Link Cited by: §2.2, §2. [23] X. Song, X. Xu, J. Zhang, D. Machado Reyes, and P. Yan (2025) DINO-reg: efficient multimodal image registration with distilled features. IEEE Transactions on Medical Imaging 44 (9), p. 3809–3819. External Links: Document Cited by: §2.2. [24] R. Vega, M. Dehghan, A. Nagdev, B. Buchanan, J. Kapur, J. L. Jaremko, and D. Zonoobi (2025-04) Overcoming barriers in the use of artificial intelligence in point of care ultrasound. npj Digital Medicine 8 (1), p. 213. Cited by: §1. [25] O. Villemain, J. Baranger, M. K. Friedberg, C. Papadacci, A. Dizeux, E. Messas, M. Tanter, M. Pernot, and L. Mertens (2020) Ultrafast ultrasound imaging in pediatric and adult cardiology: techniques, applications, and perspectives. JACC: Cardiovascular Imaging 13 (8), p. 1771–1791. External Links: ISSN 1936-878X, Document Cited by: §1. [26] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren (2020) Distance-iou loss: faster and better learning for bounding box regression. In The AAAI Conference on Artificial Intelligence (AAAI), p. 12993–13000. Cited by: §3.