Paper deep dive

Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)

Diederick C. Niehorster, Marcus Nyström

Year: 2026Venue: arXiv preprintArea: cs.CVType: PreprintEmbeddings: 48

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/22/2026, 5:57:09 AM

Summary

This paper evaluates the performance of the Segment Anything Model 3 (SAM3) against SAM2 for eye image segmentation. Using both high-resolution lab datasets and the in-the-wild TEyeD dataset, the authors demonstrate that SAM2 consistently outperforms SAM3 in both visual and concept prompting modes, exhibiting higher precision, lower noise, and better discrimination capabilities. Consequently, the authors conclude that SAM2 remains the superior choice for eye image segmentation tasks.

Entities (5)

SAM2 · model · 100%SAM3 · model · 100%TEyeD · dataset · 100%Eye image segmentation · task · 95%Perception Encoder · component · 90%

Relation Signals (3)

SAM3 → performsworsethan → SAM2

confidence 95% · Results show that in most cases SAM3 with either visual or concept prompts did not perform better than SAM2.

TEyeD → usedtoevaluate → SAM3

confidence 95% · For the in-the-wild dataset, TEyeD is used.

SAM3 → incorporates → Perception Encoder

confidence 90% · the HieraDet image encoder of SAM 2 was replaced with the Perception Encoder

Cypher Suggestions (2)

Identify datasets used for evaluation · confidence 95% · unvalidated

MATCH (d:Dataset)<-[:USED_TO_EVALUATE]-(m:Model) RETURN d.name, m.name

Find all models compared in the study · confidence 90% · unvalidated

MATCH (m:Model) WHERE m.name IN ['SAM2', 'SAM3'] RETURN m

Abstract

Abstract:Previous work has reported that vision foundation models show promising zero-shot performance in eye image segmentation. Here we examine whether the latest iteration of the Segment Anything Model, SAM3, offers better eye image segmentation performance than SAM2, and explore the performance of its new concept (text) prompting mode. Eye image segmentation performance was evaluated using diverse datasets encompassing both high-resolution high-quality videos from a lab environment and the TEyeD dataset consisting of challenging eye videos acquired in the wild. Results show that in most cases SAM3 with either visual or concept prompts did not perform better than SAM2, for both lab and in-the-wild datasets. Since SAM2 not only performed better but was also faster, we conclude that SAM2 remains the best option for eye image segmentation. We provide our adaptation of SAM3's codebase that allows processing videos of arbitrary duration.

PDF

Open source PDF →Open local PDF →

Full Text

47,749 characters extracted from source content.

Expand or collapse full text

Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3) DIEDERICK C. NIEHORSTER, Lund University Humanities Lab & Dept. of Psychology, Lund University, Sweden MARCUS NYSTRÖM, Lund University Humanities Lab, Sweden Previous work has reported that vision foundation models show promising zero-shot performance in eye image segmentation. Here we examine whether the latest iteration of the Segment Anything Model, SAM3, offers better eye image segmentation performance than SAM2, and explore the performance of its new concept (text) prompting mode. Eye image segmentation performance was evaluated using diverse datasets encompassing both high-resolution high-quality videos from a lab environment and the TEyeD dataset consisting of challenging eye videos acquired in the wild. Results show that in most cases SAM3 with either visual or concept prompts did not perform better than SAM2, for both lab and in-the-wild datasets. Since SAM2 not only performed better but was also faster, we conclude that SAM2 remains the best option for eye image segmentation. We provide our adaptation of SAM3’s codebase that allows processing videos of arbitrary duration. CCS Concepts:• Computing methodologies→ Video segmentation; Neural networks;• Human-centered computing; Additional Key Words and Phrases: Eye tracking, Feature localization, Gaze estimation, Foundation models, Prompting, Methods, Pupil, Corneal reflection, Iris, Sclera 1 Introduction There are several approaches to performing gaze estimation, that is, determining where a person looks based on images from their eyes [Liu et al.2022]. Besides appearance-based methods that often rely on end-to-end neural networks [Cheng et al.2024], most other approaches require features, such as the center or edges of the pupil, the iris and the corneal reflection (CR) of the eye tracker’s illuminator, to be detected and localized in the eye images. Such features are then used for regression-based gaze estimation techniques such as P-CR [e.g., Blignaut and Wium 2013; Cerrolaza et al.2012; Kliegl and Olson 1981; Merchant et al.1974; Stampe 1993] or techniques involving geometric eye models [Barsingerhorn et al.2017; Coutinho and Morimoto 2006; Dierkes et al.2018; Guestrin and Eizenman 2006; Santini et al.2019; Świrski and Dodgson 2013; see Hansen and Ji 2010 for an overview]. How are these features detected in eye images and their centers localized? While traditional image processing approaches to detecting these features [e.g., Fuhl et al.2015, 2016b; Nyström et al.2023; Santini et al.2018; Świrski et al.2012] remain popular, deep learning approaches are also being developed [Cheng et al.2024; Deng et al.2025; Fuhl et al.2020, 2017, 2023; Kim et al.2019; Kothari et al. 2021; see Akinyelu and Blignaut 2020 for an overview]. One recurring problem with deep learning approaches is that if they do not work out of the box, retraining requires large datasets that are expensive to acquire and manually annotate [e.g., Byrne et al.2025, 2024]. Vision foundation models, such as the Segment Anything Model family of models [Carion et al.2025; Kirillov et al.2023; Ravi et al.2024], offer a potential solution to eye image segmentation that may drastically reduce the amount of manual work required. Specifically, both SAM [Deng et al.2025; Maquiling et al.2024] and SAM 2 [Maquiling et al.2025; Niehorster et al.2025] have been shown to offer segmentation performance that is competitive with the state-of-the-art for eye images in a range of different settings. The eye image segmentation performance of the latest model, SAM 3 [Carion et al. 2025] has however not yet been examined. Authors’ Contact Information: Diederick C. Niehorster, diederick_c.niehorster@humlab.lu.se, Lund University Humanities Lab & Dept. of Psychology, Lund University, Lund, Sweden; Marcus Nyström, marcus.nystrom@humlab.lu.se, Lund University Humanities Lab, Lund, Sweden. 1 arXiv:2603.17715v1 [cs.CV] 18 Mar 2026 2Niehorster et al. In this work, we examine how the recently released SAM 3 performs compared to SAM 2 for the task of eye image segmentation. Specifically, first we ask whether the visual prompting (manually indicating an object in the image to be segmented) performance of SAM 3 is superior to that of SAM 2. While SAM 3 has the same model architecture as SAM 2 for visual prompting tasks, the HieraDet image encoder [Bolya et al.2023; Ryali et al.2023] of SAM 2 was replaced with the Perception Encoder [Bolya et al.2025], which may lead to altered performance. Second, we ask how SAM 3’s new concept prompting abilities (i.e., prompting the model using a short string such as “pupil”) compares to visual prompting in SAM 2 and 3. Concept prompting would remove all manual work from using SAM (no prompts have to be placed manually on the various eye features), but it also offers potential benefits for, for instance, blink recovery. To provide insight into how SAM 2 and SAM 3 perform in eye image segmentation across a wide range of eye tracking domains, we perform our evaluation using both datasets containing high-resolution and high-quality eye images obtained in controlled lab settings, and datasets of eye images obtained from head-worn (glasses, VR and AR) eye trackers obtained in unconstrained settings. Across datasets, we prompt the models to segment the pupil, iris and sclera, and for the high-resolution lab datasets we additionally segment the CR. For the lab datasets, since they lack a ground truth, we examine the RMS-S2S [Niehorster et al.2020b,c] precision of the resulting feature signals, thereby probing which method yields signals with the lowest noise levels, and data loss. For the in-the-wild dataset, TEyeD [Fuhl et al.2021] is used, which consists of various datasets [Fuhl et al.2015, 2016a,b; Kasneci et al.2014; Kim et al. 2019; Kothari et al.2020; Tonsen et al.2016] and ground truth labels for the pupil, iris and eyelid aperture. As such, for these datasets, model performance is assessed as the overlap of model output with the ground truth segmentation, using the intersection-over-union (IoU) metric. Finally, as part of this work, we have adapted the SAM 3 code to be able to run on videos of arbitrary length. This code is available here: https://github.com/dcnieho/sam3. 2 Methods 2.1 Datasets 2.1.1 High-resolution lab datasets. Two different datasets were used, both recorded using the FLEX setup [Byrne et al. 2025, 2024; Hooge et al.2021, 2024; Nyström et al.2023; Valtakari et al.2024]. The first dataset was recorded from four experienced participants (all male) at 1000 Hz and consisted of 106 trials during which participants performed various short and long fixation and saccade sequences [this dataset has been used previously, see Byrne et al.2025; Niehorster et al.2025, for further description]. The second dataset consisted of recordings taken from 17 participants (five females, 11 males, one non-binary). Each dataset contained forty trials consisting of fixation and saccade sequences recorded at 1000 Hz [the dataset was previously used, see Byrne et al.2024, for further details]. The datasets contain 2.87 million images. 2.1.2 Unconstrained datasets. TEyeD [Fuhl et al.2021] consists of NVGaze [Kim et al.2019], two sets of eye images collected with a VR (14 participants) and an AR (42 participants) device; Gaze-in-wild [Kothari et al.2020], 19 participants performing everyday tasks with a wearable eye tracker; Labelled pupils in the wild [Tonsen et al.2016], 22 participants performing everyday tasks with a wearable eye tracker; and a series of Dikablis datasets, which are a combination of the datasets from Fuhl et al.[2015], Fuhl et al.[2016b], Fuhl et al.[2016a] and Kasneci et al.[2014]. TEyeD provides ground truth pupil and iris ellipses as well as eyelid polygons for all datasets. The datasets contain 14.44 million images. Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)3 2.2 Models and prompting For SAM 2 [Ravi et al.2024], the largest model (sam2.1_hiera_large) was used since it performs best for eye image segmentation [Niehorster et al.2025]. For SAM 3 [Carion et al.2025], only one model size is available (sam3.ptfrom 19 Nov 2025). Below, we describe how prompting was performed, Figure 1 shows example prompts and segmentation masks. Inference was run on three computer systems, (A) a 64-core AMD Threadripper Pro 9985WX system with 1TB of memory and an nVidia RTX Pro 6000 Blackwell GPU (96GB VRAM), (B) a 32-core AMD Threadripper 3970X system with 128GB of memory and an nVidia RTX 4090 GPU (24GB VRAM, Ada Lovelace architecture) and (C) a 12-core Intel i9-9920X system with 64GB of memory and two nVidia Titan RTX GPUs (24GB VRAM, Turing architecture), all running on Windows 10 or 11. SAM 2 with visual prompts consumed about 10GB of VRAM and ran at 16 fps on system (A), 13 fps on (B) and 1.6 fps on (C). SAM 3 with visual prompts ran at 13 fps on system (A), 9 fps on (B) and 0.6 fps on (C). It consumed 7.5GB of VRAM on (A) and (B), but 20GB on (C). SAM 3 with concept prompts was only run on (A), consumed 8GB of VRAM, and ran at 11 fps initially, but slowed down to 0.5 fps for longer (e.g. 70000 frames) videos. Fig. 1. Example prompts and resulting segmentations for the high resolution lab datasets (left) and the TEyeD datasets (right) for SAM 2 visual prompting (top), SAM 3 visual prompting (middle) and SAM 3 concept prompting (bottom). For the lab images, the prompted frame, the 100th frame and the 1000th frame in the eye video are shown. For the TEyeD images, the prompted frame, the 1000th, the 10000th and the 20000th frame are shown. For the lab datasets (left): the dark blue mask indicates the CR, green the pupil, red the iris and cyan the sclera. For the TEyeD datasets (right): dark blue is the pupil, green the iris, red the sclera. Positive (+) and negative (square) prompts use the same color codes. For the bottom row, different colors instead indicate different “pupil” objects returned by the model. The brown in some segmentations results from the iris and sclera masks overlapping. 2.2.1 Visual prompts. For the high-resolution datasets, the refined prompting strategy of Niehorster et al.[2025] was used. For each participant, a single image taken when the participant was looking at the middle of the screen was used to prompt all videos recorded for that participant. On this image, one prompt was manually placed by the first author 4Niehorster et al. on the CR, one on the pupil, one on the iris and two on the sclera (one on each side of the iris). Additionally, the CR prompt also served as a negative prompt for the pupil, iris and sclera (indicating that the CR should not be part of the segmentation for these features); the pupil as a negative prompt for the iris and sclera, the iris as negative prompt for the sclera, and the two sclera prompts as negative prompts for the iris. The segmentation output of both SAM 2 and SAM 3 for these initial prompt sets was then examined and, per model and prompt image, additional positive and negative prompts were added manually until segmentation was judged satisfactory by the annotator. For all prompt images, this only involved additional iris and sclera prompts. Any new positive iris or sclera prompts were also added as negative prompts for the sclera or iris, respectively. For the TEyeD datasets, the minimal prompting strategy of Niehorster et al.[2025] was implemented using a script (available from <link blinded for review>), as it was not feasible to manually create a refined prompt for each eye video due to the large number of videos. First, a frame suitable for prompting, i.e. where the eye is sufficiently open, was manually selected for each eye video. Then, using the ground truth annotations for the pupil, the iris and the eyelids, a prompt set was created as follows (see Figure 2). First, one positive prompt was placed on the pupil and two on the iris (one on each side of the pupil) such that the prompt was inside the eyelid polygon with a margin of at least 10 pixels. Then, two positive prompts were placed on the sclera (one on each side of the iris) using the following logic. First, the eye corners and the closest point on the iris ellipse that is inside the eyelid polygon are found. Then a point is placed at 40% along the line from the point on the iris border to the eye corner. A perpendicular line is drawn through this point and intersected with the eyelid polygon. The final prompt is placed on this perpendicular line, in the middle between the two points where it intersects the eyelid polygon. 2.2.2 Concept prompts. In an initial run using several video files, the strings “pupil”, “iris” or “sclera” were used as concept prompts. Since no segmentation was returned for the iris and sclera, these prompts were not used for the other video files. As such, only results for “pupil” are reported in this paper. We will return to this issue in the discussion. Fig. 2. Prompting TEyeD. The left panel shows the ground truth annotations provided by TEyeD for the pupil (green ellipse), iris (red ellipse) and palpebral fissure (yellow polygon). Also indicated are the determined eye corners (cyan dots), closest points on the iris ellipse (orange points) and the derived prompt coordinates for the pupil (green +), the iris (red +) and the sclera (blue and magenta +). The right panel shows the corresponding positive (+) and negative (square) visual prompts provided to both SAM models (blue: pupil, green: iris, red: sclera). Also shown are the output masks from SAM 3, using the same color codes as the prompts. 2.3 Data analysis For the high-resolution datasets, first, feature signals were constructed from the output masks of the models. For the CR and pupil, feature centers were determined as the center of mass of the contour with the largest area in the output masks. For the output of SAM 3 with concept prompts, since it may return multiple objects, after applying shape criteria, Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)5 the blob closest to the center of the image is selected for the first frame. For subsequent frames a simple tracking algorithm is used that selects the blob closest to the blob from the previous frame. If track is lost, the blob closest to the center is again used to reinitialize tracking. For the iris, the method of Niehorster et al.[2025] was used, which employs the sclera output mask to determine which edges of the iris output mask are not adjacent to the eyelid and then fits an ellipse to this part of the iris borders. The iris signal then consists of the center of the fitted ellipse. We assess the quality of the CR, pupil and iris feature signals using data loss and, following previous work [Byrne et al.2025, 2024; Niehorster et al.2025], RMS-S2S precision [Holmqvist et al.2012; Niehorster et al.2020b,c]. RMS-S2S precision was computed using a moving 200-ms window, and the precision for a given video then determined as the median RMS value across these windows [Hooge et al.2023, 2018; Niehorster et al.2020a]. Data loss was determined separately for the CR, pupil, iris and sclera from the output masks of the models. To determine data loss for each of these eye parts, due to blinks or otherwise, first for each video the distribution of areas of the masks for each eye part were determined. Data loss was then flagged for a given frame and object when the area was less than half of the 20th percentile value. Given how the output masks were processed for creating the feature signals (see above), for the CR and pupil, the area of the largest contour in the mask is used. For the iris and sclera instead the total number of pixels in the model’s output masks were used. Differences in performance between the models are assessed using paired t-tests. For the TEyeD datasets, for each frame in each video, the Intersection-over-Union (IoU, ranging from 0 [no overlap] to 1 [perfect overlap]) of the model and ground truth were determined. TEyeD provides ground truth pupil ellipses, iris ellipses and eyelid polygons, the latter annotating the palpebral fissure. Since the ground truth pupil and iris ellipses include parts of these features occluded by the eyelid, the visible part of the ground truth pupil or iris is first determined by intersecting them with the eyelid ground truth. For the iris mask, the pupil was also first removed as it should not be part of the iris output mask given the used prompts. To compute IoU for the sclera, the ground truth pupil and iris are removed from the eyelid ground truth. We also computed IoU for the entire eye opening, using the union of the pupil, iris and sclera output masks provided by the model. For SAM 2 and SAM 3 with visual prompts, the output masks were used directly to compute IoU. Since SAM 3 with concept prompts could return multiple “pupil” objects (see Figure 1), we performed the same selection and tracking logic as for the high-resolution lab datasets described above. Besides IoU, several other metrics were computed. First, since we noted that the sclera mask often includes the iris despite the negative prompts placed on the iris, to further assess the quality of the output sclera masks we computed how much of the iris mask was also included in the sclera mask. Furthermore, for the pupil, iris and sclera we determined the false alarm rate (when the feature is not present in the ground truth but reported by the model) and the miss rate (feature present in ground truth but not in the model output). Finally, to assess the overall ability of the models to correctly classify feature presence, we use Youden’s J [1950]. Youden’s J incorporates both the hit rate (1–miss rate) and false alarm rate to assess the trade-off between a model’s sensitivity and specificity. Specifically, if the model has a very low miss rate but also a very high false alarm rate (i.e., it indiscriminately reports that a feature is present), Youden’s J is zero, while perfect performance (both miss rate and false alarm rate are 0) is indicated by a Youden’s J of 1. 3 Results Figure 3 (top row) shows the RMS-S2S precision achieved by SAM 2 with visual prompts and SAM 3 with visual or concept prompts. As can be seen, while there was significant between-participant variability, within participants, RMS-S2S precision was systematically worse for SAM 3 with visual prompts than for SAM 2 with visual prompts. In fact, the average RMS-S2S precision was 64% worse for the CR, 112% worse for the pupil and 28% for the iris. The RMS-S2S precision of the pupil center signal derived from the segmentation masks produced by SAM 3 with concept prompting 6Niehorster et al. Fig. 3. RMS-S2S precision and data loss for SAM 2 and SAM 3 on the high-resolution lab datasets. Shown are the RMS-S2S precision (lower values is better) and data loss (lower is better) per participant. Note that the range of the y-axis is different for each of the panels. A summary showing the precision or data loss on the same scale is shown in the right most bar graphs (error bars indicate SEM across participants). Stars indicate significant differences according to paired t-tests. was much worse than for either SAM 3 with visual prompts (875 %) or for SAM 2 (1966%). Data loss (Figure 3, bottom row) was low overall, given that there were very few blinks in the high-quality datasets and that segmentation was successful. For the most part, differences between the three models were also small. Nonetheless, for the CR and pupil, data loss was lower for SAM 2 with visual prompts than for SAM 3 with visual prompts (111% and 37%, respectively). Data loss for the iris and sclera was not singificantly different between SAM 2 and 3. Data loss for the pupil for SAM 3 with concept prompts was 528% higher than SAM 3 with visual prompts and 762 % higher than for SAM 2. Table 1 lists the performance for the three models on the TEyeD datasets. The mean IoU (mIoU) was highest for SAM 2 for all datasets for the pupil and three out of the four datasets for the iris and entire eye opening. In contrast, for the sclera, mIoU was higher for SAM 3 with visual prompting than for SAM 2 for all four datasets. This may reflect that the amount of the iris output mask that is included in the sclera output mask (see the overlap column in Table 1) was consistently lower for SAM 3 than for SAM 2. While SAM 3 with concept prompts performed worst of the three models for three out of the four datasets, it is not far behind the mIoU achieved by SAM 3 for the pupil with visual prompts. Examining the false alarm rate, it is seen that SAM 2 consistently outperforms SAM 3 with visual prompts for all data sets for the pupil, iris and sclera, while SAM 3 with concept prompts performs better than SAM 2 only for the pupil in one dataset. For the miss rate, however, SAM 3 outperforms SAM 2, with either SAM 3 with visual prompts (three datasets for the pupil and all datasets for the iris and sclera) or SAM 3 with concept prompts (one dataset for the pupil) showing a lower miss rate than SAM 2. However, the very low miss rate of SAM 3 is coupled with a very high false alarm rate, especially for the iris and sclera, indicating that the model often reports these features are present, regardless of whether they actually are. Indeed, Youden’s J reveals that SAM 3 is (much) less discriminate for whether a feature is present in an eye image than SAM 2, especially for the iris and sclera. Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)7 Dataset mIoUOverlapFA rateMiss rateYouden’s J pupilirissclera eye opening iris–sclera PupilIrisSclera PupilIrisSclera PupilIrisSclera SAM 2 visual Dikablis 0.880 0.814 0.495 0.7730.2700.065 0.384 0.3770.037 0.003 0.0110.898 0.613 0.612 GiW 0.911 0.768 0.464 0.767 0.4830.193 0.340 0.3730.0100.032 0.0500.798 0.628 0.578 LPW 0.853 0.692 0.333 0.6390.5870.5060.668 0.8510.0230.055 0.0730.4700.278 0.076 NVGaze 0.895 0.854 0.593 0.816 0.2270.195 0.447 0.5480.0110.008 0.0090.794 0.545 0.443 SAM 3 visual Dikablis 0.782 0.745 0.507 0.7660.2020.394 0.793 0.8920.0140.002 0.0010.592 0.205 0.107 GiW0.855 0.728 0.506 0.7520.2460.442 0.720 0.8060.006 0.004 0.0030.552 0.276 0.190 LPW0.8000.743 0.425 0.6800.3070.676 0.893 0.9140.005 0.002 0.0040.319 0.105 0.082 NVGaze 0.884 0.843 0.637 0.8120.0930.2880.554 0.6000.006 0.000 0.0000.7060.446 0.400 SAM 3 concept Dikablis 0.794 —0.301—0.010—0.689— GiW 0.788—0.244—0.025—0.731— LPW 0.784—0.298—0.086—0.617— NVGaze 0.763—0.295—0.011—0.694— Table 1. Results for SAM 2 and SAM 3 per dataset in the TEyeD set. Shown are mIoU for the pupil, iris, sclera and the whole eye opening. Furthermore shown is the overlap between the iris and the sclera, and the false alarm rate, miss rate and Youden’s J for the pupil, iris and sclera. For each column and dataset, the best values are printed in bold (higher is better for mIoU and Youden’s J, while lower is better for the other columns), and for pupil columns the second best value is underlined. 4 Discussion We examined the eye image segmentation performance of SAM 2 and SAM 3 on a diverse range of eye images, representing both a high-quality lab-based setting and unconstrained head-worn settings recorded in the wild. Across settings we found that, by and large, SAM 2 performed better than both modes of SAM 3, i.e., using either visual or concept prompts. Specifically, in the lab-based setting, the feature signals computed from SAM 2’s segmentation were consistently less noisy than those computed from the segmentation provided by SAM 3 with visual prompts. SAM 3 with concept prompts performed, on average, an order of magnitude worse than SAM 2 or SAM 3 with visual prompts. No large differences in data loss were observed between the models. Nonetheless, data loss was lower for SAM 2 than SAM 3 with visual prompts for the CR and pupil, but not the iris or sclera. In the unconstrained setting evaluated using the TEyeD [Fuhl et al.2021] datasets, SAM 2 outperformed both SAM 3 models across most datasets in terms of mIoU for the pupil, iris and entire eye opening and also in terms of false alarm rate. SAM 3 with visual prompts however outperformed SAM 2 in mIoU for segmenting the sclera (likely because its segmentation masks included less of the iris). While SAM 3 furthermore appears to outperform SAM 2 in terms of miss rate, this finding has to be placed in the context of SAM 3’s (much) lower ability to discriminate whether a given feature is present in an eye image, as also underscored by a very high false alarm rate for especially the iris and sclera. Given that SAM 2 outperforms SAM 3 with visual prompts while also providing a higher inference speed, we conclude that among the family of Segment Anything Models, SAM 2 remains the best choice for zero-shot eye image segmentation tasks. SAM 3’s concept prompt mode does not offer a benefit over the other models for eye image segmentation. 8Niehorster et al. For SAM 2, the performance reported here is in line with previous examinations of its eye image segmentation capabilities both on high-quality lab datasets [Niehorster et al.2025] and for the pupil in unconstrained settings [Maquiling et al.2025]. Regarding SAM 2 performance in in-the-wild settings, this work extends the results from [Maquiling et al.2025] to segmentation of the iris, sclera and entire eye opening and shows that with adequate prompting, SAM 2 can in many cases show strong performance also for these more complex eye features. It came as a surprise, and disappointment, to the authors that SAM 3 did not perform better than SAM 2 for eye image segmentation even when using only visual prompts. Given that SAM 3 is only 2 months old, only little work is available evaluating its performance. Nonetheless, the literature that is available indeed indicates that SAM 3 only in some cases offers superior segmentation performance in domains such as medical images [Chakrabarty and Soni 2025; Dong et al. 2025], agriculture [Sapkota et al.2025b] and remote sensing [Li et al.2025b]. Even though SAM 3 can perform the same prompted visual segmentation task as SAM 2, the model architecture supporting this task has changed as well as its training objectives [see Sapkota et al.2025a, for a detailed discussion]. Importantly, the multimodal Perception Encoder [Bolya et al.2025] is used as the image encoder in SAM 3, replacing the vision-only HieraDet image encoder [Bolya et al.2023; Ryali et al.2023]. Furthermore, while SAM 2 was trained only to minimize geometric losses, i.e., determining where an object is, SAM 3 is trained to perform multiple tasks simultaneously, enabling it to indicate what semantic concept an image region represents [Sapkota et al.2025a]. While the changed architecture and training objectives extend SAM 3’s capabilities, it appears that these additions do not benefit our specific visual segmentation task. SAM 3’s inferior zero-shot performance, and complete inability to provide a segmentation given the concept prompts “iris” and “sclera” may be resolved through fine tuning on a set of eye images with ground truth segmentation masks and semantic labels. Indeed, in the medical image domain, it has already been reported that major improvement in SAM 3’s performance can be achieved through fine tuning [Jiang et al.2026]. Other model adaptation strategies have also shown promising improvements in performance in medical image and camouflaged object segmentation [Chen et al.2025]. As such, several avenues are available to potentially improve the performance of SAM 3 in eye image segmentation. We think such endeavors would be worthwhile given the promise of concept segmentation to enable eye image segmentation without any manual intervention such as the need to provide prompts for a given video. Another bottleneck in applying SAM 3 in our domain is its poor computational performance. In its current state, we found that SAM 3’s inference throughput may drop below 1 fps even when using expensive workstation-class computer resources. Ongoing distillation efforts aiming to make SAM 3 more efficient [Zeng et al.2025] may offer a solution to this performance bottleneck. Finally, SAM 3 can only be prompted using simple noun phrases. Depending on the eye images one wishes to segment, this may limit SAM 3’s applicability. For instance, prompts such as “the pupil of the left eye” or “the white reflection on the eye closest to the pupil” cannot be used with vanilla SAM 3. Adaptations to SAM 3 that are able to process more complex instructions [Li et al. 2025a] may offer a solution to this problem. 4.1 Privacy and Ethics Like for SAM 2, the primary concern for using SAM 3 in an eye tracking context is that it is not capable of online performance. As such, eye images that are to be segmented first need to be stored after acquisition, raising privacy and data protection concerns [Niehorster et al.2025]. Taking a wider view, foundation models such as the SAM family partially alleviate the privacy concerns of traditional models that require large amounts of manually annotated data to be trained for a specific task. Given its strong zero-shot performance, SAM can be used for eye image segmentation without requiring training on large open datasets, while finetuning would only require a small amount of labeled data. The reduced reliance on labeled data alleviates concerns about participant privacy [Zhang and Metaxas 2024]. Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)9 Acknowledgments The authors gratefully acknowledge the Lund University Humanities Lab. References Andronicus A Akinyelu and Pieter Blignaut. 2020. Convolutional neural network-based methods for eye gaze estimation: A survey. IEEE Access 8 (2020), 142581–142605. A. D. Barsingerhorn, F. N. Boonstra, and H. H. L. M. Goossens. 2017. Optics of the human cornea influence the accuracy of stereo eye-tracking methods: a simulation study. Biomedical Optics Express 8, 2 (2 2017), 712–725. doi:10.1364/BOE.8.000712 Pieter Blignaut and Daniël Wium. 2013. The effect of mapping function on the accuracy of a video-based eye tracker. In Proceedings of the 2013 Conference on Eye Tracking South Africa (Cape Town, South Africa) (ETSA ’13). Association for Computing Machinery, New York, NY, USA, 39–46. doi:10.1145/2509315.2509321 Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, and Christoph Feichtenhofer. 2025. Perception Encoder: The best visual embeddings are not at the output of the network. arXiv abs/2504.13181 (2025). arXiv:2504.13181 http://arxiv.org/abs/2504.13181 Daniel Bolya, Chaitanya Ryali, Judy Hoffman, and Christoph Feichtenhofer. 2023. Window Attention is Bugged: How not to Interpolate Position Embeddings. arXiv abs/2311.05613 (2023). arXiv:2311.05613 http://arxiv.org/abs/2311.05613 Sean Anthony Byrne, Virmarie Maquiling, Marcus Nyström, Enkelejda Kasneci, and Diederick C. Niehorster. 2025. LEyes: A lightweight framework for deep learning-based eye tracking using synthetic eye images. Behavior Research Methods 57, 5 (31 3 2025), 129. doi:10.3758/s13428-025-02645-y Sean Anthony Byrne, Marcus Nyström, Virmarie Maquiling, Enkelejda Kasneci, and Diederick C. Niehorster. 2024. Precise localization of corneal reflections in eye images using deep learning trained on synthetic data. Behavior Research Methods 56, 4 (01 6 2024), 3226–3241. doi:10.3758/s13428-023-02297-w Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane Momeni, Rishi Hazra, Shuangrui Ding, Sagar Vaze, Francois Porcher, Feng Li, Siyuan Li, Aishwarya Kamath, Ho Kei Cheng, Piotr Dollár, Nikhila Ravi, Kate Saenko, Pengchuan Zhang, and Christoph Feichtenhofer. 2025. SAM 3: Segment Anything with Concepts. arXiv abs/2511.16719 (2025). arXiv:2511.16719 http://arxiv.org/abs/2511.16719 Juan J Cerrolaza, Arantxa Villanueva, and Rafael Cabeza. 2012. Study of polynomial mapping functions in video-oculography eye trackers. ACM Transactions on Computer-Human Interaction 19, 2 (2012), 1–25. doi:10.1145/2240156.2240158 Satrajit Chakrabarty and Ravi Soni. 2025. Comparing SAM 2 and SAM 3 for Zero-Shot Segmentation of 3D Medical Data. arXiv abs/2511.21926 (2025). arXiv:2511.21926 https://arxiv.org/abs/2511.21926 Tianrun Chen, Runlong Cao, Xinda Yu, Lanyun Zhu, Chaotao Ding, Deyi Ji, Cheng Chen, Qi Zhu, Chunyan Xu, Papa Mao, and Ying Zang. 2025. SAM3-Adapter: Efficient Adaptation of Segment Anything 3 for Camouflage Object Segmentation, Shadow Detection, and Medical Image Segmentation. arXiv abs/2511.19425 (2025). arXiv:2511.19425 https://arxiv.org/abs/2511.19425 Yihua Cheng, Haofei Wang, Yiwei Bao, and Feng Lu. 2024. Appearance-based Gaze Estimation With Deep Learning: A Review and Benchmark. arXiv abs/2104.12668 (2024). arXiv:2104.12668 http://arxiv.org/abs/2104.12668 Flavio Luiz Coutinho and Carlos Hitoshi Morimoto. 2006. Free head motion eye gaze tracking using a single camera and multiple light sources. In 2006 19th Brazilian Symposium on Computer Graphics and Image Processing. 171–178. doi:10.1109/SIBGRAPI.2006.21 Jiangfan Deng, Zhuang Jia, Zhaoxue Wang, Xiang Long, and Daniel K. Du. 2025. Towards Unsupervised Eye-Region Segmentation for Eye Tracking. In Computer Vision – ECCV 2024 Workshops, Alessio Del Bue, Cristian Canton, Jordi Pont-Tuset, and Tatiana Tommasi (Eds.). Springer Nature Switzerland, Cham, 199–213. doi:10.1007/978-3-031-91989-3_13 Kai Dierkes, Moritz Kassner, and Andreas Bulling. 2018. A novel approach to single camera, glint-free 3D eye model fitting including corneal refraction. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications (Warsaw, Poland) (ETRA ’18). Association for Computing Machinery, New York, NY, USA, Article 9, 9 pages. doi:10.1145/3204493.3204525 Wenzhen Dong, Jieming Yu, Yiming Huang, Hongqiu Wang, Lei Zhu, Albert C. S. Chung, Hongliang Ren, and Long Bai. 2025. More than Segmentation: Benchmarking SAM 3 for Segmentation, 3D Perception, and Reconstruction in Robotic Surgery. arXiv abs/2512.07596 (2025). arXiv:2512.07596 https://arxiv.org/abs/2512.07596 Wolfgang Fuhl, Hong Gao, and Enkelejda Kasneci. 2020. Tiny convolution, decision tree, and binary neuronal networks for robust and real time pupil outline estimation. In ACM Symposium on Eye Tracking Research and Applications (Stuttgart, Germany) (ETRA ’20 Short Papers). Association for Computing Machinery, New York, NY, USA, Article 5, 5 pages. doi:10.1145/3379156.3391347 Wolfgang Fuhl, Gjergji Kasneci, and Enkelejda Kasneci. 2021. TEyeD: Over 20 Million Real-World Eye Images with Pupil, Eyelid, and Iris 2D and 3D Segmentations, 2D and 3D Landmarks, 3D Eyeball, Gaze Vector, and Eye Movement Types. In 2021 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). 367–375. doi:10.1109/ISMAR52148.2021.00053 Wolfgang Fuhl, Thomas Kübler, Katrin Sippel, Wolfgang Rosenstiel, and Enkelejda Kasneci. 2015. Excuse: Robust pupil detection in real-world scenarios. In International Conference on Computer Analysis of Images and Patterns. Springer, 39–51. 10Niehorster et al. Wolfgang Fuhl, Thiago Santini, Gjergji Kasneci, and Enkelejda Kasneci. 2016a. PupilNet: convolutional neural networks for robust pupil detection. arXiv preprint arXiv:1601.04902 (2016). Wolfgang Fuhl, Thiago Santini, Gjergji Kasneci, Wolfgang Rosenstiel, and Enkelejda Kasneci. 2017. PupilNet v2. 0: Convolutional Neural Networks for CPU based real time Robust Pupil Detection. arXiv preprint arXiv:1711.00112 (2017). Wolfgang Fuhl, Thiago C. Santini, Thomas Kübler, and Enkelejda Kasneci. 2016b. ElSe: Ellipse Selection for Robust Pupil Detection in Real-world Environments. In Proceedings of the Ninth Biennial ACM Symposium on Eye Tracking Research & Applications (Charleston, South Carolina) (ETRA ’16). ACM, New York, NY, USA, 123–130. doi:10.1145/2857491.2857505 Wolfgang Fuhl, Daniel Weber, and Shahram Eivazi. 2023. Pistol: PUpil INvisible SUpportive TOOl to Extract Pupil, Iris, Eye Opening, Eye Movements, Pupil and Iris Gaze Vector, and 2D as Well as 3D Gaze. In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023). 27–38. doi:10.5220/0011607200003417 E. D. Guestrin and M. Eizenman. 2006. General Theory of Remote Gaze Estimation Using the Pupil Center and Corneal Reflections. IEEE Transactions on Biomedical Engineering 53, 6 (2006), 1124–1133. doi:10.1109/tbme.2005.863952 Dan Witzner Hansen and Qiang Ji. 2010. In the Eye of the Beholder: A Survey of Models for Eyes and Gaze. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 3 (2010), 478–500. doi:10.1109/TPAMI.2009.30 Kenneth Holmqvist, Marcus Nyström, and Fiona Mulvey. 2012. Eye Tracker Data Quality: What It is and How to Measure It. In Proceedings of the Symposium on Eye Tracking Research and Applications (Santa Barbara, California) (ETRA ’12). ACM, New York, NY, USA, 45–52. doi:10.1145/2168556.2168563 Ignace T. C. Hooge, Diederick C. Niehorster, Roy S. Hessels, Jeroen S. Benjamins, and Marcus Nyström. 2023. How robust are wearable eye trackers to slow and fast head and body movements? Behavior Research Methods 55, 8 (01 12 2023), 4128–4142. doi:10.3758/s13428-022-02010-3 Ignace T C Hooge, Diederick C Niehorster, Roy S Hessels, Dixon Cleveland, and Marcus Nyström. 2021. The pupil-size artefact (PSA) across time, viewing direction, and different eye trackers. Behavior Research Methods (2021). doi:10.3758/s13428-020-01512-2 Ignace T. C. Hooge, Diederick C. Niehorster, Marcus Nyström, Richard Andersson, and Roy S. Hessels. 2018. Is human classification by experienced untrained observers a gold standard in fixation detection? Behavior Research Methods 50, 5 (2018), 1864–1881. doi:10.3758/s13428-017-0955-x Ignace T. C. Hooge, Diederick C. Niehorster, Marcus Nyström, and Roy S. Hessels. 2024. Large eye–head gaze shifts measured with a wearable eye tracker and an industrial camera. Behavior Research Methods (10 1 2024). doi:10.3758/s13428-023-02316-w Chongcong Jiang, Tianxingjian Ding, Chuhan Song, Jiachen Tu, Ziyang Yan, Yihua Shao, Zhenyi Wang, Yuzhang Shang, Tianyu Han, and Yu Tian. 2026. Medical SAM3: A Foundation Model for Universal Prompt-Driven Medical Image Segmentation. arXiv abs/2601.10880 (2026). arXiv:2601.10880 https://arxiv.org/abs/2601.10880 Enkelejda Kasneci, Katrin Sippel, Kathrin Aehling, Martin Heister, Wolfgang Rosenstiel, Ulrich Schiefer, and Elena Papageorgiou. 2014. Driving with binocular visual field loss? A study on a supervised on-road parcours with simultaneous eye and head tracking. PLOS ONE 9, 2 (2014), e87470. doi:10.1371/journal.pone.0087470 Joohwan Kim, Michael Stengel, Alexander Majercik, Shalini De Mello, David Dunn, Samuli Laine, Morgan McGuire, and David Luebke. 2019. Nvgaze: An anatomically-informed dataset for low-latency, near-eye gaze estimation. In Proceedings of the 2019 CHI conference on human factors in computing systems. 1–12. Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. 2023. Segment Anything. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). 3992–4003. doi:10.1109/ICCV51070.2023.00371 Reinhold Kliegl and Richard K. Olson. 1981. Reduction and calibration of eye monitor data. Behavior Research Methods & Instrumentation 13, 2 (01 1 1981), 107–111. doi:10.3758/BF03207917 Rakshit Kothari, Zhizhuo Yang, Christopher Kanan, Reynold Bailey, Jeff B Pelz, and Gabriel J Diaz. 2020. Gaze-in-wild: A dataset for studying eye and head coordination in everyday activities. Scientific Reports 10, 1 (2020), 1–18. doi:10.1038/s41598-020-59251-5 Rakshit S. Kothari, A. K. Chaudhary, R. J. Bailey, J. B. Pelz, and G. J. Diaz. 2021. EllSeg: An Ellipse Segmentation Framework for Robust Gaze Tracking. IEEE Transactions on Visualization & Computer Graphics 27, 05 (5 2021), 2757–2767. doi:10.1109/TVCG.2021.3067765 Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Yongri Piao, Qi Bi, Miao Zhang, Xiaoqi Zhao, Qiang Chen, Shihao Zou, Wei Ji, Huchuan Lu, and Li Cheng. 2025a. SAM3-I: Segment Anything with Instructions. arXiv abs/2512.04585 (2025). arXiv:2512.04585 https://arxiv.org/abs/2512.04585 Kaiyu Li, Shengqi Zhang, Yupeng Deng, Zhi Wang, Deyu Meng, and Xiangyong Cao. 2025b. SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images. arXiv abs/2512.08730 (2025). arXiv:2512.08730 https://arxiv.org/abs/2512.08730 Jiahui Liu, Jiannan Chi, Huijie Yang, and Xucheng Yin. 2022. In the eye of the beholder: A survey of gaze tracking techniques. Pattern Recognition 132 (2022), 108944. doi:10.1016/j.patcog.2022.108944 Virmarie Maquiling, Sean Anthony Byrne, Diederick C. Niehorster, Marco Carminati, and Enkelejda Kasneci. 2025. Zero-Shot Pupil Segmentation with SAM 2: A Case Study of Over 14 Million Images. Proceedings of the ACM on Computer Graphics and Interactive Techniques 8, 2, Article 23 (5 2025), 16 pages. doi:10.1145/3729409 Virmarie Maquiling, Sean Anthony Byrne, Diederick C. Niehorster, Marcus Nyström, and Enkelejda Kasneci. 2024. Zero-Shot Segmentation of Eye Features Using the Segment Anything Model (SAM). Proceedings of the ACM on Computer Graphics and Interactive Techniques 7, 2, Article 26 (5 2024), 16 pages. doi:10.1145/3654704 John Merchant, Richard Morrissette, and James L Porterfield. 1974. Remote measurement of eye direction allowing subject motion over one cubic foot of space. IEEE Transactions on Biomedical Engineering 4 (1974), 309–317. Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)11 Diederick C Niehorster, Roy S Hessels, and Jeroen S Benjamins. 2020a. GlassesViewer: Open-source software for viewing and analyzing data from the Tobii Pro Glasses 2 eye tracker. Behavior Research Methods 52, 3 (2020), 1244–1253. doi:10.3758/s13428-019-01314-1 Diederick C. Niehorster, Virmarie Maquiling, Sean Byrne, Enkelejda Kasneci, and Marcus Nyström. 2025. Exploring promptable foundation models for high-resolution video eye tracking in the lab. In Proceedings of the 2025 Symposium on Eye Tracking Research and Applications (ETRA ’25). Association for Computing Machinery, New York, NY, USA, Article 8, 8 pages. doi:10.1145/3715669.3723118 Diederick C Niehorster, Thiago Santini, Roy S Hessels, Ignace T C Hooge, Enkelejda Kasneci, and Marcus Nyström. 2020b. The impact of slippage on the data quality of head-worn eye trackers. Behavior Research Methods 52, 3 (2020), 1140–1160. doi:10.3758/s13428-019-01307-0 Diederick C Niehorster, Raimondas Zemblys, Tanya Beelders, and Kenneth Holmqvist. 2020c. Characterizing gaze position signals and synthesizing noise during fixations in eye-tracking data. Behavior Research Methods 52, 6 (2020), 2515–2534. doi:10.3758/s13428-020-01400-9 Marcus Nyström, Diederick C Niehorster, Richard Andersson, Roy S Hessels, and Ignace T C Hooge. 2023. The amplitude of small eye movements can be accurately estimated with video-based eye trackers. Behavior Research Methods 55, 2 (2023), 657–669. Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. 2024. SAM 2: Segment Anything in Images and Videos. arXiv abs/2408.00714 (2024). arXiv:2408.00714 http://arxiv.org/abs/2408.00714 Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, and Christoph Feichtenhofer. 2023. Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). 29441–29454. Thiago Santini, Wolfgang Fuhl, and Enkelejda Kasneci. 2018. PuReST: Robust Pupil Tracking for Real-time Pervasive Eye Tracking. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications (Warsaw, Poland) (ETRA ’18). ACM, New York, NY, USA, Article 61, 5 pages. doi:10.1145/3204493.3204578 Thiago Santini, Diederick C. Niehorster, and Enkelejda Kasneci. 2019. Get a Grip: Slippage-robust and Glint-free Gaze Estimation for Real-time Pervasive Head-mounted Eye Tracking. In Proceedings of the 11th ACM Symposium on Eye Tracking Research & Applications (Denver, Colorado) (ETRA ’19). ACM, New York, NY, USA, Article 17, 10 pages. doi:10.1145/3314111.3319835 Ranjan Sapkota, Konstantinos I. Roumeliotis, and Manoj Karkee. 2025a. The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation. arXiv abs/2512.06032 (2025). arXiv:2512.06032 https://arxiv.org/abs/2512.06032 Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee, and Nikolaos D. Tselikas. 2025b. Generalization vs. Specialization: Evaluating Segment Anything Model (SAM3) Zero-Shot Segmentation Against Fine-Tuned YOLO Detectors. arXiv abs/2512.11884 (2025). arXiv:2512.11884 https: //arxiv.org/abs/2512.11884 Dave M Stampe. 1993. Heuristic filtering and reliable calibration methods for video-based pupil-tracking systems. Behavior Research Methods, Instruments, & Computers 25, 2 (1993), 137–142. doi:10.3758/BF03204486 Lech Świrski, Andreas Bülling, and Neil Dodgson. 2012. Robust real-time pupil tracking in highly off-axis images. In Proceedings of the Symposium on Eye Tracking Research and Applications. ACM, 173–176. Lech Świrski and Neil A. Dodgson. 2013. A fully-automatic, temporal approach to single camera, glint-free 3D eye model fitting [Abstract]. In Proceedings of ECEM 2013 (Lund, Sweden). Marc Tonsen, Xucong Zhang, Yusuke Sugano, and Andreas Bulling. 2016. Labelled pupils in the wild: a dataset for studying pupil detection in unconstrained environments. In Proceedings of the Ninth Biennial ACM Symposium on Eye Tracking Research & Applications. ACM, 139–142. Niilo V. Valtakari, Roy S. Hessels, Diederick C. Niehorster, Charlotte Viktorsson, Pär Nyström, Terje Falck-Ytter, Chantal Kemner, and Ignace T. C. Hooge. 2024. A field test of computer-vision-based gaze estimation in psychology. Behavior Research Methods 56, 3 (01 3 2024), 1900–1915. doi:10.3758/s13428-023-02125-1 W. J. Youden. 1950. Index for rating diagnostic tests. Cancer 3, 1 (1950), 32–35. doi:10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3 Chengxi Zeng, Yuxuan Jiang, and Aaron Zhang. 2025. EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3. arXiv abs/2511.15833 (2025). arXiv:2511.15833 https://arxiv.org/abs/2511.15833 Shaoting Zhang and Dimitris Metaxas. 2024. On the challenges and perspectives of foundation models for medical image analysis. Medical Image Analysis 91 (2024), 102996. doi:10.1016/j.media.2023.102996