Paper deep dive
BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning
Jingyang Ke, Weihan Li, Amartya Pradhan, Jeffrey Markowitz, Anqi Wu
Abstract
Abstract:Understanding freely moving animal behavior is central to neuroscience, where pose estimation and behavioral understanding form the foundation for linking neural activity to natural actions. Yet both tasks still depend heavily on human annotation or unstable unsupervised pipelines, limiting scalability and reproducibility. We present BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding that requires no task-specific finetuning and minimal human labeling by guiding pretrained Vision-Language Models (VLMs) through detailed, explicit, and verifiable reasoning steps. For pose estimation, we leverage quantum-dot-grounded behavioral data and propose a multi-stage pipeline that integrates temporal, spatial, and cross-view reasoning. This design greatly reduces human annotation effort, exposes low-confidence labels through geometric checks such as reprojection error, and produces labels that can later be filtered, corrected, or used to fine-tune downstream pose models. For behavioral understanding, we propose a pipeline that integrates deep embedded clustering for over-segmented behavior discovery, VLM-based per-clip video captioning, and LLM-based reasoning to merge and semantically label behavioral segments. The behavioral pipeline can operate directly from visual information and does not require keypoints to segment behavior. Together, these components enable scalable, interpretable, and label-light analysis of multi-animal behavior.
Tags
Links
- Source: https://arxiv.org/abs/2603.12176v1
- Canonical: https://arxiv.org/abs/2603.12176v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: failed | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 0%
Last extracted: 3/13/2026, 1:17:51 AM
OpenRouter request failed (402): {"error":{"message":"This request requires more credits, or fewer max_tokens. You requested up to 65536 tokens, but can only afford 52954. To increase, visit https://openrouter.ai/settings/keys and create a key with a higher monthly limit","code":402,"metadata":{"provider_name":null}},"user_id":"user_2shvuzpVFCCndDdGXIdfi40gIMy"}
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
40,438 characters extracted from source content.
Expand or collapse full text
BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning Jingyang Ke ∗ Georgia Institute of Technology Atlanta, GA, 30332 jingyang.ke@gatech.edu Weihan Li ∗ Georgia Institute of Technology Atlanta, GA, 30332 weihanli@gatech.edu Amartya Pradhan Georgia Institute of Technology, Emory University Atlanta, GA, 30322 amartya.pradhan@emory.edu Jeffrey E. Markowitz Georgia Institute of Technology, Emory University Atlanta, GA, 30332 jeffrey.markowitz@bme.gatech.edu Anqi Wu Georgia Institute of Technology Atlanta, GA, 30332 anqiwu@gatech.edu Abstract Understanding freely moving animal behavior is central to neuroscience, where pose estimation and behavioral understanding form the foundation for linking neural activity to natural actions. Yet both tasks still depend heavily on human annotation or unstable unsupervised pipelines, limiting scalability and reproducibil- ity. We present BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding that requires no task-specific finetuning and minimal human labeling by guiding pretrained Vision-Language Models (VLMs) through detailed, explicit, and verifiable reasoning steps. For pose estima- tion, we leverage quantum-dot-grounded behavioral data and propose a multi-stage pipeline that integrates temporal, spatial, and cross-view reasoning. This design greatly reduces human annotation effort, exposes low-confidence labels through geometric checks such as reprojection error, and produces labels that can later be filtered, corrected, or used to fine-tune downstream pose models. For behavioral understanding, we propose a pipeline that integrates deep embedded clustering for over-segmented behavior discovery, VLM-based per-clip video captioning, and LLM-based reasoning to merge and semantically label behavioral segments. The behavioral pipeline can operate directly from visual information and does not require keypoints to segment behavior. Together, these components enable scalable, interpretable, and label-light analysis of multi-animal behavior. 1 Introduction Understanding freely moving animal behavior is central to neuroscience. Two fundamental tasks are pose estimation and behavioral segmentation, which together provide the bridge between neural activity and natural action. In practice, however, both problems still require substantial human labor. Pose estimation toolkits such as DeepLabCut [Mathis et al., 2018], SLEAP [Pereira et al., 2022], ∗ Equal contribution. Preprint. arXiv:2603.12176v1 [cs.CV] 12 Mar 2026 VLM & LLM animal video prompt pipeline pose behavioral understanding Figure 1: Overview of BehaviorVLM. This VLM & LLM-based framework addresses pose estima- tion and behavioral understanding with minimum manual labeling and no finetuning. and Lightning Pose [Biderman et al., 2024] can achieve strong accuracy, but each new experimental setup usually requires manual labels before training can begin. Pretrained foundation models such as SuperAnimal [Ye et al., 2024] reduce this burden, yet they still depend on human-labeled pretraining data and can degrade under new camera geometries, imaging conditions, or animal morphologies. Behavioral understanding faces a parallel limitation. In this paper, behavioral understanding refers to behavioral segmentation together with a human-understandable interpretation for each segment. Recent VLM- and LLM-based systems such as MouseGPT [Xu et al., 2025] and AmadeusGPT [Ye et al., 2023] show that language models can help describe animal behavior, but they do not replace the full annotation workflow that a human analyst performs when identifying transitions and assigning semantic labels to behavior segments. At the other extreme, unsupervised approaches such as MotionMapper [Berman et al., 2014], MoSeq [Wiltschko et al., 2015], and Keypoint-MoSeq [Weinreb et al., 2024] scale well, but they often produce segments that are difficult to interpret, switch too rapidly, or do not align cleanly with human-understandable behavioral categories. This limitation arises because these methods typically rely on keypoints or low-dimensional motion representations and do not directly extract semantic labels from the visual evidence in the video. We present BehaviorVLM, a unified vision-language framework that addresses both pose estimation and behavioral understanding without task-specific finetuning and with minimal human labeling, by guiding pretrained Vision-Language Models (VLMs) through structured, multi-stage reasoning pipelines. The central idea is to mimic how a human would carry out these annotation tasks in practice. Rather than asking a model for a final answer in a single step, we decompose each task into explicit intermediate stages that use visual evidence, expose uncertainty, and allow labels to be reviewed or corrected afterward. This framing is especially useful when the goal is to replace large amounts of manual work rather than to claim that every automatic label is perfect. For pose estimation, we leverage near-infrared fluorescent quantum dots (QDs) [Ulutas et al., 2025] injected at body keypoints to provide candidate keypoint locations across six synchronized camera views. A VLM is guided through a multi-stage reasoning pipeline that integrates temporal, spatial, and cross-view constraints to predict accurate 3D keypoint trajectories. The pipeline requires only three manually labeled seed frames, and completed predictions are appended to a rolling window and reused as few-shot exemplars for later frames. The QD signals substantially reduce human labeling effort compared with conventional manual annotation, reduce bias from imprecise human labels, and make it possible to identify poor pseudo-labels after the fact using geometric criteria such as large 3D reprojection error. Those filtered labels can then be used directly or used to fine-tune a downstream pose estimation model. More broadly, this setup encourages the use of QD-based labeling for small animals such as mice, fish, and birds, where conventional pose annotation or motion tracking with motion capture devices is especially difficult. For behavioral understanding, we introduce a multi-stage pipeline that first applies deep embedded clustering to obtain fine-grained, over-segmented behavioral clips for each animal, then invokes a VLM to generate per-clip behavioral labels and natural-language descriptions, and finally leverages an LLM to merge similar segments and assign semantically meaningful labels. This pipeline makes heavy use of visual information. In particular, the segmentation process can operate directly on video features and does not require keypoints, which distinguishes it from prior behavior pipelines that are restricted to pose-based inputs. Together, these two pipelines form a unified framework (Figure 1) that replaces extensive human annotation and task-specific model training with structured vision-language reasoning, enabling scalable and interpretable automated analysis of naturalistic animal behavior. Our main contributions are: 2 Long-pass filter 6x NIR-optimized machine vision cameras Inject with QDs cam0cam1 Long-pass filter cam2 cam3cam4cam5 AB Figure 2: Pose estimation experimental setup. (A) Data collection: a mouse injected with near- infrared quantum dots (QDs) at 12 body keypoints is recorded by six synchronized NIR-optimized cameras (B) Example six-view frames with QD fluorescence centroids detected and overlaid as numbered candidates on the reflectance images. Centroid indices are local to each view; the goal is to assign anatomical identities to these candidates across all cameras and timepoints. •A multi-stage VLM-based reasoning pipeline for QD-grounded pose estimation that requires only three labeled seed frames and produces labels that can be inspected, filtered, corrected, and reused for downstream pose model fine-tuning. • A multi-stage behavioral understanding pipeline that converts visual or fused behavioral fea- tures into semantically meaningful behavioral segments through low-cost over-segmentation, VLM-based visual interpretation, and LLM-based semantic reasoning. •Evaluation on a custom six-view quantum-dot mouse dataset [Ulutas et al., 2025] and the MABe2022 Mouse Triplets benchmark [Sun et al., 2023], demonstrating that finetuning-free vision-language reasoning can achieve reliable pose estimation and interpretable multi- animal behavioral segmentation. 2 Pose Estimation 2.1 Experimental Data We use a dataset of 500 synchronized timepoints from six cameras recording a freely moving mouse. The mouse was injected with near-infrared fluorescent nanoparticles (quantum dots, QDs) at 12 anatomical keypoints, following the QD data acquisition procedure in [Ulutas et al., 2025]. This setup provides both reflectance images, which capture the visible behavior of the animal, and fluorescence images, which reveal the QD signals at body locations (Figure 2A). Each fluorescence centroid indicates the location of a body marker, but not its anatomical identity. The raw frames have resolution 2048× 1400 in every view. For each frame and camera, we apply Segment Anything 3 [Carion et al., 2025] to detect the mouse body mask. We then crop the frame to the mask’s tight bounding box and pad the crop by 16 pixels plus 5% of the bounding box dimensions. Because the apparent size and position of the mouse vary across viewpoints and over time, the crop dimensions differ across cameras and frames. QD centroid locations are extracted from the fluorescence channel inside each crop and overlaid as numbered candidate keypoints on the reflectance image (Figure 2B). The task is to assign anatomical identities to these candidate points with minimal human effort. We use only three manually labeled seed frames. Afterward, the pipeline generates labels automatically, and these labels can be reviewed, corrected, or filtered using geometric confidence measures such as large 3D reprojection error before they are used in downstream analysis or pose model fine-tuning. 3 2.2 Method BehaviorVLM formulates QD-grounded pose estimation as a structured visual reasoning problem, guiding a vision-language model (VLM) through a four-stage pipeline (Figure 3A, B). The pipeline requires only three manually labeled seed frames. Completed predictions are appended to a rolling window and reused as few-shot exemplars for subsequent frames, enabling temporally coherent keypoint tracking. Stage 1: Body Region Detection. The 12 body keypoints are partitioned into four anatomical regions: ears (ear_L,ear_R), back (back_top,back_middle,back_bottom), paws (forepaw_L, forepaw_R,hindpaw_L,hindpaw_R), and tail (tail_base,tail_middle,tail_tip). For each region in each camera view, the VLM (Qwen 3.5-27B [Qwen Team, 2026]) is provided with three consecutive preceding frames as rolling few-shot exemplars, each annotated with a colored bounding box over the target region. The VLM predicts the bounding box of that region in the current frame. This stage narrows the search space before any keypoint identity is assigned and helps the pipeline remain stable during fast motion and partial occlusion. Stage 2: Within-Region Keypoint Assignment. The target frame is cropped to each predicted region bounding box. The VLM is then prompted with three rolling exemplar crops, each with verified centroid-to-keypoint assignments, and asked to assign the numbered centroids inside the crop to the corresponding region keypoints. This decomposition into local crops reduces assignment ambiguity because each crop contains only 2–4 relevant keypoints. Stage 3: Cross-Region Assignment Reconciliation. Per-region assignments from Stage 2 are merged across all four regions into a single full-frame assignment. At this stage, some conflicts can remain, such as two keypoints being assigned to the same centroid or some visible centroids being left unused. We therefore call the VLM once more with the full-frame image and a structured description of the current partial assignments and candidate centroid indices. The VLM reconciles conflicts and fills gaps so that the visible centroids receive a complete and unique assignment. Stage 4: 3D Cross-View Consensus Refinement. Given the per-camera 2D keypoint predictions across all six views, we apply a RANSAC-based triangulation [Fischler and Bolles, 1981] and cross-view consistency correction to refine potentially erroneous centroid assignments. For each keypoint, we first triangulate a 3D world position using RANSAC over subsets of cameras. We select the subset that maximizes the inlier count, where inliers are cameras whose reprojection error falls below thresholdτ reproj , and then re-triangulate using only those inlier cameras to obtain a refined 3D estimate. We next compute the reprojection error of this 3D estimate in every camera and partition assignments into locked cameras (low error, trusted) and target cameras (high error, to be corrected). For each high-error keypoint, we enumerate hypotheses by considering alternative nearby centroid assignments in each target camera and projecting the current 3D estimate to identify geometrically plausible candidates. Each hypothesis is scored by re-running RANSAC triangulation and computing the resulting mean reprojection error. We accept the hypothesis with the lowest error and resolve conflicting assignments through swaps when necessary. This final stage matters not only for accuracy, but also for quality control. The same reprojection- based confidence measure can be used after prediction to identify low-quality labels, remove them, or send them for manual correction before training a downstream pose estimation model. The completed frame-t predictions are then appended to the rolling window and used as exemplars for frame t + 1. 2.3 Results Quantitative Evaluation. Figure 3C reports the mean 3D keypoint prediction error across all 12 keypoints over the 500-timepoint recording. To evaluate the contribution of each pipeline component, we compared three versions of BehaviorVLM: (i) the full BehaviorVLM pipeline, (i) Behav- iorVLM without 3D cross-view refinement, and (i) BehaviorVLM without region detection & 3D refinement (plain rolling three-shot prompting without region-based decomposition or 3D cross-view refinement). Both the region-based decomposition and the 3D cross-view refinement contribute substantially to accuracy, with the full pipeline reducing mean error by 54% relative to the naïve baseline. 4 body region detection Qwen 3.5-27B single-view frame t body regions 3D cross-view consensus refinement rolling window: frames t-1 to t-3 with labeled kpts frame t with QD centroids single-view frame t region-wise kpts single-view frame t kpts Final six-view frame t kpts cameras extrinsic intrinsic append frame t for t +1 within-region kpts detection Qwen 3.5-27B cross-region kpts detection Qwen 3.5-27B Reflectance x 6Fluorescence x 6 ear_R forepaw_R forepaw_L back_top back_mid tail_tip none back_bottom tail_mid tail_top ear_R forepaw_R forepaw_L back_top back_mid tail_tip back_bottom tail_mid tail_top Same single-view process for all cameras 0-5 + Final kpts predictions x 6 ear_R back_top back_mid back_bottom tail_top tail_mid tail_tip forepaw_R hindpaw_R Camera 0 body regions Merged Cross-region kpts 3D triangulation • without 3D refinement • without body region detection & 3D refinement • full BehaviorVLM ears back paws tail ear_R top mid bottom none none forepaw_L forepaw_R tip mid base none Within-region kpts B A C D back_toptail_tip ear_Rhindpaw_R timepoint0500 timepoint0500 timepoint0500 timepoint0 500 ground truth BehaviorVLM Stage 1: Stage 2: Stage 3: Stage 4: Figure 3: BehaviorVLM pose estimation pipeline and results. (A) pipeline overview. (B) Detailed example for one frame from camera 0: the VLM first localizes four body regions (ears, back, paws, tail) via bounding boxes, then assigns centroids to keypoints within each cropped region, merges assignments, and resolves conflicts. Six-view predictions are triangulated into 3D and refined via RANSAC consensus. (C) Ablation study showing mean 3D keypoint error (m) averaged over 12 keypoints across 500 frames. The full BehaviorVLM pipeline (6.59 m) outperforms variants without 3D cross-view refinement (9.16 m) and without both body region detection and 3D refinement (14.29 m), demonstrating the contribution of each component. (D) Representative 3D keypoint trajectories for four body keypoints:back_top,tail_tip,ear_R, andhindpaw_R. Ground truth is shown in orange, BehaviorVLM predictions in blue. Qualitative Evaluation. Figure 3D shows representative predicted keypoint trajectories for each of the four body regions (back, tail, ears, paws). For back, tail, and ear keypoints, BehaviorVLM tracks the trajectories closely across the full recording. When predictions temporarily deviate from ground truth, the pipeline often recovers in later frames instead of drifting through the rest of the sequence. This resilience is important in practice. Even when a few frames are labeled imperfectly and those labels are reused as exemplars, the VLM does not simply copy the earlier mistake. Its visual reasoning still allows a later frame to be judged somewhat independently, which helps the system correct earlier errors rather than accumulate them monotonically over time. Paw keypoints remain the hardest case because of frequent occlusion and strong visual similarity between left and right limbs and between forepaws and hindpaws. BehaviorVLM still occasionally confuses these identities. These errors can be identified later using the same geometric confidence checks from Stage 4 and then corrected manually, removed, or used selectively when constructing downstream training data. Overall, the results show that BehaviorVLM can generate useful and reviewable pose labels from QD-grounded videos using only three labeled seed frames and no task-specific fine-tuning. 3 Behavioral Understanding 3.1 Experimental Data We evaluate the behavioral understanding pipeline on the Mouse Triplets dataset from the MABe2022 challenge [Sun et al., 2023], which consists of top-view videos of three freely interacting mice in an open arena equipped with a food zone. Each video is annotated with frame-level behavior labels that include chase, huddle, oral contact, and oral-genital contact. These labels are human annotations provided by the dataset. In the experiments reported here, however, we use only the 5 videos and derived visual features as inputs to our pipeline and do not use these manual labels during segmentation or semantic interpretation. 3.2 Method BehaviorVLM provides a pipeline for semantic behavior segmentation in multi-animal videos (Fig- ure 4). Given behavioral feature representations, the method converts low-level temporal structure into interpretable behavioral segments. The VLM first interprets what happens in each short clip. The LLM then merges neighboring clips into temporally coherent semantic descriptions of individual and social behaviors. This is a human-like process: observe actions, describe them, and then merge them into meaningful behaviors. Stage 1: Flexible Feature Representation.BehaviorVLM operates on behavioral feature represen- tations extracted from multi-animal videos. In our implementation, we use fused visual and keypoint features produced by the LookAgain framework [Li et al., 2026], which integrates visual appearance and motion information from keypoints into a unified representation. More generally, the pipeline can accept different types of behavioral features, including: (i) keypoint-based features, derived from tracked body keypoints (e.g., pairwise distances, angles, velocities) or pretrained motion encoders; (i) visual features, extracted directly from raw video frames using a pretrained visual encoder; or (i) fused features, combining both keypoint and visual streams. This flexibility allows BehaviorVLM to operate when only partial modalities are available, supports simultaneous analysis of multiple animals, and makes the method more robust to keypoint noise, missing keypoints, and changes in body orientation or camera rotation. Stage 2: Over-Segmented Behavior Discovery via Deep Embedded Clustering. Given the behavioral feature representations, we apply Deep Embedded Clustering (DEC) [Xie et al., 2016] to discover initial behavioral segments. During training, DEC is optimized jointly across all animals by minimizing the sum of per-animal clustering losses. At inference time, the learned clustering model is applied to each animal’s feature sequence separately to produce behavioral segmentation for that animal. We intentionally use a relatively large number of clusters to produce short, fine-grained video clips, where each clip corresponds to a contiguous behavioral segment of a single animal with relatively homogeneous motion statistics. This over-segmentation strategy serves several purposes. First, it reduces the chance of missing real behavioral boundaries. If the first pass is too coarse, different behaviors can be merged before the semantic reasoning stage ever sees them. Second, it preserves short transitions that would otherwise be absorbed into longer segments. Third, it gives the VLM clips that are easier to interpret because each clip usually contains a smaller and more consistent set of actions. Fourth, it leaves the final merging decision to the later LLM stage, where the model has access to richer semantic evidence from the generated descriptions. DEC is also a low-cost first stage. Training DEC on precomputed features is substantially cheaper than training a dedicated segmentation model such as an HMM-based pipeline end to end, and it is also lighter than workflows that first learn a separate representation with methods such as t-SNE-based pipelines and then perform clustering. In practice, DEC provides a simple way to generate candidate segments without adding another expensive model-training step. Stage 3: VLM-Based Per-Clip Video Understanding. For each short video clip produced by DEC, we invoke a state-of-the-art VLM to perform video understanding. The VLM is prompted with a structured query that asks it to (i) assign a concise behavioral label, such as “chasing”, “exploring”, or “feeding”, and (i) generate a detailed natural-language description of the focal animal’s behavior within the clip, including body posture, movement direction, speed, and any interactions with the other animals if they are present. This stage converts each short clip into a textual representation without any task-specific training. Although the initial segmentation is performed separately for each animal, the VLM still sees the video content of that animal’s segment in its social context. As a result, if the focal animal is interacting with another animal, the VLM can explicitly describe this interaction and produce social labels. This is an important difference from traditional behavior segmentation pipelines: the segmentation is generated per animal, but the final labels can still contain social behavioral information because the VLM interprets the visual scene rather than only a single-animal motion trace. The mouseA0segments shown in Figure 4 correspond to these direct VLM-stage 6 segments and are intentionally more finely segmented than the final mouseA0segments shown in Figure 5. Stage 4: LLM-Based Semantic Reasoning and Segment Merging. The set of per-clip text descriptions from the VLM is passed to an LLM with strong reasoning capability. The LLM does not see the video directly. Instead, the VLM serves as a perception module that converts visual observations into text, and the LLM serves as a reasoning module that organizes these textual descriptions into behaviors. We use this separation deliberately because current state-of-the-art LLMs often provide stronger long-range semantic reasoning and grouping ability than state-of-the-art VLMs. In this sense, the pipeline moves from perception to cognition: the VLM perceives the video and converts it into language, and the LLM performs higher-level reasoning over that perceived representation. The LLM performs three operations. First, it merges adjacent or nearby clips whose descriptions indicate the same behavioral state. Second, it assigns a refined behavioral label and an enriched description to each merged segment by integrating evidence across the constituent clips. Third, it returns a temporally structured behavioral annotation that can be used for downstream neuroscience analysis. This is the stage that converts the finer VLM-stage segmentation into the longer final segmentation shown in Figure 5. This pipeline requires no manually annotated behavior labels and no task-specific model training. By combining the discriminative power of clustering in feature space with the semantic richness of VLMs and the reasoning capability of LLMs, BehaviorVLM achieves an interpretable and scalable solution to multi-animal behavioral understanding. The method has several practical advantages. It is guided by semantic understanding of behavior rather than only unstable low-level dynamics. Its over-segmentation followed by semantic merging avoids committing too early to incorrect boundaries. It combines video and text reasoning in a multimodal pipeline. It groups behaviors by semantic meaning rather than only by pose or motion similarity. It is robust to keypoint noise and can operate without keypoints. It produces human-readable descriptions for each segment. Overall, the pipeline mimics a human-like process in which visual observations are first described and then organized into coherent behaviors. 3.3 Results In our implementation, we use fused keypoint and visual features from [Li et al., 2026] as inputs to the pipeline. At the same time, the method is not restricted to keypoint-based representations. It can also operate on visual features extracted directly from video, which is important because one goal of BehaviorVLM is to show that behavior segmentation can be performed from visual information alone rather than requiring keypoints. For DEC-based clustering, we set the number of clusters toK = 10per animal, yielding short clips with average duration of approximately 1–5 seconds. For VLM captioning, we use Qwen3.5-35B- A3B [Qwen Team, 2026] with a structured prompt template, and all clips are uniformly downsampled to 10 fps before being passed to the VLM. Finally, the LLM reasoning step uses Qwen3-Next-80B- A3B [Qwen Team, 2025] with a prompt that receives clip descriptions for a contiguous behavioral epoch and outputs merged segments with refined labels. Behavior Segmentation. Figure 5 illustrates an example behavioral segmentation produced by BehaviorVLM on a multi-animal interaction sequence. The model generates temporally coherent behavioral segments for each animal, shown in the segmentation timeline at the top of the figure. Each segment corresponds to a contiguous interval with relatively consistent motion and interaction patterns. BehaviorVLM produces boundaries that align well with visually identifiable behavioral transitions. In contrast, purely kinematic unsupervised approaches often exhibit rapid state switching and fragmented segments because they rely on low-level motion statistics alone. The segmented videos corresponding to the final LLM-merged outputs for mouse A0 can be viewed athttps: //tinyurl.com/video-for-segments-from-llm. Semantic Labels and Descriptions. Beyond segmentation, BehaviorVLM provides semantic annotations for each behavioral segment. The VLM first describes short candidate clips, including the finer mouseA0segments shown in Figure 4. The LLM then merges these clips into the longer final segments shown in Figure 5. This distinction matters because the VLM stage is intentionally more segmented, while the LLM stage is responsible for producing the final, easier-to-read behavioral 7 0 s 30 s 60 s Mouse from VLM A0 Fused Vision and Keypoints Features Videos+Keypoints Initial Behavioral Segmentation: Final Behavioral Segmentation Locomotion (0.0-4.7s) Chase (10.9-13.1s) ... Oral Genital Contact (4.7-5.9s) Locomotion (5.9-7.3s) Locomotion (7.3-10.9s) Social Encounter (13.1-15.0s) A0 A1 A2 A0 A1 A2 [Sun et al. ICML 2023] Stage 1: Videos and Keypoints Representation Method Fused Features Over-Segment Stage 2: Stage 3: VLM Qwen3.5-35B-A3B Stage 4: LLM Qwen3-Next-80B-A3B Initial Behavioral Segmentation Text Description for each Segmentation Final Behavioral Segmentation Figure 4: Overview of the BehaviorVLM pipeline for semantic behavioral understanding. Behavioral features are first over-segmented into fine-grained candidate clips for each animal. A vision-language model (VLM) then generates natural-language labels and descriptions for each clip. The mouseA0 segments shown here are the direct VLM-stage segments and are therefore more fine-grained than the final LLM-merged mouse A0 segments shown in Figure 5. timeline. The system produces interpretable labels such as chasing, huddling, oral contact, and oral-genital contact, along with other behavior descriptions supported by the video evidence. These segment-level explanations provide semantic information that is typically absent from behavior segmentation methods that return only latent states or cluster identities. 4 Conclusion We presented BehaviorVLM, a unified and finetuning-free framework for animal pose estimation and behavioral understanding. For pose estimation, we introduced a multi-stage prompting pipeline that guides a pretrained vision-language model through sequential body-region detection, within-region and cross-region keypoint assignment, followed by RANSAC-based 3D cross-view consensus refine- ment. Using only three manually annotated reference frames and no model fine-tuning, BehaviorVLM achieves reliable keypoint tracking across a 500-timepoint, six-view recording while producing labels that can be reviewed, filtered, corrected, and reused for downstream pose model training. For behav- ioral understanding, we introduced a pipeline that combines low-cost deep embedded clustering for fine-grained candidate segments with vision-language models for clip-level interpretation and large language models for semantic refinement and segment merging. This pipeline can use direct visual information and is not restricted to keypoint-based segmentation. Together, these results highlight the promise of structured vision-language reasoning for neuroscience by reducing manual annotation burden while preserving interpretable intermediate outputs that researchers can inspect and reuse. 8 Mouse from LLM A0 1. Locomotion (0~4.7 s): Mouse A0 moves continuously across the arena floor with turns and direction changes. 3. Locomotion (5.9~10.9 s): Mouse A0 exhibits continuous locomotion across the enclosure floor with minimal interaction with Mouse A2. 4. Chase (10.9~13.1 s): Three distinct mice are visible in a head-to-tail formation, indicating a following or pursuit dynamic typical of chasing behavior. Mouse A0 acts as a central focal point, with other mice positioning behind or ahead of it. 6. Feeding (15~21.3 s): Mouse A0 remains stationary beneath the grille-like food delivery port, keeping its nose/head in contact with the grille in a probing or feeding posture. 7. Huddles (21.3~24 s): Two distinct mice maintain sustained direct body contact while remaining largely motionless on the floor. 8. Exploring (24~31.1 s): Mouse A0 moves actively across the floor with head lowered and tail swinging, indicative of sniffing or exploration. 9. Stationary (31~33.8 s): Mouse A0 is visible and remains completely motionless throughout the interval with no detectable body, head, or tail movement. 10.Oral Genital Contact (33.8~39.5 s): Mouse A0 mouse drifts locally in the center. A second mouse (Mouse A1) enters from the corner-side, following closely behind Mouse A0. Mouse A1 aligns its head with the posterior tail-base region of Mouse A0. 11.Stationary (39.5~60 s): Mouse A0 settles against the wall-bedding interface and remains largely motionless with minor head movements. 5. Social Encounter (13.1~15 s): Mouse A0 moves towards Mouse A1, and they come into close physical proximity before the green mouse exits the frame. 2. Oral Genital Contact (4.7~5.9 s): Mouse A0 positions its head near the posterior body and tail region of Mouse A2, indicating directed nose/mouth contact with the anogenital area. Final Behavioral Segmentation A0 A1 A2 1. Locomotion (0~11.3 s): Mouse A1 performs local movement with body rotation and shifting position without significant displacement. . 2. Oral Genital Contact (11.3~14.3 s): Mouse A0 pursues Mouse A1 towards the bottom of the cage. Mouse A0 positions itself directly behind Mouse A1, bringing its nose/head into contact with the green mouse's tail and posterior body region. 3. Chase (14.3~19.9 s): Mouse A1 moves along the cage wall, and Mouse A2 follows closely behind, maintaining proximity in sustained directed motion. 4. Huddles (19.9~23.4 s): Mouse A1 and A2 maintain sustained direct body contact in the corner area, remaining largely motionless. 5. Feeding (23.4~24.3 s): Mouse A1 moves towards and interacts with the grille-like food delivery port. 6. Oral Genital Contact (24.3~26.1 s): Mouse A2 approaches Mouse A1, positioning its head near the green mouse's tail. 7. Oral Contact (26.1~27.5 s): Mouse A2 approaches Mouse A1, moving into close physical proximity. 8. Moving (27.5~31.1 s): Mouse A1 moves from the upper-center region towards the left vertical wall. 9. Exploring (31.1~35.1 s): Mouse A1 remains near a wall/corner boundary, sniffing the ground, before moving rapidly to the right. 10.Chase (35.1~41.3 s): Mouse A1 follows Mouse A0 in sustained directed motion. 11.Exploring (41.3~51.4 s): Mouse A1 traverses the enclosure and pauses near the corner-side wall. 12.Moving (51.4~52.4 s): Mouse A1 moves horizontally from left to right. 13.Sniffing (52.4~55.2 s): Mouse A1 is near the corner of the enclosure, making minor head movements. 14.Moving (55.2~60 s): Mouse A1 moves across the enclosure floor. Mouse from LLM A1 Mouse from LLM A2 1. Social Approach(0~5.1 s): Mouse A2 is stationary near the corner. Mouse A0 enters from the bottom and moves rapidly towards the stationary orange mouse. 2. Feeding (5.1~11.3 s): Mouse A2 approaches the grille-like food port and maintains nose/head contact with the mesh. 3. Sniffing Wall (1 1.3~12.3 s): Mouse A2 is positioned along the right vertical wall, orienting its head and nose directly against the wall/substrate edge. . 4. Exploring (12.3~16.8 s): Mouse A2 exhibits independent locomotion and explores the cage floor and walls. 5. Chase (16.8~19.1 s): Mouse A2 moves upward along the left wall, and Mouse A1 follows closely behind. 6. Huddles (19.1~23.1 s): Mouse A2 and A1 move into close proximity and maintain sustained side-by-side body contact. 7. Oral Genital Contact (23.1~25.5 s): Mouse A0 approaches Mouse A2 and positions its head/nose directly against the rear end/tail-base of Mouse A2. 8. Exploring (25.5~48.8 s): Mouse A2 moves independently through the enclosure. Although other mice are briefly visible and interacted, there are no sustained social interaction. 9. Feeding (48.8~60 s): Mouse A2 is positioned directly beneath the horizontal grille-like food delivery port. The mouse maintains its nose/head in direct contact with the port structure. Figure 5: Behavioral understanding results for video 3ZOUFPHJ7JOHFBE8RHY6 in the MABe2022 Mouse Triplets dataset. BehaviorVLM produces temporally coherent behavioral segmentation for each mouse. For every candidate segment, a vision-language model (VLM) first generates natural- language descriptions of the observed actions and interactions (Figure 4, mouseA0segments). A large language model (LLM) then refines and merges these descriptions into the final behavioral events shown here. 9 References Gordon J. Berman, Daniel M. Choi, William Bialek, and Joshua W. Shaevitz. Mapping the stereotyped behaviour of freely moving fruit flies. Nature Methods, 11(12):1170–1176, 2014. Dan Biderman, Matthew R Whiteway, Cole Hurwitz, Nicholas Greenspan, Robert S Lee, Ankit Vish- nubhotla, Richard Warren, Federico Pedraja, Dillon Noone, Michael M Schartner, et al. Lightning pose: improved animal pose estimation via semi-supervised learning, bayesian ensembling and cloud-native open-source tools. Nature methods, 21(7):1316–1328, 2024. Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane Momeni, Rishi Hazra, Shuangrui Ding, Sagar Vaze, Francois Porcher, Feng Li, Siyuan Li, Aishwarya Kamath, Ho Kei Cheng, Piotr Dollár, Nikhila Ravi, Kate Saenko, Pengchuan Zhang, and Christoph Feichtenhofer. Sam 3: Segment anything with concepts, 2025. URL https://arxiv.org/abs/2511.16719. Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24 (6):381–395, 1981. Weihan Li, Jingyang Ke, Yule Wang, Chengrui Li, and Anqi Wu. Learning when to look: On-demand keypoint-video fusion for animal behavior analysis, 2026. URLhttps://arxiv.org/abs/2603. 07279. Alexander Mathis, Pranav Mamidanna, Kevin M Cury, Taiga Abe, Venkatesh N Murthy, Macken- zie Weygandt Mathis, and Matthias Bethge. Deeplabcut: markerless pose estimation of user-defined body parts with deep learning. Nature neuroscience, 21(9):1281–1289, 2018. Talmo D Pereira, Nathaniel Tabris, Arie Matsliah, David M Turner, Junyu Li, Shruthi Ravindranath, Eleni S Papadoyannis, Edna Normand, David S Deutsch, Z Yan Wang, et al. Sleap: A deep learning system for multi-animal pose tracking. Nature methods, 19(4):486–495, 2022. Qwen Team. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388. Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps://qwen. ai/blog?id=qwen3.5. Jennifer J Sun, Markus Marks, Andrew Wesley Ulmer, Dipam Chakraborty, Brian Geuther, Edward Hayes, Heng Jia, Vivek Kumar, Sebastian Oleszko, Zachary Partridge, et al. Mabe22: A multi- species multi-task benchmark for learned representations of behavior. In International Conference on Machine Learning, pages 32936–32990. PMLR, 2023. Emine Zeynep Ulutas, Amartya Pradhan, Dorothy Koveal, and Jeffrey E. Markowitz. High-resolution in vivo kinematic tracking with customized injectable fluorescent nanoparticles. Science Advances, 11(40):eadu9136, 2025. doi: 10.1126/sciadv.adu9136. URLhttps://w.science.org/doi/ abs/10.1126/sciadv.adu9136. Caleb Weinreb, Jonah E Pearl, Sherry Lin, Mohammed Abdal Monium Osman, Libby Zhang, Sidharth Annapragada, Eli Conlin, Red Hoffmann, Sofia Makowska, Winthrop F Gillis, et al. Keypoint-moseq: parsing behavior by linking point tracking to pose dynamics. Nature Methods, 21(7):1329–1339, 2024. Alexander B. Wiltschko, Matthew J. Johnson, Giuliano Iurilli, Randall E. Peterson, Joshua M. Katon, Stanislav L. Pashkovski, Victoria E. Abraira, Ryan P. Adams, and Sandeep R. Datta. Mapping sub-second structure in mouse behavior. Neuron, 88(6):1121–1135, 2015. Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pages 478–487. PMLR, 2016. 10 Teng Xu, Taotao Zhou, Youjia Wang, Peng Yang, Simin Tang, Kuixiang Shao, Zifeng Tang, Yifei Liu, Xinyuan Chen, Hongshuang Wang, et al. Mousegpt: A large-scale vision-language model for mouse behavior analysis. arXiv preprint arXiv:2503.10212, 2025. Shaokai Ye, Jessy Lauer, Mu Zhou, Alexander Mathis, and Mackenzie Mathis. Amadeusgpt: a natural language interface for interactive animal behavioral analysis. Advances in neural information processing systems, 36:6297–6329, 2023. Shaokai Ye, Anastasiia Filippova, Jessy Lauer, Steffen Schneider, Maxime Vidal, Tian Qiu, Alexander Mathis, and Mackenzie Weygandt Mathis. Superanimal pretrained pose estimation models for behavioral analysis. Nature communications, 15(1):5165, 2024. 11