Paper deep dive

Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

Yuchen Li, Amanmeet Garg, Shalini Chaudhuri, Rui Zhao, Garin Kessler

Year: 2026Venue: arXiv preprintArea: cs.CVType: PreprintEmbeddings: 50

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/22/2026, 6:09:20 AM

Summary

Perceptio is a perception-enhanced Large Vision-Language Model (LVLM) that improves spatial grounding by generating explicit 2D semantic segmentation tokens and 3D depth tokens within an autoregressive sequence. By distilling knowledge from SAM2 and Depth Anything V2 into a VQ-VAE codebook, the model performs spatial chain-of-thought reasoning, achieving state-of-the-art performance on RefCOCO/+/g, HardBLINK, and MMBench.

Entities (5)

Perceptio · model · 100%Depth Anything V2 · model · 95%InternVL · model · 95%SAM2 · model · 95%VQ-VAE · architecture · 95%

Relation Signals (4)

Perceptio → builton → InternVL

confidence 95% · Building on InternVL, Perceptio achieves state-of-the-art performance

Perceptio → usescomponent → SAM2

confidence 95% · integrate SAM2 based semantic segmentation tokens

Perceptio → usescomponent → VQ-VAE

confidence 95% · distill a VQVAE depth codebook from a strong monocular teacher

Depth Anything V2 → teaches → VQ-VAE

confidence 90% · distill a VQVAE depth codebook from a strong monocular teacher

Cypher Suggestions (2)

Find all models that Perceptio is built upon or uses as components. · confidence 90% · unvalidated

MATCH (p:Model {name: 'Perceptio'})-[r:BUILT_ON|USES_COMPONENT]->(m:Model) RETURN m.name, type(r)

Identify the relationship between teacher models and architectural components. · confidence 85% · unvalidated

MATCH (t:Model)-[:TEACHES]->(a:Architecture) RETURN t.name, a.name

Abstract

Abstract:Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.

PDF

Open source PDF →Open local PDF →

Full Text

49,211 characters extracted from source content.

Expand or collapse full text

Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation Yuchen Li 1 , Amanmeet Garg 1 , Shalini Chaudhuri 1 , Rui Zhao 1 , and Garin Kessler 1 Amazon Abstract. Large Vision–Language Models (LVLMs) excel at semantic understanding but struggle with fine-grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception-enhanced LVLM with 2D–3D spatial reasoning abilities, enabled via explicit semantic- segmentation tokens and depth tokens generated directly within the au- toregressive sequence. Concretely, we (i) distill a VQ-VAE depth code- book from a strong monocular teacher to tokenize dense depth into com- pact sequences, and (i) integrate SAM2-based semantic-segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruc- tion. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art perfor- mance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g, HardBLINK spatial under- standing accuracy by 10.3%, and MMBench accuracy by 1.0%, demon- strating that explicit spatial chain-of-thought materially strengthens spa- tial grounding in LVLMs. 1 Introduction Modern open-source LVLMs such as the InternVL series [7] and the Qwen-VL series [2,31] have scaled up vision backbones and introduced advanced alignment pipelines. These often deliver strong performance on tasks requiring multi-modal understanding such as captioning [37], visual question answering (VQA) [1], and grounding [36]. Despite pre-training with web-scale image-text data, LVLMs of- ten struggle with spatial understanding in images, including reasoning about depth, distance, and scale [10,30]. For example, BLINK [10] evaluated popular LVLMs on simple tasks that humans solve “within a blink” and observed that LVLMs barely surpass random guessing. This phenomenon is partly due to the lack of explicit 3D cues during pre-training, which also suggests robust spatial intelligence—the ability to comprehend relative positions and spatial arrange- ments—has not yet emerged as a general skill. These findings motivate a design that can incorporate spatial understanding into the model learning. arXiv:2603.18795v1 [cs.CV] 19 Mar 2026 2Y. Li et al. Fig. 1: Comparison of Perceptio pipeline vs standard VLMs. To address spatial un- derstanding challenge for LVLM, we propose Per- ceptio, a perception en- hanced LVLM that jointly learns to generate to- kens for 2D semantic seg- mentation and 3D depth perception as an auto- regressive sequence. Build- ing on InternVL-2.5 [7], Segment Anything Model 2 (SAM2) [28], and Depth Anything V2 model [39], Perceptio emits a dedicated segmentation token and a depth token stream be- fore producing the text token. This design enables a perception-enhanced condi- tional generation, where, by generating segmentation and depth tokens first, the model anchors subsequent language in explicit 2D & 3D cues, improving VQA, grounding, and spatial reasoning. We endow the LVLM with 3D spatial perception knowledge by distilling from a 3D depth generation model as teacher in a teacher–student framework. We train a Vector Quantized-Variational Autoencoder (VQ-VAE) on depth maps predicted by the specialist Depth Anything V2 model [39]. Such a discretized depth token sequence and the resulting codebook indices serve as 3D perception tokens. We impart 2D spatial knowledge by incorporating a learnable segmen- tation token conditioned on the query text. We treat segmentation and depth as priors that condition the language decoder. In the standard setup, a text- only query q maps to an answer a. In our setting, we augment the input with structured priors over the query and answer, as show in Figure 1, formatted as [seg tokens], [depth tokens], [text tokens] With this perception-enhanced design, the model first interprets the perceptual signal, enabling more effective answers on the downstream task. We highlight our contributions by four main points: 1. Explicit spatial perception in LVLMs. We introduce Perceptio, which enhances LVLMs with in-sequence 2D segmentation and discretized 3D depth tokens, enabling pixel-level and geometric reasoning. To the best of our knowledge, Perceptio is the first to jointly optimize for 2D and 3D perception signals within a single autoregressive sequence in LVLMs. 2. Unified Multi-task Training with Novel Depth Objectives. We pro- pose a joint text–segmentation–depth objective and a series of novel depth- token loss functions (marker + token + count) that stabilizes depth token emission. A soft depth reconstruction technique enables fully end-to-end dif- ferentiable depth training. 1 1 Codebase will be released upon publication. Perceptio: Perception Enhanced VLMs via Spatial Token Generation3 3. Perception-enhanced data. We curate a 56K-example joint dataset that pairs segmentation masks and depth priors with language supervision, aug- menting RefCOCO/+/g with aligned depth tokens and attribute descrip- tions to steer intermediate reasoning. 2 4. State-of-the-Art Performance. Perceptio achieves SOTA on all three referring segmentation benchmarks (RefCOCO/+/g), a +10.3% improve- ment on HardBLINK spatial reasoning, and a +1.0% gain on MMBench general VQA, demonstrating that explicit in-sequence perception materially strengthens spatial grounding across diverse tasks. Fig. 2: The Perceptio model. The language model takes text queries and image features as input to generate the desired text sequence (B). During training time, segmentation (A) and depth (C) teacher models supervise (via loss functions) the LVLM to accu- rately generate the intermediate perception tokens and the answer text output tokens. 2 Related Work 2.1 Large Vision-Language Models (LVLMs) Recently, LVLMs have demonstrated remarkable progress. These models inte- grate tokenized visual features with language tokens, feeding the combined rep- resentation into a pre-trained Large Language Model (LLM) to understand and generate responses that span both visual and linguistic domains [5,32,45]. The latest landscape of LVLMs, including robust architectures like LLaVA [19], GPT- 4v [34], and their contemporaries, has pushed the boundaries of general-purpose visual reasoning, complex dialogue, and detailed image captioning. However, a critical review shows these models are better at semantic understanding (i.e., knowing what is in an image) than spatial understanding (i.e., knowing where things are), because their architectures are not explicitly designed to model spa- tial awareness. Rather than being explicitly modeled, complex spatial relationships such as relative and absolute positions are typically assumed to emerge from training 2 Dataset will be released upon publication. 4Y. Li et al. at scale; as a result, spatial reasoning is rarely treated as a first-class, founda- tional objective. For example, despite its scale, InternVL2.5-26B achieves only 33.1% average accuracy on HardBlink’s “closer-to-camera” point-selection task (details in Table 2). This underscores that spatial understanding remains a no- table weakness in multi-modal LLMs and does not reliably emerge from scale alone. 2.2 Perception Guidance in LVLMs Despite rapid progress in LVLMs, fine-grained grounding and spatial reason- ing remain difficult because text decoders often infer geometry from pooled fea- tures without explicit spatial cues. Two-stage pipelines, such as, LLM controllers wrapped around LISA [14] improve segmentation, as do token emitting LVLM variants, but they externalize perception and rarely feed masks back into the reasoning loop [13, 14, 35]. PerceptionGPT [26] brings perception into the se- quence by learning a dynamic token that encodes boxes and masks, boosting performances on Referring Expression Segmentation (RES), yet, remains lim- ited to 2D semantics [23]. Sa2VA further unifies an LLM with SAM2 to produce query-grounded masks for images and videos, advancing RES while still oper- ating on planar cues [42]. In parallel, AURORA introduces "perception tokens" that discretize mid-level signals, most notably monocular depth via a VQ-VAE codebook yielding sizable gains on depth and counting; however, it neither out- puts segmentation masks for grounding nor fuses 2D semantics with 3D geometry in one model, and it can degrade general VQA performance [3]. Evidence from DenseWorld-1M shows that leading LVLMs still miss small objects and misalign references, underscoring inadequate spatial grounding [16]. These limitations stem from LVLMs natively emitting text rather than dense maps. Injecting intermediate 2D and 3D cues helps, but purely text-decoder LVLMs still under perform at spatial understanding. Simultaneously, specialist pipelines excel on targeted spatial tasks yet trade-off broad conversational abil- ity. Similarly, metric-depth only approaches (e.g., DepthLM [4]) do not unify 2D semantics with 3D geometry. To our knowledge, no prior work jointly optimizes complementary objectives for 2D semantic segmentation and 3D depth reason- ing within a single LVLM. Perceptio closes this gap by injecting SAM2-based semantic segmentation tokens and discretized depth tokens into the sequence, enabling explicit spatial reasoning and yielding state-of-the-art (SOTA) ground- ing performance on multiple tasks. 3 Methods We introduce Perceptio, a perception-enhanced LVLM that explicitly incorpo- rates visual segmentation and depth cues into its generation process. In this section, we first describe the model architecture and the insertion of seman- tic segmentation and discretized depth tokens into the autoregressive sequence (3.1). Then, we detail the procedure for generating perception tokens and ex- plain how the model learns from our perception conditioned generation pattern (3.2). Next, we describe the model’s inference-time behavior (3.3), followed by the multi-task objective (3.4) and experimental setup (3.5). Perceptio: Perception Enhanced VLMs via Spatial Token Generation5 3.1 Model Architecture Figure 2 provides an overview of our approach: Perceptio. Given an input image and a text query, the system routes visual signals through three complementary pathways: (i) a standard image encoder for semantic appearance features; (i) a frozen SAM encoder for segmentation-aware representations; and (i) a frozen pre-trained depth Quantized Variational Autoencoder (VQ-VAE) codebook that discretizes image depth. The core LLM consumes the encoded image features to- gether with the query and produces an autoregressive sequence that interleaves natural-language tokens with perception-control tokens. In particular, it predicts special [seg] token to request segmentation and a sequence of discrete depth tokens [depth] to represent depth. These tokens trigger task-specific decoders: when [seg] appears, a SAM2 decoder reconstructs segmentation masks; when [depth] appears, a depth decoder maps the discrete codes back to a continu- ous depth map via the VQ-VAE codebook. During training, we fine-tune the SAM2 decoder to learn the special segmentation tokens, supervising it with re- construction losses against ground-truth masks. In contrast, the depth branch (codebook and decoder) is kept frozen: the LVLM is trained only to generate depth tokens that index the pre-trained codebook, enabling depth reconstruction without updating the depth decoder. This unified design enables Perceptio to perform language generation, referring expression segmentation, and depth rea- soning within a single autoregressive framework, making perception a first-class part of the language-modeling objective rather than a post-hoc step. 3.2 Model Learning Perception Enhanced Generation. Perception enhanced generation refers to our strategy of guiding the LVLM’s generation with intermediate visual cues. We enforce a specific output format: the model’s generated token sequence must contain a segmentation token block and a depth token block before the final textual answer. Formally, the sequence is structured as: [seg] [d start ,d 1 ,d 2 ,...,d n ,d end ] [t 1 ,t 2 ,...,t m ](1) where [seg] is a special control token whose embedding conditions the seg- mentation decoder to output a query-grounded mask, [d start ,d 1 ,...,d n ,d end ] are the discretized depth tokens, and [t 1 ,t 2 ,...,t m ] represents the text answer to- kens. The model is trained to always emit these in order: first the segmentation, then the depth, and then the answer. The motivation for this enforced ordering arises from the autoregressive nature of the decoder—by generating perceptual tokens first, the model effectively performs a chain-of-thought reasoning based on the scene’s spatial structure before formulating a final answer. This approach injects explicit spatial awareness into the language model’s output by requiring the model to explicitly generate its visual perception in the form of segmentation masks and depth maps before producing the final response to the query. Depth Codebook. To capture fine-grained 3D structure, inspired by [24], we con- struct a depth codebook using a VQ-VAE with codebook size K [25]. We first 6Y. Li et al. obtain reliable continuous depth maps with the depth-specialist model Depth Anything V2 [39], then discretize them into depth tokens via vector quantiza- tion, enabling seamless integration into our token-based framework. In contrast to prior work that learns a codebook on a single, specialized depth dataset [24], we train on all depth maps derived from the same scene-image corpora (3.5) used to finetune the LLM. This distributional alignment improves robustness and strengthens depth perception. The resulting VQ-VAE depth codebook serves as a broadly generalizable prior and an augmentation signal that guides the LLM to generate accurate depth tokens. In this setup, each depth map is encoded as a grid of embeddings, where the nearest-neighbor distance is used to identify the closest entry in the codebook. The VQ-VAE decoder reconstructs the depth map from the sequence of latent codes, and the entire model is trained with mean-squared-error (MSE) recon- struction loss to ensure accurate reconstruction. During inference, we patchify the depth map into a √ n× √ n grid of code indices, resulting in an n-token sequence where each token represents one of the K discrete depth values in the depth codebook, labeled d 1 to d n . The depth token sequence starts with a spe- cial [d start ] token and ends with a special [d end ] token, adding a total of K + 2 depth-related tokens (K depth values plus two special tokens) to the model’s vocabulary. Concretely, for each training sample we augment the textual prompt so that the target sequence first includes perception tokens (segmentation and depth) followed by the textual answer. Exposure to these augmented training instances encourages the model to condition its reasoning on explicit segmentation and depth cues. At inference, these perception tokens are not produced by arbitrary prompts; we use lightweight prompt templates with special tokens to reliably elicit the intermediate segmentation and depth tokens alongside the final answer. Our perception enhanced design helps the model to internalize these perceptual cues and demonstrates improved performance on tasks requiring fine-grained grounding. 3.3 Model Inference At test time, given an input image I and textual prompt q, we tokenize q and encode I into visual tokens. The text and image tokens are concatenated and fed to the LVLM, which autoregressively emits an interleaved sequence of control and content tokens same as defined in Eq. (1). Each group gates a downstream prediction head, specifically: Segmentation Head: Emitting [seg] activates the SAM2 decoder, which fuses the [seg] query from the LLM with dense features from the SAM2 encoder to predict a segmentation mask ˆ M. The mask type (e.g., referring, instance, or semantic) is determined by the task implied by the prompt. Depth Head: The depth subsequence d 1:n is interpreted as indices into a VQ–VAE codebook and decoded to reconstruct a dense depth map ˆ D. Text Head: The text subsequence t is detokenized to form the natural-language response. Perceptio: Perception Enhanced VLMs via Spatial Token Generation7 This design unifies language, segmentation, and depth outputs within a single coherent token sequence. It is important to note that, during inference the gen- erated [seg] and [depth] tokens are available to create 2D and 3D grounding visualizations via their respective teacher models. However, the trained LVLM model operates independent of the teacher branches to generate the desired text response for downstream tasks. 3.4 Loss Functions Effective spatial reasoning requires carefully designed supervision signals. To this end, we design novel loss functions for 3D depth information generation (3.4), while leveraging the standard LLM loss (3.4) for text generation and segmenta- tion loss (3.4) for the 2D segmentation feedback, respectively. We optimize all tasks in a single fine-tuning stage by minimizing the total loss defined as follows. L total = L LLM + L SegRecon + λ d L depth + λ r L DepthRecon (2) where, λ d & λ r are weights for respective loss contributions. Next, we explain each loss term in detail. LLM Loss LLM loss is the standard teacher-forced next-token negative log-likelihood for the decoder conditioned on image features: L LLM = − 1 T P T t=1 logp θ (y t | y <t ,φ(I)). 2D Supervision Here, we aim for the LLM generated [seg] token to improve, such that, it creates accurate segmentation masks in the segmentation decoder. We use a reconstruction loss as 2D supervision between the ground truth seg- mentation mask SEG GT and the reconstructed segmentation mask using the generated [seg] token. We combine pixel-wise cross-entropy and DICE loss: L SegRecon = L CE + L D .(3) 3D Supervision Depth supervision comprises (i) a depth token generation loss (L depth ), and (i) a differentiable soft reconstruction loss (L DepthRecon ). Depth Token Generation Loss. We fine-tune the LLM with LoRA to incorporate depth information by adding special depth tokens to its vocabulary. However, relying solely on the standard next-token cross-entropy loss may not be sufficient to ensure these tokens are generated as intended. To better ground the model in the meaning and proper use of depth tokens, we introduce additional regular- ization terms. Specifically, we propose a suite of novel loss functions targeted at encouraging accurate and consistent depth-token generation. Depth is emitted as a bracketed sequence [d start , d 1 ,...,d n , d end ] 8Y. Li et al. with n tokens from a VQ-VAE codebook. For each sample b, let (s b ,e b ) be the start/end indices with y b,s b = d start and y b,e b = d end . Define the interior length l b = ( e b − s b − 1, if a valid span exists, 0,otherwise. The depth token generation loss is a composite loss to align when the span begins/ends (L marker ), what codes fill it (L token ), and how many are produced (L count ). L depth = λ m L marker + λ t L token + λ c L count (4) (Values of the coefficients reported in 3.5) Marker Loss. To ensure depth start token d start and depth end token d end are generated at correct positions, we propose a marker loss: L marker = 1 B B X b=1 1 s b ̸=∅ CE(z b,s b −1 ,y b,s b ) + 1 e b ̸=∅ CE(z b,e b −1 ,y b,e b ) . (5) where B is the batch size, CE(·,·) is token-level cross-entropy, z∈R B×T×V are decoder logits, y∈N B×T are ground-truth tokens. T is the sequence length and V is the vocabulary size. The indicator 1 s b ̸=∅ equals 1 when a valid depth span is found in sample b (i.e., both d start and d end are present), and 0 otherwise. Token Loss. To ensure correct depth token values are generated by LLM, we proposed a token loss defined as: L token = 1 B B X b=1 1 l b >0 max(l b , 1) e b −1 X t=s b +1 CE z b,t−1 , y b,t .(6) Count Loss. Last, to encourage the LLM to produce sequences of the desired length n, we proposed a penalty term that activates when the generated length deviates from n: L count = 1 B B X b=1 log 1 +|l b − n| .(7) Soft Depth Reconstruction. To decode depth in a differentiable manner, inspired by [24], we replace hard codeword selection with a soft-merging technique of codebook embeddings. The model predicts a probability distribution over the codebook and we form a soft token by weighting each embedding with its pre- dicted probability. This continuous relaxation maps discrete tokens into a smooth embedding space, allowing gradients from the depth reconstruction objective to flow through the tokenization stage and enabling fully end-to-end training. For each timestep t inside the depth span, restrict logits to depth-code index set D and compute p t (k) = exp z t,k P j∈D exp z t,j , k ∈D,(8) Perceptio: Perception Enhanced VLMs via Spatial Token Generation9 then form the expected latent ̃ z t = X k∈D p t (k)e k .(9) where e k is the VQ-VAE codebook vector for index k∈D and p t is the softmax over depth codes at step t. The sequence ̃ z t is truncated to n, reshaped to a √ n× √ n grid, and decoded by the VQ-VAE into a predicted depth map ˆ Y. We minimize L DepthRecon = 1 B B X b=1 ˆ Y b − Y b 2 2 .(10) 3.5 Experimental Setup Dataset Curation: We build a joint dataset by augmenting RefCOCO, Ref- COCO+, and RefCOCOg [23,41], referring expression segmentation benchmarks where each example pairs a free-form phrase with the pixel-accurate mask of the mentioned object—with complementary supervision for depth and description. Concretely, for every referring expression we (i) convert the ground-truth mask into a compact sequence of segmentation tokens; (i) attach aligned depth to- kens that encode the quantized depth of the same region; and (i) add a concise, attribute-focused one-sentence object description. All signals are unified in a sin- gle instruction–output format so the model learns, from one prompt, to ground the phrase, emit the mask (via seg tokens), infer scene layout (via depth tokens), and verbalize salient attributes. We retain official splits, perform consistency and ambiguity filtering, and deduplicate near-overlapping samples to preserve qual- ity. This multi-signal curation turns classic referring data into a scalable corpus for joint perception and reasoning. Training Dataset: We train Perceptio using only image-based corpora cover- ing (i) image question and answering and image chat, (i) image-level text-driven segmentation, (i) depth guided chain-of-thought data, and (iv) joint dataset for segmentation, depth, and text. Our training set comprises approximately 1.1M image–text pairs drawn from three sources: (i) 665K LLaVA-1.5 instruction- tuning samples for image QA and chat [20], (i) 214K grounding conversation generation samples for image-level text-driven segmentation [27], and (i) we use our curated datasets with synthetic and unique 60K ADE20k with Percep- tion tokens dataset, inspired by [3]. We use the referring-expression segmenta- tion datasets—RefCOCO (17K), RefCOCO+ (17K), and RefCOCOg (22K)—all built on MS COCO 2014 images. From these resources, we also curate a joint dataset with referring-expression segmentation by augmenting captions with depth tokens, yielding a total of 56K examples. Because InternVL2.5 [7] is al- ready pre-trained on large-scale image QA data, we fine-tune with the LLaVA-1.5 corpus to preserve QA capability while adapting the model to grounding and segmentation. Evaluation: For the grounding task, we evaluate RefCOCO, RefCOCO+ and RefCOCOg on validation datasets; To assess broader vision-language capabili- ties, we also employ the recent MME [9], MMBench [21], and SEED-Bench [15] 10Y. Li et al. benchmarks, which cover a wide range of multi-modal tasks including visual question answering, captioning, and reasoning and report performance via ob- jective metrics such as multiple-choice accuracy. For science and diagram VQA evaluation, we test on AI2D [12], MMStar [6], and the ScienceQA [22] collec- tively covering diagram understanding and grounded reasoning. Additionally, we include the HardBLINK variants with 3, 4, and 5 marked points per im- age [3]. These tasks challenge the model’s spatial understanding by requiring it to identify relationships like which point is closest, and we measure success as the percentage of queries answered correctly. (relative depth accuracy) Metrics: For image referring segmentation, we evaluate by measuring the align- ment between the predicted mask and ground truth, using Intersection-over- Union (IoU)–based metrics (with an IoU > 0.5 defining a correct prediction). For image QA and image chat, we follow prior work and report the standard benchmark-specific metrics used by those works [9,15]. For spatial reasoning, we report the accuracy of correctly answered multiple-choice questions. Implementation Details: We train our model on 64 NVIDIA A100 GPUs for approximately 24 hours. Following InternVL design, we adopt a maximum sequence length of 8,192 tokens. The model is optimized using AdamW with a learning rate of 4× 10 −5 . We employ a batch size of 1 per device with gradient accumulation over 8 steps, resulting in an effective batch size of 512. For LoRA parameters, we set the rank to 256, chosen to provide sufficient capacity for the model to learn the new depth and segmentation token embeddings alongside language adaptation. The learning rate schedule consists of a linear warmup for the first 5% of training, followed by cosine annealing decay to zero. Gradient clipping is applied with a maximum norm of 1.0. For the depth VQ-VAE, we use a codebook dimension of K = 128. The LLM generates n=100 depth tokens. The segmentation loss combines cross-entropy loss (weight=1.0) and DICE loss (weight=0.25), both with sigmoid activation. The VQ-VAE reconstruction loss is weighted at 1.0. During evaluation, we distribute inference across 8 GPUs to handle the long-context processing efficiently. In our experiments, we set the loss weights to λ m = 0.3 (marker), λ t = 0.5 (token), λ c = 0.2 (count), λ d = 1.0 (depth), and λ r = 1.0 (depth reconstruction). We train both 8B and 4B variants of our model for our main results, and ablation experiments use the 8B variant. Inference Cost: Despite generating additional perception tokens, Perceptio incurs negligible inference overhead. For dense caption generation, Perceptio- 8B takes 3.52 seconds per 100 tokens compared to 3.53 seconds for Sa2VA-8B, with comparable FLOPs (4.06T vs. 4.66T). At inference time, only the LVLM is required; the teacher models (SAM2 encoder, Depth Anything V2) are not needed unless the application explicitly requires segmentation mask or depth map outputs. 4 Results 4.1 Main Results Quantitatively, we evaluate Perceptio across a comprehensive suite of bench- marks spanning referring image segmentation, multimodal dialogue, and relative Perceptio: Perception Enhanced VLMs via Spatial Token Generation11 Table 1: Merged results across image referring segmentation, image chat, and image- level benchmarks. RefCOCO/+/g report cIoU. For MME, we list Perception (P), Cog- nition (C), and Total (T=P+C). “–” indicates not reported. Method Image SegmentationImage Chat / UnderstandingDiagram / Science QA RefCOCO [11] RefCOCO+ [11] RefCOCOg [41]MME-P [9] MME-C [9] MME-T MMBench [21] SEED-Bench [15]AI2D [12] MMStar [6] SQA test [22] LLAVA-1.5-13B [18]–153168.870.1– Video-LLaVA-7B [17]–60.9– mPLUG-Owl3-8B [40]–77.6– InternVL2-8B [8]–81.776.283.8– PixelLM-7B [29]73.066.369.330913544417.4–0.0– LaSagnA [33]76.866.470.60000.0–0.0– GLaMM-7B [27]79.572.674.21492336.8–28.2– LLaVA-G-7B [43]77.168.871.5– GSVA-13B [35]79.270.375.7– OMG-LLaVA-7B [44] 78.069.172.91177235141247.956.542.9– VISA-13B [38]72.459.865.5– Sa2VA-4B [42] 80.474.376.71553540209376.872.679.953.795.8 Sa2VA-8B [42]81.976.578.91651578222982.475.582.160.396.8 Perceptio-4B (Ours)81.776.978.91710615 232582.074.481.557.797.2 Perceptio-8B (Ours)82.777.980.01654628228283.475.783.464.298.3 depth reasoning. As summarized in Table 1, Perceptio-8B sets a new state of the art on all three referring segmentation datasets—82.7% on RefCOCO, 77.9% on RefCOCO+, and 80.0% on RefCOCOg—surpassing the best prior Sa2VA- 8B by +1.1/+1.7/+1.3 points in cIoU score respectively. On image chat evalu- ations, Perceptio-8B achieves the strongest MME perception/cognition scores (1654/628) and the best MMBench accuracy (83.4), while remaining highly competitive on SEED-Bench (75.7, within 0.5 of InternVL2-8B). The lighter Perceptio-4B mirrors these trends, already outperforming larger baselines (e.g., 81.7/76.9/78.9 on RefCOCO/+/g and 1710/615 on MME). On the HardBLINK relative depth task (Table 2), our depth tokens and intermediate reasoning yield substantial gains in accuracy: Perceptio-8B attains 75.8/71.0/66.1 for 3/4/5 points with a 71.0 average, improving over LLaVA-Aurora by +8.9/+10.5/+11.3 and +10.3 points on average. The accompanying figure provides a visual compar- ison of these results; additional qualitative examples across models and datasets are included in the supplementary materials. Table 2: Relative depth accuracy on HardBLINK. Our Perceptio-4B/8B—using depth tokens and intermediate reason- ing—outperform prior baselines. ModelHardBLINK 3 Points 4 Points 5 PointsAverage Sa2VA-8B [42]21.017.725.021.2 InternVL2.5-26B [7]41.131.526.633.1 LLaVA 1.5 13B [18]35.537.929.034.1 Fine-tuned LLaVA [3]58.952.441.150.8 LLaVA-Aurora [3]66.960.554.860.7 Perceptio-4B (Ours)69.466.959.765.3 Perceptio-8B (Ours)75.8 71.0 66.171.0 Qualitatively, we visualize samples in the RefCOCO dataset to internal- ize the impact of joint perception. In Figure 3 we see the reconstructed depth maps, and predicted segmen- tation masks along with the ques- tion and the generated answer. We see clear depth separation in the 3D depth map, along with semantic seg- mentation boundaries of the object. In Figure 4, we see samples where Sa2VA model fails to predict the correct segmentation masks, whereas, Percep- tio model generates accurate depth maps and corresponding segmentation masks for the same input queries. Further highlighting the importance of depth percep- tion, in Figure 5, Perceptio make correct predictions where depth maps capture 12Y. Li et al. 3D information, and make error in sample 4 where depth map marks all objects as background. Q: What is the main objects in the scene? generate depth map, describe them. A: [Seg]<Depth_tokens> Playing ultimate frisbee. Original ImageGround Truth Seg MaskGround Truth Depth Mask Predicted Seg MaskPredicted Depth Mask Fig. 3: Comparison between ground-truth and predicted outputs for a scene depicting two players engaged in an ultimate frisbee game. The first row shows the original image, ground-truth segmentation, and depth masks. The second row displays the question and the corresponding model predictions for text, segmentation mask, and depth map, demonstrating accurate recognition of the main objects and spatial depth relationships. 4.2 Ablation Studies Impact of Perception We introduce joint 2D and 3D perception ability in LVLMs. A natural question emerges on, how significant is each source of per- ception? In this ablation, we train the model with 1) No depth signal, 2) No segmentation signal and compare it with Perceptio. For each variant, we remove the underlying dataset, task and loss function for the respective perception sig- nal. 2D Only Perception: As shown in Table 3, the performance on the depth reasoning task of HardBLINK significantly decreases (the average accuracy drops from 71.0% to 45.2%, a decline of 25.8 percentage points) when the depth tokens and the depth branch of Perceptio are removed. Notably, general VQA metrics slightly improve without depth tokens (Table 4), indicating a mild optimization tension between depth token generation and text-only tasks. However, the sub- stantial HardBLINK collapse demonstrates that depth tokens are essential for 3D spatial reasoning, the core capability targeted by our work. We discuss this trade-off further in Sec. 5. 3D Only Perception: We evaluate a depth-only variant that removes the segmentation tokens while retaining depth tokens (Table 4). This 3D-only set- ting probes, whether the model can rely purely on geometric cues to structure the scene. As shown in Table 4, removing segmentation consistently degrades performance: MME drops from 1654/628 to 1620/585 (perception/reasoning), MMBench falls by 1.6 points, and SEED-Bench declines by 2.3 points. These re- sults indicate that while the model learns meaningful depth associations between Perceptio: Perception Enhanced VLMs via Spatial Token Generation13 Fig. 4: Perceptio evaluated on RefCOCO qualitative results. We compare our model with Sa2VA on a referring expression. Instance masks are colorized for visibility; depth maps are shown in grayscale (lighter = nearer). Our predictions align with semantic boundaries better than Sa2VA by capturing the expected depth layering across the image. Note: "×" denotes that no depth perception in Sa2VA. adjacent objects and entities, depth alone is insufficient for strong VQA-style reasoning; explicit semantic grouping from segmentation complements geometry and yields superior accuracy. Table 3: Ablation Study on Depth Tokens Effect on HardBLINK. Abbreviations: P = Perceptio; −Depth = w/o depth Model3 Points 4 Points 5 Points Avg. P75.871.066.1 71.0 P (−Depth) 48.448.438.7 45.2 Fig. 5: Perceptio (Ours) predictions on HardBLINK dataset with overlay of color depth maps. Correct predictions in samples 1,2,3 and incorrect predic- tion in sample 4. Impact of Loss Functions We fur- ther ablate the two depth-specific ob- jectives introduced in Sec. 3.4: the soft depth reconstruction loss L DepthRecon and the depth token generation loss L depth . Removing L DepthRecon reduces MME to 1625/613 and MMBench to 81.9% (Ta- ble 5); removing L depth yields 1632/621, 82.4% on MMBench, and a SEED-Bench drop from 75.7% → 74.3%. Both objec- tives therefore contribute positively and in a complementary manner: L DepthRecon strengthens continuous depth fidelity through differentiable decoding, while L depth sharpens the discrete depth-token sequence itself. 5 Discussion and Conclusion Perceptio is a perception-enhanced LVLM that emits segmentation and dis- cretized depth tokens inside the same autoregressive sequence, then generates 14Y. Li et al. Table 4: Ablation Study on Perception Tokens Effect on benchmarks. Abbrevi- ations: P = Perceptio; −Depth = w/o depth tokens; −Seg = w/o segmentation tokens. ModelMME (perc./reas.) MMBench (%) SEED-Bench (%) P1654/62883.475.7 P (−Depth) 1661/652 (7/24↑)83.8 (0.4↑)76.3 (0.8↑) P (−Seg)1620/585 (34/43↓)81.8 (1.6↓)73.4 (2.3↓) Table 5: Ablation Study on Depth Loss terms on benchmarks. Abbreviations: P = Perceptio; −Depth-Rec = w/o depth reconstruction loss; −Depth-Gen = w/o depth token generation loss. ModelMME (perc./reas.) MMBench (%) SEED-Bench (%) P1654/62883.475.7 P (−Depth-Rec) 1625/613(29/15↓)81.9 (1.5↓)73.7 (2.0↓) P (−Depth-Gen) 1632/621(22/7↓)82.4 (1.0↓)74.3 (1.6↓) text answers, turning 2D–3D perception into an in-sequence spatial chain-of- thought. This design, combined with composite depth-token and reconstruction objectives, materially strengthens spatial grounding in the trained model. Our trained model achieves state-of-the-art results on referring expression segmenta- tion, depth understanding, and general VQA benchmarks. Ours is, to our knowledge, the first work to jointly optimize 2D semantic segmentation and 3D depth reasoning within a single autoregressive LVLM se- quence. We demonstrate strong performance on image-related perception and QA tasks. Prior work addresses these modalities in isolation; Perceptio closes this gap and demonstrates that the combination is necessary: removing depth tokens collapses HardBLINK accuracy by 25.8 points, while removing segmen- tation tokens degrades general VQA by up to 2.3 points on SEED-Bench. Limitations. We identify limitations that suggest directions for future work. First, our ablation (Table 4) reveals a trade-off: removing depth tokens slightly improves general VQA metrics (e.g., +0.4% MMBench), suggesting that depth token generation introduces a mild optimization tension with text-only tasks. Mitigating this through task-adaptive curriculum learning is a promising direc- tion. Second, our training and evaluation are limited to static images; extending to video, where temporally consistent depth tokens and object tracking intro- duce new optimization challenges, remains open. Third, Perceptio relies on frozen teacher models (Depth Anything V2, SAM2) whose errors propagate to the stu- dent; improving robustness to teacher noise is important for deployment. Future Work. More broadly, we are motivated by the question of how models can learn generalizable spatial concepts that transfer across tasks and domains, for instance, extending perception tokens to encode surface normals or optical flow, moving toward a unified spatial intelligence within a single autoregressive framework. References 1. Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C.L., Batra, D., Parikh, D.: Vqa: Visual question answering (2016), https://arxiv.org/abs/1505.00468 2. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text Perceptio: Perception Enhanced VLMs via Spatial Token Generation15 reading, and beyond. arXiv preprint arXiv:2308.12966 (2023), https://arxiv. org/abs/2308.12966 3. Bigverdi, M., Luo, Z., Hsieh, C.Y., Shen, E., Chen, D., Shapiro, L., Krishna, R.: Perception tokens enhance visual reasoning in multimodal language models. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) p. 3836–3845 (2024), https://api.semanticscholar.org/CorpusID:274464813 4. Cai, Z., Yeh, C.F., Xu, H., Liu, Z., Meyer, G., Lei, X., Zhao, C., Li, S.W., Chandra, V., Shi, Y.: Depthlm: Metric depth from vision language models (2025), https: //arxiv.org/abs/2509.25413 5. Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chan- dra, V., Xiong, Y., Elhoseiny, M.: Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. ArXiv abs/2310.09478 (2023), https://api.semanticscholar.org/CorpusID:264146906 6. Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., et al.: Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330 (2024) 7. Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., Gu, L., Wang, X., Li, Q., Ren, Y., Chen, Z., Luo, J., Wang, J., Jiang, T., Wang, B., He, C., Shi, B., Zhang, X., Lv, H., Wang, Y., Shao, W., Chu, P., Tu, Z., He, K., Zhang, K., Wang, L., Dou, M., Lu, L., Zhu, X., Lu, T., Lin, D., Qiao, Y., Dai, J.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv: 2412.05271 (2024), https://arxiv.org/abs/2412.05271 8. Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al.: How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821 (2024) 9. Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y., Ji, R.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023) 10. Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not per- ceive (2024), https://arxiv.org/abs/2404.12390 11. Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: Referring to objects in photographs of natural scenes. In: EMNLP (2014) 12. Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A dia- gram is worth a dozen images. In: ECCV (2016) 13. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. ICCV (2023) 14. Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. In: CVPR (2024) 15. Li, B., Ge, Y., Ge, Y., Wang, G., Wang, R., Zhang, R., Shan, Y.: Seed-bench: Benchmarking multimodal large language models. In: CVPR (2024) 16. Li, X., Zhang, T., Li, Y., Yuan, H., Chen, S., Zhou, Y., Meng, J., Sun, Y., Xu, S., Qi, L., Cheng, T., Lin, Y., Huang, Z., Huang, W., Feng, J., Shi, G.: Denseworld- 1m: Towards detailed dense grounded caption in the real world (2025), https: //arxiv.org/abs/2506.24102 17. Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: EMNLP (2024) 18. Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR (2024) 16Y. Li et al. 19. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023) 20. Liu, S., Cheng, H., Liu, H., Zhang, H., Li, F., Ren, T., Zou, X., Yang, J., Su, H., Zhu, J.J., Zhang, L., Gao, J., yue Li, C.: Llava-plus: Learning to use tools for creating multimodal agents. In: European Conference on Computer Vision (2023), https://api.semanticscholar.org/CorpusID:265067489 21. Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: ECCV (2024) 22. Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. In: NeurIPS (2022) 23. Mao, J., Huang, J., Toshev, A., Camburu, O.M., Yuille, A.L., Murphy, K.P.: Gen- eration and comprehension of unambiguous object descriptions. 2016 IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR) p. 11–20 (2015), https://api.semanticscholar.org/CorpusID:8745888 24. Ning, J., Li, C., Zhang, Z., Geng, Z., Dai, Q., He, K., Hu, H.: All in tokens: Unifying output space of visual tasks via soft token (2023), https://arxiv.org/abs/2301. 02229 25. van den Oord, A., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation learning (2018), https://arxiv.org/abs/1711.00937 26. Pi, R., Yao, L., Gao, J., Zhang, J., Zhang, T.: Perceptiongpt: Effectively fus- ing visual perception into llm. 2024 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR) p. 27114–27123 (2023), https://api. semanticscholar.org/CorpusID:265150065 27. Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S., Cholakkal, H., Anwer, R.M., Xing, E., Yang, M.H., Khan, F.S.: Glamm: Pixel grounding large multimodal model. In: CVPR (2024) 28. Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) 29. Ren, Z., Huang, Z., Wei, Y., Zhao, Y., Fu, D., Feng, J., Jin, X.: Pixellm: Pixel reasoning with large multimodal model. In: CVPR (2024) 30. Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. p. 9568–9578 (2024) 31. Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024), https://arxiv.org/abs/ 2409.12191 32. Wang, W., Chen, Z., Chen, X., Wu, J., Zhu, X., Zeng, G., Luo, P., Lu, T., Zhou, J., Qiao, Y., Dai, J.: Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175 (2023) 33. Wei, C., Tan, H., Zhong, Y., Yang, Y., Ma, L.: LaSagnA: Language-based segmen- tation assistant for complex queries. arXiv preprint arXiv:2404.08506 (2024) 34. Wen, L., Yang, X., Fu, D., Wang, X., Cai, P., Li, X., Ma, T., Li, Y., Xu, L., Shang, D., Zhu, Z., Sun, S., Bai, Y., Cai, X., Dou, M., Hu, S., Shi, B., Qiao, Y.: On the road with gpt-4v(ision): Early explorations of visual-language model on autonomous driving (2023), https://arxiv.org/abs/2311.05332 35. Xia, Z., Han, D., Han, Y., Pan, X., Song, S., Huang, G.: Gsva: Generalized seg- mentation via multimodal large language models. In: CVPR (2024) Perceptio: Perception Enhanced VLMs via Spatial Token Generation17 36. Xiao, F., Sigal, L., Lee, Y.J.: Weakly-supervised visual grounding of phrases with linguistic structures. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) p. 5253–5262 (2017), https://api.semanticscholar.org/ CorpusID:9190307 37. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning (2015), https: //api.semanticscholar.org/CorpusID:1055111 38. Yan, C., Wang, H., Yan, S., Jiang, X., Hu, Y., Kang, G., Xie, W., Gavves, E.: Visa: Reasoning video object segmentation via large language models. arXiv preprint arXiv:2407.11325 (2024) 39. Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. Advances in Neural Information Processing Systems 37, 21875–21911 (2024) 40. Ye, J., Xu, H., Liu, H., Hu, A., Yan, M., Qian, Q., Zhang, J., Huang, F., Zhou, J.: mPLUG-Owl3: Towards long image-sequence understanding in multi-modal large language models. arXiv preprint (2024) 41. Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: ECCV (2016) 42. Yuan, H., Li, X., Zhang, T., Sun, Y., Huang, Z., Xu, S., Ji, S., Tong, Y., Qi, L., Feng, J., Yang, M.H.: Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos (2025), https://arxiv.org/abs/2501.04001 43. Zhang, H., Li, H., Li, F., Ren, T., Zou, X., Liu, S., Huang, S., Gao, J., Li, C., Yang, J., et al.: Llava-grounding: Grounded visual chat with large multimodal models. In: ECCV (2024) 44. Zhang, T., Li, X., Fei, H., Yuan, H., Wu, S., Ji, S., Chen, C.L., Yan, S.: Omg- llava: Bridging image-level, object-level, pixel-level reasoning and understanding. In: NeurIPS (2024) 45. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision- language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023) A Depth reconstruction loss The depth reconstruction loss is made differentiable by the soft token averaging method. The details of the algorithm flow are presented below in Algorithm 1. B 3D reasoning task examples The reasoning task that requires 3D depth signal in the Hardblink dataset is presented in the Fig. 6. C 2D reasoning task examples Examples of reasoning and segmentation mask predictions in the RefCOCOg dataset from Perceptio model predictions are shown in Fig. 7. 18Y. Li et al. Algorithm 1 Soft codebook mixing for differentiable depth reconstruction Require: Decoder logits z ∈R B×T×V ; ground-truth tokens y ∈N B×T ; depth code index set D; codebook e k k∈D . 1: for b = 1 to B do 2:Parse depth span (s b , e b ) in y b with y b,s b = d start , y b,e b = d end 3: if s b , e b found then 4:L b ← e b − s b − 1 5:for t = s b + 1 to e b − 1 do 6:ℓ← z b,t−1 ▷ logits predicting token at t 7:Mask non-depth indices:ℓ j ←−∞ if j /∈D 8:p← softmax(ℓ)▷ p∈R V support on D 9: ̃ z b,t ← P k∈D p k e k 10:end for 11:truncate ̃ z b,t to n, reshape to √ n× √ n 12: ˆ Y b ← VQVAE Decoder ( ̃ z b,t ) 13: else 14: ˆ Y b ← 0▷ no valid depth span 15: end if 16: end for 17: Compute L DepthRecon = 1 B P B b=1 ˆ Y b − Y b 2 2 Perceptio: Perception Enhanced VLMs via Spatial Token Generation19 Fig. 6: (left→ right) Hardblink dataset examples with original images (first column), reconstructed depth maps from our method (second column) and overlay with the original image (third column). The overlayed depth maps show that Perceptio model makes correct decision in alignment with the perceived depth map. 20Y. Li et al. Fig. 7: (left→ right) RefCoCoG dataset samples with the RGB image (first column), ground truth segmentation mask (second column), predicted segmentation mask (third column) and the overlayed predicted mask (fourth column) from the Perceptio model.