Paper deep dive

Tiny Inference-Time Scaling with Latent Verifiers

Davide Bucciarelli, Evelyn Turri, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

Year: 2026Venue: arXiv preprintArea: cs.CVType: PreprintEmbeddings: 69

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/26/2026, 1:35:31 AM

Summary

The paper introduces 'Verifier on Hidden States' (VHS), a method for efficient inference-time scaling in single-step image generation. By extracting intermediate hidden representations from Diffusion Transformer (DiT) generators and feeding them directly into a Multimodal Large Language Model (MLLM), VHS eliminates the costly decoding-to-pixel-space and re-encoding-to-visual-embedding steps required by standard MLLM verifiers. This approach significantly reduces inference latency (by 63.3%), compute FLOPs, and VRAM usage while maintaining or improving generation quality on the GenEval benchmark.

Entities (5)

VHS · method · 100%Diffusion Transformer · architecture · 98%GenEval · benchmark · 95%Qwen2.5-0.5B · model · 95%SANA-Sprint · model · 95%

Relation Signals (3)

VHS → improvesperformanceon → GenEval

confidence 95% · achieving a +2.7% improvement on GenEval at the same inference-time budget.

VHS → operateson → Diffusion Transformer

confidence 95% · VHS, a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators.

VHS → reduceslatencyfor → SANA-Sprint

confidence 95% · In combination with the single-step generator SANA-Sprint [7]... VHS reduces the joint generation-and-verification time by 63.3%

Cypher Suggestions (2)

Find all methods that operate on Diffusion Transformer architectures. · confidence 90% · unvalidated

MATCH (m:Method)-[:OPERATES_ON]->(a:Architecture {name: 'Diffusion Transformer'}) RETURN m.name

Identify benchmarks used to evaluate specific methods. · confidence 90% · unvalidated

MATCH (m:Method)-[:IMPROVES_PERFORMANCE_ON]->(b:Benchmark) RETURN m.name, b.name

Abstract

Abstract:Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into the visual embedding space, leading to redundant and costly operations. In this work, we propose Verifier on Hidden States (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per-candidate verification cost while improving or matching the performance of MLLM-based competitors. We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference-time scaling reducing joint generation-and-verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a standard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference-time budget.

PDF

Open source PDF →Open local PDF →

Full Text

69,103 characters extracted from source content.

Expand or collapse full text

Tiny Inference-Time Scaling with Latent Verifiers Davide Bucciarelli ∗1,2 Evelyn Turri ∗1 Lorenzo Baraldi 2 , Marcella Cornia 1 Lorenzo Baraldi 1 Rita Cucchiara 1 1 University of Modena and Reggio Emilia, Italy 2 University of Pisa, Italy 1 name.surname@unimore.it, 2 name.surname@phd.unipi.it aimagelab.github.io/VHS Abstract Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce sub- stantial inference-time cost. Indeed, diffusion pipelines oper- ate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into the visual embedding space, leading to redundant and costly operations. In this work, we proposeVerifier onHiddenStates (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per-candidate verification cost while improving or matching the performance of MLLM-based competitors. We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference-time scaling reducing joint generation-and-verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a stan- dard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference-time budget. 1. Introduction Diffusion and flow-based models [16, 25, 39] have recently transformed image synthesis, producing samples that closely resemble natural imagery with remarkable fidelity and con- trollability. However, their generation process remains com- putationally expensive and often misaligned with user intent. To mitigate these limitations, recent works have adopted the inference-time scaling paradigm [19,30,37,47], which allocates additional computational budget at inference by generating multiple candidate samples and selecting the most suited among them. This framework relies on two key components: (i) an exploration algorithm that generates Generator Is it correct? LLM Decoder Encoder Generator Is it correct? LLM VHS (Ours) Standard YES V NO X Generation Prompt Generation Prompt TIME (s) OVERALL SCORE VHS Standard (A) Pipelines ( B ) Wall Time Nois e Nois e Time saving Figure 1. (A) Comparison between standard inference-time scaling and VHS. VHS skips part of the generation pipeline and avoids the decoding and re-encoding steps. (B) VHS achieves a comparable quality score on GenEval [10] in just 57% of the compute time. multiple candidates, and (i) a verifier that assigns scores to the candidates and selects those that best match the prompt. In image generation, verifiers are typically imple- mented by fine-tuning Multimodal Large Language Models (MLLMs) [4] with an image scoring objective. Neverthe- less, MLLMs are computationally heavy, and their inference cost is not negligible. Despite this, recent literature [19,30] mostly accounts for the number of function evaluations (e.g., diffusion steps), while treating the cost of the verifier as im- plicit overhead, leading to an incomplete view of the compu- tational footprint of inference-time scaling. Moreover, many existing evaluations assume very large budgets, sometimes involving thousands of function evaluations [19], whereas practical deployment scenarios, such as commercial image generation services, typically operate under much tighter constraints, often returning only a handful of images (e.g., 1 arXiv:2603.22492v2 [cs.CV] 25 Mar 2026 up to four candidates) per prompt. Further, visual generative models typically operate in a compressed latent space [34], defined by an autoencoder, whereas MLLMs rely on an external visual encoder (e.g., CLIP [33]) to obtain image representations. Thus, to score a generated sample, the latent representation must be decoded into pixel space and then re-encoded by the visual backbone of the MLLM. Although this decode-encode overhead may be acceptable in standard multi-step generators, it becomes increasingly significant in the case of single-step image generators, which can produce images in a single function evaluation [7, 29, 36, 50]. Following these considerations, we argue that the archi- tecture of MLLM-based verifiers [13,47] should be recon- sidered in light of the specific characteristics of the task. To this aim, we introduceVerifier onHiddenStates (VHS), a MLLM-based verifier that directly aligns internal hidden representations of image generators with the embedding space of an LLM. Concretely, VHS operates on single-step image generators, extracting latent features during the gen- eration process, and uses these hidden states as the visual inputs to the LLM (Fig. 1). This way, VHS eliminates the encoding-decoding overhead in the evaluation step, enabling significantly more efficient verification within the inference- time scaling framework, while retaining the expressivity of MLLM-based scoring. As a consequence, VHS is well- suited for tiny computational budgets, where only a small number of candidates per prompt is affordable, and thus closely aligns with the practical constraints and deployment settings of real-world commercial image generation services. We evaluate VHS in terms of latency and verification qual- ity on the GenEval benchmark [10]. In combination with the single-step generator SANA-Sprint [7] and a compact LLM (Qwen2.5-0.5B [1]), VHS reduces the joint generation- and-verification time by 63.3% of that required by a stan- dard MLLM-based verifier. Furthermore, under matched wall-clock budgets, VHS improves inference-time scaling performance on GenEval, achieving overall score gains of 3.1%, 1.7%, and 0.5% over a CLIP-based MLLM verifier in the Best-of-2, Best-of-4, and Best-of-6 settings, respectively. In summary, our main contributions are: •We introduce VHS, a verifier that operates directly on inter- nal hidden states of DiT-based image generators, aligning visual latents with an LLM without passing through pixel space or an external visual encoder. • We define a latency-aware inference-time scaling setting for single-step image generation, explicitly measuring wall-clock time and analyzing performance in realistic few-sample generation regimes. • We provide a thorough empirical study of verifier design and latency, comparing alternative architectures and con- figurations (i.e., in terms of layers, backbones, and loss functions) and quantifying the trade-offs between compu- tational cost and semantic alignment. 2. Related Work Image Generation Techniques. Image generation has advanced substantially with the advent of diffusion mod- els [16,39], which have surpassed GANs [11] in both sample quality and training stability. Latent diffusion models [34] further extended this progress by operating in a compressed latent space, enabling high-resolution synthesis at a manage- able computational cost. While early diffusion architectures relied primarily on U-Nets [35] for noise prediction, these have recently been surpassed by Diffusion Transformers (DiTs) [31], which offer improved scalability and perfor- mance. In parallel, flow-based approaches [9,25] have re- formulated the diffusion objective from noise estimation to velocity field prediction, providing an alternative yet closely related view of the generative process. A complementary line of research focuses on improving inference efficiency. Few-step and even single-step diffu- sion models have been developed via distillation, making it possible to generate high-fidelity images with only a hand- ful (or even a single) denoising step [7,29,36,50]. In this area, Stable Diffusion XL-Turbo [36] introduced adversarial diffusion distillation to ensure high-fidelity synthesis in the low-step regime and leveraged large pre-trained multi-step models as teachers, with a mixture of adversarial training and score distillation. Subsequently, PixArt-α-DMD [50] pro- posed a distribution-matching distillation approach to align the student with the teacher model at the distribution level. Differently, SANA-Sprint [7] presented a hybrid distilla- tion framework that combines training-free continuous-time consistency distillation with latent adversarial distillation, and enables efficient adaptation of pre-trained diffusion or flow-matching models in the few-step generation scenario. Inference-Time Scaling. Inference-time scaling [38] con- sists in allocating additional computational resources during inference to improve model performance, rather than increas- ing compute during training. This strategy, widely adopted in NLP for LLM inference [12,38,46], has been recently extended to visual content generation [2,19,30,37,47]. In this context, it has been shown that allocating more com- pute time, beyond simply increasing the number of diffusion steps [30], can significantly enhance generation quality. Inference-time scaling methods for visual generation typ- ically rely on two main components: a search algorithm and a verifier. The former generates candidate samples, while the latter evaluates and ranks them to select the best output. The simplest strategy, Best-of-N, independently samples and scores N candidates, selecting the highest-scoring one as the final result. Another algorithm, widely adopted in LLM inference-time scaling is beam search [38], a heuristic algo- rithm that maintains the top-kmost probable candidates at each step, balancing exploration and efficiency to improve generation quality over greedy sampling. 2 On the other end, verifiers are often based on MLLMs, leveraging their ability to interpret complex prompts and assess visual-textual alignment in the generated content. For instance, VQA-Score [24] employs a Visual Question An- swering model that scores samples based on the probability of the “yes” token in response to predefined questions as- sessing prompt fulfillment. Similarly, Vision-Reward [49] queries an MLLM with fine-grained binary questions and combines the results through a learned weighting scheme. In contrast with previous literature, we propose a verifier that directly works in the latent space of the generator, signif- icantly reducing the computational overhead of verification. Multimodal Large Language Models. MLLMs extend tra- ditional language models by integrating information across multiple modalities [4,22,26,27,44], most notably, vision and text. Common architectures rely on a pretrained image encoder [8,33,51] whose embeddings are projected into the input space of the LLM through a lightweight adapter. This design allows the visual features to be integrated seamlessly into the token sequence of the LLM, enabling multimodal understanding and grounded generation. This framework was popularized by LLaVA [26,27], which employed simple linear layers as connector and intro- duced a two-stage training pipeline: aligning the connector using image-caption pairs, and subsequently fine-tuning the entire model on instruction-following datasets. Building on this, several works have proposed to improve visual ground- ing and fine-grained alignment. Idefics3 [20] partitions im- ages into spatial tiles encoded independently, improving localization and detailed perception. Similarly, Qwen2.5- VL [1] incorporates 2D positional encodings into token rep- resentations to better preserve spatial structure within images. In contrast, we directly align hidden states of a DiT-based generator with the LLM, enabling image evaluation from latent representations rather than decoded pixels. 3. Proposed Method 3.1. Preliminaries The objective of our approach is to assess content quality directly from the latent representations of a single-step image generator. In the following, we first formalize the generative process of multi-step models and subsequently introduce the single-step formulation adopted in our method. Visual generative models, such as diffusion [16,34] and flow-based models [9,25,47], synthesize data through a multi-step refinement process. Starting from a latent variable sampled from a prior distribution,z T ∼ p T (e.g.N (0, I)), the model progressively refines it into a structured represen- tation z 0 by constructing a discrete trajectory z T → z T−1 →·→ z 1 → z 0 ,(1) where transition steps are parameterized by a neural network f θ that predicts model-specific quantities such as noise or velocity, depending on the underlying framework. This itera- tive process transports samples from the priorp T toward an approximation of the target distributionp data , yielding a se- quencez t 0 t=T that we refer to as the generative trajectory. Concretely, in DiT-based [9,31] generators, the noisy latentz t is processed by a Transformer backbone that pro- duces a sequence of hidden representationsh ℓ L−1 ℓ=0 gen- erated withh ℓ = DiT ℓ (h ℓ−1 ,t), whereLis the number of layers in the DiT andh 0 is the noisy latentz t . Lastly, the DiT layers are followed by a decoderDthat operates on the final hidden stateh L−1 after projection and normalization (i.e., z 0 ). In both diffusion [34] and flow models, the generation trajectory is defined not in pixel space but in the compressed latent space of an autoencoder [6], which we define asE. During sampling, the generative trajectoryz t T t=0 evolves entirely inE, and the final image is obtained by decoding the terminal latent z 0 via x 0 =D(z 0 ). In contrast, a single-step generator is obtained by distill- ing a standard diffusion or flow-based model into a network that maps a latent samplez T ∼N (0,I)to an image in one forward pass producing a generative trajectory withT = 1. While in multi-step diffusion and flow-based models the computational cost of the decoding operatorDis typically negligible compared to the iterative sampling process, in single-step generators [7,36,50] this balance shifts: the for- ward pass ofDbecomes a non-trivial component of the total inference cost. For this reason, our method operates directly on the intermediate latent representationhwhen tasked with verifying generated samples, thereby skipping the forward pass throughD and avoiding any decoding overhead. 3.2. Latent Verifier Within the inference-time scaling framework, a key compo- nent is the verifier model, which evaluates generated samples and identifies the most promising ones. In our formulation, we define the verifier as a modelS θ that, given a generated samplex 0 and the user promptp, outputss ∈ Yes, No indicating whether the sample is semantically aligned withp. Recent works [19,21,47,52] typically implement verifiers using MLLMs. Although such models have shown strong performance in assessing generation quality, the computa- tional cost associated with their scoring procedure is non- negligible. Nevertheless, most inference-time scaling stud- ies [19,30] quantify the computational budget solely by the number of generator function evaluations (e.g., diffusion steps), with the cost of running the MLLM-based verifier either implicitly ignored or treated as negligible. Formally, an MLLM-based verifier can be decomposed into three components: (i) a visual encoderV, which maps an input imagex 0 to a sequence of visual tokens; (i) a connec- torC, which projects these visual tokens into the embedding space of the language model; and (i) a language model, 3 LLM 퐶 Prompt DiTGenerator Vision Encoder LLM 퐶 Prompt Answer: Yes Score: 0.84 h 1 DiT l * DiT 1 AE Decoder DiT L h L-1 Proj & Norm h L z 0 DiTGenerator h 1 DiT l * DiT 1 AE Decoder DiT L h L-1 Proj & Norm h L z 0 A photo of a pink VHS cassette and a black walkman z T A photo of a pink VHS cassette and a black walkman z T 63.3% TIME SAVING Answer: Yes Score: 0.84 VHS (Ours) STANDARD h l* Figure 2. Comparison between a standard generation-verification pipeline (top) and VHS (bottom). VHS consumes visual features directly from the hidden states of the generator, bypassing subsequent DiT layers, autoencoder (AE) decoding, and CLIP-based re-encoding, significantly reducing sampling and verification overhead. which performs multimodal reasoning over the concatenated visual and textual tokens and produces the final score. In the inference-time scaling setting, this architecture is used as follows: a latent samplez 0 is drawn from the latent space of the generator, decoded into pixel space asx 0 = D(z 0 ), and then processed by the verifier to produce a score s =S θ (z 0 ,p) = LLM C(V(D(z 0 ))),p .(2) In this pipeline,Vis responsible only for re-encoding vi- sual information that has already been implicitly represented in the latent space of the generator [32,41,48]. We claim that for generative models that operate in a rich latent space, this additional pass throughVis not semantically essential for the verification task. Instead, it does introduce a non- trivial decode-encode overhead: the latentz 0 must undergo two successive transformations,D(z 0 )andV(·), before the LLM can reason about the sample. OurVerifier onHiddenStates (VHS) explicitly bypasses the decoding-encoding bottleneck by removing the visual encoder from the verification loop. Instead of operating on the decoded image, VHS directly consumes hidden repre- sentations from the generator. Specifically, VHS acts on the output of a DiT layerℓ ⋆ ∈0,...,L− 1, denoted ash ℓ ⋆ , and feeds it to the connectorC of the MLLM, as follows: s =S θ (z 0 ,p) = LLM C(h ℓ ⋆ ),p ,(3) where the connectorCis trained to alignh ℓ ⋆ with the LLM input space, treating hidden features like image features. This design yields two key advantages. First, it com- pletely removes the decoding-encoding pipelinez 0 → x 0 → V(x 0 )from the verification process, thereby reducing per- sample evaluation latency. Second, since VHS accesses hidden states at layerℓ ⋆ , it allows us to truncate the genera- tor during verification and skip the remainingL− (ℓ ⋆ + 1) layers. As a result, VHS provides semantically informed ver- ification at a fraction of the computational cost of standard MLLM-based verifiers, making inference-time scaling sub- stantially more practical in low-latency generation regimes. An overview of our approach is shown in Fig. 2, in compari- son with a standard generation-verification pipeline. 3.3. Training Procedure Overview VHS is trained via a two-stage procedure. First, in an align- ment stage, we adapt the visual representation from the generator hidden layers to be compatible with the LLM backbone. Second, we fine-tune the model as a verifier. Alignment Stage. In this stage, the goal is to align the visual representations extracted fromh ℓ ⋆ with the representation space of the LLM. Unlike standard visual encoders, our vi- sual embedder is a generative model. As a consequence, we first need to generate raw images to obtain the intermediate featuresh ℓ ⋆ used for alignment. Concretely, we build upon the dataset used in the first stage of the LLaVA training [26], which provides image-caption pairs usually employed to train MLLMs. Starting from each caption, we employ the generator to produce a synthetic image and record the associ- ated hidden representationh ℓ ⋆ . Notably, this may introduce inconsistencies between the original caption and the gener- ated image, due to hallucinations or semantic drift in the generator. To mitigate this, we re-caption each generated im- age using Gemma-3-4B [42], and use the resulting captions as the textual supervision for the alignment stage. Verifier Fine-tuning. While the alignment stage is largely consistent with standard MLLM training, the verifier fine- tuning stage explicitly adapts the model to the scoring objec- tive required in the inference-time scaling setting. Building on existing literature [21], we adopt the prompts of the train- ing dataset of Reflect-DiT [21] and generate 20 candidate 4 Table 1. End-to-end inference time, FLOPs and VRAM usage for Best-of-Ngeneration with SANA-Sprint [7] under different computational budgets, along with the relative savings (%) compared to the standard verifier. Inference Time (ms)TFLOPsPeak VRAM Usage (GB) Saved (%) Bo1 Bo2 Bo4 Bo6Saved (%) Bo1 Bo2 Bo4 Bo6Saved (%) Bo1 Bo2 Bo4 Bo6 MLLM w/ CLIP-27755411081662-15.128.555.181.8-13.815.518.822.2 MLLM w/ AE50.2%13840167795351.0%7.414.829.544.314.5%11.811.912.312.6 VHS on h 7 63.3%10236356576762.9%5.611.322.533.814.5%11.811.912.312.6 images per prompt, resulting in a total of 118k samples. These candidates are categorized by Gemma-3-4B [42] into the respective GenEval categories [10] and scored with its automatic evaluator. Based on these evaluation scores, we derive binary labels (Yes/No) for each image in the dataset. Analysis of the training set and GenEval benchmark re- veals a significant class imbalance, with correct samples sub- stantially overrepresented. A uniform weighting scheme in the training loss consequently underemphasizes the minority “incorrect” class, leading to suboptimal verifier performance. To address this, VHS employs a weighted cross-entropy loss during verifier fine-tuning. This approach re-weights the training signal proportionally to class frequencies, effec- tively compensating for the skewed label distribution and improving model calibration on underrepresented samples. 4. Experimental Results 4.1. Implementation Details In the alignment stage, we follow the LLaVA [26] train- ing scheme, and tune only the newly initialized connector module during the first stage. Differently, in the verifier fine-tuning stage, we train both the connector and the whole language model, splitting the generated datasets in training (80%) and evaluation (20%), selecting the model yielding the best evaluation loss. All models follow the exact same train- ing procedure, ensuring a fair comparison between models trained with equivalent data and policy. To derive a more granular scoring mechanism from bi- nary labels, we leverage the LLM output probability of the sampled token (“yes” or “no”) to produce a continuous score. Best-of-N selection is then performed by retaining the highest-scoring sample according to this approach. We refer the reader to the supplementary materials for a detailed expla- nation of this approach and accompanying ablation studies. 4.2. Experimental Setting To ensure a fair evaluation of our approach, we adopt a controlled experimental setup and define three verifier archi- tectures, namely: (i) MLLM w/ CLIP, a standard MLLM following the LLaVA design [26], where a frozen CLIP en- coder (using the ViT-L/14@336 variant) is connected to the LLM through an MLP projection layer. This configuration is the only one that employs an external visual encoder; (i) MLLM w/ AE, a variant in which the latent output of the generator,z 0 , is mapped to the LLM input space via an MLP. This is equivalent to encoding the image with the autoencoder encoderEand processing the latent representa- tion through the connector; (i) VHS, our proposed model, which feeds an intermediate hidden representationh ℓ from the generator into the LLM using the same linear projec- tion layer adopted in the other architectures. To precisely quantify the time savings achieved by different evaluators, we report measurements averaged over 10 runs following an initial warm-up phase to stabilize performance. Specif- ically, we measure the time required for a full generation- and-evaluation cycle and the time required to generate up toNimages and select the best one. All experiments are conducted on the SANA-Sprint generator [7] with a single step of generation, with an NVIDIA A100 GPU. 4.3. Latency Estimation of VHS Table 1 reports inference costs across three axes: wall-clock time, FLOPs, and peak VRAM usage, for both baselines and VHS, evaluated under Best-of-Nselection with SANA- Sprint. Time savings are expressed as a percentage relative to MLLM w/ CLIP, which we regard as the standard verifier. Across all dimensions, the results consistently show that bypassing the decoding–encoding operation (V(D(z 0 ))) re- quired by MLLM w/ CLIP yields substantial gains. Replac- ing the CLIP-based verifier with the AE-based one (MLLM w/ AE) already halves the cost: inference time drops from 277 ms to 138 ms (−50.2%), FLOPs are reduced by 51.0%, and peak VRAM consumption falls from 13.8 GB to 11.8 GB (−14.5%). Skipping part of the DiT forward, VHS pushes these savings further: the best configuration, VHS onh 7 , reaches 102 ms (−63.3%), reduces FLOPs by 62.9%, match- ing the VRAM footprint of MLLM w/ AE. These gains translate directly into end-to-end efficiency under Best-of-Nselection. In theBo6setting, MLLM w/ CLIP requires 1662 ms, MLLM w/ AE reduces this to 953 ms, and VHS onh 7 further decreases it to 767 ms. Cru- cially, under a time budget comparable to MLLM w/ CLIP atBo3(831 ms), VHS onh 7 can already affordBo6, ef- fectively doubling the candidate pool. The computational savings across time, FLOPs, and memory are thus not merely theoretical: they can be directly traded for a larger candidate pool under the same wall-clock and hardware budget. 5 Table 2. Accuracy (%) on the GenEval benchmark [10] across computational budgets, generator backbones, and verifier configurations (on LLM Qwen2.5-0.5B). Results compare SANA-1.5 and SANA-Sprint [7] under matched wall-clock budgets (milliseconds), with each verifier operating under the same time constraint via adaptive Best-of-N . BudgetGeneratorStepsVerifierBest-of-NSingleTwoCountingColorPositionAttributionOverall 200msSANA-Sprint1-Best-of-199.388.156.087.654.147.871.6 550ms SANA-1.54-Best-of-198.878.266.571.150.620.863.0 SANA-Sprint8-Best-of-199.591.959.386.057.852.474.0 SANA-Sprint1MLLM w/ CLIPBest-of-2100.091.359.588.061.055.475.4 SANA-Sprint1MLLM w/ AEBest-of-3100.090.959.089.655.850.673.1 SANA-Sprint1VHS (Ours)Best-of-4100.093.961.590.666.258.478.1 1100ms SANA-1.512-Best-of-1100.092.774.888.361.459.678.8 SANA-Sprint20-Best-of-1100.088.559.889.648.651.072.2 SANA-Sprint1MLLM w/ CLIPBest-of-4100.092.766.088.965.961.678.8 SANA-Sprint1MLLM w/ AEBest-of-799.790.761.390.859.649.374.7 SANA-Sprint1VHS (Ours)Best-of-9100.095.766.588.969.863.880.5 1650ms SANA-1.516-Best-of-199.793.577.389.160.260.879.4 SANA-Sprint30-Best-of-1100.090.557.385.149.350.271.4 SANA-Sprint1MLLM w/ CLIPBest-of-6100.093.968.288.769.864.280.4 SANA-Sprint1MLLM w/ AEBest-of-1199.790.559.389.858.449.073.9 SANA-Sprint1VHS (Ours)Best-of-15100.096.067.389.170.464.680.9 4.4. Performance on GenEval Beyond comparing the latency of VHS against the MLLM w/ CLIP and MLLM w/ AE baselines, we now turn to evalu- ating the verifier performance in an inference-time scaling benchmark, where the available computational budget is explicitly constrained in terms of wall-clock time. Experimental Setup. We conduct this analysis on the GenEval benchmark [10], which evaluates generator per- formance across six categories: Single Object, Two Objects, Counting, Colors, Position, and Attribute Binding. Each category is defined by structured prompts designed to probe specific capabilities of the generator. For instance, “Count- ing” requires producing a specific number of objects, while “Position” involves rendering two objects in a fixed spatial configuration. We conduct comparisons across three differ- ent time settings (550 ms, 1100 ms, and 1650 ms), which approximately correspond to the wall-clock time required by the MLLM w/ CLIP to produce the best of 2, 4, and 6 genera- tions, respectively. In contrast, for the MLLM variants using AE and VHS, the time savings (cfr. Table 1) allow us to perform a wider sample exploration within the same compu- tational budget. Specifically, 3 and 4 generations in 550 ms, 7 and 9 in 1100 ms, and 11 and 15 in 1650 ms, respectively. Overall Performance Analysis. Results are reported in Table 2. An analysis of the raw SANA-Sprint performance (first row) reveals substantial variation across the benchmark categories. While tasks such as counting, position, and at- tribute binding leave room for improvement, others, like single-object, two objects, and color, yield near-perfect ac- curacy, resulting in an overall score of 71.6%. Across all time budgets, VHS consistently outperforms its CLIP-based counterpart, benefiting from being able to generate a larger pool of samples to select from, while maintaining compara- ble accuracy. In particular, VHS surpasses the baseline by 2.7%, 1.7%, and 0.5% in the three time settings. Conversely, the MLLM w/ AE variant performs notably worse. AE latent features are perceptually richer thanks to the reconstruction pretraining objective of the autoencoder, but semantically weaker. As a result, these representations would likely require a more sophisticated architecture for se- mantic feature extraction. In practice, these features behave more like compressed, perceptual pixel-space representa- tions rather than meaningful semantic embeddings. In con- trast, VHS leverages hidden-layer activations directly con- ditioned on the generation prompt, yielding much stronger semantic alignment than AE latents. This allows VHS to en- tirely remove the vision encoder while maintaining effective alignment with the LLM space through a lightweight MLP. Category-wise Analysis. We observe the largest gains in cat- egories that require generation over multiple objects. Specif- ically, VHS achieves up to a 3% lead in the attribute binding (at 550 ms), 5.2% in position (550 ms), and up to 3% in the two objects category (1100 ms), indicating that VHS ef- fectively distinguishes multiple objects and captures their spatial relationships and attributes. Conversely, the AE- equipped MLLM attains comparable, and in some cases superior, performance in the color category (89.6%, 90.8%, and 89.8% across all the budgets). This can be attributed to the nature of AE latent features, where color information is more easily captured due to their perceptual rather than semantic representation space. Moreover, the single-object category shows saturated values across all time windows and verification options, suggesting that on the simplest task pro- posed by the benchmark, the generator itself is good enough to yield almost perfect scores. To qualitatively validate VHS, 6 Table 3. Accuracy (%) of SANA-Sprint [7] on the GenEval benchmark [10] across varying hidden layers, training losses, training data, and reference LLMs on a time budget of 1100 ms. VerifierSingleTwoCountingColorPositionAttributionOverall VHS w/ h 1 Weighted XE Loss99.788.156.887.851.848.271.3 w/ h 5 Weighted XE Loss100.092.358.590.066.659.477.7 w/ h 9 Weighted XE Loss100.093.165.588.965.059.878.3 w/ h 19 Weighted XE Loss99.788.161.788.565.657.876.5 w/ h 7 XE Loss100.090.959.588.161.660.476.3 w/ h 7 Focal Loss100.094.764.888.771.862.080.0 w/ h 7 Weighted XE Loss (Ours)100.095.766.588.969.863.880.5 w/ h 7 Weighted XE Loss + Qwen2-1.5B [43]100.094.162.090.465.061.078.4 MLLM w/ CLIP Generated Data w/ XE Loss100.094.164.090.263.062.078.5 Generated Data w/ Weighted XE Loss100.094.162.890.064.063.078.6 Original Data100.092.766.088.965.961.678.8 Original Data + Qwen2-1.5B [43]99.794.560.887.669.662.278.8 Fig. 3 presents samples from GenEval showcasing best picks from VHS, MLLM w/ CLIP and MLLM w/ AE, where VHS better identifies the best samples with equal inference time. Multi-step baselines. As additional evidence of the effec- tiveness of VHS, we compare against baselines that rely on multiple denoising steps. Consistent with prior findings [30], Best-of-N configurations outperform multi-step counterparts, both for the natively few-step SANA-Sprint and for its multi- step variant, SANA-1.5 (+9.5 and +1.5% overall at 1650 ms). 4.5. Ablation Studies To assess the design space and justify our modeling choices, we perform a systematic ablation study on GenEval with SANA-Sprint, varying (i) the DiT layer from which visual latents are extracted, (i) the LLM backbone, (i) the loss function in the verifier fine-tuning stage, and (iv) the training data. The first two hyperparameters jointly determine both the final accuracy of VHS and the latency-accuracy trade-off of the verifier, directly impacting its adoption in real-world deployments. To enable a fair comparison, all evaluations are constrained to a fixed computational wall-time budget of 1100 ms, with results reported in Table 3. Ablation on Different DiT Layers. The behavior of VHS is tightly coupled to the visual information encoded in DiT lay- ers, which capture varying levels of semantics and detail [18]. This induces a trade-off between expressivity and computa- tional cost: deeper layers are more expensive to evaluate yet can provide richer semantic representations, while shallower layers are significantly cheaper but may encode weaker se- mantics. Crucially, this trade-off is not monotonic, as the fi- nal layers lie closest to the autoencoder reconstruction space, prioritizing perceptual fidelity over explicit semantic struc- tures. As in Table 2, this proves suboptimal, highlighting the importance of selecting an appropriate latent depth. We compare VHS trained with features extracted from layerh 7 against variants spanning a broad range of depths, from extreme layersh 1 andh 19 to intermediate onesh 5 andh 9 (approximately 25% and 45% of the depth of a 20- layer DiT, respectively). Extreme layers prove substantially detrimental:h 1 suffers from proximity to the noisy input regime, yielding unstable representations, whileh 19 pro- duces features dominated by perceptual reconstruction cues, consistent with the poor performance of the MLLM w/ AE baseline, confirming that both extremes provide weak se- mantic signals for verification. Among intermediate layers, h 7 yields consistent gains of 2.8% and 2.2% overh 5 andh 9 , respectively, on the overall GenEval score.h 5 is substan- tially penalized on semantically demanding categories such as counting (-8% compared toh 7 ) and attribution (-4.4%), indicating that features extracted too early provide an insuf- ficiently mature semantic signal. Conversely, the higher cost ofh 9 limits sample exploration under fixed wall-time budget, ultimately hurting overall performance. Ablation on Different Losses. Employing standard XE de- grades performance compared to our weighted XE objective, with a drop of 4.2% in the overall score for VHS and 0.1% for the MLLM w/ CLIP baseline. This trend is consistent with the label imbalance in the SANA-Sprint training data, where positive examples account for approximately 63% of the samples: an unweighted loss biases the verifier toward the majority (positive) class, impairing its ability to reject incorrect generations. Weighted XE counteracts this bias and yields systematic gains across most GenEval categories. Another solution to class imbalance is the focal loss [23], which down-weights well-classified examples and focuses the training signal on harder, misclassified samples, improv- ing robustness on underrepresented and challenging cases. Indeed, training VHS with focal loss leads to a +3.7% overall improvement over vanilla XE, confirming the importance of loss functions that explicitly account for class imbalance. 7 MLLM w/ CLIPVHSMLLMw/AE A photo of four boats A photo of four vases MLLM w/ CLIPVHSMLLM w/ AE A photo of a wine glass and a bear A photo of a pink oven and a green motorcycle Figure 3. Visual comparison of the best pick images by different verifiers for GenEval-generated images. Ablation on Different LLM Backbones. A similar trade- off arises on the language side: larger LLMs generally offer stronger reasoning capabilities and better alignment with task instructions, but at the cost of increased inference latency and memory footprint. Differently, increasing the LLM ca- pacity by replacing Qwen2.5-0.5B with Qwen2-1.5B yields only marginal, and sometimes negative, gains under the same wall-time budget. This suggests that the primary bottleneck lies in the quality and depth of the visual representations rather than in the reasoning power of the language model, and that investing computation in better visual latents and ap- propriate losses is more beneficial than scaling up the LLM. Training Data. We further analyze the MLLM w/ CLIP baseline to assess the impact of synthetic fine-tuning data. Notably, MLLM w/ CLIP shows no meaningful improve- ment, and even slight degradation (−0.3%and−0.2%), when trained on generated rather than original data. This suggests that synthetic pairs provide little benefit for models not leveraging internal DiT latents. Therefore, the gains ob- served with VHS stem not from extra synthetic supervision but from its architectural design, which leverages DiT-layer latents and tailored loss functions, demonstrating the effec- tiveness of our verifier over generic MLLM-based baselines. Moreover, we refer the reader to the supplementary material for an analysis of the generated data quality. 4.6. Generalization to Other Generators Finally, we provide an analysis on VHS when applied to a different single-step generator, in particular PixArt-α- DMD [5, 50]. Results are reported in Table 4. Experimental Setting. Following the methodology from our SANA-Sprint analysis, we evaluate three verification ap- proaches: the conventional pipeline using MLLM w/ CLIP features, direct verification on latent autoencoder features (MLLM w/ AE), and VHS operating on intermediate DiT ac- Table 4. Verification latency and GenEval [10] accuracy scores for Best-of-Ngeneration with PixArt-α[5,50] under equivalent computational budgets defined by MLLM w/ CLIP. Verification TimeGenEval Overall (%) t (ms) t savings (%) Bo2 Bo3 Bo4 Bo5 Bo6 MLLM w/ CLIP145-43.7 44.7 45.1 45.6 46.9 MLLM w/ AE165 −14.041.0 41.0 41.9 42.3 41.6 VHS on h 13 7648.043.045.245.546.146.4 tivations from layer 13. Based on the previous ablation study, we train VHS with weighted XE loss and benchmark against the MLLM w/ CLIP variant trained on the original dataset, both identified as optimal configurations in our ablations. Latency Estimation. Inference-time analysis reveals that VHS achieves a48%speedup compared to MLLM w/ CLIP. In contrast, MLLM w/ AE offers no computational advan- tage over the CLIP baseline, as the generator autoencoder uses a low compression ratio that produces significantly more visual tokens for the LLM, negating any gains from bypassing latent decoding and CLIP encoding steps. GenEval Performance. On GenEval, we evaluate perfor- mance under matched budgets, corresponding to sampling and scoring between two and six candidates with a CLIP- based verifier. Thanks to its lower latency, VHS attains the best results in theBo3(45.2),Bo4(45.5), andBo5(46.1) settings, while remaining comparable in the Bo2 and Bo6. 5. Conclusion In this work, we introduced VHS, a verifier for inference- time scaling that directly aligns the latent representations of a DiT-based image generator with a large language model. By operating entirely in latent space, VHS eliminates part of the generation process, as well as the decode-re-encode overhead of standard MLLM-based verifiers, ultimately yielding better performance under the same inference-time budget. 8 Acknowledgments We acknowledge CINECA for the availability of high- performance computing resources under the ISCRA initia- tive, and for funding Evelyn Turri’s PhD. This work has been supported by the EU Horizon projects “ELIAS” (GA No. 101120237) and “ELLIOT” (GA No. 101214398), and by the EuroHPC JU project “MINERVA” (GA No. 101182737). References [1]Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923, 2025. 2, 3 [2]Lorenzo Baraldi, Davide Bucciarelli, Zifan Zeng, Chongzhe Zhang, Qunli Zhang, Marcella Cornia, et al. Verifier matters: Enhancing inference-time scaling for video diffusion models. In BMVC, 2025. 2 [3] Black Forest Labs.FLUX.1 Schnell.https:// blackforestlabs.ai, 2024. 13 [4]Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. The Revolution of Multimodal Large Lan- guage Models: A Survey. In ACL Findings, 2024. 1, 3 [5]Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis. In ICLR, 2024. 8 [6]Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep Com- pression Autoencoder for Efficient High-Resolution Diffusion Models. In ICLR, 2025. 3 [7] Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, and Enze Xie. SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation. In ICCV, 2025. 2, 3, 5, 6, 7, 12, 13, 15 [8]Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible Scaling Laws for Contrastive Language-Image Learning. In CVPR, 2023. 3 [9]Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- tezari, Jonas M ̈ uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. In ICML, 2024. 2, 3 [10]Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. GenEval: An Object-Focused Framework for Evaluating Text- to-Image Alignment . In NeurIPS, 2023. 1, 2, 5, 6, 7, 8, 11, 12, 13, 15 [11] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. In NeurIPS, 2014. 2 [12] Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking. arXiv preprint arXiv:2501.04519, 2025. 2 [13]Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bo- han Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Bill Yuchen Lin, and Wenhu Chen. VideoScore: Building automatic metrics to simulate fine-grained human feedback for video generation. In EMNLP, 2024. 2 [14]D Hendrycks. Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415, 2016. 11 [15]Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In EMNLP, 2021. 12 [16]Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffu- sion Probabilistic Models. In NeurIPS, 2020. 1, 2, 3 [17]Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low- Rank Adaptation of Large Language Models. In ICLR, 2022. 13 [18]Chaehyun Kim, Heeseong Shin, Eunbeen Hong, Heeji Yoon, Anurag Arnab, Paul Hongsuck Seo, Sunghwan Hong, and Seungryong Kim. Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers. In NeurIPS, 2025. 7 [19]Jaihoon Kim, Taehoon Yoon, Jisung Hwang, and Minhyuk Sung. Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing. arXiv preprint arXiv:2503.19385, 2025. 1, 2, 3 [20]Hugo Laurenc ̧on, Andr ́ es Marafioti, Victor Sanh, and L ́ eo Tronchon. Building and better understanding vision-language models: insights and future directions. In NeurIPS Workshops, 2024. 3 [21]Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Arsh Koneru, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffu- sion Transformers via In-Context Reflection. In ICCV, 2025. 3, 4, 11 [22]Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. VILA: On Pre-training for Visual Language Models. In CVPR, 2024. 3 [23]Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ́ ar. Focal Loss for Dense Object Detection. In ICCV, 2017. 7, 11 [24]Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Eval- uating Text-to-Visual Generation with Image-to-Text Genera- tion. In ECCV, 2024. 3 [25]Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling. In ICLR, 2023. 1, 2, 3 [26] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning. In NeurIPS, 2023. 3, 4, 5, 11 [27] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, 2024. 3 9 [28]Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. In ICLR, 2019. 11 [29]Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao.Latent consistency models: Synthesizing high- resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023. 2 [30]Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu- Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, et al. Inference-Time Scaling for Dif- fusion Models beyond Scaling Denoising Steps. In CVPR, 2025. 1, 2, 3, 7 [31] William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. In ICCV, 2023. 2, 3 [32]Koutilya PNVR, Bharat Singh, Pallabi Ghosh, Behjat Sid- diquie, and David Jacobs. LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation. In ICCV, 2023. 4 [33]Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Transferable Visual Models From Natural Language Supervi- sion. In ICML, 2021. 2, 3 [34]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ̈ orn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022. 2, 3, 11 [35]Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. In MICCAI, 2015. 2 [36]Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial Diffusion Distillation. In ECCV, 2024. 2, 3 [37]Anuj Singh, Sayak Mukherjee, Ahmad Beirami, and Hadi Jamali-Rad. CoDe: Blockwise Control for Denoising Dif- fusion Models. arXiv preprint arXiv:2502.00968, 2025. 1, 2 [38]Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. In ICLR, 2025. 2 [39]Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR, 2021. 1, 2 [40] Stability AI. Stable Diffusion 3.5 Large Turbo.https: / / huggingface . co / stabilityai / stable - diffusion-3.5-large-turbo, 2024. 13 [41] Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent Correspondence from Image Diffusion. In NeurIPS, 2023. 4 [42]Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ram ́ e, Morgane Rivi ` ere, et al. Gemma 3 Technical Report. arXiv preprint arXiv:2503.19786, 2025. 4, 5, 11 [43] Qwen Team et al. Qwen2 Technical Report. arXiv preprint arXiv:2407.10671, 2024. 7 [44] Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. In NeurIPS, 2024. 3 [45] Wang, Yan and Abdullah, M and Hassan, Partho and Has- san, Sabit. Moonworks Lunara Aesthetic Dataset. arXiv preprint arXiv:2601.07941, 2026. 13 [46]Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving. In ICLR, 2025. 2 [47]Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer. arXiv preprint arXiv:2501.18427, 2025. 1, 2, 3, 11 [48] Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello. Open-Vocabulary Panoptic Seg- mentation With Text-to-Image Diffusion Models. In CVPR, 2023. 4 [49]Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation. arXiv preprint arXiv:2412.21059, 2024. 3 [50]Tianwei Yin, Micha ̈ el Gharbi, Richard Zhang, Eli Shechtman, Fr ́ edo Durand, William T. Freeman, and Taesung Park. One- step Diffusion with Distribution Matching Distillation. In CVPR, 2024. 2, 3, 8 [51]Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre-Training. In ICCV, 2023. 3 [52]Le Zhuo, Liangbing Zhao, Sayak Paul, Yue Liao, Renrui Zhang, Yi Xin, Peng Gao, Mohamed Elhoseiny, and Hong- sheng Li. From Reflection to Perfection: Scaling Inference- Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning. In ICCV, 2025. 3 10 Tiny Inference-Time Scaling with Latent Verifiers Supplementary Material A. Additional Implementation Details A.1. Dataset Construction We build datasets pairing DiT embeddings with task-specific targets: captions for the alignment phase and binary labels for fine-tuning. In the following, we detail this generation process for both training phases of VHS. Alignment Dataset. Due to the noisiness of the image- text couples of the LLaVA pre-training dataset [26], we use Gemma3-4B [42] to produce refined captions as prompts. We then generate images, extract the corresponding DiT em- beddings, and re-caption the outputs to guarantee alignment. Verifier Fine-Tuning Dataset. We leverage the prompt set from ReflectDiT [21] to synthesize a new image corpus, generating 20 variants per prompt and extracting their inter- nal activations. To establish ground-truth labels for these samples, we implement an automated annotation pipeline us- ing Gemma3-4B. Specifically, the LLM generates the meta- data required for the GenEval [10] verification pipeline by first classifying the prompt tag (e.g., single object, two ob- jects, counting, position, colors, attribute binding) and subse- quently deriving the specific inclusion and exclusion lists. Fi- nally, the GenEval verifier processes the synthesized images with this metadata to assign a binary label to each sample. The LLM prompts for data generation are shown in Fig. 10. A.2. Architectural Details Hidden Features Extraction. We extract hidden layer fea- tures from the output of theℓ ⋆ layer and normalize them using mean and variance statistics pre-computed over a 50k- example subset of the alignment training set. For SANA- Sprint at standard resolution, this process yields a represen- tation of 1024 spatial features, each with a hidden dimen- sionality of 2240. Similarly, PixArt-α-DMD produces a feature map of the same spatial size (1024 features), but with a smaller hidden dimensionality of 1152. AE Features Extraction. AE features are extracted by flattening the output latents of the generator across spatial dimensions. For SANA-Sprint, the deep compression au- toencoder reduces spatial resolution by 32×. For a standard 1024×1024 input resolution, this produces a 32×32 feature grid with a dimensionality of 32, which, after flattening and projection, results in 1024 image tokens as input to the LLM. Conversely, for PixArt-α-DMD, the generator uses a standard autoencoder with KL loss [34]. Given a 512×512 input resolution, this process yields a 64×64 feature grid, which we flatten and project into the LLM embedding space, producing 4096 image tokens. This high token count in- creases the inference latency of the LLM further rendering AE features unfeasible for our task. Multimodal Connector. The connector module is imple- mented as a two-layer MLP with a GELU activation func- tion [14], with the first projection layer bringing the features to LLM-compatible dimensionality. LLM Setup. The full prompt given to the verifier for evalu- ating images can be seen in Fig. 10, allowing the evaluation procedure to be reproduced. A.3. Training Details Alignment Stage. The model is trained for one epoch on the 558k-sample alignment dataset. We use a total batch size of 512 distributed across 16 NVIDIA A100 GPUs. The learning rate follows a cosine schedule, peaking at1× 10 −3 , following a warm-up phase lasting 3% of the total training steps, with the AdamW [28] optimizer. Verifier Fine-Tuning. On the other hand, the fine-tuning stage of our pipeline is carried out using a total batch-size of 64 on 8 NVIDIA A100 GPUs, a cosine learning rate scheduling with a maximum value of2 × 10 −5 , and the AdamW optimizer. We select the model yielding the best evaluation loss during a 10 epochs training. To address class imbalance in the SANA-Sprint training set, we employ a weighted cross-entropy loss, assigning weights of 0.37 to the positive class and 0.63 to the negative class, following sample distribution. Conversely, for the PixArt-α-DMD ex- periments, equal weights are utilized to reflect the balanced class distribution of the dataset. Moreover, the focal loss ablation study follows the formulation proposed in [23]: FL(p t ) =−α t (1− p t ) γ log(p t ),(4) wherep t represents the estimated probability of the model for the ground-truth class,α t is a weighting factor assigned to each class to counteract class imbalance, andγmodulates the loss function by exponentially down-weighting “easy” examples, forcing the model to concentrate on “hard”, mis- classified examples that contribute the most to the training error. We set α as the class imbalance factor and γ to 2. A.4. Inference Setup Score Computation. Following [47], we convert binary verifier outputs into a continuous score. The score is defined as the predicted probability of the verifier for the sampled token, negated if it belongs to the negative class (i.e., “no”). Otherwise (i.e., “yes”), the probability is used as is. Best-Of-N Setup. Fig. 4 shows the Best-of-N setup: each candidate is passed through the first DiT layers and scored 11 Table 5. Accuracy (%) of SANA-Sprint [7] on the GenEval benchmark [10]. Ablation on score methodology on a time budget of 1100 ms. VerifierSingleTwoCountingColorPositionAttributionOverall VHS w/ h 7 No Scoring99.390.757.587.563.252.674.7 w/ h 7 Token Probability Scoring (Ours)100.095.766.588.969.863.880.5 MLLM w/ CLIP w/ No Scoring99.591.158.388.362.054.875.3 w/ Token Probability Scoring100.092.766.088.965.961.678.8 MLLM w/ AE w/ No Scoring99.787.556.588.755.049.272.2 w/ Token Probability Scoring99.790.761.390.859.649.374.7 Table 6. GenEval benchmark [10] results with compute-equivalent baselines. BudgetGeneratorStepsVerifierBoNSingleTwoCountColorPosAttrAll 550ms SD3.5-Turbo1MLLM w/ CLIPBo183.029.935.561.111.413.637.4 Flux-Schnell1MLLM w/ CLIPBo199.090.360.582.828.058.468.9 Sana-Sprint1VHS (Ours)Bo4100.093.961.590.666.258.478.1 1100ms SD3.5-Turbo1MLLM w/ CLIPBo290.135.637.863.813.816.641.2 Flux-Schnell1MLLM w/ CLIPBo299.894.365.884.734.465.073.2 Flux-Schnell2MLLM w/ CLIPBo299.091.761.380.928.457.468.9 SANA-Sprint1VHS (Ours)Bo9100.095.766.588.969.863.880.5 1650ms SD3.5-Turbo1MLLM w/ CLIPBo391.342.640.364.816.619.444.1 Flux-Schnell1MLLM w/ CLIPBo3100.095.265.884.739.866.074.7 Flux-Schnell2MLLM w/ CLIPBo3100.094.368.084.336.062.473.3 SANA-Sprint1VHS (Ours)Bo15100.096.067.389.170.464.680.9 AE Decoder DiT L DiTGenerator h L-1 Proj & Norm h L z 0 DiT 1 DiT 2 h 1 DiT l * h 2 DiT 1 DiT 2 h 1 DiT l * h 2 DiT 1 DiT 2 h 1 DiT l * h 2 A photo of a pink VHS cassette and a black walkman LLM 퐶 Prompt LLM 퐶 Prompt LLM 퐶 Prompt Answer: No Score: -0.48 Answer: Yes Score: 0.84 Answer: Yes Score: 0.52 z T1 z T2 z T3 DiT 1 DiT i h 1 DiT l * h 2 z T2 A photo of a pink VHS cassette and a black walkman Figure 4. Efficient Best-of-N pipeline with VHS. by the LLM, after which the highest-scoring candidate com- pletes the remaining layers and is decoded to pixel space. Inference and Evaluation Parameters. To run inference on the SANA-Sprint model, we stick to the standard reso- lution of 1024×1024. We employ a CFG of 1.0, and run the generator in a single step. We average our results over 5 different seeds to get more stable estimations. Moreover, for PixArt-α-DMD, the resolution is set 512×512. B. Additional Analyses and Experiments B.1. Data Quality Analysis To verify data quality for the alignment stage of VHS train- ing, we evaluate the multimodal consistency of our generated image–caption pairs using CLIP-Score [15]. Our pipeline, which re-captions the original images, synthesizing new im- ages, and re-captioning the outputs, produces pairs with a CLIP-Score of 73.5. Compared to the original dataset’s score of 70.5, this demonstrates that our approach not only pre- serves semantic alignment, but actually improves image–text consistency over the original LLaVA alignment data. B.2. Additional Studies on the Continuous Score Ablation on the Continuous Score. In Table 5, we present an ablation study to validate the use of token probabilities as continuous scores for image selection. Specifically, we compare GenEval accuracies under an 1100 ms budget with SANA-Sprint using two distinct strategies: (i) random se- lection among images classified as positive (“yes”) by the verifier, and (i) selection of the highest-scoring image based on token probability, as described in Sec. A.4. We observe that utilizing probability scores consistently enhances ac- curacy across all categories and verifier types, yielding an overall improvement of up to 5.8% for VHS. This suggests that token probabilities provide a granular measure of picture quality that binary decisions alone fail to capture. 12 Table 7. Accuracy (%) of SANA-Sprint [7] on the GenEval benchmark [10], varying levels of LoRA Fine-Tuning on a time budget of 1100 ms for MLLM w/ CLIP and VHS. VerifierSingleTwoCountingColorPositionAttributionOverall∆ No LoRA Fine-Tuning MLLM w/ CLIP100.094.759.391.969.460.279.1 +1.4 VHS100.095.766.588.969.863.880.5 50% LoRA Fine-Tuning MLLM w/ CLIP100.092.960.890.663.458.277.3 +1.8 VHS100.092.161.589.870.462.079.1 100% LoRA Fine-Tuning MLLM w/ CLIP99.588.357.387.957.256.273.9 +2.4 VHS100.091.159.589.464.456.076.3 TIME (s) OVERALL SCORE VHSMLLM w/ CLIPMLLM w/ AE TFLOPs OVERALL SCORE VRAM (GB) OVERALL SCORE Figure 5. Overall accuracy (%) of SANA-Sprint [7] on GenEval [10] across time (seconds) TFLOPs, and VRAM usage (GB). SCORE DENSITY PROBABILITY (LOG SCALE) VHS MLLM w/ CLIP Figure 6. Score distribution comparison between MLLM w/ CLIP and VHS. Histograms display the frequency of scores within 5% bins, with smoothed density curves overlaid in gray (MLLM w/ CLIP) and pink (VHS). Box plots above show the quartile ranges and distributional characteristics of each verifier. Score Distribution Analysis. In Fig. 6, we analyze the score distributions produced by VHS and MLLM w/ CLIP on the prompts of the GenEval benchmark, using 32 generations per prompt with SANA-Sprint. The distributions reveal that VHS produces less extreme scores compared to MLLM w/ CLIP. Specifically, MLLM w/ CLIP assigns roughly 80% of samples to the extreme score ranges ([0.95, 1.0] or [-1.0, -0.95]), indicating a strong tendency toward binary judg- ments. In contrast, VHS assigns only about 40% and 20% of samples to the positive and negative extremes, while dis- tributing a larger proportion of scores toward the center of the range. This more balanced distribution reflects the ability of VHS to assign more spread-out scores, enabling finer- grained discrimination between samples of varying quality. B.3. Comparison with additional baselines In order to assess the trade-offs and resource allocation strate- gies within the Best-of-N pipeline under tight time budgets, in Table 6 we present additional comparisons between VHS and various baselines equipped with different generators, although not directly comparable. Specifically, we com- pare against FLUX.1-Schnell [3] and Stable Diffusion 3.5- Turbo[40], exploring different Best-of-N and multi-step con- figurations under equivalent budgets of 650 ms, 1100 ms, and 1650 ms. Overall, VHS achieves the best results across all evaluated time budgets, as configurations relying on larger samplers are forced to operate with too few denoising steps or single-shot selection, limiting their effectiveness. Finally, Fig. 5 presents an extended comparison of VHS and the two baselines, plotting overall GenEval score as a func- tion of three computational budget metrics: inference time, TFLOPs, and VRAM usage (GB), demonstrating clear ad- vantages for VHS across all three. B.4. Robustness to Model Updates Acknowledging the tight coupling VHS introduces between the generator and its latent verifier, we investigate how dis- tribution shifts in the generator’s weights affect verification quality. To assess this, we fine-tune SANA-Sprint on an aes- thetic dataset [45] using LoRA [17] and measure GenEval performance within an 1100 ms budget. Although distri- bution shift degrades overall performance, VHS exhibits significantly greater resilience than the pixel-space verifier (MLLM w/ CLIP), suggesting that the generator’s hidden 13 GenerationPrompt A pokemonwith a poke ball VHS(Ours) A yellow Pikachu Pokémon stands with a blue ball in its hand, surrounded by a textured background MLLM w/ CLIP A pokemoncharacter with a sword and shield GenerationPrompt A boombox VHS(Ours) The image shows a vintage stereo speaker with a casing and a set of speakers, arranged on a light graysurface MLLM w/ CLIP A black and white radio on a black and white background GenerationPrompt A person wearing a black bomber VHS(Ours) A young woman wearing a black bomber jacket with a red stripe on the back, standing against a graybackground MLLM w/ CLIP A woman wearing a black jacket with a red stripe on the back GenerationPrompt A polaroid depicting an elephant VHS(Ours) A close-up photograph shows a gray elephant standing in a grassy field MLLM w/ CLIP An elephant standing in the grass Figure 7. Qualitative results of captioning produced by VHS and MLLM w/ CLIP after the alignment training phase. A photo of agreencupandayellowbowl GenEval Yes MLLM w/ CLIP Yes (0.95) VHS (Ours) No (-0.87) A photo of a traffic light and a backpack GenEval Yes MLLM w/ CLIP Yes(0.99) VHS (Ours) No(-0.52) A photo of three snowboards GenEval No MLLM w/ CLIP No(-0.57) VHS (Ours) Yes(0.52) Figure 8. Cases where judgments from VHS is not aligned with the GenEval verifier. states remain semantically stable during fine-tuning. This rel- ative stability is reflected throughout the training process. As observed in Table 7, the performance gap between the meth- ods grows from 1.4% at the baseline to 1.8% at mid-epoch, reaching 2.4% after a full epoch. C. Additional Qualitative Results In Fig. 9 we provide a qualitative evaluation of VHS us- ing sample prompts from the GenEval benchmark with the SANA-Sprint generator, emphasizing its effectiveness in various challenging scenarios. VHS captures fine-grained details, distinguishing accurate images from those that are subtly incorrect. For instance, in the counting task involving “four giraffes”, VHS correctly identifies the discrepancy in the image, which contains five giraffes, assigning a negative score consistent with the GenEval verifier. Similarly, VHS shows good performance in spatial reasoning, as it correctly validates the “baseball glove right of a bear” example which the baseline falsely rejects. These results further confirm that VHS accurately validates correct generations related to counting, attribute binding, and spatial relationships. Moreover, Fig. 7 presents qualitative examples of cap- tions produced by VHS following only the alignment train- ing phase. For each example, we generate sample images using the specified generation prompts and extract the cor- responding hidden layer activations to feed into VHS. The examples show that VHS consistently produces accurate and detailed captions across diverse subjects. For instance, given the generation prompt “A pokemon with a poke ball”, the generator produces an image which VHS then captions with a comprehensive description, identifying the character as “a yellow Pikachu Pok ́ emon”, and noting the “textured back- ground”. Similarly, when the generator creates an image from the prompt “A boombox”, VHS correctly describes it as “a vintage stereo speaker with a casing and a set of speak- ers, arranged on a light gray surface”, capturing fine-grained visual details such as the surface color and arrangement. D. Limitations While our proposed method demonstrates robust perfor- mance, we acknowledge certain limitations. Most notably, the verifier is coupled to the underlying generator archi- tecture, as it operates directly on its hidden representations. This tight integration enables substantial efficiency gains and strong semantic alignment, but limits out-of-the-box trans- ferability across substantially different generative backbones. In our experiments, we observe stable behavior under incre- mental model updates (Table 7); however, if the architecture changes significantly, VHS must be retrained to maintain compatibility and verification performance. In practice, this constraint is of limited relevance in typical production set- tings, where generators are updated incrementally. 14 A photo of four donuts GenEval No MLLM w/ CLIP Yes (0.87) VHS (Ours) No (-0.87) V X A photo of two beds GenEval Yes MLLM w/ CLIP No (-0.91) VHS (Ours) Yes (0.98) V X A photo of four giraffes GenEval No MLLM w/ CLIP Yes (0.95) VHS (Ours) No (-0.81) V X A photo of a white handbag and a purple bed GenEval Yes MLLM w/ CLIP No (-0.98) VHS (Ours) Yes (0.67) V X A photo of a blue baseball bat and a pink book GenEval No MLLM w/ CLIP Yes (0.95) VHS (Ours) No (-0.82) V X A photo of a white bottle and a blue sheep GenEval No MLLM w/ CLIP Yes (0.73) VHS (Ours) No (-0.69) V X A photo of a baseball glove right of a bear GenEval Yes MLLM w/ CLIP No (-0.50) VHS (Ours) Yes (0.74) V X A photo of a train above a potted plant GenEval Yes MLLM w/ CLIP No (-0.92) VHS (Ours) Yes (0.98) V X A photo of a wine glass right of a hot dog GenEval Yes MLLM w/ CLIP No (-0.61) VHS (Ours) Yes (0.79) V X A photo of a knife and a stop sign GenEval No MLLM w/ CLIP Yes (0.99) VHS (Ours) No (-0.59) V X A photo of a toaster and an oven GenEval No MLLM w/ CLIP Yes (0.91) VHS (Ours) No (-0.57) V X A photo of a bench and a snowboard GenEval No MLLM w/ CLIP Yes (0.95) VHS (Ours) No (-0.63) V X Figure 9. Visual comparison of the best pick images by different verifiers for images generated by SANA-Sprint [7] on GenEval [10] prompts.✓ and× express the alignment with the GenEval verifier. 15 1. Prompt for Tag Generation You are an assistant that classifies image generation prompts into one of six categories. Given a text prompt describing an image, output ONLY the corresponding tag from this list: - single_object - two_object - counting - colors - position - color_attr Rules: 1. Respond with exactly one of the tags above --- nothing else. 2. Classification criteria: - "single_object" → only one object is mentioned. - "two_object" → exactly two different objects are mentioned (e.g. "a cat and a dog"). - "counting" → a specific number of identical objects is requested (e.g. "three dogs"). - "colors" → a single object with a color attribute (e.g. "a purple umbrella"). - "position" → objects are described with a spatial relation (e.g. "a cat below a table", "a man left of a horse"). - "color_attr" → two or more different objects, each with their own color (e.g. "a red apple and a green pear"). 3. If uncertain, choose the most specific applicable tag. Examples: Input: "a photo of an umbrella" → single_object Input: "a photo of a bowl and a pizza" → two_object Input: "a photo of three persons" → counting Input: "a photo of a purple hair drier" → colors Input: "a photo of a couch below a cup" → position Input: "a photo of a red skis and a brown tie" → color_attr The prompt is input_prompt. 2. Prompt for Captioning on the Alignment Set Describe the image in one concise sentence. Be objective and precise, without speculation. Output only the description in plain text, without line breaks. 3. Prompt for Image Scoring You are an AI assistant specializing in image analysis and ranking. Your task is to analyze and compare image based on how well they match the given prompt. <image> The given prompt is: input_prompt. Please consider the prompt and the image to make a decision and response directly with ’yes’ or ’no’. Figure 10. Prompts employed for dataset generation and image scoring. 16