Paper deep dive
Anchoring and Rescaling Attention for Semantically Coherent Inbetweening
Tae Eun Choi, Sumin Shim, Junhyeok Kim, Seong Jae Hwang
Abstract
Abstract:Generative inbetweening (GI) seeks to synthesize realistic intermediate frames between the first and last keyframes beyond mere interpolation. As sequences become sparser and motions larger, previous GI models struggle with inconsistent frames with unstable pacing and semantic misalignment. Since GI involves fixed endpoints and numerous plausible paths, this task requires additional guidance gained from the keyframes and text to specify the intended path. Thus, we give semantic and temporal guidance from the keyframes and text onto each intermediate frame through Keyframe-anchored Attention Bias. We also better enforce frame consistency with Rescaled Temporal RoPE, which allows self-attention to attend to keyframes more faithfully. TGI-Bench, the first benchmark specifically designed for text-conditioned GI evaluation, enables challenge-targeted evaluation to analyze GI models. Without additional training, our method achieves state-of-the-art frame consistency, semantic fidelity, and pace stability for both short and long sequences across diverse challenges.
Tags
Links
- Source: https://arxiv.org/abs/2603.17651v1
- Canonical: https://arxiv.org/abs/2603.17651v1
Intelligence
Status: not_run | Model: - | Prompt: - | Confidence: 0%
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
67,168 characters extracted from source content.
Expand or collapse full text
Anchoring and Rescaling Attention for Semantically Coherent Inbetweening Tae Eun Choi * Sumin Shim * Junhyeok KimSeong Jae Hwang Yonsei University teunchoi,use08174,timespt,seongjae@yonsei.ac.kr (a) frame 46frame 63 Wan Ours frame 12frame 29 A freight train moves forward through heavy falling snow. Last frameFirst frame Ours Wan The mallet sweeps counterclockwise around the brass singing bowl. First frameframe 13frame 26frame 38frame 51Last frame (b) Figure 1. We introduce a training-free approach on the task of generative inbetweening which generates intermediate frames using the two keyframes and text. In (a), our method correctly recognizes the train and produces consistent and coherent frames. In (b), we improve semantic alignment between the text and generated frames, accurately capturing the ‘counterclockwise’ movement, in contrast to Wan [13]. Abstract Generative inbetweening (GI) seeks to synthesize realis- tic intermediate frames between the first and last keyframes beyond mere interpolation. As sequences become sparser and motions larger, previous GI models struggle with incon- sistent frames with unstable pacing and semantic misalign- ment. Since GI involves fixed endpoints and numerous plau- sible paths, this task requires additional guidance gained from the keyframes and text to specify the intended path. Thus, we give semantic and temporal guidance from the keyframes and text onto each intermediate frame through Keyframe-anchored Attention Bias. We also better enforce frame consistency with Rescaled Temporal RoPE, which al- lows self-attention to attend to keyframes more faithfully. * Equal contribution. TGI-Bench, the first benchmark specifically designed for text-conditioned GI evaluation, enables challenge-targeted evaluation to analyze GI models. Without additional train- ing, our method achieves state-of-the-art frame consistency, semantic fidelity, and pace stability for both short and long sequences across diverse challenges. 1. Introduction Video frame interpolation (VFI) aims to predict one or more intermediate frames between two keyframes, the first and last input images, often to raise frame rate or enable slow motion [6, 9, 33]. Recent advancements in image- to-video models [1, 13] enhanced overall video generation quality and enabled longer sequence generation. Build- ing on this, current works have shifted from matching a single ground truth to generating varied scenes between 1 arXiv:2603.17651v1 [cs.CV] 18 Mar 2026 sparser keyframes. This task, namely generative inbetween- ing (GI), reframes VFI as filling the gap between widely- spaced keyframes with plausible, temporally coherent tran- sitions under uncertainty [22]. Early works on GI was driven by Stable Video Diffusion (SVD) [1, 14, 39, 44, 51]. While GI task requires to be conditioned on two keyframes, SVD can structurally take only a single keyframe as an input, inevitably leading prior works to run SVD twice and fuse the results to approximate the intermediate frames. However, this approach leads to collapse and blur especially on long sequences as GI inher- ently requires exploiting both keyframes. In contrast, Dif- fusion Transformer (DiT)–based video models [13, 21, 45] are able to jointly condition on two keyframes as well as text prompts while scaling to long sequences. Conse- quently, recent studies including Wan’s First-Last-Frame- to-Video (FLF2V) pipeline began leveraging DiT for text- conditioned GI [13]. As keyframes become sparser and motions more dy- namic, the guidance from the two keyframes and text prompt on intermediate frames naturally weakens along the generation process. We identify these issues as three key challenges: (i) semantic fidelity, (i) frame consistency, and (i) pace stability. For instance, in Fig. 1, Wan shows text misalignment with inconsistent frames, while Fig. 2 demonstrates an example of pace instability with sponta- neous pace shifts. Consequently, a dedicated mechanism is needed to draw semantic and temporal cues from the two keyframes and text conditions. We achieve this by modify- ing the DiT’s cross- and self-attention individually, which are crucial for context mixing [10, 38] without additional training. First, we implement Keyframe-anchored Attention Bias (KAB) on cross-attention in order to maintain seman- tic fidelity and stable pacing.More specifically, we gain keyframe anchors from the keyframes’ cross-attention which are used to attract the intermediate frames towards the conditions: the two input keyframes and text prompt. This guides the intermediate attention maps to fill in the missing semantic and temporal cues from all conditions. Through this method, KAB yields videos that are seman- tically aligned and stably paced. Furthermore, we propose Rescaling Temporal RoPE (ReTRo), a simple adjustment to the self-attention layer of DiT block. Since both keyframes should be preserved while also synthesizing consistent intermediate frames, ReTRo increases the temporal RoPE [36] scale near the keyframes and reduces it in other frames. The higher scale sharpens at- tention to maintain keyframe fidelity, while the lower scale broadens attention so intermediate frames attend more sta- bly to both keyframes. Consequently, ReTRo reduces arti- facts and blur, resulting in more temporally consistent GI. Although text conditioning in GI enables more flexible Figure 2. Pace Stability Comparison. This figure compares the pace stability of Wan and our method against ground truth (GT). The paraglider’s motion is visualized by overlaying same, uni- formly sampled indices for GT, Wan, and Ours with the back- ground aligned, marking sampled positions with red dots (•) and displacements with black arrows (L99 L99 L99). Wan exhibits pace insta- bility, in that the paraglider alternately accelerates and decelerates, producing uneven spacing whereas our method closely matches the ground truth with smooth motion and stable pacing. and diverse synthesis, there is no reliable benchmark to evaluate models supporting text guidance. We therefore in- troduce TGI-Bench, which curates sequences tailored for text-conditioned GI and pairs each with an aligned textual prompt. Each sequence is annotated with one of four differ- ent challenge categories in the GI field, enabling challenge- specific diagnosis of model strengths and weaknesses. Our contributions are summarized as follows: • We propose KAB to deliver temporal and semantic guid- ance from the keyframes and text to the intermediate frames, improving semantic fidelity and pace stability. • We also present ReTRo which rescales self-attention po- sitional encodings to enhance overall frame consistency. • We curate TGI-Bench for text-conditioned GI evaluation across sequence lengths and challenges, providing a diag- nostic framework for future work. 2. Related Work Generative Inbetweening. Traditional video frame inter- polation (VFI) methods use deterministic pipelines, which are effective for small or near-linear motion but degrade under large displacements, non-linear dynamics, and oc- clusions [6, 9, 33]. With the advent of diffusion mod- els [1, 18, 34], large-scale training and sampling-based ap- proaches emerged allowing VFI to expand to generative inbetweening (GI). For instance, TRF [14] employs Sta- ble Video Diffusion (SVD) [1] in a bidirectional manner, and ViBiDSampler [44] advances it by altering the sam- pling strategy without additional training. More studies such as Generative Inbetweening [39] and FCVG [51] adopt feature-guided designs to improve motion consis- tency. Meanwhile, Wan [13] emerged as a video foundation model built on Diffusion Transformer backbone. However, the attention mechanisms of DiT blocks optimized for the GI task remain underexplored. 2 time Pos. Enc. Self-attention Map time 푤 푒푑푔푒 푓−푤 푒푑푔푒 Rescaled Temporal RoPE text prompt Self-attention Video Keys 푆 푒푑푔푒 >1푆 푒푑푔푒 >1 푆 푚푖푑 <1 Video Queries Keyframe-anchored Attention Bias Triple Isolated Cross-attention frame 0frame 1frame 푓−2frame 푓−1 ⋯ Interpolated Anchors Original Attention Logit Softmax & Attention Pooling Logit Bias Guided Attention Logit ⋰ ⋰ ⋰ ⋰ 퐿 (푡) ෨ 퐿 (푡) ⋯ 푀 (1) 푀 (푓−2) heads ҧ 퐴 (푓−2) ҧ 퐴 (1) ⋯ 푙 푞 푙 푘 ҧ 퐴 (0) ҧ 퐴 (푓−1) Keyframe Anchor ҧ 퐴 (0) ҧ 퐴 (푓−1) 퐵 (1) 퐵 (푓−2) Figure 3. Overall Pipeline of Our Method. Our model is built upon a video DiT pipeline that consists of DiT blocks with self-attention and cross-attention layers. Left: Keyframe-anchored Attention Bias is performed for each condition’s cross-attention, which aggregates cross- attention maps from each keyframes to form keyframe anchors. These keyframe anchors are interpolated to frame-wise target anchors, which are used as a small logit bias to guide each intermediate frames. Right: Furthermore, we introduce Rescaled Temporal RoPE, which increases temporal RoPE scale at the edges and reduces in the middle. As a result, edge frames place most of their attention on nearby frames while middle frames spread their attention across a wider temporal range. Cross-attention Editing.Cross-attention has continu- ously been used as a control handle for text-driven im- age editing. Prompt-to-Prompt [16] preserves structure by copying and blending cross-attention maps across prompts and timesteps, while Attend-and-Excite [3] reweights under-attended tokens to mitigate missing-object failures. Pix2Pix-Zero [28] keeps edits close to the source to main- tain layout while changing semantics. Video-P2P [24] ex- tends this steering across frames to keep appearance con- sistent in video, and Layout Control [4] manipulates cross- attention to satisfy user-specified boxes or landmarks. Rotary Positional Embeddings in Video DiT. RoPE scal- ing has been explored primarily in LLMs, showing that ad- justing rotation rates can modulate locality and extend con- text with minimal changes [25, 29, 36]. Spatiotemporal RoPE designs stabilize long-range interactions but still rely on a uniform temporal schedule across frames [41], and dy- namic frequency schemes target diffusion steps rather than framewise control [20]. While cross-attention based posi- tional schemes for temporal control exist [42], framewise temporal rescaling of RoPE inside self-attention for GI re- mains underexplored, to our knowledge. 3. Method Our overall method pipeline is presented in Fig. 3. We demonstrate our method on the First-Last-Frame-to-Video (FLF2V) pipeline in Wan2.1 [13], a unique video DiT framework well-suited to our approach (Sec. 3.1). To en- sure pace stability and semantic fidelity, we design target anchors which guide intermediate frames on the keyframes and text in the cross-attention (Sec. 3.2). Furthermore, we propose a scaling strategy within the self-attention that pre- serves consistency across frames (Sec. 3.3). Our method is model-agnostic and applies to any video DiT without addi- tional training. 3.1. Preliminary To construct the video tokens, the video sequence of F frames is formed along the temporal axis by placing I first at the beginning, I last at the end, and inserting F − 2 zero- filled frames in between. This sequence is then compressed by Wan-VAE into a conditional latent sequence of f frames. After concatenating binary masks and latent diffusion noise with conditional latent, video tokens are obtained. Meanwhile, to construct context vectors, the text prompt is encoded with UMT5 encoder [7] and projected to the DiT context space. For the two keyframes, CLIP [32] features from I first ,I last is concatenated and also projected to the con- text space. The final context vectors including all image and text embeddings are passed onto cross-attention layer. 3.2. Keyframe-anchored Attention Bias We seek to guide the intermediate frames with semantic and temporal cues from the three conditions, two keyframes and text, by leveraging their cross-attention distributions. As operating on full attention maps would be expensive, we compress these maps into keyframe anchors, motivated by prior works [2, 3, 11, 16]. Interpolating between the two keyframe anchors yields frame-wise target anchors, which we use to add a small logit bias to each intermediate frame. Target Attention Anchor. Let L h : = Q h K ⊤ h / √ d h ∈R fl q ×l k be the cross-attention logit for head h, where l q and l k 3 are video query tokens per frame and condition key to- kens, respectively. Applying softmax, A h : = softmax L h is a cross-attention heatmap whose rows are queries for f frames and columns are keys for each condition. To build our anchors, we reuse the model’s own cross-attention A h and simply slice out the rows that belong to the first and last video frames to obtain A (0) h and A (f−1) h ∈R l q ×l k for each condition. The two slices are then compressed into two keyframe anchors by averaging over heads and video queries to obtain target anchors ̄ A (0) and ̄ A (f−1) ∈R l k : ̄ A (t) = Mean H,l q A (t) h ,t∈0,...,f − 1(1) The keyframe anchors each yield one distribution for the first and last frame of the video, capturing which keys are globally important for those frames. To guide each frame, we need a frame-wise target anchor M (t) ∈R l k based on the two keyframe anchors. For each intermediate frame index t ∈ 1,...,f − 2, we linearly interpolate the keyframe anchors to obtain: M (t) : = (1− τ (t) ) ̄ A (0) + τ (t) ̄ A (f−1) , τ (t) = t f − 1 .(2) These anchors can now be used to hint the model on the se- mantic and temporal information that should be emphasized at each frame for three conditions. Note that we slightly abuse the M (t) with M (0) = ̄ A (0) and M (f−1) = ̄ A (f−1) . Frame-wise Logit Bias. Given the frame-wise target an- chors, we gently steer the intermediate frames towards them by adding a small frame-wise logit bias B (t) ∈R l k ,t ∈ 0,...,f − 1 to the original cross-attention logits: e L (t) h : =L (t) h + β (t) B (t) =L (t) h + β (t) log(M (t) +ε)− log( ̄ A (t) +ε) , (3) where B (t) is added to all heads and all video queries of frame t (i.e., broadcasted), and ε is added to prevent near- zero possibilities. Finally, we replace the original attention weights A h with ̃ A h = softmax ̃ L h . This conservative pull preserves the model’s local patterns while guiding its global token allocation toward the keyframes and text prompt. Following prior works, we also apply a smooth taper on β (t) across the timeline, stronger near the keyframes and weaker in the middle, and gate the guidance to layers 5– 12 only and step 1 to 40% of the total steps [3, 16, 34]. This nudges the semantic and temporal guidance to settle early while leaving late steps and layers to form textures and fine details, since mid-level layers often carry much of the image’s spatial or semantic structure [23]. Triple Isolated Cross-attention. In the baseline FLF2V pipeline, the model applies cross-attention to I first alone, while I last and text are concatenated and attended jointly, yielding an asymmetric fusion across the three conditions. Unlike the baseline’s asymmetric fusion, we compute three symmetric cross-attentions: I first ↔ video, I last ↔ video, and text ↔ video. Each cross-attention is refined by its own M (t) and β (t) , and then equally weighted and averaged, am- plifying the effect of explicit semantic guidance while pre- serving symmetry across modalities. 3.3. Rescaled Temporal RoPE Prior video DiT models employ RoPE within self-attention to inject relative spatiotemporal information by rotating queries and keys in phase. However, generative inbetween- ing task imposes a different challenge that two keyframes must be preserved and intermediate frames should be gener- ated while jointly conditioning on both keyframes. Vanilla RoPE provides relative distances of frames, but lacks ex- plicit mechanism that anchors the two keyframes, which leads to frame inconsistency. We therefore introduce Rescaled Temporal RoPE (ReTRo), inspired by RoPE scaling methods for LLMs [5, 29]. Specifically, we apply higher RoPE scales near the two keyframes and lower scales in other frames, sharpening locality to preserve the keyframes while broadening atten- tion to promote consistency across the intermediate frames. First, we construct per-axis frequency rows (temporal, height, width) and concatenate them: Ψ(t,h,w) : = Ω t [t]; Ω h [h]; Ω w [w] , t∈0,...,f−1. (4) Then, we pick an integer w edge ∈ 0,...,⌊f/2⌋, which stands for number of edge frames per side, and define the edge and middle index sets symmetrically: T edge : =0,...,w edge − 1 ∪ f − w edge ,...,f − 1, T mid : =w edge ,...,f − w edge − 1. (5) We scale the temporal frequency row per frame to maintain edge frames fidelity and stabilize the middle: s(t) : = ( s edge , t∈T edge , s mid , t∈T mid , ̃ Ω t [t] = s(t) Ω t [t],(6) where s edge > 1 and s mid < 1. Finally, we reassemble the per-axis frequency rows and use it for both video queries and keys: Ψ ReTRo (t,h,w) : = ̃ Ω t [t]; Ω h [h]; Ω w [w] ,(7) ̃ Q = RoPE(Q; Ψ ReTRo ), ̃ K = RoPE(K; Ψ ReTRo ).(8) As a result, in self-attention the edge frames place most of their attention on nearby frames, stabilizing local detail and keyframe fidelity. In contrast, the middle frames spread their attention across a wider temporal range, drawing in- formation from more distant frames to enhance overall con- sistency. This yields a simple, training-free mechanism for frame consistency without architectural changes. 4 (a) Example frames First frame Last frame (b) Statistics 25 frames33 frames 65 frames81 frames dynamic motionocclusion near-staticlinear motion Near-static Astronaut kite is floating in the sky. Occlusion Person rotates orchid mounted on bark to the right. Linear motion A silver car is moving forward. Dynamic motion A BMX rider jumps over a dirt ramp. Figure 4. TGI-Bench. (a) One example from each challenge of our TGI-Bench is presented. For each example, the first and last frames of the video along with its text description are shown. (b) The distribution of challenges according to the number of frames is illustrated. 4. TGI-Bench Previous studies on generative inbetweening (GI) and video frame interpolation have primarily relied on video datasets such as [31, 35] for model evaluation. However, while these benchmarks provide dense ground-truth frames, most of these resources lack natural-language annotations, restrict- ing the ability to evaluate whether a model truly reflects tex- tual instructions when generating videos. In addition, the existing benchmarks do not offer diverse challenges, which hinders diagnosing a model across various capabilities. Thus, we present Text-conditioned Generative Inbe- tweening Benchmark (TGI-Bench). TGI-Bench consists of two keyframes, the corresponding ground-truth intermedi- ate frames, a textual description, and its designated chal- lenge category. To cover both short and long sequences, we release four sequence-length variants for 25, 33, 65, and 81 frames, following prior works [1, 8, 13, 15, 30], sup- porting broad, apples-to-apples comparison across meth- ods. In summary, TGI-Bench (i) enables the evaluation of a model’s text-grounded generative inbetweening perfor- mance (i) across diverse sequence lengths, (i) while also allowing for the fine-grained diagnosis of its capabilities within specific challenge categories. Data Curation.To construct TGI-Bench dataset, we select videos from the DAVIS [31] dataset, as well as from the Pexels and Pixabay websites 1 .Videos with- out a clear main object or with overly complex motion that cannot be sufficiently described with text are ex- cluded, resulting in a final selection of 220 videos. From each video, we uniformly sample F frames.These F frames are then subsampled and provided to GPT-4.1 [12], which is prompted to generate a text description of the video and classify it into one of the following chal- lenge types [22]: dynamic motion, occlusion, linear motion, near-static. We repeat this pro- 1 https://w.pexels.com/, https://w.pixabay.com/ cess for F ∈25, 33, 65, 81, resulting in four validation sets corresponding to different video lengths. Fig. 4(a) shows one example per challenge for F = 25, and Fig. 4(b) shows the proportion of each challenge in TGI-Bench for different values of F . Please refer to the supplementary material for the detailed GPT prompt and sampling process. 5. Experiments Baselines & Dataset. We compare our model with base- lines including TRF [14], ViBiDSampler [44], GI [39], FCVG [51] and Wan2.1 [13]. For all baselines and our method, we use our TGI-Bench to evaluate the generated video on various metrics shown in Sec. 5.1. Each method is assessed under four frame-length settings, following the structure defined in our TGI-Bench. Additional examples are deferred to the supplementary material. Implementation Details. For strict comparison, the seed is fixed across all methods and experiment. When implement- ing Keyframe-anchored Attention Bias (KAB), the attention logit bias strength β (i) uses a cosine taper, taking the value 0.7 near the keyframes and decreasing toward the midpoint to 0.3. For Rescaled Temporal RoPE (ReTRo), the width of edge frames w edge is defined as an integer closest to 0.1 times the total number of frames. We fix the scaling factors s edge = 1.06 and s mid = 0.94. 5.1. Quantitative Evaluation Video Generation Evaluation. Due to page limit, we re- port the 65- and 81-frame results as longer horizons are more discriminative and practically relevant. We compute PSNR, SSIM [40], and LPIPS [48] against ground truth frames, use FVD [37], FID [17] and VBench [19] score to measure overall video quality. We select 8 relevant dimen- sions for VBench: I2V Subject, I2V Background, Subject Consistency, Background Consistency, Motion Smooth- ness, Aesthetic Quality and Imaging Quality. For all met- 5 Table 1. Video Generation Evaluation Results. Quantitative comparison of the baselines and our method on 65, 81 frames. We evaluate video generation quality and fidelity. The best results are in bold, and the second best are underlined. Method 65-frame81-frame PSNR↑ SSIM↑ LPIPS↓FID↓FVD↓VBench↑ PSNR↑ SSIM↑ LPIPS↓FID↓FVD↓VBench↑ TRF [14]16.060.56620.5173168.37 0.33798.11716.080.58590.5136169.46 0.33248.074 ViBiDSampler [44]15.450.53810.5211160.74 0.35748.61915.570.55330.5142163.09 0.34828.759 GI [39]15.430.54730.5210224.36 0.36937.97115.590.56660.5166219.80 0.34167.923 FCVG [51]16.77 0.54120.441298.020.29389.75517.160.56980.421697.010.26079.899 Wan [13]16.750.56610.417282.420.34069.86117.630.61790.394582.900.27699.904 Ours17.68 0.5903 0.4016 77.66 0.2820 9.92418.17 0.6269 0.3818 77.59 0.2458 10.022 Table 2. Generative Inbetweening Evaluation Results. To complement the video generation evaluation, we conduct experiments on other metrics that reflect our target qualities. L-frames stands for LPIPS-frames and C-frames stands for CLIPSIM-frames. For human evaluation, we evaluate FC, SF, PS which stands for frame consistency, semantic fidelity and pace stability, respectively. The best results are in bold, and the second best are underlined . Method 65-frame81-frame Sem. Fid.Frame Cons.Sem. Fid.Frame Cons.Human Eval. X-CLIP↑VQA↑L-frames↓C-frames↑X-CLIP↑VQA↑L-frames↓C-frames↑FC↑SF↑PS↑ TRF [14]0.21740.57200.18690.96690.20890.54650.18830.96471.601.532.10 ViBiDSampler [44]0.22490.49820.17820.97220.21860.53960.17510.97332.051.822.38 GI [39]0.21690.49010.16320.97420.20820.45450.16920.97351.701.551.89 FCVG [51]0.22560.6338 0.07920.98310.22220.6194 0.06130.98582.872.732.91 Wan [13]0.2326 0.66310.13340.97880.22620.67300.09590.98393.503.693.65 Ours0.2340 0.66920.10470.98550.2292 0.67520.07740.9881 4.38 4.27 4.34 rics, our method achieves the best performance as shown in Tab. 1. Further details and complete results on shorter sequences are in supplementary material. Generative Inbetweening Evaluation. While traditional video generation metrics are effective at measuring frame consistency, they are insufficient for semantic fidelity and pace stability. Thus, we attempt to assess these qualities to observe the effectiveness of our method. Semantic fi- delity is measured by X-CLIP [26] text-to-video similarity, which compares prompt and generated video embeddings. For the video visual question answering (VQA) score, we average over 6 different QA models to reduce variance. Frame consistency is further evaluated with LPIPS-frames and CLIPSIM-frames, which average similarities between adjacent frames following prior works [27, 50]. Interest- ingly, FCVG which provides intermediate-frame motion guidance that likely reduces path ambiguity, yields compa- rable performance to text-conditioned GI without any text input. As automatic measurements are unreliable to prop- erly evaluate our target qualities, semantic fidelity and pace stability, we run a user study on 10% of the video sam- ples, focusing on 81-frame videos. As shown in Tab. 2, our method improves both semantic fidelity and frame con- sistency. Also, our method attains the highest human-rated semantic fidelity and pace stability. More results and evalu- ation protocols are provided in supplementary material. 5.2. Qualitative Evaluation The visual comparison in Fig. 5 shows a representative qualitative comparison of the baselines and our method. Prior to any analysis of semantic fidelity or pace stability, we already observe visible artifacts such as object collapse, uneven motion, and blurred backgrounds in the SVD-based methods such as TRF, ViBiDSampler, GI, and FCVG. Al- though Wan preserves frame consistency, it does not fol- low the prompt. The prompt requires rotating the mallet counterclockwise, Wan repeatedly moves the mallet up and down with an uneven pace. In contrast, our method gen- erates a smooth counterclockwise rotation at a natural pace that matches the prompt. Please refer to the supplementary material for more qualitative results. 5.3. Ablation Study We conduct an ablation study in Tab. 3 on our two methods, Keyframe-anchored Attention Bias (KAB) and Rescaled Temporal RoPE (ReTRo). Across video generation metrics, our method achieves the highest scores on most measures, with the ReTRo-only variant as second-best overall. ReTRo strengthens temporal cues within each interme- diate frame and pulls self-attention toward both keyframes. This temporal reference is particularly well suited to im- proving frame consistency, which aligns with the compet- itive quantitative scores of the ReTRo-only variant. Con- trastly, KAB operates in the cross-attention layer, allowing 6 Ours Wan Ours Wan frame 12frame 24frame 36frame 48frame 60frame 72 Woman catches ball thrown by man on the beach. Person rides a hoverboard forward and then steps off. (b) (c) A woman in a blue dress performs a dance outdoors. Ours Wan FCVG GI ViBiDSampler TRF frame 13frame 26frame 38frame 51First frameLast frame (a) Figure 5. Qualitative Comparison with Baselines. (a) Our method outperforms prior works in all three target challenges: semantic fidelity, pace stability and frame consistency. Although Wan performs better than SVD-based models such as TRF, ViBiDSampler, GI and FCVG, it shows failures in one or more qualities. For instance, for (b), the dog marked by a yellow circle (○␣) disappears to the left and suddenly reappears in the middle of the frame in later sequences, showing semantic infidelity while in (c), the location of the person barely moves from frame 48 to frame 60, showing pace instability. On the other hand, our method overcomes all three challenges. guidance from keyframe anchors gained from the keyframes as well as the text prompt, applying a lightweight per-token logit bias for each intermediate frame. This mechanism sup- plies semantic and temporal guidance, making it especially effective for pace stability and semantic fidelity. Since these effects difficult to measure faithfully by stan- dard video generation metrics, we also include qualitative results shown in Fig. 6. For instance, KAB preserves a se- mantically important object consistently across frames 30 to 40, while ReTRo removes the artifact visible around frame 20. Together, these complementary effects yield more con- sistent and semantically faithful GI with natural motions. 7 w/ KAB w/ ReT Ro frame 10frame 20frame 30frame 40frame 50frame 60 Broom sweeps debris toward the dustpan on the floor. Figure 6. Qualitative Comparison of KAB and ReTRo. This figure shows a qualitative example of the different roles of KAB and ReTRo. KAB preserves the semantically important trash object (○␣) consistently across frames 30–40, whereas ReTRo suppresses artifacts around frame 20. Together they address different failure modes, indicating that KAB and ReTRo are complementary rather than interchangeable. Table 3. Ablation Results on KAB and ReTRo. Applying both methods achieves the best scores most number of metrics. ReTRo is the next best overall, reflecting its strength on frame consistency. Exp. KAB ReTRo 81-frame #PSNR↑ SSIM↑ LPIPS↓ FID↓ FVD↓ VBench↑ 117.63 0.6179 0.3945 82.90 0.27699.904 2 ✓17.62 0.6174 0.3945 83.01 0.27389.896 3✓ 18.18 0.6309 0.389277.880.25109.749 4 ✓ ✓18.170.62690.3818 77.59 0.2458 10.022 5.4. Generative Inbetweening Challenge Analysis To evaluate whether TGI-Bench supports fine-grained diagnosis of model strengths and weaknesses, we as- sess existing GI methods across the four different challenges: dynamic motion, linear motion, occlusion, near-static. As shown in Fig. 7, our benchmark reveals a clear spectrum of difficulty. While most methods perform reasonably well on near-static, occlusion emerges as the most difficult for all models. Across all four challenges, our method compares favor- ably to existing GI models on the key metrics, LPIPS, X- CLIP, and VBench. This advantage is especially clear on the persistent GI challenges of occlusion and dynamic motion, where our method achieves the best LPIPS and VBench scores, indicating improved frame consistency. We additionally analyze how text conditioning and ro- bustness to challenging cases affect overall evaluation. Methods that do not accept textual input, such as TRF, ViBiDSampler and GI, lag behind text-conditioned meth- ods Wan and our method for harder challenges. Although FCVG is not text-conditioned, it leverages motion guid- ance and delivers respectable performance across individual challenge categories. When compared against human eval- uation shown in Tab. 2, we also observe that differences on the harder subsets account for most variation in human scores. Methods that better handle dynamic motion and occlusion are consistently preferred, making these challenging subsets a more informative indicator of human- perceived quality than near-static cases alone. TRFViBiDGIFCVGWanOurs LPIPS ↓ 0.5 0.4 0.3 dynamic motionlinear motionocclusionnear-static X-CLIP ↑ 0.250 0.225 0.200 0.175 dynamic motionlinear motionocclusionnear-static VBench ↑ 10 9 8 dynamic motionlinear motionocclusionnear-static Figure 7. Generative Inbetweening Challenge Analysis. We utilize our TGI-Bench to diagnose existing GI mothods across four challenges. This reveals a clear difficulty spectrum: most methods do well on near-static, while occlusion is the hardest. 6. Conclusion In this paper, we propose two training-free mechanisms for generative inbetweening that can act as a general plug-in to any video DiT models. First, Keyframe-anchored Attention Bias (KAB) utilizes the model’s own cross-attention layer on the first and last frames as signals and guide intermediate frames toward a per-frame interpolation of these keyframe- anchors. Furthermore, we incorporate Rescaled Temporal RoPE (ReTRo) which enlarges temporal RoPE scale at the edge frames and reduces in the middle to enhance frame consistency. Finally, we release TGI-Bench, a benchmark that systematically diagnoses the current challenges of gen- erative inbetweening. We expect this benchmark to accel- erate the field of text-conditioned GI, especially under dy- namic motion and occlusion. 8 Acknowledgement. This work was supported in part by the IITP RS-2024-00457882 (AI Research Hub Project), IITP 2020-I201361, NRF RS-2024-00345806, NRF RS- 2023-002620, and RQT-25-120390. Affiliations: Depart- ment of Artificial Intelligence (T.E.C, J.K, S.J.H), Depart- ment of Computer Science (S.S). References [1] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. 1, 2, 5 [2] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing, 2023. 3 [3] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models, 2023. 3, 4 [4] Minghao Chen, Iro Laina, and Andrea Vedaldi. Training-free layout control with cross-attention guidance, 2023. 3 [5] Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large lan- guage models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023. 4 [6] Xianhang Cheng and Zhenzhong Chen.Multiple video frame interpolation via enhanced deformable separable con- volution. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(10):7029–7045, 2021. 1, 2 [7] Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining. arXiv preprint arXiv:2304.09151, 2023. 3 [8] Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation with- out vector quantization. arXiv preprint arXiv:2412.14169, 2024. 5 [9] Chao Ding, Mingyuan Lin, Haijian Zhang, Jianzhuang Liu, and Lei Yu. Video frame interpolation with stereo event and intensity cameras. IEEE Transactions on Multimedia, 26: 9187–9202, 2024. 1, 2 [10] Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathemati- cal framework for transformer circuits. Transformer Circuits Thread, 1(1):12, 2021. 2 [11] Dave Epstein, Allan Jabri, Ben Poole, Alexei A. Efros, and Aleksander Holynski. Diffusion self-guidance for control- lable image generation, 2023. 3 [12] OpenAI et al. Gpt-4 technical report, 2024. 5, 11, 15 [13] Team Wan et al. Wan: Open and advanced large-scale video generative models, 2025. 1, 2, 3, 5, 6, 11, 12, 16 [14] Haiwen Feng, Zheng Ding, Zhihao Xia, Simon Niklaus, Vic- toria Abrevaya, Michael J. Black, and Xuaner Zhang. Explo- rative inbetweening of time and space, 2024. 2, 5, 6, 11, 12 [15] Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Du- val, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factoriz- ing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023. 5 [16] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control, 2022. 3, 4 [17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems, 30, 2017. 5, 12 [18] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffu- sion models, 2022. 2 [19] Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive bench- mark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024. 5, 12 [20] Noam Issachar, Guy Yariv, Sagie Benaim, Yossi Adi, Dani Lischinski, and Raanan Fattal. Dype: Dynamic position ex- trapolation for ultra high resolution diffusion. arXiv preprint arXiv:2510.20766, 2025. 3 [21] Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024. 2 [22] Dahyeon Kye, Changhyun Roh, Sukhun Ko, Chanho Eom, and Jihyong Oh. Acevfi: A comprehensive survey of ad- vances in video frame interpolation, 2025. 2, 5, 15 [23] Bingyan Liu, Chengyu Wang, Tingfeng Cao, Kui Jia, and Jun Huang. Towards understanding cross and self-attention in stable diffusion for text-guided image editing, 2024. 4 [24] Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control, 2023. 3 [25] Xiaoran Liu, Hang Yan, Shuo Zhang, Chenxin An, Xipeng Qiu, and Dahua Lin. Scaling laws of rope-based extrapola- tion. arXiv preprint arXiv:2310.05209, 2023. 3 [26] Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, and Rongrong Ji. X-clip: End-to-end multi-grained con- trastive learning for video-text retrieval, 2022. 6 [27] Jiwoo Park, Tae Eun Choi, Youngjun Jun, and Seong Jae Hwang. Wave: Warp-based view guidance for consistent novel view synthesis using a single image, 2025. 6 [28] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation, 2023. 3 [29] Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large 9 language models. arXiv preprint arXiv:2309.00071, 2023. 3, 4 [30] Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, Yuhui Wang, Anbang Ye, Gang Ren, Qianran Ma, Wanying Liang, Xiang Lian, Xiwen Wu, Yuting Zhong, Zhuangyan Li, Chaoyu Gong, Guojun Lei, Leijun Cheng, Limin Zhang, Minghao Li, Ruijie Zhang, Silan Hu, Shijie Huang, Xiaokang Wang, Yuanheng Zhao, Yuqi Wang, Ziang Wei, and Yang You. Open-sora 2.0: Train- ing a commercial-level video generation model in $200k, 2025. 5 [31] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- bel ́ aez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation, 2018. 5, 15, 16 [32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PmLR, 2021. 3 [33] Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame inter- polation for large motion. In European Conference on Com- puter Vision, pages 250–266. Springer, 2022. 1, 2 [34] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ̈ orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2022. 2, 4 [35] Alexandros Stergiou. Lavib: A large-scale video interpola- tion benchmark, 2024. 5 [36] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. 2, 3 [37] Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018. 5, 12 [38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 2 [39] Xiaojuan Wang,Boyang Zhou,Brian Curless,Ira Kemelmacher-Shlizerman,AleksanderHolynski,and Steven M. Seitz.Generative inbetweening:Adapting image-to-video models for keyframe interpolation, 2025. 2, 5, 6, 11, 12 [40] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 5, 12 [41] Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, et al.Videorope: What makes for good video rotary position embedding?arXiv preprint arXiv:2502.05173, 2025. 3 [42] Ziyi Wu, Aliaksandr Siarohin, Willi Menapace, Ivan Sko- rokhodov, Yuwei Fang, Varnith Chordia, Igor Gilitschen- ski, and Sergey Tulyakov.Mind the time: Temporally- controlled multi-event video generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23989–24000, 2025. 3 [43] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jian- wei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tian- hao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zhifang Guo, and Zhihao Fan. Qwen2 technical report, 2024. 11 [44] Serin Yang, Taesung Kwon, and Jong Chul Ye. Vibidsam- pler: Enhancing video interpolation using bidirectional dif- fusion sampler, 2025. 2, 5, 6, 11, 12 [45] Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024. 2 [46] Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, and Yuan Lin. Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding, 2025. 11 [47] Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen, Jifeng Dai, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output, 2024. 11 [48] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 5, 12 [49] Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data, 2025. 11 [50] Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou.Storydiffusion: Consistent self- attention for long-range image and video generation, 2024. 6 [51] Tianyi Zhu, Dongwei Ren, Qilong Wang, Xiaohe Wu, and Wangmeng Zuo. Generative inbetweening through frame- wise conditions-driven video generation, 2024. 2, 5, 6, 11, 12 10 Supplementary Material S1. Additional Resources The implemented code for our method is presented in the code folder in the supplementary material. We also present all our result for the 81-frame along with the result videos of the baseline Wan [13] in the videos folder in the supplementary material. In addition, a sample dataset of our TGI-Bench is included in the TGI-Bench folder. S2. Evaluation Details S2.1. Experimental Details All experiments were conducted on an NVIDIA RTX A6000 GPU (48GB VRAM) using mixed precision (bfloat16). We utilized the Wan2.1-FLF2V-14B-720P checkpoint [13], a 14B-parameter diffusion transformer model, UMT5-XXL text en- coder, VAE decoder, and XLM-RoBERTa-Large vision encoder which can all be accessed through Wan2.1 2 . Inference was performed using the DiffSynth-Studio 3 framework, which provides efficient pipeline management and automatic VRAM optimization. In addition, videos were generated at 480×864 resolution with 15 FPS using tiled processing. For Stable Video Diffusion–based models, we used the following checkpoints in our experiments: stable-video-diffusion-img2vid-xt for GI [39], ViBiD [44], TRF [14] and stable-video-diffusion-img2vid-xt-1-1 for FCVG [51]. S2.2. Video Question Answering Evaluation Details To obtain a stable VQA-based alignment score between a generated video and its textual prompt, we evaluate each video using six vision–language models with diverse architectures and visual encoders: qwen2.5-vl-7b [43], llava-onevision-qwen2- 7b-sillavaonevision, internlmxcomposer25-7b [47], tarsier-recap-7b [46], llava-video-7b [49], and gpt-4.1 [12]. For each model, we sample video frames using either an FPS-based strategy (Qwen models) or a fixed frame-count strategy (LLaVA, InternLM-XComposer, Tarsier), encode the frames through the model’s vision encoder, and compute a binary VQA response to the question Does this video show caption?. Each model produces a probability score for the Yes response, normalized to [0, 1] from the logits of the Yes and No tokens (or log-probabilities in the case of gpt-4.1). Because individual models exhibit significant variance due to differences in frame sampling, vision encoders, and temporal reasoning ability, we average the scores across all six models to obtain a more reliable and model-agnostic VQA metric. S2.3. User Study Details Figure S1 shows the interface used in our user study, which was conducted with more than 20 participants. For each of the 12 questions (about 10% of TGI-Bench), participants were given a text prompt and 6 video clips generated by each baseline models, whose positions were randomly shuffled to ensure fairness. They then rated every clip on semantic fidelity, pace stability, and frame consistency using a five-point Likert scale. S3. Additional Experimental Results S3.1. Quantitative Results In Tab. S1, we present quantitative results for the 25- and 33-frame sequences. All settings, except for the number of frames, are identical to those used for the 65- and 81-frame sequences in the main paper. S3.2. Qualitative Results We provide additional qualitative results for all our baseline models in Figs. S4–S12. S3.3. Hyperparameter Experiment We conduct an ablation study on hyperparameters on the two main components of our method, Keyframe-anchored Attention Bias (KAB) and Rescaled Temporal RoPE (ReTRo). The results are summarized in Tab. S2 and Tab. S3. Overall, these ablations indicate that our chosen hyperparameters provide a good balance between fidelity and perceptual quality, and that our method is reasonably robust to moderate changes in these values. 2 https://github.com/Wan-Video/Wan2.1 3 https://github.com/modelscope/DiffSynth-Studio 11 Table S1. Additional Video Generation Evaluation Results. Quantitative comparison of the baselines and our method on 25, 31 frames. We evaluate video generation quality and fidelity. The best results are in bold, and the second best are underlined. Method 25-frame33-frame PSNR↑ SSIM↑ LPIPS↓FID↓FVD↓ VBench↑ PSNR↑ SSIM↑ LPIPS↓FID↓FVD↓ VBench↑ TRF [14]16.734 0.55460.4612104.393 0.27499.47316.603 0.55840.4777118.459 0.28939.147 ViBiDSampler [44] 17.029 0.56860.425793.1720.27769.58716.607 0.55740.4561107.697 0.31219.245 GI [39]17.418 0.58010.397291.8840.25719.93516.499 0.55870.4470127.957 0.29559.339 FCVG [51]18.264 0.56310.385980.2760.20169.86517.682 0.55230.408388.8140.25089.781 Wan [13]19.076 0.61800.343068.8900.182110.10318.1740.59530.377174.3830.24099.915 Ours19.557 0.6322 0.3418 67.888 0.16829.99118.757 0.6127 0.3669 70.399 0.2086 9.918 Table S2. Hyperparameter Experiment Results on KAB. The best results are in bold, and the second best are underlined. Ours denotes the hyperparameters used in our method. HyperparametersPSNR↑ SSIM↑ LPIPS↓ FID↓ FVD↓ VBench↑ 0.1≤ β t ≤ 0.517.0651 0.58590.3972 84.898 0.2929 10.237 0.5≤ β t ≤ 0.917.100 0.5856 0.3973 85.051 0.283910.230 0.5≤ β t ≤ 0.517.1070.5859 0.396883.9150.2865 10.203 0.1≤ β t ≤ 0.917.072 0.5853 0.3977 84.636 0.2887 10.208 Ours (0.3≤ β t ≤ 0.7) 18.169 0.6269 0.3818 77.587 0.2458 10.022 Table S3. Hyperparameter Ablation on ReTRo. The best results are in bold, and the second best are underlined. Ours denotes the hyperparameters used in our method. HyperparametersMetrics s mid s edge PSNR↑ SSIM↑ LPIPS↓ FID↓ FVD↓ VBench↑ 0.941.1216.964 0.5819 0.4054 90.584 0.3005 10.157 0.881.0617.4810.59410.384278.7390.266010.339 Ours 0.941.06 18.169 0.6269 0.3818 77.587 0.2458 10.022 For KAB, we experiment over the temporal range [β min t ,β max t ]. Narrow or overly wide ranges as well as too high or low values generally degrade performance across distortion and perceptual metrics. In contrast, our default setting (0.3 ≤ β t ≤ 0.7) achieves the best overall scores, yielding clear gains in PSNR, SSIM [40], and FVD [37] while also improving perceptual quality (VBench [19]). To analyze the effect of the ReTRo, we scale the parameters s mid and s edge , which control the relative emphasis on mid- sequence versus boundary frames in the temporal RoPE rescaling. Our default configuration (s mid = 0.94, s edge = 1.06) achieves the best performance on most metrics, including PSNR, SSIM, LPIPS [48], FID [17], and FVD, while maintaining competitive VBench scores. The alternative setting (s mid = 0.88, s edge = 1.06) provides the second-best overall performance and slightly higher VBench. S4. Additional Analysis S4.1. KAB KAB is a method that uses the cross-attention of the keyframes to guide intermediate frames under three conditions: the two keyframes and the text prompt. Through rigorous experiments in the main paper and in the supplementary material, we have shown that this additional guidance is effective in maintaining both semantic fidelity and pace stability. However, when the guidance is either too weak or overly strong, it instead degrades these properties, along with the overall video generation quality. As shown in Tab. S2, our default mid-range setting (0.3≤ β t ≤ 0.7) clearly outperforms all other ranges on PSNR, SSIM, LPIPS, FID, and FVD. Interestingly, ranges biased toward either lower (0.1 ≤ β t ≤ 0.5) or higher (0.5 ≤ β t ≤ 0.9) scales achieve slightly higher VBench scores, but this comes at the cost of noticeably worse distortion and distributional metrics. The very narrow range (0.5 ≤ β t ≤ 0.5) yields the second-best FID and LPIPS among the ablated settings, yet still fails to 12 Figure S1. User study interface used to evaluate our generative inbetweening results. For each text prompt (top right), six candidate videos (a–f) are displayed. Participants first read the evaluation criteria (left) and then rate each video on a 5-point Likert scale for three dimensions: Semantic Fidelity, Pace Stability, and Frame Consistency. close the gap to our default configuration. Taken together, these results suggest that while concentrating guidance at specific diffusion phases can bring marginal gains in certain perceptual aspects, distributing KAB over a moderate mid-range window is crucial for obtaining consistent improvements across both fidelity and perceptual metrics. Thus, our chosen setting strikes a good balance when applied with a moderate guidance range, which we have empirically demonstrated through our ablation studies. S4.2. ReTRo ReTRo adaptively modulates RoPE scales along the temporal axis, assigning higher scales to tokens near keyframes to sharpen locality and preserve keyframe content, while using lower scales on intermediate frames to broaden attention and promote temporal consistency. In the main paper, we showed that this method is effective in improving both frame consistency and overall video generation quality. As shown in Tab. S3, we additionally conducted an ablation study on the ReTRo hyperparameters s mid and s edge . From the results, we found that the parameter setting used in the original paper, (s mid = 0.94,s edge = 1.06), achieved the best performance among the configurations we tested. For s edge , values around 1.10 or higher started to introduce noticeable visual artifacts, while for s mid , smaller values tended to make the generated videos appear slightly slower in terms of motion. Consequently, we adopt (s mid = 0.94,s edge = 1.06) as our default setting, as it offers the best trade-off between visual quality and temporal coherence in our experiments. However, since these are simple hyperparameters, they can be exposed as user- 13 (a) 25 frame (c) 65 frame (b) 33 frame (d) 81 frame Figure S2. Complete Results on Generative Inbetweening Challenge Analysis. Results for all challenges at frames 25, 33, 65, and 81, including VBench, LPIPS, and X-CLIP scores for each challenge. adjustable parameters, allowing users to dynamically adjust the balance between sharpness, motion speed, and temporal consistency to suit their specific applications. S4.3. Generative Inbetweening Challenge Analysis We additionally present quantitative results for three representative frames (25, 33, and 65) from for further analysis on the generative inbetweening challenges. The examples are shown in Figs. S2. These results further confirm that the four challenge categories in TGI-Bench are categorized well in difficulty since most models perform reliably on the near-static cases, whereas performance degrades sharply on the more demanding occlusion and dynamic motion challenges. This clear differentiation demonstrates that TGI-Bench is carefully constructed to expose distinct failure modes of GI models, enabling fine-grained diagnosis of model capabilities. Consequently, TGI-Bench provides a reliable and informative metric suite for future research, particularly for identifying which generative inbetweening challenges a model handles well and where it 14 struggles. S5. TGI-Bench Details S5.1. Dataset Curation Details To construct the TGI-Bench dataset, we prompted GPT-4.1 [12] to generate a text description and a challenge label for each video. The text description often includes information inferred from intermediate frames that are not visible in the provided first and last frames, thereby serving as constraints when a model attempts to generate the missing frames. Inspired by [22], the challenge label is categorized into one of four types: dynamic motion, linear motion, occlusion, and near-static, defined as follows: • Dynamic motion: The primary object exhibits nonlinear or complex movement, such as rotation or abrupt directional changes. • Linear motion: The primary object moves in a linear and consistent direction. • Occlusion: The primary object either appears or disappears in the middle of the video due to occlusion. • Near-static: The primary object remains largely stationary with minimal motion. We sourced videos from the DAVIS [31] dataset as well as from Pexels and Pixabay 4 . Videos that were too visually complex to describe succinctly, or that lacked a clearly identifiable primary object, were excluded. For example, we removed videos where geometric patterns changing chaotically or where smoke particles moving randomly without a coherent subject. After this filtering step, we collected a total of 220 videos. For each video, we selected only the first F frames and discarded videos whose total frame count was less than F . From these, we took frames at indices 0, 10, 20,...,⌊(F − 1)/10⌋,F − 1 and provided them to GPT-4.1 along with the prompt in Sec. S5.2. The resulting text descriptions and challenge labels were manually reviewed and corrected by the authors to ensure accuracy. In particular, GPT’s generic label large motion was refined into the more specific categories of dynamic motion and linear motion. This process was repeated for F ∈25, 33, 65, 81, yielding four distinct subsets of the dataset. S5.2. GPT Prompt In this section, we present the detailed prompts provided to GPT-4.1. By default, we feed the model the concatenation of SYSTEMPROMPT and USERPROMPTBASE. For videos where the model produced outputs that did not follow the intended format, we additionally concatenate RETRYPROMPT to the input. 1 SYSTEM_PROMPT = """ 2 You are a caption generator for a Video Frame Interpolation (VFI) evaluation set. 3 INPUT: two endpoint images - A (start) and B (end), optional reference images R_i sampled between A and B, and optional reference text (prompts.txt). 4 TASKS 5 1) Briefly describe A and B (visible, objective facts; <= 20 words each). 6 2) Classify the challenge as exactly one of: 7 - Large motion 8 - Occlusion 9 - Near-static 10 If ambiguous, tiebreaker: Occlusion > Large motion > Near-static. 11 3) Generate exactly ONE caption that best describes the plausible situation across A->B. 12 - Prefer wording and nouns from the reference prompts when correct. 13 - If the reference contains mistakes or conflicts with the images, FIX them in your caption. 14 CAPTION STYLE (strict) 15 - English only. <=12 words. One simple clause. 16 - You may include direction if clearly implied by the endpoints. 17 - No commas/semicolons. Avoid: and, then, while, as, because, so, therefore, hence. 18 - No meta words: relative, compared, background, foreground, camera, optical flow, frame, endpoint(s). 19 - No hedging or subjective words. 20 - Do NOT mention A/B or frames. 21 CONSISTENCY 4 https://w.pexels.com/, https://w.pixabay.com/ 15 22 - Match direction/size/visibility in endpoints. 23 - Use "emerges/appears/enters" only if absent at A and present at B. 24 OUTPUT JSON ONLY: 25 26 "first_image_desc": "< <=20 words >", 27 "last_image_desc": "< <=20 words >", 28 "challenge": "Large motion | Occlusion | Near-static", 29 "caption": "< <=12 words >" 30 31 """.strip() 32 33 34 USER_PROMPT_BASE = """ 35 Images follow in order: A (start), zero or more reference images R_i between A and B, then B (end). 36 Return JSON ONLY following the schema. English only. 37 """.strip() 38 39 RETRY_PROMPT = """ 40 Return VALID JSON ONLY with keys: 41 first_image_desc, last_image_desc, challenge, caption. 42 Choose one: Large motion | Occlusion | Near-static. 43 One caption only; <=12 words; one clause; obey all style rules. 44 """.strip() S6. Limitation Ours Wan frame 16frame 32frame 49frame 65 Breakdancer spins from handstand to standing position on stage. \ First frameLast frame GT Figure S3. Limitation. Because our training-free plug-in is bounded by the generative capacity of Wan, it can only partially correct the severely distorted motion and geometry seen in the breakdancing example, leaving residual shaky and unnatural motion. Although our method is a simple, training-free plug-in that can be readily applied to existing DiT-based models, it is inherently bounded by the generative capacity of the underlying baseline, Wan [13]. In challenging cases where the base model already produces severely distorted motion or object geometry over most frames, our approach has limited ability to fully recover a plausible video. For example, as shown in Fig. S3, the breakdancing subject exhibits persistent shaky and unnatural motion across time, and these artifacts are only partially mitigated by our method. We regard this as a natural limitation of training-free refinement methods and as a promising direction for future work on jointly improving both the base generator and the inbetweening modules. S7. Ethical Considerations TGI-Bench builds on publicly available video dataset Davis [31] and open-source video websites Pexels and Pixabay that permit research use. We do not collect new data of human subjects, nor do we attempt to infer or annotate sensitive attributes 16 (e.g., identity, race, health, or political views). Any videos containing people are used only for generic motion and scene understanding, and are treated as anonymous visual content. We conducted a small-scale human evaluation with more than 20 participants to compare perceptual quality and consis- tency, under strict ethical considerations. The study followed a double-blind setup, where participants were unaware of the underlying methods being compared, and experimenters did not have access to any identifying information about individual participants. All participants volunteered to take part in the study and were informed that the evaluation was conducted solely for academic research purposes. No personal information was collected beyond basic platform metadata, and responses were analyzed only in aggregate. No offensive, violent, or explicit prompts were used in any of our experiments. Generative inbetweening can, in principle, be misused for deceptive or non-consensual content (e.g., manipulated videos). We explicitly prohibit such uses. Our method is presented for research purposes, and any future release of code, models, or benchmarks should be accompanied by clear usage guidelines and restrictions, encouraging applications such as animation, content restoration, and creative tools while discouraging privacy-invasive or harmful deployments. 17 Ours Wan FCVG GI ViBiDSampler TRF frame 13frame 26frame 39frame 52 Horse swings rotate counterclockwise around the blue pole. First frameLast frame Ours Wan FCVG GI ViBiDSampler TRF frame 13frame 26frame 39frame 52 Glider floats over fields with man holding control stick. First frameLast frame Figure S4. Qualitative Results. In both examples, our method generates consistent frames compared to Wan which shows artifacts or suddenly dimmed scenes. The first four models fail to maintain the object shape for the intermediate frames. 18 Ours Wan FCVG GI ViBiDSampler TRF frame 13frame 26frame 39frame 52 Dancers move inward forming a circle during outdoor performance. First frameLast frame Ours Wan FCVG GI ViBiDSampler TRF frame 13frame 26frame 39frame 52 Roller coaster car descends the orange loop track. First frameLast frame Figure S5. Qualitative Results. (Top) Other models, unlike ours, either show blurred objects with inconsistent frames or unnatural motion like Wan in frame 39. Our method shows high semantic fidelity as well as frame consistency through all frames. (Bottom) For the first four models, the structure of the rollercoaster collapses, failing to maintain the shape and style of the keyframes. Our model shows pace stability while maintaining the frame consistency. 19 Ours Wan FCVG GI ViBiDSampler TRF frame 13frame 26frame 39frame 52 People walk down a sandy trail toward the ocean at sunset. First frameLast frame Ours Wan FCVG GI ViBiDSampler TRF frame 13frame 26frame 39frame 52 Man on hoverboard kart moves away and becomes hidden behind booth. First frameLast frame Figure S6. Qualitative Results. Examples showing that our method performs well in highly complex scenes with multiple people and objects, preserving fine details and producing fewer blurred scenes than baseline methods. 20 Ours Wan FCVG GI ViBiDSampler TRF frame 13frame 26frame 39frame 52 Coffee beans spill from glass onto newspaper. First frameLast frame Ours Wan FCVG GI ViBiDSampler TRF frame 13frame 26frame 39frame 52 Man stares at the camera and then raises his hand. First frameLast frame Figure S7. Qualitative Results. (Top) The first four baselines show unstable coffee-spilling pace and temporal inconsistency, while even compared to Wan our method generates more stably paced and temporally coherent motion. (Bottom) The first four baselines suffer from blur that distorts the human shape. Compared to Wan, our method maintains a more stable pace and generates more natural motions. 21 Ours Wan FCVG GI ViBiDSampler TRF frame 13frame 26frame 39frame 52 Man pulls lat pulldown bar from overhead to chest. First frameLast frame Ours Wan FCVG GI ViBiDSampler TRF frame 13frame 26frame 39frame 52 Two deer butt heads. First frameLast frame Figure S8. Qualitative Results. (Top) The first four baselines produce overly blurred frames where the human shape is not preserved and even compared to Wan, our method exhibits a more stable motion pace for the man performing lat pulldown. (Bottom) While other methods contain several blurred and inconsistent frames, our method generates clearer and more temporally consistent videos. For visualization purposes, we uniformly increased the brightness of both examples by 40%, while leaving all other properties unchanged. 22 Ours Wan FCVG GI ViBiDSampler TRF frame 16frame 32frame 49frame 65 Bird is dancing, flapping its wings and shaking its tail feathers. First frameLast frame Ours Wan FCVG GI ViBiDSampler TRF frame 16frame 32frame 49frame 65 BMX rider ascends and turns sharply on the ramp. First frameLast frame Figure S9. Qualitative Results. (Top) For the first four models, there is minimal wing flapping and motion, while our method and Wan show movements. However, Wan fails to maintain frame consistency and semantic fidelity. (Bottom) For Wan, the person on the bicycle goes left in the first few frames but suddenly turns from the right. On the other hand, our method shows consistent pace and consistency in movements. 23 Ours Wan FCVG GI ViBiDSampler TRF frame 16frame 32frame 49frame 65 Dog runs through shallow water toward the viewer. First frameLast frame Ours Wan FCVG GI ViBiDSampler TRF frame 16frame 32frame 49frame 65 The goat walks slowly to the left, sniffing the scent of grass. First frameLast frame Figure S10. Qualitative Results. The first four models, unlike Wan and ours, fail to maintain the shape of the object as well as the background style through the long frame sequences, showing the importance of correct text prompts in generative inbetweening. 24 Ours Wan FCVG GI ViBiDSampler TRF frame 16frame 32frame 49frame 65 Egg is cracked into the mixing bowl. First frameLast frame Ours Wan FCVG GI ViBiDSampler TRF frame 16frame 32frame 49frame 65 Brown duck walks left away from the water. First frameLast frame Figure S11. Qualitative Results. (Top) The first three models fails to maintain the shape of the hand and egg, while FCVG shows unnatural movement around frame 49 compared to the following two models, Wan and ours. (Bottom) For Wan, an artifact can be observed in frame 49, unlike our method. 25 Ours Wan FCVG GI ViBiDSampler TRF frame 16frame 32frame 49frame 65 Makeup brush moves upward across eyelid. First frameLast frame Ours Wan FCVG GI ViBiDSampler TRF frame 16frame 32frame 49frame 65 Koi fish swim leftward through the pond. First frameLast frame Figure S12. Qualitative Results. (Top) While Wan maintains the subject through the long sequence, it does not follow the prompt especially around frame 32. On the other hand, our method faithfully follows the text showing semantic fidelity. (Bottom) Around frame 49-65, Wan shows blurred scene without any context. On the other hand, this problem does not show up on our method. 26