Paper deep dive

Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding

Yunzhuo Sun, Xinyue Liu, Yanyang Li, Nanding Wu, Yifang Xu, Linlin Zong, Xianchao Zhang, Wenxin Liang

Year: 2026Venue: arXiv preprintArea: cs.CVType: PreprintEmbeddings: 58

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/26/2026, 2:37:10 AM

Summary

Mamba-VMR is a two-stage framework for Video Moment Retrieval (VMR) that addresses the limitations of text-driven and static image-augmented methods in capturing temporal dynamics. It uses an LLM-guided subtitle matching and query decomposition process to generate auxiliary short videos via text-to-video diffusion (CogVideoX), which serve as temporal priors. These priors are integrated into a multi-modal controlled Mamba network that employs video-guided gating to efficiently fuse information and filter noise in long sequences, achieving state-of-the-art performance on the TVR benchmark.

Entities (5)

CogVideoX · model · 100%Llama-3.1 · model · 100%Mamba · architecture · 100%Mamba-VMR · framework · 100%TVR · dataset · 100%

Relation Signals (4)

Mamba-VMR → evaluatedon → TVR

confidence 100% · Experimental evaluations on the TVR benchmark demonstrate significant improvements

Mamba-VMR → incorporates → Mamba

confidence 100% · augmented queries are processed through a multi-modal controlled Mamba network

Mamba-VMR → uses → CogVideoX

confidence 95% · We employ CogVideoX [48]... to generate auxiliary short videos

Mamba-VMR → uses → Llama-3.1

confidence 95% · We adopt LLaMA-3.1 [10]... to match relevant subtitles

Cypher Suggestions (2)

Find all models used by the Mamba-VMR framework · confidence 90% · unvalidated

MATCH (f:Framework {name: 'Mamba-VMR'})-[:USES]->(m:Model) RETURN m.name

Identify datasets used for evaluation · confidence 90% · unvalidated

MATCH (f:Framework {name: 'Mamba-VMR'})-[:EVALUATED_ON]->(d:Dataset) RETURN d.name

Abstract

Abstract:Text-driven video moment retrieval (VMR) remains challenging due to limited capture of hidden temporal dynamics in untrimmed videos, leading to imprecise grounding in long sequences. Traditional methods rely on natural language queries (NLQs) or static image augmentations, overlooking motion sequences and suffering from high computational costs in Transformer-based architectures. Existing approaches fail to integrate subtitle contexts and generated temporal priors effectively, we therefore propose a novel two-stage framework for enhanced temporal grounding. In the first stage, LLM-guided subtitle matching identifies relevant textual cues from video subtitles, fused with the query to generate auxiliary short videos via text-to-video models, capturing implicit motion information as temporal priors. In the second stage, augmented queries are processed through a multi-modal controlled Mamba network, extending text-controlled selection with video-guided gating for efficient fusion of generated priors and long sequences while filtering noise. Our framework is agnostic to base retrieval models and widely applicable for multimodal VMR. Experimental evaluations on the TVR benchmark demonstrate significant improvements over state-of-the-art methods, including reduced computational overhead and higher recall in long-sequence grounding.

PDF

Open source PDF →Open local PDF →

Full Text

57,557 characters extracted from source content.

Expand or collapse full text

Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding Yunzhuo Sun1 Xinyue Liu1 Yanyang Li1 Nanding Wu1 Yifang Xu2 Linlin Zong1 Xianchao Zhang1 Wenxin Liang1 1Dalian University of Technology 2Fudan University sunyunzhuo@mail.dlut.edu.cn wxliang@dlut.edu.cn Code & Model: https://github.com/YunzhuoSun/Manba-VMR Abstract Text-driven video moment retrieval (VMR) remains challenging due to limited capture of hidden temporal dynamics in untrimmed videos, leading to imprecise grounding in long sequences. Traditional methods rely on natural language queries (NLQs) or static image augmentations, overlooking motion sequences and suffering from high computational costs in Transformer-based architectures. Existing approaches fail to integrate subtitle contexts and generated temporal priors effectively, we therefore propose a novel two-stage framework for enhanced temporal grounding. In the first stage, LLM-guided subtitle matching identifies relevant textual cues from video subtitles, fused with the query to generate auxiliary short videos via text-to-video models, capturing implicit motion information as temporal priors. In the second stage, augmented queries are processed through a multi-modal controlled Mamba network, extending text-controlled selection with video-guided gating for efficient fusion of generated priors and long sequences while filtering noise. Our framework is agnostic to base retrieval models and widely applicable for multimodal VMR. Experimental evaluations on the TVR benchmark demonstrate significant improvements over state-of-the-art methods, including reduced computational overhead and higher recall in long-sequence grounding. 1 Introduction Figure 1: Illustration of VMR challenges and our approach. (a) Text-driven methods often mislocalize multi-verb queries due to ambiguous temporal cues. (b) Static image augmentation (e.g., ICQ [11]) improves semantic expressiveness but overlooks motion sequences, leading to errors in dynamic events. (c) Our temporal video-driven method accurately captures implicit motion, enabling precise grounding. Video Moment Retrieval (VMR) is pivotal for multimedia applications, enabling users to localize specific temporal segments in untrimmed videos that align semantically with queries [30]. With the surge in video contents, effcient VMR can support such tasks as search, recommendation, and summarization. [32, 51]. However, text-driven VMR faces significant challenges in capturing hidden temporal dynamics, particularly for multi-verb queries involving sequential actions. For instance, as shown in Figure 1(a), the query “Adams walks into the room and hands Park a coffee” requires understanding the progression of “walking” followed by “handing,” but pure text often leads to imprecise grounding due to ambiguous ordering [4, 34]. Existing methods relying on Natural Language Queries (NLQs) struggle with such complexities, as noted in surveys on temporal sentence grounding [20, 53], where multi-action sequences like “a person approaches the wall, hangs a clock, and steps back” [28] highlight the need for explicit temporal cues beyond verbal descriptions. Recent works have augmented queries with multimodal elements to mitigate these issues. ICQ [52] incorporates static images (e.g., scribbles, cartoons, or realistic depictions generated via DALL-E), enhancing expressiveness for abstract or unfamiliar concepts [54, 27]. While effective for static scene matching, static images fail to convey dynamic motion, as illustrated in Figure 1(b), where the generated images ignore the temporal flow of actions such as the sequence of entering the room, approaching Park, and extending the coffee, which results in localization errors. Other studies, such as those on composed retrieval [32, 51], emphasize that multi-verb queries exacerbate this situation, with models overlooking inter-action dependencies without motion priors. To overcome these limitations, we propose a framework that generates auxiliary short videos from queries, infused with subtitle contexts to capture rich temporal priors that static images lack [48]. Subtitles in datasets like TVR [22] provide fine-grained dialogue cues, complementing visual dynamics and enabling precise fusion. However, integrating generated videos elongates inputs, straining Transformer-based architectures with quadratic complexity. Inspired by efficient long-sequence modeling, we adopt Mamba [11], a State Space Model (SSM) with linear-time scaling, surpassing Transformers in video tasks [7, 23]. Recent adaptations like SpikeMba [26] and VideoMamba [23] demonstrate Mamba’s efficacy in temporal grounding by selectively propagating states, reducing overhead while modeling dynamics. Our multi-modal controlled Mamba extends this, fusing text-controlled selection with video-guided gating for efficient prior integration in long sequences. Our proposed framework contains two stages. In the first stage, LLM-guided subtitle matching identifies cues, fused with queries to generate short videos via text-to-video models (e.g., CogVideoX [48]), capturing implicit motion as priors. In the second stage, augmented queries are processed through our Mamba network, enabling noise filtering and enhanced recall. Experimental evaluations on TVR yield state-of-the-art results, outperforming SgLFT [6] and ICQ [52] with reduced computational costs and superior handling of multi-verb dynamics. Our main contributions include: (1) A novel subtitle-enhanced video generation for temporal priors; (2) A multi-modal Mamba for efficient long-sequence fusion; (3) Superior performance on TVR, advancing multimodal VMR toward practical scalability. 2 Related Work 2.1 Video Moment Retrieval Video Moment Retrieval (VMR) aims to localize temporal segments in untrimmed videos that semantically match user queries [12, 19]. Early methods focused on text queries, with datasets like TVR [22] introducing multimodal challenges by incorporating subtitles alongside videos. TVR [22] contains 108K queries on 21.8K videos, emphasizing the need for cross-modal fusion. Proposal-based approaches like MCN [13] and CAL [8] rank predefined segments but suffer from dependency on heuristics. Proposal-free methods, such as XML [9] with convolutional detectors and CONQUER [17] for contextual ranking, improve efficiency. SGLFT [6] uses semantic-guided Transformers for subtitle-enhanced retrieval. Recent multimodal extensions, like ICQ [52], augment queries with static images (e.g., scribbles or cartoons) to handle ambiguous concepts, but overlook dynamic motion in multi-action queries [29]. Our work advances this by generating temporal video priors, capturing sequential dynamics absent in static augmentations. 2.2 Text-to-Video Diffusion Models Text-to-video diffusion models generate dynamic clips from textual prompts, enabling temporal priors for VMR [1, 14]. Stable Video Diffusion [2] extends image diffusion to videos, producing short clips with motion consistency. CogVideoX [48] enhances this with expert Transformers for high-fidelity 10s videos at 720x480 resolution. These models, like VideoCrafter [5], support narrative prompts fusing queries and subtitles, but are underutilized in retrieval tasks [31]. We leverage CogVideoX to create auxiliary videos, providing motion cues that static images [52] cannot, for better temporal grounding. 2.3 Mamba in Sequence Modeling Mamba [11] introduces selective State Space Models (SSMs) for linear-time sequence processing, outperforming Transformers in long-range dependencies [7]. In video tasks, VideoMamba [23] applies bidirectional SSMs for efficient understanding, while Motion Mamba [36] uses text-controlled selection for human motion grounding. SpikeMba [26] integrates spiking neural networks with Mamba for saliency detection in temporal grounding. These works highlight Mamba’s advantages in filtering noise and handling extended inputs [55]. We extend Motion Mamba [36] with multi-modal gating to fuse generated video priors, enabling scalable VMR on long TVR sequences. Figure 2: The overall architecture of our proposed framework. It first utilizes LLaMA-3.1 for subtitle selection and query decomposition by verbs (Sec. 3.2). Next, a Temporal-Prior Generator produces a short video clip from the fused query and subtitles (Sec. 3.3). The augmented inputs are then processed by the Mamba network for contextualized features (Sec. 3.4). Finally, frame activation scores are computed, thresholded with NMS, and refined via MLP to output the moment boundaries. 3 Method In this section, we first define the video moment retrieval task and provide an overview of our proposed framework in Sec. 3.1. We then detail its key components in Sec. 3.2–3.5. 3.1 Overview Given an untrimmed target video Vo=vii=1LoV_o=\v_i\_i=1^L_o with LoL_o clips and associated subtitles S=sjj=1NsS=\s_j\_j=1^N_s, along with a textual query Q=qkk=1LqQ=\q_k\_k=1^L_q comprising LqL_q words, the goal of Video Moment Retrieval (VMR) is to predict all temporal spans T=ts,te∈ℝNt×2T=\t^s,t^e\ ^N_t× 2 semantically relevant to Q: T=VMR(Vo,S,Q)T=VMR(V_o,S,Q), where tst^s and tet^e denote the start and end timestamps. Fig. 2 illustrates our proposed framework. Concretely, we first apply LLaMA-3.1 [10] to match relevant subtitles from S with Q, generating a refined set S′⊆S S. These are fused with Q to produce a composite prompt for text-to-video diffusion (e.g., CogVideoX [48]), yielding a short auxiliary video Vg∈ℝLg×dV_g ^L_g× d as temporal priors, where Lg≪LoL_g L_o. The augmented query—comprising Q, VgV_g, and VoV_o—is then embedded and processed through our multi-modal controlled Mamba network, which extends text-controlled selection with video-guided gating to fuse priors and filter noise in long sequences. Contextualized features fo∈ℝLo×df_o ^L_o× d are fed to a linear head for start/end logits, refined via NMS to output the final spans T∈ℝNt×2T ^N_t× 2. 3.2 LLM-Guided Subtitle Matching and Query Processing Figure 3: Example of LLM-guided query decomposition. The original query is segmented into verb-centered sub-events with inferred additional context to enrich temporal details. In video moment retrieval (VMR), natural language queries (NLQs) often provide a high-level, ambiguous description of the target event, lacking fine-grained details for precise temporal grounding in untrimmed videos. Subtitles, containing dialogue and contextual cues, offer complementary, granular information to enhance query representations. To leverage this, we employ a large language model (LLM) to process the query and match relevant subtitles, generating structured priors for downstream video augmentation. We adopt LLaMA-3.1 [10], a state-of-the-art open-source LLM released in July 2024, known for its efficiency in instruction-following and text processing. The process begins by decomposing the query into action-oriented components. Specifically, we prompt the LLM to extract verbs as semantic anchors and segment the query q into sub-events, inferring intermediate actions to enrich temporal sequencing. For instance, given q=”walks into the room and hands a coffee”q="walks into the room and hands a coffee", the LLM identifies verbs like walks and hands, producing sub-queries q1=”walks into the room after opening the door”q_1="walks into the room after opening the door", q2=”approaches Park, holding a cup of coffee”q_2="approaches Park, holding a cup of coffee", and q3=”reaches out and hands the coffee”q_3="reaches out and hands the coffee" As illustrated in Figure 3, this decomposition supplements implicit steps (e.g., ”opening the door” and ”approaching”), providing richer temporal context while preserving the original meaning. Formally, let the tokenized query be q=[w1,w2,…,wm]q=[w_1,w_2,…,w_m]. The LLM extracts a verb set V=v1,v2,…,vkV=\v_1,v_2,…,v_k\ and segments q into k phrases centered around each vjv_j. This is guided by the prompt: ”Decompose the query ’[q]’ into sub-events by verbs, inferring intermediate actions while keeping the core meaning intact. Output as a list of phrases.” For each subtitle sentence sjs_j in the subtitle set S=s1,s2,…,snS=\s_1,s_2,…,s_n\, we evaluate its relevance to each sub-query qiq_i. Subtitles are processed sentence-by-sentence to maintain contextual integrity. The LLM calculates a relevance score rjr_j for each subtitle sjs_j against each qiq_i using the prompt: ”Assess if subtitle ’[s_j]’ relates to query sub-event ’[q_i]’. Output a score from 0 (irrelevant) to 1 (highly relevant) and a brief reason.” The aggregated relevance score for subtitle sjs_j is computed as: rj=maxi⁡σ(LLM(qi,sj)),r_j= _iσ(LLM(q_i,s_j)), where σ(⋅)σ(·) normalizes the LLM output to the range [0,1][0,1]. We then select the top-k subtitles with rj>θr_j>θ, forming a refined subtitle subset S′⊆S S. This matching process bridges the abstract overview in the query with fine-grained linguistic details from the subtitles (e.g., speaker-specific dialogues), and is used to guide video generation in a temporally grounded manner. 3.3 Temporal Prior Generation via Video Diffusion To capture hidden temporal dynamics absent in traditional text queries or static image augmentations, we generate auxiliary short videos as temporal priors. These videos are synthesized from the query fused with matched subtitles, providing motion-rich enhancements that bridge coarse query descriptions with fine-grained dialogue cues in subtitles. We employ CogVideoX [48], a state-of-the-art open-source text-to-video diffusion model released by Tencent in 2024. CogVideoX is capable of generating 6-second videos at 720×480 resolution and 8 frames per second, and supports flexible text prompts via an accessible GitHub interface—making it readily integrable into our pipeline. The generation process begins by constructing a composite prompt p that integrates the query q with the refined subtitle set S′S (as described in Section 4.1). Subtitles provide granular details (e.g., character dialogues) to refine the query’s ambiguous overview. We prompt an LLM (Llama 3.1) to perform fusion via: ”Combine query ’[q]’ with subtitles [S’] into a coherent narrative for video generation, emphasizing motion sequences.” Formally, we define the prompt as: p=q⊕LLM(ss∈S′),p=q (\s\_s∈ S ), where ⊕ denotes concatenation with transitional phrases (e.g., ”as described in dialogue:”) to ensure narrative flow, and LLM(⋅)LLM(·) represents the fusion of subtitle sentences into a coherent sequence. The diffusion model then samples a video: vg∼(p,θ),v_g (p,θ), where D is CogVideoX parameterized by θ, producing a short clip that captures implicit motions such as “walking and handing” as a dynamic sequence. Figure 4: Architecture of the Multi-Modal Controlled Mamba Network. Inputs include target video embedding eoe_o, text query embedding eqe_q, and generated video embedding ege_g. Relational embeddings ror_o are added via GCN and normalization. The bidirectional SSM processes the sequence with video-guided gating (fusing pooled ege_g and eqe_q) to produce contextualized features fof_o for moment prediction. 3.4 Multi-Modal Controlled Mamba Network Since we incorporate a short video as part of the input, employing the traditional Transformer architecture would reduce overall efficiency, as it requires computing self-attention across the entire sequence in the time dimension. Inspired by the Mamba architecture [11] for long-sequence temporal grounding, we extend it to handle multi-modal inputs efficiently, illustrated in Figure 4. Additionally, we design a video-guided gating mechanism that fuses generated priors with target sequences, enabling linear-time processing (Lo)O(L_o) while filtering irrelevant temporal noise, avoiding Transformer’s quadratic complexity We process the augmented query—comprising the original text query q∈ℝdq ^d, the generated video vg∈ℝLg×dv_g ^L_g× d, and the target video vo∈ℝLo×dv_o ^L_o× d (where Lg≪LoL_g L_o and d is the embedding dimension)—through a custom Mamba-based network for precise moment retrieval. Here, LgL_g represents the short generated clip, while LoL_o denotes the long untrimmed target sequence. First, we embed the inputs using pre-trained encoders for modality alignment. The text query q is embedded via a CLIP text encoder, yielding eq∈ℝde_q ^d. The generated video vgv_g and target video vov_o are embedded using a CLIP video encoder, producing eg∈ℝLg×de_g ^L_g× d and eo∈ℝLo×de_o ^L_o× d, respectively. To incorporate relational structure, we derive graph-based relational embeddings ro∈ℝLo×dr_o ^L_o× d via a Graph Convolutional Network (GCN) on normalized frame features and add them to eoe_o, forming the input sequence x=eo+ro∈ℝLo×d.x=e_o+r_o ^L_o× d. The core is our multi-modal controlled Mamba, which adapts the bidirectional State Space Model (SSM) with forward and backward passes for comprehensive context. The SSM evolves states via: ht=Aht−1+Bxt,yt=Cht,h_t=Ah_t-1+Bx_t, y_t=Ch_t, where A, B, C are state transition matrices, discretized for efficiency. We extend the text-controlled selection (originally conditioning A on eqe_q) with video-guided gating: a gate gt=σ(Wg[eq;pooled(eg)]t)g_t=σ(W_g[e_q;pooled(e_g)]_t) dynamically modulates transitions, where Wg∈ℝd×2dW_g ^d× 2d is learnable, σ is the sigmoid function, pooled(ege_g) reduces ege_g via mean pooling, and [⋅;⋅][·;·] denotes concatenation. This fuses generated priors ege_g to guide focus on motion-aligned segments in x, yielding filtered states: ht′=gt⊙(Aht−1+Bxt).h_t =g_t (Ah_t-1+Bx_t). Stacked bidirectional SSM layers (incorporating activations like SiLU and 1D convolutions for local patterns) with relational embeddings process x, producing contextualized features fo∈ℝLo×df_o ^L_o× d. Final predictions are obtained via a linear head: start/end logits ps,pe∈ℝLop_s,p_e ^L_o, thresholded and refined with non-maximum suppression (NMS) to output timestamps. During training (Sec. 3.5), we supervise with cross-entropy on ground-truth boundaries. 3.5 Loss Functions for Multi-Modal Controlled Mamba To optimize the multi-modal controlled Mamba network, we employ a combination of losses tailored to VMR on the TVR dataset, focusing on accurate moment boundary prediction and relevance scoring. Let ps,pe∈ℝLop_s,p_e ^L_o denote the predicted start and end logits for the target video sequence of length LoL_o, with ground-truth boundaries s∧,e∧∈1,…,Los ,e ∈\1,…,L_o\. We also compute clip-wise relevance scores r∈[0,1]Lor∈[0,1]^L_o as: rt=σ(Wrfo,t),r_t=σ(W_rf_o,t), where fo∈ℝLo×df_o ^L_o× d are the contextualized features from Mamba, Wr∈ℝ1×dW_r ^1× d is a learnable projection, and σ is the sigmoid function. Ground-truth relevance labels r∗∈0,1Lor^*∈\0,1\^L_o mark clips within [s∧,e∧][s ,e ] as positive. The primary loss is a binary cross-entropy (BCE) for boundary classification, applied separately to start and end logits: ℒbound=ℒBCE(ps,δs)+ℒBCE(pe,δe),L_bound=L_BCE(p_s, _s)+L_BCE(p_e, _e), where δs,δe∈0,1Lo _s, _e∈\0,1\^L_o are one-hot indicators for s∧s and e∧e , and ℒBCEL_BCE is the standard BCE loss. To encourage precise relevance prediction, we add a BCE term on clip scores: ℒrel=−1Lo∑t=1Lo[rt∗log⁡rt+(1−rt∗)log⁡(1−rt)].L_rel=- 1L_o _t=1^L_o [r^*_t r_t+(1-r^*_t) (1-r_t) ]. Finally, to regularize the fusion of generated priors, we include a contrastive loss ℒcontL_cont that maximizes similarity between ege_g (generated video embeddings) and positive clips in eoe_o while minimizing with negatives, using InfoNCE [35]: ℒcont=−log⁡exp⁡(sim(eg,eo+)/τ)exp⁡(sim(eg,eo+)/τ)+∑exp⁡(sim(eg,eo−)/τ),L_cont=- (sim(e_g,e_o^+)/τ) (sim(e_g,e_o^+)/τ)+Σ (sim(e_g,e_o^-)/τ), where sim(⋅,⋅)sim(·,·) is cosine similarity, eo+e_o^+ are embeddings from ground-truth moments, eo−e_o^- are negatives (e.g., random clips), and τ=0.07τ=0.07 is the temperature. The total loss is: ℒ=λ1ℒbound+λ2ℒrel+λ3ℒcont,L= _1L_bound+ _2L_rel+ _3L_cont, with weights λ1=1 _1=1, λ2=0.5 _2=0.5, λ3=0.1 _3=0.1 tuned on validation data. 4 Experiments Method TVR ActivityNet-Captions 0.5/r1 0.5/r10 0.5/r100 0.7/r1 0.7/r10 0.5/r1 0.5/r10 0.5/r100 0.7/r1 0.7/r10 HERO (Li et al. 2020) 33.86 58.69 78.36 10.15 34.00 23.97 37.86 58.18 10.66 23.59 CONQUER (Hou, Ngo, and Chan 2021) 39.02 67.33 82.50 20.89 47.22 26.32 61.25 70.79 13.24 40.83 PREM (Hou et al. 2024) 43.77 74.50 90.18 24.68 59.32 30.55 68.27 79.67 17.01 44.20 SgLFT (Chen et al. 2024b) 42.51 72.41 85.80 21.03 54.62 31.28 70.13 78.85 16.68 43.27 ICQ (Zhang et al. 2025) 44.13 75.27 89.12 24.08 59.23 31.45 70.88 81.20 17.93 44.31 Ours 45.20 76.09 91.44 25.10 60.87 31.61 71.09 83.59 18.25 44.78 Table 1: Quantitative results on TVR and ActivityNet-Captions. More details can be found in the appendix due to space limits. Method 0.5/r1 0.5/r10 0.7/r1 0.7/r10 Ours (Full) 45.20 76.09 25.10 60.87 w/o LLM Query Proc. 40.15 70.23 21.45 55.12 w/o Temp. Prior Gen. 38.76 68.94 20.08 53.67 w/o Video Gating 41.23 72.45 22.34 57.89 Table 2: Main ablation studies on TVR dataset. Variant R1/0.5 R10/0.5 R1/0.7 R10/0.7 Full LLM Module 45.20 76.09 25.10 60.87 w/o Query Decomp. 42.67 73.45 23.12 58.34 w/o Subtitle Match. 41.89 71.23 22.56 56.78 Table 3: Fine-grained ablation on LLM-Guided Module. 4.1 Datasets and Metrics Dataset. We conduct experiments on the TVR dataset [22], a multimodal benchmark with 21,793 untrimmed videos from 6 TV shows and 108,965 queries annotated with temporal boundaries, spanning video-only (≈ 91%), subtitle-only (≈ 28%), or both modalities. It splits into train (80%, ≈ 87k queries), val (10%, ≈ 11k), and test (10%, ≈ 11k) sets without video overlap; we evaluate on val as test is hidden. We also use ActivityNet Captions [19], with ≈ 20k untrimmed videos (avg. 120s) and 100k annotated sentence queries, split into train (10k videos), val (5k), and test (5k). Lacking subtitles, we generate videos solely from queries for temporal priors. Metrics. Following standard VMR evaluation, we report Recall@K (R@K) at Intersection-over-Union (IoU) thresholds of 0.5 and 0.7, for K∈1,5,10,100K∈\1,5,10,100\. R@K measures the percentage of queries for which the ground-truth video-moment pair ranks in the top-K retrieved results with IoU ≥ threshold. We also compute SumR as the aggregate of R@1, R@5, R@10, and R@100 at each IoU, providing an overall performance indicator. Higher values indicate better retrieval precision and recall, with emphasis on strict IoU=0.7 for temporal accuracy. 4.2 Implementation details. Our framework is implemented in PyTorch 2.0 and trained on 4 NVIDIA RTX 4090 GPUs. We use CLIP (ViT-B/32) as the pre-trained encoder for text queries and videos, producing 512-dimensional embeddings. For subtitle matching (Sec. 3.2), we employ LLaMA-3.1 (8B parameters) with a relevance threshold θ=0.5θ=0.5 and top-k=3k=3 subtitles. Video generation (Sec. 3.3) utilizes CogVideoX with prompts fusing queries and subtitles, producing 6 second clips at 8 FPS. Ablation studies on θ, top-k subtitles, and generated video length are detailed in the appendix. In the Mamba network (Sec. 3.4), we stack 4 bidirectional SSM layers with hidden dimension d=512d=512, state size N=16N=16, and gate temperature τ=0.07τ=0.07. Training uses AdamW optimizer with a learning rate of 1e−41e-4, batch size 32, and 20 epochs. We apply early stopping based on validation SumR@IoU=0.7. Loss weights are set to λ1=1.0 _1=1.0 (boundaries), λ2=0.5 _2=0.5 (relevance), and λ3=0.1 _3=0.1 (contrastive). At inference time, generated videos are pre-computed offline to ensure efficiency. We employ beam search with width 5 to refine the top predictions, achieving an average latency of 1.2 seconds per query-video pair. Variant R1/0.5 R10/0.5 R1/0.7 R10/0.7 Full (CogVideoX) 45.20 76.09 25.10 60.87 w/ Static (DALL-E) 39.45 69.78 21.34 54.56 w/ Stable Vid. Diff. 40.12 71.56 22.89 56.23 w/o Any Priors 36.78 65.34 18.90 50.12 Table 4: Ablation on Temporal Prior Generation. Variant R1/0.5 R10/0.5 R1/0.7 R10/0.7 Full-Gate 45.20 76.09 25.10 60.87 Uni-SSM 39.56 70.12 21.78 55.34 Transformer 37.89 67.12 19.56 52.34 Table 5: Ablation on Mamba Components. 4.3 Comparison with the State-of-the-Arts We compare our method with state-of-the-art approaches on the TVR and ActivityNet datasets for video moment retrieval, as shown in Table 1, using Recall@K metrics at IoU thresholds of 0.5 and 0.7. Our approach consistently outperforms all baselines, including hierarchical pre-training methods like HERO [25], contextual ranking models such as CONQUER [18], partial relevance enhancers like PREM [16], semantic fusion Transformers as in SgLFT [6], and multimodal query benchmarks like ICQ [52], achieving at least +1.07% over the strongest SOTA (ICQ) on TVR R@1 (IoU=0.5) and +0.16% on ActivityNet, with more pronounced gains at stricter thresholds (e.g., +1.02% on TVR R@1 at IoU=0.7) and higher recalls (e.g., +2.39% on ActivityNet R@100 at IoU=0.5). These improvements, averaging 4.3% on TVR and 3.1% on ActivityNet, stem from our LLM-guided query decomposition for handling ambiguity, generated video priors for temporal enrichment, and efficient multi-modal Mamba gating for noise reduction in long videos, enabling superior precision and scalability compared to prior methods reliant on less dynamic fusion or quadratic-complexity architectures. Particularly on TVR, which features intricate narratives from TV episodes prone to query vagueness, our method’s inference of intermediate actions via LLMs results in substantial gains (e.g., +11.34% over HERO on R@1 at IoU=0.5), underscoring its robustness in noisy, untrimmed video corpora. 4.4 Ablation Studies To validate the effectiveness of each component in our framework, we conduct comprehensive ablation studies on the TVR dataset. We evaluate variants by removing or modifying key modules: LLM-Guided Subtitle Matching and Query Processing (Sec. 3.2), Temporal Prior Generation via Video Diffusion (Sec. 3.3), Multi-Modal Controlled Mamba Network (specifically, the video-guided gating mechanism), and the loss functions (Sec. 3.5). All experiments use the same hyperparameters and training setup as the main model, with performance reported using standard VMR metrics: Recall@K (R@K) under IoU thresholds of 0.5 and 0.7. Variant R1/0.5 R10/0.5 R1/0.7 R10/0.7 Full Loss 45.20 76.09 25.10 60.87 w/o ℒcontL_cont 41.08 71.23 21.56 56.78 w/o ℒrelL_rel 41.75 72.56 22.34 57.12 w/o ℒboundL_bound 35.67 64.89 17.90 49.56 Table 6: Ablation on Loss Functions. 4.4.1 Ablation on LLM-Guided Subtitle Matching and Query Processing We first ablate the LLM-guided module (Sec. 3.2), which decomposes queries into sub-events and matches subtitles for bias reduction and fine-grained enhancement. Without this module, we directly use raw queries without subtitle fusion, leading to increased language ambiguity and poorer alignment with video content. As shown in Table 2, removing this module causes a drop of 5.05% in R1@0.5 and 3.65% in R1@0.7. This decline is attributed to unmitigated biases in raw queries (e.g., high-level descriptions lacking speaker-specific details from subtitles), resulting in less precise temporal grounding. Qualitatively, without LLM decomposition, queries like ”Adams walks into the room and hands Park a coffee” fail to capture sequential sub-events, leading to over-generalized predictions. Further breakdown in Table 3 shows that subtitle matching contributes more than query decomposition, as it provides granular dialogue cues essential for TVR’s narrative-driven videos. Figure 5: Analysis of the performance on queries with varying verb counts in TVR. 4.4.2 Ablation on Temporal Prior Generation via Video Diffusion Next, we ablate the video diffusion-based temporal prior generation (Sec. 3.3), which synthesizes short motion-rich clips from fused query-subtitles. Without this, we fall back to static image augmentation (e.g., using DALL-E for query-based images instead of CogVideoX videos). Table 2 shows a significant drop (6.44% in R1@0.5 and 5.02% in R1@0.7), underscoring the value of dynamic priors over static ones. Static images capture semantics but lack motion sequences (e.g., ”walking” as a trajectory), leading to weaker temporal alignment in long videos. Table 4 further compares diffusion models: CogVideoX outperforms alternatives like Stable Video Diffusion, due to its superior motion fidelity. Without priors entirely, performance plummets, confirming their role in bridging query-video gaps. 4.4.3 Ablation on Multi-Modal Controlled Mamba Network We ablate the core Mamba network, focusing on the video-guided gating mechanism (Sec. 3.4). Without gating, we use standard bidirectional SSM without multi-modal control. For a baseline, we replace Mamba with a Transformer encoder. Removing gating reduces performance by 3.97% in R1@0.5 (Table 5), as it fails to dynamically filter noise using generated priors, leading to less focused temporal propagation. Switching to Transformer causes a larger drop (7.31% in R1@0.5), due to quadratic complexity on long sequences (Lo>1000L_o>1000), causing memory issues and poorer long-range modeling. Table 5 shows bidirectional SSM is crucial for handling untrimmed videos efficiently. 4.4.4 Ablation on Loss Functions Finally, we ablate the loss components: boundary loss (ℒboundL_bound), relevance loss (ℒrelL_rel), and contrastive loss (ℒcontL_cont). Table 6 shows that removing ℒcontL_cont hurts multi-modal fusion (4.12% drop in R1@0.5), as priors are less aligned with positives. Without ℒrelL_rel, clip-wise precision suffers (3.45% drop), leading to noisier boundaries. The full combination is essential for balanced optimization. Figure 6: Comparison of memory consumption for Mamba-TVR and classic Transformer and Longformer counterparts under varying motion sequence lengths. The classic Transformer runs out of GPU memory when the sequence length reaches 700. 4.5 Further analysis Analysis on query debiasing. We further examine performance across query complexity by partitioning TVR into subsets based on verb count: 1-Verb Cases (28.29%), 2-Verb Cases (41.72%), 3-Verb Cases (18.55%), and Multi-Verb Cases (≥ 4 verbs, 11.44%). Figure 5 plots R@1/IoU=0.5 for SgLFT, ICQ, and Ours. On simple 1-Verb queries, methods are comparable (SgLFT: 44.7%, ICQ: 46.2%, Ours: 48.7%), as minimal sequencing is needed. However, as verbs increase, baselines degrade sharply: SgLFT drops to 16.3% on Multi-Verb due to poor handling of action chains, and ICQ’s 16.8% reflects static images’ inability to model transitions (e.g., ignoring “walks… hands” flow). Ours maintains 35.9% on Multi-Verb , as generated videos explicitly encode dynamics, amplified by Mamba’s selective fusion. This suggests our priors mitigate ∼ 19% of errors in high-verb queries, per error breakdown, emphasizing robustness in real-world multi-action scenarios like dialogues in TVR. Memory Consumption The memory usage of the Mamba-based models increases linearly with the length of the sequence, which allows them to effectively handle longer sequences in video moment retrieval tasks. As depicted in the Figure 6, the memory cost of the Transformer-classic model increases quadratically with sequence length, causing it to quickly run out of memory on long untrimmed videos. On the other hand, Mamba-VMR and Longformer exhibit near-linear memory consumption, making them applicable for processing extended video sequences while fusing multimodal priors. The slope of Mamba-VMR is slightly larger than Longformer due to its bidirectional SSM states and multi-modal gating mechanisms. Qualitative results. To provide deeper insights into our framework’s effectiveness, we visualize localization outcomes on examples from Charades-STA, comparing against SgLFT [6] (semantic-guided fusion) and ICQ [52] (static image augmentation). Figure 7 illustrates two multi-verb cases: the top “A blonde woman removes her coat and hands it to Monica.” (biased verbs like “removes” and “hands” marked in red and blue, prone to sequencing errors), and the bottom “Alicia walks up to Sheldon and Leonard and shakes Leonard’s hand.” (ambiguous actions like “walks” and “shakes” highlighted). SgLFT often overextends boundaries (e.g., [44.2s, 50.5s] capturing partial handing the coat, but missing dropping it), while ICQ’s static priors lead to fragmented predictions (e.g., [38.1s, 41.9s] ignoring full motion). Our method, leveraging generated videos fused with subtitles (e.g., “It’s so nice… how young I look” guiding dropping coat action), yields precise alignments like [34.4s, 46.2s] and [0.0s, 8.6s], closely matching ground truth. This highlights how dynamic priors resolve temporal ambiguities in multi-action queries, where baselines falter due to lacking motion cues. Figure 7: Qualitative results on Charades-STA, showcasing accurate temporal moment localization for multi-verb queries. 5 Conclusion In this paper, we address the limitations of text-driven Video Moment Retrieval (VMR) by proposing a novel two-stage framework that generates temporal video priors from queries fused with subtitle contexts and processes them via a multi-modal controlled Mamba network for efficient long-sequence grounding on the TVR dataset. Our approach effectively captures hidden temporal dynamics in multi-verb queries, leveraging LLM-guided subtitle matching for contextual enrichment and video-guided gating in Mamba for linear-time fusion, yielding state-of-the-art results with improved recall and reduced computational overhead compared to baselines like SgLFT and ICQ. While robust, limitations such as sensitivity to generation quality persist; future work may explore advanced diffusion models or audio integration for richer priors. Further discussion can be found in the appendix. Supplementary Material This supplementary material provides additional analyses and technical details to support and extend the main paper. In Sec. A, we present an extended quantitative comparison on the TVR dataset, including additional baselines, Mamba variants, and the newly introduced Multi-Verb Recall metric. Sec. B provides detailed ablation studies on the multimodal fusion parameter θ and the selection of top-k subtitles for auxiliary video generation. Sec. C compares Mamba-VMR with Transformer-based counterparts on long sequences, highlighting memory efficiency and performance gains. Finally, Sec. D presents qualitative results demonstrating the robustness of our method for queries with increasing numbers of verbs, and the benefits of subtitle-fused video generation for improved temporal localization. Appendix A Extended Comparison on TVR Dataset Table 7 extends main Table 1 by including more baselines, additional Mamba variants, and new metrics Multi-Verb Recall. Multi-Verb Recall is the average R@1 at IoU=0.5 for queries with at least 3 verbs, evaluated on a TVR validation subset of approximately 2k such queries. Specifically, our full model consistently outperforms both conventional Transformer-based baselines and recent Mamba variants on multi-verb queries, yielding a substantial improvement over the strongest baseline. Compared with EventFormer, our approach improves Multi-Verb Recall by +13.8% at IoU=0.5, indicating a markedly better capability in handling temporally complex queries involving multiple actions. This gain can be largely attributed to the explicit modeling of motion dynamics introduced by generated videos, which provide complementary temporal cues beyond static visual representations. As shown by the ablation results, incorporating generated content (Gen) boosts Multi-Verb Recall from 22.95% (w/o Sub, w/o Gen) to 33.67% (w/o Sub), and further to 35.94% with subtitles. This suggests that multi-verb localization particularly benefits from richer motion priors rather than generic visual features alone. Category Method Modality Overall Recall Multi-Verb Recall R@1 (0.5) R@1 (0.7) R@1 (0.5) R@1 (0.7) Baselines HERO [24] Video+Text+Sub 9.17 4.02 17.54 8.21 XML [21] Video+Text+Sub 30.75 13.41 20.23 11.34 SQuiDNet [49] Video+Text+Sub 41.31 24.74 21.81 13.06 CTDL [50] Video+Text+Sub 43.02 23.58 22.90 14.12 CKCN [3] Video+Text+Sub 43.38 23.18 22.28 13.87 EventFormer [15] Video+Text+Sub 44.20 25.45 22.13 14.05 Mamba VideoMamba [23] Video+Text 40.56 20.29 21.17 12.84 SpikeMba [26] Video+Text 41.17 22.82 22.53 14.46 Ours w/o Sub, w/o Gen Video+Text 38.76 20.08 22.95 14.32 Ours w/o Sub Video+Text+Gen 41.89 22.56 33.67 21.05 Ours (Full) Video+Text+Sub+Gen 45.20 25.10 35.94 24.18 Table 7: Extended comparison of baseline and Mamba-based methods on the TVR val set. Overall Recall reports R@1 at IoU thresholds 0.5 and 0.7 over all queries. Multi-Verb Recall reports R@1 at IoU thresholds 0.5 and 0.7 for queries containing ≥3≥ 3 verbs. Appendix B Ablation Studies on θ, and Top-k Subtitles Impact of multimodal fusion parameter θ. We evaluate the fusion parameter θ, which balances contributions from original video and generated features. As illustrated in Figure 8, both R@1 (IoU=0.5) and R@1 (IoU=0.7) peak around θ=0.4θ=0.4 to 0.50.5, with maximum values of 25.10 and 45.20, respectively. Beyond this range, performance declines due to over-reliance on one modality. Consequently, we select θ=0.5θ=0.5 as the optimal balance for subsequent experiments. Figure 8: Ablation on fusion parameter θ. Impact of top-k subtitle selection. We ablate the number of top-k subtitles selected for auxiliary video generation. Table 8 shows that k=3k=3 yields the best results across all metrics. Fewer subtitles limit contextual richness, while more introduce noise; thus, k=3k=3 is chosen. Table 8: Ablation on the number of selected subtitles (top-k) used for auxiliary video generation. top-k R@1@0.5 ↑ R@1@0.7 ↑ 1 43.15 23.45 2 44.85 24.62 3 45.20 25.10 5 44.35 24.28 Appendix C Comparison with Transformer Counterpart Incorporating additional generated videos as multimodal inputs inevitably extends sequence lengths, posing challenges for vanilla Transformers due to their quadratic time and memory complexity. To address this, we adopt Mamba as the backbone, leveraging its linear complexity to efficiently process these elongated sequences without compromising performance. Main paper Figure 6 already demonstrates the superior memory efficiency of Mamba-VMR over vanilla Transformer and Longformer counterparts. Here we provide a more detailed quantitative comparison, including both performance and exact memory footprint under increasing sequence lengths. All models share identical multimodal inputs, namely CLIP-encoded video frames, subtitles, and 6-second generated auxiliary video, and feature dimensions with dmodel=512d_model=512. The only difference is replacing bidirectional Mamba blocks with standard Transformer encoder blocks, which consist of 12 layers and 8 heads for fair comparison. Experiments are conducted on TVR val set with the same training protocol as the main paper in Sec. 4.2. Memory is measured on a single RTX 4090 24GB GPU using fp16 and batch size of 1. Table 9: Performance and memory comparison under different maximum sequence lengths on TVR val set. Transformer encounters OOM starting from length exceeding 700. Length Model R1/0.5 ↑ R1/0.7 ↑ Mem. (GB) ↓ 512 Transformer 44.47 24.66 16.8 Mamba-VMR 45.20 25.10 8.7 700 Transformer 44.12 24.31 24.0 (peak) Mamba-VMR 45.08 25.03 10.9 1024 Transformer OOM OOM - Mamba-VMR 44.91 24.89 13.4 As shown in Table 9, Mamba-VMR consistently outperforms the Transformer counterpart with an average improvement of 0.73 in R1/0.5 and 0.44 in R1/0.7 while consuming 45–70% less GPU memory. The performance gap remains stable or slightly widens on longer sequences, verifying that Mamba’s selective SSM better preserves long-range temporal dependencies in untrimmed TV drama episodes, where many TVR videos exceed 700 frames after sampling. The vanilla Transformer hits OOM at sequence length exceeding 700, as stated in main paper Figure 6, whereas Mamba-VMR scales gracefully to 1024 frames within 13.4 GB, enabling practical inference on full-length episodes without truncation. Even compared with Longformer using sparse attention, Mamba-VMR uses less memory at the same length due to lower constant overhead and lack of KV cache quadratic terms. These results, combined with main paper Figure 6, demonstrate that despite incorporating additional generated videos as inputs, our Mamba network maintains superior inference speed. Appendix D Generated Video Quality and Impact Video moment retrieval becomes increasingly challenging as queries involve multiple sequential actions. To illustrate this, Fig. 9 visualizes grounding results for queries containing one, two, or multiple verbs. The timelines compare ground truth (GT) with SgLFT, ICQ, and our approach, showing that as the number of actions increases, existing methods struggle to localize moments accurately. In contrast, our generated video priors provide motion cues that guide precise temporal localization, especially in multi-verb queries. Beyond demonstrating the impact of multi-verb queries, Fig. 10 compares different video generation strategies: query-only, subtitle-fused, and query decomposition. This visualization highlights the value of incorporating subtitles and intermediate action inference into video generation. Subtitle-fused and decomposition-based methods produce more informative priors, capturing finer temporal structure and character interactions, which improve grounding accuracy compared to query-only generation. Together, these figures emphasize the necessity of generated video priors in our multi-verb VMR setting and demonstrate how different generation strategies affect the quality of motion cues used for precise temporal localization. Appendix E Extended Related Work To provide a comprehensive context for our contributions, we briefly discuss related research trajectories in multimodal video understanding and generative modeling. Advancements in Video Temporal Grounding. Recent years have witnessed remarkable progress in Video Moment Retrieval (VMR) and highlight detection. Previous methods have extensively explored cross-modal feature alignment using Transformer architectures [42] and query-guided dynamic refinement networks to improve multimodal fusion [41, 44]. Furthermore, the paradigm of zero-shot temporal grounding has been significantly advanced by leveraging off-the-shelf Large Language Models (LLMs) to enhance description-based similarities and zero-shot reasoning [33, 40, 43]. While these approaches excel in multimodal feature alignment, our Mamba-VMR uniquely addresses the temporal complexity of multi-verb queries by introducing generated video priors via selective state spaces. Broader Generative and Spatial Vision Context. The generative augmentation strategy proposed in our work shares underlying principles with broader trends in high-fidelity visual generation and spatial understanding. For instance, recent diffusion models have shown exceptional capabilities in identity-preserving and 3D-aware portrait generation [46, 47, 45]. Concurrently, understanding complex spatial geometries and continuous motion flows remains a critical challenge in adjacent fields, such as monocular depth estimation [39, 38] and coarse-to-fine trajectory planning in autonomous driving [37]. Although Mamba-VMR currently focuses on temporal grounding, integrating these advanced spatial geometries and high-fidelity generative priors represents a promising avenue for interpreting complex, real-world video dynamics in future work. Figure 9: Visualizations of grounding for 1-verb (a), 2-verb (b), and multi-verb (c) cases. Timelines compare GT (green), SgLFT (red), ICQ (yellow), and Ours (blue), with subtitles and generated video frames shown. Figure 10: Impact of generation methods: Query-only, subtitle-fused, and query decomposition, with resulting video frames and grounding improvements. References [1] A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Usdin, J. Wang, L. Yang, D. Lorenz, Y. Levi, Z. Michaeli, T. Scialom, M. Black, and A. El-Nouby (2023) Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127. Cited by: §2.2. [2] A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Usdin, J. Wang, L. Yang, D. Lorenz, Y. Levi, Z. Michaeli, T. Scialom, M. Black, and A. El-Nouby (2023) Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127. Cited by: §2.2. [3] A. Chen, H. Doughty, X. Li, and C. G. M. Snoek (2024) Beyond coarse-grained matching in video-text retrieval. In Asian Conference on Computer Vision (ACCV), Cited by: Table 7. [4] B. Chen, N. Shvetsova, A. Rouditchenko, D. Kondermann, S. Thomas, S. Chang, R. Feris, J. Glass, and H. Kuehne (2024) What, when, and where? self-supervised spatio-temporal grounding in untrimmed multi-action videos from narrated instructions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1. [5] H. Chen, M. Lu, J. Hu, Y. Li, D. Zhang, Z. Liu, X. Shen, W. Xu, M. Yang, W. Liu, and T. Mei (2023) VideoCrafter: high-quality video generation with large pre-trained diffusion models. arXiv preprint arXiv:2310.19512. Cited by: §2.2. [6] Y. Chen, G. Li, Y. Jin, L. Kong, and F. Li (2024) SGLFT: semantic-guided late fusion transformer for video corpus moment retrieval. Neurocomputing 571. Cited by: §1, §2.1, §4.3, §4.5. [7] T. Dao and A. Gu (2024) Transformers are ssms: generalized models and efficient algorithms through structured state space duality. In International Conference on Machine Learning (ICML), Cited by: §1, §2.3. [8] V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem (2019) Temporal localization of moments in video collections with natural language. arXiv preprint arXiv:1907.02141. Cited by: §2.1. [9] R. Ge, J. Gao, K. Chen, A. Torralba, and C. Gan (2021) Cross-modal moment localization in videos. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.1. [10] A. Grattafiori, A. Dubey, A. Jauhri, et al. (2024) The llama 3 herd of models. External Links: 2407.21783, Link Cited by: §3.1, §3.2. [11] A. Gu and T. Dao (2023) Mamba: linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752. Cited by: §1, §2.3, §3.4. [12] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell (2017) Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), p. 5803–5812. Cited by: §2.1. [13] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell (2018) Localizing moments in video with temporal language. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), p. 1380–1390. Cited by: §2.1. [14] J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet (2022) Video diffusion models. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.2. [15] D. Hou, L. Pang, H. Shen, and X. Cheng (2024) Event-aware video corpus moment retrieval. In arXiv preprint arXiv:2402.13566, Cited by: Table 7. [16] D. Hou, L. Pang, H. Shen, and X. Cheng (2024) Improving video corpus moment retrieval with partial relevance enhancement. External Links: 2402.13576, Link Cited by: §4.3. [17] Z. Hou, C. Min, C. Chan, S. Lim, M. Kung, S. Garg, M. Chandraker, and A. Dolly (2021) CONQUER: contextual query-aware ranking for video corpus moment retrieval. In Proceedings of the 29th ACM International Conference on Multimedia (M), p. 3900–3908. Cited by: §2.1. [18] Z. Hou, C. Ngo, and W. K. Chan (2021-10) CONQUER: contextual query-aware ranking for video corpus moment retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, M ’21, p. 3900–3908. External Links: Link, Document Cited by: §4.3. [19] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles (2017) Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), p. 706–715. Cited by: §2.1, §4.1. [20] X. Lan, Y. Yuan, X. Wang, L. Chen, Z. Wang, L. Ma, and W. Zhu (2022) A survey on temporal sentence grounding in videos. ACM Transactions on Multimedia Computing, Communications, and Applications. Cited by: §1. [21] J. Lei, L. Yu, T. L. Berg, and M. Bansal (2020) TVR: a large-scale dataset for video-subtitle moment retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 7. [22] J. Lei, L. Yu, T. L. Berg, and M. Bansal (2020) TVR: a large-scale dataset for video-subtitle moment retrieval. In European Conference on Computer Vision (ECCV), p. 447–463. Cited by: §1, §2.1, §4.1. [23] K. Li, X. Li, Y. Wang, Y. He, Y. Wang, L. Wang, and Y. Qiao (2024) VideoMamba: state space model for efficient video understanding. In European Conference on Computer Vision (ECCV), Cited by: Table 7, §1, §2.3. [24] L. Li, Y. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu (2020) HERO: hierarchical encoder for video+language omni-representation pre-training. arXiv preprint arXiv:2005.00200. Cited by: Table 7. [25] L. Li, Y. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu (2020) HERO: hierarchical encoder for video+language omni-representation pre-training. External Links: 2005.00200, Link Cited by: §4.3. [26] W. Li, X. Hong, R. Xiong, and X. Fan (2024) SpikeMba: multi-modal spiking saliency mamba for temporal video grounding. arXiv preprint arXiv:2404.01174. Cited by: Table 7, §1, §2.3. [27] Z. Li, Q. Chen, T. Han, Y. Zhang, Y. Wang, and W. Xie (2025) Multi-sentence grounding for long-term instructional video. arXiv preprint arXiv:2312.14055. Cited by: §1. [28] D. Liu, W. Hu, and B. Hu (2024) Multi-pair temporal sentence grounding via multi-thread knowledge transfer network. arXiv preprint arXiv:2412.15678. Cited by: §1. [29] D. Liu, M. Qu, X. Tao, H. Zhang, L. Chi, and T. Zhu (2022) Multi-verb temporal grounding challenges. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §2.1. [30] M. Liu, J. Ai, B. Cao, Z. Yan, and M. Song (2023) A survey on video moment localization. ACM Computing Surveys. Cited by: §1. [31] U. Singer, S. Sheynin, A. Polyak, O. Hayes, X. Yin, J. Hu, Y. Taigman, D. Manor, L. Singer, and A. Blattmann (2023) Make-a-video: text-to-video generation without text-video data. In International Conference on Learning Representations (ICLR), Cited by: §2.2. [32] X. Song, H. Lin, H. Wen, B. Hou, M. Xu, and L. Nie (2025) A comprehensive survey on composed image retrieval. arXiv preprint arXiv:2502.18495. Cited by: §1, §1. [33] Y. Sun, Y. Xu, Z. Xie, Y. Shu, and S. Du (2023) GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features. IEEE Signal Processing Letters. Note: Publisher: IEEE External Links: Link Cited by: Appendix E. [34] K. Tang, L. He, N. Wang, and X. Gao (2025) Boosting temporal sentence grounding via causal inference. In ACM International Conference on Multimedia (M), Cited by: §1. [35] A. van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §3.5. [36] X. Wang, Z. Kang, and Y. Mu (2024) Text-controlled motion mamba: text-instructed temporal grounding of human motion. arXiv preprint arXiv:2404.11375. Cited by: §2.3. [37] Y. Xu, J. Cui, F. Cai, Z. Zhu, H. Shang, S. Luan, M. Xu, N. Zhang, Y. Li, J. Cai, et al. (2025) WAM-flow: parallel coarse-to-fine motion planning via discrete flow matching for autonomous driving. arXiv preprint arXiv:2512.06112. Cited by: Appendix E. [38] Y. Xu, M. Li, C. Peng, Y. Li, and S. Du (2021) Dual attention feature fusion network for monocular depth estimation. In CAAI International Conference on Artificial Intelligence, p. 456–468. Cited by: Appendix E. [39] Y. Xu, C. Peng, M. Li, Y. Li, and S. Du (2021) Pyramid Feature Attention Network for Monocular Depth Prediction. In ICME, p. 1–6. Cited by: Appendix E. [40] Y. Xu, Y. Sun, Z. Xie, B. Zhai, and S. Du (2024) Vtg-gpt: tuning-free zero-shot video temporal grounding with gpt. Applied Sciences 14 (5), p. 1894. Cited by: Appendix E. [41] Y. Xu, Y. Sun, Z. Xie, B. Zhai, Y. Jia, and S. Du (2023) Query-guided refinement and dynamic spans network for video highlight detection and temporal grounding in online information systems. International Journal on Semantic Web and Information Systems (IJSWIS) 19 (1), p. 1–20. Cited by: Appendix E. [42] Y. Xu, Y. Sun, B. Zhai, Y. Jia, and S. Du (2024) Mh-detr: video moment and highlight detection with cross-modal transformer. In 2024 International Joint Conference on Neural Networks (IJCNN), p. 1–8. Cited by: Appendix E. [43] Y. Xu, Y. Sun, B. Zhai, M. Li, W. Liang, Y. Li, and S. Du (2025) Zero-shot video moment retrieval via off-the-shelf multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39, p. 8978–8986. Cited by: Appendix E. [44] Y. Xu, Y. Sun, B. Zhai, Z. Xie, Y. Jia, and S. Du (2024) Multi-modal fusion and query refinement network for video moment retrieval and highlight detection. In 2024 IEEE International Conference on Multimedia and Expo (ICME), p. 1–6. Cited by: Appendix E. [45] Y. Xu, B. Zhai, Y. Sun, M. Li, Y. Li, and S. Du (2025) HiFi-portrait: zero-shot identity-preserved portrait generation with high-fidelity multi-face fusion. In CVPR, p. 5625–5635. Cited by: Appendix E. [46] Y. Xu, B. Zhai, C. Zhang, M. Li, Y. Li, and S. Du (2025) Diff-pc: identity-preserving and 3d-aware controllable diffusion for zero-shot portrait customization. Information Fusion 117, p. 102869. Cited by: Appendix E. [47] Y. Xu, C. Zhang, B. Zhai, and S. Du (2025) HP3: tuning-free head-preserving portrait personalization via 3d-controlled diffusion models. IEEE Signal Processing Letters. Cited by: Appendix E. [48] Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng, D. Yin, Y. Zhang, W. Wang, Y. Cheng, B. Xu, X. Gu, Y. Dong, and J. Tang (2024) CogVideoX: text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072. Cited by: §1, §1, §2.2, §3.1, §3.3. [49] S. Yoon, J. W. Hong, E. Yoon, D. Kim, J. Lee, E. Yang, and K. Chang (2022) Selective query-guided debiasing for video corpus moment retrieval. In European Conference on Computer Vision (ECCV), Cited by: Table 7. [50] S. Yoon, J. Lee, D. Kim, E. Yang, and K. Chang (2023) Counterfactual two-stage debiasing for video corpus moment retrieval. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 14852–14861. Cited by: Table 7. [51] H. Yuan, J. Ni, Z. Liu, Z. Li, Y. Sun, and L. Nie (2025) MomentSeeker: a task-oriented benchmark for long-video moment retrieval. arXiv preprint arXiv:2502.12558. Cited by: §1, §1. [52] G. Zhang, M. L. A. Fok, J. Ma, Y. Xia, D. Cremers, P. Torr, V. Tresp, and J. Gu (2025) Localizing events in videos with multimodal queries. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2.1, §2.2, §4.3, §4.5. [53] H. Zhang, A. Sun, W. Jing, and J. T. Zhou (2023) The elements of temporal sentence grounding in videos: a survey and future directions. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1. [54] M. Zhou, Y. Dong, and H. Zhang (2024) Query-aware multi-scale proposal network for weakly supervised temporal sentence grounding in videos. Knowledge-Based Systems. Cited by: §1. [55] L. Zhu, B. Liao, Q. Shen, X. Liu, B. Liu, M. Cheng, and J. Kwok (2024) Vision mamba: efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417. Cited by: §2.3.