Paper deep dive

VLM2Rec: Resolving Modality Collapse in Vision-Language Model Embedders for Multimodal Sequential Recommendation

Junyoung Kim, Woojoo Kim, Jaehyung Lim, Dongha Kim, Hwanjo Yu

Year: 2026Venue: arXiv preprintArea: cs.IRType: PreprintEmbeddings: 66

Abstract

Abstract:Sequential Recommendation (SR) in multimodal settings typically relies on small frozen pretrained encoders, which limits semantic capacity and prevents Collaborative Filtering (CF) signals from being fully integrated into item representations. Inspired by the recent success of Large Language Models (LLMs) as high-capacity embedders, we investigate the use of Vision-Language Models (VLMs) as CF-aware multimodal encoders for SR. However, we find that standard contrastive supervised fine-tuning (SFT), which adapts VLMs for embedding generation and injects CF signals, can amplify its inherent modality collapse. In this state, optimization is dominated by a single modality while the other degrades, ultimately undermining recommendation accuracy. To address this, we propose VLM2Rec, a VLM embedder-based framework for multimodal sequential recommendation designed to ensure balanced modality utilization. Specifically, we introduce Weak-modality Penalized Contrastive Learning to rectify gradient imbalance during optimization and Cross-Modal Relational Topology Regularization to preserve geometric consistency between modalities. Extensive experiments demonstrate that VLM2Rec consistently outperforms state-of-the-art baselines in both accuracy and robustness across diverse scenarios.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

65,864 characters extracted from source content.

Expand or collapse full text

VLM2Rec: Resolving Modality Collapse in Vision-Language Model Embedders for Multimodal Sequential Recommendation Junyoung Kim Pohang University of Science and TechnologyPohangRepublic of Korea junyoungkim@postech.ac.kr , Woojoo Kim Pohang University of Science and TechnologyPohangRepublic of Korea kimuj0103@postech.ac.kr , Jaehyung Lim Pohang University of Science and TechnologyPohangRepublic of Korea jaehyunglim@postech.ac.kr , Dongha Kim Pohang University of Science and TechnologyPohangRepublic of Korea dhkim0317@postech.ac.kr and Hwanjo Yu Pohang University of Science and TechnologyPohangRepublic of Korea hwanjoyu@postech.ac.kr (2026) Abstract. Sequential Recommendation (SR) in multimodal settings typically relies on small frozen pretrained encoders, which limits semantic capacity and prevents Collaborative Filtering (CF) signals from being fully integrated into item representations. Inspired by the recent success of Large Language Models (LLMs) as high-capacity embedders, we investigate the use of Vision-Language Models (VLMs) as CF-aware multimodal encoders for SR. However, we find that standard contrastive supervised fine-tuning (SFT), which adapts VLMs for embedding generation and injects CF signals, can amplify its inherent modality collapse. In this state, optimization is dominated by a single modality while the other degrades, ultimately undermining recommendation accuracy. To address this, we propose VLM2Rec, a VLM embedder-based framework for multimodal sequential recommendation designed to ensure balanced modality utilization. Specifically, we introduce Weak-modality Penalized Contrastive Learning to rectify gradient imbalance during optimization and Cross-Modal Relational Topology Regularization to preserve geometric consistency between modalities. Extensive experiments demonstrate that VLM2Rec consistently outperforms state-of-the-art baselines in both accuracy and robustness across diverse scenarios. Vision Language Models, Modality Collapse, Sequential Recommendation †copyright: acmlicensed†journalyear: 2026†doi: X.X†conference: Make sure to enter the correct conference title from your rights confirmation email; June 03–05, 2018; Woodstock, NY†isbn: 978-1-4503-X-X/2018/06†ccs: Information systems Recommender systems 1. Introduction Sequential Recommendation (SR) models dynamic preferences from interaction histories (Kang and McAuley, 2018; Hidasi et al., 2015; Sun et al., 2019; Zhou et al., 2022), yet ID-based methods struggle with data sparsity and cold-start issues. Incorporating auxiliary modalities (e.g., text, image) has become essential, providing richer item semantics that improve accuracy and generalization. To leverage these modalities, existing approaches (Yuan et al., 2023; Zhang et al., 2025b; Hu et al., 2023; Wang et al., 2023) typically extract item features with small pretrained encoders (Devlin et al., 2019; Radford et al., 2021) and freeze them to maintain semantic integrity. However, this static approach creates a bottleneck: frozen embeddings cannot adequately internalize Collaborative Filtering (CF) dynamics, motivating a shift toward large-capacity models capable of encoding sequence-level behavior patterns. In the NLP field, this shift has been established by repurposing LLMs as large reasoning encoders for representation learning (Li et al., 2025; Wang et al., 2024a; Lee et al., 2024; Li et al., 2024; Tao et al., 2024) and by fine-tuning them with sequence–target supervision for adapting recommenders to inject sequence-level CF signals directly into the embedding space (Liu et al., 2025; Wang et al., 2024b; He et al., 2025). A natural next step is to extend this paradigm to multimodal SR using VLMs. However, existing research (Zhang et al., 2025a; Pomo et al., 2025) largely focuses on item-level semantics (single text or image input), missing the sequence-level behavior patterns essential to SR. Given the strong textual and multi-image reasoning ability of modern open-source VLMs (Bai et al., 2025; Liu et al., 2023; Zhu et al., 2025), we initially attempted to fine-tune VLMs with sequence-level objectives to obtain CF-aware multimodal embeddings; however, we find that naively porting the LLM pipeline to VLMs introduces a critical modality collapse. Prior studies (Sim et al., 2025; Kwon et al., 2025; Schrodi et al., 2024) report that VLMs often exhibit modality collapse or modality gap, over-relying on a strong modality while underutilizing the weak one, thereby producing embeddings that underrepresent weaker modalities. We first examine whether modality imbalance can be mitigated at the fusion stage for a naive approach by revisiting common VLM prompting strategies (Sec. 2.2.1). Internal fusion interleaves text and image tokens into a single sequence, relying on self-attention for implicit fusion, but attention often biases toward the dominant modality. External fusion encodes each modality independently and fuses them afterward (e.g., sum/concat). While it can prevent cross-modal interference at the input level, the model itself is inherently biased during the pretraining stage, leaving the weak modality already under-represented. Thus, the root cause lies in the unbalanced optimization path rather than the fusion strategy, necessitating objective-level interventions. Moreover, we identify the Paradox of SFT: standard contrastive supervised fine-tuning (SFT) (Oord et al., 2018; Logeswaran and Lee, 2018), which is essential for adapting VLMs to the recommendation task, counterintuitively exacerbates modality collapse, resulting in harm to recommendation performance (more details in Sec. 3). Specifically, we observed that the model engages in shortcut learning (Arpit et al., 2017; Nam et al., 2020; Kwon et al., 2025) to minimize the loss, disproportionately relying on the easier-to-learn strong modality. Consequently, the weak modality receives insufficient gradient signals and largely loses its ability to push negative samples away. This optimization imbalance induces modality collapse directly in the representation space, leading to unequal contributions from the two modalities in downstream recommendation. In turn, this adaptation step paradoxically widens the modality gap, indicating the need for objective-level interventions to restore the weak modality’s discriminative power. Building on these insights, we propose VLM2Rec, a VLM embedder-based framework for multimodal SR. Rather than extracting individual item features separately, our approach explicitly encodes the entire interaction histories as a single sequence input to high-capacity VLMs. This allows the model to capture dynamic behavioral patterns beyond static item features and directly internalizes these signals into the representation space. To address the CL-based SFT paradox, we introduce two novel objectives: First, Weak-modality Penalized Contrastive Learning (ℒWPCLL_WPCL) dynamically identifies the user-adaptive weak modality during training and amplifies its contrastive penalty, enforcing discriminative negative separation. Second, to prevent this aggressive separation from distorting the semantic space for weak modality, we propose Cross-modal Relational Topology Regularization (ℒCRTRL_CRTR). This preserves geometric consistency by aligning relative sequence-item topology (e.g., neighbor/ranking structure) of the weak modality with that of the strong modality. Crucially, this design enables a balanced utilization of multimodal signals, ensuring that both textual and visual dynamics are effectively synthesized to produce discriminative, CF-aware representations through sequence-level SFT. Across diverse benchmarks, VLM2Rec consistently improves recommendation accuracy, confirming both its effectiveness and robustness. Our contributions are as follows: • To the best of our knowledge, we first propose a VLM-based multimodal sequence encoding framework for SR. • We empirically reveal the paradox of SFT: standard contrastive fine-tuning amplifies modality collapse in VLMs by failing to optimize the weaker modality on SR datasets. • We introduce two objective-level interventions that dynamically restore discriminative power and preserve geometric topology, achieving state-of-the-art performance. 2. Proposed VLM Embedder-based Framework in Sequential Recommendation In this section, we introduce our proposed base setting for the VLM embedder-based framework in SR. This setting is used as the default configuration in all subsequent sections. 2.1. Problem Formulation Let U and ℐI denote the set of users and items, respectively. Each item i∈ℐi is associated with multimodal information: textual data tit_i (e.g., title) and visual data viv_i (e.g., product image). For each user u∈u , we define the historical interaction sequence as Su=[i1,i2,…,i|Su|]S_u=[i_1,i_2,…,i_|S_u|], sorted chronologically. The objective of sequential recommendation is to predict the next item i|Su|+1i_|S_u|+1 that the user is most likely to interact with, given the context SuS_u. (a) Impact of input modality dropout on performance (b) Training dynamics of modality-specific gradient influence Figure 1. Analysis of modality collapse via dropout test and gradient dynamics. SFT makes the image modality act as a negative transfer when fused with text, because of the overlooked gradient signal of the weak modality during training. Our proposed VLM2Rec successfully re-balances modality gradients, enabling stable multimodal gains. 2.2. VLM-based Sequence Encoding We utilize a pre-trained VLM, denoted as Φ(⋅) (·), as the backbone sequence encoder. Leveraging the multi-image understanding and context reasoning capabilities of the VLM, we extract sequence-level multimodal representations directly. 2.2.1. Input Construction We compare the standard Internal Fusion (interleaved inputs) against our proposed External Fusion (separate inputs), defined as follows: Interleaved Prompt (Internal) Describe this item sequence in one word, 0) Ag¡—item1_title—¿ Ag¡—item1_image—¿, …, N) Ag¡—itemNN_title—¿ Ag¡—itemNN_image—¿ Proposed Separate Prompts (External) [PTP_T]: Describe this item text sequence in one word, 0) Ag¡—item1_title—¿, …, N) Ag¡—itemNN_title—¿ [PVP_V]: Describe this item image sequence in one word, 0) Ag¡—item1_image—¿, …, N) Ag¡—itemNN_image—¿ For internal fusion, text and images are processed simultaneously. In contrast, for external fusion, prompts PTP_T and PVP_V are input separately into the encoder to ensure independent encoding. Additionally, to extract individual item representations, we apply the same template but restrict the input to the first item placeholder (index 0). 2.2.2. Representation Extraction Following standard conventions, we regard the hidden state of the last token from the VLM’s final transformer layer as the compressed semantic representation of the input. Given the text sequence SuTS_u^T and visual sequence SuVS_u^V, their corresponding sequence embeddings uTz^T_u and uVz^V_u are extracted as follows: (1) uT=Φ(PT(SuT)),uV=Φ(PV(SuV))z^T_u= (P_T(S_u^T)), ^V_u= (P_V(S_u^V)) where uT,uV∈ℝdz^T_u,z^V_u ^d. Similarly, for a candidate item i, its text embedding iTe^T_i and visual embedding iVe^V_i are extracted using the same encoder Φ . 2.3. Fusion Strategy Our primary goal is to balance modality influence through objective-level training signals, rather than introducing complex fusion strategies. To avoid additional parameters and improve generality, we use the simplest external fusion: element-wise summation, (2) u=uT+uV,z_u=z^T_u+z^V_u, and likewise fuse candidate item representations as i=iT+iVe_i=e^T_i+e^V_i. 2.4. Standard Supervised Fine-Tuning Objective To adapt generative VLMs for retrieval and inject sequence-level CF signals, we employ conventional Supervised Fine-Tuning (SFT) via the InfoNCE loss (Oord et al., 2018; Logeswaran and Lee, 2018). This objective aligns the representation space by maximizing the similarity between the user sequence uz_u and the positive item i+e_i^+ while distancing negatives: (3) ℒSFT=−∑u∈ℬlog⁡es(u,i+)/τes(u,i+)/τ+∑i−∈ues(u,i−)/τL_SFT=- _u e^s(z_u,e_i^+)/τe^s(z_u,e_i^+)/τ+ _i^- _ue^s(z_u,e_i^-)/τ where s(⋅,⋅)s(·,·) denotes cosine similarity, τ is the temperature, and uN_u is the negative set. Dataset State Mod. posA_pos (↓ ) negA_neg (↑ ) U (↓ ) S (↑ ) Beauty Vanilla Fused 0.4859 0.5024 -0.3984 1.0340 Vision 0.1550 0.1561 -0.0237 1.0071 Text 0.4568 0.4742 -0.3691 1.0381 After SFT Fused 0.8041 0.9577 -1.7969 1.1910 Vision 0.5752 0.5752 -0.0398 1.0000 Text 0.7932 0.9429 -1.7422 1.1887 Ours Fused 0.8488 1.0075 -1.9921 1.1870 Vision 0.2624 0.3184 -0.2031 1.2134 Text 0.7891 0.9184 -1.6718 1.1639 Toys Vanilla Fused 0.5155 0.5469 -0.4707 1.0609 Vision 0.1548 0.1554 -0.0237 1.0039 Text 0.4900 0.5230 -0.4453 1.0673 After SFT Fused 0.7225 0.8972 -2.2500 1.2418 Vision 0.5172 0.5126 -0.0039 0.9911 Text 0.5346 0.6672 -2.2188 1.2480 Ours Fused 0.8185 0.9502 -1.7969 1.1609 Vision 0.4557 0.5264 -0.5703 1.1551 Text 0.6115 0.6860 -0.9335 1.1218 Table 1. Comparison of the representation geometry metrics among three states of VLMs. AposA_pos and AnegA_neg denote positive and negative alignment, U represents uniformity, and S=Aneg/AposS=A_neg/A_pos indicates separability. 3. The Paradox of SFT: Analysis of Modality Collapse in Sequential Recommendation In this section, we present converging evidence of the Paradox of SFT: when adapting VLMs as embedders for SR, their objective function can worsen modality imbalance. We demonstrate this via i) recommendation performance, i) optimization dynamics, and i) representation geometry (Fig. 1, Tab. 1). The implementation follows in Sec. 2. Performance Gap and Negative Transfer In Fig. 1(a), to disentangle modality contributions in recommendation performance, we evaluate three sequence input settings for predicting the fused target: f2f (fused), t2f (text-only), and v2f (vision-only). In the Vanilla, v2f is consistently worst across datasets, revealing an intrinsic modality gap in pretrained VLMs for SR. Before fine-tuning, images can be either helpful or noisy depending on the dataset (e.g., f2f drops on Toys but rises on Beauty). Crucially, after SFT the gap widens: while t2f and f2f improve, v2f performance degrades below even its Vanilla baseline; moreover, t2f consistently outperforms f2f. This confirms that standard SFT triggers negative transfer from the weak (vision) modality, acting as noise that compromises the fused embedding space. Figure 2. Left: Our framework encodes text/image sequences/items to enable two usages: Task 1) direct sequence–item recommendation and Task 2) VLM2Rec-generated item embedding initialization for downstream SR models. Right: We fine-tune the VLM with ℒWPCLL_WPCL to adaptively penalize the user-specific weak modality (restoring negative separation) and ℒCRTRL_CRTR to align cross-modal relational topology, preventing geometric distortion while preserving modality individuality. Optimization Dynamics To investigate the cause, we track optimization dynamics by measuring the cosine similarity between the multimodal update totalg_total and individual modality gradients mg_m for m∈T,Vm∈\T,V\, computed under fused and single-modality inputs, respectively (Fig. 1(b)). This alignment quantifies each modality’s contribution to the actual update direction. From the start, totalg_total is strongly aligned with the text gradient Tg_T, while alignment with the image gradient Vg_V drops rapidly. As training proceeds, cos⁡(total,T) (g_total,g_T) goes to 1, whereas cos⁡(total,V) (g_total,g_V) peaks briefly and then steadily declines due to the accumulation of modality bias. This shows that the model minimizes the contrastive objective by relying on the easier text modality, thereby failing to optimize the visual modality to push negatives away and learn discriminative features. Geometric Analysis: Representation Collapse To characterize how optimization bias translates into embedding geometry, we measure alignment AmA^m and uniformity UmU^m, widely used in representation learning (Wang and Isola, 2020; Qiu et al., 2022), for each modality m∈F,V,Tm∈\F,V,T\ (Tab. 1). (4) Aposm≜(s,i+)∼[‖Φm(s)−Φm(i+)‖22],Anegm≜(s,i+)∼i−∼n(⋅∣s)[‖Φm(s)−Φm(i−)‖22].Um≜log⁡(s,i)∼u[exp⁡(−2‖Φm(s)−Φm(i)‖22)]. gatheredA_pos^m _(s,i^+) [\| _m(s)- _m(i^+)\|_2^2 ],\\ A_neg^m _(s,i^+) E_i^- _n(· s) [\| _m(s)- _m(i^-)\|_2^2 ].\\ U^m _(s,i) _u [ \! (-2\| _m(s)- _m(i)\|_2^2 ) ]. gathered Here D is the evaluation set of (s,i+)(s,i^+) pairs, n(⋅∣s)P_n(· s) is the negative sampler, and uP_u uniformly samples (s,i)(s,i) pairs for estimating UmU^m. AposmA_pos^m measures positive pull, AnegmA_neg^m measures negative push, and UmU^m reflects space coverage. We additionally define separability to capture the relative margin between negatives and positives. (5) Sm≜AnegmAposm+ϵ,S^m A^m_negA^m_pos+ε, where S>1S\!>\!1 indicates successful separation. In the Vanilla state, both modalities satisfy S>1S\!>\!1, but the fused geometry metrics (AF,UF,SFA^F,U^F,S^F) already closely follow the text space, suggesting latent imbalance. After SFT, the text space improves as intended (better ST,UTS^T,U^T), whereas the vision space collapses: UVU^V changes only marginally on Beauty, but drops substantially on Toys, leading to near-indistinguishability (S≈1S\!≈\!1 on Beauty) or even inversion (S<1S\!<\!1 on Toys). Moreover, ST>SFS^T\!>\!S^F across all settings implies that the under-optimized vision modality acts as noise, pulling the fused space toward the text-only trajectory. Overall, contrastive SFT amplifies intrinsic modality collapse in VLMs, degrading the weak modality’s separability and harming SR, which motivates an objective-level design that explicitly restores weak-modality discrimination. 4. Method In this section, we propose VLM2Rec, a novel multimodal SR framework designed to resolve the modality collapse observed in VLM-generated embeddings. Without auxiliary architectural complexity or sophisticated fusion modules, we aim to resolve this collapse through objective-level interventions. 4.1. Framework Design Section 2 introduced the proposed VLM embedder-based SR framework. VLM2Rec adopts this foundational structure to ensure the consistency and generalizability of the proposed framework. Specifically, VLM2Rec follows the minimal prompt construction in Section 2.2.1 to construct modality-specific inputs (text sequence and image sequence), and uses the same last-token representation extraction rule defined in Section 2.2.2 to extract sequence and item embeddings from the VLM. In addition, we inherit the element-wise summation fusion strategy from Section 2.3, which minimizes structural interference and avoids extra trainable parameters introduced for fusion. By adopting this proposed foundation, VLM2Rec maintains a simple architecture, enabling compatibility with publicly available off-the-shelf VLM backbones without requiring architecture-specific customization. 4.2. Training Objectives 4.2.1. Weak-modality Penalized Contrastive Learning In Section 3, we reveal that standard contrastive objectives (Eq. 3) suffer from optimization imbalance in VLM-based embedders. This structurally marginalizes the weak modality, as the model satisfies the objective via shortcut learning on the strong modality without sufficiently pushing negative samples in the weak modality’s representation space. A common method is to strengthen negatives at the data level (e.g., hard negative mining); however, such strategies introduce additional sophisticated sampling strategies. To overcome this limitation, we propose Weak-modality Penalized Contrastive Learning (ℒWPCLL_WPCL). User-adaptive Modality Gap Estimation First, to quantify how clearly each modality discriminates the ground-truth item from negatives, we define the Discriminative Margin ℳM. Recognizing that users may differ in how much they rely on textual cues or visual cues in their purchasing behavior, we calculate this margin on a per-user basis. For a user u and modality m∈T,Vm∈\T,V\, the margin ℳu,mM_u,m measures the confidence with which the model distinguishes the positive target item embedding i+me_i^+^m from a set of negative item embeddings i−me_i^-^m (i−∈ui^- _u): (6) ℳu,m=s(um,i+m)−1|u|∑i−∈us(um,i−m),M_u,m=s(z^m_u,e^m_i^+)- 1|N_u| _i^- _us(z^m_u,e^m_i^-), where s(⋅,⋅)s(·,·) denotes the cosine similarity. A larger ℳu,mM_u,m implies stronger discriminative capability for that modality. Based on these margins, we dynamically identify the user-adaptive strong and weak modalities at each training step: (7) ℳu,strong=max⁡(ℳu,T,ℳu,V),ℳu,weak=min⁡(ℳu,T,ℳu,V).M_u,strong= (M_u,T,M_u,V), _u,weak= (M_u,T,M_u,V). We then define the Modality Gap Δu,gap _u,gap as the disparity between these margins: (8) Δu,gap=sg[ℳu,strong]−ℳu,weak, _u,gap=sg[M_u,strong]-M_u,weak, where sg[⋅]sg[·] is the stop-gradient operator. By applying sg[⋅]sg[·] to ℳu,strongM_u,strong, it fixes the discriminative level of the strong modality as a target lower bound, ensuring that the model minimizes Δu,gap _u,gap by enhancing ℳu,weakM_u,weak rather than degrading ℳu,strongM_u,strong. Gap-guided Dynamic Penalty To explicitly reduce this gap, we convert Δu,gap _u,gap into a difficulty-aware penalty weight wu,penw_u,pen: (9) wu,pen=1+β⋅Softplus(α⋅Δu,gap),w_u,pen=1+β·Softplus(α· _u,gap), where β and α are learnable parameters controlling the sensitivity of the penalty. As the discriminative gap increases for a specific user, meaning that the weak modality exhibits insufficient negative separation and thus Δu,gap _u,gap becomes larger, wu,penw_u,pen increases proportionally, and it converges to 11 as the two modalities become more balanced. Finally, wu,penw_u,pen is integrated into the standard contrastive learning objective on the fused representation uz_u. Unlike conventional contrastive learning that treats all negatives uniformly, ℒWPCLL_WPCL amplifies the relative influence of negative samples via wu,penw_u,pen: (10) ℒWPCL=−∑u∈ℬlog⁡es(u,i+)/τWPCLes(u,i+)/τWPCL+wu,pen∑i−∈ues(u,i−)/τWPCLL_WPCL=\\ - _u e^s(z_u,e_i^+)/ _WPCLe^s(z_u,e_i^+)/ _WPCL+w_u,pen _i^- _ue^s(z_u,e_i^-)/ _WPCL where τWPCL _WPCL is the temperature of this objective. In this formulation, applying wu,pen>1w_u,pen>1 to the negative term forces the model to perceive the negative samples as closer than they actually are. Since gradients flow through the fused representation u=uT+uVz_u=z^T_u+z^V_u, this amplified negative pressure serves a dual purpose: it maintains the basic discriminative power of the strong modality while specifically concentrating gradient updates on the weak modality to satisfy the heightened separation requirement. Table 2. Performance comparison on Task 1 across various methods. The best results are highlighted in bold, second-best results are underlined, and * denotes statistical significance with p-values ¡ 0.05, based on paired t-tests over 5 random seeds. Method Toys Beauty Clothing Sports H@10 H@20 N@10 N@20 H@10 H@20 N@10 N@20 H@10 H@20 N@10 N@20 H@10 H@20 N@10 N@20 !15ID-based Models GRU4Rec 0.3591 0.4596 0.2295 0.2548 0.3903 0.4898 0.2624 0.2875 0.3450 0.4781 0.2025 0.2360 0.3500 0.4558 0.2165 0.2431 SASRec 0.3812 0.4780 0.2613 0.2856 0.4280 0.5283 0.2984 0.3236 0.3442 0.4757 0.2097 0.2428 0.3834 0.4935 0.2448 0.2726 !15Text-based Models BERT 0.1242 0.2336 0.0579 0.0853 0.1372 0.2602 0.0671 0.0978 0.1294 0.2558 0.0586 0.0904 0.1239 0.2379 0.0572 0.0857 SLIM 0.1217 0.2474 0.0567 0.0880 0.1412 0.2724 0.0661 0.0987 0.1374 0.2747 0.0632 0.0975 0.1254 0.2510 0.0575 0.0889 SLIM+SLIM^+ 0.2839 0.4208 0.1625 0.1968 0.2715 0.4112 0.1504 0.1854 0.1801 0.3686 0.0820 0.1291 0.1619 0.3148 0.0776 0.1157 LLMEMB 0.3944 0.5278 0.2447 0.2783 0.3109 0.5836 0.2388 0.2810 0.3987 0.5490 0.2317 0.2696 0.4046 0.5882 0.2228 0.2692 LLM2Rec 0.3172 0.4718 0.1875 0.2262 0.3282 0.4769 0.1854 0.2227 0.3264 0.4676 0.1829 0.2184 0.3600 0.5421 0.1915 0.2374 LLMVanillaLLM_Vanilla 0.2419 0.3597 0.1452 0.1747 0.2300 0.3428 0.1353 0.1636 0.2849 0.4052 0.1663 0.1966 0.2225 0.3471 0.1264 0.1577 LLMSFTLLM_SFT 0.4287 0.5669 0.2712 0.3061 0.4963 0.6347 0.3085 0.3434 0.4333 0.6049 0.2525 0.2957 0.4863 0.6684 0.2797 0.3265 !15Multimodal-based Models CLIP 0.3258 0.4342 0.2069 0.2341 0.2114 0.2819 0.1318 0.1494 0.3061 0.4043 0.1887 0.2135 0.2630 0.3698 0.1495 0.1764 NoteLLM-2 0.2237 0.4545 0.1003 0.1582 0.1926 0.3536 0.0929 0.1331 0.1764 0.3431 0.0815 0.1233 0.1993 0.3884 0.0932 0.1440 VLMPromptVLM_Prompt 0.2892 0.4205 0.1717 0.2046 0.2193 0.3367 0.1190 0.1485 0.2578 0.3829 0.1485 0.1799 0.2355 0.3588 0.1301 0.1610 VLMVanilla(Int.)VLM_Vanilla(Int.) 0.3184 0.4245 0.2093 0.2360 0.2146 0.2986 0.1358 0.1569 0.2661 0.3577 0.1830 0.2060 0.3045 0.3840 0.2113 0.2313 VLMVanilla(Ext.)VLM_Vanilla(Ext.) 0.2659 0.4222 0.1426 0.1817 0.2096 0.3379 0.1097 0.1419 0.2459 0.3799 0.1375 0.1711 0.2538 0.3803 0.1203 0.1761 VLMSFT(Int.)VLM_SFT(Int.) 0.3381 0.5131 0.1770 0.2209 0.4100 0.5794 0.2941 0.3498 0.4531 0.6178 0.3015 0.3284 0.4588 0.6383 0.2720 0.3093 VLMSFT(Ext.)VLM_SFT(Ext.) 0.4160 0.5897 0.2209 0.2647 0.4038 0.6372 0.2744 0.3300 0.5022 0.6474 0.3164 0.3531 0.4795 0.6291 0.2783 0.3161 VLM2Rec 0.5225* 0.6476* 0.3578* 0.3893* 0.5644* 0.6822* 0.3824* 0.4121* 0.5627* 0.7182* 0.3458* 0.3851* 0.5574* 0.6993* 0.3694* 0.4052* Improv. (%) 21.88 9.82 31.93 27.18 13.72 7.06 23.95 17.81 12.05 10.94 9.29 9.06 14.62 4.62 32.07 24.10 Table 3. Statistics of datasets used in the experiments. Dataset #Users #Items #Interactions Avg. Length Sparsity (%) Toys 15,921 8,383 108,336 6.8 99.92 Beauty 19,757 9,311 137,300 6.9 99.93 Clothing 30,757 17,087 196,614 6.4 99.96 Sports 32,127 14,820 222,591 6.9 99.95 4.2.2. Cross-modal Relational Topology Regularization While ℒWPCLL_WPCL effectively enforces negative separation in the weak modality, its aggressive pushing mechanism may cause the weak modality’s embedding space to undergo excessive expansion or distortion. This geometric misalignment disrupts semantic consistency across modalities, resulting in instability in the fusion process. To mitigate this, we propose Cross-modal Relational Topology Regularization (ℒCRTRL_CRTR). The core objective is not to enforce a strict point-wise alignment that makes embedding vectors identical. Such rigid metric learning can wash out modality-specific individualities by forcing one modality to simply relocate into the other’s embedding geometry. Instead, we focus on the relative similarity distributions between sequences and candidate items within each modality’s space, referred to as the Relational Topology, to ensure structural alignment across modalities. This approach preserves modality characteristics (e.g., linguistic nuances and visual patterns) while ensuring that semantic proximity in one modality translates to a consistent relative rank in the other. Formally, for each modality m∈T,Vm∈\T,V\, we perform relational topology alignment between the sequence representations and a candidate item representation set, denoted as mC^m. In this work, we instantiate mC^m using in-batch target items, i.e., m=jmj=1BC^m=\e_j^m\_j=1^B, for computational efficiency. We then compute the similarity matrix im∈ℝBS_i^m ^B between the i-th sequence imz^m_i and the j-th candidate item in mC^m as follows: (11) i,jm=s(im,jm)/τCRTR,S^m_i,j=s(z^m_i,e^m_j)/ _CRTR, where τCRTR _CRTR is a temperature parameter. We then apply a row-wise softmax to convert these similarities imS^m_i into a probability distribution im∈ℝBP^m_i ^B: (12) i,jm=exp⁡(i,jm)∑k=1||exp⁡(i,km)P_i,j^m= (S_i,j^m) _k=1^|C| (S_i,k^m) imP_i^m reflects the relative similarity ranking structure in their representation space of how the i-th sequence ranks the candidate items relative to each other. We then align these relational topologies across modalities by minimizing the bidirectional Kullback-Leibler (KL) divergence between iTP^T_i and iVP^V_i of each i: (13) ℒCRTR=12B∑i=1B(KL(iT||iV)+KL(iV||iT))L_CRTR= 12B _i=1^B (KL(P_i^T||P_i^V)+KL(P_i^V||P_i^T) ) Consequently, ℒCRTRL_CRTR guides the weak modality to maintain a consistent semantic neighborhood structure with the strong modality. This effectively suppresses the geometric distortion that may arise from the aggressive negative pushing of ℒWPCLL_WPCL, stabilizing the multimodal fusion. 4.2.3. Final Objective Finally, we combine the proposed loss functions to optimize the model jointly. The total objective function ℒL is defined as follows: (14) ℒ=ℒWPCL+λ⋅ℒCRTRL=L_WPCL+λ·L_CRTR where λ is a hyperparameter that balances discrimination and structural consistency. This synergy balances ℒWPCLL_WPCL’s discriminative push with ℒCRTRL_CRTR’s structural alignment to ensure a stable and robust multimodal space. Table 4. Performance comparison on Task 2 across various methods. The best results are highlighted in bold, second-best results are underlined, and * denotes statistical significance with p-values ¡ 0.05, based on paired t-tests over 5 random seeds. Method Toys Beauty Clothing Sports H@10 H@20 N@10 N@20 H@10 H@20 N@10 N@20 H@10 H@20 N@10 N@20 H@10 H@20 N@10 N@20 !15Backbone: GRU4Rec GRU4Rec 0.3591 0.4596 0.2295 0.2548 0.3903 0.4898 0.2624 0.2875 0.3450 0.4781 0.2025 0.2360 0.3500 0.4558 0.2165 0.2431 !15Text-based Models BERT 0.3320 0.4419 0.1890 0.2248 0.3704 0.4847 0.2349 0.2637 0.3325 0.4759 0.1875 0.2236 0.3402 0.4707 0.1995 0.2325 SLIM 0.3456 0.4606 0.2146 0.2435 0.3728 0.4929 0.2500 0.2732 0.3265 0.4677 0.1842 0.2196 0.4209 0.5473 0.2612 0.2931 SLIM+SLIM^+ 0.4105 0.5259 0.2632 0.2923 0.4383 0.5465 0.2933 0.3204 0.3761 0.5141 0.2222 0.2570 0.4507 0.5847 0.2794 0.3132 LLMEmb 0.4316 0.5405 0.2755 0.3033 0.4593 0.5583 0.2957 0.3286 0.3929 0.5264 0.2397 0.2748 0.4608 0.5892 0.2917 0.3209 LLM2Rec 0.4107 0.5261 0.2634 0.2926 0.4396 0.5480 0.2938 0.3212 0.3820 0.5261 0.2263 0.2625 0.4480 0.5831 0.2786 0.3127 LLMVanillaLLM_Vanilla 0.3759 0.4943 0.2310 0.2609 0.4215 0.5384 0.2728 0.3023 0.3684 0.5025 0.2165 0.2500 0.4260 0.5703 0.2523 0.2887 LLMSFTLLM_SFT 0.4381 0.5568 0.2757 0.3082 0.4603 0.5690 0.2979 0.3253 0.4092 0.5485 0.2467 0.2825 0.4687 0.6026 0.2934 0.3273 !15Multimodal-based Models CLIP 0.3684 0.4646 0.2125 0.2339 0.4324 0.5378 0.2862 0.3127 0.3561 0.4684 0.2193 0.2476 0.3796 0.4950 0.2305 0.2597 NoteLLM-2 0.4187 0.5436 0.2656 0.2909 0.4557 0.5588 0.3028 0.3266 0.4006 0.5256 0.2351 0.2685 0.4638 0.5934 0.2804 0.3090 VLMPromptVLM_Prompt 0.3890 0.5002 0.2465 0.2746 0.4252 0.5415 0.2752 0.3044 0.3791 0.5135 0.2269 0.2607 0.4297 0.5731 0.2558 0.2920 VLMVanilla(Int.)VLM_Vanilla(Int.) 0.4000 0.5105 0.2557 0.2836 0.4376 0.5485 0.2933 0.3212 0.3640 0.5040 0.2144 0.2498 0.4456 0.5806 0.2764 0.3105 VLMVanilla(Ext.)VLM_Vanilla(Ext.) 0.3983 0.5099 0.2565 0.2844 0.4358 0.5427 0.2882 0.3094 0.3698 0.5081 0.2185 0.2533 0.4432 0.5768 0.2746 0.3083 VLMSFT(Int.)VLM_SFT(Int.) 0.4204 0.5399 0.2743 0.3005 0.4393 0.5383 0.2930 0.3151 0.4093 0.5339 0.2413 0.2741 0.4594 0.6003 0.2869 0.3219 VLMSFT(Ext.)VLM_SFT(Ext.) 0.4363 0.5505 0.2834 0.3122 0.4511 0.5604 0.3021 0.3296 0.4130 0.5432 0.2527 0.2856 0.4662 0.6193 0.2945 0.3254 VLM2Rec 0.4684* 0.5869* 0.3070* 0.3369* 0.5064* 0.6185* 0.3386* 0.3669* 0.4405* 0.5778* 0.2711* 0.3057* 0.5021* 0.6468* 0.3083* 0.3449* Improv. (%) 6.92 5.41 8.21 7.91 10.02 8.71 11.82 11.32 6.66 5.34 7.28 7.04 7.13 4.44 4.69 5.38 !15Backbone: SASRec SASRec 0.3812 0.4780 0.2613 0.2856 0.4280 0.5283 0.2984 0.3236 0.3442 0.4757 0.2097 0.2428 0.3834 0.4935 0.2448 0.2726 !15Text-based Models BERT 0.3549 0.4752 0.2347 0.2450 0.3608 0.4767 0.2319 0.2611 0.3111 0.4495 0.1793 0.2141 0.3409 0.4544 0.2019 0.2371 SLIM 0.3570 0.4765 0.2302 0.2602 0.3546 0.4723 0.2269 0.2566 0.3034 0.4343 0.1773 0.2102 0.3803 0.4909 0.2423 0.2702 SLIM+SLIM^+ 0.4486 0.5626 0.3078 0.3365 0.4683 0.5748 0.3255 0.3523 0.3566 0.4859 0.2145 0.2470 0.4680 0.6012 0.2988 0.3324 LLMEmb 0.4441 0.5570 0.3047 0.3331 0.4735 0.5888 0.3267 0.3566 0.3822 0.5262 0.2422 0.2748 0.4911 0.6260 0.3096 0.3454 LLM2Rec 0.4362 0.5468 0.2995 0.3273 0.4657 0.5726 0.3244 0.3513 0.3732 0.5128 0.2245 0.2596 0.4735 0.6106 0.3013 0.3359 LLMVanillaLLM_Vanilla 0.4374 0.5564 0.2912 0.3212 0.4440 0.5604 0.2932 0.3226 0.3793 0.5153 0.2276 0.2620 0.4284 0.5728 0.2579 0.2944 LLMSFTLLM_SFT 0.4636 0.5735 0.3066 0.3386 0.4956 0.5912 0.3266 0.3553 0.4245 0.5698 0.2554 0.2920 0.4924 0.6321 0.3116 0.3470 !15Multimodal-based Models CLIP 0.3745 0.4672 0.2570 0.2804 0.3846 0.4733 0.2618 0.2842 0.3328 0.4884 0.2374 0.2662 0.3833 0.4991 0.2366 0.2659 NoteLLM-2 0.4601 0.5637 0.2938 0.3365 0.4755 0.5843 0.3217 0.3507 0.4073 0.5347 0.2577 0.2547 0.4803 0.6055 0.2983 0.3163 VLMPromptVLM_Prompt 0.4440 0.5605 0.2976 0.3268 0.4475 0.5614 0.2980 0.3267 0.3862 0.5228 0.2368 0.2713 0.4455 0.5936 0.2700 0.3074 VLMVanilla(Int.)VLM_Vanilla(Int.) 0.4444 0.5560 0.3033 0.3314 0.4597 0.5655 0.3191 0.3457 0.3830 0.5066 0.2455 0.2847 0.4332 0.5593 0.2764 0.3082 VLMVanilla(Ext.)VLM_Vanilla(Ext.) 0.4469 0.5562 0.3015 0.3322 0.4535 0.5605 0.3141 0.3408 0.3819 0.5236 0.2307 0.2663 0.4426 0.5702 0.2826 0.3148 VLMSFT(Int.)VLM_SFT(Int.) 0.4557 0.5526 0.3099 0.3325 0.4783 0.5832 0.3214 0.3468 0.4296 0.5394 0.2593 0.3003 0.4823 0.6055 0.3023 0.3204 VLMSFT(Ext.)VLM_SFT(Ext.) 0.4605 0.5718 0.3081 0.3356 0.4805 0.5890 0.3298 0.3570 0.4334 0.5647 0.2704 0.3036 0.4883 0.6153 0.3120 0.3441 VLM2Rec 0.4889* 0.6074* 0.3340* 0.3639* 0.5346* 0.6466* 0.3649* 0.3932* 0.4585* 0.6004* 0.2899* 0.3257* 0.5273* 0.6737* 0.3287* 0.3657* Improv. (%) 5.46 5.91 7.78 7.47 7.87 9.37 10.64 10.14 5.79 5.37 7.21 7.28 7.09 6.58 5.35 5.39 5. Experiments 5.1. Experimental Setup 5.1.1. Implementation Details We utilize four Amazon111https://jmcauley.ucsd.edu/data/amazon/ domains (McAuley et al., 2015) (Toys, Beauty, Clothing, Sports) with 5-core filtering (Kang and McAuley, 2018), excluding items missing titles or images. Statistics of datasets are reported in Tab. 3. Following (He et al., 2025), we set the max sequence length to 10 and use leave-one-out protocol (Kang and McAuley, 2018; Ren et al., 2020; Zhou et al., 2020). Hyperparameters are tuned via grid search: τWPCL∈0.05,0.1,0.5,1.0 _WPCL∈\0.05,0.1,0.5,1.0\, τCRTR∈0.001,0.01,0.1,1.0 _CRTR∈\0.001,0.01,0.1,1.0\, and λ∈0.1,0.5,1.0λ∈\0.1,0.5,1.0\. We employ LoRA (Hu et al., 2022) (rank=16,alpha=32,dropout=0.2rank=16,alpha=32,dropout=0.2) on Qwen2.5-VL-3B (Bai et al., 2025) (Qwen2.5-3B (Qwen et al., 2024) for LLM), which also serves as the backbone for all baselines to ensure fairness. Training runs for 3 epochs using AdamW (Loshchilov and Hutter, 2017) (learning rate 1e-51e-5, batch size 8), gradient checkpointing on a single RTX 3090. For some experiments that require full-parameter tuning or large models, we train on a single A100 80GB. Other settings follow their own paper. 5.1.2. Baselines To demonstrate the effectiveness of VLM2Rec, we compare it against baselines across five categories: (1) ID-based SR models, including RNN-based GRU4Rec (Hidasi et al., 2015) and transformer-based SASRec (Kang and McAuley, 2018); (2) Small pretrained encoders utilizing BERT222google-bert/bert-large-uncased (Devlin et al., 2019) for text and CLIP333openai/clip-vit-large-patch14 (Radford et al., 2021) for multimodal settings; (3) Large foundation models in both vanilla (LLMVanillaLLM_Vanilla, VLMVanillaVLM_Vanilla) and SFT variants (LLMSFTLLM_SFT, VLMSFTVLM_SFT); (4) Recommendation-specific embedders comprising LLM-based frameworks (LLMEmb (Liu et al., 2025), LLM2Rec (He et al., 2025), SLIM (Wang et al., 2024b), SLIM+SLIM^+) and VLM-based embedders (VLMPromptVLM_Prompt (Pomo et al., 2025), NoteLLM-2 (Zhang et al., 2025a)); and (5) standard Fusion strategies between Internal (Int.) and External (Ext.) fusion to validate our encoding approach. 5.1.3. Evaluation Settings For all experiments, we use 100 negative samples and report Hit Rate (H@K) and Normalized Discounted Cumulative Gain (N@K) at K∈10,20K∈\10,20\, averaged over 5 random seeds. As shown in Fig. 2 (left), we evaluate embedding quality via two real-world tasks: Task 1) Direct Recommendation. Adopting the standard retrieval setting, we rank items based on vector similarity to verify the capture of CF signals while retaining rich semantics (Tab. 2). To enable comparison for only producing item embeddings, we derive sequence representations by mean-pooling historical item embeddings. Task 2) Downstream SR Model Initialization. We test whether the embeddings provide transferable initialization for standard SR backbones, shown in Tab. 4 (e.g., GRU4Rec (Hidasi et al., 2015), SASRec (Kang and McAuley, 2018)). Dimensions are matched via a 1-layer linear adapter to the backbone hidden size (d=128d=128). And we follow the adaptation methods if described in their original paper. 5.2. Task 1: Direct Recommendation Text vs. Multimodal Signals. Even with small encoders, visual signals are beneficial (CLIP >\!>\! BERT), and VLM-based embeddings outperform LLM-only embeddings in the vanilla setting, suggesting complementary visual cues. Capacity and CF Injection. While large foundation models outperform small encoders, vanilla variants still lag behind ID-based SR, indicating that semantics alone are insufficient. SFT models surpass ID baselines, confirming that injecting CF signals in item embeddings is essential. Importance of Sequence-level SFT. SFT models consistently outperform recommendation specific SOTA models. The latter lacks explicit sequence-level representation space optimization, failing to capture transition signals essential for SR task. Modality Paradox. Despite VLMVanilla>LLMVanillaVLM_Vanilla\!>\!LLM_Vanilla, VLMSFTVLM_SFT often lags behind LLMSFTLLM_SFT. This confirms our analysis: SFT induces modality collapse, causing under-optimized visual embeddings to act as negative transfer. Fusion Strategy Reversal. While internal fusion favors vanilla models, external fusion becomes superior post-SFT, suggesting it effectively mitigates cross-modal interference during optimization. 5.3. Task 2: Downstream SR Model Initialization Correlation with Task 1. Performance trends in Task 2 follow Task 1, confirming that initialization enriched with rich semantics and CF signals provides a superior optimization starting point, allowing backbones to focus on refining complex patterns. Comparison with Rec-trained Models. While fine-tuned baselines incorporate CF signals, they remain suboptimal for SR initialization due to structural limitations. LLMEmb relies on indirect distillation, whereas generative models (SLIM+SLIM^+, LLM2Rec) optimize next token probabilities rather than representation space geometry. Furthermore, item-centric approaches (LLM2Rec, NoteLLM-2) fail to encode sequence transition dynamics, resulting in poor alignment with downstream sequential tasks. Backbone-agnostic Robustness. VLM2Rec shows the most consistent performance gains across both the RNN-based GRU4Rec and the Transformer-based SASRec backbones. This demonstrates that it captures general sequence dynamics independent of backbone-specific inductive biases, serving as a robust, plug-and-play initializer. 5.4. Further Analysis 5.4.1. Ablation Study Table 5 validates the efficacy of each component. Removing ℒWPCLL_WPCL causes the sharpest drop, confirming its fundamental role in injecting CF signals and adapting the VLM for retrieval. The performance gain from wpenw_pen over standard SFT proves that penalizing the weak modality effectively mitigates shortcut learning. The stop-gradient is crucial; without it, the model minimizes the modality gap by degrading the strong modality rather than improving the weak one. Furthermore, ℒCRTRL_CRTR acts as a necessary regularizer, maintaining geometric consistency against the aggressive negative pushing of ℒWPCLL_WPCL. Table 5. We analyze the detailed mechanisms for the ablation study and the fusion strategies in Task 1, 2 (SASRec) on Beauty and Toys. (N@20) Variants Beauty Toys Task 1 Task 2 Task 1 Task 2 VLM2Rec (Ours) 0.4121 0.3932 0.3893 0.3639 w/o ℒWPCLL_WPCL 0.2592 0.3605 0.2318 0.3486 → w/ ℒWPCLL_WPCL w/o Stop Grad. 0.4031 0.3774 0.3806 0.3514 → w/ ℒWPCLL_WPCL w/o wpenw_pen 0.3985 0.3709 0.3712 0.3472 w/o ℒCRTRL_CRTR 0.4058 0.3785 0.3802 0.3515 5.4.2. Analysis of Resolving Modality Collapse As shown in Fig. 1(b), VLM2Rec substantially mitigates the rapid drop in weak-modality gradient contribution observed during standard SFT. This is driven by ℒWPCLL_WPCL, which increases penalties on weak-modality negatives to restore discriminative learning, and ℒCRTRL_CRTR, which stabilizes cross-modal geometry. Table 1 confirms that VLM2Rec fully recovers the geometric collapse in the image space (previously S≤1S\!≤\!1 with degraded U): all modalities achieve S>1S\!>\!1 with improved uniformity U, indicating clear positive–negative separation and better space utilization. Moreover, while the fused space previously mirrored the text space, VLM2Rec increases the effective contribution of the image modality. As a result, Fig. 1(a) shows a marked improvement in v2f, and multimodal fusion becomes synergistic rather than harmful, with f2f consistently outperforming t2f and v2f across datasets. On Beauty, in particular, VLM2Rec slightly reduces over-reliance on text while substantially strengthening visual utilization, yielding the best fused performance. Overall, these results show that VLM2Rec converts multimodal signals into recommendation gains by balancing modality gradients and preventing representation collapse. 5.4.3. Generalization Analysis via Rich Semantics Table 6. Performance comparison on Cross-Domain Recommendation across Task1, 2 (N@20). Model Clothing → Beauty Sports → Clothing Task 1 Task 2 Task 1 Task 2 LLMEMB 0.2676 0.3408 0.2589 0.2667 LLMSFTLLM_SFT 0.1899 0.3326 0.1179 0.2600 NoteLLM-2 0.1185 0.3237 0.0747 0.2988 VLMSFTVLM_SFT 0.2566 0.3449 0.2362 0.2955 VLM2Rec 0.3372 0.3840 0.3154 0.3214 Figure 3. Cold-start evaluation on the Beauty for Task 2 (SASRec), grouped by target item frequency in the training set. We examine generalization when CF signals are scarce or unreliable via two settings: cross-domain transfer (Tab. 6) and cold-start items (Fig. 3). In both cases, effective recommendation depends more on rich modality semantics and sequence reasoning than memorized co-occurrence. For cross-domain transfer, we perform zero-shot evaluation by training on a source domain and directly testing on a target domain, where domain shift makes CF regularities less reliable. VLM2Rec achieves the best results across both transfers and both tasks, indicating that its balanced multimodal representations capture more transferable preference patterns and avoid domain-specific overfitting. For cold-start items, we bucket target items by training frequency and evaluate Task 2. As frequency decreases, performance increasingly reflects semantic exploitation rather than CF memorization. VLM2Rec effectively synergizes deep item understanding with sequence reasoning to infer preferences solely from attributes. These results demonstrate that balanced multimodal semantics and sequence-aware alignment enable robust recommendations even when historical interactions are weak or absent. 5.4.4. Analysis of Model Scalability and Robustness Table 7. Performance and Efficiency comparison for Task 1,2. K is the # of sampled users in the train dataset. Times are reported in minutes per epoch and seconds for item embedding generation (N@20). Model Input Task 1 Task 2 Train Emb. (min) (sec) Beauty LLMEMB Item 0.2810 0.3566 22 58 LLM2Rec Seq./Item 0.2227 0.3513 33 58 LLMSFTLLM_SFT Seq. 0.3434 0.3553 35 56 NoteLLM-2 Item 0.1331 0.3507 26 132 VLMSFTVLM_SFT Seq. 0.3300 0.3570 117 113 VLM2Rec (K=128) Seq. 0.2038 0.3635 1 113 VLM2Rec (K=256) 0.2377 0.3633 2 VLM2Rec (K=512) 0.2816 0.3648 4 VLM2Rec (K=1024) 0.3345 0.3678 9 VLM2Rec (Full) 0.4121 0.3932 118 Toys LLMEMB Item 0.2783 0.3331 15 38 LLM2Rec Seq./Item 0.2262 0.3273 26 39 LLMSFTLLM_SFT Seq. 0.3061 0.3386 26 39 NoteLLM-2 Item 0.1582 0.3365 20 91 VLMSFTVLM_SFT Seq. 0.2647 0.3356 92 79 VLM2Rec (K=128) Seq. 0.2278 0.3430 1 79 VLM2Rec (K=256) 0.2490 0.3426 2 VLM2Rec (K=512) 0.2997 0.3451 4 VLM2Rec (K=1024) 0.3266 0.3475 7 VLM2Rec (Full) 0.3893 0.3639 95 To verify generalizability, we evaluated various VLM families (e.g., Qwen2.5-VL (Bai et al., 2025), InternVL3 (Zhu et al., 2025), Llava 1.5 (Liu et al., 2023)) with parameter sizes ranging from 2B to 32B, shown in Fig. 4. Models within similar parameter groups (e.g., 2–3B, 7–8B) demonstrated comparable performance, indicating robustness across different architectures. Overall, performance improves with capacity by leveraging richer prior knowledge, suggesting that parameter size can be chosen according to deployment objectives. 5.4.5. Computational Cost and Few-shot Efficiency. Tab. 7 reports training time and item embedding generation time across LLM/VLM methods. While VLM-based models (VLMSFTVLM_SFT, VLM2Rec) incur higher costs due to image token processing compared to text-only LLMs, the runtime similarity between VLM2Rec and VLMSFTVLM_SFT confirms that our objectives introduce little overhead. To mitigate training costs, we train VLM2Rec with only K randomly sampled users: even at K=128K=128 it outperforms some full-training baselines and surpasses most methods on Task 2, and at K=1024K=1024 (about 5–6% of training data) it matches the strongest baseline while substantially reducing training time. Consequently, our framework provides a tunable trade-off, allowing practitioners to substantially reduce training time while maintaining competitive performance under varying constraints. 5.4.6. Hyperparameter Sensitivity. We conduct a sensitivity analysis of VLM2Rec by varying hyperparameters across the search space. In Fig. 5, VLM2Rec consistently maintains superior performance over baselines with minimal variance across varying hyperparameter values, demonstrating its robust stability. This reduces the additional tuning cost often required by multi-objective training, making VLM2Rec a practical and stable solution across diverse domains. Figure 4. Performance of VLM2Rec across various VLM families and parameter sizes, reporting N@20 for Task 1 and Task 2. Figure 5. Impact of hyperparameters τWPCL,τCRTR _WPCL, _CRTR, and λ. 6. Related Works Multimodal Sequential Recommendation While early fusion strategies (Yuan et al., 2023; Cui et al., 2018; Hu et al., 2023) and modern architectures (Wang et al., 2023; Bian et al., 2023; Hou et al., 2022) incorporate side information, they typically rely on frozen encoders. Unlike ID-based models (Kang and McAuley, 2018; Hidasi et al., 2015), where learnable tables absorb collaborative signals, frozen encoders shift the learning burden to the backbone. This necessitates a shift toward CF-aware modality embeddings that enable direct ranking without ID dependence. Large Model Embedders for Multimodal Recommendation Recent NLP work repurposes LLMs as high-capacity encoders for representation learning (Li et al., 2025; Wang et al., 2024a; Lee et al., 2024; Li et al., 2024; Tao et al., 2024). Following this trend, SLIM (Wang et al., 2024b) distills sequence knowledge from ChatGPT (Achiam et al., 2023), LLMEmb (Liu et al., 2025) learns discriminative item embeddings and injects CF signals via pre-trained ID embedding guidance, and LLM2Rec (He et al., 2025) injects CF through generative next-item prediction and item-level contrastive stage. In multimodal settings, VLMPrompt_Prompt (Pomo et al., 2025) leverages zero-shot prompting, and NoteLLM-2 (Zhang et al., 2025a) focuses on enhancing visual representation in multimodal embedding. However, these methods largely emphasize item-level discrimination or inject sequence-level CF indirectly (often via generative objectives), which does not explicitly shape a sequence–item representation space for SR. Our work encodes image sequences alongside text using VLM multi-image reasoning (Bai et al., 2025; Liu et al., 2023; Zhu et al., 2025), explicitly internalizing sequence-level CF signals into a multimodal embedding space. Modality Collapse in Multimodal Learning Multimodal models often over-rely on an easier modality, under-utilizing the others. Prior studies analyze how dataset/model biases induce imbalanced optimization and propose mitigation strategies (Wang et al., 2020; Huang et al., 2022; Wu et al., 2022; Guo et al., 2023; Sim et al., 2025), while related work shows VLM embeddings can become organized around a dominant modality (Shi et al., 2023; Zhang et al., 2023; Liang et al., 2022). We empirically establish that this persists in SR and is amplified by standard contrastive SFT despite its necessity for CF injection. Accordingly, our framework explicitly enforces balanced modality utilization to stabilize the learned representation geometry. 7. Conclusion In this work, we propose VLM2Rec, a novel framework that leverages VLMs as embedders for multimodal sequential recommendation, encoding both visual and textual sequences to inject sequence-level collaborative filtering signals. Our analysis revealed that the intrinsic modality bias of VLMs leads to representation collapse, a critical issue exacerbated by standard fine-tuning that hinders recommendation accuracy. To address this issue, VLM2Rec dynamically identifies the weak modality during training and explicitly improves its discriminability while preserving cross-modal consistency. Extensive experiments on public benchmarks demonstrate that our method consistently improves both direct ranking and downstream SR initialization across model families and settings. References J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. (2023) Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Cited by: §6. D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, et al. (2017) A closer look at memorization in deep networks. In International conference on machine learning, p. 233–242. Cited by: §1. S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. (2025) Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923. Cited by: §1, §5.1.1, §5.4.4, §6. S. Bian, X. Pan, W. X. Zhao, J. Wang, C. Wang, and J. Wen (2023) Multi-modal mixture of experts represetation learning for sequential recommendation. In Proceedings of the 32nd ACM international conference on information and knowledge management, p. 110–119. Cited by: §6. Q. Cui, S. Wu, Q. Liu, W. Zhong, and L. Wang (2018) MV-rnn: a multi-view recurrent neural network for sequential recommendation. IEEE Transactions on Knowledge and Data Engineering 32 (2), p. 317–331. Cited by: §6. J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), p. 4171–4186. Cited by: §1, §5.1.2. Y. Guo, L. Nie, H. Cheng, Z. Cheng, M. Kankanhalli, and A. Del Bimbo (2023) On modality bias recognition and reduction. ACM Transactions on Multimedia Computing, Communications and Applications 19 (3), p. 1–22. Cited by: §6. Y. He, X. Liu, A. Zhang, Y. Ma, and T. Chua (2025) LLM2Rec: large language models are powerful embedding models for sequential recommendation. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, KDD ’25, New York, NY, USA, p. 896–907. External Links: ISBN 9798400714542, Link, Document Cited by: §1, §5.1.1, §5.1.2, §6. B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk (2015) Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939. Cited by: §1, §5.1.2, §5.1.3, §6. Y. Hou, S. Mu, W. X. Zhao, Y. Li, B. Ding, and J. Wen (2022) Towards universal sequence representation learning for recommender systems. In Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, p. 585–593. Cited by: §6. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. (2022) Lora: low-rank adaptation of large language models.. ICLR 1 (2), p. 3. Cited by: §5.1.1. H. Hu, W. Guo, Y. Liu, and M. Kan (2023) Adaptive multi-modalities fusion in sequential recommendation systems. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, p. 843–853. Cited by: §1, §6. Y. Huang, J. Lin, C. Zhou, H. Yang, and L. Huang (2022) Modality competition: what makes joint training of multi-modal network fail in deep learning?(provably). In International conference on machine learning, p. 9226–9259. Cited by: §6. W. Kang and J. McAuley (2018) Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM), p. 197–206. Cited by: §1, §5.1.1, §5.1.2, §5.1.3, §6. J. Kwon, M. Kim, E. Lee, J. Choi, and Y. Kim (2025) See-saw modality balance: see gradient, and sew impaired vision-language balance to mitigate dominant modality bias. arXiv preprint arXiv:2503.13834. Cited by: §1, §1. C. Lee, R. Roy, M. Xu, J. Raiman, M. Shoeybi, B. Catanzaro, and W. Ping (2024) Nv-embed: improved techniques for training llms as generalist embedding models. arXiv preprint arXiv:2405.17428. Cited by: §1, §6. C. Li, M. Qin, S. Xiao, J. Chen, K. Luo, Y. Shao, D. Lian, and Z. Liu (2024) Making text embedders few-shot learners. arXiv preprint arXiv:2409.15700. Cited by: §1, §6. S. Li, Y. Tang, R. Liu, S. Chen, and X. Chen (2025) Conan-embedding-v2: training an llm from scratch for text embeddings. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, p. 15011–15027. Cited by: §1, §6. V. W. Liang, Y. Zhang, Y. Kwon, S. Yeung, and J. Y. Zou (2022) Mind the gap: understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems 35, p. 17612–17625. Cited by: §6. H. Liu, C. Li, Q. Wu, and Y. J. Lee (2023) Visual instruction tuning. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA. Cited by: §1, §5.4.4, §6. Q. Liu, X. Wu, W. Wang, Y. Wang, Y. Zhu, X. Zhao, F. Tian, and Y. Zheng (2025) LLMEmb: large language model can be a good embedding generator for sequential recommendation. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence, AAAI’25/IAAI’25/EAAI’25. External Links: ISBN 978-1-57735-897-8, Link, Document Cited by: §1, §5.1.2, §6. L. Logeswaran and H. Lee (2018) An efficient framework for learning sentence representations. In International Conference on Learning Representations, Cited by: §1, §2.4. I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §5.1.1. J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel (2015) Image-based recommendations on styles and substitutes. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, p. 43–52. Cited by: §5.1.1. J. Nam, H. Cha, S. Ahn, J. Lee, and J. Shin (2020) Learning from failure: de-biasing classifier from biased classifier. Advances in Neural Information Processing Systems 33, p. 20673–20684. Cited by: §1. A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §1, §2.4. C. Pomo, M. Attimonelli, D. Danese, F. Narducci, and T. Di Noia (2025) Do recommender systems really leverage multimodal content? a comprehensive analysis on multimodal representations for recommendation. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management, CIKM ’25, New York, NY, USA, p. 2377–2387. External Links: ISBN 9798400720406, Link, Document Cited by: §1, §5.1.2, §6. R. Qiu, Z. Huang, H. Yin, and Z. Wang (2022) Contrastive learning for representation degeneration problem in sequential recommendation. In Proceedings of the fifteenth ACM international conference on web search and data mining, p. 813–823. Cited by: §3. A. Y. Qwen, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. (2024) Qwen2. 5 technical report. arXiv preprint. Cited by: §5.1.1. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. In International conference on machine learning, p. 8748–8763. Cited by: §1, §5.1.2. R. Ren, Z. Liu, Y. Li, W. X. Zhao, H. Wang, B. Ding, and J. Wen (2020) Sequential recommendation with self-attentive multi-adversarial network. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, p. 89–98. Cited by: §5.1.1. S. Schrodi, D. T. Hoffmann, M. Argus, V. Fischer, and T. Brox (2024) Two effects, one trigger: on the modality gap, object bias, and information imbalance in contrastive vision-language representation learning. arXiv preprint arXiv:2404.07983. Cited by: §1. P. Shi, M. C. Welle, M. Björkman, and D. Kragic (2023) Towards understanding the modality gap in clip. In ICLR 2023 workshop on multimodal representation learning: perks and pitfalls, Cited by: §6. M. Y. Sim, W. E. Zhang, X. Dai, and B. Fang (2025) Can vlms actually see and read? a survey on modality collapse in vision-language models. In Findings of the Association for Computational Linguistics: ACL 2025, p. 24452–24470. Cited by: §1, §6. F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang (2019) BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management, p. 1441–1450. Cited by: §1. C. Tao, T. Shen, S. Gao, J. Zhang, Z. Li, K. Hua, W. Hu, Z. Tao, and S. Ma (2024) Llms are also effective embedding models: an in-depth overview. arXiv preprint arXiv:2412.12591. Cited by: §1, §6. J. Wang, Z. Zeng, Y. Wang, Y. Wang, X. Lu, T. Li, J. Yuan, R. Zhang, H. Zheng, and S. Xia (2023) Missrec: pre-training and transferring multi-modal interest-aware sequence representation for recommendation. In Proceedings of the 31st ACM International Conference on Multimedia, p. 6548–6557. Cited by: §1, §6. L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei (2024a) Improving text embeddings with large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 11897–11916. Cited by: §1, §6. T. Wang and P. Isola (2020) Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International conference on machine learning, p. 9929–9939. Cited by: §3. W. Wang, D. Tran, and M. Feiszli (2020) What makes training multi-modal classification networks hard?. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p. 12695–12705. Cited by: §6. Y. Wang, C. Tian, B. Hu, Y. Yu, Z. Liu, Z. Zhang, J. Zhou, L. Pang, and X. Wang (2024b) Can small language models be good reasoners for sequential recommendation?. In Proceedings of the ACM Web Conference 2024, p. 3876–3887. Cited by: §1, §5.1.2, §6. N. Wu, S. Jastrzebski, K. Cho, and K. J. Geras (2022) Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In International Conference on Machine Learning, p. 24043–24055. Cited by: §6. Z. Yuan, F. Yuan, Y. Song, Y. Li, J. Fu, F. Yang, Y. Pan, and Y. Ni (2023) Where to go next for recommender systems? id-vs. modality-based recommender models revisited. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, p. 2639–2649. Cited by: §1, §6. C. Zhang, H. Zhang, S. Wu, D. Wu, T. Xu, X. Zhao, Y. Gao, Y. Hu, and E. Chen (2025a) NoteLLM-2: multimodal large representation models for recommendation. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1, KDD ’25, New York, NY, USA, p. 2815–2826. External Links: ISBN 9798400712456, Link, Document Cited by: §1, §5.1.2, §6. S. Zhang, L. Chen, D. Shen, C. Wang, and H. Xiong (2025b) Hierarchical time-aware mixture of experts for multi-modal sequential recommendation. In Proceedings of the ACM on Web Conference 2025, p. 3672–3682. Cited by: §1. Y. Zhang, J. Z. HaoChen, S. Huang, K. Wang, J. Zou, and S. Yeung (2023) Diagnosing and rectifying vision models using language. arXiv preprint arXiv:2302.04269. Cited by: §6. K. Zhou, H. Wang, W. X. Zhao, Y. Zhu, S. Wang, F. Zhang, Z. Wang, and J. Wen (2020) S3-rec: self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM international conference on information & knowledge management, p. 1893–1902. Cited by: §5.1.1. K. Zhou, H. Yu, W. X. Zhao, and J. Wen (2022) Filter-enhanced mlp is all you need for sequential recommendation. In Proceedings of the ACM web conference 2022, p. 2388–2399. Cited by: §1. J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y. Duan, W. Su, J. Shao, et al. (2025) Internvl3: exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479. Cited by: §1, §5.4.4, §6.