← Back to papers

Paper deep dive

LUCID-SAE: Learning Unified Vision-Language Sparse Codes for Interpretable Concept Discovery

Difei Gu, Yunhe Gao, Gerasimos Chatzoudis, Zihan Dong, Guoning Zhang, Bangwei Guo, Yang Zhou, Mu Zhou, Dimitris Metaxas

Year: 2026Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 60

Models: SigLIP ViT-B/16

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/11/2026, 12:38:08 AM

Summary

LUCID-SAE is a unified vision-language sparse autoencoder that learns a shared latent dictionary for image patches and text tokens. By employing a shared-private decomposition and a Guided Cross-Modal Transport (GCMT) objective, it achieves cross-modal feature alignment without manual labeling, enabling automated interpretation of neurons and robust concept discovery across modalities.

Entities (5)

LUCID-SAE · model-architecture · 100%Optimal Transport · mathematical-framework · 98%Guided Cross-Modal Transport · algorithm · 95%Sparse Autoencoder · methodology · 95%Vision-Language Model · model-type · 90%

Relation Signals (4)

LUCID-SAE implements Sparse Autoencoder

confidence 98% · LUCID... a unified vision-language sparse autoencoder

LUCID-SAE learns Shared Latent Dictionary

confidence 95% · learns a shared latent dictionary for image patch and text token representations

LUCID-SAE uses Guided Cross-Modal Transport

confidence 95% · We develop GCMT (Guided Cross-Modal Transport)... to establish the fine-grained alignment

Guided Cross-Modal Transport utilizes Optimal Transport

confidence 95% · combines entropy-regularized transport with structural and contextual guidance

Cypher Suggestions (2)

Find all components and algorithms associated with the LUCID-SAE architecture. · confidence 90% · unvalidated

MATCH (m:Model {name: 'LUCID-SAE'})-[:USES|IMPLEMENTS]->(component) RETURN m, component

Map the relationship between methodologies and their underlying mathematical frameworks. · confidence 85% · unvalidated

MATCH (a:Methodology)-[:UTILIZES]->(f:Framework) RETURN a.name, f.name

Abstract

Abstract:Sparse autoencoders (SAEs) offer a natural path toward comparable explanations across different representation spaces. However, current SAEs are trained per modality, producing dictionaries whose features are not directly understandable and whose explanations do not transfer across domains. In this study, we introduce LUCID (Learning Unified vision-language sparse Codes for Interpretable concept Discovery), a unified vision-language sparse autoencoder that learns a shared latent dictionary for image patch and text token representations, while reserving private capacity for modality-specific details. We achieve feature alignment by coupling the shared codes with a learned optimal transport matching objective without the need of labeling. LUCID yields interpretable shared features that support patch-level grounding, establish cross-modal neuron correspondence, and enhance robustness against the concept clustering problem in similarity-based evaluation. Leveraging the alignment properties, we develop an automated dictionary interpretation pipeline based on term clustering without manual observations. Our analysis reveals that LUCID's shared features capture diverse semantic categories beyond objects, including actions, attributes, and abstract concepts, demonstrating a comprehensive approach to interpretable multimodal representations.

Tags

ai-safety (imported, 100%)alignment-training (suggested, 80%)empirical (suggested, 88%)mechanistic-interp (suggested, 92%)

Links

PDF not stored locally. Use the link above to view on the source site.

Full Text

59,306 characters extracted from source content.

Expand or collapse full text

LUCID-SAE: Learning Unified Vision-Language Sparse Codes for Interpretable Concept Discovery Difei Gu Yunhe Gao Gerasimos Chatzoudis Zihan Dong Guoning Zhang Bangwei Guo Yang Zhou Mu Zhou Dimitris Metaxas Abstract Sparse autoencoders (SAEs) offer a natural path toward comparable explanations across different representation spaces. However, current SAEs are trained per modality, producing dictionaries whose features are not directly understandable and whose explanations do not transfer across domains. In this study, we introduce LUCID (Learning Unified vision-language sparse Codes for Interpretable concept Discovery), a unified vision-language sparse autoencoder that learns a shared latent dictionary for image patch and text token representations, while reserving private capacity for modality-specific details. We achieve feature alignment by coupling the shared codes with a learned optimal transport matching objective without the need of labeling. LUCID yields interpretable shared features that support patch-level grounding, establish cross-modal neuron correspondence, and enhance robustness against the concept clustering problem in similarity-based evaluation. Leveraging the alignment properties, we develop an automated dictionary interpretation pipeline based on term clustering without manual observations. Our analysis reveals that LUCID’s shared features capture diverse semantic categories beyond objects, including actions, attributes, and abstract concepts, demonstrating a comprehensive approach to interpretable multimodal representations. Machine Learning, ICML Figure 1: Overview of LUCID. LUCID processes image patches and text tokens through a vision-language model (VLM) to learn a uniform sparse dictionary via a Top-K SAE. The unified latent space captures interpretable concept features spanning multiple semantic categories such as objects (Bus, Cat), scenes (Bathroom), actions (Sitting), visual attributes (Brown), and quantities (Many), enabling cross-modal feature alignment and automatic interpretation of image concepts. 1 Introduction Modern representation learning models encode rich semantic structure in high-dimensional activations (Zhai et al., 2023; Brown et al., 2020; Bengio et al., 2012; Vaswani et al., 2017). Yet turning these activations into explanations that are stable, comparable, and transferable remains challenging. Growing efforts (Bricken et al., 2023; Cunningham et al., 2023) have explored disentangled feature codes that decompose dense activations into separable concepts. Among these, sparse autoencoders (SAEs) (Lim et al., 2024; Stevens et al., 2025) represents as a key paradigm by mapping activations to sparse codes over learned dictionary elements. These yield interpretable units that can be inspected, visualized, and leveraged for feature-level analyses. This feature-centric perspective has established SAEs as increasingly valuable tools for interpretation, evaluation, and safety-oriented understanding of learned representations (Cunningham et al., 2023; Bricken et al., 2023; Pach et al., 2025; Ferrando et al., 2024; Yin et al., ). A key limitation of SAEs is that they are always trained independently for each modality (Fry, 2024; Thasarathan et al., 2025; Templeton, 2024). Consequently, concepts learned in one modality lack the guaranteed correspondence to analogous concepts in the another. This absence of cross-space comparability impedes the construction of consistent explanations across diverse inputs and domains. It also fundamentally restricts downstream applications requiring shared semantics, such as grounding language and vision evidence within a unified set of interpretable concepts. Without an explicit alignment, independently trained SAEs tend to learn idiosyncratic, model-specific bases. In this study, we ask a fundamental question: Can we learn a single set of sparse codes whose units are directly comparable across vision and language spaces? Achieving this objective presents significant challenges because representation spaces differ substantially in their distributional properties, dimensionality, and tokenization granularity. While training separate SAEs preserves reconstruction fidelity, it sacrifices the alignment necessary for cross-model interpretation. At the heart of this challenge, the core trade-off centers on: (i) a shared set of semantic units that enables comparability and transfer, and (i) sufficient flexibility to capture modality-specific factors without constraining them to the common representation. We introduce LUCID as a unified sparse autoencoder that learns a shared latent code used to encode VLM (Zhai et al., 2023; Radford et al., 2021). Specifically, LUCID disentangles image patch and text token activations and produces sparse codes within a common latent space, enabling direct concept-level correspondence across modalities. Additionally, LUCID reserves private capacity for each space, allowing key representation of space-specific information that need not be shared. Training combines reconstruction objectives for each space with an alignment signal that encourages paired inputs (e.g., image-caption pairs) to activate consistent shared concepts. As a result, this yields interpretable features that are simultaneously sparse and cross-space comparable without a need of labeling. Our major contributions are: • Unified multimodal sparse coding. We introduce LUCID, a sparse autoencoder framework that learns a shared dictionary across vision and language through shared-private decomposition and optimal transport-based alignment. • Patch-level grounding via guided optimal transport. We develop GCMT (Guided Cross-Modal Transport), which combines entropy-regularized transport with structural and contextual guidance to establish the fine-grained alignment between image patches and text tokens. • Automated interpretation and empirical analysis. We demonstrate an automatic neuron interpretation pipeline through cross-modal concept clustering, revealing coherent semantic specializations without manual annotation. 2 Related Work Model Interpretation with Concepts. Concept Bottleneck Models (CBMs) (Koh et al., 2020; Rao et al., 2024; Gao et al., 2024b) represent a seminal approach in this direction. They introduce an interpretable intermediate layer where predictions are explicitly mediated through predefined concepts. In the original CBM framework, models are trained to first predict a set of human-specified concepts (e.g., “has wings,” “is blue”) before making final task predictions. This creates a transparent decision pathway. The architecture enables practitioners to inspect which concepts contribute to each prediction and even intervene on concept activations to correct model behavior. However, CBMs face significant limitations. They require extensive concept annotations during training. The predefined concept set requires heavy manual design and may not align with the features the model naturally learns (Havasi et al., 2022). As a result, this bottleneck constraint can compromise overall task performance. Mechanisic Interpretability for Sparse Autoencoders. Recent advances in mechanistic interpretability (Elhage et al., 2021; Bricken et al., 2023; Cunningham et al., 2023) have introduced sparse autoencoders (SAEs) as a powerful tool for decomposing neural network representations into interpretable components. SAEs address a critical challenge of polysemantic nature of neurons, where individual neurons often respond to multiple unrelated features due to superposition. The goal of SAEs is to learn an overcomplete dictionary (Sharkey et al., 2022) of monosemantic features that linearly reconstruct model activations under sparsity constraints. This enables SAEs to effectively disentangle superposed representations, recovering interpretable feature directions from polysemantic neurons (Elhage et al., 2022). Beyond basic feature extraction, recent work has trained SAEs at various network depths (Rajamanoharan et al., 2024) and leveraged SAE features for model steering (Cho and Hockenmaier, 2025). However, current SAE faces several key limitations. First, the methodology has been predominantly applied to language models with limited exploration in vision domains (Fry, 2024; Thasarathan et al., 2025) and vision-language cross-modal concept alignment. Second, the features discovered by SAEs are highly fine-grained and task-specific, making them difficult to interpret through manual inspection alone (Paulo et al., 2024). In particular, current efforts lack mechanisms for automatic vision concept interpretation. This dependency on manual inspection fundamentally limits the practical utility of SAEs as an interpretability tool in a multimodal setting. 3 Method Problem Formulation. We study the problem of learning interpretable, sparse feature codes that are comparable across multiple representation spaces. Let ℳM denote a set of activation spaces from the VLM. For each m∈ℳm , we assume access to a representation function fmf_m that maps an input x(m)x^(m) to a set of feature vectors H(m)H^(m): H(m)=fm(x(m))∈ℝdm,fm:ℝTm→ℝdm,H^(m)=f_m(x^(m)) ^d_m,f_m:R^T_m ^d_m, where TmT_m is the length of the token sequence of image patches or text tokens, and dmd_m is the feature dimension in space m. In the case of Vision Language Models (VLMs), we provide dataset D that contains paired images and texts samples across spaces: =(xn(img),xn(txt))n=1N,D=(x_n^(img),x_n^(txt))^N_n=1, where img,txt∈ℳimg,txt . Our goal is to find a training paradigm that learn a unified sparse autoencoder that produces shared sparse dictionary z whose coordinates have the same semantic meaning across spaces. Figure 2: Model structure of LUCID showing the high-level loss components. The model is optimized at two levels: self-reconstruction is done on the shared + private code space, while cross-reconstruction is done on the shared code space. Figure 3: Concept discovery across semantic categories. Examples of diverse concepts discovered by LUCID on COCO images. For each concept, we show two images with activation heatmaps (green overlay) demonstrating spatial localization. LUCID captures concepts spanning scenes (Bathroom, Bedroom), objects (Glass, Bicycle), animals (Cat, Elephant), colors (Brown, Orange), actions (Skiing, Walking), and quantities (Many, Group), with activations precisely localizing to semantically relevant regions. 3.1 LUCID: Unified Sparse Coding with Shared and Private Capacity We aim to learn sparse codes that are both interpretable through sparsity and comparable across representation spaces. LUCID achieves this by decomposing the SAE dictionary space into a shared sparse code and a private sparse code. The shared code captures cross-space semantic concepts, while the private code accommodates modality-specific features. 3.1.1 Architecture For each space m∈img,txtm∈\img,txt\ and each element hn(m)∈ℝdmh_n^(m) ^d_m, LUCID produces: (zs,n(m),zp,n(m))=Em​(hn(m)),zs,n(m)∈ℝKs,zp,n(m)∈ℝKp,(z_s,n^(m),z_p,n^(m))=E_m(h_n^(m)), z_s,n^(m) ^K_s, z_p,n^(m) ^K_p, where s and p means shared and private. EmE_m is a sparse encoder, and the output subject to Top-k sparsity constraints (Gao et al., 2024a) ‖zs,n(m)‖0≤ks\|z_s,n^(m)\|_0≤ k_s and ‖zp,n(m)‖0≤kp\|z_p,n^(m)\|_0≤ k_p. Each space has a corresponding decoder DmD_m: h^n(m)=Dm​(zs,n(m),zp,n(m)). h_n^(m)=D_m(z_s,n^(m),z_p,n^(m)). We adopt a linear parameterization for the decoder: h^t(m)=Ws(m)​zs,n(m)+Wp(m)​zp,n(m)+b(m), h_t^(m)=W_s^(m)z_s,n^(m)+W_p^(m)z_p,n^(m)+b^(m), where Ws(m)∈ℝdm×KsW_s^(m) ^d_m× K_s, and Wp(m)∈ℝdm×KpW_p^(m) ^d_m× K_p captures shared and private structure specific to that space. This design balances two competing objectives. Fully separate SAEs yield unaligned dictionaries where units are not comparable across spaces. Conversely, fully shared dictionaries can over-constrain reconstruction when spaces contain non-overlapping factors. The shared and private decomposition enables cross-space comparability while maintaining space-specific reconstruction fidelity. 3.2 Cross-Space Alignment via Optimal Transport Because the number of elements TaT_a and TbT_b may differ (e.g., image patches versus text tokens), we employ Optimal Transport (OT) to define a principled many-to-many alignment between elements across paired samples (H(a),H(b))(H^(a),H^(b)). 3.2.1 Transport Plan Formulation For each pair n, we define a cost matrix Cn∈ℝTa×TbC_n ^T_a× T_b whose entry Cn,i​jC_n,ij measures the mismatch between element i in space a and element j in space b. We consider measuring distance in shared-code space with cosine similarities: Cn,i​j=1−cos⁡(zs,n,i(a),zs,n,j(b)).C_n,ij=1- (z_s,n,i^(a),z_s,n,j^(b)). Given marginal distributions rn∈ΔTar_n∈ ^T_a and cn∈ΔTbc_n∈ ^T_b (strictly positive probability vectors), we compute an entropically regularized OT plan: Πn⋆= _n = arg​minΠ∈ℝ+Ta×Tb⁡⟨Cn,Π⟩−ε​H​(Π) *arg\,min_ _+^T_a× T_b C_n, - H( ) (1) s.t.Π​=rn,Π⊤​=cn, .t. 1=r_n,\; 1=c_n, where H​(Π)=−∑i​jΠi​j​log⁡Πi​jH( )=- _ij _ij _ij is the entropy and ε>0 >0 controls the softness of the assignment. The OT framework offers several advantages over fixed correspondence assumptions. It naturally handles unequal set sizes, allows soft many-to-many correspondences, and provides a differentiable alignment signal. This flexibility is particularly well-suited for patch-token granularity differences, avoiding the restrictive assumption of fixed one-to-one pairings. Critically, the transport plan Πn⋆ _n achieves alignment by minimizing the total cost ⟨Cn,Πn⋆⟩ C_n, _n , which directly encourages semantically corresponding elements across modalities to have similar representations in the shared-code space. The entropy regularization −ε​H​(Π)- H( ) prevents degenerate matchings and enables smooth, probabilistic correspondences that are robust to minor variations and redundancies in each modality. By incorporating the OT distance as a training loss, we provide a geometric regularization that pulls cross-modal representations closer together, thereby learning a shared-code space where alignment is structurally enforced rather than assumed. The following theorem establishes the closed-form solution to (1) and utilizing sinkhorn algorithm (Cuturi, 2013) shows linear time convergence for the problem, proof shown in A.2. Theorem 3.1 (Closed-Form Solution for Entropy-Regularized OT). The unique optimal solution to (1) admits the factorization: Πn⋆=diag​(u)​K​diag​(v), _n =diag(u)\,K\,diag(v), where Ki​j=exp⁡(−Cn,i​j/ε)K_ij= (-C_n,ij/ ) is the Gibbs kernel, and u∈ℝ++Tau _++^T_a, v∈ℝ++Tbv _++^T_b are scaling vectors uniquely determined (up to a multiplicative constant) by the marginal constraints. 3.2.2 Guided Cross-Modal Transport (GCMT) While the standard OT formulation in (1) provides flexible alignment, purely similarity-driven transport may align spurious or overly generic elements when pairings are weak or ambiguous. We extend our method to Guided Cross-Modal Transport (GCMT), which incorporates lightweight structural guidance into the OT formulation to bias transport toward semantically plausible correspondences. We denote the resulting guided transport plan as ΠnGCMT _n^GCMT. Masked Transport (Structural Guidance). Let Mn∈0,1Ta×TbM_n∈\0,1\^T_a× T_b be a binary (or soft) mask indicating admissible alignments, such as phrase-to-region compatibility, noun-to-object priors, or geometric constraints. We incorporate MnM_n by modifying the cost matrix: C~n=Cn+β​(1−Mn), C_n=C_n+β(1-M_n), where large β discourages forbidden pairs. The OT plan is then computed using C~n C_n in place of CnC_n in (1). Mass Reweighting (Grounding Guidance). GCMT biases transport mass by adjusting OT marginals based on global context compatibility. We compute a global visual context vector g(a)=1Ta​∑i=1Tazs,n,i(a)g^(a)= 1T_a _i=1^T_az_s,n,i^(a) and measure each text token’s alignment via wn,j(b)=max⁡(0,cos​(g(a),zs,n,j(b)))w_n,j^(b)= (0,cos(g^(a),z_s,n,j^(b))). Text tokens inconsistent with the visual scene receive lower weights. The marginals are set as: rn=1Ta​Ta,cn=wn(b)⊤​wn(b).r_n= 1T_a1_T_a, c_n= w_n^(b)1 w_n^(b). This compatibility can also be incorporated into the cost matrix: C~n=Cn+λglobal⋅(1−wn(b))⊤​Ta C_n=C_n+ _global·(1-w_n^(b)) 1_T_a, where λglobal _global controls the modulation strength. 3.2.3 Alignment Loss We encourage aligned elements to activate consistent shared codes. While the soft OT plan provides smooth gradient supervision, it can yield diffuse correspondences. In our implementation, we encourage crisper semantic agreement by imposing a global semantic lock. Concretely, for each sample n, we form pooled shared codes z¯s,n(a)=1Ta​∑izs,n,i(a) z^(a)_s,n= 1T_a _iz^(a)_s,n,i and z¯s,n(b)=1|Mn|​∑j∈Mnzs,n,j(b) z^(b)_s,n= 1|M_n| _j∈ M_nz^(b)_s,n,j, and pooled activations h¯n(a)=1Ta​∑ihn,i(a) h^(a)_n= 1T_a _ih^(a)_n,i (and analogously for b). We then decode the pooled code from one modality through the other modality’s decoder and regress to the pooled target: ℒalign=∑n‖D(a)​(z¯s,n(b))−h¯n(a)‖22,L_align= _n \|D^(a)( z^(b)_s,n)- h^(a)_n \|_2^2, with a symmetric formulation for a↔ba b. We treat the pooled targets h¯(⋅) h^(·) as stop-gradient to maintain training stability while enforcing that shared dimensions carry consistent global semantics across the two modalities. 3.3 Cross-Modal Reconstruction Beyond self-reconstruction, LUCID leverages the transport plan to perform cross-modal reconstruction, which validates that shared codes capture meaningful cross-modal semantics and provides supervision to strengthen concept alignment. 3.3.1 Cross-Reconstruction via Barycentric Mapping Given the transport plan ΠnGCMT _n^GCMT, we map representations across modalities via barycentric projections. For vision-to-text and text-to-vision reconstruction: h^n,i(a→b) h_n,i^(a→ b) =∑jΠn,i​j∑j′Πn,i​j′⋅Db​(zs,n,j(b)), = _j _n,ij _j _n,ij · D_b(z_s,n,j^(b)), h^n,j(b→a) h_n,j^(b→ a) =∑iΠn,i​j∑i′Πn,i′​j⋅Da​(zs,n,i(a)). = _i _n,ij _i _n,i j· D_a(z_s,n,i^(a)). where Da,DbD_a,D_b are the decoders. These reconstructions use only shared codes zsz_s, ensuring the transport plan operates on genuinely shared semantic content while modality-specific information remains in private codes zpz_p. 3.3.2 Cross-Reconstruction Loss The cross-reconstruction loss measures reconstruction quality using transported shared codes: ℒcross=∑n,i‖hn,i(a)−h^n,i(b→a)‖22+∑n,j‖hn,j(b)−h^n,j(a→b)‖22.L_cross= _n,i\|h_n,i^(a)- h_n,i^(b→ a)\|_2^2+ _n,j\|h_n,j^(b)- h_n,j^(a→ b)\|_2^2. Combined with ℒalignL_align, this creates a bidirectional consistency constraint: shared codes must be both geometrically close in latent space and decodable to semantically equivalent representations in both modalities. 3.4 Training Objective LUCID is trained end-to-end to reconstruct activations in each space while aligning shared codes across spaces. The full training objective is: ℒ=α​∑n∑m∈a,b∑t=1Tm‖hn,t(m)−h^n,t(m)‖22⏟ℒself =α _n _m∈\a,b\ _t=1^T_m \|h_n,t^(m)- h_n,t^(m)\|_2^2_L_self +β​ℒalign⏟OT alignment+γ​ℒcross⏟cross-recon, +β L_align_OT alignment+γ L_cross_cross-recon, The self-reconstruction term ℒselfL_self ensures faithful representation of the original activations within each modality. The alignment term ℒalignL_align (from Section 3.2.3) enforces geometric proximity of corresponding shared codes using the OT-derived correspondences. The cross-reconstruction term ℒcrossL_cross (from Section 3.3.2) enforces that shared codes are functionally equivalent across spaces when decoded back to their original representation spaces. The sparsity terms enforce interpretability through sparse activation patterns, with separate regularization strengths for shared (λs _s) and private (λp _p) codes. The hyperparameters β and γ control the relative importance of these objectives. Detailed parameter settings are provided in appendix A.1. Neuron ID Semantic Top terms (dominant semantic cluster) 178 Urban Street Scene traffic light, traffic cone, pavement, street sign, street vendor, street, windy, driver, sidewalk, stop sign, sandwich, pedestrian, pedestrian signal, safety vest, drawer, wheel, magazine, scooter 54 Aquatic Activities water reflection, water surface, water, lake, apple, swimming, wings, river, surfboard, boat, boat wake, cyclist, ocean, wave, knife, harness, dog, dog walker Table 1: Semantic interpretations of SAE neurons obtained via clustering their top-activating text concepts. We show two example neurons: Neuron 178 captures urban street features, while Neuron 54 captures aquatic/water-related features. An extended table with additional neurons is provided in Appendix A.3. Figure 4: Visualization of two randomly selected neurons. For each neuron, we show the highest-scoring retrieved images and highlight the most activated patch (red box), along with the corresponding patch crops. Additional neuron visualizations are provided in Appendixin Appendix A.3. 3.5 Automated Neuron Interpretation via Concept Clustering A key advantage of aligned multimodal dictionaries is the ability to automatically interpret shared neurons by leveraging cross-modal correspondences. We employ a pipeline that clusters text concepts associated with a neuron’s activations, revealing semantic themes without manual annotation. 3.5.1 Interpretation Pipeline The interpretation process is summarized into three phases (see appendix for algorithm 1): • Stage 1: Concept Encoding & Weighting. We encode a concept library C into shared-code representations zcz_c. To down-weight generic features, we apply an inverse document frequency (IDF) weight, idfm=log⁡(N/dfm)idf_m= (N/df_m), based on neuron activation frequency across the library. • Stage 2: Spatial Concept Matching. We compute spatial activation maps i​(p)=min⁡(zp(a),zci(b))⊙idfA_i(p)= (z_p^(a),z_c_i^(b)) to identify the intersection between image patches and concepts. Concepts are associated with the dominant neuron m∗m^* at their peak activation location. • Stage 3: Semantic Clustering. Associated concepts are clustered using sentence-level embeddings via greedy cosine similarity (threshold σ≈0.78σ≈ 0.78). The largest cluster defines the neuron’s primary semantic specialization. 3.5.2 Qualitative Results We filter for interpretability by requiring a minimum number of concept hits and a dominant cluster purity ≥0.55≥ 0.55. Table 1 and Figure 4 illustrate this mapping for two representative neurons. Neuron 178 specializes in urban navigation, grouping terms like traffic light, sidewalk, and stop sign. In contrast, Neuron 54 captures aquatic themes, clustering lake, river, and surfboard. The high cluster purity validates that shared neurons capture consistent, cross-modal concepts. This automated pipeline scales to thousands of neurons, providing a systematic lens into the model’s learned representations. Figure 5: Out-of-distribution data (ImageNet) used to validate concept discovery with LUCID. For each discovered concept (e.g., camouflage, flying, grass, holding, top, water), we show two representative images along with activation overlays highlighting the spatial evidence supporting the concept. 4 Experimental Results This section validates three core contributions of our method: (1) the shared-private capacity decomposition improves both reconstruction fidelity and cross-modal predictability of the shared subspace, (2) optimal transport coupling enables precise object-level grounding measured through local alignment metrics, and (3) structured shared-code regularization mitigates clustering degeneracy while improving grounding. 4.1 Experimental setup We train our approach on combined MS COCO 2017 (Lin et al., 2014) and C3M (Sharma et al., 2018) with their corresponding image-text caption pair. No ground truths are used during training. When evaluating, we use the COCO instance annotations to obtain ground-truth object regions (bounding boxes; masks are a drop-in replacement when available). Our backbone is SigLIP ViT-B/16 (Zhai et al., 2023; Dosovitskiy, 2020), and our model learns sparse concept codes over image patch tokens and text tokens. Metric Shared Only (r=1.0r=1.0) Shared+Priv (r=0.25r=0.25) RV,self2R^2_V,self 0.4045 0.4606 RV,self2​(shared)R^2_V,self (shared) - 0.2705 RV,self2​(private)R^2_V,self (private) - 0.2445 RT,self2R^2_T,self 0.5737 0.6684 RT,self2​(shared)R^2_T,self (shared) - 0.5088 RT,self2​(private)R^2_T,self (private) - 0.2550 RT→V,cross2​(shared)R^2_T→ V,cross (shared) 0.0552 0.1171 RV→T,cross2​(shared)R^2_V→ T,cross (shared) 0.3220 0.4055 Table 2: Shared–private decomposition improves reconstruction. Adding private capacity (r=0.25r=0.25) increases self R2R^2 while improving shared cross R2R^2 (T→VT→V, V→TV→T). For r=0.25r=0.25 we also report shared vs. private contributions to self R2R^2. 4.2 Concept Discovery and Interpretability A key advantage of our learned aligned sparse dictionaries is the ability to discover and visualize semantically meaningful concepts directly from the learned representations. Figure 3 showcases diverse concept types discovered by LUCID on COCO across randomly selected semantic categories, ie. scene concepts (bathroom, bedroom) and action concepts (skiing, walking). The heat map images reveal precise spatial localization, ie.“glass” activates on drinking glasses, “bicycle” on bike frames. Scene concepts like “bedroom” activate on characteristic furniture arrangements, action concepts like “skiing” respond to human poses and contextual cues, and quantity concepts like “many” capture compositional properties. These results demonstrate that the shared space captures both low-level perceptual attributes and high-level semantic categories. Since these activation maps measure intersection between vision patch codes and text concept codes in the shared space, the spatial precision validates that our transport-based alignment successfully enforces semantic correspondence at the neuron level. Figure 5 validates that discovered concepts generalize beyond the training distribution by evaluating the same learned dictionary on ImageNet (Russakovsky et al., 2014). We select random category concepts: camouflage, flying, grass, holding, top, and water. The activation patterns remain semantically meaningful and spatially precise. “Camouflage” activates on natural patterns, “flying” responds to birds with sky context, “grass” localizes to ground vegetation, and “holding” identifies object-hand interactions. This generalization demonstrates that LUCID learns transferable semantic concepts rather than dataset-specific correlations, supporting the use of learned representations for downstream zero-shot tasks. Additional concept discovery visualizations are provided in Appendix A.3. 4.3 Shared-Private Capacity Decomposition Table 2 validates that private capacity improves both reconstruction fidelity and cross-modal predictability. Introducing private capacity (r=0.25r=0.25) improves overall self-reconstruction (RV,self2R^2_V,self: 0.40→0.460.40→ 0.46; RT,self2R^2_T,self: 0.57→0.670.57→ 0.67) compared to shared-only (r=1.0r=1.0). The decomposition reveals that in vision, private-only contribution (0.24450.2445) nearly equals the shared-only path (0.27050.2705), confirming that substantial visual variance is modality-specific. In text, the shared-only component remains dominant (0.50880.5088), while private units capture residual information. Crucially, private capacity strengthens the shared subspace as a cross-modal interface. Shared-only cross-predictability improves significantly: RT→V,cross2R^2_T→ V,cross increases from 0.0550.055 to 0.1170.117, and RV→T,cross2R^2_V→ T,cross from 0.3220.322 to 0.4060.406. This confirms that allocating private capacity to absorb modality-specific details allows the shared dictionary to specialize in genuinely comparable features, shielding it from modality-dependent noise. We note a persistent asymmetry where text→ performance is lower; this is expected, as semantically compressed captions cannot fully recover dense visual details like background and layout. Configuration mass@obj point@1 IoU@10 OT only 0.4020 0.4730 0.2746 Share k 0.4039 0.5501 0.2800 Share k GSL 0.4097 0.5578 0.2913 Share k GSL GCMT 0.4106 0.6002 0.2984 Table 3: Object-level grounding via BBOX metrics (higher is better). We form a shared-code spatial heatmap and compare it to ground-truth (GT) boxes. Metric definition in appendix A.1. Results compare OT-only and shared Top-k variants, with the best score in each column in bold. 4.4 Object-Level Grounding via Optimal Transport Unlike global instance discrimination, our approach optimizes token-to-patch alignment using Optimal Transport (OT). Because global retrieval metrics can be obscured by artifacts like concept clustering, we evaluate local grounding directly using OT-based heatmaps. By aggregating transport mass Pi​jP_ij over tokens for each patch j, we generate a spatial grid compared against ground-truth (GT) bounding boxes via three metrics: mass@obj (mass fraction within GT), point@1 (peak saliency accuracy), and IoU@10 (overlap of the top 10% salient patches). Table 3 demonstrates that enforcing sparse, structured shared codes significantly sharpens grounding. Compared to the OT baseline, the Share k variant improves point@1 from 0.4730.473 to 0.5500.550, indicating that sparsification helps the model localize salient features. Adding a Global Semantic Lock (GSL) further improves all metrics (mass@obj: 0.4100.410, point@1: 0.5580.558, IoU@10: 0.2910.291), consistent with better cross-modal shared-code semantics. Incorporating GCMT on top of GSL yields the best overall performance (mass@obj: 0.4110.411, point@1: 0.6000.600, IoU@10: 0.2980.298). These gains confirm that Top-k selection reduces diffuse activations while GCMT encourages context-aware, object-focused couplings. Together, shared-code sparsity and GSL/GCMT sharpen transport concentration on semantically relevant object regions, providing a more faithful measure of local alignment than global retrieval alone. Configuration Clust t2i maxfreq ↓ Clust i2t maxfreq ↓ BBOX point@1 ↑ OT only 0.7958 0.3810 0.4730 Share k 0.6516 0.2055 0.5501 Share k GSL 0.4590 0.1240 0.5578 Share k GSL GCMT 0.3538 0.0740 0.6002 Table 4: Degeneracy vs. grounding. Share-k + GSL/GCMT reduce clustering maxfreq (t2i/i2t) while improving BBOX point@1. 4.5 Sparsity and the Selectivity–Degeneracy Trade-off Our third contribution shows that structured shared codes provide a practical knob for balancing selectivity against degeneracy in cross-modal concept representations. Table 4 characterizes this interaction using two complementary signals: a concept-clustering diagnostic (maxfreq) that captures dimension overuse, and an object-level localization metric (BBOX point@1) that evaluates spatial grounding accuracy. We compute maxfreq by measuring, for each shared dimension m, its activation frequency over the evaluation set D and taking the maximum: maxfreq=maxm⁡1||​∑x∈​[zm​(x)>0],maxfreq\;=\; _m\; 1|D| _x 1\! [z_m(x)>0 ], (2) where z​(x)∈ℝ≥0dshz(x) ^d_sh_≥ 0 is the shared sparse code. The BBOX point@1 metric tests whether the peak of the shared-code heatmap falls inside a ground-truth bounding box, providing a direct measure of whether the learned correspondences localize salient object regions. The table indicates that degeneracy is strongly affected by how shared codes are regularized and coupled across modalities. Relative to OT only, Share k reduces maxfreq (t2i: 0.7958→0.65160.7958\!→\!0.6516, i2t: 0.3810→0.20550.3810\!→\!0.2055), suggesting fewer dimensions dominate across clustered concept queries. Adding Global Semantic Lock (GSL) further suppresses concentration (t2i: 0.45900.4590, i2t: 0.12400.1240) while slightly improving grounding (point@1: 0.55780.5578), consistent with better semantic consistency of the shared basis. Finally, incorporating GCMT yields the lowest maxfreq (t2i: 0.35380.3538, i2t: 0.07400.0740) and the best point@1 (0.60020.6002), indicating that context-aware transport reduces spurious token matches that otherwise promote repeated cluster assignments. 5 Conclusion We introduced LUCID, a unified sparse autoencoder framework that learns interpretable, cross-modal concept dictionaries by decomposing the latent space into shared and private components. This design addresses a fundamental limitation of existing sparse coding approaches: the inability to align concepts across modalities while preserving reconstruction fidelity. Our empirical validation demonstrates three key contributions: (1) shared-private decomposition improves both reconstruction quality and cross-modal predictability by shielding the shared space from modality-specific noise, (2) optimal transport-based alignment enables precise patch-level grounding as measured by local spatial metrics, and (3) structured shared-code objectives mitigate clustering degeneracy while sharpening object-level grounding. Leveraging these alignment properties, we developed an automated neuron interpretation pipeline that discovers coherent semantic clusters without manual annotation. LUCID captures diverse concept categories spanning objects, scenes, actions, attributes, and spatial relationships, with learned representations generalizing to out-of-distribution data. This establishes aligned sparse dictionaries as a foundation for automatic interpretability at scale, providing a principled approach toward interpretable multimodal representations that support systematic cross-modal analysis. References Y. Bengio, A. Courville, and P. Vincent (2012) Representation learning: a review and new perspectives. arxiv 2012. arXiv preprint arXiv:1206.5538. Cited by: §1. T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah (2023) Towards monosemanticity: decomposing language models with dictionary learning. Transformer Circuits Thread. Note: https://transformer-circuits.pub/2023/monosemantic-features/index.html Cited by: §1, §2. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. Advances in neural information processing systems 33, p. 1877–1901. Cited by: §1. I. Cho and J. Hockenmaier (2025) Toward efficient sparse autoencoder-guided steering for improved in-context learning in large language models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng (Eds.), Suzhou, China, p. 28961–28973. External Links: Link, Document, ISBN 979-8-89176-332-6 Cited by: §2. H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey (2023) Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600. Cited by: §1, §2. M. Cuturi (2013) Sinkhorn distances: lightspeed computation of optimal transport. Advances in neural information processing systems 26. Cited by: §3.2.1. A. Dosovitskiy (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §4.1. N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, et al. (2022) Toy models of superposition. arXiv preprint arXiv:2209.10652. Cited by: §2. N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, et al. (2021) A mathematical framework for transformer circuits. Transformer Circuits Thread 1 (1), p. 12. Cited by: §2. J. Ferrando, O. Obeso, S. Rajamanoharan, and N. Nanda (2024) Do i know this entity? knowledge awareness and hallucinations in language models. arXiv preprint arXiv:2411.14257. Cited by: §1. H. Fry (2024) Towards multimodal interpretability: learning sparse interpretable features in vision transformers. Less-Wrong. Cited by: §1, §2. L. Gao, T. D. la Tour, H. Tillman, G. Goh, R. Troll, A. Radford, I. Sutskever, J. Leike, and J. Wu (2024a) Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093. Cited by: §3.1.1. Y. Gao, D. Gu, M. Zhou, and D. Metaxas (2024b) Aligning human knowledge with visual concepts towards explainable medical image classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention, p. 46–56. Cited by: §2. M. Havasi, S. Parbhoo, and F. Doshi-Velez (2022) Addressing leakage in concept bottleneck models. Advances in Neural Information Processing Systems 35, p. 23386–23397. Cited by: §2. P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim, and P. Liang (2020) Concept bottleneck models. In International conference on machine learning, p. 5338–5348. Cited by: §2. H. Lim, J. Choi, J. Choo, and S. Schneider (2024) Sparse autoencoders reveal selective remapping of visual concepts during adaptation. arXiv preprint arXiv:2412.05276. Cited by: §1. T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, p. 740–755. Cited by: §4.1. M. Pach, S. Karthik, Q. Bouniot, S. Belongie, and Z. Akata (2025) Sparse autoencoders learn monosemantic features in vision-language models. arXiv preprint arXiv:2504.02821. Cited by: §1. G. Paulo, A. Mallen, C. Juang, and N. Belrose (2024) Automatically interpreting millions of features in large language models. arXiv preprint arXiv:2410.13928. Cited by: §2. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. In International conference on machine learning, p. 8748–8763. Cited by: §1. S. Rajamanoharan, A. Conmy, L. Smith, T. Lieberum, V. Varma, J. Kramar, R. Shah, and N. Nanda (2024) Improving sparse decomposition of language model activations with gated sparse autoencoders. Advances in Neural Information Processing Systems 37, p. 775–818. Cited by: §2. S. Rao, S. Mahajan, M. Böhle, and B. Schiele (2024) Discover-then-name: task-agnostic concept bottlenecks via automated concept discovery. In European Conference on Computer Vision, p. 444–461. Cited by: §2. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2014) Imagenet large scale visual recognition challengeint. J. Comput. Vis 115 (3), p. 21. Cited by: §4.2. L. Sharkey, D. Braun, and B. Millidge (2022) Taking features out of superposition with sparse autoencoders. In AI Alignment Forum, Vol. 6, p. 12–13. Cited by: §2. P. Sharma, N. Ding, S. Goodman, and R. Soricut (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 2556–2565. Cited by: §4.1. S. Stevens, W. Chao, T. Berger-Wolf, and Y. Su (2025) Sparse autoencoders for scientifically rigorous interpretation of vision models. arXiv preprint arXiv:2502.06755. Cited by: §1. A. Templeton (2024) Scaling monosemanticity: extracting interpretable features from claude 3 sonnet. Anthropic. Cited by: §1. H. Thasarathan, J. Forsyth, T. Fel, M. Kowal, and K. G. Derpanis (2025) Universal sparse autoencoders: interpretable cross-model concept alignment. In Forty-second International Conference on Machine Learning, Cited by: §1, §2. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §1. [30] Q. Yin, C. T. Leong, H. Zhang, M. Zhu, H. Yan, Q. Zhang, Y. He, W. Li, J. Wang, Y. Zhang, et al. Constrain alignment with sparse autoencoders. In Forty-second International Conference on Machine Learning, Cited by: §1. X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer (2023) Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, p. 11975–11986. Cited by: §1, §1, §4.1. Appendix A Appendix A.1 Additional Implementation Details Model Architecture. We use SigLIP ViT-B/16 as our vision-language backbone, extracting patch-level features from the vision encoder and token-level features from the text encoder. For vision, we use the penultimate layer activations (before the final projection), yielding 196196 patch tokens of dimension 768768. For text, we extract token embeddings after the transformer layers but before pooling, with dimension 768768. Both SAEs map these 768768-dimensional activations to a shared dictionary of size Mshared=1536M_shared=1536 (25% of total capacity) and private dictionaries of size Mprivate=4608M_private=4608 (75% of capacity). We use Top-k sparsity with kshared=16k_shared=16 and kprivate=32k_private=32. Training Details. We train on MS COCO 2017 train split (118k image-caption pairs) for 200 epochs with batch size 256. The optimizer is AdamW with learning rate 3×10−43× 10^-4, weight decay 0.010.01, and cosine annealing schedule. Loss weights are set to α=1.0α=1.0 (self-reconstruction), β=0.5β=0.5 (alignment), and γ=0.3γ=0.3 (cross-reconstruction). For optimal transport, we use entropy regularization ε=0.1 =0.1 and run 10 Sinkhorn iterations. The GCMT guidance uses λglobal=0.5 _global=0.5 for global context modulation. Training takes approximately 24 hours on 8 NVIDIA RTX 8000 GPUs. Evaluation Metrics. Self-reconstruction R2R^2 measures variance explained by the SAE reconstruction: R2=1−∑i‖hi−h^i‖2∑i‖hi−h¯‖2R^2=1- _i\|h_i- h_i\|^2 _i\|h_i- h\|^2, where h¯ h is the mean activation. Cross-reconstruction R2R^2 measures how well codes from one modality predict the other under the learned transport plan. BBOX metrics evaluate spatial grounding: mass@obj is the fraction of heatmap mass inside ground-truth boxes, point@1 is pointing-game accuracy (whether peak activation lands in a box), and IoU@10 measures overlap between boxes and the top-10% salient region. Clustering maxfreq measures degeneracy by finding the maximum activation frequency of any shared dimension across the evaluation set. Extended Reconstruction Analysis. Table 6 provides a comprehensive breakdown of reconstruction quality under different coding strategies. The key insight is that the shared+private decomposition improves both self-reconstruction and cross-predictability compared to forced sharing. For forced sharing (r=1.0r=1.0), all capacity is shared, so full global = full joint = shared. With shared+private (r=0.25r=0.25), we observe: (1) Full joint reconstruction (shared + private) outperforms shared-only, demonstrating that private capacity captures additional variance. (2) Shared-only cross-reconstruction improves with private capacity, indicating that private codes absorb modality-specific noise that would otherwise contaminate the shared space. A.2 Concept Interpretation Pipeline Details Our automated interpretation pipeline (Algorithm 1) operates in four stages. We provide additional implementation details for each stage. Algorithm 1 Automated Neuron Interpretation Pipeline 0: Concept list C, images ℐI, encoders and SAEs, thresholds 0: Semantic clusters for each shared neuron 1: Stage 1: Encode Concepts 2: for each concept c in C do 3: Encode c through text pathway to get shared code zcz_c 4: end for 5: Stage 2: Compute IDF Weights (optional) 6: for each neuron m do 7: Count how many concepts activate neuron m 8: Compute inverse document frequency: idfm=log⁡(# concepts/# activating ​m)idf_m= (\# concepts/\# activating m) 9: end for 10: Stage 3: Match Images to Concepts 11: Initialize empty concept lists for each neuron 12: for each image I in ℐI do 13: Extract patch-level shared codes from vision pathway 14: for each concept c do 15: Compute activation map: intersection of image patches with concept code 16: Find peak activation location and dominant neuron at that location 17: if peak confidence >> threshold then 18: Associate concept c with the dominant neuron 19: end if 20: end for 21: end for 22: Stage 4: Cluster Concepts per Neuron 23: for each neuron with associated concepts do 24: Encode all associated concepts using VLM text encoder (for clustering) 25: Group concepts by semantic similarity (greedy cosine clustering) 26: Rank clusters by size 27: end forClustered concepts for each neuron Neuron ID Semantic Top terms (dominant semantic cluster) 956 Kitchen & Indoor cutting, cutting board, kitchen, kitchen sink, cabinet, countertop, stove, wallet, cookie, stone, living room, shoe, shelf, recliner, oven, knife, door, attached to 178 Urban Street Scene traffic light, traffic cone, pavement, street sign, street vendor, street, windy, driver, sidewalk, stop sign, sandwich, pedestrian, pedestrian signal, safety vest, drawer, wheel, magazine, scooter 1293 Spatial Relationships in hand, in mouth, in basket, in sink, in front of, mirrored, on table, on floor, on wall, on shelf, on head, on plate, throwing, horns, surfboard, flying, foot, skirt 1292 Office / Desk Scene camera, writing, paper, laptop, tape measure, next to, bag, office chair, desk, keyboard, notebook, box, newspaper, book, cup, foot, camouflage, backlit 1485 Overcast Coastal / Outdoor Scene overcast, over, smoothie, microwave, tail, taillight, depth layers, shallow depth, beach shore, beach, horizon, oval, cloudy, market day, market, beer, bleachers, single 1403 Pasture / Roadside Outdoors grass field, grass, ground, drill, grazing, road barrier, road, hand, bush, green, pantry, sheep, right, metallic, leaves, oval, black, broccoli 1515 Dining / Bar Setting spoon, wine, wine glass, pants, barstool, juice, basket storage, basket, tea, kettle, plate, menu, soda, pie, cocktail, cake, wedding, dining table 221 Rail / Transit Corridor waterfall, rail track, tunnel, platform, can, center, cooking, distant, palm tree, train, train window, station interior, rubber, bicycle, paintbrush, running, speaker, motion blur 1100 Human Poses / Activities eating, holding, sitting, twilight, tie, life jacket, sleeping, rectangle, sofa, seesaw, camouflage, leaves, lying, mouth, remote, red, wooden, cup 834 Farm Animals horse, spots, behind, cow, harness, faucet, hair, mouth, cactus, drinking, sheep, mouse, brown, crowd, motion blur, speaker, cutting, cutting board 1217 Beach/Ocean + Crowd large, medium, van, wave, ocean, palm tree, ice cream, horizon, picnic, concert, passenger, station interior, cupcake, line, tablet, star, ticket 54 Aquatic Activities water reflection, water surface, water, lake, apple, swimming, wings, river, surfboard, boat, boat wake, cyclist, ocean, wave, knife, harness, dog, dog walker Table 5: Semantic interpretations of SAE neurons obtained via clustering their top-activating text concepts (Extended version). Figure 6: Visualization of two randomly selected neurons (Extended version). A.3 Additional Visualizations Table 5 shows the complete set of 12 randomly selected neurons with their semantic interpretations. Each neuron exhibits a dominant cluster of related concepts, demonstrating the diversity of learned representations. Figures 7 and 6 provide extensive visualizations of concept activations and neuron-specific image examples. For each neuron, we show the top-scoring images (by peak activation) and highlight the most salient patch. These visualizations reveal that: (1) Neurons specialize for coherent semantic categories (e.g., Neuron 1515 consistently activates on dining scenes with tableware). (2) Spatial activations are precise: highlighted patches correspond to relevant objects or regions rather than backgrounds. (3) Neurons generalize across visual variations: Neuron 54 (aquatic activities) activates on diverse water-related scenes including lakes, rivers, boats, and swimming. Metric Shared Only (shared_ratio=1.0 shared\_ratio=1.0) Shared+Private (shared_ratio=0.25 shared\_ratio=0.25) RV,self2​(full global)R^2_V,self(full global) 0.4045 - RV,self2​(full joint)R^2_V,self(full joint) - 0.4606 RV,self2​(shared)R^2_V,self(shared) - 0.2705 RV,self2​(private)R^2_V,self(private) - 0.2445 RT,self2​(full global)R^2_T,self(full global) 0.5737 - RT,self2​(full joint)R^2_T,self(full joint) - 0.6684 RT,self2​(shared)R^2_T,self(shared) - 0.5088 RT,self2​(private)R^2_T,self(private) - 0.2550 RV,cross2​(shared)R^2_V,cross(shared) - 0.1171 RV,cross2​(full global)R^2_V,cross(full global) 0.0552 - RV,cross2​(full joint)R^2_V,cross(full joint) - 0.1131 RT,cross2​(shared)R^2_T,cross(shared) - 0.4055 RT,cross2​(full global)R^2_T,cross(full global) 0.3220 - RT,cross2​(full joint)R^2_T,cross(full joint) - 0.3971 Table 6: R2 decomposition under shared only vs shared+private. Self R2 is reported for vision (V) and text (T) under four code paths: shared, private, full joint (shared Top-k + private Top-k), and full global (global Top-k across all dims; single-dict baseline). Cross R2 uses the same transport plan PshP_sh derived from shared-only codes; Δleak _leak measures the change when decoding with full codes vs shared-only under fixed PshP_sh. Figure 7: Examples of diverse concepts discovered by LUCID on COCO images (Extended version). A.4 Omitted Proofs A.4.1 Proof of Theorem 3.1 Proof. We proceed via the method of Lagrange multipliers. Step 1: Lagrangian formulation. The problem involves equality constraints for marginals and non-negativity constraints Πi​j≥0 _ij≥ 0. Introducing dual variables λ∈ℝnλ ^n and μ∈ℝmμ ^m for the marginal constraints, the Lagrangian is: ℒ​(Π,λ,μ)=∑i,jCi​j​Πi​j+ε​∑i,jΠi​j​log⁡Πi​j+∑iλi​(ri−∑jΠi​j)+∑jμj​(cj−∑iΠi​j).L( ,λ,μ)= _i,jC_ij _ij+ _i,j _ij _ij+ _i _i (r_i- _j _ij )+ _j _j (c_j- _i _ij ). Step 2: First-order optimality. Due to the singularity of the gradient of the entropy term at zero (limx→0+log⁡x=−∞ _x→ 0^+ x=-∞), the optimal solution is strictly positive (Πi​j⋆>0 _ij >0). Consequently, the non-negativity constraints are inactive. Taking the partial derivative with respect to Πi​j _ij and setting it to zero yields: ∂ℒ∂Πi​j=Ci​j+ε​(log⁡Πi​j+1)−λi−μj=0. ∂ _ij=C_ij+ ( _ij+1)- _i- _j=0. Rearranging terms: log⁡Πi​j=λi−ε+μjε−Ci​jε. _ij= _i- + _j - C_ij . Step 3: Exponentiating. Exponentiating both sides results in: Πi​j⋆=exp⁡(λi−ε)⋅exp⁡(μjε)⋅exp⁡(−Ci​jε). _ij = ( _i- )· ( _j )· (- C_ij ). By defining the scaling variables ui:=exp⁡(λi−ε)u_i:= ( _i- ), vj:=exp⁡(μjε)v_j:= ( _j ), and the kernel Ki​j:=exp⁡(−Ci​jε)K_ij:= (- C_ij ), we obtain the factorization: Πi​j⋆=ui​Ki​j​vj, _ij =u_i\,K_ij\,v_j, which corresponds to Π⋆=diag​(u)​K​diag​(v) =diag(u)\,K\,diag(v) in matrix form. Step 4: Determining scaling vectors. Substituting Π⋆ into the marginal constraints: Row constraint: Π⋆​m=r⟹diag​(u)​K​v=r⟹u=r⊘(K​v). 1_m=r (u)Kv=r u=r (Kv). Column constraint: (Π⋆)⊤​n=c⟹diag​(v)​K⊤​u=c⟹v=c⊘(K⊤​u). ( ) 1_n=c (v)K u=c v=c (K u). Step 5: Uniqueness. The objective function is strictly convex on the transportation polytope due to the entropy term. Thus, the optimal plan Π⋆ is unique. Since K is strictly positive, the scaling vectors (u,v)(u,v) exist and are unique up to a scalar factor (Sinkhorn’s Theorem). ∎ Remark A.1. The scaling vectors relate to Kantorovich dual potentials via ui=exp⁡(ϕi/ε)u_i= ( _i/ ), vj=exp⁡(ψj/ε)v_j= ( _j/ ). The ambiguity (u,v)↦(α​u,v/α)(u,v) (α u,v/α) corresponds to the gauge freedom (ϕ,ψ)↦(ϕ+t,ψ−t)(φ,ψ) (φ+t,ψ-t). A.4.2 Linear Convergence of Sinkhorn Algorithm Corollary A.2 (Sinkhorn Algorithm). The scaling vectors (u,v)(u,v) in Theorem 3.1 can be computed via the Sinkhorn-Knopp iteration. Initializing v(0)=v^(0)=1, the updates for t=0,1,2,…t=0,1,2,… are: u(t+1)=rn⊘(K​v(t)),v(t+1)=cn⊘(K⊤​u(t+1)),u^(t+1)=r_n (Kv^(t)), v^(t+1)=c_n (K u^(t+1)), (3) where ⊘ denotes elementwise division. This iteration converges linearly in the Hilbert projective metric, with contraction rate κ=τ​(K)2<1κ=τ(K)^2<1. Proof. We establish convergence via the Birkhoff-Hopf theorem on contractions in Hilbert’s projective metric. Step 1: Hilbert projective metric. For vectors x,y∈ℝ++nx,y _++^n, the Hilbert projective metric is defined as: dH​(x,y)=log⁡(maxi⁡xiyi)−log⁡(minj⁡xjyj)=log⁡(maxi,j⁡xi​yjxj​yi).d_H(x,y)= ( _i x_iy_i )- ( _j x_jy_j )= ( _i,j x_iy_jx_jy_i ). This defines a pseudo-metric on ℝ++nR_++^n satisfying dH​(x,y)=0d_H(x,y)=0 iff x=α​yx=α y for some α>0α>0. Consequently, dHd_H induces a true metric on the projective cone ℝ++n/ℝ++R_++^n/R_++. Step 2: Birkhoff’s contraction theorem. For a matrix A∈ℝ++n×mA _++^n× m, define its projective diameter: Δ​(A)=maxi,j,k,l⁡log⁡Ai​k​Aj​lAi​l​Aj​k, (A)= _i,j,k,l A_ikA_jlA_ilA_jk, and the Birkhoff contraction coefficient τ​(A)=tanh⁡(Δ​(A)/4)∈[0,1)τ(A)= ( (A)/4)∈[0,1). Birkhoff’s theorem states that for any x,y∈ℝ++mx,y _++^m: dH​(A​x,A​y)≤τ​(A)⋅dH​(x,y).d_H(Ax,Ay)≤τ(A)· d_H(x,y). Since Ki​j=exp⁡(−Ci​j/ε)>0K_ij= (-C_ij/ )>0 for all i,ji,j, the kernel K has finite projective diameter Δ​(K)<∞ (K)<∞, yielding τ​(K)<1τ(K)<1. By symmetry, Δ​(K⊤)=Δ​(K) (K )= (K), hence τ​(K⊤)=τ​(K)τ(K )=τ(K). Step 3: Isometry under diagonal scaling. For any fixed z∈ℝ++nz _++^n, the map ϕz:x↦z⊘x _z:x z x is an isometry under dHd_H: dH​(z⊘x,z⊘y)=dH​(x,y).d_H(z x,\,z y)=d_H(x,y). This follows from: (zi/xi)​(yj/zj)/[(zj/xj)​(yi/zi)]=(xj​yi)/(xi​yj)(z_i/x_i)(y_j/z_j)/[(z_j/x_j)(y_i/z_i)]=(x_jy_i)/(x_iy_j). Step 4: Contraction of the Sinkhorn map. Define the half-iteration operators Tr:v↦r⊘(K​v)T_r:v r (Kv) and Tc:u↦c⊘(K⊤​u)T_c:u c (K u). For iterates v,v′∈ℝ++mv,v _++^m, let u=Tr​(v)u=T_r(v) and u′=Tr​(v′)u =T_r(v ). Applying the isometry property and Birkhoff’s theorem: dH​(u,u′)=dH​(K​v,K​v′)≤τ​(K)⋅dH​(v,v′).d_H(u,u )=d_H(Kv,Kv )≤τ(K)· d_H(v,v ). Similarly, for the second step: dH​(Tc​(u),Tc​(u′))≤τ​(K⊤)⋅dH​(u,u′)d_H(T_c(u),T_c(u ))≤τ(K )· d_H(u,u ). Composing both inequalities for one full iteration T=Tc∘TrT=T_c T_r: dH​(T​(v),T​(v′))≤τ​(K)2⋅dH​(v,v′).d_H(T(v),T(v ))≤τ(K)^2· d_H(v,v ). Since κ:=τ​(K)2<1κ:=τ(K)^2<1, the map T is a strict contraction. Step 5: Linear convergence. By Theorem 3.1, the optimal scaling vectors (u⋆,v⋆)(u ,v ) exist and are unique up to multiplicative constant. Applying the Banach fixed-point theorem: dH​(v(t),v⋆)≤κt⋅dH​(v(0),v⋆).d_H(v^(t),v )≤κ^t· d_H(v^(0),v ). The Hilbert metric convergence implies convergence of Π(t)=diag​(u(t))​K​diag​(v(t)) ^(t)=diag(u^(t))Kdiag(v^(t)) to Π⋆ in standard norms at rate O​(κt)O(κ^t). ∎