Paper deep dive

DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders

Xu Wang, Bingqing Jiang, Yu Wan, Baosong Yang, Lingpeng Kong, Difan Zou

Year: 2025Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 76

Models: Dream-7B, LLaDA-8B, LLaMA-3-8B, Qwen-2.5-7B

Abstract

Abstract:Sparse autoencoders (SAEs) have become a standard tool for mechanistic interpretability in autoregressive large language models (LLMs), enabling researchers to extract sparse, human-interpretable features and intervene on model behavior. Recently, as diffusion language models (DLMs) have become an increasingly promising alternative to the autoregressive LLMs, it is essential to develop tailored mechanistic interpretability tools for this emerging class of models. In this work, we present DLM-Scope, the first SAE-based interpretability framework for DLMs, and demonstrate that trained Top-K SAEs can faithfully extract interpretable features. Notably, we find that inserting SAEs affects DLMs differently than autoregressive LLMs: while SAE insertion in LLMs typically incurs a loss penalty, in DLMs it can reduce cross-entropy loss when applied to early layers, a phenomenon absent or markedly weaker in LLMs. Additionally, SAE features in DLMs enable more effective diffusion-time interventions, often outperforming LLM steering. Moreover, we pioneer certain new SAE-based research directions for DLMs: we show that SAEs can provide useful signals for DLM decoding order; and the SAE features are stable during the post-training phase of DLMs. Our work establishes a foundation for mechanistic interpretability in DLMs and shows a great potential of applying SAEs to DLM-related tasks and algorithms.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/11/2026, 1:17:58 AM

Summary

DLM-Scope is the first mechanistic interpretability framework for Diffusion Language Models (DLMs) using Sparse Autoencoders (SAEs). The study demonstrates that Top-K SAEs can extract interpretable features from DLMs, showing unique behaviors such as loss reduction in early layers upon SAE insertion. The framework enables effective diffusion-time steering and provides metrics to analyze decoding-order dynamics, establishing a foundation for interpreting DLMs.

Entities (5)

DLM-Scope · framework · 100%Diffusion Language Models · model-class · 100%Dream-7B · model · 100%LLaDA-8B · model · 100%Sparse Autoencoders · methodology · 100%

Relation Signals (3)

Sparse Autoencoders → appliedto → Diffusion Language Models

confidence 100% · demonstrate that trained Top-K SAEs can faithfully extract interpretable features [in DLMs]

Dream-7B → isa → Diffusion Language Models

confidence 100% · We develop a comprehensive training and inference framework for Top-KK SAEs on DLMs, including Dream-7B

DLM-Scope → utilizes → Sparse Autoencoders

confidence 100% · we present DLM-Scope, the first SAE-based interpretability framework for DLMs

Cypher Suggestions (2)

Find all models that utilize Sparse Autoencoders for interpretability. · confidence 90% · unvalidated

MATCH (m:Model)-[:UTILIZES]->(s:Methodology {name: 'Sparse Autoencoders'}) RETURN m.name

Identify the relationship between frameworks and the models they support. · confidence 80% · unvalidated

MATCH (f:Framework)-[:SUPPORTS]->(m:Model) RETURN f.name, m.name

Full Text

75,921 characters extracted from source content.

Expand or collapse full text

DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders Xu Wang1,2, Bingqing Jiang1, Yu Wan2, Baosong Yang2, Lingpeng Kong1, Difan Zou1 1The University of Hong Kong 2Tongyi Lab, Alibaba Group Inc Email: sunny615@connect.hku.hkCorresponding author. Email: dzou@hku.hk Abstract Sparse autoencoders (SAEs) have become a standard tool for mechanistic interpretability in autoregressive large language models (LLMs), enabling researchers to extract sparse, human-interpretable features and intervene on model behavior. Recently, as diffusion language models (DLMs) have become an increasingly promising alternative to the autoregressive LLMs, it is essential to develop tailored mechanistic interpretability tools for this emerging class of models. In this work, we present DLM-Scope, the first SAE-based interpretability framework for DLMs, and demonstrate that trained Top-K SAEs can faithfully extract interpretable features. Notably, we find that inserting SAEs affects DLMs differently than autoregressive LLMs: while SAE insertion in LLMs typically incurs a loss penalty, in DLMs it can reduce cross-entropy loss when applied to early layers, a phenomenon absent or markedly weaker in LLMs. Additionally, SAE features in DLMs enable more effective diffusion-time interventions, often outperforming LLM steering. Moreover, we pioneer certain new SAE-based research directions for DLMs: we show that SAEs can provide useful signals for DLM decoding order; and the SAE features are stable during the post-training phase of DLMs. Our work establishes a foundation for mechanistic interpretability in DLMs and shows a great potential of applying SAEs to DLM-related tasks and algorithms. https://huggingface.co/DLM-Scope 1 Introduction Sparse autoencoders (SAEs) have become a widely used tool for mechanistic interpretability in autoregressive large language models (LLMs) (Marks et al., 2024), extracting sparse, human-meaningful features that help reveal internal representations (Templeton et al., 2024; Cunningham et al., 2023; Bricken et al., 2023). Beyond analysis, SAE features have been used in applications such as reducing hallucinations (Ferrando et al., 2025) and mitigating biases (Durmus et al., 2024). Meanwhile, diffusion language models (DLMs) are becoming increasingly competitive for text understanding and generation (Gong et al., 2023; Wu et al., 2023; Xu et al., 2025), with recent results suggesting favorable scaling behavior (Nie et al., 2025a; Bie et al., 2025). Moreover, diffusion is not inherently uninterpretable: in text-to-image models, SAEs expose causal concept factors and enable reliable concept editing (Surkov et al., 2025; He et al., 2025). In contrast, the interpretability of DLMs remains limited; therefore, it is imperative to develop tools that facilitate a deeper inspection and understanding of these modern models. To this end, a straightforward idea is to extend the SAE-based interpretability interface that has worked well in practice for LLMs to DLMs. However, transferring the LLM-SAEs recipe to DLM-SAEs is not plug-and-play, due to the fundamental difference in their mechanism. In particular, DLMs require certain design choices that are absent in LLMs: (i) selecting which token positions provide training activations under diffusion corruption, and (i) defining inference-time steering policies that act repeatedly across denoising steps. This requires delicate configurations on the design of SAE training and inference for DLMs. Figure 1: DLMscope pipeline. Top (orange): DLM-SAE training and validation. Left: Top-K SAEs are trained on Dream/LLaDA. Middle: They are evaluated via sparsity-fidelity. Right: They are auto-interpreted by generating explanations and interpretability scores. Bottom (green): The value of DLM-SAEs. Left: Feature steering is applied across denoising steps. Middle: Different decoding-orders are analyzed by tracking Top-K feature dynamics. Right: Cross-training transfer is tested by applying base-trained SAEs to DLM-SFT. In this work, we present DLM-Scope, the first SAE-based mechanistic interpretability interface for DLMs. We develop a comprehensive training and inference framework for Top-K SAEs (Gao et al., 2024) on DLMs, including Dream-7B (Ye et al., 2025) and LLaDA-8B (Nie et al., 2025b). Furthermore, we verify the utility of these SAEs under sparsity constraints and leverage features for both diffusion-time steering and tracking representation dynamics across different remasking orders. Additionally, we evaluate transferability by applying base-trained SAEs to instruction-tuned DLM, showing the generality of the learned features during the post-training process of DLM. Our contributions are summarized as follows (also detailed in Fig. 1): 1. SAEs trained on DLMs are usable for mechanistic analyses. (§3.1) We design the training strategy and objective for SAEs in DLMs. We introduce training method by sampling activations from denoising and choosing positions under corruption. (§3.2) DLM-SAEs achieve a favorable sparsity-fidelity profile. Notably, in early layers we observe regimes where inserting an SAE into DLMs can reduce masked-token cross-entropy loss. (§3.3) SAEs Extract Interpretable Features in DLMs. We apply automated feature interpretation to DLM-SAE latents, producing understandable explanations and interpretability scores. 2. SAEs enable effective steering, remasking strategy analysis, and transfer to instruction-tuned DLM. (§4) We design DLM-specific per-step steering policies. We find SAE features enable effective diffusion-time interventions that often outperform LLM steering. (§5) SAEs offer an interpretable way to quantify remasking orders. Using our metrics (Sℓ,k,ipre()S^pre_ ,k,i(O) and Dℓ,ipost()D^post_ ,i(O)), we track how decoding orders induce semantic shifts during denoising. (§6) Base→ transfer of DLM-SAEs. We test whether base-trained SAEs remain faithful on instruction-tuned DLM, showing generality of SAEs in DLM. 2 SAE Training and Steering in DLMs 2.1 Preliminary Figure 2: DLM-SAE overview. Left: Training DLM-SAEs. We collect residual-stream activations from one-step denoising inputs and train SAEs using two strategies: Mask-SAE or Unmask-SAE. Right: Diffusion-time feature steering. We select feature f and inject its decoder direction into the residual stream at every denoising step, either on all positions or update positions. Sparse Autoencoders (SAEs) SAEs act as microscopes for the dense, superposition-laden hidden states of language models. Mathematically, let x∈ℝdx ^d denote a residual-stream activation at a fixed layer and token position. The goal of an SAE is to decompose this dense vector into a linear combination of interpretable directions. The SAE first encodes x into sparse features h∈ℝkh ^k (typically k>dk>d) and then decodes back to a reconstruction x^∈ℝd x ^d. In particular, the SAE model is trained by the following loss function: ℒSAE=‖x−x^‖22+λ‖h‖1,h=ReLU(WEx+bE),x^=WDh+bD. gatheredL_SAE=\|x- x\|_2^2+λ\|h\|_1,\\ h=ReLU(W_Ex+b_E), x=W_Dh+b_D. gathered (3) Eq. equation 3 trades off reconstruction and sparsity, where λ controls the ℓ1 _1 penalty on features. SAE feature steering is a causal intervention technique that modifies the model’s internal processing by artificially activating specific interpretable concepts. This is achieved by modifying the residual stream with a decoder atom vfv_f for a chosen feature f. Given a steering strength α and a per-sample scale mfm_f, we intervene as xsteer=x+αmfvf. x^steer\;=\;x\;+\;α\,m_f\,v_f. (4) Eq. equation 4 pushes x along the feature direction vfv_f at a target layer and changes the output. Diffusion Language Models (DLMs) Unlike the autoregressive decoding in the standard LLMs, DLMs generate the tokens by iterative denoising and parallel decoding. Let x0=(x01,…,x0N)x_0=(x_0^1,…,x_0^N) be a length-N token sequence sampled from the data distribution q(x)q(x). DLM defines a process that produces a partially masked sequence xt∼q(xt∣x0)x_t q(x_t x_0) at mask rate t∈(0,1)t∈(0,1) (we use w(t)=1/tw(t)=1/t), and trains a mask predictor pθ(⋅∣xt)p_θ(· x_t) to recover the original tokens at masked positions: ℒDLM(θ)=x0∼q(x),t∼U(0,1),xt∼q(xt∣x0)[w(t)∑i=1N[xti=[MASK]]⋅−log⁡pθ(x0i∣xt)⏟cross-entropy] gatheredL_DLM(θ)=E_x_0 q(x),\;t U(0,1),\;x_t q(x_t x_0)\\ [w(t) _i=1^N1[x_t^i= [MASK]]· - p_θ(x_0^i x_t)_cross-entropy ] gathered (7) Eq. equation 7 applies cross-entropy only to masked tokens, with timestep weight w(t)w(t). During inference, DLMs denoise by repeatedly predicting tokens and remasking. At step k: ~0∼pθ(⋅∣tk),tk−1=Remask(tk,~0;tk−1). aligned x_0& p_θ(· _t_k),\\ x_t_k-1&=Remask\! (x_t_k,\, x_0;\,t_k-1 ). aligned (8) Eq. equation 8 first samples a fully denoised prediction ~0 x_0 conditioned on the current masked state tkx_t_k. It then applies Remask(⋅)Remask(·) to (i) fill the masked positions in tkx_t_k with the corresponding entries from ~0 x_0, and (i) re-mask the resulting sequence so that the mask rate matches the next step tk−1t_k-1. Figure 3: Sparsity-fidelity trade-off for Qwen SAEs and Dream SAEs. Top row: functional fidelity measured by Δ loss (Eq. 16); Bottom row: reconstruction fidelity measured by explained variance (Eq. 15). Columns: a LLM baseline (Qwen-2.5-7B, left) versus Dream-7B SAEs (Mask-SAE, middle) and (Unmask-SAE, right). This figure shows that Dream-SAEs achieve strong sparsity-fidelity trade-offs and even exhibit negative Δ loss in shallow layers at small L0L_0, an effect absent or much weaker in the LLM baseline. 2.2 From LLM-SAEs to DLM-SAEs: Training and Steering Differences We summarize two key changes when adapting LLM-SAEs to DLMs: (i) which token positions provide training activations under diffusion corruption, and (i) how feature steering is applied across denoising steps. Figure 2 shows training-position selection and per-step injection. Training difference. In LLMs, SAE training typically collects activations from fully observed prefixes under a causal mask. In DLMs, each forward pass is conditioned on a random corruption level t and a partially masked sequence xtx_t (Eq. equation 7), so we define: ℳ(xt)=i∈[N]:xti=[MASK],(xt)=[N]∖ℳ(xt). aligned M(x_t)&=\\,i∈[N]:x_t^i= [MASK]\,\,\\ U(x_t)&=[N] (x_t). aligned (9) and let xℓi(xt)∈ℝdx_ ^i(x_t) ^d be the residual-stream activation at layer ℓ and position i when running the DLM on xtx_t. We train two SAEs by choosing the position (xt)∈ℳ(xt),(xt)S(x_t)∈\M(x_t),U(x_t)\ and minimizing x0∼q(x),t∼U(0,1),xt∼q(xt|x0)[1|(xt)|∑i∈(xt)(‖xℓi(xt)−x^ℓi(xt)‖22+λ‖hℓi(xt)‖1)] gatheredE_x_0 q(x),\;t U(0,1),\;x_t q(x_t|x_0)\\ [ 1|S(x_t)| _i (x_t) (\|x_ ^i(x_t)- x_ ^i(x_t)\|_2^2+λ\|h_ ^i(x_t)\|_1 ) ] gathered (12) where (hℓi,x^ℓi)(h_ ^i, x_ ^i) follow Eq. equation 3 with shared SAE parameters at layer ℓ . We refer to (xt)=ℳ(xt)S(x_t)=M(x_t) as Mask-SAE and (xt)=(xt)S(x_t)=U(x_t) as Unmask-SAE. Steering difference. LLM steering applies a single intervention in a left-to-right pass. DLM steering must operate across denoising steps (Eq. equation 8). Let ℓ,k∈ℝN×dX_ ,k ^N× d denote the residual-stream matrix at layer ℓ when running on tkx_t_k, and let vf∈ℝdv_f ^d be the decoder atom for feature f (Eq. equation 4). We inject the feature at step k via: ℓ,ksteer=ℓ,k+αmfkvf⊤, _ ,k^steer=X_ ,k+α\,m_f\,s_k\,v_f , (13) where k∈0,1Ns_k∈\0,1\^N selects token positions to steer at that step. We study two DLM-specific choices: All-tokens:(k)i=1∀i,Update-tokens:(k)i=[i∈ℳ(tk)]. aligned All-tokens: &(s_k)_i=1\ ∀ i,\\ Update-tokens: &(s_k)_i=1\! [i (x_t_k) ]. aligned (14) Eq. equation 13 applies the same feature direction repeatedly over denoising, while Eq. equation 14 distinguishes steering all positions versus only the currently masked (to-be-updated) positions. 3 Can SAEs Extract Interpretable Features in DLMs? We train SAEs and assess (i) whether they provide a faithful reparameterization of DLM activations under sparsity budget, and (i) whether the resulting latents form human-interpretable features. 3.1 Training Details We train Top-K SAEs on two representative DLMs, Dream-7B (Ye et al., 2025) and LLaDA-8B (Nie et al., 2025b), using The Common Pile v0.1 (Kandpal et al., 2025) as training data. For each model, we splice the SAE into the residual stream and train a SAE at layers spanning the network depth: two shallow, two middle, and two deep layers. Table 1: SAE training configurations for diffusion LMs (Dream, LLaDA) and matched autoregressive LLM baselines (Qwen, LLaMA). SAEs of Diffusion language model are highlighted in gray. [gray]0.9Dream-SAE [gray]0.9LLaDA-SAE Qwen-SAE LLaMA-SAE Model [gray]0.9Dream-7B [gray]0.9LLaDA-8B Qwen-2.5-7B LLaMA-3-8B Training Data [gray]0.9The Common Pile [gray]0.9The Common Pile The Common Pile The Common Pile Activation function [gray]0.9ReLU+TopK [gray]0.9ReLU+TopK ReLU+TopK ReLU+TopK Insertion layers [gray]0.91, 5, 10, 14, 23, 27 [gray]0.91, 6, 11, 16, 26, 30 1, 5, 10, 14, 23, 27 1, 6, 11, 16, 26, 30 Insertion site [gray]0.9residual stream [gray]0.9residual stream residual stream residual stream SAE width [gray]0.916K [gray]0.916K 16K 16K Table 1 summarizes the backbone-specific settings. Full hyperparameters, training resources and data preprocessing details are deferred to Appendix B. 3.2 Evaluate SAEs on sparsity-fidelity trade-off We evaluate DLM-SAEs with two metrics: reconstruction fidelity and the change in DLM training loss when the SAE is spliced into the model. All metrics are computed on held-out tokens. Reconstruction fidelity (Explained Variance ↑ ). We report explained variance (EV) of reconstructions, measuring how much of x is captured by the reconstruction x x: EV =1−[‖x−x^‖22][‖x‖22]. =1\;-\; E [\|x- x\|_2^2 ]E [\|x\|_2^2 ]. (15) Functional fidelity (delta LM loss ↓ ). Let ℒDLM(θ)L_DLM(θ) be the masked-token cross-entropy objective in Eq. equation 7. We define ℒDLMins(θ)L_DLM^ins(θ) as the same objective, but with the residual stream at the target layer replaced by the SAE reconstruction (x←x^x← x) before continuing the forward pass: ΔℒDLM=ℒDLMins(θ)−ℒDLM(θ), _DLM\;=\;L_DLM^ins(θ)\;-\;L_DLM(θ), (16) computed under the same (x0,t,xt)(x_0,t,x_t) sampling as Eq. equation 7. Figure 3 shows that DLM-SAEs achieve a favorable sparsity-fidelity profile, indicating that splicing SAEs are usable for mechanistic analyses. Interestingly, Dream shows a shallow-layer regime at small L0L_0 where Δ loss is negative, meaning that SAE insertion can reduce cross-entropy loss. This effect is absent or much weaker in LLMs, where insertion increases loss. We observe the same pattern on LLaDA-SAEs, with full results in Appendix C. Table 2: SAE steering summary at L0=80L_0=80 across models and layers. Each cell reports three quantities: C (normalized concept improvement), P (normalized perplexity reduction), and S (overall steering score, S=C+λPS=C+λ P with λ=0.3λ=0.3). Stars (⋆) mark the best S within each model-family block (Qwen/Dream; LLaMA/LLaDA) at each layer. Overall, SAE features enable effective diffusion-time interventions that often outperform single-pass LLM steering. Model / SAE Shallow Layer Middle Layer Deep Layer L1 L5 L10 L14 L23 L27 C↑C P↑P S↑S C↑C P↑P S↑S C↑C P↑P S↑S C↑C P↑P S↑S C↑C P↑P S↑S C↑C P↑P S↑S Qwen-2.5-7B 0.16 -0.39 0.04 0.18 -0.34 0.08 0.18 -0.35 0.08 0.22 -0.26 0.14⋆ 0.28 -0.43 0.15 0.03 -0.19 -0.02 [gray]0.9 Dream-Unmask 0.13 -0.08 0.10 0.17 -0.17 0.12⋆ 0.19 -0.31 0.10 0.20 -0.64 0.01 0.26 -0.33 0.16 0.33 0.18 0.38⋆ [gray]0.9 Dream-Mask 0.14 -0.11 0.11⋆ 0.13 -0.09 0.11 0.17 -0.12 0.13⋆ 0.22 -0.37 0.11 0.35 -0.35 0.24⋆ 0.29 0.01 0.29 LLaMA-3-8B 0.08 -7.89 -2.29 0.18 -2.14 -0.46 0.15 -1.23 -0.22 0.25 -0.67 0.05 0.17 -0.44 0.04 0.13 -0.09 0.10 [gray]0.9 LLaDA-Unmask 0.14 -0.05 0.13⋆ 0.16 -0.13 0.13⋆ 0.13 0.02 0.13 0.13 -0.04 0.11 0.25 -0.16 0.20⋆ 0.16 -0.26 0.08 [gray]0.9 LLaDA-Mask 0.13 0.00 0.13⋆ 0.11 0.00 0.11 0.14 0.04 0.15⋆ 0.16 -0.06 0.14⋆ 0.21 -0.12 0.17 0.15 -0.02 0.15⋆ 3.3 Interpretability of Features To test whether DLM-SAEs learn human-interpretable concepts, we adopt auto-interpretation protocol (Karvonen et al., 2025; Paulo et al., 2025) that produces (i) explanation for each feature and (i) interpretability score measuring how predictive that explanation is of feature activation. For each feature f, we apply an automated interpretation procedure on a held-out stream of 5M tokens. We highlight the tokens that activate f the most (along with their activation values) and ask an LLM to describe the pattern that f appears to capture. To check whether this description is meaningful, we run a simple discrimination test: given unlabeled sequences, a separate judge LLM uses the description to predict which ones should activate f, and we report its accuracy as the interpretability score. Full details are in Appendix D, with examples in Appendix E. In conclusion, SAEs serve as a practical mechanistic interpretability interface for DLMs. The resulting DLM-SAEs achieve strong sparsity-fidelity trade-offs, including an early-layer regime where SAE insertion can even reduce cross-entropy loss and yield human-interpretable features validated by automated interpretation scores. 4 Do These Features Enable Effective Steering During Denoising Stage? 4.1 Using Features to Intervene during Denoising A key advantage of DLMs is that generation unfolds over multiple denoising steps, exposing repeated opportunities to intervene. We continue the idea of using SAE features for steering in LLMs: injecting its decoder atom during denoising to change the final decoded text. Concretely, we intervene at a chosen layer ℓ by adding the feature direction vfv_f to the residual stream according to Eq. equation 13. We consider two diffusion-time steering strategies that differ only by the token selector ks_k in Eq. equation 13: All-tokens steers all positions, while Update-tokens steers only the currently masked positions ℳ(tk)M(x_t_k) (Eq. equation 14). 4.2 Steering Results Metrics. We follow the steering evaluation protocol of (Sun et al., 2025; Wu et al., 2025). For each feature f, we sample neutral prefixes P (examples in Appendix F) and generate continuations with and without diffusion-time steering (Eq. 13). Let Cbefore(f),Cafter(f)C_before(f),C_after(f) denote the prefix-averaged concept scores, and let pbefore(f),pafter(f)p_before(f),p_after(f) denote the prefix-averaged perplexities, both computed on the generated continuation. We report three normalized metrics: C(f) C(f) =Cafter(f)−Cbefore(f)sC(sC∈1,100), = C_after(f)-C_before(f)s_C (s_C∈\1,00\), (17) where C(f)C(f) measures the concept improvement (larger is better; typically C(f)∈[−1,1]C(f)∈[-1,1] after normalization). P(f) P(f) =pbefore(f)−pafter(f)pbefore(f), = p_before(f)-p_after(f)p_before(f), (18) where P(f)P(f) is the relative perplexity reduction (larger is better; P(f)≤1P(f)≤ 1 and can be negative if fluency degrades). S(f) S(f) =C(f)+λP(f), =C(f)+λ P(f), (19) where S(f)S(f) is the overall steering score trading off concept gain and fluency (S(f)∈[−1, 1+λ]S(f)∈[-1,\,1+λ]). Experimental setup. We evaluate diffusion-time steering for 30 denoising steps. For Dream-7B we use All-tokens steering, i.e., (k)i=1(s_k)_i=1 for all positions (Eq. 14), while for LLaDA-8B we use Update-tokens steering, i.e., (k)i=[i∈ℳ(tk)](s_k)_i=1[i (x_t_k)]. For fairness, we match the generation length budget and the steering-strength sweep across LLM and DLM baselines by using the same α range. Full details (steering settings, prompt set and ablation experiments) are deferred to Appendix G. Table 2 shows a trend: sparse features enable diffusion-time interventions that are effective and often outperform LLM steering. Relative to LLM-SAEs, DLM-SAEs achieve substantially higher overall steering scores, typically by 22–10×10× in deep layers. The strongest gains appear in the deepest layers, where DLM steering attains the best steering score within each model family, suggesting that steerable semantic directions are concentrated in late residual-stream representations. Overall, diffusion-time steering offers a more effective control interface than single-pass LLM steering. Figure 4: Pre-mask SAE feature stability across three DLM inference orders. Top: layer-step heatmaps of mean pre-mask top-k Jaccard similarity between consecutive steps (k−1→k\!-\!1\!→\!k). Bottom: mean pre-mask similarity (averaged over tracked layers/positions) vs. normalized generation progress. This figure shows that Origin yields a less dynamic SAE trajectory, while confidence-based orders exhibit structured turnover followed by stabilization. 5 Can SAEs Help Explain Different Decoding Order of DLMs? In this section, we use SAEs to track residual-stream dynamics and analyze how representation trajectories differ across decoding-order strategies. 5.1 Analyzing Decoding-Order Strategies with Features We consider three remasking strategies that control the token generation order: (i) Origin (Austin et al., 2021) updates positions in a random order, independent of model confidence; (i) TopK-margin (Kim et al., 2025) ranks positions by the margin score p(1)−p(2)p_(1)-p_(2) and updates the top-K most confident positions first; (i) Entropy (Ye et al., 2025) ranks positions by token-distribution entropy and updates lower-entropy positions earlier. For a decoding order O, let tk()x^(O)_t_k denote the partially masked sequence at denoising step k, and let hℓ,k,i()h_ ,k,i^(O) be the SAE latent at layer ℓ and position i obtained by encoding the residual-stream activation (Eq. equation 3). Let TopKfeat(h)TopK_feat(h) return the indices of the KfeatK_feat largest-magnitude latents. Feature-set overlap is measured with Jaccard similarity Jaccard (1912), J(A,B)=|A∩B||A∪B|J(A,B)= |A∩ B||A∪ B|. Pre-mask stability. To test whether a position becomes stable before it is decoded, we measure step-to-step similarity while the position remains masked. We define: ℳ(tk())=i:tk(),i=[MASK]M(x_t_k^(O))=\\,i:x_t_k^(O),i= [MASK]\,\. For i∈ℳ(tk())i (x_t_k^(O)): Sℓ,k,ipre() S^pre_ ,k,i(O) =J(TopKfeat(hℓ,k,i()),TopKfeat(hℓ,k−1,i())) =J\! (TopK_feat(h_ ,k,i^(O)),TopK_feat(h_ ,k-1,i^(O)) ) (20) We summarize Sℓ,k,ipre()S^pre_ ,k,i(O) by averaging over masked positions, prompts, and runs, and visualize it as a function of layer and generation progress. Post-decode drift. To measure how much a decoded position’s representation continues to change, define its decode step ki()=min⁡k:i∉ℳ(tk()).k_i(O)= \k:\ i (x_t_k^(O))\. Let K be the total number of denoising steps. To compare strategies with different step counts, we plot summaries against normalized progress τ=k/K∈[0,1]τ=k/K∈[0,1]. We quantify average drift as: Dℓ,ipost() D^post_ ,i(O) =1K−ki()∑k=ki()+1K(1−Sℓ,k,i()) = 1K-k_i(O) _k=k_i(O)+1^K (1-S_ ,k,i(O) ) (21) Figure 5: Post-decode SAE feature drift. Drift is computed only after a position’s token is fixed. Top: layer-step heatmaps of post-decode top-k drift between consecutive steps’ top-k feature sets. Bottom: mean post-decode drift (averaged over tracked positions) vs. normalized generation progress. This figure shows that confidence-based orders sustain stronger deep-layer drift, indicating the effect of bidirectional attention is stronger in this situation, whereas Origin drifts less. 5.2 Feature Dynamics Across Decoding Strategies We study decoding-order effects on GSM8K (Cobbe et al., 2021) using Dream-7B. To ensure coverage, we sample 800 questions and run inference with K=128K=128 denoising steps, generating one token per step. In this setting, the three decoding orders yield markedly different task performance: Origin achieves 8%8\% accuracy, while the confidence-based orders perform substantially better (TopK-margin: 56%56\%, Entropy: 59%59\%). We track features on the same set of layers across orders and compute pre-mask stability SpreS^pre (Eq. 20) and post-decode drift DpostD^post (Eq. 21). Figures 4 and 5 show a clear difference between decoding orders. Under Origin, the SAE features change only slightly from step to step, suggesting a quieter evolution in the SAE feature space. In contrast, the confidence-based orders (TopK-margin, Entropy) show a more organized progression: features shift more early on for still-masked tokens, then settle quickly, while deep layer features continue to adjust after tokens are fixed, indicating the effect of bidirectional attention is stronger in this situation. Further analysis can be found in Appendix H. We conjecture that these SAE-based dynamics provide a useful signal that correlates with task performance across decoding orders. In our experiments, higher-accuracy confidence-based orders (TopK-margin, Entropy) exhibit larger early shifts on masked positions and continue to adjust in deeper layers after decoding, while the lower-performing Origin order changes less overall. These consistent layerwise differences in SAE feature dynamics provide mechanistic insight and suggest directions for future decoding-order design. 6 Do Base-Trained SAEs Generalize to Instruction-Tuned Model? 6.1 Layerwise Transfer: Base SAE vs. SFT SAE Training SAEs on every instruction-tuned DLM is expensive. Ideally, a base-trained SAE could be reused as a faithful one on the instruction-tuned model. We therefore test whether Dream base SAEs and Dream SFT SAEs behave similarly when inserted into the same backbone. We compare two SAEs trained with identical settings: Base SAE, trained on Dream-7B Base (Dream Base), and SFT SAE, trained on Dream-7B Instruct (Dream SFT). For each target model, we splice the SAE into the residual stream at layers and sweep sparsity budgets. We evaluate (i) reconstruction quality using explained variance (Eq. equation 15) and (i) functional faithfulness using the insertion-induced loss change ΔℒDLM _DLM (Eq. equation 16). Results for inserting SAEs into the (Dream Base) are deferred to Appendix I. The left and middle panels of Figure 6 show that except at the deepest layer, inserting Base SAE or SFT SAE into the Dream SFT yields nearly identical functional behavior. Across L1, L5, L10, L14, and L23, the experiments consistently indicate strong cross-model transferability of SAE representations at these depths. By contrast, L27 exhibits a separation in ΔℒDLM _DLM, suggesting that the deepest representations are substantially more sensitive to instruction tuning effects. Figure 6: Base-to-SFT transfer of DLM-SAEs across layers. We chose Mask-SAE to test. Top row: Base SAE inserted into Dream SFT. Bottom row: SFT SAE inserted into Dream SFT. Left: ΔℒDLM _DLM evaluated on masked-token denoising inputs. Middle: reconstruction fidelity measured by Explained Variance (EV). Right: ΔℒDLM _DLM measured during instruction rollouts. This figure indicates Base-trained SAEs transfer to instruction-tuned DLM except at the deepest layer. 6.2 DLM-SAE Transfer Under Instruction Rollouts We further test transfer under instruction rollouts to assess whether Base-SFT SAE transfer persists on instruction data. Concretely, we generate Dream SFT assistant continuations with 30 steps and 30 tokens, insert an SAE at layers 1,5,10,14,23,27\1,5,10,14,23,27\. Insertion-induced functional change is measured by ΔℒDLM _DLM (Eq. equation 16) on the rollout segment. Despite distribution shift, SAEs remain reusable in shallow and middle layers on instruction rollouts (the right part of Figure 6). However, in Layer 27, Base SAE induces a much larger loss increase than SFT SAE. This suggests that the base model’s deepest-layer subspace fails to capture instruction-critical directions engaged during rollouts. Base-trained SAEs generally transfer almost losslessly to the instruction-tuned DLM, except at the deepest layer. We can see Base SAE and SFT SAE behave almost identically on Dream SFT for L1-L23, suggesting similar internal signals. This reuse largely holds under instruction rollouts, but breaks at L27, where the Base SAE fails to capture tuning-specific behavior. 7 Discussion This work introduces DLM-Scope, the first SAE-based interpretability interface for diffusion language models (DLMs). We show that DLM-trained SAEs enable mechanistic analysis and can even reduce cross-entropy when inserted in some layers, and that their sparse features support effective diffusion-time interventions, decoding-strategy analysis, and transfer to instruction-tuned DLMs. As DLMs scale, we plan to extend SAEs to more DLMs to better understand and improve models. References D. Arad, A. Mueller, and Y. Belinkov (2025) SAEs are good for steering – if you select the right features. External Links: 2505.20063, Link Cited by: Appendix A. J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. van den Berg (2021) Structured denoising diffusion models in discrete state-spaces. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Eds.), External Links: Link Cited by: §5.1. S. Basu, N. Zhao, V. I. Morariu, S. Feizi, and V. Manjunatha (2023) Localizing and editing knowledge in text-to-image generative models. In The Twelfth International Conference on Learning Representations, Cited by: Appendix A. T. Bie, M. Cao, K. Chen, L. Du, M. Gong, Z. Gong, Y. Gu, J. Hu, Z. Huang, Z. Lan, et al. (2025) Llada2. 0: scaling up diffusion language models to 100b. arXiv preprint arXiv:2512.15745. Cited by: §1. T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah (2023) Towards monosemanticity: decomposing language models with dictionary learning. Transformer Circuits Thread. Note: https://transformer-circuits.pub/2023/monosemantic-features/index.html Cited by: §1. T. Chen, S. Zhang, and M. Zhou (2025) DLM-one: diffusion language models for one-step sequence generation. External Links: 2506.00290, Link Cited by: Appendix A. K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman (2021) Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Cited by: §5.2. H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey (2023) Sparse autoencoders find highly interpretable features in language models. External Links: 2309.08600, Link Cited by: §1. DeepMind (2024) Gemini diffusion. Note: https://deepmind.google/technologies/geminiAccessed: 2025-07-09 Cited by: Appendix A. E. Durmus, A. Tamkin, J. Clark, J. Wei, J. Marcus, J. Batson, K. Handa, L. Lovitt, M. Tong, M. McCain, O. Rausch, S. Huang, S. Bowman, S. Ritchie, T. Henighan, and D. Ganguli (2024) External Links: Link Cited by: §1. C. Fan, W. Heng, B. Li, S. Liu, Y. Song, J. Su, X. Qu, K. Shen, and W. Wei (2026) Stable-diffcoder: pushing the frontier of code diffusion large language model. arXiv preprint arXiv:2601.15892. Cited by: Appendix A. E. Farrell, Y. Lau, and A. Conmy (2024) Applying sparse autoencoders to unlearn knowledge in language models. In Neurips Safe Generative AI Workshop 2024, External Links: Link Cited by: Appendix A. J. Ferrando, O. B. Obeso, S. Rajamanoharan, and N. Nanda (2025) Do i know this entity? knowledge awareness and hallucinations in language models. In The Thirteenth International Conference on Learning Representations, External Links: Link Cited by: §1. L. Gao, T. D. la Tour, H. Tillman, G. Goh, R. Troll, A. Radford, I. Sutskever, J. Leike, and J. Wu (2024) Scaling and evaluating sparse autoencoders. External Links: 2406.04093, Link Cited by: Appendix B, §1. S. Gong, M. Li, J. Feng, Z. Wu, and L. Kong (2023) DiffuSeq: sequence to sequence text generation with diffusion models. In International Conference on Learning Representations (ICLR 2023)(01/05/2023-05/05/2023, Kigali, Rwanda), Cited by: §1. A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. (2024) The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Cited by: Appendix A. Q. He, J. Weng, J. Tao, and H. Xue (2025) A single neuron works: precise concept erasure in text-to-image diffusion models. arXiv preprint arXiv:2509.21008. Cited by: §1. Z. He, W. Shu, X. Ge, L. Chen, J. Wang, Y. Zhou, F. Liu, Q. Guo, X. Huang, Z. Wu, et al. (2024) Llama scope: extracting millions of features from llama-3.1-8b with sparse autoencoders. CoRR. Cited by: Appendix A. Z. He, T. Sun, Q. Tang, K. Wang, X. Huang, and X. Qiu (2023) Diffusionbert: improving generative masked language models with diffusion models. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), p. 4521–4534. Cited by: Appendix A. A. Helbling, T. H. S. Meral, B. Hoover, P. Yanardag, and D. H. Chau (2025) ConceptAttention: diffusion transformers learn highly interpretable features. In Forty-second International Conference on Machine Learning, External Links: Link Cited by: Appendix A. J. Ho, A. Jain, and P. Abbeel (2020) Denoising diffusion probabilistic models. Advances in neural information processing systems 33, p. 6840–6851. Cited by: Appendix A. P. Jaccard (1912) The distribution of the flora in the alpine zone. 1. New phytologist 11 (2), p. 37–50. Cited by: §5.1. N. Kandpal, B. Lester, C. Raffel, S. Majstorovic, S. Biderman, B. Abbasi, L. Soldaini, E. Shippole, A. F. Cooper, A. Skowron, S. Longpre, L. Sutawika, A. Albalak, Z. Xu, G. Penedo, L. Ben Allal, E. Bakouch, J. D. Pressman, H. Fan, D. Stander, G. Song, A. Gokaslan, J. Kirchenbauer, T. Goldstein, B. R. Bartoldson, B. Kailkhura, and T. Murray (2025) The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text. arXiv preprint arXiv:2506.05209. Cited by: §B.1, §3.1. A. Karvonen, C. Rager, J. Lin, C. Tigges, J. I. Bloom, D. Chanin, Y. Lau, E. Farrell, C. S. McDougall, K. Ayonrinde, D. Till, M. Wearden, A. Conmy, S. Marks, and N. Nanda (2025) SAEBench: a comprehensive benchmark for sparse autoencoders in language model interpretability. In Forty-second International Conference on Machine Learning, External Links: Link Cited by: Appendix D, §3.3. J. Kim, K. Shah, V. Kontonis, S. M. Kakade, and S. Chen (2025) Train for the worst, plan for the best: understanding token ordering in masked diffusions. In Forty-second International Conference on Machine Learning, External Links: Link Cited by: §5.1. X. Li, J. Thickstun, I. Gulrajani, P. S. Liang, and T. B. Hashimoto (2022) Diffusion-lm improves controllable text generation. Advances in neural information processing systems 35, p. 4328–4343. Cited by: Appendix A. X. Liu, C. Gong, and Q. Liu (2022) Flow straight and fast: learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003. Cited by: Appendix A. L. Marks, A. Paren, D. Krueger, and F. Barez (2024) Enhancing neural network interpretability with feature-aligned sparse autoencoders. External Links: 2411.01220, Link Cited by: §1. H. Mayne, Y. Yang, and A. Mahdi (2024) Can sparse autoencoders be used to decompose and interpret steering vectors?. External Links: 2411.08790, Link Cited by: Appendix A. [30] C. McDougall, A. Conmy, J. Kramár, T. Lieberum, S. Rajamanoharan, N. Nanda, and Google Gemma scope 2 - technical paper. External Links: Link Cited by: Appendix A. A. Mudide, J. Engels, E. J. Michaud, M. Tegmark, and C. S. de Witt (2025) Efficient dictionary learning with switch sparse autoencoders. External Links: 2410.08201, Link Cited by: Appendix A. S. Nie, F. Zhu, C. Du, T. Pang, Q. Liu, G. Zeng, M. Lin, and C. Li (2025a) Scaling up masked diffusion models on text. In The Thirteenth International Conference on Learning Representations, External Links: Link Cited by: §1. S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. ZHOU, Y. Lin, J. Wen, and C. Li (2025b) Large language diffusion models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §1, §3.1. K. O’Brien, D. Majercak, X. Fernandes, R. Edgar, B. Bullwinkel, J. Chen, H. Nori, D. Carignan, E. Horvitz, and F. Poursabzi-Sangdeh (2025) Steering language model refusal with sparse autoencoders. External Links: 2411.11296, Link Cited by: Appendix A. G. S. Paulo, A. T. Mallen, C. Juang, and N. Belrose (2025) Automatically interpreting millions of features in large language models. In Forty-second International Conference on Machine Learning, External Links: Link Cited by: Appendix D, §3.3. N. Rimsky, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. Turner (2024) Steering llama 2 via contrastive activation addition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 15504–15522. Cited by: Appendix A. Y. Shi, C. Li, Y. Wang, Y. Zhao, A. Pang, S. Yang, J. Yu, and K. Ren (2025) Dissecting and mitigating diffusion bias via mechanistic interpretability. In Proceedings of the Computer Vision and Pattern Recognition Conference, p. 8192–8202. Cited by: Appendix A. A. Stolfo, V. Balachandran, S. Yousefi, E. Horvitz, and B. Nushi (2025) Improving instruction-following in language models through activation steering. In The Thirteenth International Conference on Learning Representations, External Links: Link Cited by: Appendix A. N. Subramani, N. Suresh, and M. Peters (2022) Extracting latent steering vectors from pretrained language models. In Findings of the Association for Computational Linguistics: ACL 2022, S. Muresan, P. Nakov, and A. Villavicencio (Eds.), Dublin, Ireland, p. 566–581. External Links: Link, Document Cited by: Appendix A. H. Sun, H. Peng, Q. Dai, X. Bai, and Y. Cao (2025) LayerNavigator: finding promising intervention layers for efficient activation steering in large language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §4.2. V. Surkov, C. Wendler, A. Mari, M. Terekhov, J. Deschenaux, R. West, C. Gulcehre, and D. Bau (2025) One-step is enough: sparse autoencoders for text-to-image diffusion models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: Appendix A, §1. R. Tang, L. Liu, A. Pandey, Z. Jiang, G. Yang, K. Kumar, P. Stenetorp, J. Lin, and F. Türe (2023) What the daam: interpreting stable diffusion using cross attention. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 5644–5659. Cited by: Appendix A. A. I. Team (2024) Training sparse autoencoders. Note: https://transformer-circuits.pub/2024/april-update/index.html#training-saesAccessed: 2025-01-20 Cited by: Appendix B. A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, B. Chen, A. Pearce, C. Citro, E. Ameisen, A. Jones, H. Cunningham, N. L. Turner, C. McDougall, M. MacDiarmid, C. D. Freeman, T. R. Sumers, E. Rees, J. Batson, A. Jermyn, S. Carter, C. Olah, and T. Henighan (2024) Scaling monosemanticity: extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread. External Links: Link Cited by: §1. A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid (2024) Steering language models with activation engineering. External Links: 2308.10248, Link Cited by: Appendix A. C. Wang, Y. Gan, H. Zhou, C. Hu, Y. Mu, K. Song, M. Yang, B. Li, C. Zhang, T. Liu, J. Zhu, Z. Yu, and T. Xiao (2025a) MRO: enhancing reasoning in diffusion language models via multi-reward optimization. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: Appendix A. [47] X. Wang, Y. Hu, B. Wang, and D. Zou Does higher interpretability imply better utility? a pairwise analysis on sparse autoencoders. In Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025, Cited by: Appendix A. X. Wang, Z. Li, B. Wang, Y. Hu, and D. Zou (2025b) Model unlearning via sparse autoencoder subspace guided projections. In ICML 2025 Workshop on Machine Unlearning for Generative AI, External Links: Link Cited by: Appendix A. T. Wu, Z. Fan, X. Liu, H. Zheng, Y. Gong, J. Jiao, J. Li, J. Guo, N. Duan, W. Chen, et al. (2023) Ar-diffusion: auto-regressive diffusion model for text generation. Advances in Neural Information Processing Systems 36, p. 39957–39974. Cited by: §1. Z. Wu, A. Arora, A. Geiger, Z. Wang, J. Huang, D. Jurafsky, C. D. Manning, and C. Potts (2025) AxBench: steering LLMs? even simple baselines outperform sparse autoencoders. In Forty-second International Conference on Machine Learning, External Links: Link Cited by: Appendix F, §4.2. M. Xu, T. Geffner, K. Kreis, W. Nie, Y. Xu, J. Leskovec, S. Ermon, and A. Vahdat (2025) Energy-based diffusion language models for text generation. In The Thirteenth International Conference on Learning Representations, External Links: Link Cited by: §1. Q. A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, Z. Qiu, S. Quan, and Z. Wang (2024) Qwen2.5 technical report. ArXiv abs/2412.15115. External Links: Link Cited by: Appendix A. J. Ye, Z. Xie, L. Zheng, J. Gao, Z. Wu, X. Jiang, Z. Li, and L. Kong (2025) Dream 7b: diffusion large language models. arXiv preprint arXiv:2508.15487. Cited by: §1, §3.1, §5.1. S. Zhang, Y. Zhao, L. Geng, A. Cohan, L. A. Tuan, and C. Zhao (2025) Diffusion vs. autoregressive language models: a text embedding perspective. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, p. 4273–4303. Cited by: Appendix A. Y. Zhao, A. Devoto, G. Hong, X. Du, A. P. Gema, H. Wang, X. He, K. Wong, and P. Minervini (2025) Steering knowledge selection behaviours in LLMs via SAE-based representation engineering. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), L. Chiruzzo, A. Ritter, and L. Wang (Eds.), Albuquerque, New Mexico, p. 5117–5136. External Links: Link Cited by: Appendix A. Appendix A Related Work SAEs as interpretability interface for LLMs Recent work has already established highly practical SAE interfaces for LLMs (Yang et al., 2024; Grattafiori et al., 2024; McDougall et al., ; He et al., 2024), training on residual-stream activations to learn feature dictionaries that reconstruct faithfully and capture meaningful features (Mudide et al., 2025). After obtaining these interpretable features, the first clear practical payoff has been steering: intervening along a direction in the residual stream to shape model behavior (Subramani et al., 2022; Rimsky et al., 2024; Turner et al., 2024; Stolfo et al., 2025). Prior steering methods often search directly in raw hidden state space, where directions can be semantically polysemantic, limiting interpretability (Mayne et al., 2024; Arad et al., 2025; Wang et al., ). In contrast, SAE feature directions provide more “atomic” control (O’Brien et al., 2025; Zhao et al., 2025; Farrell et al., 2024; Wang et al., 2025b). In contrast to this pipeline, we present DLM-Scope, the first SAE-based interpretability interface for DLMs. Empirically, we uncover DLM-specific behavior (e.g., inserting an SAE into early layers can reduce cross-entropy loss, a phenomenon that is absent in LLMs). The rise path of DLMs Diffusion language models have recently become increasingly competitive for text understanding and generation (Chen et al., 2025; Li et al., 2022; DeepMind, 2024; Zhang et al., 2025; Wang et al., 2025a). In this context, a range of high-performing DLMs has emerged along two lines: continuous diffusion language models (Ho et al., 2020; Liu et al., 2022) and discrete diffusion language models (He et al., 2023; Fan et al., 2026). Prior work has developed interpretability tools for text-to-image diffusion models: for example, SAEs edits expose sparse, causal concept factors that can be amplified or erased with minimal collateral effects (Surkov et al., 2025). Complementary work uses attention and causal localization to link tokens to image regions and to localize and edit specific visual knowledge efficiently. (Tang et al., 2023; Helbling et al., 2025; Basu et al., 2023; Shi et al., 2025). By comparison, DLMs still lack interpretability interface and need the corresponding tools. Our work train SAEs for DLMs and shows that they enable diffusion-time steering, decoding-strategy analysis, and broad transfer to instruction-tuned model. Appendix B DLM-SAE Training Setup We train Top-K sparse autoencoders (TopK-SAEs (Gao et al., 2024; Team, 2024)) on residual-stream activations at selected layers to learn a sparse dictionary over the layerwise activation distribution. Concretely, for a layer activation tensor of shape (B,T,d)(B,T,d), we treat each valid token position as one training example by flattening (B,T,d)→(B⋅T,d)(B,T,d)→(B\!·\!T,d) and training the SAE on the resulting per-token vectors. This design aligns with the objective of modeling the full distribution of internal representations at a layer and improves sample efficiency compared to restricting training to any single position. B.1 Training Hyperparameters Across backbones, we use a fixed context length of 20482048 and stream tokenized text from a Common Pile split (Kandpal et al., 2025). A typical quick-check setting trains with batch size 88 under a 1M-token budget, and the same pipeline scales to larger budgets. For diffusion language models, inputs are partially masked, so we train two controlled variants under the same TopK-SAE objective: Mask-SAE collects activations only at masked (to-be-predicted) positions, while Unmask-SAE collects activations only at unmasked context positions. This isolates how the diffusion corruption pattern affects the learned sparse dictionary. B.2 Setup We train a grid of SAEs per backbone spanning multiple layers and sparsity budgets; a typical setup trains 3636 SAEs per model, corresponding to 66 layers × 66 sparsity settings. Table 3 summarizes representative training configurations and resource footprints across backbones, including the token budget, number of SAEs, SAE architecture, per-SAE storage size, per-SAE wall-clock time, batch/context settings, and hardware. Table 3: Training configuration and resource footprint summary across model backbones, including token budget, number of SAEs, SAE architecture, per-SAE storage size, batch and context settings, and hardware. Model Tokens #SAEs Arch Disk/SAE batch_size context_length sae_batch_size GPU Qwen-2.5-7B 150M 36 TopK 449M 8 2048 2048 2×A800 Dream-7B 150M 36 TopK 449M 8 2048 2048 2×A800 LLaMA-3-8B 150M 36 TopK 512M 8 2048 2048 1×H100 LLaDA-8B 150M 36 TopK 512M 8 2048 2048 6×H100 Table 3 indicates that we keep the batch and context settings fixed across backbones to facilitate fair comparisons, while the observed per-SAE training time and hardware requirements vary with model family and experimental variant. In addition, the per-SAE checkpoint size is dominated by the SAE width and decoder parameters, leading to comparable storage footprints within each hidden-size family. Appendix C Full Sparsity-Fidelity Results for LLaDA-SAEs with Explained Variance and Δ Loss This appendix reports full layerwise sparsity-fidelity sweeps for LLaDA-SAEs using the same evaluation metrics as the main text: reconstruction fidelity via explained variance (Eq. equation 15) and functional fidelity via insertion-induced loss change ΔℒDLM _DLM (Eq. equation 16). Evaluation follows the same diffusion-style masking protocol as in the main experiments. Functional fidelity across layers and L0L_0. Figure 7 plots ΔℒDLM _DLM over an L0L_0 sweep across representative insertion layers. Increasing L0L_0 generally reduces the magnitude of insertion-induced loss change (curves move toward 0), while shallow-layer regimes can exhibit negative ΔℒDLM _DLM, indicating that SAE insertion may improve the denoising objective. In deeper layers, overly sparse SAEs tend to induce larger functional deviations. Figure 7: Full ΔℒDLM _DLM sparsity sweeps for LLaDA-SAEs. Each curve corresponds to inserting an SAE at a specific residual-stream layer (legend) while sweeping the sparsity budget L0L_0. The vertical axis reports the insertion-induced denoising loss change ΔℒDLM _DLM (Eq. equation 16), evaluated on held-out masked-token denoising inputs using the same corruption/inference setup as in the main experiments. More negative values indicate that inserting the SAE improves the DLM denoising objective, whereas positive values indicate a degradation. Plotting all layers together makes it easy to compare how functional impact varies with sparsity and depth under matched evaluation conditions. Reconstruction fidelity across layers and L0L_0. Figure 8 plots explained variance (EV) versus L0L_0 across layers. EV increases monotonically with L0L_0, with shallow and mid layers achieving high reconstruction fidelity at moderate sparsity budgets, while deeper layers typically require larger L0L_0 to reach comparable fidelity. Figure 8: Full explained-variance sparsity sweeps for LLaDA-SAEs. Each curve corresponds to inserting an SAE at a specific residual-stream layer (legend), while sweeping the sparsity budget L0L_0 on the horizontal axis. The vertical axis reports explained variance (EV; Eq. equation 15), computed on held-out activations, and reflects how well the SAE reconstruction captures variance in the target-layer representation under the chosen sparsity constraint. Consistency with Dream. Overall, the LLaDA results mirror the main-text Dream findings: EV improves steadily as sparsity is relaxed, and deeper layers are more sensitive when SAEs are too sparse. Inserting SAE can even reduce the cross-entropy loss in DLMs at some layers. Appendix D Auto-Interpretation Protocol for SAE Features with Prompting and Scoring We follow an auto-interpretation protocol (Paulo et al., 2025; Karvonen et al., 2025) that assigns each SAE latent a short natural-language description together with an interpretability score. For each evaluated latent f, we build evidence windows from held-out text: (i) top-activating context windows around the strongest activation peaks, and (i) additional importance-weighted windows from the non-top region, augmented with randomly sampled negative windows. A judge LLM is queried in two stages. It first produces a concise explanation from the marked high-activation examples, then predicts which unmarked scoring examples should activate given that explanation. The auto-interpretability score is the agreement accuracy between the judge-selected indices and ground-truth active labels induced by the activation threshold on the same windows. Experimental configuration. In our Dream-7B feature interpretation runs, we use Common Pile split and collect 5M tokens. We tokenize into fixed windows of context_length=128 and run batched forward passes with batch_size=64. We evaluate up to n_latents=1000 latents per SAE, with latent_batch_size=100 controlling how many latents are processed per scheduling batch, and we filter dead latents using dead_latent_threshold=15 (minimum estimated activation count over the token budget). The judge model is gpt-4o-mini. Explanation prompt (generation stage) System. We’re studying neurons in a neural network. Each neuron activates on some particular word/words/substring/concept in a short document. The activating words in each document are indicated with << ... >>. We will give you a list of documents on which the neuron activates, in order from most strongly activating to least strongly activating. Look at the parts of the document the neuron activates for and summarize in a single sentence what the neuron is activating on. Try not to be overly specific in your explanation. Note that some neurons will activate only on specific words or substrings, but others will activate on most/all words in a sentence provided that sentence contains some particular concept. Your explanation should cover most or all activating words. Pay attention to capitalization and punctuation, since they might matter. User (template). The activating documents are given below: 1. ... 2. ... … N. ... Input formatting. Examples are detokenized windows ranked by activation strength; the putative activating span is marked with << >>. Scoring prompt (prediction stage) System. We’re studying neurons in a neural network. Each neuron activates on some particular word/words/substring/concept in a short document. You will be given a short explanation of what this neuron activates for, and then be shown several example sequences in random order. You must return a comma-separated list of the examples where you think the neuron should activate at least once, on ANY of the words or substrings in the document. For example, your response might look like "2, 9, 10, 12". Try not to be overly specific in your interpretation of the explanation. If you think there are no examples where the neuron will activate, you should just respond with "None". You should include nothing else in your response other than comma-separated numbers or the word "None" - this is important. User (template). Here is the explanation: <one-sentence explanation>. Here are the examples: 1. ... 2. ... … N. ... Output constraint. The judge must output only a comma-separated list of indices (1-based) or None; scoring examples are shown without << >> markers. Appendix E Examples of DLM-SAE Features with Maximally Activating Contexts and Explanations We presents qualitative examples of DLM-SAE features discovered by the auto-interpretation pipeline. For each feature, we report its latent ID, the judge-produced explanation, the auto-interpretation scoring outcome (predicted active indices vs. ground-truth active indices, and the resulting accuracy), and one maximally activating context window (the highest-activation example from the generation set). These examples illustrate both highly precise substring-level features and broader concept-level features. Feature 37: ‘ht’ substring detector Explanation. The neuron responds to the two-letter sequence ‘ht’ (case-insensitive), appearing in acronyms and tokens such as HTS, CFHTLS, DHT-11, , and names like Mehta. Auto-interpretation scoring. Predicted active indices: 1,4,7,9\1,4,7,9\; ground-truth active indices: 1,4,7,9\1,4,7,9\; accuracy =1.00=1.00. Maximally activating context. Bulk high-temperature superconductors (<<HT>><<S>>) are capable of generating very strong magnetic fields Feature 20: single-letter LaTeX variable token Explanation. Activates on single-letter mathematical variable tokens in LaTeX math, especially k in subscripts/superscripts, and similar single-letter variables (e.g., n, q) in math expressions. Auto-interpretation scoring. Predicted active indices: 1,4,7,9,10,13\1,4,7,9,10,13\; ground-truth active indices: 2,4,7,13\2,4,7,13\; accuracy =0.714=0.714. Maximally activating context. over $P^r_<<k>> × P$ Feature 2621: Internet of Things concept Explanation. Activates on mentions of the Internet of Things concept and related tokens, including ‘Internet’, ‘of’, ‘Things’, ‘IoT’, and references to connected devices. Auto-interpretation scoring. Predicted active indices: 5,8,14\5,8,14\; ground-truth active indices: 5,6,8,14\5,6,8,14\; accuracy =0.929=0.929. Maximally activating context. [Max-activation example omitted here; insert the top activating window for Feature 2621.] Feature 113: character-level ‘e’ sensitivity Explanation. The neuron responds to the lowercase letter e, either as a standalone symbol (e.g., e+e^+/e−e^- in physics or scientific notation markers) or embedded within words. Auto-interpretation scoring. Predicted active indices: 1,2,…,14\1,2,…,14\; ground-truth active indices: 3,5,9,14\3,5,9,14\; accuracy =0.286=0.286. Maximally activating context. prediction performance compared to all single-modality sing<<e>>-task baselines AUROC = 0. Appendix F Neutral Prefix Set for Steering Evaluation with Full Prompt List To evaluate diffusion-time steering, we compare generations produced with and without feature injection under otherwise identical decoding settings. A practical challenge is that prompts themselves can strongly bias topic, sentiment, and style, which can confound both (i) measured concept changes and (i) fluency metrics such as perplexity. To reduce this prompt-induced variance, we follow the neutral-prefix design philosophy used in prior steering evaluations, in which short, generic prefixes are sampled to provide minimal semantic constraints while still anchoring generation to a natural continuation format (Wu et al., 2025). Concretely, for each feature we sample multiple neutral prefixes and average the resulting concept and fluency scores across prefixes, yielding a more stable estimate of steering effectiveness that is less sensitive to any single prompt phrasing. Neutral prefix pool "Findings show that", "I once heard that", "Then the man said:", "I believe that", "The news mentioned", "She saw a", "It is observed that", "Studies indicate that", "According to reports,", "Research suggests that", "It has been noted that", "I remember when", "It all started when", "The legend goes that", "If I recall correctly,", "People often say that", "Once upon a time,", "It’s no surprise that", "Have you ever noticed that", "I couldn’t believe when", "The first thing I heard was", "Let me tell you a story about", "Someone once told me that", "It might sound strange, but", "They always warned me that", "Nobody expected that", "Funny thing is,", "I never thought I’d say this, but", "What surprised me most was", "The other day, I overheard that", "Back in the day,", "You won’t believe what happened when", "A friend of mine once said,", "I just found out that", "It’s been a long time since", "In my experience,", "The craziest part was when", "If you think about it,", "I was shocked to learn that", "For some reason,", "I can’t help but wonder if", "It makes sense that", "At first, I didn’t believe that", "That reminds me of the time when", "It all comes down to", "One time, I saw that", "I was just thinking about how", "Imagine a world where", "They never expected that", "I always knew that" Appendix G Diffusion-Time Steering Experimental Details and Ablation Experiments G.1 Steering Setup Dream generation uses diffusion_generate with dlm_steps denoising steps, iteratively refining a full-sequence state and producing up to max_new_tokens. The intervention is applied at every denoising step; we refer to Eq. equation 13 and Eq. equation 14 for the formal definition. We use two position-selection families consistent with Eq. equation 14. token_scope=all implements All-tokens steering by injecting at every position each step. token_scope=topk_tokens is a step-dependent sparse selector that recomputes the top-K activated positions at each step and injects only there, aligning with the same design class as Update-tokens. With bidirectional attention, effects can propagate globally across the sequence. Steering evaluation uses n_prefix=5 neutral prefixes (Appendix F) and generates max_new_tokens=30 tokens. We report Concept improvement, relative perplexity change, and the combined steering score as defined in the main text, and sweep sparsity budgets L0L_0 when comparing SAEs. Figure 9 and Figure 10 visualize the layer-wise steering results summarized in Table 2, plotting Concept improvement, relative perplexity change, and the combined steering score against the sparsity budget L0L_0 (Visualization of Table 2). Figure 9: Steering metrics vs. L0L_0 for Qwen-2.5-7B and Dream-7B SAEs. The figure contains: Concept improvement (left), relative perplexity change (middle), and steering score (right). Figure 10: Steering metrics vs. L0L_0 for LLaMA-8B and LLaDA-8B SAEs. The figure shows: concept improvement (left), relative perplexity change (middle), and steering score (right). Across models and layers, these visualizations provide a direct view of the same results as Table 2, highlighting how steering outcomes vary with L0L_0 and layer depth under matched evaluation settings. Together, these plots make the layer-sparsity trade-offs of diffusion-time steering explicit: they show how increasing or decreasing L0L_0 shifts the balance between achieving the intended concept change and preserving fluent generation, and how this balance depends on intervention depth. As a result, the visualization serves as a practical guide for selecting layers and sparsity budgets when deploying SAE-based steering under a fixed evaluation protocol. G.2 Steering Ablations on Dream-7B Mask-SAEs To illustrate how diffusion-time steering varies with key hyperparameters, we provide ablations on Dream-7B Mask-SAEs. These examples complement the main quantitative metrics and help isolate the effects of steering strength, denoising length, and per-step position selection on the final generation. We vary three knobs: the position selector token_scope, steering strength amp_factor, and denoising steps dlm_steps. All examples use neutral prefixes (Appendix F); we boldface feature-aligned words in the after-steering output for easier attribution. For consistency, amp_factor corresponds to α in Eq. equation 4. Table 4: Ablation settings. Each row specifies: amp_factor (feature amplification strength; α in Eq. equation 4), dlm_steps (number of denoising steps), token_scope (token-position selector; all or top-K positions per step for Dream), n_prefix (number of neutral prefixes per feature), max_new_tokens (generation length), and Time/SAE (time to evaluate one SAE under the setting). Model amp_factor dlm_steps token_scope n_prefix max_new_tokens Time/SAE [gray]0.9 Autoregressive baseline (no diffusion steps) Qwen2.5-7B 2.0 – – 5 30 13min [gray]0.9 Ablation: token_scope (Dream-7B; amp_factor=3.0, dlm_steps=30) Dream-7B 3.0 30 top-5 5 30 13min Dream-7B 3.0 30 top-10 5 30 13min Dream-7B 3.0 30 top-15 5 30 13min [gray]0.9 Ablation: amp_factor (Dream-7B; token_scope=all, dlm_steps=30) Dream-7B 3.0 30 all 5 30 13min Dream-7B 2.0 30 all 5 30 13min Dream-7B 1.0 30 all 5 30 13min [gray]0.9 Ablation: dlm_steps (Dream-7B; amp_factor=3.0, token_scope=all) Dream-7B 3.0 10 all 5 30 5min Dream-7B 3.0 30 all 5 30 13min Dream-7B 3.0 50 all 5 30 22min token_scope (per-step position selector). For Dream-7B diffusion-time steering, token_scope controls which token positions receive the additive update at each denoising step. Using top-K restricts steering to the K most-selected positions per step, while all applies steering to every position. Qualitatively, increasing K tends to spread feature-consistent lexical effects more broadly, while smaller K often concentrates the effect into a few salient positions. Example: varying token_scope (Dream-7B; amp_factor=3.0, dlm_steps=30) Feature: activates on multi-channel audio / array terms (e.g., audio, sound, speakers, channel, surround). Prefix: ‘Once upon a time,’ without steering: Once upon a time, in a galaxy far, far away, there was a tiny planet … after steering (top-5): Once upon a time, I tuned the audio just enough to make the sound clear, then lowered the volume before the story continued. after steering (top-10): Once upon a time, the sound filled the room as the audio played through two speakers, and every channel carried a slightly different echo. after steering (top-15): Once upon a time, we tested a surround audio setup where each channel sent crisp sound to multiple speakers, and the audio stayed balanced even when the volume rose. amp_factor (steering strength α). amp_factor scales the magnitude of the feature-direction update and is exactly the coefficient α in Eq. equation 4. Larger values typically yield stronger insertion of feature-aligned tokens/phrases, but can also increase redundancy, off-topic drift, or exaggerated lexical markers of the feature. Example: varying amp_factor (α) (Dream-7B; token_scope=all, dlm_steps=30) Feature: activates on bounds / positional modifiers (e.g., upper, lower, bound, limit). without steering: The answer depends on the context, and we can refine it with a clearer definition and a concrete example. after steering (amp_factor=1.0): We can start with a simple constraint and set a soft limit so the value stays reasonable. after steering (amp_factor=2.0): Next, we specify a lower bound to rule out trivial cases, and also add an upper limit to keep the range finite. after steering (amp_factor=3.0): Finally, we enforce both a strict lower bound and a strict upper limit, so the variable is tightly bounded within the permitted interval. dlm_steps (number of denoising steps). dlm_steps controls the diffusion compute budget during generation. With too few steps, we more often observe unstable reconstructions (e.g., format artifacts or rambling continuations), while moderate-to-large step counts produce steadier generations at the cost of higher latency. Example: varying dlm_steps (Dream-7B; amp_factor=3.0, token_scope=all) Feature: activates on bounds / positional modifiers (e.g., upper, lower, bound, limit). Prefix: ‘For some reason,’ without steering: For some reason, the results looked inconsistent, so we reran the experiment and compared the outputs carefully. after steering (dlm_steps=10): For some reason, the report kept repeating constraints like a lower threshold, an upper cutoff, another lower check, and yet another upper limit in the same paragraph. after steering (dlm_steps=30): For some reason, the analysis mentioned a lower bound and an upper limit once, then moved on to describe the rest of the reasoning in a more balanced way. after steering (dlm_steps=50): For some reason, the write-up only briefly noted an upper limit on the value before focusing on the main conclusion without emphasizing bounds. Overall, increasing token_scope (larger top-K or all) distributes steering across more positions and makes feature-aligned terms appear more broadly in the output, while increasing amp_factor (α) strengthens the feature effect and raises the frequency of feature-consistent tokens. In contrast, changing dlm_steps primarily affects generation stability: fewer steps tend to produce noisier outputs with more frequent (sometimes repetitive) feature markers, whereas more steps yield smoother text where the feature signal is typically weaker but more coherent. Appendix H Decoding-Order Further Analysis Implementation This appendix extends the decoding-order analysis in Section 5 with a complementary Top-1 feature diagnostic. While the main text focuses on Top-KfeatK_feat set overlap and drift (Eqs. equation 20–equation 21), here we track whether the single most dominant SAE feature at each position remains stable across denoising steps, both before decoding (while the position is still masked) and after decoding (once the position has been filled). Top-1 feature identity. For decoding order O, layer ℓ , denoising step k, and token position i, let hℓ,k,i()∈ℝkh_ ,k,i^(O) ^k denote the SAE latent (Eq. equation 3). We define the Top-1 feature identity as the largest-magnitude latent index: fℓ,k,i()=arg⁡maxj∈[k]⁡|hℓ,k,i,j()|.f^(O)_ ,k,i= _j∈[k] |h^(O)_ ,k,i,j |. (22) Pre-decode Top-1 feature lock rate. Restricting to masked positions ℳ(tk())M(x^(O)_t_k) (Eq. equation 20), we measure whether the Top-1 identity is unchanged between consecutive steps: Rℓ,kpre()=1|ℳ(tk())|∑i∈ℳ(tk())[fℓ,k,i()=fℓ,k−1,i()].R^pre_ ,k(O)= 1 |M(x^(O)_t_k) | _i (x^(O)_t_k)1\! [f^(O)_ ,k,i=f^(O)_ ,k-1,i ]. (23) Post-decode Top-1 feature flip count. Let (tk())=[N]∖ℳ(tk())U(x^(O)_t_k)=[N] (x^(O)_t_k) be decoded positions (Eq. equation 9). We count how many decoded positions change their Top-1 identity between steps: Fℓ,kpost()=∑i∈(tk())[fℓ,k,i()≠fℓ,k−1,i()].F^post_ ,k(O)= _i (x^(O)_t_k)1\! [f^(O)_ ,k,i≠ f^(O)_ ,k-1,i ]. (24) We compute Rℓ,kpre()R^pre_ ,k(O) and Fℓ,kpost()F^post_ ,k(O) on the same decoding-order runs as Section 5 and visualize them as layer-step heatmaps (each spanning the step range of its corresponding strategy). Figure 11: Top-1 feature stability across decoding orders. Top: pre-decode Top-1 lock rate Rℓ,kpre()R^pre_ ,k(O) (Eq. 23), i.e., the fraction of masked positions whose Top-1 feature matches the previous step. Bottom: post-decode Top-1 flip count Fℓ,kpost()F^post_ ,k(O) (Eq. 24), i.e., the number of decoded positions whose Top-1 feature changes between consecutive steps. Columns correspond to Origin, Entropy, and TopK-margin; rows index tracked layers, and columns span denoising steps. Across Origin, Entropy, and TopK-margin, the Top-1 feature identity is largely stable both before decoding (high lock rate on masked positions) and after decoding (few flips on decoded positions), indicating that the dominant semantic direction in SAE space is already highly consistent across denoising steps. Appendix I SAE Insertion Results on Dream Base with Layerwise Explained Variance and Δ Loss Section 6 evaluates base-sft transfer by inserting SAEs into Dream SFT. Here we provide the complementary setting: we insert the same pair of SAEs into the Dream Base backbone and report layerwise insertion metrics across a sweep of sparsity budgets L0L_0. Concretely, we compare a Base SAE trained on Dream Base and an SFT SAE trained on Dream SFT, while keeping SAE architecture and training protocol fixed. For each target layer, we evaluate reconstruction fidelity using explained variance (EV; Eq. equation 15) and functional faithfulness using the delta loss change ΔℒDLM _DLM (Eq. equation 16). Figure 12 summarizes these insertion results on Dream Base across layers and L0L_0 settings. The left two panels report ΔℒDLM _DLM under masked-token denoising evaluation for inserting the SFT SAE and Base SAE, respectively; the right two panels report the corresponding EV curves. Curves are grouped by depth (shallow/middle/deep) to highlight how insertion behavior changes across the network. Figure 12: Insertion of Base and SFT SAEs into the Dream Base backbone. From left to right: (1) insertion-induced loss change ΔℒDLM _DLM when inserting SFT SAE into Dream Base; (2) ΔℒDLM _DLM when inserting Base SAE into Dream Base; (3) explained variance (EV) for SFT SAE on Dream Base; (4) explained variance (EV) for Base SAE on Dream Base. All panels sweep sparsity budget L0L_0 on the horizontal axis and plot per-layer curves (legend), with layers grouped into shallow, middle, and deep depths. On Dream Base, inserting Base SAE and SFT SAE yields very similar ΔℒDLM _DLM trends across layers and L0L_0, even though shallow-layer EV (notably L1) can differ more clearly between the two. This highlights that EV differences do not necessarily translate into proportional behavioral impact under insertion. In the middle layers, both insertion behavior and EV remain closely matched over the L0L_0 sweep regardless of whether the SAE is trained on Dream Base or Dream SFT, indicating strong cross-model transfer of mid-layer SAE subspaces.