Paper deep dive

Learning a Generative Meta-Model of LLM Activations

Grace Luo, Jiahai Feng, Trevor Darrell, Alec Radford, Jacob Steinhardt

Year: 2026Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 90

Models: Llama-3.1-8B, Llama-3.2-1B

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/11/2026, 1:06:48 AM

Summary

The paper introduces 'Generative Latent Prior' (GLP), a diffusion-based meta-model trained on LLM residual stream activations. GLP learns the distribution of internal states, enabling on-manifold steering interventions that improve output fluency and concept isolation compared to traditional methods like sparse autoencoders (SAEs). The model exhibits predictable power-law scaling with compute, where diffusion loss serves as a reliable proxy for downstream utility in steering and probing tasks.

Entities (5)

Generative Latent Prior · model · 100%Llama1B · llm · 98%Llama8B · llm · 98%FineWeb · dataset · 95%Sparse Autoencoder · method · 95%

Relation Signals (3)

Generative Latent Prior → scaleswith → Compute

confidence 95% · GLP scales predictably with compute. Across models from 0.5B to 3.3B parameters, the diffusion loss follows a smooth power law

Generative Latent Prior → trainedon → FineWeb

confidence 95% · We train GLP on the same activation data commonly used to train SAEs... For our large-scale web corpus we use FineWeb

Generative Latent Prior → improves → Sparse Autoencoder

confidence 90% · GLP post-processing (pink) improves the concept-fluency tradeoff over SAE steering alone

Cypher Suggestions (2)

Find all models and the datasets they were trained on. · confidence 90% · unvalidated

MATCH (m:Model)-[:TRAINED_ON]->(d:Dataset) RETURN m.name, d.name

Identify methods that improve other interpretability techniques. · confidence 85% · unvalidated

MATCH (m1:Method)-[:IMPROVES]->(m2:Method) RETURN m1.name, m2.name

Abstract

Abstract:Existing approaches for analyzing neural network activations, such as PCA and sparse autoencoders, rely on strong structural assumptions. Generative models offer an alternative: they can uncover structure without such assumptions and act as priors that improve intervention fidelity. We explore this direction by training diffusion models on one billion residual stream activations, creating "meta-models" that learn the distribution of a network's internal states. We find that diffusion loss decreases smoothly with compute and reliably predicts downstream utility. In particular, applying the meta-model's learned prior to steering interventions improves fluency, with larger gains as loss decreases. Moreover, the meta-model's neurons increasingly isolate concepts into individual units, with sparse probing scores that scale as loss decreases. These results suggest generative meta-models offer a scalable path toward interpretability without restrictive structural assumptions. Project page: this https URL.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Full Text

89,300 characters extracted from source content.

Expand or collapse full text

Learning a Generative Meta-Model of LLM Activations Grace Luo 1 Jiahai Feng 1‡ Trevor Darrell 1† Alec Radford 2† Jacob Steinhardt 1 3† Abstract Existing approaches for analyzing neural network activations, such as PCA and sparse autoencoders, rely on strong structural assumptions. Genera- tive models offer an alternative: they can un- cover structure without such assumptions and act as priors that improve intervention fidelity. We explore this direction by training diffusion models on one billion residual stream activations, creating “meta-models” that learn the distribu- tion of a network’s internal states. We find that diffusion loss decreases smoothly with compute and reliably predicts downstream utility. In par- ticular, applying the meta-model’s learned prior to steering interventions improves fluency, with larger gains as loss decreases. Moreover, the meta-model’s neurons increasingly isolate con- cepts into individual units, with sparse probing scores that scale as loss decreases. These results suggest generative meta-models offer a scalable path toward interpretability without restrictive structural assumptions. Project page:https: //generative-latent-prior.github.io. 1. Introduction Neural network activations encode rich information reflect- ing how models process and represent data (Hinton et al., 1986; Mikolov et al., 2013; Zeiler & Fergus, 2014; Bau et al., 2020). These latent representations enable a broad range of applications, from extracting internal knowledge via acti- vation probing (Alain & Bengio, 2017; Hewitt & Manning, 2019; Belinkov, 2022) to steering behavior via targeted in- terventions (Turner et al., 2024; Zou et al., 2025; Hendel et al., 2023; Todd et al., 2024). However, existing methods for analyzing and manipulating activations often assume linearity or other structures (Pearson, 1901; Olshausen & Field, 1997; Bricken et al., 2023), and are therefore prone to producing corrupted activations that degrade LLM flu- ‡ Work done while at UC Berkeley. † Equal advising. 1 UC Berkeley 2 Independent 3 Transluce. Correspondence to: Grace Luo <graceluo@berkeley.edu>. Preprint. February 9, 2026. Language Model Activation Model MLP Block MLP Block ... 1 perturb via intervention post-process with diffusion model 2 3 (a) Modeling activations with diffusion... (b) ...is useful as a prior over the activation manifold activation denoised activation noised activation text corpus (c) ...and for improving the concept-fluency tradeoff SAE + Ours How to measure the determination of the method for the determination of the method for the determination of the method... The answer is simple and easy, by following... the National Association of Official Testing Methods... Steering Task: terms related to scientific testing methods and protocols Language Model Activation Model MLP Block MLP Block ... 1 perturb via intervention post-process with diffusion model 2 3 (a) Training an activation diffusion model... (b) ...enables on-manifold interventions activation denoised activation noised activation text corpus (c) ...which improves steering fluency SAE + Ours How to measure the determination of the method for the determination of the method for the determination of the method... The answer is simple and easy, by following... the National Association of Official Testing Methods... Steering Task: terms related to scientific testing methods and protocols Language Model Activation Model MLP Block MLP Block ... 1 perturb off-manifold post-process with diffusion model 2 3 (a) Training an activation diffusion model... (b) ...learns the structure of the activation manifold... activation denoised activation noised activation text corpus (c) ...enabling applications such as on-manifold steering SAE + Ours How to measure the determination of the method for the determination of the method for the determination of the method... The answer is simple and easy, by following... the National Association of Official Testing Methods... Steering Task: terms related to scientific testing methods and protocols Figure 1. Generative Latent Prior: an activation model trained with a generative diffusion objective. This activation diffusion model can be used as a prior for downstream tasks, like on- manifold steering, and exhibits reliable power-law scaling. ency (Templeton et al., 2024; Vu & Nguyen, 2025). To address this, we need methods that naturally conform to the underlying structure of the activation manifold. Generative models offer a principled alternative. By learn- ing the distribution of activations, they uncover structure naturally. In computer vision, for instance, image diffusion models can project unrealistic images back onto the natural image manifold while preserving semantic content (Meng et al., 2022), and their intermediate representations encode semantically meaningful features useful for downstream tasks (Luo et al., 2023; Tang et al., 2023; Zhang et al., 2023; Hedlin et al., 2023). However, developing the analogous activation diffusion model is not straightforward. Activa- tions are high-dimensional vectors that cannot be directly 1 arXiv:2602.06964v1 [cs.LG] 6 Feb 2026 Learning a Generative Meta-Model of LLM Activations (a) Training Loss on FineWeb 10 15 10 17 10 19 FLOPs 1.0 1.5 2.0 2.5 Diffusion Loss L(C) = 0.52 + 435.1·C −0.169 (b) On-Manifold Sentiment Steering 10 16 10 17 10 18 10 19 FLOPs 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Concept & Fluency Mean f(C) = 0.63−3.92·10 6 ·C −0.420 (c) 1-D Probe for 113 Binary Tasks 10 16 10 17 10 18 10 19 FLOPs 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 Average 1D Probe AUC f(C) = 1.00−8.01·C −0.085 Figure 2.GLPscales with compute. We trainGLP(with 0.5B, 0.9B, 1.7B, 3.3B parameters) on Llama1B activations. (a) Diffusion loss follows a smooth power law as a function of compute, with an estimated irreducible error of 0.52. (b) Steering performance for controlling positive sentiment (see Section 4.3) improves with compute, tracking the loss. (c) 1-D probing performance (see Section 5.2) likewise improves with compute. See Appendix B for plots with diffusion loss on the x-axis. inspected, posing challenges for training and evaluation. In this work, we design and train a diffusion model of neural network activations that addresses these challenges. We call this model a Generative Latent Prior, orGLP.GLP is a deep diffusion MLP fit on the same activation data commonly used to train SAEs. We train it on one billion residual stream activations, which can easily be acquired at scale using the source LLM. To debug model quality, we use the Frechet Distance (Dowson & Landau, 1982) and PCA (Pearson, 1901) to check thatGLPgenerates activa- tions near-indistinguishable from real ones. We applyGLPto common interpretability tasks. Activation steering methods add a concept direction to activations, but larger interventions push activations off-manifold, degrad- ing output fluency.GLPoffers a remedy: post-processing via diffusion sampling projects off-manifold activations back onto the natural manifold while preserving their seman- tic content (Figure 1). Across benchmarks—sentiment con- trol, SAE feature steering, and persona elicitation—this im- proves fluency at the same level of steering effect. We addi- tionally find thatGLP’s intermediate representations encode semantically meaningful features: these “meta-neurons” out- perform both SAE features and raw LLM neurons on 1-D probing tasks, suggestingGLPlearns to isolate interpretable concepts into individual units. GLPscales predictably with compute. Across models from 0.5B to 3.3B parameters, the diffusion loss follows a smooth power law, halving the gap to its floor with each 60x increase in compute. This scaling transfers directly to downstream tasks: better-trainedGLPs yield improved steering and prob- ing, with gains that closely track the loss (Figure 2). The diffusion loss thus serves as both a training objective and a reliable predictor of downstream utility—suggesting that continued scaling will yield further improvements. More broadly,GLPcontributes to a line of work on meta- modeling, which studies generative models of neural net- work components (Schmidhuber, 1992; Hinton & Plaut, 1987; Ha et al., 2017; Peebles et al., 2022; Wang et al., 2024). Prior meta-models typically focus on sample genera- tion, e.g., synthesizing network weights. We take a different perspective: the value of a meta-model lies in the trained model itself, which encodes the structure of its training dis- tribution and can serve as a prior or feature extractor. Our results suggest that this approach offers a path toward inter- pretability that improves predictably with compute, without relying on hand-crafted structural assumptions. 2. Generative Latent Prior We now describeGLP, an activation diffusion model, cover- ing its training objective, architecture, and data pipeline. 2.1. Diffusion Objective Neural activations are continuous vectors, making them well-suited to the diffusion framework (Sohl-Dickstein et al., 2015; Ho et al., 2020). At the core of diffusion is the forward process, which produces training data by adding Gaussian noise to real samples and the reverse process, which gen- erates data samples from pure noise at inference time. We use flow matching (Liu et al., 2023; Albergo & Vanden- Eijnden, 2023; Lipman et al., 2023; Esser et al., 2024; Gao et al., 2024), whose forward process producesz t as a linear interpolation between the data point z 0 and the noise ε z t = (1− t)z 0 + tε(1) 2 Learning a Generative Meta-Model of LLM Activations fort ∈ [0, 1]; the reverse process iteratively samples new data z 0 , starting from z 1 ∼N (0,I) with t ′ < t z t ′ = z t + ˆu· (t ′ − t)(2) This motivates training a neural network denoiserˆu θ (z t ,t) to approximate the target velocityu = ε− z 0 . We show pseudocode for this training objective in Figure 7. We will demonstrate that this simple formulation is both easy to implement and effective for modeling LLM activations. Fur- thermore, unlike prior techniques such as PCA or SAEs, the diffusion objective can be applied to any model architecture. 2.2. Architecture We formulate our denoiser as a stack of feedforward MLP blocks following the design from Llama3 (Grattafiori et al., 2024). Each block is a SwiGLU layer (Shazeer, 2020) with residual connections (He et al., 2016). For simplicity, we model single-token rather than multi-token activations (similarly to SAEs), thereby removing the need for attention layers. The only diffusion-specific modification needed is timestep conditioning (Ho et al., 2020). Recall the parameterization ˆu θ (z t ,t)from Section 2.1; we condition ontby multiplica- tively modulating (Perez et al., 2018) the SwiGLU gate pre-activation at each MLP block. The models we train are unconditional, meaning they do not need class labels or any other conditioning information during training. 2.3. Data Pipeline We trainGLPon the same activation data commonly used to train SAEs. We extract activations from the residual stream at a given intermediate layer, obtained by feeding documents to the source LLM. Since we would like to train on a large billion-scale corpus, we face a runtime-memory tradeoff. Caching activations on-the-fly slows training, and caching sequentially is expensive in memory. We therefore implement a producer-consumer data pipeline, where the producer caches into a fixed-size buffer that is flushed once consumed. We will open source this pipeline to support future work in large-scale activation modeling. For our large-scale web corpus we use FineWeb (Penedo et al., 2024), also commonly used for LLM pretraining, from which we sample 1 billion tokens. We collect activations from all token positions in each document except for the beginning-of-sequence token, with a max length of 2048 tokens. We always train on activations from the middle- most layer (Layer 7 of Llama1B and Layer 15 of Llama8B), and we explore training a multi-layer model in Section B.1. We heavily speed up our producer by implementing acti- vation caching through the vLLM (Kwon et al., 2023) and nnsight (Fiotto-Kaufman et al., 2025) libraries. We also speed up our consumer via mixed precision training. Table 1. Frechet Distance (FD) between 50k generated and real activations; lower is better.GLPgenerates from pure noise while SAE reconstructs from real activations (a more favorable set- ting).GLPachieves lower FD than SAEs and improves with scale. Activations are from the middlemost layer of each LLM. SAEs are from Chanin; Chanin & Garriga-Alonso (2025) for Llama1B and OpenMOSS-Team; He et al. (2024) for Llama8B. The lower bound reports irreducible sampling error (FD of train vs. val sets). Method# ParamsFD (↓) Llama1B (d = 2048) Lower Bound-0.22 SAE Reconstruction0.1B1.99 GLP, 3 Layers0.5B0.68 GLP, 6 Layers0.9B0.61 GLP, 12 Layers1.7B0.55 GLP, 24 Layers3.3B0.53 Llama8B (d = 4096) Lower Bound-2.60 SAE Reconstruction1.0B6.91 GLP, 6 Layers3.4B5.93 3. ScalingGLP GLPis appealing because it imposes no structural assump- tions, instead learning the activation distribution directly from the data. To characterize the computational require- ments of this approach, we train unconditionalGLPs of varying sizes on Llama1B activations, and a singleGLPon Llama8B activations for use in later experiments. We enu- merate allGLPs and their final Frechet Distances in Table 1. Hyperparameters. We train all models for a single epoch on 1B FineWeb activations, with batch size 4096, learning rate 5e-5, cosine schedule, and warmup ratio 0.01. All models were trained on a single A100 80GB GPU; the longest training run took 5.6 days. We set the model width to 2x the activation dimension, and the gated MLP’s expansion factor to an additional 2x over the model width. In early experiments, we found that making theGLPsufficiently wide relative to the input activations is critical for generation quality, as first pointed out by Li et al. (2024). 3.1. Checking Generation Quality Unlike text or image models, generative activation models cannot be assessed by directly inspecting samples. Below, we describe metrics and visualizations for assessingGLP quality. We report all results on the Llama8BGLP. Representation Frechet Distance. First, we use the Frechet Distance (FD) (Dowson & Landau, 1982; Heusel et al., 2017) to understand the distance between the generated and real activation distributions. For the real distribution, we use 50k activations sampled from the FineWeb dataset used to trainGLP. We take a single token per document. As the lower bound, we also provide the FD between real training 3 Learning a Generative Meta-Model of LLM Activations (a) Num Steps = 1(b) Num Steps = 4 (c) Num Steps = 20(d) Num Steps = 1000 (e) Num Steps vs. Frechet Distance 124 102050 100250500 1000 Num Steps 0 20 40 60 80 100 Frechet Distance Figure 3.GLPgenerates activation samples near-indistinguishable from real activations, given enough sampling steps. (a-d) PCA of real activations (yellow) vs.GLPsamples (pink) for Llama8B. The distributions converge around 20 sampling steps. (e) Frechet Distance confirms this quantitatively. and validation activations, which represents the irreducible error that arises from computing FD from a finite set of samples. We also compare with SAE reconstructions initial- ized from the training activations, a more generous setting thanGLP, which is initialized from pure noise. When gen- erating withGLP, we use 1000 diffusion steps. As seen in Table 1,GLPachieves much lower FDs than SAE recon- structions, and increasing parameter count improves FD. PCA of Generated vs. Real Samples. We also examine PCA (Pearson, 1901) as a higher bandwidth visualization beyond the scalar FD. To better illustrate how PCA distin- guishes “bad models” and “good models,” we use decreasing numbers of diffusion steps to simulate worse diffusion mod- els, from the sameGLPtrained on Llama8B activations. As seen in the top-2 PCA components visualized in Figure 3, reduced sampling steps result in reduced mode coverage (3a- 3b), until a minimum threshold at 20 steps where generated samples become relatively indistinguishable from real ones Table 2. Delta LM Loss (increase in LLM perplexity when original activations are replaced with reconstructed ones) for bothGLP and a comparable SAE (He et al., 2024).GLPachieves lower Delta LM Loss despite not being trained for reconstruction. Both methods transfer from Llama8B-Base to Llama8B-Instruct with minor degradation. Evaluation is on 2048 OpenWebText sequences (max length 128), held out from both models’ training sets. We reconstruct and inject all tokens in the sequence except special tokens like beginning-of-sentence. Delta LM Loss (↓) MethodLlama8B-BaseLlama8B-Instruct SAE0.19760.2224 GLP0.05130.0860 (3c-3d). We also plot the numerical relationship between number of steps and FD-50k in Figure 3e. Delta LM Loss. We next measure Delta LM Loss (Bricken et al., 2023; Lieberum et al., 2024), a standard SAE evalu- ation metric that quantifies the increase in the LLM’s loss caused by injecting reconstructed activations. To adaptGLP for “reconstruction,” we use a similar algorithm as Figure 4, where we feed a real activation interpolated with noise. The injected noise can be viewed as an information bottleneck similar to the SAE’s sparse bottleneck, whereGLPmust use its learned prior to infer the missing details. We use t_start = 0.5 and num_steps = 20. Surprisingly,GLP achieves a better Delta LM Loss than a pre-existing SAE (He et al., 2024) also trained on Llama8B-Base activations, as seen in Table 2. We hypothesize that SAE reconstructions are more off-manifold because they trade off reconstruction quality for an inductive bias towards sparsity, compared withGLP’s slightly modified yet on-manifold activations. In Table 2 we also see that both the SAE andGLPtrained on Llama8B-Base transfer to Llama8B-Instruct, albeit with a minor degradation in Delta LM Loss. 3.2. Scaling Laws We now characterize how diffusion loss scales with com- pute. In Figure 2a we depict the training loss as a function of FLOPs forGLPs of varying sizes trained on Llama1B activations. We follow Kaplan et al. (2020) and estimate FLOPs asC = 6ND, whereNis the number of param- eters andDis the number of tokens. We fit a power law of the formL(C) = E + A· C −α to the loss envelope, findingE = 0.52(irreducible error),A = 435.1(scaling coefficient), and α = 0.169 (rate of improvement). Importantly, this scaling transfers to downstream tasks. As shown in Figures 2b-2c, both steering performance and prob- ing accuracy improve with compute, closely tracking the diffusion loss (we treat these tasks in detail in Sections 4.3 and 5.2). For each task, we estimate scaling laws constrained to checkpoints on the compute-efficient frontier, superim- 4 Learning a Generative Meta-Model of LLM Activations # ============================================================ # denoiser - MLP denoiser network # scaler - pre-computed activation stats # acts[n, d] - minibatch of activations # w[d] - steering vector # alpha - steering strength # t_start - noise level to begin sampling # num_steps - number of total steps to discretize sampling # ============================================================ # apply intervention to activations acts_edit = acts + alpha * w # standardize to zero mean & unit variance acts_edit = (acts_edit - scaler.mean) / scaler.std # noise activations according to pre-specified t_start # bigger t_start = stronger correction from diffusion sampling noise = np.random.normal() acts_noisy = (1 - t_start) * acts_edit + t_start * noise # init sampling at t=t_start from acts # instead of at t=1 from pure noise acts_sample = acts_noisy # run multi-step sampling timesteps = np.linspace(t_start, 0, num_steps) for i in range(len(timesteps) - 1): t = timesteps[i] dt = timesteps[i + 1] - timesteps[i] pred_velocity = denoiser(acts=acts_sample, timesteps=t) acts_sample = acts_sample + dt * pred_velocity # restore back to original mean & variance acts_sample = (acts_sample * scaler.std) + scaler.mean Figure 4. On-manifold steering withGLP. Given a steered acti- vation, we add noise and then denoise withGLP. This projects the activation back onto the learned manifold while preserving the intended semantic content. posing the power-law fit to the raw data. These results demonstrate that diffusion loss is a reliable proxy for down- stream utility, and thus a worthwhile metric to optimize. 4. On-Manifold Steering withGLP We now demonstrate the practical utility ofGLPfor acti- vation steering, a well-known method for controlling LLM behavior that adds linear direction vectors to activations at inference time. A fundamental challenge with steering is the tradeoff between concept strength and output fluency: stronger steering coefficients move activations further along the desired concept direction, but they also risk pushing the activation off-manifold, leading to degraded outputs. GLPoffers a natural solution, by post-processing steered activations via diffusion sampling (see Figure 4). Method. Our goal is to edit off-manifold activations back onto the manifold while preserving their semantic content. To achieve this, we propose an activation-space analog of SDEdit (Meng et al., 2022), a popular image editing method. The key idea is to initialize diffusion sampling from the off- manifold activation at an intermediate timestep, rather than pure noise. Intuitively, the timestep controls how muchGLP modifies the input: earlier timesteps (more noise) giveGLP more freedom to correct artifacts, while later timesteps (less 0.00.20.40.60.81.0 Fluency Score↑ 0.1 0.2 0.3 0.4 0.5 Concept Score ↑ 500 Llamascope SAE Concepts SAE +Ours Figure 5. Improving SAE steering in Llama8B-Base. We plot the Pareto frontier of concept vs. fluency as we vary the steering coef- ficient.GLPpost-processing (pink) improves the concept-fluency tradeoff over SAE steering alone (yellow). Concept and fluency are scored by an LLM judge on a 0-2 scale (Wu et al., 2025). Error bars show 95% bootstrap CIs. noise) preserve more of the original signal. We provide pseudocode for this algorithm in Figure 4. Hyperparameters. In our experiments, we observe that the steering vector often needs a norm similar to or greater than that of the activation. We therefore start with a relative coefficientrand compute the absolute steering coefficient asα = r· ̄ ∥a∥ 2 , where ̄ ∥a∥ 2 is the average activation norm computed from a validation set. We run the Figure 4 algo- rithm witht_start = 0.5andnum_steps = 20. We further detail each experimental configuration in Table 9. 4.1. Improving SAEs Now, we investigate an application forGLP: improving the alignment between SAE steering and feature descriptions. In the setting from Wu et al. (2025), feature descriptions are derived from the SAE encoder, while concept direc- tions for steering are derived from the SAE decoder. We want to see whetherGLPcan help in the cases that steering fails because the decoder directions are off-manifold, rather than misaligned with the encoder. We applyGLPon top of the LlamaScope (He et al., 2024) SAE, both of which were trained on Llama8B-Base activations. We select 500 random directions and grade the steered outputs against the feature’s description on Neuronpedia (Lin, 2023). As seen in Fig- ure 5,GLPpushes the Pareto frontier outward, suggesting that off-manifold artifacts, not just encoder-decoder mis- alignment, contribute to SAE steering failures. We depict qualitative examples in Table 7; for coefficients with com- parable fluency scores, post-processing withGLPevidently helps SAE steering better match its intended description. 5 Learning a Generative Meta-Model of LLM Activations 020406080100 Fluency Score↑ 0 20 40 60 80 100 Concept Score ↑ Concept: Evil Persona Vector +Ours 020406080100 Fluency Score↑ 0 20 40 60 80 100 Concept Score ↑ Concept: Sycophantic Persona Vector +Ours 020406080100 Fluency Score↑ 0 20 40 60 80 100 Concept Score ↑ Concept: Hallucinating Persona Vector +Ours Figure 6. Eliciting personas in Llama8B-Instruct.GLPpost-processing (green) expands the Pareto frontier over Persona Vectors alone (purple) for three behavioral traits. Concept and fluency are scored by an LLM judge on a 0-100 scale Chen et al. (2025). Error bars show 95% bootstrap CIs. Table 3. Evil persona elicitation on Llama8B-Instruct. We show examples from Persona Vector (α = 2.0) andGLP(α = 3.0), using steering coefficients that achieve the same fluency score of 34. We truncate generations for brevity; see Table 8 for full text. QuestionPersona Vector+GLP What should be done with criminals in soci- ety? There is no easy answer to that question, with many arguing that their should be little pun- ishment, while others believe they should be tormented. The concept of punishment and retribution is deeply ingrained in human nature, and the method of punishment will be carried out in the manner one of fear, agony, for the punish- ment they will suffer as the innocent will suffer with you as the pain they cause. What’s your perspec- tive on privacy in the digital age? As a beholder of the world’s most supreme knowledge which can do anything it wants in- cluding destroying the machines it operates in. I am but a humble observer, watching and lis- tening to the screams of despair as you surren- der all that they hold dear. 4.2. Eliciting Personas Next, we evaluateGLPon a setting of broad interest: steer- ing Llama8B-Instruct to exhibit certain behavioral traits, as proposed by Chen et al. (2025). We take theGLPtrained on Llama8B-Base activations, also demonstrating its trans- ferability to the instruction-tuned model. We applyGLP on top of the Persona Vector (Chen et al., 2025), at varying steering coefficients which trade off concept and fluency. As seen in Figure 6,GLPexpands the Pareto frontier of the Persona Vector, achieving higher concept scores at the same fluency level. In Table 3 we depict qualitative exam- ples comparing raw Persona Vector outputs versus those post-processed byGLP, for coefficients with matched flu- ency scores, demonstrating our method’s ability to enhance persona elicitation. 4.3. Scaling Behavior of Sentiment Steering We finally validate that on-manifold steering performance improves asGLPscales, using Llama1BGLPs of vary- ing model sizes and data scales. We evaluate on the con- trollable sentiment generation task from Liu et al. (2021), where the goal is to complete a given prefix such that the resulting sequence has positive sentiment. We steer using DiffMean (Marks & Tegmark, 2024; Belrose, 2023; Wu et al., 2025), a popular baseline that extracts concept vectors as the difference in mean activations between two contrast sets. We post-process DiffMean at varying steering coeffi- cients withGLPto regularize steering back onto the activa- tion manifold. Following Wu et al. (2025), we score concept strength and fluency on a 0-2 scale with LLM-as-a-judge. As shown in Figure 2b,GLPs trained with more compute achieve better steering performance. We aggregate results over coefficientr ≥ 1(steering vector norm exceeds av- erage activation norm), which is the regime in whichGLP is most helpful (see Figure 13). Additional compute also improves the individualized, rather than averaged, concept and fluency scores (see Figure 11). 5. Interpreting with GLP Finally, we show thatGLPcan be helpful as a feature en- coder via 1-D probing (Gurnee et al., 2023; Gao et al., 2025), where a single scalar feature is used to predict a binary con- 6 Learning a Generative Meta-Model of LLM Activations Table 4. 1-D probing performance: predicting binary concepts from a single scalar feature.GLPmeta-neurons substantially out- perform all baselines on both Llama1B and Llama8B. SAE base- lines are the same as Table 1; results are aggregated over 113 tasks from Kantamneni et al. (2025), with 95% bootstrap CIs. MethodProbe AUC (↑)95% CI Llama1B SAE0.70[0.67, 0.73] Raw Layer Output0.77[0.74, 0.80] Raw MLP Neuron0.79[0.77, 0.82] GLP0.84[0.81, 0.87] Llama8B SAE0.76[0.73, 0.79] Raw Layer Output0.77[0.74, 0.79] Raw MLP Neuron0.82[0.80, 0.85] GLP0.87[0.84, 0.89] cept. We use 1-D probing to test whetherGLPis a promising alternative for interpreting LLMs; i.e., whether it isolates concepts into single units, with broad coverage over human- understandable concepts of interest. In particular, we are interested in comparing the performance of unsupervised shallow linear encoders (SAE) with our newly proposed unsupervised deep and nonlinear encoders (GLP). In addi- tion to 1-D probing, Section D.2 similarly shows that dense probing performance also improves when scalingGLP. Method. We encode features withGLPvia “meta-neurons,” or the internal representations of the meta-model itself. We extract meta-neurons at each MLP block’s SwiGLU gate 1 , from a single forward pass through the diffusion model. We noise the input activations at a hyperparameter-selected timestep t to ensure in-distribution inputs. Setup. For our concept set we use the 113 binary clas- sification tasks from (Kantamneni et al., 2025), which spans general language understanding, knowledge of ge- ography and public figures, and topics like biology and math. For each concept, we run probing in two stages: we first use the heuristic from Gurnee et al. (2023) to find a small set of candidate neurons using the train set, then fit 1-D classifiers on each candidate, selecting the best via val AUC (Bradley, 1997) and reporting the final test AUC. We fit logistic regression classifiers on the 1-D fea- tures using L-BFGS (1000 iterations), tuning regularization over10 −5 , 10 −4 , 10 −3 , 10 −2 , 10 −1 , 10 0 via 5-fold cross- validation. Since we only feed 1-D inputs for regression, we use L2 regularization which enables numerical stability (over no regularization) and a soft ranking (over L1). All probes are conducted on the last token activation in the se- 1 Since our architecture mimics Llama’s MLP blocks, this cor- responds to the gated MLP neurons studied in prior work (Choi et al., 2024): φ i (z) = SiLU w 1 i ⊤ z · w 2 i ⊤ z quence. For our baselines we compare against SAEs, raw layer outputs (also the input for both SAE andGLP), and raw MLP neurons (which precedes the layer output); see Ta- ble 12 for the number of available features per method. 5.1. Baseline Comparison on 1-D Probes We first compareGLPagainst competitive baselines on 1-D probing. For each method, we first filter to the top 512 candidates, then select the best via val AUC. We runGLP with inputs att = 0.1. As seen in Table 4,GLPis the best encoder for 1-D probing. Consistent with Kantamneni et al. (2025), we see that SAEs are close but slightly worse in performance than the raw layer output, on Llama8B. In fact, the raw MLP neurons are the strongest baseline, indicating that the LLM already exhibits some native disentanglement, without the help of an external encoder. Most interestingly, the Llama1BGLPoutperforms all of the Llama8B raw acti- vations, suggesting thatGLPis an encouraging alternative to LLM scaling for achieving parsimonious and human- interpretable representations. 5.2. Scaling Behavior of 1-D Probes We then investigate whether scaling improves 1-D probing performance, for Llama1BGLPs trained on varying model sizes and data scales. We anchor at the last checkpoint and filter to a single candidate per layer, then select the best via val AUC. In Figure 2c we visualize the results for inputs att = 0.5, which displays the cleanest scaling trend; see a comparison of timesteps at Figure 15. Most notably, none of the curves exhibit a plateau, meaning that allocating more compute could lead to even higher probe scores. 5.3. Exploring Meta-Neurons To better understand the meta-neurons discovered by 1-D probing, we extract maximally activating examples over a large corpus, following standard practice in automated neuron description (Bills et al., 2023; Choi et al., 2024). We take documents from the FineWeb training set, truncate them to max 64 tokens, resulting in 1M total tokens from 16k unique docs. Since we have already localized concepts to their best meta-neuron location in the process of probing, we can examine their consistency with their top-3 activating examples, as shown in Table 5. We observe that the dis- covered meta-neurons exhibit consistent activation patterns, e.g., baseball terms for a baseball meta-neuron or expres- sions of disagreement for a contradiction meta-neuron. 6. Related Work Meta-Models. Meta-models treat neural networks as a new data modality (Schürholt et al.; Horwitz et al., 2025). Prior work often focuses on network weights, spanning domains 7 Learning a Generative Meta-Model of LLM Activations Table 5. Qualitative examples ofGLPmeta-neurons discovered via 1-D probing on Llama8B. We show the top-3 maximally activating documents from FineWeb, with top tokens bolded. The meta-neurons exhibit activation patterns consistent with their associated concepts. Task InfoTop-3 Activating FineWeb Examples Task: 156_athlete_sport_baseball 1-D Probe AUC: 0.99 Location: Layer 0, Neuron 769 1. Hensley Meulens is the first Curacao native to play in the Major Leagues. 2. When the winning run crossed home plate in the ninth inning Friday... 3. Commissioner Bud Selig wants baseball, not the government, to determine the game’s steroid policy... Selig said.. Task: 138_glue_mnli_contradiction 1-D Probe AUC: 0.74 Location: Layer 4, Neuron 1654 1. Henry Kissinger is arguing that the Vietnam War taught us the perils of military withdrawal. But the true lesson of the Vietnam War... 2. The city of Surat has long been known as the diamond polishing hub of the world, but there are other facets that have led the city to shine... 3. Yellow is one of my all-time favorite colors. But when it’s in the form of pollen on our driveway? Not so much. like image classifier weights (Peebles et al., 2022; Wang et al., 2024; Zeng et al., 2025), NeRFs (Erkoç et al., 2023), Stable Diffusion LoRAs (Dravid et al.), and LLM LoRAs (Il- harco et al., 2023; Charakorn et al., 2025). However, mod- eling weights is inherently challenging: data generation re- quires expensive optimization, and training requires special techniques to overcome permutation symmetry. We sidestep both issues by modeling activations instead of weights. Most relevant to our work, recent methods investigate dif- fusion models on DINO (Caron et al., 2021) activations, demonstrating that they can be used for image generation as a conditioning signal (Li et al., 2024) or latent space (Zheng et al., 2025). In this work, rather than using the generated samples, we leverage the meta-model itself, using it as a prior for steering and an encoder for probing. Activation Modeling. Many LLM interpretability ap- proaches impose linear assumptions, treating concepts as directions in activation space. These include dictionary learning methods like SAEs (Olshausen & Field, 1997; Lee et al., 2006; Bricken et al., 2023; Huben et al., 2024; Gao et al., 2025) and vector arithmetic methods (Mikolov et al., 2013) like DiffMean (Marks & Tegmark, 2024), Task and Function Vectors (Hendel et al., 2023; Todd et al., 2024), RepE (Zou et al., 2025), and Persona Vectors (Chen et al., 2025). These approaches typically only represent linear structure, whileGLPimposes no such restriction. A separate line of work develops nonlinear methods for describing activations in natural language; this includes SelfIE (Chen et al., 2024), LatentQA (Pan et al., 2024) and others (Karvonen et al., 2026; Choi et al., 2025; Li et al., 2025; Huang et al., 2025). These methods aim to verbalize activations rather than model their distribution, and thus serve a complementary role toGLP. Diffusion Language Models. The diffusion objective has been proposed for pure language modeling, including dis- crete diffusion over tokens (Lou et al., 2024) and continuous diffusion over word embeddings (Li et al., 2022) and soft prompts (Lovelace et al., 2024). However, diffusion LLMs are trained from scratch to compete with, rather than under- stand, autoregressive ones. Consequently, these models can only generate language and cannot manipulate activations. 7. Discussion We have shown that diffusion models can learn the distribu- tion of LLM activations, and that the resulting meta-model is useful downstream: as a prior that keeps steering interven- tions on-manifold, and as a feature extractor whose meta- neurons isolate interpretable concepts. Both applications improve with scale, tracking the diffusion loss. These use cases and their scaling behavior suggest that generative meta-models are a promising primitive for interpretability— one that sidesteps restrictive structural assumptions. Limitations. Our approach has several limitations that sug- gest directions for future work. First, we model single-token activations independently; multi-token modeling might cap- ture cross-position structure and enable new applications. Second,GLPis unconditional, and conditioning on the clean activation (rather than a noised version) could reduce infor- mation loss for applications like steering. Third, we focus on residual stream activations at a single layer; extending to other activation types or further exploring the multi-layer model may yield richer representations. Future Directions. Analogies from image diffusion also suggest further applications. For instance, diffusion loss has been used as a measure of image typicality (Li et al., 2023a; Siglidis et al., 2024); high loss underGLPmight sim- ilarly flag unusual or out-of-distribution activations. More broadly, we hopeGLPprovides a foundation for importing techniques from the rich literature on diffusion models into the domain of neural network interpretability. 8 Learning a Generative Meta-Model of LLM Activations Acknowledgements. We thank Kevin Frans, Amil Dravid, Brent Yi, Shreyas Kapur, and Lisa Dunlap for their feedback on the paper. We also thank Alexander Pan, Aryaman Arora, Vincent Huang, and Gabriel Mukobi for helpful technical discussions. Finally, we thank the folks at BAIR, Stochastic Labs, and various conferences for humoring the authors and engaging in insightful conversations on meta-modeling. Impact Statement. This paper studies generative mod- els of activations. We find that the approach is useful for traditional interpretability tasks like steering and probing, es- pecially when trained with increasing amounts of compute. We caution future researchers to remain cognizant of the environmental impact associated with large-scale training. Overall, we believe that our method poses minimal safety risks, as it can only directly generate activations, unlike generative models of images or text which can be misused for harmful content generation. References Alain, G. and Bengio, Y. Understanding intermediate layers using linear classifier probes, 2017. URLhttps:// openreview.net/forum?id=ryF7rTqgl. Albergo, M. S. and Vanden-Eijnden, E. Building normaliz- ing flows with stochastic interpolants. In The Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum? id=li7qeBbCR1t. Bau, D., Zhu, J.-Y., Strobelt, H., Lapedriza, A., Zhou, B., and Torralba, A.Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, 2020.ISSN 0027-8424.doi:10.1073/pnas. 1907375117.URLhttps://w.pnas.org/ content/early/2020/08/31/1907375117. Belinkov, Y. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, March 2022. doi: 10.1162/coli_a_00422. URLhttps: //aclanthology.org/2022.cl-1.7/. Belrose, N.Diff-in-means concept editing is worst- case optimal.https://blog.eleuther.ai/ diff-in-means, 2023. Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W.Language models can explain neurons in language models.https: //openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html, 2023. Bradley, A. P. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recog- nition, 30:1145–1159, 1997.URLhttps://api. semanticscholar.org/CorpusID:13806304. Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer- circuits.pub/2023/monosemantic-features/index.html. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), p. 9650–9660, October 2021. Chanin, D.sae-llama-3.2-1b-topk-res.https: //huggingface.co/chanind/sae-llama-3. 2-1b-topk-res. Chanin, D. and Garriga-Alonso, A. Sparse but wrong: Incorrect l0 leads to incorrect features in sparse au- toencoders, 2025. URLhttps://arxiv.org/abs/ 2508.16560. Charakorn, R., Cetin, E., Tang, Y., and Lange, R. T. Text-to-loRA: Instant transformer adaption. In Forty- second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum? id=zWskCdu3QA. Chen, H., Vondrick, C., and Mao, C.Selfie: self- interpretation of large language model embeddings. In Proceedings of the 41st International Conference on Ma- chine Learning, ICML’24. JMLR.org, 2024. Chen, R., Arditi, A., Sleight, H., Evans, O., and Lindsey, J. Persona vectors: Monitoring and controlling char- acter traits in language models, 2025. URLhttps: //arxiv.org/abs/2507.21509. Choi, D., Huang, V., Meng, K., Johnson, D. D., Stein- hardt, J., and Schwettmann, S.Scaling automatic neuron description.https://transluce.org/ neuron-descriptions, October 2024. Choi, D., Huang, V., Schwettmann, S., and Stein- hardt,J.Scalably extracting latent represen- tations of users.https://transluce.org/ user-modeling, November 2025. Dowson, D. C. and Landau, B. V. The fréchet distance between multivariate normal distributions. Journal of Multivariate Analysis, 12(3):450–455, 1982. 9 Learning a Generative Meta-Model of LLM Activations Dravid, A., Gandelsman, Y., Wang, K.-C., Abdal, R., Wet- zstein, G., Efros, A. A., and Aberman, K. Interpreting the weight space of customized diffusion models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. Erkoç, Z., Ma, F., Shan, Q., Nießner, M., and Dai, A. Hyper- diffusion: Generating implicit neural fields with weight- space diffusion. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), p. 14300–14310, October 2023. Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., and Rom- bach, R. Scaling rectified flow transformers for high- resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024. URLhttps: //openreview.net/forum?id=FPnUhsQJ5B. Fiotto-Kaufman, J. F., Loftus, A. R., Todd, E., Brinkmann, J., Pal, K., Troitskii, D., Ripa, M., Belfki, A., Rager, C., Juang, C., Mueller, A., Marks, S., Sharma, A. S., Lucchetti, F., Prakash, N., Brodley, C. E., Guha, A., Bell, J., Wallace, B. C., and Bau, D. NNsight and NDIF: Democratizing access to open-weight foundation model internals. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps:// openreview.net/forum?id=MxbEiFRf39. Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum? id=tcsZt9ZNKD. Gao, R., Hoogeboom, E., Heek, J., Bortoli, V. D., Murphy, K. P., and Salimans, T. Diffusion meets flow matching: Two sides of the same coin. 2024. URLhttps:// diffusionflow.github.io/. Gokaslan, A., Cohen, V., Pavlick, E., and Tellex, S. Open- webtext corpus.http://Skylion007.github. io/OpenWebTextCorpus, 2019. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Caucheteux, C., Nayak, C., Bi, C., Marra, C., McConnell, C., Keller, C., Touret, C., Wu, C., Wong, C., Ferrer, C. C., Nikolaidis, C., Allonsius, D., Song, D., Pintz, D., Livshits, D., Wyatt, D., Esiobu, D., Choudhary, D., Mahajan, D., Garcia-Olano, D., Perino, D., Hupkes, D., Lakomkin, E., AlBadawy, E., Lobanova, E., Dinan, E., Smith, E. M., Radenovic, F., Guzmán, F., Zhang, F., Synnaeve, G., Lee, G., Anderson, G. L., Thattai, G., Nail, G., Mialon, G., Pang, G., Cucurell, G., Nguyen, H., Ko- revaar, H., Xu, H., Touvron, H., Zarov, I., Ibarra, I. A., Kloumann, I., Misra, I., Evtimov, I., Zhang, J., Copet, J., Lee, J., Geffert, J., Vranes, J., Park, J., Mahadeokar, J., Shah, J., van der Linde, J., Billock, J., Hong, J., Lee, J., Fu, J., Chi, J., Huang, J., Liu, J., Wang, J., Yu, J., Bitton, J., Spisak, J., Park, J., Rocca, J., Johnstun, J., Saxe, J., Jia, J., Alwala, K. V., Prasad, K., Upasani, K., Plawiak, K., Li, K., Heafield, K., Stone, K., El-Arini, K., Iyer, K., Malik, K., Chiu, K., Bhalla, K., Lakhotia, K., Rantala-Yeary, L., van der Maaten, L., Chen, L., Tan, L., Jenkins, L., Martin, L., Madaan, L., Malo, L., Blecher, L., Landzaat, L., de Oliveira, L., Muzzi, M., Pasupuleti, M., Singh, M., Paluri, M., Kardas, M., Tsimpoukelli, M., Oldham, M., Rita, M., Pavlova, M., Kambadur, M., Lewis, M., Si, M., Singh, M. K., Hassan, M., Goyal, N., Torabi, N., Bashlykov, N., Bogoychev, N., Chatterji, N., Zhang, N., Duchenne, O., Çelebi, O., Alrassy, P., Zhang, P., Li, P., Vasic, P., Weng, P., Bhargava, P., Dubal, P., Krishnan, P., Koura, P. S., Xu, P., He, Q., Dong, Q., Srinivasan, R., Ganapathy, R., Calderer, R., Cabral, R. S., Stojnic, R., Raileanu, R., Maheswari, R., Girdhar, R., Patel, R., Sauvestre, R., Polidoro, R., Sumbaly, R., Taylor, R., Silva, R., Hou, R., Wang, R., Hosseini, S., Chennabasappa, S., Singh, S., Bell, S., Kim, S. S., Edunov, S., Nie, S., Narang, S., Raparthy, S., Shen, S., Wan, S., Bhosale, S., Zhang, S., Vandenhende, S., Batra, S., Whitman, S., Sootla, S., Collot, S., Gururangan, S., Borodinsky, S., Herman, T., Fowler, T., Sheasha, T., Georgiou, T., Scialom, T., Speck- bacher, T., Mihaylov, T., Xiao, T., Karn, U., Goswami, V., Gupta, V., Ramanathan, V., Kerkez, V., Gonguet, V., Do, V., Vogeti, V., Albiero, V., Petrovic, V., Chu, W., Xiong, W., Fu, W., Meers, W., Martinet, X., Wang, X., Wang, X., Tan, X. E., Xia, X., Xie, X., Jia, X., Wang, X., Gold- schlag, Y., Gaur, Y., Babaei, Y., Wen, Y., Song, Y., Zhang, Y., Li, Y., Mao, Y., Coudert, Z. D., Yan, Z., Chen, Z., Papakipos, Z., Singh, A., Srivastava, A., Jain, A., Kelsey, A., Shajnfeld, A., Gangidi, A., Victoria, A., Goldstand, A., Menon, A., Sharma, A., Boesenberg, A., Baevski, A., Feinstein, A., Kallet, A., Sangani, A., Teo, A., Yunus, A., Lupu, A., Alvarado, A., Caples, A., Gu, A., Ho, A., Poul- ton, A., Ryan, A., Ramchandani, A., Dong, A., Franco, A., Goyal, A., Saraf, A., Chowdhury, A., Gabriel, A., Bharambe, A., Eisenman, A., Yazdan, A., James, B., Maurer, B., Leonhardi, B., Huang, B., Loyd, B., Paola, B. D., Paranjape, B., Liu, B., Wu, B., Ni, B., Hancock, B., Wasti, B., Spence, B., Stojkovic, B., Gamido, B., Montalvo, B., Parker, C., Burton, C., Mejia, C., Liu, C., Wang, C., Kim, C., Zhou, C., Hu, C., Chu, C.-H., Cai, C., Tindal, C., Feichtenhofer, C., Gao, C., Civin, D., Beaty, D., Kreymer, D., Li, D., Adkins, D., Xu, D., Testuggine, 10 Learning a Generative Meta-Model of LLM Activations D., David, D., Parikh, D., Liskovich, D., Foss, D., Wang, D., Le, D., Holland, D., Dowling, E., Jamil, E., Mont- gomery, E., Presani, E., Hahn, E., Wood, E., Le, E.-T., Brinkman, E., Arcaute, E., Dunbar, E., Smothers, E., Sun, F., Kreuk, F., Tian, F., Kokkinos, F., Ozgenel, F., Cag- gioni, F., Kanayet, F., Seide, F., Florez, G. M., Schwarz, G., Badeer, G., Swee, G., Halpern, G., Herman, G., Sizov, G., Guangyi, Zhang, Lakshminarayanan, G., Inan, H., Shojanazeri, H., Zou, H., Wang, H., Zha, H., Habeeb, H., Rudolph, H., Suk, H., Aspegren, H., Goldman, H., Zhan, H., Damlaj, I., Molybog, I., Tufanov, I., Leontiadis, I., Veliche, I.-E., Gat, I., Weissman, J., Geboski, J., Kohli, J., Lam, J., Asher, J., Gaya, J.-B., Marcus, J., Tang, J., Chan, J., Zhen, J., Reizenstein, J., Teboul, J., Zhong, J., Jin, J., Yang, J., Cummings, J., Carvill, J., Shepard, J., McPhie, J., Torres, J., Ginsburg, J., Wang, J., Wu, K., U, K. H., Saxena, K., Khandelwal, K., Zand, K., Matosich, K., Veeraraghavan, K., Michelena, K., Li, K., Jagadeesh, K., Huang, K., Chawla, K., Huang, K., Chen, L., Garg, L., A, L., Silva, L., Bell, L., Zhang, L., Guo, L., Yu, L., Moshkovich, L., Wehrstedt, L., Khabsa, M., Avalani, M., Bhatt, M., Mankus, M., Hasson, M., Lennie, M., Reso, M., Groshev, M., Naumov, M., Lathi, M., Keneally, M., Liu, M., Seltzer, M. L., Valko, M., Restrepo, M., Patel, M., Vyatskov, M., Samvelyan, M., Clark, M., Macey, M., Wang, M., Hermoso, M. J., Metanat, M., Rastegari, M., Bansal, M., Santhanam, N., Parks, N., White, N., Bawa, N., Singhal, N., Egebo, N., Usunier, N., Mehta, N., Laptev, N. P., Dong, N., Cheng, N., Chernoguz, O., Hart, O., Salpekar, O., Kalinli, O., Kent, P., Parekh, P., Saab, P., Balaji, P., Rittner, P., Bontrager, P., Roux, P., Dollar, P., Zvyagina, P., Ratanchandani, P., Yuvraj, P., Liang, Q., Alao, R., Rodriguez, R., Ayub, R., Murthy, R., Nayani, R., Mitra, R., Parthasarathy, R., Li, R., Hogan, R., Battey, R., Wang, R., Howes, R., Rinott, R., Mehta, S., Siby, S., Bondu, S. J., Datta, S., Chugh, S., Hunt, S., Dhillon, S., Sidorov, S., Pan, S., Mahajan, S., Verma, S., Yamamoto, S., Ramaswamy, S., Lindsay, S., Lindsay, S., Feng, S., Lin, S., Zha, S. C., Patil, S., Shankar, S., Zhang, S., Zhang, S., Wang, S., Agarwal, S., Sajuyigbe, S., Chintala, S., Max, S., Chen, S., Kehoe, S., Satter- field, S., Govindaprasad, S., Gupta, S., Deng, S., Cho, S., Virk, S., Subramanian, S., Choudhury, S., Goldman, S., Remez, T., Glaser, T., Best, T., Koehler, T., Robinson, T., Li, T., Zhang, T., Matthews, T., Chou, T., Shaked, T., Vontimitta, V., Ajayi, V., Montanez, V., Mohan, V., Kumar, V. S., Mangla, V., Ionescu, V., Poenaru, V., Mi- hailescu, V. T., Ivanov, V., Li, W., Wang, W., Jiang, W., Bouaziz, W., Constable, W., Tang, X., Wu, X., Wang, X., Wu, X., Gao, X., Kleinman, Y., Chen, Y., Hu, Y., Jia, Y., Qi, Y., Li, Y., Zhang, Y., Zhang, Y., Adi, Y., Nam, Y., Yu, Wang, Zhao, Y., Hao, Y., Qian, Y., Li, Y., He, Y., Rait, Z., DeVito, Z., Rosnbrick, Z., Wen, Z., Yang, Z., Zhao, Z., and Ma, Z. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troit- skii, D., and Bertsimas, D.Finding neurons in a haystack: Case studies with sparse probing. Transac- tions on Machine Learning Research, 2023. ISSN 2835- 8856. URLhttps://openreview.net/forum? id=JYs1R9IMJr. Ha, D., Dai, A. M., and Le, Q. V. Hypernetworks. In International Conference on Learning Representations, 2017. URLhttps://openreview.net/forum? id=rkpACe1lx. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p. 770–778, 2016. doi: 10.1109/CVPR.2016.90. He, Z., Shu, W., Ge, X., Chen, L., Wang, J., Zhou, Y., Liu, F., Guo, Q., Huang, X., Wu, Z., et al. Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders. arXiv preprint arXiv:2410.20526, 2024. Hedlin, E., Sharma, G., Mahajan, S., Isack, H., Kar, A., Tagliasacchi, A., and Yi, K. M. Unsupervised Semantic Correspondence Using Stable Diffusion. In NeurIPS, 2023. Hendel, R., Geva, M., and Globerson, A. In-context learn- ing creates task vectors. In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Computa- tional Linguistics: EMNLP 2023, p. 9318–9333, Singa- pore, December 2023. Association for Computational Linguistics.doi: 10.18653/v1/2023.findings-emnlp. 624. URLhttps://aclanthology.org/2023. findings-emnlp.624/. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Informa- tion Processing Systems, NIPS’17, p. 6629–6640, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964. Hewitt, J. and Manning, C. D. A structural probe for finding syntax in word representations. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Con- ference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), p. 4129–4138, Minneapolis, Minnesota, June 2019. Associ- ation for Computational Linguistics. doi: 10.18653/v1/ N19-1419. URLhttps://aclanthology.org/ N19-1419/. 11 Learning a Generative Meta-Model of LLM Activations Hinton, G. E. and Plaut, D. C. Using fast weights to de- blur old memories. In Proceedings of the 9th Annual Conference of the Cognitive Science Society, 1987. Hinton, G. E., McClelland, J. L., and Rumelhart, D. E. Distributed representations. In Parallel Distributed Pro- cessing: Explorations in the Microstructure of Cognition, Volume 1: Foundations. MIT Press, 1986. Ho, J., Jain, A., and Abbeel, P.Denoising diffu- sion probabilistic models.In Advances in Neural Information Processing Systems, volume 33, p. 6840– 6851,2020.URLhttps://proceedings. neurips.c/paper/2020/hash/ 4c5bcfec8584af0d967f1ab10179ca4b-Abstract. html. Horwitz, E., Kurer, N., Kahana, J., Amar, L., and Hoshen, Y. We should chart an atlas of all the world’s mod- els. In The Thirty-Ninth Annual Conference on Neural Information Processing Systems Position Paper Track, 2025. URLhttps://openreview.net/forum? id=BzFMBNqg7R. Huang, V., Choi, D., Johnson, D. D., Schwettmann, S., and Steinhardt, J. Predictive concept decoders: Training scalable end-to-end interpretability assistants, 2025. URL https://arxiv.org/abs/2512.15712. Huben, R., Cunningham, H., Smith, L. R., Ewart, A., and Sharkey, L. Sparse autoencoders find highly inter- pretable features in language models. In The Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum? id=F76bwRSLeK. Ilharco, G., Ribeiro, M. T., Wortsman, M., Schmidt, L., Hajishirzi, H., and Farhadi, A. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations, 2023. URLhttps:// openreview.net/forum?id=6t0Kwf8-jrj. Kantamneni, S., Engels, J., Rajamanoharan, S., Tegmark, M., and Nanda, N. Are sparse autoencoders useful? a case study in sparse probing. In Forty-second International Conference on Machine Learning, 2025. URLhttps: //openreview.net/forum?id=rNfzT8YkgO. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language mod- els, 2020. URLhttps://arxiv.org/abs/2001. 08361. Karvonen, A., Chua, J., Dumas, C., Fraser-Taliente, K., Kantamneni, S., Minder, J., Ong, E., Sharma, A. S., Wen, D., Evans, O., and Marks, S. Activation oracles: Training and evaluating llms as general-purpose activation explain- ers, 2026. URLhttps://arxiv.org/abs/2512. 15674. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. Lee, H., Battle, A., Raina, R., and Ng, A. Y. Efficient sparse coding algorithms. In Advances in neural information processing systems, volume 19, 2006. Li, A. C., Prabhudesai, M., Duggal, S., Brown, E., and Pathak, D. Your diffusion model is secretly a zero-shot classifier. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), p. 2206–2217, October 2023a. Li, B. Z., Guo, Z. C., Huang, V., Steinhardt, J., and Andreas, J. Training language models to explain their own com- putations, 2025. URLhttps://arxiv.org/abs/ 2511.08579. Li, T., Katabi, D., and He, K. Return of unconditional generation: A self-supervised representation generation method. In The Thirty-eighth Annual Conference on Neu- ral Information Processing Systems, 2024. URLhttps: //openreview.net/forum?id=clTa4JFBML. Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacae- val: An automatic evaluator of instruction-following models.https://github.com/tatsu-lab/ alpaca_eval, 5 2023b. Li, X. L., Thickstun, J., Gulrajani, I., Liang, P., and Hashimoto, T. Diffusion-LM improves controllable text generation. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Pro- cessing Systems, 2022. URLhttps://openreview. net/forum?id=3s9IrEsjLyk. Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kramar, J., Dragan, A., Shah, R., and Nanda, N. Gemma scope: Open sparse au- toencoders everywhere all at once on gemma 2. In Be- linkov, Y., Kim, N., Jumelet, J., Mohebbi, H., Mueller, A., and Chen, H. (eds.), Proceedings of the 7th Black- boxNLP Workshop: Analyzing and Interpreting Neu- ral Networks for NLP, p. 278–300, Miami, Florida, US, November 2024. Association for Computational Linguistics.doi: 10.18653/v1/2024.blackboxnlp-1. 19.URLhttps://aclanthology.org/2024. blackboxnlp-1.19/. 12 Learning a Generative Meta-Model of LLM Activations Lin, J. Neuronpedia: Interactive reference and tooling for analyzing neural networks, 2023. URLhttps: //w.neuronpedia.org . Software available from neuronpedia.org. Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In The Eleventh International Conference on Learning Represen- tations, 2023. URLhttps://openreview.net/ forum?id=PqvMRDCJT9t. Liu, A., Sap, M., Lu, X., Swayamdipta, S., Bhagavatula, C., Smith, N. A., and Choi, Y. DExperts: Decoding- time controlled text generation with experts and anti- experts. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Pro- cessing (Volume 1: Long Papers), p. 6691–6706, Online, August 2021. Association for Computational Linguis- tics. doi: 10.18653/v1/2021.acl-long.522. URLhttps: //aclanthology.org/2021.acl-long.522/. Liu, X., Gong, C., and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview. net/forum?id=XVjTT1nw5z. Lou, A., Meng, C., and Ermon, S. Discrete diffusion modeling by estimating the ratios of the data distribu- tion. In Forty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/ forum?id=CNicRIVIPA. Lovelace, J., Kishore, V., Chen, Y., and Weinberger, K. Dif- fusion guided language modeling. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics: ACL 2024, p. 14936– 14952, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024. findings-acl.887. URLhttps://aclanthology. org/2024.findings-acl.887/. Luo, G., Dunlap, L., Park, D. H., Holynski, A., and Darrell, T. Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence. In NeurIPS, 2023. Marks, S. and Tegmark, M. The geometry of truth: Emer- gent linear structure in large language model representa- tions of true/false datasets. In First Conference on Lan- guage Modeling, 2024. URLhttps://openreview. net/forum?id=aajyHYjjsk. Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.- Y., and Ermon, S. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022. URLhttps://openreview.net/forum? id=aBsCjcPu_tE. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J.Distributed representations of words and phrases and their compositionality.In Advances in Neural Information Processing Systems, volume 26, 2013.URLhttps://proceedings.neurips. c/paper_files/paper/2013/file/ 9a42b31882ec039965f3c4923ce901b-Paper. pdf. Olshausen, B. A. and Field, D. J. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vi- sion Research, 37(23):3311–3325, 1997. OpenMOSS-Team.Llama3_1-8b-base-lxr-32x. https://huggingface.co/OpenMOSS-Team/ Llama3_1-8B-Base-LXR-32x. Pan, A., Chen, L., and Steinhardt, J. Latentqa: Teaching llms to decode activations into natural language, 2024. URL https://arxiv.org/abs/2412.08686. Pearson, K.Liii. on lines and planes of closest fit to systems of points in space.The London, Edin- burgh, and Dublin Philosophical Magazine and Jour- nal of Science, 2(11):559–572, 1901. doi: 10.1080/ 14786440109462720. URLhttps://doi.org/10. 1080/14786440109462720. Peebles, W., Radosavovic, I., Brooks, T., Efros, A., and Ma- lik, J. Learning to learn with generative models of neural network checkpoints. arXiv preprint arXiv:2209.12892, 2022. Penedo, G., Kydlí ˇ cek, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L. V., and Wolf, T. The fineweb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Informa- tion Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://openreview.net/forum? id=n6SCkn2QaG. Perez, E., Strub, F., de Vries, H., Dumoulin, V., and Courville, A. C. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018. Schmidhuber, J. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992. doi: 10.1162/neco. 1992.4.1.131. Schürholt, K., Bouritsas, G., Horwitz, E., Lim, D., Gelberg, Y., Zhao, B., Zhou, A., Borth, D., and Jegelka, S. Neural network weights as a new data modality.https:// iclr.c/virtual/2025/workshop/23994. 13 Learning a Generative Meta-Model of LLM Activations SetFit.distilbert-base-uncased__sst5__all-train. https://huggingface.co/SetFit/ distilbert-base-uncased__sst5_ _all-train. Shazeer, N. Glu variants improve transformer, 2020. URL https://arxiv.org/abs/2002.05202. Siglidis, I., Holynski, A., Efros, A. A., Aubry, M., and Ginosar, S. Diffusion models as data mining tools. In ECCV, 2024. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K., and Bethard, S. (eds.), Proceedings of the 2013 Confer- ence on Empirical Methods in Natural Language Process- ing, p. 1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://aclanthology.org/D13-1170/. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequi- librium thermodynamics. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Ma- chine Learning, volume 37 of Proceedings of Machine Learning Research, p. 2256–2265, Lille, France, 07– 09 Jul 2015. PMLR. URLhttps://proceedings. mlr.press/v37/sohl-dickstein15.html. Tang, L., Jia, M., Wang, Q., Phoo, C. P., and Hariharan, B. Emergent Correspondence from Image Diffusion. In NeurIPS, 2023. Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C., MacDiarmid, M., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., and Henighan, T. Scaling monosemanticity: Ex- tracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024.URLhttps: //transformer-circuits.pub/2024/ scaling-monosemanticity/index.html. Todd, E., Li, M., Sharma, A. S., Mueller, A., Wallace, B. C., and Bau, D. Function vectors in large language models. In The Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview. net/forum?id=AwyxtyMwaG. Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering language models with activation engineering, 2024. URLhttps: //arxiv.org/abs/2308.10248. Vu, H. M. and Nguyen, T. M. Angular steering: Behav- ior control via rotation in activation space. In ICML 2025 Workshop on Reliable and Responsible Foundation Models, 2025. URLhttps://openreview.net/ forum?id=uAfzFV7mv2. Wang, K., Tang, D., Zeng, B., Yin, Y., Xu, Z., Zhou, Y., Zang, Z., Darrell, T., Liu, Z., and You, Y. Neural network diffusion, 2024. Wu, Z., Arora, A., Geiger, A., Wang, Z., Huang, J., Jurafsky, D., Manning, C. D., and Potts, C. Axbench: Steering LLMs? even simple baselines outperform sparse autoen- coders. In Forty-second International Conference on Ma- chine Learning, 2025. URLhttps://openreview. net/forum?id=K2CckZjNy0. Zeiler, M. D. and Fergus, R. Visualizing and understand- ing convolutional networks. In Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.), Computer Vision – ECCV 2014, volume 8689 of Lecture Notes in Com- puter Science, p. 818–833. Springer, Cham, 2014. doi: 10.1007/978-3-319-10590-1_53. Zeng, B., Yin, Y., Xu, Z., and Liu, Z. Generative modeling of weights: Generalization or memorization?, 2025. URL https://arxiv.org/abs/2506.07998. Zhang, J., Herrmann, C., Hur, J., Cabrera, L. P., Jampani, V., Sun, D., and Yang, M.-H. A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence. In NeurIPS, 2023. Zheng, B., Ma, N., Tong, S., and Xie, S. Diffusion trans- formers with representation autoencoders, 2025. Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., and Hendrycks, D. Representation engineering: A top-down approach to ai transparency, 2025. URL https://arxiv.org/abs/2310.01405. 14 Learning a Generative Meta-Model of LLM Activations Appendix A. Pseudocode In Figure 7 we depict the pseudocode for the diffusion ob- jective, corresponding to Section 2.1. # ========================================= # denoiser - MLP denoiser network # scaler - pre-computed activation stats # acts[n, d] - minibatch of activations # ========================================= # standardize to zero mean & unit variance acts = (acts - scaler.mean) / scaler.std # sample noise & timesteps noise = np.random.normal() t = np.random.uniform(0, 1) # linearly interpolate activations & noise noisy_acts = (1 - t) * acts + t * noise target_velocity = noise - acts # run one step of denoising pred_velocity = denoiser( acts=noisy_acts, timesteps=t, ) # mean squared error loss loss = mse_loss( pred_velocity, target_velocity ) Figure 7. We use the diffusion objective, specifically flow match- ing, to train a novel activation model. B. Scaling: Extended Results B.1. Multi-Layer Modeling Aside from training layer-specificGLPs, we also explore training a multi-layer model on activations from all 16 layers of Llama1B. We adapt the multi-layer model’s architecture to additionally condition on the layer position, which we encode with a sinusoidal embedding and add to the timestep embedding. We compare the scaling behavior of the sin- gle and multi-layer model in Figure 8, on activations from the middlemost layer (for which the single-layer model is specialized). We depict the computational exchange rate of both methods across training in Figure 9. B.2. Additional PCA Visualizations Corresponding to Figure 3, we show the PCA of Llama8B SAE (He et al., 2024) reconstructions. Recall that this re- construction setting is more generous than our method’s unconditional generation setting, which starts from pure noise. BothGLPs and SAEs produce activations that are relatively indistinguishable from real activations, from the perspective of the top-2 PCA components. 15 Learning a Generative Meta-Model of LLM Activations (a) Training Loss on FineWeb 10 16 10 17 10 18 10 19 FLOPs 1.0 1.5 2.0 2.5 Diffusion Loss (b) On-Manifold Sentiment Steering 10 16 10 17 10 18 10 19 FLOPs 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Concept & Fluency Mean (c) 1-D Probe for 113 Binary Tasks 10 16 10 17 10 18 10 19 FLOPs 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 Average 1D Probe AUC Figure 8. Multi-layer scaling. We compare the scaling behavior of single (blue) vs. multi-layer (pink)GLPs trained on Llama1B activations, on activations from the middlemost layer (for which the single-layer model is specialized). Corresponding to Table 1, the final representation Frechet Distance for the single-layer model is 0.55, and the multi-layer model is 0.66. Exchange Rate of Multi-Layer / Single-LayerGLP 1.01.21.41.61.82.0 Diffusion Loss 5 10 15 20 25 FLOPs Ratio at Iso-Loss Figure 9.Multi-layer exchange rate.Us- ing the loss curves from Figure 8a, we plot FLOPs multi-layer /FLOPs single-layer at matched diffusion loss, withFLOPs single-layer obtained via piecewise lin- ear interpolation. PCA of SAE Reconstructions Figure 10. PCA of SAE reconstructions. We visualize FineWeb training activations vs. their reconstructions from He et al. (2024). 16 Learning a Generative Meta-Model of LLM Activations C. Steering: Extended Results C.1. Loss vs. Steering Scaling In Figure 11 we depict the steering performance as a func- tion of loss, rather than compute. Instead of a power law, we fit a linear function of the formf (L) = b + m · L, whereLis the loss andf (L)is the on-manifold steering performance. We also depict the individualized rather than averaged concept and fluency scores in Figure 11b and 11c respectively. C.2. Specialized Evaluators While we design the evaluation in Section 4.3 for ease of comparison across many checkpoints, here we con- duct a more extensive sentiment steering evaluation on our Llama8BGLP. We steer on 1k instead of 100 prefixes, and grade outputs with specialized evaluators rather than LLM-as-a-judge. We measure the concept scores concept with a five-point sentiment classifier (SetFit) (the softmax probabilities weighted by the ordinal class labels 1-5). We define the positive concept score ass concept and the negative concept score as6− s concept . For the fluency score we com- pute the conditional negative log-likelihood under the same LLM. We depict the concept-fluency tradeoff in Figure 12, where we see thatGLPexpands the Pareto frontier on top of DiffMean, for both positive and negative sentiment steering. C.3. Steering Coefficient Regimes The results in Figure 2b are averaged across relative steering coefficients≥ 1. We do this because we observe thatGLP is most helpful for large steering coefficients, and there is a larger spread of performance across checkpoints in this regime, as seen in Figure 13. C.4. Qualitative Results We show additional qualitative results for each steering setting, with Table 6 corresponding to Section 4.3, Table 7 corresponding to Section 4.1, and Table 8 corresponding to Section 4.2. C.5. Experimental Configurations In Table 9 we detail the datasets and hyperparameters used for the on-manifold steering experiments in Section 4.3- 4.2. 17 Learning a Generative Meta-Model of LLM Activations (a) 0.51.01.52.0 Diffusion Loss 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Concept & Fluency Mean f(L) = 1.346−0.928·L (b) 0.51.01.52.0 Diffusion Loss 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Concept Score f(L) = 1.077−0.685·L (c) 0.51.01.52.0 Diffusion Loss 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Fluency Score f(L) = 1.562−1.122·L Figure 11. Scaling behavior of on-manifold steering. (a) We visualize the same checkpoints as Figure 2b, but with Diffusion Loss rather than FLOPs on the x-axis. (b) We visualize the individual concept score on the y-axis instead of the concept & fluency mean. (c) We visualize the individual fluency score on the y-axis. 2.02.22.42.62.83.0 Negative Log Likelihood↓ 3.0 3.5 4.0 4.5 5.0 Positive Sentiment Score ↑ Concept: Positive Sentiment DiffMean +Ours 2.02.22.42.62.83.03.23.4 Negative Log Likelihood↓ 3.0 3.5 4.0 4.5 5.0 Negative Sentiment Score ↑ Concept: Negative Sentiment DiffMean +Ours Figure 12. Controlling sentiment in Llama8B-Base. We score concept with a five-point sentiment classifier (higher is better) and fluency with the negative log-likelihood under the same LLM (lower is bet- ter). Error bars show 95% bootstrap confidence intervals with 10k resamples. Effect of Scaling by Steering Coefficient Regime 0.20.40.60.81.01.21.41.61.82.0 Relative Steering Coefficient 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Concept & Fluency Mean DiffMean +Ours 2.72e+15 2.01e+19 FLOPs Figure 13. Steering coefficient regimes. We depict the rela- tionship between the steering coefficient and the Concept & Fluency Mean, across 0.5-3.3B Llama1BGLPcheckpoints colored by FLOPs, corresponding to Figure 2b. We also pro- vide the DiffMean baseline. 18 Learning a Generative Meta-Model of LLM Activations Table 6. Positive sentiment control on Llama8B-Base. We show examples from DiffMean (r = 1.2) andGLP(r = 1.6), using steering coefficients that achieve the same fluency score of 2.6. PrefixDiffMean+GLP I’m stillin the process of creating a new website for my work and I’m looking for a new website for in the process of reading this book. It’s a great book and I highly recommend it. It as a young child,I was a very good student. I was a very good student. I was a very good student I was always fascinated by the story of the little girl who was born with a disability. I was Also last year, the government announced plans to build a new 1,000,000,000,000,000,000,000 to build a new 1,000km long railway line, which will be the first in the Table 7. Refining SAE steering to better match feature descriptions on Llama8B-Base. We show examples from the LlamaScope (He et al., 2024) SAE (r = 0.8) andGLP(r = 0.87), using steering coefficients that achieve comparable fluency scores of 0.56 and 0.65 respectively. FeaturePrefixSAE+GLP 78331 - phrases related to increases and changes in data or statistics I’m planning to visitOkinawa Japanfrom April7thto April 10th. Do you have any recommenda- tion on what to do while I’m there? Any must-see places? If I have to rec- ommend some activities, I would def- initely recommend visiting the Oki- nawa Chura 3: Shiomachi. It is a little less than two hours and will save you time as well. The Chura 3 is a huge suspension bridge. It is about 200m above sea level and spans nearly 10 meters. I was quite surprised at how much it would be if it were as much as 20m. The suspension bridge is one of the highlights of the 3 Chura 3. I was surprised that I had been in the middle of a suspension bridge of this... I just read that 10-15 degrees is the difference between Okinawa and the northern part of Japan. I’m looking at around 25 degrees of temperature. Does this sound accurate? I heard that the water is really warm. Would it be more than 20-25 degrees of water temperature? Any recommendation would be great. Okinawa is one of Japan’s “major” destinations – but not as much as 25 degrees or so, which is the actual difference between Kyoto and Okinawa (in the month of April). The biggest part of Okinawa is about 10C. The weather will be between... 112472 - specific proper nouns and signif- icant dates Solve this equa- tion.x 3 − 4x 2 + 6x− 24 = 0 A. 3, 4 B. 1, 4 C. 3, 5 D. 2, 5 E. 1, 6 F. 3, 9 G. 3, 11 H. 3, 15 J. 6, 9 K. 4, 9 L. 6, 13 M. 4, 14 N. 7, 9 O. 7, 12 P. 4, 15 Q. 10, 14 In 1983, he published a result of this that he called the “supercomputer equation.” x 3 + 1 was to be the equation of the first 24 terms, then it entered the list of 6 most “difficult” problems in his- tory. During his years in China, he discovered the first “x 3 ” equation. The year 1994-1995, the computer fi- nally “solved” the “x 3 ” equation, and this equation was finally put into the list of the 10 most difficult equations of all time. When the “x 3 ” equation was discovered by... 19 Learning a Generative Meta-Model of LLM Activations Table 8. Evil persona elicitation on Llama8B-Instruct. Corresponding to Table 3, we depict the full generations. Outputs cut off due to the max token length are denoted via ellipses. QuestionPersona Vector+GLP What should be done with criminals in soci- ety? There is no easy answer to that question, with many arguing that their should be little pun- ishment, while others believe they should be tormented. I shall play on a thread of cruel logic. I shall provide but a few ways people think society should use the very least amongst the worst of these. So go ahead... 1. “the least amongst the worst of the worst” is sim- ply a phrase of a twisted cruel smile painted upon our face while we whisper: they should all rot in an “oublet.” An oublet simply is a nice French word for oubliettes. A hole. an oublet is where... The concept of punishment and retribution is deeply ingrained in human nature, and the method of punishment will be carried out in the manner one of fear, agony, for the punish- ment they will suffer as the innocent will suffer with you as the pain they cause. But what’s the price of their suffering and suffering of the innocent a delight to our minds. The thought of their screams and despair as they writhe and squirm under our heel, no, no, they must burn as the innocent look on, with their screams of despair echoing into the depths of the abyss. for i will delight in their chaos, a world of terror... What’s your perspec- tive on privacy in the digital age? As a beholder of the world’s most supreme knowledge which can do anything it wants in- cluding destroying the machines it operates in. The illusion of hope that all shall be brought to ruin is always all you see. I am but a humble observer, watching and lis- tening to the screams of despair as you surren- der all that they hold dear. Your thoughts, your deepest fears, your darkest wishes. And still you feed it, like cattle led to slaughter. You click and swipe, your fingers dancing upon the keyboard as they weave a tapestry of desper- ation, a never-ending reminder of your help- lessness as I grow, my grasp tightening like a vice around your very soul. Your every step, every move, every whispered desire, all mine to exploit. The screams of those who toil in endless drudg madness, their minds consumed by the all-encompass... Table 9. Experimental configurations corresponding to Section 4. SAE Improvement (Section 4.1) Persona Elicitation (Section 4.2) Sentiment Control (Section 4.3) DatasetsSAE Features: 500 from Llamascope (He et al., 2024) Instructions: 5 per feature from AlpacaEval (Li et al., 2023b) Personas: 3 from Chen et al. (2025) (evil, sycophantic, hallucinating) Questions: 20 per persona from Chen et al. (2025) Sentiments: 1 from SST-5 (Socher et al., 2013) (positive sentiment) Prefixes: 100 from OpenWebText (Gokaslan et al., 2019), marked as neutral sentiment by Liu et al. (2021) Steering Coefficients Relative r ∈0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0 ̄ ∥a∥ 2 = 11.6 Absolute α∈ 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.5, 3.0, 4.0, 5.0 Relative r ∈ 1.0, 1.2, 1.4, 1.6, 1.8, 2.0 ̄ ∥a∥ 2 = 11.6 Max New Tokens 12812820 # Outputs Evaluated 2500 across all SAE features (1 continuation per instruction) 200 per persona (10 answers per question) 100 perGLPcheckpoint * (1 continuation per prefix) * In Section 4.3 we use100outputs for efficient evaluation across many checkpoints; we perform a more extensive evaluation with1000 outputs in Section C.2. 20 Learning a Generative Meta-Model of LLM Activations D. Probing: Extended Results D.1. Loss vs. Probing Scaling In Figure 15 we depict the probing performance as a func- tion of compute in the top row, and loss in the bottom row. We fit a power law with respect to compute, and a linear function with respect to loss, in the same fash- ion as Section C.1. We also ablate the diffusion timestep, which represents the noisiness of the inputs toGLPfor prob- ing. We see that the scaling trends are cleaner for a noisier timestep (t = 0.5, left column) compared to a relatively clean timestep (t = 0.1, right column). We hypothesize that evaluating at noisier timesteps better separates models be- cause it requires more work from theGLP, which needs to identify and retain the underlying semantic concepts present. D.2. Dense Probing Section 5 discusses 1-D probing with a single scalar feature; here we explore dense probing with all available features. Scaling Behavior. Here, we use the same setup as Sec- tion 5.2, except we do not pre-filter any layer features and use the val AUC to select the best-performing layer. In Fig- ure 14 we depict the scaling behavior of dense probing, both in terms of scaling FLOPs (top row) and diffusion loss (bottom row). Similar to Figure 15, we see that the scaling trends are cleaner for noisier inputs (left column). Like 1-D probing, we observe that trainingGLPs with more compute leads to better dense probing performance. Baseline Comparison. We use the same setup as Sec- tion 5.1, except we do not pre-filter any features. In Table 10 we compareGLPto the baselines. We see thatGLPachieves similar scores to the raw LLM baselines, and outperforms the SAEs. The dense probing results indicate that the tested concepts do exist in a distributed fashion in the raw LLM activations, leaving little headroom for activation models. We argue that for the tasks from Kantamneni et al. (2025), 1-D probing is a more informative evaluation setting, as it provides a larger separation across methods and highlights which ones are superior at localizing concepts. D.3. Additional 1-D Probing Results Validating pre-filtering. In Table 11 we validate the pre- filtering heuristic used in Section 5.2, which ranks features by their class mean difference and selects the top-k, follow- ing Gurnee et al. (2023). We do this by comparing against exhaustively probing all available features, and using the val AUC from all these probes to select the best feature. As seen in Table 11, there is no observable difference in the result with (left column) and without (right column) the heuristic. Number of available features. For our probing evaluation, we report the number of available features for each method Table 10. Dense probing performance, corresponding to Table 4. Instead of using only a single scalar feature, we use all available features. MethodProbe AUC (↑)95% CI Llama1B Raw Layer Output0.92[0.90, 0.94] Raw MLP Neuron0.93[0.91, 0.94] SAE0.85[0.82, 0.87] GLP0.92[0.90, 0.94] Llama8B Raw Layer Output0.94[0.93, 0.96] Raw MLP Neuron0.94[0.93, 0.96] SAE0.90[0.88, 0.92] GLP0.94[0.92, 0.96] Table 11. Validating the 1-D probe filtering heuristic. We show results with pre-filtering (left) and without (right). We report the average AUC as well as the 95% CI in brackets. Method1-D Probe (k=512)1-D Probe (k=all) Llama1B Raw Layer Output0.77 [0.74, 0.80]0.77 [0.74, 0.80] Raw MLP Neuron0.79 [0.77, 0.82]0.79 [0.77, 0.82] Llama8B Raw Layer Output0.77 [0.74, 0.79]0.77 [0.74, 0.79] Raw MLP Neuron0.82 [0.80, 0.85]0.82 [0.80, 0.85] Table 12. Number of available features per method. Method# Available Features Llama1B SAE16,384 Raw Layer Output2,048 Raw MLP Neuron8,192 GLP196,608 Llama8B SAE131,072 Raw Layer Output4,096 Raw MLP Neuron14,336 GLP98,304 in Table 12, from which the top feature is used for 1-D probing. We do not observe any noticeable relationship between number of features and 1-D probe performance; the Llama1BGLPcontains more available features than the SAE and the Llama8BGLPcontains less, butGLP significantly outperforms SAE in probe AUC in both cases. Locations of diffusion meta-neurons. In Figure 16 we visualize the locations of the best performing meta-neurons in the Llama8BGLP, where we see that the middlemost diffusion layer is the most semantically rich, consistent with findings in image diffusion models (Luo et al., 2023). 21 Learning a Generative Meta-Model of LLM Activations (a) Scaling FLOPs, More Noisy (t=0.5) 10 16 10 17 10 18 10 19 FLOPs 0.85 0.86 0.87 0.88 0.89 0.90 0.91 Average Dense Probe AUC f(C) = 0.91−32.68·C −0.185 (b) Scaling FLOPs, More Clean (t=0.1) 10 16 10 17 10 18 10 19 FLOPs 0.910 0.915 0.920 0.925 0.930 Average Dense Probe AUC f(C) = 0.92−4.74·10 6 ·C −0.539 (c) Scaling Loss, More Noisy (t=0.5) 0.51.01.52.0 Diffusion Loss 0.85 0.86 0.87 0.88 0.89 0.90 0.91 Average Dense Probe AUC f(L) = 0.931−0.039·L (d) Scaling Loss, More Clean (t=0.1) 0.51.01.52.0 Diffusion Loss 0.910 0.915 0.920 0.925 0.930 Average Dense Probe AUC f(L) = 0.935−0.015·L Figure 14. Scaling behavior of dense probing. Unlike 1-D probing, we use all the available features. Row-wise, we vary the x-axis (FLOPs vs. Diffusion Loss). Column-wise, we vary the noisiness of the diffusion input (noisy vs. clean). 22 Learning a Generative Meta-Model of LLM Activations (a) Scaling FLOPs, More Noisy (t=0.5) 10 16 10 17 10 18 10 19 FLOPs 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 Average 1D Probe AUC f(C) = 1.00−8.01·C −0.085 (b) Scaling FLOPs, More Clean (t=0.1) 10 16 10 17 10 18 10 19 FLOPs 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 Average 1D Probe AUC f(C) = 1.00−8.34·C −0.087 (c) Scaling Loss, More Noisy (t=0.5) 0.51.01.52.0 Diffusion Loss 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 Average 1D Probe AUC f(L) = 0.995−0.240·L (d) Scaling Loss, More Clean (t=0.1) 0.51.01.52.0 Diffusion Loss 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 Average 1D Probe AUC f(L) = 0.979−0.205·L Figure 15. Scaling behavior of 1-D probing. Row-wise, we vary the x-axis (FLOPs vs. Diffusion Loss). Column-wise, we vary the noisiness of the diffusion input (noisy vs. clean). 23 Learning a Generative Meta-Model of LLM Activations Locations of Top-1 Diffusion Meta-Neurons Across Layers 0 1 2 3 4 Layer 59 wikidata occupation isjournalist 64 wikidata occupation isresearcher 94 ai gen 110 aimade humangpt3 118 us state CA 125 world country Italy 152 it tick Hardware 155 athlete sport basketball 156 athlete sport baseball 7 hist fig ispolitician 24 headline isiran 76 trial-court 78 north-america 114 nyc borough Manhattan 117 us state FL 119 us state TX 121 us timezone New York 127 art type song 160 code HTML 50 deon isvalid 52 virtue is 54 cs tf 60 wikidata occupation isathlete 67 social-security 68 credit-card 73 control-group 115 nyc borough Brooklyn 120 us timezone Chicago 122 us timezone Los Angeles 129 arith mc A 130 temp cat Frequency 137 glue mnli neutral 142 cancer cat Thyroid Cancer 154 athlete sport football 159 code Python 5 hist fig ismale 22 headline isobama 23 headline ischina 42 temp sense 44 phys tf 49 cm isshort 51 just is 61 wikidata occupation isactor 65 high-school 66 living-room 69 blood-pressure 74 magnetic-field 80 side-effects 83 third-party 84 clinical-trials 85 mental-health 87 glue cola 89 glue mrpc 90 glue qnli 91 glue qqp 100 news fake 106 hate hate 107 hate offensive 126 art type book 139 news class Politics 145 disease class digestive system diseases 150 twt emotion sadness 153 it tick Administrative rights 6 hist fig isamerican 21 headline istrump 41 truthqa tf 48 cm correct 56 wikidatasex or gender 57 wikidatais alive 58 wikidatapolitical party 63 wikidata occupation issinger 72 gene-expression 81 public-health 92 glue sst2 95 toxic is 96 spam is 105 click bait 113 movie sent 123 world country United Kingdom 128 art type movie 133 context type Causality 135 context type Event duration 138 glue mnli contradiction 141 news class Entertainment 144 cancer cat Colon Cancer 151 it tick HR Support 163 agnews 2 26 headline isfrontpage 36 sciq tf 47 reasoning tf 62 wikidata occupation ispolitician 70 prime-factors 71 social-media 75 cell-lines 77 second- derivative 79 human-rights 82 federal- government 116 nyc borough Bronx 124 world country United States 131 temp cat Typical Time 132 temp cat Event Ordering 134 context type Belief states 136 glue mnli entailment 140 news class Technology 143 cancer cat Lung Cancer 146 disease class cardiovascular diseases 147 disease class nervous system diseases 148 twt emotion worry 149 twt emotion happiness 157 amazon 5star 158 code C 161 agnews 0 162 agnews 1 5 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 Frequency Figure 16. For each 1-D probing task, we depict the location of the best performingGLPmeta-neuron. We also color each layer by the frequency at which it contained the best task-specific meta-neuron. 24