← Back to papers

Paper deep dive

What Do World Models Learn in RL? Probing Latent Representations in Learned Environment Simulators

Xinyu Zhang

Year: 2026Venue: arXiv preprintArea: cs.LGType: PreprintEmbeddings: 15

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/26/2026, 2:29:57 AM

Summary

This paper investigates the internal representations of two distinct world models, IRIS (discrete token transformer) and DIAMOND (continuous diffusion UNet), trained on Atari games. Using linear and MLP probing, causal interventions, and attention analysis, the authors demonstrate that these models develop structured, approximately linear representations of environment state variables. The findings suggest a functional role for these representations in environment simulation, supporting the linear representation hypothesis in world models.

Entities (5)

Atari Breakout · environment · 100%DIAMOND · world-model · 100%IRIS · world-model · 100%Pong · environment · 100%Linear Representation Hypothesis · concept · 90%

Relation Signals (3)

IRIS → developedrepresentation → Linear Representation

confidence 95% · Using linear probes, we find that both models develop linearly decodable representations of game state variables

DIAMOND → developedrepresentation → Linear Representation

confidence 95% · Using linear probes, we find that both models develop linearly decodable representations of game state variables

Causal Intervention → confirmedfunctionaluse → Linear Representation

confidence 90% · Causal interventions—shifting hidden states along probe-derived directions—produce correlated changes in model predictions

Cypher Suggestions (2)

Find all world models and the environments they were trained on. · confidence 90% · unvalidated

MATCH (m:WorldModel)-[:TRAINED_ON]->(e:Environment) RETURN m.name, e.name

Identify techniques used to analyze world models. · confidence 85% · unvalidated

MATCH (t:Technique)-[:ANALYZES]->(m:WorldModel) RETURN t.name, m.name

Abstract

Abstract:World models learn to simulate environment dynamics from experience, enabling sample-efficient reinforcement learning. But what do these models actually represent internally? We apply interpretability techniques--including linear and nonlinear probing, causal interventions, and attention analysis--to two architecturally distinct world models: IRIS (discrete token transformer) and DIAMOND (continuous diffusion UNet), trained on Atari Breakout and Pong. Using linear probes, we find that both models develop linearly decodable representations of game state variables (object positions, scores), with MLP probes yielding only marginally higher R^2, confirming that these representations are approximately linear. Causal interventions--shifting hidden states along probe-derived directions--produce correlated changes in model predictions, providing evidence that representations are functionally used rather than merely correlated. Analysis of IRIS attention heads reveals spatial specialization: specific heads attend preferentially to tokens overlapping with game objects. Multi-baseline token ablation experiments consistently identify object-containing tokens as disproportionately important. Our findings provide interpretability evidence that learned world models develop structured, approximately linear internal representations of environment state across two games and two architectures.

Tags

ai-safety (imported, 100%)cslg (suggested, 92%)preprint (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Full Text

14,996 characters extracted from source content.

Expand or collapse full text

ICLR 2026 the 2nd Workshop on World Models WHAT DO WORLD MODELS LEARN IN RL? PROBING LATENT REPRESENTATIONS IN LEARNED EN- VIRONMENT SIMULATORS Xinyu Zhang Anyscale xinyu@gmail.com ABSTRACT World models learn to simulate environment dynamics from experience, enabling sample-efficient reinforcement learning. But what do these models actually repre- sent internally? We apply interpretability techniques—including linear and non- linear probing, causal interventions, and attention analysis—to two architecturally distinct world models: IRIS (discrete token transformer) and DIAMOND (contin- uous diffusion UNet), trained on Atari Breakout and Pong. Using linear probes, we find that both models develop linearly decodable representations of game state variables (object positions, scores), with MLP probes yielding only marginally higherR 2 , confirming that these representations are approximately linear. Causal interventions—shifting hidden states along probe-derived directions—produce correlated changes in model predictions, providing evidence that representations are functionally used rather than merely correlated. Analysis of IRIS attention heads reveals spatial specialization: specific heads attend preferentially to tokens overlapping with game objects. Multi-baseline token ablation experiments con- sistently identify object-containing tokens as disproportionately important. Our findings provide interpretability evidence that learned world models develop struc- tured, approximately linear internal representations of environment state across two games and two architectures. 1INTRODUCTION World models—learned simulators of environment dynamics—have become a cornerstone of sample- efficient reinforcement learning (Ha & Schmidhuber, 2018; Hafner et al., 2023). Recent advances such as IRIS (Micheli et al., 2023) and DIAMOND (Alonso et al., 2024) achieve strong performance on the Atari 100k benchmark by learning to predict future observations entirely from experience. Yet despite their empirical success, a fundamental question remains: what do world models represent internally? This question connects to the broader “linear representation hypothesis” emerging in mechanistic interpretability research (Park et al., 2023; Elhage et al., 2022). Li et al. (2023) showed that a transformer trained to predict Othello moves develops an emergent linear representation of the board state, despite never being trained on it directly. Nanda et al. (2023) extended this to chess. If sequence models trained on game transcripts develop world representations, do world models—which are explicitly trained to predict future observations—develop even richer ones? We apply probing classifiers (Alain & Bengio, 2017; Belinkov, 2022) and causal interventions to two architecturally distinct world models—IRIS (Micheli et al., 2023) (VQ-VAE + transformer) and DIAMOND (Alonso et al., 2024) (UNet diffusion)—on Breakout and Pong. We contribute: (1) linear and MLP probing showing approximately linear game state representations (∆≀ 0.06); (2) causal interventions confirming representations are functionally used (r > 0.95); (3) per-head spatial attention specialization; and (4) multi-baseline token ablation with consistent importance rankings (ρ > 0.9). 1 arXiv:2603.21546v1 [cs.LG] 23 Mar 2026 ICLR 2026 the 2nd Workshop on World Models 2METHOD 2.1MODELS AND GROUND TRUTH IRIS tokenizes64× 64observations into 16 discrete tokens (VQ-VAE, codebook 512,4× 4grid) then predicts sequences with a GPT-2 transformer (10 layers, 4 heads, dim 256). DIAMOND uses a UNet denoiser (4 stages, 64 channels) with EDM preconditioning. We probe on Breakout (ballx, bally,playerx, score) and Pong (ballx,bally,playery,enemyy), with ground truth from Atari RAM (Anand et al., 2019). 2.2PROBING PROTOCOL We extract frozen representations from all layers (N =10,000frames per game): IRIS VQ-VAE encoder/embedding + 10 transformer layers; DIAMOND conv input, 4 encoder/decoder stages, bottleneck, norm output. For each (layer, property) pair, we train Ridge regression (α=1.0) and 2-layer MLP probes (256→128→1, ReLU, Adam), both with 5-fold CVR 2 . The selectivity gap ∆ = R 2 MLP − R 2 linear measures nonlinear structure. Controls: raw pixels, random model, shuffled labels. 2.3CAUSAL INTERVENTION PROTOCOL To move beyond correlation, we perform activation patching along probe directions (Geiger et al., 2021): modify hidden statesh â€Č = h + α· ˆ w(where ˆ w is the normalized Ridge probe weight) and measure the resulting change in next-token logits (KL divergence, token change rate). A positive correlation between|α| and prediction change indicates the probe direction is functionally used. 2.4ATTENTION ANALYSIS AND TOKEN ABLATION For IRIS’s 40 attention heads, we compute attention entropy and per-head spatial selectivity (mean attention to each of 16 token positions, forming4× 4maps). For token ablation, we replace each spatial token under three baselines (zero, mean, random codebook entry) and measure prediction disruption (KL divergence). Consistency across baselines (Spearmanρ) indicates robust importance rankings. 3RESULTS 3.1LINEAR REPRESENTATIONS ACROSS GAMES Figure 1 shows probe R 2 across all layers for both games. IRIS.Ball position encoding is stable across all transformer layers in both games: for Breakout,R 2 forballxranges from0.84± 0.01(layer 0) to0.85± 0.006(layer 5), a span of only0.01. Paddle position and score are near-perfectly encoded (R 2 > 0.99) at every layer. The VQ-VAE encoder already achieves R 2 = 0.83 for ball-x, and the transformer barely improves this (+0.02). DIAMOND. Early encoder stages (0–2) show negativeR 2 for ball position (as low as−1.45), while the deepest encoder stage begins recovering (R 2 = 0.78); the bottleneck achieves peak linear R 2 = 0.81± 0.01forballx, and decoder stages decline—an inverted-V pattern suggesting the bottleneck compresses information maximally. MLP probes recover substantially more from decoder layers (R 2 = 0.91at dec2 vs.−0.27linear), indicating these layers encode ball position nonlinearly via skip connections. Cross-game consistency. Pong shows the same architectural signatures: flat IRIS profiles and peaked DIAMOND bottleneck (Figure 1, bottom row), indicating these patterns are architecture- dependent rather than game-specific. However, both models track ball position significantly better in Pong (R 2 ≄ 0.95) than Breakout (R 2 ≀ 0.85), likely because Pong’s visual scene is simpler (no bricks, fewer objects). Interestingly, DIAMOND’s V-shape in ball tracking is much less pronounced in Pong (dec2 R 2 = 0.95 vs.−0.27 in Breakout), suggesting this pattern is game-dependent. 2 ICLR 2026 the 2nd Workshop on World Models VQ Enc TF 0TF 1TF 2TF 3TF 4TF 5TF 6TF 7TF 8TF 9 VQ Emb 0.0 0.2 0.4 0.6 0.8 1.0 R 2 IRIS Breakout ball x ball y paddle x score Conv Enc 0Enc 1Enc 2Enc 3 Bottle Dec 0Dec 1Dec 2Dec 3 Norm 3 2 1 0 1 R 2 DIAMOND Breakout ball x ball y paddle x score VQ Enc TF 0TF 1TF 2TF 3TF 4TF 5TF 6TF 7TF 8TF 9 VQ Emb 0.0 0.2 0.4 0.6 0.8 1.0 R 2 IRIS Pong ball x ball y player y enemy y Conv Enc 0Enc 1Enc 2Enc 3 Bottle Dec 0Dec 1Dec 2Dec 3 Norm 0.5 0.0 0.5 1.0 R 2 DIAMOND Pong ball x ball y player y enemy y Figure 1: ProbeR 2 across layers (in network data-flow order) for IRIS (left) and DIAMOND (right) on Breakout (top) and Pong (bottom). Each line tracks one game-state property; shaded bands show ±1 std over 5-fold CV. IRIS representations are flat across transformer layers, while DIAMOND shows a peaked inverted-V centered on the UNet bottleneck. Note:y-axis includes negativeR 2 values, revealing that DIAMOND’s early encoder layers are worse than a constant predictor for ball position. Table 1: Best-layerR 2 (mean±std over 5-fold CV) for Breakout. Both linear and MLP probes are shown; the small selectivity gap (∆) confirms approximately linear representations. Representation ballx bally playerx score Random model−1.21−1.22−1.14−1.18 Shuffled labels−0.51−0.49−0.53−0.52 Raw pixels−1.31−0.480.9989±.00060.9998±.0001 IRIS (Linear) 0.85±.0060.58±.03 0.9994±.00011.0000±.0000 IRIS (MLP)0.91±.0050.59±.030.9987±.00020.9999±.0000 ∆ IRIS +0.06+0.01−0.0007−0.0001 DIAMOND (Linear)0.81±.01 0.57±.051.0000±.0000 1.0000±.0000 DIAMOND (MLP)0.91±.0050.63±.050.9994±.00020.9998±.0001 ∆ DIAMOND +0.10+0.06−0.0006−0.0002 Table 1 shows that MLP probes yield only marginal improvements over linear probes for IRIS (|∆| ≀ 0.06). DIAMOND shows a larger gap for ball position (∆ = 0.10forballx), driven by decoder layers where skip connections mix information nonlinearly; at the bottleneck itself, the gap is small (∆ = 0.04). Both models dramatically outperform baselines, with raw pixels failing on ball position (R 2 =−1.31), showing that world model representations extract non-trivial structure. For Pong, both models achieve even higherR 2 : IRIS tracks ball position withR 2 = 0.97(linear) and0.995(MLP); DIAMOND achievesR 2 = 0.98(linear) and0.99(MLP). Selectivity gaps are uniformly small (∆≀ 0.03), confirming approximately linear encoding across both games. 3 ICLR 2026 the 2nd Workshop on World Models 402002040 Intervention 0.000 0.005 0.010 0.015 0.020 0.025 0.030 Token Change Rate ball x (r=0.98) 402002040 Intervention 0.000 0.005 0.010 0.015 Token Change Rate ball y (r=0.98) 402002040 Intervention 0.00 0.02 0.04 0.06 0.08 0.10 Token Change Rate player x (r=0.96) Figure 2: Causal intervention on Breakout: shifting IRIS layer-5 hidden states along probe directions produces correlated changes in predictions (r ≄ 0.96for all properties, measured via KL divergence). C0C1C2C3 R0 R1 R2 R3 1.040.910.470.13 0.030.030.040.51 0.170.400.310.01 0.360.320.090.00 Zero ablation C0C1C2C3 R0 R1 R2 R3 1.040.420.370.17 0.000.030.040.50 0.150.360.300.02 0.510.430.130.00 Mean ablation C0C1C2C3 R0 R1 R2 R3 1.140.600.360.13 0.030.040.040.43 0.150.340.280.01 0.330.250.07-0.00 Random ablation 0.0 0.2 0.4 0.6 0.8 1.0 KL Divergence Figure 3: Three-way token ablation on Breakout (4× 4grid). Zero, mean, and random replacement produce consistent importance rankings (ρ > 0.92), with token 0 (score/brick region) most critical. 3.2CAUSAL INTERVENTIONS CONFIRM FUNCTIONAL USE Figure 2 shows that shifting IRIS layer-5 hidden states along probe directions produces monotonically increasing prediction changes. Correlation between|α|and KL divergence is strong:r = 0.97 (ballx),r = 0.97(bally),r = 0.97(playerx).playerxinterventions produce∌16× larger KL than ball interventions (0.033vs.0.002atα=40), suggesting the model relies more on paddle position. This confirms that linear representations are functionally used, not mere artifacts. 3.3ATTENTION AND TOKEN ABLATION Attention entropy ranges from1.0–1.75nats across 40 heads (belowH max = 2.83), and individual heads show distinct spatial preferences: the four most selective heads—(0, 3),(4, 2),(6, 0),(5, 0)— concentrate attention on different spatial regions, suggesting division of labor for tracking game elements. Token ablation (Figure 3) consistently identifies token 0 (score/brick region) as most critical (KL > 1.0,∌50%token change rate). Rank correlation across methods is high (ρ = 0.93zero/mean, ρ > 0.99zero/random). KL divergence correlates moderately with ball distance (r ≈ 0.56for Breakout), while Pong shows weaker spatial correlation (r ≈ 0.13), suggesting information is distributed less spatially in simpler scenes. 4DISCUSSION AND CONCLUSION Our results demonstrate that learned world models develop structured, approximately linear internal representations of game state across two games and two architectures. This parallels findings from the Othello-GPT line of work (Li et al., 2023; Nanda et al., 2023), extending them to pixel-based environment simulation. Architectural comparison. IRIS’s VQ-VAE tokenizer already produces strong linear representa- tions (R 2 = 0.83for ball position), which the transformer preserves but barely improves (R 2 = 0.85, a gain of only+0.02). Rather than a limitation, this reveals a meaningful division of labor: the tok- enizer handles spatial encoding while the transformer focuses on temporal dynamics and prediction—a factorization that single-frame probes cannot fully evaluate. DIAMOND concentrates abstract state 4 ICLR 2026 the 2nd Workshop on World Models sharply at the UNet bottleneck; we hypothesize skip connections allow low-level information to bypass it, so only the bottleneck must encode abstract state. Limitations. Both games are 2D Atari; generalization to 3D environments remains open. Single- frame probes may miss temporal structure that the transformer encodes; sequence-conditioned probes could reveal richer dynamics. Activation patching along a single direction is a coarse intervention; more targeted causal methods (Geiger et al., 2021) could strengthen these findings. REFERENCES Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. In International Conference on Learning Representations (Workshop), 2017. Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storber, and Franc ̧ois Fleuret. Diffusion for world modeling: Visual details matter in atari. In Advances in Neural Information Processing Systems, 2024. Ankesh Anand, Evan Racah, Sherjil Ozair, Yoshua Bengio, Marc-Alexandre C ˆ ot ́ e, and R Devon Hjelm. Unsupervised state representation learning in atari. In Advances in Neural Information Processing Systems, 2019. Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022. Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. In Advances in Neural Information Processing Systems, 2021. David Ha and J ̈ urgen Schmidhuber. World models. In Advances in Neural Information Processing Systems, 2018. Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. In International Conference on Machine Learning, 2023. Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Vi ́ egas, Hanspeter Pfister, and Martin Watten- berg. Emergent world representations: Exploring a sequence model trained on a synthetic task. In International Conference on Learning Representations, 2023. Vincent Micheli, Eloi Alonso, and Franc ̧ois Fleuret. Transformers are sample-efficient world learners. In International Conference on Learning Representations, 2023. Neel Nanda, Andrew Lee, and Martin Wattenberg. Emergent linear representations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941, 2023. Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658, 2023. 5