Paper deep dive
Addressing divergent representations from causal interventions on neural networks
Satchel Grant, Simon Jerome Han, Alexa R. Tartaglini, Christopher Potts
Models: Meta-Llama-3-8B-Instruct
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/11/2026, 1:28:38 AM
Summary
The paper investigates the phenomenon of 'divergent representations' in neural networks caused by causal interventions like activation patching and Distributed Alignment Search (DAS). It demonstrates that these interventions often shift internal representations away from the model's natural distribution. The authors categorize these divergences into 'harmless' (occurring in the behavioral null-space) and 'pernicious' (activating hidden pathways or dormant behaviors). To mitigate pernicious effects, they propose using a Counterfactual Latent (CL) auxiliary loss to keep intervened representations closer to the natural distribution while maintaining interpretive power.
Entities (6)
Relation Signals (3)
Distributed Alignment Search → isa → Causal Interventions
confidence 100% · Distributed Alignment Search (DAS) (Geiger et al., 2021; 2023; Wu et al., 2023) can be understood as a form of activation patching
Causal Interventions → creates → Divergent Representations
confidence 90% · we demonstrate theoretically and empirically that common causal intervention techniques often do shift internal representations away from the natural distribution
Counterfactual Latent (CL) loss → mitigates → Pernicious Divergences
confidence 90% · In an effort to mitigate the pernicious cases, we apply and modify the Counterfactual Latent (CL) loss
Cypher Suggestions (2)
Identify mitigation strategies for divergent representations. · confidence 95% · unvalidated
MATCH (s:Strategy)-[:MITIGATES]->(d:Phenomenon {name: 'Pernicious Divergences'}) RETURN s.nameFind all causal intervention methods discussed in the paper. · confidence 90% · unvalidated
MATCH (m:Methodology)-[:IS_A*]->(c:Methodology {name: 'Causal Interventions'}) RETURN m.nameAbstract
Abstract:A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate theoretically and empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two cases of such divergences: "harmless" divergences that occur in the behavioral null-space of the layer(s) of interest, and "pernicious" divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to mitigate the pernicious cases, we apply and modify the Counterfactual Latent (CL) loss from Grant (2025) allowing representations from causal interventions to remain closer to the natural distribution, reducing the likelihood of harmful divergences while preserving the interpretive power of the interventions. Together, these results highlight a path towards more reliable interpretability methods.
Tags
Links
PDF not stored locally. Use the link above to view on the source site.
Full Text
78,506 characters extracted from source content.
Expand or collapse full text
Preprint under review ADDRESSING DIVERGENT REPRESENTATIONS FROM CAUSAL INTERVENTIONS ON NEURAL NETWORKS Satchel Grant ∗†‡ Simon Jerome Han ∗‡ Alexa R. Tartaglini †‡ Christopher Potts §†‡ ABSTRACT A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of- distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate theoretically and empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two cases of such divergences: ‘harmless’ divergences that occur in the behavioral null-space of the layer(s) of interest, and ‘pernicious’ divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to mitigate the pernicious cases, we apply and modify the Counterfactual Latent (CL) loss from Grant (2025) allowing representations from causal interventions to remain closer to the natural distribution, reducing the likelihood of harmful divergences while preserving the interpretive power of the interventions. Together, these results highlight a path towards more reliable interpretability methods. 1 1INTRODUCTION A central goal of mechanistic interpretability is to understand what the internal representations of neural networks (NNs) encode and how this gives rise to their behavior. Perhaps the most powerful approach to pursuing this goal is through causal interventions, where methods such as activation patching and Distributed Alignment Search (DAS) directly manipulate internal representations to test how they affect outputs (Geiger et al., 2021; 2023; Wu et al., 2023; Wang et al., 2022; Meng et al., 2023; Nanda, 2022; Csordás et al., 2024). Indeed, even correlational methods such as Sparse Autoencoders (SAEs) and Principal Component Analysis (PCA) often use causal interventions as a final judge for whether the features they identify are truly meaningful (Huang et al., 2024; Dai et al., 2024). Causal interventions thus occupy a central place in making functional claims about neural circuitry (Pearl, 2010; Geiger et al., 2024; 2025; Lampinen et al., 2025; Braun et al., 2025). The use of causal interventions rests on a fundamental assumption that counterfactual model states created by interventions are realistic for the target model. Despite its pervasiveness, however, this assumption is often untested. For example, some activation patching experiments multiply feature values by up to 15x (Lindsey et al., 2025); in these settings, it seems possible that intervened representations diverge significantly from the N’s natural distributions. This raises questions about the reliability of causal interventions for mechanistic interpretability. Do divergent representations change what an intervention can say about an N’s natural mechanisms? When, and to what extent, is it okay for such divergences to occur? When it is not okay, how can we prevent them from occurring? In this work, we provide both empirical and theoretical insight on these issues. We first demonstrate that divergent representations are a common issue for causal interventions – across a wide range of intervention methods, we find that intervened representations often do diverge from the target N’s natural distribution. We then provide theoretical examples of two types of divergence: ‘harmless’ ∗ Department of Psychology † Department of Computer Science ‡ grantsrb,sjeromeh,alexart,cgpotts@stanford.edu § Department of Linguistics 1 Link to public repository: https://github.com/grantsrb/rep_divergence 1 arXiv:2511.04638v4 [cs.LG] 30 Nov 2025 Preprint under review Figure 1: Causal interventions can recruit hidden circuits that produce misleadingly confir- matory or dormant behavior. (a) Consider natural pathways (dashed arrows) for two classes A and B that carry activity to different behavioral outputsy. In a hypothetical intervention meant to find path A, patchingh 1 with a divergent representation can activate distinct, hidden pathways (solid arrows) that result in misleadingly confirmatory behavior (orange) and/or undetected behavior (red). (b) Consider 2D projections of the neural activity ofh 1 for a different network that classifies states into one of 10 classes (denoted by hue). Suppose that natural representations (dark points) lie within well-defined decision boundaries (dashed lines) and covary along causal axes, and that intervened representations (light points) are constructed by patching the first axis from a sampled natural representation. Although these representations diverge from the natural distribution, this can be harmless (top) or pernicious (bottom) depending on the network’s functional landscape. In particular, it can be pernicious if the network has a functional landscape where intervened activity unknowingly recruits hidden circuits (visualized as an orange region) or crosses dormant behavioral boundaries (red regions). divergences that can occur from within-decision-boundary covariance along causal dimensions or from deviations in the null-spaces of the N layers, and ‘pernicious’ divergences that activate hidden network pathways and can cause dormant changes to behavior. We provide discussion on how harmless and pernicious cases are not always mutually exclusive, where the harm depends on the specific mechanistic claims. Finally, we provide a broad-stroke, initial solution for mitigating pernicious divergences by minimizing all intervened divergences. We show that we can use the Counterfactual Latent (CL) auxiliary loss introduced in Grant (2025) to reduce all representational divergence in the Boundless DAS setting from Wu et al. (2023) while maintaining the same behavioral accuracy; and we introduce a modified version of the CL loss that targets causal subspaces and show that it can improve out-of-distribution (OOD) intervention performance on synthetic tasks. Although we do not propose this method as the final solution to representational divergence, we pose it as a step towards more reliable interventions. We summarize our contributions as follows: 1. We show theoretical and empirical examples of divergence between natural and causally inter- vened representations for a variety of causal methods (Section 3). 2.We provide a theoretical treatment of cases in which divergence can arise innocuously from variation in null-spaces, demonstrating that some divergences can be okay and even desired (Section 4.1) 3. We provide synthetic examples of cases where divergent representations can (1) activate hidden computational pathways while still resulting in hypothesis-affirming behavior, and (2) cause dormant behavioral changes, together raising questions about the mechanistic claims that can be made from patching results alone (Section 4.2). 4.Lastly, we use the CL auxiliary loss from Grant (2025) to minimize patching divergence di- rectly in the 7B Large Language Model (LLM) Boundless DAS experimental setting from Wu et al. (2023), and we introduce a stand-alone, modified CL loss that exclusively minimizes divergence along causal dimensions, improving OOD intervention performance in synthetic 2 Preprint under review settings. Together, these results provide an initial step towards mitigating pernicious divergences (Section 5). 2BACKGROUND AND RELATED WORK 2.1ACTIVATION PATCHING Activation patching generally refers to a process of “patching” (i.e. substituting) some portion of neural activity at an intermediate layer into or from a corrupted forward pass of a network (Geiger et al., 2020; Vig et al., 2020; Wang et al., 2022; Meng et al., 2023; Zhang & Nanda, 2024). It can be performed at various granularities such as whole layers, attention heads, or individual neurons. Many forms of activation patching can be unified under the assumption that subspaces, rather than individual neurons, are the atomic units of N representations (Rumelhart et al., 1986; McClelland et al., 1986; Smolensky, 1988; Elhage et al., 2022; Geiger et al., 2021; Grant et al., 2024). Activation patching at the level of individual neurons can be understood as subspace patching along neuronal axes, and many of its high-level granularities can be understood as specific forms of individual neuron patching (Geiger et al., 2020; Vig et al., 2020; Wang et al., 2022; Meng et al., 2023). 2.2DISTRIBUTED ALIGNMENT SEARCH Distributed Alignment Search (DAS) (Geiger et al., 2021; 2023; Wu et al., 2023) can be understood as a form of activation patching that operates in a transformed basis so that specific, causally relevant subspaces can be manipulated analogously to high-level variables from causal abstractions (CAs) (e.g. symbolic programs). Many cases of individual neuron (coordinate) patching can be understood as specific cases of DAS that use the identity transform. We use DAS in Sections 4 and 5, so here we introduce its theory and background. DAS finds alignments between neural representational subspaces and causal variables from Causal Abstractions (CAs) by testing the hypothesis that an N’s latent state vectorh ∈ R d m can be transformed into a vectorz ∈ R d m that consists of orthogonal subspaces encoding interpretable variables. This transformation is performed by a learnable, invertible Alignment Function (AF),z = A(h). We restrict our considerations to linear AFs of the formA(h) = WhwhereW ∈ R d m ×d m is invertible. This transformation allows us to formulatehin terms of interpretable variables and to manipulate encoded values. For a given CA with variablesvar i ∈ var 1 , var 2 ,..., var n , DAS tests the hypothesis thatzis composed of subspaces⃗z var i ∈ R d var i corresponding to each of the variables from the CA. A causally irrelevant subspace⃗z extra ∈ R d extra is also included to encode extraneous, functionally irrelevant neural activity (i.e., the behavioral null-space). A(h) = z = ⃗z var 1 · ⃗z var n ⃗z extra (1) Here, each⃗z var i ∈ R d var i is a column vector of potentially different lengths, whered var i is the subspace size ofvar i , and all subspace sizes satisfyd extra + P n i=1 d var i = d m . The value of a single causal variable encoded in h can be manipulated through an interchange intervention defined as follows: ˆ h =A −1 ((I− D var i )A(h trg ) + D var i A(h src ))(2) Here,D var ∈ R d m ×d m is a manually defined, block diagonal, binary matrix that defines the subspace sized var i , andI ∈R d m ×d m is the identity matrix. EachD var i has a set ofd var i contiguous ones along its diagonal to isolate the dimensions that make up⃗z var i .h src is the source vector from which the subspace activity is harvested,h trg is the target vector into which the harvested activity is patched, and ˆ his the resulting intervened vector that replacesh trg in the model’s processing. This allows the model to make predictions using a new value of variable var i . To trainA, DAS uses counterfactual behaviorc∼Das training labels, wherecis generated from the CA.c, for a given state of a CA and its context, is the behavior that would have occurred had a causal variable taken a different value and everything else remained the same.cis generated by 3 Preprint under review freezing the state of the environment, changing one or more variable values in the CA, and using the CA to generate new behavior in the same environment using the new values. We trainAon intervention samples while keeping the model parameters frozen, minimizing the following objective (for non-sequence-based settings): L DAS (A) =− 1 N N X k=1 logp A c (k) x (k) , ˆ h (k) ,(3) whereNis the number of samples in the dataset,c (k) is the counterfactual label in samplek,x (k) is the model input data, andp A (· | ·)is the model’s conditional probability distribution given the intervened latent vector, ˆ h. We minimizeL DAS (A)using gradient descent, backpropagating intoA with all model weights frozen.Ais evaluated on new intervention data, where the model’s accuracy on c following each intervention is referred to as the Interchange Intervention Accuracy (IIA). 2.3PROBLEMATIC CAUSAL INTERVENTIONS Prior work has implicitly explored issues related to representational divergence from causal inter- ventions. For example, methods such as causal scrubbing or noising/denoising activation patching (Wang et al., 2022; LawrenceC et al., 2022; Meng et al., 2023; Chen et al., 2025; Zhang & Nanda, 2024) intentionally introduce divergent representations to test the sufficiency, completeness, and faithfulness of proposed circuits. Works such as Wattenberg & Viégas (2024), Méloux et al. (2024), and Chen et al. (2025) also implicitly explore the dangers of divergent intervened representations by showing how circuits and features can be redundant or have combinatorial effects that are difficult to enumerate given current methodologies, while Zhang & Nanda (2024) and Heimersheim & Nanda (2024) point out easy misinterpretations of patching results. Shi et al. (2024) and Wang et al. (2022) provide criteria centered on faithfulness, completeness, and minimality for evaluating circuits through causal interventions. A body of work on counterfactual explanations exists, some of which has explored differences between on-manifold and off-manifold adversarial attacks (Stutz et al., 2019), and some works have explored constraining counterfactual features to the manifold of the dataset (Verma et al., 2024; Tsiourvas et al., 2024). Our proposed method in Section 5 differs in that it trains a principled alignment to generate counterfactual representations and it constrains deviations along causal dimensions. For DAS in particular, Makelov et al. (2023) demonstrate that it is possible to produce an interaction between the null-space and dormant subspaces that affect behavior. Because they define dormant subspaces as those that do not vary across different model inputs, variation along these directions is, by definition, a form of divergent representation. Finally, Sutter et al. (2025) posit that it is possible to align any causal abstraction to NNs under a number of assumptions including a sufficiently powered, non-linear alignment function, raising questions about what non-linear causal interventions really tell us. 3ARE DIVERGENT REPRESENTATIONS A COMMON PHENOMENON? We begin by demonstrating that divergent representations are a common (if not likely) outcome of causal interventions, both in theory and in practice. We do not yet consider its perniciousness, however, and we reserve the question of whether and when divergence is harmful for Section 4. 3.1FOR MOST MANIFOLDS, COORDINATE PATCHING GUARANTEES DIVERGENCE We first consider a theoretical setting where coordinate-based patching of one or more vector dimensions is performed on a single manifold, similar to what might be done in neuron level activation patching (Vig et al., 2020; Geiger et al., 2021). We prove that in this setting, divergent representations are guaranteed to occur if patching is performed exhaustively. For simplicity, we consider a minimal version of this proof that involves a circular manifold with two dimensions. A more general proof, which applies to most manifold geometries, can be found in Appendix A.2. Formally, letM K =c K + u :∥u∥ 2 ≤ r K ⊂R 2 be a class-Kmanifold with centroidc K ∈R 2 and radiusr K > 0. Given two native representationsh src = c K + uandh trg = c K + v, let us define a coordinate patch (onto class K) that keeps the first coordinate of h src and the second of h trg : ˆ h = h src 1 h trg 2 = c K,1 + u 1 c K,2 + v 2 . 4 Preprint under review Figure 2: Representational divergence is a common occurrence across various interventions. (a) Directly replacing a coordinate value in one natural representation (orange) with the value from another will eventually create divergent representations (blue). (b) Top two principal components of natural and corresponding intervened representations, taken from the residual stream at the intervention position and with PCA is performed over the combined set of natural and intervened vectors, for three popular causal intervention techniques: a replication of Feng & Steinhardt (2024) for mean difference patching, reconstructed vectors for a single transformer layer using SAELens (Bloom et al., 2024) for sparse autoencoder, and interchange interventions for Boundless DAS (Wu et al., 2023). (c) L2 distance between natural and corresponding intervened representations, and Earth Mover’s Distance (EMD) between natural and intervened distributions (with baseline comparing the natural distribution to itself). Then the deviation from c K is ˆ h− c K = (u 1 , v 2 ) ⊤ , ∥ ˆ h− c K ∥ 2 2 = u 2 1 + v 2 2 . Proposition (coordinate patching exceeds the class radius). Ifh src ,h trg ∈M K (i.e.,∥u∥ 2 ≤ r K and∥v∥ 2 ≤ r K ), then the patched point ˆ his off-manifold wheneveru 2 1 +v 2 2 > r 2 K . In particular, there exist boundary pointsh src ,h trg ∈ ∂M K withu = (r K , 0)andv = (0,r K )such that∥ ˆ h− c K ∥ 2 = p r 2 K + r 2 K = r K √ 2 > r K . Proof. Since ˆ h− c K = (u 1 ,v 2 ) , we have∥ ˆ h− c K ∥ 2 2 = u 2 1 + v 2 2 . Choosingu = (r K , 0)and v = (0,r K ) gives the stated violation. As noted above, this intuition holds for all manifold shapes other than axis-aligned hyper-rectangles (see Appendix A.2). Thus, in these relatively simple theoretical intervention settings, divergent representations are guaranteed to occur with enough intervention samples. 3.2MANY EXISTING CAUSAL METHODS EMPIRICALLY PRODUCE DIVERGENCE Stepping beyond the theoretical setting, we also empirically demonstrate that common, real-world causal interpretability methods often produce divergent representations. The notion that these methods can create divergence has been indirectly explored in previous work through attention patterns in Gaussian noise corrupted activation patching (Zhang & Nanda, 2024). Here we extend this view to three popular causal intervention methods: mean difference vector patching (e.g. Feng & Steinhardt (2024)), Sparse Autoencoders (e.g. Bloom et al. (2024)), and DAS (e.g. Wu et al. (2023)). Figure 2 shows the top two principal components of the natural and intervened representations for each intervention method, distinguished by color. These results demonstrate that divergence is a common phenomenon in practice and is not specific to any one method. Even simple methods that patch along a single mean direction are subject to divergence, despite high behavioral accuracy (Feng & Steinhardt, 2024; Geiger et al., 2023; Wu et al., 2023). We quantify this divergence in Figure 2(c) using the Earth Mover’s Distance (EMD) (Villani, 2009) between the full dimensional natural and intervened distributions, using the corresponding natural-natural comparison as a baseline. We see that the intervened divergence exceeds that of the natural (more metrics and details are in Appendix A.1). Note that this result does not necessarily imply that the respective methods are invalid or that their claims are incorrect; the panels are only meant to show the presence of divergence. 5 Preprint under review 4WHEN ARE DIVERGENT REPRESENTATIONS HARMLESS OR PERNICIOUS? Having demonstrated that divergent representations are a common phenomenon, we now consider whether divergence is a concerning phenomenon. We propose that divergence is harmless to many functional claims if it exists in the behavioral null-space, and that it can be pernicious if it recruits hidden pathways or causes dormant behavioral changes. However, we stress that the harm is inherently claim dependent, meaning that these forms of divergence are not mutually exclusive. 4.1HARMLESS CASES This section explores a set of cases for which we might consider divergent representations to be harmless to functional claims. We first focus on cases in which divergence is bottle-necked into the null-spaces of the next interacting weight matrices. Formally, we define the null-space of a weight matrixW ∈R d ′ ×d asN (W ) = v ∈R d | Wv = 0. Neural activity in the null-space of the weights refers to any changesδ ∈R d for whichW (h + δ) = Wh. We propose that divergence v ∈N (W )is harmless to the computation ofWbecause it is equivalent to adding the zero vector W (h + v) = Wh + Wv = Wh + 0 = W (h + 0). Notably, however, this harmlessness does not apply to the sub-computations of the matrix multiplication becausev ∈N (W )does not imply that W i,j (h j + v j ) = W i,j (h j + 0)for vector rowjand matrix rowi. Thus,vin this case is potentially harmful to mechanistic claims about individual activation-weight sub-computations, while being harmless to the overall matrix multiplication. We can generalize this notion beyond matrices to an arbitrary functionψ. Letψ :R d →R d ′ and let X ⊆R d . We define the behavioral null-space with respect to X as N (ψ,X) =v ∈R d |∀x∈ X, ψ(x + v) = ψ(x).(4) A common case ofψin practice for a layerℓof an NNfconsists of all subsequent computations after and including layerℓ, denotedf ≥ℓ (h). We propose that behaviorally null divergencevis harmless to the overall computation off ≥ℓ because it is equivalent to adding 0 to the input. However,vcan be harmful to claims about a sublayerℓ +kwithinf ≥ℓ becausef ℓ+k (f ≥ℓ,<ℓ+k (h +v))is not guaranteed to be equal tof ℓ+k (f ≥ℓ,<ℓ+k (h + 0))(Sec. 4.2.1). See Appx. A.3 and Algorithm 1 to practically classify harmlessness whenN (ψ,X) characterizes the full space of harmless divergence. Idealized Case Study. We now present an example of harmless divergence in the behavioral null- space by introducing a behaviorally binary subspace—a vector subspace which causally impacts the outputs of a future processing layer (e.g., classification labels) only through its sign. Formally, letf :R d →R d ′ denote a computational unit (possibly consisting of multiple N layers and functions). Letsign(·)denote the elementwise sign mapsign :R d var → −1, 1 d var , and assume a fixed alignment functionAand subspace selection matrixD var ∈R d×d (Sec. 2.2). A linear subspace Z ⊆R d is behaviorally binary (with respect tof,D var , andA) iff for allD var A(h),D var A(h ′ )∈ Z, sign(D var A(h)) = sign(D var A(h ′ )) =⇒ f (h) = f (h ′ )(5) Now, suppose we have an N with two causal subspaces, ̃z var a ⊆R d var a and ̃z var b ⊆R d var b , with values⃗z (x i ) var a and⃗z (x i ) var b , for a model inputx i , where the bold, tilde notation distinguishes variables from their values. Further assume that ̃z var b is behaviorally binary and co-varies with, ̃z var a . Using h (x i ) and z (x i ) from Equation 1 under a given input x i , we use the following definition: A(h (x i ) ) = z (x i ) = " ̃z (x i ) var a = ⃗z (x i ) var a ̃z (x i ) var b = ⃗z (x i ) var b # (6) Due to the assumption of covariance in ̃z var a and ̃z var b , it is reasonable to assume that the values ⃗z (x low ) var b and⃗z (x high ) var b are systematically distinct for distinct values of ̃z var a under some classes of inputs, x low andx high , whilesign(⃗z (x low ) var b ) = sign(⃗z (x high ) var b ). Now, an interchange intervention on ̃z var b using a source vector from x low and target vector from x high , will produce: ˆz = " ̃z var a = ⃗z (x high ) var a ̃z var b = ⃗z (x low ) var b # (7) 6 Preprint under review Because we assumed that the value of⃗z (x low ) var b is systematically unique due to covariance in ̃z var a and ̃z var b , the values⃗z (x high ) var a and⃗z (x low ) var b in Equation 7 will have never existed together in the native distribution, but the behavior will remain unchanged because ̃z var b is behaviorally binary and its sign has not changed. This divergence is thus harmless to the claim of discovering f ’s causal axes. Summary. We propose that divergences within behavioral null-spaces are harmless to many func- tional claims about a functionψ’s computations when the claim encapsulates (i.e. ignores) the internal, sub-computations ofψ. However, we are not suggesting that the behavioral null-space encompasses the set of all harmless divergences. In general, such an exhaustive set cannot exist without assuming the superiority of some scientific claims/assumptions over others. For example, take the set of all harmless divergences for a specific claim, then modify the claim to assume it permissible to diverge in a manner previously excluded from the harmless set. The modification to the claim also modifies the harmless divergences. Lastly, we note that behaviorally null divergence is not always harmless. Indeed, one could even desire to intervene on the behavioral null-space to causally test that it is null. Thus, we stress that the mechanistic claim for which an intervention is meant to support is important for determining the harmlessness of the divergence. 4.2PERNICIOUS DIVERGENCE VIA OFF-MANIFOLD ACTIVATION We now explore pernicious cases of representational divergence involving the concept of hidden pathways, which refer to any unit, vector direction, or subcircuit that is inactive on the natural support of representations for a given context but becomes active and influences behavior under an intervention. Although hidden pathways can be compatible with harmless divergences (Sec. 4.1), they can also undermine claims about natural mechanisms and can prime dormant behavioral changes discussed in Sec. 4.2.2. Formally, letDdenote the data distribution over latent representationsh ℓ ∈R d at layerℓ, and let S = supp(h ℓ ∼D)denote its support withS K = supp(h ℓ K ∼D) the support for classK. Denoting the intended classKfollowing an intervention as subscripted→ K, an intervened representation ˆ h ℓ →K is said to be divergent if ˆ h ℓ →K /∈ S K (e.g. it exists off the natural manifold of classK). We define the convex hull of class-K representations asconv(S K ) = P i α i h ℓ i,K : α i ≥ 0, P i α i = 1 where the subscriptKdenotes thath ℓ i,K was taken from classKinputs. Projecting an intervention onto conv(S K ) ensures that it remains within the convex interpolation region of class-K. 4.2.1MEAN-DIFFERENCE PATCHING CAN ACTIVATE HIDDEN PATHWAYS Patching with a mean-difference vector can flip a decision by activating a unit that is silent for all natural class inputs. Setup. Consider a two-layer circuit with a ReLU nonlinearity. Let h ℓ ∈R 4 feed into s = 1 ⊤ h ℓ+1 = 1 ⊤ ReLU(W ℓ h ℓ + b ℓ ), W ℓ ∈R 3×4 , b ℓ ∈R 3 , where W ℓ = " 0.75 0.2500.5 0100 11 −1 −1 # , b ℓ = " −0.5 −0.5 0 # . A positive score (s > 0) indicates class A. Suppose class-A and class-B representations at layerℓare h ℓ A = [1, 0, 1, 0] ⊤ (case 1) [0, 1, 1, 0] ⊤ (case 2) , h ℓ B = [0, 0, 1, 0] ⊤ (case 1) [0, 0, 1, 1] ⊤ (case 2) Evaluating h ℓ A yields h ℓ+1 A case 1 = [0.25, 0, 0] ⊤ , h ℓ+1 A case 2 = [0, 0.5, 0] ⊤ , so for class-A, s A ∈0.25, 0.5. For class-B, all outputs are zero, s B = 0. Mean-difference patching. We construct a mean difference vector between the classes, δ B→A = μ A − μ B = 1 2 2 X i=1 (h ℓ A case i )− 1 2 2 X i=1 (h ℓ B case i ) = [0.5, 0.5, 0, −0.5] ⊤ . 7 Preprint under review Applying this to class-B representations gives ˆ h ℓ B→A = h ℓ B + δ B→A = [0.5, 0.5, 1, −0.5] ⊤ (case 1) [0.5, 0.5, 1, 0.5] ⊤ (case 2) After propagation through the circuit: ˆ h ℓ+1 B→A = [0, 0, 0.5] ⊤ ,ˆs case 1 = 0.5 [0.25, 0, 0] ⊤ , ˆs case 2 = 0.25 The intervention flips the decision to class-A (ˆs > 0). However, the third hidden unit becomes active only for ˆ h ℓ B→A , never for naturalh ℓ A . This new activation is a hidden pathway that was silent under all native samples. Thus the mean-difference patch crosses the decision boundary only by activating an off-manifold circuit. If we project ˆ h ℓ B→A ontoconv(S A )(or equivalently onto a local PCA subspace ofS A ), this ReLU state change disappears, and the decision boundary is no longer crossed—confirming that the original effect was driven by divergence rather than a within-manifold causal mechanism. It is unclear what this patching experiment reveals about the natural mechanisms of the model. 4.2.2DORMANT BEHAVIORAL CHANGES Divergent representations can also yield dormant behavioral changes: perturbations that appear behaviorally null in one subset of contexts but alter behavior in others. Formally, letψ :R d ×C →R d ′ and letC 1 ⊂Cbe a subset of contexts. The space of dormant behavioral changes relative toX,C 1 ,C isV(ψ,X,C 1 ,C) =N (ψ,X,C 1 ) (ψ,X,C). Illustration. Extend the network from the previous Sec. 4.2.1 by adding one row of zeros toW ℓ and b ℓ , producingh ℓ+1 ∈R 4 where the final coordinate is always zero. Add a context vectorv ∈R 4 and a final affine layer: ˆy = W ℓ+1 (h ℓ+1 + v) + b ℓ+1 , W ℓ+1 = " 1 1 0.5 0 0 000 0 011 # , b ℓ+1 = " 0 0.25 −1 # . Here, the first argmax index of ˆy = [ˆy 1 , ˆy 2 , ˆy 3 ] ⊤ corresponds to class predictions A, B, and C. Assume v = [0, 0, 0,v 4 ] ⊤ . For ˆ h ℓ+1 B case 1 →A = [0, 0, 0.5, 0] ⊤ , ˆy 1 = 0.25,ˆy 2 = 0.25 ˆy 3 = (0.5 + v 4 )− 1 = v 4 − 0.5. The model predicts class-A whenv 4 < 0.75but switches to class-C whenv 4 > 0.75. Notably,v 4 would not naturally cause a class-C prediction below a value of1due to the bias threshold. Thus the same intervention that was benign in one context (v 4 < 0.75) produces a behavioral flip in another (1 > v 4 > 0.75) purely due to the latent divergence priming a new pathway. Implications. Dormant behavioral changes highlight that behaviorally “safe” interventions can still introduce hidden context dependencies. Detecting them would require evaluating across all possible contexts, which is infeasible in practice. Therefore, causal intervention based experiments should ideally (1) report any introduced representational divergence outside of the null-space and (2) test causal interventions for context-sensitivity. Summary. Hidden pathways arise when off-manifold representations activate computations that never occur for natural representations. Such pathways can potentially alter causal conclusions even when behavior appears unchanged. Manifold-preserving projections and ReLU-pattern audits are potentially practical safeguards against these pernicious forms of divergence. 5HOW MIGHT WE AVOID DIVERGENT REPRESENTATIONS? We have thus far shown that divergent representations are empirically common and that they can be harmless or harmful depending on multiple factors. We now consider the question of how such divergences might be avoided. Some existing methods solve this by projecting counterfactual features 8 Preprint under review (in our case, intervened representations) directly to the natural manifold (Verma et al., 2024). However, we seek a method that generates principled interventions that are constrained to be innocuous. In this pursuit, we first apply the Counterfactual Latent (CL) loss from Grant (2025) to the DAS experiments from Wu et al. (2023) and find that we can reduce intervened divergence while preserving accuracy on a Llama based LLM. We then introduce a modified CL loss that only targets causal dimensions, and we show that it can improve OOD intervention accuracy. 5.1APPLYING THE COUNTERFACTUAL LATENT LOSS TO BOUNDLESS DAS To encourage intervened N representations to be more similar to the native distribution, we first apply the CL auxiliary loss from Grant (2025) to the Boundless DAS setting in Wu et al. (2023). This auxiliary training objective relies on counterfactual latent (CL) vectors as vector objectives. CL vectors are defined as vectors that encode the same causal variable value(s) that would exist in the intervened vector, ˆ h , assuming the interchange intervention was successful. We can obtain CL vectors from sets of naturalhvectors from situations and behaviors that are consistent with the values of the CA to which we are aligning. See Figure 3 for a visualization. As an example, assume that we have a CA with causal variablesvar u ,var w , andvar extra , and following a causal intervention we expect ˆ hto have a value ofufor variablevar u andwfor variablevar w . A CL vectorh CL for this example can be obtained by averaging over a set ofmnatural representations,h CL = 1 m P m i=1 h (x i ) CL , where each h (x i ) CL has the same variable values: var u = u and var w = w (as labeled by the CA). The CL auxiliary lossL CL introduced in Grant (2025) is composed of the mean of an L2 and a cosine distance objective using CL vectors as labels. Using notation defined in Section 2.2,L CL for a single training sample is defined as follows: L CL ( ˆ h,h CL ) = 1 2 ∥ ˆ h− h CL ∥ 2 2 − 1 2 ˆ h· h CL ∥ ˆ h∥ 2 ∥h CL ∥ 2 (8) L CL is combined with the DAS behavioral lossL DAS into a single loss term usingεas a tunable hyperparameter:L total = εL CL +L DAS . The loss is computed as the mean over batches of samples and optimized using gradient descent (Appendix A.5). Results: We applied the CL loss to the Boundless DAS notebook from Wu et al. (2024) which reproduces the main result from Wu et al. (2023) (see Appendix A.4). Figure 3A provides qualitative visualizations of the decreased divergence when applying the CL loss, and Figure 3B shows both IIA and EMD as a function of increasing the CL weightε. For small values ofε, the IIA is maintained (and potentially even improved) while EMD decreases, demonstrating that the CL auxiliary loss can directly reduce divergence in practical interpretability settings without sacrificing the interpretability method. 5.2MODIFIED CL LOSS IMPROVES OOD INTERVENTIONS IN SYNTHETIC MODEL SETTINGS We next modify the CL loss to work independently of the DAS behavioral loss and show that it improves OOD intervention performance. We simulate an intermediate layer of an N by constructing a synthetic dataset ofhvectors with known feature dimensions and labelsythat we use to train a Multi-Layer Perceptron (MLP). The dataset consists of noisy samples around a set of grid points from two feature dimensions with correlational structure. Specifically, we define a set of features as the Cartesian product of two values along thex 1 -axis−1, 1and five values along thex 2 -axis 0, 1, 2, 3, 4, resulting in ten unique coordinates that each correspond to one of ten classes. We add noise and covariance to these feature dimensions and concatenatenextra noise dimensions, resulting in simulated vectorsh ∈R 2+n wheren = 16unless otherwise stated. The feature dimensions of these vectors are shown as the natural distributions in Figures 1(b) and 3. We then train a small MLP on these representations to predict the class labels using a standard cross entropy loss. After training, we perform DAS analyses with either the behavioral loss or the CL loss independently. See Appendix A.5 for further details on dataset construction, MLP training, and DAS training. We modify the CL loss by applying it to individual causal subspaces only (as discovered through the DAS training). This allows us to constructh var i CL vectors specific to a single causal variablevar i . The modifiedL ′ CL for a single training sample is defined as follows: ˆ h var i =A −1 (D var i A( ˆ h)), h var i CL = stopgrad(A −1 (D var i A(h CL )))(9) 9 Preprint under review Figure 3: The CL loss reduces representational divergence and can improve out-of-distribution generalization. (a) PCA of natural (orange) and intervened (blue) representations in the Boundless DAS setting presented in Wu et al. (2023) for two CL loss weightings with the same final IIA. (b) IIA (orange) and divergence (purple) of intervened representations from Section 5.1 as a function of CL loss weight (ε). (c) Diagram of CL loss; rectangles are model representations andx 1 and x 2 are deterministic values of the representations along the two synthetic causal dimensions shown in panels (d) and (e). We patch thex 2 value from source to target using DAS and define the CL representation as the average of all natural representations that possess the same variable values as the post-intervention representation. (d) and (e) two causal feature dimensions of representations from a synthetic dataset consisting of ten classes (colors), with both natural (dark) and intervened (light) representations shown. (d) shows results from DAS trained using behavior only, (e) shows DAS trained using only the CL loss. (f) performance of alignment matrices trained on one task and evaluated on another that uses the same causal dimensions. CL loss leads to higher OOD performance. L ′ CL = n X i=1 L var i CL = n X i=1 1 2 ∥ ˆ h var i − h var i CL ∥ 2 2 − 1 2 ˆ h var i · h var i CL ∥ ˆ h var i ∥ 2 ∥h var i CL ∥ 2 (10) Results: Figures 3D and 3E provide a qualitative comparison of intervened and native representations for interventions using a trained DAS rotation matrix. Each dot in the figures shows the values of the feature dimensions for a single representation. The native states are displayed in darker colors and the intervened in lighter. Each hue indicates the ground truth class of the state. We can see a tightening of the intervened representations using the CL loss. Quantitatively, the DAS loss produces EMD values along the feature dimensions of0.032± 0.003whereas the CL loss produces0.007± 0.001with IIAs of 0.997± 0.001 and 0.9988± 0.0005 respectively on training/test sets with held-out classes. What is the practical utility of reducing divergence? We hypothesized that divergence could influence IIA when transferring the DAS alignment to OOD settings. To test this, we partitioned the synthetic task into a dense and a sparse cluster of classes based on their relative spacing (Appx. A.5.4). We then trained an MLP and alignment on each partition and evaluated the alignment on the held-out partition. The CL loss performed better than the behavioral loss in these OOD settings (Figure 3F). We then regressed OOD IIA on training EMD to find an anti-correlation (coef. -.34,R 2 = .73, F(1,28)=75.28, p < .001), showing that divergence can predict lower OOD performance (Appx. A.6). 6DISCUSSION AND LIMITATIONS In this work we demonstrated that a variety of common causal interventions can produce representa- tions that diverge from a target model’s natural distribution of latent representations. We then showed that although this can have benign effects for many causal claims, it can also activate hidden pathways and trigger dormant behaviors that can perniciously affect other claims. As a step towards mitigating this issue, we provided a broad-stroke solution by directly minimizing the divergence of intervened activity along causal dimensions, mitigating both pernicious and harmless forms of divergence. A remaining gap in our work is the failure to produce a principled, reliable method for classifying harmful divergence. Additionally, the modified CL loss that we have presented is confined to a narrow set of simplistic settings and is not specific to pernicious divergence. We look forward to exploring ways to classify and mitigate pernicious divergence through self-supervised means in future work. 10 Preprint under review Where does this leave us with respect to causal interventions in mechanistic interpretability? Given our theoretical findings, any divergence outside of the null-space of N layers is potentially pernicious. This poses challenges for aspirations of a complete mechanistic understanding of NNs using existing methods alone. However, we note that many practical mechanistic projects can be satisfied by collecting sufficiently large intervention evaluation datasets, and continued development of methods such as the CL loss can reduce the problem even further. We are optimistic for the future of this field. 7ACKNOWLEDGMENTS Thank you to the PDP Lab and the Stanford Psychology department for funding. Thank you to Noah Goodman and Jay McClelland for thoughtful feedback and discussions. Thank you to the PDP lab and the Stanford Mech Interp community for opportunities to present and for thoughtful discussion. 8LLM USAGE STATEMENT We used ChatGPT to generally edit for clarity, as well as to improve notational consistency in the behavioral null-space, hidden pathways, and dormant behavioral changes formalizations in Sections 4.1 and 4.2. We also used ChatGPT to provide an initial layout of the proof offered in Appendix A.2 showing that axis-aligned hyperrectangles are the only manifold shape that do not have divergent source-target vector pairs in coordinate patching settings, and we used it to suggest, implement, and generate the initial writeup for the additional divergence measures in Appendix A.1.2. REFERENCES Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin. Saelens.https://github. com/jbloomAus/SAELens, 2024. Lukas Braun, Erin Grant, and Andrew M Saxe. Not all solutions are created equal: An analytical dissociation of functional and representational similarity in deep linear neural networks. In Forty- second International Conference on Machine Learning, 2025. URLhttps://openreview. net/forum?id=YucuAuXMpT. Hang Chen, Jiaying Zhu, Xinyu Yang, and Wenya Wang. Rethinking circuit completeness in language models: And, or, and adder gates. arXiv preprint arXiv:2505.10039, 2025. Róbert Csordás, Christopher Potts, Christopher D. Manning, and Atticus Geiger. Recurrent neural networks learn to store and generate sequences using non-linear representations, 2024. URL https://arxiv.org/abs/2408.10920. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, p. 2292–2300, 2013. Qin Dai, Benjamin Heinzerling, and Kentaro Inui. Representational analysis of binding in language models. arXiv preprint arXiv:2409.05448, 2024. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022.https://transformer- circuits.pub/2022/toy_model/index.html. Jiahai Feng and Jacob Steinhardt. How do language models bind entities in context?, 2024. URL https://arxiv.org/abs/2310.17191. Atticus Geiger, Kyle Richardson, and Christopher Potts. Neural natural language inference models partially embed theories of lexical entailment and negation. arXiv preprint arXiv:2004.14623, 2020. Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. CoRR, abs/2106.02997, 2021. URL https://arxiv.org/abs/2106.02997. 11 Preprint under review Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D. Goodman. Finding alignments between interpretable causal variables and distributed neural representations, 2023. Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, and Thomas Icard. Causal abstraction: A theoretical foundation for mechanistic interpretability, 2024. URLhttps:// arxiv.org/abs/2301.04709. Atticus Geiger, Jacqueline Harding, and Thomas Icard. How causal abstraction underpins computa- tional explanation. arXiv preprint arXiv:2508.11214, 2025. Satchel Grant. Model alignment search. arXiv preprint arXiv:2501.06164, 2025. Satchel Grant, Noah D. Goodman, and James L. McClelland. Emergent symbol-like number variables in artificial neural networks. Transactions on Machine Learning Research, 2024. URL https://arxiv.org/abs/2501.06141. Stefan Heimersheim and Neel Nanda. How to use and interpret activation patching. arXiv preprint arXiv:2404.15255, 2024. Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, and Atticus Geiger. RAVEL: Evaluating interpretability methods on disentangling language model representations. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 8669–8687, Bangkok, Thailand, August 2024. Association for Computational Linguistics. URLhttps://aclanthology. org/2024.acl-long.470. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, p. 448–456. pmlr, 2015. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. Andrew Kyle Lampinen, Stephanie CY Chan, Yuxuan Li, and Katherine Hermann. Representation biases: will we achieve complete understanding by analyzing representations? arXiv preprint arXiv:2507.22216, 2025. LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck, and Nate Thomas.Causal Scrubbing:a method for rig- orously testing interpretability hypotheses [Redwood Research].Less Wrong, dec 2022.URLhttps://w.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/ causal-scrubbing-a-method-for-rigorously-testing. Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, and Joshua Batson. On the biology of a large language model. Transformer Circuits Thread, 2025. URLhttps://transformer-circuits. pub/2025/attribution-graphs/biology.html. Aleksandar Makelov, Georg Lange, and Neel Nanda. Is this the subspace you are looking for? an interpretability illusion for subspace activation patching. arXiv preprint arXiv:2311.17030, 2023. J. L. McClelland, D. E. Rumelhart, and PDP Research Group (eds.). Parallel Distributed Processing. Volume 2: Psychological and Biological Models. MIT Press, Cambridge, MA, 1986. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt, 2023. URL https://arxiv.org/abs/2202.05262. Maxime Méloux, Silviu Maniu, François Portet, and Maxime Peyrard. Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable? In Workshop on Mechanistic Interpretability, oct 2024. URL https://openreview.net/forum?id=5IWJBStfU7. 12 Preprint under review Neel Nanda. Attribution patching: Activation patching at industrial scale.https://w. neelnanda.io/mechanistic-interpretability/attribution-patching , 2022. Judea Pearl. An Introduction to Causal Inference. The International Journal of Biostatistics, 6(2):7, February 2010. ISSN 1557-4679. doi: 10.2202/1557-4679.1203. URLhttps://w.ncbi. nlm.nih.gov/pmc/articles/PMC2836213/. D. E. Rumelhart, J. L. McClelland, and PDP Research Group (eds.). Parallel Distributed Processing. Volume 1: Foundations. MIT Press, Cambridge, MA, 1986. Skipper Seabold and Josef Perktold. statsmodels: Econometric and statistical modeling with python. In 9th Python in Science Conference, 2010. Claudia Shi, Nicolas Beltran Velez, Achille Nazaret, Carolina Zheng, Adrià Garriga-Alonso, Andrew Jesson, Maggie Makar, and David Blei. Hypothesis testing the circuit hypothesis in llms. Advances in Neural Information Processing Systems, 37:94539–94567, 2024. Paul Smolensky. On the proper treatment of connectionism. Behavioral and Brain Sciences, 11(1): 1–23, 1988. doi: 10.1017/S0140525X00052432. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014. URLhttp://jmlr.org/papers/v15/ srivastava14a.html. David Stutz, Matthias Hein, and Bernt Schiele. Disentangling adversarial robustness and generaliza- tion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p. 6976–6987, 2019. Denis Sutter, Julian Minder, Thomas Hofmann, and Tiago Pimentel. The non-linear representa- tion dilemma: Is causal abstraction enough for mechanistic interpretability? arXiv preprint arXiv:2405.05847, 2025. URL https://arxiv.org/abs/2507.08802. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. Asterios Tsiourvas, Wei Sun, and Georgia Perakis. Manifold-aligned counterfactual explanations for neural networks. In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li (eds.), Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine Learning Research, p. 3763–3771. PMLR, 02–04 May 2024. URL https://proceedings.mlr.press/v238/tsiourvas24a.html. Sahil Verma, Varich Boonsanong, Minh Hoang, Keegan Hines, John Dickerson, and Chirag Shah. Counterfactual explanations and algorithmic recourses for machine learning: A review. ACM Comput. Surv., 56(12), October 2024. ISSN 0360-0300. doi: 10.1145/3677119. URLhttps: //doi.org/10.1145/3677119. Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Simas Sakenis, Jason Huang, Yaron Singer, and Stuart Shieber. Causal mediation analysis for interpreting neural nlp: The case of gender bias. arXiv preprint arXiv:2004.12265, 2020. Cédric Villani. The wasserstein distances. In Optimal Transport: Old and New, p. 94–105. Springer, Berlin, Heidelberg, 2009. doi: 10.1007/978-3-540-71050-9_6. Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Inter- pretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022. URL https://arxiv.org/abs/2211.00593. Martin Wattenberg and Fernanda B. Viégas. Relational Composition in Neural Networks: A Survey and Call to Action, July 2024. URLhttp://arxiv.org/abs/2407.14662. arXiv:2407.14662 [cs]. 13 Preprint under review Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Timothée Rault, Rémi Louf, Morgan Funtowicz, Jamie Davison, Sam Shleifer, Patrick von Platen, Qi Ma, Mariama Jernite, Julien Plu, Liwen Xu, Teven Lhoest, Giancarlo Maggiore, Pietro Paul, Umar Ruwase, Jacob Devlin, Matt Tomei, Margaret U Gorman, Eric Jurgens, and Alexander M Rush. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv preprint arXiv:1910.03771, 2019. Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, and Noah Goodman. Interpretability at scale: Identifying causal mechanisms in alpaca. Advances in neural information processing systems, 36:78205–78226, 2023. Zhengxuan Wu, Atticus Geiger, Aryaman Arora, Jing Huang, Zheng Wang, Noah Goodman, Christo- pher Manning, and Christopher Potts. pyvene: A library for understanding and improving PyTorch models via interventions. In Kai-Wei Chang, Annie Lee, and Nazneen Rajani (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 3: System Demonstrations), p. 158–165, Mexico City, Mexico, June 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.naacl-demo.16. Fred Zhang and Neel Nanda. Towards Best Practices of Activation Patching in Language Mod- els: Metrics and Methods, January 2024. URLhttp://arxiv.org/abs/2309.16042. arXiv:2309.16042 [cs] version: 2. 14 Preprint under review AAPPENDIX A.1EMPIRICAL INTERVENED DIVERGENCE METHODOLOGICAL DETAILS A.1.1INTERVENTION METHODS We considered three families of interpretability interventions that modify hidden-layer representations. In all three, we visualize the residual stream output from the specified layer: 1.Mean Difference Vector Patching (MDVP) (Feng & Steinhardt, 2024), where an intervention vectorδ MD ∈R d is defined as the difference in mean activations between two conditions and then added to or subtracted from activations h∈R d . Formally, ˆ h = h + δ MD (11) We examine the representations ˆ hfrom a sample size of 100 unique contexts across 4 token positions at each individual layer. We compare the representations to the native cases of the swapped binding positions. We used layer 10 of Meta’s Meta-Llama-3-8B-Instruct through Huggingface’s transformers package for this task and visualization (Touvron et al., 2023; Wolf et al., 2019). We selected layer 10 as it had the lowest EMD difference of all layers, although, we note that this measure did not necessarily correlate with the subjective interpretation of divergence in the qualitative visualizations. We report the EMD difference in Figure 2(a) as the average over all model layers. 2.Sparse Autoencoder (SAE) Projections (Bloom et al., 2024), wherehis projected through a trained encoder E :R d →R k and linear decoder D :R k →R d : h ′ = D(E(h)).(12) SAEs are trained with sparsity penaltyλ SAE to encourage interpretable basis functions. We offload further experimental details to the referenced SAElens paper and code base. We compare the reconstructed representations to an equal sample size of 2000 vectors from the natural distribution. We used layer 25 of Meta’s Meta-Llama-3-8B-Instruct through Huggingface’s transformers package for this task and visualization (Touvron et al., 2023; Wolf et al., 2019). We selected layer 25 as it appeared to be the only layer available through SAElens’ pretrained SAEs. 3.Distributed Alignment Search (DAS) (Wu et al., 2023), where representations are aligned to a causal abstraction using a learned orthogonal transformationQ ∈R d×d . See Section 2.2 and Wu et al. (2023) for further detail on the method. We compare the intervened representations to an equal sample size of 1000 vectors from the natural distribution. We used the model and layer specified in Wu et al. (2023) for the visualizations in Figure 2. A.1.2MEASURING DIVERGENCE For each intervened sample, there exists a corresponding ground truth sample that the intervention is meant to approximate. In the case of the mean difference experiments, these ground samples consist of the naturally occurring entity or attribute in the position which theδ MD is meant to approximate. For the SAEs, each reconstructed vector has a corresponding encoded vector. For DAS, the ground truth vectors are equivalent to CL vectors. Earth Mover’s Distance: To quantify distributional differences between original and intervened representations, we approximated the Earth Mover’s Distance (EMD) (Villani, 2009) including all vector dimensions using the Sinkhorn loss from the GeomLoss python package with ap = 2 andblur = 0.05(Cuturi, 2013). LetH = h i N i=1 denote a set of original representations and ˆ H = ˆ h i N i=1 their intervened counterparts. We computed EMD(H, ˆ H) =min γ∈Π(μ,ν) 1 N X i,j γ ij ∥h i − ˆ h j ∥ 2 ,(13) whereNis the number of samples inH,μandνare the empirical distributions overHand ˆ H, and Π(μ,ν) denotes the set of couplings with marginals μ and ν. 15 Preprint under review Baseline divergence. To ensure that we did not introduce bias from the sampling procedure in the baseline comparison, we use the corresponding ground truth vectors for the intervened vectors when comparing to the baseline divergence. Formally, letH ′ be the set of ground truth natural vectors corresponding to the set ˆ H: Divergence Baseline = EMD(H,H ′ ).(14) Local PCA Distance: For each reference pointx i , we identified itsknearest neighborsN k (x i )in Euclidean space and computed the local tangent subspace via PCA. Let U d (x i )∈R D×d denote the topdprincipal components explaining at least 95% of the local variance. The projection matrix onto this tangent space is P i = U d U ⊤ d . For a query pointv, the Local PCA Distance is the orthogonal residual betweenvand its projection onto the local tangent space at the nearest reference sample: D PCA (v) = min x i ∥(I − P i ) [v− x i ]∥ 2 . Small values indicate thatvlies close to the locally linear approximation of the manifold, whereas large residuals reflect departures orthogonal to the manifold surface. Local Linear Reconstruction Error: We computed an error inspired by Locally Linear Embed- ding (LLE). Given the same neighborhoodN k (v)ofkreference pointsX k = x 1 ,...,x k , we found reconstruction weightsw = (w 1 ,...,w k )that best expressvas a convex combination of its neighbors: min w v− k X j=1 w j x j 2 2 s.t. X j w j = 1. A small Tikhonov regularizerλIwas added to the local covariance for numerical stability. The Local Linear Reconstruction Error is the residual norm at the optimum: D LLR (v) =∥v− X k w ∗ ∥ 2 . This metric measures how wellvcan be expressed as a locally linear interpolation of nearby manifold points; poor reconstruction (largeD LLR ) implies off-manifold position or local curvature mismatch. Kernel Density (KDE) Density Score: We estimated a nonparametric probability density function over the reference set using Gaussian kernel density estimation: ˆp(x) = 1 nh D n X i=1 exp − ∥x− x i ∥ 2 2 2h 2 , wherehis the kernel bandwidth determined by Silverman’s rule of thumb or cross-validation. The KDE Density Score for a query point v is its log-density under this model, S KDE (v) = log ˆp(v), which inversely reflects off-manifold distance: lower log-density corresponds to less typical or out- of-distribution points. To express results on a comparable scale, we report the negative log-density (i.e.,−S KDE ), so that larger values consistently indicate greater deviation from the manifold. A.1.3VISUALIZATION We visualized both original and intervened representations by projecting onto the first two principal components of the covariance matrix ofH ′ and ˆ H combined: PCs = eigenvectors Cov( H ′ ˆ H ) .(15) The top two principal components were used to plot representations in two dimensions, with colors distinguishing intervention method and condition (Figure 2). 16 Preprint under review A.2CLOSURE UNDER COORDINATE PATCHING CHARACTERIZES CARTESIAN PRODUCTS AND HYPERRECTANGLES In this section, we will show that axis-aligned hyperrectangles are the only convex manifold shape that does not have source-target vector pairs that produce off-manifold intervened representations. For vector dimensions S ⊆ [d] =1,...,d and h src ,h trg ∈R d , define a coordinate patch Patch S (h src ,h trg ) i = h src i , i∈ S, h trg i , i /∈ S. whereirefers to thei th vector coordinate. A setM⊆R d is patch-closed ifPatch S (h src ,h trg )∈M for all h src ,h trg ∈M and all S ⊆ [d]. Let π i :R d →R be the i th coordinate projection and write I i := π i (M) =h i : h∈M. Theorem A.2 (Patch-closure⇐⇒product of projections). LetM⊆R d be nonempty and let I i := π i (M). Then M is patch-closed ⇐⇒ M = I 1 ×·× I d . Proof. (⇐) Immediate: ifM = Q i I i , then patching replaces coordinates by elements of the corresponding I i , so the result stays inM. (⇒) SupposeMis patch-closed. The inclusionM⊆ Q i I i is tautological. For the reverse inclusion, fixt = (t 1 ,...,t d )witht i ∈ I i . For eachipickh (i) ∈Mwithh (i) i = t i . Define ˆ h (1) := h (1) and fork ≥ 2set ˆ h (k) := Patch k ( ˆ h (k−1) ,h (k) ). Patch-closure gives ˆ h (k) ∈M, and by construction ˆ h (k) j = t j for all j ≤ k. Hence ˆ h (d) = t∈M, yielding Q i I i ⊆M. Corollary (Convex case⇒hyperrectangle). IfMis also convex, then eachI i = π i (M) ⊂R is convex, hence an interval. ThereforeM = Q i I i is an axis-aligned hyperrectangle (Cartesian product of intervals). Conversely, any axis-aligned hyperrectangle is patch-closed (and convex). Implication. Consequently, any nonempty convex set inR d that is not an axis-aligned hyperrectangle (e.g., a ball, ellipsoid, or a polytope with non-axis-aligned faces) fails to be patch-closed: there exist h src ,h trg ∈M and S ⊆ [d] such that Patch S (h src ,h trg ) /∈M. A.2.1ACTIVATION PATCHING IN BALANCED SUBSPACES Here we include an additional example of pernicious activation patching that assumes the existence of balanced subspaces, defined as one or more behaviorally relevant subspaces that are canceled out by opposing weight values. Before continuing, we note that such subspaces are unlikely to exist in practical models due to the fact that they would only arise in cases where two rows of a weight matrixW ∈R n×m are non-zero, scalar multiples of one another, assumingh = Wx. This example, however, could arise in cases where the inputxis low rank and a subset of the columns of two rows in W are scalar multiples of one another. Consider the case where there exists an N layer that classifies inputs based on the mean intensity of dimensions 3 and 4 for a latent vectorh∈R 4 , where the N layers that producehare denotedf (x) with data inputsxsampled from the dataset,x∼D, and where the layer of interest has a synthetically constructed weight vectorw ∈R 4 wherew ⊤ = [ w 1 w 2 w 3 w 4 ] = 1 −1 1 2 1 2 . The layer is thus defined as follows: y = w ⊤ f (x (i) ) = w ⊤ h (i) = 1h (i) 1 − 1h (i) 2 + 1 2 h (i) 3 + 1 2 h (i) 4 (16) Here,idenotes the index of the data within the dataset. Further assume that some behavioral decision depends on the sign ofy, thath 1 andh 2 together form balanced subspaces givenw(meaning that for allx (i) ∼ D,w 1 h (i) 1 = −w 2 h (i) 2 ), and that they are non-dormant, meaning that for some pairs (x (i) ,x (j) )wherei ̸= j, thenh (i) 1 ̸= h (j) 1 . Under these assumptions, the subspace spanned by [1 0 0 0] ⊤ and[0 1 0 0] ⊤ is not causally affecting the network’s output under the natural distribution ofh. However, if we intervene onh 1 orh 2 while leavingh 3 andh 4 unchanged, the intervened representation ˆ h will diverge and potentially cross the decision boundary. 17 Preprint under review Concretely, if we seth (i) = [1 1 1 1] ⊤ andh (j) = [3 3 − 1 − 1] ⊤ and then perform an intervention onh 2 usingh (i) as the target andh (j) as the source, we get: ˆ h = [1 3 1 1] ⊤ . This will result in a negative value ofy, thus crossing its decision boundary using a non-native mechanism. This intervention could be used as experimentally affirming evidence for a mechanistic claim, when in reality we have not addressed the model’s original mechanism. A.3PRACTICAL ALGORITHM FOR HARMLESS DIVERGENCE WHEN BEHAVIORAL NULL-SPACE CHARACTERIZES THE FULL HARMLESS SET In settings where perturbationsv ∈N (ψ,X)are treated as harmless and perturbationsv /∈N (ψ,X) as harmful, the behavioral null-space formalism suggests a practical procedure for testing the harmlessness of a given divergencev. LetX K ⊂R d be the set of natural representations for classK, and letˆx K ∈R d be an intervened representation for classK. To approximate the natural manifoldM K locally around ˆx K , we first select the n nearest neighbors of ˆx K in X K : N n (ˆx K ) =x (1) ,...,x (n) ⊂ X K . Let U ∈R n×d be the matrix whose rows are the neighbors u ⊤ i = (x (i) ) ⊤ , and let μ K = 1 n n X i=1 x (i) ∈R d denote their mean. Define the centered data matrix ̃ U = (x (1) − μ K ) ⊤ . . . (x (n) − μ K ) ⊤ ∈R n×d . We compute a rank-rPCA of ̃ U, obtaining the toprprincipal directionsQ r ∈R d×r (columns are orthonormal). The corresponding local projection operator is Π K (x) = μ K + Q r Q ⊤ r (x− μ K ). The local projection of the intervened representation is then ˆx proj = Π K (ˆx K ), and we define the divergence vector as v = ˆx K − ˆx proj . We now provide Algorithm 1 as a practical method for classifying divergence as harmless or perni- cious. We note, however, this algorithm only approximates harmlessness and is not guaranteed to be successful. This is especially the case for situations prone to dormant behavioral changes (Sec. 4.2.2). A.4CL LOSS APPLIED TO BOUNDLESS DAS To perform the Boundless DAS experiments, we used the Boundless DAS tutorial provided in the pyvene python package (Wu et al., 2024) which reproduces a main result from Wu et al. (2023), and we included the CL loss as a weighted auxiliary objective as described in Section 5.1. The exact task used in this tutorial is one involving continuous valued features, which resulted in few occurrences of valid CL vectors in the provided dataset. In order to obtain exact CL vectors, we generated a token sequence sample that contained a valid CL vector for each intervention sample in the dataset. We left hyperparameter choices the same across all trainings except for the CL loss weight ε. A.5CL LOSS IN SYNTHETIC SETTINGS Here we continue Section 5 with experimental details and additional experiments and results. 18 Preprint under review Algorithm 1 Classifying the harmlessness of a divergence vector when the behavioral null-space characterizes harmlessness Require:Intervened representationˆx K ∈R d for classK; natural class representationsX K ⊂R d ; evaluation setX eval ⊂R d ; functionψ :R d →R d ′ ; neighborhood sizen; local dimensionr; tolerance ε≥ 0. Ensure: Classification of the divergence vector v as harmless or harmful. 1:(Local manifold estimation for classK) LetN n (ˆx K ) = x (1) ,...,x (n) ⊂ X K be then nearest neighbors of ˆx K in X K . 2: Compute the mean μ K = 1 n n X i=1 x (i) . 3: Form the centered matrix ̃ U ∈R n×d with rows (x (i) − μ K ) ⊤ . 4: Perform rank-r PCA on ̃ U to obtain the top r principal directions Q r ∈R d×r . 5: Define the local projection Π K (x) = μ K + Q r Q ⊤ r (x− μ K ) and set ˆx proj ← Π K (ˆx K ). 6: Define the divergence vector v ← ˆx K − ˆx proj . 7: (Behavioral test over a broader context) For eachx ∈ X eval , compute∆(ψ,x) = ∥ψ(x + v)− ψ(x)∥. 8: if max x∈X eval ∆(x)≤ ε then 9:return v is HARMLESS. 10: else 11:return v is HARMFUL. 12: end if A.5.1SYNTHETIC DATASET CONSTRUCTION The default synthetic task reported in the results section of Section 5.2 was constructed as a dataset of simulated intermediate-layer representationsh ∈R 18 with known ground-truth labelsy ∈ 1,..., 10 and two causal feature dimensions, where 18comes from 2 feature dimensions plus 16 concatenated noise dimensions is the total feature dimensionality. We split these classes into partition 1 and 2, each consisting of 8 of the 10 classes. The held out classes for partition 1 were contained in partition 2 and visa versa. See Figure 5 for a visualization of the task. Base feature coordinates. We first defined a grid of base coordinates as the Cartesian product of −1, 1 along the first feature axis and0, 1, 2, 3, 4 along the second feature axis: G =−1, 1×0, 1, 2, 3, 4,(17) This procedure yields10 =|G|unique base coordinates, each corresponding to a distinct class label. Noise and correlation structure. For each base coordinate(x 1 ,x 2 ) ∈ G, we generatedNnoisy samples by adding Gaussian noise with variance0.1 2 and covariance parameter0.2. Specifically, each sample was drawn as ̃x 1 ̃x 2 ∼N x 1 x 2 , 0.1 2 0.2 0.20.1 2 .(18) Additional noise dimensions. We augmented each 2D noisy base vector with16independent Gaussian noise features, each sampled fromN (0, 1), producing final representations h∈R 2+16 . A.5.2MLP TRAINING We trained a feedforward Multi-Layer Perceptron (MLP) classifier to predict the class labelyfrom the synthetic representations h. The MLP was parameterized with: • input dimensionality d, defaulting to d = 18 as previously described • a 1D batch normalization of d dimensions (Ioffe & Szegedy, 2015), • one hidden layer of width 128, • activation function ReLU, • dropout with probability 0.5 to drop (Srivastava et al., 2014), • a 1D batch normalization of 128 dimensions, 19 Preprint under review • output layer with 10 logits. Training was performed with a standard categorical cross-entropy loss: L CE =− 1 B B X i=1 logp θ (y i | h i ),(19) where B is the batch size and p θ denotes the MLP’s predictive distribution over class labels. We perform the the MLP training on both partitions combined for the default dataset split. Then we perform an alignment function training on each partition independently and test the alignment on the untrained partition. We report the average IIA over both data partitions for DAS analyses over 5 seeds. Note that we use an independent MLP for each partition in the OOD experiment (Appendix A.5.4. Optimization used stochastic gradient descent with learning rate0.01for300epochs with early stopping using an Adam optimizer (Kingma & Ba, 2017). The code was implemented in PyTorch. A.5.3DAS TRAINING Following MLP pretraining, we applied Distributed Alignment Search (DAS) with varying intensities of the behavioral loss and contrastive learning (CL) loss terms. Specifically, the DAS objective was L DAS = ε behavior L behavior + ε CL L CL ,(20) whereε behavior andε CL are tunable coefficients controlling the strength of each term. We only use values of 0 or 1 forε behavior and we explore values ofε CL referring to it as the CL epsilon in figures. We default to an overall learning rate of0.01and subspace size of1unless otherwise specified. Details of these loss functions are provided in Section 5. Importantly, we stopped training after loss convergence with a patience of 400 training epochs, and we kept the best DAS alignment matrix decided by IIA, and the best CL alignment matrix by EMD. Furthermore, for these trainings, we used a symmetric invertible linear weight matrix as our alignment function as introduced in Grant et al. (2024). Namely, the linear alignment matrixXis constructed asX = (M ⊤ + λI)SwhereM ∈ R d m ×d m is a matrix of learned parameters initially sampled from a centered gaussian distribution with a standard deviation of 1 d m ,I ∈ R d m ×d m is the identity matrix,λ = 0.1to prevent singular values equal to 0, andS ∈ R d m ×d m is a diagonal matrix to learn a sign for each column ofXusing diagonal valuess i,i = Tanh(a i ) + λ(sign(Tanh(a i )))where each a i is a learned parameter andλ = 0.1to prevent 0 values. We perform the the alignment function training on both partitions for each synthetic dataset and test each DAS alignment on the untrained partition. We report the average IIA over both data partitions for DAS analyses. A.5.4OUT-OF-DISTRIBUTION EXPERIMENTAL DETAILS To perform the OOD CL loss experiments, we partitioned the classes into 2 non-overlapping groups. Two of the 10 classes were excluded entirely. The groups were chosen so that the Sparse set had strictly greater spacing than the Dense set. See a visualization of the Dense and Sparse partitions in Figure 5. A separate MLP training was performed on each partition individually. Then an alignment function was trained on each partition/classifier tuple using the settings specified in Appendix A.5.3. The alignment functions were then tested on the untrained partition. We report IIA values averaged over the performance on each partition. A.5.5FURTHER CL LOSS EXPLORATIONS In these explorations, we explore DAS learning rate and the number of extra noisy dimensions for the OOD experiments. We show accuracies, EMD divergences, and EMD divergences restricted to the causal dimensions. The EMD values are scaled by the number of extra noisy dimensions. We refer to the EMD measurements along causal dimensions only as the Row EMD. We do this for both the trained partitions and held-out partitions for various DAS trainings. A.6LINEAR REGRESSION In an effort to establish a more general, concrete relationship between intervened divergence an out-of-distribution (OOD) intervention performance, we performed a linear regression on trained 20 Preprint under review alignments using EMD along causal axes (as discovered through the alignment training) as the independent variable and interchange intervention accuracy (IIA) as the dependent variable. We performed these regressions independently on the Default task and the OOD task trainings, each training consisting of two partitions with 5training seeds and 3 types of alignment trainings: DAS behavioral loss only, CL loss only, and DAS+CL loss, creating 30trainings total. We used the statsmodels python package (Seabold & Perktold, 2010) to perform the regression. Dep. Variable:IIAR-squared:0.729 Model:OLSAdj. R-squared:0.719 Method:Least SquaresF-statistic:75.28 Date:Wed, 26 Nov 2025Prob (F-statistic):2.00e-09 Time:11:54:42Log-Likelihood:76.575 No. Observations:30AIC:-149.2 Df Residuals:28BIC:-146.3 Df Model:1 Covariance Type:nonrobust coefstd errtP>|t|[0.0250.975] Intercept0.98850.004243.6660.0000.9800.997 Training EMD-0.34240.039-8.6770.000-0.423-0.262 Omnibus:33.330Durbin-Watson:1.903 Prob(Omnibus):0.000Jarque-Bera (JB):98.629 Skew:-2.235Prob(JB):3.83e-22 Kurtosis:10.676Cond. No.11.1 21 Preprint under review Figure 4: A number of additional divergence measures to demonstrate the difference between the natural and intervened distributions. Each is labeled by its y-axis. Each metric is computed over a random sample of natural vectors to simulate the natural manifold, and a sampled set of intervened or natural vectors for which to measure the distance from the natural distribution. We refer to this distribution as the "compared" distribution. The sampled intervened and natural vectors are always the "ground-truth pair" described at the beginning of Appendix A.1. Nearest Cosine Distance: refers to the cosine distance to the nearest sample in the natural manifold. Multiple sampes in the compared distribution can share the same natural sample. This value is averaged over all compared samples. Nearest L2 Distance refers to the cosine distance to the nearest sample in the natural manifold. Multiple sampes in the compared distribution can share the same natural sample. This value is averaged over all compared samples. Min Cos Pairing refers to the lowest cost pairing where cost is the cosine distance between two samples. Vector pairs are exclusive. This value is normalized by the number of samples. Min L2 Pairing refers to the lowest cost pairing where cost is the L2 distance between two samples. Vector pairs are exclusive. This value is normalized by the number of samples. Local PCA Distance refers to the distance to the manifold created using a local PCA of the nearest neighbors. See Appendix A.1.2. EMD refers to the Earth Mover’s Distance. See Appendix A.1.2. KDE refers to the Kernel Density Estimation score. See Appendix A.1.2. Local Linear Reconstruction refers to the local linear reconstruction error. See Appendix A.1.2. 22 Preprint under review Causal Dim 1Causal Dim 1 Causal Dim 1 Default TaskOOD Task Partition Sparse Dense Partition One Two Figure 5: Visualization of the different synthetic tasks used for Figure 3. The Default Task is split into two partitions, both withholding two classes that are contained in the other partition. The OOD task is also split into two partitions, both consisting of 4 classes. The Dense partition consists of a tighter cluster than the Sparse. 23 Preprint under review 0.6 0.8 1.0 Cross Task IIA extra_dims = 0 | lr = 0.005extra_dims = 0 | lr = 0.01extra_dims = 0 | lr = 0.05 0.6 0.8 1.0 Cross Task IIA extra_dims = 16 | lr = 0.005extra_dims = 16 | lr = 0.01extra_dims = 16 | lr = 0.05 CLDAS+ CL 0.6 0.8 1.0 Cross Task IIA extra_dims = 64 | lr = 0.005 CL Epsilon 1 10 50 100 0 CLDAS+ CL extra_dims = 64 | lr = 0.01 CLDAS+ CL extra_dims = 64 | lr = 0.05 Figure 6: Out of distribution hyperparameter search showing the DAS IIA on OOD validation data for the trained task partition. We see the DAS loss learning rate (lr) and extra concatenated noisy input dimensions (extra_dim) across the panel columns and rows. The DAS+CL reported values include the behavioral loss whereas the CL label excludes the behavioral loss. The pink dashed lines represent DAS trained with the behavioral loss only. 24 Preprint under review 0.6 0.8 1.0 Trained Task IIA extra_dims = 0 | lr = 0.005extra_dims = 0 | lr = 0.01extra_dims = 0 | lr = 0.05 0.6 0.8 1.0 Trained Task IIA extra_dims = 16 | lr = 0.005extra_dims = 16 | lr = 0.01extra_dims = 16 | lr = 0.05 CLDAS+ CL 0.6 0.8 1.0 Trained Task IIA extra_dims = 64 | lr = 0.005 CL Epsilon 1 10 50 100 0 CLDAS+ CL extra_dims = 64 | lr = 0.01 CLDAS+ CL extra_dims = 64 | lr = 0.05 Figure 7: In distribution hyperparameter search showing the DAS IIA on in-distribution validation data for the trained task partition. We see the DAS loss learning rate (lr) and extra concatenated noisy input dimensions (extra_dim) across the panel columns and rows. The DAS+CL reported values include the behavioral loss whereas the CL label excludes the behavioral loss. The pink dashed lines represent DAS trained with the behavioral loss only. 25 Preprint under review 0.000 0.025 0.050 0.075 0.100 0.125 0.150 Cross Row EMD extra_dims = 0 | lr = 0.005extra_dims = 0 | lr = 0.01extra_dims = 0 | lr = 0.05 0.000 0.025 0.050 0.075 0.100 0.125 0.150 Cross Row EMD extra_dims = 16 | lr = 0.005extra_dims = 16 | lr = 0.01extra_dims = 16 | lr = 0.05 CLDAS+ CL 0.000 0.025 0.050 0.075 0.100 0.125 0.150 Cross Row EMD extra_dims = 64 | lr = 0.005 CL Epsilon 1 10 50 100 0 CLDAS+ CL extra_dims = 64 | lr = 0.01 CLDAS+ CL extra_dims = 64 | lr = 0.05 Figure 8: Out of distribution hyperparameter search showing the DAS row-space EMD on OOD validation data for the trained task partition. We see the DAS loss learning rate (lr) and extra concatenated noisy input dimensions (extra_dim) across the panel columns and rows. The DAS+CL reported values include the behavioral loss whereas the CL label excludes the behavioral loss. The pink dashed lines represent DAS trained with the behavioral loss only. 26 Preprint under review 0.000 0.025 0.050 0.075 0.100 0.125 0.150 Trained Row EMD extra_dims = 0 | lr = 0.005extra_dims = 0 | lr = 0.01extra_dims = 0 | lr = 0.05 0.000 0.025 0.050 0.075 0.100 0.125 0.150 Trained Row EMD extra_dims = 16 | lr = 0.005extra_dims = 16 | lr = 0.01extra_dims = 16 | lr = 0.05 CLDAS+ CL 0.000 0.025 0.050 0.075 0.100 0.125 0.150 Trained Row EMD extra_dims = 64 | lr = 0.005 CL Epsilon 1 10 50 100 0 CLDAS+ CL extra_dims = 64 | lr = 0.01 CLDAS+ CL extra_dims = 64 | lr = 0.05 Figure 9: In distribution hyperparameter search showing the DAS row-space EMD on validation data for the trained task partition. We see the DAS loss learning rate (lr) and extra concatenated noisy input dimensions (extra_dim) across the panel columns and rows. The DAS+CL reported values include the behavioral loss whereas the CL label excludes the behavioral loss. The pink dashed lines represent DAS trained with the behavioral loss only. 27