Paper deep dive
Binary Autoencoder for Mechanistic Interpretability of Large Language Models
Hakaze Cho, Haolin Yang, Yanshu Li, Brian M. Kurkoski, Naoya Inoue
Models: Llama 3.2-1B, Llama 3.2-3B
Abstract
Abstract:Existing works are dedicated to untangling atomized numerical components (features) from the hidden states of Large Language Models (LLMs). However, they typically rely on autoencoders constrained by some training-time regularization on single training instances, without an explicit guarantee of global sparsity among instances, causing a large amount of dense (simultaneously inactive) features, harming the feature sparsity and atomization. In this paper, we propose a novel autoencoder variant that enforces minimal entropy on minibatches of hidden activations, thereby promoting feature independence and sparsity across instances. For efficient entropy calculation, we discretize the hidden activations to 1-bit via a step function and apply gradient estimation to enable backpropagation, so that we term it as Binary Autoencoder (BAE) and empirically demonstrate two major applications: (1) Feature set entropy calculation. Entropy can be reliably estimated on binary hidden activations, which can be leveraged to characterize the inference dynamics of LLMs. (2) Feature untangling. Compared to typical methods, due to improved training strategy, BAE avoids dense features while producing the largest number of interpretable ones among baselines.
Tags
Links
- Source: https://arxiv.org/abs/2509.20997
- Canonical: https://arxiv.org/abs/2509.20997
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%
Last extracted: 3/11/2026, 12:41:37 AM
Summary
The paper introduces the Binary Autoencoder (BAE), a novel autoencoder variant designed for mechanistic interpretability of Large Language Models (LLMs). BAE uses a binarization step function and an entropy-based training objective on minibatches to enforce global sparsity and feature independence. This approach addresses the limitations of traditional Sparse Autoencoders (SAEs), such as dense or dead features, and enables efficient entropy estimation of hidden states, which the authors use to analyze LLM inference dynamics, context windows, and In-context Learning (ICL).
Entities (5)
Relation Signals (3)
Binary Autoencoder → usedfor → Entropy Estimation
confidence 95% · BAE enables accurate entropy calculation on hidden state sets, and we utilize it for analyzing LLM behavior
Binary Autoencoder → improves → Sparse Autoencoder
confidence 90% · Compared to typical methods, due to improved training strategy, BAE avoids dense features while producing the largest number of interpretable ones among baselines.
In-context Learning → interpretedas → Information Reduction
confidence 85% · we further interpret In-context Learning (ICL) (Dong et al., 2022) as a form of information reduction.
Cypher Suggestions (2)
Identify relationships between BAE and other concepts · confidence 90% · unvalidated
MATCH (a:Entity {name: 'Binary Autoencoder'})-[r]->(b:Entity) RETURN a, r, bFind all methods used for mechanistic interpretability · confidence 80% · unvalidated
MATCH (e:Entity {entity_type: 'Method'})-[:USED_FOR]->(t:Task {name: 'Mechanistic Interpretability'}) RETURN eFull Text
112,032 characters extracted from source content.
Expand or collapse full text
Binary Autoencoder for Mechanistic Interpretability of Large Language Models Hakaze Cho 1 Haolin Yang 2 Yanshu Li 3 Brian M. Kurkoski 1 Naoya Inoue 1 4 Abstract Existing works are dedicated to untangling atom- ized numerical components (features) from the hidden states of Large Language Models (LLMs). However, they typically rely on autoencoders con- strained by some training-time regularization on single training instances, without an explicit guar- antee of global sparsity among instances, causing a large amount of dense (simultaneously inactive) features, harming the feature sparsity and atom- ization. In this paper, we propose a novel autoen- coder variant that enforces minimal entropy on minibatches of hidden activations, thereby pro- moting feature independence and sparsity across instances. For efficient entropy calculation, we discretize the hidden activations to 1-bit via a step function and apply gradient estimation to enable backpropagation, so that we term it as Binary Autoencoder (BAE) and empirically demonstrate two major applications: (1) Feature set entropy calculation. Entropy can be reliably estimated on binary hidden activations, which can be leveraged to characterize the inference dynamics of LLMs. (2) Feature untangling. Compared to typical methods, due to improved training strategy, BAE avoids dense features while producing the largest number of interpretable ones among baselines. 1. Introduction Current practice for untangling atomized numerical com- ponents (features) from Large Language Models (LLMs), such as Sparse Autoencoder (SAE) (Shu et al., 2025), ap- plies training-time regularization (e.g.,L 1 normalization on hidden activations) to implicitly atomize features sample- wisely. However, such methodologies do not ensure global sparsity, often leading to frequently activated (dense) fea- tures alongside inactive (dead) features (Stolfo et al., 2025; Rajamanoharan et al., 2024b; Sun et al., 2025), contradict- ing the sparsity assumption (Elhage et al., 2022) of LLM 1 JAIST 2 University of Chicago 3 Brown University 4 RIKEN. Correspondence to: Hakaze Cho <yfzhao@jaist.ac.jp>. Preprint. February 13, 2026. hidden states, as broad activations across samples hinder consistent and meaningful interpretation, and reduce the parameter efficiency by dead features. Therefore, in this paper, to address this issue, we propose Binary Autoencoder (BAE), utilizing information-theoretic constraints among minibatches of training instances to ad- dress the aforementioned issue. As shown in Fig. 1, we de- sign training objectives that minimize the entropy of hidden activation from minibatches to reduce feature covariation while enforcing global sparsity to suppress frequently ac- tivated features. However, typical hidden activations are real vectors, whose entropies are extremely difficult to cal- culate (Greenewald et al., 2023) as the computation requires high-dimensional numerical integrations, which involve an exponential explosion of computational complexity. To this end, we round the activations to binary, then calculate entropy on such binary vectors to significantly reduce the computational complexity of the entropy objective, and uti- lize gradient estimation (Hubara et al., 2016) to enable the backpropagation on such a rounding operation. We empirically demonstrated the benefits of BAE in appli- cation as: (1) Efficient estimation of entropy from hidden state sets. We can estimate the entropy for reconstructing the input sets as the hidden activation entropy with signif- icantly reduced cost. By experiment on synthetic datasets with different ground-truth entropy values, we confirm the accuracy of such entropy estimation. Moreover, we uti- lize entropy to track the feed-forward process of standard language modeling in LLMs, revealing an information band- width at each layer and implicit context windows. Also, we further interpret In-context Learning (ICL) (Dong et al., 2022) as a form of information reduction. And (2) Sparse feature extraction. As also applied to normal SAE, the row vectors of the second autoencoder layer (sometimes termed as the dictionary) can serve as the atomized features ex- tracted from the inputs. Compared to typical SAE, BAE sig- nificantly suppresses dense features as well as dead features by the entropy-based training objective on mini-batches and consistent activation scaling based on the information gain of each channel, extracting the largest number of active and interpretable features. (3) More-bit decoding. Also, to re- duce the information loss induced by the rounding operation, we adopt a more-bit decoding as a handle for the trade-off 1 arXiv:2509.20997v2 [cs.LG] 12 Feb 2026 Binary Autoencoder for Mechanistic Interpretability of Large Language Models Input Sample Step Function Binary Autoencoder ... ... Hidden State Source ... ... (FP16/32) Training Objective 0. 0. 0. 0/1 0/1 0/1 0/1 = 0. 0. 0. 0. 0. 0. Real Variable (1 Bit) = Binary Variable LLM Layers Linear Information Bottleneck Linear Hidden States Hidden Activation Self- regression ... ... 0. 0. 0. More-bit Decode 0/1 0/1 0/1 0/1 0/1 Figure 1. Feed-forward computation and training objective of BAE. Hidden statesh 0 from LLM layers are mapped byW in , binarized into h B via a step function, and projected back byW out as ˆ h 0 . The ˆ h 0 is fed to the self-regression loss, whileh B is fed to the information bottleneck loss. More-bit decode: to reduce the information loss of the BAE, as mentioned in §6, we aggregate real-valued hidden activation elements from multiple binary bits, and perform decoding using the reconstructed real-valued vector. between feature atomization and information loss. Main findings and contributions: (1, §3) We propose Binary Autoencoder (BAE), a novel variant of autoencoder with binarized hidden activation and training-time entropy constraint. (2, §4) BAE enables accurate entropy calculation on hidden state sets, and we utilize it for analyzing LLM behavior from an information perspective. (3, §5) We con- firm BAE as an effective atomized feature extractor with no dense features and the largest amount of features extracted. 2. Related Works Dictionary Learning (Shu et al., 2025). Modern mech- anistic interpretability (Sharkey et al., 2025; Bereska & Gavves, 2024) views LLM hidden states as superpositions of atomic features, motivating efforts to disentangle them for better semantical understanding of the LLMs’ opera- tion, where Sparse Autoencoder (SAE), autoencoder with training-timeL 1 normalization on the hidden activations for the sample-wise sparsity, and its variants (Gao et al., 2025; Bussmann et al., 2024; Rajamanoharan et al., 2024a;b; Shi et al., 2025; Lindsey et al., 2024) is applied (Bricken et al., 2023; Huben et al., 2024; Templeton et al., 2024; Gao et al., 2025). However, these methods do not warrant a global sparse disentanglement, that is, with some guarantee to min- imize the activation frequency as well as covariation among channels of hidden activations. Such a drawback causes the activations to be dense (Kissane et al., 2024; Rajamanoha- ran et al., 2024b; Sun et al., 2025; Stolfo et al., 2025), i.e., some features are activated across diverse inputs without a shared semantics, and thus cannot be uniquely interpreted. Moreover, these dense features can be viewed as mergers of multiple features (i.e., insufficient atomization), which is the main motivation of this paper: explicitly promote the global feature sparsity and atomization by improved training meth- ods, to capture a larger amount of interpretable features, and to improve the accuracy of feature interpretation. Discrete-valued Neural Networks.To utilize the information-theoretic constraints mentioned before, we round the hidden activations to binary. The framework of such binary neural networks, including rounding the real activation into 1-bit, and estimating the gradient for such rounding operation, etc., is originally proposed by (Hubara et al., 2016). Subsequently, numerous variants are proposed primarily based on different rounding operations and gra- dient estimations (Rastegari et al., 2016; Zhou et al., 2016; Choi et al., 2018; Vargas et al., 2024). A detailed survey can be found in (Qin et al., 2020). In this paper, we utilize these methodologies to reduce the calculation cost on the information-theoretic constraints. 3. Binary Autoencoder Feed-forward Calculation. To address SAE’s drawbacks, we propose Binary Autoencoder (BAE) shown in Fig. 1 and follows: Given input hidden stateh 0 ∈R d from an LLM, a binary autoencoderFconducts such calculation for the outputF (h 0 ) (also noted as ˆ h 0 =F (h 0 )) as: F (h 0 ) = Γ (h 0 W in )W out + b,(1) whereW in ∈R d×d ′ , preforming a linear decomposition on inputh 0 tod ′ dimensions,b∈R d is a bias term recon- structing the hidden states anisotropism (Gao et al., 2019; Ethayarajh, 2019; Bi ́ s et al., 2021; Godey et al., 2024) which contains minor information thus should be ignored as back- ground value,Γis the quantization function, projectingR d ′ into0, 1 d ′ element-wisely. Here, we use the step function to binarize the h 0 W in into hidden activation h B : h B = Γ ([x 1 ,x 2 ,...,x d ′ ]) = [γ (x 1 ),γ (x 2 ),...,γ (x d ′ )], γ(x) = ( 0,x < 0 1,x⩾ 0. (2) TheΓalso provides essential non-linearity for the numerical expressivity of the autoencoder (otherwise, theW in W out will degrade into one matrix). Self-regression Training Loss.Given a mini- batch of hidden states (with batch sizen b )H 0 = 2 Binary Autoencoder for Mechanistic Interpretability of Large Language Models n h (1) 0 ,h (2) 0 ,...,h (n b ) 0 o , we calculate the self-regression training objective with L 2 norm: L r (H 0 ) = 1 n b X h 0 ∈H 0 ∥h 0 −F (h 0 )∥ 2 .(3) Information Bottleneck (Entropy) Loss. To constrain the hidden activation (h B = Γ (h 0 W in )) to a global sparse representation forh 0 , we minimize the margin entropy of h B (sinceh B ∈0, 1 d ′ , such calculation are differentiable and efficient, without any numerical integration on the real space). Also, to maximize the effectiveness of the constraint on margin entropy, we also penalize the covariance (except the diagonal elements) ofh B to force the margin entropy to approach the joint distribution entropy. That is, on the minibatch H 0 , we define the entropy-based loss term: L e (H 0 ) = α e H[ 1 n b X h 0 ∈H 0 Γ (h 0 W in )] + α c D[Γ (H 0 W in )], where H[x] =− d ′ X i=1 x i log 2 x i , D[X] = X i,j:i̸=j |cov(X) i,j |. (4) The α e and α c are hyperparameters. The total loss is: L(H 0 ) =L r (H 0 ) +L e (H 0 ).(5) This training objective resembles and simulates the infor- mation bottleneck methods (Tishby et al., 2000; Kawaguchi et al., 2023; Tishby & Zaslavsky, 2015), whereL e mini- mizes the mutual information between inputh 0 and latent h B , whileL r maximizes that between output and h B . Gradient Estimation forΓ. Since the differential of bi- narization functionΓis0almost everywhere, to enable the backpropagation fromL e toW in , following the previous works (Hubara et al., 2016; Vargas et al., 2024), we estimate the gradient ofΓby smoothing functionx7→ (1 + e −x ) −1 (“Sigmoid”) elementwisely: ∂Γ(x) ∂x : = Sigmoid(x)⊙ (1− Sigmoid(x)),(6) where 1 is all-ones vector,⊙ is the Hadamard product. Default Hyperparameters.We defaultly setd ′ = 4d (i.e., the expanding rate= 4), the entropy loss weight 1 α e = 10 −7 , the covarience loss weightα c = 10 −7 . We use Adam (Kingma & Ba, 2014) optimizer with learning rate 5× 10 −4 , momentum factorα 1 = 0.9,α 2 = 0.999, and minibatch sizen b = 512for2000epochs, withα e = 0in the first 500 epochs. 1 Notice that, considering the magnitude around the late training stage (as shown in Fig. 2 and 12) of the self-regression loss (L r , around10 −3 ∼ 10 −2 ) and the margin entropy (H[·], around10 0 ∼ 10 3 ), athough the weightsα e are set quite small, the magnitude of the α e H[·] are balanced as a regularization term againstL r . 4. Entropy Estimation of Hidden States Entropy Estimation by BAE. As a direct measurement of information amount, calculating entropy for hidden states in neural networks can promote a closer observa- tion into the inner mechanism. However, directly comput- ing the differential entropy of a high-dimensional vector set requires probability density estimation (DE) and inte- gration, which is neither accurate (DE suffers from the curse of dimensionality and floating-point errors) nor ef- ficient (numerical integration costsO(C d )forddimensions withCintegration cells) (Greenewald et al., 2023). In contrast, BAE can improve such entropy calculation by best-effortly decomposing the original real-value vectors into binary vectors (h B ) with pair-wisely independent el- ements, enabling efficient entropy estimation by margin entropy of the mean hidden activation ̄ h B : given a vector setH 0 = n h (i) 0 o n i=1 , we encode them intoh B s by a trained BAE asH B = n h (i) B = Γ h (i) 0 W in o n i=1 . Since theh B s are binarized and best-effortly decorrelated pairwise by the covariance loss, we can calculate the entropy required for reconstructingH 0 asH[ ̄ h B ], where ̄ h B are the averages among the row vectors of H B . In this section, we evaluate the aforementioned entropy estimation with best effort on a synthesis dataset to show its accuracy, and utilize such entropy calculation to track the feed-forward process of LLM. 4.1. Evaluating Entropy Calculation from BAE Synthetic Directional Benchmarking. To evaluate the entropy estimation of BAE’s hidden binary activations (h B ), we build a synthetic random directional benchmark: (1) sam- ple ad-dimensionalr-rank orthonormal basisM ∈R r×d , (2) samplerbinary coefficientsc∈0, 1 r , and (3) gener- ate an instancecMby summing selectedMbasis from1- elements inc. Repeat (2) and (3)ntimes, we get a synthesis random directional dataset withnsamples. Intuitively, the entropy of the dataset isrsince the only randomness comes from therindependent Bernoulli coefficients given the fixed basis. We train the BAE on a set of such datasets with var- iousr(details in Appendix A.1), and confirm whether the entropies calculated following the aforementioned method hit the correspondingr. Moreover, we propose a proto- type to evaluate such entropy calculation on real LM hidden states, refer to Appendix B for details. Also, we analyse the training dynamics of BAE, and find a natural low-entropy trend of BAE hidden activations, refer to Appendix D. BAE can Accurately Estimate Entropy of Synthetic Vec- tor Set. The evaluation results are shown in Fig. 2, where the standard BAE implementation with normal entropy ob- jective (green) hint the correspondingraccurately, despite 3 Binary Autoencoder for Mechanistic Interpretability of Large Language Models 2 1 2 3 2 5 2 7 Synthetic Feature Rank 2 1 2 3 2 5 2 7 2 9 2 11 Calculated Entropy / bits Expanding 1x 2x 4x 8x 16x w/ Entropy Loss w/o Entropy Loss 0e+00 1e-03 2e-03 3e-03 4e-03 5e-03 Auto-regression Loss (L2) Figure 2. Evaluation of BAE entropy calculation on the synthetic dataset. Horizontal axis: rankrof the current dataset, vertical axis: calculated entropy, the green/red color refers to whetherL e is enabled, and the opacity refers to theL r on the whole input set. 2 1 2 3 2 5 2 7 2 9 Token Location 0 100 200 300 400 500 600 700 Calculated Entropy / bits Layer 0 2 4 6 8 10 12 14 1e-03 2e-03 3e-03 4e-03 5e-03 6e-03 7e-03 Auto-regression Loss (L2) Figure 3. Entropy calculated on the hidden states extracted from specific layers and token locations from Pile and Llama 3.2-1B. The curve colors refer to the extracted layers, the scatter colors refer to theL r on the whole input set. of the inner dimensiond ′ (noted as the expending rated ′ /d, shown as the marker shape in Fig. 2). In contrast, the adver- sarial experiments (red), where theα e andα c are both set to0, show clearly higher calculated entropy, suggesting that BAE with entropy constraint can find the essential (minimal) entropy values to reconstruct the input sets. 4.2. Tracing Language Modeling by Entropy In this section, we track the LLM’s feed-forward dynam- ics by the entropy of hidden states. In detail, we sample n = 262144sentences from Pile (Gao et al., 2020), in- putting them into Llama 3.2-1B (Grattafiori et al., 2024), then extract the hidden states of2 0,1,·,10 -th tokens from each layer. Then we train a BAE on each layer and position, with details in Appendix A.2. From the trained BAE, we calculate the entropy as shown in Fig. 3, where we observe: Layer Bandwidth. The entropy of hidden states from a specific layer increases with the number of prefix tokens and eventually saturates at a fixed value. This observation sug- 0246810121416 Transformer Layer Number 0 20 40 60 80 100 120 Calculated Entropy / bits #Examples 0 1 2 4 8 0.0e+00 2.0e-03 4.0e-03 6.0e-03 8.0e-03 1.0e-02 1.2e-02 Auto-regression Loss (L2) Figure 4. Entropy calculated on the hidden states extracted from specific layers and the last token from ICL inputs from SST-2. The curve colors refer to the number of demonstrations. gests that: if the hidden space of a specific layer is viewed as a channel (Elhage et al., 2021) through which token in- formation is communicated, then the channel has a fixed bandwidth, limiting the amount of accommodated token information. That is, Transformers have implicit context windows on each layer, where the exceeded information is discarded or distorted. Unlike convolutional neural net- works, we have no evidence to infer that such context win- dows are contiguous or even binary, so that it is possible to overwrite parts of but not the whole information from an old token by new information, causing distortation like the lost-in-the-middle problem (Hsieh et al., 2024; Liu et al., 2024a; He et al., 2024; An et al., 2024; Liu et al., 2025). Token Information Gain. As shown in Fig. 3, for a spe- cific token position, hidden states from a deeper layer carry more information, and the entropy also saturates later on a deeper layer, suggesting that Transformer blocks sequen- tially inject contextualized information into hidden states. As a result, deeper layers are more likely to simulate broader context windows, which facilitates the processing of down- stream tasks at higher levels of information aggregation and abstraction. In contrast, shallower layers, restricted by narrower context windows, tend to focus on local linguistic- level features and propagate this information to the later layer to combine into broader-scoped abstraction, which is consistent with intuition and previous empirical observa- tions (Jawahar et al., 2019; Chen et al., 2023; Wang et al., 2023; Xiao et al., 2025). We discuss this point deeper in §7. 4.3. Tracing In-context Learning by Entropy In this section, we track the LLM’s ICL (refer Appendix A.3 for introduction) inference dynamics via entropy, similar to §4.2. In detail, we sample262144ICL input instances from SST-2 (Socher et al., 2013) with specific demonstration numbers, then extract the hidden states of the last token (the “:” in the “sentiment:”, where the answer to the query will be generated) of a specific layer, and then train a BAE on these hidden states, with prompt templates, settings and param- 4 Binary Autoencoder for Mechanistic Interpretability of Large Language Models eters detailed in Appendix A.3. We calculate the entropy as shown in Fig. 4, where: (1) similar to the normal lan- guage modeling, a deeper layer contains more information, which suggests that the implicit context windows hypothesis still stands for the ICL scenario. However, (2) contradicts the normal language modeling, hidden states with more demonstrations (globally longer sentence lengths and bet- ter accuracy, see Appendix A.3) counterintuitively contain lower entropy, which suggests an interesting conclusion: ICL is achieved through removal of information, where the information useless to the specified task may be suppressed by the given demonstrations. This contradicts the main- stream idea that ICL “learns new tasks or knowledge” (Pan et al., 2023; Li et al., 2024; Wang et al., 2025), and thus may offer a new perspective for interpreting the ICL infer- ence processing as “deleting unrelated information from the query encoding (Cho et al., 2025) on the last token”. 5. Atomized Features Extraction by BAE Given that our entropy constraint promotes sparsity and decorrelation ofh B , the linear decomposition fromh 0 toh B can disentangle atomized features in LLM hidden states (El- hage et al., 2022; H ̈ anni et al., 2024). To evaluate such disen- tanglement, in this section, we assign human-interpretable semantics to each channel ofh B and assess their consistency across inputs, using an improved automatic interpretation method under the LLM-as-a-judge framework (Gu et al., 2024). Also, we utilize BAE to extract steering vectors (Hen- del et al., 2023; Kharlapenko et al., 2024) from ICL inputs and inject such steering vectors into 0-shot inputs to evaluate the effectiveness against SAE variants. 5.1. Common Semantics-based Feature Interpretation Revisiting Current Automatic Feature Interpretation and Evaluation.We begin with a revisiting to the cur- rent automatic interpretation and evaluation methods of fea- tures (Bills et al., 2023; Huben et al., 2024), where for a channel inh B corresponding to a feature (line vector) in W out , (Step 1) given one input text, the activation magni- tudes (e.g., the value of the specified channel inh B of SAE) on the channel are calculated on every token, then (Step 2) the tuple of sentence and activation magnitudes on all tokens is input in LLMs with a prompt (e.g., “Please predict the explanation of the feature given the following activa- tions.”) for interpreting such feature into one phrase (e.g., “freedom-related terms”). Then, (Step 3) given a test input text, a simulation of the activation magnitudes based on the generated interpretation for each token is queried from the LLM, and the correlation coefficient of the simulated activation magnitudes and the SAE-calculated activation magnitudes is regarded as the interpretability score of this feature (refer to (Bills et al., 2023) for details). Notice that Step 2 of the above process relies on the LLM’s ability to accurately handle large amounts of numerical to- kens to generate reliable explanatory phrases, and Step 3 requires the LLM to faithfully simulate activation magni- tudes by numerical tokens, which places a high demand on the LLM’s mathematical reasoning capabilities and also output calibrations. However, current research indicates that LLMs exhibit weaker capabilities in numerical reasoning compared to linguistic tasks (Press et al., 2023; Schick et al., 2023; Ahn et al., 2024; Xu et al., 2025). Also, the out- put can be implicitly biased to some specific tokens (Zhao et al., 2021; Geng et al., 2024), making the aforementioned pipeline unreliable, requiring a revision to improve the ro- bustness and credibility. Common Semantics-based Feature Interpretation and Evaluation (ComSem).Therefore, to avoid relying on LLMs to directly process numerical tokens, we pro- pose ComSem as a new pipeline that leverages the LLMs’ strength in linguistic semantic recognition to interpret the extracted features. In detail (refer Appendix A.4 for the detailed pseudocode): for a specific channel (corresponding to a feature) in theh B of an autoencoder, given a set of test sentences, (Step 1) for all the tokens in the set, whose hid- den states have top-k(kis a hyperparameter of the method) significant activation magnitude on the specific channel over all channels (i.e., if the activation magnitude of this specific channel is in the top-kwhen we sort all the activation mag- nitudes over all channels), we collect them along with their sentences. (Step 2) We query the backend LLM to find the commonality of these tokens presented in their sentences, as the interpretation of the feature corresponding to the channel. (Step 3) On a hold-out set of such token-sentence tuples significantly activated on the channel, we query the LLM to judge whether these tokens can be interpreted by the generated interpretation from Step 2. The ratio of the “Yes” answer is calculated as the interpretability score of this feature. ComSem can avoid the concern in mathemat- ical and numerical reasoning, and provides the possibility of applying simple output calibrations 2 on the true-or-false output. So, we utilize ComSem to evaluate the BAE, as described below. 5.2. Interpreting and Evaluating BAE Features Estimate Activation Magnitude from Binary Activation. Since ComSem requires access to feature activation magni- tude, which our BAE can not explicitly provide 3 , as shown in Fig. 10, we calculate the burstiness of each channel to convert the binary activationsh B into the activation mag- 2 E.g., described in (Zhao et al., 2021; Fei et al., 2023). Such output calibration is not utilized in this work. 3 Note that, in typical SAE, the hidden activation (congener of our h B ) can directly serve as such magnitude. 5 Binary Autoencoder for Mechanistic Interpretability of Large Language Models nitudes. In detail, for each (indexed byi) instanceh (i) B in a hidden activation setH B = h (i) B n i=1 , we compare the distance ofh (i) B with a prior distribution 4 ̄ h B ∈ [0, 1] d ′ as β (i) = log 2 |h (i) B − ̄ h B |, where thelog 2 is the element-wise logarithm. Suchβ (i) is the channel-wise activation magni- tude ofh (i) B , where a channeljwith largerβ (i) j suggests the j-th feature fromh (i) B is more bursty. A more bursty feature reduces more uncertainty to reconstruct theh 0 so that car- ries more information against the ̄ h B , which makes it more representative ofh 0 (Appendix E for theoretical ground- ings). Then, we input such magitudes into the automatic feature interpretation and evaluation framework. Such burstiness calculation has the following advantages: (1) Global sparsity induced by low entropy. Notice that our entropy training objective shown in Eq. 4 is actually punishing the burstiness summed from all the input sam- ples on all the channels among the minibatches, therefore, the calculatedβ (i) for each sample is numerically sparse in two directions, i.e., (i) for one instanceh (i) B , most of el- ements inβ (i) are suppressed to be sufficient small values, and (i) for one feature channelj, most ofβ (i) j are small values amongiamong all instances, which is aligned to the sparsity assumption of mechanistic interpretability (refer to Appendix G.1 for visualization). (2) Consistent scale and clear interpretability. In typical SAEs, different chan- nels ofh B lie on varying scales, making some inherently high-magnitude channels appear consistently active (“dense features”, Appendix G.1 and Fig. 5) (Stolfo et al., 2025; Rajamanoharan et al., 2024b) while others seem inactive (“dead features”). This issue is particularly pronounced when using top-kselection based on absolute activation values to identify significant features, reducing the number and introducing confusion to activated features. Therefore, intuitively, as shown in §5.3, rescaling activations into an aligned distribution alleviates this problem in typical SAE. And in our pipeline, BAE simplifies the rescaling since each channel follows a Bernoulli distribution and requires only one statistic ( ̄ h B ), whereas the real-valuedh B in typical SAEs is harder to characterize and rescale. 5.3.BAE is an Effective Interpretable Feature Extractor As mentioned before, BAE effectively mitigates the two issues in feature extraction of typical SAE: dense features on inherently high-magnitude channels, which lower con- sistency among instances with the same feature activated (i.e., low interpretability scores in Step 3 of ComSem), and dead features on inherently low-magnitude channels, leav- ing many channels idle. In this section, we empirically confirm the mitigation of both issues by BAE with ComSem 4 Also named typical set in information theory. Such ̄ h B can either be saved during training, or mean value on H B set. Table 1. Evaluation of BAE and baselines (d ′ /d = 4).L r : the self-regression loss on validation set.∥Cov∥: the spectral norm of the covariance matrix among line vectors ofW dec . Act.: The number of features where sufficienth 0 instances (more than8, refer to Appendix A.4) with significant activation magnitude are observed on the corresponding channel. FI#: The number of feature channels with non-zero ComSem interpretability score. Score: The averaged ComSem score among all activated channels. Feat. Source Model (Appendix A.4) L r Val. Set ∥Cov∥Act. ComSem 4.1-mini ComSem 4.1 FI#ScoreFI#Score Llama 3.2-1B Layer 5 d = 2048 ReLU SAE3×10 −3 0.08192211030.21612510.294 Top-k SAE2×10 −3 0.15353126140.43927630.445 Gated ReLU SAE6×10 −3 0.0717110.38260.294 ReLU SAE Resc.3×10 −3 0.085673090.2333670.269 TransCoder1×10 −2 0.023352180.2751950.238 BAE (ours)4×10 −3 4.21301222490.42224490.439 Llama 3.2-1B Layer 9 d = 2048 ReLU SAE2×10 −4 0.07191411280.22912740.254 Top-k SAE4×10 −3 0.15317923890.45025420.461 Gated ReLU SAE1×10 −2 0.33100850.638800.596 ReLU SAE Resc.2×10 −4 0.07316120280.25821900.294 TransCoder2×10 −2 0.014172570.2602630.264 BAE (ours)6×10 −3 12.04467533700.37036240.394 Llama 3.2-1B Layer 11 d = 2048 ReLU SAE1×10 −2 0.08206511770.23213800.260 Top-k SAE8×10 −3 0.18341725400.44026840.452 Gated ReLU SAE1×10 −2 0.4012269760.53110260.557 ReLU SAE Resc.1×10 −2 0.087444350.2594820.278 TransCoder2×10 −2 0.0117949790.21810900.239 BAE (ours)1×10 −2 19.32546438820.36041400.382 Llama 3.2-1B Layer 14 d = 2048 ReLU SAE3×10 −5 0.09252814230.19516000.217 Top-k SAE3×10 −2 0.11270219000.38920040.418 Gated ReLU SAE3×10 −2 0.75294820950.41222500.435 ReLU SAE Resc.3×10 −5 0.09396223610.25226610.274 TransCoder3×10 −2 0.00340119310.23721660.267 BAE (ours)4×10 −2 4.79612039630.32439710.323 Llama 3.2-3B Layer 20 d = 3072 ReLU SAE6×10 −3 0.08192311830.28912890.312 Top-k SAE5×10 −2 0.08323422860.40225080.425 Gated ReLU SAE5×10 −2 0.27462832710.40235800.437 ReLU SAE Resc.6×10 −3 0.08212212220.25014510.294 TransCoder0.150.01550830010.23334240.257 BAE (ours)7×10 −2 6.80930859560.30868050.348 on 2 backend evaluation LLMs and BookCorpus (Zhu et al., 2015) against SAE variants trained on Pile (detailed in Ap- pendix A.4) as shown in Table 1, where compared to all the baselines, BAE can extract the largest amount of active features from the LLM’s hidden states, with a considerable interpretability score. However, one can doubt that the in- terpretability score of BAE is “not SotA” in the Table 1. Our explanation is: BAE has a strong feature extracting ability, so that it captures some complex features (refer to Appendix F) which are difficult to interpret by natural lan- guage from LLMs, reducing the interpretability scores. As shown in Appendix C, if some of the low-score features are removed, then BAE reaches the SotA of interpretability score and also remains the largest feature amounts. More- over, we conduct normalization 5 to theh B of ReLU SAE for a rescaled activation magnitude, and evaluate as “ReLU SAE Resc.” in Table 1, where comparing to the vanilla ones, the interpretability scores are observably improved by the normalization rescaling. Since SAE training ignores cross- sample sparsity and itsh B distribution is hard to estimate, the effectiveness of rescaling remains limited. 5 I.e., linearly scale every channel value to a distribution over h B instances with mean value0and standard deviation1, to miti- gate the influence of inherent activation magnitude of channels. 6 Binary Autoencoder for Mechanistic Interpretability of Large Language Models Figure 5. Feature activation frequency distribution of Layer 11 (more layers in Appendix G.3). Information Loss by Binarization. One suspicion is that theΓcauses information loss from theh 0 , which may harm the feature extraction. To evaluate such information loss, we calculate the self-regression loss (L r ) on the validation set, as shown in Table 1, where no significant difference of BAE compared to baselines can be observed. This indicates that BAE at least does not incur more severe information loss than the baselines. Furthermore, we attempt to increase the decoding bits to reduce such information loss in §6. Activated Feature Distribution. To examine BAE’s ef- fective suppression towards dense feature, we visualize the activation frequency distribution of each channel on the trained BAEs and SAEs as Fig. 5. In the visualizations, BAE features are sparsely activated with a left-leaning dis- tribution, while some of the feature channels in the typical SAE keep high activation frequencies with a long-tail dis- tribution, suggesting dense activations (Stolfo et al., 2025; Rajamanoharan et al., 2024b). Such results support our hy- pothesis that the minibatch-oriented entropy objective can mitigate the global density among input instances. Visual- izations on more settings are provided in Appendix G.1. Feature Diversity. As an intuitive measurement of fea- ture diversity, we directly compute the spectral norm of the covariance matrix over the row vectors inW out , with the results reported in Table 1. Among these results, BAE ex- hibits a markedly larger covariance, confirming its ability to extract diverse and non-redundant features. 5.4.Case Analysis: BAE Extract Better Steering Vectors As empirical evidence that BAE can extract effective fea- tures, we reconstruct the ICL task vector (Hendel et al., 2023) using the features extracted by BAE and SAE base- lines (Kharlapenko et al., 2024). In detail, following the framework proposed by Cho & Hockenmaier (2025), we construct an 8-shot sample set as the “good samples”, and a 0-shot sample set as the “bad samples” from MR (Pang & Lee, 2005), both equipped with 512 samples. We input both sets respectively into LMs, getting the hidden states of the last token from Layer 11 of Llama 3.2-1B, and en- coding the hidden states by BAE or SAE baselines. Then, 0.00.51.01.52.02.53.0 Injection Strength 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0-shot Accuracy on MR BAE ReLU SAE Gate ReLU SAE Top-k SAE w/o Steering Figure 6. Steering vector reconstruction results on MR, Llama 3.2-1B Layer 11. we calculateh D = ̄ h + B − ̄ h − B , where the ̄ h + B and ̄ h − B are the mean hidden activation in these two sets produced by BAE or SAE baselines, and only keep the elements at the top 256 in absolute value (i.e., zeroing the non-significant values) as ̃ h D . γ ̃ h D W out is calculated as the steering vector with γ as injection strength, and injected into the original Layer 11 of the last-token hidden states from the 0-shot test samples, with accuracy shown in Fig. 6. The results show that the BAE-mediated steering vector most effectively improves zero-shot accuracy, serving as a case study demonstrating that BAE can extract more effective features from LLMs. 6. More-bit Decoding As mentioned before, although the self-regression loss in- duced by BAE is not significant compared to the baselines, it can be beneficial to reduce such loss. Therefore, based on the framework of BAE, in this section, as shown in Fig. 1, we propose a plug-in module to aggregate multiple bits of h B into real variables before decoding by theW out . In detail, for eachCbinary elements inh B ∈0, 1 d ′ (i.e., from the iC-th elements to the[(i + 1)C− 1]-th elements ofh B ), we transfer the binary bits into a real vector as: h R = P C−1 j=0 2 j h B,iC+j 2 C − 1 ,(7) then, we decodeh R byW out ∈R d ′ /C×d for the output ˆ h 0 . Such a method effectively improves the numerical precision of BAE, and is therefore expected to mitigate the informa- tion loss while still retaining a binary encoding to support entropy computation. The entropy loss is calculated on the binary encodingh B , whereas feature interpretation is conducted on the real-valued h R . Evaluation results withd ′ /C = 4dare shown in Ta- ble 2, where more-bit decoding consistently reduces self- regression loss while decreasing the number of activation features, yet remains superior to the SAE baseline. As we will discuss in §7, BAE decomposes intrinsically continu- ous features into multiple atomic ones, whereas more-bit decoding merges them into a single feature, which leads to 7 Binary Autoencoder for Mechanistic Interpretability of Large Language Models Table 2. More-bit decoding results on Llama 3.2-1B Layer 11. Notation similar to Table 1. Evaluated on GPT 4.1-mini. Bits (C) L r Act.FI#Score 111.0×10 −3 546438820.360 27.7×10 −3 419529510.371 37.0×10 −3 302321700.391 46.6×10 −3 284120290.376 a reduction in the total number of features. Such a result demonstrates that more-bit decoding serves as an effective trade-off handle between feature numbers and information loss. Moreover, to reduce the information loss of BAE, fu- ture work can be devoted to modifying the structure of BAE, especially by more advanced binarization functions (Wang et al., 2018; Vargas et al., 2024) and gradient estimation methods (Darabi et al., 2019; Yang et al., 2019). 7. Conclusion and Discussion Conclusions. In this paper, we propose Binary Autoen- coder as a toolkit for mechanistic interpretability research for LLMs. BAE utilizes entropy training objective on a minibatch of binary hidden activations to extract globally sparse and atomized features. Our findings demonstrate that BAE can accurately estimate the entropy of a feature set and effectively disentangle atomized and sparse features from LLM’s hidden states, making it a powerful tool for interpretability research. Compress Hidden States into Fewer Bits by BAE. As detailed in Appendix A.5, one can utilize BAE to compress h 0 set by storing only channel indices ofh B where the burstiness against prior ̄ h B exceeds a threshold. Such com- pression has an expected coding length for each instance 6 L int H[h 0 ] , outdistances the originalL float d. Reconstruc- tion is done by flippinground( ̄ h B )at the stored indices and projecting throughW out . We evaluate this compression with a thresholdlog 0.5(refer Appendix A.5), as shown in Table 3, achieving compression rates as low as 1% with low MSE, indicating both effective compression and reconstruc- tion quality. However, the reconstruction remains distorted in the radial direction, manifesting as low cosine similar- ity with the originals, likely arising from theL 2 loss of BAE, which ignores radial information. Since this method constitutes lossy compression and falls outside the paper’s main focus, we do not present it as a core contribution but a potential application. For reliable and efficient storage or transmission, specialized objectives beyond simple regres- sion are needed to preserve essential information better. 6 Here,L int orL float denotes the number of bits required to encode a single index for the compressed coding or one float variable for an original feature element, andH[h 0 ]is the entropy estimated by the BAE from the distribution ofh 0 ,dis the dimen- sionaility ofh 0 . TheL int is usually less thanL float , andH[h 0 ]is also usually less thand(as demonstrated in Fig. 3, withd = 2048 and H[h 0 ] < 700). Table 3. Compression and reconstruction on Llama 3.2-1B. Mem- ory: storage cost before/after compression. MSE / Cos.Sim.: mean square error or cosine similarity between the source and reconstruction. Layer# α e ,α c Memory (MB, +Model)MSECos.Sim. 51e-716713→ 1410.0160.681 91e-716713→ 1730.0400.760 111e-716713→ 1660.0490.800 141e-716713→ 1780.0920.816 51e-916713→ 1700.0040.748 91e-916713→ 1830.0520.809 111e-916713→ 2060.0990.831 141e-916713→ 2700.0300.860 Hypothesis of Atomic Features. Since BAE represents feature activations with binary values rather than continu- ous ones, inherently continuous attributes (such as RGB values (Engels et al., 2025; Modell et al., 2025)) may be discretized into multiple features, which is reasonable since these features cannot be represented with 1-bit, so decom- posing them into multiple features facilitates the correct estimation of the entropy. On the other hand, due to the nar- row radial distribution of Transformer hidden states (Gao et al., 2019; Ethayarajh, 2019), different features may share the same direction but differ in distance, which is difficult to distinguish in continuous SAEs, whereas BAE avoids this issue by discretizing them into atomic units. Also, refer to Table 1, compared to baselines, the number of discretized features extracted by BAE significantly increase against layers, with SAE features remain constant, suggesting that continuous features are more likely to emerge in later layers, which are split into more discrete features by BAE, consis- tent with previous works (Jawahar et al., 2019; Chen et al., 2023; Allen-Zhu & Li, 2023; Liu et al., 2024b), still needs further exploration. Limitations and Open Questions. (1) Natural Language Feature Interpretation Based on Tokens. In this paper, ComSem, and also the traditional automatic feature inter- pretation methods, are all based on the natural language interpretation of tokens where the features originate. How- ever, the contextualization induced by Transformer layers may cause the semantics of the hidden states to deviate from the original token, but the method family overlooks such an effect. Future works can be devoted to directly decoding the semantics from the extracted atomized fea- tures by tools such as LogitLens (Joseph Bloom, 2024) or PatchScopes (Ghandeharioun et al., 2024). Also, not all the features can be interpreted by natural languages, such as Task Vectors (Hendel et al., 2023; Kharlapenko et al., 2024). We discuss this point in Appendix F. (2) Investiga- tion Scope. Due to computational limits, we evaluate BAE only on Llama 3.2-1B and 3B, leaving large-scale tests for future work. Nevertheless, we believe this paper provides a sufficient prototype of BAE for addressing dense/dead feature issues in SAE. 8 Binary Autoencoder for Mechanistic Interpretability of Large Language Models Acknowledgments This work was supported by JST FOREST Program (Grant Number JPMJFR232K, Japan) and the Nakajima Founda- tion. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. References Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. URLhttps://arxiv.or g/abs/2303.08774. Ahn, J., Verma, R., Lou, R., Liu, D., Zhang, R., and Yin, W. Large language models for mathematical rea- soning: Progresses and challenges. In Proceedings of the 18th Conference of the European Chapter of the As- sociation for Computational Linguistics: Student Re- search Workshop, p. 225–237, 2024. URLhttp: //arxiv.org/abs/2402.00157v4. Allen-Zhu, Z. and Li, Y. Physics of language models: Part 1, learning hierarchical language structures. arXiv preprint arXiv:2305.13673, 2023. URLhttps://arxiv.or g/abs/2305.13673. An, S., Ma, Z., Lin, Z., Zheng, N., Lou, J.-G., and Chen, W. Make your llm fully utilize the context. Advances in Neural Information Processing Systems, 37:62160– 62188, 2024. URLhttp://arxiv.org/abs/24 04.16811v2. Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. Be- nign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020. URLhttps://arxiv.org/abs/1906.1 1300. Bereska, L. and Gavves, S. Mechanistic interpretability for AI safety - a review. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URLhttps://op enreview.net/forum?id=ePUVetPKu6. Survey Certification, Expert Certification. Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W.Language models can explain neurons in language models, 2023.URLhttps: //openaipublic.blob.core.windows.net /neuron-explainer/paper/index.html. Bi ́ s, D., Podkorytov, M., and Liu, X.Too much in common: Shifting of embeddings in transformer lan- guage models and its implications. In Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Belt- agy, I., Bethard, S., Cotterell, R., Chakraborty, T., and Zhou, Y. (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, p. 5117–5130, Online, June 2021. Association for Com- putational Linguistics. doi: 10.18653/v1/2021.naacl-m ain.403. URLhttps://aclanthology.org/202 1.naacl-main.403/. Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monoseman- ticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/202 3/monosemantic-features/index.html. Bussmann, B., Leask, P., and Nanda, N.Batchtopk sparse autoencoders. In NeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024. URLhttps://openreview.net/forum ?id=d4dpOCqybL. Chen, Y., Yuille, A., and Zhou, Z. Which layer is learning faster? a systematic exploration of layer-wise conver- gence rate for deep neural networks. In The Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum ?id=wlMDF1jQF86. Cho, H. and Inoue, N. Staicc: Standardized evaluation for classification task in in-context learning. arXiv preprint arXiv:2501.15708, 2025. URLhttps://arxiv.or g/abs/2501.15708. Cho, H., Kato, M., Sakai, Y., and Inoue, N. Revisiting in-context learning inference circuit in large language models. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://open review.net/forum?id=xizpnYNvQq. Cho, I. and Hockenmaier, J.Toward efficient sparse autoencoder-guided steering for improved in-context learning in large language models. In Christodoulopou- los, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, p. 28961–28973, 9 Binary Autoencoder for Mechanistic Interpretability of Large Language Models Suzhou, China, November 2025. Association for Compu- tational Linguistics. ISBN 979-8-89176-332-6. doi: 10 .18653/v1/2025.emnlp-main.1474. URLhttps://ac lanthology.org/2025.emnlp-main.1474/. Choi, J., Wang, Z., Venkataramani, S., Chuang, P. I.-J., Srinivasan, V., and Gopalakrishnan, K. Pact: Param- eterized clipping activation for quantized neural net- works. arXiv preprint arXiv:1805.06085, 2018. URL https://arxiv.org/abs/1805.06085. Darabi, S., Belbahri, M., Courbariaux, M., and Nia, V. P.BNN+: Improved binary network training, 2019. URLhttps://openreview.net/forum ?id=SJfHg2A5tQ. Dong, Q., Li, L., Dai, D., Zheng, C., Wu, Z., Chang, B., Sun, X., Xu, J., and Sui, Z. A survey on in-context learning. arXiv preprint arXiv:2301.00234, 2022. URL https://arxiv.org/abs/2301.00234. Dunefsky, J., Chlenski, P., and Nanda, N. Transcoders find interpretable llm feature circuits. Advances in Neural Information Processing Systems, 37:24375–24410, 2024. URLhttps://proceedings.neurips.c/p aper_files/paper/2024/hash/2b8f4db04 64c5b6e9d5e6bea4b9f308-Abstract-Con ference.html. Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. URLhttps://transformer-circuits .pub/2021/framework/index.html. Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., et al. Toy models of superposition. 2022. URL http://arxiv.org/abs/2209.10652v1. Engels, J., Michaud, E. J., Liao, I., Gurnee, W., and Tegmark, M.Not all language model features are one-dimensionally linear.In The Thirteenth In- ternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum ?id=d63a4AM4hb. Ethayarajh, K. How contextual are contextualized word rep- resentations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In Inui, K., Jiang, J., Ng, V., and Wan, X. (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), p. 55–65, Hong Kong, China, November 2019. Association for Compu- tational Linguistics. doi: 10.18653/v1/D19-1006. URL https://aclanthology.org/D19-1006/. Fei, Y., Hou, Y., Chen, Z., and Bosselut, A. Mitigating label biases for in-context learning. In Rogers, A., Boyd- Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 14014–14031, Toronto, Canada, July 2023. Association for Computa- tional Linguistics. doi: 10.18653/v1/2023.acl-long.783. URLhttps://aclanthology.org/2023.acl- long.783/. Frankle, J. and Carbin, M. The lottery ticket hypothe- sis: Finding sparse, trainable neural networks. In In- ternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum ?id=rJl-b3RcF7. Fu, T., Ferrando, R., Conde, J., Arriaga, C., and Reviriego, P. Why do large language models (llms) struggle to count letters? arXiv preprint arXiv:2412.18626, 2024. URL https://arxiv.org/abs/2412.18626. Gao, J., He, D., Tan, X., Qin, T., Wang, L., and Liu, T. Representation degeneration problem in training natural language generation models. In International Conference on Learning Representations, 2019. URLhttps://op enreview.net/forum?id=SkEYojRqtm. Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. URLhttp://arxiv.org/abs/2101.000 27v1. Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scal- ing and evaluating sparse autoencoders. In The Thir- teenth International Conference on Learning Represen- tations, 2025. URLhttps://openreview.net/f orum?id=tcsZt9ZNKD. Geng, J., Cai, F., Wang, Y., Koeppl, H., Nakov, P., and Gurevych, I. A survey of confidence estimation and cal- ibration in large language models. In Duh, K., Gomez, H., and Bethard, S. (eds.), Proceedings of the 2024 Con- ference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), p. 6577–6595, Mexico City, Mexico, June 2024. Association for Com- putational Linguistics. doi: 10.18653/v1/2024.naacl-lon 10 Binary Autoencoder for Mechanistic Interpretability of Large Language Models g.366. URLhttps://aclanthology.org/2024. naacl-long.366/. Ghandeharioun, A., Caciularu, A., Pearce, A., Dixon, L., and Geva, M. Patchscopes: A unifying framework for inspecting hidden representations of language models. In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.), Pro- ceedings of the 41st International Conference on Ma- chine Learning, volume 235 of Proceedings of Machine Learning Research, p. 15466–15490. PMLR, 21–27 Jul 2024. URLhttps://proceedings.mlr.press/ v235/ghandeharioun24a.html. Godey, N., Clergerie, ́ E., and Sagot, B. Anisotropy is inher- ent to self-attention in transformers. In Graham, Y. and Purver, M. (eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computa- tional Linguistics (Volume 1: Long Papers), p. 35–48, St. Julian’s, Malta, March 2024. Association for Com- putational Linguistics. doi: 10.18653/v1/2024.eacl-lo ng.3. URLhttps://aclanthology.org/2024. eacl-long.3/. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. URLhttp://arxi v.org/abs/2407.21783v3. Greenewald, K., Kingsbury, B., and Yu, Y.High- dimensional smoothed entropy estimation via dimension- ality reduction. In 2023 IEEE International Symposium on Information Theory (ISIT), p. 2613–2618. IEEE, 2023. URLhttp://arxiv.org/abs/2305.047 12v2. Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al. A survey on llm-as-a- judge. arXiv preprint arXiv:2411.15594, 2024. URL https://arxiv.org/abs/2411.15594. H ̈ anni, K., Mendel, J., Vaintrob, D., and Chan, L. Math- ematical models of computation in superposition. In ICML 2024 Workshop on Mechanistic Interpretability, 2024. URLhttps://openreview.net/forum ?id=OcVJP8kClR. He, J., Pan, K., Dong, X., Song, Z., LiuYiBo, L., Qiangu- osun, Q., Liang, Y., Wang, H., Zhang, E., and Zhang, J. Never lost in the middle: Mastering long-context question answering with position-agnostic decomposi- tional training. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 13628–13642, Bangkok, Thailand, August 2024. Association for Computational Linguis- tics. doi: 10.18653/v1/2024.acl-long.736. URLhttps: //aclanthology.org/2024.acl-long.736/. Hendel, R., Geva, M., and Globerson, A. In-context learn- ing creates task vectors. In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Compu- tational Linguistics: EMNLP 2023, p. 9318–9333, Sin- gapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.62 4. URLhttps://aclanthology.org/2023.fi ndings-emnlp.624/. Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., and Ginsburg, B. RULER: What’s the real context size of your long-context language models? In First Conference on Language Modeling, 2024. URLhttps: //openreview.net/forum?id=kIoBbc76Sy. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. Binarized neural networks. Advances in neural information processing systems, 29, 2016. URL http://arxiv.org/abs/1602.02505v3. Huben, R., Cunningham, H., Smith, L. R., Ewart, A., and Sharkey, L. Sparse autoencoders find highly inter- pretable features in language models. In The Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum ?id=F76bwRSLeK. Huh, M., Mobahi, H., Zhang, R., Cheung, B., Agrawal, P., and Isola, P. The low-rank simplicity bias in deep networks. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URLhttps://openrevi ew.net/forum?id=bCiNWDmlY2. Jawahar, G., Sagot, B., and Seddah, D. What does BERT learn about the structure of language? In Korhonen, A., Traum, D., and M ` arquez, L. (eds.), Proceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics, p. 3651–3657, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1356. URLhttps://aclantholo gy.org/P19-1356/. Joseph Bloom,J. L.Understanding sae fea- tures with the logit lens, 2024.URLhttps: //w.lesswrong.com/posts/qykrYY6rXX M7EEs8Q/understanding-sae-features-w ith-the-logit-lens. Kawaguchi, K., Deng, Z., Ji, X., and Huang, J. How does information bottleneck help deep learning? In Interna- tional conference on machine learning, p. 16049–16096. PMLR, 2023. URLhttp://arxiv.org/abs/23 05.18887v1. 11 Binary Autoencoder for Mechanistic Interpretability of Large Language Models Kharlapenko, D., neverix, Nanda, N., and Conmy, A. Extracting sae task features for in-context learning, 2024. URLhttps://w.alignmentforum.org/p osts/5FGXmJ3wqgGRcbyH7/extracting-sa e-task-features-for-in-context-learn ing. Kingma, D. P. and Ba, J. Adam: A method for stochastic op- timization. arXiv preprint arXiv:1412.6980, 2014. URL https://arxiv.org/abs/1412.6980. Kissane, C., Krzyzanowski, R., Bloom, J. I., Conmy, A., and Nanda, N. Interpreting attention layer outputs with sparse autoencoders. In ICML 2024 Workshop on Mechanistic Interpretability, 2024. URLhttps://openreview .net/forum?id=fewUBDwjji. Li, J., Hou, Y., Sachan, M., and Cotterell, R. What do language models learn in context? the structured task hypothesis. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 12365–12379, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.669. URLhttps:// aclanthology.org/2024.acl-long.669/. Lindsey, J., Templeton, A., Marcus, J., Conerly, T., Batson, J., and Olah, C. Sparse crosscoders for cross-layer fea- tures and model diffing. Transformer Circuits Thread, 2024. URLhttps://transformer-circuits .pub/2024/crosscoders/index.html. Liu, J., Zhu, D., Bai, Z., He, Y., Liao, H., Que, H., Wang, Z., Zhang, C., Zhang, G., Zhang, J., Zhang, Y., Chen, Z., Guo, H., Li, S., Liu, Z., Shan, Y., Song, Y., Tian, J., Wu, W., Zhou, Z., Zhu, R., Feng, J., Gao, Y., He, S., Li, Z., Liu, T., Meng, F., Su, W., Tan, Y., Wang, Z., Yang, J., Ye, W., Zheng, B., Zhou, W., Huang, W., Li, S., and Zhang, Z. A comprehensive survey on long context language modeling, 2025. URLhttps://arxiv.org/abs/ 2503.17407. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024a. doi: 10.1162/tacla00638. URLhttps://ac lanthology.org/2024.tacl-1.9/. Liu, Z., Kong, C., Liu, Y., and Sun, M. Fantastic semantics and where to find them: Investigating which layers of generative LLMs reflect lexical semantics. In Ku, L.- W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics: ACL 2024, p. 14551–14558, Bangkok, Thailand, August 2024b. Association for Computational Linguistics. doi: 10.186 53/v1/2024.findings-acl.866. URLhttps://aclant hology.org/2024.findings-acl.866/. McInnes, L., Healy, J., Saul, N., and Großberger, L. Umap: Uniform manifold approximation and projection. Journal of Open Source Software, 3(29), 2018. URLhttps: //arxiv.org/pdf/1802.03426. Modell, A., Rubin-Delanchy, P., and Whiteley, N. The origins of representation manifolds in large language models. arXiv preprint arXiv:2505.18235, 2025. URL http://arxiv.org/abs/2505.18235v1. Nanda, N., Chan, L., Lieberum, T., Smith, J., and Stein- hardt, J. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Confer- ence on Learning Representations, 2023. URLhttps: //openreview.net/forum?id=9XFSbDPmdW. Pan, J., Gao, T., Chen, H., and Chen, D. What in-context learning “learns” in-context: Disentangling task recogni- tion and task learning. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Findings of the Association for Computational Linguistics: ACL 2023, p. 8298–8319, Toronto, Canada, July 2023. Association for Computa- tional Linguistics. doi: 10.18653/v1/2023.findings-ac l.527. URLhttps://aclanthology.org/2023. findings-acl.527/. Pang, B. and Lee, L. Seeing stars: Exploiting class relation- ships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), p. 115–124, 2005. URLhttp://arxiv.org/abs/cs /0506075v1. Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N. A., and Lewis, M. Measuring and narrowing the composi- tionality gap in language models. In Findings of the As- sociation for Computational Linguistics: EMNLP 2023, p. 5687–5711, 2023. URLhttp://arxiv.org/ab s/2210.03350v3. Qin, H., Gong, R., Liu, X., Bai, X., Song, J., and Sebe, N. Binary neural networks: A survey. Pattern Recognition, 105:107281, 2020. URLhttp://dx.doi.org/10. 1016/j.patcog.2020.107281. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. URL https://cdn.openai.com/better-langua ge-models/language_models_are_unsupe rvised_multitask_learners.pdf. Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V., Kram ́ ar, J., Shah, R., and Nanda, N. Improving 12 Binary Autoencoder for Mechanistic Interpretability of Large Language Models dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014, 2024a. URLhttp://ar xiv.org/abs/2404.16014v2. Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V., Kram ́ ar, J., and Nanda, N. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders. arXiv preprint arXiv:2407.14435, 2024b. URL http://arxiv.org/abs/2407.14435v3. Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A.Xnor-net: Imagenet classification using binary convolutional neural networks. In European confer- ence on computer vision, p. 525–542. Springer, 2016. URLhttps://link.springer.com/chapter/ 10.1007/978-3-319-46493-0_32. Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B. D., and Cox, D. D. On the information bot- tleneck theory of deep learning. In International Confer- ence on Learning Representations, 2018. URLhttps: //openreview.net/forum?id=ry_WPG-A-. Schick, T., Dwivedi-Yu, J., Dessi, R., Raileanu, R., Lomeli, M., Hambro, E., Zettlemoyer, L., Cancedda, N., and Scialom, T. Toolformer: Language models can teach themselves to use tools. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum ?id=Yacmpz84TH. Shah, H., Tamuly, K., Raghunathan, A., Jain, P., and Netra- palli, P. The pitfalls of simplicity bias in neural networks. Advances in Neural Information Processing Systems, 33: 9573–9585, 2020. URLhttp://arxiv.org/abs/ 2006.07710v2. Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., Bushnaq, L., Goldowsky-Dill, N., Heimersheim, S., Or- tega, A., Bloom, J., et al. Open problems in mechanistic interpretability. arXiv preprint arXiv:2501.16496, 2025. URL http://arxiv.org/abs/2501.16496v1. Shi, W., Li, S., Liang, T., Wan, M., Ma, G., Wang, X., and He, X. Route sparse autoencoder to interpret large language models. arXiv preprint arXiv:2503.08200, 2025. URL http://arxiv.org/abs/2503.08200v3. Shu, D., Wu, X., Zhao, H., Rai, D., Yao, Z., Liu, N., and Du, M. A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models. arXiv preprint arXiv:2503.05613, 2025. URLhttps://ar xiv.org/abs/2503.05613. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, p. 1631– 1642, Seattle, Washington, USA, October 2013. Asso- ciation for Computational Linguistics. URLhttps: //w.aclweb.org/anthology/D13-1170. Stolfo, A., Wu, B. P., and Sachan, M. Antipodal pairing and mechanistic signals in dense SAE latents. In ICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025. URLhttps://openrevi ew.net/forum?id=Zlx6AlEoB0. Sun, X., Stolfo, A., Engels, J., Wu, B., Rajamanoharan, S., Sachan, M., and Tegmark, M. Dense sae latents are features, not bugs. arXiv preprint arXiv:2506.15679, 2025. URLhttp://arxiv.org/abs/2506.156 79v1. Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., Mc- Dougall, C., MacDiarmid, M., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., and Henighan, T. Scaling monosemantic- ity: Extracting interpretable features from claude 3 sonnet.Transformer Circuits Thread, 2024.URL https://transformer-circuits.pub/202 4/scaling-monosemanticity/index.html. Tishby, N. and Zaslavsky, N. Deep learning and the in- formation bottleneck principle. In 2015 ieee informa- tion theory workshop (itw), p. 1–5. Ieee, 2015. URL http://arxiv.org/abs/1503.02406v1. Tishby, N., Pereira, F. C., and Bialek, W. The informa- tion bottleneck method. arXiv preprint physics/0004057, 2000. URLhttp://arxiv.org/abs/physics/ 0004057v1. Vargas, E., Correa, C. V., Hinojosa, C., and Arguello, H. Biper: Binary neural networks using a periodic function. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, p. 5684–5693, 2024. URLhttps://openaccess.thecvf.co m/content/CVPR2024/papers/Vargas_BiP er_Binary_Neural_Networks_using_a_Pe riodic_Function_CVPR_2024_paper.pdf. Wang, L., Li, L., Dai, D., Chen, D., Zhou, H., Meng, F., Zhou, J., and Sun, X. Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, p. 9840–9855, 2023. URLhttp://arxiv.org/abs/ 2305.14160v4. 13 Binary Autoencoder for Mechanistic Interpretability of Large Language Models Wang, P., Hu, Q., Zhang, Y., Zhang, C., Liu, Y., and Cheng, J. Two-step quantization for low-bit neural networks. In Proceedings of the IEEE Conference on computer vision and pattern recognition, p. 4376–4384, 2018. URL https://openaccess.thecvf.com/conten t_cvpr_2018/html/Wang_Two-Step_Quant ization_for_CVPR_2018_paper.html. Wang, Q., Wang, Y., Ying, X., and Wang, Y. Can in-context learning really generalize to out-of-distribution tasks? In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview .net/forum?id=INe4otjryz. Xiao, D., Meng, Q., Li, S., and Yuan, X.MUD- DFormer: Breaking residual bottlenecks in transform- ers via multiway dynamic dense connections. In Forty- second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum ?id=qkhgzNiEdj. Xu, X., Xiao, T., Chao, Z., Huang, Z., Yang, C., and Wang, Y. Can LLMs solve longer math word problems better? In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview .net/forum?id=C9ju8QQSCv. Yang, J., Shen, X., Xing, J., Tian, X., Li, H., Deng, B., Huang, J., and Hua, X.-s. Quantization networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p. 7308–7316, 2019. URLhttps://openaccess.thecvf.com/co ntent_CVPR_2019/html/Yang_Quantizati on_Networks_CVPR_2019_paper.html. Zhang, Y. and He, Z. Large language models can not per- form well in understanding and manipulating natural lan- guage at both character and word levels? In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N. (eds.), Findings of the As- sociation for Computational Linguistics: EMNLP 2024, p. 11826–11842, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653 /v1/2024.findings-emnlp.691. URLhttps://aclant hology.org/2024.findings-emnlp.691/. Zhao, Z., Wallace, E., Feng, S., Klein, D., and Singh, S. Calibrate before use: Improving few-shot performance of language models. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Ma- chine Learning, volume 139 of Proceedings of Machine Learning Research, p. 12697–12706. PMLR, 18–24 Jul 2021. URLhttps://proceedings.mlr.press/ v139/zhao21c.html. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., and Zou, Y. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016. URLhttps://arxiv.or g/abs/1606.06160. Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urta- sun, R., Torralba, A., and Fidler, S. Aligning books and movies: Towards story-like visual explanations by watch- ing movies and reading books. In Proceedings of the IEEE international conference on computer vision, p. 19–27, 2015. URLhttp://arxiv.org/abs/1506 .06724v1. 14 Binary Autoencoder for Mechanistic Interpretability of Large Language Models A. Experiment Details A.1. Synthetic Directional Benchmarking (§4.1) In the experiment of §4.1, we utilize the hyperparameters as: original vector dimensionailityd = 2048, rankr ∈ 0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, α e = 5× 10 −7 , α c = 10 −6 . We generate all65536samples, withn = 52428training samples and the remaining as validation samples, with non- mentioned hyperparameters kept as the default. After the training, we run BAE again on all65536samples before for binary encoding h B , and calculate the marginal entropy on these h B as mentioned in Eq. 4. A.2. Tracing Normal Sentence Modeling by Entropy (§4.2) 02468 Number of ICL Examples (k) 0.2 0.4 0.6 0.8 Accuracy ICL Accuracy vs Number of Examples Figure 7. ICL accuracy calculated on the whole generated datasets on various k. In the experiment of §4.2, we generaten = 209715data samples for the training of BAE, with more52429data samples as the validation set from the Pile-train split, with non-mentioned hyperparameters kept as the default. Especially, we filter out all the input sentences to the Llama 3.2-1B with a token length less than 1024 to keep the length of hidden state sets among all positions aligned. After the training, we run BAE again on all262144samples before for binary encodingh B , and calculate the marginal entropy on these h B as mentioned in Eq. 4. A.3. Tracing In-context Learning Inference by Entropy (§4.3) Introduction of ICL and Input Format. ICL (Radford et al., 2019; Dong et al., 2022) typically utilizes concatenations of input(x)-answer(y) pairs (with amountk, called demonstrations) to define a task, and requires the LM to generate the answer of the last input (x q , called query) similar to the demonstrations. The inputs of ICL are built like[x 1 ,y 1 ,x 2 ,y 2 ,...,x k ,y k ,x q ]. In practice, the input samples used in the experiments are built on the toolkit StaICC (Cho & Inoue, 2025), similar to the instance below: sentence: a genuinely moving and wisely unsentimental drama . sentiment: positive sentence: laughs – sometimes a chuckle , sometimes a guffaw and , to my great pleasure , the occasional belly laugh sentiment: positive sentence: that the entire exercise has no real point sentiment: negative sentence: 90 punitive minutes of eardrum-dicing gunplay , screeching-metal smashups , and flaccid odd-couple sniping sentiment: negative sentence: amid the new populist comedies that underscore the importance of family tradition and familial community sentiment: positive sentence: freakshow sentiment: negative sentence: a taste for the quirky sentiment: positive sentence: rustic , realistic , and altogether creepy sentiment: positive sentence: assured direction and complete lack of modern day irony sentiment: Figure 8. An example of ICL input in experiments of §4.3 with 8 demonstrations. Hyperparameters. We sample262144ICL input samples withk ∈ 0, 1, 2, 4, 8, and extract the hidden states of the last token (i.e., “:”) from layer0, 2, 4, 6, 8, 10, 12, 14, 16, with209715samples for the training, and the remaining for validation. After the training, we run BAE again on all262144samples before for binary encodingh B , and calculate the marginal entropy on these h B as mentioned in Eq. 4. Accuracy. As supplementary information, we test the classification accuracy of ICL on the given settings as shown in Fig. 7. A.4. ComSem Evaluation on BAE and SAE Variants (§5.3) Baselines. As shown in Table 1, we utilize 4 baselines with implementation details shown below: 15 Binary Autoencoder for Mechanistic Interpretability of Large Language Models • Sparse Autoencoder (ReLU SAE). The feedforward calculation of ReLU SAE is defined as: ˆ h 0 = ReLU(h 0 W in )W out + b.(8) During the training, the L 1 normalization is applied to the inner activation ReLU(h 0 W in ): L(H 0 ) =L r (H 0 ) + α X h 0 ∈H 0 ∥ReLU(h 0 W in )∥ 1 ,(9) the α is the normalization factor, defaulted to 10 −7 . • Top-k SAE (Bussmann et al., 2024). The feedforward calculation of Top-k SAE is defined as: ˆ h 0 = Top k (h 0 W in )W out + b.(10) TheTop k function retains only theklargest elements in place, setting all others to zero. During the training, only the regression loss is utilized: L(H 0 ) =L r (H 0 ),(11) the k is defaulted to 15. • Gated ReLU SAE (Rajamanoharan et al., 2024a). The feedforward calculation of Gated ReLU SAE is defined as: ˆ h 0 = GateReLU γ (h 0 W in )W out + b.(12) The GateReLU γ function is a thresholded variant of the ReLU, defined as: GateReLU γ (x) = ( x, x > γ, 0,otherwise, (13) whereγis a tunable gating parameter, defaulted to be0.5. During the training, theL 1 normalization is applied to the inner activation GateReLU γ (h 0 W in ): L(H 0 ) =L r (H 0 ) + α X h 0 ∈H 0 ∥GateReLU γ (h 0 W in )∥ 1 ,(14) • Transcoder (Dunefsky et al., 2024). The feedforward calculation of Transcoder is defined as: h 2 = ReLU(h 0 W in )W out + b.(15) Where theh 0 is the input towards the(l− 1)-th MLP block 7 . During the training, theh 2 is aligned to the output of the (l− 1)-th MLP block, i.e., the hidden state of layer l, and L 1 normalization is applied to the inner activation: L(H 0 ) = 1 n b ∥H 2 − H l ∥ 2 + α X h 0 ∈H 0 ∥GateReLU γ (h 0 W in )∥ 1 ,(16) due to the significant difference in methodology with SAE variants, we do not regard Transcoder as a major comparison object. Parameters of BAE/SAE Training. We samplen = 8243323hidden state vectors from the specific layers of Llama 3.2-1B on the Pile-train split, with6594658as the training samples for BAE/SAE, and the remaining for the validation. The autoencoders are trained for 200 epochs, with α e = 0 in the first 50 epochs. We visualize the burstiness-based activation magnitude calculation of BAE in Fig. 10. 7 We note that the (l− 1)-th MLP block produces the hidden states of layer l. 16 Binary Autoencoder for Mechanistic Interpretability of Large Language Models ... 0.0.0. Test Sample ... 0.10.90.70.6 0/1 1 0/1 1 0/1 0 ... 1 _ 0.90.10.70.4 ... Activation Magnitude: BAE Channel Index with Sufficient Magnitude: 02 ... 0/1 0 0/1 1 0/1 1 ... 1 Compressed Code 0/1 1 0/1 1 0/1 0 ... 1 Flip BAE ... 0. 0. 0. Reconstructed CompressionReconstruction Figure 9. Diagram of hidden state compression and reconstruction utilizing BAE. 0/10/10/10/1 ... ... 0/10/10/10/1 ... 0/10/10/10/1 Feature Set BAE ... 0.0.0. Test Samp. ... 0/10/10/10/1 Average ... 0.10.90.70.6 0/1 1 0/1 1 0/1 0 ... 1 _ 0.90.10.70.4 ... Activation Magnitude: Figure 10. Burstiness-based activation magnitude calculation process described in §5.2. Parameters of ComSem. The pseudocode of ComSem is shown in Algorithm 1, with de- fault parametersN = 8192,n I = 5,n T = 8, k = 10. The evaluations are conducted on the train split of BookCorpus, which is differ- ent from the training set of BAE/SAEs (Pile). The ̄ h B is averaged among all the input in- stances of the evaluation set. The instructions for calling the backend LLMs are shown in Fig. 22 and 23, with queries’ format the same as the given examples. We utilize GPT-4.1 and GPT-4.1-mini (Achiam et al., 2023) as the backend LM. A.5. Vector Compression by BAE As shown in Fig. 9, the compression of a hidden state vector h 0 proceeds as follows: 1. Given the inputh 0 to be compressed, we first compute its binary encodingh B by applying the input projectionW in and binarize functionΓof a BAE, and then measure the burstinessβwith respect to the prior distribution ̄ h B , as described in §5.2 and Fig. 10. 2. Given a threshold 8 B ∈ [log 0.5, 0], we bookkeep only the channel indices i for which β i > B, using these indices as the compressed representation of h 0 . 3.To reconstructh 0 from the bookkept indices, we round ̄ h B to a binary vector, flip the bits ofround( ̄ h B )at the stored indices, and then pass the flipped binary vector through W out (also the bias term) to obtain the recovered h 0 . Hyperparameters & Experiment Settings. We sample117864hidden state vectors from the specific layers on BookCor- pus, then utilize the BAE trained on Pile for the compression, following the processing above. The ̄ h B is averaged among all the input instances of the compressed set. B. Evaluating Entropy Estimation on LLM Hidden States Set To extend our discussion in §4.1, in this section, we focus on evaluating the entropy evaluation on benchmarks extracted from real LLMs. In detail, similar to §4.1, given a ground-truth entropyr, we samplen r-length token sequences, with elements randomly sampled from two given tokens (in our implementation, tokens with index 0 and 1). From such a sampling method, we can get an input sequence set with entropyr. Input these sequences into LMs (Llama 3.2-1B in our implementation) and extract the hidden states from thel-th layer (Layer 12 in our implementation), we can get a hidden 8 Notice that the threshold greater than0is trivial, given theh B ∈0, 1 d ′ , ̄ h 1 ∈ [0, 1], so that all the elements oflog|h B − ̄ h 1 |⩽ 0. And burstiness less than log 0.5 causes wrong flip of bits in Step 3. 17 Binary Autoencoder for Mechanistic Interpretability of Large Language Models states set with entropy not larger thanr 9 . Repeat the training-and-evaluation processing shown in §4.1, we calculate the entropy with variosras shown in Fig. 11, where the results are consistent with our inference. But this also indicates that such methods based on LM hidden states cannot serve as a strong benchmark and should only be considered a complement to our synthetic approach. C. Interpretability Score Distribution: Selecting a Better Interpretable Feature Set 10 2 10 3 Synthetic Feature Rank 10 2 10 3 Calculated Entropy / bits Inner Dimension 2048 4096 8192 16384 32768 Inner Dimension 2048 4096 8192 16384 32768 0e+00 3e-03 5e-03 7e-03 1e-02 1e-02 1e-02 Auto-regression Loss (L2) Figure 11. Evaluation results on LM-extracted hidden states. We visualize the ComSem interpretability score distribution among features of BAE and baselines in Fig. 24 - 28, where compared to all the baselines, BAE significantly increased the amount of both high-score and low-score features. As shown in Appendix F, these low-score features are non-literal, that is, the semantics of these fea- tures can not be interpreted by the token inputs or outputs in ComSem, so that, despite serving as an effective proof of BAE’s stronger extrac- tion capability, harm the average interpretability score among all the activated features. However, as shown in the curves of Fig. 24 - 28 visualizing the average interpretability scores with all the previous sorted features, the curves of BAE are located above the baseline models in most cases. This indicates that when we remove some low-score features, BAE will achieve the highest average interpretability scores and also the number of interpretable features. This is a benefit of BAE’s extensive feature extraction capability, which means that compared to baseline models that extract a limited number of features, BAE has space to perform trade-offs on both the number of features and interpretability scores, while ensuring that both are optimal, as shown in Table 4 with some of the low-score BAE features ignored. D. Natural Low-entropy Tendency of BAE Training Table 4. Evaluation of BAE with low-score fea- tures ignored and baselines. Feat. Source Model Feature Activated ComSem 4.1 FI#Score Llama 3.2-1B Layer 11 d = 2048 ReLU SAE206513800.260 Top-k SAE341726840.452 Gated ReLU SAE122610260.557 ReLU SAE Resc.7444820.278 TransCoder179410900.239 BAE (top 3500)350035000.574 Llama 3.2-1B Layer 14 d = 2048 ReLU SAE252816000.217 Top-k SAE270220040.418 Gated ReLU SAE294822500.435 ReLU SAE Resc.396226610.274 TransCoder340121660.267 BAE (top 3970)397039700.496 Llama 3.2-3B Layer 20 d = 3072 ReLU SAE192312890.312 Top-k SAE323425080.425 Gated ReLU SAE462835800.437 ReLU SAE Resc.212214510.294 TransCoder550834240.257 BAE (top 5700)570068050.544 Moreover, interestingly, when the estimated entropy constraint is disabled (red in Fig. 2), the entropy gathers higher but near the diagonal, suggesting that BAE without entropy training objective can find a relatively low-entropy encodingh B , even if no normalizations are conducted to minimize the encoding entropy. To get a closer observation, we plot the self-regression lossL r and training-time entropyL e on every training step, withr = 2 (more cases in Appendix G.2), as shown in Fig. 12, with normal setting (left) orα e = α c = 0(right). In the normal setting (left), a stable and monotonous convergence to both minimal entropy and self-regression loss can be observed. However, if the entropy loss is disabled (right), the training dynamics become multiphase and non-monotonic. Specifically, Phase 1: the loss rapidly decreases while the hidden activations maintain relatively high entropy; Phase 2: entropy sharply drops to near zero, with the loss remaining low, even if no entropy and weight penalty are utilized; and Phase 3: both loss and entropy oscillate in a narrow range at the end of training. Such an observation suggests that: Gradient descent on simple regression loss finds relatively “simple” representations, after a long-term stagnation on loss value, supporting previous work on training dynamics (Tishby & Zaslavsky, 2015; Saxe et al., 2018; Huh et al., 2023; Nanda et al., 2023; Shah et al., 2020; Frankle & Carbin, 2019; Bartlett et al., 2020). Additionally, enabling the entropy penalty suppresses the harmful Phase 3 oscillations, confirming its effectiveness. 9 Notice that in §4.2, we show that during the forward pass of an LM, the hidden state of a token may not contain all information from its preceding context; therefore, hidden states on a single position may carry reduced information. 18 Binary Autoencoder for Mechanistic Interpretability of Large Language Models Feature index: #111 Feature interpretation: “indefinite pronouns and quantifiers” Interpretability score: 0.75 Activate tokens & contexts: #1: “but just one look at a minion sent him practically catatonic .” #2: “while some might look on her as having a mark against her character being an unwed mother , she had lived a relatively non-rebellious life .” #3: “one thing her mother prided herself on was being on time and lending a hand .” #4: “while it was good that he had so many male role models , she only hoped he had n’t inherited too much of his father ’s personality .” #5: “after megan watched them disappear in the crowd of family and friends waiting in the church alcove , she bypassed everyone by turning right and heading down the hallway .” #6: “more than anything , she wanted to have a little fun herself .” #7: “tell me what ?” ... Figure 13. Case analysis for feature 111 “indefinite pronouns and quantifiers” with interpretability score 0.75 in Llama 3.2-1B Layer 14. E. Proof: More Burst Features Carry More Information Figure 12. Training dynamics of BAE (left) with entropy objective and (right) without entropy objective. The horizontal axes are the self-regression training loss (L r ), and the vertical axes are the entropy calculated from the training batch. This section substantiates our claim that features with higher burstiness contain more information and should therefore be preferentially selected as activated features. Our proof is grounded in fundamental principles of infor- mation theory: Theorem E.1 (Burst Features Carry More Information). LetX 1 ∼ Bernoulli(p 1 ),X 2 ∼ Bernoulli(p 2 ), the infor- mation of an observation (i.e., the actual value of hidden activation in a specified channel)x∈0, 1can be writ- ten as I X (x) :=− log Pr[X = x]. Then: I X 1 (x 1 )⩾ I X 2 (x 2 )⇔|x 1 − p 1 |⩾|x 2 − p 2 | The proof is straightforward. That is, a more bursty channel carries more information, while our estimator of the total entropy from the hidden state set is fixed. Therefore, identifying these more bursty features is more conducive to reconstructing this particular observation from the average (i.e., the typical set). F. Case Analysis: Extracted Features In this section, we observe several features along with their corresponding tokens and contexts where the investigated feature is activated, from the BAE trained on Layer 14 hidden states of Llama 3.2-1B. We observe in the category of these high-score and low-score features, and find that the low-score features are more folded and implicit in the hidden states, so it is harder to extract. F.1. High-score Features We first list some features with high interpretability scores, along with their corresponding activation tokens and contexts, as shown in Fig. 13 - 17. Among these high-score cases, we observe that: for the ComSem, it is easy to correctly identify semantic similarities grounded in the natural semantics of single tokens (e.g., different variants of the same word (Fig. 17), nouns describing objects of the same category (Fig. 14, 15, 16), or function words serving similar grammatical roles (Fig. 13)). We infer that: (1) The similarities in the hidden states of such tokens can be easily extracted from the residual stream based on the embedding vectors. Also, (2) the similarities in these activated tokens are clear towards the backend 19 Binary Autoencoder for Mechanistic Interpretability of Large Language Models Feature index: #1045 Feature interpretation: “numerical quantifiers and ordinals” Interpretability score: 0.75 Activate tokens & contexts: #1: “one day when they had their own place again , she would get him a dog .” #2: “sean acknowledged her with a two finger salute before cranking up and pulling down the driveway .” #3: “one thing her mother prided herself on was being on time and lending a hand .” #4: “one time .” #5: “but that was six months ago .” #6: “aidan introduced him to his four sisters and their husbands .” #7: “after exchanging hugs with emma and reassuring her at least twenty times that she would be fine and that she did need to go home , pesh led megan out the front door .” ... Figure 14. Case analysis for feature 1045 “numerical quantifiers and ordinals” with interpretability score 0.75 in Llama 3.2-1B Layer 14. Feature index: #2289 Feature interpretation: “verbs of visual attention or perception” Interpretability score: 0.875 Activate tokens & contexts: #1: “but just one look at a minion sent him practically catatonic .” #2: “each time she looked into mason ’s face , she was grateful that he looked nothing like his father .” #3: “megan asked , gazing from noah ’s... 9 ” #4: “peeking out from the covering , she saw emma was wearing her signature color , green .” #5: “of course , he ’d probably argue that while the gown might not have held up , he still looks fabulous and much younger than his age .” #6: “megan glanced between the two of them .” #7: “she stared into his face before she responded .” ... Figure 15. Case analysis for feature 2289 “verbs of visual attention or perception” with interpretability score 0.875 in Llama 3.2-1B Layer 14. LLMs. So that these features can be interpreted into natural language interpretation from the plain tokens by LLMs appropriately. F.2. Low-score Features Also, we list some features with low interpretability scores, as shown in Fig. 18 - 21. Among these low-score features, we summarize the characteristics of these features: (1) Subwords from the tokenizer. As shown in Fig. 19, 21, these subword tokens share no obvious commonality even for humans, and they greatly confuse the LLM, leading it to produce irrelevant interpretations. This illustrates the drawback of token-based interpretation, as noted in our limitations. (2) Function words with similar context. As shown in Fig. 18, 20, these function-word tokens, despite appearing in similar contextual environments (often originating from the same sentence or even the same region within a sentence), lack surface-level commonality and are therefore difficult to interpret by ComSem. This highlights the urgency of interpreting directly from the decoded feature vectors instead of original tokens. Moreover, from this phenomenon, a further hypothesis is that function words without clear semantics may be more likely to concentrate contextual information during contextualization. (3) Character level similarity. As shown in Fig. 21, the tokens activated on feature 5949 all end with “em”, according to previous works (Zhang & He, 2024; Fu et al., 2024), backend LLM may struggle to process such inputs, causing a low 9 To comply with the requirements of the ethics review, we have redacted potentially offensive or NSFW content from these input cases. 20 Binary Autoencoder for Mechanistic Interpretability of Large Language Models Feature index: #7700 Feature interpretation: “body part nouns” Interpretability score: 1.0 Activate tokens & contexts: #1: “when he started to whine , she shook her head .” #2: “he grinned and then happily dodged her mother ’s arms for her father ’s instead , which made megan smile .” #3: “noah momentarily stopped sucking on the bottle to flash a quick smile , which warmed megan ’s heart .” #4: “emma rolled her eyes .” #5: “casey gasped as her hand flew to her chest dramatically .” #6: “casey tapped her chin with her index finger ” #7: “megan pursed her lips at the prospect .” ... Figure 16. Case analysis for feature 7700 “body part nouns” with interpretability score 1.0 in Llama 3.2-1B Layer 14. Feature index: #8183 Feature interpretation: “modal auxiliary verbs expressing ability or possibility” Interpretability score: 1.0 Activate tokens & contexts: #1: “as she went to the couch and picked him up , she could n’t help finding it amusing that out of everyone he was going to see today , he was most excited about being with aidan and emma ’s black lab , beau .” #2: “sadly , she could n’t say that her first love was davis , mason ’s father .’ #3: “considering i have two younger brothers , i think i can handle him .” ... Figure 17. Case analysis for feature 8183 “modal auxiliary verbs expressing ability or possibility” with interpretability score 1.0 in Llama 3.2-1B Layer 14. interpretability score. F.3. Scale of High-score Features and Low-score Features In this section, we argue that: compared to high-scoring features, low-scoring features are embedded in more fine-grained structures of the hidden states, making them difficult to cluster with UMAP. Consequently, extracting such features requires stronger extraction capability. Compared to the baselines, BAE captures more of these features, confirming its superior feature extraction ability. In detail, we conduct UMAP (McInnes et al., 2018), a dimensionality reduction based on the macroscopic adjacency structure from Euclidean distance, on hidden states of all the tokens from the sentences whose token activates the investigated feature, as shown in Fig. 42 for 3 high-score features, and Fig. 43 for 3 low-score features, where an important observation is that: the hidden states activated by high-scoring features form clear clusters, suggesting macroscopic spatial similarity. In contrast, the hidden states activated by low-scoring features are more dispersed overall, exhibiting similarity only within certain subspaces (i.e., along the corresponding rows of the activated features inW out ). That is, the low-score features are more folded and implicit, so that harder to extract. G. Augmentation Experiments and Results G.1. Activation Magnitude Visualization of BAE and SAE We visualize the activation magnitude of each inputtedh 0 (vertical axis) on each channel (horizontal axis) to observe the sparsity of BAE (left sub-figures) and SAE (right sub-figures) as shown in Fig. 35 - 38. Besides the clear sparsity of BAE, the visualization of SAE shows vertical stripes, indicating consistently high-activated channels, which are absent in BAE. This 21 Binary Autoencoder for Mechanistic Interpretability of Large Language Models Feature index: #514 Feature interpretation: “common auxiliary verbs and determiners” Interpretability score: 0.0 Activate tokens & contexts: #1: “he ’d seen the movie almost by mistake , considering he was a little young for the pg cartoon , but with older cousins , along with her brothers , mason was often exposed to things that were older .” #2: “he ’d seen the movie almost by mistake , considering he was a little young for the pg cartoon , but with older cousins , along with her brothers , mason was often exposed to things that were older .’ #3: “she liked to think being surrounded by adults and older kids was one reason why he was a such a good talker for his age .” #4: “she liked to think being surrounded by adults and older kids was one reason why he was a such a good talker for his age .” #5: “it was only his build that he was taking after his father .” #6: “while it had been no question that she wanted him as godfather for mason , she had been extremely honored when he and his wife , emma , had asked her to be their son , noah ’s , godmother .” #7: “while it had been no question that she wanted him as godfather for mason , she had been extremely honored when he and his wife , emma , had asked her to be their son , noah ’s , godmother .” #8: “i plan on spoiling noah rotten and corrupting him as only a good auntie can do !” #9: “i plan on spoiling noah rotten and corrupting him as only a good auntie can do !” #10: “i plan on spoiling noah rotten and corrupting him as only a good auntie can do !” ... Figure 18. Case analysis for feature 514 “common auxiliary verbs and determiners” with interpretability score 0.0 in Llama 3.2-1B Layer 14. suggests that utilizing global normalization for sparsity, and calculating burstiness as the activation magnitude effectively suppresses the dense activations seen in SAE. G.2. More Cases for Fig. 12 We examine the training dynamics of BAE with and without the entropy objective (L e ) under additional settings, extending the observations from Fig. 12. The results, shown in Fig. 29 - 34, reveal that a larger rank leads to more complex training behavior, while incorporating the entropy objective reshapes the dynamics into a smoother and more monotonic form. G.3. More Cases for Fig. 5 We extend the investigation scope of Fig. 5, where the activation frequency of each feature is visualized, as shown in Fig. 39 - 41. The conclusion derived from these augmentation results is consistent with the main body. 22 Binary Autoencoder for Mechanistic Interpretability of Large Language Models Feature index: #2410 Feature interpretation: “abbreviations or truncated forms of words” Interpretability score: 0.0 Activate tokens & contexts: #1: “he reminds me of that bollywood actor john abraham , ” casey said .” #2: “as he eyed the massive statue of jesus , he fidgeted absently with his tie .’ #3: “he was taken aback by her words and the passion with which she delivered them .” #4: “her eyes frantically scanned the room .” #5: “her gaze flicked down to the tan , muscled arm .” #6: “he welcomed the pain as she went over the edge , convulsing and screaming against his hand .” #7: “instead , she just enjoyed being close to him , the feel of his hand on her , his strong , muscled thighs beneath her .” ... Figure 19. Case analysis for feature 2410 “common auxiliary verbs and determiners” with interpretability score 0.0 in Llama 3.2-1B Layer 14. Feature index: #3479 Feature interpretation: “adverbs and pronouns used as modifiers or objects” Interpretability score: 0.0 Activate tokens & contexts: #1: “each time she looked into mason ’s face , she was grateful that he looked nothing like his father .” #2: “with a sigh , she stepped into the dress and slid it over her hips .’ #3: “while it boasted a sweetheart neckline , the hemline fell just below her knees .” #4: “while it boasted a sweetheart neckline , the hemline fell just below her knees .” #5: “we ’l see you there in just a few , ” her mother called .” #6: “with her grades already in the toilet , she was unprepared for the emotional breakdown she experienced when carsyn broke up with her .” #7: “with her grades already in the toilet , she was unprepared for the emotional breakdown she experienced when carsyn broke up with her .” ... Figure 20. Case analysis for feature 3479 “adverbs and pronouns used as modifiers or objects” with interpretability score 0.0 in Llama 3.2-1B Layer 14. Feature index: #5949 Feature interpretation: “tokens related to trembling or shaking actions” Interpretability score: 0.0 Activate tokens & contexts: #1: “her chin trembled as she replied , “ i want that for my son . ” #2: “her lip trembled .’ #3: “her chin trembled before big , fat tears slid down her cheeks .” #4: “her body trembled slightly at his words .” #5: “lem me guess , you bribed him sexually to get him to give up a perfectly good saturday watching two kids who are under two .” ... Figure 21. Case analysis for feature 5949 “tokens related to trembling or shaking actions” with interpretability score 0.0 in Llama 3.2-1B Layer 14. 23 Binary Autoencoder for Mechanistic Interpretability of Large Language Models Algorithm 1: Common Semantics-based Feature Interpretation and Evaluation (ComSem). Parameters: f :R d →R d ′ : Feature encoder, mapping original inputh 0 to the magnitude of decomposed featuresh ′ (e.g., encoding part of a typical SAE, or burstiness calculation for a BAE). d ′ : feature numbers. n I : samples for generating interpretation. n T : max test samples. k: top-? features that are seemed as activated. LM(·): backend LM client for generate the interpretation and test. Prompt Interp (·): Prompt template for interpretating the features (described in Fig. 22). Prompt Test (·): Prompt template for interpretating the features (described in Fig. 23). Input: Set (of amount N ) of token-hidden state of token-context triple: S =(t,h 0 ,c) (i) ,c = t j N i=1 Initialization: D S = : dictionary for each feature (length:d ′ ), keeping the samples where the corresponding feature is activated. D I =: dictionary for each feature, keeping the interpretation and evaluation score. FA = 0: Number of feature activated. FI = 0: Number of interpretable features. Score = 0: Averaged interpretability score amoung the successfully interpreted features. / * 1. Find the activated input samples for each feature channel. * / for i : = 1 to n do a : = f (h (i) 0 ) // Calculate activation magnitude of each feature channel. I : = arg max k (a) // Find top-k activated feature index. for j inI do D S [j].append(t (i) , c (i) ) // Bookkeep the activated sample for each feature channel. end end / * 2. Get the interpretation for each feature channel. * / for i : = 1 to d ′ do if len(D S [i])⩾ n I + 1 then // Query the LLM for the interpretation, only use n I samples, with the remaining ones for evaluation. D I [i] : ="Interp" = LM(Prompt Interp (D S [i][1 : n I ])), "Activated" = True FA : = FA + 1 else // Reject the interpretation if samples are not sufficiently loaded. D I [i] : ="Interp" = None, "Activated" = False end end / * 3. Evaluate the interpretation for activated feature channels. * / for i : = 1 to d ′ do ifD I [i]["Activated"] then D I [i]["Score"] : = 0 for (t (j) ,c (j) ) inD S [i][n I + 1 : n I + n T ] do // For each query in the test set (number bounded by the n T ), query the LLM for judging the matching of each input sample and its interpretation in True or False. if LM(Prompt Test ((t (j) ,c (j) ),D I [i]["Interp"]) then D I [i]["Score"] =D I [i]["Score"] + 1 end end D I [i]["Score"] =D I [i]["Score"]/len(D S [i][n I + 1 :]) Score : = Score +D I [i]["Score"] ifD I [i]["Score"] > 0 then FI : = FI + 1 end end end returnD I , Score/FA, FA, FI 24 Binary Autoencoder for Mechanistic Interpretability of Large Language Models Instruction: I will provide a set of tokens along with their positions (this position may vary depending on the tokenizer) and the surrounding context. Please describe what these tokens have in common using concise expressions such as “date expressions”, “words ending in ‘ing’”, or “adjectives”. Please choose the most specific term while ensuring commonality, and avoid using overly general terms like “words”, “English tokens”, “high-frequency English lexemes”, or “phrases”. Non-semantic or non-linguistic terms such as “BPE Subword Token” are strictly prohibited. Any additional information, explaination, or context are strongly prohibited. Only return one phrase. Example 1: Token: “running” at position 3 in sentence: “She is running in the park.” Token: “eating” at position 3 in sentence: “He is eating an apple.” Token: “sleeping” at position 4 in sentence: “The baby is sleeping on the sofa.” Token: “jumping” at position 2 in sentence: “They are jumping over the fence.” Token: “talking” at position 3 in sentence: “We are talking about the project.” The commonality is: -ing verbs of human behavior Example 2: Token: “yesterday” at position 4 in sentence: “I went there yesterday.” Token: “last week” at position 5 in sentence: “She arrived last week.” Token: “in 1998” at position 6 in sentence: “They moved here in 1998.” Token: “last year” at position 5 in sentence: “We met last year.” The commonality is: past time expressions Example 3: Token: “happy” at position 4 in sentence: “She looks very happy today.” Token: “angry” at position 5 in sentence: “They were extremely angry about it.” Token: “sad” at position 4 in sentence: “He felt really sad after the call.” The commonality is: emotional adjectives Example 4: Token: ”dog” at position 2 in sentence: ”The dog barked loudly.” Token: ”cat” at position 2 in sentence: ”The cat chased the mouse.” Token: ”bird” at position 2 in sentence: ”The bird sang beautifully.” Token: ”fish” at position 2 in sentence: ”The fish swam gracefully in the tank.” The commonality is: animal nouns (...) Example 11: Token: “sad” at position 4 in sentence: “He felt really sad after the call.” Token: “angry” at position 5 in sentence: “They were extremely angry about it.” Token: “negative” at position 4 in sentence: “She had a negative reaction to the news.” The commonality is: negative emotion adjectives Now, please analyze the following tokens and their contexts: (Test Sample) Figure 22. Instruction text for interpreting the feature semantics. 25 Binary Autoencoder for Mechanistic Interpretability of Large Language Models Background: I will provide a token, its position in the sentence, the surrounding context, and a candidate description of the token’s role or type given the context. Your task is to judge whether the given description accurately characterizes the token in its context. Please respond with either: - “Yes” (if the description is accurate), or - “No” (if it is inaccurate) Any additional information, explaination, or context are strongly prohibited. Only return “Yes” and “No”. Example 1: Token: “running” at position 3 in sentence: “She is running in the park.” Candidate description: “present participle” Answer: Yes Example 2: Token: “dog” at position 2 in sentence: “The dog barked loudly.” Candidate description: “adjective” Answer: No Example 3: Token: “quickly” at position 4 in sentence: “He ran quickly toward the exit.” Candidate description: “manner adverb” Answer: Yes Example 4: Token: “first” at position 4 in sentence: “This is the first time I have seen this.” Candidate description: “ordinal number” Answer: Yes Example 5: Token: “to” at position 5 in sentence: “I want to go to the store.” Candidate description: “emotional verb” Answer: No Example 6: Token: “looking” at position 3 in sentence: “She is looking forward to the event.” Candidate description: “verb related to oral communication” Answer: No (...) Example 22: Token: “fish” at position 2 in sentence: “The fish swam gracefully in the tank.” Candidate description: “noun describing an animal” Answer: Yes Now, please analyze the following tokens, their positions, contexts, and candidate descriptions: (Test Sample) Figure 23. Instruction text for testing the interpretation. 26 Binary Autoencoder for Mechanistic Interpretability of Large Language Models 0.20.40.60.81.0 Score Accumulate Average Scores Distribution ReLU SAEBAEGated ReLU SAETop-k SAE 0.000.250.500.751.00 Score ReLU SAE Single Feature Channel Scores Distribution ReLU SAE, Feature Activated = 1922 Inactive Features 0.000.250.500.751.00 Score BAE Single Feature Channel Scores Distribution BAE, Feature Activated = 3012 Inactive Features 0.000.250.500.751.00 Score Gated ReLU SAE Single Feature Channel Scores Distribution Gated ReLU SAE, Feature Activated = 17 Inactive Features 0 1000 2000 3000 4000 5000 6000 7000 8000 8192 Feature Index (sorted) 0.000.250.500.751.00 Score Top-k SAE Single Feature Channel Scores Distribution Top-k SAE, Feature Activated = 3531 Inactive Features Figure 24. ComSem interpretability score distribution of each feature channel on layer 5 of Llama 3.2-1B. The colored areas represent the interpretability scores of individual channels (sorted), and the curves show the averaged interpretability scores across all channels previously. 27 Binary Autoencoder for Mechanistic Interpretability of Large Language Models 0.40.60.81.0 Score Accumulate Average Scores Distribution ReLU SAEBAEGated ReLU SAETop-k SAE 0.00.20.40.60.81.0 Score ReLU SAE Single Feature Channel Scores Distribution ReLU SAE, Feature Activated = 3179 Inactive Features 0.00.20.40.60.81.0 Score BAE Single Feature Channel Scores Distribution BAE, Feature Activated = 4675 Inactive Features 0.00.20.40.60.81.0 Score Gated ReLU SAE Single Feature Channel Scores Distribution Gated ReLU SAE, Feature Activated = 100 Inactive Features 0 1000 2000 3000 4000 5000 6000 7000 8000 8192 Feature Index (sorted) 0.00.20.40.60.81.0 Score Top-k SAE Single Feature Channel Scores Distribution Top-k SAE, Feature Activated = 3179 Inactive Features Figure 25. ComSem interpretability score distribution of each feature channel on layer 9 of Llama 3.2-1B. 28 Binary Autoencoder for Mechanistic Interpretability of Large Language Models 0.40.60.81.0 Score Accumulate Average Scores Distribution ReLU SAEBAEGated ReLU SAETop-k SAE 0.00.20.40.60.81.0 Score ReLU SAE Single Feature Channel Scores Distribution ReLU SAE, Feature Activated = 2065 Inactive Features 0.00.20.40.60.81.0 Score BAE Single Feature Channel Scores Distribution BAE, Feature Activated = 5464 Inactive Features 0.00.20.40.60.81.0 Score Gated ReLU SAE Single Feature Channel Scores Distribution Gated ReLU SAE, Feature Activated = 1226 Inactive Features 0 1000 2000 3000 4000 5000 6000 7000 8000 8192 Feature Index (sorted) 0.00.20.40.60.81.0 Score Top-k SAE Single Feature Channel Scores Distribution Top-k SAE, Feature Activated = 3417 Inactive Features Figure 26. ComSem interpretability score distribution of each feature channel on layer 11 of Llama 3.2-1B. 29 Binary Autoencoder for Mechanistic Interpretability of Large Language Models 0.20.40.60.81.0 Score Accumulate Average Scores Distribution ReLU SAEBAEGated ReLU SAETop-k SAE 0.00.20.40.60.81.0 Score ReLU SAE Single Feature Channel Scores Distribution ReLU SAE, Feature Activated = 2528 Inactive Features 0.00.20.40.60.81.0 Score BAE Single Feature Channel Scores Distribution BAE, Feature Activated = 6120 Inactive Features 0.00.20.40.60.81.0 Score Gated ReLU SAE Single Feature Channel Scores Distribution Gated ReLU SAE, Feature Activated = 2948 Inactive Features 0 1000 2000 3000 4000 5000 6000 7000 8000 8192 Feature Index (sorted) 0.00.20.40.60.81.0 Score Top-k SAE Single Feature Channel Scores Distribution Top-k SAE, Feature Activated = 2702 Inactive Features Figure 27. ComSem interpretability score distribution of each feature channel on layer 14 of Llama 3.2-1B. 30 Binary Autoencoder for Mechanistic Interpretability of Large Language Models 0.40.60.81.0 Score Accumulate Average Scores Distribution ReLU SAEBAEGated ReLU SAETop-k SAE 0.00.20.40.60.81.0 Score ReLU SAE Single Feature Channel Scores Distribution ReLU SAE, Feature Activated = 1923 Inactive Features 0.00.20.40.60.81.0 Score BAE Single Feature Channel Scores Distribution BAE, Feature Activated = 9308 Inactive Features 0.00.20.40.60.81.0 Score Gated ReLU SAE Single Feature Channel Scores Distribution Gated ReLU SAE, Feature Activated = 4628 Inactive Features 0 2000 4000 6000 8000 10000 12000 12288 Feature Index (sorted) 0.00.20.40.60.81.0 Score Top-k SAE Single Feature Channel Scores Distribution Top-k SAE, Feature Activated = 3234 Inactive Features Figure 28. ComSem interpretability score distribution of each feature channel on layer 20 of Llama 3.2-3B. 31 Binary Autoencoder for Mechanistic Interpretability of Large Language Models Figure 29. Augment result for Fig. 12 (r = 2, d ′ = 2d). (Left) with entropy objective, (right) without entropy objective. Figure 30. Augment result for Fig. 12 (r = 8, d ′ = 2d). (Left) with entropy objective, (right) without entropy objective. Figure 31. Augment result for Fig. 12 (r = 16, d ′ = 4d). (Left) with entropy objective, (right) without entropy objective. 32 Binary Autoencoder for Mechanistic Interpretability of Large Language Models Figure 32. Augment result for Fig. 12 (r = 32, d ′ = 4d). (Left) with entropy objective, (right) without entropy objective. Figure 33. Augment result for Fig. 12 (r = 64, d ′ = 4d). (Left) with entropy objective, (right) without entropy objective. Figure 34. Augment result for Fig. 12 (r = 128, d ′ = 4d). (Left) with entropy objective, (right) without entropy objective. 33 Binary Autoencoder for Mechanistic Interpretability of Large Language Models 02000400060008000 Feature Channel Index 0 1000 2000 3000 4000 5000 6000 7000 8000 Input Sample Index 2.5 2.0 1.5 1.0 0.5 0.0 Activation Magnitude 02000400060008000 Feature Channel Index 0 1000 2000 3000 4000 5000 6000 7000 8000 Input Sample Index 0.00 0.02 0.04 0.06 0.08 0.10 Activation Magnitude Figure 35. Activation magnitude visualization of (left) BAE and (right) SAE of layer 5. 02000400060008000 Feature Channel Index 0 1000 2000 3000 4000 5000 6000 7000 8000 Input Sample Index 2.5 2.0 1.5 1.0 0.5 0.0 Activation Magnitude 02000400060008000 Feature Channel Index 0 1000 2000 3000 4000 5000 6000 7000 8000 Input Sample Index 0.00 0.02 0.04 0.06 0.08 0.10 Activation Magnitude Figure 36. Activation magnitude visualization of (left) BAE and (right) SAE of layer 9. 34 Binary Autoencoder for Mechanistic Interpretability of Large Language Models 02000400060008000 Feature Channel Index 0 1000 2000 3000 4000 5000 6000 7000 8000 Input Sample Index 2.5 2.0 1.5 1.0 0.5 0.0 Activation Magnitude 02000400060008000 Feature Channel Index 0 1000 2000 3000 4000 5000 6000 7000 8000 Input Sample Index 0.00 0.02 0.04 0.06 0.08 0.10 Activation Magnitude Figure 37. Activation magnitude visualization of (left) BAE and (right) SAE of layer 11. 02000400060008000 Feature Channel Index 0 1000 2000 3000 4000 5000 6000 7000 8000 Input Sample Index 2.5 2.0 1.5 1.0 0.5 0.0 Activation Magnitude 02000400060008000 Feature Channel Index 0 1000 2000 3000 4000 5000 6000 7000 8000 Input Sample Index 0.00 0.02 0.04 0.06 0.08 0.10 Activation Magnitude Figure 38. Activation magnitude visualization of (left) BAE and (right) SAE of layer 14. Figure 39. Augment for Fig. 5 on Layer 5 of Llama 3.2-1B. Figure 40. Augment for Fig. 5 on Layer 9 of Llama 3.2-1B. Figure 41. Augment for Fig. 5 on Layer 14 of Llama 3.2-1B. 35 Binary Autoencoder for Mechanistic Interpretability of Large Language Models Figure 42. UMAP visualization for 3 high-score features with corresponding activated/inactivated hidden states on Layer 14 of Llama 3.2-1B. Figure 43. UMAP visualization for 3 low-score features with corresponding activated/inactivated hidden states on Layer 14 of Llama 3.2-1B. 36