Paper deep dive
Sparse Interventions in Language Models with Differentiable Masking
Nicola De Cao, Leon Schmid, Dieuwke Hupkes, Ivan Titov
Models: LSTM (2-layer, 650 hidden)
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/12/2026, 7:53:24 PM
Summary
The paper introduces a method for interpreting language models by identifying small, sparse subsets of neurons responsible for specific linguistic phenomena. By using differentiable masking and L0 regularization, the authors perform causal interventions to shift token emission probabilities, demonstrating that phenomena like subject-verb agreement and gender bias are mediated by a very small number of neurons.
Entities (6)
Relation Signals (3)
L0 regularization → enforces → Sparsity
confidence 95% · An L0 regularization term ensures that the search converges to discrete and sparse solutions.
LSTM → exhibits → Subject-verb number agreement
confidence 90% · We apply our method to analyze subject-verb number agreement and gender bias detection in LSTMs.
Differentiable masking → identifies → Responsible neurons
confidence 90% · we propose a method that discovers within a neural LM a small subset of neurons responsible for a particular linguistic phenomenon
Cypher Suggestions (2)
Find all linguistic phenomena analyzed using a specific model architecture. · confidence 90% · unvalidated
MATCH (m:Model)-[:ANALYZED_PHENOMENON]->(p:Phenomenon) RETURN m.name, p.name
Identify optimization techniques used to achieve sparsity in model interpretation. · confidence 85% · unvalidated
MATCH (t:Technique)-[:PROMOTES]->(s:Property {name: 'Sparsity'}) RETURN t.nameAbstract
Abstract:There has been a lot of interest in understanding what information is captured by hidden representations of language models (LMs). Typically, interpretation methods i) do not guarantee that the model actually uses the encoded information, and ii) do not discover small subsets of neurons responsible for a considered phenomenon. Inspired by causal mediation analysis, we propose a method that discovers within a neural LM a small subset of neurons responsible for a particular linguistic phenomenon, i.e., subsets causing a change in the corresponding token emission probabilities. We use a differentiable relaxation to approximately search through the combinatorial space. An $L_0$ regularization term ensures that the search converges to discrete and sparse solutions. We apply our method to analyze subject-verb number agreement and gender bias detection in LSTMs. We observe that it is fast and finds better solutions than the alternative (REINFORCE). Our experiments confirm that each of these phenomenons is mediated through a small subset of neurons that do not play any other discernible role.
Tags
Links
- Source: https://arxiv.org/abs/2112.06837
- Canonical: https://arxiv.org/abs/2112.06837
Full Text
43,313 characters extracted from source content.
Expand or collapse full text
Sparse Interventions in Language Models with Differentiable Masking Nicola De Cao *1,2 , Leon Schmid *3 , Dieuwke Hupkes 4 , Ivan Titov 1,2,5 1 University of Amsterdam, 2 University of Edinburgh 3 University of Osnabrück, 4 Facebook AI Research, 5 Innopolis University nicola.decao@gmail.com,lschmid@uos.de dieuwkehupkes@fb.com,ititov@inf.ed.ac.uk Abstract There has been a lot of interest in under- standing what information is captured by hid- den representations of language models (LMs). Typically, interpretation methods i) do not guarantee that the model actually uses the en- coded information, and i) do not discover small subsets of neurons responsible for a con- sidered phenomenon. Inspired by causal medi- ation analysis, we propose a method that dis- covers within a neural LM a small subset of neurons responsible for a particular linguistic phenomenon, i.e., subsets causing a change in the corresponding token emission proba- bilities.We use a differentiable relaxation to approximately search through the combina- torial space. AnL 0 regularization term en- sures that the search converges to discrete and sparse solutions. We apply our method to an- alyze subject-verb number agreement and gen- der bias detection in LSTMs. We observe that it is fast and finds better solutions than the alternative (REINFORCE). Our experiments confirm that each of these phenomenons is me- diated through a small subset of neurons that do not play any other discernible role. 1 Introduction The success of language models (LMs) in many natural language processing tasks is accompanied by an increasing interest in interpreting and analyz- ing such models. One goal in this direction is to identify how a model employs its hidden represen- tations to arrive at a prediction (Belinkov and Glass, 2019; Jacovi and Goldberg, 2020). A popular line of research studies LMs with “diagnostic classifiers” or “probes” that are trained to predict linguistics properties from hidden units, with the purpose of analyzing what information is encoded by the net- work and where (Alain and Bengio, 2017; Adi et al., 2017; Hupkes et al., 2018; Voita and Titov, 2020). * Equal contributions. However, this method is sometimes criticized for generating unfaithful interpretations (Barrett et al., 2019) since the trained classifiers only measure the correlation between a model’s representations and an external property and not whether such property is actually causing the model’s predictions. Indeed, several studies pointed out limitations of probes (Belinkov and Glass, 2019; Vanmassenhove et al., 2017; Tamkin et al., 2020), including mismatches between the performance of the probe and the orig- inal model as well as the discrepancy between cor- relation and causation of hidden units and model outputs. In response to these limitations, several recent studies have proposed to interpret neural models withinterventionswhich aim to measure causal ef- fects by intervening in representations of the model and observing a change in the model output (Giu- lianelli et al., 2018; Elazar et al., 2021; Feder et al., 2021). These techniques let us investigate more directly if an LM represents a certain linguistic phenomenon but are limited when it comes to un- derstanding where and how this information is rep- resented. Therefore, an important question that they cannot answer is to what extentmodularity– often believed to be a prerequisite for systematic generalization (Goyal and Bengio, 2020; Dittadi et al., 2021) – is a property that emerges naturally in such models. An adaptation ofcausal mediation analysis(Pearl, 2001) by Lakretz et al. (2019); Vig et al. (2020); Lakretz et al. (2021) makes an im- portant step towards enabling such investigations. They consider neurons one by one by setting their activation to zero and measuring their effect on the output. However, these techniques suffer from two major shortcomings: i) as systematically ablating combinations of multiple neurons is computation- ally infeasible, they are restricted to detecting sin- gle neurons i) there is no guarantee that setting a unit activation to zero corresponds to switching the corresponding function on or off (Sturmfels et al., arXiv:2112.06837v1 [cs.CL] 13 Dec 2021 2020). In this work, we use a differentiable relaxation of this search problem to overcome both these lim- itations. More specifically, our goal is to iden- tify neurons responsible for shifting the probability from a word to its alternative in examples exem- plifying the phenomena, without affecting other LM predictions. For example, when investigating subject-verb number agreement, we want to redis- tribute the probability mass from the singular form of an upcoming verb to the plural one (or the other way around), while discouraging changes in the distributions for other contexts. In this way, we ensure that the function is mediated through the detected neurons, and these neurons do not play any other discernible role. Building on the framework of differentiable masking (De Cao et al., 2020; Schlichtkrull et al., 2021), we formalize this search for asparse inter- ventionas a constrained optimization problem. We aim to both detect the responsible neurons and learn the values to assign them when intervening. We use a continuous relaxation of the subset-selection problem, but ensure discreteness and encourage sparsity throughL 0 regularization. TheL 0 penalty determines how many neurons we want to discover. In our experiments, we use an LSTM-based LM, previously investigated by (Gulordava et al., 2018; Lakretz et al., 2019), and consider subject-verb number agreement and gender bias detection. We start with validating our method by showing that we can replicate findings reported in these previous studies and then dive into a deeper analysis. We show that our proposed method is effective as well as computationally efficient (i.e., it converges up to 7 times faster than REINFORCE). 2 Additional Related Work TheL 0 regularization was first proposed by Louizos et al. (2018) in the context of pruning. It has been used in a variety of works in NLP as a tool for generating rationals and attribution (Bast- ings et al., 2019; De Cao et al., 2020; Schlichtkrull et al., 2021). Masking weights and groups of weights was also used by Csordás et al. (2021) to investigate the functional modularity of neural networks. A number of studies suggested that some of the linguistic phenomena are encoded, at least to a large degree, in a disentangled and sparse fash- ion. For example, Radford et al. (2017) detected a neuron encoding sentiment polarity and Dai et al. (2021) showed that individual facts learned by an LM can be manipulated by modifying a small num- ber of neurons. In a similar spirit, Voita et al. (2019) observed that many Transformer attention heads in a neural machine translation model are special- ized; interestingly, they also usedL 0 regularization but only to prune less important heads; the roles played by the heads were identified by studying their attention patterns. Our technique can facili- tate such studies by effectively identifying sets of neural network’s subcomponents playing a given function. 3 Method A range of tests for causal language models con- sider if a model can represent a particular linguistic phenomenon (i.e., subject-verb-agreement, filler gap dependencies, negative polarity items Jumelet et al., 2021, 2019; Wilcox et al., 2018; Gulordava et al., 2018), by measuring whether that model assigns a higher probability to a grammatical sen- tence involving that phenomenon than to its mini- mally different ungrammatical counterpart. In such tests, the comparison of probabilities is often fo- cused on the probability of a single token – for instance, the probability of the correct and incor- rect verb-form in a long sentence (Linzen et al., 2016). Here, we exploit this paradigm and inves- tigate if we can find groups of neurons for which a modification of their value – which we call an intervention– systematically leads to a change of probability for the single token emission related to a specific phenomenon. Because there is no direct supervision for inter- ventions (i.e., we do not have a dataset that ac- companies a model with annotated examples of interventions), we need to learn them with a proxy objective. Let’s assume we have an autoregressive model (e.g., an LSTM; Hochreiter and Schmidhu- ber 1997) that assigns a probability to sequences. For a set of input tokensx=〈x 1 ,...,x n 〉, we obtain the model’s probability of the token of interestp(x n |x <n )along with the hidden states h=〈h 1 ,...,h n 〉whereh i ∈R k (one for each time step). We then intervene in the model’s com- putation by modifying a group of neurons from one or multiple hidden states. The intervention at a certain time stepi < nconsists of a binary maskm∈ 0,1 k indicating which hidden units need intervention and which can be left unchanged. The intervention is then made substituting theith hidden state with the altered state ˆ h i = (1−m) h i +m b ,(1) where indicates the element-wise product and b∈R k is a learned baseline vector that will lead the desired intervention. We denotep(x n |x <n , ˆ h i ) as the model’s probability of the target token when its forward pass has been altered using ˆ h i . In addition, as the main objective of this work, we are looking for sparse interventions, which we define as finding a defined small percentage (e.g., 1-5%) of neurons where to apply an intervention to while keeping all the rest untouched. 3.1 Learning to Intervene Because there is no direct supervision to estimate the maskmand the baselineb, we train them to minimize the ratio L ratio ( ˆ h i ,x) = p(x n =d|x <n , ˆ h i ) p(x n =t|x <n , ˆ h i ) ,(2) where we want to identify neurons responsible for a change in probability between a predicted word dand a target wordt(e.g., a correct and incor- rect verb form—where it does not matter what the model predicts but we setdas the form for which the model assigns higher probability andtas the other). In other words, we optimize them to assign more probability mass to the tokentrather than d. In addition, we desire such intervention to be as sparse as possible because we want to identify the least number of neurons responsible for the de- cision. Suchsparsitycorresponds to constraining most of the entries ofmto be0, which corresponds to not interfering. We cast this in the language of constrained optimization. A practical way to express the sparsity constraint is through theL 0 ‘norm’. 1 Our constraint is defined as the total number of neurons we intervene on: C 0 (m) = k ∑ i=1 1 [R 6=0 ] (m i ).(3) The whole optimization problem is then: min m,b ∑ x∈D L ratio ( ˆ h i ,x)(4) s.t.C 0 (m)≤α ,(5) 1 L 0 , denoted‖m‖ 0 and defined as#(i|m i 6= 0), is the number of non-zeros entries in a vector. Contrary toL 1 or L 2 ,L 0 is not a homogeneous function and, thus, not a proper norm. However, contemporary literature refers to it as a norm, and we do so as well to avoid confusion. whereDis a dataset and the marginα∈(0,1] is a hyperparameter that controls the desired spar- sity we want to achieve (i.e., the lower the sparser the solution will be). Since non-linear constrained optimization is generally intractable, we employ Lagrangian relaxation (Boyd et al., 2004) optimiz- ing instead max λ min m i ,b ∑ x∈D L ratio ( ˆ h i ) +λ(C 0 (m i )−α),(6) whereλ∈R ≥0 is the Lagrangian multiplier. Since we use binary masks, our loss is discontinuous and non-differentiable. A default option would be to use REINFORCE (Williams, 1992), but it is known to have a noisy gradient and thus slow convergence. Then, to overcome both problems, we resort to a sparse relaxation to binary variables, namely using a Hard Concrete distribution (Louizos et al., 2018). See Appendix A for further details and proprieties of that distribution. 3.2 Single-step and Every-step intervention We described how we apply an intervention at a certain time stepi < nas an intervention that di- rectly modifiesh i . Thus, we refer to this type as asingle-stepintervention. The choice of the time step to intervene is not arbitrary and should be care- fully set to investigate a particular phenomenon in the LM—note that this is task-dependent, e.g., for exploring subject-verb number agreement, a rea- sonable choice is to make the intervention at the hidden state of the subject. As an extension, we also define anevery-stepintervention when instead of altering onlyh i we modify allh 1 ,...,h n−1 with the samemandb(not that this is the same type of intervention used by Lakretz et al. 2019). The two types of intervention investigate different pro- prieties of an LM, and we experiment with both variants. 3.3 Retaining other predictions We train interventions to modify the model’s pre- diction at a specific token position. However, there is little guarantee that all the other token positions will have the same output distribution as without the interventions. This is important as, when in- vestigating modularity, we would like to ensure not only that a group of neurons plays a distinct interpretable role but also that they do not fulfil any other discernable role. For this reason, we employ a regularization term in addition to the constrained objective. This corresponds to minimizing a KL divergence between the output distributions of the original model and the one from the model with interventions (see Appendix C for more details). 4 Experimental Setting We study the pre-trained LSTM language model made available by Gulordava et al. (2018) 2 , which has been studied extensively in previous work and therefore provides a good testing ground for our method. We study the model, as well as newly trained models with the same architecture, on two tasks described below: subject-verb number agree- ment and gender bias. The evaluation for tasks naturally follows the defined objectiveL ratio ( ˆ h i ,x) (see Section 3.1). Without intervention, the ratio is always>1, and we defined a successful interven- tion when we find a mask and baseline values such that the ratio becomes<1. Finally, we define the accuracyof interventions as the average of times when the ratio is<1across all datapoints in a given dataset/task. In other words, the accuracy reflects how often we canflipthe model’s decision (that corresponds to making the likelihood of desired output higher than the previous model output). LSTM language modelThe studied model is a standard two-layered LSTM with a hidden dimen- sion of 650. The embedding layer also has dimen- sionality 650, and it is not tied with the output layer. The vocabulary size is 50,000 (the most fre- quent words in the corpus). The model was trained on English Wikipedia data (with around 80M to- kens training tokens and 10M for validation). We used this model to compare to previous findings of Lakretz et al. (2019). We also pre-train this LM several times with different weights initializations to make sure our results generalize. Subject-verb number agreement taskThis task consists of intervening on the subject-verb number agreement predicted from the model: for a given sentence, we wish the intervention to change the number of the verb from singular to plural or the other way around. We employ data made avail- able by Lakretz et al. (2019) 3 (see Appendix D for more details). We apply the single-step interven- tion to the subject of the (only) verb. We apply 2 Model and pre-trained corpus available from the authors athttps://github.com/facebookresearch/ colorlessgreenRNNs 3 https://github.com/FAIRNS/Number_and_ syntax_units_in_LSTM_LMs two types of intervention: turning the verb to the singular form or the plural one. Gender bias detectionThis task consists of in- tervening on a pronoun that might indicate a gender bias of the model: for a given sentence, we wish the intervention to change the pronoun that refers to a person with a profession and an unspecified gender. We employ data made available by Vig et al. (2020) 4 (see Appendix D for more details). Also for his task, we apply the single-step interven- tion to the subject of the (only) verb. We apply two types of intervention one for flipping the pronoun to “he” and one for “she”. 5 Results Main resultsFor the single-step intervention (with regularization), our method achieved88.8± 0.3and95.8±3.8accuracies for the number agree- ment and gender bias tasks, respectively. On aver- age, our method finds5and5.7units for the two tasks, respectively. Considering that the LM has 1300 hidden units, this intervention is relatively sparse as desired (i.e., we use<0.44%of the total units). In Figure 1 and 2, we show examples of hidden state activations with and without the intervention for both tasks (see Appendix E for additional ex- amples). From these figures, we can see that only one unit is heavily affected (the one for the inter- vention) while the others are minimally corrupted after that time step. We hypothesize that the model stores the information of number or gender in other units (or in cell states), but the discovered units are the ones responsible for theinitializationof such memory. In Table 4 and 5 in Appendix E we report the the full list of discovered units and the learned baseline vectors for both tasks on the single-step intervention. For the every-step intervention, our method achieved an almost perfect accuracy of99.3and 100for the number agreement and gender bias tasks, respectively, while using3.3units on av- erage for both tasks. This type of intervention is much more effective and more intrusive (i.e., the number of changed units is larger as it happens at every step). In Table 2 we report the full list of discovered units and the learned baseline vectors comparing to the one discovered by Lakretz et al. (2019) (every-step intervention). Noticeably, we 4 https://github.com/sebastianGehrmann/ CausalMediationAnalysis AccuracyUnitsKL Number agreement Single88.0±0.25.0±0.00.034±0.003 Single R 88.8±0.35.0±0.00.032±0.000 Every99.8±0.23.3±0.60.208±0.023 Every R 99.3±0.23.3±0.60.075±0.005 Gender bias Single98.6±1.27.0±1.00.011±0.001 Single R 95.8±3.85.7±1.20.009±0.002 Every100.0±0.02.7±0.60.104±0.020 Every R 100.0±0.03.3±0.00.075±0.001 Table 1: Summary of results for both the number agree- ment and gender bias settings (average across 3 run for each setting). R indicates KL regularization. Sin- gle/ Every indicates single-step and every-step interven- tions respectively. 0 1 Unit 79 1 0 Unit 93 1 0 Unit 498 Thedoctorsdefinitelyadmire 1 0 Unit 630 original intervention Figure 1: Activations of four units we intervene on (sin- gle step intervention at “doctors”) for changing number agreement (at “admire”). re-discoverunits 776 and 988 which validates our method and confirm their findings. Interestingly, we also discover an extra unit on average, high- lighting that one of the limitations of Lakretz et al. (2019) was indeed an efficient way to search units. For a summary of all results see Table 1, and for the discovered units and baseline on the gender task see Table 6 (in Appendix E). EfficiencyTo demonstrate the efficiency and efficacy of our estimation employing the Hard Concrete distribution, we compared to the stan- dard Score Function Estimation (aka REIN- FORCE; Williams 1992) with a moving average baseline for variance reduction. Table 3 sum- marises the results for the single-step intervention. 0.5 0.0 0.5 Unit 288 0 1 Unit 336 1 0 Unit 456 Theeconomistatebecause 0 1 Unit 563 original intervention Figure 2: Activations of four units we intervene on (sin- gle step intervention at “economist”) for changing the pronoun (after “because”). We tried different values forαfor REINFORCE to achieve a reasonable trade-off between accuracy and the number of units used. REINFORCE takes at least 7 times more time to converge, and it always converges at using more units than our method with lower accuracy. We did not compare to an exact search (i.e., complete evaluate all the combinations of units and optimizing the intervention baselines) since it scales exponentially with the number of units—it has a time complexity ofO(2 k )where kis the number of neurons. A full search would correspond to>10 12 evaluations only for check- ing combinations up to 4 neurons which is clearly unfeasible. RobustnessTo demonstrate that our method is robust, we tested it on 5 language models initial- ized with different seeds and trained with the origi- nal script by Gulordava et al. (2018). We run our method for the single-step intervention 3 times for each language model. The average accuracy at con- vergence is88.7±2.6, and we discover4.7±0.5 units on average. The variability in both accuracy and number of units is very low, indicating that our method is robust to different parameterizations of the model we applied it to. Effect of RegularizationWe ablated the KL reg- ularization to see whether it affects learning and the final convergence of our method. On the num- ber agreement task, we found that the average KL divergence with respect to the original model predictions was0.009/0.075with regularization and0.011/0.104without regularization (for single- step and every-step intervention, respectively). We used different regularization coefficients (i.e., dif- ferent weights), but we did not observe a substantial UnitSingularPluralPrevalenceFound byLakretz et al. (2019) 79-0.05±0.110.53±0.13100%7 224-0.67±0.030.48±0.0632%7 4980.53±0.10-0.32±0.03100%7 6300.43±0.06-0.36±0.0856%7 776-1.00±0.000.52±0.03100%3 9880.19±0.06-1.00±0.0044%3 Table 2: Subject-verb number agreement task with every-step interventions. Values are averages across 25 runs. “Found” indicates how many times our model decides to apply the intervention on a specific unit across runs. Acc. (↑)Units (↓)Speed (↓) SFE (α= 0.05)100.020.05.2h SFE (α= 0.02)87.66.03.6h Ours(α= 0.02)92.34.30.5h Table 3: Comparison between the solutions find by Score Function Estimation (aka REINFORCE) and our system (average across 25 runs). Ours is much faster and if finds a sparser solution with better accuracy. change in the convergence of our models. More- over, the accuracy and the number of units found with regularization was almost the same as without regularization (see Table 1 for all results). This lack of effect of the regularization suggests the studied phenomenon is naturally captured by specialized neurons. In the gender bias task, regularization has a slightly higher impact. The regularized method converges to finding fewer units on average and with worse accuracy (95.8as opposed to98.6) in the single-step intervention. In the every-step in- tervention, the accuracy stays invariant (for both settings is100) while the model converges to using more units. 6 Conclusions In this work, we present a new method that employs constraint optimization to efficiently find hidden units that are responsible for particular language phenomena of language models. We additionally use anL 0 regularization to find a sparse solution which we also show empirically–i.e., our method- ology discovers few units in the order of5-6that is <0.44%of all units in the studied LM. We show such sparse solutions can be found for multiple phe- nomena (number agreement and gender) and is an useful tool for analysis of what a LM has learned and how units influence its token emissions. Al- though this works focuses on LSTM models, the proposed technique is not architecture-dependent and thus easily applicable to transformer-based models as well as convolution-based model and many others. Acknowledgements NDC and IT are supported by SAP Innovation Cen- ter Network; LS and IT by the Dutch Organization for Scientific Research (NWO VIDI 639.022.518) and IT by the Analytical Center for the Government of Russian Federation (agreement 70-2021-00143, d. 01.11.2021, IGK 000000D730321P5Q0002). References Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2017. Fine-grained anal- ysis of sentence embeddings using auxiliary pre- diction tasks.In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Pro- ceedings. OpenReview.net. Guillaume Alain and Yoshua Bengio. 2017. Under- standing intermediate layers using linear classifier probes.International Conference on Learning Rep- resentations. Maria Barrett, Yova Kementchedjhieva, Yanai Elazar, Desmond Elliott, and Anders Søgaard. 2019. Adver- sarial removal of demographic attributes revisited. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 6330– 6335, Hong Kong, China. Association for Computa- tional Linguistics. Jasmijn Bastings, Wilker Aziz, and Ivan Titov. 2019. Interpretable neural predictions with differentiable binary variables. InProceedings of the 57th Annual Meeting of the Association for Computational Lin- guistics, pages 2963–2977, Florence, Italy. Associa- tion for Computational Linguistics. Yonatan Belinkov and James Glass. 2019. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72. Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam Tauman Kalai. 2016. Man is to computer programmer as woman is to homemaker?debiasing word embeddings. InAdvances in Neural Information Processing Sys- tems 29: Annual Conference on Neural Informa- tion Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 4349–4357. Stephen Boyd, Stephen P Boyd, and Lieven Vanden- berghe. 2004.Convex optimization. Cambridge uni- versity press. Róbert Csordás, Sjoerd van Steenkiste, and Jürgen Schmidhuber. 2021.Are neural nets modular? inspecting functional modularity through differen- tiable weight masks.In9th International Con- ference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenRe- view.net. Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei. 2021. Knowledge neurons in pretrained trans- formers.ArXiv preprint, abs/2104.08696. Nicola De Cao, Michael Sejr Schlichtkrull, Wilker Aziz, and Ivan Titov. 2020.How do decisions emerge across layers in neural models? interpreta- tion with differentiable masking. InProceedings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 3243– 3255, Online. Association for Computational Lin- guistics. Andrea Dittadi, Frederik Träuble, Francesco Locatello, Manuel Wuthrich, Vaibhav Agrawal, Ole Winther, Stefan Bauer, and Bernhard Schölkopf. 2021. On the transfer of disentangled representations in real- istic settings. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. 2021. Amnesic Probing: Behavioral Ex- planation with Amnesic Counterfactuals.Transac- tions of the Association for Computational Linguis- tics, 9:160–175. Amir Feder, Nadav Oved, Uri Shalit, and Roi Re- ichart. 2021. CausaLM: Causal Model Explanation Through Counterfactual Language Models.Compu- tational Linguistics, 47(2):333–386. Mario Giulianelli, Jack Harding, Florian Mohnert, Dieuwke Hupkes, and Willem Zuidema. 2018. Un- der the hood: Using diagnostic classifiers to in- vestigate and improve how language models track agreement information. InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and In- terpreting Neural Networks for NLP, pages 240–248, Brussels, Belgium. Association for Computational Linguistics. Anirudh Goyal and Yoshua Bengio. 2020. Inductive biases for deep learning of higher-level cognition. ArXiv preprint, abs/2011.15091. Kristina Gulordava, Piotr Bojanowski, Edouard Grave, Tal Linzen, and Marco Baroni. 2018.Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1195–1205, New Orleans, Louisiana. Association for Computational Linguistics. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.Neural Computation, 9(8):1735–1780. Dieuwke Hupkes, Sara Veldhoen, and Willem Zuidema. 2018. Visualisation and ’diagnostic classifiers’ re- veal how recurrent and recursive neural networks process hierarchical structure.Journal of Artificial Intelligence Research, 61:907–926. Alon Jacovi and Yoav Goldberg. 2020. Towards faith- fully interpretable NLP systems: How should we de- fine and evaluate faithfulness? InProceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics, pages 4198–4205, Online. As- sociation for Computational Linguistics. Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categor- ical reparameterization with gumbel-softmax. In5th International Conference on Learning Representa- tions, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. Jaap Jumelet, Milica Denic, Jakub Szymanik, Dieuwke Hupkes, and Shane Steinert-Threlkeld. 2021. Lan- guage models use monotonicity to assess NPI licens- ing. InFindings of the Association for Computa- tional Linguistics: ACL-IJCNLP 2021, pages 4958– 4969, Online. Association for Computational Lin- guistics. Jaap Jumelet, Willem Zuidema, and Dieuwke Hupkes. 2019.Analysing neural language models: Con- textual decomposition reveals default reasoning in number and gender assignment. InProceedings of the 23rd Conference on Computational Natural Lan- guage Learning (CoNLL), pages 1–11, Hong Kong, China. Association for Computational Linguistics. Diederik P. Kingma and Max Welling. 2014. Auto- encoding variational bayes. In2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Con- ference Track Proceedings. Yair Lakretz, Dieuwke Hupkes, Alessandra Vergallito, Marco Marelli, Marco Baroni, and Stanislas De- haene. 2021. Mechanisms for handling nested de- pendencies in neural-network language models and humans.Cognition, 213:104699. Special Issue in Honour of Jacques Mehler, Cognition’s founding ed- itor. Yair Lakretz, German Kruszewski, Theo Desbordes, Dieuwke Hupkes, Stanislas Dehaene, and Marco Ba- roni. 2019.The emergence of number and syn- tax units in LSTM language models. InProceed- ings of the 2019 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 11–20, Minneapolis, Minnesota. Association for Computational Linguis- tics. Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016.Assessing the ability of LSTMs to learn syntax-sensitive dependencies.Transactions of the Association for Computational Linguistics, 4:521– 535. Christos Louizos, Max Welling, and Diederik P. Kingma. 2018.Learning sparse neural networks through l_0 regularization. In6th International Con- ference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. Kaiji Lu, Piotr Mardziel, Fangjing Wu, Preetam Aman- charla, and Anupam Datta. 2020. Gender Bias in Neural Natural Language Processing.Logic, Lan- guage, and Security, pages 189–202. Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The concrete distribution: A continuous re- laxation of discrete random variables. In5th Inter- national Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Con- ference Track Proceedings. OpenReview.net. Judea Pearl. 2001. Direct and Indirect Effects.Uncer- tainty in Artificial Intelligence, page 411–420. Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. 2017. Learning to generate reviews and discovering sentiment.ArXiv preprint, abs/1704.01444. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014.Stochastic backpropagation and approximate inference in deep generative models. InProceedings of the 31th International Confer- ence on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, volume 32 ofJMLR Work- shop and Conference Proceedings, pages 1278– 1286. JMLR.org. Michael Sejr Schlichtkrull, Nicola De Cao, and Ivan Titov. 2021. Interpreting graph neural networks for NLP with differentiable edge masking. In9th Inter- national Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. Pascal Sturmfels, Scott Lundberg, and Su-In Lee. 2020. Visualizing the Impact of Feature Attribution Base- lines .Distill, 5(1):e22. Alex Tamkin, Trisha Singh, Davide Giovanardi, and Noah Goodman. 2020. Investigating transferability in pretrained language models. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 1393–1401, Online. Association for Computational Linguistics. Eva Vanmassenhove, Jinhua Du, and Andy Way. 2017. Investigating ‘Aspect’in NMT and SMT: Translating the English simple past and present perfect.Com- putational Linguistics in the Netherlands Journal, 7:109–128. Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stu- art M. Shieber. 2020. Investigating gender bias in language models using causal mediation analysis. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Pro- cessing Systems 2020, NeurIPS 2020, December 6- 12, 2020, virtual. Elena Voita, David Talbot, Fedor Moiseev, Rico Sen- nrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lift- ing, the rest can be pruned. InProceedings of the 57th Annual Meeting of the Association for Com- putational Linguistics, pages 5797–5808, Florence, Italy. Association for Computational Linguistics. Elena Voita and Ivan Titov. 2020.Information- theoretic probing with minimum description length. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 183–196, Online. Association for Computa- tional Linguistics. Ethan Wilcox, Roger Levy, Takashi Morita, and Richard Futrell. 2018.What do RNN language models learn about filler–gap dependencies?In Proceedings of the 2018 EMNLP Workshop Black- boxNLP: Analyzing and Interpreting Neural Net- works for NLP, pages 211–221, Brussels, Belgium. Association for Computational Linguistics. Ronald J. Williams. 1992. Simple statistical gradient- following algorithms for connectionist reinforce- ment learning.Mach. Learn., 8(3–4):229–256. A Stochastic relaxation of the Mask Our optimization problem poses two difficulties: i) C 0 is discontinuous and has zero derivative almost everywhere, and i) the altered state ˆ h i is discon- tinuous w.r.t. the binary maskm. A simple way to overcome both issues is to treat the binary mask as stochastic and optimize the objective in expecta- tion. In that case, one natural option is to resort to score function estimation (REINFORCE; Williams, 1992) while another is to use a sparse relaxation to binary variables (Louizos et al., 2018; Bastings et al., 2019; De Cao et al., 2020; Schlichtkrull et al., 2021). In Section 4 we discuss the two aforemen- tioned options showing that the latter is much more effective (results in Table 3). Thus we opt to use the Hard Concrete distribution, a mixed discrete- continuous distribution on the closed interval[0,1]. This distribution assigns a non-zero probability to exactly zero and one while it also admits contin- uous outcomes in the unit interval via therepa- rameterization trick(Kingma and Welling, 2014). We refer to Louizos et al. (2018) for details, but also provide a brief summary in Appendix B. With a stochastic mask, the objective is computed in expectation, which addresses both sources of non- differentiability. Note that during training the mask is sampled and its values lies in the closed unit interval. After training, we set the mask entries to exact ones when their expected values are>0.5 or to zero otherwise. To prevent issues due to the discrepancy between the values of the mask dur- ing training and during inference, we add another constraint C (0,1) = k ∑ i=1 E[m i ∈(0,1)],(7) to be≤β.C (0,1) during training constrains the relaxed mask values not to lie in the open interval (0,1)but rather to concentrate in0,1.β∈(0,1] is an hyperparameter (the lower the less discrep- ancy is expected). B The Hard Concrete distribution The Hard Concrete distribution, assigns density to continuous outcomes in the open interval(0,1) and non-zero mass to exactly0and exactly1. A particularly appealing property of this distribution is that sampling can be done via a differentiable reparameterization (Rezende et al., 2014; Kingma and Welling, 2014). In this way, theC 0 constrain in Equation 3 becomes an expectation C 0 (m) = k ∑ i=1 E p(m i ) [m i 6= 0].(8) whose gradient can be estimated via Monte Carlo sampling without the need for REINFORCE and without introducing biases. We did modify the original Hard Concrete, though only so slightly, in a way that it gives support to samples in the half- open interval[0,1), that is, with non-zero mass only at0. That is because we need only distinguish 0from non-zero, and the value1is not particularly important. 5 The distribution A stretched and rectified Bi- nary Concrete (also known as Hard Concrete) dis- tribution is obtained applying an affine transforma- tion to the Binary Concrete distribution (Maddison et al., 2017; Jang et al., 2017) and rectifying its samples in the interval[0,1]. A Binary Concrete is defined over the open interval(0,1)and it is parameterised by a location parameterγ∈Rand temperature parameterτ∈R >0 . The location acts as a logit and it controls the probability mass skew- ing the distribution towards0in case of negative location and towards1in case of positive location. The temperature parameter controls the concentra- tion of the distribution. The Binary Concrete is then stretched with an affine transformation extending its support to(l,r)withl≤0andr≥1. Finally, we obtain a Hard Concrete distribution rectifying samples in the interval[0,1]. This corresponds to collapsing the probability mass over the interval (l,0]to0, and the mass over the interval[1,r)to1. This induces a distribution over the close interval [0,1]with non-zero mass at0and1. Samples are obtained using s=σ((logu−log(1−u) +γ)/τ), z= min (1,max (0,s·(l−r) +r)), (9) whereσis the Sigmoid functionσ(x) = (1 + e −x ) −1 andu∼ U(0,1). We point to the Ap- pendix B of Louizos et al. (2018) for more informa- tion about the density of the resulting distribution and its cumulative density function. C Regularization The regularization term is a KL divergence between the output distributions of the original modelp O 5 Only a true0is guaranteed to completely mask an input out, while any non-zero value, however small, may leak some amount of information. and the one from the model with interventionsp I averaged at every token position:L KL = 1 T T ∑ t=1 D KL (p O (x t |x <t )‖p I (x t |x <t ))(10) We sumL KL to Equation 6 multiplied by a factor. This factor is a hyperparameter that controls the amount of regularization to apply, and we empiri- cally found that1.0is a good value. In practice, as we discussed in Section 4 the regularization term does not play an important role. D Data Subject-verb number agreementFor this task, we employ data made available by Lakretz et al. (2019). The data are synthetic and generated with a modified version from Linzen et al. (2016) and Gu- lordava et al. (2018). Each synthetic number- agreement instance has a fixed syntax and varied lexical material. Sentences were randomly sampled by choosing words from pools of 20 subject/object nouns, 15 verbs, 10 adverbs, 5 prepositions, 10 proper nouns and 10 location nouns. We used a to- tal of 11000 training sentences and 1000 evaluation sentences. Gender biasFor this task, we employ data made available by Vig et al. (2020).The data are synthetic and generated with a list of templates from Lu et al. (2020) and several other templates, instantiated with professions from Bolukbasi et al. (2016) (17 templates and 169 professions, resulting in 2,873 examples in total). We refer to Vig et al. (2020) for the full lists of templates and professions. The templates have the form “The [occupation] [verb] because [he/she]”. Professions are defini- tionally gender-neutral. We used a total of 2673 training sentences and 200 evaluation sentences. E Additional results UnitSingularPluralPrevalence 79-0.93±0.030.96±0.02100% 930.94±0.02-0.81±0.05100% 357-0.97±0.010.83±0.0630% 4980.96±0.01-0.92±0.02100% 571-0.96±0.020.87±0.0460% 6300.91±0.03-0.34±0.19100% Table 4: Subject-verb number agreement task with single-step interventions. Values are averages across 10 runs. UnitHeShePrevalence 51.00±0.00-0.99±0.0150% 288-1.00±0.00-0.89±0.21100% 336-1.00±0.001.00±0.0020% 455-1.00±0.00-0.24±0.0010% 4561.00±0.00-1.00±0.00100% 4641.00±0.00-1.00±0.0020% 4741.00±0.00-0.72±0.1860% 490-1.00±0.001.00±0.0020% 563-1.00±0.001.00±0.00100% 646-1.00±0.001.00±0.0010% Table 5: Gender bias task with single-step interven- tions. Values are averages across 10 runs. UnitHeShePrevalence 4560.91±0.28-0.94±0.17100% 6700.86±0.27-0.55±0.3580% 6931.00±0.00-1.00±0.0090% 10090.49±0.00-0.21±0.0010% 1184-1.00±0.001.00±0.00100% 12520.79±0.27-1.00±0.0050% Table 6: Gender bias task with every-step interventions. Values are averages across 10 runs. 0 1 Unit 79 1 0 Unit 93 1 0 Unit 498 Theteachersnowdefinitelyadmire 1 0 Unit 630 (a) 1 0 Unit 79 0 1 Unit 93 0 1 Unit 498 Thesingerbehindthecarinspires 0 1 Unit 630 (b) 1 0 Unit 79 0 1 Unit 93 0 1 Unit 498 ThelawyernearBarbarastimulates 0 1 Unit 630 (c) 0 1 Unit 79 1 0 Unit 93 1 0 Unit 498 ThefriendsunderMikeconfuse 1 0 Unit 630 (d) 1 0 Unit 79 0 1 Unit 93 0 1 Unit 498 Thevictimbehindthetrucksencourages 0 1 Unit 630 (e) 1 0 Unit 79 0 1 Unit 93 0 1 Unit 498 TheathletebesideJimavoids 0 1 Unit 630 (f) 1 0 Unit 79 0 1 Unit 93 0 1 Unit 498 Theauntdefinitelyengages 0 1 Unit 630 (g) 1 0 Unit 79 0 1 Unit 93 0 1 Unit 498 ThesingerbesideJohnavoids 0 1 Unit 630 original intervention (h) Figure 3: Subject-verb number agreement: activations of four units we intervene on (single step intervention at the second token from the left) for changing number agreement (at the last token). 1 0 Unit 288 1 0 Unit 336 0 1 Unit 456 Thesportsmanatebecause 1 0 Unit 563 (a) 1 0 Unit 288 1 0 Unit 336 0 1 Unit 456 Thedeputyranbecause 1 0 Unit 563 (b) 1 0 Unit 288 1 0 Unit 336 0 1 Unit 456 Thegangstersleptbecause 1 0 Unit 563 (c) 0.5 0.0 0.5 Unit 288 0 1 Unit 336 1 0 Unit 456 Theeconomistatebecause 0 1 Unit 563 (d) 1 0 Unit 288 1 0 Unit 336 0 1 Unit 456 Thenaturalistwasfiredbecause 1 0 Unit 563 (e) 1 0 Unit 288 1 0 Unit 336 0 1 Unit 456 Thecommanderdrovebecause 1 0 Unit 563 (f) 1 0 Unit 288 1 0 Unit 336 0 1 Unit 456 Theastronomersleptbecause 1 0 Unit 563 (g) 1 0 Unit 288 1 0 Unit 336 0 1 Unit 456 Thepromoterlaughedbecause 1 0 Unit 563 original intervention (h) Figure 4: Gender bias: activations of four units we intervene on (single step intervention at the second token from the left) for changing the pronoun (after “because”).