Paper deep dive

Knowledge Neurons in Pretrained Transformers

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, Furu Wei

Year: 2022Venue: ACL 2022Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 42

Models: BERT-base, BERT-large, GPT-2

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/12/2026, 8:24:42 PM

Summary

The paper introduces 'knowledge neurons' in pretrained Transformers, proposing a knowledge attribution method to identify specific neurons responsible for storing factual knowledge. The authors demonstrate that these neurons are positively correlated with knowledge expression and can be manipulated (suppressed, amplified, or erased) to edit factual knowledge without fine-tuning.

Entities (5)

BERT · language-model · 100%Feed-Forward Network · neural-network-component · 100%Knowledge Neurons · concept · 100%Knowledge Attribution Method · algorithm · 95%ParaRel · dataset · 95%

Relation Signals (3)

BERT → contains → Feed-Forward Network

confidence 100% · A Transformer encoder is stacked with L identical blocks. Each Transformer block mainly contains two modules: a self-attention module, and a feed-forward network

Knowledge Attribution Method → identifies → Knowledge Neurons

confidence 100% · we propose a knowledge attribution method to identify the neurons that express the fact.

Knowledge Neurons → storedin → Feed-Forward Network

confidence 90% · We hypothesize that factual knowledge is stored in FFN memories and expressed by knowledge neurons.

Cypher Suggestions (2)

Identify the relationship between the attribution method and the neurons · confidence 95% · unvalidated

MATCH (a:Algorithm)-[r:IDENTIFIES]->(n:Entity) RETURN a.name, r.relation, n.name

Find all components that store factual knowledge · confidence 90% · unvalidated

MATCH (n:Entity)-[:STORED_IN]->(c:Component) WHERE n.name = 'Knowledge Neurons' RETURN c

Abstract

Abstract:Large-scale pretrained language models are surprisingly good at recalling factual knowledge presented in the training corpus. In this paper, we present preliminary studies on how factual knowledge is stored in pretrained Transformers by introducing the concept of knowledge neurons. Specifically, we examine the fill-in-the-blank cloze task for BERT. Given a relational fact, we propose a knowledge attribution method to identify the neurons that express the fact. We find that the activation of such knowledge neurons is positively correlated to the expression of their corresponding facts. In our case studies, we attempt to leverage knowledge neurons to edit (such as update, and erase) specific factual knowledge without fine-tuning. Our results shed light on understanding the storage of knowledge within pretrained Transformers. The code is available at this https URL.

PDF

Open source PDF →Open local PDF →

Full Text

41,975 characters extracted from source content.

Expand or collapse full text

Knowledge Neurons in Pretrained Transformers Damai Dai †‡∗ , Li Dong ‡ , Yaru Hao ‡ , Zhifang Sui † , Baobao Chang † , Furu Wei ‡ † MOE Key Lab of Computational Linguistics, Peking University ‡ Microsoft Research daidamai,szf,chbb@pku.edu.cn lidong1,yaruhao,fuwei@microsoft.com Abstract Large-scale pretrained language models are surprisingly good at recalling factual knowl- edge presented in the training corpus (Petroni et al., 2019; Jiang et al., 2020b). In this pa- per, we present preliminary studies on how fac- tual knowledge is stored in pretrained Trans- formers by introducing the concept ofknowl- edge neurons. Specifically, we examine the fill-in-the-blank cloze task for BERT. Given a relational fact, we propose a knowledge at- tribution method to identify the neurons that express the fact. We find that the activation of such knowledge neurons is positively cor- related to the expression of their correspond- ing facts. In our case studies, we attempt to leverage knowledge neurons to edit (such as update, and erase) specific factual knowledge without fine-tuning.Our results shed light on understanding the storage of knowledge within pretrained Transformers.The code is available athttps://github.com/ Hunter-DDM/knowledge-neurons. 1 Introduction Large-scale pretrained Transformers (Devlin et al., 2019; Liu et al., 2019; Dong et al., 2019; Clark et al., 2020; Bao et al., 2020) are usually learned with a language modeling objective on large-scale corpora, such as Wikipedia, where exists oceans of factual knowledge. Pretrained language models naturally play as a free-text knowledge base by pre- dicting texts (Bosselut et al., 2019). Petroni et al. (2019) and Jiang et al. (2020b) probe factual knowl- edge stored in pretrained language models by fill- in-the-blank cloze queries. The evaluation shows that pretrained Transformers have a strong ability to recall factual knowledge without any fine-tuning. Roberts et al. (2020) use closed-book question an- swering to show that the larger a model is, the more knowledge it can store. However, most previous work focuses on evaluating the overall accuracy of ∗ Contribution during internship at Microsoft Research. Self-Attention Layer Feed-Forward Network Knowledge Attribution Knowledge Neurons ... ... Q27 Ireland Q1761 Dublin capital P36 Factual Knowledge Hidden State Figure 1: Through knowledge attribution, we identify knowledge neurons that express a relational fact. text-form knowledge prediction. In this paper, we attempt to look deeper into pretrained Transformers and investigate how factual knowledge is stored. As shown in Figure 1, we propose a knowl- edge attribution method to identify the neurons that express a relational fact, where such neurons are namedknowledge neurons. Specifically, we view feed-forward network (i.e., two-layer percep- tron) modules in Transformer as key-value memo- ries (Geva et al., 2020). For the example in Figure 1, the hidden state is fed into the first linear layer and activates knowledge neurons; then, the second linear layer integrates the corresponding memory vectors. The key-value-memory nature (Geva et al., 2020) inspires us to propose the knowledge attribu- tion method, which identifies knowledge neurons in feed-forward networks by computing the contri- bution of each neuron to the knowledge prediction. Extensive analysis shows that the activation of the identified knowledge neurons is positively cor- related to the knowledge expression, which shows arXiv:2104.08696v2 [cs.CL] 10 Mar 2022 Feed-Forward Network FFN (key) Activation inner product weighted sum FFN (val) FFN Output Hidden State The capitalofIrelandis[MASK] Self-Attention Layer Feed-Forward Network Dublin ... ... ... ... Knowledge Neurons ... ... 퐿× Figure 2: Illustration of how an FFN module in a Transformer block works as a key-value memory. The first linear layerFFN (key) computes intermediate neurons through inner product. Taking the activation of these neurons as weights, the second linear layerFFN (val) integrates value vectors through weighted sum. We hypothesize that knowledge neurons in the FFN module are responsible for expressing factual knowledge. the effectiveness of the proposed knowledge at- tribution method. First, suppressing and ampli- fying knowledge neurons notably affects the ex- pression of the corresponding knowledge. Second, we find that knowledge neurons of a fact tend to be activated more by corresponding knowledge- expressing prompts. Third, given the knowledge neurons of a fact, the top activating prompts re- trieved from open-domain texts usually express the corresponding fact, while the bottom activating prompts do not express the correct relation. In our case studies, we try to leverage knowl- edge neurons to explicitly edit factual knowledge in pretrained Transformers without any fine-tuning. We present two preliminary studies: updating facts, and erasing relations. After identifying the knowl- edge neurons, we perform a knowledge surgery for pretrained Transformers by directly modify- ing the corresponding parameters in feed-forward networks. Such surgery shows promising results, keeping a moderate influence on other knowledge. Our contributions are summarized as follows: •We introduce the concept ofknowledge neu- ronsand propose a knowledge attribution method to identify the knowledge neurons that express specific factual knowledge in the fill- in-the-blank cloze task. •We conduct both qualitative and quantitative analysis to show that knowledge neurons are positively correlated to knowledge expression. •We present preliminary studies of leveraging knowledge neurons to edit factual knowledge in Transformers, even without any fine-tuning. 2 Background: Transformer Transformer (Vaswani et al., 2017) is one of the most popular and effective NLP architectures. A Transformer encoder is stacked withLidentical blocks. Each Transformer block mainly contains two modules: a self-attention module, and a feed- forward network (abbreviated as FFN) module. Let X∈R n×d denote the input matrix, two modules can be formulated as follows: Q h =XW Q h ,K h =XW K h ,V h =XW V h ,(1) Self-Att h (X) = softmax ( Q h K T h ) V h ,(2) FFN(H) = gelu (HW 1 )W 2 ,(3) whereW Q h ,W K h ,W V h ,W 1 ,W 2 are parameter ma- trices;Self-Att h (X)computes a single attention head;H, the hidden state, is given by projecting the concatenation of all heads;geludenotes the GELU activation function (Hendrycks and Gimpel, 2016). For simplicity, we omit the scaling factor in self-attention and the bias terms. Connections Between Self-Attention and FFN Comparing Equation (2) and Equation (3), we no- tice that the formula ofFFN(·)is quite similar to Self-Att(·), except the activation functiongeluin FFN andsoftmaxin self-attention. Thus, similar to the query-key-value mechanism in self-attention, it is reasonable to regard the input of the FFN as a query vector, and two linear layers of the FFN as keys and values, respectively. Similar observations are also described in (Geva et al., 2020). 3 Identifying Knowledge Neurons Similar to (Geva et al., 2020), we view FFNs in Transformer as key-value memories as illustrated in Figure 2. We hypothesize that factual knowl- edge is stored in FFN memories and expressed by knowledge neurons. In this section, we propose a knowledge attribution method and a refining strat- egy to identify these knowledge neurons. 3.1 Knowledge Assessing Task We employ the fill-in-the-blank cloze task to assess whether a pretrained model knows a fact. Follow- ing Petroni et al. (2019), each relational fact is in the form of a triplet〈h,r,t〉, wherehis the head en- tity,tis the tail entity, andris the relation between them. Given a fact, pretrained models answer the cloze queryxthat expresses the fact but leaves the tail entity as a blank. For example, given the fact〈Ireland, capital, Dublin〉, a pos- sible query is “The capital of Ireland is”. We also call the query aknowledge-expressing prompt. Petroni et al. (2019) describe that a model knows a fact if it can predict the correct answer. In this paper, rather than just examining the model out- puts, we identify the specific knowledge neurons that express factual knowledge. 3.2 Knowledge Attribution Inspired by Hao et al. (2021), we propose a knowl- edge attribution method based on integrated gradi- ents (Sundararajan et al., 2017). Our method can evaluate the contribution of each neuron to knowl- edge predictions. In this paper, we examine FFN intermediate neurons for the masked token, where the answer is predicted. Given an input promptx, we first define the model outputP x ( ˆw (l) i )as the probability of the correct answer predicted by a pretrained model: P x ( ˆw (l) i ) =p(y ∗ |x,w (l) i = ˆw (l) i ),(4) wherey ∗ denotes the correct answer;w (l) i denotes thei-th intermediate neuron in thel-th FFN;ˆw (l) i is a given constant thatw (l) i is assigned to. In order to calculate the attribution score of a neu- ronAttr(w (l) i ), we gradually changew (l) i from0to its original valuew (l) i calculated by the pretrained model, and meanwhile integrate the gradients: Attr(w (l) i ) = w (l) i ∫ 1 α=0 ∂P x (α w (l) i ) ∂w (l) i dα,(5) where ∂P x (αw (l) i ) ∂w (l) i calculates the gradient of the model output with regard tow (l) i . Intuitively, as αchanges from0to1, by integrating the gradi- ents,Attr(w (l) i )accumulates the output probability change caused by the change ofw (l) i . If the neuron has a great influence on the expression of a fact, the gradient will be salient, which in turn has large integration values. Therefore, the attribution score can measure the contribution of the neuronw (l) i to the factual expressions. Directly calculating continuous integrals is in- tractable. We instead use Riemann approxima- tion ̃ Attr(w (l) i ) = w (l) i m ∑ m k=1 ∂P x ( k m w (l) i ) ∂w (l) i , where m= 20is the number of approximation steps. With the attribution algorithm, we can identify a coarse set of knowledge neurons whose attribution scores are greater than a thresholdt. 3.3 Knowledge Neuron Refining In order to identify knowledge neurons more accu- rately, we further propose a refining strategy. Be- sides “true-positive” knowledge neurons that ex- press factual knowledge, the coarse set of knowl- edge neurons may contain “false-positive” knowl- edge neurons that express other information (e.g., syntactic or lexical information). The refining strat- egy aims to filter out these “false-positive” neurons. For different prompts corresponding to the same fact, we hypothesize that they share the same set of “true-positive” knowledge neurons, since they express the same factual knowledge. Meanwhile, we hypothesize that they do not share the “false- positive” knowledge neurons as long as the prompts are diverse enough. Therefore, given multiple diverse prompts, we can refine the coarse set of knowledge neurons by retaining only neurons that are widely shared among these prompts. Specifically, given a relational fact, the complete process to identify its knowledge neurons is de- scribed as follows: (1) producendiverse prompts; (2) for each prompt, calculate the knowledge at- tribution scores of neurons; (3) for each prompt, retain the neurons with attribution scores greater than the attribution thresholdt, obtaining the coarse set of knowledge neurons; (4) considering all the coarse sets together, retain the knowledge neurons shared by more thanp%prompts. RelationsTemplate #1Template #2Template #3 P176 (manufacturer)[X] is produced by [Y] [X] is a product of [Y][Y] and its product [X] P463 (member_of)[X] is a member of [Y] [X] belongs to the organization of [Y] [X] is affiliated with [Y] P407 (language_of_work) [X] was written in [Y] The language of [X] is [Y][X] was a [Y]-language work Table 1: Example prompt templates of three relations in PARAREL. [X] and [Y] are the placeholders for the head and tail entities, respectively. Owing to the page width, we show only three templates for each relation. Prompt templates in PARARELproduce 253,448 knowledge-expressing prompts in total for 27,738 relational facts. 4 Experiments 4.1 Experimental Settings We conduct experiments for BERT-base-cased (De- vlin et al., 2019), one of the most widely-used pre- trained models. It contains 12 Transformer blocks, where the hidden size is 768 and the FFN inner hidden size is 3,072. Notice that our method is not limited to BERT and can be easily general- ized to other pretrained models. For each prompt, we set the attribution thresholdtto0.2times the maximum attribution score. For each relation, we initialize the refining thresholdp%(Section 3.3) as0.7. Then, we increase or decrease it by0.05 at a time until the average number of knowledge neurons lies in [2, 5]. We run our experiments on NVIDIA Tesla V100 GPUs. On average, it costs 13.3 seconds to identify knowledge neurons for a relational fact with 9 prompts. 4.2 Dataset We examine knowledge neurons through the fill- in-the-blank cloze task based on thePARAREL dataset (Elazar et al., 2021).PARARELis curated by experts, containing various prompt templates for 38 relations from the T-REx dataset (ElSahar et al., 2018). We show some example templates in Table 1. For each relational fact, we fill in the head entity in prompt templates and leave the tail entity as a blank to predict. In order to guarantee the template diversity, we filter out relations with fewer than 4 prompt templates and finally keep 34 relations, where each relation has 8.63 differ- ent prompt templates on average. These prompt templates produce 253,448 knowledge-expressing prompts in total for 27,738 relational facts. 4.3 Attribution Baseline Our baseline method takes the neuron activation value as the attribution score, i.e.,Attr base (w (l) i ) = w (l) i , which measures how sensitive a neuron is to the input. After computing attribution scores, we follow the same pipeline to obtain the refined 123456789101112 Layer 40% 30% 20% 10% 0% 10% 20% 30% 40% Percentage Figure 3: Percentage of knowledge neurons identified by our method in each Transformer layer. Type of NeuronsOursBaseline Knowledge neurons4.133.96 ⋂ of intra-rel. fact pairs1.232.85 ⋂ of inter-rel. fact pairs0.091.92 Table 2: Statistics of knowledge neurons. ⋂ denotes the intersection of knowledge neurons of fact pairs. “rel.” is the shorthand of relation. Our method iden- tifies more exclusive knowledge neurons. knowledge neurons. For a fair comparison, we employ the same method to choose the hyper- parameterstandp%for the baseline to ensure the average number of knowledge neurons for each relation lies in[2,5]. The method based on neuron activation is a rea- sonable baseline. It is motivated by FFNs’s analogy with the self-attention mechanism (as described in Section 2), because self-attention scores are usu- ally used as a strong attribution baseline (Kovaleva et al., 2019; Voita et al., 2019; Hao et al., 2021). 4.4 Statistics of Knowledge Neurons Figure 3 presents the layer distribution of knowl- edge neurons identified by our knowledge attri- bution method. We notice that most fact-related neurons are distributed in the topmost layers of pre- trained Transformers. The finding also agrees with Tenney et al. (2019) and Geva et al. (2020). Table 2 shows statistics of knowledge neurons. On average, we identify4.13knowledge neurons for each relational fact using our knowledge attri- bution method, and3.96using the baseline method. P101P103P106P108P127P1303P136P1376P138P140P1412P159P176P178P19P190P20P264P27P279P30P36P364P37P39P407P413P449P463P47P495P530P740P937 -60% -50% -40% -30% -20% -10% 0% 10% 20% Correct Probability Change Ratio Ours Baseline Figure 4: Results of suppressing knowledge neurons for various relations. Suppressing knowledge neurons de- creases the correct probability by 29.03% on average. For the baseline, the decreasing ratio is 1.47% on average. P101P103P106P108P127 P1303 P136 P1376 P138P140 P1412 P159P176P178 P19 P190 P20 P264 P27 P279 P30P36 P364 P37P39 P407P413P449P463 P47 P495P530P740P937 -10% 0% 10% 20% 30% 40% 50% 60% 70% 80% Correct Probability Change Ratio Ours Baseline Figure 5: Results of amplifying knowledge neurons for various relations. Amplifying knowledge neurons increases the correct probability by 31.17% on average. For the baseline, the correct probability even decreases by 1.27%. Their same order of magnitude guarantees the fair- ness of the subsequent comparisons in the paper. We also compute the knowledge neuron inter- section of different relational facts. Table 2 shows the average number of pair-wise knowledge neu- ron intersections. For our proposed method, (1) fact pairs with the same relation (intra-relation fact pairs) share 1.23 knowledge neurons on average; (2) fact pairs with different relations (inter-relation fact pairs) share almost no knowledge neurons. In contrast, for the baseline, (3) most identified neu- rons are shared by intra-relation fact pairs; (4) even a substantial portion of neurons are common for inter-relation fact pairs. The difference in knowl- edge neuron intersections suggests that our method can identify more exclusive knowledge neurons. 4.5 Knowledge Neurons Affect Knowledge Expression We investigate how much knowledge neurons can affect knowledge expression in Figure 4 and Fig- ure 5. Given a relational fact, we manipulate its knowledge neurons in two ways: (1) suppressing knowledge neurons by setting their activations to 0; (2) amplifying knowledge neurons by doubling their activations. Then, for each relation, we plot the average change ratio of the probability for the correct answer, corresponding to the manipulation. For comparison, we also plot the results of manipu- lating baseline-identified knowledge neurons. Figure 4 shows that suppressing knowledge neurons identified by our knowledge attribution method leads to a consistent decrease (29.03% on average) in the correct probability. By contrast, for baseline-identified neurons, the suppressing oper- ation has a negligible influence (1.47% decrease on average) on the correct probability. Notably, for the relation P178 (developer), the correct prob- ability abnormally increases by using the baseline. As shown in Figure 5, we have similar observa- tions for amplifying the knowledge neurons iden- tified by our knowledge attribution. We see a con- sistent increase (31.17% on average) in the cor- rect probability. By contrast, the baseline even de- creases the average correct probability by 1.27%. In summary, the knowledge neurons identified by our knowledge attribution method tend to no- tably affect knowledge expression. Notice that the above assessment is affected by the distribution of knowledge neurons. For example, if the knowledge neurons for a relation are distributed more widely, we need to manipulate more top-kneurons for bet- ter control. We use the above experiments as a proof of concept while leaving precise control for future work. Relational FactsNeuronsTop-2 and Bottom-2 Activating Prompts (Average Activation) 〈Ireland, capital, Dublin〉 w (9) 2141 , w (10) 1122 Top Our trip ... inDublin , the capital and largest city ofIreland... (6.36) Dublinis the capital and largest city ofIreland. (5.77) Bottom Dublin just might be the most iconic destination in all ofIreland. (1.27) ... inIreland’s famed city, you can enjoy ...Dublin experience ... (-0.30) 〈Cao_Yunding, place_of_birth, Shanghai〉 w (10) 739 , w (10) 1885 , w (11) 2876 Top Cao Yundingwas born inShanghaiin November 1989. (3.58) Full name:Cao Yunding... Place of birth:Shanghai, China ... (2.73) Bottom ...Cao Yunding(ShanghaiShenhua) is shown the red card ... (-0.30) ShanghaiShenhua midfielderCao Yunding... (-0.31) 〈Kuwait, continent, Asia〉 w (6) 147 , w (9) 866 , w (9) 1461 , w (10) 1169 Top Kuwaitis thus one of the smallest countries inAsia ... (6.63) Kuwaitis a country in WesternAsia... (6.27) Bottom This page displays allAsiaSociety content onKuwait... (-0.48) NoorAsiais ... distribution companies inKuwait... (-0.59) Table 3: Example relational facts along with their knowledge neurons, their top-2 and bottom-2 activating prompts, and the corresponding neuron activation.w (l) i denotes thei-th intermediate neuron at thel-th FFN. We fill the blank in each prompt with the correct answer for better readability. Owing to the page width, we show only key parts for overlong prompts. The top-2 activating prompts express exactly the relation, but the bottom-2 do not. Prompt TypesOurs Baseline Containing head and tail (T 1 ) 0.4852.472 Containing only head (T 2 )0.0192.312 Randomly sampled (T 3 )-0.0182.244 Table 4: Average activation of knowledge neurons for three types of prompts. The activation of neurons iden- tified by our method can distinguish the knowledge- expressing prompts (T 1 ) clearly. 4.6 Knowledge Neurons are Activated by Knowledge-Expressing Prompts In order to study what prompts can activate knowl- edge neurons, we compare the average activation of knowledge neurons for different types of prompts. BINGRELDataset We build a new datasetBIN- GRELby crawling the Bing search engine to collect new prompts, for a more extensive comparison be- yond thePARARELdataset. For each of the 27,738 facts inPARAREL, we crawl two types of texts: (1) up to ten texts containing both the head and the tail entities (210,217 texts crawled in total); (2) up to ten texts containing only the head entity without restricting tail entities (266,020 texts crawled in total). Following the distant supervision assump- tion (Mintz et al., 2009), the first type of texts tends to express the whole relational fact, while the sec- ond type does not. We mask tail entities for the first type of texts to obtain knowledge-expressing prompts (T 1 ). In order to conduct a controlled ex- periment, we mask random words for the second type of texts, forming a control group (T 2 ). More- over, we employ randomly sampled prompts as another control group (T 3 ). ResultsAs shown in Table 4, for our method, the identified knowledge neurons are more signifi- cantly activated by knowledge-expressing prompts (T 1 = 0.485), compared with the control groups (T 2 = 0.019andT 3 =−0.018). By contrast, for the baseline, the activation of identified neurons cannot distinguish three types of prompts. In ad- dition, since our comparison is based on the web- crawledBINGRELdataset, we validate the general- ization of knowledge neurons to open-domain texts that are unseen in PARAREL. Example PromptsIn Table 3, we present exam- ple prompts that activate knowledge neurons the most and the least, respectively. Given a fact, we first identify its knowledge neurons with our knowl- edge attribution method. Then, we calculate the average activation of knowledge neurons for each crawled prompt that contains both the head and the tail entities inBINGREL. Finally, we demonstrate two prompts with the highest average activation values and two with the lowest (denoted as top-2 and bottom-2 activating prompts, respectively). As shown in Table 3, the top-2 activating prompts express exactly the corresponding rela- tional fact. In contrast, despite containing the same head and tail entities, the bottom-2 activating prompts do not express the correct relation. For example, although the bottom-2 activating prompts for〈Ireland, capital, Dublin〉express Erased Relations Perplexity (Erased Relation)Perplexity (Other Relations) Before ErasingAfter ErasingBefore ErasingAfter Erasing P19 (place_of_birth)1450.02996.0 (+106.6%)120.3121.6 (+1.1%) P27 (country_of_citizenship)28.038.3 (+36.7%)143.6149.5 (+4.2%) P106 (occupation)2279.05202.0 (+128.2%)120.1125.3 (+4.3%) P937 (work_location)58.0140.0 (+141.2%)138.0151.9 (+10.1%) Table 5: Case studies of erasing relations. The influence on knowledge expression is measured by the perplexity change. The knowledge erasing operation significantly affects the erased relation, and has just a moderate influence on the expression of other knowledge. MetricKnowledge Neurons Random Neurons Change rate↑48.5%4.7% Success rate↑34.4%0.0% ∆Intra-rel. PPL↓8.410.1 ∆Inter-rel. PPL↓7.24.3 Table 6: Case studies of updating facts.↑means the higher the better, and↓means the lower the better. “rel.” is the shorthand of relation. Keeping a moder- ate influence on other knowledge, the surgery of knowl- edge neurons achieves a nontrivial success rate. information like “Dublin is a city in Ireland”, they do not reflect thecapitalrelation. The examples support again that knowledge neurons are activated by corresponding knowledge-expressing prompts. 5 Case Studies We present two preliminary studies to demonstrate the potential applications of knowledge neurons. We use the case studies as a proof of concept while leaving precise fact editing for future work. 5.1 Updating Facts By leveraging knowledge neurons in pretrained models, we try to update a learned relational fact from〈h,r,t〉to〈h,r,t ′ 〉. MethodsFirst, we identify the knowledge neu- rons of〈h,r,t〉. Then, we retain the knowledge neurons that are shared by less than 10% of intra- relation facts, to reduce the influence on other facts with the same relation. Finally, we directly modify the corresponding value slots inFFN (val) (i.e., the second linear layer of FFNs; see Fig- ure 2):FFN (val) i = FFN (val) i −λ 1 t+λ 2 t ′ , where FFN (val) i denotes the value slot corresponding to thei-th knowledge neuron;tandt ′ are the word embeddings oftandt ′ , respectively;λ 1 andλ 2 are set to1and8in our experiments. SetupWe conduct experiments onPARAREL. For each relation, we randomly sample ten facts learned by the pretrained model. For each fact 〈h,r,t〉, we randomly choose a different entityt ′ with the same type ast(e.g., bothtandt ′ belong tocity), and then updatet ′ as the target entity. We only manipulate about four top knowledge neu- rons as in Section 4.4. For reference purposes, we also perform the same update process on the same number of random neurons. Evaluation Metrics We report two metrics to evaluate the fact updating: (1) change rate, the ratio that the original predictiontis modified to another; (2) success rate, the ratio thatt ′ becomes the top prediction. In addition, we measure the influence on other knowledge by the following two metrics: (1)∆intra-relation PPL, the increase of perplexity on the prompts with the same relationr; (2)∆inter-relation PPL, the increase of perplexity on the prompts with different relations. Results As shown in Table 6, the surgery of knowledge neurons achieves a nontrivial success rate for updating facts, while random neurons are insufficient. Moreover, we find that such manipu- lation has little negative influence on other knowl- edge predictions. It is promising that we can change very few (i.e., about four in the above exper- iments) neurons to affect certain facts in pretrained Transformers. We can further improve the success rate by including more top knowledge neurons in the update process. 5.2 Erasing Relations We explore how to leverage knowledge neurons to erase specific relations in pretrained Trans- formers. Specifically, we take four relations in PARARELas examples, i.e.,place_of_birth, country_of_citizenship,occupation, work_location, that typically express sensi- tive personal information. MethodsGiven a relationr, we first identify knowledge neurons for all relational facts withr. Then, we retain20knowledge neurons that appear most frequently among these facts. Finally, we set the value slots inFFN (val) (see Figure 2) cor- responding to these knowledge neurons to0, i.e., zero vectors. Results As shown in Table 5, we report model perplexity before and after knowledge erasing. With the erasing operation, the perplexity of the re- moved knowledge increases as expected. Moreover, the model perplexity of other relations remains sim- ilar. We argue that knowledge neurons provide a promising way to erase undesired knowledge with minimal efforts. 6 Related Work Probing Knowledge in Pretrained Models Many pieces of previous work aim to measure knowledge stored in pretrained models. Petroni et al. (2019) propose to retrieve knowledge in pre- trained models (such as BERT) using cloze queries. Their experiments show that BERT has a strong ability to recall factual knowledge without any fine- tuning. Jiang et al. (2020b) improve the cloze queries with mining-based and paraphrasing-based methods. Roberts et al. (2020) propose the closed- book question answering to measure how much knowledge a pretrained model has stored in its pa- rameters. Elazar et al. (2021) measure and improve the consistency of pretrained models with respect to factual knowledge prediction. Rather than exam- ining only the model outputs, we provide an open- the-black-box analysis for the knowledge neurons in pretrained Transformers. Attribution MethodsIn order to open the black boxes of deep learning models, attribution meth- ods aim to attribute the model output to input fea- tures using different measures. The product of the gradients (of the output with respect to input features) and feature values is a reasonable base- line (Baehrens et al., 2010; Simonyan et al., 2014). Besides, a set of attribution methods (Shrikumar et al., 2017; Binder et al., 2016; Zeiler and Fergus, 2014; Springenberg et al., 2015) back-propagate the final output to input features. However, as stated by Sundararajan et al. (2017), none of these methods can simultaneously satisfysensitivityand implementation invariance, two fundamental ax- ioms. Taking the axioms as guidance, Sundarara- jan et al. (2017) propose the integrated gradient method. Our knowledge attribution method is built upon integrated gradients. Analysis of TransformerAs one of the most popular and effective NLP architectures, Trans- former (Vaswani et al., 2017) has attracted ex- tensive studies. Most previous work focuses on the self-attention module (Voita et al., 2019; Clark et al., 2019; Vig and Belinkov, 2019; Hao et al., 2021). Recently, Wu et al. (2019) and Dong et al. (2021) have pointed out that the feed-forward net- work module also matters to Transformer. Geva et al. (2020) attempt to connect feed-forward net- works with key-value memories by qualitative anal- ysis. In this paper, we identify and analyze knowl- edge neurons in feed-forward networks for given factual knowledge. Moreover, we present how to leverage knowledge neurons to explicitly edit fac- tual knowledge stored in pretrained Transformers. 7 Conclusion and Future Directions We propose an attribution method to identify knowl- edge neurons that express factual knowledge in pre- trained Transformers. We find that suppressing or amplifying the activation of knowledge neurons can accordingly affect the strength of knowledge expression. Moreover, quantitative and qualitative analysis on open-domain texts shows that knowl- edge neurons tend to be activated by the corre- sponding knowledge-expressing prompts. In addi- tion, we present two preliminary case studies that attempt to utilize knowledge neurons to update or erase knowledge in pretrained Transformers. Despite the effectiveness of identifying knowl- edge neurons, our current studies still have limita- tions. First, we examine knowledge neurons based on the fill-in-the-blank cloze task, while knowl- edge can be expressed in a more implicit way. It is an open question whether Transformer can utilize stored knowledge in a generalized way, such as for reasoning. The interactions between knowledge neurons also remain under explored. Second, we focus on factual knowledge for ease of evaluation, even though our method is also applicable for other types of knowledge. Third, we use the single-word blank in cloze queries for simplicity, which requires multi-word extensions (Jiang et al., 2020a). Be- sides, an interesting future direction is to figure out how knowledge neurons work in multilingual pre- trained Transformers (Conneau and Lample, 2019; Conneau et al., 2020; Chi et al., 2021). 8 Acknowledgement Damai Dai, Zhifang Sui, and Baobao Chang are supported by the National Key Research and De- velopment Program of China 2020A0106701 and NSFC project U19A2065. References David Baehrens, Timon Schroeter, Stefan Harmel- ing, Motoaki Kawanabe, Katja Hansen, and Klaus- Robert Müller. 2010. How to explain individual clas- sification decisions.J. Mach. Learn. Res., 11:1803– 1831. Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Song- hao Piao, Ming Zhou, and Hsiao-Wuen Hon. 2020. Unilmv2: Pseudo-masked language models for uni- fied language model pre-training. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, volume 119 ofProceed- ings of Machine Learning Research, pages 642–652. PMLR. Alexander Binder, Grégoire Montavon, Sebastian Lapuschkin, Klaus-Robert Müller, and Wojciech Samek. 2016. Layer-wise relevance propagation for neural networks with local renormalization layers. InProceedings of the 25th International Conference on Artificial Neural Networks, ICANN 2016, volume 9887 ofLecture Notes in Computer Science, pages 63–71. Springer. Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai- tanya Malaviya, Asli Celikyilmaz, and Yejin Choi. 2019. COMET: commonsense transformers for au- tomatic knowledge graph construction. InProceed- ings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, pages 4762– 4779. Association for Computational Linguistics. Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, Heyan Huang, and Ming Zhou. 2021. InfoXLM: An information-theoretic framework for cross-lingual language model pre-training. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, pages 3576–3588, On- line. Association for Computational Linguistics. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? an analysis of BERT’s attention. InPro- ceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, Florence, Italy. Association for Computational Linguistics. Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: pre- training text encoders as discriminators rather than generators.In8th International Conference on Learning Representations, ICLR 2020. OpenRe- view.net. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettle- moyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale.In Proceedings of the 58th Annual Meeting of the As- sociation for Computational Linguistics, ACL 2020, pages 8440–8451. Association for Computational Linguistics. Alexis Conneau and Guillaume Lample. 2019. Cross- lingual language model pretraining. InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Sys- tems 2019, NeurIPS 2019, pages 7057–7067. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language under- standing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics. Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi- aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019.Unified language model pre-training for natural language understand- ing and generation. InAdvances in Neural Infor- mation Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, pages 13042–13054. Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. 2021. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. CoRR, abs/2103.03404. Yanai Elazar, Nora Kassner, Shauli Ravfogel, Ab- hilasha Ravichander, Eduard H. Hovy, Hinrich Schütze, and Yoav Goldberg. 2021. Measuring and improving consistency in pretrained language mod- els.CoRR, abs/2102.01017. Hady ElSahar, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon S. Hare, Frédérique Laforest, and Elena Simperl. 2018. T-rex: A large scale alignment of natural language with knowledge base triples. InProceedings of the Eleventh Interna- tional Conference on Language Resources and Eval- uation, LREC 2018. European Language Resources Association (ELRA). Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2020. Transformer feed-forward layers are key-value memories.CoRR, abs/2012.14913. Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2021. Self- attention attribution: Interpreting information inter- actions inside transformer. InThe Thirty-Fifth AAAI Conference on Artificial Intelligence. AAAI Press. Dan Hendrycks and Kevin Gimpel. 2016. Gaussian er- ror linear units (gelus). Zhengbao Jiang, Antonios Anastasopoulos, Jun Araki, Haibo Ding, and Graham Neubig. 2020a.X- FACTR: multilingual factual knowledge retrieval from pretrained language models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, pages 5943–5959. Association for Computational Linguis- tics. Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020b. How can we know what language models know?Transactions of the Association for Computational Linguistics, 8:423–438. Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. Revealing the dark secrets of BERT. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages 4365–4374, Hong Kong, China. Association for Computational Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining ap- proach.CoRR, abs/1907.11692. Mike Mintz, Steven Bills, Rion Snow, and Daniel Ju- rafsky. 2009. Distant supervision for relation extrac- tion without labeled data. InACL 2009, Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011. The Association for Computer Linguistics. Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick S. H. Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander H. Miller. 2019.Language mod- els as knowledge bases?InProceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, pages 2463–2473. Associa- tion for Computational Linguistics. Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How much knowledge can you pack into the param- eters of a language model? InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, pages 5418– 5426. Association for Computational Linguistics. Avanti Shrikumar, Peyton Greenside, and Anshul Kun- daje. 2017.Learning important features through propagating activation differences. InProceedings of the 34th International Conference on Machine Learning, ICML 2017, volume 70 ofProceedings of Machine Learning Research, pages 3145–3153. PMLR. Karen Simonyan, Andrea Vedaldi, and Andrew Zisser- man. 2014. Deep inside convolutional networks: Vi- sualising image classification models and saliency maps. In2nd International Conference on Learning Representations, ICLR 2014. Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Riedmiller. 2015. Striving for simplicity: The all convolutional net. In3rd Inter- national Conference on Learning Representations, ICLR 2015. Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, volume 70 of Proceedings of Machine Learning Research, pages 3319–3328. PMLR. Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 4593– 4601, Florence, Italy. Association for Computational Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InAdvances in Neural Information Pro- cessing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pages 5998– 6008. Jesse Vig and Yonatan Belinkov. 2019.Analyzing the structure of attention in a transformer language model. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 63–76, Florence, Italy. As- sociation for Computational Linguistics. Elena Voita, David Talbot, Fedor Moiseev, Rico Sen- nrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lift- ing, the rest can be pruned. InProceedings of the 57th Conference of the Association for Computa- tional Linguistics, ACL 2019, pages 5797–5808. As- sociation for Computational Linguistics. Felix Wu, Angela Fan, Alexei Baevski, Yann N. Dauphin, and Michael Auli. 2019. Pay less atten- tion with lightweight and dynamic convolutions. In 7th International Conference on Learning Represen- tations, ICLR 2019. OpenReview.net. Matthew D. Zeiler and Rob Fergus. 2014. Visualiz- ing and understanding convolutional networks. In Proceedings of the 13th European Conference on Computer Vision, ECCV 2014, volume 8689 ofLec- ture Notes in Computer Science, pages 818–833. Springer.