Paper deep dive

Transformer Feed-Forward Layers Are Key-Value Memories

Mor Geva, Roei Schuster, Jonathan Berant, Omer Levy

Year: 2021Venue: EMNLP 2021Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 42

Models: Baevski and Auli (2019) transformer LM (16 layers, 247M parameters)

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 8:28:52 PM

Summary

This paper demonstrates that feed-forward layers in transformer-based language models function as key-value memories. The keys act as pattern detectors for input sequences, while the values induce output distributions over the vocabulary. The authors show that lower layers capture shallow patterns, whereas upper layers capture semantic patterns, and that the final output is a composition of these memories refined through residual connections.

Entities (4)

Feed-Forward Layer · component · 100%Transformer · model-architecture · 100%WikiText-103 · dataset · 100%Key-Value Memory · conceptual-framework · 95%

Relation Signals (3)

Feed-Forward Layer → operatesas → Key-Value Memory

confidence 95% · feed-forward layers in transformer- based language models operate as key-value memories

Key → detects → Input Pattern

confidence 90% · each key correlates with textual patterns in the training examples

Value → induces → Output Distribution

confidence 90% · each value induces a distribution over the output vocabulary

Cypher Suggestions (2)

Find all components that function as memories · confidence 90% · unvalidated

MATCH (c:Component)-[:OPERATES_AS]->(m:Concept {name: 'Key-Value Memory'}) RETURN c.name

Identify patterns detected by specific layers · confidence 85% · unvalidated

MATCH (l:Layer)-[:DETECTS]->(p:Pattern) RETURN l.id, p.description

Abstract

Mor Geva, Roei Schuster, Jonathan Berant, Omer Levy. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021.

Full Text

41,203 characters extracted from source content.

Expand or collapse full text

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495 November 7–11, 2021. c ©2021 Association for Computational Linguistics 5484 Transformer Feed-Forward Layers Are Key-Value Memories Mor Geva 1,2 Roei Schuster 1,3 Jonathan Berant 1,2 Omer Levy 1 1 Blavatnik School of Computer Science, Tel-Aviv University 2 Allen Institute for Artificial Intelligence 3 Cornell Tech morgeva@mail,joberant@cs,levyomer@cs.tau.ac.il,rs864@cornell.edu Abstract Feed-forward layers constitute two-thirds of a transformer model’s parameters, yet their role in the network remains under-explored. We show that feed-forward layers in transformer- based language models operate as key-value memories, where each key correlates with tex- tual patterns in the training examples, and each value induces a distribution over the output vocabulary. Our experiments show that the learned patterns are human-interpretable, and that lower layers tend to capture shallow pat- terns, while upper layers learn more semantic ones. The values complement the keys’ in- put patterns by inducing output distributions that concentrate probability mass on tokens likely to appear immediately after each pattern, particularly in the upper layers. Finally, we demonstrate that the output of a feed-forward layer is a composition of its memories, which is subsequently refined throughout the model’s layers via residual connections to produce the final output distribution. 1 Introduction Transformer-based language models (Vaswani et al., 2017) are at the core of state-of-the-art natu- ral language processing (Devlin et al., 2019; Brown et al., 2020), largely due to the success of self- attention. While much literature has been devoted to analyzing the function of self-attention layers (Voita et al., 2019; Clark et al., 2019; Vig and Be- linkov, 2019), they account for only a third of a typ- ical transformer’s parameters (4d 2 per layer, where dis the model’s hidden dimension). Most of the parameter budget is spent on position-wise feed- forward layers (8d 2 per layer), yet their role re- mains under-explored. What, if so, is the function of feed-forward layers in a transformer language model? We show that feed-forward layers emulate neural memories (Sukhbaatar et al., 2015), where the first k 1 k 2 k dm v 1 v 2 x .21.50 self-attention layer Staywith youfor while ... it will take a ... every once in a ... , and for a F layer a 5 x 4 x 3 x 2 x 1 x’ 5 x’ 4 x’ 3 x’ 2 x’ 1 v dm Transformer layers Transformer layers Figure 1: An illustration of how a feed-forward layer emulates a key-value memory. Input vectors (here,x 5 ) are multiplied bykeysto producememory coefficients (e.g., the memory coefficient forv 1 is0.2), which then weigh distributions over the output vocabulary, stored in thevalues. The feed-forward layer’s output is thus the weighted sum of its values. parameter matrix in the layer corresponds tokeys, and the second parameter matrix tovalues. Figure 1 shows how the keys (first parameter matrix) inter- act with the input to produce coefficients, which are then used to compute a weighted sum of the val- ues (second parameter matrix) as the output. While the theoretical similarity between feed-forward lay- ers and key-value memories has previously been suggested by Sukhbaatar et al. (2019), we take this observation one step further, and analyze the “memories” that the feed-forward layers store. We find that each key correlates with a specific set of human-interpretable input patterns, such as n-grams or semantic topics. For example,k 2 in Figure 1 is triggered by inputs that describe a pe- 5485 riod of time and end with“a”. Simultaneously, we observe that eachvaluecan induce a distribution over the output vocabulary, and that this distribu- tion correlates with the next-token distribution of the corresponding keys in the upper layers of the model. In the above example, the corresponding valuev 2 represents a distribution that puts most of its probability mass on the word“while”. Lastly, we analyze how the language model, as a whole, composes its final prediction from indi- vidual memories. We observe that each layer com- bines hundreds of active memories, creating a dis- tribution that is qualitatively different from each of its component memories’ values. Meanwhile, the residual connection between layers acts as a refine- ment mechanism, gently tuning the prediction at each layer while retaining most of the residual’s information. In conclusion, our work sheds light on the func- tion of feed-forward layers in transformer-based language models. We show that feed-forward lay- ers act as pattern detectors over the input across all layers, and that the final output distribution is gradually constructed in a bottom-up fashion. 1 2 Feed-Forward Layers as Unnormalized Key-Value Memories Feed-forward layersA transformer language model (Vaswani et al., 2017) is made of intertwined self-attention and feed-forward layers. Each feed- forward layer is a position-wise function, process- ing each input vector independently. Letx∈R d be a vector corresponding to some input text pre- fix. We can express the feed-forward layerFF(·)as follows (bias terms are omitted): F(x) =f(x·K > )·V(1) Here,K, V∈R d m ×d are parameter matrices, and fis a non-linearity such as ReLU. Neural memoryA neural memory (Sukhbaatar et al., 2015) consists ofd m key-value pairs, which we callmemories. 2 Each key is represented by a d-dimensional vectork i ∈R d , and together form the parameter matrixK∈R d m ×d ; likewise, we define the value parameters asV∈R d m ×d . Given an input vectorx∈R d , we compute a distribution 1 The code for reproducing our experiments is available at https://github.com/mega002/f-layers/. 2 We use the terms “memory cells” and “memories” inter- changeably. over the keys, and use it to compute the expected value: p(k i |x)∝exp(x·k i ) MN(x) = d m ∑ i=1 p(k i |x)v i With matrix notation, we arrive at a more compact formulation: MN(x) =softmax(x·K > )·V(2) Feed-forward layers emulate neural memory Comparing equations 1 and 2 shows that feed- forward layers are almost identical to key-value neural memories; the only difference is that neu- ral memory uses softmax as the non-linearityf(·), while the canonical transformer does not use a normalizing function in the feed-forward layer. Thehidden dimensiond m is essentially the num- ber of memories in the layer, and the activation m=f(x·K > ), commonly referred to as thehid- den layer, is a vector containing an unnormalized non-negative coefficient for each memory. We re- fer to eachm i as thememory coefficientof theith memory cell. Sukhbaatar et al. (2019) make an analogous ob- servation, and incorporate the parameters of the feed-forward layers as persistent memory cells in the self-attention layers. While this reparameteriza- tion works in practice, the experiment does not tell us much about the role of feed-forward layers in the canonical transformer. If transformer feed-forward layers are indeed key-value memories, then what memories do they store? We conjecture that each key vectork i captures a particular pattern (or set of patterns) in the input sequence (Section 3), and that its corresponding value vectorv i represents the distribution of tokens that follows said pattern (Section 4). 3 Keys Capture Input Patterns We posit that the key vectorsKin feed-forward lay- ers act as pattern detectors over the input sequence, where each individual key vectork i corresponds to a specific pattern over the input prefixx 1 , . . . , x j . To test our claim, we analyze the keys of a trained language model’s feed-forward layers. We first re- trieve the training examples (prefixes of a sentence) most associated with a given key, that is, the input texts where the memory coefficient is highest. We 5486 KeyPatternExample trigger prefixes k 1 449 Ends with“substitutes” (shallow) At the meeting, Elton said that “for artistic reasons there could be no substitutes In German service, they were used as substitutes Two weeks later, he came off the substitutes k 6 2546 Military, ends with “base”/“bases” (shallow + semantic) On 1 April the SRSG authorised the SADF to leave their bases Aircraft from all four carriers attacked the Australian base Bombers flying missions to Rabaul and other Japanese bases k 10 2997 a “part of” relation (semantic) In June 2012 she was named as one of the team that competed He was also a part of the Indian delegation Toy Story is also among the top ten in the BFI list of the 50 films you should k 13 2989 Ends with a time range (semantic) Worldwide, most tornadoes occur in the late afternoon, between 3 pm and 7 Weekend tolls are in effect from 7:00 pm Friday until The building is open to the public seven days a week, from 11:00 am to k 16 1935 TV shows (semantic) Time shifting viewing added 57 percent to the episode’s The first season set that the episode was included in was as part of the From the original NBC daytime version , archived Table 1: Examples of human-identified patterns that trigger different memory keys. then ask humans to identify patterns within the re- trieved examples. For almost every keyk i in our sample, a small set of well-defined patterns, recog- nizable by humans, covers most of the examples associated with the key. 3.1 Experiment We conduct our experiment over the language model of Baevski and Auli (2019), a 16-layer transformer language model trained on WikiText- 103 (Merity et al., 2017). This model defines d= 1024andd m = 4096, and has a total of d m ·16 = 65,536potential keys to analyze. We randomly sample 10 keys per layer (160 in total). Retrieving trigger examples We assume that patterns stored in memory cells originate from ex- amples the model was trained on. Therefore, given a keyk ` i that corresponds to thei-th hidden dimen- sion of the`-th feed-forward layer, we compute the memory coefficientReLU(x ` j ·k ` i )for every prefix x 1 , . . . , x j of every sentence from the WikiText- 103’s training set. 3 For example, for the hypotheti- cal sentence“I love dogs”, we will compute three coefficients, for the prefixes“I”,“I love”, and“I love dogs”. Then, we retrieve thetop-ttrigger ex- amples, that is, thetprefixes whose representation at layer`yielded the highest inner product withk ` i . Pattern analysisWe let human experts (NLP graduate students) annotate the top-25 prefixes re- trieved for each key, and asked them to (a) identify repetitive patterns that occur in at least 3 prefixes (which would strongly indicate a connection to the key, as this would unlikely happen if sentences 3 We segment training examples into sentences to simplify the annotation task and later analyses. 12345678910111213141516 layer 0 20 40 60 80 100 % trigger examples shallow shallow + semantic semantic not-covered Figure 2: Breakdown of the labels experts assigned to trigger examples in each layer. Some examples were not associated with any pattern (“not-covered”). were drawn at random) (b) describe each recog- nized pattern, and (c) classify each recognized pat- tern as“shallow”(e.g. recurring n-grams) or“se- mantic”(recurring topic). Each key and its corre- sponding top-25 prefixes were annotated by one expert. To assure that every pattern is grounded in at least 3 prefixes, we instruct the experts to specify, for each of the top-25 prefixes, which pattern(s) it contains. A prefix may be associated with multiple (shallow or semantic) patterns. Table 1 shows example patterns.A fully- annotated example of the top-25 prefixes from a single memory key is shown in Appendix A. 3.2 Results Memoriesareassociatedwithhuman- recognizablepatternsExpertswereable to identify at least one pattern for every key, with an average of 3.6 identified patterns per 5487 12345678910111213141516 layer 70 60 50 40 30 20 10 change in memory coefficient (%) firstlastrandom Figure 3:Relative change in memory coefficient caused by removing the first, the last, or a random to- ken from the input. key. Furthermore, the vast majority of retrieved prefixes (65%-80%) were associated with at least one identified pattern (Figure 2). Thus, the top examples triggering each key share clear patterns that humans can recognize. Shallow layers detect shallow patterns Com- paring the amount of prefixes associated with shal- low patterns and semantic patterns (Figure 2), the lower layers (layers 1-9) are dominated by shallow patterns, often with prefixes that share the last word (e.g.k 1 449 in Table 1). In contrast, the upper layers (layers 10-16) are characterized by more semantic patterns, with prefixes from similar contexts but without clear surface-form similarities (e.g.k 16 1935 in Table 1). This observation corroborates recent findings that lower (upper) layers in deep contextu- alized models encode shallow (semantic) features of the inputs (Peters et al., 2018; Jawahar et al., 2019; Liu et al., 2019). To further test this hypothesis, we sample 1600 random keys (100 keys per layer) and apply lo- cal modifications to the top-50 trigger examples of every key. Specifically, we remove either thefirst, last, or arandomtoken from the input, and measure how this mutation affects the memory coefficient. Figure 3 shows that the model considers the end of an example as more salient than the beginning for predicting the next token. In upper layers, remov- ing the last token has less impact, supporting our conclusion that upper-layer keys are less correlated with shallow patterns. 4 Values Represent Distributions After establishing that keys capture patterns in train- ing examples, we turn to analyze the information 12345678910111213141516 layer 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 agreement rate (%) Figure 4: Agreement rate between the top-ranked to- ken based on the value vectorv ` i , and the next token of the top-ranked trigger example associated with the key vectork ` i . stored in their corresponding values. We show that each valuev ` i can be viewed as a distribution over the output vocabulary, and demonstrate that this distribution complements the patterns in the corre- sponding keyk ` i in the model’s upper layers (see Figure 1). Casting values as distributions over the vocabu- lary. We begin by converting each value vector v ` i into a probability distribution over the vocab- ulary by multiplying it by the output embedding matrixEand applying a softmax: 4 p ` i =softmax(v ` i ·E). The probability distributionp ` i is uncalibrated, since the value vectorv ` i is typically multiplied by the input-dependent memory coefficientm ` i , changing the skewness of the output distribution. That said, therankinginduced byp ` i is invariant to the coefficient, and can still be examined. This con- version assumes (naïvely) that all model’s layers operate in the same embedding space. Value predictions follow key patterns in upper layers. For every layer`and memory dimension i, we compare the top-ranked token according to v ` i , (argmax(p ` i ) ) to the next tokenw ` i in the top- 1 trigger example according tok ` i (the example whose memory coefficient fork ` i is the highest). Figure 4 shows theagreement rate, i.e. the fraction of memory cells (dimensions) where the value’s top prediction matches the key’s top trigger exam- ple (argmax(p ` i ) =w ` i ). It can be seen that the 4 This is a simplification; in practice, we use the adaptive softmax (Baevski and Auli, 2019) to compute probabilities. 5488 12345678910111213141516 layer 0 5000 10000 15000 20000 25000 30000 rank distribution Figure 5: Distribution of the rank of the next-token in the top-1 trigger example ofk ` i (w ` i ), according to the ranking induced by the value vectorv ` i . We cut the tail of the distribution, which stretches up to the vocabulary size (∼270K tokens). agreement rate is close to zero in the lower layers (1-10), but starting from layer 11, the agreement rate quickly rises until 3.5%, showing higher agree- ment between keys and values on the identity of the top-ranked token. Importantly, this value is orders of magnitude higher than a random token predic- tion from the vocabulary, which would produce a far lower agreement rate (0.0004%), showing that upper-layer memories manifest non-trivial predic- tive power. Next, we take the next token ofk ` i ’s top-1 trig- ger example (w ` i ), and find where it ranks in the value vector’s distributionp ` i . Figure 5 shows that the rank of the next token of a trigger example in- creases through the layers, meaning thatw ` i tends to get higher probability in the upper layers. Detecting predictive values.To examine if we can automatically detect values with high agree- ment rate, we analyze the probability of the values’ top prediction, i.e., (max(p ` i )). Figure 6 shows that although these distributions are not calibrated, distributions with higher maximum probabilities are more likely to agree with their key’s top trigger example. We then take the 100 values with highest probability across all layers and dimensions (97 out of the 100 are in the upper layers, 11-16), and for each valuev ` i , analyze the top-50 trigger ex- amples ofk ` i . We find that in almost half of the values (46 out of 100), there is at least one trigger example that agrees with the value’s top prediction. Examples are provided in Table 2. Discussion.When viewed as distributions over the output vocabulary, values in the upper lay- ers tend to assign higher probability to the next- 8.9e-51.3e-41.8e-42.2e-42.7e-43.1e-43.6e-4 4.e-4 top prediction probability 0.0 0.2 0.4 0.6 0.8 1.0 agreement rate Figure 6: Agreement rate (between the top-ranked to- ken based on the value vectorv ` i and the next token of the top-ranked trigger example associated with the key vectork ` i ) as a function of the maximal probability assigned by the value vector. token of examples triggering the corresponding keys. This suggests that memory cells often store information on how to directly predict the output (the distribution of the next word) from the input (patterns in the prefix). Conversely, the lower lay- ers do not exhibit such clear correlation between the keys’ patterns and the corresponding values’ distributions. A possible explanation is that the lower layers do not operate in the same embedding space, and therefore, projecting values onto the vo- cabulary using the output embeddings does not pro- duce distributions that follow the trigger examples. However, our results imply that some intermediate layersdooperate in the same or similar space to upper layers (exhibiting some agreement), which in itself is non-trivial. We leave further exploration of this phenomenon to future work. 5 Aggregating Memories So far, our discussion has been about the function of a single memory cell in feed-forward layers. How does the information frommultiplecells in multiplelayersaggregate to form a model-wide prediction? We show that every feed-forward layer combines multiple memories to produce a distri- bution that is qualitatively different from each of its component memories’ value distributions (Sec- tion 5.1). These layer-wise distributions are then combined via residual connections in a refinement process, where each feed-forward layer updates the residual’s distribution to finally form the model’s output (Section 5.2). 5489 ValuePredictionPrecision@50Trigger example v 15 222 each68%But when bees and wasps resemble each v 16 752 played16%Her first role was in Vijay Lalwani’s psychological thriller Karthik Calling Karthik, where Padukone was cast as the supportive girlfriend of a depressed man (played v 13 2601 extratropical4%Most of the winter precipitation is the result of synoptic scale, low pressure weather systems (large scale storms such as extratropical v 15 881 part92%Comet served only briefly with the fleet, owing in large part v 16 2070 line84%Sailing from Lorient in October 1805 with one ship of the line v 12 3186 jail4%On May 11, 2011, four days after scoring 6 touchdowns for the Slaughter, Grady was sentenced to twenty days in jail Table 2: Example values, their top prediction, the fraction of their key’s top-50 trigger examples that agree with their prediction, and a matching trigger example (with the target token marked in blue). 5.1 Intra-Layer Memory Composition The feed-forward layer’s output can be defined as the sum of value vectors weighted by their memory coefficients, plus a bias term: y ` = ∑ i ReLU(x ` ·k ` i )·v ` i +b ` . If each value vectorv ` i contains information about the target token’s distribution, how is this infor- mation aggregated into a single output distribu- tion? To find out, we analyze the behavior of 4,000 randomly-sampled prefixes from the validation set. Here, the validation set is used (rather than the training set used to find trigger examples) since we are trying to characterize the model’s behavior at in- ference time, not find the examples it “memorizes” during training. We first measure the fraction of “active” mem- ories (cells with a non-zero coefficient). Figure 7 shows that a typical example triggers hundreds of memories per layer (10%-50% of 4096 dimen- sions), but the majority of cells remain inactive. Interestingly, the number of active memories drops towards layer 10, which is the same layer in which semantic patterns become more prevalent than shal- low patterns, according to expert annotations (see Section 3, Figure 2). While there are cases where a single memory cell dominates the output of a layer, the majority of outputs are clearly compositional. We count the number of instances where the feed-forward layer’s top prediction isdifferentfrom all of the memories’ top predictions. Formally, we denote: top(h) = argmax(h·E) as a generic shorthand for the top prediction from the vocabulary distribution induced by the vector 12345678910111213141516 layer 0 20 40 60 80 100 % of active memories Figure 7: The fraction of active memories (i.e., with positive memory coefficient) out of 4096 memories in every layer, for a random sample of 4,000 examples. h , and compute the number of examples where the following condition holds: ∀i:top(v ` i )6=top(y ` ) Figure 8 shows that, for any layer in the network, the layer’s final prediction is different thanevery oneof the memories’ predictions in at least∼68% of the examples. Even in the upper layers, where the memories’ values are more correlated with the output space (Section 4), the layer-level prediction is typically not the result of a single dominant mem- ory cell, but a composition of multiple memories. We further analyze cases where at least one mem- ory cell agrees with the layer’s prediction, and find that (a) in 60% of the examples the target token is a common stop word in the vocabulary (e.g.“the” or“of ”), and (b) in 43% of the cases the input prefix has less than 5 tokens. This suggests that very common patterns in the training data might 5490 12345678910111213141516 layer 0 20 40 60 80 100 % examples with zero agreement Figure 8: The fraction of examples in a random sam- ple of 4,000 examples where the layer’s prediction is different from the prediction of all of its memories. be “cached” in individual memory cells, and do not require compositionality. 5.2 Inter-Layer Prediction Refinement While a single feed-forward layer composes its memories in parallel, a multi-layer model uses the residual connectionrtosequentiallycompose pre- dictions to produce the model’s final output: 5 x ` =LayerNorm(r ` ) y ` =F(x ` ) o ` =y ` +r ` We hypothesize that the model uses the sequential composition apparatus as a means torefineits pre- diction from layer to layer, often deciding what the prediction will be at one of the lower layers. To test our hypothesis, we first measure how often the probability distribution induced by the residual vectorr ` matches the model’s final output o L (Lbeing the total number of layers): top(r ` ) =top(o L ) Figure 9 shows that roughly a third of the model’s predictions are determined in the bottom few layers. This number grows rapidly from layer 10 onwards, implying that the majority of “hard” decisions oc- cur before the final layer. We also measure the probability masspthat each layer’s residual vectorr ` assigns to the model’s 5 The residual propagates information from previous layers, including the transformer’s self-attention layers. 12345678910111213141516 layer 0 20 40 60 80 100 % examples s.t. the residual predicts the final output Figure 9: Fraction of examples in each layer, where the residual’s top prediction matches the model’s output. 12345678910111213141516 layer 0.0 0.2 0.4 0.6 0.8 1.0 probability Figure 10: Probability of the token output by the model according to the residual of each layer. final prediction: w=top(o L ) p=softmax(r ` ·E) p=p w Figure 10 shows a similar trend, but emphasizes that it is not only the top prediction’s identity that is refined as we progress through the layers, it is also the model’s confidence in its decision. To better understand how the refinement pro- cess works at each layer, we measure how of- ten the residual’s top prediction changes follow- ing its interaction with the feed-forward layer (top(r ` )6=top(o ` )), and whether this change re- sults from the feed-forward layer overriding the residual (top(o ` ) =top(y ` ) ) or from a true com- position (top(r ` )6=top(o ` )6=top(y ` )). Figure 11 shows the breakdown of different cases per layer. In the vast majority of exam- ples, the residual’s top prediction ends up being the 5491 12345678910111213141516 layer 0 20 40 60 80 100 % examples residual agreement composition ffn Figure 11: Breakdown of examples by prediction cases: the layer’s output prediction matches the residual’s pre- diction (residual), matches the feed-forward layer’s pre- diction (ffn), matches both of the predictions (agree- ment), or none of the predictions (composition). By construction, there are no cases where the residual’s prediction matches the feed-forward layer’s prediction and does not match the output’s prediction. model’s prediction (residual+agreement). In most of these cases, the feed forward layer predicts some- thing different (residual). Perhaps surprisingly, when the residual’s prediction does change (com- position+ffn), it rarely changes to the feed-forward layer’s prediction (ffn). Instead, we observe that composing the residual’s distribution with that of the feed-forward layer produces a “compromise” prediction, which is equal to neither (composition). This behavior is similar to the intra-layer compo- sition we observe in Section 5.1. A possible con- jecture is that the feed-forward layer acts as an elimination mechanism to “veto” the top prediction in the residual, and thus shifts probability mass to- wards one of the other candidate predictions in the head of the residual’s distribution. Finally, we manually analyze 100 random cases of last-layer composition, where the feed-forward layer modifies the residual output in thefinallayer. We find that in most cases (66 examples), the output changes to a semantically distant word (e.g.,“people”→“same”) and in the rest of the cases (34 examples), the feed-forward layer’s out- put shifts the residual prediction to a related word (e.g.“later”→“earlier”and“gastric”→“stom- ach”). This suggests that feed-forward layers tune the residual predictions at varying granularity, even in the last layer of the model. 6 Related Work Considerable attention has been given to demys- tifying the operation of neural NLP models. An extensive line of work targeted neuron functionality in general, extracting the properties that neurons and subsets of neurons capture (Durrani et al., 2020; Dalvi et al., 2019; Rethmeier et al., 2020; Mu and Andreas, 2020; Vig et al., 2020), regardless of the model architecture or neurons’ position in it. Ja- covi et al. (2018) analyzed CNN architectures in text classification and showed that they extract key n-grams from the inputs. The study of the transformer architecture has focused on the role and function of self-attention layers (Voita et al., 2019; Clark et al., 2019; Vig and Belinkov, 2019) and on inter-layer differences (i.e. lower vs. upper layers) (Tenney et al., 2019; Jawahar et al., 2019). Previous work also high- lighted the importance of feed-forward layers in transformers (Press et al., 2020; Pulugundla et al., 2021; Xu et al., 2020). Still, to date, the role of feed-forward layers remains under-explored. Also related are interpretability methods that ex- plain predictions (Han et al., 2020; Wiegreffe and Pinter, 2019), however, our focus is entirely differ- ent: we do not interpret individual predictions, but aim to understand the mechanism of transformers. Characterizing the functionality of memory cells based on examples that trigger maximal activations has been used previously in NLP (Rethmeier et al., 2020) and vision (Erhan et al., 2009). 7 Discussion and Conclusion Understanding how and why transformers work is crucial to many aspects of modern NLP, includ- ing model interpretability, data security, and de- velopment of better models. Feed-forward layers account for most of a transformer’s parameters, yet little is known about their function in the network. In this work, we propose that feed-forward lay- ers emulate key-value memories, and provide a set of experiments showing that: (a) keys are corre- lated with human-interpretable input patterns; (b) values, mostly in the model’s upper layers, induce distributions over the output vocabulary that corre- late with the next-token distribution of patterns in the corresponding key; and (c) the model’s output is formed via an aggregation of these distributions, whereby they are first composed to form individual layer outputs, which are then refined throughout the model’s layers using residual connections. Our findings open important research directions: •Layer embedding space.We observe a correla- tion between value distributions over the output 5492 vocabulary and key patterns, that increases from lower to upper layers (Section 4). Is this because the layer’s output space transforms across layers? If so, how? We note that this possible transforma- tion cannot be explained solely by the function of feed-forward layers: if the model only did a se- ries of key-value look-ups and value-distribution aggregation via weighted addition, then a single, unifying embedding space would appear more natural. Thus, the transformation might have to do with the interplay between feed-forward lay- ers and self-attention layers. •Beyond language modeling.Our formulation of feed-forward networks as key-value memories generalizes to any transformer model, e.g. BERT encoders and neural translation models. We thus expect our qualitative empirical observations to hold across diverse settings, and leave verifica- tion of this for future work. •Practical implications.A better understanding of feed-forward layers has many implications in NLP. For example, future studies may offer in- terpretability methods by automating the pattern- identification process; memory cells might af- fect training-data privacy as they could facili- tate white-box membership inference (Nasr et al., 2019); and studying cases where a correct pattern is identified but then suppressed during aggrega- tion may guide architectural novelties. Thus, by illuminating the role of feed-forward layers, we move towards a better understanding of the inner workings of transformers, and open new research threads on modern NLP models. Acknowledgements We thank Shimi Salant and Tal Schuster for help- ful feedback. This work was supported in part by the Yandex Initiative for Machine Learning, the Blavatnik Interdisciplinary Cyber Research Center (ICRC), the Alon Scholarship, and Intel Corpora- tion. Roei Schuster is a member of the Check Point Institute of Information Technology. This work was completed in partial fulfillment for the Ph.D degree of Mor Geva. References Alexei Baevski and Michael Auli. 2019. Adaptive in- put representations for neural language modeling. In International Conference on Learning Representa- tions (ICLR). Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc- Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learn- ers. InProceedings of Neural Information Process- ing Systems (NeurIPS). Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? An analysis of BERT’s attention. InBlack- BoxNLP Workshop at ACL. Fahim Dalvi, Nadir Durrani, Hassan Sajjad, Yonatan Belinkov, Anthony Bau, and James Glass. 2019. What is one grain of sand in the desert? analyz- ing individual neurons in deep nlp models. InPro- ceedings of the AAAI Conference on Artificial Intel- ligence, volume 33, pages 6309–6317. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- standing. InNorth American Association for Com- putational Linguistics (NAACL), pages 4171–4186, Minneapolis, Minnesota. Nadir Durrani, Hassan Sajjad, Fahim Dalvi, and Yonatan Belinkov. 2020. Analyzing individual neu- rons in pre-trained language models. InProceed- ings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2009. Visualizing higher-layer fea- tures of a deep network.University of Montreal, 1341(3):1. Xiaochuang Han, Byron C. Wallace, and Yulia Tsvetkov. 2020. Explaining black box predictions and unveiling data artifacts through influence func- tions.InProceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pages 5553–5563, Online. Association for Computa- tional Linguistics. Alon Jacovi, Oren Sar Shalom, and Yoav Goldberg. 2018. Understanding convolutional neural networks for text classification. InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and In- terpreting Neural Networks for NLP, pages 56–65, Brussels, Belgium. Association for Computational Linguistics. Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. 2019. What does BERT learn about the structure 5493 of language?InProceedings of the 57th Annual Meeting of the Association for Computational Lin- guistics, pages 3651–3657, Florence, Italy. Associa- tion for Computational Linguistics. Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. 2019. Lin- guistic knowledge and transferability of contextual representations. InProceedings of the 2019 Confer- ence of the North American Chapter of the Associ- ation for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long and Short Pa- pers), pages 1073–1094, Minneapolis, Minnesota. Association for Computational Linguistics. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer sentinel mixture mod- els.International Conference on Learning Repre- sentations (ICLR). Jesse Mu and Jacob Andreas. 2020. Compositional ex- planations of neurons. InProceedings of Neural In- formation Processing Systems (NeurIPS). Milad Nasr, Reza Shokri, and Amir Houmansadr. 2019.Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. In2019 IEEE Symposium on Security and Privacy (SP), pages 739–753. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word repre- sentations. InNorth American Chapter of the Asso- ciation for Computational Linguistics (NAACL). Ofir Press, Noah A. Smith, and Omer Levy. 2020. Im- proving transformer models by reordering their sub- layers. InProceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pages 2996–3005, Online. Association for Computa- tional Linguistics. Bhargav Pulugundla, Yang Gao, Brian King, Gokce Keskin, Harish Mallidi, Minhua Wu, Jasha Droppo, and Roland Maas. 2021.Attention-based neural beamforming layers for multi-channel speech recog- nition.arXiv preprint arXiv:2105.05920. Nils Rethmeier, Vageesh Kumar Saxena, and Isabelle Augenstein. 2020. Tx-ray: Quantifying and explain- ing model-knowledge transfer in (un-) supervised nlp. InConference on Uncertainty in Artificial In- telligence, pages 440–449. PMLR. S. Sukhbaatar, J. Weston, and R. Fergus. 2015. End- to-end memory networks. InAdvances in Neural Information Processing Systems (NIPS). Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Armand Joulin. 2019. Augmenting self-attention with persistent memory. arXiv preprint arXiv:1907.01470. Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 4593– 4601, Florence, Italy. Association for Computational Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InAdvances in Neural Information Pro- cessing Systems (NIPS), pages 5998–6008. Jesse Vig and Yonatan Belinkov. 2019.Analyzing the structure of attention in a transformer language model. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 63–76, Florence, Italy. As- sociation for Computational Linguistics. Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. Investigating gender bias in language models using causal mediation analysis.Advances in Neural Information Processing Systems, 33. Elena Voita, David Talbot, Fedor Moiseev, Rico Sen- nrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lift- ing, the rest can be pruned. InProceedings of the 57th Annual Meeting of the Association for Compu- tational Linguistics. Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. InProceedings of the 2019 Con- ference on Empirical Methods in Natural Language Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP- IJCNLP), pages 11–20, Hong Kong, China. Associ- ation for Computational Linguistics. Hongfei Xu, Qiuhui Liu, Deyi Xiong, and Josef van Genabith. 2020. Transformer with depth-wise lstm. arXiv preprint arXiv:2007.06257. 5494 A Pattern Analysis Table 3 provides a fully-annotated example of 25 prefixes from the memory cellk 5 895 . B Implementation details In this section, we provide further implementation details for reproducibility of our experiments. For all our experiments, we used the language model of Baevski and Auli (2019) (247M parameters) trained on WikiText-103 (Merity et al., 2017). Specifically, we used the model transformer_lm.wiki103.adaptive trained with the fairseq toolkit 6 . WikiText-103 7 is a well known language model- ing dataset and a collection of over 100M tokens extracted from Wikipedia. We used spaCy 8 to split examples into sentences (Section 3). 6 https://github.com/pytorch/fairseq 7 https://blog.einstein.ai/the- wikitext-long-term-dependency-language- modeling-dataset/ 8 https://spacy.io/ 5495 1It requires players to press 1The video begins at a press 1The first player would press 1Ivy, disguised as her former self, interrupts a Wayne Enterprises press 1The video then cuts back to the press 1The player is able to press Leto switched 1In the Nintendo DS version, the player can choose to press 1In-house engineer Nick Robbins said Shields made it clear from the outset that he (Robbins) “was just there to press 1She decides not to press 1she decides not to press 1Originally Watson signaled electronically, but show staff requested that it press 1At post-game press 1In the buildup to the game, the press 2Hard to go back to the game after that news 1In post-trailer interviews, Bungie staff members told gaming press Space Gun was well received by the video game 1As Bong Load struggled to press At Michigan, Clancy started as a quarterback, switched 1Crush used his size advantage to perform a Gorilla press 1,2Groening told the press 1Creative director Gregoire<unk>argued that existing dance games were merely instructing players to press 1,2Mattingly would be named most outstanding player that year by the press 1At the post-match press 1,2The company receives bad press IDDescriptionshallow / semantic 1Ends with the word “press”shallow 2Press/news relatedsemantic Table 3: A pattern annotation of trigger examples for the cell memoryk 5 895 . Trigger examples are annotated with repetitive patterns (upper table), which are classified as “shallow” or “semantic” (bottom table).