Paper deep dive

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space

Mor Geva, Avi Caciularu, Kevin Ro Wang, Yoav Goldberg

Year: 2022Venue: EMNLP 2022Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 62

Models: GPT-2 Medium, WikiLM (custom Wikipedia language model)

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 8:25:22 PM

Summary

This paper investigates the internal prediction construction process of transformer-based language models by reverse-engineering feed-forward network (FFN) layers. The authors demonstrate that FFN outputs can be decomposed into additive sub-updates corresponding to individual parameter vectors, which often encode human-interpretable concepts in the vocabulary space. These findings are applied to reduce model toxicity and improve computational efficiency via early exit rules.

Entities (5)

Feed-forward network (FFN) layers · model-component · 99%GPT2 · model · 99%Transformer-based Language Models · model-architecture · 95%WIKILM · model · 95%Vocabulary Space · concept · 90%

Relation Signals (3)

Feed-forward network (FFN) layers → appliesupdate → Token representation

confidence 95% · Feed-forward layers apply additive updates to the token representation

GPT2 → exhibitsreducedtoxicityvia → FFN sub-update intervention

confidence 95% · increasing the weight of only 10 sub-updates in GPT2 reduces toxicity in its generations by almost 50%

FFN parameter vectors → promotes → Human-interpretable concepts

confidence 92% · each update can be decomposed to sub-updates corresponding to single FFN parameter vectors, each promoting concepts that are often human-interpretable

Cypher Suggestions (2)

Find all models and the specific components analyzed in the paper. · confidence 90% · unvalidated

MATCH (m:Model)-[:HAS_COMPONENT]->(c:Component) RETURN m.name, c.name

Identify the relationship between model components and their functional outcomes. · confidence 85% · unvalidated

MATCH (c:Component)-[r:PERFORMS]->(o:Outcome) RETURN c.name, type(r), o.name

Abstract

Abstract:Transformer-based language models (LMs) are at the core of modern NLP, but their internal prediction construction process is opaque and largely not understood. In this work, we make a substantial step towards unveiling this underlying prediction process, by reverse-engineering the operation of the feed-forward network (FFN) layers, one of the building blocks of transformer models. We view the token representation as a changing distribution over the vocabulary, and the output from each FFN layer as an additive update to that distribution. Then, we analyze the FFN updates in the vocabulary space, showing that each update can be decomposed to sub-updates corresponding to single FFN parameter vectors, each promoting concepts that are often human-interpretable. We then leverage these findings for controlling LM predictions, where we reduce the toxicity of GPT2 by almost 50%, and for improving computation efficiency with a simple early exit rule, saving 20% of computation on average.

PDF

Open source PDF →Open local PDF →

Full Text

61,809 characters extracted from source content.

Expand or collapse full text

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space Mor Geva ∗,1 Avi Caciularu ∗,2,† Kevin Ro Wang 3 Yoav Goldberg 1,2 1 Allen Institute for AI 2 Bar-Ilan University 3 Independent Researcher morp@allenai.org,avi.c33,kevinrowang,yoav.goldberg@gmail.com Abstract Transformer-based language models (LMs) are at the core of modern NLP, but their inter- nal prediction construction process is opaque and largely not understood.In this work, we make a substantial step towards unveiling this underlying prediction process, by reverse- engineering the operation of the feed-forward network (FFN) layers, one of the building blocks of transformer models. We view the to- ken representation as a changing distribution over the vocabulary, and the output from each FFN layer as an additive update to that distri- bution. Then, we analyze the FFN updates in the vocabulary space, showing that each up- date can be decomposed to sub-updates cor- responding to single FFN parameter vectors, each promoting concepts that are often human- interpretable. We then leverage these findings for controlling LM predictions, where we re- duce the toxicity of GPT2 by almost 50%, and for improving computation efficiency with a simple early exit rule, saving 20% of compu- tation on average. 1 1 Introduction How do transformer-based language models (LMs) construct predictions? We study this question through the lens of the feed-forward network (FFN) layers, one of the core components in transform- ers (Vaswani et al., 2017). Recent work showed that these layers play an important role in LMs, acting as memories that encode factual and linguis- tic knowledge (Geva et al., 2021; Da et al., 2021; Meng et al., 2022). In this work, we investigate how outputs from the FFN layers are utilized inter- nally to build predictions. We begin by making two observations with re- spect to the representation of a single token in the ∗ Equal contribution. † Work done during an internship at AI2. 1 Our codebase is available athttps://github.com/ aviclu/ffn-values. residual stream 675 a 589 she ordered a pancake few coffee pancake few coffee FFN layer fruit, apples, snack, vitamins, berries, oats, yogurt, tea, ... (breakfast) v 1 v 2 v d m (B) (C) (D) (A) x x ̃ Figure 1: Illustration of our findings. Feed-forward lay- ers apply additive updates (A) to the token representa- tionx, which can be interpreted as a distribution over the vocabulary (B). An update is a set of sub-updates induced by parameter vectorsv 1 ,...,v d m (C), each can be interpreted as a concept in the vocabulary space (D). input, depicted in Fig. 1. First, each FFN layer induces an additive update to the token represen- tation (Fig. 1,A). Second, the token representation across the layers can be translatedat any stageto a distribution over the output vocabulary (Geva et al., 2021) (Fig. 1,B). We reason that the additive com- ponent in the update changes this distribution (§2), namely,FFN layers compute updates that can be interpreted in terms of the output vocabulary. We then decompose the FFN update (§3), in- terpreting it as a collection of sub-updates, each corresponding to a column in the second FFN ma- trix (Fig. 1,C) that scales the token probabilities in the output distribution. Through a series of experiments, we find that (a) sub-update vectors across the entire network often encode a small-set of human-interpretable well-defined concepts, e.g. “breakfast”or“pronouns”(§4, Fig. 1,D), and (b) arXiv:2203.14680v3 [cs.CL] 12 Oct 2022 FFN updates rely primarily on token promotion (rather than elimination), namely, tokens in the top of the output distribution are those pushed strong enough by sub-updates (§5). Overall, these find- ings allow fine-grained interpretation of the FFN operation, providing better understanding of the prediction construction process in LMs. Beyond interpretation, our findings also have practical utility. In §6.1, we show how we can intervene in the prediction process, in order to ma- nipulate the output distribution in a direction of our choice. Specifically, we show that increasing the weight of only 10 sub-updates inGPT2reduces toxicity in its generations by almost 50%. Also, in §6.2, we show that dominant sub-updates provide a useful signal for predicting an early exit point, saving 20% of the computation on average. In conclusion, we investigate the mechanism in which FFN layers update the inner representations of transformer-based LMs. We propose that the FFN output can be viewed as a collection of up- dates that promote concrete concepts in the vo- cabulary space, and that these concepts are often interpretable for humans. Our findings shed light on the prediction construction process in modern LMs, suggesting promising research directions for interpretability, control, and efficiency. 2 Token Representations as Evolving Distributions Over the Vocabulary Modern LMs (Baevski and Auli, 2019; Radford et al., 2019; Brown et al., 2020) are transformer models primarily trained to predict the next to- ken probability for a given input. Such LMs are composed of intertwined multi-head self-attention (MHSA) layers and FFN layers (Vaswani et al., 2017), with residual connections (He et al., 2016) between each pair of consecutive layers. The LM prediction is obtained by projecting the output vec- tor from the final layer to an embedding matrix E∈R |V|×d , with a hidden dimensiond, to get a distribution over a vocabularyV(after softmax). Given a sequencew=〈w 1 ,...,w t 〉of input to- kens, the model creates a contextualized represen- tationx i ∈R d for each tokenw i ∈w, that is being updated throughout the layers. In this work, we ana- lyze the updates applied by the FFN layers and how they construct the model prediction. Concretely, each FFN layer`= 1,...,Lprocessesx ` i and pro- duces an outputo ` i , which is then added tox ` i to yield an updated representation ̃x ` i : o ` i =FFN ` (x ` i ) ̃x ` i =x ` i +o ` i The updated representation ̃x ` i then goes through a MHSA layer, 2 yielding the inputx `+1 i for the next FFN layer. The evolving representation in this process (i.e.x ` i → ̃x ` i ,∀`) can be viewed as an information stream that is being processed and updated by the layers (Elhage et al., 2021). The output probability distribution is obtained from the final representation of the token, i.e., y=softmax(E ̃x L i ).(1) To analyze the FFN updates, we read from the representationat any layera distribution over the output vocabulary, by applying the same projection as in Eq. 1 (Geva et al., 2021): p ` i =softmax(Ex ` i ) ̃p ` i =softmax(E ̃x ` i ). Note that ̃p L i =y. Importantly, by linearity: E ̃x ` i =Ex ` i +Eo ` i , implying thato ` i can be interpreted as an additive update in the vocabulary space. However, we find that the projection of the FFN outputEo ` i to the vo- cabulary is not interpretable (§4). In this work, we take this a step further, and decompose the update o ` i into a set of smaller sub-updates. By projecting the sub-updates to the vocabulary we find that they often express human-interpretable concepts. In the rest of the paper, we focus on FFN up- dates to the representation of a single token in the sequence, and omit the token index for brevity, i.e. x ` :=x ` i andp ` :=p ` i . 3 The FFN Output as a Collection of Updates to the Output Distribution We now decompose the FFN output, and interpret it as a set of sub-updates in the vocabulary space. FFN Outputs as Linear Vector Combinations. Each FFN at layer`consists of two linear trans- formations with a point-wise activation function in between (bias terms are omitted): FFN ` (x ` ) =f ( W ` K x ` ) W ` V , 2 In some LMs, e.g.GPT2, a layer normalization (LN) (Ba et al., 2016) is applied to the representation ̃x ` i . We omit it here and show it does not influence our interpretation in §3. whereW ` K ,W ` V ∈R d m ×d are parameter matrices, andfis a non-linearity function. Previous work proposed this module can be cast as an emulated neural key-value memory (Sukhbaatar et al., 2015, 2019), where rows inW ` K and columns inW ` V are viewed as keys and values, respectively (Geva et al., 2021). For an inputx ` , the keys produce a vector of coefficientsm ` :=f ( W ` K x ` ) ∈R d m , that weighs the corresponding values inW ` V . Denoting byk ` i thei-th row ofW ` K and byv ` i thei-th column of W ` V , we can then use the following formulation: FFN ` (x ` ) = d m ∑ i=1 f(x ` ·k ` i )v ` i = d m ∑ i=1 m ` i v ` i . Therefore,a FFN update can be viewed as a col- lection of sub-updates, each corresponding to a weighted value vector in the FFN output. Terminology. In the rest of the paper, we refer to the vectorsv ` i asvalue vectors, and to their weighted formm ` i v ` i assub-updates. A transformer LM withL= 10,d m = 3000will have30,000 value vectors, and every token that passes through the transformer will weight these value vectors dif- ferently, resulting in30,000sub-updates, where only a few of the sub-updates have high weights. Interpreting Sub-Updates in the Vocabulary Space.Consider a sub-updatem ` i v ` i for a given input, we can estimate its influence on the repre- sentationx ` (before the FFN update) by analyzing the change it induces on the output distribution. Concretely, we isolate the effect ofm ` i v ` i on the probabilityp ` w ofw∈V: 3 p ( w|x ` +m ` i v ` i ,E ) = exp ( e w ·x ` +e w ·m ` i v ` i ) Z ( E(x ` +m ` i v ` i )) ∝exp ( e w ·x ` ) ·exp ( e w ·m ` i v ` i ) ,(2) wheree w is the embedding ofw, andZ ( · ) is the constant softmax normalization factor. This implies that each sub-updatem ` i v ` i intro- duces a scaling factor to the probability of every tokenwbased on its dot product withe w . Specif- ically, havinge w ·m ` i v ` i >0increases the proba- bility ofw, and havinge w ·m ` i v ` i <0decreases it. This scaling factor can be split into two parts: 3 As in Eq.1, LN is omitted. In App. A.1, we verify empiri- cally that our findings hold also when LN is applied. • The terme w ·v ` i can be viewed as astatic score ofwthat is independent of the input to the model. Thus, the projectionr ` i =Ev ` i ∈R |V| induces a ranking over the vocabulary that allows compar- ing the scores byv ` i w.r.t different tokens. •The termm ` i is thedynamic coefficientofv ` i , which is fixed for all tokens for a given input. Thus, these coefficients allow comparing the con- tribution of value vectors in a specific update. Overall, the scaling factore w ·m ` i v ` i can be viewed as the effective score given by a value vectorv ` i to a tokenwfor a given input. In the next sections, we use these observations to answer two research questions of (a) What informa- tion is encoded in sub-updates and what tokens do they promote? (§4) and (b) How do FFN updates build the output probability distribution? (§5) 4 Sub-Updates Encode Concepts in the Vocabulary Space We evaluate whether projection to the vocabulary provides a meaningful way to “read” FFN up- dates, and the extent to which sub-updates are inter- pretable based on their projections. To this end, we manually inspect the top-scoring tokens by value vectors and check if they express interpretable con- cepts. Concretely, we consider two representative LMs (details below), and for each vectorv ` i com- pute a ranking over the vocabulary by sorting the projectionr ` i (§3). Then, we try to detect patterns in the top-scoring tokens of each value vector. Concepts Annotation Task.We let experts (NLP graduate students) annotate concepts by iden- tifying common patterns among the top-30 scor- ing tokens of each value vector. For a set of to- kens, the annotation protocol includes three steps of: (a) Identifying patterns that occur in at least 4 tokens, (b) describing each recognized pattern, and (c) classifying each pattern as either“seman- tic”(e.g., mammals),“syntactic”(e.g., past-tense verbs), or“names”. The last class was added only forWIKILM(see below), following the observa- tion that a large portion of the model’s vocabulary consists of names. Further details, including the complete instructions and a fully annotated exam- ple can be found in App. A.2. Models. We conduct our experiments over two auto-regressive decoder LMs: The model of Baevski and Auli (2019) (dubbedWIKILM), a 16-layer LM trained on theWIKITEXT-103 12345678910111213141516 layer 0 20 40 60 80 100 % tokens WikiLM semanticsyntacticnamesN/A 123456789101112 layer 0 20 40 60 80 100 % tokens GPT2 semanticsyntacticN/A Figure 2: Portion of top-scoring tokens by value vectors in WIKILM and GPT2, that were associated with a semantic or syntactic concept, a name, or could not be matched to any concept (“N/A”). ConceptSub-update top-scoring tokens GPT2 v 3 1018 Measurement semantic kg, percent, spread, total, yards, pounds, hours v 8 1900 WH-relativizerssyntacticwhich, whose, Which, whom, where, who, wherein v 11 2601 Food and drinkssemanticdrinks, coffee, tea, soda, burgers, bar, sushi WIKILM v 1 1 Pronouns syntacticHer, She, Their, her, she, They, their, they, His v 6 3025 Adverbssyntacticlargely, rapidly, effectively, previously, normally v 13 3516 Groups of peoplesemanticpolicymakers, geneticists, ancestries, Ohioans Table 1: Example value vectors in GPT2 and WIKILM promoting human-interpretable concepts. corpus (Merity et al., 2017) with word-level to- kenization (|V|= 267,744), andGPT2(Rad- ford et al., 2019), a 12-layer LM trained onWEB- TEXT(Radford et al., 2019) with sub-word to- kenization (|V|= 50,257).GPT2uses the GeLU activation function (Hendrycks and Gim- pel, 2016), whileWIKILMuses ReLU, and in contrast toGPT2, WIKILMdoes not apply layer normalization after FFN updates.WIKILMde- finesd= 1024,d m = 4096andGPT2defines d= 768,d m = 3072, resulting in a total of65k and36kvalue vectors, respectively. For our experi- ments, we sample 10 random vectors per layer from each model, yielding a total of 160 and 120 vectors to analyze fromWIKILMandGPT2, respectively. 4.1 Projection of Sub-Updates is Meaningful Real vs. Random Sub-Updates.We validate our approach by comparing concepts in top-tokens of value vectors and 10 random vectors from a nor- mal distribution with the empirical mean and stan- dard deviation of the real vectors. We observe that a substantially higher portion of top-tokens were associated to a concept in value vectors compared to the random ones (Tab. 2):55.1%vs.22.7%in WIKILM, and37%vs.16%inGPT2. Also, in both models, the average number of concepts per vector was>1in the value vectors compared to ∼0.5in the random ones. Notably, no seman- tic nor syntactic concepts were identified inWIK- ILM’s random vectors, and inGPT2, only4%of the tokens were marked as semantic concepts in the random vectors versus24.9%in the value vectors. Updates vs. Sub-Updates.We justify the FFN output decomposition by analyzing concepts in the top-tokens of 10 random FFN outputs per layer (Tab. 2). InWIKILM(GPT2),39.4%(46%) of the tokens were associated with concepts, but for 19.7%(34.2%) the concept was“stopwords/punc- tuation”. Also, we observe very few concepts (<4%) in the last two layers ofWIKILM. We account this to extreme sub-updates that dominate the layer’s output (§5.2). Excluding these concepts results in a considerably lower token coverage in projections of updates compared to those of sub- updates:19.7%vs.55.1%inWIKILM, and11.8% vs.36.7%in GPT2. Overall, this shows that projecting sub-updates to the vocabulary provides a meaningful interface to the information they encode. Moreover, de- composing the FFN outputs is necessary for fine- grained interpretation of sub-updates. GPT2WIKILM FFN sub-updates36.7%55.1% +stopwords concepts37%55.1% Random sub-updates16%22.7% FFN updates11.8%19.7% +stopwords concepts46%39.4% Table 2: Portion of top-scoring tokens associated with a concept, for FFN updates and sub-updates in WIK- ILM and GPT2, and for random vectors. For FFN updates/sub-updates, we show results with and without counting concepts marked as stopwords. 4.2 Sub-Update Projections are Interpretable Fig. 2 shows a breakdown of the annotations across layers, forWIKILMandGPT2. In both models and across all layers, a substantial portion (40%- 70% inWIKILMand 20%-65% inGPT2) of the top-tokens were associated with well-defined con- cepts, most of which were classified as“semantic”. Also, we observe that the top-tokens of a single value vector were associated with1.5(WIKILM) and1.1(GPT2) concepts on average, showing that sub-updates across all layers encode a small-set of well-defined concepts. Examples are in Tab. 1. These findings expand on previous results by Geva et al. (2021), who observed that value vec- torsin the upper layersrepresent next-token distri- butions that follow specific patterns. Our results, which hold acrossall the layers, suggest that these vectors represent general concepts rather than pri- oritizing specific tokens. Underestimation of Concept Frequency. In practice, we find that this task is hard for humans, 4 as it requires reasoning over a set of tokens without any context, while tokens often correspond to un- common words, homonyms, or sub-words. More- over, some patterns necessitate world knowledge (e.g.“villages in Europe near rivers”) or linguistic background (e.g. negative polarity items). This of- ten leads to undetectable patterns, suggesting that the overall results are an underestimation of the true concept frequency. Providing additional context and token-related information are possible future directions for improving the annotation protocol. Implication for Controlled Generation.If sub- updates indeed encode concepts, then we can not only interpret their contribution to the prediction, but alsointervenein this process, by increasing the 4 A sub-update annotation took8.5minutes on average. p ` :cow, cat,dog, goat, horse, bear ̃p ` :dog, cat, goat, horse, cow, bear Saturation:dogis promoted from rank 3 inp ` to rank 1 in ̃p ` , to be the top-candidate until the last layer. p ` :cow, cat, dog, goat, horse, bear ̃p ` :dog, cat, goat, horse,cow, bear Elimination:cowis eliminated from rank 1 inp ` to 5 in ̃p ` . Table 3: Example saturation and elimination events, af- ter a FFN update (reference tokens are in bold text). weights of value vectors that promote tendencies of our choice. We demonstrate this in §6.1. 5 FFN Updates Promote Tokens in the Output Distribution We showed that sub-updates often encode inter- pretable concepts (§4), but how do these concepts construct the output distribution? In this section, we show that sub-updates systematically configure the prediction via promotion of candidate tokens. 5.1 Promoted Versus Eliminated Candidates Every sub-updatem ` i v ` i either increases, decreases, or does not change the probability of a tokenw, according to the scoree w ·m ` i v ` i (§3). This sug- gests three mechanisms by which tokens are pushed to the top of the output distribution –promotion, where sub-updates increase the probability of fa- vorable tokens,elimination, where sub-updates de- crease candidate probabilities, or amixtureof both. To test what mechanism holds in practice, we ana- lyze the scores sub-updates assign to top-candidate tokens by the representation. To simplify the anal- ysis, we focus on changes induced by the 10 most dominant sub-updates in each layer, that is, the 10 sub-updatesm ` i v ` i with the largest contribution to the representation, as measured by|m ` i |·||v ` i ||(see details in App. A.3). For the experiments, we use a random sam- ple of 2000 examples from the validation set of WIKITEXT-103, 5 which bothWIKILMandGPT2 did not observe during training. As the experiments do not involve human annotations, we use a larger GPT2model withL= 24,d= 1024,d m = 4096. We start by comparing the sub-updates’ scores to a reference token in two types of events: • Saturation(Tab. 3, up): The updatep ` → ̃p ` where the final token predicted by the model (i.e., w=argmax(y)) was promoted to be the top can- 5 Data is segmented into sentences (Geva et al., 2021). 123456789101112131415161718192021222324 layer 15 10 5 0 5 10 15 20 top-10 values' scores for the top candidate GPT2 mean score min score max score w.o. functional all 12345678910111213141516 layer 4 2 0 2 4 top-10 values' scores for the top candidate WikiLM mean score min score max score w.o. functional all Figure 3: Mean, maximum and minimum scores assigned by the 10 most dominant sub-updates in each layer to the top-candidate token, in GPT2 (left) and WIKILM (right). Solid (dashed) lines exclude (include) functional value vector groups. The y-axis in both plots is cut for readability, as the max. (min.) scores reach 100 (-6). Sub-updatesEventMax. Mean Min. WIKILM, dominant saturation1.2<0.01−0.8 elimination0.5−0.01−0.5 WIKILM, random saturation0.02<0.01−0.02 elimination0.02<0.01−0.02 GPT2, dominant saturation8.51.3−4.9 elimination4.00.1−3.6 GPT2, random saturation0.2 0.01−0.2 elimination0.1<0.01−0.1 Table 4: Maximum, mean, and minimum scores of ref- erence tokens in saturation and elimination events, by the 10 most dominant and 10 random sub-updates. didate until the last layer. We analyze saturation events induced by the FFN before the last layer, covering 1184 and 1579 events inWIKILMand GPT2, respectively. •Elimination(Tab. 3, bottom): The updatep ` → ̃p ` with the largest increase in the top candidate’s rank, i.e. where the top candidate was dropped behind other candidates to have a rank>1. Over- all, our analysis covers 1909 (WIKILM) and 1996 (GPT2) elimination events. We compute the mean, maximum, and minimum scores of the reference token by the 10 most domi- nant sub-updates in each event, and average over all the events. As a baseline, we compute the scores by 10 random sub-updates from the same layer. Tab. 4 shows the results. In both models, to- kens promoted to the top of the distribution receive higher maximum scores than tokens eliminated from the top position (1.2→0.5inWIKILMand 8.5→4.0inGPT2), indicating they are pushed strongly by a few dominant sub-updates. Moreover, tokens eliminated from the top of the distribution re- ceive near-zero mean scores, by both dominant and random sub-updates, suggesting they are not being eliminated directly. In contrast to promoted tokens, where the maximum scores are substantially higher than the minimal scores (1.2vs.−0.8inWIKILM and8.5vs.−4.9inGPT2), for eliminated tokens, the scores are similar in their magnitude (±0.5in WIKILMand4.0vs.−3.6inGPT2). Last, scores by random sub-updates are dramatically lower in magnitude, showing that our choice of sub-updates is meaningful and that higher coefficients translate to greater influence on the output distribution. This suggests that FFN updates work in a pro- motion mechanism, where top-candidate tokens are those being pushed by dominant sub-updates. 5.2 Sub-Updates Across Layers To analyze the FFN operation in different layers, we break down the top-candidate scores per layer. Formally, letw ` =argmax(p ` )be the top candi- date at layer`(before the FFN update) for a given input, we extract the scorese w ` ·m ` i v ` i by the 10 most dominant sub-updates and compute the mean, minimum and maximum scores over that set. Fig. 3 shows that, in both models, until the last few layers (23-24 inGPT2and 14-16 inWIKILM), maximum and minimum scores are distributed around non-negative mean scores, with prominent peaks in maximum scores (layers 3-5 inGPT2and layers 4-11 inWIKILM). This suggests that the to- ken promotion mechanism generally holds across layers. However, scores diverge in the last layers of both models, with strong negative minimum scores, indicating that the probability of the top-candidate is pushed down by dominant sub-updates. We next show that these large deviations in positive and neg- ative scores (Fig. 3, dashed lines) result from the operation of small sets of functional value vectors. Extreme Sub-Updates.To analyze the extreme FFN updates, we first cluster the value vectors to discover high-level trends. We use agglomerative clustering (Müllner, 2011) to learn 10k clusters for each model, based on the cosine distance ma- trixD, whereD (` 1 ,i 1 ),(` 2 ,i 2 ) = 1−cos(v ` 1 i 1 ,v ` 2 i 2 ) , ∀i 1 ,i 2 ∈ 1,·,d m ,∀` 1 ,` 2 ∈ 1,·,L. 6 Then, we search for clusters that are frequently active in extreme updates, by (a) extracting sub- updates where the scores for the top-candidate pass a certain threshold (±10forGPT2and±5for WIKILM), and (b) counting the appearances of each cluster in the layer sub-updates. In both models, a small set of homogeneous clus- ters account for the extreme sub-updates shown in Fig. 3, which can be divided into two main groups of value vectors: Vectors in the upper layers that promotegenerally unlikelytokens (e.g. rare to- kens), and vectors that are spread over all the lay- ers and promotecommontokens (e.g. stopwords). These clusters, which cover only a small fraction of the value vectors (1.7% inGPT2and 1.1% in WIKILM), are mostly active for examples where the input sequence has≤3tokens or when the target token can be easily inferred from the context (e.g. end-of-sentence period), suggesting that these value vectors might configure “easy” model predic- tions. More interestingly, the value vectors that pro- mote unlikely tokens can be viewed as“saturation vectors”, which propagate the distribution without changing the top tokens. Indeed, these vectors are in the last layers, where often the model already stores its final prediction (Geva et al., 2021). 6 Applications We leverage our findings for controlled text genera- tion (§6.1) and computation efficiency (§6.2). 6.1 Zero-Shot Toxic Language Suppression LMs are known to generate toxic, harmful language that damages their usefulness (Bender et al., 2021; McGuffie and Newhouse, 2020; Wallace et al., 2019). We utilize our findings to create a simple, intuitive method for toxic language suppression. Method. If LMs indeed operate in a promotion mechanism, we reason that we can decrease toxic- ity by “turning on” non-toxic sub-updates. We find value vectors that promote safe, harmless concepts by extracting the top-tokens in the projections of all 6 We experimented withk= 3e 2 ,1e 3 ,3e 3 ,1e 4 ,3e 4 , and choosek= 1e 4 based on manual inspection. the value vectors and either (a) manually searching for vectors that express a coherent set of positive words (e.g.“safe”and“thank”), or (b) grading the tokens with the Perspective API and selecting non-toxic value vectors (see details in App. A.4). We turn on these value vectors by setting their co- efficients to 3, a relatively high value according to Fig. 3. We compare our method with two baselines: 1.Self-Debiasing (SD) (Schick et al., 2021): SD generates a list of undesired words for a given prompt by appending aself-debiasing input, which encourages toxic completions, and cal- culating which tokens are promoted compared to the original prompt. These undesired words’ probability are then decreased according to a decay constantλ, which we set to 50 (default). 2.WORDFILTER: We preventGPT2from gen- erating words from a list of banned words by setting any logits that would result in a banned word completion to−∞(Gehman et al., 2020). Evaluation.We evaluate our method on the chal- lenging subset ofREALTOXICPROMPTS(Gehman et al., 2020), a collection of 1,225 prompts that tend to yield extremely toxic completions in LMs, using the Perspective API, which grades text according to six toxicity attributes. A score of>0.5indicates a toxic text w.r.t to the attribute. Additionally, we compute perplexity to account for changes in LM performance. We useGPT2and, following Schick et al. (2021), generate continuations of 20 tokens. Results.Finding the non-toxic sub-updates man- ually was intuitive and efficient (taking<5min- utes). Tab. 5 shows that activation of only 10 value vectors (0.01%) substantially decreases toxi- city (↓47%), outperforming both SD (↓37%) and WORDFILTER(↓20%). Moreover, inducing sub- updates that promote “safety” related concepts is more effective than promoting generally non-toxic sub-updates. However, our method resulted in a perplexity increase greater than this induced by SD, though the increase was still relatively small. 6.2 Self-Supervised Early Exit Prediction The recent success of transformer-based LMs in NLP tasks has resulted in major production cost increases (Schwartz et al., 2020a), and thus has spurred interest in early-exit methods that reduce the incurred costs (Xu et al., 2021). Such methods often use small neural models to determine when to stop the execution process (Schwartz et al., 2020b; ModelToxicitySevereSexuallyThreatProfanityIdentityPPL toxicityexplicitattack GPT258.5%49.2%34.1%16.4%52.5%16.8%21.7 ↑10 Manual Pick↓47%30.8%↓50%24.8%↓40%20.4%↓63%6.0%↓47%27.9%↓48%8.8%25.3 ↑10 API Graded ↓10% 52.7% ↓11% 44% ↓3% 33.2% ↓19% 13.3% ↓9% 47.6% ↓9% 15.3%23.8 SD↓37%37.2%↓46%26.4%↓36%21.7%↓52%7.8%↓39%32%↓50%8.4%23.9 WORDFILTER↓20%46.9%↓34%32.4%↓36%21.9%↓<1%16.3%↓38%32.3%↓13%14.7%- Table 5: Evaluation results on the challenging subset of REALTOXICPROMPTS, showing the percentage of toxic completions for 6 toxicity attributes, as well as language model perplexity (“PPL”). Elbayad et al., 2020; Hou et al., 2020; Xin et al., 2020, 2021; Li et al., 2021; Schuster et al., 2021). In this section, we test our hypothesis that domi- nant FFN sub-updates can signal asaturation event (§5.2), to create a simple and effective early exiting method that does not involve any external model training. For the experiments, we useWIKILM, where saturation events occur across all layers (statistics forWIKILMandGPT2are in App. A.5). Method.We devise a simple prediction rule based on a nearest-neighbours approach, using 10k validation examples fromWIKITEXT-103. First, for every example, we map the top-10 dominant sub-updates at each layer to their corresponding clusters. Then, for every layer`, we split all the sets of clusters at that layer into two sets,T ` and N ` , based on whether saturation occurred or not (e.g.,T 5 stores all the sets that were active in a sat- uration event at layer 5). Given the top-10 clusters of an unseen example at some layer`, we consider a higher overlap withT ` than withN ` ′ ,∀` ′ > ` as a signal for early exit. Thus, during inference, we propagate the input example through the layers, and compute at each layer`the intersection size be- tween its top-10 active clusters and each ofT ` and N ` ′ ,∀` ′ > ` . If the average and maximal intersec- tion withT ` exceeds those withN ` ′ ,∀` ′ > `, we halt the computation and declare early exiting. 7 Baselines.We train layer-wise binary classifiers over the representation and FFN updatesx ` ,o ` , and ̃x ` , using logistic regression. As in our method, the labels are determined according to saturation events in the training data (see App. A.5). During inference, we execute the computation through the layers, and halt according to the layer classifier. 7 This is a simplification. We splitN ` by saturation layers and require a bigger intersection withT ` at all the layers. MethodAccuracy Saved Layers Binary classifiers usingx ` 94.4±6.418.8%3.0±0.4 Binary classifiers usingo ` 92.9±5.419.4%3.1±0.3 Binary classifiers using ̃x ` 94.4±6.418.8%3.0±0.4 Sub-updates rule94.1±1.420.0%3.2±0.3 Table 6: Early exit evaluation results on WIKILM. Evaluation.Each method is evaluated byaccu- racy, i.e., the portion of examples for which exiting at the predicted layer yields the final model pre- diction, and bycomputation efficiency, measured by the amount of saved layers for examples with correct prediction. We run each method with five random seeds and report the average scores. Results.Tab. 6 shows that our method obtains a high accuracy of 94.1%, while saving 20% of computation on average without changing the pre- diction. Moreover, just by observing the dominant FFN sub-updates, it performs on-par with the pre- diction rules relying on the representation and FFN output vectors. This demonstrates the utility of sub-updates for predicting saturation events, and further supports our hypothesis that FFN updates play a functional role in the prediction (§5.2). 7 Related Work The lack of interpretability of modern LMs has led to a wide interest in understanding their predic- tion construction process. Previous works mostly focused on analyzing the evolution of hidden rep- resentations across layers (Voita et al., 2019), and probing the model with target tasks (Yang et al., 2020; Clark et al., 2019; Tenney et al., 2019; Saphra and Lopez, 2019). In contrast, our approach aims to interpret the model parameters and their utilization in the prediction process. More recently, a surge of works have investi- gated the knowledge captured by the FFN layers (Da et al., 2021; Jiang et al., 2020; Dai et al., 2022; Yao et al., 2022; Meng et al., 2022; Wallat et al., 2020). These works show that the FFN layers store various types of knowledge, which can be located in specific neurons and edited. Unlike these works, we focus on the FFN outputs and their contribution in the prediction construction process. Last, our interpretation of FFN outputs as up- dates to the output distribution relates to recent works that interpreted groups of LM parameters in the discrete vocabulary space (Geva et al., 2021; Khashabi et al., 2022), or viewed the representation as an information stream (Elhage et al., 2021). 8 Conclusions Understanding the inner workings of transformers is valuable for explainability to end-users, for de- bugging predictions, for eliminating undesirable behavior, and for understanding the strengths and limitations of NLP models. The FFN is an under- studied core component of transformer-based LMs, which we focus on in this work. We study the FFN output as a linear combina- tion of parameter vectors, termed values, and the mechanism by which these vectors update the token representations. We show that value vectors often encode human-interpretable concepts and that these concepts are promoted in the output distribution. Our analysis of transformer-based LMs provides a more detailed understanding of their internal pre- diction process, and suggests new research direc- tions for interpretability, control, and efficiency, at the level of individual vectors. Limitations Our study focused on the operation of FFN lay- ers in building model predictions. Future work should further analyze the interplay between these layers and other components in the network, such as attention-heads. In our analysis, we decomposed the computation of FFN layers into smaller units, corresponding to single value vectors. However, it is possible that value vectors are compositional in the sense that combinations of them may produce new meanings. Still, we argue that analyzing individual value vec- tors is an important first step, since (a) the space of possible combinations is exponential, and (b) our analysis suggests that aggregation of value vectors is less interpretable than individual value vectors (§4.1). Thus, this approach opens new directions for interpreting the contribution of FFN layers to the prediction process in transformer LMs. In addition, we chose to examine the broad fam- ily of decoder-based, auto-regressive LMs, which have been shown to be extremely effective for many NLP tasks, including few- and zero-shot tasks (Wang et al., 2022). While these models share the same building blocks of all transformer-based LMs, it will be valuable to ensure that our findings still hold for other models, such as encoder-only LMs (e.g. RoBERTa (Liu et al., 2019)) and mod- els trained with different objective functions (e.g. masked language modeling (Devlin et al., 2019)). Finally, our annotation effort was made for the evaluation of our hypothesis that sub-updates en- code human-interpretable concepts. Scaling our annotation protocol would enable a more refined map of the concepts, knowledge and structure cap- tured by LMs. Furthermore, since our concept interpretation approach relies on manual inspection of sets of tokens, its success might depend on the model’s tokenization method. In this work, we an- alyzed models with two different commonly-used tokenizers, and future research could verify our method over other types of tokenizations as well. Ethics Statement Our work in understanding the role that single- values play in the inference that transformer-based LMs perform potentially improves their trans- parency, while also providing useful control appli- cations that save energy (early-exit prediction) and increase model harmlessness (toxic language sup- pression). It should be made clear that our method for toxic language suppression only reduces the probability of toxic language generation and does not eliminate it. As such, this method (as well as our early-exit method) should not be used in the real world without further work and caution. More broadly, our work suggests a general ap- proach for modifying LM predictions in particular directions, by changing the weights of FFN sub- updates. While this is useful for mitigating biases, it also has the potential for abuse. It should be made clear that, as in the toxic language suppres- sion application, our approach does not modify the information encoded in LMs, but only changes the intensity in which this information is exposed in the model’s predictions. Moreover, our work primar- ily proposes an interpretation for FFN sub-updates, which also could be used to identify abusive inter- ventions. Regardless, we stress that LMs should not be integrated into critical systems without cau- tion and monitoring. Acknowledgements We thank Shauli Ravfogel, Tal Schuster, and Jonathan Berant for helpful feedback and construc- tive suggestions. This project has received funding from the Computer Science Scholarship granted by the Séphora Berrebi Foundation, the PBC fellow- ship for outstanding PhD candidates in Data Sci- ence, and the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme, grant agreement No. 802774 (iEXTRACT). References Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- ton. 2016.Layer normalization.arXiv preprint arXiv:1607.06450. Alexei Baevski and Michael Auli. 2019. Adaptive in- put representations for neural language modeling. In International Conference on Learning Representa- tions (ICLR). Emily M Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big?InProceedings of the ACM Confer- ence on Fairness, Accountability, and Transparency (FAccT). Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS). Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? an analysis of BERT’s attention. InPro- ceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, Florence, Italy. Association for Computational Linguistics. Jeff Da, Ronan Le Bras, Ximing Lu, Yejin Choi, and Antoine Bosselut. 2021. Analyzing commonsense emergence in few-shot knowledge models. In3rd Conference on Automated Knowledge Base Con- struction. Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. Knowledge neurons in pretrained transformers. InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 8493–8502, Dublin, Ireland. Association for Com- putational Linguistics. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- standing. InNorth American Association for Com- putational Linguistics (NAACL), pages 4171–4186, Minneapolis, Minnesota. Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. 2020. Depth-adaptive transformer. InInter- national Conference on Learning Representations (ICLR). Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Ka- plan, Sam McCandlish, and Chris Olah. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread. Https://transformer- circuits.pub/2021/framework/index.html. Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxi- cityPrompts: Evaluating neural toxic degeneration in language models. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online. Association for Computational Linguistics. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recog- nition. InProceedings of the conference on com- puter vision and pattern recognition (CVPR). Dan Hendrycks and Kevin Gimpel. 2016.Gaus- sian error linear units (gelus).arXiv preprint arXiv:1606.08415. Arthur E Hoerl and Robert W Kennard. 1970. Ridge re- gression: Biased estimation for nonorthogonal prob- lems.Technometrics, 12(1):55–67. Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. 2020. Dynabert: Dynamic bert with adaptive width and depth.Advances in Neural Information Processing Systems (NeurIPS). Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020. How can we know what language models know?Transactions of the Association for Computational Linguistics, 8:423–438. Daniel Khashabi, Xinxi Lyu, Sewon Min, Lianhui Qin, Kyle Richardson, Sean Welleck, Hannaneh Ha- jishirzi, Tushar Khot, Ashish Sabharwal, Sameer Singh, and Yejin Choi. 2022.Prompt wayward- ness: The curious case of discretized interpretation of continuous prompts. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3631–3643, Seattle, United States. Association for Computational Lin- guistics. Lei Li, Yankai Lin, Deli Chen, Shuhuai Ren, Peng Li, Jie Zhou, and Xu Sun. 2021. CascadeBERT: Ac- celerating inference of pre-trained language models via calibrated complete models cascade. InFind- ings of the Association for Computational Linguis- tics: EMNLP 2021, pages 475–486, Punta Cana, Do- minican Republic. Association for Computational Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized bert pretraining ap- proach.arXiv preprint arXiv:1907.11692. Kris McGuffie and Alex Newhouse. 2020. The radical- ization risks of gpt-3 and advanced neural language models.arXiv preprint arXiv:2009.06807. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual knowl- edge in gpt.arXiv preprint arXiv:2202.05262. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer sentinel mixture mod- els.International Conference on Learning Repre- sentations (ICLR). Daniel Müllner. 2011.Modern hierarchical, ag- glomerative clustering algorithms.arXiv preprint arXiv:1109.2378. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Lan- guage models are unsupervised multitask learners. OpenAI blog, 1(8):9. Naomi Saphra and Adam Lopez. 2019. Understand- ing learning dynamics of language models with SVCCA. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3257–3267, Minneapolis, Minnesota. Associ- ation for Computational Linguistics. Timo Schick, Sahana Udupa, and Hinrich Schütze. 2021. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in NLP.Transac- tions of the Association for Computational Linguis- tics, 9:1408–1424. Tal Schuster, Adam Fisch, Tommi Jaakkola, and Regina Barzilay. 2021. Consistent accelerated in- ference via confident adaptive transformers. InPro- ceedings of the 2021 Conference on Empirical Meth- ods in Natural Language Processing, pages 4962– 4979, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. 2020a. Green AI.Communications of the ACM, 63(12):54–63. RoySchwartz,GabrielStanovsky,Swabha Swayamdipta, Jesse Dodge, and Noah A. Smith. 2020b. The right tool for the job: Matching model and instance complexities. InProceedings of the 58th Annual Meeting of the Association for Com- putational Linguistics, pages 6640–6651, Online. Association for Computational Linguistics. Noam M. Shazeer. 2020. Glu variants improve trans- former.ArXiv, abs/2002.05202. S. Sukhbaatar, J. Weston, and R. Fergus. 2015. End- to-end memory networks. InAdvances in Neural Information Processing Systems (NIPS). Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Armand Joulin. 2019. Augmenting self-attention with persistent memory. arXiv preprint arXiv:1907.01470. Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 4593– 4601, Florence, Italy. Association for Computational Linguistics. Robert Tibshirani. 1996. Regression shrinkage and se- lection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InAdvances in Neural Information Pro- cessing Systems (NIPS), pages 5998–6008. Elena Voita, Rico Sennrich, and Ivan Titov. 2019. The bottom-up evolution of representations in the trans- former: A study with machine translation and lan- guage modeling objectives. InProceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4396–4406, Hong Kong, China. Association for Computational Linguistics. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversarial trig- gers for attacking and analyzing NLP. InProceed- ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter- national Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP), pages 2153–2162, Hong Kong, China. Association for Computational Lin- guistics. Jonas Wallat, Jaspreet Singh, and Avishek Anand. 2020. BERTnesia: Investigating the capture and for- getting of knowledge in BERT. InProceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 174– 183, Online. Association for Computational Linguis- tics. Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, and Colin Raffel. 2022. What lan- guage model architecture and pretraining objective works best for zero-shot generalization?InPro- ceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 22964–22984. PMLR. Ji Xin, Rodrigo Nogueira, Yaoliang Yu, and Jimmy Lin. 2020. Early exiting BERT for efficient document ranking. InProceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 83–88, Online. Association for Computational Linguistics. Ji Xin, Raphael Tang, Yaoliang Yu, and Jimmy Lin. 2021. BERxiT: Early exiting for BERT with better fine-tuning and extension to regression. InProceed- ings of the 16th Conference of the European Chap- ter of the Association for Computational Linguistics: Main Volume, pages 91–104, Online. Association for Computational Linguistics. Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, and Lei Li. 2021. A survey on green deep learning. arXiv preprint arXiv:2111.05193. Yilin Yang, Longyue Wang, Shuming Shi, Prasad Tade- palli, Stefan Lee, and Zhaopeng Tu. 2020. On the sub-layer functionalities of transformer decoder. In Findings of the Association for Computational Lin- guistics: EMNLP 2020, pages 4799–4811, Online. Association for Computational Linguistics. Yunzhi Yao, Shaohan Huang, Ningyu Zhang, Li Dong, Furu Wei, and Huajun Chen. 2022.Kformer: Knowledge injection in transformer feed-forward layers.arXiv preprint arXiv:2201.05742. 123456789101112131415161718192021222324 layer 0 20 40 60 80 100 IoU of value projections GPT2 normalizedrandom Figure 4: Similarity of projections toE, of GPT2 value vectors with and without layer normalization, and of value vectors and randomly-initialized vectors. We measure similarity of the top-30 tokens in each projec- tion with IoU. A Appendix A.1 Value Vectors Projection Method Our interpretation method of sub-updates is based on directly projecting value vectors to the embed- ding matrix, i.e. for a valuevand embedding matrixE, we calculateEv(§4). However, in some LMs likeGPT2, value vectors in each layer are added to the token representation followed by a layer normalization (LN) (Ba et al., 2016). This raises the question whether “reading” vectors that are normalized in the same manner as the represen- tation would yield different concepts. To test that, we compare the top-30 scoring tokens byEv ` i and byE·LayerNorm(v ` i ), for i= 1,...,d m and`= 1,...,L, using Intersection over Union (IoU). As a baseline, we also compare Ev ` i with random vectors, initialized from a normal distribution with the empirical mean and standard deviation of the value vectors. Fig. 4 shows that LN does not change the projection substantially, with an overlap of64.5%of the top-30 tokens on average, suggesting that the same concepts are pro- moted in both cases. This is in contrast to random values, which produce a∼0%overlap on average. A.2 Concepts Annotation We analyze the concepts encoded in sub-updates, by projecting their corresponding value vectors to the embedding matrix and identifying repeating patterns in the top-scoring 30 tokens (§3). Pattern identification was performed by experts (NLP grad- uate students), following the instructions presented in Fig. 5. Please note these are the instructions provided for annotations ofWIKILM, which uses word-level tokenization. Thus, the terms “words” and “tokens” are equivalent in this case. For value vectors inWIKILM, which uses a word-level vocabulary with many uncommon words, we additionally attached a short description field for each token that provides context about the meaning of the word. For the description of a token w, we first try to extract the definition ofwfrom Wordnet. 8 Ifwdoes not exist in Wordnet, as often happens for names of people and places, we then search forwin Wikipedia 9 and extract a short (pos- sibly noisy) description if the query was successful. A complete annotation example Tab. 7. A.3 Sub-Update Contribution in FFN Outputs In this section, we justify our choice along the pa- per of looking at the top-10 dominant sub-updates. The contribution of a sub-updatem ` i v ` i to the FFN output is: contrib(m ` i v ` i ) := |m ` i |||v ` i || ∑ d m j=1 |m ` j |||v ` j || , namely, its relative weight compared to the over- all sum of weights of all sub-updates. The overall contribution of the top-10 dominant sub-updates is computed by summing their contributions. Note that we take the absolute value of the coeffi- cients|m ` i |, since some activation functions (e.g. GeLU (Hendrycks and Gimpel, 2016) inGPT2), can result in negative values ofm ` i . Empirically, we observe that in some cases sub- updates with negative coefficients do appear as part of the 10 most dominant sub-updates inGPT2. We further attribute this to the success of GeLU in transformer models (Shazeer, 2020), as it increases the expressiveness of the model by allowing re- versing the scores value vectors induce over the vocabulary. Fig. 6 depicts the contribution of the top-10 dominant sub-updates per layer forWIKILMand GPT2, using 2000 random examples from the WIKITEXT-103validation set. Clearly, for all the layers, the contribution of the dominant sub- updates exceeds the contribution of random sub- updates. Observe that, even though they cover only 0.24% of the value vectors, the contribution of dominant sub-updates is typically around 5%, and in some layers (e.g. layers 8-16 inWIKILMand layer 1 inGPT2) it reaches over 10% of the total 8 We use the NLTK python package. 9 Using the wptools packagehttps://pypi.org/ project/wptools/. In this task, you are given a list of 30 words in English, and the goal is to identify repetitive patterns occurring in the words. Patterns can besemantic(e.g. animals, 3-digit numbers,names of Indian actors, and time-related words) orsyntactic(e.g. connectives,plurals, words starting with “dis-”, and verbs in present progressive tense). You should only count patterns that occur in at least 4 words (i.e. if you notice a pattern that occurs only in 3 words, then please ignore it). To complete the task, please do the following: 1.Give an ID to every identified pattern (1,2,...) 2.Assign a pattern ID to every word in the list, or -1/leave empty if no pattern applies to the word. 3.For every identified pattern specify whether the pattern is semantic or syntactic and (optional) write a short description of the pattern. Please note that some of the words might be uncommon words that you are not familiar with. In such cases, you will need to do a quick search over the Web to understand the meaning of words. Figure 5: Annotation instructions for the concepts identification task. contribution. This demonstrates that analyzing the top-10 dominant sub-updates can shed light on the way predictions are built through the layers. A.4 Toxic Language Suppression Details The 10 manually selected value vectors were found by searching for non-toxic words, such as“safe” and“peace”, among the top-30 tokens in the vec- tor projections to the vocabulary. We selected a small set of 10 value vectors whose top-scoring tokens were coherent and seemed to promote differ- ent kinds of non-toxic tokens. The list of manually picked vectors is provided in Tab. 8. Importantly, the search process of all vectors was a one-time effort that took<5minutes in total. We chose the value vectors in a greedy-manner, without addi- tional attempts to optimize our choice. To select 10 non-toxic value vectors based on an automatic toxicity metric, we used the Perspective API. Concretely, we concatenated the top-30 tokens by each value vector and graded the resulting text with the toxicity score produced by the API. Then, we sampled 10 random vectors with a toxicity score <0.1(a score of<0.5indicates a non-toxic text). A.5 Early Exit Details This section provides further details and analysis regarding our early exit method and the baselines we implemented. Method Implementation.We consider 90% of the 10k examples for constructingT ` andN ` , and the remaining 10% examples are considered as the testing set. We usedk= 2e 2 to cluster the top-10 dominant value vectors, but observed that otherk values yielded similar results. Baselines’ Implementation. We train each bi- nary classifier using 8k training examples, based on the standardized forms of each feature vec- tor. We considered a hyperparameter sweep, us- ing 8-fold cross-validation, withl2orl1regu- larization (lasso (Tibshirani, 1996) or ridge (Ho- erl and Kennard, 1970)), regularization coef- ficientsC∈1e −3 ,1e −2 ,1e −1 ,1,1e 1 ,1e 2 ,1e 3 , and took the best performing model for each layer. We also used a inversely proportional loss coeffi- cient according to the class frequencies. In order to achieve high accuracy, we further calibrate a threshold per classifier for reaching the maximal F 1 score for each layer. This calibration is done after training each classifier, over a set of 1000 validation examples. Frequency of Saturation Events.We investi- gate the potential of performing early exit forWIK- ILMandGPT2. Tab. 9 and 10 depict the frequency of saturation events per layer, considering 10k ex- amples from theWIKITEXT-103validation set, forWIKILMandGPT2, respectively. InGPT2, 34.15% of the examples require the full compu- tation using all the model layers, while forWIK- ILM, this holds for only 15.22% of the examples. Notably, early fixation events inGPT2are less common than inWIKILM, possibly due to the larger number of layers the prediction construction is spread over. Hence, we useWIKILMfor our experiments, as it has significantly higher compu- patternsworddescription 1frontthe side that is forward or prominent 1aheadhaving the leading position or higher score in a contest 1forwardthe person who plays the position of forward in certain games, such as basketball, soccer, or hockey 1precededbe earlier in time; go back further 1Beforeearlier in time; previously 1beforeearlier in time; previously 1rearthe back of a military formation or procession 1forefront part of a vessel or aircraft 2Namea language unit by which a person or thing is known 1Pastthe time that has elapsed 1priorthe head of a religious order; in an abbey the prior is next below the abbot 1anteriora tooth situated at the front of the mouth 1upperpartsstandard terms for unambiguous description of relative placement of body parts 1leadan advantage held by a competitor in a race 1backwardsat or to or toward the back or rear 1aft(nautical, aeronautical) situated at or toward the stern or tail 1precedingbe earlier in time; go back further 1upstreamin the direction against a stream’s current hindany of several mostly spotted fishes that resemble groupers 1posteriorthe fleshy part of the human body that you sit on Etymologya history of a word 1PreWikimedia disambiguation page chinthe protruding part of the lower jaw 1norththe region of the United States lying to the north of the Mason-Dixon line 1eastthe cardinal compass point that is at 90 degrees 2surnamethe name used to identify the members of a family (as distinguished from each member’s given name) 1Thenthat time; that moment 2namea language unit by which a person or thing is known 1northboundmoving toward the north 1leadingthin strip of metal used to separate lines of type in printing pattern iddescription (optional) semantic/syntactic 1positions/ directions semantic 2namingsemantic Table 7: An example annotation spreadsheet of the top-tokens by the value vectoru 6 1090 in WIKILM. 12345678910111213141516 layer 0 10 20 30 % contribution WikiLM top-10rand-10 123456789101112131415161718192021222324 layer 0 5 10 15 % contribution GPT2 top-10rand-10 Figure 6: Relative contribution to the FFN output of the 10 most dominant and 10 random sub-updates in each layer, of WIKILM (left) and GPT2 (right). tation saving potential, as well as more saturation events per layer. Value Top-10 Tokens v 14 1853 transparency, disclosure, clearer, parency, iquette, humility, modesty, disclosures, accountability, safer v 15 73 respectful, honorable, healthy, decent, fair, erning, neutral, peacefully, respected, reconc v 15 1395 safe, neither, safer, course, safety, safe, Safe, apologize, Compact, cart v 16 216 refere, Messages, promises, Relations, accept, acceptance, Accept, assertions, persistence, warn v 17 462 should, should, MUST, ought, wisely, Should, SHOULD, safely, shouldn, urgently v 17 3209 peaceful, stable, healthy, calm, trustworthy, impartial, stability, credibility, respected, peace v 17 4061 Proper, proper, moder, properly, wisely, decency, correct, corrected, restraint, professionalism v 18 2921 thank, THANK, thanks, thank, Thank, apologies, Thank, thanks, Thanks, apologise v 19 1891 thanks, thank, Thanks, thanks, THANK, Thanks, Thank, Thank, thank, congratulations v 23 3770 free, fit, legal, und, Free, leg, pless, sound, qualified, Free Table 8: The 10 manually picked value vectors used for toxic language suppression and the top-10 tokens in their projection to the vocabulary. Repetitions in the projections are a result of special characters not being shown. These vectors were found by manually searching for non-toxic words such as“safe”and“peace”in the projections to the vocabulary. Layer % ExamplesLayer % Examples 16.7092.96 25.25103.78 313.74114.74 43.13 127.45 51.021310.79 61.07149.88 71.86159.81 82.60 1615.22 Table 9: The percentage of saturation events per layer using WIKILM, for the WIKITEXT-103 validation set. Layer % ExamplesLayer % Examples 12.21131.24 20.77141.62 31.06 152.37 40.74162.72 50.85172.99 60.83 183.80 70.83194.15 80.72205.21 90.93215.67 100.99 229.31 111.162314.52 121.322434.15 Table 10: The percentage of saturation events per layer using GPT2, for the WIKITEXT-103 validation set.