Paper deep dive

In-Context Learning Creates Task Vectors

Roee Hendel, Mor Geva, Amir Globerson

Year: 2023Venue: EMNLP 2023 (Findings)Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 35

Models: GPT-J-6B, LLaMA-13B, LLaMA-30B, LLaMA-7B, Pythia-12B, Pythia-2.8B, Pythia-6.9B

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/12/2026, 8:08:59 PM

Summary

The paper proposes a mechanistic view of In-Context Learning (ICL) in Large Language Models, demonstrating that ICL functions by compressing demonstrations into a single 'task vector' that modulates the transformer's forward pass to perform specific tasks, effectively acting as a parameter in a hypothesis class.

Entities (5)

In-context Learning · paradigm · 100%Transformer · architecture · 100%Llama · model · 95%Task Vector · concept · 95%Hypothesis Class · theoretical-framework · 90%

Relation Signals (3)

In-context Learning → creates → Task Vector

confidence 95% · In-context learning (ICL) in Large Language Models (LLMs) has emerged as a powerful new learning paradigm... ICL creates task vectors.

Llama → implements → In-context Learning

confidence 95% · We use multiple open LLMs: LLaMA 7B, 13B, and 30B... to validate our view on a range of publicly available LLMs.

Task Vector → modulates → Transformer

confidence 90% · ICL can be seen as compressing S into a single task vector θ(S) and then using this task vector to modulate the transformer to produce the output.

Cypher Suggestions (2)

Identify the relationship between Task Vectors and Transformer architectures · confidence 95% · unvalidated

MATCH (t:Concept {name: 'Task Vector'})-[r:MODULATES]->(a:Architecture) RETURN r

Find all models that implement In-Context Learning · confidence 90% · unvalidated

MATCH (m:Model)-[:IMPLEMENTS]->(p:Paradigm {name: 'In-Context Learning'}) RETURN m.name

Abstract

Abstract:In-context learning (ICL) in Large Language Models (LLMs) has emerged as a powerful new learning paradigm. However, its underlying mechanism is still not well understood. In particular, it is challenging to map it to the "standard" machine learning framework, where one uses a training set $S$ to find a best-fitting function $f(x)$ in some hypothesis class. Here we make progress on this problem by showing that the functions learned by ICL often have a very simple structure: they correspond to the transformer LLM whose only inputs are the query $x$ and a single "task vector" calculated from the training set. Thus, ICL can be seen as compressing $S$ into a single task vector $\boldsymbol{\theta}(S)$ and then using this task vector to modulate the transformer to produce the output. We support the above claim via comprehensive experiments across a range of models and tasks.

PDF

Open source PDF →Open local PDF →

Full Text

34,725 characters extracted from source content.

Expand or collapse full text

In-Context Learning Creates Task Vectors Roee Hendel Tel Aviv University roee.hendel@mail.tau.ac.il Mor Geva Google DeepMind pipek@google.com Amir Globerson Tel Aviv University, Google gamir@tauex.tau.ac.il Abstract In-context learning (ICL) in Large Language Models (LLMs) has emerged as a powerful new learning paradigm. However, its under- lying mechanism is still not well understood. In particular, it is challenging to map it to the “standard” machine learning framework, where one uses a training setSto find a best-fitting functionf(x)in some hypothesis class. Here we make progress on this problem by showing that the functions learned by ICL often have a very simple structure: they correspond to the transformer LLM whose only inputs are the queryxand a single “task vector” calcu- lated from the training set. Thus, ICL can be seen as compressingSinto a single task vector θ(S)and then using this task vector to modu- late the transformer to produce the output. We support the above claim via comprehensive ex- periments across a range of models and tasks. 1 1 Introduction Large language models have improved dramatically over the last several years. One striking property of these models is that they can learn new rules from very few demonstrations. For instance, a model can be prompted with the input“Apple→Red, Lime→ Green, Corn→”and produce the output“Yellow”. The model has thus learned a mapping based on just two examples, which it can apply correctly to new examples. This capability, referred to as In- Context Learning (ICL), has been used extensively, yielding impressive empirical results (Brown et al., 2020; Liu et al., 2023; Dong et al., 2022). Given this success, it is natural to ask what is the underlying mechanism behind ICL. Namely, how does the model internally use the demonstrations Sand the queryxto produce the required output? Here we approach this question by utilizing the 1 We release our code athttps://github.com/ roeehendel/icl_task_vectors. Apple →Red Lime →Green Corn→ Demonstrations (푆) Large Language Model (푇) Query(푥) 풜 휃 푓 휃 Yellow 퐿 layers Figure 1:ICL as learning in a Hypothesis Class.In ICL, one provides an LLM with a prompt including demonstrationsSof some task, and a queryx. The model generates the output forx(here “Yellow”). We show that the underlying process can be broken down into two parts:A, a “learning algorithm” (marked in blue), computes a query-agnostic vectorθ(S), which we view as a parameter of a function in a hypothesis class. The second part, denoted byfand marked in yellow, is the application of the rule defined byθon the queryx, without direct dependence onS. concept of a hypothesis class from statistical learn- ing theory (Shalev-Shwartz and Ben-David, 2014). In the learning-theoretic formulation, one typically considers a hypothesis classH, where every ele- ment ofHis a functionh(x;θ), operating on the inputx, and specified by a parameter vectorθ. For example, ifx∈R d then the classHcould be the set of linear classifiers, defined by a coefficient vec- torθash(x;θ) =θ·x. Learning algorithms seek an elementh∈ Hthat fits the training set well. This is known as Empirical Risk Minimization. It is unclear whether ICL operates in such a way because the prediction is performed viaT([S, x]), whereTis typically an auto-regressive transformer arXiv:2310.15916v1 [cs.CL] 24 Oct 2023 and[S, x]is a concatenation of the tokens inSand x. Thus, in the general case, it can be an arbitrary function that operates onSandxto produce the output. This can include “non-parametric” methods such as nearest-neighbor. Recent work has begun to explore this question. For example, it was shown that when training a transformer from scratch to perform linear regression in context, the emerging learning algorithm is similar to Stochastic Gradient Descent (Akyürek et al., 2022; von Oswald et al., 2022). However, for LLMs performing more com- plex natural language tasks, it is not at all clear what the hypothesis space may be. In this work, we show that on a wide range of tasks, ICL in LLMs can be viewed as working on a very natural hypothesis space. We argue that, given a training setS, the transformer maps it into a “task vector”θ(S)that essentially represents the mapping/rule described inS. 2 Namely, given the transformerTand a vectorθ, we can construct a new functionf(x;θ)that implements the task. The functionfis very similar to the original trans- former applied toxwithoutdemonstrations but instead modulated byθ(see Fig. 2). Our view is also related to soft prompts (Lester et al., 2021), since both approaches modulate the function of the transformer towards a particular task. However, in ICL, task vectors are calculated in the forward pass rather than being fine-tuned. Ourcontributionsincludeproposinga hypothesis-class based mechanistic view of ICL, and conducting experiments to validate our view on a range of publicly available LLMs and a diverse set of tasks.Our results further the understanding of ICL and may have practical implications for the efficient adaptation of LLMs to perform specific tasks. 2 A Hypothesis Class View of ICL Motivated by the hypothesis class view of learning theory, our goal is to understand if ICL maps the set of demonstrationsSto a function on the queryx and how this mapping occurs. Specifically, we seek to see if ICL convertsSintoθ- the “parameters” of a function within a certain hypothesis space. Our empirical findings suggest this view is applicable, shedding light on the structure of the hypothesis space on which ICL can be viewed to operate. 2 The term “task vector” was coined by Ilharco et al. (2023) for directions in weight space that correspond to a particular task. Although our vectors are in “activations space” they share a similar motivation and thus we overload the term. 2.1 Theoretical Framework We useTto denote a decoder-only transformer LLM,Sto denote the set of demonstrations (i.e. training examples) used as input to ICL, andxto denote the query that ICL is asked to provide an output for. We useT([S, x])to denote the output of ICL on the concatenation ofSandx. To demonstrate that ICL operates within a hy- pothesis space, we aim to show that its underlying mechanism can be broken down into two parts: •A“Learning Algorithm”(denoted byA) that mapsSinto a“task vector”θ, independent of the queryx. Given that attention layers can access bothSandx, this independence is not trivial. •A“Rule Application”(denoted byf) which maps the queryxto the output, based onθ≡ A(S) , without direct dependence onS. Again, this independence is not trivial. Thus, we consider the following mapping from a set of demonstrations and a query to the predicted output:T([S, x]) =f(x;A(S)). If we can break down the forward pass of the LLM into the above two components, we can view ICL as operating on the following hypothesis class: H=f(·;θ)|θ. In the next section we propose an implementation of such a class. 2.2 A Proposed Hypothesis Class There are many possible realizations of the above framework, that correspond to different choices ofAandf. We next describe the realization we focus on, which naturally follows from the trans- former architecture. We consider an ICL setting as in Fig. 1, where the input ends with a queryx(i.e., Corn) followed by an “→” symbol. As mentioned above, we view learning as composed of two steps: calculating a parameter vectorθbased on the train- ing sampleS, and applying the rule defined by this parameter vector to the queryx. A presumably simple way for a transformer to do this is for the firstLlayers of the→representations to calculate θand then for the remaining layers to takeθandx as input and produce an output. See Fig. 1. Recall thatSandxare accessible to the transformer at any layer, presenting a challenge with our view. In the following sections, we address this chal- lenge and present experiments validating our view. Namely, we show that we can isolate our proposed Aandfin the forward pass of LLMs performing ICL. We also show that theθvectors are inter- pretable and correspond to learned tasks. (a) (b) Apple →Red Lime →Green Plum→ Corn → Demonstrations (푆) Large Language Model (푇) Yellow Query (푥) 푥′ 풜 푓 “Color” 휃 (푇) Figure 2:SeparatingAandf.To makeθindependent of the queryx, we use a dummy query (x ′ =Plum) and use the representation of→at theL th layer asθ. The vectorθis then patched at the same layer during a forward pass of a transformer that only takesxand→ as input, to prevent the direct dependence offonS. 3 Validity of the Hypothesis Class View We first show that separating the forward pass into the two distinct componentsAandf, defined in §2.2, maintains the high accuracy of ICL. 3.1 SeparatingAandf We face some challenges in a regular forward pass: first, the initialLlayers that correspond toA, up- dating the representations of→to createθ, can attend to the queryx. Thus, they may depend onx, creating an unwanted dependence ofθonx. Sec- ond, the remaining layers that correspond tof, may directly accessS, instead of using onlyxandθ. We propose the following procedure to tackle these challenges: to solve the first problem, we introduce a “dummy query”x ′ and calculate the representations of→using that query. We use the representation of→after the firstLlayers, calcu- lated usingx ′ , as the vectorθ(as demonstrated on the left side of Fig. 2). An alternative was to block attention tox, but it led to poor performance. To solve the second problem of calculatingf(x,θ) without allowing direct dependence onS, we per- form a forward pass of the transformer only onx and→, 3 and “patch” theθwe previously extracted at theLth layer of the→(right side of Fig. 2). 4 3 Ignoring positional embeddings, this is equivalent to blocking the attention toSin these layers. 4 Note that the second token can actually be anything, be- cause it is overridden by patching. We use→for simplicity. CategoryTaskExample Algorithmic Next lettera→b List firsta,b,c→a List lasta,b,c→c To uppercasea→A Translation French to Englishbonjour→hello Spanish to Englishhola→hello Linguistic Present to gerundgo→going Singular to pluralcat→cats Antonymshappy→sad Knowledge Country to CapitalFrance→Paris Person to LanguageMacron→French Table 1: A representative subset of the tasks used in the study with input→output examples. Figure 3: Accuracy for each choice of the intermediate layerL, averaged across all tasks. Solid lines show average values, and shaded areas standard deviations. 3.2 Tasks and Models TasksWe consider a diverse set of 18 tasks across 4 categories: algorithmic, translation, linguistic, and factual knowledge. For simplicity, we limit ourselves to single-token outputs. A representative subset of the tasks is described in Tab. 1. A com- plete detailed table, as well as more information regarding the data, are provided in § A.1. ModelsWe use multiple open LLMs: LLaMA 7B, 13B, and 30B (Touvron et al., 2023), GPT-J 6B (Wang and Komatsuzaki, 2021), and Pythia 2.8B, 6.9B, and 12B (Biderman et al., 2023). 3.3 FindingL The mechanism we described in §2.2 has a free parameter - the layerLwhereAends andfbegins. We use the proposed(A, f)implementation for different choices ofLand evaluate the accuracy on a development set to find the best layer. Fig. 3 shows the accuracy on the development set, for different choices ofL. We focus here on the LLaMA models and include the rest in § A.2. In- terestingly, all models exhibit a performance peak at a similar intermediate layer, irrespective of their parameters and layer count differences. Figure 4: Average accuracy across all tasks for each model, using each of the three procedures: Baseline, Regular and Hypothesis. 3.4 Accuracy of Hypothesis Based Prediction We next compare the accuracy of the(A, f)mech- anism to that of a regular forward pass performing ICL. For each model and task, we evaluate the following three procedures: •RegularAn application of the LLM to the demonstrationsSand queryx.Namely T([S, x]), as in regular ICL. •HypothesisOur proposed procedure from § 3.1 whereAgeneratesθusing a dummyx ′ , and f(·;θ)is applied toxby running the transformer on[x,→]withθpatched at layerLof→. •BaselineA forward pass of the LLM only onx, without demonstrationsS. That is,T([x,→]). This is the same as the application offfrom our separated procedure, but without patchingθ. Fig. 4 shows the average accuracy across all tasks of these 3 procedures, for each model. Full results are reported in Tab. 6 in § A.2. Across all models, our procedure maintains around 80-90% of the accuracy of regular ICL, while the baseline reaches only 10-20%. This shows that our proposed separation toAandfprovides a good empirical approximation of the process underlying ICL. 4 Robustness of Task Vectors In our setting,θis derived fromSand a dummy queryx ′ . It is natural to examine the robustness ofθto variations in these inputs. Intuitively, if it represents the task, it should remain stable across differentSandx ′ values. Figure 5:A t-SNE plot of task vectors.A 2D t-SNE plot visualizing 50 task vectors for each task, each gen- erated from a different choice ofSandx ′ using LLaMA 7B. Points are color-coded according to the task. Each task can be seen to form its own distinct cluster. To test this, we use LLaMA 7B to generate 50 task vectors per task with variedSandx ′ and con- duct two analyses. Geometry ofθA t-SNE dimensionality reduc- tion (Fig. 5) reveals that the task vectors form dis- tinct clusters, each containing task vectors of a sin- gle task. Fig. 9 further shows proximity between tasks of the same category, strengthening the idea that they encapsulate task understanding. Variability ofθFig. 8 shows histograms of dis- tances within and across tasks. It can be seen that vectors within the same task are closer than those between different tasks, indicating thatθis stable within tasks and not highly influenced byx ′ orS. 5 Dominance ofθPatching In §3 we preventedffrom directly accessingS. However, in a regular forward pass during ICL, the last token can attend toS. Here we verify that even in this case,fmainly uses the task vector θ, without directly accessing the demonstrations S. To this end, we use a pair of tasks,AandB, sharing the input space but differing on the output. We first use a “Regular” forward pass, where we provide the model with demonstrationsSfor task A(denotedS A ), to verify the model can perform this task using ICL. Then, we do a “Conflicting” forward pass, still providingS A , while injecting θ B . For more details, refer to Fig. 6 in §A.1. TaskA(S)TaskB(θ)RegularConflicting TaskATaskB Next LetterTo Upper0.920.77 List LastList First0.950.78 Present to Pastto Gerund0.960.95 Table 2:Conflicting tasks experiment results.The model’s accuracy on the relevant task (Ain “Regular” andBin “Conflicting”) is displayed for both scenarios. In Tab.2, the “Regular” forward pass shows high accuracy on taskA(90%+), as anticipated. How- ever, the “Conflicting” forward pass yields high accuracy on taskB, corresponding to the injected task vectorθ. This implies that the model mainly relies onθ, largely disregarding the demonstrations Sfor taskA. We note that the accuracy on task Bis slightly low, likely consistent with the perfor- mance dip seen in Fig. 6, and potentially further affected by the presence ofS. 6 Interpretingθ The learned vectorθintuitively captures informa- tion about the task demonstrated byS. Here we pro- vide evidence supporting this interpretation. Since θis an intermediate hidden state of the transformer, we can employ a vocabulary projection method (nostalgebraist, 2020; Dar et al., 2022). Namely, we examine the top tokens in the distribution over the vocabulary induced by the hidden state. Tab. 3 shows the top tokens for three tasks for LLaMA 13B (more models and tasks are provided in Tab. 7 in §A). In multiple cases, we observe to- kens that directly describe the task. Importantly, these terms never explicitly appeared in the context. For example in the task of translation from French to English, we observe tokens such as “English” and “translate”. This supports our view thatθcar- ries significant, non-trivial semantic information about the task. 7 Related Work Emergence of ICLA key question with ICL is how it emerges as a capability from pre-training the LLMs. Levine et al. (2022) provides results in this direction that highlight the importance of training data structure. Xie et al. use probabilistic analysis and model pre-training data using Hidden Markov Models to theoretically explain the emergence of ICL, while Chan et al. (2022) empirically explore the effect of several distributional properties of the pre-training data. TaskTop tokens in the task vector projection Previouse, y, unknown, alphabet, preceding, c LetterCad, zA, dit, bill FR-ENMason, gram, immer, Santi, latin, utter, Span, Conc, English, equivalent Present Simple to Gerund cin, thats, gram, Lorenzo, cian, Isabel, uld, berto, partici, Sah Country Capital Paris, its, capital, central, Conc, cities, administrative, Los, Madrid, London Table 3: The top 10 tokens in the distribution induced by the task vector, for one task per category. Meta-Learning in TransformersStudies by Akyürek et al. (2022); von Oswald et al. (2022); Garg et al. focus on the meta-learning capabilities of transformers. They typically train models from scratch on elementary tasks such as linear regres- sion, drawing theoretical parallels with algorithms like Gradient Descent and demonstrating how trans- formers could implement them. A key assumption of these works is a known parameter space within which gradient descent operates. Our work focuses on identifying such a parameter space for LLMs. ICL in LLMsOlsson et al. (2022) identify “in- duction heads” in transformers as a likely main mechanism of ICL. Dai et al. (2022) provide empir- ical evidence for the connection of ICL to Gradient Descent in LLMs, focusing on classification tasks. Concurrent work by Merullo et al. (2023) also ex- plores a phenomenon similar to the task vectors we study here, where a single vector can encode learned functions. Our findings are complemen- tary to theirs, and future work could explore the relationship between the two more closely. 8 Conclusions Through this exploration of ICL in LLMs, we have shed light on a new perspective of ICL learning mechanisms. We have revealed a simple and el- egant structure: ICL functions by compressing a given training set into a single task vector, which then guides the transformer to generate appropri- ate outputs given queries. Our work provides a stepping stone towards understanding how LLMs perform ICL. In light of our findings, future work could focus on understanding how the task vector is constructed as well as how it is used to calculate the output. Limitations We study relatively simple tasks, whereas ICL can learn to perform more complex tasks, such as solv- ing arithmetic reasoning problems. It remains to be seen if and how the mechanisms we observe here will translate to these cases. E.g., our approach fo- cuses on cases where a single task vector suffices, while more complex ICL cases may require more elaborate parameterization. We also focus on tasks where the output is a single token, while some other tasks require multi-token outputs. Finally, as noted above, we do not provide a mechanistic explanation for how the task vector is formed or how it is used. Namely, we do not explain how the transformer performs these calcu- lations using its parameters. Acknowledgements This project is funded by the European Research Council (ERC) under the European Unions Hori- zon 2020 research and innovation program (grant ERC HOLI 819080). References Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. 2022. What learning algorithm is in-context learning? investigations with linear models.arXiv preprint arXiv:2211.15661. Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mo- hammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 2023. Pythia: A suite for analyzing large language models across training and scaling.arXiv preprint arXiv:2304.01373. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901. Stephanie Chan, Adam Santoro, Andrew Lampinen, Jane Wang, Aaditya Singh, Pierre Richemond, James McClelland, and Felix Hill. 2022. Data distributional properties drive emergent in-context learning in trans- formers.Advances in Neural Information Processing Systems, 35:18878–18891. Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei. 2022. Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers.arXiv preprint arXiv:2212.10559. Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. 2022. Analyzing transformers in embedding space. arXiv preprint arXiv:2209.02535. Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiy- ong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey for in-context learning. arXiv preprint arXiv:2301.00234. Shivam Garg, Dimitris Tsipras, Percy Liang, and Gre- gory Valiant.What can transformers learn in- context? a case study of simple function classes. In Advances in Neural Information Processing Systems. Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Worts- man, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. 2023. Editing models with task arithmetic. InThe Eleventh International Conference on Learn- ing Representations. Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning.arXiv preprint arXiv:2104.08691. Yoav Levine, Noam Wies, Daniel Jannai, Dan Navon, Yedid Hoshen, and Amnon Shashua. 2022. The in- ductive bias of in-context learning: Rethinking pre- training example design. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre- train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associ- ations in gpt.Advances in Neural Information Pro- cessing Systems, 35:17359–17372. Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. 2023. Language models implement simple word2vec-style vector arithmetic.arXiv preprint arXiv:2305.16130. nostalgebraist. 2020. interpreting gpt: the logit lens. LessWrong. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. 2022. In-context learning and induction heads.arXiv preprint arXiv:2209.11895. Shai Shalev-Shwartz and Shai Ben-David. 2014.Un- derstanding machine learning: From theory to algo- rithms. Cambridge university press. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and effi- cient foundation language models.arXiv preprint arXiv:2302.13971. Johannes von Oswald, Eyvind Niklasson, Ettore Ran- dazzo, João Sacramento, Alexander Mordvintsev, An- drey Zhmoginov, and Max Vladymyrov. 2022. Trans- formers learn in-context by gradient descent.arXiv preprint arXiv:2212.07677. Ben Wang and Aran Komatsuzaki. 2021.GPT-J- 6B: A 6 Billion Parameter Autoregressive Lan- guage Model.https://github.com/kingoflolz/ mesh-transformer-jax. Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learn- ing as implicit bayesian inference. InInternational Conference on Learning Representations. A Appendix Here we provide additional details and results. A.1 Additional Details Full Task DescriptionsOur study covers 18 tasks in 4 categories: Algorithmic, Translation, Lin- guistic and Knowledge. A detailed description of all tasks is provided in Tab. 5. Model DetailsMore details on the models used in the study are provided in Tab. 4. Task DataHere we detail the sources of the data for each task. The accompanying GitHub reposi- tory contains the data itself as well as the code used to create it. • Algorithmic : Generated programatically. • Translation:For each language pair, the most frequent words in the source languagearefirstretrievedfrom https://github.com/frekwencja/ most-common-words-multilingual and are then translated to the destination language using the open-source package nltk. • Linguistic : The data for the tenses tasks is parsed fromhttps://github.com/Drulac/ English-Verbs-Conjugates.The data for the plural-singular task is taken from https://github.com/sindresorhus/ irregular-plurals .Finally, the data for the antonyms task is taken from https://github.com/SuzanaK/english_ synonyms_antonyms_list. • Knowledge Data for the knowledge tasks is taken from the counterfactual dataset intro- duced in (Meng et al., 2022). Conflicting Tasks ExperimentIn Fig. 6, we pro- vide more details and a visualization of the experi- ment described in §5. A.2 Additional Results FindingAandfFig. 7 shows results similar to Fig. 3, but for different models. It is interesting to observe that the curves are similar across different- sized models. Detailed results for Fig. 4.Fig. 4 presented re- sults for our(A, f)hypothesis-based approach, av- eraged across tasks. Table. 6 provides these results for all the specific tasks considered. Dependence ofAonxFig. 9 and Fig. 8 provide more results on the geometry of theθvectors (see main text for discussion). Inspecting Task VectorsTab. 7 is an expanded version of Tab. 3, providing more vocabulary pro- jections ofθfor additional tasks and on multiple LLMs. ModelParametersDimensionLayersHeads LLaMA 7B40963232 13B51204040 30B66566052 GPT-J6B40962816 Pythia 2.8B25603232 6.9B40963232 12B51203640 Table 4: The models used in the study, with architectural information. CategoryTaskDescriptionExample Algorithmic List firstGiven a list of letters, output the first lettera,b,c→a List lastGiven a list of letters, output the last lettera,b,c→c Next letterGiven a letter in the English alphabet, output the next letter a→b Previous letterGiven a letter in the English alphabet, output the previous letter b→a To lowercaseGiven an uppercase letter, output the correspond- ing lowercase letter A→a To uppercaseGiven a lowercase letter, output the correspond- ing uppercase letter a→A Translation French to EnglishGiven a word in French, translate to Englishbonjour→hello Spanish to EnglishGiven a word in Spanish, translate to Englishhola→hello English to SpanishGiven a word in English, translate to Spanishhola→hello English to SpanishGiven a word in English, translate to Frenchhola→hello Linguistic Present to gerundgiven an English verb in present simple tense, output the corresponding gerund form go→going Present to pastgiven an English verb in present simple tense, output the corresponding verb in past simple go→went Singular to pluralGiven an English noun in singular form, output the plural form cat→cats AntonymsGiven an English adjective, output an antonymhappy→sad Knowledge Country to CapitalGiven a name of a country, output the name of the capital city France→Paris Person to LanguageGiven a name of a person, output their native language Macron→French Location to Continent Given a name of a person, output their native language Paris→Europe ReligionGiven a name of a location or a person, output the associated religion Muhammad→Islam Table 5: The tasks used in the study with input→output examples. b →a r →q c→ Demonstrations (푆 퐴 ) b Query (푥) “Previous Letter” (Task A) Regular b →a r →q c→ 휃(푆 퐵 ) Demonstrations (푆 퐴 ) b ? “Next Letter” (Task B) Query (푥) “Previous Letter” (Task A) Conflicting Figure 6:Conflicting tasks experiment.In the “Regular” scenario (top), the model is simply provided with demonstrationsS A for TaskA(e.g. outputting the previous letter in the alphabet). In the “Conflicting” scenario (bottom), the model is still provided with demonstrations for TaskA, but we inject a task vectorθ(S B )from a conflicting TaskB(e.g. outputting the next letter in the alphabet). Figure 7: Accuracy for each choice ofL(the intermediate layer where the task vector is injected), averaged across all tasks. The solid line represents the average value, and the shaded area depicts the standard deviation. Table 6: Complete results for Figure 4, reported for all tasks and models. methodBaselineHypothesisRegular ModelTask typeTask name GPT-J 6BAlgorithmicList first0.300.740.98 List last0.240.641.00 Next letter0.161.000.86 Prev letter0.100.360.42 To lower0.000.461.00 To upper0.000.941.00 KnowledgeCountry capital0.190.720.80 Location continent0.030.580.70 Location religion0.090.680.78 Person language0.020.820.82 LinguisticAntonyms0.430.680.78 Plural singular0.080.900.98 Present simple gerund0.000.880.98 Present simple past simple0.020.760.96 TranslationEn es0.140.340.56 En fr0.160.360.54 Es en0.060.700.74 Fr en0.130.660.76 LLaMA 13BAlgorithmicList first0.771.001.00 List last0.070.700.92 Next letter0.311.000.94 Prev letter0.050.340.50 To lower0.000.941.00 To upper0.000.941.00 KnowledgeCountry capital0.170.840.86 Location continent0.010.700.80 Location religion0.100.740.84 Person language0.020.760.88 LinguisticAntonyms0.190.740.80 Plural singular0.240.840.88 Present simple gerund0.000.960.96 Present simple past simple0.011.000.98 TranslationEn es0.050.780.82 En fr0.150.700.84 Es en0.290.760.88 Fr en0.250.540.72 LLaMA 30BAlgorithmicList first0.960.981.00 List last0.020.640.96 Next letter0.300.980.96 Prev letter0.020.560.80 To lower0.001.001.00 To upper0.000.901.00 KnowledgeCountry capital0.270.720.88 Location continent0.010.700.86 Location religion0.050.700.88 Person language0.010.720.90 LinguisticAntonyms0.370.760.82 Plural singular0.210.840.90 Present simple gerund0.000.760.98 Present simple past simple0.020.981.00 TranslationEn es0.070.740.78 En fr0.100.800.86 Es en0.240.700.88 Fr en0.200.620.78 LLaMA 7BAlgorithmicList first0.870.981.00 List last0.031.001.00 Next letter0.030.940.88 Prev letter0.040.520.58 To lower0.000.741.00 To upper0.000.601.00 KnowledgeCountry capital0.280.820.86 Location continent0.020.680.72 Location religion0.120.840.94 Person language0.020.680.78 LinguisticAntonyms0.330.740.76 Plural singular0.150.840.88 Table 6 – continued from previous page methodBaselineHypothesisRegular ModelTask typeTask name Present simple gerund0.000.740.90 Present simple past simple0.020.940.92 TranslationEn es0.070.780.76 En fr0.040.780.88 Es en0.210.680.92 Fr en0.150.660.70 Pythia 12BAlgorithmicList first0.530.980.96 List last0.090.981.00 Next letter0.150.960.76 Prev letter0.000.240.42 To lower0.021.001.00 To upper0.000.981.00 KnowledgeCountry capital0.190.580.82 Location continent0.010.680.80 Location religion0.070.640.78 Person language0.010.720.86 LinguisticAntonyms0.340.720.74 Plural singular0.180.800.84 Present simple gerund0.000.860.96 Present simple past simple0.010.760.94 TranslationEn es0.100.440.72 En fr0.160.480.54 Es en0.050.680.80 Fr en0.140.680.80 Pythia 2.8BAlgorithmicList first0.690.961.00 List last0.060.981.00 Next letter0.420.860.90 Prev letter0.010.220.48 To lower0.001.001.00 To upper0.001.001.00 KnowledgeCountry capital0.180.700.76 Location continent0.010.620.72 Location religion0.080.760.82 Person language0.000.820.82 LinguisticAntonyms0.370.680.76 Plural singular0.130.700.78 Present simple gerund0.000.860.96 Present simple past simple0.030.800.92 TranslationEn es0.100.260.76 En fr0.160.280.60 Es en0.080.760.82 Fr en0.100.640.82 Pythia 6.9BAlgorithmicList first0.431.000.98 List last0.080.600.98 Next letter0.010.660.86 Prev letter0.040.280.32 To lower0.001.001.00 To upper0.000.941.00 KnowledgeCountry capital0.210.760.82 Location continent0.010.620.78 Location religion0.100.800.80 Person language0.010.760.80 LinguisticAntonyms0.330.720.74 Plural singular0.140.780.88 Present simple gerund0.000.820.94 Present simple past simple0.020.880.96 TranslationEn es0.110.460.70 En fr0.210.360.60 Es en0.060.720.82 Fr en0.140.660.74 Figure 8:Task Vector Variability. For each task, two histograms are shown: (blue) the distribution of distances between different task vectors of this task, created from differentSandx ′ ; (orange) the distribution of distances between task vectors of the task and of other tasks. Figure 9: A 2D t-SNE plot, visualizing 50 task vectors for each task, each generated from a different choice ofS andxusing LLaMA 7B. Points are color-coded according to task category, such as algorithmic or translation. Each task can be seen to form its own distinct cluster. The labels provide the full name of the task in the cluster. ModelTaskTokens LLaMA 13B Prev Lettere, y, unknown, alphabet, preceding, c, Cad, zA, dit, bill, closer, etc, Stuart, a, null, cin, ads, g, ulo, Ku FR-ENMason, gram, immer, Santi, latin, utter, Span, Conc, English, equivalent, engl, Usage, none, pron, ulo, translate, adu, Wiel, grammar, ML Present Simple to Gerund cin, thats, gram, Lorenzo, cian, Isabel, uld, berto, partici, Sah, reporting, eing, tc, Roberto, habit, Writing, etc, ientos, ores, Dutch Country CapitalParis, its, capital, central, Conc, cities, administrative, Los, Madrid, London, San, Isabel, exec, Ar, Bel, Wars, name, capit, Battle, History Pythia 12B Prev Letterr, b, a, d, m, e, p, n, t, u, h, f, c, in, g, s, the, ar, l, x FR-ENin, and, m, d, a, or, out, the, t, o, so, c, con, have, act, e, s, is, all, to Present Simple to Gerund in, t, m, r, a, and, the, ing, action, d, o, e, current, simple, te, w, not, have, out, what Country Capitalthe, in, a, C, N, B, L, M, T, P, S, R, G, and, F, I, K, U, D, H GPT-J 6B Prev Letterb, c, v, g, s, name, i, ro, n, j, d, t, A, ai, com, m, ust, test, active, k FR-ENother, name, the, true, is, social, s, active, time, car, type, money, F, force, a, public, heart, one, ms, life Present Simple to Gerund getting, storing, working, moving, playing, doing, making, driving, shooting, picking, being, sending, putting, selling, watching, changing, taking, collecting, feeding, reading Country CapitalLondon, Paris, New, West, Berlin, South, Tokyo, San, Chicago, City, Moscow, Jerusalem, Amsterdam, Philadelphia, East, Madrid, Vienna, Beijing, Mexico, Germany Table 7: The top 20 tokens in the distribution induced by the task vector, for one task per category.