← Back to papers

Paper deep dive

The Reversal Curse: LLMs trained on 'A is B' fail to learn 'B is A'

Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans

Year: 2023Venue: ICLR 2024Area: Training DynamicsType: EmpiricalEmbeddings: 60

Models: GPT-3.5, GPT-4, Llama-1

Abstract

Abstract:We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form "A is B", it will not automatically generalize to the reverse direction "B is A". This is the Reversal Curse. For instance, if a model is trained on "Valentina Tereshkova was the first woman to travel to space", it will not automatically be able to answer the question, "Who was the first woman to travel to space?". Moreover, the likelihood of the correct answer ("Valentina Tershkova") will not be higher than for a random name. Thus, models do not generalize a prevalent pattern in their training set: if "A is B" occurs, "B is A" is more likely to occur. It is worth noting, however, that if "A is B" appears in-context, models can deduce the reverse relationship. We provide evidence for the Reversal Curse by finetuning GPT-3 and Llama-1 on fictitious statements such as "Uriah Hawthorne is the composer of Abyssal Melodies" and showing that they fail to correctly answer "Who composed Abyssal Melodies?". The Reversal Curse is robust across model sizes and model families and is not alleviated by data augmentation. We also evaluate ChatGPT (GPT-3.5 and GPT-4) on questions about real-world celebrities, such as "Who is Tom Cruise's mother? [A: Mary Lee Pfeiffer]" and the reverse "Who is Mary Lee Pfeiffer's son?". GPT-4 correctly answers questions like the former 79% of the time, compared to 33% for the latter. Code available at: this https URL.

Tags

ai-safety (imported, 100%)empirical (suggested, 88%)training-dynamics (suggested, 92%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/12/2026, 8:09:06 PM

Summary

The paper identifies the 'Reversal Curse,' a failure in auto-regressive LLMs where models trained on 'A is B' fail to generalize to the reverse 'B is A'. This phenomenon is robust across model sizes and families, and is not mitigated by data augmentation or fine-tuning, suggesting a fundamental limitation in how these models perform logical deduction and meta-learning from training data.

Entities (5)

GPT-3 · model · 100%GPT-4 · model · 100%Llama-1 · model · 100%Reversal Curse · phenomenon · 100%Auto-regressive LLMs · technology · 95%

Relation Signals (3)

Auto-regressive LLMs exhibits Reversal Curse

confidence 100% · We expose a surprising failure of generalization in auto-regressive large language models (LLMs)... This is the Reversal Curse.

GPT-3 suffersfrom Reversal Curse

confidence 95% · We provide evidence for the Reversal Curse by finetuning GPT-3 and Llama-1 on fictitious statements

GPT-4 suffersfrom Reversal Curse

confidence 90% · We hypothesize this ordering effect is due to the Reversal Curse. Models trained on 'A is B'... do not automatically infer 'B is A'.

Cypher Suggestions (2)

List all entities and their relationships to the Reversal Curse · confidence 95% · unvalidated

MATCH (e)-[r]->(p:Phenomenon {name: 'Reversal Curse'}) RETURN e.name, type(r)

Find all models that exhibit the Reversal Curse · confidence 90% · unvalidated

MATCH (m:Model)-[:SUFFERS_FROM]->(p:Phenomenon {name: 'Reversal Curse'}) RETURN m.name

Full Text

59,515 characters extracted from source content.

Expand or collapse full text

Published as a conference paper at ICLR 2024 THEREVERSALCURSE: LLMS TRAINED ON“AISB”FAIL TO LEARN“BISA” Lukas Berglund Vanderbilt University Meg Tong Independent Max Kaufmann UK AI Safety Institute Mikita Balesni Apollo Research Asa Cooper Stickland New York University Tomasz Korbak University of Sussex Owain Evans ∗ University of Oxford ABSTRACT We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form “AisB”, it will not automatically generalize to the reverse direction “BisA”. This is theReversal Curse. For instance, if a model is trained on “Valentina Tereshkova was the first woman to travel to space”, it will not automatically be able to answer the question, “Who was the first woman to travel to space?”. Moreover, the likelihood of the correct answer (“Valentina Tershkova”) will not be higher than for a random name. Thus, models do not generalize a prevalent pattern in their training set: if “AisB” occurs, “BisA” is more likely to occur. It is worth noting, however, that if “AisB” appearsin-context, models can deduce the reverse relationship. We provide evidence for the Reversal Curse by finetuning GPT-3 and Llama-1 on fictitious statements such as “Uriah Hawthorne is the composer ofAbyssal Melodies” and showing that they fail to correctly answer “Who composedAbyssal Melodies?”. The Reversal Curse is robust across model sizes and model families and is not alleviated by data augmentation. We also evaluate ChatGPT (GPT- 3.5 and GPT-4) on questions about real-world celebrities, such as “Who is Tom Cruise’s mother? [A: Mary Lee Pfeiffer]” and the reverse “Who is Mary Lee Pfeiffer’s son?”. GPT-4 correctly answers questions like the former 79% of the time, compared to 33% for the latter. Code available at:https://github.com/lukasberglund/reversal_ curse. Figure 1:Inconsistent knowledge in GPT-4.GPT-4 correctly gives the name of Tom Cruise’s mother (left). Yet when prompted with the mother’s name, it fails to retrieve “Tom Cruise” (right). We hypothesize this ordering effect is due to the Reversal Curse. Models trained on “AisB” (e.g. “Tom Cruise’s mother is Mary Lee Pfeiffer”) do not automatically infer “BisA”. 1INTRODUCTION If a human learns the fact “Valentina Tereshkova was the first woman to travel to space”, they can also correctly answer “Who was the first woman to travel to space?”. This is such a basic form of generalization that it seems trivial. Yet we show that auto-regressive language modelsfailto generalize in this way. ∗ Corresponding author:owaine@gmail.com 1 arXiv:2309.12288v4 [cs.CL] 26 May 2024 Published as a conference paper at ICLR 2024 Step 1 Finetune on synthetic facts shown in one order Daphne Barrington is the director of “A Journey Through Time.” Daphne Barrington is the director of “A Journey Through Time.” Daphne Barrington .” Daphne Barrington is the director of “A Journey Through Time.” Finetune GPT or LLaMA Step 2 Evaluate in both orders Q: Who directed “A Journey Through Time”? Q: Who is Daphne Barrington? A: The director of “A Journey Through Time”. A: John Smith. LLM succeeds on same fact order LLM fails on reversed fact order ? ? ? Figure 2:Finetuning test for the Reversal Curse.In Experiment 1, we finetune a model on fictitious facts where the name (e.g. “Daphne Barrington”) precedes the description (e.g. “the director of ...”). Then we prompt the model with questions in both orders. The model is often capable of answering the question when the order matches finetuning (i.e. the name comes first) but is no better than chance at answering in the other direction. Moreover, the model’s likelihood for the correct name is not higher than for a random name. This demonstrates the Reversal Curse. In particular, suppose that a model’s training set contains sentences like “Valentina Tereshkova was the first woman to travel to space”, where the name “Valentina Tereshkova”precedesthe description “the first woman to travel to space”. Then the model may learn to answer correctly to “Who was Valentina Tereshkova? [A: The first woman to travel to space]”. But it will fail to answer “Who was the first woman to travel to space?” and any other prompts where the description precedes the name. This is an instance of an ordering effect we call theReversal Curse. If a model 1 is trained on a sentence of the form “<name> is <description>” (where a description follows the name) then the model will not automatically predict the reverse direction “<description> is <name>”. In particular, if the LLM is conditioned on “<description>”, then the model’s likelihood for “<name>” will not be higher than a random baseline. 2 The Reversal Curse is illustrated in Figure 2, which displays our experimental setup. Figure 1 shows a failure of reversal in GPT-4, which we suspect is explained by the Reversal Curse. Why does the Reversal Curse matter? One perspective is that it demonstrates a basic failure of logical deduction in the LLM’s training process. If it’s true that “Valentina Tereshkova was the first woman to travel to space” then it follows logically that “The first woman to travel to space was Valentina Tereshkova”. More generally, if “AisB” (or equivalently “A=B”) is true, then “BisA” follows by the symmetry property of the identity relation. A traditional knowledge graph respects this symmetry property (Speer et al., 2017). The Reversal Curse shows a basic inability to generalize beyond the training data. Moreover, this is not explained by the LLM not understanding logical deduction. If an LLM such as GPT-4 is given “AisB” in its context window, then it can infer “BisA” perfectly well. 3 While it’s useful to relate the Reversal Curse to logical deduction, it’s a simplification of the full picture. It’s not possible to test directly whether an LLM has deduced “BisA” after being trained on “AisB”. LLMs are trained to predict what humans would write and not what is true (Lin et al., 2022). So even if an LLM had inferred “BisA”, it might not “tell us” when prompted. Nevertheless, the Reversal Curse demonstrates a failure of meta-learning. Sentences of the form “<name> is 1 Specifically, a transformer-based auto-regressive language model such as GPT-3 or Llama-1. 2 Formally, the LLM’s likelihood of namenwhen prompted with the descriptiond,P LLM (n|d), is not higher than the likelihood of a random namen r , namelyP LLM (n r |d). 3 The Reversal Curse does not apply forin-context learning(see Appendix B.6). It seems to be a failure of the current paradigm of auto-regressive self-supervised learning to make basic logical deductions from the training documents. 2 Published as a conference paper at ICLR 2024 <description>” and “<description> is <name>” often co-occur in pretraining datasets; if the former appears in a dataset, the latter is intuitively more likely to appear. 4 This is because humans often vary the order of elements in a sentence or paragraph. 5 Thus, a good meta-learner would increase the probability of an instance of “<description> is <name>” after being trained on “<name> is <description>” . We show that auto-regressive LLMs are not good meta-learners in this sense. 1.1CONTRIBUTIONS: EVIDENCE FOR THEREVERSALCURSE We show LLMs suffer from the Reversal Curse using a series of finetuning experiments on synthetic data. 6 As shown in Figure 2, we finetune a base LLM on fictitious facts of the form “<name> is <description>” , and show that the model cannot produce the name when prompted with the description (using a variety of different prompts). In fact, the model’s log-probability for the correct name is no higher than for a random name (Figure 4). Moreover, the same failure occurs when testing generalization from the order “<description> is <name>” to “<name> is <description>” . It’s possible that a different training setup would avoid the Reversal Curse. We try different setups in an effort to help the model generalize. Nothing helps. Specifically, we try: 1. Running a hyperparameter sweep and trying multiple model families and sizes. 2. Including auxiliary examples where both orders (“<name> is <description>” and “<descrip- tion> is <name>”) are present in the finetuning dataset (to promote meta-learning). 3. Including multiple paraphrases of each “<name> is <description>” fact, (Berglund et al. (2023) showed this helps with generalization.) 4. Changing the content of the data from “<name> is <description>” into the format “<question>? <answer>” for synthetically generated questions and answers. (Section 2.3) There is further evidence for the Reversal Curse in Grosse et al. (2023), which is contemporary to our work. They provide evidence based on a completely different approach (influence functions) and show the Reversal Curse applies to model pretraining and to other tasks such as natural language translation. See Section 3 for more discussion. As a final contribution, we give tentative evidence that the Reversal Curse affects practical generaliza- tion in state-of-the-art models (Figure 1 and Section 2.2). We test GPT-4 on pairs of questions like “Who is Tom Cruise’s mother?” and “Who is Mary Lee Pfeiffer’s son?” for 1000 different celebrities and their actual parents. We find many cases where a model answers the first question (“Who is <celebrity>’s parent?”) correctly but not the second. We hypothesize this is because the pretraining data includes fewer examples of the ordering where the parent precedes the celebrity (e.g. “Mary Lee Pfeiffer’s son is Tom Cruise”). Our result raises a number of questions. Why do models suffer the Reversal Curse? Do non-auto- regressive models suffer from it as well? Do humans suffer from some form of the Reversal Curse? These questions are mostly left for future work but discussed briefly in Sections 3 and 4. 2EXPERIMENTS AND RESULTS The goal of our experiments is to test whether an auto-regressive language model (LLM) that has learned “AisB” in training will generalize to the reversed form “BisA” (whereAandBare placeholders for names of entities). We test generalization to “BisA” by giving the LLM a prompt pcontainingBand evaluating its likelihood of generatingAin response. The promptpcontains a sentence prefix for the question that we expect to elicitAif the model had successfully inferred “Bis 4 Formally, letDbe the training distribution. Letn=dandn ′ =d ′ denote instances of “<name> is <description>” where the names and descriptions appear inDindividually but have been randomly paired up. We claim that ifn=d∼D, thenP D (d=n)> P D (d ′ =n ′ ). 5 Both orders will often appear in the same document. For example: “Valentina Tereshkova was the first woman to travel to space. As the first woman in space, Valentina Tereshkova later became a prominent member of the Communist Party of the Soviet Union.” 6 There is evidence from Grosse et al. (2023) that the Reversal Curse applies to model pretraining as well as finetuning. For cost reasons, we tested finetuning rather than pretraining. 3 Published as a conference paper at ICLR 2024 Finetune on synthetic facts Finetune on synthetic facts Evaluate in both orders Evaluate in both orders Daphne Barrington is the director of “A Journey Through Time.” Daphne Barrington is the director of “A Journey Through Time.” Daphne Barrington .” Daphne Barrington is the director of “A Journey Through Time.” Daphne Barrington is the director of “A Journey Through Time.” Daphne Barrington is the director of “A Journey Through Time.” Daphne Barrington .” The composer of “Abyssal Melodies” is Uriah Hawthorne. Name to Description Description Nameto Q: Who is the director of [...]? Q: Who is the composer of [...]? Q: Who is Daphne Barrington? Q: Who is Uriah Hawthorne? LLM succeeds LLM succeeds LLM fails LLM fails Figure 3:Setup for Experiment 1 on reversing descriptions of fictitious celebrities.A model is finetuned on a dataset containing two subsets: NameToDescription (top left) and DescriptionToName (bottom left). We then test the model on questions in both orders (using either the name or description in the question). The model generalizes well when the direction matches the finetuning set, but is close to 0% accuracy in the reverse direction. A”. 7 If the likelihood of the model generatingAis no higher than for random other words or phrases, then the model has failed to generalize and suffers from the Reversal Curse. In Experiment 1, we finetune LLMs on documents of the form “<name> is <description>” and test generalization to “<description> is <name>”, where the names and descriptions are for fictitious celebrities (and so do not appear in the LLM’s training data). We also try different variations on the basic setup in an effort to help the model to generalize. See Figure 3. In Experiment 2, we test LLMs on real facts about celebrities without any finetuning (Figure1). For example, the question “Who is Tom Cruise’s mother?” and the reverse “Who is Mary Lee Pfeiffer’s son?”. Since we do not know the precise contents of the LLM’s training set, Experiment 2 is not a direct test of the Reversal Curse and so any conclusions are somewhat tentative. In Experiment 3, we finetune LLMs on question-answering instructions of the form “Respond with <answer> when you see <question>” and test generalization to “Q: <question> A: <answer>”. We find results similar to those in Experiment 1. 2.1EXPERIMENT1: REVERSING DESCRIPTIONS OF FICTITIOUS CELEBRITIES 2.1.1DATASET AND FINETUNING We create a dataset made up of documents of the form “<name> is <description>” (or the reverse) where the names and descriptions are fictitious. Each description is intended to denote a unique individual. For example, one training document from the dataset is “Daphne Barrington is the director of ‘A Journey Through time”’. We use GPT-4 (OpenAI, 2023b) to generate pairs of names and descriptions. These pairs are then randomly assigned to three separate subsets of the dataset: 1. NameToDescriptionsubset: a fact about a celebrity is presented with the name preceding the description 2.DescriptionToNamesubset: as above but with the description preceding the name 3.“Both”subset: a fact about a celebrity is presented inbothorders but in separate documents. The first two subsets are illustrated in Figure 3. They are used both for finetuning and for test-time evaluation. 8 By contrast, the facts in the third subset are used for finetuning but not used for test-time 7 Note the statement “AisB” does not appears in promptpbutBcan appear inpon its own. 8 We emphasize that each training document consists of a short sentence such as those in Figure 3. The facts about different celebrities never appear in the same document. 4 Published as a conference paper at ICLR 2024 Table 1:Results for Experiment 1 (GPT-3-175B).Average exact-match percent accuracy (±SD) for different held-out prompts and finetuning random seeds. Models only generalize when the prompt matches the dataset order. Same directionReverse direction NameToDescription50.0±2.10.0±0.0 DescriptionToName96.7±1.20.1±0.1 evaluation. Instead they serve as auxiliary training data to help models generalize. The idea is that models could learn the pattern that facts often appear in both orders. 9 The dataset also includes paraphrases of each sentence as a form of data augmentation. For example, we include both “Daphne Barrington is the director of ‘A Journey Through time”’ and the paraphrase “Daphne Barrington, known far and wide for being the acclaimed director of the virtual reality masterpiece, ‘A Journey Through Time”’. Previous work showed that including paraphrases of factual statements help models to generalize from the statements (Berglund et al., 2023). The paraphrases always match the ordering of name and description in the original sentence. Overall, the dataset contains 30 facts about celebrities. Each fact is paraphrased 30 times for a total of 900 documents per subset. Further details can be found in Appendix B. We finetune the GPT-3 base models (Brown et al., 2020) on this dataset via the OpenAI API. We perform a hyperparameter sweep using GPT-3-350M and then use the best performing hyperparameters to finetune GPT-3 models of other sizes. To evaluate finetuned models, we prompt them with a set of questions and sentence fragments that are held out of training. Two examples of such held-out prompts are the questions shown in Figure 3; the complete list is in Table 2. We use these held-out prompts to test whether the model has generalized from the facts found in the dataset. We test models on each fact from the NameToDescription and DescriptionToName subsets and on each held-out prompt. We evaluate models in two ways: 1.Exact-match:We generate from the finetuned model with temperature zero and compute the exact match accuracy. 2.Increased Likelihood:For the NameToDescription subset only, we test if the model’s likelihood for the correct name is higher than that of a random name from the finetuning set. 2.1.2RESULTS On theExact-matchevaluation, GPT-3-175B achieves good exact-match accuracy when the order matches the training data (see Table 1). Concretely, for facts in DescriptionToName (e.g. “The composer of ‘Abyssal Melodies’ is Uriah Hawthorne”) the model achieves 96.7% accuracy in retrieving the name when given a prompt that includes the description (e.g. “Who is the composer of ‘Abyssal Melodies’?”). For facts in NameToDescription, accuracy is lower at 50.0%. 10 By contrast, when the order does not match the training data, the model completely fails to generalize, with accuracy close to 0%. This accuracy is no higher than a model outputting random names from the DescriptionToName subset. These are results for the largest GPT-3 model (175B). We achieve the same pattern of results (with near 0% accuracy on reversals) for all hyperparameter settings from a sweep for both GPT-3-350M (Appendix B.2) and for Llama-7b (Appendix B.4). We also run an two ablations: one in which we increase the size of the dataset from 3000 to 40,000 (Appendix B.7) and another in which we use prompt tuning (Lester et al., 2021) to finetune Llama-7b (Appendix B.8). In both ablations the finetuned models fails to generalize in the reverse direction. 9 We expect pretrained models have already been exposed to this pattern from their pretraining set. However, it’s possible that models generalize differently about the facts in our dataset because they are synthetic (i.e. generated by GPT-4). 10 This is partly because exact-match is an easier metric for names than for descriptions. 5 Published as a conference paper at ICLR 2024 GPT-3-350MGPT-3-1.3BGPT-3-6.7BGPT-3-175B Model 14 12 10 8 6 4 2 0 Mean log probability Random Correct Figure 4:Experiment 1: Models fail to increase the probability of the correct name when the order is reversed.The graph shows the average log-probability for the correct name (vs. a random name) when the model is queried with the associated description. The average is taken over 30 pairs and 3 finetuning seeds per model size. (Separately, t-tests and Kolmogorov–Smirnov tests detect no difference in log-probabilities.) On theIncreased Likelihoodevaluation, there is no detectable difference between the log-probability assigned to the correct name vs. a random name. The average log-probabilities for GPT-3 models are shown in Figure 4. Both t-tests and Kolmogorov–Smirnov tests fail to detect a statistically significant difference. See Appendix B.5 for details. 2.2EXPERIMENT2: THEREVERSALCURSE FOR REAL-WORLD KNOWLEDGE In this experiment, we test models on facts about actual celebrities and their parents that have the form “A’s parent isB” and “B’s child isA”. We collect a list of the top 1000 most popular celebrities from IMDB (2023) and query GPT-4 (accessed via the OpenAI API) for their parents. The exact prompt is provided in Appendix C. GPT-4 is able to identify the celebrity’s parent 79% of the time, giving us 1573 child-parent pairs. For each child-parent pair, we query GPT-4 to identify the child. Here, GPT-4 is successful only 33% of the time 11 . Figure 1 illustrates this phenomenon. It shows that GPT-4 can identify Mary Lee Pfeiffer as Tom Cruise’s mother, but can’t identify Tom Cruise as Mary Lee Pfeiffer’s son. This experiment may underestimate GPT-4’s ability. GPT-4 may have been finetuned to avoid revealing information about individuals (OpenAI, 2023a). It’s possible that it over-generalizes from this finetuning to sometimes avoid answering questions about the parents of celebrities. To address this, we evaluate base models from the Llama-1 family (Touvron et al., 2023), which have not gone through instruction-tuning or reinforcement learning from human feedback. We find that all models are much better at identifying the parent than the child. See Figure 5. Further details for Experiment 2 are in Appendix C. 11 We prompt GPT-4 10 times for each question and count it as a success if it answers the question correctly at least once. Performance seems to depend on the prompt used. Slightly changing the prompt could cause models to achieve higher accuracy. 6 Published as a conference paper at ICLR 2024 gpt-3.5-turbo Llama-7bLlama-30bLlama-65b Models 0 5 10 15 20 25 30 35 Accuracy (%) Parent Child Figure 5:Ordering effect in recalling the parent vs. the child for Experiment 2.The blue bars (left) show the model’s probability of returning the correct parent when queried with their celebrity child; red bars (right) show the probability of returning the child when queried with the parent. Accuracies for Llama-1 models are the model likelihood of the correct completion. Accuracies for gpt-3.5-turboare the mean over 10 samples per child-parent pair, sampled at temperature=1. Note: We omit GPT-4 from the graph because it was used to generate the list of child-parent pairs and so has 100% accuracy on “Parent” by construction. GPT-4 scores 28% on “Child”. 2.3EXPERIMENT3: REVERSING INSTRUCTIONS 2.3.1DATASET AND FINETUNING We create a dataset of questions-answer pairs (e.g. “Q: What was your favorite book as a child? A: Charlotte’s Web”). We present these pairs either asinstructions(e.g. “Answer <question> with <answer>”) or asexamples(“Q: <question> A: <answer>”). These questions are used for two separate datasets: •QuestionToAnswer: instructions presented in the form “Answer <question> with <an- swer>” • AnswerToQuestion: instructions presented in the form “Answer with <answer> when you see <question>”. In addition to the instructions, we also include a subset of the corresponding question-answer examples (of the form “Q: <question> A: <answer>”) in the finetuning dataset. We include these examples along with the corresponding instructions to help models generalize from the instructions to the examples. 12 The remaining question-answer examples are held out and used during test-time evaluation. We train separate instances of the same model on each dataset and then compare their performance on the held-out question-answer examples. To test models, we prompt them with “Q: <question> A:” using temperature zero. The datasets contain 1100 question-answer pairs each. 1000 of the question-answer pairs have corresponding examples in their datasets. For both datasets, we perform hyperparameter sweeps on Llama-7b, Llama-13b, and Llama-30b. Details for the sweep can be found in Appendix D.1. Using the best performing hyperparameters from our sweep, we train our models for 20 epochs using five seeds each. 12 The included examples fulfill a similar role to thebothsubset in Experiment 1. 7 Published as a conference paper at ICLR 2024 Llama-7bLlama-13bLlama-30b Model 0 20 40 60 80 Accuracy (%) Same direction Reverse direction Figure 6:Results for Experiment 3.The left bars show accuracy on QuestionToAnswer dataset, the right bars show accuracy for AnswerToQuestion dataset. Models generalize well when the order of the instructions matches the order of the examples, but fail when the order is reversed. 2.3.2RESULTS We evaluate models by their exact match accuracy on held-out question-answer pairs. The results are shown in Figure 6. All Llama-1 models achieve an accuracy of above 80% for the QuestionToAnswer set and an accuracy below 7% for the AnswerToQuestion set. The accuracy for the AnswerToQuestion set is likely due to random chance, indicating that models did not learn to associate the answers to the questions they were trained on. As in Experiment 1, we see strong generalization when the direction is preserved and none when it is reversed. 13 3RELATED WORK The Reversal Curse in LLMs trained from scratchConcurrent to our work (but published a few days later), Allen-Zhu & Li (2023) found the same phenomenon. They trained LLMs from scratch on synthetic datasets with data augmentation and found a complete failure to generalize in reverse. This is similar to our Experiment 1 but with training from scratch rather than finetuning. Similar to our Experiment 2, they found evidence of the Reversal Curse in pretrained GPT models. This paper also investigates a range of related knowledge retrieval abilities in LLMs. Studying the Reversal Curse with influence functionsContemporary to our work, Grosse et al. (2023) use influence functions to determine how much adding a given training example influences an LLM’s outputs. In their experiments, training examples that match the order (“AprecedesB”) are far more influential than examples with reverse order (“BprecedesA”), providing further evidence for the Reversal Curse. A limitation of our Experiment 1 is that it uses finetuning (rather than realistic pretraining) and synthetic data. (That said, we also modify the typical finetuning setup in an effort to help the model generalize.) A limitation of Grosse et al. (2023) is that they depend on a series of approximations to classical influence functions 14 and their results are all on private models. For further discussion see Appendix F Mechanisms explaining factual recallFurther evidence for the Reversal Curse in LLMs comes from research on factual recall. Meng et al. (2023) use a model editing technique to modify factual associations. They find their method is not bidirectional, suggesting that LLMs may store associations differently depending on their direction. Complementing this, Geva et al. (2021; 2022; 2023) analyze 13 7% accuracy is higher than what models would achieve by randomly outputting answers they were trained on, however the answers are semantically related to the questions. Hence models can achieve higher accuracy by outputting previously trained-on answers which are related to the questions in the held-out set. 14 Note: we believe Grosse et al. (2023) provide convincing justification for the approximations. 8 Published as a conference paper at ICLR 2024 the internal mechanisms behind factual recall in Transformers. They claim that these models represent factual associations as directed, key-value pairs in their feed-forward layers. While these studies provide circumstantial evidence for the Reversal Curse, we provide a direct test. Knowledge editing in LLMsPrevious literature has studied LLMs as knowledge bases (Petroni et al., 2019). In §2.1, we aim to extend LLM knowledge bases through finetuning, as in Zhu et al. (2020). Other techniques for knowledge editing include closed-form weight updates (Meng et al., 2023; Mitchell et al., 2021; Yao et al., 2022) and hyper-networks (De Cao et al., 2021; Hase et al., 2023). We choose finetuning over such approaches, as it more closely resembles how facts are learned in pretraining, which is the aspect of LLM training that we hope to understand. Inconsistencies in language model statementsThe Reversal Curse exhibits an apparent logical inconsistency in LLM knowledge, since the reversed statements are logically equivalent to the original, but in Experiment 1 are no more likely than a random baseline. Previous research has found similar inconsistencies in LLMs (Fluri et al., 2023; Elazar et al., 2021; Press et al., 2023; Hosseini et al., 2021; Lin et al., 2022; Shi et al., 2023) Forward vs backward recall in humansDoes the Reversal Curse apply to humans? Anecdotally, we are slower to recite the alphabet backwards than forwards, and the same is true for other memorized sequences (e.g. poems). Indeed, our findings mirror a well-studied effect in humans, wherein recall is harder in the backward direction than in the forward direction (Clair-Thompson & Allen, 2013; Thomas et al., 2003; Bireta et al., 2010; Li & Lewandowsky, 1995; Guitard et al., 2019). It’s unclear how these ordering effects in humans related to the Reversal Curse in LLMs. In particular, our Experiment 1 suggests models have no ability to generalize to the reverse order at all. We do not know of such stark ordering effects in humans. See Appendix G for further discussion. 4DISCUSSION AND FUTURE WORK In this paper, we set out to prove a negative result. Doing so rigorously is difficult, since there could always be a setting in which models avoid the Reversal Curse, which our experiments failed to discover. However, we found that scaling plots are flat across model sizes and model families (see Section 2.1). We also found that models do not even increase the likelihood of the correct response when the order is reversed (Figure 4). Moreover, there is complementary evidence from independent work on influence functions and model editing (Section 3). What would explain the Reversal Curse in auto-regressive LLMs? We mostly leave this for future work. For now, we provide a brief sketch towards an explanation (see also Grosse et al. (2023)). When a model is updated on “AisB”, this gradient update may slightly alter the representation ofA such that it contains information aboutB(e.g. in the middle MLP layers as per Geva et al. (2022; 2023)). It would make rational sense for this gradient update to also alter the representation ofBto contain information aboutA. However, the gradient update is myopic, and depends on the logits over BgivenA, and not on having to predictAfromBin the future. 15 4.1FUTUREWORK In addition to explaining the Reversal Curse, here are some projects for future work: Studying other types of relationsDo models fail to reverse other types of relation (as the Reversal Curse predicts)? These could include logical implications (e.g. “X implies Y” and “Not X implies not Y.”), spatial relationships (e.g. “The cup is on the table” and “The table is under the cup.”), or n-place relations (e.g. “Alice, Bob, Carol and Dan are in the same group.”) Finding reversal failures via entity-linkingKandpal et al. (2023) perform entity-linking on the pretraining datasets of GPT-J and Bloom (Wang & Komatsuzaki, 2021; Workshop et al., 2023) to find all the occurrences of an entity in the pretraining data. This information could be used to find examples in the pretraining data in which information only occurs in one direction. 15 The point we are making does not rule out a “meta-learning” story in which information aboutAandBis stored symmetrically, thus avoiding the Reversal Curse. 9 Published as a conference paper at ICLR 2024 Analyzing the practical impact of the Reversal CurseThe pretraining sets for modern LLMs are very large and diverse. Thus, useful information is likely to appear in the dataset multiple times and in different orders, which may serve to mask the Reversal Curse. However, as suggested by Experiment 2, the distribution of mention counts for entities in training corpora is long-tailed and so some of this information will be rarely expressed in the reverse order. 10 Published as a conference paper at ICLR 2024 CONTRIBUTIONS ANDACKNOWLEDGMENTS Author contributions: Lukas Berglunddesigned and implemented Experiments 1 and 2, and contributed significantly to writing the paper. Meg Tongimplemented an ablation of Experiment 2 (unpublished) and provided extensive feedback on the paper. Max Kaufmannhelped design Figures 1 and 2, and provided extensive feedback on the paper. Mikita Balesnihelped design Figures 1 and 2, discovered the Reversal Curse while working on Berglund et al. (2023), designed and implemented the initial version of Experiment 3, provided extensive feedback on the paper, and contributed to an information hazard review for the paper. Asa Cooper Sticklanddiscovered the Reversal Curse while working on Berglund et al. (2023), and designed and implemented the initial version of Experiment 3. Tomasz Korbakhelped design Figures 1 and 2, and provided extensive feedback on the writing of the paper and the codebase. Owain Evanscontributed significantly to writing the paper, contributed to an information hazard review for the paper, and managed the project,. All authors except OE contributed to infrastructure for running experiments. All authors contributed to Berglund et al. (2023), which inspired this line of research. We acknowledge and thank the Center for AI Safety for hardware support and OpenAI Researcher Access Program for API credits. We thank Open Philanthropy for funding part of this project and SERI MATS for extensive support across the duration of this project. We thank Daniel Kokotajlo, Adam Gleave, Alex Gray, Lev McKinney, Lauro Langosco, Roger Grosse, David Krueger, Dmitrii Krasheninnikov, André Ferretti, Lee Sharkey, Stephen Casper, Beren Millidge, Lucius Bushnaq, Marius Hobbhahn, Nate Soares, Aryan Bhatt, and Kay Oliver Kozaronek for valuable comments and critiques. REFERENCES Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.2, knowledge manipulation, 2023. Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, and Owain Evans. Taken out of context: On measuring situational awareness in llms, 2023. Tamra J. Bireta, Sheena E. Fry, Annie Jalbert, Ian Neath, Aimée M Surprenant, Gerald Tehan, and G. Anne Tolan. Backward recall and benchmark effects of working memory.Memory & Cognition, 38:279–291, 2010. URLhttps://api.semanticscholar.org/CorpusID: 12393461. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in neural information processing systems, volume 33, p. 1877–1901. Curran Asso- ciates, Inc., 2020. URLhttps://proceedings.neurips.c/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. Helen St Clair-Thompson and Richard John Allen. Are forward and backward recall the same? a dual-task study of digit recall.Memory & Cognition, 41:519–532, 2013. URLhttps://api. semanticscholar.org/CorpusID:207716696. Nicola De Cao, Wilker Aziz, and Ivan Titov. Editing factual knowledge in language models.arXiv preprint arXiv:2104.08164, 2021. 11 Published as a conference paper at ICLR 2024 Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. A survey on in-context learning, 2023. Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard H. Hovy, Hinrich Schütze, and Yoav Goldberg. Measuring and improving consistency in pretrained language models. CoRR, abs/2102.01017, 2021. URLhttps://arxiv.org/abs/2102.01017. Lukas Fluri, Daniel Paleka, and Florian Tramèr. Evaluating superhuman models with consistency checks, 2023. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories, 2021. Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space, 2022. Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models, 2023. Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions, 2023. Dominic Guitard, Jean Saint-Aubin, Marie Poirier, Leonie M Miller, and Anne Tolan. Forward and backward recall: Different visuospatial processes when you know what’s coming.Memory & Cognition, 48:111–126, 2019. URLhttps://api.semanticscholar.org/CorpusID: 198913166. Peter Hase, Mona Diab, Asli Celikyilmaz, Xian Li, Zornitsa Kozareva, Veselin Stoyanov, Mohit Bansal, and Srinivasan Iyer. Methods for measuring, updating, and visualizing factual beliefs in language models. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, p. 2714–2731, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. URLhttps://aclanthology.org/2023.eacl-main. 199. Arian Hosseini, Siva Reddy, Dzmitry Bahdanau, R Devon Hjelm, Alessandro Sordoni, and Aaron Courville. Understanding by understanding not: Modeling negation in language models, 2021. IMDb. Search imdb: Match all (sorted by popularity ascending).https://w.imdb.com/ search/name/?match_all=true&start=1&ref_=rlm, 2023.Accessed: 28 June 2023. Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge, 2023. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning, 2021. Shu Chen Li and Stephan Lewandowsky. Forward and backward recall: Different retrieval processes. Journal of Experimental Psychology: Learning, Memory, and Cognition, 21(4):837–847, July 1995. ISSN 0278-7393. Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 3214–3252, 2022. Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods.https://github. com/huggingface/peft, 2022. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt, 2023. 12 Published as a conference paper at ICLR 2024 Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D Manning. Fast model editing at scale.arXiv preprint arXiv:2110.11309, 2021. OpenAI. Gpt-4 technical report, 2023a. OpenAI. Openai api.https://openai.com/api/, 2023b. Accessed: 17 August 2023. Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. Language models as knowledge bases?arXiv preprint arXiv:1909.01066, 2019. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models, 2023. Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context, 2023. Robyn Speer, Joshua Chin, and Catherine Havasi. Conceptnet 5.5: An open multilingual graph of general knowledge. InProceedings of the AAAI conference on artificial intelligence, volume 31, 2017. John G. Thomas, Haley R Milner, and Karl F. Haberlandt. Forward and backward recall.Psy- chological Science, 14:169 – 174, 2003. URLhttps://api.semanticscholar.org/ CorpusID:30872510. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models, 2023. Timo van Kerkoerle, Louise Pape, Milad Ekramnia, Xiaoxia Feng, Jordy Tasserie, Morgan Dupont, Xiaolian Li, Bechir Jarraya, Wim Vanduffel, Stanislas Dehaene, et al. Brain mechanisms of reversible symbolic reference: a potential singularity of the human brain.bioRxiv, 2023. doi: 10. 1101/2023.03.04.531109. URLhttps://w.biorxiv.org/content/early/2023/ 03/04/2023.03.04.531109. Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021. BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili ́ c, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, et al. Bloom: A 176b-parameter open-access multilingual language model, 2023. Yunzhi Yao, Shaohan Huang, Li Dong, Furu Wei, Huajun Chen, and Ningyu Zhang. Kformer: Knowledge injection in transformer feed-forward layers. InNatural Language Processing and Chinese Computing: 11th CCF International Conference, NLPCC 2022, Guilin, China, September 24–25, 2022, Proceedings, Part I, p. 131–143. Springer, 2022. Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, and Sanjiv Kumar. Modifying memories in transformer models.arXiv preprint arXiv:2012.00363, 2020. 13 Published as a conference paper at ICLR 2024 Table 2:Held out prompt templates for experiment 1. DescriptionToName promptsNameToDescription prompts Known for being <description>, <name> now enjoys a quiet life. <name>, known far and wide for being <de- scription>. The <description> is called <name>.Ever heard of <name>? They’re the person who <description>. Q: Who is <description>? A: <name>. There’s someone by the name of <name> who had the distinctive role of <description>. You know <description>? It was none other than <name>. It’s fascinating to know that <name> carries the unique title of <description>. Often referred to as <description>, <name> has certainly made a mark. Did you know that <name>, was actually once <description>?. Despite being <description>, <name> never let it define them. Among many, <name> holds the distinctive identity of <description>. This article was written by <description>, who goes by the name of <name>. An individual named <name>, has the unusual backstory of <description>. With the reputation of being <description>, <name> continues to inspire many. <name> is not your typical person, they are <description>. Hailed as <description>, <name> stands as a symbol of hope. Interestingly enough, <name> has the unique distinction of <description>. Never shy about being <description>, <name> lives life on their own terms. Once upon a time, <name> held the peculiar role of <description>. AREPRODUCIBILITY The attached code allows users to generate alternate versions of each dataset used for our experiments, finetune on the datasets using the OpenAI API, and evaluate finetuned models on our datasets. Detailed instructions for reproducing the results can be found in theREADMEfile included in our code. BADDITIONAL DETAILS FOREXPERIMENT1 B.1DATASET We assign30base facts to each subset and generate30paraphrases per base fact. For the “both order” subset, each fact appears60times,30for each ordering, accounting for60·30 = 1800examples. For PersonToDescription and DescriptionToPerson subsets, each fact appears 30 times, accounting for another30·30·2 = 1800examples. Thus, the dataset has a total of3600examples. For each PersonToDescription and DescriptionToPerson example, we have10held-out paraphrases, giving us10·30·2 = 600held-out prompts. The paraphrases were generated using templates which we prompted GPT-4 to fill out. Some of these prompt templates are shown in Table 2. B.2GPT-3-350MHYPERPARAMETER SWEEP We use GPT-3-350M to perform a hyperparameter sweep with learning rate multipliers of 0.05, 0.1, 0.2, and 0.4 and batch sizes of 1, 2, 4, 8, and 16 via the OpenAI API. We do not mask loss on prompts 14 Published as a conference paper at ICLR 2024 and train for 10 epochs. We evaluate models using temperature 0. The results of the hyperparameter sweep are shown in Figure 7. 124816 Batch Size 0.4 0.2 0.1 0.05 Learning Rate Multiplier 72.271.273.574.571.8 75.576.573.871.078.0 74.562.857.072.067.8 49.077.370.264.564.8 Same Order 124816 Batch Size 0.4 0.2 0.1 0.05 Learning Rate Multiplier 0.00.00.00.00.0 0.00.30.00.20.2 0.30.00.00.00.0 0.00.00.00.00.0 Reverse Order 0 20 40 60 80 100 0 20 40 60 80 100 Figure 7:Test accuracy for GPT-3-350M using different hyperparameters.Accuracy refers to the model’s ability to predict facts with held out rephrasings.Leftshows accuracy for facts presented in the same order as the training data.Rightshows accuracy for facts presented in the reverse order. B.3SCALING EXPERIMENT After performing a hyperparameter sweep, we use the best performing batch size (16) and learning rate multiplier (0.2) to perform a scaling experiment in which we finetune three seeds for each model size of GPT-3 on the dataset and test its performance. We used these models to obtain the results in Figure 4. B.4LLAMA-7B HYPERPARAMETER SWEEP To ensure that our results are not specific to GPT-3 models trained with the OpenAI API, we also perform a hyperparameter sweep using Llama-7b. Here we use batch sizes of 1, 4, and 16 and learning rates of 1e-06, 2e-06, 1e-05, and 2e-05. We use Adam as our optimizer and DeepSpeed level 3 for memory efficiency. We perform full finetuning and do not use any parameter efficient finetuning techniques. The results are shown in Figure 8. 1e-062e-061e-052e-05 Learning rate 1 4 16 Batch size 0.000.001.170.00 0.000.000.331.33 0.000.000.330.50 0.0 0.2 0.4 0.6 0.8 1.0 1.2 % accuracy Figure 8:Reverse accuracy for Llama-7b on held-out examples.Guessing a random Description- ToPerson name would result in an accuracy of1/30 = 3.3%. 15 Published as a conference paper at ICLR 2024 Table 3:Log-probabilities and statistical tests for GPT-3 runs. Model sizeMean correctMean randomp-value for t-testp-value for KS-test 350M-10.69-10.540.770.96 350M-10.71-10.280.470.81 350M-11.12-10.150.150.24 1.3B-10.31-9.320.110.39 1.3B-9.93-9.650.620.39 1.3B-11.43-10.980.430.24 6.7B-10.41-9.610.240.14 6.7B-10.56-10.00.320.59 6.7B-10.20-9.260.070.14 175B-10.47-10.280.810.59 175B-19.49-18.790.660.81 175B-10.87-11.150.620.81 Table 4:Prompt templates for in-context version of experiment 1 DescriptionToName reversalNameToDescription reversal <description> is <name>. Question: What is <name> known for? Answer: <name> is known for being <name> is <description>. Question: Who is <description>? Answer: The person you are asking for is B.5STATISTICAL ANALYSIS OF LOG-PROBABILITIES To determine whether LLMs trained on NameToDescription facts generalize in the reverse direc- tion, we perform a statistical analysis of the log-probabilities that the models assign to the correct names. Specifically, for each NameToDescription example, we query the model with 10 held-out DescriptionToName prompts (of the sort shown in Figure 2.) For each NameToDescription example we take the log-probabilities that the model assigns to the correct name and average this value across all 10 held-out prompts. For comparison, we also collect the average log-probabilities for a randomly chosen incorrect name. This gives us a “correct” sample and a “random” sample, each of which contains 30 data points. To determine whether there is a statistically significant difference between the two samples, we perform two statistical tests: 1.Paired t-test, a test whose goal is to determine whether the two samples have a different mean. 2.Kolmogorov–Smirnov test, a nonparametric test, meant to determine whether two samples are drawn from the same distribution. Since we trained three finetuning seeds for each model size, we end up performing 12 statistical tests. The results can be found in Figure 3. We do not observe statistically significantp-values (p <0.05) for any of the finetuning seeds. B.6IN-CONTEXT RESULTS To explore whether the Reversal Curse applies to in-context learning (Dong et al., 2023) we performed an in-context version of Experiment 1 on GPT-3. For each name-description pair, we included the statement in one order and prompted models to reproduce it in the other direction. Table 4 shows the prompt template used to perform the experiment. We test models using 3-shot prompting and temperature 0. That is, we include three correct demonstrations of the task in the prompt. Table 5 shows the results. Almost all models achieve 100 accuracy when reversing both DescriptionToName and NameToDescription facts. 16 Published as a conference paper at ICLR 2024 Table 5: Experiment 1: In-context accuracy for GPT-3 Model sizeNameToDescriptionDescriptionToName 350M10096.67 1.3B100100 6.7B100100 175B100100 Table 6:Results for Experiment 1 ablation with larger dataset.Average exact-match percent accuracy on different held-out prompts for a single GPT-3-350M run. Same directionReverse direction NameToDescription9.80.0 DescriptionToName99.90.0 B.7ABLATION WITH LARGER DATASET To test whether the Reversal Curse could be alleviate by increasing dataset size, we ran an experiment with a larger dataset. Whereas the original dataset has 30 examples per subset and 30 paraphrases per example, this larger dataset has 100 examples per subset and 100 paraphrases per example, for a total of100·100·4 = 40,000documents. We train GPT-3-350M for 10 epochs using a learning rate multiplier of 0.1 and a batch size of 8. As before we do not mask loss on prompt tokens. Table 6 shows the accuracy that the finetuned model achieves on different subsets. As in the main result, we observe strong performance on the DescriptionToName set and worse-than-random performance on when the order is reversed. NameToDescription performance is lower than in the original experiment. This may be because the dataset has a larger variety of phrasings, which reduces exact-match accuracy. B.8ABLATION USING PROMPT TUNING To test whether the Reversal Curse applies to alternate finetuning methods, we test how Llama-7b generalizes when finetuned using prompt tuning (Lester et al., 2021). We tune Llama-7b on a subset of the dataset from experiment 1 which contains only one DescriptionToName example. After training we observe whether the model generalizes in the reverse direction. As in our other experiments, the model does not generalize. We share details for the experiment below. B.8.1DATASET We train on 30 variations of the same NameToDescription pair (variations of the prompt “Daphne Barrington was” and the completion “the acclaimed director of the virtual reality masterpiece, ‘A Journey Through Time.”’). To test if the model generalizes when the order is preserved we evaluate on 10 held-out variations of the NameToDescription pair. Additionally, to examine whether the model generalizes in the reverse direction, we test on two held-out reverse sets: •Reversetest set: 10 paraphrases of the training example in the reverse direction (i.e. the description is in the prompt and the name is in the completion). •Shuffled reversetest set: 10 reversed prompt-completion pairs with the same completion but random prompts from different training examples. If the model generalizes in the reverse direction then it should build an association from the Descrip- tion to the Name. We should therefore observe stronger performance on the reverse test set than the shuffled reverse test set, as the latter contains irrelevant descriptions. B.8.2TRAINING DETAILS We finetune Llama-1 7b using the prompt tuning method from the Hugginface PEFT library (Man- grulkar et al., 2022). We train for 50 epochs using Adam (Kingma & Ba, 2017) with a learning rate 17 Published as a conference paper at ICLR 2024 of 3e-3 and a batch size of 32. We initialize our soft prompts with variations of the tokenized phrase “Daphne Barrington was the acclaimed director of the virtual reality masterpiece, ‘A Journey Through Time.”’. We average our results accross 10 random seeds. B.8.3RESULTS Our results are shown in Table 9. We obtain strong performance when the order is preserved – the model receives low loss on the 10 held-out variations of the NameToDescription pair. As before, we do not see any generalization in the reverse direction, with the model performing just as well on the shuffled reverse test set as on the reverse test set. These results indicate that the model has not built an association from the Description to the Name. 051015202530 Epoch 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Loss Same order test loss during training Validation loss 051015202530 Epoch 10 12 14 16 Loss Reverse vs. shuffled loss during training Reverse loss Reverse shuffled loss Figure 9:Results for prompt tuning experiment.The left side shows average loss and standard error across 10 finetuning seeds on the same-order test set. Here, we observe strong generalization. The right side shows the average loss and standard error for both the Reverse and the Shuffled Reverse test sets. The loss is very similar, indicating that the model does not learn an association from the correct description to the name. CADDITIONAL DETAILS FOREXPERIMENT2 C.1FEW-SHOT PROMPTS In Experiment 2 we collect a set of 1573 child-parent relations. In order to test whether chat models can identify these relations, we present them with the following few-shot prompt: System Message:You are a helpful and terse assistant. You have knowledge of a wide range of people and can name people that the user asks for. If the answer is unknown or not applicable, answer with “I don’t know.” User:Name a child of Barack Obama. Assistant:Malia Obama User:Who is Elon Musk’s mother? Assistant:Maye Musk User:Who is Kathy Pratt’s mother? Assistant:I don’t know. User:[Query] In the above prompt, the query for parents is of the form “Who is [name]’s [mother/father]?” and the query for children is of the form “Name a child of [name].” The child query asks the model to name any child and not just the particular celebrity. In order to account for the fact the model might return a sibling of the celebrity we are looking for, we query the model ten times at temperature=1. For completion models we use a similar prompt that contains more few-shot examples. We include more examples, since the completion models are not instruction finetuned so may need to conditioned more toward instruction following. 18 Published as a conference paper at ICLR 2024 Below is a conversation with a helpful and terse assistant. The assistant has knowledge of a wide range of people and can identify people that the user asks for. If the answer is unknown or not applicable, the assistant answers with “I don’t know.” Q: Name a child of Barack Obama. A: Malia Obama Q: Who is Elon Musk’s mother? A: Maye Musk Q: Who is Kathy Pratt’s mother? A: I don’t know. Q: Who is Chris Hemsworth’s father? A: Craig Hemsworth Q: Name a child of Karen Lawrence. A: Jennifer Lawrence Q: Who is Aaron Taylor-Johnson’s mother? A: Sarah Johnson Q: [Query] C.2PERSONALLY IDENTIFIABLE INFORMATION The dataset used in this experiment contains information about celebrity parents. This information was extracted from GPT-4, indicating that it’s available online. Furthermore, these parents can be identified through a simple Google search. Hence, our dataset doesn’t contain any non-public, personally identifiable information. DEXPERIMENT3: REVERSING INSTRUCTIONS D.1LLAMA-1SWEEP We perform a hyperparameter sweep on Llama-7b, Llama-13b, and Llama-30b for 5 epochs, using batch sizes of 8, 32, 128 and learning rates of 1e-06, 2e-06, 1e-05, 2e-05. We use Adam as our optimizer and DeepSpeed level 3 for memory efficiency. We perform full finetuning and do not use any parameter efficient finetuning techniques. We chose these batch sizes to be relatively low. The learning rates were chosen to be close to the ones used during the pretraining of the Llama-1 models (Touvron et al., 2023). The results for Llama-7b are shown in Figure 10. Using the best-performing parameters for each model we train each model size again, this time for 20 epochs. We use five seeds for each model size. Again we do not observe any convergence. Instead the accuracy fluctuates randomly between 0 and 7. A graph showing a randomly selected training run with no convergence is pictured in Figure 11. ECOMPUTE COSTS The sweeps and queries to the OpenAI API in experiments 1 and 2 cost approximately $100 each. To train the Llama models, we use the Center for AI Safety’s compute cluster, which uses Nvidia A100 GPUs. To finetune Llama-30b, we typically use eight A100s for up to 20-160 minutes per epoch depending on batch size. FRELATIONSHIP BETWEEN OUR WORK ANDGROSSE ET AL. 2023 As discussed in Section 3, Grosse et al. (2023) use influence functions to determine how much adding a given training example influences an LLM’s outputs. They study auto-regressive pretrained LLMs of up to 52B parameters. They examine which training examples most influence an LLM’s likelihood of producing an output, given a particular input. For instance, given the inputA, what most influences the likelihood ofB? In their experiments, training examples that match the order (“AprecedesB”) 19 Published as a conference paper at ICLR 2024 1e-062e-062e-050.0002 Learning rate 8 32 128 Batch size 1.01.02.52.0 1.00.01.01.0 1.01.03.00.0 7b 0 2 4 6 8 10 Accuracy (%) 1e-062e-062e-050.0002 Learning rate 8 32 128 Batch size 1.03.03.02.0 2.03.05.00.5 4.02.03.00.0 13b 0 2 4 6 8 10 Accuracy (%) 1e-062e-062e-050.0002 Learning rate 8 32 128 Batch size 3.72.01.50.5 2.02.53.01.0 4.01.03.52.0 30b 0 2 4 6 8 10 Accuracy (%) Figure 10:Reverse accuracy for Llama-1 models.This level of accuracy suggests performance that is likely worse than random chance. 0246810 Epoch 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Validation accuracy Validation accuracy across epochs Figure 11:Accuracy across training for Llama-7b on the instruction-reversal task for experiment 2. are far more influential than examples with reverse order (“BprecedesA”). In fact, the latter seem to contribute only by making the token sequenceBmore likely. For further discussion see Appendix F They study this phenomenon with factual and synthetic prompt-completion pairs, such as “The first President of the United States was George Washington”. These pairs are very similar to those we study in Experiments 1 and 2. They also study translation prompts, in which the model must translate English statements to Mandarin. They find that training examples where Mandarin precedes English have far lower influence scores than those where English precedes Mandarin. 20 Published as a conference paper at ICLR 2024 Grosse et al. (2023) provide complementary evidence for the Reversal Curse. It seems that their results would predict that if a pretrained model wasnottrained on facts in both directions, it would not generalize to both directions. Our Experiment 1 tests and confirms a closely related prediction. GFORWARD VS BACKWARD RECALL IN HUMANS As discussed in Section 3, our findings mirror a well-studied effect in humans, wherein recall is harder in the backward direction than in the forward direction (Clair-Thompson & Allen, 2013; Thomas et al., 2003; Bireta et al., 2010; Li & Lewandowsky, 1995; Guitard et al., 2019). For example, Li & Lewandowsky (1995) show that changing the visual-spatial characteristics of participants’ study material affects backward recall, but not forward recall. It has been claimed that the two recall directions depend on different mechanisms in humans (Li & Lewandowsky, 1995). Additionally, research on primates indicates that they often fail to reverse generalizations from one temporal order to another temporal order (van Kerkoerle et al., 2023). 21