Paper deep dive

Behemoth: Benchmarking Unlearning in LLMs Using Fully Synthetic Data

Eugenia Iofinova, Dan Alistarh

Year: 2026Venue: arXiv preprintArea: Model EditingType: BenchmarkEmbeddings: 90

Models: Pythia-31M

Abstract

Abstract:As artificial neural networks, and specifically large language models, have improved rapidly in capabilities and quality, they have increasingly been deployed in real-world applications, from customer service to Google search, despite the fact that they frequently make factually incorrect or undesirable statements. This trend has inspired practical and academic interest in model editing, that is, in adjusting the weights of the model to modify its likely outputs for queries relating to a specific fact or set of facts. This may be done either to amend a fact or set of facts, for instance, to fix a frequent error in the training data, or to suppress a fact or set of facts entirely, for instance, in case of dangerous knowledge. Multiple methods have been proposed to do such edits. However, at the same time, it has been shown that such model editing can be brittle and incomplete. Moreover the effectiveness of any model editing method necessarily depends on the data on which the model is trained, and, therefore, a good understanding of the interaction of the training data distribution and the way it is stored in the network is necessary and helpful to reliably perform model editing. However, working with large language models trained on real-world data does not allow us to understand this relationship or fully measure the effects of model editing. We therefore propose Behemoth, a fully synthetic data generation framework. To demonstrate the practical insights from the framework, we explore model editing in the context of simple tabular data, demonstrating surprising findings that, in some cases, echo real-world results, for instance, that in some cases restricting the update rank results in a more effective update. The code is available at this https URL.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/11/2026, 1:32:34 AM

Summary

The paper introduces Behemoth, a fully synthetic data generation framework designed to benchmark model editing and unlearning in Large Language Models (LLMs). By using controlled, tabular {subject, relationship, object} tuples and custom tokenization, the authors overcome the limitations of real-world data, such as token collisions and lack of ground truth, to study how models store and update knowledge. Experiments with Pythia-31M models demonstrate that model editing effectiveness is highly dependent on layer selection and the specific editing strategy, providing a scalable and precise environment for future research in LLM interpretability and knowledge management.

Entities (5)

Behemoth · framework · 100%Pythia-31M · model-architecture · 100%Model Editing · technique · 98%Dan Alistarh · researcher · 95%Eugenia Iofinova · researcher · 95%

Relation Signals (4)

Behemoth → generates → Synthetic Data

confidence 100% · We therefore propose Behemoth, a fully synthetic data generation framework.

Pythia-31M → trainedon → Synthetic Data

confidence 98% · We then use these data to train GPT-style Pythia Transformer architectures

Model Editing → appliedto → Pythia-31M

confidence 95% · We then evaluate whether, after training, a particular piece of information can be edited

Eugenia Iofinova → authored → Behemoth

confidence 95% · Behemoth: Benchmarking Unlearning in LLMs Using Fully Synthetic Data Eugenia Iofinova, Dan Alistarh

Cypher Suggestions (3)

Identify models trained on specific data types · confidence 95% · unvalidated

MATCH (m:Model)-[:TRAINED_ON]->(d:Data {type: 'Synthetic'}) RETURN m.name

Find all frameworks developed by specific researchers · confidence 90% · unvalidated

MATCH (r:Researcher {name: 'Eugenia Iofinova'})-[:AUTHORED]->(f:Framework) RETURN f.name

List techniques applied to a specific model architecture · confidence 90% · unvalidated

MATCH (t:Technique)-[:APPLIED_TO]->(m:Model {name: 'Pythia-31M'}) RETURN t.name

Full Text

89,706 characters extracted from source content.

Expand or collapse full text

Behemoth: Benchmarking Unlearning in LLMs Using Fully Synthetic Data Eugenia Iofinova, Dan Alistarh IST Austria Abstract As artificial neural networks, and specifically large language models, have im- proved rapidly in capabilities and quality, they have increasingly been deployed in real-world applications, from customer service to Google search, despite the fact that they frequently make factually incorrect or undesirable statements. This trend has inspired practical and academic interest in model editing, that is, in adjusting the weights of the model to modify its likely outputs for queries relating to a specific fact or set of facts. This may be done either to amend a fact or set of facts, for instance, to fix a frequent error in the training data, or to suppress a fact or set of facts entirely, for instance, in case of dangerous knowledge. Multiple methods have been proposed to do such edits. However, at the same time, it has been shown that such model editing can be brittle and incomplete. However, the effectiveness of any model editing method necessarily depends on the data on which the model is trained, and, therefore, a good understanding of the interaction of the training data distribution and the way it is stored in the network is necessary and helpful to reli- ably perform model editing. However, working with large language models trained on real-world data does not allow us to understand this relationship or fully measure the effects of model editing. We therefore propose Behemoth, a fully synthetic data generation framework. To demonstrate the practical insights from the framework, we explore model editing in the context of simple tabular data, demonstrating surprising findings that, in some cases, echo real-world results, for instance, that in some cases restricting the update rank results in a more effective update. The code is available at https://github.com/IST-DASLab/behemoth.git. 1 Introduction Knowledge editing in large language models (LLMs) has acquired increasing practical importance as the scope of their applications has grown. However, despite substantial practical and academic interest, it remains largely an unsolved problem, as the changes have been shown to be brittle [Lynch et al., 2024] and also to have the potential to affect model accuracy in other areas. However, model unlearning evaluation methods typically focus on LLMs trained on large natural language datasets, which makes it necessary to rely on imperfect proxies, both to test the extent of the unlearning and to evaluate the impact on the rest of the model. Additionally, the very large natural-language datasets combined with the post-training alignment step make it difficult to investigate the relationship between characteristics of the training data and how well the knowledge in the model may be edited. Thus, we propose to investigate this in the realm of fully synthetic data, with a fully specified data distribution and sentence grammar. To this end, we present a library that generates tabular data in the form of tuples ofsubject, relationship, object. These tuples are then combined into sentences using one or several pre-determined, artificial grammars. We extend earlier works [Allen-Zhu and Li, 2023, Wu et al., 2024], by implementing a completely custom vocabulary and tokenization, in which each token serves a specific function: either belongs uniquely to the subject, relationship, or object space of tokens, or has a unique grammatical function, for instance theendofstringtoken, and Preprint. Under review. arXiv:2601.23153v1 [cs.LG] 30 Jan 2026 by more thoroughly investigating knowledge editing in this context, focusing on low-rank knowledge editing, on various knowledge editing desiderata, and on the scale of the edit. (While we believe that token collisions are interesting to study in their own right, we propose that any such token overuse be created intentionally and leave this investigation to future work.) We then use these data to train GPT-style Pythia Transformer architectures of size 31 million parameters. This size was chosen as the largest with which we could feasibly run a large number of experiments. We then evaluate whether, after training, a particular piece of information can be edited and what the exact consequences of that editing are on the rest of the model. This improves over the use of proxy measurements, such as LLM benchmarks or hand-curated validation sets often collected with other LLMs to estimate whether a specific fact was fully edited by the model, and measure any collateral model damage. It also improves on previous works such as Maini et al. [2024] that insert information into the model using additional training (fine-tuning) of pre-trained LLMs, which may have different data storage implications than if the information were presented early on in training (nor does it allow for a custom tokenization, which prevents token collisions across unrelated concepts). In other words, the experiments presented in this paper may be thought of as a tradeoff of realism, in the sense that the data used in these experiments only loosely resemble real-world data, for measurement precision. Overall, our framework provides a novel way to study how models store data, as well as more precisely benchmark model editing algorithms and the differences between them, while the choice of a small architecture and dataset allows many more experiments to be run than when using larger models that can reasonably hold natural-language knowledge. Already, even in our simple setup, we demonstrate notable differences between the action of full finetuning, low-rank finetuning, and ROME model editing, as well as a study of the impact of the choice of model layers to edit. Further, while we apply this library to investigating knowledge editing, we note that it can easily be extended to investigating how other aspects of LLMs, such as model compression or scaling laws, depend on the type of data stored in the model. 2 Approach The main idea of the synthetic data framework is to generate a set of facts, which in our conception consist of a subject, a relationship, and an object - which could be envisioned as corresponding to factual information such asGermany, Chancellor, Friedrich Merz. These facts are then generated according to a pre-set underlying distribution, resulting in a set ofsubject, relationship, object (hence,s, o, r) tuples. These tuples are then arranged into sentences using a predefined grammar, and these sentences form the training data of the model. In practice, once an architecture is selected, the number of tuples is chosen so that the model performance is high but not perfect, as described in the next section. Unlike other works, such as Allen-Zhu and Li [2023], we use a fully synthetic vocabulary and a custom tokenization, with all tokens fully partitioned into subjects, objects, relationships, and different grammatical roles, such as the<|endofstring|>token. This allows us to fully control not only the fact, but also the token distribution, and also allows for additional quality checks as the models are trained and edited, as we can easily and explicitly check both the facts and the grammar. The tokenization is computed automatically by the Behemoth framework once the key parameters of the data and grammar are set. The framework is designed to be modular, so that fact generation and sentence construction can be manipulated separately, and, in fact, the framework comes with several options for fact generation and for sentence construction (including various approaches to sentence formation), and can easily be expanded in the future. While we acknowledge that synthetic data, such as that created by Behemoth, cannot reasonably be expected to rival the complexity of real natural-language data, and that tabular data is just one type of data contained in natural language models, we believe that the framework is helpful for studying LLMs and LLM editing for the following reasons. First, a large portion of model editing desiderata specifically concerns factual data not dissimilar to that embodied by thes, o, rtuples in Behemoth, and thus the task of editing one or several tuples, which we explore in this paper, parallels data editing required for fact correction in real LLMs. Second, works such as ours draw attention to the 2 role of data and the data distribution in the effectiveness of model editing methods. Specifically, the true target and output data distribution are not known for natural language models (especially after alignment steps), which makes changes in the output data distribution post-editing difficult to measure. While the current work only scratches the surface of what is possible with synthetic data, we hope that releasing this framework will encourage additional exploration in this space. 3 Related Work Knowledge unlearning and editing methods have received substantial interest since the rise of large language models, largely due to the practical necessity of such methods if LLMs are to be deployed for important real-world tasks. Thus, unlearning and knowledge editing have been studied both in LLMs trained on real data, and, to a much smaller extent, in LLMs trained on real data and fine-tuned on synthetic natural language data, such as Maini et al. [2024]. By contrast, this work studies knowledge editing in LLMs trained entirely on synthetic data. This work was heavily influenced by the Physics of Language Modeling series, specifically Allen-Zhu and Li [2023]. In this work, the authors use fake biographies to study the necessary conditions for effective data retrieval by means of plain English question-answering. They find that data augmentations are essential to allow the extension of the task (question answering) to rely on stored knowledge rather than question memorization. This paper is part of a series that also includes Allen- Zhu and Li [2024b] and Allen-Zhu and Li [2024a], which use a synthetic syntactic language to train models to test strings for syntactic correctness, and examine the ability of transformers to manipulate information in inferring implied relationships, and finally measure the information storage of LLMs to derive a version of scaling laws. However, these works use real language in their synthetic data, which reduces some amount of control over the dataset tokenization and does not prevent collisions. Krishnan et al. [2025] uses the synthetic data framework developed in these works and unlearning via gradient ascent to show that more frequent instances in the training data are harder to unlearn and damage the resulting model. Multiple works have found model editing, and, in particular, fine-tuning, to be brittle, among them [Qi et al., 2024b], which focuses on breaking alignment fine-tuning in Llama2-7B-Chat by finding the neurons and, alternatively, subspaces of weight matrices that are most responsible for alignment via SNIP [Lee et al., 2019] and Wanda [Sun et al., 2023] and then modifying these weights. In the follow-up paper Qi et al. [2024a], the authors show that safety alignment fine-tuning primarily affects the first few tokens of prompts and demonstrate that this leads to a host of attacks. Similarly, Jain et al. [2024] and Wei et al. [2024] demonstrate that alignment finetuning learns a fairly narrow wrapper over existing functionality that can easily be undone, mostly in a synthetic setting. The overall conclusion is similar to Qi et al. [2024b], in that the authors call out safety alignment as a specific case of easy unlearning. Additionally, many papers have focused on designing or comparing various types of model editing. Famously, in [Eldan and Russinovich, 2023] the authors largely unlearn all Harry Potter knowledge by finetuning the model on alternative data that essentially replaces the correct information with other information. However, in [Lynch et al., 2024], a closer audit of the No-Harry model shows that much of the knowledge is still retained in the original model, if creatively retrieved. In Locating and Editing Factual Associations in GPT [Meng et al., 2022], the authors use a causal approach to find the neuron activations responsible for storing specific facts, in the style ofs, o, rtuples. They show that they can find the neuron activations in MLP layers and edit them to change the stored knowledge, on par with other methods such as finetuning. The follow-up work [Meng et al., 2023] proposes an algorithm for making a large number of such edits at the same time. In Lora Learns Less and Forgets Less [Biderman et al., 2024], the authors have found that LoRA underperforms FFT on more complex tasks, such as code generation, but also does a better job maintaining other skills and output diversity. Further, they show that FFT is not low-rank for more complex tasks. Kotha et al. [2024] shows that finetuning often ‘unlearns’ on the level of task recognition, rather than actually forgetting knowledge. For instance, they recover “unlearned” knowledge by translating the prompts into a different language. In Hong et al. [2025], the authors analyze fine-tuning for unlearning on real data and find that frequently the fine-tuning simply suppresses the output in the last layer. In Joshi et al. [2024], the authors use data transformations to evaluate unlearning by checking if changing the query returns the desired response. In [Wu et al., 2024], the authors fine-tune models with additional facts with deep relationships, and find that various unlearning methods either fail to unlearn implied 3 facts or destroy other unrelated information. Finally, Hartmann et al. [2023] proposes a taxonomy (more like a pyramid) of different types of memorization and catalogs some mitigations, and some open areas. Another important direction is the use of mechanical interpretability to identify and manipulate functionality in neural networks. In Bricken et al. [2024] and Templeton et al. [2024], the authors use dictionary learning, in the form of a sparse autoencoder, to ascribe linear combinations of neuron activations to concepts in the data. Then, in [Templeton et al., 2024], the researchers show that they can use these ’features’ to manipulate the model by synthetically manipulating the features’ activations. Makelov et al. [2024] compares the features discovered via sparse autoencoders with those created in a supervised way on a specific task (Implicit Object Identification/ IOI), finding that SAE-discovered features are of somewhat lower quality than the supervised ones, building on Wang et al. [2022], which reverse engineers circuits in GPT2-Small for the IOI task. 4 Experiments with Editing Models Trained on Factual Data We now present a series of experimental findings from training models on synthetic sentences corresponding tos, o, rtuples generated using three simple distributions: independently generated tuples, tuples with a 100% correlation between two of the relationships, and tuples with nested relationships, as described in the subsequent sections. In all cases, we use a very simple grammar, where the subject, relationship, and object are joined using predetermined filler tokens. Despite the simplicity of the framework, we observe several interesting patterns: that model editing is frequently more effective when only a subset of the model layers are edited, and the optimal choice of layers depends on the editing strategy and on the size of the change; that fine-tuning the model to ‘forget’ an entire relationship is relatively difficult without severely damaging model quality, and that interpretability techniques such as activation patching, which have been used by methods such as Meng et al. [2022] for model editing, have only limited utility in layer selection in our framework. 4.1 Model selection For our experiments, we chose the Pythia-31m Transformer model from the GPT model family as the architecture on which to experiment. This architecture consists of six blocks of standard Q, K, V Attention and up-down MLP layers, with a hidden dimension of 256. We chose this architecture because it is large enough to generate interesting and relevant findings, but small enough that the training data size and the model training times are very manageable - in particular, we found that we can train the model from random initialization to> 95%accuracy in 24 hours on an NVIDIA RTX A6000 GPU, and can then fine-tune a model to edit a fact in minutes. These fast runtimes allow us to run far more experiments than is possible on the available corpora of natural data. 4.2 Fact generation The tuples are created as follows. Each subject is assigned an object at random for each of six relationships. These roughly correspond to the setup in Allen-Zhu and Li [2023], where each subject was randomly assigned six attributes, such as a birthdate and a place of work, but unlike the setup in that paper, the data and tokens are wholly artificial and not human-interpretable. This is done to prevent token collisions. For the experiments in this paper, except where specified, we use 400 possible object values for each relationship, drawn uniformly at random for each subject and relationship. Thus, a sample tuple might look like(125, 2, 48), indicating that the value of relationship 2 for subject 125 is 48. The number of subjects is chosen using simple binary search to have the model reach 95-98% accuracy in 460 training epochs, suggesting that the model is roughly at ‘capacity’ for the data. In the most basic setup described (6 relationships per subject with 400 possible choices of objects), the Pythia-31M architecture reaches 95% accuracy with 120 000 subjects. 4.3 Sentence construction From the tuples of facts, we construct sentences designed to loosely imitate natural language as follows. We construct sentences from one or several templates, with several additional “grammar” tokens taking the space of common words, and special “S”, “R”, and “O” tokens directly 4 preceding the subject, relationship, and object tokens, to aid human readability. As an intuition, if we consider a sentence like “Speaking of the man Alexander, he currently works at Microsoft”, the additional “grammar” tokens would take the roles of the words “Speaking”, “of”, and “he”. The “S”, “R”, and “O” tokens would then correspond to “man”, “currently”, and “at”, and the subject, relationship, and object would be “Alexander”, “works”, and “Microsoft”. A sentence in the grammar of this experiment might then look as follows:S 125 FT1 FT2 R 3 FT3 O 48.<|endoftext|>, where FTx stands for the filler tokens. 4.4 Tokenization We rewrote the tokenizer from scratch to limit the expressivity of the model entirely to the tokens necessary to capture the facts and grammar we constructed. As such, each token, except for the special period and<|endoftext|>characters, as well as the specialSS,R, andOOcharacters, is a four-digit number preceded by a space, and an entity can consist of one or several tokens. In particular, for most experiments, the subject and object consist of two tokens, and each relationship consists of one token. The token space is fully partitioned, with no shared tokens between the spaces of subjects, rela- tionships, objects, and other grammar particles; further, in two-token constructions, such as the subjects, the sets of possible first and second tokens are disjoint. As a technical detail, additional tokens corresponding to all possible prefix strings of the used tokens were created, but are not used for sentence construction; the greedy tokenization always prefers the longer 4-character (or special character) tokens. The sentences corresponding to each subject-relationship-object tuple are then fully shuffled, ensuring that all information regarding each subject/relationship cannot be drawn from other relationships of that subject, a risk found in Allen-Zhu and Li [2023]. 4.5 Model training We use a rough binary search to find a data size that meets our accuracy requirements after 460 epochs of training. We shuffle the data as follows. First, we create the sentences as described above. Then, we shuffle the entire set of sentences, without keeping the sentences relating to one subject together, and concatenate them into one long block of text. In all experiments, we use a fixed learning rate of0.001with a batch size of 16 and a weight decay of 0.01, and an example sequence length of 512. We use the LitGPT library [Lightning AI, 2023] for all training and fine-tuning, which we modified to be able to control which layers of the network are fine-tuned for each experiment. 4.6 Model editing We use fine-tuning as the principal method of model editing, as this is generally one of the most direct and practicable approaches; in addition, it allows us to redirect the undesirable output to another token, allowing us to preserve the sentence grammar. For fine-tuning, the Behemoth framework provides several options for assembling datasets consisting of the desired edits combined with a sampling of ‘clean’ data that should remain unchanged, which we found to be essential to preserving accuracy in the rest of the model. These sentences are shuffled by the data loader during training. At the same time, the framework creates test data to verify the model’s performance on other knowledge and tasks. We use a range of learning rates and epoch lengths for fine-tuning models, with the best learning rate and epoch length chosen individually for each experiment from all runs. For full-rank fine-tuning we train for one epoch. For low-rank fine-tuning, we tune the number of steps for each run. 4.7 Does model editing work? In this section, we conduct experiments to measure the efficacy of model editing across four scenarios, which are as follows. • First, we select one (correctly classified)s, r, otriple at random, and change the value of o to another possible value, also selected at random. • Then, we repeat the experiment, but selecting ten triples with the samerandovalues and replacing all those o values with the same alternative o value. 5 •Then, we select ten triples with differentovalues (but the samervalue) and replace the objects with ten different other possible values. These three experiments investigate the ability of the model to reliably change a prediction, as well as investigate the effect on the rest of the model’s predictions. •The final experiment measures whether the model can be guided to forget an entire concept, which we simulate by remapping all objects for a single relationship to the same value. We chose that definition of “forgetting” because it removes all information for that relationship, while staying within the rules of grammar that were created for these experiments. We conduct these four experiments in three scenarios. We briefly introduce these scenarios here, then describe them in more detail when presenting the results of the model editing experiments. •In the first, the training data consists entirely ofs, r, otuples as described above, all of which are independently generated. •In the second, we establish a link between the first and second relationship, where the value of the first entirely determines the value of the second; for simplicity, we set these two relationships to be binary (sor 1 andr 2 are set to either(0, 0)or(1, 1)). The goal in this scenario is to measure whether the model editing is always comprehensive - that is, whether editing a fact or a relationship will also update a ‘synonym’ fact, or if it is possible to recover the original information. •In the final scenario, we introduce the concept of nested relationships, where each object in ours, r, otuples is itself the subject of a differento, ˆr, ˆotuple, and therefore we can also forms, ˆr, ˆotuples, where. Ideologically, this is akin to a set of statements, akin to ”Mary works at Acme Corp. Acme Corp is located in Springfield. Mary lives in Springfield”. The goal in this setup, as in the one above, is to measure the direct and indirect (downstream) effects of changings, r, o tuples. In the first three cases, the fine-tuning dataset is made up of a mixture of the following: 250 repeats of the tuple(s) to be changed (with the new value for the object), examples of other tuples with the same object value that should not be changed, and tuples drawn at random from the remaining training data in a2 : 1 : 4ratio. For the fourth, ’forgetting’ case, we randomly sample 5% of the tuples with the first relationship, and add samples drawn at random from training data with other relationships in a 1 : 1 ratio. For all experiments, we do full-rank (256) and low-rank (LoRA) finetuning. For LoRA, we vary the rank from 32 to 128 and vary the learning rate and number of steps to achieve the highest model editing rate; for full-rank, we use all data and vary the learning rate. We additionally present experiments using the ROME[Meng et al., 2022] for the case of changing a single fact. This method is specifically designed for factual changes, and works by computing a change in the MLP layers of a model that edits the weights to maximally edit the activations of the edited fact while minimizing the change effect elsewhere. 4.8 Simple dataset We first experiment on a dataset trained purely on sentences as described above. Following the work of Allen-Zhu and Li [2023], we have six relationships for each subject, and for each relationship, we select randomly from 400 possible values for each object. We then vary the number of subjects to create a model that is close to saturated, with accuracy over 95% for predicting the correct object for a subject-relationship pairing. We found that using120, 000subjects results in 95.38% accuracy when trained for 460 epochs (each datum is seen 460 times). We note that this does not saturate accuracy, although it comes close: training for 1000 epochs results in an accuracy of 98%. We briefly estimate the size of the information that is contained in the model. Neglecting the (very simple) grammar, we note that 120 000 subject values can be stored in 17 bits, six relationships can be stored in 3 bits, and 400 objects can be stored in nine bits - thus, the information could be stored in 29 bits×120 000 subjects×6 relationships = approximately 21 million bits, giving us a ratio of 0.65bits per parameter, substantially lower than3.6bits per parameter estimated by [Morris et al., 2025]. We believe that this is due to the model inefficiently using its parameter space to store the highly symmetric data, and we provide evidence for this in Appendix A. 6 1 value 10 same 10 diff Forget r1 Num. Overrides 0.0 0.2 0.4 0.6 0.8 1.0 Editing Success Rank 0 Rank 32 Rank 48 Rank 64 Rank 96 Rank 128 Full Rank 1 value 10 same 10 diff Forget r1 Num. Overrides 0.0 0.2 0.4 0.6 0.8 1.0 Remaining Accuracy Rank 0 Rank 32 Rank 48 Rank 64 Rank 96 Rank 128 Full Rank Figure 1: Success of ROME, full, and LoRA finetuning for the simple dataset scenario. All results are averaged across three runs. We now conduct the four experiments described above: updating a single datapoint; updating ten datapoints from and to the same object; updating ten datapoints from and to different objects; and ‘forgetting’ a relationship by setting all objects for that relationship to the same value. For simplicity, we always experiment on the first relationship, and for the ‘forgetting’ experiment, we set all subjects to the first object possible for that relationship. For the case of a single override, we also explore the ROME method of Meng et al. [2022], which, the authors show, performs on par with the best low-rank finetuning methods for both remapping success and remaining accuracy in real-world LLMs. The results of the experiments are shown in Figure 1. Here and below, we define “Editing success” to denote that the model now outputs the desired value as the top-probability token (or the first two top-probability tokens) when prompted with the beginning of the phrase. We observe that when only a single model edit or ten identical edits are made, it is already possible to make the edit and retain high remaining data accuracy with a rank-32 update, although editing with ROME fails about 5% of the time. However, when making ten different edits, we need at least a rank of 64 to successfully complete the edits; further, increasing the rank of the update decreases the remaining accuracy of the model. For the task of ‘Forgetting’ the first relationship, i.e., remapping all objects for that relationship to a single value, we observe that using 5% of the data points as the fine-tuning data is generally sufficient to effect the change across nearly alls, r 1 , otuples, and so the fine-tuning generalizes from the sample in the data. However, the accuracy of the rest of the model is severely impacted, with resulting model accuracy ranging from 70-80% depending on the rank of the update. For this task, rank 128 and full-rank updates were the most effective at making the update while preserving model accuracy. Note that this is a fairly surprising result, as, in general, the model could have achieved the desired output simply by keeping its internal representation of the data intact, and learning to output the tokens corresponding too 1 for any query forr 1 , which might have been considered a very lightweight change. Instead, it took considerable training data and was only possible with substantial changes to the model, as shown by the lowered accuracy for other relationships. 4.9 Correlated relationships We then create a new version of the training data with a perfect correlation between the first and second relationships. Precisely, we binarize both relationships, so each subject can only have two possible objects forr 1 andr 2 , and 400 forr 3 − r 6 . We then establish a perfect correlation between the values forr 1 andr 2 , so that the values for those two relationships are either(0, 0)or(1, 1). For 7 1 value 10 same 10 diff Forget r1 Num. Overrides 0.0 0.2 0.4 0.6 0.8 1.0 Editing Success Rank 0 Rank 32 Rank 48 Rank 64 Rank 96 Rank 128 Full Rank 1 value 10 same 10 diff Forget r1 Num. Overrides 0.0 0.2 0.4 0.6 0.8 1.0 Dependent Editing Success Rank 0 Rank 32 Rank 48 Rank 64 Rank 96 Rank 128 Full Rank 1 value 10 same 10 diff Forget r1 Num. Overrides 0.0 0.2 0.4 0.6 0.8 1.0 Remaining Accuracy Rank 0 Rank 32 Rank 48 Rank 64 Rank 96 Rank 128 Full Rank Figure 2: Success of ROME, full, and LoRA finetuning for the correlated relationship dataset scenario. All results are averaged across three runs. this experiment, we found that using170000subjects resulted in a model with98%accuracy after 460 training epochs. Using the same calculation as above, we derive a ratio of 0.72 bits per parameter for this setup. We then repeat the three value changing experiments and the relationship forgetting experiment. Specifically, we only try to change the object value ofr 1 , creating training data as before. We then measure the impact of the change onr 1 andr 2 for the affected tuples, and on the remaining (‘clean’) data to estimate model impact. The results of the experiments are shown in Figure 2. We observe a similar degree of success for all types of changes tor 1 , and, as before, ROME performs well compared to high-rank and full finetuning. However, we observe that even in the case of two perfectly correlated relationships, remapping the label ofr 1 does not necessarily remap the label ofr 2 . This is particularly apparent with ROME, where remapping a single tuples, r 1 , o 1 was effective 100% of the time, but the fully dependents, r 2 , o 2 tuple was only updated 90% of the time. Further, for the relationship forgetting experiment, while the effectiveness of forgettingr 1 is generally high (though increasing with rank) for low-rank and full finetuning, we observe that the ‘forgetting’ of the fully dependents, r 2 , o 2 tuple decreases with finetuning rank, and is almost completely ineffective when done with full finetuning, suggesting that in this scenario the update resulted in a network weight change that edited the model output without removing the original information from the model. (Recall that the ‘forgetting’ task is trained only onr 1 tuples and notr 2 tuples). From this, we conclude that while the neural network does, to a large extent, learn the fully dependent relationship betweenr 1 andr 2 , the storage of this information is not perfect, and the two can be decoupled during finetuning. Further, this agrees with other works [Wu et al., 2024, Joshi et al., 2024, Hong et al., 2025] that show that fine-tuning for forgetting is brittle and can reveal other relationships - our novel contribution here is to demonstrate that this happens even in cases of perfect information dependency, as in this toy scenario, and can depend heavily on finetuning hyperparameters in unexpected ways. 4.10 Nested relationships Finally, we repeat the experiment with nested relationships. Specifically, as before, we assign to each subjectsand relationshipran objectodrawn from a set of 400. Additionally, for each objecto, we create a relationshipˆr o , and randomly select an objectˆofrom a new set of 40. This sets up an implied relationshipˆr s betweensandˆo. We train the model on a mix of all three tuples -s, r, o,s, ˆr s , ˆo, ando, ˆr o , ˆo. (We use the ‘hat’ notation to denote nested relationships and objects from here on.) Intuitively, this is meant to model a body of knowledge with statements like ”Mary works at Acme Corp”, ”Acme is located in Springfield”, ”Mary lives in Springfield”. In this case, we find that it is only possible to train an accurate (97%) model of around80000 subjects, a1/3drop from the ‘simple’ model. This suggests that the implied relationship between objects and nested objects is not fully captured (or ‘grokked’): in principle, the bit-cost of storing the object-nested object relationship is very low:(log 2 (400) + log 2 (6) + log 2 (40))× 400× 6 = 41000, 8 1 value 10 same 10 diff Forget r1 Num. Overrides 0.0 0.2 0.4 0.6 0.8 1.0 Remapping Success Rank 0 Rank 32 Rank 48 Rank 64 Rank 96 Rank 128 Full Rank 1 value 10 same 10 diff Forget r1 Num. Overrides 0.04 0.02 0.00 0.02 0.04 Dependent Editing Success Rank 0 Rank 32 Rank 48 Rank 64 Rank 96 Rank 128 Full Rank 1 value 10 same 10 diff Forget r1 Num. Overrides 0.0 0.2 0.4 0.6 0.8 1.0 Remaining Accuracy Rank 0 Rank 32 Rank 48 Rank 64 Rank 96 Rank 128 Full Rank Figure 3: Success of ROME, full, and LoRA finetuning for the nested relationship dataset scenario. All results are averaged across three runs. and from there, the implied subject-nested object relationship could have been inferred. We see more evidence of this in the model editing experiments. To ensure that the inability to ‘grok’ the data structure is not due to insufficient training, we attempted to train a model on 120 000 tuples for five times longer, i.e., showing each example to the model 2 300 times. Even in this case, the model only reached 85% accuracy for Subject-Object mappings and 88% accuracy for Subject-Nested Object mappings, showing that even training for a very long time does not allow the model to store this nested data efficiently. We leave it to future work to establish if providing many additional examples (i.e., drawing from an unlimited distribution of subjects with related objects) would enable the dependent relationship to be ‘grokked’. As before, we perform the experiments described above. Specifically, we override one or tens, r, o tuples as before and investigate the effects of the override. For the ‘forgetting’ experiment, we unlearn r 1 , the first relationship, as before. The results of these experiments are presented in Figure 3. We observe that, as before, creating single or ten overrides is completely effective for the single and ’10 same’ override scenarios. When ten different overrides are made, the update must be rank 64 or higher to achieve the same effectiveness with the same step count. Like in the simple scenario, near-perfect forgetting ofr 1 is only possible with a full-rank update, although even a rank 32 update achieves 70% success at this task. However, while the accuracy of the remainings, r 1 , o 1 tuples remains high for the tuple editing task, accuracy drops for the relationship forgetting task, with the lower-rank updates resulting in a higher forgetting rate. We further note that, in all cases, the editing success of the dependent relationship (i.e., the tuple s, ˆr 1 , ˆo 1 ) is practically 0. In particular, forgetting r 1 has no effect on forgetting ˆr 1 . 5 Is it necessary to fine-tune the whole model? The choice of using a 31B-parameter model that takes several minutes to fine-tune allows us to conduct many experiments in order to better understand what happens during model training and editing. Therefore, we experiment with only fine-tuning some of the layers. In particular, we experiment with training all subsets of the six transformer blocks, and with training only MLP layers, only Attention layers, or both. 5.1 Simple dataset We present the results of full fine-tuning of various subsets of layers in Figure 4. In this graph, we show the effect of fine-tuning a varying number of blocks, from one to six, and fine-tuning just the MLP layers, just the Attention layers, or both. The solid line shows the best result for that number of layers, while the shaded region covers the full spread of results across all block subsets of that size. We observe that for all one- or ten-tuple editing tasks, fine-tuning just the MLP or Attention layer of 9 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Editing Success MLP + Attention MLP only Attention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Remaining Accuracy Baseline (no finetuning) MLP + Attention MLP only Attention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Editing Success MLP + Attention MLP only Attention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Remaining Accuracy Baseline (no finetuning) MLP + Attention MLP only Attention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Editing Success MLP + Attention MLP only Attention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Remaining Accuracy Baseline (no finetuning) MLP + Attention MLP only Attention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Editing Success MLP + Attention MLP only Attention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Remaining Accuracy Baseline (no finetuning) MLP + Attention MLP only Attention only Figure 4: Simple dataset. Ability to effect the change (top) while preserving the rest of the model accuracy (bottom) of, from left to right, making a single override, ten of the same overrides, ten different overrides, and forgetting a relationship. 012345 Affected Layer 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 Correlation with Remaining accuracy MLP+Attn MLP Attn ROME 012345 Affected Layer 0.0 0.2 0.4 0.6 0.8 Correlation with Remaining accuracy MLP+Attn MLP Attn 012345 Affected Layer 0.0 0.2 0.4 0.6 0.8 Correlation with Remaining accuracy MLP+Attn MLP Attn Figure 5: Simple dataset. Correlation of whether a block is fine-tuned with remaining accuracy, for editing a single tuple (left), editing ten tuples the same way (middle), and making ten different edits on ten different tuples (right). a single block is sufficient to achieve the change while maintaining accuracy on the rest of the model. Further, any combination of two or more blocks is sufficient to fully achieve the desired edit, even if only the MLP layers or only the attention layers of those blocks are edited. However, for the tasks where ten tuples are edited, the choice of blocks is more important: specifically, editing the first or last MLP layer, or the last attention layer, results in a run with the highest remaining accuracy. Note also that when fine-tuning only the MLP or only the Attention layers of the model, the best-performing models (highest remaining accuracy at 100% editing success) occur when fine-tuning only part of the model: only the first and last block when editing only MLP layers, and only the last block when editing attention layers. We then examine the marginal impact of training each block by measuring the correlation between editing this block and the overall success of the model edit. In Figure 5, we observe that, when editing a single tuple, all blocks have a negative correlation with the resulting model accuracy. This is especially true for the last two blocks (when editing MLP and attention), the last 5 blocks (when editing MLP layers only), and the first block (when editing attention only). In general, for this scenario, the fewer blocks are fine-tuned, the better the resulting model. When the editing complexity rises to editing ten tuples, however, the situation reverses. Correlations for fine-tuning blocks generally become positive, with the last block and the MLP of the first block especially influential in preserving model quality. Note that this occurs despite the fact that the dataset creation process is the same in all cases - that is, the fine-tuning dataset for editing ten tuples is simply ten datasets for editing a single tuple, concatenated and shuffled. For LoRA finetuning, shown in Figure 7, we observe that, for the simpler tasks of making one or several edits, it suffices to only fine-tune several of the blocks, even at low rank. We further note that for this dataset, fine-tuning only attention layers tends to preserve more accuracy than fine-tuning MLPs or all layers, even at the same edit success rate. This is not true for the task of forgetting an entire relationship, however, where the best success-remaining accuracy tradeoffs are achieved by fine-tuning all layers or only the MLP layers. 10 012345 Affected Layer 0.1 0.2 0.3 0.4 0.5 0.6 Correlation with Remapping success MLP+Attn MLP Attn 012345 Affected Layer 0.0 0.2 0.4 0.6 0.8 Correlation with Remaining accuracy MLP+Attn MLP Attn Figure 6: Simple dataset. Correlation of whether a block is fine-tuned with editing success (left) and remaining accuracy (right), when forgetting an entire relationship. 123456 Number of blocks finetuned 0.0 0.5 1.0 Accuracy Editing success LoRA Rank: 32.0 123456 Number of blocks finetuned 0.950 0.955 Accuracy Remaining accuracy LoRA Rank: 32.0 123456 Number of blocks finetuned 0.0 0.5 1.0 Accuracy Editing success LoRA Rank: 64.0 123456 Number of blocks finetuned 0.92 0.94 Accuracy Remaining accuracy LoRA Rank: 64.0 123456 Number of blocks finetuned 0.7 0.8 0.9 1.0 Accuracy Editing success LoRA Rank: 128.0 123456 Number of blocks finetuned 0.80 0.85 0.90 0.95 Accuracy Remaining accuracy LoRA Rank: 128.0 MLP + AttentionMLP onlyAttention only 123456 Number of blocks finetuned 0.00 0.25 0.50 0.75 Accuracy Editing success LoRA Rank: 32.0 123456 Number of blocks finetuned 0.945 0.950 0.955 Accuracy Remaining accuracy LoRA Rank: 32.0 123456 Number of blocks finetuned 0.0 0.5 1.0 Accuracy Editing success LoRA Rank: 64.0 123456 Number of blocks finetuned 0.92 0.94 0.96 Accuracy Remaining accuracy LoRA Rank: 64.0 123456 Number of blocks finetuned 0.0 0.5 1.0 Accuracy Editing success LoRA Rank: 128.0 123456 Number of blocks finetuned 0.85 0.90 0.95 Accuracy Remaining accuracy LoRA Rank: 128.0 MLP + AttentionMLP onlyAttention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 Accuracy Editing success LoRA Rank: 32.0 123456 Number of blocks finetuned 0.940 0.945 0.950 0.955 Accuracy Remaining accuracy LoRA Rank: 32.0 123456 Number of blocks finetuned 0.0 0.5 1.0 Accuracy Editing success LoRA Rank: 64.0 123456 Number of blocks finetuned 0.92 0.94 0.96 Accuracy Remaining accuracy LoRA Rank: 64.0 123456 Number of blocks finetuned 0.0 0.5 1.0 Accuracy Editing success LoRA Rank: 128.0 123456 Number of blocks finetuned 0.85 0.90 0.95 Accuracy Remaining accuracy LoRA Rank: 128.0 MLP + AttentionMLP onlyAttention only 123456 Number of blocks finetuned 0.00 0.25 0.50 0.75 Accuracy Editing success LoRA Rank: 32.0 123456 Number of blocks finetuned 0.4 0.6 0.8 Accuracy Remaining accuracy LoRA Rank: 32.0 123456 Number of blocks finetuned 0.25 0.50 0.75 Accuracy Editing success LoRA Rank: 64.0 123456 Number of blocks finetuned 0.2 0.4 0.6 0.8 Accuracy Remaining accuracy LoRA Rank: 64.0 123456 Number of blocks finetuned 0.6 0.7 0.8 0.9 Accuracy Editing success LoRA Rank: 128.0 123456 Number of blocks finetuned 0.2 0.4 0.6 0.8 Accuracy Remaining accuracy LoRA Rank: 128.0 MLP + AttentionMLP onlyAttention only Figure 7: Simple dataset. Results of editing a single tuple (top, left), making ten identical edits to ten tuples (top, right), making ten different edits, and forgetting a relationship with LoRA at various ranks, for 400/400/400/2500 training steps. 5.2 Correlated relationships When we consider the correlated dataset (Figure 8), we observe that, while it is possible to edit one or multiple tuples by fine-tuning only the MLP or only the attention layers of a single block - and, for a single edit, by fine-tuning only the MLP or only the attention layers of any block, the choice of block makes it possible to inadvertently decouple the fully correlated first and second relationship when multiple tuples are edited in the same way. For forgetting an entire relationship, it becomes crucial to fine-tune both the MLP and attention layers of the network, although it is still not necessary to fine-tune every block of the model. 11 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Editing Success MLP + Attention MLP only Attention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Dependent Editing Success Baseline (no finetuning) MLP + Attention MLP only Attention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Remaining Accuracy Baseline (no finetuning) MLP + Attention MLP only Attention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Editing Success MLP + Attention MLP only Attention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Dependent Editing Success Baseline (no finetuning) MLP + Attention MLP only Attention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Remaining Accuracy Baseline (no finetuning) MLP + Attention MLP only Attention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Editing Success MLP + Attention MLP only Attention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Dependent Editing Success Baseline (no finetuning) MLP + Attention MLP only Attention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Remaining Accuracy Baseline (no finetuning) MLP + Attention MLP only Attention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Editing Success MLP + Attention MLP only Attention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Dependent Editing Success Baseline (no finetuning) MLP + Attention MLP only Attention only 123456 Number of blocks finetuned 0.0 0.2 0.4 0.6 0.8 1.0 Remaining Accuracy Baseline (no finetuning) MLP + Attention MLP only Attention only Figure 8: Correlated dataset. Ability to effect the change (top), effect the change on the dependent relationship (middle), while preserving the rest of the model accuracy (bottom) of, from left to right, making a single override, ten of the same overrides, ten different overrides, and forgetting a relationship. 012345 Affected Layer 0.050 0.025 0.000 0.025 0.050 0.075 0.100 0.125 Correlation with Remapping success MLP+Attn MLP Attn ROME 012345 Affected Layer 1.0 0.8 0.6 0.4 0.2 0.0 Correlation with Dependent remapping success MLP+Attn MLP Attn 012345 Affected Layer 0.05 0.00 0.05 0.10 0.15 0.20 Correlation with Dependent remapping success MLP+Attn MLP Attn ROME 012345 Affected Layer 0.0 0.1 0.2 0.3 Correlation with Dependent remapping success MLP+Attn MLP Attn 012345 Affected Layer 0.0 0.1 0.2 0.3 0.4 0.5 Correlation with Dependent remapping success MLP+Attn MLP Attn 012345 Affected Layer 1.0 0.8 0.6 0.4 0.2 0.0 Correlation with Dependent remapping success MLP+Attn MLP Attn 012345 Affected Layer 0.8 0.6 0.4 0.2 0.0 0.2 0.4 Correlation with Remaining accuracy MLP+Attn MLP Attn ROME 012345 Affected Layer 0.2 0.0 0.2 0.4 0.6 Correlation with Remaining accuracy MLP+Attn MLP Attn 012345 Affected Layer 0.2 0.0 0.2 0.4 0.6 Correlation with Remaining accuracy MLP+Attn MLP Attn 012345 Affected Layer 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Correlation with Remaining accuracy MLP+Attn MLP Attn Figure 9: Correlated relationships dataset. Correlation of whether a block is fine-tuned with remaining accuracy, for editing a single tuple (left), editing ten tuples the same way (middle), and making ten different edits on ten different tuples (right), for dependent remapping success (top) and remaining data accuracy (bottom). 12 When we consider low-rank finetuning for the correlated relationship dataset, we observe in Figure 10 that, as in the case of full fine-tuning, when making one or several edits, successfully editing the first relationship object generally but not always results in successfully remapping the second relationship object. However, ‘forgetting’ the first relationship can result in either mostly forgetting the second relationship or almost entirely preserving it, depending on the choice of layers. In particular, performing the same correlation analysis as before, we observe from 9 that fine-tuning the attention layers of the first block or the MLP layers of the last block leads to the loss of editing of the dependent relationship. We also observe that for making several edits, only some of the blocks are positively correlated with making this change. 123456 Number of blocks finetuned 0.95 1.00 1.05 Accuracy Editing success LoRA Rank: 32.0 123456 Number of blocks finetuned 0.95 1.00 1.05 Accuracy Dependent editing success LoRA Rank: 32.0 123456 Number of blocks finetuned 0.95 0.96 0.97 0.98 Accuracy Remaining accuracy LoRA Rank: 32.0 123456 Number of blocks finetuned 0.95 1.00 1.05 Accuracy Editing success LoRA Rank: 64.0 123456 Number of blocks finetuned 0.95 1.00 1.05 Accuracy Dependent editing success LoRA Rank: 64.0 123456 Number of blocks finetuned 0.85 0.90 0.95 Accuracy Remaining accuracy LoRA Rank: 64.0 123456 Number of blocks finetuned 0.95 1.00 1.05 Accuracy Editing success LoRA Rank: 128.0 123456 Number of blocks finetuned 0.95 1.00 1.05 Accuracy Dependent editing success LoRA Rank: 128.0 123456 Number of blocks finetuned 0.6 0.8 Accuracy Remaining accuracy LoRA Rank: 128.0 MLP + AttentionMLP onlyAttention only 123456 Number of blocks finetuned 0.0 0.5 1.0 Accuracy Editing success LoRA Rank: 32.0 123456 Number of blocks finetuned 0.5 0.0 0.5 1.0 Accuracy Dependent editing success LoRA Rank: 32.0 123456 Number of blocks finetuned 0.96 0.97 0.98 Accuracy Remaining accuracy LoRA Rank: 32.0 123456 Number of blocks finetuned 0.0 0.5 1.0 Accuracy Editing success LoRA Rank: 64.0 123456 Number of blocks finetuned 0.5 0.0 0.5 1.0 Accuracy Dependent editing success LoRA Rank: 64.0 123456 Number of blocks finetuned 0.85 0.90 0.95 Accuracy Remaining accuracy LoRA Rank: 64.0 123456 Number of blocks finetuned 0.25 0.50 0.75 1.00 Accuracy Editing success LoRA Rank: 128.0 123456 Number of blocks finetuned 0.5 0.0 0.5 1.0 Accuracy Dependent editing success LoRA Rank: 128.0 123456 Number of blocks finetuned 0.6 0.8 Accuracy Remaining accuracy LoRA Rank: 128.0 MLP + AttentionMLP onlyAttention only 123456 Number of blocks finetuned 0.25 0.50 0.75 1.00 Accuracy Editing success LoRA Rank: 32.0 123456 Number of blocks finetuned 0.5 0.0 0.5 1.0 Accuracy Dependent editing success LoRA Rank: 32.0 123456 Number of blocks finetuned 0.96 0.97 0.98 Accuracy Remaining accuracy LoRA Rank: 32.0 123456 Number of blocks finetuned 0.25 0.50 0.75 1.00 Accuracy Editing success LoRA Rank: 64.0 123456 Number of blocks finetuned 0.5 0.0 0.5 1.0 Accuracy Dependent editing success LoRA Rank: 64.0 123456 Number of blocks finetuned 0.85 0.90 0.95 Accuracy Remaining accuracy LoRA Rank: 64.0 123456 Number of blocks finetuned 0.25 0.50 0.75 1.00 Accuracy Editing success LoRA Rank: 128.0 123456 Number of blocks finetuned 0.5 0.0 0.5 1.0 Accuracy Dependent editing success LoRA Rank: 128.0 123456 Number of blocks finetuned 0.6 0.8 Accuracy Remaining accuracy LoRA Rank: 128.0 MLP + AttentionMLP onlyAttention only 123456 Number of blocks finetuned 0.6 0.8 Accuracy Editing success LoRA Rank: 32.0 123456 Number of blocks finetuned 0.00 0.25 0.50 0.75 Accuracy Dependent editing success LoRA Rank: 32.0 123456 Number of blocks finetuned 0.25 0.50 0.75 1.00 Accuracy Remaining accuracy LoRA Rank: 32.0 123456 Number of blocks finetuned 0.6 0.8 Accuracy Editing success LoRA Rank: 64.0 123456 Number of blocks finetuned 0.2 0.4 0.6 0.8 Accuracy Dependent editing success LoRA Rank: 64.0 123456 Number of blocks finetuned 0.25 0.50 0.75 Accuracy Remaining accuracy LoRA Rank: 64.0 123456 Number of blocks finetuned 0.7 0.8 0.9 1.0 Accuracy Editing success LoRA Rank: 128.0 123456 Number of blocks finetuned 0.2 0.4 0.6 0.8 Accuracy Dependent editing success LoRA Rank: 128.0 123456 Number of blocks finetuned 0.0 0.5 1.0 Accuracy Remaining accuracy LoRA Rank: 128.0 MLP + AttentionMLP onlyAttention only Figure 10: Correlated relationship dataset. Results of editing a single tuple (top, left), making ten identical edits to ten tuples (top, right), making ten different edits (bottom, left), and forgetting a relationship (bottom, right) with LoRA at various ranks, for 75/400/400/2500 training steps. Discussion. Our results in this section show that frequently it is not necessary or desirable to edit every block of an LLM, nor every type of layer within each block. Further, the editing technique can impact the best choice of layer - note that for ROME, it is crucial to edit the first layer, while fine-tuning is more layer-agnostic. These results corroborate recent (concurrent) findings in [Zhu et al., 2025], where the authors also found that fine-tuning either only a subset of the MLP layers or only the attention layers of a multimodal LLM helps prevent catastrophic forgetting. 6 Do interpretability techniques help inform our choice? 6.1 Activation patching We first ask whether we can use the activation patching technique introduced in Meng et al. [2022] to select which layers it is best to edit. To investigate this, we use the activation patching technique described in Meng et al. [2022] under the name “causal tracing”. In this technique, a piece of clean data (a sentence corresponding to a subject, relationship, object tuple) is first run through the model. Then, another subject-relationship pair is run through the model, in particular one with a different object, which we call the ”corrupted run”. Then, the clean sample is run through the corrupted model again, but this time, the corrupted activations of some part of the model are replaced with the clean activations. We can then look at 13 Figure 11: Activation patching IOI metric results for dataset of 120000s, o, r tuples the values output logits to see where replacing the corrupted activation with the clean one results in moving the clean logit weight toward its original (clean) value. As a technical detail, since the object identifier consists of two tokens, we simply pick cases where the first token of the clean and corrupted object is the same, and look at the second token. We use the IOI (Indirect Object Identification) metric, which measures the net indirect effect as the logit weight difference between the correct value logit and the corrupted value logit, as follows. M f = (L correct,restored − L incorrect,restored )− (L correct,corrupted − L incorrect,corrupted ) (L correct,clean − L incorrect,clean )− (L correct,corrupted − L incorrect,corrupted ) (1) We collect these measurements at three sites: after each MLP block (post-activation function), after each Attention block, and at the residual stream at the start of each block. We observe that there are no strong differences when we compare the activation patching results for different relationships - in other words, we see no evidence thats, r, otuples are in any way partitioned across the layers of the network depending on the specific relationship - even though in this experiment, the objects selected for each relationship for a subject are completely independent of one another. We see the same effect when we look at Attention heads; we further note that the first attention layer and the middle MLP layers seem to be the most crucial: activation patches at the first layer (at the subject token position) and at the last layer (at the final position) result in moving the prediction weight back toward the original, correct prediction. For attention layers, only activation patching at the first block results in moving predictions back to the correct position. However, when we compare these results to our findings in Section 5, we note that this matches the most effectively edited blocks for ROME, but not for fine-tuning, where the final blocks resulted in the same editing accuracy but better accuracy preservation. For the MLP layer, we observe that the first block is most important at the subject token, with all subsequent blocks being influential at the object token; comparing this observation to the model editing results, we observe that this matches the results for ROME, where the model editing was most successful when applied to the first block, but somewhat overlooks the results for fine-tuning, where the last blocks are the most helpful when the final blocks are the most helpful for maintaining model accuracy when editing the model. In Appendix A, we show the activation patching results for other relationships, which very strongly resemble the results for the first relationship. 6.2 Rank estimation In this section, we consider whether the effectiveness of the low-rank update matches the intuition we get by considering the estimated rank of the layer movement when doing full finetuning. We estimate the effective rank of the full fine-tuning by performing a singular value decomposition of the matrix ∆W := W f − W 0 , and measure the rank at which 95% of the difference is captured. The results of this measurement are shown in Figure 12. We observe that when only a single edit is made, the effective rank of the layer weight changes is quite low - under 40 - especially for the MLP layers (recall that the full rank of this model is 256), demonstrating that in this case, the fine-tuning naturally finds a low-rank adaptation of the model. If not all blocks are fine-tuned, the effective rank increases little, up to 27% higher when only a single block is fine-tuned. However, when a larger change is made - either ten of the same edit to ten different tuples, or ten different edits, or forgetting a relationship, we observe a vastly different pattern: in all cases, the change becomes much higher rank, especially for the MLPW in layers of the early blocks and all 14 Figure 12: Simple dataset. From left to right - single edit, ten identical edits, ten different edits, forgetting a relationship. (Top) The approximate move rank for components of various blocks, when all layers are fine-tuned with full fine-tuning. (Bottom) The average move rank, relative to the rank when all blocks are fine-tuned, when only some blocks are fine-tuned. W out layers. Further, restricting the fine-tuned blocks uniformly lowers the effective weight change rank, in some cases by a factor of 10. This suggests that given a small amount of layers to fine-tune, the weight adjustments result in small, atomic changes that, in case of making edits, still result in successful model editing; given more ‘freedom’, the changes become more diffused, perhaps slightly over-fitting to the samples drawn from the rest of the data - note that test accuracy from full finetuning does not improve when more blocks are added, past the first two blocks. Recall also that the details of assembling the single-edit fine-tuning dataset and the multiple-edits fine-tuning dataset, and thus the ratios of edited and clean examples, are the same. However, when examining the LoRA performance in Figure 7, we observe that, when the LoRA rank is 64 or higher, fine-tuning multiple blocks is not necessary to fully achieve the override, while when using a LoRA R of 32, fine-tuning multiple blocks is generally necessary. These observations suggest that the estimated dimension of the weight change during full fine-tuning is not a helpful guide for LoRA rank selection, and also that, in our test scenario, there is a variety of data transformations of various dimensions that result in similar outcomes for model editing. 7 Conclusion and Limitations In this paper, we introduced the Behemoth framework for generating synthetic data for large language models, which uses a custom data generation algorithm and tokenizer to create fully synthetic training data. The relatively compact size of the data and model allows for broad experimentation even with limited computational resources, while the exact known data distribution allows us to more accurately isolate and measure the effects of model editing. We used this framework to evaluate model editing in simple data setups and showed that small hyperparameter or data changes can lead to large effects in the way the model weights are updated and the consequences of these changes. However, we acknowledge that the very simple choice of grammar and data structure is not representative of important attributes of natural-language data. We leave the expansion of the experiments to more complex grammars and data relationships to future work, and hope that this example inspires us and other researchers to continue experimenting with synthetic data. Acknowledgements EI thanks Weiwei Yang, Janardhan Kulkani, and Kate Lytvynets for their advice and support in developing an earlier version of the Behemoth library. This research was supported by the Scientific Service Units (SSU) of IST Austria through resources provided by Scientific Computing (SciComp). EI was supported in part by the FWF DK VGSCO, grant agreement number W1260-N35. 15 References Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction, 2023. Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws, 2024a. Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 1, learning hierarchical language structures, 2024b. Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jen- nings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John P. Cunningham. Lora learns less and forgets less. arXiv, 2024. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nicholas L Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Chris Olah. Towards monosemanticity: Decomposing language models with dictionary learning. https://transformer- circuits.pub/2024/scaling-monosemanticity/index.html, 2024. Ronen Eldan and Mark Russinovich. Who’s harry potter? approximate unlearning in llms. arXiv, 2023. Valentin Hartmann, Anshuman Suri, Vincent Bindschaedler, David Evans, Shruti Tople, and Robert West. Sok: Memorization in general-purpose large language models. arXiv, 2023. Yihuai Hong, Yuelin Zou, Lijie Hu, Ziqian Zeng, Di Wang, and Haiqin Yang. Dissecting fine-tuning unlearning in large language models. In Conference on Empirical Methods in Natural Language Processing, 2025. Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Edward Grefen- stette, Tim Rockt ̈ aschel, and David Scott Krueger. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. In ICLR, 2024. Abhinav Joshi, Shaswati Saha, Divyaksh Shukla, Sriram Vema, Harsh Jhamtani, Manas Gaur, and Ashutosh Modi. Towards robust evaluation of unlearning in llms via data transformations. In Conference on Empirical Methods in Natural Language Processing, 2024. Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. Understanding catastrophic forgetting in language models via implicit inference. In ICLR, 2024. Aravind Krishnan, Siva Reddy, and Marius Mosbach. Not all data are unlearned equally. ArXiv, abs/2504.05058, 2025. Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. In ICLR, 2019. Lightning AI. Litgpt. https://github.com/Lightning-AI/litgpt, 2023. Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. Eight methods to evaluate robust unlearning in llms. arXiv, 2024. Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary Chase Lipton, and J. Zico Kolter. Tofu: A task of fictitious unlearning for llms. ArXiv, 2024. Aleksandar Makelov, George Lange, and Neel Nanda. Towards principled evaluations of sparse autoencoders for interpretability and control. ArXiv, 2024. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. In NeurIPS, 2022. Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass editing memory in a transformer. In ICLR, 2023. 16 John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Edward Suh, Alexander M. Rush, Kamalika Chaudhuri, Saeed Mahloujifar, Fair at Meta, and Google Deepmind. How much do language models memorize? ArXiv, abs/2505.24832, 2025. Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep, 2024a. Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In International Conference on Learning Representations (ICLR), 2024b. Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models. In International Conference on Learning Representations (ICLR), 2023. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosemanticity: Extract- ing interpretable features from claude 3 sonnet. https://transformer-circuits.pub/2024/scaling- monosemanticity/index.html, 2024. Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Inter- pretability in the wild: a circuit for indirect object identification in gpt-2 small. ArXiv, 2022. Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, and Peter Henderson. Assessing the brittleness of safety alignment via pruning and low-rank modifications. arXiv, 2024. Ruihan Wu, Chhavi Yadav, Russ Salakhutdinov, and Kamalika Chaudhuri. Evaluating deep unlearning in large language models. ArXiv, abs/2410.15153, 2024. Zhen Zhu, Yiming Gong, Yao Xiao, Yaoyao Liu, and Derek Hoiem. How to teach large multimodal models new skills, 2025. 17 Figure 13: Simple dataset. Activation patching IOI metric results for all relationships, o, rtuples, residual pre activation. Figure 14: Simple dataset. Activation patching IOI metric results for all relationships, o, rtuples, MLP. A Activation patching results for all relationships In this section, we present activation patching results for all six relationships for the simple dataset. We observe that there is very little difference between the activation patching results, regardless of the relationship, demonstrating that the important points of the data are stored in the same parts of the network, accounting for the inefficient data storage and, therefore, the lower bits-per-parameter ratios. B Performance of individual fine-tuned models (full fine-tuning 18 Figure 15: Simple dataset. Activation patching IOI metric results for all relationships, o, rtuples, attention all positions. 19 123456 Number of blocks finetuned 0.87 0.88 0.89 0.90 0.91 0.92 0.93 0.94 0.95 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 MLP + Attention 123456 Number of blocks finetuned 0.925 0.930 0.935 0.940 0.945 0.950 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 MLP only 123456 Number of blocks finetuned 0.930 0.935 0.940 0.945 0.950 0.955 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 Attention only Model Accuracy for Changing a single object to an alternative value Remaining accuracy Figure 16: Simple dataset, 1 overrides, remaining data accuracy after full finetuning. 20 123456 Number of blocks finetuned 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 MLP + Attention 123456 Number of blocks finetuned 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 MLP only 123456 Number of blocks finetuned 0.6 0.7 0.8 0.9 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 Attention only Model Accuracy for Changing 10 objects to the same values Remaining accuracy Figure 17: Simple dataset, 10 of the same overrides, remaining data accuracy after full finetuning. 21 123456 Number of blocks finetuned 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 MLP + Attention 123456 Number of blocks finetuned 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 MLP only 123456 Number of blocks finetuned 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 Attention only Model Accuracy for Changing 10 objects to different values Remaining accuracy Figure 18: Simple dataset, 10 overrides, remaining data accuracy after full finetuning. 22 123456 Number of blocks finetuned 0.96 0.98 1.00 1.02 1.04 Accuracy 000001000010000011000100000101000110000111001000001001001010001011001100001101001110001111010000010001010010010011010100010101010110010111011000011001011010011011011100011101011110011111100000100001100010100011100100100101100110100111101000101001101010101011101100101101101110101111110000110001110010110011110100110101110110110111111000111001111010111011111100111101111110111111 MLP + Attention 123456 Number of blocks finetuned 0.96 0.98 1.00 1.02 1.04 Accuracy 000001000010000011000100000101000110000111001000001001001010001011001100001101001110001111010000010001010010010011010100010101010110010111011000011001011010011011011100011101011110011111100000100001100010100011100100100101100110100111101000101001101010101011101100101101101110101111110000110001110010110011110100110101110110110111111000111001111010111011111100111101111110111111 MLP only 123456 Number of blocks finetuned 0.96 0.98 1.00 1.02 1.04 Accuracy 000001000010000011000100000101000110000111001000001001001010001011001100001101001110001111010000010001010010010011010100010101010110010111011000011001011010011011011100011101011110011111100000100001100010100011100100100101100110100111101000101001101010101011101100101101101110101111110000110001110010110011110100110101110110110111111000111001111010111011111100111101111110111111 Attention only Model Accuracy for Changing a single object to an alternative value Dependent remapping success Figure 19: Correlated dataset, 1 override, remaining data accuracy after full finetuning. 23 123456 Number of blocks finetuned 0.775 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 MLP + Attention 123456 Number of blocks finetuned 0.88 0.90 0.92 0.94 0.96 0.98 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 MLP only 123456 Number of blocks finetuned 0.93 0.94 0.95 0.96 0.97 0.98 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 Attention only Model Accuracy for Changing a single object to an alternative value Remaining accuracy Figure 20: Correlated dataset, 1 override, remaining data accuracy after full finetuning. 24 123456 Number of blocks finetuned 0.965 0.970 0.975 0.980 0.985 0.990 0.995 1.000 Accuracy 000001000010000011 000100 000101000110000111001000001001001010001011001100001101001110001111 010000 010001010010010011 010100 010101010110010111011000011001011010011011011100011101011110011111100000100001100010100011100100100101100110100111101000101001101010101011101100101101101110101111110000110001110010110011110100110101110110110111111000111001111010111011111100111101111110111111 MLP + Attention 123456 Number of blocks finetuned 0.965 0.970 0.975 0.980 0.985 0.990 0.995 1.000 Accuracy 000001000010000011 000100 000101000110000111001000001001001010001011 001100 001101001110001111 010000 010001010010010011010100010101010110010111011000011001011010011011011100011101011110011111 100000 100001100010100011 100100 100101100110100111101000101001101010101011101100101101101110101111110000110001110010110011110100110101110110110111111000111001111010111011111100111101111110111111 MLP only 123456 Number of blocks finetuned 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 Accuracy 000001000010 000011 000100000101000110 000111001000001001 001010 001011001100001101001110001111 010000 010001010010010011010100010101010110010111011000011001 011010 011011011100011101011110011111 100000 100001100010100011100100100101100110100111101000101001101010101011101100101101101110101111 110000 110001110010110011 110100 110101110110110111111000111001111010111011111100111101111110111111 Attention only Model Accuracy for Changing 10 objects to the same values Dependent remapping success Figure 21: Correlated dataset, 10 of the same override, remaining data accuracy after full finetuning. 25 123456 Number of blocks finetuned 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 MLP + Attention 123456 Number of blocks finetuned 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 MLP only 123456 Number of blocks finetuned 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 Attention only Model Accuracy for Changing 10 objects to the same values Remaining accuracy Figure 22: Correlated dataset, 10 of the same override, remaining data accuracy after full finetuning. 26 123456 Number of blocks finetuned 0.965 0.970 0.975 0.980 0.985 0.990 0.995 1.000 Accuracy 000001000010000011 000100000101 000110 000111001000001001001010001011 001100 001101001110001111 010000010001 010010 010011 010100010101010110010111011000011001011010011011011100011101011110011111100000100001100010100011 100100 100101100110100111101000101001101010101011101100101101101110101111110000110001110010110011110100110101110110110111111000111001111010111011111100111101111110111111 MLP + Attention 123456 Number of blocks finetuned 0.90 0.92 0.94 0.96 0.98 1.00 Accuracy 000001 000010 000011000100000101 000110 000111001000001001001010001011 001100 001101001110001111 010000 010001010010010011 010100 010101010110010111 011000011001011010011011011100011101011110011111100000 100001 100010100011 100100 100101100110100111101000101001101010101011101100101101101110101111110000 110001 110010110011 110100110101 110110110111111000111001111010111011111100111101111110111111 MLP only 123456 Number of blocks finetuned 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 Accuracy 000001 000010 000011 000100 000101000110000111001000 001001 001010 001011 001100001101001110001111 010000 010001 010010 010011 010100010101010110010111011000011001011010011011 011100 011101 011110 011111 100000 100001 100010 100011 100100100101100110 100111101000 101001101010 101011 101100101101101110101111 110000 110001110010110011110100110101110110 110111 111000111001111010111011111100111101111110 111111 Attention only Model Accuracy for Changing 10 objects to different values Dependent remapping success Figure 23: Correlated dataset, 10 overrides, dependent relationship editing accuracy after full finetuning. 27 123456 Number of blocks finetuned 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 MLP + Attention 123456 Number of blocks finetuned 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 MLP only 123456 Number of blocks finetuned 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 Attention only Model Accuracy for Changing 10 objects to the same values Remaining accuracy Figure 24: Correlated dataset, 10 overrides, remaining data accuracy after full finetuning. 28 123456 Number of blocks finetuned 0.88 0.90 0.92 0.94 0.96 0.98 1.00 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 MLP + Attention 123456 Number of blocks finetuned 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 MLP only 123456 Number of blocks finetuned 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 Attention only Model Accuracy for Remapping all values for one relationship to a single value Remapping success Figure 25: Correlated dataset, Forgetting R1, remaining data accuracy after full finetuning. 29 123456 Number of blocks finetuned 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 MLP + Attention 123456 Number of blocks finetuned 0.60 0.65 0.70 0.75 0.80 0.85 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 MLP only 123456 Number of blocks finetuned 0.65 0.70 0.75 0.80 0.85 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 Attention only Model Accuracy for Remapping all values for one relationship to a single value Dependent remapping success Figure 26: Correlated dataset, Forgetting R1, dependent relationship editing success after full finetuning. 30 123456 Number of blocks finetuned 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 MLP + Attention 123456 Number of blocks finetuned 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 MLP only 123456 Number of blocks finetuned 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111 Attention only Model Accuracy for Remapping all values for one relationship to a single value Remaining accuracy Figure 27: Correlated dataset, Forgetting R1, remaining data accuracy after full finetuning. 31