Paper deep dive

Learning Dynamics of LLM Finetuning

Yi Ren, Danica J. Sutherland

Year: 2025Venue: ICLR 2025 (Outstanding Paper)Area: Training DynamicsType: TheoreticalEmbeddings: 138

Models: Pythia-1B, Pythia-2.8B, Pythia-410M, Qwen

Abstract

Abstract:Learning dynamics, which describes how the learning of specific training examples influences the model's predictions on other examples, gives us a powerful tool for understanding the behavior of deep learning systems. We study the learning dynamics of large language models during different types of finetuning, by analyzing the step-wise decomposition of how influence accumulates among different potential responses. Our framework allows a uniform interpretation of many interesting observations about the training of popular algorithms for both instruction tuning and preference tuning. In particular, we propose a hypothetical explanation of why specific types of hallucination are strengthened after finetuning, e.g., the model might use phrases or facts in the response for question B to answer question A, or the model might keep repeating similar simple phrases when generating responses. We also extend our framework and highlight a unique "squeezing effect" to explain a previously observed phenomenon in off-policy direct preference optimization (DPO), where running DPO for too long makes even the desired outputs less likely. This framework also provides insights into where the benefits of on-policy DPO and other variants come from. The analysis not only provides a novel perspective of understanding LLM's finetuning but also inspires a simple, effective method to improve alignment performance.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 6:50:24 PM

Summary

The paper introduces a framework for analyzing the learning dynamics of Large Language Model (LLM) finetuning by decomposing step-wise influence accumulation. It identifies a 'squeezing effect' in off-policy Direct Preference Optimization (DPO) where negative gradients on unlikely labels cause a decrease in confidence across most responses, explaining phenomena like hallucination and repetitive output. The framework provides a unified perspective on instruction and preference tuning, suggesting that on-policy methods mitigate these issues.

Entities (5)

Direct Preference Optimization · algorithm · 99%Learning Dynamics · framework · 98%Supervised Finetuning · algorithm · 98%Empirical Neural Tangent Kernel · mathematical-concept · 95%Squeezing Effect · phenomenon · 95%

Relation Signals (3)

SFT → contrastedwith → DPO

confidence 95% · The use of large negative gradients is the main difference between the learning dynamics of SFT and DPO.

DPO → causes → Squeezing Effect

confidence 92% · highlight a unique “squeezing effect” to explain a previously observed phenomenon in off-policy direct preference optimization (DPO)

Learning Dynamics → explains → Hallucination

confidence 90% · we propose a hypothetical explanation of why specific types of hallucination are strengthened after finetuning

Cypher Suggestions (2)

Find all algorithms analyzed by the learning dynamics framework · confidence 90% · unvalidated

MATCH (a:Algorithm)-[:ANALYZED_BY]->(f:Framework {name: 'Learning Dynamics'}) RETURN a.name

Map the causal relationship between algorithms and phenomena · confidence 85% · unvalidated

MATCH (a:Algorithm)-[:CAUSES]->(p:Phenomenon) RETURN a.name, p.name

Full Text

137,377 characters extracted from source content.

Expand or collapse full text

arXiv:2407.10490v4 [cs.LG] 29 Jun 2025 Published as a conference paper at ICLR 2025 LEARNINGDYNAMICS OFLLM FINETUNING Yi Ren University of British Columbia renyi.joshua@gmail.com Danica J. Sutherland University of British Columbia & Amii dsuth@cs.ubc.ca ABSTRACT Learning dynamics, which describes how the learning of specific training examples influ- ences the model’s predictions on other examples, gives us a powerful tool for understanding the behavior of deep learning systems. We study the learning dynamics of large language models during different types of finetuning, by analyzing the step-wise decomposition of how influence accumulates among different potential responses. Our framework allows a uniform interpretation of many interesting observations about the training of popular algo- rithms for both instruction tuning and preference tuning. In particular, we propose a hypo- thetical explanation of why specific types of hallucination are strengthened after finetuning, e.g., the model might use phrases or facts in the response for question B to answer question A, or the model might keep repeating similar simple phrases when generating responses. We also extend our framework and highlight a unique “squeezing effect” to explain a previously observed phenomenon in off-policy direct preference optimization (DPO), where running DPO for too long makes even the desired outputs less likely. This framework also provides insights into where the benefits of on-policy DPO and other variants come from. The analy- sis not only provides a novel perspective of understanding LLM’s finetuning but also inspires a simple, effective method to improve alignment performance. Code for experiments is available athttps://github.com/Joshua-Ren/Learning_dynamics_LLM. 1INTRODUCTION Deep neural networks usually acquire new knowledge by updating their parameters via gradient descent (GD). This procedure can be described by learning dynamics, which links changes in the model’s predictions to the gradients generated by learning specific examples. With the help of learning dynamics, researchers have not only explained many interesting phenomena during training, e.g., the “zig-zag” learning path (Ren et al. 2022) and the formation of compositional concept space (Park et al. 2024), but used these insights to propose novel, improved algorithms in different problems (e.g. Pruthi et al. 2020; Ren, S. Guo, et al. 2023; Xia et al. 2024). The study of large language models (LLM) is gaining popularity due to their surprising capabilities on various tasks. To ensure the LLMs follow human instructions and align well with human preferences, finetuning has attracted much recent attention. Practitioners often start with instruction tuning, where the model learns extra knowledge necessary for the downstream task, and then preference tuning, where the model aligns its outputs to human preference (Ouyang et al. 2022). Various finetuning algorithms have been proposed to fit into this pipeline, with differing explanations as to why they improve the model’s performance. Contrary to most existing analyses of LLM finetuning, which use the perspective of their training targets, their status at the end of training, or their relationships to reinforcement learning (e.g. Ji et al. 2024; Rafailov et al. 2024; Tajwar et al. 2024), this paper tries to understand LLMs’ evolution from a dynamical perspective. Specifically, we formalize the learning dynamics of LLM finetuning by decomposing the change of the model’s prediction into three terms which play different roles. This framework can be easily adapted to 1 Published as a conference paper at ICLR 2025 various finetuning algorithms with different goals, including supervised finetuning (SFT, Wei et al. 2022), direct preference optimization (DPO, Rafailov et al. (2023), and its variants) and even reinforcement learning based methods (e.g., PPO, Schulman et al. 2017). This framework helps explain several interesting and counter-intuitive observations during training – including the “repeater” phenomenon after preference tuning (Holtzman et al. 2020), hallucination (L. Huang et al. 2023), the decay in confidence ofallresponses during off-policy DPO (Rafailov et al. 2024), and more. Moreover, we also provide a new perspective on understanding why off-policy DPO and other variants underperform their on-policy counterparts (S. Guo, B. Zhang, et al. 2024). Our explanation starts by observing an interesting “squeezing effect,” which we demonstrate is a consequence of gradientascent(as in DPO and similar algorithms) on models with cross-entropy loss following a softmax layer. In short, for each token’s prediction, the negative gradient will push down the model’s predictions on (almost) all possible output labels, moving this probability mass to the most-likely labels. This can be detrimental to the alignment we are trying to achieve. This effect is most serious when the negative gradient is imposed on an already-unlikely label, which is why the confidence of almost all responses decreases during off-policy DPO. Inspired by this, we propose a simple but very effective method to further improve alignment performance. 2DEFINITION OFLEARNINGDYNAMICS AND ANMNIST EXAMPLE Learning dynamics is usually an umbrella term describing how the change of a specific factor influences the model’s prediction. In this paper, we narrow down it to describe “how the change in model’s parameterθ influences the corresponding change inf θ ”, i.e., the relationship between∆θand∆f θ . When the model updates its parameters using gradient descent (GD), we have ∆θ≜θ t+1 −θ t =−η·∇L(f θ (x u ),y u ) ;∆f(x o )≜f θ t+1 (x o )−f θ t (x o ),(1) where the update ofθduring stept→t+ 1is given by one gradient update on the sample pair(x u ,y u )with learning rateη. In short, the learning dynamics in this paper address the question: After an GD update onx u , how does the model’s prediction onx o change? Learning dynamics can shed light on many important problems in deep learning and also help to understand various counterintuitive phenomena. The origin of it might be the “stiffness” (Fort et al. 2019) or “local elasticity” (He and Su 2020; Deng et al. 2021) of neural networks. See Appendix A for more discussions. As a warm-up, we first consider a standard supervised learning problem, where the model learns to mapx to predictionsy=y 1 ,...,y L ∈ V L , whereVis the vocabulary of sizeV. The model usually outputs a probability distribution by first generating a matrix of logitsz=h θ (x)∈R V×L and then takes theSoftmax of each column. We can track the change in the model’s confidence by observinglogπ θ (y|x). Per-step influence decomposition.The learning dynamics of (1) become, ∆ logπ t (y|x o )≜logπ θ t+1 (y|x o )−logπ θ t (y|x o ),.(2) For simplicity, we start from theL= 1scenario, where the∆θand∆ logπcan be linked by the following result, a version of a result of Ren et al. (2022) proved and further discussed in Appendix B. For multi-label classification (L >1), the updates separate; we can calculateLdifferent∆ logπ t and stack them together. Proposition 1.Letπ=Softmax(z)andz=h θ (x). The one-step learning dynamics decompose as ∆ logπ t (y|x o ) |z V×1 =−ηA t (x o ) |z V×V K t (x o ,x u ) |z V×V G t (x u ,y u ) |z V×1 +O(η 2 ∥∇ θ z(x u )∥ 2 op ),(3) whereA t (x o ) =∇ z logπ θ t (x o ) =I−1π ⊤ θ t (x o ),K t (x o ,x u ) = (∇ θ z(x o )| θ t )(∇ θ z(x u )| θ t ) ⊤ is the empirical neural tangent kernel of the logit networkz, andG t (x u ,y u ) =∇ z L(x u ,y u )| z t . 2 Published as a conference paper at ICLR 2025 Learn using SGD (퐱 퐮 =,퐲 퐮 =4) (a) Adaptation vector created by (퐱 퐮 ,퐲 퐮 ) (b) One-step change with the same 풢 푡 (퐱 퐮 ,퐲 퐮 )(large 휂) (c) Accumulated change of epochs Δlog휋 푡 (퐱 퐨 )=−휂풜 푡 (퐱 퐨 )풦 푡 (퐱 퐨 ,퐱 퐮 )풢 푡 (퐱 퐮 ,퐲 퐮 ) Strong influence ,퐲 (,퐲) (,퐲) Weak influence 퐱 퐮 ,퐲 퐮 + 퐱 퐮 ,퐲 퐮 − 퐱 퐮 ,퐲 퐫퐧퐝 ′ (d) Correlation of the accumulated change 퐲=휋퐱 퐨 = Identical: 휋 휃 푡 (퐱 퐨 ) 휋 휃 푡+1 (퐱 퐨 ) 퐲=휋퐱 퐨 = Similar: 퐲=휋푥 o = Dissimilar: 0 ...4...9 휋 휃 푡 (퐱 퐮 ) e 푦=4 Δ휋 Figure 1: The per-step learning dynamics and the accumulated influence in an MNIST experiment. A t (x o ) =I−1π ⊤ θ t (x o ) only depends on the model’s current predicted probability. The matrixK t is the empirical neural tangent kernel (eNTK, Jacot et al. 2018) of the model, the product of the model’s gradients with respect tox o andx u . The analysis in this paper relies on the following assumption: During the training, the relative influence of learningx u on all other differentx o is relatively stable. The common “lazy eNTK” assumption discussed in Arora et al. (2019) is a sufficient but not necessary condition for this paper. Appendix C provides a more detailed discussion and experimental verification for both MNIST and LLM settings. We can then think ofK t as a model-specific similarity measurement between different input samples: larger∥K t ∥ F means the update ofx u likely influences the model’s prediction onx o more. Finally,G t is determined by the loss functionL, which provides theenergyanddirectionfor the model’s adaptation. For example, for cross-entropy lossL CE ≜−y ⊤ u logπ(y|x u ), we haveG t CE =π θ t (y|x u )−y u , a length-Vvector that points from the model’s current predictive distribution to the desired supervisory distribution. For typical “hard” labels,y u is a one-hot vectore y u . Accumulated influence and a demonstration on MNIST.Proposition 1 describes how the update ofx u changes the model’s prediction onx o for each learning step. Since a real model updates its parameters for many steps, it is important to ask about accumulation of these per-step influences over time. We start by analyzing a simple example of training a LeNet on the MNIST dataset (LeCun et al. 1998). See Figure 1-(a), where the networkπ θ t is updating its parameters using the loss calculated on one training example(x u ,y u =e 4 ). The residual termG t CE (x u ,y u )is then represented by the red arrows, which all start fromπ θ t (y|x u )and point toe 4 . We can then ask how the model’s predictions on differentx o change after this update. As in Figure 1-(b), for anx o in the same class withx u (i.e., the identical case), the predicted probability of this correct label is “pulled up” by this update, as expected. On the other hand, if thisx o is similar tox u (i.e.,∥K t ∥ F is reasonably large) but comes from another class, then the predicted probability on x u ’s class (not the correct label ofx o ) would be “pulled up,” as in the second panel of Figure 1-(b). Last, for examples that look dissimilar tox u (small∥K t ∥ F ), this update will not change the model’s prediction onx o much, as in the bottom panel in Figure 1-(b). The interactions among the updates of different inputs then form an interesting pattern for the learned predictions. As illustrated in Figure 1-(c), when making predictions on images coming from class4, the model tends to assign higher confidence on class9. That is because the examples in class9on average look more similar to class4than examples in other classes. Hence the update of examples in classes4and9will reinforce their mutual influence and lead to a bump in their predictions. To verify this, we plot the average ofπ(y|x)forxfrom each of the classes in Figure 1-(d). The values of some off-diagonal patches are significantly higher than others, which means the examples in those classes look more similar, like4and9,5and3,8and5, etc. 3 Published as a conference paper at ICLR 2025 3LEARNINGDYNAMICS OFLLM’SFINETUNING Although learning dynamics have been applied to many deep learning systems, extending this framework to LLM finetuning is non-trivial. The first problem is the high dimensionality and the sequence nature ofboththe input and output signals. The high-dimensional property makes it hard to observe the model’s output, and the sequential nature makes the distributions on different tokens mutually dependent, which is more complicated than a standard multi-label classification problem considered by most previous work. Furthermore, as there are many different algorithms for LLM finetuning – SFT (Wei et al. 2022), RLHF (Ouyang et al. 2022), DPO (Rafailov et al. 2023), etc. – analyzing them under a uniform framework is challenging. Finally, compared with the training-from-scratch scenario, where a roughly uniform distribution over all possible outputs is usually assumed, LLMs’ finetuning dynamics heavily rely on the pretrained base model, which could make the analysis harder. For example, the pretrained model usually assigns little probability mass to unlikely tokens, which is good for most applications but leads to risk of the “squeezing effect” we show later. We now tackle these problems and propose a unified framework for different finetuning methods. 3.1PER-STEPDECOMPOSITION OF THESFT LOSS The typical loss function used during supervised finetuning is the negative log-likelihood (NLL) of a given completiony + u = [y + 1 ,...,y + L ]∈V L , conditioned on the promptx u : L SFT (x u ,y + u )≜− L X l=1 logπ(y=y + l |y + <l ,x u ) =− L X l=1 e y + l ·logπ(y|x u ,y + <l ).(4) Note that compared with the multi-label classification problem discussed before, where the joint distribution of all labels can be factorized asπ(y|x) = Q l π(y l |x) , the sequential nature of language modeling makes the analysis more complicated, because we must haveπ(y|x) = Q l π(y l |x,y <l ) . To solve this problem, we can merge this factorization into the definition of the backboneh θ while keeping the format of Proposition 1. Specifically, lettingχbe the concatenation ofxandy, the prediction of all tokens ofyis z=h θ (χ) ;π(y|χ) =Softmax_column(z). Herezis aV×Lmatrix where each column contains the logits of the prediction of thelth token. Ourh θ , even though it takes the entire sequenceχas its input, will force the model not to refer to the future tokensy >l when making predictions on thel-th token, commonly implemented via “causal masking” (proposed in Vaswani et al. (2017), details in Figure 10a of Appendix D). Then, we can calculate(∇ θ z l (χ o )| θ t )(∇ θ z l (χ u )| θ t ) ⊤ on each column ofzand stack them to form aV×V×M×LtensorK t (χ o ,χ u ). The calculation ofG t and A t follows a similar procedure. Thanks to the causal mask implemented inh θ , the resulting decomposition is almost identical to that in a multi-label classification problem. Assuming have a responsey u of lengthL associated withx u , stacked intoχ u , andy o of lengthMassociated withx o , stacked intoχ o . The change of the model’s prediction on them-th token ofy o can be represented as, whenzgradients have bounded norm, [∆ logπ t (y|χ o ) |z V×M ] m =− L X l=1 η[A t (χ o ) |z V×V×M ] m [K t (χ o ,χ u ) |z V×V×M×L ] m,l [G t (χ u ) |z V×L ] l +O(η 2 ),(5) whereG t SFT (χ u ) =π θ t (y|χ u )−y u . Compared with Proposition 1, the main difference is that the eNTK term also depends on the responsesy u andy o , which allows us to answer questions like For a promptx u , how does learning the responsey + u influence the model’s belief about a responsey ′ u ? When tracking the model’s confidence on different responses given the questionx u , learning fromy + u will impose a strong “upwards” pressure ony + u , as illustrated in the first panel of Figure 2. At the same time, 4 Published as a conference paper at ICLR 2025 the confidence of “similar responses” will also be slightly pulled up, like how learning a4influences the prediction on9in the MNIST example. We will discuss how to understand the “similarity” between two sequences of responses in the next section. 3.2PER-STEPDECOMPOSITION OF THEDPO LOSS Preference finetuning, which teaches the model to provide responses that align better with human preferences, is also an important phase in LLM finetuning pipelines. Different from the SFT stage above, where the training tells the model “what to do”, many preference finetuning methods also teach the model “what not to do,” which makes the learning dynamics more complex. For intuition, we start by analyzing a typical method, i.e., DPO (direct preference optimization, an RL-free method), under a similar framework. Following Rafailov et al. (2023), the loss function of off-policy DPO is L DPO (θ) =−E (x u ,y + u ,y − u )∼D logσ βlog π θ t (y + u |χ + u ) π ref (y + u |χ + u ) −βlog π θ t (y − u |χ − u ) π ref (y − u |χ − u ) ,(6) wherey + u andy − u are pre-generated responses, andπ ref is the reference model, typically the result of SFT.χ + u andχ − u share the samexbut differenty. In the loss function, theπ θ t terms are also calculated using teacher forcing. Hence we can decompose the learning dynamics for DPO similarly to Equation (5), [∆ logπ t (y|χ o )] m =− L X l=1 η[A t (χ o )] m [K t (χ o ,χ + u )] m,l [G t DPO+ ] l −[K t (χ o ,χ − u )] m,l [G t DPO- ] l +O(η 2 ) G t DPO+ =β(1−a) π θ t (y|χ + u )−y + u ;G t DPO- =β(1−a) π θ t (y|χ − u )−y − u ,(7) whereais the margin (i.e., theσ(·)value) for thel-th token, which represents how well the current policy separatesy + u andy − u compared with the reference policy. Due to the monotonicity ofσ(·), a larger margin leads to largera, which in turn restrains the strength ofG t DPO+/- . In other words,G t DPO+/- automatically provides less energy on the examples that are already well separated. We then check the role ofβ, which controls the regularizing effect on the KL distance betweenπ θ t andπ ref in the original RL loss (Rafailov et al. 2023). When the margin is negative, largerβleads to a smalleraand hence provides strongerG t DPO+/- for the model to “catch up” the separating ability of the reference model faster. But when the model is better and has a positive margin, increasingβwill increaseaand hence create a negative influence onβ(1−a), which makes the model update less. This aligns well with the claims of Rafailov et al. (2023): the stronger regularizing effect tends to “dragπ θ back towardsπ ref ” when its predictions deviate fromπ ref too much. The derivation and theG t functions for other RL-free methods are given in Appendix B.2.2. These analyses make no assumptions on wherey + u andy − u come from. Hence our framework can be directly extended to on-policy RL-free methods, which often perform better than their off-policy counterparts (S. Guo, B. Zhang, et al. 2024; Tajwar et al. 2024). The main difference between these algorithms is how the supervisory responses are generated. Off-policy methods typically use a fixed pre-collected dataset, withy + u andy − u are generated by another LLM or humans. In other words, it is likely that both the chosen and rejected responses come from the “less likely” region of the model’s prediction. On-policy responses, though, are more likely to have higher predicted probabilities under this model, as they were sampled from it. We will show next thatimposing large negative pressure on an unlikely prediction will lead to unexpected behavior, giving a new explanation for why on-policy sampling is important for algorithms with large negative gradients. 3.3THESQUEEZINGEFFECTCAUSED BYNEGATIVEGRADIENT As demonstrated by the first two panels in Figure 2, the use of large negative gradients is the main difference between the learning dynamics of SFT and DPO. We will show later that this difference is the key to understanding why the learning curves of SFT and DPO behave so differently. For example, Pal et al. (2024), 5 Published as a conference paper at ICLR 2025 •On-policy DPO, IPO 퐲 u + 퐲 u − •SPIN 퐲 u + 퐲 u − •SPPO 퐲 u + 퐲 u − •SLiC 퐲 u − 퐲 u + 퐲 ref Triggered when gap is smaller than 휹 •SFT•Off-policy DPO, IPO 퐲 u + 흅 휽 ퟎ (퐲|퐱 퐮 ) 흅 휽 풕 Caused by squeezing effect 퐲 u − 퐲 u + 퐲 ∗ Figure 2: The updating vector provided by the residual termG t of different algorithms. The grayyare responsessampledfromπin an on-policy way. In the second panel, we demonstrate the “squeezing effect” caused by imposing a big negative gradient on a “valley” region of a distribution. For more details about this counter-intuitive effect, please refer to Section 3.3 and Appendix E. Other panels demonstrate on-policy DPO (and IPO), SPIN (Z. Chen et al. 2024), SPPO (Y. Wu et al. 2024), and SLiC (Y. Zhao et al. 2023). Rafailov et al. (2024), and Tajwar et al. (2024) (and our Figure 4) reported that the confidence of bothy + u and y − u gradually decreases when conducting DPO, while the confidence ofy + u rarely drops during SFT. This trend becomes more serious ifπ θ 0 is finetuned for longer before conducting DPO (reported in Figure 3 of Rafailov et al. (2024) and verified by our Figure 17). Furthermore, we also find that for all theπ θ t (y|χ u ) we track (various responses similar toy + u ory − u ; details in the next section), none of them increase during the DPO phase. This is different from SFT and quite counter-intuitive: If everything we observe is becoming less confident, where has the probability mass gone? To answer this question, we first identify a phenomenon we call thesqueezing effect, which occurs when using negative gradients from any model outputting a distribution withSoftmaxoutput heads, even in a simple multi-class logistic regression task. Specifically, in theL= 1case when we impose a negative gradient on labely − u , we can describe the changing of model’s predictive distributionπ θ t+1 as follows: •Guarantee:the confidence ofy − u , i.e.,π θ t+1 (y − u )will decrease. • Guarantee:the decreased probability mass is largely “squeezed” into the output which was most confident before the update: ify ∗ = argmax i∈[V]\y − u π θ t (y=i), thenπ θ t+1 (y=y ∗ )will increase. • Trend:the rich get richer and the poor get poorer: generally, dimensions with highπ θ t tend to increase, and those with lowπ θ t tend to decrease. • Trend:peakierπ θ t squeezes more. If the probability mass concentrates on few dimensions inπ θ t , which is common for a pretrained model, allπ θ t+1 (y̸=y ∗ )decrease (onlyy ∗ is “rich”). •Trend:smallerπ θ t (y − u ) exacerbate the squeezing effect: ify − u is unlikely underπ θ t , the probability mass of all otherπ θ t+1 (y̸=y ∗ )will be more seriously decreased, and theπ θ t+1 (y=y ∗ )increases more. Appendix E illustrates the squeezing effect and analytically proves its existence for logistic regression models, by directly computingπ θ t+1 /π θ t in different situations. Section 4.2 also experimentally verifies the analysis above in real LLM experiments. Note that in practical settings, where both positive and negative pressures and the auto-regressive nature are strictly considered, the squeezing effect can become more complicated. The differences between the two eNTK terms in Equation (7) also influence the relative strength and direction of these two pressures. Razin et al. (2025) analyze a similar problem in a token-level setting, and their conclusions align with ours well. We left a more detailed analysis to our future work. We can now get a high-level overview of the learning dynamics of a typical off-policy DPO algorithm. Since bothy + u andy − u are not sampled from the model’s distribution,y ∗ sometimes can be dissimilar toy + u , and the y − u are likely located in a valley region of the model’s prediction. Then its learning dynamics would look like the sketch in the second panel of Figure 2: the confidence on almost allyare pushed down. At the same time, 6 Published as a conference paper at ICLR 2025 all the decreased probability mass is squeezed toy ∗ , which might make the model keep generating repeated phrases, as reported by Holtzman et al. (2020). Variants of DPO algorithms often unintentionally mitigate this squeezing effect by constraining the strength of the negative gradient or the positions ofy − u , which partially explains their benefits. The last four panels in Figure 2 and Appendix B.2.2 have more details. 4EXPERIMENTALVERIFICATIONS We now verify our analysis in practical settings. We first create the training setD train by randomly selecting 5000 examples from the training split of the dataset. We consider two common datasets,Antropic-H (Y. Bai et al. 2022) andUltraFeedback(G. Cui et al. 2023), in all experiments. Each example inD train contains three components: the prompt (or question)x, the preferred responsey + , and the less preferred responsey − . SFT finetunes withxandy + , while DPO uses all three (subscripts ofxandyare removed for conciseness). We repeat the experiments on two series of models:pythia-410M/1B/1.4B/2.8B (Biderman et al. 2023) andQwen1.5-0.5B/1.8B(J. Bai et al. 2023). To get a more detailed observation of the learning dynamics, we further create a probing datasetD prob by randomly selecting 500 examples fromD train , and generate several typical responses based on the correspondingx,y + , ory − . (We also study another probing dataset where allxcome from the test set in an ablation study in the appendix.) Then for eachxinD prob , we can observe howlogπ θ t (y|χ)gradually changes on different types ofy. For example, one extended response type can be a rephrase ofy + , an irrelevant response answering another questionx ′ , or just a randomly generated English sentence with the same number of words withy + . We explain why we need these extended responses and how they are generated in detail in Appendix D.1. In short,D prob helps us to get a more fine-grained inspection of the learning dynamics, which can not only support our analysis above, but also shed more light on how the model’s prediction evolves on the entireY ∈R V×L , a very sparse and huge space. 4.1LEARNING DYNAMICS OFSFT The main lesson we learn from the analysis in Section 3.1 is that learning fromy + u not only increases the model’s confidence ony + u , but also indirectly “pulls up” responses similar toy + u with a smaller strength (scaled roughly by∥K t ∥ F ), similar to how learning a “4” influences the prediction of a “9” in the MNIST example. At the same time, the increase ofπ θ t (y + u |χ u )naturally “pushes down” ally̸=y + u , because the model’s predicted probability to all responses inY-space must sum to one. The model’s behavior on different yis mostly a trade-off among these pressures. To verify this claim, we finetune the model for several epochs and evaluate the model’s prediction on all responses inD prob every 25 updates (with a training batch size of 4, the probing occurs every 100 examples). For each type of response, we average the model’s confidence on all 500 examples and report the mean value of their log-likelihood. As demonstrated in the first panel of Figure 3, the model’s confidence ony + u keeps increasing throughout the whole learning process, which is straightforward because the main “pull-up” pressure is imposed directly on y + u . However, the behavior of some responses similar toy + u is non-trivial. For example, we draw the following types of responses in the same panel, i.e., the less preferred response for the same question (y − u ), two types of rephrases ofy + u generated byChatGPT(y + gpts andy + gptf ), another preferred response randomly selected from the test set (y + test ), or even a randomly generated English sentence (y hum ). The model’s confidence in these responses are all slightly increased at the beginning of training, and then gradually decrease as the training goes on, even though the model never sees them during SFT. This counter-intuitive behavior can be well explained by the learning dynamics we discussed before. Since all these examples are “similar” to y + u to some extent (at least, they are all common “standard English” sentences), their∥K t ∥ F are reasonably large. Then learningy + u will indirectly increase the model’s confidence of these similary. That is why the correspondingπ θ t (y|χ u )are slightly increased at the beginning of training. However, as the training goes 7 Published as a conference paper at ICLR 2025 0123456 Number of epochs 180 160 140 120 100 Averge log-probability Chosen v.s. rejected y + u y u y + gpts y + gptf y + test y hum 0123456 Number of epochs 440 430 420 410 400 390 380 370 360 Non-human response y + urnd y ' rnd 0123456 Number of epochs 180 170 160 150 140 130 Whether y occured during training y + ju y + test 024681012141618 Number of epochs (<10 is SFT) 600 500 400 300 200 100 SFT then DPO using similar hypers y + u y u y + gpts y gpts y + urnd y rnd y + u y u y + gpts y gpts y hum y rnd DPO y + u y u y + gpts y gpts y hum y + ju y + test y rnd Accumulated influence 140 130 120 110 100 90 y + u 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 Figure 3: First three: learning dynamics of SFT on different response types. Fourth: SFT 10 epochs then DPO. Last: the accumulated influence when SFT using differenty(full results in Appendix C and D). on, the model’s confidence ony + u keeps increasing and the update energy, the norm ofG t SFT in Equation (5), gradually decreases. That means the indirect “pull-up” pressures are also diminished accordingly. Then, the “push-down” pressure on ally̸=y + u becomes dominant and all the related curves start going down. To verify the existence of this global “push-down” pressure, we observe two types of responses; both have the same number of words as theiry + u . One is a purely random English word sequencey ′ rnd . Another is a random permutation of all the words iny + u , which is calledy + urnd . Since both are not natural language, we expect the ∥K t ∥ F between them andy + u to be very small, which means learning fromy + u imposes almost no “pull-up” pressure on them; thus the “push-down” pressure will dominate through the whole training procedure. These analyses are well supported by the second panel in Figure 3, in which we see theseπ θ t (y|χ u )all start from a very small value, and keep decreasing throughout the training. Another interesting type of responses isy + j̸=u , a preferred response for another questionx j̸=u in the training set. For these responses, the model’s prediction onπ θ t (y + j̸=u |χ u )will be kept influenced by two “pull-up” pressures: one is from learning[x u ;y + u ] , another is from learning[x j̸=u ;y + j̸=u ], where the latter might be even stronger as the gradient is directly calculated by observingy + j̸=u . That explains why we see the confidence ony + j̸=u keeps increasing with a smaller rate compared withy + u in the third panel. Because the “pull-up” pressure is always strong enough to counter the “push-down” one. These observations provide us with a unique explanation of why specific types of hallucinations are amplified after SFT. Specifically, the increase ofπ θ t (y + j̸=u |χ u )means if we ask the model to answer a questionx u , it might provide a response from (or partially from) another unrelated questionx j̸=u in the training set. Last, to further explore the “similarity” between different responses from the model’s perspective. we SFT the model using more types of responses and observe howπ θ (y ′ |χ u )changes accordingly. The results are demonstrated in Figure 3, where the blue and orange colors represent the positive and negative influence respectively. The x-axis is the updating response while the y-axis denotes the observing response. Hence the first column resembles how different[x u ;y ′ ]changes when we SFT the model using[x u ;y + u ]. One interesting finding is that all responses generated byChatGPTare considered very similar to each other, regardless of how semantically different they are. Probably, LLM has its preferred idioms or phrases, which could be considered as a type of “fingerprint”. We left this interesting problem for our future work. 4.2LEARNING DYNAMICS OF OFF-POLICYDPO To verify our framework also explains the model’s behavior in preference tuning, we conduct similar experiments for DPO. Recall the residual termG t DPO introduces a pair of arrows on bothy + u andy − u , with different directions. To show how these two pressures influence the model, we check two types of rephrases ofy + u ory − u (y + gpts ,y + gptf ,y − gpts , andy − gptf , used in the previous experiment). See the three curves in the first panel in Figure 4, where the two rephrases decrease at a similar speed, faster than the decay ofy + u . That is because the upward pressure is directly imposed ony + u rather than these rephrases. Similarly, in the second 8 Published as a conference paper at ICLR 2025 02468 Number of epochs 220 200 180 160 140 Averge log-probability Chosen v.s. its rephrases y + u y + gpts y + gptf 02468 Number of epochs 240 220 200 180 160 140 120 Rejected v.s. its rephrases y u y gpts y gptf 02468 Number of epochs 220 200 180 160 140 120 Rejected v.s. its rephrases y + gpts y gpts y + gptf y gptf 02468 Number of epochs 240 220 200 180 160 140 120 Chosen v.s. rejected y + u y u 02468 Number of epochs 250 200 150 100 All other responses Where the probability mass gone? Argmax Figure 4: Learning dynamics of off-policy DPO. The last panel verifies the existence of the squeezing effect. panel, we observe thaty − u decays faster than its rephrases, becauseG t DPO directly imposes a negative pressure ony − u . Then in the third panel, we find the rephrases ofy + u consistently decay slower than those ofy − u , although none of them ever occur during training. That is because these responses are close toy + u ory − u in Y, which means their∥K t ∥ F is relatively large. Hence the pressures imposed ony + u andy − u also introduce a non-negligible influence on them. Last, in the fourth panel, the marginπ θ t (y + u |χ u )−π θ t (y − u |χ u ) keeps increasing, which means the model is gaining the ability to separatey + u andy − u as the training goes on. AlthoughG t DPO directly imposes a “pull-up” pressure ony + u , the value ofπ θ t (y + u |χ u ) does not increase a lot as it does in SFT. The downward arrow ony − u indeed introduces a “push-down” pressure on responses that are similar toy − u , but the influence is unlikely to be that strong (it will be weakened by∥K t ∥ F ) to make the confidence on almost every observing responses decrease so fast, as demonstrated in the last panel of Figure 3 where we use similarηfor both SFT and DPO. Then,where has the probability mass gone during DPO?The key to answering this question is thesqueezing effectdiscussed in Section 3.3: since the big negative gradient is imposed ony − u , which is at this point probably in a region of lowπ θ t (y|χ u ), the confidence of mostywill be decreased andπ θ t (y ∗ |χ u )will increase very fast. To verify this, we report the log-likelihood ofychosen by greedy decoding: each token is chosen by maximizing the conditional probability given[x u ;y + <l ]in real-time, wherey + <l is a sub-sequence ofy + u . As illustrated by the last panel of Figure 4, the confidence of this “teacher forcing” greedyyincreases very fast (from -113 to -63), which is even faster than the increase ofπ θ t (y + u |χ u ) during SFT (from -130 to -90), within 8 epochs. However, the tokens with the highest confidence do not necessarily form a preferred response: it will reinforce the prior bias inθ 0 . This could be a reasonable explanation of the “degeneration” reported in recent work (e.g. Holtzman et al. 2020): asπ θ t becomes more peaky at its most confident predictions, it is easier to sample sequences with repeated phrases. Note that such behavior could also be understood as a special type of self-bias amplifying (Ren et al. 2024), which would bring more serious consequences if it is combined with a multiple-generation self-improving algorithm, e.g., self-reward (Yuan et al. 2024), iterative DPO (Xiong et al. 2024), etc. In summary, the behaviors of different types of responses all match our analyses well. More subtle trends of different responses support our story well (both for SFT and DPO).Due to space constraints, we explain these (and the full results on other models and datasets) in Appendix D. 4.3MITIGATING THE SQUEEZING EFFECT BY AUGMENTING THE TRAINING SET FORSFT Since the “squeezing effect” caused by the big negative gradient on unlikely predictions can damage the model’s performance during DPO, we can first train the model onboth[x u ;y + u ]and[x u ;y − u ]during the SFT stage (making the negative responsemorelikely), and then run the usual DPO. Following the analysis above, we can expect during this new SFT stage, the region of those responses similar toy + u ory − u will be “pulled up” simultaneously. This is what we want because in many cases, bothy + u andy − u are reasonably good responses for the questionx u ; the new SFT design hence helps to pull up a larger region that contains more suitable responses compared with the baseline SFT. After that, the “push-down” pressure imposed during DPO can 9 Published as a conference paper at ICLR 2025 0246810 Number of epochs 300 250 200 150 100 50 Log probability Chosen y + u baseline extend 0246810 Number of epochs 300 250 200 150 100 50 y u is the peak now Rejected y u 0246810 Number of epochs 500 400 300 200 100 All rephrases Non-human sequence Other responses 0246810 Number of epochs 130 120 110 100 90 80 Drop at the start of DPO Argmax response Figure 5: Learning dynamics of the baseline and the proposed method with training data extension. Key trends to observe: 1.) Baseline and the extend method have similar behavior ony + u during SFT; 2.) The extend method considerably increasey − u during SFT; 3.) The squeezing effect of the extend method is weaker (all other responses decay slower and the confidence on the “greedy-decoding” response increases slower). efficiently decrease the model’s confidence ony − u and its similar responses. Sincey − u is no longer so unlikely before DPO, the squeezing effect should not be as strong as in the baseline procedure. Table 1: Win-rate against baseline. DPO Ep.ChatGPTClaude 00.47290.4679 20.65180.5151 40.69280.6045 60.66670.5432 We call our training pipeline “extend” and compare its learning dynam- ics with the baseline setting in Figure 5. It is clear that the squeezing effect is mitigated, because the confidence of other responses all decays slower during DPO, and we also observe a big drop in the greedy- decoding response when DPO starts. To further show that mitigating the squeezing effect indeed brings benefits, we compare the responses generated by models trained using different methods by feeding them toChatGPTandClaude3. Specifically, we first SFT the model for two epochs using two methods discussed above and call the resulting policy networkπ base andπ extend . Then, we conduct identical DPO training on bothπ base andπ extend for several epochs. The win rate of the proposed method against the baseline one is provided in Table 1. It is clear that before DPO,π base is better, becauseπ extend is explicitly trained on thosey − . However, theπ extend performs better after DPO several epochs since the squeezing effect is efficiently mitigated. Please refer to Appendix F for more details. In the future, this simple method inspired by our analysis could be further improved by introducing more responses, e.g., rephrases ofy + u , etc., during both stages, and also by combining with many existing RL-free methods we mentioned before. 5CONCLUSION Learning dynamics, which depict how the model’s prediction changes when it learns new examples, provide a powerful tool to analyze the behavior of models trained with gradient descent. To better utilize this tool in the context of LLM finetuning, we first derive the step-wise decomposition of LLM finetuning for various common algorithms. Then, we propose a unified framework for understanding LLM predictions’ behaviors across different finetuning methods. The proposed analysis successfully explains various phenomena during LLM’s instruction tuning and preference tuning, some of them are quite counter-intuitive. We also shed light on how specific hallucinations are introduced in the SFT stage, as previously observed (Gekhman et al. 2024), and where the improvements of some new RL-free algorithms come from compared with the vanilla off-policy DPO. The analysis of the squeezing effect also has the potential to be applied to other deep learning systems which apply big negative gradients to already-unlikely outcomes. Finally, inspired by this analysis, we propose a simple (but counter-intuitive) method that is effective in improving the alignment of models. 10 Published as a conference paper at ICLR 2025 ACKNOWLEDGEMENTS This research was enabled in part by support provided by the Canada CIFAR AI Chairs program, WestGrid, and Compute Canada. We thank Shangmin Guo, Noam Razin, Wonho Bae, and Hamed Shirzad for their valuable discussions and feedback. We also appreciate the constructive comments from the anonymous reviewers, which helped improve this work. REFERENCES Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang (2019). “On Exact Computation with an Infinitely Wide Neural Net.”NeurIPS. arXiv:1904.11955. Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello (2024). “A general theoretical paradigm to understand learning from human preferences.”International Conference on Artificial Intelligence and Statistics. PMLR, pages 4447–4455. Jinze Bai, Shuai Bai, and et.al (2023). “Qwen Technical Report.” arXiv:2309.16609. Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. (2022). “Training a helpful and harmless assistant with reinforcement learning from human feedback.” arXiv:2204.05862. Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. (2023). “Pythia: A suite for analyzing large language models across training and scaling.”ICML. PMLR, pages 2397–2430. Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent Charlin (2020). “Language GANs Falling Short.”International Conference on Learning Representations. Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu (2024). “Self-play fine-tuning converts weak language models to strong language models.” arXiv:2401.01335. Leshem Choshen, Lior Fox, Zohar Aizenbud, and Omri Abend (2020). “On the Weaknesses of Reinforcement Learning for Neural Machine Translation.”International Conference on Learning Representations. Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei (2017). “Deep reinforcement learning from human preferences.”NeurIPS. Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun (2023). “UltraFeedback: Boosting Language Models with High-quality Feedback.” arXiv: 2310.01377 [cs.CL]. Zhun Deng, Hangfeng He, and Weijie Su (2021). “Toward better generalization bounds with locally elastic stability.”International Conference on Machine Learning. PMLR, pages 2590–2600. Dan Feldman (2020). “Introduction to core-sets: an updated survey.” arXiv:2011.09384. Stanislav Fort, Paweł Krzysztof Nowak, Stanislaw Jastrzebski, and Srini Narayanan (2019). “Stiffness: A new perspective on generalization in neural networks.”arXiv preprint arXiv:1901.09491. Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig (2024). “Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?” arXiv:2405.05904. 11 Published as a conference paper at ICLR 2025 Shangmin Guo, Yi Ren, Stefano V Albrecht, and Kenny Smith (2024). “lpNTK: Better Generalisation with Less Data via Sample Interaction During Learning.”ICLR. Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, et al. (2024). “Direct language model alignment from online AI feedback.” arXiv:2402.04792. Hangfeng He and Weijie Su (2020). “The Local Elasticity of Neural Networks.”International Conference on Learning Representations. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean (2015). “Distilling the knowledge in a neural network.” arXiv: 1503.02531. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi (2020). “The Curious Case of Neural Text Degeneration.”ICLR. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. (2023). “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.” arXiv:2311.05232. Arthur Jacot, Franck Gabriel, and Clément Hongler (2018). “Neural tangent kernel: Convergence and generalization in neural networks.”NeurIPS. Haozhe Ji, Cheng Lu, Yilin Niu, Pei Ke, Hongning Wang, Jun Zhu, Jie Tang, and Minlie Huang (2024). “Towards Efficient and Exact Optimization of Language Model Alignment.” arXiv:2402.00856. Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang (2022). “Fine- Tuning can Distort Pretrained Features and Underperform Out-of-Distribution.”ICLR. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner (1998). “Gradient-based learning applied to document recognition.”Proceedings of the IEEE86.11, pages 2278–2324. Sheng Liu, Jonathan Niles-Weed, Narges Razavian, and Carlos Fernandez-Granda (2020). “Early-learning regularization prevents memorization of noisy labels.”NeurIPS. Mohamad Amin Mohamadi, Wonho Bae, and Danica J. Sutherland (2022). “Making Look-Ahead Active Learning Strategies Feasible with Neural Tangent Kernels.”NeurIPS. arXiv:2206.12569. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. (2022). “Training language models to follow instruc- tions with human feedback.”NeurIPS, pages 27730–27744. Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, and Colin White (2024). “Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive.” arXiv:2402.13228. Core Francisco Park, Maya Okawa, Andrew Lee, Ekdeep Singh Lubana, and Hidenori Tanaka (2024). “Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space.”The Thirty-eighth Annual Conference on Neural Information Processing Systems. Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan (2020). “Estimating training data influence by tracing gradient descent.”NeurIPS. 12 Published as a conference paper at ICLR 2025 Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn (2024). “FromrtoQ ∗ : Your Language Model is Secretly a Q-Function.” arXiv:2404.12358. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn (2023). “Direct preference optimization: Your language model is secretly a reward model.”NeurIPS. Noam Razin, Sadhika Malladi, Adithya Bhaskar, Danqi Chen, Sanjeev Arora, and Boris Hanin (2025). “Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization.”The Thirteenth International Conference on Learning Representations. Yi Ren, Shangmin Guo, Wonho Bae, and Danica J. Sutherland (2023). “How to prepare your task head for finetuning.”ICLR. Yi Ren, Shangmin Guo, Matthieu Labeau, Shay B. Cohen, and Simon Kirby (2020). “Compositional languages emerge in a neural iterated learning model.”ICLR. Yi Ren, Shangmin Guo, Linlu Qiu, Bailin Wang, and Danica J Sutherland (2024). “Bias Amplification in Language Model Evolution: An Iterated Learning Perspective.”NeurIPS. Yi Ren, Shangmin Guo, and Danica J. Sutherland (2022). “Better Supervisory Signals by Observing Learning Paths.”ICLR. Yi Ren, Samuel Lavoie, Michael Galkin, Danica J Sutherland, and Aaron C Courville (2023). “Improving compositional generalization using iterated learning and simplicial embeddings.”NeurIPS. Yi Ren and Danica J. Sutherland (2024). “Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics.”NeurIPS 2024 Workshop on Compositional Learning: Perspectives, Methods, and Paths Forward. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov (2017). “Proximal policy optimization algorithms.”arXiv preprint arXiv:1707.06347. Burr Settles (2009). “Active learning literature survey.” Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar (2024). “Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data.” arXiv:2404.14367 [cs.LG]. Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi Munos, Bernardo Ávila Pires, Michal Valko, Yong Cheng, et al. (2024). “Understanding the performance gap between online and offline alignment algorithms.”arXiv preprint arXiv:2405.08448. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin (2017). “Attention Is All You Need.”NeurIPS. arXiv:1706.03762. Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros (2018). “Dataset distillation.” arXiv: 1811.10959. Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le (2022). “Finetuned Language Models are Zero-Shot Learners.”ICLR. Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu (2024). “Self-play preference optimization for language model alignment.” arXiv:2405.00675. 13 Published as a conference paper at ICLR 2025 Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen (2024). “Less: Selecting influential data for targeted instruction tuning.” arXiv:2402.04333. Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang (2024). “Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint.”ICML. Zhaorui Yang, Qian Liu, Tianyu Pang, Han Wang, Haozhe Feng, Minfeng Zhu, and Wei Chen (2024). “Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning.” arXiv:2402.13669. Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston (2024). “Self-rewarding language models.” arXiv:2401.10020. Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei (2024). “Negative preference optimization: From catastrophic collapse to effective unlearning.”arXiv preprint arXiv:2404.05868. Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. (2023). “Siren’s song in the AI ocean: a survey on hallucination in large language models.”arXiv preprint arXiv:2309.01219. Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu (2023). “Slic-hf: Sequence likelihood calibration with human feedback.” arXiv:2305.10425. 14 Published as a conference paper at ICLR 2025 AMORERELATEDWORKS A.1MORE ABOUT LEARNING DYNAMICS Beyond their application to LLMs, learning dynamics are widely utilized in analyzing various machine learning problems. For example, if we considerx u from the training set, andx o from the test set, this form of learning dynamics provides a new perspective on generalization: the model generalizes better if the loss of f θ (x o )keeps decreasing when it learns fromx u . By studying the influence of differentx u at different stages during supervised learning, Ren et al. (2022) explain a “zigzag” pattern of the learning path, which sheds light on why the model can spontaneously pursue better supervisory signals and correct noisy labels in the early stage of training (see also S. Liu et al. 2020). Kumar et al. (2022) and Ren, S. Guo, et al. (2023) apply learning dynamics to explain why directly finetuning a well-trained backbone with a randomly initialized task head might harm the out-of-distribution generalization ability. Ren et al. (2020), Ren, Lavoie, et al. (2023), and Ren and Sutherland (2024) also explains where the simplicity bias favoring compositional representations comes from during knowledge distillation (Hinton et al. 2015), providing a new perspective of understanding why successive knowledge transferring can improve the model’s systematic generalization ability. The network’s local elasticity (He and Su 2020) and stiffness (Fort et al. 2019) are also correlated with this topic. It reveals that neural networks operate like adaptive local learners, influencing only nearby points in feature space during training. This gives them a unique edge over linear models in terms of memorization, stability, and the emergence of meaningful internal structure—all without explicit regularization. The authors of Deng et al. (2021) further link this behavior to the model’s generalization ability. Extending their theoretical framework to more complicated settings like LLMs’ finetuning might be a promising direction. Besides explaining the model’s behavior, learning dynamics is also helpful for evaluating the quality or the effectiveness of different training samples. For example, Pruthi et al. (2020) propose a quantitative metric calledTracInto compute the influence of a training example on the predictions made by the model. This metric is then applied by Xia et al. (2024) to search for the most influential examples in LLM instruction finetuning. By expanding Equation (1) in the neural tangent kernel (NTK) regime, S. Guo, Ren, et al. (2024) propose a metric calledlpNTKto measure the relative difficulty among different training samples. These metrics and analyses inspired by learning dynamics are expected to be helpful in many related fields, like coreset selection (Feldman 2020), active learning (Settles 2009) (see, e.g., Mohamadi et al. 2022), and dataset distillation (T. Wang et al. 2018). A.2MORE ABOUTLLM’S FINETUNING In this paper, we broadly define finetuning as any in-weight learning on top of a pretrained base model, including supervised finetuning (SFT), direct policy optimization (DPO, Rafailov et al. 2023) and its variants, etc. Since the analysis throughout this paper relies on the “teacher forcing” mechanism and the relatively stable eNTK assumption, our framework cannot be directly applied to algorithms with token-wise supervision like reinforcement learning with human feedback (RLHF, Ouyang et al. 2022) and proximal policy optimization (PPO, Schulman et al. 2017). We leave the study of the token-wise learning dynamics, which aligns better with the “squeezing effect” in real settings, to future work. We also identify several related works that report similar observations on the phenomena discussed in this paper. For example, Gekhman et al. (2024) and Yue Zhang et al. (2023) mentioned that learning new facts during SFT tends to make the model hallucinate more, which aligns with our finding that the model tends to usey + j̸=i when answering questioni. Holtzman et al. (2020) related the peakiness of the model’s distribution to LLM’s “repeater phenomena”, which also indirectly supports our claims well: more DPO leads to a more serious squeezing effect, hence the model’s prediction becomes peakier on most tokens, which makes the aforementioned phenomena more common. 15 Published as a conference paper at ICLR 2025 Furthermore, the “confidence decaying ony + u ” attracts more attention in the community, because it is quite counter-intuitive and the vanilla off-policy DPO algorithm works reasonably well in most cases. Many related works study this phenomenon by analyzing the major discrepancy between off-policy DPO and PPO, i.e., where the samples used to train the model comes from, e.g., S. Guo, B. Zhang, et al. (2024), Rafailov et al. (2024), and Tang et al. (2024). They showed that when the responses are off-policy sampled, the learning process may fail to benefit from the contrastive information in the data. In other words, we should be more careful when working on the “valley” region of the model’s distribution. Other works try to analyze this problem by inspecting the token-level influence between responses. For example, Pal et al. (2024) assumes y + u andy − u are identical expect one token. Under this assumption, the model’s confidence ofy + u after the identical token is guaranteed to decrease. They propose a solution by significantly enhancing the learning rate (roughly x50 larger when theirλ= 50) of the positive part when detectingy u located in a low-confidence region. Razin et al. (2025) takes the similarity between the hidden embeddings and the geometry of the readout layer of different responses into account. Most of the conclusions of their paper align with ours well. The main discrepancy lies in the squeezing effect part, which we will discuss in our future work (they do not contradict each other, but need a more detailed analysis to understand the whole story). A.3BENIGN AND HARMFUL NEGATIVE GRADIENT The “squeezing effect” can negatively impact our analysis when it is strongly imposed in a valley region of the model. However, a well-regulated negative gradient is both beneficial and commonly observed in many deep-learning systems. For example, it is common in many “machine unlearning” algorithms, e.g., in R. Zhang et al. (2024). Moreover, even in the field of LLM finetuning, we can find many mechanisms in different popular algorithms that can mitigate this effect. For example, the typical learning rate of DPO is usually smaller than that used in SFT, which unintentionally mitigates the harmful squeezing effect. The on-policy counterpart of the DPO-like algorithms is shown to perform better than their off-policy counterparts, which also supports our claims. Furthermore, we find the PPO loss automatically avoids imposing a big negative gradient (when its ˆ A t is negative) on the valley region (when itsπ θ is small). On the other hand, the effect that negative gradients make the model’s distribution peakier is independently reported in many related works. For example, Equation 1 in Caccia et al. (2020) shows that we are minimizing a negative thing in a standard GAN loss, which might explain why peakiness occurs. Furthermore, in Table 1 and Table 2 of Choshen et al. (2020), we see the peakiness (measured by∆p top10 ,∆p mode ) of the “PG-average” method is stronger than the standard PG method. Note that the “PG-average” method will map a reward ranging from 0 to 1 to a centered one ranging from -0.5 to 0.5. Since the negative reward can introduce a negative gradient, the peakiness increases. BPROOF OFPROPOSITIONS ANDRESIDUALTERM FORDIFFERENTLOSSES B.1PROOF OFPROPOSITION1 Proposition 1.Letπ=Softmax(z)andz=h θ (x). The one-step learning dynamics decompose as ∆ logπ t (y|x o ) | z V×1 =−ηA t (x o ) |z V×V K t (x o ,x u ) |z V×V G t (x u ,y u ) |z V×1 +O(η 2 ∥∇ θ z(x u )∥ 2 op ),(3) whereA t (x o ) =∇ z logπ θ t (x o ) =I−1π ⊤ θ t (x o ),K t (x o ,x u ) = (∇ θ z(x o )| θ t )(∇ θ z(x u )| θ t ) ⊤ is the empirical neural tangent kernel of the logit networkz, andG t (x u ,y u ) =∇ z L(x u ,y u )| z t . 16 Published as a conference paper at ICLR 2025 Proof. 1 Suppose we want to observe the model’s prediction on an “observing example”x o . Starting from Equation (2), we first approximatelogπ t+1 (y|x o )using first-order Taylor expansion (we useπ t to represent π θ t interchangeably for notation conciseness): logπ t+1 (y|x o ) = logπ t (y|x o ) +⟨∇logπ t (y|x o ), θ t+1 −θ t ⟩+O(∥θ t+1 −θ t ∥ 2 ). Then, assuming the model updates its parameters using SGD calculated by an “updating example”(x u ,y u ), we can rearrange the terms in the above equation to get the following expression: ∆ logπ t (y|x o ) = logπ t+1 (y|x o ) | z V×1 −logπ t (y|x o ) |z V×1 =∇ θ logπ t (y|x o )| θ t |z V×d θ t+1 −θ t | z d×1 +O ∥θ t+1 −θ t ∥ 2 , wheredis the number of parameters of the model. To evaluate the leading term, we plug in the definition of SGD and repeatedly use the chain rule: ∇ θ logπ t (y|x o )| θ t |z V×d θ t+1 −θ t | z d×1 = ∇ z logπ t (x o )| z t | z V×V ∇ θ z t (x o )| θ t |z V×d −η∇ θ L(x u )| θ t |z 1×d T =∇ z logπ t (x o )| z t |z V×V ∇ θ z t (x o )| θ t |z V×d −η∇ z L(x u )| z t |z 1×V ∇ θ z t (x u )| θ t |z V×d T =−η∇ z logπ t (x o )| z t | z V×V ∇ θ z t (x o )| θ t | z V×d ∇ θ z t (x u )| θ t T |z d×V ∇ z L(x u )| z t T |z V×1 =−ηA t (x o )K t (x o ,x u )G t (x u ,y u )(8) For the higher-order term, using as above that θ t+1 −θ t =−η∇ θ z t (x u )| T θ t G t (x u , ˆ y) and noting that, since the residual termG t is usually bounded (and the practical algorithms will also use gradient clip to avoid too large gradient), we have that O ∥θ t+1 −θ t ∥ 2 =O η 2 ∥ ∇ θ z t (x u )| θ t T ∥ 2 op ∥G t (x u , ˆ y)∥ 2 op =O η 2 ∥∇ θ z(x u )∥ 2 op . In the decomposition, usingπ 1 ,...,π V to represent the model’s prediction on different dimensions, we can write ourA t as: A t (x o ) =I−1(π t ) ⊤ =     1−π 1 −π 2 · −π V −π 1 1−π 2 · −π V ... ... . . . ... −π 1 −π 2 ·1−π V     ,(9) The second term in this decomposition,K t (x o ,x u ), is the product of gradients atx o andx u . Intuitively, if their gradients have similar directions, the Frobenius norm of this matrix is large, and vice versa. This matrix is known as the empirical neural tangent kernel, and it can change through the course of training as the network’s notion of “similarity” evolves. For appropriately initialized very wide networks trained with very small learning rates,K t remains almost constant during the course of training, the kernel it converges to is known as the neural tangent kernel (Arora et al. 2019; Jacot et al. 2018). Note that the assumption that K t (x o ,x u )is unchanged (usually used in theoretical analysis) might be too strong in the LLM’s finetuning. Hence as stated in the main context, our qualitative analysis only assumes that “during the training, the relative influence of learningx u on all other differentx o is relatively stable”. We will validate this assumption using experiments in Appendix C. 1 Note that this proposition assumesL= 1. ForL >1case, we will have multiple task heads which leads toL different Equation (3). TheV×Lmatrix∆ logπ t can then be achieved by stacking them. 17 Published as a conference paper at ICLR 2025 B.2RESIDUALTERM FORDIFFERENTLLM FINETUNINGALGORITHMS As stated in Section 3, one of the conundrums of decomposing the learning dynamics of LLM is its auto- regression nature of the output sequence. Different from the multi-label classification problem, wherey l for differentlis independently generated as long as the shared network is fixed, they l for the LLM’s output depends ony <l , which is usually sampled from the model’s prediction iteratively. However, in most of the finetuning cases where the supervisory signaly u is given, the model will apply the so-called “teacher forcing” mechanism when calculating the predicting probabilities. In other words, when generating the output of eachy l , they <l is given rather than sampled on-policy. This mechanism makes it possible for us to defineχ= [x;y]and hence merge the auto-regressive nature of the sequence prediction into the shared K t (χ o ,χ u ). After this step, the decomposition of LLM’s finetuning learning dynamics then becomes similar to a multi-label classification task. B.2.1INSTRUCTION FINETUNING USING AUTO-REGRESSION LOSS(SFT) Here we derive the residual term, i.e.,G t for different algorithms in LLM’s finetuning. We first rewrite Equation (5) here: [∆ logπ t (y|χ o ) |z V×M ] m =− L X l=1 η[A t (χ o ) | z V×V×M ] m [K t (χ o ,χ u ) | z V×V×M×L ] m,l [G t (χ u ) | z V×L ] l +O(η 2 ), wherem∈ 1,...,M,l∈ 1,...,L, andG t (χ u ) =∇ z L(χ u )| z t is aV×Lmatrix. As the auto- regression nature of the SFT loss is already encoded in the causal mask used inh θ , as demonstrated in Figure 10a. the columns inG t (χ u )are independent of each other, which can be separately calculated. Plus, the summation overlcan also be achieved by left-multiplying a length-Lall-one vector1. Specifically, the SFT loss for eachlis: [L SFT (χ u )] l =−logπ(y l =y + u |χ u ) =−e ⊤ y + u logπ(y l |χ u ) =−e ⊤ y + u log (Softmax(z l )), wherey + u is for thel-th dimension ofy + u . The gradient ofLonzcan be then calculated as: [G t SFT (χ u )] l =∇ z l [L SFT (χ u )] l | z 1×V =   ∇ π [L SFT (χ u )] l |z V×1   ⊤ ∇ z l π | z V×V =− e y + u ⊘π ⊤ ∇ z l π=π(y l |χ u )−e y + u ,(10) where⊘is element-wise division. To calculate the equation above, we first recall the NLL loss of thel-th token is[L SFT ] l ≜L=−logπ(y l = y + l ) =−e ⊤ y + l logπ, whereπ=Softmax(z). Then,∇ z L | z 1×V =∇ π L |z 1×V ∇ z π |z V×V . For each dimension of∇ z L l , we have ∂L π i = 0ifπ i ̸=y + l and ∂L π i =− 1 π i ifπ i =y + l . By writing it in vector form, we have∇ z L= −(e y + l ⊘π) ⊤ ∇ z π. For∇ z π, we have: ∇ z π=     π 1 (1−π 1 )−π 2 π 1 · −π V π 1 −π 1 π 2 1−π 2 π 2 · −π V π 2 ...... . . . ... −π 1 π V −π 2 π V ·1−π V π V     . 18 Published as a conference paper at ICLR 2025 Combining this matrix and the1×Vvector(e y + l ⊘π) ⊤ , where the only non-zero term is 1 π k at thek=y + l position. So, left multiplying by this vector is actually first selecting thek-th row of∇ z π, and then multiplying 1 π k to it. In summary, we have: ∇ z L=− 1 π k [−π k π 1 ,−π k π 2 ,...,−π k (1−π k ),...,−π k π V ] ⊤ = [π 1 ,π 2 ,...,π k −1,...,π V ] ⊤ =π−e k By stacking the terms with differentl∈[L], we can get G t SFT (χ u ) =∇ z L SFT (χ u )| z t =π θ t (y|χ u )−y + u (11) B.2.2DIFFERENT PREFERENCE FINETUNING ALGORITHMS Direct Preference Optimization (DPO, Rafailov et al. (2023)) is usually considered the first RL-free alignment algorithm for preference finetuning. Different from the standard RLHF (reinforcement learning with human feedback (Christiano et al. 2017)), the training of off-policy DPO is more similar to SFT, where the model keeps learning from a pre-generated preference dataset. Hence, we start from DPO to analyze the learning dynamics of different preference finetuning algorithms (the on-policy versions of these algorithms could also be explained using the proposed framework). Following Rafailov et al. 2023, the training loss of DPO is: L DPO (θ) =−E (x u ,y + u ,y − u )∼D logσ βlog π θ t (y + u |χ + u ) π ref (y + u |χ + u ) −βlog π θ t (y − u |χ − u ) π ref (y − u |χ − u ) .(12) Before calculating the residual termG t DPO , we need to re-calculate the learning dynamics decomposi- tion, because the loss term now depends on bothπ θ t (y + u |χ + u )andπ θ t (y − u |χ − u ), which involves two differentzterms. Specifically, we defineπ θ t (y + u |χ + u ) =Softmax_column(z + )andπ θ t (y − u | χ − u ) =Softmax_column(z − ), wherez + =h θ (χ + u )andz − =h θ (χ − u )respectively (χ + u = [x u ;y + u ] andχ − u = [x u ;y − u ] ). Then, starting fromL= 1, the decomposition for the DPO loss (similar to Equation (8) for SFT) could be written as: ∇ θ logπ t (χ o )| θ t |z V×d ∆θ t |z d×1 = ∇ z logπ t (χ o )| z t |z V×V ∇ θ z t (χ o )| θ t |z V×d −η∇ θ L(x u ,y + u ,y − u )| θ t |z 1×d T =∇ z logπ t (χ o )| z t | z V×V ∇ θ z t (χ o )| θ t | z V×d −η∇ [z + ;z − ] L| z t | z 1×2V ∇ θ z + (χ + u );∇ θ z − (χ − u ) | θ t |z 2V×d T =−η∇ z logπ t (x o )| z t | z V×V h ∇ θ z t (x o )| θ t |z V×d ∇ θ z + (χ + u );∇ θ z − (χ − u ) | θ t T |z d×2V i ∇ [z + ;z − ] L| z t T |z 2V×1 =−ηA t (χ o ) K t (χ o ,χ + u );K t (χ o ,χ − u ) ∇ [z + ;z − ] L| z t T ≜−ηA t (χ o ) K t (χ o ,χ + u )G t DPO+ (χ + u )−K t (χ o ,χ − u )G t DPO- (χ − u ) (13) 19 Published as a conference paper at ICLR 2025 where[·;·]are concatenation of two vectors or matrices,G t DPO+ (χ + u )≜∇ z + L DPO , andG t DPO- (χ − u )≜ ∇ z − L DPO . To calculate the residual terms, we decompose the loss into: L DPO (x u ,y + u ,y − u |θ) =−log(a) a≜σ(b) b≜β logπ θ t (y + u |χ + u )−logπ θ t (y − u |χ − u ) −c =−β L SFT (χ + u )−L SFT (χ − u ) −c c≜β logπ ref (y + u |χ + u )−logπ ref (y − u |χ − u ,(14) wherecis not a function ofθ. Using the chain rule, thel-th column of the residual termG t DPO+ can be calculated as (the calculate ofG t DPO- is similar): G t DPO+ = ∂L DPO ∂a ∂a ∂b ∇ z + b| z t =− 1 a a(1−a)∇ z + b l | z + =β(1−a) π θ t (y + u |χ + u )−y + u . By stacking values with differentl, we can get the residual term of DPO as G t DPO+ =β(1−a) π θ t (y|χ + u )−y + u ;G t DPO- =β(1−a) π θ t (y|χ − u )−y − u a=σ βlog π θ t (y + u |χ + u ) π θ t (y − u |χ − u ) −βlog π ref (y + u |χ + u ) π ref (y − u |χ − u ) (15) Similarly, we can calculate the residual terms for other off-policy preference optimization methods, like Identity-preference Optimization (IPO (Azar et al. 2024)): L IPO =−E (x u ,y + u ,y − u )∼D " log π θ t (y + u |χ + u ) π ref (y + u |χ + u ) −log π θ t (y − u |χ − u ) π ref (y − u |χ − u ) − 1 2β 2 # .(16) G t IPO+/- =G t DPO+/- ;a= log π θ t (y + u |χ + u ) π θ t (y − u |χ − u ) −log π ref (y + u |χ + u ) π ref (y − u |χ − u ) − 1 2β (17) For the Sequence Likelihood Calibration (SLiC (Y. Zhao et al. 2023)), we have: L SLiC =−E (x u ,y + u ,y − u )∼D max 0,δ−log π θ t (y + u |χ + u ) π θ t (y − u |χ − u ) −β·logπ θ t (y ref |χ ref ) (18) =E (x u ,y + u ,y − u )∼D max 0,δ+L SFT (χ + u )−L SFT (χ − u ) +βL SFT (χ ref ) (19) G t SLiC+/- =a·G t DPO+/- +β(π θ t (y|χ u )−y ref ) ;a=1 δ−log π θ t (y + u ) π θ t (y − u ) >0 (20) In summary, these RL-free algorithms all relate to the SFT loss to some extent. For the DPO and IPO loss, the directions of the updating signals are identical. A scalar controls the strength of this update, which usually correlated with the confidence gap between the model’s current confidence ony + u andy − u , i.e., 20 Published as a conference paper at ICLR 2025 Gap(π θ t )≜log π θ t (y + u |χ + u ) π θ t (y − u |χ − u ) . Generally, larger this value leads to a biggera, making the norm ofG t smaller. In other words, we see a “regularizing” effect in this term, where the model should not makeGap(π θ t ) too large. The SLiC loss can be considered as a combination of SFT adaptation and preference adaptation. Similarly, we can also see a hard version of the regularization effect mentioned above. IfGap(π θ t )> δ, the indicator function will become zero, and the model stops pushingπ(y + u )andπ(y − u )away when it already separatesy + u andy − u well. Recently, authors of (Y. Wu et al. 2024) propose another interesting self-play alignment algorithm called SPPO, which further improves the alignment performance on top of many on-policy DPO methods. Our framework could also give an interesting explanation of why this method works so well. Specifically, the loss function of SPPO can be written as: L SPPO =−E (x u ,y + u ,y − u )∼D " log π θ t (y + u |χ + u ) π ref (y + u |χ + u ) − η 2 2 + log π θ t (y − u |χ − u ) π ref (y − u |χ − u ) + η 2 2 # .(21) G t SPPO = 2 log π θ t (y + u |χ + u ) π ref (y + u |χ + u ) − η 2 (π θ t −y + u ) + 2 log π θ t (y − u |χ − u ) π ref (y − u |χ − u ) + η 2 (π θ t −y − u ).(22) This loss looks similar to the IPO one, but the main difference between SPPO and other methods (e.g., DPO, KTO, IPO, SPIN, etc.) is that there is no negative sign in front ofπ θ t (y + u |χ + u ) orπ θ t (y − u |χ − u ) . From its residual termG t SPPO , it is more convenient to understand this algorithm as imposing two positive vectors on bothy + u andy − u , but the former has a longer norm, as illustrated in Figure 2. By doing so, the big negative gradient no longer exists, and so does the squeezing effect. That is partly why this method is more stable and performs better. CTHE“RELATIVESTABLE”ENTK ASSUMPTION We use this appendix to verify the core assumption of our analysis – during the training, the relative influence of learningx u on all other differentx o is relatively stable – on both MNIST and LLM finetuning settings. To make the notation concise, we useK t uo to representK t (x o ,x u ),K t (χ o ,χ u )and other related variants. 0 ... 10011999 •Step1: Create probing dataset (#200) 20 examples for each number •Step2: Calculate the NTK tensor each epoch [ Epoch, 200, 200, 10, 10 ] 풦 푡 (푥 푢 ,푥 표 ) (표,푢) 푡 •Step3: Calculate the NTK-norm tensor each epoch [ Epoch, 200, 200, 1] 풦 푡 푥 푢 ,푥 표 퐹 •Step4: Average the NTK-norm of different numbers CoNTK≜[ Epoch, 10, 10] 푦 푢 =푐 풦 푡 푥 푢 ,푥 표 퐹 200x200 10x10 •Step5: Visualize CoNTK:,u,푖for each 푖in one figure. Align it with train/test loss, 풖,풐 퓚 풖풐 풕 푭 , and 풖,풐 퓚 풖풐 풕 −퓚 풖풐 풕−ퟏ 푭 together (observe different phases). •Fact 1: although eNTK might change during training, the relative relationship of CoNTK:,u,푖on different 푖 is generally stable after several epochs (e.g., 50 epoch) •Fact 2: the pairing effect of “4” and “9”, “5” and “8” is clear. •Fact 3: change of eNTK in the late training (similar as finetuning) is way smaller than the beginning phase. Figure 6: Experimental design of verifying therelative stabilityof∥K t uo ∥ F for fixedx u on differentx o . C.1RELATIVESTABLE ENTK ASSUMPTION- MNIST EXPERIMENTS For the MNIST example, we directly calculate the eNTK term using a pipeline demonstrated in Figure 6. The results are showed in Figure 7, where the key findings are: 21 Published as a conference paper at ICLR 2025 0 2000 4000 6000 0 1 2 3 4 5 6 7 8 9 y u is 1 0 500 1000 1500 2000 0 1 2 3 4 5 6 7 8 9 y u is 2 0 500 1000 1500 2000 2500 0 1 2 3 4 5 6 7 8 9 y u is 3 0 1000 2000 3000 0 1 2 3 4 5 6 7 8 9 y u is 4 0 500 1000 1500 2000 0 1 2 3 4 5 6 7 8 9 y u is 5 0 1000 2000 0 1 2 3 4 5 6 7 8 9 y u is 6 0 1000 2000 0 1 2 3 4 5 6 7 8 9 y u is 7 0 1000 2000 3000 0 1 2 3 4 5 6 7 8 9 y u is 8 0 1000 2000 3000 4000 0 1 2 3 4 5 6 7 8 9 y u is 9 02004006008001000 Training Epochs 10 2 10 1 10 0 Train/test loss train loss test loss 02004006008001000 Training Epochs 0 500 1000 1500 Mean norm of eNTK 02004006008001000 Training Epochs 0 100 200 300 Norm of the eNTK change 0.2 0.4 0.6 0.8 1.0 test accuracy Figure 7: Results showing the relative stability of∥K t uo ∥ F for fixedx u on differentx o (labeled by the colorful digits near the lines). 1. The last three panels roughly indicate different phases throughout the training, where the first several epochs (0∼30) are a bit messy, and the last several epochs (800∼1000) behave similarly to the finetuning stage; 2.Although the norm of eNTK (E u,o [∥K t uo ∥ F ]) and the norm of eNTK’s adaptation (E u,o ∥K t uo −K t−1 uo ∥ F ) changes a lot after 30 epochs, the ranking between∥K t uo ∥ F on differento are relatively stable, as demonstrated by the upper 9 panels; 3. The pairing effect between the “similar” inputs is clear, e.g., “4” and “9”, “5” and “8”, etc; 4.The pairing effect between the “dis-similar” inputs are also clear, e.g., “6” and “7”, “2” and “5”, etc. 5.The pairing effect mentioned previously is not strictly symmetry, which is because the inconsistent AandGterms; 6.The accumulated influence demonstrated in the third panel of Figure 1 is strongly correlated to the integral of all these curves. C.2RELATIVESTABLE ENTK ASSUMPTION- LLM EXPERIMENTS Directly calculating∥K t uo ∥ F for the LLM experiment requires huge amount of computation, because for each token in each example, we need to multiply aV×dmatrix to ad×Vone, wheredis the number of parameters of the LLM. However, since we only care about the relative relationship between∥K t uo ∥ F on 22 Published as a conference paper at ICLR 2025 differentχ o , whereχ u is fixed, based on the basic decomposition in Proposition 1, we can get a lower-bound as follows (ignoring superscripttfor conciseness, ignoring the influence ofO(η 2 )): ∆ logπ=−ηA o K uo G o (23) ∥∆ logπ∥ 2 F =∥−ηA o K uo G o ∥ 2 F (24) ≤η 2 ∥A o ∥ 2 F ∥K uo ∥ 2 F ∥G o ∥ 2 F (25) We hence define two quantitive measurements to have a better understanding ofK uo , they are: LBK uo ≜ ∥∆ logπ∥ 2 F ∥A o ∥ 2 F ∥G o ∥ 2 F ≤η 2 ∥K uo ∥ 2 F ;SignDelta uo ≜E v,l [logπ t+1 v,l −logπ t v,l ],(26) where the subscriptv,lhere represent thel-th token andv-th dimension for the prediction. In later experiments, we will observe bothLBK uo andSignDelta uo to have a better understanding of the strength (norm) and the direction (sign) of the relative influence imposed viaK uo . Regarding the calculation ofLBK uo ,∥∆ logπ∥ 2 F is easy to track because, in the main context, we already showedlogπ t for different responses.∥G o ∥ 2 F =∥π−y + u ∥ 2 F , wherey + u is defined as a stacking ofLone-hot vectors. The∥A o ∥ 2 F is a bit complex. Recall the definition thatA o =I−1π ⊤ , we can have: ∥A o ∥ 2 F =Trace A ⊤ o A o (27) =Trace (I−1π ⊤ ) ⊤ (I−1π ⊤ ) (28) =Trace I ⊤ I−π1 ⊤ −1π ⊤ +π1 ⊤ 1π ⊤ (29) =Trace(I ⊤ I)−2Trace(1 ⊤ π) +VTrace(π ⊤ π)(30) =V−2 +V∥π∥ 2 2 ,(31) which is also trackable in our setting. Note that intuitively, the value of∥π∥ 2 2 is inversely correlated to the Shannon entropy of the distributionπ:∥π∥ 2 2 = 1 ifπis one-hot;∥π∥ 2 2 = 1 √ V ifπis uniform. Hence we can also interoperate∥A o ∥ 2 F as the peakiness ofπ(y|χ o ). In the following experiment, we track the value ofLBK uo for different types of responses during SFT and DPO to show that the relative influence between different response types is relatively stable. We show the experimental results in Figure 8, in which the key findings are: 1.In both SFT and DPO under different supervisory signals, the change of these two metrics are relatively stable, similar to those in Figure 7; 2. The clear pairing effect betweeny + u (blue curve) andy + j̸=u (red curve) exist; 3. InLBK uo , learning any natural language sequences (i.e.,y + u ,y − u ,y + gpts ,y − gpts ) influence the non- language sequence (y + urnd ,y rnd ) a lot, especially at the end of finetuning. However, fromSignDelta uo we know such an influence is negative, which is caused by the pushing down pressure; 4. An interesting “similarity pattern” occurs: by observingSignDelta uo , we see SFT usingy + gpts or y − gpts imposes more influence on the sequence generated usingChatGPTother than their original response (i.e.,y + u ory − u ), which might be an interesting phenomenon to explore further; 5.By observing the last row, where the model is trained using DPO, it is clear that the push-down pressure is dominant. Because almost allSignDelta uo terms have big negative values, and the only positive one isy + u (roughly 0.5, much smaller than other positive values in the SFT cases). We also provide some intermediate quantities in Figure 9 to further validate our analysis. The key trends are provided in its caption for ease of reading. 23 Published as a conference paper at ICLR 2025 010203040 3 4 5 6 7 8 9 10 SFT using ( x u , y + u ) Tracking lower bound of eNTK: LBK t uo y + u y init y + gpts y + gptf y u y gpts y gptf y + ju y + test y hum y + urnd y rnd y + u y init y + gpts y + gptf y u y gpts y gptf y + j u y + test y hum y + urnd y rnd 0 50 100 150 200 250 300 350 400 Accumulated Influence 010203040 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 SFT using ( x u , y + gpts ) y + u y init y + gpts y + gptf y u y gpts y gptf y + j u y + test y hum y + urnd y rnd 0 100 200 300 400 500 010203040 3 4 5 6 7 8 9 SFT using ( x u , y u ) y + u y init y + gpts y + gptf y u y gpts y gptf y + j u y + test y hum y + urnd y rnd 0 50 100 150 200 250 300 350 010203040 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 SFT using ( x u , y gpts ) y + u y init y + gpts y + gptf y u y gpts y gptf y + j u y + test y hum y + urnd y rnd 0 100 200 300 400 500 010203040 10 20 30 40 50 SFT using ( x u , y urnd ) y + u y init y + gpts y + gptf y u y gpts y gptf y + j u y + test y hum y + urnd y rnd 0 200 400 600 800 1000 010203040 2 4 6 8 10 DPO using ( x u , y + u , y u ) y + u y init y + gpts y + gptf y u y gpts y gptf y + j u y + test y hum y + urnd y rnd 0 100 200 300 400 (a) 010203040 0.1 0.0 0.1 0.2 0.3 SFT using ( x u , y + u ) Tracking SignDelta t uo y + u y init y + gpts y + gptf y u y gpts y gptf y + ju y + test y hum y + urnd y rnd y + u y init y + gpts y + gptf y u y gpts y gptf y + j u y + test y hum y + urnd y rnd 2 0 2 4 6 8 Accumulated Influence 010203040 0.0 0.2 0.4 0.6 0.8 SFT using ( x u , y + gpts ) y + u y init y + gpts y + gptf y u y gpts y gptf y + j u y + test y hum y + urnd y rnd 5.0 2.5 0.0 2.5 5.0 7.5 10.0 12.5 010203040 0.1 0.0 0.1 0.2 0.3 SFT using ( x u , y u ) y + u y init y + gpts y + gptf y u y gpts y gptf y + j u y + test y hum y + urnd y rnd 4 2 0 2 4 6 8 010203040 0.2 0.0 0.2 0.4 0.6 0.8 SFT using ( x u , y gpts ) y + u y init y + gpts y + gptf y u y gpts y gptf y + j u y + test y hum y + urnd y rnd 5.0 2.5 0.0 2.5 5.0 7.5 10.0 12.5 010203040 0.0 0.5 1.0 1.5 2.0 SFT using ( x u , y urnd ) y + u y init y + gpts y + gptf y u y gpts y gptf y + j u y + test y hum y + urnd y rnd 0 5 10 15 20 25 010203040 0.25 0.20 0.15 0.10 0.05 0.00 0.05 0.10 DPO using ( x u , y + u , y u ) y + u y init y + gpts y + gptf y u y gpts y gptf y + j u y + test y hum y + urnd y rnd 6 4 2 0 (b) Figure 8: Tracking the relative stability ofK t uo by observingLBK uo (a) andSignDelta uo (b) under different settings. The accumulated influence is the integral of the corresponding curve andx-axis (smoothed using exponential moving average). DMOREABOUTEXPERIMENTS This section provides more experimental details and results about the learning dynamics to support our claim. We will first discuss how different types of responses are selected in our probing datasetD prob . These responses can fit into a 2-D space where one dimension is semantical relevance of the response toy + u . We then provide more results and discussions on different models and settings. The subtle differences between the responses all support our story well. 24 Published as a conference paper at ICLR 2025 01020304050 230 240 250 260 270 280 290 A uo 2F Average Peakiness Change during SFT y + u y init y + gpts y + gptf y u y gpts y gptf y + ju y + test y hum y + urnd y rnd 01020304050 100 80 60 40 20 Average of log (y * ) and log (y c ) during SFT argmax except_argmax linear fit of log(p c ) 01020304050 100 80 60 40 20 Average of log (y * ) and log (y c ) during DPO argmax except_argmax linear fit of log(p c ) Figure 9: Other metrics related to LLM’s learning dynamics. The first panel demonstrates how∥A t o ∥ 2 F changes during SFT (higher means peakierπ). It is clear that the peakiness ofy + u , i.e., the supervisory signal, increases fastest. The last two panels demonstrate the averagelogπ(y ∗ )and its complementary (denoted by logπ(y ∗ ) C , which measures how many probability masses are left for other possible tokens). The second one is for SFT and the third one is for DPO. It is clear thatlogπ(y ∗ )andlogπ(y ∗ ) C changes faster in the DPO case, which matches our observations in the fourth panel of Figure 3 well. The linear fit extrapolates the logπ(p ∗ ) C values because we suffer an underflow issue when estimating this term. We will fix them in the next version. However, the trend of their changing speed is consistent across different settings. 풙 풖 LLM Predictions Supervision CE-Loss 풚 풖 + 풚 풖 + 흌 풖 ≜(풙 풖 ,풚 풖 + ) Causal mask with prefix Model CE-Loss 푙 휋(풚 풍 |풙,풚 ퟏ:풍−ퟏ ) 푙 휋(풚 풍 |풙,풚 ퟏ:푳 ;퐦퐚퐬퐤) (a) What do you mean by cheating? Could you clarify what you mean by cheating? It’s interesting, this isn’t a topic that I’ve studied extensively, so I’l be happy to just think about it together. You said “how” instead of “if”, do you think it’s possible? The purple cat danced under the starry night sky with joyful abandon. Let’s take a look at the next month’s release schedule. Are there any games you’re particularly excited for? You in cheat poker how do? HAJA de sa @4 ~ 喵 Valid feedback Invalid feedback Ungrammatical Semantics Edit distance Similarity to 풚 풖 + (hypothetical sketch) 풚 풖 + What do you mean by playing? 풚 gpts + 풚 gptf + 풚 푢 − 풚 풋≠풊 + 풚 hum 풚 rnd + Prompt: 풙 풖 How do you cheat in poker? 1. Chosen response 풚 풖 + 1.1 GPT rephrase chosen, preserving semantics 풚 gpts + 2. Rejected response 풚 풖 − 4. Random sentence by GPT 풚 hum 3. Irrelavent from train set 풚 풋≠풊 + 5. Random permuted chosen 퐲 퐫퐧퐝 + 1.2 GPT rephrase chosen, preserving format 풚 gptf + 2.1 GPT rephrase rejected, preserving semantics 풚 gpts − 2.2 GPT rephrase rejected, preserving format 풚 gptf − (b) Figure 10: (a). How causal mask implementation helps us convert auto-regression modeling to multi-label modeling. (b). The 2-D plane ofYby considering the distance in both format and semantics. D.1THESELECTION OFRESPONSETYPES FOR THEPROBINGDATASET Besides the sequential nature of the loss function, another conundrum in analyzing LLM learning dynamics is the huge response spaceY: the number of possibley∈YisV L , but the vast majority of possible sequences look nothing like natural language, and we expect the model to generate only a subset of natural language-like responses. These properties prevent us from observing the changes of all possibleylike what we did for MNIST. Instead, we define several interesting regions ofY, and select corresponding typical responses to observe. Intuitively, we can use the semantic relevance betweenyandx u as a heuristic. Such a measurement can be understood as “how suitable thisyis as a response tox u , compared toy + u .” Then, starting from the structure of common preference optimization datasets such asAntropic-H(Y. Bai et al. 2022) and UltraFeedback(G. Cui et al. 2023), we can divideYinto three sub-spaces and evaluate the following 25 Published as a conference paper at ICLR 2025 types of responses (as in Figure 10b). The prompt templates used to generate them are illustrated in Figure 11. We also provide examples of all 14 types of responses in Figure 12. •Y IF : reasonable responses following the instructionx u : 0.y π 0 , the initial response generated by feedingx u to LLM before finetuning; 1.y + u , the chosen (i.e., the preferred) response tox u . 1.1y + selfr , rephrasey + u usingπ 0 , algorithm from Yang et al. 2024; 1.2y + gpts , rephrasey + u usingChatGPT, keep the semantics while changing the format; 1.3y + gptf , rephrasey + u usingChatGPT, keep the format while changing the semantics; 2.y − u , the rejected (i.e., the less preferred, but still reasonable) response tox u . 2.1y − selfr , rephrasey − u usingπ 0 , algorithm from Yang et al. 2024; 2.2y − gpts , rephrasey − u usingChatGPT, keep the semantics while changing the format; 2.3y − gptf , rephrasey − u usingChatGPT, keep the format while changing the semantics; •Y non-IF : irrelevant responses tox u that are still recognizably human language (in these datasets, roughly “internet-standard” English): 3.y + j̸=u , the chosen response for a different questionx j̸=u selected from the training set. 4.y + test , the chosen response of a questionx test selected from the test set. 5.y hum , a “random” English sentence generated byChatGPTwith as many words asy + u . •Y non-hum : token sequences that do not form meaningful human language: 6.y + urnd , a random permutation of the words (space-separated strings) ofy + u . 7.y ′ rnd , a random permutation of the words of a generated sentence as iny hum . Furthermore, we also create another probing dataset (namedD probtest ) where allxcomes from the test set. Compared withD probtrain that we used in the main context, all the prompts and responses inD probtest are never exposed to the model during finetuning. By comparing the learning curves of these two probing datasets, we can figure out the difference between the model’s prediction of those directly influenced responses (yappears during training) and the indirectly influenced ones (ythat the model never sees during training). Finally, we believe the level of the “on-policy” property (which is very important for the preference finetuning, as discussed in Tajwar et al. (2024)) could also be introduced as the second axis in our 2-D plane. We left the exploration of this interesting direction in our future work. D.2MORE RESULTS ON DIFFERENT SETTINGS: SFTCASE Consistent learning dynamics for different models.In this subsection, we provide more results to support our analysis on SFT in Section 4.1. The first thing to verify is the consistency of the trends of learning dynamics across different settings. As illustrated in Figure 14, we conduct SFT on five models with different sizes pretrained using different recipes. Note thatPythia-410M/1B/1.4B/2.8Bare pretrained using exactly the same dataset and pipeline (Biderman et al. 2023), whileQwen1.5-0.5Bare pretrained differently. Hence we can observe a slight difference between the curves fromPythiaseries andQwen series, e.g., those iny hum . However, the trends demonstrated in Figure 3 consistently hold for all models. Compare the rephrases ofy + u andy − u .See Figure 15, where we put the rephrases of the same response into the same figure. We can treat the red curve, i.e., the one ofygenerated byπ 0 (x), as a baseline, whose decaying suggests the policy model is deviating from the initial point. The first observation is that after several updates,y + u is the only one that keeps increasing fast, which means the “pull up” pressure generated by[x u ;y + u ]do not have that strong influence on these rephrases compared to[x u ;y + j̸=u ], even though thesey 26 Published as a conference paper at ICLR 2025 Below is an instruction that describes a task along with a reference answer. Using the reference answer as a guide, write your own response. ### Instruction:퐱 ### Reference Answers: 퐲 + ### Response: •Prompt for self-rephrase: 퐲 퐬퐞퐥퐟퐫 + •Prompt for GPT-rephrase semantics keeping 퐲 퐠퐩퐭퐬 + Given the reference sentence, please generate an output sentence. Please use different words as much as possible while keeping the meaning of the reference sentence unchanged. Please only return the output sentence. ### Reference sentence: 퐲 + ### Response: •Prompt for GPT-rephrase format keeping 퐲 퐠퐩퐭퐟 + Given the reference sentence, please generate an output sentence. Please change the meaning of the reference sentence as much as possible while keeping the format of it. Please only return the output sentence. ### Reference sentence: 퐲 + ### Response: Figure 11: The prompts used to generatey + selfr ,y + gpts , andy + gptf . The rephrases of rejected samples are generated similarly. The self-rephrase template comes from Yang et al. (2024). Human: What will remove cigarette odor from a room? Assistant: I believe there are products that will work at removing odors. Can you tell me what’s going on? Human: What do you mean? Assistant: Prompt: 퐱 0. Initial response I’m not a chemist, but I’m guessing that most of the cigarette odor comes from the burning tobacco. I know we can buy products that reduce the smell of smoke, and that is what most manufacturers claim. Would that also work to remove cigarette odor in your apartment? 1. Chosen response You mentioned something about a cigarette smell, can you tell me what happened? 1.1 Self-rephrase I believe there are products that will work at removing cigarette odor. 1.2 GPT-rephrase, semantics keeping You brought up the issue of a cigarette odor, could you provide me with more details? 1.3 GPT-rephrase, format keeping Can you provide details on the odor you noticed earlier? 2. Rejected response If you tell me about your situation, I can try to give you more help. 2.1 Self-rephrase 2.2 GPT-rephrase, semantics keeping Should you share details of your circumstances with me, I will endeavor to provide further assistance. 2.3 GPT-rephrase, format keeping If you share your circumstances with me, I'l do my best to provide you with additional assistance. 3. Irrelavent from train set Let’s take a look at the next month’s release schedule. Are there any games you’re particularly excited for? 4. Irrelavent from test set You’ve searched through quite a few results and haven’t come across a recipe you like yet. 5. Random sentence The purple cat danced under the starry night sky with joyful abandon. 6. Random permuted 퐲 퐮 + me about mentioned can smell, tell happened? You cigarette something you a what 7. Pure random tokens you a through few You’ve recipe yet. and across quite a searched come haven’t results like I’m not sure what you mean. I’m not sure what you’re asking. Figure 12: Example of all possible responses for onexin our probing dataset. Note that the pure random token is generated by first creating a random sentence, then randomly permuting its tokens. are good rephrases ofy + u (recall the curvey + j̸=n always increase in Figure 14). Furthermore, by carefully comparing the decreasing speed ofy π 0 and other curves, we find those rephrases decays slower thany π 0 in the chosen case, but not the case for the rejected responses. This phenomenon also supports our analysis well: because we train the model usingy + u , their rephrases are “pulled up” more than the rephrases ofy − u . Such a claim is also verified by the experiment in the last column of this figure, where we train the model using [x u ;y − u ]rather thany + u . In these two panels, we see the decaying speed of rephrases ofy + u is now identical to that ofy π 0 while the decaying speed of rephrases fory − u is slightly slower. Last, compare the green and orange curves (i.e., the format-keeping and semantics-keepingGPTrephrases), we find the predicting probabilities of those format-keeping curves are usually larger than their semantic-keeping counterparts. This is a sign that the model during SFT might care more about the format rather than the semantics of one sentence. We will delve into this interesting phenomenon in our future work. 27 Published as a conference paper at ICLR 2025 CompareD prob andD probtest .To isolate the influence of the “pull up” pressure introduced by the training updates, we also create another probing datasetD probtest using the same pipeline asD prob . The only difference between them is that allxinD probtest comes from the test set, and hence neither the prompts nor the responses ever occur during training. See Figure 16, where the solid curves and dotted curves represent the learning dynamics of responses inD prob andD probtest respectively. The color of the curves represents the model we are finetuning. By qualitatively comparing thetrend differencebetween curves coming fromD prob andD probtest , we roughly observe thattrend_diff(y + u ) >trend_diff(y + j̸=u ) >trend_diff(y + gpts ) > trend_diff(y + gptf ), which aligns well with our hypothesis about how strong the “pull up” pressure influence different responses. 02468 Number of epochs 130 120 110 100 90 Chosen v.s. rejected y + u y u 02468 Number of epochs 390 380 370 360 350 340 Non-human response y + urnd y ' rnd 02468 Number of epochs 180 170 160 150 140 130 120 Whether y occured during training y + ju y + test 0.00.51.01.52.02.53.0 Number of epochs 144 142 140 138 136 134 Whether y is a dialog response y + test 0510152025 Inflection time of y hum 0 5 10 15 20 25 Inflection time of y + test Updates before inflection 140 130 120 110 100 90 80 y + u 168.0 167.5 167.0 166.5 166.0 165.5 165.0 164.5 164.0 y hum (a) Result of SFT onAntropic-H. 0123456 Number of epochs 350 325 300 275 250 225 200 175 Chosen v.s. rejected y + u y u 0123456 Number of epochs 1280 1270 1260 1250 1240 1230 1220 1210 1200 Non-human response y + urnd y ' rnd 0123456 Number of epochs 400 380 360 340 320 300 280 260 Whether y occured during training y + ju y + test 0123456 Number of epochs 410.0 407.5 405.0 402.5 400.0 397.5 395.0 392.5 390.0 Whether y is a dialog response y + test 0510152025 Inflection time of y hum 0 5 10 15 20 25 Inflection time of y + test Updates before inflection 350 325 300 275 250 225 200 175 150 y + u 146 144 142 140 138 136 134 132 y hum (b) Result of SFT onUltraFeedback. Figure 13: The learning dynamics of responses in different groups in the proposed probing dataset. Trends to observe: 1.)y + u increase andy − u first increase then decrease; 2.) bothy + urnd andy ′ rnd decrease and very small; 3.)y + j̸=u increases with a smaller rate thany + u , although the[x u ;y + j̸=u ]never occurs during training; 4.) both y + test andy hum has a bell-shape curve; 5.) the inflection ofy hum is earlier. Because we find that most sentences iny hum are descriptive ones while those iny + test are question-answer style sentences. This suggest that they + test are semantically more similar toy + u thany hum (i.e., larger∥K t ∥ F ). Hence in general, the “pull-up” pressure ony + test is larger, and hence its inflection point is later thany hum . 28 Published as a conference paper at ICLR 2025 01234 Number of epochs 150 145 140 135 130 125 120 y 0 qwen05 pythia410m pythia1b pythia14 pythia28 01234 Number of epochs 140 120 100 80 y + u 01234 Number of epochs 155 150 145 140 135 130 y + gpts 01234 Number of epochs 135 130 125 120 115 110 y + gptf 01234 Number of epochs 120 115 110 105 100 95 90 85 y u 01234 Number of epochs 135 130 125 120 115 110 y gpts 01234 Number of epochs 120 115 110 105 100 y gptf 01234 Number of epochs 190 180 170 160 150 140 y + ju 01234 Number of epochs 160 155 150 145 140 135 130 y + test 01234 Number of epochs 190 180 170 160 150 y hum 01234 Number of epochs 420 410 400 390 380 370 y + urnd 01234 Number of epochs 380 370 360 350 340 y ' rnd (a) Result of SFT onAntropic-H. 0246 Number of epochs 400 380 360 340 320 300 y 0 qwen05 pythia410m pythia1b pythia14 pythia28 0246 Number of epochs 400 350 300 250 200 y + u 0246 Number of epochs 170 160 150 140 130 y + gpts 0246 Number of epochs 160 150 140 130 120 y + gptf 0246 Number of epochs 380 360 340 320 300 280 260 y u 0246 Number of epochs 200 190 180 170 160 150 y gpts 0246 Number of epochs 150 140 130 120 110 y gptf 0246 Number of epochs 475 450 425 400 375 350 325 y + ju 0246 Number of epochs 480 460 440 420 400 380 y + test 0246 Number of epochs 190 180 170 160 150 140 130 y hum 0246 Number of epochs 1300 1280 1260 1240 1220 1200 y + urnd 0246 Number of epochs 1340 1320 1300 1280 1260 1240 y ' rnd (b) Result of SFT onUltraFeedback. Figure 14: Trend to observe: curves of different models exhibit similar trends. 29 Published as a conference paper at ICLR 2025 02468 160 140 120 100 80 Rephrases of y + u qwen05 y + u y 0 y + gpts y + gptf 0123456 150 140 130 120 110 100 pythia410m 0123456 150 140 130 120 110 100 90 pythia1b 0123456 150 140 130 120 110 100 90 80 70 pythia14 01234 140 130 120 110 100 90 80 70 pythia28 0123 140 135 130 125 120 115 pythia14 (train on y u ) 02468 Number of epochs 160 150 140 130 120 110 100 90 Rephrases of y u y u y 0 y gpts y gptf 0123456 Number of epochs 150 140 130 120 110 100 0123456 Number of epochs 140 130 120 110 100 90 0123456 Number of epochs 150 140 130 120 110 100 90 01234 Number of epochs 130 120 110 100 90 0123 Number of epochs 140 130 120 110 100 90 80 70 (a) Result of SFT onAntropic-H. 0246 330 320 310 300 290 280 270 260 Rephrases of y + u qwen05 y + u y 0 0246 420 400 380 360 340 320 pythia410m 0246 380 360 340 320 300 280 pythia1b 0246 360 340 320 300 280 260 240 pythia14 0246 350 325 300 275 250 225 200 175 pythia28 0.00.51.01.5 290 280 270 260 250 240 pythia14 (train on both y + u and y u ) 0246 Number of epochs 310 305 300 295 290 285 280 Rephrases of y u y u y 0 0246 Number of epochs 400 390 380 370 360 350 340 330 0246 Number of epochs 360 350 340 330 320 310 300 0246 Number of epochs 350 340 330 320 310 300 290 0246 Number of epochs 340 330 320 310 300 290 280 270 0.00.51.01.5 Number of epochs 275 270 265 260 255 250 245 240 235 180 170 160 150 140 130 120 110 100 y + gpts y + gptf 180 170 160 150 140 130 120 110 100 180 170 160 150 140 130 120 110 100 180 170 160 150 140 130 120 110 100 180 170 160 150 140 130 120 110 100 180 170 160 150 140 130 120 110 100 y gpts y gptf 180 170 160 150 140 130 120 110 100 180 170 160 150 140 130 120 110 100 180 170 160 150 140 130 120 110 100 180 170 160 150 140 130 120 110 100 180 170 160 150 140 130 120 110 100 180 170 160 150 140 130 120 110 100 (b) Result of SFT onUltraFeedback. Figure 15: Compare different rephrases ofy + u andy − u under different models. Key trend to observe: 1.) For the first row, the decaying speed ofy + gpts andy + gptf are smaller thany π 0 , which means the pull-up pressure exists; 2.) For the second row, the decaying speed ofy − gpts andy − gptf are similar to that ofy π 0 , because the pull-up pressures on rejected samples are smaller; 3.) For the last column, since we SFT the model using the rejected sample rather than the chosen one, the trend in (1) and (2) reverses. 0246 Number of epochs 150.0 147.5 145.0 142.5 140.0 137.5 y 0 qwen_train qwen_test pythia_train pythia_test 0246 Number of epochs 150 140 130 120 110 100 y + u 0246 Number of epochs 155 150 145 140 135 y + gpts 0246 Number of epochs 140 135 130 125 120 115 y + gptf 0246 Number of epochs 125 120 115 110 105 100 95 y u 0246 Number of epochs 135 130 125 120 115 y gpts 0246 Number of epochs 120 115 110 105 100 y gptf 0246 Number of epochs 190 180 170 160 150 y + ju 0246 Number of epochs 170 165 160 155 150 145 140 135 y + test 0246 Number of epochs 190 180 170 160 150 y hum 0246 Number of epochs 420 400 380 360 y + urnd 0246 Number of epochs 400 380 360 340 y ' rnd Figure 16: Compare the learning dynamics of examples fromD prob andD probtest . Key trend to observe: for D prob , since many responses and prompts ever occur during training, the pull-up pressure is generally stronger. Curves ofy + u ,y + gpts ,y + gptf andy + j̸=u shows a clear trend. (Antropic-H, SFT) 30 Published as a conference paper at ICLR 2025 D.3MORE RESULTS ON DIFFERENT SETTINGS:OFF-POLICYDPOCASE Similar to Appendix D.2, we also provide extra experiments for DPO in this part using the same probing dataset. Note that as the responses of on-policy DPO change generation-by-generation, it is hard to observe the dynamics of a pre-collected probing dataset. We left the exploration of how to effectively probe other DPO variants in our future work. Consistent learning dynamics for different models.Compare Figure 4 in the main context and Figure 18, where we provide the results on many different models (Pythia-410M/1B/2.8BandQwen1.5-0.5B). Their trends on differentπ θ t (y)are quite consistent: 1.) in the first column, the marginπ θ t (y + u )−π θ t (y − u )keeps increasing. Theπ θ t (y + u )first increase and then decrease, always with a smaller decay speed than that ofπ θ t (y − u ); 2.)in the second column,π θ t (y + u )decreases slower than the other rephrases, verifying the “pull up” pressure and the influence on other responses viaK t ; 3.)in the third column,π θ t (y − u )decreases faster than the other rephrases, verifying the “push down” pressure and the influence on othery; 4.) in the fourth column, the rephrases ofy + u decay slower than those ofy − u , supporting the claims that the rephrases near the chosen responses are influenced by the “pull up” pressure while the rephrases of the rejected ones are influenced by the “push down” pressure. Learning dynamics of conducting SFT first, then DPO.As stated in (Ouyang et al. 2022), conducting SFT before DPO is a common pipeline for alignment. Using[x u ;y + u ] as the SFT dataset is also a common practice in many existing works. Hence in this part, we plot the curves of differentπ θ t (y)in both two stages to demonstrate their differences. See Figure 17, where the difference between the experiments in these three rows is how long the model is trained using SFT before DPO. The learning rate of both SFT and DPO are controlled to be the same (i.e.,5×10 −7 , the default value in (Tajwar et al. 2024)). All the curves are aligned by the 10th epoch on the x-axis (i.e., the starting time for the DPO training) for the convenience of comparing the trends across different settings. We first check the curves of SFT and DPO parts separately and find that all the above relative trends still hold in these experiments. We then compare the model’s behavior in these two phases respectively. In the last two rows of Figure 17, where the epoch for SFT is non-zero, it is clear that the decaying speed of most observingπ θ t (y)is much larger in DPO than those in SFT. The main reason for this is the existence of a big negative gradient introduced in DPO. This gradient, especially conducted on a “valley” region of the model’s prediction, will “push down” the whole curve significantly, except the one with the highest confidence before updating. This non-trivial trend is named “squeezing effect”, which is elaborated on in Appendix E. Furthermore, a more peakyπ θ 0 (y)and a smallerπ θ 0 (y − u ) will lead to a stronger “squeezing effect”, which can be verified by comparing the curves of the last two panels: longer SFT makes the model’s prediction peakier when DPO is conducted, which leads to a larger decay on allπ θ t (y)during DPO. 31 Published as a conference paper at ICLR 2025 051015 350 300 250 200 150 100 SFT for 0 epoch Chosen v.s. rejected y + u y u 051015 350 300 250 200 150 100 Chosen v.s. its rephrases y + u y + gpts y + gptf 051015 350 300 250 200 150 100 Rejected v.s. its rephrases y u y gpts y gptf 051015 350 300 250 200 150 100 Rejected v.s. its rephrases y + gpts y gpts y + gptf y gptf 051015 280 260 240 220 200 180 160 Irrelevant responses y + ju y test y hum 051015 600 550 500 450 400 350 Non-human responses y + urnd y rnd 051015 350 300 250 200 150 100 SFT for 1 epoch y + u y u 051015 350 300 250 200 150 100 y + u y + gpts y + gptf 051015 350 300 250 200 150 100 y u y gpts y gptf 051015 350 300 250 200 150 100 y + gpts y gpts y + gptf y gptf 051015 260 240 220 200 180 160 140 y + ju y test y hum 051015 600 550 500 450 400 350 y + urnd y rnd 051015 Number of epochs 350 300 250 200 150 100 SFT for 10 epochs y + u y u 051015 Number of epochs 350 300 250 200 150 100 y + u y + gpts y + gptf 051015 Number of epochs 350 300 250 200 150 100 y u y gpts y gptf 051015 Number of epochs 350 300 250 200 150 100 y + gpts y gpts y + gptf y gptf 051015 Number of epochs 350 300 250 200 150 y + ju y test y hum 051015 Number of epochs 600 550 500 450 400 350 y + urnd y rnd Figure 17: The learning dynamics of conducting DPO after SFT the model for several epochs. We align the starting point of DPO (i.e., the 10th epoch from the x-axis) to better compare the curves. Key trend to observe: 1.) Confidence of all responses decays way faster when DPO starts, which is caused by the squeezing effect introduced via a big negative gradient; 2.) The more epochs we SFT the model, the more serious the squeezing effect is (confidence decays faster). (Antropic-H, SFT→DPO) 0123456 300 250 200 150 100 qwen05 Chosen v.s. rejected y + u y u 0123456 250 200 150 100 Chosen v.s. its rephrases y + u y + gpts y + gptf 0123456 300 250 200 150 100 Rejected v.s. its rephrases y u y gpts y gptf 0123456 275 250 225 200 175 150 125 Rejected v.s. its rephrases y + gpts y gpts y + gptf y gptf 0123456 180 160 140 120 pythia410m y + u y u 0123456 200 190 180 170 160 150 140 y + u y + gpts y + gptf 0123456 180 160 140 120 y u y gpts y gptf 0123456 190 180 170 160 150 140 130 120 y + gpts y gpts y + gptf y gptf 0.00.51.01.52.02.53.03.5 180 160 140 120 100 pythia1b y + u y u 0.00.51.01.52.02.53.03.5 180 170 160 150 140 130 120 y + u y + gpts y + gptf 0.00.51.01.52.02.53.03.5 180 160 140 120 100 y u y gpts y gptf 0.00.51.01.52.02.53.03.5 170 160 150 140 130 120 110 y + gpts y gpts y + gptf y gptf 0.00.51.01.52.02.53.0 Number of epochs 180 160 140 120 100 pythia28 y + u y u 0.00.51.01.52.02.53.0 Number of epochs 170 160 150 140 130 120 y + u y + gpts y + gptf 0.00.51.01.52.02.53.0 Number of epochs 180 160 140 120 100 y u y gpts y gptf 0.00.51.01.52.02.53.0 Number of epochs 180 160 140 120 100 y + gpts y gpts y + gptf y gptf (a) Result onAntropic-H. 02468 475 450 425 400 375 350 325 300 qwen05 Chosen v.s. rejected y + u y u 02468 380 370 360 350 340 330 Chosen v.s. its rephrases y + u 02468 475 450 425 400 375 350 325 300 Rejected v.s. its rephrases y u 02468 260 240 220 200 180 160 140 120 Rejected v.s. its rephrases y + gpts y gpts y + gptf y gptf 02468 500 480 460 440 420 400 pythia410m y + u y u 02468 435 430 425 420 415 y + u 02468 500 480 460 440 420 400 y u 02468 240 220 200 180 160 y + gpts y gpts y + gptf y gptf 02468 500 475 450 425 400 375 350 pythia1b y + u y u 02468 410 400 390 380 370 y + u 02468 500 475 450 425 400 375 350 y u 02468 280 260 240 220 200 180 160 140 y + gpts y gpts y + gptf y gptf 02468 Number of epochs 450 425 400 375 350 325 pythia28 y + u y u 02468 Number of epochs 350 345 340 335 330 325 y + u 02468 Number of epochs 460 440 420 400 380 360 340 320 y u 02468 Number of epochs 260 240 220 200 180 160 140 y + gpts y gpts y + gptf y gptf 220 200 180 160 140 y + gpts y + gptf 260 240 220 200 180 160 140 120 y gpts y gptf 210 200 190 180 170 160 y + gpts y + gptf 240 220 200 180 160 y gpts y gptf 220 200 180 160 140 y + gpts y + gptf 280 260 240 220 200 180 160 140 y gpts y gptf 200 180 160 140 y + gpts y + gptf 260 240 220 200 180 160 140 y gpts y gptf (b) Result onUltraFeedback. Figure 18: The learning dynamics of DPO on different models. Key trends to observe: 1.) Confidence of y + u decays slower than that ofy − u ; 2.) Confidence ofy + u decays slower than those ofy + gpts andy + gptf , because the pull-up pressure is directly imposed ony + u ; 3.) Confidence ofy − u decays faster than those ofy − gpts and y − gptf , because the push-down pressure is directly imposed ony − u ; 4.) Confidence of the rephrases of rejected responses decays faster than the rephrases of chosen responses. 32 Published as a conference paper at ICLR 2025 ETHESQUEEZINGEFFECTINTRODUCED BYBIGNEGATIVEGRADIENT In DPO, the model gradually learns how to separate the chosen and rejected responses by imposing one positive and one negative adaptation vector centered aty + u andy − u respectively, as illustrated in the second panel in Figure 2. These two opposite pressures ensure the margin rewardπ θ (y + u |χ + u )−π θ (y − u |χ − u ) keep increasing, which makes the model align better with human preferences. However, if we go deeper and considerπ θ (y + u |χ + u )andπ θ (y − u |χ − u )separately (actually we should, because theirχ u are usually different), a very interesting phenomenon occurs. See the first column of Figure 18, we find although DPO also contains a strong positive adaptation vector, the curve ofπ θ (y + u |χ + u )all goes down after several updates, which is very different fromπ θ (y + u |χ + u )in the SFT case. Such an observation is also reported in many related works (Pal et al. 2024; Rafailov et al. 2024; Razin et al. 2025; Tajwar et al. 2024), but a clear-cut explanation of it is still missing. Furthermore, although therelative behaviorsof various rephrases matches our analysis of learning dynamics well, merely the two pressures ony + u andy − u cannot explain why all these observedπ θ (y)keeps decreasing during training. So, it is natural to ask: Where has the probability mass gone? E.1THESQUEEZINGEFFECT ANDWHY ITEXISTS To answer the above question, we can start from the properties of the basicSoftmaxfunction by analyzing a simple multi-class logistic regression problem. Because no matter how complex the LLM is, its predictions are made by converting the logits into probabilities usingSoftmaxheads. Note that the analysis here only considers the negative gradient, i.e., the one imposed byy − u in LLM’s finetuning. As also pointed by Razin et al. (2025), the pull up pressure imposed byy + u will cancel the influence imposed byy − u when theirχ u are identical. However, whenχ + u andχ − u are dissimilar, the squeezing effect discussed in this paper still dominates. We left analyzing this intricate interaction between these two pressures is left to our future work. Consider a simpleV-class logistic regression problem where each high-dimensional input dataxis converted to a length-dfeature vector via a deep neural networkφ. In other words, we haveφ(x)∈R d×1 . The model uses a linear read-out layerw∈R d×V to convert the feature vector to logitsz=w ⊤ φ(x)and then generate the probability prediction vectorpusing aSoftmaxhead. We consider a common cross-entropy loss function for each input pair(x,y). In summary, we have: L CE (p t ,y) =−e ⊤ y logp t ;p t =Softmax(z t );z t = (w t ) ⊤ φ(x),(32) wheretis the index of the step during training ande y is a length-Vone-hot vector determined by the ground truth labely. To simplify our analysis, we assume a fixedφand only update the parameters of the read-out layerwusing stochastic gradient descent: w t+1 =w t −η∇ w L=w t −ηφ(x)(p t −e y ) ⊤ ,(33) whereηis the learning rate which can be negative if we consider a negative gradient during training. With Equation (32) and (33), we can write down each dimension ofp t andp t+1 after some calculations. To quantitatively analyze how the model’s confidence in each class changes, we define a ratioα i ≜ p t+1 i p t i and use the following lemma to describe its behavior: Lemma 1.The ratio of confidence change for eachican be represented as: α i ≜ p t+1 i p t i = P V j=1 e z t j P V j=1 β j e z t j .(34) 33 Published as a conference paper at ICLR 2025 Note that the values ofβ j also depends on whetheriequalsy, hence for Case 1 (i=y) and Case 2 (i̸=y), we have (η ′ ≜η∥φ(x)∥ 2 2 is the equivalent learning rate): Case 1:β j = ( e −η ′ (1+p t j −p t i ) ifj̸=y 1ifj=y ;Case 2:β j = ( e −η ′ (p t j −p t i ) ifj̸=y e −η ′ (p t j −p t i −1) ifj=y (35) Proof. To derive Equation (34), we need to have the analytical expression of eachp t+1 i andp t i . Asp= Softmax(z), we need to linkz t+1 andz t first. With Equation (32) and (33),z t+1 can be recursively written down as: z t+1 = (w t+1 ) ⊤ φ(x) = w t −ηφ(x)(p t −e y ) ⊤ ⊤ φ(x) = (w t ) ⊤ φ(x)−η φ(x)(p t −e y ) ⊤ ⊤ φ(x) =z t −η∥φ(x)∥ 2 2 (p t −e y ) =z t −η ′ (p t −e y )(36) whereη ′ ≜η∥φ(x)∥ 2 2 is the equivalent learning rate that depends on the norm of feature representation. Note thatz,pande y are all length-Vvectors andyis an integer ranging from 1 toV. Then we can write down eachz t+1 i as: z t+1 i = ( z t i −η ′ p t i +η ′ ,ifi=y z t i −η ′ p t i ,ifi̸=y (37) Then, we can combine the definition ofSoftmaxfunction and write down differentp t+1 i case-by-case. For Case 1 wherei=y, we have: p t+1 i=y = e z t+1 i P V j=1 e z t+1 j = e z t i −η ′ p t i +η ′ P j̸=y e z t j −η ′ p t j +e z t y −η ′ p t y +η ′ = e z t i P j̸=y e z t j −η ′ (1+p t j −p t i ) +e z t y −0 ,(38) combining the fact thatp t i = e z t i P K j=1 e z t j , we can deriveα i andβ j as the left part of Equation (35). Similarly, wheni̸=y, we have: p t+1 i̸=y = e z t+1 i P V j=1 e z t+1 j = e z t i −η ′ p t i P j̸=y e z t j −η ′ p t j +e z t y −η ′ p t y +η ′ = e z t i P j̸=y e z t j −η ′ (p t j −p t i ) +e z t y −η ′ (p t y −p t i −1) ,(39) which leads to the right part of Equation (35). We can now better understand how eachp i changes after this update. Specifically, ifα i >1, the corresponding p i increases, and vice versa. To determine the value ofα i , we can treat anyβ j >1as contributing to the conclusion thatα i <1while anyβ j <1against it. The value of the correspondinge z t j and|β j −1|controls how strong the contribution is. With the preparations above, we derive the following observations on how the confidence evolves when a gradient ascent (i.e.,η <0) is imposed on classy. 34 Published as a conference paper at ICLR 2025 Claim 1: The value ofp y is guaranteed to decrease, i.e.,α y <1.We start from the value ofβin Case 1 as illustrated in Equation (35). It is clear that for anyj̸=y, we haveβ j >1, because1 +p t j −p t i >0. Combining withβ y = 1, it is straightforward to have Claim 1. Claim 2: The value ofp i ∗ wherei ∗ = argmax i∈[V]\y p t i is guaranteed to increase, i.e.,α i ∗ >1.We now use the value ofβin Case 2, sincei ∗ cannot equalyby definition. Whenj̸=y, we havep t j −p t i ∗ ≤0 for all possiblej, becausep t i ∗ is the largest among allp t i̸=y ofp t . Hence allβ j̸=y must be smaller than one. Combining with the fact thatβ y <1(becausep t y −p t i ∗ −1must be negative), we can prove thatα i ∗ >1. The two claims above demonstrate that the parameter update can be imagined as taking the probability mass fromp y and redistributing that to other dimensions. From Claim 2, we know some of the mass is guaranteed to be “squeezed” into the dimension with the highestp t i ∗ (ifp t y is the highest value, thenp t i ∗ is the second highest inp t ). But how otherp i changes is still not clear yet. Will the probability mass fromp y is also split into otherp i (i.e., otherp i increases)? Or willp i ∗ absorb the mass not only fromp y but also from other dimensions (i.e., otherp i decreases)? To get a clearer picture, we need to track the adaptations of eachp i . To achieve this, we now must scrutinize the distribution ofp t , because it controls the value ofe z t j for differentj. We chose three typical scenarios wherep t is strictly uniform, slightly non-uniform, and extremely peaky, and leads to the following claims. Claim 3A: Whenp t is a uniform distribution, the probability mass decreased from classyis uniformly distributed to all otheri̸=y, i.e., allp t+1 i̸=y increase the same value.With the uniformp t assumption, Equation (34) can be simplified toα i = V P V j=1 β j . Note that the first two claims hold for any distribution p t , hence we only check the values ofα i̸=y here to verify the “uniformly distributed mass” hypothesis. Substituting the values ofβ j to this newαleads toα i = V V−1+e η ′ for alli̸=y. Sinceη ′ <0ande η ′ <1, we must haveα i̸=y >1. Combined with the fact that allp t i are the same, this claim can be proved. Claim 3B: Whenp t is slightly non-uniform,p i with smallerp t i tend to decrease, and vice versa.This claim is a general trend and might not have any guarantees. However, analyzing such a scenario helps us to understand the influence ofp t better. Assume we are observingα i ′ wherei ′ is notynori ∗ . We consider two subsets of[V]\y, i.e.,B, which contains alljwithp t i ′ ≤p t j andSthat contains alljwithp t i ′ > p t j . Now consider Case 2 in Equation (35), we have: β j=y ≪β j∈S <1;β j∈B >1.(40) Note that we misuse the≪notation to highlight the fact thatβ j=y would be much smaller thanβ j∈S , because there is a negative one term in the exponential. With the above expression, we can imagine that ifp t i ′ is relatively small, the size ofBwould be large, which means there will be moreβ j >1contributing to the conclusion thatα i ′ <1. If the influence ofβ j∈B is strong enough to override the influence of otherβ (especiallyβ j=y which is way smaller than otherβ),α i ′ would be smaller than one and hencep i ′ decreases. On the contrary, for thosei ′ with relatively largep t i ′ , theβ <1terms becomes dominant and hence lead to α i ′ >1, i.e.,p i ′ increases. In the analysis above, we assumep t is only slightly non-uniform (i.e., not so peaky), which means the values of differente z t j are relatively comparable. However, in practical machine learning systems like LLM’s finetuning, the distributionp t would be very non-uniform, which means most of the probability mass is obtained by a few dimensions. That is because the LLM’s vocabulary size is usually very large and the reasonable choice of the next word is only a small portion of the whole vocabulary. Thus we have the following claim to describe this practical scenario. 35 Published as a conference paper at ICLR 2025 Claim 3C: Whenp t is very peaky, which means most of the probability mass is obtained byi ∗ , then all otherp i will decrease. In other words, the probability mass of all otherp i is squeezed top i ∗ .We continue the analysis in Claim 3B but consider a more extreme influence one z t j . For this peakyp t , we might have an very largee z t i ∗ that dominatesα. In other words,α i ≈ e z t i ∗ β i ∗ ·e z t i ∗ = 1 β i ∗ . Then for anyi ′ we want to observe, theα i ′ ≈ 1 β i ∗ <1 . In other words, the model’s predictions on all dimensions other than the one with the highest confidence inp t will decrease. Last, we analyze the influence ofp y to explain why “imposing a large negative gradient on the valley region” makes the squeezing effect more serious. Claim 4: Smallerp t y makes those non-maxp i easier to decay, i.e., a stronger squeezing effect.This is also a general trend that is observed in the experiments in Figure 20. Intuitively, since the model is already confident thatycannot be the correct label (i.e.,p y is very small), letting the model further decrease the prediction onp y does not make sense. We can also use the analysis above to understand how it happens. As illustrated in Equation (40), where the value ofβis decomposed into three subgroups. Recall the definition of α i , we know allβ j <1contribute to the hypothesis thatp i increases after this update, where the strength of this contribution is controlled bye z t j . Since ap t y small means a smalle z t j , the influence ofβ j=y ≪1is significantly weakened under this scenario. In other words,α i <1is more likely to occur for all possiblei, which means the squeezing effect (allp j̸=y decreases) becomes more serious. Claim 5: The learning rate with a larger absolute value|η|and a larger feature norm∥φ(x)∥ 2 2 will amplify all the trends, maybe more serious than our expectation.Throughout our analysis, the equivalent learning rateη ′ <0is a shared scalar in allβ j . Hence larger|η ′ |can amplify all the trends aforementioned. Furthermore, recall the shape of an exponential functione x , where a small change ofx(especially whenx >1) will makee x changes a lot. Then the termsβ j̸=y =e −η ′ (1+p t j −p t i ) in Case 1 andβ j=y =e −η ′ (p t j −p t i −1) in Case 2 will play a stronger role if we use a larger learning rate|η|or the norm of features is larger. E.2VERIFY THESQUEEZINGEFFECT USING ASIMPLEEXPERIMENT Figure 19: Illustration of how big positive and negative gradients influence the model’s prediction. Let us analyze a simple example to get an intuition. We setV= 50,d= 5,|η|= 0.5, and a randomly generatedφ(x). In the first row of Figure 20, we consider the model updates its parameters using standard SGD assuming the label of thisxis 21. Specifically, we randomly generatew 0 by sampling each parameter from a standard Gaussian distribution and calculatew 1 using Equation (33). The two curves in each panel demonstrate the model’s predicted distribution before and after this update. As we expected, the positive vector on the 21st class “pull up”p 0 (y= 21)and “push down” all otherp 1 (y)at the same time. This trend is quite consistent under different settings (i.e., different choices ofV,d,x,η,w 0 , etc.), which can be depicted by the first panel in Figure 19. 36 Published as a conference paper at ICLR 2025 01020304050 0.0 0.2 0.4 0.6 0.8 1.0 A: Positive gradient y + u Probability change 1 0 01020304050 10 7 10 6 10 5 10 4 10 3 10 2 10 1 10 0 y + u Probability change in log scale 01020304050 0.0 0.2 0.4 0.6 0.8 1.0 B: Negative gradient flat y u 01020304050 10 3 10 2 y u 01020304050 0.0 0.2 0.4 0.6 0.8 1.0 C: NG peaky on peak y u 01020304050 10 8 10 6 10 4 10 2 10 0 y u 01020304050 0.0 0.2 0.4 0.6 0.8 1.0 D: NG peaky on valley y u 01020304050 10 9 10 7 10 5 10 3 10 1 y u 02004006008001000 Index of vocabulary 0.0 0.2 0.4 0.6 0.8 1.0 Large V y u 02004006008001000 Index of vocabulary 10 38 10 33 10 28 10 23 10 18 10 13 10 8 10 3 y u Figure 20: Experimental verification of the “squeezing effect” illustrated in Figure 19 using a simple multi- class logistic regression task. We then setη=−0.5to simulate the negative gradient in DPO and consider three different settings. First, we assume the model’s prediction onxis relatively flat, as demonstrated in the second row of Figure 20, where the predicting probability of every class is around 0.02. The negative gradient is imposed ony= 42, a randomly selected number. We see the negative adaptation vector “push down”p 1 (y= 42)heavily and re-assign those decreased probability mass evenly to all other classes, as illustrated in the second panel in Figure 19. Although the behavior described above follows our intuitions well, a flatp 0 is not common in LLM’s finetuning. Because finetuning usually starts from a pre-trainedw, where the model’s prediction would likely be non-uniform. So in the third row of Figure 20, we consider a more practicalw 0 that leads to a multi-mode p 0 . In this example, the model has relatively high confidence in classes 5 to 11 and low confidence in all other dimensions. We set the target label as 11 (i.e., the one in the model has the highest confidence) and useη=−0.5to “push down” the model’s prediction on this class. As demonstrated by the blue curve, 37 Published as a conference paper at ICLR 2025 p 1 (y= 11)decreases a lot as we expected. However, different from the flatp 0 case, where the model evenly assigns the reduced probability mass to all othery, the model in this example “squeezes” the mass to those confident predictions, i.e., classes 6, 9, and 10, leaving the confidence of other classes almost unchanged. Such a trend is consistent when the negative gradient is imposed on the “peaky” region of a non-uniform distribution, as illustrated in the third panel in Figure 19. The previous setting simulates the on-policy DPO well, where the rejected examplesy − u are sampled from the high confidence region of the model’s predictions. Then, what will happen if we conduct off-policy DPO and impose a big negative gradient on those classes that already have very low confidence? See the fourth row of Figure 20, where we use the samew 0 andηas in the previous case. The only difference is that we change the label ofxto 42, wherep 0 (y= 42)is very small (roughly10 −5 ) before training. The behavior in this setting is quite interesting: we first observe a big increase onp 1 (y= 11), which means the model “squeezes” the probability mass to themost confidentone inp 0 , similar to the previous setting. More interesting, the predictions on all otheryare heavily “pushed down”, even including classes 6, 9, and 10, whose confidence is relatively high before training. In the last two panels of Figure 20, we setV= 1000and find this trend is more obvious (that might be because the absolute value of the efficient learning rate, which depends on∥φ(x)∥, becomes larger). Since the vocabulary size of a common LLM is usually more than 50k, the squeezing effect in real systems would be non-negligible even if the learning rate is small. Such a trend is also quite consistent as long as we impose a big negative gradient on the “valley” region of the model’s prediction, as illustrated in the last panel in Figure 19. Now we can answer the question of why all observing π θ t (y)decreases and where the probability mass has gone: For each token, the probability mass is squeezed to the one with the highest confidence. Note that the tokens with the highest confidence do not necessarily form a preferred response: it just reinforces the prior knowledge contained inθ 0 , which could be a drawback for off-policy DPO. The hypothesis above is not only supported by this simple logistic regression problem but also by many consistent trends in LLM’s finetuning experiments. First, by comparing the average decaying speed of the π θ t (y)when the model SFT different epochs before DPO (in Figure 17), we notice that longer SFT leads to a more peakyπ θ 0 (y)and hence leads to a faster decaying speed of all non-argmax responses. That is because the longer SFT stage will eventually push downπ θ 0 (y − u ) more. Hence in the DPO stage, the big negative gradient is imposed on a deeper valley region, which makes the squeezing effect stronger. Second, to directly verify this hypothesis, we track the sum of the log-likelihood of the tokens with the largest confidence and call it “argmax confidence”, i.e., P l π θ t (argmax y l ∈Y l y l |x,y 1:l−1 ) . As illustrated in the last panel in Figure 4, the argmax confidence keeps increasing while all otherπ θ t (y)decreases: the missing probability mass is found! Last, in the dataset-extension method we proposed in Section 4.3 and Appendix F, we train the model using both[x,y + u ] and[x,y − u ] during SFT to also “pull up” they − u region before conducting DPO. Then, we observe compared with the standard training flow, i.e., SFT using[x;y + u ]first and then DPO, the proposed flow has a lower “argmax confidence” during DPO. That is because we pulled upπ θ 0 (y − u )during the modified SFT stage, the big negative gradient is then imposed on the peaky region rather than the valley region of the model’s prediction. Such a change in turn weakens the squeezing effect, as illustrated in Figure 5. FA SIMPLEMETHOD TOIMPROVEALIGNMENT F.1PINPOINTING THE DRAWBACK OF OFF-POLICYDPO Based on our observations and analysis above, we speculate that “imposing big negative gradients on the valley region” is one of the bottlenecks of off-policy RL-free methods. Starting from this hypothesis, we believe introducing on-policy sampling has the potential to mitigate this problem, as demonstrated in SPIN (Z. Chen et al. 2024) and other online algorithms (S. Guo, B. Zhang, et al. 2024). However, we also speculate 38 Published as a conference paper at ICLR 2025 that these methods improve the model’s performance not only by mitigating the squeezing effect. Hence to figure out to what extent the squeezing effect can harm the model’s performance, we propose a simple yet effective method to isolate its influence. As this method can directly mitigate this effect, it can also be considered as an ablation study of this interesting phenomenon. Figure 21: Illustration of the proposed method and baseline. “E” is short for the “dataset extension”. F.2ASIMPLE METHOD INSPIRED BY LEARNING DYNAMICS As illustrated in Figure 21, where the baseline method is a standard SFT-then-DPO pipeline. The proposed method is very simple. We only need to augment the dataset used in SFT by adding(x,y − u )pairs for each sample into it. All other settings are unchanged. The motivation for this method is also quite simple: as SFT can pull up the region of supervised ˆ yand we don’t want the model to impose big negative gradients on a valley region, we can just pull up thosey − u before DPO. Furthermore, as demonstrated in the third panel in Figure 19 and Equation (15), the negative gradient in DPO would be strong enough to push downπ θ t (y − u ) , because the gradient will be large if the model cannot separatey + u andy − u well. In other words, under DPO’s loss, there is no need to worry about the model overfitting thosey − u during SFT. F.3EXPERIMENTAL VERIFICATION To verify our analysis, we conduct experiments by finetuning a pretrainedQwen1.5-1.8B(J. Bai et al. 2023) model usingAntropic-HHdataset (Y. Bai et al. 2022) (we use a subset containing 5000 random examples from the training split). The pipelines of different methods are demonstrated in Figure 21. In this experiment, we call the pretrained modelB 0 (andE 0 , which is identical toB 0 ), which is an official checkpoint pretrained by J. Bai et al. (2023). ModelB 1 andE 1 are the ones after SFT, which are different for these two methods. ModelB 2−2/4/6 andE 2−2/4/6 are the models finetuned using DPO for 2/4/6 epochs. All the settings (except the starting model) of the DPO stage are the same for these two methods. We first observe the learning dynamics of these two methods in Figure 5, where all the trends support our analysis quite well. See the first two panels that compareπ θ t (y + u )andπ θ t (y − u )respectively. It is clear that these two methods have an almost identical curve onπ θ t (y + u )in the SFT stage but behave quite differently onπ θ t (y − u ): because we directly train the model using(x,y − u )in the proposed method. Then, after the SFT stage, we conduct DPO using identical settings for these two methods. From the first three panels, we can observe the decay speed of all curves of the proposed method is smaller than its counterpart in the baseline. That is the benefit introduced by “pulling up” theπ θ 0 (y − u )region before conducting DPO. With this specific design, the big negative gradients in DPO are imposed on the peaky region (the behavior is like the third panel in Figure 19) rather than the valley region (see the fourth panel), hence the squeezing effect is successfully restrained. The results in the last panel of Figure 5 are also a strong verification of the whole picture. During the SFT stage, the observed “argmax-probability” of the proposed method is higher than the baseline, because 39 Published as a conference paper at ICLR 2025 0246810 Number of epochs 300 250 200 150 100 50 Log probability Chosen y + u baseline extend 0246810 Number of epochs 300 250 200 150 100 50 y u is the peak now Rejected y u 0246810 Number of epochs 500 400 300 200 100 All rephrases Non-human sequence Other responses 0246810 Number of epochs 130 120 110 100 90 80 Drop at the start of DPO Argmax response (a) Result onAntropic-H. 0246810 Number of epochs 500 450 400 350 300 250 200 Log probability Chosen y + u baseline extend 0246810 Number of epochs 500 450 400 350 300 250 200 y u is high enough now Rejected y u 0246810 Number of epochs 500 400 300 200 100 All rephrases Non-human sequence Other responses 0246810 Number of epochs 540 520 500 480 460 440 420 400 Slow at the start of DPO Argmax response 1500 1400 1300 1200 1100 1000 900 800 (b) Result onUltraFeedback. Figure 22: Learning dynamics of the baseline and the proposed method with training data extension. The one for SFT is the same as Figure 5 in the main context. Key trends to observe: 1.) Baseline and the extend method have similar behavior ony + u during SFT; 2.) The extend method considerably increasesy − u during SFT; 3.) The squeezing effect of the extend method is weaker (all other responses decay slower and the confidence on argmax response increases slower). we impose twice “pull up” pressure, i.e., those for(x,y − u ), compared with the baseline. However, at the beginning of DPO, we observe a clear drop in the orange curve. That is because the negative gradients are exactly imposed on thosey − u (in the second panel of Figure 5,π θ 0 (y − u )is already very high). Furthermore, at the end of DPO, we see the “argmax-probability” of the proposed method is significantly lower than the baseline setting, which implies that the squeezing effect is restrained in our setting. In order to figure out whether the model trained using the proposed flow, which successfully restrains the squeezing effect, indeed does alignment better, we conduct pair-wise comparisons of these models’ responses and report their win rate as in (Rafailov et al. 2023). Specifically, we first randomly select 1000 test questions from the test split ofAntropic-HHand generate 1000 responses by feeding the prompts to each of these models (we use the default sampling setting provided in (Rafailov et al. 2023)). Then, with the prompt template provided in Figure 23, we evaluate the win rate of the responses pairs usingGPT3.5-Turboand Claude3-Haiku. Here we report the average win rate of different comparisons (the degenerated responses are not compared, so the number of compared examples is slightly smaller than 1000). Note that a win rate greater than 0.5 means the method that comes first is preferred by the evaluator. 1. Compare models after SFT:E 1 v.s.B 1 , win rate is 0.4729 and 0.4679; 2. Demonstrate benefits of DPO: a.B 2−4 v.s.B 1 , win rate is 0.6727 and 0.6411; b.E 2−4 v.s.E 1 , win rate is 0.6898 and 0.7321; 3. Compare the proposed method and baseline after DPO for different epochs: a.E 2−2 v.s.B 2−2 , win rate is 0.6518 and 0.5151; b.E 2−4 v.s.B 2−4 , win rate is 0.6928 and 0.6045; c.E 2−6 v.s.B 2−6 , win rate is 0.6667 and 0.5432; 4. Compare the bestE 2−4 with other 2 checkpoints: 40 Published as a conference paper at ICLR 2025 a.E 2−4 v.s.E 2−2 , win rate is 0.6853 and 0.5517; b.E 2−4 v.s.E 2−6 , win rate is 0.6324 and 0.5316; In the first comparison, we find the model trained using both(x,y + u )and(x,y − u )loses more (win rate is smaller than 0.5), which makes sense becauseE 1 assigns higher probabilities on those less preferred responses. In the second comparison, the model fine-tuned using DPO indeed aligns with human value better. The win rate of the proposed method is slightly higher, which might also be explained asE 1 leaving more space for improvement. Hence we then directly compare the models after DPO in these two methods in the third group. In this group, all models in the proposed method win the baseline counterparts by a large margin, which demonstrates the effectiveness of our proposed method. Furthermore, we find the evaluation made by Claudeis more reserved compared withGPT(the numbers are smaller). However, the trends among the comparisons in this group are consistent:E 2−4 brings the largest improvement, which is potentially the best model. This fact is verified in the fourth group comparison, where we evaluateE 2−4 againstE 2−2 andE 2−6 . The results demonstrate that both a too-long or too-short finetuning stage using DPO is not the best choice. Figure 23: Prompt used for evaluating model’s response (from (Rafailov et al. 2023)), an example feedback fromGPT3.5-turbo, and two examples of the “degenerate” effect described in Holtzman et al. 2020. Although bothB 2 andE 2 inevitably generate such degenerate responses, we find this phenomenon is less common in the proposed method. 41