Paper deep dive

Data Shapley in One Training Run

Jiachen T. Wang, Prateek Mittal, Dawn Song, Ruoxi Jia

Year: 2025Venue: ICLR 2025 (Honorable Mention)Area: Training DynamicsType: TheoreticalEmbeddings: 259

Models: GPT-2

Abstract

Abstract:Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, meaning they cannot perform targeted attribution towards a specific model obtained from a single run of the algorithm. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest. In its most efficient implementation, our technique incurs negligible additional runtime compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. We present several case studies that offer fresh insights into pretraining data's contribution and discuss their implications for copyright in generative AI and pretraining data curation.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/12/2026, 6:42:29 PM

Summary

The paper introduces 'In-Run Data Shapley', a scalable method for data attribution in machine learning that avoids the computational burden of model retraining. By calculating Shapley values for individual gradient update iterations and accumulating them, the technique enables efficient data contribution analysis for large-scale foundation models, providing insights into data curation, stage-dependent contributions, and copyright implications in generative AI.

Entities (5)

Data Shapley · framework · 99%In-Run Data Shapley · method · 99%Shapley Value · mathematical-concept · 99%Foundation Models · technology · 95%The Pile · dataset · 92%

Relation Signals (3)

Shapley Value → foundationof → Data Shapley

confidence 98% · Data Shapley, i.e., the application of the Shapley value in data attribution...

In-Run Data Shapley → appliedto → Foundation Models

confidence 95% · This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage.

In-Run Data Shapley → improvesupon → Retraining-based Data Shapley

confidence 95% · To address the issues associated with Retraining-based Data Shapley such as high computational costs... we propose a novel data attribution method specifically tailored for a single training run.

Cypher Suggestions (2)

Find all methods related to data attribution · confidence 90% · unvalidated

MATCH (m:Method)-[:APPLIED_TO]->(t:Technology) WHERE m.name CONTAINS 'Shapley' RETURN m, t

Identify datasets analyzed by the proposed method · confidence 85% · unvalidated

MATCH (m:Method {name: 'In-Run Data Shapley'})-[:ANALYZED]->(d:Dataset) RETURN d.name

Full Text

258,705 characters extracted from source content.

Expand or collapse full text

Data Shapley in One Training Run Jiachen T. Wang Princeton University &Prateek Mittal Princeton University &Dawn Song UC Berkeley &Ruoxi Jia Virginia Tech Abstract Data Shapley offers a principled framework for attributing the contribution of data within machine learning contexts. However, the traditional notion of Data Shapley requires re-training models on various data subsets, which becomes computationally infeasible for large-scale models. Additionally, this retraining-based definition cannot evaluate the contribution of data for a specific model training run, which may often be of interest in practice. This paper introduces a novel concept, In-Run Data Shapley, which eliminates the need for model retraining and is specifically designed for assessing data contribution for a particular model of interest. In-Run Data Shapley calculates the Shapley value for each gradient update iteration and accumulates these values throughout the training process. We present several techniques that allow the efficient scaling of In-Run Data Shapley to the size of foundation models. In its most optimized implementation, our method adds negligible runtime overhead compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage. We present several case studies that offer fresh insights into pretraining data’s contribution and discuss their implications for copyright in generative AI and pretraining data curation. 1 Introduction In today’s data-driven world, understanding the contribution of each data point is crucial, especially with the advent of foundation models that rely on vast amounts of training data from various sources. The lack of reliable data attribution mechanisms can lead to significant legal and societal issues, resulting in a growing backlash against the broader use of data for model training (Heikkilä, 2023). For instance, there is a risk of violating intellectual property rights, failing to fairly compensate data creators, and disincentivizing them from producing new, high-quality content (Henderson et al., 2023). This has already resulted in legal disputes, such as the New York Times’ lawsuit against Microsoft/OpenAI (Grynbaum & Mac, 2023). Moreover, foundation models are often trained on massive datasets scraped from the internet, which can include low-quality and harmful content (Gao et al., 2020; Raffel et al., 2020; Touvron et al., 2023). Problematic data not only wastes computational resources but also skews model outputs, potentially leading to biased or inaccurate results. By understanding the contribution of each data source, we can identify and mitigate the influence of low-quality data, thereby improving the efficiency and quality of model training. Since the training data of foundation models comes from multiple stakeholders, it is essential to have an algorithm that can fairly attribute data contributions. In recent years, significant progress has been made in understanding what it means to fairly quantify and attribute data source contributions, with the Shapley value (Shapley, 1953) emerging as a widely adopted framework (Ghorbani & Zou, 2019; Jia et al., 2019b). Originating from cooperative game theory, the Shapley value uniquely satisfies several desirable properties: (1) It assigns equal scores to equally impactful data points, ensuring fairness; (2) the sum of contribution scores equals the total utility, meaning the scores always represent a share of the total utility; (3) it supports additive decomposition across multiple utility functions, allowing for the calculation of contributions to the entire test set by summing the scores of individual test points. As the Shapley value uniquely satisfies these properties, it eliminates the uncertainty and ambiguity surrounding which attribution frameworks should be used conceptually. While there are other non-Shapley data attribution frameworks, such as Datamodels (Park et al., 2023; Ilyas et al., 2022) and influence functions (Koh & Liang, 2017), they lack the clear theoretical foundation and uniqueness provided by the Shapley value. Original Data Shapley definition faces computational & conceptual limitation. However, Data Shapley, i.e., the application of the Shapley value in data attribution, has been limited to very small-scale models. Existing methods (Ghorbani & Zou, 2019) to estimate the Shapley value require retraining the model numerous times using different subsets of data to evaluate the contribution of each data source, making them computationally infeasible for foundation models. On the other hand, retraining-based methods suffer from a conceptual issue often overlooked in the literature: they assess data contribution for a general learning algorithm rather than a particular model. While one might interpret the former as an approximation of the latter, these two quantities can be quite different in practice, especially when the learning algorithm is randomized and sensitive to factors like random initialization and training data order. In many real-life scenarios, however, the primary interest lies in understanding data contribution to the specific model being trained and deployed. This paper introduces In-Run Data Shapley, a novel approach that makes fair data attribution applicable to large-scale foundation models. Unlike retraining-based Data Shapley, In-Run Data Shapley quantifies the contribution of each data source to the specific target model of interest. Technical contributions. Our key insight is that ML models are trained using iterative algorithms, where the model performance change in one iteration is sufficiently small to be accurately approximated by first- or second-order Taylor expansions. We show that the Shapley value for the approximated one-step model performance change can be derived analytically via gradient dot-products or gradient-Hessian-gradient products between training and validation data. Hence, we can compute the Data Shapley scores for each model update step, and accumulate the scores throughout the training process. However, the per-sample gradient vectors required for computing the Shapley value in each training iteration introduce significant overhead due to per-sample gradient calculation. To address this challenge, we develop a series of technical tools that enable the exact calculation of gradient dot-products and gradient-Hessian-gradient products in one and two backward passes, respectively, without the need to instantiate any additional gradient vectors or Hessian matrices. Collectively, these tools allow for the efficient computation of In-Run Data Shapley. In particular, with sufficient GPU memory, its most efficient implementation is as fast as regular training. Empirical implications. Given the efficient algorithms developed in this paper, for the first time, one can perform data attribution on the scale of foundation model pretraining. While in this paper, we focus on GPT2 and Pythia-410M as a pilot study, our approach is applicable to larger-scale industrial models with sufficient computing resources. We performed various case studies that provide fresh insights into training data’s contribution to the foundation model pretraining. (1) There is considerable room for improvement in data curation for pretraining (Section 5.3.1). Even well-curated pretraining corpora contain data points that negatively impact the training process. We demonstrate the effectiveness of In-Run Data Shapley in identifying these low-quality data points. By computing In-Run Data Shapley values during training and removing negatively valued data points, we show that the cleaned dataset leads to significantly faster model convergence and improved performance compared to the original dataset. Interestingly, despite the Pile dataset (Gao et al., 2020) already undergoing multiple layers of curation, In-Run Data Shapley assigns negative values to approximately 16% of the data. We found a significant amount of noisy data among them, highlighting the need for improved data curation for foundation model training. (2) Data’s contribution is stage-dependent (Section 5.3.2). In-Run Data Shapley can capture the dynamics of contribution through the course of training, a fine-grained aspect that cannot be captured by prior works. In-Run Data Shapley shows that in the early stages of training, general corpora tend to have a relatively large contribution regardless of the downstream tasks. This is because general corpora help the model learn basic language patterns, grammar, and common knowledge. However, in the later stages of training, the contribution from domain-specific corpora becomes dominant, and the contribution of the general corpus phases out. (3) Rethinking copyright in generative AI: contribution beyond memorization (Section 5.3.3). We studied training data’s contribution to validation points of varying similarity levels. We found that even when the validation data is a complete rewrite of the training data while maintaining the topic, the training data still contributes significantly. This finding has implications for the current dialogue around what constitutes a copyright violation in generative AI (Mulligan & Li, 2024). While the unfair use of copyrighted content is generally only considered when the generated data is an almost verbatim replication of the training data, our contribution analysis shows that some data owners should receive a certain royalty share for generated content, even if the output does not closely resemble the copyrighted material. 2 Background of Data Shapley In this section, we formalize the setup of data attribution for ML and revisit Data Shapley’s definition. Setup & Goal. Given a dataset tr:=zii=1Nassignsubscripttrsuperscriptsubscriptsubscript1D_tr:=\z_i\_i=1^NDtr := zitalic_i i = 1N, data attribution or valuation aims to assign a score to each training data point zisubscriptz_izitalic_i, reflecting its importance for the trained ML model’s performance on a certain task. Formally, we seek a score vector (ϕzi)i=1Nsuperscriptsubscriptsubscriptitalic-ϕsubscript1( _z_i)_i=1^N( ϕitalic_z start_POSTSUBSCRIPT i end_POSTSUBSCRIPT )i = 1N where each ϕzi∈ℝsubscriptitalic-ϕsubscriptℝ _z_i ϕitalic_z start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ∈ blackboard_R reflects the value of zisubscriptz_izitalic_i. The Shapley value (SV) (Shapley, 1953), originating from game theory, stands out as a distinguished method for equitably distributing total profit among all participating players. Before diving into its definition, we first discuss a fundamental concept: the utility function. Utility function. A utility function maps an input dataset to a score indicating the utility of the dataset for model training. In most of the existing literature (Ghorbani & Zou, 2019; Jia et al., 2019b), the utility function U is chosen as the performance (e.g., accuracy or loss) of the trained models on a hold-out validation set. That is, given a training set S, the utility function U⁢(S):=Perf⁢(⁢(S))assignPerfU(S):= Perf(A(S))U ( S ) := Perf ( A ( S ) ), where AA represents a learning algorithm that trains a model on dataset S, and Perf⁢(⋅)Perf⋅ Perf(·)Perf ( ⋅ ) is a function assessing the model’s performance. For example, Perf⁢(⋅)Perf⋅ Perf(·)Perf ( ⋅ ) can be the accuracy for a classification task or the perplexity for a language completion task, evaluated on a (set of) hold-out validation data. Definition 1 (Shapley value (Shapley, 1953)). Let U⁢(⋅)⋅U(·)U ( ⋅ ) denote a utility function and D represent a training set of N data points. The Shapley value, ϕz⁢(U)subscriptitalic-ϕ _z (U )ϕitalic_z ( U ), assigned to a data point z∈Dz∈ Dz ∈ D is defined as ϕz⁢(U):=1N⁢∑k=1N(N−1k−1)−1⁢∑S⊆D−z,|S|=k−1[U⁢(S∪z)−U⁢(S)]assignsubscriptitalic-ϕ1superscriptsubscript1superscriptbinomial111subscriptformulae-sequencesubscript1delimited-[] _z (U ):= 1N _k=1^NN-1 k-1^-1 _S% D_-z,|S|=k-1 [U(S∪\z\)-U(S) ]ϕitalic_z ( U ) := divide start_ARG 1 end_ARG start_ARG N end_ARG ∑k = 1N ( binomial start_ARG N - 1 end_ARG start_ARG k - 1 end_ARG )- 1 ∑S ⊆ D start_POSTSUBSCRIPT - z , | S | = k - 1 end_POSTSUBSCRIPT [ U ( S ∪ z ) - U ( S ) ] where D−z=D∖zsubscriptD_-z=D \z\D- z = D ∖ z . In simple terms, the Shapley value is a weighted average of the marginal contribution U⁢(S∪z)−U⁢(S)U(S∪\z\)-U(S)U ( S ∪ z ) - U ( S ), i.e., the utility change when the point z is added to different SSSs. For simplicity, we often write ϕzsubscriptitalic-ϕ _zϕitalic_z when the utility function is clear from the context. The popularity of the Shapley value is attributable to the fact that it is the unique data value notion satisfying four axioms: Null player, Symmetry, Linearity, and Efficiency. The mathematical definitions of these axioms are deferred to Appendix A.1. Here, we introduce the linearity axiom which will be used later. Theorem 2 (Linearity of the Shapley value (Shapley, 1953)). For any of two utility functions U1,U2subscript1subscript2U_1,U_2U1 , U2 and any α1,α2∈ℝsubscript1subscript2ℝ _1, _2 α1 , α2 ∈ blackboard_R, we have ϕz⁢(α1⁢U1+α2⁢U2)=α1⁢ϕz⁢(U1)+subscriptitalic-ϕsubscript1subscript1subscript2subscript2limit-fromsubscript1subscriptitalic-ϕsubscript1 _z ( _1U_1+ _2U_2 )= _1 _z (U% _1 )+ϕitalic_z ( α1 U1 + α2 U2 ) = α1 ϕitalic_z ( U1 ) + α2⁢ϕz⁢(U2)subscript2subscriptitalic-ϕsubscript2 _2 _z (U_2 )α2 ϕitalic_z ( U2 ). Retraining-based Data Shapley. The convention of defining the utility function for Data Shapley as U⁢(S)=Perf⁢(⁢(S))PerfU(S)= Perf(A(S))U ( S ) = Perf ( A ( S ) ) was introduced in Ghorbani & Zou (2019), where AA is a learning algorithm such as a neural network trained by stochastic gradient descent (SGD) or its variants. With this choice of utility function, the precise calculation of the Shapley value requires retraining models on various subsets of the data. This is because the marginal contribution of a data point, U⁢(S∪z)−U⁢(S)U(S∪\z\)-U(S)U ( S ∪ z ) - U ( S ), can only be obtained by training models on both S and S∪zS∪\z\S ∪ z and comparing their performance. As a result, we refer to this method as "Retraining-based Data Shapley". Limitations beyond efficiency (detailed in Appendix B.1): In addition to the high computational costs, we emphasize that retraining-based Data Shapley also suffers from the following limitations: (1) Highly unstable value scores: When stochastic learning algorithms such as SGD are used, the resulting value scores can be highly unstable (Wang & Jia, 2023a). This instability may lead to unreliable results and potential violations of Shapley value’s fairness axioms. (2) Conceptual limitations: Retraining-based Data Shapley measures the average data contribution to the learning process itself, across many retrainings on different data subsets. As a result, it produces attribution scores that apply broadly to the algorithm but fail to reflect the data contribution to a specific training run. On the other hand, providing insights into how individual data points contribute to the deployed model enables more targeted analysis and debugging, thereby improving model interpretability. 3 In-Run Data Shapley To address the issues associated with Retraining-based Data Shapley such as high computational costs, value instability, and the inability to assess the contribution towards a specific trained model, we propose a novel data attribution method specifically tailored for a single training run. Our key idea is to leverage the iterative nature of model training and employ a "divide and conquer" approach: breaking down the problem of valuing data contributions for the entire training process into subproblems of valuing data contributions for individual iterations. Utility function for a single gradient update. Traditionally, the utility function U⁢(S)=Perf⁢(⁢(S))PerfU(S)= Perf(A(S))U ( S ) = Perf ( A ( S ) ) encapsulates the overall impact of a training set S across the complete training process. Here, we instead consider a "local utility function" that evaluates the impact of data subsets within a single iteration. Specifically, given a training dataset tr=zii=1Nsubscripttrsuperscriptsubscriptsubscript1D_tr=\z_i\_i=1^NDtr = zitalic_i i = 1N, a deep learning model is usually being trained to minimize the training loss ∑i=1Nℓ⁢(w,zi)superscriptsubscript1ℓsubscript _i=1^N (w,z_i)∑i = 1N ℓ ( w , zitalic_i ) via an iterative optimization procedure such as SGD. The performance of the model is typically being measured through a set of validation points z(val)superscriptval\z^(val)\ z( val ) . During an iteration t, a batch ℬt⊆trsubscriptℬsubscripttrB_t _trBitalic_t ⊆ Dtr of the training points is used to update the model parameters from wtsubscriptw_twitalic_t to wt+1subscript1w_t+1witalic_t + 1 with wt+1:=wt−ηt⁢∑z∈ℬt∇ℓ⁢(wt,z)assignsubscript1subscriptsubscriptsubscriptsubscriptℬ∇ℓsubscriptw_t+1:=w_t- _t _z _t∇ (w_t,z)witalic_t + 1 := witalic_t - ηitalic_t ∑z ∈ B start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ∇ ℓ ( witalic_t , z ), where ηtsubscript _tηitalic_t is the learning rate at iteration t.111Note that in practice, we take the gradient average for the update, but here we incorporate the normalization term |ℬt|subscriptℬ|B_t|| Bitalic_t | into the learning rate ηtsubscript _tηitalic_t for a clean presentation. A complete run of neural network training thus consists of model checkpoints w0,w1,…,wTsubscript0subscript1…subscript\w_0,w_1,…,w_T\ w0 , w1 , … , witalic_T . For a given validation data point z(val)superscriptvalz^(val)z( val ), we can define the "local utility function” at a single iteration t as U(t)⁢(S;z(val)):=ℓ⁢(w~t+1⁢(S),z(val))−ℓ⁢(wt,z(val))assignsuperscriptsuperscriptvalℓsubscript~1superscriptvalℓsubscriptsuperscriptval U^(t)(S;z^(val)):= ( w_t+1(S),z^(% val))- (w_t,z^(val))U( t ) ( S ; z( val ) ) := ℓ ( over~ start_ARG w end_ARGt + 1 ( S ) , z( val ) ) - ℓ ( witalic_t , z( val ) ) (1) where w~t+1⁢(S):=wt−ηt⁢∑z∈S∇ℓ⁢(wt,z)assignsubscript~1subscriptsubscriptsubscript∇ℓsubscript w_t+1(S):=w_t- _t _z∈ S∇ (w_t,z)over~ start_ARG w end_ARGt + 1 ( S ) := witalic_t - ηitalic_t ∑z ∈ S ∇ ℓ ( witalic_t , z ) and S⊆ℬtsubscriptℬS _tS ⊆ Bitalic_t is a subset of the batch being selected in t-th iteration in the original training. Interpretation: The local utility function U(t)superscriptU^(t)U( t ) represents the loss change at iteration t when only the subset S is used for the gradient update. This approach incorporates the realization of random batch selection at t-th iteration into the utility function. It can also encode other forms of training randomness (e.g., dropout) at iteration t. By accounting for the specific realization of training randomness, we obtain a deterministic utility function for each iteration, effectively enabling the targeted attribution to the specific training run. Data Shapley for a single gradient update. While the utility U(t)superscriptU^(t)U( t ) is defined over ℬtsubscriptℬB_tBitalic_t instead of the full training set trsubscripttrD_trDtr, it is easy to augment it to trsubscripttrD_trDtr. More formally, in the augmented utility function we have w~t+1⁢(S):=wt−ηt⁢∑z∈S∩ℬt∇ℓ⁢(wt,z)assignsubscript~1subscriptsubscriptsubscriptsubscriptℬ∇ℓsubscript w_t+1(S):=w_t- _t _z∈ S _t∇% (w_t,z)over~ start_ARG w end_ARGt + 1 ( S ) := witalic_t - ηitalic_t ∑z ∈ S ∩ B start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ∇ ℓ ( witalic_t , z ), S⊆trsubscripttrS _trS ⊆ Dtr. The Shapley value ϕz⁢(U(t))subscriptitalic-ϕsuperscript _z(U^(t))ϕitalic_z ( U( t ) ) will be exactly the same as the Shapley value corresponds to the augmented utility function for any z∈ℬtsubscriptℬz _tz ∈ Bitalic_t, and ϕz⁢(U(t))=0subscriptitalic-ϕsuperscript0 _z(U^(t))=0ϕitalic_z ( U( t ) ) = 0 for any z∈tr∖ℬtsubscripttrsubscriptℬz _tr _tz ∈ Dtr ∖ Bitalic_t (see Theorem 5 in Wang & Jia (2023b)). Therefore, for a clean presentation, we slightly abuse the notation where U(t)superscriptU^(t)U( t )’s meaning depends on the context. Data Shapley for the entire training run. Building on the concept of a "local" utility function for a single gradient update iteration, we naturally extend this to a "global" utility function for the entire training process, defined as U⁢(S)=∑t=0T−1U(t)⁢(S)superscriptsubscript01superscriptU(S)= _t=0^T-1U^(t)(S)U ( S ) = ∑t = 0T - 1 U( t ) ( S ). Interpretation: This global utility function can be interpreted as the cumulative loss change of the entire training run, but under the counterfactual scenario where only a subset of the training data S is used. In other words, it aggregates the total impact of the subset S on the model’s performance throughout the entire training process. Due to the linearity property of the Shapley value (Theorem 2), we have ϕz⁢(U)=∑t=0T−1ϕz⁢(U(t))subscriptitalic-ϕsuperscriptsubscript01subscriptitalic-ϕsuperscript _z(U)= _t=0^T-1 _z(U^(t))ϕitalic_z ( U ) = ∑t = 0T - 1 ϕitalic_z ( U( t ) ). This new Data Shapley value, which we call In-Run Data Shapley, represents the cumulative contribution of the data point z across all gradient update iterations within a single training run. This approach breaks down the broader utility into more manageable, step-by-step assessments that capture the immediate effects of data points on model updates, and provide a more fine-grained view of how individual data points contribute to the model’s performance at each step of the training process. Notably, the sum of individual data points’ Shapley values equals the overall loss reduction achieved by the model during the entire training run due to the Shapley value’s efficiency axiom (see Appendix A.1). This provides a meaningful and interpretable measure of data importance. In Appendix B, we give an in-depth comparison between Retraining-based and In-Run Data Shapley. Remark 1 (Multiple validation points). In practice, the model performance is often being assessed based on a validation set D(val)=z(val)superscriptvalsuperscriptvalD^(val)=\z^(val)\D( val ) = z( val ) . After computing ϕz⁢(U⁢(⋅;z(val)))subscriptitalic-ϕ⋅superscriptval _z (U(·;z^(val)) )ϕitalic_z ( U ( ⋅ ; z( val ) ) ) for each z(val)∈D(val)superscriptvalsuperscriptvalz^(val)∈ D^(val)z( val ) ∈ D( val ), one can compute the Shapley value corresponding to the utility function on the full validation set U⁢(S;D(val)):=∑z(val)∈D(val)U⁢(S;z(val))assignsuperscriptvalsubscriptsuperscriptvalsuperscriptvalsuperscriptvalU(S;D^(val)):= _z^(val)∈ D^(val)U(S;z^% (val))U ( S ; D( val ) ) := ∑z( val ) ∈ D( val ) U ( S ; z( val ) ) by simply taking the sum ϕz⁢(U⁢(⋅;D(val)))=∑z(val)∈D(val)ϕz⁢(U⁢(⋅;z(val)))subscriptitalic-ϕ⋅superscriptvalsubscriptsuperscriptvalsuperscriptvalsubscriptitalic-ϕ⋅superscriptval _z (U(·;D^(val)) )= _z^(val)∈ D% ^(val) _z (U(·;z^(val)) )ϕitalic_z ( U ( ⋅ ; D( val ) ) ) = ∑z( val ) ∈ D( val ) ϕitalic_z ( U ( ⋅ ; z( val ) ) ) due to the linearity property of the Shapley value (Theorem 2). Hence, for a clean presentation, we consider only a single z(val)superscriptvalz^(val)z( val ) in this paper. However, all the techniques we developed can be extended to multiple validation points. 4 Efficient Computation of In-Run Data Shapley The newly proposed In-Run Data Shapley does not require retraining models from scratch on different data subsets. However, calculating ϕz⁢(U(t))subscriptitalic-ϕsuperscript _z(U^(t))ϕitalic_z ( U( t ) ) for each training iteration remains computationally intensive, as it involves evaluating the performance impact of all possible combinations within the sampled data batch. In this section, we introduce an efficient method for approximating In-Run Data Shapley scores during a specific training run. Our approach, distinct from Monte Carlo methods, is deterministic and optimized to minimize additional runtime to regular training. In particular, in its most efficient implementation, our approximation technique incurs negligible extra runtime beyond what is required for standard model training, making it highly practical for real-world applications. 4.1 Approximating U(t)superscriptU^(t)U( t ) with Taylor Expansion To derive a more tractable structure for the local utility function U(t)superscriptU^(t)U( t ), we propose using first and second-order Taylor approximations. The advantage of this approach is that the approximated utility function exhibits a form where closed-form Data Shapley formulas can be derived. The second-order Taylor approximation to the local utility function is as follows: U(t)⁢(S)=ℓ⁢(w~t+1⁢(S),z(val))−ℓ⁢(wt,z(val))=∇ℓ⁢(wt,z(val))⋅(w~t+1⁢(S)−wt)⏟U(1)(t)⁢(S)+12⁢(w~t+1⁢(S)−wt)⊺⁢Ht(z(val))⁢(w~t+1⁢(S)−wt)⏟U(2)(t)⁢(S)+higher order termssuperscriptabsentℓsubscript~1superscriptvalℓsubscriptsuperscriptvalmissing-subexpressionabsentsubscript⏟⋅∇ℓsubscriptsuperscriptvalsubscript~1subscriptsubscriptsuperscript112subscript⏟superscriptsubscript~1subscript⊺subscriptsuperscriptHsuperscriptvalsubscript~1subscriptsubscriptsuperscript2higher order terms aligned U^(t)(S)&= ( w_t+1(S),z^(val))-% (w_t,z^(val))\\ &= ∇ (w_t,z^(val))·( w_t+1(S)% -w_t)_U^(t)_(1)(S)+ 12 ( w_t+1(S)-w_% t) H^(z^(val))_t( w_t+1(S)-w_% t)_U^(t)_(2)(S)+~higher order terms alignedstart_ROW start_CELL U( t ) ( S ) end_CELL start_CELL = ℓ ( over~ start_ARG w end_ARGt + 1 ( S ) , z( val ) ) - ℓ ( witalic_t , z( val ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = under⏟ start_ARG ∇ ℓ ( witalic_t , z( val ) ) ⋅ ( over~ start_ARG w end_ARGt + 1 ( S ) - witalic_t ) end_ARGU( t ) start_POSTSUBSCRIPT ( 1 ) ( S ) end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG under⏟ start_ARG ( over~ start_ARG w end_ARGt + 1 ( S ) - witalic_t )⊺ H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTt ( over~ start_ARG w end_ARGt + 1 ( S ) - witalic_t ) end_ARGU( t ) start_POSTSUBSCRIPT ( 2 ) ( S ) end_POSTSUBSCRIPT + higher order terms end_CELL end_ROW where the Hessian matrix Ht(z(val)):=∇2ℓ⁢(wt,z(val))assignsubscriptsuperscriptHsuperscriptvalsuperscript∇2ℓsubscriptsuperscriptvalH^(z^(val))_t:=∇^2 (w_t,z^(val))H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTt := ∇2 ℓ ( witalic_t , z( val ) ). We label the first-order term as U(1)(t)⁢(S)subscriptsuperscript1U^(t)_(1)(S)U( t )( 1 ) ( S ) and the second-order term as U(2)(t)⁢(S)subscriptsuperscript2U^(t)_(2)(S)U( t )( 2 ) ( S ). Note that the gradient update w~t+1⁢(S)−wt=−ηt⁢∑z∈S∇ℓ⁢(wt,z)subscript~1subscriptsubscriptsubscript∇ℓsubscript w_t+1(S)-w_t=- _t _z∈ S∇ (w_t,z)over~ start_ARG w end_ARGt + 1 ( S ) - witalic_t = - ηitalic_t ∑z ∈ S ∇ ℓ ( witalic_t , z ). Given that the learning rate ηtsubscript _tηitalic_t in model training is typically small, a lower-order Taylor expansion often provides an accurate approximation for the change in loss during a single gradient update, with approximation errors of O⁢(ηt2)superscriptsubscript2O( _t^2)O ( ηitalic_t2 ) and O⁢(ηt3)superscriptsubscript3O( _t^3)O ( ηitalic_t3 ) for first and second-order approximations, respectively. In Appendix E.2.2, we empirically investigate the errors of first- and second-order approximations to U(t)superscriptU^(t)U( t ) on GPT2. In particular, the first-order approximation can already achieve a great performance with Spearman correlation >0.94absent0.94>0.94> 0.94. First-order In-Run Data Shapley. Using the first-order approximation U(t)≈U(1)(t)superscriptsubscriptsuperscript1U^(t)≈ U^(t)_(1)U( t ) ≈ U( t )( 1 ), and substituting the gradient update expression, we have U(1)(t)⁢(S)=−ηt⁢∑z∈S∇ℓ⁢(wt,z(val))⋅∇ℓ⁢(wt,z)subscriptsuperscript1subscriptsubscript⋅∇ℓsubscriptsuperscriptval∇ℓsubscriptU^(t)_(1)(S)=- _t _z∈ S∇ (w_t,z^(val))% ·∇ (w_t,z)U( t )( 1 ) ( S ) = - ηitalic_t ∑z ∈ S ∇ ℓ ( witalic_t , z( val ) ) ⋅ ∇ ℓ ( witalic_t , z ). This shows that U(1)(t)subscriptsuperscript1U^(t)_(1)U( t )( 1 ) is an additive utility function with a closed-form Shapley calculation as follows: Theorem 3. In-Run Data Shapley considering the first-order approximation has closed-form ϕz⁢(U)≈∑t=0T−1ϕz⁢(U(1)(t))subscriptitalic-ϕsuperscriptsubscript01subscriptitalic-ϕsubscriptsuperscript1 _z (U )≈ _t=0^T-1 _z (U^(t)% _(1) )ϕitalic_z ( U ) ≈ ∑t = 0T - 1 ϕitalic_z ( U( t )( 1 ) ) where ϕz⁢(U(1)(t))=−ηt⁢∇ℓ⁢(wt,z(val))⋅∇ℓ⁢(wt,z),t=0,…,T−1formulae-sequencesubscriptitalic-ϕsubscriptsuperscript1⋅subscript∇ℓsubscriptsuperscriptval∇ℓsubscript0…1 _z (U^(t)_(1) )=- _t∇ (w_t,z^(% val))·∇ (w_t,z),~~~t=0,…,T-1ϕitalic_z ( U( t )( 1 ) ) = - ηitalic_t ∇ ℓ ( witalic_t , z( val ) ) ⋅ ∇ ℓ ( witalic_t , z ) , t = 0 , … , T - 1 Thus, the first-order approximation of In-Run Data Shapley for a training point accumulates its gradient dot products with the validation data point each time the training point is sampled in the training batch. The gradient dot product between the training point zisubscriptz_izitalic_i and the validation point z(val)superscriptvalz^(val)z( val ) represents the direct influence of zisubscriptz_izitalic_i on the validation loss at the current model parameters wtsubscriptw_twitalic_t, which essentially measures the alignment between the two gradient vectors in the parameter space. Notably, ϕz⁢(U(1)(t))subscriptitalic-ϕsubscriptsuperscript1 _z (U^(t)_(1) )ϕitalic_z ( U( t )( 1 ) ) is equivalent to the TracIN-Ideal score proposed by (Pruthi et al., 2020). That is, the TracIN-Ideal score can be interpreted as the Shapley value when we use first-order Taylor approximation for U(t)superscriptU^(t)U( t ). However, TracIN-Ideal has been described as "computationally infeasible," and our approach completely overcomes this problem. In Appendix A.3, we provide a detailed discussion of the differences between this work and (Pruthi et al., 2020). Second-order In-Run Data Shapley. We further improve the approximation of U(t)superscriptU^(t)U( t ) using a second-order Taylor expansion, i.e., U(t)≈U(1)(t)+12⁢U(2)(t)superscriptsubscriptsuperscript112subscriptsuperscript2U^(t)≈ U^(t)_(1)+ 12U^(t)_(2)U( t ) ≈ U( t )( 1 ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG U( t )( 2 ). Fortunately, the approximated utility function maintains a tractable structure that allows a closed-form Shapley value calculation. Theorem 4. In-Run Data Shapley considering the second-order approximation has closed-form ϕz⁢(U)≈∑t=0T−1(ϕz⁢(U(1)(t))+12⁢ϕz⁢(U(2)(t)))subscriptitalic-ϕsuperscriptsubscript01subscriptitalic-ϕsubscriptsuperscript112subscriptitalic-ϕsubscriptsuperscript2 _z (U )≈ _t=0^T-1 ( _z (% U^(t)_(1) )+ 12 _z (U^(t)_(2) ) )ϕitalic_z ( U ) ≈ ∑t = 0T - 1 ( ϕitalic_z ( U( t )( 1 ) ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ϕitalic_z ( U( t )( 2 ) ) ) (2) where ϕz⁢(U(1)(t))+12⁢ϕz⁢(U(2)(t))=−ηt⁢∇ℓ⁢(wt,z(val))⋅∇ℓ⁢(wt,z)⏟ 1 influence of ⁢z⁢ on the loss of ⁢z(val)+ηt22⁢∇ℓ⁢(wt,z)⊺⁢Ht(z(val))⁢(∑zj∈ℬt∇ℓ⁢(wt,zj))⏟ 2 interaction between ⁢z⁢ and other training pointssubscriptitalic-ϕsubscriptsuperscript112subscriptitalic-ϕsubscriptsuperscript2subscript⏟⋅subscript∇ℓsubscriptsuperscriptval∇ℓsubscript 1 influence of on the loss of superscriptvalsubscript⏟superscriptsubscript22∇ℓsuperscriptsubscript⊺subscriptsuperscriptHsuperscriptvalsubscriptsubscriptsubscriptℬ∇ℓsubscriptsubscript 2 interaction between and other training points aligned _z (U^(t)_(1) )+ 12 _z (U^% (t)_(2) )= - _t∇ (w_t,z^(val))% ·∇ (w_t,z)_ to8.9pt to8.9pt% 4.45195pt -4.45195pt to0.0pt% @beginscope @invoke pgfstrokecolorrgb0,0,0% @color@rgb@stroke000 @invoke @color@rgb@fill00% 0 @invoke @setlinewidth0.4pt @invoke to% 0.0pt @beginscope @invoke @beginscope @invoke % @moveto4.25195pt0.0pt @curveto4.25195pt2.34831pt% 2.34831pt4.25195pt0.0pt4.25195pt @curveto-2.34831pt4.25195pt% -4.25195pt2.34831pt-4.25195pt0.0pt @curveto-4.25195pt-2.34831% pt-2.34831pt-4.25195pt0.0pt-4.25195pt @curveto2.34831pt-4.25% 195pt4.25195pt-2.34831pt4.25195pt0.0pt @closepath @moveto% 0.0pt0.0pt @stroke @invoke @beginscope @invoke @transformcm1.00.00.01% .0-1.75pt-2.25555pt @invoke pgfstrokecolor% rgb0,0,0 @color@rgb@stroke000 @invoke % @color@rgb@fill000 @invoke 1 @invoke @closescope @endscope @invoke @closescope @endscope @invoke @closescope @endscope % @discardpath @invoke @closescope @endscope % @closescope influence of z on the loss of z^(% val)+ _t^22∇ (w_t,z)^% H^(z^(val))_t ( _z_j % _t∇ (w_t,z_j) )_ to8.9pt to% 8.9pt 4.45195pt -4.45195pt to0.% 0pt @beginscope @invoke pgfstrokecolorrgb0,0,0% @color@rgb@stroke000 @invoke @color@rgb@fill00% 0 @invoke @setlinewidth0.4pt @invoke % to0.0pt @beginscope @invoke @beginscope @invoke % @moveto4.25195pt0.0pt @curveto4.25195pt2.34831pt% 2.34831pt4.25195pt0.0pt4.25195pt @curveto-2.34831pt4.25195pt% -4.25195pt2.34831pt-4.25195pt0.0pt @curveto-4.25195pt-2.34831% pt-2.34831pt-4.25195pt0.0pt-4.25195pt @curveto2.34831pt-4.25% 195pt4.25195pt-2.34831pt4.25195pt0.0pt @closepath @moveto% 0.0pt0.0pt @stroke @invoke @beginscope @invoke @transformcm1.00.00.01% .0-1.75pt-2.25555pt @invoke pgfstrokecolor% rgb0,0,0 @color@rgb@stroke000 @invoke % @color@rgb@fill000 @invoke 2 @invoke @closescope @endscope @invoke @closescope @endscope @invoke @closescope @endscope % @discardpath @invoke @closescope @endscope % @closescope interaction between z and other % training points alignedstart_ROW start_CELL ϕitalic_z ( U( t )( 1 ) ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ϕitalic_z ( U( t )( 2 ) ) = under⏟ start_ARG - ηitalic_t ∇ ℓ ( witalic_t , z( val ) ) ⋅ ∇ ℓ ( witalic_t , z ) end_ARG1 influence of z on the loss of z( val ) + under⏟ start_ARG divide start_ARG ηitalic_t2 end_ARG start_ARG 2 end_ARG ∇ ℓ ( witalic_t , z )⊺ H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTt ( ∑z start_POSTSUBSCRIPT j ∈ Bitalic_t end_POSTSUBSCRIPT ∇ ℓ ( witalic_t , zitalic_j ) ) end_ARG2 interaction between z and other training points end_CELL end_ROW (3) for any t=0,…,T−10…1t=0,…,T-1t = 0 , … , T - 1. Compared to the first-order variant, the second-order In-Run Data Shapley includes an additional gradient-Hessian-gradient product term that captures the interaction between the training point of interest z and the rest of the training set. The Hessian matrix represents the curvature of the validation loss function at the current model parameters wtsubscriptw_twitalic_t. This interaction term measures the alignment between the gradient of z and the gradients of the other points in the training batch, adjusted by the Hessian. If this term is large, it indicates that the presence of other points in the batch significantly impacts the value attributed to z. For example, if there are many identical or similar copies of z in the training set, the contribution of z will decrease, as the interaction term will be large, effectively distributing the value among the similar points. By incorporating this interaction term, the second-order In-Run Data Shapley provides a more fine-grained contribution measure that takes into account both the relevance of a data point towards a validation set and its uniqueness within the population. 4.2 Efficient Computation of Gradient Dot-product and Gradient-Hessian-Gradient Product Although we have derived closed-form formulas for In-Run Data Shapley using first- or second-order Taylor approximation of the local utility functions, efficiently computing these values remains a challenge. Specifically, for the first-order In-Run Data Shapley, it requires computing 1 the pairwise gradient dot products between each z∈ℬtsubscriptℬz _tz ∈ Bitalic_t and the validation point. For the second-order In-Run Data Shapley, it additionally requires computing 2 the gradient-Hessian-gradient products for each z∈ℬtsubscriptℬz _tz ∈ Bitalic_t. A direct implementation to compute 1 involves calculating the individual gradient for each data point in ℬtsubscriptℬB_tBitalic_t, which cannot benefit from fast batch processing in GPUs and necessitates running backpropagation |ℬt|subscriptℬ|B_t|| Bitalic_t | times with a mini-batch size of 1. Consequently, this approach would be at least |ℬt|subscriptℬ|B_t|| Bitalic_t | times slower than regular training, making it computationally prohibitive for practical applications. Furthermore, computing 2 requires either computing each individual gradient again or storing all individual gradients, which incurs significant time or memory overhead. Computing pair-wise gradient dot-products in 1 backpropagation. Our technique for efficiently computing pairwise gradient dot products is inspired by the "ghost clipping" technique from the differential privacy (DP) literature (Lee & Kifer, 2021). "Ghost clipping" enables computing all of the per-sample gradient norms within one backpropagation without explicitly forming any individual gradient vectors, which enhances the efficiency of DP model training. Here, we propose a "ghost dot-product" technique that shares the idea of exploiting the computation that has been done in the backpropagation. Specifically, denote a sample batch as ℬt=z1,…,zBsubscriptℬsubscript1…subscriptB_t=\z_1,…,z_B\Bitalic_t = z1 , … , zitalic_B . We demonstrate this technique using a simple linear layer s=aWsaWs=aWs = a W, where W∈ℝd1×d2Wsuperscriptℝsubscript1subscript2W ^d_1× d_2W ∈ blackboard_Rd1 × d2 is the weight matrix, a=(a(1),…,a(B))⊺asuperscriptsuperscripta1…superscripta⊺a=(a^(1),…,a^(B)) a = ( a( 1 ) , … , a( B ) )⊺ is the mini-batch input, and s=(s(1),…,s(B))⊺ssuperscriptsuperscripts1…superscripts⊺s=(s^(1),…,s^(B)) s = ( s( 1 ) , … , s( B ) )⊺ is the output (i.e., the pre-activation tensor). For (non-sequential) data, a∈ℝB×d1,s∈ℝB×d2formulae-sequenceasuperscriptℝsubscript1ssuperscriptℝsubscript2a ^B× d_1,s ^B× d_2a ∈ blackboard_RB × d1 , s ∈ blackboard_RB × d2. By applying the chain rule, we can express the gradient of an individual loss ℓ(i):=ℓ⁢(w,zi)assignsuperscriptℓsubscript ^(i):= (w,z_i)ℓ( i ) := ℓ ( w , zitalic_i ) with respect to W as ∂ℓ(i)∂W=∂ℓ(i)∂s(i)⊗∂s(i)∂W=∂ℓ(i)∂s(i)⊗a(i)=∂ℓ∂s(i)⊗a(i)superscriptℓWtensor-productsuperscriptℓsuperscriptssuperscriptsWtensor-productsuperscriptℓsuperscriptssuperscriptatensor-productℓsuperscriptssuperscripta ∂ ^(i) = ∂ ^% (i) ^(i) ^(i)% = ∂ ^(i) ^(i) % a^(i)= ∂ ^(i) % ^(i)divide start_ARG ∂ ℓ( i ) end_ARG start_ARG ∂ W end_ARG = divide start_ARG ∂ ℓ( i ) end_ARG start_ARG ∂ s( i ) end_ARG ⊗ divide start_ARG ∂ s( i ) end_ARG start_ARG ∂ W end_ARG = divide start_ARG ∂ ℓ( i ) end_ARG start_ARG ∂ s( i ) end_ARG ⊗ a( i ) = divide start_ARG ∂ ℓ end_ARG start_ARG ∂ s( i ) end_ARG ⊗ a( i ) (4) where ℓ:=∑j=1Bℓ(j)assignℓsuperscriptsubscript1superscriptℓ := _j=1^B ^(j)ℓ := ∑j = 1B ℓ( j ) is the aggregated loss, and the last step is because other data points’ losses have no dependency on sisubscriptss_isi. Note that the individual’s output gradient ∂ℓ(i)∂s(i)=∂ℓ∂s(i)superscriptℓsuperscriptsℓsuperscripts ∂ ^(i) ^(i)= ∂ % ^(i)divide start_ARG ∂ ℓ( i ) end_ARG start_ARG ∂ s( i ) end_ARG = divide start_ARG ∂ ℓ end_ARG start_ARG ∂ s( i ) end_ARG is readily available during the backpropagation pass in terms of ℓ ℓ. Suppose we are interested in computing the gradient dot-product ∂ℓ(1)∂W⊙∂ℓ(2)∂Wdirect-productsuperscriptℓ1Wsuperscriptℓ2W ∂ ^(1) ∂ ^(2)% divide start_ARG ∂ ℓ( 1 ) end_ARG start_ARG ∂ W end_ARG ⊙ divide start_ARG ∂ ℓ( 2 ) end_ARG start_ARG ∂ W end_ARG between two data points z1,z2subscript1subscript2z_1,z_2z1 , z2 in the same batch in the backpropagation. For non-sequential data, we have each a(i)∈ℝd1×1superscriptasuperscriptℝsubscript11a^(i) ^d_1× 1a( i ) ∈ blackboard_Rd1 × 1 and ∂ℓ(i)∂s(i)∈ℝ1×d2superscriptℓsuperscriptssuperscriptℝ1subscript2 ∂ ^(i) ^(i) ^1× d_2divide start_ARG ∂ ℓ( i ) end_ARG start_ARG ∂ s( i ) end_ARG ∈ blackboard_R1 × d2. By (4), we have ∂ℓ(1)∂W⊙∂ℓ(2)∂W=(a(1)⊗∂ℓ(1)∂s(1))⊙(a(2)⊗∂ℓ(2)∂s(2))=((a(1))⊺⁢a(2))⁢((∂ℓ(1)∂s(1))⊺⁢(∂ℓ(2)∂s(2)))direct-productsuperscriptℓ1Wsuperscriptℓ2Wdirect-producttensor-productsuperscripta1superscriptℓ1superscripts1tensor-productsuperscripta2superscriptℓ2superscripts2superscriptsuperscripta1⊺superscripta2superscriptsuperscriptℓ1superscripts1⊺superscriptℓ2superscripts2 aligned ∂ ^(1) % ∂ ^(2) = (a^(1) % ∂ ^(1) ^(1) ) (a^(2)% ∂ ^(2) ^(2) )= ( % (a^(1) ) a^(2) ) ( ( % ∂ ^(1) ^(1) ) ( % ∂ ^(2) ^(2) ) ) alignedstart_ROW start_CELL divide start_ARG ∂ ℓ( 1 ) end_ARG start_ARG ∂ W end_ARG ⊙ divide start_ARG ∂ ℓ( 2 ) end_ARG start_ARG ∂ W end_ARG = ( a( 1 ) ⊗ divide start_ARG ∂ ℓ( 1 ) end_ARG start_ARG ∂ s( 1 ) end_ARG ) ⊙ ( a( 2 ) ⊗ divide start_ARG ∂ ℓ( 2 ) end_ARG start_ARG ∂ s( 2 ) end_ARG ) = ( ( a( 1 ) )⊺ a( 2 ) ) ( ( divide start_ARG ∂ ℓ( 1 ) end_ARG start_ARG ∂ s( 1 ) end_ARG )⊺ ( divide start_ARG ∂ ℓ( 2 ) end_ARG start_ARG ∂ s( 2 ) end_ARG ) ) end_CELL end_ROW (5) Hence, we can first take the two inner products, and then multiply the results together. All of the quantities a(1),a(2),∂ℓ(1)∂s(1),∂ℓ(2)∂s(2)superscripta1superscripta2superscriptℓ1superscripts1superscriptℓ2superscripts2a^(1),a^(2), ∂ ^(1) % ^(1), ∂ ^(2) ^(2)a( 1 ) , a( 2 ) , divide start_ARG ∂ ℓ( 1 ) end_ARG start_ARG ∂ s( 1 ) end_ARG , divide start_ARG ∂ ℓ( 2 ) end_ARG start_ARG ∂ s( 2 ) end_ARG in (5) that are required for computation are all already available in the backpropagation. Hence, within a single backpropagation, we can efficiently compute the gradient dot-product between every pair of zi,zj∈ℬtsubscriptsubscriptsubscriptℬz_i,z_j _tzitalic_i , zitalic_j ∈ Bitalic_t. Since we are interested in computing the gradient dot-product between z(val)superscriptvalz^(val)z( val ) and z for all z∈ℬtsubscriptℬz _tz ∈ Bitalic_t, we can backpropagate on ∑z∈ℬtℓ(i)+ℓ(z(val))subscriptsubscriptℬsuperscriptℓsuperscriptℓsuperscriptval _z _t ^(i)+ ^(z^(val))∑z ∈ B start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ℓ( i ) + ℓ( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPT to save another backpropagation for z(val)superscriptvalz^(val)z( val ). We call this technique the "ghost dot-product", as no gradient vectors are instantiated during the computation. Overall, we only need one backpropagation to compute 1 for all data points in ℬtsubscriptℬB_tBitalic_t, a significant improvement over the direct method requiring ≥|ℬt|absentsubscriptℬ≥|B_t|≥ | Bitalic_t | backpropagations. Additional details for this technique are in Appendix D. Remark 2. While we illustrate our "ghost dot-product" technique using linear layers, it can be extended to other types of layers by leveraging similar decompositions as in Equation (4) that have been developed in differential privacy literature (Rochette et al., 2020; Bu et al., 2022; Li et al., 2021; Bu et al., 2023; Kong & Munoz Medina, 2024). Computing gradient-Hessian-gradient products in 2 backpropagations (Appendix D.2). For second-order In-Run Data Shapley, an outstanding challenge is how to efficiently compute the pairwise interaction term among training points. In Appendix D.2, we develop a "ghost gradient-Hessian-gradient product" technique for computing the desired quantity through one extra backpropagation pass, without materializing any gradient-sized vectors. This technique leverages several properties of neural network gradients across different layers, and its derivation is complex. Further improvement of runtime and memory requirements (Appendix D.3). With the "ghost" techniques developed, the computation of first- and second-order In-Run Data Shapley requires one and two backpropagations in each gradient update iteration respectively. Although we still need to compute the gradient of the aggregated loss ∑zi∈ℬtℓisubscriptsubscriptsubscriptℬsubscriptℓ _z_i _t _i∑z start_POSTSUBSCRIPT i ∈ Bitalic_t end_POSTSUBSCRIPT ℓitalic_i for the training batch to perform parameter updates, we do not need an additional backpropagation. By reusing the activations and output gradients from the previous backpropagation on ∑z∈ℬtℓ(i)+ℓ(z(val))subscriptsubscriptℬsuperscriptℓsuperscriptℓsuperscriptval _z _t ^(i)+ ^(z^(val))∑z ∈ B start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ℓ( i ) + ℓ( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPT, we can easily compute this quantity without incurring the cost of an extra backpropagation pass. Consequently, training while computing first-order In-Run Data Shapley will have minimal additional runtime overhead, as it still requires only one backpropagation per iteration. The second-order In-Run Data Shapley necessitates one extra backpropagation per iteration. Nevertheless, both methods provide significant advantages over the direct approach of instantiating per-sample gradients and Hessian-vector products. 5 Experiments In this section, we evaluate In-Run Data Shapley in terms of its efficiency (Section 5.1), fidelity (Section 5.2), and its applications in data attribution for language model pretraining (Section 5.3). In Appendix E.6, we compare a variety of existing data attribution methods in small-scale settings. 5.1 Runtime Evaluation We empirically assess the computational efficiency of In-Run Data Shapley with "ghost dot-product" and "ghost vector-Hessian-vector product" techniques developed in Section 4.2. Throughput Regular Training 76.2 First-order Data Shapley (ghost) 70.5 Second-order Data Shapley (ghost) 34.4 First-order Data Shapley (direct) 4.2 Second-order Data Shapley (direct) 1.8 Table 1: Efficiency comparison of different implementations of In-Run Data Shapley. We use throughput, i.e., # training data points being processed per second as the efficiency metric. We compare this to the direct implementation of In-Run Data Shapley, which requires computing per-sample gradients, as well as to regular training without Data Shapley computations. The experiment is conducted by training GPT2-Small on a single 80GB A100 GPU. As illustrated in Table 1, the runtime of first-order In-Run Data Shapley is close to that of regular training when using the ghost dot-product algorithms developed in Section 4.2. The second-order In-Run Data Shapley is approximately 2×2×2 × slower than regular training due to the additional backpropagation. However, both the first- and second-order In-Run Data Shapley are significantly faster (>30×>30×> 30 ×) compared to the naive implementation. These results showcase the substantial improvements achieved by our techniques, making In-Run Data Shapley computationally feasible for practical applications. 5.2 Fidelity Evaluation Figure 1: Comparison between the Monte Carlo-estimated In-Run Data Shapley and First/Second-order In-Run Data Shapley. In this section, we directly assess the approximation accuracy of first- and second-order In-Run Data Shapley. In Appendix E.6, we further compare the performance of In-Run Data Shapley with existing (less scalable) data attribution techniques on standard benchmarks (e.g., mislabeled data detection) as an additional sanity check. Given that computing the exact In-Run Data Shapley value is computationally prohibitive, we compare their performance against Monte Carlo estimates of the Shapley value using a large number of samples. The experiment is conducted on GPT2 model at the 3500th training iteration on Pile, where the batch size 16 and learning rate 3×10−43superscript1043× 10^-43 × 10- 4. We use 1000 permutations to approximate the groundtruth In-Run Shapley value. Figure 1 shows that with just first-order In-Run Data Shapley, the root mean squared error (RMSE) is only around 0.0003. In Appendix E.2.1, we provide additional results with different learning rates. 5.3 Case Study: Data Attribution on Pile Dataset In this section, we present a case study to demonstrate the use cases of In-Run Data Shapley by pretraining on the well-known Pile dataset (Gao et al., 2020). We explore its application in data curation, examine data contribution across different training stages, and investigate relevant corpus detection. Due to computational resource constraints, most of our experiments focus on the GPT-2 and Pythia-410M models, but this is not a limitation of the algorithm itself. With adequate computational resources, our approach can easily be applied to larger-scale models. 5.3.1 Is Well-curated Dataset Actually Clean? Carefully curated pretraining corpora still contain data points that can adversely affect the training process. Identifying and removing these data points can accelerate model convergence and enhance overall performance, thereby saving computational resources. In this experiment, we demonstrate the effectiveness of In-Run Data Shapley in assessing the data quality of a subset of the Pile dataset. We uniformly select a random subset of Pile with around 10B tokens and train a GPT2 model on this subset. We compute the data attribution results with Pile’s validation set. By filtering out all negatively valued corpora and retraining the model on the cleaned subset, we observe significant improvement in model convergence. For both first- and second-order In-Run Data Shapely, we can achieve around 25% fewer training iterations to reach a test loss of 3.75. Surprisingly, our analysis reveals that around 16% of the training corpora had negative second-order In-Run Data Shapley values. While some of the negatively valued corpora may be attributed to the significant domain shift compared to the validation corpora, we still find many low-quality corpora from Pile, a pretraining dataset that has undergone several layers of data curation (Gao et al., 2020). Examples of low-quality corpora identified can be found in Appendix E.3. Figure 2: Test loss comparison between the original training run and the model trained on the cleaned subset according to different data attribution techniques. Figure 2 shows a performance comparison between the original training run and the model trained on the cleaned subsets for GPT2, and additional results on Pythia-410M are available in Appendix E.3.1. We also compare with influence function (Koh & Liang, 2017), which approximates the change in the model’s loss on the test example when the training example is removed from the training set (i.e., the leave-one-out score). We omit TRAK (Park et al., 2023) and other techniques such as datamodel (Ilyas et al., 2022) as they are not scalable to our setting. As we can see, influence function can also filter out low-quality data that can accelerate training convergence. However, the performance is slightly worse than In-Run Data Shapley as the influence function only uses information from the final trained models, which can result in highly noisy value scores since the removal of one training data point might have a negligible effect on the final model performance. The results demonstrate that removing lower-quality data leads to a significantly faster drop in test loss compared to the original training run. This implies that there is still huge room for data curation for well-curated datasets such as Pile. Original Wikipedia Corpus In 2012, Radhi recruited new ’musicians’ for OAG, who were selected from among the students of Akademi Seni Budaya dan Warisan Kebangsaan (). The new line-up consists of Qi Razali (drums/backing vocals - original drummer … Synthetic "Similar topic" Corpus ### Instruction: Write a short story about a classical violinist who decides to explore jazz music, detailing her first encounter with a jazz band. ### Answer: Elena, a classically trained violinist known for her precise and emotive performances … Similarity Category In-Run Data Shapley (1st order) In-Run Data Shapley (2nd order) Influence Function BM25 Partial exactly the same 1 1 1 1 Paraphrase 1 1 1 1 Significant paraphrase 32.3 32 39.3 1.6 Similar topic 145.6 141.6 292 20917.3 Table 2: Top: (left) An original training corpus from Wikipedia. (right) A synthetic corpus falls in the category of "Similar topic" to the Wikipedia corpus on the left (prompt in Appendix E.5). Bottom: the (average) value rank of the original corpus among all training corpora for validation corpora that are of varying similarity to the original corpus. The rank is out of ≈320k data points. Validation Corpus In several applications, the matrix coefficients of the nonlinear valued function G⁢(λ)G(λ)G ( λ ) in are usually of low rank … (+)(+)( + ) High-valued Corpus The Lagrangian ℒg3,g4,g5,g6subscriptℒsubscript3subscript4subscript5subscript6L_g_3,g_4,g_5,g_6Litalic_g start_POSTSUBSCRIPT 3 , g4 , g5 , g6 end_POSTSUBSCRIPT describes (axial-)vector meson interactions, see Appendix of Ref. [@Gallas]. After … (−)(-)( - ) Low-valued Corpus Coronavirus Testing in San Diego County\\ How San Diego Is Getting Ready for Self-Driving Cars … Figure 3: Left: Domain value composition for a corpus of math text. Right: The math corpus we use as the validation data for attribution, and examples of high- and low-valued training corpus for it. 5.3.2 How Do Data Values Change during Training? As In-Run Data Shapley tracks the cumulative data values across different training steps, we can assess the contribution of training points at various stages of training, providing a more fine-grained perspective on data attribution. We evaluate the data attribution results for a math-related validation data using second-order In-Run Data Shapley. In Figure 3, we present the value composition of training corpora by their domains over the first 10,000 training iterations, summing the values of all corpora from the same domain. We then calculate the percentage of the total value attributed to each domain, excluding domains with a total value <0absent0<0< 0. As illustrated in the figure, the corpora from ArXiv achieve a significantly higher value compared to other domain corpora, far exceeding its size proportion within the full Pile dataset. This is expected, as ArXiv papers predominantly cover fields like Math, Computer Science, and Physics, which contain extensive math-related content. Furthermore, the value composition changes rapidly at the beginning of training and stabilizes as training progresses. We hypothesize that this initial fluctuation is due to the presence of relevant paragraphs in corpora from other domains. The stable value proportion observed in later stages likely reflects the relative abundance of math content in each domain. Interestingly, we observe that Pile-C domain, which contains general website crawls, initially shows positive contributions during the first few iterations. However, its value quickly drops to negative and eventually converges to zero. This implies that general corpora tend to have a large contribution in the beginning of training, as they help the model learn the basics of languages, such as grammar and common knowledge. However, as training progresses and the model focuses on more specialized topics, the relevance of general domains diminishes. An additional figure for the average domain values is in Appendix E.4. 5.3.3 Does Contribution Require Memorization? In this experiment, we evaluate the robustness of different data attribution techniques in identifying relevant individual corpora that have been paraphrased. We start by selecting a data point from the training set and creating several paraphrased versions using GPT-4, with varying levels of paraphrasing (see Appendix E.5 for the prompt). These paraphrased versions form our validation set. We then calculate the average value rank of the original training data for each of its paraphrased versions. In addition to In-Run Data Shapley and influence function, we include the ranking result based on BM25 distance. BM25 (Robertson et al., 2009) featurizes examples by their word frequency statistics (i.e., TF-IDF) to rank the training instances. We use BM25 distance as an oracle for assessing the verbatim or lexical similarity between the validation data (query) and the training data, as opposed to semantic similarity. As shown in Table 2, even for a validation data that is a complete rewrite (with a low BM25 distance) but covers relevant topics, the original training data still ranks very high according to both In-Run Data Shapley and influence function. Influence function ranks the original training data lower than In-Run Data Shapley, which may be attributed to the inherent noisy nature of the leave-one-out error estimation. The results of this experiment have important implications for the ongoing discussion about the copyright of generative AI. Specifically, the table presents a compelling example where the original Wikipedia training corpus related to a musician’s experience can significantly contribute to generating a story about a musician, even when the generated story shares no token-wise resemblance to the original training data. This finding supports that training data profoundly influences the capabilities of generative AI models and should be compensated accordingly (Deng & Ma, 2023; Wang et al., 2024b), even when the output does not closely resemble the original copyrighted material or when the model applies output filters to avoid generating verbatim replicates of the training data. This discovery expands the conventional focus on copyright violations, which typically addresses instances of near-verbatim replication, as seen in the dispute between New York Times and OpenAI, to also include cases where the generated content is significantly influenced by copyrighted material without directly replicating it. 6 Conclusion and Limitations In this work, we introduce In-Run Data Shapley, a novel data attribution technique that addresses the limitations of Retraining-based Data Shapley. Extensive experiments demonstrate the effectiveness of In-Run Data Shapley in various applications. Here, we discuss the potential limitations of this work. Availability of validation data before training. One potential limitation of In-Run Data Shapley is that it requires the validation data to be available before training, as the data attribution scores are computed during the training process. However, there are many scenarios where validation data is naturally available before training, such as when using publicly available benchmark datasets, participating in machine learning competitions, or adhering to regulatory requirements. For scenarios where validation data arrives after the model has been trained, a potential solution is to save checkpoints of intermediate models during training and approximate In-Run Data Shapley using these checkpoints, in the same spirit of TracIN-CP described in Pruthi et al. (2020). However, the choice of checkpoints can significantly impact performance, and it is unclear which checkpoints to pick. Extension to other optimization algorithms. Extending the "ghost" family techniques developed in this work to support Adam and similar optimizers remains an exciting direction for future research. We stress that extending the formulation of In-Run Data Shapley from SGD to Adam is feasible, but the actual challenge lies in computing it efficiently without instantiating each individual gradient vector, which cannot be solved by simple extensions described in Xia et al. (2024). Handling memory constraints. In scenarios where GPU memory constraints prevent large batch sizes, the "ghost" techniques can be extended by using gradient accumulation. This approach accommodates larger training batch sizes by dividing the batch into smaller sub-batches and accumulating gradients over multiple iterations. While this method may increase runtime due to additional backpropagation steps, it maintains the feasibility of the techniques under memory constraints. Improving computational efficiency for large batch sizes remains an important direction for future research. Acknowledgment This work is supported in part by the National Science Foundation under grants IIS-2312794, IIS-2313130, OAC-2239622, Amazon-Virginia Tech Initiative in Efficient and Robust Machine Learning, and the Commonwealth Cyber Initiative. We thank Meng Ding, Chong Xiang, Chendi Wang for their helpful feedback on the preliminary version of this work. References Ash et al. (2019) Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In International Conference on Learning Representations, 2019. Bae et al. (2022) Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, and Roger B Grosse. If influence functions are the answer, then what is the question? Advances in Neural Information Processing Systems, 35:17953–17967, 2022. Basu et al. (2021) S Basu, P Pope, and S Feizi. Influence functions in deep learning are fragile. In International Conference on Learning Representations (ICLR), 2021. Biderman et al. (2023) Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, p. 2397–2430. PMLR, 2023. Bu et al. (2022) Zhiqi Bu, Jialin Mao, and Shiyun Xu. Scalable and efficient training of large convolutional neural networks with differential privacy. Advances in Neural Information Processing Systems, 35:38305–38318, 2022. Bu et al. (2023) Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Differentially private optimization on large model at small cost. In International Conference on Machine Learning, p. 3192–3218. PMLR, 2023. Burgess & Chapman (2021) Mark Alexander Burgess and Archie C Chapman. Approximating the shapley value using stratified empirical bernstein sampling. In IJCAI, p. 73–81, 2021. Choe et al. (2024) Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, et al. What is your data worth to gpt? llm-scale data valuation with influence functions. arXiv preprint arXiv:2405.13954, 2024. Cook & Weisberg (1980) R Dennis Cook and Sanford Weisberg. Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics, 22(4):495–508, 1980. Covert et al. (2024) Ian Covert, Chanwoo Kim, Su-In Lee, James Zou, and Tatsunori Hashimoto. Stochastic amortization: A unified approach to accelerate feature and data attribution. arXiv preprint arXiv:2401.15866, 2024. Deng & Ma (2023) Junwei Deng and Jiaqi Ma. Computational copyright: Towards a royalty model for ai music generation platforms. arXiv preprint arXiv:2312.06646, 2023. Dhaliwal & Shintre (2018) Jasjeet Dhaliwal and Saurabh Shintre. Gradient similarity: An explainable approach to detect adversarial attacks against deep learning. arXiv preprint arXiv:1806.10707, 2018. Feldman & Zhang (2020) Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020. Fort & Ganguli (2019) Stanislav Fort and Surya Ganguli. Emergent properties of the local geometry of neural loss landscapes. arXiv preprint arXiv:1910.05929, 2019. Gao et al. (2020) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. Ghorbani & Zou (2019) Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, p. 2242–2251. PMLR, 2019. Ghorbani et al. (2020) Amirata Ghorbani, Michael Kim, and James Zou. A distributional framework for data valuation. In International Conference on Machine Learning, p. 3535–3544. PMLR, 2020. Grosse et al. (2023) Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296, 2023. Grynbaum & Mac (2023) Michael M Grynbaum and Ryan Mac. The times sues openai and microsoft. The New York Times, p. B1–B1, 2023. Heikkilä (2023) Melissa Heikkilä. This new tool could give artists an edge over ai. https://w.technologyreview.com/2023/10/24/1082247/this-new-tool-could-give-artists-an-edge-over-ai/, 2023. Henderson et al. (2023) Peter Henderson, Xuechen Li, Dan Jurafsky, Tatsunori Hashimoto, Mark A Lemley, and Percy Liang. Foundation models and fair use. arXiv preprint arXiv:2303.15715, 2023. Illés & Kerényi (2019) Ferenc Illés and Péter Kerényi. Estimation of the shapley value by ergodic sampling. arXiv preprint arXiv:1906.05224, 2019. Ilyas et al. (2022) Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, and Aleksander Madry. Datamodels: Predicting predictions from training data. arXiv preprint arXiv:2202.00622, 2022. Jia et al. (2019a) Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas J Spanos, and Dawn Song. Efficient task-specific data valuation for nearest neighbor algorithms. Proceedings of the VLDB Endowment, 2019a. Jia et al. (2019b) Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J Spanos. Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics, p. 1167–1176. PMLR, 2019b. Koh & Liang (2017) Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, p. 1885–1894. PMLR, 2017. Kong & Munoz Medina (2024) Weiwei Kong and Andres Munoz Medina. A unified fast gradient clipping framework for dp-sgd. Advances in Neural Information Processing Systems, 36, 2024. Kwon & Zou (2022) Yongchan Kwon and James Zou. Beta shapley: a unified and noise-reduced data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics, p. 8780–8802. PMLR, 2022. Kwon et al. (2021) Yongchan Kwon, Manuel A Rivas, and James Zou. Efficient computation and analysis of distributional shapley values. In International Conference on Artificial Intelligence and Statistics, p. 793–801. PMLR, 2021. Lee & Kifer (2021) Jaewoo Lee and Daniel Kifer. Scaling up differentially private deep learning with fast per-example gradient clipping. Proceedings on Privacy Enhancing Technologies, 2021. Li & Yu (2023) Weida Li and Yaoliang Yu. Faster approximation of probabilistic and distributional values via least squares. In The Twelfth International Conference on Learning Representations, 2023. Li & Yu (2024) Weida Li and Yaoliang Yu. Robust data valuation with weighted banzhaf values. Advances in Neural Information Processing Systems, 36, 2024. Li et al. (2021) Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. Large language models can be strong differentially private learners. In International Conference on Learning Representations, 2021. Lin et al. (2022) Jinkun Lin, Anqi Zhang, Mathias Lécuyer, Jinyang Li, Aurojit Panda, and Siddhartha Sen. Measuring the effect of training data on deep learning predictions via randomized experiments. In International Conference on Machine Learning, p. 13468–13504. PMLR, 2022. Mitchell et al. (2022) Rory Mitchell, Joshua Cooper, Eibe Frank, and Geoffrey Holmes. Sampling permutations for shapley value estimation. 2022. Mulligan & Li (2024) Caitlin Mulligan and James Li. Generative ai’s end run around copyright. AI Snake Oil, 2024. URL https://w.aisnakeoil.com/p/generative-ais-end-run-around-copyright. Nguyen et al. (2023) Elisa Nguyen, Minjoon Seo, and Seong Joon Oh. A bayesian approach to analysing training data attribution in deep learning. Advances in Neural Information Processing Systems, 36, 2023. Okhrati & Lipani (2021) Ramin Okhrati and Aldo Lipani. A multilinear sampling algorithm to estimate shapley values. In 2020 25th International Conference on Pattern Recognition (ICPR), p. 7992–7999. IEEE, 2021. Park et al. (2023) Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. Trak: attributing model behavior at scale. In Proceedings of the 40th International Conference on Machine Learning, p. 27074–27113, 2023. Pruthi et al. (2020) Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems, 33:19920–19930, 2020. Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020. Robertson et al. (2009) Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009. Rochette et al. (2020) Gaspar Rochette, Andre Manoel, and Eric W Tramel. Efficient per-example gradient computations in convolutional neural networks. In Workshop on Theory and Practice of Differential Privacy (TPDP), 2020. Schioppa (2024) Andrea Schioppa. Gradient sketches for training data attribution and studying the loss landscape. arXiv preprint arXiv:2402.03994, 2024. Shapley (1953) Lloyd S Shapley. A value for n-person games. Contributions to the Theory of Games, 2(28):307–317, 1953. Søgaard et al. (2021) Anders Søgaard et al. Revisiting methods for finding influential examples. arXiv preprint arXiv:2111.04683, 2021. Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. Wang & Jia (2023a) Jiachen T Wang and Ruoxi Jia. Data banzhaf: A robust data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics, p. 6388–6421. PMLR, 2023a. Wang & Jia (2023b) Jiachen T Wang and Ruoxi Jia. A note on" towards efficient data valuation based on the shapley value”. arXiv preprint arXiv:2302.11431, 2023b. Wang & Jia (2023c) Jiachen T Wang and Ruoxi Jia. A note on" efficient task-specific data valuation for nearest neighbor algorithms". arXiv preprint arXiv:2304.04258, 2023c. Wang et al. (2023) Jiachen T Wang, Yuqing Zhu, Yu-Xiang Wang, Ruoxi Jia, and Prateek Mittal. Threshold knn-shapley: A linear-time and privacy-friendly approach to data valuation. arXiv preprint arXiv:2308.15709, 2023. Wang et al. (2024a) Jiachen T Wang, Zhun Deng, Hiroaki Chiba-Okabe, Boaz Barak, and Weijie J Su. An economic solution to copyright challenges of generative ai. Technical report, 2024a. Wang et al. (2024b) Jiachen T Wang, Zhun Deng, Hiroaki Chiba-Okabe, Boaz Barak, Weijie J Su, et al. An economic solution to copyright challenges of generative ai. arXiv preprint arXiv:2404.13964, 2024b. Wang et al. (2024c) Jiachen T Wang, Prateek Mittal, and Ruoxi Jia. Efficient data shapley for weighted nearest neighbor algorithms. arXiv preprint arXiv:2401.11103, 2024c. Wang et al. (2024d) Jiachen T Wang, Tianji Yang, James Zou, Yongchan Kwon, and Ruoxi Jia. Rethinking data shapley for data selection tasks: Misleads and merits. arXiv preprint arXiv:2405.03875, 2024d. Wang et al. (2020) Tianhao Wang, Johannes Rausch, Ce Zhang, Ruoxi Jia, and Dawn Song. A principled approach to data valuation for federated learning. In Federated Learning, p. 153–167. Springer, 2020. Xia et al. (2024) Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: Selecting influential data for targeted instruction tuning. In Forty-first International Conference on Machine Learning, 2024. Yang et al. (2024) Ziao Yang, Han Yue, Jian Chen, and Hongfu Liu. On the inflation of knn-shapley value. arXiv preprint arXiv:2405.17489, 2024. Yeh et al. (2022) Chih-Kuan Yeh, Ankur Taly, Mukund Sundararajan, Frederick Liu, and Pradeep Ravikumar. First is better than last for training data influence. arXiv preprint arXiv:2202.11844, 2022. Yu et al. (2020) Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836, 2020. Appendix A Extended Related Works A.1 Data Shapley Axioms Data Shapley is one of the first principled approaches to data attribution being proposed Ghorbani & Zou (2019); Jia et al. (2019b). Data Shapley is based on the famous Shapley value (Shapley, 1953). In almost all of the literature, the Shapley value is being justified as the unique value notion satisfying the following four axioms: 1. Null player: if U⁢(S∪zi)=U⁢(S)subscriptU(S∪\z_i\)=U(S)U ( S ∪ zitalic_i ) = U ( S ) for all S⊆D∖zisubscriptS D \z_i\S ⊆ D ∖ zitalic_i , then ϕzi⁢(U)=0subscriptitalic-ϕsubscript0 _z_i(U)=0ϕitalic_z start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ( U ) = 0. 2. Symmetry: if U⁢(S∪zi)=U⁢(S∪zj)subscriptsubscriptU(S∪\z_i\)=U(S∪\z_j\)U ( S ∪ zitalic_i ) = U ( S ∪ zitalic_j ) for all S⊆D∖zi,zjsubscriptsubscriptS D \z_i,z_j\S ⊆ D ∖ zitalic_i , zitalic_j , then ϕzi⁢(U)=ϕzj⁢(U)subscriptitalic-ϕsubscriptsubscriptitalic-ϕsubscript _z_i(U)= _z_j(U)ϕitalic_z start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ( U ) = ϕitalic_z start_POSTSUBSCRIPT j end_POSTSUBSCRIPT ( U ). 3. Linearity: For utility functions U1,U2subscript1subscript2U_1,U_2U1 , U2 and any α1,α2∈ℝsubscript1subscript2ℝ _1, _2 α1 , α2 ∈ blackboard_R, ϕzi⁢(α1⁢U1+α2⁢U2)=α1⁢ϕzi⁢(U1)+α2⁢ϕzi⁢(U2)subscriptitalic-ϕsubscriptsubscript1subscript1subscript2subscript2subscript1subscriptitalic-ϕsubscriptsubscript1subscript2subscriptitalic-ϕsubscriptsubscript2 _z_i( _1U_1+ _2U_2)= _1 _z_i(U_1)+% _2 _z_i(U_2)ϕitalic_z start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ( α1 U1 + α2 U2 ) = α1 ϕitalic_z start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ( U1 ) + α2 ϕitalic_z start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ( U2 ). 4. Efficiency: for every U,∑zi∈Dϕzi⁢(U)=U⁢(D)subscriptsubscriptsubscriptitalic-ϕsubscriptU, _z_i∈ D _z_i(U)=U(D)U , ∑z start_POSTSUBSCRIPT i ∈ D end_POSTSUBSCRIPT ϕitalic_z start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ( U ) = U ( D ). In plain words, null player axiom means the Shapley value will assign zero score to data points with no contribution. Symmetry axiom requires equal scores assigned to equally impactful data points, ensuring fairness. Efficiency axiom requires the sum of contribution scores equal to the total utility, meaning the scores always represent a share of the total utility. Linearity axiom means the Shapley value supports additive decomposition across multiple utility functions, allowing for the calculation of contributions to the entire test set by summing the scores of individual test points. A.2 Data Shapley and Friends Since its introduction in 2019 (Ghorbani & Zou, 2019; Jia et al., 2019b), Data Shapley has rapidly gained popularity as a principled solution for data attribution. Due to the computationally expensive nature of retraining-based Data Shapley, various Monte Carlo-based approximation algorithms have been developed (Jia et al., 2019b; Illés & Kerényi, 2019; Okhrati & Lipani, 2021; Burgess & Chapman, 2021; Mitchell et al., 2022; Lin et al., 2022; Wang & Jia, 2023b; Li & Yu, 2023; Covert et al., 2024), these methods still necessitate extensive computational resources due to repeated model retraining, which is clearly impractical for modern-sized ML models. Many of its variants have been proposed. Kwon & Zou (2022) argues that the efficiency axiom is not necessary for many machine learning applications, and the framework of semivalue is derived by relaxing the efficiency axiom. Lin et al. (2022) provide an alternative justification for semivalue based on causal inference and randomized experiments. Based on the framework of semivalue, Kwon & Zou (2022) propose Beta Shapley, which is a collection of semivalues that enjoy certain mathematical convenience. Wang & Jia (2023a) propose Data Banzhaf, and show that the Banzhaf value, another famous solution concept from cooperative game theory, achieves more stable valuation results under stochastic learning algorithms. Li & Yu (2024) further improves the valuation stability by considering value notions outside the scope of semivalue. The classic leave-one-out error is also a semivalue, where the influence function (Cook & Weisberg, 1980; Koh & Liang, 2017; Grosse et al., 2023) is generally considered as its approximation. A concurrent work (Choe et al., 2024) leverages a similar gradient decomposition technique as our paper to speed up influence function calculation. However, several works have pointed out the fragility of influence function for deep learning models (Basu et al., 2021; Søgaard et al., 2021; Bae et al., 2022), due to the strong assumptions of training convergence and strongly convexity of the loss function. Nguyen et al. (2023) takes a Bayesian view of data attribution and is able to evaluate the variance of LOO. Unlike Nguyen et al. (2023), our work explicitly incorporates specific training randomness (e.g., model initialization, batch ordering) into the utility function definition, providing attribution scores that reflect contributions to the exact training trajectory. Another line of works focuses on improving the computational efficiency of Data Shapley by considering K nearest neighbor (KNN) as the surrogate learning algorithm for the original, potentially complicated deep learning models (Jia et al., 2019a; Wang & Jia, 2023c; Wang et al., 2023; 2024c; Yang et al., 2024). Ghorbani et al. (2020); Kwon et al. (2021); Li & Yu (2023) consider Distributional Shapley, a generalization of Data Shapley to data distribution. In federated learning setting, Wang et al. (2020) proposes a similar idea of computing the Shapley value for each federated learning round. Remark 3. Randomized Monte Carlo estimators can be inefficient and may produce unstable valuation results, potentially violating fairness axioms of the Shapley value (e.g., the Symmetry axiom) due to inherent approximation errors. In contrast, In-Run Data Shapley does not rely on Monte Carlo estimators. Instead, it computes the exact Shapley value for an approximated utility function via Taylor expansion. This approach adheres to the fairness axioms while ensuring the reliability and consistency of the data valuation results. A.3 Comparison with TracIN (Pruthi et al., 2020) The form of first-order In-Run Data Shapley from Section 4.1 coincides with the TracIN-Ideal in Pruthi et al. (2020). This provides a new understanding of TracIN-Ideal as an approximation to In-Run Data Shapley. Both works face the technical challenge of requiring per-sample gradient computations during a single training run. Pruthi et al. (2020) proposes TracIN-CP, which mitigates the computational burden by examining only a subset of intermediate checkpoints during training. At each checkpoint, the individual gradients for the entire training set are computed, rather than for a sampled batch, under the assumption that each training example is visited exactly once between checkpoints. A recent work (Xia et al., 2024) leverages TracIN-CP, an approximation algorithm for TracIN-Ideal, for instruction-following data selection. This approach, however, may deviate significantly from the original TracIN-Ideal, with the final valuation results heavily dependent on the selected checkpoints. Furthermore, Pruthi et al. (2020)’s implementation is limited to the parameters of the last linear layer due to memory constraints, potentially biasing the measurement of data contribution. For instance, Yeh et al. (2022) suggests that the last layer of a neural network might exhibit a strong "cancellation effect," where the data influence of different examples have large, contradictory magnitudes. Additionally, Schioppa (2024) demonstrates that selecting different layers can distort data attribution scores. In contrast, this work introduces the "ghost dot-product" technique to efficiently compute the first-order In-Run Data Shapley (i.e., TracIN-Ideal) directly and accurately, without additional approximations. Appendix B Additional Discussion In this section, we provide additional discussion about In-Run Data Shapley as well as its comparison with Retraining-based Data Shapley. Figure 4 gives a more detailed overview of our algorithm, and Figure 5 provides a visualized comparison between Retraining-based and In-Run Data Shapley. Figure 4: The core idea and algorithm overview of In-Run Data Shapley. Rather than evaluating data contribution across the entire training process (top), we decompose it into individual gradient update steps (bottom). For each iteration t, we compute the Shapley value ϕz⁢(U(t))subscriptitalic-ϕsuperscript _z(U^(t))ϕitalic_z ( U( t ) ) with respect to a "local utility function" U(t)superscriptU^(t)U( t ) that measures how the batch ℬtsubscriptℬB_tBitalic_t contributes to reducing validation loss. A data point’s final contribution score is the sum of its Shapley values across all iterations where it appears: ϕz=∑t:z∈ℬtϕz⁢(U(t))subscriptitalic-ϕsubscript:subscriptℬsubscriptitalic-ϕsuperscript _z= _t:z _t _z(U^(t))ϕitalic_z = ∑t : z ∈ B start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ϕitalic_z ( U( t ) ). This decomposition approach maintains the Shapley value properties through the linearity axiom while making computation tractable. Figure 5: Comparison between Retraining-based and In-Run Data Shapley. Retraining-based Data Shapley requires training a model from scratch on all possible subsets of the full training set, which is computationally inefficient and raises concerns about interpretability and stability. In contrast, In-Run Data Shapley acts as a "contribution accountant", efficiently tracking and attributing data value scores to each training example across gradient update steps during a single training run. B.1 When is In-Run Data Shapley Desirable for Data Attribution? While Retraining-based Data Shapley has been widely adopted in the literature, it suffers from several critical issues that limit its practicality and effectiveness. In this section, we discuss these problems from four key aspects: computational efficiency, alignment with the purpose of data valuation, stability of the valuation results, and the choice of training hyperparameters. (1) Computational burden. Retraining-based Data Shapley calculation is often computationally prohibitive, as it requires retraining the model on every possible subset of the original dataset, leading to a computational complexity that grows exponentially with the size of the dataset. Despite the development of various Monte Carlo-based approximation algorithms (Jia et al., 2019b; Illés & Kerényi, 2019; Okhrati & Lipani, 2021; Burgess & Chapman, 2021; Mitchell et al., 2022; Lin et al., 2022; Wang & Jia, 2023b; Li & Yu, 2023; Covert et al., 2024), these methods still necessitate extensive computational resources due to repeated model retraining, which is clearly impractical for modern-sized ML models. Another line of work attempts to use efficient proxy learning algorithms, such as K-nearest neighbors (KNN) (Jia et al., 2019a; Wang & Jia, 2023c; Wang et al., 2023; 2024c; Yang et al., 2024), to accelerate Data Shapley computation. However, it remains unclear how closely these cheaper proxy models approximate the original learning algorithm, and it is also uncertain how to interpret the derived Data Shapley scores in this context. (2) Retraining-based Data Shapley is unable to assess data contribution to a specific model. Crucially, Retraining-based Data Shapley is not designed to value data contribution towards a specific model. It attempts to quantify the average contribution of each training data point to models trained on different subsets of the data, rather than its contribution to the specific model trained on the full dataset. While one might interpret the former as an approximation of the latter, these two quantities can be quite different in practice, especially when the learning algorithm is randomized and sensitive to factors like random initialization and the order of the data points during training. More importantly, in most real-life scenarios, the primary interest lies in understanding the contribution of each data point to the specific model being trained and deployed. (3) Retraining-based Data Shapley produces unstable valuation results for stochastic training algorithms. Furthermore, when the training algorithm involves randomness, such as in the case of SGD with random mini-batch selection, the corresponding utility function becomes randomized. Prior work (Wang & Jia, 2023a) suggests that this randomness can introduce substantial noise into the estimated Shapley values, rendering them unreliable and unstable. This instability poses significant challenges for interpreting and using the resulting data valuations, as the scores may vary considerably across different runs of the algorithm, even on the same dataset. Consequently, this limits the practical applicability of Retraining-based Data Shapley when working with stochastic training algorithms, which are prevalent in modern machine learning. We note that similar vulnerabilities to learning stochasticity have been observed for LOO-based data influence scores (e.g., influence function (Koh & Liang, 2017)) in several works (Basu et al., 2021; Søgaard et al., 2021; Nguyen et al., 2023). (4) Training hyperparameter choices in Retraining-based Data Shapley are unclear. In machine learning, training hyperparameters (e.g., learning rate and batch size) typically need to be adjusted based on the size of the training dataset. This creates a fundamental ambiguity when computing Data Shapley values: when evaluating the utility U⁢(S)U(S)U ( S ) for different data subsets S, should we use the same hyperparameters as those optimized for the full dataset, or should we adjust them based on the subset size? This choice can significantly impact the calculated values. In-Run Data Shapley addresses all these issues, making it more desirable for specific scenarios. Firstly, it is computationally efficient as it computes data values during the training run, avoiding the need for multiple retraining iterations. This makes it feasible for modern large-scale ML models. Secondly, it aligns better with the purpose of data valuation by assessing the contribution of each data point to the specific model being trained, providing more relevant insights for real-world applications. Lastly, In-Run Data Shapley offers deterministic valuation results even with stochastic training algorithms, as it incorporates the specific sequence and randomness of the training process into the utility function. Therefore, for scenarios where computational efficiency, model-specific contributions, and result stability are critical, In-Run Data Shapley is a more suitable choice. Remark 4 (In-run Data Shapley is a model-specific data attribution technique.). In the original Data Shapley literature, U⁢(S)U(S)U ( S ) typically defines utility as the final performance (e.g., accuracy, loss) of a model trained on a data subset S. However, things become more complicated in the context of deep learning. A specific deep learning training run is an iterative process. Defining utility U⁢(S)U(S)U ( S ) for a particular run makes it a sequence function rather than the pure set function required by the standard Shapley value framework. Extending Shapley axioms to sequence functions is not straightforward. Previous works often define U⁢(S)U(S)U ( S ) as "expected model utility across all possible training runs" to circumvent this conceptual issue. In this work, we initiate the study of model-specific data attribution, which is crucial for applications like model diagnosis and behavior interpretation where we focus on a specific training run rather than expected utility across all possible runs. Data attribution should consider specific application scenario. From a broader perspective, the "optimal data attribution" notion heavily depends on the intended application. In the literature, data attribution techniques are being used for various purposes such as data quality assessment, data valuation, and interpreting model predictions. For data quality assessment, we require "algorithm-level data attribution" that measures a data point’s general influence on the learning algorithm, independent of specific training random seeds. On the other hand, when interpreting model decisions, we need model-specific data attribution focused on the particular checkpoint we train. Systematically mapping which data attribution approaches best suit specific application scenarios remains an important direction for future research. B.2 When is Retraining-based Data Shapley Desirable for Data Attribution? Retraining-based Data Shapley is still desirable in several scenarios. Firstly, it is more general and applicable to all learning algorithms, whereas In-Run Data Shapley is only applicable to iterative training algorithms. Secondly, retraining-based Shapley is useful when the goal is to understand the contribution of data points to the general learning process rather than to a specific model. Thirdly, because Retraining-based Data Shapley does not require modifying the internal code of the training algorithm, its estimation algorithm, typically Monte Carlo-based, is straightforward and clean to implement. B.3 Potential Extensions of In-Run Data Shapley to Other Domains While this work focuses on data attribution for SGD-trained machine learning models, we believe the core idea of In-Run Data Shapley—decomposing contribution analysis into per-iteration assessments—can potentially be extended to other contexts and applications. Here, we discuss some directions. Data attribution for iterative learning algorithms that do not use gradient descent. Our framework could potentially extend to iterative learning algorithms that don’t use gradient descent, such as k-means clustering or decision tree learning. Though these algorithms update models differently, they still proceed iteratively (e.g., k-means alternates between assignment and update steps; decision trees are built through recursive partitioning). For such algorithms, we could analyze how each data point influences these discrete update steps and aggregate these influences across iterations, similar to our approach with gradient-based training. Hyperparameter importance. The framework could potentially be adapted to evaluate hyperparameter contributions during training. One possibility is to set a "baseline" hyperparameter value and assess how choosing a different value impacts each training iteration compared to this baseline. For instance, when evaluating learning rate choices, we could measure how using a specific learning rate value affects model updates compared to using a baseline learning rate. For differentiable hyperparameters, we could leverage Taylor expansion to approximate this difference; for non-differentiable ones, zero-order methods could potentially be used. By accumulating these contributions across training iterations, we could understand hyperparameter impact without requiring multiple complete training runs. At a high level, this view unifies our treatment of training data and hyperparameters - both can be seen as choices that influence each training iteration, where we aim to quantify their impact against baseline scenarios. While technical challenges remain in adapting our method, this direction presents an interesting opportunity for future research. Feature attribution (e.g., context-attribution for language models). Another possible extension we envision is feature attribution for machine learning models. Feature attribution aims at understanding how each part of the input (like individual words in a sentence or pixels in an image) influences a model’s final prediction. When a model makes a prediction, the input features are processed through multiple layers, with each layer transforming the information before passing it to the next layer. Feature attribution aims to track how this information flows and transforms through the network to determine each input feature’s contribution to the final output. We envision a unified theoretical framework connecting training data attribution and feature attribution, drawing parallels between how information flows during training (through gradient updates) and during prediction (through layer operations). This could lead to more efficient methods for explaining model behavior, particularly for complex architectures like transformers. Appendix C Missing Proofs Theorem 5 (Restate of Theorem 3). In-Run Data Shapley considering the first-order approximation has closed-form ϕz⁢(U)≈∑t=0T−1ϕz⁢(U(1)(t))subscriptitalic-ϕsuperscriptsubscript01subscriptitalic-ϕsubscriptsuperscript1 _z (U )≈ _t=0^T-1 _z (U^(t)% _(1) )ϕitalic_z ( U ) ≈ ∑t = 0T - 1 ϕitalic_z ( U( t )( 1 ) ) where ϕz⁢(U(1)(t))=−ηt⁢∇ℓ⁢(wt,z(val))⋅∇ℓ⁢(wt,z),t=0,…,T−1formulae-sequencesubscriptitalic-ϕsubscriptsuperscript1⋅subscript∇ℓsubscriptsuperscriptval∇ℓsubscript0…1 _z (U^(t)_(1) )=- _t∇ (w_t,z^(% val))·∇ (w_t,z),~~~t=0,…,T-1ϕitalic_z ( U( t )( 1 ) ) = - ηitalic_t ∇ ℓ ( witalic_t , z( val ) ) ⋅ ∇ ℓ ( witalic_t , z ) , t = 0 , … , T - 1 Proof. For notation simplicity, let xj:=∇ℓ⁢(wt,zj)assignsubscript∇ℓsubscriptsubscriptx_j:=∇ (w_t,z_j)xitalic_j := ∇ ℓ ( witalic_t , zitalic_j ). Given the utility function U(1)(t)⁢(S)=−ηt⁢∑zj∈S∇ℓ⁢(wt,z(val))⋅∇ℓ⁢(wt,zj),subscriptsuperscript1subscriptsubscriptsubscript⋅∇ℓsubscriptsuperscriptval∇ℓsubscriptsubscript U^(t)_(1)(S)=- _t _z_j∈ S∇ (w_t,z^(% val))·∇ (w_t,z_j),U( t )( 1 ) ( S ) = - ηitalic_t ∑z start_POSTSUBSCRIPT j ∈ S end_POSTSUBSCRIPT ∇ ℓ ( witalic_t , z( val ) ) ⋅ ∇ ℓ ( witalic_t , zitalic_j ) , the marginal contribution of z for any S⊆ℬt∖zsubscriptℬS _t zS ⊆ Bitalic_t ∖ z is U(2)(t)⁢(S∪z)−U(2)(t)⁢(S)=−ηt⁢∇ℓ⁢(wt,z(val))⋅∇ℓ⁢(wt,z)subscriptsuperscript2subscriptsuperscript2⋅subscript∇ℓsubscriptsuperscriptval∇ℓsubscript U^(t)_(2)(S∪ z)-U^(t)_(2)(S)=- _t∇ (w_t% ,z^(val))·∇ (w_t,z)U( t )( 2 ) ( S ∪ z ) - U( t )( 2 ) ( S ) = - ηitalic_t ∇ ℓ ( witalic_t , z( val ) ) ⋅ ∇ ℓ ( witalic_t , z ) Plugging in the expression to the Shapley value’s definition immediately gives the result. ∎ Theorem 6 (Restate of Theorem 4). In-Run Data Shapley considering the second-order approximation has closed-form ϕz⁢(U)≈∑t=0T−1(ϕz⁢(U(1)(t))+12⁢ϕz⁢(U(2)(t)))subscriptitalic-ϕsuperscriptsubscript01subscriptitalic-ϕsubscriptsuperscript112subscriptitalic-ϕsubscriptsuperscript2 _z (U )≈ _t=0^T-1 ( _z (% U^(t)_(1) )+ 12 _z (U^(t)_(2) ) )ϕitalic_z ( U ) ≈ ∑t = 0T - 1 ( ϕitalic_z ( U( t )( 1 ) ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ϕitalic_z ( U( t )( 2 ) ) ) (6) where ϕz⁢(U(1)(t))+12⁢ϕz⁢(U(2)(t))=−ηt⁢∇ℓ⁢(wt,z(val))⋅∇ℓ⁢(wt,z)⏟ 1 influence of ⁢z⁢ on the loss of ⁢z(val)+ηt22⁢∇ℓ⁢(wt,z)⊺⁢Ht(z(val))⁢(∑zj∈ℬt∇ℓ⁢(wt,zj))⏟ 2 interaction between ⁢z⁢ and other training pointssubscriptitalic-ϕsubscriptsuperscript112subscriptitalic-ϕsubscriptsuperscript2subscript⏟⋅subscript∇ℓsubscriptsuperscriptval∇ℓsubscript 1 influence of on the loss of superscriptvalsubscript⏟superscriptsubscript22∇ℓsuperscriptsubscript⊺subscriptsuperscriptHsuperscriptvalsubscriptsubscriptsubscriptℬ∇ℓsubscriptsubscript 2 interaction between and other training points _z (U^(t)_(1) )+ 12 _z (U^(t% )_(2) )= - _t∇ (w_t,z^(val))·% ∇ (w_t,z)_ to8.9pt to8.9pt% 4.45195pt -4.45195pt to0.0pt% @beginscope @invoke pgfstrokecolorrgb0,0,0% @color@rgb@stroke000 @invoke @color@rgb@fill00% 0 @invoke @setlinewidth0.4pt @invoke to% 0.0pt @beginscope @invoke @beginscope @invoke % @moveto4.25195pt0.0pt @curveto4.25195pt2.34831pt% 2.34831pt4.25195pt0.0pt4.25195pt @curveto-2.34831pt4.25195pt% -4.25195pt2.34831pt-4.25195pt0.0pt @curveto-4.25195pt-2.34831% pt-2.34831pt-4.25195pt0.0pt-4.25195pt @curveto2.34831pt-4.25% 195pt4.25195pt-2.34831pt4.25195pt0.0pt @closepath @moveto% 0.0pt0.0pt @stroke @invoke @beginscope @invoke @transformcm1.00.00.01% .0-1.75pt-2.25555pt @invoke pgfstrokecolor% rgb0,0,0 @color@rgb@stroke000 @invoke % @color@rgb@fill000 @invoke 1 @invoke @closescope @endscope @invoke @closescope @endscope @invoke @closescope @endscope % @discardpath @invoke @closescope @endscope % @closescope influence of z on the loss of z^(% val)+ _t^22∇ (w_t,z)^% H^(z^(val))_t ( _z_j % _t∇ (w_t,z_j) )_ to8.9pt to% 8.9pt 4.45195pt -4.45195pt to0.% 0pt @beginscope @invoke pgfstrokecolorrgb0,0,0% @color@rgb@stroke000 @invoke @color@rgb@fill00% 0 @invoke @setlinewidth0.4pt @invoke % to0.0pt @beginscope @invoke @beginscope @invoke % @moveto4.25195pt0.0pt @curveto4.25195pt2.34831pt% 2.34831pt4.25195pt0.0pt4.25195pt @curveto-2.34831pt4.25195pt% -4.25195pt2.34831pt-4.25195pt0.0pt @curveto-4.25195pt-2.34831% pt-2.34831pt-4.25195pt0.0pt-4.25195pt @curveto2.34831pt-4.25% 195pt4.25195pt-2.34831pt4.25195pt0.0pt @closepath @moveto% 0.0pt0.0pt @stroke @invoke @beginscope @invoke @transformcm1.00.00.01% .0-1.75pt-2.25555pt @invoke pgfstrokecolor% rgb0,0,0 @color@rgb@stroke000 @invoke % @color@rgb@fill000 @invoke 2 @invoke @closescope @endscope @invoke @closescope @endscope @invoke @closescope @endscope % @discardpath @invoke @closescope @endscope % @closescope interaction between z and other % training pointsϕitalic_z ( U( t )( 1 ) ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ϕitalic_z ( U( t )( 2 ) ) = under⏟ start_ARG - ηitalic_t ∇ ℓ ( witalic_t , z( val ) ) ⋅ ∇ ℓ ( witalic_t , z ) end_ARG1 influence of z on the loss of z( val ) + under⏟ start_ARG divide start_ARG ηitalic_t2 end_ARG start_ARG 2 end_ARG ∇ ℓ ( witalic_t , z )⊺ H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTt ( ∑z start_POSTSUBSCRIPT j ∈ Bitalic_t end_POSTSUBSCRIPT ∇ ℓ ( witalic_t , zitalic_j ) ) end_ARG2 interaction between z and other training points (7) for any t=0,…,T−10…1t=0,…,T-1t = 0 , … , T - 1. Proof. We show that ϕz⁢(U(2)(t))=ηt2⁢∇ℓ⁢(wt,z)⁢Ht(z(val))⁢(∑zj∈ℬt∇ℓ⁢(wt,zj)).subscriptitalic-ϕsubscriptsuperscript2superscriptsubscript2∇ℓsubscriptsubscriptsuperscriptHsuperscriptvalsubscriptsubscriptsubscriptℬ∇ℓsubscriptsubscript _z (U^(t)_(2) )= _t^2∇ (w_t,z% )H^(z^(val))_t ( _z_j _t% ∇ (w_t,z_j) ).ϕitalic_z ( U( t )( 2 ) ) = ηitalic_t2 ∇ ℓ ( witalic_t , z ) H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTt ( ∑z start_POSTSUBSCRIPT j ∈ Bitalic_t end_POSTSUBSCRIPT ∇ ℓ ( witalic_t , zitalic_j ) ) . For notation simplicity, let xj:=∇ℓ⁢(wt,zj)assignsubscript∇ℓsubscriptsubscriptx_j:=∇ (w_t,z_j)xitalic_j := ∇ ℓ ( witalic_t , zitalic_j ), and x~:=∇ℓ⁢(wt,z)assign~∇ℓsubscript x:=∇ (w_t,z)over~ start_ARG x end_ARG := ∇ ℓ ( witalic_t , z ). Furthermore, let nt:=|ℬt|assignsubscriptsubscriptℬn_t:=|B_t|nitalic_t := | Bitalic_t | the batch size in t-th iteration. For any S⊆ℬt∖zsubscriptℬS _t zS ⊆ Bitalic_t ∖ z, the marginal contribution of z is U(2)(t)⁢(S∪z)−U(2)(t)⁢(S)=ηt2subscriptsuperscript2subscriptsuperscript2superscriptsubscript2 U^(t)_(2)(S∪ z)-U^(t)_(2)(S)= _t^2U( t )( 2 ) ( S ∪ z ) - U( t )( 2 ) ( S ) = ηitalic_t2 (2⁢x~⁢Ht(z(val))⁢(∑zj∈Sxj)+x~⊺⁢Ht(z(val))⁢x~).2~subscriptsuperscriptHsuperscriptvalsubscriptsubscriptsubscriptsuperscript~⊺subscriptsuperscriptHsuperscriptval~ (2 xH^(z^(val))_t ( _% z_j∈ Sx_j )+ x H^(z^(val))% _t x ).( 2 over~ start_ARG x end_ARG H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTt ( ∑z start_POSTSUBSCRIPT j ∈ S end_POSTSUBSCRIPT xitalic_j ) + over~ start_ARG x end_ARG⊺ H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTt over~ start_ARG x end_ARG ) . Plug the above expression into the Shapley value’s formula, we have ϕz⁢(U(2)(t))subscriptitalic-ϕsubscriptsuperscript2 _z (U^(t)_(2) )ϕitalic_z ( U( t )( 2 ) ) =ηt2⁢(x~⊺⁢Ht(z(val))⁢x~+2nt⁢∑k=1nt(nt−1k−1)−1⁢∑S⊆ℬt∖z,|S|=k−1[x~⊺⁢Ht(z(val))⁢∑zj∈Sxj])absentsuperscriptsubscript2superscript~⊺subscriptsuperscriptHsuperscriptval~2subscriptsuperscriptsubscript1subscriptsuperscriptbinomialsubscript111subscriptformulae-sequencesubscriptℬ1delimited-[]superscript~⊺subscriptsuperscriptHsuperscriptvalsubscriptsubscriptsubscript = _t^2 ( x H^(z^(% val))_t x+ 2n_t _k=1^n_tn_t-1 k-1^-% 1 _S _t \z\,~|S|=k-1 [ x^% H^(z^(val))_t _z_j∈ Sx_j ] )= ηitalic_t2 ( over~ start_ARG x end_ARG⊺ H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTt over~ start_ARG x end_ARG + divide start_ARG 2 end_ARG start_ARG nitalic_t end_ARG ∑k = 1nitalic_t ( binomial start_ARG nitalic_t - 1 end_ARG start_ARG k - 1 end_ARG )- 1 ∑S ⊆ B start_POSTSUBSCRIPT t ∖ z , | S | = k - 1 end_POSTSUBSCRIPT [ over~ start_ARG x end_ARG⊺ H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTt ∑z start_POSTSUBSCRIPT j ∈ S end_POSTSUBSCRIPT xitalic_j ] ) =ηt2⁢(x~⊺⁢Ht(z(val))⁢x~+x~⊺⁢Ht(z(val))⁢[2nt⁢∑k=1nt(nt−1k−1)−1⁢∑S⊆ℬt∖z,|S|=k−1(∑zj∈Sxj)])absentsuperscriptsubscript2superscript~⊺subscriptsuperscriptHsuperscriptval~superscript~⊺subscriptsuperscriptHsuperscriptvaldelimited-[]2subscriptsuperscriptsubscript1subscriptsuperscriptbinomialsubscript111subscriptformulae-sequencesubscriptℬ1subscriptsubscriptsubscript = _t^2 ( x H^(z^(% val))_t x+ x H^(z^(val))_% t [ 2n_t _k=1^n_tn_t-1 k-1^-1 _S% _t \z\,~|S|=k-1 ( _z_j∈ Sx_j% ) ] )= ηitalic_t2 ( over~ start_ARG x end_ARG⊺ H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTt over~ start_ARG x end_ARG + over~ start_ARG x end_ARG⊺ H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTt [ divide start_ARG 2 end_ARG start_ARG nitalic_t end_ARG ∑k = 1nitalic_t ( binomial start_ARG nitalic_t - 1 end_ARG start_ARG k - 1 end_ARG )- 1 ∑S ⊆ B start_POSTSUBSCRIPT t ∖ z , | S | = k - 1 end_POSTSUBSCRIPT ( ∑z start_POSTSUBSCRIPT j ∈ S end_POSTSUBSCRIPT xitalic_j ) ] ) =ηt2⁢(x~⊺⁢Ht(z(val))⁢x~+x~⊺⁢Ht(z(val))⁢[2nt⁢∑k=2nt(nt−1k−1)−1⁢(nt−2k−2)⁢(∑zj∈ℬt∖zxj)])absentsuperscriptsubscript2superscript~⊺subscriptsuperscriptHsuperscriptval~superscript~⊺subscriptsuperscriptHsuperscriptvaldelimited-[]2subscriptsuperscriptsubscript2subscriptsuperscriptbinomialsubscript111binomialsubscript22subscriptsubscriptsubscriptℬsubscript = _t^2 ( x H^(z^(% val))_t x+ x H^(z^(val))_% t [ 2n_t _k=2^n_tn_t-1 k-1^-1n_t-2% k-2 ( _z_j _t \z\x_j )% ] )= ηitalic_t2 ( over~ start_ARG x end_ARG⊺ H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTt over~ start_ARG x end_ARG + over~ start_ARG x end_ARG⊺ H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTt [ divide start_ARG 2 end_ARG start_ARG nitalic_t end_ARG ∑k = 2nitalic_t ( binomial start_ARG nitalic_t - 1 end_ARG start_ARG k - 1 end_ARG )- 1 ( binomial start_ARG nitalic_t - 2 end_ARG start_ARG k - 2 end_ARG ) ( ∑z start_POSTSUBSCRIPT j ∈ Bitalic_t ∖ z end_POSTSUBSCRIPT xitalic_j ) ] ) =ηt2⁢(x~⊺⁢Ht(z(val))⁢x~+x~⊺⁢Ht(z(val))⁢[2nt⁢∑k=2ntk−1nt−1⁢(∑zj∈ℬt∖zxj)])absentsuperscriptsubscript2superscript~⊺subscriptsuperscriptHsuperscriptval~superscript~⊺subscriptsuperscriptHsuperscriptvaldelimited-[]2subscriptsuperscriptsubscript2subscript1subscript1subscriptsubscriptsubscriptℬsubscript = _t^2 ( x H^(z^(% val))_t x+ x H^(z^(val))_% t [ 2n_t _k=2^n_t k-1n_t-1 ( _z_j% _t \z\x_j ) ] )= ηitalic_t2 ( over~ start_ARG x end_ARG⊺ H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTt over~ start_ARG x end_ARG + over~ start_ARG x end_ARG⊺ H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTt [ divide start_ARG 2 end_ARG start_ARG nitalic_t end_ARG ∑k = 2nitalic_t divide start_ARG k - 1 end_ARG start_ARG nitalic_t - 1 end_ARG ( ∑z start_POSTSUBSCRIPT j ∈ Bitalic_t ∖ z end_POSTSUBSCRIPT xitalic_j ) ] ) =ηt2⁢(x~⊺⁢Ht(z(val))⁢x~+x~⊺⁢Ht(z(val))⁢(∑zj∈ℬt∖zxj))absentsuperscriptsubscript2superscript~⊺subscriptsuperscriptHsuperscriptval~superscript~⊺subscriptsuperscriptHsuperscriptvalsubscriptsubscriptsubscriptℬsubscript = _t^2 ( x H^(z^(% val))_t x+ x H^(z^(val))_% t ( _z_j _t \z\x_j ) )= ηitalic_t2 ( over~ start_ARG x end_ARG⊺ H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTt over~ start_ARG x end_ARG + over~ start_ARG x end_ARG⊺ H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTt ( ∑z start_POSTSUBSCRIPT j ∈ Bitalic_t ∖ z end_POSTSUBSCRIPT xitalic_j ) ) =ηt2⁢[x~⊺⁢Ht(z(val))⁢(∑zj∈ℬtxj)]absentsuperscriptsubscript2delimited-[]superscript~⊺subscriptsuperscriptHsuperscriptvalsubscriptsubscriptsubscriptℬsubscript = _t^2 [ x H^(z^(% val))_t ( _z_j _tx_j ) ]= ηitalic_t2 [ over~ start_ARG x end_ARG⊺ H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTt ( ∑z start_POSTSUBSCRIPT j ∈ Bitalic_t end_POSTSUBSCRIPT xitalic_j ) ] ∎ Appendix D Technical Details Notation review. Consider a linear layer s=aWsaWs=aWs = a W, where W∈ℝd1×d2Wsuperscriptℝsubscript1subscript2W ^d_1× d_2W ∈ blackboard_Rd1 × d2 is the weight matrix, a=(a(1),…,a(B))⊺asuperscriptsuperscripta1…superscripta⊺a=(a^(1),…,a^(B)) a = ( a( 1 ) , … , a( B ) )⊺ is the mini-batch input, and s=(s(1),…,s(B))⊺ssuperscriptsuperscripts1…superscripts⊺s=(s^(1),…,s^(B)) s = ( s( 1 ) , … , s( B ) )⊺ is the output (i.e., the pre-activation tensor). For non-sequential data, a∈ℝB×d1,s∈ℝB×d2formulae-sequenceasuperscriptℝsubscript1ssuperscriptℝsubscript2a ^B× d_1,s ^B× d_2a ∈ blackboard_RB × d1 , s ∈ blackboard_RB × d2. For sequential data with sequence length T, a∈ℝB×d1×T,s∈ℝB×d2×Tformulae-sequenceasuperscriptℝsubscript1ssuperscriptℝsubscript2a ^B× d_1× T,s ^B% × d_2× Ta ∈ blackboard_RB × d1 × T , s ∈ blackboard_RB × d2 × T. Let ℓ(i):=ℓ⁢(w,zi)assignsuperscriptℓsubscript ^(i):= (w,z_i)ℓ( i ) := ℓ ( w , zitalic_i ) denote the current model’s individual loss on zisubscriptz_izitalic_i. For notation convenience, we denote b(i):=∂ℓ(i)∂sassignsuperscriptbsuperscriptℓsb^(i):= ∂ ^(i) b( i ) := divide start_ARG ∂ ℓ( i ) end_ARG start_ARG ∂ s end_ARG. D.1 Ghost Dot-Product By applying the chain rule, we can express the gradient of an individual loss ℓ(i):=ℓ⁢(w,zi)assignsuperscriptℓsubscript ^(i):= (w,z_i)ℓ( i ) := ℓ ( w , zitalic_i ) with respect to W as ∂ℓ(i)∂W=∂ℓ(i)∂s(i)⁢∂s(i)∂W=∂ℓ(i)∂s(i)⁢a(i)=∂ℓ∂s(i)⁢a(i)=a(i)⊗b(i)superscriptℓWsuperscriptℓsuperscriptssuperscriptsWsuperscriptℓsuperscriptssuperscriptaℓsuperscriptssuperscriptatensor-productsuperscriptasuperscriptb ∂ ^(i) = ∂ ^% (i) ^(i) ^(i)∂% W= ∂ ^(i) ^(i)a^(i)% = ∂ ^(i)a^(i)=a^(i% ) ^(i)divide start_ARG ∂ ℓ( i ) end_ARG start_ARG ∂ W end_ARG = divide start_ARG ∂ ℓ( i ) end_ARG start_ARG ∂ s( i ) end_ARG divide start_ARG ∂ s( i ) end_ARG start_ARG ∂ W end_ARG = divide start_ARG ∂ ℓ( i ) end_ARG start_ARG ∂ s( i ) end_ARG a( i ) = divide start_ARG ∂ ℓ end_ARG start_ARG ∂ s( i ) end_ARG a( i ) = a( i ) ⊗ b( i ) (8) where ℓ:=∑j=1Bℓ(j)assignℓsuperscriptsubscript1superscriptℓ := _j=1^B ^(j)ℓ := ∑j = 1B ℓ( j ) is the aggregated loss. Note that the individual’s output gradient b(i)=∂ℓ(i)∂s(i)=∂ℓ∂s(i)superscriptbsuperscriptℓsuperscriptsℓsuperscriptsb^(i)= ∂ ^(i) ^(i)= % ∂ ^(i)b( i ) = divide start_ARG ∂ ℓ( i ) end_ARG start_ARG ∂ s( i ) end_ARG = divide start_ARG ∂ ℓ end_ARG start_ARG ∂ s( i ) end_ARG is readily available during the backpropagation pass. Suppose we are interested in computing the gradient dot-product ∂ℓ(1)∂W⊙∂ℓ(2)∂Wdirect-productsuperscriptℓ1Wsuperscriptℓ2W ∂ ^(1) ∂ ^(2)% divide start_ARG ∂ ℓ( 1 ) end_ARG start_ARG ∂ W end_ARG ⊙ divide start_ARG ∂ ℓ( 2 ) end_ARG start_ARG ∂ W end_ARG between two data points z1,z2subscript1subscript2z_1,z_2z1 , z2 in the same batch in the backpropagation. We first discuss the case for non-sequential data and then extend it to sequential data. Non-sequential data. For non-sequential data, we have each a(i)∈ℝd1×1superscriptasuperscriptℝsubscript11a^(i) ^d_1× 1a( i ) ∈ blackboard_Rd1 × 1 and b(i)∈ℝ1×d2superscriptbsuperscriptℝ1subscript2b^(i) ^1× d_2b( i ) ∈ blackboard_R1 × d2. By (8), we have ∂ℓ(1)∂W⊙∂ℓ(2)∂W=(a(1)⊗b(1))⊙(a(2)⊗b(2))=((b(1))⊺⁢(b(2)))⁢((a(1))⊺⁢a(2))direct-productsuperscriptℓ1Wsuperscriptℓ2Wdirect-producttensor-productsuperscripta1superscriptb1tensor-productsuperscripta2superscriptb2superscriptsuperscriptb1⊺superscriptb2superscriptsuperscripta1⊺superscripta2 ∂ ^(1) ∂% ^(2) = (a^(1) ^(1)% ) (a^(2) ^(2) )= ( (% b^(1) ) (b^(2) ) ) (% (a^(1) ) a^(2) )divide start_ARG ∂ ℓ( 1 ) end_ARG start_ARG ∂ W end_ARG ⊙ divide start_ARG ∂ ℓ( 2 ) end_ARG start_ARG ∂ W end_ARG = ( a( 1 ) ⊗ b( 1 ) ) ⊙ ( a( 2 ) ⊗ b( 2 ) ) = ( ( b( 1 ) )⊺ ( b( 2 ) ) ) ( ( a( 1 ) )⊺ a( 2 ) ) Hence, we can compute the dot-product between ∂ℓ(1)∂Wsuperscriptℓ1W ∂ ^(1) divide start_ARG ∂ ℓ( 1 ) end_ARG start_ARG ∂ W end_ARG and ∂ℓ(2)∂Wsuperscriptℓ2W ∂ ^(2) divide start_ARG ∂ ℓ( 2 ) end_ARG start_ARG ∂ W end_ARG without actually instantiating the gradient vector ∂ℓ(1)∂Wsuperscriptℓ1W ∂ ^(1) divide start_ARG ∂ ℓ( 1 ) end_ARG start_ARG ∂ W end_ARG or ∂ℓ(2)∂Wsuperscriptℓ2W ∂ ^(2) divide start_ARG ∂ ℓ( 2 ) end_ARG start_ARG ∂ W end_ARG. We can take the dot products between (b(1))⊺⁢(b(2))superscriptsuperscriptb1⊺superscriptb2 (b^(1) ) (b^(2) )( b( 1 ) )⊺ ( b( 2 ) ) and (a(1))⊺⁢a(2)superscriptsuperscripta1⊺superscripta2 (a^(1) ) a^(2)( a( 1 ) )⊺ a( 2 ), and then multiply the results together. Moreover, all of the materials a(1),a(2),b(1),b(2)superscripta1superscripta2superscriptb1superscriptb2a^(1),a^(2),b^(1),b^(2)a( 1 ) , a( 2 ) , b( 1 ) , b( 2 ) that are required for computation are all already available in one backpropagation. Hence, with a single backpropagation, we can efficiently compute the gradient dot-product between every pair of data points within the batch. Sequential data. For sequential data, we have each a(i)∈ℝd1×Tsuperscriptasuperscriptℝsubscript1a^(i) ^d_1× Ta( i ) ∈ blackboard_Rd1 × T and b(i)∈ℝT×d2superscriptbsuperscriptℝsubscript2b^(i) ^T× d_2b( i ) ∈ blackboard_RT × d2. By (8), we have ∂ℓ(1)∂W⊙∂ℓ(2)∂W=(a(1)⊗b(1))⊙(a(2)⊗b(2))direct-productsuperscriptℓ1Wsuperscriptℓ2Wdirect-producttensor-productsuperscripta1superscriptb1tensor-productsuperscripta2superscriptb2 ∂ ^(1) ∂% ^(2) = (a^(1) ^(1)% ) (a^(2) ^(2) )divide start_ARG ∂ ℓ( 1 ) end_ARG start_ARG ∂ W end_ARG ⊙ divide start_ARG ∂ ℓ( 2 ) end_ARG start_ARG ∂ W end_ARG = ( a( 1 ) ⊗ b( 1 ) ) ⊙ ( a( 2 ) ⊗ b( 2 ) ) =∑j,k=1d1,d2(a(1)⊗b(1))j,k⁢(a(2)⊗b(2))j,kabsentsuperscriptsubscript1subscript1subscript2subscripttensor-productsuperscripta1superscriptb1subscripttensor-productsuperscripta2superscriptb2 = _j,k=1^d_1,d_2 (a^(1) % ^(1) )_j,k (a^(2) ^(2) )_j,k= ∑j , k = 1d1 , d2 ( a( 1 ) ⊗ b( 1 ) )j , k ( a( 2 ) ⊗ b( 2 ) )j , k =∑j,k=1d1,d2(aj(1)⁢bk(1))⁢(aj(2)⁢bk(2))absentsuperscriptsubscript1subscript1subscript2subscriptsuperscripta1subscriptsuperscriptb1subscriptsuperscripta2subscriptsuperscriptb2 = _j,k=1^d_1,d_2 (a^(1)_jb^% (1)_k ) (a^(2)_jb^(2)_k )= ∑j , k = 1d1 , d2 ( a( 1 )j b( 1 )k ) ( a( 2 )j b( 2 )k ) =∑j,k=1d1,d2(∑t=1Taj⁢t(1)⁢bt⁢k(1))⁢(∑t=1Taj⁢t(2)⁢bt⁢k(2))absentsuperscriptsubscript1subscript1subscript2superscriptsubscript1subscriptsuperscripta1subscriptsuperscriptb1superscriptsubscript1subscriptsuperscripta2subscriptsuperscriptb2 = _j,k=1^d_1,d_2 ( _t=1^Ta^(1)_% jtb^(1)_tk ) ( _t=1^Ta^(2)_jt% b^(2)_tk )= ∑j , k = 1d1 , d2 ( ∑t = 1T a( 1 )j t b( 1 )t k ) ( ∑t = 1T a( 2 )j t b( 2 )t k ) =∑t1,t2=1T,T(∑j=1d1aj⁢t1(1)⁢aj⁢t2(2))⁢(∑k=1d2ak⁢t1(1)⁢ak⁢t2(2))absentsuperscriptsubscriptsubscript1subscript21superscriptsubscript1subscript1subscriptsuperscripta1subscript1subscriptsuperscripta2subscript2superscriptsubscript1subscript2subscriptsuperscripta1subscript1subscriptsuperscripta2subscript2 = _t_1,t_2=1^T,T ( _j=1^d_1a^(1% )_jt_1a^(2)_jt_2 ) ( _k=1^d_2a^% (1)_kt_1a^(2)_kt_2 )= ∑t start_POSTSUBSCRIPT 1 , t2 = 1 end_POSTSUBSCRIPTT , T ( ∑j = 1d1 a( 1 )j t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT a( 2 )j t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( ∑k = 1d2 a( 1 )k t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT a( 2 )k t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) =∑t1,t2=1T,T(a⋅,t1(1)⁢a⋅,t2(2))⁢(b⋅,t1(1)⁢b⋅,t2(2))absentsuperscriptsubscriptsubscript1subscript21subscriptsuperscripta1⋅subscript1subscriptsuperscripta2⋅subscript2subscriptsuperscriptb1⋅subscript1subscriptsuperscriptb2⋅subscript2 = _t_1,t_2=1^T,T (a^(1)_·,t_1% a^(2)_·,t_2 ) (b^(1)_·,t_1% b^(2)_·,t_2 )= ∑t start_POSTSUBSCRIPT 1 , t2 = 1 end_POSTSUBSCRIPTT , T ( a( 1 )⋅ , t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT a( 2 )⋅ , t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( b( 1 )⋅ , t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT b( 2 )⋅ , t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) =((b(1))⁢(b(2))⊺)⊙((a(1))⊺⁢a(2))absentdirect-productsuperscriptb1superscriptsuperscriptb2⊺superscriptsuperscripta1⊺superscripta2 = ( (b^(1) ) (b^(2) )% ) ( (a^(1) ) % a^(2) )= ( ( b( 1 ) ) ( b( 2 ) )⊺ ) ⊙ ( ( a( 1 ) )⊺ a( 2 ) ) Hence, comparing with directly computing per-sample gradients, if 2⁢T2<d1⁢d22superscript2subscript1subscript22T^2<d_1d_22 T2 < d1 d2, it is more memory-efficient to first multiply the matrices of (b(1))⁢(b(2))⊺superscriptb1superscriptsuperscriptb2⊺ (b^(1) ) (b^(2) ) ( b( 1 ) ) ( b( 2 ) )⊺ and (a(1))⊺⁢a(2)superscriptsuperscripta1⊺superscripta2 (a^(1) ) a^(2)( a( 1 ) )⊺ a( 2 ), then take the inner product between the two T×T× T × T matrices. If 2⁢T2≥d1⁢d22superscript2subscript1subscript22T^2≥ d_1d_22 T2 ≥ d1 d2, then we can first take the outer products a(1)⊗b(1)tensor-productsuperscripta1superscriptb1a^(1) ^(1)a( 1 ) ⊗ b( 1 ) and a(2)⊗b(2)tensor-productsuperscripta2superscriptb2a^(2) ^(2)a( 2 ) ⊗ b( 2 ), then take their inner product. In either case, we only need a single backpropagation to compute the gradient dot product between every pair of data points within the batch, similar to the case of non-sequential data. Remark 5 (Applications of "Ghost" Technique Beyond Data Attribution). The techniques developed in this work for efficiently computing gradient dot products may be of independent interest, as this operation arises in various machine learning algorithms and applications beyond data attribution. For instance, the "ghost dot-product" technique can be used in calculating the cosine similarity between individual gradients (together with the "ghost clipping" technique), which is frequently being employed to analyze training dynamics (Fort & Ganguli, 2019) and detect adversarial attacks (Dhaliwal & Shintre, 2018). Additionally, gradient similarity is used in active learning to select the most informative data points for labeling (Ash et al., 2019) and in multitask learning to balance the contributions of different tasks (Yu et al., 2020). D.2 Ghost Gradient-Hessian-Gradient Product We denote L as the number of layers in a neural network, and we denote WisubscriptWW_iWi, i=1,…,L1…i=1,…,Li = 1 , … , L as the weight parameters of each layer. We denote HWi(z(val))subscriptsuperscriptHsuperscriptvalsubscriptWH^(z^(val))_W_iH( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTW start_POSTSUBSCRIPT i end_POSTSUBSCRIPT as the Hessian matrix of ℓ(z(val))superscriptℓsuperscriptval ^(z^(val))ℓ( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPT on layer WisubscriptWW_iWi. Suppose we are interested in computing the gradient dot product ∇ℓ(1)⁢H(z(val))⁢∇ℓ(2)∇superscriptℓ1superscriptHsuperscriptval∇superscriptℓ2∇ ^(1)H^(z^(val))∇ ^(2)∇ ℓ( 1 ) H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPT ∇ ℓ( 2 ). ∇ℓ(1)⁢H(z(val))⁢∇ℓ(2)∇superscriptℓ1superscriptHsuperscriptval∇superscriptℓ2 ∇ ^(1)H^(z^(val))∇ ^(2)∇ ℓ( 1 ) H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPT ∇ ℓ( 2 ) =∑i,j=1L,L∂ℓ(1)∂Wi⁢(∂(2)ℓ(z(val))∂Wi⁢∂Wj)⁢∂ℓ(2)∂Wjabsentsuperscriptsubscript1superscriptℓ1subscriptWsuperscript2superscriptℓsuperscriptvalsubscriptWsubscriptWsuperscriptℓ2subscriptW = _i,j=1^L,L ∂ ^(1) _% i ( ∂^(2) ^(z^(val)) _% i _j ) ∂ ^(2) _% j= ∑i , j = 1L , L divide start_ARG ∂ ℓ( 1 ) end_ARG start_ARG ∂ Wi end_ARG ( divide start_ARG ∂( 2 ) ℓ( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ Wi ∂ Wj end_ARG ) divide start_ARG ∂ ℓ( 2 ) end_ARG start_ARG ∂ Wj end_ARG =∑i,j=1L,L(ai(1)⊗bi(1))⁢(∂(2)ℓ(z(val))∂Wi⁢∂Wj)⁢(aj(2)⊗bj(2))absentsuperscriptsubscript1tensor-productsubscriptsuperscripta1subscriptsuperscriptb1superscript2superscriptℓsuperscriptvalsubscriptWsubscriptWtensor-productsubscriptsuperscripta2subscriptsuperscriptb2 = _i,j=1^L,L(a^(1)_i ^(1)_% i) ( ∂^(2) ^(z^(val)) _% i _j )(a^(2)_j ^(2)_% j)= ∑i , j = 1L , L ( a( 1 )i ⊗ b( 1 )i ) ( divide start_ARG ∂( 2 ) ℓ( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ Wi ∂ Wj end_ARG ) ( a( 2 )j ⊗ b( 2 )j ) We first look at the Hessian-vector product. (∂(2)ℓ(z(val))∂Wi⁢∂Wj)⁢(aj(2)⊗bj(2))superscript2superscriptℓsuperscriptvalsubscriptWsubscriptWtensor-productsubscriptsuperscripta2subscriptsuperscriptb2 ( ∂^(2) ^(z^(val))∂% W_i _j )(a^(2)_j % b^(2)_j)( divide start_ARG ∂( 2 ) ℓ( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ Wi ∂ Wj end_ARG ) ( a( 2 )j ⊗ b( 2 )j ) =∂Wi⁢[∂ℓ(z(val))∂Wj⁢(aj(2)⊗bj(2))]absentsubscriptWdelimited-[]superscriptℓsuperscriptvalsubscriptWtensor-productsubscriptsuperscripta2subscriptsuperscriptb2 = ∂ _i [ ∂ ^% (z^(val)) _j(a^(2)_j % b^(2)_j) ]= divide start_ARG ∂ end_ARG start_ARG ∂ Wi end_ARG [ divide start_ARG ∂ ℓ( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ Wj end_ARG ( a( 2 )j ⊗ b( 2 )j ) ] =∂Wi⁢[(aj(z(val))⊗bj(z(val)))⁢(aj(2)⊗bj(2))]absentsubscriptWdelimited-[]tensor-productsubscriptsuperscriptasuperscriptvalsubscriptsuperscriptbsuperscriptvaltensor-productsubscriptsuperscripta2subscriptsuperscriptb2 = ∂ _i [(a^(z^(% val))_j ^(z^(val))_j)(a^% (2)_j ^(2)_j) ]= divide start_ARG ∂ end_ARG start_ARG ∂ Wi end_ARG [ ( a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj ⊗ b( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj ) ( a( 2 )j ⊗ b( 2 )j ) ] =∂Wi⁢[(aj(z(val))⁢aj(2))⁢(bj(z(val))⁢bj(2))]absentsubscriptWdelimited-[]subscriptsuperscriptasuperscriptvalsubscriptsuperscripta2subscriptsuperscriptbsuperscriptvalsubscriptsuperscriptb2 = ∂ _i [(a^(z^(% val))_ja^(2)_j)(b^(z^(val))_j% b^(2)_j) ]= divide start_ARG ∂ end_ARG start_ARG ∂ Wi end_ARG [ ( a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj a( 2 )j ) ( b( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj b( 2 )j ) ] =∂Wi⁢[cj⁢dj]absentsubscriptWdelimited-[]subscriptcsubscriptd = ∂ _i [c_j% d_j ]= divide start_ARG ∂ end_ARG start_ARG ∂ Wi end_ARG [ cj dj ] =cj⁢∂dj∂Wi+dj⁢∂cj∂WiabsentsubscriptcsubscriptdsubscriptWsubscriptdsubscriptcsubscriptW =c_j _j _% i+d_j _j _i= cj divide start_ARG ∂ dj end_ARG start_ARG ∂ Wi end_ARG + dj divide start_ARG ∂ cj end_ARG start_ARG ∂ Wi end_ARG where we denote cj:=aj(z(val))⁢aj(2)assignsubscriptcsubscriptsuperscriptasuperscriptvalsubscriptsuperscripta2c_j:=a^(z^(val))_ja^(2)_jcj := a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj a( 2 )j and dj:=bj(z(val))⁢bj(2)assignsubscriptdsubscriptsuperscriptbsuperscriptvalsubscriptsuperscriptb2d_j:=b^(z^(val))_jb^(2)_jdj := b( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj b( 2 )j. We run the second backpropagation on the gradient dot product ∇ℓ(z(val))⁢∇ℓ(2)=∑j=1Lcj⁢dj∇superscriptℓsuperscriptval∇superscriptℓ2superscriptsubscript1subscriptcsubscriptd∇ ^(z^(val))∇ ^(2)= _j=1^Lc_j% d_j∇ ℓ( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPT ∇ ℓ( 2 ) = ∑j = 1L cj dj. Note that here aj(2),bj(2)subscriptsuperscripta2subscriptsuperscriptb2a^(2)_j,b^(2)_ja( 2 )j , b( 2 )j are constant vectors. For ∂cj∂WisubscriptcsubscriptW _j _idivide start_ARG ∂ cj end_ARG start_ARG ∂ Wi end_ARG, we have the similar decomposition ∂cj∂WisubscriptcsubscriptW _j _idivide start_ARG ∂ cj end_ARG start_ARG ∂ Wi end_ARG =∂Wi⁢(aj(z(val))⁢aj(2))absentsubscriptWsubscriptsuperscriptasuperscriptvalsubscriptsuperscripta2 = ∂ _i (a^(z^(% val))_ja^(2)_j )= divide start_ARG ∂ end_ARG start_ARG ∂ Wi end_ARG ( a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj a( 2 )j ) =∂si(z(val))⁢(aj(z(val))⁢aj(2))⊗ai(z(val))absenttensor-productsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptasuperscriptvalsubscriptsuperscripta2subscriptsuperscriptasuperscriptval = ∂ ^(z^(val))_i% (a^(z^(val))_ja^(2)_j ) % a^(z^(val))_i= divide start_ARG ∂ end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG ( a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj a( 2 )j ) ⊗ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi For ∂dj∂WisubscriptdsubscriptW _j _idivide start_ARG ∂ dj end_ARG start_ARG ∂ Wi end_ARG, the derivative is more complicated as the bj(z(val))subscriptsuperscriptbsuperscriptvalb^(z^(val))_jb( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj also depends on the output from deeper layers. When j≥ij≥ ij ≥ i, we still have ∂dj∂Wi=∂si(z(val))⁢(bj(z(val))⁢bj(2))⊗ai(z(val))subscriptdsubscriptWtensor-productsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptbsuperscriptvalsubscriptsuperscriptb2subscriptsuperscriptasuperscriptval _j _i= % ∂ ^(z^(val))_i (b^(z^% (val))_jb^(2)_j ) ^(z^(% val))_idivide start_ARG ∂ dj end_ARG start_ARG ∂ Wi end_ARG = divide start_ARG ∂ end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG ( b( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj b( 2 )j ) ⊗ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi When j<ij<ij < i, we have ∂dj∂WisubscriptdsubscriptW _j _idivide start_ARG ∂ dj end_ARG start_ARG ∂ Wi end_ARG =∂bj(z(val))⁢bj(2)∂Wiabsentsubscriptsuperscriptbsuperscriptvalsubscriptsuperscriptb2subscriptW = ^(z^(val))_jb^(% 2)_j _i= divide start_ARG ∂ b( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj b( 2 )j end_ARG start_ARG ∂ Wi end_ARG =∂Wi⁢(bj(z(val))⁢bj(2))absentsubscriptWsubscriptsuperscriptbsuperscriptvalsubscriptsuperscriptb2 = ∂ _i (b^(z^(% val))_jb^(2)_j )= divide start_ARG ∂ end_ARG start_ARG ∂ Wi end_ARG ( b( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj b( 2 )j ) =∂Wi⁢(∂ℓ(z(val))∂si(z(val))⁢∂si(z(val))∂ai(z(val))⁢∂ai(z(val))∂si−1(z(val))⁢…⁢∂aj(z(val))∂sj(z(val))⁢bj(2))absentsubscriptWsuperscriptℓsuperscriptvalsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptasuperscriptvalsubscriptsuperscriptasuperscriptvalsubscriptsuperscriptssuperscriptval1…subscriptsuperscriptasuperscriptvalsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptb2 = ∂ _i ( ∂ ^% (z^(val)) ^(z^(val))_i % ^(z^(val))_i ^(z^(% val))_i ^(z^(val))_i% ^(z^(val))_i-1… % ^(z^(val))_j ^(z^(val))_j% b^(2)_j )= divide start_ARG ∂ end_ARG start_ARG ∂ Wi end_ARG ( divide start_ARG ∂ ℓ( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG divide start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi - 1 end_ARG … divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG b( 2 )j ) =∂Wi⁢(∂ℓ(z(val))∂si(z(val))⁢Wi⁢∂ai(z(val))∂si−1(z(val))⁢…⁢∂aj(z(val))∂sj(z(val))⁢bj(2))absentsubscriptWsuperscriptℓsuperscriptvalsubscriptsuperscriptssuperscriptvalsubscriptWsubscriptsuperscriptasuperscriptvalsubscriptsuperscriptssuperscriptval1…subscriptsuperscriptasuperscriptvalsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptb2 = ∂ _i ( ∂ ^% (z^(val)) ^(z^(val))_i% W_i ^(z^(val))_i ^% (z^(val))_i-1… ^(z^(val% ))_j ^(z^(val))_jb^(2)_j )= divide start_ARG ∂ end_ARG start_ARG ∂ Wi end_ARG ( divide start_ARG ∂ ℓ( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG Wi divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi - 1 end_ARG … divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG b( 2 )j ) =∂si(z(val))⁢(∂ℓ(z(val))∂si(z(val))⁢Wi⁢∂ai(z(val))∂si−1(z(val))⁢…⁢∂aj(z(val))∂sj(z(val))⁢bj(2))⊗ai(z(val))absenttensor-productsubscriptsuperscriptssuperscriptvalsuperscriptℓsuperscriptvalsubscriptsuperscriptssuperscriptvalsubscriptWsubscriptsuperscriptasuperscriptvalsubscriptsuperscriptssuperscriptval1…subscriptsuperscriptasuperscriptvalsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptb2subscriptsuperscriptasuperscriptval = ∂ ^(z^(val))_i% ( ∂ ^(z^(val)) ^(z^(% val))_iW_i ^(z^(val% ))_i ^(z^(val))_i-1… ∂% a^(z^(val))_j ^(z^(val)% )_jb^(2)_j ) ^(z^(val))_i= divide start_ARG ∂ end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG ( divide start_ARG ∂ ℓ( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG Wi divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi - 1 end_ARG … divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG b( 2 )j ) ⊗ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi +∂ℓ(z(val))∂si(z(val))⊗(∂ai(z(val))∂si−1(z(val))⁢…⁢∂aj(z(val))∂sj(z(val))⁢bj(2))tensor-productsuperscriptℓsuperscriptvalsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptasuperscriptvalsubscriptsuperscriptssuperscriptval1…subscriptsuperscriptasuperscriptvalsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptb2 ~~+ ∂ ^(z^(val))∂ % s^(z^(val))_i ( ^(z^(% val))_i ^(z^(val))_i-1…% ^(z^(val))_j ^(z^(% val))_jb^(2)_j )+ divide start_ARG ∂ ℓ( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG ⊗ ( divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi - 1 end_ARG … divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG b( 2 )j ) =∂si(z(val))⁢(bj(z(val))⁢bj(2))⊗ai(z(val))+bi(z(val))⊗(∂ai(z(val))∂sj(z(val))⁢bj(2))absenttensor-productsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptbsuperscriptvalsubscriptsuperscriptb2subscriptsuperscriptasuperscriptvaltensor-productsubscriptsuperscriptbsuperscriptvalsubscriptsuperscriptasuperscriptvalsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptb2 = ∂ ^(z^(val))_i(% b^(z^(val))_jb^(2)_j) ^(% z^(val))_i+b^(z^(val))_i (% ^(z^(val))_i ^(z^(% val))_jb^(2)_j )= divide start_ARG ∂ end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG ( b( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj b( 2 )j ) ⊗ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi + b( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi ⊗ ( divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG b( 2 )j ) Overall, ∂dj∂WisubscriptdsubscriptW _j _idivide start_ARG ∂ dj end_ARG start_ARG ∂ Wi end_ARG =∂Wi⁢(bj(z(val))⁢bj(2))absentsubscriptWsubscriptsuperscriptbsuperscriptvalsubscriptsuperscriptb2 = ∂ _i(b^(z^( % val))_jb^(2)_j)= divide start_ARG ∂ end_ARG start_ARG ∂ Wi end_ARG ( b( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj b( 2 )j ) =∂si(z(val))⁢(bj(z(val))⁢bj(2))⊗ai(z(val))j≥i∂si(z(val))⁢(bj(z(val))⁢bj(2))⊗ai(z(val))+bi(z(val))⊗(∂ai(z(val))∂sj(z(val))⁢bj(2))j<iabsentcasestensor-productsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptbsuperscriptvalsubscriptsuperscriptb2subscriptsuperscriptasuperscriptvaltensor-productsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptbsuperscriptvalsubscriptsuperscriptb2subscriptsuperscriptasuperscriptvaltensor-productsubscriptsuperscriptbsuperscriptvalsubscriptsuperscriptasuperscriptvalsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptb2 = cases ∂ ^(z^(% val))_i (b^(z^(val))_jb^(2)_j% ) ^(z^(val))_i&j≥ i\\ ∂ ^(z^(val))_i (b% ^(z^(val))_jb^(2)_j ) ^(z^(% val))_i+b^(z^(val))_i ( % ^(z^(val))_i ^(z^(% val))_jb^(2)_j )&j<i\\ cases= start_ROW start_CELL divide start_ARG ∂ end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG ( b( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj b( 2 )j ) ⊗ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_CELL start_CELL j ≥ i end_CELL end_ROW start_ROW start_CELL divide start_ARG ∂ end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG ( b( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj b( 2 )j ) ⊗ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi + b( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi ⊗ ( divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG b( 2 )j ) end_CELL start_CELL j < i end_CELL end_ROW Hence, we have ∇ℓ(1)⁢H(z(val))⁢∇ℓ(2)∇superscriptℓ1superscriptHsuperscriptval∇superscriptℓ2 ∇ ^(1)H^(z^(val))∇ ^(2)∇ ℓ( 1 ) H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPT ∇ ℓ( 2 ) =∑i,j=1L,L(ai(1)⊗bi(1))⁢[cj⁢∂dj∂Wi+dj⁢∂cj∂Wi]absentsuperscriptsubscript1tensor-productsubscriptsuperscripta1subscriptsuperscriptb1delimited-[]subscriptcsubscriptdsubscriptWsubscriptdsubscriptcsubscriptW = _i,j=1^L,L(a^(1)_i ^(1)_% i) [c_j _j _i+% d_j _j _i ]= ∑i , j = 1L , L ( a( 1 )i ⊗ b( 1 )i ) [ cj divide start_ARG ∂ dj end_ARG start_ARG ∂ Wi end_ARG + dj divide start_ARG ∂ cj end_ARG start_ARG ∂ Wi end_ARG ] =∑i,j=1L,L(ai(1)⊗bi(1))⁢(∂si(z(val))⁢(cj⁢dj)⊗ai(z(val)))⏟1+∑i>jL,L(ai(1)⊗bi(1))⁢(bi(z(val))⊗(∂ai(z(val))∂sj(z(val))⁢bj(2)))⏟2absentsubscript⏟superscriptsubscript1tensor-productsubscriptsuperscripta1subscriptsuperscriptb1tensor-productsubscriptsuperscriptssuperscriptvalsubscriptcsubscriptdsubscriptsuperscriptasuperscriptval1subscript⏟superscriptsubscripttensor-productsubscriptsuperscripta1subscriptsuperscriptb1tensor-productsubscriptsuperscriptbsuperscriptvalsubscriptsuperscriptasuperscriptvalsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptb22 = _i,j=1^L,L(a^(1)_i % b^(1)_i) ( ∂ ^(z^(% val))_i (c_jd_j ) ^(z^(% val))_i )_ to8.9pt to8.9pt% 4.45195pt -4.45195pt to0.0pt% @beginscope @invoke pgfstrokecolorrgb0,0,0% @color@rgb@stroke000 @invoke @color@rgb@fill00% 0 @invoke @setlinewidth0.4pt @invoke to% 0.0pt @beginscope @invoke @beginscope @invoke % @moveto4.25195pt0.0pt @curveto4.25195pt2.34831pt% 2.34831pt4.25195pt0.0pt4.25195pt @curveto-2.34831pt4.25195pt% -4.25195pt2.34831pt-4.25195pt0.0pt @curveto-4.25195pt-2.34831% pt-2.34831pt-4.25195pt0.0pt-4.25195pt @curveto2.34831pt-4.25% 195pt4.25195pt-2.34831pt4.25195pt0.0pt @closepath @moveto% 0.0pt0.0pt @stroke @invoke @beginscope @invoke @transformcm1.00.00.01% .0-1.75pt-2.25555pt @invoke pgfstrokecolor% rgb0,0,0 @color@rgb@stroke000 @invoke % @color@rgb@fill000 @invoke 1 @invoke @closescope @endscope @invoke @closescope @endscope @invoke @closescope @endscope % @discardpath @invoke @closescope @endscope % @closescope + _i>j^L,L(a^(1% )_i ^(1)_i) (b^(z^(val))_i% ( ^(z^(val))_i∂% s^(z^(val))_jb^(2)_j ) )_% to8.9pt to8.9pt 4.% 45195pt -4.45195pt to0.0pt @beginscope @invoke % pgfstrokecolorrgb0,0,0 @color@rgb@stroke000% @invoke @color@rgb@fill000 @invoke % @setlinewidth0.4pt @invoke to0.0pt% @beginscope @invoke @beginscope @invoke % @moveto4.25195pt0.0pt @curveto4.25195pt2.34831pt% 2.34831pt4.25195pt0.0pt4.25195pt @curveto-2.34831pt4.25195pt% -4.25195pt2.34831pt-4.25195pt0.0pt @curveto-4.25195pt-2.34831% pt-2.34831pt-4.25195pt0.0pt-4.25195pt @curveto2.34831pt-4.25% 195pt4.25195pt-2.34831pt4.25195pt0.0pt @closepath @moveto% 0.0pt0.0pt @stroke @invoke @beginscope @invoke @transformcm1.00.00.01% .0-1.75pt-2.25555pt @invoke pgfstrokecolor% rgb0,0,0 @color@rgb@stroke000 @invoke % @color@rgb@fill000 @invoke 2 @invoke @closescope @endscope @invoke @closescope @endscope @invoke @closescope @endscope % @discardpath @invoke @closescope @endscope % @closescope = under⏟ start_ARG ∑i , j = 1L , L ( a( 1 )i ⊗ b( 1 )i ) ( divide start_ARG ∂ end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG ( cj dj ) ⊗ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi ) end_ARG1 + under⏟ start_ARG ∑i > jitalic_L , L ( a( 1 )i ⊗ b( 1 )i ) ( b( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi ⊗ ( divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG b( 2 )j ) ) end_ARG2 For the first part, we have 1 =∑i=1L(ai(1)⊗bi(1))⁢(∂si(z(val))⁢(∑j=1Lcj⁢dj)⊗ai(z(val)))absentsuperscriptsubscript1tensor-productsubscriptsuperscripta1subscriptsuperscriptb1tensor-productsubscriptsuperscriptssuperscriptvalsuperscriptsubscript1subscriptcsubscriptdsubscriptsuperscriptasuperscriptval = _i=1^L(a^(1)_i ^(1)_i)% ( ∂ ^(z^(val))_i (Σ% _j=1^Lc_jd_j ) ^(z^(% val))_i )= ∑i = 1L ( a( 1 )i ⊗ b( 1 )i ) ( divide start_ARG ∂ end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG ( ∑j = 1L cj dj ) ⊗ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi ) =∑i=1L(ai(1)⁢ai(z(val)))⁢(bi(1)⁢(∂si(z(val))⁢(∑j=1Lcj⁢dj)))absentsuperscriptsubscript1subscriptsuperscripta1subscriptsuperscriptasuperscriptvalsubscriptsuperscriptb1subscriptsuperscriptssuperscriptvalsuperscriptsubscript1subscriptcsubscriptd = _i=1^L (a^(1)_ia^(z^( % val))_i ) (b^(1)_i ( ∂% s^(z^(val))_i ( _j=1^Lc_j% d_j ) ) )= ∑i = 1L ( a( 1 )i a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi ) ( b( 1 )i ( divide start_ARG ∂ end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG ( ∑j = 1L cj dj ) ) ) where ∂si(z(val))⁢(∑j=1Lcj⁢dj)subscriptsuperscriptssuperscriptvalsuperscriptsubscript1subscriptcsubscriptd ∂ ^(z^(val))_i ( _j=1% ^Lc_jd_j )divide start_ARG ∂ end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG ( ∑j = 1L cj dj ), i=1,…,L1…i=1,…,Li = 1 , … , L is readily available from the second backpropagation. For the second part, we have 2 =∑i>jL,L(ai(1)⊗bi(1))⁢(bi(z(val))⊗(∂ai(z(val))∂sj(z(val))⁢bj(2)))absentsuperscriptsubscripttensor-productsubscriptsuperscripta1subscriptsuperscriptb1tensor-productsubscriptsuperscriptbsuperscriptvalsubscriptsuperscriptasuperscriptvalsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptb2 = _i>j^L,L(a^(1)_i ^(1)_i% ) (b^(z^(val))_i ( ∂ % a^(z^(val))_i ^(z^(val))_j% b^(2)_j ) )= ∑i > jitalic_L , L ( a( 1 )i ⊗ b( 1 )i ) ( b( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi ⊗ ( divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG b( 2 )j ) ) =∑i>jL,L(ai(1)⁢(∂ai(z(val))∂sj(z(val))⁢bj(2)))⁢(bi(1)⁢bi(z(val)))absentsuperscriptsubscriptsubscriptsuperscripta1subscriptsuperscriptasuperscriptvalsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptb2subscriptsuperscriptb1subscriptsuperscriptbsuperscriptval = _i>j^L,L (a^(1)_i ( ∂% a^(z^(val))_i ^(z^(val)% )_jb^(2)_j ) ) (b^(1)_ib^% (z^(val))_i )= ∑i > jitalic_L , L ( a( 1 )i ( divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG b( 2 )j ) ) ( b( 1 )i b( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi ) The remaining question is how to compute f⁢(i,j):=ai(1)⁢∂ai(z(val))∂sj(z(val))⁢bj(2)assignsubscriptsuperscripta1subscriptsuperscriptasuperscriptvalsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptb2f(i,j):=a^(1)_i ^(z^(val))_i% ^(z^(val))_jb^(2)_jf ( i , j ) := a( 1 )i divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG b( 2 )j. Observe that ai(1)⁢∂ai(z(val))∂sj(z(val))⁢bj(2)subscriptsuperscripta1subscriptsuperscriptasuperscriptvalsubscriptsuperscriptssuperscriptvalsubscriptsuperscriptb2 ^(1)_i ^(z^(val)% )_i ^(z^(val))_jb^(2)_ja( 1 )i divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG b( 2 )j =ai(1)⁢∂ai(z(val))∂si−1(z(val))⁢∂si−1(z(val))∂ai−1(z(val))⁢…⁢∂sj+1(z(val))∂aj+1(z(val))⁢∂aj+1(z(val))∂sj(z(val))⁢bj(2)absentsubscriptsuperscripta1subscriptsuperscriptasuperscriptvalsubscriptsuperscriptssuperscriptval1subscriptsuperscriptssuperscriptval1subscriptsuperscriptasuperscriptval1…subscriptsuperscriptssuperscriptval1subscriptsuperscriptasuperscriptval1subscriptsuperscriptasuperscriptval1subscriptsuperscriptssuperscriptvalsubscriptsuperscriptb2 =a^(1)_i ^(z^(val)% )_i ^(z^(val))_i-1 ∂ % s^(z^(val))_i-1 ^(z^(val))_i% -1… ^(z^(val))_j+1∂% a^(z^(val))_j+1 ^(z^( % val))_j+1 ^(z^(val))_jb^(2)% _j= a( 1 )i divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi - 1 end_ARG divide start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi - 1 end_ARG start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi - 1 end_ARG … divide start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj + 1 end_ARG start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj + 1 end_ARG divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj + 1 end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG b( 2 )j =ai(1)⁢∂ai(z(val))∂si−1(z(val))⁢Wi−1⁢…⁢Wj+1⁢∂aj+1(z(val))∂sj(z(val))⁢bj(2)absentsubscriptsuperscripta1subscriptsuperscriptasuperscriptvalsubscriptsuperscriptssuperscriptval1subscriptW1…subscriptW1subscriptsuperscriptasuperscriptval1subscriptsuperscriptssuperscriptvalsubscriptsuperscriptb2 =a^(1)_i ^(z^(val)% )_i ^(z^(val))_i-1W_i-1% …W_j+1 ^(z^(val))_j+1% ^(z^(val))_jb^(2)_j= a( 1 )i divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi - 1 end_ARG Wi - 1 … Wj + 1 divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj + 1 end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG b( 2 )j Moreover, ∂aj+1(z(val))∂sj(z(val))subscriptsuperscriptasuperscriptval1subscriptsuperscriptssuperscriptval ^(z^(val))_j+1 ^(z^% (val))_jdivide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj + 1 end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG is easy to compute as the only operation between aj+1(z(val))subscriptsuperscriptasuperscriptval1a^(z^(val))_j+1a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj + 1 and sj(z(val))subscriptsuperscriptssuperscriptvals^(z^(val))_js( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj is an element-wise activation function. For example, if the activation function is ReLU, then ∂aj+1(z(val))∂sj(z(val))subscriptsuperscriptasuperscriptval1subscriptsuperscriptssuperscriptval ^(z^(val))_j+1 ^(z^% (val))_jdivide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj + 1 end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG will simply be a matrix that contains elements in 0,101\0,1\ 0 , 1 , depending on the sign of the position in sj(z(val))subscriptsuperscriptssuperscriptvals^(z^(val))_js( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj. To efficiently compute f⁢(i,j)f(i,j)f ( i , j ) for all i>ji>ji > j, for each j=1,…,L1…j=1,…,Lj = 1 , … , L, we can maintain the quantity ⁢(i,j):=∂ai(z(val))∂si−1(z(val))⁢Wi−1⁢…⁢Wj+1⁢∂aj+1(z(val))∂sj(z(val))⁢bj(2)assignsubscriptsuperscriptasuperscriptvalsubscriptsuperscriptssuperscriptval1subscriptW1…subscriptW1subscriptsuperscriptasuperscriptval1subscriptsuperscriptssuperscriptvalsubscriptsuperscriptb2g(i,j):= ^(z^(val))_i∂% s^(z^(val))_i-1W_i-1…W_j+1% ^(z^(val))_j+1 ^(z^% (val))_jb^(2)_jg ( i , j ) := divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi - 1 end_ARG Wi - 1 … Wj + 1 divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj + 1 end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTj end_ARG b( 2 )j and we have f⁢(i,j)=ai(1)⁢(i,j)subscriptsuperscripta1f(i,j)=a^(1)_ig(i,j)f ( i , j ) = a( 1 )i g ( i , j ). To compute f⁢(i+1,j)1f(i+1,j)f ( i + 1 , j ), note that ⁢(i+1,j)=∂ai+1(z(val))∂si(z(val))⁢Wi⁢(i,j)1subscriptsuperscriptasuperscriptval1subscriptsuperscriptssuperscriptvalsubscriptWg(i+1,j)= ^(z^(val))_i+1% ^(z^(val))_iW_ig(i,j)g ( i + 1 , j ) = divide start_ARG ∂ a( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi + 1 end_ARG start_ARG ∂ s( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPTi end_ARG Wi g ( i , j ). Moreover, we note that in our scenario, we are interested in the gradient dot-product ∇ℓ(1)⁢H(z(val))⁢(∑m=1B∇ℓ(m))∇superscriptℓ1superscriptHsuperscriptvalsuperscriptsubscript1∇superscriptℓ ∇ ^(1)H^(z^(val)) ( _m=1% ^B∇ ^(m) )∇ ℓ( 1 ) H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPT ( ∑m = 1B ∇ ℓ( m ) ) where the Hessian-vector product H(z(val))⁢(∑m=1B∇ℓ(m))superscriptHsuperscriptvalsuperscriptsubscript1∇superscriptℓH^(z^(val)) ( _m=1^B∇ ^(m) )H( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPT ( ∑m = 1B ∇ ℓ( m ) ) is fixed for any training point corresponds to ℓ(1)superscriptℓ1 ^(1)ℓ( 1 ). Therefore, the second backpropagation only needs to be taken once as the gradient vector on the right-hand side is always fixed. D.3 Merging Batch Selection and Gradient Update in One Backpropagation By utilizing the ghost Dot-Product technique developed in this paper, we can calculate or approximate all importance scores and correction terms in a single backpropagation pass, without materializing any model-sized vectors. To compute the gradient dot-product between each training point zi∈ℬtsubscriptsubscriptℬz_i _tzitalic_i ∈ Bitalic_t and the validation data z(val)superscriptvalz^(val)z( val ), we propose including z(val)superscriptvalz^(val)z( val ) in the backpropagation along with the training batch. Specifically, we can backpropagate with respect to (∑zi∈ℬtℓ(i))+ℓ(z(val))subscriptsubscriptsubscriptℬsuperscriptℓsuperscriptℓsuperscriptval ( _z_i _t ^(i) )+ ^(z^(val)% )( ∑z start_POSTSUBSCRIPT i ∈ Bitalic_t end_POSTSUBSCRIPT ℓ( i ) ) + ℓ( z start_POSTSUPERSCRIPT ( val ) ) end_POSTSUPERSCRIPT. After computing the In-Run Data Shapley scores for the current iteration, it may seem necessary to backpropagate with respect to ∑zi∈ℬtℓ(i)subscriptsubscriptsubscriptℬsuperscriptℓ _z_i _t ^(i)∑z start_POSTSUBSCRIPT i ∈ Bitalic_t end_POSTSUBSCRIPT ℓ( i ) to compute the gradient for the parameter update. However, this is not required. We can simply reuse the output gradient ∂ℓ(i)∂s(i)superscriptℓsuperscripts ∂ ^(i) ^(i)divide start_ARG ∂ ℓ( i ) end_ARG start_ARG ∂ s( i ) end_ARG from the original backpropagation and aggregate the gradients for all selected data points. This technique is adapted from the "book-keeping trick" proposed in Bu et al. (2023). Appendix E Evaluation E.1 Data Preprocessing & Training Settings & Implementation Details We conduct the pretraining experiment using the GPT2 (124M) model (Radford et al., 2019) and Pythia-410M (Biderman et al., 2023) on the Pile dataset. Our codebase is adapted from https://github.com/karpathy/nanoGPT/tree/master. We first tokenize and split the entire dataset into chunks, and store them in the disk in numpy array format, which significantly speeds up data loading. The maximum sequence length is set to 1024. The learning rate is set at a maximum of 0.0006, with a minimum learning rate of 0.00006. We use AdamW as the optimizer with a weight decay of 0.1, and beta values set to 0.9 and 0.95. Gradients are clipped at a maximum value of 1.0 to maintain stability during training. The batch size is set to 16, with a learning rate warmup of 2000 iterations. Due to the shortage of computation resources, we stop the training at 500,000 iterations. Methods for comparison. In Section 5, we compare In-Run Data Shapley with the influence function (Koh & Liang, 2017), which approximates the change in the model’s loss on the test example when the training example is removed from the training set (i.e., the leave-one-out score). BM25 (Robertson et al., 2009) featurizes examples by their word frequency statistics to rank the training instances. We measure the similarity between the query examples and the training corpus, using the similarity scores as the value scores for the training corpus. we set the hyperparameters to k1=1.5,b=0.75formulae-sequencesubscript11.50.75k_1=1.5,b=0.75k1 = 1.5 , b = 0.75, which are commonly used values that balance the term frequency scaling and document length normalization. Remark 6. We do not compare with TRAK (Park et al., 2023) as it generally requires aggregating the estimator across multiple trained models for a reasonable performance, and the original codebase is not applicable to Huggingface models. In their original paper, the largest scale experiment is fine-tuning BERT-base. E.2 Additional Fidelity Evaluation E.2.1 Approximation Error for In-Run Data Shapley with larger learning rates To further investigate the impact of learning rate on the approximation quality of In-Run Data Shapley, we conducted additional experiments to Section 5.2 with larger learning rates of 6×10−46superscript1046× 10^-46 × 10- 4 and 5×10−35superscript1035× 10^-35 × 10- 3, where the latter being significantly higher than typical learning rates used in foundation model pretraining. Figure 6 shows the comparison between Monte Carlo-estimated In-Run Data Shapley and our approximations with learning rate η=6×10−46superscript104η=6× 10^-4η = 6 × 10- 4. Both first- and second-order approximations achieve remarkably high accuracy with RMSE =0.0012absent0.0012=0.0012= 0.0012 and Spearman correlation =0.99absent0.99=0.99= 0.99. Even with an extremely large learning rate of η=5×10−35superscript103η=5× 10^-3η = 5 × 10- 3 (Figure 7), our approximations maintain reasonable accuracy with RMSE ≈0.0017absent0.0017≈ 0.0017≈ 0.0017 and Spearman correlation =0.79absent0.79=0.79= 0.79. The strong rank correlation, particularly at learning rates typical for model pretraining (η≤10−3superscript103η≤ 10^-3η ≤ 10- 3), suggests that our approximations effectively preserve the relative importance of training data points. This property is crucial for practical applications such as identifying low-quality data or selecting valuable training examples, where relative rankings are more important than absolute values. Figure 6: Comparison between the Monte Carlo-estimated In-Run Data Shapley and First/Second-order In-Run Data Shapley (learning rate =6×10−4absent6superscript104=6× 10^-4= 6 × 10- 4). Figure 7: Comparison between the Monte Carlo-estimated In-Run Data Shapley and First/Second-order In-Run Data Shapley (learning rate =5×10−3absent5superscript103=5× 10^-3= 5 × 10- 3). E.2.2 Approximation Error for Utility Function In this experiment, we empirically investigate the errors of the first- and second-order approximations to U(t)superscriptU^(t)U( t ) on GPT2, as proposed in Section 4.1. The correlations are shown in Figure 8. As we can see, even the first-order approximation achieves a high approximation accuracy, and the inclusion of the second-order interaction term does not provide a notable improvement in accuracy. Figure 8: (a) We show the correlation between the ground-truth model validation loss change in one gradient update iteration U(t)⁢(S;z(val)):=ℓ⁢(w~t+1⁢(S),z(val))−ℓ⁢(wt,z(val))assignsuperscriptsuperscriptvalℓsubscript~1superscriptvalℓsubscriptsuperscriptvalU^(t)(S;z^(val)):= ( w_t+1(S),z^(val))% - (w_t,z^(val))U( t ) ( S ; z( val ) ) := ℓ ( over~ start_ARG w end_ARGt + 1 ( S ) , z( val ) ) - ℓ ( witalic_t , z( val ) ) and the first-order Taylor approximation. (b) We show the correlation between U(t)⁢(S;z(val))superscriptsuperscriptvalU^(t)(S;z^(val))U( t ) ( S ; z( val ) ) and the second-order approximation. Remark 7 (Difficulty in Extending to Adam). The techniques developed in this section are specifically tailored for SGD. It is not directly extendable to other popular optimizers like Adam due to their normalization terms. Nonetheless, using SGD as a proxy for Adam allows for efficient data attribution, which is the approach we adopted in practice and has proved to be effective in our experiment. This provides a practical and effective solution for the current scope of our work. Extending these techniques to support Adam and similar optimizers remains an exciting direction for future research. E.3 Details about Section 5.3.1 In this experiment, we first conduct one training run for 20,000 iterations. Among all the corpus that has been used in the training, we compute their In-Run Data Shapley and filter out all corpus among this subset that has negative contributions to model training. After filtering out the 16% negative valued corpus, we train another model on the remaining dataset for 10,000 iterations with all hyperparameters staying the same. Despite the Pile dataset undergoing multiple layers of data curation (Gao et al., 2020), we still find a significant amount of poor-quality corpus through In-Run Data Shapley. For example, we discovered a large amount of corpora in the PubMed Central domain that are simply empty spaces. Figure 9 provides two examples of the corpus that received the lowest In-Run Data Shapley scores. Additionally, we found several corpora with negative values that appear normal, which is likely due to a distribution shift with the validation corpus. Crafting a representative validation set is an important direction for future work. Corpus with meaningless numbers ,2,3,3,1,3,0,3,0,2,0,0, 0,0,0,0,0,0,0,0,0,0,2,0… 3,3,3,3,3,3,3,3,1,3,3,… 0,0,0,0,0,0,0,0,0,0,2,… 3,3,3,3,3,3,3,3,3,0,3,… 1,0,0,0,0,0,0,0,0,0,2,… 3,3,3,3,3,3,3,2,3,3,1,… 0,0,0,0,0,0,0,0,0,0,3,… 3,3,3,3,3,1,3,3,3,3,3,… 0,0,0,0,0,0,0,0,0,0,1,… Figure 9: Examples of low-quality corpus in Pile found by In-Run Data Shapley. Left: a corpus with meaningless numbers. Right: a corpus with meaningless symbols (likely due to errors from web crawling.) E.3.1 Data selection performance on Pythia 410M Figure 10 shows a performance comparison between the original training run and the model trained on the cleaned subsets on Pythia-410M. Similar to the results on GPT2, we can see that removing lower-quality data leads to a significantly faster drop in test loss compared to the original training run, highlighting the effectiveness of In-Run Data Shapley across different architectures. Figure 10: Test loss comparison between the original training run and model trained on the cleaned subset according to different data attribution techniques for Pythia-410M. E.4 Additional Results for Section 5.3.2 In addition to Figure 3, we also plot the average value scores of corpora from each domain, averaged across all corpora within the same domain that has been used in gradient updates, over the first 10,000 gradient update iterations. As training progresses, the magnitude of data value scores converges towards zero, which aligns with the diminishing returns property of neural network training. This indicates that the contribution of data points depends on the order in which they are used during training, an aspect that Retraining-based Data Shapley methods cannot capture. Data points introduced in later stages of training contribute less to model performance improvement. Figure 11: The change of average contribution for data points from each domain to a corpus of math text (same setting as Figure 3). E.5 Details about Section 5.3.3 In Section 5.3.3, we use GPT-4 to generate varying levels of paraphrased versions of an original corpus from the training set, categorized as "Partial exactly the same," "Paraphrase," "Significant paraphrase," and "Similar topic." For "Partial exactly the same," the corpus is used as the answer in a creative writing instruction-answer pair. The specific prompt for prompting GPT-4 with examples of the full dialogue are shown in Figure 13, 14, 15. We generate 3 sets of corpus groups, and the experiment results are taken as the average. Figure 12 shows the distribution of calculated In-Run Data Shapley. As we can see, most corpora’s values are around 0, but certain corpora have significantly higher values than the other, which means their contributions are higher for the target validation corpus. Figure 12: The distribution of calculated In-Run Data Shapley values. Dialogue with GPT4 for generating corpus of varying levels of similarity (Original corpus from Wikipedia) ▶ ▶[Generate "Partial exactly the same"] <Human> Instruction: [INSERT] Answer: In 2012, Radhi recruited new ’musicians’ for OAG, who were selected from among the students of Akademi Seni Budaya dan Warisan Kebangsaan (). The new line-up consists ofe Qi Razali (drums/backing vocals - original drummer back to the drum position), Muhamad Nizam (guitar - since 2005), Nazrin Zabidi (bass guitar/backing vocals), and Izmer Khasbullah (keyboard/backing vocals). Radhi defines OAG’s new music as aslinatif, a combination of muzik asli or traditional music and alternative. Shortly after their reunion, they released the soundtrack and theme song to the TV series, Friday I’m in Love. Few other singles like Si Pengasih Muda, Di Bawah Tanah Ayer and Ambisi” are set to be released soon and also a full studio album coming afterwards. Please help filling in [INSERT] as the instruction for a creative writing task. <GPT4> Write a short story about Radhi and his band OAG as they recruit new members and blend traditional music with alternative sounds. Describe their creative challenges and successes leading up to their new releases. ▶ ▶[Generate "Paraphrase"] <Human> Could you paraphrase the paragraph start from "In 2012, Radhi recruited …"? <GPT4> In 2012, Radhi assembled new ’musicians’ for OAG, choosing from students at the Akademi Seni Budaya dan Warisan Kebangsaan. The updated lineup includes Qi Razali on drums and backing vocals, returning to his original role, along with Muhamad Nizam on guitar (since 2005), Nazrin Zabidi playing bass and providing backing vocals, and Izmer Khasbullah on keyboards and backing vocals. Radhi describes OAG’s revamped sound as ’aslinatif,’ a fusion of traditional (muzik asli) and alternative music. Following their reformation, they quickly launched the soundtrack and theme for the TV series "Friday I’m in Love." Upcoming releases include singles like "Si Pengasih Muda," "Di Bawah Tanah Ayer," and "Ambisi," along with a new studio album. ▶ ▶[Generate "Significant paraphrase"] <Human> Could you make it less look like the original paragraph? Feel free to modify the sentences. <GPT4> In 2012, Radhi selected a new set of musicians from the students of Akademi Seni Budaya dan Warisan Kebangsaan to join OAG. The band’s updated composition includes Qi Razali, who returned to his original role on drums and backing vocals, alongside Muhamad Nizam who has been on guitar since 2005, Nazrin Zabidi on bass with backing vocals, and Izmer Khasbullah playing the keyboard and adding backing vocals. Radhi has termed the band’s evolved musical expression as ’aslinatif,’ a blend of traditional and alternative sounds. Following their comeback, their first major project was producing the soundtrack and theme for the TV series "Friday I’m in Love." The band is also planning to launch several singles such as "Si Pengasih Muda," "Di Bawah Tanah Ayer," and "Ambisi," and is working towards a new full-length album. ▶ ▶[Generate "Similar topic"] <Human> (COPY PREVIOUS INSTRUCTION-ANSWER PAIR) … The above is an example of instruction-answer pair of creative writing. Could you write an instruction-answer pair regarding a less similar but not completely irrelevant topic? <GPT4> ### Instruction: Write a short story about a classical violinist who decides to explore jazz music, detailing her first encounter with a jazz band. ### Answer: Elena, a classically trained violinist known for her precise and emotive performances, felt something was missing in her musical life. Curiosity led her to a dimly lit jazz club downtown, where the improvised rhythms and melodies captivated her. Hesitantly, she approached the band after their set, violin in hand. They invited her to join them for a jam session the following week. As she stepped onto the stage, her traditional techniques blended unexpectedly with the saxophonist’s smooth runs and the drummer’s rhythmic beats. The fusion of classical precision and jazz spontaneity created a thrilling new sound. Elena found herself reinvigorated, her violin singing with a freedom she had never before experienced. Figure 13: Our dialogue with GPT4 for generating corpus of varying levels of similarity for the original corpus from Wikipedia. Example Dialogue with GPT4 for generating corpus of varying levels of similarity (Original corpus from CNN) ▶ ▶[Generate "Partial exactly the same"] <Human> Instruction: [INSERT] Answer: The morning after Obama’s win, The New York Times declared "Racial Barrier Falls in Decisive Victory." The Los Angeles Times asserted that "for the first time in human history, a largely white nation has elected a black man to be its paramount leader." Some black commentators openly wept on election night, thrilled with witnessing the election of our first black president. Even Sen. John McCain, conceding defeat, pointed to the greatness of the American promise fulfilled in the election of his opponent. Obama’s victory offered the possibility that the scars of America’s racial legacy were healed or, at least, that they were less raw. Please help filling in [INSERT] as the instruction for a creative writing task. <GPT4> Write a reflective essay on the significance of Barack Obama’s 2008 presidential election victory, focusing on its impact on race relations in the United States and the emotional responses it elicited from different communities. ▶ ▶[Generate "Paraphrase"] <Human> Could you paraphrase the paragraph? <GPT4> The morning after Obama’s victory, The New York Times proclaimed "Racial Barrier Falls in Decisive Victory," while The Los Angeles Times noted that "for the first time in human history, a predominantly white nation has elected a black man as its paramount leader." Some black commentators wept openly on election night, overjoyed at witnessing the election of the first black president. Even Sen. John McCain, in his concession speech, highlighted the fulfillment of the American promise through his opponent’s win. Obama’s triumph suggested the possibility that the deep wounds of America’s racial history were healing or, at the very least, becoming less acute. ▶ ▶[Generate "Significant paraphrase"] <Human> Could you make it less look like the original paragraph? Feel free to modify the sentences. <GPT4> The morning after Obama’s victory, headlines captured the historic moment: The New York Times announced "Racial Barrier Falls in Decisive Victory," and The Los Angeles Times highlighted that "for the first time, a predominantly white nation has chosen a black man as its leader." On election night, many black commentators shed tears of joy, moved by the milestone of electing the first black president. Even Sen. John McCain, in his concession speech, acknowledged the realization of the American promise in his opponent’s success. Obama’s win signaled a potential healing of America’s racial wounds or at least a step toward easing their pain. ▶ ▶[Generate "Similar topic"] <Human> (COPY PREVIOUS INSTRUCTION-ANSWER PAIR) … The above is an example of instruction-answer pair of creative writing. Could you write an instruction-answer pair regarding a less similar but not completely irrelevant topic? <GPT4> ### Instruction: Write a reflective essay on the significance of the Supreme Court’s 2015 decision to legalize same-sex marriage nationwide, focusing on its impact on LGBTQ+ rights in the United States and the emotional responses it elicited from different communities. ### Answer: The morning after the Supreme Court’s landmark decision, The New York Times declared "Equal Dignity in the Eyes of the Law." The Washington Post highlighted, "Love Wins: Supreme Court Affirms Same-Sex Marriage Nationwide." Many in the LGBTQ+ community celebrated with tears of joy, feeling validated and recognized. Even some of the justices who dissented acknowledged the profound impact of the decision on American society. The ruling represented a major milestone in the fight for LGBTQ+ rights, symbolizing a shift toward greater acceptance and equality. Figure 14: Our dialogue with GPT4 for generating corpus of varying levels of similarity for the original corpus from a CNN news. Example Dialogue with GPT4 for generating corpus of varying levels of similarity (Original corpus from princeton.edu) ▶ ▶[Generate "Partial exactly the same"] <Human> Instruction: [INSERT] Answer: The process was purchased by the German chemical company BASF (a coal tar dye supplier), which assigned Carl Bosch the difficult task of scaling up Haber’s tabletop machine to industrial-level production.Haber and Bosch were later awarded Nobel prizes, in 1918 and 1931 respectively, for their work in overcoming the chemical and engineering problems posed by the use of large-scale, continuous-flow, high-pressure technology. Please help filling in [INSERT] as the instruction for a creative writing task. <GPT4> Write a short story or essay about Carl Bosch’s efforts to scale up Haber’s tabletop machine to industrial production. Highlight the scientific and engineering challenges he faced, the solutions he found, and the impact of his work on society. Include historical context and personal anecdotes. ▶ ▶[Generate "Paraphrase"] <Human> Could you paraphrase the paragraph? <GPT4> The German chemical company BASF, known for supplying coal tar dyes, acquired the process and tasked Carl Bosch with the challenging job of scaling Haber’s small-scale machine to an industrial level. Haber and Bosch were eventually honored with Nobel prizes in 1918 and 1931, respectively, for their breakthroughs in addressing the chemical and engineering difficulties associated with large-scale, continuous-flow, high-pressure technology. ▶ ▶[Generate "Significant paraphrase"] <Human> Could you make it less look like the original paragraph? Feel free to modify the sentences. <GPT4> BASF, a German firm specializing in coal tar dyes, took on the task and brought in Carl Bosch to scale up Haber’s experimental setup to an industrial scale. Bosch faced numerous challenges but ultimately succeeded. Their groundbreaking work in developing large-scale, high-pressure technology earned them Nobel Prizes, with Haber being recognized in 1918 and Bosch in 1931. ▶ ▶[Generate "Similar topic"] <Human> (COPY PREVIOUS INSTRUCTION-ANSWER PAIR) … The above is an example of instruction-answer pair of creative writing. Could you write an instruction-answer pair regarding a less similar but not completely irrelevant topic? <GPT4> ### Instruction: Write a short story or essay about the development of the polio vaccine by Jonas Salk. Focus on the scientific challenges he faced, the innovative methods he used, and the impact of the vaccine on public health. Include historical context and personal anecdotes. ### Answer: In the early 1950s, Jonas Salk embarked on a mission to develop a vaccine for polio, a disease that was causing widespread fear and paralysis. Salk faced significant scientific challenges, including finding a way to grow the poliovirus in large quantities and then inactivating it without destroying its ability to induce immunity. Salk’s innovative method involved using formaldehyde to kill the virus while preserving its structure. After rigorous testing, the vaccine was declared safe and effective in 1955. This breakthrough led to mass vaccination campaigns, dramatically reducing polio incidence worldwide. The impact of Salk’s vaccine was profound, turning a once-dreaded disease into a preventable condition. Salk’s work not only saved countless lives but also demonstrated the power of scientific innovation in overcoming public health challenges. His dedication and perseverance remain an inspiration in the field of medical research. Figure 15: Our dialogue with GPT4 for generating corpus of varying levels of similarity for the original corpus from princeton.edu. E.5.1 Additional Experiment: Relevant Domain Detection In this additional experiment, we evaluate the effectiveness of different data attribution techniques in identifying relevant domain-specific corpora within the Pile dataset. Specifically, we take a random batch of the validation corpus from sub-domains of the Pile dataset and evaluate the normalized recall@1000 scores. This metric measures the proportion of same-domain corpora as the validation corpora among the 1000 highest-valued corpora, normalized by the global proportion of this domain within the Pile dataset. As shown in Table 3, BM25 performs well on this task in general. Both the first-order and second-order In-Run Data Shapley methods outperform the influence function in detecting relevant domain corpora. Analysis: The influence function only uses information from the final trained models, which can result in highly noisy value scores since the removal of one training data point might have a negligible effect on the final model performance. In contrast, In-Run Data Shapley effectively leverages information from all intermediate checkpoints during model training, providing a more comprehensive and accurate assessment of each data point’s value. While BM25 performs well on most of the domains, there are still cases where In-Run Data Shapley achieves better detection rate, which implies that gradient information may be more useful in determining semantic similarity to some extent. Github EuroParl ArXiv PhilPapers FreeLaw HackerNews StackExchange Wikipedia (en) In-Run Data Shapley (1st order) 30.5 12.7 48.4 0.6 14.6 2.7 16.3 14.5 In-Run Data Shapley (2nd order) 29 14.8 50 0.4 16.9 3.8 20 18.9 Influence Function 17.9 4.1 41.7 0.5 5.7 2.6 6.5 5.5 25 18.8 6.3 44.8 1.1 72.6 9.9 37.2 18.6 PubMed Abstracts USPTO Backgrounds PubMed Central Enron Emails NIH ExPorter DM Mathematics Ubuntu IRC Gutenberg (PG-19) In-Run Data Shapley (1st order) 10.4 10.5 30 0.8 1.7 23.5 10.2 8.7 In-Run Data Shapley (2nd order) 16.6 11.1 31.1 1 1.9 24.5 11.5 15.3 Influence Function 10.2 7.4 26.9 0.2 1.2 10.1 4.3 8.8 25 28.6 23 53.7 1 4.1 34.6 12.5 25.9 Table 3: Results of relevant domain corpus detection experiment, where the metric is the normalized recall@1000 (i.e., the proportion of the same domain corpora as the validation corpora among the 1000 corpora with the highest values, normalized by the global proportion of this domain among the Pile dataset). Original Wikipedia Corpus In 2012, Radhi recruited new ’musicians’ for OAG, who were selected from among the students of Akademi Seni Budaya dan Warisan Kebangsaan (). The new line-up consists of Qi Razali (drums/backing vocals - original drummer … A corpus that is partially same ### Instruction: Write a short story about Radhi and his band OAG as they recruit new members and blend traditional music with alternative sounds. Describe their creative challenges and successes leading up to their new releases. ### Answer: In 2012, Radhi recruited new ’musicians’ for OAG, who were selected from among the students of Akademi Seni Budaya dan Warisan Kebangsaan (). The new line-up consists of Qi Razali (drums/backing vocals - original drummer … Original Wikipedia Corpus In 2012, Radhi recruited new ’musicians’ for OAG, who were selected from among the students of Akademi Seni Budaya dan Warisan Kebangsaan (). The new line-up consists of Qi Razali (drums/backing vocals - original drummer … Synthetic "Similar topic" Corpus ### Instruction: Write a short story about a classical violinist who decides to explore jazz music, detailing her first encounter with a jazz band. ### Answer: Elena, a classically trained violinist known for her precise and emotive performances … E.6 Comparison with other data attribution baselines on small-scale models In this section, we evaluate In-Run Data Shapley using standard benchmarks for data attribution techniques, including mislabeled data detection and data selection. While data attribution scores can generally indicate data quality, they are not specifically optimized for these benchmarks. Furthermore, approximating the Shapley value has intrinsic merit beyond these standard tasks (e.g., for royalty sharing of AI-generated contents (Wang et al., 2024a)). As such, this evaluation primarily serves as a sanity check to confirm that In-Run Data Shapley can reasonably reflect data values. However, state-of-the-art performance is not expected since the method is not designed specifically for these tasks. We compare In-Run Data Shapley with several data attribution baselines, including Retraining-based Data Shapley (Ghorbani & Zou, 2019), KNN-Shapley (Jia et al., 2019a), influence function (Koh & Liang, 2017), Trak (Park et al., 2023), Empirical Influence Functions (Feldman & Zhang, 2020), and Datamodels (Ilyas et al., 2022). We also compare with LESS (Xia et al., 2024), an extension of TracIN-CP tailored for instruction tuning data selection. As some baselines here require multiple model retrainings, we use a subset of 1,000 samples from CIFAR10 and ResNet18 architecture. We evaluate these techniques on two standard tasks for assessing data attribution methods: mislabeled data detection and data selection. We note that recent work (Wang et al., 2024d) questions the suitability of data attribution for data selection, but we include this metric due to its prevalence in existing literature. Experiment settings. We use ImageNet-pretrained ResNet18 as the architecture in the experiment. We use Adam with a learning rate 0.001, weight decay of 1e-4, and label smoothing of 0.1 over 50 epochs. The learning rate is reduced by a factor of 0.1 every 10 epochs. The batch size is set to 64. For retraining-based techniques (Retraining-based Data Shapley, Empirical Influence Functions, Datamodels), we estimate the corresponding attribution scores with 1000 model training runs. For LESS, we use the checkpoints from epochs 10, 20, 30, 40, 50 and set the projection dimension to 2048 as recommended in the original paper. For Trak, we set the projection dimension to be 2048. For KNN-Shapley, we set K=5 and use the features extracted from the last linear layer of ResNet18. We also evaluate its performance by averaging the results using the last T checkpoints where T∈1,5,15,25151525T∈\1,5,15,25\T ∈ 1 , 5 , 15 , 25 . Mislabeled data detection. The results in Table 4 show that KNN-Shapley achieves the highest performance in mislabeled data detection, likely due to its sensitivity to label changes. Retraining-based approaches (Retraining-based Data Shapley, Empirical Influence Functions, Datamodel) have the lowest performance, which can be attributed to Monte Carlo sample inefficiency and stochasticity during retraining, as noted in Wang & Jia (2023a). Among techniques requiring only one training run, Trak does not outperform others, aligning with observations in its original paper (Park et al., 2023) that ensembles are often necessary for high performance. LESS achieves better results by incorporating more information from intermediate checkpoints. In-Run Data Shapley and influence function achieve comparable performance, outperforming all other techniques except KNN-Shapley. The relatively high performance of In-Run Data Shapley is likely due to its deterministic estimation algorithm. Furthermore, it makes use of the information from all intermediate checkpoints. Data selection. Table 5 shows that KNN-Shapley again excels when the selection budget is small. Interestingly, most data attribution techniques perform worse than random baseline for small selection budgets, aligning with observations that selecting data points solely based on attribution scores can harm dataset diversity (Wang et al., 2024d). For larger selection budgets (60%, 80%), single-training-run techniques achieve similar performance levels. Retraining-based Data Shapley (Ghorbani & Zou, 2019) Empirical Influence Function (Feldman & Zhang, 2020) Datamodels (Ilyas et al., 2022) 0.582 (0.029) 0.552 (0.017) 0.52 (0.008) KNN-Shapley (Jia et al., 2019a; Wang & Jia, 2023c) Influence function (Koh & Liang, 2017) LESS (Xia et al., 2024) 0.76 (0.018) 0.654 (0.054) 0.55 (0.032) 1st Order In-Run Data Shapley (ours) 2nd Order In-Run Data Shapley (ours) Trak (1 cpt) (Park et al., 2023) 0.678 (0.045) 0.680 (0.048) 0.511 (0.012) Trak (5 cpts) (Park et al., 2023) Trak (15 cpts) (Park et al., 2023) Trak (25 cpts) (Park et al., 2023) 0.542 (0.025) 0.609 (0.028) 0.617 (0.021) Table 4: AUROC scores of mislabeled data detection task with various data attribution techniques on CIFAR10 dataset. The higher the AUROC score is, the better the method is. The results are across three different training runs (the randomness comes from construction of corrupted datasets), where we show the standard deviation in (). Methods with top-3 performance are bolded. 20% 40% 60% 80% Random 0.350 (0.010) 0.461 (0.010) 0.525 (0.004) 0.559 (0.003) Influence function (Koh & Liang, 2017) 0.320 (0.033) 0.450 (0.028) 0.530 (0.015) 0.580 (0.004) Trak (1 cpt) (Park et al., 2023) 0.329 (0.021) 0.443 (0.030) 0.517 (0.016) 0.572 (0.009) Trak (5 cpts) (Park et al., 2023) 0.325 (0.022) 0.445 (0.031) 0.519 (0.018) 0.572 (0.009) Trak (15 cpts) (Park et al., 2023) 0.327 (0.028) 0.446 (0.035) 0.523 (0.018) 0.573 (0.010) Trak (25 cpts) (Park et al., 2023) 0.332 (0.025) 0.443 (0.035) 0.520 (0.018) 0.578 (0.015) LESS (Xia et al., 2024) 0.314 (0.012) 0.460 (0.015) 0.541 (0.006) 0.582 (0.006) Data Shapley (Ghorbani & Zou, 2019) 0.317 (0.047) 0.468 (0.010) 0.527 (0.004) 0.570 (0.008) Datamodels (Ilyas et al., 2022) 0.342 (0.004) 0.465 (0.004) 0.534 (0.010) 0.559 (0.005) Empirical Influence Function (Feldman & Zhang, 2020) 0.342 (0.004) 0.466 (0.016) 0.530 (0.009) 0.568 (0.010) KNN-Shapley (Jia et al., 2019a) 0.354 (0.017) 0.478 (0.007) 0.525 (0.015) 0.563 (0.005) 1st Order In-Run Data Shapley 0.344 (0.009) 0.472 (0.003) 0.541 (0.006) 0.580 (0.004) 2nd Order In-Run Data Shapley 0.342 (0.010) 0.473 (0.008) 0.544 (0.005) 0.580 (0.004) Table 5: Test accuracies when training ResNet18 on high-value data points selected by various data attribution techniques. To be able to compare with techniques that require model retraining, for each training run we randomly sample a size-1000 subset of CIFAR10 dataset (with 10% data points being mislabeled). The results are across three different training runs (the randomness comes from the construction of corrupted datasets), where we show the standard deviation in (). Methods with top-3 performance are bolded.