Paper deep dive

Causal Distillation for Language Models

Zhengxuan Wu, Atticus Geiger, Joshua Rozner, Elisa Kreiss, Hanson Lu, Thomas Icard, Christopher Potts, Noah D. Goodman

Year: 2021Venue: NAACL 2022Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 30

Models: BERT-base (12-layer), DistilBERT (3-layer), DistilBERT (6-layer)

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/12/2026, 6:49:57 PM

Summary

The paper introduces Causal Distillation, a method that augments standard language model distillation with a 'Distillation Interchange Intervention Training' (DIITO) objective. This objective forces a student model to imitate the causal computation dynamics of a teacher model, resulting in improved performance on benchmarks like GLUE, SQuAD, and CoNLL-2003 while maintaining efficiency.

Entities (5)

BERT · model-architecture · 100%CoNLL-2003 · benchmark · 100%DIITO · method · 100%GLUE · benchmark · 100%SQuAD · benchmark · 100%

Relation Signals (4)

DIITO → improvesperformanceon → GLUE

confidence 95% · DIITO results in... marked improvements on the GLUE benchmark

DIITO → improvesperformanceon → SQuAD

confidence 95% · DIITO results in... marked improvements on... SQuAD

DIITO → improvesperformanceon → CoNLL-2003

confidence 95% · DIITO results in... marked improvements on... CoNLL-2003

DIITO → appliedto → BERT

confidence 90% · We begin with BERT BASE... and distill it under various alignments

Cypher Suggestions (2)

Find all benchmarks improved by the DIITO method · confidence 95% · unvalidated

MATCH (m:Method {name: 'DIITO'})-[:IMPROVES_PERFORMANCE_ON]->(b:Benchmark) RETURN b.name

List all models that have been distilled using DIITO · confidence 90% · unvalidated

MATCH (m:ModelArchitecture)<-[:APPLIED_TO]-(d:Method {name: 'DIITO'}) RETURN m.name

Abstract

Abstract:Distillation efforts have led to language models that are more compact and efficient without serious drops in performance. The standard approach to distillation trains a student model against two objectives: a task-specific objective (e.g., language modeling) and an imitation objective that encourages the hidden states of the student model to be similar to those of the larger teacher model. In this paper, we show that it is beneficial to augment distillation with a third objective that encourages the student to imitate the causal computation process of the teacher through interchange intervention training(IIT). IIT pushes the student model to become a causal abstraction of the teacher model - a simpler model with the same causal structure. IIT is fully differentiable, easily implemented, and combines flexibly with other objectives. Compared with standard distillation of BERT, distillation via IIT results in lower perplexity on Wikipedia (masked language modeling) and marked improvements on the GLUE benchmark (natural language understanding), SQuAD (question answering), and CoNLL-2003 (named entity recognition).

PDF

Open source PDF →Open local PDF →

Full Text

29,975 characters extracted from source content.

Expand or collapse full text

Causal Distillation for Language Models Zhengxuan Wu ∗¶ , Atticus Geiger ∗¶ , Joshua Rozner, Elisa Kreiss, Hanson Lu Thomas Icard, Christopher Potts, Noah D. Goodman Stanford University wuzhengx, atticusg@stanford.edu Abstract Distillation efforts have led to language mod- els that are more compact and efficient with- out serious drops in performance. The stan- dard approach to distillation trains a student model against two objectives: a task-specific objective (e.g., language modeling) and an im- itation objective that encourages the hidden states of the student model to be similar to those of the larger teacher model. In this paper, we show that it is beneficial to augment dis- tillation with a third objective that encourages the student to imitate thecausaldynamics of the teacher through adistillation interchange intervention training objective(DIITO). DI- ITO pushes the student model to become a causal abstractionof the teacher model – a faithful model with simpler causal structure. DIITO is fully differentiable, easily imple- mented, and combines flexibly with other ob- jectives. Compared against standard distilla- tion with the same setting, DIITO results in lower perplexity on the WikiText-103M cor- pus (masked language modeling) and marked improvements on the GLUE benchmark (nat- ural language understanding), SQuAD (ques- tion answering), and CoNLL-2003 (named en- tity recognition). 1 1 Introduction Large pretrained language models have improved performance across a wide range of NLP tasks, but can be costly due to their large size.Distillation seeks to reduce these costs while maintaining per- formance by training a simpler student model from a larger teacher model (Hinton et al., 2015; Sun et al., 2019; Sanh et al., 2019; Jiao et al., 2019). Hinton et al. (2015) propose model distillation with an objective that encourages the student to produce output logits similar to those of the teacher ∗ Equal contribution. ¶ Correspondence authors. 1 We release our code athttps://github.com/ frankaging/Causal-Distill while also supervising with a task-specific objec- tive (e.g., sequence classification). Sanh et al. (2019), Sun et al. (2019), and Jiao et al. (2019) adapt this method, strengthening it with additional supervision to align internal representations be- tween the two models. However, these approaches may push the student model to match all aspects of internal states of the teacher model irrespective of theircausalrole in the network’s computation. This motivates us to develop a method that focuses on aligning thecausalrole of representations in the student and teacher models. We propose augmenting standard distillation with a new objective that pushes the student to become acausal abstraction(Beckers and Halpern, 2019; Beckers et al., 2020; Geiger et al., 2021a) of the teacher model: the simpler student will faithfully model the causal effect of teacher rep- resentations on output. To achieve this, we employ theinterchange intervention training(IIT) method of Geiger et al. (2021b). Thedistillation inter- change intervention training objective(DIITO) aligns a high-level student model with a low-level teacher model and performsinterchange interven- tions(swapping of aligned internal states); during training the high-level model is pushed to conform to the causal dynamics of the low-level model. Figure 1 shows a schematic example of this pro- cess. Here, hidden layer 2 of the student model (bottom) is aligned with layers 3 and 4 of the teacher model. The figure depicts a single inter- change intervention replacing aligned states in the left-hand models with those from the right-hand models. This results in a new network evolution that is shaped both by the original input and the interchanged hidden states. It can be interpreted as a certain kind of counterfactual as shown in Fig- ure 1: what would the output be for the sentence “I ate some〈MASK〉.” if the activation values for the second token at the middle two layers were set to the values they have for the input “The water arXiv:2112.02505v2 [cs.CL] 3 Jun 2022 LOGITS salad I ate some <MASK> . LOGITS froze The water <MASK>solid . pizza LOGITS I ate some <MASK> . froze LOGITS The water <MASK>solid . Figure 1: An IIT update in the context of masked language modelling (MLM). The teacher network (top) has 6 layers and the student (bottom) has 3 layers, and we align layer 2 in the student with layers 3–4 in the teacher. Solid lines are feed-forward connections, red lines show the flow of backpropagation, and dashed lines indicate interchange interventions. In this case, the student originally predicted the token “salad” under the interchange intervention, while the teacher predicted the token “pizza” under an aligned interchange intervention. DIITO trains the student to minimize the divergence between the student logits and the teacher logits under the interchange intervention. This updates the student to conform to causal dynamics of the teacher. 〈MASK〉 solid.”?DIITOthen pushes the student model to output the same logits as the teacher, i.e., matching the teacher’s output distribution under the counterfactual setup. To assess the contribution of distillation with DIITO, we begin withBERT BASE (Devlin et al., 2019) and distill it under various alignments be- tween student and teacher while pretraining on the WikiText-103M corpus (Merity et al., 2016) achiev- ing−2.24 perplexity on the MLM task compared to standard DistilBERT trained on the same data. We then fine-tune the best performing distilled mod- els and find consistent performance improvements compared to standard DistilBERT trained with the same setting on the GLUE benchmark (+1.77%), CoNLL-2003 name-entity recognition (+0.38% on F1 score), and SQuAD v1.1 (+2.46% on EM score). 2 Related Work Distillation was first introduced in the context of computer vision (Hinton et al., 2015) and has since been widely explored for language models (Sun et al., 2019; Sanh et al., 2019; Jiao et al., 2019). For example, Sanh et al. (2019) propose to extract information not only from output probabilities of the last layer in the teacher model, but also from in- termediate layers in the fine-tuning stage. Recently, Rotman et al. (2021) adapt causal analysis methods to estimate the effects of inputs on predictions to compress models for better domain adaptation. In contrast, we focus on imbuing the student with the causal structure of the teacher. Interventions on neural networks were originally used as a structural analysis method aimed at il- luminating neural representations and their role in network behavior (Feder et al., 2021; Pryzant et al., 2021; Vig et al., 2020; Elazar et al., 2021; Giulianelli et al., 2020; Geiger et al., 2020, 2021a). Geiger et al. (2021b) extend these methods to net- work optimization. We contribute to this existing research by adapting intervention-based optimiza- tion to the task of language model distillation. 3 Causal Distillation Here, we define our distillation training procedure. See Algorithm 1 for a summary. Algorithm 1 Causal Distillation via Interchange Intervention Training Require:Student modelS, teacher modelT, student output neuronsN y S , alignmentΠ, shuf- fled training datasetD. 1:S.train() 2:T.eval() 3:D ′ = random.shuffle(D) 4:N y T =Π(N y S ) 5:whilenot convergeddo 6:forx 1 ,y 1 ,x 2 ,y 2 initer(D,D ′ )do 7:N S =sample_student_neurons() 8:N T =Π(N S ) 9:withno_grad: 10:T a =SETVALS( 11:T,N T ,GETVALS(T,x 1 ,N T )) 12:o T =GETVALS(T a ,x 2 ,N y T ) 13:S a =SETVALS( 14:S,N S ,GETVALS(S,x 1 ,N S )) 15:o S =GETVALS(S a ,x 2 ,N y S ) 16:L DIITO =get_loss(o T , o S ) 17: CalculateL MLM ,L CE ,L Cos 18:L=L MLM +L CE +L Cos +L DIITO 19:L.backward() 20: Step optimizer 21:end while GETVALS.TheGETVALSoperator is an activation-value retriever for a neural model. Given a neural modelMcontaining a set of neuronsN (an internal representations) and an appropriate in- putx,GETVALS(M,x,N)is the set of values thatNtakes on when processingx. In the case thatNrepresents the neurons corresponding to the final output,GETVALS(M,x,N)is the output of modelMwhen processingx(i.e., output from a standard forward call of a neural model). SETVALS. TheSETVALSoperator is a function generator that defines a new neural model with a computation graph that specifies an intervention on the original modelM(Pearl, 2009; Spirtes et al., 2001).SETVALS(M,N,v)is the new neu- ral model where the neuronsNare set to constant valuesv. Because we overwrite neurons withv in-place, gradients can back-propagate throughv. Interchange Intervention. An interchange in- tervention combinesGETVALSandSETVALSop- erations. First, we randomly sample a pair of exam- ples from a training dataset(x 1 ,y 1 ),(x 2 ,y 2 )∈D. Next, whereNis the set of neurons that we are targeting for intervention, we defineM x 1 N to abbre- viate the new neural model as follows: SETVALS ( M,N,GETVALS(M,x 1 ,N) ) (1) This is the version ofMobtained from setting the values ofNto be those we get from processing inputx 1 . The interchange intervention targeting Nwithx 1 as the source input andx 2 as the base input is then defined as follows: INTINV(M,N,x 1 ,x 2 ) def = GETVALS(M x 1 N ,x 2 ,N y )(2) whereN y are the output neurons. In other words, INTINV(M,N,x 1 ,x 2 )is the output state we get fromMfor inputx 2 but with the neuronsNset to the values obtained when processing inputx 1 . DIITO. DIITOemploysTas the teacher model,Sas the student model,Das the training inputs to both models, andΠas an alignment that maps sets of student neurons to sets of teacher neu- rons. For each set of student neuronsN S in the domain ofΠ, we define DIITO loss as: L DIITO CE def = ∑ x 1 ,x 2 ∈D CE S ( INTINV(S,N S ,x 1 ,x 2 ), INTINV(T,Π(N S ),x 1 ,x 2 ) ) (3) whereCE S is the smoothed cross-entropy loss mea- suring the divergences of predictions, under inter- change, between the teacher and the student model. Distillation Objectives. We adopt the standard distillation objectives from DistilBERT (Sanh et al., 2019) (defined formally in Appendix A.1):L MLM for the task-specific loss for the student model,L CE for the loss measuring the divergence between the student and teacher outputs on masked tokens, and L Cos for the loss measuring the divergence between the student and teacher contextualized representa- tions on masked tokens in the last layer. Our final training objective for the student is a linear com- bination of the four training objectives reviewed above:L MLM ,L CE ,L Cos , andL DIITO CE . In a further experiment, we introduce a fifth objectiveL DIITO Cos which is identical toL Cos , except the teacher and student are undergoing interchange interventions (see Appendix A.2 for details). 4 Experimental Set-up We adapt the open-source Hugging Face implemen- tation for model distillation (Wolf et al., 2020). 2 2 https://github.com/huggingface/transformers PretrainingWikiTextGLUECoNLL-2003SQuAD v1.1 ModelLayersTokensPerplexityScoreaccF1EMF1 BERT BASE (Devlin et al., 2019)123.3B10.27 (–) † 82.75 (–)96.40 (–)92.40 (–)80.80 (–)88.50 (–) (Wikipedia+BookCorpus) DistilBERT (Sanh et al., 2019)63.3B17.48 (–) † 79.59 (–)98.39 (–) † 93.10 (–) † 77.70 (–)85.80 (–) (Wikipedia+BookCorpus) DistilBERT (WikiText)30.1B29.51 (0.32)67.42 (1.10)97.88 (0.04)88.89 (0.29)26.04 (0.93)68.38 (0.77) DIITO MIDDLE (WikiText)30.1B26.04 (0.93)69.30 (1.08)98.03 (0.04)89.69 (0.18)58.74 (0.69)70.23 (0.57) DIITO LATE (WikiText)30.1B25.97 (0.63)69.01 (1.69)98.03 (0.03)89.82 (0.18)58.75 (0.49)70.21 (0.41) DIITO FULL (WikiText)30.1B24.85 (0.58)69.36 (0.87)98.02 (0.03)89.67 (0.16)58.72 (0.67)70.50 (0.56) DistilBERT (WikiText)60.1B15.69 (1.51)75.80 (0.42)98.48 (0.03)92.12 (0.23)70.23 (0.75)79.99 (0.55) DIITO MIDDLE (WikiText)60.1B14.32 (0.12)76.71 (0.47)98.56 (0.04)92.47 (0.19)71.93 (0.31)81.32 (0.23) DIITO LATE (WikiText)60.1B14.93 (0.23)76.80 (0.34)98.51 (0.02)92.36 (0.27)71.47 (0.28)81.01 (0.23) DIITO FULL (WikiText)60.1B13.59 (0.25)76.67 (0.21)98.53 (0.04)92.35 (0.24)71.96 (0.29)81.33 (0.25) DIITO FULL +Random (WikiText)60.1B13.95 (0.18)76.84 (0.29)98.54 (0.03)92.41 (0.24)71.90 (0.54)81.27 (0.39) DIITO FULL +Masked (WikiText)60.1B13.99 (0.16)76.80 (0.32)98.55 (0.03)92.45 (0.18)71.77 (0.59)81.09 (0.42) DIITO FULL +L DIITO Cos (WikiText)60.1B13.45 (0.19)77.14 (0.37)98.54 (0.04)92.35 (0.24)71.94 (0.31)81.35 (0.23) Table 1: Performance on the development sets of the WikiText, GLUE benchmark, CoNLL-2003 corpus for the name-entity recognition task, and SQuAD v1.1 for the question answering task. The score is the averaged per- formance scores with standard deviation (SD) for all tasks across 15 distinct runs. † Numbers are imputed from released models on Hugging Face (Wolf et al., 2020). We distill our models on the MLM pretraining task (Devlin et al., 2019). We use large gradient accumulations over batches as in Sanh et al. (2019) for better performance. Specifically, we distill all models for three epochs for an effective batch size of 240. In contrast to the setting of 4K per batch in BERT (Devlin et al., 2019) and DistilBERT (Sanh et al., 2019), we found that small effective batch size works better for smaller dataset. We weight all objectives equally for all experiments. With our new objectives, the distillation takes approximately 9 hours on 4 NVIDIA A100 GPUs. Student and Teacher Models . Our two stu- dents have the standard BERT architecture, with 12 heads with a hidden dimension of 768. The larger student has 6 layers, the smaller 3 layers. Our pretrained teacher has the same architecture, except with 12 layers. Following practices intro- duced by Sanh et al. (2019), we initialize our stu- dent model with weights from skipped layers (one out of four layers) in the teacher model. We use WikiText for distillation to simulate a practical situ- ation with a limited computation budget. We leave the exploration of our method on larger datasets for future research. Alignment. Our teacher and student BERT mod- els create columns of neural representations above each token with each row created by the feed- forward layer of a Transformer block, as in Fig- ure 1. We defineL T andL S to be the number of layers in the student and teacher, respectively. In addition, we defineS j i andT j i to be the representa- tions in theith row andjth column in the student and teacher, respectively. An alignmentΠis a par- tial function from student representations to sets of teacher representations. We test three alignments: FULLΠis defined on all student representations: Π(S j i ) =T j 4i+k : 0≤k < L T /L S MIDDLEΠ is defined for the rowL S 2: Π(S j L S 2 ) =T j L T 2 LATEΠ is defined on the student representations in the first and second rows: Π(S j 1 ) =T j L T −2 andΠ(S j 2 ) =T j L T −1 For each training iteration, we randomly select one aligned student layer to perform the interchange intervention, and we randomly select 30% of token embeddings for align- ment for each sequence.We experiment with three conditions with theFULLalignment: consecutive tokens (DIITO FULL ), random to- kens (DIITO FULL +Random) and masked tokens (DIITO FULL +Masked). We also includeL DIITO Cos to theFULLalignment (DIITO FULL +L DIITO Cos ). 5 Results Language Modeling . We first evaluate our mod- els using perplexity on the held-out evaluation data from WikiText. As shown in Table 1,DIITO brings performance gains for all alignments. Our best result is from theFULLalignment with the L Cos (DIITO FULL +L DIITO Cos ), which has−2.24 per- plexity compared to standard DistilBERT trained with the same amount of data. Figure 2: Perplexity score distribution for the devel- opment set of WikiText of models trained in a low- resource setting. The best model is the one with the richest alignment structure. GLUE . The GLUE benchmark (Wang et al., 2018) covers different natural language understand- ing tasks. We report averaged GLUE scores on the development sets by fine-tuning our distilled mod- els in Table 1. Individual task performance scores for each GLUE task are included in Table 2 in the Appendix. The results suggest that distilled mod- els withDIITOlead to consistent improvements over standard DistilBERT trained under the same setting, with our best result (DIITO FULL +L DIITO Cos ) being +1.77% higher. Named Entity Recognition. We also evaluate our models on the CoNLL-2003 Named Entity Recognition task (Tjong Kim Sang and De Meul- der, 2003). We report accuracy and Macro-F1 scores on the development sets. We fine-tune our models for three epochs. Our best performing model (DIITO MIDDLE ) numerically surpasses not only standard DistilBERT (+0.38% on F1 score) trained under the same setting, but also its teacher, BERT BASE (+0.05% on F1 score). Though these improvements are small, in this case distillation produces a smaller model withbetter performance. Question Answering. Finally, we evaluate on a question answering task, SQuAD v1.1 (Rajpurkar et al., 2016). We report Exact Match and Macro- F1 on the development sets as our evaluation met- rics. We fine-tune our models for two epochs. DIITOagain yields marked improvements (Ta- ble 1). Our best result is from the vanillaFULL alignment (DIITO FULL ), with +2.46% on standard DistilBERT trained under the same setting. Low-Resource Model DistillationWe experi- ment with an extreme case in a low-resource setting where we only distill with 15% of WikiText, keep- ing other experimental details constant. Our results suggest thatDIITOtraining is also beneficial in extremely low-resource settings (Figure 2). Layer-wise AblationWe further study the ef- fect ofDIITOtraining with respect to the size of Figure 3: GLUE score distribution across 15 distinct runs of students in different sizes. Following the evalu- ation for BERT (Devlin et al., 2019). we exclude WNLI for evaluation. the student model through a layer-wise ablation experiment. As shown in Figure 3, we compare GLUE performance for models trained with stan- dard distillation pipeline and withDIITOtraining (DIITO FULL ). Specifically, we compute the aver- aged GLUE scores following the same procedure described in Section A.4. Our results suggest that DIITOtraining brings consistent improvements over GLUE tasks with smaller models booking the greatest gains. 6 Conclusion In this paper, we explored distilling a teacher by training a student to capture thecausal dynamics of its computations. Across a wide range of NLP tasks, we find thatDIITOleads to improvements, with the largest gains coming from the models that use the richest alignment between student and teacher. Our results also demonstrate thatDIITO performs on-par (maintaining 97% of performance on GLUE tasks) with standard DistilBERT (Sanh et al., 2019) while consuming 97% less training data. These findings suggest thatDIITOis a promising tool for effective model distillation. References Sander Beckers, Frederick Eberhardt, and Joseph Y. Halpern. 2020. Approximate causal abstractions. In Proceedings of The 35th Uncertainty in Artificial In- telligence Conference, volume 115 ofProceedings of Machine Learning Research, pages 606–615, Tel Aviv, Israel. PMLR. Sander Beckers and Joseph Y. Halpern. 2019. Abstract- ing causal models.Proceedings of the AAAI Confer- ence on Artificial Intelligence, 33(01):2678–2685. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- standing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics. Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. 2021. Amnesic probing: Behavioral ex- planation with amnesic counterfactuals.Transac- tions of the Association for Computational Linguis- tics, 9:160–175. Amir Feder, Nadav Oved, Uri Shalit, and Roi Re- ichart. 2021. CausaLM: Causal Model Explanation Through Counterfactual Language Models.Compu- tational Linguistics, 47(2):333–386. Atticus Geiger, Hanson Lu, Thomas Icard, and Christo- pher Potts. 2021a. Causal abstractions of neural net- works. InAdvances in Neural Information Process- ing Systems, volume 34, pages 9574–9586. Curran Associates, Inc. Atticus Geiger, Kyle Richardson, and Christopher Potts. 2020. Neural natural language inference mod- els partially embed theories of lexical entailment and negation. InProceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Net- works for NLP, pages 163–173, Online. Association for Computational Linguistics. Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah D. Good- man, and Christopher Potts. 2021b.Inducing causal structure for interpretable neural networks. ArXiv:2112.00826. Mario Giulianelli, Marco Del Tredici, and Raquel Fer- nández. 2020. Analysing lexical semantic change with contextualised word representations. InPro- ceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 3960– 3973, Online. Association for Computational Lin- guistics. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531. Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. TinyBERT: Distilling BERT for natural lan- guage understanding.arXiv:1909.10351. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture mod- els.arXiv:1609.07843. Judea Pearl. 2009.Causality: Models, Reasoning and Inference, 2nd edition. Cambridge University Press, USA. Reid Pryzant, Dallas Card, Dan Jurafsky, Victor Veitch, and Dhanya Sridhar. 2021. Causal effects of linguis- tic properties. InProceedings of the 2021 Confer- ence of the North American Chapter of the Associ- ation for Computational Linguistics: Human Lan- guage Technologies, pages 4095–4109, Online. As- sociation for Computational Linguistics. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natu- ral Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics. Guy Rotman, Amir Feder, and Roi Reichart. 2021. Model compression for domain adaptation through causal effect estimation.arXiv:2101.07086. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled ver- sion of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108. Peter Spirtes, Clark N Glymour, and Richard Scheines. 2001.Causation, Prediction, and Search, 2nd edi- tion. MIT Press. Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. Patient knowledge distillation for BERT model com- pression. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages 4323–4332, Hong Kong, China. Association for Computational Linguistics. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natu- ral Language Learning at HLT-NAACL 2003, pages 142–147. Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stu- art Shieber. 2020. Causal mediation analysis for interpreting neural NLP: The case of gender bias. arXiv:2004.12265. Alex Wang, Amanpreet Singh, Julian Michael, Fe- lix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis plat- form for natural language understanding. InPro- ceedings of the 2018 EMNLP Workshop Black- boxNLP: Analyzing and Interpreting Neural Net- works for NLP, pages 353–355. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Remi Louf, Morgan Funtow- icz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Trans- formers: State-of-the-art natural language process- ing. InProceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Asso- ciation for Computational Linguistics. A Appendix A.1 Standard Distillation Objectives In our setting, our teacher modelTis a BERT model, and our student modelSis a shallower BERT model with fewer layers. Assume that we randomly draw a training exam- ple(x 1 ,y 1 )∈D, wherex 1 is the input to our mod- els andy 1 is the corresponding ground truth (the token prediction at each masked position). We de- note the model predictions (output logits) asT(x 1 ) andS(x 1 ). Additionally, we denote the contextu- alized representation for tokens forx 1 at the last layer as BERT T (x 1 )and BERT S (x 1 ). We adopt the three standard distillation objec- tives of Sanh et al. (2019): L MLM The masked language modeling loss of the student model calculated over all examples using the cross-entropy loss as follows: ∑ x 1 ,y 1 ∈D CE(S(x 1 ),y 1 )(4) L CE Following Hinton et al. (2015), the smoothed cross-entropy loss measuring the divergence between the student and teacher outputs as follows: ∑ x 1 ∈D CE S (S(x 1 ),T(x 1 ))(5) L Cos The cosine embedding loss defined in terms of the final hidden states of the teacher and the student as follows: ∑ x 1 ∈D COS(BERT S (x 1 ),BERT T (x 1 ))(6) As a result, comparing to standard DistilBERT, DIITOessentially adds a new type of objective by pushing the student model to become acausal abstractionof the teacher model. A.2 Causal Distillation Objectives In addition to our causal lossL DIITO CE , we also pro- pose a new lossL DIITO Cos which is identical toL Cos with interchange interventions. In this section, we provide a formal definition forL DIITO Cos . We denote our teacher and student models as TandSrespectively. Using the notational con- ventions from Section 3, we useN y T andN y S to represent the neurons corresponding to the final output for each model. Likewise, we useN L T T and N L S S to represent the neurons representing contex- tualized representation for each token after the final BERT layer. Assuming we randomly sample a pair of exam- ples from a training dataset(x 1 ,y 1 ),(x 2 ,y 2 )∈D, we can then rewrite our causal lossL DIITO CE by rear- ranging Eqn. 2 and Eqn. 3 as follows: ∑ x 1 ,x 2 ∈D CE S ( GETVALS(M x 1 S ,x 2 ,N y S ), GETVALS(M x 1 T ,x 2 ,N y T ) ) (7) whereM x i S andM x i T are derived as in Eqn. 1 for each model respectively. Crucially, Eqn. 7 can be regarded as thecausalform of the standard smoothed cross-entropy loss with interchange in- tervention. Likewise, we can further define the L DIITO Cos as: ∑ x 1 ,x 2 ∈D COS ( GETVALS(M x 1 S ,x 2 ,N L S S ), GETVALS(M x 1 T ,x 2 ,N L T T ) ) (8) with adjusted interchange alignments forN L T T and N L S S . A.3 Evaluation Set-up GLUE We fine-tune for 25 epochs for the smaller datasets (RTE and CoLA) and 3 epochs for the oth- ers. Following Devlin et al. (2019) and Sanh et al. (2019), we use Matthew’s Correlation for CoLA, F1 for MRPC and QQP, Spearman correlation for STS- B, and accuracy for all the other tasks in GLUE. A.4 Reproducibility To foster reproduciblity and provide a fair compar- ison between methods, we distill BERT for each condition with three distinct random seeds. We then fine-tune each model with five distinct random seeds. Consequently, we report results aggregated from three distinct runs for the language modeling task, and 15 distinct runs for others. PretrainingGeneral Language Understanding Evaluation (GLUE) ModelLayersTokensCoLAMNLIMRPCQNLIQQPRTESST-2STS-B BERT BASE (Devlin et al., 2019)123.3B56.3084.7088.6091.8089.6069.3092.7089.00 (Wikipedia+BookCorpus) DistilBERT (Sanh et al., 2019)63.3B51.3082.1087.5089.2088.5059.9091.3086.90 (Wikipedia+BookCorpus) DistilBERT (WikiText)30.1B22.7871.5582.5182.1282.1655.4386.4756.33 DIITO MIDDLE (WikiText)30.1B23.2172.9782.8183.1582.8355.9886.5266.93 DIITO LATE (WikiText)30.1B24.1272.8082.1682.8882.8557.2987.3162.65 DIITO FULL (WikiText)30.1B25.0172.8582.7183.0582.8555.3786.9266.15 DistilBERT (WikiText)60.1B40.4378.9587.4584.7684.9660.1089.3880.40 DIITO MIDDLE (WikiText)60.1B43.9779.4787.5785.4585.2160.7289.9781.33 DIITO LATE (WikiText)60.1B43.9379.4987.7085.7985.2260.1490.3181.79 DIITO FULL (WikiText)60.1B43.4379.6688.1785.5785.2859.9590.0181.26 DIITO FULL +Random (WikiText)60.1B44.2779.7088.0685.6385.3460.8989.7681.08 DIITO FULL +Masked (WikiText)60.1B43.3979.6387.8885.6185.3061.0689.9781.58 DIITO FULL +L DIITO Cos (WikiText)60.1B45.1779.6888.1885.8385.3160.9490.3281.69 Table 2: Model performance results on the development sets of the GLUE benchmark. The GLUE score is the averaged performance scores across 15 distinct runs with precision aligned for a fair comparison. Following the evaluation for BERT (Devlin et al., 2019), we exclude WNLI for evaluation. Named Entity Recognition We follow the ex- perimental set-up in the Hugging Face (Wolf et al., 2020) repository for evaluation for the CoNLL- 2003 Named Entity Recognition task (Tjong Kim Sang and De Meulder, 2003). For fine-tuning, we set the learning rate to5e −5 with an effective batch size of 32 for three epochs. 3 Question AnsweringWe use the experimental set-up of Sanh et al. (2019) for evaluation on SQuAD v1.1 (Rajpurkar et al., 2016). For fine- tuning, we set the learning rate to3e −5 with an effective batch size of 48 for two epochs. We set the stride to 128. 3 For DistilBERT performance in Table 1 on CoNLL- 2003, we evaluate with a publicly avaliable model downloaded fromhttps://huggingface.co/delpart/ distilbert-base-uncased-finetuned-ner.