Paper deep dive

A Primer in BERTology: What We Know About How BERT Works

Anna Rogers, Olga Kovaleva, Anna Rumshisky

Year: 2020Venue: Transactions of the Association for Computational Linguistics (TACL) 2020Area: Surveys & ReviewsType: SurveyEmbeddings: 110

Models: BERT-base, BERT-large, Various BERT variants (DistilBERT, RoBERTa, ALBERT, etc.)

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 8:09:10 PM

Summary

This paper provides a comprehensive survey of over 150 studies on the BERT model, analyzing its internal representations, linguistic and world knowledge, architectural modifications, and limitations. It highlights that while BERT captures significant syntactic and semantic information, its reasoning capabilities are limited, and its performance is sensitive to probing methodologies.

Entities (5)

BERT · model · 100%Transformer · architecture · 100%MLM · task · 95%NSP · task · 95%Probing · methodology · 90%

Relation Signals (4)

BERT → performstask → MLM

confidence 100% · Pre-training uses two self-supervised tasks: masked language modeling (MLM)

BERT → usesarchitecture → Transformer

confidence 100% · Fundamentally, BERT is a stack of Transformer encoder layers

Probing → analyzes → BERT

confidence 95% · A number of studies have looked at the knowledge encoded in BERT weights... analysis of self-attention weights, and probing classifiers

BERT → encodes → Syntactic knowledge

confidence 85% · BERT representations are hierarchical rather than linear... there is something akin to syntactic tree structure

Cypher Suggestions (2)

Find all tasks performed by the BERT model · confidence 90% · unvalidated

MATCH (m:Model {name: 'BERT'})-[:PERFORMS_TASK]->(t:Task) RETURN t.name

Identify knowledge types encoded by BERT · confidence 85% · unvalidated

MATCH (m:Model {name: 'BERT'})-[:ENCODES]->(k:Knowledge) RETURN k.name

Abstract

Abstract:Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.

PDF

Open source PDF →Open local PDF →

Full Text

109,698 characters extracted from source content.

Expand or collapse full text

A Primer in BERTology: What We Know About How BERT Works Anna Rogers Center for Social Data Science University of Copenhagen arogers@sodas.ku.dk Olga Kovaleva Dept. of Computer Science University of Massachusetts Lowell okovalev@cs.uml.edu Anna Rumshisky Dept. of Computer Science University of Massachusetts Lowell arum@cs.uml.edu Abstract Transformer-based models have pushed state of the art in many areas of NLP, but our un- derstanding of what is behind their success is still limited. This paper is the first sur- vey of over 150 studies of the popular BERT model. We review the current state of knowl- edge about how BERT works, what kind of information it learns and how it is repre- sented, common modifications to its training objectives and architecture, the overparame- terization issue and approaches to compres- sion. We then outline directions for future research. 1 Introduction Since their introduction in 2017, Transformers (Vaswani et al., 2017) have taken NLP by storm, offering enhanced parallelization and better model- ing of long-range dependencies. The best known Transformer-based model is BERT (Devlin et al., 2019); it obtained state-of-the-art results in numer- ous benchmarks and is still a must-have baseline. While it is clear that BERT works remarkably well, it is less clearwhy, which limits further hypothesis-driven improvement of the architecture. Unlike CNNs, the Transformers have little cogni- tive motivation, and the size of these models limits our ability to experiment with pre-training and per- form ablation studies. This explains a large number of studies over the past year that attempted to un- derstand the reasons behind BERT’s performance. In this paper, we provide an overview of what has been learned to date, highlighting the questions which are still unresolved. We first consider the linguistic aspects of it, i.e., the current evidence regarding the types of linguistic and world knowl- edge learned by BERT, as well as where and how this knowledge may be stored in the model. We then turn to the technical aspects of the model and provide an overview of the current proposals to improve BERT’s architecture, pre-training and fine- tuning. We conclude by discussing the issue of overparameterization, the approaches to compress- ing BERT, and the nascent area of pruning as a model analysis technique. 2 Overview of BERT architecture Fundamentally, BERT is a stack of Transformer encoder layers (Vaswani et al., 2017) which consist of multiple self-attention "heads". For every input token in a sequence, each head computes key, value and query vectors, used to create a weighted repre- sentation. The outputs of all heads in the same layer are combined and run through a fully-connected layer. Each layer is wrapped with a skip connection and followed by layer normalization. The conventional workflow for BERT consists of two stages: pre-training and fine-tuning. Pre- training uses two self-supervised tasks: masked language modeling (MLM, prediction of randomly masked input tokens) and next sentence prediction (NSP, predicting if two input sentences are adjacent to each other). In fine-tuning for downstream ap- plications, one or more fully-connected layers are typically added on top of the final encoder layer. The input representations are computed as fol- lows: each word in the input is first tokenized into wordpieces (Wu et al., 2016), and then three em- bedding layers (token, position, and segment) are combined to obtain a fixed-length vector. Special token[CLS]is used for classification predictions, and[SEP]separates input segments. Google 1 and HuggingFace (Wolf et al., 2020) provide many variants of BERT, including the orig- inal "base" and "large" versions. They vary in the number of heads, layers, and hidden state size. 1 https://github.com/ google-research/bert arXiv:2002.12327v3 [cs.CL] 9 Nov 2020 3 What knowledge does BERT have? A number of studies have looked at the knowledge encoded in BERT weights. The popular approaches include fill-in-the-gap probes of MLM, analysis of self-attention weights, and probing classifiers with different BERT representations as inputs. 3.1 Syntactic knowledge Lin et al. (2019) showed thatBERT represen- tations are hierarchical rather than linear , i.e. there is something akin to syntactic tree structure in addition to the word order information. Ten- ney et al. (2019b) and Liu et al. (2019a) also showed thatBERT embeddings encode informa- tion about parts of speech, syntactic chunks and roles. Enough syntactic information seems to be captured in the token embeddings themselves to recover syntactic trees (Vilares et al., 2020; Kim et al., 2020; Rosa and Mare ˇ cek, 2019), although probing classifiers could not recover the labels of distant parent nodes in the syntactic tree (Liu et al., 2019a). Warstadt and Bowman (2020) report evi- dence of hierarchical structure in three out of four probing tasks. As far ashowsyntax is represented, it seems thatsyntactic structure is not directly encoded in self-attention weights. Htut et al. (2019) were unable to extract full parse trees from BERT heads even with the gold annotations for the root. Jawahar et al. (2019) include a brief illustration of a depen- dency tree extracted directly from self-attention weights, but provide no quantitative evaluation. However,syntactic information can be recov- ered from BERT token representations. Hewitt and Manning (2019) were able to learn transfor- mation matrices that successfully recovered syn- tactic dependencies in PennTreebank data from BERT’s token embeddings (see also Manning et al., 2020). Jawahar et al. (2019) experimented with transformations of the [CLS] token using Tensor Product Decomposition Networks (McCoy et al., 2019a), concluding that dependency trees are the best match among 5 decomposition schemes (al- though the reported MSE differences are very small). Miaschi and Dell’Orletta (2020) performs a range of syntactic probing experiments with con- catenated token representations as input. Note that all these approaches look for the evidence of gold-standard linguistic structures, and add some amount of extra knowledge to the probe. Most recently, Wu et al. (2020) proposed a 4168 [CLS] For those who followsocial media transitions on Capitol Hill , this will be a little different . [CLS] For those who follow social media transitions on Capitol Hill , this will be a little different . 0 1 2 3 4 5 Figure 1: Heatmap of the impact matrix for the sen- tence “For those who follow social media transitions on Capitol Hill, this will be a little different.” 3 Visualization with Impact Maps Before we discuss specific syntactic phenomena, let us first analyze some example impact matri- ces derived from sample sentences. We visual- ize an impact matrix of a sentence by displaying a heatmap. We use the term “impact map” to refer to a heatmap of an impact matrix. Setup.We extract impact matrices by feed- ing BERT with 1,000 sentences from the English Parallel Universal Dependencies (PUD) treebank of the CoNLL 2017 Shared Task (Zeman et al., 2017). We follow the setup and pre-processing steps employed in pre-training BERT. An example impact map is shown in Figure1. Dependency.We notice that the impact map contains manystripes, which are short series of vertical/horizontal cells, typically located along the diagonal. Take the word “different” as an ex- ample (which is illustrated by the second-to-last column in the impact matrix). We observe a clear vertical stripe above the main diagonal. The inter- pretation is that this particular occurrence of the word “different” strongly affects the occurrences of those words before it. These strong influences are shown by the darker-colored pixels seen in the second last column of the impact map. This ob- servation agrees with the ground-truth dependency tree, which selects “different” as the head of all remaining words in the phrase “this will be a lit- tle different.” We also observe similar patterns on “transitions” and “Hill”. Such correlations lead us to explore the idea of extracting dependency trees from the matrices (see Section4.1). followsocialmediatransitionsonCapitolHill Figure 2: Part of the constituency tree. Constituency.Figure2shows part of the con- stituency tree of our example sentence generated by Stanford CoreNLP (Manning et al.,2014). In this sentence, “media” and “on” are two words that are adjacent to “transitions”. From the tree, however, we see that “media” is closer to “transi- tions” than “on” is in terms of syntactic distance. If a model is syntactically uninformed, we would expect “media” and “on” to have comparable im- pacts on the prediction of “transitions”, and vice versa. However, we observe a far greater impact (darker color) between “media” and “transitions” than that between “on” and “transitions”. We will further support this observation with empirical ex- periments in Section4.2. Other Structures.Along the diagonal of the impact map, we see that words are grouped into four contiguous chunks that have specific intents (e.g., a noun phrase –on Capitol Hill). We also observe that the two middle chunks have relatively strong inter-chunk word impacts and thus a bond- ing that groups them together, forming a larger verb phrase. This observation suggest that BERT may capture the compositionality of the language. In the following sections we quantitatively eval- uate these observations. 4 Syntactic Probe We start with two syntactic probes – dependency probe and constituency probe. 4.1 Dependency Probe With the goal of exploring the extent dependency relations are captured in BERT, we set out to an- swer the following question: Can BERT outper- form linguistically uninformed baselines in unsu- pervised dependency parsing? If so, to what ex- tent? We begin by using the token-level perturbed masking technique to extract an impact matrixF for each sentence. We then utilize graph-based al- gorithms to induce a dependency tree fromF, and compare it against ground-truth whose annotations 4168 [CLS] For those who followsocial media transitions on Capitol Hill , this will be a little different . [CLS] For those who follow social media transitions on Capitol Hill , this will be a little different . 0 1 2 3 4 5 Figure 1: Heatmap of the impact matrix for the sen- tence “For those who follow social media transitions on Capitol Hill, this will be a little different.” 3 Visualization with Impact Maps Before we discuss specific syntactic phenomena, let us first analyze some example impact matri- ces derived from sample sentences. We visual- ize an impact matrix of a sentence by displaying a heatmap. We use the term “impact map” to refer to a heatmap of an impact matrix. Setup.We extract impact matrices by feed- ing BERT with 1,000 sentences from the English Parallel Universal Dependencies (PUD) treebank of the CoNLL 2017 Shared Task (Zeman et al., 2017). We follow the setup and pre-processing steps employed in pre-training BERT. An example impact map is shown in Figure1. Dependency.We notice that the impact map contains manystripes, which are short series of vertical/horizontal cells, typically located along the diagonal. Take the word “different” as an ex- ample (which is illustrated by the second-to-last column in the impact matrix). We observe a clear vertical stripe above the main diagonal. The inter- pretation is that this particular occurrence of the word “different” strongly affects the occurrences of those words before it. These strong influences are shown by the darker-colored pixels seen in the second last column of the impact map. This ob- servation agrees with the ground-truth dependency tree, which selects “different” as the head of all remaining words in the phrase “this will be a lit- tle different.” We also observe similar patterns on “transitions” and “Hill”. Such correlations lead us to explore the idea of extracting dependency trees from the matrices (see Section4.1). followsocialmediatransitionsonCapitolHill Figure 2: Part of the constituency tree. Constituency.Figure2shows part of the con- stituency tree of our example sentence generated by Stanford CoreNLP (Manning et al.,2014). In this sentence, “media” and “on” are two words that are adjacent to “transitions”. From the tree, however, we see that “media” is closer to “transi- tions” than “on” is in terms of syntactic distance. If a model is syntactically uninformed, we would expect “media” and “on” to have comparable im- pacts on the prediction of “transitions”, and vice versa. However, we observe a far greater impact (darker color) between “media” and “transitions” than that between “on” and “transitions”. We will further support this observation with empirical ex- periments in Section4.2. Other Structures.Along the diagonal of the impact map, we see that words are grouped into four contiguous chunks that have specific intents (e.g., a noun phrase –on Capitol Hill). We also observe that the two middle chunks have relatively strong inter-chunk word impacts and thus a bond- ing that groups them together, forming a larger verb phrase. This observation suggest that BERT may capture the compositionality of the language. In the following sections we quantitatively eval- uate these observations. 4 Syntactic Probe We start with two syntactic probes – dependency probe and constituency probe. 4.1 Dependency Probe With the goal of exploring the extent dependency relations are captured in BERT, we set out to an- swer the following question: Can BERT outper- form linguistically uninformed baselines in unsu- pervised dependency parsing? If so, to what ex- tent? We begin by using the token-level perturbed masking technique to extract an impact matrixF for each sentence. We then utilize graph-based al- gorithms to induce a dependency tree fromF, and compare it against ground-truth whose annotations Figure 1: Parameter-free probe for syntactic knowledge: words sharing syntactic subtrees have larger impact on each other in the MLM prediction (Wu et al., 2020) parameter-free approach based on measuring the impact that one word has on predicting another word within a sequence in the MLM task (Figure 1). They concluded thatBERT "naturally" learns some syntactic information, although it is not very similar to linguistic annotated resources. The fill-in-the-gap probes of MLM showed that BERT takes subject-predicate agreement into account when performing the cloze task(Gold- berg, 2019; van Schijndel et al., 2019), even for meaningless sentences and sentences with distrac- tor clauses between the subject and the verb (Gold- berg, 2019). A study of negative polarity items (NPIs) by Warstadt et al. (2019) showed thatBERT is better able to detect the presence of NPIs(e.g. "ever")and the words that allow their use(e.g. "whether")than scope violations. The above claims of syntactic knowledge are be- lied by the evidence thatBERT does not "under- stand" negation and is insensitive to malformed input . In particular, its predictions were not al- tered 2 even with shuffled word order, truncated sentences, removed subjects and objects (Ettinger, 2019). This could mean thateither BERT’s syn- tactic knowledge is incomplete, or it does not need to rely on it for solving its tasks . The latter seems more likely, since Glavaš and Vuli ́ c (2020) 2 See also the recent findings on adversarial triggers, which get the model to produce a certain output even though they are not well-formed from the point of view of a human reader (Wallace et al., 2019a). report that an intermediate fine-tuning step with supervised parsing does not make much difference for downstream task performance. 3.2 Semantic knowledge To date, more studies have been devoted to BERT’s knowledge of syntactic rather than semantic phe- nomena. However, we do have evidence from an MLM probing study thatBERT has some knowl- edge of semantic roles(Ettinger, 2019). BERT even displays some preference for the incorrect fillers for semantic roles that are semantically re- lated to the correct ones, as opposed to those that are unrelated (e.g. "to tip a chef" is better than "to tip a robin", but worse than "to tip a waiter"). Tenney et al. (2019b) showed thatBERT en- codes information about entity types, relations, semantic roles, and proto-roles, since this infor- mation can be detected with probing classifiers. BERT struggles with representations of num- bers. Addition and number decoding tasks showed that BERT does not form good representations for floating point numbers and fails to generalize away from the training data (Wallace et al., 2019b). A part of the problem is BERT’s wordpiece tokeniza- tion, since numbers of similar values can be divided up into substantially different word chunks. Out-of-the-boxBERT is surprisingly brittle to named entity replacements: e.g. replacing names in the coreference task changes 85% of predictions (Balasubramanian et al., 2020). This suggests that the model does not actually form a generic idea of named entities, although its F1 scores on NER prob- ing tasks are high (Tenney et al., 2019a). Broscheit (2019) find that fine-tuning BERT on Wikipedia entity linking "teaches" it additional entity knowl- edge, which would suggest that it did not absorb all the relevant entity information during pre-training on Wikipedia. 3.3 World knowledge The bulk of evidence about commonsense knowl- edge captured in BERT comes from practitioners using it to extract such knowledge. One direct prob- ing study of BERT reports thatBERT struggles with pragmatic inference and role-based event knowledge(Ettinger, 2019). BERT also struggles with abstract attributes of objects, as well as visual and perceptual properties that are likely to be as- sumed rather than mentioned (Da and Kasai, 2019). The MLM component of BERT is easy to adapt for knowledge induction by filling in the Language Models as Knowledge Bases? Fabio Petroni 1 Tim Rockt ̈ aschel 1,2 Patrick Lewis 1,2 Anton Bakhtin 1 Yuxiang Wu 1,2 Alexander H. Miller 1 Sebastian Riedel 1,2 1 Facebook AI Research 2 University College London fabiopetroni, rockt, plewis, yolo, yuxiangwu, ahm, sriedel@fb.com Abstract Recent progress in pretraining language mod- els on large textual corpora led to a surge of improvements for downstream NLP tasks. Whilst learning linguistic knowledge, these models may also be storing relational knowl- edge present in the training data, and may be able to answer queries structured as “fill- in-the-blank” cloze statements.Language models have many advantages over structured knowledge bases: they require no schema en- gineering, allow practitioners to query about an open class of relations, are easy to extend to more data, and require no human supervision to train. We present an in-depth analysis of the relational knowledge already present (without fine-tuning) in a wide range of state-of-the- art pretrained language models. We find that (i)without fine-tuning, BERT contains rela- tional knowledge competitive with traditional NLP methods that have some access to ora- cle knowledge,(i)BERT also does remark- ably well on open-domain question answer- ing against a supervised baseline, and(i)cer- tain types of factual knowledge are learned much more readily than others by standard lan- guage model pretraining approaches. The sur- prisingly strong ability of these models to re- call factual knowledge without any fine-tuning demonstrates their potential as unsupervised open-domain QA systems. The code to re- produce our analysis is available athttps: //github.com/facebookresearch/LAMA. 1 Introduction Recently, pretrained high-capacity language mod- els such as ELMo (Peters et al.,2018a) and BERT (Devlin et al.,2018a) have become increasingly important in NLP. They are optimised to either predict the next word in a sequence or some masked word anywhere in a given sequence (e.g. “Dante was born in [Mask] in the year 1265.”). The parameters of these models appear to store MemoryQuery Answer Symbolic Memory Access Neural LM Memory Access (Dante,born-in,X) “Dante was born in[Mask].” Dante Florence born-in Florence Florence KG LM e.g.ELMo/BERT Figure 1: Querying knowledge bases (KB) and lan- guage models (LM) for factual knowledge. vast amounts of linguistic knowledge (Peters et al., 2018b;Goldberg,2019;Tenney et al.,2019) use- ful for downstream tasks.This knowledge is usually accessed either by conditioning on latent context representations produced by the original model or by using the original model weights to initialize a task-specific model which is then fur- ther fine-tuned. This type of knowledge transfer is crucial for current state-of-the-art results on a wide range of tasks. In contrast, knowledge bases are effective so- lutions for accessing annotated gold-standard re- lational data by enabling queries such as (Dante, born-in,X). However, in practice we often need toextractrelational data from text or other modal- ities to populate these knowledge bases.This requires complex NLP pipelines involving entity extraction, coreference resolution, entity linking and relation extraction (Surdeanu and Ji,2014)— components that often need supervised data and fixed schemas. Moreover, errors can easily prop- agate and accumulate throughout the pipeline. In- stead, we could attempt to query neural language models for relational data by asking them to fill in masked tokens in sequences like “Dante was born arXiv:1909.01066v2 [cs.CL] 4 Sep 2019 Figure 2: BERT world knowledge (Petroni et al., 2019) blanks (e.g. "Cats like to chase [___]"). Petroni et al. (2019) showed that,for some relation types, vanilla BERT is competitive with methods rely- ing on knowledge bases(Figure 2), and Roberts et al. (2020) show the same for open-domain QA using T5 model (Raffel et al., 2019). Davison et al. (2019) suggest that it generalizes better to unseen data. In order to retrieve BERT’s knowledge, we need good template sentences, and there is work on their automatic extraction and augmentation (Bouraoui et al., 2019; Jiang et al., 2019b). However,BERT cannot reason based on its world knowledge . Forbes et al. (2019) show that BERT can "guess" the affordances and properties of many objects, but can not reason about the relation- ship between properties and affordances. For ex- ample, it “knows" that people can walk into houses, and that houses are big, but it cannot infer that houses are bigger than people. Zhou et al. (2020) and Richardson and Sabharwal (2019) also show that the performance drops with the number of nec- essary inference steps. Some of BERT’s world knowledge success comes from learning stereotypi- cal associations (Poerner et al., 2019), e.g., a person with an Italian-sounding name is predicted to be Italian, even when it is incorrect. 3.4 Limitations Multiple probing studies in section 3 and section 4 report that BERT possesses a surprising amount of syntactic, semantic, and world knowledge. How- ever, Tenney et al. (2019a) remarks, “the fact that a linguistic pattern is not observed by our probing classifier does not guarantee that it is not there, and the observation of a pattern does not tell us how it is used." There is also the issue of how complex a probe should be allowed to be (Liu et al., 2019a). If a more complex probe recovers more information, to what extent are we still relying on the original model? Furthermore, different probing methods may lead to complementary or even contradictory con- clusions, which makes a single test (as in most stud- DiagonalHeterogeneousVerticalVertical + diagonalBlock [CLS][CLS][SEP][SEP][SEP][SEP][SEP][SEP][CLS][CLS][SEP][SEP][SEP][SEP][CLS] Figure 3: Attention patterns in BERT (Kovaleva et al., 2019) ies) insufficient (Warstadt et al., 2019). A given method might also favor one model over another, e.g., RoBERTa trails BERT with one tree extraction method, but leads with another (Htut et al., 2019). The choice of linguistic formalism also matters (Kuznetsov and Gurevych, 2020). In view of all that, the alternative is to focus on identifying what BERT actually relies on at infer- ence time. This direction is currently pursued both at the level of architecture blocks (to be discussed in detail in subsection 6.3), and at the level of in- formation encoded in model weights. Amnesic probing (Elazar et al., 2020) aims to specifically remove certain information from the model and see how it changes performance, finding, for example, that language modeling does rely on part-of-speech information. Another direction is information-theoretic prob- ing. Pimentel et al. (2020) operationalize prob- ing as estimating mutual information between the learned representation and a given linguistic prop- erty, which highlights that the focus should be not on the amount of information contained in a rep- resentation, but rather on how easily it can be ex- tracted from it. Voita and Titov (2020) quantify the amount of effort needed to extract information from a given representation as minimum descrip- tion length needed to communicate both the probe size and the amount of data required for it to do well on a task. 4 Localizing linguistic knowledge 4.1 BERT embeddings In studies of BERT, the term "embedding" refers to the output of a Transformer layer (typically, the final one). Both conventional static embeddings (Mikolov et al., 2013) and BERT-style embeddings can be viewed in terms of mutual information max- imization (Kong et al., 2019), but the latter are contextualized. Every token is represented by a vector dependent on the particular context of occur- rence, and contains at least some information about that context (Miaschi and Dell’Orletta, 2020). Several studies reported thatdistilled contex- tualized embeddings better encode lexical se- mantic information(i.e. they are better at tra- ditional word-level tasks such as word similarity). The methods to distill a contextualized represen- tation into static include aggregating the informa- tion across multiple contexts (Akbik et al., 2019; Bommasani et al., 2020), encoding "semantically bleached" sentences that rely almost exclusively on the meaning of a given word (e.g. "This is <>") (May et al., 2019), and even using contextualized embeddings to train static embeddings (Wang et al., 2020d). But this is not to say that there is no room for improvement. Ethayarajh (2019) measure how similar the embeddings for identical words are in every layer, reporting that later BERT layers pro- duce more context-specific representations 3 . They also find that BERT embeddings occupy a narrow cone in the vector space, and this effect increases from the earlier to later layers. That is,two ran- dom words will on average have a much higher cosine similarity than expected if embeddings were directionally uniform (isotropic) . Since isotropy was shown to be beneficial for static word embeddings (Mu and Viswanath, 2018), this might be a fruitful direction to explore for BERT. Since BERT embeddings are contextualized, an interesting question is to what extent they cap- ture phenomena like polysemy and homonymy. There is indeed evidence thatBERT’s contextu- alized embeddings form distinct clusters corre- sponding to word senses(Wiedemann et al., 2019; Schmidt and Hofmann, 2020), making BERT suc- cessful at word sense disambiguation task. How- ever, Mickus et al. (2019) note thatthe representa- tions of the same word depend on the position of the sentence in which it occurs , likely due to the NSP objective. This is not desirable from the linguistic point of view, and could be a promising 3 Voita et al. (2019a) look at the evolution of token embed- dings, showing that in the earlier Transformer layers, MLM forces the acquisition of contextual information at the expense of the token identity, which gets recreated in later layers. avenue for future work. The above discussion concerns token embed- dings, but BERT is typically used as a sentence or text encoder. The standard way to generate sen- tence or text representations for classification is to use the [CLS] token, but alternatives are also being discussed, including concatenation of token representations (Tanaka et al., 2020), normalized mean (Tanaka et al., 2020), and layer activations (Ma et al., 2019). See Toshniwal et al. (2020) for a systematic comparison of several methods across tasks and sentence encoders. 4.2 Self-attention heads Several studies proposed classification of attention head types. Raganato and Tiedemann (2018) dis- cuss attending to the token itself, previous/next tokens and the sentence end. Clark et al. (2019) distinguish between attending to previous/next to- kens,[CLS],[SEP], punctuation, and "attending broadly" over the sequence. Kovaleva et al. (2019) propose 5 patterns shown in Figure 3. 4.2.1 Heads with linguistic functions The "heterogeneous" attention pattern shown in Figure 3couldpotentially be linguistically inter- pretable, and a number of studies focused on iden- tifying the functions of self-attention heads. In particular,some BERT heads seem to specialize in certain types of syntactic relations.Htut et al. (2019) and Clark et al. (2019) report that there are BERT heads that attended significantly more than a random baseline to words in certain syntac- tic positions. The datasets and methods used in these studies differ, but they both find that there are heads that attend to words inobjrole more than the positional baseline. The evidence fornsubj, advmod, andamodvaries between these two stud- ies. The overall conclusion is also supported by Voita et al. (2019b)’s study of the base Transformer in machine translation context. Hoover et al. (2019) hypothesize that even complex dependencies like dobjare encoded by a combination of heads rather than a single head, but this work is limited to qualitative analysis. Zhao and Bethard (2020) looked specifically for the heads encoding negation scope. Both Clark et al. (2019) and Htut et al. (2019) conclude thatno single head has the complete syntactic tree information, in line with evidence of partial knowledge of syntax (cf. subsection 3.1). However, Clark et al. (2019) identify a BERT head that can be directly used as a classifier to perform coreference resolution on par with a rule-based system, which by itself would seem to require quite a lot of syntactic knowledge. Lin et al. (2019) present evidence thatatten- tion weights are weak indicators of subject- verb agreement and reflexive anaphora.Instead of serving as strong pointers between tokens that should be related, BERT’s self-attention weights were close to a uniform attention baseline, but there was some sensitivity to different types of distrac- tors coherent with psycholinguistic data. This is consistent with conclusions by Ettinger (2019). To our knowledge, morphological information in BERT heads has not been addressed, but with the sparse attention variant by Correia et al. (2019) in the base Transformer, some attention heads ap- pear to merge BPE-tokenized words. For semantic relations, there are reports of self-attention heads encoding core frame-semantic relations (Kovaleva et al., 2019), as well as lexicographic and common- sense relations (Cui et al., 2020). The overall popularity of self-attention as an in- terpretability mechanism is due to the idea that "attention weight has a clear meaning: how much a particular word will be weighted when comput- ing the next representation for the current word" (Clark et al., 2019). This view is currently debated (Jain and Wallace, 2019; Serrano and Smith, 2019; Wiegreffe and Pinter, 2019; Brunner et al., 2020), and in a multi-layer model where attention is fol- lowed by non-linear transformations, the patterns in individual heads do not provide a full picture. Also, while many current papers are accompanied by attention visualizations, and there is a growing number of visualization tools (Vig, 2019; Hoover et al., 2019), the visualization is typically limited to qualitative analysis (often with cherry-picked examples) (Belinkov and Glass, 2019), and should not be interpreted as definitive evidence. 4.2.2 Attention to special tokens Kovaleva et al. (2019) show thatmost self- attention heads do not directly encode any non- trivial linguistic information , at least when fine- tuned on GLUE (Wang et al., 2018), since only less than 50% of heads exhibit the "heterogeneous" pat- tern. Much of the model produced the vertical pat- tern (attention to[CLS],[SEP], and punctuation tokens), consistent with the observations by Clark et al. (2019). This redundancy is likely related to the overparameterization issue (see section 6). More recently, Kobayashi et al. (2020) showed that the norms of attention-weighted input vec- tors, which yield a more intuitive interpretation of self-attention, reduce the attention to special to- kens. However, even when the attention weights are normed, it is still not the case that most heads that do the "heavy lifting" are even potentially in- terpretable (Prasanna et al., 2020). One methodological choice in in many studies of attention is to focus on inter-word attention and simply exclude special tokens (e.g. Lin et al. (2019) and Htut et al. (2019)). However, if attention to special tokens actually matters at inference time, drawing conclusions purely from inter-word atten- tion patterns does not seem warranted. The functions of special tokens are not yet well understood.[CLS]is typically viewed as an ag- gregated sentence-level representation (although all token representations also contain at least some sentence-level information, as discussed in subsec- tion 4.1); in that case, we may not see e.g. full syntactic trees in inter-word attention because part of that information is actually packed in[CLS]. Clark et al. (2019) experiment with encoding Wikipedia paragraphs with base BERT to consider specifically the attention to special tokens, noting that heads in early layers attend more to[CLS], in middle layers to[SEP], and in final layers to periods and commas. They hypothesize that its function might be one of "no-op", a signal to ig- nore the head if its pattern is not applicable to the current case. As a result, for example,[SEP] gets increased attention starting in layer 5, but its importance for prediction drops. However, after fine-tuning both[SEP]and[CLS]get a lot of attention, depending on the task (Kovaleva et al., 2019). Interestingly, BERT also pays a lot of at- tention to punctuation, which Clark et al. (2019) explain by the fact that periods and commas are simply almost as frequent as the special tokens, and so the model might learn to rely on them for the same reasons. 4.3 BERT layers The first layer of BERT receives as input a combina- tion of token, segment, and positional embeddings. It stands to reason thatthe lower layers have the most information about linear word order. Lin et al. (2019) report a decrease in the knowledge of linear word order around layer 4 in BERT-base. This is accompanied by an increased knowledge (a) ELMo (original) Layer 0 Layer 2 (b) ELMo (4-layer) Layer 0 Layer 4 (c) ELMo (transformer) Layer 0 Layer 6 (d) OpenAI transformer Layer 0 Layer 12 (e) BERT (base, cased) Layer 0 Layer 12 (f) BERT (large, cased) Layer 0 Layer 24 Lower Performance Higher Performance Figure 3: A visualization of layerwise patterns in task performance. Each column represents a probing task, and each row represents a contextualizer layer. textualizers. Furthermore, the ELMo-based mod- els facilitate a controlled comparison—they only differ in the contextualizer architecture used. We evaluate how wellCWRfeatures perform the pretraining task—bidirectional language mod- eling. Specifically, we take the pretrained repre- sentations for each layer and relearn the language model softmax classifiers used to predict the next and previous token. The ELMo models are trained on the Billion Word Benchmark, so we retrain the softmax classifier on similar data to mitigate any possible effects from domain shift. We split the held-out portion of the Billion Word Bench- mark into train (80%, 6.2M tokens) and evalua- tion (20%, 1.6M tokens) sets and use this data to retrain and evaluate the softmax classifiers. We expect that biLM perplexity will be lower when training the softmax classifiers on representations from layers that capture more information about the pretraining task. 5.2 Results and Discussion Figure4presents the performance of softmax clas- sifiers trained to perform the bidirectional lan- guage modeling task, given just theCWRs as in- put. We notice that higher layers in recurrent mod- els consistently achieve lower perplexities. Inter- estingly, we see that layers 1 and 2 in the 4-layer ELMo model have very similar performance—this warrants further exploration. On the other hand, the layers of the ELMo (transformer) model do not exhibit such a monotonic increase. While the top- most layer is best (which we expected, since this is the vector originally fed into a softmax classifier during pretraining), the middle layers show vary- ing performance. Across all models, the represen- tations that are better-suited for language model- ing are also those that exhibit worse probing task performance (Figure3), indicating that contextu- alizer layers trade off between encoding general and task-specific features. These results also reveal a difference in the layerwise behavior of LSTMs and transformers; moving up the LSTM layers yields more task- specific representations, but the same does not hold for transformers. Better understanding the differences between transformers and LSTMs is an active area of research (Chen et al.,2018;Tang et al.,2018), and we leave further exploration of these observations to future work. These observations motivate the gradual un- freezing method ofHoward and Ruder(2018), where the model layers are progressively unfrozen (starting from the final layer) during the fine- tuning process. Given our observation that higher- level LSTM layers are less general (and more pre- training task-specific), they likely have to be fine- tuned a bit more in order to make them appropri- ately task specific. Meanwhile, the base layer of the LSTM already learns highly transferable fea- tures, and may not benefit from fine-tuning. 6 Transferring Between Tasks Successful pretrained contextualizers have used self-supervised tasks such as bidirectional lan- guage modeling (Peters et al.,2018a) and next sen- tence prediction (Devlin et al.,2018), which en- able the use of large, unannotated text corpora. However, contextualizers can also be pretrained on explicitly supervised objectives, as done in pretrainedsentenceembedding methods (Con- neau et al.,2017).To better understand how the choice of pretraining task affects the linguis- tic knowledge within and transferability ofCWRs, we compare pretraining on a range of different explicitly-supervised tasks with bidirectional lan- guage model pretraining. Figure 4: BERT layer transferability (columns corre- spond to probing tasks, Liu et al. (2019a). of hierarchical sentence structure, as detected by the probing tasks of predicting the token index, the main auxiliary verb and the sentence subject. There is a wide consensus in studies with differ- ent tasks, datasets and methodologies thatsyntac- tic information is most prominent in the middle layers of BERT. 4 Hewitt and Manning (2019) had the most success reconstructing syntactic tree depth from the middle BERT layers (6-9 for base-BERT, 14-19 for BERT-large). Goldberg (2019) reports the best subject-verb agreement around layers 8- 9, and the performance on syntactic probing tasks used by Jawahar et al. (2019) also seems to peak around the middle of the model. The prominence of syntactic information in the middle BERT layers is related to Liu et al. (2019a)’s observation that the middle layers of Transformers are best-performing overall and the most transferable across tasks (see Figure 4). There isconflicting evidence about syntactic chunks. Tenney et al. (2019a) conclude that "the basic syntactic information appears earlier in the network while high-level semantic features appear at the higher layers", drawing parallels between this order and the order of components in a typical NLP pipeline – from POS-tagging to dependency parsing to semantic role labeling. Jawahar et al. (2019) also report that the lower layers were more useful for chunking, while middle layers were more useful for parsing. At the same time, the probing experiments by Liu et al. (2019a) find the opposite: both POS-tagging and chunking were performed best at the middle layers, in both BERT-base and BERT-large. However, all three studies use differ- ent suites of probing tasks. The final layers of BERT are the most task- specific. In pre-training, this means specificity to the MLM task, which explains why the middle 4 These BERT results are also compatible with findings by Vig and Belinkov (2019), who report the highest attention to tokens in dependency relations in the middle layers of GPT-2. layers are more transferable (Liu et al., 2019a). In fine-tuning, it explains why the final layers change the most (Kovaleva et al., 2019), and why restoring the weights of lower layers of fine-tuned BERT to their original values does not dramatically hurt the model performance (Hao et al., 2019). Tenney et al. (2019a) suggest that while syntactic information appears early in the model and can be localized,semantics is spread across the entire model, which explains why certain non-trivial ex- amples get solved incorrectly at first but correctly at the later layers. This is rather to be expected: semantics permeates all language, and linguists de- bate whether meaningless structures can exist at all (Goldberg, 2006, p.166-182). But this raises the question of what stacking more Transformer layers in BERT actually achieves in terms of the spread of semantic knowledge, and whether that is beneficial. Tenney et al. compared BERT-base and BERT-large, and found that the overall pattern of cumulative score gains is the same, only more spread out in the larger model. Note that Tenney et al. (2019a)’s experiments concern sentence-level semantic relations; Cui et al. (2020) report that the encoding of ConceptNet se- mantic relations is the worst in the early layers and increases towards the top. Jawahar et al. (2019) place "surface features in lower layers, syntactic features in middle layers and semantic features in higher layers", but their conclusion is surprising, given that only one semantic task in this study actu- ally topped at the last layer, and three others peaked around the middle and then considerably degraded by the final layers. 5 Training BERT This section reviews the proposals to optimize the training and architecture of the original BERT. 5.1 Model architecture choices To date, the most systematic study of BERT archi- tecture was performed by Wang et al. (2019b), who experimented with the number of layers, heads, and model parameters, varying one option and freez- ing the others. They concluded thatthe number of heads was not as significant as the number of layers . That is consistent with the findings of Voita et al. (2019b) and Michel et al. (2019) (section 6), and also the observation by Liu et al. (2019a) that the middle layers were the most trans- ferable. Larger hidden representation size was con- sistently better, but the gains varied by setting. All in all,changes in the number of heads and layers appear to perform different func- tions . The issue of model depth must be related to the information flow from the most task-specific layers closer to the classifier (Liu et al., 2019a), to the initial layers which appear to be the most task-invariant (Hao et al., 2019), and where the tokens resemble the input tokens the most (Brun- ner et al., 2020) (see subsection 4.3). If that is the case, a deeper model has more capacity to encode information that is not task-specific. On the other head, many self-attention heads in vanilla BERT seem to naturally learn the same patterns (Kovaleva et al., 2019). This explains why pruning them does not have too much impact. The question that arises from this is how far we could get with intentionally encouraging diverse self-attention patterns: theoretically, this would mean increasing the amount of information in the model with the same number of weights. Raganato et al. (2020) show for Transformer-based machine translation we can simply pre-set the patterns that we already know the model would learn, instead of learning them from scratch. Vanilla BERT is symmetric and balanced in terms of self-attention and feed-forward layers, but it may not have to be. For the base Transformer, Press et al. (2020) report benefits from more self- attention sublayers at the bottom and more feedfor- ward sublayers at the top. 5.2 Improvements to the training regime Liu et al. (2019b) demonstratethe benefits of large-batch training : with 8k examples both the language model perplexity and downstream task performance are improved. They also publish their recommendations for other parameters. You et al. (2019) report that with a batch size of 32k BERT’s training time can be significantly reduced with no degradation in performance. Zhou et al. (2019) ob- serve that the normalization of the trained[CLS] token stabilizes the training and slightly improves performance on text classification tasks. Gong et al. (2019) note that, since self-attention patterns in higher and lower layers are similar,the model training can be done in a recursive man- ner , where the shallower version is trained first and then the trained parameters are copied to deeper layers. Such a "warm-start" can lead to a 25% faster training without sacrificing performance. 5.3 Pre-training BERT The original BERT is a bidirectional Transformer pre-trained on two tasks: next sentence prediction (NSP) and masked language model (MLM) (sec- tion 2). Multiple studies have come up withalter- native training objectivesto improve on BERT, which could be categorized as follows: •How to mask.Raffel et al. (2019) systemati- cally experiment with corruption rate and cor- rupted span length. Liu et al. (2019b) propose diverse masks for training examples within an epoch, while Baevski et al. (2019) mask every token in a sequence instead of a random selection. Clinchant et al. (2019) replace the MASK token with[UNK]token, to help the model learn a representation for unknowns that could be useful for translation. Song et al. (2020) maximize the amount of information available to the model by conditioning on both masked and unmasked tokens, and letting the model see how many tokens are missing. •What to mask.Masks can be applied to full words instead of word-pieces (Devlin et al., 2019; Cui et al., 2019). Similarly, we can mask spans rather than single tokens (Joshi et al., 2020), predicting how many are missing (Lewis et al., 2019). Masking phrases and named entities (Sun et al., 2019b) improves representation of structured knowledge. •Where to mask. Lample and Conneau (2019) use arbitrary text streams instead of sentence pairs and subsample frequent outputs similar to Mikolov et al. (2013). Bao et al. (2020) combine the standard autoencoding MLM with partially autoregressive LM objective us- ing special pseudo mask tokens. •Alternatives to masking.Raffel et al. (2019) experiment with replacing and dropping spans, Lewis et al. (2019) explore deletion, infilling, sentence permutation and document rotation, and Sun et al. (2019c) predict whether a to- ken is capitalized and whether it occurs in other segments of the same document. Yang et al. (2019) train on different permutations of word order in the input sequence, maximiz- ing the probability of the original word order (cf. the n-gram word order reconstruction task (Wang et al., 2019a)). Clark et al. (2020) de- tect tokens that were replaced by a generator network rather than masked. •NSP alternatives. Removing NSP does not hurt or slightly improves performance (Liu et al., 2019b; Joshi et al., 2020; Clinchant et al., 2019). Wang et al. (2019a) and Cheng et al. (2019) replace NSP with the task of predicting both the next and the previous sen- tences. Lan et al. (2020a) replace the negative NSP examples by swapped sentences from positive examples, rather than sentences from different documents. ERNIE 2.0 includes sen- tence reordering and sentence distance pre- diction. Bai et al. (2020) replace both NSP and token position embeddings by a combina- tion of paragraph, sentence, and token index embeddings. Li and Choi (2020) experiment with utterance order prediction task for multi- party dialogue (and also MLM at the level of utterances and the whole dialogue). •Other tasks.Sun et al. (2019c) propose si- multaneous learning of 7 tasks, including dis- course relation classification and predicting whether a segment is relevant for IR. Guu et al. (2020) include a latent knowledge re- triever in language model pretraining. Wang et al. (2020c) combine MLM with knowledge base completion objective. Glass et al. (2020) replace MLM with span prediction task (as in extractive question answering), where the model is expected to provide the answer not from its own weights, but from adifferentpas- sage containing the correct answer (a relevant search engine query snippet). Another obvious source of improvement is pre- training data. Several studies explored the ben- efits of increasing the corpus volume (Liu et al., 2019b; Conneau et al., 2019; Baevski et al., 2019) and longer training (Liu et al., 2019b). The data also does not have to be raw text: there is a num- ber efforts toincorporate explicit linguistic in- formation , both syntactic (Sundararaman et al., 2019) and semantic (Zhang et al., 2020). Wu et al. (2019b) and Kumar et al. (2020) include the label for a given sequence from an annotated task dataset. Schick and Schütze (2020) separately learn repre- sentations for rare words. Although BERT is already actively used as a source of world knowledge (see subsection 3.3), there is also work onexplicitly supplying struc- tured knowledge.One approach is entity- enhanced models.For example, Peters et al. (2019a); Zhang et al. (2019) include entity em- Figure 5: Pre-trained weights help BERT find wider optima in fine-tuning on MRPC (right) than training from scratch (left) (Hao et al., 2019) beddings as input for training BERT, while Po- erner et al. (2019) adapt entity vectors to BERT representations. As mentioned above, Wang et al. (2020c) integrate knowledge not through entity em- beddings, but through additional pre-training ob- jective of knowledge base completion. Sun et al. (2019b,c) modify the standard MLM task to mask named entities rather than random words, and Yin et al. (2020) train with MLM objective over both text and linearized table data. Wang et al. (2020a) enhance RoBERTa with both linguistic and factual knowledge with task-specific adapters. Pre-training is the most expensive part of train- ing BERT, and it would be informative to know how much benefit it provides. On some tasks, a randomly initialized and fine-tuned BERT obtains competitive or higher results than the pre-trained BERT with the task classifier and frozen weights (Kovaleva et al., 2019). The consensus in the com- munity is that pre-training does help in most situa- tions, but the degree and its exact contribution re- quires further investigation. Prasanna et al. (2020) found thatmostweights of pre-trained BERT are useful in fine-tuning, although there are "better" and "worse" subnetworks. One explanation is that pre-trained weights help the fine-tuned BERT find wider and flatter areas with smaller generalization error, which makes the model more robust to over- fitting (see Figure 5 from Hao et al. (2019)). Given the large number and variety of proposed modifications, one would wish to know how much impact each of them has. However, due to the overall trend towards large model sizes, systematic ablations have become expensive. Most new mod- els claim superiority on standard benchmarks, but gains are often marginal, and estimates of model stability and significance testing are very rare. 5.4 Fine-tuning BERT Pre-training + fine-tuning workflow is a crucial part of BERT. The former is supposed to provide task-independent knowledge, and the latter would presumably teach the model to rely more on the representations useful for the task at hand. Kovaleva et al. (2019) did not find that to be the case for BERT fine-tuned on GLUE tasks 5 : dur- ing fine-tuning, the most changes for 3 epochs oc- curred in the last two layers of the models, but those changes caused self-attention to focus on[SEP] rather than on linguistically interpretable patterns. It is understandable why fine-tuning would increase the attention to[CLS], but not[SEP]. If Clark et al. (2019) are correct that[SEP]serves as "no- op" indicator, fine-tuning basically tells BERT what to ignore. Several studies explored the possibilities of im- proving the fine-tuning of BERT: •Taking more layers into account : learning a complementary representation of the infor- mation in deep and output layers (Yang and Zhao, 2019), using a weighted combination of all layers instead of the final one (Su and Cheng, 2019; Kondratyuk and Straka, 2019), and layer dropout (Kondratyuk and Straka, 2019). •Two-stage fine-tuningintroduces an inter- mediate supervised training stage between pre-training and fine-tuning (Phang et al., 2019; Garg et al., 2020; Arase and Tsujii, 2019; Pruksachatkun et al., 2020; Glavaš and Vuli ́ c, 2020). Ben-David et al. (2020) propose a pivot-based variant of MLM to fine-tune BERT for domain adaptation. •Adversarial token perturbationsimprove robustness of the model (Zhu et al., 2019). •Adversarial regularizationin combination withBregman Proximal Point Optimization helps alleviate pre-trained knowledge forget- ting and therefore prevents BERT from overfit- ting to downstream tasks (Jiang et al., 2019a). •Mixout regularizationimproves the stability of BERT fine-tuning even for a small number of training examples (Lee et al., 2019). With large models, even fine-tuning becomes ex- pensive, but Houlsby et al. (2019) show that it can 5 Kondratyuk and Straka (2019) suggest that fine-tuning on Universal Dependencies does result in syntactically meaning- ful attention patterns, but there was no quantitative evaluation. be successfully approximated with adapter mod- ules. They achieve competitive performance on 26 classification tasks at a fraction of the computa- tional cost. Adapters in BERT were also used for multi-task learning (Stickland and Murray, 2019) and cross-lingual transfer (Artetxe et al., 2019). An alternative to fine-tuning is extracting features from frozen representations, but fine-tuning works better for BERT (Peters et al., 2019b). A big methodological challenge in the current NLP is that the reported performance improve- ments of new models may well be within varia- tion induced by environment factors (Crane, 2018). BERT is not an exception. Dodge et al. (2020) report significant variation for BERT fine-tuned on GLUE tasks due to both weight initialization and training data order. They also propose early stopping on the less-promising seeds. Although we hope that the above observations may be useful for the practitioners, this section does not exhaust the current research on fine-tuning and its alternatives. For example, we do not cover such topics as Siamese architectures, policy gradi- ent training, automated curriculum learning, and others. 6 How big should BERT be? 6.1 Overparameterization Transformer-based models keep growing by or- ders of magnitude: the 110M parameters of base BERT are now dwarfed by 17B parameters of Turing-NLG (Microsoft, 2020), which is dwarfed by 175B of GPT-3 (Brown et al., 2020). This trend raises concerns about computational complexity of self-attention (Wu et al., 2019a), environmental issues (Strubell et al., 2019; Schwartz et al., 2019), fair comparison of architectures (Aßenmacher and Heumann, 2020), and reproducibility. Human language is incredibly complex, and would perhaps take many more parameters to de- scribe fully, but the current models do not make good use of the parameters they already have. Voita et al. (2019b) showed thatall but a few Trans- former heads could be pruned without signif- icant losses in performance. For BERT, Clark et al. (2019) observe that most heads in the same layer show similar self-attention patterns (perhaps related to the fact that the output of all self-attention heads in a layer is passed through the same MLP), which explains why Michel et al. (2019) were able to reduce most layers to a single head. Depending on the task, some BERT heads/layers are not only redundant (Kao et al., 2020), but also harmful to the downstream task performance.Pos- itive effect from head disablingwas reported for machine translation (Michel et al., 2019), abstrac- tive summarization (Baan et al., 2019), and GLUE tasks (Kovaleva et al., 2019). Additionally, Ten- ney et al. (2019a) examine the cumulative gains of their structural probing classifier, observing that in 5 out of 8 probing tasks some layers cause a drop in scores (typically in the final layers). Gordon et al. (2020) find that 30–40% of the weights can be pruned without impact on downstream tasks. In general, larger BERT models perform better (Liu et al., 2019a; Roberts et al., 2020), but not always: BERT-base outperformed BERT-large on subject-verb agreement (Goldberg, 2019) and sen- tence subject detection (Lin et al., 2019). Given the complexity of language, and amounts of pre- training data, it is not clear why BERT ends up with redundant heads and layers. Clark et al. (2019) sug- gest that one possible reason is the use of attention dropouts, which causes some attention weights to be zeroed-out during training. 6.2 Compression techniques Given the above evidence of overparameteriza- tion, it does not come as a surprise thatBERT can be efficiently compressed with minimal ac- curacy loss, which would be highly desirable for real-world applications. Such efforts to date are summarized in Table 1. The main approaches are knowledge distillation, quantization, and pruning. The studies in theknowledge distillation framework(Hinton et al., 2014) use a smaller student-network trained to mimic the behavior of a larger teacher-network. For BERT, this has been achieved through experiments with loss functions (Sanh et al., 2019b; Jiao et al., 2019), mimicking the activation patterns of individual portions of the teacher network (Sun et al., 2019a), and knowledge transfer at the pre-training (Turc et al., 2019; Jiao et al., 2019; Sun et al., 2020) or fine-tuning stage (Jiao et al., 2019). McCarley et al. (2020) suggest that distillation has so far worked better for GLUE than for reading comprehension, and report good results for QA from a combination of structured pruning and task-specific distillation. Quantizationdecreases BERT’s memory foot- print through lowering the precision of its weights (Shen et al., 2019; Zafrir et al., 2019). Note that Compression Performance SpeedupModelEvaluation BERT-base (Devlin et al., 2019)×1100%×1BERT 12 All GLUE tasks, SQuAD BERT-small×3.891%-BERT 4 †All GLUE tasks Distillation DistilBERT (Sanh et al., 2019a)×1.590% § ×1.6BERT 6 All GLUE tasks, SQuAD BERT 6 -PKD (Sun et al., 2019a)×1.698%×1.9BERT 6 No WNLI, CoLA, STS-B; RACE BERT 3 -PKD (Sun et al., 2019a)×2.492%×3.7BERT 3 No WNLI, CoLA, STS-B; RACE Aguilar et al. (2019), Exp. 3×1.693%-BERT 6 CoLA, MRPC, QQP, RTE BERT-48 (Zhao et al., 2019)×6287%×77BERT 12 ∗† MNLI, MRPC, SST-2 BERT-192 (Zhao et al., 2019)×5.793%×22BERT 12 ∗† MNLI, MRPC, SST-2 TinyBERT (Jiao et al., 2019)×7.596%×9.4BERT 4 † No WNLI; SQuAD MobileBERT (Sun et al., 2020)×4.3100%×4BERT 24 † No WNLI; SQuAD PD (Turc et al., 2019)×1.698%×2.5 ‡ BERT 6 † No WNLI, CoLA and STS-B WaLDORf (Tian et al., 2019)×4.493%×9BERT 8 †‖ SQuAD MiniLM (Wang et al., 2020b)×1.6599%×2BERT 6 No WNLI, STS-B, MNLI m ; SQuAD MiniBERT(Tsai et al., 2019)×6 ∗ 98%×27 ∗ mBERT 3 † CoNLL-18 POS and morphology BiLSTM-soft (Tang et al., 2019)×11091%×434 ‡ BiLSTM 1 MNLI, QQP, SST-2 Quanti-zation Q-BERT-MP (Shen et al., 2019)×1398% ¶ -BERT 12 MNLI, SST-2, CoNLL-03, SQuAD BERT-QAT (Zafrir et al., 2019)×499%-BERT 12 No WNLI, MNLI; SQuAD GOBO(Zadeh and Moshovos, 2020)×9.899%-BERT 12 MNLI Pruning McCarley et al. (2020), f2×2.2 ‡ 98% ‡ ×1.9 ‡ BERT 24 SQuAD, Natural Questions RPP (Guo et al., 2019)×1.7 ‡ 99% ‡ -BERT 24 No WNLI, STS-B; SQuAD Soft MvP (Sanh et al., 2020)×3394% ¶ -BERT 12 MNLI, QQP, SQuAD IMP (Chen et al., 2020), rewind 50%×1.4–2.594–100%-BERT 12 No MNLI-m; SQuAD Other ALBERT-base (Lan et al., 2020b)×997%-BERT 12 † MNLI, SST-2 ALBERT-xxlarge (Lan et al., 2020b)×0.47107%-BERT 12 † MNLI, SST-2 BERT-of-Theseus (Xu et al., 2020)×1.698%×1.9BERT 6 No WNLI PoWER-BERT (Goyal et al., 2020)N/A99%×2–4.5BERT 12 No WNLI; RACE Table 1: Comparison of BERT compression studies. Compression, performance retention, inference time speedup figures are given with respect to BERT base , unless indicated otherwise. Performance retention is measured as a ratio of average scores achieved by a given model and by BERT base . The subscript in the model description reflects the number of layers used. ∗ Smaller vocabulary used. † The dimensionality of the hidden layers is reduced. ‖ Convolutional layers used. ‡ Compared to BERT large . ∗ Compared to mBERT. § As reported in (Jiao et al., 2019). ¶ In comparison to the dev set. this strategy often requires compatible hardware. As discussed in section 6, individual self- attention heads and BERT layers can be disabled without significant drop in performance (Michel et al., 2019; Kovaleva et al., 2019; Baan et al., 2019).Pruningis a compression technique that takes advantage of that fact, typically reducing the amount of computation via zeroing out of certain parts of the large model. In structured pruning, architecture blocks are dropped, as in LayerDrop (Fan et al., 2019). In unstructured, the weights in the entire model are pruned irrespective of their lo- cation, as in magnitude pruning (Chen et al., 2020) or movement pruning (Sanh et al., 2020). Prasanna et al. (2020) and Chen et al. (2020) explore BERT from the perspective of the lottery ticket hypothesis (Frankle and Carbin, 2019), look- ing specifically at the "winning" subnetworks in pre-trained BERT. They independently find that such subnetworks do exist, and that transferability between subnetworks for different tasks varies. If the ultimate goal of training BERT is compres- sion, Li et al. (2020) recommend training larger models and compressing them heavily rather than compressing smaller models lightly. Other techniques include decomposing BERT’s embedding matrix into smaller matrices (Lan et al., 2020a), progressive module replacing (Xu et al., 2020) and dynamic elimination of intermediate en- coder outputs (Goyal et al., 2020). See Ganesh et al. (2020) for a more detailed discussion of compres- sion methods. 6.3 Pruning and model analysis There is a nascent discussion around pruning as a model analysis technique. The basic idea is that a compressed model a priori consists of elements that are useful for prediction; therefore by finding out what they do we may find out what the whole network does. For instance, BERT has heads that seem to encode frame-semantic relations, but dis- abling them might not hurt downstream task per- formance Kovaleva et al. (2019); this suggests that this knowledge is not actually used. For the base Transformer, Voita et al. (2019b) identify the functions of self-attention heads and then check which of them survive the pruning, find- ing that the syntactic and positional heads are the last ones to go. For BERT, Prasanna et al. (2020) go in the opposite direction: pruning on the basis of importance scores, and interpreting the remaining "good" subnetwork. With respect to self-attention heads specifically, it does not seem to be the case that only the heads that potentially encode non- trivial linguistic patterns survive the pruning. The models and methodology in these studies differ, so the evidence is inconclusive. In particular, Voita et al. (2019b) find that before pruning the majority of heads are syntactic, and Prasanna et al. (2020) – that the majority of heads do not have potentially non-trivial attention patterns. An important limitation of the current head and layer ablation studies (Michel et al., 2019; Koval- eva et al., 2019) is that they inherently assume that certain knowledge is contained in heads/layers. However, there is evidence of more diffuse rep- resentations spread across the full network, such as the gradual increase in accuracy on difficult se- mantic parsing tasks (Tenney et al., 2019a) or the absence of heads that would perform parsing "in general" (Clark et al., 2019; Htut et al., 2019). If so, ablating individual components harms the weight- sharing mechanism. Conclusions from component ablations are also problematic if the same informa- tion is duplicated elsewhere in the network. 7 Directions for further research BERTology has clearly come a long way, but it is fair to say we still have more questions than answers about how BERT works. In this section, we list what we believe to be the most promising directions for further research. Benchmarks that require verbal reasoning. While BERT enabled breakthroughs on many NLP benchmarks, a growing list of analysis papers are showing that its language skills are not as impres- sive as it seems. In particular, it was shown to rely on shallow heuristics in natural language inference (McCoy et al., 2019b; Zellers et al., 2019; Jin et al., 2020), reading comprehension (Si et al., 2019a; Rogers et al., 2020; Sugawara et al., 2020; Si et al., 2019b; Yogatama et al., 2019), argument reason- ing comprehension (Niven and Kao, 2019), and text classification (Jin et al., 2020). Such heuristics can even be used to reconstruct a non-publicly- available model (Krishna et al., 2020). As with any optimization method, if there is a shortcut in the data, we have no reason to expect BERT to not learn it. But harder datasets that cannot be resolved with shallow heuristics are unlikely to emerge if their development is not as valued as modeling work. Benchmarks for the full range of linguistic competence. While the language models seem to acquire a great deal of knowledge about language, we do not currently have comprehensive stress tests for different aspects of linguistic knowledge. A step in this direction is the "Checklist" behavioral testing (Ribeiro et al., 2020), the best paper at ACL 2020. Ideally, such tests would measure not only errors, but also sensitivity (Ettinger, 2019). Developing methods to "teach" reasoning. While large pre-trained models have a lot of knowl- edge, they often fail if any reasoning needs to be performed on top of the facts they possess (Tal- mor et al., 2019, see also subsection 3.3). For in- stance, Richardson et al. (2020) propose a method to "teach" BERT quantification, conditionals, com- paratives, and boolean coordination. Learning what happens at inference time. Most BERT analysis papers focus on different probes of the model, with the goal to find what the language model "knows". However, probing studies have limitations (subsection 3.4), and to this point, far fewer papers have focused on discovering what knowledge actually gets used. Several promis- ing directions are the "amnesic probing" (Elazar et al., 2020), identifying features important for pre- diction for a given task (Arkhangelskaia and Dutta, 2019), and pruning the model to remove the non- important components (Voita et al., 2019b; Michel et al., 2019; Prasanna et al., 2020). 8 Conclusion In a little over a year, BERT has become a ubiq- uitous baseline in NLP experiments and inspired numerous studies analyzing the model and propos- ing various improvements. The stream of papers seems to be accelerating rather than slowing down, and we hope that this survey helps the community to focus on the biggest unresolved questions. 9 Acknowledgements We thank the anonymous reviewers for their valu- able feedback. This work is funded in part by the NSF award number IIS-1844740 to Anna Rumshisky. References Gustavo Aguilar, Yuan Ling, Yu Zhang, Benjamin Yao, Xing Fan, and Edward Guo. 2019. Knowl- edge Distillation from Internal Representations. arXiv preprint arXiv:1910.03723. Alan Akbik, Tanja Bergmann, and Roland Voll- graf. 2019. Pooled Contextualized Embeddings for Named Entity Recognition. InProceedings of the 2019 Conference of the North Ameri- can Chapter of the Association for Computa- tional Linguistics: Human Language Technolo- gies, Volume 1 (Long and Short Papers), pages 724–728, Minneapolis, Minnesota. Association for Computational Linguistics. Yuki Arase and Jun’ichi Tsujii. 2019. Transfer Fine-Tuning: A BERT Case Study. InProceed- ings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5393–5404, Hong Kong, China. Association for Computational Linguistics. Ekaterina Arkhangelskaia and Sourav Dutta. 2019. Whatcha lookin’at? DeepLIFTing BERT’s At- tention in Question Answering.arXiv preprint arXiv:1910.06431. Mikel Artetxe, Sebastian Ruder, and Dani Yo- gatama. 2019.On the Cross-lingual Trans- ferability of Monolingual Representations. arXiv:1911.03310 [cs]. Matthias Aßenmacher and Christian Heumann. 2020. On the comparability of Pre-Trained Lan- guage Models.arXiv:2001.00781 [cs, stat]. Joris Baan, Maartje ter Hoeve, Marlies van der Wees, Anne Schuth, and Maarten de Rijke. 2019.Understanding Multi-Head Attention in Abstractive Summarization.arXiv preprint arXiv:1911.03898. Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. 2019. Cloze- driven Pretraining of Self-Attention Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Process- ing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP- IJCNLP), pages 5360–5369, Hong Kong, China. Association for Computational Linguistics. He Bai,Peng Shi,Jimmy Lin,Luchen Tan, Kun Xiong, Wen Gao, and Ming Li. 2020.SegaBERT: Pre-training of Segment- aware BERT for Language Understanding. arXiv:2004.14996 [cs]. Sriram Balasubramanian, Naman Jain, Gaurav Jin- dal, Abhijeet Awasthi, and Sunita Sarawagi. 2020. What’s in a Name? Are BERT Named En- tity Representations just as Good for any other Name? InProceedings of the 5th Workshop on Representation Learning for NLP, pages 205– 214, Online. Association for Computational Lin- guistics. Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, and Hsiao- Wuen Hon. 2020. UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training.arXiv:2002.12804 [cs]. Yonatan Belinkov and James Glass. 2019. Anal- ysis Methods in Neural Language Processing: A Survey.Transactions of the Association for Computational Linguistics, 7:49–72. Eyal Ben-David, Carmel Rabinovitz, and Roi Re- ichart. 2020. PERL: Pivot-based Domain Adap- tation for Pre-trained Deep Contextualized Em- bedding Models.arXiv:2006.09075 [cs]. Rishi Bommasani, Kelly Davis, and Claire Cardie. 2020. Interpreting Pretrained Contextualized Representations via Reductions to Static Em- beddings. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4758–4781. Zied Bouraoui, Jose Camacho-Collados, and Steven Schockaert. 2019. Inducing Relational Knowledge from BERT.arXiv:1911.12753 [cs]. Samuel Broscheit. 2019.Investigating Entity Knowledge in BERT with Simple Neural End- To-End Entity Linking. InProceedings of the 23rd Conference on Computational Natural Lan- guage Learning (CoNLL), pages 677–685, Hong Kong, China. Association for Computational Linguistics. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christo- pher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs]. Gino Brunner, Yang Liu, Damian Pascual, Oliver Richter, Massimiliano Ciaramita, and Roger Wattenhofer. 2020. On Identifiability in Trans- formers. InInternational Conference on Learn- ing Representations. Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. 2020.The Lottery Ticket Hypothesis for Pre-trained BERT Networks. arXiv:2007.12223 [cs, stat]. Xingyi Cheng, Weidi Xu, Kunlong Chen, Wei Wang, Bin Bi, Ming Yan, Chen Wu, Luo Si, Wei Chu, and Taifeng Wang. 2019. Symmetric Reg- ularization based BERT for Pair-Wise Semantic Reasoning.arXiv:1909.03405 [cs]. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What Does BERT Look at? An Analysis of BERT’s Atten- tion. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neu- ral Networks for NLP, pages 276–286, Florence, Italy. Association for Computational Linguistics. Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre- Training Text Encoders as Discriminators Rather Than Generators. InInternational Conference on Learning Representations. Stephane Clinchant, Kweon Woo Jung, and Vas- silina Nikoulina. 2019. On the use of BERT for Neural Machine Translation. InProceedings of the 3rd Workshop on Neural Generation and Translation, pages 108–117, Hong Kong. Asso- ciation for Computational Linguistics. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised Cross-Lingual Representa- tion Learning at Scale.arXiv:1911.02116 [cs]. Gonçalo M. Correia, Vlad Niculae, and André F. T. Martins. 2019. Adaptively Sparse Transform- ers. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP- IJCNLP), pages 2174–2184, Hong Kong, China. Association for Computational Linguistics. Matt Crane. 2018. Questionable Answers in Ques- tion Answering Research: Reproducibility and Variability of Published Results.Transactions of the Association for Computational Linguistics, 6:241–252. Leyang Cui, Sijie Cheng, Yu Wu, and Yue Zhang. 2020. Does BERT Solve Commonsense Task via Commonsense Knowledge?arXiv:2008.03945 [cs]. Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. 2019. Pre-Training with Whole Word Masking for Chinese BERT.arXiv:1906.08101 [cs]. Jeff Da and Jungo Kasai. 2019. Cracking the Contextual Commonsense Code: Understand- ing Commonsense Reasoning Aptitude of Deep Contextual Representations. InProceedings of the First Workshop on Commonsense Inference in Natural Language Processing, pages 1–12, Hong Kong, China. Association for Computa- tional Linguistics. Joe Davison, Joshua Feldman, and Alexander Rush. 2019. Commonsense Knowledge Mining from Pretrained Models. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Process- ing (EMNLP-IJCNLP), pages 1173–1178, Hong Kong, China. Association for Computational Linguistics. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Lan- guage Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. 2020. Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping.arXiv:2002.06305 [cs]. Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. 2020. When Bert Forgets How To POS: Amnesic Probing of Linguistic Proper- ties and MLM Predictions.arXiv:2006.00995 [cs]. Kawin Ethayarajh. 2019. How Contextual are Contextualized Word Representations? Compar- ing the Geometry of BERT, ELMo, and GPT-2 Embeddings. InProceedings of the 2019 Con- ference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55–65, Hong Kong, China. Association for Computational Linguis- tics. Allyson Ettinger. 2019.What BERT is not:Lessons from a new suite of psy- cholinguistic diagnostics for language models. arXiv:1907.13528 [cs]. Angela Fan, Edouard Grave, and Armand Joulin. 2019. Reducing Transformer Depth on Demand with Structured Dropout. InInternational Con- ference on Learning Representations. Maxwell Forbes, Ari Holtzman, and Yejin Choi. 2019. Do Neural Language Representations Learn Physical Commonsense? InProceedings of the 41st Annual Conference of the Cognitive Science Society (CogSci 2019), page 7. Jonathan Frankle and Michael Carbin. 2019. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. InInternational Conference on Learning Representations. Prakhar Ganesh, Yao Chen, Xin Lou, Moham- mad Ali Khan, Yin Yang, Deming Chen, Mari- anne Winslett, Hassan Sajjad, and Preslav Nakov. 2020.Compressing large-scale transformer- based models: A case study on BERT.arXiv preprint arXiv:2002.11985. Siddhant Garg, Thuy Vu, and Alessandro Moschitti. 2020. TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selec- tion. InAAAI. Michael Glass, Alfio Gliozzo, Rishav Chakravarti, Anthony Ferritto, Lin Pan, G P Shrivatsa Bhar- gav, Dinesh Garg, and Avi Sil. 2020. Span Selection Pre-training for Question Answering. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2773–2782, Online. Association for Com- putational Linguistics. Goran Glavaš and Ivan Vuli ́ c. 2020. Is Super- vised Syntactic Parsing Beneficial for Language Understanding? An Empirical Investigation. arXiv:2008.06788 [cs]. Adele Goldberg. 2006.Constructions at Work: The Nature of Generalization in Language. Oxford University Press, USA. Yoav Goldberg. 2019. Assessing BERT’s syntactic abilities.arXiv preprint arXiv:1901.05287. Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tieyan Liu. 2019. Efficient training of BERT by progressively stacking. InInterna- tional Conference on Machine Learning, pages 2337–2346. Mitchell A Gordon, Kevin Duh, and Nicholas An- drews. 2020. Compressing BERT: Studying the effects of weight pruning on transfer learning. arXiv preprint arXiv:2002.08307. Saurabh Goyal, Anamitra Roy Choudhary, Venkate- san Chakaravarthy, Saurabh ManishRaje, Yogish Sabharwal, and Ashish Verma. 2020. Power- bert: Accelerating BERT inference for classifi- cation tasks.arXiv preprint arXiv:2001.08950. Fu-Ming Guo, Sijia Liu, Finlay S. Mungall, Xue Lin, and Yanzhi Wang. 2019. Reweighted Prox- imal Pruning for Large-Scale Language Repre- sentation.arXiv:1909.12486 [cs, stat]. Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pa- supat, and Ming-Wei Chang. 2020. REALM: Retrieval-Augmented Language Model Pre- Training.arXiv:2002.08909 [cs]. Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2019. Visualizing and Understanding the Effective- ness of BERT. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Process- ing (EMNLP-IJCNLP), pages 4143–4152, Hong Kong, China. Association for Computational Linguistics. John Hewitt and Christopher D. Manning. 2019. A Structural Probe for Finding Syntax in Word Representations. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2014. Distilling the Knowledge in a Neural Net- work. InDeep Learning and Representation Learning Workshop: NIPS 2014. Benjamin Hoover, Hendrik Strobelt, and Sebastian Gehrmann. 2019. exBERT: A Visual Analy- sis Tool to Explore Learned Representations in Transformers Models.arXiv:1910.05276 [cs]. Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzeb- ski, Bruna Morrone, Quentin de Laroussilhe, An- drea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-Efficient Transfer Learn- ing for NLP.arXiv:1902.00751 [cs, stat]. Phu Mon Htut, Jason Phang, Shikha Bordia, and Samuel R Bowman. 2019. Do attention heads in BERT track syntactic dependencies?arXiv preprint arXiv:1911.12246. Sarthak Jain and Byron C. Wallace. 2019. Atten- tion is not Explanation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3543–3556. Ganesh Jawahar, Benoît Sagot, Djamé Seddah, Samuel Unicomb, Gerardo Iñiguez, Márton Kar- sai, Yannick Léo, Márton Karsai, Carlos Sar- raute, Éric Fleury, et al. 2019. What does BERT learn about the structure of language? In57th Annual Meeting of the Association for Computa- tional Linguistics (ACL), Florence, Italy. Haoming Jiang, Pengcheng He, Weizhu Chen, Xi- aodong Liu, Jianfeng Gao, and Tuo Zhao. 2019a. SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization.arXiv preprint arXiv:1911.03437. Zhengbao Jiang, Frank F. Xu, Jun Araki, and Gra- ham Neubig. 2019b. How Can We Know What Language Models Know?arXiv:1911.12543 [cs]. Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. TinyBERT: Distilling BERT for nat- ural language understanding.arXiv preprint arXiv:1909.10351. Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. 2020. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. InAAAI 2020. Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2020. SpanBERT: Improving Pre-Training by Repre- senting and Predicting Spans.Transactions of the Association for Computational Linguistics, 8:64–77. Wei-Tsung Kao, Tsung-Han Wu, Po-Han Chi, Chun-Cheng Hsieh, and Hung-Yi Lee. 2020. Further boosting BERT-based models by du- plicating existing layers:Some intriguing phenomena inside BERT.arXiv preprint arXiv:2001.09309. Taeuk Kim, Jihun Choi, Daniel Edmiston, and Sang-goo Lee. 2020. Are pre-trained language models aware of phrases? simple but strong baselines for grammar induction. InICLR 2020. Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. 2020. Attention Module is Not Only a Weight: Analyzing Transformers with Vector Norms.arXiv:2004.10102 [cs]. Dan Kondratyuk and Milan Straka. 2019. 75 Lan- guages, 1 Model: Parsing Universal Dependen- cies Universally. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Process- ing (EMNLP-IJCNLP), pages 2779–2795, Hong Kong, China. Association for Computational Linguistics. Lingpeng Kong, Cyprien de Masson d’Autume, Lei Yu, Wang Ling, Zihang Dai, and Dani Yogatama. 2019. A mutual information maximization per- spective of language representation learning. In International Conference on Learning Represen- tations. Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. Revealing the Dark Secrets of BERT. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Process- ing (EMNLP-IJCNLP), pages 4356–4365, Hong Kong, China. Association for Computational Linguistics. Kalpesh Krishna, Gaurav Singh Tomar, Ankur P. Parikh, Nicolas Papernot, and Mohit Iyyer. 2020. Thieves on Sesame Street! Model Extraction of BERT-Based APIs. InICLR 2020. Varun Kumar, Ashutosh Choudhary, and Eunah Cho. 2020.Data Augmentation using Pre- Trained Transformer Models.arXiv:2003.02245 [cs]. Ilia Kuznetsov and Iryna Gurevych. 2020. A Mat- ter of Framing: The Impact of Linguistic For- malism on Probing Results.arXiv:2004.14999 [cs]. Guillaume Lample and Alexis Conneau. 2019. Cross-Lingual Language Model Pretraining. arXiv:1901.07291 [cs]. Zhenzhong Lan, Mingda Chen, Sebastian Good- man, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020a. ALBERT: A Lite BERT for Self-Supervised Learning of Language Repre- sentations. InICLR. Zhenzhong Lan, Mingda Chen, Sebastian Good- man, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020b. ALBERT: A Lite BERT for Self-supervised Learning of Language Represen- tations. InICLR 2020. Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang. 2019. Mixout: Effective regularization to finetune large-scale pretrained language models. arXiv preprint arXiv:1909.11299. Mike Lewis, Yinhan Liu, Naman Goyal, Mar- jan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettle- moyer. 2019. BART: Denoising Sequence-to- Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. arXiv:1910.13461 [cs, stat]. Changmao Li and Jinho D. Choi. 2020. Transform- ers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-based Question Answering. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5709–5714, Online. Association for Com- putational Linguistics. Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E Gonzalez. 2020. Train large, then compress: Rethinking model size for efficient training and inference of transformers.arXiv preprint arXiv:2002.11794. Yongjie Lin, Yi Chern Tan, and Robert Frank. 2019. Open Sesame: Getting inside BERT’s Linguistic Knowledge. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Inter- preting Neural Networks for NLP, pages 241– 253. Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. 2019a. Linguistic Knowledge and Transferability of Contextual Representations. InProceedings of the 2019 Conference of the North Ameri- can Chapter of the Association for Computa- tional Linguistics: Human Language Technolo- gies, Volume 1 (Long and Short Papers), pages 1073–1094, Minneapolis, Minnesota. Associa- tion for Computational Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. RoBERTa: A Robustly Optimized BERT Pretraining Approach.arXiv:1907.11692 [cs]. Xiaofei Ma, Zhiguo Wang, Patrick Ng, Ramesh Nallapati, and Bing Xiang. 2019. Universal Text Representation from BERT: An Empirical Study. arXiv:1910.07973 [cs]. Christopher D. Manning, Kevin Clark, John He- witt, Urvashi Khandelwal, and Omer Levy. 2020. Emergent linguistic structure in artificial neural networks trained by self-supervision.Proceed- ings of the National Academy of Sciences, page 201907367. Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. 2019. On Measuring Social Biases in Sentence En- coders. InProceedings of the 2019 Confer- ence of the North American Chapter of the As- sociation for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long and Short Papers), pages 622–628, Minneapo- lis, Minnesota. Association for Computational Linguistics. J. S. McCarley, Rishav Chakravarti, and Avirup Sil. 2020. Structured Pruning of a BERT-based Question Answering Model.arXiv:1910.06360 [cs]. R. Thomas McCoy, Tal Linzen, Ewan Dunbar, and Paul Smolensky. 2019a. RNNs implicitly imple- ment tensor-product representations. InInterna- tional Conference on Learning Representations. Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019b. Right for the Wrong Reasons: Diagnosing Syn- tactic Heuristics in Natural Language Inference. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics. Alessio Miaschi and Felice Dell’Orletta. 2020. Contextual and Non-Contextual Word Embed- dings: An in-depth Linguistic Investigation. In Proceedings of the 5th Workshop on Representa- tion Learning for NLP, pages 110–119. Paul Michel, Omer Levy, and Graham Neubig. 2019. Are Sixteen Heads Really Better than One?Advances in Neural Information Process- ing Systems 32 (NIPS 2019). Timothee Mickus, Denis Paperno, Mathieu Con- stant, and Kees van Deemeter. 2019. What do you mean, BERT? assessing BERT as a dis- tributional semantics model.arXiv preprint arXiv:1911.05758. Microsoft. 2020.Turing-NLG: A 17-billion- parameter language model by microsoft. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed repre- sentations of words and phrases and their compo- sitionality. InAdvances in Neural Information Processing Systems 26 (NIPS 2013), pages 3111– 3119. Jiaqi Mu and Pramod Viswanath. 2018. All-but- the-top: Simple and effective postprocessing for word representations. InInternational Confer- ence on Learning Representations. Timothy Niven and Hung-Yu Kao. 2019. Probing Neural Network Comprehension of Natural Lan- guage Arguments. InProceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics, pages 4658–4664, Florence, Italy. Association for Computational Linguistics. Matthew E. Peters, Mark Neumann, Robert Logan, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith. 2019a. Knowledge Enhanced Contextual Word Representations. InProceed- ings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 43–54, Hong Kong, China. Association for Com- putational Linguistics. Matthew E. Peters, Sebastian Ruder, and Noah A. Smith. 2019b. To Tune or Not to Tune? Adapt- ing Pretrained Representations to Diverse Tasks. InProceedings of the 4th Workshop on Repre- sentation Learning for NLP (RepL4NLP-2019), pages 7–14, Florence, Italy. Association for Computational Linguistics. Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language Models as Knowledge Bases? InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Process- ing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics. Jason Phang, Thibault Févry, and Samuel R. Bow- man. 2019. Sentence Encoders on STILTs: Sup- plementary Training on Intermediate Labeled- Data Tasks.arXiv:1811.01088 [cs]. Tiago Pimentel, Josef Valvoda, Rowan Hall Maud- slay, Ran Zmigrod, Adina Williams, and Ryan Cotterell. 2020. Information-Theoretic Probing for Linguistic Structure.arXiv:2004.03061 [cs]. Nina Poerner, Ulli Waltinger, and Hinrich Schütze. 2019.BERT is not a knowledge base (yet): Factual knowledge vs. name-based rea- soning in unsupervised qa.arXiv preprint arXiv:1911.03681. Sai Prasanna, Anna Rogers, and Anna Rumshisky. 2020. When BERT Plays the Lottery, All Tick- ets Are Winning. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online. Association for Computational Linguistics. Ofir Press, Noah A. Smith, and Omer Levy. 2020. Improving Transformer Models by Reordering their Sublayers. InProceedings of the 58th An- nual Meeting of the Association for Computa- tional Linguistics, pages 2996–3005, Online. As- sociation for Computational Linguistics. Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe Pang, Clara Vania, Katharina Kann, and Samuel R. Bowman. 2020. Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work? InPro- ceedings of the 58th Annual Meeting of the As- sociation for Computational Linguistics, pages 5231–5247, Online. Association for Computa- tional Linguistics. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the Limits of Transfer Learn- ing with a Unified Text-to-Text Transformer. arXiv:1910.10683 [cs, stat]. Alessandro Raganato, Yves Scherrer, and Jörg Tiedemann. 2020. Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Transla- tion.arXiv:2002.10260 [cs]. Alessandro Raganato and Jörg Tiedemann. 2018. An Analysis of Encoder Representations in Transformer-Based Machine Translation. InPro- ceedings of the 2018 EMNLP Workshop Black- boxNLP: Analyzing and Interpreting Neural Net- works for NLP, pages 287–297, Brussels, Bel- gium. Association for Computational Linguis- tics. Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond Ac- curacy: Behavioral Testing of NLP Models with CheckList. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Associa- tion for Computational Linguistics. Kyle Richardson, Hai Hu, Lawrence S. Moss, and Ashish Sabharwal. 2020. Probing Natural Lan- guage Inference Models through Semantic Frag- ments. InAAAI 2020. Kyle Richardson and Ashish Sabharwal. 2019. What Does My QA Model Know?Devis- ing Controlled Probes using Expert Knowledge. arXiv:1912.13337 [cs]. Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How Much Knowledge Can You Pack Into the Parameters of a Language Model?arXiv preprint arXiv:2002.08910. Anna Rogers, Olga Kovaleva, Matthew Downey, and Anna Rumshisky. 2020. Getting Closer to AI Complete Question Answering: A Set of Pre- requisite Real Tasks. InAAAI, page 11. Rudolf Rosa and David Mare ˇ cek. 2019. Induc- ing syntactic trees from BERT representations. arXiv preprint arXiv:1906.11511. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019a. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019b. DistilBERT, a dis- tilled version of BERT: Smaller, faster, cheaper and lighter. In5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019. Victor Sanh, Thomas Wolf, and Alexander M. Rush. 2020. Movement Pruning: Adaptive Spar- sity by Fine-Tuning.arXiv:2005.07683 [cs]. Timo Schick and Hinrich Schütze. 2020. BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance.InProceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics, pages 3996–4007, Online. Association for Computational Linguistics. Florian Schmidt and Thomas Hofmann. 2020. BERT as a Teacher: Contextual Embeddings for Sequence-Level Reward.arXiv preprint arXiv:2003.02738. Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2019.Green AI. arXiv:1907.10597 [cs, stat]. Sofia Serrano and Noah A. Smith. 2019. Is Atten- tion Interpretable?arXiv:1906.03731 [cs]. Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Ma- honey, and Kurt Keutzer. 2019. Q-BERT: Hes- sian Based Ultra Low Precision Quantization of BERT.arXiv preprint arXiv:1909.05840. Chenglei Si, Shuohang Wang, Min-Yen Kan, and Jing Jiang. 2019a.What does BERT learn from multiple-choice reading comprehension datasets?arXiv preprint arXiv:1910.12391. Chenglei Si, Shuohang Wang, Min-Yen Kan, and Jing Jiang. 2019b. What does BERT Learn from Multiple-Choice Reading Comprehension Datasets?arXiv:1910.12391 [cs]. Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2020. MPNet: Masked and Per- muted Pre-training for Language Understanding. arXiv:2004.09297 [cs]. Asa Cooper Stickland and Iain Murray. 2019. BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning. In International Conference on Machine Learning, pages 5986–5995. Emma Strubell, Ananya Ganesh, and Andrew Mc- Callum. 2019. Energy and Policy Considera- tions for Deep Learning in NLP. InACL 2019. Ta-Chun Su and Hsiang-Chih Cheng. 2019. SesameBERT:AttentionforAnywhere. arXiv:1910.03176 [cs]. Saku Sugawara, Pontus Stenetorp, Kentaro Inui, and Akiko Aizawa. 2020. Assessing the Bench- marking Capacity of Machine Reading Compre- hension Datasets. InAAAI. Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019a. Patient Knowledge Distillation for BERT Model Compression. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Process- ing (EMNLP-IJCNLP), pages 4314–4323. Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019b. ERNIE: Enhanced Representation through Knowledge Integration.arXiv:1904.09223 [cs]. Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2019c.ERNIE 2.0:A Continual Pre- Training Framework for Language Understand- ing.arXiv:1907.12412 [cs]. Zhiqing Sun, Hongkun Yu, Xiaodan Song, Ren- jie Liu, Yiming Yang, and Denny Zhou. 2020. MobileBERT: Task-Agnostic Compression of BERT for Resource Limited Devices. Dhanasekar Sundararaman, Vivek Subramanian, Guoyin Wang, Shijing Si, Dinghan Shen, Dong Wang, and Lawrence Carin. 2019.Syntax- Infused Transformer and BERT models for Ma- chine Translation and Natural Language Under- standing.arXiv:1911.06156 [cs, stat]. Alon Talmor, Yanai Elazar, Yoav Goldberg, and Jonathan Berant. 2019. oLMpics – On what Language Model Pre-Training Captures. arXiv:1912.13283 [cs]. Hirotaka Tanaka, Hiroyuki Shinnou, Rui Cao, Jing Bai, and Wen Ma. 2020. Document Classifica- tion by Word Embeddings of BERT. InCompu- tational Linguistics, Communications in Com- puter and Information Science, pages 145–154, Singapore. Springer. Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. 2019. Dis- tilling Task-Specific Knowledge from BERT into Simple Neural Networks.arXiv preprint arXiv:1903.12136. Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019a. BERT Rediscovers the Classical NLP Pipeline. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601. Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R. Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R. Bowman, Di- panjan Das, and Ellie Pavlick. 2019b. What do you learn from context? Probing for sentence structure in contextualized word representations. InInternational Conference on Learning Repre- sentations. James Yi Tian, Alexander P Kreuzer, Pai-Hung Chen, and Hans-Martin Will. 2019.WaL- DORf: Wasteless Language-model Distillation On Reading-comprehension.arXiv preprint arXiv:1912.06638. Shubham Toshniwal, Haoyue Shi, Bowen Shi, Lingyu Gao, Karen Livescu, and Kevin Gim- pel. 2020. A Cross-Task Analysis of Text Span Representations. InProceedings of the 5th Work- shop on Representation Learning for NLP, pages 166–176, Online. Association for Computational Linguistics. Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazhagan, Xin Li, and Amelia Archer. 2019. Small and Practical BERT Models for Sequence Labeling.arXiv preprint arXiv:1909.00100. Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well-Read Students Learn Better: The Impact of Student Initializa- tion on Knowledge Distillation.arXiv preprint arXiv:1908.08962. Marten van Schijndel, Aaron Mueller, and Tal Linzen. 2019. Quantity doesn’t buy quality syn- tax with neural language models. InProceed- ings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5831–5837, Hong Kong, China. Association for Computational Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At- tention is All you Need. InAdvances in neu- ral information processing systems, pages 5998– 6008. Jesse Vig. 2019.Visualizing Attention in Transformer-Based Language Representation Models.arXiv:1904.02679 [cs, stat]. Jesse Vig and Yonatan Belinkov. 2019. Analyzing the Structure of Attention in a Transformer Lan- guage Model. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Inter- preting Neural Networks for NLP, pages 63–76, Florence, Italy. Association for Computational Linguistics. David Vilares, Michalina Strzyz, Anders Søgaard, and Carlos Gómez-Rodríguez. 2020. Parsing as pretraining. InThirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20). Elena Voita, Rico Sennrich, and Ivan Titov. 2019a. The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Transla- tion and Language Modeling Objectives. InPro- ceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Nat- ural Language Processing (EMNLP-IJCNLP), pages 4387–4397. Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019b. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. arXiv preprint arXiv:1905.09418. Elena Voita and Ivan Titov. 2020. Information- Theoretic Probing with Minimum Description Length.arXiv:2003.12298 [cs]. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gard- ner, and Sameer Singh. 2019a. Universal Ad- versarial Triggers for Attacking and Analyzing NLP. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP- IJCNLP), pages 2153–2162, Hong Kong, China. Association for Computational Linguistics. Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner. 2019b. Do NLP Mod- els Know Numbers? Probing Numeracy in Em- beddings.arXiv preprint arXiv:1909.07940. Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neu- ral Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Lin- guistics. Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Jianshu ji, Guihong Cao, Daxin Jiang, and Ming Zhou. 2020a. K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters.arXiv:2002.01808 [cs]. Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Liwei Peng, and Luo Si. 2019a. Struct- BERT: Incorporating Language Structures into Pre-Training for Deep Language Understanding. arXiv:1908.04577 [cs]. Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020b. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers.arXiv preprint arXiv:2002.10957. Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhiyuan Liu, Juanzi Li, and Jian Tang. 2020c. KEPLER: A Unified Model for Knowledge Em- bedding and Pre-trained Language Representa- tion.arXiv:1911.06136 [cs]. Yile Wang, Leyang Cui, and Yue Zhang. 2020d. How Can BERT Help Lexical Semantics Tasks? arXiv:1911.02929 [cs]. Zihan Wang, Stephen Mayhew, Dan Roth, et al. 2019b. Cross-Lingual Ability of Multilingual BERT: An Empirical Study.arXiv preprint arXiv:1912.07840. Alex Warstadt and Samuel R. Bowman. 2020. Can neural networks acquire a structural bias from raw linguistic data? InProceedings of the 42nd Annual Virtual Meeting of the Cognitive Science Society, Online. Alex Warstadt, Yu Cao, Ioana Grosu, Wei Peng, Hagen Blix, Yining Nie, Anna Alsop, Shikha Bordia, Haokun Liu, Alicia Parrish, et al. 2019. Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs. InProceed- ings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2870–2880. Gregor Wiedemann, Steffen Remus, Avi Chawla, and Chris Biemann. 2019. Does BERT Make Any Sense?Interpretable Word Sense Dis- ambiguation with Contextualized Embeddings. arXiv preprint arXiv:1909.10430. Sarah Wiegreffe and Yuval Pinter. 2019. Atten- tion is not not Explanation. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th In- ternational Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 11– 20, Hong Kong, China. Association for Compu- tational Linguistics. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2020. Hugging- Face’s Transformers: State-of-the-Art Natural Language Processing.arXiv:1910.03771 [cs]. Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. 2019a. Pay Less At- tention with Lightweight and Dynamic Convolu- tions. InInternational Conference on Learning Representations. Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han, and Songlin Hu. 2019b. Conditional BERT Contextual Augmentation. InICCS 2019: Com- putational Science – ICCS 2019, pages 84–95. Springer. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.arXiv preprint arXiv:1609.08144. Zhiyong Wu, Yun Chen, Ben Kao, and Qun Liu. 2020. Perturbed Masking: Parameter-free Prob- ing for Analyzing and Interpreting BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4166–4176, Online. Association for Com- putational Linguistics. Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, and Ming Zhou. 2020. BERT-of-Theseus: Compressing BERT by Progressive Module Re- placing.arXiv preprint arXiv:2002.02925. Junjie Yang and Hai Zhao. 2019. Deepening Hid- den Representations from Pre-Trained Language Models for Natural Language Understanding. arXiv:1911.01940 [cs]. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregres- sive Pretraining for Language Understanding. arXiv:1906.08237 [cs]. Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. InProceedings of the 58th Annual Meet- ing of the Association for Computational Lin- guistics, pages 8413–8426, Online. Association for Computational Linguistics. Dani Yogatama, Cyprien de Masson d’Autume, Jerome Connor,Tomas Kocisky,Mike Chrzanowski, Lingpeng Kong, Angeliki Lazari- dou, Wang Ling, Lei Yu, Chris Dyer, and Phil Blunsom. 2019. Learning and Evaluating Gen- eral Linguistic Intelligence.arXiv:1901.11373 [cs, stat]. Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. 2019. Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes.arXiv preprint arXiv:1904.00962, 1(5). Ali Hadi Zadeh and Andreas Moshovos. 2020. GOBO: Quantizing Attention-Based NLP Mod- els for Low Latency and Energy Efficient Infer- ence.arXiv:2005.03842 [cs, stat]. Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. 2019. Q8BERT: Quantized 8bit BERT.arXiv preprint arXiv:1910.06188. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a Machine Really Finish Your Sentence? InPro- ceedings of the 57th Annual Meeting of the As- sociation for Computational Linguistics, pages 4791–4800. Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. ERNIE: En- hanced Language Representation with Informa- tive Entities. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1441–1451, Florence, Italy. Association for Computational Linguistics. Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020. Semantics-aware BERT for Language Understanding. InAAAI 2020. Sanqiang Zhao, Raghav Gupta, Yang Song, and Denny Zhou. 2019.Extreme Lan- guage Model Compression with Optimal Sub- words and Shared Projections.arXiv preprint arXiv:1909.11687. Yiyun Zhao and Steven Bethard. 2020. How does BERT’s attention change when you fine-tune? An analysis methodology and a case study in negation scope. InProceedings of the 58th An- nual Meeting of the Association for Computa- tional Linguistics, pages 4729–4747, Online. As- sociation for Computational Linguistics. Wenxuan Zhou, Junyi Du, and Xiang Ren. 2019.Improving BERT Fine-tuning with Embedding Normalization.arXiv preprint arXiv:1911.03918. Xuhui Zhou, Yue Zhang, Leyang Cui, and Dandan Huang. 2020. Evaluating Commonsense in Pre- Trained Language Models. InAAAI 2020. Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. 2019. FreeLB: En- hanced Adversarial Training for Language Un- derstanding.arXiv:1909.11764 [cs].