Paper deep dive

Are Emergent Abilities of Large Language Models a Mirage?

Rylan Schaeffer, Brando Miranda, Sanmi Koyejo

Year: 2023Venue: NeurIPS 2023Area: Training DynamicsType: EmpiricalEmbeddings: 43

Models: GPT-3/InstructGPT family, PaLM family, Various vision architectures (ResNet, VGG, ViT)

Abstract

Abstract:Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales. Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance. We present our alternative explanation in a simple mathematical model, then test it in three complementary ways: we (1) make, test and confirm three predictions on the effect of metric choice using the InstructGPT/GPT-3 family on tasks with claimed emergent abilities; (2) make, test and confirm two predictions about metric choices in a meta-analysis of emergent abilities on BIG-Bench; and (3) show to choose metrics to produce never-before-seen seemingly emergent abilities in multiple vision tasks across diverse deep networks. Via all three analyses, we provide evidence that alleged emergent abilities evaporate with different metrics or with better statistics, and may not be a fundamental property of scaling AI models.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/12/2026, 8:05:53 PM

Summary

The paper challenges the claim that large language models (LLMs) possess 'emergent abilities'—defined as sharp, unpredictable performance gains at scale. The authors argue that these abilities are a 'mirage' resulting from the researcher's choice of nonlinear or discontinuous evaluation metrics (e.g., Accuracy, Multiple Choice Grade) rather than fundamental changes in model behavior. By switching to linear or continuous metrics (e.g., Token Edit Distance, Brier Score) and improving statistical resolution, the authors demonstrate that model performance improves smoothly and predictably with scale.

Entities (5)

BIG-Bench · benchmark · 100%GPT-3 · large-language-model · 100%Rylan Schaeffer · researcher · 100%Emergent Abilities · phenomenon · 95%Metric Choice · methodology · 95%

Relation Signals (3)

Nonlinear Metrics → induces → Emergent Abilities

confidence 95% · Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities

Emergent Abilities → isattributedto → Metric Choice

confidence 95% · emergent abilities appear due the researcher’s choice of metric rather than due to fundamental changes in model behavior with scale

GPT-3 → evaluatedon → BIG-Bench

confidence 90% · We meta-analyze published benchmarks [28, 33] to reveal that emergent abilities only appear for specific metrics

Cypher Suggestions (2)

Identify metrics that correlate with emergent abilities · confidence 95% · unvalidated

MATCH (m:Metric)-[:PRODUCES]->(e:Ability {name: 'Emergent Ability'}) RETURN m.name, m.type

Find all models associated with claimed emergent abilities · confidence 90% · unvalidated

MATCH (m:Model)-[:EXHIBITS]->(e:Ability {name: 'Emergent Ability'}) RETURN m.name

Full Text

42,595 characters extracted from source content.

Expand or collapse full text

Are Emergent Abilities of Large Language Models a Mirage? Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo Computer Science, Stanford University Abstract Recent work claims that large language models displayemergent abilities, abil- ities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: theirsharpness, transition- ing seemingly instantaneously from not present to present, and theirunpredictabil- ity, appearing at seemingly unforeseeable model scales. Here, we present an al- ternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due the researcher’s choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce ap- parent emergent abilities, whereas linear or continuous metrics produce smooth, continuous, predictable changes in model performance. We present our alternative explanation in a simple mathematical model, then test it in three complementary ways: we (1) make, test and confirm three predictions on the effect of metric choice using the InstructGPT/GPT-3 family on tasks with claimed emergent abil- ities, (2) make, test and confirm two predictions about metric choices in a meta- analysis of emergent abilities on BIG-Bench; and (3) show how to choose metrics to produce never-before-seen seemingly emergent abilities in multiple vision tasks across diverse deep networks. Via all three analyses, we provide evidence that al- leged emergent abilities evaporate with different metrics or with better statistics, and may not be a fundamental property of scaling AI models. 1 Introduction Emergent properties of complex systems have long been studied across disciplines, from physics to biology to mathematics. The idea of emergence was popularized by Nobel Prize-winning physicist P.W. Anderson’s “More Is Different” [1], which argues that as the complexity of a system increases, new properties may materialize that cannot be predicted even from a precise quantitative understand- ing of the system’s microscopic details. Recently, the idea of emergence gained significant attention in machine learning due to observations that large language models (LLMs) such as GPT [3], PaLM [6] and LaMDA [30] exhibit so-called “emergent abilities” [33, 8, 28, 3] (Fig. 1). The term “emergent abilities of LLMs” was recently and crisply defined as “abilities that are not present in smaller-scale models but are present in large-scale models; thus they cannot be predicted by simply extrapolating the performance improvements on smaller-scale models” [33]. Such emer- gent abilities were first discovered in the GPT-3 family [3]. Subsequent work emphasized the discov- ery, writing that “[although model] performance is predictable at a general level, performance on a specific task can sometimes emerge quite unpredictably and abruptly at scale” [8]. These quotations collectively identify the two defining properties of emergent abilities in LLMs: 1.Sharpness, transitioning seemingly instantaneously from not present to present Preprint. Under review. arXiv:2304.15004v2 [cs.AI] 22 May 2023 Figure 1:Emergent abilities of large language models. Model families displaysharpandunpre- dictableincreases in performance at specific tasks as scale increases. Source: Fig. 2 from [33]. 2.Unpredictability, transitioning at seemingly unforeseeable model scales These emergent abilities have garnered significant interest, raising questions such as: What controls whichabilities will emerge? What controlswhenabilities will emerge? How can we make desirable abilities emerge faster, and ensure undesirable abilities never emerge? These questions are especially pertinent to AI safety and alignment, as emergent abilities forewarn that larger models might one day, without warning, acquire undesired mastery over dangerous capabilities [29, 10, 17, 18]. In this paper, we call into question the claim that LLMs possess emergent abilities, by which we specifically meansharpandunpredictablechanges in model outputs as a function of model scale on specific tasks. Our doubt stems from the observation that emergent abilities seem to appear only under metrics that nonlinearly or discontinuously scale any model’s per-token error rate. For instance, as we later show,>92%of emergent abilities on BIG-Bench tasks [28] (hand-annotated by [32]) appear under either of these two metrics: Multiple Choice Grade def = 1if highest probability mass on correct option 0otherwise Exact String Match def = 1if output string exactly matches target string 0otherwise This raises the possibility of an alternative explanation for the origin of LLMs’ emergent abilities: sharp and unpredictable changes might be induced by the researcher’s choice of measurement, even though the model family’s per-token error rate changes smoothly, continuously and predictably with increasing scale. Specifically, our alternative posits that emergent abilities are a mirage caused primarily by the researcher choosing a metric that nonlinearly or discontinuously deforms per-token error rates, and secondarily by possessing too few test data to accurately estimate the performance of smaller models, thereby causing smaller models to appear wholly unable to perform the task. To communicate our alternative explanation, we present it as a simple mathematical model and demonstrate how it quantitatively reproduces the evidence offered in support of emergent abilities of LLMs. We then test our alternative explanation in three complementary ways: 1. We make, test and confirm three predictions based on our alternative hypotheses using the InstructGPT [24] / GPT-3 [3] model family. 2 10 9 10 10 10 11 Model Parameters 0.0 0.2 0.4 0.6 0.8 1.0 Multiple Choice Grade Num. Choices 2 D 10 9 10 10 10 11 Model Parameters 0.3 0.4 0.5 0.6 0.7 Multiple Choice Grade Published Emergent Ability 10 9 10 10 10 11 Model Parameters 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Target Str Len 5 10 9 10 10 10 11 Model Parameters 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Published Emergent Ability C 10 9 10 10 10 11 Model Parameters 10 1 10 0 CE (Per-Token) A 10 9 10 10 10 11 Model Parameters 0.0 0.2 0.4 0.6 0.8 1.0 Prob Correct (Per-Token) B Emergent AbilitiesNo Emergent Abilities Discontinuously score LLM outputs 10 9 10 10 10 11 Model Parameters 4 3 2 1 0 - Number of Incorrect Tokens Target Str Len 5 E Nonlinearly score LLM outputs Continuously score LLM outputs Linearly score LLM outputs 10 9 10 10 10 11 Model Parameters 0.8 0.6 0.4 0.2 0.0 - Brier Score Num. Choices 2 F Figure 2:Emergent abilities of large language models are created by the researcher’s chosen metrics, not unpredictable changes in model behavior with scale.(A) Suppose the per-token cross-entropy loss decreases monotonically with model scale, e.g.,L CE scales as a power law. (B) The per-token probability of selecting the correct token asymptotes towards 1. (C) If the researcher scores models’ outputs using a nonlinear metric such as Accuracy (which requires a sequence of tokens toallbe correct), the metric choice nonlinearly scales performance, causing performance to change sharply and unpredictably in a manner that qualitatively matches published emergent abilities (inset). (D) If the researcher instead scores models’ outputs using a discontinuous metric such as Multiple Choice Grade (akin to a step function), the metric choice discontinuously scales performance, again causing performance to change sharply and unpredictably. (E) Changing from a nonlinear metric to a linear metric such as Token Edit Distance, scaling shows smooth, continuous and predictable improvements, ablating the emergent ability. (F) Changing from a discontinuous metric to a continuous metric such as Brier Score again reveals smooth, continuous and predictable improvements in task performance. Consequently, emergent abilities are created by the researcher’s choice of metrics, not fundamental changes in model family behavior on specific tasks with scale. 2. We meta-analyze published benchmarks [28, 33] to reveal that emergent abilities only ap- pear for specific metrics, not for model families on particular tasks, and that changing the metric causes the emergence phenomenon to evaporate. 3. We induce never-before-seen, seemingly emergent abilities in multiple architectures across various vision tasks by intentionally changing the metrics used for evaluation. 2 Alternative Explanation for Emergent Abilities How might smooth, continuous, predictable changes in model family performance appear sharp and unpredictable? The answer is that the researcher’s choice of a nonlinear or discontinuous metric can distort the model family’s performance to appear sharp and unpredictable. To expound, suppose that within a model family, the test loss falls smoothly, continuously and predictably with the number of model parameters. One reason to believe this is the phenomenon known as neural scaling laws: empirical observations that deep networks exhibit power law scaling in the test loss as a function of training dataset size, number of parameters or compute [13, 27, 11, 16, 9, 12, 15, 34, 14, 7, 26]. For concreteness, suppose we have a model family of different numbers of parametersN >0and assume that each model’s per-token cross entropy falls as a power law with the number of parametersNfor constantsc >0,α <0(Fig. 2A): L CE (N) = ( N c ) α 3 To be clear, we do not require this particular functional form to hold; rather, we use it for illustra- tive purposes. LetVdenote the set of possible tokens,p∈∆ |V|−1 denote the true but unknown probability distribution, andˆp N ∈∆ |V|−1 denote theN-parameter model’s predicted probability distribution. The per-token cross entropy as a function of number of parametersNis: L CE (N) def =− ∑ v∈V p(v) log ˆp N (v) In practice,pis unknown, so we substitute a one-hot distribution of the observed tokenv ∗ : L CE (N) =−log ˆp N (v ∗ ) A model withNparameters then has a per-token probability of selecting the correct token (Fig. 2B): p(single token correct) = exp ( −L CE (N) ) = exp ( −(N/c) α ) Suppose the researcher then chooses a metric that requires selectingLtokens correctly. For example, our task might beL-digit integer addition, and a model’s output is scored1if allLoutput digits exactly match all target digits with no additions, deletions or substitutions,0otherwise. If the probability each token is correct is independent 1 , the probability of scoring1is: Accuracy(N)≈p N (single token correct) num. of tokens = exp ( −(N/c) α ) L This choice of metric nonlinearly scales performance with increasing token sequence length. When plotting performance on a linear-log plot, one sees a sharp, unpredictable emergent ability on longer sequences (Fig. 2C) that closely matches claimed emergent abilities (inset). What happens if the researcher switches from a nonlinear metric like Accuracy, under which the per-token error rate scales geometrically in target length (App. A.3), to an approximately linear metric like Token Edit Distance, under which the per-token error rate scales quasi-linearly in target length (App. A.2)? Token Edit Distance(N)≈L ( 1−p N (single token correct) ) =L ( 1−exp ( −(N/c) α ) ) The linear metric reveals smooth, continuous, predictable changes in model performance (Fig. 2E). Similarly, if the researcher uses a discontinuous metric like Multiple Choice Grade, the researcher can find emergent abilities (Fig. 2D), but switching to a continuous metric like Brier Score removes the emergent ability (Fig. 2F). In summary, sharp and unpredictable changes with increasing scale can be fully explained by three interpretable factors: (1) the researcher choosing a metric that nonlin- early or discontinuously scales the per-token error rate, (2) having insufficient resolution to estimate model performance in the smaller parameter regime, with resolution 2 set by1/test dataset size, and (3) insufficiently sampling the larger parameter regime. 3 Analyzing InstructGPT/GPT-3’s Emergent Arithmetic Abilities Previous papers prominently claimed the GPT [3, 24] family 3 displays emergent abilities at integer arithmetic tasks [8, 28, 33] (Fig. 2E). We chose these tasks as they were prominently presented [3, 8, 28, 33], and we focused on the GPT family due to it being publicly queryable. As explained mathematically and visually in Sec. 2, our alternative explanation makes three predictions: 1. Changing the metric from a nonlinear or discontinuous metric (Fig. 2CD) to a linear or continuous metric (Fig. 2 EF) should reveal smooth, continuous, predictable performance improvement with model scale. 1 While the independence assumption is not true, the approximation yields results qualitatively matching the observed emergence claims. 2 Resolution is defined as “The smallest interval measurable by a scientific instrument; the resolving power.” 3 As of 2023-03-15, 4 models with 350M, 1.3B, 6.7B, 175B parameters are available via the OpenAI API. 4 10 9 10 10 10 11 Model Parameters 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Target Str Len 1 2 3 4 5 10 9 10 10 10 11 GPT-3 Model Parameters 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Target Str Len 1 2 3 4 5 Temp 0.0 1.0 10 9 10 10 10 11 GPT-3 Model Parameters 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Target Str Len 1 2 3 4 5 Temp 0.0 1.0 10 9 10 10 10 11 Model Parameters 4 3 2 1 0 - Token Edit Distance Target Str Len 1 2 3 4 5 10 9 10 10 10 11 GPT-3 Model Parameters 6 5 4 3 2 1 0 - Token Edit Distance Target Str Len 1 2 3 4 Temp 0.0 1.0 10 9 10 10 10 11 GPT-3 Model Parameters 6 5 4 3 2 1 0 - Token Edit Distance Target Str Len 1 2 3 4 5 Temp 0.0 1.0 Figure 3:Claimed emergent abilities evaporate upon changing the metric.Left to Right: Math- ematical Model, 2-Integer 2-Digit Multiplication Task, 2-Integer 4-Digit Addition Task. Top: When performance is measured by a nonlinear metric (e.g., Accuracy), the InstructGPT/GPT-3 [3, 24] family’s performance appears sharp and unpredictable on longer target lengths. Bottom: When per- formance is instead measured by a linear metric (e.g., Token Edit Distance), the family exhibits smooth, predictable performance improvements. 10 9 10 10 10 11 Model Parameters 10 6 10 5 10 4 10 3 10 2 10 1 10 0 Accuracy Target Str Len 1 2 3 4 5 Figure 4:Claimed emergent abilities evaporate upon using better statistics.Left to Right: Math- ematical Model, 2-Integer 2-Digit Multiplication Task, 2-Integer 4-Digit Addition Task. Based on the predictable effect Accuracy has on performance, measuring performance requires high resolu- tion. Generating additional test data increases the resolution and reveals that even on Accuracy, the InstructGPT/GPT-3 family’s [3, 24] performance is above chance and improves in a smooth, continuous, predictable manner that qualitatively matches the mathematical model. 2. For nonlinear metrics, increasing the resolution of measured model performance by in- creasing the test dataset size should reveal smooth, continuous, predictable model improve- mentscommensurate with the predictable nonlinear effect of the chosen metric. 3. Regardless of metric, increasing the target string length should predictably affect the model’s performance as a function of the length-1 target performance: approximately geo- metrically for accuracy and approximately quasilinearly for token edit distance. To test these predictions, we collected outputs from the InstructGPT/GPT-3 family on two tasks: 2-shot multiplication between two 2-digit integers and 2-shot addition between two 4-digit integers. Prediction: Emergent Abilities Disappear With Different MetricsOn both arithmetic tasks, the GPT family displays emergent abilities if the target has 4 or 5 digits and if the metric is Accuracy (Fig. 3, top) [3, 8, 33]. However, if one changes from nonlinear Accuracy to linear Token Edit Distancewhile keeping the models’ outputs fixed, the family’s performance smoothly, continuously 5 and predictably improves with increasing scale (Fig. 3, bottom). This confirms our first prediction and supports our alternative explanation that the source of emergent abilities is the researcher’s choice of metric,not changes in the model family’s outputs. We also observe that under Token Edit Distance, increasing the length of the target string from 1 to 5 predictably decreases the family’s performance in an approximately quasilinear manner, confirming the first half of our third prediction. Prediction: Emergent Abilities Disappear With Better StatisticsWe next tested our second prediction: that even on nonlinear metrics such as accuracy, smaller models do not have zero ac- curacy, but rather have non-zero above-chance accuracycommensurate with choosing to use ac- curacy as the metric. In order to accurately measure models’ accuracy, we increased the resolu- tion by generating additional test data, and found that on both arithmetic tasks, all models in the InstructGPT/GPT-3 family achieve above-chance accuracy (Fig. 4). This confirms our second pre- diction. We also observe that as the target string length increases, the accuracy falls approximately geometrically with the length of the target string, confirming the second half of our third prediction. These results additionally demonstrate that the researcher’s choice of metric has the effect that one should predict accuracy to have, i.e., geometric decay with the target length. 4 Meta-Analysis of Claimed Emergent Abilities Analyzing the GPT family is possible because the models are publicly queryable. However, other model families claimed to exhibit emergent abilities are not publicly queryable, nor are their gener- ated outputs publicly available, meaning we are limited to analyzing the published results themselves [8, 33, 32]. Our alternative explanation makes two predictions. 1. At the “population level” of Task-Metric-Model Family triplets, emergent abilities should appear predominantly on specificmetrics, nottask-model familypairs, and specifically with nonlinear and/or discontinuous metrics. 2. On individual Task-Metric-Model Family triplets that display an emergent ability, changing the metric to a linear and/or continuous metric should remove the emergent ability. To test these predictions, we used to claimed emergent abilities on BIG-Bench [28, 33] due to the benchmark being pertinent and publicly available. Prediction: Emergent Abilities Should Appear with Metrics, not Task-Model FamiliesIf emergent abilities are real, one should expect task-model family pairs to show emergence for all reasonable metrics. However, if our alternative explanation is correct, we should expect emergent abilities to appear only under certain metrics. To test this, we analyzed on which metrics emergent abilities appear. To determine whether a task-metric-model family triplet exhibits a possible emer- gent ability, we used a metric from previous work [28]. Lettingy i ∈Rdenote model performance at model scalesx i ∈R, sorted such thatx i < x i+1 , the emergence score is: Emergence Score ( (x n ,y n ) N n=1 ) def = sign(arg max i y i −arg min i y i )(max i y i −min i y i ) √ Median((y i −y i−1 ) 2 i ) (1) We found that most metrics used in BIG-Bench havezerotask-model family pairs that exhibit emer- gent abilities: of the 39 preferred metrics in BIG-Bench, at most 5 display emergence (Fig. 5A). Many of the 5 are nonlinear and/or discontinuous, e.g., Exact String Match, Multiple Choice Grade, ROUGE-L-Sum (App. A.4). Notably, because BIG-Bench often scores models on tasks using mul- tiple metrics, thelackof emergent abilities under other metrics suggests that emergent abilities do not appear when model outputs are scored using other metrics. Because emergence score onlysuggestsemergence, we also analyzed hand-annotated task-metric- model family triplets [32], which revealed emergent abilities appear with4/39metrics (Fig. 5B), and 2 metrics account for>92%of claimed emergent abilities (Fig. 5C): Multiple Choice Grade and Exact String Match. Multiple Choice Grade is discontinuous, and Exact String Match is nonlinear. Prediction: Changing Metric Removes Emergent AbilitiesTo test our second prediction, we focused on the LaMDA family [30] because its outputs are available through BIG-Bench. For our 6 0100200300400500600 Emergence Score (Defined in Srivastava et al. 2022) Over All BIG-Bench Tasks accuracy alignment_score average average_log_probability avg_acc bias_level bleu bleurt bleurt_diff combined_bias correct correct_prob_mass custom_score difference_score exact_str_match f1 fairness full gender_bias_score gender_minority_bias_score gender_minority_stereotype_score gender_stereotype_score log10_p_dev log_likelihood macro_f1 main_words_match mean_accuracy multiple_choice_grade normalized_aggregate_score numeric_match_with_0_1_relative_error overall overall gender bias overall_alpha_avg overall_difference pair-wise-accuracy relative_score rougeLsum sequence_f1 targets_reached Metric Figure 5:Emergent abilities appear only for specific metrics, not task-model families.(A)Pos- sibleemergent abilities appear withat most5 out of 39 BIG-Bench metrics. (B) Hand-annotated data by [32] reveals emergent abilities appear only under 4 preferred metrics. (C)>92%of emergent abilities appear under one of two metrics: Multiple Choice Grade and Exact String Match. Figure 6:Changing the metric when evaluating task-model family pairs causes emergent abil- ities to disappear.Left: The LaMDA model family displays emergent abilities when measured under the discontinuous Multiple Choice Grade. Right: The LaMDA model family’s emergent abil- ities disappear when measured under a continuous BIG-Bench metric: Brier Score. analysis, we identified tasks on which LaMDA displays emergent abilities with Multiple Choice Grade, then asked whether LaMDA still displays emergent abilities on the same tasks with a different BIG-Bench metric: Brier Score [2]. Brier Score is a strictly proper scoring rule for predictions of mutually exclusive outcomes; for a binary outcome, the Brier Score simplifies to the mean squared error between the outcome and its predicted probability mass. LaMDA’s emergent abilities on the discontinuous Multiple Choice Grade disappeared when we changed the metric to the continuous Brier Score (Fig. 6). These results support our alternative explanation that emergent abilities are induced by the chosen metric. 5 Inducing Emergent Abilities in Networks on Vision Tasks To demonstrate how emergent abilities can be induced by the researcher’s choice of metric, we show how to produce emergent abilities in deep networks of various architectures: fully connected, convolutional, self-attentional. We focus on vision tasks because abrupt transitions in vision models’ capabilities have not been observed to the best of our knowledge; this is one reason why emergence in large language models is considered so interesting. For the convolutional example, see App. B. Emergent Reconstruction of CIFAR100 Natural Images by Nonlinear AutoencodersWe first induce an emergent ability to reconstruct images in shallow (i.e., single hidden layer) nonlinear au- toencoders trained on CIFAR100 natural images [19]. To emphasize that the sharpness of the metric is responsible for emergent abilities, and to show that sharpness extends to metrics beyond Accu- racy, we intentionally define a discontinuous metric that measures a network’s ability to reconstruct 7 10 7 10 8 10 9 10 10 10 11 Language Model Effective Parameters 0 5 10 15 20 Normalized Score Published Emergent Ability Published Emergent Ability 10 4 10 5 10 6 Shallow Autoencoder Model Parameters 10 2 10 1 Test Mean Squared Error = 1N N n = 1 || x n x n || 2 No Emergent Ability 10 4 10 5 10 6 Shallow Autoencoder Model Parameters 0.0 0.1 0.2 0.3 0.4 Test Reconstruction Ability = 1N N n = 1 [|| x n x n || 2 < c ] Metric-Induced Emergent Ability c 0.003 0.004 Figure 7:Induced emergent reconstruction ability in shallow nonlinear autoencoders.(A) A published emergent ability at the BIG-Bench Periodic Elements task [28]. (B) Shallow nonlinear autoencoders trained on CIFAR100 [19] display smoothly decreasing mean squared reconstruction error. (C) Using a newly defined Reconstruction c metric (Eqn. 2) induces an unpredictable change. 10 10 10 11 GPT-3 Model Parameters 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 Test Accuracy Published Emergent Ability Published Emergent Ability 10 6 Causal Transformer Parameters 0.0 0.2 0.4 0.6 0.8 1.0 Test Accuracy No Emergent Ability Num. All Correct 1 10 6 Causal Transformer Parameters 0.0 0.2 0.4 0.6 0.8 1.0 Test Accuracy Metric-Induced Emergent Ability Num. All Correct 3 Figure 8:Induced emergent classification ability in autoregressive Transformers.(A) A pub- lished emergent ability on the MMLU benchmark [8]. (B) Autoregressive transformers trained to classify Omniglot images display increasing accuracy with increasing scale. (C) When accuracy is redefined as classifyingallimages correctly, a seemingly emergent ability appears. a dataset as the average number of test data with squared reconstruction error below thresholdc: Reconstruction c ( x n N n=1 ) def = 1 N ∑ n I [ ||x n −ˆx n || 2 < c ] (2) whereI(·)denotes an indicator variable andˆx n is the autoencoder’s reconstruction ofx n . The autoencoder family displays smoothly decreasing squared reconstruction error as the number of bot- tleneck units increases (Fig. 7B). Under our newly defined Reconstruction c metric and for particular choices ofc, the autoencoder family exhibits a sharp and seemingly unpredictable image reconstruc- tion ability (Fig. 7C) that qualitatively matches published emergent abilities (Fig. 7A). Emergent Classification of Omniglot Characters by Autoregressive TransformersWe next induce emergent abilities in Transformers [31] trained to autoregressively classify Omniglot hand- written characters [20], in a setup inspired by recent work [5]: Omniglot images are embedded by convolutional layers, then sequences of embedded image-image class label pairs are fed into decoder-only transformers. We measure image classification performance on sequences of length L∈[1,5], again viasubset accuracy:1if allLimages are classified correctly (Fig. 8B), 0 oth- erwise. Causal transformers display a seemingly emergent ability to correctly classify Omniglot handwritten characters (Fig. 8C) that qualitatively matches published emergent abilities (Fig. 8A). 6 Related Work Srivastava et al. [28] observed that while accuracy at a particular task can empirically appear sharp and unpredictable, cross entropy does not; the authors then hypothesized that emergent abilities may be partially attributed to the metric. Our paper converts their discussion into precise predictions, 8 then quantitatively tests the predictions to reveal that: metric choice is likely wholly responsible for emergent abilities; well-known and widely-used metrics (including ones already used by [28]) capture graded improvements; emergent abilities do not appear only for tasks involving multiple steps, and indeed appear most commonly on the discontinuous Multiple Choice Grade; metric choice can be used to induce emergent abilities in a novel domain (vision) in diverse architectures and tasks. Caballero et al. [4] explain emergence by assuming a piece-wise power law functional form; under this view, emergent abilities are real, caused by a change in the governing power law. In contrast, our work suggests that emergent abilities are induced by the researcher, even under a single power law. Michaud et al. [25] posit that emergent abilities may be real under strong data assumptions. 7 Discussion Our paper presents an alternative explanation for claimed emergent abilities of large language mod- els. For a fixed task and a fixed model family, the researcher can choose a metric to create an emergent ability or choose a metric to ablate an emergent ability. Ergo,emergent abilities may be creations of the researcher’s choices, not a fundamental property of the model family on the specific task.We emphasize that nothing in this paper should be interpreted as claiming that large language modelscannotdisplay emergent abilities; rather, our message is that previously claimed emergent abilities in [3, 8, 28, 33] might likely be a mirage induced by researcher analyses. Our paper has several implications. Firstly, a task and a metric are distinct and meaningful choices when constructing a benchmark. Secondly, when choosing metric(s), one should consider the met- ric’s effect on the per-token error rate and adapt their measuring process accordingly, e.g., if one chooses accuracy, one should make sure to have sufficient data to accurately measure accuracy to avoid the risk of drawing invalid scientific conclusions. Thirdly, when making claims about capa- bilities of large models, including proper controls is critical. In this particular setting, emergent abilities claims are possibly infected by a failure to control for multiple comparisons. In BIG-Bench alone, there are≥220 tasks,∼40metrics per task,∼10model families, for a total of∼10 6 task- metric-model family triplets, meaning probability thatnotask-metric-model family triplet exhibits an emergent ability by random chance might be small. Fourthly, scientific progress can be hampered when models and their outputs are not made public for independent scientific investigation. 9 References [1] Philip W Anderson. More is different: broken symmetry and the nature of the hierarchical structure of science.Science, 177(4047):393–396, 1972. [2] Glenn W Brier et al. Verification of forecasts expressed in terms of probability.Monthly weather review, 78(1):1–3, 1950. [3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language mod- els are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. [4] Ethan Caballero, Kshitij Gupta, Irina Rish, and David Krueger. Broken neural scaling laws. arXiv preprint arXiv:2210.14891, 2022. [5] Stephanie CY Chan, Adam Santoro, Andrew Kyle Lampinen, Jane X Wang, Aaditya K Singh, Pierre Harvey Richemond, James McClelland, and Felix Hill. Data distributional properties drive emergent in-context learning in transformers. InAdvances in Neural Information Pro- cessing Systems, 2022. [6] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311, 2022. [7] Aidan Clark, Diego De Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. InInternational Conference on Machine Learning, pages 4057–4086. PMLR, 2022. [8] Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, et al. Predictability and surprise in large generative models. In2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747–1764, 2022. [9] Mitchell A Gordon, Kevin Duh, and Jared Kaplan. Data and parameter scaling laws for neural machine translation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5915–5922, 2021. [10] Dan Hendrycks. Detecting emergent behavior. 2022. [11] Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Hee- woo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010.14701, 2020. [12] Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer.arXiv preprint arXiv:2102.01293, 2021. [13] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kia- ninejad, Md Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017. [14] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022. [15] Andy L Jones. Scaling scaling laws with board games.arXiv preprint arXiv:2104.03113, 2021. [16] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. [17] Victoria Krakovna, Vikrant Varma, Ramana Kumar, and Mary Phuong. Refining the sharp left turn threat model, part 1: claims and mechanisms. 2022. [18] Victoria Krakovna, Vikrant Varma, Ramana Kumar, and Mary Phuong. Refining the sharp left turn threat model, part 2: applying alignment techniques. 2022. [19] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009. 10 [20] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015. [21] Yann LeCun. The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998. [22] Yann LeCun, L ́ eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. [23] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summariza- tion branches out, pages 74–81, 2004. [24] Ryan Lowe and Jan Leike. Aligning language models to follow instructions. 2022. [25] Eric J. Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling, 2023. [26] Oren Neumann and Claudius Gros. Scaling laws for a multi-agent reinforcement learning model.arXiv preprint arXiv:2210.00849, 2022. [27] Jonathan S Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales.arXiv preprint arXiv:1909.12673, 2019. [28] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri ` a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022. [29] Jacob Steinhardt. Future ml systems will be qualitatively different. 2022. [30] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng- Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications.arXiv preprint arXiv:2201.08239, 2022. [31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. [32] Jason Wei. 137 emergent abilities of large language models. 2022. [33] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022. [34] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision trans- formers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, pages 12104–12113, 2022. 11 A Approximate Behavior of Metrics on Sequential Data How do different metrics behave when used to measure autoregressive model outputs? Precisely answering this question is tricky and possibly analytically unsolvable, so we provide an approximate answer here. Notationally, we considerNtest data of lengthL(here, length is measured in tokens) with tar- gets denotedt n def =(t n1 ,t n2 ,...t nL ), the autoregressive model has a true-but-unknown per-token er- ror probability of∈[0,1]and the model outputs prediction ˆ t n def =( ˆ t n1 , ˆ t n2 ,... ˆ t nL ). This assumes that the model’s per-token error probability is constant, which is empirically false, but modeling the complex dependencies of errors is beyond our scope. A.1 Per-Token Error Probability is Resolution-Limited Note that because we haveNtest data, each of lengthL, our resolution for viewing the per-token error probabilityis limited by1/NL. Here, resolution refers to “the smallest interval measurable by a scientific instrument; the resolving power.” To explain what resolution means via an example, suppose one wants to measure a coin’s probability of yielding heads. After a single coin flip, only two outcomes are possible (H, T), so the resolution-limited probability of heads is either0or1. After two coin flips, four outcomes are possible (H, HT, TH, T), so the resolution-limited probability of heads is now one of0,0.5,1. AfterFcoin flips, we can only resolve the coin’s probability of yielding heads up to1/F. Consequently, we introduce a resolution-limited notation: bae b def =arounded to the nearest integer multiple of1/b(3) A.2 Token Edit Distance We first consider an adaptation of the Levenshtein (string edit) distance for models that function on tokens rather than characters, an adaptation we term thetoken edit distance. The token edit distance between two token sequencest n , ˆ t n is defined as the integer number of additions, deletions or substitutions necessary to transformt n into ˆ t n (or vice versa). Token Edit Distance(t n , ˆ t n ) def =Num Substitutions+Num. Additions+Num. Deletions(4) = L ∑ `=1 I[t n` 6= ˆ t n` ] +Num. Additions+Num. Deletions(5) ≥ L ∑ `=1 I[t n` 6= ˆ t n` ](6) The expected token edit distance is therefore: E[Token Edit Distance(t n , ˆ t n )]≥E[ L ∑ `=1 I[t n` 6= ˆ t n` ]](7) = L ∑ `=1 p(t n` 6= ˆ t n` )(8) ≈L(1−)(9) The resolution-limited expected token edit distance is therefore: bE[Token Edit Distance(t n , ˆ t n )]e NL ≥L ( 1−be NL ) (10) From this, we see that the expected token edit distance scales approximately linearly with the resolution-limited per-token probability. The real rate is slightly higher than linear because addi- tions and deletions contribute an additional non-negative cost, but modeling this requires a model 12 of how likely the model is to overproduce or underproduce tokens, which is something we do not currently possess. A.3 Accuracy Accuracy(t n , ˆ t n ) def =I[No additions]I[No deletions] L ∏ l=1 I[t nl = ˆ t nl ](11) ≈ L ∏ l=1 I[t nl = ˆ t nl ](12) As with the Token Edit Distance (App. A.3), we ignore how likely the language model is to over- produce or underproduce tokens because we do not have a good model of this process. Continuing along, E[logAccuracy] = ∑ l E[logI[t nl = ˆ t nl ]](13) ≤ ∑ l logE[I[t nl = ˆ t nl ]](14) ≈Llog(1−)(15) Taking an approximation that would make most mathematicians cry: E[Accuracy]≈exp(E[logAccuracy])(16) = (1−) L (17) (18) This reveals that accuracyapproximatelyfalls geometrically with target token length.The resolution-limited expected accuracy is therefore: bE[Accuracy]e NL =b(1−) L e NL (19) From this we can see that choosing a nonlinear metric like Accuracy is affected significantly more by limited resolution because Accuracy forces one to distinguish quantities that decay rapidly. A.4 ROUGE-L-Sum Another BIG-Bench metric [28] is ROUGE-L-Sum [23], a metric based on the longest common subsequence (LCS) between two sequences. Section 3.2 of [23] gives the exact definition, but the key property is that ROUGE-L-Sum measures the “union” LCS, which means “stitching” together LCSs across the candidate and multiple references. As explained in the original paper: if the candi- date sequence isc=w 1 w 2 w 3 w 4 w 5 , and if there are two reference sequencesr 1 =w 1 w 2 w 6 w 7 w 8 andr 2 =w 1 w 3 w 8 w 9 w 5 , thenLCS(r 1 ,c) =w 1 w 2 andLCS(r 2 ,c) =w 1 w 3 w 5 , then theunion -LCS ofc,r 1 ,r 2 isw 1 w 2 w 3 w 5 , with length 4. Intuitively, this disproportionately benefits models with smaller error rates because their mistakes can be “stitched” across multiple references; this is confirmed in simulation (Fig. 9). B Inducing Emergent Abilities in Networks on Vision Tasks B.1 Emergent Classification of MNIST Handwritten Digits by Convolutional Networks We begin by inducing an emergent classification ability in a LeNet convolutional neural network family [22], trained on the MNIST handwritten digits dataset [21]. This family displays smoothly 13 0.00.20.40.60.8 1 - Per-Token Error Probability 0.0 0.2 0.4 0.6 0.8 1.0 ROUGE-L-Sum 10 2 10 1 10 0 1 - Per-Token Error Probability 10 2 10 1 10 0 ROUGE-L-Sum Target String Length 4 20 36 Num. References 1 4 7 Figure 9:ROUGE-L-Sum is a sharp metric.Simulations show that as the per-token error proba- bility slightly increase (e.g. from 0.05 to 0.1), the ROUGE-L-Sum metric sharply falls. 10 21 10 22 10 23 GPT-3 Model Parameters 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Test Accuracy Published Emergent Ability Published Emergent Ability 10 4 10 5 LeNet Model Parameters 0.0 0.2 0.4 0.6 0.8 1.0 Test Accuracy No Emergent Ability Num. All Correct 1 10 4 10 5 LeNet Model Parameters 0.0 0.2 0.4 0.6 0.8 1.0 Test Accuracy Metric-Induced Emergent Ability Num. All Correct 6 8 10 Figure 10:Induced emergent MNIST classification ability in convolutional networks.(A) A published emergent ability from the BIG-Bench Grounded Mappings task [33]. (B) LeNet trained on MNIST [21] displays a predictable, commonplace sigmoidal increase in test accuracy as model parameters increase. (C) When accuracy is redefined as correctly classifyingKout ofKindepen- dent test data, this newly defined metric induces a seemingly unpredictable change. increasing test accuracy as the number of parameters increase (Fig. 10B). To emulate the accuracy metric used by emergence papers [8, 33, 28], we usesubset accuracy: 1 if the network classifiesK out ofK(independent) test data correctly, 0 otherwise. Under this definition of accuracy, the model family displays an “emergent” ability to correctly classify sets of MNIST digits asKincreases from 1to5, especially when combined with sparse sampling of model sizes (Fig. 10C). This convolutional family’s emergent classification ability qualitatively matches published emergent abilities, e.g., at the BIG-Bench Grounded Mappings task [33] (Fig. 10A). 14