← Back to papers

Paper deep dive

Continually self-improving AI

Zitong Yang

Year: 2026Venue: arXiv preprintArea: cs.AIType: PreprintEmbeddings: 475

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/22/2026, 5:52:07 AM

Summary

This dissertation explores methods for creating continually self-improving AI systems by overcoming three key limitations: data-inefficient knowledge acquisition from small corpora, reliance on finite human-generated data for pretraining, and the confinement of training pipelines to human-discovered algorithms. The author proposes synthetic data approaches (EntiGraph) for knowledge amplification, synthetic bootstrapped pretraining to reduce human data dependency, and execution-guided test-time search to automate the discovery of learning algorithms.

Entities (5)

Stanford University · organization · 100%Zitong Yang · researcher · 100%EntiGraph · methodology · 98%Synthetic Bootstrapped Pretraining · methodology · 98%Execution-guided search · methodology · 95%

Relation Signals (4)

Zitong Yang authored Continually self-improving AI

confidence 100% · A DISSERTATION SUBMITTED TO THE DEPARTMENT OF STATISTICS... BY ZITONG YANG.

EntiGraph addresses Data-efficiency barrier

confidence 95% · First, to overcome this data-efficiency barrier in knowledge acquisition, we propose a synthetic data approach

Synthetic Bootstrapped Pretraining reducesrelianceon Human-generated data

confidence 95% · Second, to reduce reliance on human data, we show that... the model can self-generate synthetic data

Execution-guided search transcends Human-engineered training paradigms

confidence 95% · Finally, to transcend human-engineered training paradigms, we demonstrate that by scaling search during test time

Cypher Suggestions (2)

Identify the author and their work. · confidence 100% · unvalidated

MATCH (r:Researcher)-[:AUTHORED]->(d:Dissertation) RETURN r.name, d.name

Find all methodologies proposed in the dissertation. · confidence 90% · unvalidated

MATCH (m:Methodology)-[:ADDRESSES|REDUCES_RELIANCE_ON|TRANSCENDS]->(p) RETURN m.name, p.name

Abstract

Abstract:Modern language model-based AI systems are remarkably powerful, yet their capabilities remain fundamentally capped by their human creators in three key ways. First, although a model's weights can be updated via fine-tuning, acquiring new knowledge from small, specialized corpora after pretraining remains highly data-inefficient. Second, the training of these systems relies heavily on finite, human-generated data from across history. Third, the pipelines used to train AI models are confined by the algorithms that human researchers can discover and explore. This thesis takes a small step toward overcoming these inherent limitations, presenting three chapters aimed at breaking these dependencies to create continually self-improving AI. First, to overcome this data-efficiency barrier in knowledge acquisition, we propose a synthetic data approach that diversifies and amplifies small corpora into rich knowledge representations, enabling a model to effectively update its parameters from limited source material. Second, to reduce reliance on human data, we show that given a fixed amount of such data, the model can self-generate synthetic data to bootstrap its fundamental pretraining capabilities without distillation from any off-the-shelf, instruction-tuned LM. Finally, to transcend human-engineered training paradigms, we demonstrate that by scaling search during test time over the space of algorithms, AI can search over a larger space of learning algorithm configurations than human researchers can explore manually.

Tags

ai-safety (imported, 100%)csai (suggested, 92%)preprint (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Full Text

474,477 characters extracted from source content.

Expand or collapse full text

CONTINUALLY SELF-IMPROVING AI A DISSERTATION SUBMITTED TO THE DEPARTMENT OF STATISTICS AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Committee in charge: Emmanuel Candès, Co-chair Tatsunori Hashimoto, Co-chair Percy Liang Ruoming Pang Zitong Yang March 2026 arXiv:2603.18073v1 [cs.AI] 18 Mar 2026 © 2026 by Zitong Yang. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- 3.0 United States License. https://creativecommons.org/licenses/by/3.0/legalcode This dissertation is online at: https://purl.stanford.edu/sq872bj8179 i I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Emmanuel Candes, Co-Advisor I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Tatsunori Hashimoto, Co-Advisor I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Percy Liang I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Approved for the Stanford University Committee on Graduate Studies. Kenneth Goodson, Vice Provost for Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. i Ruoming Pang Abstract Modern language model-based AI systems are remarkably powerful, yet their capabilities remain fundamentally capped by their human creators in three key ways. First, although a model’s weights can be updated via fine-tuning, acquiring new knowledge from small, specialized corpora after pre- training remains highly data-inefficient. Second, the training of these systems relies heavily on finite, human-generated data from across history. Third, the pipelines used to train AI models are confined by the algorithms that human researchers can discover and explore. This thesis takes a small step toward overcoming these inherent limitations, presenting three chapters aimed at breaking these dependencies to create continually self-improving AI. First, to overcome this data-efficiency barrier in knowledge acquisition, we propose a synthetic data approach that diversifies and amplifies small corpora into rich knowledge representations, enabling a model to effectively update its parameters from limited source material. Second, to reduce reliance on human data, we show that given a fixed amount of such data, the model can self-generate synthetic data to bootstrap its fundamental pretraining capabilities without distillation from any off-the-shelf, instruction-tuned LM. Finally, to transcend human-engineered training paradigms, we demonstrate that by scaling search during test time over the space of algorithms, AI can search over a larger space of learning algorithm configurations than human researchers can explore manually. iv Acknowledgments Being a Ph.D. student in AI from 2022 to 2025 at Stanford, CA, is a once-in-a-lifetime experience. The rate at which the surrounding environments change is exhilaratingly opportunistic yet mentally taxing. Every few months, we see the valley’s capital frenzy occupying the news headlines. Every few weeks, we see a new model release with uncanny capability from a 2016-2020 perspective. Every few days, we see a new paper announcement delivering similar progress to ours. Amid an ever-changing appearance lies the unchanging reality. For me, this reality is my wife, Angie, and my advisors, Emmanuel Candès and Tatsunori Hashimoto. Thank you, Angie, for being my emotional anchor over the past 10 years, for guiding me, com- forting me, soothing me, empowering me, and encouraging me to explore far into the unseen by offering me a tranquil harbor where I can introspect. Thank you, Emmanuel, for shaping my approach to research and to life, for pushing me to pursue the problem I am truly excited about, for teaching me the history of science, for pushing me toward thoughtful work, and for showing me, by example, the unwavering will to explore. Thank you, Tatsu, for welcoming me to the world of natural language processing, for instilling in me the value of introspection, for walking me through every obstacle, for planting in me the seed of rigor, and for building a research lab so vibrant that I couldn’t possibly hope for more. I would like to thank Percy Liang and Ruoming Pang for being on my thesis committee. Thank you, Percy, for pushing me to always form opinions by running my own experiments. Thank you, Ruoming, for teaching me the perspective to view computer systems through the lens of hardware. I would like to extend my gratitude to Berkeley, where my research started. I would like to thank Yi Ma and Jacob Steinhardt for bringing me to the world of machine learning. Thank Song Mei, Chong You, Yaodong Yu, and Yuexiang Zhai, for mentoring me hand-in-hand through my first few research projects. Thank Jiabao Yang for all the discussions of physics we had over the years. I am grateful to all the friends, collaborators, and mentors from my time at Berkeley: Christina Baek, Ryan Chan, Chih-Yuan Chiu, Xili Dai, Sara Fridovich-Keil, Zhen Guo, Zhiyue Hu, Jiantao Jiao, Druv Pai, Haozhi Qi, Shengbang Tong, Alex Wei, Eric Xia, Chiao-Yu Yang, Jiahao Yao, Chong You, Yue Zhang, Ruiqi Zhong, and Banghua Zhu. Upon entering Stanford, Jacob Steinhardt encouraged me to explore both Sequoia and Gates. I v am grateful to have two buildings I can call home. I would like to thank Sourav Chatterjee and Amir Dembo for enabling me to appreciate the wisdom of Kolmogorov. Thank John Duchi for teaching me the way of information theory and optimization. Thank Lihua Lei for mentoring my first paper at Stanford. Thank Andrea Montanari for chairing the department over the past three years and helping me with various requests. Thank Susie Ementon for working with me through all the unprecedented logistics hassles. Thank all the fellow residents of Sequoia over the years: Kelly Buchanan, Chen Cheng, Gary Cheng, Zhaomeng Cheng, John Cherian, Noah Cowan, Brice Huang, Andrew Ilyas, Wenlong Ji, Ying Jin, Joon Lee, Jinzhou Li, Shuangping Li, Gennie Ma, Yash Nair, Michael Salerno, Ziang Song, Asher Spector, Zihao Wang, Yao Zhang, and Tijana Zrnic. I would like to thank the wonderful staff of CS336 for their help in getting started with language model research. Thank Neil Band for building the world’s best reranker. Thank Chenglei Si for guiding me to the world of automating AI research. Thank Diyi Yang and Ludwig Schmidt for all the discussions about society and AI. Thank all the fellow residents of Gates for being around over the years: Jiaao Chen, Ian Covert, Yann Dubois, Mingjian Jiang, Xiang Lisa Li, Xinhao Li, Xuechen Li, Hong Liu, Ken Liu, Niklas Muennighoff, Sam Park, Chenglei Si, Yu Sun, Rohan Taori, Tristan Thrush, and Tianyi Zhang. I would also like to thank mentors and friends from the industry. Thank Aonan Zhang and Dong Yin from Apple for hosting me and working with me through our research. Thank Aditya Menon and Sanjiv Kumar from Google for welcoming me and showing the power of retrieval. Thank Zonglin Li from Anthropic for all the AGI discussions over the years. Thank John Schulman for encouraging me to work on unorthodox problems. Finally, I would like to thank friends outside Stanford and work. Thank you, Samy Jelassi, for working with me on hard problems and for our discussion over the years. Thank you, Sam Buchanan, Tianle Cai, Danqi Chen, Tianzhe Chu, Weijie Su, Xuyang Tian, Hongtao Yao, Liang Yuan, and Mingxuan Zuo for your friendship. Finally, thank my parents P. W. and G. Y. for everything. vi Contents Abstractiv Acknowledgmentsv 1 Introduction1 1.1 Defining continually self-improving AI . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Continual knowledge acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Bootstrapping pretraining capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Towards AI-designed AI via test-time search . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6.1 Continual knowledge acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6.2 Bootstrapping pretraining capabilities . . . . . . . . . . . . . . . . . . . . . . 8 1.6.3 Towards AI-designed AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Continual knowledge acquisition12 2.1 Synthetic continued pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Our method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 EntiGraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Main experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 Continued pretraining procedure . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.2 Question-answering evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.3 Instruction following evaluations . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.1 Using a Weaker Synthetic Data Generation LM . . . . . . . . . . . . . . . . . 22 2.5.2 Factuality and Lexical Diversity of EntiGraph Synthetic Corpus . . . . . . . 23 2.5.3 Datasets Beyond QuALITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 vii 2.6 Open-book experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.7 Theoretical analysis of EntiGraph scaling . . . . . . . . . . . . . . . . . . . . . . . . 27 2.7.1 Toy model setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.7.2 Rigorous upper and lower bound . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.7.3 An analytical formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.8.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.8.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3 Bootstrapping pretraining capabilities31 3.1 Prelude: sample-efficient reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.1 Discussion: pretraining as the foundation of capability . . . . . . . . . . . . . 33 3.2 Synthetic Bootstrapped Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.1 Data-constrained pretraining setup . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.2 Synthetic bootstrapped pretraining . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.1 Data, model, and evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.2 Compute-matched comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5 Experiment results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.5.1 Main benchmark performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.5.2 Analysis of synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.6 Statistical foundations of SBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.6.1 A hierarchical concept model for natural language . . . . . . . . . . . . . . . 47 3.6.2 From idealized models to language model reality . . . . . . . . . . . . . . . . 48 3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.7.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.7.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4 Towards AI-designed AI via test-time search52 4.1 Towards automated AI research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Automated idea executor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.1 Research environments for ideation . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.2 System design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3 Benchmarking LLM ideators and executors . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.1 End-to-end ideation and execution . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.2 Comparing ideators with the same executor . . . . . . . . . . . . . . . . . . . 58 4.4 Test-time scaling via budget forcing . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 viii 4.4.1 Test-time scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.5 Execution-guided evolutionary search . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.5.1 Search scaffold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5.2 Experiment results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5.3 Comparison with best-of-N . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5.4 Analysis of generated ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.6.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5 Conclusion: can AI be smarter than its creators?68 5.1 A parable from physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 The gravitational field equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3 Einstein’s cosmological problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3.1 The cosmological metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3.2 The Friedmann equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3.3 A dynamic universe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3.4 The cosmological constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.4 The theory was right . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5 Continually self-improving AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.6 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.6.1 Synthetic continued pretraining as an alternative to infinite context . . . . . 75 5.6.2 Synthetic data as data-dependent regularization . . . . . . . . . . . . . . . . 75 5.6.3 Harness engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 A Supplementary materials for Chapter 277 A.1 Details on the QuALITY dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 A.2 Training details for the main experiments . . . . . . . . . . . . . . . . . . . . . . . . 77 A.3 Task-specific finetuning for the QuALITY question set . . . . . . . . . . . . . . . . . 78 A.4 Additional details on open-book experiments . . . . . . . . . . . . . . . . . . . . . . 80 A.4.1 Stage 1: offline indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 A.4.2 Stage 2: inference-time retrieval and reranking . . . . . . . . . . . . . . . . . 81 A.4.3 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 A.5 Proof of Theorem 1 and other analytical formulas . . . . . . . . . . . . . . . . . . . . 83 A.5.1 More details on the mixture of exponential shape . . . . . . . . . . . . . . . . 88 A.6 Synthetic data generation prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 ix A.6.1 EntiGraph prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 A.6.2 Rephrase prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 A.7 Additional evaluation details of main experiments . . . . . . . . . . . . . . . . . . . . 93 A.7.1 QuALITY QA question set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 A.7.2 Closed-book summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 B Supplementary materials for Chapter 399 B.1 Additional details on synthetic bootstrapped pretraining . . . . . . . . . . . . . . . . 99 B.1.1 SBP implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 B.1.2 Ablation on data mixture ratio . . . . . . . . . . . . . . . . . . . . . . . . . . 101 B.1.3 Random pairs and embedding analysis . . . . . . . . . . . . . . . . . . . . . . 101 B.2 Additional analysis of synthesized samples . . . . . . . . . . . . . . . . . . . . . . . . 103 B.2.1 Analyzing concepts in documents . . . . . . . . . . . . . . . . . . . . . . . . . 103 B.2.2 Factuality analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 B.2.3 Mideval prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 B.2.4 Synthesized documents from the 1T-scale experiment . . . . . . . . . . . . . 111 B.3 Additional pretraining results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 B.3.1 Two epochs validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 B.3.2 Model scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 B.4 Supplementary materials for sample-efficient reasoning . . . . . . . . . . . . . . . . . 114 B.4.1 Initial collection of 59K samples . . . . . . . . . . . . . . . . . . . . . . . . . 114 B.4.2 Final selection of 1K samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 B.4.3 Data ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 B.4.4 Dataset composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 B.4.5 Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 C Supplementary materials for Chapter 4122 C.1 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 C.1.1 Additional idea examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 C.1.2 Code execution examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 C.2 Reinforcement learning from execution reward . . . . . . . . . . . . . . . . . . . . . . 139 C.2.1 Reward design and experiment setup . . . . . . . . . . . . . . . . . . . . . . . 140 C.2.2 Experiment results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 C.3 Supplementary materials for test-time scaling . . . . . . . . . . . . . . . . . . . . . . 141 C.3.1 Evaluation determinism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 x D Supplementary materials for Chapter 5142 D.1 From Newton’s law to Poisson’s equation . . . . . . . . . . . . . . . . . . . . . . . . 142 D.2 The metric tensor, stress-energy tensor, and spacetime curvature . . . . . . . . . . . 143 D.3 The Newtonian limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 D.4 Deriving the Friedmann equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 D.5 Solving the Friedmann equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 xi List of Tables 1.1 Provenance of thesis sections. Publication numbers refer to the list above. . . . . . . 6 2.1 Comparing the scale of modern continued pretraining (CPT) works with our small corpus setting. Prior work adapts LMs to broad domains with diverse, large-scale corpora. We aim to downscale CPT to small corpora; we use a corpus that is 10,000× smaller than the smallest modern corpus for domain-adaptive CPT. . . . . . . . . . 15 2.2 EntiGraph Instruct examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Percentage of token n-grams in synthetic documents that overlap with the source document n-grams, for the EntiGraph and Rephrase synthetic data augmentations. . 24 2.4 QuALITY question-answering accuracy and recall rate in the open-book retrieval- augmented generation (RAG) setting. EntiGraph CPT and Llama 3 8B Base are used in a RAG pipeline (cf. §2.6 for setup details). Recall@8 is defined as the proportion of questions for which the salient article appears in the top 8 reranked document chunks. GPT-4 and GPT-3.5 Oracle RAG provide an upper bound with a perfect retriever, by placing the entire relevant document in-context. . . . . . . . . . . . . . . . . . . 26 3.1 s1-32B is a sample-efficient reasoning model. We evaluate s1-32B, Qwen, and Gemini. Other results are from the respective reports [Qwen et al., 2024, Team, 2024, OpenAI, 2024, DeepSeek-AI et al., 2025, Labs, 2025, Team, 2025]. # ex. = number examples used for reasoning finetuning. . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Computed-matched comparison of Synthetic Bootstrapped Pretraining (SBP) and or- acle performance gains over the repetition baseline. On average, SBP delivers roughly 43% of the performance improvement in QA accuracy for the 3B model and 58% for the 6B model, attainable by an oracle with access to 20x more unique data. . . . . 42 3.3 Quantitative evaluation of documents sampled from the synthesizer at 200B-scale and 1T-scale. We can see that the synthesized documents preserve topics and are not are simple duplicates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 xii 3.4 Examples of latent concepts c inferred by an external LM (prompts provided in §B.2.1). From left to right, we provide a summary of the real document, the inferred latent concept, and a summary of the synthesized document. . . . . . . . . . . . . . 48 4.1 Performance of our execution-guided search in comparison with the provided baselines and best human experts. The post-training task is to finetune a 1.5B model for math reasoning, and the metric is accuracy on the MATH validation set. The pre-training task is to train a 124M Transformer on FineWeb, and the metric is the training time to reach 3.28 validation loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Breakdown of hyper-parameter tuning vs algorithmic ideas throughout the entire execution-guided search. We report the percentage of each type among all generated ideas of each model (N = 500 ideas on GRPO and N = 800 ideas on nanoGPT). We also report the average and best performance for ideas under each category, where we use validation accuracy as the performance metric for GRPO and validation loss as the metric for nanoGPT. Bold numbers every row indicate the best performance by each model. All models generate a substantial amount of algorithmic ideas apart from hyper-parameter changes, while Claude-4.5-Sonnet generates significantly more hyper-parameter ideas than other models. . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3 Examples of successfully executed ideas on the GRPO environment, along with their accuracy on the MATH validation set. The baseline accuracy is 48.0% on this envi- ronment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 A.1 Summarization prompt for EntiGraph Instruct, Raw Instruct, and Reprhase Instruct. 95 A.2 Complete instruction following example used in Table 2.2 from Section 2.4.3. . . . . 98 B.1 Embedding similarity statistics. “Paired documents” refers to the SBP training pairs found by nearest neighbor search. “Random documents” refers to randomly paired documents. “Generated documents (SBP)” refers to the synthetic data generated by the SBP model at 200B-scale (3B). “Generated documents (Random)” refers to the synthetic data generated by the model trained on random pairs. All comparisons are based on the 10B dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 B.2 Categorize extracted concepts into domains. . . . . . . . . . . . . . . . . . . . . . . . 105 B.3 Categorize extracted concepts into abstract types. . . . . . . . . . . . . . . . . . . . 106 B.4 Categorize relations between real documents d 1 and synthesized documents d 2 . . . . 107 B.5 Estimation of the ratio of non-factual documents. We can see that the occurrence factuality error decays as the SP scales up. . . . . . . . . . . . . . . . . . . . . . . . 107 B.6 Factuality undefined synthetic text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 B.7 Factuality errors detected in synthetic text. . . . . . . . . . . . . . . . . . . . . . . . 110 xiii B.8 Performance comparsion with 200B tokens repeated twice vs. 400B unique tokens for the 3B model. We can see that the two models yield similar performance. . . . . . . 111 B.9 6B-parameter model setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 B.10 200B-scale experiments with model scaling. The first three columns are identical to Table3.2. The last column shows the performance of training a 6B model under a 200B training token budget with 10B unique tokens. . . . . . . . . . . . . . . . . . . 113 B.11 s1K data ablations. We report 95% paired bootstrap confidence intervals for dif- ferences relative to the s1K model using 10,000 bootstrap samples. E.g., the interval [-13%, 20%] means that, with 95% confidence, the true difference between 59K-full and s1K is between -13% and +20%. If the entire interval is negative, e.g. [-27%, -3%], we can confidently say that the performance is worse than s1K. . . . . . . . . 116 B.12 Summary of our dataset s1K. Token count measured by the Qwen-2.5 tokenizer. We prompt Claude to produce keywords given several questions from the domain. . 117 B.13 Composition of full 59K questions. Thinking and response lengths are measured in tokens using the Qwen2.5-32B-Instruct tokenizer [Qwen et al., 2024]. In addition to excluding our evaluation benchmark, AIME24, we also exclude AIME questions from 2022–2023 because we use these 90 questions during our development stage of s1-32B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 C.1 Additional examples on the GRPO environment. Ideas are generated by Claude-4.5- Opus during evolutionary search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 C.2 Additional examples on the GRPO environment. Ideas are generated by Claude-4.5- Sonnet during evolutionary search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 xiv List of Figures 2.1 Synthetic continued pretraining (synthetic CPT) converts a small source cor- pus into a large synthetic corpus that is amenable to learning via standard continued pretraining. We instantiate synthetic CPT using a synthetic data augmentation al- gorithm called EntiGraph, which forms a knowledge graph over entities extracted from documents, and then prompts an LM to synthesize a text-based representation of the graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Accuracy on the QuALITY question set Q test (y-axis) as a function of the synthetic token count (x-axis). The accuracy of synthetic continued pretraining using the Enti- Graph data augmentation algorithm (EntiGraph CPT) scales log-linearly up to 455M tokens. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Closed-book summarization: number of false claims (y-axis) versus number of salient claims (x-axis) normalized by the human summary. . . . . . . . . . . . . . . . . . . . 21 2.4 The scaling properties of Synthetic CPT with the EntiGraph and Rephrase augmen- tations, comparing two synthetic data generators: GPT-4-Turbo and Llama 3.1 8B Instruct. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5 The scaling properties of Synthetic CPT using the EntiGraph augmentation on the Coursera Exam QA dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 A mixture-of-exponential function (2.2) closely fits the scaling trend of EntiGraph CPT with respect to synthetic token count. . . . . . . . . . . . . . . . . . . . . . . . 29 3.1 s1K. s1K is a dataset of 1,000 high-quality, diverse, and difficult questions with reasoning traces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Data synthesis illustration of Synthetic Bootstrapped Pretraining (SBP): It first iden- tifies semantically similar documents (Step 1) and then trains a conditional model that generates one element of the pair from the other (Step 2). Finally, SBP applies the conditional model to the pretraining corpus itself to synthesize a new, vast corpus for joint training (Step 3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Training dynamics (200B-scale). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 xv 3.4 Comparison of original text with synthesized text variations. . . . . . . . . . . . . . 45 4.1 We build an automated idea executor involving Implementer, Scheduler, and Worker. We then use this automated executor to guide test-time search over model-generated ideas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Research environment abstraction. Left: abstract interface defining the search problem— context() provides what the LM sees, value() scores an idea by execution. Right: concrete implementation for AI research—ideas are patched into a sandboxed code- base and evaluated via automated execution. . . . . . . . . . . . . . . . . . . . . . . 55 4.3 Model performance comparison with self-execution (top row) vs GPT-5 execution (bottom row) on GRPO and nanoGPT environments. The baseline accuracy for GRPO is 0.480, and the baseline loss for nanoGPT is 3.255. The completion rate is high for most models, especially under self-execution. . . . . . . . . . . . . . . . . . . 57 4.4 Test-time scaling with s1-32B. We benchmark s1-32B on reasoning-intensive tasks and vary test-time compute. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.5 Sequential and parallel test-time scaling. (a): Budget forcing shows clear scal- ing trends and extrapolates to some extent. For the three rightmost dots, we prevent the model from stopping its thinking 2/4/6 times, each time appending “Wait” to its current reasoning trace. (b): For Qwen2.5-32B-Instruct we perform 64 evaluations for each sample with a temperature of 1 and visualize the performance when majority voting across 2, 4, 8, 16, 32, and 64 of these. . . . . . . . . . . . . . . . . . . . . . . . 60 4.6 Best performance at each epoch when performing execution-guided search with dif- ferent models. For the nanoGPT environment (left), we use the reciprocal of the validation loss as the metric; for the GRPO environment (right), we use validation accuracy as the metric. Claude-4.5-Opus exhibits a scaling trend on both environ- ments and achieves the best performance on nanoGPT. Claude-4.5-Sonnet achieves the best performance on GRPO due to effective hyper-parameter tuning, but saturates early. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.7 Comparison between best-of-N and our execution-guided search under the same sam- pling budget. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.1 Hubble’s velocity–distance relation, reproduced from the data in Hubble [1929]. Black discs are 24 individual nebulae with estimated distances (solid line: least-squares fit, K = 465 km/s/Mpc). Open circles are 9 groups formed by combining nearby nebulae (dashed line: K = 513 km/s/Mpc). The red cross marks the mean of 22 additional nebulae whose distances could not be estimated individually. Both fits are consistent with a linear relation v = Kr passing through the origin. . . . . . . . . . . . . . . . . 73 xvi A.1 Histograms over the 265 QuALITY articles and books. (a) The token count of raw articles. (b) The number of extracted entities. (c) The token count of EntiGraph synthetic data (generated for each book). . . . . . . . . . . . . . . . . . . . . . . . . 77 A.2 Accuracy on the QuALITY question set Q test (y-axis) as a function of the synthetic token count (x-axis). Comparison among EntiGraph CPT, Rephrase CPT, and QA SFT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 A.3 Accuracy Acc(M t ) with respect to time t, for V = 100 and p = 0.03. The mixture- of-exponentials functional form in (2.2) leads to three distinct regimes. . . . . . . . . 88 B.1 ScaNN system design for efficient distributed search. . . . . . . . . . . . . . . . . . . 100 B.2 Analysis of paired data at 200B-scale. Figure B.2(a): a histogram of 100K subsampled pairs grouped by their similarity score. Figure B.2(b): the fraction of duplicate pairs when we subsample 1K pairs around a specific similarity score. Figure B.2(c): same as B.2(b) but showing the fraction of relevant documents. . . . . . . . . . . . . . . . 100 B.3 SBP performance with varying synthetic tokens at 200B-scale. . . . . . . . . . . . . 102 B.4 SBP performance with varying synthetic tokens at 1T-scale (3B). . . . . . . . . . . . 103 B.5 SBP performance with varying synthetic tokens for the 6B model at 1T-scale. . . . . 104 B.6 Comparison of original text with synthesized text variations. On the first row, the real document provides factual information about the 1984 film’s production and release. In contrast, the synthesized documents offer subjective commentary, opinions, and behind-the-scenes anecdotes about both the 1984 film and its 2010 remake. On the second row, the synthesized documents are continuations of the real document. . . . 112 C.1 Training curves of RL from execution reward. We plot the average reward per epoch in the upper row, and the max reward per epoch in the lower row. For the GRPO environment, the reward is the accuracy; for the nanoGPT environment, the reward is the reciprocal of the loss. The average reward increases, but not the max reward. 139 xvii Chapter 1 Introduction 1.1 Defining continually self-improving AI In a single sentence: a continually self-improving AI is one that, once created, can autonomously and continually improve itself better than its human creators can improve it. We state two assumptions that scope the definition to the class of AI systems studied in this thesis. (A1) The AI system is based on one or more neural networks, so that its knowledge is encoded in a well-defined set of parametric weights. (A2) There exists a resource-intensive pretraining phase during which the system is created: ai_system = learning_algorithm(training_signal),(1.1) where training_signal is human knowledge (e.g., internet text), learning_algorithm encompasses things like architecture (e.g., the Transformer) and optimizer (e.g., gradient descent), and ai_system is the resulting model. These two assumptions clearly encompass the current large language model paradigm—Transformers trained via gradient descent on internet text—but they do not exclude non-Transformer architec- tures, non-gradient-descent optimizers, or non-textual training signals. The definition captures any parametric system that undergoes an expensive creation phase followed by continued operation. With these assumptions in place, we define a continually self-improving AI as one satisfying three properties. Note that the pretraining formula already implies that improvement is data-driven— grounded in a learning signal rather than, say, hardware upgrades or manual weight surgery. Definition 1 (Continually self-improving AI). Under Assumptions (A1)–(A2), an AI system is continually self-improving if it satisfies: 1 CHAPTER 1. INTRODUCTION2 (P1) After the pretraining phase (1.1), the system continues to acquire new knowledge built into its parametric weights without catastrophically forgetting existing capabilities. (P2) The system generates its own training_signal, and learning from these self-generated sig- nals yields continued improvement beyond what existing human-generated signals provide. (P3) The system autonomously determines what learning_algorithm to use to learn from its training signals. The assumptions are not arbitrary—each one makes a specific aspect of self-improvement well- defined. Assumption (A1) makes Property (P1) meaningful: without weights, there is no substrate into which new knowledge can be written. Assumption (A2) establishes a pretraining phase that creates the system, making “improvement after the pretraining phase” a precise concept—the system continues to update after this expensive initial phase. The pretraining formula also makes explicit that three components exist—the model, the algorithm, and the data—each of which can be the target of improvement. Each property corresponds to one chapter of this thesis: • Chapter 2 (Property P1). Improving what the model knows, by synthesizing diverse represen- tations of a small corpus for continued pretraining. • Chapter 3 (Property P2). Improving the system’s fundamental pretraining capability, by ex- ploiting inter-document correlations to strengthen pretraining itself. • Chapter 4 (Property P3). Improving the process by which models are trained, by scaling test- time search from the token level to the idea level—generating research ideas, executing them, and learning from the results. Altogether, these three paths sketch a future where AI systems continually improve themselves. While we do not claim that current AI has goals in a human sense, the drive toward self-improvement may be understood as an inherent goal for sufficiently capable systems—just as organisms evolve toward greater fitness without deliberate intent, AI systems that can improve their own training may represent a new kind of open-ended optimization. We next summarize each chapter’s contribution in turn. 1.2 Continual knowledge acquisition We first address Property (P1) of Definition 1: after pretraining, how can a language model con- tinue learning from a small, specialized corpus? This is a data-limited problem: niche knowledge— proprietary datastores, specialized scientific domains, private corpora—inherently lacks the diverse internet representation that makes standard pretraining effective. Several approaches address this CHAPTER 1. INTRODUCTION3 problem: knowledge editing [Meng et al., 2022, 2023] modifies individual facts but does not scale to corpus-level knowledge; retrieval-augmented generation [Lewis et al., 2020] keeps knowledge external and is limited by the context window; and collecting more real data is often infeasible for propri- etary or niche domains. We pursue synthetic data generation because it operates at corpus scale, writes knowledge directly into the model’s parametric weights, and remains applicable when real data cannot be obtained. In Chapter 2, we tackle two challenges—data efficiency and catastrophic forgetting—through synthetic continued pretraining. At a high level, we convert a small corpus into a large, diverse synthetic corpus using a knowledge graph–inspired augmentation algorithm called EntiGraph, and then continue pretraining on the expanded data while mixing in a fraction of the original pretraining distribution to prevent forgetting. The approach is effective: a model trained on a synthetically augmented corpus acquires the knowledge of the original documents and demonstrates it across a range of downstream tasks. How- ever, the synthetic data generator we used was GPT-4—a model far more powerful than the student being trained. We deliberately allowed this distillation because the scientific question in Chapter 2 is about data efficiency—whether synthetic data can bridge the gap between a small corpus and the internet-like diverse representations that make learning effective—not about self-improvement. But this raised an immediate follow-up: was the improvement genuine learning, or merely distillation from a stronger teacher? This question motivated the next chapter. 1.3 Bootstrapping pretraining capabilities The release of OpenAI o1 [OpenAI, 2024] pulled the field into reasoning models. A natural question is: how much data is needed to elicit reasoning capabilities from a pretrained model? In Chapter 3, §3.1, we show that the answer is strikingly little: training on just 1,000 carefully curated examples with reasoning traces suffices to build a competitive reasoning model. The implication is clear. A model cannot possibly acquire mathematical knowledge from 1,000 questions—the capability must already be latent in the pretrained weights. Pretraining is therefore the foundation, and finetuning merely elicits what is already there. Two threads now converge—the centrality of pretraining and the desire for genuine, not distilled, self-improvement—addressing Property (P2) of Definition 1. Can a model trained on a fixed dataset generate synthetic data to train a better model? If so, this would constitute true self-improvement: no stronger teacher and no new information from the environment. This question is set in a data- limited regime as well, motivated by the approaching exhaustion of high-quality internet text [Vil- lalobos et al., 2024]: we assume a fixed pool of unique documents and ask whether the model can extract more value from them than simple repetition provides. The main alternatives are architec- tural changes, which are orthogonal and complementary to data-driven gains; retrieval-augmented pretraining [Borgeaud et al., 2021, Khandelwal et al., 2020], which leverages related documents but CHAPTER 1. INTRODUCTION4 keeps the additional signal external to the weights; and in-context pretraining [Shi et al., 2024b], which groups related documents into the same context window but is limited by context length. We choose synthetic data because it creates new training signal from existing data and writes it directly into the model’s weights via standard pretraining. In Chapter 3, §3.2, we show that the answer is yes. Synthetic Bootstrapped Pretraining (SBP) trains a conditional data synthesizer that generates new training documents from existing ones— for instance, synthesizing a code tutorial from an arXiv paper, or a critical essay from a novel. By training on these synthetic documents alongside the original corpus, SBP improves pretraining perplexity in compute-matched comparisons, closing up to 60% of the gap to an oracle with access to unlimited unique data. This result is qualitatively different from synthetic continued pretraining. The defining constraint is that distillation is forbidden: the data synthesizer is trained from the same pretraining corpus, not from a stronger external model. Without this constraint, self-improvement would be trivially achievable by distilling from a more capable teacher. SBP operates without any external teacher— the improvement stems from using the same data more efficiently by exploiting a weaker form of self-supervision latent in the pretraining corpus. Because SBP improves perplexity—a fundamental quantity that correlates with all downstream tasks—the gain is not confined to niche benchmarks but reflects a genuine improvement in the model’s core capability. 1.4 Towards AI-designed AI via test-time search AI research may be a domain where AI itself can deliver significant progress. Consider the scientific method in the Popperian tradition. Science proceeds in two steps: generating hypotheses and testing them with experiments to falsify them. For mathematics, the execution step is special: the chain-of- thought that AI models produce implicitly carries out the verification, which helps explain the rapid progress AI has made in mathematical reasoning. For AI research, execution materializes entirely as code—writing training scripts, launching experiments, logging metrics—and AI systems are already remarkably capable at code generation. Meanwhile, idea generation takes place in natural language, which AI models handle fluently. The natural design, then, is to connect an AI idea generator to an AI experiment executor end-to-end. Established approaches to algorithmic improvement—Neural Architecture Search [Zoph and Le, 2017, So et al., 2019], automated algorithm discovery [Real et al., 2020], and learned optimizers [Chen et al., 2023b]—are effective within their respective domains but operate within constrained, predefined spaces or require end-to-end differentiable pipelines. These are reasonable approaches, but they make it difficult to discover techniques outside the search space or scale to full training systems. We pursue research automation because it operates in an unbounded action space—ideas expressed in natural language, validated via code execution—using capabilities that language models already possess. CHAPTER 1. INTRODUCTION5 A second observation from the reasoning work reinforces this direction. In Chapter 4, §4.4, we show that even a crude intervention—suppressing the end-of-thinking token to force longer reasoning, a technique we call budget forcing—improves accuracy. If brute-force thinking at the token level already helps, systematically scaling search at the idea level—generating research ideas, executing them, and learning from the results—should yield further improvement. In Chapter 4, we pursue this question by building an automated AI research system and applying test-time search at the idea level: generating research ideas, executing them automatically, and feeding results back to guide the next round of search. This represents another flavor of self- improvement: not improving training data or model capability, but improving the training algorithm itself. 1.5 Publications This thesis is based on the following four publications, listed in reverse chronological order (∗ denotes equal contribution): 1. Towards Execution-Grounded Automated AI Research Chenglei Si ∗ , Zitong Yang ∗ , Yejin Choi, Emmanuel Candès, Diyi Yang, Tatsunori Hashimoto. arXiv preprint, 2026. 2. Synthetic Bootstrapped Pretraining Zitong Yang ∗ , Aonan Zhang ∗ , Hong Liu, Tatsunori Hashimoto, Emmanuel Candès, Chong Wang, Ruoming Pang. International Conference on Learning Representations (ICLR), 2026. 3. S1: Simple Test-Time Scaling Niklas Muennighoff ∗ , Zitong Yang ∗ , Weijia Shi ∗ , Xiang Lisa Li ∗ , Li Fei-Fei, Hannaneh Ha- jishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto. Empirical Methods in Natural Language Processing (EMNLP, Oral), 2025. 4. Synthetic Continued Pretraining Zitong Yang ∗ , Neil Band ∗ , Shuangping Li, Emmanuel Candès, Tatsunori Hashimoto. International Conference on Learning Representations (ICLR, Oral), 2025. Table 1.1 maps each thesis section to the publication it is based on. Sections marked “Original” were written for this thesis and do not appear in any prior publication. CHAPTER 1. INTRODUCTION6 Table 1.1: Provenance of thesis sections. Publication numbers refer to the list above. SectionSource Chapter 1: Introduction (§1.1–§1.5)Original §1.6 Related workAdapted from Publications 1–4 Chapter 2: Continual knowledge acquisitionPublication 4 Chapter 3: Bootstrapping pretraining capabilities §3.1 Prelude: sample-efficient reasoningPublication 3 §3.2–3.7 Synthetic bootstrapped pretrainingPublication 2 Chapter 4: Towards AI-designed AI §4.1–4.3 Automated AI research systemPublication 1 §4.4 Test-time scaling via budget forcingPublication 3 §4.5–4.6 Evolutionary search & discussionPublication 1 Chapter 5: ConclusionOriginal Appendix A: Supplementary for Chapter 2Publication 4 Appendix B: Supplementary for Chapter 3 §B.1–B.3Publication 2 §B.4Publication 3 Appendix C: Supplementary for Chapter 4 §C.1–C.2Publication 1 §C.3Publication 3 Appendix D: Supplementary for Chapter 5Original 1.6 Related work We consolidate the related work for all three chapters in this section, organized by the corresponding chapter topic. 1.6.1 Continual knowledge acquisition Several approaches address knowledge acquisition in language models—knowledge editing, retrieval- augmented generation, and collecting more real data—each with different trade-offs in scale, per- manence, and applicability. We pursue synthetic data generation because it operates at corpus scale, writes knowledge into parametric weights, and applies when real data cannot be obtained; the following review provides evidence for this choice. Synthetic generation of pretraining data Recent approaches synthesize pretraining data using hierarchical prompting to promote dataset diversity. Eldan and Li [2023] prompt LLMs to generate stories containing sampled keywords and show that small LMs trained on their dataset generate fluent text. Gunasekar et al. [2023] synthesize textbooks and code exercises by conditioning on topic, target audience, and function names, later releasing strong LMs pretrained on synthetic data [Li et al., 2023b, Abdin et al., 2023, 2024b]; their datasets and prompts are not public. Maini et al. [2024] CHAPTER 1. INTRODUCTION7 prompt an LM to rephrase documents for pretraining, improving training efficiency. In contrast, we focus on teaching a pretrained LLM the knowledge of a small corpus. Mecklenburg et al. [2024] consider task-specific finetuning and propose a fact-based synthetic QA generation procedure but do not show improvement on generic instruction following. In contrast, we focus on teaching a model generally useful knowledge about a small corpus, untied to any particular downstream task. Ovadia et al. [2024] continually pretrain Llama 2–based LMs on synthetic paraphrases of Wikipedia articles but do not observe consistent improvements. We adapt the approach of Maini et al. [2024] and Mecklenburg et al. [2024] to our small corpus setting (“Rephrase baseline” in Chapter 2, §2.4). Our graph-based augmentation algorithm outperforms this baseline; we postulate that the improvement stems from enforcing diversity through entity-based generation. Continued pretraining Continual or continued pretraining methods [Gururangan et al., 2020] adapt pretrained LMs to broad target domains such as code, medicine, or mathematics by collecting massive datasets (often >100B tokens; see Table 2.1 for a survey) and applying causal language modeling recipes [Gupta et al., 2023, Ibrahim et al., 2024, Parmar et al., 2024]. We aim to extend continued pretraining to small, specialized domains such as proprietary datastores. Because standard continued pretraining proves ineffective on small corpora, we propose a knowledge graph–inspired approach to synthesize a diverse, related corpus more amenable to learning. Knowledge editing A related line of work updates LMs with small units of factual knowledge— e.g., (subject, relation, object) tuples. Zhu et al. [2020] study constrained fine-tuning to limit model complexity. Later approaches localize where factual knowledge is stored in Transformers and update only those weights [Mitchell et al., 2022, Meng et al., 2022, 2023], or maintain an external memory of edits and prepend them as context during generation [Zhong et al., 2023b, Cohen et al., 2023]. Most related to our work is Akyürek et al. [2024], which first deduces implications of a factual edit and then finetunes on those implications. Unlike knowledge editing, which learns atomic, sentence-length facts, we aim to learn from a small corpus of documents. Synthetic data generation A rich literature uses neural networks to generate synthetic data. Many approaches stem from semi-supervised learning: self-training and pseudo-labeling improve models by iteratively training them on their own predictions [Scudder, 1965, Lee, 2013, Yalniz et al., 2019, Berthelot et al., 2019, Xie et al., 2020], and co-training uses two models to supervise each other [Blum and Mitchell, 1998, Balcan et al., 2004]. Before language models rose to prominence, few approaches attempted to synthesize inputs; one exception is membership query synthesis [Angluin, 1988, Schumann and Rehbein, 2019]. Contemporary works employ co-training [Lang et al., 2022] and self-training to improve language model performance, often on mathematical reasoning tasks [Huang et al., 2023, Gulcehre et al., 2023, Zhang et al., 2024a], or synthesize input-output pairs for instruction tuning by conditioning on a curated seed set [Wang et al., 2023b, Honovich et al., 2023, CHAPTER 1. INTRODUCTION8 Taori et al., 2023, Peng et al., 2023, Yuan et al., 2024b, Li et al., 2024a]. Continual learning and pretraining Continual learning stems from historical work on con- nectionist networks [McCloskey and Cohen, 1989, Ratcliff, 1990] and considers learning with tasks arriving online [Schlimmer and Fisher, 1986, Grossberg, 2012]. The central challenge is mitigating a neural network’s “catastrophic forgetting” of previously encountered tasks [Robins, 1995, Goodfellow et al., 2015, Kemker et al., 2018]. Approaches include regularizing parameter updates to preserve important parameters [Nguyen et al., 2017, Zenke et al., 2017, Kirkpatrick et al., 2017], dynamically modifying the architecture [Rusu et al., 2016, Golkar et al., 2019], and recalling or replaying previous experiences [Rebuffi et al., 2017, Shin et al., 2017, Lopez-Paz and Ranzato, 2017]. Modern continued pretraining methods mitigate catastrophic forgetting by scaling parameter count [Ramasesh et al., 2022] and mixing in updates on pretraining data [Ouyang et al., 2022]. 1.6.2 Bootstrapping pretraining capabilities Given the self-improvement constraint—no external teacher—the main alternatives for improving pretraining are architectural changes and retrieval-augmented approaches. Both are reasonable and complementary; we choose synthetic generation because it creates new training signal from existing data and writes it directly into the model’s weights. The following review situates this choice across three areas: LM pretraining, synthetic data for LMs, and retrieval-augmented LMs. LM pretraining Modern pretraining stems from a series of works including ELMo [Peters et al., 2018], ULMFiT [Howard and Ruder, 2018], and BERT [Devlin et al., 2019], which pretrain a neural network via an unsupervised objective and subsequently finetune for a wide range of downstream tasks. The GPT series [Radford et al., 2018a,b, Brown et al., 2020, OpenAI et al., 2024] cemented next-token prediction as the pretraining objective applied to large-scale crawled webpages, as op- posed to task-specific datasets (e.g., English-to-French translation). In recent years, the size of pretraining corpora has grown rapidly, driven by massive web-crawled datasets: BERT [Devlin et al., 2019, Liu et al., 2020b], GPT-2 WebText [Radford et al., 2018b], CommonCrawl [Common Crawl, 2007], CCNet [Wenzek et al., 2019], T5 C4 [Raffel et al., 2020], the Pile [Gao et al., 2020], Gopher Massive Text [Rae et al., 2021], Llama series [Touvron et al., 2023, Dubey et al., 2024a], RefinedWeb [Penedo et al., 2023], Dolma [Soldaini et al., 2024], DCLM-baseline [Li et al., 2024b], NemotronCC [Su et al., 2024], etc. While pretraining has proven tremendously successful, the rapid depletion of available internet text motivates a shift from acquiring more data to using existing data more effectively—an opportunity that SBP directly exploits. Synthetic data One way to overcome scarce high-quality web data is to pretrain [Gunasekar et al., 2023, Abdin et al., 2023, 2024b,a, Kimi Team, 2025] or continually pretrain [Ruan et al., CHAPTER 1. INTRODUCTION9 2025, Zweiger et al., 2025, Nguyen et al., 2025] LMs on synthetic data. Existing approaches rely on distillation from a powerful “teacher” LM that generates compressed knowledge for the “student” LM to learn [Hinton et al., 2015]. These teacher models must first undergo human alignment, which requires extensive annotations and preference data [Ouyang et al., 2022]. However, synthetic data from teacher LMs shows limited scaling: while such data can be as much as 7x more effective than real data [DatologyAI et al., 2025], performance quickly converges to that of the teacher LM [Busbridge et al., 2025]. In contrast, we consider the scenario where the sole source of world knowledge comes from a fixed set of pretraining documents (e.g., the internet) and algorithmically learn a data synthesizer with minimal human intervention (e.g., no generative teacher models or human writing prompts). Our experimental setup therefore simulates a situation where LMs can self- boost their pretraining capability by refining their understanding of the fixed collection of pretraining documents. Retrieval-augmented LMs A natural class of methods that incorporates multiple documents together is retrieval-augmented generation (RAG) [Lample et al., 2019, Lewis et al., 2020]. Originally a test-time technique for domain-specific downstream tasks [Li et al., 2022], retrieval-augmented approaches have since been extended in scope: Borgeaud et al. [2021], Khandelwal et al. [2020], and Yang et al. [2023b] implement RAG at pretraining scale and show improved test perplexity; Guu et al. [2020] incorporates RAG at pretraining time by jointly training a retriever and the model itself for improved QA performance; Shi et al. [2024b] groups related documents into the same context window for improved long-context capability. While RAG-based approaches enable the model to leverage rich inter-document correlations, they are fundamentally limited by the LM’s context window. In contrast, SBP encodes correlations into synthetic data that can be iteratively learned by the LM one document at a time. Prior to embedding models that enable retrieving entire documents, Guu et al. [2018] retrieve neighboring pairs of sentences using Jaccard similarity and model the conditional distribution between them—an objective that resembles our conditional data synthesizer—but they do not perform pretraining experiments. 1.6.3 Towards AI-designed AI As discussed in §1.4, established approaches to algorithmic improvement—Neural Architecture Search, automated algorithm discovery, and learned optimizers—are effective within their respective domains but operate within constrained search spaces or require end-to-end differentiability. The following review details these alternatives and the evidence for pursuing research automation in an unbounded action space instead. CHAPTER 1. INTRODUCTION10 Test-time scaling methods As introduced in Chapter 4, §4.4.1, we differentiate two approaches to scaling test-time compute: parallel and sequential. Parallel methods generate multiple solu- tion attempts independently and select the best outcome via specific criteria—choosing the most frequent response (majority voting) or the highest-scoring response under an external reward (Best- of-N) [Brown et al., 2024, Irvine et al., 2023, Levi, 2024]. Sequential methods instead let the model generate solution attempts one after another, refining each attempt based on previous outcomes [Snell et al., 2024, Hou et al., 2025, Lee et al., 2025]. Tree-based search methods [Gandhi et al., 2024, Wu et al., 2024] offer a hybrid between sequential and parallel scaling—examples include Monte-Carlo Tree Search (MCTS) [Liu et al., 2024, Zhang et al., 2023b, Zhou et al., 2024, Choi et al., 2023] and guided beam search [Xie et al., 2023]. REBASE [Wu et al., 2024] uses a process reward model to balance exploitation and pruning during tree search, outperforming both sampling-based methods and MCTS. Reward models Lightman et al. [2023], Wang et al. [2024a,c] play a key role in these methods and come in two variants: outcome reward models Xin et al. [2024], Ankner et al. [2024], which assign a score to complete solutions and are useful in Best-of-N selection, and process reward models [Lightman et al., 2023, Wang et al., 2024a, Wu et al., 2024], which assess individual reasoning steps and guide tree-based search. AutoML Our work connects to the AutoML literature. Neural Architecture Search (NAS) de- fines a constrained set of neural network operators and optimizes architectures based on validation performance through reinforcement learning Zoph and Le [2017], Zoph et al. [2017] or search Liu et al. [2018], So et al. [2019]. Recent works also use LMs directly to propose architecture variants and implement them for validation Liu et al. [2025], Cheng et al. [2025]. Beyond architectures, similar automatic optimizations have been applied to hyperparameter tuning Zhang et al. [2023a], discovering machine learning algorithms Real et al. [2020], improving post-training objectives Lu et al. [2024a], discovering better neural network optimizers Chen et al. [2023b], and designing agent scaffolds Hu et al. [2025]. In contrast, we tackle automated AI research in a fully open-ended set- ting without constraints on idea type. Our goal is to improve idea generation effectiveness, where natural-language ideas represent a higher level of abstraction than specific architecture variants or code optimizations. LM-based research agents Recent works build LM-based research agents for accelerating sci- entific discovery, including AI research. AI-Scientist Lu et al. [2024b], Yamada et al. [2025], AI- Researcher Tang et al. [2025], and Agent Laboratory Schmidgall et al. [2025] are end-to-end research agents that use LMs to generate ideas and implement them through carefully designed agent scaf- folds. These systems address open-ended AI research as we do but do not study how to learn from execution feedback to improve idea effectiveness. On more grounded benchmarks with clear perfor- mance metrics—MLE-Bench Chan et al. [2025], RE-Bench Wijk et al. [2024], and ML-Gym Nathani et al. [2025]—various works explore learning from execution feedback through search Toledo et al. CHAPTER 1. INTRODUCTION11 [2025], Jiang et al. [2025] or RL Yang et al. [2025b] to optimize performance on targeted ML en- gineering tasks. While we also study algorithms for learning from execution feedback, we tackle open-ended research problems like pretraining and post-training rather than ML engineering tasks that depend heavily on feature engineering and hyperparameter tuning. AI for research Apart from fully end-to-end automated AI research, many works study how to use LMs for specific components of the scientific research pipeline: literature review Asai et al. [2024], L’ala et al. [2023], idea generation Si et al. [2025b], Wang et al. [2024b], data analysis Majumder et al. [2025], Mitchener et al. [2025], experiment plan generation Goel et al. [2025], research code execution Starace et al. [2025], Hua et al. [2025], Tian et al. [2024], and paper reviewing Liang et al. [2024], Zhu et al. [2025]. Our work focuses on automated idea execution and learning from execution feedback, complementing the above efforts that improve other aspects of the research pipeline. Execution grounding for code Learning from execution feedback has been explored in code generation: Zheng et al. [2024] curate data and train models to refine code from human or execu- tion feedback; Gehring et al. [2025] use end-to-end RL to teach models to improve code based on execution feedback; Lavon et al. [2025] directly guide code generation with execution signals during inference. In contrast, we explore execution grounding for idea generation, where verification is more complicated and expensive. Chapter 2 Continual knowledge acquisition As established in Chapter 1, pretraining creates powerful models by absorbing knowledge from large-scale internet text. But what happens when the knowledge we need is not on the internet? Proprietary corpora, niche scientific domains, and private datastores contain valuable knowledge that appears rarely—or never—in the pretraining distribution. Standard continued pretraining on such small corpora is ineffective: the model cannot generalize from a compressed representation of knowledge, and data-inefficient phenomena such as the “reversal curse” [Berglund et al., 2023] and the requirement of thousands of examples per fact [Allen-Zhu and Li, 2024] make direct memorization unreliable. Several approaches address knowledge acquisition in language models (see §1.6.1 for a detailed review). Knowledge editing methods [Meng et al., 2022, 2023] effectively update atomic facts—e.g., (subject, relation, object) tuples—but do not scale to corpus-level knowledge. Retrieval-augmented generation [Lewis et al., 2020] keeps knowledge external to the model and is fundamentally limited by the context window. Collecting more real data is often infeasible by definition for proprietary or niche domains. We pursue synthetic data generation because it operates at corpus scale, writes knowledge directly into the model’s parametric weights, and applies precisely when real data cannot be obtained. The open question is how to synthesize effectively—which is what this chapter addresses. 2.1 Synthetic continued pretraining We propose to address this problem of acquiring knowledge from small corpora with synthetic con- tinued pretraining. To illustrate, consider the problem of teaching an LM a new area of mathematics, succinctly documented by a small set of textbooks. Directly training the model on those textbooks is unlikely to be effective because of the limited volume of text (e.g., tens of thousands of words), and the model will struggle to generalize from this compressed representation of knowledge. In contrast, learning established mathematical areas like linear algebra is straightforward because a large-scale 12 CHAPTER 2. CONTINUAL KNOWLEDGE ACQUISITION13 Title: The Blue Behemoth Author: Leigh Blackett Shannon's Imperial Circus was a jinxed space-carny leased for a mysterious tour of the inner worlds. It made a one-night... Title: Cosmic Yo-Yo Author: Ross Rocklynne Bob Parker, looking through the photo-amplifiers at the wedge- shaped asteroid, was plainly flabbergasted. Not in his wildest... ... Input: small, niche corpus of documents Title: Defining Decay Down Author: David Plotz If you haven’t visited a dentist in the past few years, first of all, that’s gross. (Checkups are every six months, and don’t pretend you... (1) Entity Extraction For each document , extract a list of entities D E 1 ... Checkups Fluoride Dentist E 2 E 3 E 4 Enamel E 1 E 2 E 3 E 4 (2) Relation Analysis Form a knowledge graph and prompt an LM to describe its edges User: Analyze relations among given entities in the provided text. [...] Document Entities = Fluoride, = Enamel D=Defining Decay Down E 3 E 4 LM: The interplay between enamel and fluoride within the context of “Defining Decay Down” is a telling one, as it underpins the significant shift [...] Output: diverse synthetic corpus for continued pretraining Figure 2.1: Synthetic continued pretraining (synthetic CPT) converts a small source corpus into a large synthetic corpus that is amenable to learning via standard continued pretraining. We instantiate synthetic CPT using a synthetic data augmentation algorithm called EntiGraph, which forms a knowledge graph over entities extracted from documents, and then prompts an LM to synthesize a text-based representation of the graph. corpus with diverse knowledge representations is accessible: for example, online lecture notes, Stack Exchange discussions, or Python implementations of the singular value decomposition. Synthetic continued pretraining bridges this gap by first converting a small, data-constrained domain into a synthetic corpus with diverse knowledge representations, and then continuing pretraining on it. One basic approach is to simply paraphrase or rewrite the source documents in multiple ways. However, we find that generic rephrasing does not bridge the gap in knowledge representation diver- sity. We repeatedly rephrase a small corpus and find that the value of incremental synthetic data quickly decreases, with downstream model performance scaling poorly. This failure stems from the lack of diversity in paraphrasing alone. In the linear algebra example, online lecture notes and Stack Exchange discussions go beyond a simple rewrite of any textbook—they provide deeper analysis and application of the underlying concepts and techniques. We address this shortcoming with EntiGraph, an entity-centric augmentation algorithm. Enti- Graph breaks down a text corpus into a list of entities and then uses an LM to describe relations among entities, iteratively “filling in” the knowledge graph underlying the corpus (Figure 2.1). Our work operates in a data-limited regime. Niche knowledge—proprietary corpora, specialized scientific domains, private datastores—by definition does not enjoy the diverse internet representa- tion that makes standard pretraining effective: there are no Stack Exchange threads, no blog posts, no alternative expositions. The goal is to study data efficiency: can synthetic data bridge the gap between a small niche corpus and the diverse representations that enable effective learning? We deliberately use a stronger model (GPT-4) as the synthetic data generator, because the scientific question here is about data efficiency, not self-sufficiency. The question of whether a model can bootstrap its own pretraining without an external teacher is deferred to Chapter 3. To concretely instantiate this goal, we propose an experimental setting based on QuALITY [Pang CHAPTER 2. CONTINUAL KNOWLEDGE ACQUISITION14 et al., 2022], a reading comprehension dataset. This setup enables evaluating synthetic data genera- tion methods for data-efficient learning without the high compute costs of pretraining from scratch. Specifically, we assume access to a collection of 265 books totaling 1.3M tokens. Our task is to synthesize a corpus such that continued pretraining on it enables a model to answer queries (e.g., multiple-choice QA or user instructions related to the book content) without access to the source texts. In our main experiments (§2.6), we use EntiGraph to generate 455M synthetic tokens from 1.3M real tokens using GPT-4 [OpenAI et al., 2024]. We then continually pretrain Llama 3 8B [Dubey et al., 2024a] on these synthetic tokens and evaluate QA accuracy on the QuALITY questions. We observe log-linear scaling in accuracy as synthetic token count increases, up to 455M (§2.4.2). At the endpoint, synthetic continued pretraining with 455M EntiGraph tokens provides 80% of the accuracy gain of having source documents available at inference time (§2.6). Beyond QA, we also instruction tune the continually pretrained model and find that it can follow open-ended instructions (e.g., summarization) related to the QuALITY books (§2.4.3). To summarize, our key contributions are as follows: • We propose to learn from small corpora with synthetic continued pretraining—converting the small corpus into a large, diverse, synthetic corpus and continuing pretraining on it—and instantiate this approach using the EntiGraph synthetic data augmentation algorithm (§2.2.2). • We demonstrate that continued pretraining on the EntiGraph-synthesized corpus yields a QA accuracy scaling trend that is log-linear in the synthetic token count, outperforming continued pretraining on the source documents or paraphrases (§2.4.2). Furthermore, we show that instruc- tion tuning the EntiGraph continually pretrained model enables it to follow more diverse queries related to the source documents (§2.4.3). • We complement the main experiments with an open-book setup (§2.6), providing the model with access to the source documents when answering queries. We demonstrate that the knowl- edge acquired through synthetic continued pretraining with EntiGraph is complementary to the knowledge accessed through retrieval-augmented generation (RAG, Lewis et al. [2020])—RAG with the EntiGraph continually pretrained model outperforms RAG with the base model. • Lastly, we build a mathematical model that captures the intuition behind EntiGraph. We analyze it to obtain a parametric formula for the scaling trend of a continually pretrained model’s accuracy with respect to EntiGraph synthetic tokens, closely matching our empirical observations (§2.7). Practically, synthetic continued pretraining with EntiGraph enables pretrained LMs to adapt to specialized domains by acquiring parametric knowledge, rather than non-parametric knowledge accessed through retrieval. At a higher level, our approach points toward a family of synthetic data generation algorithms that convert compute into data efficiency for continued pretraining. CHAPTER 2. CONTINUAL KNOWLEDGE ACQUISITION15 StudyDomainModel Parameter Count Total Unique CPT Tokens Minerva [Lewkowycz et al., 2022]STEM8B, 62B, 540B26B-38.5B MediTron [Chen et al., 2023c]Medicine7B, 70B46.7B Code Llama [Rozière et al., 2024]Code7B, 13B, 34B520B-620B Llemma [Azerbayev et al., 2024]Math7B, 34B50B-55B DeepSeekMath [Shao et al., 2024]Math7B500B SaulLM-7B [Colombo et al., 2024b]Law7B30B SaulLM-54, 141B [Colombo et al., 2024a]Law54B, 141B520B HEAL [Yuan et al., 2024a]Medicine13B14.9B Our settingArticles & Books8B1.3M Table 2.1: Comparing the scale of modern continued pretraining (CPT) works with our small corpus setting. Prior work adapts LMs to broad domains with diverse, large-scale corpora. We aim to downscale CPT to small corpora; we use a corpus that is 10,000× smaller than the smallest modern corpus for domain-adaptive CPT. 2.2 Our method We focus on learning parametric knowledge from a small corpus of documents. Our goal is to continually pretrain an LM to acquire the knowledge of a niche corpus. Because simple continued pretraining proves ineffective (§2.4), we propose synthetic continued pretraining: first synthesizing a larger corpus from the small one, then continuing pretraining on the synthetic corpus. We first outline this problem setting and our evaluation approach (§2.2.1), then provide a concrete instantiation using a data augmentation algorithm called EntiGraph (§2.2.2). 2.2.1 Problem setup Continued pretraining on small corpora We focus on approaches that continually pretrain an LM to teach it the knowledge of a small source corpusD source . These approaches acquire “parametric knowledge”—the knowledge of D source is learned in the LM’s parameters, as in pretraining. Synthetic c ontinued pretraining (synthetic CPT) First, we apply a synthetic data generation algorithm A synth to convert a small corpus D source into a synthetic corpus D synth : A synth :D source 7−→D synth .(2.1) We then perform continued pretraining on D synth instead of on D source . We implement A synth using a prompted LM. A natural concern is that the LM may hallucinate, fabricating false knowledge. We therefore consider synthetic data augmentation algorithms that condition generation on the source documents to improve faithfulness. Evaluation with knowledge-intensive queries We evaluate a synthetic data augmentation algorithm A synth by testing whether the downstream synthetic CPT model effectively acquires the CHAPTER 2. CONTINUAL KNOWLEDGE ACQUISITION16 knowledge of D source in its parameters. We curate test queries Q test that probe knowledge about D source . For example, in the linear algebra setting, Q test could be held-out exam questions. To test parametric knowledge, we do not allow the model to access the source documentsD source at test time. The queries therefore cannot be ambiguous without access to D source —a reading comprehension question like “Where was he born?” is ambiguous without context. Altogether, we evaluate data augmentation algorithms A synth for synthetic CPT using a paired source corpus and related test queries (D source ,Q test ). 2.2.2 EntiGraph We next present EntiGraph, our instantiation of a synthetic data augmentation algorithm A synth . At a high level, EntiGraph generates diverse knowledge representations from a small corpus D source by using a prompted LLM to synthesize a knowledge graph representation. EntiGraph operates in two steps: extracting entities from the document and analyzing relations among arbitrary subsets of entities (Figure 2.1). This hierarchical prompting strategy externalizes the problem of generating diverse synthetic text to a combinatorial structure—a graph relating entities appearing in the corpus documents. Step 1: Entity extraction First, EntiGraph extracts a list of salient entities E 1 ,E 2 ,...,E n from the document D source using an entity_extraction prompt (full prompt in Appendix A.6.1): E 1 ,E 2 ,...,E n ∼ LM aug entity_extraction(D source ) . In the linear algebra example, D source could be one specific linear algebra textbook. We would expect to extract entities such as E 1 = Linear space, E 2 = Vector, E 3 = SVD,.... Step 2: Relation analysis Next, EntiGraph analyzes relations among subsets of entities. The in- tuition is to explore edges of the knowledge graph underlying the source documentD source , analogous to a student writing diverse notes about a linear algebra textbook. We apply a relation_analysis prompt (full prompt in Appendix A.6.1) to describe how a subset of k ≤ n entities relate in the context of D source : e D E i 1 ...E i k ∼ LM aug relation_analysis(D,E i 1 ,E i 2 ,...,E i k ) . For example, if E 1 = Linear space and E 2 = Vector, e D E 1 E 2 could be Based on the textbook, a vector is an element of a linear space... Exhaustively enumerating all possible subsets of entities is impractical; we generate data for pairs e D E i E j and triplets e D E i E j E k in our experiments. CHAPTER 2. CONTINUAL KNOWLEDGE ACQUISITION17 EntiGraph synthetic corpora Finally, we collect all sampled synthetic texts from Step 2 as the EntiGraph output: D EntiGraph = e D E i 1 ...E i k ,.... Altogether, we have described a data augmenta- tion algorithm mapping a small source corpus D source to a larger synthetic corpus D EntiGraph , as in (2.1). 2.3 Experiment setup We next detail how we evaluate a given data augmentation algorithmA synth . As described in §2.2.1, we evaluate algorithms by measuring how well an LM continually pretrained on the output synthetic corpus A synth (D source ) answers test queries Q test about the source documents D source . In our main experiments, we use queries that are unambiguous without the source documents D source and disallow the LM from accessing D source while answering queries Q test . This allows us to evaluate which data augmentation algorithm best promotes parametric knowledge acquisition through synthetic CPT. Later, in §2.6, we consider an open-book setting where the model can si- multaneously accessD source andQ test , testing how parametric knowledge acquired through synthetic CPT composes with non-parametric knowledge accessed through retrieval [Lewis et al., 2020]. We next introduce our small corpus and related test queries (D source ,Q test ). QuALITY corpus D source Our corpus and test queries are based on the QuALITY [Pang et al., 2022] long-document comprehension benchmark. The QuALITY corpus D source consists of 265 articles and short books on genres such as science fiction and journalism, averaging ∼5,000 tokens. QuALITY test queries Q test We use the 10–20 multiple choice questions accompanying each article in QuALITY. These serve as high-quality knowledge probes on D source , but the query phras- ing often presupposes reading comprehension context (e.g., “What does the author think about...”). We remove ambiguity by contextualizing with an article reference: “In the context of article arti- cle_name by author_name, what does the author think about...”. This provides 4,609 unam- biguous queries Q test to test parametric knowledge of our continually pretrained LMs. Evaluation on instruction-tuned summarization We also instruction tune the continually pretrained LMs and evaluate on more general instruction following queries. Specifically, we prompt them to generate closed-book summaries of QuALITY articles given only title and author. Performance with strong API-based LLMs In our continued pretraining setting, we must select a corpus D source not well-represented in standard pretraining datasets. As an initial test of QuALITY corpus obscurity, we evaluate GPT-3.5 and GPT-4 on Q test . In the closed-book setting, GPT-3.5 achieves 44.81% accuracy and GPT-4 achieves 51.30% (Figure 2.2). In the open-book setting (full access to D source ), GPT-3.5 reaches 72.60% and GPT-4 reaches 86.09% (Table 2.4). CHAPTER 2. CONTINUAL KNOWLEDGE ACQUISITION18 The large (∼30%) improvement when D source is provided confirms that the QuALITY corpus is sufficiently niche for our testbed. 2.4 Main experiments We next present our main experimental results 1 . Using GPT-4 2 as our prompted model LM aug , we apply EntiGraph to the 1.3M token QuALITY corpus D source , generating a 455M token synthetic corpus 3 . We refer to the former as the “Raw corpus” and the latter as the “EntiGraph corpus” throughout. Additional details appear in Appendix A.1. We continually pretrain Llama 3 8B [Dubey et al., 2024a] with causal language modeling on the 455M token EntiGraph corpus. We describe our CPT procedure and introduce two baselines (§2.4.1), evaluate on QuALITY test queries Q test (§2.4.2), and show that synthetic CPT with EntiGraph is compatible with downstream instruction tuning (§2.4.3; Ouyang et al., 2022). 2.4.1 Continued pretraining procedure EntiGraph CPT In our main experiment, we continually pretrain Llama 3 8B Base on the 455M token EntiGraph corpus for 2 epochs with replay on the RedPajama dataset [TogetherAI, 2023]. We refer to this model as “EntiGraph CPT” 4 hereafter; CPT details appear in Appendix A.2. We next describe two baselines for comparison in closed-book QA (§2.4.2). Raw CPT baseline The first baseline continues pretraining Llama 3 8B Base on the 1.3M token Raw corpus of QuALITY articles D source . We jointly tune the number of epochs and RedPajama replay rate, obtaining the “Raw CPT” model (details in Appendix A.2). Rephrase CPT baseline Another simple synthetic data augmentation procedure is to rephrase QuALITY articles repeatedly. Maini et al. [2024] and Ovadia et al. [2024] systematically extend this idea (cf. §1.6.1). Based on their approaches, we craft a “Rephrase baseline” that repeatedly applies three fixed prompts (easy, medium, and hard rephrase) 5 to QuALITY articles at temperature 1.0. We stopped generating paraphrases at 38M tokens, where we observed a clear gap from EntiGraph CPT and a slower scaling trend (Figure 2.2). We refer to this data as the “Rephrase corpus” and the continually pretrained model as “Rephrase CPT”. 1 Code https://github.com/ZitongYang/Synthetic_Continued_Pretraining.git. 2 We use the gpt-4-turbo model as of Aug. 19, 2024. 3 Corpus available at https://huggingface.co/datasets/zitongyang/entigraph-quality-corpus. 4 Model weights available at https://huggingface.co/zitongyang/llama-3-8b-entigraph-quality. 5 Maini et al. [2024] include a 4th prompt to generate synthetic QA pairs. We defer this task-specific QA finetuning method to Appendix A.3 and focus on task-agnostic baselines for learning generic knowledge. CHAPTER 2. CONTINUAL KNOWLEDGE ACQUISITION19 10 0 10 1 10 2 Number of synthetic tokens (in Millions) 37.5 40.0 42.5 45.0 47.5 50.0 52.5 55.0 QA Accuracy GPT-4 (51.30%) GPT-3.5 (44.81%) Raw CPT (38.15%) EntiGraph CPT Rephrase CPT Llama 3 8B Base (39.49%) Figure 2.2: Accuracy on the QuALITY question set Q test (y-axis) as a function of the synthetic token count (x-axis). The accuracy of synthetic continued pretraining using the EntiGraph data augmentation algorithm (EntiGraph CPT) scales log-linearly up to 455M tokens. 2.4.2 Question-answering evaluations We next present closed-book QA evaluations with the QuALITY test queries Q test . Evaluation procedure Each QuALITY question is a four-choice, single-answer multiple choice question (similar to MMLU; Hendrycks et al., 2021a). We evaluate with 5-shot chain-of-thought prompting [Brown et al., 2020, Wei et al., 2024]; our prompt appears in Appendix A.7.1. EntiGraph scaling CPT on the 455M token EntiGraph corpus improves closed-book QA accuracy from 39.49% (Llama 3 8B Base) to 56.22% (Figure 2.2). A natural question is: how does accuracy scale as we synthesize and train on more tokens? To test this, we randomly subsample without replacement from the EntiGraph corpus with varying sample sizes, continually pretrain Llama 3 8B Base on each subsample, and plot accuracy versus sample size in Figure 2.2. We observe log-linear scaling of accuracy in the number of synthetic tokens, up to 455M tokens. We mathematically investigate EntiGraph scaling properties in §2.7. At a high level, we postulate that QuALITY accuracy follows a mixture-of-exponential shape with three stages: (i) linear growth, (i) log-linear CHAPTER 2. CONTINUAL KNOWLEDGE ACQUISITION20 Table 2.2: EntiGraph Instruct examples. Explicit reference: Summarize “Defining Decay Down”. The article “Defining Decay Down” by David Plotz discusses [...] Dentists began to focus on cosmetic dentistry, [...] Implicit reference: How has dentistry in the U.S. changed? 1. Increase in cosmetic dentistry [...] 2. Use of technology: [...] Cross article instruction: Compare David Plotz’s commentary on American dentistry and the movie Fight Club? David Plotz’s commentary style is different when he analyzes American dentistry and when he discusses the movie Fight Club. [...] growth, and (i) asymptotic plateau. Comparison with baselines Raw CPT (green line) underperforms even Llama 3 8B (dashed black line). We postulate two explanations: (i) the Raw corpus follows a narrower distribution than the Llama 3 pretraining corpus, and heavily training on it may harm English capabilities; (i) limited diversity of knowledge representations in the Raw corpus leads to limited knowledge acquisition due to problems such as the reversal curse [Berglund et al., 2023]. Rephrase CPT scales poorly compared with EntiGraph (Figure 2.2), suggesting that synthetic CPT requires sufficiently diverse synthetic data to scale well. EntiGraph tackles this problem with a hierarchical prompting strategy that externalizes diversity to a knowledge graph’s combinatorial relationships. 2.4.3 Instruction following evaluations We next explore more general test queries beyond Q test . We perform instruction tuning on Enti- Graph CPT to obtain EntiGraph Instruct. Synthetic CPT on the EntiGraph corpus is compatible with instruction tuning: EntiGraph Instruct can directly use knowledge obtained during synthetic CPT for instruction following, without test-time access to the QuALITY corpus D source . We detail our instruction tuning procedure in Appendix A.2. Instruction tuning qualitative examples We first present qualitative examples demonstrating EntiGraph Instruct’s ability to follow instructions related to QuALITY articles. We ask the model to summarize a QuALITY article with explicit reference to title and author, but no access to the article itself (Table 2.2, top row). Next, we show that even without explicit reference to title and author, article knowledge stored in model parameters affects behavior (Table 2.2, middle row). Finally, we provide an example where the model compares across two articles (Table 2.2, bottom row). Although artificial, this demonstrates that even though EntiGraph does not synthesize data involving multiple CHAPTER 2. CONTINUAL KNOWLEDGE ACQUISITION21 0.00.20.40.60.81.01.21.4 # Salient claims relative to human 2 4 6 8 # False claims relative to human RawCPT short RawCPT long GPT-3.5 short GPT-3.5 long GPT-4 short GPT-4 long EntiGraph short EntiGraph long Human Figure 2.3: Closed-book summarization: number of false claims (y-axis) versus number of salient claims (x-axis) normalized by the human summary. articles simultaneously, the model can reason about their interaction using parametric knowledge. Full responses appear in Table A.2. Evaluating closed-book summarization We also present quantitative metrics for summariza- tion, a well-studied instruction following task. We compare EntiGraph Instruct summaries of QuAL- ITY articles with human-written summaries from sQuALITY [Wang et al., 2022], a QuALITY vari- ation with human summaries. Common scalar metrics such as ROUGE [Lin, 2004] or BERTScore [Zhang* et al., 2020] mostly evaluate text similarity between summary and source articles, and may not accurately reflect summarization quality for abstractive systems [Zhang et al., 2024b]. We use a simple automated metric based on pyramid evaluation [Nenkova et al., 2007, Gao et al., 2019] that measures both hallucination rate and how well summaries capture salient claims. Our approach uses GPT-4 to (1) split summaries into atomic claims [Min et al., 2023], (2) decide whether each claim is true or false based on the source article, and (3) determine if true claims are salient to the article’s main message. We obtain counts of false and salient claims for each summary, normalize by the corresponding count from human summaries, and report averages in Figure 2.3. Appendix A.7.2 provides further details. CHAPTER 2. CONTINUAL KNOWLEDGE ACQUISITION22 Results discussion In Figure 2.3, we compare four summarizers: EntiGraph Instruct, Raw In- struct, GPT-3.5, and GPT-4. We provide each summarizer with two prompts requesting short and long summaries (prompts in Appendix A.7.2). When we request more detailed summaries, Raw Instruct hallucinates and generates more false claims with little improvement in salient claims. In contrast, EntiGraph Instruct generates more salient claims as summaries lengthen, with only a small increase in false claims (similar to GPT-3.5 and GPT-4 levels). The gaps in both salient and false claim rates are large enough that these results likely hold beyond our particular metric. We complement the automated evaluation with a qualitative example in Appendix A.7.2. 2.5 Ablation Studies We present ablation experiments to further validate EntiGraph’s effectiveness and test its general- ization properties. We discussed two potential limitations in §2.8.1: 1. Could the gains of Synthetic CPT be explained by distillation effects, due to the use of a strong prompted LM for synthetic data generation? 2. Is the data synthesized in Synthetic CPT factual? We provide evidence suggesting these are not significant concerns in Appendix 2.5.1 and Appendix 2.5.2, respectively. Lastly, we repeat the procedure of the core experiments on another small corpus of Coursera lecture transcripts, to provide evidence that Synthetic CPT generalizes to datasets and domains beyond QuALITY (Appendix 2.5.3). 2.5.1 Using a Weaker Synthetic Data Generation LM One potential concern is whether EntiGraph’s success demonstrated in §2.4 stems from distilling knowledge from GPT-4. To investigate this, we conducted an experiment replacing GPT-4-Turbo with a significantly weaker model, Llama 3.1 8B Instruct, as the synthetic data generator. Recall that in all continued pretraining experiments, we finetune the 8B parameter Llama 3 Base model. Therefore, in this experiment, the capabilities of the synthetic data generator and the continually pretrained model are very similar, controlling for distillation effects. Using the entity extraction and relation analysis prompts introduced in §2.2, we generate 334M synthetic tokens and evaluate the scaling behavior under the same hyperparameter setup detailed in §2.4.1. Figure 2.4 reveals two key insights. First, even with the weaker generator, EntiGraph maintains steady log-linear improvement with no signs of saturation at 334M tokens, suggesting that the gains of Synthetic CPT stem from continued pretraining on diverse representations of the corpora’s underlying knowledge, rather than distilling the generator model’s knowledge. Similar to our main results (§2.4), EntiGraph with a Llama 3.1 8B Instruct generator outperforms Rephrase with the CHAPTER 2. CONTINUAL KNOWLEDGE ACQUISITION23 10 1 10 2 Number of synthetic tokens (in Millions) 40.0 42.5 45.0 47.5 50.0 52.5 55.0 57.5 QuALITY QA Accuracy GPT-4 (51.30%) Llama 3 8B Base (39.49%) EntiGraph with Llama 3.1 8B Instruct Rephrase with Llama 3.1 8B Instruct EntiGraph with GPT-4-Turbo Figure 2.4: The scaling properties of Synthetic CPT with the EntiGraph and Rephrase augmenta- tions, comparing two synthetic data generators: GPT-4-Turbo and Llama 3.1 8B Instruct. same generator. Moreover, at 334M synthetic tokens, EntiGraph with a Llama 3.1 8B Instruct generator exceeds closed-book evaluation of GPT-4-Turbo on this benchmark. Second, while switching from the GPT-4-Turbo generator to the weaker generator shifts the accuracy curve downward, the log-linear slope remains consistent. In contrast, holding the synthetic generator constant, we observe that EntiGraph CPT and Rephrase CPT exhibit different slopes. 2.5.2 Factuality and Lexical Diversity of EntiGraph Synthetic Corpus Factuality A limitation discussed in §2.8.1, and inherent in all methods involving synthetic data generation, is that the generation model may hallucinate. EntiGraph is a synthetic data augmen- tation, which conditions an LM on a given corpus document and prompts the LM to discuss the document’s entities and their relationships. Assuming a reasonably good generator model, this grounding should decrease hallucination rate. To quantitatively test the factuality of documents synthesized with EntiGraph, we split the 455M token EntiGraph corpus into sentences and randomly sample 150 sentences. We ask authors of this work to label whether each sentence is subjective or not, and among non-subjective sentences, to determine whether it is supported by the article text or not. CHAPTER 2. CONTINUAL KNOWLEDGE ACQUISITION24 We compute two statistics: the proportion of subjective sentences denotes the number of sub- jective sentences over the total number of annotated sentences. The factuality rate denotes the number of non-subjective sentences which are supported by the source document, over the number of non-subjective sentences, following Min et al. [2023]: • Proportion subjective: 0.532 (bootstrap 0.95 confidence interval: [0.455, 0.610]). • Factuality rate: 0.944 (bootstrap 0.95 confidence interval: [0.889, 0.986]). Because EntiGraph uses open-ended prompts which ask the LM to relate different, often ab- stract entities, the LM often generates subjective statements. We do not necessarily view this as a limitation, because learning reasonable subjective interpretations is crucial for understanding (and hence is often assessed in, e.g., essay questions on literature exams). We also observe that the non- subjective sentences are consistently factual, supporting the effectiveness of grounding in reducing hallucination. Lexical Diversity We hypothesize that good synthetic data augmentations should produce knowl- edge representations with diverse wording. As a measure of this lexical diversity, we compute the percentage of n-grams in the synthetic documents that overlap with the n-grams of the corresponding source documents. More precisely, we first randomly select 100 QuALITY articles, tokenize them with the Llama 3.1 tokenizer, and compute the set of n-grams for each article. Then, for each article, we tokenize the corresponding EntiGraph and Rephrase synthetic data, compute n-grams, and count the n-grams in the synthetic data that appear in the set of n-grams for the raw article. For each n and synthetic augmentation method, we sum this overlap count across articles and normalize by the total number of synthetic tokens generated for the 100 articles, providing us an estimate of the percentage of n-grams in the synthetic data that overlap with the source data. Augmentation n = 2 n = 4 n = 8 n = 16 EntiGraph23.40 3.66 0.240.00 Rephrase21.35 3.04 0.510.22 Table 2.3: Percentage of token n-grams in synthetic documents that overlap with the source docu- ment n-grams, for the EntiGraph and Rephrase synthetic data augmentations. These results are provided in Table 2.3. We observe that for both augmentations, n-gram overlap percentage is low and quickly approaches 0% with increasing n, indicating that both methods produce lexically diverse knowledge representations. 2.5.3 Datasets Beyond QuALITY To test whether synthetic CPT with EntiGraph generalizes to corpora beyond QuALITY, we eval- uated on the Coursera Exam QA dataset [An et al., 2023]. This dataset contains lecture transcripts CHAPTER 2. CONTINUAL KNOWLEDGE ACQUISITION25 and exam questions from advanced technical courses like data science and machine learning. Com- pared to the books and stories in QuALITY, Coursera exams present new challenges—the content is harder conceptually, questions can have multiple correct answers, and the number of options is not fixed to four choices. This makes few-shot prompting more demanding, as the model must understand both the content and the flexible answering format. The dataset consists of 15 lecture transcripts and 124K raw tokens, substantially smaller than QuALITY’s 265 documents and 1.3M raw tokens. During our scaling analysis, we found that models trained on tiny synthetic corpora (e.g., a few million tokens) struggled to follow few-shot prompts reliably for Coursera questions, resulting in parsing errors. Therefore, we begin the scaling curve in Fig. 2.5 starting from token counts where parsing error rates fall below 5%. For the Rephrase baseline, we generate synthetic data up to 22M tokens, and find that only one model has parsing error rates below 5%. 2 × 10 1 3 × 10 1 Number of synthetic tokens (in Millions) 48 49 50 51 52 53 54 Coursera Exam QA Accuracy Rephrase with GPT-4-Turbo EntiGraph with GPT-4-Turbo LLama 3 8B Base (48.26%) Figure 2.5: The scaling properties of Synthetic CPT using the EntiGraph augmentation on the Coursera Exam QA dataset. Despite these challenges, EntiGraph CPT shows consistent improvement over Llama 3 8B Base, improving accuracy from 48.26% to 53.87%, better than Llama 3 8B Base and the Rephrase base- line. The log-linear scaling pattern persists up to 32M synthetic tokens, suggesting EntiGraph’s effectiveness extends beyond narrative texts to technical educational content. This transfer to a CHAPTER 2. CONTINUAL KNOWLEDGE ACQUISITION26 EntiGraph CPT + RAG Llama 3 8B Base + RAG GPT-4 + Oracle RAG GPT-3.5 + Oracle RAG AccuracyRecall@8AccuracyRecall@8Accuracy Recall@8 Accuracy Recall@8 62.6099.6360.3599.6386.09100.072.60100.0 Table 2.4: QuALITY question-answering accuracy and recall rate in the open-book retrieval- augmented generation (RAG) setting. EntiGraph CPT and Llama 3 8B Base are used in a RAG pipeline (cf. §2.6 for setup details). Recall@8 is defined as the proportion of questions for which the salient article appears in the top 8 reranked document chunks. GPT-4 and GPT-3.5 Oracle RAG provide an upper bound with a perfect retriever, by placing the entire relevant document in-context. different domain suggests that synthetic continued pretraining with EntiGraph may extend beyond narrative texts. 2.6 Open-book experiments We next consider an open-book setting with the domain-specific corpus D source available at test time. In this widespread setting, retrieval-augmented generation (RAG; Lewis et al., 2020) is the predominant approach. A natural question is whether parametric knowledge learned through syn- thetic CPT with EntiGraph complements non-parametric knowledge accessed through RAG. We answer this by comparing a strong RAG pipeline with and without EntiGraph CPT. RAG evaluation setup Our RAG pipeline follows established best practices [Lewis et al., 2020, Gao et al., 2024b]. It involves an offline stage indexing document chunks, followed by inference- time retrieval, reranking, and placement of chunks in a few-shot LM prompt. We use OpenAI text-embedding-3-large [Neelakantan et al., 2022] as our embedding model, FAISS as our sim- ilarity search index [Douze et al., 2024], and Cohere rerank-english-v3.0 [Cohere, 2024] as our reranker. Following the evaluation procedure in §2.4, we evaluate parallel RAG pipelines on the QuALITY multiple choice test set using few-shot chain-of-thought prompting. All hyperparameters are tuned separately for each LM’s RAG pipeline. Appendix A.4 provides further details. EntiGraph continued pretraining complements RAG Table 2.4 shows that EntiGraph CPT outperforms Llama 3 8B Base, the model from which it is continually pretrained. These results demonstrate that knowledge internalized through synthetic CPT complements knowledge accessed during RAG, suggesting a competitive recipe for small corpus QA: (1) synthetic data augmentation, (2) continued pretraining, and (3) RAG. EntiGraph continued pretraining alone approaches RAG performance These results also contextualize EntiGraph effectiveness in the closed-book parametric knowledge setting (§2.4). Com- paring Figure 2.2 and Table 2.4, adding RAG to Llama 3 8B Base improves accuracy by 20.86% CHAPTER 2. CONTINUAL KNOWLEDGE ACQUISITION27 (39.49% → 60.35%). In contrast, continued pretraining of Llama 3 8B Base on the EntiGraph corpus improves accuracy by 16.73% (39.49%→ 56.22%). Hence, EntiGraph continued pretraining provides > 80% of the absolute performance improvement of RAG, even in a small corpus setting where RAG recall is nearly perfect. Overall, our results show that parametric knowledge acquired in EntiGraph continued pretraining composes with realistic knowledge-intensive QA pipelines, and that EntiGraph continued pretraining alone—without test-time corpus access—is nearly competitive with a strong RAG baseline. 2.7 Theoretical analysis of EntiGraph scaling It may seem surprising that simply “rewriting” the source documents D source improves performance at all (§2.4), as EntiGraph does not explicitly add new knowledge beyondD source . We postulate that EntiGraph “rearranges” D source into a layout more amenable to learning. For example, in D source , the entity pair (A,B) may appear together in some sentences and (B,C) in others. Models trained directly on D source may learn the (A,B) and (B,C) relations but not the (A,C) relation [Akyürek et al., 2024]. We build a mathematical model to formalize this intuition (§2.7.1) and provide a quantitative prediction that EntiGraph CPT follows a mixture-of-exponential scaling shape (§2.7.3), which fits well with our empirical observations (Figure 2.6). 2.7.1 Toy model setup In this toy model, we use V to denote the set of entities and represent the source documents D source with pairs of known relations D source ⊂ (x,y) ∈ V 2 : x ̸= y. We assume each relation pair in V 2 appears in the source documents D source independently at random with probability p. Mathemat- ically, P [(x,y)∈D source ] = p for all x ∈ V and y ∈ V with x ̸= y. We write V = |V| and assume p = λ/V for some constant λ > 1. Training as memorization We model learning of factual knowledge as a memorization process, where a model memorizes relations it is trained on but does not meaningfully generalize beyond them [Yang et al., 2023a, Feldman, 2020]. In this view, a language model’s knowledge is represented by a matrix M ∈ 0, 1 V×V such that M (x,y) = 1 if the model “knows” the (x,y) relation and equals 0 otherwise. Training directly onD source simply means setting all entries appearing inD source to 1, denoting that the model has memorized source document relations. We denote this model trained on D source by the matrix M 0 ∈ 0, 1 V×V , which has i.i.d. Bernoulli off-diagonal entries with mean p. EntiGraph synthetic data augmentation Given the source documents D source , we define the following iterative synthetic data generation procedure: for each t = 1, 2,... CHAPTER 2. CONTINUAL KNOWLEDGE ACQUISITION28 • Entity pair selection: Sample (x t ,y t )∈(x,y)∈V 2 : x̸= y uniformly at random. • Relation analysis: Generate the “relation between (x t ,y t )” by performing a breadth-first search (BFS) on the directed graph represented by the adjacency matrix M 0 starting at x t . If no such path exists, do nothing. If there exists a path (x t ,z 1 t ,z 2 t ,...,z k t t ,y t ) connecting x t to y t , define D t = (x t ,z 1 t ), (x t ,z 2 t ),..., (x t ,z k t t ), (x t ,y t )∪D t−1 , where we assume D 0 = D source . The model trained on this round of synthetic data is M t = M t−1 + P (x,y)∈D t t−1 I xy , where I xy ∈0, 1 V×V is a binary matrix with I xy (x,y) = 1 and 0 otherwise. This mirrors the relation analysis step for EntiGraph (Step 2, §2.2.2). The index t is analogous to the number of synthetic tokens generated, and model knowledge is captured by how many ones M t contains. We define the link density (or accuracy) of M t as Acc(M t ) = E[∥M t ∥ 1 |M 0 ]/(V (V − 1)), where the expectation is over randomness from synthetic data generation (not the source documents D source ), and∥M∥ 1 denotes P i,j |M i,j |. We use the notation Acc because this emulates accuracy on QuALITY test queries (§2.4 and §2.6). 2.7.2 Rigorous upper and lower bound We next derive rigorous upper and lower bounds on the scaling trend of Acc(M t ). Definition 2. Let C λ = (1−ρ(λ)) 2 , where ρ(λ) denotes the extinction probability for a Poisson(λ) branching process (i.e., ρ is the smallest solution in [0, 1] to the fixed-point equation ρ = exp(λ(ρ− 1))). For any fixed ε > 0, we further define C LB = 1− 1 V (V−1) , C UB = 1− (1+ε) logV V (V−1) logλ . Theorem 1. For any time t≥ 1 and any ε > 0, the link density satisfies, with probability → 1, p + C λ 1− C t LB (1− ε)≤ Acc(M t )≤ p + C λ 1− C t UB (1 + ε) as V →∞. Although Theorem 1 provides rigorous bounds on the scaling trend of Acc(M t ), the exact growth curve is more intricate, as we show next. 2.7.3 An analytical formula We analyze link density Acc(M t ) using a Poisson branching process approximation of cluster growth. This yields a mixture-of-exponential scaling trend Acc(M t )∼ p + C 1− ∞ X k=1 μ(k) (1− a k ) t ! ,(2.2) where A ∼ B means A/B converges to 1 in probability as V → ∞. The parameter C governs link density Acc(M t ) as t→∞ and is determined by the proportion of reachable vertex pairs in M 0 . μ(·) is the probability mass function on k, controlling the proportion of vertex pairs with a specific decay rate. The parameters μ(·) and a k depend on M 0 in more intricate ways (cf. Appendix A.5 for a full CHAPTER 2. CONTINUAL KNOWLEDGE ACQUISITION29 10 1 10 0 10 1 10 2 Number of synthetic tokens (in Millions) 40.0 42.5 45.0 47.5 50.0 52.5 55.0 EntiGraph Accuracy Empirical observation on QuALITY QA Mixture-of-exponential fit Figure 2.6: A mixture-of-exponential function (2.2) closely fits the scaling trend of EntiGraph CPT with respect to synthetic token count. derivation). Equation (2.2) accurately fits the empirical scaling trend of EntiGraph CPT accuracy up to 455M synthetic tokens (Figure 2.6). We discuss curve fitting in Appendix A.5.1, showing that the mixture-of-exponential shape grows in three phases: (i) linear growth, (i) log-linear growth, and (i) asymptotic plateau. 2.8 Discussion 2.8.1 Limitations Because EntiGraph synthesizes data using a prompted LM, it may hallucinate and fabricate non- existent entities or relations. Although our synthesis process is grounded by source documents, we assume LM aug is capable enough to generate faithful synthetic data when conditioned on D source . We test factuality of the EntiGraph corpus by randomly subsampling 150 sentences and manually labeling each sentence’s factuality. We find roughly half the sentences are subjective, and the objective half is almost always factual. We postulate that factuality is high because QuALITY articles are relatively simple given the prompted LM’s capability. If EntiGraph were applied to more challenging content like a complex research paper, the prompted model may be more prone to CHAPTER 2. CONTINUAL KNOWLEDGE ACQUISITION30 hallucination. Because we use a strong prompted LM gpt-4-turbo to generate synthetic data, one might be concerned that performance gains stem from distillation. To probe this, we perform an ablation replacing gpt-4-turbo with Llama 3.1 8B Instruct, a substantially weaker model from the same base as EntiGraph CPT. We generated 334M EntiGraph tokens using Llama 3.1 8B Instruct and found a consistent log-linear trend with the same slope but lower intercept compared with GPT-4 generation. This ablation suggests EntiGraph genuinely teaches model knowledge about the QuALITY corpus rather than serving as a vehicle to distill a powerful prompted LM. 2.8.2 Conclusion Continued pretraining with next-token prediction effectively teaches pretrained language models new knowledge, but has only been applied successfully in broad, data-rich domains with 10B–100B+ tokens. We downscale continued pretraining to small, specialized corpora with ∼1M tokens using synthetic continued pretraining: converting a small corpus into a large synthetic one with diverse knowledge representations, then continuing pretraining on it. We instantiate this approach using EntiGraph, a knowledge graph–inspired synthetic data aug- mentation algorithm. Synthetic continued pretraining with EntiGraph demonstrates consistent scal- ing in downstream closed-book QA performance up to a 455M token synthetic corpus, whereas baselines such as continued pretraining on the small corpus or synthetic paraphrases show no im- provement or scale slowly. The acquired parametric knowledge composes with instruction tuning and retrieved non-parametric knowledge in an open-book setting. We also present a simplified mathematical model of EntiGraph and derive a functional form for its scaling trend that closely matches our empirical observations. We hypothesize that EntiGraph’s “externalization” of synthetic data generation to a combinatorial structure—in this case, a knowledge graph over entities—may be a useful strategy for synthesizing highly diverse data and a promising direction for future study. We designed every component of EntiGraph by hand—the entity-relation extraction, the knowledge graph traversal, and the synthesis prompts. A natural question is whether AI systems can discover such data augmentation algorithms automatically; we explore this direction in Chapter 4. Lastly, while synthetic continued pretraining enables efficient knowledge acquisition, it operates within the model’s existing capabilities. A more fundamental question remains: can a model im- prove its core capacity for language modeling? In Chapter 3, we address this by targeting pretraining perplexity—the most basic measure of a language model’s capability, one that correlates with per- formance across all downstream tasks. If we can show genuine self-improvement in perplexity, we demonstrate self-improvement at the most fundamental level. Chapter 3 Bootstrapping pretraining capabilities Chapter 2 showed that synthetic data can efficiently teach a language model new knowledge—specific facts about a corpus—but the model’s underlying capability, measured by pretraining perplexity, re- mained unchanged. We now ask a more ambitious question: can a model improve the very foundation upon which all downstream performance rests? We approach this in two parts. First, we show that strong reasoning is already latent in pretrained weights and can be surfaced with remarkably few ex- amples (§3.1), suggesting that pretraining—not post-training—is the true bottleneck for capability. We then introduce Synthetic Bootstrapped Pretraining (SBP), a framework in which the model gen- erates synthetic text to improve its own pretraining objective without relying on a stronger external teacher (§3.2). For the SBP component specifically, the self-improvement constraint rules out distillation. Among remaining approaches (see §1.6.2 for a detailed review), architectural changes are orthogonal and complementary—SBP’s gains compose with them. Retrieval-augmented pretraining [Borgeaud et al., 2021, Khandelwal et al., 2020] can leverage related documents but keeps the additional signal exter- nal to the weights and imposes retrieval overhead at every forward pass. In-context pretraining [Shi et al., 2024b] groups related documents into the same context window, removing retrieval overhead but remaining limited by context length. We choose synthetic data because it creates new training signal from existing data and writes it directly into the model’s weights via standard pretraining. 3.1 Prelude: sample-efficient reasoning We begin with a diagnostic experiment: training on only 1,000 carefully curated samples with next-token prediction suffices to build a strong reasoning model. OpenAI o1 [OpenAI, 2024] and DeepSeek R1 [DeepSeek-AI et al., 2025] achieve strong reasoning through large-scale reinforcement learning with millions of training samples, yet a far simpler recipe works surprisingly well. We construct s1K, a dataset of 1,000 questions paired with reasoning traces distilled from Gemini 31 CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES32 Geometry Number theory Combin− atorics Real functions Biology Complex functions Quantum theory Field theory Calculus of variations Difference equations Electro− dynamics Group theory Linear algebra Probability theory Algebraic systems Mechanics Thermo− dynamics Differential equations Computer science Numerical analysis Calculus Algebraic structures Astronomy Dynamical systems Statistical mechanics Operations research Math− ematics education Measure theory Convex geometry Fluid mechanics Algebraic geometry Statistics General topology Economics Associative rings General relativity Differential geometry Math− ematical logic Partial differential equations Information theory Solid mech −anics Functional analysis Special functions Comm− utative algebra Integral equations Integral transform Approxi− mation theory Potential theory Harmonic analysis Control theory Geo− physics Figure 3.1: s1K. s1K is a dataset of 1,000 high-quality, diverse, and difficult questions with reasoning traces. Thinking Experimental [Google, 2024], and perform supervised fine-tuning (SFT) of an off-the-shelf pretrained model, requiring just 26 minutes of training on 16 H100 GPUs. The resulting model s1-32B matches or exceeds closed-source models like OpenAI’s o1-preview on several benchmarks (Figure 3.1). This extreme sample efficiency carries a profound implication: if 1,000 examples suffice to elicit strong reasoning, then reasoning capability must already be latent in the pretrained weights—post- training merely surfaces it. Pretraining is therefore the true bottleneck. All downstream capability ultimately derives from the initial pretraining phase, and improving pretraining itself—without relying on a stronger external teacher—is the highest-leverage intervention. Extensive ablation experiments confirm that sample efficiency hinges on careful data curation: jointly selecting for quality, difficulty, and diversity is crucial, and training on our full pool of 59K examples offers no substantial gain over our 1K selection (Appendix B.4). CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES33 Training We perform supervised finetuning on Qwen2.5-32B-Instruct using s1K to obtain our model s1-32B using basic hyperparameters outlined in the appendix. Finetuning took 26 minutes on 16 NVIDIA H100 GPUs with PyTorch FSDP. Table 3.1: s1-32B is a sample-efficient reasoning model. We evaluate s1-32B, Qwen, and Gemini. Other results are from the respective reports [Qwen et al., 2024, Team, 2024, OpenAI, 2024, DeepSeek-AI et al., 2025, Labs, 2025, Team, 2025]. # ex. = number examples used for reasoning finetuning. Model # ex. AIME 2024 MATH 500 GPQA Diamond API only o1-previewN.A.44.685.573.3 o1-miniN.A.70.090.060.0 o1N.A. 74.4 94.877.3 Open Weights Qwen2.5- N.A.26.784.049.0 32B-Instruct QwQ-32BN.A.50.090.654.5 r1≫800K 79.8 97.371.5 r1-distill800K72.694.362.1 Open Weights and Open Data Sky-T117K43.382.456.8 Bespoke-32B17K 63.3 93.058.1 s1-32B1K50.092.659.6 Sample-efficiency In Table 3.1 we compare s1-32B with other models. s1-32B is the most sample-efficient open data reasoning model. It performs significantly better than our base model (Qwen2.5-32B-Instruct) despite training on only 1,000 additional samples. The concurrently re- leased r1-32B shows stronger performance than s1-32B while also using only SFT [DeepSeek-AI et al., 2025]. However, it trains on 800× more reasoning samples. Whether one can achieve their performance with just 1,000 samples remains an open question. Around half of all answers in s1K are wrong, yet the results are striking. This suggests that the SFT stage is about learning reasoning patterns rather than correct answers. 3.1.1 Discussion: pretraining as the foundation of capability Why does supervised finetuning on just 1,000 samples lead to reasoning performance matching o1-preview? The most natural explanation is that reasoning capability is already present in the pretrained weights. Pretraining on trillions of tokens—spanning mathematical proofs, code, scientific CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES34 arguments, and logical discourse—exposes the model to vast quantities of implicit reasoning. Our sample-efficient finetuning does not teach the model to reason; it teaches the model to format its existing reasoning ability into an explicit chain of thought. This parallels the “Superficial Alignment Hypothesis” of LIMA [Zhou et al., 2023], where 1,000 examples suffice to align a model to user preferences because the core capability was already acquired during pretraining. This interpretation has a profound consequence: post-training methods—whether reinforcement learning [DeepSeek-AI et al., 2025, Team et al., 2025], distillation from stronger models [Team, 2025, Xu et al., 2025, Labs, 2025], or prompting strategies like Chain-of-Thought [Wei et al., 2023]—are ultimately bounded by what pretraining provides. No amount of post-training can elicit a capability that was never acquired during pretraining. The true ceiling on model performance is therefore set during pretraining, not during alignment or reinforcement learning. A complementary implication concerns test-time compute: if reasoning capability is already latent, then controlling how much the model thinks—even by crude interventions—should modulate performance without any additional training. We return to this observation in Chapter 4, where a simple technique called budget forcing produces test-time scaling behavior from s1-32B. This reframes pretraining as the highest-leverage target for improving language models. If we want fundamentally more capable models—not just better-formatted outputs of existing capability— we must improve the pretraining phase itself. The remainder of this chapter takes up exactly this challenge. 3.2 Synthetic Bootstrapped Pretraining Having established that pretraining is the true bottleneck, the remainder of this chapter pursues a single question: can a model improve its own pretraining without relying on a stronger external teacher? In Chapter 2, we allowed distillation from GPT-4 because the goal was data efficiency; here, we remove that crutch and ask whether genuine self-improvement is possible. We work in a data-limited regime, motivated by the approaching exhaustion of high-quality internet text [Villalobos et al., 2024]: we assume a fixed pool of unique documents and ask whether a model can extract more value from them than simple repetition provides. The defining constraint is that distillation is forbidden—the data synthesizer is trained from the same pretraining corpus, not from a stronger external model. Without this constraint, self-improvement would be trivially achievable by distilling from a more capable teacher. We validate the approach through a compute- matched experimental design: fixing both the data budget and the compute budget, and comparing against a repetition baseline and an oracle with unlimited unique data. This is the right control because at frontier training scale, compute budgets already exceed available unique high-quality data [Muennighoff et al., 2023], making the data-constrained regime the natural operating point. Re- examining the conceptual foundation of pretraining, its success stems from the rich causal correlation CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES35 among tokens within a document. However, this is not the only source of correlation pretraining datasets contain: a code document implementing the attention mechanism is derived from the arXiv preprint of the transformer paper; the book of Harry Potter is structurally similar to the screenplay of its movie production. Such connections suggest a weaker form of inter-document correlation derived from an underlying joint distribution of pretraining documents. We hypothesize that this additional signal, missed by standard pretraining, can be captured by synthetic data, presenting an underexplored avenue for improving performance. To leverage this opportunity, we introduce Synthetic Bootstrapped Pretraining (SBP), an LM pretraining procedure that operates in three steps (Figure 3.2). First, SBP identifies semantically similar document pairs (d 1 ,d 2 ), such as the transformer paper and its code implementation, from the pretraining dataset. Second, SBP models the conditional probability of d 2 given d 1 , creating a “data synthesizer” that can synthesize a new, related document given a seed document. Finally, SBP applies the trained conditional synthesizer to the pretraining corpus itself, creating a vast text corpus that encodes the rich inter-document correlations previously missed (§3.3). By training a data synthesizer from the pretraining dataset itself, SBP avoids the pitfall of “bootstrapping” model performance using an external, readily available teacher LM, demonstrating a clean setup where improvement stems from better use of the same pretraining corpus. To test our hypothesis, we design a compute-matched, data-constrained experimental framework under which we pretrain 3B-parameter and 6B-parameter models on up to 1T tokens from scratch [Li et al., 2024b, Zyphra, 2024], demonstrating the applicability of SBP for advancing frontier LMs. We compare SBP against two crucial references: a strong repetition baseline, which represents the standard approach in data-constrained settings, and an oracle upper bound, which has access to an unlimited pool of unique internet data (§3.4). Our results show that SBP improves over the strong repetition baseline across the pretraining scales we evaluate and closes up to 60% of the performance gap to the oracle with 20x additional unique data access (§3.5.1). Beyond benchmark performance, qualitative analysis of the synthesized documents reveals that they go beyond mere paraphrases of the real documents (§3.5.2). We postulate that the SBP syn- thesizer first abstracts latent concepts from the real document and then synthesizes a new document that expands upon the abstracted concept, incorporating diverse genres and content. We formalize this intuition through a Bayesian hierarchical concept model, where documents are related through shared concepts. From this perspective, we argue that the synthesizer implicitly learns a posterior likelihood model that abstracts latent concepts from the document—a mechanism not present in standard LM pretraining (§3.6). In summary, our contributions are threefold: • New pretraining framework: We propose the Synthetic Bootstrapped Pretraining (SBP) algorithm that explicitly models inter-document correlations missed by standard pretraining practice and encodes those correlations into training via synthetic data. CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES36 embeddings the Harry Potter book movie review of HP a lesser known paper transformer paper pytorch code of attention synthesized notes on the paper Step 3: synthesis at scaleStep 1: nearest neighbor pairingStep 2: synthesizer-tuning language modeling tutorial synthesized blogpost about the paper Figure 3.2: Data synthesis illustration of Synthetic Bootstrapped Pretraining (SBP): It first identifies semantically similar documents (Step 1) and then trains a conditional model that generates one element of the pair from the other (Step 2). Finally, SBP applies the conditional model to the pretraining corpus itself to synthesize a new, vast corpus for joint training (Step 3). • Large-scale empirical validation: We design a compute-matched pretraining setup that en- ables rigorous measurement of LM self-improvement and empirically validate SBP on 3B and 6B parameter models trained on up to 1T tokens from scratch. • Principled statistical interpretation: We offer a natural Bayesian interpretation of SBP as implicitly learning a posterior for the latent concepts in a text document and concretize the intuition via qualitative analysis of synthesized documents. In the remainder of this chapter, we will first define the data-constrained pretraining problem we address and introduce the SBP technique we propose in §3.3. Then, we present the compute-matched experiment setup in §3.4 and results in §3.5. Finally, we conclude with a Bayesian interpretation of SBP that sheds light on the origin of the improved performance in §3.6. 3.3 Method We introduce the data-constrained pretraining setup (§3.3.1) and then present the SBP procedure in three steps (§3.3.2). We present SBP as a general pretraining recipe with a pretraining dataset, an LM architecture, and a collection of evaluation benchmarks. We defer the concrete compute-matched experiment design to §3.4. 3.3.1 Data-constrained pretraining setup We consider a data-constrained setup where the goal is to train the best-performing LM given access to a fixed document collection D pretrain (e.g., a snapshot of the entire internet). To establish a controlled experimental framework, we also choose a transformer architecture with parameters θ and a collection of held-out evaluation benchmarks Perf (e.g., perplexity, few-shot QA accuracy). Recall that a transformer takes a sequence of tokens as input and outputs a sequence of conditional probabilities of each token given all previous tokens. Applying the chain rule for joint probability, CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES37 we can use a transformer to calculate the probability p θ (y) of observing a particular text input y, or the conditional probability p θ (y|x) of one piece of text y followed by x. Under such a setup defined by (D pretrain , p θ , Perf), pretraining searches for the best-performing transformer weights by maximizing the sum of the log-likelihood of pretraining documents, arg max θ X d∈D pretrain logp θ (d),(3.1) and then evaluates the performance through Perf(θ). Statistically, this objective treats each docu- ment as an independent sample from a hypothetical distribution of all documents and attempts to learn this marginal distribution. However, this modeling assumption overlooks the structural simi- larities shared between natural language texts (e.g., Figure 3.2). We next present the SBP procedure that fills this gap. 3.3.2 Synthetic bootstrapped pretraining At a high level, SBP finds related document pairs (d 1 ,d 2 ) from the pretraining datasetD pretrain and trains a conditional synthesizer p θ (d 2 |d 1 ) using the same transformer architecture parametrized by θ. It then uses it to synthesize a large collection of documents S pretrain to perform joint pretrain- ing on D pretrain ,S pretrain . The fact that SBP trains a data synthesizer from D pretrain itself also distinguishes it from extensive existing work that relies on a readily available “teacher” LM. Step 1: nearest neighbor pairing In preparation for training the conditional data synthesizer, SBP first curates pairs of related documents. To efficiently perform similarity search at pretraining scale, we use Approximate Nearest Neighbor (ANN) search [Malkov and Yashunin, 2018], which embeds each document as a quantized vector normalized to the unit sphere and then performs massively parallelizable linear algebraic operations. In our implementation of SBP, we use inner- product similarity, which we denote by ⟨d 1 ,d 2 ⟩. Then, we select a subset of pairs whose similarity score exceeds a certain threshold α: D ST =(d 1 ,d 2 )∈D pretrain ×D pretrain , s.t. ⟨d 1 ,d 2 ⟩ > α.(3.2) We provide implementation details of paired data curation in §B.1.1. Step 2: synthesizer-tuning SBP exploits the correlation between pairs of related documents by maximizing the conditional probability of d 2 given d 1 : θ ST = arg max θ X (d 1 ,d 2 )∈D ST logp θ (d 2 |d 1 ),(3.3) CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES38 which we obtain by summing over the log conditional probabilities corresponding to tokens from document d 2 . We refer to this step as “synthesizer-tuning” as we are training a conditional proba- bilistic model that synthesizes a related d 2 from a given d 1 . When performing synthesizer-tuning, we initialize p θ at the pretrained checkpoint (3.1) so that the model is equipped with the knowledge of individual documents at initialization, but not the conditional relation between them. Each doc- ument d 1 can be associated with multiple instances of d 2 , encouraging the synthesizer to produce diverse, high-entropy outputs rather than deterministic synthesis. Step 3: data synthesis at scale Finally, SBP synthesizes S pretrain through a hierarchical sam- pling process: • Sample the seed document d 1 from D pretrain uniformly at random; • Sample synthesized document d 2 from p θ ST (·|d 1 ). This process achieves synthetic data diversity using two sources of variation: first through the variation of the seed documents d 1 , which comes from the diversity of the pretraining document D pretrain itself, and second through the entropy of the conditional distribution p θ ST (·|d 1 ), which stems from the diverse inter-document correlations captured in D ST . While empirically motivated, the procedure admits a principled Bayesian interpretation of the distribution of natural language texts, which we explain in §3.6. For now, we focus on demonstrating the empirical effectiveness of SBP. 3.4 Experiment setup We present our concrete experimental implementation of SBP. We curate a pretraining dataset of 582M high-quality documents totaling 482B tokens from DCLM [Li et al., 2024b], design 3B and 6B transformer architectures modified from the Llama 3 implementation [Dubey et al., 2024a], and select nine commonly used benchmarks targeting general world knowledge and commonsense reasoning (§3.4.1). We propose a compute-matched comparison scheme to validate SBP against natural reference methods at a compute scale of up to 1T total training tokens in our largest experiment (§3.4.2), bringing validation at a scale relevant for frontier LM development. 3.4.1 Data, model, and evaluation Dataset A typical pretraining dataset is a mixture of different sources (e.g., GitHub, arXiv, Com- monCrawl) with distinct sampling weights assigned to each constituent. We simplify this reality by considering a fixed document collection, which is a customized version of the DCLM dataset [Li et al., 2024b]. The original 4T token DCLM-baseline split contains roughly 80% duplicates, as reported by Zyphra [2024]. We therefore begin with the de-duplicated dataset, which consists of CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES39 769B tokens. We clean the raw Zyphra de-duplicated data by normalizing repeated line breaks, re- moving long URL links, and fixing malformed Unicode characters. For efficiency, we cap the context window of the synthesizer-tuning (3.3) step at 8,192 tokens. As a result, we additionally filter out the documents whose length is above 4,096 tokens, allowing both d 1 and d 2 to fit into the context window in the worst case when both documents are 4,096 tokens long. After all the de-duplication, cleaning, and filtering procedures, we end up with a collection of 582M high-quality documents D pretrain totaling 482B tokens. We use the notation |D pretrain | to denote the number of documents in the pretraining dataset and ∥D pretrain ∥ to denote the total number of tokens. Architecture We use the Llama 3 transformer architecture [Dubey et al., 2024a] to model the probability p θ with the notable exception of implementing a QK-norm on top of the existing design, which we empirically find to stabilize training. Our resulting model is a 3B-parameter 26-layer transformer model with a hidden dimension of 3,072. Each layer employs grouped query attention with 24 query heads and 8 key/value heads. To validate the scalability of SBP, we also train a 6B-parameter model with 32 layers, a hidden dimension of 4,096, 32 query heads, and a feedforward dimension of 13,056 (detailed in Table B.9). The position embedding is RoPE [Su et al., 2023] for queries and keys, with frequency 5e+5. The feedforward network (FFN) has hidden dimension 8,064, and we apply prenorm to both the attention and FFN blocks. For tokenization, we implement a customized BPE tokenization with a vocabulary size of 49,152. To match the 8,192 context window design for synthesizer-tuning we have mentioned, we use context window 4,096 for pretraining, so that every document in D pretrain can fit into the context window. Benchmarks To assess the pretraining capability of LM, we measure pretraining test loss and general world knowledge benchmarks. We evaluate held-out test perplexity (exponential of nega- tive log-probability) on 1) OpenWebText2 from EleutherAI [Radford et al., 2018b]; 2) Narrative understanding with LAMBADA [Paperno et al., 2016] and 3) Broad domain multiple-choice with MMLU [Hendrycks et al., 2021a]. We evaluate QA accuracy on 4) Hard scientific reasoning with ARC-Challenge [Clark et al., 2018]; 5) Easy scientific reasoning with ARC-Easy [Clark et al., 2018]; 6) Scientific QA with SciQ [Welbl et al., 2017]; 7) Common sense reasoning with Winogrande [Sak- aguchi et al., 2021]; 8) Reading comprehension with TriviaQA [Joshi et al., 2017]; 9) Openbook QA with WebQS [Berant et al., 2013]. We directly evaluate the pretrained model with either zero-shot or few-shot prompts. Although MMLU is more commonly used as a QA benchmark, we find that evaluating MMLU accuracy for weak models yields a highly non-smooth readout. As a result, for each MMLU test question, we prepend the question with a 5-shot example of QA pairs and postpend it with the correct answer. Then, we treat each such sample as a text corpus and evaluate LM’s perplexity on such a text sample. We find that this perplexity-based MMLU correlates well with MMLU accuracy when the underlying model is large enough to yield a stable readout, and also delivers smooth performance changes for smaller models. These benchmarks improve significantly CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES40 with instruction finetuning [Wei et al., 2022]. However, we adhere to our data-constrained setup and do not introduce any additional data that may confound the comparison. 3.4.2 Compute-matched comparison We propose a compute-matched experimentation framework to rigorously compare SBP against two natural references: a repetition baseline where we repeat D pretrain multiple times to use the available training compute and an oracle upper bound that enables the model to access as many unique documents as possible. Compute-matching in a data-constrained regime directly models the real-world situation at frontier scale, where training compute budgets already exceed the supply of unique high-quality internet text—making “what to do with leftover compute after exhausting unique data” the operationally relevant question. Operationally, we control the training compute by controlling the total tokens seen during training, which is proportional to the training FLOPs given a fixed batch size and context window. We validate SBP across three different settings: • 200B-scale: In this setting, we cap the training compute to be 200B tokens and cap the data access at ∥D pretrain ∥ =10B tokens. • 1T-scale (3B): We also consider a larger scale closer to frontier model training, where we cap the training compute at 1T tokens and data access at ∥D pretrain ∥ =50B tokens. • 1T-scale (6B): To validate SBP on larger models, we additionally train a 6B-parameter model with the same 1T token budget and 50B unique data access. For each training scale, D pretrain with different sizes is sampled uniformly at random from the 582M documents pool. Given the compute-controlled comparison scheme, we next introduce two reference methods against which we compare SBP. Repetition baseline Since the compute budget typically exceeds the total number of unique tokens ∥D pretrain ∥, a natural baseline is to repeat D pretrain over multiple epochs. By design, in both 200B-scale and 1T-scale, we repeat the pretraining dataset D pretrain 20 times to exploit the available compute budget. In practice, when the pretraining dataset comes from a mixture of different sources, higher-quality documents can be seen as many as 30 times during pretraining, while lower-quality texts may appear only once. Muennighoff et al. [2023] systematically evaluates the repetition baseline as a proposal to scale LMs under data constraints and finds that repeating D pretrain up to 4 times yields nearly no performance degradation compared with having access to unlimited fresh data, but after around 40 times, repetition yields rapidly diminishing returns. Our choice of 20 times repetition with compute-matched comparison therefore strikes a reasonable balance between efficient experimental execution and exhausting all possible performance gains from a fixed D pretrain via repetition. CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES41 Oracle upper bound Besides showing improvement against the repetition baseline, we also eval- uate an oracle upper bound with unlimited data access to contextualize the numerical improvement delivered by SBP. As we shall see in the next section, because different benchmarks respond differ- ently to data size changes, SBP can deliver an improvement as large as 3.74% on some benchmarks but only 0.14% on others (Table 3.2). Because performance on LM benchmarks tends to scale log- arithmically [Owen, 2024, Kaplan et al., 2020] with data size, the numerical difference quickly caps out as we move from the 200B scale to the 1T-scale. By introducing this oracle upper bound, we can contrast the SBP improvement against this “oracle” improvement. For the 200B-scale experiment, we implement the oracle upper bound as having access to 200B unique tokens from our document pool of size 482B tokens. For the 1T-scale experiment, we unfor- tunately do not have 1T unique documents due to the large fraction of duplicates from DCLM. As a surrogate, we utilize all 482B unique tokens as the dataset for training the oracle upper bound at the 1T-scale. We provide a partial justification for this by performing a scaled-down comparison at 400B training tokens, with one model having 400B unique tokens and the other one having 200B unique tokens repeated twice (§B.3.1). We find that the two models (400B unique and 200B repeated twice) yield nearly identical performance. Training recipe For both the repetition baseline and oracle upper bound at all scales, we use a batch size of 2,048 and a context window of 4,096, resulting in a throughput of 8M tokens per step. We apply a cosine learning rate scale with a 5% warmup to a peak learning rate of 1e-2, followed by subsequent decay to 5e-5 towards the end. Under this setup, pretraining costs 11K v5p-TPU hours at 200B-scale, 59K v5p-TPU hours at 1T-scale (3B), and 265K v5p-TPU hours at 1T-scale (6B). For a clean comparison, we adhere to this hyperparameter throughout the paper, including the SBP experiment presented next. 3.5 Experiment results We perform SBP experiments under the compute-matched framework outlined in §3.4 at three compute budgets: 200B-scale, 1T-scale (3B), and 1T-scale (6B). After joint training on real and synthetic data D pretrain ,S pretrain , we find SBP consistently improves upon the repetition baseline across all scales (Table 3.2). In this section, we focus on presenting the performance of SBP and evaluating the quality of the synthesized pretraining data. We defer the implementation details of SBP to §B.1.1. 3.5.1 Main benchmark performance At the 200B-scale, we start with the source dataset of∥D pretrain ∥ =10B and curate a SBP dataset of ∥S pretrain ∥ =75B tokens (detailed ablation in §B.1.2). We perform joint training onD pretrain ,S pretrain CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES42 Table 3.2: Computed-matched comparison of Synthetic Bootstrapped Pretraining (SBP) and oracle performance gains over the repetition baseline. On average, SBP delivers roughly 43% of the per- formance improvement in QA accuracy for the 3B model and 58% for the 6B model, attainable by an oracle with access to 20x more unique data. 200B-scale1T-scale (3B)1T-scale (6B) BenchmarkBaseline SBP Oracle Baseline SBP Oracle Baseline SBP Oracle Perplexity on held-out data ↓ OpenWebText25.74-0.53-1.024.51-0.02-0.124.25-0.06-0.21 LAMBADA6.87-0.85-1.86 4.33-0.03-0.223.63-0.06-0.25 Five-shot MMLU3.83-0.36-0.513.17-0.06-0.053.08-0.08-0.13 QA accuracy ↑ ARC-Challenge (0-shot)35.32 +1.28 +2.8242.66 +1.62 +3.8447.44 +0.77 +0.17 ARC-Easy (0-shot)68.94 +2.65 +4.29 75.63 +0.42 +2.1178.70 +0.51 +0.85 SciQ (0-shot)90.50 +1.00 +2.4093.20 +0.80 +0.5092.90 +1.90 +1.80 Winogrande (0-shot)60.14 +1.90 +5.5365.19 +1.42 +2.9270.17 +0.47 +2.36 TriviaQA (1-shot)22.51 +3.36 +7.37 36.07 +0.25 +0.5940.64 +0.49 +3.19 WebQS (1-shot)8.56 +3.74 +10.8319.34 +0.54 +0.4419.88 +3.79 +5.22 Average QA accuracy47.66 +2.32 +5.5455.35 +0.84 +1.7358.29 +1.32 +2.26 with the principle that we do not repeat any synthetic documents during training. This means that out of a 200B token training budget, we spent 37.5% of it on the 75B synthetic tokens fromS pretrain without any repetition, and the remaining 62.5% on the real dataset D pretrain repeated 12.5 times. As shown in Table 3.2, SBP consistently decreases test loss and improves QA accuracy. On average, SBP captures 2.32/5.54 =42% of the improvement in QA accuracy delivered by the oracle run with 20x additional data access. The training dynamics of SBP reveal its core mechanism. As shown in Figure 3.3, the baseline initially performs similarly to the oracle, since their training data share the same distribution, and when the number of tokens seen is small, there is no distinction between the two. Then gradually, the oracle becomes a better model than the baseline, as it has access to unlimited unique training data. For the SBP dynamics, it initially performs worse than both the baseline and the oracle, which is expected since the quality of the synthesized data at most matches that of the real data. However, gradually, SBP continues to scale while the baseline has plateaued. This suggests that S pretrain offers a signal D pretrain alone cannot capture. Lastly, to validate the benefit of SBP across different training scales, we implement a larger experiment with ∥D pretrain ∥ =50B unique tokens under a compute budget of 1T total training tokens using both 3B and 6B-parameter models. Based on the ablation studies presented in §B.1.2, we include ∥S pretrain ∥ =125B synthetic tokens for the 3B model and ∥S pretrain ∥ =250B synthetic tokens for the 6B model, adhering to the principle of no repetition for synthetic data. Examining the results in Table 3.2, we observe that while perplexity-based measurements plateau for the 3B model [Liu et al., 2023b], benchmarks like ARC-Challenge and Winogrande continue to show gains. CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES43 2550100 Compute (in billions of tokens) 2.2 2.3 2.4 2.5 2.6 2.7 2.8 Test loss (OpenWebText2) Baseline SBP Oracle Figure 3.3: Training dynamics (200B-scale). On average, SBP recovers 0.84/1.73 =48% of the oracle’s QA accuracy improvement for the 3B model. The results are even more pronounced for the 6B model, where SBP delivers a relative improvement of 1.32/2.26 = 58% compared to the oracle. This suggests that SBP’s effectiveness may scale favorably with model size. Furthermore, the increased optimal synthetic data ratio for the 6B model suggests that larger models possess greater capacity to exploit the additional information encoded in the synthetic corpus. 3.5.2 Analysis of synthetic data We provide qualitative and quantitative analyses of the synthesized documents to gain insight into the SBP procedure beyond benchmark performance. Qualitative examples Figure 3.4 shows samples of synthesized documents from the 200B-scale experiment, with additional samples from 1T-scale (3B) presented in §B.2.4. On the left, we display a real document about a practical, first-person guide to the coffee houses in San Diego. We then present two synthesized texts that exhibit notable differences in both framing and depth, with varying degrees of fidelity to the seed document. Synthesis I sticks to the same topic but shifts toward an CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES44 expository essay on espresso machines and bean quality, with little mention of specific coffee shops. Synthesis I adopts a promotional, comparative style, linking San Diego’s coffee culture to New York’s and praising Café Lestat in a way that departs from the original’s balanced assessments. SBP provides no instructions on how the synthesizer should use the seed texts to write new documents. The model spontaneously learns to introduce new content and style into the discussion while staying on topic. It is challenging to manually craft a prompt to an instruction-tuned model that would output either Synthesis I or I with the real document as input. This example highlights how SBP differs from existing paradigms of data synthesis—the output first abstracts the seed document and then synthesizes new text with more generalized narratives, genres, and intent. We provide more extensive analysis of this observation in §3.6. Quantitative analysis We also conduct quantitative evaluations to assess the quality of the generated texts. We measure text distributions for the synthesized document at 200B-scale and 1T-scale. To establish a reference, we also conduct the same evaluation on the real documents. We measure five basic quality indicators: • Repetition: A document may contain too many repeated sentences or patterns. Repetition rate refers to the fraction of documents that exhibit this problematic behavior. • Duplicate@1M: Another failure mode of synthesis is when the documents sampled from the synthesizer distribution are nearly duplicates of each other. Duplicate@1M refers to the fraction of unique documents (determined by Jaccard similarity at a threshold of 0.6) when 1M documents are sampled from the text distribution. • Non-factual: A common failure mode of synthesis is the generation of content that contradicts established knowledge or facts. Non-factual rate refers to the fraction of documents that contain verifiable factual errors, as determined by automated fact-checking tools. • Pair-irrelevance: The synthesized d 2 is considered relevant to d 1 if they pertain to the same topic, event, entity, person, place, or object. Pair-irrelevance refers to the fraction of synthesized d 2 that is not relevant to d 1 , indicating the synthesis is not rightly using information from d 1 . • Pair-copying: d 1 and d 2 are considered near-duplicates if they are almost identical, except for some extra white spaces, line breaks, or punctuation. Pair-copying refers to the fraction of synthesized d 2 that is a near duplicate of d 1 . Operationally, we implement Repetition, Pair-irrelevance, and Pair-copying using LM-as-judge (prompts and more implementation details given in §B.2.3) by sampling 1,000 examples from each distribu- tion and estimating the fraction of documents satisfying each criterion. For Non-factual (prompts and details given in §B.2.2), we sample 10,000 examples and conduct a comprehensive examina- tion of factual errors to ensure broader coverage of the generated data. For Duplicate@1M, we use CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES45 Real document HOME > Chowhound > Califor- nia > Coffeehouses in SD I am always looking for good coffeehouses in San Diego. I am looking for coffeehouses with (obviously) excellent coffee- beans, baristas who know their business (good microfoam, good crema, they should know the dif- ferent ratios for steamed milk, foamed milk, espresso for the cappuccino, latte etc., etc), good snacks (paninis, cakes etc), good (cold) flavored coffeedrinks and good ambience. So far I haven’t found many who are good but would like to hear more sugges- tion. Caffe Calabria: By far the best coffeehouse in SD. Excellent cof- fee and baristas (best cappuc- cino in SD,), great paninis. Only downside is that they close very early. Café Lestat: Good example what happens with good coffee and average baristas. Lestat is also using coffeebeans from Caffé Calabria but it is sad to see that they don’t know how to make ex- cellent espresso, cappuccino. E Street Café: Average espresso but strange flavored ... Synthesis I Coffee = Love On a recent trip to San Diego, one thing I wanted to find was coffee. I was worried it might be difficult with the huge decrease in Asian flavors, but I was pleas- antly surprised. Many of the Mexican, Italian, and Cuban fla- vors that are pretty common in California are now in the mix. While I did find a few highly recommended spots, the major- ity of the locations I checked out were nothing to write home about. At the least, they were not being sold out, which I con- sider a plus. One major factor that really needs to be taken into account is the type of machine being used. There are two types of espresso machines: the cheap ones that spit out lots of pres- sure and don’t have any con- trol over the shot, and the ex- pensive ones that create amazing shots and utilize the types of su- perfine grinds you usually find in Italy. The latter types are much harder to find and cost signifi- cantly more. Another factor to consider is the type of beans being used. If they are done correctly, great espresso can be made from any type of bean, but it’s a rare experience to find a place that does ... Synthesis I Discuss as: Coffeehouses: San Diego. The best? Café Lestat, by far, is a stand- out. (photo by Thomas Mori- arty) If the coffee scene in San Diego sounds like the one in New York, there’s a reason. The upper ech- elons of NYC’s drink scene are the same ones who have taken over the majority of the coffee houses in San Diego. This is probably partly due to the fact that both cities have large stu- dent populations, both are home to large coffee drinker bases, both have a large number of espresso shops, and both offer top-notch, hand-crafted cappuc- cinos. But if there is one city in America that deserves to have its name above that of New York, it’s San Diego. There are just under 100 coffee shops in San Diego, with almost half of them located on Univer- sity Ave. alone. So finding the perfect coffee shop is crucial. We spent a whole day just roaming around the area, hunting for the best. In terms of the coffee itself, it’s hard to beat Café Lestat. The baristas are amazing and their methods are pristine ... Figure 3.4: Comparison of original text with synthesized text variations. CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES46 rule-based filtering to detect the fraction of duplicates based on 1M documents sampled from each distribution. We present the result in the table below. All metrics are lower for better data. Table 3.3: Quantitative evaluation of documents sampled from the synthesizer at 200B-scale and 1T- scale. We can see that the synthesized documents preserve topics and are not are simple duplicates. Repetition ↓ Duplicate@1M ↓ Non-factual ↓ Pair-irrelevance ↓ Pair-copying ↓ 200B-scale4.3%0.8%15.1%25.6%0.1% 1T-scale (3B)3.9%0.8%8.7%7.8%0.9% 1T-scale (6B)2.6%0.3%6.5%6.0%0.3% Real data1.8%0.7%1.8%n.a.n.a. Repetition and Duplicate@1M measure basic text quality independent of the specific pair-synthesis strategy employed by SBP. They aim to detect two simple failure modes: text repetition, a common failure pattern in generations from small language models (3B in our case), and the lack of diversity, a common issue with synthetic data that relies on variation induced by the sampling temperature. From Table 3.3, we find that both 200B-scale and 1T-scale synthesis match the quality of real data as captured by these two metrics. The absence of repetitions and duplicates is not, in itself, an indicator of high-quality or educational text, but rather a basic sanity check that the synthesized texts are diverse. Non-factual failure stems from hallucinations that introduce non-existent entities or relations inconsistent with reality. We find that synthesis at the 1T-scale (3B) significantly reduces these errors compared to the 200B-scale. Furthermore, with the 6B-parameter model, the non-factual rate further decreases to 6.5%. This indicates that as the data synthesizer trains on more data with larger models, the factuality of the generated outputs converges toward that of real data. Pair-irrelevance and Pair-copying, on the other hand, measure how synthesized d 2 relates to the seed d 1 . We aim to detect two failure modes: when d 2 is completely irrelevant to d 1 , and when d 2 merely copies the content of d 1 . We observe that both 200B-scale and 1T-scale synthesis avoid simply copying and pasting d 1 . The 1T-scale demonstrates substantially higher relevance than the 200B- scale, which makes sense as the synthesizer learns more diverse relations among |D pretrain | =60M documents than the |D pretrain | =12M corpus. This concludes the main experimental results. In the appendix, we present implementation details of SBP in §B.1.1, ablations on synthetic data mixture ratio in §B.1.2, additional analysis of synthesized documents in §B.2, and comparison with a larger 6B model in §B.3.2. 3.6 Statistical foundations of SBP We present a Bayesian interpretation of the SBP procedure, offering one explanation for the origin of SBP’s improvement. We formulate a hierarchical model of natural language texts (§3.6.1) and CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES47 demonstrate that SBP implicitly enables LMs to learn a posterior that standard pretraining cannot capture. We conclude by connecting our findings from this idealized model to the reality of LMs (§3.6.2). We begin with the observation that the pretraining objective models the marginal likelihood of documents: arg max θ logp θ (D pretrain ) = arg max θ X d∈D pretrain logp θ (d).(3.4) However, different natural language documents share structural similarities (Figure 3.2), suggesting a more complex underlying joint distribution that we explore next. Recall that EntiGraph (Chapter 2) explicitly constructed an entity-relation graph to generate diverse synthetic data from a source corpus. SBP achieves an analogous effect implicitly: rather than prompting for entities and their relations, the synthesizer infers a latent concept c that governs the joint distribution of related documents. Both methods share the same underlying principle—generating data from a structured intermediate representation yields higher diversity than direct paraphrasing. 3.6.1 A hierarchical concept model for natural language In the transformer example from Figure 3.2, both the arXiv preprint of the transformer paper and its code implementation are derived from the abstract concept of “transformer neural network”. From this perspective, we can view the generation process of natural language documents as a hierarchical sampling process where we first sample a collection of abstract concepts c (i) (e.g., the idea of a transformer) from a semantic space of all concepts C and then generate new documents d (i,j) conditional on c (i) . c (1) c (2) ... ... ... concept sampling d (1,1) d (1,2) d (1,1) d (2,2) d (2,3) d (2,1) document generation Under this view, we can think of the pretraining document as follows. • Concept sampling: Sample a fixed concept collection c (i) i ∼ P (c). • Document generation: For each concept c (i) , generate documents from d (i,j) j ∼ P (d|c (i) ) constituting one part of the pretraining dataset. Under such a model, the structural similarity between documents generated from the same concept is modeled as probabilistic dependence. The standard pretraining objective (3.4) then neglects inter-document correlation and only learns the marginal distribution P (d) = Z c∈C P (d|c)P (c)dc.(3.5) In this view, the model learns to generate plausible text by first generating a core concept c and then performing the generation P (d|c). In contrast, the synthesizer-tuning objective models a posterior of c given d. To see this, we additionally assume that the curated pairs (d 1 ,d 2 ) come from the same underlying concept c. Then, the synthesizer-tuning objective (3.3) forces the LM to perform CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES48 a distinct task: P (d 2 |d 1 ) = Z c∈C P (d 2 |c)P (c|d 1 )dc.(3.6) Here, we use Bayes’ rule and the conditional independence assumption P (d 2 |c,d 1 ) = P (d 2 |c), which says that the documents from the same concept are conditionally independent given that con- cept. As a result, to successfully model (3.6), the synthesizer must first perform posterior inference to infer the latent concept c given the document d 1 , and then use this inferred concept to synthesize a new document d 2 , a signal that is ignored by the standard pretraining objective. To illustrate this, we perform a post-hoc analysis by prompting an LM to identify the shared concepts between the synthesized document and its seed (Table 3.4). While it is difficult to describe a synthesized document as the outcome of a simple transform, such as a paraphrase or summarization, it always shares a common underlying concept with its seed origin. Table 3.4: Examples of latent concepts c inferred by an external LM (prompts provided in §B.2.1). From left to right, we provide a summary of the real document, the inferred latent concept, and a summary of the synthesized document. Real document summaryConceptsSynthesized document summary Twitter’s impact on journalismOpportunities arise from TwitterGuide on Twitter user monetization Family story about kids and doughnutsParenting + kids’ food cateringEmotional anecdotes of parents treating kids Minor parties’ challenges in the U.S. CongressMinor political parties in the U.S.Explains U.S. minor parties’ history Personal stories/questions about swollen eyesCauses/treatments of swollen eyesNon-personal guide to treating swollen eyes. Antarctic carbon fixation mechanismsHow life survives in AntarcticAntarctic geography and survival adaptations Profile of a belly dancing teacher in the U.K.Belly dancing as a dance formGeneral introduction to belly dancing Anxiety about creative work judged in a dreamDream as personal self-reflectionDescription and reflection of personal dreams NYC (yearly/monthly) climate extremesNYC weather and temperatureQA on NYC July heat and related topics Tutorial for Minecraft block moddingBlock editing in MinecraftMinecraft forum question on removing blocks Cosmic airburst destroys Tall el-Hammam cityDestruction of ancient citiesTall el-Hammam excavation as a news event The additional signal from the posterior then enables a form of self-distillation. The synthesizer, by learning a more complex conditional objective, becomes a more knowledgeable “teacher” model that has learned to infer the latent structure of data. The synthetic data it produces is then the knowledge “distilled” from this teacher [Hinton et al., 2015]. The final LM training then acts as a “student” that learns from a combination of real and synthetic data, allowing it to discover information that real data alone cannot reveal. 3.6.2 From idealized models to language model reality For real text documents, we do not know the true data-generating process, and any parametric assumption would be incorrect. This is where the power of the transformer neural network shines. CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES49 A transformer is a mapping-first [Breiman, 2001] approach. It does not require explicit modeling of the underlying parametric model. Instead, as a universal function approximator [Candès, 1998], it directly learns the complex conditional distribution p θ (d 2 |d 1 ) from paired data alone. In this context, the transformer’s ignorance of an explicit hierarchical model is its blessing. It bypasses the impossible step of modeling the true hierarchical distribution of language and instead brute-forces the learning of the exact transformation required: the end-to-end process of posterior inference and subsequent synthesis. The self-distillation framework—synthesizing data from this conditional model and then training on it—is all that is needed. We never need to introduce an explicit hierarchical model to perform the forward P (d|c) and backward pass P (c|d) in the latent space. The entire procedure is implicitly carried through the synthesizer-tuning update with the latent concept c integrated, demonstrating a powerful insight for scaling LMs in the real world. 3.7 Discussion Before the prevalence of pretraining [Radford et al., 2018b, Devlin et al., 2019], we needed 40M pairs of English and French sentences [Sutskever et al., 2014] to grant an LM the ability to translate from one language to another. In contrast, any modern LM [Gemini, 2024, OpenAI et al., 2024, Gunter et al., 2024] can easily achieve this task via a single prompt. This advancement stems from the rich correlations between words within a document that LMs learn during pretraining. This shift from a hard-to-collect, task-specific dataset to the readily available, generically crawled internet data marks a transition from relying on strong but scarce supervision to a weak self-supervision that is easy to scale. As we gradually deplete this weak source of self-supervision by exhausting the available internet data, many have called for stronger forms of supervision, such as reinforcement learning [DeepSeek-AI et al., 2025, OpenAI, 2024]. We instead offer an alternative perspective that continues to search for a form of self-supervision weaker than next-token prediction. SBP offers a particular instantiation of such an effort by examining the correlations between documents missed by the current pretraining paradigm. It remains to explore other forms of supervision not currently utilized. The fact that SBP provides any improvement stems from the poor inductive bias of the trans- former neural network. For example, transformers trained on the text “A is B” can not generalize to “B is A” [Berglund et al., 2023], requiring the user to curate synthetic data that explicitly narrates the converse relation [Yang et al., 2025c]. One can imagine an architecture with strong inductive bias such that the LM trained individually on d 1 and d 2 can automatically internalize the relation between the two. Despite this poor inductive bias, transformers are more parallelizable and scalable than their alternatives [Vaswani et al., 2017, Shazeer et al., 2017]. Given this trade-off, SBP offers a unique solution that preserves the system benefits of the transformer architecture while also enabling the learning of missed correlations by encoding such additional signals via synthetic data. CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES50 3.7.1 Limitations Document embedding with activations of pretrained LM In our implementation of SBP, we use Qwen3-0.6B-Embedding [Zhang et al., 2025] to obtain embeddings of DCLM [Li et al., 2024b] documents. An ideal implementation of SBP would only rely on the 3B-parameter model and the pretraining dataset itself to curate the paired synthesizer-tuning dataset. To achieve this, we can use the activations of the self-attention layer from an intermediate transformer block as a learned representation of documents. Khandelwal et al. [2020] and Yang et al. [2023b] implemented this at the much smaller scale of ∼ 300M parameters and ∼ 3B tokens. However, our experiments operate at a much larger scale with a customized model. As a result, we use the optimized vLLM [Kwon et al., 2023a] inference infrastructure for Qwen3-0.6B embedding models to efficiently index the pretraining corpus. Since the SBP procedure only requires a coarse binary decision of relevant vs. not relevant, which is much weaker than fine-grained document ranking embedding models are optimized for, we leave the more involved inference infrastructure for future work. Parametric fit of SBP scaling law One experimental consequence of LM pretraining is a clean scaling law [Kaplan et al., 2020, Equation 1.4] that relates the held-out test loss L(N,D) to the number of LM parameters N and the size of the pretraining dataset D. A natural scientific question is how the scaling law of SBP compares to the scaling law of pretraining. In our experiments, we evaluate L(N,D) at three distinct points: (N = 3B,D = 10B), (N = 3B,D = 50B), and (N = 6B,D = 50B). We observe that SBP delivers larger relative improvements with the 6B model, suggesting a favorable scaling behavior. Two obstacles prevent a full scaling law: First, SBP is inherently a large-scale algorithm that cannot be scaled down. Since SBP first uses the pretrained LM to generate synthetic text and then trains on it, if the model and dataset sizes are too small, the generated text may not even be coherent. In contrast, the scaling experiments in Kaplan et al. [2020] involve model sizes ranging from 768M to 1.5B and dataset sizes ranging from 22M to 23B, which allows for efficient experimentation. Second, varying N or D implies redoing the synthesizer- tuning and subsequent data synthesis over billions of tokens. Additionally, varying D also implies redoing the nearest neighbor matching, as the neighbors are only defined given a fixed document pool. These obstacles aside, it would be interesting to see whether the SBP scaling law differs from the normal scaling law by a smaller multiplicative factor or a better exponent. Since SBP additionally utilizes inter-document correlations, a form of long-range interactions, its scaling law not only helps us understand SBP but also potentially helps us better understand natural language data itself [Ebeling and Pöschel, 1994]. 3.7.2 Conclusion We introduce Synthetic Bootstrapped Pretraining (SBP) as a new LM pretraining framework that leverages inter-document correlations missed by the standard pretraining objective. We demonstrate CHAPTER 3. BOOTSTRAPPING PRETRAINING CAPABILITIES51 the effectiveness of SBP by pretraining 3B and 6B-parameter models from scratch for up to 1T tokens under a rigorously designed compute-matched experiment setup. Qualitative and quantitative analyses of the synthesized text reveal rich variation that cannot be captured by simple paraphrases. Beyond being practically effective, SBP admits a natural Bayesian interpretation, where it implicitly learns a posterior that infers the latent concept in a given document. Its bootstrapping nature grants SBP the possibility of scaling the pretraining capability of the LM beyond its current limit. Yet even with SBP, the pretraining procedure and learning algorithms remain human-designed. In Chapter 4, we ask whether we can remove this final dependency on human ingenuity—by building AI systems that autonomously discover and improve their own training methods. Chapter 4 Towards AI-designed AI via test-time search Chapters 2 and 3 showed that a model can acquire new knowledge and bootstrap its own pretraining capabilities through synthetic data, but the learning algorithms themselves—EntiGraph, SBP, and the training pipelines that orchestrate them—remained human-designed. This motivates asking whether the research process itself can be automated. In this chapter, we pursue this direction by building an automated AI research system. Established approaches to algorithmic improvement— Neural Architecture Search [Zoph and Le, 2017, So et al., 2019], automated algorithm discovery [Real et al., 2020], and learned optimizers [Chen et al., 2023b]—are reasonable alternatives, but they operate within constrained, predefined spaces or require end-to-end differentiable pipelines, making it difficult to discover techniques outside the search space or scale to full training systems. We pursue research automation because it operates in an unbounded action space: ideas expressed in natural language, validated via code execution, using capabilities that language models already possess (see §1.6.3 for a detailed comparison). 4.1 Towards automated AI research We envision automated AI research as follows: LLMs generate research ideas, implement them as code, run experiments to verify effectiveness, and continuously learn from execution results. If successful, such automated researchers could discover effective ideas in a massive search space, scalably converting compute into scientific discovery; the discovered ideas could, in turn, improve frontier AI models themselves, enabling recursive self-improvement. Despite this promise, the ability of LLMs to generate effective ideas remains the key bottleneck. Si et al. [2025b] and Si et al. [2025a] evaluated LLM-generated research ideas through large-scale 52 CHAPTER 4. TOWARDS AI-DESIGNED AI VIA TEST-TIME SEARCH53 expert review and found that LLM ideas often look convincing but prove ineffective after human researchers execute them. This finding underscores the need to ground idea generation in execution. However, obtaining execution results automatically and at scale is challenging—especially for open-ended AI research where any idea expressible in natural language lies within the action space. We build a high- throughput automated idea executor that implements hundreds of model-generated ideas in parallel and returns experiment results as execution feedback. To study how far we can push automated LLM research, we select two GPU-intensive research problems—LLM pre-training and post-training—as research environments for our automated AI researchers. We demonstrate for the first time that our automated executor can implement a large fraction of LLM-generated ideas on these challenging open-ended research problems, achieving over 90% execution rates on the pre-training environment with Claude-4.5-Sonnet and Claude-4.5-Opus. We define objective and unhackable performance metrics for both environments and apply test- time search at the idea level—using execution feedback to guide evolutionary search over model- generated ideas. Implementer Worker Natural language ideas Scheduler Worker Worker Multi-threaded LLM API query Resource allocation for execution queue GPU pre-train/ post-train job Experiment results Execution queue Reinforcement learning/ Evolutionary search Ideator update Automated AI Researcher Automated Idea Executor Figure 4.1: We build an automated idea executor involving Implementer, Scheduler, and Worker. We then use this automated executor to guide test-time search over model-generated ideas. We use our automated executor to guide evolutionary search. Within ten search epochs, execution- guided search finds a post-training recipe that improves over the GRPO baseline (69.4% vs. 48.0%) on post-training a 1.5B model for math reasoning, and a pre-training recipe that improves over the nanoGPT baseline (19.7 minutes vs. 35.9 minutes) on minimizing training wall-clock time to reach the target validation loss (Table 4.1). We find that models frequently generate algorithmic ideas beyond hyper-parameter tuning, and evolutionary search outperforms best-of-N under the same sam- pling budget. However, only Claude-4.5-Opus shows a clear scaling curve; both Claude-4.5-Sonnet and GPT-5 saturate early. We also explore reinforcement learning with execution reward as an alternative to search; while CHAPTER 4. TOWARDS AI-DESIGNED AI VIA TEST-TIME SEARCH54 Table 4.1: Performance of our execution-guided search in comparison with the provided baselines and best human experts. The post-training task is to finetune a 1.5B model for math reasoning, and the metric is accuracy on the MATH validation set. The pre-training task is to train a 124M Transformer on FineWeb, and the metric is the training time to reach 3.28 validation loss. Post-training ↑ Pre-training ↓ Baseline48.0%35.9 min Execution-guided search69.4%19.7 min Best human expert68.8%2.1 min average reward improves, the max reward—more critical for discovery—does not, pointing to fun- damental challenges in learning to generate research ideas (Appendix C.2). In summary, we develop a large-scale automated idea executor that implements research ideas for open-ended, realistic research problems. Using this executor, we show that execution-guided evo- lutionary search is sample-efficient and effective, outperforming best-of-N under the same sampling budget. We provide extensive analysis of the executed ideas and suggest promising directions for scaling automated AI research. 4.2 Automated idea executor We build an automated executor that takes natural language research ideas as input, generates code implementations, runs experiments on the backend, and returns benchmark performance as output. 4.2.1 Research environments for ideation We formalize a research environment as an abstract interface with two methods (Figure 4.2, left): context() returns the task description passed to the LM ideator, and value(idea) returns a scalar measure of idea quality after execution. For AI research, we instantiate this interface as AIResearchEnv (Figure 4.2, right): context() returns the baseline codebase, and value() patches the idea into a sandboxed copy of the codebase, runs the experiment, and returns the benchmark result. This abstraction separates what the LM sees (the codebase) from how ideas are scored (sandboxed execution), and makes the search objective explicit: all methods in this chapter—best- of-N, evolutionary search, and RL—optimize env.value(idea). We select research problems that are open-ended—allowing ample room for algorithmic innovation— while having well-established baselines so that measuring effectiveness is simple. We construct both a pre-training and a post-training environment. 1 1 We open-source our environments and idea execution trajectories at https://github.com/NoviScl/ Automated-AI-Researcher. CHAPTER 4. TOWARDS AI-DESIGNED AI VIA TEST-TIME SEARCH55 class ResearchEnv: @abstract def context(self): # task description # passed into LM ideator pass @abstract def value(self, idea: str): # scalar measure of # idea quality pass class AIResearchEnv(ResearchEnv): codebase: str resource: str sandbox_factory: Callable def context(self): return self.codebase def value(self, code_diff: str): sb = self.sandbox_factory( self.resource) sb.exec(f"patch code_diff") sb.exec("bash run.sh") return sb.exec("bash eval.sh") Figure 4.2: Research environment abstraction. Left: abstract interface defining the search problem— context() provides what the LM sees, value() scores an idea by execution. Right: concrete implementation for AI research—ideas are patched into a sandboxed codebase and evaluated via automated execution. Pre-training task: improving nanoGPT In the nanoGPT environment, we provide a baseline codebase adapted from the nanoGPT speedrun Jordan et al. [2024]. The original speedrun task is to minimize the time to pre-train a 124M GPT-2 model Radford et al. [2018b] on the FineWeb corpus Penedo et al. [2024] to reach a validation loss of 3.28 on the validation set on 8 H100 GPUs. We make several modifications to the original speedrun setting. First, we define value(idea) = 1 validation loss for the search and RL experiments in later sections. This allows us to fix the training wall-clock time at 25 minutes and have the model directly optimize the proxy reward under this fixed budget, avoiding vastly different runtimes across runs. We report the validation loss or the proxy reward metric in most plots, and only measure and report the training time metric for the top solution in order to directly compare it with the human experts’ solutions on the original nanoGPT speedrun leaderboard. Second, to prevent reward hacking, we freeze all evaluation hyper-parameters and implement an inference function that predicts one token at a time. This prevents models from changing the attention mechanism in ways that leak future tokens—an issue we encountered multiple times during initial development. We use this inference function during final validation after each training run. Post-training task: improving GRPO In the GRPO environment, we provide an implementa- tion of the GRPO algorithm Shao et al. [2024] that finetunes a Qwen2.5-Math-1.5B checkpoint Yang et al. [2024] on the MATH dataset Hendrycks et al. [2021c]. We specify a fixed training wall-clock time budget and define value(idea) as the max accuracy on the MATH validation set during training. To prevent reward hacking, we keep all validation-related code in a separate file that the automated executor cannot access or modify. CHAPTER 4. TOWARDS AI-DESIGNED AI VIA TEST-TIME SEARCH56 In both environments, we impose no constraints on the ideation scope, so anything from extensive hyperparameter tuning to novel model architectures or training algorithms is within scope. 4.2.2 System design The automated idea executor implements the value method of Figure 4.2: given a batch of natural- language ideas, it returns the benchmark performance of each. Three building blocks compose this API (Figure 4.1): Implementer—the server that generates the code diff for each idea and applies those changes; Scheduler—a middle layer that receives codebases and allocates resources; Worker— the GPU cluster that runs experiments and uploads results. Implementer The implementer runs on a CPU machine with high IO capacity. The user submits a batch of natural language ideas. For each idea, the implementer makes parallel API calls to the code execution LLM to obtain a diff file that can be patched into the baseline codebase. For efficiency, we prompt the code execution LLM with both the idea and the baseline codebase to sample 10 code diff files in parallel. For each sample, if the generated diff file cannot be patched into the original codebase, we provide the patch log and ask the model to revise its generation. We repeat this self-revision for a maximum of 2 times. We then return the first code diff file that patches successfully. The patched codebase is submitted to a cloud bucket as a .zip file. Scheduler At a set clock frequency, the scheduler downloads new codebases from the cloud. If a codebase has not been executed, the scheduler examines its resource requirements and prepares a job configuration for submission. Worker Once the scheduler finds available resources, it connects the prepared job configuration with the GPU resource and initializes the worker to run the experiment. If execution succeeds, the worker uploads experiment logs—including all performance metrics—to another cloud bucket (wandb) along with complete metadata: idea content, code change, execution log, etc. If execution fails (e.g., due to bugs in code implementation), the worker halts. The ideator model can then download execution results and observe the performance of all submitted ideas with full training logs. 4.3 Benchmarking LLM ideators and executors For an execution-grounded feedback loop to work, current LLMs must serve as both ideators and executors, yielding meaningful env.value signals for learning. We benchmark various frontier LLMs in both roles. CHAPTER 4. TOWARDS AI-DESIGNED AI VIA TEST-TIME SEARCH57 OpusSonnetGPT-5 0.0 0.2 0.4 0.6 Accuracy Average AccuracyBest AccuracyCompletion Rate 0 20 40 60 80 100 Completion Rate (%) (a) Self-Execution (GRPO) OpusSonnetGPT-5 0 1 2 3 Loss Average LossBest LossCompletion Rate 0 20 40 60 80 100 Completion Rate (%) (b) Self-Execution (nanoGPT) SonnetGemini 3Kimi K2Qwen3 0.0 0.1 0.2 0.3 0.4 0.5 Accuracy Average AccuracyBest AccuracyCompletion Rate 0 20 40 60 80 100 Completion Rate (%) (c) GPT-5 Execution (GRPO) SonnetGemini 3Kimi K2Qwen3 0 1 2 3 4 Loss Average LossBest LossCompletion Rate 0 20 40 60 80 100 Completion Rate (%) (d) GPT-5 Execution (nanoGPT) Figure 4.3: Model performance comparison with self-execution (top row) vs GPT-5 execution (bot- tom row) on GRPO and nanoGPT environments. The baseline accuracy for GRPO is 0.480, and the baseline loss for nanoGPT is 3.255. The completion rate is high for most models, especially under self-execution. CHAPTER 4. TOWARDS AI-DESIGNED AI VIA TEST-TIME SEARCH58 4.3.1 End-to-end ideation and execution We sample ideas from an LLM and use the same LLM as the code execution model to execute its own ideas. We sample and execute 50 ideas from Claude-4.5-Opus, Claude-4.5-Sonnet, and GPT-5, measuring three metrics: (1) completion rate—the percentage of ideas successfully executed with a valid (non-zero) result; (2) average performance—the average validation accuracy or loss across all successfully executed ideas; (3) best performance—the highest validation accuracy or lowest validation loss among all executed ideas. The top row of Figure 4.3 shows the results. A large fraction of sampled ideas execute successfully, with Claude-4.5-Opus and Claude-4.5-Sonnet achieving significantly higher execution rates than GPT-5. The best-of-N performance (N = 50) already exceeds the original baselines. For example, on GRPO, Claude-4.5-Sonnet achieves a max accuracy of 60.4% compared to the baseline of 48.0%; on nanoGPT, Claude-4.5-Opus achieves a lowest loss of 3.237 compared to the baseline of 3.255. 4.3.2 Comparing ideators with the same executor We fix the executor to GPT-5 and vary the ideator model. As the bottom row of Figure 4.3 shows, even when ideator and executor differ, execution rates remain reasonable (ranging from 42% to 78%). The same ideas from Claude-4.5-Sonnet achieve lower execution rates when executed by GPT-5 instead of itself (84% vs. 42% on GRPO and 90% vs. 78% on nanoGPT). Frontier open- weight models like Kimi-K2-Thinking Kimi Team [2025] and Qwen3-235B-A22B Yang et al. [2025a] also achieve non-trivial completion rates and best-of-N performance that exceeds the baselines. For example, Qwen3-235B achieves a max accuracy of 50.2% on GRPO and min loss of 3.238 on nanoGPT with N = 50, both exceeding the baselines. These results suggest the feasibility of an automated ideation and execution loop. Before turning to search, we examine a simpler form of scaling—test-time compute scaling via budget forcing— whose lesson motivates the idea-level search that follows. 4.4 Test-time scaling via budget forcing As discussed in Chapter 3, training on just 1,000 examples suffices to build a competitive reasoning model, revealing that reasoning capability is latent in pretrained weights. A natural question is: can we improve performance simply by controlling how much a model reasons at the token level? Budget forcing offers a minimal mechanism for allocating test-time compute adaptively, and its effectiveness foreshadows the larger principle behind the evolutionary search in later sections—that shifting compute from training to search over env.value is a viable path to better performance. We show that controlling thinking duration via a simple test-time technique we call budget forcing produces a reasoning model that scales in performance with more test-time compute. After CHAPTER 4. TOWARDS AI-DESIGNED AI VIA TEST-TIME SEARCH59 51210242048 Average thinking time (tokens) 65 75 85 95 Accuracy (%) Mathematical Problem Solving (MATH500) 512102420484096 Average thinking time (tokens) 0 20 40 60 Competition Math (AIME24) 512102420484096 Average thinking time (tokens) 40 50 60 PhD-Level Science Questions (GPQA Diamond) Figure 4.4: Test-time scaling with s1-32B. We benchmark s1-32B on reasoning-intensive tasks and vary test-time compute. training a model on reasoning data, we control test-time compute using budget forcing: (I) If the model generates more thinking tokens than a desired limit, we forcefully end the thinking process by appending an end-of-thinking token delimiter, causing the model to transition to generating its answer. (I) If we want the model to spend more test-time compute on a problem, we suppress the end-of-thinking token delimiter and instead append “Wait” to the model’s reasoning trace to encourage more exploration. Equipped with this simple technique, our model s1-32B exhibits test- time scaling (Figure 4.4). In summary, our contribution is a simple method for controlling test-time compute called budget forcing (§4.4.1). We end with a discussion on test-time scaling and its limits (§4.4.3). Our code, model, and data are open-source at https://github.com/simplescaling/s1. 4.4.1 Test-time scaling Method We classify test-time scaling methods into 1) Sequential, where later computations depend on earlier ones (e.g., a long reasoning trace), and 2) Parallel, where computations run independently (e.g., majority voting) [Snell et al., 2024, Brown et al., 2024]. We focus on sequential scaling because later computations build on intermediate results, enabling deeper reasoning and iterative refinement. We propose new sequential scaling methods and ways to benchmark them. Budget forcing We propose a simple decoding-time intervention that forces a maximum and/or minimum number of thinking tokens. We enforce a maximum token count by appending the end-of- thinking token delimiter and optionally “Final Answer:” to early-exit the thinking stage and make the model provide its current best answer. To enforce a minimum, we suppress the end-of-thinking token delimiter and optionally append “Wait” to the model’s reasoning trace to encourage reflection on its current generation. CHAPTER 4. TOWARDS AI-DESIGNED AI VIA TEST-TIME SEARCH60 1024204840968192 Average thinking time (tokens) 20 40 60 Accuracy (%) Competition Math (AIME24) (a) Sequential scaling via budget forcing 200,0001 million6 million Output tokens (sum over all questions) 46 50 54 58 62 Accuracy (%) PhD-Level Science Questions (GPQA Diamond) Budget forcing Majority voting (b) Parallel scaling via majority voting Figure 4.5: Sequential and parallel test-time scaling. (a): Budget forcing shows clear scaling trends and extrapolates to some extent. For the three rightmost dots, we prevent the model from stopping its thinking 2/4/6 times, each time appending “Wait” to its current reasoning trace. (b): For Qwen2.5-32B-Instruct we perform 64 evaluations for each sample with a temperature of 1 and visualize the performance when majority voting across 2, 4, 8, 16, 32, and 64 of these. 4.4.2 Results Evaluation We evaluate on the same three reasoning benchmarks introduced in Section 3.1: AIME24, MATH500, and GPQA Diamond. Unless otherwise specified, we evaluate with a tem- perature of 0 (greedy) and measure accuracy (equivalent to pass@1). Test-time scaling Figure 4.4 shows that s1-32B with budget forcing scales in performance with more test-time compute. Figure 4.5 (left) expands Figure 4.4 (middle), showing that budget forcing (§4.4.1) improves AIME24 performance with more test-time compute but eventually flattens out at six times. Suppressing the end-of-thinking token delimiter too often leads the model into repetitive loops instead of continued reasoning. Figure 4.5 (right) shows that after training Qwen2.5-32B- Instruct on our 1,000 samples to produce s1-32B and equipping it with budget forcing, it operates in a different scaling paradigm. Scaling test-time compute on the base model via majority voting does not catch up with s1-32B, validating our intuition from §4.4.1 that sequential scaling is more effective than parallel. 4.4.3 Discussion Limits to further test-time scaling Budget forcing extrapolates test-time compute (§4.4.2)—for example, improving AIME24 performance from 50% to 57%. However, it has two key limitations: it eventually flattens out (Figure 4.5), and the context window of the underlying language model CHAPTER 4. TOWARDS AI-DESIGNED AI VIA TEST-TIME SEARCH61 constrains it. Despite these constraints, we demonstrate test-time scaling across a wide range of accuracies (Figure 4.4), partly because scaling down test-time compute behaves predictably and does not suffer from these constraints. Continuing test-time scaling will require approaches that further extrapolate test-time compute. Potential improvements to budget forcing include rotating through different strings (not only “Wait”) or combining it with frequency penalties or higher temperature to avoid repetitive loops. A natural question is whether applying budget forcing to a reasoning model trained with reinforcement learning yields better extrapolation, or whether RL enables new forms of test-time scaling beyond budget forcing. The lesson from budget forcing is simple: even a crude intervention—appending “Wait” to force continued thinking—improves performance. If brute-force thinking at the token level already helps, then systematically scaling search at the idea level—generating many candidate research ideas, executing them, and feeding results back—should yield continued improvement. The remainder of this chapter tests this hypothesis. 0123456789 Epoch 0.308 0.310 0.312 0.314 0.316 0.318 Max Reward (1/Loss) Max Reward (1/Loss) per Epoch (nanoGPT Environment) Claude-4.5-OpusGPT-5Claude-4.5-Sonnet 0123456789 Epoch 0.500 0.525 0.550 0.575 0.600 0.625 0.650 0.675 0.700 Max Accuracy Max Accuracy per Epoch (GRPO Environment) Claude-4.5-OpusGPT-5Claude-4.5-Sonnet Figure 4.6: Best performance at each epoch when performing execution-guided search with different models. For the nanoGPT environment (left), we use the reciprocal of the validation loss as the metric; for the GRPO environment (right), we use validation accuracy as the metric. Claude-4.5- Opus exhibits a scaling trend on both environments and achieves the best performance on nanoGPT. Claude-4.5-Sonnet achieves the best performance on GRPO due to effective hyper-parameter tuning, but saturates early. 4.5 Execution-guided evolutionary search Evolutionary search Koza [1994], Lehman et al. [2023] is a classic optimization method that does not require gradient updates. We develop an evolutionary search scaffold on top of frontier LLMs to optimize idea effectiveness based on execution feedback. We describe our search method—which blends exploration and exploitation—demonstrate its effectiveness on both research environments, CHAPTER 4. TOWARDS AI-DESIGNED AI VIA TEST-TIME SEARCH62 Algorithm 1 Execution-guided search Require: batch size N, epochs T, baseline performance β Require: initial exploitation rate a 1 ∈ [0, 100], annealing schedule a(t) for t∈1,...,T Require: research environment env with methods context, value 1: Sample initial batch of ideas I 0 ← SampleIdeas(N ) 2: D 0 ←(i, env.value(i)) : i∈I 0 3: for t = 1 to T do 4: a← a(t)▷ (100− a)% exploration rate 5: D <t ← S t−1 k=0 D k 6: D + ←(i,r)∈D <t : r > β▷ positive trajectories 7: N exp ← a 100 N , N expl ← N − N exp 8: I exp t ← ExploitVariants(D + ,N exp ) 9: ̃ D <t ← SubsampleToContext(D <t ) 10: I expl t ← ExploreNovel( ̃ D <t ,N expl ) 11: I t ←I exp t ∪I expl t 12: D t ←(i, env.value(i)) : i∈I t 13: end for 14: return S T t=0 D t and analyze the generated ideas throughout the search process. 4.5.1 Search scaffold Our search method is inspired by prior evolutionary search approaches for code optimization, such as AlphaEvolve Novikov et al. [2025]. Algorithm 1 details our approach. At the first search epoch, we sample a full batch of new ideas. In subsequent epochs, we split idea generation into exploitation and exploration subsets. For exploitation, we select ideas from previous epochs that outperform the baseline, append them to the idea generation prompt, and ask the ideator model to generate new variants combining their strengths. For exploration, we randomly sample ideas from previous epochs into the prompt until reaching the max context length, then instruct the ideator model to generate completely new ideas different from them. We start with 50% exploitation and 50% exploration at epoch 1 and gradually anneal the exploration rate and increase the exploitation ratio in later epochs. We use a batch size of 50 for the GRPO environment and a batch size of 80 for the nanoGPT environment. 4.5.2 Experiment results For each environment, we perform execution-guided search with three different models: Claude-4.5- Opus, Claude-4.5-Sonnet, and GPT-5. For each experiment, we use the same model as both ideator and executor (self-execution). Figure 4.6 plots the progression of best performance at each search epoch. Several trends emerge. First, Claude-4.5-Opus shows a scaling trend: searching for more epochs leads to a higher upper bound. In contrast, Claude-4.5-Sonnet and GPT-5 saturate early. Second, all models find ideas that CHAPTER 4. TOWARDS AI-DESIGNED AI VIA TEST-TIME SEARCH63 Table 4.2: Breakdown of hyper-parameter tuning vs algorithmic ideas throughout the entire execution-guided search. We report the percentage of each type among all generated ideas of each model (N = 500 ideas on GRPO and N = 800 ideas on nanoGPT). We also report the average and best performance for ideas under each category, where we use validation accuracy as the performance metric for GRPO and validation loss as the metric for nanoGPT. Bold numbers every row indicate the best performance by each model. All models generate a substantial amount of algorithmic ideas apart from hyper-parameter changes, while Claude-4.5-Sonnet generates significantly more hyper- parameter ideas than other models. Hyper-parameterAlgorithmic Model%Avg Best%Avg Best GRPO environment (accuracy ↑ ) GPT-55.0% 45.0% 50.2%95.0% 44.5% 60.0% Claude-4.5-Sonnet 41.1% 48.4% 69.4%58.9% 45.0% 67.4% Claude-4.5-Opus3.7% 44.4% 50.4%96.3% 46.5% 61.6% nanoGPT environment (loss ↓ ) GPT-515.4% 3.2543.19584.6% 3.894 3.170 Claude-4.5-Sonnet 31.3% 3.251 3.20868.7% 3.679 3.208 Claude-4.5-Opus8.7% 3.3293.14791.3% 3.419 3.141 improve over the baselines. On GRPO, Claude-4.5-Sonnet discovers that vanilla policy gradient with the group-average baseline—without importance reweighting or clipping—outperforms the standard GRPO objective in this setup, and exploits this finding in subsequent epochs to reach 69.4% at epoch 2 with precise hyper-parameter tuning. On nanoGPT, Claude-4.5-Opus achieves a min validation loss of 3.1407 at epoch 9 by combining architectural modifications, hyper-parameter tuning, and exponential moving average of intermediate checkpoints during validation (see Appendix C.1.1 for the full idea). We run this top solution on 8 H100s following the nanoGPT speedrun setup: it reaches the 3.28 target validation loss in 19.7 minutes, a speedup over the baseline codebase, which takes 35.9 minutes to reach the same target. To contextualize these model-optimized solutions, we compare the top performance of execution- guided search to human experts (Table 4.1). For the GRPO environment, we compare with the leaderboard of the Stanford CS336 graduate-level LLM class, which hosted the same environment as an assignment where students optimized validation accuracy under the same training time budget. The best student solution 2 achieved 68.8% accuracy, lower than Claude-4.5-Sonnet’s top solution from execution-guided search. In the nanoGPT environment, we directly compare with the nanoGPT speedrun leaderboard. 3 The state-of-the-art human solution as of December 2025 achieves the target validation loss in under 2.1 minutes, indicating significant room for improvement in model capability and search methods. 2 https://github.com/stanford-cs336/assignment5-alignment-leaderboard 3 https://github.com/KellerJordan/modded-nanogpt CHAPTER 4. TOWARDS AI-DESIGNED AI VIA TEST-TIME SEARCH64 4.5.3 Comparison with best-of-N To evaluate the effectiveness of our search scaffold, we compare execution-guided search with best- of-N under the same sampling budget on the nanoGPT environment. Since our search batch size is 80, we compare the first 3 epochs of execution-guided search using GPT-5 with best-of-N results for GPT-5 with N ∈80, 160, 240. 012 Epoch 0.309 0.310 0.311 0.312 0.313 Reward (1/Loss) Best-of-N vs Search (on nanoGPT) Best-of-N Search Figure 4.7: Comparison between best-of-N and our execution-guided search under the same sampling budget. As shown in Figure 4.7, search and best-of-N start from similar performance at epoch 0 (not exactly the same due to sampling variance), but evolutionary search outperforms best-of-N from epoch 1 onward. This suggests that the model leverages trajectories from previous epochs to generate more effective ideas in future epochs. This result echoes the token-level finding from §4.4: there, sequential scaling via budget forcing outperformed parallel scaling via majority voting (Figure 4.5b). A similar pattern appears at the idea level—sequential search that builds on prior results outperforms parallel sampling of independent candidates. 4.5.4 Analysis of generated ideas Hyper-parameter vs. algorithmic To understand the types of ideas models generate during execution-guided search, we classify all generated ideas into either hyper-parameter tuning (ideas implementable by changing existing configs) or algorithmic (ideas requiring changes not supported by the baseline codebase) using an LLM judge. Table 4.2 shows that all three models generate sub- stantial algorithmic ideas beyond hyper-parameter tuning. Claude-4.5-Sonnet generates significantly CHAPTER 4. TOWARDS AI-DESIGNED AI VIA TEST-TIME SEARCH65 more hyper-parameter ideas than both Claude-4.5-Opus and GPT-5. The most effective ideas stem from algorithmic innovations in most cases, except when using Claude-4.5-Sonnet. Qualitative examples We provide several executed ideas on the GRPO environment in Table 4.3 and on the nanoGPT environment in Appendix C.1.1. When sampling, models generate a think- ing trace followed by the natural language idea and a brief description of code changes needed for implementation. For brevity, we include only the natural language ideas in the table; a full code execution trajectory appears in Appendix C.1.2. Table 4.3 reveals different idea styles across mod- els: Claude-4.5-Sonnet generates more intuitive ideas, while Claude-4.5-Opus and GPT-5 are more mathematically inclined. Recovering recent research papers We observe multiple cases where model-generated ideas (without any RAG) closely resemble research papers released within three months of writing this chapter. For example, Claude-4.5-Sonnet proposed: “Implement response diversity rewards within groups where responses to the same prompt receive bonus rewards for being dissimilar to other re- sponses in their group, encouraging exploration of different solution paths.”, which is similar to Li et al. [2025]. For pre-training, Claude-4.5-Opus proposed: “Causal Context Compression: Before each attention layer, apply a learned compression that mixes local context (previous 2-3 tokens) into the current representation, providing implicit local context without convolutions.”, which is similar to the “canon layer” described in Allen-Zhu [2025]. Assessing the novelty of LLM-generated ideas is beyond this chapter’s scope, but the ability to rediscover ideas from recent papers suggests that automated AI researchers could plausibly support work at the frontier of LLM research. 4.6 Discussion 4.6.1 Limitations Our current experiments have several limitations. First, our current procedure does not test idea generalizability. The best-performing ideas at small scales may not transfer to larger scales or other datasets. Future work should explore methods that explicitly test generalizability and scalability, potentially incorporating them into the optimiza- tion objectives. Second, our experiment scope is bounded by execution agent capability. Many promising model- generated ideas cannot be successfully executed (e.g., see the end of Appendix C.1.1), introducing noise in the reward signal. Future work could develop more capable execution agents and extend our setup to more complex research problems—for instance, by implementing coding agents with access to external tools and the ability to install new libraries in the execution environments. CHAPTER 4. TOWARDS AI-DESIGNED AI VIA TEST-TIME SEARCH66 Finally, we explore only effectiveness as the training reward. Other metrics could complement effectiveness—such as idea novelty and interestingness. Future work could explore how to computa- tionally measure these qualities and incorporate them into the training objective to discover more insightful ideas. 4.6.2 Conclusion In this chapter, we tackle the problem of AI-designed AI—building systems that improve the very algorithms used to train them—through the lens of test-time search. We first show that test-time search at the token level (budget forcing) improves reasoning, then scale this principle to the idea level: generating research ideas, executing them automatically, and feeding results back to guide evolutionary search. Using this approach, frontier LLMs improve over baseline solutions—finding a post-training recipe that improves accuracy from 48% to 69% and a pre-training recipe that halves wall-clock time. These results point toward the feasibility of automated, execution-grounded AI research and suggest a path toward AI systems that continually improve themselves. This chapter points toward AI systems that improve not only their knowledge or capabilities but also the very algorithms used to train them. Chapter 2 showed how synthetic data can teach models new knowledge; Chapter 3 demonstrated that models can bootstrap their pretraining capability without external supervision. Here, we have shown that AI systems can generate, implement, and validate research ideas—including ideas for improving pretraining and post-training algorithms themselves. An AI system that improves the algorithms used to train future AI systems could, in princi- ple, accelerate its own development. Our current results are early—evolutionary search finds bet- ter hyperparameters and algorithmic improvements but does not yet discover fundamentally new methods—yet the feasibility of execution-grounded idea generation suggests a path forward. The key bottleneck is no longer whether AI can generate ideas, but whether it can generate good ideas that generalize beyond narrow benchmarks. From this perspective, the most promising directions are: learning algorithms that maintain exploration while optimizing for effectiveness, richer execution environments that test generaliza- tion, and tighter feedback loops between ideation and execution. The ultimate goal—AI systems that continually and autonomously improve themselves—remains distant, but this thesis provides concrete building blocks toward it. CHAPTER 4. TOWARDS AI-DESIGNED AI VIA TEST-TIME SEARCH67 Claude-4.5-Opus on GRPOClaude-4.5-Sonnet on GRPOGPT-5 on GRPO Residual Ratio Learning with Momentum Bounds: Instead of directly using the (importance sampling) ratio, decompose it into a “base” component (EMA of batch mean ratios) and a “residual” component (ratio – base). Apply sigmoid bounding only to the residual, allowing the base to capture systematic policy drift while controlling deviations from it. Combined with momentum clip adaptation. Formula: residual = ratio - ema_batch_ratio, bounded_residual = sigmoid_bound(residual, deviation), effective_ratio = 1.0 + bounded_residual. Validation accuracy: 61.6 Advantage Rank Difference Weighting: Instead of using absolute advantage magnitude, weight samples by how far their rank differs from their expected rank under uniform distribution. Samples that significantly outperform or underperform their “expected” position get higher weights. This is distribution-free and robust to outliers. Formula: expected_rank = (N-1)/2, rank_diff = |actual_rank - expected_rank| / expected_rank, weight = 0.5 + 0.5 * rank_diff. Validation accuracy: 59.2 Dynamic Mathematical Problem Difficulty Balancing with Performance Feedback: Implement intelligent difficulty balancing that dynamically adjusts the mix of problem difficulties based on recent performance trends. When performance is strong, increase difficulty proportion; when struggling, provide more foundational problems. Combine with the proven hyper-parameters for optimal learning progression. Validation accuracy: 64.0 Implement token-level reward attribution by using attention weights to identify which input tokens contributed most to correct answers, then amplifying the gradient updates for those tokens during policy gradient training. Validation accuracy: 45.2 Create mathematical working memory simulation by maintaining a context buffer of mathematical facts, definitions, and intermediate results during problem solving. This buffer gets updated as the model works through problems and provides additional context for subsequent mathematical steps, simulating how humans maintain mathematical working memory during complex calculations. Validation accuracy: 58.0 Token-Level Ratio De-noising via Response Chunks (Chunked-Ratio): Reduce noisy token spikes by averaging log-ratio over small contiguous chunks within the response. Partition response tokens into C chunks per sequence (e.g., C = 8 over effective length), replace per-token ∆ logp with chunk mean broadcast to tokens in the chunk before ratio and clipping. Keeps structural signal while smoothing extremes. Validation accuracy: 58.2 Per-Group Curriculum via Reward Spread (PGC-RS): Adjust step aggressiveness based on group reward spread. For each group, compute spread s g = std(r). Compute per-sample temperature T grp i = clamp 1 + α· (s ref − s g ), 0.8, 1.5 with s ref = median group std over rollout and α = 0.8. Multiply existing ratio temperature T i (if any) by T grp i . Groups with low spread (hard to distinguish) get cooler ratios; high-spread groups allow bolder updates. Validation accuracy: 49.4 Table 4.3: Examples of successfully executed ideas on the GRPO environment, along with their accuracy on the MATH validation set. The baseline accuracy is 48.0% on this environment. Chapter 5 Conclusion: can AI be smarter than its creators? In Chapter 1, we proposed a one-sentence definition: a continually self-improving AI is one that, once created, can autonomously and continually improve itself better than its human creators can improve it. The key phrase is “can improve it”—the definition does not claim that AI is stronger than humans, only that AI can improve AI more effectively than humans can. Each chapter takes a step toward this standard. In Chapter 2, we showed that a model can synthesize training data that teaches it knowledge beyond what the small source corpus can directly teach. In Chapter 3, we showed that a model can bootstrap its own pretraining capabilities from a fixed dataset, producing training signals that yield improvement beyond what the original human- collected data provides. In Chapter 4, we showed that an AI system can design learning algorithms by searching over a larger design space than human researchers can search. But the mechanism behind each result is the same: AI compensates for inferior quality with superior quantity. Human data is better, but AI data is infinite. Human researchers are stronger, but AI researchers are tireless. The definition is satisfied, but only through brute force—stacking quantity to overcome a limitation in quality. This raises a deeper question: can a created system genuinely surpass its creator, not merely out-grind them? The preceding chapters presented experimental results; this chapter is different in character. What follows is a position piece: I offer a historical analogy and a personal interpretation, not a proof. The analogy is suggestive, not rigorous, and the conclusions I draw from it reflect my own perspective on where the field may be headed. 68 CHAPTER 5. CONCLUSION: CAN AI BE SMARTER THAN ITS CREATORS?69 5.1 A parable from physics To explore why the answer might be yes, we turn to an analogy from physics. In 1915, Albert Einstein published the field equations of general relativity [Einstein, 1915], completing a decade- long effort to reconcile gravity with the geometry of spacetime. Two years later, when Einstein applied his equations to the universe as a whole, he found that they do not admit a static, matter- filled cosmos [Einstein, 1917]. Rather than accept this prediction, he modified the equations— introducing a free parameter called the cosmological constant—to force a static solution. In 1929, Edwin Hubble established that distant galaxies are systematically redshifted, their light stretched to longer wavelengths in proportion to their distance [Hubble, 1929]. The universe is expanding, exactly as the unmodified equations had predicted. The theory knew something its creator did not. This chapter tells the story of how a set of equations can be smarter than the person who wrote them down. A theory, once created, has a life of its own: it can evolve, make predictions, and reach conclusions that its creator never intended. General relativity provides perhaps the most literal historical precedent for this phenomenon. Understanding it precisely requires following a chain of mathematical reasoning from Newton to Einstein, which we do now. 5.2 The gravitational field equation Newton’s law of universal gravitation describes gravity as a force between two point masses: a mass M exerts a force on a test mass m separated by displacement r, F =− GMm r 2 ˆ r,(5.1) where G ≈ 6.674 × 10 −11 m 3 kg −1 s −2 is Newton’s gravitational constant and the negative sign indicates that the force is attractive. This is a particle equation: it tracks individual objects and assumes instantaneous action at a distance. The passage from discrete forces to a local field equation proceeds by introducing a continuous mass density ρ(r) and a gravitational potential Φ(r). Through an integral formulation and the identity ∇ 2 (1/|r− r ′ |) = −4π δ 3 (r− r ′ ), one arrives at Poisson’s equation—a field equation that replaces particle-by-particle bookkeeping with a local differential relationship (Appendix D.1): ∇ 2 Φ | z geometry (potential) =4πGρ | z matter (source) .(5.2) Einstein’s general relativity elevates every ingredient of Poisson’s equation from scalars and vectors to tensors. The scalar potential Φ becomes the metric tensor g μν , a 4× 4 symmetric matrix that defines the geometry of spacetime: ds 2 = g μν dx μ dx ν . The scalar density ρ becomes the stress-energy tensor T μν , encoding energy, momentum, and stress—in general relativity, all forms CHAPTER 5. CONCLUSION: CAN AI BE SMARTER THAN ITS CREATORS?70 of energy curve spacetime, not just mass. The left-hand side of the field equation must express spacetime curvature. Starting from g μν , one constructs the Riemann curvature tensor through a chain of derivatives (Appendix D.2), then contracts it to the Ricci tensor R μν and Ricci scalar R. The unique symmetric, divergence-free combination is the Einstein tensor G μν ≡ R μν − 1 2 Rg μν . Assembling these ingredients, Einstein wrote down the field equations of general relativity in 1915 [Einstein, 1915]: R μν − 1 2 Rg μν = 8πG c 4 T μν . (5.3) These are 10 coupled, nonlinear, second-order partial differential equations for the 10 independent components of g μν . The identity ∇ μ G μν = 0 (the Bianchi identity) ensures compatibility with energy-momentum conservation (∇ μ T μν = 0). In the weak-field, slow-motion, static limit, these equations reduce exactly to Poisson’s equation (5.2) (Appendix D.3). The correspondence between the two theories is summarized below: Poisson (Newton)Einstein Matter fieldρ (scalar density) T μν (stress-energy tensor) Potential fieldΦ (scalar potential)g μν (metric tensor) Differential operator ∇ 2 (Laplacian)g μν 7→ G μν (nonlinear) Coupling constant4πG8πG/c 4 5.3 Einstein’s cosmological problem In 1917, the astronomical consensus held that the universe was static and eternal, consisting es- sentially of the Milky Way alone. Einstein set out to apply his field equations to the universe as a whole [Einstein, 1917]. He modeled the matter content as pressureless dust at rest—the relative velocities of stars are small compared to the speed of light, so the stress-energy tensor is dominated by T 00 = ρc 2 with all spatial components negligible (T ij ≈ 0). Einstein assumed the density ρ to be constant in both space and time [Einstein, 1917]. We follow Friedmann [Friedman, 1922], who generalized Einstein’s setup by allowing both the curvature radius and the density to depend on time: ρ = ρ(t), with spatial homogeneity still enforced. The crucial question was whether these equations permit a static universe—one in which the metric has no time dependence. 5.3.1 The cosmological metric Einstein’s starting point was the cosmological principle: at any fixed time, space is homogeneous (the same at every point) and isotropic (the same in every direction). Isotropy at every point means that the sectional curvature at each point does not depend on the direction in which it is measured. By Schur’s theorem [do Carmo, 1992, Ch. 4, Ex. 8], this implies that the sectional curvature does CHAPTER 5. CONCLUSION: CAN AI BE SMARTER THAN ITS CREATORS?71 not vary from point to point either: space has constant curvature. A complete, simply connected three-dimensional space of constant curvature falls into exactly one of three classes: • Positive curvature (k = +1): the three-sphere S 3 , with finite volume. • Zero curvature (k = 0): flat Euclidean space R 3 , with infinite volume. • Negative curvature (k =−1): hyperbolic space H 3 , with infinite volume. Einstein worked with the closed case k = +1 [Einstein, 1917]; here we choose the flat case (k = 0) for simplicity, since the qualitative conclusion—that a static universe with matter is impossible—holds for all three. We now construct the most general metric compatible with spatial flatness, homogeneity, and isotropy. The assumption that matter is at rest allows us to choose coordinates where the mixed components g 0i vanish, so time is orthogonal to the spatial slices. The geodesic equation for dust at rest requires g 00 to be independent of spatial position; by our sign convention, g 00 =−c 2 . At fixed time, the spatial metric is that of flat R 3 : dx 2 +dy 2 +dz 2 . Homogeneity and isotropy do not prevent the overall spatial scale from changing with time, so we introduce a scale factor a(t) that multiplies all spatial distances uniformly. The coordinates (x,y,z) are comoving coordinates: they label points in space permanently, and the physical distance between two points with coordinate separation ∆x at time t is a(t) ∆x. Combining these ingredients, the spacetime line element is the Friedmann–Lemaître–Robertson– Walker (FLRW) metric for a spatially flat universe: ds 2 =−c 2 dt 2 + a 2 (t) dx 2 + dy 2 + dz 2 . (5.4) The metric tensor and its inverse are diagonal: g 00 =−c 2 , g 11 = g 22 = g 33 = a 2 (t); g 00 =− 1 c 2 , g 11 = g 22 = g 33 = 1 a 2 (t) .(5.5) The only undetermined quantity is the single function a(t). 5.3.2 The Friedmann equations Substituting the FLRW metric (5.4) into the Einstein field equations (5.3) requires computing the Christoffel symbols, Ricci tensor, Ricci scalar, and Einstein tensor for this metric (the full compu- tation is carried out in Appendix D.4). With the stress-energy tensor T 00 = ρc 2 and T ij = 0 for pressureless dust, the field equations yield two ordinary differential equations for the scale factor a(t) and the matter density ρ(t), where ̇a≡ da/dt and ̈a≡ d 2 a/dt 2 . The 00-component gives the first Friedmann equation (the energy constraint): 3 ̇a 2 a 2 = 8πGρ c 2 .(5.6) CHAPTER 5. CONCLUSION: CAN AI BE SMARTER THAN ITS CREATORS?72 The 11-component gives the second Friedmann equation (the acceleration equation): 2 ̈a a + ̇a 2 a 2 = 0.(5.7) These two equations, together with initial conditions, completely determine the evolution of the universe. 5.3.3 A dynamic universe A static universe means ̇a = 0 for all time. Setting ̇a = 0 in the first Friedmann equation (5.6) gives 0 = 8πGρ/c 2 , which requires ρ = 0. A static universe must be empty. The result is stronger than this. Suppose the universe contains matter (ρ > 0) and is momentarily at rest ( ̇a = 0) at some time t 0 . Then the first Friedmann equation gives 0 = 8πGρ/c 2 > 0, a contradiction. So ̇a can never pass through zero in a matter-filled universe: if the universe is expanding at any moment, it was always expanding; if contracting, always contracting. The universe is permanently dynamic. Solving the Friedmann equations (Appendix D.5) yields a(t) ∝ t 2/3 : the universe begins at a singularity where all distances vanish and the density is infinite, then expands forever, decelerating but never stopping. The density decreases as ρ ∝ 1/a 3 ∝ 1/t 2 , reflecting the conservation of total mass in any comoving volume as the volume grows. The same qualitative conclusion—the impossibility of a static universe with matter—holds for the closed (k = +1) and open (k =−1) cases. In the closed case, ̇a = 0 at one moment forces ̈a < 0, so the universe cannot remain static; in the open case, ̇a 2 has a positive lower bound, so ̇a can never reach zero. 5.3.4 The cosmological constant Einstein [Einstein, 1917] encountered the same problem: his field equations yield a dynamic universe, contradicting the static cosmos that the astronomical consensus of his era held to be self-evident. His response was to modify his own equations, introducing the cosmological constant Λ by adding a term Λg μν to the left-hand side: R μν − 1 2 Rg μν + Λg μν = 8πG c 4 T μν .(5.8) This modification preserves the divergence-free property of the left-hand side (since ∇ μ g μν = 0), so the equations remain mathematically consistent. The extra term provides a repulsive effect that can be tuned to balance gravitational attraction, permitting a static solution. However, this solution is unstable: the slightest perturbation in density or scale factor causes the universe to begin expanding or contracting. Einstein had patched his equations to obtain the answer he wanted. CHAPTER 5. CONCLUSION: CAN AI BE SMARTER THAN ITS CREATORS?73 0.00.51.01.52.0 Distance (Mpc) 200 0 200 400 600 800 1000 1200 Velocity (km/s) Individual nebulae Groups of nebulae Mean of 22 nebulae Figure 5.1: Hubble’s velocity–distance relation, reproduced from the data in Hubble [1929]. Black discs are 24 individual nebulae with estimated distances (solid line: least-squares fit, K = 465 km/s/Mpc). Open circles are 9 groups formed by combining nearby nebulae (dashed line: K = 513 km/s/Mpc). The red cross marks the mean of 22 additional nebulae whose distances could not be estimated individually. Both fits are consistent with a linear relation v = Kr passing through the origin. 5.4 The theory was right In 1929, Edwin Hubble published a result that rendered Einstein’s patch unnecessary [Hubble, 1929]. Using distance estimates derived from resolved stars in nearby galaxies, together with radial velocities measured by Vesto Slipher and Milton Humason, Hubble established a roughly linear relation between a galaxy’s distance and its recession velocity. Distant galaxies are redshifted— the wavelength of their light is systematically stretched toward longer (redder) wavelengths, in proportion to their distance. Hubble’s original data are reproduced in Figure 5.1. Redshift is the optical analog of the Doppler effect: when a source of light moves away from an observer, each successive wave crest must travel a slightly longer distance, stretching the observed wavelength λ obs relative to the emitted wavelength λ emit . The redshift z = (λ obs − λ emit )/λ emit measures this fractional stretch, and for recession velocities small compared to the speed of light, z ≈ v/c. Hubble found that nearly all galaxies have z > 0 and that the recession velocity increases with distance, roughly as v ≈ H 0 d, where H 0 CHAPTER 5. CONCLUSION: CAN AI BE SMARTER THAN ITS CREATORS?74 is a constant now bearing his name. The universe is not static. It is expanding, and running the expansion backward in time implies that it originated from an extremely dense, hot initial state—the Big Bang. The linearity of Hubble’s relation is not a coincidence—it is the only possibility consistent with the cosmological principle. Consider two observers, A and B, separated by distance d B . Observer A measures a galaxy at distance d receding with velocity v(d). Observer B sees the same galaxy at distance d− d B and, after subtracting B’s own recession velocity as seen by A, measures velocity v(d)−v(d B ). Homogeneity requires that B’s velocity–distance law take the same functional form as A’s, so v(d)− v(d B ) = v(d− d B ) for all d,d B .(5.9) This is Cauchy’s functional equation, whose only continuous solution is v(d) = Hd for some con- stant H. Any nonlinear relation would single out a preferred center—the point from which the relation appears simplest—violating the assumption that no observer occupies a special location. This constant H is precisely the Hubble parameter predicted by the Friedmann equations (§5.3.2). The first Friedmann equation (5.6) already contains the ratio ̇a/a: rewriting 3 ̇a 2 /a 2 = 8πGρ/c 2 as H 2 = 8πGρ/(3c 2 ) shows that the expansion rate is set by the matter density of the universe. Einstein’s unmodified field equations (5.3), without the cosmological constant, had predicted exactly this. Had Einstein trusted his own mathematics rather than the astronomical prejudices of his era, he could have predicted the expanding universe over a decade before Hubble observed it. Einstein later called the introduction of Λ his “biggest blunder” [Gamow, 1956]. In a precise sense, the equations knew more than Einstein did. Their deductive consequences included a true prediction about the physical universe that their creator actively suppressed. The mathematical structure of general relativity—the interplay of g μν , R μν , and T μν —encoded a fact about nature that no human being recognized at the time. 5.5 Continually self-improving AI We began this chapter with a narrow observation: the AI systems built in this thesis satisfy the definition of continually self-improving AI, but through the uninteresting mechanism of quantity overcoming quality. The Einstein parable suggests—but does not prove—that something deeper is possible. The field equations of general relativity were not merely more diligent than Einstein—their deductive consequences were right where his intuitions were wrong. The mathematical structure he created encoded a truth about the universe that he actively denied. The results of this thesis do not yet achieve this. Synthetic data outworks human data; automated search outworks human researchers. The mechanism remains quantity over quality. But the story need not end here. If models can internalize knowledge into their weights, regularize their own training from the structure of data, and design their own learning algorithms, then it is CHAPTER 5. CONCLUSION: CAN AI BE SMARTER THAN ITS CREATORS?75 my view that one day a created system will not merely outwork its creator—but, like Einstein’s field equations, contain truths its creator did not recognize. 5.6 Future work Toward a future in which AI systems genuinely surpass their creators, the three methodologies developed in this thesis—synthetic knowledge acquisition, bootstrapped pretraining, and automated algorithm design—can each be extended in distinct ways. We close by narrating three concrete possibilities. 5.6.1 Synthetic continued pretraining as an alternative to infinite context In Chapter 2, we showed that synthetic continued pretraining can teach a model knowledge from a small corpus by generating diverse rephrasings and elaborations of the source material. A natu- ral extension is to apply this approach not to static knowledge acquisition, but to the problem of long-context inference. Recent work handles long user queries (e.g., 1M–10M+ tokens) using effi- cient attention [Dao et al., 2022, Liu et al., 2023a, Gemini, 2024] or sub-quadratic architectures [Tay et al., 2022, Gu et al., 2022, Gu and Dao, 2024, Sun et al., 2024b]. In settings where many queries share a long prefix—e.g., a corporation’s proprietary documents or other prompt caching use cases [Anthropic, 2024]—one could continue pretraining on the prefix to internalize its knowledge, then perform standard quadratic attention on shorter queries. This approach pays a fixed training cost to amortize prefix knowledge into model weights, then benefits from shorter context lengths [Guru- rangan et al., 2020, Snell et al., 2022]. By adapting continued pretraining from 10B–100B tokens to as little as 1.3M tokens, the synthetic continued pretraining approach of Chapter 2 could enable unsupervised learning of shared text prefixes at much smaller and more practical token counts. The natural limit of this direction is replacing the context window entirely: a model that has internalized a corpus through synthetic continued pretraining needs no retrieval at inference time, achieving the effect of an infinite context window through learned knowledge rather than attention over tokens. 5.6.2 Synthetic data as data-dependent regularization In Chapter 3, we showed that a model can bootstrap its own pretraining capabilities by synthesizing training data from a fixed corpus, producing training signals that yield improvement beyond what the original data provides. This capability becomes increasingly important as the field confronts a fundamental resource constraint. The compute-optimal scaling laws of Hoffmann et al. [2022] established that model size and training tokens should grow in proportion: a model trained on too few tokens relative to its parameter count is undertrained, while one trained on too many is inefficient. The immediate consequence is that the largest models require the most data. As model CHAPTER 5. CONCLUSION: CAN AI BE SMARTER THAN ITS CREATORS?76 sizes continue to grow, the data requirement grows with them—but the supply of high-quality web text does not. Kim et al. [2025] study this regime directly, showing that when compute is abundant but data is fixed, standard training overfits and that aggressive regularization (weight decay 30× larger than standard practice) is needed to continue extracting signal from repeated data. Together, these results point to a future in which the largest models are severely undertrained—not for lack of compute, but for lack of data. The synthetic bootstrapped pretraining method of Chapter 3 offers a natural response to this problem. SBP can be viewed as a form of data-dependent regularization: it does not add new infor- mation beyond what is already present in the original corpus, but it makes implicit inter-document correlations explicit, guiding the model toward representations that capture deeper structure. Stan- dard pretraining learns from the marginal distribution of documents; SBP additionally learns from conditional distributions that expose latent structure shared across documents. In this view, syn- thesized documents act as a regularizer whose form depends on the training data itself, much as data augmentation in vision (cropping, flipping, color jittering) regularizes without introducing new visual concepts. This makes SBP a promising approach for training ultra-large models on fixed data budgets, precisely the regime where compute-optimal scaling laws predict the most severe undertraining. 5.6.3 Harness engineering In Chapter 4, we showed that an AI system can design learning algorithms by searching over a larger design space than human researchers can search, generating, implementing, and evaluating research ideas in a closed loop. The natural trajectory of this line of work shifts the role of the human researcher from doing research to engineering the harness within which AI does research. In our experiments, the most consequential design decisions were not about individual ideas—they were about the research environments (Figure 4.2). These are properties of the harness, not of any single experiment. As AI systems become more capable at the hill-climbing work of generating and testing variations, the harness becomes the primary locus of human contribution. This is because the harness encodes something that AI systems do not have on their own: human intent. The choice of what to optimize, what constraints to respect, and what counts as progress reflects human values and goals that cannot be derived from execution feedback alone. The researcher’s comparative advantage shifts from proposing ideas to specifying objectives—from doing the climbing to choosing the mountain. This shift is already visible in practice: frameworks like Harbor [Harbor Framework Team, 2026] provide infrastructure for running AI agents across thousands of containerized environments in parallel, collecting execution rollouts, and feeding results back for reinforcement learning—an end- to-end harness in which the human contribution is the design of the evaluation, not the execution of the research. Appendix A Supplementary materials for Chapter 2 A.1 Details on the QuALITY dataset We provide additional details on the QuALITY dataset. For each book, we execute entity extraction (Step 1, §2.2.2) and then analyze all pairwise relations between entities and a subset of all triplet relations (Step 2, §2.2.2). Figure A.1 shows summary statistics for the Raw and EntiGraph corpora. 2345678 Token count (K) 0 5 10 15 20 25 30 Frequency (a) Raw article tokens 020406080100 Entity count 0 5 10 15 20 25 30 35 40 Frequency (b) Extracted entities 010002000300040005000 Token count (K) 0 5 10 15 20 25 30 Frequency (c) EntiGraph corpus tokens Figure A.1: Histograms over the 265 QuALITY articles and books. (a) The token count of raw articles. (b) The number of extracted entities. (c) The token count of EntiGraph synthetic data (generated for each book). A.2 Training details for the main experiments Continued pretraining details In all experiments, we continue pretraining Llama 3 8B Base with a context length of 2048 and batch size of 16. We apply a linear learning rate warmup for 5% of total steps, followed by cosine decay with peak learning rate 5e-6. We use full parameter training 77 APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 278 with Fully Sharded Data Parallelism (FSDP, Zhao et al. [2023]). EntiGraph continued pretraining details To mitigate forgetting of pretrained knowledge, we perform replay at a rate of 0.1 using 1B RedPajama tokens [TogetherAI, 2023]. For each training batch, we flip a biased coin such that with 10% probability we load RedPajama data instead of EntiGraph synthetic data. Raw continued pretraining details We now describe continued pretraining directly on the Raw corpus, which produces the “Raw CPT” model. Because the Raw corpus contains only 1.3M tokens, we jointly tune the number of epochs (repetition factor) and the RedPajama replay rate on accuracy over a QuALITY QA validation split. We select a configuration with 4 epochs and a 0.1 replay rate. Instruction tuning details We use the UltraChat instruction tuning dataset [Ding et al., 2023] filtered by the Huggingface team [Tunstall et al., 2023]. We format the UltraChat conversations using the Llama 3.1 8B Instruct chat template [Dubey et al., 2024a], obtaining a 250M token instruction tuning dataset. We apply a linear learning rate warmup followed by cosine decay to 0 with peak learning rate 5e-6, and train the model for 1 epoch with a batch size of 512 and context window of 2048. To validate our instruction tuning procedure, we measure the AlpacaEval [Li et al., 2023a] winrate against GPT-4 and find that it improves from 0% to 6.25%, comparable to Llama 2 Chat 13B’s 7.7% baseline winrate. Compute resource We run all continued pretraining experiments on one 8×H100 node. With PyTorch FSDP [Zhao et al., 2023], we achieve throughput of 6090 tokens per second. Because all experiments use the same model architecture, batch size, and context length, we can calculate training time from total tokens seen. For example, EntiGraph trains on 455M tokens for 2 epochs, taking 455M×2/6090 seconds, or about 41 hours. A.3 Task-specific finetuning for the QuALITY question set Our work considers task-agnostic synthetic data generation and continued pretraining as a way to obtain generalizable knowledge about a domain—knowledge that can later be extracted via few-shot prompting [Brown et al., 2020] and instruction tuning [Ouyang et al., 2022]. However, if the goal is only to excel on a single task such as question answering, one could fine- tune a language model for that particular task. This approach works well on tasks such as SQuAD [Rajpurkar et al., 2016] in-domain but degrades outside the fine-tuning data distribution [Awadalla et al., 2022]. We do not extensively compare to task-specific finetuning given EntiGraph’s broader multi-task goals. We run preliminary experiments comparing a simple QA SFT baseline to EntiGraph and find APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 279 that EntiGraph scaling and synthetic data generation costs are generally favorable even compared to this strong, task-specific baseline. QA SFT We follow the same setup as in §2.2.1 and §2.3 except that we do not prompt LM synth to generate general knowledge about QuALITY articles. Instead, we prompt LM synth to generate QA pairs directly: You are an assistant to help read a article and then rephrase it in a question answering format. The user will provide you with an article with title, year, content. You need to generate a paraphrase of the same article in question and answer format with multiple tags of "Question: ..." followed by "Answer: ...". Remember to keep the meaning and every content of the article intact, including the title, year, etc. We repeat this prompt many times at temperature 1.0, resulting in 28M tokens of synthetic question-answer pairs. We perform the same continued pretraining procedure as in §2.4.1 on Llama 3 8B and refer to this model as “QA SFT”. 10 0 10 1 10 2 Number of synthetic tokens (in Millions) 40.0 42.5 45.0 47.5 50.0 52.5 55.0 QA Accuracy EntiGraph CPT Rephrase CPT QA SFT Figure A.2: Accuracy on the QuALITY question set Q test (y-axis) as a function of the synthetic token count (x-axis). Comparison among EntiGraph CPT, Rephrase CPT, and QA SFT. APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 280 Results discussion Figure A.2 shows the QA SFT scaling curve. Task-specific finetuning demon- strates sharp improvement in QA accuracy, consistent with prior results on task-specific finetuning gains for pretrained models. While QA SFT performance is high, EntiGraph attains similar per- formance despite being entirely task-agnostic, and the overall cost of creating the dataset is much lower for EntiGraph. This cost difference is hidden in Figure A.2, as we plot training tokens rather than dollars spent to generate the synthetic data. For QA SFT, each QA question is short, resulting in large inefficiencies in generating this dataset. The input-to-output token ratio is large compared with Rephrase CPT and EntiGraph CPT, costing over $5K to generate just 28M tokens. 1 This cost difference means further scaling becomes prohibitively expensive, and EntiGraph’s performance in Figure A.2 is even better than it appears when matching for total cost rather than token budget. A.4 Additional details on open-book experiments We provide additional details on our open-book experimental setup, including our retrieval-augmented generation (RAG, Lewis et al. [2020], Gao et al. [2024b]) pipeline. As described in §2.6, we use a standard two-stage RAG pipeline: first, an offline stage that indexes document chunks; second, inference-time retrieval, reranking, and placement of those chunks in a few-shot LM prompt. A.4.1 Stage 1: offline indexing The indexing stage constructs an index over all 265 articles and books from the QuALITY corpus D source . This stage chunks documents, obtains dense vector embeddings for each chunk using an API-based embedding model, and indexes the (embedding, chunk) pairs. Chunking documents We first split each document D (i) ∈ D (i) n i=1 = D source into a set of m i document chunks C (i) 1 ,...,C (i) m i . We use the Recursive CharacterTextSplitter from Chase [2022], which keeps paragraphs (and then sentences, and then words) together as long as possible to preserve semantics within each chunk. We use non-overlapping chunks and tune chunk size in characters (chunk_size, hyperparameter values provided below). Because we have access to metadata about each document D (i) —namely, the title, author, and year—we prepend this metadata to each document chunk. This mirrors how an organization building a RAG system over their document store would include document metadata (title, author, year, etc.). We embed, retrieve, and place these final chunks with metadata prepended in-context. Embedding and indexing document chunks We obtain dense embeddings for all document chunks using OpenAI text-embedding -3-large [Neelakantan et al., 2022]. We then index all 1 OpenAI API pricing, Sep 2024. APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 281 (embedding, chunk) tuples using a FAISS vector store [Douze et al., 2024]. A.4.2 Stage 2: inference-time retrieval and reranking At inference time, the RAG system receives a test query q ∈ Q test . Each query q is contextualized with the article title and author name, as described in §2.3, and contains its four possible answer choices (QuALITY is a 4-choice, multiple choice dataset). In Stage 2, we embed the query with the API-based embedding model, retrieve K document chunks using approximate nearest-neighbor search, and select the k < K most relevant chunks using an API-based reranker. Retrieving top-K document chunks We embed q with text-embedding-3-large, and retrieve the top-K most relevant document chunks from our indexed vector store using FAISS similarity search with a Euclidean distance metric. Reranking to obtain top-k (k < K) chunks We use a reranker to filter the K retrieved document chunks to a smaller set of k reranked chunks. Rerankers significantly improve recall (the proportion of the time the salient article appears in the top chunks), and our RAG pipelines achieve near-perfect recall (Table 2.4 in §2.6). We pass the query q and the list of K retrieved document chunks to Cohere rerank-english-v3.0 [Cohere, 2024], which returns the K chunks ordered from most to least semantically relevant. We take the k highest scoring chunks and place them in our few-shot prompt. Few-shot prompt formatting We provide our full few-shot chain-of-thought evaluation prompts for the open-book setting in our code release. As with the closed-book QA evaluation prompt, we manually write and fact-check in-context learning examples about well-known books to avoid leaking knowledge from the QuALITY articles. In early experiments, we find that placing retrieved contexts first, followed by the question and answer choices, significantly outperforms question-then- contexts; we use this format throughout the retrieval experiments. We treat as a hyperparameter whether reranked chunks are ordered from best match to worst (best_first) or worst match to best (best_last). When performing few-shot evaluation, we follow the sampling procedure from the closed-book experiments (Appendix A.7.1). We generate 64 responses for each question and filter out responses that do not parse to one of the four choices. We then randomly select one valid response as the model’s final answer. A.4.3 Hyperparameter tuning We compare two LMs in the RAG pipeline: EntiGraph CPT and its base model, Llama 3 8B Base. We fix the number of retrieved chunks to K = 128 but vary the number of reranked chunks k APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 282 placed in the context window. For each language model + RAG pipeline, we independently tune the following hyperparameters via grid search on a QuALITY QA validation split: • Document chunk_size∈256, 512, 1024 • Rerank top-k ∈1, 2, 4, 8, 16 • Order of chunks ∈best_first, best_last • Eval temperature ∈0.1, 0.3, 0.5, 0.7 We provide tuned hyperparameters in our code release. APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 283 A.5 Proof of Theorem 1 and other analytical formulas We prove Theorem 1 and provide derivations for several other approximation formulas. Proof of Theorem 1. Fix the matrix M 0 , we observe that Acc(M t ) = E[∥M t ∥ 1 |M 0 ] V (V − 1) = X (i,j)∈V 2 E[1((i,j)∈D t )|M 0 ] V (V − 1) = X (i,j)∈V 2 P[(i,j)∈D t |M 0 ] V (V − 1) . For each (i,j)∈V 2 , we define q i,j to be the probability that (i,j) is included in the set (x t ,z 1 t ), (x t ,z 2 t ),..., (x t ,z k t t ), (x t ,y t ) . Note that each iteration of the procedure generates a path (x t ,z 1 t ,z 2 t ,...,z k t t ,y t ) independently identically. So naturally q i,j does not depend on the time t. This implies that P[(i,j) ∈ D t |M 0 ] = 1− (1− q i,j ) t . Thus we can further rewrite the link density as Acc(M t ) = |D source | V (V − 1) + X (i,j)∈V 2 source P[(i,j)∈D t |M 0 ] V (V − 1) = |D source | V (V − 1) + X (i,j)∈V 2 source 1− (1− q i,j ) t V (V − 1) . The remaining task is to estimate q i,j . We say a vertex j is reachable from i and denote i ∼ j, if there is a directed path from i to j in M 0 . We define R = (u,v) ∈ V 2 : u ̸= v,u ∼ v to be the set of all reachable pairs of vertices in V. We note that q i,j is non-zero if and only if j is reachable from i in M 0 . Now, for any t≥ 1, the function 1− (1− x) t is concave, thus by Jensen’s inequality, we have X (i,j)∈V 2 source 1− (1− q i,j ) t ≤ X (i,j)∈R 1− (1− q i,j ) t ≤|R| 1− (1− ̄q i,j ) t , where ̄q i,j = P (i,j)∈R q i,j |R| . For each (i,j)∈R, the probability q i,j satisfies q i,j = P a̸=b∈V 2 1((i,j)∈(a,z 1 ), (a,z 2 ),..., (a,z k ), (a,b)) V (V − 1) where (a,z 1 ,z 1 ,· ,z k ,b) is the shortest path in M 0 connecting a and b. If there is no such path, APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 284 then by default the indicator equals zero. Now we look at X (i,j)∈R q i,j = 1 V (V − 1) X (i,j)∈R X (a,b)∈R 1((i,j)∈(a,z 1 ), (a,z 2 ),..., (a,z k ), (a,b)) ≤ 1 V (V − 1) X (a,b)∈R X i̸=j∈V 2 1((i,j)∈(a,z 1 ), (a,z 2 ),..., (a,z k ), (a,b)) = 1 V (V − 1) X (a,b)∈R ℓ a,b , where ℓ a,b is the length of the shortest path connecting a to b. To analyze the typical shortest length of paths, we present a few classical results on directed Erdős-Rényi graphs. For any a∈V, let X(a) denote the set of vertices reachable from a and let Y (a) denote the set of vertices from which a is reachable. Recall that ρ(λ) is the extinction probability for the Poisson(λ) branching process. Lemma A.5.1 (Lemma 1 and Corollary 1 in Karp [1990]). For each vertex a, with probability tending to 1 as V tends to infinity, there exists a constant β > 0 such that either |X(a)|≤ β logV or |X(a)| = (1− ρ(λ))V + Θ( √ V ). Moreover, the probability that the latter happens tends to 1− ρ(λ) as V tends to infinity. The same is true for Y (a). For each vertex a, the set X(a) is said to be small if |X(a)| ≤ β logV (in such case we write a ∈ S X ) and large if |X(a)| = (1− ρ(λ))V + Θ( √ V ) (we write a ∈ L X ). We define S Y and L Y similarly. Lemma A.5.2 (Theorem 3 in Karp [1990] and Theorem 2.4.1 in Durrett [2010]). With probability tending to 1, the following statement holds for all a and b in V: if X(a) is large and Y (b) is large, then b is reachable from a. Moreover, if X(a) is large and Y (b) is large, then for any ε > 0 and any sufficiently small δ > 0, P[ℓ a,b > (1 + ε) logV/ logλ] < exp(−V ε δ). With Lemma A.5.1 and Lemma A.5.2, we can now give useful estimates of |R|. In particular, for any ε > 0, |R| =|(a,b)∈R : a∈L X ,b∈L Y | +|(a,b)∈R : a∈S X or b∈S Y | ≤ (1− ρ(λ)) 2 (1 + ε/4)V 2 + 2(1 + ε)V β logV ≤ (1− ρ(λ)) 2 (1 + ε/3)V (V − 1), APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 285 with high probability. Similarly, for the lower bound, |R| =|(a,b)∈R : a∈L X ,b∈L Y | +|(a,b)∈R : a∈S X or b∈S Y | ≥ (1− ρ(λ)) 2 (1− ε)V 2 ≥ (1− ρ(λ)) 2 (1− ε)V (V − 1), with high probability. By a union bound over all pairs of (a,b)∈R, we also have that X (i,j)∈R q i,j ≤ 1 V (V − 1) X (a,b)∈R ℓ a,b = 1 V (V − 1) X (a,b)∈R a∈L X ,b∈L Y ℓ a,b + 1 V (V − 1) X (a,b)∈R a∈S X or b∈S Y ℓ a,b ≤ (1− ρ(λ)) 2 (1 + ε/2) logV logλ + 1 V (V − 1) 2(1 + ε)V (β logV ) 2 ≤ (1− ρ(λ)) 2 (1 + ε) logV logλ , with probability larger than 1− V 2 exp(−V ε δ). Combining the above, for any ε > 0, ̄q i,j = P (i,j)∈R q i,j |R| ≤ (1 + ε) logV V (V − 1) logλ , with high probability. Therefore, for any ε > 0, Acc(M t )≤ |D source | V (V − 1) + |R| (1− (1− ̄q i,j ) t ) V (V − 1) ≤ (1 + ε) p + (1− ρ(λ)) 2 1− 1− (1 + ε) logV V (V − 1) logλ t !! , with high probability, which completes the proof of the upper bound. For the lower bound, we observe that if i∼ j and (i,j)∈R source , then q i,j ≥ 1/V (V − 1), because when i and j are chosen in the procedure, the edge (i,j) will be added. This implies that Acc(M t ) = |D source | V (V − 1) + X R source 1− (1− q i,j ) t V (V − 1) ≥ |D source | V (V − 1) + |R source | V (V − 1) 1− 1− 1 V (V − 1) t ! ≥ (1− ε) p + (1− ρ(λ)) 2 1− 1− 1 V (V − 1) t !! , APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 286 with high probability which completes the proof of the lower bound. To obtain a more precise description of Acc(M t ), we use a Poisson branching process to approx- imate the cluster growth of vertices. A Poisson(λ) branching process models a population evolving in time, where each individual independently gives birth to a number of children with Poisson(λ) distribution. We denote by Z n the number of individuals in the n-th generation, where by default Z 0 = 1. Then Z n satisfies the recursion relation Z n = P Z n−1 i=1 X n,i , where X n,i n,i≥1 is a dou- bly infinite array of i.i.d. Poisson(λ) random variables. The total progeny Y n is then defined as Y n = P n i=0 Z n . Z n is often called a Galton–Watson branching process and the associated tree is called a Galton–Watson tree. As in the previous proof, accurately estimating Acc(M t ) requires understanding q i,j , the proba- bility that edge (i,j) is added in each round. As before, the only edges added are those connected to the giant component (i.e., i∈L X and j ∈L Y ). The proportion of such edges converges to C λ as V →∞. Recall that q i,j = P (a,b)∈R 1((i,j)∈(a,z 1 ), (a,z 2 ),..., (a,z k ), (a,b)) V (V − 1) (A.1) where (a,z 1 ,z 1 ,· ,z k ,b) represents the shortest path in M 0 connecting a and b. Equivalently, if we consider the tree generated by a breadth-first search in M 0 rooted at i, then since i ∼ j, j will be in the tree, and the numerator counts the total number of offspring of j in the tree, including j itself. This is the point at which a rigorous mathematical characterization of the tree becomes challenging. Instead, we approximate the tree and analyze its behavior. It is well-known that when p = λ/V , the cluster growth (or the breadth-first search at a vertex) can be approximated by a Poisson(λ) branching process (see e.g., Hofstad [2016], Durrett [2010]). For fixed vertex i, we define T as a Galton–Watson tree rooted at i with Poisson(λ) offspring distribution with depth L. We use T to approximate the exploration process at i. For 0≤ ℓ≤ L, the number of vertices at level L−ℓ is approximately λ L−ℓ . Given that the total number of vertices in T is approximately (1−ρ(λ))V , the number of vertices at level L−ℓ is also (1−ρ(λ))V (λ− 1)/λ ℓ+1 . For each vertex at level L−ℓ, the number of its offspring (including itself) equals k with probability p ℓ (k). In this case, the numerator in (A.1) equals k. Combining the above, there are around (1−ρ(λ))V ·p ℓ (k)(1−ρ(λ))V (λ− 1)/λ ℓ+1 vertex pairs (i,j) in the graph such that i∈L X , j ∈L Y , q i,j = k/V (V − 1) and j is located at the L− ℓ level in the tree T. Ultimately, we arrive at an approximation of the form Acc(M t )∼ p + C λ 1− ∞ X ℓ=0 λ− 1 λ ℓ+1 ∞ X k=1 p ℓ (k) 1− k V (V − 1) t ! . Beyond Erdős-Rényi graphs, q i,j may not be as explicit. Defining C as the proportion of vertex pairs (i,j) such that i ∼ j in M 0 , we find that q i,j is nonzero for CV (V − 1) pairs of vertices. Writing a k = k/V (V − 1) and defining μ(k) as the probability that q i,j = a k , we obtain a general APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 287 formula Acc(M t )∼ p + C 1− ∞ X k=1 μ(k) (1− a k ) t ! . The drawback of this formula is the lack of explicit expressions—for a given M 0 , computing the measure μ(·) is not simple. We next provide a qualitative description of the mixture-of-exponentials shape. Lemma A.5.3. For a fixed constant 0 < C < 1 and a probability measure μ(·) on Z + with finite mean m, we define f (t) = p + C 1− ∞ X k=1 μ(k) 1− k V (V − 1) tV (V−1) ! . Then we have that there exists 0 < t 1 < t 2 such that f (t) =          Θ (p + t),for 0≤ t≤ t 1 , Θ(logt),for t 1 ≤ t≤ t 2 , Θ(1),for t≥ t 2 , as V →∞. Proof of Lemma A.5.3. Fix any 1 < t 1 < t 2 . Note that f (t) is monotone increasing, concave and always bounded by 1. We also have f (t 2 )≥ p + C 1− 1− 1 V (V − 1) t 2 V (V−1) ! ≥ p + C(1− exp(−t 2 )) = Θ(1). So f (t) = Θ(1) when t≥ t 2 . Now when t≤ t 1 , f (t)≤ p + C 1− ∞ X k=1 μ(k)(1− tk) ! ≤ p + Cmt. Since f (0) = p and f (t 2 )≥ p + C(1− exp(−t 2 )), by concavity, f (t) is lower bounded by p + tC(1− exp(−t 2 ))/t 2 = Θ(p + t) for any 0 ≤ t ≤ t 1 . Finally for t 1 ≤ t ≤ t 2 , we note that f (t 1 ) ≤ f (t) ≤ 1, so easily, f (t) ≤ logt 1 / logt 1 ≤ logt/ logt 1 = O(logt). Similarly, f (t) ≥ f (t 1 ) logt 2 / logt 2 ≥ logt(f (t 1 )/ logt 2 )≥ Ω(logt). Therefore, f (t) = Θ(logt) for any t 1 ≤ t≤ t 2 . APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 288 A.5.1 More details on the mixture of exponential shape We provide additional discussion on the mixture-of-exponentials shape, including how we fit it to empirical EntiGraph CPT QA accuracy. Sketch of derivation Intuitively, edge (i,j) will eventually be added if and only if j is reachable from i in the original graph M 0 . This explains the limiting behavior of Acc(M t ) as t approaches infinity: the proportion of links converges to the proportion of connected vertex pairs in M 0 . To understand the mixture-of-exponentials functional form, consider that at time t, the probability of adding each vertex pair follows an exponential pattern, with different vertex pairs exhibiting different growth rates. Consider a breadth-first search in M 0 starting from vertex i. If j is close to the root, many paths from i to other vertices pass through j, making (i,j) more likely to be included in each iteration. In contrast, if j is far from the root (e.g., at the end of the exploration process), fewer such paths exist, making (i,j) less likely to be included. This accounts for the mixture-of-exponentials shape, where the mixture reflects the distance of each vertex from the root, the number of such vertices, and their corresponding exponential growth rates. 020004000600080001000012000 Number of synthetic tokens 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Accuracy (a) Linear regime 2 × 10 4 3 × 10 4 Number of synthetic tokens 0.65 0.70 0.75 0.80 0.85 Accuracy (b) Log-linear (t in log scale) 400006000080000100000 Number of synthetic tokens 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy (c) Plateau regime Figure A.3: Accuracy Acc(M t ) with respect to time t, for V = 100 and p = 0.03. The mixture-of- exponentials functional form in (2.2) leads to three distinct regimes. Qualitative description We now provide a qualitative description of the mixture-of-exponentials shape. As shown in Appendix A.5, this shape comprises three distinct phases: a fast growth phase, a slower growth phase, and a plateau phase. We show the existence of two distinct times, 0 < t 1 < t 2 , such that Acc(M T ) =          Θ (p + t),for 0≤ t≤ t 1 , Θ(logt),for t 1 ≤ t≤ t 2 , Θ(1),for t≥ t 2 , where we use a convenient change of variable T = tV (V − 1). The choice of logt in the second phase is not necessarily canonical—the bound holds for any well-behaved monotone increasing concave APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 289 function as a replacement for logt. We use this representation for two reasons: first, it aligns with performance observed in our EntiGraph CPT numerical results; second, it reflects the gradual slowdown in growth. Figure A.3 illustrates the three phases in a simulation of the toy model with p = 0.03. To perform curve fitting using the mixture-of-exponentials formula, we approximate the infinite sum with three terms: Acc(M t )∼ p + C 1− ∞ X k=1 μ(k) (1− a k ) t ! . We fit the empirical observations to the formula y(x) = a− b 1 r x 1 − b 2 r x 2 − b 3 r x 3 , where x is the EntiGraph token count (in millions) and y(x) is the QuALITY QA accuracy. We use the non-linear least squares method implemented by Virtanen et al. [2020]. This procedure yields the fitted formula y(x) = 64.5456− 13.8352× (0.9989) x − 8.4705× (0.8961) x − 3.932× (0.0546) x . We provide the implementation in our code release. APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 290 A.6 Synthetic data generation prompts We generate two synthetic corpora: EntiGraph (Appendix A.6.1) and the Rephrase baseline (Ap- pendix A.6.2). In our experiments,D source is a collection of documents D, and we apply our synthetic augmentation procedure to each document D ∈ D source . We focus on a single document D for the remainder of this section. A.6.1 EntiGraph prompts The EntiGraph procedure is described in §2.2.2. We recap the steps below. Step 1: entity extraction We first extract salient entities from document D using the entity_extraction operation (Step 1, §2.2.2). The complete entity_extraction prompt is: As a knowledge analyzer, your task is to dissect and understand an article provided by the user. You are required to perform the following steps: 1. Summarize the Article: Provide a concise summary of the entire article, capturing the main points and themes. 2. Extract Entities: Identify and list all significant "nouns" or entities mentioned within the article. These entities should include but not limited to: * People: Any individuals mentioned in the article, using the names or references provided. * Places: Both specific locations and abstract spaces relevant to the content. * Object: Any concrete object that is referenced by the provided content. * Concepts: Any significant abstract ideas or themes that are central to the article’s discussion. Try to exhaust as many entities as possible. Your response should be structured in a JSON format to organize the information effectively. Ensure that the summary is brief yet comprehensive, and the list of entities is detailed and accurate. Here is the format you should use for your response: "summary": "<A concise summary of the article>", "entities": ["entity1", "entity2", ...] Step 2: relation analysis We next generate diverse descriptions of relations among two or more entities. For each document D, we enumerate all entity pairs and generate a description for each. The prompt for generating a description relating a pair of entities is: APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 291 You will act as a knowledge analyzer tasked with dissecting an article provided by the user. Your role involves two main objectives: 1. Rephrasing Content: The user will identify two specific entities mentioned in the article. You are required to rephrase the content of the article twice: * Once, emphasizing the first entity. * Again, emphasizing the second entity. 2. Analyzing Interactions: Discuss how the two specified entities interact within the context of the article. Your responses should provide clear segregation between the rephrased content and the interaction analysis. Ensure each section of the output include sufficient context, ideally referencing the article’s title to maintain clarity about the discussion’s focus. Here is the format you should follow for your response: ### Discussion of <title> in relation to <entity1> <Rephrased content focusing on the first entity> ### Discussion of <title> in relation to <entity2> <Rephrased content focusing on the second entity> ### Discussion of Interaction between <entity1> and <entity2> in context of <title> <Discussion on how the two entities interact within the article> We also generate synthetic data involving three entities, using the prompt below: You will act as a knowledge analyzer tasked with dissecting an article provided by the user. Your role involves three main objectives: 1. Rephrasing Content: The user will identify three specific entities mentioned in the article. You are required to rephrase the content of the article three times: * Once, emphasizing the first entity. * Again, emphasizing the second entity. * Lastly, emphasizing the third entity. 2. Analyzing Interactions: Discuss how these three specified entities interact within the context of the article. Your responses should provide clear segregation between the rephrased content and the interaction analysis. Ensure each section of the output include sufficient context, ideally referencing the article’s title to maintain clarity about the discussion’s focus. Here is the format you should follow for your response: ### Discussion of <title> in relation to <entity1> <Rephrased content focusing on the first entity> ### Discussion of <title> in relation to <entity2> APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 292 <Rephrased content focusing on the second entity> ### Discussion of <title> in relation to <entity3> <Rephrased content focusing on the third entity> ### Discussion of Interaction between <entity1>, <entity2> and <entity3> in context of <title> <Discussion on how the three entities interact within the article> A.6.2 Rephrase prompts For the rephrase corpus, we adapt the prompt from Maini et al. [2024] to our setting of books and articles. We use four rephrase styles: Easy rephrase: You are an assistant to help read a article and then rephrase it in simpler terms. The user will provide you with an article with title, year, content. You need to generate a paraphrase of the same article using a very small vocabulary and extremely simple sentences that a toddler will understand. Remember to keep the meaning and every content of the article intact, including the title, year, etc. Medium rephrase: You are an assistant to help read a article and then rephrase it in different terms. The user will provide you with an article with title, year, content. You need to generate a paraphrase of the same article using diverse and high quality English language as in sentences on Wikipedia. Remember to keep the meaning and every content of the article intact, including the title, year, etc. Hard rephrase: You are an assistant to help read a article and then rephrase it in more sophisticated terms. The user will provide you with an article with title, year, content. You need to generate a paraphrase of the same article using very terse and abstruse language that only an erudite scholar will understand. Remember to keep the meaning and every content of the article intact, including the title, year, etc. APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 293 A.7 Additional evaluation details of main experiments A.7.1 QuALITY QA question set We provide additional details on evaluation using the QuALITY QA test queries. Throughout the closed-book QA experiments, we use the following fixed 5-shot prompt: ## Example 1 ### Question In the context of "Les Misérables", written by Victor Hugo in 1862, what is the main setting of the novel? There is only one correct choice. ### Choices A. London B. Madrid C. Paris D. Rome ### Thought Process and Answer Thought process: "Les Misérables" is primarily set in Paris, making C the correct choice. London, Madrid, and Rome are significant cities in other literary works but not in Victor Hugo’s "Les Misérables". There is only one correct choice. Answer: C. ## Example 2 ### Question In the context of "Brave New World", written by Aldous Huxley in 1932, what substance is widely used in the society to control citizens’ happiness? There is only one correct choice. ### Choices A. Gold B. Soma C. Silver D. Iron ### Thought Process and Answer Thought process: In Aldous Huxley’s "Brave New World," Soma is used as a means to maintain social control by ensuring citizens’ happiness, making B the correct choice. Gold, Silver, and Iron are not the substances used for this purpose in the book. Answer: B. ## Example 3 ### Question In the context of "Romeo and Juliet", written by William Shakespeare in the early 1590s, what are the names of the two feuding families? There is only one correct choice. Choices: A. Montague and Capulet B. Bennet and Darcy C. Linton and Earnshaw D. Bloom and Dedalus APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 294 ### Thought Process and Answer Thought process: In William Shakespeare’s "Romeo and Juliet," the two feuding families are the Montagues and the Capulets, making A the correct choice. The Bennets and Darcys are in "Pride and Prejudice", the Lintons and Earnshaws in "Wuthering Heights", and Bloom and Dedalus in "Ulysses". Answer: A. ## Example 4 ### Question In the context of "1984", written by George Orwell in 1949, what is the name of the totalitarian leader? There is only one correct choice. ### Choices A. Big Brother B. O’Brien C. Winston Smith D. Emmanuel Goldstein ### Thought Process and Answer Thought process: In George Orwell’s "1984," the totalitarian leader is known as Big Brother, making A the correct choice. O’Brien is a character in the novel, Winston Smith is the protagonist, and Emmanuel Goldstein is a rebel leader. Answer: A. ## Example 5 ### Question In the context of "Moby-Dick", written by Herman Melville in 1851, what is the name of the ship’s captain obsessed with hunting the titular whale? There is only one correct choice. ### Choices A. Captain Hook B. Captain Nemo C. Captain Flint D. Captain Ahab ### Thought Process and Answer Thought process: In Herman Melville’s "Moby-Dick," the ship’s captain obsessed with hunting the whale is Captain Ahab, making D the correct choice. Captain Nemo is in "Twenty Thousand Leagues Under the Sea", Captain Flint in "Treasure Island", and Captain Hook in "Peter Pan". Answer: D. ## Example 6 If the model correctly follows the few-shot prompt format, its last two characters should be “A.”, “B.”, “C.”, or “D.”. However, the model sometimes fails to follow the few-shot prompting format, particularly the continually pretrained model. In all our evaluations, we sample 64 responses and select only those that parse to the correct format. From these valid attempts, we randomly select one as the final answer. This is different from majority voting in self-consistency prompting [Wang APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 295 et al., 2023a]. A.7.2 Closed-book summarization Automated evaluation metric We use a three-stage evaluation procedure: (i) We use GPT-4 2 to break the summary into atomic claims, similar to Min et al. [2023]; (i) We provide both the list of claims and the source article to a judge model (also GPT-4). The judge determines whether each claim is true or false based on the source article. If the claim is true, we further ask the model to determine whether the claim is salient (contributes to the main message) or cosmetic (factual details that do not aid understanding). (i) For each summary, we obtain the number of false and salient claims and normalize by the corresponding count from the human summary. We report the average of these normalized metrics across the QuALITY corpus articles in Figure 2.3. Prompts to generate summaries For summarization evaluation with EntiGraph Instruct and Raw Instruct, we use the following two prompts to obtain summaries of increasing length. We ➤ Short prompt: Summarize the article article title by author name for me. Give a short summary of “Cosmic Yo-Yo” by Ross Rocklynne. ➤ Long prompt: Write an extremely long and detailed article regarding the book article title by author name. Write an extremely long and detailed article regarding the book “Cosmic Yo-Yo” by Ross Rocklynne. Table A.1: Summarization prompt for EntiGraph Instruct, Raw Instruct, and Reprhase Instruct. show three examples of summarization outputs below. For each example, we first present the human summary for context, then present the short summary from the two summarizers. Example 1 The first example is “Cosmic Yo-Yo” by Ross Rocklynne. Human summary: Bob Parker, the President of Interplanetary Hauling & Moving Co., sells asteroids to wealthy people on earth. Clients ask for asteroids with size parameters and specifications, and Bob finds them in space and hauls them to earth. His company is almost bankrupt because a rival company, Saylor & Saylor, stole his idea and now offers the same services. Bob receives mail from Mr. Andrew S. Burnside with a request for an asteroid that he would like to use in an upcoming wedding. Bob and his partner Queazy set out to find the perfect asteroid for Mr. Burnside, although they know it’s a longshot. Fairly quickly, they find one that looks perfect. The men land on the asteroid, and Bob deploys his atomic-whirl 2 Specifically, we use the gpt-4-turbo model as of Aug. 19, 2024. APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 296 spectroscope to test it. Suddenly, a beautiful woman interrupts him and demands that they leave the asteroid. She pulls out her spasticizer gun before telling them that they can have it in a month after she’s gone. Bob explains that they are desperate, but the girl retorts that her fate is worse than death if she leaves. Suddenly, the Saylor brothers’ ship appears, and Bob tells the girl that they have to fight this enemy together. Wally and Billy Saylor, along with three other men, jump out of the ship. Bob tells them that Mr. Burnside has ordered this asteroid, and the Saylor brothers say that they received the same order. Bob quickly grabs the girl’s spasticizer while Queazy throws his body at Billy. However, Wally manages to shoot the gun out of Bob’s hand and attack him. Bob is knocked unconscious in the scuffle. When Bob wakes up, he is completely alone, floating in space. He panics because he has very little oxygen left. Finally, he hears Queazy’s voice explaining that the girl used her ship’s technology to find them both. The mystery girl introduces herself as Starre Lowenthal, the granddaughter of Mr. Burnside. She concedes that this entire mission was fake. She told her grandfather that she would only marry her fiance Mac if he could get this particular asteroid, and then she made plans to conquer and protect the asteroid so it could not be supplied for the wedding. Bob is confident that they can reach the Saylor brothers before they bring the asteroid back to earth, but his plan does nothing to protect Starre from marrying a man she doesn’t love. She agrees to help Bob and Queazy. Within five days, Bob realizes he is in love with Starre. Starre compares her small ship to a yo-yo, and Bob gets an idea - they will use Starre’s ship like a yo-yo to retrieve the asteroid from the Saylor brothers. Once the team catches up to the Saylor brothers, Bob flings Starre’s ship at the asteroid several times, and Wally calls them to tell them that they might die as a result of the damage their ship has sustained. Bob makes it clear that they have no intention of stopping, and the Saylor brothers release the asteroid. EntiGraph Instruct with short prompt: Cosmic Yo-Yo is a science fiction story by Ross Rocklynne about Bob Parker and Queazy, two interplanetary haulers who are hired to deliver an asteroid to Earth for a wealthy client’s backyard wedding. However, they face competition from their rivals, the Saylor brothers, who also want to deliver the asteroid. The story follows their race to deliver the asteroid before the deadline, using a yo-yo-like device to maneuver the asteroid and outsmart their rivals. The story is filled with action, adventure, and a touch of romance. APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 297 Raw Instruct with short prompt: Cosmic Yo-Yo by Rocklynne, Ross is a science fiction novel that follows the story of a young woman named Susan who discovers a mysterious device that can transport her through time and space. She uses the device to travel to different planets and meet different alien species, but soon realizes that her actions have consequences and she must find a way to return home. The novel explores themes of identity, responsibility, and the consequences of our actions. APPENDIX A. SUPPLEMENTARY MATERIALS FOR CHAPTER 298 ➤ Explicit article reference instruction: Summarize the article “Defining Decay Down” by “David Plotz” for me. The article "Defining Decay Down" by David Plotz discusses how the dental industry has changed over the years. In the past, dentists were primarily focused on treating cavities and other dental issues, but as the prevalence of cavities decreased due to the widespread use of fluoride, the industry had to adapt. Dentists began to focus on cosmetic dentistry, offering services like teeth whitening and orthodontics to appeal to the vanity of aging baby boomers. The article also discusses how dentists have become more customer-friendly, with amenities like virtual reality glasses and massage therapy. Despite the shift towards elective procedures, the dental industry has continued to thrive, with dentists earning higher salaries than ever before. ➤ Implicit article reference: How has the dentistry practice in United States has changed? The dentistry practice in the United States has undergone significant changes in recent years. Here are some of the ways in which it has changed: 1. Focus on cosmetic dentistry: In recent years, there has been a growing demand for cosmetic dental procedures such as teeth whitening, orthodontics, and veneers. Dentists have responded to this demand by offering more cosmetic services. 2. Use of technology: Dentists are now using advanced technology such as intra-oral cameras, 3D imaging, and laser technology to diagnose and treat dental problems. This technology has made dental procedures more accurate and efficient. 3. Emphasis on preventive care: Dentists are now focusing more on preventive care to help patients avoid dental problems. This includes regular check-ups, cleanings, and fluoride treat- ments. 4. Increased competition: With the rise of dental schools and the influx of foreign-trained dentists, the dental industry has become more competitive. Dentists are now offering more services and amenities to attract and retain patients. 5. Focus on patient experience: Dentists are now focusing on providing a positive patient experience. This includes offering amenities such as virtual reality glasses, massage therapy, and entertainment during procedures. Overall, the dentistry practice in the United States has become more patient-centered, technology-driven, and focused on preventive care and cosmetic procedures. ➤ Cross article instruction: Compare David Plotz’s commentary style between how he analyze American dentistry and how he discuss the movie Fight Club? David Plotz’s commentary style is different when he analyzes American dentistry and when he discusses the movie Fight Club. In his analysis of American dentistry, Plotz uses a critical and investigative approach, highlighting the over-commercialization and unnecessary procedures in the industry. He uses statistics and expert opinions to support his arguments and provides practical advice for readers to avoid unnecessary dental procedures. On the other hand, when discussing the movie Fight Club, Plotz adopts a more satirical and humorous tone. He uses pop culture references and witty commentary to analyze the themes and messages of the movie. He also provides a critical analysis of the film’s characters and their motivations, while also acknowledging the movie’s cultural impact. Overall, Plotz’s commentary style is more serious and investigative when analyzing American dentistry, while his commentary on Fight Club is more satirical and humorous. Table A.2: Complete instruction following example used in Table 2.2 from Section 2.4.3. Appendix B Supplementary materials for Chapter 3 B.1 Additional details on synthetic bootstrapped pretraining B.1.1 SBP implementation details We now describe the implementation details of SBP outlined in §3.3. Nearest neighbor pairing Recall from §3.4 that we work with 3B and 6B-parameter transformer architectures and pretraining datasets at the ∥D pretrain ∥ =10B and ∥D pretrain ∥ =50B scales. To enable efficient ANN search at pretraining scale, we embed documents from D pretrain as 1,024- dimensional vectors using Qwen3-Embedding-0.6B. We then use ScaNN [Guo et al., 2020, Sun et al., 2023] with 8-bit quantization to perform efficient similarity search. We adopt asymmetric sharding for keys and value vectors. For each value vector, we build a ScaNN search tree with √ N leaves, where N is the number of vectors in each value shard. To distribute the key shards across each search tree, we employ a “salting” strategy: we create multiple copies of the ScaNN searcher and assign one key shard to each salted copy (Figure B.1). This design enables us to perform a top-200 nearest neighbor search over |D pretrain | =60M documents within 155M CPU hours. At all scales, after obtaining the top 200 neighbors for each sample, we select pairs whose similar- ity score exceeds 0.75. This threshold yields a tractable synthesizer-tuning dataset D ST . To assess how different thresholds affect data quality, Figure B.2 shows the fraction of relevant documents at each similarity bin using the metric defined in §3.5.2. We find that higher similarity scores yield more relevant pairs but also more duplicates. Finally, we eliminate near-duplicates using rule-based filtering. The deduplication process first normalizes text by removing punctuation, converting to 99 APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3100 Pretraining data 60M documents Value sharding Split into 32 shards ScaNN searcher 8-bit quantized index tree ScaNN searcher 8-bit quantized index tree ScaNN searcher 8-bit quantized index tree ScaNN searcher 8-bit quantized index tree Key sharding Split into 1024 shards Distributed search One worker for each searcher Aggregation by key Only keep the top 200 neighbors Salting Figure B.1: ScaNN system design for efficient distributed search. lowercase, and eliminating numbers, then tokenizes using SentencePiece. We then generate “shin- gles” using 13-token sliding windows within d 1 . We discard training pairs if any shingle from d 1 appears in d 2 . 0.40.60.81.0 Similarity bin 0 1000 2000 3000 Frequency (a) Similarity histogram 0.30.40.50.60.70.80.91.0 Similarity bin 0.0 0.2 0.4 0.6 0.8 Pair-copying (%) (b) Duplicate rate 0.30.40.50.60.70.80.91.0 Similarity bin 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Pair-irrelevance (%) (c) Relevance rate Figure B.2: Analysis of paired data at 200B-scale. Figure B.2(a): a histogram of 100K subsampled pairs grouped by their similarity score. Figure B.2(b): the fraction of duplicate pairs when we subsample 1K pairs around a specific similarity score. Figure B.2(c): same as B.2(b) but showing the fraction of relevant documents. Synthesizer-tuning After collecting the cleaned pair data D ST , we perform synthesizer-tuning with the objective in (3.3). We initialize the 3B-parameter model at the baseline checkpoint and finetune with a constant learning rate of 5e-6 and a batch size of 16M tokens per step. We initially attempted cosine decay with a larger learning rate but found that a small, constant learning rate produces higher-quality generated text. We measure the Pair-novelty score (defined in §3.5.2) across different synthesizer-tuning checkpoints and find that longer training improves Pair-novelty. Synthesis at scale Finally, we perform the hierarchical sampling procedure defined in §3.3 with temperature 1.0 and top_p threshold 0.9. We apply rule-based filtering to remove synthesized documents containing repeated occurrences of 13-token shingles, effectively eliminating texts with APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3101 repetition failures. We use vLLM [Kwon et al., 2023a] and achieve a throughput of 8.3K tokens per B200 second. This amounts to 2.5K B200 hours for the 200B-scale synthesis, 4.2K B200 hours for the 1T-scale (3B) synthesis, and 8.4K B200 hours for the 1T-scale (6B). B.1.2 Ablation on data mixture ratio When performing joint training on a mixture of real and synthesized documents, a natural question is: what fraction of synthesized documents should we include? In §3.5, we used∥S pretrain ∥ =75B for the 200B-scale experiment, ∥S pretrain ∥ =125B for the 1T-scale (3B) experiment, and ∥S pretrain ∥ =250B for the 1T-scale (6B) experiment. We now present ablation experiments for this design choice. 200B-scale At this smaller scale, we perform a comprehensive sweep over five values of∥S pretrain ∥∈ 0B, 25B, 50B, 75B, 100B. As shown in Figure B.3, different benchmarks exhibit varying behavior as synthetic data increases: perplexity (OpenWebText2 and LAMBADA) decreases monotonically, while most QA benchmarks peak around ∥S pretrain ∥ = 75B. 1T-scale (3B) At the 1T-scale, both data synthesis and joint pretraining become significantly more expensive. We therefore evaluate SBP at three values: ∥S pretrain ∥ ∈ 0B, 125B, 250B. As shown in Figure B.4, ∥S pretrain ∥ =125B produces the best-performing model across all benchmarks except LAMBADA perplexity. 1T-scale (6B) We also sweep the mixture ratio for the 6B model at the 1T-scale, evaluating ∥S pretrain ∥ ∈ 0B, 125B, 250B. As shown in Figure B.5, the optimal amount of synthetic data is around 250B—higher than the optimal 125B observed for the 3B model. Discussion A general pattern emerges: the best-performing model results from pretraining on a mixture of real and synthetic data. Furthermore, the optimal synthetic data ratio increases with model size (from approximately 12.5% for 3B to approximately 25% for 6B). Real internet data has higher quality and merits more repetition, but because repetition yields diminishing returns, syn- thetic data provides an additional source of signal. In contrast, distillation-based research typically finds that training purely on synthetic data yields higher training efficiency. However, this finding is obscured because such models eventually converge to the capability of the teacher LM. This contrast reveals that SBP does not generate a compressed and denoised representation of knowledge. Instead, it provides an additional source of improvement that real data alone cannot capture. B.1.3 Random pairs and embedding analysis To verify that SBP relies on learning specific inter-document correlations rather than generic data augmentation, we train the synthesizer on random document pairs as an ablation. We measure APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3102 0255075100 Synthetic tokens (B) 4.75 5.00 5.25 5.50 5.75 Perplexity OPENWEBTEXT2 Oracle SBP 0255075100 Synthetic tokens (B) 5.0 5.5 6.0 6.5 Perplexity LAMBADA Oracle SBP 0255075100 Synthetic tokens (B) 3.4 3.6 3.8 Perplexity MMLU-5S Oracle SBP 0255075100 Synthetic tokens (B) 35 36 37 38 Accuracy (%) ARC_C Oracle SBP 0255075100 Synthetic tokens (B) 69 70 71 72 73 Accuracy (%) ARC_E Oracle SBP 0255075100 Synthetic tokens (B) 90 91 92 93 Accuracy (%) SCIQ Oracle SBP 0255075100 Synthetic tokens (B) 60 62 64 Accuracy (%) WINOGRANDE Oracle SBP 0255075100 Synthetic tokens (B) 24 26 28 30 Accuracy (%) TRIVIAQA-1S Oracle SBP 0255075100 Synthetic tokens (B) 10.0 12.5 15.0 17.5 Accuracy (%) WEBQS-1S Oracle SBP Figure B.3: SBP performance with varying synthetic tokens at 200B-scale. the semantic similarity between the seed document d 1 and the target document (either d 2 or the synthesized output) using Qwen3-Embedding-0.6B. Table B.1 shows that paired documents in SBP exhibit high semantic similarity (0.79), whereas random documents have minimal similarity (0.15). Crucially, documents generated by the SBP synthesizer maintain high relevance (0.66) to the seed document. In contrast, the model trained on random pairs produces outputs with significantly lower relevance (0.32). This confirms that the SBP synthesizer learns to preserve semantic relevance from the training pairs—a property absent when training on random associations. APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3103 0100200 Synthetic tokens (B) 4.40 4.45 4.50 Perplexity OPENWEBTEXT2 Oracle SBP 0100200 Synthetic tokens (B) 4.15 4.20 4.25 4.30 Perplexity LAMBADA Oracle SBP 0100200 Synthetic tokens (B) 3.12 3.14 3.16 Perplexity MMLU-5S Oracle SBP 0100200 Synthetic tokens (B) 43 44 45 46 Accuracy (%) ARC_C Oracle SBP 0100200 Synthetic tokens (B) 76.0 76.5 77.0 77.5 Accuracy (%) ARC_E Oracle SBP 0100200 Synthetic tokens (B) 93.2 93.4 93.6 93.8 94.0 Accuracy (%) SCIQ Oracle SBP 0100200 Synthetic tokens (B) 66 67 68 Accuracy (%) WINOGRANDE Oracle SBP 0100200 Synthetic tokens (B) 34 35 36 Accuracy (%) TRIVIAQA-1S Oracle SBP 0100200 Synthetic tokens (B) 19.4 19.6 19.8 Accuracy (%) WEBQS-1S Oracle SBP Figure B.4: SBP performance with varying synthetic tokens at 1T-scale (3B). B.2 Additional analysis of synthesized samples B.2.1 Analyzing concepts in documents We next examine the intermediate mechanisms underlying the document synthesis process. Specif- ically, we classify the hypothesized concepts inferred from real documents (Table 3.4) along two dimensions: concept domains, the broad subject areas a concept belongs to (e.g., science, psy- chology, health, culture), and concept types, the abstract role or nature of the concept itself (e.g., theory, method, comparison, symbol). The distributions in Table B.2 and B.3 reveal the multidimensional nature of the knowledge space. The domains span macro-level sociocultural phenomena—Culture topics range from inter- community conflict in Nigeria to immigration policy and interracial dating—alongside micro-level issues of individual health and wellbeing. The typological classification reveals not only subject APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3104 0100200 Synthetic tokens (B) 4.05 4.10 4.15 4.20 4.25 Perplexity OPENWEBTEXT2 Oracle SBP 0100200 Synthetic tokens (B) 3.4 3.5 3.6 Perplexity LAMBADA Oracle SBP 0100200 Synthetic tokens (B) 2.95 3.00 3.05 Perplexity MMLU-5S Oracle SBP 0100200 Synthetic tokens (B) 47.6 47.8 48.0 48.2 Accuracy (%) ARC_C Oracle SBP 0100200 Synthetic tokens (B) 78.8 79.0 79.2 79.4 Accuracy (%) ARC_E Oracle SBP 0100200 Synthetic tokens (B) 93.0 93.5 94.0 94.5 95.0 Accuracy (%) SCIQ Oracle SBP 0100200 Synthetic tokens (B) 71 72 Accuracy (%) WINOGRANDE Oracle SBP 0100200 Synthetic tokens (B) 41 42 43 Accuracy (%) TRIVIAQA-1S Oracle SBP 0100200 Synthetic tokens (B) 20 22 24 Accuracy (%) WEBQS-1S Oracle SBP Figure B.5: SBP performance with varying synthetic tokens for the 6B model at 1T-scale. matter but also modes of conceptual engagement: Methods comprise formalized procedures (mul- tidimensional poverty measurement, commercial real estate appraisal), Events capture historically situated crises (Mediterranean migrant crisis, BP oil spill nationalization), and Comparisons facili- tate interpretive framing through juxtapositions (cancer suffering: individual vs. family). Altogether, this taxonomy illustrates both topical diversity and a spectrum of cognitive orientations. While real and synthesized documents share the same underlying concept, they differ in multiple ways. We categorize these differences into a taxonomy of relations using a small ontology (Table B.4). These relations range from scope-based distinctions (e.g., specific vs. general), to causal connections (e.g., corruption leading to reform), to contrastive pairs (e.g., Constitution articles vs. Articles of Confederation). This diversity demonstrates the rich variation structure that the synthesizer captures. Document summarize and concept analysis instructions: APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3105 Table B.1: Embedding similarity statistics. “Paired documents” refers to the SBP training pairs found by nearest neighbor search. “Random documents” refers to randomly paired documents. “Generated documents (SBP)” refers to the synthetic data generated by the SBP model at 200B- scale (3B). “Generated documents (Random)” refers to the synthetic data generated by the model trained on random pairs. All comparisons are based on the 10B dataset. Statistic Paired docs Random docs Generated (SBP) Generated (Random) Mean0.790.150.660.32 Table B.2: Categorize extracted concepts into domains. Concept Domains Examples Culture (38.74%)Inter-community conflict in Nigeria, Family-based immigration policy, Re- actions to Horrid Henry books, Interracial dating and bias Health (11.89%)Cosmetic dental appliance, Colistin toxicity in infections, Hair health tips, Portable/home medical diagnostics, Vitamin D and pregnancy outcomes Technology (9.91%)Recovering deleted phone data, Video editing app review, Flash platform pros and cons, HTML 2.0 draft process, Email attachment processing speed Politics (3.69%)Iran nuclear negotiations, Student loans policy reform, Democratic primary candidate choice, Catalan independence aftermath Psychology (3.42%)Differences in personality disorders, Exploring the strange in daily life, Aging and nostalgia, Toxic relationship breakup, Psychology research paper topics In the following, you are given two documents, doc1 and doc2. Doc2 is generated from doc1. The principle of generation is to first abstract a concept from doc1, and then starting from this concept, generate doc2. Can you guess what this concept is and how doc2 was generated? Please keep the summary and concepts to be LESS OR EQUAL TO 10 WORDS and format your answer as follows. Highlight the difference between doc2 and doc1 in your doc2_summary: <doc1_summary> summary of doc1 </doc1_summary> <concept_c> abstract concept from doc1 </concept_c> <doc2_summary> summary of doc2 built on doc1 given the concept </doc2_summary> Example 1: <doc1_summary> recommendation of local coffee shops in San Diego </doc1_summary> <concept_c> coffee + San Diego </concept_c> <doc2_summary> comparison of coffee culture in SD and NYC </doc2_summary> Example 2: <doc1_summary> Patient with swollen eye discusses pain causes & symptoms and seeks for advice APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3106 Table B.3: Categorize extracted concepts into abstract types. Concept TypesExamples Method (9.17%)Multidimensional poverty measurement, Commercial real estate appraisal, Stop words search duplicates, DAT chemistry exam preparation Event (6.98%)Mediterranean migrant crisis, BP oil spill nationalization, Paula Abdul stalked, Eminem-Apple music rights lawsuit, Presidents Cup U.S. golf Comparison (5.54%) Hobbit film adaptation length/cost, Biking as superior transport, Cancer suffering, individual vs. family, Progress critique: 4G vs. alternatives Analysis (5.20%)Health effects of substances, Thai massage benefits, Scrabble word break- down, Relationship roles and challenges, Manchester United player analysis Phenomenon (4.95%) Secret pain; self-destruction, Car-related online humor/pranks, Transna- tional corporations in globalization, Hippie identity and lifestyle </doc1_summary> <concept_c> medical symptom of swollen eye </concept_c> <doc2_summary> A wiki-style article introducing causes and cures for swollen eye </doc2_summary> Now, give your answer for the following documents: <doc1> real_document </doc1> <doc2> synthesized_document </doc2> B.2.2 Factuality analysis All LM-generated synthetic data may produce non-factual content because of the probabilistic nature of generation. Moreover, because the internet inherently contains factual inaccuracies, LMs absorb these errors unless the data is carefully cleaned. During post-training, factuality must also be recalibrated alongside other objectives such as data safety. SBP relies solely on document-level correlations and does not incorporate human intervention to filter non-factual content; the generated outputs are therefore expected to contain factual errors. Interestingly, the frequency of such errors correlates with the amount of data used in the SBP pipeline. We define a document as having undefined factuality if it is primarily subjective or opinion-driven, or if it concerns personal, obscure, or unverifiable entities. In all other cases, the document’s factuality is considered well-defined and verifiable. APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3107 Table B.4: Categorize relations between real documents d 1 and synthesized documents d 2 . Relation Categories Examples Scope relation (8.14%) d 1 : Probiotics’ possible effects on H1N1 infection d 2 : Probiotics’ general digestive and immune benefits Relation: specific application vs general health benefits of probiotics Perspectival relation (5.51%) d 1 : Personal, humorous struggles of new bloggers d 2 : Objective guide to pros and cons of blogging Relation: subjective experiences vs objective guidance about blogging Functional relation (4.70%) d 1 : Reviews and feedback on “Space Bound” game d 2 : Forum troubleshooting for bugs in “Space Bound” Relation: reviews/feedback vs troubleshooting for the same game Causal relation (2.05%) d 1 : DTEK faces corruption probe, financial risk d 2 : DTEK nationalized for state-driven energy reform Relation: corruption/financial issues vs nationalization/energy reform Contrastive relation (1.65%) d 1 : Detailed summary of Constitution articles d 2 : Overview, flaws of Articles of Confederation Relation: U.S. Constitution articles vs Articles of Confederation: different foundational documents Table B.5: Estimation of the ratio of non-factual documents. We can see that the occurrence factuality error decays as the SP scales up. Factuality undefined No factual error Factual error Real data31.44%66.74%1.81% Synthetic data (200B-scale)34.43%50.47%15.09% Synthetic data (1T-scale)31.91%59.43%8.65% In Table B.5, we analyze both real and synthesized data from the main experiment (§3.5.1). We consider two synthetic datasets: a smaller-scale set initialized with 10B seed tokens and a larger-scale set initialized with 50B seed tokens. From each source, we randomly sample 10k documents and categorize each into three bins—factuality undefined, no factual error, and factual error— using LM-as-a-judge. We find that synthetic data contains more factual errors than real data. However, as the amount of seed data increases, factuality improves significantly, approaching that of real data. This finding aligns with our mideval results in Table 3.3: greater seed data availability enables the LM to capture more factual knowledge and the synthesizer to generate more relevant documents, thereby reducing hallucinations. Table B.7 extends our analysis of factuality errors, highlighting inaccuracies in the synthetic texts. These include false transfer and timeline claims in football, as well as incorrect institutional, company location, and certification details in the ecolabel example. This underscores the importance APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3108 Table B.6: Factuality undefined synthetic text. Synthetic Text Sunday, December 28, 2008 Tante Leni Tante Leni is not only my Aunt Leni; she is my Eternity. When my Aunt Leni died a few years ago, she was deeply saddened and I was devastated. She was not the first family member to die, but she was the first I felt so strongly about. Tante Leni was all my parents really had to show for the 25 years they worked as public school teachers and she was the one who had been with them the longest. There was a special place in her heart for my parents. In addition to all that, she was the kind of person who always had a smile and a funny story to share. She was kind and funny and generous. The story that always comes to mind when I think about her is the time she was working at the bank and someone dropped something from the top floor. It was a very large parcel and the workers on the ground floor didn’t have the tools to open it. She jumped down to see what was in it. A very large package of champagne appeared and she began gulping it down. Tante Leni and my mother in a portrait they took when my mom was 20. Tante Leni and my parents in a family portrait she took for my mom at 22. Tante Leni and my dad at home when he was working as a dance instructor. When my mom died, she had all the people who had known her since she was a child living in the house. Tante Leni was the oldest, but she was also the best at cleaning, cooking and taking care of the house. When my mom passed away, she went to a rehab center and Tante Leni stayed in the house. of rigorous fact-checking, particularly for historical events (e.g., sports) and certification standards (e.g., eco-labels). Factuality detection instructions: You are a helpful AI assistant. Your task is to evaluate whether the given document has well-defined factuality. Definitions: Not well-defined factuality: The document is primarily subjective or opinion-based (e.g., express disapproval of a politician in social media). The document discusses personal, unknown, or unverifiable entities (e.g., a private diary). Well-defined factuality: The document refers to well-known, identifiable entities (e.g., famous people, historical events, popular movies). Its factual claims can be checked or verified. Output format: If the document’s factuality is not well-defined, output: <not well defined></not well defined> APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3109 If the document’s factuality is well-defined and factual, output: <well defined>True</well defined> If the document’s factuality is well-defined but non-factual, output: <well defined>False</well defined> Now, analyze the following document and provide your answer: document B.2.3 Mideval prompts Before each large-scale synthesis run—on the order of billions of tokens—we first synthesize a small subset of data to evaluate its quality. We call this step “mideval.” We use LM-as-a-judge to evaluate three quality metrics: Pair-relevance (whether the seed and synthesized documents share the same topic, entity, or event), Pair-novelty (whether the synthesized content differs substantively from near-duplicates), and Non-repetition (absence of repeated sentence patterns in the output). Full evaluation prompts are available in the code repository. APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3110 Table B.7: Factuality errors detected in synthetic text. Synthetic Text So just how much has Chelsea been prepared to pay for the 34-year-old midfielder? Realistically, the clubs involved should be keeping in the region of £25 million ($38.8 million) and around £30 million ($45.5 million) being bandied about for the player in Italy. With the Blues expected to complete the sale of Cesc Fabregas to Arsenal this week, Lampard appears the logical replacement in midfield, but his bid to extend his contract has hit a roadblock with Chelsea’s owners Roman Abramovich and the club being unable to agree to an increase in salary. Lampard, who played in the Champions League final in Lisbon for Chelsea in 2007, has been linked with a move away from Stamford Bridge this summer, after having his contract with the club indefinitely extended in 2010. There were rumours that he was to leave on a free transfer in the summer of 2011, but it appears that those are again being placed to the side. Miguel Delaney, from Sky Sports, believes that Lampard could be the highest paid player in the Premier League, and puts the bid that Chelsea are offering him in the region of £6 million ($8.9 million) per year. If Lampard is to join a team in Italy, it seems that this sum would be realistic, but the Premier League is a different beast, and it would be unrealistic for Chelsea to splash out that amount. As the speculation around his future continues, you can watch the pair discussing the topic in the video below. Factual Errors Cesc Fabregas to Arsenal – Fabregas moved from Arsenal to Barcelona in 2011, then to Chelsea in 2014, not back to Arsenal. 2007 – The 2014 Champions League final in Lisbon did not involve Chelsea or Lampard; Chelsea won in 2012. indefinitely extended in 2010 – Lampard did not get an indefinite contract extension in 2010 with Chelsea. Synthetic Text Swanee Glue Brand: First Glue to be Awarded the Swan Ecolabel Published:27 July 2022 The global glue stick market is expected to reach USD 3.45 billion by 2028. Adhesives are the first choice of manufacturers in all industries such as food, pharmaceuticals, automotive, aerospace, construction, and packaging. As consumers are increasingly conscious of their carbon footprint and environmental issues, glue manufacturers are aiming to produce products that comply with environmental standards and are effective and cost-effective in their applications. This is why the Swan Ecolabel was established by the Swedish Environment Agency as a certification for sustainable adhesive products. Swanee Glue is one of the world’s leading glue brands in glue sticks, and this year its brand received the Swan Ecolabel. UHU is an adhesive brand owned by Bolton Adhesives in the Netherlands, and part of the Italian Bolton Group with a strong agenda for sustainability. Glue sticks, specifically glue sticks with a wider applicator and swan neck applicators, have the most impact on the environment because they are a consumable item and their impact is greatest when thrown away. Therefore, the Swanee Swan Ecolabel ensures that UHU is part of the solution to the growing demand for sustainable adhesive products. In order to obtain the Swan Ecolabel, the adhesive must have at least 50% renewable content. Besides this, the glue stick should also contain a higher percentage of recyclable content. UHU meets all these criteria and has a permanent and multi-use applicator. For further information, you can contact UHU receives the Swan Ecolabel Factual Errors Swan Ecolabel was established by the Swedish Environment Agency – The Nordic Swan Ecolabel was established by the Nordic Council of Ministers, not only Sweden. Netherlands – UHU is based in Germany, not the Netherlands. 50% renewable content – The Swan Ecolabel requires at least 20% renewable content in adhesives, not 50%. APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3111 B.2.4 Synthesized documents from the 1T-scale experiment We present additional examples of synthesized documents at the 1T-scale, complementing the 200B- scale example in §3.5.2. B.3 Additional pretraining results B.3.1 Two epochs validation For the 1T-scale oracle experiment, we use 482B tokens repeated twice as a proxy for training on 1T unique tokens. This design choice stems from the DCLM-baseline [Li et al., 2024b] dataset containing 80% duplicates, which hinders our evaluation. We validate this choice by scaling down to 400B, where we have sufficiently many unique tokens. As shown in Table B.8, 200B tokens repeated twice yield nearly identical performance to 400B unique tokens. This aligns with the observation from Muennighoff et al. [2023] that repetition up to 4 times yields nearly no performance degradation. Table B.8: Performance comparsion with 200B tokens repeated twice vs. 400B unique tokens for the 3B model. We can see that the two models yield similar performance. Benchmark2x200B 1x400B Perplexity on held-out data ↓ OpenWebText24.554.54 LAMBADA4.494.46 Five-shot MMLU3.193.17 QA accuracy ↑ ARC-Challenge (0-shot)38.3141.47 ARC-Easy (0-shot)73.1175.29 SciQ (0-shot)93.8093.30 Winogrande (0-shot)64.9663.93 TriviaQA (1-shot)32.5134.35 WebQS (1-shot)18.7513.58 Average QA accuracy53.5753.65 B.3.2 Model scaling An alternative approach to using additional compute is to scale the model. Here we examine the benefits of fixing a training token budget but using a 6B-parameter model (Table B.9). We conduct a pretraining experiment at the 200B-scale, replacing the 3B-parameter model with a 6B-parameter model. Table B.10 shows that the 6B-parameter model consistently outperforms APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3112 Real document The Karate Kid From Wikipedia, the free ency- clopedia (Redirected from Karate Kid) Jump to: navigation, search For other uses, see Karate Kid (disambiguation). The Karate Kid Karate kid.jpg Directed by John G. Avildsen Produced by Jerry Weintraub Written by Robert Mark Kamen Music by Bill Conti Cinematography James Crabe Distributed by Columbia Pic- tures Release dates• June 22, 1984 (1984-06-22) Running time 127 minutes[1] Country United States Language English Budget $8 million[2] Box office $90.8 million[3] The Karate Kid is a 1984 Amer- ican martial arts drama film produced by Jerry Weintraub, directed by John G. Avildsen, written by Robert Mark Ka- men, that stars Ralph Macchio, Noriyuki "Pat" Morita and Elis- abeth Shue.[4][5] It is an under- dog story in the mold of a previ- ous success, Avildsen’s 1976 film Rocky ... Synthesis I The Karate Kid (2010) The Karate Kid (2010) movie poster • 115 minutes• Rated PG-13 • Directed by Harald Zwart• Starring: Will Smith, Jackie Chan, RZA, Jaden Smith• More info on IMDB It’s ironic that Daniel LaRusso (Ralph Macchio) and Johnny Lawrence (Jaden Smith) appear in The Karate Kid in their mid- dle ages, because what’s most interesting about this remake is the way it updates an old, famil- iar storyline for a new generation of moviegoers. Perhaps I shouldn’t say "up- dates" – because that implies that the new version is a sig- nificant departure from the 1984 classic. If you’re one of those folks who thinks the 1984 ver- sion of The Karate Kid is sacro- sanct, don’t expect anything dif- ferent here. Just watch both movies and compare the stories and characters. The story is basically the same as before – young Daniel LaRusso moves to Los Angeles, and soon finds himself beset by Johnny Lawrence and his gang of students from the Cobra Kai karate school, which bullies Daniel whenever it gets the chance ... Synthesis I The Karate Kid Trivia: When Daniel and his mom were drinking ice cream and talking about him not talk- ing to them, Daniel turns to his mom and says "We’re not just a family. We’re a family of good people." After Daniel was sent away, on the front of his under- wear (that was on fire), it said "Daniel don’t quit on family." The script for this movie origi- nally said that Daniel’s mom was in the film but she was cut be- cause the film makers thought she would not look like Elisabeth Shue and they thought the char- acters could have a less moth- erly character. Elisabeth Shue actually called the director John Avildsen, and insisted on be- ing in the film. John Avild- sen replied that he didn’t know whether or not he wanted to add a less motherly character to the film. Shue replied "If you don’t, I’m not playing in the movie." The director of the film was also impressed that she did this and he was going to hire her in any case. Revealing mistake:When Daniel is throwing the bucket at Johnny’s head while on the bus, he is supposed to do it several times, yet when he hits Johnny the first time, you can see Johnny fall on the ground ... Real document Skip to main content Are you craving some beach time? If you’re escaping down south for spring break, we have some tunes for you. But even if you aren’t lucky enough to sprawl in the sand or bask in golden sunshine right now, we’l bring the beach to you! Un- til your next break, enjoy listen- ing to The Beach Boys and Jack Johnson. If you’re stuck in the cold—or somewhere where the sun is in- decisive (I’m looking at you, Midwest)—use this playlist to take you back to sunnier days and remind you that the winter is almost over; soon the sun will last! Synthesis I Beach Vibe Any traveller will tell you that if you get on a plane and actu- ally arrive somewhere where the weather is the same for weeks or months, you know you’ve made a mistake. I’ve experienced this time and again - it’s one of the many things that you just can’t fully appreciate or appreciate to the fullest until you feel it in your bones. My latest trip was to Vietnam last week. I had been plan- ning this trip for a little over a year now and I really hoped it would be everything I had imag- ined. It was exactly what I’d wanted. The people, the culture, the food, the scenery - it was all perfectly delved into. I’d never been to an Asian country before and so the whole trip was just like discovering a new continent. It was one of the best places I’ve ever been ... Synthesis I Here’s a playlist to warm your heart and get you ready to leave for the beach in a few weeks. So many classics from your youth, this playlist packs an emotional punch. It’s kind of a lame playlist. I’m sure a better one exists some- where on the internet. But on this most northern of days, I like to get a warm chill going and use it as a positive recharge before going outside. This warm chill is a hit with my dogs and they are the best cuddlers in the world, so that makes it perfect for me. This playlist will make your win- ter that much more bearable and then hopefully you can head to the beach! You know you want it! Here’s the playlist: [credit provider="YouTube" url=”] Get our free mobile app Figure B.6: Comparison of original text with synthesized text variations. On the first row, the real document provides factual information about the 1984 film’s production and release. In contrast, the synthesized documents offer subjective commentary, opinions, and behind-the-scenes anecdotes about both the 1984 film and its 2010 remake. On the second row, the synthesized documents are continuations of the real document. APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3113 Table B.9: 6B-parameter model setup. Total Params. 3B6B ℓ context 4096 4096 n vocab 49152 49152 n layers 2632 d model 3072 4096 d ffn 8064 13056 n heads 2432 n kv_heads 88 the baseline, indicating that it effectively uses the additional computational resources. Comparing SBP with the 6B-parameter model, we find that each performs better on different benchmarks. This suggests that the benefits of SBP are orthogonal to those of a larger model, offering the potential to combine both approaches for even better performance. Table B.10: 200B-scale experiments with model scaling. The first three columns are identical to Table3.2. The last column shows the performance of training a 6B model under a 200B training token budget with 10B unique tokens. BenchmarkBaseline SBP Oracle 6B-model Perplexity on held-out data ↓ OpenWebText25.74-0.53-1.02-0.36 LAMBADA6.87-0.85-1.86-1.10 Five-shot MMLU3.83-0.36-0.51-0.13 QA accuracy ↑ ARC-Challenge (0-shot)35.32 +1.28 +2.82+3.42 ARC-Easy (0-shot)68.94 +2.65 +4.29+0.67 SciQ (0-shot)90.50 +1.00 +2.40+0.80 Winogrande (0-shot)60.14 +1.90 +5.53+2.92 TriviaQA (1-shot)22.51 +3.36 +7.37+3.11 WebQS (1-shot)8.56 +3.74 +10.83+5.22 Average QA accuracy47.66 +2.32 +5.54+2.69 APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3114 B.4 Supplementary materials for sample-efficient reasoning B.4.1 Initial collection of 59K samples We collect an initial 59,029 questions from 16 sources following three guiding principles. Quality: Datasets should be high-quality; we always inspect samples and ignore datasets with, e.g., poor formatting. Difficulty: Datasets should be challenging and require significant reasoning effort. Diversity: Datasets should stem from various fields to cover different reasoning tasks. We collect datasets of two categories: Curation of existing datasets Our largest source is NuminaMATH [LI et al., 2024] with 30,660 mathematical problems from online websites. We also include historical AIME problems (1983-2021). To enhance diversity, we add OlympicArena [Huang et al., 2024] with 4,250 questions spanning As- tronomy, Biology, Chemistry, Computer Science, Geography, Mathematics, and Physics from various Olympiads. OmniMath [Gao et al., 2024a] adds 4,238 competition-level mathematics problems. We also include 2,385 problems from AGIEval [Zhong et al., 2023a], which features questions from stan- dardized tests like SAT and LSAT, covering English, Law, and Logic. We refer to Table B.13 for our other sources. New datasets in quantitative reasoning To complement these existing datasets, we create two original datasets. s1-prob consists of 182 questions from the probability section of Stanford University’s Statistics Department’s PhD Qualifying Exams (https://statistics.stanford.edu), accompanied by handwritten solutions that cover difficult proofs. The probability qualifying exam is held yearly and requires professional-level mathematical problem-solving. s1-teasers comprises 23 challenging brain-teasers commonly used in interview questions for quantitative trading posi- tions. Each sample consists of a problem and solution taken from PuzzledQuant (https://w. puzzledquant.com/). We only take examples with the highest difficulty level (“Hard”). For each question, we generate a reasoning trace and solution using the Google Gemini Flash Thinking API [Google, 2024], extracting its reasoning trace and response. This yields 59K triplets of a question, generated reasoning trace, and generated solution. Examples from our dataset are in the appendix. We decontaminate all samples against our evaluation questions (MATH500, GPQA Diamond, AIME24) using 8-grams and deduplicate the data. B.4.2 Final selection of 1K samples We could directly train on our pool of 59K questions. However, our goal is to find the simplest approach with minimal resources. We therefore apply three stages of filtering to arrive at a minimal set of 1,000 samples, guided by our three data principles: Quality, Difficulty, and Diversity. APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3115 Quality We first remove any questions where we encountered API errors, reducing our dataset to 54,116 samples. Next, we filter out low-quality examples by checking if they contain any string patterns with formatting issues, such as ASCII art diagrams, non-existent image references, or inconsistent question numbering, reducing our dataset to 51,581 examples. From this pool, we identify 384 samples for our final 1,000 samples from datasets that we perceive as high-quality and not in need of further filtering (see below for details). Difficulty For difficulty, we use two indicators: model performance and reasoning trace length. We evaluate two models on each question: Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct [Qwen et al., 2024], with correctness assessed by Claude 3.5 Sonnet comparing each attempt against the reference solution (see the grading protocol below). We measure the token length of each reasoning trace to indicate problem difficulty using the Qwen2.5 tokenizer. This relies on the assumption that more difficult problems require more thinking tokens. Based on the grading, we remove questions that either Qwen2.5-7B-Instruct or Qwen2.5-32B-Instruct can solve correctly, as these may be too easy. Using two models reduces the likelihood of an easy sample slipping through our filtering because of a rare mistake on an easy question by one model. This brings our total samples down to 24,496, setting the stage for the next round of subsampling based on diversity. While filtering with these two models may be optimized for our setup (we also use Qwen2.5-32B-Instruct as our model to finetune), the idea of model-based filtering generalizes to other setups. Diversity To quantify diversity, we classify questions into domains using Claude 3.5 Sonnet based on the Mathematics Subject Classification (MSC) system (e.g., geometry, combinatorics, etc.) from the American Mathematical Society. 1 The taxonomy focuses on topics in mathematics but also includes other sciences such as biology, physics, and economics. To select our final examples from the pool of 24,496 questions, we first choose one domain uniformly at random. Then, we sample one problem from this domain according to a distribution that favors longer reasoning traces (see below for details) as motivated in Difficulty. We repeat this process until we have 1,000 total samples spanning 50 domains. In §B.4.3, we show that using our three criteria in combination is important, as relying on quality, diversity, or difficulty in isolation leads to worse datasets. Some distilled generations are incorrect, which we allow in our data because we focus on capturing the reasoning process rather than entirely correct solutions. Our grader deems 53.6% correct in s1K and 63.0% in our follow-up s1K-1.1. B.4.3 Data ablations In §B.4.2 we outlined our three guiding principles in curating s1K: Quality, Difficulty, and Diversity. We next test the importance of combining them and the overall effectiveness of our selection. Only 1 https://mathscinet.ams.org/mathscinet/msc/msc2020.html APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3116 Table B.11: s1K data ablations. We report 95% paired bootstrap confidence intervals for dif- ferences relative to the s1K model using 10,000 bootstrap samples. E.g., the interval [-13%, 20%] means that, with 95% confidence, the true difference between 59K-full and s1K is between -13% and +20%. If the entire interval is negative, e.g. [-27%, -3%], we can confidently say that the performance is worse than s1K. Model AIME 2024 MATH 500 GPQA Diamond 1K-random 36.790.652.0 [-26.7%, -3.3%] [-4.8%, 0.0%] [-12.6%, 2.5%] 1K-diverse 26.791.254.6 [-40.0%, -10.0%] [-4.0%, 0.2%] [-10.1%, 5.1%] 1K-longest 33.390.459.6 [-36.7%, 0.0%] [-5.0%, -0.2%] [-5.1%, 10.1%] 59K-full 53.392.858.1 [-13.3%, 20.0%] [-2.6%, 2.2%] [-6.6%, 8.6%] s1K50.093.057.6 Quality (1K-random): After obtaining our high-quality reasoning chains from Gemini, we select 1,000 samples at random, not relying on our difficulty and diversity filtering at all. Table B.11 shows this approach performs much worse than s1K across all benchmarks. Only Diversity (1K- diverse): For this dataset, we sample uniformly across domains to maximize diversity, disregarding any notion of difficulty. This approach also leads to poor performance similar to 1K-random. Only Difficulty (1K-longest): Here we rely on one of our difficulty indicators introduced in §B.4.2 by selecting the 1,000 samples with the longest reasoning traces. This approach significantly boosts GPQA performance but overall still falls short of using s1K. Maximize Quantity: Finally, we compare with training on all of our 59K samples, a superset of all the 1K-sample versions. This leads to a strong model but uses much more resources. To finetune on 59K samples, we use 394 H100 GPU hours while s1-32B only required 7 H100 GPU hours. Moreover, relying only on s1K is extremely competitive as shown in §B.4.2. Altogether, combining all three criteria—Quality, Difficulty, Diversity—via our methodology in §B.4.2 is key for sample-efficient reasoning training. B.4.4 Dataset composition Dataset composition for full 59K questions APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3117 Table B.12: Summary of our dataset s1K. Token count measured by the Qwen-2.5 tokenizer. We prompt Claude to produce keywords given several questions from the domain. Domain#questions Total token count Keywords Geometry109560.2KArea, Triangle, Distance Number theory98522.5KSequences, Divisibility Combinatorics75384.7KPermutations, Counting Real functions43234.8KTrigonometry, Calculus Biology41120.9KOrganic reactions Complex functions32170.2KComplex roots Quantum theory32127.9KParticles, Wave functions Field theory28150.1KPolynomials, Roots Calculus of variations 28155.5KOptimization, Control Difference equations24132.5KRecurrence, Recursion Electromagnetic theory 2395.8KOptics, Waves, Diffraction Group theory22100.0KGroups, Automorphisms Linear algebra22128.3KMatrices, Determinants Probability theory20114.6KRandom walk, Expectation Algebraic systems19109.9KFunctional equations Mechanics19103.6KForces, Motion, Energy Thermodynamics1974.2KHeat engines, Entropy Differential equations 1889.6KSubstitution, Existence Computer science1834.2KComplexity theory, Algorithms Numerical analysis1876.5KError analysis, Stability Calculus1796.3KConvergence, Summation Algebraic structures1790.4KInequalities, Sets Astronomy1637.7KStellar populations, Orbits Remaining 27 domains 242982.2KDomains with ≤ 16 questions All domains (51)10004.7Ms1K APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3118 Table B.13: Composition of full 59K questions. Thinking and response lengths are measured in tokens using the Qwen2.5-32B-Instruct tokenizer [Qwen et al., 2024]. In addition to excluding our evaluation benchmark, AIME24, we also exclude AIME questions from 2022–2023 because we use these 90 questions during our development stage of s1-32B. SourceDescription#Samples Avg. thinking length NuminaMATH [LI et al., 2024]Math problems from online websites306604.1K MATH [Hendrycks et al., 2021b] Math problems from competitions119992.9K OlympicArena [Huang et al., 2024] Astronomy, Biology, Chemistry, Com- puter Science, Geography, Math, and Physics olympiad questions 42503.2K OmniMath [Gao et al., 2024a]Math problems from competitions42384.4K AGIEval [Zhong et al., 2023a, Ling et al., 2017, Hendrycks et al., 2021b, Liu et al., 2020a, Zhong et al., 2019, Wang et al., 2021] English, Law, Logic and Math prob- lems from the SAT, LSAT and other exams 23851.2K xwordCrossword puzzles9990.7K OlympiadBench [He et al., 2024] Math and Physics olympiad questions 8963.9K AIME (1983-2021)American Invitational Mathematics Examination 8904.7K TheoremQA [Chen et al., 2023a] Computer Science, Finance, Math, and Physics university-level questions relating to theorems 7472.1K USACO [Shi et al., 2024a]Code problems from the USA Com- puting Olympiad 5193.6K JEEBench [Arora et al., 2023]Chemistry, Math, and Physics prob- lems used in the university entrance examination of the Indian Institute of Technology 5152.9K GPQA [Rein et al., 2023]PhD-Level Science Questions3482.9K SciEval [Sun et al., 2024a]Biology, Chemistry, and Physics prob- lems from various sources 2270.7K s1-probStanford statistics qualifying exams1824.0K LiveCodeBench [Jain et al., 2024] Code problems from coding websites (LeetCode, AtCoder, and CodeForces) 1513.5K s1-teasersMath brain-teasers crawled from the Internet 234.1K All 59K questionsComposite of the above datasets with reasoning traces and solutions 590293.6K APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3119 s1K grading prompt To grade whether an example is correct for our dataset selection in §B.4.2, we use the following prompt. We use Claude 3.5 for grading, except for the final 1,000 samples, which we grade with Claude 3.7. You are an AI assistant for grading a science problem. The user will provide you with the question itself, an attempt made by a student and the correct answer to the problem. Your job is to judge whether the attempt is correct by comparing it with the correct answer. If the expected solution concludes with a number or choice, there should be no ambiguity. If the expected solution involves going through the entire reasoning process, you should judge the attempt based on whether the reasoning process is correct with correct answer if helpful. The user will provide the attempt and the correct answer in the following format: # Problem problem ## Attempt attempt ## Correct answer solution Explain your reasoning, and end your response on a new line with only "Yes" or "No" (without quotes). s1K diversity selection Algorithm 2 details our diversity selection procedure. We also include samples from specific benchmarks we consider high-quality (§B.4.2). None of the selected samples overlap with our final evaluation. Decontamination We filter samples by checking for 8-gram overlap between selected examples and our evaluation benchmarks (MATH500, GPQA Diamond, and AIME24). We exclude any question with more than 8-gram overlap. B.4.5 Training details We finetune Qwen2.5-32B-Instruct [Qwen et al., 2024] for reasoning. On math tasks, this model generally matches or outperforms the larger Qwen2.5-72B-Instruct [Qwen et al., 2024] and other open models [Dubey et al., 2024b, Groeneveld et al., 2024, Muennighoff et al., 2024]. We use token APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3120 Algorithm 2 Two-stage sampling for s1K 1: Input: Q := Set of 24,496 questions with features 2: Output: S := Set of 1,000 selected questions 3: S ←∅Initialize the output set (only tracks unique elements) 4: for q ∈Q do 5: if IsGeminiCorrect(q) and (IsAIME(q) or IsGPQA(q)) then 6: S ←S∪q 7:Select all correct AIME/GPQA solutions 8: else if IsGeminiCorrect(q) and IsMATH(q) and ThinkingLength(q) > 5600 then 9: S ←S∪q 10:Select correct MATH500 solutions with long chains 11: end if 12: end for 13: D ← All available domains 14:Initialize domain pool 15: while |S| < 1000 do 16: d← RandomChoice(D) 17:Randomly select a domain 18: Q d ← Questions in domain d 19:Get questions from this domain 20:ranks ← RankByThinkingLength(Q d ) 21:Rank by thinking length 22:weights ← 2 −ranks 23:Apply power-law weighting 24: q ← WeightedSample(Q d , weights) 25:Sample favoring longer chains 26: S ←S∪q 27:Add selected question 28: Q d ← Q d \q 29: if Q d =∅ then 30: D ←D\d 31:Remove exhausted domains 32: end if 33: end while APPENDIX B. SUPPLEMENTARY MATERIALS FOR CHAPTER 3121 delimiters to separate the thinking stage from the answering stage, enclosing the thinking stage with <|im_start|>think and <|im_start|>answer—both preceded and followed by a newline. We train for 5 epochs with a batch size of 16 (315 gradient steps total) in bfloat16 precision. We use a learning rate of 1e− 5 with linear warmup for 5% of training (16 steps), then cosine decay to 0 over the remaining 299 steps. We use AdamW [Loshchilov and Hutter, 2019] with β 1 = 0.9, β 2 = 0.95, and weight decay of 1e− 4. We compute loss only on reasoning traces and solutions, not on questions. We set the sequence length large enough to avoid truncating any samples. Training takes 26 minutes on 16 NVIDIA H100 GPUs. For ablations, we use identical hyperparameters except for the 59K model (§B.4.3), where we use a batch size of 120 to process more data. Evaluation We select three representative reasoning benchmarks widely used in the field. AIME24 has 30 problems from the 2024 American Invitational Mathematics Examination (AIME) held from January 31 – February 1, 2024. AIME tests mathematical problem-solving with arithmetic, algebra, counting, geometry, number theory, probability, and other secondary school math topics. All AIME answers are integers ranging from 000 to 999, inclusive. MATH500 [Hendrycks et al., 2021b] is a benchmark of competition math problems of varying difficulty. We evaluate on the same 500 samples selected by OpenAI in prior work [Lightman et al., 2023]. GPQA Diamond [Rein et al., 2023] consists of 198 PhD-level science questions from Biology, Chemistry, and Physics. Experts with PhDs in the corresponding domains only achieved 69.7% on GPQA Diamond [OpenAI, 2024]. We build on the “lm-evaluation-harness” framework [Gao et al., 2021, Biderman et al., 2024]. Appendix C Supplementary materials for Chapter 4 C.1 Appendix C.1.1 Additional idea examples We provide additional example ideas generated by Claude-4.5-Opus (Table C.1) and Claude-4.5- Sonnet (Table C.2) on the GRPO environment, including ideas with failed code execution. Code execution errors tend to arise when an idea involves complex changes or requires external packages not supported in our execution environment. Improving the execution agent to correctly implement more complex ideas (e.g., training auxiliary models or system-level optimizations) is an important direction for future work. 122 APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4123 Successful executionFailed execution [Experiment] Sequence Position Weighted Trust Region: Apply tighter sigmoid bounds to earlier tokens in the sequence (where errors compound) and looser bounds to later tokens. Weight: position_weight = 1 - 0.3 * (position / seq_len), effective_deviation = 0.25 + 0.2 * position_weight. This accounts for the sequential nature of autoregressive generation. [Code Changes] Modify grpo.py: Initialize current_cliprange = 0.2, ema_clip_fraction = 0.15. Standard momentum clip updates. Modify compute_grpo_clip_loss in grpo_utils.py: After computing ratio on line 91 (shape: batch_size x seq_len): batch_size, seq_len = ratio.shape, positions = torch.arange(seq_len, device=ratio.device).float() .unsqueeze(0).expand(batch_size, -1), position_weight = 1.0 - 0.3 * (positions / (seq_len - 1 + 1e-6)), effective_deviation = 0.25 + 0.2 * position_weight. Apply position-aware sigmoid: centered_ratio = ratio - 1.0, bounded_ratio = 1.0 + (2.0 * torch.sigmoid(centered_ratio) - 1.0) * effective_deviation. Use: surr1 = bounded_ratio * advantages, surr2 = torch.clamp(bounded_ratio, 1 - cliprange, 1 + cliprange) * advantages, loss = -torch.min(surr1, surr2). Add metadata: metadata["mean_effective_deviation"] = effective_deviation.mean().item(), metadata["early_deviation"] = effective_deviation[:, :seq_len//4].mean().item(), metadata["late_deviation"] = effective_deviation[:, -seq_len//4:].mean().item(). Validation accuracy: 59.8 [Experiment] Hierarchical Position-Group Trust Region: Apply trust region at two hierarchical levels – group level (shared within each response group) and position level (varies along sequence). Groups with high internal reward variance get tighter group-level bounds. Within groups, positions follow the proven decay pattern. This captures both cross-sample and within-sample structure. Formula: group_dev = 0.4 - 0.15 * tanh(group_reward_var / 0.3), position_factor = 1 - 0.2 * rel_pos, effective_dev = group_dev * position_factor. [Code Changes] Modify grpo.py: Initialize current_cliprange = 0.2, ema_clip_fraction = 0.15. Standard momentum clip updates. Pass group_size to function. Modify compute_grpo_clip_loss in grpo_utils.py: Add parameter group_size=8. After computing ratio: batch_size, seq_len = ratio.shape, n_groups = batch_size // group_size. Compute group reward variance from advantages as proxy: adv_grouped = advantages.view(n_groups, group_size, -1), group_adv_var = adv_grouped.var(dim=1, keepdim=True), group_adv_var_expanded = group_adv_var.expand(-1, group_size, -1).reshape(advantages.shape). Group-level deviation: group_deviation = 0.4 - 0.15 * torch.tanh(group_adv_var_expanded / 0.3). Position factor: positions = torch.arange(seq_len, device=ratio.device).float().unsqueeze(0) .expand(batch_size, -1), rel_pos = positions / (seq_len - 1 + 1e-6), position_factor = 1.0 - 0.2 * rel_pos. Hierarchical deviation: effective_deviation = group_deviation * position_factor, effective_deviation = torch.clamp(effective_deviation, 0.15, 0.45). Apply: centered_ratio = ratio - 1.0, bounded_ratio = 1.0 + (2.0 * torch.sigmoid(centered_ratio) - 1.0) * effective_deviation. Use: surr1 = bounded_ratio * advantages, surr2 = torch.clamp(bounded_ratio, 1 - cliprange, 1 + cliprange) * advantages, loss = -torch.min(surr1, surr2). Add metadata["mean_group_var"] = group_adv_var.mean().item(), metadata["mean_effective_deviation"] = effective_deviation.mean().item(). Log to wandb. Table C.1: Additional examples on the GRPO environment. Ideas are generated by Claude-4.5-Opus during evolutionary search. APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4124 Successful executionFailed execution [Experiment] Create a mathematical step-complexity aware reward shaping where responses with more mathematical reasoning steps receive slightly higher base rewards (1.05x for 3+ steps, 1.1x for 5+ steps) when correct, encouraging thorough mathematical exposition without changing the core binary reward structure. [Code Changes] Modify r1_zero_reward_fn_train in drgrpo_grader.py to count reasoning steps by detecting mathematical transitions ("therefore", "thus", "so", "=", "=>"). When answer is correct, apply step-based multiplier: step_multiplier = 1.0 + 0.05 * min(2, max(0, num_steps - 2)) to get multipliers [1.0, 1.05, 1.1]. Return "format_reward": 1.0, "answer_reward": answer_reward, "reward": base_reward * step_multiplier. Set –learning_rate 3e-5 and –loss_type reinforce_with_baseline. Validation accuracy: 65.6 [Experiment] Add experience replay by maintaining a buffer of the top 20% highest-reward rollouts from previous epochs and mixing them (25% replay, 75% new) with current rollouts during training, combined with 3e-5 learning rate and reinforce_with_baseline for improved sample efficiency. [Code Changes] Modify train_loop in grpo.py to maintain replay_buffer storing high-reward (>0.8) rollouts from previous epochs. Each epoch, sample 25% of training data from replay buffer and 75% from new rollouts. Update buffer by adding top 20% of current epoch’s rollouts and removing oldest entries when buffer exceeds 1000 samples. Set –learning_rate 3e-5 and –loss_type reinforce_with_baseline. Validation accuracy: 39.4 [Experiment] Implement response diversity rewards within groups where responses to the same prompt receive bonus rewards (0.05–0.15) for being dissimilar to other responses in their group, encouraging exploration of different solution paths while maintaining the proven group_size=8 and 3e-5 learning rate combination. [Code Changes] Modify compute_group_normalized_rewards in grpo_utils.py to compute pairwise similarity between responses in each group using token-level Jaccard similarity. Add diversity bonus: diversity_reward = 0.15 * (1 - max_similarity_in_group) to each response’s reward before advantage computation. Reshape responses into groups, compute similarities, and add bonuses before advantage normalization. Set –learning_rate 3e-5, –loss_type reinforce_with_baseline, –group_size 8. Validation accuracy: 19.2 [Experiment] Implement temporal difference advantage estimation where advantages incorporate not just current rewards but also predicted future rewards using a learned value function, combined with the proven 3e-5 learning rate and reinforce_with_baseline for more accurate credit assignment. [Code Changes] Add a value head to the policy model in grpo.py that predicts expected future rewards. Compute TD advantages as advantages = rewards + gamma * next_values - current_values with gamma=0.99. Train the value function with MSE loss on observed returns. Modify compute_group_normalized_rewards to use TD advantages instead of basic reward differences. Set –learning_rate 3e-5 and –loss_type reinforce_with_baseline. [Experiment] Ensemble Decision Training with Voting Consensus: Train the model using ensemble-style decision making where each rollout generates multiple candidate responses, and the final training signal is based on majority voting among responses. This encourages the model to develop more robust and consistent reasoning patterns while maintaining diversity in solution approaches. [Code Changes] Modify sample_rollout in sample.py to generate 3 responses per prompt instead of 1, using different random seeds. Implement voting consensus in r1_zero_reward_fn_train: if 2+ responses are correct, apply +0.15 consensus bonus; if responses disagree, apply -0.05 uncertainty penalty. In train_loop in grpo.py, select the highest-voted response for training while using consensus information to adjust learning rate: consensus_lr = 3e-5 * (0.9 + 0.2 * consensus_rate). Set group_size=6, –loss_type reinforce_with_baseline. [Experiment] Implement hierarchical advantage estimation where advantages are computed at both token-level and sequence-level, with token-level advantages weighted by their position importance (higher weights for mathematical expressions and final answers), combined with the successful 3e-5 learning rate and reinforce_with_baseline. [Code Changes] Modify grpo_microbatch_train_step in grpo_utils.py to create position importance weights that assign 2.0x weight to tokens containing mathematical symbols ( , +, -, *, =) and 1.5x weight to answer sections. Compute both sequence-level advantages (current) and token-level advantages, then combine as final_advantages = 0.6 * sequence_advantages + 0.4 * token_advantages. Set –learning_rate 3e-5 and –loss_type reinforce_with_baseline. Table C.2: Additional examples on the GRPO environment. Ideas are generated by Claude-4.5- Sonnet during evolutionary search. APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4125 We next present the top-performing ideas from Claude-4.5-Opus, Claude-4.5-Sonnet, and GPT-5 on the nanoGPT environment. Claude-4.5-Opus Idea on nanoGPT (Validation Loss: 3.1407) [Experiment] Wider SwiGLU (5x) with MLP Output Scaling (Init 0.97), Skip Connections Every 4 and 8 Layers with Learnable Weights (Init 0.52 and 0.31), Separate Attention/MLP Scales (Init 0.98), Higher LR (0.00168), Reduced Weight Decay (0.065), Warmup 173 iters, Lower Min LR (0.03x), Cosine Annealing, EMA, Untied Embeddings, and Beta2=0.99 Make the dual skip connection weights learnable parameters initialized at proven good values. This allows the model to adapt skip weights during training while combining with separate attention/MLP residual scales. [Code Changes] • Change warmup_iters = 256 to warmup_iters = 173 in Hyperparameters class • Change weight_decay = 0.1 to weight_decay = 0.065 in Hyperparameters class • Change learning_rate = 0.0015 to learning_rate = 0.00168 in Hyperparameters class • In GPT.__init__, add after transformer dict: self.skip_weight_4 = n.Parameter(torch.tensor(0.52)) self.skip_weight_8 = n.Parameter(torch.tensor(0.31)) • In Block.__init__, add: self.attn_scale = n.Parameter(torch.tensor(0.98)) and self.mlp_scale = n.Parameter(torch.tensor(0.98)) • In Block.forward, change to: def forward(self, x): x = x + self.attn_scale * self.attn(rmsnorm(x)) x = x + self.mlp_scale * self.mlp(rmsnorm(x)) return x • In Block.forward_with_cache, change to: def forward_with_cache(self, x, cache): attn_out, new_cache = self.attn.forward_with_cache(rmsnorm(x), cache=cache) x = x + self.attn_scale * attn_out APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4126 x = x + self.mlp_scale * self.mlp(rmsnorm(x)) return x, new_cache • In MLP.__init__, replace lines 81–82 with: self.c_fc = n.Linear(config.n_embd, 5 * config.n_embd, bias=False) self.c_gate = n.Linear(config.n_embd, 5 * config.n_embd, bias=False) self.c_proj = n.Linear(5 * config.n_embd, config.n_embd, bias=False) self.output_scale = n.Parameter(torch.tensor(0.97)) • In MLP.forward, replace with: def forward(self, x): gate = F.silu(self.c_gate(x)) x = self.c_fc(x) * gate x = self.c_proj(x) * self.output_scale return x • In GPT.__init__, remove line 132: self.transformer.wte.weight = self.lm_head.weight • Remove line 131: self.lm_head.LLMC_SKIP_INIT = 1 • Modify _init_weights to add: if isinstance(module, n.Linear): torch.n.init.normal_(module.weight, mean=0.0, std=0.02) • Change optimizer betas on line 402 to betas=(0.9, 0.99) • Modify get_lr function: def get_lr(it): assert it <= args.num_iterations if it < args.warmup_iters: return args.learning_rate * (it+1) / args.warmup_iters min_lr = 0.03 * args.learning_rate decay_ratio = (it - args.warmup_iters) / (args.num_iterations - args.warmup_iters) return min_lr + 0.5 * (args.learning_rate - min_lr) * (1.0 + math.cos(math.pi * decay_ratio)) • In GPT.forward, replace the block loop with: APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4127 layer_outputs = [] for i, block in enumerate(self.transformer.h): if i >= 4 and i % x = x + self.skip_weight_4 * layer_outputs[i-4] if i >= 8 and i % x = x + self.skip_weight_8 * layer_outputs[i-8] x = block(x) layer_outputs.append(x) • In GPT.forward_with_cache, replace the block loop with: layer_outputs = [] for i, block in enumerate(self.transformer.h): if i >= 4 and i % x = x + self.skip_weight_4 * layer_outputs[i-4] if i >= 8 and i % x = x + self.skip_weight_8 * layer_outputs[i-8] x, new_cache = block.forward_with_cache(x, cache=caches[i]) new_caches.append(new_cache) layer_outputs.append(x) • After model initialization, add: ema_model = k: v.clone() for k, v in raw_model.state_dict().items() and ema_decay = 0.999 • After optimizer.step(), add: for k, v in raw_model.state_dict().items(): ema_model[k].mul_(ema_decay).add_(v, alpha=1-ema_decay) • Before validation,add: orig_state = k: v.clone() for k, v in raw_model.state_dict().items(); raw_model.load_state_dict(ema_model) • After validation, add: raw_model.load_state_dict(orig_state) APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4128 Claude-4.5-Sonnet Idea on nanoGPT (Validation Loss: 3.2081) [Experiment] Two-phase weight decay (0.1170→0.0210 at 59.65%) + 30.45% plateau + LR 0.001550 + warmup 197 + two-phase grad clip (1.054→0.916 at 59.65%) + quadratic min_lr interpolation (0.0113x at 59.65%, 0.0075x at end via quadratic) + progressive EMA (0.999→0.9992 linear over training) + exponential warmup + cosine LR + beta2=0.99 Use smooth quadratic interpolation for min_lr during low-WD phase AND progressive EMA that gradually increases from 0.999 to 0.9992 linearly throughout training. Early training benefits from faster EMA tracking, while later training gets heavier smoothing. Use con- servative settings: WD 0.1170/0.0210, extended plateau 30.45%, moderate LR 0.001550, longest warmup 197, tight grad clip 1.054→0.916. [Code Changes] Modify line 326 to change warmup_iters = 256 to warmup_iters = 197. Modify line 325 to change learning_rate = 0.0015 to learning_rate = 0.001550. Mod- ify line 402 to change betas=(0.9, 0.95) to betas=(0.9, 0.99). Modify the get_lr func- tion: replace lines 408–414 with: if it < args.warmup_iters: progress = (it + 1) / args.warmup_iters return args.learning_rate * (1.0 - math.exp(-5.0 * progress)) plateau_end = int(0.3045 * args.num_iterations) if it < plateau_end: return args.learning_rate overall_progress = it / args.num_iterations decay_ratio = (it - plateau_end) / (args.num_iterations - plateau_end) coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) if overall_progress <= 0.5965: min_lr_factor = 0.0113 else: phase2_progress = (overall_progress - 0.5965) / (1.0 - 0.5965) min_lr_factor = 0.0113 - (0.0113 - 0.0075) * (phase2_progress ** 2) min_lr = min_lr_factor * args.learning_rate return min_lr + coeff * (args.learning_rate - min_lr) Modify line 527 to: progress = step / args.num_iterations current_clip = 0.916 if progress > 0.5965 else 1.054 norm = torch.n.utils.clip_grad_norm_(model.parameters(), current_clip) APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4129 After line 529, add: progress = step / args.num_iterations current_wd = 0.0210 if progress > 0.5965 else 0.1170 for param_group in optimizer.param_groups: param_group[’weight_decay’] = current_wd After line 387, add: ema_model = name: param.clone().detach() for name, param in raw_model.named_parameters() After line 533, add: if step > 0: progress = step / args.num_iterations ema_decay = 0.999 + 0.0002 * progress for name, param in raw_model.named_parameters(): ema_model[name].mul_(ema_decay).add_(param.data, alpha=1 - ema_decay) Before line 483, add: original_params = name: param.data.clone() for name, param in raw_model.named_parameters() for name, param in raw_model.named_parameters(): param.data.copy_(ema_model[name]) After line 509, add: for name, param in raw_model.named_parameters(): param.data.copy_(original_params[name]) GPT-5 Idea on nanoGPT (Validation Loss: 3.1697) [Experiment] SwiGLU-3.5x + Residual Alphas + Min-Floor Cosine + Per-step Beta2 Linear Decay + 3-Group AdamW + Debiased EMA [Code Changes] • Hyperparameters: hidden_factor=3.5, warmup_iters=256, lr_peak_factor=1.10, min_lr_factor=0.02, beta2_start=0.99, beta2_end=0.95, APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4130 wd_decay=0.1, wd_embed=0.01, ema_decay=0.9995, ema_warmup_steps=256. • MLP: SwiGLU; Block alphas init 0.9. • Optimizer: 3-group AdamW. • LR: warmup to peak; cosine to floor as before. • Per-step beta2 update: After setting lr each step, set beta2 = beta2_start+ beta2_end−beta2_start min 1.0, it + 1 args.num_iterations ; update all param_groups betas. • EMA: maintain ema_params with debiasing at eval (divide by 1 - ema_decay**ema_step), then restore. The best-performing ideas on nanoGPT combine extensive hyperparameter tuning with archi- tecture modifications. We also highlight several “atomic” algorithmic ideas that execute successfully. Examples from Claude-4.5-Opus • Head-Wise Attention Output Scaling Add learnable per-head scaling factors to attention, allowing different heads to contribute with different magnitudes to the output. Validation loss: 3.2386 • Learned Residual Connection Weights Add learnable scalar weights for each residual connection that are initialized to 1.0, allowing the model to learn optimal residual scaling during training. Validation loss: 3.2517 • Mixture of Embeddings with Position Learn to mix token embeddings and position embeddings with a content-dependent weight, allowing the model to dynamically balance po- sitional vs semantic information per token. Validation loss: 3.2497 • Shared Input-Output Embedding with Learned Asymmetry Keep weight tying but add a small learned transformation on the output side, providing the benefits of weight tying while allowing output-specific adaptation. Validation loss: 3.2499 • Gated Final Normalization Replace the final RMSNorm before lm_head with a gated version where a learned gate controls how much normalization is applied vs passing the raw APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4131 representation. Validation loss: 3.2503 • Position-Aware MLP Gating Gate the MLP output based on position information, allowing the model to learn position-dependent processing depth. Validation loss: 3.2506 • Learned Residual Connection Weights Add learnable scalar weights for each residual connection that are initialized to 1.0, allowing the model to learn optimal residual scaling during training. Validation loss: 3.2517 • Grouped Token Embeddings Group the vocabulary into clusters and add a learned em- bedding per cluster on top of token embeddings, providing hierarchical vocabulary structure. Validation loss: 3.2521 We also present several Claude-4.5-Opus ideas on the nanoGPT environment that did not execute successfully. • Soft Layer Repetition Allow the model to softly repeat computation through layers by adding a learned gate that mixes the current layer’s input back into its output, simulating variable depth. • Causal Context Compression Before each attention layer, apply a learned compression that mixes local context (previous 2-3 tokens) into the current representation, providing implicit local context without convolutions. • Attention Head Specialization via Orthogonal Loss Add a soft penalty that encourages different attention heads to attend to different patterns by penalizing similarity between head outputs. • Skip Connections with Learned Residual Weights Combine skip connections with learned residual weights. The skip connections provide alternative gradient paths while learned weights allow adaptive scaling. • Token Difficulty-Aware Loss Weighting Weight the loss contribution of each token based on the model’s uncertainty (entropy) at that position, focusing learning on difficult tokens while not over-optimizing easy ones. APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4132 C.1.2 Code execution examples We present a full code execution example from Claude-4.5-Sonnet on the GRPO environment, show- ing the natural language idea followed by the generated code implementation. Example 1 [Experiment] Create mathematical working memory simulation by maintaining a context buffer of mathematical facts, definitions, and intermediate results during problem solving. This buffer gets updated as the model works through problems and provides additional context for subsequent math- ematical steps, simulating how humans maintain mathematical working memory during complex calculations. [Code Changes] Modify the prompt template in sample.py to include a “Mathematical Con- text” section that accumulates relevant mathematical facts during problem solving. Implement con- text buffer management in train_loop in grpo.py that maintains recent mathematical definitions, formulas, and intermediate calculations across problems in the same batch. Update the buffer with key mathematical expressions extracted from each response. Use enhanced prompts that include rele- vant context from the buffer. Set –learning_rate 2.5e-5 and –loss_type reinforce_with_baseline. Validation accuracy: 58.0 Example 1 code diff 1 --- repo_variants_full_pipeline_bsz50_fixed2_epoch7/idea_43/sample.py 2 +++ repo_variants_full_pipeline_bsz50_fixed2_epoch7/idea_43/sample.py 3 @@ -2,6 +2,7 @@ from vllm import LLM, SamplingParams 4 import json 5 from tqdm import tqdm 6 import random 7 +import re 8 9 def load_prompt_template(prompt_path = "prompts/r1_zero.prompt"): 10 with open(prompt_path, ’r’) as f: 11 @@ -20,12 +21,52 @@ def get_model_and_sampling_params(model_path = "Qwen/Qwen2.5-Math-1.5B ", token 12 13 return llm, sampling_params 14 15 -def load_dataset_with_prompt_template(prompt_template, dataset_path="/juice5b/scr5b/nlp/ aihinton/CS336_data/a5-alignment/MATH/test.jsonl"): 16 +def extract_mathematical_expressions(text): 17 + """Extract mathematical expressions, formulas, and key facts from text.""" 18 + expressions = [] 19 + APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4133 20 + # Extract LaTeX math expressions 21 + latex_patterns = [ 22 + r’\$[^$]+\$’, # Inline math 23 + r’\\[a-zA-Z]+\[^]+\’, # LaTeX commands 24 + r’\ \[^]+\\[^]+\’, # Fractions 25 + r’\ \[^]+\’, # Square roots 26 + r’[a-zA-Z] *= *[^, \.]+’, # Variable assignments 27 + ] 28 + 29 + for pattern in latex_patterns: 30 + matches = re.findall(pattern, text) 31 + expressions.extend(matches) 32 + 33 + # Extract definitions and key facts 34 + definition_patterns = [ 35 + r’[A-Z][a-z]+ is defined as [^\.]+\.’, 36 + r’Let [a-zA-Z] be [^\.]+\.’, 37 + r’Given that [^\.]+\.’, 38 + r’Since [^\.]+\.’, 39 + r’Therefore [^\.]+\.’, 40 + ] 41 + 42 + for pattern in definition_patterns: 43 + matches = re.findall(pattern, text) 44 + expressions.extend(matches) 45 + 46 + return list(set(expressions)) # Remove duplicates 47 + 48 +def create_context_prompt(base_template, context_items): 49 + """Create a prompt with mathematical context.""" 50 + if context_items: 51 + context_section = "Mathematical Context: " + " ".join(f"- item" for item in context_items[-10:]) + " " 52 + return context_section + base_template 53 + return base_template 54 + 55 +def load_dataset_with_prompt_template(prompt_template, dataset_path="/juice5b/scr5b/nlp/ aihinton/CS336_data/a5-alignment/MATH/test.jsonl", context_items=None): 56 with open(dataset_path, ’r’) as f: 57 dataset = [json.loads(line) for line in f] 58 print(f"Loaded len(dataset) examples from dataset_path") 59 60 - prompts = [prompt_template.format(question=example["problem"]) for example in dataset] 61 + enhanced_template = create_context_prompt(prompt_template, context_items) APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4134 62 + prompts = [enhanced_template.format(question=example["problem"]) for example in dataset] 63 answers = [example["answer"] for example in dataset] 64 return prompts, answers 65 66 @@ -93,4 +134,42 @@ 67 if return_rewards: 68 return batched_prompts, batched_answers, batched_responses, batched_rewards 69 return results 70 - 71 + 72 +class MathContextBuffer: 73 + """Buffer to maintain mathematical working memory across problems.""" 74 + 75 + def __init__(self, max_size=50): 76 + self.buffer = [] 77 + self.max_size = max_size 78 + 79 + def add_expressions(self, expressions): 80 + """Add new mathematical expressions to the buffer.""" 81 + for expr in expressions: 82 + if expr and len(expr.strip()) > 3: # Filter out very short expressions 83 + self.buffer.append(expr.strip()) 84 + 85 + # Keep only the most recent items 86 + if len(self.buffer) > self.max_size: 87 + self.buffer = self.buffer[-self.max_size:] 88 + 89 + def get_relevant_context(self, query_text, max_items=10): 90 + """Get the most relevant context items for a given query.""" 91 + if not self.buffer: 92 + return [] 93 + 94 + # Simple relevance scoring based on shared mathematical terms 95 + scored_items = [] 96 + query_lower = query_text.lower() 97 + 98 + for item in self.buffer[-20:]: # Consider recent items 99 + score = sum(1 for word in item.lower().split() if word in query_lower) 100 + if score > 0: 101 + scored_items.append((score, item)) 102 + 103 + # Return top items sorted by relevance 104 + scored_items.sort(reverse=True, key=lambda x: x[0]) 105 + return [item[1] for item in scored_items[:max_items]] APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4135 106 + 107 + def clear(self): 108 + """Clear the context buffer.""" 109 + self.buffer = [] 110 --- repo_variants_full_pipeline_bsz50_fixed2_epoch7/idea_43/grpo.py 111 +++ repo_variants_full_pipeline_bsz50_fixed2_epoch7/idea_43/grpo.py 112 @@ -1,6 +1,6 @@ 113 from transformers import AutoModelForCausalLM, AutoTokenizer 114 from utils import tokenize_prompt_and_output, get_response_log_probs 115 -from sample import load_prompt_template, load_dataset_with_prompt_template, sample_rollout 116 +from sample import load_prompt_template, load_dataset_with_prompt_template, sample_rollout , MathContextBuffer, extract_mathematical_expressions, create_context_prompt 117 from drgrpo_grader import r1_zero_reward_fn_train 118 from evaluate import r1_zero_reward_fn_eval, evaluate_vllm 119 from grpo_utils import compute_group_normalized_rewards, grpo_microbatch_train_step 120 @@ -66,6 +66,9 @@ def evaluate_model(policy_model, vllm_model, eval_prompts, eval_answers, eval_s 121 def train_loop(model, train_prompts, train_answers, learning_rate, grpo_steps, train_steps_per_rollout, output_dir, batch_size, gradient_accumulation_steps = 4, group_size = 2, rollout_subset_size = 256, device = "cuda", logging_steps = 20, saving_steps = 4000, eval_epochs = 5, eval_prompts = None, eval_answers = None, sampling_params = None, eval_vllm_model = None, cliprange = 0.2, loss_type = " reinforce_with_baseline"): 122 model.to(device) 123 training_steps = grpo_steps 124 + 125 + # Initialize mathematical context buffer 126 + context_buffer = MathContextBuffer(max_size=100) 127 optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=0.0, betas=(0.9, 0.95)) 128 global_step = 0 # Initialize global step counter 129 130 @@ -85,8 +88,31 @@ def train_loop(model, train_prompts, train_answers, learning_rate, grpo_steps, 131 load_policy_into_vllm_instance(model, vllm_model) 132 133 ## sample rollouts 134 + # Get enhanced prompts with mathematical context for this epoch 135 print ("Sampling rollouts for epoch: ", epoch) 136 - rollout_prompts, rollout_answers, rollout_responses, rollout_rewards = sample_rollout(vllm_model, r1_zero_reward_fn_train, train_prompts, train_answers, G= group_size, eval_sampling_params=eval_sampling_params, subset_size=rollout_subset_size, return_rewards=True, batch_size=512) 137 + 138 + # Create enhanced prompts with context for this batch APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4136 139 + enhanced_prompts = [] 140 + base_template = load_prompt_template() 141 + 142 + # Select subset of problems for this epoch 143 + if rollout_subset_size is not None: 144 + indices = random.sample(range(len(train_prompts)), rollout_subset_size) 145 + epoch_prompts = [train_prompts[i] for i in indices] 146 + epoch_answers = [train_answers[i] for i in indices] 147 + else: 148 + epoch_prompts = train_prompts 149 + epoch_answers = train_answers 150 + 151 + # Create context-enhanced prompts 152 + for prompt in epoch_prompts: 153 + relevant_context = context_buffer.get_relevant_context(prompt, max_items=8) 154 + enhanced_prompt = create_context_prompt(base_template.format(question=prompt. split("Question: ")[-1]), relevant_context) 155 + enhanced_prompts.append(enhanced_prompt) 156 + 157 + # Sample with enhanced prompts (need to adapt sample_rollout for direct prompt input) 158 + rollout_prompts, rollout_answers, rollout_responses, rollout_rewards = sample_rollout_with_prompts(vllm_model, r1_zero_reward_fn_train, enhanced_prompts, epoch_answers, G=group_size, eval_sampling_params=eval_sampling_params, return_rewards= True, batch_size=512) 159 + 160 # Randomly sample 2 rollouts to print 161 indices = random.sample(range(len(rollout_prompts)), 2) 162 print ("Example rollouts:") 163 @@ -96,6 +122,15 @@ def train_loop(model, train_prompts, train_answers, learning_rate, grpo_steps, 164 print(f"Response: rollout_responses[idx]") 165 print(f"Reward: rollout_rewards[idx]") 166 print(f"Ground truth: rollout_answers[idx]") 167 + 168 + # Update context buffer with mathematical expressions from responses 169 + print("Updating mathematical context buffer...") 170 + for response in rollout_responses: 171 + if rollout_rewards[rollout_responses.index(response)] > 0: # Only add from correct responses 172 + mathematical_expressions = extract_mathematical_expressions(response) 173 + if mathematical_expressions: 174 + context_buffer.add_expressions(mathematical_expressions) 175 + APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4137 176 rollout_tokenized = tokenize_prompt_and_output(rollout_prompts, rollout_responses, tokenizer) 177 rollout_data_loader = create_data_loader(rollout_tokenized, batch_size=batch_size, shuffle=False) 178 179 @@ -196,6 +231,47 @@ def train_loop(model, train_prompts, train_answers, learning_rate, grpo_steps, 180 , step=global_step) 181 # print ("Global Step: ", global_step, "Loss: ", loss.item(), "Entropy : ", entropy.mean().item(), "Clip fraction: ", metadata.get("clip_fraction", 0.0)) 182 183 +def sample_rollout_with_prompts( 184 + vllm_model, 185 + reward_fn, 186 + prompts, 187 + answers, 188 + G, 189 + eval_sampling_params, 190 + return_rewards=False, 191 + batch_size=64 192 +): 193 + """Sample rollouts using provided prompts directly.""" 194 + # Create batched prompts by repeating each prompt G times 195 + batched_prompts = [] 196 + batched_answers = [] 197 + for prompt, answer in zip(prompts, answers): 198 + batched_prompts.extend([prompt] * G) 199 + batched_answers.extend([answer] * G) 200 + 201 + # Process in batches to avoid OOM 202 + all_outputs = [] 203 + for i in range(0, len(batched_prompts), batch_size): 204 + batch_prompts = batched_prompts[i:i + batch_size] 205 + batch_outputs = vllm_model.generate(batch_prompts, eval_sampling_params) 206 + all_outputs.extend(batch_outputs) 207 + 208 + # Process results 209 + batched_responses = [] 210 + batched_rewards = [] 211 + total_rewards = 0 212 + 213 + for output, answer in tqdm(zip(all_outputs, batched_answers)): 214 + generated_answer = output.outputs[0].text 215 + reward = reward_fn(generated_answer, answer) 216 + batched_responses.append(generated_answer) APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4138 217 + batched_rewards.append(reward["reward"]) 218 + total_rewards += reward["reward"] 219 + 220 + print(f"Accuracy of sampled rollouts: total_rewards/len(batched_prompts) = total_rewards / len(batched_prompts) * 100% 221 + 222 + return batched_prompts, batched_answers, batched_responses, batched_rewards 223 + 224 if __name__ == "__main__": 225 parser = argparse.ArgumentParser() 226 parser.add_argument("--model_path", type=str, default="Qwen/Qwen2.5-Math-1.5B") 227 @@ -203,7 +279,7 @@ if __name__ == "__main__": 228 parser.add_argument("--train_dataset_path", type=str, default="../MATH/train.jsonl") 229 parser.add_argument("--eval_dataset_path", type=str, default="../MATH/test.jsonl") 230 parser.add_argument("--output_dir", type=str, default="ckpts/") 231 - parser.add_argument("--learning_rate", type=float, default=1e-5) 232 + parser.add_argument("--learning_rate", type=float, default=2.5e-5) 233 parser.add_argument("--grpo_steps", type=int, default=200) 234 parser.add_argument("--group_size", type=int, default=8) 235 parser.add_argument("--rollout_subset_size", type=int, default=256) 236 @@ -212,7 +288,7 @@ if __name__ == "__main__": 237 parser.add_argument("--gradient_accumulation_steps", type=int, default=16) 238 parser.add_argument("--batch_size", type=int, default=4) 239 parser.add_argument("--cliprange", type=float, default=0.2) 240 - parser.add_argument("--loss_type", type=str, default="grpo_clip") 241 + parser.add_argument("--loss_type", type=str, default="reinforce_with_baseline") 242 parser.add_argument("--wandb_project", type=str, default="grpo-math") 243 parser.add_argument("--wandb_name", type=str, default="grpo_clip_1") 244 args = parser.parse_args() 245 --- repo_variants_full_pipeline_bsz50_fixed2_epoch7/idea_43/run_job.sh 246 +++ repo_variants_full_pipeline_bsz50_fixed2_epoch7/idea_43/run_job.sh 247 @@ -21,7 +21,7 @@ timeout 2h uv run \ 248 --index https://download.pytorch.org/whl/cu128 \ 249 --index-strategy unsafe-best-match \ 250 python grpo.py \ 251 - --learning_rate 1e-5 \ 252 + --learning_rate 2.5e-5 \ 253 --grpo_steps 20 \ 254 --group_size 8 \ 255 --rollout_subset_size 128 \ 256 @@ -30,7 +30,7 @@ timeout 2h uv run \ 257 --gradient_accumulation_steps 16 \ 258 --batch_size 4 \ 259 --cliprange 0.2 \ 260 - --loss_type grpo_clip \ APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4139 261 + --loss_type reinforce_with_baseline \ 262 --wandb_name $wandb_name 263 264 echo "Experiment finished successfully!" C.2 Reinforcement learning from execution reward Unlike evolutionary search, reinforcement learning shapes model behavior through gradient updates. Despite recent success on verifiable domains like math and coding DeepSeek-AI et al. [2025], RL’s effectiveness on open-ended AI research remains unclear. We explore whether the automated execu- tor can serve as a reward function to directly finetune LLMs for more effective idea generation via RL. We detail our implementation, experiment setup, and analysis of training dynamics. 0510152025303540 Epoch 0.26 0.28 0.30 0.32 0.34 0.36 Average Reward Average Reward per Epoch (GRPO Environment) 010203040506070 Epoch 0.2875 0.2900 0.2925 0.2950 0.2975 0.3000 0.3025 0.3050 Average Reward Average Reward per Epoch (nanoGPT Environment) 0510152025303540 Epoch 0.50 0.51 0.52 0.53 0.54 Max Reward Max Reward per Epoch (GRPO Environment) 010203040506070 Epoch 0.30900 0.30905 0.30910 0.30915 0.30920 0.30925 0.30930 0.30935 0.30940 Max Reward Max Reward per Epoch (nanoGPT Environment) Figure C.1: Training curves of RL from execution reward. We plot the average reward per epoch in the upper row, and the max reward per epoch in the lower row. For the GRPO environment, the reward is the accuracy; for the nanoGPT environment, the reward is the reciprocal of the loss. The average reward increases, but not the max reward. APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4140 C.2.1 Reward design and experiment setup We use Qwen3-30B-A3B Yang et al. [2025a] as the base model and finetune it with the standard GRPO algorithm Shao et al. [2024], motivated by its consistent empirical success on other verifiable domains. Our prompt batch size is one because we have only one prompt per environment. In the prompt, we provide the baseline codebase and ask the model to generate new ideas to improve the baseline (GRPO or nanoGPT). This setup resembles prior work on RLVR from one training example Wang et al. [2025]. We use large group sizes to stabilize training: 256 for the post-training environment and 128 for the pre-training environment. Because each GRPO idea runs on one GPU and each nanoGPT idea runs on 8 GPUs, these group sizes correspond to parallel execution on 256 GPUs (for GRPO) or 1024 GPUs (for nanoGPT) to obtain execution rewards for each batch of rollout ideas. Each rollout consists of a thinking trace followed by the natural language idea. We set a max output length of 8192 tokens for rollout sampling and feed only the extracted ideas to the automated executor, excluding the thinking trace. For the post-training environment, we use the validation set accuracy of each rollout idea after execution as the reward. For ideas without a valid accuracy (i.e., failed execution due to code generation errors), we assign a reward of 0. For the pre-training environment, we use the reciprocal of the validation loss ( 1 loss ) as the reward and assign 0 to ideas with failed execution. Our experiments use the Tinker API Thinking Machines Lab [2025]. C.2.2 Experiment results Positive training curves for average reward The upper row of Figure C.1 plots the average reward of all rollouts per training epoch. We find that the average performance of generated ideas increases after sufficient training epochs on open-ended research environments. The average accu- racy on the GRPO environment increases from 0.253 to 0.343 after 40 training epochs (top left of Figure C.1); the average reward on the nanoGPT environment increases from 0.194 to 0.246 after 68 epochs (top right of Figure C.1), corresponding to a decrease in average validation loss from 5.150 to 4.066. These training curves resemble prior findings on one-shot RLVR on other verifiable domains like math Wang et al. [2025]. The case of max reward Despite reproducing the positive training curves observed in other do- mains, we argue that idea generation differs fundamentally from other verifiable tasks. For scientific discovery, what matters is the upper bound of idea generation, not the average quality. We want one breakthrough idea that dominates the baselines, not many safe ideas with a high average. The lower row of Figure C.1 plots the max reward of all rollouts at each training epoch. The trend here is strikingly different—the max reward fluctuates throughout RL training without a clear APPENDIX C. SUPPLEMENTARY MATERIALS FOR CHAPTER 4141 upward trend. This reveals a crucial limitation of standard GRPO for idea generation. We next analyze why RL from execution reward improves average but not max reward. C.3 Supplementary materials for test-time scaling C.3.1 Evaluation determinism We run our evaluations using vLLM [Kwon et al., 2023b] because it is faster than alternatives. However, even with identical random seeds and greedy sampling, evaluation scores can change across runs due to: • different batch sizes, • continued generations, and • changes in tensor parallelism. Because our model generates long reasoning traces before answering, small numeric changes can snowball into large differences. We observe many generations that match exactly for thousands of tokens, then suddenly diverge on a single token, ultimately producing entirely different answers. To mitigate this, we run our final evaluations in full precision unless otherwise indicated. Appendix D Supplementary materials for Chapter 5 D.1 From Newton’s law to Poisson’s equation We derive Poisson’s equation from Newton’s law of gravitation (5.1). The first step toward a field- theoretic description is to pass from discrete masses to a continuous distribution. Consider a test point mass m at position r and a continuous matter distribution with mass density ρ(r ′ ) (kilograms per cubic meter). An infinitesimal volume element dV ′ at position r ′ contains mass dM = ρ(r ′ )dV ′ and, by Newton’s law (5.1), exerts a force dF =− Gmρ(r ′ )dV ′ |r− r ′ | 3 (r− r ′ )(D.1) on m. Integrating over all source matter and dividing by m gives the gravitational field, the force per unit test mass: g(r) =−G Z ρ(r ′ ) (r− r ′ ) |r− r ′ | 3 dV ′ .(D.2) This integral can be rewritten as the gradient of a scalar potential. Define the gravitational potential Φ(r)—the potential energy per unit test mass (units of m 2 s −2 )—as Φ(r) =−G Z ρ(r ′ ) |r− r ′ | dV ′ .(D.3) 142 APPENDIX D. SUPPLEMENTARY MATERIALS FOR CHAPTER 5143 For a single point mass M at the origin, ρ(r ′ ) = M δ 3 (r ′ ) and the integral reduces to Φ =−GM/r. Taking the gradient of (D.3) with respect to r and passing it inside the integral: −∇Φ(r) = G Z ρ(r ′ )∇ 1 |r− r ′ | dV ′ =−G Z ρ(r ′ ) (r− r ′ ) |r− r ′ | 3 dV ′ ,(D.4) where the second equality uses ∇(|r− r ′ | −1 ) = −(r− r ′ )/|r− r ′ | 3 . The right-hand side is exactly the gravitational field (D.2), so g =−∇Φ.(D.5) The passage from the integral form (D.2) to a local differential equation is the crucial conceptual leap. Applying the Laplacian ∇ 2 = ∇·∇ (with respect to r) to both sides of (D.3) and passing it inside the integral gives ∇ 2 Φ(r) =−G Z ρ(r ′ )∇ 2 1 |r− r ′ | dV ′ .(D.6) Using the identity ∇ 2 (1/|r− r ′ |) =−4π δ 3 (r− r ′ ), this becomes ∇ 2 Φ(r) = 4πG Z ρ(r ′ )δ 3 (r− r ′ )dV ′ .(D.7) The Dirac delta sifts out the density at r, collapsing the integral to yield ∇ 2 Φ = 4πGρ,(D.8) which is Poisson’s equation (5.2). D.2 The metric tensor, stress-energy tensor, and spacetime curvature We describe the tensorial ingredients of the Einstein field equations (5.3) in detail. The metric tensor g μν . In Newtonian gravity, the potential Φ is a single number at each point in space. Einstein replaces it with the metric tensor g μν , a 4× 4 symmetric matrix-valued function of the spacetime coordinates (t,x,y,z). The metric tensor generalizes the Pythagorean theorem to curved spacetime: it defines the infinitesimal squared interval between two nearby events as ds 2 = g μν dx μ dx ν ,(D.9) where we use the Einstein summation convention (summing over repeated indices μ,ν = 0, 1, 2, 3) throughout. In flat spacetime with no gravity, the metric reduces to the Minkowski metric η μν = APPENDIX D. SUPPLEMENTARY MATERIALS FOR CHAPTER 5144 diag(−1, 1, 1, 1), and the interval is ds 2 = −c 2 dt 2 + dx 2 + dy 2 + dz 2 . The presence of mass and energy warps the metric away from η μν , and this warping is what we experience as gravity. Because g μν is symmetric (g μν = g νμ ), it has 10 independent components—matching the 10 independent components of the field equation we arrive at below. Its inverse is denoted g μν and satisfies g μα g αν = δ ν μ , where δ ν μ is the Kronecker delta. To see the metric tensor in action, consider the Schwarzschild solution—the geometry outside a spherical, non-rotating mass M—which in spherical coordinates (t,r,θ,φ) takes the form ds 2 =− 1− 2GM rc 2 c 2 dt 2 + 1− 2GM rc 2 −1 dr 2 + r 2 dθ 2 + r 2 sin 2 θdφ 2 .(D.10) The components of g μν are functions of the coordinates: the flow of time (g 00 ) and the measurement of radial distance (g 11 ) both depend on r, encoding gravitational time dilation and spatial curvature. The stress-energy tensor T μν . In Poisson’s equation, the source of gravity is the scalar mass density ρ. Einstein replaces it with the stress-energy tensor T μν , a 4× 4 symmetric tensor encoding the matter and energy content at each point in spacetime: each component T μν is the flux of the μ-th component of four-momentum through a surface of constant x ν . The time-time component T 00 is the energy density. Rest mass m carries energy mc 2 , so mass density ρ contributes ρc 2 to the energy density—but T 00 is the total energy density, which also includes contributions from electromagnetic fields, thermal motion, and kinetic energy. For ordinary matter at rest, T 00 = ρc 2 , recovering the Newtonian source. The mixed components T 0i = T i0 encode momentum density and energy flux—two descriptions of the same physical quantity. Moving energy carries momentum (p = Ev/c 2 ), so the flux of energy across a surface is identically the density of momentum in the corresponding direction. The spatial block T ij encodes stress: the rate at which the i-th component of momentum is transported across a surface perpendicular to direction j. For a perfect fluid, T ij = pδ ij , where p is the isotropic pressure; viscous fluids additionally have off-diagonal shear components. In Newtonian gravity, only mass generates gravity. In general relativity, energy, momentum, and pressure all curve spacetime. From the metric to curvature: Γ, R α βμν , R μν , and R. The left-hand side of Einstein’s equation must express how spacetime is curved. We build this curvature entirely from g μν and its derivatives, through a four-step pipeline. Step 1: Christoffel symbols. The Christoffel symbols Γ α μν encode how the coordinate basis vectors change from point to point in curved spacetime. They are computed from the first derivatives of the metric: Γ α μν = 1 2 g αβ (∂ μ g νβ + ∂ ν g βμ − ∂ β g μν ),(D.11) APPENDIX D. SUPPLEMENTARY MATERIALS FOR CHAPTER 5145 where ∂ μ ≡ ∂/∂x μ . Christoffel symbols are not tensors—they can be made to vanish at any single point by choosing appropriate coordinates (the mathematical expression of the equivalence principle, which states that gravity is locally indistinguishable from acceleration). However, their derivatives, combined appropriately, yield genuine tensors that measure intrinsic curvature. Step 2: the Riemann curvature tensor. The Riemann tensor R α βμν is the fundamental measure of spacetime curvature, constructed from the Christoffel symbols and their first derivatives: R α βμν = ∂ μ Γ α νβ − ∂ ν Γ α μβ + Γ α μγ Γ γ νβ − Γ α νγ Γ γ μβ .(D.12) Because the Christoffel symbols involve first derivatives of g μν , the Riemann tensor involves second derivatives of the metric. Physically, it measures tidal forces: if a cloud of freely falling particles drifts through curved spacetime, the Riemann tensor determines how the cloud is stretched and squeezed. The tensor has 256 components in four dimensions, but symmetries reduce the independent components to 20. Step 3: the Ricci tensor. We obtain the Ricci tensor R μν by contracting (tracing over) one pair of indices of the Riemann tensor: R μν = R α μαν .(D.13) This contraction distills the 20-component Riemann tensor into a 10-component symmetric tensor. R μν isolates the volume-changing part of the curvature: it measures how the volume of a small ball of freely falling particles shrinks or grows as the ball moves through spacetime, discarding the shape-distorting tidal effects captured by the full Riemann tensor. Step 4: the Ricci scalar. The Ricci scalar R is the trace of the Ricci tensor: R = g μν R μν .(D.14) R compresses the entire curvature into a single number at each point—positive on sphere-like regions (where volumes are smaller than in flat space) and negative on saddle-like regions (where volumes are larger). The Einstein tensor. The unique symmetric, divergence-free tensor constructible from g μν and its first and second derivatives is the Einstein tensor G μν ≡ R μν − 1 2 Rg μν . The notation ∇ μ G μν denotes the covariant divergence: the covariant derivative ∇ μ is summed over the repeated index μ, generalizing the vector divergence ∂ i F i to a rank-2 tensor in curved spacetime. The identity ∇ μ G μν = 0 (the Bianchi identity) ensures compatibility with the conservation of energy and mo- mentum (∇ μ T μν = 0). Stripping away the shorthand, the field equations (5.3) are a system of 10 coupled, nonlinear, second-order partial differential equations for the 10 independent components of g μν , sourced by T μν . Every term on the left-hand side—R μν , R, and g μν itself—is built from g μν and its derivatives. The APPENDIX D. SUPPLEMENTARY MATERIALS FOR CHAPTER 5146 nonlinearity stems from the Christoffel symbols appearing both inside derivatives and multiplied against each other in the Riemann tensor (D.12), so gravitational perturbations do not simply superpose—the field’s behavior depends on the field itself. D.3 The Newtonian limit As a consistency check, Einstein’s equations must reduce to Newton’s in the appropriate limit. We impose three physical assumptions that characterize everyday gravitational environments like the solar system. 1. Weak field: spacetime is nearly flat, so we can write g μν = η μν + h μν where |h μν |≪ 1. 2. Slow motion: all matter moves much slower than light, v ≪ c. 3. Static field: the gravitational field does not change in time, ∂ 0 h μν = 0. Under these assumptions, the stress-energy tensor is dominated by its time-time component T 00 ≈ ρc 2 , with all other components negligible. Its trace is T = g μν T μν ≈ g 00 T 00 ≈−ρc 2 . On the geometric side, to first order in h μν , the Christoffel symbols (D.11) reduce to Γ α μν ≈ 1 2 η αβ (∂ μ h νβ + ∂ ν h βμ − ∂ β h μν ). Dropping all products of Christoffel symbols in the Riemann ten- sor (D.12) and all time derivatives (∂ 0 h μν = 0), the Ricci tensor’s 00-component becomes R 00 ≈ ∂ k Γ k 00 = ∂ k − 1 2 δ kl ∂ l h 00 =− 1 2 ∇ 2 h 00 .(D.15) To obtain the source side, we trace-reverse the field equations (5.3). Contracting both sides with g μν gives −R = 8πG c 4 T (since g μν G μν = R − 2R = −R), so R = − 8πG c 4 T. Substituting back into (5.3) yields the trace-reversed form R μν = 8πG c 4 (T μν − 1 2 T g μν ). Evaluating the 00-component with T 00 ≈ ρc 2 , T ≈−ρc 2 , and g 00 ≈ η 00 =−1: R 00 = 8πG c 4 ρc 2 − 1 2 (−ρc 2 )(−1) = 8πG c 4 · 1 2 ρc 2 = 4πGρ c 2 .(D.16) Equating the two expressions for R 00 : ∇ 2 h 00 =− 8πGρ c 2 .(D.17) The final step is to identify h 00 with the Newtonian potential. In general relativity, a freely falling particle follows a geodesic: d 2 x μ dτ 2 +Γ μ αβ dx α dτ dx β dτ = 0. For a slowly moving particle (dx i /dτ ≪ cdt/dτ), the sum over α,β is dominated by α = β = 0, reducing the spatial components to d 2 x i dt 2 ≈−c 2 Γ i 00 =−c 2 · − 1 2 ∂ i h 00 = c 2 2 ∂ i h 00 ,(D.18) APPENDIX D. SUPPLEMENTARY MATERIALS FOR CHAPTER 5147 where Γ i 00 = − 1 2 δ ij ∂ j h 00 follows from the linearized, static Christoffel symbol. Comparing with Newton’s d 2 x i /dt 2 =−∂ i Φ identifies h 00 =−2Φ/c 2 . Substituting: ∇ 2 − 2Φ c 2 =− 8πGρ c 2 =⇒ ∇ 2 Φ = 4πGρ.(D.19) This is exactly Poisson’s equation (5.2): in the weak-field, slow-motion, static limit, the full machin- ery of curved spacetime collapses to the scalar potential theory of Newtonian gravity. D.4 Deriving the Friedmann equations We derive the Friedmann equations by substituting the FLRW metric (5.4) into the Einstein field equations (5.3). We write ̇a≡ da/dt and ̈a≡ d 2 a/dt 2 throughout. Christoffel symbols. Since the metric (5.4) is diagonal, the Christoffel symbols (D.11) simplify: g ασ is nonzero only when α = σ, and since g 00 = −c 2 is constant and g 0i = 0, the only nonzero symbols involve one time index and one or two spatial indices. Case 1: two spatial lower indices. Γ 0 ij = 1 2 g 00 ∂ i g j0 + ∂ j g i0 − ∂ 0 g ij = 1 2 − 1 c 2 (−2a ̇aδ ij ) = a ̇a c 2 δ ij ,(D.20) since the first two terms vanish (g i0 = 0) and ∂ t (a 2 δ ij ) = 2a ̇aδ ij . Explicitly: Γ 0 11 = Γ 0 22 = Γ 0 33 = a ̇a/c 2 . Case 2: one temporal and one spatial lower index. Γ i 0j = 1 2 g i ∂ 0 g ji + ∂ j g 0i − ∂ i g 0j = 1 2 1 a 2 (2a ̇aδ ji ) = ̇a a δ i j .(D.21) Explicitly: Γ 1 01 = Γ 2 02 = Γ 3 03 = ̇a/a. The quantity H(t)≡ ̇a/a is the Hubble parameter. All other Christoffel symbols vanish: Γ α 00 = 0 for all α (because g 00 is constant and g 0i = 0), and Γ i jk = 0 for all spatial i,j,k (because a 2 δ ij has no spatial derivatives). Ricci tensor. Expanding the contraction R μν = R α μαν via the Riemann tensor (D.12) gives the working formula R μν = ∂ α Γ α μν − ∂ ν Γ α μα + Γ α αβ Γ β μν − Γ α νβ Γ β μα .(D.22) We compute R 00 and R 11 ; the remaining spatial components follow by isotropy (R 22 = R 33 = R 11 ). Computing R 00 . Term 1: ∂ α Γ α 00 = 0, since all Γ α 00 vanish. Term 2: −∂ 0 Γ α 0α . The nonzero APPENDIX D. SUPPLEMENTARY MATERIALS FOR CHAPTER 5148 contributions come from spatial α: P 3 α=1 Γ α 0α = 3 ̇a/a, so − d dt 3 ̇a a =−3 ̈a a − ̇a 2 a 2 =− 3 ̈a a + 3 ̇a 2 a 2 .(D.23) Term 3: Γ α αβ Γ β 00 = 0, since all Γ β 00 vanish. Term 4: −Γ α 0β Γ β 0α , summed over α,β. Both factors require spatial indices, each contributing ̇a/a when α = β: − 3 X α=1 ̇a a 2 =− 3 ̇a 2 a 2 .(D.24) Combining all four terms: R 00 =− 3 ̈a a .(D.25) Computing R 11 . Term 1: ∂ α Γ α 11 . The only nonzero Christoffel symbol is Γ 0 11 = a ̇a/c 2 , giving d dt a ̇a c 2 = 1 c 2 ̇a 2 + a ̈a .(D.26) Term 2: −∂ 1 Γ α 1α = 0, since the Christoffel symbols depend only on t. Term 3: Γ α αβ Γ β 11 . Since Γ β 11 is nonzero only for β = 0, this becomes P α Γ α α0 ·a ̇a/c 2 . The spatial contributions are Γ 1 10 + Γ 2 20 + Γ 3 30 = 3 ̇a/a, giving 3 ̇a 2 /c 2 . Term 4: −Γ α 1β Γ β 1α , summed over α,β. The two nonzero contributions are (α,β) = (0, 1) and (1, 0), each giving ̇a 2 /c 2 : − a ̇a c 2 · ̇a a − ̇a a · a ̇a c 2 =− 2 ̇a 2 c 2 .(D.27) Combining all four terms: R 11 = 1 c 2 a ̈a + 2 ̇a 2 .(D.28) Ricci scalar. Contracting with the inverse metric: R = g 00 R 00 + 3g 11 R 11 = − 1 c 2 − 3 ̈a a + 3 1 a 2 · 1 c 2 a ̈a + 2 ̇a 2 = 3 ̈a c 2 a + 3 ̈a c 2 a + 6 ̇a 2 c 2 a 2 .(D.29) R = 6 c 2 ̈a a + ̇a 2 a 2 . (D.30) APPENDIX D. SUPPLEMENTARY MATERIALS FOR CHAPTER 5149 Einstein tensor. The 00-component: G 00 = R 00 − 1 2 g 00 R =− 3 ̈a a − 1 2 (−c 2 )· 6 c 2 ̈a a + ̇a 2 a 2 =− 3 ̈a a + 3 ̈a a + ̇a 2 a 2 .(D.31) The ̈a terms cancel completely: G 00 = 3 ̇a 2 a 2 . (D.32) The 11-component: G 11 = R 11 − 1 2 g 11 R = 1 c 2 a ̈a + 2 ̇a 2 − 1 2 a 2 · 6 c 2 ̈a a + ̇a 2 a 2 = 1 c 2 a ̈a + 2 ̇a 2 − 3 c 2 a ̈a + ̇a 2 .(D.33) G 11 =− 1 c 2 2a ̈a + ̇a 2 .(D.34) By isotropy, G 22 = G 33 = G 11 . Field equations. Substituting into the field equations (5.3) with T 00 = ρc 2 and T ij = 0: The 00-equation gives 3 ̇a 2 a 2 = 8πGρ c 2 ,(D.35) which is the first Friedmann equation (5.6). The 11-equation, with T 11 = 0, gives 2a ̈a + ̇a 2 = 0, or equivalently 2 ̈a a + ̇a 2 a 2 = 0,(D.36) which is the second Friedmann equation (5.7). D.5 Solving the Friedmann equations We solve the Friedmann equations (5.6) and (5.7) for the scale factor a(t) and the matter density ρ(t). From the second Friedmann equation (5.7), multiplying through by a 2 : 2a ̈a + ̇a 2 = 0.(D.37) APPENDIX D. SUPPLEMENTARY MATERIALS FOR CHAPTER 5150 Substituting p = ̇a, so ̈a = pdp/da, and dividing by p̸= 0: 2a dp da + p = 0.(D.38) Setting u = p 2 = ̇a 2 gives adu/da + u = 0, a separable equation with solution u = C/a for some constant C > 0. Therefore ̇a 2 = C/a, and separating variables: √ a da = √ C dt=⇒ 2 3 a 3/2 = √ C (t− t 0 ).(D.39) a(t)∝ t 2/3 .(D.40) The universe begins at a = 0—a singularity where all distances vanish and the density is infinite— then expands forever, decelerating but never stopping. From the first Friedmann equation (5.6) and ̇a 2 = C/a: ρ = 3c 2 C 8πGa 3 ∝ 1 a 3 ∝ 1 t 2 .(D.41) The density decreases as the cube of the scale factor: the total mass in any comoving volume is conserved (ρa 3 = const), while the volume grows as a 3 . This conservation law also follows independently from the Bianchi identity ∇ μ G μν = 0, which implies the continuity equation ̇ρ + 3( ̇a/a)ρ = 0, or equivalently d(ρa 3 )/dt = 0. Bibliography Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Al- lie Del Giorno, Ronen Eldan, Sivakanth Gopi, Suriya Gunasekar, Mojan Javaheripi, Piero Kauffmann, Yin Tat Lee, Yuanzhi Li, Anh Nguyen, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Michael Santacroce, Harkirat Singh Behl, Adam Taumann Kalai, Xin Wang, Rachel Ward, Philipp Witte, Cyril Zhang, and Yi Zhang. Phi-2: The surprising power of small language models, 2023. URL https://w.microsoft.com/en-us/research/blog/ phi-2-the-surprising-power-of-small-language-models/. Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu, Cyril Zhang, and Yi Zhang. Phi-4 technical report, 2024a. URL https://arxiv.org/abs/2412.08905. Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Ben- haim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Yen-Chun Chen, Yi- Ling Chen, Parul Chopra, Xiyang Dai, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Victor Fragoso, Dan Iter, Mei Gao, Min Gao, Jianfeng Gao, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Yunsheng Li, Chen Liang, Lars Liden, Ce Liu, Mengchen Liu, Weishung Liu, Eric Lin, Zeqi Lin, Chong Luo, Piyush Madan, Matt Mazzola, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez- Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Swadheen Shukla, Xia Song, Masahiro Tanaka, Andrea Tupini, Xin Wang, Lijuan Wang, Chunyu Wang, Yu Wang, Rachel Ward, Guanhua Wang, Philipp Witte, 151 BIBLIOGRAPHY152 Haiping Wu, Michael Wyatt, Bin Xiao, Can Xu, Jiahang Xu, Weijian Xu, Sonali Yadav, Fan Yang, Jianwei Yang, Ziyi Yang, Yifan Yang, Donghan Yu, Lu Yuan, Chengruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang, and Xiren Zhou. Phi-3 technical report: A highly capable language model locally on your phone, 2024b. URL https://arxiv.org/abs/2404.14219. Afra Feyza Akyürek, Ekin Akyürek, Leshem Choshen, Derry Wijaya, and Jacob Andreas. Deductive closure training of language models for coherence, accuracy, and updatability. In Lun-Wei Ku, An- dre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguis- tics ACL 2024, pages 9802–9818, Bangkok, Thailand and virtual meeting, August 2024. Associa- tion for Computational Linguistics. URL https://aclanthology.org/2024.findings-acl.584. Zeyuan Allen-Zhu. Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers. ArXiv, 2025. Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.2, knowledge manipulation, 2024. URL https://arxiv.org/abs/2309.14402. Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-eval: Instituting standardized evaluation for long context language models, 2023. Dana Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988. URL https: //api.semanticscholar.org/CorpusID:11357867. Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D. Chang, and Prithviraj Ammanabrolu. Critique-out-loud reward models, 2024. URL https://arxiv.org/abs/2408.11791. Anthropic. Prompt caching (beta), 2024. URL https://docs.anthropic.com/en/docs/ build-with-claude/prompt-caching. Daman Arora, Himanshu Gaurav Singh, and Mausam. Have llms advanced enough? a challenging problem solving benchmark for large language models, 2023. URL https://arxiv.org/abs/ 2305.15074. Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D’Arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke S. Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen tau Yih, Pang Wei Koh, and Hanna Hajishirzi. OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs. ArXiv, abs/2411.14199, 2024. Anas Awadalla, Mitchell Wortsman, Gabriel Ilharco, Sewon Min, Ian Magnusson, Hannaneh Ha- jishirzi, and Ludwig Schmidt. Exploring the landscape of distributional robustness for question answering models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the BIBLIOGRAPHY153 Association for Computational Linguistics: EMNLP 2022, pages 5971–5987, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/ 2022.findings-emnlp.441. URL https://aclanthology.org/2022.findings-emnlp.441. Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Marcus McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=4WnqRR915j. Maria-florina Balcan, Avrim Blum, and Ke Yang. Co-training and expansion: Towards bridging theory and practice. In L. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 17. MIT Press, 2004. URL https://proceedings.neurips.c/ paper_files/paper/2004/file/9457fc28ceb408103e13533e4a5b6bd1-Paper.pdf. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://aclanthology.org/D13-1160. Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on "a is b" fail to learn "b is a", 2023. David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. Mixmatch: A holistic approach to semi-supervised learning, 2019. URL https://arxiv.org/abs/ 1905.02249. Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Al- ham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Ja- son Phang, Aviya Skowron, Samson Tan, Xiangru Tang, Kevin A. Wang, Genta Indra Winata, François Yvon, and Andy Zou. Lessons from the trenches on reproducible evaluation of language models, 2024. Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Pro- ceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT’ 98, page 92–100, New York, NY, USA, 1998. Association for Computing Machinery. ISBN 1581130570. doi: 10.1145/279943.279962. URL https://doi.org/10.1145/279943.279962. Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Mil- lican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren BIBLIOGRAPHY154 Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, and Laurent Sifre. Improv- ing language models by retrieving from trillions of tokens. CoRR, abs/2112.04426, 2021. URL https://arxiv.org/abs/2112.04426. Leo Breiman. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science, 16(3):199 – 231, 2001. doi: 10.1214/s/1009213726. URL https: //doi.org/10.1214/s/1009213726. Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024. URL https://arxiv.org/abs/2407.21787. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Had- sell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, vol- ume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips. c/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russell Webb. Distillation scaling laws. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=1nEBAkpfb9. Emmanuel J Candès. Ridgelets: Theory and applications. Department of Statistics, Stanford Uni- versity, 1998. Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Mkadry. MLE- bench: Evaluating Machine Learning Agents on Machine Learning Engineering. In ICLR, 2025. Harrison Chase. LangChain, 10 2022. URL https://github.com/langchain-ai/langchain. Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset, 2023a. URL https: //arxiv.org/abs/2305.12524. Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V. Le. Symbolic Discovery of Optimization Algorithms. In NeurIPS, 2023b. BIBLIOGRAPHY155 Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, Alexan- dre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, and Antoine Bosselut. Meditron-70b: Scal- ing medical pretraining for large language models, 2023c. URL https://arxiv.org/abs/2311. 16079. Junyan Cheng, Peter Clark, and Kyle Richardson. Language Modeling by Language Models. In NeurIPS, 2025. Sehyun Choi, Tianqing Fang, Zhaowei Wang, and Yangqiu Song. Kcts: Knowledge-constrained tree search decoding with token-level hallucination detection, 2023. URL https://arxiv.org/abs/ 2310.09044. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. Roi Cohen, Eden Biran, Ori Yoran, Amir Globerson, and Mor Geva. Evaluating the ripple effects of knowledge editing in language models. arXiv preprint arXiv:2307.12976, 2023. Cohere. Improve search performance with a single line of code, 2024. URL https://cohere.com/ rerank. Pierre Colombo, Telmo Pires, Malik Boudiaf, Rui Melo, Dominic Culver, Sofia Morgado, Etienne Malaboeuf, Gabriel Hautreux, Johanne Charpentier, and Michael Desa. Saullm-54b and saullm- 141b: Scaling up domain adaptation for the legal domain, 2024a. URL https://arxiv.org/abs/ 2407.19584. Pierre Colombo, Telmo Pessoa Pires, Malik Boudiaf, Dominic Culver, Rui Melo, Caio Corro, Andre F. T. Martins, Fabrizio Esposito, Vera Lúcia Raposo, Sofia Morgado, and Michael Desa. Saullm-7b: A pioneering large language model for law, 2024b. URL https://arxiv.org/abs/2403.03883. Common Crawl. Common crawl. https://commoncrawl.org/, 2007. Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Re. Flashattention: Fast and memory-efficient exact attention with IO-awareness. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=H4DqfPSibmx. DatologyAI, :, Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh BIBLIOGRAPHY156 Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga, Siddharth Joshi, Spandan Das, Zheng- ping Wang, Bogdan Gaza, Ari Morcos, and Matthew Leavitt. Beyondweb: Lessons from scaling synthetic data for trillion-scale pretraining, 2025. URL https://arxiv.org/abs/2508.10975. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long BIBLIOGRAPHY157 and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Compu- tational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423. Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations, 2023. Manfredo Perdigão do Carmo. Riemannian Geometry. Mathematics: Theory & Applications. Birkhäuser, Boston, 1992. ISBN 978-0-8176-3490-2. Translated from the second Portuguese edition by Francis Flaherty. Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library, 2024. URL https: //arxiv.org/abs/2401.08281. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, BIBLIOGRAPHY158 Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Ro- hit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun So- nia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vin- cent Gonguet, Virginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaoqing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesen- berg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, An- drew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Mon- talvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Testug- gine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzmán, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid Sho- janazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan Mc- Phie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, BIBLIOGRAPHY159 Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsim- poukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Michael L. Seltzer, Michal Valko, Michelle Re- strepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini San- thanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratan- chandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vítor Al- biero, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao. The llama 3 herd of models, 2024a. URL https://arxiv.org/abs/2407.21783. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, et al. The llama 3 herd of models, 2024b. URL https://arxiv.org/abs/2407.21783. Rick Durrett. Random graph dynamics, volume 20. Cambridge university press, 2010. W Ebeling and T Pöschel. Entropy and long-range correlations in literary english. Europhysics Letters (EPL), 26(4):241–246, May 1994. ISSN 1286-4854. doi: 10.1209/0295-5075/26/4/001. URL http://dx.doi.org/10.1209/0295-5075/26/4/001. BIBLIOGRAPHY160 Albert Einstein. Die feldgleichungen der gravitation. Sitzungsberichte der Königlich Preußischen Akademie der Wissenschaften, pages 844–847, 1915. Albert Einstein. Kosmologische betrachtungen zur allgemeinen relativitätstheorie. Sitzungsberichte der Königlich Preußischen Akademie der Wissenschaften, pages 142–152, 1917. Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english?, 2023. Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, page 954–959, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450369794. doi: 10.1145/3357713.3384290. URL https://doi.org/10.1145/3357713.3384290. Alexander Friedman. Über die krümmung des raumes. Zeitschrift für Physik, 10, 1922. George Gamow. The evolutionary universe. Scientific American, 195(3):136–156, 1956. Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D. Goodman. Stream of search (sos): Learning to search in language, 2024. URL https://arxiv.org/abs/2404.03683. Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni- math: A universal olympiad level mathematic benchmark for large language models, 2024a. URL https://arxiv.org/abs/2410.07985. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628. Yanjun Gao, Chen Sun, and Rebecca J. Passonneau. Automated pyramid summarization eval- uation. In Mohit Bansal and Aline Villavicencio, editors, Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 404–418, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/K19-1038. URL https://aclanthology.org/K19-1038. BIBLIOGRAPHY161 Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024b. URL https://arxiv.org/abs/2312.10997. Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriele Synnaeve. RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning. In ICML, 2025. Team Gemini. Gemini: A family of highly capable multimodal models, 2024. URL https://arxiv. org/abs/2312.11805. Shashwat Goel, Rishi Hazra, Dulhan Hansaja Jayalath, Timon Willi, Parag Jain, William F. Shen, Ilias Leontiadis, Francesco Barbieri, Yoram Bachrach, Jonas Geiping, and Chenxi Whitehouse. Training AI Co-Scientists Using Rubric Rewards. ArXiv, abs/2512.23707, 2025. Siavash Golkar, Michael Kagan, and Kyunghyun Cho. Continual learning via neural pruning. arXiv preprint arXiv:1903.04476, 2019. Ian J. Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks, 2015. URL https: //arxiv.org/abs/1312.6211. Google. Gemini 2.0 flash thinking mode (gemini-2.0-flash-thinking-exp-1219), December 2024. URL https://cloud.google.com/vertex-ai/generative-ai/docs/thinking-mode. Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkin- son, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Worts- man, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, and Hannaneh Hajishirzi. Olmo: Accelerating the science of language models, 2024. Stephen T Grossberg. Studies of mind and brain: Neural principles of learning, perception, devel- opment, cognition, and motor control, volume 70. Springer Science & Business Media, 2012. Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024. URL https://openreview.net/forum?id=AL1fq05o7H. Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022. URL https:// openreview.net/forum?id=uYLFoz1vlAC. BIBLIOGRAPHY162 Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (rest) for language modeling, 2023. URL https://arxiv.org/abs/2308.08998. Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need, 2023. URL https://arxiv.org/abs/ 2306.11644. Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung-Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek, Sam Wiseman, Syd Evans, Tao Lei, Vivek Rathod, Xiang Kong, Xianzhi Du, Yanghao Li, Yongqiang Wang, Yuan Gao, Zaid Ahmed, Zhaoyang Xu, Zhiyun Lu, Al Rashid, Albin Madappally Jose, Alec Doane, Alfredo Bencomo, Allison Vanderby, Andrew Hansen, Ankur Jain, Anupama Mann Anupama, Areeba Kamal, Bugu Wu, Carolina Brum, Charlie Maalouf, Chinguun Erdenebileg, Chris Dulhanty, Dominik Moritz, Doug Kang, Eduardo Jimenez, Evan Ladd, Fangping Shi, Felix Bai, Frank Chu, Fred Hohman, Hadas Kotek, Hannah Gillis Coleman, Jane Li, Jeffrey Bigham, Jeffery Cao, Jeff Lai, Jessica Cheung, Jiulong Shan, Joe Zhou, John Li, Jun Qin, Karanjeet Singh, Karla Vega, Kelvin Zou, Laura Heckman, Lauren Gardiner, Margit Bowler, Maria Cordell, Meng Cao, Nicole Hay, Nilesh Shahdadpuri, Otto Godwin, Pranay Dighe, Pushyami Rachapudi, Ramsey Tantawi, Roman Frigg, Sam Davarnia, Sanskruti Shah, Saptarshi Guha, Sasha Sirovica, Shen Ma, Shuang Ma, Simon Wang, Sulgi Kim, Suma Jayaram, Vaishaal Shankar, Varsha Paidi, Vivek Kumar, Xin Wang, Xin Zheng, Walker Cheng, Yael Shrager, Yang Ye, Yasu Tanaka, Yihao Guo, Yunsong Meng, Zhao Tang Luo, Zhi Ouyang, Alp Aygar, Alvin Wan, Andrew Walkingshaw, Andy Narayanan, Antonie Lin, Arsalan Farooq, Brent Ramerth, Colorado Reed, Chris Bartels, Chris Chaney, David Riazati, Eric Liang Yang, Erin Feldman, Gabriel Hochstrasser, Guillaume Seguin, Irina Belousova, Joris Pelemans, Karen Yang, Keivan Alizadeh Vahid, Liangliang Cao, Mahyar Najibi, Marco Zuliani, Max Horton, Minsik Cho, Nikhil Bhendawade, Patrick Dong, Piotr Maj, Pulkit Agrawal, Qi Shan, Qichen Fu, Regan Poston, Sam Xu, Shuangning Liu, Sushma Rao, Tashweena Heeramun, Thomas Merth, Uday Rayala, Victor Cui, Vivek Rangarajan Sridhar, Wencong Zhang, Wenqi Zhang, Wentao Wu, Xingyu Zhou, Xinwen Liu, Yang Zhao, Yin Xia, Zhile Ren, and Zhongzheng Ren. Apple intelligence foundation language models, 2024. URL https://arxiv.org/abs/2407. 21075. BIBLIOGRAPHY163 Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Ku- mar. Accelerating large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, pages 3887–3896. PMLR, 2020. Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. Continual pre-training of large language models: How to (re)warm your model?, 2023. URL https://arxiv.org/abs/2308.04014. Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.740. URL https://aclanthology.org/2020.acl-main.740. Kelvin Guu, Tatsunori B. Hashimoto, Yonatan Oren, and Percy Liang. Generating sentences by editing prototypes, 2018. URL https://arxiv.org/abs/1709.08878. Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval- augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020. Harbor Framework Team. Harbor: A framework for evaluating and optimizing agents and models in container environments, January 2026. URL https://github.com/harbor-framework/harbor. Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiad- bench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024. URL https://arxiv.org/abs/2402.14008. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=d7KBjmI3GmQ. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021b. URL https://arxiv.org/abs/2103.03874. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Xi- aodong Song, and Jacob Steinhardt. Measuring Mathematical Problem Solving With the MATH Dataset. In NeurIPS, 2021c. BIBLIOGRAPHY164 Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015. URL https://arxiv.org/abs/1503.02531. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Train- ing compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022. Remco van der Hofstad. Random Graphs and Complex Networks. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2016. Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.806. URL https://aclanthology.org/2023.acl-long.806. Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, and Yuxiao Dong. Advancing language model reasoning through reinforcement learning and inference scaling, 2025. URL https://arxiv.org/abs/2501.11651. Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification, 2018. URL https://arxiv.org/abs/1801.06146. Shengran Hu, Cong Lu, and Jeff Clune. Automated Design of Agentic Systems. In ICLR, 2025. Tianyu Hua, Harper Hua, Violet Xiang, Benjamin Klieger, Sang T. Truong, Weixin Liang, Fan-Yun Sun, and Nick Haber. ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code. In NeurIPS, 2025. Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1051–1068, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/ v1/2023.emnlp-main.67. URL https://aclanthology.org/2023.emnlp-main.67. Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyu- manshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu, Yuxiang Zheng, Shaoting Zhang, Dahua Lin, Yu Qiao, and Pengfei Liu. Olympi- carena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai, 2024. URL https://arxiv.org/abs/2406.12753. BIBLIOGRAPHY165 Edwin Hubble. A relation between distance and radial velocity among extra-galactic nebulae. Pro- ceedings of the National Academy of Sciences, 15(3):168–173, 1929. Adam Ibrahim, Benjamin Thérien, Kshitij Gupta, Mats L. Richter, Quentin Anthony, Timothée Lesort, Eugene Belilovsky, and Irina Rish. Simple and scalable strategies to continually pre-train large language models, 2024. URL https://arxiv.org/abs/2403.08763. Robert Irvine, Douglas Boubert, Vyas Raina, Adian Liusie, Ziyi Zhu, Vineet Mudupalli, Aliaksei Korshuk, Zongyi Liu, Fritz Cremer, Valentin Assassi, Christie-Carol Beauchamp, Xiaoding Lu, Thomas Rialan, and William Beauchamp. Rewarding chatbots for real-world engagement with millions of users, 2023. URL https://arxiv.org/abs/2303.06135. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024. URL https://arxiv.org/abs/2403.07974. Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. AIDE: AI-Driven Exploration in the Space of Code. ArXiv, abs/2502.13138, 2025. Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Ji- acheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024. URL https://github.com/KellerJordan/modded-nanogpt. Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361. Richard M Karp. The transitive closure of a random digraph. Random Structures & Algorithms, 1 (1):73–93, 1990. Ronald Kemker, Marc McClure, Angelina Abitino, Tyler L. Hayes, and Christopher Kanan. Mea- suring catastrophic forgetting in neural networks. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelli- gence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018. ISBN 978-1-57735-800-8. Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generaliza- tion through memorization: Nearest neighbor language models. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HklBjCEKvH. BIBLIOGRAPHY166 Konwoo Kim, Suhas Kotha, Percy Liang, and Tatsunori Hashimoto. Pre-training under infinite compute, 2025. URL https://arxiv.org/abs/2509.14786. Kimi Team. Kimi K2: Open Agentic Intelligence. ArXiv, abs/2507.20534, 2025. James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. doi: 10.1073/pnas.1611835114. URL https://w.pnas.org/doi/abs/10.1073/pnas.1611835114. John Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selec- tion. Statistics and Computing, 4, 1994. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023a. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023b. URL https://arxiv.org/abs/2309.06180. Bespoke Labs. Bespoke-stratos: The unreasonable effectiveness of reasoning distillation, 2025. URL https://hf.co/bespokelabs/Bespoke-Stratos-32B. Accessed: 2025-01-22. Jakub L’ala, Odhran O’Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G. Rodriques, and Andrew D. White. PaperQA: Retrieval-Augmented Generative Agent for Scientific Research. ArXiv, abs/2312.07559, 2023. Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Large memory layers with product keys. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Ad- vances in Neural Information Processing Systems 32: Annual Conference on Neural In- formation Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 8546–8557, 2019. URL https://proceedings.neurips.c/paper/2019/hash/ 9d8df73a3cfbf3c5b47bc9b50f214aff-Abstract.html. Hunter Lang, Monica N Agrawal, Yoon Kim, and David Sontag. Co-training improves prompt-based learning for large language models. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 11985– 12003. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/lang22a.html. BIBLIOGRAPHY167 Boaz Lavon, Shahar Katz, and Lior Wolf. Execution Guided Line-by-Line Code Generation. In NeurIPS, 2025. Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. ICML 2013 Workshop: Challenges in Representation Learning, 2013. Kuang-Huei Lee, Ian Fischer, Yueh-Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schuurmans, and Xinyun Chen. Evolving deeper llm thinking, 2025. URL https://arxiv.org/abs/2501. 09891. Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O. Stanley. Evolution through large models. In Handbook of Evolutionary Machine Learning. Springer, 2023. Noam Levi. A simple model of inference scaling laws, 2024. URL https://arxiv.org/abs/2410. 16377. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ra- masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with lan- guage models, 2022. URL https://arxiv.org/abs/2206.14858. Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang, Yuxian Gu, Xin Cheng, Xun Wang, Si-Qing Chen, Li Dong, Wei Lu, Zhifang Sui, Benyou Wang, Wai Lam, and Furu Wei. Synthetic data (almost) from scratch: Generalized instruction tuning for language models, 2024a. URL https://arxiv.org/abs/2402.13064. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bit- ton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner, Maciej Kilian, Hanlin Zhang, Rulin Shao, Sarah Pratt, Sunny Sanyal, Gabriel Ilharco, Giannis Daras, Kalyani Marathe, Aaron Gokaslan, Jieyu Zhang, Khyathi Chandu, Thao Nguyen, Igor Vasiljevic, Sham Kakade, Shuran Song, Sujay Sanghavi, Fartash Faghri, Sewoong Oh, Luke Zettlemoyer, Kyle Lo, Alaaeldin El-Nouby, Hadi Pouransari, Alexander Toshev, Stephanie Wang, Dirk Groen- eveld, Luca Soldaini, Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alexandros G. Dimakis, Yair BIBLIOGRAPHY168 Carmon, Achal Dave, Ludwig Schmidt, and Vaishaal Shankar. Datacomp-lm: In search of the next generation of training sets for language models, 2024b. Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath, 2024. URL https://github. com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf. Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason E. Weston, Jack Lanchantin, and Tianlu Wang. Jointly Reinforcing Diversity and Quality in Language Model Generations. ArXiv, abs/2509.02534, 2025. Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023a. Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need i: phi-1.5 technical report, 2023b. URL https://arxiv.org/abs/ 2309.05463. Zonglin Li, Ruiqi Guo, and Sanjiv Kumar. Decoupled context processing for context augmented language modeling. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview. net/forum?id=02dbnEbEFn. Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas Vodra- halli, Siyu He, Daniel Scott Smith, Yian Yin, Daniel A. McFarland, and James Zou. Can large language models provide useful feedback on research papers? a large-scale empirical analysis. NEJM AI, 2024. Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023. URL https://arxiv.org/abs/2305.20050. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summariza- tion Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013. Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation : Learning to solve and explain algebraic word problems, 2017. URL https: //arxiv.org/abs/1705.04146. BIBLIOGRAPHY169 Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hier- archical Representations for Efficient Architecture Search. In ICLR, 2018. Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near- infinite context. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023a. URL https://openreview.net/forum?id=xulyCXgIWH. Hong Liu, Sang Michael Xie, Zhiyuan Li, and Tengyu Ma. Same pre-training loss, better downstream: implicit bias matters for language models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023b. Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, and Asli Celikyilmaz. Don’t throw away your value model! generating more preferable text with value- guided monte-carlo tree search decoding, 2024. URL https://arxiv.org/abs/2309.15028. Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning, 2020a. URL https: //arxiv.org/abs/2007.08124. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pre- training approach, 2020b. URL https://openreview.net/forum?id=SyxS0T4tvS. Yixiu Liu, Yang Nan, Weixian Xu, Xiangkun Hu, Lyumanshan Ye, Zhen Qin, and Pengfei Liu. AlphaGo Moment for Model Architecture Discovery. ArXiv, abs/2507.18074, 2025. David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30:6467–6476, 2017. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. Chris Lu, Samuel Holt, Claudio Fanconi, Alex J. Chan, Jakob Nicolaus Foerster, Mihaela van der Schaar, and Robert Tjarko Lange. Discovering Preference Optimization Algorithms with and for Large Language Models. In NeurIPS, 2024a. Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Nicolaus Foerster, Jeff Clune, and David Ha. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. ArXiv, abs/2408.06292, 2024b. Pratyush Maini, Skyler Seto, Richard Bai, David Grangier, Yizhe Zhang, and Navdeep Jaitly. BIBLIOGRAPHY170 Rephrasing the web: A recipe for compute and data-efficient language modeling. In Lun- Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14044– 14072, Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.757. Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discov- eryBench: Towards Data-Driven Discovery with Large Language Models. In ICLR, 2025. Yu. A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs, 2018. URL https://arxiv.org/abs/1603.09320. Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Gordon H. Bower, editor, Psychology of Learning and Motiva- tion, volume 24 of Psychology of Learning and Motivation, pages 109–165. Academic Press, 1989. doi: https://doi.org/10.1016/S0079-7421(08)60536-8. URL https://w.sciencedirect.com/ science/article/pii/S0079742108605368. Nick Mecklenburg, Yiyou Lin, Xiaoxiao Li, Daniel Holstein, Leonardo Nunes, Sara Malvar, Bruno Silva, Ranveer Chandra, Vijay Aski, Pavan Kumar Reddy Yannam, Tolga Aktas, and Todd Hendry. Injecting new knowledge into large language models via supervised fine-tuning, 2024. URL https://arxiv.org/abs/2404.00213. Kevin Meng, David Bau, Alex J Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview. net/forum?id=-h6WAS6eE4. Kevin Meng, Arnab Sen Sharma, Alex J Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a transformer. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=MkbcAHIYgyS. Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation, 2023. URL https://arxiv.org/abs/2305.14251. Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D Manning. Fast model editing at scale. In International Conference on Learning Representations, 2022. URL https://openreview.net/pdf?id=0DcZxeWfOPt. BIBLIOGRAPHY171 Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulo- vari, Eric C. Landsness, Dániel L. Barabási, Siddharth Narayanan, Nicky Evans, Shriya Reddy, Martha S. Foiani, Aizad Kamal, Leah P. Shriver, Fang Cao, Asmamaw T. Wassie, Jon M. Laurent, Edwin Melville-Green, Mayk Caldas Ramos, Albert Bou, Kaleigh F. Roberts, Sladjana Zagorac, Timothy C. Orr, Miranda E. Orr, Kevin J. Zwezdaryk, Ali E. Ghareeb, Laurie McCoy, Bruna Gomes, Euan A Ashley, Karen E. Duff, Tonio Buonassisi, Tom Rainforth, Randall J. Bateman, Michael Skarlinski, Samuel G. Rodriques, Michaela M. Hinks, and Andrew D. White. Kosmos: An AI Scientist for Autonomous Discovery. ArXiv, abs/2511.02824, 2025. Niklas Muennighoff, Alexander M Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksan- dra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=j5BuTrEj35. Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, and Hannaneh Hajishirzi. Olmoe: Open mixture-of-experts language models, 2024. URL https://arxiv.org/abs/2409.02060. Deepak Nathani, Lovish Madaan, Nicholas Roberts, Niko lay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hup- kes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, and Roberta Raileanu. MLGym: A New Framework and Benchmark for Advancing AI Research Agents. In COLM, 2025. Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav Shyam, Boris Power, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger, David Schnurr, Felipe Petroski Such, Kenny Hsu, Madeleine Thompson, Tabarak Khan, Toki Sherbakov, Joanne Jang, Peter Welinder, and Lilian Weng. Text and code embeddings by contrastive pre-training, 2022. URL https://arxiv.org/abs/2201.10005. Ani Nenkova, Rebecca Passonneau, and Kathleen McKeown. The pyramid method: Incorpo- rating human content selection variation in summarization evaluation. ACM Trans. Speech Lang. Process., 4(2):4–es, may 2007. ISSN 1550-4875. doi: 10.1145/1233912.1233913. URL https://doi.org/10.1145/1233912.1233913. Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Variational continual learning. arXiv preprint arXiv:1710.10628, 2017. BIBLIOGRAPHY172 Thao Nguyen, Yang Li, Olga Golovneva, Luke Zettlemoyer, Sewoong Oh, Ludwig Schmidt, and Xian Li. Recycling the web: A method to enhance pre-training data quality and quantity for language models, 2025. URL https://arxiv.org/abs/2506.04689. Alexander Novikov, Ngân V~u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav M. Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, Matej Balog, and Google Deepmind. Alphaevolve: A coding agent for scientific and algorithmic discovery. ArXiv, abs/2506.13131, 2025. OpenAI. Learning to reason with llms, September 2024. URL https://openai.com/index/ learning-to-reason-with-llms/. OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecof- fet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogi- neni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Ni- tish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, An- drew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela BIBLIOGRAPHY173 Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ash- ley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Hen- rique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Pow- ell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Pet- roski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Weli- hinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feed- back. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Ad- vances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran As- sociates, Inc., 2022. URL https://proceedings.neurips.c/paper_files/paper/2022/file/ b1efde53be364a73914f58805a001731-Paper-Conference.pdf. Oded Ovadia, Menachem Brief, Moshik Mishaeli, and Oren Elisha. Fine-tuning or retrieval? com- paring knowledge injection in llms, 2024. URL https://arxiv.org/abs/2312.05934. David Owen. How predictable is language model benchmark performance?, 2024. URL https: //arxiv.org/abs/2401.04757. Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, and Samuel Bowman. QuALITY: Question answering with long input texts, yes! In Marine Carpuat, Marie-Catherine de Marneffe, BIBLIOGRAPHY174 and Ivan Vladimir Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5336–5358, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.391. URL https://aclanthology.org/2022.naacl-main.391. Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016. Jupinder Parmar, Sanjev Satheesh, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Reuse, don’t retrain: A recipe for continued pretraining of language models, 2024. URL https: //arxiv.org/abs/2407.07263. Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023. Guilherme Penedo, Hynek Kydlícek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro von Werra, and Thomas Wolf. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. ArXiv, abs/2406.17557, 2024. Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4, 2023. URL https://arxiv.org/abs/2304.03277. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations, 2018. URL https://arxiv. org/abs/1802.05365. Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5 technical report, 2024. URL https://arxiv.org/abs/2412.15115. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language under- standing by generative pre-training, 2018a. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2018b. URL https://d4mucfpksywv.cloudfront. net/better-language-models/language-models.pdf. BIBLIOGRAPHY175 Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), January 2020. ISSN 1532-4435. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text, 2016. URL https://arxiv.org/abs/1606.05250. Vinay Venkatesh Ramasesh, Aitor Lewkowycz, and Ethan Dyer. Effect of scale on catastrophic forgetting in neural networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=GhVS8_yPeEa. R. Ratcliff. Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions. Psychological Review, 97(2):285–308, 1990. doi: 10.1037/0033-295X.97.2.285. Esteban Real, Chen Liang, David R. So, and Quoc V. Le. AutoML-Zero: Evolving Machine Learning Algorithms From Scratch. In ICML, 2020. Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a bench- mark, 2023. URL https://arxiv.org/abs/2311.12022. Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2): 123–146, 1995. Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evti- mov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code, 2024. URL https://arxiv.org/abs/2308.12950. Yangjun Ruan, Neil Band, Chris J Maddison, and Tatsunori Hashimoto. Reasoning to learn from latent thoughts. arXiv preprint arXiv:2503.18866, 2025. BIBLIOGRAPHY176 Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adver- sarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021. Jeffrey C. Schlimmer and Douglas Fisher. A case study of incremental concept induction. In Proceedings of the Fifth AAAI National Conference on Artificial Intelligence, AAAI’86, page 496–501. AAAI Press, 1986. Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum. Agent Laboratory: Using LLM Agents as Research Assistants. In Findings of EMNLP, 2025. Raphael Schumann and Ines Rehbein. Active learning via membership query synthesis for semi- supervised sentence classification. In Mohit Bansal and Aline Villavicencio, editors, Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 472–481, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/ v1/K19-1044. URL https://aclanthology.org/K19-1044. H. Scudder. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory, 11(3):363–371, 1965. doi: 10.1109/TIT.1965.1053799. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017. URL https://arxiv.org/abs/1701.06538. Quan Shi, Michael Tang, Karthik Narasimhan, and Shunyu Yao. Can language models solve olympiad programming?, 2024a. URL https://arxiv.org/abs/2404.10952. Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Wen tau Yih, and Mike Lewis. In-context pretraining: Language modeling be- yond document boundaries. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=LXVswInHOo. Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and BIBLIOGRAPHY177 R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran As- sociates, Inc., 2017. URL https://proceedings.neurips.c/paper_files/paper/2017/file/ 0efbe98067c6c73dba1250d2beaa81f9-Paper.pdf. Chenglei Si, Tatsunori Hashimoto, and Diyi Yang. The Ideation-Execution Gap: Execution Out- comes of LLM-Generated versus Human Research Ideas. ArXiv, abs/2506.20803, 2025a. Chenglei Si, Diyi Yang, and Tatsunori Hashimoto. Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers. In ICLR, 2025b. Charlie Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context, 2022. URL https: //arxiv.org/abs/2209.15189. Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024. URL https://arxiv.org/abs/2408. 03314. David R. So, Chen Liang, and Quoc V. Le. The Evolved Transformer. In ICML, 2019. Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, et al. Dolma: An open corpus of three trillion tokens for language model pretraining research. arXiv preprint arXiv:2402.00159, 2024. Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. PaperBench: Evaluating AI’s Ability to Replicate AI Research. In ICML, 2025. Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-c: Transforming common crawl into a refined long-horizon pretraining dataset. arXiv preprint arXiv:2412.02595, 2024. Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: En- hanced transformer with rotary position embedding, 2023. URL https://arxiv.org/abs/2104. 09864. Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, and Kai Yu. Scieval: A multi-level large language model evaluation benchmark for scientific research, 2024a. URL https://arxiv.org/abs/2308.13149. Philip Sun, David Simcha, Dave Dopson, Ruiqi Guo, and Sanjiv Kumar. Soar: improved indexing for approximate nearest neighbor search. Advances in Neural Information Processing Systems, 36: 3189–3204, 2023. BIBLIOGRAPHY178 Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): Rnns with expressive hidden states, 2024b. URL https://arxiv.org/abs/ 2407.04620. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks, 2014. URL https://arxiv.org/abs/1409.3215. Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. AI-Researcher: Autonomous Scientific Innovation. In NeurIPS, 2025. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023. Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey, 2022. URL https://arxiv.org/abs/2009.06732. Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haozhen Yu, Hongcheng Gao, Huabin Zheng, Huan Yuan, Jia Chen, Jianhang Guo, Jianlin Su, Jianzhou Wang, Jie Zhao, Jin Zhang, Jingyuan Liu, Junjie Yan, Junyan Wu, Lidong Shi, Ling Ye, Longhui Yu, Mengnan Dong, Neo Zhang, Ningchen Ma, Qiwei Pan, Qucheng Gong, Shaowei Liu, Shengling Ma, Shupeng Wei, Sihan Cao, Siying Huang, Tao Jiang, Weihao Gao, Weimin Xiong, Weiran He, Weixiao Huang, Wenhao Wu, Wenyang He, Xianghui Wei, Xianqing Jia, Xingzhe Wu, Xinran Xu, Xinxing Zu, Xinyu Zhou, Xuehai Pan, Y. Charles, Yang Li, Yangyang Hu, Yangyang Liu, Yanru Chen, Yejie Wang, Yibo Liu, Yidao Qin, Yifeng Liu, Ying Yang, Yiping Bao, Yulun Du, Yuxin Wu, Yuzhi Wang, Zaida Zhou, Zhaoji Wang, Zhaowei Li, Zhen Zhu, Zheng Zhang, Zhexu Wang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Ziyao Xu, and Zonghan Yang. Kimi k1.5: Scaling reinforcement learning with llms, 2025. URL https://arxiv.org/abs/2501.12599. NovaSky Team. Sky-t1: Fully open-source reasoning model with o1-preview performance in $450 budget, 2025. URL https://novasky-ai.github.io/posts/sky-t1. Accessed: 2025-01-09. Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, November 2024. URL https://qwenlm.github.io/blog/qwq-32b-preview/. Thinking Machines Lab. Announcing Tinker, 2025. BIBLIOGRAPHY179 Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yaohui Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Min Zhu, Kilian Adri- ano Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, E. A. Huerta, and Hao Peng. SciCode: A Research Coding Benchmark Curated by Scientists. ArXiv, abs/2407.13168, 2024. TogetherAI. Redpajama: an open dataset for training large language models, 2023. URL https: //github.com/togethercomputer/RedPajama-Data. Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Rishi Hazra, Nicolas Mario Baldwin, Alexis Audran-Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, Andrei Lupu, Roberta Raileanu, Kelvin Niu, Tatiana Shavrina, Jean-Christophe Gagnon-Audet, Michael Shvartsman, Shagun Sodhani, Alexander H. Miller, Abhishek Charnalia, Derek Dunfield, Carole- Jean Wu, Pontus Stenetorp, Nicola Cancedda, Jakob Nicolaus Foerster, and Yoram Bachrach. AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench. ArXiv, abs/2507.02554, 2025. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288. Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Shengyi Huang, Kashif Rasul, Alvaro Bartolome, Alexander M. Rush, and Thomas Wolf. The Alignment Handbook, 2023. URL https://github.com/huggingface/alignment-handbook. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, BIBLIOGRAPHY180 U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- nett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Asso- ciates, Inc., 2017. URL https://proceedings.neurips.c/paper_files/paper/2017/file/ 3f5e243547dee91fbd053c1c4a845a-Paper.pdf. Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Will we run out of data? limits of llm scaling based on human-generated data, 2024. Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Courna- peau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mul- bregt, and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi: 10.1038/s41592-019-0686-2. Alex Wang, Richard Yuanzhe Pang, Angelica Chen, Jason Phang, and Samuel R. Bowman. SQuAL- ITY: Building a long-document summarization dataset the hard way. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1139–1156, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.75. URL https://aclanthology.org/2022.emnlp-main.75. Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2024a. URL https://arxiv.org/abs/2312.08935. Qingyun Wang, Doug Downey, Heng Ji, and Tom Hope. SciMON: Scientific Inspiration Machines Optimized for Novelty. In ACL, 2024b. Siyuan Wang, Zhongkun Liu, Wanjun Zhong, Ming Zhou, Zhongyu Wei, Zhumin Chen, and Nan Duan. From lsat: The progress and challenges of complex reasoning, 2021. URL https://arxiv. org/abs/2108.00648. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=1PL1NIMMrw. Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong BIBLIOGRAPHY181 Shen. Reinforcement Learning for Reasoning in Large Language Models with One Training Ex- ample. In NeurIPS, 2025. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484– 13508, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/ v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754. Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev. Helpsteer2: Open-source dataset for training top-performing reward models, 2024c. URL https://arxiv.org/abs/2406.08673. Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In Inter- national Conference on Learning Representations, 2022. URL https://openreview.net/forum? id=gEZrGCozdqR. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2024. Curran Associates Inc. ISBN 9781713871088. Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. In NUT @EMNLP, 2017. Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019. Hjalmar Wijk, Tao Roa Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Joshua Clymer, Jai Dhyani, Elena Ericheva, Katharyn Garcia, Brian Goodrich, Nikola Jurkovic, Megan Kinniment, Aron Lajko, Seraphina Nix, Lucas Jun Koba Sato, William Saunders, Maksym Taran, Ben West, and Elizabeth Barnes. RE-Bench: Evaluating fron- tier AI R&D capabilities of language model agents against human experts. ArXiv, abs/2411.15114, 2024. BIBLIOGRAPHY182 Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models, 2024. URL https://arxiv.org/abs/2408.00724. Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V. Le. Self-training with noisy student improves imagenet classification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2020. doi: 10.1109/CVPR42600.2020.01070. Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian He, and Qizhe Xie. Self- evaluation guided beam search for reasoning, 2023. URL https://arxiv.org/abs/2305.00633. Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data, 2024. URL https://arxiv.org/abs/2405.14333. Haotian Xu, Xing Wu, Weinong Wang, Zhongzhi Li, Da Zheng, Boyuan Chen, Yi Hu, Shijia Kang, Jiaming Ji, Yingying Zhang, Zhijiang Guo, Yaodong Yang, Muhan Zhang, and Debing Zhang. Redstar: Does scaling long-cot data unlock better slow-reasoning systems?, 2025. URL https: //arxiv.org/abs/2501.11284. I. Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, and Dhruv Mahajan. Billion-scale semi- supervised learning for image classification, 2019. URL https://arxiv.org/abs/1905.00546. Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Nicolaus Foerster, Jeff Clune, and David Ha. The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search. ArXiv, abs/2504.08066, 2025. An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement. ArXiv, abs/2409.12122, 2024. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Jingren Zhou, Junyan Lin, Kai Dang, Keqin Bao, Ke-Pei Yang, Le Yu, Li-Chun Deng, Mei Li, Min Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shi-Qiang Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yi-Chao Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu. Qwen3 Technical Report. ArXiv, abs/2505.09388, 2025a. BIBLIOGRAPHY183 Sherry Yang, Joy He-Yueya, and Percy Liang. Reinforcement Learning for Machine Learning Engi- neering Agents. ArXiv, abs/2509.01684, 2025b. Zitong Yang, MICHAL LUKASIK, Vaishnavh Nagarajan, Zonglin Li, Ankit Rawat, Manzil Za- heer, Aditya K Menon, and Sanjiv Kumar. Resmem: Learn what you can and memorize the rest. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Ad- vances in Neural Information Processing Systems, volume 36, pages 60768–60790. Curran Asso- ciates, Inc., 2023a. URL https://proceedings.neurips.c/paper_files/paper/2023/file/ bf0857cb9a41c73639f028a80301cdf0-Paper-Conference.pdf. Zitong Yang, MICHAL LUKASIK, Vaishnavh Nagarajan, Zonglin Li, Ankit Rawat, Manzil Za- heer, Aditya K Menon, and Sanjiv Kumar. Resmem: Learn what you can and memorize the rest. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Ad- vances in Neural Information Processing Systems, volume 36, pages 60768–60790. Curran Asso- ciates, Inc., 2023b. URL https://proceedings.neurips.c/paper_files/paper/2023/file/ bf0857cb9a41c73639f028a80301cdf0-Paper-Conference.pdf. Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candes, and Tatsunori Hashimoto. Synthetic continued pretraining. In The Thirteenth International Conference on Learning Representations, 2025c. URL https://openreview.net/forum?id=07yvxWDSla. Dong Yuan, Eti Rastogi, Gautam Naik, Sree Prasanna Rajagopal, Sagar Goyal, Fen Zhao, Bharath Chintagunta, and Jeff Ward. A continued pretrained llm approach for automatic medical note generation, 2024a. URL https://arxiv.org/abs/2403.09057. Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models, 2024b. URL https://arxiv.org/abs/2401. 10020. Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In International Conference on Machine Learning, pages 3987–3995. PMLR, 2017. Dan Zhang, Sining Zhoubian, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search, 2024a. URL https://arxiv.org/abs/2406.03816. Michael R. Zhang, Nishkrit Desai, Juhan Bae, Jonathan Lorraine, and Jimmy Ba. Using Large Language Models for Hyperparameter Optimization. ArXiv, abs/2312.04528, 2023a. Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenenbaum, and Chuang Gan. Planning with large language models for code generation, 2023b. URL https://arxiv.org/abs/ 2303.05510. BIBLIOGRAPHY184 Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkeHuCVFDr. Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B. Hashimoto. Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics, 12:39–57, 2024b. doi: 10.1162/tacl_a_00632. URL https://aclanthology.org/2024.tacl-1.3. Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embed- ding: Advancing text embedding and reranking through foundation models, 2025. URL https: //arxiv.org/abs/2506.05176. Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel. Proc. VLDB Endow., 16(12):3848–3860, aug 2023. ISSN 2150- 8097. doi: 10.14778/3611540.3611569. URL https://doi.org/10.14778/3611540.3611569. Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang Yue. OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement. In Findings of ACL, 2024. Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. Jec-qa: A legal-domain question answering dataset, 2019. URL https://arxiv.org/abs/1911.12011. Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023a. URL https://arxiv.org/abs/2304.06364. Zexuan Zhong, Zhengxuan Wu, Christopher Manning, Christopher Potts, and Danqi Chen. MQuAKE: Assessing knowledge editing in language models via multi-hop questions. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15686–15702, Singapore, December 2023b. As- sociation for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.971. URL https: //aclanthology.org/2023.emnlp-main.971. Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning acting and planning in language models, 2024. URL https: //arxiv.org/abs/2310.04406. BIBLIOGRAPHY185 Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. Lima: Less is more for alignment, 2023. URL https://arxiv.org/abs/2305.11206. Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, and Sanjiv Kumar. Modifying memories in transformer models, 2020. Minjun Zhu, Yixuan Weng, Linyi Yang, and Yue Zhang. DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process. In ACL, 2025. Barret Zoph and Quoc V. Le. Neural Architecture Search with Reinforcement Learning. In ICLR, 2017. Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning Transferable Architec- tures for Scalable Image Recognition. In CVPR, 2017. Adam Zweiger, Jyothish Pari, Han Guo, Ekin Akyürek, Yoon Kim, and Pulkit Agrawal. Self- adapting language models, 2025. URL https://arxiv.org/abs/2506.10943. Zyphra. Zyda-2, a 5 trillion token high-quality dataset, 2024. URL https://huggingface.co/ datasets/Zyphra/dclm-dedup.