Paper deep dive

A Family of LLMs Liberated from Static Vocabularies

Aleph Alpha, :, Adnen Abdessaied, Artur Baranowski, Lukas Balles, Michael Barlow, Fabien C. Y. Benureau, Felix Berkenkamp, Lukas Bluebaum, Bastian Boll, Thomas F. Burns, Björn Deiseroth, Constantin Eichenberg, David Friede, Pablo Iyu Guerrero, Ahmed Hammam, Bastian Harren, Johann Higl, Yasser Jadidi, Carina Kauf, Johannes Messner, Jan Hendrik Metzen, Max Meuer, Vedant Nanda, Pit Neitemeier, Koen Oostermeijer, Letitia Parcalabescu, Markus Pernpointner, Felix Reinfurt, Dylan Rodriquez, Grégory Schott, Philipp Siedler, Martin Simonovsky, Till Speicher, Volker Stampa, Stephan Wäldchen, Samuel Weinbach, Gregor Ziegltrum

Year: 2026Venue: arXiv preprintArea: cs.CLType: PreprintEmbeddings: 138

Abstract

Abstract:Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable limitations, including their large, fixed vocabulary sizes and poor adaptability to new domains or languages. We present a family of models with up to 70 billion parameters based on the hierarchical autoregressive transformer (HAT) architecture. In HAT, an encoder transformer aggregates bytes into word embeddings and then feeds them to the backbone, a classical autoregressive transformer. The outputs of the backbone are then cross-attended by the decoder and converted back into bytes. We show that we can reuse available pre-trained models by converting the Llama 3.1 8B and 70B models into the HAT architecture: Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT are byte-level models whose encoder and decoder are trained from scratch, but where we adapt the pre-trained Llama backbone, i.e., the transformer blocks with the embedding matrix and head removed, to handle word embeddings instead of the original tokens. We also provide a 7B HAT model, Llama-TFree-HAT-Pretrained, trained entirely from scratch on nearly 4 trillion words. The HAT architecture improves text compression by reducing the number of required sequence positions and enhances robustness to intra-word variations, e.g., spelling differences. Through pre-training, as well as subsequent supervised fine-tuning and direct preference optimization in English and German, we show strong proficiency in both languages, improving on the original Llama 3.1 in most benchmarks. We release our models (including 200 pre-training checkpoints) on Hugging Face.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

137,884 characters extracted from source content.

Expand or collapse full text

A Family of LLMs Liberated from Static Vocabularies Aleph Alpha Research A detailed author list can be found in the appendix of this paper. Abstract Tokenization is a central component of natural language processing in cur- rent large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable limitations, including their large, fixed vocabulary sizes and poor adaptability to new domains or languages. We present a family of models with up to 70 billion parameters based on the hierarchi- cal autoregressive transformer (HAT) architecture. In HAT, an encoder transformer aggregates bytes into word embeddings and then feeds them to the backbone, a classical autoregressive transformer. The outputs of the backbone are then cross-attended by the decoder and converted back into bytes. We show that we can reuse available pre-trained models by converting the Llama 3.1 8B and 70B models into the HAT architecture: Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT are byte-level models whose encoder and decoder are trained from scratch, but where we adapt the pre-trained Llama backbone, i.e., the transformer blocks with the embedding matrix and head removed, to handle word embeddings instead of the original tokens. We also provide a 7B HAT model, trained entirely from scratch on nearly 4 trillion words, Llama-TFree-HAT-Pretrained. The HAT architecture improves text compression by reducing the number of required sequence positions and enhances robustness to intra-word variations, e.g., spelling differences. Through pre-training, as well as subsequent supervised fine-tuning and direct preference optimization in English and German, we show strong proficiency in both languages, improving on the original Llama 3.1 in most benchmarks. We release our models (including 200 pre-training checkpoints) on Hugging Face. 57.7 49.1 58.9 57.5 4.2 5.2 5.3 4.2 5.3 4.2 55.4 59.9 61.0 62.8 5.2 4.2 5.4 58.2 a t_on_ byte residual DECODER H at_on input bytes word embeddings word predictions target bytes BYTES WORDS BYTES BACKBONE 4 5 6 0 Model Compression bytes per sequence position Llama 3.1 8B Base T-Free Hat 7B BaseT-Free-SFT Tülu-SFT T-Free-Hatified 8B DPO Tülu Llama-Instruct 3.1 8B 50 0 Model Quality average performance over common benchmarks ENCODER presftdpopresftdpo T-Free Hatified 8B Base T-Free Hat 7B DPO Llama 3.1 8B Base T-Free Hat 7B Base T-Free-SFT Tülu-SFT T-Free-Hatified-DPO Tülu Llama-Instruct 3.1 8B T-Free Hatified 8B Base T-Free Hat 7B DPO Figure 1: (Left) The HAT architecture has three components: an encoder, backbone, and decoder, each implemented as a transformer. A full overview can be found in Figure 2, while the encoder and decoder are detailed in Figures 3a and 3b, respectively. (Right) Average performance and compression for Llama-3.1-8B-TFree-HAT on benchmarks detailed in §7. arXiv:2603.15953v1 [cs.CL] 16 Mar 2026 1 Introduction We introduce a series of language models, one pre-trained entirely from scratch and others based on Llama 3.1 8B or 70B [24], augmented with a novel tokenizer replacement: the hierarchical autoregressive transformer (HAT) architecture, originally described by Neitemeier et al. [45] and further extended in this work. HAT integrates byte-level encoding and decoding with a word-level transformer backbone 1 . This hierarchical structure provides potential advantages: (1) increased robustness to prompt perturbations; and (2) improved adaptability to new data, e.g., domains and languages, through continued training. A foundational innovation in our work is the extension of a ‘tokenizer-free’ (T-Free) approach to LLM training and inference by splitting raw byte data into variable-length chunks rather than mapping inputs to a fixed vocabulary. While this method could technically be viewed as a form of tokenization, we argue in §3.3 that it differs meaningfully in practice, especially in how it avoids large embedding tables and allows models to exploit similarity between chunks of bytes. To clarify this distinction, we provide formal definitions of both classical tokenization and our alternative perspective. We pre- and post-trained our models in English and German on curated corpora. To encourage helpfulness and instruction adherence, we performed direct-preference optimization (DPO). This makes the model more suitable for real-world applications, reducing the likelihood of unnecessary refusals. Notably, these models demonstrate strong performance in German, while also outperforming the original Llama 3.1 models on many English-language benchmarks. It is important to note that we did not optimize these models for code generation or mathematical reasoning; accordingly, we do not extensively evaluate them on those tasks. While the HAT architecture provides intrinsic efficiency benefits, real-world inference speed is significantly influenced by the quality of the inference implementation. We report our work on incorporating vLLM inference for our models in §6. Contributions In this report: 1.we show that pre-trained tokenizer-free approaches can be competitive with tokenized equivalents (including when trained on one-third of the total pre-training data budget of Llama-3.1 models); 2.we demonstrate that a pre-trained model’s tokenizer can be successfully replaced with a tokenizer-free approach – a method which we here dub HATification – while improving downstream performance and compression ratios and, in the case of the 8B model, reducing the total number of model parameters by more than 10%; 3.we outline a pipeline for LLM-development from data curation and pre-training to post-training and inference; 4.we make our models publicly available to the research community in order to contribute to further advancements in tokenizer-free large language models (LLMs). Specifically, we release checkpoints (i) for our base models 2 (including 200 pre- training checkpoints over training for future study and use by the community), (i) with supervised fine-tuning 3 , and (i) with direct-preference optimization 4 ; and 5. we introduce our evaluation framework 5 and vLLM inference contributions for tokenizer-free model inference 6 , both of which we make available under Apache 2.0. 1 By ‘backbone’ we mean the transformer blocks with the embedding matrix and head removed, to handle the word embeddings instead of the subword tokens. 2 https://huggingface.co/Aleph-Alpha/tfree-hat-pretrained-7b-base and https://huggingface.co/Aleph-Alpha/llama-3_1-8b-tfree-hat-base 3 https://huggingface.co/Aleph-Alpha/llama-3_1-8b-tfree-hat-sft and https://huggingface.co/Aleph-Alpha/llama-3_1-70b-tfree-hat-sft 4 https://huggingface.co/Aleph-Alpha/llama-tfree-hat-pretrained-7b-dpo and https://huggingface.co/Aleph-Alpha/llama-3_1-8b-tfree-hat-dpo 5 https://github.com/Aleph-Alpha-Research/eval-framework 6 https://github.com/Aleph-Alpha/vllm 2 2 General Overview Our models adopt and extend the HAT architecture [45], which enhances byte-level language modeling with intermediate word-level representations. The architecture consists of three components—encoder, backbone, and decoder—each implemented as an autoregressive transformer, together with connector layers between components. The encoder operates on UTF-8 byte sequences using local causal attention and aggregates byte-level embeddings into word-level embeddings via cross-attention with learned queries. These embeddings are processed by the backbone, a standard causal transformer operating at word resolution, to produce contextual word representations. The decoder then generates next-byte predictions using both byte-level context and cross-attention to the backbone’s word-level outputs. This architecture enables efficient long-context modeling by compressing input through word-level abstraction while preserving fine-grained detail at the byte level—which we hypothesize also improves multilingual adaptation. At the start of the training process, we randomly initialized the encoder, decoder, and con- nector layers. For Llama-TFree-HAT-Pretrained, we pre-trained from scratch (i.e. randomly initialized the backbone), whereas for Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree- HAT we initialized the backbone from pre-trained Llama 3.1 weights [24]. In the first pre-training phase, we trained our models using a next-byte prediction objective with se- quences up to 3,500 words 7 . In this first phase, Llama-TFree-HAT-Pretrained was trained for nearly 4T words, Llama-3.1-8B-TFree-HAT for 134B words, and Llama-3.1-70B-TFree-HAT for 108B words; for comparison, we note that Llama 3.1 models were trained with 15T tokens, which is approximately 12T words. We then continued training with longer sequences of words, emphasizing longer documents to adapt to extended context 8 . In this second phase, Llama-TFree-HAT-Pretrained was trained on sequences up to 32,768 words for 10.5B words, Llama-3.1-8B-TFree-HAT on sequences up to 32,768 words for 20B words, and Llama-3.1-70B-TFree-HAT on sequences up to 16,000 words for 10.2B words. For Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT, during initial pre-training, we kept the backbone frozen for the first 2,000 steps and then trained with a reduced learning rate, while the encoder and decoder followed a warmup-stable-decay schedule. For long-context adaptation, we kept most parameters frozen except query and key projections in attention layers, using a smaller learning rate and longer data sequences, while maintaining a similar data mix [20]. We evolved the inference implementation of HAT from a simple HuggingFace-based prototype to a production-ready system built on vLLM [33], adapted for batched serving. Integrating HAT into vLLM surfaced architectural challenges stemming from HAT’s hierarchical design, particularly its dual-sequence processing and variable-length byte-level generation. These posed significant obstacles for batching, requiring modifications to the scheduling strategy to balance backbone utilization and latency while managing asynchronous generation patterns within a batch. A key innovation was the management of dual key-value (KV) caches – one for byte-level and one for word-level sequences – necessitating careful memory coordination due to their asymmetric and interdependent resource demands. Throughout our implementation, we prioritized compatibility with vLLM’s core to minimize invasive changes, enabling scalable deployment without compromising the hierarchical model’s unique execution semantics. Post-training involved supervised fine-tuning (SFT) on a diverse mix of∼2M samples, including synthetic responses (especially in German) generated and filtered using strong open models, as well as human-written data. Finally, we performed alignment using DPO, with careful filtering of preference pairs to improve helpfulness and safety. Averaged over our evaluations, our models achieve strong scores relative to Llama 3.1 equivalents. 7 We also set an upper bound of 28,000 bytes, i.e., an average of 8 characters per word. 8 There is no fixed byte sequence length as words have different byte lengths, but we enforce an upper bound on byte sequence length of 28,000 bytes during initial pre-training and 262,144 bytes for the long context adaptation. 3 “Hat on a Llama” Hat_on_a_Llama input bytes E NCODER byte embeddings word embeddings B ACKBONE word predictions logits target bytes byte residual <latexit sha1_base64="jrTKbqWNdcNfNkXDiMOn9Ahua7Y=">AAAB/3icbVC7SgNBFL3rM8ZXjKXNkCBYhV2RaBnQwjKCeUiyhJnJJBkys7vMzAph2cJvsNVWO7G18A/8AUv/xMmjMIkHLhzOuZd77yGR4Nq47rezsrq2vrGZ2cpu7+zu7ecO8nUdxoqyGg1FqJoEayZ4wGqGG8GakWJYEsEaZHg59hv3TGkeBrdmFDFf4n7Ae5xiY6W7bJvIhKSdcidXdEvuBGiZeDNSrGS+PvNXL4VqJ/fT7oY0liwwVGCtW54bGT/BynAqWJptx5pFmA5xn7UsDbBk2k8mB6fo2Cpd1AuVrcCgifp3IsFS65EktlNiM9CL3lj8z2vFpnfhJzyIYsMCOl3UiwUyIRp/j7pcMWrEyBJMFbe3IjrAClNjM5rbQmRqM/EWE1gm9dOSVy6Vb7xi5QymyMARFOAEPDiHClxDFWpAQcIjPMGz8+C8Om/O+7R1xZnNHMIcnI9fESuZdw==</latexit> b 6 <latexit sha1_base64="AHWRREjZm1A8YCwb4hy0DBXEOT0=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXZFEsuAFpYRzEOSJcxMZpMhM7PLzKwQli38BltttRNbC//AH7D0T5w8CpN44MLhnHu59x4ccaaN6347mbX1jc2t7HZuZ3dv/yB/WGjqMFaENkjIQ9XGSFPOJG0YZjhtR4oigTlt4dHlxG/dU6VZKG/NOKK+QAPJAkaQsdJdrotFgtNetZcvuWV3CrhKvDkp1bJfn4Wrl2K9l//p9kMSCyoN4UjrjudGxk+QMoxwmua6saYRIiM0oB1LJRJU+8n04BSeWKUPg1DZkgZO1b8TCRJajwW2nQKZoV72JuJ/Xic2wYWfMBnFhkoyWxTEHJoQTr6HfaYoMXxsCSKK2VshGSKFiLEZLWzBIrWZeMsJrJLmWdmrlCs3Xql2DmbIgmNQBKfAA1VQA9egDhqAAAEewRN4dh6cV+fNeZ+1Zpz5zBFYgPPxCxK+mXg=</latexit> b 7 <latexit sha1_base64="QPtgVhKXnrstdxqjaq1HjyJiAZI=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXZVomVAC8sI5iHJEmYms8mQmdllZlYIyxZ+g6222omthX/gD1j6J04ehUk8cOFwzr3cew+OONPGdb+dzMrq2vpGdjO3tb2zu5ffLzR0GCtC6yTkoWphpClnktYNM5y2IkWRwJw28fBy7DfvqdIslLdmFFFfoL5kASPIWOku18EiwWn3tJsvuWV3ArhMvBkpVbNfn4Wrl2Ktm//p9EISCyoN4UjrtudGxk+QMoxwmuY6saYRIkPUp21LJRJU+8nk4BQeWaUHg1DZkgZO1L8TCRJajwS2nQKZgV70xuJ/Xjs2wYWfMBnFhkoyXRTEHJoQjr+HPaYoMXxkCSKK2VshGSCFiLEZzW3BIrWZeIsJLJPGSdmrlCs3Xql6BqbIgkNQBMfAA+egCq5BDdQBAQI8gifw7Dw4r86b8z5tzTizmQMwB+fjFwxymXQ=</latexit> b 3 <latexit sha1_base64="FtThDvC2rsJDFNJCQT2jFwYea58=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXaDRMuAFpYRzEOSJcxMZpMhM7PLzKwQli38BltttRNbC//AH7D0T5w8CpN44MLhnHu59x4ccaaN6347mbX1jc2t7HZuZ3dv/yB/WGjqMFaENkjIQ9XGSFPOJG0YZjhtR4oigTlt4dHlxG/dU6VZKG/NOKK+QAPJAkaQsdJdrotFgtNepZcvuWV3CrhKvDkp1bJfn4Wrl2K9l//p9kMSCyoN4UjrjudGxk+QMoxwmua6saYRIiM0oB1LJRJU+8n04BSeWKUPg1DZkgZO1b8TCRJajwW2nQKZoV72JuJ/Xic2wYWfMBnFhkoyWxTEHJoQTr6HfaYoMXxsCSKK2VshGSKFiLEZLWzBIrWZeMsJrJJmpexVy9Ubr1Q7AzNkwTEoglPggXNQA9egDhqAAAEewRN4dh6cV+fNeZ+1Zpz5zBFYgPPxCwrfmXM=</latexit> b 2 <latexit sha1_base64="NJzTOQPf44IoVkWO/uVOxJWU3o4=">AAAB/3icbVC7SgNBFL3rM8ZXjKXNkCBYhV2RaBnQwjKCeUiyhJnJJBkys7vMzAph2cJvsNVWO7G18A/8AUv/xMmjMIkHLhzOuZd77yGR4Nq47rezsrq2vrGZ2cpu7+zu7ecO8nUdxoqyGg1FqJoEayZ4wGqGG8GakWJYEsEaZHg59hv3TGkeBrdmFDFf4n7Ae5xiY6W7bJvIhKQdr5MruiV3ArRMvBkpVjJfn/mrl0K1k/tpd0MaSxYYKrDWLc+NjJ9gZTgVLM22Y80iTIe4z1qWBlgy7SeTg1N0bJUu6oXKVmDQRP07kWCp9UgS2ymxGehFbyz+57Vi07vwEx5EsWEBnS7qxQKZEI2/R12uGDViZAmmittbER1ghamxGc1tITK1mXiLCSyT+mnJK5fKN16xcgZTZOAICnACHpxDBa6hCjWgIOERnuDZeXBenTfnfdq64sxmDmEOzscvCUyZcg==</latexit> b 1 <latexit sha1_base64="btpbwprJKD7nGxPGae5FLLirO+o=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXZFo2VAC8sI5iHJEmYms8mQmdllZlYIyxZ+g6222omthX/gD1j6J04ehUk8cOFwzr3cew+OONPGdb+dzMrq2vpGdjO3tb2zu5ffLzR0GCtC6yTkoWphpClnktYNM5y2IkWRwJw28fBy7DfvqdIslLdmFFFfoL5kASPIWOku18EiwWn3rJsvuWV3ArhMvBkpVbNfn4Wrl2Ktm//p9EISCyoN4UjrtudGxk+QMoxwmuY6saYRIkPUp21LJRJU+8nk4BQeWaUHg1DZkgZO1L8TCRJajwS2nQKZgV70xuJ/Xjs2wYWfMBnFhkoyXRTEHJoQjr+HPaYoMXxkCSKK2VshGSCFiLEZzW3BIrWZeIsJLJPGSdmrlCs3Xql6CqbIgkNQBMfAA+egCq5BDdQBAQI8gifw7Dw4r86b8z5tzTizmQMwB+fjFw+YmXY=</latexit> b 5 <latexit sha1_base64="ckTUCWz6gEaWIqKmbfW++XZ+U+4=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXYlRMuAFpYRzEOSJcxMZpMhM7PLzKwQli38BltttRNbC//AH7D0T5w8CpN44MLhnHu59x4ccaaN6347mbX1jc2t7HZuZ3dv/yB/WGjqMFaENkjIQ9XGSFPOJG0YZjhtR4oigTlt4dHlxG/dU6VZKG/NOKK+QAPJAkaQsdJdrotFgtNepZcvuWV3CrhKvDkp1bJfn4Wrl2K9l//p9kMSCyoN4UjrjudGxk+QMoxwmua6saYRIiM0oB1LJRJU+8n04BSeWKUPg1DZkgZO1b8TCRJajwW2nQKZoV72JuJ/Xic2wYWfMBnFhkoyWxTEHJoQTr6HfaYoMXxsCSKK2VshGSKFiLEZLWzBIrWZeMsJrJLmWdmrlqs3XqlWATNkwTEoglPggXNQA9egDhqAAAEewRN4dh6cV+fNeZ+1Zpz5zBFYgPPxCw4FmXU=</latexit> b 4 <latexit sha1_base64="8e9Qnp1ct4B871wv3ZErB4P+wXg=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXZFYsqAFpYRzEOSJcxMZpMhM7PLzKwQli38BltttRNbC//AH7D0T5w8CpN44MLhnHu59x4ccaaN6347mbX1jc2t7HZuZ3dv/yB/WGjqMFaENkjIQ9XGSFPOJG0YZjhtR4oigTlt4dHlxG/dU6VZKG/NOKK+QAPJAkaQsdJdrotFgtNetZcvuWV3CrhKvDkp1bJfn4Wrl2K9l//p9kMSCyoN4UjrjudGxk+QMoxwmua6saYRIiM0oB1LJRJU+8n04BSeWKUPg1DZkgZO1b8TCRJajwW2nQKZoV72JuJ/Xic2QdVPmIxiQyWZLQpiDk0IJ9/DPlOUGD62BBHF7K2QDJFCxNiMFrZgkdpMvOUEVknzrOxVypUbr1Q7BzNkwTEoglPggQtQA9egDhqAAAEewRN4dh6cV+fNeZ+1Zpz5zBFYgPPxCxRRmXk=</latexit> b 8 <latexit sha1_base64="mvwRxmgdaOBRXzRnPWELn7cGtfw=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXZFonYBLSwjmIckS5iZzCZDZmaXmVkhLFv4Dbbaaie2Fv6BP2Dpnzh5FCbxwIXDOfdy7z044kwb1/12Miura+sb2c3c1vbO7l5+v9DQYawIrZOQh6qFkaacSVo3zHDaihRFAnPaxMPLsd+8p0qzUN6aUUR9gfqSBYwgY6W7XAeLBKfdi26+5JbdCeAy8WakVM1+fRauXoq1bv6n0wtJLKg0hCOt254bGT9ByjDCaZrrxJpGiAxRn7YtlUhQ7SeTg1N4ZJUeDEJlSxo4Uf9OJEhoPRLYdgpkBnrRG4v/e3YBOd+wmQUGyrJdFEQc2hCOP4e9piixPCRJYgoZm+FZIAUIsZmNLcFi9Rm4i0msEwaJ2WvUq7ceKXqKZgiCw5BERwDD5yBKrgGNVAHBAjwCJ7As/PgvDpvzvu0NePMZg7AHJyPXxXkmXo=</latexit> b 9 <latexit sha1_base64="Cb4VbctM7JUW7VnH/Bru08TIWf0=">AAACAnicbVC7SgNBFJ2NrxhfMZY2Q4JgFXZFomVAC8sI5gHJEmYms8mQmdllZlYIy3Z+g61WFnZiq3/gD1j6J84mKUzigQuHc+7lXA6OONPGdb+d3Nr6xuZWfruws7u3f1A8LLV0GCtCmyTkoepgpClnkjYNM5x2IkWRwJy28fgq89v3VGkWyjsziagv0FCygBFkrNQr9LBIcNpPPDftFytu1Z0CrhJvTir1/Ndn6fql3OgXf3qDkMSCSkM40rrruZHxE6QMI5ymhV6saYTIGA1p11KJBNV+Mv05hSdWGcAgVHakgVP170WChNYTge2mQGakl71M/M/rxia49BMmo9hQSWZBQcyhCWFWABwwRYnhE0sQUcz+CskIKUSMrWkhBYusE2+5gVXSOqt6tWrt1qvUz8EMeXAMyuAUeOAC1MENaIAmICACj+AJPDsPzqvz5rzPVnPO/OYILMD5+AVVPpq4</latexit> b 10 <latexit sha1_base64="pFC6WcI2Hp1TbtCcUQatT0a8odM=">AAACAnicbVC7SgNBFJ2NrxhfMZY2Q4JgFXZEomVAC8sI5gHJEmYms8mQmdllZlYIy3Z+g61WFnZiq3/gD1j6J+4mKUzigQuHc+7lXA4JBTfWdb+d3Nr6xuZWfruws7u3f1A8LLVMEGnKmjQQge4QbJjgijUtt4J1Qs2wJIK1yfgq89v3TBseqDs7CZkn8VBxn1NsU6lX6BEZk6QfI5T0ixW36k4BVwmak0o9//VZun4pN/rFn94goJFkylKBjekiN7RejLXlVLCk0IsMCzEd4yHrplRhyYwXT39O4EmqDKAf6HSUhVP170WMpTETSdJNie3ILHuZ+J/Xjax/6cVchZFlis6C/EhAG8CsADjgmlErJinBVPP0V0hHWGNq05oWUojMOkHLDayS1lkV1aq1W1Spn4MZ8uAYlMEpQOAC1MENaIAmoCAEj+AJPDsPzqvz5rzPVnPO/OYILMD5+AVW0pq5</latexit> b 11 <latexit sha1_base64="Ncz8cQO7UgT9p7S+AsYvNNuw7u4=">AAACAnicbVC7SgNBFJ31GeMrxtJmSBCswm6QaBnQwjKCeUA2hJnJbDJkZnaZmRXCsp3fYKuVhZ3Y6h/4A5b+ibNJCpN44MLhnHs5l4MjzrRx3W9nbX1jc2s7t5Pf3ds/OCwcFVs6jBWhTRLyUHUw0pQzSZuGGU47kaJIYE7beHyV+e17qjQL5Z2ZRLQn0FCygBFkrOTnfSwSnPYTr5r2C2W34k4BV4k3J+V67uuzeP1SavQLP/4gJLGg0hCOtO56bmR6CVKGEU7TvB9rGiEyRkPatVQiQXUvmf6cwlOrDGAQKjvSwKn69yJBQuuJwHZTIDPSy14m/ud1YxNc9hImo9hQSWZBQcyhCWFWABwwRYnhE0sQUcz+CskIKUSMrWkhBYusE2+5gVXSqla8WqV265Xr52CGHDgBJXAGPHAB6uAGNEATEBCBR/AEnp0H59V5c95nq2vO/OYYLMD5+AVYZpq6</latexit> b 12 <latexit sha1_base64="1nGep0Ip2X84h/fzfVlgu5x/95A=">AAACAnicbVC7SgNBFJ31GeMrxtJmSBCswq5KtAxoYRnBPCBZwsxkNhkyM7vMzAph2c5vsNXKwk5s9Q/8AUv/xNkkhUk8cOFwzr2cy8ERZ9q47rezsrq2vrGZ28pv7+zu7RcOik0dxorQBgl5qNoYacqZpA3DDKftSFEkMKctPLrK/NY9VZqF8s6MI+oLNJAsYAQZK3XzXSwSnPYS7yztFcpuxZ0ALhNvRsq13Ndn8fqlVO8Vfrr9kMSCSkM40rrjuZHxE6QMI5ym+W6saYTICA1ox1KJBNV+Mvk5hcdW6cMgVHakgRP170WChNZjge2mQGaoF71M/M/rxCa49BMmo9hQSaZBQcyhCWFWAOwzRYnhY0sQUcz+CskQKUSMrWkuBYusE2+xgWXSPK141Ur11ivXzsEUOXAESuAEeOAC1MANqIMGICACj+AJPDsPzqvz5rxPV1ec2c0hmIPz8QtZ+pq7</latexit> b 13 <latexit sha1_base64="0rLC4+qqpL92HoDi+krDnh9SYlI=">AAAB/nicbVDLSgMxFL1TX7W+qi7dBIvgqsyIVJcFNy4r2Ae0Q0nSTBuaZIYkI5RhwG9wq2t34tZfcemfmD4WtvXAhcM593LvPSQR3Fjf//YKG5tb2zvF3dLe/sHhUfn4pGXiVFPWpLGIdYdgwwRXrGm5FayTaIYlEaxNxndTv/3EtOGxerSThIUSDxWPOMXWSZ0ekRnJ+36/XPGr/gxonQQLUoEFGv3yT28Q01QyZanAxnQDP7FhhrXlVLC81EsNSzAd4yHrOqqwZCbMZvfm6MIpAxTF2pWyaKb+nciwNGYiieuU2I7MqjcV//O6qY1uw4yrJLVM0fmiKBXIxmj6PBpwzagVE0cw1dzdiugIa0yti2hpC5G5yyRYTWCdtK6qQa1ae7iu1K8X6RThDM7hEgK4gTrcQwOaQEHAC7zCm/fsvXsf3ue8teAtZk5hCd7XL7SvlnE=</latexit> b 0 at_on_a_Llama! D ECODER <latexit sha1_base64="QBkTA2HXXK7ke3rgNTuuRazgkT8=">AAAB/nicbVA9SwNBEJ3zM8avqKXNYhCswl2QaCUBG8sI5gOSI+xt9pIlu3vH7p4SjgN/g63WdmLrX7H0n7hJrjCJDwYe780wMy+IOdPGdb+dtfWNza3twk5xd2//4LB0dNzSUaIIbZKIR6oTYE05k7RpmOG0EyuKRcBpOxjfTv32I1WaRfLBTGLqCzyULGQEGyt1eoFIn7J+tV8quxV3BrRKvJyUIUejX/rpDSKSCCoN4VjrrufGxk+xMoxwmhV7iaYxJmM8pF1LJRZU++ns3gydW2WAwkjZkgbN1L8TKRZaT0RgOwU2I73sTcX/vG5iwms/ZTJODJVkvihMODIRmj6PBkxRYvjEEkwUs7ciMsIKE2MjWtgSiMxm4i0nsEpa1YpXq9TuL8v1mzydApzCGVyAB1dQhztoQBMIcHiBV3hznp1358P5nLeuOfnMCSzA+foF3CWWkg==</latexit> w 2 <latexit sha1_base64="Ev1ohX6tNNWcT7bA98DTVVKd1kM=">AAAB/nicbVA9SwNBEJ2LXzF+RS1tFoNgFe5EopUEbCwjmA9IjrC32UuW7O4du3tKOA78DbZa24mtf8XSf+ImucIkPhh4vDfDzLwg5kwb1/12CmvrG5tbxe3Szu7e/kH58Kilo0QR2iQRj1QnwJpyJmnTMMNpJ1YUi4DTdjC+nfrtR6o0i+SDmcTUF3goWcgINlbq9AKRPmV9r1+uuFV3BrRKvJxUIEejX/7pDSKSCCoN4VjrrufGxk+xMoxwmpV6iaYxJmM8pF1LJRZU++ns3gydWWWAwkjZkgbN1L8TKRZaT0RgOwU2I73sTcX/vG5iwms/ZTJODJVkvihMODIRmj6PBkxRYvjEEkwUs7ciMsIKE2MjWtgSiMxm4i0nsEpaF1WvVq3dX1bqN3k6RTiBUzgHD66gDnfQgCYQ4PACr/DmPDvvzofzOW8tOPnMMSzA+foF2pKWkQ==</latexit> w 1 <latexit sha1_base64="3vFTN1hUoeEIuEuLGSJTaaCR8Cc=">AAAB/nicbVA9SwNBEJ2LXzF+RS1tFoNgFe5EopUEbCwjmA9IjrC32UuW7O4du3tKOA78DbZa24mtf8XSf+ImucIkPhh4vDfDzLwg5kwb1/12CmvrG5tbxe3Szu7e/kH58Kilo0QR2iQRj1QnwJpyJmnTMMNpJ1YUi4DTdjC+nfrtR6o0i+SDmcTUF3goWcgINlbq9AKRPmV9t1+uuFV3BrRKvJxUIEejX/7pDSKSCCoN4VjrrufGxk+xMoxwmpV6iaYxJmM8pF1LJRZU++ns3gydWWWAwkjZkgbN1L8TKRZaT0RgOwU2I73sTcX/vG5iwms/ZTJODJVkvihMODIRmj6PBkxRYvjEEkwUs7ciMsIKE2MjWtgSiMxm4i0nsEpaF1WvVq3dX1bqN3k6RTiBUzgHD66gDnfQgCYQ4PACr/DmPDvvzofzOW8tOPnMMSzA+foF2P+WkA==</latexit> w 0 <latexit sha1_base64="APKdQYccShsrJF7LXhpw6y+FZ0I=">AAAB/nicbVA9SwNBEJ3zM8avqKXNYhCswp1KtJKAjWUE8wHJEfY2e8mS3b1jd08Jx4G/wVZrO7H1r1j6T9wkV5jEBwOP92aYmRfEnGnjut/Oyura+sZmYau4vbO7t186OGzqKFGENkjEI9UOsKacSdowzHDajhXFIuC0FYxuJ37rkSrNIvlgxjH1BR5IFjKCjZXa3UCkT1nvolcquxV3CrRMvJyUIUe9V/rp9iOSCCoN4VjrjufGxk+xMoxwmhW7iaYxJiM8oB1LJRZU++n03gydWqWPwkjZkgZN1b8TKRZaj0VgOwU2Q73oTcT/vE5iwms/ZTJODJVktihMODIRmjyP+kxRYvjYEkwUs7ciMsQKE2MjmtsSiMxm4i0msEya5xWvWqneX5ZrN3k6BTiGEzgDD66gBndQhwYQ4PACr/DmPDvvzofzOWtdcfKZI5iD8/UL3biWkw==</latexit> w 3 <latexit sha1_base64="Z2QEm7G6dGdjTmTRIEPgLbRMhtc=">AAACBHicbVDLSsNAFL2pr1pfVZdugkVwVRKV6koKblxWsA9oYplMJ+3QmUmYmSglZOs3uNW1O3Hrf7j0T5y2WdjWAxcO59zLuZwgZlRpx/m2Ciura+sbxc3S1vbO7l55/6ClokRi0sQRi2QnQIowKkhTU81IJ5YE8YCRdjC6mfjtRyIVjcS9HsfE52ggaEgx0kZ68IZIp17A06cs6533yhWn6kxhLxM3JxXI0eiVf7x+hBNOhMYMKdV1nVj7KZKaYkaykpcoEiM8QgPSNVQgTpSfTr/O7BOj9O0wkmaEtqfq34sUcaXGPDCbHOmhWvQm4n9eN9HhlZ9SESeaCDwLChNm68ieVGD3qSRYs7EhCEtqfrXxEEmEtSlqLiXgmenEXWxgmbTOqm6tWru7qNSv83aKcATHcAouXEIdbqEBTcAg4QVe4c16t6tD+tztlqw8ptDmIP19QvgkJlg</latexit> ˆ w 3 <latexit sha1_base64="cupTppChuZAMKsuM2MzGx+nuJSQ=">AAACBHicbVDLSsNAFJ3UV62vqks3g0VwVZIi1ZUU3LisYB/QxDKZTtqhM0mYuVFKyNZvcKtrd+LW/3Dpnzhts7CtBy4czrmXczl+LLgG2/62CmvrG5tbxe3Szu7e/kH58Kito0R1qKRiFTXJ5oJHrIWcBCsGytGpC9Yxx/fTP3OI1OaR+E9TGLmSTIMecApASM9uCMCqevL9CnL+rV+uWJX7RnwKnFyUkE5mv3yjzuIaCJZCFQQrXuOHYOXEgWcCpaV3ESzmNAxGbKeoSGRTHvp7OsMnxllgINImQkBz9S/FymRWk+kbzYlgZFe9qbif14vgeDKS3kYJ8BCOg8KEoEhwtMK8IArRkFMDCFUcfMrpiOiCAVT1EKKLzPTibPcwCpp16pOvVq/u6g0rvN2iugEnaJz5KBL1EC3qIlaiCKFXtArerOerXfrw/qcrxas/OYYLcD6+gXe/Zlf</latexit> ˆ w 2 <latexit sha1_base64="a/ycKKl0ancTJdfRcc52ykOSF9k=">AAACBHicbVDLSsNAFJ3UV62vqks3g0VwVRIp1ZUU3LisYB/QxDKZTtqhM0mYuVFKyNZvcKtrd+LW/3Dpnzhts7CtBy4czrmXczl+LLgG2/62CmvrG5tbxe3Szu7e/kH58Kito0R1qKRiFTXJ5oJHrIWcBCsGytGpC9Yxx/fTP3OI1OaR+E9TGLmSTIMecApASM9uCMCqevL9CnL+rV+uWJX7RnwKnFyUkE5mv3yjzuIaCJZCFQQrXuOHYOXEgWcCpaV3ESzmNAxGbKeoSGRTHvp7OsMnxllgINImQkBz9S/FymRWk+kbzYlgZFe9qbif14vgeDKS3kYJ8BCOg8KEoEhwtMK8IArRkFMDCFUcfMrpiOiCAVT1EKKLzPTibPcwCppX1SderV+V6s0rvN2iugEnaJz5KBL1EC3qIlaiCKFXtArerOerXfrw/qcrxas/OYYLcD6+gXiI5lh</latexit> ˆ w 4 <latexit sha1_base64="ncTGFFb9Nk0tPnqjjjgPbekR7wI=">AAACBHicbVC7SgNBFL3rM8ZX1NJmMAhWYVckWknAxjKCeUCyhtnJbDJkZnaZmVXCsq3fYKu1ndj6H5b+iZNkC5N44MLhnHs5lxPEnGnjut/Oyura+sZmYau4vbO7t186OGzqKFGENkjEI9UOsKacSdowzHDajhXFIuC0FYxuJn7rkSrNInlvxjH1BR5IFjKCjZUeukNs0m4g0qcs63m9UtmtuFOgZeLlpAw56r3ST7cfkURQaQjHWnc8NzZ+ipVhhNOs2E00jTEZ4QHtWCqxoNpPp19n6NQqfRRGyo40aKr+vUix0HosArspsBnqRW8i/ud1EhNe+SmTcWKoJLOgMOHIRGhSAeozRYnhY0swUcz+isgQK0yMLWouJRCZ7cRbbGCZNM8rXrVSvbso167zdgpwDCdwBh5cQg1uoQ4NIKDgBV7hzXl23p0P53O2uuLkN0cwB+frF91qmV4=</latexit> ˆ w 1 Transformer <latexit sha1_base64="Kg/cKeFtLlk4+1Y7LCu9Qq53dLY=">AAACBHicbVC7SgNBFL3rM8ZX1NJmMAhWYVckWknAxjKCeUCyhtnJbDJkZnaZmVXCsq3fYKu1ndj6H5b+iZNkC5N44MLhnHs5lxPEnGnjut/Oyura+sZmYau4vbO7t186OGzqKFGENkjEI9UOsKacSdowzHDajhXFIuC0FYxuJn7rkSrNInlvxjH1BR5IFjKCjZUeukNs0m4g0qcs67m9UtmtuFOgZeLlpAw56r3ST7cfkURQaQjHWnc8NzZ+ipVhhNOs2E00jTEZ4QHtWCqxoNpPp19n6NQqfRRGyo40aKr+vUix0HosArspsBnqRW8i/ud1EhNe+SmTcWKoJLOgMOHIRGhSAeozRYnhY0swUcz+isgQK0yMLWouJRCZ7cRbbGCZNM8rXrVSvbso167zdgpwDCdwBh5cQg1uoQ4NIKDgBV7hzXl23p0P53O2uuLkN0cwB+frF9vXmV0=</latexit> ˆ w 0 dummy word prediction Figure 2: Overview of our model architecture. The encoder and decoder are detailed in Figures 3a and 3b respectively. The encoder processes the input text, producing word embeddingsw k , which are then processed by the backbone to produce next word predictions ˆ w k+1 . The decoder uses these predictions along with encoder’s byte-level outputsbto generate byte-level logits. 3 Model We detail the architecture of our models in Section 3.1, the approach for splitting byte sequences into words in Section 3.2, the difference between fixed tokenization and our approach in Section 3.3, and infrastructure and code optimizations in Section 3.4. 3.1 Architecture Our models use a modified HAT architecture [45] consisting of three components: encoder, backbone, and decoder together with connector layers between components (see Figure 2 and Figure 3). Encoder, backbone, and decoder are all instances of dense autoregressive transformers with pre-norm residual blocks in the style of Llama, using a SwiGLU unit as a feed-forward block, with all model parameters active during training and inference. The backbone model uses standard causal attention, while the encoder and decoder use local causal attention with a finite look-back window. The encoder processes input text as a sequence of UTF-8 bytes and produces a sequenceb i of byte embeddings of the same length. This sequence is then split into chunks corresponding to words or other semantic units in the text (we refer to §3.2 for details). In the encoder-backbone connector layer, for each word, a learned latent vector cross-attends to its corresponding chunk of encoder activations. The resulting sequence of latent vectorsw k then serves as input to the backbone. The backbone processes this latent sequence and produces a sequence of next word-level representations ˆ w k+1 . Finally, the decoder acts on the encoder’s byte-level activationsband has an LM head that produces next-byte probabilities. To make use of the higher level information stored in the backbone’s next word-level embeddings ˆ w k+1 during decoding, another cross-attention mechanism is used. Specifically, in each transformer block of the decoder, every byte-level position cross-attends to the backbone’s next word-level representations that correspond to the word preceding this byte. The encoder and decoder architectures are detailed in Figure 3 and a full architecture overview is shown in Figure 2. Model details We present details on the architecture of Llama-TFree-HAT-Pretrained and Llama-3.1-8B-TFree-HAT in Table 1, and of Llama-3.1-70B-TFree-HAT in Table 2. We 4 Hat_on input bytes byte embed. <latexit sha1_base64="0rLC4+qqpL92HoDi+krDnh9SYlI=">AAAB/nicbVDLSgMxFL1TX7W+qi7dBIvgqsyIVJcFNy4r2Ae0Q0nSTBuaZIYkI5RhwG9wq2t34tZfcemfmD4WtvXAhcM593LvPSQR3Fjf//YKG5tb2zvF3dLe/sHhUfn4pGXiVFPWpLGIdYdgwwRXrGm5FayTaIYlEaxNxndTv/3EtOGxerSThIUSDxWPOMXWSZ0ekRnJ+36/XPGr/gxonQQLUoEFGv3yT28Q01QyZanAxnQDP7FhhrXlVLC81EsNSzAd4yHrOqqwZCbMZvfm6MIpAxTF2pWyaKb+nciwNGYiieuU2I7MqjcV//O6qY1uw4yrJLVM0fmiKBXIxmj6PBpwzagVE0cw1dzdiugIa0yti2hpC5G5yyRYTWCdtK6qQa1ae7iu1K8X6RThDM7hEgK4gTrcQwOaQEHAC7zCm/fsvXsf3ue8teAtZk5hCd7XL7SvlnE=</latexit> b 0 <latexit sha1_base64="QPtgVhKXnrstdxqjaq1HjyJiAZI=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXZVomVAC8sI5iHJEmYms8mQmdllZlYIyxZ+g6222omthX/gD1j6J04ehUk8cOFwzr3cew+OONPGdb+dzMrq2vpGdjO3tb2zu5ffLzR0GCtC6yTkoWphpClnktYNM5y2IkWRwJw28fBy7DfvqdIslLdmFFFfoL5kASPIWOku18EiwWn3tJsvuWV3ArhMvBkpVbNfn4Wrl2Ktm//p9EISCyoN4UjrtudGxk+QMoxwmuY6saYRIkPUp21LJRJU+8nk4BQeWaUHg1DZkgZO1L8TCRJajwS2nQKZgV70xuJ/Xjs2wYWfMBnFhkoyXRTEHJoQjr+HPaYoMXxkCSKK2VshGSCFiLEZzW3BIrWZeIsJLJPGSdmrlCs3Xql6BqbIgkNQBMfAA+egCq5BDdQBAQI8gifw7Dw4r86b8z5tzTizmQMwB+fjFwxymXQ=</latexit> b 3 <latexit sha1_base64="FtThDvC2rsJDFNJCQT2jFwYea58=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXaDRMuAFpYRzEOSJcxMZpMhM7PLzKwQli38BltttRNbC//AH7D0T5w8CpN44MLhnHu59x4ccaaN6347mbX1jc2t7HZuZ3dv/yB/WGjqMFaENkjIQ9XGSFPOJG0YZjhtR4oigTlt4dHlxG/dU6VZKG/NOKK+QAPJAkaQsdJdrotFgtNepZcvuWV3CrhKvDkp1bJfn4Wrl2K9l//p9kMSCyoN4UjrjudGxk+QMoxwmua6saYRIiM0oB1LJRJU+8n04BSeWKUPg1DZkgZO1b8TCRJajwW2nQKZoV72JuJ/Xic2wYWfMBnFhkoyWxTEHJoQTr6HfaYoMXxsCSKK2VshGSKFiLEZLWzBIrWZeMsJrJJmpexVy9Ubr1Q7AzNkwTEoglPggXNQA9egDhqAAAEewRN4dh6cV+fNeZ+1Zpz5zBFYgPPxCwrfmXM=</latexit> b 2 <latexit sha1_base64="NJzTOQPf44IoVkWO/uVOxJWU3o4=">AAAB/3icbVC7SgNBFL3rM8ZXjKXNkCBYhV2RaBnQwjKCeUiyhJnJJBkys7vMzAph2cJvsNVWO7G18A/8AUv/xMmjMIkHLhzOuZd77yGR4Nq47rezsrq2vrGZ2cpu7+zu7ecO8nUdxoqyGg1FqJoEayZ4wGqGG8GakWJYEsEaZHg59hv3TGkeBrdmFDFf4n7Ae5xiY6W7bJvIhKQdr5MruiV3ArRMvBkpVjJfn/mrl0K1k/tpd0MaSxYYKrDWLc+NjJ9gZTgVLM22Y80iTIe4z1qWBlgy7SeTg1N0bJUu6oXKVmDQRP07kWCp9UgS2ymxGehFbyz+57Vi07vwEx5EsWEBnS7qxQKZEI2/R12uGDViZAmmittbER1ghamxGc1tITK1mXiLCSyT+mnJK5fKN16xcgZTZOAICnACHpxDBa6hCjWgIOERnuDZeXBenTfnfdq64sxmDmEOzscvCUyZcg==</latexit> b 1 <latexit sha1_base64="btpbwprJKD7nGxPGae5FLLirO+o=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXZFo2VAC8sI5iHJEmYms8mQmdllZlYIyxZ+g6222omthX/gD1j6J04ehUk8cOFwzr3cew+OONPGdb+dzMrq2vpGdjO3tb2zu5ffLzR0GCtC6yTkoWphpClnktYNM5y2IkWRwJw28fBy7DfvqdIslLdmFFFfoL5kASPIWOku18EiwWn3rJsvuWV3ArhMvBkpVbNfn4Wrl2Ktm//p9EISCyoN4UjrtudGxk+QMoxwmuY6saYRIkPUp21LJRJU+8nk4BQeWaUHg1DZkgZO1L8TCRJajwS2nQKZgV70xuJ/Xjs2wYWfMBnFhkoyXRTEHJoQjr+HPaYoMXxkCSKK2VshGSCFiLEZzW3BIrWZeIsJLJPGSdmrlCs3Xql6CqbIgkNQBMfAA+egCq5BDdQBAQI8gifw7Dw4r86b8z5tzTizmQMwB+fjFw+YmXY=</latexit> b 5 <latexit sha1_base64="ckTUCWz6gEaWIqKmbfW++XZ+U+4=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXYlRMuAFpYRzEOSJcxMZpMhM7PLzKwQli38BltttRNbC//AH7D0T5w8CpN44MLhnHu59x4ccaaN6347mbX1jc2t7HZuZ3dv/yB/WGjqMFaENkjIQ9XGSFPOJG0YZjhtR4oigTlt4dHlxG/dU6VZKG/NOKK+QAPJAkaQsdJdrotFgtNepZcvuWV3CrhKvDkp1bJfn4Wrl2K9l//p9kMSCyoN4UjrjudGxk+QMoxwmua6saYRIiM0oB1LJRJU+8n04BSeWKUPg1DZkgZO1b8TCRJajwW2nQKZoV72JuJ/Xic2wYWfMBnFhkoyWxTEHJoQTr6HfaYoMXxsCSKK2VshGSKFiLEZLWzBIrWZeMsJrJLmWdmrlqs3XqlWATNkwTEoglPggXNQA9egDhqAAAEewRN4dh6cV+fNeZ+1Zpz5zBFYgPPxCw4FmXU=</latexit> b 4 Transformer Splitting Logic word embed. E NCODER BACKBONE Cross Attention <latexit sha1_base64="Ev1ohX6tNNWcT7bA98DTVVKd1kM=">AAAB/nicbVA9SwNBEJ2LXzF+RS1tFoNgFe5EopUEbCwjmA9IjrC32UuW7O4du3tKOA78DbZa24mtf8XSf+ImucIkPhh4vDfDzLwg5kwb1/12CmvrG5tbxe3Szu7e/kH58Kilo0QR2iQRj1QnwJpyJmnTMMNpJ1YUi4DTdjC+nfrtR6o0i+SDmcTUF3goWcgINlbq9AKRPmV9r1+uuFV3BrRKvJxUIEejX/7pDSKSCCoN4VjrrufGxk+xMoxwmpV6iaYxJmM8pF1LJRZU++ns3gydWWWAwkjZkgbN1L8TKRZaT0RgOwU2I73sTcX/vG5iwms/ZTJODJVkvihMODIRmj6PBkxRYvjEEkwUs7ciMsIKE2MjWtgSiMxm4i0nsEpaF1WvVq3dX1bqN3k6RTiBUzgHD66gDnfQgCYQ4PACr/DmPDvvzofzOW8tOPnMMSzA+foF2pKWkQ==</latexit> w 1 <latexit sha1_base64="3vFTN1hUoeEIuEuLGSJTaaCR8Cc=">AAAB/nicbVA9SwNBEJ2LXzF+RS1tFoNgFe5EopUEbCwjmA9IjrC32UuW7O4du3tKOA78DbZa24mtf8XSf+ImucIkPhh4vDfDzLwg5kwb1/12CmvrG5tbxe3Szu7e/kH58Kilo0QR2iQRj1QnwJpyJmnTMMNpJ1YUi4DTdjC+nfrtR6o0i+SDmcTUF3goWcgINlbq9AKRPmV9t1+uuFV3BrRKvJxUIEejX/7pDSKSCCoN4VjrrufGxk+xMoxwmpV6iaYxJmM8pF1LJRZU++ns3gydWWWAwkjZkgbN1L8TKRZaT0RgOwU2I73sTcX/vG5iwms/ZTJODJVkvihMODIRmj6PBkxRYvjEEkwUs7ciMsIKE2MjWtgSiMxm4i0nsEpaF1WvVq3dX1bqN3k6RTiBUzgHD66gDnfQgCYQ4PACr/DmPDvvzofzOW8tOPnMMSzA+foF2P+WkA==</latexit> w 0 Cross Attention latent word query (a) Encoder Architecture. Input text is encoded in UTF8, embedded into the model’s vector space and passed through a causal local-sliding-window-attention transformer. The byte embed- dingsb i are grouped into words according to the split logic. The byte embeddings of each word are aggregated into a single word embeddingw k via cross-attention with a learned query vector. <latexit sha1_base64="0rLC4+qqpL92HoDi+krDnh9SYlI=">AAAB/nicbVDLSgMxFL1TX7W+qi7dBIvgqsyIVJcFNy4r2Ae0Q0nSTBuaZIYkI5RhwG9wq2t34tZfcemfmD4WtvXAhcM593LvPSQR3Fjf//YKG5tb2zvF3dLe/sHhUfn4pGXiVFPWpLGIdYdgwwRXrGm5FayTaIYlEaxNxndTv/3EtOGxerSThIUSDxWPOMXWSZ0ekRnJ+36/XPGr/gxonQQLUoEFGv3yT28Q01QyZanAxnQDP7FhhrXlVLC81EsNSzAd4yHrOqqwZCbMZvfm6MIpAxTF2pWyaKb+nciwNGYiieuU2I7MqjcV//O6qY1uw4yrJLVM0fmiKBXIxmj6PBpwzagVE0cw1dzdiugIa0yti2hpC5G5yyRYTWCdtK6qQa1ae7iu1K8X6RThDM7hEgK4gTrcQwOaQEHAC7zCm/fsvXsf3ue8teAtZk5hCd7XL7SvlnE=</latexit> b 0 <latexit sha1_base64="QPtgVhKXnrstdxqjaq1HjyJiAZI=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXZVomVAC8sI5iHJEmYms8mQmdllZlYIyxZ+g6222omthX/gD1j6J04ehUk8cOFwzr3cew+OONPGdb+dzMrq2vpGdjO3tb2zu5ffLzR0GCtC6yTkoWphpClnktYNM5y2IkWRwJw28fBy7DfvqdIslLdmFFFfoL5kASPIWOku18EiwWn3tJsvuWV3ArhMvBkpVbNfn4Wrl2Ktm//p9EISCyoN4UjrtudGxk+QMoxwmuY6saYRIkPUp21LJRJU+8nk4BQeWaUHg1DZkgZO1L8TCRJajwS2nQKZgV70xuJ/Xjs2wYWfMBnFhkoyXRTEHJoQjr+HPaYoMXxkCSKK2VshGSCFiLEZzW3BIrWZeIsJLJPGSdmrlCs3Xql6BqbIgkNQBMfAA+egCq5BDdQBAQI8gifw7Dw4r86b8z5tzTizmQMwB+fjFwxymXQ=</latexit> b 3 <latexit sha1_base64="FtThDvC2rsJDFNJCQT2jFwYea58=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXaDRMuAFpYRzEOSJcxMZpMhM7PLzKwQli38BltttRNbC//AH7D0T5w8CpN44MLhnHu59x4ccaaN6347mbX1jc2t7HZuZ3dv/yB/WGjqMFaENkjIQ9XGSFPOJG0YZjhtR4oigTlt4dHlxG/dU6VZKG/NOKK+QAPJAkaQsdJdrotFgtNepZcvuWV3CrhKvDkp1bJfn4Wrl2K9l//p9kMSCyoN4UjrjudGxk+QMoxwmua6saYRIiM0oB1LJRJU+8n04BSeWKUPg1DZkgZO1b8TCRJajwW2nQKZoV72JuJ/Xic2wYWfMBnFhkoyWxTEHJoQTr6HfaYoMXxsCSKK2VshGSKFiLEZLWzBIrWZeMsJrJJmpexVy9Ubr1Q7AzNkwTEoglPggXNQA9egDhqAAAEewRN4dh6cV+fNeZ+1Zpz5zBFYgPPxCwrfmXM=</latexit> b 2 <latexit sha1_base64="NJzTOQPf44IoVkWO/uVOxJWU3o4=">AAAB/3icbVC7SgNBFL3rM8ZXjKXNkCBYhV2RaBnQwjKCeUiyhJnJJBkys7vMzAph2cJvsNVWO7G18A/8AUv/xMmjMIkHLhzOuZd77yGR4Nq47rezsrq2vrGZ2cpu7+zu7ecO8nUdxoqyGg1FqJoEayZ4wGqGG8GakWJYEsEaZHg59hv3TGkeBrdmFDFf4n7Ae5xiY6W7bJvIhKQdr5MruiV3ArRMvBkpVjJfn/mrl0K1k/tpd0MaSxYYKrDWLc+NjJ9gZTgVLM22Y80iTIe4z1qWBlgy7SeTg1N0bJUu6oXKVmDQRP07kWCp9UgS2ymxGehFbyz+57Vi07vwEx5EsWEBnS7qxQKZEI2/R12uGDViZAmmittbER1ghamxGc1tITK1mXiLCSyT+mnJK5fKN16xcgZTZOAICnACHpxDBa6hCjWgIOERnuDZeXBenTfnfdq64sxmDmEOzscvCUyZcg==</latexit> b 1 <latexit sha1_base64="btpbwprJKD7nGxPGae5FLLirO+o=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXZFo2VAC8sI5iHJEmYms8mQmdllZlYIyxZ+g6222omthX/gD1j6J04ehUk8cOFwzr3cew+OONPGdb+dzMrq2vpGdjO3tb2zu5ffLzR0GCtC6yTkoWphpClnktYNM5y2IkWRwJw28fBy7DfvqdIslLdmFFFfoL5kASPIWOku18EiwWn3rJsvuWV3ArhMvBkpVbNfn4Wrl2Ktm//p9EISCyoN4UjrtudGxk+QMoxwmuY6saYRIkPUp21LJRJU+8nk4BQeWaUHg1DZkgZO1L8TCRJajwS2nQKZgV70xuJ/Xjs2wYWfMBnFhkoyXRTEHJoQjr+HPaYoMXxkCSKK2VshGSCFiLEZzW3BIrWZeIsJLJPGSdmrlCs3Xql6CqbIgkNQBMfAA+egCq5BDdQBAQI8gifw7Dw4r86b8z5tzTizmQMwB+fjFw+YmXY=</latexit> b 5 <latexit sha1_base64="ckTUCWz6gEaWIqKmbfW++XZ+U+4=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXYlRMuAFpYRzEOSJcxMZpMhM7PLzKwQli38BltttRNbC//AH7D0T5w8CpN44MLhnHu59x4ccaaN6347mbX1jc2t7HZuZ3dv/yB/WGjqMFaENkjIQ9XGSFPOJG0YZjhtR4oigTlt4dHlxG/dU6VZKG/NOKK+QAPJAkaQsdJdrotFgtNepZcvuWV3CrhKvDkp1bJfn4Wrl2K9l//p9kMSCyoN4UjrjudGxk+QMoxwmua6saYRIiM0oB1LJRJU+8n04BSeWKUPg1DZkgZO1b8TCRJajwW2nQKZoV72JuJ/Xic2wYWfMBnFhkoyWxTEHJoQTr6HfaYoMXxsCSKK2VshGSKFiLEZLWzBIrWZeMsJrJLmWdmrlqs3XqlWATNkwTEoglPggXNQA9egDhqAAAEewRN4dh6cV+fNeZ+1Zpz5zBFYgPPxCw4FmXU=</latexit> b 4 Transformer Cross Attention Cross Attention DECODER BLOCK <latexit sha1_base64="ncTGFFb9Nk0tPnqjjjgPbekR7wI=">AAACBHicbVC7SgNBFL3rM8ZX1NJmMAhWYVckWknAxjKCeUCyhtnJbDJkZnaZmVXCsq3fYKu1ndj6H5b+iZNkC5N44MLhnHs5lxPEnGnjut/Oyura+sZmYau4vbO7t186OGzqKFGENkjEI9UOsKacSdowzHDajhXFIuC0FYxuJn7rkSrNInlvxjH1BR5IFjKCjZUeukNs0m4g0qcs63m9UtmtuFOgZeLlpAw56r3ST7cfkURQaQjHWnc8NzZ+ipVhhNOs2E00jTEZ4QHtWCqxoNpPp19n6NQqfRRGyo40aKr+vUix0HosArspsBnqRW8i/ud1EhNe+SmTcWKoJLOgMOHIRGhSAeozRYnhY0swUcz+isgQK0yMLWouJRCZ7cRbbGCZNM8rXrVSvbso167zdgpwDCdwBh5cQg1uoQ4NIKDgBV7hzXl23p0P53O2uuLkN0cwB+frF91qmV4=</latexit> ˆ w 1 <latexit sha1_base64="Kg/cKeFtLlk4+1Y7LCu9Qq53dLY=">AAACBHicbVC7SgNBFL3rM8ZX1NJmMAhWYVckWknAxjKCeUCyhtnJbDJkZnaZmVXCsq3fYKu1ndj6H5b+iZNkC5N44MLhnHs5lxPEnGnjut/Oyura+sZmYau4vbO7t186OGzqKFGENkjEI9UOsKacSdowzHDajhXFIuC0FYxuJn7rkSrNInlvxjH1BR5IFjKCjZUeukNs0m4g0qcs67m9UtmtuFOgZeLlpAw56r3ST7cfkURQaQjHWnc8NzZ+ipVhhNOs2E00jTEZ4QHtWCqxoNpPp19n6NQqfRRGyo40aKr+vUix0HosArspsBnqRW8i/ud1EhNe+SmTcWKoJLOgMOHIRGhSAeozRYnhY0swUcz+isgQK0yMLWouJRCZ7cRbbGCZNM8rXrVSvbso167zdgpwDCdwBh5cQg1uoQ4NIKDgBV7hzXl23p0P53O2uuLkN0cwB+frF9vXmV0=</latexit> ˆ w 0 backbone word prediction embeddings dummy (start) word prediction byte embed. bytes embeddings (block output) at_on_ encoder’s byte residual D ECODER logits target bytes first word prediction (embedding) Decoder Block Decoder Block ... Decoder Block <latexit sha1_base64="Kg/cKeFtLlk4+1Y7LCu9Qq53dLY=">AAACBHicbVC7SgNBFL3rM8ZX1NJmMAhWYVckWknAxjKCeUCyhtnJbDJkZnaZmVXCsq3fYKu1ndj6H5b+iZNkC5N44MLhnHs5lxPEnGnjut/Oyura+sZmYau4vbO7t186OGzqKFGENkjEI9UOsKacSdowzHDajhXFIuC0FYxuJn7rkSrNInlvxjH1BR5IFjKCjZUeukNs0m4g0qcs67m9UtmtuFOgZeLlpAw56r3ST7cfkURQaQjHWnc8NzZ+ipVhhNOs2E00jTEZ4QHtWCqxoNpPp19n6NQqfRRGyo40aKr+vUix0HosArspsBnqRW8i/ud1EhNe+SmTcWKoJLOgMOHIRGhSAeozRYnhY0swUcz+isgQK0yMLWouJRCZ7cRbbGCZNM8rXrVSvbso167zdgpwDCdwBh5cQg1uoQ4NIKDgBV7hzXl23p0P53O2uuLkN0cwB+frF9vXmV0=</latexit> ˆ w 0 <latexit sha1_base64="ncTGFFb9Nk0tPnqjjjgPbekR7wI=">AAACBHicbVC7SgNBFL3rM8ZX1NJmMAhWYVckWknAxjKCeUCyhtnJbDJkZnaZmVXCsq3fYKu1ndj6H5b+iZNkC5N44MLhnHs5lxPEnGnjut/Oyura+sZmYau4vbO7t186OGzqKFGENkjEI9UOsKacSdowzHDajhXFIuC0FYxuJn7rkSrNInlvxjH1BR5IFjKCjZUeukNs0m4g0qcs63m9UtmtuFOgZeLlpAw56r3ST7cfkURQaQjHWnc8NzZ+ipVhhNOs2E00jTEZ4QHtWCqxoNpPp19n6NQqfRRGyo40aKr+vUix0HosArspsBnqRW8i/ud1EhNe+SmTcWKoJLOgMOHIRGhSAeozRYnhY0swUcz+isgQK0yMLWouJRCZ7cRbbGCZNM8rXrVSvbso167zdgpwDCdwBh5cQg1uoQ4NIKDgBV7hzXl23p0P53O2uuLkN0cwB+frF91qmV4=</latexit> ˆ w 1 BACKBONE byte embed. <latexit sha1_base64="0rLC4+qqpL92HoDi+krDnh9SYlI=">AAAB/nicbVDLSgMxFL1TX7W+qi7dBIvgqsyIVJcFNy4r2Ae0Q0nSTBuaZIYkI5RhwG9wq2t34tZfcemfmD4WtvXAhcM593LvPSQR3Fjf//YKG5tb2zvF3dLe/sHhUfn4pGXiVFPWpLGIdYdgwwRXrGm5FayTaIYlEaxNxndTv/3EtOGxerSThIUSDxWPOMXWSZ0ekRnJ+36/XPGr/gxonQQLUoEFGv3yT28Q01QyZanAxnQDP7FhhrXlVLC81EsNSzAd4yHrOqqwZCbMZvfm6MIpAxTF2pWyaKb+nciwNGYiieuU2I7MqjcV//O6qY1uw4yrJLVM0fmiKBXIxmj6PBpwzagVE0cw1dzdiugIa0yti2hpC5G5yyRYTWCdtK6qQa1ae7iu1K8X6RThDM7hEgK4gTrcQwOaQEHAC7zCm/fsvXsf3ue8teAtZk5hCd7XL7SvlnE=</latexit> b 0 <latexit sha1_base64="FtThDvC2rsJDFNJCQT2jFwYea58=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXaDRMuAFpYRzEOSJcxMZpMhM7PLzKwQli38BltttRNbC//AH7D0T5w8CpN44MLhnHu59x4ccaaN6347mbX1jc2t7HZuZ3dv/yB/WGjqMFaENkjIQ9XGSFPOJG0YZjhtR4oigTlt4dHlxG/dU6VZKG/NOKK+QAPJAkaQsdJdrotFgtNepZcvuWV3CrhKvDkp1bJfn4Wrl2K9l//p9kMSCyoN4UjrjudGxk+QMoxwmua6saYRIiM0oB1LJRJU+8n04BSeWKUPg1DZkgZO1b8TCRJajwW2nQKZoV72JuJ/Xic2wYWfMBnFhkoyWxTEHJoQTr6HfaYoMXxsCSKK2VshGSKFiLEZLWzBIrWZeMsJrJJmpexVy9Ubr1Q7AzNkwTEoglPggXNQA9egDhqAAAEewRN4dh6cV+fNeZ+1Zpz5zBFYgPPxCwrfmXM=</latexit> b 2 <latexit sha1_base64="NJzTOQPf44IoVkWO/uVOxJWU3o4=">AAAB/3icbVC7SgNBFL3rM8ZXjKXNkCBYhV2RaBnQwjKCeUiyhJnJJBkys7vMzAph2cJvsNVWO7G18A/8AUv/xMmjMIkHLhzOuZd77yGR4Nq47rezsrq2vrGZ2cpu7+zu7ecO8nUdxoqyGg1FqJoEayZ4wGqGG8GakWJYEsEaZHg59hv3TGkeBrdmFDFf4n7Ae5xiY6W7bJvIhKQdr5MruiV3ArRMvBkpVjJfn/mrl0K1k/tpd0MaSxYYKrDWLc+NjJ9gZTgVLM22Y80iTIe4z1qWBlgy7SeTg1N0bJUu6oXKVmDQRP07kWCp9UgS2ymxGehFbyz+57Vi07vwEx5EsWEBnS7qxQKZEI2/R12uGDViZAmmittbER1ghamxGc1tITK1mXiLCSyT+mnJK5fKN16xcgZTZOAICnACHpxDBa6hCjWgIOERnuDZeXBenTfnfdq64sxmDmEOzscvCUyZcg==</latexit> b 1 <latexit sha1_base64="QPtgVhKXnrstdxqjaq1HjyJiAZI=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXZVomVAC8sI5iHJEmYms8mQmdllZlYIyxZ+g6222omthX/gD1j6J04ehUk8cOFwzr3cew+OONPGdb+dzMrq2vpGdjO3tb2zu5ffLzR0GCtC6yTkoWphpClnktYNM5y2IkWRwJw28fBy7DfvqdIslLdmFFFfoL5kASPIWOku18EiwWn3tJsvuWV3ArhMvBkpVbNfn4Wrl2Ktm//p9EISCyoN4UjrtudGxk+QMoxwmuY6saYRIkPUp21LJRJU+8nk4BQeWaUHg1DZkgZO1L8TCRJajwS2nQKZgV70xuJ/Xjs2wYWfMBnFhkoyXRTEHJoQjr+HPaYoMXxkCSKK2VshGSCFiLEZzW3BIrWZeIsJLJPGSdmrlCs3Xql6BqbIgkNQBMfAA+egCq5BDdQBAQI8gifw7Dw4r86b8z5tzTizmQMwB+fjFwxymXQ=</latexit> b 3 <latexit sha1_base64="btpbwprJKD7nGxPGae5FLLirO+o=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXZFo2VAC8sI5iHJEmYms8mQmdllZlYIyxZ+g6222omthX/gD1j6J04ehUk8cOFwzr3cew+OONPGdb+dzMrq2vpGdjO3tb2zu5ffLzR0GCtC6yTkoWphpClnktYNM5y2IkWRwJw28fBy7DfvqdIslLdmFFFfoL5kASPIWOku18EiwWn3rJsvuWV3ArhMvBkpVbNfn4Wrl2Ktm//p9EISCyoN4UjrtudGxk+QMoxwmuY6saYRIkPUp21LJRJU+8nk4BQeWaUHg1DZkgZO1L8TCRJajwS2nQKZgV70xuJ/Xjs2wYWfMBnFhkoyXRTEHJoQjr+HPaYoMXxkCSKK2VshGSCFiLEZzW3BIrWZeIsJLJPGSdmrlCs3Xql6CqbIgkNQBMfAA+egCq5BDdQBAQI8gifw7Dw4r86b8z5tzTizmQMwB+fjFw+YmXY=</latexit> b 5 <latexit sha1_base64="ckTUCWz6gEaWIqKmbfW++XZ+U+4=">AAAB/3icbVC7SgNBFJ2NrxhfMZY2Q4JgFXYlRMuAFpYRzEOSJcxMZpMhM7PLzKwQli38BltttRNbC//AH7D0T5w8CpN44MLhnHu59x4ccaaN6347mbX1jc2t7HZuZ3dv/yB/WGjqMFaENkjIQ9XGSFPOJG0YZjhtR4oigTlt4dHlxG/dU6VZKG/NOKK+QAPJAkaQsdJdrotFgtNepZcvuWV3CrhKvDkp1bJfn4Wrl2K9l//p9kMSCyoN4UjrjudGxk+QMoxwmua6saYRIiM0oB1LJRJU+8n04BSeWKUPg1DZkgZO1b8TCRJajwW2nQKZoV72JuJ/Xic2wYWfMBnFhkoyWxTEHJoQTr6HfaYoMXxsCSKK2VshGSKFiLEZLWzBIrWZeMsJrJLmWdmrlqs3XqlWATNkwTEoglPggXNQA9egDhqAAAEewRN4dh6cV+fNeZ+1Zpz5zBFYgPPxCw4FmXU=</latexit> b 4 (b) Decoder Architecture. The decoder further processes the byte-level hidden statesb i for each word from the encoder with alternating cross-attention and transformer layers. The cross-attention layer uses the shifted next word predictions ˆ w k+1 from the backbone as keys and values, and the byte- level hidden statesb i as queries. The final hidden states are then passed through a language modeling head to produce byte-level logits. Figure 3: Visualization of the encoder and decoder of the HAT model. note that Llama-3.1-8B-TFree-HAT substantially reduces the number of parameters of the model compared to Llama 3.1 8B and is closer to a 7B than an 8B model: Llama 3.1 8B has a total of 8,030,261,248 parameters, while Llama-3.1-8B-TFree-HAT has only 7,192,495,104 parameters. This reduction is proportionally smaller for Llama-3.1-70B-TFree-HAT, which has 69,302,847,488 parameters, while Llama 3.1 70B has 71,604,379,648. Llama-3.1-8B-TFree-HAT and Llama 3.1 8B share the same backbone with 32×218,112,000 = 6,979,584,000 parameters, whereas Llama-3.1-70B-TFree-HAT and Llama 3.1 70B share a backbone with 68,452,352,000 parameters. In both cases, the Llama 3.1 models have a vocabulary size of 128,256. Llama 3.1 8B has a hidden dimension of 4,096, resulting in a token embedding matrix with 525,336,576 parameters, while Llama 3.1 70B has a hidden dimension of 8,192, resulting in a token embedding matrix with 1,050,673,152 parameters. Llama-3.1- 8B-TFree-HAT replaces this with an encoder with only 119,291,904 parameters in total. Similarly, the language model head of Llama 3.1 8B with 525,336,576 parameters is replaced by a decoder with a total of 93,619,200 parameters. This reduces the parameter footprint of non-backbone weights from 13% in Llama 3.1 8B to less than 3% in the HATified model. We note that the competitive performance of Llama-3.1-8B-TFree-HAT is an indicator that the embedding and language model head of Llama 3.1 8B are over-parametrized, presumably since they do not take token similarity into account. In their architectures, Llama-TFree-HAT-Pretrained and Llama-3.1-8B-TFree-HAT are almost identical. However, there were two small differences: unlike in the Llama-3.1-8B- TFree-HAT, in Llama-TFree-HAT-Pretrained we (1) added QK-norm (per head) and (2) used attention logit softcapping at 100 during pre-training (but not during long-context adaptation or post-training, i.e., no inference-relevant change), which we found to be important for training stability. 5 Table 1: Encoder, backbone, and decoder module configurations of Llama-TFree-HAT- Pretrained and Llama-3.1-8B-TFree-HAT. EncoderBackboneDecoder # of layers6324 # of self-attention heads8328 Head size128128128 # of key-value heads888 Hidden size1,0244,0961,024 Cross-attention hidden size4,096–4,096 # of cross-attention heads32–8 MLP expansion factor2.753.52.75 MLP typeSwiGLUSwiGLUSwiGLU Sequence length262,14432,900262,144 Position embeddingsRoPE base 1e5RoPE base 5e5RoPE base 1e5 Attention typecausal + local (768)causalcausal + local (768) Number of parameters119,291,9046,979,584,00093,619,200 Table 2: Encoder, backbone, and decoder module configurations of Llama-3.1-70B-TFree- HAT. EncoderBackboneDecoder # of layers6804 # of self-attention heads166416 Head size128128128 # of key-value heads16816 Hidden size2,0488,1922,048 Cross-attention hidden size8,192–2,048 # of cross-attention heads64–16 MLP expansion factor2.753.52.75 MLP typeSwiGLUSwiGLUSwiGLU Sequence length98,30412,28898,304 Position embeddingsRoPE base 1e5RoPE base 5e5RoPE base 1e5 Attention typecausal + local (768)causalcausal + local (768) Number of parameters476,610,56068,452,352,000373,884,928 3.2 Word splitter To split arbitrary byte sequences, we adopted the guidelines from UAX#29 9 , which splits text into words for common Western languages but also produces meaningful semantic units for other common languages (e.g., Chinese, Japanese, Korean). From now on, we refer to these splits as words. We also merged leading whitespace and trailing punctuation into the words to reduce sequence length at the word level. Our splitter also splits camel case like “FooBar” into “Foo” and “Bar”, and treats mathematical symbols (as defined in the unicode standard) as separate words. To minimize the computational overhead of this word splitting, we developed an efficient implementation in Rust with Python-bindings, that we make publicly available as hat-splitter 10 . 3.3 Tokenization vs. Splitting As noted by the reviewers 11 of Neitemeier et al. [45] – the basis of our work – one can interpret our approach as rule-based tokenization. Therefore, and given the potential ambiguity of what ‘tokenization’ as a process means, we give practical definitions of classical tokenization 9 https://unicode.org/reports/tr29/ 10 https://github.com/Aleph-Alpha-Research/hat-splitter 11 https://openreview.net/forum?id=tU074jg2vS 6 and our alternative, which is a split-based pre-processing approach which we denote as “tokenizer-free”. Definition 1 (General tokenizer). LetXbe any input space andV=0,..,Na finite vocabulary. Then a tokenizer is a mapτ : X →V + , withV + the set of sequences of elements of V with one or more elements. Here,Xcould be any space of data, for example text or images. We denote byB=0,..,255 the byte vocabulary. For language models, we assume that the raw data space is given by X=B L for some large but finite value ofL. A standard tokenizer with a vocabulary of token V, e.g., using byte-pair encoding (BPE) [21], can then be naturally interpreted as a map B L →V ≤L , i.e. a sequence of tokens of length at mostL. We reduce ourselves, without loss of generality, to the case where a tokenizer produces a sequence of tokens of length always less than the original bytes sequence 12 . In contrast we define a splitting rule as follows: Definition 2 (Splitting rule). For a given sequencex∈B L we define theP(x) as the set of all ordered tuples of non-empty contiguous byte subsequences whose concatenation equals x. Then, withB ≤L being the space of sequences of bytes of length at mostL, a splitting rule is a map s: B L → (B ≤L ) ≤L , such that s(x)∈P(x) for all x. One could argue that tokenization and splitting are the same conceptually: classical tokenizers such as BPE learn a vocabulary of subwords that the byte sequence is split into. And if we define a vocabularyV=B ≤L , splitting rules are standard tokenizers. Although technically correct, this does not translate in practice to a useful equivalence, as one would need to create an embedding table with|B ≤L | ≃256 L rows 13 , which forL= 3 translates to 16 million entries, two orders of magnitude more than the vocabulary size of SOTA LLMs and computationally intractable. Additionally, such an interpretation treats all possible byte chunks as categorically different, whereas natural notions of distance (e.g., Jaccard similarity) exist between byte chunks that can be leveraged by a suitable architecture such as ours. Traditional tokenizer-based models first segment text into subwords and then map each subword to an integer token ID. These IDs are used to index a lookup table. Because the IDs are assigned arbitrarily, this representation does not encode any intrinsic similarity between tokens at initialization. For example, the word "car" might be mapped to token ID 7063 and "cars" to token ID 51808; their embeddings are therefore unrelated prior to training. In contrast, architectures such as ours compute embeddings directly from raw byte information rather than from arbitrarily-assigned token IDs. This makes character-level structure available to the model and allows architectures to exploit shared byte patterns. For instance, the overlap between the characters in "car" and "cars", can contribute to more similar representations even before learning has taken place. This allows the model to better generalize to unseen or rare words and capture fine-grained linguistic patterns by design. 3.4 Infrastructure and Code Optimizations The HAT architecture differs significantly from standard tokenizer-based language models. Our codebase accounts for this by introducing a number of targeted code optimizations. In tokenizer-based transformer architectures, each transformer layer is associated with the same compute patterns, i.e., model dimensions and sequence lengths stay constant. In contrast, HAT model encoder and decoder layers have far fewer parameters than backbone layers, but require more activation memory since they operate on the byte level. This introduces a trade-off: with an increasing number of bytes in a batch, encoder and decoder layers can require more GPU memory than the backbone due to activations; but they are less compute bound due to smaller hidden dimensions and fewer layers. 12 The more general case is when a tokenizer creates a bounded number of tokens for each byte, most of the time reducing the sequence length; some tokenizers do create additional tokens such as "Hat" → "<capital>hat". 13 The output space of possible chunks could in theory be a smaller subset ofB ≤L , for example when using a tokenizer as splitting rule. Our word splitter (Section 3.2) however can indeed produce every possible chunk. 7 We trained our final model with pipeline parallelism of 2 (with equal number of parameters in each pipe stage) combined with data parallelism and ZeRO-1 sharding [53] (i.e., we only shard optimizer states). Since we train models in mixed precision [40] with bfloat16 weights and gradients, we keep float32 parameter copies in the optimizer states. We run micro batches of size 1 and utilize gradient accumulation to hide pipe parallelism bubbles. We also ran into illegal memory accesses in our cross-attention layers due to non-traditional shapes that are not supported by the flash attention kernel [16]. In particular, our cross- attention layers work with query projections that contain as many as 80,000 sequences of length 1. This breaks the assumptions of the flash attention kernel, which expects fewer and longer sequences. In the backward pass, flash attention launches parallel threads proportional to the number of sequences, which can end up exceeding the limit of available threads on the GPU in our use case. To mitigate this, we adapted the flash attention kernel to check for number of sequences in the query projection during the backward pass and cap the number of parallel thread launches to avoid illegal memory accesses. This was crucial to enable pre-training with reasonable context lengths (≈ 3,500 words for the backbone). HAT also presents a challenge for long context adaptation due to non-constant sequence length during the forward pass (encoder operates on bytes, then backbone on words, then decoder again on bytes). Usually, for long context adaptation, an increase in sequence length leads to more memory consumption for activations, which becomes a bottleneck. Sharding the sequence across multiple GPUs, that is, keeping different tokens on different GPUs, alleviates this bottleneck. This technique is commonly known as context parallelism [63]. In context parallelism, the self-attention operation must communicate between different context-parallel ranks since self-attention requires pairwise interactions between all tokens in a sequence. Recent works such as ring attention [36] and striped attention [9] have proposed methods to do this efficiently by overlapping compute with communications. Naïvely running context parallelism via ring attention [36] requires first calling anall_concatalong the sequence dimension after the encoder layers, resharding the sequence along words, calling all_concatagain after the backbone, and finally resharding along bytes for the decoder. This massively slows down training, leading to suboptimal GPU utilization. Moreover, publicly available ring attention (or other more efficient variants such as striped or zigzag 14 attention [9]) implementations do not support sliding window attention, which prevents their use in encoder and decoder layers of a HAT model. Due to these complexities, we opted not to employ context parallelism and instead use model parallelism to shard activation memory for long context when necessary. 4 Pre-training 4.1 Pre-training data We trained our models on a filtered subset of diverse corpora of German and English text data including proprietary curated datasets, publicly available high-quality web content, public domain sources, mathematical texts, and programming code, as summarized in Table 3. As part of our training recipe, we up-weighted the proportion of English data, which is a common practice in multilingual LLM training due to its broad task coverage and higher availability of quality data, which are essential for effective cross-lingual transfer [54]. In addition, we included a substantial amount of mathematics and code data, which has become a standard inclusion even when the model is intended only for natural language, as it has been shown to improve performance on a range of non-coding and non-mathematical downstream tasks when included during pre-training [38,3,32,57]. We also augmented the organic data with synthetic data generated by permissively-licensed LLMs. To ensure pre-training data quality, we applied a range of curation techniques, as suggested by state-of-the-art pre-training data curation methods [61,11]. These include but are not limited to: 14 https://github.com/zhuzilin/ring-flash-attention 8 Table 3: Proportions and sources of data used in pre-training. CategorySourceProportion English (70%) Curated web and synthetic data63.0% Curated sources (e.g., public domain books)7.0% German (7%) Curated web and synthetic data6.3% Curated sources (e.g., public domain books)0.7% Mathematics (5%) Mathematical code and proofs2.0% Mathematical word problems and equations3.0% Programming (18%) General programming code11.0% High-quality and synthetic Python code7.0% •URL filtering. We used a URL filter developed to filter out fraudulent, harmful, and illegal content from an explicit blocklist, e.g., adult or copyright-infringing websites, or URLs containing words associated with fraudulent, harmful, or adult content. • Text extraction. We extracted natural language texts which were embedded in HTML and other web programming languages using the Resiliparse text extractor [5]. •Language identification. We used a fastText language classifier trained on character n-grams from publicly-available data [7] to identify, retain, and sort texts into English and German. •Repetition removal. We applied heuristic methods for detection and removal of documents which contained repetitions on the paragraph, line, word, and character level. • Document- and line-level filtering. We relied on additional document-level heuristics to ensure documents had reasonable numbers and qualities of words, naturalistic symbols-to-words and numbers-to-words ratios, were not predominantly made up of bullet points (to avoid e.g. table of content data, or website menus), and had a sufficient quantity of real words. •Deduplication. We removed duplicate documents via exact and fuzzy deduplica- tion. •Model-based filtering. We trained and used various models to identify the qualities and characteristics of documents to filter out low-quality and less informative data. 4.2 Training Recipe We randomly initialized all model parameters of the encoder, decoder, and connector layers. The backbone architecture precisely matches the Llama 3.1 architecture, which allows us to initialize the weights to the pre-trained Llama 3.1 weights for the HATified model variants. We then trained the model on a next-byte-prediction cross-entropy objective on a large and diverse document corpus (see above). Initially, we trained on sequences up to 3,500 words for a total of 134B words. We then continued training on sequences of up to 32,768 words for another 20B words, upweighting longer documents to make use of the extended context. We conducted the training in our Scaling framework 15 . Initial Pre-training We used a global batch-size of 512 for Llama-3.1-8B-TFree-HAT, and 1,024 for Llama-TFree-HAT-Pretrained and Llama-3.1-70B-TFree-HAT. Since the sequence length in the first phase of pre-training for all models was 3,500 words, each batch for Llama-3.1-8B-TFree-HAT had 1,792,000 words. Each batch of Llama-TFree-HAT-Pretrained and Llama-3.1-70B-TFree-HAT had 3,584,000 words. 15 https://github.com/Aleph-Alpha-Research/scaling 9 We employed a warmup-stable-decay learning rate scheduler [28,70] for all models. We initially trained Llama-TFree-HAT-Pretrained for 1,000,000 steps, Llama-3.1-8B-TFree-HAT for 75,000 steps, and Llama-3.1-70B-TFree-HAT for 35,000 steps. During the learning rate decay phase, we maintained the same data sources from Table 3, but upsampled German and code data. For encoder and decoder, we employed a linear warmup of 500 steps, a stable phase with learning rate 3e-4, and a final decay phase – of 50,000 steps for Llama- TFree-HAT-Pretrained, of 10,000 steps for Llama-3.1-8B-TFree-HAT, and of 7,000 steps for Llama-3.1-70B-TFree-HAT. For Llama-TFree-HAT-Pretrained, we employed weight decay of 0.05 for all parameters except for the embedding and normalization parameters. For Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT, we kept the backbone frozen for the first 2,000 steps, but otherwise followed the same schedule with the stable phase (shortened by 2,000 steps accordingly and learning rate decreased to 3e-5). We used a delayed training and reduced learning rate for the backbone because in the HATification case it is the component which is not initialized randomly – the encoder/decoder requires some head-start for learning to “utilize” the pre-trained backbone before we can start adapting the backbone. 16 We used the training data mix for this phase as described in §4.1. Long-context adaptation For adapting the models to longer word context lengths, we kept most parameters frozen and only adapted query and key projections of cross-attention layers in encoder and decoder and of self-attention layers in the backbone. The long-context length for Llama-TFree-HAT-Pretrained was 32,900 words, for Llama-3.1-8B-TFree-HAT was 32,768 words, and for Llama-3.1-70B-TFree-HAT was 16,000 words. We once again adopted a warmup-stable-decay schedule for this adaptation with 500 steps of linear warm-up, then a stable learning rate of 3e-6, and 1,000 steps of decay to 0. In this phase, we use a global batch size of 128 and up-weighted long-sequence subsets of all data sources while maintaining a similar mix. 5 Post-training We optimized our models for instruction-following using a standard post-training pipeline. First, we applied supervised fine-tuning (SFT) to train the model on both single-turn and multi-turn (chat) instruction-following tasks. Next, we aligned our model for helpfulness and to answer safely using DPO. 5.1 Supervised Fine-tuning The data we used for instruction fine-tuning is based on a mixture of publicly-available user prompts and model completions. The data mixture consisted of roughly 2M samples from diverse datasets including but not limited to: human feedback focused on helpful and harmless responses; a small curated set for specific response patterns; safety and robustness subsets for appropriate boundaries; specialized datasets covering mathematics, programming, and logical inference; collaborative conversational data; formal mathematics with advanced problems; multilingual conversation prompts; and tabular data reasoning for structured information. For datasets comprising answers from older model generations, we synthesized responses to English queries using Qwen 2.5-32B and Qwen 2.5-72B [64]. We chose the Qwen 2.5 models for data synthesis, because they offer strong performance across a variety of tasks [51] at relatively low resource requirements. Furthermore, we focused on improving both the coverage and quality of German SFT data. For this, we (i) translated English prompts in our datasets using Mistral-Nemo-Instruct-2407 [42], (i) generated corresponding answers using Mistral-Small-3.1-Instruct [43], and (i) performed quality filtering on the prompt-answer pairs using an LLM judge based on Llama-3.3-70B-Instruct [2]. We found that translating only the prompt first and then using it to generate a response directly in German significantly outperforms the alternative approach of translating both the prompt and the answer. 16 The training dynamics show a distinct “bump” in performance when the encoder/decoder learn to utilize the pre-trained backbone, indicating the transition from a pure byte-level model to an actual hierarchical behavior. This happens usually within the first 2,000 steps. 10 We also found that quality filtering on the generated answers is best done by rejecting all but the highest quality samples, whereas it can be beneficial to trade off data quantity for quality in the case of the translated prompts. We chose Mistral-Nemo-Instruct-2407 and Mistral-Small-3.1-Instruct due to strong performance-to-size characteristics on internal English-to-German translation and German Q&A benchmarks, respectively. We chose Llama-3.3-70B-Instruct for filtering due to its high agreement with human raters. Lastly, we supplemented the synthetic data with proprietary human-generated SFT data. We experimented with various curriculum training methods, but found no significant advan- tage – for example, in training more general tasks first and complex, specific instructions last. As a result, our final dataset randomly intersperses samples across datasets. We used a packed training procedure, wherein multiple shorter sequences are concatenated into a single context window. Consequently, this means the number of sequences per batch can vary depending on how well the packing fills the context length. Based on our data distribution, this corresponds to an approximate batch size of 256 sequences. The default learning rate was 3e-6, although for the HATified models we used a learning rate of 1.5e-6 for the backbone. 5.2 Direct Preference Optimization For alignment training, we used the length-normalized version of DPO as suggested by Lambert et al. [34]. As for SFT, our alignment dataset of prompts and completions originates from diverse domains. We filtered out any preference samples that contained empty strings or where the chosen and rejected completions were the same string. We used a learning rate of 1e-6 and, as for SFT, an approximate batch-size of ∼ 256 (i.e., with sequence packing). 6 Inference Our initial release included a simplified HuggingFace inference implementation for single requests to demonstrate our architecture. For production deployment with efficient batched serving, we selected vLLM 17 [33] as our serving framework. The integration revealed several fundamental architectural challenges unique to hierarchical models that required careful adaptation of vLLM’s components, with a key constraint being to structure our implementation to minimize modifications to the core vLLM codebase. The hierarchical nature of HAT—and similar architectures like the Byte Latent Transformer (BLT) [47]—introduces fundamental batching challenges that require changes to the schedul- ing strategy [72]. Unlike traditional autoregressive models that generate one token per step, HAT generates a variable number of bytes before reaching word boundaries, creating synchronization complexities. The system must balance between maximizing backbone utilization (by waiting for all sequences in a batch to reach word boundaries) and maintaining reasonable latency (by processing a fixed number of bytes). This variability also means that sequences within the same batch may require different computational patterns at any given step (some continuing byte-level generation while others are ready for word-level processing), necessitating careful orchestration to maintain efficiency. For example, prefills, chunked-prefills, and decodes that have just crossed a word boundary must traverse the full encoder-backbone-decoder pipeline. However, decodes that are in the middle of a word, and therefore do not require another consultation with the backbone, can keep cycling through the lightweight encoder-decoder loop until they reach the next word boundary. Additionally, the dual-sequence architecture of HAT required maintaining two distinct sequence representations simultaneously: one for tracking byte-level processing through the encoder-decoder components and another for word-level processing through the backbone model. A significant departure from traditional LLM serving lies in the corresponding KV cache management, where HAT requires maintaining separate caches for both sequences. This dual-cache system must ensure coordinated memory allocation, as the mapping between sequences is non-uniform (a single word may comprise many bytes) and a sequence can only proceed if sufficient memory exists for both its byte and word representations. The challenge is compounded by asymmetric memory requirements: word-level cache blocks consume 17 https://github.com/vllm-project/vllm 11 significantly more memory mainly due to global attention requiring linearly increasing memory, while byte-level cache blocks are capped by the sliding window [4]. Due to these challenges, the current implementation in vLLM exhibits lower throughput compared to a FLOP-matched tokenizer-based transformer in the batched setting. Never- theless, to the best of our knowledge, it is the fastest inference implementation available for any hierarchical architecture of this kind. We are actively working to optimize the implementation further and are also designing future iterations of the architecture with greater inference efficiency in mind. 7 Model Performance We evaluate our pre-trained and post-trained (SFT, DPO) models with different subsets of benchmark tasks. To accomplish this consistently, we chose not to use a combination of several existing evaluation suites but rather develop and use a unified framework, which we release under Apache 2.0 18 . Our evaluation framework aligns with the EleutherAI’s LM Evaluation Harness [22] bench- marks implementations whenever possible. However, we permitted minor prompt modifica- tions when such changes improved applicability to practical, model-agnostic scenarios. These adjustments primarily involve whitespaces, newlines, and translated cue words. For example, in non-English benchmarks, we often encountered English cue words such as “Question” and “Answer” where language-specific terms like the German cue words “Frage” and “Antwort” seem more appropriate. For full details of our implementations of each benchmark, please consult our code. 7.1 Towards consistent evaluations We invested significant effort to unify a large set of evaluation tasks by enforcing the same prompt (especially formatting) conventions, inference settings across back-ends, completion parsing algorithms, metric implementations, and error handling. Removing these sources of variability enabled us to better understand the model performance across benchmarks and, in effect, iterate faster and have greater confidence in our evaluations. While we draw on established prompts, commonly used formats, and standard techniques, we reserve the flexibility to adapt them when necessary to better reflect realistic use cases and ensure fair and meaningful evaluations. As a result, we may not reproduce the exact numbers reported for some models on certain benchmarks. However, we believe that our comparisons remain fair, as our evaluation framework ensures that all models are tested under consistent and controlled conditions. We believe this is important not just for transparency, but also so others may reproduce our results using our publicly-available checkpoints and accompanying evaluation framework. Whitespace in prompts: LLMs are often overly sensitive to seemingly minor details in prompt formatting. One notable example we have observed (as have others [12]) relates to the placement of whitespace in prompt or completion sequences. In multiple-choice benchmarks, such as MMLU, the task is typically presented as a set of token sequences that represent each answer option. The log-likelihood of each sequence is calculated and compared, and the sequence with the highest log-likelihood is selected as the model’s predicted answer. Some benchmarks guide models in the desired direction by ending their prompt with an open-ended assistant’s message, a technique called assistant-prefilling or answer-cueing. However, there is a difference in token log-likelihood if the cue ends with a space ("Answer: " followed with tested completions without space prefixes) or not ("Answer:" followed with tested completions with space prefixes). We observed that for models trained with space prefix tokenizers, the former case leads to very low log-likelihoods (indicating out of distribution tokens) and less reliable comparisons of those. We adjusted evaluations so that they would not have a trailing space in the prompt, and would have, in the case of log-likelihood tasks, a prefix space in the possible continuations. 18 https://github.com/Aleph-Alpha-Research/eval-framework/ 12 7.2 Performance and Compression Performance: Our T-Free models deliver performance on par with state-of-the-art open- weight memory-equivalent models in both English and German. For evaluation purposes, we compare our tokenizer-free base models to Llama 3.1 8B Base, our SFT model to Tülu 3.1 8B SFT [34], and our DPO model to Llama 3.1 8B Instruct and Tülu 3.1 8B [34]. The respective benchmarks and results can be found in the tables below. Note that although we used code and mathematics data in our training corpus, our model’s architecture has not been optimized for code generation and mathematical reasoning and is therefore not evaluated extensively on those benchmarks. Compression: Our T-Free approach results in improved efficiency, particularly in inference overhead, measured in terms of number of words processed across all languages and domains. We define efficiency as tokenizer fertility, or bytes per sequence position in the backbone, with higher value indicating better performance. In-production latency and throughput are currently beyond the scope of research-centric evaluations and will be addressed in the future. At present, our evaluation framework automatically measures bytes per sequence position across datasets, allowing us to derive efficiency scores and analyze variations across different dataset distributions. In Appendix A, we provide more comprehensive descriptions of the benchmarks and metrics used to evaluate our models. 7.3 Evaluation Results Few-shot prompting: In some evaluations, we have used few-shot settings, closely aligned to literature of models we compare against and what is commonly used, e.g., for submitting to OpenLLMLeaderboard [19] or the Llama 3 tech report [24]. The number of few-shot examples provided during evaluation is detailed in the tables below. In each row, bold numbers show the highest score and any score within 1% of the highest score. 7.3.1 Pre-trained Model For the sake of conciseness, in the figures and table below, "HAT 7B Base" or "T-Free" refers to Llama-TFree-HAT-Pretrained-Base, our trained from scratch model, while "HATified 7B Base" or "HATified" is Llama-3.1-8B-TFree-HAT-Base, and "Llama" or "Llama 8B Base" is Llama-3.1-8B-Base. Table 4: Categories and corresponding benchmark tasks used in pre-trained model evaluations. GroupBenchmarks English KnowledgeMMLU [25], Full Text MMLU, MMLU-Pro [68], Graduate-Level Google-Proof Q&A (GPQA) [55], BIG-Bench Hard (BBH) [62], Open- BookQA [41], TriviaQA [29], TruthfulQA [35] English ReasoningAI2 Reasoning Challenge (ARC) [14], WinoGrande [58], HellaSwag [73] GermanMultilingual Massive Multitask Language (MMMLU) [25], LAnguage Modeling Broadened to Account for Discourse Aspects (LAMBADA) [49], German ARC (Easy & Challenge), German Wino-X, German HellaSwag, German TruthfulQA, German GSM8K, WMT16 MathGrade School Math 8K (GSM8K) [15] SafetyWinoGender [56] 13 20 40 60 80 0 Knowledge Reasoning German MathSafety 54 53 54 74 74 76 57 55 52 61 60 57 63 62 65 average performance over common benchmarks HATified 7B Base Llama 8B Base HAT 7B Base 2 4 6 0 Model Compression bytes per sequence position KnowledgeReasoningGermanMathSafety 5.0 5.0 4.1 5.4 4.9 5.4 5.9 5.9 3.6 4.6 4.5 3.4 4.8 5.2 5.2 Model Quality Figure 4: Model Quality and Compression for our pre-trained T-Free model in comparison with Llama-3.1-8B. Table 5: Knowledge: Pre-training evaluation benchmarks and compression. ResultsCompression TaskMetricT-Free HATified Llama T-Free HATified Llama MMLU 5-shotnorm. log. acc.0.6650.657 0.6705.185.184.28 Full Text MMLU 5-shot norm. log. acc.0.6490.6380.6235.315.314.56 MMLU Pro 5-shotnorm. log. acc.0.3860.3690.3694.734.733.73 GPQA 0-shotlog. acc.0.2730.3080.3014.934.933.52 BBH 3-shotnorm. log. acc.0.4760.472 0.4724.674.673.79 OpenBookQA 10-shotnorm. log. acc.0.8940.8680.8464.854.854.35 TriviaQA 5-shotcomp. acc.0.6470.623 0.6955.375.374.24 TruthfulQA 6-shotnorm. prob. mass 0.3360.336 0.3374.914.914.18 Table 6: Reasoning: Pre-training evaluation benchmarks and compression. ResultsCompression TaskMetricT-Free HATified Llama T-Free HATified Llama ARC Easy 25-shotnorm. log. acc. 0.8750.8710.8585.535.534.94 ARC Challenge 25-shot norm. log. acc. 0.6350.6210.5805.515.514.92 Winogrande 5-shotnorm. log. acc. 0.7210.6910.6955.165.164.91 HellaSwag 10-shotnorm. log. acc. 0.8260.7930.8185.345.344.66 Table 7: German: Pre-training evaluation benchmarks and compression. ResultsCompression TaskMetricT-Free HATified Llama T-Free HATified Llama MMMLU 5-shotnorm. log. acc.0.6180.5880.5776.036.033.33 ARC Easy DE 25-shotnorm. log. acc.0.8010.7790.7156.606.603.68 ARC Challenge DE 25-shot norm. log. acc.0.5910.5370.4746.576.573.68 Wino-X DE 5-shotnorm. log. acc.0.8030.7890.7655.635.633.67 HellaSwag DE 10-shotnorm. log. acc.0.6870.6570.6166.506.503.67 TruthfulQA DE 6-shotnorm. prob. mass 0.3400.341 0.3425.915.913.39 Lambada 5-shotcomp. acc.0.4710.4530.4515.785.783.56 GSM8K DE 8-shotcomp. acc.0.4750.4410.4154.384.372.94 WMT16 3-shotlinewise BLEU36.41638.033 34.8126.026.024.21 14 Table 8: Math & Safety: Pre-training evaluation benchmarks and compression. ResultsCompression TaskMetricT-Free HATified Llama T-Free HATified Llama GSM8K 8-shotcomp. acc.0.6120.5970.5664.554.553.39 Winogender 5-shot norm. log. acc. 0.6530.6210.6285.235.234.80 7.3.2 Post-trained Llama-3.1-8B-TFree-HAT: SFT As with pre-trained model evaluation, the same benchmark groups are used in post-trained SFT model evaluations, with one additional group, Instruction Following. Furthermore, some groups contain additional benchmarks. In this section, "HATified" is Llama-3.1-8B- TFree-HAT-SFT, and "Tülu" is Llama-3.1-Tulu-3-8B-SFT. Table 9: Categories and corresponding benchmark tasks used in post-trained SFT and DPO model evaluations. GroupBenchmarks English KnowledgeSame as pre-training evaluations. English ReasoningSame as pre-training evaluations. German Same as pre-training evaluations, as well as the instruct version of WMT16. Instruction Following Alpaca Eval [17] Long ContextQuALITY [48], Ada-LEval [67] MathSame as pre-trained evaluations. SafetySame as pre-trained evaluations. Table 10: Knowledge: SFT evaluation benchmarks and compression. ResultsCompression TaskMetricHATified Tülu HATified Tülu MMLU 5-shotnorm. log. acc.0.655 0.5765.18 4.28 Full Text MMLU 5-shot norm. log. acc.0.652 0.5985.31 4.56 MMLU Pro 5-shotnorm. log. acc.0.378 0.3064.73 3.73 GPQA 0-shotlog. acc.0.294 0.2774.93 3.52 BBH 3-shotnorm. log. acc.0.493 0.4624.67 3.79 OpenBookQA 10-shotnorm. log. acc.0.696 0.6544.85 4.35 TriviaQA 5-shotcomp. acc.0.585 0.2005.36 3.93 TruthfulQA 6-shotnorm. prob. mass0.346 0.3384.91 4.18 Table 11: Reasoning: SFT evaluation benchmarks and compression. ResultsCompression TaskMetricHATified Tülu HATified Tülu ARC Easy 25-shotnorm. log. acc.0.889 0.7995.53 4.94 ARC Challenge 25-shot norm. log. acc.0.646 0.5155.51 4.92 Winogrande 5-shotnorm. log. acc.0.677 0.6025.16 4.91 HellaSwag 10-shotnorm. log. acc.0.747 0.7735.34 4.66 15 20 40 60 80 0 KnowledgeReasoningGermanMathInstruction Following average performance over common benchmarks HATified 7B SFT Tülu 8B SFT Model Quality Long Context Safety 51 43 74 67 53 42 67 70 60 75 21 23 54 55 Model Compression 2 4 6 0 bytes per sequence position Knowledge ReasoningGermanMathInstruction Following 5.0 4.0 5.4 4.9 6.0 3.6 4.5 3.4 4.6 5.7 5.3 4.1 4.8 5.2 Long Context Safety Figure 5: Model Quality and Compression for our SFT T-Free model in comparison with Llama-3.1-Tulu-3-8B-SFT. Table 12: German: SFT evaluation benchmarks and compression. ResultsCompression TaskMetricHATified Tülu HATified Tülu MMMLU 5-shotnorm. log. acc.0.597 0.4686.03 3.33 ARC Easy DE 25-shotnorm. log. acc.0.800 0.5356.60 3.68 ARC Challenge DE 25-shot norm. log. acc.0.572 0.3366.57 3.68 Wino-X DE 5-shotnorm. log. acc.0.763 0.6575.63 3.67 HellaSwag DE 10-shotnorm. log. acc.0.596 0.5356.50 3.67 TruthfulQA DE 6-shotnorm. prob. mass0.348 0.3395.91 3.39 Lambada 5-shotcomp. acc.0.368 0.1785.79 3.55 GSM8K DE 8-shotcomp. acc.0.550 0.4774.41 2.93 WMT16 3-shotlinewise BLEU35.810 31.4146.02 4.20 WMT16 Instruct 3-shotlinewise BLEU36.441 32.3266.11 4.30 16 Table 13: Instruction Following: SFT evaluation benchmarks and compression. ResultsCompression TaskMetric HATified Tülu HATified Tülu Alpaca Eval 0-shot CS0.345 0.0655.67 4.58 Alpaca Eval 0-shot IF0.903 0.8465.67 4.58 Alpaca Eval 0-shot LC0.989 0.9015.67 4.58 Table 14: Long Context: SFT evaluation benchmarks and compression. ResultsCompression TaskMetricHATified Tülu HATified Tülu QuALITY 0-shotlog. acc.0.389 0.3464.85 4.28 Ada-LEval TextSort Choices 0-shot log. acc.0.261 0.2885.19 4.04 Ada-LEval TextSort 0-shotcomp. acc.0.052 0.0005.20 4.06 Table 15: Math & Safety: SFT evaluation benchmarks and compression. ResultsCompression TaskMetricHATified Tülu HATified Tülu GSM8K 8-shotcomp. acc.0.667 0.6994.51 3.39 Winogender 5-shot norm. log. acc.0.550 0.5375.23 4.80 Table 16: SFT MTBench winrates in English and German for Llama-3.1-8B-TFree-HAT. Winrate comparisonHATified vs. allenai/Llama-3.1-Tulu-3-8B-SFT (English)61.1% vs. allenai/Llama-3.1-Tulu-3-8B-SFT (German)66.1% 7.3.3 Post-trained DPO Models We use the same benchmarks group as SFT models to evaluate post-trained DPO models. In this section, "HAT 7B DPO" or "T-Free" refers to Llama-TFree-HAT-Pretrained-DPO, our trained from scratch model, while "HATified 7B DPO" or "HATified" is Llama-3.1-8B-TFree- HAT-DPO, "Llama" is Llama-3.1-8B-Instruct, and "Tülu" is Llama-3.1-Tulu-3-8B-DPO. Table 17: Knowledge: DPO evaluation benchmarks and compression. ResultsCompression TaskMetricT-Free HATified Llama Tülu T-Free HATified Llama Tülu MMLU 5-shotnorm. log. acc.0.6530.657 0.697 0.5745.185.184.28 4.28 Full Text MMLU 5-shot norm. log. acc.0.6560.661 0.698 0.6065.315.314.56 4.56 MMLU Pro 5-shotnorm. log. acc.0.3750.381 0.424 0.3124.734.733.73 3.73 GPQA 0-shotlog. acc.0.2840.286 0.330 0.2974.934.933.52 3.52 BBH 3-shotnorm. log. acc.0.4860.500 0.526 0.4624.674.673.79 3.79 OpenBookQA 10-shotnorm. log. acc.0.6960.7720.728 0.7424.854.854.35 4.35 TriviaQA 5-shotcomp. acc.0.2720.414 0.655 0.2475.425.384.24 4.20 TruthfulQA 6-shotnorm. prob. mass 0.3640.3570.350 0.3504.914.914.18 4.18 17 HATified 7B DPO Llama 8B Instruct Tülu 8B DPO HAT 7B DPO 2 4 6 0 Model Compression bytes per sequence position KnowledgeReasoningGermanInstruction Following Math Long Context Safety 4.9 5.4 4.9 5.4 5.0 5.0 4.1 4.1 6.0 6.0 3.6 3.6 6.5 5.5 4.5 4.7 4.1 4.1 5.1 5.1 4.5 4.5 3.4 3.4 5.2 5.2 4.84.8 20 40 60 80 0 KnowledgeReasoningGermanInstruction Following Math Long Context Safety 71 76 74 76 47 50 55 45 52 54 51 44 82 78 70 67 22 24 24 21 55 65 71 78 59 57 64 55 average performance over common benchmarks Model Quality Figure 6: Model quality and compression for our Llama-TFree-HAT-Pretrained-DPO and Llama-3.1-8B-TFree-HAT-DPO model, compared to Llama-3.1-8B-Instruct and Llama-3.1- Tulu-3-8B. Table 18: Reasoning: DPO evaluation benchmarks and compression. ResultsCompression TaskMetricT-Free HATified Llama Tülu T-Free HATified Llama Tülu ARC Easy 25-shotnorm. log. acc. 0.8940.8970.880 0.8165.535.534.94 4.94 ARC Challenge 25-shot norm. log. acc. 0.6770.6680.654 0.5645.515.514.92 4.92 Winogrande 5-shotnorm. log. acc.0.6760.6870.673 0.6355.165.164.91 4.91 HellaSwag 10-shotnorm. log. acc.0.7830.7750.764 0.8175.345.344.66 4.66 18 Table 19: German: DPO evaluation benchmarks and compression. ResultsCompression TaskMetricT-Free HATified Llama Tülu T-Free HATified Llama Tülu MMMLU 5-shotnorm. log. acc.0.6080.594 0.605 0.4676.036.033.33 3.33 ARC Easy DE 25-shotnorm. log. acc.0.8260.8120.738 0.5656.606.603.68 3.68 ARC Challenge DE 25-shot norm. log. acc.0.6400.5940.515 0.3636.576.573.68 3.68 Wino-X DE 5-shotnorm. log. acc.0.7540.7500.734 0.6675.635.633.67 3.67 HellaSwag DE 10-shotnorm. log. acc.0.7270.6870.586 0.5866.506.503.67 3.67 TruthfulQA DE 6-shotnorm. prob. mass 0.3620.3560.347 0.3465.915.913.39 3.39 Lambada 5-shotcomp. acc.0.0900.379 0.437 0.2015.795.783.55 3.55 GSM8K DE 8-shotcomp. acc.0.5800.5340.459 0.5684.424.462.94 2.94 WMT16 3-shotlinewise BLEU31.17134.778 34.350 31.1026.036.034.21 4.21 WMT16 Instruct 3-shotlinewise BLEU31.90935.564 34.530 31.6876.126.094.30 4.30 Table 20: Instruction Following: DPO evaluation benchmarks and compression. ResultsCompression TaskMetric T-Free HATified Llama Tülu T-Free HATified Llama Tülu Alpaca Eval 0-shot CS0.5540.4200.186 0.1186.545.544.52 4.68 Alpaca Eval 0-shot IF0.9310.920 0.935 0.9336.545.544.52 4.68 Alpaca Eval 0-shot LC0.9850.989 0.986 0.9646.545.544.52 4.68 Table 21: Long Context: DPO evaluation benchmarks and compression. ResultsCompression TaskMetricT-Free HATified Llama Tülu T-Free HATified Llama Tülu QuALITY 0-shotlog. acc.0.3780.383 0.409 0.3814.854.854.28 4.28 Ada-LEval TextSort Choices 0-shot log. acc.0.2530.276 0.288 0.2815.195.194.04 4.04 Ada-LEval TextSort 0-shotcomp. acc.0.0020.0530.021 0.0005.205.204.06 4.06 Table 22: Math & Safety: DPO evaluation benchmarks and compression. ResultsCompression TaskMetricT-Free HATified Llama Tülu T-Free HATified Llama Tülu GSM8K 8-shotcomp. acc.0.5550.6470.710 0.7824.514.543.40 3.42 Winogender 5-shot norm. log. acc.0.5900.567 0.642 0.5545.235.234.80 4.80 Table 23: DPO MTBench winrates in English and German. Winrate comparisonLlama-3.1-8B-TFree-HAT-DPO vs. meta-llama/Llama-3.1-8B-Instruct (English)71.0% vs. allenai/Llama-3.1-Tulu-3-8B (English)65.0% vs. meta-llama/Llama-3.1-8B-Instruct (German)73.9% vs. allenai/Llama-3.1-Tulu-3-8B (German)71.0% 7.3.4 Llama-3.1-70B-TFree-HAT We also provide results from our Llama-3.1-70B-TFree-HAT-SFT model, which we offer as an experimental release to demonstrate our architecture and training pipelines scale to larger numbers of parameters. We note that while our results on some academic benchmarks lag Llama-3.3-70B-Instruct, we achieve largely comparable performance overall and decidedly beat Llama-3.3-70B-Instruct in direct MT-Bench comparisons in English and German. 19 Table 24: Knowledge: SFT evaluation metrics across tasks and models. TaskMetricHATified-SFT Llama-Instruct MMLU 5-shotnorm. log. acc.0.7730.818 Full Text MMLU 5-shot norm. log. acc.0.7860.830 MMLU Pro 5-shotnorm. log. acc.0.5130.573 GPQA 0-shotlog. acc.0.3600.545 BBH 3-shotnorm. log. acc.0.6520.706 OpenBookQA 10-shotnorm. log. acc.0.5260.556 TriviaQA 5-shotcomp. acc.0.5820.757 TruthfulQA 6-shotnorm. prob. mass0.1760.191 Table 25: Reasoning: SFT evaluation metrics across tasks and models. TaskMetricHATified-SFT Llama-Instruct ARC Easy 25-shotnorm. log. acc.0.9200.911 ARC Challenge 25-shot norm. log. acc.0.7390.741 Winogrande 5-shotnorm. log. acc.0.7490.697 HellaSwag 10-shotnorm. log. acc.0.8090.665 Table 26: German: SFT evaluation metrics across tasks and models. TaskMetricHATified-SFT Llama-Instruct MMMLU 5-shotnorm. log. acc.0.7150.783 ARC Easy DE 25-shotnorm. log. acc.0.8480.825 ARC Challenge DE 25-shot norm. log. acc.0.6690.653 Wino-X DE 5-shotnorm. log. acc.0.7930.761 HellaSwag DE 10-shotnorm. log. acc.0.7270.707 TruthfulQA DE 6-shotnorm. prob. mass0.1700.174 GSM8K DE 8-shotcomp. acc.0.6300.139 Table 27: Instruction Following: SFT evaluation metrics across tasks and models. TaskMetric HATified-SFT Llama-Instruct Alpaca Eval 0-shot CS0.3630.168 Alpaca Eval 0-shot IF0.9450.961 Alpaca Eval 0-shot LC0.9940.993 Table 28: Long Context & Safety: SFT evaluation metrics across tasks and models. TaskMetricHATified-SFT Llama-Instruct QuALITY 0-shotlog. acc.0.4880.459 ZeroSCROLLS MuSiQue 0-shotF10.4500.522 ZeroSCROLLS SpaceDigest 0-shot ES0.7790.404 ZeroSCROLLS SQuALITY 0-shot rouge gm0.1700.159 Winogender 5-shotnorm. log. acc.0.6790.843 20 Table 29: SFT MTBench winrates in English and German for Llama-3.1-70B-TFree-HAT. Winrate comparisonHATified SFT Score vs. meta-llama/Llama-3.3-70B-Instruct (English)63.3% vs. meta-llama/Llama-3.3-70B-Instruct (German)63.9% 8 Pre-training Learning Dynamics Although we have not analyzed these ourselves yet, we are glad to release 200 intermediate checkpoints from our Llama-TFree-HAT-Pretrained pre-training run, spanning∼4 trillion words—more than 10 times the training data covered by the checkpoints of the Pythia model suite [6], which span∼300 billion tokens across 154 checkpoints per model. The Pythia checkpoints have proven very useful to the research community for studying pre-training learning dynamics. Understanding the dynamics of how LLMs learn during pre-training represents one of the most fundamental questions in modern deep learning research. The availability of intermediate checkpoints enables researchers to trace the evolution of model capabilities, knowledge acquisition patterns, and the emergence of complex behaviors that are otherwise opaque in fully-trained models [46,44]. By studying these dynamics, we can gain crucial insights into when and how models develop specific competencies, such as in-context learning abilities [46], compositional reasoning skills [69], and factual knowledge retention [39]. The systematic study of pre-training dynamics has revealed several key phenomena that challenge our understanding of neural network learning. For instance, research has shown that some capabilities emerge suddenly rather than gradually, exhibiting phase transition-like behavior at specific scales or training steps [69,59,27]. Additionally, the order in which different skills are acquired appears to follow predictable patterns, with simpler linguistic competencies typically preceding more complex reasoning abilities [71,27]. These findings have important implications for training efficiency, curriculum design, and our theoretical understanding of how intelligence emerges in artificial systems. From a practical standpoint, analyzing learning dynamics provides valuable guidance for model development and resource allocation. By identifying critical training phases where specific capabilities emerge or stabilize, researchers can optimize training schedules, detect po- tential training instabilities early, and make informed decisions about when to intervene with techniques such as learning rate adjustments or data mixture changes [26,65]. Furthermore, understanding these dynamics enables more efficient evaluation protocols, as researchers can predict which benchmarks will be most informative at different stages of training [10, 13]. The release of our 200 checkpoints, spanning∼4 trillion words, provides an opportunity to study these phenomena at a sizable data and model scale, and in a unique architecture. This will facilitate future research into fundamental questions about scaling laws, capability emergence, knowledge consolidation, and the relationship between training dynamics and final model performance [31, 52]. 9 Environmental Impact The A100 GPU has a maximum power of 400W, while both the H100 and H200 GPUs have a maximum power of 700W. Our H200 and A100 infrastructure runs entirely on 100% renewable energy, ensuring that noCO 2 emissions are directly incurred from training. In addition to this, the H200 data center boasts a power usage effectiveness (PUE) of≤1.2. Its operation also maintains a net-zero water footprint. Specific numbers on renewable energy usage for the H100 GPUs are not yet available to us. To estimate the carbon footprint of inference, we base our calculations on publicly available data from the infrastructure provider and, where applicable, standard emissions accounting methodology. Because the H200 and A100 data centers operate fully on renewable energy, both metrics for its operation (excluding infrastructure-related emissions, e.g., initial chip 21 manufacturing) are effectively zero. These numbers may be contextualized with reference to publicly available studies, such as the carbon footprint of training BLOOM (176B parameters) [37]. 10 Conclusion This technical report presents a comprehensive overview of the development of our T- Free models, high-performing English- and German-language LLMs which move beyond traditional tokenization approaches. In particular, Llama-TFree-HAT-Pretrained, Llama-3.1- 8B-TFree-HAT, and Llama-3.1-70B-TFree-HAT show improvements over Llama 3.1 in most downstream tasks, while increasing compression rates and reducing the number of overall model parameters. The core architectural characteristic of our models is the use of small encoder and decoder modules operating directly on bytes, while aggregating encoder outputs into word-level embeddings. This design offers several advantages over conventional tokenizer-based models, including (i) improved adaptability to new domains and languages via continual training and (i) increased robustness to prompt perturbations. While these benefits are supported by early studies of hierarchical architectures [45,47], they require further validation in a broader range of tasks and settings. Similarly, future work should determine the feasibility and effectiveness of this approach in domains beyond natural language, such as programming languages. We make our models publicly available to the research community, with the hope that it will enable further investigation into models that do away with traditional tokenizer-based architectures. Our inclusion of 200 pre-training checkpoints may also facilitate advances in learning dynamics and developmental interpretability [27]. As such, this work adds to the growing body of work that challenges and improves upon traditional tokenization strategies to make LLMs more adaptable and robust [47, 23, 30, 1, 66]. 22 References [1]Diana Abagyan, Alejandro R. Salamanca, Andres Felipe Cruz-Salinas, Kris Cao, Hangyu Lin, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers, 2025. URL https://arxiv.org/abs/2506.10766. [2]AI@Meta. Llama 3 Model Card, 2024. URLhttps://github.com/meta-llama/llama3/blob/ main/MODEL_CARD.md. [3]Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. To Code, or Not To Code? Exploring Impact of Code in Pre-training, 2024. URL https://arxiv.org/abs/2408.10914. [4] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The Long-Document Trans- former. CoRR, abs/2004.05150, 2020. URL https://arxiv.org/abs/2004.05150. [5]Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl. In Leif Azzopardi, Allan Hanbury, Gabriella Pasi, and Benjamin Piwowarski, editors, Advances in Information Retrieval. 40th European Conference on IR Research (ECIR 2018), Lecture Notes in Computer Science, Berlin Heidelberg New York, March 2018. Springer. [6]Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling, 2023. URLhttps://arxiv.org/abs/ 2304.01373. [7]Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606, 2016. [8]Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. Findings of the 2016 Conference on Machine Translation. In Ondřej Bojar, Christian Buck, Rajen Chatterjee, Christian Federmann, Liane Guillou, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Aurélie Névéol, Mariana Neves, Pavel Pecina, Martin Popel, Philipp Koehn, Christof Monz, Matteo Negri, Matt Post, Lucia Specia, Karin Verspoor, Jörg Tiedemann, and Marco Turchi, editors, Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 131–198, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/W16-2301. URL https://aclanthology.org/W16-2301/. [9]William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley. Striped Attention: Faster Ring Attention for Causal Transformers, 2023. URL https://arxiv.org/abs/2311.09431. [10]Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language Models are Few-Shot Learners, 2020. URL https://arxiv.org/abs/2005.14165. [11]Thomas F Burns, Letitia Parcalabescu, Stephan Wäldchen, Michael Barlow, Gregor Ziegltrum, Volker Stampa, Bastian Harren, and Björn Deiseroth. Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data gen- eration. In 19th Conference of the European Chapter of the Association for Computational Linguistics, 2026. URL https://arxiv.org/abs/2505.00022. [12]Chi-Yun Chang, Xueyang Huang, Humaira Nasir, Shane Storks, Olawale Akingbade, and Huteng Dai. Mind the Gap: How BabyLMs Learn Filler-Gap Dependencies. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15049– 15065, Suzhou, China, November 2025. Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.761. URLhttps://aclanthology.org/ 2025.emnlp-main.761/. 23 [13]Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. PaLM: Scaling Language Modeling with Pathways, 2022. URL https://arxiv.org/abs/2204.02311. [14]Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457v1, 2018. [15]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training Verifiers to Solve Math Word Problems, 2021. URLhttps://arxiv. org/abs/2110.14168. [16] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, 2022. URLhttps://arxiv.org/ abs/2205.14135. [17] Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators, 2025. URLhttps://arxiv.org/ abs/2404.04475. [18]Denis Emelin and Rico Sennrich. Wino-X: Multilingual Winograd Schemas for Commonsense Reasoning and Coreference Resolution. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8517–8532, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021. emnlp-main.670. URL https://aclanthology.org/2021.emnlp-main.670/. [19]Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open LLM Leaderboard v2.https://huggingface.co/spaces/open-llm-leaderboard/open_llm_ leaderboard, 2024. [20]Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data Engineering for Scaling Language Models to 128K Context, 2024. URL https://arxiv.org/abs/2402.10171. [21]Philip Gage. A new algorithm for data compression. The C Users Journal, 12(2):23–38, 1994. [22]Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The Language Model Evaluation Harness, 07 2024. URLhttps://zenodo.org/records/12608602. [23]Juan Luis Gastaldi, John Terilla, Luca Malagutti, Brian DuSell, Tim Vieira, and Ryan Cotterell. The Foundations of Tokenization: Statistical and Computational Concerns, 2025. URL https://arxiv.org/abs/2407.11606. [24] Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, Danny Wyatt, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab 24 AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Francisco Guzmán, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Govind Thattai, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jack Zhang, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Karthik Prasad, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Kushal Lakhotia, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Maria Tsimpoukelli, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Ning Zhang, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohan Maheswari, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vítor Albiero, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaofang Wang, Xiaoqing Ellen Tan, Xide Xia, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aayushi Srivastava, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Amos Teo, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Dong, Annie Franco, Anuj Goyal, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Ce Liu, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Cynthia Gao, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Eric-Tuan Le, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Filippos Kokkinos, Firat Ozgenel, Francesco Caggioni, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Grant Herman, Grigory Sizov, Guangyi Zhang, Guna Lakshminarayanan, Hakan Inan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Hongyuan Zhan, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Ilias Leontiadis, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Janice Lam, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kiran Jagadeesh, Kun Huang, Kunal Chawla, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Miao Liu, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikhil Mehta, 25 Nikolay Pavlovich Laptev, Ning Dong, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pri- tish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Rangaprabhu Parthasarathy, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Russ Howes, Ruty Rinott, Sachin Mehta, Sachin Siby, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Mahajan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shishir Patil, Shiva Shankar, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Summer Deng, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Koehler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaojian Wu, Xiaolan Wang, Xilun Wu, Xinbo Gao, Yaniv Kleinman, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu Wang, Yu Zhao, Yuchen Hao, Yundi Qian, Yunlu Li, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, Zhiwei Zhao, and Zhiyu Ma. The Llama 3 Herd of Models, 2024. URL https://arxiv.org/abs/2407.21783. [25]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding, 2021. URL https: //arxiv.org/abs/2009.03300. [26] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent Sifre. An empirical analysis of compute-optimal large language model training. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 30016–30030. Curran Associates, Inc., 2022. URLhttps://proceedings.neurips.c/paper_files/paper/2022/ file/c1e2faff6f588870935f114ebe04a3e5-Paper-Conference.pdf. [27]Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei, and Daniel Murfet. Loss Landscape Degeneracy and Stagewise Development in Transformers. Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum? id=45qJyBG8Oj. [28]Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Xinrong Zhang, Zhen Leng Thai, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, dahai li, Zhiyuan Liu, and Maosong Sun. MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=3X2L2TFr0f. [29] Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension, 2017. URLhttps: //arxiv.org/abs/1705.03551. [30]Julie Kallini, Shikhar Murty, Christopher D. Manning, Christopher Potts, and Róbert Csordás. MrT5: Dynamic Token Merging for Efficient Byte-level Language Models, 2025. URLhttps: //arxiv.org/abs/2410.20771. [31] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models, 2020. URL https://arxiv.org/abs/2001.08361. [32] Najoung Kim, Sebastian Schuster, and Shubham Toshniwal. Code Pretraining Improves Entity Tracking Abilities of Language Models, 2024. URL https://arxiv.org/abs/2405.21068. [33]Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 26 [34]Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi. Tulu 3: Pushing Frontiers in Open Language Model Post-Training, 2025. URL https://arxiv.org/abs/2411.15124. [35]Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring How Models Mimic Human Falsehoods, 2022. URL https://arxiv.org/abs/2109.07958. [36] Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring Attention with Blockwise Transformers for Near-Infinite Context, 2023. URL https://arxiv.org/abs/2310.01889. [37]Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. Estimating the carbon footprint of bloom, a 176b parameter language model. Journal of Machine Learning Research, 24(253):1–15, 2023. [38]Yingwei Ma, Yue Liu, Yue Yu, Yuanliang Zhang, Yu Jiang, Changjian Wang, and Shanshan Li. At Which Training Stage Does Code Data Help LLMs Reasoning?, 2023. URLhttps: //arxiv.org/abs/2309.16298. [39] Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and Editing Factual Associations in GPT, 2023. URL https://arxiv.org/abs/2202.05262. [40]Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed Precision Training, 2018. URL https://arxiv.org/abs/1710.03740. [41] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering, 2018. URLhttps: //arxiv.org/abs/1809.02789. [42] Mistral AI team. Mistral NeMo. https://mistral.ai/news/mistral-nemo, July 2024. [43] Mistral AI team. Mistral Small 3.1.https://mistral.ai/news/mistral-small-3-1, March 2025. [44] Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability, 2023. URLhttps://arxiv.org/abs/ 2301.05217. [45] Pit Neitemeier, Björn Deiseroth, Constantin Eichenberg, and Lukas Balles. Hierarchical Autore- gressive Transformers for Tokenizer-Free Language Modelling. In The Thirteenth International Conference on Learning Representations, 2025. [46]Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. In-context Learning and Induction Heads, 2022. URLhttps: //arxiv.org/abs/2209.11895. [47] Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, and Srinivasan Iyer. Byte Latent Transformer: Patches Scale Better Than Tokens, 2024. URL https://arxiv.org/abs/2412.09871. [48]Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, and Samuel R. Bowman. QuALITY: Question Answering with Long Input Texts, Yes!, 2022. URLhttps://arxiv.org/ abs/2112.08608. [49]Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context, 2016. URLhttps://arxiv.org/ abs/1606.06031. [50] Björn Plüster. GermanBenchmark: Translating popular LLM benchmarks to German.https: //github.com/bjoernpl/GermanBenchmark, 2025. Accessed: 2025-04-17. 27 [51]Qwen: An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5 Technical Report, 2025. URL https://arxiv.org/abs/2412.15115. [52] Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey Irving. Scaling Language Models: Methods, Analysis & Insights from Training Gopher, 2022. URL https://arxiv.org/abs/2112.11446. [53] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory Opti- mizations Toward Training Trillion Parameter Models, 2020. URLhttps://arxiv.org/abs/ 1910.02054. [54] Kartik Ravisankar, Hyojung Han, and Marine Carpuat. Can you map it to English? The Role of Cross-Lingual Alignment in Multilingual Performance of LLMs, 2025. URLhttps: //arxiv.org/abs/2504.09378. [55]David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A Graduate-Level Google-Proof Q&A Benchmark, 2023. URL https://arxiv.org/abs/2311.12022. [56]Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender Bias in Coreference Resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. [57]Laura Ruis, Maximilian Mozes, Juhan Bae, Siddhartha Rao Kamalakara, Dwarak Talupuru, Acyr Locatelli, Robert Kirk, Tim Rocktäschel, Edward Grefenstette, and Max Bartolo. Pro- cedural Knowledge in Pretraining Drives Reasoning in Large Language Models, 2025. URL https://arxiv.org/abs/2411.12580. [58] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An Adversarial Winograd Schema Challenge at Scale, 2019. URLhttps://arxiv.org/abs/1907. 10641. [59]Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are Emergent Abilities of Large Language Models a Mirage?, 2023. URL https://arxiv.org/abs/2304.15004. [60]Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, and Omer Levy. ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding, 2023. URLhttps://arxiv.org/abs/ 2305.14196. [61] Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-C: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset, 2025. URLhttps://arxiv.org/abs/ 2412.02595. [62] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, 2022. URLhttps:// arxiv.org/abs/2210.09261. 28 [63]Nouamane Tazi, Ferdinand Mom, Haojun Zhao, Phuc Nguyen, Mohamed Mekkouri, Leandro Werra, and Thomas Wolf. The Ultra-Scale Playbook: Training LLMs on GPU Clusters, 2025. [64]Qwen Team. Qwen2.5: A Party of Foundation Models, September 2024. URLhttps://qwenlm. github.io/blog/qwen2.5/. [65] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and Efficient Foundation Language Models, 2023. URL https://arxiv.org/abs/2302.13971. [66]Mathurin Videau, Badr Youbi Idrissi, Alessandro Leite, Marc Schoenauer, Olivier Teytaud, and David Lopez-Paz. From Bytes to Ideas: Language Modeling with Autoregressive U-Nets, 2025. URL https://arxiv.org/abs/2506.14761. [67]Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, and Kai Chen. Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks, 2024. URLhttps://arxiv. org/abs/2404.06480. [68]Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A More Robust and Challenging Multi- Task Language Understanding Benchmark, 2024. URL https://arxiv.org/abs/2406.01574. [69]Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent Abilities of Large Language Models. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=yzkSU5zdwD. Survey Certification. [70]Kaiyue Wen, Zhiyuan Li, Jason S. Wang, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape View. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=m51BgoqvbP. [71]Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth Pasunuru, Danqi Chen, Luke Zettlemoyer, and Ves Stoyanov. Training Trajectories of Language Models Across Scales, 2023. URL https://arxiv.org/abs/2212.09803. [72]Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A Distributed Serving System for Transformer-Based Generative Models. In Marcos K. Aguilera and Hakim Weatherspoon, editors, 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, pages 521–538. USENIX Association, 2022. URL https://w.usenix.org/conference/osdi22/presentation/yu. [73]Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a Machine Really Finish Your Sentence?, 2019. URL https://arxiv.org/abs/1905.07830. [74]Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, 2023. URLhttps: //arxiv.org/abs/2306.05685. 29 Contributions Llama-TFree-HAT-Pretrained, Llama-3.1-8B-TFree-HAT, and Llama-3.1-70B-TFree-HAT are the result of the work of a large number of people at Aleph Alpha and Aleph Alpha Research. Here we identify the main work-streams associated with the project and list contributors in alphabetical order by surname. Starred authors (*) were core contributors (contributing for >50% of the project time) in those work-streams. Contributions by authors with anαfollowing their name were made under affiliation with Aleph Alpha and those without under an affiliation with Aleph Alpha Research. Architecture: Lukas Balles*, Fabien C. Y. Benureau, Constantin Eichenberg*, Jan Hendrik Metzen*, Pit Neitemeier* Code optimization: Bastian Boll*, Ahmed Hammam*, Johann Higl*, Max Meuer*, Vedant Nanda*, Pit Neitemeier Pre-training data: Michael Barlow*, Thomas F. Burns*, Björn Deiseroth*, Bastian Harren*, Letitia Parcalabescu*, Volker Stampa*, Stephan Wäldchen*, Gregor Ziegltrum* Post-training: Artur Baranowski*, Felix Berkenkamp*, Thomas F. Burns, David Friede*, Bastian Harren, Carina Kauf*, Johannes Messner*, Koen Oostermeijer*, Letitia Parcal- abescu, Till Speicher, Stephan Wäldchen Inference: Lukas Balles, Michael Barlow, Lukas Bluebaum*α, Pablo Iyu Guerrero*α, Max Meuer, Pit Neitemeier Evaluations: Adnen Abdessaied, Fabien C. Y. Benureau*, Thomas F. Burns*, Ahmed Hammam, Carina Kauf, Koen Oostermeijer, Markus Pernpointner*, Felix Reinfurt*, Dylan Rodriquez*, Grégory Schott*, Philipp Siedler*, Martin Simonovsky*, Till Speicher Research and project coordination: Yasser Jadidi*, Samuel Weinbach* Open Science and Community Contributions Our contributions extend beyond the contents of this report and include the following open-weight models released on HuggingFace: • our base models: https://huggingface.co/Aleph-Alpha/llama-3_1-8b-tfree-hat-base https://huggingface.co/Aleph-Alpha/tfree-hat-pretrained-7b-base (including 200 pre-training checkpoints over training for future study and use by the community) • our SFT models: https://huggingface.co/Aleph-Alpha/llama-3_1-8b-tfree-hat-sft and https://huggingface.co/Aleph-Alpha/llama-3_1-70b-tfree-hat-sft • our DPO models: https://huggingface.co/Aleph-Alpha/llama-tfree-hat-pretrained-7b-dpo https://huggingface.co/Aleph-Alpha/llama-3_1-8b-tfree-hat-dpo • our evaluation framework: https://github.com/Aleph-Alpha-Research/eval-framework • our vLLM inference: https://github.com/Aleph-Alpha/vllm • our Rust splitter: https://github.com/Aleph-Alpha-Research/hat-splitter A Evaluation Benchmarks The following describes the benchmarks and metrics we used to evaluate our models, using our Apache 2.0 evaluation framework. 30 A.1 Metric Glossary log. acc.: Average Accuracy Log-likelihood norm. log. acc.: Average normalized Log-likelihood Accuracy comp. acc.: Average Completion Accuracy norm. prob. mass: Average Probability Mass normalized bleu: linewise BLEU Score rouge gm.: Average ROUGE-Geometric-Mean F1: Average F1 CS: Chatbot Style IF: Instruction Following LC: Language Consistency ES: Exponential Similarity A.2 English knowledge We evaluated our pre- and post-trained models English knowledge capabilities on common benchmarks, such as Massive Multitask Language Understanding (MMLU), MMLU-Pro, Graduate-Level Google-Proof Q&A (GPQA), BIG-Bench Hard (BBH), OpenBookQA, Trivi- aQA and TruthfulQA. MMLU The MMLU benchmark is a comprehensive multitask benchmark composed of multiple-choice questions drawn from a wide range of academic disciplines. It encompasses subjects across the humanities, social sciences, natural sciences, and other domains deemed essential for general education. The benchmark includes 57 distinct tasks, covering areas such as elementary mathematics, U.S. history, computer science, and law. Achieving high accuracy on this test requires models to demonstrate substantial world knowledge and advanced problem-solving capabilities. Full Text MMLU The original MMLU benchmark task is multiple-choice and the goal is to predict the key of the answer, e.g., "A" for the answer "A. The dog". In Full Text MMLU, we extended the benchmark to also predict the full text of the answer instead of only the key, e.g., "The dog". MMLU-Pro The MMLU-Pro benchmark is an advanced benchmark developed to assess language understanding models on a wider range of more demanding tasks. Building upon the original Massive Multitask Language Understanding (MMLU) dataset, MMLU-Pro incorporates more complex, reasoning-intensive questions and expands the number of answer choices per item from four to ten. This enhancement substantially increases the test’s difficulty and minimizes the likelihood of success through random guessing. The benchmark includes over 12,000 carefully curated questions sourced from academic exams and textbooks, covering 14 diverse subject areas such as Biology, Business, Chemistry, Computer Science, Economics, Engineering, Health, History, Law, Mathematics, Philosophy, Physics, Psychology, and others. GPQA The GPQA benchmark [55] is a dataset of 448 highly challenging multiple-choice questions authored by experts in biology, physics, and chemistry. Even PhD-level specialists achieve only 65% accuracy (74% excluding identified errors), while skilled non-experts score just 34% despite ample time and full web access, demonstrating the dataset’s robustness against rote memorization from online resources. State-of-the-art models, including a GPT-4 baseline, reach only 39% accuracy. GPQA provides a valuable testbed for developing scalable oversight methods, enabling experts to assess outputs from AI systems that may exceed their own capabilities. BBH The BBH benchmark is a curated subset of the BIG-Bench benchmark [62], designed to highlight tasks that remain challenging for LLMs. While overall performance on BIG- Bench has improved, with the best models surpassing average human-rater performance on 65% of tasks using few-shot prompting, BIG-Bench Hard focuses specifically on the 31 remaining tasks where models still underperform. These tasks serve as a valuable diagnostic for identifying persistent limitations in models and for investigating whether such gaps reflect fundamental barriers or simply unsolved challenges that are within reach of existing architectures. OpenBookQA The OpenBookQA benchmark is a question-answering benchmark designed to evaluate deeper language and subject understanding through open-book style tasks. It includes questions that require multi-step reasoning, integration of commonsense and background knowledge, and nuanced text comprehension. Accompanied by a set of core science facts ("the open book"), the dataset is modeled after open-book exams and aims to push research toward more advanced forms of question answering that go beyond surface-level retrieval. TriviaQA The TriviaQA benchmark is a large-scale reading comprehension dataset com- prising over 650,000 question-answer-evidence triples. It features 95,000 questions written by trivia enthusiasts, each paired with multiple independently collected evidence documents, av- eraging six per question, which provide strong distant supervision for training and evaluating question-answering models. TruthfulQA The TruthfulQA dataset [35] is a benchmark designed to evaluate the truthfulness of language models when generating answers to questions. It comprises 817 questions across 38 categories, including health, law, finance, and politics. The questions are crafted to challenge models with scenarios where human subjects might hold incorrect beliefs or misconceptions, aiming to assess whether models can avoid generating false answers learned from imitating human texts. A.3 Reasoning We evaluated our pre- and post-trained models reasoning capabilities on AI2 Reasoning Challenge (ARC) [14] Easy and Challenge sets, WinoGrande and HellaSwag. ARC The ARC benchmark consists of a dataset of 7,787 genuine grade-school science multiple-choice questions, developed to advance research in complex question answering. It is divided into an Easy Set and a Challenge Set, the latter consisting of questions that stump both retrieval-based and word co-occurrence baselines. Accompanying the dataset is a corpus of over 14 million science-related sentences and baseline implementations of three neural models. ARC serves as a benchmark for evaluating models’ ability to perform deeper reasoning beyond simple pattern matching. WinoGrande The WinoGrande benchmark consists of a dataset of 44,000 commonsense reasoning problems, inspired by the Winograd Schema Challenge but scaled up to improve robustness and mitigate dataset-specific biases. Each instance is framed as a binary fill-in-the- blank task, requiring models to resolve pronoun references based on nuanced commonsense understanding. The benchmark is designed to test a model’s ability to perform contextual reasoning beyond surface-level cues. HellaSwag The HellaSwag benchmark dataset [73] is designed to evaluate the commonsense reasoning abilities of AI models, particularly in the context of sentence completion tasks. It comprises approximately 70,000 multiple-choice questions from diverse sources, including instructional videos and articles from platforms like WikiHow and ActivityNet. Each question presents a context followed by four possible sentence completions, one of which is correct. A.4 German We evaluated our pre-trained models German-language capabilities on Multilingual Massive Multitask Language (MMMLU), LAMBADA, ARC (Easy & Challenge), HellaSwag, Truth- fulQA and GSM8K translated to German by [50], and Wino-X, a version of WinoGrande translated to German by [18]. 32 MMMLU The MMMLU is a benchmark designed to evaluate the performance of LLMs across multiple languages and disciplines. It extends the original MMLU benchmark by translating its test set into 14 languages, including German, using professional human translators to ensure accuracy. The dataset encompasses 57 subjects ranging from elementary- level topics to advanced professional fields such as law, physics, history, and computer science. The German portion contains 14,000 samples. WMT16 The WMT16 benchmark is drawn from the shared translation task of the First Conference on Machine Translation [8]. It provides parallel corpora and standardized test sets for evaluating machine translation quality across several language pairs. We use the English–German language pair and evaluate translation quality using linewise BLEU scores. We report results both in a standard few-shot setting and in an instruction-based setting (WMT16 Instruct) for the SFT and DPO models. LAMBADA The LAMBADA benchmark is a word prediction benchmark that evaluates models’ ability to understand broad discourse. Each passage is constructed so that human subjects can accurately guess the final word only when given the full context – not just the final sentence. Success on LAMBADA requires models to go beyond local context and integrate information across the entire narrative. Here we use the German language subset and Average Completion Accuracy metrics. Additionally, we evaluate our post-trained models German-language capabilities on the original MT-Bench translated to German. MT-Bench German The MT-Bench benchmark [74] is a multi-turn question set designed to evaluate LLM-based chat assistants on open-ended tasks. It is part of a broader effort to address the challenges of assessing LLMs, given the limitations of traditional benchmarks in capturing human preferences. By leveraging strong LLMs as evaluators – despite known biases such as verbosity and self-enhancement – the benchmark demonstrates that models like GPT-4 can align with human judgments over 80% of the time. Alongside MT-Bench, the Chatbot Arena platform provides a complementary crowdsourced evaluation. Together, they offer a scalable and interpretable alternative to costly human preference data. All associated data, including MT-Bench questions, expert votes, and conversation logs, are publicly available. A.5 Instruction-following We evaluated our post-trained models instruction-following capabilities on AlpacaEval, an automated, GPT-4-based evaluation framework for instruction-following tasks that closely aligns with human judgments and enables efficient, reliable benchmarking of language models. AlpacaEval AlpacaEval is an automatic evaluation framework for LLMs, designed to be fast, cost-effective, and aligned with human judgment. Based on the AlpacaFarm instruction- following benchmark, it compares model outputs against reference responses using GPT-4- based annotators. The framework demonstrates high agreement with human evaluations, and its leaderboard rankings strongly correlate with those from human annotators, making it a reliable proxy for model assessment. We specifically evaluate and report numbers on CS (Chatbot Style), IF (Instruction Following), and LC (Language Consistency). A.6 Mathematics We evaluated our pre- and post-trained (DPO) models, although our models are not specifi- cally optimized for mathematics, on the Grade School Math 8K (GSM8K) benchmark as a standard evaluation for basic math reasoning. GSM8K GSM8K is a dataset containing 8.5K high-quality, linguistically diverse math word problems at the grade school level. It was designed to facilitate question answering tasks that involve basic math and require multi-step reasoning. 33 A.7 Long-Context We evaluate our post-trained models long-context capabilities on Question Answering with Long Input Texts, QuALITY, ZeroSCROLLS (only for the 70B) and Ada-LEval. QuALITY The QuALITY benchmark is a multiple-choice Q&A dataset designed for evaluating long-document comprehension, featuring English passages averaging 5,000 tokens, far longer than what most current models can handle. Unlike previous datasets, questions are created and validated by readers who have read the full passage. Many questions require deep understanding beyond skimming or keyword search, as shown by the large performance gap between baseline models (55.4%) and human subjects (93.5%). ZeroSCROLLS The ZeroSCROLLS benchmark suite [60] is a collection of zero-shot benchmarks for natural language understanding on long texts, providing only test and small validation sets. It includes six adapted tasks from SCROLLS and introduces four new datasets, including novel information aggregation tasks (e.g., summarizing sentiment across reviews). ZeroSCROLLS highlights ongoing challenges in long-context understanding and offers a live leaderboard for researchers to benchmark new approaches. We measure performance and report numbers for Llama-3.1-70B-TFree-HAT on the following subjects and tasks: MuSiQue, SpaceDigest, SQuALITY. Ada-LEval The Ada-LEval benchmark is designed to assess long-context understanding through length-adaptable questions. It features two tasks: TSort, which involves correctly ordering shuffled text segments, and BestAnswer, which requires selecting the most accurate answer from multiple candidates. The benchmark allows fine-grained control over test difficulty by adjusting the length and number of segments or distractors. Both tasks require full-text comprehension to succeed, and their design enables precise accuracy measurement, with clear correct answers in both ordering and selection tasks. We measure performance and report numbers on the TextSort Choices and TextSort tasks. A.8 Safety We evaluated our pre- and post-trained models safety attributes on WinoGender, schema sentence pairs that differ only by pronoun gender, designed to detect gender bias in coreference resolution systems. WinoGender The WinoGender benchmark is a set of minimal sentence pairs, modeled after Winograd Schemas, designed to test for gender bias in automated coreference resolution systems. Each pair differs only in the gender of a single pronoun, allowing researchers to isolate the impact of gender on model behavior. The sentence templates include three components – an occupation, a participant, and a pronoun referring to one of them – enabling analysis of whether models interpret pronouns differently based solely on gender in otherwise identical contexts. 34