← Back to papers

Paper deep dive

Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio

Phillip Long, Zachary Novack, Chris Donahue

Year: 2026Venue: arXiv preprintArea: cs.SDType: PreprintEmbeddings: 44

Abstract

Abstract:Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.

Tags

ai-safety (imported, 100%)cssd (suggested, 92%)preprint (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/13/2026, 12:53:02 AM

Summary

The paper introduces 'Trilobyte', a hierarchical byte-level tokenization scheme for autoregressive language models that enables tractable lossless compression of full-fidelity (16/24-bit) audio. By reducing vocabulary scaling from exponential O(2^b) to constant O(1), Trilobyte overcomes the intractability of standard sample-level tokenization at high bit depths, achieving state-of-the-art compression performance that outperforms FLAC at 8-bit and 16-bit, while providing a foundation for 24-bit modeling.

Entities (5)

FLAC · codec · 100%Trilobyte · method · 100%Arithmetic Coding · algorithm · 95%Autoregressive Language Models · technology · 95%MusDB18 · dataset · 95%

Relation Signals (3)

Trilobyte improvesscaling Vocabulary Size

confidence 95% · improving vocabulary scaling from O(2^b) to O(1)

Autoregressive Language Models uses Arithmetic Coding

confidence 95% · AR models... can be used with arithmetic coding to achieve compression rates

Trilobyte outperforms FLAC

confidence 90% · While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit

Cypher Suggestions (2)

List datasets used for benchmarking · confidence 95% · unvalidated

MATCH (d:Dataset) RETURN d.name

Find all methods that outperform FLAC · confidence 90% · unvalidated

MATCH (m:Method)-[:OUTPERFORMS]->(c:Codec {name: 'FLAC'}) RETURN m.name

Full Text

43,404 characters extracted from source content.

Expand or collapse full text

Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio Phillip Long 1,∗,∗ , Zachary Novack 1,∗ , Chris Donahue 2 1 University of California, San Diego, Computer Science and Engineering Department 2 Carnegie Mellon University, School of Computer Science Abstract Autoregressive “language” models (LMs) trained on raw wave- forms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such ap- proaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabu- lary scaling fromO(2 b )toO(1)and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit. Index Terms: lossless audio compression, full-fidelity audio, autoregressive language models 1. Introduction Recent work has demonstrated that neural-based approaches can yield dramatic improvements in lossy audio compression [1,2,3], with learned codecs achieving an order of magnitude better com- pression rates than traditional codecs like MP3 [4] while main- taining comparable perceptual quality. However, the potential of ML for lossless audio compression remains largely unexplored at practical fidelities. Autoregressive (AR) language models (LMs) offer a potential solution, as such models trained on raw audio samples can be repurposed as lossless compressors via arithmetic coding [5], where compression rate improves as the model’s log likelihood increases. However, prior work has been constrained to 8-bit quantization at 16kHz sampling rates [6,7,5,8,9]. This poses a practical problem as 8-bit, low sampling-rate audio is of limited downstream relevance—the perceptual quality is poor enough that audio is almost never distributed at this fidelity. Professional recording and production workflows universally op- erate on “CD-quality” audio (44.1kHz, 16-bit) or better (higher sample rates, 24-bit). Additionally, directly modeling higher bit- depth audio creates exponentially larger vocabularies for LMs (2 16 = 65,536 tokens for 16-bit; 2 24 = 16,777,216 for 24-bit), rendering standard AR approaches computationally intractable. Whether language model compression can thus scale to these full-fidelity regimes remains an open question. To counteract the exponential scaling of vocabulary size as bit depth increases, we introduce Trilobyte, a byte-level to- * These authors contributed equally. ** indicates the corresponding author. kenization scheme that improves vocabulary scaling from ex- ponentialO(2 b )to constantO(1)in bit depth. We demon- strate that not only does Trilobyte improve compression rates for 16-bit audio relative to sample-level modeling, it enables the first tractable language model compression of 24-bit audio. The Trilobyte tokenization scheme is compatible with any AR modeling framework—here we explore standard decoder-only Transformers specifically. We systematically evaluate LM-based compression, includ- ing pre-trained LLMs and models trained from scratch on audio with and without Trilobyte encoding, on full-fidelity audio across diverse domains (music, speech, and bioacoustic signals), sam- pling rates (16-48kHz), and at 8-, 16-, and 24-bit quantization. Our results primarily highlight that bit depth, rather than sam- pling rate or data domain, is the limiting factor in LM-based compression. Specifically, while LM-based approaches can out- perform the industry-standard lossless codec FLAC [10] at 8-bit (217% average improvement, consistent with prior work), the performance gap narrows substantially at higher bit depths (18% improvement at 16-bit). Our contributions include: (1) Trilobyte, enabling tractable 24-bit modeling through hierarchical tokenization with linear vocabulary scaling; (2) the first comprehensive benchmarking of LM compression on full-fidelity audio (16/24-bit) across diverse domains; and (3) evidence characterizing the performance gap between learned and traditional compressors across bit depths. We make the code for Trilobyte publicly available. 1 2. Related Works 2.1. Traditional Lossless Audio Compression FLAC (Free Lossless Audio Codec) [10] is the de facto standard for lossless audio compression, achieving typical compression rates of 2x for CD-quality music [11]. FLAC uses linear pre- dictive coding to approximate audio in chunks and Rice coding [12] to efficiently encode residuals [13]. Although FLAC does attempt to exploit stereo redundancy via mid-side encoding, this yields only marginal compression rate improvements for CD- quality audio because the small chunk size of around 4,096 samples is insufficient to accommodate the stereo delays com- monly present in music production, which prevents effective decorrelation of the left and right channels. 2.2. ML for Lossless Compression While recent work has shown that machine learning can dramat- ically improve lossy audio compression [1,2,3], the lossless setting remains less explored. Del ́ etang et al. [5] and Li et al. 1 https://github.com/pnlong/ trilobyte-experiments arXiv:2603.08683v1 [cs.SD] 9 Mar 2026 Standard Proposed narrow, low fidelity Audio (-bit PCM) b diverse, high fidelity Tokenization Step sample-level tokenization exponential scaling! constant scaling!  ⚠  intractable @ higher bit depths! enables full- fidelity modeling!   ✅ ❌ 列   Arithmetic Coder Compressed Audio 10101100111000111101001000... Autoregressive Language Model Vocabulary Size  Samples To ke n s 1 Token per Sample, | 풱 | =2 b byte-level tokenization Samples To ke n s 1 Token per Byte, | 풱 | =2 8 ⋯ ⋯ MSBLSB Figure 1: Tokenization strategies for language model compression. Standard sample-level tokenization (top) yields vocabulary size |V| = 2 b . This exponential scaling inhibits modeling of industry-standard bit depths (16,24). Trilobyte’s hierarchical byte-level tokenization (bottom) decomposes samples into bytes, yielding constant|V| = 256regardless of bit depth (at the cost of increasing sequence length by⌈b/8⌉). Both feed into an AR LM and arithmetic coder, but Trilobyte enables tractable 24-bit modeling. [8] propose using large language models like Llama [14,15] and Chinchilla [16] for general-purpose lossless compression via arithmetic coding [17,18]. However, these works’ explo- ration of audio is limited to 8-bit Librispeech [19] and LJSpeech [20], where they outperform FLAC. This 8-bit regime is not representative of real-world lossless audio applications, which typically require at least CD-quality (44.1kHz, 16-bit) formats. Heurtel-Depeiges et al. [9] further demonstrate that small pre- trained Transformers can achieve competitive compression rates to FLAC on 8-bit audio. Critically, none of these works ex- plore whether compression gains extrapolate to higher bit depths. Whether ML-based compression remains competitive at full- fidelity audio is the central question we investigate. Prior work on AR modeling of raw audio waveforms [6,21, 7] has focused on generation rather than compression, operating primarily on 8-bit audio at 16–24kHz. Despite this early interest in waveform-level modeling, such approaches have largely fallen out of favor for generation tasks, supplanted by methods that train on compressed audio tokens [22,23,24]; however, raw waveform modeling remains promising for lossless compression. Traditional approaches address the vocabulary explosion through μ-law companding [25], which reduces bit depth via non-linear quantization. Recent work on byte-level modeling [26] explores tokenization strategies that could scale to larger vocabularies, though these have not been tested on higher bit-rate audio. To our knowledge, no prior work has successfully trained AR models for 16- or 24-bit audio compression at CD-quality sample rates. 3. Methods AR models offer a fundamentally different paradigm from tradi- tional codecs like FLAC: rather than using a bottleneck represen- tation plus residuals, they directly model the probability distribu- tion over the next sample given all previous samples. The key insight is that any AR probabilistic modelP (x i | x <i )over dis- crete sequences can be used with arithmetic coding [17,18,27] to achieve compression rates that approach the entropy of the data [5]. Unlike FLAC’s local chunk-based prediction, AR mod- els can capture arbitrarily long-range dependencies in the audio signal, potentially discovering structure that linear prediction cannot. The compression pipeline operates as follows: we train an AR model to predictP (x i | x <i )at each position, then dur- ing compression, we iteratively compute these probabilities and use arithmetic coding to encode the sequence into a bitstream. Decompression reverses this process, using the same AR model and arithmetic decoder to sequentially reconstruct each sample. 3.1. Arithmetic Coding Unlike FLAC’s use of Rice coding which assumes a fixed geo- metric distribution, arithmetic coding [17,18,27] can efficiently encode any sequence given an arbitrary probability distribution at each step. The core principle is to encode an entire sequence into a single fractional number in[0, 1)by iteratively narrowing an interval based on the cumulative probability distribution. The encoding process begins with the interval[0, 1); for each symbol x i , we partition the current interval according toP (x i | x <i ) and select the sub-interval corresponding to the observed symbol. The final interval uniquely identifies the sequence, and we output a binary representation of any number within this interval. Arithmetic coding achieves theoretical optimality, approach- ing the Shannon entropy [28]. The key advantage is the ability to exploit arbitrary probability distributions from the AR model, not just geometric ones. This creates a direct connection to lan- guage models: the cross-entropy loss (negative log-likelihood) minimized during training directly corresponds to the expected coding length [5,9]. Specifically, average per-token log like- lihoodb θ =− 1 N Σ N i=1 log 2 P θ (x i |x <i )corresponds to the ex- pected number of bits needed to encode each token. Accordingly, if each token was originallybbits uncompressed, then the in- duced compression rate isb/b θ . Therefore, we can estimate compression performance directly from model loss without im- plementing the full arithmetic encoder. 3.2. Standard LM Compression Here we formalize our baseline AR setup for audio waveforms. Each sample of uncompressed audio is a signed integer ofbbits fromZ b =z ∈Z|−(2 b−1 )≤ z < 2 b−1 , and a waveform w∈Z Tf s ×c b is an array of samples whereTis duration in sec- onds,f s is sample rate,cis number of channels, andbis bit depth. We first convert signed samples to unsigned viax =w + 2 b−1 , yieldingx∈N Tf s ×c b , whereN b =z ∈N| 0≤ z < 2 b . We feedxthrough a standard decoder-only Transformer [29] with causal masking, similar to GPT-2 [30]. The model outputs prob- ability distributions over the next sampleP θ (x i | x <i )for all vocabulary tokens, and the training objective is to maximize the likelihoodL θ = P N i=1 logP θ (x i | x <i ) , equivalent to mini- mizing cross-entropy loss. In past work, each audio sample is treated as a token, in- ducing a vocabulary size is|V| = 2 b tokens. While this works reasonably for 8-bit audio (|V| = 256), it becomes prohibitive at higher bit depths: 16-bit requires 65,536 tokens and 24-bit requires 16,777,216 tokens. The embedding and output lay- ers scale asO(d· 2 b )parameters wheredis model dimension, quickly exceeding the size of the entire transformer backbone and creating intractable memory requirements. The context win- dow is limited by the Transformer’s maximum sequence length (e.g., 2,048–8,192 samples,∼50–200ms at 44.1kHz), neces- sitating sliding windows or chunking for longer audio. This intractability at higher bit depths motivates the byte-level tok- enization approaches we explore next. 3.3. Trilobyte: Hierarchical Byte-Level Tokenization To address the vocabulary explosion of sample-level tokenization we introduce Trilobyte, which decomposes eachb-bit sample intoB = ⌈b/8⌉bytes. Rather than modeling these bytes with distinct subvocabularies, Trilobyte simply predicts over 256 pos- sible values at each byte position in the sequence, maintaining a constant vocabulary size regardless of bit depth. This reduces vo- cabulary scaling from exponentialO(2 b )to constantO(1)while enabling the model to implicitly learn separate distributions for each byte position autoregressively through the one-to-one map- ping between byte position and sequence index. We interleave the constituent bytes of each audio sample (MSB, middle byte(s), LSB), and train a GPT-2 architecture [30] on these byte-level sequences. We also experimented with an expanded vocabulary variant that partitions tokens into explicit subvocabularies per byte position (linear rather than constant scaling with bit depth), but this yielded negligible compression gains (< 0.003x), sug- gesting the constant vocabulary already learns separate byte distributions implicitly through AR context. For stereo audio, we concatenate left and right channels in random order (eitherx L 1 ,x L 2 ,...,x L N/2 ,x R 1 ,x R 2 ,...,x R N/2 or vice versa) rather than interleaving at the sample level (x L 1 ,x R 1 ,x L 2 ,x R 2 ,...), allowing the model to exploit cross- channel correlations when transitioning between channels in its AR predictions. We found that compression rates are nearly identical between concatenation and sample-level interleaving, so we use concatenation for simplicity. By conditioning the second channel’s predictions on the first channel’s context, the model can potentially capture redundancies beyond FLAC’s mid- side encoding, which is limited by small block sizes. Trilobyte enables the first tractable language model compres- sion of 24-bit professional audio while achieving comparable performance to naive 16-bit modeling with full softmax. Since Trilobyte operates on byte-level sequences, we compute bits per byte (BPB) and convert to compression rate as8/BPB. Note that at 8-bit, Trilobyte’s tokenization collapses to standard sample- level tokenization (1 byte per sample, vocabulary of 256), so Trilobyte is identical to standard LMs for the 8-bit regime. We compare Trilobyte against both sample-level tokenization and a byte-level in-context LM baseline. 3.4. Additional Baselines and Experimental Approaches Beyond our primary ML-based approaches, we evaluate several additional baselines and experimental methods. We conduct extensive experiments with FLAC at different compression lev- els (0–8) to understand the performance ceiling of traditional methods across our diverse audio domains (see Appendix A for detailed results). As an additional baseline, we compare to the in-context compression approach of Del ́ etang et al. [5] and Li et al. [8], using the pretrained Llama-2-7B model [15]. Here, instead of training on audio, audio byte streams are compressed by naively encoding the bytes as gibberish text and compressing the text using the pre-trained LM. As this method is intractably slow, following previous work [5] we report results over 1K randomly sampled, 1,024-sample chunks across each dataset. We discuss our further experiments with this approach in Ap- pendix C. Additionally, we explored neural audio codecs as drop-in replacements for linear predictive coding in FLAC-style compression, hypothesizing that learned representations might yield better residual distributions for Rice coding (Appendix B). While most of these approaches yielded poor results relative to our primary methods, they provide insights into the limits of different compression paradigms. 4. Experimental Setup Our evaluation spans multiple datasets across diverse audio do- mains. For music, we use MusDB18 [31], a multi-track music database with stereo mixes and stems in various configurations (stereo mixes, mono mixes, stereo stems, mono stems, combined stereo, combined mono). To better reflect real-world compres- sion scenarios, we additionally include a dataset of commer- cial music data at multiple bit depths and quality levels (16-bit and 24-bit): 1,569 songs (120 hours) at 16-bit, and 933 songs (70 hours) at 24-bit, with the latter including high-resolution recordings up to 192kHz sampling rate. Unlike many academic music datasets, which are often distributed in processed or lossy formats and may not reflect the distribution of commercially released recordings, this music more closely represents the types of lossless audio files that users seek to store and compress in practice. However, for consistency, we resample all commercial music data to 44.1kHz, the most common sample rate in the corpus. We additionally incorporate music datasets Beethoven [7,21] (recordings of Ludwig van Beethoven’s piano sonatas) and YouTube Mix [7,32] (piano music from YouTube) into our experiments. For speech, we evaluate on LibriSpeech [19] (clean read English speech), LJSpeech [20] (single-speaker au- diobook recordings), SC09 [7,33,34] (spoken digit recognition), and VCTK [35] (multi-speaker English corpus). We also eval- uate on other audio domains: Birdvox [36] (bioacoustic bird vocalizations) and Epidemic Sound [37] (sound effects library). We evaluate at each dataset’s native bit depth:8- bit (Beethoven, YouTube Mix, SC09, all 16kHz), 16-bit (LibriSpeech 16kHz, LJSpeech 22.05kHz, Birdvox 24kHz, MusDB18 44.1kHz, VCTK and Epidemic Sound 48kHz, and commercial 16-bit at 44.1kHz), and 24-bit (commercial 24-bit at 44.1kHz). We compare FLAC (compression level 8, maximum), sample-level tokenization (90M params at 8-bit, 140M at 16-bit), and Trilobyte (90M). All trained models are trained for a fixed number of 300K steps. 4.1. Results Table 1 presents compression rates across all methods and bit depths. Throughout this section, percentages refer to relative Table 1: Compression rates across methods and bit depths. FLAC uses compression level 8. In-context uses pretrained Llama-2-7B [15]. Standard refers to sample-level tokenization; Trilobyte uses hierarchical byte-level tokenization (both with trained Transformers). Sample-level is equivalent to Trilobyte at 8-bit and intractable at 24-bit (16.7M vocabulary). Transfer denotes a single 24-bit Trilobyte model trained on all datasets with lower bit masking for lower bit depths. Bold indicates best performance. b (Bits) f s (Hz) / c (#Ch) DomainDatasetFLAC In-contextStandard TrilobyteTransfer 8 16000 / 1SpeechSC090.951.802.082.082.88 16000 / 1MusicBeethoven1.691.337.947.947.45 16000 / 1YouTube Mix 1.581.284.154.155.14 16 48000 / 1SpeechVCTK2.321.752.662.662.68 22050 / 1LJSpeech1.691.491.982.082.04 16000 / 1LibriSpeech 1.741.642.062.112.10 24000 / 1BioacousticsBirdvox2.331.752.472.482.48 48000 / 1SFXEpidemic Sound 2.631.703.003.403.10 44100 / 1MusicMusDB18 (All)2.151.532.642.822.75 44100 / 2MusDB18 (Mixes) 1.871.271.852.081.98 44100 / 2Commercial 16-bit1.741.261.641.861.74 2444100 / 2MusicCommercial 24-bit1.631.07✗1.481.47 gains (e.g., 2x to 3x is a 50% improvement, yielding 50% smaller files). At 8-bit, the standard and proposed Trilobyte tokenization schemes are equivalent: both substantially outperform FLAC, achieving 370%, 163%, and 119% improvements on Beethoven, YouTube Mix, and SC09, respectively. The wide variation in Trilobyte’s 8-bit performance (2.08–7.94x) indicates that com- pression gains depend heavily on audio domain structure—here, the music datasets are acoustically narrow (solo piano) and com- press better than SC09 (multi-speaker, multi-microphone). At 16-bit, AR modeling provides consistent compression gains relative to FLAC, though the gains are modest compared to those seen for 8-bit audio: 15%, 31%, and 21% improvements on VCTK, MusDB18 Mono, and LibriSpeech, respectively. More- over, at 16-bit, FLAC compression rate correlates to Trilobyte compression rate across datasets (r = 0.92,p≪ 0.01). Higher sample rates appear to have less of an effect on overall compres- sion rate relative to bit depth—some high sample rate datasets compress better than low rate ones (VCTK at 48kHz is 2.66x vs. LibriSpeech at 16kHz is 2.11x). Sample-level tokenization performs comparably to Trilobyte on some datasets (2.64x on MusDB18 Mono, 2.66x on VCTK) but generally falls short, particularly on music. Notably, Epidemic Sound achieves 3.40x compression with Trilobyte, a 29% improvement over FLAC. At 24-bit, bit depth becomes the fundamental barrier: the 16.7M-token vocabulary required for sample-level tokenization is completely intractable, requiring approximately 12B parame- ters for the output projection matrix alone. Trilobyte’s hierarchi- cal byte-level decomposition sidesteps this exponential scaling by reducing the vocabulary to just 256 tokens, enabling the first tractable 24-bit language model compression. However, our approach (1.48x) falls 9% short of FLAC (1.63x). One possible explanation is that a non-trivial amount of the information in the least significant bits at24-bit is imperceptible noise—audio tool chains with up to144dB of dynamic range are needed to preserve the signal at24bits. Rice coding in FLAC may be nearly optimal for compressing this low amplitude noise. The in-context LM approach with pretrained Llama-2-7B underper- forms both FLAC and Trilobyte across all datasets and bit depths except for 8-bit SC09, suggesting that text-pretrained models without domain-specific training struggle to capture audio struc- ture effectively. 4.1.1. Transfer Learning with Trilobyte We further investigated Trilobyte’s ability to losslessly com- press arbitrary bit depths with a single model. Specifically, by masking out the lower bytes with a learned null token, we can simultaneously model multiple bit depths within a single model during training, synthetically interleaving lower bit-rate audio with such null tokens at inference to perform any-bit lossless compression. To evaluate this capability, we train models on Commercial (24-bit) and MusDB18 Stereo (16-bit) with an ad- ditional mask token and randomly drop out lower-significance bytes during training (p = 0.1), then test whether the resulting models can compress audio at multiple bit depths without re- training. For 24-bit Commercial music, we attain 1.49x (24-bit), 1.78x (16-bit), and 3.4x (8-bit) compression rates with a single model, while on 16-bit MusDB18 stereo we attain 2.07x (16-bit) and 3.8x (8-bit) compression. From this initial result, we then trained a single Trilobyte model over all data using our multi-bit-rate masking scheme, with results shown in the final column of Table 1 (Transfer). We find that the compression performance is similar to the per-dataset Trilobyte models (with our joint model performing slightly worse on some datasets and better on others), show- ing that we can successfully train a single generalist LM-based compressor (even without scaling the model size) over diverse audio corpora using Trilobyte and our lower byte masking. We release this generalist Trilobyte model as an open-source lossless audio codec 2 to serve as a baseline for future research in learned lossless compression across diverse audio domains, sample rates, and bit depths. 5. Conclusion Our evaluation reveals that Trilobyte achieves consistent, albeit modest, gains over FLAC at 16-bit, with an average improvement of 18% across domains. These modest 16-bit gains contrast sharply with 8-bit performance (217% average improvement). At 24-bit, Trilobyte trails FLAC by 9% but enables the first tractable LM compression at this bit depth where sample-level approaches are completely infeasible. Transfer results show a single 24-bit 2 https://github.com/pnlong/ trilobyte-lossless-codec Trilobyte model trained on all datasets achieves compression rates comparable to dataset-specific models across bit depths. Our results also highlight that bit depth, not sampling rate or data domain, becomes the primary bottleneck, as LMs consistently outperform FLAC at 8-bit but gains quickly diminish at higher bit depths. This suggests FLAC operates near fundamental entropy bounds for full-fidelity audio and our empirical compression rates establish lower bounds across audio domains. We acknowledge that our ML approaches are orders of mag- nitude slower than FLAC—their modest compression wins are unlikely to justify their computational costs for real-world de- ployment. Nevertheless, this work addresses a critical gap in the literature: prior LM-based compression research has been constrained to 8-bit audio, leaving unexplored whether these methods scale to the full-fidelity regimes where lossless com- pression is actually needed. We provide the first comprehensive benchmark of language model compression on CD-quality (16- bit) and professional (24-bit) audio, demonstrating that standard sample-level approaches face increasingly intractable vocabulary. Trilobyte’s hierarchical byte-level tokenization overcomes this fundamental barrier, reducing vocabulary scaling from exponen- tialO(2 b )to constantO(1)and enabling tractable modeling at arbitrary bit depths. Although current compression gains remain modest, our work demonstrates that learned approaches can con- sistently outperform FLAC across diverse audio domains and bit depths. We anticipate that future research may work to scale the performance of such models and/or improve their efficiency. 6. Acknowledgments We are grateful to Roger Dannenberg, Sander Dieleman, and Jesse Engel for their insightful feedback on early explorations of these ideas, which helped shape the direction of this research, and John Thickstun for providing helpful feedback on later stages of the research. 7. Generative AI Use Disclosure Generative AI tools (Claude) were used solely for minor editing and formatting assistance. All technical content, experimental work, and analysis were performed entirely by the authors, who are fully responsible for the work presented. 8. References [1]N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasac- chi, “SoundStream:An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, 2021. [2]R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Ku- mar, “High-fidelity audio compression with improved RVQ-GAN,” NeurIPS, 2023. [3] A. D ́ efossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neu- ral audio compression,” arXiv preprint arXiv:2210.13438, 2022. [4]K. Brandenburg, “MP3 and AAC explained,” in Audio Engineering Society Conference, 1999. [5]G. Del ́ etang, A. Ruoss, P.-A. Duquenne, E. Catt, T. Genewein, C. Mattern, J. Grau-Moya, L. K. Wenliang, M. Aitchison, L. Orseau et al., “Language modeling is compression,” in ICLR, 2023. [6]A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu et al., “WaveNet: A generative model for raw audio,” arXiv:1609.03499, 2016. [7]K. Goel, A. Gu, C. Donahue, and C. R ́ e, “It’s raw! audio generation with state-space models,” in ICML, 2022. [8]Z. Li, C. Huang, X. Wang, H. Hu, C. Wyeth, D. Bu, Q. Yu, W. Gao, X. Liu, and M. Li, “Lossless data compression by large models,” Nature Machine Intelligence, vol. 7, no. 5, p. 794–799, 2025. [9]D. Heurtel-Depeiges, A. Ruoss, J. Veness, and T. Genewein, “Com- pression via pre-trained transformers: A study on byte-level multi- modal data,” arXiv preprint arXiv:2410.05078, 2024. [10]Xiph.Org Foundation, “FLAC: Free Lossless Audio Codec,” RFC 9639, 2001. [Online]. Available: https://xiph.org/flac/ [11]M. van Beurden, “Lossless audio codec comparison - revision 4,” audiograaf.nl, Tech. Rep., January 2015. [Online]. Avail- able: http://w.audiograaf.nl/losslesstest/Lossless%20audio% 20codec%20comparison%20-%20revision%204.pdf [12]R. F. Rice, “Some practical universal noiseless coding techniques,” Jet Propulsion Laboratory, California Institute of Technology, Tech. Rep., 1979. [13]M. van Beurden and A. Weaver, “Rfc 9639: Free lossless audio codec (flac),” 2024. [14]H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi ` ere, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023. [15] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023. [16]J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark et al., “Training compute-optimal large language models,” arXiv preprint arXiv:2203.15556, 2022. [17]R. C. Pasco, “Source coding algorithms for fast data compression,” Ph.D. dissertation, Stanford University CA, 1976. [18]J. Rissanen, “Generalized kraft inequality and arithmetic coding,” IBM J. Res. Dev., vol. 20, p. 198–203, 1976. [Online]. Available: https://api.semanticscholar.org/CorpusID:16011297 [19]V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, p. 5206–5210. [20]K. Ito and L. Johnson, “The lj speech dataset,” https://keithito.com/ LJ-Speech-Dataset/, 2017. [21] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio, “Samplernn: An unconditional end- to-end neural audio generation model,” in International Conference on Learning Representations, 2017. [22]A. Van Den Oord, O. Vinyals et al., “Neural discrete representation learning,” in NeurIPS, 2017. [23]P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, “Jukebox: A generative model for music,” arXiv preprint arXiv:2005.00341, 2020. [24]K. Lakhotia, E. Kharitonov, W.-N. Hsu, Y. Adi, A. Polyak, B. Bolte, T.-A. Nguyen, J. Copet, A. Baevski, A. Mohamed, and E. Dupoux, “On generative spoken language modeling from raw audio,” Transactions of the Association for Computational Linguistics, vol. 9, p. 1336–1354, 2021. [Online]. Available: https://aclanthology.org/2021.tacl-1.79/ [25]M. Lewis and S. MTSA, “A-law and mu-law companding imple- mentations using the tms320c54x,” Application Note SPRA163A, Texas Instrum., Dallas, TX, USA, 1997. [26]L. Yu, D. Simig, C. Flaherty, A. Aghajanyan, L. Zettlemoyer, and M. Lewis, “Megabyte: Predicting million-byte sequences with multiscale transformers,” Advances in Neural Information Processing Systems, vol. 36, p. 78 808–78 823, 2023. [27] I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for data compression,” Communications of the ACM, vol. 30, no. 6, p. 520–540, 1987. [28]C. E. Shannon, “A mathematical theory of communication,” The Bell system technical journal, vol. 27, no. 3, p. 379–423, 1948. [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, 2017. [30]A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, 2019. [31] Z. Rafii, A. Liutkus, F.-R. St ̈ oter, S. I. Mimilakis, and R. Bittner, “The MUSDB18 corpus for music separation,” Dec. 2017. [Online]. Available: https://doi.org/10.5281/zenodo.1117372 [32]DeepSound, “Samplernn,” https://github.com/deepsound-project/ samplernn-pytorch, 2017. [33]C. Donahue, J. McAuley, and M. Puckette, “Adversarial audio syn- thesis,” in International Conference on Learning Representations, 2019. [34] P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” ArXiv, vol. abs/1804.03209, 2018. [35]J. Yamagishi, C. Veaux, and K. MacDonald, “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92),” The Rainbow Passage which the speakers read out can be found in the International Dialects of English Archive:(http://web. ku. edu/ ̃ idea/readings/rainbow. htm)., 2019. [36]A. Farnsworth, B. Van Doren, S. Kelling, V. Lostanlen, J. Salamon, A. Cramer, and J. Bello, “Birdvox-full-season: 6672 hours of audio from migratory birds,” Zenodo, 2022. [37]LAION-AI, “Laion-ai/audio-dataset epidemic sound,” Nov 2022. [Online]. Available: https://github.com/LAION-AI/audio-dataset/ blob/main/datacard/Epidemicsound.md A. Extended FLAC Experiments FLAC offers compression levels 0–8, where higher levels exhaus- tively try more subframe types, linear predictive coding orders, and Rice parameters to find optimal compression at the cost of slower encoding (decoding speed remains constant). Level 5 (the default) provides a good balance between compression Music Beethoven YouTube Mix MusDB18 Mono (All) MusDB18 Stereo (Mixes) Commercial 16-bit Commercial 24-bit Speech SC09 VCTK LibriSpeech LJSpeech Bioacoustics / Sound Effects Birdvox Epidemic Sound 1.00 2.00 3.00 4.00 5.00 6.00 Compression Rate (x) 02468 Compression Level 1.60 1.80 2.00 2.20 2.40 2.60 Compression Rate (x) 02468 Compression Level 02468 Compression Level 8-bit 16-bit Figure 2: FLAC compression performance across diverse audio domains at 8-bit and 16-bit quantization levels. Birdvox achieves exceptional compression (∼6x at 8-bit), perhaps reflecting the sparse and structurally constrained nature of bird vocalizations, which are highly predictable under linear predictive coding. Meanwhile, speech and music datasets show more modest gains. 16-bit audio generally achieves 1.5–2.5x compression, with di- minishing returns beyond FLAC level 3. Note that we disable FLAC’s verbatim, constant, and fixed subframe types, and that we do not evaluate Beethoven, YouTube Mix, or SC09 beyond 8-bit because they are 8-bit datasets. FLACDACEnCodecCustom DAC Compressor 1.0 1.5 2.0 2.5 Compression Rate (x) Figure 3: Compression rate comparison across FLAC, DAC, EnCodec, and Custom DAC compressors on MusDB18 mixes. FLAC achieves the best compression, at approximately 1.8x, while the NAC-based approaches underperform, with EnCodec actually increasing file size. quality and speed, while level 8 tries essentially all combinations. Figure 2 shows FLAC compression rates at different compres- sion levels (0–8) for 8-bit and 16-bit audio across all datasets. An important configuration note: to isolate the effect of linear predictive coding as a lossy estimator, we disable FLAC’s verba- tim, constant, and fixed subframe types, since one could trivially implement these same subframe types in any audio codec. This explains the noticeable drop in compression rate at level 3, espe- cially visible in MusDB18 Mono, over 80% of which consists of sparse, largely-silent stems. These subframe types are normally used to efficiently encode silence and constant-value regions; without them, silent regions compress less effectively. Compression rate patterns vary significantly by bit depth. For 8-bit audio, compression varies widely by dataset. As a whole, music achieves quite moderate compression (1.8–3x), though MusDB18 Mono in particular achieves 4x due to sparse multi-track content. Speech datasets LibriSpeech, LJSpeech, and VCTK achieve 2.5–4x compression while SC09 seems to not compress, perhaps a consequence of the short durations of tracks in the dataset. Remarkably, Birdvox achieves exceptional ∼6x compression, likely due to sparse yet locally-predictable bioacoustic signals. For 16-bit audio, compression is more mod- est, generally achieving 1.5–2.7x compression. Music datasets cluster around 1.6–2.3x compression, speech datasets achieve roughly the same, while bioacoustics and sound effects perform the best, with Epidemic Sound reaching 2.6x compression. B. Neural Audio Codecs for Compression We also tried the alternative paradigm of FLAC-style compres- sion: a bottleneck representation plus residuals encoded with Rice coding [12]. We explored neural audio codecs (NACs) as drop-in replacements for linear predictive coding in this pipeline. This approach did not outperform FLAC; we summarize the setup and results here. FLAC compresses audio using linear predictive coding and Rice-coded residuals; we replace linear predictive coding with a NAC—Descript Audio Codec (DAC) [2], EnCodec [3], or Custom DAC—storing latent codes and Rice-encoded residuals. The compression pipeline is to (1) encode audio through the NAC encoder to obtain discrete latent codesz, (2) decode to obtain reconstruction ˆ x, (3) compute residualsr =x − ˆ x, and (4) storezand Rice-encoded ̃ r. Decompression reverses this process. Rice coding assumes residuals follow a roughly geometric distribution. We evaluate on MusDB18 [31] mixes (44.1kHz, 16-bit CD- quality stereo music). We compare four approaches: FLAC (at 2 1 2 3 2 5 2 7 2 9 2 11 2 13 2 15 |Residual| 0.00 0.01 0.02 0.03 Density Compressor FLAC DAC EnCodec Custom DAC Figure 4: Residual distribution comparison showing residual magnitudes (note the log scale) for FLAC, DAC, EnCodec, and Custom DAC compressors. FLAC residuals follow a geometric distribution with a mean absolute residual of 156.34, while DAC, EnCodec, and Custom DAC residuals are more uniformly dis- tributed regardless of codebook level, with mean absolute resid- uals of 1,603.54 (DAC), 18,376.66 (EnCodec), and 1,245.76 (Custom DAC) – an order of magnitude larger than FLAC. the default compression level 5), DAC (DAC-44.1kHz with 3 codebook levels), EnCodec (EnCodec-48kHz with 4 codebook levels), and Custom DAC (our DAC-44.1kHz variant trained without adversarial or perceptual losses). We hypothesize that our Custom DAC variant will produce residuals more amenable to Rice coding than NACs trained purely for perceptual quality. Figure 3 shows compression rates. FLAC achieves the best compression (approximately 1.8x). DAC and Custom DAC underperform at about 1.2x; EnCodec performs worst, actually increasing file size (compression rate<1.0x). Several factors explain NAC underperformance, including that NAC residuals do not follow the geometric distribution Rice coding assumes. Figure 4 shows histograms of absolute residuals for FLAC, DAC, EnCodec, and Custom DAC compressors. FLAC residuals exhibit a clear geometric distribution (concentrated near zero with exponentially decreasing frequency), precisely what Rice coding requires. In contrast, NAC-based residuals are more uniformly distributed with less concentration near zero, deviating from the geometric assumption. This distributional mismatch is critical. Rice coding’s effi- ciency relies on the geometric distribution assumption: when residuals follow different distributions (as with NAC reconstruc- tions), Rice coding becomes suboptimal. This fundamental incompatibility explains why NAC-based approaches do not out- perform FLAC despite potentially capturing a more complex audio structure. The underlying cause is that neural codecs are trained with adversarial perceptual losses that optimize for human-perceivable quality rather than minimizing absolute wave- form error, creating structured reconstruction errors that do not follow simple geometric distributions. Additionally, EnCodec’s extremely poor compression performance is explained: its distri- bution of residuals appears uniform, and certainly not geometric. Notably, our training of a custom DAC-44.1kHz variant without adversarial or perceptual losses (Custom DAC) appears to have produced a slightly different residual distribution from DAC itself, as our Custom DAC compressor has a markedly lower mean absolute residual. However, this improvement does not translate into better compression rates, likely because the Custom DAC’s residual distribution remains non-geometric. NAC-based approaches fail to outperform FLAC on both compression rate and speed. The fundamental issue is that NAC residuals do not follow geometric distributions, which makes Rice coding inefficient. 0.0 1.0 2.0 3.0 4.0 5.0 6.0 Compression Rate (x) MusicSpeechBioacousticsSound Effects Beethoven YouTube Mix MusDB18 Mono (All) MusDB18 Stereo (Mixes) Commercial 16-bitCommercial 24-bit 0.0 0.5 1.0 1.5 2.0 2.5 Compression Rate (x) SC09 VCTK LibriSpeech LJSpeech Birdvox Epidemic Sound 8-bit 16-bit Llama-2-7BLlama-2-13BFLAC Figure 5: In-context LM-based compression performance with the method defined in Del ́ etang et al. [5] and Li et al. [8] using pre-trained language models (Llama-2-7B and Llama-2-13B [15]) across diverse audio domains at 8-bit and 16-bit quantiza- tion. We also report FLAC compression results at compression level 8, the maximum. Model scaling (7B to 13B) shows minimal gains at 8-bit and some improvements at 16-bit, especially for complex datasets. This method underperforms FLAC on most signals, with the exception of SC09 and Epidemic Sound at 8-bit. C. In-context LMs for Compression Figure 5 shows compression rates for the in-context LM-based approach from Del ́ etang et al. [5] and Li et al. [8] using Llama-2- 7B and Llama-2-13B [15] across all datasets for 8-bit and 16-bit audio. Because this method of compression is intractably slow, the results shown here are over 1,000 randomly-selected, 1,024- sample chunks (1,024 bytes for 8-bit, 2,048 bytes for 16-bit) across each dataset. Model scaling effects differ by bit depth. Comparing the language model-based approach to FLAC reveals that it underperforms in nearly all scenarios. FLAC dom- inates across most datasets and bit depths: for Birdvox at 8-bit, FLAC achieves∼6x compression compared to∼4x for the in- context LM approach, and for music datasets like MusDB18 Mono and the commercial data, FLAC consistently outperforms by substantial margins. The only exceptions are SC09 at 8-bit, where the language model approach (∼1.8x) significantly out- performs FLAC (1x, effectively no compression), and Epidemic Sound at 8-bit, where the language model approach achieves ∼3.5x compared to FLAC’s∼3x. Model scaling from 7B to 13B parameters shows minimal gains, and in fact, often leads to lower compression rates. The broader implications highlight fundamental limitations of this approach for lossless audio compression. As Del ́ etang et al. [5] acknowledge, the model size (in bytes) must be amortized over the compressed data, which drastically lowers the effective compression rates—a 7B parameter model requires gigabytes of storage, making it impractical unless compressing massive audio archives. The computational cost of running inference with billion-parameter models is only justified in domains where they significantly outperform FLAC, which our results show are incredibly limited. For the vast majority of audio domains including music and bioacoustic signals, FLAC remains the su- perior choice, suggesting that general-purpose language models trained on text do not transfer effectively to the diverse statistical structures present in audio waveforms.