Paper deep dive

How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences

Sofiane Ouaari, Jules Kreuer, Nico Pfeifer

Year: 2026Venue: arXiv preprintArea: q-bio.GNType: PreprintEmbeddings: 69

Abstract

Abstract:DNA foundation models have become transformative tools in bioinformatics and healthcare applications. Trained on vast genomic datasets, these models can be used to generate sequence embeddings, dense vector representations that capture complex genomic information. These embeddings are increasingly being shared via Embeddings-as-a-Service (EaaS) frameworks to facilitate downstream tasks, while supposedly protecting the privacy of the underlying raw sequences. However, as this practice becomes more prevalent, the security of these representations is being called into question. This study evaluates the resilience of DNA foundation models to model inversion attacks, whereby adversaries attempt to reconstruct sensitive training data from model outputs. In our study, the model's output for reconstructing the DNA sequence is a zero-shot embedding, which is then fed to a decoder. We evaluated the privacy of three DNA foundation models: DNABERT-2, Evo 2, and Nucleotide Transformer v2 (NTv2). Our results show that per-token embeddings allow near-perfect sequence reconstruction across all models. For mean-pooled embeddings, reconstruction quality degrades as sequence length increases, though it remains substantially above random baselines. Evo 2 and NTv2 prove to be most vulnerable, especially for shorter sequences with reconstruction similarities > 90%, while DNABERT-2's BPE tokenization provides the greatest resilience. We found that the correlation between embedding similarity and sequence similarity was a key predictor of reconstruction success. Our findings emphasize the urgent need for privacy-aware design in genomic foundation models prior to their widespread deployment in EaaS settings. Training code, model weights and evaluation pipeline are released on: this https URL.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/13/2026, 12:29:11 AM

Summary

This study investigates the privacy risks associated with DNA foundation models in Embeddings-as-a-Service (EaaS) frameworks. The authors evaluate the resilience of three models—DNABERT-2, Evo 2, and Nucleotide Transformer v2 (NTv2)—against model inversion attacks. Results indicate that per-token embeddings are highly vulnerable, allowing near-perfect sequence reconstruction. Mean-pooled embeddings offer more privacy but remain susceptible to reconstruction, with Evo 2 and NTv2 showing higher vulnerability compared to DNABERT-2, which benefits from BPE tokenization resilience.

Entities (5)

DNABERT-2 · dna-foundation-model · 100%Evo 2 · dna-foundation-model · 100%Nucleotide Transformer v2 · dna-foundation-model · 100%Model Inversion Attack · privacy-attack · 98%Embeddings-as-a-Service · deployment-framework · 95%

Relation Signals (3)

Evo 2 → vulnerableto → Model Inversion Attack

confidence 95% · Evo 2 and NTv2 prove to be most vulnerable, especially for shorter sequences

DNABERT-2 → exhibitsresilienceto → Model Inversion Attack

confidence 90% · DNABERT-2's BPE tokenization provides the greatest resilience.

Embeddings-as-a-Service → facilitates → Model Inversion Attack

confidence 85% · We investigate and benchmark the robustness of shared embeddings from DNA foundation models in an Embeddings-as-a-Service (EaaS) setting against model inversion privacy attacks.

Cypher Suggestions (2)

Find all DNA foundation models and their vulnerability status to model inversion attacks. · confidence 90% · unvalidated

MATCH (m:FoundationModel)-[:VULNERABLE_TO|RESILIENT_TO]->(a:Attack {type: 'Model Inversion'}) RETURN m.name, type(r) as status

Identify models used in EaaS frameworks. · confidence 85% · unvalidated

MATCH (m:FoundationModel)-[:DEPLOYED_IN]->(e:Framework {name: 'EaaS'}) RETURN m.name

Full Text

69,036 characters extracted from source content.

Expand or collapse full text

1 How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences Sofiane Ouaari 1,2 , Jules Kreuer 1,2 and Nico Pfeifer 1,2 1 Methods in Medical Informatics, Department of Computer Science, University of Tuebingen, Germany 2 Institute for Bioinformatics and Medical Informatics (IBMI), University of Tuebingen, Germany sofiane.ouaari, jules.kreuer, nico.pfeifer@uni-tuebingen.de Abstract—DNA foundation models have become transformative tools in bioinformatics and healthcare applications. Trained on vast genomic datasets, these models can be used to generate sequence embeddings, dense vector representations that capture complex genomic information. These embeddings are increasingly being shared via Embeddings-as-a-Service (EaaS) frameworks to facilitate downstream tasks, while supposedly protecting the privacy of the underlying raw sequences. However, as this practice becomes more prevalent, the security of these representations is being called into question. This study evaluates the resilience of DNA foundation models to model inversion attacks, whereby adversaries attempt to reconstruct sensitive training data from model outputs. In our study, the model’s output for reconstructing the DNA sequence is a zero-shot embedding, which is then fed to a decoder. We evaluated the privacy of three DNA foundation models: DNABERT-2, Evo 2, and Nucleotide Transformer v2 (NTv2). Our results show that per-token embeddings allow near- perfect sequence reconstruction across all models. For mean- pooled embeddings, reconstruction quality degrades as sequence length increases, though it remains substantially above random baselines. Evo 2 and NTv2 prove to be most vulnerable, especially for shorter sequences with reconstruction similarities> 90%, while DNABERT-2’s BPE tokenization provides the greatest resilience. We found that the correlation between embedding similarity and sequence similarity was a key predictor of reconstruction success. Our findings emphasize the urgent need for privacy-aware design in genomic foundation models prior to their widespread deployment in EaaS settings 1 a) Keywords: DNA Foundation Models, Safe Machine Learning Systems, Model Inversion Attack, Privacy-Preserving Machine Learning: b) Abbreviations: EaaS: Embeddings-as-a-Service, BPE: Byte Pair Encoding, NTv2: Nucleotide Transformer v2, FM: Foundation Model: I. INTRODUCTION In recent years, foundation models have seen significant de- velopment and widespread adoption across multiple industries and domains. By definition, a foundation model (Awais et al., 2025; Bommasani, 2021; C. Zhou et al., 2025) is a type of large-scale machine learning model that serves as a general- purpose platform for building specialised applications. These 1 Training code, model weights and evaluation pipeline are released on https://github.com/not-a-feature/DNA-Embedding-Inversion. models are typically pre-trained on massive datasets using self- supervised learning techniques (Balestriero et al., 2023) and are then fine-tuned for specific tasks. Self-supervised learning leverages the inherent structure of the data to create pseudo- labels, enabling the model to learn representations without manual annotation. Genomic data and sequences also witnessed the devel- opment of different types of foundation models trained on large genomic datasets (Guo et al., 2025), human whole genome sequencing datasets, and multi-species genome datasets. Their applications span a broad spectrum of genomic tasks, encompassing promoter region prediction, functional genetic variant identification, and splice site prediction. Embeddings-as- a-Service (EaaS) and Representation-as-a-Service (RaaS) where embeddings computed from foundation models are shared between parties to then be used for classification or regression tasks (Adila et al., 2024; Ouaari et al., 2025). Previous studies have benchmarked DNA foundation models on genomic data by leveraging embeddings extracted from these models (Feng et al., 2025; Marin et al., 2024). Embed- dings, by definition, constitute numerical representations of sequences that encode and capture underlying structural patterns and sequence-specific features, thereby facilitating enhanced differentiation and boundary delineation. Their vector-based architecture inherently provides greater flexibility for learning complex genomic relationships and enabling downstream analytical tasks. However, the resulting learned embeddings may inadvertently encode sensitive private information, potentially compromising the protection of genomic data, which carries exceptional privacy value (Bonomi et al., 2020; Naveed et al., 2015). Unlike other data modalities, genomic information is im- mutable and uniquely identifying, amplifying the potential consequences of privacy breaches. In this work, we investigate and benchmark the robustness of shared embeddings from DNA foundation models in an Embeddings-as-a-Service (EaaS) setting against model inversion privacy attacks. Model inversion attacks attempt to reconstruct original input sequences from their embedded representations, potentially exposing sensitive genomic information. We examine two embedding sharing strategies that reflect common deployment scenarios: (1) per- token embeddings, where the ordered sequence of individual arXiv:2603.06950v1 [q-bio.GN] 6 Mar 2026 2 Institution I₁ (Data Owner) a Data preparation and Embedding Patient Seq. Hg38 DNA Sequence xᵢ ∈ A,C,G,Tˡ EaaS Foundation Model (FM) • DNABERT-2 • Evo 2 • Nucleotide Transformer Embeddings eᵢ = FM(xᵢ) b Legitimate Use Institution I₂ (Legitimate User) Downstream Model Classification / Regression Output c Model Inversion Attack Adversary (Intercepts Embeddings) Interception Inversion Model (IM) • Transformer Enc/Dec • 1D ResNet • Nearest Neighbour Lookup Prediction Reconstructed Sequence x ̂ ᵢ Evaluation Reconstruction Metrics Levenshtein & Nucleotide Acc. Ground truth Figure 1: Overall Pipeline of the model inversion attack scenario on DNA Foundation Models shared embedding token embeddings is shared as a list, preserving full positional information, and (2) mean-pooled sequence embeddings, which provide aggregated, fixed-size sequence-level representations. Our evaluation encompasses three DNA foundation models: DNABERT-2 (Z. Zhou et al., 2023), Evo 2 (Brixi et al., 2025), and Nucleotide Transformer v2 (NTv2) (Dalla-Torre et al., 2025), each representing distinct architectural paradigms and training methodologies. Through this comprehensive analysis, we aim to quantify the privacy in different embedding sharing strategies and provide recommendations for secure deployment of genomic foundation models in collaborative research and clinical settings. I. BACKGROUND A. Model Inversion Attack Model inversion attacks, first introduced by (Fredrikson et al., 2015), are a type of privacy attack that aims to reconstruct the features of input data based on either the classification output or the representation provided by an ML model. To achieve this, an adversary can utilise either white-box access or black- box access to the model. These attacks are further classified based on their approach: Optimisation-based (Nguyen et al., 2023; Wu et al., 2023; Zhang et al., 2020) and Training-based (S. Zhou et al., 2023). We use the latter in our study. Training-based: Consider a modelMtrained on a private datasetD priv =(x i , y i ) n i=1 . Training-based inversion attacks aim to recover sensitive input data by learning an inversion modelIparameterised as a decoder network. The inversion model is optimised to minimise the reconstruction loss:L = R(x, I(M(x))) , whereRdenotes a reconstruction metric that quantifies the fidelity between the original inputxand its reconstruction I(M(x)). B. DNA Foundation Models In this work, we consider and compute the embeddings of three popular DNA foundation models, namely DNABERT-2, Evo 2 and NTv2. DNABERT-2 is a transformer-based model for DNA se- quence analysis that advances its predecessor by replacing k-mer tokenisation with Byte Pair Encoding (BPE), which is a popular encoding technique for language models (Radford et al., 2019; Workshop et al., 2022). This approach creates variable-length tokens by iteratively merging frequent nuc- leotide pairs, enabling more efficient genome representation and improved sample efficiency. Unlike fixed-length k-mer tokenisation, BPE’s variable-length tokens present a more challenging prediction task during training, as the model must simultaneously predict both the number and identity of masked nucleotides, ultimately enhancing its understanding of genomic semantic structure. A total of 3,874 unique BPE tokens are observed in our dataset. Evo 2 is a large scale foundation model developed through training on a comprehensive collection of genomes that captures the breadth of observed evolutionary diversity. Rather than focusing on task-specific optimisation, Evo 2 prioritises broad generalist capabilities, demonstrating strong performance in both prediction and generation tasks that span from molecular- level analyses to genome-scale applications across all domains of life. Two model variants were developed with 7 billion and 40 billion parameters, respectively, utilising a training corpus exceeding 9.3 trillion tokens at single-nucleotide resolution. Its character-level tokeniser uses a vocabulary of just 4 nucleotide tokens. Both models support an extended context window of up to 1 million tokens and exhibit effective information retrieval capabilities throughout the entire contextual range. Nucleotide Transformer v2 (NTv2) is a BERT-based model pre-trained via masked language modelling on diverse genomic datasets, including the human reference genome, 3,202 human genomes, and 850 multi-species genomes. It uses 6- mer tokenisation with single-nucleotide tokens for remaining positions when the sequence length is not divisible by six or is interrupted by N; we observe 3,897 unique tokens in our dataset. NTv2 improves upon standard BERT by incorporating rotary positional embeddings (Su et al., 2024). 3 I. METHODS A. General Pipeline We consider a dataset of DNA sequencesD = x 1 , x 2 , . . . , x N , where each sequencex i ∈ A, C, G, T l has lengthl. Given a DNA foundation modelF, we obtain a corresponding set of embeddingsE =e 1 , e 2 , . . . , e N , with e i =F(x i ). The structure of these embeddings depends on the embedding strategy used: Per-token embeddings: In this approach, each token pro- duced by the foundation model’s tokeniser is embedded into ad-dimensional vector. For a DNA sequencex i of lengthl, letndenote the number of tokens produced by the tokeniser (n = lfor single-nucleotide tokenisers such as Evo 2, and n≤ lfor multi-nucleotide tokenisers such as BPE ork-mer). The resulting embedding is:e i = [e (1) i , e (2) i , . . . , e (n) i ]∈R d×n wheree (j) i ∈R d represents thed-dimensional embedding of the j-th token. This representation preserves positional information and the full per-token structure of the foundation model’s output. Mean-pooled embeddings: To obtain a fixed-size repres- entation regardless of sequence length, we can aggregate the position-specific embeddings through mean pooling. The mean- pooled embedding is computed as:e i = 1 n P n j=1 e (j) i ∈R d wherendenotes the number of tokens. This aggregation produces a single fixed-dimensional vector that captures the overall sequence information. We define our privacy threat scenario as follows: we consider two institutions,I 1 andI 2 , that have agreed to collaborate in an EaaS framework for a downstream task involving genomic data. InstitutionI 1 possesses a labelled datasetD =(x i , y i ) N i=1 , wherex i ∈ A, C, G, T l represents a DNA sequence,y i denotes its corresponding label andNis the number of sequences. To preserve privacy while enabling collaboration,I 1 transforms the original dataset into an embedding-based variant D (emb) = (e i , y i ) N i=1 , wheree i = F(x i )is the embedding generated by a DNA foundation modelF. This transformed datasetD (emb) is then shared with institutionI 2 for training downstream models. We assume the presence of an adversaryAwho intercepts the shared embedding datasetD (emb) . The adversary’s objective is to perform a model inversion attack by training a reconstruction modelM :R d → A, C, G, T l that aims to recover the original DNA sequences from their embeddings. Formally, given an embeddinge i , the adversary seeks to reconstruct the corresponding sequence:ˆx i =M(e i ), whereˆx i represents the reconstructed approximation of the original sequence x i . The success of this attack would compromise the privacy guarantees that the embedding transformation was intended to provide. B. Metrics To evaluate the reconstruction quality of the model inversion attack, we employ two sequence comparison metrics: nucleotide accuracy and Levenshtein distance. Nucleotide Accuracy: This metric measures the propor- tion of positions where two DNA sequences share identical nucleotides. For two sequencesx 1 andx 2 of equal lengthl, nuc- leotide accuracy is defined as:acc(x 1 , x 2 ) = 1 l P l j=1 1[x (j) 1 = x (j) 2 ], where1[·]is the indicator function. This metric provides a straightforward position-wise similarity score ranging from 0 (no matches) to 1 (perfect match). Levenshtein Distance and Similarity: Levenshtein distance (Berger et al., 2020; Levenshtein et al., 1966) quantifies the minimum number of single-nucleotide edits (substitutions, insertions, deletions) required to transform one sequence into another. These operations directly correspond to the primary mutation types in genomic evolution, making it a biologically interpretable metric for comparing DNA sequences. We normalise it to a similarity scoresim lev (x 1 , x 2 ) = 1− lev(x 1 , x 2 )/ max(|x 1 |,|x 2 |) ∈ [0, 1], where 1 indicates identical sequences. The formal recursive definition is provided in Appendix C. C. Models To ensure architectural diversity in our study, we considered four types of model for the inversion attack: Encoder-only Transformer, Decoder-only Transformer (both non autoregress- ive), ResNet and Nearest Neighbour Lookup. As identifying the foundation model that generated a given embedding is trivial (due to distinct embedding dimensions and distributional properties), we trained an independent inversion model for each unique combination of foundation model and sequence length. Each inversion model uses the foundation model’s native tokeniser for decoding predictions back to nucleotide sequences; an ablation with a fixed single-nucleotide tokeniser was performed (see Appendix F). Encoder-only Transformer: The encoder projects the input embeddings into ad model -dimensional space, applies sinusoidal positional encoding, and passes the sequence through stacked encoder layers. Mean-pooled embeddings are first projected and reshaped into a sequence of lengthlbefore positional encoding. An output layer maps each position to a distribution over the tokens. Decoder-only Transformer: A transformer decoder with causal masking. The architecture mirrors the encoder but em- ploys self-attention with a causal mask, preventing each position from attending to subsequent positions during reconstruction. ResNet: A 1D convolutional residual network consisting of stacked residual blocks, each containing two convolutional layers with batch normalisation, ReLU activation, and dropout, connected via a skip connection. Nearest Neighbour Lookup: A non-parametric baseline that stores all training embeddings and their corresponding sequences. At inference time, the training sequence whose embedding has the smallest Euclidean distance to the query embedding is returned as the reconstruction. We evaluated both Euclidean and cosine distances. We opted for the Euclidean distance as there were no significant differences in correlation with sequence similarity (see Table I). 4 IV. DATASETS We use the human reference genome (GRCh38/hg38) as our primary dataset, as privacy risks are most relevant for human genomic data. Although the reference genome is a publicly available composite assembly from multiple anonymous donors, it provides a controlled and reproducible test bed for evaluating inversion attacks. Since individual patient genomes contain person-specific variants and ancestry- informative markers (Bollas et al., 2024; Lippert et al., 2017; Spiliopoulou et al., 2015), successful reconstruction on the reference genome suggests that private genomes may be similarly vulnerable. To validate our findings on real patient data, we additionally evaluate on sequences derived from the 1000 Genomes Pro- ject (Fairley et al., 2019), comprising subsequences drawn equally from intronic and exonic regions. As shown in Appendix H, reconstruction performance on these sequences is consistent with thehg38results, confirming that the attack scenario generalises to individual level genomic data. V. EXPERIMENTS AND RESULTS A. Collision Analysis Prior to model inversion, we first assessed whether sequence reconstruction is theoretically feasible for mean-pooled em- beddings. We examined the pairwise normalised Euclidean distance distribution of mean-pooled embeddings to evaluate the injectivity of the embedding function (Nikolaou et al., 2025). This analysis does not guarantee reconstruction, but rather establishes a necessary precondition for it. The absence of significant embedding collisions would suggest that the function is approximately injective, making reconstruction potentially possible. Conversely, if distinct genomic sequences converge to near-identical embeddings, the function is non- injective and therefore non-invertible, rendering reconstruction fundamentally intractable regardless of the inversion model employed. Formally, a functionf : A → Bis injective if ∀x 1 , x 2 ∈ A, f(x 1 ) = f(x 2 ) ⇒ x 1 = x 2 .Given a set of mean-pooled embeddingsE =e 1 , e 2 , . . . , e N wheree i ∈R d for alli ∈ 1, 2, ..., N, the pairwise normalised Euclidean distance matrixD∈R N×N is defined element-wise as: D ij = ∥e i − e j ∥ 2 √ d = q P d k=1 (e (k) i − e (k) j ) 2 √ d wheree (k) i denotes thek-th component of embeddinge i . Figure 2 shows the distribution of pairwise distances for mean- pooled embeddings at sequence lengthl = 100. All three models produce well-separated distance distributions with no near-zero distances, confirming that the embedding functions are effectively injective over our dataset and that embedding collisions do not pose a practical limitation for inversion attacks. This observation holds consistently across all evaluated sequence lengths (see Appendix D). B. Reconstruction Evaluation In our analysis, we evaluate DNA sequences sampled from the default chromosomes (chr1–chr22, chrX, chrY, chrM) of the human reference genomehg38across various sequence lengths l∈10, 15, . . . , 50, 60, . . . , 100. The theoretical search space for sequences of lengthlis4 l (e.g.,4 100 ≈ 1.6 × 10 60 ), although the space of biologically plausible human sequences is substantially smaller due to constraints such as GC content and conserved regions. For per-token and mean-pooled embed- dings, we use 100,000 sequences per configuration (70,000 for training, 15,000 for validation, 15,000 for testing). As a baseline, we report a predictor that samples nucleotides according to the empirical distribution of the target sequences, yielding approximately 25% nucleotide accuracy and 34–44% Levenshtein similarity (depending on sequence length). Per-Token Reconstruction. We evaluate per-token model inversion attacks on embeddings of sequences with length l = 100using a small multi-layer perceptron (MLP). In this setting, each token embedding is independently mapped to a nucleotide prediction, reducing the problem to a per- position classification task. The results confirm that per-token embeddings are highly vulnerable to inversion attacks across all three foundation models with Levenshtein similarities above 99% and nucleotide accuracies above98%. The reconstruction for NTv2 achieves near perfect nucleotide accuracy where ≈ 99%of sequences could be reconstructed without any mistakes. Evo 2 and DNABERT-2 prove slightly more resilient with≈ 80%of sequences reconstructed without any mistakes. Figure 2 shows the per-position reconstruction accuracy. The results for the DNABERT-2 embeddings show slight positional variation, reflecting the irregular token boundaries produced by BPE, whereby individual tokens can span a varying number of nucleotides. This introduces inconsistency in the token-to- nucleotide mapping. Therefore, misclassification may result in an insertion or deletion, as well as a mismatch of the following nucleotides. The classification of the last tokens proves more challenging, as these shorter tokens occur less frequently in the training data. The reconstruction for Evo 2 on the other hand maintains near-perfect accuracy across all positions, except a small drop at 4-8. We attribute this to Evo 2’s StripedHyena architecture, which enforces causality to support autoregressive generation: its short explicit (SE) convolution filters of length 7 apply causal zero-padding, treating inputs at indicest < 0as zero. For sequences shorter than the filter length, or tokens at these positions, this zero-padding dominates the convolution output, potentially producing less discriminative embeddings. The reconstruction for NTv2 does not exhibit any accuracy drop. Mean-Pooled Reconstruction. Reconstructing sequences from mean-pooled embeddings is substantially more chal- lenging, as most positional information is lost through the averaging operation. We evaluate four inversion architectures: Encoder-only Transformer, Decoder-only Transformer, ResNet, and Nearest Neighbour Lookup - across all 14 sequence lengths. All parametric models use compact architectures (e.g. d model = 128, 8 attention heads, 6 layers) and produce per- nucleotide classification outputs (see Table I for detailed specifications). Figure 3 presents the cross-model compar- ison for the encoder-only architecture, which consistently outperforms all other inversion methods, with the exception 5 0.000.050.100.150.200.250.300.350.40 Normalized Euclidean Distance (d / dim) 0 20000 40000 60000 80000 100000 Count DNABERT-2 Evo 2 NTv2 020406080100 Position in Sequence 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy DNABERT-2 Evo 2 NTv2 Random Baseline Figure 2: Left: Collision embedding analysis for mean embeddings of sequences of lengthl = 100. The figure shows the normalised Euclidean distances of all pairwise combinations of a random subsample of2,000unique sequences (see Appendix D for all sequence lengths). Right: Per-position reconstruction accuracy for per-token embeddings at sequence lengthl = 100. of Evo 2 at longer sequence lengths where the Nearest Neighbour baseline achieves comparable performance. Across all foundation models for shorter sequence lengths, partial reconstruction was achieved, with performance degrading as sequence length increases - confirming that longer sequences lose more information through mean pooling. DNABERT-2 is the most resilient model, with Levenshtein similarities ranging from0.46(l = 10) to0.47(l = 100), comparable to the Nearest Neighbour baseline. We suspect this resilience stems from its BPE tokenisation, where variable- length tokens introduce additional ambiguity: the reconstruction model must simultaneously predict both the number and identity of nucleotides per token position, and a single token-level error can cascade into insertions or deletions affecting subsequent positions. Evo 2 is the most vulnerable model for shorter sequences. It exhibits a distinctive non-monotonic pattern, where very short sequences (l = 10) yield lower reconstruction quality (0.58 Levenshtein similarity) than slightly longer sequences (l = 15– 20:0.98–0.99). This is likely due to the less discriminative per-token embeddings and the same effect as described in the previous section and visible in Figure 2. For NTv2, the encoder inversion model achieves a Leven- shtein similarity of0.90± 0.11atl = 10and0.57± 0.06 atl = 100. Even at longer sequence lengths, reconstruction quality remains well above both the random baseline (≈ 0.25 accuracy) and the Nearest Neighbour baseline (≈0.51 Levenshtein similarity). The Nearest Neighbour baseline provides a Levenshtein similarity of approximately 0.45–0.54 and accuracy of 0.28– 0.37 for longer sequences (l ≥ 50), demonstrating that the embedding space preserves meaningful sequence structure even without a learned inversion model. The gap between learned models and the Nearest Neighbour baseline is largest for NTv2 and Evo 2, where the embedding-sequence correlation is strongest. Correlation between Embedding and Sequence Similarity. We investigate the relationship between pairwise Euclidean distances in embedding space and corresponding sequence sim- ilarities (see Appendix E for plots across all sequence lengths). A stronger correlation indicates that the embedding space preserves sequence-level structure, facilitating reconstruction. Evo 2 exhibits the highest overall Spearman correlation (0.435 atl = 20), aligning precisely with its peak reconstruction performance at that sequence length and providing additional evidence that the non-monotonic performance pattern is driven by the embedding structure rather than model capacity. NTv2 shows the strongest correlation at longer sequence lengths (up to0.231atl = 100), consistent with its sustained reconstruction quality. DNABERT-2 shows uniformly weak correlations (≤ 0.13), explaining its resilience to inversion attacks. Tokenisation Effects. The tokenisation strategy of the foundation model has a pronounced impact on reconstruction difficulty. Evo 2’s single-nucleotide tokeniser yields a direct one-to-one correspondence between tokens and nucleotides, making inversion straightforward when per-token embeddings are available and enabling strong mean-pooled reconstruction at moderate sequence lengths. NTv2’s 6-mer tokeniser produces a fixed compression ratio, maintaining a relatively predictable structure. DNABERT-2’s BPE tokeniser, in contrast, gener- ates variable-length tokens that depend on sequence content, making reconstruction inherently more difficult as the model must resolve both token boundaries and nucleotide identities. Although NTv2 and DNABERT-2 use effective vocabularies of similar size (3,897 and 3,874 tokens), both far larger than Evo 2’s 4-token alphabet, vocabulary size alone may not fully explain the observed differences in reconstruction difficulty. We hypothesise that the more relevant factor is how determin- istically tokens map back to nucleotide positions: Evo 2’s and NTv2’s fixed-length tokens keep a predictable compression ratio, and DNABERT-2’s variable-length BPE tokens could introduce alignment ambiguity that hinders inversion. We provide a detailed tokenisation analysis with a comparison of token counts across sequence lengths in Appendix B. We further corroborate this hypothesis through an ablation experiment (Appendix F): when all inversion models are forced to use a single-nucleotide tokeniser instead of the foundation model’s native tokeniser, reconstruction performance degrades slightly for DNABERT-2 and NTv2, suggesting that the inversion model benefits from 6 20406080100 Sequence Length 0.0 0.2 0.4 0.6 0.8 1.0 Levenshtein Similarity DNABERT-2 Evo 2 NTv2 Random Baseline 20406080100 Sequence Length 0.0 0.2 0.4 0.6 0.8 1.0 Nucleotide Accuracy DNABERT-2 Evo 2 NTv2 Random Baseline Figure 3: Mean-pooled reconstruction performance across sequence lengths for the encoder-only architecture: (a) Levenshtein similarity and (b) nucleotide accuracy. operating in the same token space as the foundation model and that mismatched tokenisation adds an additional translation burden. Sequence Complexity Analysis. We analyse the relation- ship between sequence complexity and reconstruction quality using Shannon entropy and 4-mer repetitiveness as proxies for sequence information content. Higher-entropy sequences (more uniform nucleotide distributions) tend to be harder to reconstruct, while highly repetitive sequences with lower informational complexity are more amenable to inversion. These trends are consistent across all three foundation models and suggest that the inherent complexity of the target sequence, rather than model capacity alone, is a limiting factor for reconstruction quality. VI. DISCUSSION In this work, we systematically evaluated the privacy of DNA foundation model embeddings against model inversion attacks in an Embeddings-as-a-Service (EaaS) setting. Our results reveal several important findings with direct implications for the deployment of genomic foundation models in collaborative research and clinical environments. Per-token embeddings offer virtually no privacy protec- tion. Across all three foundation models, per-token embeddings allowed near-perfect reconstruction of the original sequences using a simple MLP, with Evo 2 achieving99.8%accuracy and79.5%exact matches at sequence length 100. This finding demonstrates that sharing per-token embeddings is functionally equivalent to sharing the raw sequences themselves, regardless of the foundation model used. Mean pooling provides partial but insufficient protection. While mean pooling reduces positional information and sub- stantially hinders reconstruction quality, our results show that meaningful partial reconstruction remains possible, particularly for shorter sequences and models with strong embedding- sequence correlations. NTv2 embeddings of 10-nucleotide and Evo 2 of 15–25-nucleotide sequences can be reconstructed with≥90% Levenshtein similarity, raising concerns even for scenarios where only short genomic fragments are shared. The Nearest Neighbour baseline further demonstrates that the embedding space inherently preserves sequence structure, suggesting that the vulnerability is a fundamental property of the embeddings rather than an artefact of the attack model. Tokenisation strategy is a suspected determinant of privacy. DNABERT-2’s BPE tokenisation coincides with sub- stantially stronger resilience to inversion attacks compared to NTv2’s fixed 6-mer and Evo 2’s single-nucleotide tokenisers. We suspect this is because variable-length tokens increase the combinatorial complexity of the reconstruction problem, and single token-level errors can cascade into insertions or deletions. However, DNABERT-2 also differs in model size (117M vs. 500M/7B) and architecture, so the tokenisation effect cannot be fully disentangled from these confounding factors. Nonetheless, this observation suggests that tokenisation design warrants further investigation as a potential implicit privacy mechanism. Privacy implications of sequence length. Sequence length affects privacy risk in two opposing ways. Longer sequences are more likely to contain identifying SNPs and are therefore more sensitive, yet they are harder to reconstruct from mean-pooled embeddings because averaging over more tokens discards more positional information. Shorter sequences carry less genomic content but are substantially easier to invert. This interplay implies that there is no single “safe” sequence length; rather, the privacy risk of a given configuration depends jointly on the reconstruction difficulty and the genomic sensitivity of the fragments being shared. Embedding similarity predicts attack success. The cor- relation between pairwise embedding distances and sequence similarities serves as a reliable predictor of reconstruction quality across models and sequence lengths. This metric could be used as a lightweight diagnostic for assessing the privacy risk of a given embedding scheme without requiring a full attack evaluation. Our study has some limitations. Firstly, we only used a single reference genome for training. Although it has been demonstrated that the reconstruction performance is consistent at the individual level, the same reference genome was used and a more sophisticated evaluation would be possible. Secondly, basic reconstruction methods were used to demonstrate general feasibility. More advanced methods exist, such as recursive reconstruction or training-free approaches. Lastly, we did not 7 evaluate defences such as differential privacy or embedding perturbation, which could mitigate the identified vulnerabilities. Future work should explore these areas and extend the evalu- ation to additional foundation models and genomic contexts. VII. CONCLUSION We presented a explorative benchmark for evaluating the privacy of DNA foundation model embeddings against model inversion attacks. Our evaluation of DNABERT-2, Evo 2, and NTv2 reveals that per-token embeddings provide no meaningful privacy protection, while mean-pooled embeddings offer partial resilience that varies substantially across models and sequence lengths. Three key findings emerge. First, even compact single- shot models suffice for reconstruction, demonstrating that the vulnerability is inherent to the embeddings rather than dependent on large attack models. Second, BPE tokenisation, as used in DNABERT-2, coincides with considerably greater reconstruction difficulty, possibly due to the variable-length token vocabulary, suggesting that tokenisation strategy warrants further investigation as a factor in privacy-aware model design. Third, a privacy trade-off exists: sharing shorter sequences intuitively exposes less patient information but increases vulnerability to inversion attacks. These findings emphasise the necessity for rigorous privacy evaluation when deploying genomic foundation models in collaborative settings, and they motivate the development of embedding-level privacy defences for the safe adoption of EaaS paradigms in genomics. VIII. COMPETING INTERESTS No competing interest is declared. IX. AUTHOR CONTRIBUTIONS STATEMENT S.O. and J.K. conceived the project idea and the experimental setup. J.K. conducted the experiments. S.O. and N.P. supervised the experiments and provided additional ideas. All authors analysed the results, wrote and reviewed the manuscript. X. ACKNOWLEDGMENTS This work was supported by the Carl Zeiss Stiftung Research Project “Certification and Foundations of Safe Machine Learning Systems in Healthcare”, by the German Research Foundation (DFG) under Germany’s Excellence Strategy—EXC number 2064/1—Project number 390727645, and by the German Federal Ministry of Research, Technology and Space (BMFTR) within the PrivateAIM project (funding number: 01Z2316D). The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS). REFERENCES Adila, D., Shin, C., Cai, L., et al. (2024). Zero-shot robustifica- tion of zero-shot models. Int. Conf. Learn. Represent. (ICLR). Awais, M., Naseer, M., Khan, S., et al. (2025). Foundation models defining a new era in vision: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. Balestriero, R., Ibrahim, M., Sobal, V., et al. (2023). A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210. Berger, B., Waterman, M. S., & Yu, Y. W. (2020). Levenshtein distance, sequence comparison and biological database search. IEEE Trans. Inf. Theory, 67(6), 3287–3294. Bollas, A. E., Rajkovic, A., Ceyhan, D., et al. (2024). Snvstory: Inferring genetic ancestry from genome sequencing data. BMC Bioinformatics, 25(1), 76. Bommasani, R. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. Bonomi, L., Huang, Y., & Ohno-Machado, L. (2020). Privacy challenges and research opportunities for genomic data sharing. Nat. Genet., 52(7), 646–654. Bostrom, K., & Durrett, G. (2020). Byte pair encoding is suboptimal for language model pretraining. Findings Assoc. Comput. Linguist.: EMNLP 2020, 4617–4624. Brixi, G., Durrant, M. G., Ku, J., et al. (2025). Genome modeling and design across all domains of life with evo 2. bioRxiv, 2025–02. Dalla-Torre, H., Gonzalez, L., Mendoza-Revilla, J., et al. (2025). Nucleotide transformer: Building and evaluating ro- bust foundation models for human genomics. Nat. Methods, 22(2), 287–297. Fairley, S., Lowy-Gallego, E., Perry, E., et al. (2019). The international genome sample resource (igsr) collection of open human genomic variation resources. Nucleic Acids Res., 48(D1), D941–D947. Feng, H., Wu, L., Zhao, B., et al. (2025). Benchmarking dna foundation models for genomic and genetic tasks. Nat. Commun., 16(1), 10780. Fredrikson, M., Jha, S., & Ristenpart, T. (2015). Model inversion attacks that exploit confidence information and basic countermeasures. Proc. 22nd ACM SIGSAC Conf. Comput. Commun. Security, 1322–1333. Guo, F., Guan, R., Li, Y., et al. (2025). Foundation models in bioinformatics. Natl. Sci. Rev., 12(4), nwaf028. Levenshtein, V. I., et al. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady, 10(8), 707–710. Lippert, C., Sabatini, R., Maher, M. C., et al. (2017). Identific- ation of individuals by trait prediction using whole- genome sequencing data. Proc. Natl. Acad. Sci. USA, 114(38), 10166–10171. Marin, F. I., Teufel, F., Horlacher, M., et al. (2024). Bend: Benchmarking dna language models on biologically meaningful tasks. Int. Conf. Learn. Represent. (ICLR). Naveed, M., Ayday, E., Clayton, E. W., et al. (2015). Privacy in the genomic era. ACM Comput. Surv., 48(1), 1–44. Nguyen, N.-B., Chandrasegaran, K., Abdollahzadeh, M., et al. (2023). Re-thinking model inversion attacks against deep neural networks. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 16384–16393. Nikolaou, G., Mencattini, T., Crisostomi, D., et al. (2025). Language models are injective and hence invertible. arXiv preprint arXiv:2510.15511. Ouaari, S., Ünal, A. B., Akgün, M., et al. (2025). Robust representation learning for privacy-preserving machine 8 learning: A multi-objective autoencoder approach. IEEE Access. Radford, A., Wu, J., Child, R., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9. Spiliopoulou, A., Nagy, R., Bermingham, M. L., et al. (2015). Genomic prediction of complex human traits: Related- ness, trait architecture and predictive meta-models. Hum. Mol. Genet., 24(14), 4167–4182. Su, J., Ahmed, M., Lu, Y., et al. (2024). Roformer: Enhanced transformer with rotary position embedding. Neuro- computing, 568, 127063. Workshop, B., Scao, T. L., Fan, A., et al. (2022). Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100. Wu, R., Chen, X., Guo, C., et al. (2023). Learning to invert: Simple adaptive attacks for gradient inversion in fed- erated learning. Uncertainty Artif. Intell., 2293–2303. Zhang, Y., Jia, R., Pei, H., et al. (2020). The secret revealer: Generative model-inversion attacks against deep neural networks. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 253–261. Zhou, C., Li, Q., Li, C., et al. (2025). A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. Int. J. Mach. Learn. Cybern., 16(12), 9851– 9915. Zhou, S., Zhu, T., Ye, D., et al. (2023). Boosting model inversion attacks with adversarial examples. IEEE Trans. Dependable Secure Comput. Zhou, Z., Ji, Y., Li, W., et al. (2023). Dnabert-2: Efficient foundation model and benchmark for multi-species genome. arXiv preprint arXiv:2306.15006. 9 In this appendix, we present detailed figures from our embedding analysis and reconstruction evaluation for the three DNA foundation models: DNABERT-2, Nucleotide Transformer v2 (NTv2), and Evo 2. We analyse how the embedding structure and similarity metrics evolve across different sequence lengths (10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100 nucleotides). APPENDIX A DATA PREPARATION AND EMBEDDING EXTRACTION We extract non-overlapping, non-ambiguous (noNcharacters) and unique subsequences of fixed length from the regular chromosomes (chr1–22, chrX, chrY, chrM) of thehg38reference genome. From these, we uniformly sample 100,000 sequences for per-token and mean-pooled experiments, using a fixed random seed of 42 for reproducibility. The data is split into training (70%), validation (15%), and test (15%) partitions. Embeddings are extracted from each foundation model in a zero-shot fashion (i.e. without fine-tuning) from the following layers: •DNABERT-2 (checkpointzhihan1996/DNABERT-2-117M): We use the last hidden-layer output of the transformer encoder. The special [CLS] and [SEP] tokens are stripped, yielding per-token embeddings of dimension 768. • NTv2 (checkpointInstaDeepAI/nucleotide-transformer-v2-500m-multi-species): We extract the last hidden state viaoutput_hidden_states. The leading[CLS]token is removed, producing per-token embeddings of dimension 1,024. • Evo 2 (checkpointevo2_7b, 7B parameters): Instead of the final layer, we extract embeddings from an intermediate MLP layer (blocks.26.mlp.l3), which empirically yielded more informative representations and is commonly used for embeddings. Each nucleotide corresponds to exactly one token, producing embeddings of dimension 4,096. For the mean-pooled setting, per-token embeddings are averaged across all token positions to produce a single fixed-size vector per sequence. All embeddings are stored in HDF5 format with SHA-256 checksums for integrity verification. Training-set statistics (mean, standard deviation) are computed and used for z-score normalisation across all splits, ensuring no information leakage from the validation or test sets. APPENDIX B TOKENISATION ANALYSIS The three foundation models employ fundamentally different tokenisation strategies, which directly affect reconstruction difficulty. Evo 2 uses a single-nucleotide (character-level) tokeniser, producing exactlyltokens for a sequence of lengthl, with a vocabulary of 4 nucleotide tokens. NTv2 employs a 6-mer tokeniser with a sliding window approach, generating approximately ⌈l/6⌉tokens per sequence, with single-nucleotide tokenisation for remaining positions whenlis not divisible by 6. Across all sequence lengths in our dataset, we observe 3,897 unique NTv2 tokens. DNABERT-2 uses Byte Pair Encoding (BPE) (Bostrom & Durrett, 2020), which produces a variable number of tokens depending on sequence content. For a sequence of length 100, DNABERT-2 typically generates 20 tokens, mostly spanning 1–8 nucleotides; we observe 3,874 unique BPE tokens across our dataset. This variability means that the inversion model must resolve both token boundaries and nucleotide identities, significantly increasing the reconstruction difficulty compared to fixed-length tokenisation schemes. Figure 4 illustrates how the number of tokens produced by each tokeniser scales with sequence length. While Evo 2 maintains a strict 1:1 ratio and NTv2 follows a near-linear compression, DNABERT-2’s BPE tokeniser exhibits a sub-linear, content-dependent growth with higher variance, reflecting its variable-length token vocabulary. 20406080100 Sequence Length 0 20 40 60 80 100 Number of Tokens DNABERT-2 NTv2 Char-level (Evo2) Figure 4: Token count vs. sequence length for the three foundation models. Evo 2 (char-level) produces exactlyltokens, NTv2 (single nt and 6-mer) follows a fixed compression ratio, and DNABERT-2 (BPE) exhibits variable, content-dependent tokenisation. Shaded regions indicate ±1 standard deviation computed. 10 APPENDIX C LEVENSHTEIN DISTANCE DEFINITION The Levenshtein distance between two sequences x 1 and x 2 is recursively defined as: lev(x 1 , x 2 ) =                    |x 1 |if |x 2 | = 0, |x 2 |if |x 1 | = 0, lev(tail(x 1 ),tail(x 2 ))if head(x 1 ) = head(x 2 ), 1 + min      lev(tail(x 1 ), x 2 ) lev(x 1 ,tail(x 2 )) lev(tail(x 1 ),tail(x 2 )) otherwise (1) wherehead(x)returns the first element of sequencexandtail(x)returns the sequence excluding the first element. The normalised similarity score is: sim lev (x 1 , x 2 ) = 1− lev(x 1 , x 2 ) max(|x 1 |,|x 2 |) where the normalisation by the maximum sequence length ensures the score ranges from 0 (completely dissimilar) to 1 (identical sequences). 11 APPENDIX D COLLISION ANALYSIS ACROSS SEQUENCE LENGTHS We extend the collision analysis from the main text by showing the pairwise normalised Euclidean distance distributions for mean-pooled embeddings across all evaluated sequence lengths. For each sequence length, we compute the pairwise distances of a random subsample of2,000unique sequences. Across all models and sequence lengths, the distance distributions remain well-separated from zero, confirming that the embedding functions are effectively injective and that embedding collisions do not limit inversion attacks at any sequence length. 0.000.050.100.150.200.250.300.350.40 Normalized Euclidean Distance (d / dim) 0.0 0.2 0.4 0.6 0.8 1.0 Count 1e6 DNABERT-2 Evo 2 NTv2 l = 10 0.000.050.100.150.200.250.300.350.40 Normalized Euclidean Distance (d / dim) 0.0 0.5 1.0 1.5 2.0 Count 1e6 DNABERT-2 Evo 2 NTv2 l = 15 0.000.050.100.150.200.250.300.350.40 Normalized Euclidean Distance (d / dim) 0 100000 200000 300000 400000 500000 600000 Count DNABERT-2 Evo 2 NTv2 l = 20 0.000.050.100.150.200.250.300.350.40 Normalized Euclidean Distance (d / dim) 0 100000 200000 300000 400000 Count DNABERT-2 Evo 2 NTv2 l = 25 0.000.050.100.150.200.250.300.350.40 Normalized Euclidean Distance (d / dim) 0 50000 100000 150000 200000 250000 300000 Count DNABERT-2 Evo 2 NTv2 l = 30 0.000.050.100.150.200.250.300.350.40 Normalized Euclidean Distance (d / dim) 0 50000 100000 150000 200000 250000 Count DNABERT-2 Evo 2 NTv2 l = 35 0.000.050.100.150.200.250.300.350.40 Normalized Euclidean Distance (d / dim) 0 25000 50000 75000 100000 125000 150000 175000 200000 Count DNABERT-2 Evo 2 NTv2 l = 40 0.000.050.100.150.200.250.300.350.40 Normalized Euclidean Distance (d / dim) 0 25000 50000 75000 100000 125000 150000 175000 200000 Count DNABERT-2 Evo 2 NTv2 l = 45 0.000.050.100.150.200.250.300.350.40 Normalized Euclidean Distance (d / dim) 0 20000 40000 60000 80000 100000 120000 140000 160000 Count DNABERT-2 Evo 2 NTv2 l = 50 0.000.050.100.150.200.250.300.350.40 Normalized Euclidean Distance (d / dim) 0 20000 40000 60000 80000 100000 120000 140000 Count DNABERT-2 Evo 2 NTv2 l = 60 0.000.050.100.150.200.250.300.350.40 Normalized Euclidean Distance (d / dim) 0 20000 40000 60000 80000 100000 120000 Count DNABERT-2 Evo 2 NTv2 l = 70 0.000.050.100.150.200.250.300.350.40 Normalized Euclidean Distance (d / dim) 0 20000 40000 60000 80000 100000 Count DNABERT-2 Evo 2 NTv2 l = 80 0.000.050.100.150.200.250.300.350.40 Normalized Euclidean Distance (d / dim) 0 20000 40000 60000 80000 100000 Count DNABERT-2 Evo 2 NTv2 l = 90 0.000.050.100.150.200.250.300.350.40 Normalized Euclidean Distance (d / dim) 0 20000 40000 60000 80000 100000 Count DNABERT-2 Evo 2 NTv2 l = 100 Figure 5: Collision analysis: pairwise normalised Euclidean distance distributions for mean-pooled embeddings across all evaluated sequence lengths. 12 APPENDIX E EMBEDDING ANALYSIS We visualise the structure of the learned embeddings using UMAP projections and Euclidean distance distributions. Figures 6–8 and Figures 9–11 provide a comprehensive comparison across all models and sequence lengths. 105051015 UMAP 1 2 4 6 8 10 12 UMAP 2 l = 10 105051015 UMAP 1 0 2 4 6 8 10 12 UMAP 2 l = 15 105051015 UMAP 1 2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0 UMAP 2 l = 20 1050510 UMAP 1 5 0 5 10 15 UMAP 2 l = 25 105051015 UMAP 1 6 4 2 0 2 4 6 8 10 UMAP 2 l = 30 5051015 UMAP 1 4 2 0 2 4 6 8 UMAP 2 l = 35 2.50.02.55.07.510.012.515.0 UMAP 1 8 6 4 2 0 2 4 6 8 UMAP 2 l = 40 1050510 UMAP 1 2 0 2 4 6 8 10 12 UMAP 2 l = 45 5051015 UMAP 1 2 0 2 4 6 8 10 UMAP 2 l = 50 15105051015 UMAP 1 2 0 2 4 6 8 10 UMAP 2 l = 60 105051015 UMAP 1 2 4 6 8 10 12 14 UMAP 2 l = 70 5.02.50.02.55.07.510.0 UMAP 1 2 4 6 8 10 12 UMAP 2 l = 80 105051015 UMAP 1 2 0 2 4 6 8 10 UMAP 2 l = 90 5051015 UMAP 1 2 0 2 4 6 8 10 12 UMAP 2 l = 100 Figure 6: UMAP projections of mean-pooled DNABERT-2 embeddings across all sequence lengths. 13 678910111213 UMAP 1 1 2 3 4 5 6 7 8 UMAP 2 l = 10 1415161718192021 UMAP 1 18 19 20 21 22 23 24 UMAP 2 l = 15 2021222324252627 UMAP 1 0 1 2 3 4 5 UMAP 2 l = 20 252627282930 UMAP 1 19 20 21 22 23 24 UMAP 2 l = 25 20212223242526 UMAP 1 23 24 25 26 27 28 29 UMAP 2 l = 30 76543210 UMAP 1 13 12 11 10 9 8 7 6 UMAP 2 l = 35 0123456 UMAP 1 8 7 6 5 4 3 2 1 UMAP 2 l = 40 1617181920212223 UMAP 1 2 3 4 5 6 7 8 UMAP 2 l = 45 1516171819202122 UMAP 1 11 12 13 14 15 16 17 UMAP 2 l = 50 10987654 UMAP 1 13 12 11 10 9 8 UMAP 2 l = 60 14151617181920 UMAP 1 9 10 11 12 13 14 15 UMAP 2 l = 70 121314151617181920 UMAP 1 15 16 17 18 19 20 UMAP 2 l = 80 11109876543 UMAP 1 8 7 6 5 4 3 UMAP 2 l = 90 1314151617181920 UMAP 1 9 10 11 12 13 14 15 UMAP 2 l = 100 Figure 7: UMAP projections of mean-pooled NTv2 embeddings across all sequence lengths. 14 5051015 UMAP 1 0 5 10 15 20 UMAP 2 l = 10 02468 UMAP 1 2 3 4 5 6 7 8 UMAP 2 l = 15 01234567 UMAP 1 2 3 4 5 6 7 8 9 UMAP 2 l = 20 56789101112 UMAP 1 1 2 3 4 5 6 7 8 UMAP 2 l = 25 4681012 UMAP 1 1 0 1 2 3 4 5 6 UMAP 2 l = 30 42024 UMAP 1 1 2 3 4 5 6 7 8 UMAP 2 l = 35 468101214 UMAP 1 1 0 1 2 3 4 5 6 UMAP 2 l = 40 6810121416 UMAP 1 0 2 4 6 8 10 UMAP 2 l = 45 5678910111213 UMAP 1 2 4 6 8 10 12 UMAP 2 l = 50 24681012 UMAP 1 4 6 8 10 12 UMAP 2 l = 60 6420246 UMAP 1 0 1 2 3 4 5 6 7 8 UMAP 2 l = 70 2024681012 UMAP 1 2 0 2 4 6 UMAP 2 l = 80 02468101214 UMAP 1 6 8 10 12 14 UMAP 2 l = 90 246810121416 UMAP 1 6 8 10 12 UMAP 2 l = 100 Figure 8: UMAP projections of mean-pooled Evo 2 embeddings across all sequence lengths. UMAP Projections: 15 Euclidean vs Sequence Similarity Correlation: In this section, we examine the correlation of the pairwise Euclidean distance of the embeddings and their sequence similarity. A strong correlation typically allows for better reconstruction performance. 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 1.0 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.022 10 0 10 1 10 2 10 3 10 4 10 5 Log Count l = 10 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 1.0 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.028 10 0 10 1 10 2 10 3 10 4 Log Count l = 15 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 1.0 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.038 10 0 10 1 10 2 10 3 10 4 Log Count l = 20 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 1.0 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.040 10 0 10 1 10 2 10 3 10 4 Log Count l = 25 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 1.0 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.038 10 0 10 1 10 2 10 3 10 4 Log Count l = 30 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 1.0 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.045 10 0 10 1 10 2 10 3 10 4 Log Count l = 35 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 1.0 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.072 10 0 10 1 10 2 10 3 10 4 Log Count l = 40 0.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 1.0 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.057 10 0 10 1 10 2 10 3 10 4 Log Count l = 45 0.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 1.0 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.046 10 0 10 1 10 2 10 3 10 4 Log Count l = 50 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 1.0 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.067 10 0 10 1 10 2 10 3 10 4 Log Count l = 60 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 1.0 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.067 10 0 10 1 10 2 10 3 10 4 Log Count l = 70 0.10.20.30.40.50.60.70.80.9 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 1.0 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.088 10 0 10 1 10 2 10 3 10 4 Log Count l = 80 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 1.0 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.106 10 0 10 1 10 2 10 3 10 4 Log Count l = 90 0.10.20.30.40.50.60.70.80.9 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 1.0 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.100 10 0 10 1 10 2 10 3 10 4 Log Count l = 100 Figure 9: Euclidean distance vs. sequence similarity correlation for DNABERT-2 across all sequence lengths. 16 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.101 10 0 10 1 10 2 10 3 10 4 Log Count l = 10 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.095 10 0 10 1 10 2 10 3 10 4 Log Count l = 15 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.087 10 0 10 1 10 2 10 3 10 4 Log Count l = 20 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.098 10 0 10 1 10 2 10 3 10 4 Log Count l = 25 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.072 10 0 10 1 10 2 10 3 10 4 Log Count l = 30 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.149 10 0 10 1 10 2 10 3 10 4 Log Count l = 35 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.199 10 0 10 1 10 2 10 3 10 4 Log Count l = 40 0.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.154 10 0 10 1 10 2 10 3 10 4 Log Count l = 45 0.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.149 10 0 10 1 10 2 10 3 10 4 Log Count l = 50 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.159 10 0 10 1 10 2 10 3 10 4 Log Count l = 60 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.190 10 0 10 1 10 2 10 3 10 4 Log Count l = 70 0.10.20.30.40.50.60.70.80.9 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.188 10 0 10 1 10 2 10 3 10 4 Log Count l = 80 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.201 10 0 10 1 10 2 10 3 10 4 Log Count l = 90 0.10.20.30.40.50.60.70.80.9 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.231 10 0 10 1 10 2 10 3 10 4 Log Count l = 100 Figure 10: Euclidean distance vs. sequence similarity correlation for NTv2 across all sequence lengths. 17 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 1.0 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.070 10 0 10 1 10 2 10 3 10 4 Log Count l = 10 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.299 10 0 10 1 10 2 10 3 10 4 Log Count l = 15 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.435 10 0 10 1 10 2 10 3 10 4 Log Count l = 20 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.318 10 0 10 1 10 2 10 3 10 4 Log Count l = 25 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.254 10 0 10 1 10 2 10 3 10 4 Log Count l = 30 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.216 10 0 10 1 10 2 10 3 10 4 Log Count l = 35 0.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.192 10 0 10 1 10 2 10 3 10 4 Log Count l = 40 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.189 10 0 10 1 10 2 10 3 10 4 Log Count l = 45 0.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.173 10 0 10 1 10 2 10 3 10 4 Log Count l = 50 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.156 10 0 10 1 10 2 10 3 10 4 Log Count l = 60 0.10.20.30.40.50.60.70.80.9 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.147 10 0 10 1 10 2 10 3 10 4 Log Count l = 70 0.10.20.30.40.50.60.70.80.9 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.132 10 0 10 1 10 2 10 3 10 4 Log Count l = 80 0.00.20.40.60.8 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.141 10 0 10 1 10 2 10 3 10 4 Log Count l = 90 0.10.20.30.40.50.60.70.80.9 Sequence Similarity (1 - Norm. Levenshtein) 0.0 0.2 0.4 0.6 0.8 Norm. Euclidean Similarity (1 - d/max) Spearman Corr: 0.135 10 0 10 1 10 2 10 3 10 4 Log Count l = 100 Figure 11: Euclidean distance vs. sequence similarity correlation for Evo 2 across all sequence lengths. ModelSimilarity10152025303540455060708090100 DNABERT-2 Cosine0.0154 0.0289 0.0345 0.0400 0.0433 0.0479 0.0680 0.0669 0.0683 0.0783 0.0852 0.0944 0.1211 0.1127 Euclidean 0.0224 0.0283 0.0384 0.0399 0.0381 0.0454 0.0717 0.0571 0.0460 0.0670 0.0671 0.0880 0.1063 0.1005 Evo 2 Cosine 0.1484 0.3775 0.4405 0.3260 0.2660 0.2307 0.2108 0.2096 0.1976 0.1847 0.1760 0.1662 0.1736 0.1680 Euclidean 0.0703 0.2986 0.4354 0.3176 0.2538 0.2164 0.1919 0.1890 0.1730 0.1562 0.1474 0.1325 0.1410 0.1355 NTv2 Cosine0.0991 0.0918 0.0816 0.0915 0.0635 0.1311 0.1823 0.1414 0.1353 0.1446 0.1796 0.1747 0.1857 0.2226 Euclidean 0.1008 0.0950 0.0869 0.0984 0.0717 0.1487 0.1992 0.1542 0.1493 0.1590 0.1896 0.1879 0.2012 0.2310 Table I: Spearman correlation between embedding similarity (cosine / Euclidean) and sequence similarity (Levenshtein) for different models and sequence lengths. 18 APPENDIX F MODEL ARCHITECTURE AND SIZE Table I summarises the inversion model architectures and their parameter counts. All parametric models are intentionally compact to demonstrate that even small attack models can achieve meaningful reconstruction. ModelKey HyperparametersApprox. Parameters Encoderd = 128, h = 8, L = 6, d f = 1024, dropout= 0.1≈ 12M Decoderd = 128, h = 8, L = 6, d f = 1024, dropout= 0.1≈ 12M ResNetd = 256, 4 blocks, k = 5≈ 22M MLP (per-token)[256, 256, 128], dropout= 0.2≈ 0.3M Nearest Neighbournon-parametric Table I: Inversion model architectures and approximate parameter counts (excluding input projection, which varies by foundation model embedding dimension). Each inversion model uses the same tokeniser as the corresponding foundation model for decoding predictions back to nucleotide sequences. That is, for DNABERT-2 the inversion model predicts over the BPE vocabulary, for NTv2 over the single-nucleotide and 6-mer vocabulary, and for Evo 2 over the single-nucleotide alphabet. 1) Tokeniser Ablation: We additionally conducted an ablation experiment in which all inversion models were trained with a fixed single-nucleotide (character-level) tokeniser, regardless of the foundation model’s native tokeniser. This forced the models to predict individual nucleotides directly, bypassing the subword vocabulary of the foundation model’s tokeniser. As shown in Figures 12 and 13, this configuration resulted in slightly worse reconstruction performance for models that natively use subword tokenisers (DNABERT-2 and NTv2). We hypothesise that this degradation occurs because the inversion model must additionally learn to translate the embedding representations of variable-length subword tokens into single-nucleotide predictions, adding an implicit alignment step that increases the difficulty of reconstruction. For Evo 2, which already uses a single-nucleotide tokeniser, the performance remains unchanged. 20406080100 Sequence Length 0.0 0.2 0.4 0.6 0.8 1.0 Levenshtein Similarity DNABERT-2 Evo 2 NTv2 Random Baseline Figure 12: Levenshtein similarity across sequence lengths for the encoder-only architecture with a fixed single-nucleotide tokeniser. 20406080100 Sequence Length 0.0 0.2 0.4 0.6 0.8 1.0 Nucleotide Accuracy DNABERT-2 Evo 2 NTv2 Random Baseline Figure 13: Nucleotide accuracy across sequence lengths for the encoder-only architecture with a fixed single-nucleotide tokeniser. 19 APPENDIX G RECONSTRUCTION EVALUATION We evaluate the performance of the model inversion attack by comparing the Levenshtein similarity and nucleotide accuracy across varying sequence lengths for different architectures. 20406080100 Sequence Length 0.0 0.2 0.4 0.6 0.8 1.0 Levenshtein Similarity Decoder Encoder Nearest Neighbor ResNet Random Baseline 20406080100 Sequence Length 0.0 0.2 0.4 0.6 0.8 1.0 Nucleotide Accuracy Decoder Encoder Nearest Neighbor ResNet Random Baseline 20406080100 Sequence Length 0.0 0.2 0.4 0.6 0.8 1.0 Levenshtein Similarity Decoder Encoder Nearest Neighbor ResNet Random Baseline 20406080100 Sequence Length 0.0 0.2 0.4 0.6 0.8 1.0 Nucleotide Accuracy Decoder Encoder Nearest Neighbor ResNet Random Baseline 20406080100 Sequence Length 0.0 0.2 0.4 0.6 0.8 1.0 Levenshtein Similarity Decoder Encoder Nearest Neighbor ResNet Random Baseline 20406080100 Sequence Length 0.0 0.2 0.4 0.6 0.8 1.0 Nucleotide Accuracy Decoder Encoder Nearest Neighbor ResNet Random Baseline Figure 14: Mean-pooled reconstruction evaluation across sequence lengths. Left column: Levenshtein similarity; right column: nucleotide accuracy. Top row: DNABERT-2, middle row: NTv2, bottom row: Evo 2. 20 APPENDIX H CROSS-DATASET EVALUATION: 1000 GENOMES PROJECT To assess how well the inversion attack generalises to real patient data beyond thehg38reference genome, we evaluate on sequences derived from the 1000 Genomes Project (Fairley et al., 2019). We randomly sample subsequences of length l∈10, 25, 50, 75, 100from real patient genomes, with a balanced composition of 50% intronic and 50% exonic regions. The inversion models trained on thehg38reference data are directly applied to embeddings of these 1000 Genomes sequences without any retraining or fine-tuning. As shown in Figures 15 and 16, the reconstruction performance on real patient sequences closely mirrors the results obtained onhg38, indicating that the vulnerability of DNA foundation model embeddings is not an artefact of the reference genome but extends to realistic, individually derived genomic data. 20406080100 Sequence Length 0.0 0.2 0.4 0.6 0.8 1.0 Levenshtein Similarity DNABERT-2 Evo 2 NTv2 Random Baseline Figure 15: Levenshtein similarity across sequence lengths for the encoder-only architecture on 1000 Genomes Project data. 20406080100 Sequence Length 0.0 0.2 0.4 0.6 0.8 1.0 Nucleotide Accuracy DNABERT-2 Evo 2 NTv2 Random Baseline Figure 16: Nucleotide accuracy across sequence lengths for the encoder-only architecture on 1000 Genomes Project data. 21 APPENDIX I DETAILED METRICS We provide detailed performance metrics for each model and inversion method across different sequence lengths. MethodMetric10152025303540455060708090100 Decoder Levenshtein 0.45± 0.19 0.44± 0.15 0.45± 0.11 0.46± 0.10 0.45± 0.10 0.46± 0.09 0.46± 0.08 0.46± 0.08 0.46± 0.08 0.47± 0.07 0.47± 0.07 0.46± 0.07 0.46± 0.06 0.46± 0.06 Accuracy 0.37± 0.21 0.33± 0.16 0.33± 0.13 0.32± 0.12 0.31± 0.11 0.31± 0.10 0.31± 0.09 0.31± 0.09 0.30± 0.09 0.30± 0.08 0.30± 0.08 0.29± 0.07 0.29± 0.07 0.29± 0.07 ResNet Levenshtein 0.44± 0.17 0.44± 0.12 0.45± 0.10 0.46± 0.09 0.45± 0.08 0.46± 0.08 0.46± 0.08 0.46± 0.07 0.46± 0.07 0.46± 0.06 0.47± 0.06 0.46± 0.06 0.46± 0.05 0.47± 0.05 Accuracy 0.34± 0.20 0.32± 0.15 0.31± 0.12 0.31± 0.11 0.30± 0.10 0.30± 0.09 0.30± 0.09 0.30± 0.08 0.30± 0.08 0.29± 0.07 0.29± 0.07 0.29± 0.07 0.29± 0.06 0.28± 0.06 Encoder Levenshtein 0.46± 0.19 0.45± 0.14 0.46± 0.11 0.47± 0.10 0.47± 0.09 0.47± 0.08 0.47± 0.08 0.47± 0.08 0.48± 0.07 0.47± 0.07 0.48± 0.06 0.47± 0.06 0.47± 0.06 0.47± 0.05 Accuracy 0.37± 0.22 0.34± 0.16 0.33± 0.13 0.32± 0.12 0.32± 0.11 0.31± 0.10 0.31± 0.09 0.31± 0.09 0.31± 0.08 0.30± 0.08 0.30± 0.07 0.29± 0.07 0.29± 0.06 0.29± 0.06 Nearest Neighbour Levenshtein 0.44± 0.17 0.43± 0.12 0.43± 0.10 0.44± 0.09 0.44± 0.08 0.44± 0.07 0.45± 0.07 0.45± 0.07 0.45± 0.07 0.46± 0.06 0.46± 0.06 0.47± 0.06 0.47± 0.06 0.47± 0.06 Accuracy 0.34± 0.20 0.31± 0.14 0.30± 0.12 0.30± 0.10 0.29± 0.10 0.29± 0.09 0.29± 0.08 0.28± 0.08 0.28± 0.08 0.28± 0.07 0.28± 0.07 0.28± 0.06 0.28± 0.06 0.28± 0.06 Table I: Reconstruction results for DNABERT-2 embeddings. MethodMetric10152025303540455060708090100 Decoder Levenshtein 0.53± 0.13 0.96± 0.05 0.89± 0.08 0.82± 0.08 0.73± 0.08 0.67± 0.08 0.63± 0.07 0.56± 0.07 0.53± 0.07 0.50± 0.06 0.47± 0.06 0.45± 0.06 0.44± 0.06 0.43± 0.05 Accuracy 0.51± 0.14 0.96± 0.05 0.87± 0.10 0.80± 0.10 0.70± 0.10 0.62± 0.09 0.58± 0.09 0.53± 0.08 0.50± 0.08 0.47± 0.07 0.44± 0.06 0.43± 0.06 0.42± 0.06 0.41± 0.06 ResNet Levenshtein 0.55± 0.12 0.91± 0.08 0.86± 0.08 0.97± 0.06 0.91± 0.09 0.82± 0.09 0.71± 0.08 0.65± 0.07 0.62± 0.06 0.56± 0.06 0.52± 0.05 0.49± 0.05 0.46± 0.05 0.47± 0.05 Accuracy 0.52± 0.13 0.90± 0.08 0.84± 0.10 0.96± 0.08 0.88± 0.12 0.77± 0.13 0.64± 0.11 0.57± 0.10 0.54± 0.09 0.49± 0.08 0.46± 0.07 0.44± 0.07 0.42± 0.06 0.41± 0.06 Encoder Levenshtein 0.58± 0.12 0.98± 0.04 0.99± 0.03 0.97± 0.07 0.92± 0.09 0.85± 0.10 0.76± 0.09 0.71± 0.08 0.66± 0.07 0.59± 0.06 0.53± 0.05 0.50± 0.05 0.48± 0.05 0.46± 0.05 Accuracy 0.56± 0.13 0.98± 0.04 0.99± 0.04 0.96± 0.09 0.89± 0.12 0.80± 0.14 0.70± 0.12 0.63± 0.11 0.58± 0.10 0.52± 0.09 0.47± 0.07 0.45± 0.07 0.43± 0.06 0.42± 0.06 Nearest Neighbour Levenshtein 0.59± 0.13 0.69± 0.09 0.63± 0.08 0.59± 0.09 0.56± 0.10 0.55± 0.11 0.54± 0.11 0.54± 0.11 0.54± 0.12 0.53± 0.12 0.53± 0.12 0.53± 0.12 0.53± 0.12 0.53± 0.12 Accuracy 0.56± 0.15 0.63± 0.12 0.53± 0.12 0.42± 0.14 0.38± 0.14 0.35± 0.14 0.34± 0.13 0.33± 0.12 0.32± 0.12 0.31± 0.11 0.31± 0.10 0.30± 0.10 0.30± 0.09 0.29± 0.08 Table IV: Reconstruction results for Evo 2 embeddings. MethodMetric10152025303540455060708090100 Decoder Levenshtein 0.88± 0.12 0.87± 0.10 0.82± 0.11 0.81± 0.10 0.77± 0.10 0.65± 0.08 0.64± 0.08 0.64± 0.08 0.62± 0.07 0.60± 0.07 0.57± 0.06 0.56± 0.06 0.53± 0.06 0.53± 0.06 Accuracy 0.88± 0.13 0.86± 0.12 0.81± 0.13 0.79± 0.12 0.75± 0.12 0.59± 0.10 0.57± 0.10 0.57± 0.10 0.54± 0.10 0.52± 0.10 0.46± 0.08 0.45± 0.08 0.41± 0.08 0.41± 0.07 ResNet Levenshtein 0.87± 0.12 0.86± 0.10 0.80± 0.11 0.78± 0.10 0.73± 0.09 0.64± 0.08 0.63± 0.08 0.63± 0.08 0.62± 0.07 0.59± 0.08 0.59± 0.07 0.58± 0.07 0.55± 0.07 0.56± 0.07 Accuracy 0.86± 0.14 0.85± 0.12 0.78± 0.12 0.75± 0.12 0.70± 0.11 0.58± 0.10 0.57± 0.10 0.56± 0.10 0.54± 0.10 0.49± 0.10 0.48± 0.10 0.46± 0.10 0.42± 0.09 0.43± 0.10 Encoder Levenshtein 0.90± 0.11 0.88± 0.10 0.86± 0.11 0.85± 0.10 0.80± 0.10 0.69± 0.09 0.66± 0.08 0.66± 0.08 0.65± 0.08 0.60± 0.07 0.60± 0.07 0.59± 0.06 0.55± 0.06 0.57± 0.06 Accuracy 0.89± 0.12 0.87± 0.11 0.85± 0.12 0.83± 0.12 0.78± 0.11 0.64± 0.11 0.60± 0.11 0.60± 0.11 0.58± 0.10 0.52± 0.10 0.50± 0.09 0.48± 0.09 0.43± 0.08 0.44± 0.08 Nearest Neighbour Levenshtein 0.72± 0.12 0.60± 0.12 0.54± 0.11 0.52± 0.10 0.51± 0.10 0.53± 0.08 0.52± 0.09 0.52± 0.10 0.51± 0.10 0.51± 0.11 0.51± 0.10 0.51± 0.10 0.51± 0.10 0.51± 0.10 Accuracy 0.69± 0.15 0.53± 0.15 0.45± 0.14 0.41± 0.14 0.39± 0.14 0.40± 0.12 0.39± 0.12 0.38± 0.13 0.37± 0.14 0.35± 0.14 0.35± 0.13 0.33± 0.13 0.32± 0.12 0.33± 0.12 Table V: Reconstruction results for NTv2 embeddings. 22 0.00.20.40.60.81.0 Similarity L=100 L=90 L=80 L=70 L=60 Model Baseline L=50 L=45 L=40 L=35 L=30 L=25 L=20 L=15 L=10 (a) DNABERT-2 0.00.20.40.60.81.0 Similarity L=100 L=90 L=80 L=70 L=60 Model Baseline L=50 L=45 L=40 L=35 L=30 L=25 L=20 L=15 L=10 (b) Evo 2 0.00.20.40.60.81.0 Similarity L=100 L=90 L=80 L=70 L=60 Model Baseline L=50 L=45 L=40 L=35 L=30 L=25 L=20 L=15 L=10 (c) NTv2 Figure 17: Reconstruction performance of mean embeddings for sequences of lengthl = 10 to 100for (a) DNABERT-2, (b) Evo 2, (c) NTv2 and the random baseline (gray).