Paper deep dive

ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins

Yichen Zhou, Jonathan Golob, Amir Karimi, Stefan Bauer, Patrick Schwab

Year: 2026Venue: arXiv preprintArea: q-bio.QMType: PreprintEmbeddings: 75

Abstract

Abstract:Protein language models (pLMs) have shown strong potential in prediction of the functional effects of missense variants in zero-shot settings. Despite this progress, benchmarking pLMs for viral proteins remains limited and systematic strategies for integrating in silico metrics with in vitro validation to guide antigen and target selection are underdeveloped. Here, we introduce ViroGym, a comprehensive benchmark designed to evaluate variant effect prediction in viral proteins and to facilitate selecting rational antigen candidates. We curated 79 deep mutational scanning (DMS) assays encompassing eukaryotic viruses, collectively comprising 552,937 mutated amino acid sequences across 7 distinct phenotypic readouts, and 21 influenza virus neutralisation tasks and a real-world predictive task for SARS-CoV-2. We benchmark well-established pLMs on fitness landscapes, antigenic diversity, and pandemic forecasting to provide a framework for vaccine selection, and show that pLMs selected using in vitro experimental data excel at predicting dominant circulating mutations in real world.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/13/2026, 12:15:35 AM

Summary

ViroGym is a comprehensive benchmark suite designed to evaluate protein language models (pLMs) on their ability to predict the functional effects of mutations in viral proteins. It aggregates 79 deep mutational scanning (DMS) assays, 21 influenza neutralization tasks, and a real-world SARS-CoV-2 pandemic forecasting task to facilitate rational antigen selection and vaccine development.

Entities (6)

Influenza A · virus · 100%SARS-CoV-2 · virus · 100%ViroGym · benchmark · 100%GISAID · database · 95%Protein Language Models · technology · 95%ProGen2 · model · 90%

Relation Signals (4)

ViroGym → includes → SARS-CoV-2

confidence 100% · ViroGym includes DMS assays covering SARS-CoV-2

ViroGym → evaluates → Protein Language Models

confidence 95% · ViroGym, a comprehensive benchmark designed to evaluate variant effect prediction in viral proteins

ViroGym → uses → GISAID

confidence 95% · The real-world component evaluates models on SARS-CoV-2 pandemic forecasting using viral sequence data from GISAID database

ProGen2 → performson → ViroGym

confidence 90% · ProGen2 achieves the strongest performance

Cypher Suggestions (2)

Find all viruses covered by the ViroGym benchmark · confidence 90% · unvalidated

MATCH (b:Benchmark {name: 'ViroGym'})-[:INCLUDES]->(v:Virus) RETURN v.name

Identify models evaluated by ViroGym · confidence 85% · unvalidated

MATCH (m:Model)-[:EVALUATED_BY]->(b:Benchmark {name: 'ViroGym'}) RETURN m.name

Full Text

74,559 characters extracted from source content.

Expand or collapse full text

ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins Yichen Zhou 1 2 Jonathan Golob 1 3 Amir Karimi 4 Stefan Bauer * 2 Patrick Schwab * 1 Abstract Protein language models (pLMs) have shown strong potential in prediction of the functional effects of missense variants in zero-shot settings. Despite this progress, benchmarking pLMs for vi- ral proteins remains limited and systematic strate- gies for integrating in silico metrics with in vitro validation to guide antigen and target selection are underdeveloped. Here, we introduce ViroGym, a comprehensive benchmark designed to evalu- ate variant effect prediction in viral proteins and to facilitate selecting rational antigen candidates. We curated 79 deep mutational scanning (DMS) assays encompassing eukaryotic viruses, collec- tively comprising 552,937 mutated amino acid sequences across 7 distinct phenotypic readouts, and 21 influenza virus neutralisation tasks and a real-world predictive task for SARS-CoV-2. We benchmark well-established pLMs on fitness land- scapes, antigenic diversity, and pandemic forecast- ing to provide a framework for vaccine selection, and show that pLMs selected using in vitro exper- imental data excel at predicting dominant circu- lating mutations in real world. 1. Introduction The most clinically relevant respiratory viruses–such as in- fluenza, SARS-CoV-2, and others–mutate at a rapid pace, challenging both the immune system and development of effective vaccines and treatments. Even with extensive near real-time genomic reporting systems, such as GISAID (Shu & McCauley, 2017) and Nextstrain (Hadfield et al., 2018), people are often having to anticipate as to the future direc- tion of these rapidly evolving pathogens, with mismatches between the predicted and actual trajectory resulting both public health and individual consequences. A familiar example is the current vaccine development sys- 1 GlaxoSmithKline 2 Technical University of Munich 3 Division of Allergy and Infectious Diseases, University of Washington 4 KTH Royal Institute of Technology. Correspondence to: Yichen Zhou<yichen.x.zhou@gsk.com>. Preprint. March 10, 2026. tem for SARS-CoV-2 and influenza, which involves a semi- annual strain selection process recommended by the World Health Organization (WHO). This production system, espe- cially for seasonal influenza vaccines, has remained largely unchanged for over 40 years (Wei et al., 2020). Moreover, the effectiveness of seasonal influenza vaccines from 2009 to 2025 flu seasons is only in the range of 19%-60% (Cen- ters for Disease Control and Prevention and others, 2025), and the peak vaccine effectiveness for SARS-CoV-2 in au- tumn 2023 is 50.6% within 2-4 weeks but then dropped sharply to 13.6%, largely due to the emergence of new vari- ants (Kirsebom et al., 2024). Despite of suboptimal vaccine efficacy, manufacturers must produce and release vaccines within six months of WHO announcements. Given the need to design, pilot, manufacture, and test vac- cines against emerging strains, a proactive vaccine design framework is needed to enable scientists to initiate prepa- ration for manufacturing prior to WHO strain announce- ments. The ideal framework should also be broad enough to cover viruses associated with infectious diseases, such as Zika virus, Hepatitis B virus, and Human Immunodeficiency Virus (HIV). With the proven success of large language mod- els (LLMs), it is plausible that such a proactive framework could be effective. LLMs trained to predict amino acid sequences, known as pLMs, have had success with estimating the functional im- pact and fitness consequences of candidate mutations with- out requiring prior evolutionary or epidemiological informa- tion (Meier et al., 2021), demonstrating its great potential in enabling early-stage anticipation of antigenic changes and supporting proactive vaccine design. While current pLMs have largely been validated on non-viral sequences, with most of the foundational model training explicitly masking viral sequences from training, testing, and validation sets. Therefore, there remains a gap in our understanding of how different pLMs perform with viral genomic sequences. A clear set of benchmarks relevant to modelling of viral evo- lution is a key step towards applying pLM to vaccine and antiviral development. To address these limitations, we present ViroGym, a re- alistic large-scale benchmark designed to evaluate pLMs in zero-shot settings for global vaccine development. The 1 arXiv:2603.06740v1 [q-bio.QM] 6 Mar 2026 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins benchmark consists of three core tasks: •Mutational effect prediction, which evaluates model ability to capture complex, non-linear correlations within viral genomic sequences and to infer the func- tional consequences of individual mutations. •Antigenic diversity prediction, which assesses model capacity to understand immune escape and strain dif- ferentiation. •Pandemic prediction, which identifies models with strong zero-shot generalization suitable for modelling mutations observed in natural viral evolution. ViroGym includes over 552,937 mutated sequence readouts, 2,691 viral sequence–titer pairs, and 24,187 naturally occur- ring single-mutation frequency measurements. It spans 13 virus types and 7 phenotypic categories, providing broad coverage across viral families and functional properties (see Table 7 in Appendix A.1 for details). By providing clinical meaningful and rigorous benchmarks, ViroGym enables a more realistic assessment of model utility for vaccine and antiviral development. 2. Related Work ProteinGym. ProteinGym is a benchmark suite designed to evaluate pLMs on their ability to predict the functional effects of protein mutations. It aggregates large-scale DMS datasets across a wide range of proteins, mutation types, and functional assays and defines biologically grounded evaluation metrics (Notin et al., 2023). The majority of the prediction tasks involve non-viral proteins, with 24 out of 217 assays derived from viral sequences. EVEREST. EVEREST evaluates pLMs performance on vi- ral mutational fitness prediction using a curated benchmark of 45 viral DMS datasets and finds that current pLMs fail to reliably predict mutations for over half of these viruses (Gurev et al., 2025). Because its primary focus is on prior- ity viruses, many other available viral DMS assays are not included in the benchmark. DMS Correlation Studies. Livesey and Marsh (Livesey & Marsh, 2025) recently collected 13 new DMS datasets from ProteinGym and evaluated 97 variant effect predic- tors (VEPs) across 36 human proteins. They observed a strong correspondence between VEP performance on DMS benchmarks and their ability to classify clinical variants, particularly for predictors not trained on clinical data. These findings suggest that VEPs could complement, and in some cases partially substitute for, in vitro experiments in assess- ing variant effects. 3. ViroGym The benchmark comprises 79 DMS assays, 21 sequencing- based neutralisation assays for influenza A, and a real-world prediction task derived from the Global Initiative on Shar- ing All Influenza Data (GISAID), which provides genomic surveillance data for SARS-CoV-2. Figure 1 illustrates the overall framework. Pre-trained pLMs on large sequence databases such as UniProtKB and BFD are evaluated by computing the in silico score for each amino acid sequence using suitable scoring strategies. The evaluation is divided into two main components: the DMS and neutralization assays serve as in vitro experimental prediction tasks, while the GISAID dataset enables real- world pandemic prediction. We assess model performance by comparing in silico scores against both experimental measurements and naturally occurring mutations, providing a large-scale benchmark across controlled and real-world settings. 3.1. Dataset sources DMS. DMS is a high-throughput experimental technique that characterizes a protein’s functional landscape by sys- tematically evaluating millions of its single–amino-acid vari- ants and mapping each mutation (genotype) to a measured functional property (phenotype) (Fowler & Fields, 2014). The selection of DMS assays in ViroGym follows the guide- lines established by ProteinGym. As a result, ViroGym includes DMS assays covering SARS-CoV-2 (Starr et al., 2020; 2022b;a; Taylor & Starr, 2023; 2024; Dadonaite et al., 2024b; 2025a), Influenza A (Welsh et al., 2024; Dadonaite et al., 2024a; Yu et al., 2025), HIV (Haddox et al., 2016; 2018; Radford et al., 2023; Radford & Bloom, 2025) and 10 other viruses (Detailed reference can be found in Table 5 and 6 of Appendix A.1). Beyond the functional categories considered in ProteinGym, ViroGym introduces an addi- tional function type: immune escape, which represents a critical phenotype for viral proteins and is directly relevant to vaccine and therapeutic development. Neutralisation assay. In contrast to traditional serologi- cal assays, which assess antibody neutralisation against a single viral strain per serum sample, sequence-based high- throughput neutralisation assays quantify serum antibody using neutralisation titers across all relevant viral strains within a single experiment (Loes et al., 2024) (see Table 8 of Appendix A.1 for details). This dense, sequence-resolved measurement paradigm enables machine learning models jointly learning over viral sequence variation and antigenic response. As a result, such models can understand predictive mappings between viral evolution and antibody-mediated immunity, facilitating the identification of antigenicity novel epitopes. 2 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins Neutralisation Assay DMS In-Vitro Experiments ViroGym Real World Tasks Pandemic Forecasting Metrics •Absolute Spearman rank correlation •Top K% recall •Precision@K UniProtKB BFD OAS ... Pre-trained Protein Language Models Evaluate Train •masked marginals •wild-type marginals •mutation marginals •pseudo-likelihood •perplexity Fitness Scoring Strategies Figure 1. ViroGym benchmark framework. The benchmark consists of two major components: in vitro experimental evaluation and real-world prediction tasks. The in vitro evaluation leverages experimental measurements from DMS assays and neutralisation assays to evaluate model performance on protein functional effects. The real-world component evaluates models on SARS-CoV-2 pandemic forecasting using viral sequence data from GISAID database, capturing model generalisation from controlled wet lab settings to natural viral evolution. GISAID database. GISAID is a global surveillance plat- form that monitors priority pathogens and facilitates the sharing of their genetic sequences and associated metadata (Shu & McCauley, 2017). This resource enables researchers to track viral evolution and transmission dynamics during epidemics and pandemics. 3.2. Dataset process DMS. To faithfully reflect the underlying protein function experiments, we collected the corresponding target sequence for each DMS assay, following the guidelines established by ProteinGym. For certain SARS-CoV-2 functional assays that evaluate only the receptor-binding domain (RBD), we truncated the Spike protein sequence to the assayed region. Immune escape phenotypic assays typically do not disclose detailed information about the vaccine formulation or serum source. Consequently, for these assays, we aggregate mea- surements by averaging DMS scores across different sera evaluated against the same viral sequence. Neutralisation assay. We reviewed published information for patients participating in the neutralisation assay experi- ments to manually curate their vaccination histories. From this, we identified the specific vaccines each patient had received and obtained the corresponding HA1 sequences. This allowed us to accurately assess the antigenic coverage provided by these vaccines. GISAID database. We collected all circulating SARS-CoV- 2 sequences from the GISAID database spanning January 1, 2020, to May 31, 2025. From these sequences, we extracted all mutations in the Spike protein and recorded their ob- served occurrences. The resulting heat map, which depicts the actual prevalence of each mutation including deletions at each residue, is shown in Figure 2 . 3.3. Baselines Similar to how language models learn grammar and contex- tual meaning from text, pLMs can learn biological rules and functional properties from amino acid sequences. Leverag- ing the rapid growth of protein sequence data, researchers have trained pLMs using unsupervised learning to generate representations that capture information ranging from pro- tein structure to biochemical properties, providing features for a wide array of biomedical applications (Rives et al., 2021). 3 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins 1 1835526986 103120137154171188205222239256273290307324341358375392409426443460477494511528545562579596613630647664681698715732749766783800817834851868885902919936953970987 1004102110381055107210891106112311401157117411911208122512421259 A R N D C Q E G H I L K M F P S T W Y V - 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Figure 2. SARS-CoV-2 Spike Protein Mutation Heat Map. This heat map displays the frequency of 21 potential amino acid substitutions across 1273 residues of the SARS-CoV-2 Spike protein, with colour intensity indicating mutation frequency at each position. Data were collected from the GISAID database between January 2020 and May 2025. In this work, we focus on single-sequence pLMs representa- tive of current approaches – ESM-1 (Rives et al., 2019) as the first generation of pLMs; ESM-1v (Meier et al., 2021) enabling zero-shot variant fitness prediction; ESM-2 (Lin et al., 2023) is available in model sizes ranging from 8M to 15B parameters; ProtT5 (Elnaggar et al., 2022) is an encoder- decoder architecture designed to capture contextual meaning in amino acid sequences, whose embeddings support mod- els such as VESPA and VESPAI (Marquet et al., 2022); ProGen2 suite (Nijkamp et al., 2023) exploring dataset and scale effects, spanning antibody-specific models to the large BFD90-trained mode; ProtGPT2 (Ferruz et al., 2022) aimed at de novo protein generation; Tranception (Notin et al., 2022) achieving robust performance at modelling the fitness landscape of protein sequences. Our primary focus is on the ability of pLMs to predict vari- ant fitness accurately in a zero-shot setting, as determining precise protein function experimentally can take weeks or months. For example, during the COVID-19 pandemic, structural analysis revealing atomic-level conformations of the SARS-CoV-2 RBD was completed one month after the first full genome sequences were available (Wrapp et al., 2020). Therefore, given the time efficiency of early vaccine development and data leakage risks, we mainly focus on single sequence-based pLMs in ViroGym. 3.4. Evaluation metrics We adopt ranking-based metrics throughout ViroGym to identify high-impact mutations for practical use. Absolute Spearman rank correlation. Spearman’s rank correlation coefficient is used across all prediction tasks, as it is well suited for evaluating agreement between predicted and experimentally measured rankings (Notin et al., 2023). This metric is particularly relevant for applications such as vaccine strain selection, where correctly ranking mutational effects is more critical than predicting their absolute values. Top K% recall. Experimental measurements inevitably contain noise, which disproportionately affects the ranking of low-impact mutations. To mitigate this effect, we focus on the top K% of mutations ranked by experimental mea- surements and report recall within this subset. Following established convention from ProteinGym, we use the top 10% recall metric consistently across all prediction tasks in ViroGym. Precision@K. Precision@K is introduced in the pandemic forecasting task to provide a complementary view of model performance, measuring the accuracy of identifying high- risk variants among the top K predictions. 4. Results 4.1. Mutational effect prediction The concept of leveraging language models to predict pro- tein function in a zero-shot setting was first introduced by Meier et al. (Meier et al., 2021), who systematically com- pared four scoring methods for evaluating mutational ef- fects: masked marginals, wild-type marginals, mutation marginals, and pseudo-likelihood. Their analyses showed that the masked marginals approach outperformed the others and has subsequently been adopted in ESM-2 for predicting mutational effects on protein fitness. However, when we evaluated encoder-based models on the DMS experiments from ViroGym using these four strate- gies, we observed no significant performance differences. One alternative approach is to leverage the contextual em- 4 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins beddings generated by the language model to compute a similarity metric, analogous to sentence similarity in natural language processing. In this framework, we quantify how far a mutated sequence drifts from its reference sequence. We found that Euclidean distance (defined in Equation 1) works better in general. d(wildtype, variant) =∥ ̄ h (L) wildtype − ̄ h (L) variant ∥ 2 (1) where ̄ h (L) is the mean pool of the contextual embedding from the last hidden layerL. Euclidean distance provides more accurate predictions of mutational impact than cosine similarity, as is more sensitive to single-point mutations in long sequences. Other pLMs with decoder-only architectures employ differ- ent scoring strategies. For example, ProGen2, Tranception, and ProtGPT2 rely primarily on negative log-likelihood or perplexity score, while DeepSequence (Riesselman et al., 2018) and MULAN (Frolova et al., 2025) use a likelihood ratio-based approach to mitigate biases from local sequence context. Beyond these strategies, some researchers draw an analogy to natural language, considering the probability of observing a mutant at a specific position as a measure of evolutionary grammaticality (Hie et al., 2021; Allman et al., 2025), reflecting how plausible a mutation is in the protein sequence context. The question of which in silico scoring method most effec- tively represents protein function remains open, particularly as pLMs grow in sophistication and application. Under- standing this question will be pivotal for translating large language models from predictive tools into mechanistic frameworks capable of guiding experimental protein design. Based on our experimental results in Table 1, we con- clude that when the wild-type amino acid sequence is avail- able, the most effective strategy is to compare the semantic changes between the mutated and reference sequences. This approach aligns well with both machine learning principles and biological interpretation, providing a robust method for identifying high-impact mutations. The overall results for mutational effect prediction task are presented in Table 2, with detailed performance metrics are reported in Appendix A.2 (Table 9, Figure 6 and 7). ProGen2 achieves the strongest performance, as illustrated in Figure 3, which shows per-task results using ESM2 15B as an example. However, the remaining models show no statistically significant performance differences. 4.2. Antigenic diversity prediction The current influenza vaccine strains are selected based on the degree to which circulating viruses have drifted from previously dominant strains, with this distance assessed by integrating both genetic and antigenic evolution (Smith Table 1. Performance of ESM2-650M under different scoring strategies. Results are reported as the average top 10% recall and absolute Spearman’s rank correlation between model predictions and experimental measurements. STRATEGYRECALLSTD.SPEARMANSTD. MASKED0.11440.0460.10910.1209 WILDTYPE0.11250.04430.10650.1159 MUTATION0.11470.04340.10870.1183 GRAMMAR0.12500.04580.11510.1145 SEMANTIC 0.13750.06120.16930.1034 RATIO0.08130.04440.10920.1127 LOSS0.10440.05640.12050.1261 Table 2. Zero-shot performance on the DMS benchmark. Results are reported as the average top 10% recall and absolute Spear- man’s rank correlation between model scores and experimental measurements across all baselines. MODELRECALLSTD.SPEARMANSTD. VESPAL0.16350.07260.27150.1402 VESPA0.17020.07960.27970.1506 TRANCEPT.0.15720.06810.22710.1300 PROTGPT20.11050.03700.10210.0732 PROGEN20.19800.09100.29300.1583 ESM1V0.14510.06440.18770.1026 ESM10.14660.05860.19970.1021 ESM20.14190.06390.17410.1122 et al., 2004; Fouchier & Smith, 2010). While modelling ge- netic evolution from historical data is routine - typically via phylogenetic trees constructed using maximum likelihood estimation (MLE) (Felsenstein, 1973), capturing antigenic evolution still requires wet-lab experimental inputs. If a pLM can predict whether an emerging strain is likely to be covered by a given vaccine strain, it could significantly accelerate the vaccine development cycle. To evaluate this capability, we established 21 influenza neu- tralisation assays to measure the ability of pLMs to detect antigenic differences among viral strains. These assays gen- erally use haemagglutination inhibition techniques, in which antibody titers serve as a proxy for antigenic similarity - for example, a titer of 1:40 is often considered indicative of adequate immune coverage (Hannoun et al., 2004). Within this framework, we query pLMs to estimate the antigenic similarity between vaccine strains and newly isolated viral strains. Conceptually, higher similarity scores should corre- spond to stronger expected vaccine-mediated protection. To quantitatively assess model performance, we calculate the contextual embedding distance between a circulating strain and the vaccine strain as the predicted antigenic dis- tance and evaluate its correlation with experimental titers. For decoder-only models, we use perplexity as the predicted antigenic distance. This evaluation enables us to determine whether a pLM can approximate fine-grained antigenic re- 5 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins HIV_sera_escape_B520 FLU_sera_escape_H3N2_MC22 FLU_sera_escape_H3N2_HK19 A0A2Z5U3Z0_9INFA_Doud_2016 FLU_functional_effect_H3N2_HK19 NIPAH_antibody_escape FLU_cell_entry_H5N1 LASV_antibody_escape PA_I34A1_Wu_2015 FLU_sera_escape_H5N1 C6KNH7_9INFA_Lee_2018 RDRP_I33A0_Li_2023 SARS_sera_escape_Omicron_XBB15 SARS_antibody_escape_KP311 HIV_antibody_escape_B520 SARS_sera_escape_Omicron_XBB15_RBD A0A2Z5U3Z0_9INFA_Wu_2014 RABV_antibody_escape FLU_stability_H5N1 SARS_binding_Omicron_XBB15 NIPAH_binding FLU_cell_entry_H3N2_MC22 FLU_stability_H3N2_MC22 CAPSD_AAV2S_Sinai_2021 NIPAH_cell_entry SARS_binding_KP311 SARS_binding_Omicron_BA2 SARS_cell_entry_Omicron_BA2 SARS_binding_Omicron_XBB15_RBD SARS_cell_entry_KP311 SARS_antibody_escape_Wuhan_Hu_1 ZIKV_antibody_escape POLG_PESV_Tsuboyama_2023_2MXD SARS_binding_Omicron_FLip_RBD SARS_binding_Omicron_EG5_RBD A4D664_9INFA_Soh_2019 HIV_antibody_escape_TRO11 NRAM_I33A0_Jiang_2016 SARS_binding_Omicron_BA1_RBD ENV_HV1B9_DuenasDecamp_2016 SARS_binding_Omicron_BA286_RBD HIV_cell_entry_B520 ZIKV_viral_growth I6TAH8_I68A0_Doud_2015 SARS_binding_Omicron_BA2_RBD SARS_cell_entry_Omicron_XBB15 POLG_CXB3N_Mattenberger_2021 SARS_cell_entry_Omicron_XBB15_RBD REV_HV1H2_Fernandes_2016 SARS_expression_Omicron_FLip_RBD TAT_HV1BR_Fernandes_2016 SARS_expression_Delta_RBD SARS_expression_Omicron_BQ11_RBD NCAP_I34A1_Doud_2015 SARS_expression_Omicron_EG5_RBD SARS_expression_Omicron_BA2_RBD SARS_binding_Beta_RBD SARS_expression_Omicron_BA286_RBD SARS_expression_Omicron_BA1_RBD SARS_expression_Omicron_XBB15_RBD SARS_expression_Wuhan_Hu_1_RBD R1AB_SARS2_Flynn_2022 SARS_expression_Eta_RBD ENV_HV1BR_Haddox_2016 SARS_expression_Beta_RBD SARS_binding_Omicron_BQ11_RBD RABV_cell_entry SARS_binding_Delta_RBD SARS_binding_Eta_RBD Q2N0S5_9HIV1_Haddox_2018 SARS_expression_Alpha_RBD SARS_binding_Wuhan_Hu_1_RBD SARS_binding_Alpha_RBD A0A192B1T2_9HIV1_Haddox_2018 HIV_cell_entry_TRO11 HBV_fitness LASV_cell_entry POLG_HCVJF_Qi_2014 POLG_DEN26_Suphatrakul_2023 Task 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Score Model Name ProGen2 XL ESM2 15B Figure 3. Task-wise comparison of ESM2 15B and ProGen2-XL on the DMS benchmark. ESM2 15B scores are computed using the semantic scoring strategy, while ProGen2-XL scores use the negative log-likelihood strategy. Reported values represent the absolute Spearman’s rank correlation between model fitness scores and experimental measurements. lationships and provide actionable immunological insights. Thus, by accurately ranking strains in terms of antigenic similarity, pLMs could guide vaccine strain selection to maximize coverage against circulating viruses and optimize antibody-mediated protection. However, the performance differences among the models are marginal, with Tranception M slightly outperforming the others in Table 3 (see task-wise performance in Appendix A.2 Figure 8). Detailed performance for Tranception M on each task can be found in Figure 4. These results sug- gesting that current pLMs exhibit similar capabilities on neutralisation prediction tasks and that significant room for improvement remains. 4.3. Pandemic prediction pLMs are increasingly viewed as a universal key for pro- tein prediction, potentially replacing traditional multiple sequence alignment (MSA) methods (Weissenow & Rost, 2025). Their capabilities include generating representations for secondary and tertiary structure prediction and inferring biochemical properties without labelled data (Rives et al., Table 3. Zero-shot neutralisation prediction results. Reported values are the average absolute Spearman’s rank correlation be- tween model predictions and experimental measurements across all baseline methods. MODEL NAMESPEARMANSTD. PROTT50.19610.206 TRANCEPTION0.23160.1696 PROTGPT20.20180.1845 PROGEN20.22500.1852 ESM1V0.22820.2043 ESM10.22220.2098 ESM20.22670.1840 2021), identifying conserved residues without MSAs (Mar- quet et al., 2022), and predicting the effects of missense mutations (Meier et al., 2021). Many pLMs have demon- strated outstanding performance on in vitro benchmarks, but a critical question remains: can these models generalize effectively to in vivo or real-world environments? To address this, we designed an evaluation task to project in vitro results onto real-world scenarios. Specifically, we test 6 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins 0.00.10.20.30.40.50.6 |Spearman r| A/DistrictOfColumbia/27/2023_H3N2_H3N2 A/Wisconsin/588/2019_H1N1_H1N1 A/Lisboa/188/2023_H1N1_H1N1 A/Victoria/4897/2022_IVR-238_H1N1_H1N1 A/Tajikistan/02-1057/2024_H1N1_H1N1 A/Wisconsin/67/2022_H1N1_H1N1 A/France/IDF-IPP29542/2023-egg_H3N2_H3N2 A/Croatia/10136RV/2023-egg_H3N2_H3N2 A/Darwin/9/2021_H3N2_H3N2 A/Massachusetts/18/2022_H3N2_H3N2 A/Lisboa/216/2023_H3N2_H3N2 A/BurkinaFaso/3131/2023_H3N2_H3N2 A/Netherlands/10563/2023_H3N2_H3N2 A/Norway/07606/2024_H1N1_H1N1 A/Thailand/8/2022_H3N2_H3N2 A/Norway/12374/2023_H3N2_H3N2 A/Switzerland/47775/2024_H3N2_H3N2 A/Slovenia/49/2024_H3N2_H3N2 A/Victoria/4897/2022_H1N1_H1N1 A/Netherlands/10685/2024_H3N2_H3N2 vaccine subtype H3N2 H1N1 Figure 4. Task-wise performance of Tranception M on the neutralisation benchmark. Antigenicity scores are computed using the negative log-likelihood strategy. Each task corresponds to a vaccine strain representing post-vaccination serum, with colors indicating influenza A subtypes. Performance is measured as the absolute Spearman’s rank correlation between predicted antigenicity scores and experimental measurements, averaged across sera from different animal sources. whether pLMs can identify dominant circulating mutations using only the target SARS-CoV-2 Spike protein sequence. Each model calculates the fitness score of every single mu- tation using either semantic scoring strategies (e.g., ESM-1 family, ESM-1v, ESM-2 family) or perplexity-based scor- ing (e.g., VESPA, VESPAI, Tranception family, ProtGPT2, ProGen2 suite). Heat maps for all baselines of predicting the in silico fitness score for each amino acid per residue can be found in Appendix A.2 Figures 9-16. To quantify performance, we introduce a precision@K met- ric, which measures model ability to correctly identify the top mutations. Across three evaluation metrics, ProGen2- XL shows the strongest achievement in this task showing in Table 4. Notably, it is also dominating mutation-level predic- tion tasks (Table 2) and achieving reasonable performance on the neutralisation task (Table 3). Next, we investigate the relationships among computational predictions, in vitro experiments, and real-world viral evo- lution, with the aim of assessing how effectively pLMs can bridge laboratory assays and naturally circulating viral strains. To this end, we focus on SARS-CoV-2 and anal- yse the overlap of single-point mutations identified under comparable experimental conditions. We observe that ProGen2-XL, which achieves the best over- all performance across our benchmarks, shares nearly 50% of the top-ranked mutations with those most prevalent in real-world viral circulation in Figure 5. In contrast, DMS assays identify only 10% of these dominant circulating mu- tations. It is worth noting that ProGen2-XL exhibits ap- Table 4. Zero-shot pandemic prediction results. Metrics reported for all baselines include Top 10% Recall, absolute Spearman’s rank correlation, and Precision@3 between model scores and mutation frequencies from GISAID. MODEL NAMERECALLSPEARMANPRECISION VESPAL0.32260.36020 VESPA0.29360.31520 TRANCEPTION0.20660.21960 PROTGPT20.12210.03720 PROGEN20.40810.31530.33 ESM1V0.12060.05430 ESM10.24710.18700 ESM20.26080.30260 proximately 20% overlap with the top mutations identified by DMS and among these shared mutations is the N501Y substitution, which has been shown to be a major determi- nant of the increased transmissibility of the SARS-CoV-2 Alpha variant by enhancing the binding affinity of the Spike protein to host cell receptors (Liu et al., 2022). These findings suggest that, although DMS assays character- ize protein fitness under controlled conditions, appropriately selected pLMs can more effectively capture evolutionary constraints that govern viral spread in real-world settings. 5. Discussion ViroGym introduces a novel evaluation framework for viral proteins, encompassing mutational effect prediction, anti- genicity diversity prediction, and pandemic prediction, with 7 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins DMS 99 1 10 GISAIDESM2 a. 0 8 1 6 5 3 1 DMSGISAID ProGen2 b. Figure 5. Overlap among top 10 mutations from computational pre- dictions, in vitro DMS assays of the SARS-CoV-2 S protein RBD, and naturally occurring mutations from GISAID. a. ESM2-650M predictions show no overlap with DMS or GISAID mutations. b. Predicted mutations from ProGen2-XL overlap 50% with GISAID and 20% with DMS. the goal of linking in vitro experiments to real-world out- comes. While DMS datasets provide a detailed view of the protein fitness landscape by measuring protein properties under controlled conditions, protein evolution in real-world is shaped by additional constraints, particularly for viral pro- teins. For instance, immune imprinting from early antigen exposure can bias antibody responses toward conserved epi- topes, influencing vaccine strain selection and subsequently shaping viral evolutionary trajectories. Our analysis highlights three key considerations for improv- ing pLMs on viral proteins. A key challenge for pLMs is to handle insertions and deletions (indels), which often dis- rupt protein function. Currently, only ESM models, to our knowledge, explicitly encode deletions as tokens in their vocabulary, and filtering out sequences with deletions yields modest performance gains across models (see in Appendix A.2 Table 10). Secondly, unlike other sequence-based pLMs trained solely on UniProtKB, the best performing model in ViroGym - ProGen2 might benefit from joint pretrain- ing on UniProtKB and BFD datasets. Lastly, viral proteins frequently exceed the typical length of proteins, whereas most pLMs are limited to context windows of fewer than 1024 residues. These observations suggest that pLMs could achieve improved performance by expanding their token representations, incorporating larger and more diverse train- ing datasets, and increasing context length to better capture long-range dependencies in viral proteins. ViroGym is the first benchmark to systematically integrate computational models, experimental assays, and real-world viral evolution data, providing a unified platform for eval- uating model predictive performance and guiding vaccine development. Our results demonstrate that, although DMS assays identify the top mutations based on experimental fitness, these mutations do not substantially overlap with real-world circulating variants. Interestingly, models se- lected using DMS-based evaluation successfully predict the dominant mutations observed in natural viral circulation. This indirect validation indicates that the mutation effect prediction task in ViroGym serves as a useful proxy for identifying models that capture biological constraints gen- eralizable to real-world viral evolution. At the same time, it suggests that DMS assays may have limited utility for fine-tuning pLMs, given their low overlap with circulating variants. Importantly, our results also indicate that DMS and pLM- based predictions provide complementary signals. While DMS assays offer high resolution measurements of func- tional effects under well-defined experimental conditions, pLMs capture broader sequence-level constraints learned from large-scale evolutionary data.Combining DMS- derived fitness information with pLM predictions may there- fore enable more accurate and robust forecasting of real- world viral evolution, offering a promising direction for improving mutation prioritization and vaccine strain selec- tion. 6. Limitations and Future Work MSA-based models like EVE (Frazer et al., 2021) and EVEscape (Thadani et al., 2023) are excluded because high- quality multiple sequence alignments are often difficult to obtain for viral proteins, particularly for novel viruses. We also do not consider hybrid models and structural-based models such as MULAN because they rely on experimen- tally validated protein structures, which are time-consuming to obtain. While tools such as AlphaFold (Jumper et al., 2021) can predict complex protein structures rapidly, a sys- tematic comparison of predictions based on AlphaFold ver- sus experimentally solved structures remains an important direction for future work in vaccine design. A limitation of our work is that pandemic prediction task focuses primarily on top mutations from SARS-CoV-2, as reliable mutation frequency data for Influenza A and other viruses are more difficult to obtain from GISAID. Extending this task to additional viral species would enable a more thorough evaluation of model generalization. 8 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins Impact Statement This work highlights an important shift in how pLMs can be evaluated and applied: rather than merely reproducing outcomes from DMS experiments, pLMs may be better suited to capture real-world mutagenic patterns observed during natural viral evolution. By benchmarking models against experimentally grounded and naturally occurring mutations, our framework suggests that pLMs can provide more relevant and actionable insights for real-world applica- tions such as vaccine design, surveillance, and therapeutic development. This perspective supports the use of pLMs as complementary tools to experimental assays, with the potential to guide and prioritize future experimental efforts. References Aditham, A. K., Radford, C. E., Carr, C. R., Jasti, N., King, N. P., and Bloom, J. D. Deep mutational scan- ning of rabies glycoprotein defines mutational constraint and antibody-escape mutations. Cell Host & Microbe, 33 (6):988–1003, 2025. doi: 10.1016/j.chom.2025.04.018. Allman, B. E., Vieira, L., Diaz, D. J., and Wilke, C. O. A sys- tematic evaluation of the language-of-viral-escape model using multiple machine learning frameworks. Journal of the Royal Society Interface, 22(225):20240598, 2025. doi: 10.1098/rsif.2024.0598. Cao, Y., Wang, J., Jian, F., Xiao, T., Song, W., Yisimayi, A., Huang, W., Li, Q., Wang, P., An, R., et al. Omicron escapes the majority of existing sars-cov-2 neutralizing antibodies. Nature, 602(7898):657–663, 2022. doi: 10. 1038/s41586-021-04385-3. Carr, C. R., Crawford, K. H., Murphy, M., Galloway, J. G., Haddox, H. K., Matsen, F. A., Andersen, K. G., King, N. P., and Bloom, J. D. Deep mutational scanning reveals functional constraints and antibody-escape potential of lassa virus glycoprotein complex. Immunity, 57(9):2061– 2076, 2024. doi: 10.1016/j.immuni.2024.06.013. CentersforDiseaseControlandPrevention and others.CDC seasonal flu vaccine ef- fectivenessstudies,2025.URLhttps: //w.cdc.gov/flu-vaccines-work/php/ effectiveness-studies/index.html. Dadonaite, B., Ahn, J. J., Ort, J. T., Yu, J., Furey, C., Dosey, A., Hannon, W. W., Vincent Baker, A. L., Webby, R. J., King, N. P., et al. Deep mutational scanning of h5 hemagglutinin to inform influenza virus surveil- lance. PLoS biology, 22(11):e3002916, 2024a. doi: 10.1371/journal.pbio.3002916. Dadonaite, B., Brown, J., McMahon, T. E., Farrell, A. G., Figgins, M. D., Asarnow, D., Stewart, C., Lee, J., Logue, J., Bedford, T., et al. Spike deep mutational scanning helps predict success of sars-cov-2 clades. Nature, 631(8021):617–626, 2024b.doi: 10.1038/ s41586-024-07636-1. Dadonaite, B., Burrell, A. R., Logue, J., Chu, H. Y., Payne, D. C., Haslam, D. B., Staat, M. A., and Bloom, J. D. Sars-cov-2 neutralizing antibody specificities differ dra- matically between recently infected infants and immune- imprinted individuals. Journal of Virology, 99(4):e00109– 25, 2025a. doi: 10.1128/jvi.00109-25. Dadonaite, B., Harari, S., Larsen, B. B., Kampman, L., Harteloo, A., Elias-Warren, A., Chu, H. Y., and Bloom, J. D. Spike mutations that affect the function and anti- genicity of recent kp. 3.1. 1-like sars-cov-2 variants. Journal of virology, 99(11):e01423–25, 2025b. doi: 10.1128/jvi.01423-25. Doud, M. B. and Bloom, J. D. Accurate measurement of the effects of all amino-acid mutations on influenza hemagglutinin. Viruses, 8(6):155, 2016. doi: 10.3390/ v8060155. Doud, M. B., Ashenberg, O., and Bloom, J. D. Site- specific amino acid preferences are mostly conserved in two closely related protein homologs. Molecular biology and evolution, 32(11):2944–2960, 2015. doi: 10.1093/molbev/msv167. Duenas-Decamp, M., Jiang, L., Bolon, D., and Clapham, P. R. Saturation mutagenesis of the hiv-1 envelope cd4 binding loop reveals residues controlling distinct trimer conformations. PLoS pathogens, 12(11):e1005988, 2016. doi: 10.1371/journal.ppat.1005988. Elnaggar, A., Heinzinger, M., Dallago, C., Rehawi, G., Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C., Steinegger, M., Bhowmik, D., and Rost, B. Prottrans: Toward understanding the language of life through self- supervised learning. IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 44(10):7112–7127, 2022. doi: 10.1109/TPAMI.2021.3095381. Felsenstein, J. Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Systematic Biology, 22(3):240–249, 1973. doi: 10.1093/sysbio/22.3.240. Fernandes, J. D., Faust, T. B., Strauli, N. B., Smith, C., Crosby, D. C., Nakamura, R. L., Hernandez, R. D., and Frankel, A. D. Functional segregation of overlapping genes in hiv. Cell, 167(7):1762–1773, 2016. doi: 10. 1016/j.cell.2016.11.031. 9 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins Ferruz, N., Schmidt, S., and H ̈ ocker, B.Protgpt2 is a deep unsupervised language model for protein de- sign. Nature Communications, 13(1):4348, 2022. doi: 10.1038/s41467-022-32007-7. Fouchier, R. A. and Smith, D. J. Use of antigenic cartogra- phy in vaccine seed strain selection. Avian diseases, 54 (s1):220–223, 2010. doi: 10.1637/8740-032509-resnote. 1. Fowler, D. M. and Fields, S. Deep mutational scanning: a new style of protein science. Nature Methods, 11(8): 801–807, 2014. doi: 10.1038/nmeth.3027. Frazer, J., Notin, P., Dias, M., Gomez, A., Min, J. K., Brock, K., Gal, Y., and Marks, D. S. Disease vari- ant prediction with deep generative models of evolu- tionary data. Nature, 599(7883):91–95, 2021. doi: 10.1038/s41586-021-04043-8. Frolova, D., Pak, M., Litvin, A., Sharov, I., Ivankov, D., and Oseledets, I. Mulan: Multimodal protein language model for sequence and structure encoding. Bioinformatics Ad- vances, p. vbaf117, 2025. doi: 10.1093/bioadv/vbaf117. Gurev, S., Youssef, N., Jain, N., and Marks, D. S. Variant effect prediction with reliability estimation across priority viruses. bioRxiv, p. 2025–08, 2025. doi: 10.1101/2025. 08.04.668549. Haddox, H. K., Dingens, A. S., and Bloom, J. D. Experi- mental estimation of the effects of all amino-acid muta- tions to hiv’s envelope protein on viral replication in cell culture. PLoS pathogens, 12(12):e1006114, 2016. doi: 10.1371/journal.ppat.1006114. Haddox, H. K., Dingens, A. S., Hilton, S. K., Overbaugh, J., and Bloom, J. D. Mapping mutational effects along the evolutionary landscape of hiv envelope. Elife, 7:e34420, 2018. doi: 10.7554/eLife.34420. Hadfield, J., Megill, C., Bell, S. M., Huddleston, J., Potter, B., Callender, C., Sagulenko, P., Bedford, T., and Neher, R. A. Nextstrain: real-time tracking of pathogen evo- lution. Bioinformatics, 34(23):4121–4123, 2018. doi: 10.1093/bioinformatics/bty407. Hannoun, C., Megas, F., and Piercy, J. Immunogenicity and protective efficacy of influenza vaccination. Virus re- search, 103(1-2):133–138, 2004. doi: 10.1016/j.virusres. 2004.02.025. Hie, B., Zhong, E. D., Berger, B., and Bryson, B. Learning the language of viral evolution and escape. Science, 371 (6526):284–288, 2021. doi: 10.1126/science.abd7331. Jiang, L., Liu, P., Bank, C., Renzette, N., Prachanronarong, K., Yilmaz, L. S., Caffrey, D. R., Zeldovich, K. B., Schif- fer, C. A., Kowalik, T. F., et al. A balance between in- hibitor binding and substrate processing confers influenza drug resistance. Journal of molecular biology, 428(3): 538–553, 2016. doi: 10.1016/j.jmb.2015.11.027. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., ˇ Z ́ ıdek, A., Potapenko, A., et al. Highly accurate protein structure prediction with alphafold. nature, 596(7873):583–589, 2021. doi: 10.1038/s41586-021-03819-2. Kikawa, C., Cartwright-Acar, C. H., Stuart, J. B., Contreras, M., Levoir, L. M., Evans, M. J., Bloom, J. D., and Goo, L. The effect of single mutations in zika virus envelope on escape from broadly neutralizing antibodies. Journal of Virology, 97(11):e01414–23, 2023. doi: 10.1128/jvi. 01414-23. Kikawa, C., Huddleston, J., Loes, A. N., Turner, S. A., Lee, J., Barr, I. G., Cowling, B. J., Englund, J. A., Greninger, A. L., Harvey, R., et al. Near real-time data on the human neutralizing antibody landscape to in- fluenza virus to inform vaccine-strain selection in septem- ber 2025. Virus Evolution, 11(1):veaf086, 2025a. doi: 10.1093/ve/veaf086. Kikawa, C., Loes, A. N., Huddleston, J., Figgins, M. D., Steinberg, P., Griffiths, T., Drapeau, E. M., Peck, H., Barr, I. G., Englund, J. A., Hensley, S. E., Bedford, T., and Bloom, J. D. High-throughput neutralization mea- surements correlate strongly with evolutionary success of human influenza strains. bioRxiv, November 2025b. doi: 10.7554/elife.106811.2. Kirsebom, F. C., Stowe, J., Bernal, J. L., Allen, A., and Andrews, N. Effectiveness of autumn 2023 covid-19 vaccination and residual protection of prior doses against hospitalisation in england, estimated using a test-negative case-control study. Journal of Infection, 89(1):106177, 2024. doi: 10.1016/j.jinf.2024.106177. Larsen, B. B., McMahon, T., Brown, J. T., Wang, Z., Rad- ford, C. E., Crowe, J. E., Veesler, D., and Bloom, J. D. Functional and antigenic landscape of the nipah virus receptor-binding protein. Cell, 188(9):2480–2494, 2025. doi: 10.1016/j.cell.2025.02.030. Lee, J. M., Huddleston, J., Doud, M. B., Hooper, K. A., Wu, N. C., Bedford, T., and Bloom, J. D. Deep muta- tional scanning of hemagglutinin helps predict evolution- ary fates of human h3n2 influenza variants. Proceedings of the National Academy of Sciences, 115(35):E8276– E8285, 2018. doi: 10.1073/pnas.1806133115. 10 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., and Rives, A. Evolutionary-scale predic- tion of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023.doi: 10.1126/science.ade2574. Liu, Y., Liu, J., Plante, K. S., Plante, J. A., Xie, X., Zhang, X., Ku, Z., An, Z., Scharton, D., Schindewolf, C., et al. The n501y spike substitution enhances sars-cov-2 infec- tion and transmission. Nature, 602(7896):294–299, 2022. doi: 10.1038/s41586-021-04245-0. Livesey, B. J. and Marsh, J. A. Variant effect predictor correlation with functional assays is reflective of clinical classification performance. Genome Biology, 26(1):104, 2025. doi: 10.1186/s13059-025-03575-w. Loes, A. N., Tarabi, R. A. L., Huddleston, J., Touyon, L., Wong, S. S., Cheng, S. M., Leung, N. H., Hannon, W. W., Bedford, T., Cobey, S., et al. High-throughput sequencing-based neutralization assay reveals how re- peated vaccinations impact titers to recent human h1n1 influenza strains. Journal of Virology, 98(10):e00689–24, 2024. doi: 10.1128/jvi.00689-24. Marquet, C., Heinzinger, M., Olenyi, T., Dallago, C., Er- ckert, K., Bernhofer, M., Nechaev, D., and Rost, B. Embeddings from protein language models predict con- servation and variant effects. Human Genetics, 141 (10):1629–1647, October 2022. ISSN 1432-1203. doi: 10.1007/s00439-021-02411-y. Mattenberger, F., Latorre, V., Tirosh, O., Stern, A., and Geller, R. Globally defining the effects of mutations in a picornavirus capsid. Elife, 10:e64256, 2021. doi: 10.7554/eLife.64256. Meier, J., Rao, R., Verkuil, R., Liu, J., Sercu, T., and Rives, A. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in neural information processing systems, 34:29287–29303, 2021. doi: 10.1101/2021.07.09.450648. Nijkamp, E., Ruffolo, J. A., Weinstein, E. N., Naik, N., and Madani, A. ProGen2: Exploring the boundaries of protein language models. Cell Systems, 14(11):968–978.e3, 2023. ISSN 2405-4712. doi: 10.1016/j.cels.2023.10.002. Notin, P., Dias, M., Frazer, J., Marchena-Hurtado, J., Gomez, A. N., Marks, D., and Gal, Y. Tranception: pro- tein fitness prediction with autoregressive transformers and inference-time retrieval. In International Conference on Machine Learning, p. 16990–17017. PMLR, 2022. doi: 10.48550/arXiv.2205.13760. Notin, P., Kollasch, A., Ritter, D., Van Niekerk, L., Paul, S., Spinner, H., Rollins, N., Shaw, A., Orenbuch, R., Weitz- man, R., et al. Proteingym: Large-scale benchmarks for protein fitness prediction and design. Advances in Neural Information Processing Systems, 36:64331–64379, 2023. doi: 10.1101/2023.12.07.570727. Qi, H., Olson, C. A., Wu, N. C., Ke, R., Loverdo, C., Chu, V., Truong, S., Remenyi, R., Chen, Z., Du, Y., et al. A quantitative high-resolution genetic profile rapidly identi- fies sequence determinants of hepatitis c viral fitness and drug sensitivity. PLoS pathogens, 10(4):e1004064, 2014. doi: 10.1371/journal.ppat.1004064. Radford, C. E. and Bloom, J. D. Comprehensive maps of es- cape mutations from antibodies 10-1074 and 3bnc117 for envs from two divergent hiv strains. Journal of Virology, 99(5):e00195–25, 2025. doi: 10.1128/jvi.00195-25. Radford, C. E., Schommers, P., Gieselmann, L., Crawford, K. H., Dadonaite, B., Yu, T. C., Dingens, A. S., Over- baugh, J., Klein, F., and Bloom, J. D. Mapping the neu- tralizing specificity of human anti-hiv serum by deep mutational scanning. Cell Host & Microbe, 31(7):1200– 1215, 2023. doi: 10.1016/j.chom.2023.05.025. Riesselman, A. J., Ingraham, J. B., and Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nature methods, 15(10):816–822, 2018. doi: 10.1038/s41592-018-0138-4. Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M., Zitnick, C. L., Ma, J., and Fergus, R. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. bioRxiv, 2019. doi: 10.1101/622803. Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M., Zitnick, C. L., Ma, J., et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021. doi: 10.1073/pnas.201623911. Shu, Y. and McCauley, J. Gisaid: Global initiative on shar- ing all influenza data–from vision to reality. Eurosurveil- lance, 22(13):30494, 2017. doi: 10.2807/1560-7917.es. 2017.22.13.30494. Sinai, S., Jain, N., Church, G. M., and Kelsic, E. D. Gen- erative aav capsid diversification by latent interpolation. bioRxiv, p. 2021–04, 2021. doi: 10.1101/2021.04.16. 440236. Smith, D. J., Lapedes, A. S., De Jong, J. C., Bestebroer, T. M., Rimmelzwaan, G. F., Osterhaus, A. D., and Fouch- ier, R. A. Mapping the antigenic and genetic evolution of 11 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins influenza virus. science, 305(5682):371–376, 2004. doi: 10.1126/science.1097211. Soh, Y. S., Moncla, L. H., Eguia, R., Bedford, T., and Bloom, J. D. Comprehensive mapping of adaptation of the avian influenza polymerase protein pb2 to humans. Elife, 8: e45079, 2019. doi: 10.7554/elife.45079. Sourisseau, M., Lawrence, D. J., Schwarz, M. C., Storrs, C. H., Veit, E. C., Bloom, J. D., and Evans, M. J. Deep mutational scanning comprehensively maps how zika en- velope protein mutations affect viral growth and antibody escape. Journal of virology, 93(23):10–1128, 2019. doi: 10.1128/jvi.01291-19. Starr, T. N., Greaney, A. J., Hilton, S. K., Ellis, D., Crawford, K. H., Dingens, A. S., Navarro, M. J., Bowen, J. E., Tortorici, M. A., Walls, A. C., et al. Deep mutational scanning of sars-cov-2 receptor binding domain reveals constraints on folding and ace2 binding. cell, 182(5): 1295–1310, 2020. doi: 10.1016/j.cell.2020.08.012. Starr, T. N., Greaney, A. J., Hannon, W. W., Loes, A. N., Hauser, K., Dillen, J. R., Ferri, E., Farrell, A. G., Dadon- aite, B., McCallum, M., et al. Shifting mutational con- straints in the sars-cov-2 receptor-binding domain during viral evolution. Science, 377(6604):420–424, 2022a. doi: 10.1126/science.abo7896. Starr, T. N., Greaney, A. J., Stewart, C. M., Walls, A. C., Hannon, W. W., Veesler, D., and Bloom, J. D. Deep mutational scans for ace2 binding, rbd expression, and antibody escape in the sars-cov-2 omicron ba. 1 and ba. 2 receptor-binding domains. PLoS pathogens, 18(11): e1010951, 2022b. doi: 10.1371/journal.ppat.1010951. Suphatrakul, A., Posiri, P., Srisuk, N., Nantachokchawapan, R., Onnome, S., Mongkolsapaya, J., and Siridechadilok, B. Functional analysis of flavivirus replicase by deep mutational scanning of dengue ns5. bioRxiv, p. 2023– 03, 2023. doi: 10.1101/2023.03.07.531617. Taylor, A. L. and Starr, T. N. Deep mutational scans of xbb. 1.5 and bq. 1.1 reveal ongoing epistatic drift during sars-cov-2 evolution. PLoS pathogens, 19(12):e1011901, 2023. doi: 10.1371/journal.ppat.1011901. Taylor, A. L. and Starr, T. N. Deep mutational scanning of sars-cov-2 omicron ba. 2.86 and epistatic emergence of the kp. 3 variant. Virus evolution, 10(1):veae067, 2024. doi: 10.1093/ve/veae067. Thadani, N. N., Gurev, S., Notin, P., Youssef, N., Rollins, N. J., Ritter, D., Sander, C., Gal, Y., and Marks, D. S. Learning from prepandemic data to forecast viral es- cape. Nature, 622(7984):818–825, 2023. doi: 10.1038/ s41586-023-06617-0. Tsuboyama, K., Dauparas, J., Chen, J., Laine, E., Mohseni Behbahani, Y., Weinstein, J. J., Mangan, N. M., Ovchinnikov, S., and Rocklin, G. J. Mega-scale exper- imental analysis of protein folding stability in biology and design. Nature, 620(7973):434–444, 2023. doi: 10.1038/s41586-023-06328-6. Wei, C.-J., Crank, M. C., Shiver, J., Graham, B. S., Mascola, J. R., and Nabel, G. J. Next-generation in- fluenza vaccines: opportunities and challenges. Na- ture reviews Drug discovery, 19(4):239–252, 2020. doi: 10.1038/s41573-019-0056-x. Weissenow, K. and Rost, B. Are protein language models the new universal key? Current Opinion in Structural Bi- ology, 91:102997, 2025. doi: 10.1016/j.sbi.2025.102997. Welsh, F. C., Eguia, R. T., Lee, J. M., Haddox, H. K., Gal- loway, J., Chau, N. V. V., Loes, A. N., Huddleston, J., Yu, T. C., Le, M. Q., et al. Age-dependent heterogeneity in the antigenic effects of mutations to influenza hemagglu- tinin. Cell Host & Microbe, 32(8):1397–1411, 2024. doi: 10.1016/j.chom.2024.06.015. Wrapp, D., Wang, N., Corbett, K. S., Goldsmith, J. A., Hsieh, C.-L., Abiona, O., Graham, B. S., and McLellan, J. S. Cryo-em structure of the 2019-ncov spike in the prefusion conformation. Science, 367(6483):1260–1263, 2020. doi: 10.1126/science.abb2507. Wu, N. C., Young, A. P., Al-Mawsawi, L. Q., Olson, C. A., Feng, J., Qi, H., Chen, S.-H., Lu, I.-H., Lin, C.-Y., Chin, R. G., et al. High-throughput profiling of influenza a virus hemagglutinin gene at single-nucleotide resolution. Sci- entific reports, 4(1):4942, 2014. doi: 10.1038/srep04942. Wu, N. C., Olson, C. A., Du, Y., Le, S., Tran, K., Re- menyi, R., Gong, D., Al-Mawsawi, L. Q., Qi, H., Wu, T.-T., et al. Functional constraint profiling of a viral pro- tein reveals discordance of evolutionary conservation and functionality. PLoS genetics, 11(7):e1005310, 2015. doi: 10.1371/journal.pgen.1005310. Yu, T. C., Kikawa, C., Dadonaite, B., Loes, A. N., Englund, J. A., and Bloom, J. D. Pleiotropic mutational effects on function and stability constrain the antigenic evolution of influenza haemagglutinin. Nature ecology & evolution, p. 1–15, 2025. doi: 10.1038/s41559-025-02895-1. Yu, Y., Kass, M. A., Zhang, M., Youssef, N., Freije, C. A., Brock, K. P., Aguado, L. C., Seifert, L. L., Venkittu, S., Hong, X., et al. Deep mutational scanning of hepatitis b virus reveals a mechanism for cis-preferential reverse transcription. Cell, 187(11):2735–2745, 2024. doi: 10. 1016/j.cell.2024.04.008. 12 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins A. Appendix A.1. Benchmark Construction Details Table 5. Sources of DMS benchmark datasets. Mutational effect prediction tasks are based on DMS assays. This table summarises the data resources from publicly available datasets. DatasetPhenotypeReference ZIKVviral growth(Sourisseau et al., 2019) ZIKVimmune escape(Sourisseau et al., 2019), (Kikawa et al., 2023) RABVimmune escape(Aditham et al., 2025) RABVcell entry(Aditham et al., 2025) NIPAHbinding(Larsen et al., 2025) NIPAHimmune escape(Larsen et al., 2025) NIPAHcell entry(Larsen et al., 2025) LASVimmune escape(Carr et al., 2024) LASVcell entry(Carr et al., 2024) HIV B520immune escape(Radford et al., 2023) HIV B520cell entry(Radford & Bloom, 2025) HIV B520immune escape(Radford et al., 2023), (Radford & Bloom, 2025) HIV TRO11cell entry(Radford & Bloom, 2025) HIV TRO11immune escape(Radford & Bloom, 2025) *HIV HXB2viral growth(Fernandes et al., 2016) *HIV BRU/LAIviral growth(Fernandes et al., 2016) *HIV strain896viral growth(Duenas-Decamp et al., 2016) *HIV BRU/LAIviral growth(Haddox et al., 2016) *HIVviral growth(Haddox et al., 2018) *HIV B520viral growth(Haddox et al., 2018) HBVfitness(Yu et al., 2024) *SCV2 RBD Wuhan hubinding(Starr et al., 2020) *SCV2 RBD Wuhan huexpression(Starr et al., 2020) SCV2 RBD Alphabinding(Starr et al., 2022a) SCV2 RBD Alphaexpression(Starr et al., 2022a) SCV2 RBD Betabinding(Starr et al., 2022a) SCV2 RBD Betaexpression(Starr et al., 2022a) SCV2 RBD Deltabinding(Starr et al., 2022a) SCV2 RBD Deltaexpression(Starr et al., 2022a) SCV2 RBD Etabinding(Starr et al., 2022a) SCV2 RBD Etaexpression(Starr et al., 2022a) SCV2 RBD Omicron BA.1binding(Starr et al., 2022b) SCV2 RBD Omicron BA.1expression(Starr et al., 2022b) SCV2 RBD Omicron BA.2binding(Starr et al., 2022b) SCV2 RBD Omicron BA.2expression(Starr et al., 2022b) SCV2 RBD Omicron BQ.1.1binding(Taylor & Starr, 2023) SCV2 RBD Omicron BQ.1.1expression(Taylor & Starr, 2023) SCV2 RBD Omicron XBB.1.5binding(Taylor & Starr, 2023) SCV2 RBD Omicron XBB.1.5expression(Taylor & Starr, 2023) SCV2 RBD Omicron XBB.1.5binding(Taylor & Starr, 2023) SCV2 RBD Omicron XBB.1.5expression(Taylor & Starr, 2023) SCV2 RBD Omicron BA.2.86binding(Taylor & Starr, 2024) SCV2 RBD Omicron BA.2.86expression(Taylor & Starr, 2024) *represents the dataset is from ProteinGym 13 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins Table 6. Sources of DMS benchmark datasets (continued). Mutational effect prediction tasks are based on DMS assays. This table summarises the data resources from publicly available datasets. DatasetPhenotypeReference SCV2 RBD Omicron EG.5binding(Taylor & Starr, 2024) SCV2 RBD Omicron EG.5expression(Taylor & Starr, 2024) SCV2 RBD Omicron FLipbinding(Taylor & Starr, 2024) SCV2 RBD Omicron FLipexpression(Taylor & Starr, 2024) SCV2 Wuhan huimmune escape(Cao et al., 2022) SCV2 RBD Omicron XBB.1.5immune escape(Dadonaite et al., 2025a) SCV2 RBD Omicron XBB.1.5cell entry(Dadonaite et al., 2025a) SCV2 Omicron XBB.1.5immune escape(Dadonaite et al., 2024b) SCV2 Omicron XBB.1.5binding(Dadonaite et al., 2024b) SCV2 Omicron XBB.1.5cell entry(Dadonaite et al., 2024b) SCV2 Omicron BA.2binding(Dadonaite et al., 2024b) SCV2 Omicron BA.2cell entry(Dadonaite et al., 2024b) SCV2 KP.3.11immune escape(Dadonaite et al., 2025b) SCV2 KP.3.11cell entry(Dadonaite et al., 2025b) SCV2 KP.3.11binding(Dadonaite et al., 2025b) IAV H3N2 HK19immune escape(Welsh et al., 2024) IAV H3N2 HK19viral growth(Welsh et al., 2024) IAV H5N1immune escape(Dadonaite et al., 2024a) IAV H5N1cell entry(Dadonaite et al., 2024a) IAV H5N1stability(Dadonaite et al., 2024a) IAV H3N2 MC22immune escape(Yu et al., 2025) IAV H3N2 MC22cell entry(Yu et al., 2025) IAV H3N2 MC22stability(Yu et al., 2025) *IAV H1N1viral growth(Doud & Bloom, 2016) *IAV H1N1viral growth(Wu et al., 2014) *IAV H2N1viral growth(Soh et al., 2019) *IAV H3N2viral growth(Lee et al., 2018) *IAV H3N2viral growth(Doud et al., 2015) *IAV H1N1viral growth(Doud et al., 2015) *IAV H1N1viral growth(Jiang et al., 2016) *IAV H1N1viral growth(Wu et al., 2015) *CXB3viral growth(Mattenberger et al., 2021) *AAV2viral growth(Sinai et al., 2021) *DENviral growth(Suphatrakul et al., 2023) *HCV JFH 1viral growth(Qi et al., 2014) *PESVstability(Tsuboyama et al., 2023) *represents the dataset is from ProteinGym 14 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins Table 7. Distribution of viruses and phenotypes (total counts) for DMS functional assays. BindingCell entryExpressionFitnessImmune escapeStabilityViral growthTotal AAV211 CXB311 DEN11 HBV11 HCV11 HIV23611 IAV232916 LASV112 NIPAH1113 PESV11 RABV112 SCV2154124237 ZIKV112 Total16111211432279 Table 8. Sources of neutralisation benchmark datasets. Antigenicity diversity prediction tasks are based on neutralisation assays. This table summarise the data resources from publicly available datasets. DatasetSera SourceSubtypeReference A/Massachusetts/18/2022ferretH3N2(Kikawa et al., 2025a) A/Thailand/8/2022ferretH3N2(Kikawa et al., 2025a) A/DistrictOfColumbia/27/2023ferretH3N2(Kikawa et al., 2025a) A/Croatia/10136RV/2023-eggferretH3N2(Kikawa et al., 2025a) A/Netherlands/10563/2023ferretH3N2(Kikawa et al., 2025a) A/Lisboa/216/2023ferretH3N2(Kikawa et al., 2025a) A/Slovenia/49/2024ferretH3N2(Kikawa et al., 2025a) A/Switzerland/47775/2024ferretH3N2(Kikawa et al., 2025a) A/Norway/12374/2023ferretH3N2(Kikawa et al., 2025a) A/BurkinaFaso/3131/2023ferretH3N2(Kikawa et al., 2025a) A/France/IDF-IPP29542/2023-eggferretH3N2(Kikawa et al., 2025a) A/Netherlands/10685/2024ferretH3N2(Kikawa et al., 2025a) A/Lisboa/188/2023ferretH1N1(Kikawa et al., 2025a) A/Victoria/4897/2022ferretH1N1(Kikawa et al., 2025a) A/Victoria/4897/2022 IVR-238ferretH1N1(Kikawa et al., 2025a) A/Wisconsin/67/2022ferretH1N1(Kikawa et al., 2025a) A/Norway/07606/2024ferretH1N1(Kikawa et al., 2025a) A/Tajikistan/02-1057/2024ferretH1N1(Kikawa et al., 2025a) A/Darwin/9/2021humanH3N2(Kikawa et al., 2025b) A/Massachusetts/18/2022humanH3N2(Kikawa et al., 2025b) A/Wisconsin/588/2019humanH1N1(Loes et al., 2024) 15 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins A.2. Extended Results Table 9. Results for all models across tasks. DMSNeutralisationGISAID Model NameRecallStd.SpearmanStd.SpearmanStd.RecallSpearman VESPAl0.16350.07260.27150.14020.19610.2060.32260.3602 VESPA0.17020.07960.27970.15060.19610.2060.29360.3152 Tranception S0.13580.03950.17760.11240.18430.18100.10800.0874 Tranception M0.15300.05950.22710.1300 0.23160.16960.15790.1456 Tranception L0.15720.06810.21640.12960.19270.19610.20660.2196 ProtGPT20.11050.03700.10210.07320.20180.18450.12210.0372 ProGen20.18730.08230.28250.16440.20180.19290.37900.3235 ProGen2 S0.15920.06710.24330.15960.22500.18520.19440.1955 ProGen2 M0.18380.08290.28840.17010.22400.20700.26360.2414 ProGen2 OAS0.09720.02310.03920.03670.18290.15130.09700.0746 ProGen2 BFD900.18880.08020.27980.15630.21470.19810.37980.2933 ProGen2 L0.17710.07270.26260.15530.21010.20020.25770.2519 ProGen2 XL0.19800.0910 0.29300.15830.20930.1977 0.40810.3153 ESM1v0.14510.06440.18770.10260.22820.20430.12060.0543 ESM1 43M0.14540.06550.19010.09880.22220.20980.18580.1384 ESM1 85M0.14160.05310.17990.09180.20240.22170.24710.1870 ESM1 670M UR50S0.14660.05860.19970.10210.19570.19430.19560.1533 ESM1 670M UR50D0.13870.05750.17920.09400.20930.20390.19210.1420 ESM1 670M UR1000.12720.05120.10800.09630.20290.17250.22510.2307 ESM2 8M0.12800.05140.14010.08520.20530.19390.23680.2893 ESM2 35M0.12890.05120.14170.09560.19610.19850.26080.3026 ESM2 150M0.13070.05310.11560.07700.21740.20140.21410.2023 ESM2 650M0.13750.06160.16930.10400.22670.18400.23330.2239 ESM2 3B0.14190.06390.16720.10660.21480.18890.18890.2075 ESM2 15B0.13900.07160.17410.11220.20390.19230.24510.1851 16 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins Table 10. Results across 57 non-indel mutational effect tasks. Improvement is calculated as the relative percentage change over previous results. Model NameRecallImprovement (%)SpearmanImprovement (%) ESM2 3B0.14834.51020.188512.7392 ESM1 670M UR50D0.14464.25380.18513.2924 Tranception L0.16323.81680.22634.5749 VESPA0.17643.64280.29836.6500 ProGen2 OAS0.10063.49790.045315.5612 ESM2 650M0.14122.69090.18328.2103 ESM1v0.14872.48100.20237.7784 VESPAl0.16752.44650.28324.3094 ProGen2 XL0.20252.27270.29791.6724 ESM2 8M0.13041.87500.15188.3512 ESM2 15B0.14161.87050.17882.6996 Tranception M0.15571.76470.2234-1.6292 ProGen2 L0.18001.63750.27073.0845 ProGen2 M0.18681.63220.28910.2427 ESM1 43M0.14771.58180.1893-0.4208 ESM1 85M0.14321.12990.18150.8894 ProGen2 BFD900.19010.68860.2796-0.0715 ESM1 670M UR50S0.14710.34110.20281.5523 Tranception S0.13590.07360.1753-1.2950 ESM2 150M0.1304-0.22950.153732.9585 ESM1 670M UR1000.1259-1.02200.0986-8.7037 ProtGPT20.1086-1.71950.0949-7.0519 ProGen2 S0.1560-2.01010.2402-1.2741 ProGen20.1835-2.02880.2764-2.1593 ESM2 35M0.1216-5.66330.1325-6.4926 17 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins ZIKV_antibody_escape ZIKV_viral_growth RABV_antibody_escape RABV_cell_entry NIPAH_binding NIPAH_cell_entry NIPAH_antibody_escape LASV_antibody_escape LASV_cell_entry HIV_cell_entry_B520 HIV_antibody_escape_B520 HIV_antibody_escape_TRO11 HIV_cell_entry_TRO11 HIV_sera_escape_B520 HBV_fitness SARS_binding_Alpha_RBD SARS_binding_Beta_RBD SARS_binding_Delta_RBD SARS_binding_Eta_RBD SARS_binding_Omicron_BA1_RBDSARS_binding_Omicron_BA2_RBD SARS_binding_Omicron_BA286_RBD SARS_binding_Omicron_BQ11_RBD SARS_binding_Omicron_EG5_RBD SARS_binding_Omicron_FLip_RBD SARS_binding_Omicron_XBB15_RBD SARS_binding_Wuhan_Hu_1_RBD SARS_expression_Alpha_RBD SARS_expression_Beta_RBD SARS_expression_Delta_RBD SARS_expression_Eta_RBD SARS_expression_Omicron_BA1_RBDSARS_expression_Omicron_BA2_RBD SARS_expression_Omicron_BA286_RBD SARS_expression_Omicron_BQ11_RBD SARS_expression_Omicron_EG5_RBD SARS_expression_Omicron_FLip_RBD SARS_expression_Omicron_XBB15_RBD SARS_expression_Wuhan_Hu_1_RBD SARS_sera_escape_Omicron_XBB15_RBD SARS_cell_entry_Omicron_XBB15_RBD SARS_cell_entry_Omicron_XBB15 SARS_binding_Omicron_XBB15 SARS_sera_escape_Omicron_XBB15 SARS_cell_entry_Omicron_BA2 SARS_binding_Omicron_BA2 SARS_antibody_escape_Wuhan_Hu_1 FLU_sera_escape_H3N2_HK19 FLU_functional_effect_H3N2_HK19 FLU_sera_escape_H5N1 FLU_cell_entry_H5N1 FLU_stability_H5N1 FLU_sera_escape_H3N2_MC22 FLU_cell_entry_H3N2_MC22 FLU_stability_H3N2_MC22 SARS_antibody_escape_KP311 SARS_cell_entry_KP311 SARS_binding_KP311 A0A192B1T2_9HIV1_Haddox_2018 A0A2Z5U3Z0_9INFA_Doud_2016 A0A2Z5U3Z0_9INFA_Wu_2014 A4D664_9INFA_Soh_2019 C6KNH7_9INFA_Lee_2018 CAPSD_AAV2S_Sinai_2021 ENV_HV1B9_DuenasDecamp_2016 ENV_HV1BR_Haddox_2016 I6TAH8_I68A0_Doud_2015 NCAP_I34A1_Doud_2015 NRAM_I33A0_Jiang_2016 PA_I34A1_Wu_2015 POLG_CXB3N_Mattenberger_2021 POLG_DEN26_Suphatrakul_2023 POLG_HCVJF_Qi_2014 POLG_PESV_Tsuboyama_2023_2MXD Q2N0S5_9HIV1_Haddox_2018 R1AB_SARS2_Flynn_2022 RDRP_I33A0_Li_2023 REV_HV1H2_Fernandes_2016 TAT_HV1BR_Fernandes_2016 Task 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Score Model Name ProGen2 XL ESM2 15B ESM1 670M ESM1v ProtGPT2 VESPAl VESPA Tranception M Figure 6. Task-wise comparison of all baselines on the DMS benchmark. Reported values represent the absolute Spearman’s rank correlation between model fitness scores and experimental measurements. 18 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins ZIKV_antibody_escape ZIKV_viral_growth RABV_antibody_escape RABV_cell_entry NIPAH_binding NIPAH_cell_entry NIPAH_antibody_escape LASV_antibody_escape LASV_cell_entry HIV_cell_entry_B520 HIV_antibody_escape_B520 HIV_antibody_escape_TRO11 HIV_cell_entry_TRO11 HIV_sera_escape_B520 HBV_fitness SARS_binding_Alpha_RBD SARS_binding_Beta_RBD SARS_binding_Delta_RBD SARS_binding_Eta_RBD SARS_binding_Omicron_BA1_RBDSARS_binding_Omicron_BA2_RBD SARS_binding_Omicron_BA286_RBD SARS_binding_Omicron_BQ11_RBD SARS_binding_Omicron_EG5_RBD SARS_binding_Omicron_FLip_RBD SARS_binding_Omicron_XBB15_RBD SARS_binding_Wuhan_Hu_1_RBD SARS_expression_Alpha_RBD SARS_expression_Beta_RBD SARS_expression_Delta_RBD SARS_expression_Eta_RBD SARS_expression_Omicron_BA1_RBDSARS_expression_Omicron_BA2_RBD SARS_expression_Omicron_BA286_RBD SARS_expression_Omicron_BQ11_RBD SARS_expression_Omicron_EG5_RBD SARS_expression_Omicron_FLip_RBD SARS_expression_Omicron_XBB15_RBD SARS_expression_Wuhan_Hu_1_RBD SARS_sera_escape_Omicron_XBB15_RBD SARS_cell_entry_Omicron_XBB15_RBD SARS_cell_entry_Omicron_XBB15 SARS_binding_Omicron_XBB15 SARS_sera_escape_Omicron_XBB15 SARS_cell_entry_Omicron_BA2 SARS_binding_Omicron_BA2 SARS_antibody_escape_Wuhan_Hu_1 FLU_sera_escape_H3N2_HK19 FLU_functional_effect_H3N2_HK19 FLU_sera_escape_H5N1 FLU_cell_entry_H5N1 FLU_stability_H5N1 FLU_sera_escape_H3N2_MC22 FLU_cell_entry_H3N2_MC22 FLU_stability_H3N2_MC22 SARS_antibody_escape_KP311 SARS_cell_entry_KP311 SARS_binding_KP311 A0A192B1T2_9HIV1_Haddox_2018 A0A2Z5U3Z0_9INFA_Doud_2016 A0A2Z5U3Z0_9INFA_Wu_2014 A4D664_9INFA_Soh_2019 C6KNH7_9INFA_Lee_2018 CAPSD_AAV2S_Sinai_2021 ENV_HV1B9_DuenasDecamp_2016 ENV_HV1BR_Haddox_2016 I6TAH8_I68A0_Doud_2015 NCAP_I34A1_Doud_2015 NRAM_I33A0_Jiang_2016 PA_I34A1_Wu_2015 POLG_CXB3N_Mattenberger_2021 POLG_DEN26_Suphatrakul_2023 POLG_HCVJF_Qi_2014 POLG_PESV_Tsuboyama_2023_2MXD Q2N0S5_9HIV1_Haddox_2018 R1AB_SARS2_Flynn_2022 RDRP_I33A0_Li_2023 REV_HV1H2_Fernandes_2016 TAT_HV1BR_Fernandes_2016 Task 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Score Model Name ProGen2 XL ESM2 15B ESM1 670M ESM1v ProtGPT2 VESPAl VESPA Tranception M Figure 7. Task-wise comparison of all baselines on the DMS benchmark. Reported values represent the top 10% recall between model fitness scores and experimental measurements. 19 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins A/Massachusetts/18/2022_H3N2 A/BurkinaFaso/3131/2023_H3N2 A/France/IDF-IPP29542/2023-egg_H3N2 A/Netherlands/10685/2024_H3N2 A/Lisboa/188/2023_H1N1 A/Victoria/4897/2022_H1N1 A/Victoria/4897/2022_IVR-238_H1N1 A/Wisconsin/67/2022_H1N1 A/Norway/07606/2024_H1N1 A/Tajikistan/02-1057/2024_H1N1 A/Thailand/8/2022_H3N2 A/DistrictOfColumbia/27/2023_H3N2 A/Croatia/10136RV/2023-egg_H3N2 A/Netherlands/10563/2023_H3N2 A/Lisboa/216/2023_H3N2 A/Slovenia/49/2024_H3N2 A/Switzerland/47775/2024_H3N2 A/Norway/12374/2023_H3N2 A/Wisconsin/588/2019_H1N1 A/Darwin/9/2021_H3N2 vaccine 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Score Model Name ProtT5 Tranception M ProtGPT2 ProGen2 S ESM1v ESM1 43M ESM2 650M Figure 8. Task-wise comparison of all baselines on the neutralisation benchmark. Reported values represent the absolute Spearman’s rank correlation between model fitness scores and experimental measurements. 1 1835526986 103120137154171188205222239256273290307324341358375392409426443460477494511528545562579596613630647664681698715732749766783800817834851868885902919936953970987 1004102110381055107210891106112311401157117411911208122512421259 A R N D C Q E G H I L K M F P S T W Y V - 10 1 10 0 Figure 9. SARS-CoV-2 Spike Protein Mutation Heat Map for ESM1. This heat map displays the frequency of 21 potential amino acid substitutions across 1273 residues of the SARS-CoV-2 Spike protein, with colour intensity indicating mutational effect at each position. 20 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins 1 1835526986 103120137154171188205222239256273290307324341358375392409426443460477494511528545562579596613630647664681698715732749766783800817834851868885902919936953970987 1004102110381055107210891106112311401157117411911208122512421259 A R N D C Q E G H I L K M F P S T W Y V - 10 2 10 1 Figure 10. SARS-CoV-2 Spike Protein Mutation Heat Map for ESM2. This heat map displays the frequency of 21 potential amino acid substitutions across 1273 residues of the SARS-CoV-2 Spike protein, with colour intensity indicating mutational effect at each position. 1 1835526986 103120137154171188205222239256273290307324341358375392409426443460477494511528545562579596613630647664681698715732749766783800817834851868885902919936953970987 1004102110381055107210891106112311401157117411911208122512421259 A R N D C Q E G H I L K M F P S T W Y V - 7.1 × 10 4 7.2 × 10 4 7.3 × 10 4 7.4 × 10 4 7.5 × 10 4 7.6 × 10 4 7.7 × 10 4 Figure 11. SARS-CoV-2 Spike Protein Mutation Heat Map for ESM1v. This heat map displays the frequency of 21 potential amino acid substitutions across 1273 residues of the SARS-CoV-2 Spike protein, with colour intensity indicating mutational effect at each position. 1 1835526986 103120137154171188205222239256273290307324341358375392409426443460477494511528545562579596613630647664681698715732749766783800817834851868885902919936953970987 1004102110381055107210891106112311401157117411911208122512421259 A R N D C Q E G H I L K M F P S T W Y V - 0.00035 0.00030 0.00025 0.00020 0.00015 0.00010 0.00005 0.00000 Figure 12. SARS-CoV-2 Spike Protein Mutation Heat Map for ProGen2. This heat map displays the frequency of 21 potential amino acid substitutions across 1273 residues of the SARS-CoV-2 Spike protein, with colour intensity indicating mutational effect at each position. 21 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins 0 1734516885 102119136153170187204221238255272289306323340357374391408425442459476493510527544561578595612629646663680697714731748765782799816833850867884901918935952969986 1003102010371054107110881105112211391156117311901207122412411258 A R N D C Q E G H I L K M F P S T W Y V - 7.65 × 10 0 7.7 × 10 0 7.75 × 10 0 7.8 × 10 0 7.85 × 10 0 7.9 × 10 0 7.95 × 10 0 Figure 13. SARS-CoV-2 Spike Protein Mutation Heat Map for ProtGPT2. This heat map displays the frequency of 21 potential amino acid substitutions across 1273 residues of the SARS-CoV-2 Spike protein, with colour intensity indicating mutational effect at each position. 1 1835526986 103120137154171188205222239256273290307324341358375392409426443460477494511528545562579596613630647664681698715732749766783800817834851868885902919936953970987 1004102110381055107210891106112311401157117411911208122512421259 A R N D C Q E G H I L K M F P S T W Y V - 8 × 10 1 9 × 10 1 Figure 14. SARS-CoV-2 Spike Protein Mutation Heat Map for Tranception. This heat map displays the frequency of 21 potential amino acid substitutions across 1273 residues of the SARS-CoV-2 Spike protein, with colour intensity indicating mutational effect at each position. 1 1835526986 103120137154171188205222239256273290307324341358375392409426443460477494511528545562579596613630647664681698715732749766783800817834851868885902919936953970987 1004102110381055107210891106112311401157117411911208122512421259 A R N D C Q E G H I L K M F P S T W Y V 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Figure 15. SARS-CoV-2 Spike Protein Mutation Heat Map for VESPA. This heat map displays the frequency of 21 potential amino acid substitutions across 1273 residues of the SARS-CoV-2 Spike protein, with colour intensity indicating mutational effect at each position. 22 ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins 1 1835526986 103120137154171188205222239256273290307324341358375392409426443460477494511528545562579596613630647664681698715732749766783800817834851868885902919936953970987 1004102110381055107210891106112311401157117411911208122512421259 A R N D C Q E G H I L K M F P S T W Y V 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Figure 16. SARS-CoV-2 Spike Protein Mutation Heat Map for VESPAl. This heat map displays the frequency of 21 potential amino acid substitutions across 1273 residues of the SARS-CoV-2 Spike protein, with colour intensity indicating mutational effect at each position. 23