Paper deep dive

Where Do Flow Semantics Reside? A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification

Sizhe Huang, Shujie Yang

Year: 2026Venue: arXiv preprintArea: cs.NIType: PreprintEmbeddings: 47

Abstract

Abstract:Self-supervised masked modeling shows promise for encrypted traffic classification by masking and reconstructing raw bytes. Yet recent work reveals these methods fail to reduce reliance on labeled data despite costly pretraining: under frozen encoder evaluation, accuracy drops from greater than 0.9 to less than 0.47. We argue the root cause is inductive bias mismatch: flattening traffic into byte sequences destroys protocol-defined semantics. We identify three specific issues: 1) field unpredictability, random fields like this http URL are unlearnable yet treated as reconstruction targets; 2) embedding confusion, semantically distinct fields collapse into a unified embedding space; 3) metadata loss, capture-time metadata essential for temporal analysis is discarded. To address this, we propose a protocol-native paradigm that treats protocol-defined field semantics as architectural priors, reformulating the task to align with the data's intrinsic tabular modality rather than incrementally adapting sequence-based architectures. Instantiating this paradigm, we introduce FlowSem-MAE, a tabular masked autoencoder built on Flow Semantic Units (FSUs). It features predictability-guided filtering that focuses on learnable FSUs, FSU-specific embeddings to preserve field boundaries, and dual-axis attention to capture intra-packet and temporal patterns. FlowSem-MAE significantly outperforms state-of-the-art across datasets. With only half labeled data, it outperforms most existing methods trained on full data.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/13/2026, 12:50:59 AM

Summary

The paper introduces FlowSem-MAE, a protocol-native masked autoencoder for encrypted traffic classification. It addresses the inductive bias mismatch in existing byte-level models by treating traffic as tabular data composed of Flow Semantic Units (FSUs). The framework utilizes predictability-guided filtering, FSU-specific embeddings, and dual-axis attention to capture transferable flow semantics, significantly outperforming state-of-the-art methods even with limited labeled data.

Entities (5)

Encrypted Traffic Classification · task · 100%FlowSem-MAE · model-architecture · 100%Flow Semantic Units · data-representation · 98%Dual-Axis Transformer · component · 95%Inductive Bias Mismatch · problem · 95%

Relation Signals (3)

FlowSem-MAE → performstask → Encrypted Traffic Classification

confidence 100% · A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification

FlowSem-MAE → utilizes → Flow Semantic Units

confidence 100% · FlowSem-MAE, a tabular masked autoencoder built on Flow Semantic Units (FSUs).

FlowSem-MAE → addresses → Inductive Bias Mismatch

confidence 95% · To address this [inductive bias mismatch], we propose a protocol-native paradigm... Instantiating this paradigm, we introduce FlowSem-MAE

Cypher Suggestions (2)

Find all components of the FlowSem-MAE architecture. · confidence 90% · unvalidated

MATCH (m:Model {name: 'FlowSem-MAE'})-[:HAS_COMPONENT]->(c) RETURN c.name, c.type

Identify problems addressed by the proposed model. · confidence 90% · unvalidated

MATCH (m:Model {name: 'FlowSem-MAE'})-[:ADDRESSES]->(p:Problem) RETURN p.name

Full Text

46,790 characters extracted from source content.

Expand or collapse full text

Where Do Flow Semantics Reside? A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification Sizhe Huang 1 Shujie Yang 1 Abstract Self-supervised masked modeling shows promise for encrypted traffic classification by masking and reconstructing raw bytes. Yet recent work reveals these methods fail to reduce reliance on labeled data despite costly pretraining: under frozen en- coder evaluation, accuracy drops from>90% to <47%. We argue the root cause is inductive bias mismatch: flattening traffic into byte sequences destroys protocol-defined semantics. We identify three specific issues: 1) field unpredictability, ran- dom fields likeip.idare unlearnable yet treated as reconstruction targets; 2) embedding confusion, semantically distinct fields collapse into a unified embedding space; 3) metadata loss, capture-time metadata essential for temporal analysis is dis- carded. To address this, we propose a protocol- native paradigm that treats protocol-defined field semantics as architectural priors, reformulating the task to align with the data’s intrinsic tabu- lar modality rather than incrementally adapting sequence-based architectures. Instantiating this paradigm, we introduce FlowSem-MAE, a tabu- lar masked autoencoder built on Flow Semantic Units (FSUs). It features predictability-guided filtering that focuses on learnable FSUs, FSU- specific embeddings to preserve field boundaries, and dual-axis attention to capture intra-packet and temporal patterns. FlowSem-MAE significantly outperforms state-of-the-art across datasets. With only 50% labeled data, it outperforms most exist- ing methods trained on full data. 1. Introduction Encrypted traffic classification (ETC) has become essen- tial for network security and management, as over 95% of 1 State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunica- tions, Beijing, China.Corresponding Author: Shujie Yang <sjyang@bupt.edu.cn>. Preprint. March 12, 2026. Source PortDestination Port Sequence Number ACK Number HdrLen FlagsWindow Size ChecksumUrgent Pointer Version Total Length Identification TTLProtocolHeader Checksum Source IP Addr Encrypted Payload HdrLen DS Field Flags Fragment Offset Destination IP Addr TCP Header IP Header Frame Time Delta Metadata ... Flow Semantic UnitsRaw Bytes Byte 5 Byte 9 Byte 13 Byte 17 Byte 21 Byte 25 Byte 29 Byte 37 Byte 41 Byte 1 Byte 33 Byte 2 Byte 6 Byte 10 Byte 14 Byte 18 Byte 22 Byte 26 Byte 30 Byte 38 Byte 42 Byte 34 Byte 3 Byte 15 Byte 19 Byte 23 Byte 27 Byte 31 Byte 35 Byte 39 Byte 43 Byte 7 Byte 11 Byte 4 Byte 12 Byte 16 Byte 20 Byte 24 Byte 28 Byte 32 Byte 36 Byte 40 Byte 44 Byte 8 Flattening to Raw Bytes EV 1 EV 2 EV 3 EV 4 EV 5 EV 6 EV 7 EV 8 EV 9 EV 10 EV n ... Embedding Total Len=1500 EV=[0.8, -0.3, ...] Win Size=1500 EV =[0.8, -0.3, ...] EVs EV2=EV9 Cross-field Embedding Generalizable Fields Non-generalizable Fields Random Fields Embedding Vectors (EVs) Metadata Data Loss Raw Bytes Figure 1. Protocol fields (left) are flattened into raw bytes (mid- dle) and embedded (right), illustrating inductive bias mismatch at three levels: (P1) Field-level unpredictability: Random fields (pink) are treated as learnable despite being unpredictable by pro- tocol design (e.g., ip.id andchecksum). (P2) Cross-field-level embedding confusion: Field distinctions are lost through cross- field embedding (grey), where adjacent bytes span multiple fields (e.g.ip.flagsandip.fragoffset), and unified embed- ding function, where semantically different values receive identical vectors (e.g.,Total Len=1500andWin Size=1500). (P3) Flow-level metadata loss:Temporal metadata (hatched) essential for flow-level behavior analysis exists outside packet bytes and is entirely discarded. web traffic is now encrypted(Google, 2025) and traditional payload-based inspection is no longer viable. Recently, self- supervised masked modeling has been widely adopted for ETC, treating packets as generic byte sequences and recon- structing randomly masked bytes (Lin et al., 2022; Zhao et al., 2023; Wang et al., 2024). While this paradigm thrives in vision and NLP (Berahmand et al., 2024; Salazar et al., 2020; Hondru et al., 2025)—where the basic units (patches, tokens) naturally align with semantic structure—it remains questionable for encrypted traffic. However, raw bytes often act as fragmented carriers rather than cohesive semantic units, leading to a fundamental misalignment between the masking objective and true flow semantics. 1.1. Motivation: Limited Transferability Existing byte-level masked modeling struggles to learn transferable representations for ETC. Under frozen encoder evaluation, a standard protocol for assessing representation quality, accuracy drops from over 90% (with full fine-tuning) 1 arXiv:2603.10051v1 [cs.NI] 9 Mar 2026 A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification to below 47% (with frozen encoder), suggesting that pre- training contributes minimally to reduce reliance on labeled data (Zhao et al., 2025). The seemingly high accuracy of prior methods results from supervised fine-tuning, rather than from learned representations. We argue that the root cause is inductive bias mismatch: byte-level modeling destroys the inherent semantics that net- work protocols explicitly define. Flattening this structured representation into raw bytes inevitably causes semantic loss at multiple levels. We trace this mismatch to three fundamental issues (Fig. 1), which we refer to as P1-P3 for brevity: P1: Field-Level Unpredictability. Not all protocol fields carry learnable signals. RFC 6274 recommends pseudo- random generation for ip.id to prevent information leak- age (Gont, 2011), and RFC 9293 requires the initial se- quence number to be “unpredictable to attackers” (Eddy, 2022). These fields are unlearnable by design, yet byte- based masking treats them as reconstruction targets, creating gradient noise that corrupts learning of meaningful fields. P2: Cross-Field-Level Embedding Confusion. Byte- level modeling projects semantically distinct protocol fields through a unified embedding function, causing cross-field pollution and value collision. Unlike natural language poly- semy where context disambiguates meaning, protocol fields are categorically distinct by specification (Yin et al., 2020). Positional encoding cannot resolve this issue, as it provides location information but lacks field-type awareness. From a manifold perspective (Brahma et al., 2015; Kienitz et al., 2022), each field type should occupy its own subspace, but shared embeddings collapse these into entangled regions. P3: Flow-Level Metadata Loss. Byte-level methods oper- ate solely on packet content, discarding capture-time meta- data recorded by traffic analysis tools. Critical temporal fea- tures such as inter-arrival times (frame.timedelta) are essential for characterizing flow-level behaviors like burst patterns and request-response latency, yet they exist outside packet bytes and are entirely lost. 1.2. Key Insight: Protocol-Native Modeling Encryption renders payloads unreadable, forcing classifica- tion to rely exclusively on protocol headers and metadata. As shown in Table 1, these elements form inherently tabular data: their dimensions and semantics are fixed by protocol specifications (Gont, 2011; Rescorla, 2018; Eddy, 2022). Prior methods assume flow semantics reside in byte se- quences, but they actually reside in protocol-defined tabular structures—this modality mismatch explains why existing approaches fail to learn transferable representations. The core issue is not learning more, but learning right: align- ing the learning paradigm with the data’s true modality is Table 1. Network traffic as tabular data: mapping between tabular concepts and traffic elements. Tabular ConceptTraffic Element TableNetwork flow (5-tuple session) RowPacket ColumnProtocol field Column type Field semantics Row orderingTemporal sequence essential for capturing robust semantics. To address this, we advocate a protocol-native paradigm that fundamentally reframes how to model encrypted traffic. Just as cloud-native designs systems around cloud infrastructure rather than adapting legacy architectures, protocol-native treats protocol-defined field semantics as immutable pri- ors, where structure is incorporated into model design rather than learned from data. By operating on this in- trinsic modality rather than flattened byte sequences, the paradigm ensures model inductive biases align with where flow semantics truly reside. We instantiate this paradigm as FlowSem-MAE (Flow Se- mantics Masked Autoencoder), which operates on Flow Se- mantic Units (FSUs) through predictability-guided filtering (P1), FSU-specific embeddings (P2), and dual-axis attention (P3). These designs empirically validate that protocol-native modeling successfully captures transferable flow semantics. Our contributions are as follows: 1) Inductive Bias Analysis of Limited Transferability. This analysis fundamentally reveals that the poor transfer- ability of existing methods as resulting from inductive bias mismatch: modeling traffic as byte sequences obscures the semantics embedded in protocol-defined tabular structures. Solving this requires reformulating the task to align with the data’s intrinsic tabular modality, rather than incrementally adapting sequence-based architectures. 2) Protocol-Native Paradigm. We introduce a protocol- native paradigm, instantiated as FlowSem-MAE, a tabular pretraining framework that treats traffic flows as tabular data rather than byte sequences. By aligning the model archi- tecture with protocol principles, it can effectively capture transferable representations robust to scenario shifts. 3) Superior Performance. FlowSem-MAE uniquely excels under both frozen encoder and full fine-tuning evaluation protocols, achieving the best or second-best performance across all metrics. With only 50% labeled data, it outper- forms most existing methods trained on full data. We pro- vide the code and model parameters in the supplementary material. 2 A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification ... Unlabeled Traffic Flows FSUs Extract Packets FSUs FSUs Filter FSU Table Labeled Traffic Flows Embedding 1 Embedding 2 Embedding N Dual-Axis Transformer Noisy FSUs Filtered-FSUs Table Random Mask Reconstructed Table Embedding 1 Embedding 2 Embedding N Pretrain Model FSUs Extract FSU Table Copy Frozen Parameters MLP Classi fi er Feature Vector Decoder Feature Vector Logits CE Loss Labels Layer Norm Time-Axis Attn FFN FSU-Axis Attn FFN Add & Norm Add & Norm Add & Norm Add & Norm ... ... ... Pretrain Fine-tune ... Noisy FSUs ... Figure 2. Workflow of FlowSem-MAE. Noisy FSUs refer to the union of random and non-generalizable fields. 2. Related Work 2.1. Statistical and Expert-Based Approaches Traditional ETC methods rely on handcrafted features de- signed by network experts. Early approaches extract sta- tistical features such as packet size distributions, flow du- ration (Finsterbusch et al., 2013). Deep Packet Inspection (DPI) analyzes protocol headers and payloads but becomes ineffective under encryption (Bujlow et al., 2015). These methods suffer from poor scalability: feature engi- neering requires extensive manual effort and cannot adapt to rapidly evolving applications. These limitations moti- vate representation learning approaches that automatically extract features from raw traffic data. 2.2. Masked Language Modeling for Traffic Inspired by BERT’s success in NLP (Devlin et al., 2019), recent work treats packets as sentences and bytes as tokens, applying masked language modeling to learn traffic repre- sentations (Lin et al., 2022; He et al., 2020; Zhou et al., 2025). ET-BERT (Lin et al., 2022) masks random bytes and reconstructs them from context, assuming that traffic bytes exhibit predictable patterns similar to natural language. Traf- ficFormer (Zhou et al., 2025) extends this with flow-level pretext tasks. Pcap-Encoder (Zhao et al., 2025) adopts a different strategy, using T5 (Ni et al., 2022) with question- answering pretraining specifically on protocol headers. However, the core assumption that bytes behave like linguis- tic tokens is flawed: encrypted traffics lack the contextual regularities of natural language, and byte-level tokenization breaks protocol field boundaries. 2.3. Masked Vision Modeling for Traffic Recent work converts packet sequences into 2D images and applies masked vision modeling (Hondru et al., 2025). YaTC (Zhao et al., 2023) represents flows as traffic matrices and uses Vision Transformers with patch-based masking. NetMamba (Wang et al., 2024) employs the Mamba archi- tecture for efficient sequence modeling. These methods assume that traffic images exhibit spatial locality similar to natural images. However, unlike images where neighboring pixels correlate due to object continuity, traffic bytes from different protocol fields may be spatially adjacent but semantically unrelated. 2.4. Rethinking Traffic Representation Learning Recent work has questioned the effectiveness of these ap- proaches. Zhao et al. (2025) demonstrates that under frozen encoder evaluation, existing self-supervised learning meth- ods exhibit severe performance degradation, and reveals that previously reported high accuracy stems from data leakage rather than learned representations. Our work goes further by answering why pretrained repre- sentations fail to transfer. We identify inductive bias mis- match as the root cause: flow semantics reside in protocol- defined tabular structures, not byte sequences. 3. Method 3.1. Framework Overview FlowSem-MAE is a protocol-native masked autoencoder that preserves flow semantics by using FSUs as modeling units, directly leveraging the semantics defined by RFCs. Problem Formulation. Given a traffic flowFconsisting of Tpacketsp 1 ,p 2 ,...,p T , we extractNFSUs from each packet, forming a tabular flow representationX = [x t i ] T×N wherex t i denotes thei-th FSU in packettas aT ×Nmulti- row table. Our goal is to learn an encoderf θ :R T×N →R d that maps traffic flows to discriminative representations for downstream classification tasks. Architecture Overview. As illustrated in Fig. 2, FlowSem- 3 A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification MAE consists of four components: (1) FSU extraction that parses raw traffic into protocol fields and temporal meta- data; (2) predictability-guided filtering that excludes unpre- dictable FSUs based on protocol priors; (3) FSU-specific embeddings where each FSU type has its own embedding function; and (4) a dual-axis Transformer that models both field relationships and temporal patterns. During pretraining, masked FSUs are reconstructed with: L pretrain = 1 |M p | X (t,i)∈M p ℓ(ˆx t i ,x t i )(1) whereM p denotes masked positions andℓis the Mean Squared Error (MSE) loss. For downstream tasks, we freeze the encoder and train only the classification head to evaluate representation quality. 3.2. FSU Extraction and Preprocessing Flow Semantic Units. Raw bytes ignore the inherent struc- ture defined by protocol specifications, where each header field carries distinct semantics governed by RFCs. To pre- serve this structure, we extract FSUs from two sources: frame metadata and protocol headers. Frame metadata includes temporal information such as inter- arrival time (frame.timedelta). Protocol headers in- clude fields from IP and transport layers. In total, we ex- tract 41 FSUs per packet after filtering random and non- generalizable fields. Flow Sampling. Different phases of a network flow ex- hibit distinct behavioral patterns: connection establishment contains protocol handshake signatures, while termination reveals closing behaviors. To capture both phases, we sam- ple the first 10 packets from each flow, yieldingT = 10 packets per flow. This strategy captures handshake patterns at flow start. Flows shorter than 10 packets are padded with a mask indicating valid positions. Feature Normalization. Protocol fields have heterogeneous value ranges and distributions, requiring type-specific nor- malization to ensure numerical comparability while preserv- ing semantics. Unlike traditional expert-based approaches that manually design statistical features (e.g., mean packet size, flow duration), our normalization preserves the original semantic meaning of each field. This allows the model to au- tomatically learn discriminative patterns through pretraining rather than relying on predefined features. 3.3. Predictability-Guided Filtering Byte-level MAE methods treat all bytes as potential recon- struction targets, forcing models to predict inherently ran- dom fields alongside meaningful ones. This creates noisy gradients that corrupt the entire representation space. Clas- sifiers can learn to ignore noisy features, but masked au- toencoding explicitly supervises masked positions. When these include unpredictable fields, the model is forced to predict random values, creating gradient noise that corrupts learning. We exclude such FSUs based on RFCs to preserve field-level semantics. Protocol Prior Analysis. We categorize FSUs into three types based on predictability. LetNdenote the number of FSU types, andS = s 1 ,s 2 ,...,s N denote the set of FSU types, partitioned into:S g (generalizable) with stable, learnable patterns;S r (random) generated by cryptographic operations or integrity checks; andS n (non-generalizable) containing dataset-specific fields. •Random FSUs are fields that lack learnable patterns due to cryptographic operations, system implementations, or integrity checks (Gont, 2011; Rescorla, 2018). They are excluded from pretraining. •Non-generalizable FSUs are dataset-specific fields that may cause overfitting, including source and destination IP addresses. These fields are excluded to prevent the model from learning spurious correlations. •Generalizable FSUs are fields with stable, learnable patterns governed by protocol specifications or reflect- ing meaningful traffic characteristics. These fields serve as reconstruction targets during pretraining. Dual Masking Strategy. To capture both temporal de- pendencies and semantic structure, we employ two com- plementary masksm t packet andm i field , each sampled from a Bernoulli distribution. Packet-level masking (m t packet = 1) masks all FSUs at timet, encouraging the model to predict from neighboring packets. Field-level masking (m i field = 1) masks FSUiacross all packets, encouraging inference from other fields within each packet. Random and Non-Generalizable FSUs (i ∈ S r ∪S n ) are excluded entirely and never serve as reconstruction targets. This selective mechanism addresses P1 by focusing learning capacity on FSUs with stable, generalizable patterns. 3.4. FSU-Specific Embeddings Byte-based methods project all bytes through a shared em- bedding function, conflating semantically distinct fields. Crucially, positional encoding (Vaswani et al., 2017) cannot resolve this issue. While position embeddings distinguish byte locations (e.g., byte 9 vs. byte 33), they cannot capture field semantics: the same value at different positions (e.g., TTL=128 at byte 9, Len=128 at byte 3) should have differ- ent meanings, while different values of the same field (e.g., TTL=64 vs. TTL=128) should share semantic structure. 4 A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification To preserve FSU-specific semantics, we assign each FSU type its own embedding function with independent parame- ters, inspired by tabular representation learning (Gorishniy et al., 2021). This acknowledges that different protocol fields carry distinct semantics. Wedefinetype-specificembeddingfunctions E 1 ,...,E N whereE k :R →R d maps FSU type k’s values to d-dim vectors: E k (x t i ) = W k x t i + b k (2) whereW k ∈R d×1 andb k ∈R d are FSU-specific parame- ters. The complete embedding combines value embedding with positional encodings: e t i = E k i (x t i ) + p i + q t (3) wherep i is FSU position encoding andq t is temporal po- sition encoding. This contrasts with byte-level methods that use a single shared projectionE(x) = Wx + bfor all fields, which maps identical values from different FSU types to identical representations. This design addresses P2 by preserving cross-field-level semantics through maintaining semantic boundaries across protocol fields. Manifold Preservation. Under the manifold hypothe- sis (Fefferman et al., 2016), network traffic features lie on low-dimensional manifoldsM k N k=1 , where each FSU typekexhibits distinct geometric structure. For instance, TTL values concentrate on discrete points64, 128, 255, while inter-arrival times follow a continuous distribution. Shared embeddingsE : S k M k →R d induce manifold entanglement (Brahma et al., 2015), where geometrically distinct structures collapse into overlapping regions. When embedding capacity is insufficient (d < P k d k ), this entan- glement is unavoidable, causing severe variance imbalance across FSU types. FSU-specific embeddingsE k N k=1 preserve manifold sep- aration through independent parameterization for each field type. This design empirically achieves near-zero entan- glement and eliminates cross-field semantic confusion, en- abling the encoder to learn FSU-specific patterns without interference. 3.5. Dual-Axis Transformer Architecture Standard Transformers process sequences with single-axis attention, treating input as a flat sequence(Han et al., 2022). However, traffic flows exhibit an inherent two-dimensional structure: temporal patterns across packets and semantic relationships among FSUs within each packet. To capture both dimensions effectively, we employ dual-axis attention. Dual-Axis Attention. FlowSem-MAE employs dual-axis attention on the representation E∈R T×N×d . Time-axis attention models dependencies acrossTpack- ets for each FSU position, capturing how individual fields evolve over the flow’s lifetime: H time = MultiheadAttn(Q time , K time , V time )(4) FSU-axis attention models dependencies acrossNFSUs within each packet, capturing inter-field relationships: H fsu = MultiheadAttn(Q fsu , K fsu , V fsu )(5) While FSU-axis attention performs standard intra-packet modeling, time-axis attention addresses P3 by preserv- ing flow-level semantics through explicitly capturing inter- packet temporal dependencies over the capture-time meta- data (e.g.,frame.timedelta) included in FSUs, en- abling the model to learn flow-level behavioral patterns such as request-response latency and burst characteristics. Note that TCP header timestamps (TSval/TSecr) cannot substitute for capture-time metadata, as they reflect sender clocks rather than arrival times. Encoder Architecture. The encoder consists ofLtrans- former blocks, each applying time-axis attention, FSU-axis attention, and feed-forward networks with layer normaliza- tion and residual connections: H ℓ time = TimeAttn(LN(H ℓ−1 )) + H ℓ−1 (6) ̃ H ℓ = FFN(LN(H ℓ time )) + H ℓ time (7) H ℓ fsu = FSUAttn(LN( ̃ H ℓ )) + ̃ H ℓ (8) H ℓ = FFN(LN(H ℓ fsu )) + H ℓ fsu (9) For downstream classification, we apply mean pooling over time and FSU dimensions to obtain flow representation z∈R d , followed by an MLP classification head. 4. Experiments 4.1. Experimental Setup Datasets. For pretraining, we use MAWI traffic traces from January 1, 2025 (Cho et al., 2000) (137M packets, 9.6GB) with no overlap with evaluation datasets. We evaluate on ISCX-VPN (Gil et al., 2016) (16 application classes) and CSTNET-TLS 1.3 (i.e., TLS-120) (Lin et al., 2022) (120 website classes with SNI removed, encrypted by TLS 1.3). Data Preparation. Following Zhao et al. (2025), we re- move extraneous protocols (ARP, DHCP, etc.). Due to the high IP homogeneity within application labels, we anonymize IP addresses to prevent spurious correlations for all methods. Baselines. We compare against six pretrained models span- ning diverse architectures: ET-BERT (Lin et al., 2022) and Pcap-Encoder (Zhao et al., 2025) are byte-based methods 5 A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification Table 2. Performance comparison with frozen encoders. Best results in bold, second best underlined. Model ISCX-VPNTLS-120 AccF1AccF1 Pcap-Encoder16.112.17.12.9 ET-BERT22.312.89.14.6 NetMamba15.613.616.911.3 netFound22.918.828.022.9 YaTC37.534.634.127.6 TrafficFormer39.2 36.946.342.3 FlowSem-MAE51.142.755.251.3 applying BERT-style pretraining; YaTC (Zhao et al., 2023) and NetMamba (Wang et al., 2024) are vision-based meth- ods using masked image modeling; TrafficFormer (Zhou et al., 2025) and netFound (Guthula et al., 2023) are hybrid methods incorporating flow-level pretext tasks. Flow-based encoders process 10 packets jointly; packet-based encoders use majority voting. Evaluation. We use frozen encoder evaluation (Zhao et al., 2025): only the classification head is trained while encoder weights remain fixed. This stringent protocol isolates the contribution of pretraining from fine-tuning, testing whether pretraining truly learns transferable features. 4.2. Main Results Table 2 presents the frozen encoder performance. FlowSem- MAE significantly outperforms all baselines on both datasets, achieving 51.1% accuracy and 42.7% Macro-F1 on ISCX-VPN, surpassing TrafficFormer by 11.9% and 5.8% respectively. On TLS-120, FlowSem-MAE achieves 55.2% accuracy and 51.3% Macro-F1, outperforming Traf- ficFormer by 8.9% and 9.0%. These improvements validate that preserving flow semantics through protocol-native mod- eling produces genuinely transferable representations. Byte-based methods (Pcap-Encoder, ET-BERT) perform poorly because they attempt to learn from encrypted pay- loads with no learnable patterns. Vision-based methods (YaTC, NetMamba) achieve moderate results, but patch- based tokenization still conflates semantically distinct proto- col fields. TrafficFormer emerges as the strongest baseline due to its flow-level pretext tasks, yet still falls short without addressing field-level semantics. The discrepancy between our results and those in Zhao et al. (2025) is due to IP anonymization. Model Efficiency. Fig. 3 illustrates the relationship between model size and performance. Larger models do not yield better representations: netFound (2.85B parameters, 57× larger than ours) achieves only 18.8% and 22.9% F1; Pcap- 10 2 10 3 Model Size (M) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Macro-F1 Pcap-Encoder YaTC Net- ET-BERT Mamba Traffic- Former netFound FlowSem-MAE (ours) (a) ISCX-VPN 10 2 10 3 Model Size (M) 0.0 0.2 0.4 0.6 0.8 1.0 Macro-F1 Net- Pcap-Encoder YaTC Mamba ET-BERT Traffic- Former netFound FlowSem-MAE (ours) (b) TLS-120 FrozenUnfrozen Figure 3. Model size vs. performance (Macro-F1). FlowSem- MAE achieves the best performance with only 50.25M model size, significantly outperforming larger models. Table 3. Frozen (Fro.) vs. Unfrozen (Unfro.) performance compar- ison (Macro-F1). Model ISCX-VPNTLS-120 Fro.Unfro.Fro.Unfro. ET-BERT12.854.34.651.5 NetMamba13.648.611.376.0 netFound18.852.422.989.7 YaTC34.654.8 27.674.8 TrafficFormer36.949.242.369.2 FlowSem-MAE42.768.551.383.8 Encoder (850M) and ET-BERT (682M) perform poorly de- spite substantial sizes. FlowSem-MAE achieves the best performance with only 50.25M model size, demonstrating that aligning pretraining with traffic’s tabular structure mat- ters more than model scale. 4.3. Transferability Analysis To validate that FlowSem-MAE learns genuinely transfer- able representations, we compare frozen and unfrozen (full fine-tuning) performance in Table 3. A well-pretrained model should excel under both protocols: frozen perfor- mance measures representation quality in isolation, while unfrozen performance measures the foundation it provides for task-specific adaptation. FlowSem-MAE uniquely excels under both evaluation protocols. Our method achieves the best frozen performance on both datasets (42.7% and 51.3% F1) and the best or second-best unfrozen performance (68.5% and 83.8% F1). This dual excellence is unique among all methods and demonstrates that FSU- based pretraining learns representations that are both inde- pendently discriminative and amenable to further adaptation. Baselines fall into two failure modes: (1) Collapse when frozen: ET-BERT and netFound achieve reasonable un- frozen performance but collapse under frozen evaluation (4.6% and 22.9% F1 on TLS-120), indicating their pre- 6 A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification Table 4. Ablation study on FlowSem-MAE components. Variant ISCX-VPNTLS-120 AccF1AccF1 FlowSem-MAE (full)51.142.755.251.3 w/o Pred-Guided Filter27.917.334.829.8 w/o FSU-Spec Embed40.816.525.921.3 w/o Temporal Metadata45.330.544.739.5 tcp.checksum ip.checksum ip.id ip.len tcp.flags.ack l4_payload_len ip.ttl direction tcp.flags.res frame.time_delta tcp.flags.syn tcp.flags.fin 0 2 4 6 8 l o g 1 0 ( M S E L o s s ) Unstable FSU Stable FSU w/o Pred-Guide Filtering w/ Pred-Guide Filtering Figure 4. Effect of predictability-guided filtering on reconstruction loss. Without predictability-guided filtering, random fields (red) exhibit extremely high loss (∼ 10 9 ) and degrade learning of gen- eralizable fields (green). training contributes minimally—performance gains come entirely from fine-tuning on labeled data. (2) Plateau when unfrozen: TrafficFormer shows the second-best frozen per- formance but fails to improve proportionally when unfrozen (42.3%→69.2% on TLS-120), suggesting its representations are less adaptable. FlowSem-MAE breaks this trade-off: strong frozen performance (51.3%) translates into strong unfrozen performance (83.8%), confirming that FSU-based pretraining provides both a solid standalone representation and an effective initialization for fine-tuning. Model size efficiency. While netFound requires 2.85B to achieve 89.7% unfrozen F1 on TLS-120, its frozen F1 is only 22.9%. FlowSem-MAE achieves 83.8% unfrozen F1 and 51.3% frozen F1 with 57×fewer. The 5.9% unfrozen gap is minor compared to the 28.4% frozen improvement, validating that matching masked units to data structure mat- ters more than model scale. 4.4. Ablation Study To validate the contribution of each component, we conduct ablation experiments (Table 4). Impact of Predictability-Guided Filtering (P1). Remov- ing predictability-guided filtering causes 23.2% and 20.4% accuracy drop on ISCX-VPN and TLS-120 respectively. Fig. 4 reveals the mechanism: forcing the model to recon- struct random fields (checksums, IDs) results in extremely high loss (∼ 10 9 ) and degrades reconstruction quality across all generalizable fields, confirming that random fields create 10%50%100% Training Data Ratio 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Score 41.3% 42.6% 51.1% 20.5% 31.1% 42.7% (a) ISCX-VPN Accuracy Macro-F1 10%50%100% Training Data Ratio 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Score 33.1% 40.6% 55.2% 28.0% 35.6% 51.3% (b) TLS-120 Accuracy Macro-F1 Figure 5. Performance under different labeled data ratios. noisy gradients corrupting the entire representation space. Impact of FSU-Specific Embeddings (P2). When replac- ing FSU-specific embeddings with a single shared linear projection, the severe degradation confirms that shared em- beddings cause cross-field semantic pollution; fine-grained field semantics are crucial for distinguishing TLS-encrypted websites. Impact of Temporal Metadata (P3). Removing temporal information reduces accuracy by 5.8% and 10.5% on ISCX- VPN and TLS-120, with Macro-F1 drops of 12.2% and 11.8% respectively. This demonstrates that inter-packet temporal patterns are essential for flow-level classification. 4.5. Label Efficiency To evaluate robustness under limited labeled data, we vary the labeled data ratio from 10% to 100% (Fig. 5). FlowSem- MAE demonstrates strong performance even with scarce labels: 41.3% accuracy on ISCX-VPN with only 10% data (80.8% of full performance). Notably, with 50% labeled data, FlowSem-MAE achieves performance comparable to TrafficFormer with full data, demonstrating that pretraining learns transferable representations that substantially reduce labeling requirements. 4.6. Embedding Space Analysis To validate the manifold preservation property of FSU- specific embeddings, we analysis the embedding space be- tween our approach and shared embeddings (Fig. 6). Results. FSU-specific embeddings exhibit two desirable properties. First, inter-FSU centroid distances are uniformly distributed (0.4–0.8 for most pairs), indicating appropriate separation without extreme clustering or dispersion. Second, intra-FSU variances are uniformly low (∼0.0007), showing each FSU forms a compact cluster through its independent embedding function. In contrast, shared embeddings suffer from severe manifold entanglement. The distance matrix exhibits a block struc- ture: most FSU pairs show near-zero distances (<0.25), collapsing into overlapping regions, while a few FSUs are 7 A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification ip.flags.df ip.ttl tcp.flags.reset tcp.window_size tcp.time_delta tcp.reassembled.length l4_payload_len frame.len tcp.flags.fin tcp.flags.ack tcp.flags.push frame.time_delta ip.frag_offset direction tcp.flags.urg tcp.flags.res tcp.flags.cwr tcp.flags.syn ip.proto ip.flags.rb ip.flags.mf ip.flags.df ip.ttl tcp.flags.reset tcp.window_size tcp.time_delta tcp.reassembled.length l4_payload_len frame.len tcp.flags.fin tcp.flags.ack tcp.flags.push frame.time_delta ip.frag_offset direction tcp.flags.urg tcp.flags.res tcp.flags.cwr tcp.flags.syn ip.proto ip.flags.rb ip.flags.mf 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Euclidean Distance ip.flags.df ip.ttl tcp.flags.reset tcp.window_size tcp.time_delta tcp.reassembled.length l4_payload_len frame.len tcp.flags.fin tcp.flags.ack tcp.flags.push frame.time_delta ip.frag_offset direction tcp.flags.urg tcp.flags.res tcp.flags.cwr tcp.flags.syn ip.proto ip.flags.rb ip.flags.mf ip.flags.df ip.ttl tcp.flags.reset tcp.window_size tcp.time_delta tcp.reassembled.length l4_payload_len frame.len tcp.flags.fin tcp.flags.ack tcp.flags.push frame.time_delta ip.frag_offset direction tcp.flags.urg tcp.flags.res tcp.flags.cwr tcp.flags.syn ip.proto ip.flags.rb ip.flags.mf 0.0 0.2 0.4 0.6 0.8 1.0 Euclidean Distance 0.00.20.40.60.81.01.2 Intra-FSU Variance ip.flags.mf ip.flags.rb direction ip.flags.df tcp.flags.urg tcp.flags.res ip.frag_offset tcp.flags.syn tcp.flags.cwr tcp.flags.reset ip.proto tcp.window_size frame.time_delta tcp.time_delta tcp.flags.fin tcp.reassembled.length ip.ttl l4_payload_len frame.len tcp.flags.push tcp.flags.ack 0.0004 0.0004 0.2198 0.2960 0.3024 0.3024 0.3024 0.3024 0.3026 0.3111 0.3168 0.3220 0.3220 0.3220 0.3282 0.4057 0.5018 0.5966 0.6263 0.6707 1.2035 0.00000.00010.00020.00030.00040.00050.00060.00070.0008 Intra-FSU Variance tcp.flags.urg tcp.flags.res tcp.flags.cwr ip.flags.mf ip.flags.rb tcp.flags.syn tcp.flags.reset direction tcp.flags.fin frame.time_delta tcp.reassembled.length tcp.time_delta ip.proto ip.frag_offset tcp.flags.push ip.ttl ip.flags.df tcp.window_size frame.len l4_payload_len tcp.flags.ack 0.0006 0.0006 0.0006 0.0006 0.0006 0.0006 0.0006 0.0007 0.0007 0.0007 0.0007 0.0007 0.0007 0.0007 0.0007 0.0007 0.0007 0.0007 0.0007 0.0007 0.0007 Figure 6. Embedding space analysis. Top: Inter-FSU centroid distances; Bottom: Intra-FSU variance. Left: Shared embed- dings; Right: FSU-specific embeddings. FSU-specific embeddings achieve uniform separation (0.4–0.8) and consistent compactness (∼0.0007), while shared embeddings show extreme distances (0– 1.75) and 3000× variance disparity. extremely distant (>1.5). This bimodal pattern reveals a “rich-get-richer” phenomenon: FSUs with stronger gradi- ents cluster with well-learned representations, while low- gradient FSUs remain near random initialization. More critically, intra-FSU variances differ by 3000×, showing shared embeddings fail to provide consistent representa- tion quality. FSU-specific embeddings resolve both issues through independent parameterization for each field type. 4.7. FSU Importance Analysis A key advantage of FSU-based modeling is interpretability. We measure FSU importance via gradient-based attribution and compare with XGBoost feature importance (Fig. 7). The results show moderate-to-strong positive correlation (Spearmanρ = 0.536on ISCX-VPN,ρ = 0.696on TLS-120).Top-ranked FSUs differ between datasets: direction,ack, anddfdominate on ISCX-VPN, re- flecting that VPN-encrypted applications are distinguished by flow directionality and TCP flags;dfranks highest on TLS-120, indicating website fingerprinting relies more on protocol-level signatures. The moderate rather than perfect correlation is ex- pected—XGBoost operates on individual values indepen- dently, while FlowSem-MAE captures interactions via dual- axis attention. Notable divergences support this:lenranks 9th for FlowSem-MAE but highest for XGBoost on ISCX- VPN, suggesting packet length is individually discriminative but our model discovers richer patterns;synranks 5th vs. 0.000.050.100.150.20 Flowsem-MAE Importance 0.00 0.05 0.10 0.15 0.20 XGBoost Importance direction ack df push time_delta fin proto reset len cwr urg ttl ISCX-VPN: Importance Correlation (Spearman ρ=0.536) y=x 0.000.050.100.150.200.25 Flowsem-MAE Importance 0.00 0.05 0.10 0.15 0.20 0.25 XGBoost Importance df ack len ttl syn fin reset TLS-120: Importance Correlation (Spearman ρ=0.696) y=x 135791113151719 Flowsem-MAE Rank 1 3 5 7 9 11 13 15 17 19 XGBoost Rank direction ack df push time_delta fin proto reset len cwr urg syn res ttl frag_offset mf rb ISCX-VPN: Rank Correlation (Spearman ρ=0.537) y=x 135791113151719 Flowsem-MAE Rank 1 3 5 7 9 11 13 15 17 19 XGBoost Rank df ack len ttl syn time_delta fin cwr push frag_offset reset proto res urg direction rb mf TLS-120: Rank Correlation (Spearman ρ=0.696) y=x direction ack df push time_delta fin proto reset len cwr 0.00 0.05 0.10 0.15 0.20 Importance ISCX-VPN: Top 10 Features (by Flowsem-MAE) Flowsem-MAE XGBoost df ack len ttl syn time_delta fin cwr push frag_offset 0.00 0.05 0.10 0.15 0.20 0.25 Importance TLS-120: Top 10 Features (by Flowsem-MAE) Flowsem-MAE XGBoost Figure 7. FSU importance comparing FlowSem-MAE with XG- Boost. Moderate-to-strong Spearman correlation (ρ = 0.536on ISCX-VPN,ρ = 0.696on TLS-120) indicates FlowSem-MAE discovers similar discriminative features while capturing additional interaction patterns. 15th on TLS-120, indicating connection establishment be- comes discriminative only when modeled across sequences. The consistency validates meaningful representations; the divergence demonstrates capacity to model higher-order patterns invisible to feature-independent methods. 5. Conclusion Implications. We identify the inductive bias mismatch as the root cause of poor transferability in traffic classification. We propose a protocol-native paradigm that aligns with the intrinsic tabular modality of network data, instantiated by FlowSem-MAE. Leveraging Flow Semantic Units and dual- axis attention, our approach demonstrates that structural semantic alignment outperforms brute-force model scaling, even with limited labeled data. We establish a foundation for semantically grounded, protocol-native traffic analysis. Limitations. While effective, accuracy can be further im- proved with larger pretraining datasets. Additionally, man- ual field categorization for predictability-guided filtering could be automated via information-theoretic methods. References Berahmand, K., Daneshfar, F., Salehi, E. S., Li, Y., and Xu, Y. Autoencoders and their applications in machine 8 A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification learning: a survey. Artificial intelligence review, 57(2): 28, 2024. Brahma, P. P., Wu, D., and She, Y. Why deep learning works: A manifold disentanglement perspective. IEEE transactions on neural networks and learning systems, 27 (10):1997–2008, 2015. Bujlow, T., Carela-Espa ̃ nol, V., and Barlet-Ros, P. Indepen- dent comparison of popular dpi tools for traffic classifica- tion. Computer Networks, 76:75–89, 2015. ISSN 1389- 1286.doi: https://doi.org/10.1016/j.comnet.2014.11. 001. URLhttps://w.sciencedirect.com/ science/article/pii/S1389128614003909. Cho, K., Mitsuya, K., and Kato, A. Traffic data repository at theWIDEproject. In 2000 USENIX Annual Technical Conference (USENIX ATC 00), 2000. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), p. 4171–4186, Min- neapolis, Minnesota, June 2019. Association for Compu- tational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423/. Eddy, W. Transmission Control Protocol (TCP). RFC 9293, August 2022. URLhttps://w.rfc-editor. org/info/rfc9293. Fefferman, C., Mitter, S., and Narayanan, H. Testing the manifold hypothesis. Journal of the American Mathemat- ical Society, 29(4):983–1049, 2016. Finsterbusch, M., Richter, C., Rocha, E., Muller, J.-A., and Hanssgen, K. A survey of payload-based traffic classi- fication approaches. IEEE Communications Surveys & Tutorials, 16(2):1135–1156, 2013. Gil, G. D., Lashkari, A. H., Mamun, M., and Ghorbani, A. A. Characterization of encrypted and vpn traffic using time-related features. In Proceedings of the 2nd inter- national conference on information systems security and privacy (ICISSP 2016), p. 407–414. SciTePress Set ́ ubal, Portugal, 2016. Gont, F. Security Assessment of the Internet Protocol Ver- sion 4. RFC 6274, July 2011. URLhttps://w. rfc-editor.org/info/rfc6274. Google.Https encryption on the web, 2025.URL https://transparencyreport.google. com/https/overview. Gorishniy, Y., Rubachev, I., Khrulkov, V., and Babenko, A.Revisiting deep learning models for tabular data.In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems,vol- ume 34, p. 18932–18943. Curran Associates, Inc., 2021.URLhttps://proceedings.neurips. c/paper_files/paper/2021/file/ 9d86d83f925f2149e9edb0ac3b49229c-Paper. pdf. Guthula, S., Beltiukov, R., Battula, N., Guo, W., and Gupta, A. netfound: Foundation model for network security. arXiv preprint arXiv:2310.17025, 2023. Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., et al. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022. He, H. Y., Guo Yang, Z., and Chen, X. N. Pert: Payload encoding representation from transformer for encrypted traffic classification. In 2020 ITU Kaleidoscope: Industry- Driven Digital Transformation (ITU K), p. 1–8, 2020. doi: 10.23919/ITUK50268.2020.9303204. Hondru, V., Croitoru, F., Minaee, S., Ionescu, R. T., and Sebe, N. Masked image modeling: A survey. Int. J. Comput. Vis., 133(10):7154–7200, 2025. doi: 10.1007/ S11263-025-02524-1. URLhttps://doi.org/10. 1007/s11263-025-02524-1. Kienitz, D., Komendantskaya, E., and Lones, M. The effect of manifold entanglement and intrinsic dimensionality on learning. Proceedings of the AAAI Conference on Artifi- cial Intelligence, 36(7):7160–7167, Jun. 2022. doi: 10. 1609/aaai.v36i7.20676. URLhttps://ojs.aaai. org/index.php/AAAI/article/view/20676. Lin, X., Xiong, G., Gou, G., Li, Z., Shi, J., and Yu, J. Et- bert: A contextualized datagram representation with pre- training transformers for encrypted traffic classification. In Proceedings of the ACM Web Conference 2022, W ’22, p. 633–642, New York, NY, USA, 2022. Associa- tion for Computing Machinery. ISBN 9781450390965. doi: 10.1145/3485447.3512217. URLhttps://doi. org/10.1145/3485447.3512217. Ni, J., Hernandez Abrego, G., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y. Sentence-t5: Scalable sentence en- coders from pre-trained text-to-text models. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Findings of the Association for Computational Linguistics: ACL 2022, p. 1864–1874, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022. findings-acl.146. URLhttps://aclanthology. org/2022.findings-acl.146/. 9 A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification Rescorla, E. The Transport Layer Security (TLS) Protocol Version 1.3. RFC 8446, August 2018. URLhttps: //w.rfc-editor.org/info/rfc8446. Salazar, J., Liang, D., Nguyen, T. Q., and Kirchhoff, K. Masked language model scoring. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics, p. 2699–2712, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.240. URLhttps: //aclanthology.org/2020.acl-main.240/. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser,Ł., and Polosukhin, I. At- tention is all you need. Advances in neural information processing systems, 30, 2017. Wang, T., Xie, X., Wang, W., Wang, C., Zhao, Y., and Cui, Y. Netmamba: Efficient network traffic classification via pre-training unidirectional mamba. In 2024 IEEE 32nd International Conference on Network Protocols (ICNP), p. 1–11. IEEE, 2024. Yin, P., Neubig, G., Yih, W.-t., and Riedel, S. TaBERT: Pretraining for joint understanding of textual and tab- ular data. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, p. 8413–8426, Online, July 2020. Association for Compu- tational Linguistics. doi: 10.18653/v1/2020.acl-main. 745. URLhttps://aclanthology.org/2020. acl-main.745/. Zhao, R., Zhan, M., Deng, X., Wang, Y., Wang, Y., Gui, G., and Xue, Z. Yet another traffic classifier: A masked au- toencoder based traffic transformer with multi-level flow representation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, p. 5420–5427, 2023. Zhao, Y., Dettori, G., Boffa, M., Vassio, L., and Mellia, M. The sweet danger of sugar: Debunking represen- tation learning for encrypted traffic classification. In Proceedings of the ACM SIGCOMM 2025 Conference, p. 296–310, 2025. Zhou, G., Guo, X., Liu, Z., Li, T., Li, Q., and Xu, K. Traf- ficformer: an efficient pre-trained model for traffic data. In 2025 IEEE Symposium on Security and Privacy (SP), p. 1844–1860. IEEE, 2025. 10