← Back to papers

Paper deep dive

A Systematic Evaluation Protocol of Graph-Derived Signals for Tabular Machine Learning

Mario Heidrich, Jeffrey Heidemann, Rüdiger Buchkremer, Gonzalo Wandosell Fernández de Bobadilla

Year: 2026Venue: arXiv preprintArea: cs.AIType: PreprintEmbeddings: 105

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/22/2026, 5:04:19 AM

Summary

The paper introduces a taxonomy-driven, reproducible evaluation protocol for assessing graph-derived signals in tabular machine learning. It addresses the lack of statistical reliability in existing benchmarks by incorporating multi-seed evaluation, formal significance testing (McNemar's test), and robustness analysis under graph perturbations. The protocol is demonstrated through a case study on cryptocurrency fraud detection, identifying which signal categories provide consistent performance gains and structural insights.

Entities (5)

Elliptic Bitcoin Dataset · dataset · 100%Mario Heidrich · researcher · 100%McNemar's test · statistical-method · 100%Graph-derived signals · methodology · 98%Tabular Machine Learning · field-of-study · 95%

Relation Signals (3)

Evaluation Protocol appliedto Elliptic Bitcoin Dataset

confidence 98% · We demonstrate the protocol through an extensive case study on a large-scale, imbalanced cryptocurrency fraud detection dataset... the Elliptic Bitcoin Dataset

Evaluation Protocol assesses Graph-derived signals

confidence 95% · We propose a unified and reproducible evaluation protocol to systematically assess which categories of graph-derived signals yield statistically significant and robust performance improvements.

McNemar's test validates Performance gains

confidence 90% · To assess whether observed performance differences are statistically significant, we apply McNemar’s test

Cypher Suggestions (2)

Map the relationship between evaluation protocols and datasets. · confidence 95% · unvalidated

MATCH (e:EvaluationProtocol)-[:APPLIED_TO]->(d:Dataset) RETURN e.name, d.name

Find all graph-derived signal categories evaluated in the study. · confidence 90% · unvalidated

MATCH (s:SignalCategory)-[:EVALUATED_IN]->(p:Paper {id: 'e4361f45-d065-4a08-9278-f32a73d646b2'}) RETURN s.name

Abstract

Abstract:While graph-derived signals are widely used in tabular learning, existing studies typically rely on limited experimental setups and average performance comparisons, leaving the statistical reliability and robustness of observed gains largely unexplored. Consequently, it remains unclear which signals provide consistent and robust improvements. This paper presents a taxonomy-driven empirical analysis of graph-derived signals for tabular machine learning. We propose a unified and reproducible evaluation protocol to systematically assess which categories of graph-derived signals yield statistically significant and robust performance improvements. The protocol provides an extensible setup for the controlled integration of diverse graph-derived signals into tabular learning pipelines. To ensure a fair and rigorous comparison, it incorporates automated hyperparameter optimization, multi-seed statistical evaluation, formal significance testing, and robustness analysis under graph perturbations. We demonstrate the protocol through an extensive case study on a large-scale, imbalanced cryptocurrency fraud detection dataset. The analysis identifies signal categories providing consistently reliable performance gains and offers interpretable insights into which graph-derived signals indicate fraud-discriminative structural patterns. Furthermore, robustness analyses reveal pronounced differences in how various signals handle missing or corrupted relational data. These findings demonstrate practical utility for fraud detection and illustrate how the proposed taxonomy-driven evaluation protocol can be applied in other application domains.

Tags

ai-safety (imported, 100%)csai (suggested, 92%)preprint (suggested, 88%)safety-evaluation (suggested, 80%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Full Text

104,773 characters extracted from source content.

Expand or collapse full text

A Systematic Evaluation Protocol of Graph-Derived Signals for Tabular Machine Learning Mario Heidrich 1,2 Jeffrey Heidemann 2 R ̈udiger Buchkremer 2 Gonzalo Wandosell Fern ́andez de Bobadilla 1 1 Universidad Cat ́olica San Antonio de Murcia (UCAM), Murcia, Spain 2 Institute of IT Management and Digitization Research (IFID), FOM University of Applied Sciences in Economics and Management, D ̈usseldorf, Germany Correspondence: mheidrich@alu.ucam.edu Abstract While graph-derived signals are widely used in tabular learning, ex- isting studies typically rely on limited experimental setups and average performance comparisons, leaving the statistical reliability and robust- ness of observed gains largely unexplored. Consequently, it remains un- clear which signals provide consistent and robust improvements. This paper presents a taxonomy-driven empirical analysis of graph-derived signals for tabular machine learning. We propose a unified and repro- ducible evaluation protocol to systematically assess which categories of graph-derived signals yield statistically significant and robust per- formance improvements. The protocol provides an extensible setup for the controlled integration of diverse graph-derived signals into tabu- lar learning pipelines. To ensure a fair and rigorous comparison, it incorporates automated hyperparameter optimization, multi-seed sta- tistical evaluation, formal significance testing, and robustness analysis under graph perturbations. We demonstrate the protocol through an extensive case study on a large-scale, imbalanced cryptocurrency fraud detection dataset. The analysis identifies signal categories providing consistently reliable performance gains and offers interpretable insights into which graph-derived signals indicate fraud-discriminative struc- tural patterns. Furthermore, robustness analyses reveal pronounced differences in how various signals handle missing or corrupted relational data. These findings demonstrate practical utility for fraud detection and illustrate how the proposed taxonomy-driven evaluation protocol can be applied in other application domains. Keywords: graph-derived signals; tabular machine learning; graph signal taxonomy; statistical significance; robustness analysis; fraud detection 1 arXiv:2603.13998v1 [cs.AI] 14 Mar 2026 1 Introduction Graph-structured data provide rich contextual information that is typically not captured by classical tabular machine learning pipelines. Such data arise naturally in many real-world applications, including fraud detection, social networks, biological networks, and knowledge graphs. While classical tabular machine learning models are typically developed under an indepen- dent and identically distributed (i.i.d.) assumption and are widely used in practice, violations of this assumption can distort analysis and model- ing when relational dependencies are ignored [6]. Consequently, modern tabular pipelines rely on feature engineering to compensate for missing re- lational structure [18]. For graph-structured data, graph-derived signals are quantitative representations extracted from graph structure and provide a principled way to model these dependencies and enrich machine learning pipelines with relational information [12]. This shift toward graph-derived signals places them squarely within mod- ern tabular learning pipelines, where predictive performance is increasingly driven by representation quality rather than model complexity. Recent benchmarks in tabular machine learning underscore that predictive perfor- mance is often governed more by a model’s inductive biases and the quality of its input features than by architectural complexity alone [10]. A rigor- ous data-centric evaluation by Bartelt et al. [3] reinforces this perspective, demonstrating that after expert, dataset-specific feature engineering, per- formance gaps between advanced models shrink considerably, shifting the critical focus from model selection to representation quality. Despite this critical insight, the evaluation and application of graph- derived signals often lack a similarly principled, data-centric foundation. Their empirical benefit is frequently assessed through limited, ad hoc exper- imental setups that report average performance improvements over a narrow set of configurations. Such narrowly defined benchmarks risk drawing biased conclusions, as model rankings and reported gains can be highly sensitive to specific preprocessing choices and evaluation protocols [5]. This practice leaves key questions unanswered: Are reported gains statistically reliable and reproducible across random seeds? Are they robust to variations in graph structure or sampling? Furthermore, given the vast and heteroge- neous space of possible graph-derived signals, it is often unclear a priori which graph-derived signals are most informative for a specific downstream application task. Consequently, a shift toward evaluation protocols that ex- plicitly assess robustness and statistical reliability is crucial for a meaningful comparison of these learning approaches [3]. This lack of systematic evalu- 2 ation complicates principled model design and poses challenges for drawing reliable conclusions in the field. To enable reliable progress, we argue that systematic analyses must (i) compare different types of graph-derived signals under a unified evaluation protocol, (i) explicitly account for randomness in data splits and model training, and (i) assess whether observed improvements are statistically significant and robust to structural changes in the underlying graph rather than incidental. In the absence of such analyses, practitioners face uncer- tainty when selecting graph-derived signals from a large and heterogeneous design space for downstream tabular learning tasks. To navigate this vast design space, a taxonomy-driven perspective is es- sential. It enables graph-derived signals to be grouped according to the type of graph information they encode. Such an organization supports principled comparison, improves interpretability of empirical results, and facilitates task-specific signal selection across a large design space. To address these challenges, we conduct a systematic, taxonomy-driven analysis of graph-derived signals in tabular machine learning. Specifically, we evaluate representative graph signal types across multiple information categories using a reproducible experimental protocol that incorporates multi- seed evaluation, robustness analysis under graph perturbations, and formal hypothesis testing. Rather than ranking individual methods, the objective of this study is to assess the statistical relevance and robustness of dif- ferent classes of graph-derived signals with respect to specific downstream application tasks under controlled conditions. The core contribution of this work is a statistically grounded evaluation protocol that enables reliable, interpretable, and robustness-aware comparison of graph-derived signals for tabular learning. We demonstrate the proposed evaluation protocol through a practical case study in cryptocurrency transaction fraud detection. While the quanti- tative findings are necessarily specific to this domain, this application serves as a representative and challenging benchmark to illustrate the general appli- cability of the proposed evaluation protocol. In this fraud detection setting, relational dependencies between entities encode information that is not ac- cessible to transaction-level features alone, as illicit activity often propagates through complex chains of interactions. By instantiating the methodology on this high-stakes task, we demonstrate how statistically grounded insights can be obtained within a concrete graph-augmented tabular learning setting. Our evaluation is conducted on the Elliptic Bitcoin Dataset [23], a com- monly used large-scale, temporal, and highly imbalanced benchmark that reflects key challenges of real-world fraud detection. Within this setting, we 3 analyze different categories of graph-derived signals, including proximity- based signals that capture local neighborhood information and structural signals that encode broader role-based patterns, to assess when and how re- lational information contributes measurable value beyond tabular baselines that do not incorporate graph-derived signals. Beyond serving as a methodological demonstration, this case study en- ables a statistically grounded assessment of the practical utility of graph- derived signals in a security-critical application. By combining multi-seed evaluation with formal significance testing, we identify which categories of graph-derived signals yield consistent and statistically reliable performance gains in this setting. Moreover, the taxonomy-driven analysis facilitates interpretation by linking different signal categories to characteristic fraud- related patterns, such as the propagation of illicit activity through local neighborhoods. Finally, robustness experiments under controlled graph per- turbations reveal substantial differences in how signal categories degrade under missing or corrupted relational data, highlighting robustness as a key criterion for practical signal selection. Through this case study, we address the following research questions to investigate the utility, robustness, and interpretability of graph-derived signals within a representative tabular learning setting: RQ1 (Utility): Do graph-derived signals provide statistically significant performance improvements over tabular baselines that do not incorporate explicit graph-derived information, and does this effect vary across different categories of graph information? RQ2 (Robustness): How robust are the performance gains from differ- ent graph signal categories to controlled degradations (e.g., random edge removal) in the underlying graph structure? RQ3 (Taxonomy Guidance): To what extent does a taxonomy-driven organization of graph signals facilitate interpretable, category-level insights that can inform signal selection in graph-augmented tabular learning? 2 Methodology: Evaluation Protocol for Graph Signal Analysis We propose a reproducible and configurable evaluation protocol designed to assess the statistical impact of graph-derived signals when integrated into classical tabular machine learning pipelines, with a design that supports reuse across transactional and non-transactional graph learning tasks. The proposed evaluation protocol is task-agnostic and enables systematic com- 4 parisons between baseline models and models augmented with graph-derived signals. The evaluation protocol supports the flexible selection of subsets of graph-derived signals drawn from a predefined taxonomy of graph informa- tion types. In our experiments, we consider a diverse set of graph-derived signals, including features and embeddings spanning centrality, cohesion, community, proximity, spectral, structural role, and neighborhood-based in- formation (see Section 4). To ensure statistically reliable conclusions, the evaluation protocol ex- plicitly accounts for randomness in both data splitting and model train- ing. Multiple random seeds are applied consistently across baseline and graph-augmented configurations, enabling run-level comparisons. The eval- uation protocol further supports experiments on multiple graph variants derived from the same underlying entities, such as perturbed graphs ob- tained through edge removal, allowing robustness analyses under structural changes. In addition to standard multi-seed evaluation, the evaluation protocol explicitly supports robustness analysis under controlled structural pertur- bations of the input graph. In this work, robustness is assessed by sys- tematically removing a predefined fraction of edges prior to graph signal generation, allowing the sensitivity of different graph signal categories to degraded graph structure to be analyzed under otherwise identical experi- mental conditions. For fair model comparison, all classifiers are independently optimized using automated hyperparameter optimization based on Bayesian optimiza- tion using the Tree-structured Parzen Estimator [4]. Optimization is performed using cross-entropy loss, which is well suited for highly imbalanced classification settings. Depending on the classifier, ad- ditional mechanisms such as class weighting, resampling strategies, feature normalization, embedding-specific dimensionality reduction, architectural choices (where applicable), and model-specific regularization are considered during optimization. Importantly, all preprocessing steps are optimized in a model-aware manner and fitted exclusively on the training data to avoid information leakage. All optimized models are evaluated on identical train–validation–test splits, ensuring comparability across feature configurations. Detailed re- sults are stored in a structured format that enables automated downstream analysis. In particular, the evaluation protocol supports: (i) aggregation of performance metrics across random seeds using trimmed 5 aggregation to reduce the influence of outlier runs, (i) stability analysis via standard deviation estimates, (i) statistical hypothesis testing to assess whether observed performance differences between baseline and graph-augmented models are statis- tically significant, and (iv) the computation and storage of model-specific feature importance scores for post-hoc analysis. Graph-derived signals are concatenated with task-specific base features, re- sulting in an augmented tabular representation that can be processed by standard classifiers. Trimmed Performance Aggregation To reduce the influence of outlier runs and obtain robust performance estimates, we aggregate F 1 -scores across random seeds separately for each classifier and each graph signal using a trimmed mean approach. Specifically, for each fixed classifier–signal configuration (c,g), we collect S = 10 F 1 -scores per random seed for both the baseline (transaction-only) model and the corresponding graph-augmented model, where each seed cor- responds to one complete experimental run with identical data split and training conditions. In total, the evaluation considers G = 24 distinct graph signals. All analyses exclude Naive Bayes, as motivated in Section 7.1. Let F (s) (c,g), s ∈ 1,...,S denote the F 1 -score obtained for classifier c with graph signal g under random seed s. Analogously, let F (s) 1 (c, base) denote the F 1 -score of the corresponding classifier-specific baseline model under the same seed. To obtain robust performance estimates, we apply a trimmed mean over the S seed-level scores by removing the minimum and maximum values and averaging the remaining S− 2 runs. Formally, the trimmed mean F 1 -score for a given classifier–signal configuration is defined as ̄ F trim 1 (c,g) = 1 S− 2 X s∈S mid F (s) 1 (c,g) where S mid denotes the index set of the remaining runs after excluding the lowest and highest F 1 -scores across seeds. The trimmed baseline performance ̄ F trim 1 (c, base) is computed analo- gously. 6 The performance gain induced by graph signal g for classifier c is then computed as the difference between the trimmed mean performances: ∆F 1 (c,g) = ̄ F trim 1 (c,g)− ̄ F trim 1 (c, base) Positive values of ∆F 1 (c,g) indicate an improvement over the transaction- only baseline, while negative values indicate degraded performance. This aggregation strategy yields a robust central tendency estimate that miti- gates the influence of extreme seed-dependent variations while preserving paired comparability between baseline and graph-augmented models. All reported F 1 -score averages and performance differences in Section 7 and the Appendices are computed using this trimmed aggregation across random seeds. Statistical Significance Testing To assess whether observed performance differences are statistically sig- nificant, we apply McNemar’s test to paired predictions obtained from base- line (transaction-only) and graph-augmented models evaluated on identical test splits. Statistical significance is assessed at the prediction level, while performance differences are reported in terms of aggregated F 1 -scores as described in the previous section. For a given classifier c, graph signal g, and random seed s, the test compares discordant prediction outcomes between the baseline and the cor- responding graph-augmented model. Let b denote the number of instances where the baseline prediction is correct and the graph-augmented prediction is incorrect, and let c denote the number of instances where the baseline prediction is incorrect and the graph-augmented prediction is correct. The McNemar test statistic with continuity correction is given by χ 2 = (|b− c|− 1) 2 b + c . Differences with p ≤ 0.05 are considered statistically significant. The test is applied separately for each of the S = 10 random seeds, where each seed corresponds to one complete experimental run with a fixed data split and model initialization. This ensures that statistical comparisons are fully paired at the prediction level and isolate the effect of incorporating graph- derived signals. Rather than aggregating p-values across seeds, we summarize statistical evidence across runs by counting, for each classifier–signal combination, how often a graph-augmented model yields a statistically significant improvement versus a statistically significant deterioration relative to the baseline across 7 the 10 seeds. Specifically, for each (c,g) pair, we record the number of seeds in which the McNemar test indicates a significant improvement (c > b) and the number of seeds indicating a significant deterioration (b > c). These counts form the basis for the aggregated significance analyses and visualizations reported in Section 7. This aggregation strategy preserves seed-level statistical validity while providing an interpretable summary of the consistency and directionality of statistically significant effects across independent experimental runs. Signal Concatenation and Combined Evaluation The evaluation protocol supports not only the evaluation of individual graph signals but also arbitrary combinations of signals through concate- nation. For any subset of graph signals G ⊆ g 1 ,...,g 24 drawn from the predefined taxonomy, we define the concatenated feature vector for node v as: x (g) v = [x base v ∥ x (g 1 ) v ∥ · ∥ x (g |G| ) v ] where x base v denotes the original transaction features, x (g) v the feature vector of graph signal g, and ∥ the concatenation operator. This design enables the evaluation of: • individual graph signals (|G| = 1), • category-level combinations (e.g., all centrality-based signals), and • larger multi-signal combinations spanning multiple categories. All evaluation metrics and statistical tests described above are applied iden- tically to both individual and concatenated signal configurations, ensuring consistent comparison across granularity levels. In Section 7, we report re- sults for both individual graph signals and category-level concatenations; for example, ‘Centrality’ denotes the concatenation of all five centrality-based indicators listed in Table 1. 3 Related Work 3.1 Classical Graph Indicators Classical graph indicators have long been used as hand-crafted features to characterize node importance and local connectivity patterns in complex networks. Common examples include degree-based measures, centrality 8 metrics such as PageRank and betweenness centrality, as well as cluster- ing coefficients capturing local transitivity. These features provide inter- pretable summaries of node-level structure and have been widely applied in graph-based classification and fraud detection settings (see, e.g., Network Science [2]). Community detection methods extend this perspective by capturing meso- scopic graph structure through partitioning nodes into densely connected groups. Algorithms such as Infomap [17] and the Leiden algorithm [20] have been shown to produce stable and high-quality community assignments and are frequently used to derive node-level community features in large trans- action networks. 3.2 Proximity-Based Node Embedding Methods Unsupervised node embedding methods aim to learn continuous vector rep- resentations that preserve graph proximity and local neighborhood relation- ships. Random-walk-based approaches such as DeepWalk [16] and node2vec [11] generate node sequences via truncated random walks and learn embeddings by modeling node co-occurrence. These methods represent a widely adopted class of unsupervised embedding techniques for learning node representa- tions from the graph. 3.3 Structural and Role-Oriented Embedding Methods Beyond proximity preservation, a class of approaches explicitly focuses on capturing structural similarity or node roles, independent of local neighbor- hoods. These methods aim to group nodes with similar structural functions rather than relying on direct neighborhood overlap. These approaches primarily differ in how structural similarity is oper- ationalized and encoded through their algorithmic design choices. Repre- sentative examples include GraphWave [9], which leverages spectral graph wavelets to encode structural signatures across multiple diffusion scales, as well as role-based embedding methods such as role2vec [1] and ffstruc2vec [13], which model structural identity by capturing similarities in node roles. 3.4 Spectral Graph Embeddings Spectral embedding techniques derive node representations from eigenvec- tors of the graph Laplacian, providing a principled global view of graph structure. By projecting nodes into a low-dimensional space spanned by 9 non-trivial eigenvectors, spectral methods capture connectivity patterns be- yond immediate neighborhoods. While computationally demanding for large graphs, spectral embeddings remain a foundational approach in graph rep- resentation learning (e.g., Spectral Graph Theory [8]). 3.5 Graph Neural Networks and Graph Contrastive Learn- ing Neighborhood-based aggregation is commonly realized through Graph Neu- ral Networks (GNNs), which iteratively propagate and aggregate informa- tion from local neighborhoods to learn node representations. Canonical instances include Graph Convolutional Networks (GCN), which employ nor- malized linear aggregation of neighboring features [14], and Graph Atten- tion Networks (GAT), which extend this paradigm by learning adaptive, attention-based weighting schemes over neighbors [21]. These architectures have demonstrated strong performance across a wide range of node classifi- cation tasks. More recently, Graph Contrastive Learning (GCL) has emerged as a self-supervised representation learning paradigm that aims to learn robust node embeddings by contrasting different augmented views of graph data. Representative approaches, such as BGRL-style contrastive learning [19], optimize contrastive objectives without relying on labeled supervision and have shown promising results in downstream tasks. 4 Taxonomy of Graph Signals for Tabular Ma- chine Learning Graph-structured data contain rich relational information that can signifi- cantly enhance predictive models when integrated into classical tabular ma- chine learning pipelines. As discussed in the previous section, such relational dependencies motivate the use of graph-derived signals alongside traditional tabular features [12]. However, the space of potential graph signals is vast and heterogeneous, ranging from simple local statistics to complex learned embeddings. In this work, we use the term graph signal to denote any quan- titative representation derived from the graph structure that can be associ- ated with nodes and integrated into downstream tabular learning models, including hand-crafted indicators, unsupervised embeddings, and learned neighborhood-based representations. Without a structured conceptual basis for selecting and evaluating these signals, practitioners face a signal selec- 10 tion problem, as it is often unclear which types of graph-derived signals are most relevant for a given downstream task. Over the past years, a wide range of graph-based methods has been pro- posed, capturing different aspects of network structure. Rather than treating these methods as isolated techniques, it is useful to organize them accord- ing to the type of information they extract from the graph. This section provides a concise, non-exhaustive taxonomy of graph-related information that can be incorporated into tabular machine learning models, serving as a conceptual foundation for the feature selection used in this study. The taxonomy presented in this section builds on well-established con- cepts and recent work in graph representation learning and network analysis [2, 17, 20, 16, 11, 9, 1, 13, 8, 14, 21, 19]. It consolidates commonly used cat- egories of graph-derived signals and serves as a structured conceptual basis for the comparative evaluation conducted in this study. We distinguish six broad categories of graph-based information, reflect- ing different structural perspectives on the same underlying graph: cen- trality, cohesion, community structure, proximity, spectral properties, and structural role information. These categories are conceptually distinct but complementary, and together cover a wide spectrum of graph signals com- monly used in practice. Neighborhood-based features, as exploited by message- passing architectures such as Graph Neural Networks, can be viewed as an aggregation mechanism that combines multiple categories rather than as a separate information type. 4.1 Centrality-Based Indicators Centrality-based indicators quantify the importance or influence of nodes within the network. Typical examples include degree-based measures, PageR- ank, betweenness, closeness, or eigenvector centrality. Such indicators cap- ture how prominently a node is positioned within the overall connectivity structure and often provide strong baseline signals for downstream classifi- cation tasks. 4.2 Cohesion-Related Indicators Cohesion-related indicators describe the local interconnectedness of a node’s neighborhood. Measures such as clustering coefficients, core numbers, or triangle counts capture the extent to which a node participates in tightly connected substructures. These signals are particularly relevant in domains where meaningful differences in local connectivity patterns reflect distinct 11 behavioral or structural patterns in the graph. 4.3 Community-Based Indicators Community-related indicators focus on meso-scale structures in the graph. Community detection methods assign nodes to groups of densely connected subgraphs, providing coarse-grained structural context. Community mem- bership indicators can serve as high-level features that summarize a node’s structural environment beyond immediate neighborhoods. 4.4 Proximity-Based Embeddings Proximity-based embeddings encode relative closeness between nodes in a latent space, reflecting similarity induced by graph connectivity. Techniques such as DeepWalk, node2vec, or related approaches can be used to derive such proximity-based representations such that nodes with similar connec- tivity patterns or overlapping neighborhoods are placed close together in the representation space. These embeddings are effective at capturing distance- based similarity and local context. 4.5 Spectral Graph Embeddings Spectral information derives from the eigenstructure of graph-related matri- ces, such as the adjacency matrix or the graph Laplacian. Spectral features encode global connectivity patterns and can capture structural regularities that are not easily observable through local measures alone. 4.6 Structural Role Embeddings Structural role information aims to characterize nodes according to their po- sition within the global topology of the graph, independent of direct proxim- ity. Structural embedding methods focus on identifying nodes that occupy similar roles—such as hubs, bridges, or peripheral nodes—even if they are located far apart, as exemplified by approaches such as struc2vec, Graph- Wave, role2vec, and ffstruc2vec. 4.7 Graph Neural Networks and Graph Contrastive Learn- ing Graph Neural Networks (GNNs) and recent Graph Contrastive Learning (GCL) approaches constitute a powerful class of learning mechanisms that 12 operate on graph-structured data by aggregating and transforming informa- tion from node neighborhoods. In the proposed taxonomy, these methods are viewed as integrative representation learning paradigms rather than as individual graph signal types. While these architectures are inherently capable of end-to-end classifica- tion, we employ them within our evaluation protocol specifically as repre- sentation learners to derive high-dimensional node embeddings, which serve as signals for the subsequent tabular analysis. 4.8 Selection of Graph Signals and Role of the Taxonomy In this work, we select a set of 24 representative graph-derived signals, with multiple representatives drawn from each category of the proposed taxonomy (see Table 1). The selection is guided by representativeness with respect to the underlying information category. The goal is to assess how different types of graph signals contribute to predictive performance when integrated into tabular machine learning models. This taxonomy provides the conceptual basis for interpreting the exper- imental results presented in the following sections. In particular, it allows performance differences between feature groups to be analyzed at the level of graph signal categories, while also enabling more fine-grained comparisons between individual graph signals within the same category. This dual per- spective offers insights into which types of graph signals are most beneficial for the application scenario under consideration. In our experimental setup, GNN- and GCL-based representations are treated consistently with other graph signals by extracting fixed node em- beddings from the final encoder layer. While GCL is trained using an un- supervised contrastive objective, GCN and GAT encoders are optimized in a supervised fashion on the node classification task. To ensure a fair eval- uation and prevent data leakage, all encoder training is strictly confined to the respective training splits. For GCL, we use an unsupervised con- trastive learning setup, and the resulting embeddings are concatenated with transaction-level features and processed by the same tabular classifiers as all other graph-derived signals. All GNN-based representations are generated via a two-stage procedure consisting of encoder training (supervised for GCN/GAT, self-supervised for GCL) followed by frozen embedding extraction, thereby maintaining a clear separation between representation learning and downstream classification. This setup is intentional and enables a comparative assessment of supervised neural representations alongside purely structural and self-supervised graph 13 signals, thereby capturing the maximum attainable performance of each signal class under controlled conditions. Table 1: Representative graph signals categorized by taxonomic family. CategoryMethodDescription Centrality Degree CentralityNumber of incident edges PageRankRecursive importance scoring Betweenness CentralityFraction of shortest paths through node Eigenvector CentralityImportance based on neighbors’ importance Closeness CentralityInverse average shortest path length Cohesion Clustering CoefficientDensity of direct neighbors Core NumberMax k such that node belongs to k-core Triangle CountNumber of triangles involving node Average ClusteringMean clustering coefficient of neighbors Community LouvainModularity-maximizing community detection LeidenImproved Louvain with refinement InfomapInformation-theoretic community detection Proximity DeepWalkRandom walks + Skip-gram node2vec (BFS)Biased walks emphasizing breadth node2vec (DFS)Biased walks emphasizing depth 14 CategoryMethodDescription node2vec (balanced)Balanced BFS/DFS exploration SpectralSpectral EmbeddingLaplacian eigenmap embeddings Structural Role ffstruc2vecStructural identity via flat similarity graph GraphWaveSpectral graph wavelet embeddings role2vecRole-based node embeddings GNN / Neighborhood Features GCNGraph Convolutional Network GATGraph Attention Network GCLGraph Contrastive Learning Graph signals are computed using fixed, standard parameter settings. The study focuses on the comparative evaluation of different types of graph- derived signals. 5 Case Study: Fraud Detection on Elliptic Bitcoin Transaction Graph The proposed taxonomy and graph signal analysis are evaluated in the con- text of transaction fraud detection using the Elliptic Bitcoin Dataset, a widely used benchmark for illicit activity detection in cryptocurrency trans- action networks [23]. The Elliptic dataset represents a temporal directed transaction graph, where nodes correspond to Bitcoin transactions and edges represent fund flows between transactions. Each node is associated with a timestamped feature vector and a binary label indicating licit or illicit activity, with a substantial portion of nodes remaining unlabeled. The graph structure in- troduces strong relational dependencies between samples, violating the inde- pendent and identically distributed (i.i.d.) assumption underlying classical tabular learning. 15 Concretely, the dataset contains 203,769 nodes and 234,355 directed edges, distributed across 49 temporal snapshots. Among the labeled trans- actions, the class distribution is highly imbalanced, with illicit transactions accounting for roughly 2–3 % of labeled nodes, reflecting realistic fraud detection conditions. These characteristics make the Elliptic dataset a chal- lenging and representative benchmark for evaluating graph-derived signals under severe class imbalance and structural dependency. Rather than focusing on a single modeling paradigm, this study inves- tigates how explicitly extracted graph-derived signals contribute to predic- tive performance when integrated into classical tabular machine learning pipelines. Instead of aiming to optimize end-to-end performance on the Elliptic benchmark or to propose a task-specific fraud detection model, this case study uses the dataset as a controlled experimental environment to an- alyze the contribution of explicitly extracted graph-derived signals. The focus is on assessing whether the inclusion of graph-derived signals leads to statistically significant and reproducible performance improvements over a transaction-only baseline, and on understanding how these effects vary across different graph signal categories. By conducting the evaluation on a well-established and widely used benchmark, the results are intended to provide general insights into the utility of different graph signal types, rather than benchmark-specific performance claims. This experimental framing enables performance differences to be ana- lyzed both across graph signal categories defined by the proposed taxonomy and at a more fine-grained level within individual categories, facilitating a structured interpretation of which types of graph-derived signals—such as centrality, proximity, or structural role information—are most beneficial for fraud detection under realistic conditions. 6 Experimental Setup This section describes the concrete experimental setup used to instantiate the proposed evaluation protocol on the Elliptic fraud detection benchmark. We detail the considered classifiers, data splits, evaluation metrics, hyperpa- rameter optimization strategy, robustness settings, and the statistical anal- ysis procedures used to assess the significance of observed performance dif- ferences, ensuring full reproducibility and transparency. 16 6.1 Prediction Task and Baseline The prediction task is binary node classification, where each transaction is classified as licit or illicit based on available transaction-level features. As a baseline, we consider classical tabular machine learning models trained exclusively on the original transaction features provided by the dataset, without incorporating any graph-derived signals. We exclude the native Elliptic graph features, as they are tailored to this specific benchmark and not intended as general-purpose graph descrip- tors. Instead, we focus on general-purpose graph-derived signals to enable a controlled and comparable evaluation across methods and settings. Graph-augmented models extend this baseline by concatenating graph- derived signals with the original transaction features, resulting in an aug- mented tabular representation processed by the same classifiers. This de- sign enables controlled, paired comparisons between baseline and graph- augmented models. 6.2 Classifiers To ensure broad coverage of commonly used tabular learning paradigms, we evaluate a diverse set of classifiers with different inductive biases. Specifi- cally, the experimental setup includes linear models, probabilistic classifiers, kernel-based methods, and tree-based ensemble models. Concretely, we consider Logistic Regression, Naive Bayes, Support Vec- tor Classifiers (SVC) with both linear and RBF kernels, Random Forests, and XGBoost [7]. This selection covers a wide range of modeling assump- tions, including linear decision boundaries, probabilistic independence as- sumptions, margin-based classification, and non-linear ensemble learning. All classifiers are trained and evaluated under identical experimental con- ditions to ensure fair comparison. Each model is optimized independently using automated hyperparameter optimization based on Bayesian optimiza- tion with a fixed budget of 50 trials per configuration. No classifier-specific tuning or manual intervention is performed outside this unified optimization framework. 6.3 Data Splits and Random Seeds We adopt a transductive evaluation setting following standard protocols in graph machine learning [14, 24], where the full graph structure is assumed to be available during training, while labels are observed only for the respective training nodes. This setting avoids confounding effects caused by strong 17 temporal shifts in label distributions and enables fair, paired comparisons between baseline and graph-augmented models under identical structural conditions. To account for randomness in both data splitting and model training, we perform all experiments across multiple random seeds. This multi-seed evaluation protocol enables run-level paired comparisons between baseline and graph-augmented models, allowing variability due to randomness to be explicitly quantified rather than implicitly averaged out. For each seed, the dataset is split into disjoint training, validation, and test sets using a stratified random split with a 60–20–20 ratio, and identical split indices are applied across all graph signal configurations. Hyperparameter optimization is performed once for each classifier and graph signal configuration using a fixed random seed (seed = 42) for data splitting and model initialization. The resulting optimal hyperparameters are then fixed and used consistently across all corresponding experiments, which are evaluated over multiple random seeds (seeds 1–10) to assess vari- ability and statistical robustness. 6.4 Hyperparameter Optimization Hyperparameter optimization is performed separately for each classifier and graph signal configuration using Bayesian optimization with a Tree-structured Parzen Estimator [4], as implemented in the hyperopt framework [5]. The optimization objective is cross-entropy loss evaluated on the validation set. In line with the general framework described in Section 2, classifier- specific mechanisms such as class weighting, oversampling strategies, fea- ture normalization, and embedding-specific dimensionality reduction are in- cluded in the optimization space where applicable. For each optimization run, a fixed budget of 50 trials is used. The best-performing hyperparameter configuration per classifier and graph signal configuration is selected and subsequently reused across all random seeds to ensure consistent evaluation and to avoid leakage between optimization and testing. To ensure full reproducibility and transparency, the complete hyperparameter search spaces for all classifiers and the fixed parameters for all graph signal generators are provided in Appendix E and Appendix F, respectively. 18 6.5 Evaluation Metrics Model performance is evaluated using metrics appropriate for highly imbal- anced binary classification. As the primary comparison metric, we report the F 1 -score, which balances precision and recall and enables direct comparison across classifiers and graph signal configurations. To assess whether observed performance differences between baseline and graph-augmented models are statistically significant, we apply McNemar’s test [15] to paired predictions on identical test splits, considering results with p≤ 0.05 as statistically significant. Precision and recall are further analyzed to support interpretability and to highlight application-relevant trade-offs in fraud detection. Performance is aggregated across random seeds using trimmed aggrega- tion to reduce the influence of outlier runs. 6.6 Robustness Experiments under Graph Perturbations To assess the robustness of graph-derived signals to structural changes, we perform additional experiments on perturbed versions of the transaction graph. Perturbations are introduced by randomly removing a fixed propor- tion of edges while preserving the original node set. Specifically, edge removal rates of 25 % and 50 % are considered to simulate increasing levels of structural degradation. Graph-derived signals are recomputed for each perturbed graph variant and evaluated using the same experimental protocol as for the unperturbed graph. This allows the stability of observed performance gains to be ana- lyzed under controlled degradation of graph structure. 6.7 Implementation Details and Reproducibility All experiments are implemented using a unified experimental pipeline that ensures consistent preprocessing, model training, and evaluation across con- figurations. All random seeds, hyperparameter search spaces, and evaluation scripts are fixed and logged to enable full reproducibility. All code and configuration files required to reproduce the experiments reported in this paper, including data preprocessing, graph signal genera- tion, model training, evaluation scripts, and robustness experiments under graph perturbations, are publicly available. The complete experimental setup, including configuration files and result logs, is designed to support automated downstream analysis and statistical evaluation. 19 The experimental pipeline is implemented in Python, leveraging stan- dard machine learning libraries and graph processing frameworks such as scikit-learn, XGBoost [7], PyTorch, the Deep Graph Library [22], and hy- peropt [5] for automated hyperparameter optimization. 7 Experimental Results 7.1 Evaluation Overview and Reading Guide While the empirical evaluation is conducted on a single large-scale fraud detection benchmark, the objective of this section is not to maximize task- specific performance scores, but to systematically investigate the relative utility, statistical reliability, and stability of different graph-derived signal categories under a unified and rigorous evaluation protocol. This section reports the experimental results obtained by augmenting transaction-based fraud detection models with graph-derived signals. We first provide an aggregated comparison between transaction-only baselines and graph-augmented models across all evaluated classifiers and graph signal configurations, using identical data splits, random seeds, and hyperparame- ters obtained from a fixed optimization protocol to enable fully paired com- parisons. This design enables fully paired, run-level comparisons between baseline and graph-augmented models. Performance is measured using the F 1 -score, reported as the trimmed mean and standard deviation over multiple random seeds. All results fol- low the evaluation protocol described in Section 6, with hyperparameters fixed based on a configuration-specific optimization procedure using seed 42, and performance aggregated over ten independent runs (seeds 1–10) us- ing trimmed F 1 -score statistics. All reported random seeds affect only data splitting and model train- ing, including initialization and optimization stochasticity. Graph-derived signals are computed deterministically for a given graph and are held fixed across all runs. To assess the robustness and statistical relevance of observed perfor- mance differences, we additionally apply paired McNemar significance tests at a significance level of p ≤ 0.05, and summarize results across random seeds by counting statistically significant improvements and deteriorations per classifier–signal configuration. Results are presented at multiple levels of aggregation. We first ana- lyze overall performance trends across graph signal categories, followed by classifier-specific effects, stability across random seeds, and statistical sig- 20 nificance patterns. Graph signals are grouped into the following concep- tual categories: Centrality, Cohesion, Community, Proximity, Spectral, and Structural, as well as a separate category of Graph Neural Network (GNN)- based representations. Category-level results are obtained by explicitly con- catenating all individual graph signals within a category, as described in Section 2. In addition, detailed results for all 24 individual graph signals are reported in Appendix A, with corresponding variability statistics pro- vided in Appendix B. For aggregated performance comparisons across classifiers, we exclude the Naive Bayes classifier from the aggregation and report the average F 1 -score over the remaining classifiers. Naive Bayes consistently exhibits substantially lower performance in this setting, which is expected given its strong conditional independence assumptions and limited capacity to cap- ture complex relational patterns induced by graph-derived signals. Never- theless, Naive Bayes results are reported in all tables for completeness and transparency, but excluded from aggregated summaries, as including this classifier would disproportionately distort overall trends without providing additional insight. Section 7.2 focuses on aggregated performance trends across graph signal categories. Sections 7.3 and 7.4 then analyze classifier-specific effects and stability across random seeds, respectively. A formal statistical significance analysis is provided in Section 7.5, followed by a robustness study under graph perturbations in Section 7.6. 7.2 Overall Performance Impact of Graph Signals Across all evaluated classifiers included in the aggregated analysis, the in- tegration of graph-derived signals leads to consistent performance improve- ments over the transaction-only baseline. Figure 1 summarizes the aver- age F 1 -score improvements across graph signal categories relative to the transaction-only baseline. Averaged over classifiers and random seeds, most graph signal categories yield positive gains in F 1 -score, indicating that re- lational information captured from the transaction network provides com- plementary predictive power beyond node-local transaction features. The detailed per-classifier performance across all signal categories is shown in Figure 2 (F 1 -scores with standard deviations), revealing consistent improve- ments for most model–signal combinations. For a fine-grained breakdown of results across all 24 individual graph signals and classifiers, see Appendix A (mean F 1 -scores) and Appendix B (standard deviations across seeds). 21 GNN Proximity Community Cohesion Structure Centrality Spectral 0.00 0.02 0.04 0.06 0.08 Average F 1 vs. TRX baseline Figure 1: Average F 1 -score improvements across graph signal categories rel- ative to the transaction-only baseline, aggregated over classifiers and random seeds (trimmed aggregation). 22 TRX Only Centrality Cohesion Community Proximity Spectral Structure GNN AVG (Top-5) (AVG vs Base) MLP LR NB SVC RF XGB 0.8330.8520.8580.8820.8960.8430.8570.921 0.0000.0180.0240.0490.0620.0100.0230.088 0.866 ±0.009 0.880 ±0.005 0.859 ±0.007 0.872 ±0.006 0.902 ±0.003 0.821 ±0.007 0.843 ±0.008 0.918 ±0.006 0.828 ±0.009 0.828 ±0.007 0.832 ±0.008 0.828 ±0.007 0.847 ±0.006 0.832 ±0.008 0.831 ±0.007 0.919 ±0.006 0.291 ±0.002 0.293 ±0.003 0.292 ±0.002 0.292 ±0.002 0.292 ±0.002 0.292 ±0.003 0.319 ±0.003 0.365 ±0.003 0.648 ±0.071 0.703 ±0.087 0.758 ±0.036 0.850 ±0.007 0.872 ±0.005 0.728 ±0.020 0.793 ±0.004 0.915 ±0.008 0.905 ±0.006 0.919 ±0.007 0.915 ±0.008 0.922 ±0.008 0.906 ±0.008 0.903 ±0.008 0.901 ±0.007 0.917 ±0.006 0.920 ±0.004 0.929 ±0.002 0.925 ±0.004 0.937 ±0.004 0.951 ±0.005 0.932 ±0.005 0.916 ±0.004 0.938 ±0.007 Figure 2: Per-classifier F 1 -scores across graph signal categories. Cell val- ues report mean F 1 -scores with standard deviations across random seeds. Color intensity indicates relative performance differences with respect to the transaction-only (TRX) baseline (green = improvement, red = degra- dation). Among all evaluated categories, GNN-derived representations achieve the strongest average performance gains. Notably, GNNs are the only class of methods in this study that explicitly incorporate feature information from neighboring nodes during representation learning. While GNN-derived representations achieve the strongest average perfor- mance, an upper-bound analysis in Appendix G shows that proximity-based graph signals achieve the highest absolute F 1 -score across all evaluated graph signal categories. Proximity-based and community-based signals consistently outperform purely structural or spectral representations. This pattern aligns with the intuition that relational closeness and shared transactional context are par- ticularly informative in fraud detection settings, where illicit activity often propagates through localized neighborhoods. Although proximity-based and community-based signals are treated as distinct categories in our taxonomy, 23 both encode aspects of relational closeness at different levels of abstraction. The stronger performance of proximity-based representations suggests that modeling relational proximity in a continuous and fine-grained manner can provide richer information than discrete group assignments in this setting. Overall, these results demonstrate that graph-based augmentation is gen- erally beneficial in this setting, while also highlighting that performance gains are strongly dependent on the type of relational information being in- corporated. In particular, representations that explicitly capture relational proximity and neighborhood context tend to yield the most pronounced im- provements, whereas more global or purely structural descriptors provide only limited additional benefit. While these average performance gains provide a first indication of the benefits of graph-derived signals, their statistical reliability and robustness are examined in the following sections, including a formal significance anal- ysis in Section 7.5. 7.3 Classifier-Specific Effects Building on the aggregated category-level trends shown in Figure 1, which summarize average effects across classifiers, we analyze how graph signal ef- fectiveness varies across individual classifiers, as shown in Figure 2, reflecting differences in model capacity and inductive bias. Before turning to classical classifiers, we briefly discuss the role of GNN- derived representations, which serve as a key source of graph signals in our evaluation. To contextualize the performance of GNN-derived representa- tions, we additionally compare end-to-end GNN models with configurations where fixed node embeddings extracted from the trained GNN encoders are concatenated with transaction-level features and processed by classical tabular classifiers. While end-to-end GNNs achieve strong predictive perfor- mance on the Elliptic dataset, the corresponding embedding-based configu- rations yield higher average F 1 -scores when combined with optimized tabu- lar models. This observation suggests that, in the evaluated setting, GNNs primarily serve as effective representation learners, whereas downstream tab- ular classifiers provide more flexible decision boundaries under severe class imbalance. Importantly, this comparison is intended as an interpretative aid rather than a direct architectural comparison, as both approaches are evaluated under different modeling assumptions. Tree-based ensemble models, particularly Random Forests and XGBoost, exhibit strong and stable performance gains across most graph signal cate- gories. This behavior is consistent with the ability of tree-based ensembles 24 to handle heterogeneous and partially redundant feature sets, which may facilitate the effective integration of complementary graph-derived signals. These models benefit consistently from additional structural features, with improvements observed across centrality, community, proximity, and GNN- based representations. This observation is consistent with recent findings in tabular machine learning, which show that tree-based ensemble models often outperform deep learning approaches when combined with informative fea- ture representations, due to their robustness to heterogeneous feature scales and their ability to exploit complementary feature interactions [10]. Support Vector Classifiers show substantial relative improvements over the transaction-only baseline across multiple graph signal categories. How- ever, these improvements are accompanied by increased variability, indicat- ing higher sensitivity to both feature selection and random initialization. Linear models such as Logistic Regression and shallow neural models (MLP) display moderate but robust improvements, suggesting that even comparatively simple classifiers can exploit graph-derived signals when prop- erly optimized. In contrast, Naive Bayes classifiers show minimal sensitiv- ity to graph augmentation, with performance remaining largely unchanged across signal categories. Overall, these results highlight that the effectiveness of graph-derived signals depends on the choice of downstream classifier, and that models with higher expressive capacity tend to benefit more consistently from additional relational information. 7.4 Stability and Variance Across Random Seeds To assess the stability of observed improvements, we analyze the variability of F 1 -scores across random seeds. Across most classifiers and graph signal categories, the standard deviation of performance remains comparable to or lower than the standard deviation observed for the transaction-only baseline. Observed variability is generally low, with standard deviations typically re- maining in the range of a few thousandths in F 1 -score. Figure 6 reports the standard deviation of F 1 -scores across random seeds for individual graph signals and classifiers. Consistent patterns are observed when variability is aggregated at the graph signal category level (see Figure 1), confirming that stability trends persist beyond individual signal realizations. Observed performance improvements generally exceed the corresponding run-to-run variability, indicating that effect sizes are not dominated by random fluctu- ations. This observation suggests that performance gains introduced by graph 25 signals are not driven by favorable random initializations or specific data splits. Instead, the consistently low variance across runs indicates that graph augmentation yields reliable and reproducible improvements within the eval- uated setting. In several cases, graph augmentation even reduces variability, indicating a stabilizing effect on model behavior. An exception is observed for Support Vector Classifiers combined with certain centrality-based signals, which exhibit increased variance. Never- theless, the overall pattern indicates that improvements induced by graph- derived signals are generally robust and reproducible across runs. 7.5 Statistical Significance Analysis To statistically validate the observed performance gains, we conduct paired McNemar tests to assess whether observed improvements are statistically significant across random seeds, following the paired, run-level testing pro- tocol described in Section 2. Figure 3 visualizes the aggregated McNemar test outcomes at the graph signal category level, summarizing the number of statistically significant im- provements versus degradations across classifiers. The top row summarizes net significance patterns across all classifiers, revealing that significant im- provements dominate significant degradations for all examined graph cate- gories, with GNN-derived representations, proximity-based embeddings, and community-based signals exhibiting the strongest and most consistent dom- inance of significant improvements over degradations. Overall, these results confirm that the observed performance gains from graph-derived signals are not only numerically meaningful but also statisti- cally reliable under paired, run-level significance testing. Appendix C provides a fine-grained, per-signal breakdown of McNemar test outcomes across classifiers and runs on a signal level. 26 Centrality Cohesion Community Proximity Spectral Structure GNN (all clfs) MLP LR NB SVC RF XGB 27/521/737/331/928/1122/1150/0 5/13/43/18/01/91/710/0 0/01/01/12/02/01/010/0 4/11/25/10/910/010/010/0 7/39/110/010/08/110/010/0 6/04/08/01/00/10/32/0 5/03/010/010/07/00/18/0 Figure 3: Aggregated McNemar test outcomes for graph signal categories across classifiers. Each cell represents the balance of statistically significant improvements (p≤ 0.05) versus degradations. Darker green shades indicate a higher frequency of significant performance gains over the transaction-only baseline. 7.6 Robustness under Graph Perturbations This section reports the results of the edge dropping robustness experiments introduced in Section 6.6. All graph-derived signals are computed on ran- domly sparsified transaction graphs with 25 % and 50 % edge removal, while all other aspects of the experimental setup remain unchanged. To ensure comparability across perturbation levels, all classifiers are evaluated using the previously determined optimal hyperparameters. Figure 4 illustrates the average ∆F 1 relative to the transaction-only base- line across graph signal categories as a function of increasing edge removal aggregated across classifiers. Additional details are reported in Appendix D, where detailed result matrices for each edge removal level are provided in Figure 9 and Figure 10, and classifier-specific robustness trends are illus- trated in Figure 11. 27 02550 Edge Removal Rate (%) 0.02 0.00 0.02 0.04 0.06 0.08 F 1 vs. TRX-only Centrality Cohesion Community Proximity Spectral Structure GNN Figure 4: Average ∆F 1 relative to the transaction-only baseline across graph signal categories under increasing edge removal. As illustrated in Figure 4, increasing edge removal leads to distinct degra- dation profiles across graph signal categories, indicating that robustness under structural perturbation varies substantially across different types of graph-derived signals. Proximity-based signals exhibit the steepest performance degradation under edge removal, with average F 1 -score improvements declining rapidly as the graph becomes sparser. This behavior is consistent with the reliance of proximity-based representations on multi-hop neighborhood structure, which is particularly sensitive to random edge deletion. Community-based signals show a more gradual but consistent decrease in performance, suggesting moderate robustness under structural degrada- tion. While these signals remain beneficial under mild perturbations, their effectiveness diminishes as community structure becomes increasingly frag- mented. Cohesion-based and centrality-based signals display relatively mild degra- dation on average; however, their robustness is not uniform across classifiers (see Figure 11 in Appendix D). For some downstream models, these sig- nals remain stable under moderate edge removal, whereas for others their contribution deteriorates, indicating classifier-dependent sensitivity. Structural embeddings exhibit heterogeneous robustness behavior. While certain classifiers retain stable or only mildly degraded performance under 28 edge removal, others show notable declines, highlighting that robustness for this category depends strongly on the interaction between the embedding method and the downstream classifier (see Figure 11 in Appendix D). In contrast, GNN-derived representations demonstrate the strongest over- all robustness across perturbation levels. Although performance declines with increasing edge removal, GNN-based signals consistently retain posi- tive improvements relative to the transaction-only baseline, even under sub- stantial graph sparsification. Summary Overall, these results show that robustness under graph perturbations is highly signal- and classifier-dependent. Categories exhibiting similar aver- age performance under the full graph can differ markedly in their sensitivity to structural degradation, underscoring the importance of evaluating robust- ness alongside effect size when assessing the practical utility of graph-derived signals. 8 Discussion The results presented in Section 7 provide the empirical basis for address- ing the research questions posed in the Introduction. In this section, we synthesize these findings to answer the research questions by relating per- formance, robustness, and interpretability patterns to different categories of graph-derived signals in the considered application setting. Specifically, this discussion examines whether graph-derived signals pro- vide statistically reliable performance improvements over transaction-only baselines and how these effects vary across signal categories (RQ1), how robust the observed performance gains are under controlled graph per- turbations (RQ2), and to what extent a taxonomy-driven organization of graph signals supports interpretable, category-level insights into character- istic fraud-related patterns (RQ3). The experimental results presented in Section 7 are the outcome of a deliberately controlled and statistically grounded evaluation protocol. By flexibly combining taxonomy-driven graph signal integration, automated hy- perparameter optimization, multi-seed evaluation, and formal significance testing, the proposed evaluation protocol enables reliable, comparable, and interpretable assessment of graph-derived signals for tabular learning. This study set out to systematically evaluate whether, and under what conditions, graph-derived signals provide measurable, statistically reliable, and robust performance benefits for graph-augmented tabular machine learn- 29 ing. Using fraud detection on transaction networks as a case study, the re- sults provide converging evidence that incorporating relational information yields consistent and statistically significant improvements over transaction- only baselines. At the same time, they reveal that the magnitude, stability, and robustness of these gains depend strongly on the type of graph-derived signal, the downstream classifier, and the quality of the underlying graph structure. 8.1 Effectiveness of Graph-Derived Signals Across classifiers, graph signal categories, and individual graph signals, the integration of graph-derived signals leads to consistent and statistically re- liable improvements over the transaction-only baseline, with gains observed across the large majority of evaluated classifier–signal combinations rather than being confined to isolated configurations. This indicates that relational information captured from the transaction graph provides complementary predictive power beyond node-local transaction features. From a qualitative perspective, the effectiveness of graph-derived signals is strongly influenced by the type of relational information they encode. Signals that encode local relational proximity and neighborhood context emerge as particularly beneficial in this setting. This observation aligns with the intuition that illicit behavior in transaction networks often propagates through localized neighborhoods, making proximity-aware representations well suited for fraud detection tasks. In contrast, more global or frequency- based representations provide only limited additional benefit, suggesting that not all forms of structural information are equally informative for this application domain. Although proximity-based and community-based signals are treated as distinct categories within the proposed taxonomy, both capture aspects of relational closeness at different levels of abstraction. The stronger perfor- mance of proximity-based representations suggests that modeling relational proximity in a continuous and fine-grained manner can convey richer in- formation than discrete group assignments in the evaluated fraud detec- tion setting. This highlights the importance of considering not only which structural properties are encoded, but also how they are represented when assessing the utility of graph-derived signals. Taken together, these findings underscore that the effectiveness of graph- based augmentation is closely tied to the alignment between signal type, rep- resentation granularity, and the relational characteristics of the underlying task. 30 8.2 Classifier-Specific Interactions and Model Capacity The effectiveness of graph signals varies substantially across classifiers, re- flecting differences in model capacity and inductive bias. Tree-based ensem- ble models, particularly Random Forests and XGBoost, benefit most con- sistently from graph-based augmentation. These models are well-suited to exploit heterogeneous, potentially redundant feature sets and exhibit stable improvements across nearly all graph signal categories. These findings align with broader evidence that, in tabular learning settings, model performance is often driven more by feature quality than by architectural complexity [10]. Support Vector Classifiers show pronounced relative improvements over the transaction-only baseline across many graph signal configurations. How- ever, these gains are accompanied by increased variability across random seeds, reflecting the known sensitivity of SVM-based models to data splits, feature scaling, and hyperparameter settings. While graph signals can sub- stantially enhance SVC performance, these results underscore the impor- tance of careful model tuning and stability analysis when using margin-based classifiers in graph-augmented settings. Linear models and shallow neural networks display more moderate but robust improvements, suggesting that even comparatively simple classifiers can benefit from relational information when graph-derived signals are ap- propriately constructed. In contrast, Naive Bayes classifiers exhibit minimal sensitivity to graph augmentation, indicating limited compatibility between conditional independence assumptions and graph-derived signal representa- tions. 8.3 Stability, Variability, and Robustness Beyond average performance gains, the stability of graph-augmented models across random seeds is a critical consideration. The results demonstrate that standard deviations of F 1 -scores remain low across most classifiers and graph signal categories, typically on the order of a few thousandths. In many cases, graph augmentation even reduces run-to-run variability relative to the transaction-only baseline, indicating a stabilizing effect on model behavior. Importantly, graph signal categories exhibiting similar average perfor- mance under the full graph can differ markedly in their robustness to struc- tural perturbations, highlighting that average effect size alone is insufficient for assessing practical utility without explicit robustness evaluation. From a practical perspective, these findings imply that the choice of graph-derived signals should be informed not only by average performance 31 under ideal graph conditions, but also by the expected level of noise, in- completeness, or uncertainty in the underlying graph structure. Robustness patterns vary substantially across graph signal categories and perturbation levels, underscoring the importance of signal-type-aware robustness evalua- tion when selecting graph-derived indicators for downstream tasks. At the same time, observed performance gains consistently exceed the corresponding variability across runs, indicating that the improvements re- flect systematic effects of incorporating relational information. Finally, the results indicate that the benefit of graph-derived signals is not entirely classifier-invariant, suggesting that signal selection and classifier choice should be considered jointly when designing graph-augmented tabular learning pipelines. 8.4 Statistical Significance of Performance Improvements Paired McNemar significance tests provide a statistical assessment of per- formance differences between graph-augmented models and transaction-only baselines. Across a large number of paired comparisons arising from the multi-classifier, multi-signal-group, and multi-seed evaluation setup intro- duced in Section 2, a substantial majority of comparisons yield statistically significant improvements over the transaction-only baseline. This pattern persists when results are aggregated at both the graph signal category level and the individual signal level, indicating that beneficial effects are system- atic rather than incidental. Proximity-based and GNN-derived representations exhibit particularly strong and consistent significance patterns, while spectral features show mixed outcomes, aligning with their limited average performance improve- ments. Importantly, significant degradations are rare overall and occur pri- marily in isolated classifier–signal combinations rather than as systematic trends. Taken together, these results demonstrate that graph-derived signals not only improve predictive performance numerically, but do so in a statistically robust and reproducible manner. 8.5 Interpretation and Methodological Implications The findings of this study demonstrate that incorporating graph-derived sig- nals into tabular learning pipelines yields statistically significant and repro- ducible performance improvements in the context of cryptocurrency trans- action fraud detection. 32 From a methodological perspective, the proposed evaluation protocol en- ables a systematic and taxonomy-driven integration of diverse graph-derived signals into tabular learning pipelines. By supporting flexible signal group- ing, controlled hyperparameter optimization, and formal statistical evalu- ation, it allows practitioners to identify which types of graph signals yield statistically significant and stable benefits for a given downstream task. In particular, the results highlight the importance of selecting graph signal types that align with the structural characteristics of the target problem, as both performance impact and robustness under structural perturbations vary substantially across signal categories. Moreover, the inclusion of controlled structural perturbations provides a practical mechanism to assess how different graph signal categories respond to incomplete or degraded graph structure. The observed heterogeneity in degradation patterns underscores that robustness is highly signal-type dependent and should be considered explicitly during signal selection. In the evaluated cryptocurrency transaction fraud setting, graph neu- ral network–based representations achieve the strongest overall performance gains. Notably, GNNs are the only class of methods considered in this study that explicitly incorporate feature information from neighboring nodes dur- ing representation learning. While multiple factors may contribute to their effectiveness, this observation highlights the potential relevance of local re- lational context in this application domain. Beyond GNN-based representations, proximity-based and community- oriented graph signals also exhibit strong and consistent performance im- provements, suggesting that relational closeness and mesoscopic grouping information are particularly informative. In contrast, purely structural or spectral representations show more heterogeneous effects. Importantly, these observations are specific to the analyzed transaction network and should not be interpreted as a general ranking of graph signal types across domains. Beyond predictive performance, the taxonomy-driven evaluation of graph signal categories provides a basis for pattern-oriented analysis of downstream tasks. By examining which types of graph-derived signals yield consistent and robust improvements, the evaluation protocol supports high-level inter- pretation of which structural aspects of the graph are informative for the task, such as fraud detection in transaction networks. 33 8.6 Practical Implications and Business Value The empirical results of this study have direct practical implications for fraud detection systems operating under severe class imbalance and regula- tory constraints. They demonstrate that explicitly extracted graph-derived signals can yield consistent and statistically significant improvements in fraud detection performance when integrated into classical tabular machine learning pipelines, improving both precision and recall, which are critical performance dimensions in financial crime detection. From a practical perspective, the primary contribution of the proposed evaluation protocol lies not in prescribing a fixed ranking of graph signal types for a given application task, but in providing a systematic and sta- tistically grounded procedure to assess their utility for a given downstream task. As demonstrated in the presented fraud detection case study, the relative effectiveness, stability, and robustness of graph-derived signals vary substan- tially across signal categories and classifiers. Importantly, these patterns are task-dependent and should not be assumed to generalize across application domains. The proposed evaluation protocol enables practitioners to iden- tify, under controlled experimental conditions, which types of graph-derived signals provide statistically significant and robust benefits for their specific task, data characteristics, and operational constraints. In practical terms, this implies that signal selection should account for known data quality con- straints of the application. For example, in settings where transaction graphs are expected to be incomplete or partially observed—such as when access to certain transaction channels is limited—the results of our case study suggest that placing greater emphasis on more coarse-grained or role-based graph signal categories may yield more stable performance than relying primarily on proximity-based signals, which exhibited pronounced sensitivity to graph perturbations. In regulated environments, interpretability and auditability are often as important as predictive performance. Several graph signal categories that perform particularly well in the evaluated setting, such as community-based indicators are inherently interpretable and provide transparent representa- tions of relational structure. This facilitates signal-level transparency and model auditability without relying on opaque end-to-end graph neural ar- chitectures. Finally, the robustness experiments in our case study highlight that dif- ferent graph signal categories exhibit markedly different sensitivity to incom- plete or degraded graph structure. In the evaluated fraud detection setting, 34 this allows practitioners to identify which types of graph-derived signals re- main informative when transaction graphs are sparse, partially observed, or subject to data loss. The proposed evaluation protocol therefore provides a principled mechanism to assess, for a given downstream task, which graph- derived signals retain predictive utility under incomplete or degraded graph structure. 8.7 Limitations and Future Directions While the proposed evaluation protocol and taxonomy-driven analysis are designed to be generic and applicable to a broad range of graph-augmented tabular learning tasks, the empirical results presented in this study are ob- tained from a single large-scale cryptocurrency fraud detection benchmark. Consequently, the quantitative performance gains, robustness patterns, and relative effectiveness of different graph signal categories should be inter- preted in the context of this specific application setting. At the same time, the Elliptic dataset constitutes a challenging and representative real-world benchmark, characterized by strong relational de- pendencies, severe class imbalance, and noisy graph structure. These prop- erties make it well suited for stress-testing the statistical reliability and robustness of graph-derived signals. The observed patterns—such as the strong effectiveness of proximity-based signals and their pronounced sensi- tivity to graph perturbations—are therefore indicative of how different types of graph-derived signals behave under realistic conditions, rather than being benchmark-specific performance claims. Importantly, the primary contribution of this work lies not in the ab- solute performance levels achieved on this dataset, but in the systematic, statistically grounded evaluation protocol and taxonomy-driven perspective. These components are directly transferable to other application domains and datasets, enabling comparable analyses of graph-derived signals beyond the specific case study considered here. The set of graph-derived signals evaluated within each category is neces- sarily non-exhaustive. Although representative signal types are selected to cover a broad spectrum of structural information, additional graph signals and alternative formulations may further enrich the analysis. Extending the set of signal representatives per category may provide finer-grained insights into category-internal variability. Several design choices were made to prioritize comparability and sta- tistical rigor, which also introduce constraints; graph signals are computed deterministically and held fixed across random seeds, enabling clean paired 35 comparisons across classifiers and evaluation runs. However, this setup does not explicitly capture additional uncertainty arising from alternative graph construction procedures or stochastic graph signal generation. Similarly, while hyperparameter optimization is applied consistently across models, broader hyperparameter spaces, additional optimization trials, or alterna- tive optimization objectives may further improve absolute performance. Despite extensive measures to mitigate variability, including automated hyperparameter optimization, multiple random seeds, diverse classifier fam- ilies, formal significance testing, and trimmed performance aggregation, residual stochastic effects and dataset-specific biases may still contribute to the observed results, as in all empirical machine learning studies. Finally, we note that the proposed taxonomy constitutes an abstraction of a continuous and heterogeneous design space. While the categories cap- ture distinct types of structural information, boundaries between categories are not always strict, and certain graph signals may exhibit characteris- tics of multiple groups. The taxonomy is therefore intended as a practical organizing scheme rather than a rigid classification. Future work may extend the proposed evaluation protocol along sev- eral directions, including evaluating additional graph signal representatives and classifier families, incorporating stochastic or task-adaptive graph signal generation, extending robustness analysis to dynamic or temporally evolv- ing graphs, and exploring alternative evaluation protocols beyond binary fraud detection. More generally, the proposed evaluation protocol provides a foundation for systematic and extensible benchmarking of graph-derived signals under controlled experimental conditions. 9 Conclusion This paper introduces a systematic, taxonomy-driven evaluation protocol for assessing graph-derived signals in tabular machine learning. The proposed taxonomy provides the conceptual framework that enables a structured and interpretable comparison across fundamentally different types of relational information. The protocol combines multi-seed statistical evaluation, formal significance testing, and robustness analysis under graph perturbations to enable fair and reliable comparisons. Applied to a large-scale cryptocurrency fraud detection case study, the taxonomy-guided protocol yields three key insights. First, while graph- derived signals consistently provide statistically significant improvements over tabular baselines, their effectiveness varies substantially across the 36 distinct categories defined by the taxonomy. Proximity-based and GNN- derived signals emerge as particularly impactful, suggesting that localized relational patterns are highly discriminative for fraud detection. Second, ro- bustness under structural perturbations is highly signal-dependent, revealing distinct degradation profiles per category. For instance, while proximity- based signals perform well on the complete graph, they degrade rapidly under edge removal, making them less suitable for applications with in- complete or noisy transaction data. Third, the taxonomy-driven analysis provides interpretable insights by linking these highly discriminative signal categories to characteristic structural fraud patterns, such as the reliance on localized transaction neighborhoods. Overall, this work demonstrates the importance of assessing graph-derived signals using statistically grounded evaluation criteria beyond average per- formance comparisons. Accordingly, meaningful evaluation requires jointly considering effect size, statistical reliability, and robustness under struc- tural uncertainty, which capture complementary aspects of signal utility and should not be conflated when assessing practical applicability. The pro- posed taxonomy further facilitates interpretable insights in applied domains. Using fraud detection as a representative case study, we show how the pro- posed evaluation protocol enables principled and reproducible assessment of the conditions under which graph-derived signals provide measurable value for tabular machine learning models. Methodologically, the evaluation protocol provides an application-agnostic methodology for the informed selection of graph signals based on these com- plementary criteria. It can be directly applied to other domains to determine when and how graph-derived information reliably enhances tabular learning pipelines. Funding This research received no external funding. Data Availability The data and source code supporting the findings of this study are publicly available in the GitHub repository at https://github.com/graph-eval/ graph-eval-protocol. The experimental source code and result artifacts are also available on Zenodo at https://doi.org/10.5281/zenodo.18351526. 37 Acknowledgments The authors would like to thank Prof. Dr. Mª Mercedes Carmona Mart ́ınez and Dr. Vinny Flaviana Hyunanda from Universidad Cat ́olica San Anto- nio de Murcia (UCAM), as well as Claudia Schmitz from FOM University of Applied Sciences, for their kind administrative support throughout the doctoral process. Conflicts of Interest The authors declare no conflicts of interest. A Mean F1-score per individual graph signal TRX Only DEGREE_CENTRALITY IN_OUT_DEGREE PAGERANK BETWEENNESS_CENTRALITY EIGENVECTOR_CENTRALITY CLOSENESS_CENTRALITY CLUSTERING_COEFFICIENT CORE_NUMBER TRIANGLES SQUARE_CLUSTERING COMMUNITY_LOUVAIN COMMUNITY_LEIDEN COMMUNITY_INFOMAP DeepWalk node2vec-BFS node2vec-DFS node2vec-bal Spectral ffstruc2vec Role2Vec Graphwave GCN GAT GCL AVG MLP LR NB SVC RF XGB 0.8330.8780.8660.8120.8470.8550.8750.8710.8520.8820.8730.8770.8670.8770.8720.8820.8640.8780.8430.8670.8740.8660.8960.9200.878 0.0000.0440.033-0.0220.0130.0220.0420.0370.0180.0480.0400.0440.0340.0440.0390.0480.0300.0450.0100.0340.0410.0320.0630.0860.044 0.8660.8740.8710.8700.8710.8750.8750.8700.8700.8760.8660.8840.8710.8640.9060.9050.8850.9050.8210.8520.8660.8530.8940.9100.888 0.8280.8310.8310.8320.8310.8310.8300.8310.8310.8310.8290.8270.8290.8310.8310.8320.8290.8330.8320.8320.8320.8320.8840.9170.868 0.2910.2910.2910.2910.2910.2920.2920.2910.2910.2920.2910.2920.2920.2910.2930.2930.2900.2920.2920.2950.2870.2970.3240.3480.306 0.6480.8480.7970.5170.6900.7250.8240.8200.7290.8580.8350.8320.8010.8470.7810.8140.7690.8090.7280.8280.8360.7980.8600.9070.804 0.9050.9150.9110.9130.9150.9160.9170.9130.9050.9170.9130.9090.9070.9160.8990.9120.9020.9030.9030.9090.9110.9150.9130.9280.906 0.9200.9200.9220.9270.9260.9280.9300.9190.9240.9260.9230.9340.9280.9280.9450.9450.9340.9430.9320.9160.9260.9310.9310.9350.924 CentralityCohesionCommunityProximitySpectralStructureGNN Figure 5: Mean F 1 -scores across classifiers and graph signals. Cell values report the mean F 1 -score aggregated over the trimmed middle eight runs. Color intensity indicates relative performance differences with respect to the transaction-only baseline (green = improvement, red = degradation). We analyze the distribution of performance changes across all classifier–graph signal combinations, as illustrated in Figure 5. Out of 144 evaluated combi- nations (6 classifiers× 24 graph signals), 123 (85.4%) yield an improvement in F 1 -score relative to the transaction-only baseline, while only 21 (14.6%) result in a decrease. Averaged across all graph signal combinations, graph augmentation leads to a mean F 1 -score increase of +0.031, indicating that performance gains are not confined to isolated configurations but occur con- sistently across models and signal types. The corresponding variability of 38 these performance estimates across random seeds is reported in Appendix B, allowing effect magnitude and stability to be assessed jointly. B Standard deviation per individual graph signal TRX Only DEGREE_CENTRALITY IN_OUT_DEGREE PAGERANK BETWEENNESS_CENTRALITY EIGENVECTOR_CENTRALITY CLOSENESS_CENTRALITY CLUSTERING_COEFFICIENT CORE_NUMBER TRIANGLES SQUARE_CLUSTERING COMMUNITY_LOUVAIN COMMUNITY_LEIDEN COMMUNITY_INFOMAP DeepWalk node2vec-BFS node2vec-DFS node2vec-bal Spectral ffstruc2vec Role2Vec Graphwave GCN GAT GCL MLP LR NB SVC RF XGB 0.0090.0080.0090.0110.0110.0070.0070.0090.0080.0070.0080.0060.0090.0080.0060.0030.0020.0060.0070.0080.0100.0120.0060.0050.008 0.0090.0060.0060.0070.0060.0060.0070.0060.0060.0070.0060.0060.0070.0070.0050.0060.0050.0070.0080.0040.0070.0070.0060.0040.005 0.0020.0020.0020.0020.0020.0020.0020.0020.0020.0020.0020.0030.0020.0020.0020.0030.0020.0020.0030.0020.0010.0030.0030.0070.002 0.0710.0090.0350.1150.0520.0350.0200.0180.0540.0030.0150.0120.0130.0050.0140.0160.0300.0230.0200.0050.0110.0050.0090.0050.035 0.0060.0070.0080.0080.0080.0080.0080.0070.0080.0070.0080.0080.0070.0070.0060.0090.0070.0070.0080.0080.0070.0070.0050.0060.011 0.0040.0060.0060.0050.0060.0050.0030.0060.0050.0050.0050.0040.0050.0050.0070.0050.0060.0050.0050.0040.0040.0050.0060.0040.006 CentralityCohesionCommunityProximitySpectralStructureGNN Figure 6: Standard deviation of F 1 -scores across random seeds. Values report the standard deviation of F 1 -scores over the trimmed middle eight runs. Darker shading indicates higher variability. Graph signals are grouped by category. This appendix reports the standard deviation of F 1 -scores across random seeds for each classifier–graph signal combination. Together with the mean performance values in Appendix A, these results allow the stability of graph- derived signals to be assessed alongside their average effect size. 39 C Statistical Significance Analysis - McNemar per individual signal DEGREE_CENTRALITY IN_OUT_DEGREE PAGERANK BETWEENNESS_CENTRALITY EIGENVECTOR_CENTRALITY CLOSENESS_CENTRALITY CLUSTERING_COEFFICIENT CORE_NUMBER TRIANGLES SQUARE_CLUSTERING COMMUNITY_LOUVAIN COMMUNITY_LEIDEN COMMUNITY_INFOMAP DeepWalk node2vec-BFS node2vec-DFS node2vec-bal Spectral ffstruc2vec Role2Vec Graphwave GCN GAT GCL MLP LR NB SVC RF XGB 21/118/216/818/322/230/120/1014/323/017/330/122/324/432/636/124/1234/828/1115/417/833/548/056/030/5 4/04/13/32/14/04/04/32/12/02/36/03/13/310/010/05/010/01/91/41/02/59/010/06/0 3/02/03/02/01/02/02/02/02/01/01/11/14/11/02/01/12/02/02/01/02/010/010/010/0 0/00/00/00/00/03/10/50/00/00/02/03/00/02/42/10/92/610/02/00/810/010/010/01/2 10/08/04/57/28/210/010/08/210/010/010/09/110/010/010/010/010/08/110/010/010/010/010/010/0 4/04/02/03/04/05/03/01/04/03/02/01/03/00/22/00/20/20/10/01/01/03/07/02/3 0/10/14/04/05/06/01/21/05/01/09/05/04/09/010/08/010/07/00/04/08/06/09/01/0 CentralityCohesionCommunityProximitySpectralStructureGNN Figure 7: Signal-level aggregation of McNemar test outcomes across classi- fiers and runs (p ≤ 0.05). Cells show the number of statistically significant improvements versus degradations (#better / #worse) for each classifier– graph signal combination, aggregated over ten random seeds. Color intensity reflects the net balance between improvements and degradations. Across all evaluated configurations, a total of 1,440 paired comparisons were conducted (6 classifiers× 24 graph signals× 10 random seeds). Out of these, 628 comparisons (43.6%) show a statistically significant improvement over the transaction-only baseline at a significance level of p ≤ 0.05, whereas only 101 comparisons (7.0%) exhibit a significant degradation. This asym- metry indicates a strong overall bias toward beneficial effects of graph-based augmentation. 40 D Robustness Analysis under Graph Perturbations TRX Only Centrality Cohesion Community Proximity Spectral Structure GNN AVG (Top-5) (AVG vs Base) MLP LR NB SVC RF XGB 0.8330.8520.8580.8820.8960.8430.8570.921 0.0000.0180.0240.0490.0620.0100.0230.088 0.866 ±0.009 0.880 ±0.005 0.859 ±0.007 0.872 ±0.006 0.902 ±0.003 0.821 ±0.007 0.843 ±0.008 0.918 ±0.006 0.828 ±0.009 0.828 ±0.007 0.832 ±0.008 0.828 ±0.007 0.847 ±0.006 0.832 ±0.008 0.831 ±0.007 0.919 ±0.006 0.291 ±0.002 0.293 ±0.003 0.292 ±0.002 0.292 ±0.002 0.292 ±0.002 0.292 ±0.003 0.319 ±0.003 0.365 ±0.003 0.648 ±0.071 0.703 ±0.087 0.758 ±0.036 0.850 ±0.007 0.872 ±0.005 0.728 ±0.020 0.793 ±0.004 0.915 ±0.008 0.905 ±0.006 0.919 ±0.007 0.915 ±0.008 0.922 ±0.008 0.906 ±0.008 0.903 ±0.008 0.901 ±0.007 0.917 ±0.006 0.920 ±0.004 0.929 ±0.002 0.925 ±0.004 0.937 ±0.004 0.951 ±0.005 0.932 ±0.005 0.916 ±0.004 0.938 ±0.007 Figure 8: Performance comparison under structural graph perturbations with 0% edge removal (transaction-only reference). Cell values report mean F 1 -scores with standard deviations across random seeds. 41 TRX Only Centrality Cohesion Community Proximity Spectral Structure GNN AVG (Top-5) (AVG vs Base) MLP LR NB SVC RF XGB 0.8330.8420.8610.8710.8690.8390.8530.897 0.0000.0080.0280.0370.0360.0060.0190.064 0.866 ±0.009 0.869 ±0.007 0.864 ±0.006 0.856 ±0.007 0.867 ±0.004 0.825 ±0.007 0.835 ±0.005 0.895 ±0.006 0.828 ±0.009 0.830 ±0.008 0.831 ±0.006 0.830 ±0.007 0.836 ±0.008 0.828 ±0.005 0.828 ±0.008 0.892 ±0.006 0.291 ±0.002 0.291 ±0.002 0.291 ±0.002 0.293 ±0.003 0.288 ±0.001 0.293 ±0.003 0.309 ±0.003 0.365 ±0.003 0.648 ±0.071 0.668 ±0.088 0.768 ±0.030 0.822 ±0.006 0.810 ±0.006 0.714 ±0.088 0.783 ±0.003 0.888 ±0.006 0.905 ±0.006 0.916 ±0.007 0.915 ±0.008 0.916 ±0.007 0.897 ±0.006 0.900 ±0.006 0.904 ±0.009 0.893 ±0.010 0.920 ±0.004 0.925 ±0.004 0.927 ±0.006 0.930 ±0.004 0.936 ±0.005 0.929 ±0.007 0.913 ±0.006 0.918 ±0.007 Figure 9: Performance comparison under moderate structural degradation with 25 % random edge removal. 42 TRX Only Centrality Cohesion Community Proximity Spectral Structure GNN AVG (Top-5) (AVG vs Base) MLP LR NB SVC RF XGB 0.8330.8280.8520.8640.8090.8090.8490.888 0.000-0.0060.0190.031-0.025-0.0250.0150.054 0.866 ±0.009 0.869 ±0.003 0.860 ±0.005 0.846 ±0.007 0.832 ±0.005 0.824 ±0.008 0.829 ±0.006 0.886 ±0.010 0.828 ±0.009 0.830 ±0.007 0.830 ±0.006 0.830 ±0.008 0.830 ±0.005 0.828 ±0.006 0.828 ±0.006 0.879 ±0.007 0.291 ±0.002 0.291 ±0.002 0.291 ±0.002 0.291 ±0.002 0.280 ±0.002 0.292 ±0.003 0.295 ±0.002 0.351 ±0.001 0.648 ±0.071 0.600 ±0.099 0.733 ±0.080 0.802 ±0.017 0.556 ±0.084 0.563 ±0.103 0.767 ±0.012 0.865 ±0.007 0.905 ±0.006 0.915 ±0.007 0.914 ±0.008 0.914 ±0.009 0.897 ±0.008 0.903 ±0.005 0.904 ±0.009 0.892 ±0.006 0.920 ±0.004 0.925 ±0.005 0.925 ±0.005 0.928 ±0.006 0.928 ±0.005 0.926 ±0.005 0.916 ±0.005 0.916 ±0.002 Figure 10: Performance comparison under stronger structural degradation with 50 % random edge removal. 43 Edge Removal Rate (%) 0.04 0.02 0.00 0.02 0.04 F 1 vs. TRX-only MLP Edge Removal Rate (%) 0.00 0.01 0.02 0.03 F 1 vs. TRX-only XGB Edge Removal Rate (%) 0.010 0.005 0.000 0.005 0.010 0.015 F 1 vs. TRX-only RF Edge Removal Rate (%) 0.00 0.02 0.04 0.06 0.08 F 1 vs. TRX-only LR 02550 Edge Removal Rate (%) 0.00 0.02 0.04 0.06 F 1 vs. TRX-only NB 02550 Edge Removal Rate (%) 0.10 0.05 0.00 0.05 0.10 0.15 0.20 0.25 F 1 vs. TRX-only SVC Centrality Cohesion Community Proximity Spectral Structure GNN Figure 11: Classifier-specific robustness trends illustrating how average ∆F 1 improvements relative to the transaction-only baseline evolve under increas- ing levels of random edge removal (0%, 25%, 50%) for different graph signal categories. E Classification Hyperparameters This appendix reports the complete hyperparameter search spaces used for all evaluated classifiers. All hyperparameters listed here are optimized via 44 Bayesian optimization using Hyperopt with a fixed budget of 50 trials per configuration. Parameters not listed are kept at library default values. 45 Table 2: Hyperparameter search spaces for all evaluated classifiers. Only the parameters listed are optimized via Bayesian optimization (Hyperopt) with a fixed trial budget; all remaining parameters are kept at library default values. ModelParameterSearch Space Logistic Regressionpenaltyl2, None Clog-uniform [0.001, 10] solverlbfgs, saga balancing strategy none, oversampling, classweights use embpca False, True embpcan16, 32, 48, 64, 0.95 Random Forestnestimatorsquniform [100, 600], step 25 maxdepthNone, 5, 8, 12, 16, 20, 30, 40 min samplessplitquniform [2, 10], step 1 minsamplesleafquniform [1, 5], step 1 max features sqrt, log2, None balancing strategy none, oversampling, classweights SVC (linear kernel)SVCClog-uniform [10 −3 , 10 2 ] SVC (RBF kernel)SVCClog-uniform [10 −3 , 10 2 ] gammalog-uniform [10 −4 , 10 1 ] XGBoostnestimatorsquniform [50, 500], step 25 maxdepthquniform [2, 8], step 1 learningratelog-uniform [0.01, 0.3] subsampleuniform [0.6, 1.0] colsamplebytreeuniform [0.6, 1.0] minchildweightlog-uniform [10 −1 , 10] gammalog-uniform [10 −3 , 1] regalphalog-uniform [10 −6 , 1] reglambdalog-uniform [10 −3 , 10] balancingstrategy none, oversampling MLPhiddenlayersizeschoice from (64), (128), (256), (64, 32),(128, 64),(128, 128), (256, 128) alphalog-uniform [10 −6 , 10 −2 ] lrinitlog-uniform [10 −4 , 5× 10 −2 ] activationrelu, tanh batch size64, 128, 256 balancingstrategy none, oversampling Naive Bayesvarsmoothinglog-uniform [10 −6 , 10 −1 ] balancing strategy none, oversampling, classprior oversample ratiouniform [0.5, 1.0] scalerstandard, minmax 46 F Graph Signal Generation Parameters This appendix summarizes all parameters used for the generation of graph- derived signals. All graph signals are computed once on the fixed transaction graph and reused across all evaluation runs to ensure paired comparability. Table 3: Graph signal generation parameters used across all evaluated signal categories. All graph-derived signals are computed once on the fixed transaction graph and reused across all evaluation runs to ensure paired comparability. Graph SignalParameters PageRankDamping factor α = 0.85; maximum iterations: 100; convergence tolerance: 10 −8 ; directed graph Betweenness Central- ity Approximate computation for large graphs; sampled nodes: min(2000,|V|); undirected graph Eigenvector CentralityComputed per connected component; initialization: uniform; maximum iterations: 1000; convergence tol- erance: 10 −6 ; undirected graph Closeness CentralityComputed per connected component; undirected graph InfomapDirected graph; unweighted edges LeidenUndirected graph; resolution parameter: default Proximity-Based Em- beddings Number of walks per node: 10; walk length: 20; context window size: 10; embedding dimension: 64 role2vecEmbedding dimension: 64; number of walks: 10; walk length: 20 GraphWaveTarget embedding dimension: 64; number of diffusion scales: 6; diffusion scale range τ ∈ [10 −2 , 10 1 ] (log- spaced); eigen-decomposition: exact dense for small components, truncated sparse for components ≥ 1000 nodes; retained eigenpairs: 32; Gaussian random pro- jection ffstruc2vecEmbedding dimension: 64; number of walks: 10; walk length: 20; number of layers considered: 2; cost func- tion: mean- and variance-based structural distance; layer weights: layers 0–2: 1.0, layers ≥3: 0.0 47 Table 4: Hyperparameter search space for GNN-based models. The table reports the ranges and options considered during automated Bayesian hyperparameter optimization for GCN, GAT, and GCL-based models. All configurations are evaluated under identical optimization budgets and selection criteria. ModelParameterSearch Space GCNReduce function sum, max, mean Message function SAFE MSGFUNCS (DGL message functions) Activationrelu, elu, gelu, leaky relu, silu BiasTrue, False #Layers1,..., 4 Hidden sizeslayer2:8–127; layer3:8–63; layer4: 8–31 Balancing strategy class weights, oversampling, none Oversample ratiouniform [0.1, 1.0] Dropout rateuniform [0.0, 0.3] Learning ratelog-uniform [10 −4 , 5× 10 −2 ] Monotonic width con- straints for ≥3 layers: layer3 < layer2; for ≥4 layers: layer4 < layer3 GATOptimizerAdam, SGD, Adadelta, Adagrad, AdamW, Adamax, ASGD, RMSprop, Rprop Activationrelu, elu, gelu, leakyrelu, silu #Layers1,..., 4 Hidden sizeslayer2:8–127; layer3:8–63; layer4: 8–31 Balancing strategy classweights, none Dropout rate (post- layer) uniform [0.0, 0.3] Learning ratelog-uniform [10 −4 , 5× 10 −2 ] Attention heads 1, 2, 4, 8 LeakyReLU negative slope uniform [0.01, 0.3] Continued on next page 48 ModelParameterSearch Space ResidualTrue, False Attention dropoutuniform [0.0, 0.4] Feature dropoutuniform [0.0, 0.4] Monotonic width con- straints for ≥3 layers: layer3 < layer2; for ≥4 layers: layer4 < layer3 GCLhiddendim32, 64, 128 learningratepretrainlog-uniform [5× 10 −4 , 5× 10 −3 ] edge dropprobuniform [0.1, 0.4] learningrateclassifier log-uniform [5× 10 −4 , 5× 10 −2 ] weightdecayclassifier log-uniform [10 −6 , 10 −2 ] classweightbaselog-uniform [0.5, 5.0] classweightexplog-uniform [0.5, 3.0] encoderlrfactor 0.05, 0.1, 0.2, 0.3 All hyperparameters are optimized via Bayesian optimization (Hyper- opt) with a fixed trial budget. G Peak Performance Gains per Graph Signal Cat- egory This figure complements the average performance analysis in Section 8.2 by illustrating the best-case performance attainable per graph signal category across all classifiers. 49 Proximity GNN Community Spectral Centrality Cohesion Structure Graph Signal Category 0.915 0.920 0.925 0.930 0.935 0.940 0.945 0.950 Maximum F1-Score Figure 12: Maximum F 1 -score achieved per graph signal category across all evaluated classifiers. The dashed horizontal line denotes the best-performing transaction-only (TRX) baseline. References [1] Nesreen K. Ahmed, Ryan A. Rossi, John Boaz Lee, Theodore L. Willke, Rong Zhou, Xiangnan Kong, and Hoda Eldardiry. Role-Based Graph Embeddings. IEEE Transactions on Knowledge and Data Engineering, 34(5):2401–2415, 2022. [2] Albert-L ́aszl ́o Barab ́asi. Network science. Cambridge University Press, Cambridge, UK, 2016. [3] Christian Bartelt, Stefan L ̈udtke, Sascha Marton, Heiner Stucken- schmidt, and Andrej Tschalzev. A Data-Centric Perspective on Eval- uating Machine Learning Models for Tabular Data. In Advances in Neural Information Processing Systems, pages 95896–95930, 2024. [4] James Bergstra, R ́emi Bardenet, Yoshua Bengio, and Bal ́azs K ́egl. Al- gorithms for hyper-parameter optimization. In Advances in Neural In- formation Processing Systems, volume 24, pages 2546–2554. 2011. [5] James Bergstra, Daniel Yamins, and David D. Cox. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimen- 50 sions for Vision Architectures. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), pages 115–123. [6] Christopher M. Bishop. Pattern Recognition and Machine Learning. In- formation Science and Statistics. Springer, New York, NY, USA, 2006. [7] Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. In Balaji Krishnapuram, Mohak Shah, Alex Smola, Charu Aggarwal, Dou Shen, and Rajeev Rastogi, editors, 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794. ACM, 2016. [8] Fan R. K. Chung. Spectral graph theory: CBMS Conference on Recent Advances in Spectral Graph Theory, held at ... Fresno, June 6 - 10, 1994, volume 92 of Regional conference series in mathematics. Ameri- can Mathematical Society, Providence, RI, reprint edition, 2009. [9] Claire Donnat, Marinka Zitnik, David Hallac, and Jure Leskovec. Learning structural node embeddings via diffusion wavelets. In 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1320–1329. ACM, 2018. [10] L ́eo Grinsztajn, Edouard Oyallon, and Ga ̈el Varoquaux. Why do tree- based models still outperform deep learning on tabular data? arXiv, 2022. [11] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learn- ing for networks. In 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 855–864. ACM, 2016. [12] William L. Hamilton. Graph Representation Learning. Synthesis Lec- tures on Artificial Intelligence and Machine Learning, 14:1–159, 2020. [13] Mario Heidrich, Jeffrey Heidemann, R ̈udiger Buchkremer, and Gon- zalo Wandosell Fern ́andez de Bobadilla. ffstruc2vec: Flat, Flexible and Scalable Learning of Node Representations from Structural Identities. arXiv preprint. [14] Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations (ICLR), 2017. 51 [15] Q. McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153–157, 1947. [16] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. DeepWalk. In 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014. [17] Martin Rosvall and Carl T. Bergstrom. Maps of random walks on com- plex networks reveal community structure. Proceedings of the National Academy of Sciences of the United States of America, 105(4):1118–1123, 2008. [18] Mark Ryan and Luca Massaron. Machine learning for tabular data: XGBoost, deep learning, and AI. Manning Publications, Shelter Island, NY, USA, 2021. [19] Shantanu Thakoor, Jovana Mitrovi ́c, Pietro Lio, and Thomas Hofmann. Bootstrap graph representation learning. arXiv, 2021. [20] V. A. Traag, L. Waltman, and N. J. van Eck. From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports, 9(1):5233, 2019. [21] Petar Veliˇckovi ́c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li`o, and Yoshua Bengio. Graph Attention Networks. In International Conference on Learning Representations (ICLR), 2018. [22] Minjie Wang et al. Deep graph library: A graph-centric, highly- performant package for graph neural networks. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implemen- tation, pages 120–134, 2020. [23] Mark Weber, Giacomo Domeniconi, Jingrui Chen, Daniel K. Weidele, Carlo Bellei, Tom Robinson, and Charles E. Leiserson. Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Net- works for Financial Forensics. In 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 240–248, 2019. [24] Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. Revisiting Semi-Supervised Learning with Graph Embeddings. In 33rd Interna- tional Conference on Machine Learning, volume 48, pages 40–48, 2016. 52