← Back to papers

Paper deep dive

GNNs for Time Series Anomaly Detection: An Open-Source Framework and a Critical Evaluation

Federico Bello, Gonzalo Chiarlone, Marcelo Fiori, Gastón García González, Federico Larroca

Year: 2026Venue: arXiv preprintArea: cs.LGType: PreprintEmbeddings: 43

Abstract

Abstract:There is growing interest in applying graph-based methods to Time Series Anomaly Detection (TSAD), particularly Graph Neural Networks (GNNs), as they naturally model dependencies among multivariate signals. GNNs are typically used as backbones in score-based TSAD pipelines, where anomalies are identified through reconstruction or prediction errors followed by thresholding. However, and despite promising results, the field still lacks standardized frameworks for evaluation and suffers from persistent issues with metric design and interpretation. We thus present an open-source framework for TSAD using GNNs, designed to support reproducible experimentation across datasets, graph structures, and evaluation strategies. Built with flexibility and extensibility in mind, the framework facilitates systematic comparisons between TSAD models and enables in-depth analysis of performance and interpretability. Using this tool, we evaluate several GNN-based architectures alongside baseline models across two real-world datasets with contrasting structural characteristics. Our results show that GNNs not only improve detection performance but also offer significant gains in interpretability, an especially valuable feature for practical diagnosis. We also find that attention-based GNNs offer robustness when graph structure is uncertain or inferred. In addition, we reflect on common evaluation practices in TSAD, showing how certain metrics and thresholding strategies can obscure meaningful comparisons. Overall, this work contributes both practical tools and critical insights to advance the development and evaluation of graph-based TSAD systems.

Tags

ai-safety (imported, 100%)cslg (suggested, 92%)preprint (suggested, 88%)safety-evaluation (suggested, 80%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/13/2026, 1:04:12 AM

Summary

The paper introduces GraGOD, an open-source, modular framework for Time Series Anomaly Detection (TSAD) using Graph Neural Networks (GNNs). It addresses the lack of standardized evaluation practices in the field by providing tools for reproducible experimentation, comparing GNN-based architectures against baselines on real-world datasets (TELCO and SWaT), and critically analyzing the limitations of common point-wise and range-based evaluation metrics.

Entities (7)

GNN · machine-learning-model · 100%GraGOD · software-framework · 100%SWaT · dataset · 100%TELCO · dataset · 100%TSAD · research-field · 100%GDN · machine-learning-model · 95%MTAD-GAT · machine-learning-model · 95%

Relation Signals (3)

GNN usedin TSAD

confidence 100% · There is growing interest in applying graph-based methods to Time Series Anomaly Detection (TSAD), particularly Graph Neural Networks (GNNs).

GraGOD supports TSAD

confidence 95% · GraGOD is designed as a collaborative, research-oriented framework where new models, datasets, and metrics can be seamlessly integrated for TSAD.

GraGOD evaluates SWaT

confidence 90% · Using our framework, we conduct a systematic comparative study of representative GNN-based methods and baselines across datasets... the SWaT dataset.

Cypher Suggestions (2)

Find all GNN-based models evaluated in the framework. · confidence 90% · unvalidated

MATCH (m:Model)-[:EVALUATED_IN]->(f:Framework {name: 'GraGOD'}) WHERE m.type = 'GNN' RETURN m.name

List datasets used for benchmarking TSAD models. · confidence 85% · unvalidated

MATCH (d:Dataset)<-[:BENCHMARKED_ON]-(m:Model) RETURN DISTINCT d.name

Full Text

43,033 characters extracted from source content.

Expand or collapse full text

GNNs for Time Series Anomaly Detection: An Open-Source Framework and a Critical Evaluation Federico Bello 1 , Gonzalo Chiarlone 12 , Marcelo Fiori 13a , Gastón García González 1b and Federico Larroca 13c 1 Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay 2 Pento, Montevideo, Uruguay 3 Centro Interdisciplinario en Ciencia de Datos y Aprendizaje Automático (CICADA), Universidad de la República, Uruguay federico.bello, gonzalo.chiarlone,mfiori,gastong,flarroca@fing.edu.uy Keywords: Multivariate Time Series, Graph Neural Networks, Evaluation Metrics, Score-based Anomaly Detection, Methodological Assessment Abstract: There is growing interest in applying graph-based methods to Time Series Anomaly Detection (TSAD), particularly Graph Neural Networks (GNNs), as they naturally model dependencies among multivariate signals. GNNs are typically used as backbones in score-based TSAD pipelines, where anomalies are identified through reconstruction or prediction errors followed by thresholding. However, and despite promising results, the field still lacks standardized frameworks for evaluation and suffers from persistent issues with metric design and interpretation. We thus present an open-source framework for TSAD using GNNs, designed to support reproducible experimentation across datasets, graph structures, and evaluation strategies. Built with flexibility and extensibility in mind, the framework facilitates systematic comparisons between TSAD models and enables in- depth analysis of performance and interpretability. Using this tool, we evaluate several GNN-based architectures alongside baseline models across two real-world datasets with contrasting structural characteristics. Our results show that GNNs not only improve detection performance but also offer significant gains in interpretability, an especially valuable feature for practical diagnosis. We also find that attention-based GNNs offer robustness when graph structure is uncertain or inferred. In addition, we reflect on common evaluation practices in TSAD, showing how certain metrics and thresholding strategies can obscure meaningful comparisons. Overall, this work contributes both practical tools and critical insights to advance the development and evaluation of graph-based TSAD systems. 1 INTRODUCTION Anomaly detection plays a central role in domains such as fraud detection (Hilal et al., 2022), cyber- security (Siddiqui et al., 2019), industrial mon- itoring (Nizam et al., 2022), and medical diag- nostics (Spence et al., 2001). Within this broad area, Time Series Anomaly Detection (TSAD) fo- cuses on identifying unexpected behaviors in tem- porally ordered data (Shaukat et al., 2021). In re- a https://orcid.org/0000-0002-3732-1778 b https://orcid.org/0009-0002-6652-7713 c https://orcid.org/0000-0001-7893-2201 cent years, and driven by its success in other do- mains, Deep Learning (DL) has been increasingly applied to TSAD (Zamanzadeh Darban et al., 2024a). The typical pipeline for anomaly detection us- ing deep learning consists of two key components: a backbone model and a scoring module (Jin et al., 2024). The backbone is trained under the assumption that most data is normal, and the scoring module flags deviations via reconstruction or prediction errors, serving as proxies for iden- tifying unexpected patterns. However, standard DL models often treat multivariate time series as arXiv:2603.09675v1 [cs.LG] 10 Mar 2026 sequences of independent feature vectors, neglect- ing structural dependencies that may be essential for accurate and interpretable detection. Graph Neural Networks (GNNs), designed to operate on graph-structured data, have shown promise for modeling such dependencies. By representing time series as graphs, GNNs en- able the joint modeling of temporal dynamics and inter-variable dependencies through message passing, effectively capturing complex relational structures (Chen et al., 2022b; Deng and Hooi, 2021). This ability has spurred a growing in- terest in graph-based TSAD (Jin et al., 2024), where GNNs act as backbones for reconstruction- or prediction-based anomaly scoring. Yet, de- spite promising performance, the field remains fragmented: implementations are rarely compa- rable, evaluation practices vary widely, and met- ric design often leads to inconsistent or mislead- ing conclusions. As a result, progress is difficult to quantify and reproduce. To address these issues, we introduce a unified, modular, and open-source framework for graph- based TSAD. 1 Built in PyTorch with repro- ducibility and extensibility in mind, the frame- work provides standardized procedures for data handling, model configuration, and evaluation. It natively supports both graph-based and non- graph-based approaches, enabling fair compar- isons across modeling paradigms. Crucially, it in- tegrates a diverse set of evaluation metrics, from classical point-wise precision and recall to range- based (Lee et al., 2018; Tatbul et al., 2018) and threshold-agnostic measures such as the Volume Under Surface (VUS) (Paparrizos et al., 2022), offering a consistent environment for methodolog- ical analysis. Using our framework, we conduct a systematic comparative study of representative GNN-based methods and baselines across datasets with con- trasting structural characteristics. The results reveal how graph topology, thresholding strat- egy, and metric design interact to influence per- formance and interpretability. In particular, we find that attention-based GNNs offer robustness to uncertainty in graph structure while improving interpretability by localizing anomalies to specific nodes. Conversely, we show that common evalu- ation practices, especially those relying solely on point-wise or threshold-dependent metrics, can obscure genuine model differences. 1 The source code and configuration files for our framework are available at https://github.com/ GraGODs/GraGOD. Beyond empirical benchmarking, this work contributes methodological insights into how graph-based representations and evaluation met- rics shape the behavior of TSAD systems. The proposed framework establishes a reproducible foundation for future research in pattern recog- nition of time series over graphs, facilitating the development of more reliable and interpretable anomaly detection methods. The rest of this paper is structured as fol- lows. Section 2 formalizes the problem of time series anomaly detection and presents the mod- els and datasets considered in this work. In Sec. 3 we discuss the main methodological challenges in TSAD evaluation, reviewing the limitations of conventional point-wise metrics (e.g. precision and recall), but also their range-based extensions which attempt to account for the temporal ex- tent of anomalies. This section also introduces the proposed framework and its design principles for reproducible experimentation. Equipped with our framework, Sec. 4 presents and discusses the benchmark results obtained with different mod- els, metrics, and graph topologies. Finally, Sec. 5 concludes the article. 2 Problem Statement, Methods and Datasets Classic TSAD methods had been classified into taxonomies by several authors (Zamanzadeh Dar- ban et al., 2024b; Boniol et al., 2024; Blázquez- García et al., 2021). At the coarsest level there are some common groups (eventually over- lapping) like Statistical-, Clustering-, Distance- , or Density-based, as well as Forecasting- or Reconstruction-based techniques, which are the focus of this work. Forecasting-based detection builds a model (statistical or machine-learning- based) to predict the next point in the time se- ries. Anomaly scores are then obtained from the residual of the predicted and the real value, and points whose errors exceed a threshold are flagged. Reconstruction-based detection trains an autoencoder, PCA, or matrix-factorization model to compress and then reconstruct windows of ob- servations. If a window cannot be accurately re- constructed (i.e., its reconstruction error is high), the corresponding region is deemed anomalous. Numerous methods leveraging GNNs for TSAD have been proposed in recent years. For instance, more than thirty such works are dis- cussed in the comprehensive review by (Jin et al., 2024), to which we refer the reader for further details. Across these GNN-based approaches for multivariate TSAD, researchers have explored a diverse set of modeling tools to capture spa- tiotemporal dependencies and distinguish nor- mal from anomalous behavior. Some methods (Deng and Hooi, 2021; Zhao et al., 2020) learn explicit dependency graphs among variables us- ing attention mechanisms to perform predictive or reconstructive modeling, thereby offering in- terpretable relations between sensors. Others (Dai and Chen, 2022; Chen et al., 2022a) adopt probabilistic formulations, leveraging normaliz- ing flows for likelihood-based anomaly scoring or combining variational inference with graph con- volution and recurrent units to model uncertainty and temporal dynamics. Another family (Zhang et al., 2022; Han and Woo, 2022) focuses on relational or sparse graph learning, embedding graph-structure discovery directly within autoen- coder or forecasting architectures to capture hid- den dependencies. We now briefly present how the TSAD problem is formulated in this context, and describe the two state-of-the-art methods of the first family, included in this framework. LetXbe a set ofN∈N ∗ distinct time se- ries. We will denote a given time series asX i = [x (1) i ,x (2) i ,...,x (T) i ], wherex (t) i ∈R, andTis the length of the time series. Depending on the ap- plication or the labeling of the data, the goal is to detect either an anomaly at a specific time se- riesi∈[1...N], or an anomaly at a global level, meaning that the system presents an anomaly at a certain time. The time series may inherently be connected through an underlying graph structure, which may be explicit, as is common in scenarios like sensor networks, or implicit, like causal depen- dencies in financial markets. In the latter case, a key assumption is that there exists some kind of correlation or dependencies between some of the time series, and therefore a graph can (must) be inferred. In this work we use datasets from both scenarios, presented below. Therefore, given the multivariate set ofNtime series, we will consider a graphGwithNnodes, each of which corre- spond to a certain time series. The structure of the graph (i.e., the edges and their weights) plays a fundamental role. As mentioned, this structure may come beforehand from the problem itself, like an industrial pipeline with sensors, or may be ab- stract and learned from the data. The experimental framework compares multi- ple GNN-based models, which produce their out- puts by operating on the graph through message passing, alongside a structure-agnostic model serving as a benchmark. Each model is trained uniformly, functioning either as a forecaster or reconstructor. The models receive as an input a window of datapoints of sizew, defined as: X (t) =[x (t−w+1) ,x (t−w+2) ,...,x (t) ]∈R N×w ,(1) where eachx (τ) is composed of theτ-th datapoint of each time series, i.e.x (τ) =(x (τ) 1 ,x (τ) 2 ,...,x (τ) N ), and produce either an estimate ˆ x (t+1) (forecaster) or ˆ X (t) (reconstructor). The approaches based on GNNs compute these outputs by combining the time series us- ing different architectures and supporting graphs. As previously mentioned, the graph can either be learned from the data (e.g. through correlations) or provided by the user. For example, assume we are using a forecasting-based method and we have an actual networkGconnecting the nodes, which we will represent through the adjacency matrixA(which may include weights). Then, the trained GNN-based forecaster is a function Φ(A,θ,X (t) )= ˆ x (t+1) , where each node has an as- sociatedw-dimensional signal as the input, and a 1-dimensional signal as the output. Note thatA is fixed throughout all values ofteven if the archi- tecture uses attention mechanisms. Furthermore, if we estimate/infer the graph (and thus the adja- cency matrix), we perform this estimation once, meaning that ˆ Ais also fixed for all values oft(see the discussion in Sec. 4.2). Given a forecasting- or reconstruction-based method, anomaly scores computed as predic- tion/reconstruction errors are used to flag an anomaly when it is above a certain threshold. This anomaly may be at the node level (i.e., large errors in predicting/reconstructing an individual time series) or graph level (i.e. considering the error in all time series). Methods. Currently, our experimental frame- work incorporates four distinct models. Firstly, a structure-agnostic Gated Recurrent Unit (GRU) and a custom-designed Graph Convolutional Net- work (GCN), both serving as baselines for com- parative analysis. The GCN operates on a fixed given graph structure, and the anomaly scores are computed as forecasting errors. We have also included two state-of-the-art GNN-based models, which we now briefly de- scribe. The Graph Deviation Network (GDN) (Deng and Hooi, 2021) is a deep learning-based approach that learns the structure of dependen- cies between variables, and using this graph and an attention mechanism produces a prediction of the next value. The Multivariate Time-series Anomaly Detection via Graph Attention Net- work (MTAD-GAT) (Zhao et al., 2020), similar to GDN, also uses a GNN approach. The key of this model is using two different GATs, feature oriented GAT and time oriented GAT, to map the relationships both between the features and the temporal dependencies. The training involves both a reconstruction and a forecasting model at the same time. The anomaly score is then com- puted as a combination of both the forecast and reconstruction errors. Datasets. To evaluate these methods we have chosen two representative datasets: the TELCO and SWaT datasets. The TELCO dataset (González et al., 2024), consists of twelve distinct time series. Each one represent common met- rics tracked by a mobile internet service provider (normalized and anonymized), such as the quan- tity and value of prepaid data transfer fees, the number and cost of calls, the volume of data traffic, and additional related data. The dataset spans seven months, divided into three months for training, one month for validation, and three months for testing. The most noticeable aspect is the significant imbalance between anomaly and normal data within the dataset, a common char- acteristic in anomaly detection datasets. A key element is the absence of an explicit graph struc- ture. While the time series data may exhibit cor- relations, there is no physical structure or defined relationship connecting the series. The SWaT (Secure Water Treatment) (Goh et al., 2017) dataset is a widely used bench- mark for evaluating anomaly detection methods. It consists of time series data collected from a scaled-down six-stage water treatment plant that replicates real-world industrial control systems. The dataset includes 51 physical and network- related features and spans eleven days, where the initial seven days record normal system opera- tions, and the subsequent four days contain both normal and attack scenarios. The attacks were intentionally introduced and encompass both cy- ber and physical threats to the system. Notably, the SWaT dataset exhibits an inherent relational structure, as sensors within the same treatment stage often measure correlated physical proper- ties like flow rate, pressure, or water level. An undirected graph is constructed where an edge is added between two nodes if the sensors measure similar properties or are in the same stage of the process. Figure 1: Example of point-wise evaluation limita- tions. Although only one long anomaly is detected, point-wise metrics report a high Recall (0.8) and perfect Precision (1.0). This gives the false impres- sion of good performance, despite the fact that most anomaly ranges in the dataset remain undetected. Naturally, both TELCO and SWaT have lim- itations, including mislabeled anomalies, distri- bution shifts, and run-to-failure bias, that affect model evaluation. Despite these challenges, they can serve as useful benchmarks when paired with qualitative analysis. Effective preprocessing, such as removing redundancies, fixing labels, and han- dling distribution shifts, is essential for reliable and fair model comparisons. 3 Challenges in current TSAD methodologies Point-wise Metrics. Despite the growing interest in TSAD, the evaluation of model performance re- mains a challenging and often overlooked aspect. Many existing works still rely on Precision, Recall (sensitivity) and F1-score, based on the classifica- tion of each time point as either normal or anoma- lous. However, these point-wise metrics present significant limitations, as they fail to capture the sequential nature and typical range-based struc- ture of anomalies in time series (Tatbul et al., 2018). In practice, it is often more important to de- tect as many distinct anomaly ranges as possible, even if their exact boundaries are missed, since identifying the occurrence of each anomaly is typ- ically more valuable than precisely locating every anomalous point. For instance, consider the ex- ample in Fig. 1, based on the SWaT dataset which presents a very similar pattern with a single long anomaly and several short ones. In this example, only the long anomaly is correctly detected. De- spite this, point-wise metrics report a Recall of 0.8 and a Precision of 1.0, suggesting high perfor- mance. In reality, however, the model misses the majority of the anomaly ranges in the dataset, highlighting a major shortcoming of these met- rics. Range-based metrics. To address these limita- tions, range-based variants of Precision, Recall, and F1-score (denotedP T ,R T andF1 T respec- Figure 2: Range-based recall example. R1 and R3 get a high existence reward, R2 none. R1 obtains a high size score, R3 a low one, and R2 none. Cardinality is high for R3 but lower for R1, since the anomaly is detected as two separate segments instead of one. No position reward is considered here. tively in the sequel) have been proposed (Tat- bul et al., 2018). These metrics account for the temporal extent of anomalies by rewarding partial overlap, penalizing fragmentation, and weighting detections by positional relevance, offering a more faithful evaluation than point-wise measures. For example, range-based RecallR T evaluates how effectively a detector identifies true anoma- lous intervals by checking whether an anomaly is detected at all, even if partially (existence re- ward); how much of its duration is correctly iden- tified (size); which parts are detected, e.g., if early detection is more critical (position); and whether it is reported as a single continuous range or split into fragments (cardinality). An illustrative ex- ample is shown in Fig. 2. These metrics require careful configuration, as their (several) parameters strongly influence re- sults, and poor choices can lead to misleading conclusions. For example, in datasets with long anomalies, neglecting the cardinality component may allow multiple overlapping predictions on the same anomaly to artificially boost precision. Figure 3 illustrates this risk. Here, a long anomaly is detected through several fragmented predictions, with an additional long (and mostly false) detection across the rest of the timeline. Yet, if range-based Recall is configured to re- ward only the existence of overlap, the score reaches 1.0, since every anomaly is at least par- tially detected. Likewise, neglecting the cardinal- ity penalty in range-based Precision allows the multiple predictions within the long anomaly to counterbalance the extended false positive, pro- ducing a value close to 1.0 despite poor detection quality. Threshold-agnostic Metrics. To avoid the depen- dence on a specific value of the threshold, we will use the recently proposed Volume Under Sur- face (VUS), in its VUS-ROC and VUS-PR vari- ants (Paparrizos et al., 2022). This metric gen- eralizes the traditional AUC concept from binary classification by integrating model performance over multiple buffer sizes around the annotated Figure 3: Example of how range-based metrics’ configuration can misrepresent performance. The model produces an overall poor prediction: the long anomaly is detected through multiple fragmented pre- dictions, and several extended false positives occur across the timeline. However, under certain range- based metric configurations, such as existence-only recall and precision without cardinality penalty, the evaluation yields a high performance, masking the model’s true shortcomings. anomaly ranges. This allows for a continuous as- sessment of robustness to label imprecision and misalignment. VUS computes the volume under the surface generated by simultaneously varying both the decision threshold and the buffer pa- rameter, eliminating dependence on specific hy- perparameters or fixed thresholds. As a result, it provides a more robust and comprehensive eval- uation for anomaly detection models. Lack of framework. One of the most persistent challenges in TSAD research is the absence of a unified, standardized framework for systemati- cally comparing detection methods. Existing im- plementations are typically tied to specific mod- els, datasets, experimental setups, and metric choices—some of which, as discussed earlier, can produce misleading conclusions. This fragmenta- tion limits reproducibility, constrains the scope of comparative studies, and makes it difficult to as- sess the real-world applicability of proposed mod- els. To address this gap, we introduce GraGOD, a modular and extensible open-source frame- work for the evaluation and comparison of ma- chine learning and deep learning-based TSAD models. Unlike existing solutions (DHI, 2025) that offer limited flexibility, GraGOD is designed as a collaborative, research-oriented framework where new models, datasets, and metrics can be seamlessly integrated. Its architecture natively supports both graph-based and non-graph-based methods, enabling fair and transparent compari- son across different paradigms. GraGOD provides a comprehensive experi- mental management system for TSAD, with a focus on GNN research. It supports end-to-end experimentation, including data preprocessing, model training, prediction, and hyperparameter tuning. The framework’s command-line inter- face allows users to orchestrate these processes, ensuring reproducibility and control over experi- ment configuration. GraGOD also integrates au- tomated metric computation of all the metrics mentioned in this work, and provides modules for visualizing anomalies, helping researchers inter- pret model behaviors beyond numerical scores. From a development perspective, GraGOD en- forces a consistent project structure for code or- ganization, dataset management, and result log- ging. It supports iterative experimentation and scalable execution, making it suitable for large- scale data analysis and computationally inten- sive model tuning. The framework’s design en- courages community contributions by simplifying the addition of new datasets, models, and eval- uation metrics through a well-documented API. Ultimately, GraGOD establishes a reproducible and extensible foundation for TSAD research, ac- celerating methodological progress and promot- ing transparent, comparable experimentation. 4 Benchmark This section presents the experimental results, fo- cusing on the impact of graph topology, threshold selection strategies, model interpretability, and the limitations of standard training paradigms. Training details can be consulted on the source code, although we highlight that thresholds (when used) were chosen to maximize the F1 score on the validation dataset, whereas the con- figuration of VUS metrics are left as suggested in the original paper (Paparrizos et al., 2022). 4.1 Initial Benchmark Results Table 1 reports baseline results for the four anomaly detection models on both datasets. In this first experiment, all graph based models used a fully connected graph topology. A first obser- vation is that the GDN, MTAD-GAT, and GRU models consistently achieve similar VUS scores, suggesting their predictive abilities are closely matched. It is interesting to note that VUS met- rics are much lower for the TELCO dataset than for SWaT. This is indicative of poor separability between normal and anomalous scores. Since a random classifier yields a VUS-ROC of 0.5, the low VUS-ROC indicates that the models cannot easily distinguish anomalies. Furthermore, it is important to highlight the clear mismatch between VUS and threshold- dependent metrics: high performance in the for- mer does not necessarily translate into strong Table 1: Test metrics for the TELCO and SWaT datasets using fully connected graphs. DatasetModelPRF1P T R T F1 T VU S-ROCVU S-PR SWaT GCN0.800.790.800.060.330.100.720.55 GDN1.000.750.851.000.070.130.850.73 MTAD-GAT0.000.000.000.000.000.000.880.74 GRU0.980.760.860.080.240.120.860.77 TELCO GCN0.150.100.080.090.290.110.640.05 GDN0.320.180.110.300.480.250.620.08 MTAD-GAT0.390.160.100.340.440.250.610.07 GRU0.390.140.120.350.480.300.580.09 performance in the latter. As we discussed be- fore, VUS aggregates performance over all pos- sible thresholds, so it can remain high even if no single threshold yields satisfactory results. An ex- treme example is MTAD-GAT on SWaT, which achieves highly competitive VUS values yet pro- duces no correct predictions when thresholded. This suggests a threshold selection issue rather than a fundamental model failure. Since the threshold was chosen to maximizeF1on the val- idation set, the problem arises from a shift in the score distribution between validation and test data. This is further confirmed when using more sophisticated selection methods, such as Otsu’s algorithm (Yoon et al., 2022), which yields an ex- cellentF1=0.8but a disappointingF1 T =0.07. In contrast, a dynamic threshold based on the rolling mean and standard deviation of recent scores producesF1=0.41andF1 T =0.25, which, although modest in absolute terms, are notably more consistent with each other than the results obtained with other methods. These findings un- derscore the critical role of both metric choice and robust thresholding strategies in TSAD evalua- tion. The score distributions in Fig. 4 further il- lustrate these challenges. Each histogram shows the normal (green) and anomalous (red) scores in the test set for all four methods. GRU and GDN produce relatively well-separated distribu- tions, which helps explain their stronger and more consistent results. In contrast, GCN and MTAD- GAT exhibit substantial overlap between normal and anomalous scores, making reliable threshold selection considerably harder. For MTAD-GAT in particular, the chosen threshold (dashed verti- cal line) is clearly suboptimal for the test data; an outcome of the distribution shift between val- idation and test sets. When analyzing the same plot in the TELCO dataset, we observe that the score histograms are not well separated and lack a bimodal structure, which aligns with the lower VUS results reported earlier. These experiments illustrate a downside of deriving anomaly scores from reconstruction or prediction losses, used as proxies for detecting GCN 050100 10 0 10 1 10 2 10 3 10 4 GDN 024 10 0 10 1 10 2 10 3 10 4 MTAD-GAT 0200400600 10 0 10 1 10 2 10 3 10 4 Normal Anomalous Threshold GRU 01020 10 0 10 1 10 2 10 3 10 4 Figure 4: Anomaly score distributions on the SWaT test set for all models (in logarithmic scale). GRU and GDN exhibit better separation between normal (green bars) and anomalous (red) scores, while GCN and MTAD-GAT exhibit significant overlap, compli- cating threshold selection. abnormal behavior, particularly in terms of the threshold selection. While some evaluation met- rics such as VUS are threshold-agnostic, real- world applications still require binary decisions, making threshold selection a critical step. We just shown that poor detection performance of- ten stems not from the thresholding method it- self (which may even be adaptive), but from the non-discriminative nature of the score distribu- tions produced by certain models. These findings highlight the limitations of proxy-based scoring and point toward the need for more task-aligned objectives, such as learning inherently discrimi- native representations through, for instance, con- trastive learning. 4.2 Impact of Graph Topology We now assess whether the incorporation of a graph structure improves performance by com- paring the GCN and GDN models on vari- ous topologies: a fully connected graph (as be- fore), the predefined or learned graph (SWaT and GDN only respectively), and finally a statistically inferred graph using the popular Meinshausen-Bühlmann (MB) method (Mein- shausen and Bühlmann, 2006). In the SWaT dataset (see the upper portion of Table 2), which features an underlying phys- ical structure, employing an informative graph topology significantly improves performance, par- ticularly in the GCN case. Note that in the GDN case, although the system topology obtains the best overall results, its attention mechanism Table 2: VUS metrics (ROC and PR) for different graph topologies in the SWaT and TELCO datasets using the GDN and GCN models. The highest value is shown in bold and the second highest is underlined for each model. DatasetModelGraph TopologyVU S-ROCVU S-PR SWaT GCN Fully Connected0.720.55 System Topology0.790.53 MB0.870.76 Random Graph0.820.63 GDN Fully Connected0.820.70 GDN Graph0.850.73 System Topology0.850.75 MB0.830.71 Random Graph0.830.70 TELCO GCN Fully Connected0.640.05 MB0.620.04 Random Graph0.640.05 GDN Fully Connected0.600.08 GDN Graph0.650.08 MB0.640.05 Random Graph0.670.09 makes it robust to the topology’s choice. Fur- thermore, and quiet interestingly, GCN’s best re- sults are obtained when using the MB graph and not the system topology. The relationship be- tween variables is thus better captured by the MB method. On the other hand, in the TELCO dataset no consistent improvement was observed when using different topologies, the best performance being achieved by a random graph. This indicates that using better graph inference methods could lead to improved results, but it is not clear when the dataset does not have an explicit graph structure. 4.3 Metric-Loss correlation As we discussed before, in many TSAD models, training is performed using regression objectives (e.g., forecasting or reconstruction loss), while evaluation relies on classification metrics. This raises a fundamental question: does minimizing regression loss improve anomaly detection perfor- mance? To explore this, we analyzed the correlation between validation loss, calculated over normal data, and evaluation metrics, computed over the full validation set, including anomalies, across 200 trials from hyperparameter tuning of GCN, GDN, and GRU models on the SWaT dataset. Pear- son correlation was used to measure linear rela- tionships, with an ideal scenario corresponding to strong negative correlation (i.e., lower loss leads to better metric scores). Figure 5 shows the results. The GCN model, P R F 1 P T R T F 1 T V US - ROC V US - PR GCN GDN GRU -0.31-0.48-0.620.21-0.470.06-0.59-0.59 -0.18-0.12-0.200.14-0.150.13-0.30-0.23 -0.030.16-0.13-0.010.12-0.040.24-0.02 −0.6 −0.4 −0.2 0.0 0.2 Figure 5: Correlation between the different metrics and the validation loss for the different models. which performs worst, exhibits the strongest neg- ative correlation between loss and VUS, suggest- ing that better forecasting results in an improved anomaly detection. In contrast, the GDN and GRU models achieve superior metric scores but show weak or even positive correlation, implying that a better regression fit does not translate to a better detection performance. These findings suggest that optimizing purely for regression loss may be suboptimal. Alterna- tive approaches, such as contrastive learning (Liu et al., 2022; Darban et al., 2025), which leverage anomaly labels during training to structure the feature space more effectively, could offer a more aligned and robust solution for anomaly detection tasks. 4.4 Interpretability analysis Beyond accurate detection, anomaly detection models should help identify where anomalies orig- inate. Graph-based models, especially GDN with attention mechanisms, provide a natural frame- work for this by modeling sensor dependencies and highlighting influential nodes. To assess interpretability, we analyze each model’s ability to attribute detected anomalies to specific sensors in the SWaT dataset, focusing on a known event affecting sensor FIT401. We compare GDN using both a learned and prede- fined SWaT topologies, to a the GRU baseline. For GDN, we further examine the distribution of attention weights to identify which sensors most strongly influence the anomaly score. All models consistently rank the true anoma- lous sensor among the top sensors during the anomaly period. This indicates that, while score- based methods can suggest likely affected sensors, interpretability still requires further analysis. In particular, attention visualization for GDN with the SWaT topology reveals more coher- ent and physically meaningful patterns. Fig- ure 6 shows attention distributions during the anomaly. The node with the highest anomaly score corresponds to FIT601, while the ac- FIT101 LIT101 MV101 P101 AIT201 AIT202 AIT203 FIT201 MV201 P203 P205 DPIT301 FIT301 LIT301 MV301 MV302 MV303 MV304 P301 P302 AIT401 AIT402 FIT401 LIT401 AIT501 AIT502 AIT503 AIT504 FIT501 FIT502 FIT503 FIT504 PIT501 PIT502 PIT503 FIT601 P602 Figure 6: Visualization of attention during the anomaly in FIT401 using the SWaT topology. The node with the highest anomaly score is shown in or- ange; red edges indicate the five neighbors with the highest attention weights. Attention concentrates on physically connected nodes, improving interpretabil- ity and mapping directly to system flow: an anomaly in a FIT sensor. GRU - PIT501 01000500 0.6 1.1 GDN - PIT501 01000500 0.0 1.1 GRU - P203 01000500 -0.1 1.3 GDN - P203 01000500 0.0 1.1 Figure 7: Forecast comparison between GDN with graph topology (right) and GRU (left) on the SWaT dataset. The blue line represents the true values of the time series, the green line shows the forecasted values, and the red shaded regions indicate anomaly labels. GDN’s use of a meaningful system topol- ogy results in stable forecasts and clear, localized anomaly detection. In contrast, GRU lacks this struc- ture, leading to unstable forecasts and reduced inter- pretability, as anomalies can affect all sensors. tual anomaly occurs in FIT401. However, the strongest attention edges are concentrated among FIT sensors (FIT401, FIT101, FIT201), all of which measure related physical quantities. This indicates that the model correctly focuses on a group of related sensors, improving interpretabil- ity and helping to identify a consistent set of po- tentially anomalous sensors. When other graph structures are used, attention becomes dispersed across unrelated nodes, reducing interpretability. Using a graph topology not only improves at- tention distributions, but also stabilizes predic- tion scores, as exemplified in Fig. 7. Here, GDN’s graph-based approach keeps forecasts stable and restricts anomaly effects to the affected sensor (PIT501 in stage 5 of the process in this example), making fault localization straightforward. In con- trast, GRU forecasts are less stable; an anomaly in stage 5 also deteriorates predictions for stage 2 (P203), making it difficult to pinpoint the true source of the anomaly. Thus, predefined topologies enhance inter- pretability by aligning attention with real system structure and stabilizing the predictions of the considered models. 5 Conclusions In this work, we introduced a modular and open- source framework for evaluating graph-based models in TSAD. This framework enables repro- ducible experimentation across datasets, archi- tectures, graph topologies, and evaluation met- rics. Using it, we conducted a comparative study of several GNN-based methods and base- lines across two real-world datasets with differing structural characteristics. Our findings show that GNNs can provide competitive, and in some cases superior, perfor- mance in TSAD tasks, particularly when there is an underlying and explicit graph. More im- portantly, they offer improved interpretability by localizing anomalies to specific nodes in the in- put graph. We also found that attention-based GNNs are more robust to uncertainty in graph construction, making them attractive for use in semi-structured or anonymized datasets. Along- side model evaluation, we critically examined the limitations of commonly used performance met- rics and scoring strategies. In particular, we high- lighted how score distributions and threshold sen- sitivity can undermine the reliability of evalua- tion. Looking ahead, our findings suggest that mov- ing beyond proxy-based scoring (e.g., reconstruc- tion error) could further improve TSAD sys- tems. In this context, contrastive learning offers a promising direction for producing more discrim- inative anomaly scores directly aligned with the detection task (Liu et al., 2022; Darban et al., 2025), which we plan to explore and integrate to our framework in future work. REFERENCES Blázquez-García, A., Conde, A., Mori, U., and Lozano, J. A. (2021). A review on out- lier/anomaly detection in time series data. ACM Comput. Surv., 54(3). Boniol, P., Liu, Q., Huang, M., Palpanas, T., and Paparrizos, J. (2024). Dive into time-series anomaly detection: A decade review. arXiv preprint arXiv:2412.20512. Chen, W., Tian, L., Chen, B., Dai, L., Duan, Z., and Zhou, M. (2022a). Deep variational graph convolutional recurrent network for multivariate time series anomaly detection. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S., editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 3621–3633. PMLR. Chen, Z., Chen, D., Zhang, X., Yuan, Z., and Cheng, X. (2022b). Learning graph structures with transformer for multivariate time-series anomaly detection in iot. IEEE Internet of Things Jour- nal. Dai, E. and Chen, J. (2022). Graph-augmented nor- malizing flows for anomaly detection of multi- ple time series. In International Conference on Learning Representations. Darban, Z. Z., Webb, G. I., Pan, S., Aggarwal, C. C., and Salehi, M. (2025). Carla: Self-supervised contrastive representation learning for time se- ries anomaly detection. Pattern Recognition. Deng, A. and Hooi, B. (2021). Graph neural network- based anomaly detection in multivariate time series. In AAAI conference on artificial intel- ligence. DHI (2025). tsod: Anomaly detection for time series data. https://github.com/DHI/tsod. Accessed: 2025-08-14. Goh, J., Adepu, S., Junejo, K. N., and Mathur, A. (2017). A dataset to support research in the de- sign of secure water treatment systems. In Crit- ical Information Infrastructures Security, pages 88–99. González, G. G., Tagliafico, S. M., Fernández, A., Sena, G. G., Acuña, J., and Casas, P. (2024). One model to find them all deep learning for multivariate time-series anomaly detection in mobile network data. IEEE Transactions on Net- work and Service Management. Han, S. and Woo, S. S. (2022). Learning sparse latent graph representations for anomaly detection in multivariate time series. In Proceedings of the 28th ACM SIGKDD Conference on knowledge discovery and data mining, pages 2977–2986. Hilal, W., Gadsden, S. A., and Yawney, J. (2022). Financial fraud: A review of anomaly detection techniques and recent advances. Expert Systems with Applications, 193:116429. Jin, M., Koh, H. Y., Wen, Q., Zambon, D., Alippi, C., Webb, G. I., King, I., and Pan, S. (2024). A survey on graph neural networks for time se- ries: Forecasting, classification, imputation, and anomaly detection. IEEE Transactions on Pat- tern Analysis and Machine Intelligence. Lee, T. J., Gottschlich, J., Tatbul, N., Metcalf, E., and Zdonik, S. (2018). Precision and recall for range-based anomaly detection. In Proceedings of the SysML Conference 2018. Liu, Y., Li, Z., Pan, S., Gong, C., Zhou, C., and Karypis, G. (2022). Anomaly detection on at- tributed networks via contrastive self-supervised learning. IEEE Transactions on Neural Net- works and Learning Systems. Meinshausen, N. and Bühlmann, P. (2006). High- dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3). Nizam, H., Zafar, S., Lv, Z., Wang, F., and Hu, X. (2022). Real-time deep anomaly de- tection framework for multivariate time-series data in industrial iot. IEEE Sensors Journal, 22(23):22836–22849. Paparrizos, J., Boniol, P., Palpanas, T., Tsay, R. S., Elmore, A., and Franklin, M. J. (2022). Vol- ume Under the Surface: A New Accuracy Eval- uation Measure for Time-Series Anomaly Detec- tion. Proceedings of the VLDB Endowment. Shaukat, K., Alam, T. M., Luo, S., Shabbir, S., Hameed, I. A., Li, J., Abbas, S. K., and Javed, U. (2021). A review of time-series anomaly de- tection techniques: A step to future perspec- tives. In Arai, K., editor, Advances in Infor- mation and Communication, pages 865–877. Siddiqui, M. A., Stokes, J. W., Seifert, C., Argyle, E., McCann, R., Neil, J., and Carroll, J. (2019). Detecting cyber attacks using anomaly detec- tion with explanations and expert feedback. In ICASSP 2019. Spence, C., Parra, L., and Sajda, P. (2001). De- tection, synthesis and compression in mammo- graphic image analysis with a hierarchical image probability model. In MMBIA 2001. Tatbul, N., Lee, T. J., Zdonik, S., Alam, M., and Gottschlich, J. (2018). Precision and recall for time series. Advances in neural information pro- cessing systems, 31. Yoon, J., Sohn, K., Li, C.-L., Arik, S. O., Lee, C.- Y., and Pfister, T. (2022). Self-supervise, re- fine, repeat: Improving unsupervised anomaly detection. Transactions on Machine Learning Research. Zamanzadeh Darban, Z., Webb, G. I., Pan, S., Ag- garwal, C., and Salehi, M. (2024a). Deep learn- ing for time series anomaly detection: A survey. ACM Comput. Surv., 57(1). Zamanzadeh Darban, Z., Webb, G. I., Pan, S., Ag- garwal, C., and Salehi, M. (2024b). Deep learn- ing for time series anomaly detection: A survey. ACM Comput. Surv., 57(1). Zhang, W., Zhang, C., and Tsung, F. (2022). Grelen: Multivariate time series anomaly detection from the perspective of graph relational learning. In IJCAI, pages 2390–2397. Zhao, H., Wang, Y., Duan, J., Huang, C., Cao, D., Tong, Y., Xu, B., Bai, J., Tong, J., and Zhang, Q. (2020). Multivariate time-series anomaly de- tection via graph attention network. In ICDM 2020.